Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 29-06-10Frontiers of GPU Computing 20101.

Slides:



Advertisements
Apresentações semelhantes
Presenter’s Notes Some Background on the Barber Paradox
Advertisements

Laboratório de Sistemas Distribuídos (LSD) – Universidade Federal de Campina Grande (UFCG)EELA Grid School – December 04, 2006 Enhancing SegHidro/BRAMS.
São Paulo - November 7, 2013 Measuring the Cost of Formalization in Brazil © 2003 The Ronald Coase Institute Adopting RCI methodology to measure start.
Chapter Six Pipelining
Chapter Five The Processor: Datapath and Control (Parte B: multiciclo)
Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Multi-Processamento.
Ciência Robert Sheaffer: Prepared Talk for the Smithsonian UFO Symposium, Sept. 6, 1980.
Capacitores Ou, como guardar energia elétrica de forma relativamente simples.
Experiências de Indução.
1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch6-1 Chapter Six Pipelining.
DIRETORIA ACADÊMICA NÚCLEO DE CIÊNCIAS HUMANAS E ENGENHARIAS DISCIPLINA: INGLÊS FUNDAMENTAL - NOITE PROFESSOR: JOSÉ GERMANO DOS SANTOS PERÍODO LETIVO
Um pouco mais sobre modelos de objetos. Ray Path Categorization Ray Path Categorization. Nehab, D.; Gattass, M. Proceedings of SIBGRAPI 2000, Brazil,
A.4. Trabalhando com elementos de biblioteca STL – Standard Template Libraby Disponibiliza um conjunto de classes templates, provendo algoritmos eficientes.
Fundamentos da teoria dos semicondutores Faixas de energia no cristal semicondutor. Estatística de portadores em equilíbrio. Transporte de portadores.
Vetor da rede recíproca.
Aula 02.
GT Processo Eletrônico SG Documentos Eletrônicos Segunda reunião – 28/08/2009 Interlegis.
“Capital Budgeting Using
Uniform Resource Identifier (URI). Uniform Resource Identifiers Uniform Resource Identifiers (URI) ou Identificador de Recursos Uniforme provê um meio.
Protocolo HTTP.
SECEX SECRETARIA DE COMÉRCIO EXTERIOR MINISTÉRIO DO DESENVOLVIMENTO, INDUSTRIA E COMÉRCIO EXTERIOR BRAZILIAN EXPORTS STATISTICAL DEPURATION SYSTEM Presentation.
Knowledge-editing using WYSIWYM Richard Power & Donia Scott ITRI University of Brighton.
Aprendizado de Máquina
IEEE PES General Meeting, Tampa FL June 24-28, 2007 Conferência Brasileira de Qualidade de Energia Santos, São Paulo, Agosto 5-8, Chapter 3 Harmonic.
Indirect Object Pronouns - Pronomes Pessoais Complemento Indirecto
OER LIFE CYCLE Andrew Moore and Tessa Welch.
Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012.
Part 5: Regression Algebra and Fit 5-1/34 Econometrics I Professor William Greene Stern School of Business Department of Economics.
Webots Pedro Pinheiro 12 de Novembro de Webots Pedro Pinheiro 12 de Novembro de 2004 Prepared by: Pedro Pinheiro.
Definição do MoC Subjacente a Aplicação Prof. Dr. César Augusto Missio Marcon Parcialmente extraído de trabalhos de Axel Jantch, Edward Lee e Alberto Sangiovanni-Vincentelli.
Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Subspace based Multi-Dimensional Model Order Selection in Colored Noise Scenarios.
Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Adaptive & Array Signal Processing AASP Prof. Dr.-Ing. João Paulo C. Lustosa.
Knowledge Extraction from the Web (ISEWO)
Avaliação Constituição dos grupos de trabalho:
Lecture 4 Pressure distribution in fluids. Pressure and pressure gradient. Hydrostatic pressure 1.
Lecture 2 Properties of Fluids Units and Dimensions 1.
Metodologia de Desenvolvimento de Software Hermano Moura Alexandre Vasconcelos, André Santos, Augusto Sampaio, Hermano Moura, Paulo.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Ontology Building Process: The Wine Domain João Graça, Márcio.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Understanding Epidemic Quorum Systems INESC-ID Lisbon/Technical.
Faculdade de Ciências Económicas e Empresariais Universidade Católica Portuguesa 15/09/2014Ricardo F Reis 6 th session: Financial Measures.
IEEE PES General Meeting, Tampa FL June 24-28, 2007 Conferência Brasileira de Qualidade de Energia Santos, São Paulo, Agosto 5-8, Chapter 5: Harmonic.
Socio-technical approaches for Safety STAMP/STPA
Prof Afonso Ferreira Miguel
Brasil Innovaciones censales Instituto Brasileiro de Geografia e Estatística – IBGE Taller “IPUMS América Latina II” Ciudad de Panamá, Panamá.
Cigré/Brasil CE B5 – Proteção e Automação Seminário Interno de Preparação para a Bienal 2006 Rio de Janeiro, setembro/06.
Equação da Continuidade e Equação de Navier-Stokes
The microarray data analysis
Chapter 2 Harmonics and Interharmonics Theory
RELATÓRIO CEMEC 06 COMPARAÇÕES INTERNACIONAIS Novembro 2013.
Microprocessadores 8051 – Aula 3 Interrupção
Ambrósio et al e-POSTER Enhanced Screening for Refractive Candidates based on Corneal Tomography and Biomechanics Renato Ambrósio Jr., MD, PhD Ruiz Alonso,
Divisão Serviço da Hora Laboratório Primário de Tempo e Frequência 2010 SIM TFWG Workshop and Planning Meeting March 9 – 12 Lima, Peru. Time Scales Virtual.
Aula Teórica 18 & 19 Adimensionalização. Nº de Reynolds e Nº de Froude. Teorema dos PI’s , Diagrama de Moody, Equação de Bernoulli Generalizada e Coeficientes.
Faculdade de Ciências Económicas e Empresariais Universidade Católica Portuguesa 17/12/2014Ricardo F Reis 2 nd session: Principal –
IEEE PES General Meeting, Tampa FL June 24-28, 2007 Conferência Brasileira de Qualidade de Energia Santos, São Paulo, Agosto 5-8, Chapter 8: Procedure.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Segurança em Redes Móveis /35 Mersenne.
Limit Equlibrium Method. Limit Equilibrium Method Failure mechanisms are often complex and cannot be modelled by single wedges with plane surfaces. Analysis.
Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Adaptive & Array Signal Processing AASP Prof. Dr.-Ing. João Paulo C. Lustosa.
Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Adaptive & Array Signal Processing AASP Prof. Dr.-Ing. João Paulo C. Lustosa.
Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Adaptive & Array Signal Processing AASP Prof. Dr.-Ing. João Paulo C. Lustosa.
Visão geral do Aprendizado de máquina
Teste e Qualidade de Software
TQS - Teste e Qualidade de Software (Software Testing and Quality) Geração Automática de Casos de Teste com a Ferramenta.
Tópicos Avançados em Engenharia de Software
Developing a Hypothesis
Introduction to Machine learning
Introduction to density estimation Modelação EcoLÓGICA
Reflection, rotation and translation
Pesquisadores envolvidos Recomenda-se Arial 20 ou Times New Roman 21.
The following are the CSD Responses in relation to the IEEE P802
Transcrição da apresentação:

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Frontiers of GPU Computing Efficient Independent Component Analysis on a GPU Rui Ramalho, Pedro Tomás, Leonel Sousa 1

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Frontiers of GPU Computing Outline Motivation Independent Component Analysis FastICA Algorithm Experimental Results Conclusions 2

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Blind Source Separation Blind Source Separation (BSS) is a signal processing technique that separates a set of signals (sources) from a set of mixed signals. Little is known about the original signals or the mixing process, only that the original signals are uncorrelated Frontiers of GPU Computing mixsep

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Blind Source Separation: Cocktail Party A classical example of blind source separation is the cocktail party problem. –A number of people are talking simultaneously in a crowded room (at cocktail party). –Despite all the noise and cross talking, a human brain has little difficulty following a conversation. –Machines have to rely on blind source separation Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Blind Source Separation: Applications Blind Source Separation has also been used in several other domains: –EEG/MEG measurements (each sensor picks up a mixture of brain electrical activity and BSS can be used to separate and identify them). –Denoising images (by treating the noise as an independent source it is possible to separate it from the images original components). –Financial analysis (BSS can be used to uncover hidden factors in financial data) Frontiers of GPU Computing 20105

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Independent Component Analysis Independent Component Analysis (ICA) is a special case of Blind Source Separation. The mixed signals sources are assumed to be statistically independent (BSS only assumes the sources are statistically uncorrelated) Frontiers of GPU Computing 20106

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Independent Component Analysis: ICA Model Under the ICA model, the observed variables are assumed to be a linear combination of several independent sources/signals. The objective of ICA is to find the matrix W that inverts the mixing operation performed by the matrix A, without knowledge of A or s Frontiers of GPU Computing 20107

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Independent Component Analysis: Measuring Statistical Independence One of the ways of measuring statistical independence is through negentropy: –H(y) is the differential information entropy of y: In practice J(y) needs to be estimated. The estimator used by FastICA is: –G is a nonquadratic nonlinear function – is a Gaussian variable of zero mean and unit variance Frontiers of GPU Computing 20108

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed FastICA Algorithm The procedure for computing the independent components can be divided in 3 stages: –pre-processing Allows a number of simplifications on the FastICA algorithm. –weight vector computation The FastICA algorithm itself. –decorrelation Prevents the algorithm from converging to the same solutions Frontiers of GPU Computing 20109

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed FastICA Algorithm: Preprocessing & Weight Vector Computation Preprocessing includes general tasks such as centering, whitening or filtering the data. The computation of each of the weight vectors is done by: –g is the derivative of the nonlinear contrast function J –This algorithm can be modified to compute all the ICs simultaneously (a symmetric approach) Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed FastICA Algorithm: GPU Implementation The preprocessing stage is generally inexpensive and was implemented on the CPU. The FastICA algorithm is composed mostly of matrix operations that can be efficiently implemented using CUBLAS. –The computation of the non-linear function g and g have no dependencies. –The expected value is computed using hierarchical additions, storing the intermediate results in the GPUs shared memory Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed FastICA Algorithm: Decorrelation To keep the estimated weight vectors from converging to the same results, they need to be decorrelated: –After estimating p independent components, subtract the projections of the previous p components from the p+1 estimate: –An alternative is to apply a symmetric decorrelation after every iteration: Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Decorrelation: The Tricky Bit The computation of (WW T ) 1/2 is complex and can be done using the eigenvalues of (WW T ). –This can be done using the already available CPU-based high performance libraries (LAPACK). –Alternatively, the eigenvalues can be computed directly on the GPU Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Jacobi Eigenvalue Algorithm The Jacobi Eigenvalue Algorithm successively uses Jacobi rotations to annihilate the off-diagonal elements of a given matrix A. A Jacobi rotation is given by: –J is a Jacobi rotation matrix –c = cos( ) –s = sin( ) Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Jacobi Eigenvalue Algorithm Each Jacobi rotation only changes two columns and two rows of the matrix A. By carefully choosing the order of the rotations, up to N/2 rotations can be done simultaneously. The matrix J is a very sparse matrix, making CUBLAS unsuitable for this algorithm Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Decorrelation: Iterative Algorithm Another alternative to the eigenvalue problem is to avoid its computation altogether. Algorithm 4 converges to the decorrelation expression presented earlier Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Decorrelation: Comparison of Decorrelation Algorithms Experimental results show that the proposed GPU-based Jacobi eigenvalue algorithm is outperformed by a CPU based LAPACK eigenvalue algorithm using multiple relatively robust representations (MRRR). However, avoiding the explicit computation of the eigenvalues is still the fastest process Frontiers of GPU Computing 2010

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Experimental Results: Experimental Setup Experimental Setup –A hyperbolic tangent was chosen as a typical non-linear function g –The iterative decorrelation algorithm that avoids the explicit computation of the eigenvalues is used in the decorrelation step Frontiers of GPU Computing CPUGPU AMD Opteron 170NVidea GeForce 8800 GTX Number of cores2128 Clock Frequency2 GHz1.35 GHz Main Memory2 GB768 MB

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Experimental Results: Single Core CPU Vs GPU The accelerated portion of the algorithm (loop) is spedup up to 110x, for estimating 256 ICs with samples. As the accelerated portion gets faster, so grows the influence of the unaccelerated part of the algorithm (the preprocessing stage). This noticeably reduces the global speedup Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Experimental Results: Single Core CPU Vs GPU The accelerated loop component ceases to be the bottleneck. The additional penalty of transferring data to and from the GPU is negligible Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Experimental Results: Multicore CPU Vs GPU The parallelized GPU algorithm was also tested on a more powerful Geforce GTX 285, with 240 cores. This implementation was compared with a CPU based implementation on an Intel Core 2 Quad Q9950 using Intels high performance MKL library. It was possible to attain a speedup of around 12x Frontiers of GPU Computing

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Conclusions By using a GPU it was possible to speedup the FastICA algorithm by 55x for estimating 256 ICs with 1000 samples each, in comparison with a serial version running on a single core of a CPU. These results can be further improved as the current bottleneck lies in the preprocessing stage, which is still done on the CPU Frontiers of GPU Computing

IV Jornadas sobre Sistemas Reconfiguráveis - REC' technology from seed Frontiers of GPU Computing 2010