Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/2005 4. Exemplos de Alguns.

Slides:

Advertisements

Apresentações semelhantes

Presenter’s Notes Some Background on the Barber Paradox

Advertisements

Laboratório de Sistemas Distribuídos (LSD) – Universidade Federal de Campina Grande (UFCG)EELA Grid School – December 04, 2006 Enhancing SegHidro/BRAMS.

São Paulo - November 7, 2013 Measuring the Cost of Formalization in Brazil © 2003 The Ronald Coase Institute Adopting RCI methodology to measure start.

Chapter Six Pipelining

Chapter Five The Processor: Datapath and Control (Parte B: multiciclo)

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Ch5A-1 Chapter Five The Processor: Datapath and Control.

Copyright no direito americano: o caso Leslie Kelly v. Arriba Soft Corp. 1.

Processadores AMD.

O que há de novo na plataforma x86 para High Performance

The new way! The old way... TC – DEI, 2005/2006.

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Tendências Actuais.

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ BUS e Armazenamento.

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Multi-Processamento.

Ciência Robert Sheaffer: Prepared Talk for the Smithsonian UFO Symposium, Sept. 6, 1980.

Capacitores Ou, como guardar energia elétrica de forma relativamente simples.

Meeting 17 Chapter & 6-6.

VHDL - Tipos de dados e operações

Arquitetura AMD 64 Família de 64 bits da AMD Trabalho feito por :

MC Prof. Paulo Cesar Centoducatte MC542 Organização de Computadores Teoria e Prática.

Chapter 3 Instructions: Language of the Machine

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch6-1 Chapter Six Pipelining.

MC Prof. Paulo Cesar Centoducatte MC542 Organização de Computadores Teoria e Prática.

DIRETORIA ACADÊMICA NÚCLEO DE CIÊNCIAS HUMANAS E ENGENHARIAS DISCIPLINA: INGLÊS FUNDAMENTAL - NOITE PROFESSOR: JOSÉ GERMANO DOS SANTOS PERÍODO LETIVO

A.4. Trabalhando com elementos de biblioteca STL – Standard Template Libraby Disponibiliza um conjunto de classes templates, provendo algoritmos eficientes.

Estudo comparativo entre as arquiteturas Opteron e Itanium

Fundamentos da teoria dos semicondutores Faixas de energia no cristal semicondutor. Estatística de portadores em equilíbrio. Transporte de portadores.

Arquiteturas de 4, 3, 2, 1 e 0 endereços.

Mais sobre classes Baseada no Livro: Deitel&Deitel - C++ How To program Cap. 7 Prentice Hall 1994 SCE 213 Programação Orientada a Objetos, ICMC - USP 2.

Arquitetura de Computadores I

Funções de um computador

GT Processo Eletrônico SG Documentos Eletrônicos Segunda reunião – 28/08/2009 Interlegis.

Capítulo I – Conceitos Primários 1.1 – Estrutura de um computador 1.2 – Informações manipuladas por um computador 1.3 – Evolução das linguagens de programação.

I – Computação - Hardware Escola Politécnica da USP MBA EPUSP em Gestão e Engenharia do Produto EP018 O Produto Internet e suas Aplicações Tecnologias.

Organização de Sistemas de Computadores

Uniform Resource Identifier (URI). Uniform Resource Identifiers Uniform Resource Identifiers (URI) ou Identificador de Recursos Uniforme provê um meio.

Organização ou MicroArquitectura

SECEX SECRETARIA DE COMÉRCIO EXTERIOR MINISTÉRIO DO DESENVOLVIMENTO, INDUSTRIA E COMÉRCIO EXTERIOR BRAZILIAN EXPORTS STATISTICAL DEPURATION SYSTEM Presentation.

Criação de objetos da AD 1Luis Rodrigues e Claudia Luz.

Ecological Economics Lecture 6 Tiago Domingos Assistant Professor Environment and Energy Section Department of Mechanical Engineering Doctoral Program.

Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012.

Números de 0 a 1,000,000,000 É uma dúvida de muitos estudantes do nível básico como dizer os números em inglês. Segue abaixo a lista de 0 a 1,000,000,000.

INPE / CAP-315 Airam J. Preto, Celso L. Mendes Aula 30 (1) Empacotamento de Dados em MPI Tópicos: Buffer de Mensagem Empacotamento/Desempacotamento.

Webots Pedro Pinheiro 12 de Novembro de Webots Pedro Pinheiro 12 de Novembro de 2004 Prepared by: Pedro Pinheiro.

Universidade de Brasília Laboratório de Processamento de Sinais em Arranjos 1 Adaptive & Array Signal Processing AASP Prof. Dr.-Ing. João Paulo C. Lustosa.

Knowledge Extraction from the Web (ISEWO)

Avaliação Constituição dos grupos de trabalho:

Lecture 2 Properties of Fluids Units and Dimensions 1.

Introdução à Criptografia Moderna – 2ª Lista de Exercícios

Metodologia de Desenvolvimento de Software Hermano Moura Alexandre Vasconcelos, André Santos, Augusto Sampaio, Hermano Moura, Paulo.

Infra-estrutura de Hardware

1 2 Observa ilustração. Cria um texto. Observa ilustração.

Hoje é domingo, 14 de setembro de 2014 Agora mesmo são 22:54 h. Relaxe por uns momentos e aprecie … Com som Today is Monday, 1 st December Relax.

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Understanding Epidemic Quorum Systems INESC-ID Lisbon/Technical.

Socio-technical approaches for Safety STAMP/STPA

Using innovation models to grow the GLP business Kip Garland innovationSEED October 9 th, 2009.

Equação da Continuidade e Equação de Navier-Stokes

Lei de Cotas: Looking at the Implementation of the Brazilian Employment Quota in São Paulo, Brazil Anna C. O’Kelly.

RELATÓRIO CEMEC 06 COMPARAÇÕES INTERNACIONAIS Novembro 2013.

Aula Teórica 18 & 19 Adimensionalização. Nº de Reynolds e Nº de Froude. Teorema dos PI’s , Diagrama de Moody, Equação de Bernoulli Generalizada e Coeficientes.

Unit l Verb to be.

Unit 22 Relative Clauses and Pronouns.

Equação de Bernoulli e Equação de Conservação da Energia

INTELIGÊNCIA COMPETITIVA Marketing Intelligence BRAZIL “The Strategic Positioning” Amanda Avedissian Workshop Inteligência Competitiva Outubro/2001.

Faculdade de Ciências Económicas e Empresariais Universidade Católica Portuguesa 17/12/2014Ricardo F Reis 2 nd session: Principal –

Introdução a Ciência da Computação Aula 05 Prof. Francisco G. Rodrigues.

Three analogies to explain reactive power Why an analogy? Reactive power is an essential aspect of the electricity system, but one that is difficult to.

Introduction to density estimation Modelação EcoLÓGICA

Chapter Six Pipelining Harzard

The following are the CSD Responses in relation to the IEEE P802

Transcrição da apresentação:

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Exemplos de Alguns Processadores Actuais 4.1. Arquitectura IA-32

2 » The x86 isnt that all complex – It just doesnt make a lot of sense « Mike Johnson, Leader of the 80x86 design at AMD Microprocessor Report (1994)

3 Uma breve história : The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The increases address space to 24 bits, +instructions 1985: The extends to 32 bits, new addressing modes : The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: 57 new MMX instructions are added, Pentium II 1999: The Pentium III added another 70 instructions (SSE) 2001: Another 144 instructions (SSE2) 2003: AMD extends the architecture to increase address space to 64 bits, widens all registers to 64 bits and other changes (AMD64) 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds more media extensions Problema do legado e compatibilidade para trás

4 Visão geral Complexidade: Instruções podem ter um tamanho de 1 a 17 bytes Um operando funciona sempre como origem e destino Um operando pode vir de memória Formas de endereçamento complexas O que salvou a arquitectura ao longo dos anos: As instruções mais frequentes não são difíceis de implementar Os compiladores não geram as instruções lentas e não usam a parte da arquitectura que é lenta O processador foi convertido à arquitectura RISC, mantendo apenas um front-end que descodifica as instruções complexas em µOPs RISC, simples.... Volume de mercado

5 Registos (FP não mostrados)

6 Instruções De dois operandos (e.g. ADD AX, BX) Diferentes tipos de origem/destino Register/Register Register/Immediate Register/Memory Memory/Register Memory/Immediate Múltiplos modos de endereçamento Absoluto (e.g. MOV AX, [1000]) Indirecto via Registo (e.g. MOV AX, [SI]) Base mode with 8/16/32 displacement (e.g. MOV AX, [SI+100]) Indexed (e.g. MOV AX, [SI+BX]) Based Indexed (e.g. MOV AX, [SI+BX+100]) Base+Scaled Indexed (endereço = BaseReg+2^Scale*IndexReg) Base+Scaled Index with Displacement (como acima + displ.)

7 Múltiplos modos de endereçamento

8 Instruções (apenas algumas...) Os registos, em muitos casos, não são General Purpose!

9 Codificação das Instruções

10 Extensões à arquitectura IA-32 Instruções MMX, SSE, SSE2 Consistem em: MMX: Operações sobre vectores de inteiros (vectores de 64 bits contendo números de 8, 16 ou 32 bits) SSE: Operações sobre vectores de virgula flutuante simples (vectores de 4 floats IEEE745) SSE2: Operações sobre vectores de vírgula flutuante dupla (vectores de 2 double IEEE754) + extensão aos vectores de inteiros (vectores de 128 bits contendo números de 8, 16, 32 ou 64 bits)

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Exemplos de Alguns Processadores Actuais 4.2. Intel Pentium 4

12 Instruções IA-32 e µOPs Todas as implementações modernas da arquitectura IA-32 convertem as instruções originais numa sequência de micro-instruções. No caso da Intel, estas são chamadas µOPs As µOPS são bastante semelhantes às instruções RISC: tamanho constante, formato uniforme, etc. Uma instrução IA-32 é no mínimo 1 µOP. Uma instrução complexa pode corresponder a centenas delas (!) (e.g. REP MOVSB) MOV AX, [1000] µOP 1 µOP 2 µOP 3 µOP 4

13 Algumas das características do Pentium 4 (2000) Pipeline com execução especulativa com diversas unidades funcionais (Arquitectura NetBurst) Pipeline de 20 fases 7 Unidades Funcionais Até 126 µOPs em Execução no Pipeline (dos quais 48 LOADs e 24 STOREs) Completa até 3 µOPs por ciclo de relógio ALUs funcionam ao dobro da velocidade de relógio Utilização de uma Trace Cache Dois Branch Target Buffers Front-end: 4K entradas Trace-cache: 512 entradas Utilização de Register Renaning (8 registos 128) para além de um Re-order Buffer Register Renaning elimina dependências de nome Re-order buffer garante a ordem de commit das instruções

14 Visão Geral do Pentium4

15 Aspecto do Pipeline

16 Trace Cache Uma trace cache é uma versão sofisticada de uma Instruction Cache (L1) Quando a trace cache é acedida com o endereço de uma certa instrução IA-32, acontece uma de 3 coisas: A tradução da instrução está na cache. Até 3 µOPs são produzidas. As 3 podem representar entre 1 e 3 instruções IA-32. Portanto, o PC IA-32 é avançado entre 1 e 3 instruções. A tradução da instrução está na cache, mas são necessárias mais do que 4 µOPs para a mesma. No caso destas instruções complexas, o controlo é passado a um programa numa micro- ROM até que a sequência completa é produzida. A tradução não está na cache. Neste caso, o descodificador IA-32 é utilizado para traduzir a instrução. O resultado é colocado na cache. Note-se que da próxima vez que a instrução for executada, tipicamente já estará descodificada na cache

17 Trace Cache (2) A Trace-Cache guarda sequências de instruções executadas para além dos saltos

18 Visão Detalhada do Pentium 4 (2000)

19 Pentium 4 Die

Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra 2004/ Exemplos de Alguns Processadores Actuais 4.3. AMD Opteron (& Athlon64)

21 Top processors on SPEC2000 (July/04) CPU INTEGER PERFORMANCE

22 CPU FLOATING POINT PERFORMANCE Top processors on SPEC2000 (July/04)

23 Processor Market The PC market has lead Intel and AMD to really boost the integer performance of their processors To a point they largely passed the performance available in classical RISC chips Floating point performance is increasing although RISC/Vector/VLIW processors still have an edge No consumer need in the PC market Scientific workstations need FP performance In the server market the important is not so much the peek performance, but throughput and reliability Xeon systems Itanium POWER4+

24 64-bit World 64-bit machines have been available for a long time in the scientific and business market e.g. SPARCv9, Alpha, POWER4+,... What does 64-bit brings? Increased address space (32-bit: 4GByte max; 64-bit: PByte!) Increased dynamic range for variables (32-bit int: ; 64-bit int: ) 64-bit does not bring increased performance automatically! It may have the contrary effect, memory traffic doubles when going from 32-bit to 64-bit!

25 Main contenders in the 64-bit server market SPARCv9 (Sun and Fujitsu) Intel Itanium2 AMD64 Opteron (and Athlon64) Intels Extended Memory 64 Processors Future uncertain, mostly used on high-end market, keeps on going partly because of installed consumer base. Future uncertain. AMDs are much better and Intel EM64T is a copy of AMD. Bad performance for its price when compared with the competition. Have taken the lead of the market by proposing an architecture that enables to execute 32 and 64 bit applications with performance. Superior memory bandwidth. Problem: ITS NOT INTEL! Intel licensed the AMD technology and has launched an architecture exactly (or almost) equal. It is currently available in high-end Xeon machines Note: IBM POWER4+ still dominates on the high-end multi-way server market

26 AMD64 – Dual Mode AMD has proposed an architecture which allows the execution of 32 and 64-bit applications (x32-64) No need to recompile old applications 32-bit applications execute with same performance 64-bit applications take advantage of a larger address space, more registers, etc. Operating System Support: Linux (SuSE, Redhat,...) Windows Server 2003 (beta) Solaris (2nd Half 2004) FreeBSD & NetBSD Java 1.5 Operating System (e.g. Linux64 or Windows ) Legacy 32-bit Application (4GB memory limit) 64-bit Application

27 The Instruction Set Architecture RAX 63 GPRGPRGPRGPR x87x87x87x AH EAX AL 0715 In x86 XMM XMM7 EAX EIP Added by AMD64 EDI XMM8 XMM15 R8 R15 Registers IA-32 instructions + new prefixes Next 64-bit mode instructions Instructions (INTELs look alike!)

28 Why More Registers? Number of Registers Each Function in the Program Needs Question: If processors do Register Renaming, why do we need more programmer visible registers?

29 The memory controller is included in the CPU 6.4GB/sec HyperTransport Point-to-point link for high- speed circuits standard (international consortium) 3x 6.4GB/sec inter-processor connections Up to 19.2GB/s peak aggregate bandwidth (AMD Athlon64 only has one HyperTransport link) L2 Cache L1 Instruction Cache L1 Data Cache AMD64 Core DDR Memory Controller HyperTransport technology To other processors/devices AMD Opteron processor architecture Directly to memory AMD Opteron Architecture

30 Difference to traditional systems PCI PCI-X IDE, FDC, USB, Etc. DDR Memory DDR PCI-X Bridge PCI-X Bridge I/O Hub I/O Hub Opteron CPU Opteron CPU Other CPUs or devices CPU North Bridge North Bridge South Bridge South Bridge PCI PCI-X IDE, FDC, USB, Etc. DDR PCI-X Bridge PCI-X Bridge Other CPUs or devices

31 AMD64 Core (Opteron – Hammer) Superscalar Out-of-Order Multi-Issue Processor 10 Execution Units 3 Integer ALUs 3 FP ALUs 3 Address calculation Units 1 Load/Store Unit 12 stage pipeline 17 stages for FP The IA-32 instructions are translated into MacroOps (MOPS) single-part MOps: arithmetic operations or memory accesses two-part MOps: an arithmetic operation and a memory access Dynamic Branch Prediction Local history table + Global history table (16K entries) Branch Target Buffer: 2K branches Integrated DDR Memory Controller

32 Opterons Core

33 Moving Instructions from Memory to Cache When code is first moved into the Athlon's L1 instruction cache, the processor's predecode logic examines the newly cached lump of code in order to detect individual instruction boundaries, and it marks those boundaries with a small amount of "metadata" so that the front end has less work to perform. The predecode logic also marks static branches. This predecoding process moves some of the front-end work to an earlier portion of the pipeline, speeding the actual fetch and decode phases later. The drawback is that the extra metadata eats up valuable L1 I-cache space Processador Cache Instruções Memória

34 Processor Frontend 16 bytes are read at a time ( 5 IA-32 instructions) FastPath Decoder (instr. that translate into 2 MOPs max) - max 3 IA-32 Instr. clock - max 3 MOPs clock Micro ROM (everything else) - max 1 IA-32 Instr. clock - max 3 MOPs clock issue slots (3 instructions)

35 Opterons Pipeline

36 Opterons Die

37 Material para ler Computer Architecture: A Quantitative Approach Secção 3.10 Apêndice D Artigos Jon "Hannibal" Stokes, The Pentium 4 and the G4e: an Architectural Comparison: Part I, in Ars Technica, July Jon "Hannibal" Stokes, The Pentium 4 and the G4e: an Architectural Comparison: Part II, in Ars Technica, July Jon "Hannibal" Stokes, Inside AMD's Hammer: the 64-bit architecture behind the Opteron and Athlon 64, in Ars Technica, January Viktor Kartunov, Facts & Assumptions about the Architecture of AMD Opteron and Athlon 64, in Digit-Life