صفحه 1:
CG SGC: Cosputer Gpstesos rokiterture
Lecture 0: Okat is Cowputer Orchiterture
ant why should 1 care?
Professor Barwet Dick
Ouwersiy of Texas of Busts
تا مان
Lecture 1 0
صفحه 2:
ous
© Onderstodd the “how” ocd “why” oP cocwputer موه
كصاا ام
- Iestrunica Get Orokitecture
= Opstew Oryunizuivn (processor, wewory, VO)
— Dervochiterture
مان
۰ مسر مرو ۲و عون مورا
— ODvtrics & beackwarks
° beara how to woke systews yo Post
= مه را
— رانا ماو )0۱2۸, ۹۱۸۸(
— Oppicaica spev Pic archiertures (qraphivg, vical prov.)
° Preview oP where ochitecture is heodtacy
Lecture 1
صفحه 3:
Loxistics
Lewes ۳۲۱۳ 8:90:07, 9.00:
سوه Pro. Borwet Ditckel, ۵ 0:9
TO ملظ وق
OO 44:90-4:00pw PO1S.90 Desk
render see web pe
Dente Weewessy & Patersoa, وه
ون ord Descgr (Ports Cio)
1 00
ی
Lecture 1 8
صفحه 4:
لدم 606991
ORL:
4 wil povesiosly exo pou vie blackboard ord by pour nevistered
ewal address. 1 expect this choceel to be retoble ood tively.
dismussivd your: vie bhackboord
تلجس نومه اه زوا
Geverd, Woweworks, Project
]0 Orchitecture Gewicar Series:
وله عم و ترس
Lecture 1 5
صفحه 5:
Ossie Por ext Puestay
* Duro ic studect survey Pores, PF you wet
* Rend the Doore puper (ser webpor)
= Orte a revew oP WOU pag (ser slows)
— Review shoud tactude
۰ موق of و خن مه
۱ عي سي لسع سوه لانن سر
۰ Your pbeervaies vu iis releveare tokay
— @e prepared ip dene va Duestky io ches
Lecture 1 8
صفحه 6:
ومادص جوز(
© Ore pou itterested tc tobiey this course?
۰ Ove questive ubvut cowputer sviewe
* Ove questive ubvut cowputer architecture
Lecture 1 6 بوصصومن
eaouwe
صفحه 7:
Specification compute the fibonacci sequence
for(i=2; i<100; i++) {
Program afi] = a[i-1]+a[i-2];}
ISA (instruction Set Architecture$9@4 5 ali Arch vs. parch
microArchitecture |
2
Logic EF
D D
5 1
Transistors of = oy {
Physics/Chemistry 5 5
Lecture 1 a
صفحه 8:
66991 Topics
Trews مت رت(
واه او متسه
كد ناعون
Ooders pipeliced architectures
- Opoamic WP wackices
— Graic WP wackices
Cocke wewory sysiews
Orrtud wewory
Qutiprovessors
Cowputer systew iwpleweutaiva
Lecture 1 8
صفحه 9:
Occhio Mhis Chass Dork Por ‘ou
* Phos und witus yrodtes
٠ Olickers
cose, Lecture 1 6
anne
صفحه 10:
Oket is سم Prvhitecture?
صفحه 11:
Pevkulbyp Covsiraints
* Yea) koprovewedt
- مس ecko
+ 980% wore devices per
chy
(doubles every WO with)
۰ 19% Poster devices
(doubles every S years)
+ Chwer wires
- سر( Dicks
+ 90% kerewe i decoy
= Ora boos
+ 8% korewse to wire
decoy
= Obes
+ ww oben
1000nm350nm
g00nm 250nm130nm 90nm
<< سود سس سس OOD
(Dx سس یت
Lecture 1 5
صفحه 12:
۰ مسا واه( رمزممان)
مساو( من
٠ 0 ۶ +
ماسم be oP 00000 تاسلج =
ery = poeple poirot expbot ون سقس -
و او روم عسسوی
٠ 200 اوه لسوت
Teena ea) 0 سوت یکت
eae ee) = @ower wal
000 وا مت = 0۶ ۰
ای اس( - ام 0۳0 مایت -
اس 00
0 ۰
- امه الا ,وود
— stople iustruction sete
- sel rch coches
= Cwbecked ve. Deshi vs.
ota vector (chord)
— Dew sexe (POO, Pkeck)
= Grok cores ond bots oP ew
— Optcetztasy Por power
Lecture 1 we
صفحه 13:
9
0 - 6۳0 اصه»
٠ First wioroprovessor
موم 8,900 ۰
جلك )0 0006 ۰
(Ou عم
صفحه 14:
ی
ما۵ +
سم ۵۵ ۰
۱
che!
ard
net revenue was around $35 billion a year for most of the <
R&D about $5 billion a year
۰ ۱2 10
عماجم ونان وود
0 ره
sonmprocess 2
o-saymprocess
IBMCefE
evector processors + 1PPC *
عدوم *
sonmprocess *
3
صفحه 15:
Vou Dorat (az loony us it iz xOO) متام بو
Processor families in TOP500 supercomputers.
Number of systems
100
2010
Year
صفحه 16:
Oppicaicd Ovustraicts
لا اس سل شام
۳
سم مپسا ۰
سا رت ین
باس مس
اجه چم و1۵ ۰
OPO) perPorenace سیر ۰
Dever support =
الما ۵ ۰
ی
WO tore, power +
اس ملع(
ta بم ortho انم ما
Lecture 1 9
صفحه 17:
Oppicaice-Onved Orchitectunes
* Geverd purpose - wood perPorwode vo “oll” propos
— x09 Pamiy, BRO, powerPO, etc.
* Oppicaicg speciPicity ooo Poous oat
= امه رو وه ۴و مور
و oP رصصم) ماو koodkel, desktop)
٠ Dodap - pverview oF yruphics processors
= AeterPace (eetrution set archievture - 16@)
- ومسصيسى و۳
- واه امجوون)
Lecture 1
صفحه 18:
Opple's (Padi(PkoueP Powered by BP Chip
Applications P كيدا
OF pad *
[System Gontal Connectivity
سوه ۳وروی .. -
3 a eRe SR Fast OA
. 3 21 ea
0ه ةا ال وضع || مس كعنم |
"es
زا 1 ۲۵9۵۱۵ | prove im
۵
۶
PUPA ux Multimedia
- (7 0
I DOP YP EE vider! وراج
Ober 5
کج ۵
ال الوم رودن
Image Processing Unit SPOR Te سح
‘Sahara v4 Resizing and Blending =e
این [ [irwstzone™ ||| [~~ irworsionand Rctaton
ae |
Image Enhancement [Ex memory UF] =
lho zoe te
[sate Camera Doe 200 mz
oma © ]| ۴ [ ۳885 ]| «ممصمم
صفحه 19:
ارم له وه را نموم( وظ)
موه مه موم صا موز ترمووورا ۰
fie ام وم موی او نیما ۰
امین ومع ٠ )0
يفيو سين نير لبه يياكا ص سو تمصا تيهنا ع
Dicts boric: bots oP water (e.q., 17 Pil pool) =
Oketis “Wigk speed Ietercet?” ۰
لصي عرو عسوو لكيه جما bow =
لا ما ادلی الما رال
لان و له را اه وا سا مور راو
صفحه 20:
ارو مه رورا Qehaicaship beter
لاه راما رامت كلسب جوج جد مورا ۰
بوواسكجم مات طساوا ود چا موی hewy Pond =
صما
۰ Op Pactory tokes ( doy to woke o Dodel-T Pord.
= @u 1 cae ott bubdeg تج 10 جيه سروت و
— OSE krslday, Vow woke OF * © <2 IPP curs per day
= © spevid order Por 1 qreeu or, vill tober (doy
= Dhrougkpulis eoreused, bu kieany is oot.
© boteuy recuntivg is diPPioutt
۰ اتمه رها موه وه بان
= Gig wore wewery chips) wore coho poorer ون
= Oty server Panne (e.4., oode) ore high boedwichy
صفحه 21:
۰ Olu cowwputteg is where dyouicdly scotoble ced مجثاه
virtudized respurces ure provided us a service ver the
otercet (hacks, wihipedia! )
٠ 11 جه وسخصصصه خا- a service (IotS)
— @wamn's CCS (chests cowpute ciel)
* PhoiPors منطو وه و (PatS)
— Cone was
- Divrosvht azure
٠ وروی us u service (GutS)
- أدصي
— Pavebook
— Flickr
صفحه 22:
Services Economies of Scale
* Substantial economies of scale possible
* 2006 comparison of very large service with small/mid-sized: (~1000 servers):
Large Service [$13/Mbis/mth]: $0.04/68,
[Medium [$95/Mbisimth]: $0.30/GB (7.1%)
Large Service: $4.6/GBlyear (2x in 2 DC)
Medium: $26.00/GBiyear* (5.7x)
Large Service: Over 1.000 servers/admin
Enterprise: ~140 servers/admin (7.1x)
* High cost of entry
— Physical plant expensive: 15MW roughly $200M
* Summary: significant economies of scale but at very high cost of entry
— Small number of large players likely outcome
Thanks, James Hamilton, amazon
صفحه 23:
(raphios kas dedicated chip ia POs
681 Million
transistors
(GeForce 8800, 90nm)
582 Million ۱
transistors
(intel “Kentsfield” quad core,
1X6700, 65nm, two dies, 8MB L2$)
Disk, Keyboard, PCle, etc.
Lecture 1 eo
(AGP, PCle)
صفحه 24:
موه مهو( و BPO/CCO
مه موی نسم Bove: DODD
Lecture 1 oe
صفحه 25:
Okv «dedicated provessicy chip?
* 0) Gpeviaizaivg — bevoewiag less ieoportod wily tee
0( Cordless = beoowkny wore موود
(Graphics processors one the voly kighty-porcbel
Provessors ic every deshiop او
128 “processors”
0
اجه دس مت سرت
You can program them!
صفحه 26:
(raphivs requires راطاهمموممص
Every application does something a bit different.
Example Cg “shader” program (invoked like a “callback”
function):
void normalmapped(float2 normalMapTexCoord : TEXCOORDO,
out Floata color : COLOR,
uniform float ambient,
~)
float3 normalTex, نس
normalTex = tex2D(normalMap, normalMapTexCoord) .xyz;
diffuse = saturate(dot(normal, normLightDir) ;
Color = Kd * (ambient + diffuse ) +
Ks * pow(specular, specularExponent;
Lecture 1 ee
صفحه 27:
مس ۳ ات مت
one aa
صفحه 28:
Ora Dice
تاره مهو( و(
تاو نموه توه)
Wow chips ore worde
festruntiog set review!overview مه وا توق
Ohwuys check web poe Por اوه
Lecture 1 ee
CS 352H: Computer Systems Architecture
Lecture 1: What is Computer Architecture
and why should I care?
Professor Emmett Witchel
University of Texas at Austin
witchel@cs.utexas.edu
Lecture 1
1
Goals
• Understand the “how” and “why” of computer system
organization
–
–
–
–
Instruction Set Architecture
System Organization (processor, memory, I/O)
Microarchitecture
Virtualization
• Learn methods of evaluating performance
– Metrics & benchmarks
• Learn how to make systems go fast
– Pipelining, caching
– Parallelism (ILP, DLP, TLP)
– Application specific architectures (graphics, signal proc.)
• Preview of where architecture is heading
Lecture 1
2
Logistics
Lectures
Instructor
TA
T/Th 12:30-2:00pm, PAI 3.14
Prof. Emmett Witchel, W 1:15-2:15
Shalini Sahoo
MW 11:30-1:00pm PAI 5.38 Desk1
Grading
see web page
Texts
Hennessy & Patterson, Computer
Organization and Design (Fourth Edition)
Including CD
Revised Fourth Edition preferred, not required
Lecture 1
3
CS352H Online
URL: www.cs.utexas.edu/users/witchel/CS352H
I will occasionally email you via blackboard and by your registered
email address. I expect this channel to be reliable and timely.
discussion group: via blackboard
login at courses.utexas.edu
General, Homeworks, Project
Computer Architecture Seminar Series:
www.cs.utexas.edu/users/cart/arch
Lecture 1
4
Assignment for Next Tuesday
• Turn in student survey forms, if you want
• Read the Moore paper (see webpage)
– Write a review of 1/2-1 page (see syllabus)
– Review should include
• Summary of content of paper
• Your observations on the most interesting/important aspects
• Your observations on its relevance today
– Be prepared to discuss on Tuesday in class
Lecture 1
5
Discussion
• Are you interested in taking this course?
• One question about computer science
• One question about computer architecture
CS352H
Fall 2007
Lecture 1
6
Specification
compute the fibonacci sequence
for(i=2; i<100; i++) {
a[i] = a[i-1]+a[i-2];}
Program
load r1, a[i];
(Instruction Set Architecture)
add r2, r2, r1;
microArchitecture
A
Logic
Arch vs. µarch
registers
ISA
F
B
D
S
Transistors
G
G
Physics/Chemistry
S
Lecture 1
D
S
7
CS352H Topics
•
•
•
•
Technology Trends
Instruction set architectures
Pipelining
Modern pipelined architectures
– Dynamic ILP machines
– Static ILP machines
•
•
•
•
Cache memory systems
Virtual memory
Multiprocessors
Computer system implementation
Lecture 1
8
Making This Class Work For You
• Plus and minus grades
• Clickers
CS352H
Fall 2007
Lecture 1
9
I/O Chan
Link
ISA
API
What is Computer Architecture?
Interfaces
Technology
IR
Regs
Machine Organization
Applications
Computer
Architect
Lecture 1
Measurement &
Evaluation
10
Technology Constraints
• Yearly improvement
1000nm 350nm
– Semiconductor technology
250nm130nm
800nm
• 60% more devices per
chip
1989
(doubles every 18 months)
1992
• 15% faster devices
(doubles every 5 years)
1995
• Slower wires
– Magnetic Disks
1998
• 60% increase in density
– Circuit boards
20
• 5% increase in wire
02
density
20
– Cables
06
• no change
90nm
>100x more devices since 1989
10x faster devices
Lecture 1
11
Changing Technology leads to
Changing Architecture
• 1970s
– multi-chip CPUs
– semiconductor memory very
expensive
– microcoded control
– complex instruction sets (good
code density)
• 1990s
– lots of transistors
– complex control to exploit
instruction-level parallelism
• 2000s
–
–
–
–
• 1980s
– single-chip CPUs, on-chip
RAM feasible
– simple, hard-wired control
– simple instruction sets
– small on-chip caches
even more transistors
Power wall
Transition to CMPs
Multi-level caches
• 2010s
– Embedded vs. Desktop vs.
Data center (cloud)
– New storage (PCM, flash)
– Simpler cores and lots of them
– Optimizing for power
Lecture 1
12
Intel 4004 - 1971
• The first microprocessor
• 2,300 transistors
• 108 KHz
• 10m process
Lecture 1
13
Some Recent Chips!
Intel Pentium IV
• 42 million transistors
• 4GHz
• 0.13m process
• Could fit ~15,000 4004s on this
chip!
net revenue was around $35 billion a year for most of the a
Intel Itanium II (Montecito)
R&D
about
NVidia
- GeForce$5
6800 billion a year
• 222 million transistors
• 400MHz
• 0.13m process
Lecture 1
• 1.7 billion transistors
• 1.6 GHz
• 90nm process
IBM Cell
• 8 vector processors + 1 PPC
• 4 GHz
• 90nm process
14
Any Architecture You Want (as long as it is x86)
CS352H
Fall 2007
Lecture 1
15
Application Constraints
•
Applications drive machine ‘balance’
– Numerical simulations
• floating-point performance
• main memory bandwidth
– Transaction processing
• I/Os per second
• integer CPU performance
– Decision support
• I/O bandwidth
– Embedded control
• I/O timing, power
– Media processing
• low-precision ‘pixel’ arithmetic
Lecture 1
16
Application-Driven Architectures
• General purpose - good performance on “all” programs
– x86 family, ARM, powerPC, etc.
• Application specificity can focus on:
– Types of concurrency available
– Domain of deployment (server, handheld, desktop)
• Today - overview of graphics processors
– Interface (instruction set architecture - ISA)
– Processor organization
– Concurrent elements
Lecture 1
17
Apple’s iPad/iPhone4 Powered by A4 Chip
• A4 is modified ARM Cortex run at 1GHz
– Integrated processor, graphics, memory controller
• Among other claims, ARM says the processors gets a near
"25 percent processing power boost, even at same
processor speed, from the use of a new instruction pipelining
system."
– We will cover pipelining in this class.
• Claim: 10 hours of 1024x768 video at 25W
• Let’s look at the Freescale i.MX51
CS352H
Fall 2007
Lecture 1
18
Performance: Latency and Throughput
• Latency: time to complete an operation
• Throughput: work completed per unit time
• Consider plumbing
– Low latency: turn on faucet and water comes out
– High bandwidth: lots of water (e.g., to fill a pool)
• What is “High speed Internet?”
– Low latency: needed to interactive gaming
– High bandwidth: needed for downloading large files
– Marketing departments like to conflate latency and bandwidth…
Relationship between Latency and Throughput
• Latency and bandwidth only loosely coupled
– Henry Ford: assembly lines increase bandwidth without reducing
latency
• My factory takes 1 day to make a Model-T ford.
–
–
–
–
But I can start building a new car every 10 minutes
At 24 hrs/day, I can make 24 * 6 = 144 cars per day
A special order for 1 green car, still takes 1 day
Throughput is increased, but latency is not.
• Latency reduction is difficult
• Often, one can buy bandwidth
– E.g., more memory chips, more disks, more computers
– Big server farms (e.g., google) are high bandwidth
What is cloud computing?
• Cloud computing is where dynamically scalable and often
virtualized resources are provided as a service over the
Internet (thanks, wikipedia!)
• Infrastructure as a service (IaaS)
– Amazon’s EC2 (elastic compute cloud)
• Platform as a service (PaaS)
– Google gears
– Microsoft azure
• Software as a service (SaaS)
– gmail
– facebook
– flickr
Thanks, James Hamilton, amazon
Graphics has dedicated chip in PCs
Memory
Memory
Memory
Memory
Memory Controller Chip
CPU
(“North Bridge”)
582 Million
transistors
Input/Output Glue Chip
(“South Bridge”)
Graphics
Processor
681 Million
transistors
(GeForce 8800, 90nm)
(Intel “Kentsfield” quad core,
QX6700, 65nm, two dies, 8MB L2$)
(AGP, PCIe)
Disk, Keyboard, PCIe, etc.
Lecture 1
23
GFLOPS
GPU/CPU Performance comparison
* IBM Cell ~200 GFlops
Core 2 Quad 3GHz, 96 GFLOPS *
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Source: NVIDIA (except CELL and Core2 Quad)
Lecture 1
24
Why a dedicated processing chip?
• 1) Specialization – becoming less important with time
• 2) Parallelism – becoming more important
Graphics processors are the only highly-parallel
processors in every desktop machine.
128 “processors”
* 2 FLOPS
@ 1.35 GHz
You can program them!
CS352H
Fall 2007
Lecture 1
25
Graphics requires programmability
Every application does something a bit different.
Example Cg “shader” program (invoked like a “callback”
function):
void normalmapped(float2 normalMapTexCoord : TEXCOORD0,
…
out float4 color : COLOR,
uniform float ambient,
…)
{
float3 normalTex, …;
normalTex = tex2D(normalMap, normalMapTexCoord).xyz;
…
diffuse = saturate(dot(normal, normLightDir);
…
color = Kd * (ambient + diffuse ) +
Ks * pow(specular, specularExponent;
}
Lecture 1
26
GeForce 8800
Lecture 1
27
Next Time
•
•
•
•
Performance evaluation
Basic computer organization
How chips are made
Start in on instruction set review/overview
• Always check web page for assignments
Lecture 1
28