Lecture 2 - Performance

Integrated Circuits Costs


 

Real World Examples (1994)

Chip Metal Line Wafer Defect Area Dies/ Yield Die 
layers width Cost /cm2 mm2 wafer   Cost
386DX 2 0.90 $900  1.0  43  360 71% $4 
486DX2 3 0.80 $1200  1.0  81  181 54% $12 
PowerPC 601 4 0.80 $1700  1.3  121  115  28% $53 
HP PA 7100 3 0.80 $1300  1.0  196  66  27% $73 
DEC Alpha 3 0.70 $1500  1.2  234  53  19% $149 
SuperSPARC 3 0.70 $1700  1.6  256  48  13% $272 
Pentium 3 0.80 $1500  1.5  296  40  9% $417 
Other Costs

IC cost = ( Die cost + Testing cost + Packaging cost ) / Final test yield

Packaging Cost: depends on pins, heat dissipation, etc...

 
 
Chip Die   Package    Test & Total
  cost pins type cost Assembly  
386DX $4 132 QFP $1 $4 $9 
486DX2 $12 168 PGA $11 $12 $35 
PowerPC 601 $53 304 QFP $3 $21 $77 
HP PA 7100 $73 504 PGA $35 $16 $124 
DEC Alpha $149 431 PGA $30 $23 $202 
SuperSPARC $272 293 PGA $20 $34 $326 
Pentium $417 273 PGA $19 $37 $473 

CMOS improvements

Die size doubles every 3 years; Line widths halve every 7 years

Technology Trends


 

Processor Performance

The bottom line: Performance (and cost)


 

Metrics of performance

Relating Processor Metrics

CPU execution time = (CPU clock cycles for pgm) * (clock cycle time)

or CPU execution time = (CPU clock cycles for pgm) / (clock rate)

CPU clock cycles for pgm = (Instructions for pgm) * (avg. clock cycles per instr.)

or CPI = (CPU clock cycles for pgm) / (Instructions for pgm)

CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured
 

Aspects of CPU Performance


 

Example


 

Marketing Metrics

MIPS = Instruction Count / (Time * 10^6)
= Clock Rate / (CPI * 10^6)

What About:

Generally MIPS is not correlated with performance

MFLOP/S = Floating point Operations / (Time * 10^6)

This is very machine dependent and FP operations are often not where time is spent
 

Benchmarks

Programs used to Evaluate Processor Performance

Benchmarks should represent large class of important programs
Improving benchmark performance should help many programs

Types

Toy Benchmarks

simple 10-100 line programs
e.g.,: sieve, puzzle, quicksort

Synthetic Benchmarks

Programs which attempt to match average frequencies of real workloads
e.g., Whetstone, dhrystone

Kernels

Time critical excerpts of real programs
e.g., Livermore loops

Real programs

e.g., gcc, spice
A Successful Benchmark: SPEC
 

SPEC

In 1988 six Companies banded together to form Systems Performance Evaluation Committee (SPEC).
Sun, MIPS, HP, Apollo, DEC

Uses a standard list of programs, some real programs, includes OS calls, some I/O.

SPEC first round - 1989

10 programs, single number to summarize performance

One program: 99% of time in single line of code
New front-end compiler could improve dramatically. Compare these two results on the same machine but using different compilers.

Second round: SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs)

Compiler Flags unlimited. March 93 of DEC 4000 Model 610:

spice: unix.c:/def=(sysv,has_bcopy,"bcopy(a,b,c)=memcpy(b,a,c)"
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

Add SPECbase: one flag setting for integer programs & one for FP
 

Third round: SPECint95 and SPECfp95 new set of programs

"benchmarks useful for 3 years"
 

How to Summarise Performance

Arithmetic mean (or weighted arithmetic mean) of times: SUM(Ti)/n or SUM(Wi*Ti)

Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS):
n/SUM(1/Ri) or n/SUM(Wi/Ri)

Normalized execution time is handy for scaling performance
(e.g., time on reference machine ÷ time on measured machine)

But do not take the arithmetic mean of normalized execution time, use the geometric_ mean (prod(Ri)^1/n) because this does not depend on which is the reference machine.

Unfortunately, geometric mean rewards all improvements equally:
program A going from 2 seconds to 1 second as important as
program B going from 2000 seconds to 1000 seconds

Amdahl's Law

Speedup due to enhancement E:
             ExTime w/o E     Performance w/ E 
Speedup(E) = ------------  =  ----------------- 
             ExTime w/ E      Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) * ExTime(without E)

Speedup(with E) = ExTime(without E) /
((1-F) + F/S) * ExTime(without E)