Computing s Energy Problem: - PDF Free Download

Computing s Energy Problem: (and what we can do about it) Mark Horowitz Stanford University horowitz@ee.stanford.edu 1 of 46

Everything Has A Computer Inside 2of 46

The Reason is Simple: Moore s Law Made Gates Cheap 3of 46

Dennard s Scaling Made Them Fast & Low Energy The triple play: Get more gates, 1/L 2 1/α 2 Gates get faster, CV/i α Energy per switch CV 2 α 3 Dennard, JSSC, pp. 256-268, Oct. 1974 4of 46

Our Expectation Cray-1: world s fastest computer 1976-1982 64Mb memory (50ns cycle time) 40Kb register (6ns cycle time) ~1 million gates (4/5 input NAND) 80MHz clock 115kW In 45nm (30 years later) < 3 mm 2 > 1 GHz ~ 1 W CRAY-1 5of 46

Supporting Evidence http://cpudb.stanford.edu/ 6of 46

Houston, We Have A Problem http://cpudb.stanford.edu/ 7of 46

The Power Limit Watts/mm 2 http://cpudb.stanford.edu/ 8of 46

Clever Power Increased Because We Were Greedy 10x too large http://cpudb.stanford.edu/ 9of 46

This Power Problem Is Not Going Away: P = α C * Vdd 2 * f L 0.6 http://cpudb.stanford.edu/ 10 of 46

Think About It 11 of 46

Technology to the Rescue? 12 of 46

Problems w/ Replacing CMOS Pretty fundamental physics Avoiding this problem will be hard e- Its capability is pretty amazing fj/gate, 10ps delays, 10 9 working devices 13 of 46

Catch - 22 Investment Risk Building Computers = Large $ Very Different = High Risk Capital you need 14 of 46

The Truth About Innovation Start by creating new markets 15 of 46

Our CMOS Future Will see tremendous innovative uses of computation Capability of today s technology is incredible Can add computing and communication for nearly $0 Key questions are what problems need to be solved? Most performance system will be energy limited These systems will be optimized for energy efficiency Power = Energy/Op * Ops/sec 16 of 46

Processor Energy Delay Trade-off http://cpudb.stanford.edu/ 17 of 46

The Rise of Multi-Core Processors http://cpudb.stanford.edu/ 18 of 46

The Stagnation of Multi-Core Processors http://cpudb.stanford.edu/ 19 of 46

Optimizing Parallel Machines (GPUs) Galal et al, Trans on Computers, July 2011, pp 913,922 20 of 46

Have A Shiny Ball, Now What? 21 of 46

Signal Processing ASICs 10000 Dedicated 1000 Energy Efficiency (MOPS/mW) 100 10 1 CPUs CPUs+GPUs GP DSPs ~1000x 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Markovic, EE292 Class, Stanford, 2013 22 of 46

The Push For Specialized Hardware International Solid-State Circuits Conference 1.1: Computing s Energy Problem: (and what we can do about it) 23 of 46

Before Talking About Specialization 24 of 46

Don t Forget Memory System Energy 25 of 46

Processor Energy w/ Corrected Cache Sizes 26 of 46

Processor Energy Breakdown 8 cores L1/reg/TLB L2 L3 27 of 46

Data Center Energy Specs Malladi, ISCA, 2012 28 of 46

29 of 46

What Is Going On Here? 10000 Dedicated 1000 Energy Efficiency (MOPS/mW) 100 10 1 CPUs CPUs+GPUs GP DSPs ~1000x 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 of 46

ASIC s Dirty Little Secret All the ASIC applications have absurd locality And work on short integer data Video Frames Inter Prediction Intra Prediction CABAC Entropy Encoder Compressed Bit Stream Integer Motion Estimation Fractional Motion Estimation H.264 90% of Execution time is here Hamid et al, ISCA, 2010 31 of 46

Rough Energy Numbers (45nm) Integer FP Memory Add FAdd Cache (64bit) 8 bit 0.03pJ 16 bit 0.4pJ 8KB 10pJ 32 bit 0.1pJ 32 bit 0.9pJ 32KB 20pJ Mult FMult 1MB 100pJ 8 bit 0.2pJ 16 bit 1pJ DRAM 1.3-2.6nJ 32 bit 3 pj 32 bit 4pJ Instruction Energy Breakdown 25pJ 6pJ Control 70 pj I-Cache Access Register File Access Add 32 of 46

The Truth: It s More About the Algorithm then the Hardware All Algorithms GPU Alg 33 of 46

34 of 46

Highly Local Computation Model 35 of 46

Highly Local Computation Model 36 of 46

Highly Local Computation Model 37 of 46

Highly Local Computation Model 38 of 46

Compose These Cores into a Pipeline Program in space, not time Makes building programmable hardware more difficult 39 of 46

Working on System to Explore This Space Takes high-level program Graph of stencil kernels Maps to hardware level assembly Compute graph of operations for each kernel Currently we map the result to: FPGA, custom ASIC 40 of 46

Enabling Innovation You don t just compile applications to efficiency Need to tweak the application to fit constraints Need to enable application experts to play They know how to cheat and still get good results 41 of 46

Remember This Trade-off? Investment Risk Need to reduce cost to play Building constructors, not instances Capital you need http://genesis2.stanford.edu/ 42 of 46

Not All Systems Are On The Bleeding Edge 43 of 46

App Store For Hardware 44 of 46

Challenge 45 of 46

A New Hope If technology is scaling more slowly We can incorporate current design knowledge into tools To create extensible system constructors If killer products are going to be application driven Application experts need to design them We can leverage the 1 st bullet to enable the 2 nd To usher in a new wave of innovative computing products 46 of 46