Computing s Energy Problem:

Similar documents
Fast Software-managed Code Decompression

Cloud, Distributed, Embedded. Erlang in the Heterogeneous Computing World. Omer

VLSI Design 12. Design Styles

GOLOMB Compression Technique For FPGA Configuration

Fast Floating Point Compression on the Cell BE Processor

Reducing Code Size with Run-time Decompression

Synchronous Sequential Logic. Topics. Sequential Circuits. Chapter 5 Steve Oldridge Dr. Sidney Fels. Sequential Circuits

Profile-driven Selective Code Compression

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Reduction of Bitstream Transfer Time in FPGA

A 64 Bit Pipeline Based Decimal Adder Using a New High Speed BCD Adder

Multicore Real-Time Scheduling

Spacecraft Simulation Tool. Debbie Clancy JHU/APL

MotoTally. Enduro Event Management and Reporting Application

The pinnacle of car racing is Formula One, and Red Bull

Instruction Cache Compression for Embedded Systems by Yujia Jin and Rong Chen

MEMORY is one of the key driving factors in embeddedsystem

CS 341 Computer Architecture and Organization. Lecturer: Bob Wilson Cell Phone: or

The Evolution of Transport Planning

IT S HOW THE GAME IS ONE! Introducing Sync

HW #5: Digital Logic and Flip Flops

Post-Placement Functional Decomposition for FPGAs

A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications

82C288 BUS CONTROLLER FOR PROCESSORS (82C C C288-8)

Automating Injection Molding Simulation using Autonomous Optimization

The Future of Field-Programmable Gate Arrays

MICROPROCESSOR ARCHITECTURE

6/16/2010 DAG Execu>on Model, Work and Depth 1 DAG EXECUTION MODEL, WORK AND DEPTH

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

A Hybrid Code Compression Technique using Bitmask and Prefix Encoding with Enhanced Dictionary Selection

B.R.O. Basketball Return Optimizer. Electrical & Computer Engineering

Evaluation of a High Performance Code Compression Method

WiiMaze Design Document March 13, 2008

Efficient Placement of Compressed Code for Parallel Decompression

Software Design of the Stiquito Micro Robot

LISA - Test Pacing. Capgemini India Private Limited. July Prepared by: Srikumar Bose

The Mathematics of Gambling

Computational Challenges in Cold QCD. Bálint Joó, Jefferson Lab Computational Nuclear Physics Workshop SURA Washington, DC July 23-24, 2012

Stack Height Analysis for FinFET Logic and Circuit

Titelbild. Höhe: 13cm Breite: 21 cm

Rescue Rover. Robotics Unit Lesson 1. Overview

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number2- Dec 2014

CT PET-2018 Part - B Phd COMPUTER APPLICATION Sample Question Paper

Which compressed air system design is right for you?

COMPRESSION OF FPGA BIT STREAMS USING EFFECTIVE RUN LENGTH ENCODING TECHIQUES AND ITS PERFORMANCE ESTIMATION

RICK FAUSEL, BUSINESS DEVELOPMENT ENGINEER TURBOMACHINERY CONTROL SYSTEM DESIGN OBJECTIVES

Design of AMBA APB Protocol

Science&Motion. SAM BalanceLab. Sports. control the invisible. Most advanced pressure plate for coaching

Ocean Energy. Haley, Shane, Alston

Autodesk Moldflow Communicator Process settings

Adiabatic Switching. A Survey of Reversible Computation Circuits. Benjamin Bobich, 2004

Implementation of Modern Traffic Light Control System

Analysis of Traditional Yaw Measurements

SUPPLEMENT MATERIALS

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

A new take at Adaptive Fast Multipole Methods: application, implementation, and hybrid CPU/GPU parallelism

VLSI Design I; A. Milenkovic 1

Matrix-based software test data decompression for systems-on-a-chip

The Implementation and Evaluation of Dynamic Code Decompression Using DISE

CPE/EE 427, CPE 527 VLSI Design I L21: Sequential Circuits. Review: The Regenerative Property

A New Reference Frame Recompression Algorithm and Its VLSI Architecture for UHDTV Video Codec

LOCOMOTION CONTROL CYCLES ADAPTED FOR DISABILITIES IN HEXAPOD ROBOTS

Development and Evolution of an Asset Management Organization

University of Cincinnati

sorting solutions osx separator series

Artificial Intelligence for the EChO Mission Scheduler

Automatic Identification and Analysis of Basketball Plays: NBA On-Ball Screens

Param Express. Key Activities Concluded. Watch Out For

Extending Flash Memory Through Adaptive Parameter Tuning. Conor Ryan CTO Software

Smart Data Role computers play in Technology

FLUID POWER FLUID POWER EQUIPMENT TUTORIAL OTHER FLUID POWER VALVES. This work covers part of outcome 2 of the Edexcel standard module:

Provably Secure Camouflaging Strategy for IC Protection

HWBOINTS REVISION 7. Revision 7 is designed in response to community feedback with the intent of awarding points

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3

Physical Design of CMOS Integrated Circuits

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

SoundCast Design Intro

Flight Software Overview

Compression of FPGA Bitstreams Using Improved RLE Algorithm

Decompression Method For Massive Compressed Files In Mobile Rich Media Applications

Irrigation System Renovation. Your golf course. Our commitment.

Space Launchers. Cryogenic, gas-related equipment and engineering solutions for launchers and launch pads infrastructures

IA-64: Advanced Loads Speculative Loads Software Pipelining

Reliable Real-Time Recognition of Motion Related Human Activities using MEMS Inertial Sensors

Energy Savings Through Compression in Embedded Java Environments

How Game Engines Can Inspire EDA Tools Development: A use case for an open-source physical design library

Select ADO-DG Case Studies in Aerodynamic Design Optimization

EVOLVING HEXAPOD GAITS USING A CYCLIC GENETIC ALGORITHM

securing networks with silicon

Electromagnetic Attacks on Ring Oscillator-Based True Random Number Generator

CS472 Foundations of Artificial Intelligence. Final Exam December 19, :30pm

Designing Mechanisms for Reliable Internet-based Computing

AddressBus 32. D-cache. Main Memory. D- Engine CPU. DataBus 1. DataBus 2. I-cache

Efficient I/O for Computational Grid Applications

Evaluating and Classifying NBA Free Agents

ATM 322 Basic Pneumatics H.W.6 Modules 5 7

Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems

Discovery TGA 55, TGA 550, TGA Site Preparation Guide

Transcription:

Computing s Energy Problem: (and what we can do about it) Mark Horowitz Stanford University horowitz@ee.stanford.edu 1 of 46

Everything Has A Computer Inside 2of 46

The Reason is Simple: Moore s Law Made Gates Cheap 3of 46

Dennard s Scaling Made Them Fast & Low Energy The triple play: Get more gates, 1/L 2 1/α 2 Gates get faster, CV/i α Energy per switch CV 2 α 3 Dennard, JSSC, pp. 256-268, Oct. 1974 4of 46

Our Expectation Cray-1: world s fastest computer 1976-1982 64Mb memory (50ns cycle time) 40Kb register (6ns cycle time) ~1 million gates (4/5 input NAND) 80MHz clock 115kW In 45nm (30 years later) < 3 mm 2 > 1 GHz ~ 1 W CRAY-1 5of 46

Supporting Evidence http://cpudb.stanford.edu/ 6of 46

Houston, We Have A Problem http://cpudb.stanford.edu/ 7of 46

The Power Limit Watts/mm 2 http://cpudb.stanford.edu/ 8of 46

Clever Power Increased Because We Were Greedy 10x too large http://cpudb.stanford.edu/ 9of 46

This Power Problem Is Not Going Away: P = α C * Vdd 2 * f L 0.6 http://cpudb.stanford.edu/ 10 of 46

Think About It 11 of 46

Technology to the Rescue? 12 of 46

Problems w/ Replacing CMOS Pretty fundamental physics Avoiding this problem will be hard e- Its capability is pretty amazing fj/gate, 10ps delays, 10 9 working devices 13 of 46

Catch - 22 Investment Risk Building Computers = Large $ Very Different = High Risk Capital you need 14 of 46

The Truth About Innovation Start by creating new markets 15 of 46

Our CMOS Future Will see tremendous innovative uses of computation Capability of today s technology is incredible Can add computing and communication for nearly $0 Key questions are what problems need to be solved? Most performance system will be energy limited These systems will be optimized for energy efficiency Power = Energy/Op * Ops/sec 16 of 46

Processor Energy Delay Trade-off http://cpudb.stanford.edu/ 17 of 46

The Rise of Multi-Core Processors http://cpudb.stanford.edu/ 18 of 46

The Stagnation of Multi-Core Processors http://cpudb.stanford.edu/ 19 of 46

Optimizing Parallel Machines (GPUs) Galal et al, Trans on Computers, July 2011, pp 913,922 20 of 46

Have A Shiny Ball, Now What? 21 of 46

Signal Processing ASICs 10000 Dedicated 1000 Energy Efficiency (MOPS/mW) 100 10 1 CPUs CPUs+GPUs GP DSPs ~1000x 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Markovic, EE292 Class, Stanford, 2013 22 of 46

The Push For Specialized Hardware International Solid-State Circuits Conference 1.1: Computing s Energy Problem: (and what we can do about it) 23 of 46

Before Talking About Specialization 24 of 46

Don t Forget Memory System Energy 25 of 46

Processor Energy w/ Corrected Cache Sizes 26 of 46

Processor Energy Breakdown 8 cores L1/reg/TLB L2 L3 27 of 46

Data Center Energy Specs Malladi, ISCA, 2012 28 of 46

29 of 46

What Is Going On Here? 10000 Dedicated 1000 Energy Efficiency (MOPS/mW) 100 10 1 CPUs CPUs+GPUs GP DSPs ~1000x 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 of 46

ASIC s Dirty Little Secret All the ASIC applications have absurd locality And work on short integer data Video Frames Inter Prediction Intra Prediction CABAC Entropy Encoder Compressed Bit Stream Integer Motion Estimation Fractional Motion Estimation H.264 90% of Execution time is here Hamid et al, ISCA, 2010 31 of 46

Rough Energy Numbers (45nm) Integer FP Memory Add FAdd Cache (64bit) 8 bit 0.03pJ 16 bit 0.4pJ 8KB 10pJ 32 bit 0.1pJ 32 bit 0.9pJ 32KB 20pJ Mult FMult 1MB 100pJ 8 bit 0.2pJ 16 bit 1pJ DRAM 1.3-2.6nJ 32 bit 3 pj 32 bit 4pJ Instruction Energy Breakdown 25pJ 6pJ Control 70 pj I-Cache Access Register File Access Add 32 of 46

The Truth: It s More About the Algorithm then the Hardware All Algorithms GPU Alg 33 of 46

34 of 46

Highly Local Computation Model 35 of 46

Highly Local Computation Model 36 of 46

Highly Local Computation Model 37 of 46

Highly Local Computation Model 38 of 46

Compose These Cores into a Pipeline Program in space, not time Makes building programmable hardware more difficult 39 of 46

Working on System to Explore This Space Takes high-level program Graph of stencil kernels Maps to hardware level assembly Compute graph of operations for each kernel Currently we map the result to: FPGA, custom ASIC 40 of 46

Enabling Innovation You don t just compile applications to efficiency Need to tweak the application to fit constraints Need to enable application experts to play They know how to cheat and still get good results 41 of 46

Remember This Trade-off? Investment Risk Need to reduce cost to play Building constructors, not instances Capital you need http://genesis2.stanford.edu/ 42 of 46

Not All Systems Are On The Bleeding Edge 43 of 46

App Store For Hardware 44 of 46

Challenge 45 of 46

A New Hope If technology is scaling more slowly We can incorporate current design knowledge into tools To create extensible system constructors If killer products are going to be application driven Application experts need to design them We can leverage the 1 st bullet to enable the 2 nd To usher in a new wave of innovative computing products 46 of 46