A Single Cycle Processor. ECE4680 Computer Organization and Architecture. Designing a Pipeline Processor. Overview of a Multiple Cycle Implementation

Similar documents
Outline Single Cycle Processor Design Multi cycle Processor. Pipelined Processor: Hazards and Removal. Instruction Pipeline. Time

ELE 455/555 Computer System Engineering. Section 2 The Processor Class 4 Pipelining with Hazards

Surfing Interconnect

Pipelined Implementation

Lecture 7: Pipelined Implementation. James C. Hoe Department of ECE Carnegie Mellon University

MICROPROCESSOR ARCHITECTURE

VLSI Design I; A. Milenkovic 1

CS650 Computer Architecture. Lecture 5-1 Branch Hazards

CPE/EE 427, CPE 527 VLSI Design I L21: Sequential Circuits. Review: The Regenerative Property

82C288 BUS CONTROLLER FOR PROCESSORS (82C C C288-8)

Instruction Cache Compression for Embedded Systems by Yujia Jin and Rong Chen

Design of AMBA APB Protocol

VLSI Design 14. Memories

Lecture Topics. Overview ECE 486/586. Computer Architecture. Lecture # 9. Processor Organization. Basic Processor Hardware Pipelining

A 64 Bit Pipeline Based Decimal Adder Using a New High Speed BCD Adder

Data Hazards. Result does not feed back around in time for next operation Pipelining has changed behavior of system. Comb. logic A. R e g.

FLUID POWER FLUID POWER EQUIPMENT TUTORIAL OTHER FLUID POWER VALVES. This work covers part of outcome 2 of the Edexcel standard module:

PRELAB: COLLISIONS IN TWO DIMENSIONS

ECE 757. Homework #2

THE PROBLEM OF ON-LINE TESTING METHODS IN APPROXIMATE DATA PROCESSING

Advanced Hydraulics Prof. Dr. Suresh A. Kartha Department of Civil Engineering Indian Institute of Technology, Guwahati

Fast Software-managed Code Decompression

GOLOMB Compression Technique For FPGA Configuration

CS3350B Computer Architecture. Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions

Lab 1c Isentropic Blow-down Process and Discharge Coefficient

Profile-driven Selective Code Compression

Multimodal Approach to Planning & Implementation of Transit Signal Priority within Montgomery County Maryland

Introduction to Pneumatics

Design of Low Power and High Speed 4-Bit Ripple Carry Adder Using GDI Multiplexer

Simulation with IBIS in Tight Timing Budget Systems

Revision 2013 Vacuum Technology 1-3 day Good Vacuum Practice 1 Day Course Outline

Stand-Alone Bubble Detection System

Reduction of Bitstream Transfer Time in FPGA

EB251. Motorola Semiconductor Engineering Bulletin. How to Calculate Instruction Times on the MC68HC16. Freescale Semiconductor, I.

Layout Design II. Lecture Fall 2003

ADVANCED TRANSIT OPERATIONS AND MODELING

A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications

CBC2 performance with switched capacitor DC-DC converter. systems meeting, 12/2/14

Master Level Rules Review

IA-64: Advanced Loads Speculative Loads Software Pipelining

VLSI Design 8. Design of Adders

Computing s Energy Problem:

Special Provisions for Left Turns at Signalized Intersections to Increase Capacity and Safety

Designing of Low Power and Efficient 4-Bit Ripple Carry Adder Using GDI Multiplexer

Distributed Control Systems

Comparing ALES to F5J This section discusses key differences between ALES and F5J rules

HW #5: Digital Logic and Flip Flops

Policy Gradient RL to learn fast walk

MASBATE PRODUCTION OVERVIEW (i)

Risk Recon Overview. Risk Recon Overview Prepared by: Lisa Graf and Mike Olsem October 28, 2010

CS Lecture 5. Vidroha debroy. Material adapted courtesy of Prof. Xiangnan Kong and Prof. Carolina Ruiz at Worcester Polytechnic Institute

What we have learned over the years Richard Williams I Sales Manager

ECE520 VLSI Design. Lecture 9: Design Rules. Payman Zarkesh-Ha

Stack Height Analysis for FinFET Logic and Circuit

Walking with coffee: when and why coffee spills

Reducing Code Size with Run-time Decompression

Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Bike Routes Assessment: 95 Ave, 106 St & 40 Ave. Replace with appropriate image in View > Master.

Lecture 1 Temporal constraints: source and characterization

Simple Mathematical, Dynamical Stochastic Models Capturing the Observed Diversity of the El Niño Southern Oscillation (ENSO) Andrew J.

Discovering Special Triangles Learning Task

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

COMP219: Artificial Intelligence. Lecture 8: Combining Search Strategies and Speeding Up

Overview. Depth Limited Search. Depth Limited Search. COMP219: Artificial Intelligence. Lecture 8: Combining Search Strategies and Speeding Up

Time-Delay Electropneumatic Applications

LEVEL 1 SKILL DEVELOPMENT MANUAL

Standard League WRO Football Simple Simon EV3 Training Curriculum

Bowhead Whales: An Introduction to our Whales Unit

Application Note PowerPC 750GX Lockstep Facility

Paper Presentation: COMPLETE AND SIMULTANEOUS DCS FAILURE IN TWO 500MW UNITS

Efficient Placement of Compressed Code for Parallel Decompression

Module 3 Developing Timing Plans for Efficient Intersection Operations During Moderate Traffic Volume Conditions

Functional safety. Functional safety of Programmable systems, devices & components: Requirements from global & national standards

Dynamic Analysis of a Multi-Stage Compressor Train. Augusto Garcia-Hernandez, Jeffrey A. Bennett, and Dr. Klaus Brun

A Study on Algorithm for Compression and Decompression of Embedded Codes using Xilinx

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

HANDBOOK. Squeeze Off Unit SOU 250. Please refer any queries to:

Synchronous Sequential Logic. Topics. Sequential Circuits. Chapter 5 Steve Oldridge Dr. Sidney Fels. Sequential Circuits

Physical Design of CMOS Integrated Circuits

ATM 322 Basic Pneumatics H.W.6 Modules 5 7

Technology issues regarding Blends of Refrigerants Buffalo Research Laboratory

Machine Learning an American Pastime

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

FHWA Safety Performance for Intersection Control Evaluation (SPICE) Tool

Tank Continuous Liquid Level Detector

CS 4649/7649 Robot Intelligence: Planning

System Operating Limit Definition and Exceedance Clarification

ENHANCED PARKWAY STUDY: PHASE 2 CONTINUOUS FLOW INTERSECTIONS. Final Report

Tenth USA/Europe Air Traffic Management Research and Development Seminar

An Overview of Types of Isolation Valves

The Kanban Guide for Scrum Teams

TAKING IT TO THE EXTREME The WAGO-I/O-SYSTEM for extreme Applications

RRS 2009-'12: New Section C

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

Uninformed Search (Ch )

Optimizing Bus Rapid Transit, Downtown to Oakland

CHM Gas Chromatography: Our Instrument Charles Taylor (r10)

G53CLP Constraint Logic Programming

Distributed Systems [Fall 2013]

Transcription:

ECE468 Computer Organization and Architecture Designing a Pipeline Processor A Single Cycle Processor <3:26> Jump nstruction<3:> op Main nstruction : Fetch op 3 mm6 <5:> func 5 5 5 ctr 3 m Rw busw -bit Registers En Adr n imm6 nstr<5:> 6 mory <6:2> <2:25> Extender <:5> <:5> ECE468 Pipeline. 22-4-3 ECE468 Pipeline.2 22-4-3 Drawbacks of this Single Cycle Processor Long cycle time: Cycle time must be long enough for the load instruction: - s -to-q + - nstruction mory Access Time + - Register File Access Time + - Delay (address calculation) + - mory Access Time + - Register File Setup Time + - Skew Cycle time is much longer than needed for all other instructions. Examples: instructions do not require data memory access Jump does not require operation nor data memory access Overview of a Multiple Cycle mplementation The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps ute each step (instead of the entire instruction) in one cycle - Cycle time: time it takes to execute the longest step - Keep all the steps to have similar length This is the essence of the multiple cycle processor The advantages of the multiple cycle processor: Cycle time is much shorter fferent instructions take different number of cycles to complete - Load takes five cycles - Jump only takes three cycles Allows a functional unit to be used more than once per instruction ECE468 Pipeline.3 22-4-3 Why will this hinder ECE468 Pipeline.4 the pipeline? 22-4-3 Multiple Cycle Processor MCP: A functional unit to be used more than once per instruction Cond Src Br ord m R SelA Target RAdr 5 deal 5 Reg File mory 4 Adr Rw n ut busw 2 3 << 2 nstruction Reg Outline of Today s Lecture--- Pipelining ntroduction to the Concept of Pipelined Processor Pipelined path and Pipelined How to Avoid ce Condition in a Pipeline Design? Pipeline Example: nstructions nteraction mm Extend 6 SelB Op ECE468 Pipeline.5 22-4-3 ECE468 Pipeline.6 22-4-3

Timing agram of a Load nstruction,,, Op, Func ctr nstruction Fetch nstr Decode / Reg. Fetch -to-q nstruction mory Access Time Delay through Logic mory 2 Register File Access Time Delay through Extender & 3 Delay Reg mory Access Time busw New ECE468 Pipeline.7 22-4-3 Register File ite Time The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : Calculate the memory address m: Read the data from the mory : ite the data back to the register file ECE468 Pipeline.8 22-4-3 Key deas Behind Pipelining Pipelining the Load nstruction Grading the mid term exams: 5 problems, five people grading the exam Each person ONLY grades one problem Pass the exam to the next person as soon as one finishes his part Assume each problem takes.5 hour to grade - Each individual exam still takes 2.5 hours to grade - But with 5 people, all exams can be graded much quicker The load instruction has 5 stages: Five independent functional units to work on each stage - Each functional unit is used only once The 2nd load can start as soon as the st finishes its fet stage Each load still takes five cycles to complete The throughput, however, is much higher ECE468 Pipeline.9 22-4-3 Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw fetch Reg/Dec m 2nd lw fetch Reg/Dec m 3rd lw fetch Reg/Dec m The five independent functional units in the pipeline datapath are: nstruction mory for the fetch stage Register File s Read ports (bus A and ) for the Reg/Dec stage for the stage Regiester file is used 2 times but no problem! mory for the m stage Register File s ite port (bus W) for the stage One instruction enters the pipeline every cycle One instruction comes out of the pipeline (complete) every cycle Comparison with m-cycle processor: 5x3 cycles versus 7 cycles The Effective Cycles per nstruction (CP) is ECE468 Pipeline. 22-4-3 The Four Stages of Cycle Cycle 2 Cycle 3 Cycle 4 fetch Reg/Dec fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : operates on the two register operands : ite the output back to the register file Pipelining the and Load nstruction Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec Ops! We have a problem! fetch Reg/Dec Load fetch Reg/Dec m fetch Reg/Dec fetch Reg/Dec deal Case each functional unit is used once per instruction AND all instructions are the same, just as discussed in the previous 2 slides. We have a problem: Two instructions try to write to the register file at the same time! We will see many other problems involved in pipelining. ECE468 Pipeline. 22-4-3 ECE468 Pipeline.2 22-4-3

mportant Observation Each functional unit can only be used once per instruction: necessary but not sufficient. Each functional unit must be used at the same stage for all instructions: Load uses Register File s ite Port during its 5th stage 2 3 4 5 Load fetch Reg/Dec m uses Register File s ite Port during its 4th stage 2 3 4 fetch Reg/Dec Problem! Solution : nsert Bubble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec Load fetch Reg/Dec m fetch Reg/Dec fetch Reg/Dec Pipeline fetch Reg/Dec Bubble Reg/Dec fetch Reg/Dec fetch Reg/Dec nsert a bubble into the pipeline to prevent 2 writes at the same cycle The control logic can be complex No instruction is completed during Cycle 5: The Effective CP for load is 2 ECE468 Pipeline.3 22-4-3 ECE468 Pipeline.4 22-4-3 Solution 2: Delay s ite by One Cycle Delay s register write by one cycle: Now instructions also use Reg File s write port at Stage 5 m stage is a NOOP stage: nothing is being done 2 3 4 5 fetch Reg/Dec m Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec m fetch Reg/Dec m Load fetch Reg/Dec m The Four Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : Calculate the memory address m: ite the data into the mory : is NOOP stage so that the pipeline diagram looks more uniform. fetch Reg/Dec m fetch Reg/Dec m ECE468 Pipeline.5 22-4-3 ECE468 Pipeline.6 22-4-3 The Four Stages of Beq A Pipelined path -to-q delay Cycle Cycle 2 Cycle 3 Cycle 4 NOOP! Beq fetch Reg/Dec m fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode Find the difference of Beq : See slide 23 for more details between pipeline processor and M-cycle processor, think why? compares the two register operands Adder calculates the branch target address m: f the registers we compared in the stage are the same, ite the branch target address into the A F/D Register mm6 Rw D/Ex Register Op mm6 Ex/m Register m/ Register No datapath uder? Why need such registers? m ECE468 Pipeline.7 22-4-3 ECE468 Pipeline.8 22-4-3

The nstruction Fetch Stage Location : lw $, x($2) $ <- m[($2) + x] A Detail View of the nstruction Location : lw $, x($2) fetch Reg/Dec m Op fetch Reg/Dec = 4 A F/D: lw $, ($2) mm6 Rw D/Ex Register mm6 Ex/m Register m m/ Register new value old output? = 4 4 nstruction mory nstruction Adder F/D: lw $, ($2) ECE468 Pipeline.9 22-4-3 ECE468 Pipeline.2 22-4-3 The Decode / Register Fetch Stage Location : lw $, x($2) $ <- m[($2) + x] Load s Calculation Stage Location : lw $, x($2) $ <- m[($2) + x] fetch Reg/Dec m Op fetch Reg/Dec m Op=Add = A F/D: mm6 Rw D/Ex: Reg. 2 & x mm6 Ex/m Register m/ Register A F/D: mm6 Rw D/Ex Register mm6 Ex/m: Load s m/ Register m ECE468 Pipeline.2 22-4-3 Remember / was = = m connected to Rfile before? ECE468 Pipeline.22 22-4-3 A Detail View of the ution Load s mory Access Stage Location : lw $, x($2) $ <- m[($2) + x] m fetch Reg/Dec m D/Ex Register imm6 6 Extender << 2 Adder Target 3 out ctr Ex/m: Load s mory A F/D: mm6 Rw D/Ex Register Op mm6 Ex/m Register = m RA m/: Load s = = 3 Op=Add m= ECE468 Pipeline.23 22-4-3 ECE468 Pipeline.24 22-4-3

Load s ite Back Stage Location : lw $, x($2) $ <- m[($2) + x] You are somewhere out there! fetch Reg/Dec m = Op How About Signals? Key Observation: Signals at Stage N = Func (nstr. at Stage N) N =, m, or Example: s Signals at Stage = Func(Load s ) fetch Reg/Dec m Op=Add = A F/D: mm6 Rw D/Ex Register mm6 Ex/m Register m RA m/ Register A F/D: mm6 Rw D/Ex Register mm6 Ex/m: Load s m RA m/ Register m = 22-4-3 ECE468 Pipeline.25 Why no control signals = = m at st and 2 nd stages? ECE468 Pipeline.26 22-4-3 Pipeline Beginning of the s Stage: A Real World Problem The Main generates the control signals during Reg/Dec signals for (,,...) are used cycle later signals for m (m ) are used 2 cycles later signals for ( m) are used 3 cycles later RegAdr s -to-q Adr m m s -to-q F/D Register Reg/Dec m Main Op mw r D/Ex Register Op mw r Ex/m Register mw r m/ Register m/ RegAdr RegAdr s -to-q Reg File At the beginning of the stage, we have a problem if: RegAdr s ( or ) -to-q > s -to-q Similarly, at the beginning of the m stage, we have a problem if: Adr s -to-q > m s -to-q We have a race condition between and ite Enable! Ex/m m Adr Adr s -to-q mory ECE468 Pipeline.27 22-4-3 Why can M-cycle processor s approach not be useful? ECE468 Pipeline.28 22-4-3 The Pipeline Problem Multiple Cycle design prevents race condition between Addr and En: Make sure is stable by the end of Cycle N Asserts En during Cycle N + This approach can NOT be used in the pipeline design because: Must be able to write the register file every cycle Must be able write the data memory every cycle Store fetch Reg/Dec m Store fetch Reg/Dec m fetch Reg/Dec m fetch Reg/Dec m Solution? Recall -cycle processor s approach. ECE468 Pipeline.29 22-4-3 Synchronize Register File & Synchronize mory Solution: And the ite Enable signal with the This is the ONLY place where gating the clock is used MUST consult circuit expert to ensure no timing violation: - Example: High Time > ite Access Delay _Addr _En C_En En _En _Addr _ C_En Reg File or mory Actual write Synchronize mory and Register File,, and En must be stable at least set-up time before the edge ite occurs at the cycle following the clock edge that captures the signals ECE468 Pipeline.3 22-4-3 En Reg File or mory

: Load A More Extensive Pipelining Example Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 fetch Reg/Dec m 4: fetch Reg/Dec m 8: Store fetch Reg/Dec m 2: Beq (target is ) fetch Reg/Dec m Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 4: Load s m, s, Store s Reg, Beq s fetch Cycle 5: Load s, s m, Store s, Beq s Reg Cycle 6: s, Store s m, Beq s Cycle 7: Store s, Beq s m ECE468 Pipeline.3 22-4-3 Pipelining Example: Cycle 4 : Load s m 4: s 8: Store s Reg 2: Beq s fetch = 6 2: Beq s fet A F/D: Beq nstruction 8: Store s Reg 4: s : Load s m = mm6 Rw Op= =x D/Ex: Store s & B mm6 = = =x m= 22-4-3 ECE468 Pipeline. Ex/m: s Result = m RA m/: Load s ut Pipelining Example: Cycle 5 : Lw s 4: R s m 8: Store s 2: Beq s Reg 6: R s fetch Pipelining Example: Cycle 6 4: R s 8: Store s m 2: Beq s 6: R s Reg 2: R s fet 6: R s fet 2: Beq s Reg 8: Store s 4: s m : Load s = Op=Add = = 2: s fet 6: s Reg 2: Beq s 8: Store s m 4: s Op=Sub = = = = 2 A F/D: nstruction @ 6 mm6 Rw D/Ex: Beq s & B mm6 Ex/m: Store s m/: s Result = 24 A F/D: nstruction @ 2 mm6 Rw D/Ex: s & B mm6 Ex/m: Beq s Results m/: Nothing for St =x = = m= 22-4-3 ECE468 Pipeline.33 =x = m= = ECE468 Pipeline.34 22-4-3 Pipelining Example: Cycle 7 8: Store s 2: Beq s m 6: R s 2: R s Reg 24: R s fet = 24: s fet A F/D: nstruction @ 24 2: s Reg 6: s 2: Beq s m 8: Store s Op= = =x = mm6 Rw D/Ex: s & B mm6 = = =x m= 22-4-3 ECE468 Pipeline.35 Ex/m: ype s Results m/:nothing for Beq The Delay Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Cycle 2: Beq fetch Reg/Dec m (target is ) 6: fetch Reg/Dec m 2: fetch Reg/Dec m 24: fetch Reg/Dec m : Target of Br fetch Reg/Dec m Although Beq is fetched during Cycle 4: Target address is NOT written into the until the end of Cycle 7 s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect This is referred to as Hazard: Clever design techniques can reduce the delay to ONE instruction ECE468 Pipeline.36 22-4-3

The Delay Load Phenomenon Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 : Load fetch Reg/Dec m Plus fetch Reg/Dec m Plus 2 fetch Reg/Dec m Plus 3 fetch Reg/Dec m Plus 4 fetch Reg/Dec m Although Load is fetched during Cycle : The data is NOT written into the Reg File until the end of Cycle 5 We cannot read this value from the Reg File until Cycle 6 3-instruction delay before the load take effect This is referred to as Hazard: Clever design techniques can reduce the delay to ONE instruction Summary sadvantages of the Single Cycle Processor Long cycle time Cycle time is too long for all instructions except the Load Multiple Cycle Processor: vide the instructions into smaller steps ute each step (instead of the entire instruction) in one cycle Pipeline Processor: Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction f a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions Pipeline : - Each stage s control signal depends ONLY on the instruction that is currently in that stage ECE468 Pipeline.37 22-4-3 ECE468 Pipeline.38 22-4-3 Single Cycle, Multiple Cycle, vs. Pipeline Cycle Cycle 2 Single Cycle mplementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Multiple Cycle mplementation: Load fetch Reg m Store fetch Reg m fetch Pipeline mplementation: Load fetch Reg m Store fetch Reg m fetch Reg m ECE468 Pipeline.39 22-4-3