ECE468 Computer Organization and Architecture Designing a Pipeline Processor A Single Cycle Processor <3:26> Jump nstruction<3:> op Main nstruction : Fetch op 3 mm6 <5:> func 5 5 5 ctr 3 m Rw busw -bit Registers En Adr n imm6 nstr<5:> 6 mory <6:2> <2:25> Extender <:5> <:5> ECE468 Pipeline. 22-4-3 ECE468 Pipeline.2 22-4-3 Drawbacks of this Single Cycle Processor Long cycle time: Cycle time must be long enough for the load instruction: - s -to-q + - nstruction mory Access Time + - Register File Access Time + - Delay (address calculation) + - mory Access Time + - Register File Setup Time + - Skew Cycle time is much longer than needed for all other instructions. Examples: instructions do not require data memory access Jump does not require operation nor data memory access Overview of a Multiple Cycle mplementation The root of the single cycle processor s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps ute each step (instead of the entire instruction) in one cycle - Cycle time: time it takes to execute the longest step - Keep all the steps to have similar length This is the essence of the multiple cycle processor The advantages of the multiple cycle processor: Cycle time is much shorter fferent instructions take different number of cycles to complete - Load takes five cycles - Jump only takes three cycles Allows a functional unit to be used more than once per instruction ECE468 Pipeline.3 22-4-3 Why will this hinder ECE468 Pipeline.4 the pipeline? 22-4-3 Multiple Cycle Processor MCP: A functional unit to be used more than once per instruction Cond Src Br ord m R SelA Target RAdr 5 deal 5 Reg File mory 4 Adr Rw n ut busw 2 3 << 2 nstruction Reg Outline of Today s Lecture--- Pipelining ntroduction to the Concept of Pipelined Processor Pipelined path and Pipelined How to Avoid ce Condition in a Pipeline Design? Pipeline Example: nstructions nteraction mm Extend 6 SelB Op ECE468 Pipeline.5 22-4-3 ECE468 Pipeline.6 22-4-3
Timing agram of a Load nstruction,,, Op, Func ctr nstruction Fetch nstr Decode / Reg. Fetch -to-q nstruction mory Access Time Delay through Logic mory 2 Register File Access Time Delay through Extender & 3 Delay Reg mory Access Time busw New ECE468 Pipeline.7 22-4-3 Register File ite Time The Five Stages of Load Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : Calculate the memory address m: Read the data from the mory : ite the data back to the register file ECE468 Pipeline.8 22-4-3 Key deas Behind Pipelining Pipelining the Load nstruction Grading the mid term exams: 5 problems, five people grading the exam Each person ONLY grades one problem Pass the exam to the next person as soon as one finishes his part Assume each problem takes.5 hour to grade - Each individual exam still takes 2.5 hours to grade - But with 5 people, all exams can be graded much quicker The load instruction has 5 stages: Five independent functional units to work on each stage - Each functional unit is used only once The 2nd load can start as soon as the st finishes its fet stage Each load still takes five cycles to complete The throughput, however, is much higher ECE468 Pipeline.9 22-4-3 Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 st lw fetch Reg/Dec m 2nd lw fetch Reg/Dec m 3rd lw fetch Reg/Dec m The five independent functional units in the pipeline datapath are: nstruction mory for the fetch stage Register File s Read ports (bus A and ) for the Reg/Dec stage for the stage Regiester file is used 2 times but no problem! mory for the m stage Register File s ite port (bus W) for the stage One instruction enters the pipeline every cycle One instruction comes out of the pipeline (complete) every cycle Comparison with m-cycle processor: 5x3 cycles versus 7 cycles The Effective Cycles per nstruction (CP) is ECE468 Pipeline. 22-4-3 The Four Stages of Cycle Cycle 2 Cycle 3 Cycle 4 fetch Reg/Dec fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : operates on the two register operands : ite the output back to the register file Pipelining the and Load nstruction Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec Ops! We have a problem! fetch Reg/Dec Load fetch Reg/Dec m fetch Reg/Dec fetch Reg/Dec deal Case each functional unit is used once per instruction AND all instructions are the same, just as discussed in the previous 2 slides. We have a problem: Two instructions try to write to the register file at the same time! We will see many other problems involved in pipelining. ECE468 Pipeline. 22-4-3 ECE468 Pipeline.2 22-4-3
mportant Observation Each functional unit can only be used once per instruction: necessary but not sufficient. Each functional unit must be used at the same stage for all instructions: Load uses Register File s ite Port during its 5th stage 2 3 4 5 Load fetch Reg/Dec m uses Register File s ite Port during its 4th stage 2 3 4 fetch Reg/Dec Problem! Solution : nsert Bubble into the Pipeline Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec Load fetch Reg/Dec m fetch Reg/Dec fetch Reg/Dec Pipeline fetch Reg/Dec Bubble Reg/Dec fetch Reg/Dec fetch Reg/Dec nsert a bubble into the pipeline to prevent 2 writes at the same cycle The control logic can be complex No instruction is completed during Cycle 5: The Effective CP for load is 2 ECE468 Pipeline.3 22-4-3 ECE468 Pipeline.4 22-4-3 Solution 2: Delay s ite by One Cycle Delay s register write by one cycle: Now instructions also use Reg File s write port at Stage 5 m stage is a NOOP stage: nothing is being done 2 3 4 5 fetch Reg/Dec m Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 fetch Reg/Dec m fetch Reg/Dec m Load fetch Reg/Dec m The Four Stages of Store Cycle Cycle 2 Cycle 3 Cycle 4 Store fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode : Calculate the memory address m: ite the data into the mory : is NOOP stage so that the pipeline diagram looks more uniform. fetch Reg/Dec m fetch Reg/Dec m ECE468 Pipeline.5 22-4-3 ECE468 Pipeline.6 22-4-3 The Four Stages of Beq A Pipelined path -to-q delay Cycle Cycle 2 Cycle 3 Cycle 4 NOOP! Beq fetch Reg/Dec m fetch Reg/Dec m fetch: nstruction Fetch Fetch the instruction from the nstruction mory Reg/Dec: Registers Fetch and nstruction Decode Find the difference of Beq : See slide 23 for more details between pipeline processor and M-cycle processor, think why? compares the two register operands Adder calculates the branch target address m: f the registers we compared in the stage are the same, ite the branch target address into the A F/D Register mm6 Rw D/Ex Register Op mm6 Ex/m Register m/ Register No datapath uder? Why need such registers? m ECE468 Pipeline.7 22-4-3 ECE468 Pipeline.8 22-4-3
The nstruction Fetch Stage Location : lw $, x($2) $ <- m[($2) + x] A Detail View of the nstruction Location : lw $, x($2) fetch Reg/Dec m Op fetch Reg/Dec = 4 A F/D: lw $, ($2) mm6 Rw D/Ex Register mm6 Ex/m Register m m/ Register new value old output? = 4 4 nstruction mory nstruction Adder F/D: lw $, ($2) ECE468 Pipeline.9 22-4-3 ECE468 Pipeline.2 22-4-3 The Decode / Register Fetch Stage Location : lw $, x($2) $ <- m[($2) + x] Load s Calculation Stage Location : lw $, x($2) $ <- m[($2) + x] fetch Reg/Dec m Op fetch Reg/Dec m Op=Add = A F/D: mm6 Rw D/Ex: Reg. 2 & x mm6 Ex/m Register m/ Register A F/D: mm6 Rw D/Ex Register mm6 Ex/m: Load s m/ Register m ECE468 Pipeline.2 22-4-3 Remember / was = = m connected to Rfile before? ECE468 Pipeline.22 22-4-3 A Detail View of the ution Load s mory Access Stage Location : lw $, x($2) $ <- m[($2) + x] m fetch Reg/Dec m D/Ex Register imm6 6 Extender << 2 Adder Target 3 out ctr Ex/m: Load s mory A F/D: mm6 Rw D/Ex Register Op mm6 Ex/m Register = m RA m/: Load s = = 3 Op=Add m= ECE468 Pipeline.23 22-4-3 ECE468 Pipeline.24 22-4-3
Load s ite Back Stage Location : lw $, x($2) $ <- m[($2) + x] You are somewhere out there! fetch Reg/Dec m = Op How About Signals? Key Observation: Signals at Stage N = Func (nstr. at Stage N) N =, m, or Example: s Signals at Stage = Func(Load s ) fetch Reg/Dec m Op=Add = A F/D: mm6 Rw D/Ex Register mm6 Ex/m Register m RA m/ Register A F/D: mm6 Rw D/Ex Register mm6 Ex/m: Load s m RA m/ Register m = 22-4-3 ECE468 Pipeline.25 Why no control signals = = m at st and 2 nd stages? ECE468 Pipeline.26 22-4-3 Pipeline Beginning of the s Stage: A Real World Problem The Main generates the control signals during Reg/Dec signals for (,,...) are used cycle later signals for m (m ) are used 2 cycles later signals for ( m) are used 3 cycles later RegAdr s -to-q Adr m m s -to-q F/D Register Reg/Dec m Main Op mw r D/Ex Register Op mw r Ex/m Register mw r m/ Register m/ RegAdr RegAdr s -to-q Reg File At the beginning of the stage, we have a problem if: RegAdr s ( or ) -to-q > s -to-q Similarly, at the beginning of the m stage, we have a problem if: Adr s -to-q > m s -to-q We have a race condition between and ite Enable! Ex/m m Adr Adr s -to-q mory ECE468 Pipeline.27 22-4-3 Why can M-cycle processor s approach not be useful? ECE468 Pipeline.28 22-4-3 The Pipeline Problem Multiple Cycle design prevents race condition between Addr and En: Make sure is stable by the end of Cycle N Asserts En during Cycle N + This approach can NOT be used in the pipeline design because: Must be able to write the register file every cycle Must be able write the data memory every cycle Store fetch Reg/Dec m Store fetch Reg/Dec m fetch Reg/Dec m fetch Reg/Dec m Solution? Recall -cycle processor s approach. ECE468 Pipeline.29 22-4-3 Synchronize Register File & Synchronize mory Solution: And the ite Enable signal with the This is the ONLY place where gating the clock is used MUST consult circuit expert to ensure no timing violation: - Example: High Time > ite Access Delay _Addr _En C_En En _En _Addr _ C_En Reg File or mory Actual write Synchronize mory and Register File,, and En must be stable at least set-up time before the edge ite occurs at the cycle following the clock edge that captures the signals ECE468 Pipeline.3 22-4-3 En Reg File or mory
: Load A More Extensive Pipelining Example Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 fetch Reg/Dec m 4: fetch Reg/Dec m 8: Store fetch Reg/Dec m 2: Beq (target is ) fetch Reg/Dec m Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 4: Load s m, s, Store s Reg, Beq s fetch Cycle 5: Load s, s m, Store s, Beq s Reg Cycle 6: s, Store s m, Beq s Cycle 7: Store s, Beq s m ECE468 Pipeline.3 22-4-3 Pipelining Example: Cycle 4 : Load s m 4: s 8: Store s Reg 2: Beq s fetch = 6 2: Beq s fet A F/D: Beq nstruction 8: Store s Reg 4: s : Load s m = mm6 Rw Op= =x D/Ex: Store s & B mm6 = = =x m= 22-4-3 ECE468 Pipeline. Ex/m: s Result = m RA m/: Load s ut Pipelining Example: Cycle 5 : Lw s 4: R s m 8: Store s 2: Beq s Reg 6: R s fetch Pipelining Example: Cycle 6 4: R s 8: Store s m 2: Beq s 6: R s Reg 2: R s fet 6: R s fet 2: Beq s Reg 8: Store s 4: s m : Load s = Op=Add = = 2: s fet 6: s Reg 2: Beq s 8: Store s m 4: s Op=Sub = = = = 2 A F/D: nstruction @ 6 mm6 Rw D/Ex: Beq s & B mm6 Ex/m: Store s m/: s Result = 24 A F/D: nstruction @ 2 mm6 Rw D/Ex: s & B mm6 Ex/m: Beq s Results m/: Nothing for St =x = = m= 22-4-3 ECE468 Pipeline.33 =x = m= = ECE468 Pipeline.34 22-4-3 Pipelining Example: Cycle 7 8: Store s 2: Beq s m 6: R s 2: R s Reg 24: R s fet = 24: s fet A F/D: nstruction @ 24 2: s Reg 6: s 2: Beq s m 8: Store s Op= = =x = mm6 Rw D/Ex: s & B mm6 = = =x m= 22-4-3 ECE468 Pipeline.35 Ex/m: ype s Results m/:nothing for Beq The Delay Phenomenon Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Cycle 2: Beq fetch Reg/Dec m (target is ) 6: fetch Reg/Dec m 2: fetch Reg/Dec m 24: fetch Reg/Dec m : Target of Br fetch Reg/Dec m Although Beq is fetched during Cycle 4: Target address is NOT written into the until the end of Cycle 7 s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect This is referred to as Hazard: Clever design techniques can reduce the delay to ONE instruction ECE468 Pipeline.36 22-4-3
The Delay Load Phenomenon Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 : Load fetch Reg/Dec m Plus fetch Reg/Dec m Plus 2 fetch Reg/Dec m Plus 3 fetch Reg/Dec m Plus 4 fetch Reg/Dec m Although Load is fetched during Cycle : The data is NOT written into the Reg File until the end of Cycle 5 We cannot read this value from the Reg File until Cycle 6 3-instruction delay before the load take effect This is referred to as Hazard: Clever design techniques can reduce the delay to ONE instruction Summary sadvantages of the Single Cycle Processor Long cycle time Cycle time is too long for all instructions except the Load Multiple Cycle Processor: vide the instructions into smaller steps ute each step (instead of the entire instruction) in one cycle Pipeline Processor: Natural enhancement of the multiple clock cycle processor Each functional unit can only be used once per instruction f a instruction is going to use a functional unit: - it must use it at the same stage as all other instructions Pipeline : - Each stage s control signal depends ONLY on the instruction that is currently in that stage ECE468 Pipeline.37 22-4-3 ECE468 Pipeline.38 22-4-3 Single Cycle, Multiple Cycle, vs. Pipeline Cycle Cycle 2 Single Cycle mplementation: Load Store Waste Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle Multiple Cycle mplementation: Load fetch Reg m Store fetch Reg m fetch Pipeline mplementation: Load fetch Reg m Store fetch Reg m fetch Reg m ECE468 Pipeline.39 22-4-3