Radix-64 Floaing- Poin Divider Javier D. Bruguera ARITH25 June 25-27, 2018 2018 Ar Liied
Overview Main feaures General archiecure Perforance Radix-64 digi-recurrence division Overlapping of hree radix-4 ieraions Microarchiecure Pre-processing Digi ieraion Digi selecion Nex reainder calculaion Evaluaion and coparison 2 2018 Ar Liied
Main Feaures 2018 Ar Liied
Operands and Resul Floaing-poin division,! = # $, Noralized operands, &, ' [1,2), alhough Subnoral operands are acceped à operand noralizaion before he digi ieraions To siplify he rounding resul is forced o be in! [1,2) If resul is! [0.5,2) à rounding needs a guard bi and a round bi, and resul can need a 1-bi lef shif! 0.5,1 if & < ', 1 2, 3 if 4 6 Early deecion of & < ' à! = 7#, 8'! [1,2) Sae anissa as in &/', bu exponen needs o be decreened Suppor for double, single and half-precisions $ 4 2018 Ar Liied
Digi-Recurrence Division Algorih Radix-64, overlapping hree radix-4 ieraion: 6 bis of he resul are obained every cycle (6 bis/cycle) Each radix-4 ieraion gives 2 bis of he resul Each radix-4 ieraion, 1. Quoien digi selecion, digis { 2, 1, 0, +1, +2} 2. Reainder updae Operands pre-scaling o have a siple quoien-digi selecion funcion If divisor is close enough o 1, he digi selecion is independen on he divisor, depends only on he reainder Dividend scaled as well o preserve he resul The firs quoien-digi (ineger digi) can ake values {+1, +2} à siplified selecion logic In parallel wih he pre-scaling 5 2018 Ar Liied
Early Terinaion Mode Occurs when Any of he operands is NaN,, 0 Division by a power of 2 The resul is no obained in he digi calculaion ieraions Shorer laency 6 2018 Ar Liied
Laency Nuber of resul bis: 1 ineger bi + n fracional bis Ineger bi is obained in parallel wih he pre-scaling Fracional bis include he guard bi,! = 53 %&, 24 *&, 11 (-&) Nuber of digi cycles is (6 bi/cycle) Half-precision: Single-precision: Double-precision: 2 digi cycles -> 6 radix-4 ieraions -> 12 fracional bis 4 digi cycles -> 12 radix-4 ieraions -> 24 fracional bis digi cycles -> 27 radix-4 ieraions -> 54 fracional bis Laency for noral operaion (no subnorals, resul noralized) Half-precision: Single precision: Double-precision: PSC DGT DGT RND PSC DGT DGT DGT DGT RND PSC DGT DGT DGT DGT DGT - DGT DGT DGT DGT RND Obaining he firs digi in parallel wih he pre-scaling and forcing he resul in [1,2) conribue o save 1 cycle laency 7 2018 Ar Liied
Digi-Recurrence Division 2018 Ar Liied
Radix-r Digi-Recurrence Division Ieraive algorih Ieraion i copues : a radix-r quoien digi,! "#$, a reainder, %&'[) + 1] The reainder is used o ge he nex quoien digi % = 4 " Parial quoien before ieraion i, Q i = 234! 2 4 62 A ieraion i! "#$ = 78 4 :%&' ) %&' ) + 1 = 4 %&' )! "#$ < Usually, %&'[)] in redundan carry-save or signed-digi represenaion = %&'[)] is an esiaion of %&'[)] wih few bis (6 bis in radix 4) 2018 Ar Liied
Nuber of Ieraions and Cycles Radix-4 ieraions Nuber of ieraions is!" = % log ) 4 = % 2 For exaple, double-precision: % = 53!" = 27 Radix-64 division, hree radix-4 ieraions per cycle Nuber of cycles for noral division is 010234 =!" 3 + 2 1 pre-scaling cycle,!" 3 digi cycles, 1 rounding cycle For exaple, double-precision:!" = 27 010234 = + 2 = 11 10 2018 Ar Liied
Radix-64 Divider Naive Ipleenaion MSBs of 4 x re[i] re[i] -bi!"# $ + 1 = 4!"# $ + $ + 1, -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 11 2018 Ar Liied re[i+3]
Radix-64 Divider Microarchiecure 2018 Ar Liied
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 13 2018 Ar Liied re[i+3]
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 14 2018 Ar Liied re[i+3]
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 15 2018 Ar Liied re[i+3]
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 16 2018 Ar Liied re[i+3]
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 17 2018 Ar Liied re[i+3]
Overlapping Radix-4 Ieraions wih Speculaion/Replicaion MSBs of 4 x re[i] re[i] -bi -bi Msb s 2 2 re[i+1] -bi Msb s re[i+2] q[i+3] 18 2018 Ar Liied re[i+3]
Radix-64 Ieraion 6 MSBs of 4 x re[i] 6 6 6b 6 1 afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif - d re[i] 1 1 1 1 1 b b b b b 4re[i]-qd 6 MSBs 6 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d re[i+1] 4re[i+1]-qd 6 re[i+2] d 4re[i+2]-qd q[i+3] 1 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Radix-64 Ieraion 6 MSBs of 4 x re[i] 6 6 6 MSBs 6b afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif 1 1 1 1 1 b b b b b 6 6 1 Reainder updae 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d - d re[i] 4re[i]-qd 4re[i+1]-qd re[i+1] 6 re[i+2] d 4re[i+2]-qd q[i+3] 20 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Radix-64 Ieraion Digi Selecion 6 MSBs of 4 x re[i] 6 6 6 MSBs 6b afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif 1 1 1 1 1 b b b b b 6 6 1 Reainder updae 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d - d re[i] 4re[i]-qd 4re[i+1]-qd re[i+1] 6 re[i+2] d 4re[i+2]-qd q[i+3] 21 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Radix-64 Ieraion 6 MSBs of 4 x re[i] 6 6 6b 6 1 afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif - d re[i] 1 1 1 1 1 b b b b b 4re[i]-qd 6 MSBs 6 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d re[i+1] 4re[i+1]-qd 6 re[i+2] d 4re[i+2]-qd q[i+3] 22 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Radix-64 Ieraion Digi Selecion 6 MSBs of 4 x re[i] 6 6 6 MSBs 6b afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif 1 1 1 1 1 b b b b b 6 6 1 Reainder updae 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d - d re[i] 4re[i]-qd 4re[i+1]-qd re[i+1] 6 re[i+2] d 4re[i+2]-qd q[i+3] 23 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Radix-64 Ieraion Digi Selecion 6 MSBs of 4 x re[i] 6 6 6 MSBs 6b afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif 1 1 1 1 1 b b b b b 6 6 1 Reainder updae 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs - d - d re[i] 4re[i]-qd 4re[i+1]-qd re[i+1] 6 re[i+2] d 4re[i+2]-qd q[i+3] 24 2018 Ar Liied Digi selecion Reainder calculaion q[i+3] re[i+3]
Firs Cycle: Operands Pre-Scaling 1. Scaling of divisor and Divisor close o 1 o have a sipler digi selecion funcion Digi selecion funcion depends only on he reainder, i does no depend on he divisor Dividend is scaled as well o preserve he resul 2. Operand coparison If! < # he is lef-shifed by 1 bi, $ = 2! # Resul in [1,2) Makes rounding easier: only a guard bi, here is no a rounding bi 3. Ineger quoien-digi calculaion Resul is in 1,2, hen ineger digi is {+1, +2} Siplified selecion funcion Replicaed for! > # and! # 4. Iniial reainder The scaled divisor and scaled are used for he iniial redundan reainder Includes 1-bi lef shif if! < # 25 2018 Ar Liied
Firs Cycle: Operands Pre-Scaling divisor divisor 1 2 3 1 1 2 3 1 SUB redundan scaled divisor 1 1 quoien digi selecion quoien digi selecion redundan scaled divisor > divisor > divisor > 1 1 scaled scaled divisor q 1 re[1] 26 2018 Ar Liied
Firs Cycle: Operands Pre-Scaling divisor SUB redundan scaled divisor divisor Operand Pre-scaling! # [1 1 2 3 1 1 2 3 1 1 1 quoien digi selecion quoien digi selecion redundan scaled Scaling facor! = 1 + 0 3 0 0 8, 0 7 1.xxx 1 64, 1 + 1 8] M 000 1+1/2+1/2 divisor > divisor > 001 1+1/4+1/2 010 1+1/2+1/8 011 1+1/2+0 1 divisor > 1 scaled 100 1+1/4+1/8 101 1+1/4+0 110 1+0+1/8 scaled divisor q 1 re[1] 111 1+01/8 27 2018 Ar Liied
Firs Cycle: Operands Pre-Scaling Operand Coparison divisor divisor 1 2 3 1 1 2 3 1 To save ie, he non-scaled operands are copared SUB redundan scaled divisor 1 1 quoien digi selecion quoien digi selecion redundan scaled divisor > divisor > divisor > 1 1 scaled scaled divisor q 1 re[1] 28 2018 Ar Liied
Firs Cycle: Operands Pre-Scaling divisor divisor 1 2 3 1 1 2 3 1 SUB redundan scaled divisor Ineger quoien-digi 1 1 quoien digi selecion quoien digi selecion redundan scaled divisor > divisor > divisor > 1 1 scaled scaled divisor q 1 re[1] 2 2018 Ar Liied
Firs Cycle: Operands Pre-Scaling divisor divisor 1 2 3 1 1 2 3 1 SUB redundan scaled divisor 1 1 quoien digi selecion quoien digi selecion redundan scaled divisor > divisor > divisor > 1 1 scaled Iniial reainder scaled divisor q 1 re[1]!"# 1 = &' ) * &, 30 2018 Ar Liied
Evaluaion 2018 Ar Liied
Evaluaion and Coparison Evaluaion Laency Area Coparison wih oher recen processors AMD K7 AMD Jaguar IBM z13 HAL Sparc Inel 2018 (*lake) uliplicaive division algorih uliplicaive division algorih radix-4 digi-recurrence division algorih uliplicaive division algorih radix-1024 digi-recurrence division algorih * No inforaion abou he icroarchiecure, jus soe noes wih he radix and SP/DP laencies 32 2018 Ar Liied
Laency Double precision Single precision Half precision Regular inpu, noralized resul 11 6 4 Regular inpu, subnoral resul 12 7 5 One subnoral inpu, noralized resul 13 8 6 One subnoral inpu, subnoral resul 14 7 Two subnoral inpu, noralized resul 14 7 Exaple: Single precision Regular inpu, nor resul: Regular inpu, subnor resul: 1 subnoral inpu, nor resul: 1 subnoral inpu, subnor resul: 2 subnoral inpu, nor resul: PSC DGT DGT DGT DGT RND1 PSC DGT DGT DGT DGT RND1 RND2 UNP NM PSC DGT DGT DGT DGT RND1 UNP NM PSC DGT DGT DGT DGT RND1 RND2 UNP NM NM PSC DGT DGT DGT DGT RND1 PSC - > pre-scaling, UNP -> unpacking, DGT -> digi ieraion, NM -> noralizaion, RND1,2 -> rounding 33 2018 Ar Liied
Laency - Coparison Algorih Half-precision Single-precision Double-precision AMD K7 uliplicaive N/A 16 20 AMD Jaguar uliplicaive N/A 14 1 IBM z13 Radix-4 N/A 23 37 HAL Sparc uliplicaive N/A 16 1 Inel 2018 (lake) Radix-1024 N/A 10 13 ARM Radix-64 4 6 11 Laencies include pre-processing (unpacking, pre-scaling), ieraion cycles, and posprocessing (rounding) and assuing noralized inpu/oupu Divider based on uliplicaive algorihs: Laency liied by he laency of uliplicaion or uliply-and-accuulae Can be significaive Radix-1024 divider (10 bi/cycle) Pre-processing (probably) needs several cycles 34 2018 Ar Liied
Area Large area Radix-64 ieraion fifeen 58-bi (5 58-bi per radix-4 ieraion) five -bi adder five 7-bi adder Selecion of hree quoien-digis Muxes Pre-scaling Three 58-bi adder Two 58-bi Reduced selecion logic Muxes Rounding, Noralizaion DGT Pre-proc Toal 58-bi adders -- 3 3 Sall adders 10 2 12 58-bi 15 2 17 Selecion logic 3 2 5 58-bi 4-o-1 ux 6 -- 6 58-bi 2-o-1 ux -- 6 6 narrow uxes 2 1 3 35 2018 Ar Liied
Area - Coparison Muliplicaive algorihs Modes area, saller han in he radix-64 divider Reusing he exising FP ulipliers Look-Up able for iniial seed Muxes Radix-4 algorih Redundan reainder: 116-bi su word, 28-bi carry word 6 os-significan bis of he reainder are non-redundan Area: 3-o-2, 2 sall CPA, digi selecion able Radix-1024 algorih Large area (probably) Pre-scaling: 1 # approxiaion, 2 uliplier Ieraion: sall adder for digi selecion, recangular uliplier for reainder updae 36 2018 Ar Liied
Conclusions 2018 Ar Liied
Conclusions Radix-64 floaing-poin divider 6 bis/cycle Overlapping of hree radix-4 ieraions Pre-scaling o have a siple selecion funcion Resul in [1,2) Firs ieraion in parallel wih he pre-scaling Low laency fp divider 11, 6, and 4 cycles for double, single, and half-precision Addiional cycles in case of subnoral inpu/oupu Saller laency han oher ipleenaions, alhough area is larger Ineger division could be easily inegraed Shared logic wih floaing-poin square-roo 38 2018 Ar Liied
Thank You Danke Merci Gracias Kiios 감사합니다 ध"यव द ה ד ו ת 3 2018 Ar Liied
Floaing-Poin Division Ieraive digi-recurrence algorih In a radix-r algorih each ieraion copues a radix-r digi of he quoien. A radix-r digi represens log $ % bis of he quoien Nuber of ieraions depends on he resul precision and on he radix Several cycles Unpacking Pre-scaling Noralizaion (1 cycle per subnoral inpu) Digi calculaion (several cycles) Rounding (2 rounding cycles if quoien is subnoral) 40 2018 Ar Liied