Radix-64 Floating-Point Divider

Radix-64 Floaing-Poin Divider Javier D. Bruguera ARM Ausin Design Cener Eail: javier.bruguera@ar.co Absrac Digi-recurrence division is widely used in acual high-perforance icroprocessors because i presens a good rade-off in ers of perforance, area and power. consupion. In his paper we presen a radix-64 divider, providing 6 bis per cycle. To have an affordable ipleenaion, each ieraion is coposed of hree radix-4 ieraions; speculaion is used beween consecuive radix-4 ieraions o ge a reduced iing. The resul is a fas, low-laency floaing-poin divider, requiring 11, 6, and 4 cycles for double-precision, single-precision and half-precision floaing-poin division wih noralized operands and resul. One or wo addiional cycles are needed in case of subnoral operand(s) or resul. I. INTRODUCTION Division is one of he os represenaive floaing-poin funcions in odern processors. There exis wo ain failies of algorihs for calculaing division in hardware [3]: digirecurrence algorihs, which have linear convergence and are based on subracion, and uliplicaive algorihs, based on uliplicaion and wih quadraic convergence. The energy efficiency of boh approaches has been recenly analyzed and he conclusion is ha he digi-recurrence approach is uch ore energy efficiency [7] and requires less area. In addiion, for he floaing-poin precisions of ineres, double, single and half-precision, digi-recurrence ehods are uch faser. Muliplicaive ehods rely on several ieraions of a uliply-add fused (MAF) operaion, and he laency of a single MAF is beween 3 and 6 cycles [5], [9], [11]. In soe cases, his is he laency of our proposed divider for singleprecision. In his paper a radix-64 digi-recurrence divider is described. I is hard o ge an energy and iing efficien radix-64 ipleenaion; hen, hree radix-4 ieraions are overlapped in a single cycle providing 6 bis of he quoien per cycle, which is equivalen o a radix-64 ieraion. In order o reduce he iing, speculaion is used beween consecuive radix-4 ieraions in he cycle. The divider has been ipleened in a processor wih a frequency of 3 GHz. Probably, he os criical poin in digi-recurrence division is he quoien-digi selecion. Every ieraion, a digi of he quoien is obained. To have a siple radix-4 selecion funcion, independen of he divisor, he divisor needs o be scaled o value close enough o 1 [2]. This scaling is carried ou before he digi ieraions. In addiion, he firs ieraion, which gives he ineger digi of he divider, wih value +1 or +2, is carried ou in parallel wih he operands scaling, conribuing o save one cycle in single-precision. The resul is a low-laency divider wih 11, 6, and 4 cycles laency for double-precision, single-precision and halfprecision, respecively, when he inpu operands and he resul are noralized. These laencies include he scaling and rounding cycles. In case of subnoral operands, one or wo addiional noralizaion cycles are needed. Siilarly, in case of subnoral resul a second rounding cycle is needed. The res of he paper is organized as follows: In secion II he ain feaures of he proposed divider are oulined. Secion III is a brief descripion of he foundaions of digirecurrence division. In secion IV he deailed ipleenaion of he divider is described. Finally, in Secion V he divider is copared wih oher ipleenaions in acual processors, and in Secion VI he ain conclusions are presened. II. MAIN FEATURES The divider perfors he floaing-poin division of a, x, and a divisor, d, o obain a quoien, q = x/d. The wo operands need o be noralized, x, d [1, 2), alhough subnoral operands are acceped; in his case, he subnoral operands are noralized before he digi ieraions. If he wo operands are noralized in [1, 2), he resul is in [0.5, 2); his way wo bis o he righ of he leas significan bi (LSB) of he quoien are needed for rounding, he guard and he round bis. The guard bis is used for rounding when he resul is noralized, q [1, 2), whereas he round bi is used for rounding when he resul is no noralized, q [0.5, 1). In his laer case, he resuls is lef-shifed by 1 bi, and he guard and round bis becoe he LSB and he guard bi, respecively, of he noralized resul. However, o siplify he rounding, he resul is forced o be in q [1, 2). Noe ha q<1 only if x<d. This siuaion is deeced in an early sage and he if lef-shifed by 1 bi in such a way ha q =2 x/d and q [1, 2). Of course, he anissa is he sae as in x/d bu he exponen needs o be decreened. The algorih used for he division is he radix-4 digirecurrence algorih wih hree ieraions per cycle, wih a signed-digi represenaion of he quoien wih digi se { 2, 1, 0, +1, +2}; ha is, being r =4, a =2, he radix and he digi se respecively. Each ieraion, a digi of he quoien is obained by eans of a selecion funcion. In order o have a quoien-digi selecion funcion independen of he divisor, he divisor has o be scaled close o 1. Of course, o preserve he resul he needs o be scaled by he sae aoun han he divisor. XXX-X-XXXXXXX-X-X/ARITH18/ c 2018 IEEE 87

Wih he radix-4 algorih, 2 bis of he quoien are obained every ieraion. As hree radix-4 ieraions are perfored per clock cycle, 6 bis of he quoien are obained every cycle, which is equivalen o a radix-64 divider. In addiion, noe ha he firs quoien digi, which is he ineger digi of he resul, can ake only values {+1, +2}, and is calculaion is uch sipler han he calculaion of he reaining digis. Then, i is obained in parallel wih he operand prescaling, saving one addiional ieraion in singleprecision. On he oher hand, here is an early-erinaion ode for excepional operands. The early erinaion occurs when any of he operand are NaN, infiniy, or zero, or in case of a division by a power of 2 wih boh operands noralized. In he laer case, he resul is obained by erely decreening he exponen of he. In suary, he ain feaures of he radix-64 divider are he following: Prescaling of divisor and Firs quoien digi (ineger digi) obained in parallel wih he operands prescaling Coparison of he scaled and divisor and lef shif of he o have he resul in [1, 2) Three radix-4 ieraions per cycle, giving 6 bis per cycle Half, single and double-precision Subnoral suppor, wih noralizaion cycles before he ieraions Early erinaion for excepional operands III. DIGIT-RECURRENCE DIVISION Digi-recurrence division is an ieraive algorih which copues a radix-r quoien digi q i+1 and a reainder every ieraion. The reainder re[i] is used o obain he nex radix-r digi. For a fas ieraion, he reainder is kep in carry-save of signed digi redundan represenaion. In our ipleenaion, we have chosen a radix-2 signed digi represenaion for he reainder, wih a posiive and a negaive word. Paricularizing o radix-4, r =4, he parial quoien before ieraion i is defined as i Q[i] = q j 4 j (1) j=0 and he radix-4 algorih, considering a scaled divisor close o 1, is described by he following equaions, q i+1 = SEL( re[i]) (2) re[i +1] = 4 re[i] d q i+1 (3) being re[i] an esiaion of he reainder re[i] wih a few bis. For his ipleenaion, i has been deerined ha only he 6 os-significan bis (MSB) of he reainder are required, hree ineger bis and hree fracional bis [3]. Then, every ieraion a quoien digi is obained fro he curren reainder, and a new reainder is copued for he nex ieraion. Then, he nuber of ieraions is i = n/ log 2 (4) (4) being n he nuber of bis of he resul, including he bis required for rounding. The laency of he division, he nuber of cycles, is direcly relaed o he nuber of ieraions. I depends also on he nuber of ieraions perfored per cycle. Three ieraions per cycle has been ipleened o obain 6 bis per cycle, which is equivalen o a radix-64 division. Then, he laency for a noral division is cycles = i/3 +2 (5) Apar fro he cycles needed for he ieraions, i/3, here are wo addiional cycles for operand prescaling and rounding. Soe exaples of digi-recurrence division, including radix-4, can be found in [1][3]. The naive ipleenaion is shown in Figure 1. Noe ha only he os-significan bis of he reainder are used o selec he quoien digi. The reainder is updaed using carrysave adders () and sored in redundan represenaion. Then, he quoien digi selecion needs he os-significan bis of he reainder o be added in a carry-propagaed adder (CPA) o ge is non-redundan represenaion. However, his naive ipleenaion is oo slow; o speed up he cycle, speculaion in reainder calculaion and quoien digi selecion beween ieraions has been used, as explained in he following secion. IV. ARCHITECTURE The divider is coposed of hree pars: prescaling logic, digi-recurrence logic, and rounding logic. The prescaling and rounding ake one cycle each, whereas he digi-recurrence logic, because of he ieraive naure of he digi-recurrence algorihs, is reused during several consecuive cycles. In he following subsecions he prescaling and digi logic are described. Rounding can be he sandard digi-recurrence division rounding. In addiion, o reduce he laency by 1 cycle in singleprecision, he firs quoien digi, he ineger digi which can ake values +1 or +2 only, is calculaed in parallel wih he prescaling. The nuber of bis n o be obained in he ieraions include he ineger bi, he fracional bis, and he guard bi for rounding. If he ineger digi is obained in he operands prescaling cycle, he nuber of bis n is decreened by one. Then, for pracical floaing-poin foras, he nuber of bis n, he nuber of ieraions, and he nuber of cycles (see equaions (4) and (5)) are: Double precision (n =53): i =27, cycles =11 Single precision (n =24): i =12, cycles =6 Half precision (n =11): i =6, cycles =4 Noe ha in addiion o he digi cycles, he laencies include he prescaling and rounding cycles. Noe ha, due o he fac ha he firs quoien digi is calculaed in parallel wih he pre-scaling, in single precision he nuber of bis has been reduced fro 25 o 24, and wih n =25he nuber of ieraions would be i =13, and he 88 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018)

q[i+1] MSBs of 4 x re[i] -bi -d 0 d re[i] TABLE I DETERMINATION OF THE PRESCALING FACTOR 0.1x 1 x 2 x 3 M 000 1+1/2+1/2 001 1+1/4+1/2 010 1+1/2+1/8 011 1+1/2+0 100 1+1/4+1/8 101 1+1/4+0 110 1+0+1/8 111 1+0+1/8 q[i+2] q[i+3] -bi -bi Msb s Msb s -d 0 d -d 0 d 2 re[i+3] 2 re[i+1] re[i+2] Fig. 1. Naive ipleenaion of a radix-64 divider wih hree radix-4 ieraions per cycle nuber of cycles cycles =7. Therefore, he laency has been reduced by 1 cycle. A. Operand Prescaling and Ineger Digi Calculaion During he prescaling, he divisor is scaled o a value close o 1 so ha he quoien digi selecion is independen of he divisor. I has been deerined ha, for a radix-4 digi recurrence, i is enough o have he scaled divisor in he range [1 1/64, 1+1/8] [2]. The divisor is uliplied by a scaling facor M =1+b 2 3, wih 0 b 8, and b 7; his scaling facor depends only on he value of he divisor. As shown in Table I, only hree bis of he divisor need o checked o ge M. Noe ha for he prescaling, he divisor is supposed o be in [0.5, 1). The prescaling has been ipleened as he addiion of he divisor plus 2 (or 1) uliples of he divisor [2]. The has o be prescaled by he sae aoun o ge he correc resul. The block diagra of his cycle is shown in Figure 2. During his cycle, in addiion o he operands prescaling, he firs ieraion is carried ou: 1) The operands are scaled. As par of he scaling, redundan carry-save represenaions of divisor and are obained. 2) The redundan prescaled divisor and are assiilaed o a non-redundan represenaion o ge he reainder afer he firs ieraion. The non-redundan divisor is used in he digi ieraions as well. 3) The operands are copared and he is lefshifed by 1 bi if x<d. To save ie, he coparison is carried ou in parallel wih he prescaling. In parallel wih he operands redundan o non-redundan conversion, he ineger digi of he quoien is obained as well. This is a siplified digi quoien calculaion, because as he quoien is posiive and in [1, 2), he ineger radix-4 digi can only ake values q 1 =+1, or q 1 =+2. A siplified radix-4 ieraion is perfored o obain he ineger digi of he quoien. The ineger digi calculaion is replicaed for x<dand for x d, obaining wo quoien digi candidaes, for larger han divisor and for saller han divisor. The resul of he coparison selecs he correc digi and nex reainder. Noe ha he difference beween boh cases is ha he is 1-bi lef-shifed if he divisor is larger han he. The nex reainder, re[1], (see equaion (3)) is obained fro he non-redundan scaled (posiive word of he reainder) and he non-redundan scaled divisor (negaive word of he reainder), shifed 1 bi o he lef if he quoien digi is +2 and no-shifed if he quoien digi is +1. B. Digi Ieraion The acual ipleenaion of he floaing-poin divider perfors hree radix-4 ieraions per cycle. So, he logic has been opiized aking his fac ino accoun. Figure 3 shows he block diagra of a digi-ieraion cycle; ha is he copuaion of hree radix-4 ieraions. Noe ha, he ipleenaion in 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018) 89

divisor divisor 1 2 3 1 1 2 3 1 SUB redundan scaled divisor 1 1 quoien digi selecion quoien digi selecion redundan scaled divisor > divisor > divisor > 1 1 scaled scaled divisor q 1 re[1] Fig. 2. Prescaling and ineger quoien-digi calculaion Figure 3 is spli ino wo pars, (1) digi selecion and, (2) reainder calculaion. The reainders are copued speculaively according equaion (3). So, five reainders are copued every ieraion, one reainder for each value of he quoien digi, and he correc reainder is seleced when he digi has been obained. Noe ha, he reainder has o be lef-shifed by wo bis as par of he copuaion of he nex reainder. The quoien-digi selecion uses an esiaion of he reainder o obain he nex quoien digi (equaion (2)). As said in Secion III, i has been deerined ha only he 6 os-significan bis (MSB) of he reainder are required, hree ineger bis and hree fracional bis. The quoien digi selecion funcion is shown in Table II(a). The inervals 4 re[i] for he selecion of every digi has been obained following he ehodology described in [3]. To selec digis q i+2 and q i+3 he of he speculaive re[i +1] are assiilaed. Noe ha, he are assiilaed because alhough he selecion funcion for q i+2 only needs he 6 MSBs, 2 addiional bis are required for he selecion of q i+3 because of he 2-bi lef-shif of re[i +2], and anoher addiional bi is used o cach he carry ino he leas-significan posiion of he 8 bis. Digi q i+1 selecs he, aong he 5 speculaively calculaed MSBs, ha are going o be used in he selecion of q i+2. Noe ha only 6 bis are used in he selecion. The 6 MSBs so obained ay be differen o he 6 MSB obained direcly fro re[i+1], because he +1 o coplee he 2 s copleen of he su word, in he assiilaion of re[i +1], is added a a differen posiion. In he acual ipleenaion in Figure 3, i is added a he posiion of he 8h MSB whereas, in case of being obained direcly fro re[i +1], i would be added a he posiion of he 6h MSB. Consequenly, he carry ino he 6h MSB can be differen. This difference akes he end-poins of he inervals in Table II(a) ge a wrong selecion when he carry ino he 6h MSB bi is zero. This is correced wih he selecion funcion shown in Table II(b). Noe ha he selecion of he inerval end-poins depends on he carry ino he 6h MSB (carry colun in he able). 90 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018)

6 MSBs of 4 x re[i] 6 6 6b 6 1 afer a 2b lef shif 9 9 afer a 2b lef shif 9 9 afer a 2b lef shif afer a 2b lef shif afer a 2b lef shif 9 9 9 9 9 9 - d re[i] 1 1 1 1 1 9b 9b 9b 9b 9b 4re[i]-qd q[i+1] 6 MSBs q[i+2] 6 9 9 9 9 9 q[i+1] 7 MSBs 2 of -qd ^ - d^ 0 d^ ^ 7 7 7 7 7 7 1 1 1 1 1 7b 7b 7b 7b 7b 6 6 6 6 6 6 MSBs q[i+1] - d re[i+1] 4re[i+1]-qd q[i+2] q[i+2] 6 re[i+2] d 4re[i+2]-qd q[i+3] Digi selecion Reainder calculaion q[i+3] re[i+3] Fig. 3. Digi cycle logic 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018) 91

TABLE II QUOTIENT-DIGIT ION 4 re[i] q i+1 [13/8, 31/8] +2 [4/8, 12/8] +1 [ 3/8, 3/8] 0 [ 12/8, 4/8] 1 [ 32/8, 13/8] 2 (a) Sandard selecion inervals 4 re[i] carry q i+1 31/8 1 +2 [13/8, 30/8] - +2 12/8 0 +2 12/8 1 +1 [4/8, 11/8] - +1 3/8 0 +1 3/8 1 0 [ 3/8, 2/8] - 0 4/8 0 0 4/8 1 1 [ 12/8, 5/8] - 1 13/8 0 1 13/8 1 2 [ 32/8, 14/8] - 2 31/8 0 2 (b) Modified selecion inervals Therefore, i is clear ha he carry ino he 6h bi is required for he selecion of he q[i +2] digi. Hence, he 9 bi adder has o be spli ino a 6 bi adder and a 3-bi adder o ge access o his carry. In parallel wih he selecion of q i+2, he 6 MBS o be used in he selecion of digi q i+3 are copued speculaively for every value of q i+2.thus, he non-redundan esiaion of re[i +2] is obained in he five 7-bi adders, by adding he shifed 7 MSB of re[i +1] plus he 7 MSBs of q i+2 d. Then, digi q i+2 is used o selec he correc adder oupu, and q i+3 is seleced according Table II(a). This way, he delay of he logic in he cycle has been reduced wih respec o a plain ipleenaion of he hree quoien-digi selecion funcions. In he quoien-digi selecion, block in he figure, he quoien digi is coded as a 1-ho 5-bi code {qp2,qp1,qz,qn1,qn2}, so ha for exaple, qp2 = 1, qp1 =qz = qn1 =qn2 =0if q j+1 =+2. The logic funcion o ge every bi in he 1-ho 5-bi code is relaively siple, a 3-level 2-inpu gae logic funcion. The quoien is obained in a redundan signed-digi represenaion wih 2 words, a posiive word (quo pos) soring he posiive digis and a negaive word (quo neg) soring he negaive digis. For exaple, a final single precision quoien quo =100( 1)21( 1)000111is represened by quo pos = 1 000 210 000 111 quo neg = 0 001 001 000 000 Noe ha if he quoien digi is 0, here will be a 0 in boh he posiive and he negaive words. The final quoien Q[n] in equaion (1) is obained by subracing boh words and rounding he resul. Alernaively, an on-he-fly conversion [1] could be used, bu in our ipleenaion his resul in a worse cycle ie. V. EVALUATION In his secion we evaluae or design in ers of laency, area and iing ands copare i wih oher recen dividers. A. Laency The nuber of fracional bis of he quoien he algorih has o obain is 53 for double precision (52 fracional bis plus he guard bi), 24 for single precision (23 fracional bis plus he guard bi) and 11 for half precision (10 fracional bis plus he guard bi). Addiionally, here is an ineger digi which can be 1 or 2. This ineger digi is obained in parallel wih he prescaling. Hence, he nuber of digi cycles required for half, single and double precision are 2, 4 and 9 respecively. So, for noral operands he laency is Half precision, 4 cycles: E1/PS-DGT-DGT-RND1 Single precision, 6 cycles: E1/PS-DGT-DGT-DGT-DGT-RND1 Double precision, 11 cycles: E1/PS-DGT-DGT-DGT-DGT-DGT-DGT-DGT-DGT- DGT-RND1 being E1/PS he iniial (where operands are unpacked and soe condiions are deerined), and prescaling cycle (scaling of divisor and, lef shif of he if x<d and ineger digi calculaion), DGT he digi cycle (3 radix-4 ieraions per digi cycle), and RND1 he rounding cycle. In case or noral operands, he iniial sage and he prescaling sage are done in he sae cycle. If any of he operands is subnoral, he iniial and he prescaling sages are done in differen cycles; his is because he operand needs o be noralized, ha is in [1,2), before o be prescaled. So, in case of subnoral inpus here will be 1 or 2 addiional noralizaion sages, NM1, NM2, before he prescaling (PS) cycle. In case of a iny resul and addiional rounding sage RND2, afer RND1, is needed. As an exaple, he laency of a single-precision division wih one subnoral operand and iny resul is 9 cycles: E1-NM1-PS-DGT-DGT-DGT-DGT-RND1-RND2. Table III copares he laency of he proposed divider wih he laency of soe oher recen processors for floaing-poin half, single and double precisions wih noralized operands and resul [4], [6], [8] [10]. The laencies shown in he able 95 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018)

TABLE III LATENCY COMPARISON HP SP DP AMD K7 [8] N/A 16 20 AMD Jaguar [10] N/A 14 19 IBM z13 [4] N/A 23 37 HAL Sparc [6] N/A 16 19 This paper 4 6 11 include he ieraion cycles and he pre- and pos-processing cycles, such unpacking, prescaling and rounding. Noe ha no cycles for noralizaion are been included because i has been assued ha he operands are already noralized; alhough, as saed previously, he proposed divider can handle subnoral inpus and oupus. Mos of he design in he able uses a uliplicaive division algorih, [6], [8] [10], and one of he uses a radix-4 digirecurrence ipleenaion [4]. As shown in he Table, our proposal ges uch lower laencies. The uliplicaive ipleenaion are liied by he laency of he uliplier of uliply-and-accuulae unis ha, as saed in he inroducion, can be very significan. On he oher hand, he ipleenaion in [4] uses a very low radix, which iplies a high nuber of ieraions. alhough is ipleenaion is quie siple. In our ipleenaion we have been able o pu in a single cycle hree radix 4 ieraions by using speculaion beween ieraion in he sae cycle. In addiion, here are only wo 2 pre- and pos-processing cycles before he ieraions, unpacking of operands and prescaling, and rounding. B. Area On he oher hand, he divider area is larger in our ipleenaion han in he oher ipleenaions in he able. Our divider uses a large nuber of s and CPAs in he ieraive par: five 58-bi s for ieraion for a oal of 15 s, five 9-bi CPAs, and 5 7-bi CPAs, plus he logic for he selecion of 3 quoien digis and he uliplexers, welve 58-bi 4:1 uxes and wo 5:1 sall wide uxes. In addiion, in he prescaling logic hree 58-bi adders and soe addiional logic, s, uliplexers, and a reduced selecion logic, are needed. Muliplicaive division algorihs involve only odes addiional cos because he exising FP ulipliers are reused o perfor each algorih ieraion. Only a look-up able for he iniial seed and soe addiional logic is needed o ipleen he divider. The area of he radix-4 divider is also odes. The redundan parial reainder consiss of a su par of 116 bis and a carry par of 28 bis (only 1 ou 4 carries are flopped); he 6 os-significan bis us be in non-redundan fora because hey are used for he quoien digi selecion. The ieraion is ipleened wih an sage of 3:2 and one sage of a 4-bi CPA; an addiional 6-bi CPA is needed o deliver he 6 os-significan bis o he digi selecion able. TABLE IV DELAY OF BASIC GATES AND MODULES OF THE DIVIDER FO4 inverer 1 6 2-inpu gae 1.33 8 basic gaes 3-inpu gae 1.67 10 xor gae 2 12 2:1 ux 2.66 16 58-bi adder 14.35 86 54-bi sub 14.35 86 prescaling reduced SEL logic 4 24 2:1 ux wih load 2.66 + log 4 (58) 40 6-bi adder 9.43 56 7-bi adder 9.34 56 digi 9-bi adder 11 66 cycle 3:2 4 24 5:1 ux 4.33 26 SEL logic 5.33 + log 4 (64) 50 *Due o fanou The area of he rounding sage has no been included in he discussion because i should be roughly he sae for all he ipleenaions. C. Tiing For he criical pah delay esiaion he Logical Effor odel [12] is used in his secion. Table IV suarizes he delay of he basic gaes (upper par) and of he ain odules in figures 2 and 3 (iddle and lower pars respecively) in ers of a FO4 and is equivalen in picoseconds. We have considered a FO4 delay of 6 ps. The load of every signal have be aken ino accoun, so ha a fanou of n adds a delay equivalen o log 4 n FO4. The fanou affecs especially o he odule in he figure. This odule consiss of a 4 2-inpu logic levels, bu he odule oupu, he quoien digi, has a high fanou, roughly 64 gaes. Then, in he prescaling cycle here are wo pahs wih roughly he pah he sae esiaed delay, and 54-bi sub 2:1 ux wih large fanou 2:1 ux 2:1 ux 3:2 58-bi adder 2:1 ux being he delay of each pah 142 ps. Noe he large fanou in he firs pah 2:1 ux. In he digi cycle, here are several candidaes o be he criical pah, bu due o large fanou a he oupu of he SEL logic, he criical pah is he one arked in blue in figure 3. I consiss of ps 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018) 93

6-bi adder SEL logic 5:1 ux SEL logic 5:1 ux SEL logic 5:1 ux wih an esiaed delay of 300 ps. Then, in conclusion, he criical pah of he divider is in he digi cycle and has a esiaed delay of 300 ps. [11] S. Srinivasan, K. Bhudiya, R. Raanarayanan, P. S. Babu, T. Jacob, S. K. Mahew, R. Krishnaurhy and V. Erragunla. Spli-pah Fused Floaing Poin Muliply Accuulae (FPMAC). Proceedings of 21h IEEE inernaional Syposiu on Copuer Ariheic. (2013). [12] I. Suherland, B. Sproull, and D. Harris Logical Effor: Designing Fas CMOS Circuis. The Morgan Kaufann Series in Copuer Archiecure and Design. Morgan Kaufann Publishers. (1999). VI. CONCLUSIONS The archiecure of a radix-64 floaing-poin divider providing 6 bis of he quoien per cycle is presened. To ge a siple ipleenaion and a affordable iing he radix-64 ieraion is build wih 3 radix-4 ieraions, each one providing 2 bis of he quoien for a hroughpu of 6-bi per cycle, using speculaion beween consecuive ieraions in he cycle. Addiionally, o have a siple digi selecion logic, he divisor has been prescaled o a value close o 1, in such a way ha he digi selecion funcion does no depends on he divisor, i depends only on he 6 os-significan bis of he reainder. Prescaling has been ipleened as he addiion of hree ers, which depend on he os-significan bis of he divisor. Of course, he has o be scaled by he sae aoun han he divisor as well. Furher laency reducions for soe floaing-poin precisions are obained by lef-shifing he by 1 bi when i is larger han he divisor o have he resul in {1, 2), and by perforing he firs ieraion, which gives he ineger digi of he resul, in parallel wih he prescaling. The resul is a low laency floaing-poin digi-recurrence divider, wih laencies of 11, 6 and 4 cycles for doubleprecision, single-precision and half-precision, respecively. REFERENCES [1] M. Ercegovac and T. Lang. Division and Square Roo. Digi-Recurrence Algorihs and Ipleenaions. Kluwer Acadeic Publishers. (1994). [2] M. Ercegovac and T. Lang. Siple Radix-4 Division wih Operand Scaling. IEEE. Transacions on Copuers, Vol. 39, No. 9, pp. 1204-1208, (1994). [3] M. Ercegovac and T. Lang,Digial Ariheic. San Maeo, CA, USA: Morgan Kaufann, (2004). [4]. G. Gerwig, H. Weer, E. M. Schwarz, J. Haess. High Perforance Floaing-Poin Uni wih 116 bi Wide Divider. Proceedings of 16h IEEE inernaional Syposiu on Copuer Ariheic. (2003). [5] T. Lang and J. D. Bruguera. Floaing-Poin Muliply-Add-Fused wih Reduced Laency. IEEE. Transacions on Copuers, Vol. 53, No. 8, pp. 988-1003, (2004). [6] A. Naini, A. Dhablania. 1-GHz HAL SPARC64 Dual Floaing-Poin Uni wih RAS Feaures. Proceedings of 15h IEEE inernaional Syposiu on Copuer Ariheic. (2001). [7] A. Nannarelli. Perforance/Power Space Exploraion for Binary64 Division Unis. IEEE. Transacions on Copuers, Vol. 65, No. 5, pp. 1671-1677, (2016). [8] S.F. Oberan. Floaing Poin Division and Square Roo Algorihs and Ipleenaion in he AMD-K7 Microprocessor. Proceedings of 14h IEEE inernaional Syposiu on Copuer Ariheic. (1999). [9] J. Preiss, M. Boersa and S. M. Mueller. Advanced Clockgaing Schees for Fused-Muliply-Add-Type Floaing-Poin Unis. Proceedings of 19h IEEE inernaional Syposiu on Copuer Ariheic. (2009). [10] J. Rupley, J. King, E. Quinnell, F. Galloway, K. Paon, P. M. Seidel, J. Dinh, H. Bui, A. Bhowik. The Floaing-Poin Uni of he Jaguar x86 Core. Proceedings of 21h IEEE inernaional Syposiu on Copuer Ariheic. (2013). 94 25 h IEEE Sybosiu on Copuer Ariheic (ARITH 2018)