Performance Comparison of Dynamic Voltage Scaling Algorithms for Hard Real-Time Systems

Performnce Comprison of Dynmic Voltge Scling Algorithms for Hrd Rel-Time Systems Woonseok Kim Λ Dongkun Shin y Hn-Sem Yun y Jihong Kim y Sng Lyul Min Λ School of Computer Science nd Engineering Seoul Ntionl University ENG4190, Seoul, Kore, 151-742 wskim@rchi.snu.c.kr, fsdk, hsyun, jihongg@dvinci.snu.c.kr, symin@dndelion.snu.c.kr Astrct Dynmic voltge scling (DVS) is n effective low-power design technique for emedded rel-time systems. In recent yers, mny DVS lgorithms hve een proposed for reducing the energy consumption of emedded hrd rel-time systems. However, the proposed DVS lgorithms were not quntittively evluted under unified frmework, mking it difficult tsk to select n pproprite DVS lgorithm for given ppliction/system. In this pper, we compre severl key DVS lgorithms recently proposed for hrd rel-time periodic tsk sets, nlyze their energy efficiency, nd discuss the performnce differences quntittively. Our evlution results give quntittive nswers to severl importnt DVS questions. 1 Introduction Dynmic voltge scling (DVS), which djusts the supply voltge nd correspondingly the clock frequency dynmiclly, is n effective low-power design technique for emedded rel-time systems. Since the energy consumption E of CMOS circuits hs qudrtic dependency on the supply voltge V dd, lowering the supply voltge V dd is one of the most effective wys of reducing the energy consumption. With recent explosive growth in the portle nd moile emedded device mrket, where low-power consumption is n importnt design requirement, severl commercil vrile-voltge microprocessors [19, 1, 8] were developed. Trgeting these microprocessors, mny DVS lgorithms hve een proposed or developed, especilly for hrd Λ This work ws supported in prt y the Ministry of Eduction under the BK21 progrm, nd y the Ministry of Science nd Technology under the Ntionl Reserch Lortory progrm. y This work ws supported y grnt No. R01-2001-00360 from the Kore Science & Engineering Foundtion. rel-time systems [7, 9, 18, 2, 14, 16, 5, 10]. Since lowering the supply voltge lso decreses the mximum chievle clock speed [15], vrious DVS lgorithms for hrd rel-time systems hve the gol of reducing supply voltge dynmiclly to the lowest possile level while stisfying the tsks timing constrints. Although ech DVS lgorithm is shown to e quite effective in reducing the energy/power consumption of trget system under its own experimentl scenrios, these recent DVS lgorithms hve not een quntittively evluted under unified frmework, mking it difficult tsk for lowpower emedded system developers to select n pproprite DVS lgorithm for given ppliction/system. A quntittive nlysis of the energy-efficiency is prticulrly importnt ecuse most of these DVS lgorithms re sed on oth sttic nd dynmic slck nlysis techniques whose performnce is difficult to predict nlyticlly. In ddition, their energy efficiency fluctute significntly depending on the worklod vritions, tsk set chrcteriztions, nd execution pths tken, further requiring quntittive comprison study. In this pper, we quntittively evlute the energy efficiency of severl recent DVS lgorithms proposed for hrd rel-time systems using unified DVS simultion environment clled SimDVS [17]. We focus on preemptive hrd rel-time systems in which periodic rel-time tsks re scheduled with the Erliest-Dedline-First (EDF) lgorithm or the Rte-Monotonic (RM) lgorithm, the two most widely used rel-time system models [12]. Our study is different from the previous performnce comprisons such s [13, 6]. [13] nd [6] focus on periodic tsks in hrd rel-time systems nd non rel-time systems, respectively, while our study focuses on periodic tsks in hrd rel-time systems. For the trget hrd rel-time systems, two ctegories of lgorithms re used: inter-tsk DVS (InterDVS) nd intr-tsk DVS (IntrDVS). InterDVS lgorithms determine the supply voltge on tsk-y-tsk sis, while IntrDVS lgorithms

djust the supply voltge within n individul tsk oundry. For comprtive study, we use eight InterDVS lgorithms [18, 2, 14, 10] nd two IntrDVS lgorithms [16, 5] tht were recently proposed. We lso evlute the energy efficiency of HyridDVS lgorithms. (If DVS lgorithm uses oth the IntrDVS nd InterDVS pproches, we cll the lgorithm hyrid DVS lgorithm (HyridDVS).) Since mny fctors ffect the energy efficiency of DVS lgorithms, our comprtive study cnnot nswer ll the DVS performnce questions. In this pper, we limit our evlution gols to the following questions which represent some of the most importnt unnswered questions: ffl InterDVS: Wht is the est InterDVS lgorithm under given conditions? How close is the lgorithm s energy efficiency to the theoreticl lower ound? Wht restrictions of vrile-voltge processors, if ny, limit the chievle energy efficiency of InterDVS lgorithms? ffl IntrDVS: Which IntrDVS lgorithm performs etter under wht condition? ffl HyridDVS: Cn we chieve etter energy efficiency if we comine n InterDVS lgorithm nd n IntrDVS lgorithm? Our comprtive study shows tht the existing EDF Inter- DVS lgorithms such s [2, 14, 10], re very effective; their energy consumption is only 9ο12% worse thn the theoreticl lower ound. Moreover, this gp cn e further reduced y using more intelligent slck distriution method. With etter slck distriution heuristic, we strongly elieve tht the energy efficiency of the current stte-of-rt EDF Inter- DVS lgorithms is very close to tht of the theoreticl optiml lgorithm. However, in the RM InterDVS lgorithms, there still remins room for improvement. Also, the energy efficiency of ech lgorithm cn vry from 10% to 32% ccording to the numer of voltge levels supported y the trget vrile-voltge processor. For the IntrDVS lgorithms, our results indicte tht the pth-sed IntrDVS [16] chieves etter performnce thn the stochstic IntrDVS [5] when the slck time is limited. On the other hnd, when there is lrge mount of slck time, the stochstic IntrDVS lgorithm works etter. For the HyridDVS lgorithms, our experiments show tht the energy efficiency of HyridDVS is etter thn the one tht cn e chieved y using n IntrDVS lgorithm or n InterDVS lgorithm lone. The rest of the pper is orgnized s follows; efore the selected DVS lgorithms re evluted, we first clssify existing DVS techniques in Section 2. In Section 3, we summrize the selected DVS lgorithms using the clssifiction frmework of Section 2. Simultion environments re descried in Section 4. We present the performnce evlution Tle 1. Clssifiction of DVS techniques. Voltge Scling Methods Scling Decision IntrDVS InterDVS (1) Pth-sed method (2) Stochstic method Off-Line (3) Mximum constnt speed (4) Stretching to NTA (5) Priority-sed slck-steling On-Line (6) Utiliztion updting results in Section 5, nd Section 6 concludes with summry. 2 Clssifiction of DVS lgorithms In this section, we clssify the existing DVS techniques nd riefly descrie the key chrcteristics of ech technique. (See Tle 1 for summry.) For hrd rel-time systems, there re two kinds of voltge scheduling pproches depending on the voltge scling grnulrity: intr-tsk DVS (IntrDVS) nd inter-tsk DVS (InterDVS). The intr-tsk DVS lgorithms [16, 5] djust the voltge within n individul tsk oundry, while the intertsk DVS lgorithms determine the voltge on tsk-y-tsk sis t ech scheduling point. The min difference etween them is whether the slck times re used for the current tsk or for the tsks tht follow. InterDVS lgorithms distriute the slck times from the current tsk for the following tsks, while IntrDVS lgorithms use the slck times from the current tsk for the current tsk itself. 2.1 Intr-tsk DVS lgorithm design fctors In scheduling hrd rel-time tsks, in order to gurntee the timing constrint of ech tsk, the execution times of tsks re usully ssumed to e the worst cse execution times (WCETs). However, since tsk hs mny possile execution pths, there re lrge execution time vritions mong them. So, when the execution pth tken t run time is not the worst cse execution pth (WCEP), the tsk my complete its execution efore its WCET, resulting in slck time. In tht cse, IntrDVS exploits such slck times nd djusts the processor speed. IntrDVS lgorithms cn e clssified into two types depending on how to estimte slck times nd how to djust speeds. 2.1.1 Pth-sed method In the pth-sed IntrDVS, the voltge nd clock speed re determined sed on predicted reference execution pth, such s WCEP. For exmple, when the ctul execution devites from the predicted reference execution pth (sy, y rnch instruction), the clock speed is djusted. If the new pth tkes significntly longer to complete its execution thn

the reference pth, the clock speed is rised to meet the dedline constrint. On the other hnd, if the new pth cn finish its execution erlier thn the reference pth, the clock speed is lowered to reduce the energy consumption. In the pth-sed IntrDVS, progrm loctions for possile speed scling re identified using sttic progrm nlysis [16] or execution time profiling [11]. 2.1.2 Stochstic method The stochstic method is sed on the ide tht it is etter to strt the execution t low speed nd ccelerte the execution lter when needed thn to strt with high speed nd reduce the speed lter when slck time is found. By strting t low speed, if the tsk finishes erlier thn its WCET, it does not need to execute t high speed. Theoreticlly, if the proility density function of execution times of tsk is known priori, the optiml speed schedule cn e computed [5]. Under the stochstic method, the clock speed is rised t specific time instnces, regrdless of the execution pths tken. Unlike the pth-sed IntrDVS tht cn utilize ll the slck times of the tsk in scling speed, the stochstic IntrDVS my not utilize ll the potentil slck times. 2.2 Inter-tsk DVS lgorithm design fctors InterDVS lgorithms exploit the run-clculte-ssignrun strtegy to determine the supply voltge, which cn e summrized s follows: (1) run current tsk, (2) when the tsk is completed, clculte the mximum llowle execution time for the next tsk, (3) ssign the supply voltge for the next tsk, nd (4) run the next tsk. Most InterDVS lgorithms differ during step (2) in computing the mximum llowed time for the next tsk fi which is the sum of WCET of fi nd the slck time ville for fi. A generic InterDVS lgorithm consists of two prts: slck estimtion nd slck distriution. The gol of the slck estimtion prt is to identify s much slck times s possile while the gol of the slck distriution prt is to distriute the resulting slck times so tht the resulting speed schedule is s uniform s possile. Slck times generlly come from two sources; sttic slck times re the extr times ville for the next tsk tht cn e identified stticlly, while dynmic slck times re cused from run-time vritions of the tsk executions. 2.2.1 Slck estimtion methods (1) Sttic slck estimtion Mximum constnt speed One of the most commonly used sttic slck estimtion methods is to compute the mximum constnt speed, which is defined s the lowest possile clock speed tht gurntees the fesile schedule of tsk set [18]. For exmple, in EDF scheduling, if the worst cse current time T T current time Tc NTA time current time () A single tsk ctivtion NTA time Cse I current time Cse II current time () Multiple tsk ctivtions NTA NTA NTA Figure 1. Exmples of Stretching-to-NTA. time time time processor utiliztion (WCPU) U of given tsk set is lower thn 1.0 under the mximum speed f mx, the tsk set cn e scheduled with new mximum speed f 0 mx = U f mx. Although more complicted, the mximum constnt speed cn e stticlly clculted s well for RM scheduling [18, 5]. (2) Dynmic slck estimtion Three widely-used techniques of estimting dynmic slck times re riefly descried elow. Stretching to NTA Even though given tsk set is scheduled with the mximum constnt speed, since the ctul execution times of tsks re usully much less thn their WCETs, the tsks usully hve dynmic slck times. One simple method to estimte the dynmic slck time is to use the rrivl time of the next tsk [18]. (The rrivl time of the next tsk is denoted y NTA.) Assume tht the current tsk fi is scheduled t time t. IfNTAof fi is lter thn (t+wcet(fi )), tsk fi cn e executed t lower speed so tht its execution completes exctly t the NTA. Figure 1 shows exmples of the Stretching-to-NTA method. When single tsk fi is ctivted s shown in Figure 1(), the execution of fi cn e stretched to NTA. When multiple tsks re ctivted, there cn e severl lterntives in stretching options. For exmple, the dynmic slck time my e given to single tsk or distriuted eqully to ll ctivted tsks. Cses I nd II of Figure 1() illustrte these two options, respectively. Priority-sed slck steling This method exploits the sic properties of priority-driven scheduling such s RM nd EDF. The sic ide is tht when higher-priority tsk completes its execution erlier thn its WCET, the following

Tle 2. Trget DVS lgorithms. Ctegory Scheduling Policy DVS Policy Used Methods y lppsedf [18] (3)+(4) ccedf [14] (6) EDF ledf [14] (6) Λ InterDVS DRA [2] (3)+(4)+(5) AGR [2] (4) Λ +(5) lpshe [10] (3)+(4)+(5) Λ RM lppsrm [18] (3)+(4) ccrm [14] (3)+(4) Λ IntrDVS Pth-sed Method intrshin [16] (1) Stochstic Method intrgruin [5] (2) y Numers indicte corresponding techniques in Tle 1. (n) Λ indictes n improved version of n. lower-priority tsks cn use the slck time from the completed higher-priority tsk. It is lso possile for higherpriority tsk to utilize the slck times from completed lowerpriority tsks. However, the ltter type of slck steling is computtionlly expensive to implement precisely. Therefore, the existing lgorithms re sed on heuristics [2, 10]. Utiliztion updting The ctul processor utiliztion during run time is usully lower thn the worst cse processor utiliztion. The utiliztion updting technique estimtes the required processor performnce t the current scheduling point y reclculting the expected worst cse processor utiliztion using the ctul execution times of completed tsk instnces [14]. When the processor utiliztion is updted, the clock speed cn e djusted ccordingly. The min merit of this method is its simple implementtion, since only the processor utiliztion of completed tsk instnces hve to e updted t ech scheduling point. 2.2.2 Slck distriution methods In distriuting slck times, most InterDVS lgorithms hve dopted greedy pproch, where ll the slck times re given to the next ctivted tsk. This pproch is not n optiml solution, ut the greedy pproch is widely used ecuse of its simplicity. 3 Trget DVS lgorithms Tle 2 summrizes the DVS lgorithms selected for the comprtive study. Here, eight InterDVS lgorithms re chosen, two [18, 14] of which re sed on the RM scheduling policy, while the other six lgorithms [18, 14, 2, 10] re sed on the EDF scheduling policy. For IntrDVS lgorithms, two lgorithms re selected, one from pth-sed IntrDVS lgorithms [16], nd the other from stochstic methods [5]. In these selected DVS lgorithms, one or sometimes more thn one slck estimtion methods explined in the previous section were used. In lppsedf nd lppsrm which were proposed y Shin et. l. in [18], slck time of tsk is estimted using the mximum constnt speed nd Stretchingto-NTA methods. The ccrm lgorithm proposed y Pilli et. l. [14] is similr to lppsrm in the sense tht it uses oth the mximum constnt speed nd the Stretching-to-NTA methods. However, while lppsrm cn djust the voltge nd clock speed only when single tsk is ctive (Figure 1()), ccrm extends the stretching to NTA method to the cse where multiple tsks re ctive (Cse-II in Figure 1()). Pilli et. l. lso proposed two other DVS lgorithms [14], ccedf nd ledf, for EDF scheduling policy. These lgorithms estimte slck time of tsk using the utiliztion updting method. While ccedf djusts the voltge nd clock speed sed on run-time vrition in processor utiliztion lone, ledf tkes more ggressive pproch y estimting the mount of work required to e completed efore NTA. DRA nd AGR, which were proposed y Aydin et. l. in [2], re two representtive DVS lgorithms tht re sed on the priority-sed slck steling method. The DRA lgorithm estimtes the slck time of tsk using the prioritysed slck steling method long with the mximum constnt speed nd the Stretching-to-NTA methods. Aydin et. l. lso extended the DRA lgorithm nd proposed nother DVS lgorithm clled AGR for more ggressive slck estimtion nd voltge/clock scling. In AGR, in ddition to the priority-sed slck steling, more slck times re identified y computing the mount of work required to e completed efore NTA (Cse-I in Figure 1()). lpshe is nother DVS lgorithm which is sed on the priority-sed slck steling method [10]. Unlike DRA nd AGR, lpshe extends the priority-sed slck steling method y dding procedure tht estimtes the slck time from lower-priority tsks tht were completed erlier thn expected. DRA, AGR, nd lpshe lgorithms re somewht similr to one nother in the sense tht ll of them use the mximum constnt speed in the off-line phse nd the Stretching-to-NTA method in the on-line phse in ddition to the priority-sed slck steling method. For IntrDVS lgorithms, Shin s intr-tsk DVS lgorithm [16] (intrshin) nd Gruin s lgorithm [5] (intrgruin) re used s representtive lgorithms of the pth-sed method nd the stochstic method, respectively. (The detils of these lgorithms were descried in Section 2.) 4 Simultion environment In this section, we descrie SimDVS [17], unified DVS simultion environment, used for the quntittive nlysis. In order to support wide vriety of DVS lgorithms nd simultion scenrios, SimDVS ws designed to chieve the following gols: 1) support oth IntrDVS nd InterDVS

Inputs Tsk Set Specifiction Executle Progrm Profile Informtion IntrDVS Preprocessing Module CFG Genertor CFG Stochstic Dt Voltge Scler DVS wre CFG Speed Trnsition Tle Offline Slck Informtion InterDVS Module Tsk Execution Module IntrDVS Module Intr tsk Simultor Slck Estimtion Module Mchine Specifiction Energy Estimtion Module Outputs * Energy Consumption *... Figure 2. Overview of the SimDVS simultion environment. lgorithms, 2) integrte different DVS lgorithms esily, 3) support different tsk worklods, vritions in execution pths tken, nd different tsk set configurtions esily, nd 4) support different vrile-voltge processors esily. Figure 2 shows n overview of SimDVS, which consists of three min modules: 1) the InterDVS module, 2) the IntrDVS module, nd 3) the IntrDVS pre-processing module. SimDVS tkes s n input tsk set specifiction for n InterDVS lgorithm nd DVS-wre control flow grph (CFG) for n IntrDVS lgorithm. The DVS-wre CFG is uilt from the input inry progrm. As output, SimDVS reports the energy consumption of the input tsk set (or the input CFG). The InterDVS module is responsile for the overll opertion of SimDVS. It simultes given tsk set under the selected scheduling policy using given slck estimtion heuristic. The IntrDVS module simultes IntrDVS lgorithms using the Intr-tsk simultor. The input to the IntrDVS module is pre-processed y the tools ville in the IntrDVS pre-processing module. For fster simultions of IntrDVS lgorithms, the CFG of the input progrm is simulted rther thn the instructions in the progrm. For comprtive study, SimDVS supports ll DVS lgorithms descried in Section 3. 4.1 Sumodules of InterDVS module The InterDVS module, responsile for scheduling tsks, plys the role of rel-time scheduler in hrd rel-time system. It tkes s n input the specifiction of periodic tsk set. The tsk set specifiction descries the properties of simulted periodic tsks, such s the period nd WCET of ech tsk nd the worklod vrition fctors (e.g., the worst cse utiliztion nd execution time distriution). To simulte given InterDVS scheduling lgorithm, it hs two modules, one for slck estimtion nd the other for slck distriution. Slck estimtion is done y the slck estimtion module tht computes the totl ville time of the scheduled tsk, nd the slck distriution is done y the tsk execution module tht determines the operting speed of the scheduled tsk nd simultes the execution of the tsk instnce. To simulte new InterDVS lgorithm, these two modules for the new lgorithm need to e dded. Slck estimtion module This module is highly dependent on the simulted trget InterDVS lgorithm. Therefore, the exct implementtion of this module depends on the DVS lgorithm. Currently, ll the InterDVS lgorithms descried in Tle 2 re supported. In ddition, n optiml slck steling method under EDF scheduling is lso supported to evlute the effectiveness of the slck estimtion prts of vrious InterDVS lgorithms. Some DVS lgorithms (e.g., [5]) my require off-line pre-processing steps for more efficient on-line slck estimtion. In this cse, the slck estimtion module tkes such n off-line informtion s n dditionl input. Tsk execution module This module hs two roles. First, it determines the voltge nd clock speed sed on the ville execution time t for the current tsk. Using the supported voltge levels y the trget mchine (specified in the mchine specifiction file), it sets the voltge nd clock speed so tht the ctivted tsk finishes its execution within t time units even in the cse where its execution tkes WCEP. Second, it simultes the execution of the tsk. It genertes the effective worklod of ech tsk sed on the input worklod vrition fctor, clcultes the elpsed time nd the unused time from the ssigned ville time intervl, nd reports this timing/speed informtion to the energy estimtion module. If n intr-tsk scheduling is used, this module clls the Intr-tsk simultor of the IntrDVS module to simulte intr-tsk voltge scling. Energy estimtion module This module tkes the timing nd speed informtion from the tsk execution module, nd computes the energy consumption of the current tsk execution using the current mchine configurtion. By defult, the energy consumption is estimted sed on the equtions descried in [3]. The current version of SimDVS supports the specifictions of XScle [8], AMD s K6-2+ [1], nd Crusoe [19] processors. 4.2 Sumodules of IntrDVS & its pre-processing modules The IntrDVS module tht contins the intr-tsk simultor hs two roles; it simultes the execution ehvior of rel pplictions, nd performs intr-tsk DVS. To reflect the execution ehvior of rel pplictions, the CFG genertor in the IntrDVS pre-processing module produces CFGs

from SimpleSclr 2.0 [4] inry progrm. Ech node of CFG is nnotted with extr informtion (e.g., the numer of instructions in sic lock) necessry for proper simultion runs. In order to support the simultion of pth-sed IntrDVS lgorithms nd stochstic IntrDVS lgorithms, voltge scling loctions within tsk should e determined during the off-line phse. The following two sumodules in the IntrDVS pre-processing module re responsile for this. Voltge scler This module tkes the CFG of the trget ppliction nd extrcts the timing informtion from the CFG. It nlyzes the given CFG nd computes the predicted remining execution times from ech sic lock. Then, it inserts the voltge scling informtion t selected scling points. Finlly, Voltge scler genertes the DVSwre CFG, which includes voltge scling informtion, nd psses it to the Intr-tsk simultor for the pth-sed IntrDVS. Speed trnsition tle To simulte stochstic IntrDVS lgorithms, the stochstic dt (such s the cumultive distriution function of tsk execution times) should e collected from profiling. Bsed on the stochstic dt, the speed trnsition tle, which descries when the execution speed is chnged to wht level, is constructed. Then, the speed trnsition tle is pssed to the Intr-tsk simultor for the stochstic Intr-DVS. 5 Experimentl results The DVS lgorithms descried in Section 3 re evluted y implementing them in SimDVS nd performing experiments with vrious key prmeters tht my ffect the energy efficiency of the DVS lgorithms. Three clsses of DVS lgorithms were evluted: InterDVS lgorithms, IntrDVS lgorithms, nd HyridDVS lgorithms. For the experiments, the energy consumption model sed on the ARM8 microprocessor core is used. The clock speed cn e vried in the rnge of [8, 100] MHz with step size of 1 MHz nd the supply voltge cn e vried in the rnge of [1.1, 3.3] V. We ssume tht the system enters power-down mode whenever the system ecomes idle nd tht no energy is consumed in the power-down mode. We lso ssume tht the voltge scling overhed is negligile oth in the time nd the energy consumed. 5.1 Performnce evlution of InterDVS lgorithms The energy efficiency of InterDVS lgorithms depends significntly on the ccurcy of slck estimtion nd the ppropriteness of slck distriution. To evlute the effectiveness of the slck estimtion method used in ech InterDVS lgorithm, extensive experiments while vrying the numer of tsks nd WCPUs of tsk sets re performed. Then, the energy efficiency of the lgorithms re mesured while chnging the numer of ville voltge levels, in order to evlute their dptility to different mchine specifictions. Finlly, to evlute the effect of slck distriution methods, experiments were performed while restricting the mount of slck time tht tsk cn utilize. 5.1.1 Numer of tsks To evlute the impct of the numer of tsks on the energy efficiency of DVS lgorithms, experiments with vrious numers of tsks were performed. For ech tsk set with n tsks (where n = 2; 4; 6; ; 16), 100 tsk sets were rndomly generted. The period nd the WCET of ech tsk were rndomly generted using uniform distriution with the rnges of [10; 100] ms nd [1; period] ms, respectively. To eliminte the effect of sttic slck times, we chose the tsk sets which hve high worst cse processor utiliztion; WCPUs re equl to 1.0 for EDF InterDVS lgorithms nd 0.9 for RM InterDVS lgorithms. The execution time of ech tsk instnce ws rndomly drwn from Gussin distriution 1 with the rnge of [ 1 WCET, WCET] 10 of ech tsk, nd the resulting verge cse processor utiliztion (ACPU) ws set to 0.55. Figure 3 shows the impct of the numer of tsks on the energy consumption. In the figure, the y xis indictes the normlized energy consumption vlue over the energy consumption of n ppliction running on DVS-unwre system with power-down mode only. As the numer of tsks increses, the energy efficiency of lppsedf, lppsrm, nd ccrm tht only use the Stretching-to-NTA technique do not significntly improve, while tht of the other more ggressive InterDVS lgorithms improves significntly. This cn e explined y the fct tht, in the Stretching-to-NTA method, the slck time tht cn e exploited is limited to the time etween the completion of tsk instnce nd the rrivl time of the next tsk instnce, which is lrgely independent of the numer of tsks in the system. On the other hnd, for the other InterDVS lgorithms, since the slck times cn e tken from ny completed tsk instnce, s the numer of tsk increses, ech tsk hs more slck sources nd cn e scheduled with lowered clock speed. Since the energy efficiency of ech InterDVS lgorithm is not ffected y the numer of tsks when there re more thn eight tsks, the rest of experiments were performed using tsk sets with 8 tsks. 5.1.2 Worst cse processor utiliztion of tsk set When the WCPU of given tsk set is less thn 1.0, the tsks hve inherent sttic slck times. Figure 4() shows the re- 1 With the men m = WCET=10+WCET 2 nd the stndrd devition ff = 0:9 WCET 6.

Normlized Energy Consumption š Q} Qs Ÿ c e g i c e g Numer of Tsks lppsedf ccedf ledf DRA AGR lpshe Normlized Energy Consumption Theoreticl Lower Bound c e g i c e g Numer of Tsks lppsrm ccrm Normlized Energy Consumption Worst Cse Processor Utiliztion lppsedf ccedf ledf DRA AGR lpshe Normlized Energy Consumption f c f e d c Numer of Scling Levels lppsedf ccedf ledf DRA AGR lpshe () EDF InterDVS () RM InterDVS () WCPU () Numer of scling levels Figure 3. Impct of the numer of tsks. Figure 4. Impct of WCPU nd the numer of scling levels. sults for vrying WCPUs of 8-tsk tsk sets. The results indicte tht, except for lppsedf, the energy consumption of InterDVS lgorithms increses s liner function of WCPU of tsk set. For lppsedf, the energy consumption increses fster thn liner function of WCPU of tsk set. This indirectly indictes tht the dynmic slck estimtion method of lppsedf is not very effective. One interesting oservtion from Figure 4() is tht lppsedf shows etter energy efficiency thn ccedf when WCPU is less thn 0.7. This is ecuse, in ccedf, the clock speed is determined using the ctul processor utiliztion 2 t the scheduling point. Since the ctul processor utiliztion increses when low-speed tsk instnce completes its execution, the next tsk instnce needs to e executed in higher speed. Such voltge fluctution occurs more often s the WCPU decreses. Thus, s the WCPU decreses, the energy efficiency of ccedf ecomes worse thn tht of lppsedf. Becuse of the spce limittion, the results for lppsrm nd ccrm re not included ut they re very similr to tht of lppsedf. 5.1.3 Mchine specifiction Vrile-voltge processors provide finite numer of voltge levels, from two to s mny s 100 levels. To evlute the impct of the numer of scling levels on the energy efficiency of the InterDVS lgorithms, severl different mchine specifictions were tested. In the experiments, when there re k scling levels, the voltge nd the clock speed cn e vried with step size of 92 k MHz within the rnge of [8,100] MHz. Figure 4() shows the effect of the numer of scling levels on the energy efficiency of the InterDVS lgorithms. 2 The ctul processor utiliztion P is computed y summing the individul tsk processor utiliztion, i.e., U = i c where p pi is the period of tsk fii nd ci is i ssumed to e WCET if fii is not completed, otherwise the ctul execution time of fi i. As shown, the energy consumption increses s the numer of scling levels decreses. For more ggressive lgorithms (e.g., DRA, AGR, ledf, nd lpshe), the impct of the numer of scling levels is reltively mrginl (roughly 8%) compred to tht of less ggressive lgorithms (e.g., lppsedf nd ccedf). 5.1.4 Speed ound In the previous experiments, we ssumed the greedy method in the slck distriution. Tht is, ll the slck time identified is given to the current tsk instnce. While the greedy policy is simple, it is not the est one. For exmple, in ggressive InterDVS lgorithms such s ledf, AGR nd lp- SHE, slck times my e distriuted unevenly mong tsk instnces. When the current tsk instnce exhusts its ssigned slck time y the greedy distriution policy, tsk instnces tht follow my not enefit from slck times t ll. In order to understnd the effect of different slck distriution policies, we experimented y vrying the mount of usle slck times. In the experiments, we specified the lower ound on the clock speed regrdless of ville slck times. Figure 5 shows the experimentl results for vrious minimum speeds. In ech experiment, it is ssumed tht the clock speed cn e vried within the rnge of [ff f mx ;f mx ] with step size of 1 MHz where f mx = 100 MHz nd ff is the speed ound fctor. As ff ecomes lrger, the tsk instnces is scheduled with lowered clock speed less ggressively ecuse the clock scling is restricted y ff f mx. When ff f mx is close to the lowest possile clock speed of the trget mchine, it is similr to when the greedy slck distriution is used. The experiments were performed vrying ff from 0.1 to 0.9. In Figure 5, the x-xis indictes the speed ound fctor ff. The energy efficiency of InterDVS lgorithms (except for lppsedf nd ccedf) is generlly higher when ff vlues re etween 0.3 nd 0.5. For exm-

ž š«qvÿ ªQt Ÿ ž š Ÿ Theoreticl Lower Bound = 0.46 Qs Ÿ Qw (α) ž š«qvÿ ªQt Ÿ ž š Ÿ Theoreticl Lower Bound = 0.24 Qs Ÿ Qw (α) est energy efficiency is chieved for ledf. For ccedf, however, this trend does not hold s we cn notice in Figure 6(). Similr study with the RM InterDVS lgorithms show tht the performnce gp etween the energy efficiency of the RM InterDVS lgorithms nd tht of the theoreticl lower ound ws roughly 35ο40%. This result indictes tht there is sustntil room for improvement in developing more energy-efficient RM InterDVS lgorithms. vuw vuw vuw uƒr rxƒ yv () Under WCPU=1.0 nd ACPU=0.55 WCPU vuw vuw vuw uƒr rxƒ yv () Under WCPU=0.6 nd ACPU=0.33 Figure 5. Impct of speed ound. Speed Bound Fctor () Normlized energy consumption of ledf WCPU Speed Bound Fctor () Normlized energy consumption of ccedf Figure 6. Impct of speed ound. ple, when the speed ound fctor is 0.5 in Figure 5(), n improvement of 6ο11% ws chieved over when the greedy policy is used. In Figure 5, it is shown tht the energy efficiency of AGR nd lpshe is very close to the theoreticl lower ound 3 when the speed ound fctor is ner 0.5. In fct, one interesting oservtion is tht for the ggressive InterDVS lgorithms, the energy efficiency is highest when the speed ound fctor ws set to ACPU. This trend cn e noted in Figure 5() nd 5(). To show the reltionship etween the speed ound nd ACPU, extensive experiments were performed for vrious tsk sets while vrying ACPU nd scling ound. Figure 6 shows the results. (Due to the lck of spce, only the results for ledf (n exmple of ggressive InterDVSs) nd ccedf (n exmple of non-ggressive InterDVSs) re shown. (The results for AGR nd lpshe re very similr to tht of ledf.) The results confirm tht when the selected speed ound fctor is close to ACPU (= 0:55 WCPU), the 3 The theoreticl lower ound is computed with the complete execution trce informtion using Yo s lgorithm [20]. 5.2 Performnce evlution of Intr-Tsk DVS lgorithms We hve evluted the energy efficiency of intrshin nd intrgruin using n MPEG4 video decoder nd n MPEG4 video encoder tht were previously used in [16]. Both pplictions were pre-processed for speed/voltge chnges s descried using the tools in the IntrDVS preprocessing module descried in Section 4.2. For intrgruin, the execution times of oth the MPEG4 decoder nd encoder were ssumed to follow norml distriution N o = N (m 1 ; ( m2 6 )2 ) where m 1 = 1 WCET nd 2 m 2 = 9 WCET. 10 For intrshin, we first collected lrge numer of execution pths; in SimDVS, ech execution pth cn e represented y pir of prmeters [17]. For ech execution pth, we estimted the energy consumption of the execution pth using the IntrDVS simultor. The overll verge energy consumption is computed y tking the weighted verge of estimted energy consumptions using the execution pth distriutions used for intrgruin. Since the energy efficiency of intrgruin lrgely depends on the slck rtio 4 given in the on-line phse nd the ccurcy of the execution time distriution used in the off-line profiling, we performed experiments vrying these two fctors. Figure 7 shows the reltive energy consumption rtio of intrgruin over intrshin. If the rtio is lrger (smller) thn 1, intrgruin performs etter (worse) thn intrshin. In Figure 7, the N o line represents the cse when the ctul execution times follow the ssumed N o distriution. The N, N nd N c lines indicte the cses where the ctul execution times follow different norml distriutions from the ssumed N o, where N = N (m 1 ; ( m2 5 )2 ), N = N (m 1 ; ( m2 7 )2 ) nd N c = N (1:5 m 1 ; ( m2 7 )2 ). When the slck rtio is less thn 1.2, intrshin outperforms intrgruin ecuse intrshin spends more time in the lower speed region thn intrgruin. When the slck rtio is incresed, intrgruin spends more time in the lower speed region thn intrshin. Figure 7 lso shows tht intrshin works etter thn in- 4 The slck rtio is defined s the rtio of WCET to the ssigned execution time.

_f _f Tle 3. Four heuristics for HyridDVS lgorithms. Reltive Energy Consumption _cf f _c _e _g _i c c_c c_e c_g c_i d Slck Rtio () MPEG4 Decoder N No N Nc Reltive Energy Consumption _cf f _c _e _g _i c c_c c_e c_g c_i d Slck Rtio () MPEG4 Encoder N No N Nc Heuristic H1 H2 H3 H4 Description uses the inter mode s defult ut uses the intr mode if no ctivted tsk instnce exists. uses the intr mode first, ut chnges into the inter mode when the current tsk instnce hs used predefined mount of slck time. uses the inter mode first, ut chnges into the intr mode when the unused slck time is more thn predefined mount of slck time. lterntes the intr mode nd the inter mode keeping the lnce of slck consumption in ech mode. Figure 7. Energy consumption rtio of intrshin nd intrgruin. trgruin when the distriution of ctul execution times is significntly different from the ssumed distriution, s shown in the N c line. 5.3 Performnce evlution of hyrid methods In this section, the question of whether HyridDVS lgorithms will perform etter thn pure IntrDVS lgorithms or pure InterDVS lgorithms is investigted. Although oth intrshin nd intrgruin cn e used for comprtive study, we use intrshin s se IntrDVS lgorithm. This is ecuse intrshin is less likely to generte dynmic slck times, thus mking the distinctions mong the different HyridDVS methods clerer. HyridDVS lgorithms select either the intr mode or the inter mode when slck times re produced during the execution of the current tsk instnce. In the inter mode, the slck time is used not for the current tsk instnce ut for the following tsk instnces. In the intr mode, ll the slck times re used for the current tsk instnce, llowing it to execute t lower speed. Tle 3 summrizes four heuristics [17] for HyridDVS lgorithms considered in this section. The heuristics differ in how close they re to the pure IntrDVS pproch or pure InterDVS pproch. We hve experimented four heuristics in Tle 3 with six EDF InterDVS lgorithms nd two RM InterDVS lgorithms in Tle 2. H1 nd H3 re close to the pure InterDVS pproch nd H2 is close to the pure IntrDVS pproch. The performnce of HyridDVS lgorithms depends on the dynmic slck estimtion methods dopted y ech Inter- DVS lgorithm. In ledf, DRA, AGR, nd lpshe where slck times re identified more ggressively, it is good ide tht some (or ll) slck times produced y the current tsk instnce re pssed to the following tsks. However, in lppsedf/rm nd ccedf/rm where slck times re less ggressively identified, it is etter for the current tsk in- ž š«qvÿ ªQt Ÿ ž š Ÿ ˆ Qt Q Q š š«š Ÿ šÿ šÿ yc ye yd y vuw () ccedf ž š«qvÿ ªQt Ÿ ž š Ÿ ˆ Qt Q Q š š«š Ÿ šÿ šÿ yc ye yd y vuw () ledf Figure 8. Energy efficiency of HyridDVS lgorithms. stnce to utilize most of the slck time generted. Therefore, if HyridDVS is sed on ledf, DRA, AGR, or lpshe, H1 nd H3 re etter choices. On the other hnd, for lppsedf/rm nd ccedf/rm, H2 nd H4 re etter choices. Figure 8 shows the energy efficiency of the HyridDVS methods. The grphs show the energy consumption for vrious WCPUs. As explined efore, if HyridDVS lgorithm is sed on non-ggressive InterDVS lgorithm, the heuristic H2 gives good results s shown in Figure 8(). For n ggressive InterDVS lgorithm, H1 nd H3 give good results s shown in Figure 8(). Though the performnce of HyridDVS lgorithms is lso dependent on the properties of the tsk set tested nd the execution time vritions, in these experiments, HyridDVS lgorithms re shown to reduce the energy consumption y 5ο20% over tht of the pure DVS lgorithms. 6 Conclusions We hve compred the energy efficiency of recent DVS lgorithms for hrd rel-time periodic tsks. The evluted DVS lgorithms include eight InterDVS lgorithms nd two

IntrDVS lgorithms. We lso performed experiments with four versions of HyridDVS lgorithms. For fir nd efficient comprtive study, we hve lso developed SimDVS, unified DVS simultion environment. Our comprtive study shows tht the existing EDF Inter- DVS lgorithms such s AGR, ledf nd lpshe re close to optiml; for our test tsk sets, their power consumption is only 9ο12% worse thn the theoreticl lower ound. We demonstrted tht the performnce gp from the theoreticl lower ound cn e further reduced with more intelligent slck distriution policy. However, in the RM InterDVS lgorithms, our study indictes tht there is still significnt performnce gp from the theoreticl lower ound. Therefore, our findings strongly suggest tht more reserch should e directed towrd developing etter RM InterDVS lgorithms. From the evlution of IntrDVS lgorithms, we demonstrted tht two representtive IntrDVS lgorithms perform quite differently depending on ville slck times. Our study indictes tht the performnce of HyridDVS lgorithm cn e etter thn pure IntrDVS lgorithm or pure InterDVS lgorithm. However, the differences in energy efficiency depend on the chrcteristics of oth the IntrDVS nd the InterDVS components used in the HyridDVS lgorithm. One of interesting future reserch topics will e to devise n intelligent guideline on selecting the est Hyrid- DVS lgorithm for given tsk set. References [1] AMD Corportion. PowerNow! Technology. http:// www.md.com, Decemer 2000. [2] H. Aydin, R. Melhem, D. Mosse, nd P. M. Alvrez. Dynmic nd Aggressive Scheduling Techniques for Power- Awre Rel-Time Systems. In Proceedings of IEEE Rel- Time Systems Symposium, Decemer 2001. [3] T. Burd nd R. Brodersen. Design Issues for Dynmic Voltge Scling. In Proceedings of the Interntionl Symposium on Low Power Electronics nd Design, pges 9 14, July 2000. [4] D. Burger nd T. M. Austin. The SimpleSclr Tool Set, version 2.0. Technicl Report 1342, University of Wisconsin- Mdison, CS Deprtment, June 1997. [5] F. Gruin. Hrd Rel-Time Scheduling Using Stochstic Dt nd DVS Processors. In Proceedings of the Interntionl Symposium on Low Power Electronics nd Design, pges 46 51, August 2001. [6] D. Grunwld, P. Levis, nd K. I. Frks. Policies for Dynmic Clock Scheduling. In Proceedings of the 4th Symposium on Operting Systems Design nd Implementtion, pges 73 86, Octoer 2000. [7] I. Hong, G. Qu, M. Potkonjk, nd M. B. Srivstv. Synthesis Techniques for Low-Power Hrd Rel-Time Systems on Vrile Voltge Processor. In Proceedings of the IEEE Rel- Time Systems Symposium, pges 178 187, Decemer 1998. [8] Intel Corportion. Intel XScle Technology. http:// developer.intel.com/ design/ intelxscle/, Novemer 2001. [9] T. Ishihr nd H. Ysuur. Voltge Scheduling Prolem for Dynmiclly vrile voltge processors. In Proceedings of the Interntionl Symposium on Low Power Electronics nd Design, pges 197 202, August 1998. [10] W. Kim, J. Kim, nd S. L. Min. A Dynmic Voltge Scling Algorithm for Dynmic-Priority Hrd Rel-Time Systems Using Slck Time Anlysis. In Proceedings of Design, Automtion nd Test in Europe (DATE 02), pges 788 794, Mrch 2002. [11] S. Lee nd T. Skuri. Run-time Voltge Hopping for Lowpower Rel-Time Systems. In Proceedings of the 37th Design Automtion Conference, pges 806 809, June 2000. [12] W.-S. Liu. Rel-Time Systems. Prentice Hll, Englewood Cliffs, NJ, June 2000. [13] T. Pering nd R. Brodersen. Energy Efficient Voltge Scheduling for Rel-Time Operting Systems. In Proceedings of the 4th IEEE Rel-Time Technology nd Applictions Symposium, Work in Progress Session, June 1998. [14] P. Pilli nd K. G. Shin. Rel-Time Dynmic Voltge Scling for Low-Power Emedded Operting Systems. In Proceedings of 18th ACM Symposium on Operting Systems Principles (SOSP 01), pges 89 102, Octoer 2001. [15] T. Skuri nd A. Newton. Alph-power Lw MOSFET Model nd Its Appliction to CMOS Inverter Dely nd Other Formulrs. IEEE Journl of Solid Stte Circuits, 25(2):584 594, 1990. [16] D. Shin, J. Kim, nd S. Lee. Intr-Tsk Voltge Scheduling for Low-Energy Hrd Rel-Time Applictions. IEEE Design nd Test of Computers, 18(2):20 30, Mrch 2001. [17] D. Shin, W. Kim, J. Jeon, J. Kim, nd S. L. Min. SimDVS: An Integrted Simultion Environment for Performnce Evlution of Dynmic Voltge Scling Algorithms. In Proceedings of Workshop on Power-Awre Computer Systems (PACS 2002), Ferury 2002. [18] Y. Shin, K. Choi, nd T. Skuri. Power Optimiztion of Rel-Time Emedded Systems on Vrile Speed Processors. In Proceedings of the Interntionl Conference on Computer-Aided Design, pges 365 368, Novemer 2000. [19] Trnsmet Corportion. Crusoe Processor. http:// www.trnsmet.com, June 2000. [20] F. Yo, A. Demers, nd A. Shenker. A Scheduling Model for Reduced CPU Energy. In Proceedings of the IEEE Foundtions of Computer Science, pges 374 382, Octoer 1995.