Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems

Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems Haris Lekatsas Jörg Henkel Wayne Wolf NEC USA NEC USA Princeton University Abstract In the past, systems utilizing code compression have been shown to be advantageous over traditional systems especially in terms of smaller memory need. However, in order to take full advantage of other design criteria like increasing performance and/or minimizing power consumption, the decompression should take place as close as possible to the CPU. We have designed such a decompression unit that, in spite of the higher bandwidth constraints close to the CPU, does improve performance and minimize power consumption of a whole embedded system. By means of extensive simulations we have designed and eventually sized the various parameters of the decompression engine (#pipelines, #pipeline stages, input/output buffer sizes etc.). As a result, the system s performance is increased by up to 46%. Unlike other approaches we have implemented our engine as a soft IP core such that it can be used directly within a SOC design without any modification on the CPU architecture. 1 Introduction Code compression has been studied and deployed extensively during the past decade, relying on data compression techniques which have been around almost as long as there exist computers. However, the focus has mainly been on saving expensive memory size since a compressed i.e. smaller program resides in the memory. Typically, instructions are then decompressed on-the-fly and transferred to an instruction cache from where the CPU finally accesses the already decompressed code. Due to the rapidly increasing sizes of SOCs (Systems On a Chip) that can compromise more than 400 millions of transistors on a single chip using 0:07 [5], memory size is no longer the most important design constraint. Rather, with the emergence of complex mobile hand held computing/communication/internet devices, performance and power are the driving constraints in embedded system design. Permission to make digital or hard copies of all or part of this work for pe rsonal or classroom use is granted without fee provided that copies are not made or distributed f or profit or commercial advantage and that copies bear this notice and the full citation on the f irst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requ ires prior specific permission and/or a fee. ISSS 01, October 1-3, 2001, Montréal, Québec, Canada. Copyright 2001 ACM 1-58113-297-2/01/0006: : :$5.00 We designed an architecture where the decompression unit is located between the cache and the CPU (see Fig. 1). The advantages are apparent: first, the bandwidth on Bus1 is increased since compressed code is transmitted to the cache rather than larger decompressed code. Say, for example we would have a compression ratio of 50% (compressed code size to uncompressed code size) then the number of instruction-related transactions through Bus1 would just be half 1. Conversely, we can argue that in the same amount of time twice as many instructions can be transmitted via Bus1. Thus the performance of such a system is increased via a higher bandwidth on Bus1. But also the power consumption on Bus1 decreases due to a decreased number of total (instruction-related) transactions implying less bus transition and thus a smaller power consumption. A similar argument can be held for Bus2 when the decompression unit is close to the CPU i.e. Bus2 also transmits compressed code. The instruction cache reveals additional benefits: the cache can hold twice as many (in our example) instructions. That means that the cache effectively seems larger (i.e. double size in the example). Consequently, time consuming cache misses occur less often thus increasing the performance of such a system. In addition, power consumption also decreases since much less power consuming bus transactions related to cache misses occur. The flip side of the coin are the increased constraints imposed to a decompression engine that is located between cache and CPU close to the CPU: highperformance CPUs feature a high instruction fetch ratio, thus requiring a high bandwidth on Bus2. On the other side, decompression needs time proposing that a decompression unit could slow down the traffic on Bus1. This is the problem we are focusing on in this paper. The question is in which way we can design a decompression unit that does not slow down the traffic on Bus1 (compared to an identical system using no code compression) and can even increase performance through the effects discussed before. Our proposed architecture consists of four pipelines, I/O buffers, control logic to minimizes penalties during branches etc. We have simulated this architecture extensively and eventually dimensioned the parameters such that it improves performance without introducing substantial complexity. This paper is a companion paper to our other work [6]. The focus here is on the pipelined architecture. This paper is structured as follows: Section 2 presents related work in the code compression for embedded systems 1 This assumes that more than one compressed instructions, i.e. two in the example, are packed into one word that is transmitted though the bus. 63

Figure 1: Basic system architecture with De- Compression Engine (DCE) area. Section 3 describes the architecture that includes a compressed code cache. Section 4 explores the design space in order to locate the best parameters for the cache size in the system and presents experimental results. We conclude in Section 5. 2 Related work Wolfe et al. [3] were first to apply code compression to an embedded system design. They compressed cache blocks and used Huffman Coding [9] for compression. Dictionary based code compression [7] approaches have been applied by Lefurgy et al. [4] and by Liao et al. [8] Dictionary based approaches are easily to decompress since they are look-up tables assigning codes to instructions based on the frequency of their occurrency within a program. Yoshida et al. [2] propose a logarithmic based compression scheme where they compress 32 bit long instructions to a smaller but fixed bit-size of compressed instructions. Other research has focused on architectural issues. Industrial approaches to modify the processor architecture to accomplish execution of compressed code have been conducted for the MIPS architecture [11] and for the ARM Thumb architecture [12]. Both of these approaches modify the decoding unit of the particular processor. Wolfe et al. [3] focused on a pre-cache architecture i.e. a stand-alone decompression unit that is located between main memory and cache. Other approaches do not investigate the impact of a cache like Benini et al. [10] and Lefurgy et al. [4]. Our approach is new in the sense that it is the first architecture that is a stand-alone implementation being located between cache and CPU in order to maximize the benefits as discussed in the introduction. Please note that the focus of this paper is on the the design and dimensioning of our decompression unit that is located between cache and CPU. This paper is not about our decompression/compression algorithm (this is described in [6]). Only those aspects of our decompression/compression algorithm are shortly explained here that are relevant for the design of the architecture to implement it. 3 Architecture of the decompression engine In this section the architecture of our decompression engine is described. As opposed to other approaches using compressed instruction code, our architecture is distinct in that it is: Figure 2: Combined pipeline of decompression engine and CPU a) located between cache and CPU b) a stand-alone decompression unit (not integrated into the CPU pipeline and thus can be used in core-based design with relatively small effort c) it is efficient in all three major design goals, namely increasing performance, decreasing power and reducing area of the whole system. Note that we only shortly repeat some major characteristics of our compression/decompression here as it is important for the following explanations. For our compression and decompression algorithms, please refer to [6]. Instructions are separated into four groups before compression in order to enhance compressibility and ease the decoding effort (i.e. reduce the complexity of the decompression engine). The four groups consist of instructions with immediates (Group 1), branch and call instructions (Group 2), instructions without any immediate fields (Group 3), and finally instructions that are hard to compress and are left uncompressed. For each instruction group a pipeline has been designed within the decompression engine. Fig. 2 shows the four pipelines (plus the CPU pipeline stages for later discussion). It shows the pipe for Group 1, the pipeline for the Group 2 (branch unit BU), Group 3 (fast dictionary FD) and the bypass for uncompressed instructions. The gray-underlayed path represents the critical path i.e. the path with the longest combined number of pipeline stages in both the decompression unit and the CPU. Within the decompression unit the longest path goes through the G1 pipeline. Also note that the buffers (I-Buf and O-Buf) do not matter in terms of delay since data from any location within the buffer can directly be accessed (not a FIFO) and write-to and read-from is done as part of the adjacent pipeline stage. Since we have a maximum of three stages and each state is designed to complete in one cycle, in the worst case there will be not be produced a (decompressed) instruction work at any cycle. But note (as we will show later) that all four pipelines can execute at the same time. It might occur that an instruction instr b that follows in the program after an instruction instr a will be decoded faster and thus be earlier available at the exit of the decompression unit since, for example, instrc b belongs to Group 1 and instr a belongs to Group 2. As a result the decompressed instructions in the output buffer O-Buf may be out of order. Though additional complexity is imposed for the design of the DCE (as we 64

Schedule Schedule C 1 C 1 C 2 C 1 C 1 C 2 Next Next 32 bits Perfect 32 bits Greedy Figure 3: Input buffer strategies will discuss later), this characteristic with the parallel execution of the various pipelines is a major design feature that enables high throughput and keeps the CPU almost (we discuss exceptions later on) always busy i.e. it prevents stalling. Actually, the CPU is fed with more instructions per cycle as compared to a system not using compression. One effect that helps is the increased bandwidth on Bus1 and Bus2 as mentioned in the introduction. Details are discussed in the following sections. 3.1 Input buffer algorithms In the following we will use the SPARC instruction set. Compressed instructions are fed to the decompression engine through data bus Bus1 (Fig. 1). In each cycle 32 bits of compressed code are written in the input buffer. As should be clear from the previous section these compressed 32 bits can correspond to anything from a portion of an uncompressed instruction up to multiple uncompressed instructions 2. During the first stage of the decompression engine s pipeline the tag bits are read and one or more compressed instructions are sent to the appropriate paths. A buffer is needed because in some cases there will not be any whole instruction for the decoder to send to its path, and the 32 bits must be stored until the next cycle when the rest of the instruction will arrive. This case is essentially the uncompressed instructions case (group 4). Another reason for having a buffer is that if the pipeline (where instructions go through multiple times to be decompressed) is full, if more than one instruction from the same group instructions arrive from the cache they must be buffered until there is an empty slot in the pipeline. Designing the input buffer is tricky; a simple circular buffer would be ideal and can be designed very efficiently in hardware. However, this buffer cannot be circular, because it is possible to have two instructions of the same kind appear in the same cycle. Fig. 3 illustrates this problem: suppose the current bus transaction brings three compressed instructions, two of Group 1 and one of Group 2. A circular buffer would use two pointers: the read pointer to point to the next location to be read, and the write pointer to point to the location that must be written next. A power-of-two buffer size is helpful because we can avoid unnecessary comparison and make the pointers cycle when they reach the boundaries with a simple AND operation. 2 Note that uncompressed instructions will be stored in more than 32 bits since the code 101 is appended to them. Figure 4: Pipeline stages over time: example where a branch is not taken Fig. 3 illustrates two different cases and shows what happens when two instructions of the same kind appear in the same cycle. As seen in the Figure when two instructions of the same kind appear followed by another instruction, a gap appears in the input buffer since the second c 1 instruction cannot be scheduled in the current cycle. To avoid increasing the complexity of the input buffer by introducing extra pointers, we propose a greedy algorithm: if a second instruction of the same kind appears in the buffer, no more instructions are scheduled. That means that in Fig. 3 instruction c 2 will not be scheduled in the current cycle since because of the second c 1 it will stop any further scheduling. 3.2 Handling branches As shown in Fig. 1 the effective pipeline is 6 stages long in the worst case including the Group 2 pipeline of the DCE plus the three pipeline stages of the CPU. This makes the system especially sensitive in case a branch or call etc. is not taken. We explain some of the effects and discuss how we resolved the problem of ing the whole combined pipeline. Remember that performance is key in order to make the DCE useful for an embedded system design as discussed earlier. Also note that our goal is not to modify an already optimized architecture like the CPU. Hence, the only means to guarantee performance constraints is in form of the DCE architecture. Whether a branch is taken or not is decided in the ID stage of the CPU. The following code, includes a conditional branch (be): 12288: be 0x12298 1228c: mov %g1, %o0 12290: call atexit 12294: nop 12298: sethi %hi(0x2d400), %o0 1229c: call atexit None of the involved instructions uses the longest path through the DCE. 3. In the case shown in Fig. 4 the branch is taken and only three instructions need to be ed from the 3 This is a demonstrative example only and similar discussion can be given for the case where pipe G1 is used 65

DT_PC Dec.Table L1 Comp buf1 bufp L2 priority ibuf Instr. Decoder buf2 priority MUX out_buf Figure 5: Pipeline stages over time: example where a branch is taken pipelines i.e. mov, call, nop. The whole unit consisting of DCE and CPU can host up to eight instructions (i.e. the total number of combined pipeline stages) plus the number of already decompressed instructions (if any) residing in the buffer between DCE and CPU. In fact, we only need three cycles to get the new branch target in the EX stage of the CPU. This is because of an architectural feature in the DCE called ITT (Instruction Trace Tagging). The ITT keeps track of any instruction that is in any stage of the DCE. It is a set of 4 bits accompanying each instruction through their way within the DCE. In case a branch (or call, jmp) that is taken, the CU (Control Unit) of the DCE finds the target (if it is in the DCE) and prevents ing the whole DCE. Rather, it selectively es only those instructions that will not be used and preserves the others. This can save time as such small branches appear frequently in our benchmark programs. A comparison to the case where the branch is not taken shows that the given sequence needs seven cycles to execute compared to only nine when the branch is taken (Fig. 5. As a result, we have a delay of only two cycles even though we have a very long combined (DCE plus CPU) pipeline. 3.3 The DCE pipeline in detail As mentioned before, instructions are separated into 4 groups. Instructions in Group 1 will generally take longer for decompression than instructions in other classes since our compression/decompression algorithm (SAMC) will in average decode about 6-8 bits/cycle [6]. This is due to a compromise between performance and complexity of the implementation. Since Group 1 (i.e. SAMC) instructions are the bottleneck we designed our DCE such that subsequent instructions are decompressed in parallel as long as they do not belong to the same group and are thus decompressed in different pipelines. In each of the 4 groups of instructions the output is written to either buf1 or buf2. (see Figure 6). Groups 2,3,4 always write 32 bits (one decoded instruction) to their local buffer. Local buf1 has enough capacity to handle multiple 32 bit instructions. It must have at least a storage capacity of 4 (since 4 is the number of parallel pipelines) such that the highest-throughput case (i.e. each pipeline produces an instruction) can be handled. Among other things, the Controller (see Figure 6) is used to re order the results in the output buffers so that the processor can collect the decompressed instruc- Branch Unit Fast Dict. Table anull anull priority instr_order instr_order Controller stall_sig cpu_addr Figure 6: Combined pipeline of decompression engine and CPU tions in the right order. This is mandatory because Group1 instructions take generally longer to decompress (see above), hence instructions are produced out of order. For that reason, each instruction is assigned a set of 4 bits (ITT: Instruction Trace Tagging) a number which is used by the Controller to decide which one should be written to out buf. This is controlled by the instr order signal (see Figure 6). The Controller is used to send the appropriate signals to the various sub-units carrying out de compression. Depending on the bits in the input buffer, one of 4 paths is chosen. In case of a branch the anull signal is set by the branch unit which may result in ing. The stall signal is also generated, if the Group1 pipeline or buf2 are full. In such cases, the input buffer is not updated with compressed code sent via the bus. The following pseudo-code illustrates the input buffer s operation: The function schedule directs un compressed instructions out of the input buffer into the appropriate pipeline. The function delete from pipes is responsible for ing all instructions from the pipelines that have addresses equal to the pc s given as their arguments. In general, the DCE may stall for the following reasons: 1. The latch storing SAMC instructions is full and cannot accept any more words from the input buffer (all stages are busy). If it is full, a stall signal is used to tell the CPU to stop providing addresses, and nothing is written to the input buffer. 2. Buffer 2 which stores the results of cases 2,3 and 4 may be full. A stall signal prevents any more writing and fetching from cache, until an empty slot is available. 3. A branch instruction has altered the flow and some entries need to be ed. For the SPARC architecture, this is typically the case where an instruction has been inserted in the delay slot and must be ed due to a change in program flow. 66

1) for (;;) f 2) if (ibuffer not full()) 3) read from cache(new pc); 4) if (new pc!=pc+4) f // we have a branch 5) delete from pc history table(); 6) if (new pc > pc) // forward branch 7) delete from pipes(pc+8... target pc-4); 8) else 9) delete from pipes(all pc s > pc + 4); 10) g 11) schedule(); 12) if (obuffer not full()) 13) write to obuffer(); 14) if (earliest pc in history table is in obuffer) f 15) delete from obuffer(earliest pc); // CPU gets it 16) delete from history table(earliest pc); 17) g 18) g Figure 7: Pseudo code of the Controller Unit The controller is shown in Figure 2. Its inputs are the address requested by the CPU and the anulling bit from the branch unit. The controller outputs a stall signal for the CPU, a signal which contains the priority number for each instruction, and the instruction order signal which tells the MUX which local output buffer result should be used. When an instruction is retrieved from the input buffer a number (priority) is assigned to it, which corresponds to the order required by the processor. This is needed since decompression time is variable, and later instructions may finish before earlier ones. 4 Experimental results In this section we discuss some of the design issues and present experimental results. The architecture we have used consists of main memory, a cache, a decompression engine and the CPU. Furthermore, the decompression engine is placed as a standalone unit (soft core). This means that all levels of memory hierarchy store compressed code allowing us to benefit from reduced cache misses. Fig. 8 shows miss ratios on one of our applications for different cache sizes. This Figure shows that code compression can have a big impact on the cache size chosen for a given application. The question is which cache size will give us the best performance while not consuming too much area. In terms of cache miss ratios, a cache size of 1024 seems to be the best choice. There is no change on the cache design to store compressed code. The cache is a regular cache that can be accessed on word boundaries only. This a compressed instruction will in general be stored on any byte boundary in the cache and can possibly span two cache lines. Therefore, one instruction can cause in the worst case, two cache misses. We take into account this effect in our experiments above. Fig. 9 shows CPI results for the MPEG application 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 128 2 CPI 1.5 256 512 1024 2048 4096 8192 16384 Figure 8: Cache miss ratios for MPEG 3.5 3 2.5 1 0.5 0 2048 4096 8192 Cache size 16384 Figure 9: CPI for MPEG No-comp comp No-comp comp for different cache sizes. Clearly, a cache size of 1 or 2K will have an advantage for the compressed architecture as there is a significant difference in cache miss ratios. However, a larger cache will result in a lower cache miss ratio for the non-compressed architecture as well making it a better choice. Since we are designing an embedded system, such cache sizes may not make a sense as a small chip may be a major design goal. From Figures 9 and 8 it is clear that the best design point to achieve our goals would be a cache size of 2048 bytes 4. Similar graphs can be compiled for other applications. Table 2 shows the compression results on diverse applications. CR denotes Compression Ratio and G1 denotes Group 1 instructions etc. We use compression ratios, defined as the compressed size over the original instruction segment size. Hence, smaller numbers mean higher compression. Note that the overall compression ratio is not the average of the compression ratio of branches (Group 2), instructions with immediate fields (Group 1), and instructions with no immediate fields (Group 3), since these groups have different percentages in our applications. In particular, we found that around 53.0% of the instructions belong into 4 A 1024-cache gives even better results. However, it is not a good design choice for a non-compression architecture 67

Application # Scheduled instructions Compressed I-fetches G1 usage G2 usage G3 usage i3d 19911 11990 10941 6167 2803 diesel 34368 19743 17661 9283 7424 key 9849864 5894826 6047993 1633457 2168414 smo 1716150 976439 954312 290985 480853 trick 520860 281544 239706 125906 155248 Table 1: Pipeline statistics Application G1 G2 G3 Overall CR i3d 0.53 0.50 0.34 0.53 cpr 0.58 0.53 0.34 0.56 diesel 0.53 0.50 0.34 0.52 key 0.57 0.50 0.34 0.55 mpeg 0.56 0.50 0.34 0.55 smo 0.55 0.53 0.34 0.54 trick 0.58 0.49 0.34 0.54 Table 2: Compression Results Benchmark G1 [%] G2 [%] G3 [%] G4 [%] i3d 53.5 28.8 16.0 0.7 cpr 50.6 32.3 17.0 1.1 diesel 51.0 28.9 20.0 0.1 key 57.7 19.9 21.7 0.7 mpeg 57.9 23.9 17.3 0.9 smo 51.1 26.3 22.2 0.4 trick 48.9 26.9 23.7 0.5 Table 3: Number of instructions in each group group1, 26.7% into group 2, 19.7% into group3, and finally 0.6% in the fourth group. Furthermore, the overall compression ratio takes into account the byte alignment of branch targets. Finally, in Table 1 we show some pipeline statistics for our applications. The number of compressed I-fetches shows the gain over the total number of scheduled instructions. We observe fewer fetches because each instruction fetch now contains on average more than one instruction. Note that the total number of scheduled instructions is equal to the total number of instructions in the trace. The other columns show the total number of cycles each pipeline was busy. As G1 instructions are the most common, the G1 pipeline is the most frequently used. Clearly there is some redundancy as for some cycles certain pipes will stay idle. 5 Conclusions We presented and discussed the design of a pipelined decompression architecture for embedded systems. In addition to a decreased size of the main memory we especially achieve performance gains for the whole embedded system. This is achieved by carefully selecting the cache size and adapting the parameters of the decompression architecture accordingly. It is the first decompression architecture that improves all three major design constaints: performance and power and area. For example, we achieve a aperformance increase of up to 46%. Our decompression architecture is only useful in embedded systems where the software running on the system is known a priori. References [1] T.M.Kemp, R.K.Montoye, J.D.Harper, J.D.Palmer and D.J.Auerbach, A Decompression Core for PowerPC, IBM Journal of Research and Development, vol. 42(6) pp. 807-812, November 1998. [2] Y. Yoshida, B.-Y. Song, H. Okuhata and T. Onoye An Object Code Compression Approach to Embedded Processors, Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED) pp. 265-268, ACM, August 1997. [3] A. Wolfe and A. Chanin, Executing Compressed Programs on an Embedded RISC Architecture, Proc. 25th Ann. International Symposium on Microarchitecture, pp. 81-91, Portland, OR, December, 1992. [4] C. Lefurgy, P. Bird, I. Cheng and T. Mudge, Code Density Using Compression Techniques, Proc. of the 30th Annual International Symposium on MicroArchitecture, pp. 194-203, December, 1997. [5] TI s 0.07 Micron CMOS Technology Ushers In Era of Gigahertz DSP and Analog Performance, Texas Instruments, Published on the Internet, http://www.ti.com/- sc/docs/news/1998/98079.htm, 1998. [6] H. Lekatsas, J. Henkel and W. Wolf, Code Compression for Low Power Embedded System Design, Submitted for publication, Design Automation Conference 2000. [7] T.C. Bell, J.G. Cleary and I.H. Witten, Text Compression, Prentice Hall, New Jersey, 1990. [8] S.Y. Liao, S. Devadas and K. Keutzer, Code Density Optimization for Embedded DSP Processors Using Data Compression Techniques, Proceedings of the 1995 Chapel Hill Conference on Advanced Research in VLSI, pp. 393-399, 1995. [9] D.A. Huffman, A Method for the Construction of Minimum-Redundancy Codes, Proceedings of the IRE, vol 4D, pp. 1098-1101, September, 1952. [10] L. Benini, A. Macii, E. Macii and M. Poncino, Selective Instruction Compression for Memory Energy Reduction in Embedded Systems, IEEE/ACM Proc. of International Symposium on Low Power Electronics and Design (ISLPED 99), pp. 206-211, 1999. [11] K.D. Kissell, MIPS16: High Density MIPS for the Embedded Market, Silicon Graphics Group, 1997. [12] Advanced Risc Machines Ltd., An Introduction to Thumb, March, 1995. 68