Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000
Motivation Problem: embedded code size Constraints: cost, area, and power Fit program in on-chip memory Compilers vs. hand-coded assembly Portability Development costs Code bloat Solution: code compression Reduce compiled code size Take advantage of instruction repetition Implementation Hardware or software? Code size? Execution speed? CPU I/O RAM Original Program CPU I/O RAM ROM Program ROM Compressed Program Embedded Systems 2
Software decompression Previous work Decompression unit: whole program [Tauton91] No memory savings Decompression unit: procedures [Kirovski97][Ernst97] Requires large decompression memory Fragmentation of decompression memory Slow Our work Decompression unit: 1 or 2 cache-lines High performance focus New profiling method 3
Dictionary compression algorithm Goal: fast decompression Dictionary contains unique instructions Replace program instructions with short index 32 bits 16 bits 32 bits lw r2,r3 5 lw r2,r3 lw r2,r3 5 lw r15,r3 lw r15,r3 lw r15,r3 30 30.dictionary segment lw r15,r3 30.text segment Original program.text segment (contains indices) Compressed program 4
Decompression Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) Write directly into I-cache Decompressed instructions only exist in I-cache I-cache í Memory Add r1,r2,r3 Dictionary Proc. D-cache ô 5... Indices 5
Overview CodePack IBM PowerPC First system with instruction stream compression Decompress during I-cache miss Software CodePack Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions 6
compressio n ratio = CodePack: 55% - 63% Dictionary: 65% - 82% Compression ratio compressed size original size Compression ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Dictionary CodePack cc1 ghostscript go ijpeg mpeg2enc pegwit perl vortex 7
Simulation environment SimpleScalar Pipeline: 5 stage, in-order I-cache: 16KB, 32B lines, 2-way D-cache: 8KB, 16B lines, 2-way Memory: 10 cycle latency, 2 cycle rate 8
Performance CodePack: very high overhead Reduce overhead by reducing cache misses Slowdown relative to native code 22 20 18 16 14 12 10 8 6 4 2 0 Go CodePack Dictionary Native 4KB 16KB 64KB I-cache size (KB) 9
Cache miss Control slowdown by optimizing I-cache miss ratio Slowdown relative to native code 40 35 30 25 20 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio CodePack 4KB CodePack 16KB CodePack 64KB Dictionary 4KB Dictionary 16KB Dictionary 64KB 10
Hybrid programs Selective compression Only compress some procedures Trade size for speed Avoid decompression overhead Profile methods Count dynamic instructions Example: Thumb Use when compressed code has more instructions Reduce number of executed instructions Count cache misses Example: CodePack Use when compressed code has longer cache miss latency Reduce cache miss latency 11
Cache miss profiling Cache miss profile reduces overhead 50% Loop-oriented benchmarks benefit most Approach performance of native code Slowdown relative to native code 1.12 1.10 1.08 1.06 1.04 1.02 Pegwit (encryption) CodePack: dynamic instructions CodePack: cache miss 1.00 60% 70% 80% 90% 100% Compression ratio 12
CodePack vs. Dictionary More compression may have better performance CodePack has smaller size than Dictionary compression Even with some native code, CodePack is smaller CodePack is faster due to using more native code 4.0 Ghostscript Slowdown relative to native code 3.5 3.0 2.5 2.0 1.5 1.0 CodePack: cache miss Dictionary: cache miss 0.5 60% 70% 80% 90% 100% Compression ratio 13
Conclusions High-performance SW decompression possible Dictionary faster than CodePack, but 5-25% compression ratio difference Hardware support I-cache miss exception Store-instruction instruction Tune performance by reducing cache misses Cache size Code placement Selective compression Use cache miss profile for loop-oriented benchmarks Code placement affects decompression overhead Future: unify code placement and compression 14
Web page http://www.eecs.umich.edu/compress 15