Reducing Code Size with Run-time Decompression

Similar documents
Fast Software-managed Code Decompression

Reducing Code Size with Run-time Decompression

Evaluation of a High Performance Code Compression Method

Compact Binaries with Code Compression in a Software Dynamic Translator

Profile-driven Selective Code Compression

Compact Binaries with Code Compression in a Software Dynamic Translator

Instruction Cache Compression for Embedded Systems by Yujia Jin and Rong Chen

GOLOMB Compression Technique For FPGA Configuration

EMBEDDED computing systems are space and cost sensitive.

Efficient Placement of Compressed Code for Parallel Decompression

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

The Implementation and Evaluation of Dynamic Code Decompression Using DISE

A Study on Algorithm for Compression and Decompression of Embedded Codes using Xilinx

MEMORY is one of the key driving factors in embeddedsystem

Reduction of Bitstream Transfer Time in FPGA

MEMORY is one of the most constrained resources in an

EMBEDDED systems have become more and more important

Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems

Energy Savings Through Compression in Embedded Java Environments

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Fast Floating Point Compression on the Cell BE Processor

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number2- Dec 2014

Matrix-based software test data decompression for systems-on-a-chip

Computing s Energy Problem:

A Hybrid Code Compression Technique using Bitmask and Prefix Encoding with Enhanced Dictionary Selection

COMPRESSION OF FPGA BIT STREAMS USING EFFECTIVE RUN LENGTH ENCODING TECHIQUES AND ITS PERFORMANCE ESTIMATION

AccuRAID iscsi Auto-Tiering Best Practice

AddressBus 32. D-cache. Main Memory. D- Engine CPU. DataBus 1. DataBus 2. I-cache

Code Compression for Low Power Embedded System Design

Compression of FPGA Bitstreams Using Improved RLE Algorithm

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

Persistent Memory Performance Benchmarking & Comparison. Eden Kim, Calypso Systems, Inc. John Kim, Mellanox Technologies, Inc.

A Space-Efficient On-Chip Compressed Cache Organization for High Performance Computing 1

An Architecture of Embedded Decompressor with Reconfigurability for Test Compression

Spacecraft Simulation Tool. Debbie Clancy JHU/APL

Satoshi Yoshida and Takuya Kida Graduate School of Information Science and Technology, Hokkaido University

CS 341 Computer Architecture and Organization. Lecturer: Bob Wilson Cell Phone: or

WHEN WILL YOUR MULTI-TERABYTE IMAGERY STOP REQUIRING YOU TO BUY MORE DATA STORAGE?

Decompression Method For Massive Compressed Files In Mobile Rich Media Applications

COMPRESSORS WITH SIDE STREAM

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3

Design and Evaluation of Adaptive Traffic Control System for Heterogeneous flow conditions

EVOLVING HEXAPOD GAITS USING A CYCLIC GENETIC ALGORITHM

The Constrained Ski-Rental Problem and its Application to Online Cloud Cost Optimization

Solving the problem of serving large image mosaics. Using ECW Connector and Image Web Server with ArcIMS

IAC-06-D4.1.2 CORRELATIONS BETWEEN CEV AND PLANETARY SURFACE SYSTEMS ARCHITECTURE PLANNING Larry Bell

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #39. Agenda

Decompression of run-time compressed PE-files

Solving MINLPs with BARON. Mustafa Kılınç & Nick Sahinidis Department of Chemical Engineering Carnegie Mellon University

CPE/EE 427, CPE 527 VLSI Design I L21: Sequential Circuits. Review: The Regenerative Property

VLSI Design I; A. Milenkovic 1

Word-based Statistical Compressors as Natural Language Compression Boosters

Fingerprint Recompression after Segmentation

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

Policy Gradient RL to learn fast walk

UNIVERSITY OF WATERLOO

Computational Challenges in Cold QCD. Bálint Joó, Jefferson Lab Computational Nuclear Physics Workshop SURA Washington, DC July 23-24, 2012

- 2 - Companion Web Site. Back Cover. Synopsis

Design and Evaluation of Adaptive Traffic Control System for Heterogeneous flow conditions. SiMTraM & CosCiCost2G

Simulation Model Portability 2 standard support in EuroSim Mk4

Strategy, Developments & Outlook SESP September 2010 ESTEC, Noordwijk, The Netherlands

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

Analyzing the Processor Bottlenecks in SPEC CPU 2000

DESIGN OF SPECIALIST BAROCHAMBERS FOR THE STUDY OF BAROTRAUMA

Arithmetic Coding Modification to Compress SMS

The Mechanical Advantage

Functional safety. Functional safety of Programmable systems, devices & components: Requirements from global & national standards

NNCI ETCH WORKSHOP SI DRIE IN PLASMATHERM DEEP SILICON ETCHER. Usha Raghuram Stanford Nanofabrication Facility Stanford, CA May 25, 2016


Light Loss-Less Data Compression, With GPU Implementation

CONSUMER MODEL INSTALLATION GUIDE

CS3350B Computer Architecture. Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions

Next Generation Life Support (NGLS): Variable Oxygen Regulator Element

A Novel Gear-shifting Strategy Used on Smart Bicycles

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

International olympiads in Informatics in Kazakhstan. A. Iglikov Z. Gamezardashvili B. Matkarimov

: A WEB-BASED SOLVER FOR COMPRESSIBLE FLOW CALCULATIONS

Functional Development Process of the electric Anti-Roll-Stabilizer ears. Dipl. Ing. Daniel Lindvai-Soos Dr. techn.

Design Strategies for ARX with Provable Bounds: SPARX and LAX

Constraining a global, eddying, ocean and sea ice model with scatterometer data

Adiabatic Switching. A Survey of Reversible Computation Circuits. Benjamin Bobich, 2004

unsignalized signalized isolated coordinated Intersections roundabouts Highway Capacity Manual level of service control delay

A New Reference Frame Recompression Algorithm and Its VLSI Architecture for UHDTV Video Codec

MICROPROCESSOR ARCHITECTURE

Software Design of the Stiquito Micro Robot

The Incremental Evolution of Gaits for Hexapod Robots

TAKING IT TO THE EXTREME The WAGO-I/O-SYSTEM for extreme Applications

Engineering Note. Algorithms. Overview. Detailed Algorithm Description. NeoFox Calibration and Measurement. Products Affected: NeoFox

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Benefits of Detailed Compressor Modeling in Optimizing Production from Gas-Lifted Fields. Manickam S. Nadar Greg Stephenson

215 Liquid Handler Up-Grade Manual

SUPPLEMENT MATERIALS

The system design must obey these constraints. The system is to have the minimum cost (capital plus operating) while meeting the constraints.

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Industrial Compressor Controls Standard Custom

The Race Director. Race Director Go [RACE DIRECTOR GO] This document describes the implementation of Race Director Go with a beginning to end example.

IA-64: Advanced Loads Speculative Loads Software Pipelining

IEEE TRANSACTIONS ON COMPUTERS, VOL.54, NO.11, NOVEMBER The VPC Trace-Compression Algorithms

2007 Gas-Lift Workshop

Transcription:

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000

Motivation Problem: embedded code size Constraints: cost, area, and power Fit program in on-chip memory Compilers vs. hand-coded assembly Portability Development costs Code bloat Solution: code compression Reduce compiled code size Take advantage of instruction repetition Implementation Hardware or software? Code size? Execution speed? CPU I/O RAM Original Program CPU I/O RAM ROM Program ROM Compressed Program Embedded Systems 2

Software decompression Previous work Decompression unit: whole program [Tauton91] No memory savings Decompression unit: procedures [Kirovski97][Ernst97] Requires large decompression memory Fragmentation of decompression memory Slow Our work Decompression unit: 1 or 2 cache-lines High performance focus New profiling method 3

Dictionary compression algorithm Goal: fast decompression Dictionary contains unique instructions Replace program instructions with short index 32 bits 16 bits 32 bits lw r2,r3 5 lw r2,r3 lw r2,r3 5 lw r15,r3 lw r15,r3 lw r15,r3 30 30.dictionary segment lw r15,r3 30.text segment Original program.text segment (contains indices) Compressed program 4

Decompression Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) Write directly into I-cache Decompressed instructions only exist in I-cache I-cache í Memory Add r1,r2,r3 Dictionary Proc. D-cache ô 5... Indices 5

Overview CodePack IBM PowerPC First system with instruction stream compression Decompress during I-cache miss Software CodePack Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions 6

compressio n ratio = CodePack: 55% - 63% Dictionary: 65% - 82% Compression ratio compressed size original size Compression ratio 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Dictionary CodePack cc1 ghostscript go ijpeg mpeg2enc pegwit perl vortex 7

Simulation environment SimpleScalar Pipeline: 5 stage, in-order I-cache: 16KB, 32B lines, 2-way D-cache: 8KB, 16B lines, 2-way Memory: 10 cycle latency, 2 cycle rate 8

Performance CodePack: very high overhead Reduce overhead by reducing cache misses Slowdown relative to native code 22 20 18 16 14 12 10 8 6 4 2 0 Go CodePack Dictionary Native 4KB 16KB 64KB I-cache size (KB) 9

Cache miss Control slowdown by optimizing I-cache miss ratio Slowdown relative to native code 40 35 30 25 20 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio CodePack 4KB CodePack 16KB CodePack 64KB Dictionary 4KB Dictionary 16KB Dictionary 64KB 10

Hybrid programs Selective compression Only compress some procedures Trade size for speed Avoid decompression overhead Profile methods Count dynamic instructions Example: Thumb Use when compressed code has more instructions Reduce number of executed instructions Count cache misses Example: CodePack Use when compressed code has longer cache miss latency Reduce cache miss latency 11

Cache miss profiling Cache miss profile reduces overhead 50% Loop-oriented benchmarks benefit most Approach performance of native code Slowdown relative to native code 1.12 1.10 1.08 1.06 1.04 1.02 Pegwit (encryption) CodePack: dynamic instructions CodePack: cache miss 1.00 60% 70% 80% 90% 100% Compression ratio 12

CodePack vs. Dictionary More compression may have better performance CodePack has smaller size than Dictionary compression Even with some native code, CodePack is smaller CodePack is faster due to using more native code 4.0 Ghostscript Slowdown relative to native code 3.5 3.0 2.5 2.0 1.5 1.0 CodePack: cache miss Dictionary: cache miss 0.5 60% 70% 80% 90% 100% Compression ratio 13

Conclusions High-performance SW decompression possible Dictionary faster than CodePack, but 5-25% compression ratio difference Hardware support I-cache miss exception Store-instruction instruction Tune performance by reducing cache misses Cache size Code placement Selective compression Use cache miss profile for loop-oriented benchmarks Code placement affects decompression overhead Future: unify code placement and compression 14

Web page http://www.eecs.umich.edu/compress 15