Fast Software-managed Code Decompression

Similar documents
Reducing Code Size with Run-time Decompression

Evaluation of a High Performance Code Compression Method

Reducing Code Size with Run-time Decompression

Compact Binaries with Code Compression in a Software Dynamic Translator

Compact Binaries with Code Compression in a Software Dynamic Translator

Profile-driven Selective Code Compression

Instruction Cache Compression for Embedded Systems by Yujia Jin and Rong Chen

Efficient Placement of Compressed Code for Parallel Decompression

MEMORY is one of the key driving factors in embeddedsystem

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

EMBEDDED computing systems are space and cost sensitive.

Design and Simulation of a Pipelined Decompression Architecture for Embedded Systems

MEMORY is one of the most constrained resources in an

Energy Savings Through Compression in Embedded Java Environments

A Study on Algorithm for Compression and Decompression of Embedded Codes using Xilinx

GOLOMB Compression Technique For FPGA Configuration

The Implementation and Evaluation of Dynamic Code Decompression Using DISE

EMBEDDED systems have become more and more important

Spacecraft Simulation Tool. Debbie Clancy JHU/APL

Reduction of Bitstream Transfer Time in FPGA

Computing s Energy Problem:

A Hybrid Code Compression Technique using Bitmask and Prefix Encoding with Enhanced Dictionary Selection

Decompression Method For Massive Compressed Files In Mobile Rich Media Applications

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number2- Dec 2014

A Space-Efficient On-Chip Compressed Cache Organization for High Performance Computing 1

Fast Floating Point Compression on the Cell BE Processor

Compiling for Multi, Many, and Anycore. Rudi Eigenmann Purdue University

Matrix-based software test data decompression for systems-on-a-chip

Code Compression for Low Power Embedded System Design

AddressBus 32. D-cache. Main Memory. D- Engine CPU. DataBus 1. DataBus 2. I-cache

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #39. Agenda

The system design must obey these constraints. The system is to have the minimum cost (capital plus operating) while meeting the constraints.

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Light Loss-Less Data Compression, With GPU Implementation

Decompression of run-time compressed PE-files

The Constrained Ski-Rental Problem and its Application to Online Cloud Cost Optimization

CS 341 Computer Architecture and Organization. Lecturer: Bob Wilson Cell Phone: or

AccuRAID iscsi Auto-Tiering Best Practice

(12) United States Patent (10) Patent No.: US 6,618,728 B1

COMPRESSION OF FPGA BIT STREAMS USING EFFECTIVE RUN LENGTH ENCODING TECHIQUES AND ITS PERFORMANCE ESTIMATION

MICROPROCESSOR ARCHITECTURE

Functional safety. Functional safety of Programmable systems, devices & components: Requirements from global & national standards

WHEN WILL YOUR MULTI-TERABYTE IMAGERY STOP REQUIRING YOU TO BUY MORE DATA STORAGE?

Analyzing the Processor Bottlenecks in SPEC CPU 2000

Appendix A COMPARISON OF DRAINAGE ALGORITHMS UNDER GRAVITY- DRIVEN FLOW DURING GAS INJECTION

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

Instrument pucks. Copyright MBARI Michael Risi SIAM design review November 17, 2003

The Evolution of Transport Planning

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

VLSI Design I; A. Milenkovic 1

Data Extraction from Damage Compressed File for Computer Forensic Purposes

LOCOMOTION CONTROL CYCLES ADAPTED FOR DISABILITIES IN HEXAPOD ROBOTS

IA-64: Advanced Loads Speculative Loads Software Pipelining

CPE/EE 427, CPE 527 VLSI Design I L21: Sequential Circuits. Review: The Regenerative Property

Compression of FPGA Bitstreams Using Improved RLE Algorithm

: A WEB-BASED SOLVER FOR COMPRESSIBLE FLOW CALCULATIONS

Morgan DeLuca 11/2/2012

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Simulation Model Portability 2 standard support in EuroSim Mk4

Solving MINLPs with BARON. Mustafa Kılınç & Nick Sahinidis Department of Chemical Engineering Carnegie Mellon University

Software Design of the Stiquito Micro Robot

A Quadtree-Based Lightweight Data Compression Approach to Processing Large-Scale Geospatial Rasters

SUPPLEMENT MATERIALS

A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications

DAV CENTENARY COLLEGE, FARIDABAD

Maximum CPU utilization is obtained by multiprogramming. CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait

Rescue Rover. Robotics Unit Lesson 1. Overview

SoundCast Design Intro

ECE 757. Homework #2

Precision level sensing with low-pressure module MS

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

FIRE PROTECTION. In fact, hydraulic modeling allows for infinite what if scenarios including:

Word-based Statistical Compressors as Natural Language Compression Boosters

Solving the problem of serving large image mosaics. Using ECW Connector and Image Web Server with ArcIMS

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Constraining a global, eddying, ocean and sea ice model with scatterometer data

PRODUCTION AND OPERATIONAL ISSUES

UNDERSTANDING A DIVE COMPUTER. by S. Angelini, Ph.D. Mares S.p.A.

EVOLVING HEXAPOD GAITS USING A CYCLIC GENETIC ALGORITHM

Autodesk Moldflow Communicator Process settings

ICES REPORT November gfpc: A Self-Tuning Compression Algorithm. Martin Burtscher and Paruj Ratanaworabhan

Persistent Memory Performance Benchmarking & Comparison. Eden Kim, Calypso Systems, Inc. John Kim, Mellanox Technologies, Inc.

Solving Problems by Searching chap3 1. Problem-Solving Agents

A Single Cycle Processor. ECE4680 Computer Organization and Architecture. Designing a Pipeline Processor. Overview of a Multiple Cycle Implementation

CS650 Computer Architecture. Lecture 5-1 Branch Hazards

Flight Software Overview

Agenda Item I.5.b Supplemental NOAA Presentation November 2012

Designing A Low-Latency Cuckoo Hash Table for Write-Intensive Workloads Using RDMA

Decompression Plans October 26, 2009

VPC3: A Fast and Effective Trace-Compression Algorithm

82C288 BUS CONTROLLER FOR PROCESSORS (82C C C288-8)

Design & Analysis of Natural Laminar Flow Supercritical Aerofoil for Increasing L/D Ratio Using Gurney Flap

Design of AMBA APB Protocol

IEEE TRANSACTIONS ON COMPUTERS, VOL.54, NO.11, NOVEMBER The VPC Trace-Compression Algorithms

CS61C Machine Structures: End of Episode I. (Lecture 28) May 7, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

The Future of Retailing in Canada to 2018

Transcription:

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor Compiler and Architecture Support for Embedded Systems (CASES) October 1-3, 1999

Motivation Problem: embedded code size Constraints: cost, area, and power Fit program in on-chip memory Compilers vs. hand-coded assembly Solution: code compression Reduce compiled code size Take advantage of instruction repetition Benefits On-chip memory used more effectively Trade-off performance for code density Systems use cheaper processors with smaller on-chip memories CPU Original Program CPU I/O I/O RAM RAM ROM Program ROM Compressed Program Embedded Systems 2

Hardware or software decompression? Hardware Faster translation CodePack, MIPS-16, Thumb Software Smaller physical area Lower cost Quicker re-targeting to new compression algorithms Rivals HW solutions on some (loopy) benchmarks 3

Overview Kirovski et al., 1997 Procedure Compression Decompress and execute 1 procedure at a time Store decompressed code in procedure cache Cache management Results 60% compression ratio on SPARC 166% execution penalty with 64KB procedure cache HLL F() {...} HLL G() {...} Compile Native F: load r5,4... Native G: addi r7,8... LZ Compress LZ LZ F: 10010... G: 00101... Decompressor P-cache manager 4

Dictionary compression algorithm Dictionary contains unique instructions Replace program instructions with short index 32 bits 16 bits 32 bits Add r1,r2,r3 5 Add r1,r2,r3 Add r1,r2,r3 5 Add r1,r2,r4 Add r1,r2,r4 Add r1,r2,r4 Add r1,r2,r4 30 30 30.dictionary segment.text segment.text segment (contains indices) Original program Compressed program 5

Compression ratio compressio n ratio = compressed size original size Compression ratios Dictionary: 65% - 82% LZRW1: 55% - 63% Benchmark Original Dict. Compression LZRW1 Compression cc1 1,083,168 65.4% 60.4% vortex 495,248 65.8% 55.5% go 310,576 69.6% 63.9% perl 267,568 73.7% 60.2% ijpeg 198,272 77.2% 61.5% mpeg2enc 119,600 82.5% 60.5% pegwit 88,800 79.5% 56.7% 6

Simple Decompression code Small static code size: 25 instructions Fast Less than 3 instructions per output byte 74 dynamic instructions per decompressed cache line Algorithm Invoke decompressor on L1 I-cache miss Decompress 1 complete cache line For each instruction in cache line Read index Reference dictionary with index to get instruction Put instruction in I-cache HW Support L1-cache miss exception Write into I-cache 7

Partial decompression Optimizations compress from missed instruction to end of cache line use a valid bit per word in cache line to mark instructions at beginning of line as invalid avoids decompressing instructions that may not be executed up to 12% speedup Second register file Many embedded processors have an additional register file Avoid save/restore of registers when decompressor runs 2nd register file with partial decompression: up to 16% speedup 8

SimpleScalar Simulation environment Modified to support compression 5 stage, in-order pipeline Simple embedded processor D-cache 8KB, 16B lines, 2-way I-cache 1 to 64KB, 32B lines, 2-way Memory 10 cycle latency, 2 cycle rate 9

Performance: cc1 Slowdown relative to native code 6 5 4 3 2 1 0 1KB 4KB 16KB 64KB I-cache size (KB) compressed partial partial+regfile native 10

Performance: ijpeg Slowdown relative to native code 6 5 4 3 2 1 0 compressed partial partial+regfile native 1KB 4KB 16KB 64KB I-cachesize(KB) 11

Performance summary Data from CINT95, MediaBench with several cache sizes Control slowdown by optimizing I-cache miss ratio Code layout may help 6 5 Slowdown relative to native code 4 3 2 1 compressed partial partial+regfile 0 0% 5% 10% 15% I-cache miss ratio 12

Performance summary, cont. Magnification of previous graph Slowdown under 3x when I-miss ratio is under 2% Slowdown under 2x when I-miss ratio is under 1% 4 3 Slowdown relative to native code 2 1 compressed partial partial+regfile 0 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% I-cache miss ratio 13

Conclusions Line-based decompression beats procedure-based use normal cache as decompression buffer no fragmentation management as in procedure-based decompression order of magnitude performance difference A previous decompressor with procedure granularity had 100x slowdown on gcc and go [Kirovski97] Compressed code fills gap has quick execution of native code has small size of interpreted code 14

Web page http://www.eecs.umich.edu/~tnm/compress 15