Delta Compressed and Deduplicated Storage Using Stream-Informed Locality

Similar documents
Log2fs or how to achieve IO/s

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #39. Agenda

XC2 Client/Server Installation & Configuration

SUPPLEMENT MATERIALS

SQL LiteSpeed 3.0 Installation Guide

We release Mascot Server 2.6 at the end of last year. There have been a number of changes and improvements in the search engine and reports.

Fast Floating Point Compression on the Cell BE Processor

A Quadtree-Based Lightweight Data Compression Approach to Processing Large-Scale Geospatial Rasters

Microsoft System Center Data

DataCore Cloud Service Provider Program (DCSPP) Product Guide

Designing A Low-Latency Cuckoo Hash Table for Write-Intensive Workloads Using RDMA

AccuRAID iscsi Auto-Tiering Best Practice

Protecting shared storage from rogue jobs by I/O profiling and dynamic load balancing

Bareos, ZFS and Puppet. Christian Reiß Symgenius

If you need to reinstall FastBreak Pro you will need to do a complete reinstallation and then install the update.

Fencing Time Version 4.3

Parsimonious Linear Fingerprinting for Time Series

The Evolution of Transport Planning

Solving the problem of serving large image mosaics. Using ECW Connector and Image Web Server with ArcIMS

GOLOMB Compression Technique For FPGA Configuration

Fast Software-managed Code Decompression

Inspection User Manual

Inspection User Manual This application allows you to easily inspect equipment located in Onix Work.

Helium: A Data Driven User Tool for SAR Analysis. May 17 th 2011 Karen Worsfold

Multi Cylinder TPA and DI Pulse Model Development using GT-SUITE. Lakshmidhar Reddy Isuzu Technical Center America

LiteSpeed for SQL Server 6.5. Integration with TSM

[CROSS COUNTRY SCORING]

A history-based estimation for LHCb job requirements

WHEN WILL YOUR MULTI-TERABYTE IMAGERY STOP REQUIRING YOU TO BUY MORE DATA STORAGE?

Scalable Data Structure to Compress Next- Generation Sequencing Files and its Application to Compressive Genomics

HyperSecureLink V6.0x User Guide

SVM Root Volume Protection Express Guide

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3

POLICY GUIDE. DataCore Cloud Service Provider Program (DCSPP) DCSPP OVERVIEW POLICY GUIDE INTRODUCTION PROGRAM MEMBERSHIP DCSPP AGGREGATORS

PROJECT and MASTER THESES 2016/2017

Shearwater Cloud Desktop Release Notes

Typical Wayside Reader System Quotation

[CROSS COUNTRY SCORING]

Instrument pucks. Copyright MBARI Michael Risi SIAM design review November 17, 2003

Learning Locomotion Test 3.1 Results

Data Extraction from Damage Compressed File for Computer Forensic Purposes

imágenes en el Hospital Clinic

Unisys. Imagine it. Done. c Consulting. c Systems Integration. c Outsourcing. c Infrastructure. c Server Technology. Unisys NDP 30 and NDP 110

Reducing Code Size with Run-time Decompression

Transition from Scrum to Flow

FAQs GOLF CANADA KIOSK

Registering Club players in Whole Game Club Official Training Guide

ITF SCORER ONLINE TRAINING SETUP

HP Integrity Superdome 2

AFG FITNESS APP OWNER S MANUAL AFG MANUEL DU PROPRIÉTAIRE DU TAPIS ROULANT AFG MANUAL DEL PROPIETARIO DE LA CAMINADORA

DATA SCIENCE SUMMER UNI VIENNA

ROSE-HULMAN INSTITUTE OF TECHNOLOGY Department of Mechanical Engineering. Mini-project 3 Tennis ball launcher

ID: Cookbook: browseurl.jbs Time: 15:40:31 Date: 11/04/2018 Version:

Learning Locomotion Test 3.2 Results

An Architectural Approach for Improving Availability in Web Services

Electronic Volume Correctors

NASCAR Media Group CASE STUDY: LOCATION: Charlotte, NC GOAL: SOLUTION:

MPCS: Develop and Test As You Fly for MSL

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Scratch Hands-on Assignments CS4HS - Summer 2017

FIRE PROTECTION. In fact, hydraulic modeling allows for infinite what if scenarios including:

ArcLink Additional API support for Wayback Machines

Imperfectly Shared Randomness in Communication

Now every device for small & medium businesses, at zero upfront

Prudhoe Bay Oil Production Optimization: Using Virtual Intelligence Techniques, Stage One: Neural Model Building

Smart Card based application for IITK Swimming Pool management

Mac Software Manual for FITstep Pro Version 2

CRL Processing Rules. Santosh Chokhani March

2017 Census Reporting To access the SOI s Census Reporting web site go to:

WildCat RF unit (0.5W), full 32-byte transmissions with built-in checksum

CS 341 Computer Architecture and Organization. Lecturer: Bob Wilson Cell Phone: or

Education Services LAGAN Upgrade Training Brochure

CROSS CONTAMINATION OF IN-SUITE MURB VENTILATION SYSTEMS

IA-64: Advanced Loads Speculative Loads Software Pipelining

Automatic Identification and Analysis of Basketball Plays: NBA On-Ball Screens

CE Owner s Manual For PRS-, Buddy and Desktop Tools

PEAPOD. Pneumatically Energized Auto-throttled Pump Operated for a Developmental Upperstage. Test Readiness Review

Session Objectives. At the end of the session, the participants should: Understand advantages of BFD implementation on S9700

Emerging Crash Trend Analysis. Mark Logan Department of Main Roads, Queensland. Patrick McShane Queensland Transport

Dante B. Fascell Port of Miami-Dade

Cover Page for Lab Report Group Portion. Boundary Layer Measurements

Quick Start Guide. For Gold and Silver Editions

APPLICATION OF TOOL SCIENCE TECHNIQUES TO IMPROVE TOOL EFFICIENCY FOR A DRY ETCH CLUSTER TOOL. Robert Havey Lixin Wang DongJin Kim

This GC has two columns (I think new columns can be ordered from Restek)

SteelHead Product Family

Technical Bulletin, Communicating with Gas Chromatographs

BLACKLIST ECOSYSTEM ANALYSIS: JULY DECEMBER, 2017

EAD: The UK Experience

High usability and simple configuration or extensive additional functions the choice between Airlock Login or Airlock IAM is yours!

Session: I03 TSM is not just a black box. Carolyn Sanders EDS. May 19, :30 p.m. 02:30 p.m. Platform: LUW

Volume A Question No : 1 You can monitor your Steelhead appliance disk performance using which reports? (Select 2)

ASCAT Level 2 Soil Moisture Reprocessing Phase 1 - Dataset Description

STAND alone & p.c. VERSION

Alternative architectures for distributed ledgers. Sarah Meiklejohn (University College London)

Modeling Signalized Traffic Intersections Using SAS Simulation Studio. Leow Soo Kar

UNITY 2 TM. Air Server Series 2 Operators Manual. Version 1.0. February 2008

Aviation Unit Safety Management System

Computing s Energy Problem:

TYPE DOSAODOR-D SOFTWARE FOR CONFIGURATION OF TYPE DOSAODOR-D ODORANT INJECTION SYSTEM

P500/700/900 Part Number. P510/710/910 Part Number N/A N/A SBB0K65455 N/A N/A N/A SBB0K65456 N/A N/A N/A SBB0K65457 N/A N/A N/A SBB0K65521 N/A

Transcription:

Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant Wallace, Mark Huang, & Windsor Hsu Backup Recovery Systems Division EMC Corporation

Motivation and Approach Improve storage compression Decrease price per GB Decrease data center space Decrease power Decrease management Combine deduplication and delta compression Remove identical data regions Compress with similar data regions

Previous Work on Similarity Indexing Version information [Burns 97, MacDonald 00] Similarity index in memory [Aronovich 09, Kulkarni 04] Similarity index on-disk [You 11] Stream-informed delta locality for WAN replication [Shilane 12] Low WAN throughput Did not store delta compressed data

Contributions 1. First deduplicated and delta compressed storage implementation using streaminformed locality 2. Quantify throughput and suggest improvement areas 3. Explore new complexities related to data integrity and cleaning 4. Report combination of deduplication and delta compression across chunk sizes

Stream-informed Deduplication backup.tar

Stream-informed Deduplication backup.tar 1 2 3 4 5 6 Content Defined s

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 Secure Fingerprint of s

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6 Store fingerprints in a metadata section and chunks to a data section of a container. Also create an index from fingerprint to container.

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar 1 2 3 4 5 6 Content Defined s Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 Secure Fingerprint of s Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 Lookup fingerprint 1 in the index Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 Load all fingerprints from container 1 into an in-memory cache Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6 Fingerprint Cache Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6

Stream-informed Deduplication backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 1 week later backup.tar Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6 FP cache hit on 1,2,3,5, & 6 Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Fp 1 6 Data s 1 6 Fingerprint Cache Fp 1 Fp 2 Fp 3 Fp 4 Fp 5 Fp 6

Stream-informed Deduplication and Delta Compression backup.tar

Stream-informed Deduplication and Delta Compression backup.tar 1 2 3 4 5 6 Content Defined s

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Secure Fingerprint and Sketch of s Calculate fingerprints used for deduplication and sketches used for similarity detection

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6 Store fingerprints and sketches in a metadata section and chunks to a data section of a container. Also create an index from fingerprint to container.

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar 1 2 3 4 5 6 Content Defined s Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Secure Fingerprint and Sketch of s Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Lookup fingerprint 1 in the index Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Load all fingerprints and sketches from container 1 into an in-memory cache Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6 Fingerprint and Sketch Cache Fp 1 Fp 6 Sk 1 Sk 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 FP cache hit on 1,2,3,5, & 6 Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6 Fingerprint and Sketch Cache Fp 1 Fp 6 Sk 1 Sk 6

Stream-informed Deduplication and Delta Compression backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 1 week later backup.tar Fp 1 Sk 1 Fp 2 Sk 2 Fp 3 Sk 3 Fp 4 Sk 4 Fp 5 Sk 5 Fp 6 Sk 6 Sketch cache hit on 4 Delta encode Fp Index Fp 1 -> C1 Fp 6 -> C1 s stored together on disk Container C1 Meta Data Data Fp 1 6 Sk 1 6 s 1 6 Fingerprint and Sketch Cache Fp 1 Fp 6 Sk 1 Sk 6

Deduplication and Delta Compression fp Sketches based on Broder 97

Deduplication and Delta Compression fp sk Maximal Value 1 Maximal Value 2 Maximal Value 3 Maximal Value 4 super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk Maximal Value 1 Maximal Value 2 Maximal Value 3 Maximal Value 4 Store this chunk to disk super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk fp (duplicate of earlier chunk) Fingerprint is a match, so do not store super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk (similar to earlier chunk) Regions of difference super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk fp (similar to earlier chunk) Regions of difference Fingerprint is not a match, so calculate a sketch super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk fp sk (similar to earlier chunk) Maximal Value 1 Maximal Value 2 Regions of difference Maximal Value 3 Maximal Value 4 Sketch match super_feature = Rabin_fp(feature 1 feature 4 ) sketch is one or more super_features Sketches based on Broder 97

Deduplication and Delta Compression fp sk fp sk (similar to earlier chunk) Regions of difference Calculate a delta and store the changed bytes and a reference to the earlier chunk Store fp and differences fp Sketches based on Broder 97

Backup Datasets Dataset Type Backup Policy TB Months Workstations 16 desktops Weekly full Daily incremental 2.3 4 Email MS Exchange server Daily full 2.5 5 Source Code Version control repository Weekly full Daily incremental 4.5 6 System Logs Server s /var directory Weekly full Daily incremental 5.3 4

Compression Results Delta adds 1.4 3.5X compression improvement over deduplication and LZ Dataset Deduplication Delta LZ Total Compression Delta Improvement Workstations 5.0 4.2 1.6 33.6 3.5 Email 4.9 2.6 2.1 26.8 2.1 Source Code 16.7 3.6 2.5 150.3 1.4 System Logs 25.2 3.3 1.8 149.7 1.5

Throughput Delta compression requires extra computation and I/O Compare to deduplicated storage as baseline Throughput: 74% on first full backup Throughput: 53% on later full backups Calculate difference - = fp Store delta fp

Throughput Stages Single-stream timing for each stage I/O to a hard drive is the bottleneck, switching to SSD may help Read Alternatives Dataset Sketch MB/s Lookup MB/s Encode Mb/s HDD MB/s SSD MB/s Workstations 47 1,528 94 5 400 Email 49 1,441 69 1 80 Source Code 30 30 31 2 160 System Logs 30 70 50 2 160 Aggregate throughput is higher due to: deduplication, 2/3 are delta encoded, multi-threading and asynchronous reads across multiple disks

Indirection Complexities Writing duplicates causes unintended read paths Unpredictable read back times Read back cat: There are multiple options, some that involve delta references cat catch catch cat cat cat+ch catch catch

Indirection Complexities Writing duplicates causes unintended read paths Unpredictable read back times Multi-level delta increases compression and complexity We implemented 1-level delta cat catch catch cat catcher cat cat+ch catch+er

Indirection Complexities End-to-end validity checks are slow because of remote references Reconstructing a delta chunk requires reading the base Verify catcher cat catch catch cat catcher cat catch catch+er cat+ch catch

Indirection Complexities End-to-end validity checks are slow because of remote references Reconstructing a delta chunk requires reading the base Incorrect garbage collection can cause loops and data loss Verify catcher: Loop due to incorrect GC indicates data loss cat catch catch cat catcher cat+ch catch catch+er

Garbage Collection Cleaning deleted chunks in a log structured file system Reference counts Mark-and-sweep Copying live chunks forward changes data locality, which impacts delta compression apple banana zebra C1 apple banana cat C2 whale zebra C3 apple banana zebra

Garbage Collection Impact on Delta Compression GC changes data locality but has only a small impact on delta compression

Compression vs. Size Consistent results on all datasets

Compression vs. Size Delta adds significant compression beyond deduplication and LZ

Compression vs. Size Delta helps maintain high total compression as chunk size varies

Summary and Future Work Deduplication and delta compression prototype Stream-informed locality replaces sketch indexes and improves write path throughput Adds 1.4X 3.5X compression Studied throughput Throughput 50% of underlying deduplication system Areas for improvement: SSD Garbage collection and data integrity Remote reference complexity Affects speed and validity Delta helps maintain overall compression across a broad range of chunk sizes

Questions?