Dealing with Dependent Failures in Distributed Systems

Similar documents
Adaptability and Fault Tolerance

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Designing Mechanisms for Reliable Internet-based Computing

Better Search Improved Uninformed Search CIS 32

Lecturers. Multi-Agent Systems. Exercises: Dates. Lectures. Prof. Dr. Bernhard Nebel Room Dr. Felix Lindner Room

Non-Interactive Secure Computation Based on Cut-and-Choose

Application of Dijkstra s Algorithm in the Evacuation System Utilizing Exit Signs

Safety-critical systems: Basic definitions

The Cooperative Cleaners Case Study: Modelling and Analysis in Real-Time ABS

/435 Artificial Intelligence Fall 2015

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf

Simulation Analysis of Intersection Treatments for Cycle Tracks

Using what we have. Sherman Eagles SoftwareCPR.

Efficient Gait Generation using Reinforcement Learning

INTRODUCTION TO PATTERN RECOGNITION

CASE STUDY. Compressed Air Control System. Industry. Application. Background. Challenge. Results. Automotive Assembly

Using MATLAB with CANoe

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Product Overview. Product Description CHAPTER

SOME GEOTECHNICAL CONSIDERATIONS FOR PROBABILISTIC ANALYSIS IN SLOPE DESIGN

Lecture 04 ( ) Hazard Analysis. Systeme hoher Qualität und Sicherheit Universität Bremen WS 2015/2016

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

A quantitative software testing method for hardware and software integrated systems in safety critical applications

Paper Presentation: COMPLETE AND SIMULTANEOUS DCS FAILURE IN TWO 500MW UNITS

Failure modes and models

SAFE EGRESS FROM DEEP STATIONS Flawed Criteria in NFPA 130

Ingersoll Rand. X-Series System Automation

Safety Analysis Methodology in Marine Salvage System Design

Questions & Answers About the Operate within Operate within IROLs Standard

CS472 Foundations of Artificial Intelligence. Final Exam December 19, :30pm

Analysis and realization of synchronized swimming in URWPGSim2D

Modeling Planned and Unplanned Store Stops for the Scenario Based Simulation of Pedestrian Activity in City Centers

Tutorial for the. Total Vertical Uncertainty Analysis Tool in NaviModel3

Lecture 10. Support Vector Machines (cont.)

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Introduction to Pattern Recognition

Volume A Question No : 1 You can monitor your Steelhead appliance disk performance using which reports? (Select 2)

Traffic Safety Barriers to Walking and Bicycling Analysis of CA Add-On Responses to the 2009 NHTS

Simulating Street-Running LRT Terminus Station Options in Dense Urban Environments Shaumik Pal, Rajat Parashar and Michael Meyer

Deriving an Optimal Fantasy Football Draft Strategy

Game Theory. 1. Introduction. Bernhard Nebel and Robert Mattmüller. Summer semester Albert-Ludwigs-Universität Freiburg

Engineering Response to. Workplace Accidents. Oregon Governor s Occupational Safety and Health Conference. Introduction

Fail Operational Controls for an Independent Metering Valve

Training Fees 3,400 US$ per participant for Public Training includes Materials/Handouts, tea/coffee breaks, refreshments & Buffet Lunch.

RoboCup German Open D Simulation League Rules

Assessing the Traffic and Energy Impacts of Connected and Automated Vehicles (CAVs)

SoundCast Design Intro

Safety Critical Systems

Modeling of Hydraulic Hose Paths

C. Mokkapati 1 A PRACTICAL RISK AND SAFETY ASSESSMENT METHODOLOGY FOR SAFETY- CRITICAL SYSTEMS

Application of Bayesian Networks to Shopping Assistance

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems

Fisher FIELDVUE DVC6200f Digital Valve Controller PST Calibration and Testing using ValveLink Software

D-Case Modeling Guide for Target System

Failure Mode and Effect Analysis (FMEA) for a DMLC Tracking System

SPE The paper gives a brief description and the experience gained with WRIPS applied to water injection wells. The main

Happy Birthday to... Two?

FRDS GEN II SIMULATOR WORKBOOK

UFLTI Seminar. Capacity and Delay Implications of Connected and Automated Vehicles at Signalized Intersections. Alexander Skabardonis

TR Electronic Pressure Regulator. User s Manual

siot-shoe: A Smart IoT-shoe for Gait Assistance (Miami University)

Safety of railway control systems: A new Preliminary Risk Analysis approach

Introduction to Pattern Recognition

Using Markov Chains to Analyze a Volleyball Rally

Deciding When to Quit: Reference-Dependence over Slot Machine Outcomes

Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight

APPENDIX P U.S. EPA WARM MODEL OUTPUT

The ICC Duckworth-Lewis-Stern calculator. DLS Edition 2016

Chapter 4 Traffic Analysis

The Role of Olive Trees Distribution and Fruit Bearing in Olive Fruit Fly Infestation

Safety Assessment of Installing Traffic Signals at High-Speed Expressway Intersections

Road Data Input System using Digital Map in Roadtraffic

Upgrading Vestas V47-660kW

Introduction Roundabouts are an increasingly popular alternative to traffic signals for intersection control in the United States. Roundabouts have a

RoboCup Soccer Simulation 3D League

Keeping EGI secure. EGI CSIRT: Prevention - Response - Training

Determining bicycle infrastructure preferences A case study of Dublin

POLICY GUIDE. DataCore Cloud Service Provider Program (DCSPP) DCSPP OVERVIEW POLICY GUIDE INTRODUCTION PROGRAM MEMBERSHIP DCSPP AGGREGATORS

Scaling up of ADAS Traffic Impacts to German Cities

Using Perceptual Context to Ground Language

Introduction to Parallelism in CASTEP

Basic Autoclave #1 Ideal Gas Law, Inert Gas

A study on the relation between safety analysis process and system engineering process of train control system

2011 ScheduALL FOXTEL

Section 10 - Hydraulic Analysis

Legendre et al Appendices and Supplements, p. 1

Advances in Low Voltage Motor Control Center (MCC) Technology Help Reduce Arc-Flash Hazards and Minimize Risks

Service & Support. Questions and Answers about the Proof Test Interval. Proof Test According to IEC FAQ August Answers for industry.

Safety Manual OPTISWITCH series relay (DPDT)

Imperfectly Shared Randomness in Communication

POWER Quantifying Correction Curve Uncertainty Through Empirical Methods

Section 3- Repair and Calibration Procedures

A SEMI-PRESSURE-DRIVEN APPROACH TO RELIABILITY ASSESSMENT OF WATER DISTRIBUTION NETWORKS

Critical Systems Validation

Introduction to Roundabout Analysis Using ARCADY

Focus on Enforcement. 3/14/2017 Presentation to the Vision Zero Taskforce. Joe Lapka Corina Monzón

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

Virtual Breadboarding. John Vangelov Ford Motor Company

IMPACTS OF COASTAL PROTECTION STRATEGIES ON THE COASTS OF CRETE: NUMERICAL EXPERIMENTS

Transcription:

Dealing with Dependent Failures in Distributed Systems Ranjita Bhagwan, Henri Casanova, Alejandro Hevia, Flavio Junqueira, Yanhua Mao, Keith Marzullo, Geoff Voelker, Xianan Zhang University California San Diego Dept. of Computer Science & Engineering

Research Interests Work in the formal/practical boundary of distributed systems. Basic strategy: Identify a practical issue Develop an abstraction of the problem Work at abstract level Bring back down to practical issue May 2006 Dealing with Dependent Failures 2

Being Abstract About Failures When designing or analyzing a protocol, it is usually done in the context of a failure model What are the base components? (processes, networks, processors, controllers,...) How can components deviate from correct behavior? (crash, produce the wrong output, drop input,...) How many of them can fail? (it is highly unlikely that more than this set can fail) May 2006 Dealing with Dependent Failures 3

Threshold Failure Models Consider only processes. Out of n processes, no more than t can be faulty. Easy to reason about. Useful for expressing bounds. In practice, compute t and n offline Some examples of lower bounds n > 2t for crash consensus n > 3t for arbitrary consensus n > 4t for Byzantine quorum systems n > 3t/2 for receive-omission primary-backup May 2006 Dealing with Dependent Failures 4

Implication of using Threshold Model Threshold model is an outgrowth of an early, influential project in fly-by-wire (SIFT). Any lower bounds derived on it are based on an implicit assumption of independent and identically distributed failures (IID). Any subset of t or fewer processes can be faulty Attained by careful design of the system May 2006 Dealing with Dependent Failures 5

Dealing with Dependent Failures Nonidentically distributed and dependent failures are now the common case. Heterogeneous components. Clusters with shared components. Networks of hosts with shared vulnerabilities under attack by malware. To use a protocol designed with the threshold model, choose t large enough to cover all failure situations. May 2006 Dealing with Dependent Failures 6

Generalizing thresholds Definition A core is a minimal set of processes such that in all executions, at least one of them will be nonfaulty. It isn't known a priori which are faulty. With threshold model, any subset of t + 1 processes. May 2006 Dealing with Dependent Failures 7

An example n = 12 May 2006 Dealing with Dependent Failures 8

An example cores n = 12 May 2006 Dealing with Dependent Failures 9

An example cores n = 12 May 2006 Dealing with Dependent Failures 10

An example cores n = 12 May 2006 Dealing with Dependent Failures 11

An example cores n = 12 May 2006 Dealing with Dependent Failures 12

An example cores n = 12 May 2006 Dealing with Dependent Failures 13

An example cores n = 12 May 2006 Dealing with Dependent Failures 14

An example cores n = 12 May 2006 Dealing with Dependent Failures 15

An example cores n = 12 May 2006 Dealing with Dependent Failures 16

An example cores n = 12, t=5 n 3t May 2006 Dealing with Dependent Failures 17

An example cores n = 12, t=5 n 3t May 2006 Dealing with Dependent Failures 18

Protocols for Dependent Failures Wish a simple abstraction different from the threshold model that can represent non-iid failures. Rather than being a source of despair, can exploit knowledge about dependent failures. Understand the limitations (eg, lower bounds) when failures are not IID. May 2006 Dealing with Dependent Failures 19

Modeling dependent failures Dual representations core c: a minimal set of processes such that, in every run, there is some process in c that does not fail. survivor set s: a minimal set of processes such that there is a run in which no processes in s fails. Each survivor set has a non-empty intersection with each core. Can be formally defined either probabilistically or in terms of admissible runs.... using cores is useful for lower bound proofs while using survivor sets is useful for protocol design. May 2006 Dealing with Dependent Failures 20

Modeling dependent failures Let R be the set of runs, π be the set of processes and up(r) be the set of processes that do not fail during the run r. C π = {c π ( r R: c up(r) ) ( c' c: r R: c' up(r) )} S π = {s π ( r R: s = up(r)) ( s' s: r R: s' = up(r))} May 2006 Dealing with Dependent Failures 21

Modeling dependent failures Under the threshold model: any set of t + 1 processes is a core; any set of n t processes is a survivor set. Cores and survivor sets are duals of each other. Eg, cores {a, b}, {a, d} x process x does not crash (a b) (a d) vs. a (b d) survivor sets {a}, {b, d} May 2006 Dealing with Dependent Failures 22

An example cores survivor sets May 2006 Dealing with Dependent Failures 23

Replication Requirements Replication requirements of the form n > k t becomes Any partition of the processes into k subsets results in one of the subsets containing a core. The intersection of any k survivor sets is not empty. The intersection of any k 1 survivor sets contains a core. May 2006 Dealing with Dependent Failures 24

An Example cores n = 12, t=5 survivor sets May 2006 Dealing with Dependent Failures 25

Fractional Replication Requirements n > a t / b There are some surprising consequences (a = 4, b = 2) is not the same as (a = 2, b = 1) The survivor set properties aren't easily expressed in a general way (n, n 1) and (3, 2) May 2006 Dealing with Dependent Failures 26

Consensus and Dependent Failures We have re-examined consensus protocols in this model. The protocols are often easily derived from existing protocols Synchronous crash consensus: decide in smallest core, with last round delivering result to all. Asynchronous crash consensus: Paxos over quorum system with survivor sets as coterie rather than majority coterie system (which has optimal availability in IID) The lower bound proofs are similar, but occasionally surprising Time to decide in synchronous consensus. Same in IID, but arbitrary slower than crash in general. May 2006 Dealing with Dependent Failures 27

Generalizing We have translated, by hand, many protocols. For a restricted class of protocols, this can be automated. Can a generalized technique be developed? May 2006 Dealing with Dependent Failures 28

Practical Use: Grid Systems In grid systems (GriPhyN, BIRN, Geon) the application for non-iid failures reduces to construction of coteries in a wide-area network. Individual machines can fail Whole sites can fail or can be disconnected... in such an environment, simple majority quorums do not have optimal availability. May 2006 Dealing with Dependent Failures 29

Multisite Failure Model Consider case where failures within sites is IID and failures of sites is IID At most t s unavailable sites, and at most t unavailable within a site For example, t s = 1 One site can be unavailable Two sites: at least n - t working properly Survivor sets: n - t processes from two different sites Site A Site B Site C n processes t failures max. n processes t failures max. n processes t failures max. May 2006 Dealing with Dependent Failures 30

Quorums on PlanetLab Toy application: quorums of acceptors (Paxos) Client: issues requests Two access for each request (ballot) Sites used Three sites for the acceptors Three acceptors from each site One UCSD host: client Two ways of constructing quorums SimpleMaj Quorum: any five processes 3SitesMaj Quorum: four hosts, majority from each of two sites UC Davis UC San Diego UT Austin Duke May 2006 Dealing with Dependent Failures 31

Practical Use: Internet Catastrophes Vulnerabilities shared among many hosts allow an Internet pathogen to quickly spread. Can replication be used to preserve function in the face of such an attack? May 2006 Dealing with Dependent Failures 32

Replicating for Internet Catastrophes Phoenix: Cooperative Backup Informed replication Replica sets based on attributes Different attributes indicate disjoint vulnerabilities Attributes Common software systems Challenges Is there sufficient diversity in an Internet setting? Can we select small replica sets? May 2006 Dealing with Dependent Failures 33

Host diversity Study of the UCSD network Vulnerability represented by open port nmap utility Port scans OS fingerprinting 2,963 general-purpose hosts (port data + OS) Host configuration OS + Applications May 2006 Dealing with Dependent Failures 34

Replica sets A set S of hosts is a core No attribute in all the hosts S is minimal {,, } Cores Hosts {,, } {,, } Configurations {,, } May 2006 Dealing with Dependent Failures 35

Heuristics for Selecting Cores Random selection Greedy selection... with limits on how many cores a given host participates in. May 2006 Dealing with Dependent Failures 36

Core size and Coverage 10 9 Random Uniform 1 8 0.95 Average core size 7 6 5 Average coverage 0.9 0.85 4 0.8 3 2 2 3 4 5 6 7 8 9 10 Load limit 0.75 2 3 4 5 6 7 8 9 10 Load limit Random Uniform Uniform: most hosts require a single extra replica Three times more storage for random May 2006 Dealing with Dependent Failures 37

Phoenix Recovery System Data backup On cores using Uniform Implement on a DHT Pastry Macedon framework Advertising configurations Container Zone Sub-container Sub-zone To get a core Request messages To recover Periodic announcement messages Core members only May 2006 Dealing with Dependent Failures 38

Prototype evaluation (USENIX '05) On PlanetLab Total number of hosts: 63 62 PlanetLab hosts 1 UCSD host Configurations manually set 63 randomly chosen out of the 2,963 L Core size Imp. Sim. Coverage Imp. Sim. Simulating a catastrophe Failed Windows hosts Recovery time Announcement period: 120s For 35: ~ 100s One site: order of minutes 3 5 7 2.12 2.10 2.10 2.22 2.23 2.22 1.0 1.0 1.0 1.0 1.0 1.0 May 2006 Dealing with Dependent Failures 39

Wrap up Designing for non-iid failures is possible and worth doing. On the formal side, working on automated methods for transforming protocols from IID to non-iid. For grid computing, empirical study to see what failures occur in Geon and how replication can be done to increase availability. For Internet catastrophes, empirical study on the nature of actual vulnerabilities in an enterprise network. May 2006 Dealing with Dependent Failures 40