Failure Management and Fault Tolerance for Avionics and Automotive Systems

Similar documents
Basic STPA Tutorial. John Thomas

STPA Systems Theoretic Process Analysis John Thomas and Nancy Leveson. All rights reserved.

Systems Theoretic Process Analysis (STPA)

Systems Theoretic Process Analysis (STPA)

Basic STPA Exercises. Dr. John Thomas

Every things under control High-Integrity Pressure Protection System (HIPPS)

Real-Time & Embedded Systems

Three Approaches to Safety Engineering. Civil Aviation Nuclear Power Defense

Safety Critical Systems

Performing Hazard Analysis on Complex, Software- and Human-Intensive Systems

Hazard analysis. István Majzik Budapest University of Technology and Economics Dept. of Measurement and Information Systems

Understanding safety life cycles

Introducing STAMP in Road Tunnel Safety

Modeling and Hazard Analysis Using Stpa

The Safety Case. Structure of Safety Cases Safety Argument Notation

An STPA Primer. Version 1, August 2013

Safety-Critical Systems

STAMP/STPA Beginner Introduction. Dr. John Thomas System Engineering Research Laboratory Massachusetts Institute of Technology

Failure modes and models

Using STPA in the Design of a new Manned Spacecraft

D-Case Modeling Guide for Target System

The Safety Case. The safety case

Critical Systems Validation

3. Real-time operation and review of complex circuits, allowing the weighing of alternative design actions.

Missing no Interaction Using STPA for Identifying Hazardous Interactions of Automated Driving Systems

Basic Design for Safety Principles

Adaptability and Fault Tolerance

The Relationship Between Automation Complexity and Operator Error

Fail Operational Controls for an Independent Metering Valve

Using what we have. Sherman Eagles SoftwareCPR.

Understanding the How, Why, and What of a Safety Integrity Level (SIL)

Safety-critical systems: Basic definitions

Safety-Critical Systems. Rikard Land

Software Reliability 1

Reliability of Safety-Critical Systems Chapter 3. Failures and Failure Analysis

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Hazard Identification

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

FLIGHT CREW TRAINING NOTICE

Session One: A Practical Approach to Managing Safety Critical Equipment and Systems in Process Plants

Training Fees 3,400 US$ per participant for Public Training includes Materials/Handouts, tea/coffee breaks, refreshments & Buffet Lunch.

DeZURIK Double Block & Bleed (DBB) Knife Gate Valve Safety Manual

Gravity Probe-B System Reliability Plan

A quantitative software testing method for hardware and software integrated systems in safety critical applications

A systematic hazard analysis and management process for the concept design phase of an autonomous vessel.

CT433 - Machine Safety

Advanced LOPA Topics

(C) Anton Setzer 2003 (except for pictures) A2. Hazard Analysis

Module No. # 01 Lecture No. # 6.2 HAZOP (continued)

DeZURIK. KSV Knife Gate Valve. Safety Manual

HAZARD ANALYSIS PROCESS FOR AUTONOMOUS VESSELS. AUTHORS: Osiris A. Valdez Banda Aalto University, Department of Applied Mechanics (Marine Technology)

Solenoid Valves used in Safety Instrumented Systems

Ultima. X Series Gas Monitor

Chapter 5: Methods and Philosophy of Statistical Process Control

Lecture 04 ( ) Hazard Analysis. Systeme hoher Qualität und Sicherheit Universität Bremen WS 2015/2016

DeZURIK. KGC Cast Knife Gate Valve. Safety Manual

FP15 Interface Valve. SIL Safety Manual. SIL SM.018 Rev 1. Compiled By : G. Elliott, Date: 30/10/2017. Innovative and Reliable Valve & Pump Solutions

Purpose. Scope. Process flow OPERATING PROCEDURE 07: HAZARD LOG MANAGEMENT

Safety Manual VEGAVIB series 60

POWER Quantifying Correction Curve Uncertainty Through Empirical Methods

CHAPTER 1 INTRODUCTION TO RELIABILITY

STPA: A New Hazard Analysis Technique. (System-Theoretic Process Analysis)

Reliability of Safety-Critical Systems Chapter 10. Common-Cause Failures - part 1

C. Mokkapati 1 A PRACTICAL RISK AND SAFETY ASSESSMENT METHODOLOGY FOR SAFETY- CRITICAL SYSTEMS

Gerald D. Anderson. Education Technical Specialist

Proof Testing A key performance indicator for designers and end users of Safety Instrumented Systems

MDEP Common Position No AP

STICTION: THE HIDDEN MENACE

Safety Manual. Process pressure transmitter IPT-1* 4 20 ma/hart. Process pressure transmitter IPT-1*

PROCEDURE. April 20, TOP dated 11/1/88

Safety Manual VEGAVIB series 60

The IEC61508 Inspection and QA Engineer s hymn sheet

Hydraulic (Subsea) Shuttle Valves

New Thinking in Control Reliability

PIG MOTION AND DYNAMICS IN COMPLEX GAS NETWORKS. Dr Aidan O Donoghue, Pipeline Research Limited, Glasgow

TRI LOK SAFETY MANUAL TRI LOK TRIPLE OFFSET BUTTERFLY VALVE. The High Performance Company

Reliability of Safety-Critical Systems Chapter 4. Testing and Maintenance

Surrogate UAV Approach and Landing Testing Improving Flight Test Efficiency and Safety

UNIVERSITY OF WATERLOO

Lab 4: Root Locus Based Control Design

RICK FAUSEL, BUSINESS DEVELOPMENT ENGINEER TURBOMACHINERY CONTROL SYSTEM DESIGN OBJECTIVES

CHEMICAL ENGINEERING LABORATORY CHEG 239W. Control of a Steam-Heated Mixing Tank with a Pneumatic Process Controller

CHAPTER 28 DEPENDENT FAILURE ANALYSIS CONTENTS

SIL explained. Understanding the use of valve actuators in SIL rated safety instrumented systems ACTUATION

Safety Management in Multidisciplinary Systems. SSRM symposium TA University, 26 October 2011 By Boris Zaets AGENDA

A study on the relation between safety analysis process and system engineering process of train control system

PERCEPTIVE ROBOT MOVING IN 3D WORLD. D.E- Okhotsimsky, A.K. Platonov USSR

Pneumatic QEV. SIL Safety Manual SIL SM Compiled By : G. Elliott, Date: 8/19/2015. Innovative and Reliable Valve & Pump Solutions

SPR - Pneumatic Spool Valve

Proposed Abstract for the 2011 Texas A&M Instrumentation Symposium for the Process Industries

Courses of Instruction: Controlling and Monitoring of Pipelines

Failure Modes, Effects and Diagnostic Analysis

High Integrity Pressure Protection Systems HIPPS

RESILIENT SEATED BUTTERFLY VALVES FUNCTIONAL SAFETY MANUAL

FLIGHT TEST RISK ASSESSMENT THREE FLAGS METHOD

CASE STUDY. Compressed Air Control System. Industry. Application. Background. Challenge. Results. Automotive Assembly

This manual provides necessary requirements for meeting the IEC or IEC functional safety standards.

FMEA- FA I L U R E M O D E & E F F E C T A N A LY S I S. PRESENTED BY: AJITH FRANCIS

IE098: Advanced Process Control for Engineers and Technicians

Instrumented Safety Systems

Transcription:

Failure Management and Fault Tolerance for Avionics and Automotive Systems Prof. Nancy G. Leveson Aeronautics and Astronautics Engineering Systems MIT

Outline Intro to Fault Tolerance and Failure Management Hardware Reliability Software Reliability System Safety Tools and Techniques Important topics not covered: Human Error Networked and distributed systems

Why Need Fault Tolerance and Failure Management? Safety: freedom from losses, concerned with eliminating hazards Reliability: freedom from failures, concerned with failure rate reduction Reliability vs. Safety Reliability is the probability an item will perform its required function in the specified manner over a given time period and under specified or assumed conditions. Most accidents result from errors in specified requirements or functions and deviations from assumed conditions.

Accident with No Component Failures

Types of Accidents Component Failure Accidents Single or multiple component failures Usually assume random failure Component Interaction Accidents Arise in interactions among components Related to interactive complexity and tight coupling Exacerbated by introduction of computers and software New technology introduces unknowns and unk-unks

Reliability = Safety Many accidents occur without any component failure Caused by equipment operation outside parameters and time limits upon which reliability analyses are based. Caused by interactions of components all operating according to specification. Highly reliable components are not necessarily safe For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety. But this is untrue for complex, software-intensive or human-intensive systems.

It s only a random failure, sir! It will never happen again.

Hardware Reliability Concerned primarily with failures and failure rate reduction: Parallel redundancy Standby sparing Safety factors and margins Derating Screening Timed replacements Others

Parallel Redundancy Same functions performed by two or more components at same time. Usually some type of voting mechanism to decide which one to use If one unit fails, function can still be performed by remaining unit or units N-Modular Redundancy X Inputs X Voter X

Standby Sparing (Series Redundancy) Inoperable or idling standby units take over if operating unit fails. Need failure detection (monitoring) and switchover devices to activate standby unit. Monitor may detect failure and provide some other type of response (e.g., graceful degradation or fail-soft) Primary Failure Detector/ Switch Spare

Safety Factors and Safety Margins Used to cope with uncertainties in engineering Inaccurate calculations or models Limitations in knowledge Variation in strength of a specific material due to differences in composition, manufacturing, assembly, handling, environment, or usage. Appropriate for continuous and non-action systems (e.g., structures and mechanical equipment) Some serious limitations

Copyright Nancy Leveson, Aug. 2006

Derating Equivalent of safety margins but for electronic components Electronic components fail under specific conditions and stresses If reduce stress levels under which components used, will reduce their failure rates

Screening and Timed Replacement Screening: eliminate components that pass operating tests but with indications will fail within an unacceptable time. Timed replacement: replace components before they wear out (preventive maintenance)

Others Isolation (failure containment): so failure in one component does not affect another Damage tolerance: surrounding systems should be able to tolerate failures in case cannot be contained Designed failure paths: direct high energy failure to a safe path, e.g., design so engine will fall off before it damages the structure. Design to fail into safe states (passive vs. active failure protection) A safety rather than a reliability design measure

Passive vs. Active Protection Passive safeguards: Maintain safety by their presence Fail into safe states Active safeguards: Require hazard or condition to be detected and corrected Tradeoffs Passive rely on physical principles Active depend on less reliable detection and recovery mechanisms BUT Passive tend to be more restrictive in terms of design freedom and not always feasible to implement

Software Reliability How define? An abstraction Errors do not occur randomly Maybe really just talking about correctness (consistency with requirements)

The Computer Revolution General Purpose Machine + Software = Special Purpose Machine Software is simply the design of a machine abstracted from its physical realization Machines that were physically impossible or impractical to build become feasible Design can be changed without retooling or manufacturing Can concentrate on steps to be achieved without worrying about how steps will be realized physically

Advantages = Disadvantages Computer so powerful and useful because has eliminated many of physical constraints of previous technology Both its blessing and its curse No longer have to worry about physical realization of our designs But no longer have physical laws that limit the complexity of our designs.

The Curse of Flexibility Software is the resting place of afterthoughts No physical constraints To enforce discipline in design, construction, and modification To control complexity So flexible that start working with it before fully understanding what need to do And they looked upon the software and saw that it was good, but they just had to add one other feature

Abstraction from Physical Design Software engineers are doing logical design Autopilot Expert! Requirements! Software! Engineer Design of Autopilot Most operational software errors related to requirements (particularly incompleteness) Software failure modes are different Usually does exactly what you tell it to do Problems occur from operation, not lack of operation Usually doing exactly what software engineers wanted

Redundancy Goal is to increase reliability and reduce failures Common-cause and common-mode failures May add so much complexity that causes failures More likely to operate spuriously May lead to false confidence (Challenger) Useful to reduce hardware failures. But what about software? Design redundancy vs. design diversity Bottom line: Claims that multiple version software will achieve ultra-high reliability levels are not supported by empirical data or theoretical models

Redundancy (2) Standby spares vs. concurrent use of multiple devices (with voting) Identical designs or intentionally different ones (diversity) Diversity must be carefully planned to reduce dependencies Can also introduce dependencies in maintenance, testing, repair Redundancy most effective against random failures not design errors

Redundancy (3) Software errors are design errors Data redundancy: extra data for detecting errors: e.g., parity bit and other codes checksums message sequence numbers duplicate pointers and other structural information Algorithmic redundancy: 1. Acceptance tests (hard to write) 2. Multiple versions with voting on results

Software Redundancy (Multi-Version Programming) Assumptions: Probability of correlated failures is very low for independently developed software Software errors occur at random and are unrelated Neither of these assumptions is true Even small probabilities of correlated errors cause a substantial reduction in expected reliability gains All experiments (and there are lots of them) have shown ultrahigh reliability does not result

Failure Independence Experiment Experimental Design 27 programs, one requirements specification Graduate students and seniors from two universities Simulation of a production environment: 1,000,000 input cases Individual programs were high quality Results Analysis of potential reliability gains must include effect of dependent errors Statistically correlated failures result from: Nature of application Hard cases in input space Programs with correlated failures were structurally and algorithmically very different Using different programming languages and compilers will not help Conclusion: Correlations due to fact that working on same problem, not due to tools used or languages used or even algorithms used

N-Version Programming (Summary) Need to have realistic expectations of benefits to be gained and costs involved: Costs very high (more than N times) In practice, end up with lots of similarity in designs (more than in our experiments) Overspecification Cross Checks Safety of system thus dependent on quality that has been systematically eliminated during development Costs so great that must cut somewhere else Requirements flaws not handled, which is where most problems (particularly safety) arise anyway

Recovery Blocks Module A: Save state in recovery cache; Block A1:. Acceptance test: Block A2: End Module A;

Recovery Blocks Have never been used on any real system in over 30 years since proposed Experiments have found it difficult to impossible to write acceptance tests except in very simple situations (reversible mathematical functions) Is actually just n-version programming performed serially (acceptance test is a different version)

Recovery Backward Assume can detect error before does any damage Assume alternative will be more effective Forward Robust data structures Dynamically altering flow of control Ignoring single cycle errors But real problem is detecting erroneous states

Monitoring Difficult to make monitors independent Checks usually require access to information being monitored, but usually involves possibility of corrupting that information Depends on assumptions about behavior of system and about errors that may or may not occur May be incorrect under certain conditions Common incorrect assumptions may be reflected both in design of monitor and devices being monitored.

Copyright Nancy Leveson, Aug. 2006

Self-Monitoring (Checking) In general, farther down the hierarchy check can be made, the better Detect the error closer to time it occurred and before erroneous data used Easier to isolate and diagnose the problem More likely to be able to fix erroneous state rather than recover to safe state Writing effective checks very hard and number usually limited by time and memory Added monitoring and checks can cause failures themselves

Self-Checking Software Experimental design Launch Interceptor Programs (LIP) from previous study 24 graduate students from UCI and UVA employed to instrument 8 programs (chosen randomly from subset of 27 in which we had found errors) Provided with identical training materials Checks written using specifications only at first and then subjects were given a program to instrument Allowed to make any number or type of check. Subjects treated this as a competition among themselves

Software Reliability Conclusions Not very encouraging but some things may help for particular errors: Defensive programming Assertions and run-time checks Separation of critical functions Elimination of unnecessary functions Exception handling Enforce discipline and control complexity Limits have changed from structural integrity and physical constraints of materials to intellectual limits Improve communication among engineers Use state-of-the-art software engineering practices to prevent or detect software errors before use. Use defensive design at the system level

Safety Safety and (component or system) reliability are different qualities in complex systems! Increasing one will not necessarily increase the other. So need different approaches to achieve it.

Reliability Approach to Software Safety Using standard engineering techniques of Preventing failures through redundancy Increasing component reliability Reuse of designs and learning from experience will not work for software and system accidents Copyright Nancy Leveson, Aug. 2006

Preventing Failures Through Redundancy Redundancy simply makes complexity worse NASA experimental aircraft example Any solutions that involve adding complexity will not solve problems that stem from intellectual unmanageability and interactive complexity Majority of software-related accidents caused by requirements errors Does not work for software even if accident is caused by a software implementation error Software errors not caused by random wear-out failures

Increasing Software Reliability (Integrity) Appearing in many new international standards for software safety (e.g., 61508) Safety integrity level (SIL) Sometimes give reliability number (e.g., 10-9 ) Can software reliability be measured? What does it mean? What does it have to do with safety? Safety involves more than simply getting the software correct : Example: altitude switch 1. Signal safety-increasing " Require any of three altimeters report below threshhold 1. Signal safety-decreasing " Require all three altimeters to report below threshhold

Software Component Reuse One of most common factors in software-related accidents Software contains assumptions about its environment Accidents occur when these assumptions are incorrect Therac-25 Ariane 5 U.K. ATC software Mars Climate Orbiter Most likely to change the features embedded in or controlled by the software COTS makes safety analysis more difficult

Identify hazards System Safety Engineering Design to eliminate or control them Approaches developed for relatively simple electromechanical devices. Focus on reliability (preventing failures) Software creates new problems

Software-Related Accidents Are usually caused by flawed requirements Incomplete or wrong assumptions about operation of controlled system or required operation of computer Unhandled controlled-system states and environmental conditions Merely trying to get the software correct or to make it reliable will not make it safer under these conditions.

Software-Related Accidents (2) Software may be highly reliable and correct and still be unsafe: Correctly implements requirements but specified behavior unsafe from a system perspective. Requirements do not specify some particular behavior required for system safety (incomplete) Software has unintended (and unsafe) behavior beyond what is specified in requirements.

MPL Requirements Tracing Flaw SYSTEM REQUIREMENTS 1. The touchdown sensors shall be sampled at 100-HZ rate. 2. The sampling process shall be initiated prior to lander entry to keep processor demand constant. 3. However, the use of the touchdown sensor data shall not begin until 12 m above the surface. SOFTWARE REQUIREMENTS 1. The lander flight software shall cyclically check the state of each of the three touchdown sensors (one per leg) at 100-HZ during EDL. 2. The lander flight software shall be able to cyclically check the touchdown event state with or without touchdown event generation enabled.????

A Possible Solution Multi-faceted approach: Requirements completeness and safety New, extended causality models for accidents Enforce constraints on system and software behavior Controller contributes to accidents not by failing but by: 1. Not enforcing safety-related constraints on behavior 2. Commanding behavior that violates safety constraints

Requirements Completeness Most software-related accidents involve software requirements deficiencies Accidents often result from unhandled and unspecified cases. We have defined a set of criteria to determine whether a requirements specification is complete. Derived from accidents and basic engineering principles. Validated (at JPL) and used on industrial projects. Completeness: Requirements are sufficient to distinguish the desired behavior from that of any other undesired program that might be designed.

Requirements Completeness Criteria About 60 in all including human-computer interaction criteria (in the book) Startup, shutdown Mode transitions Inputs and outputs Value and timing Load and capacity Environmental capacity Failure states and transitions Human-Computer interface Robustness Data Age Latency Feedback Reversibility Pre-emption Path Robustness

Traditional Accident Causation Model

Chain-of-Events Model Explains accidents in terms of multiple events, sequenced as a forward chain over time. Simple, direct relationship between events in chain Ignores non-linear relationships, feedback, etc. Events almost always involve component failure, human error, or energy-related event Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design: e.g., redundancy, overdesign, safety margins,.

Chain-of-events example

STAMP A new accident causation model using Systems Theory (vs. Reliability Theory)

STAMP: A System s Approach to Risk Management New accident (risk) model based on systems theory Safety viewed as a dynamic control problem rather than a component failure problem. O-ring did not control propellant gas release by sealing gap in field joint Software did not adequately control descent speed of Mars Polar Lander Temperature in batch reactor not adequately controlled Train speed not adequately controlled on a curve Events result from the inadequate control Result from lack of enforcement of safety constraints by system design and operations

Example Control Structure

Control processes operate between levels Controller Control Actions Model of Process Feedback Process models must contain: - Required relationship among process variables - Current state (values of process variables - The ways the process can change state Controlled Process

Relationship Between Safety and Process Models Accidents occur when models do not match process and Incorrect control commands given Correct ones not given Correct commands given at wrong time (too early, too late) Control stops too soon

Relationship Between Safety and Process Models (2) How do they become inconsistent? Wrong from beginning Missing or incorrect feedback Not updated correctly Time lags not accounted for Resulting in Uncontrolled disturbances Unhandled process states Inadvertently commanding system into a hazardous state Unhandled or incorrectly handled system component failures

STAMP Accidents arise from unsafe interactions among humans, machines, and the environment (not just component failures) prevent failures! enforce safety constraints on system behavior Losses are the result of complex dynamic processes, not simply chains of failure events Most major accidents arise from a slow migration of the entire system toward a state of high-risk Need to control and detect this migration

A Broad View of Control Does not imply need for a controller Component failures and unsafe interactions may be controlled through design (e.g., redundancy, interlocks, fail-safe design) or through process Manufacturing processes and procedures Maintenance processes Operations Does imply the need to enforce the safety constraints in some way New model includes what do now and more

Safety Constraints Build safety in by enforcing safety constraints on behavior in system design and operations System Safety Constraint: Water must be flowing into reflux condenser whenever catalyst is added to reactor Software Safety Constraint: Software must always open water valve before catalyst valve We have new hazard analysis and safety-driven design techniques to: Identify system and component safety constraints Perform hazard analysis in parallel with design to guide engineering design process

STPA: A New Hazard Analysis Technique Based on STAMP Controller Inadequate control Commands Actuator(s) Inadequate Actuator Operation Process Input Wrong or Missing Control Input Wrong or Missing Inadequate Control Algorithm Process Model Wrong Controlled Process Failure Disturbances Unidentified or Out of Range Sensor(s) Feedback Wrong or Missing Inadequate Sensor Operation Process Output Wrong or Missing

Summary: Accident Causality Accidents occur when Control structure or control actions do not enforce safety constraints Unhandled environmental disturbances or conditions Unhandled or uncontrolled component failures Dysfunctional (unsafe) interactions among components Control structure degrades over time (asynchronous evolution) Control actions inadequately coordinated among multiple controllers

Uncoordinated Control Agents UNSAFE SAFE STATE BOTH TCAS TCAS ATC and ATC provides provide uncoordinated instructions & independent to to both planes instructions Control Agent (TCAS) Instructions Instructions No Coordination Instructions Instructions Control Agent (ATC)

Uses for STAMP Basis for new, more powerful hazard analysis techniques (STPA) Safety-driven design (physical, operational, organizational) More comprehensive accident/incident investigation and root cause analysis Organizational and cultural risk analysis Identifying physical and project risks Defining safety metrics and performance audits Designing and evaluating potential policy and structural improvements Identifying leading indicators of increasing risk ( canary in the coal mine ) New holistic approaches to security

System Engineering Tools and Safety Driven Design Example

Based on Intent Specifications Research on how experts solve problems Basic principles of system engineering Goals Enhance communication and review Capture domain knowledge and design rationale Integrate safety information into decision-making environment Provide traceability from requirements to code For verification and validation To support change and upgrade process

Intent Specifications (2) Differs in structure, not content, from other specifications Hierarchy of models Describe system from different viewpoints Traceability between models (levels) Seven levels Represent different views of system (not refinement) From different perspectives To support different types of reasoning

Intent Specifications

Level 1: System Purpose (TCAS) Introduction Historical Perspective Environment Description Environment Assumptions Attitude information is available from intruders with a minimum precision of 100 feet All aircraft have legal identification numbers Environment Constraints The behavior or interaction of non-tcas components with TCAS must not degrade the performance of the TCAS equipment

Create a Functional Decomposition

Design High-Level System Control Structure Copyright Nancy Leveson, Aug. 2006

Hazard List H1: Near midair collision (NMAC): An encounter for which, at the closest point of approach, the vertical separation is less than 100 feet and the horizontal separation is less than 500 feet. H2: TCAS causes controlled maneuver into ground e.g., descend command near terrain H3: TCAS causes pilot to lose control of aircraft H4: TCAS interferes with other safety-related systems e.g., interferes with ground proximity warning or ground ATC system

Perform STPA Identify inadequate control actions 1. A required control action is not provided or not followed 2. An incorrect or unsafe control action is provided 3. A potentially correct or inadequate control action is provided too late or too early (at the wrong time) 4. A correct control action is stopped too soon. Identify safety-related requirements and design constraints along with design decisions to eliminate or control the hazardous control actions

Example to be provided separately

Example Level 2 TCAS System Design

Example from Level 3 Model of Collision Avoidance Logic

Spectrm-RL Combined requirements specification and modeling language. A state machine with a domain-specific notation on top of it Includes a task modeling language Can add other notations and visualizations of state machine Enforces or includes most of completeness criteria Supports specifying systems in terms of modes Control modes Operational modes Supervisory modes Display modes

Attribute Templates to Assist in Completeness

Visualization Tools

Templates and tools to manage hazard information

Editor for creating and maintaining control structure diagrams

Hyperlinking to more detailed views of control structure

Requirements Analysis

Black Box Model: Simulation

Timeline view of simulation

Halt on Hazardous Condition

Built-in static and dynamic analysis tools plus API to add others

Executable Specifications as Prototypes Easily changed At end, have requirements specification to use Can be reused (product families) Can be more easily reviewed than code If formal, can be analyzed Can be used in hardware-in-the-loop or operator-in-the-loop simulations. Save an enormous amount of time and money generating code for the prototypes.

Does it Work? Is it Practical? MDA risk assessment of inadvertent launch (technical) Architectural trade studies for the space exploration initiative (technical) Safety driven design of a NASA JPL spacecraft (technical) NASA Space Shuttle Operations (risk analysis of a new management structure) NASA Exploration Systems (risk management tradeoffs among safety, budget, schedule, performance in development of replacement for Shuttle) JAXA (Japanese Space Agency)

Does it Work? Is it Practical? (2) Orion (Space Shuttle Replacement) Accident analysis (spacecraft losses, bacterial contamination of water supply, aircraft collision, oil refinery explosion, train accident, chemical plant runaway reaction, etc.) Pharmaceutical safety Hospital safety (risks of outpatient surgery at Beth Israel MC) Food safety Train safety (Japan)

! Safety = reliability Summary! Lots of good hardware reliability techniques! Most do not apply to software! Safety requires more than just reliability! Requirements completeness! Extended accident model (STAMP) #Treat safety as a control problem (systems or control theory rather than reliability theory) # Identify and enforce safety constraints on behavior # New types of hazard analysis and safety-driven design 2004-01-1663

Summary (2) Tool support needed Traceability Design rationale Reviewable and readable by anyone Unambiguous semantics Completeness analysis Executable models Visualizations Organized to assist in problem solving