Real-Time & Embedded Systems Agenda Safety Critical Systems Project 6 continued
Safety Critical Systems Safe enough looks different at 35,000 feet. Bruce Powell Douglass The Air Force has a perfect operating record everything we put in the air has come back down. - Unknown
Ubiquity of Control Systems Electro-mechanical devices are migrating to softwaredriven systems Automobiles Planes Home Appliances Medical Equipment Nuclear Power Plants
Software Failures Therac-25 Radiation therapy device Software-driven Bugs allowed massive radiation overdoses Killed 3 people, contributed to the death of a fourth
Software Failures Patriot Missiles Clock drift reduced their effectiveness from 95% to 13% Allowed a SCUD missile through defense perimeter Killed 29, injured 97 Aegis tracking system Failure contributed to shooting down an Iranian Airline flight 290 lives lost
Software Failures 8080-based factory control software Mistakenly stacked large boulders 80 feet high Crushed cars and damaged a building Robotics Stray EM interference blamed for 19 deaths Cardiac pacemakers Low-energy radiation reprogrammed Caused several deaths
Software Failures Medical Database Software Incorrectly informed woman she had incurable syphilis and had passed it on to her children She strangled one, attempted to kill another and herself Sunlight Filtering Software Failed to remove false missile detections based on sunlight reflecting off clouds A Soviet Commander averted nuclear war based on a funny feeling in my gut.
Terms Reliability the measure of up-time, or availability of a system The probability that a task will complete before the system fails Measured in Mean Time Between Failures (MTBF) Security permitting access to only authorized and authenticated persons of systems Safety does not incur too much risk to person or property Risk the chance that something bad will happen Common-mode failure a single failure results in the failure of multiple control paths
Fundamental Hazards Release of energy Release of toxins Interference of life-support functions Supplying misleading information to safety personnel or control systems Failure to alarm when hazardous conditions arise Failure to limit or act when unwanted events occur, inputs are flawed or outputs are outside correct levels
System Issues Safety is a system issue Multiple solutions may address a concern Interlocks Redundant hardware Redundant software The interaction of the components determines the safety of the system
Software Failures Software does not fail Failures represent a change in the capability of the system Broken switch Failed component Bad sensor If software does something wrong, it does it every time! Software may respond poorly to failures
Single-point Failures A device is considered safe if a single failure in the system does not result in an unsafe condition Single-point assessments tree:
Fail-Safe State A condition a safety-critical system must attain with an unrecoverable fault. Emergency Stop Partial Shutdown Hold Manual Control Restart Driven by the problem domain needs
Fail-Safe states An airliner jet engine fails? Unmanned space vehicle launch? Attended medical devices? Hazardous area robotics? Unmanned aircraft control failure? Cruise ship rudder failure?
Achieving Safety Separation of safety channels from non-safety channels Firewall pattern Any component failure in the channel fails the entire channel Isolation of safety systems from non-safety systems is common and justifiable Redundancy Small or large scale Homogenous or diverse
Achieving Safety Homogenous Channels are replicated verbatim Detects only faults, not errors Inexpensive Diverse A different channel is implemented Detects faults and errors More expensive
Achieving Safety Diverse redundancy is stronger Protects against systemic faults / errors Data corruption detection Parity bit Hamming codes (parity bits) Checksums CRCs Redundant storage
Achieving Safety Reasonableness checks A second algorithm validating the results of the first Usually much simpler Feedback error detection Identify potential fault conditions May cause a fail-safe transition Feedback error correction Identify and correct potential fault conditions Attempts to keep the system operating, and may reduce capability
Safety Architectures Single-Channel Protected Design A single flow of control A break in the channel induces a failure Safeguards are added to ensure correct fail-safe behavior A single point of failure Multi-channel Voting Pattern An odd number of redundant channels Each channel votes on the task Majority rules Homogenous or diverse
Safety Architectures Homogenous Redundancy Pattern Identical channels run in parallel If an odd number of channels: Majority channels detect and correct minority channels Must be fully redundant Inexpensive to implement Detects only faults, not errors May be expensive due to redundant hardware
Safety Architectures Diverse Redundancy Pattern Redundant, but uniquely implemented channels Different but equal Lightweight redundancy Separation of monitoring and actuation
Safety Architectures Watchdog Pattern A secondary process monitors the primary process Primary process periodically feeds the secondary process Secondary process can alarm or restart should the primary process fail May include a periodic test suite
Safety Architectures Safety Executive Pattern A centralized coordinator for monitoring safety A really smart watchdog Watchdog timeouts Software error assertions Continuous or periodic built-in tests Faults indentified by monitors
Safety Architecture Monitor-actuator pattern Separation of algorithms Actuation performs the actions Monitoring tracks the actions Additional cost and complexity
Eight Steps to Safety Identify the hazards Determine the risks Dfine the safety measures Create safe requirements Create safe designs Implement safety Assure the safety process Test, test, test (Peer Reviews!)
Identify the Hazards Identify the hazard Determine the level of risk Determine the tolerance time Determine the source of the hazrd: The fault leading to the hazard The likelihood of the fault The fault detection time The means by which the hazard is handled: The means The fault reaction (exposure time)
Identify the Hazards Patient Ventilator Example:
Fault Analysis Fault-tree analysis (FTA) Identify the hazards Work backward from the hazard to identify the causal conditions Diagram with a boolean flow chart UML Activity diagram Failure mode effect analysis (FMEA) Identify potential faults Work forward to the consequences
Determine the Risks FDA levels of concern Minor not expected to result in injury or death Moderate results in minor to moderate injury Major result in major injury or death German TUV characterization (S) Severity of the risk (E) Duration of the period of exposure (G) Prevention of the danger (W) Probability of occurrence
Determine the Risks German TUV characterization
Determine the Risks German TUV Example
Define the Safety Measure Obviation make the hazard physically impossible Education User training Alarming Announce the haard so action can be taken Interlocks removed via secondary device or logic to interceded Internal Checking the system detects and handles the malfunction prior to an incident Safety Equipment goggles, gloves, etc Restriction of access access to potential hazards is restricted to trained personnel Labeling High Voltage, do not touch
Create Safe Requirements Consider the requirements from a safety perspective Specify the negations The system shall not move hardware before user input
Create Safe Designs Work from safe requirements Adopt a safe architecture Revisit, revise the hazard analysis during development Select measures that provide appropriate levels of detection and correction Ensure independent channels lack common-mode failures Adopt consistent strategies for handling faults Include POST and periodic run-time tests
Implementing Safety Language Choice Strong compile-time checking Strong run-time checking Support for encapsulation and abstration (but not just because ) Exception handling Safe language constructs Void*?
Assure the Safety Process Continuously track against hazard analysis Utilize peer reviews to assure quality Verify design adherence Verify coding standards Identify how each hazard is handled
Test, test, test Black box testing White box testing Monkey testing Fault seeding Load testing Simulations System testing Unit testing