C. Mokkapati 1 A PRACTICAL RISK AND SAFETY ASSESSMENT METHODOLOGY FOR SAFETY- CRITICAL SYSTEMS Chinnarao Mokkapati Ansaldo Signal Union Switch & Signal Inc. 1000 Technology Drive Pittsburgh, PA 15219 Abstract This paper presents a practical methodology for a) assessment of risks associated with the intended application of a safety-critical system, and b) verification that the system meets the safety design requirements that enable the risks to be kept at acceptable levels throughout its lifecycle. The methodology consists of the following steps: 1) Define the system and analyze its intended operation to determine all potential hazards; 2) Analyze the risks (potential consequences after considering the available procedural, circumstantial and physical risk reduction barriers in the intended operation of the system); 3) Determine the tolerable hazard rates for the system functions by comparing the remaining risks with industry-accepted tolerable levels; 4) Apportion the tolerable hazard rates and corresponding safety integrity levels to various subsystems/equipment within the system; and 5) Analyze the design of the subsystems/ equipment and the system to show that the tolerable hazard rates will not be exceeded, and that the required levels of safety integrity (assurance against systematic failures) have been built into the system. Suitability of the methodology for railroad signaling systems is shown with the help of an example. 1.0 INTRODUCTION When an organization such as a Railway desires to install a new product/system for the purpose of improving the efficiency and/or safety of its operations, there must be verifiable proof that the
C. Mokkapati 2 new product/system does indeed provide the desired improvements. Specific to safety, the improvements should come in the form of a reduced level of risk (of accidents/mishaps) relative to the current level of risk (if known), or relative to commonly-accepted tolerable risk levels. This paper presents an approach that can be used for risk and safety assessment of a safetycritical system. This approach, broadly based upon U.S. Military Standard 882C (1), AREMA C&S Manual Section 17 (2), IEEE Standard 1483-2000 (3), and the CENELEC Standards EN50126 (4), EN50128 (5), and EN50129 (6), has been used by the author s organization for the assessment of Automatic Train Control Systems furnished for the Copenhagen Metro and the Kuala Lumpur Monorail System. It can be applied in a practical manner for other systems such as PTC Systems, Train Protection Warning Systems, Train Collision Avoidance Systems, etc., that use newer technologies and architectures for meeting defined risk and safety requirements. The concepts of Safety Integrity Levels (SILs) and Tolerable Hazard Rates (THRs) are used in this approach. Reference (6) provides a detailed description of the concepts of SILs and THRs. Section 2 of this paper presents an overview of the risk and safety analysis methodology. Section 3 presents details of risk analysis while Section 4 outlines the system design analysis that provides proof that the system meets its safety requirements derived from the risk analysis. Section 5 gives an example. 2.0 OVERVIEW OF RISK ANALYSIS AND SYSTEM DESIGN ANALYSIS A methodology, derived from CENELEC Report prr009-004 (7), for risk analysis and system design analysis is presented in this section. At the heart of this approach is a well-defined interface between the operational environment and the architectural design of the system. From
C. Mokkapati 3 the safety point of view this interface is defined by a list of hazards and tolerable hazard rates associated with the system. The general steps of the risk analysis and system design analysis methodology are shown in Figure 1 and can be summarized as follows: 1. Define the system adequately 2. Identify key operational hazards 3. Determine the tolerable hazard rate THR for each hazard by analyzing the consequences of the hazards (taking into account the operational parameters) 4. For each hazard: Anlyze the causes down to a functional level taking into account system definition and architecture 5. Decide which functions are implemented by which subsystem. Then, for each subssytem: Collect contributions of each function, which is realised by the subsystem, to all hazards Calculate overall tolerable hazard rate THR s for the subsystem Translate THR s into a safety integrity level SIL s for the subsystem using a SIL table Determine failure rates for the system elements to meet THR s for the subsystem Verify & validate that the THR s and SIL s are met. This methodology, shown in the flowchart of Figure 1, can be divided into two parts: Risk Analysis, consisting of Steps 1-3, and System Design Analysis, consisting of Steps 4-5. Risk Analysis deals with the real world of the system operation. System Design Analysis deals with the technical solutions for managing the risks.
C. Mokkapati 4 3.0 DETAILS OF RISK ANALYSIS The Risk Analysis steps are shown in Figure 2. 3.1 System Definition The system under investigation must be defined completely. This is typically done in the form of following documents: System Requirements Specification System Architecture Description System Design Description Documents These documents should give details of the system s Functional Requirements Type of Operation (e.g., signaling principles) Operational Parameters (e.g., train schedules, speeds, density, ) System Boundaries 3.2 Hazard Identification Through a structured Hazard Identification study (e.g., as described in AREMA C&S Manual 17.3.5), and based on existing data from the End User s sources, the potential hazards associated with the intended operation of the system shall be identified and documented in a Hazard Log. The following terminology is used: 1. An individual i uses the technical system (e.g., a train, a Level Crossing). The usage profile is described by the number of uses N i (per year or per hour). For reference, a total exposure
C. Mokkapati 5 per use E i (hours) may be defined (i. e. the duration of a train journey or the time needed to pass a LC). 2. While using the technical system the individual is exposed to hazards arising from failure of the technical system (or its subsystems etc.). Let there be n hazards associated with the technical system. Let each hazard H j have a hazard rate HR j hazards/hour, j = 1,., n. The tolerable value of each HR j is what we are trying to determine through the Risk Analysis process. The probability that the individual is exposed to the hazard depends additionally on the hazard duration D j and the exposure time E ij of the individual to the hazards. This probability consists of a sum of the probability that the hazard already exists when the individual enters the system (approximately HR j D j ) and the probability that the hazard occurs while the individual is exposed (approximately HR j E ij ). Note that the exposure to the hazard H j may be shorter than or equal to the total exposure: E ij E i. 3.3 Risk Determination From each hazard one or several types of accidents may occur. This is described for each hazard by the consequence probability C jk, that accident k occurs. Associated with each type of accident A k is a corresponding severity, which from the individual point of view is described as the probability of fatality F ik for the single individual. This causality corresponds one to one to the individual risk of fatality by IRF i = all hazards H j N i ( HR j x (D j + E ij ) C jk x F ik ) (1) Accidents A k
C. Mokkapati 6 If, as a result the IRF is less than the Tolerable Individual Risk (TIR) usually expressed in fatalities per year, then the calculated or estimated hazard rates (HR) are called tolerable hazard rates (THR). In Formula (1), the individual probability of fatality F ik can be calculated from the severity S k (e.g., number of fatalities) in accident k, out of a population of N k exposed to accident k (concept of collective measure of severity). That is, F ik = S k / N k (2) Note: Accident k could result in other types of potential losses, namely commercial loss and environmental loss. It is possible to quantify these losses (convert them into an equivalent number of fatalities) in order to include them in the term S k in Equation (2). A discussion and agreement with the User shall be needed in this regard. 3.4 Risk Tolerability Criteria and THR Determination To determine the tolerable level of risk, either the GAMAB, the ALARP, or the MEM principle can be used. Reference (8), a report by Dr. Hendrik Schäbe, of the Institute for Software, Electronics, Railroad Technology, TÜV InterTraffic GmbH, provides a detailed treatment of these principles. The GAMAB principle requires the risk of the new system to be no higher than that associated with the system being replaced. An upper and a lower bound on TIR (fatality rate in fatalities per year) can be derived from the ALARP principle. A single value of TIR can be derived from the MEM principle.
C. Mokkapati 7 The IRF i in Formula 1) is now equated to the TIR in order to determine the tolerable value of each hazard rate HR j. These are denoted THR j. 4.0 DETAILS OF SYSTEM DESIGN ANALYSIS The System Design Analysis Process is shown in Figure 3. The Risk Analysis detailed in Section 3.0 results in list of n hazards H 1,.., H n together with their tolerable hazard rates THR 1,.., THR n respectively. Further analysis is then required to arrive at a suitable system architecture for the control of such hazards. This is called System Design Analysis, which is essentially a causal analysis of the hazards H 1,..,H n. It consists of the following tasks: Define the system functions and architecture (technical solution), Analyze the causes leading to each hazard, Determine the safety integrity requirements (SIL and hazard rates) for the subsystems, Determine the reliability requirements for the equipment Causal analysis of hazards constitutes two key phases. In a first phase, each THR is apportioned to a functional level (system functions). The hazard rate for a function is then translated to a SIL using the SIL table below, taken from (6). The SILs are defined at this functional level for the subsystems implementing the functionality. Tolerable Hazard Rate THR per Safety Integrity Level hour and per function THR < 10-8 4 10-8 < THR < 10-7 3 10-7 < THR < 10-6 2 10-6 < THR < 10-5 1
C. Mokkapati 8 A sub-system, i. e. a combination of equipment, may implement a number of Safety-Related Functions, each of which could require a different SIL. Where this is the case, the sub-system must be designed to meet the highest Safety Integrity Level of those functions. In the second phase of the causal analysis, the hazard rates for subsystems are further apportioned, leading to failure rates for the equipment, but at this physical or implementation level the SIL remains unchanged. Consequently also the software SIL defined in (5) would be the same as the subsystem SIL but for the exception described in clause 5.2.3 of (5) The apportionment process may be performed by any method which allows a suitable representation of the combinational logic, e. g. reliability block diagrams, failure modes & effects analyses, fault trees, binary decision diagrams, Markov models etc. In any case, particular care must be taken when independence of items is required. While in the first part of the Causal Analysis functional independence is required (i. e. the failure of functions shall be independent with respect to systematic and random faults), physical independence is sufficient in the second part (i. e., the failure of subsystems shall be independent with respect to random faults). Assumptions made in the causal analysis must be checked and may lead to safetyrelevant application rules for the implementation. System design analysis is essentially a combination of various qualitative and quantitative hazard analyses and safety verification & validation steps. A disciplined approach to system design
C. Mokkapati 9 analysis using a structured Safety Assurance Program (e.g., as outlined in AREMA C&S Manual Part 17.3.1) is recommended. 5.0 EXAMPLE A hypothetical Train Protection Warning System (TPWS) shown in Figure 4 is used as an example for detailing the steps involved in the Risk Analysis. The Safety Analysis portion is not covered in detail for this hypothetical system. The desired functions of the TPWS are a) Provide Emergency Brake application to prevent Signals Passed at Danger(SPADs), and b) Provide driver warning and speed supervision with ability to stop the train if overspeed condition is ignored by the driver. This system is intended to be used on a Railroad with heavy passenger train traffic, and the goal is to reduce the risk of fatalities due to SPADs to a tolerable level. The following steps are as outlined in Section 3. The quantitative numbers used in the example calculations are the author s assumed data and are not reflective of any particular Railroad s statistics. HAZARD H 1 : TPWS fails to prevent a SPAD that could result in a collision and ensuing fatalities. RISK ANALYSIS 1. Determine Risk Tolerability A reasonably practical scheme shall be implemented with the aim of ensuring that train collisions due to SPADs pose a risk of fatality no higher than 1 in 1,000,000 per year. That is,
C. Mokkapati 10 Tolerable Individual Risk (TIR) 10-6 per year (Risk of SPAD-caused fatality to the train driver, also assumed to be the same for a passenger if the train involved in the event is a passenger-carrying train) 2. Determine Risk Exposure N i = Number of times/year train i passes signals = 10,000 D 1 = Duration of Hazard H 1 = 10 hours (A pessimistic estimate) Hazard H1 exists when the TPWS has a wrong-side (hazardous) failure that remains non-negated or un-repaired. Hazard H 1 has a hazard rate of HR 1 failures/hour. The goal is to determine this HR 1 before the design of the TPWS can proceed. E i1 = Exposure time of the train to Hazard H 1 (time taken by the train to pass a signal at a failed TPWS location. Very short, relative to D 1. Ignored) 3. Cause-Consequence Analysis Done in the form of an Event Tree Analysis (ETA), as shown in Figure 5. 4. Loss Analysis From the ETA, two types of accidents and their probabilities of occurrence are determined and listed below. For the sake of simplicity, assume the probabilities of fatality in each accident as shown below. No. (k) Accident (A k ) Probability of Occurrence (C 1k ) Probability of Fatality (F ik ) 1 High Speed Collision 0.00005 0.9 2 Low Speed Collision 0.00001 0.5 5. Determine THR Substitute the above values in Equation (1): IRF i = N i {HR 1 x (D 1 +E i1 ) (C 1k xf ik )}
C. Mokkapati 11 = 10,000 x HR 1 x 10 x (0.00005x0.9 +0.00001x0.5) TIR = 10-6 This results in HR 1 = 2x10-7 failures/hour, which is now called THR 1 SYSTEM DESIGN ANALYSIS Apportion THR 1 to individual pieces of equipment in the TPWS by using Failure Modes and Effects Analysis (FMEA) and Fault Tree Analysis (FTA) techniques. Guidance given in AREMA C&S Manual Parts 17.3.3 (2) and IEEE Std 1483 (3) can be used. Make sure physical, functional and process dependencies within the TPWS equipment are properly handled with the use of AND gates in the FTA. An iterative approach is needed to arrive at a cost-effective design. Different parts of the TPWS equipment may end up being designed to different SILs for systematic failure integrity. 6. CONCLUSIONS A practical methodology for risk and safety analysis using the concepts of tolerable risk, safety integrity levels, and tolerable hazard rates is presented in this paper with the help of a simple example. This methodology can be applied to signaling and train control systems that use new technologies and architectures, and is expected to provide a cost-effective approach to both design and assessment of such systems. 7. REFERENCES (1) United States Department of Defense (January 19, 1993) Military Standard: MIL-STD- 882C - System Safety Program Requirements.
C. Mokkapati 12 (2) AREMA Communications & Signal Manual, Section 17: Quality Principles. Parts 17.3.1 (2004), 17.3.3 (2004), and 17.3.5(2004). (3) IEEE Standard 1483-2000: Verification of Vital Functions for Processor-Based Systems Used in Signal and Train Control. (4) CENELEC Standard EN 50126: Railway Applications - The Specification and Demonstration of Dependability, Reliability, Availability, Maintainability and Safety (RAMS). Issue: March 2000. (5) CENELEC Standard EN 50128: Railway Applications- Communications, signaling and processing systems - Software for railway control and protection systems. Issue: March 2001 (6) CENELEC Standard EN 50129: Railway Applications- Communications, signaling and processing systems - Safety related electronic systems for signaling. Issue: May 2002 (7) CENELEC Report prr009-004: Railway Applications Systematic Allocation of Safety Integrity Requirements (March 1999). (8) Different Approaches For Determination Of Tolerable Hazard Rates, by Dr. Hendrik Schäbe, Institute for Software, Electronics, Railroad Technology, TÜV InterTraffic GmbH, 51105 Köln.
C. Mokkapati List of Figures in the Paper A Practical Risk and Safety Assessment Methodology for Safety-Critical Systems Figure 1. Risk and Safety Analysis Overview (From Reference (4)) Figure 2: Process Details of Risk Analysis (From Reference (4)) Figure 3. System Design Analysis Summary (From Reference (4)) Figure 4. A Simple Train Protection Warning System Figure 5. Cause-Consequence Analysis (Determination of External Risk Reduction)
C. Mokkapati Input Activity Output 1 Define System (functions, boundary, interfaces, environment,.) System definition 2 Identify (system) hazards top level hazards Hazard Log Risk Analysis Risk tolerability criteria (Safety) 3 Analyze consequences of hazards Risk THRs System Requirements Specification (Sub-) System Architecture 4 Analyze causes of hazards. Identify additional hazards Hazard Analysis Iterate until system element level 5 Allocate Safety Integrity Requirements to subsystems/equipment SILs, Failure Rates Subsystem Requirements Specification System Design Analysis Figure 1. Risk and Safety Analysis Overview (From Reference (4))
C. Mokkapati System Definition Analyze Operation Identify Hazards Estimate Hazard Rates Identify Consequences: Accidents Near Misses Safe State Hazard Log Determine Risk Risk Tolerability Criteria (Safety) Determine THR System Design Analysis System Requirements Specification (Safety Requirements) Figure 2: Process Details of Risk Analysis (From Reference (4))
C. Mokkapati Hazards H 1,.., H n and their tolerable hazard rates For Each Hazard For each AND: Common Cause Failure Analysis Fault detection mechanism and time Safety-related application conditions Use FMEAs, FTAs, Reliability Block Diagrams, Binary Decision Diagrams, Markov models, etc. as appropriate For Each Subsystem System Architecture SIL Table 1. Collect contributions to hazards 2. Determine THR and SIL SIL and THR for subsystems Apportion failure rates to elements SIL and THR for elements Conduct Verification & Validation of SILs and THRs Figure 3. System Design Analysis Summary (From Reference (4))
C. Mokkapati 8 4 7 6 5 1 2 3 9 1. Onboard Computer (OBC) 2. Transponder Transmission Module 3. Transponder Antenna 4. Driver s Console 5. Tachometer 6. Emergency Brake Interface 7. Signal Control Logic 8. Lineside Electronic Unit 9. Transponder BASIC FUNCTIONALITY DESIRED: Provide driver warning then Emergency Brake Application to prevent Signal Passed at Danger. Provide driver warning and speed supervision with ability to stop train if overspeed condition is ignored by the driver Figure 4. A Simple Train Protection Warning System
C. Mokkapati H 1 Train approaches a Signal at Danger Engineer passes Signal at Danger Yes 0.001 Engineer does not notice obstruction, plows ahead Yes 0.5 No Yes 0.1 Engineer No notices obstruction, starts braking, No but can t stop short of obstruction Yes 0.2 No High Speed Collision 0.00005 Low Speed Collision 0.00001 Safe State 0.99994 Figure 5. Cause Consequence Analysis (Determination of External Risk Reduction)