Reliability of Safety-Critical Systems Chapter 10. Common-Cause Failures - part 1 Mary Ann Lundteigen and Marvin Rausand mary.a.lundteigen@ntnu.no &marvin.rausand@ntnu.no RAMS Group Department of Production and Quality Engineering NTNU (Version 1.0 per July 2015) M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 1 / 31
Reliability of Safety-Critical Systems Slides related to the book Reliability of Safety-Critical Systems Theory and Applications Wiley, 2014 Theory and Applications Marvin Rausand Homepage of the book: http://www.ntnu.edu/ross/ books/sis M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 2 / 31
About this slide series These slides builds on material presented in Chapter 10.1-10.3 of the textbook. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 3 / 31
Common cause failures (CCFs) in brief Some keywords: One cause, multiple effects Modeling was introduced in the nuclear power industry in the 1970s. Increased attention also in other industries and in key standards like IEC 61508. Figure: NUREG/CR-4780: a foundation stone M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 4 / 31
CCFs as threats to safety CCFs are a particular concern in relation to safety barriers Hazards or Threats Safety barriers Victim/ Vulnerable target CCFs may lead to failure of one safety barrier, OR simultaneous failure of several safety barriers M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 5 / 31
Potential effects of CCFs Production? stopped Yes Detected & isolated (primary)? Detected and isolated (secondary)? No Yes cause No Yes No e.g., PSD Logic solver e.g., ESD Logic solver Input elements Final elements Input elements Final elements cause M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 6 / 31
Dependent failures Several types of dependent failures A CCF belongs to the category of dependent failures. Other categories of dependent failures are: Common mode failures (CMFs), which are a subcategory of CCFs, Cascading failures. Systematic failures may in some cases be or be the cause of CCFs. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 7 / 31
Dependent failures Cascading failures Cascading failures: A sequence of item failures where the first failure shifts its load to one or more nearby items such that these fail and again shift their load to other item, and so on. Cascading failures are sometimes referred to as a Domino effect. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 8 / 31
Definition of CCFs Single point of failure and CCFs Single point of failure: A single failure that result in the fail to function of other items and functions. Discussed in an article by Gentile and Summers (2006) Single point of failures may be loss of power supply, routers, communication links, logic solvers, etc. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 9 / 31
Definitions of CCFs Single point of failure and CCFs Figure: From Gentil and Summers (2006), DOI10.1002/prs.10145 M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 10 / 31
Dependent failures Statistical properties Consider two items, 1 and 2, and let E i denote the event that item i is in a failed state. The probability that both items are in a failed state is Pr(E 1 E 2 ) = Pr(E 1 E 2 ) Pr(E 2 ) = Pr(E 2 E 1 ) Pr(E 1 ) The two events, E 1 and E 2 are said to be statistically independent if Pr(E 1 E 2 ) = Pr(E 1 ) and Pr(E 2 E 1 ) = Pr(E 2 ) such that Pr(E 1 E 2 ) = Pr(E 1 ) Pr(E 2 ) Note that when E 1 E 2 =, then Pr(E 1 E 2 ) = 0 and Pr(E 1 E 2 ) = 0. A set of events cannot be both mutually exclusive and independent. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 11 / 31
Dependent failures Statistical properties Two items, 1 and 2, are dependent when Pr(E 1 E 2 ) Pr(E 1 ) and Pr(E 2 E 1 ) Pr(E 2 ) Items 1 and 2 are said to have a positive dependence when Pr(E 1 E 2 ) > Pr(E 1 ) and Pr(E 2 E 1 ) > Pr(E 2 ), such that Pr(E 1 E 2 ) > Pr(E 1 ) Pr(E 2 ) Items 1 and 2 are said to have a negative dependence when Pr(E 1 E 2 ) < Pr(E 1 ) and Pr(E 2 E 1 ) < Pr(E 2 ) Pr(E 1 E 2 ) < Pr(E 1 ) Pr(E 2 ) where E i is the event that item i is in a failed state. A CCF represents a positive dependence. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 12 / 31
Dependent failures Positive or negative There are two types of dependencies: Positive dependence is usually most relevant in reliability and risk analyses. Negative dependency may also be relevant in some cases. Example of a negative dependency Consider two items that influence each other by producing vibration or heat. When one item fails and is down for repair, the other item will have an improved operating environment, and its probability of failure is reduced. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 13 / 31
Dependent failures Intrinsic or extrinsic NUREG/CR-6268 defines two additional types of dependencies, intrinsic and extrinsic. Intrinsic dependency: A situation where the functional status of a component is affected by the functional status of other components. Sub-classes: Functional requirement dependency Functional input dependency Cascading failure Remark: NUREG/CR-6268 is a successor of NUREG/CR-4780. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 14 / 31
Dependent failures Intrinsic or extrinsic Extrinsic dependency: A situation where the dependency or coupling is not inherent or intended in the functional characteristics of the system. Extrinsic dependencies may be related to: Physical or environment stresses. Human intervention M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 15 / 31
Definition of CCFs One unique interpretation? Did you think that CCF has one, unique interpretation? M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 16 / 31
Definition of CCFs Smith and Watson The definition of Smith and Watson (1980) is perhaps the most comprehensive one: 1. The items affected are unable to perform as required 2. Multiple failures exist within (but not limited to) redundant configurations 3. The failures are first-in-line type of failures and not the result of cascading failures 4. The failures occur within a defined critical time period (e.g., the time a plane is in the air during a flight) 5. The failures are due to a single underlying defect or physical phenomenon (the common cause ) 6. The effect of failures must lead to some major disabling of the system s ability to perfor as required M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 17 / 31
Definition of CCFs Reflections Demanding environment Consider a system of m gas detectors that are installed at an offshore facility. The performance of the detector may be affected by high salty humidity.the exposure will normally not normally result in failure of detectors at the same time. The time between detector failures may be rather long, and the question may then be: Is it or is it not a CCF? Cheep and unreliable You have gotten the honorable task to buy new light bulbs for a large room. You buy some cheep ones to save money, and after some time they start to fail one after the other. Is the multiple bulb failures a CCF, or just the consequence of having bought light bulbs with a shorter life? M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 18 / 31
Definition of CCFs Various industry sectors Nuclear industry (NEA, 2004): A dependent failure in which two or more component fault states exist simultaneously or within a short time interval, and are a direct result of a shared cause. Space industry (NASA PRA guide, 2002): The failure (or unavailable state) of more than one component due to a shared cause during the system mission. Process industry (IEC 61511, 2003): Failure, which is the result of one or more events, causing failures of two or more separate channels in a multiple channel system, leading to system failure. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 19 / 31
Definition of CCFs Other examples Lundteigen and Rausand (2007) - related to safety-instrumented systems: 1. The CCF event comprises complete failures of two or more redundant components or two or more safety instrumented functions (SIFs) due to a shared cause 2. The multiple failures occur within the same inspection or function test interval 3. The CCF event may lead to failure of a single SIF or loss of several SIFs M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 20 / 31
Attributes Two types There are two types of CCFs: CCFs that occur at the same time due to a shock, or CCFs that occur over a certain time interval due to an increase in stress. For the last type, it is key to define what is meant by time interval. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 21 / 31
Attributes Root causes and coupling factors The shared cause of a CCF may be split into two elements: the root cause and the coupling factor. Root cause: Why did the component fail? (i.e., looking at each affected component) Coupling factor: Why were several components affected? (i.e., looking at the relationships between the affected components) E 1 Root causes Coupling factors E 2 E 3 M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 22 / 31
Attributes Root causes and coupling factors Root cause: Most basic cause of item failure that, if corrected, would prevent recurrence of this and similar failures. Coupling factor: Property that makes multiple items susceptible to the same root cause. A coupling factor is also called a coupling mechanism. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 23 / 31
Attributes Types of root causes We may distinguish between pre-operational and operational causes: Pre-operational root causes: Design, manufacturing, construction, installation, and commissioning errors. Operational root causes: Operation and maintenance-related: Inadequate maintenance and operational procedures, execution, competence and scheduling Environmental stresses: Internal and external exposure outside the design envelope or energetic events such as earthquake, fire, flooding. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 24 / 31
Attributes Types of coupling factors To look for coupling factors is the same as to look for similarities... Same design (principles) Same hardware Same function Same software Same installation staff Same maintenance and operational staff Same procedures Same system/item interface Same environment Same (physical) location M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 25 / 31
Visualization Same environment Same hardware/software Same physical l ocation Design & Engineering errors Safety valve Safety valve Single point of O&M errors Environmental failure stresses Same staff Same installation/ tie-in/interface Same procedures M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 26 / 31
Reflection The distinction between independent failures and CCF is not always so clear: Assume that you buy some low price light bulbs. After a short period we assume that most of them has failed. Is this a CCF (cheap bulbs of same brand), or is the main explanation just that each bulb is not very reliable and has a higher failure rate. Some sensors are located on an offshore facility (topside) and are exposed to ocean weather (high humidity, salty water droplets). After a couple of years some of the sensors fail. Is this due to CCF, or is it the excessive environment that has resulted in an increased failure rate? M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 27 / 31
Defense measures Nevertheless, it is important to reduce the chance of having CCFs, as the effect of introducing redundancy may significantly reduced in lack of such measures. Possible defenses are: Increase separation and/or segregation: Two sensors may send their signal to two separate input cards of the logic solver. Cables for very critical systems, like from uninteruptable power supply, may be routed through separate locations Diversity: A vessel may be equipped with different level measurement types: ultrasonic or radar and magnetic float. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 28 / 31
Defense measures Possible defenses are (continued): Increase robustness: Mount sensors inside weather protection cabinets Mount a fan in a cabinet to reduce temperature Increase inherent reliability of each channel/element All measures taken to increase the inherent reliability is efficient for reducing independent as well as CCFs. Keep design simple: Complex systems that are difficult to comprehend are more subject to systematic faults. If the faults are due to misunderstanding, it is more likely that the fault will be replicated for other equipment. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 29 / 31
Defense measures Possible defenses are (continued): Analysis: Analysis is a way to reveal systematic faults, often those that are not evident without detailed investigation. Analysis may reveal if the systematic faults have been introduced for several equipment, or if they can result in a single point of failure. Procedures and human interface: Also this measure is related to the possible avoidance of systematic faults. Competence and training: Also this measure is related to the possible avoidance of systematic faults. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 30 / 31
Defense measures Possible defenses are (continued): Environmental control Same meaning as increasing robustness, but the main focus is on robustness against environmental exposure. Diagnostics and coverage: All faults revealed by diagnostics can be acted upon immediately or otherwise managed by compensating measures. This means that it is likely that the effect of a subsequent failure will be reduced. M.A.Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 31 / 31