Reliability of Safety-Critical Systems Chapter 3. Failures and Failure Analysis

Reliability of Safety-Critical Systems Chapter 3. Failures and Failure Analysis Mary Ann Lundteigen and Marvin Rausand mary.a.lundteigen@ntnu.no RAMS Group Department of Production and Quality Engineering NTNU (Version 1.1 per July 2015) Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 1 / 29

Reliability of Safety-Critical Systems Slides related to the book Reliability of Safety-Critical Systems Theory and Applications Wiley, 2014 Theory and Applications Marvin Rausand Homepage of the book: http://www.ntnu.edu/ross/ books/sis Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 2 / 29

Learning objectives To become familiar with key terms and concepts related to failures and failure classification To become familiar with different ways of failure classification strategies, using the following sources as basis: IEC 61508 (and IEC 61511) The PDS method OREDA To become aware of some typical SIS related failures Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 3 / 29

Definition of failure Failure: The termination of the ability of an item to perform a required function. [IEV 191-04-01] A failure is always related to a required function. The function is often specified together with a performance requirement. 1 A failure occurs when the function cannot be performed or has a performance that falls outside the performance requirement. Shutdown valve According to the performance requirement, the maximum closing time of a shutdown valve shall be no longer than 15 seconds. A failure of the closing function occurs when the closing time exceeds 15 seconds. 1 Also called a functional requirement Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 4 / 29

Failure attributes A failure is an event that occurs at a specific point in time. A failure may: Develop gradually Occur as a sudden event The failure may sometimes be revealed: On demand (i.e., when the function is needed) ( hidden ) During a functional test (also hidden ) By monitoring or diagnostics ( evident ) Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 5 / 29

Fault Fault: The state of an item characterized by inability to perform a required function [IEV 191-05-01] While a failure is an event that occurs at a specific point in time, a fault is a state that will last for a shorter or longer period. In most cases, an item will have a fault after a hardware failure has occurred and we say that the item is in a failed state. Design and installation errors may also prevent the item from performing its required function. The item has a fault that is not preceded by any hardware failure and we call this fault a systematic fault. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 6 / 29

Error Error: Discrepancy between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition. [IEC 191-05-24]. An error is present when the performance of a function deviates from the target performance (i.e., the theoretically correct performance), but still satisfies the performance requirement. An error will often, but not always, develop into a failure. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 7 / 29

Relationship failure, fault and error A failure may originate from an error. When the failure occurs, the item enters a fault state. Performance Actual performance Error Failure (event) Fault (state) Target value Acceptable deviation Time Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 8 / 29

Failure Mode definition Failure mode: The way a failure is observed on a failed item. [IEC 191-05-22] A failure mode is the way in which an item could fail to perform its required function. An item can fail in many different ways a failure mode is a description of a possible state of the item after it has failed. Pump Performance requirement: The pump must provide an output between 100 and 110 liters per minute. Associated failure modes may be: No output Too low output Too high output Too much fluctuation in output Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 9 / 29

Failure Mode attributes A failure mode is always related to a required function and the associated performance requirement. A failure mode is description of a fault (i.e., a state) and not of a failure (i.e., an event). A more correct term would therefore be fault mode 2 Some data sources list, for example, corrosion as a failure mode. This is wrong. Corrosion is a failure mechanism and may be a cause of a failure mode. It is, however, not a description of a lost function. 2 This term is used in IEC 60300, but failure mode is so common that a change of the term might confuse the users. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 10 / 29

Classification of Failures Failures may be classified according to their: Causes: To avoid future occurrences and make judgments about repair Effects: To rank between critical and not so critical failures Detectability: To distinguish failures that may be revealed automatically (and shortly after their occurrence) and those that may be hidden until special effort is taken, such as proof tests. And several other criteria. Special category: Common-cause failures (CCFs) Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 11 / 29

Failure Classification in IEC 61508 IEC 61508 classify failures according to their: Causes: Effects: Random (hardware) faults Systematic faults (including software faults) Safe failures Dangerous failures Detectability: Detected - revealed by online diagnostics Undetected - revealed by functional tests or upon a real demand for activation Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 12 / 29

Random and systematic faults Random (hardware) faults: Failures resulting from the natural degradation (i.e. degradation within the design envelope) mechanisms of the item. Some additional random failure causes may be added: e.g, some types of human errors (ref. ISO-TR 12489) Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 13 / 29

Random and systematic faults Systematic faults (non-physical causes): Faults that can be related to a particular cause other than natural degradation and foreseen stressors. Systematic faults may be due to errors made during specification, design, operation and maintenance phases of the lifecycle. Each systematic failure may be regarded as one of a kind: can normally be eliminated by a modification, either of the design or manufacturing process, the testing and operating procedures, the training of the operators or documentation. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 14 / 29

Safe and dangerous faults Safe (S) failure: The item may operate without any demand (PDS method handbook) Faiure which does not have the potential to put the safety-related system in a hazardous or fail-to function state (IEC 61508, part 4). or a Dangerous (D) failure: The item does not operate upon a demand (PDS method handbook) Failure which has the potential to put the safety-related system in a hazardous or fail-to function state (IEC 61508, part 4). Non-critical failure (or no-part/no-effect faults): Failures where the main functions of the item are not affected (adapted from IEC 61508-4). Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 15 / 29

Detected and undetected faults Safe and dangerous failures may be classified further as either detected (by diagnostics) or undetected (not detected by diagnostics): Safe undetected (SU): A spurious closure of a valve usually falls into this category Safe detected (SD): A non-critical alarm from a sensor may fall into this category Dangerous detected (DD): A critical alarm from a detector, such as out of range or loss of signal, may fall into this category. As long as the fault is not corrected, the item will remain unable to carry out the function. Dangerous undetected (DU): A failure that is hidden and which will prevent the execution of a safety-critical failure. A shutdown valve that is stuck in open position may be such an example. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 16 / 29

Other failure classifications: PDS method The PDS method (www.sintef.no/pds) distinguishes between: Critical failures Dangerous failures: detected and undetected Safe detected and safe undetected (spurious) failures Non-critical failures Total λ Critical λ Crit Undetected Detected λ DU λ SU λ DD λ SD Taken into account for SFF calculations λ NONC Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 17 / 29

Other failure classifications: PDS method Failure Random hardware failure Systematic failure Aging failure - random failures due to natural (and foreseen) stressors Software failure - Inadequate specificaton - Programming error - Error during software update Installation failure - Gas detector capsulating left on after commisioning - Valve installed in wrong direction - Incorrect sensor location Operational failure - Valve left in wrong position - Sensor calibration failure - Detector in bypass mode Source: PDS method handbook (2009) Design related failure - Inadequate specificaton - Inadequate implementation Excessive stress failure - Excessive vibration - Unforeseen sand prod. - Too high temperature Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 18 / 29

Examples of Systematic Faults Systematic faults may be: Software faults: Programming errors, compilation errors, inadequate testing, unforeseen application conditions, change of system parameters, etc. Design related faults: Faults (other than software faults) introduced during the design phase of the equipment. It may be a fault in the system specification itself, a fault in the manufacturing process and/or in the quality assurance of the item. Installation faults: Faults introduced during the last phases prior to operation, i.e., during installation or commissioning. If detected, such faults are typically removed during the first months of operation and such faults are therefore often excluded from data bases. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 19 / 29

Examples of Systematic Faults Systematic faults may be (continued): Excessive stress: Failures that occur from stresses beyond the design specification are placed upon the component. The excessive stresses may be caused either by external causes or by internal influences from the medium. Operational errors: Initiated by human errors during operation or maintenance/testing Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 20 / 29

Failure Classification Illustration Causes Effects Design errors Human interaction errors Systematic fault Safe Dangerous Usually not quantified* Detected Undetected Detected Undetected Environmental stresses Normal degradation *See ISO TR 12489 Random hardware failures Safe Dangerous l SD l SU l DD l DU Detected Undetected Detected Undetected *) Some systematic failures may be included in historical databases. Some systematic failures may also be catered for in the CCF rate. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 21 / 29

PDS method vs. in IEC 61508 Failure Dangerous (D) Safe (S) IEC 61508 Dangerous detected (DD) Dangerous undetected (DD) Safe (spurious) No part No effect PDS Dangerous detected (DD) Dangerous undetected (DD) Safe undetected (spurious) Safe detected Non-critical (NONC) Dangerous (D) Safe (S) Failure Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 22 / 29

Failure classification in OREDA Critical, degraded, and incipient (further explained on a separate slide) Failure mechanisms: Pre-defined categories of immidiate failure causes defined for each maintainable item Boundary conditions, maintainable items: Maintainable items: Repairable items and subsystems for which failures are recorded and data are presented Boundary conditions: Assumptions made about maintainable item, including boundaries for what to consider as part of the maintainable item and what is not. Ref: ISO 14224 and OREDA handbooks Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 23 / 29

Failure classification in OREDA Critical failure: A failure of an item that causes immidiate cessation of its ability to perform a required function. In this case, its ability comprises two elements: Loss of ability to function on demand (safety-related) Loss of ability to maintain production (production-availability related) This means that critical failures usually include what IEC 61508 defines as DU, DD, and SU failures. Degraded failure: A partial failure where the item has a degraded performance, but is still able to perform i s essential functions Incipient failure: Also a partial failure, but its degradation is barely noticable and can be regarded as a very early symptom of a degradation under development Ref: ISO 14224 and OREDA handbooks Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 24 / 29

Definition of CCF Common cause failure: A dependent failure in which two or more component fault states exist simultaneously or within a short time interval, and are a direct result of a shared cause. Example: Two pressure transmitters in a SIF have failed due to a calibration error. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 25 / 29

What does reliability data include? Reliability data supplied by manufacturers are often based on random hardware faults alone, while data sources based on industry experience may include both. It is sometimes argued that systematic faults should be excluded from the input data used for reliability quantification, as they do not (necessarily) share the properties of random hardware faults. A random hardware fault may re-occur, while a systematic fault (ideally) is removed forever once it has been corrected. Many standards on reliability analysis therefore limit reliability quantification to random hardware faults, and rely on procedures, testing, reviews, proper training, and so on to avoid systematic faults. Reflection In what way may the statements above impact the confidence in the results from the reliability calculations? Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 26 / 29

Failure Classification based on IEC 61508 Consider a shutdown valve: Failure mode Fail to close (FTC) Leakage in closed position (LCP) Premature (spurious) closure (PC) Fail to open (FTO) Leakage to environment (LTE) Classification DU DU SU SU SD www.franklinvalve.com Remark: Valves have usually limited diagnostic features, as opposed to sensors/transmitters and logic solvers. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 27 / 29

FMEDA - a variant of FMECA An Failure modes, Effects, and Diagnostic Analysis (FMEDA) is an extension of an FMECA that is tailured-made for a SIS. FMEDA as a method was developed by the company Exida Is in principle, very similar to an FMECA, and a FMECA-like table is used Focus is placed on (and columns in the table are allocated to) the classification of each failure mode into DU, DD, SU or SD Failiure rates can be estimated for each failure category with basis in the classification and the overall failure rate of the item Also proof test coverage may be considered The approach can supplement manufacturers calculations of failure rates, and specific measures like the safe failure fraction (SFF) and diagnostic coverage factor (DC) More information is available from the book Safety instrumented systems verification, by William M. Goble and Harry Cheddie. Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 28 / 29

Possible extensions Example Reference: Goble, W.M. and Brombacher, A. Using a failure modes, effects and diagnosis analysis (FMEDA) to mesure diagnostic coverage in programmable electronic systems. DOI:10.1016/S0951-8320(99)00031-9 (Journal of Reliability Engineering and System Safety) Mary Ann Lundteigen (RAMS Group) Reliability of Safety-Critical Systems (Version 1.1) 29 / 29