Resilience Engineering: The changing nature of safety Professor & ndustrial Safety Chair MNES ParisTech Sophia Antipolis, France Erik Hollnagel E-mail: erik.hollnagel@gmail.com Professor NTNU Trondheim, Norge
Normal accident theory (1984) n the whole, we have complex systems because we don t know how to produce the output through linear systems.
MT view: sharp-end, blunt-end Blunt-end failures happened earlier and somewhere else Morals, social norms Government Authority Company Management Workplace Activity Sharp-end failures happen here and now
Focus on operation (sharp end) Design Scope of work system (1984) rganisation (management) Downstream Maintenance Work has clear objectives and takes place in well-defined situations. Systems and technologies are loosely coupled and tractable. Upstream Technology, automation
Vertical and horizontal extensions Scope of work system (2008) rganisation (management) Downstream Design Maintenance Upstream Technology, automation A vertical extension to cover the entire system, from technology to organisation
Vertical and horizontal extensions Scope of work system (2008) rganisation (management) Downstream Design Maintenance ne horizontal extension, to cover the lifecycle, from design to maintenance Upstream Technology, automation A vertical extension to cover the entire system, from technology to organisation
Vertical and horizontal extensions Scope of work system (2008) rganisation (management) A second horizontal extension, to cover Downstream upstream and downstream processes Design Maintenance ne horizontal extension, to cover the lifecycle, from design to maintenance Upstream Technology, automation A vertical extension to cover the entire system, from technology to organisation
Tractable and intractable systems Tractable system (independent, clockwork) Description are simple with few details Principles of functioning known (white box) System does not change while being described Complicacy Comprehensibility Stability ntractable system (interdependent, teamwork) Elaborate descriptions with many details Principles of functioning unknown (black box) System changes before description is completed Fully specified Partly specified Underspecified
Performance variability is necessary Systems are so complex that work situations always are underspecified hence partly unpredictable Few if any tasks can successfully be carried out unless procedures and tools are adapted to the situation. Performance variability is both normal and necessary.? Many socio-technical systems are intractable. The conditions of work therefore never completely match what has been specified or prescribed. ndividuals, groups, and organisations normally adjust their performance to meet existing conditions, specifically actual resources and requirements. Because resources (time, manpower, information, etc.) always are finite, such adjustments will always be approximate rather than exact. Performance variability Success Failure
Efficiency-Thoroughness Trade-ff Thoroughness: Time to think Recognising situation. Choosing and planning. f thoroughness dominates, there may be too little time to carry out the actions. Neglect pending actions Miss new events Efficiency: Time to do mplementing plans. Executing actions. f efficiency dominates, actions may be badly prepared or wrong Miss pre-conditions Look for expected results Time & resources needed Time & resources available
The ETT principle The ETT principle describes the fact that people (and organisations) as part of their activities practically always must make a trade-off between the resources (time and effort) they spend on preparing an activity and the resources (time, effort and materials) they spend on doing it. ETTing favours thoroughness over efficiency if safety and quality are the dominant concerns, and efficiency over thoroughness if throughput and output are the dominant concerns. The ETT principle means that it is impossible to maximise efficiency and thoroughness at the same time. Neither can an activity expect to succeed, if there is not a minimum of either.
Failures or successes? When something goes wrong, e.g., 1 event out of 10.000 (10E-4), humans are assumed to be responsible in 80-90% of the cases. When something goes right, e.g., 9.999 events out of 10.000, are humans also responsible in 80-90% of the cases? Who or what are responsible for the remaining 10-20%? nvestigation of failures is accepted as important. Who or what are responsible for the remaining 10-20%? nvestigation of successes is rarely undertaken.
Why only look at what goes wrong? Safety = Reduced number of adverse events. Focus is on what goes wrong. Look for failures and malfunctions. Try to eliminate causes and improve barriers. Safety and core business compete for resources. Learning only uses a fraction of the data available 10-4 := 1 failure in 10.000 events 1-10 -4 := 9.999 nonfailures in 10.000 events Safety = Ability to succeed under varying conditions. Focus is on what goes right. Use that to understand normal performance, to do better and to be safer. Safety and core business help each other. Learning uses most of the data available
Risk profile (= lack of safety) Consequence Negligible Marginal Critical Very critical Low Low Low Moderate Moderate Low Low Moderate High High Moderate Moderate High High Extreme High High Extreme Extreme Extreme Rare Unlikely Possible Likely Certain Probability
Benefit profile (= safety) nnovative High High Very high Very high Very high Consequence Effective Acceptable Moderate Moderate Low Low High Moderate High High Very high High Negligible Low Low Low Moderate Moderate Rare Unlikely Possible Likely Certain Probability
Range of event outcomes utcome Positive Negative Neutral Serendipity Disasters Very low Good luck Accidents Normal outcomes (things that go right) ncidents Near misses Mishaps Very high Predictability
Frequency of event outcomes utcome Positive Serendipity Good luck Normal outcomes (things that go right) Negative Neutral Disasters Very low Accidents ncidents Near misses Mishaps Very high 10 6 4 10 10 2 Predictability
Being safe versus being unsafe utcome Positive Serendipity Safe Functioning (invisible) Normal outcomes (things that go right) Good luck Negative Neutral Disasters Very low Unsafe Functioning (visible) Functioning Accidents ncidents Near misses Mishaps Very high 10 6 4 10 10 2 Predictability
H i g h w o r k l o a d M e c h a n i c s H i g h w o r k l o a d E q u i p m e n t E x p e r t i s e E x p e r t i s e P G r o c e d u r e s P r o c e d u r e s r e a s e E x c e s s i v e e n d - p l a y L u b r i c a t i o n n t e r v a l a p p r o v a l s n t e r v a l a p p r o v a l s A l l o w a b l e e n d - p l a y n F A A i r c r a f t d e s i g n k n o w l e d g e A C o n t r o l l e d s t a b i l i z e r m o v e m e n t A i r c r a f t R e d u n d a n t d e s i g n L i m i t e d s t a b i l i z e r m o v e m e n t Non-linear accident model Assumption: Accidents result from unexpected combinations (resonance) of variability of normal performance. T C M a i n t e n a n c e o v e r s i g h t T C P R C e r t i f i c a t i o n T C P R T C E n d - p l a y c h e c k i n g P R T C T C J a c k s c r e w u p - d o w m o v e m e n t P R A ir c r a f t d e s i g n P R T C H o r i z o n t a l s t a b i l i z e r m o v e m e n t T C L i m i t i n g s t a b i l i z e r m o v e m e n t P R T C L u b r i c a t i o n T C P R A i r c r a f t p i t c h c o n t r o l P R J a c k s c r e w r e p l a c e m e n t P R P R Consequence: Accidents are prevented by monitoring and damping variability. Safety requires constant ability to anticipate future events. Hazardsrisks: Functional Resonance Accident Model Emerge from combinations of normal variability (socio-technical system), hence looking for ETT* and sacrificing decision * ETT = Efficiency-Thoroughness Trade-ff
High workload Mechanics High workload Equipment Expertise Expertise Procedures Procedures Lubrication Grease Excessive end-play nterval approvals nterval approvals Allowable end-play FAA Aircraft design knowledge Controlled stabilizer movement Aircraft Redundant design Limited stabilizer movement High workload Mechanics High workload Equipment Expertise Expertise Procedures Procedures Grease Excessive end-play Lubrication nterval approvals nterval approvals Allowable end-play FAA Aircraft design knowledge Controlled stabilizer movement Aircraft Redundant design Limited stabilizer movement Risks as non-linear combinations T C T C f accidents happen like this... End-play checking Maintenance oversight Aircraft design Certification Jackscrew up-down movement Horizontal stabilizer movement Aircraft pitch control Lubrication Limiting stabilizer movement Jackscrew replacement... then risks can be found like this... End-play checking Maintenance oversight Aircraft design Certification Jackscrew up-down movement Horizontal stabilizer movement Aircraft pitch control Lubrication Limiting stabilizer movement Jackscrew replacement P R P R Unexpected combinations (resonance) of variability of normal performance. Systems at risk are intractable rather than tractable. Unexpected combinations (resonance) of variability of normal performance. The established assumptions therefore have to be revised
Methods and reality Simple linear Focus on technology Low Hazard High Loose Coupling Tight Dams Marine transport Fault tree Post FMEA offices Chemical plants Railways Mining Power grids HAZP Emergency department Assembly lines Financial markets NPPs Nanotech Military LSCTS adventures Air traffic control Healthcare Manufacturing FMECA Universities ntegrated operations Space missions Public services R&D companies Low (tractable) Uncertainty High (intractable)
Methods and reality Simple linear Focus on technology Complex linear Focus on Human Factors Low Hazard High Loose Coupling Tight Dams Marine transport Railways Mining HPES Power grids Emergency department Chemical plants ATHEANA HAZP Fault Healthcare AEB tree TRACEr Manufacturing FMECA Universities Post FMEA HERA HEAT offices HCR THERP Assembly lines RCA CSN Financial markets NPPs Nanotech Military LSCTS adventures Air traffic control ntegrated operations Space missions Public services R&D companies Low (tractable) Uncertainty High (intractable)
Hazard and uncertainty Simple linear Focus on technology Complex linear Focus on Human Factors Complex linear Focus on organisations Non-linear Focus on normal functions Low Hazard High Loose Coupling Tight Dams Marine transport Power NPPs grids CREAM Emergency MT MERMS department AcciMap Chemical Swiss TRPD plants ATHEANA cheese Railways MRT Mining HPES STEP HAZP Fault Healthcare AEB tree TRACEr Manufacturing FMECA Universities Post FMEA HERA HEAT offices HCR THERP Assembly lines RCA CSN Low (tractable) Uncertainty Financial markets Nanotech RE FRAM STAMP Military LSCTS adventures Air traffic control ntegrated operations Space missions Public services R&D companies High (intractable)
From the negative to the positive Negative outcomes are caused by failures and malfunctions. All outcomes (positive and negative) are due to performance variability.. Safety = Reduced number of adverse events. Safety = Ability to respond when something fails. Safety = Ability to succeed under varying conditions. Eliminate failures and malfunctions as far as possible. mprove ability to respond to adverse events. mprove resilience.
What is safety? The ability to succeed under varying conditions (respond, monitor, anticipate, learn) The reduction of unnecessary harm to an acceptable minimum Systemic (process view) Safety should be As High As Reasonably Practicable (AHARP) Particularistic (product view) Risk should be As Low As Reasonably Practicable (ALARP)
Conclusions so far... Complex socio-technical systems can only function if performance is adjusted to conditions (ETT) Performance variability is the reason why things go right, but also the reason why things sometimes go wrong. We need to understand how things go right before we can understand how they go wrong. Resilience engineering is about how we can ensure that systems remain productive and safe in expected and unexpected conditions alike.