Introducing reliability and availability requirements into TOC models

Page 1 of 17 Introducing reliability and availability requirements into TOC models Evin Stump Senior Systems Engineer Galorath Incorporated estump@galorath.com 951.676.7804 Wendy Lee Systems Engineer / Cost Analyst Galorath Incorporated wlee@galorath.com 310.414.3222 Ext 655 Abstract Two important system requirements are reliability and availability. Reliability is the probability of no disabling failures over a certain span of time, while availability is the ratio of the time the system will actually fulfill its operational expectations, to the total time it is expected to fulfill them. Both of these are often included in system specifications. An explicit reliability requirement may be stated somewhat as follows: The probability of no disabling system failures in 5,000 hours of operations shall be at least 99.9%. An availability requirement may be stated something like this: The system shall be available for operations at least 95% of the times when it is expected to be available. These requirements are often set with highly imperfect knowledge of their effects on total ownership cost (TOC). The result, if the requirements are rigorously maintained, can be a huge effect on total ownership cost. The effects can be felt in several areas of cost; particularly design labor hours, prototype material cost, production labor hours, production material cost, and support and possibly operational costs after the system is fielded. This paper explores ways that parametric TOC models are often deficient in addressing reliability and maintainability costs, and suggests how they can be made to better deal with them.

Page 2 of 17 Arrangement of this Paper This paper has eight major sections and several subsections. These are the major sections and their concerns: Why a Reliability Requirement? This section defines the necessary content of a reliability requirement and why such a requirement is often imposed. Reliability Flowdown Here is discussed the fact that reliability requirements are typically imposed at system level, and must be flowed down to all elements of hardware and software, even to the component level. During any such flowdown, the required reliability gradually increases above the system level requirement. How a Reliability Goal Is Achieved This section discusses the more important tools available to reliability engineers to meet reliability goals. Lifetime Phases of an Element Each hardware element typically undergoes various phases of its life during which the environmental stresses, and hence the failure rates, vary. These phases and their consequences are discussed. Determining MTBF of WBS Elements Mean time between failures (MTBF) is typically the number which most drives maintenance costs due to random failures. Its determination is discussed. Non-Random Failures Most of the concerns of reliability engineers, and also most cost models, are with random failures. Their intensity is typically measured by MTBF. But substantial failure costs are in some cases attributable to non-random failures. These costs are sometimes greater than costs of random failures but are often ignored in cost analysis. Overhauls Many hardware items are required to be overhauled from time to time. The various conditions of overhauls are discussed. What Is an Availability Requirement and Why Is it Imposed? The nature of an availability requirement and the reasons for its imposition are discussed. Why a Reliability Requirement? Not infrequently, a reliability requirement is stated in vague terms: The equipment shall operate reliably throughout its lifetime. Such a requirement is all but meaningless. It s equivalent to a requirement that software be user friendly. The problem is that reliability, like user friendliness, can have many meanings depending on both context and personal point of view. A meaningful and universally understandable reliability requirement always requires two numbers, plus a clear definition of failure, and also an environmental context. The definition of failure is important because reliability is about the absence of failure. Generally, failure is defined in terms of loss of a desired functionality. Loss of the nameplate attached to an item of equipment would not normally be classified as a failure, but failure to perform of a single key on a 100 key keyboard probably would be. Minor degradations of capability, such as a single sticky key, might or might not be classified as a failure depending on circumstances. Whatever the definition, it must be clearly expressed to be useful.

Page 3 of 17 Two numbers are needed to define reliability concisely: 1) a timespan and 2) a probability. Timespan can be expressed in any convenient units of time, such as hours, days, or months. Probability, at least for high reliability equipment, is generally expressed in 9 s, e.g., 0.9995, or equivalently, 99.95%. The probability number always holds true only for a specific operating environment, or sometimes a nonoperating environment. For example, the reliability of a printed circuit board (PCB) might be defined in this way: For at least 50,000 operating hours, the PCB must have a probability of at least 0.9995 of no failures. Environmental limits are listed below. A failure is defined as any loss of signal output of longer than one second duration. Why would a request for proposal contain such a requirement? From the customer point of view, it provides high confidence that the product will be functional when the customer needs it to be. However, both customers and providers should understand that product acquisition costs virtually always rise sharply when: The definition of reliability becomes more comprehensive and exacting The count of 9 s in the reliability probability measure increases The required reliability timespan increases The operating (or for some systems the non-operating) environment becomes more severe and more damaging to the product. Higher reliability is often traded off against higher acquisition costs. On the other hand, higher reliability generally decreases post-production support cost, and sometimes operating cost as well. Higher development and production costs may well be repaid many times over by lower operations and support (O&S) costs. Of course, from a time value of money standpoint the front-end acquisition costs have more weight, and often are more intensely scrutinized. Users of parametric cost models should be aware of the critical dependency of acquisition cost on reliability. They need to understand how their development and production models, as well as their O&S models deal with this issue and in particular how reliability and operating environment information is captured by their model. In this paper, these issues are considered in more detail. Reliability Flowdown Most project efforts to develop either hardware or software describe the work to be done in a hierarchical work breakdown structure (WBS). When a reliability requirement is imposed on hardware, it most likely will be imposed from the top down, that is, from the highest rollup WBS element. However, in a system of systems situation, reliability requirements may be imposed at more than one rollup level. In any event, what is imposed at a high level must be flowed down to lower levels. Mathematically, the reliability at any rollup level is the product of the reliabilities of all lower level elements the directly feed that rollup. This is because reliability is a probability and it follows the rules

Page 4 of 17 of probabilities. The relevant rule here is that the joint probability of two or more independent events is the product of their separate probabilities. For example, in Figure 1 the top level rollup has been assigned a reliability requirement of 0.9995 (99.95%) over some period of time and within a specified environment. The top level element is fed by three lower level rollups. The product of their reliabilities must be at least 0.9995. If the second level rollups all are assumed to have equal reliability, then their reliabilities must all be at least (0.9995)^(1/3), i.e., the cube root of 0.9995. This happens to be 0.99983, as shown in Figure 1. 1 In Figure 1 we have assumed that below the second level rollups, there are only leaf elements, that is, elements that have no children. Note that the first leaf element has the same reliability requirement as its parent, because it stands alone. However, the next two leaf elements report to the same parent, so their reliability allocations must be (0.99983)^(1/2), i.e., the square root of 0.99983. Finally, the last three leaf element are assigned reliabilities (0.99983)^(1/3) = 0.99994. Noteworthy is that as the number of sub-elements increases, the flowdown reliability allocations increase somewhat. In Figure 1, the biggest increase was from 0.9995 to 0.99994, an increase of only about a half of one percent. But to a reliability engineer it is more difficult than it would at first appear. It adds almost another 9, and more 9 s can be hard and expensive -- to come by. This is not the end of the story. Each of the leaf elements typically contains a variety of components, often referred to as parts. In certain hardware items, these parts may be quite numerous. For example, in a printed circuit board (PCB) there could be, say, 100 parts, such as ASICs, resistors, capacitors, diodes, connectors, etc. Each of these parts has what is known to reliability engineers as a failure rate, typically measured in failures per hour, or alternately in failures per thousand or per billion hours. The failure rate of a part depends on certain factors such as its quality (degree of care taken in its design and manufacture to assure suitability to purpose), its ruggedness, its complexity, and its operating 1 Equal assignment of reliabilities to lower level children in the WBS hierarchy is probably the most common allocation method used by reliability engineers, but it is not the only one. Sometimes other methods are used, but the results seldom stray far from the results due to equal assignments.

Page 5 of 17 environment. Its operating environment of interest often includes the ambient temperature range, shocks and vibrations to which an item might be subject, chemical or other attacks to which it might be subject, and possibly other factors. Electronic parts generally are more prone to higher failure rates when applied voltages are higher and also when ambient temperatures are higher. Table 1 Failure Rate vs. Reliability Example λ R (1k hours) R (7k hours) R (10K hours) 1.00E-05 99.00498% 93.23938% 90.48374% 1.00E-06 99.90005% 99.30244% 99.00498% 1.00E-07 99.99000% 99.93002% 99.90005% 1.00E-08 99.99900% 99.99300% 99.99000% So how are reliability and failure rate connected? In a version of reliability theory commonly used by engineers, namely the exponential theory, the reliability of a single part is given by R(t) = exp(-λt), where t is the elapsed time in hours and λ is the failure rate in failures per hour. Some typical values of this function are tabulated in Table 1 for three periods of time: 1,000 hours, 7,000 hours, and 10,000 hours. 2 The failure rate (λ) ranges from 1E-05 to 1E-08 failures per hour, fairly typical of high reliability hardware. In a WBS leaf element having multiple parts connected in series, the failure rate of the whole is the sum of the failure rates of the individual parts. 3 If the parts all have a similar failure rate, which is fairly common, then as an approximation we can write: λ c N λ avg In this equation λ c is the composite failure rate of the leaf element, N is the parts count, and λ avg is the average failure rate of the individual parts. The composite reliability of the element R c is then given by: R c (t) = exp(-λ c t) By reference to Figure 1, it can be seen that R c (t) must be equal to or greater than the reliability flowdown. With the aid of a bit of algebraic manipulation (solving for λ c ), this implies: λ c -ln(r c (t))/t If in fact λ c does exceed this value there are things that can be done to still achieve a desired reliability, as will be discussed in the next section. How a Reliability Goal Is Achieved Engineers use several approaches to achieve a desired reliability goal. The most important of these are: 2 In military hardware, and indeed in most reliability critical commercial hardware, failure rate of a part is often assumed to be constant over the useful life of the hardware. This is a very good approximation in most cases. Note that one calendar year is approximately 8,760 hours. 3 Connected in series means connected in such a way that a failure of a single part is a failure of the entire element. Later we will discuss parallel connections, in which the element does not fail unless all of its parallel connections fail.

Page 6 of 17 Reduction of component count Increased component quality Redundancy of circuits or of elements 4 Environmental Protection Test and Fix Cross strapping Reduction of component count Recall these two equations, previously cited: R c (t) = exp(-λ c t) λ c N λ avg From the first equation, it can be seen that the composite reliability depends on the composite failure rate, λ c. From the second, it can be seen that the composite failure rate depends on N, the parts count. Therefore, if the same functionality can be achieved with fewer parts, there is a gain in reliability. For electronic elements, the use of integrated circuits of various types effectively reduces parts count, and this is one reason they are widely used. For mechanical elements, a reduction in parts count can often be achieved by combining two or more parts into one. This has the added benefit of reducing assembly costs. Also for mechanical elements, minimizing the number of moving parts is especially helpful, because moving parts, such as bearings, gears, valves, pistons, threads, springs, and levers, almost always have higher failure rates than non-moving parts such as structures. Increased component quality High component quality gets a lot of attention from reliability engineers. Achieving it is both difficult and costly, but it can result in big gains in reliability. Most major NASA projects, as well as many Air Force or other projects involve spacecraft that are not repairable if they fail, and that are expected to operate without disabling failure for many years, sometimes as many as fifteen. It is in projects of this type that component quality matters most. Little wonder, then, that NASA pays great attention to parts quality. A number of other government agencies and commercial companies have the same concerns, so the pursuit of high quality components is widespread. Equally widespread is the concern about the cost of such components, and a variety of strategies have been and are being used to get first rate quality without incurring first rate costs. This has led to a rather confusing mix of parts classifications and parts testing requirements, with various organizations or authorities tending to have their own approach to the problem. The most likely classification systems encountered by an aerospace/defense cost analyst are NASA levels 1, 2, and 3, with level 1 being the highest quality, and the S, B, and C system used by several organizations, including the USAF. S level parts are generally required for spacecraft and some aircraft 4 An element is a WBS element. A circuit" is a part of an element.

Page 7 of 17 applications, B level is often regarded as sufficient for other military applications, and C level may be adequate for commercial applications other than heavy duty industrial applications. Generally, the higher the level the more costly is the part, mainly because of requirements for expensive testing and documentation. Always a consideration in selecting parts quality is the issue of reparability. If a system can be accessed for reasonably rapid repair, and if occasional failures that can be quickly repaired are acceptable, then lower quality parts may be selected in the interests of lower development and production costs. But of course, there will be an increase in maintenance costs. Redundancy of circuits or elements If when using the highest quality parts and the minimum number of parts a certain leaf element has a calculated reliability of 98.7%, but the flowdown reliability required is 99.99%, is the situation hopeless? Not necessarily. Redundancy may come to the rescue. Redundancy is one technique for achieving fault tolerant designs, that is, designs in which certain failures do not prevent continued operation. Redundancy generally is mostly needed only for electronic elements, because mechanical elements generally have both lower parts counts and higher reliabilities for individual parts. Many mechanical elements are essentially designed for infinite life, due to the safety factors employed. However, for electronics, element redundancy seldom involves using (say) two parallel resistors instead of one, because if one of them fails open circuit, then the circuit is still likely to fail because the circuit resistance will suddenly increase sharply. Most electronic circuits are tolerant of small to moderate variations in the electrical properties of a component, but they usually will not be tolerant of a large change. Of course, a short circuit failure is generally even more disastrous. Redundancy is most likely to take place at either the circuit level or at the element level. At the circuit level an entire circuit, say perhaps an oscillator, or an analog to digital converter, may be replicated on a single PCB. This can be done two ways, both of which increase reliability, but by different amounts. One way is to have both redundant circuits in operation continuously. Another circuit called a voter tests the output of both circuits and selects the output deemed best. If one redundant circuit fails, then the voter will select the output of the one that has not failed. The other way is to have one of the circuits non-operational until the other one fails, at which time it is switched on. This also requires a voter, but a somewhat more sophisticated (and costly) one. It should be noted that introducing redundancy on a PCB increases its size, weight, and cost, not to mention, in most configurations, its power draw and its heat output. One s first impulse about including redundant elements on a PCB is that this increases parts count and thus reduces reliability, and it does, but the offsetting effects of redundancy generally more than make up for this loss. When using redundancy at the element level, two (sometimes three) PCBs replace one. Again, a voter is needed, and again, there is the option of both PCBs being in continuous operation, or only one in operation until it fails. At failure a different one is turned on.

Page 8 of 17 Can redundancy be more than two deep? In principle, it can be any number deep, but the reality is that it usually becomes unwieldy beyond three deep. 5 Beyond three deep, engineering considerations of volume, weight, power draw, heat generation, and cost generally call a halt. Is redundancy limited to simple parallel configurations? No, redundancy can have a number of different and often complex configurations. Note in Figure 2 that there are three paths to successful operation: A-B, D-E, and A-C-E. A total failure would require either: A failure of both A & D A failure of both B & E A failure of A, C, & E While such a circuit can result in very high reliabilities over long periods of time, it is also costly, high in power consumption, difficult to design, and difficult to test. A reliability engineer would call for such a circuit only if the reliability requirements are extreme. However, such arrangements are not unusual in mechanical or electromechanical subsystems, and the various elements involved are likely to not be identical. For example, if is fairly common to see a power source such as a generator backed up by a battery, plus circuitry to change the battery s DC output at a low voltage into an AC output at a much higher voltage. It should be kept in mind that a redundancy arrangement used to boost reliability may not have the same reliability over its useful life. Consider, for example, a simple redundancy arrangement comprising two identical circuits in parallel, both operating continuously. Suppose that the useful life expectancy is ten years, 87,600 hours, and further suppose that the circuit is not repairable. Also, assume that each of the circuits individually has a reliability of 0.999 over 87,600 hours. The math, shown below, predicts that with two equal parallel components both operating properly, the reliability is increased from three 9s to six 9s. R 2 in parallel = 1 (1 R 1only ) 2 = 1 (1 0.999) 2 = 0.999999 But if during the lifetime one of the parallel circuits should fail, then for the remainder of the lifetime the reliability decreases to 0.999. This can have serious consequences in some situations. 5 There is also the issue of diminishing returns. Double redundancy increases MTBF by a factor of 1.5. Triple redundancy increases it only by a factor of 1.8333. Quadruple redundancy increases it only by a factor of 2.08. Every added layer of redundancy has less and less effect. This is one reason why redundancy usually does not go beyond three layers. There is also a diminishing returns effect with respect to reliability.

Page 9 of 17 While the major concerns of a reliability engineer are usually assessing reliability, and whether a reliability requirement has been met, the major concern of a cost analyst is usually mean time between failures (MTBF), because that primarily drives repair costs in repairable systems subject to random failures. In systems with no redundancy, MTBF c = 1 / λ c, a very simple relationship. With the aid of a bit of algebra it can further be written that: MTBF c = -t / ln(r(t)) This equation directly relates MTBF c to reliability when there is no redundancy. Thus the concerns of the reliability engineer can be directly related to the concerns of the cost analyst. Generally, long term costs of maintenance due to random failures are more or less proportional to MTBF c. The picture changes significantly, however, when multiple redundancy is employed. For redundancy of two circuits in parallel, MTBF c is no longer equal to 1/λ c. The derivation is not shown, but the new relationship is MTBF c = 3/2λ c. 6 This reduces corrective maintenance costs by roughly the same factor, but it may increase preventive maintenance costs because under some preventive maintenance policies it may be required to frequently check parallel redundant arrangements to see if one of the circuits or elements has failed. 7 As mentioned previously, a failure of one parallel redundant element may reduce the reliability considerably. The need for such preventive maintenance is prompted by requirements such as this, which are appearing more often: The unit shall have a 95% chance of remaining fully operational after a second failure of a similar device. (This is a direct quote from a recent NASA requirements document. Italics provided by the authors of this paper. It is likely that the NASA author actually intended to say at least a 95% chance.) Environmental Protection Reliability of an element is specific to the extremes of environment under which the element is expected to operate. The harsher the environment, the lower will be the reliability. Electronics are especially sensitive to applied voltage and working temperature. They are also sensitive to nuclear radiation. Reliability can be increased by reducing the environmental impacts, often done by providing some form of protection, such as barriers, cooling, isolation, or insulation. While environmental protections incur some costs, they are often far less that the costs of not providing them. Protection may also be useful in reducing non-random failures, such as those induced by corrosion. As an example, naval aircraft are sometimes washed down with precious fresh water to minimize salt corrosion. Test and Fix The basic idea in test and fix is to subject an element, almost always an electronic element (PCB), to intensive accelerated life testing until it fails. The testing is usually a combination of hot / cold cycling and random vibration (vibration across a wide spectrum of frequencies) at levels sufficiently high to severely stress but not destroy the tested element. 6 Some O&S models assume that MTBF is always the reciprocal of failure rate. This can be grossly in error. 7 Such a check can be part of preventive maintenance, but currently the trend is to design in a capability to overcome such a failure while still remaining operational.

Page 10 of 17 The cause of the failure, commonly called the failure mode, is then thoroughly investigated, and a specific fix is designed and implemented. This process may be repeated until no further failure modes are found. This process of reliability enhancement, often called Highly Accelerated Life Testing (HALT), tends to produce very high but somewhat unknowable reliability values. Unfortunately for accurate support cost analysis, it also creates fairly large uncertainties as to mean time between failures (MTBF), although it is generally the case that the MTBF will exceed the useful life of the product, given the rapid changes in electronic technology currently being experienced. At this writing, we believe that a reasonable approach when practices such as HALT are being used is to assume that they are equivalent to three deep parallel redundancies. This provides high values of reliability and long values of MTBF. The result may be less than one predicted failure in a lifetime, which from a cost standpoint may be negligible. While processes such as HALT are expensive, especially when applied to every PCB delivered, as they sometimes are, there is a large payback in terms of reliability, reduced corrective maintenance cost, and availability. Cross strapping Cross strapping is a method of providing multiple interconnections so that the normal effects of a failure are avoided by reliance on an alternate power or signal source in the event of a failure of the current source. Some cross strapping schemes are quite complex. Care must be taken in designing cross strapped system to avoid inadvertent applications of an incorrect voltage or signal. Cross strapped systems are commonly difficult and expensive to design and test. Analysis of reliability and MTBF of cross strapped systems can be very complex and may require use of an iterative computer algorithm. It is not uncommon for the MTBF of such systems to far exceed the useful life of the system. The block diagram below has multiple cross strapping and illustrates the high level of complexity that can result.

Page 11 of 17 Figure 3 Cross Strapping Example Design diversity Design diversity means effecting redundancy by using redundant elements that are unlike. This approach can possibly make a redundant element more robust with respect to certain damaging environments, or can increase the likelihood of continued operation in the event of unexpected stresses. It will likely of course be more expensive than using identical designs for the redundant elements. Lifetime Phases of an Element Most elements that are subject to random failures are also subject to various lifetime phases that can increase (or decrease) their propensity to fail. The projected useful life of an element is the first consideration in addressing the element s phases. The second consideration is how the useful life divides up into phases that may have significantly different failure rates. Various allocation schemes have been used to portray this. One that is fairly universal is the following: Operational phase the phase of useful life in which the element is fully operational and in active use. This generally, but not always, is the phase having the highest stresses and consequently the highest failure rates. For an aircraft, the operational phase most often coincides with flight. For a ship, the operational phase most often coincides with being under way. However, it should be noted that certain elements are not necessarily operating when their platform is. Alert phase the part of useful life in which the element is not operational, but can rapidly transition to an operational state. In the alert state, the element is generally quiescent, and its

Page 12 of 17 failure rate is typically much lower than in the operational state (perhaps only about ten to twenty percent) but usually not negligible, as is often incorrectly assumed. Some systems, such as containerized missiles, spend most of their life in this state. Out of service phase the phase of useful life in which the element is unavailable for operational service. For example, most elements of an aircraft are out of service when it is undergoing tests for fatigue cracks in the fuselage. Most elements of a ship are out of service when it is in drydock. Elements returned to a depot for overhaul are out of service until they are returned to inventory. An element down for maintenance is also out of service. The out of service failure rate is usually the same as or similar to the alert failure rate, but there can be exceptions. Note that an element in either the operational or the alert state is normally considered to be available for purposes of calculating its availability, while an element that is out of service is considered to be unavailable. Availability is discussed later in this paper. Patterns of service can differ considerably from element type to element type, and can also differ considerably from operational environment to operational environment. For example, an element of a containerized missile will generally spend most of its useful life in the alert state. Its operational state coincides with the missile being expended, and there can be no maintenance cost in that phase. There may be an out of service life if the missile is periodically recycled to a depot for inspection and / or overhaul. An element of a spacecraft typically spends its life in the operational state, but some elements may be turned off for significant periods of time, so there could be a significant alert state for them. These considerations are of great interest to a spacecraft reliability engineer, but not to a cost analyst, because spacecraft elements are not accessible for maintenance (except possibly for certain commands that can be issued from a ground station to reconfigure a partially failed spacecraft). For elements of earthbound platforms such as aircraft, ships, tanks, trucks, ground stations, etc., the life phases usually coincide closely with the life phases of the platform. There are exceptions, such as electronic items that often remain unused. For example, some ships still carry a LORAN navigation system, which usually is not turned on unless GPS and inertial navigation systems both fail. While an aircraft carrier s air search radar generally operates continuously when the ship is at sea, that is not true for submarines, especially not for nuclear submarines, which spend most of their time at sea under the surface, where air search radar cannot be used. Active sonar is another example of a system whose use pattern often does not coincide with the operational pattern of the ship that carries it. It is generally turned on only when there is a tactical need for it. On the other hand, passive sonar is likely to be constantly in use in some classes of ships when they are underway. The key takeaway with regard to lifetime phases is that understanding them is vitally important to estimation of both operation and support costs.

Page 13 of 17 Determining MTBF of WBS Elements From all that has been said up to this point it should be clear that to get accurate estimates of the cost of corrective maintenance of random failures, one needs reasonably good MTBF values. Where do those come from? We should first recognize that reliability is the probability of an event, the event being the non-failure of an item over a stated period of time, and under specified environmental conditions. To arrive at such probabilities, we must compile statistical data on the failure-free performance of items of interest in the time domain. We do this by observing items under specified conditions of operation and environment, and measuring their time to failure. For simple components, a fairly large sample is usually selected for testing. The times to failure will vary randomly, but it is possible to compute a mean value MTBF. From the same data it is also possible to infer failure rates. Wearout Life (failure rate increases w/ time) A complication arises in that for most hardware there are three types of failure: early (often called infant mortality ), random, with essentially constant failure rate, and wearout. This gives rise to the well-known bathtub curve of reliability engineering. An example is shown in Figure 4. Normally for critical equipment, early electronic failures are weeded out by a process called burn-in, and equipment is retired or overhauled before wearout begins. 8 Therefore most reliability analysis concerns itself only with random failures based on a constant failure rate. A huge majority of projects do not test all of their components for MTBF from scratch. That would be prohibitively expensive. Fortunately, reliability engineers have available libraries listing typical failure rates for most commonly used components. Testing is reserved for unique situations where reliability is critical and for which no reliable failure data exist. Sometimes it is necessary to use components in environments that differ from environments for which in which they are most commonly used. To that end, derating factors are compiled. A derating is a 8 If equipment remains in use after wearout begins, the failure rate increases, and even accelerates. Corrective maintenance becomes much more costly. An adequate reliability theory exists for this case but we do not address it in this paper.

Page 14 of 17 correction factor applied to failure rates when components are used in extreme environments. They probably are most commonly applied to temperature sensitive electronic parts that must be used in hot or high voltage environments, but there are many other applications as well, for example, derating of ball bearings for higher than normal loads. From failure rates of simple components, failures rates and MTBF values for various assemblies of components can be derived. Recall that it matters how the components are assembled. There are major differences in MTBF and reliability between assemblies that contain redundancies in the form of parallel arrangements or cross strapping and assemblies that do not. Some O&S models do not recognize this. One frequent issue regarding MTBF is that values predicted from laboratory tests do not match well with results reported from field operations. This an extremely difficult issue because of the many gaps, lack of timeliness, errors, and lack of specificity in field failure reports, versus the close controls usually used in laboratory testing. The reliability engineers who screen and tabulate data from field reports must do sometimes heroic feats of data interpretation. Still, imperfect data is usually better than no data. Non-random Failures Many failures of hardware are classified as random because their actual time of occurrence is not predictable. But there is a class of failures for which time to failure may not be perfectly predictable, but it can be estimated reasonably well. O&S models often do not recognize this important class of failures. An example is aircraft tires. While occasionally they do actually fail randomly, more often they are replaced after a certain number of landings because after that number of landings they are deemed to be unsafe. Cargo extraction parachutes are another example. They often get deployment friction burns and lengthy exposure to sunlight, both of which weaken them. After a certain number of deployments they are typically removed from service. A non-random failure sometimes is an end-of-life event, but this is not quite the same thing as the useful life estimate, because useful life in calendar terms can be much prolonged simply by using the item less often, thus exposing it less often to a hurtful environment. Often the amount of exposure to a harmful environment deemed to result in failure is a policy matter, and is determined by testing or from experience. Since that exposure is likely to vary considerably in the various lifetime phases above listed, it should be estimated separately for each phase and the results combined. One particular non-random failure mode, corrosion, is often omitted from O&S analyses. But across all DoD equipment it accounts for an average of about a fourth of all corrective maintenance costs. For a few systems, such as the C-130 aircraft, it accounts for almost half of all maintenance costs.

Page 15 of 17 Overhauls High value assets subject to random and / or non-random failures are sometimes sent to depots for extensive repairs and renewals intended to make them like new. The motivation is that the intrinsic value of the item makes the overhaul expense worthwhile. O&S models sometimes do not give adequate treatment to overhauls. Overhauls may be fairly complex to deal with from a modeling standpoint. For example, the frequency of overhauls, usually a policy matter, can be driven by: Cumulative operating hours Cumulative alert hours Cumulative out of service hours Combinations of the above Cumulative calendar hours regardless of operational state Occurrence of a random failure Random selection from a population Sometimes overhauls are not done on an entire population. Instead, they are based on a sampling plan. This is most likely to be done for systems in which detection of some failures is possible only in a well equipped depot. Costs of overhauls usually are generated by: Inspections to determine equipment status Repairs, which may include some remanufacturing Testing to verify condition before return to service. Overhauls of electronic items may not be advisable, because any repair or remanufacture of them often reduces their reliability. What Is an Availability Requirement and Why Is it Imposed? Certain systems are operationally critical in the sense that if they fail to operate there can be substantial losses. In industry, these losses are typically of time and money. But for military equipment, the loss could be a battle or even a war. Recall the famous military proverb: For want of a nail the shoe was lost. For want of a shoe the horse was lost. For want of a horse the rider was lost. For want of a rider the battle was lost. For want of a battle the kingdom was lost. And all for want of a horseshoe nail.

Page 16 of 17 This proverb, the sources of which have been traced back as far as the year 1390, was until recent years applied primarily to failures of logistic supply. But in recent years it also has been applied to equipment designs and maintainability planning as well. For example, would it not have been nice to have horseshoes that have no need of nails? Or perhaps even horses that have no need of shoes? In that sense its message can be summarized in a single word: AVAILABILITY The general sense of availability can be grasped easily enough: Will I ever need it? Will it be there if I need it? But in today s risk conscious world, we need both a clear definition and a way to define it quantitatively. Here is one definition that is easily understood: Availability is the proportion of need time that a system (or subsystem or equipment) is in an adequately functioning condition. Obviously, as reliability increases so does availability, all else being equal. In terms of the representative partitioning of useful life previously described in this paper, equipment generally is desired to be available in both its operational and alert states, but not in the out of service state. What can make equipment unavailable when it is expected to be available? Ignoring the possibility of a secondary failure, such as a general power outage which forces the equipment to the off condition, the answer is a failure of the equipment itself. If by some miracle a failure could always be repaired instantaneously, availability would always be 100%. But repairs take time, the average of which is commonly denoted mean time to repair or sometimes mean time to recover. Either way, the acronym is MTTR. Probably the most commonly cited mathematical representation of availability is: = [ ] [ ] + [ ] Here, A = availability, which like reliability is commonly expressed in nines; E[Uptime] = expected uptime during system life, and E[Downtime] = expected downtime during system life. Example An equipment has an MTBF of 100,000 hours and an MTTR of 1 hour. Its availability is 100,000/100,001 = 0.99999. The unavailability is 1 0.99999 = 0.00001. Defining a calendar year as 8,760 hours and assuming that the equipment is in continuous operation, this unavailability translates to 0.0876 hours (5.256 minutes) downtime per calendar year. Obviously important to availability are reliability, MTBF, accessibility of the equipment to maintenance staff, skill level of maintenance staff, adequate diagnostic equipment or built-in-test capability, and readiness of spares. All of this and perhaps other factors should be considered when assigning an MTTR value to equipment in an O&S model.

Page 17 of 17 Availability flowdown As previously described in this paper reliability flows down to leaf elements of a WBS from a higher level reliability requirement in a mathematically simple way. This flowdown emulates what a reliability engineer has supposedly done, namely meeting the reliability requirement in the most economical way. Availability, like reliability, is a probability and the flowdown math is the same. Can an availability flowdown be used to check the availability compliance of a leaf element? Yes, it can. Assuming that an MTBF number and an MTTR number are both available at a leaf element, as they should be, the availability calculation shown above can be used to calculate the element availability. That calculation can then be compared to the flowdown value. It should be at least as large. Summary The goal of this paper is to review some of the ways that TOC models are often deficient in addressing reliability and availability requirements as cost drivers, and to suggest how to approach fixing these deficiencies. Specifically, the following have been noted: Reliability and availability are strong drivers of projects cost in aerospace and defense and models sometimes ignore this Reliability requirements directly influence MTBF values of hardware items, and MTBF assignments often do not recognize this direct relationship Reliability is a strong function of the operating (sometimes also the non-operating) environment, and this must be accounted for but often is not Models do not always adequately recognize the beneficial effects of various types of redundancy and other reliability enhancements It is sometimes not recognized that a hardware element may experience more than one life cycle phase, with failure rates varying significantly between phases For many systems, non-random costs are major costs, and are often ignored Overhaul costs often are not carefully analyzed Further Reading The equations and most other information presented in this paper are readily available from many sources. Two very good ones are: Bazovsky, Reliability Theory and Practice, Dover, 2004 Pecht (editor), Product Reliability, Maintainability, and Supportability Handbook, CRC Press, 1995