Achieving Maximum Reliability Growth in Newly Designed Systems Harry W. Jones, Ph.D., NASA Ames Research Center Key Words: new designs, reliability growth SUMMARY & CONCLUSIONS Newly designed systems often have unexpectedly high failure rates which can be reduced by successive redesign until the system achieves an acceptable failure rate. Reliability can grow at the maximum possible rate if the causes of all the failures that occur are identified and removed without introducing new failure modes. Typically, the expected failures are random, infrequent, and need not be corrected. But there are many causes of early and unexpected failures that do require correction. New systems may have incorrect specifications, design oversights, or unanticipated environmental or operational challenges. These occur during the “infant mortality” phase and correcting such unexpected problems as soon as they are observed can produce large early reliability growth. The failure rate reduction that can be obtained depends on the number and the failure rates of the correctable failures. Under the strong assumption that the failure causes can be identified and removed, the decline in overall failure rate can be predicted. If a failure occurs at the rate of λ per unit time, the expected time before the failure occurs and can be corrected is 1/λ, the Mean Time Before Failure (MTBF). Finding and fixing this failure reduces the total system failure rate by λ. The failure rate is thus on average expected to decline by λ at time 1/λ. Reliability growth can be predicted as the expected decline in the current failure rate, and this depends directly on the failure rates of the remaining undetected failure modes. Insight into the reliability growth mechanism is gained by considering different numbers of failure modes with specific arrays of failure rates. The reliability growth process includes the detection and repair of common cause failures, which can be due to easily correctable design mistakes. All software failures or bugs are design errors, essentially common cause failures, because they affect all copies of the software and they are usually corrected. The more common reliability assumption is that most failures are random and independent, they occur at a low rate, they can be repaired using a small stock of spare parts, and they do not indicate a need for redesign. The general process of reliability improvement can be described with simple mathematics. Suppose there are N correctable high probability failure modes. Since they have high failure rate, they will all probably occur early in testing, say before time T. After testing starts, the number of failures will gradually increase until N have occurred at time T. The U.S. Government work not protected by U.S. copyright
failure rate is then N(t)/t = N/T. If all of the N correctable failure modes have been removed, no more correctable failures will occur. As time goes on, the cumulative failure rate N(t)/t = -1 N/t will decline as 1/t or t . This is the best case of failure rate decline. The commonly used Duane-Crow reliability growth model represents this, with the main issue being the time exponent of failure rate decline. A problem in using the Duane-Crow reliability growth model is that it does not include an acceptable residual failure rate due to random or uncorrectable failure modes. The assumption is that reliability growth will continue and the failure rate decrease indefinitely. As the constant rate acceptable failures accumulate beyond the period of reliability growth, the reliability growth time exponent decreases toward zero. Using two classes of failures, correctable and not correctable or random, provides a better model for actual reliability growth data than reliability growth models that assume reliability growth continues indefinitely. The approach taken here combines the continuous failure rate reduction model with the constant acceptable failure rate -b model, to form the abcd model. The failure rate N(t)/t = a t + -b c, where a t describes continuous reliability growth and c is the constant acceptable random failure rate. The parameter d represents an additional constant failure rate due to correctable but uncorrected failure modes. After the reliability growth process is terminated, the failure rate N(t)/t = c + d. The abcd model seems to be useful in describing and understanding reliability growth data. It helps to predict future reliability growth if the reliability improvement effort continues and the failure rates do not change. The abcd reliability improvement model can be useful in planning and guiding the failure rate reduction process. The expected system failure rate due to component failure rates is usually computed during design. An estimate or data on the actual initial failure rate can be used to estimate the required test time and level of effort needed to achieve the needed reliability growth. As the reliability improvement process is carried out, the current failure rate and an updated model could be used to track progress and adjust planning. 1 THE BATHTUB CURVE The failure rate is the number of failures per unit time. A system’s changing failure rate over time often follows the “bathtub curve.” The failure rate first decreases with time, then remains constant during the system’s useful life, and finally increases due to component wear out. The initial high “infant
Cumulative failure rate, log-log Data
Crow
1.00
Bathtub curve
0.10
1.20 1.00 0.80 0.60 0.40 0.20 0.00
0.1
1
Time
10
100
Figure 2. Failure rate N(t)/t with Duane and Crow models
0
100
200
300
400
The downward slope of the Duane line shows reasonable reliability growth, but this is due to a high early failure rate, infant mortality. The Crow model is much less influenced by the early infant mortality data and gives a barely noticeable projection of future reliability growth.
500
Time units
2.2 Crow’s data does not show continuous reliability growth
Figure 1. The bathtub curve 2 DUANE-CROW RELIABILITY GROWTH Duane observed that if N(t) is the number of failures occurring until time t, the cumulative failure rate, N(t)/t, often declines as a fractional power of the cumulative test time, t. The cumulative failure rate is
-α N(t)/t = k t
(2)
where β is between zero and one. The expected cumulative failure rate is
β-1 N(t)/t = m(t)/t = k t
(3)
The Crow and Duane reliability growth models are equivalent, with the Duane α equal to Crow’s 1 - β. The parameter k is the same in both. The reliability growth parameters can be estimated from the failure time data. [2] [3] 2.1 Applying the Duane-Crow reliability growth model Crow used a 56 failure data set to illustrate reliability growth. [4] A graphical Duane model fit to the data gives Duane N(t)/t = 0.640 t-0.283
(4)
Crow’s computational analysis of this data found Crow N(t)/t = 0.217 t -0.073 The data and models are shown in Figure 2.
Cumulative failure rate and two part fit N(t)/t
(1)
The reliability growth rate is α, the downward slope of N(t)/t versus t. It usually varies from 0.2 to 0.6. [1] [2] Crow provided a theoretical basis for the Duane model by assuming that failures occur according to a non-homogeneous (time-varying) Poisson process with a power law mean value function, m(t). The mean number of failures over time is
β m(t) = k t
Figure 3 shows the cumulative failure rate, N(t)/t plotted versus time, t, but in a linear rather than log-log graph. A two part failure model is also shown, rather than the single model equation used by Duane and Crow.
(5)
t^-0.5
Flat line
1.50
N(t)/t
Failure rate
Duane
N(t)/t
mortality” failure rate is due to burn-in, to failure of defective components, to detection and correction of design faults, and to other improvements in design and operations. The failures during useful life are usually assumed to be random events caused by unpredictable internal degradation. A failure rate increase at end-of-life can be caused by mechanical wear or aging related to chemical or thermal activity. The bathtub curve is shown in Figure 1.
1.00 0.50 0.00
0
100
200
300
400
Time Figure 3. Cumulative failure rate -0.5 fits the A Duane-Crow equation with N(t)/t = 1.11 t data from time 0 to 100. The high initial failure rate declines and becomes constant. A flat line fits the data points from time 100 to 400. The Duane-Crow model fits the failure rate data with a single line on a log-log plot. It is well known, and illustrated by the “bathtub curve,” that failure rates often decline strongly during an initial period of “infant mortality,” and then tend to be constant during the operational phase. Using a Duane-Crow log-log line fit to the early data can exaggerate the reliability growth potential. Adding more data from the later operational period gives a false picture of continuing reliability growth occurring at an ever decreasing rate. If the test and fix process is terminated and the system put into operation, the best predictor of the future failure rate would be the failure rate at the end of the reliability growth effort, assuming no “end-of-
life” increase in failure rate occurs. Clearly the two phases, according to its own failure rate and each has a contribution to initial improvement and continued operation, have different the unrepaired but correctable failure rate that declines with failure rate behavior and require different modeling. time. The worst case, with the slowest reliability growth, occurs when all the failure rates are equal. 3 RELIABILITY GROWTH FOR DISCRETE FAILURES Reliability growth models are developed based on assumed discrete correctable failure modes.
Reliability growth and failure rate decline Non CCF failure rate
3.1 Reliability grows by fixing high rate failures
Total failure rate, one correctable failure
Failure rate per unit time
Reliability growth will occur if failure causes are removed Total failure rate, ten correctable failures by redesign or otherwise corrected. A mature, well tested 0.012 system will have occasional failures, but these are usually considered random, unpredictable, and unpreventable. But an 0.01 all new design may have an unacceptably high failure rate due 0.008 to design oversights, specification errors, improper operation, 0.006 etc. Curing any failure mode decreases the expected system failure rate by exactly the failure rate of the failure mode 0.004 removed. Failure modes with high failure rates allow rapid 0.002 reliability growth. 0 To predict reliability growth, it usually is necessary to 0 1000 2000 3000 4000 5000 consider the number and failure rates of correctable failure Time modes. In order to be fixed, a failure mode must cause a failure, and the probability of its occurrence over time depends on its failure rate. The higher the failure rate, λ, the sooner Figure 4. Reliability growth due to removal of correctable sooner the failure occurs. The Mean Time Before Failure, the failures. MTBF = 1/λ. The failure modes with the highest failure rates will occur first. If these observed high rate failure modes are Figure 4 also shows the expected failure rate decline for ten then cured, the failure rate declines rapidly and reliability correctable failures, each having the same failure rate. If the growth occurs. same total failure rate is due to ten rather than one failure mode, the rate of reliability growth is cut by ten. 3.2 Reliability growth for a single correctable failure If there is only one correctable failure with a failure rate of 1 in 100 time units, it will probably occur and be fixed in 100 Consider a new design with multiple failure modes, time units. If there are 10 correctable failures each with failure including independent random failures and correctable rate of 1 in 1,000 time units, the overall correctable failure rate common cause failures (CCFs). Reliability growth depends on is still 1 in 100 time units. The first failure will probably occur the failure rates of the correctable failures. Suppose that the and be fixed in 100 time units, but that will leave nine combined failure rate of all the non-correctable failure modes undetected and unfixed. It will require 1,000 time units before is λnon. Suppose that the combined failure rates of all the we can expect to find and fix all 10 correctable failures. The correctable failure modes is λcor. The total initial failure rate time for reliability growth is increased by 10. is λtotal = λnon + λcor. The minimum final failure rate after 4 A SIMPLE UPPER BOUND ON FAILURE RATE all possible reliability growth is λmin = λnon. How does reliability growth occur over time? The failure If it is assumed that all correctable failures are removed process is probabilistic. Suppose that there is only one when they first occur, a simple upper bound on the failure rate correctable failure with failure rate λcor. The probability of can be derived without knowing the actual correctable failure this failure having occurred over time is 1 - exp (- λcor t), rates. [5] where t is time. The expectation of the failure occurring and The average expected failure rate for each correctable being removed gradually increases from zero to one, so the failure declines with time, λcor i (t) = λcor i * exp (-λcor i t). At expected failure rate decreases from λtotal = λnon + λcor to any given time, there is some initial λcor i that maximizes its λmin = λnon. This is a process of exponential decay, λ(t) = current average expected failure rate. Taking the derivative of λnon + λcor * exp (- λcor t). Figure 4 shows the expected λcor i (t) with respect to λcor i and setting it to zero, it can be failure rate decline for a single correctable failure. Actually, shown that the maximum value of λcor i (t) occurs at the time the single correctable failure will occur at one point in time, when t equals 1/λcor i , the MTBF. A failure mode with a given and when it is corrected λ(t) drops from λtotal to λmin = failure rate is more likely to occur near the time equal to its λnon. MTBF. The maximum value of λcor i (t) over t is a bound on the current expected remaining failure rate. The bound, found by substituting t = 1/λcor i , is λcor i (t) < 1/(e t). [5] Figure 5 plots If there are several correctable failure modes, each occurs the upper bound 1/(e t) and the individual expected failure rates
3.3 Reliability growth for ten correctable failures
over time λcor i * exp (-λcor i t), for λcor i = 0.1, 0.01, 0.001, 5.1 The reliability growth graph 0.0001, and 0.00001. Typical expected reliability growth is shown in Figure 6. Failure rate bound independent of failure rate 1/(e t)
0.1
0.01
0.001
0.0001
0.00001
1.0E+00 1.0E+00 1.0E-01
0.1 0.9 t^-0.5 + 0.1 to t = 50, then 0.23
1.0E+02
1.0E+04
1.0E+06
1.00
1.0E-02
0.80
Failure rate
Failure rate per unit time
Reliability growth 0.9 t^-0.5 + 0.1
1.0E-03 1.0E-04 1.0E-05
0.60 0.40
1.0E-06
0.20
1.0E-07 1.0E-08
Time
0.00 0
50
100
150
200
Time units Figure 5. The expected failure rate is always lower than the bound 1/(e t) For each individual failure mode, the expected failure rate declines exponentially with time. The current expected failure rate is always less than, better than, the bound 1/(e t), regardless of the original failure rate. The assumptions are that the failure rates are constant and independent and that a failure is immediately corrected, without introducing a new failure mode. The bound directly decreases with increasing test and redesign time. The bound is surprising because it proves reliability growth must occur under the assumptions and because it allows the maximum future failure rate to be predicted from the current failure rate. The bound shows that reliability growth will occur if correctable failure modes are removed. The failure rate bound of λcor i (t) < 1/(e t) is tight only near t = MTBF. The expected failure rate exactly equals the bound when t = 1/λcor i , the MTBF. If there are N different failure modes, the bound on the total failure rate is N/(e t). If all the N failure modes have the same initial failure rate λcor i , the bound is tight at the time equal to the MTBF. However, if the individual failure rates are very different, the total failure rate can be significantly less than the bound and the time to complete reliability growth stretches to the longest MTBF. As shown in Figure 5, only the few failure modes with MTBFs close to the current time contribute substantially to the current failure rate, λ(t). 5 THE ABCD HEROIC RELIABILITY GROWTH MODEL
Figure 6. The expected reliability growth curve The reliability growth curve in Figure 6 has two components, a correctable and therefore declining failure rate -0.5 and a constant random failure rate of 0.1. At equal to 0.9 t time equal to 50 time units, the failure correction process is terminated. A correctable failure rate of 0.13 remains, producing a constant total failure rate of 0.23 after time 50. Continuing to remove the remaining failure modes would have produced the continually declining failure rate shown. 5.2 The abcd model The mathematical model in Figure 6 is Failure rate = 0.9 t-0.5 + 0.1 from t = 0 to 50 = 0.9 50-0.5 + 0.1 = 0.23 after t = 50
(6a) (6b)
A simple abcd mathematical model for reliability growth and failure rate decline is Failure rate = a t-b + c from t = 0 to td = c + d after td, where d = a td-b
(7a) (7b)
The heroic reliability growth effort continues and the failure rate declines until the time t d when reliability growth stops and the failure rate becomes constant. 6 THE HEROIC RELIABILITY EFFORT, DURATION, AND GROWTH METRICS
Heroic reliability growth requires correctly diagnosing all The explanation of failure mode correction suggests a two the correctable failure modes and then removing the failures phase model of reliability growth. It also implies the need for without introducing any new ones. If this is done, the failure a heroic reliability growth effort and suggests metrics for rate will decline faster than the bound 1/(et). For this bound, the heroic reliability growth.
7 RELIABILITY GROWTH IN THE DUANE-CROW DATA The “abcd” model is applied to the Duane-Crow reliability growth data set. Figure 3 showed a rough fit to the Duane-Crow data set, with failure rate equal to 1.11 t -0.5 to time 100. The abcd model gives a better fit.
Failure rate = 0.97 t-0.83 + 0.14 from t = 0 to td = 200 (8a) = 0.14 + 0.01 = 0.15 after td = 200
(8b)
Figure 7 shows the Duane-Crow failure rate data, the abcd model, and the N/(e t) bound. Duane-Crow failure rate data, model, and bound N(t)/t
0.97 t^-0.83 + 0.14 (fit 0 to 200)
N/(e t)
1.00 0.80
N(t)/t
abcd reliability growth model a t-b would equal (1/e) t-1, so the worst case bound reliability growth exponent for a heroic effort is b = 1. If no effort is made to reduce the failure rate, the reliability growth time exponent is 0 and the failure rate is constant. It seems appropriate to define the reliability growth exponent “b” as the heroic reliability growth effort metric. The heroic reliability effort metric varies from 0 for no effort to 1 and even higher if the failure reduction problem is unusually easy. The 1/(et) is an upper bound on the failure rate during the reliability growth effort. The failure rate can fall faster if, for instance, there are only a few failure modes with high failure rates. However, there are many more likely reasons why the failure rate decline exponent, which equals the heroic reliability effort metric, can be much less than 1, even approaching 0. If a failure is not detected or corrected the first time it occurs, failure rate reduction is slowed. A particular failure mode may never be removed. An attempted fix may introduce a new failure mode. Measured reliability growth provides another metric. In -b the abcd reliability growth model, with a failure rate = a t + c, the minimum final failure rate is “c,” the residual random failure rate. But all correctable failures are removed only if the failure rate reduction process continues without end. If the process is terminated at t d , the final failure rate is c + d, where d = a t d -b. The failure rate reduction achieved is Max - (c+d), where Max is the highest initial failure rate rate, Max = a t 1 -b + c, and t 1 is the time of the first failure. The greatest possible failure rate reduction, achieved over a very long time, is Max c. It seems appropriate to define achieved failure rate reduction ratio (Max - c - d)/(Max - c) as the heroic reliability improvement metric. Computing this metric requires that the model distinguish the remaining correctable failure rate, “d,” from the uncorrectable continuous failure rate, “c.” The abcd model assumes that the failure reduction effort stops at some time t d . At this point in time, the remaining correctable failure rate is d. With the model’s continuing unending decline in the probability of failure, there is never a time when all correctable faults have been fixed. Suppose that the objective is to reduce the correctable failure rate to r, and that this requires time t r . From the reliability growth model, r = a t r -b, and time t r = EXP[-LN(r/a)/b]. As a difficult reference target, the remaining correctable failure rate, r, is set equal to r = 0.1 c, ten percent of the originally expected random failure rate. The final failure rate at time t r would then be a t r -b + c = 0.1 c + c = 1.1 c. If the reliability growth effort stops at t d but should extend to t r , it seems appropriate to define the heroic reliability growth duration metric as t d /t r . For r = 0.1 c, t r = EXP[-LN(0.1 c/a)/b]. A combined metric, the heroic reliability growth metric, can be created by multiplying the three reliability growth metrics for effort, improvement, and duration.
0.60 0.40 0.20 0.00 0
100
200
300
400
Time Figure 7. Duane-Crow data, model, and bound The data and model include the constant failure rate c = 0.14. The abcd model parameters were computed using the N(t)/t data out to time t = 200 without the constant failure rate c, which was added back for Figure 7. The two-part model combines initial reliability growth with a constant background failure rate and fits the data closely. The upper bound on the failure rate, N/(e t), applies only to the correctable failures and so falls below the constant failure rate. The bound is correct, as shown in Figure 8, where the blue dots are the N(t)/t data minus c = 0.14 and the red line is the abcd model minus c = 0.14. The broken orange line is the abcd model fit using data only out to time = 100. For time 0 to 100, the reliability effort metric, b, was higher, 0.99, than the 0.83 found for the full period from 0 to 200. The failure rate decline is rapid and both abcd models fit the data. The N/(e t) upper bound is significantly above the actual correctable failure rate data. Significant reliability growth was achieved, with a strong continued effort. The model predicts that continuing the reliability growth effort at the same level from time 200 to 400 would have reduced d from 0.012 to 0.007, which would be a very small gain for doubling the reliability growth time. The Duane-Crow data set does show dramatic reliability growth until time 100. There is a small but observable reliability growth from time 100 until time 200, but from time 200 to 400, the failure rate decrease is less than one-tenth of the constant random failure rate.
failure rates. If the failure causes are corrected, the cumulative failure rate necessarily declines. When the reliability growth model is the only one used, it extends out into the constant N(t)/t - 0.14 failure rate phase. Even after reliability growth has ceased, the 0.97 t^-0.83 (fit 0 to 200) 1.37 t^-0.99 (fit 0 to 100) model shows it continuing at an ever decreasing rate. 0.20 The abcd model was developed to include both early reliability growth and a later constant low failure rate. The abcd 0.15 model provides a better fit to the usually expected failure rate data, but there are two problems in its application. First, determining the precise time of transition between reliability 0.10 growth and a constant failure rate affects the model parameters and is somewhat subjective. It is assumed that all failures are 0.05 corrected in the reliability growth phase and none in the constant failure rate phase, but the actual transition may be 0.00 gradual. 0 50 100 150 200 Time Second, the key assumption that failure rates are constant will not be true if a system experiences the third wear out phase Figure 8. Data and model without constant failure rate of the bathtub curve. Wear out could be modeled as an 7.1 Duane-Crow abcd model parameters and metrics increasing failure rate near the expected lifetime. Poorly designed mechanical and chemical systems can wear out The “abcd” parameters for the model fit from t = 0 to 200 prematurely, leading to a redesign and a repeat of the bathtub are shown in Table 1. curve. Table 1. Duane-Crow abcd model parameters REFERENCES a b c d td tr 1. Duane, J. T., “Learning Curve Approach to Reliability 0.97 0.83 0.14 0.01 200 175.5 Monitoring," IEEE Transactions on Aerospace, Vol. 2, No. 2, 1964, pp. 563-566. Table 2. Duane-Crow heroic reliability growth metrics 2. Yamada, S., and S. Osaki, “Reliability growth models for reliability hardware and software systems based on nonhomogeneous effort improvement duration growth Poisson processes: a survey,” Microelectronics Reliability, (Max - c - d) / Vol. 23, No. 1, 1983, pp. 91-112. b t d /t r (Max - c) 3. Crow, L. H., Reliability Analysis for Complex, Repairable 0.83 0.99 1.14 0.94 Systems, US AMSAA Technical Report No. 138, December 1975. The heroic metrics are shown in Table 2. 4. Crow, L. H. "An Extended Reliability Growth Model for The heroic metrics in Table 2 indicate good reliability Managing and Assessing Corrective Actions," Proceedings growth, including good effort, improvement, duration, and of RAMS 2004 Symposium, pp. 73-80. overall reliability growth. 5. Bishop, P., and R. Bloomfield, “A Conservative Theory for Long-Term Reliability-Growth Prediction, IEEE 8 COMMENTS Transactions on Reliability, vol. 45, no. 4, 1996. Modeling failure rates helps understand and predict them. BIOGRAPHY The best general model is probably the graphical bathtub curve, with infant mortality, constant failure operational phase, Harry W. Jones, Ph.D., MBA and final wear out. An ideal mathematical model of failure N239-8 rates would include all three phases of the bathtub curve. The NASA Ames Research Center abcd model includes the first two phases. Moffett Field, CA 94035, USA The most common mathematical reliability model assumes all failure modes have constant failure rates. The e-mail:
[email protected] system can be kept operating by using spares to replace failed Harry Jones is a NASA systems engineer working in life components. But this model does not account for the many support. He previously worked on missiles, satellites, Apollo, early failures due to design oversights. digital video communications, the Search for Extra Terrestrial The Duane-Grow model of reliability growth reflects the Intelligence (SETI), and the International Space Station (ISS) cure of infant mortality problems and predicts the decline of N(t)/t
Duane-Crow failure rate data and models