3
REDUNDANCY, SPARES, AND
REPAIRS
Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design
Martin L. Shooman
Copyright
2002
John Wiley & Sons, Inc.
ISBNs:
0
-
471
-
29342
-
3
(Hardback);
0
-
471
-
22460
-X (Electronic)
83
3
.
1
INTRODUCTION
This chapter deals with a variety of techniques for improving system reliability
and availability. Underlying all these techniques is the basic concept of redun-
dancy, providing alternate paths to allow the system to continue operation even
ing does not increase cost, but in general, improved reliability requires
higher cost or increases in weight or volume. In most cases, however, the
gains in reliability and decreases in life-cycle costs justify the expendi-
tures.
2
. Parallel redundancy, where one or more extra components are operating
and waiting to take over in case of a failure of the primary system. In
the case of two computers and, say, two disk memories, synchronization
of the primary and the extra systems may be a bit complex.
3
. A standby system is like parallel redundancy; however, power is off in
the extra system so that it cannot fail while in standby. Sometimes the
sensing of primary system failure and switching over to the standby sys-
tem is complex.
4
. Often the use of replacement components or repairs in conjunction with
parallel or standby systems increases reliability by another substantial
factor. Essentially, once the primary system fails, it is a race to fix or
replace it before the extra system(s) fails. Since the repair rate is gener-
ally much higher than the failure rate, the repair almost always wins the
race, and reliability is greatly increased.
Because fault-tolerant systems generally have very low failure rates, it is
hard and expensive to obtain failure data from tests. Thus second-order factors,
such as common mode and dependent failures, may become more important
than they usually are.
The reader will need to use the concepts of probability in Appendix A,
Sections A
1
–A
6
Figure
3
.
1
A system model composed of k major subsystems, all of which are nec-
essary for system success.
3
.
2
APPORTIONMENT
One might conceive system design as an optimization problem in which one
has a budget of resources (dollars, pounds, cubic feet, watts, etc.), and the goal
is to achieve the highest reliability within the constraints of the available bud-
get. Such an approach is discussed in Chapter
7
; however, we need to use some
of the simple approaches to optimization as a structure for comparison of the
various methods discussed in this chapter. Also, in a truly large system, there
are too many possible combinations of approach; a top–down design philoso-
phy is therefore useful to decompose the problem into simpler subproblems.
The technique of apportionment serves well as a “divide and conquer” strategy
to break down a large problem.
Apportionment techniques generally assume that the highest level—the over-
all system—can be divided into
5
–
10
major subsystems, all of which must work
for the system to work. Thus we have a series structure as shown in Fig.
3
U
x
2
·· ·
U
x
k
)(
3
.
1
a)
and if we use the more common engineering notation, this equation becomes
R
s
P(x
1
x
2
·· ·x
k
)(
3
.
1
b)
If we assume that all the elements are independent, Eq. (
3
.
Α
Α
Α
i
1
c
i
(
3
.
3
)
86
REDUNDANCY, SPARES, AND REPAIRS
We assume that the system reliability given by Eq. (
3
.
2
) is below the sys-
tem specification or goal, and that the designer must improve the reliability
of the system. We further assume that the maximum allowable system cost,
c
0
, is generally sufficiently greater than c so that the system reliability can be
improved to meet its reliability goal, R
s
≥ R
0
; otherwise, the goal cannot be
∏
i
1
r
i
(r
1
)
k
(
3
.
4
)
and solving for r
1
yields
r
1
(R
0
)
1
/
k
(
3
ponent and system redundancy. In fact, one can prove that component redun-
dancy is superior to system redundancy in a wide variety of situations.
Consider the three systems shown in Fig.
3
.
2
. The reliability expression for
system (a) is
SYSTEM VERSUS COMPONENT REDUNDANCY
87
x
1
x
2
x
1
x
3
x
2
x
4
x
1
x
3
x
2
x
4
)
P(x
2
)
p. The
reliability expression for system (b) is given simply by
R
b
(p)
P(x
1
x
2
+ x
3
x
4
)(
3
.
7
a)
For independent identical units (IIU) with reliability of p,
R
b
(p)
+ x
4
)(
3
.
8
a)
Assuming IIU, we obtain
R
c
(p)
p
2
(
2
− p)
2
(
3
.
8
b)
To compare Eqs. (
3
.
8
b) and (
3
.
2
)
(
3
.
9
)
Algebraic manipulation yields
R
c
(p)
R
b
(p)
(
2
− p)
2
(
2
− p
2
)
4
−
4
p + p
2
.
10
)
88
REDUNDANCY, SPARES, AND REPAIRS
Because
0
< p <
1
, the term
2
− p
2
>
0
, and R
c
(p)
/
R
b
(p) ≥
1
; thus compo-
nent redundancy is superior to system redundancy for this structure. (Of course,
they are equal at the extremes when p
0
or p
Roberts [
1964
, p.
260
] proves by induction that this ratio is always greater
than
1
and that component redundancy is superior regardless of the number of
elements n.
The superiority of component redundancy over system redundancy also
holds true for nonidentical elements; an algebraic proof is given in Shooman
[
1990
, p.
282
].
A simpler proof of the foregoing principle can be formulated by consider-
ing the system tie-sets. Clearly, in Fig.
3
.
2
(b), the tie-sets are x
1
x
2
and x
3
x
4
,
3
(a) that can be viewed as a
simple series structure if the parallel combination of x
1
and x
2
is replaced by
an equivalent branch that we will call x
5
. Then x
5
, x
3
, and x
4
form a simple
chain structure, and component redundancy, as shown in Fig.
3
.
3
(b), is clearly
superior. Many complex configurations can be examined in a similar manner.
Unit and component redundancy are compared graphically in Fig.
3
.
4
.
Another interesting case in which one can compare component and unit
x
3
Component redundancy: (a) original system and (b) redundant system.
SYSTEM VERSUS COMPONENT REDUNDANCY
89
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
0
0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
elements
Rp= [1 – (1 – ) ]
mn
Rp= 1 – (1 – )
nm
Number of series elements ( )n
Number of series elements ( )n
(a)
(b)
Reliability ( )R
Reliability ( )R
Figure
3
.
4
Redundancy comparison: (a) component redundancy and (b) unit redun-
dancy. [Adapted from Figs.
7
.
10
and
7
.
11
, Reliability Engineering, ARINC Research
Corporation, used with permission, Prentice-Hall, Englewood Cliffs, NJ,
1964
.]
90
REDUNDANCY, SPARES, AND REPAIRS
C
o
m
p
o
n
e
n
t
r
e
d
u
n
d
a
n
c
y
C
o
m
p
o
n
e
n
t
r
e
u
n
c
n
d
a
S
i
n
g
l
e
2
:
4
e
s
y
s
tm
S
i
n
g
l
e
3
:
4
s
≤ r < n. The results for
2
-out-of-
4
and
3
-out-of-
4
systems are plotted in Fig.
3
.
5
. Again, component redundancy is superior.
The superiority of component over unit redundancy in an r-out-of-n system is
easily proven by considering the system tie-sets.
All the above analysis applies to two-state systems. Different results are
obtained for multistate models; see Shooman [
1990
, p.
286
].
SYSTEM VERSUS COMPONENT REDUNDANCY
91
(a) System redundancy
(one coupler)
(b) Component redundancy
(three couplers)
x
3
x
x
2
’
x
3
x
3
’
Figure
3
.
6
Comparison of system and component redundancy, including coupling.
In a practical case, implementing redundancy is a bit more complex than
indicated in the reliability graphs used in the preceding analyses. A simple
example illustrates the issues involved. We all know that public address sys-
tems consisting of microphones, connectors and cables, amplifiers, and speak-
ers are notoriously unreliable. Using our principle that component redundancy
is better, we should have two microphones that are connected to a switching
box, and we should have two connecting cables from the switching box to dual
inputs to amplifier
1
or
2
that can be selected from a front panel switch, and we
select one of two speakers, each with dual wires from each of the amplifiers.
We now have added the reliability of the switches in series with the parallel
components, which lowers the reliability a bit; however, the net result should
be a gain. Suppose we carry component redundancy to the extreme by trying
to parallel the resistors, capacitors, and transistors in the amplifier. In most
x
′
2
x
′
3
)P(x
c
)(
3
.
12
)
and if we have IIU and P(x
c
)
Kp(x
c
)
Kp,
R
a
(
2
p
3
− p
c
1
)P(x
c
2
)P(x
c
3
)(
3
.
14
)
92
REDUNDANCY, SPARES, AND REPAIRS
and if we have IIU and P(x
c
1
)
P(x
c
2
)
P(x
c
3
)
2
p
3
− p
6
)Kp
(
2
p − p
2
)
3
K
3
p
3
(
3
.
16
a)
Solving for K yields
K
2
(
2
p
3
, and the cou-
pling reliability Kp becomes
0
.
9772006509
. The easiest way to interpret this
result is to say that if the component failure probability
1
− p is
0
.
1
, then
component and system reliability are equal if the coupler failure probability is
0
.
0228
. In other words, if the coupler failure probability is less than
22
.
8
% of
the component failure probability, component redundancy is superior. Clearly,
coupler reliability will probably be significant in practical situations.
Most reliability models deal with two element states—good and bad; how-
ever, in some cases, there are more distinct states. The classical case is a diode,
which has three states: good, failed-open, and failed-shorted. There are also
analogous elements, such as leaking and blocked hydraulic lines. (One could
contemplate even more than three states; for example, in the case of a diode,
the two “hard”-failure states could be augmented by an “intermittent” short-
of e
−z
about Z
0
can be written as follows:
e
−Z
1
− Z +
Z
2
2
!
−
Z
3
3
!
+·· ·+
(−Z)
n
n!
+·· · (
3
.
17
)
We can also write the series in n terms and a remainder term [Thomas,
.
18
)
where
R
n
(Z)
(−
1
)
n +
1
∫
Z
0
(Z − y)
n
n!
e
−y
dy (
3
.
19
)
We can therefore approximate e
−Z
by n terms of the series and use R
n
!
−
2
Z
3
3
!
+·· ·+
2
(−Z)
n
n!
+ · · ·
+
−
1
+
2
Z −
(
2
Z)
2
2
!
+
(
2
.
20
)
Two- and three-term approximations to Eqs. (
3
.
17
) and (
3
.
20
) are compared
with the complete expressions in Fig.
3
.
7
(a) and (b). Note that the two-term
approximation is a “pessimistic” one, whereas the three-term expression is
slightly “optimistic”; inclusion of additional terms will give a sequence of alter-
nate upper and lower bounds. In Shooman [
1990
, p.
217
], it is shown that the
magnitude of the nth term is an upper bound on the error term, R
n
(Z), in an
n-term approximation.
If the system being modeled involves repair, generally a Markov model is
used, and oftentimes Laplace transforms are used to solve the Markov equa-
0.80
0.7
0.85
0.8
0.90
0.9
0.95
1.0
1.00
(a)
(b)
Reliability
Reliability
1 – Z
1 –
2
Z +
Z
2
1 – +ZZ
23
1 – Z
2
e
–Z
2e
–Z
– e
–2Z
Figure
P(x
1
+ x
2
x
3
)
e
−lt
+ e
−
2
lt
− e
−
3
lt
(
3
.
21
)
APPROXIMATE RELIABILITY FUNCTIONS
95
From Appendix B, we see that z(t) is given by the density function divided
by the reliability function, which can be written as the negative of the time
derivative of the reliability function divided by the reliability function.
z(t)
.
22
)
Expanding z(t) in a Taylor series,
z(t)
1
+ lt −
3
l
2
t
2
/
2
+·· · (
3
.
23
)
We can use such approximations to compare the equivalent hazard of various
systems.
3
.
4
.
3
Mean Time to Failure
In the last section, it was shown that reliabiilty calculations become very com-
plicated in a large system when there are many components and a diverse reli-
R(t)
exp
[
−
n
Α
Α
Α
i
1
Z
i
(t)
]
(
3
.
25
a)
and the MTTF is given by
MTTF
∫
∞
0
{
exp
[
20
].
Different equations hold for a parallel system. For two parallel elements,
the reliability expression is written as R(t)
e
−Z
1
(t)
+ e
−Z
2
(t)
− e
[− Z
1
(t)+Z
2
(t)]
.If
both system components have a constant-hazard rate, and we apply Eq. (
3
.
24
)
to each term in the reliability expression,
MTTF
1
l
l
n
−
1
l
1
+ l
2
+
1
l
1
+ l
3
+ · · · +
1
l
i
+ l
j
+
1
l
1
+ l
2
1
l
i
(
3
.
27
)
If the n units are identical—that is, l
1
l
2
· · ·
l
n
l—then Eq. (
3
.
27
)
becomes
MTTF
1
l
n
Α
Α
Α
i
1
1
i
(
3
.
28
a)
The preceding series is called the harmonic series; the summation form is
given in Jolley [
1961
, p.
26
, Eq. (
200
)] or Courant [
1951
, pp.
0
.
577
+ ln n +
1
2
n
−
1
12
n(n +
1
)
· · ·
]
(
3
.
28
b)
PARALLEL REDUNDANCY
97
x
3
x
n
x
1
x
c
R(t)
P(x
1
+ x
2
+ · · · + x
n
)
1
− P(x
1
x
2
· · · x
n
)(
3
.
29
)
In the case of constant-hazard components, P
f
P(x
i
)
1
30
)
In the case of linearly increasing hazard, the expression becomes
R(t)
1
−
[
n
∏
i −
1
(
1
− e
−K
i
t
2
/
2
)
]
(
3
.
31
)
We recall that in the example of Fig.
3
)
If we have IIU with constant-failure rates, then Eq. (
3
.
32
) becomes
98
REDUNDANCY, SPARES, AND REPAIRS
R(t)
[
1
− (
1
− e
−lt
)
n
]e
−l
c
t
(
3
.
33
a)
where l is the element failure rate and l
c
is the coupler failure rate. Assuming
− l
c
t. Substituting these approximations into Eq. (
3
.
33
a),
R(t) ≈ [
1
− (lt)
n
](
1
− l
c
t)(
3
.
33
b)
Neglecting the last term in Eq. (
3
.
33
b), we have
R(t) ≈
1
− l
c
t − (lt)
)
For the case of n
3
and a comparison at lt
0
.
1
, we see that l
c
/
l <
0
.
01
.
Thus the failure rate of the coupling device must be less than
1
/
100
that of the
element. In this example, if l
c
0
.
01
l, then the coupling system probability of
failure is equal to the parallel system probability of failure. This is a limiting
x
2
x
1
x
2
(a) (b) (c)
Figure
3
.
9
Three reliability structures: (a) single element, (b) two series elements,
and (c) two parallel elements.
PARALLEL REDUNDANCY
99
0
0
0.5
0.5
1.0
1.0
1.5
1.5
2.0
2.0
0
0
0.2
0.2
0.4
2
Single element e
–t
Single element e
–t
2
/2
Normalized time =tlt
Normalized time =t kt
√
Figure
3
.
10
Comparison of reliability functions: (a) constant-hazard elements and
(b) linearly increasing hazard elements.
3
.
5
.
2
Dependent and Common Mode Effects
There are two additional effects that must be discussed in analyzing a parallel
system: that of common mode (common cause) failures and that of depen-
dent failures. A common mode failure is one that affects all the elements in a
redundant system. The term was popularized when the first reliability and risk
analyses of nuclear reactors were performed in the
1970
s [McCormick,
1981
4
. Fortunately, there was a third redundant mode that depended on a com-
pletely different mechanism, the scribe marks, and visual alignment.
When parallel elements are purposely chosen to involve devices with
different failure mechanisms to avoid common mode failures, the term
diversity is used.
In terms of analysis, common mode failures behave much like failures of
a coupling mechanism that was studied previously. In fact, we can use Eq.
(
3
.
33
) to analyze the effect if we use l
c
to represent the sum of coupling and
common mode failure rates. (A fortuitous choice of subscript!)
Another effect to consider in parallel systems is the effect of dependent
failures. Suppose we wish to use two parallel satellite channels for reliable
communication, and the probability of each channel failure is
0
.
01
. For a single
channel, the reliability would be
0
.
99
; for two parallel channels, c
1
and c
1
c
2
)
1
− P(c
1
)P(c
2
|
c
1
)(
3
.
37
)
If the failures of both channels, c
1
and c
2
, are independent, Eq. (
3
.
37
) yields
R
1
−
0
.
01
×
0
.
25
0
.
9975
. Thus for a single channel, the probability of failure is
AN r-OUT-OF-n STRUCTURE
101
0
.
01
; with two independent parallel channels, it is
0
.
0001
, but for dependent
channels, it is
0
.
0025
. This means that dependency has reduced the expected
100
-fold reduction in failure probabilities to a reduction by only a factor of
used. Success of exactly r-out-of-n identical and independent items is given
by
B(r : n)
n
r
p
r
(
1
− p)
n − r
(
3
.
38
)
where r : n stands for r out of n, and the success of at least r-out-of-n items is
given by
P
s
n
Α
Α
Α
k
)
n − k
(
3
.
40
)
102
REDUNDANCY, SPARES, AND REPAIRS
Similarly, for linearly increasing or Weibull components, the reliability func-
tions are
R(t)
n
Α
Α
Α
k
r
n
k
e
−kKt
2
/
2
(
1
/
(m +
1
)
(
1
− e
−Kt
m +
1
/
(m +
1
)
)
n − k
(
3
.
41
b)
Clearly, Eqs. (
3
.
39
)–(
3
.
41
and n ≥
20
,
which represents the low-reliability region. If we are interested in the high-
reliability region, we switch to failure probabilities, requiring q
1
− p ≤
0
.
05
and n ≥
20
. Since we are assuming different components, we define average
probabilities of success and failure p and q as
p
1
n
n
Α
Α
Α
i
1
p
i
1
k
0
(nq)
k
e
−nq
k!
(
3
.
43
)
and for the low-reliability region, we compute the probability of r or more
successes as
R(t)
n
Α
Α
Α
k
r
(np)
k
e
−np
k!
(
of Eqs. (
3
.
39
)–(
3
.
41
) a simple procedure, and Eqs. (
3
.
43
) and (
3
.
44
) are useful
mainly as simplified analytical expressions that provide a check on computa-
tions. [Note that Eqs. (
3
.
43
) and (
3
.
44
) also hold true for IIU with p
p and
q
0
.
9995
)
20
0
.
990047
. Another option is to use two paral-
lel
20
-channel cables and associated electronics switch from cable A to cable
B whenever there is any failure in cable A. The reliability of such an ordinary
parallel system of two
20
-channel cables is given by R
2
/
20
2
(
0
.
990047
) −
(
0
.
21
q
0
+
21
p
20
q
(
0
.
9995
)
21
+
21
(
0
.
9995
)
20
(
0
.
0005
)
0
), we compute the approximation Eq. (
3
.
43
) for n
21
, r
20
.
R(t)
1
Α
Α
Α
k
0
(nq)
k
e
−nq
k!
(
1
+ nq)e
−nq
104
REDUNDANCY, SPARES, AND REPAIRS
TABLE
3
.
1
Comparison of Design for Fiber-Optic Cable
Example
Unreliability,
System Reliability, R
(
1
− R)
Single
20
-channel cable
0
.
990047 0
.
00995
Two
20
-channel cables
0
.
9999009 0
.
000099
in parallel
7
STANDBY SYSTEMS
3
.
7
.
1
Introduction
Suppose we consider two components, x
1
and x
′
1
, in parallel. For discussion
purposes, we can think of x
1
as the primary system and x
′
1
as the backup;
however, the systems are identical and could be interchanged. In an ordinary
parallel system, both x
1
and x
′
1
begin operation at time t
0
, and both can fail.
′
1
the standby
system. Sometimes an ordinary parallel system is called a “hot” standby, and
a standby system is called a “cold” standby. The time to system failure for
a standby system is given by t
t
1
+ t
2
. Clearly, t
1
+ t
2
> max(t
1
, t
2
), and a
standby system is superior to a parallel system. The “coupler” element in a
standby system is more complex than in a parallel system, requiring a more
detailed analysis.
One can take a number of different approaches to deriving the equations for
a standby system. One is to determine the probability distribution of t
t
1
+ t
2
1
x
1
x
2
x
1
, good; x
2
, failed.
s
2
x
1
x
2
x
1
, failed; x
2
, good.
s
3
x
1
describe the system. The probability that element x fails in time interval Dt
is given by the product of the failure rate l (failures per hour) and Dt. Similarly,
the probability of no failure in this interval is (
1
− lDt). We can summarize
this information by the probabilistic state model (probabilistic graph, Markov
model) shown in Fig.
3
.
11
.
The probability that the system makes a transition from state s
0
to state s
1
in
time Dt is given by l
1
Dt, and the transition probability for staying in state s
0
is
(
1
− l
1
Dt). Similar expressions are shown in the figure for staying in state s
1
or
making a transition to state s
2
0
(t)+ (
1
− l
2
Dt)P
s
1
(t)(
3
.
47
b)
P
s
2
(t + Dt)
l
2
DtP
s
1
(t)+ (
1
)P
s
2
(t)(
3
On-line failed and standby component good.
s
2
x
1
x
2
On-line and standby components failed.
106
REDUNDANCY, SPARES, AND REPAIRS
1 – l
1
Dt
l
1
Dt
l
2
Dt
sx=
102
x
1 – l
2
Dt
1
sx=
112
s
0
(t)
Dt
−l
1
P
s
0
(t)(
3
.
48
b)
Taking the limit of the left-hand side of Eq. (
3
.
48
b) as Dt
0
yields the time
derivative, and the equation becomes
dP
s
0
(t)
dt
+ l
1
t
+ l
1
Ae
−l
1
t
0
The value of A is determined from the initial condition. If we start with a good
system, P
s
0
(t
0
)
1
; thus A
1
and
P
s
0
e
−l
1
)
This equation has the solution
P
s
1
(t)
B
1
e
−l
1
t
+ B
2
e
−l
2
t
(
3
.
52
)
Substitution of Eq. (
3
.
52
) into Eq. (
3
B
1
l
1
l
2
− l
1
(
3
.
54
)
We can obtain the other constant by substituting the initial condition P
s
1
(t
0
)
0
, and solving for B
2
yields
B
2
−B
−l
2
t
](
3
.
56
)
Note that the system is successful if we are in state
0
or state
1
(state
2
is
a failure). Thus the reliability is given by
R(t)
P
s
0
(t)+P
s
1
(t)(
3
.
57
)
Equation (
+ lte
−lt
(
3
.
58
)
A few general comments are appropriate at this point.
1
. The solution given in Eq. (
3
.
58
) can be recognized as the first two terms
in the Poisson distribution, the probability of zero occurrences in time
t plus the probability of one occurrence in time t hours, where l is the
occurrence rate per hour. Since the “exposure time” for the standby com-
ponent does not start until the on-line element has failed, the occurrences
are a sequence in time that follows the Poisson distribution.
2
. The model in Fig.
3
.
11
could have been extended to the right to incorpo-
rate a very large number of components and states. The general solution
of such a model would have yielded the Poisson distribution.