4
N-MODULAR REDUNDANCY
Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design
Martin L. Shooman
Copyright
2002
John Wiley & Sons, Inc.
ISBNs:
0
-
471
-
29342
-
3
(Hardback);
0
-
471
-
22460
-X (Electronic)
145
4
.
1
INTRODUCTION
In the previous chapter, parallel and standby systems were discussed as means
of introducing redundancy and ways to improve system reliability. After the
concepts were introduced, we saw that one of the complicating design fea-
tures was that of the coupler in a parallel system and that of the decision unit
led to the development of this concept, now called N-modular redundancy.
This chapter and Chapter
3
are linked in many ways. For example, the tech-
nique of voting reliability joins the parallel and standby system reliability of
the previous chapter as the three most common techniques for fault tolerance.
(Also, the analytical techniques involving binomial probabilities and Markov
models are used in both chapters.) Thus many of the analyses in this chapter
that are aimed at comparing the three techniques constitute a continuation of
the analyses that were begun in the previous chapter.
The reader not familiar with the binomial distribution discussed in Sections
A
5
.
3
and B
2
.
4
or the concepts of Markov modeling in Sections A
8
and B
7
should read the material in these appendix sections first. Also, the introductory
material on digital logic in Appendix C is used in this chapter for discussing
voter circuitry.
4
.
2
THE HISTORY OF N-MODULAR REDUNDANCY
and was put into operation by the Bureau of the Census in March
1951
. This
machine was designed to operate
24
hours per day,
7
days per week (
168
hours), except for approximately
32
hours of regularly scheduled preventa-
tive maintenance per week. Thus the availability would be
136
/
168
(
81
%) if
there were no failures. In the
7
-month period from June to December
1951
, the
computer experienced about
22
hours of nonscheduled engineering time (repair
time due to failures), which reduced availability to
114
/
lished in the work of Moore and Shannon [
1956
]), who developed the basic
idea of majority voting into a sophisticated scheme with many NAND elements
in parallel. Each input to the NAND element is supplied by a bundle of N iden-
tical inputs, and the
2
N inputs are cross-coupled so that each NAND element
has one input from each bundle. One of von Neuman’s elements was called
a restoring organ, since erroneous data that entered at the input was com-
pared with the correct input data, producing the correct output and restoring
the data.
4
.
3
TRIPLE MODULAR REDUNDANCY
4
.
3
.
1
Introduction
The basic modular redundancy circuit is triple modular redundancy (often
called TMR). The system shown in Fig.
4
.
1
consists of three parallel digi-
tal circuits—A, B, and C—all with the same input. The outputs of the three
circuits are compared by the voter, which sides with the majority and gives
105
–
107
] and [
1992
,
pp.
22
;
32
;
35
;
37
;
357
;
804
].
148
N-MODULAR REDUNDANCY
Digital circuit
Digital circuit
Digital circuit
A
C
B
Voter
System
inputs
1
)
If all the digital circuits are independent and identical with probability of suc-
cess p, then this equation can be rewritten as follows in terms of the binomial
theorem.
R
B(
3
:
3
)+B(
2
:
3
)
3
3
p
3
(
1
− p)
0
+
3
of the correct input may not be valid. (It is, however, a worst-case type of
result and should yield a lower bound, i.e., a pessimistic answer.)
4
.
3
.
3
System Error Rate
The probability model derived in the previous secton enabled us to compute
the system reliability, that is, the probability of no failures. In many prob-
lems, this is the primary measure of interest; however, there are also a number
of applications in which another approach is important. In a digital commu-
nications system, for example, we are interested not only in the probability
that the system makes no errors but also in the error rate. In other words, we
TRIPLE MODULAR REDUNDANCY
149
assume that errors from temporary equipment malfunction or noise are not
catastrophic if they occur only rarely, and we wish to compute the probability
of such occurrence. Similarly, in digital computer processing of non-safety-
critical data, we could occasionally tolerate an error without shutting down
the operation for repair. A third, less clear-cut example is that of an inertial
guidance computer for a rocket. At every computation cycle, the computer gen-
erates a course change and directs the missile control system accordingly. An
error in one computation will direct the missile off course. If the error is large,
the time between computations moderately long, the missile control system
and dynamics quick to respond, and the flight near its end, the target may be
missed, from which a catastrophic failure occurs. If these factors are reversed,
however, a small error will temporarily steer the missile off course, much as
a wind gust does. As long as the error has cleared in one or two computa-
tion cycles, the missile will rapidly return to its proper course. A model for
B
0
+ A
0
C
0
+ B
0
C
0
)(
4
.
3
a)
Equation (
4
.
3
a) states that the probability of correctly indicating a one output is
given by unity minus the probability of two or more “zero failures.” Similarly,
the probability of correctly indicating zero output is given by Eq. (
4
.
3
b):
P
0
1
3
b). If we let
P(A)
P(B)
P(C )
p (
4
.
4
a)
P(A
1
)
P(B
1
)
P(C
1
)
q
1
(
4
.
4
.
4
a–c) in Eq. (
4
.
3
a) yields the following equations:
P
1
1
− P(A
0
B
0
) − P(A
0
C
0
) − P(B
0
C
0
) +
2
P(A
0
B
0
0
1
− P(A
1
B
1
) − P(A
1
C
1
) − P(B
1
C
1
) +
2
P(A
1
B
1
C
1
)(
4
.
6
a)
1
2
(
4
.
7
a)
−
1
2
(
3
q
2
0
+
3
q
2
1
−
2
q
3
0
−
2
q
3
1
(
1
− p)
/
2
. Substitution in Eq. (
4
.
7
b) yields
P
1
2
+
3
4
p −
1
4
p
3
(
4
.
8
)
The two probabilities, Eq. (
4
.
750
,
the system reliability becomes
0
.
844
; however, the probability that any one
message is successfully transmitted is
0
.
957
. To put the result another way,
if
1
,
000
such digital circuits were operated for
1
year, on average
156
would
not be operating properly at that time. However, the mistakes made by these
machines would amount to
43
mistakes per
1
,
000
on the average. Thus, for
1 0.75 0.50 0.25 0
Element reliability, p
Probability of success
A
n
y
o
n
e
t
r
a
n
s
m
i
s
s
i
o
n
c
o
r
r
e
c
t
A
l
t
yo
f
a
s
i
n
g
l
e
e
l
e
m
e
n
t
Figure
4
.
2
Comparison of probability of successful transmission with the reliability.
tem functions properly if there are no system failures or one system failure.
The reliability expression was previously derived in terms of the probability
of element success, p, as
R
3
p
2
3
l t
(
4
.
10
)
We can compute the MTTF for this system by integrating the reliability func-
tion, which yields
MTTF
3
2
l
−
2
3
l
5
6
l
(
4
.
11
)
Toy calls this a TMR
3
–
will change, and we must add the binomial probability of
1
:
3
to the equation,
that is, B(
1
:
3
)
3
p(
1
− p)
2
, yielding
R
3
p
2
−
2
p
3
+
3
p(
1
−
2
l t
+
3
e
−l t
(
4
.
12
b)
and an MTTF calculation yields
MTTF
1
3
l
−
3
2
l
+
3
l
11
6
l
(
−l t
is
R
s
Х
1
− l t (
4
.
14
)
For a TMR
3
–
2
system, the truncated expansion of the reliability function, Eq.
(
4
.
9
), is
R
TMR
(
3
–
2
)
e
2
/
2
)]
Х
1
−
3
(l t)
2
(
4
.
15
)
For a TMR
3
–
2
–
1
system, the truncated expansion of the reliability function,
Eq. (
4
.
12
b), is
R
TMR
(
/
2
− (
3
l t)
3
/
6
]
−
3
[
1
−
2
l t + (
2
l t)
2
/
2
− (
2
l t)
3
/
6
]
+
3
), and (
4
.
16
) are plotted in Fig.
4
.
3
showing the
superiority of the TMR systems in the high-reliability region. Note that the
TMR(
3
–
2
) system reliability decreases to about the same value as a single
N-MODULAR REDUNDANCY
153
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Normalized time, l t
.
2
, whereas TMR (
3
–
2
–
1
) is of greater benefit and provides a
considerably higher reliability for l t <
0
.
5
.
For further comparisons of MTTF and reliability for N-modular systems,
refer to the problems at the end of the chapter.
4
.
4
N-MODULAR REDUNDANCY
4
.
4
.
1
Introduction
The preceding section introduced TMR as a majority voting scheme for
improving the reliability of digital systems and components. Of course, this is
the most common implementation of majority logic because of the increased
cost of replicating systems. However, with the reduction in cost of digital sys-
2
and
4
.
7
.
4
.
In addition, if we contemplate using N-modular redundancy for a digital
system composed of the three subsystems A, B, and C, the question arises:
Do we use N-modular redundancy on three systems (A
1
B
1
C
1
, A
2
B
2
C
2
, and
A
3
B
3
C
3
) with one voter, or do we apply voting on a lower level, with one
2
System Voting
A general treatment of N-modular redundancy was developed in the
1960
s
[Knox-Seith,
1953
; Pierce,
1961
]. If one considers a system of
2
n +
1
voters
(note that this is an odd number), parallel digital elements, and a single perfect
voter, the reliability expression is given by
R
2
n +
1
Α
Α
Α
i
n +
1
B(i :
2
4
.
17
)
The preceding expression is plotted in Fig.
4
.
4
for the case of one, three,
five, and nine elements, assuming p
e
−l t
. Note that as n ∞, the MTTF of
the system
0
.
69
/
l. The limiting behavior of Eq. (
4
.
17
) as n ∞ is dis-
cussed in Shooman [
1990
, p.
302
]; the reliability function approaches the three
straight lines shown in Fig.
4
.
5
. Since this configuration is
composed of just the m-independent series groups of the same configuration
N-MODULAR REDUNDANCY
155
0
0.5
1.0
0 0.5 0.69 1.0 1.5 t
Rt()
n t ∞
n = 0
n = 0
n = 1
n = 1
n = 2
n = 2
n = 4
n = 4
e
–t
Figure
4
.
4
Reliability of a majority voter containing
2
n +
i
p
i
ss
(
1
− p
ss
)
2
n +
1
− i
]
m
(
4
.
18
)
where p
ss
is the subsystem reliability.
The subsystem reliability p
ss
is, of course, not equal to a fixed value of p; it
instead decays in time. In fact, if we assume that all subsystems are identical
and have constant-hazard and -failure rates, and if the system failure rate if
l, the subsystem failure rate would be l
MTTF ≈
0
.
7
m
/
l. This is a direct consequence of the limiting behavior of Eq.
(
4
.
17
), as was discussed previously.
To use Eq. (
4
.
18
) in design, one chooses values of n and m that yield a
value of R, which meets the design goals. If there is a choice of values (n,
m) that yield the desired reliability, one would choose the pair that represents
the lowest cost system. The subject of optimizing voter systems is discussed
further in Chapter
7
.
156
N-MODULAR REDUNDANCY
11 1
22 2
2+1n 2+1n 2+1n
2+1
inputs
Limitations on Voter Reliability
One of the main reasons for using a voter to implement redundancy in a digital
circuit or system is the ease with which a comparison is made of the digital
signals. In this section, we consider an imperfect voter and compute the effect
that voter failure will have on the system reliability. (The reader should com-
pare the following analysis with the analogous effect of coupler reliability in
the discussion of parallel redundancy in Section
3
.
5
.)
In the analysis presented so far in this chapter, we have assumed that the
voter itself cannot fail. This is, of course, untrue; in fact, intuition tells us that
if the voter is poor, its unreliability will wipe out the gains of the redundancy
scheme. Returning to the example of Fig.
4
.
1
, the digital circuit reliability will
be called p
c
, and the voter reliability will be called p
v
. The system reliability
formerly given by Eq. (
4
.
2
) must be modified to yield
R
)
To achieve an overall gain, the voting scheme with the imperfect voter must
be better than a single element, and
R > p
c
or
R
p
c
>
1
(
4
.
20
)
Obviously, this requires that
IMPERFECT VOTERS
157
0
0
0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
7
7
7
Rt()
Rt()
Rt()
n t∞
n t ∞
n t ∞
n = 0
n = 0
n = 0
n = 4
n = 2
n = 1
n = 1
n = 1
n = 4
n = 4
n = 4
n = 0
m = 1
m = 4
m = 16
l t
l t
l t
Figure
4
.
.
7
Plot of function p
c
(
3
−
2
p
c
) versus p
c
.
R
p
c
p
v
p
c
(
3
−
2
p
c
) >
1
(
.
7
. One can
obtain information on the values of p
v
that allow improvement over a single cir-
cuit by studying this equation. To begin with, we know that since p
v
is a proba-
bility,
0
< p
v
<
1
. Furthermore, a study of Fig.
4
.
3
(lower curve) and Fig.
4
.
4
(note that e
−
0
.
69
0
2
p
2
c
. Differentiating with respect
to p
c
and equating to zero yields p
c
3
/
4
, which agrees with Fig.
4
.
7
. Substitut-
ing this value of p
c
into [p
v
p
c
(
3
−
2
p
c
.
1
. This table provides lower bounds on voter reliability that are useful during
design; however, most voters have a much higher reliability. The main objective
is to make p
v
close enough to unity by using reliable components, by derating,
and by exercising conservative design so that the voter reliability has only a neg-
ligible effect on the value of R given in Eq. (
4
.
19
).
4
.
5
.
2
Use of Redundant Voters
In some cases, it is not possible to devise individual voters that have a high
enough reliability to meet the requirements of an ultrareliable system. Since the
voter reliability multiplies the N-modular redundancy reliability, as illustrated
in Eq. (
4
.
19
), the system reliability can never exceed that of the voter. If voting
IMPERFECT VOTERS
159
TABLE
4
.
18
) is multiplied by p
m
v
, which can
significantly lower the reliability of the N-modular redundancy scheme. In such
cases, one should consider the possibility of using redundant voters.
The standard TMR configuration including redundant voters is shown in Fig.
4
.
8
. Note that Fig.
4
.
8
depicts a system composed of n subsystems with a triple
of subsystems A, B, and C and a triple of voters V, V
′
, V
′′
. Also, in the last stage
of voting, only a single voter can be employed. One interesting property of the
circuit in Fig.
4
.
8
is that errors do not propagate more than one stage. If we assume
that subsystems A
′′
1
fails and yields an erroneous output of zero.
Circuits A
2
and B
2
will have the correct inputs and outputs, and C
2
will have an
incorrect output since it has an incorrect input. However, the next stage of voters
will have two correct inputs from A
2
and B
2
, and these will outvote the erroneous
output from V
′′
1
; thus, voters V
2
, V
′
2
, and V
′′
2
will all have the correct output. One
can say that single circuit errors do not propagate at all and that single voter errors
only propagate for one stage.
i
, and V
′′
i
are all on the same chip.
A
1
A
2
A
n
B
1
B
2
B
n
C
1
C
2
C
n
V
1
V
2
V
n–
V
. All voters V
i
, V
′
i
, and V
′′
i
are independent circuits or independent inte-
grated circuit chips, and circuits A
i
, B
i
, and C
i
are all on the same chip.
4
. All circuits A
i
, B
i
, and C
i
are all on the same chip, and voters V
i
, V
′
i
,
and V
will also prevail for the remainder of the text, there are limitations. This section
will briefly discuss a few situations that limit the accuracy of analytical models.
The following situations can be viewed as effects that are difficult to model
analytically, that lead to pessimistic results from analytical models, and that
represent cases in which the methods of Appendix D would be warranted.
1
. Some of the failures in digital (and analog) systems are transient in nature
[compare the rationale behind adaptive voting; see Eq. (
4
.
63
)]. A trans-
ient failure only occurs over a brief period of time or following certain
triggering events. Thus the equipment may or may not be operating at
any point in time. The analysis associated with the upper curve in Fig.
4
.
2
took such effects into account.
2
. Sometimes, the resulting output of a TMR circuit is correct even if there
are two failures. Suppose that all three circuits compute one bit, that unit
two is good, unit one has failed s-a-
1
, and that unit three has failed s-a-
0
. If the correct output should be a one, then the good unit produces a
one output that votes along with the failed unit one, producing a correct
voter output. Similarly, if zero were the correct output, unit three would
vote with the good unit, producing a correct voter output.
x
1
x
2
x
3
f
v
(x
1
x
2
x
3
)
0000
Two
0010
or
0100
three
1000
zeroes
1101
Two
1011
or
0111
three
1111
A direct approach to designing a majority voter is to include a term for
all the minterms in Table
4
.
2
, that is, the last four rows corresponding to an
output of one. The logic circuit would require three three-input AND gates, a
three-input OR gate, and three inverters (NOT gates) for each bit.
f
v
(x
1
x
2
x
3
)
x
1
x
2
x
3
+ x
1
x
2
x
3
Minterm Simplification for Table
4
.
3
00 01 11 10
0 0010
1 0111
x
1
x
23
x
The minterm simplification for the TMR voter is shown in Table
4
.
4
and
yields the logic function given in Eq. (
4
.
23
). The result of the simplification
yields a voter logic function, as follows:
f
v
(x
1
x
2
x
Digital circuit
A
Digital circuit
A
Digital circuit
B
Digital circuit
B
Digital circuit
C
Digital circuit
C
System
inputs
(0,1)
System
inputs
(0,1)
System
output
(0,1)
System
output
(0,1)
x
1
x
1
x
1
NAND gates are used. The voter in Fig.
4
.
9
(b) can be seen as equivalent to
that in Fig.
4
.
9
(a) if one examines the output and applies DeMorgan’s theorem:
f
v
(x
1
x
2
x
3
)
(x
1
x
2
)
.
(x
1
x
3
Voting and Error Detection
There are many reasons why it is important to know which circuit has failed
when N-modular redundancy is employed, such as the following:
1
. If a panel with light-emitting diodes (LEDs) indicates circuit failures, the
operator has a warning about which circuits are operative and can initiate
replacement or repair of the failed circuit. This eliminates much of the
need for off-line testing.
2
. The operator can take the failure information into account in making a
decision.
3
. The operator can automatically lock out a failed circuit.
4
. If spare circuits are available, they can be powered up and switched in
to replace a failed component.
If one compares the voter inputs the first time that a circuit disagrees with
the majority, a failed warning can be initiated along with any automatic action.
We can illustrate this by deriving the logic circuits that would be obtained
for a TMR system. If we let f
v
(x
1
x
2
x
3
) represent the voter output as before
and f
e
5
holds.
A simple logic realization of these
4
outputs using NAND gates is shown in
TABLE
4
.
5
Truth Table for a TMR Voter Including Error-Detection
Outputs
Inputs
Outputs
x
1
x
2
x
3
f
v
f
e
1
f
e
2
f
e
3
1
x
2
x
3
x
1
x
2
x
3
Circuit badA
Circuit badB
Circuit badC
Figure
4
.
10
Circuit that realizes the four switching functions given in Table
4
.
5
for
a TMR majority voter and error detector.
Fig.
4
.
10
. The reader should realize that this circuit, with
13
076
or about
50
,
000
gates.
The implication of this computation is clear: One should not employ voters
to improve the reliability of small circuits because the voter reliability may
wipe out most of the intended improvement. Clearly, it would also be wise
to consult an experienced logic circuit designer to see if the
512
-gate circuit
just discussed could be simplified by using other technology, semicustom gate
circuits, available microelectronic chips, and so forth.
The circuit given in Fig.
4
.
10
could also be used to solve the chip test prob-
lem mentioned in Section
4
.
4
.
1
. If the entire circuit of Fig.
4
.
10
were on a
and availability can be found in Siewiorek [
1992
, p.
335
] and in Toy [
1987
].
4
.
7
.
2
Reliability Computations
One might expect that it would be most efficient to seek a general solution
for the reliability and availability of a system with N-modular redundancy and
repair, then specify that N
3
for a TMR system, N
5
for
5
-level voting, and
so on. A moment’s thought, however, suggests quite a different approach. The
conventional solution for the reliability and availability of a system with repair
involves making a Markov model and solving it much as was done in Chapter
3
. In the process, the Laplace transform was computed, and a partial fraction
expansion was used to find the individual exponential terms in the solution. For
–
481
]. Ludovico
Ferrari, a pupil of Cardan, developed the formula for the fourth-order equation.
166
N-MODULAR REDUNDANCY
Neils Henrik Abel developed a proof that no closed-form solution exists for
n ≥
5
[Iyanaga, p.
1
].
The conclusion from the foregoing information on polynomial roots is that
we should start with TMR and other simpler systems if we wish to use alge-
braic solutions. Numerical solutions are always possible for higher-order equa-
tions, and the mathematical software discussed in Appendix D expedites such
an approach; however, the insight of an analytical solution is generally lacking.
Another approach is to use simplifications and approximations such as those
discussed in Appendix B (Sections B
8
.
2
and B
8
.
3
). We will use the tried and
true three-step engineering approach:
1
. Represent the main features of the system by a low-order model that is
same failure rate, l, and that the voter does not fail.
If we compare Fig.
4
.
11
with the model given in Fig.
3
.
14
of Chapter
3
,
we see that they are essentially the same, only with different parameter values
(transition rates). There are three states in both models: repair occurs from
state s
1
to s
0
, and state s
2
is an absorbing state. (Actually, a complete model
for Fig.
4
.
11
would have a fourth state, s
3
, which is reached by an additional
failure from state s
2
2
x
3
+ x
1
x
2
x
3
and s
3
x
1
x
2
x
3
by reformulating
the model as a more complex four-state model. However, the four-state model
is not needed to solve for the upstate probabilities P
s
0
and P
s
1
. Thus the simpler
three-state model of Fig.
4
.
Dt
t
t
t
t
m
+ m)1
s
0
s
1
s
2
Figure
4
.
11
A Markov reliability model for a TMR system with repair.
In the TMR model of Fig.
4
.
11
, there are three ways to experience a single
failure from s
0
to s
1
and two ways for failures to move the system state from
s
1
l (three ways to go
from state s
1
to state s
2
); l
2
l (two ways to go from state s
2
to state s
3
);
and m
′
m (single repairman in both cases). Substituting these values in Eqs.
(
3
.
65
) yields
P
s
0
(s)
s +
2
l + m
.
25
b)
P
s
2
(s)
6
l
s[s
2
+(
5
l + m)s +
6
l
2
]
(
4
.
25
c)
Note that as a check, we sum Eqs. (
4
.
25
a–c) and obtain the value
1
.
26
a)
The denominator polynomial factors into (s +
2
l) and (s +
3
l), and partial
fraction expansion yields
168
N-MODULAR REDUNDANCY
R
TMR
(s)
3
l + m
l
s +
2
l
−
2
l + m
l
s +
m
l
e
−
3
l t
(
4
.
26
c)
One can check the above result by letting m
0
(no repair), which yields
R
TMR
(t)
3
e
−
2
l t
−
2
e
−
3
the initial behavior of the TMR system simply by using the transform for t
n
,
which is discussed in Appendix B, Section B
8
.
3
. We begin with Eq. (
4
.
26
a),
where division of the denominator into the numerator using polynomial long
division yields for the first three terms:
R
TMR
(s)
1
s
−
6
l
2
s
3
+
6
l
2
n
(
4
.
27
b)
Setting a
0
yields
L
{
1
(n −
1
)!
t
n −
1
}
1
(s)
n
(
4
.
27
c)
Using the transform in Eq. (
27
d)
We previously studied the first two terms in the Taylor series expansion of
N-MODULAR REDUNDANCY WITH REPAIR
169
the TMR reliability expansion in Eq. (
4
.
15
). In Eq. (
4
.
27
d), we have a three-
term solution, and one can compare Eqs. (
4
.
15
) and (
4
.
27
b) by calculating an
additional third term in the expansion of Eq. (
4
.
15
). The expansions in Eq.
(
4
)
1
−
3
l
2
t
2
+
5
l
3
t
3
(
4
.
27
e)
Thus the first three terms of Eq. (
4
.
15
) and Eq. (
4
.
27
d) are identical for the
case of no repair, m
4
.
27
d) becomes +
15
l
3
t
3
rather than +
5
l
3
t
3
with no repair. One can evaluate the increase due to m
10
l
at one point in time by letting t
0
.
1
/
l. At this point in time, the TMR
reliability without repair is equal to
0
.
975
in Eq. (
4
.
11
) yields a value of
5
/
6
l. This implies that a single element is bet-
ter than TMR, but we know that TMR has a higher reliability than a single
element (see also Siewiorek [
1992
, p.
294
]). The explanation of this apparent
contradiction is simple if we examine the n
0
and n
1
curves in Fig.
4
.
4
.
In the region of primary interest,
0
< lt <
0