Độ tin cậy của hệ thống máy tính và mạng P1 - Pdf 76

1
INTRODUCTION
Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design
Martin L. Shooman
Copyright 
2002
John Wiley & Sons, Inc.
ISBNs:
0
-
471
-
29342
-
3
(Hardback);
0
-
471
-
22460
-X (Electronic)
1
The central theme of this book is the use of reliability and availability com-
putations as a means of comparing fault-tolerant designs. This chapter deﬁnes
fault-tolerant computer systems and illustrates the prime importance of such
techniques in improving the reliability and availability of digital systems that
are ubiquitous in the
21
st century. The main impetus for complex, digital sys-
tems is the microelectronics revolution, which provides engineers and scien-

generally allows the implementation of the techniques to be faster, better, and
cheaper. Siewiorek [
1992
] cites four other reasons for an increasing need for
fault tolerance: harsher environments, novice users, increasing repair costs, and
larger systems. One might also point out that the ubiquitous computer system
is at present so taken for granted that operators often have few clues on how
to cope if the system should go down.
Many books cover the architecture of fault tolerance (the way a fault-tolerant
system is organized). However, there is a need to cover the techniques required
to analyze the reliability and availability of fault-tolerant systems. A proper
comparison of fault-tolerant designs requires a trade-off among cost, weight,
volume, reliability, and availability. The mathematical underpinnings of these
analyses are probability theory, reliability theory, component failure rates, and
component failure density functions.
The obvious technique for adding redundancy to a system is to provide a
duplicate (backup) system that can assume processing if the operating (on-line)
system fails. If the two systems operate continuously (sometimes called hot
redundancy), then either system can fail ﬁrst. However, if the backup system
is powered down (sometimes called cold redundancy or standby redundancy),
it cannot fail until the on-line system fails and it is powered up and takes over.
A standby system is more reliable (i.e., it has a smaller probability of failure);
however, it is more complex because it is harder to deal with synchronization
and switching transients. Sometimes the standby element does have a small
probability of failure even when it is not powered up. One can further enhance
the reliability of a duplicate system by providing repair for the failed system.
The average time to repair is much shorter than the average time to failure.
Thus, the system will only go down in the rare case where the ﬁrst system fails
and the backup system, when placed in operation, experiences a short time to
failure before an unusually long repair on the ﬁrst system is completed.

The above schemes apply to digital hardware; however, many of the relia-
bility problems in modern systems involve software errors. Modeling the num-
ber of software errors and the frequency with which they cause system failures
requires approaches that differ from hardware reliability. Thus, software reli-
ability theory must be developed to compute the probability that a software
error might cause system failure. Software is made more reliable by testing to
ﬁnd and remove errors, thereby lowering the error probability. In some cases,
one can develop two or more independent software programs that accomplish
the same goal in different ways and can be used as redundant programs. The
meaning of independent software, how it is achieved, and how partial software
dependencies reduce the effects of redundancy are studied in Chapter
5
, which
discusses software.
Fault-tolerant design involves more than just reliable hardware and software.
System design is also involved, as evidenced by the following personal exam-
ples. Before a departing ﬂight I wished to change the date of my return, but the
reservation computer was down. The agent knew that my new return ﬂight was
seldom crowded, so she wrote down the relevant information and promised to
enter the change when the computer system was restored. I was advised to con-
ﬁrm the change with the airline upon arrival, which I did. Was such a procedure
part of the system requirements? If not, it certainly should have been.
Compare the above example with a recent experience in trying to purchase
tickets by phone for a concert in Philadelphia
16
days in advance. On my
Monday call I was told that the computer was down that day and that nothing
could be done. On my Tuesday and Wednesday calls I was told that the com-
puter was still down for an upgrade, and so it took a week for me to receive
a call back with an offer of tickets. How difﬁcult would it have been to print

years. The low cost, small size, and low power
consumption of microelectronics and especially digital electronics allow prac-
tical systems of tremendous sophistication but with concomitant hardware and
software complexity. Similarly, the progress in storage systems and computer
networks has led to the rapid growth of networks and systems.
A timeline of the progress in electronics is shown in Shooman [
1990
, Table
K-
1
]. The starting point is the
1874
discovery that the contact between a metal
wire and the mineral galena was a rectiﬁer. Progress continued with the vacuum
diode and triode in
1904
and
1905
. Electronics developed for almost a half-cen-
tury based on the vacuum tube and included AM radio, transatlantic radiotele-
phony, FM radio, television, and radar. The ﬁeld began to change rapidly after
the discovery of the point contact and ﬁeld effect transistor in
1947
and
1949
and, ten years later in
1959
, the integrated circuit.
The rise of the computer occurred over a time span similar to that of micro-
electronics, but the more signiﬁcant events occurred in the latter half of the

44
). The ENIAC developed at the
University of Pennsylvania between
1942
and
1945
with U.S. Army support
is generally recognized as the ﬁrst electronic computer; it used vacuum tubes.
Major theoretical developments were the general mathematical model of com-
putation by Alan Turing in
1936
and the stored program concept of computing
published by John von Neuman in
1946
. The next hardware innovations were
in the storage ﬁeld: the magnetic-core memory in
1950
and the disk drive
THE RISE OF MICROELECTRONICS AND THE COMPUTER
5
in
1956
. Electronic integrated circuit memory came later in
1975
. Software
improved greatly with the development of high-level languages: FORTRAN
(
1954
–
58

64
), the ﬁrst time-sharing systems at Dartmouth using the BASIC lan-
guage (
1966
) and the MULTICS system at MIT written in the PL-I language
(
1965
–
70
), and the ﬁrst computer network, the ARPA net, that began in
1969
.
The concept of RAID fault-tolerant memory storage systems was ﬁrst pub-
lished in
1988
. The major developments in operating system software were
the UNIX operating system (
1969
–
70
), the CM operating system for the
8086
Microprocessor (
1980
), and the MS-DOS operating system (
1981
). The choice
of MS-DOS to be the operating system for IBM’s PC, and Bill Gates’ ﬂedgling
company as the developer, led to the rapid development of Microsoft.
The ﬁrst home computer design was the Mark-

1985
, and the ﬁrst version of the
Ofﬁce business software in
1989
. For more details on the historical develop-
ment of microelectronics and computers in the
20
th century, see the following
sources: Ditlea [
1984
], Randall [
1975
], Sammet [
1969
], and Shooman [
1983
].
Also see www.intel.com and www.microsoft.com.
This historical development leads us to the conclusion that today one can
build a very powerful computer for a few hundred dollars with a handful of
memory chips, a microprocessor, a power supply, and the appropriate input,
output, and storage devices. The accelerating pace of development is breath-
taking, and of course all the computer memory will be ﬁlled with software
that is also increasing in size and complexity. The rapid development of the
microprocessor—in many ways the heart of modern computer progress—is
outlined in the next section.
1
.
2
.

1975 64
,
000 2
16

65
,
536
of Fairchild Semiconductor, to predict the future of the microchip industry.
From the chronology in Table
1
.
1
, we see that the ﬁrst microchip was invented
in
1959
. Thus the complexity was then one transistor. In
1964
, complexity had
grown to
32
transistors, and in
1965
, a chip in the Fairchild R&D lab had
64
transistors. Moore projected that chip complexity was doubling every year,
based on the data for
1959
,
1964

Transistor Complexity of Microprocessors and Moore’s Law
Assuming a Doubling Period of Two Years
Microchip
Complexity
Moore’s Law Complexity:
Year CPU Transistors
Transistors
1971
.
50 4004 2
,
300
(
2
0
) ×
2
,
300

2
,
300
1978
.
75 8086 31
,
000
(
2

507
1985
.
25 80386 280
,
000
(
2
2
.
5
/
2
) ×
113
,
507

269
,
967
1989
.
75 80486 1
,
200
,
000
(
2

5
/
2
) ×
1
,
284
,
185

4
,
319
,
466
1995
.
25
Pentium Pro
5
,
500
,
000
(
2
2
/
2
) ×

) ×
8
,
638
,
933

18
,
841
,
647
(P
6
+ MMX)
1998
.
50
Merced (P
7
)
14
,
000
,
000
(
2
3
.

2
) ×
26
,
646
,
112

41
,
093
,
922
2000
.
75
Pentium
442
,
000
,
000
(
2
1
/
2
) ×
41
,

Intel microprocessor complexities fell slightly behind Moore’s Law. Some
say that Moore’s Law no longer holds because transistor spacing cannot be
reduced rapidly with present technologies [Mann,
2000
; Markov,
1999
]; how-
ever, Moore, now Chairman Emeritus of Intel Corporation, sees no funda-
mental barriers to increased growth until
2012
and also sees that the physical
limitations on fabrication technology will not be reached until
2017
[Moore,
2000
].
The data in Table
1
.
2
is plotted in Fig.
1
.
1
and shows a close ﬁt to Moore’s
Law. The three data points between
1997
and
2000
seem to be below the curve;

million megawatt hours (the
energy produced by all the world’s nuclear power plants in
72
hours); the ulti-
mate speed is
5
.
4
×
10
50
hertz (about
10
43
the speed of the Pentium
4
); and
the memory size would be
2
.
1
×
10
31
bits, which is
4
×
10
30
bytes (

1981
, the IBM personal computer was limited
to
640
,
000
kilobytes of memory by the operating system’s nearsighted spec-
iﬁcations, even though many “workaround” solutions were common. By the
early
1990
s,
4
or
8
megabyte memories for PCs were the rule, and in
2000
,
the standard PC memory size has grown to
64
–
128
megabytes. Disk memory
has also increased rapidly: from small
32
–
128
kilobyte disks for the PDP
8
e
8

%
per year, yielding an eighteenfold increase in capacity [Fisher,
1997
; Markoff,
1999
]. In
2001
, the standard desk PC came with a
40
gigabyte hard drive.
If Moore’s Law predicts a doubling of microprocessor complexity every two
years, disk storage capacity has increased by
2
.
56
times each two years, faster
than Moore’s Law.
THE RISE OF MICROELECTRONICS AND THE COMPUTER
9
1
.
2
.
4
Digital Electronics in Unexpected Places
The examples of the need for fault tolerance discussed previously focused on
military, space, and other large projects. There is no less a need for fault toler-
ance in the home now that electronics and most electrical devices are digital,
which has greatly increased their complexity. In the
1940

alarm systems and elderly medic alert systems; irrigation systems; pacemak-
ers; video games; Web-surﬁng devices; copying machines; calculators; tooth-
brushes; musical greeting cards; pet identiﬁcation tags; and toys. Of course
this list does not even include the cellular phone, which may soon assume
the functions of both a personal digital assistant and a portable Internet inter-
face. It has been estimated that the typical American home in
1999
had
40
–
60
microprocessors—a number that could grow to
280
by
2004
. In addition, a
modern family sedan contains about
20
microprocessors, while a luxury car
may have
40
–
60
microprocessors, which in some designs are connected via a
local area network [Stepler,
1998
; Hafner,
1999
].
Not all these devices are that simple either. An electronic toothbrush has

controls others. They have modiﬁed Billy Bass to speak the hackers’ dialog
and sing their songs.
Late in
2000
, Sony introduced a second-generation dog-like robot called
Aibo (Japanese for “pal”); with
20
motors, a
32
-bit RISC processor,
32
megabytes of memory, and an artiﬁcial intelligence program. Aibo acts like
a frisky puppy. It has color-camera eyes and stereo-microphone ears, touch
sensors, a sound-synthesis voice, and gyroscopes for balance. Four different
“personality” modules make this $
1
,
500
robot more than a toy [Pogue,
2001
].
10
INTRODUCTION
What is the need for fault tolerance in such devices? If a Furby fails, you
discard it, but it would be disappointing if that were the only sensible choice
for a microwave oven or a washing machine. It seems that many such devices
are designed without thought of recovery or fault-tolerance. Lawn irrigation
timers, VCRs, microwave ovens, and digital phone answering machines are all
upset by power outages, and only the best designs have effective battery back-
ups. My digital answering machine was designed with an effective recovery

very complex systems. Thus, a system designer should formulate a number of
different approaches to a problem and weigh the pluses and minuses of each
design before recommending an approach. One should be careful to base con-
clusions on an analysis of facts, not on conjecture. Sometimes the best solution
includes simplifying the design a bit by leaving out some marginal, complex
features. It may be difﬁcult to convince the authors of the requirements that
sometimes “less is more,” but this is sometimes the best approach. Design deci-
sions often change as new technology is introduced. At one time any attempt to
digitize the Library of Congress would have been judged infeasible because of
the storage requirement. However, by using modern technology, this could be
accomplished with two modern RAID disk storage systems such as the EMC
Symmetrix systems, which store more than nine terabytes (
9
×
10
12
bytes)
[EMC Products-At-A-Glance, www.emc.com]. The computation is outlined in
the problems at the end of this chapter.
Reliability and availability of the system should always be two factors that
are included, along with cost, performance, time of development, risk of fail-
ure, and other factors. Sometimes it will be necessary to discard a few design
objectives to achieve a good design. The system engineer should always keep
RELIABILITY AND AVAILABILITY
11
in mind that the design objectives generally contain a list of key features and a
list of desirable features. The design must satisfy the key features, but if one or
two of the desirable features must be eliminated to achieve a superior design,
the trade-off is generally a good one.
1

,
000
)

0
.
04
. Clearly the probability
of success, P
s
, which is known as the reliability, R, is given by R(
1
,
000
)

P
s
(
1
,
000
)

1
− P
f
(
1
,

or, as it is some-
times stated, fr

z

40
failures per million operating hours, where z is often
called the hazard function. The units used in the telecommunications industry
are ﬁts (failures in time), which are failures per billion operating hours. More
detailed mathematical development relates the reliability, the failure rate, and
time. For the simplest case where the failure rate z is a constant (one gener-
ally uses l to represent a constant failure rate), the reliability function can be
shown to be R(t)

e
−lt
. If we substitute the preceding values, we obtain
R(
1
,
000
)

e
−
4
×
10
−
5


e
−nlt
Consider the case of the ﬁrst supercomputer, the CDC
6600
[Thornton,
1970
]. This computer had
400
,
000
transistors, for which the estimated fail-
ure rate was then
4
×
10
−
9
failures per hour. Thus, even though the failure
rate of each transistor was very small, the computer reliability for
1
,
000
hours
would be
R(
1
,
000
)

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Độ tin cậy của hệ thống máy tính và mạng P1 - Pdf 76

Tài liệu, ebook tham khảo khác

Học thêm