26
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
The basic operations specified by IEEE arithmetic are first and foremost addition, subtraction,
multiplication, and division. Square roots and remainders are also included. The default rounding
for these operationsis “to nearest even”. This means that the floating point result fl (a op b) of the
exact operation (a op b) is the nearest floating point number to (a op b), breaking ties by rounding
to the floating point number whose bottom bit is zero (the “even” one). It is also possible to round
up, round down, or truncate (round towards zero). Rounding up and down are useful interval
arithmetic, which can provide guaranteed error bounds; unfortunately most languages and/or
compilers provide no access to the status flag which can select the rounding direction. When the
result of floating point operation is not representable as a normalized floating point number, and
exception occurs.
1.8 FLOATING POINT ARITHMETIC AND THEIR COMPUTATION
The computer performed five basic arithmetic operations such as addition, subtraction,
multiplication and division. The decimal numbers are converted to machine numbers. The machine
number consists of only the digit 0 and 1 with a base. It’s base depending on the computer. If the
base is two the number system is called the binary number system, if the base is eight it is called
octal number system and if the base is sixteen it is called hexadecimal number system respectively.
The decimal number system has the base 10. In numerical computation, there are mainly two
types of arithmetic operations present in the system.
(a) Integer arithmetic, which deals with integer operands and
(b) Real or Floating-point arithmetic, which deals with fractional part of a number as operands.
Mostly computers carried out scientific calculations in floating point arithmetic to avoid the
difficulty of keeping every number less than 1 in magnitude during computation. A floating point
number is characterized by three parameters—the base b, the number of digit n and the exponent
range (m, M).
An n-digit floating-point number with base b has the form:
12
(0 . )
e
nb
–1. The number 0 is written as:
+ 0.000 0 × b
e
The floating-point number is said to be normalized if d
1
≠
0 or else d
1
= d
2
=
= d
n
= 0. If d
l
, d
n
≠
0 the number is said to have an n significant digits.
There are two commonly used ways to translate any given real number x into an n b-digit
floating-point number f
p
(x), rounding and chopping.
A floating-point number x =
±
(0, d
1
d
2
3
d
n
)
n
b
e
then the floating point number is in
rounding form. If it can be written as
12 1
1
( ) 0.
2
pnn
fx dd dd
b
+
=+
where first n digits are
used to write a floating-point number.
ERRORS AND FLOATING POINT
27
Example 1. Digit normalized form of
2
3
Sol.
()
If we assume computer memory store 6 digits in each location and also store one or more signs
then to represent real number, computer assumed a fixed position for the decimal point and all
numbers are stored after appropriate shifting with an assumed decimal point. For that, the
maximum possible numbers are stored as 9999.99 and the minimum possible numbers are stored
as 0000.01. These maximum and minimum limits for numbers are in magnitude. For this purpose,
preserve the maximum number of significant digits in a real number and increase the range of
values for that real number. This type of representation is called the normalized floating-point
mode.
Example 2. The number 58.72 × 10
5
is represented as 0.5872 × 10
7
or 0.5872e7.
Sol. Here mantissa is 0.5872 and the exponent is 7. Also shifting of the mantissa to the left
to its most significant digit, is nonzero, is called normalization.
1.8.1 Arithmetic Operations on Floating Point Numbers
Basically there are four arithmetic operations such as addition, subtraction, multiplication and
division. These operations applied on floating point numbers as follows:
Example 3. Add the following floating-point numbers 0.4546e3 and 0.5433e7.
Sol. This problem contains unequal exponent. To add these floating-point numbers, take
operands with the largest exponent as,
0.5433e7 + 0.0000e7 = 0.5433e7
(Because 0.4546e3 changes in the same operand as 0.0000e7).
Example 4. Add the following floating-point numbers 0.6434e3 and 0.4845e3.
Sol. This problem has an equal exponent but on adding we get 1.1279e3, that is, mantissa
has 5 digits and is greater than 1, that’s why it is shifted right one place. Hence we get the
resultant value 0.l127e4.
Example 5. Add the following floating-point numbers 0.6434e99 and 0.4845e99.
Sol. In this example, mantissa is shifted right and exponent is increased by 1, resulting is
a value of 100 for the exponent (because sum of mantissa exceeds by 1). This condition is called
Example 8. Subtract the following floating-point numbers:
1. 0.5424e – 99 From 0.5452e – 99
2. 0.3862e – 7 From 0.9682e – 7
Sol. On subtracting we get 0.0028e – 99. Again this is a floating-point number but not in the
normalized form. To convert it in normalized form, shift the mantissa to the left by 1. Therefore
we get 0.028e – 100. This condition is called an
underflow conditionunderflow condition
underflow conditionunderflow condition
underflow condition.
Similarly, after subtraction we get 0.5820e – 7.
Above examples (7 and 8) shows the subtraction of floating points numbers with underflow
condition. Therefore we say that, if two numbers represented in normalized floating-point notation
then for addition and subtraction it is required that the exponent of the numbers must be equal,
if it is not then made be equal and shift the mantissa appropriately.
Example 9. Multiply the following floating point numbers:
1. 0.1111e74 and 0.2000e80
2. 0.I234e – 49 and 0.1111e – 54
Sol. 1. On multiplying 0.1111e74 × 0.2000e80 we have 0.2222e153. This
Shows overflow condition of normalized floating-point numbers.
2. Similarly second multiplication gives 0.1370e – 104, which shows the underflow
condition of floating-point number.
This example represent that two numbers are multiplied by multiplying the mantissa and
by adding the exponent of given normalized floating-point representation. Similarly division is
evaluated by division of mantissa of the numerator by that of the denominator and denominator
exponent is subtracted from the numerator exponent. The resultant exponent is obtained by
adjusting it appropriately and using previous results normalizes the quotient mantissa.
Example 10. Calculate the sum of given floating-point numbers:
1. 0.4546e5 and 0.5433e7
2. 0.4546e5 and 0.5433e5
Sol. 1. When the exponent is not equal, the operand is kept with large exponent number.
× e
.25
e
5
= (.2718el) × (.2718e1)× (.27I8e1)× (.27I8e1)× (.2718e1)
= .1484e3
Also, we find e
.25
.
Therefore e
.25
= 1 + (.25) +
() ()
22
.25 .25
2! 3!
+
= 1.25 + .03125 + .002604 = .1284e1
Hence e
.5250e1
= (.1484e3) × (.1284e1) = .l905e3
Example 13. Compute the middle value of the number a = 4.568 and b = 6.762 using the four-digit
arithmetic and compare the result by taking c = a +
−
ba
2
.
Example 14. Evaluate 1 – cos x at x = 0.1396 radian. Assume cos(0.1396) = 0.9903 and compare
it when evaluated 2 sin
2
x
2
. Also assumes in (0.0698) = 0.6794e – 1.
Sol. Since x = 0.1396
Therefore l – cos(0.1396) = 0.1000el – 0.9903e0
= 0.1000e1 – 0.0990e1 = 0.1000e1 – 1
Now sin
2
x
= sin(0.0698) = 0.6974e – l
2sin
2
2
x
= (0.2000e1) × (0.6974e – 1) × (0.6974e – 1) = 0.9727e – 2
The value obtained by alternate formula is close to the true value 0.9728e – 2.
Example 15. Evaluate the following floating-point numbers:
1. 0.5334e9 × 0.l132e – 25
2. 0.1111el0 × 0.1234e15
3. 0.9998e – 5 ÷ 0.1000e98
4. 0.1111e51 × 0.4444e50
5. 0.1000e5 ÷ 0.9999e3
30
COMPUTER BASED NUMERICAL AND STATISTICAL TECHNIQUES
6. 0.5543e12 × 0.4111e – 15
7. 0.9998el + 0.l000e – 99
Sol. Since x = 0.4845, y = 0.4800
Hence x + y = 0.4845e0 + 0.4800e0 or 0.9645e0.
Again,
x
2
= (0.4845e0) × (0.4845e0) = 0.2347e0
y
2
= (0.4800e0) × (0.4800e0) = 0.2304e0
x
2
– y
2
= 0.2347e0 – 0.2304e0 = 0.0043e0
Therefore,
2
2
x
y
x
y
−
+
=
0.0043
0
0.9645
0
e
e
Hence roots are:
0.1000 4 0.1000 4 0.1000 4 0.1000 4
and
22
ee ee+−
which are 0.1000e4 and 0.0000e4 respectively. One of the roots becomes zero due to the limited
precision allowed in computation. We know that in quadratic equation ax
2
+ bx + c, the product
of the roots is given by
c
a
, the smaller root may be obtained by dividing (c/a) by the largest root.
ERRORS AND FLOATING POINT
31
Therefore first root is given by 0.1000e4 and second root is as
25 0.2500 2
0.2500 1.
0.1000 4 0.1000 4
e
e
ee
==−
Example 18. Associative and distributive laws are not always valid in case of normalized floating-
point representation. Give example to prove this statement.
Sol. According to the consequence of the normalized floating-point representation the
associative and the distributive laws of arithmetic are not always valid. The example given below
4
2
bb ac
x
a
+−
=
and
2
2
4
2
bb ac
x
a
−−
=
Here b
2
>>|4ac| and product of roots are
c
a
.
Therefore smaller root is
2
/
4
2
ca
bb ac
0.4000 3 0.4000 3 0.8000 3
ee
e
ee e
×
==−=
+
.
PROBLEM SET 1.2
1. Round off the following numbers to four significant figures:
38.46235,
0.70029,
0.0022218,
19.235101 [Ans. 38.46, 0.7003, 0.002222, 19.24]
2. Round off the following numbers to two decimal places:
48.21416,
2.385,
52.275,
81.255,
2.3742 [Ans. 48.21, 2.39, 52.28, 81.26, 2.37]
3. Obtain the range of values within which the exact value of
1.265(10.21 7.54)
47
−
lies, if all the
numerical quantities are rounded off. [Hint. on taking e
a
< 1%] [Ans. 0.06186 <x< 0.8186]
4. Calculate the value of
102 1
rounding:
1. (a) + (b) + (c) 2. (a) – (b) – (c) 3. (a)/(c)
4. (a)(b)/(c)5.(a) – (b)6.(b)/(c) (a)
[Ans. 1. 0.2585e1 2. 0.2581e1 3. 1.7511e–8
4. 0.3717e–8 5. –0.1663e–3 6. 0.1823e3]
ERRORS AND FLOATING POINT
33
10. Give example to show that most of the laws of arithmetic fail to hold for floating-point
arithmetic.
11. Find the root of smaller magnitude of the equation x
2
+ 0.4002e0x + 0.8e – 4 = 0. Work in
floating-point arithmetic using a four decimal place mantissa. [Ans. –0.2 e–3]
12. Give the normalized floating-point representation for the following:
1. 22/7 2. –22.75 3. 0.01
4.
3
9
8
5. –
3
64
6. 3/6
[Ans. 1. 0.3143e1 2. –0.2275e2 3. 1e–2
4. 0.9375e1 5. 0.5 e0 6. –0.4688e–1]
13. Using 5-digit arithmetic with rounding, calculate the sum of two numbers x = 0.78596e –2
and y = 0.786327e1. [Ans. 0.78712 e1]
14. Compute 403000 × 0.197 by 3-digit arithmetic with rounding. [Ans. 0.7939e5]
15. Evaluate
−
n –1
x + a
n
where a’s are constant (a
0
≠
0) and n is a positive integer, is called a polynomial in x of degree
n, and the equation f (x) = 0 is called an algebraic equation of degree n. If f (x) contains some other
functions like exponential, trigonometric, logarithmic etc., then f (x) = 0 is called a transcendental
equation. For example,
x
3
– 3x + 6 = 0, x
5
– 7x
4
+ 3x
2
+ 36x – 7 = 0
are algebraic equations of third and fifth degree, whereas x
2
– 3 cos x + 1 = 0, xe
x
– 2 = 0,
x log
10
x = 1.2 etc., are transcendental equations. In both the cases, if the coefficients are pure
numbers, they are called numerical equations.
In this chapter, we shall describe some numerical methods for the solution of f(x) = 0 where
f(x) is algebraic or transcendental or both.
These methods, also known as trial and error methods, are based on the idea of successive
approximations, i.e., starting with one or more initial approximations to the value of the root, we
obtain the sequence of approximations by repeating a fixed sequence of steps over and over again
till we get the solution with reasonable accuracy. These methods generally give only one root at
a time.
For the human problem solver, these methods are very cumbersome and time consuming,
but on other hand, more natural for use on computers, due to the following reasons:
(1) These methods can be concisely expressed as computational algorithms.
(2) It is possible to formulate algorithms which can handle class of similar problems. For
example, algorithms to solve polynomial equations of degree n may be written.
(3) Rounding errors are negligible as compared to methods based on closed form solutions.
2.3 ORDER (OR RATE) OF CONVERGENCE OF ITERATIVE METHODS
Convergence of an iterative method is judged by the order at which the error between successive
approximations to the root decreases.
The order of convergence of an iterative method is said to be kth order convergent if k is
the largest positive real number such that
1
lim
i
k
i
i
e
A
e
+
→∞
≤
Where A, is a non-zero finite number called asymptotic error constant and it depends on
derivative of f(x) at an approximate root x. e