5.7 The Wavelet Transform 213
clear; % main program
filename=’lena128’; dim=128;
fid=fopen(filename,’r’);
if fid==-1 disp(’file not found’)
else img=fread(fid,[dim,dim])’; fclose(fid);
end
thresh=0.0; % percent of transform coefficients deleted
figure(1), imagesc(img), colormap(gray), axis off, axis square
w=harmatt(dim); % compute the Haar dim x dim transform matrix
timg=w*img*w’; % forward Haar transform
tsort=sort(abs(timg(:)));
tthresh=tsort(floor(max(thresh*dim*dim,1)));
cim=timg.*(abs(timg) > tthresh);
[i,j,s]=find(cim);
dimg=sparse(i,j,s,dim,dim);
% figure(2) displays the remaining transform coefficients
%figure(2), spy(dimg), colormap(gray), axis square
figure(2), image(dimg), colormap(gray), axis square
cimg=full(w’*sparse(dimg)*w); % inverse Haar transform
density = nnz(dimg);
disp([num2str(100*thresh) ’% of smallest coefficients deleted.’])
disp([num2str(density) ’ coefficients remain out of ’ ...
num2str(dim) ’x’ num2str(dim) ’.’])
figure(3), imagesc(cimg), colormap(gray), axis off, axis square
File harmatt.m with two functions
function x = harmatt(dim)
num=log2(dim);
p = sparse(eye(dim));q=p;
i=1;
while i<=dim/2;
A
1
=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
1
2
1
2
000000
00
1
2
1
2
0000
0000
1
2
1
2
1
2
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
,A
1
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
255
224
192
159
47.5
15.5
16.5
16.0
15.5
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
, (5.10)
A
2
=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
,A
3
=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
1
2
1
2
000000
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
239.5
175.5
111.0
47.5
15.5
16.5
16.0
15.5
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
207.5
79.25
32.0
31.75
15.5
16.5
16.0
15.5
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
⎠
. (5.11)
Instead of calculating averages and differences, all we have to do is construct matri-
ces A
1
, A
2
,andA
3
, multiply them to get W = A
1
A
2
A
3
, and apply W to all the columns
of an image I by multiplying W·I:
W
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
255
⎝
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
1
8
−1
8
−1
8
−1
2
0000
0000
1
2
−1
2
00
000000
1
2
−1
2
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
143.375
64.125
32
31.75
15.5
16.5
16
15.5
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.7 The Wavelet Transform 215
This, of course, is only half the job. In order to compute the complete transform, we
tr
·(W
−1
)
T
,
and this is where the normalized Haar transform (mentioned on page 200) becomes
important. Instead of calculating averages [quantities of the form (d
i
+ d
i+1
)/2] and
differences [quantities of the form (d
i
− d
i+1
)], it is better to compute the quantities
(d
i
+d
i+1
)/
√
2and(d
i
−d
i+1
)/
√
Wavelet Analysis: The Scalable Structure of Information (1998)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
216 5. Image Compression
5.8 Filter Banks
So far, we have worked with the Haar transform, the simplest wavelet (and subband)
transform. We are now ready for the general subband transform. As a preparation for
the material in this section, we again examine the two main types of image transforms,
orthogonal and subband. An orthogonal linear transform is performed by computing the
inner product of the data (pixel values or audio samples) with a set of basis functions.
The result is a set of transform coefficients that can later be quantized and encoded. In
contrast, a subband transform is performed by computing a convolution of the data with
a set of bandpass filters. Each of the resulting subbands encodes a particular portion of
the frequency content of the data.
Note. The discrete inner product of the two vectors f
i
and g
i
is defined as the
following sum of products
f,g =
i
f
i
g
i
.
The discrete convolution h is denoted by fgandisdefinedas
h
i
smallest transform matrix that can be constructed in this case is
11
1 −1
/
√
2. It is a 2×2
matrix, and it generates two transform coefficients, an average and a difference. (Notice
that these are not exactly an average and a difference, because
√
2 is used instead of 2.
Better names for them are coarse detail and fine detail, respectively.) In general, the
DWT can use any set of wavelet filters, but it is computed in the same way regardless
of the particular filter used.
We start with one of the most popular wavelets, the Daubechies D4. As its name
implies, it is based on four filter coefficients c
0
, c
1
, c
2
,andc
3
, whose values are listed in
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.8 Filter Banks 217
Equation (5.13). The transform matrix W is [compare with matrix A
1
, Equation (5.10)]
−c
0
00... 0
00c
0
c
1
c
2
c
3
... 0
00c
3
−c
2
c
1
−c
0
... 0
.
.
.
.
.
.
.
.
.
−c
2
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
.
When this matrix is applied to a column vector of data items (x
1
,x
2
,...,x
n
), its top
row generates the weighted sum s
1
= c
0
x
1
6
, and the other odd-numbered rows
generate similar weighted sums s
i
. Such sums are convolutions of the data vector x
i
with the four filter coefficients. In the language of wavelets, each of them is called a
smooth coefficient, and together they are termed an H smoothing filter.
In a similar way, the second row of the matrix generates the quantity d
1
= c
3
x
1
−
c
2
x
2
+ c
1
x
3
− c
0
x
4
, and the other even-numbered rows generate similar convolutions.
Each d
i
⎜
⎝
c
0
c
3
00... c
2
c
1
c
1
−c
2
00... c
3
−c
0
c
2
c
1
c
0
c
3
... 00
c
3
−c
1
c
0
c
3
c
3
−c
0
c
1
−c
2
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
.
It can be shown that in order for W to be orthonormal, the four coefficients have to
satisfy the two relations c
3
− 1c
2
+2c
1
− 3c
0
= 0. They represent the vanishing of the first two moments of the
sequence (c
3
,−c
2
,c
1
,−c
0
). The solutions are
c
0
=(1+
√
3)/(4
√
2) ≈ 0.48296,c
1
=(3+
√
3)/(4
√
2) ≈ 0.8365,
cally developed to compress fingerprint images. Other compression methods that employ
the wavelet transform can be found in [Salomon 07].
Most of us may not realize it, but fingerprints are “big business.” The FBI started
collecting fingerprints in the form of inked impressions on paper cards back in 1924,
and today they have about 200 million cards, occupying an acre of filing cabinets in
the J. Edgar Hoover building in Washington, D.C. (The FBI, like many of us, never
throws anything away. They also have many “repeat customers,” which is why “only”
about 29 million out of the 200 million cards are distinct; these are the ones used for
running background checks.) What’s more, these cards keep accumulating at a rate of
30,000–50,000 new cards per day (this is per day, not per year)! There’s clearly a need
to digitize this collection, so it will occupy less space and will lend itself to automatic
search and classification. The main problem is size (in bits). When a typical fingerprint
card is scanned at 500 dpi, with eight bits/pixel, it results in about 10 Mb of data. Thus,
the total size of the digitized collection would be more than 2,000 terabytes (a terabyte
is 2
40
bytes); huge even by current (2008) standards.
Exercise 5.16: Apply these numbers to estimate the size of a fingerprint card.
Compression is therefore a must. At first, it seems that fingerprint compression
must be lossless because of the small but important details involved. However, lossless
image compression methods produce typical compression ratios of 0.5, whereas in order
to make a serious dent in the huge amount of data in this collection, compressions of
about 1 bpp or better are needed. What is needed is a lossy compression method that
results in graceful degradation of image details, and does not introduce any artifacts
into the reconstructed image. Most lossy image compression methods involve the loss
of small details and are therefore unacceptable, since small fingerprint details, such as
sweat pores, are admissible points of identification in court. This is where wavelets come
into the picture. Lossy wavelet compression, if carefully designed, can satisfy the criteria
above and result in efficient compression where important small details are preserved or
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
A simple test of fwt1 is
n=16; t=(1:n)./n;
dat=sin(2*pi*t)
filt=[0.4830 0.8365 0.2241 -0.1294];
wc=fwt1(dat,1,filt)
which outputs
dat=
0.3827 0.7071 0.9239 1.0000 0.9239 0.7071 0.3827 0
-0.3827 -0.7071 -0.9239 -1.0000 -0.9239 -0.7071 -0.3827 0
wc=
1.1365 -1.1365 -1.5685 1.5685 -0.2271 -0.4239 0.2271 0.4239
-0.0281 -0.0818 -0.0876 -0.0421 0.0281 0.0818 0.0876 0.0421
Figure 5.52: Code for the One-Dimensional Forward Discrete
Wavelet Transform.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
220 5. Image Compression
are at least identifiable. Figure 5.53a,b (obtained, with permission, from Christopher
M. Brislawn), shows two examples of fingerprints and one detail, where ridges and sweat
pores can clearly be seen.
Figure 5.53: Examples of Scanned Fingerprints (courtesy Christopher Brislawn).
Compression is also necessary, because fingerprint images are routinely sent between
law enforcement agencies. Overnight delivery of the actual card is too slow and risky
(there are no backup cards), and sending 10 Mb of data through a 9,600 baud modem
takes about three hours.
The method described here [Bradley et al. 93] has been adopted by the FBI as its
standard for fingerprint compression [Federal Bureau of Investigations 93]. It involves
three steps: (1) a discrete wavelet transform, (2) adaptive scalar quantization of the
wavelet transform coefficients, and (3) a two-pass Huffman coding of the quantization
indices. This is the reason for the name wavelet/scalar quantization, or WSQ. The
method typically produces compression factors of about 20. Decoding is the reverse of
A =
−14
√
15 + 63
1080
√
15
1/3
, and B =
−14
√
15 − 63
1080
√
15
1/3
.
This wavelet image decomposition can be called symmetric. It is shown in Fig-
ure 5.55. The SWT is first applied to the image rows and columns, resulting in 4×4=16
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
5.9 WSQ, Fingerprint Compression 221
Tap Exact value Approximate value
h
0
(0) −5
√
|
2
+4Rx
2
− 1)/16 −0.110624404418420
h
0
(±3) −5
√
2x
1
(Rx
2
)/8 −0.023849465019380
h
0
(±4) −5
√
2x
1
/64 0.037828455506995
h
1
(−1)
√
2(6x
1
− 1)/16x
1
0.788485616405660
23
10
16
18
40
42
48
50
51 54 55
5756 60 61
5958 62 63
52 53
8
21
27
29
19
22
28
30
20
25
31
33
23
26
32
34
24
9
information is important and should be quantized lightly.
The transform coefficients in the 64 subbands are floating-point numbers to be
denoted by a. They are quantized to a finite number of floating-point numbers that are
denoted by ˆa. The WSQ encoder maps a transform coefficient a to a quantization index
p (an integer that is later mapped to a code that is itself Huffman encoded). The index
p can be considered a pointer to the quantization bin where a lies. The WSQ decoder
receives an index p and maps it to a value ˆa that is close, but not identical, to a.This
is how WSQ loses image information. The set of ˆa values is a discrete set of floating-
point numbers called the quantized wavelet coefficients. The quantization depends on
parameters that may vary from subband to subband, since different subbands have
different quantization requirements.
Figure 5.56 shows the setup of quantization bins for subband k. Parameter Z
k
is
the width of the zero bin, and parameter Q
k
is the width of the other bins. Parameter
C is in the range [0, 1]. It determines the reconstructed value ˆa.ForC =0.5, for
example, the reconstructed value for each quantization bin is the center of the bin.
Equation (5.14) shows how parameters Z
k
and Q
k
are used by the WSQ encoder to
quantize a transform coefficient a
k
(m, n) (i.e., a coefficient in position (m, n) in subband
k) to an index p
k
(m, n) (an integer), and how the WSQ decoder computes a quantized
k
(m, n) ≤ Z
k
/2,
a
k
(m,n)+Z
k
/2
Q
k
+1,a
k
(m, n) < −Z
k
/2,
(5.14)
ˆa
k
(m, n)=
⎧
⎪
⎨
⎪
⎩
p
k
following steps:
Step 1: Let the width and height of subband k be denoted by X
k
and Y
k
, respectively.
We compute the six quantities
W
k
=
3X
k
4
,H
k
=
7Y
k
16
,
x
0k
=
X
k
0k
) to position (x
1k
,y
1k
) to estimate the variance
σ
2
k
of the subband by
σ
2
k
=
1
W·H − 1
x
1k
n=x
0k
y
1k
m=y
0k
a
k
(m, n) − µ
, 4 ≤ k ≤ 59, and σ
2
k
≥ 1.01,
0, 60 ≤ k ≤ 63, or σ
2
k
< 1.01,
where q is a proportionality constant that controls the bin widths Q
k
and thereby the
overall level of compression. The procedure for computing q is complex and will not be
described here. The values of the constants A
k
are
A
k
=
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
1.32,k=52, 56,
1.08,k=53, 58,
1.42,k=54, 57,
The second group is codes 107 through 254. They specify small indices, in the range
[−73, +74]. The third group consists of the six escape codes 101 through 106. They
indicate large indices or run lengths of more than 100 zero indices. Code 180 (which
corresponds to an index p
k
(m, n)=0)isnotused,becausethiscaseisreallyarun
length of a single zero. An escape code is followed by the (8-bit or 16-bit) raw value of
the index (or size of the run length). Here are some examples:
An index p
k
(m, n)=−71 is coded as 109. An index p
k
(m, n)=−1 is coded as 179.
An index p
k
(m, n) = 75 is coded as 101 (escape for positive 8-bit indices) followed by
75 (in eight bits). An index p
k
(m, n)=−259 is coded as 104 (escape for negative large
indices) followed by 259 (the absolute value of the index, in 16 bits). An isolated index
of zero is coded as 1, and a run length of 260 zeros is coded as 106 (escape for large run
lengths) followed by 260 (in 16 bits). Indices or run lengths that require more than 16
bits cannot be encoded, but the particular choices of the quantization parameters and
the wavelet transform virtually guarantee that large indices will never be generated.
Code Index or run length
1 run length of 1 zeros
2 run length of 2 zeros
3 run length of 3 zeros
.
.
third blocks contain the highpass detail subbands 19–51 and 52–59, respectively (recall
that subbands 60–63 are completely discarded). Two Huffman code tables are prepared,
one for the first block and the other for the second and third blocks.
A Huffman code table for a block of subbands is prepared by counting the number
of times each of the 254 codes of Table 5.57 appears in the block. The counts are used
to determine the length of each code and to construct the Huffman code tree. This is a
two-pass job (one pass to determine the code tables and another pass to encode), and
it is done in a way similar to the use of the Huffman code by JPEG (Section 5.6.4).
O’Day figured that that was more than he’d had the right to expect under the cir-
cumstances. A fingerprint identification ordinarily required ten individual points—the
irregularities that constituted the art of fingerprint identification—but that number
had always been arbitrary. The inspector was certain that Cutter had handled this
computer disk, even if a jury might not be completely sure, if that time ever came.
—Tom Clancy, Clear and Present Danger
Chapter Summary
Images are an important type of digital multimedia data. Images are popular, they are
easy to create (by a digital camera, scanning a document, or by creating a drawing or
an illustration), and they feature several types of redundancies, which makes it easy to
come up with methods for compressing them. In addition, the human visual system
can perceive the general form and many details of an image, but it cannot register the
precise color of every pixel. We therefore say that a typical image has much noise,
and this feature allows for much loss of original image information when the image is
compressed and then decompressed.
The chapter starts by discussing the various types of images, bi-level, grayscale,
continuous-tone, discrete-tone, and cartoon-like. It then states the main principle of im-
age compression, a principle that stems from the correlation of pixels. Eight approaches
to image compression are briefly discussed, all of them based on the main principle.
The remainder of the chapter concentrates on image transforms (Section 5.3) and in
particular on orthogonal and wavelet transforms. The popular JPEG method is based
on the discrete cosine transform (DCT), one of the important orthogonal transforms,
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
6
Audio Compression
In the Introduction, it is mentioned that the electronic digital computer was originally
conceived as a fast, reliable calculating machine. It did not take computer users long to
realize that a computer can also store and process nonnumeric data. The term “multi-
media,” which became popular in the 1990s, refers to the ability to digitize, store, and
manipulate in the computer all kinds of data, not just numbers. Previous chapters dis-
cussed digital images and methods for their compression, and this chapter concentrates
on audio data.
An important fact about audio compression is that decoding must be fast. Given
a compressed text file, we don’t mind waiting until it is fully decompressed before we
can read it. However, given a compressed audio file, we often want to listen to it while
it is decompressed (in fact, we decompress it only in order to listen to it). This is why
audio compression methods tend to be asymmetric. The encoder can be sophisticated,
complex, and slow, but the decoder must be fast.
First, a few words about audio and how it is digitized. The term audio refers to the
recording and reproduction of sound. Sound is a wave. It can be viewed as a physical
disturbance in the air (or some other medium) or as a pressure wave propagated by the
vibrations of molecules. A microphone is a device that senses sound and converts it to
an electrical wave, a voltage that varies continuously with time in the same way as the
sound. To convert this voltage into a format where it can be input into a computer,
stored, edited, and played back, the voltage is sampled many times each second. Each
audio sample is a number whose value is proportional to the voltage at the time of
sampling. Figure Intro.1, duplicated here, shows a wave sampled at three points in
time. It is obvious that the first sample is a small number and the third sample is a
large number, close to the maximum.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
228 6. Audio Compression
Amplitude
voltage would result in audio samples of zero and played back as silence. This is why
most ADC converters create 16-bit audio samples. Such a sample can have 2
16
=65,536
values, so it can distinguish sounds as low as 1/65,536 volt ≈ 15 microvolt (µv). Thus,
the sample size can be considered quantization of the original, analog, audio signal.
Eight-bit samples correspond to coarse quantization, while 16-bit samples lead to fine
quantization and thus to better quality of the played-back sound.
Audio sampling (or ADC) is also known as pulse-code-modulation (PCM), a term
often found in the professional literature.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Prelude 229
Armed with this information, we can estimate the sizes of various audio files and
thereby show why audio compression is so important. A typical 3-minute song lasts
180 sec and results in 180 × 44,100 = 7,938,000 audio samples when it is digitized (for
stereo sound, the number of samples is double that). For 16-bit samples, this translates
to close to 16 Mb, bigger than most still images. A 30-minute symphony is ten times
longer, so it results in a 160 Mb file when digitized. Thus, audio files are much bigger
than text files and can easily be bigger than (raw) image files. Another point to consider
is that audio compression, similar to image compression, can be lossy and thus feature
large compression factors.
Exercise 6.1: It is a handy rule of thumb that an average book occupies about a million
bytes. Explain why this makes sense.
Approaches to audio compression. The problem of compressing an audio file
can be approached from various directions, because audio data has several sources of
redundancy. The discussion that follows concentrates on three common approaches.
The main source of redundancy in digital audio is the fact that adjacent audio
samples tend to be similar; they are correlated. With 44,100 samples each second, it is
no wonder that adjacent samples are virtually always similar. Audio data where many
audio samples are very different from their immediate neighbors would sound harsh and
The idea is to analyze the raw audio second by second, to identify those parts of the
audio to which the ear is not sensitive, and to heavily quantize the audio samples in
those parts.
This approach is the principle of the popular lossy mp3 method [Brandenburg and
Stoll 94]. The mp3 encoder is complex, because (1) it has to identify the frequencies
contained in the input sound at each point in time and (2) it has to decide which parts
of the original audio will not be heard by the ear. Recall that the input to the encoder is
a set of audio samples, not a sound wave. Thus, the encoder has to prepare overlapping
subsets of the samples and apply a Fourier transform to each subset to determine the
frequencies of the sound contained in it. The encoder also has to include a psychoacoustic
model in order to decide which sounds will not be heard by the ear. The decoder, in
contrast, is very simple.
This chapter starts with a detailed discussion of companding (Section 6.1), the
human auditory system (Section 6.2), and linear prediction (Section 6.3). This material
is followed by descriptions of three audio compression algorithms, µ-law and A-law
companding (Section 6.4) and Shorten (Section 6.5).
6.1 Companding
Companding (short for “compressing/expanding”) is a simple nonlinear technique based
on the experimental fact that the ear requires more precise samples at low amplitudes
(soft sounds), but is more forgiving at higher amplitudes. The typical ADC found in
many personal computers converts voltages to numbers linearly. If an amplitude a is
converted to the number n, then amplitude 2a will be converted to the number 2n.A
compression method based on companding, however, is nonlinear. It examines every
audio sample in the sound file, and employs a nonlinear relation to reduce the number
of bits devoted to it. For 16-bit samples, for example, a companding encoder may use a
formula as simple as
mapped = 32,767
2
sample
1,100 → 383
35
40,100 → 17,309
53
10,000 → 3,656
50,000 → 22,837
10,100 → 3,694
38
50,100 → 22,896
59
20,000 → 7,719
60,000 → 29,040
20,100 → 7,762
43
60,100 → 29,105
65
Table 6.1: 16-Bit Samples Mapped to 15-Bit Numbers.
Reducing 16-bit numbers to 15 bits doesn’t produce much compression. Better
results can be achieved by substituting a smaller number for 32,767 in equations (6.1)
and (6.2). A value of 127, for example, would map each 16-bit sample into an 8-bit
integer, yielding a compression ratio of 0.5. However, decoding would be less accurate.
A 16-bit sample of 60,100, for example, would be mapped into the 8-bit number 113,
but this number would produce 60,172 when decoded by Equation (6.2). Even worse,
the small 16-bit sample 1,000 would be mapped into 1.35, which has to be rounded to 1.
When Equation (6.2) is used to decode a 1, it produces 742, significantly different from
the original sample. The amount of compression should therefore be a user-controlled
parameter, and this is an interesting example of a compression method where the com-
pression ratio is known in advance!
In practice, there is no need to go through Equations (6.1) and (6.2), since the
mapping of all the samples can be prepared in advance in a table. Both encoding and
20
10
20
30
40
46810121416
(a)
dB
KHz
Frequency
20
10
20
30
40
46810121416
(b)
dB
KHz
Frequency
x
Figure 6.2: Threshold and Masking of Sound.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.