Tài liệu Bài 6: Principal Component Analysis and Whitening - Pdf 92

6
Principal Component
Analysis and Whitening
Principal component analysis (PCA) and the closely related Karhunen-Lo
`
eve trans-
form, or the Hotelling transform, are classic techniques in statistical data analysis,
feature extraction, and data compression, stemming from the early work of Pearson
[364]. Given a set of multivariate measurements, the purpose is to find a smaller set of
variables with less redundancy, that would give as good a representation as possible.
This goal is related to the goal of independent component analysis (ICA). However,
in PCA the redundancy is measured by correlations between data elements, while
in ICA the much richer concept of independence is used, and in ICA the reduction
of the number of variables is given less emphasis. Using only the correlations as in
PCA has the advantage that the analysis can be based on second-order statistics only.
In connection with ICA, PCA is a useful preprocessing step.
The basic PCA problem is outlined in this chapter. Both the closed-form solution
and on-line learning algorithms for PCA are reviewed. Next, the related linear
statistical technique of factor analysis is discussed. The chapter is concluded by
presenting how data can be preprocessed by whitening, removing the effect of first-
and second-order statistics, which is very helpful as the first step in ICA.
6.1 PRINCIPAL COMPONENTS
The starting point for PCA is a random vector
x
with
n
elements. There is available
a sample
x(1) ::: x(T )
from this random vector. No explicit assumptions on the
probability density of the vectors are made in PCA, as long as the first- and second-

(see
Chapter 4). Let us assume in the following that the centering has been done and thus
E
fxg = 0
.Next,
x
is linearly transformed to another vector
y
with
m
elements,
m < n
, so that the redundancy induced by the correlations is removed. This is
done by finding a rotated orthogonal coordinate system such that the elements of
x
in the new coordinates become uncorrelated. At the same time, the variances of
the projections of
x
on the new coordinate axes are maximized so that the first axis
corresponds to the maximal variance, the second axis corresponds to the maximal
variance in the direction orthogonal to the first axis, and so on.
For instance, if
x
has a gaussian density that is constant over ellipsoidal surfaces
in the
n
-dimensional space, then the rotated coordinate system coincides with the
principal axes of the ellipsoid. A two-dimensional example is shown in Fig. 2.7 in
Chapter 2. The principal components are now the projections of the data points on the
two principal axes,

image window can still
be reconstructed from it. This kind of compression is possible because neighboring
elements of
x
, which are the gray levels of neighboring pixels in the digital image,
are heavily correlated. These correlations are utilized by PCA, allowing almost the
same information to be represented by a much smaller vector
y
. PCA is a linear
technique, so computing
y
from
x
is not heavy, which makes real-time processing
possible.
PRINCIPAL COMPONENTS
127
6.1.1 PCA by variance maximization
In mathematical terms, consider a linear combination
y
1
=
n
X
k=1
w
k1
x
k
= w

y
1
is called the first principal component of
x
, if the variance of
y
1
is
maximally large. Because the variance depends on both the norm and orientation of
the weight vector
w
1
and grows without limits as the norm grows, we impose the
constraint that the norm of
w
1
is constant, in practice equal to 1. Thus we look for a
weight vector
w
1
maximizing the PCA criterion
J
PCA
1
(w
1
)=
E
fy
2

f:g
is the expectation over the (unknown) density of input vector
x
,andthe
norm of
w
1
is the usual Euclidean norm defined as
kw
1
k =(w
T
1
w
1
)
1=2
=
n
X
k=1
w
2
k1
]
1=2
The matrix
C
x
in Eq. (6.1) is the

n
satisfy
d
1
 d
2
 :::  d
n
. The solution maximizing (6.1) is
given by
w
1
= e
1
Thus the first principal component of
x
is
y
1
= e
T
1
x
.
The criterion
J
PCA
1
in eq. (6.1) can be generalized to
m

fy
m
y
k
g =0 k<m:
(6.4)
Note that the principal components
y
m
have zero means because
E
fy
m
g = w
T
m
E
fxg =0
128
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
The condition (6.4) yields:
E
fy
m
y
k
g =
E
f(w
T

1
= e
1
. We are thus looking for maximal variance
E
fy
2
2
g =
E
f(w
T
2
x)
2
g
in the subspace orthogonal to the first eigenvector of
C
x
.The
solution is given by
w
2
= e
2
Likewise, recursively it follows that
w
k
= e
k

with maximal variance, under the constraints that the weights
are normalized and the principal components are uncorrelated with each other. It
turns out that this is strongly related to minimum mean-square error compression
of
x
, which is another way to pose the PCA problem. Let us search for a set of
m
orthonormal basis vectors, spanning an
m
-dimensional subspace, such that the mean-
square error between
x
and its projection on the subspace is minimal. Denoting again
the basis vectors by
w
1
:::w
m
, for which we assume
w
T
i
w
j
= 
ij
the projection of
x
on the subspace spanned by them is
P

2
g
(6.7)
It is easy to show (see exercises) that due to the orthogonality of the vectors
w
i
,this
criterion can be further written as
J
PCA
MSE
=
E
fkxk
2
g
E
f
m
X
j =1
(w
T
j
x)
2
g
(6.8)
=
trace

the same optimal compression. While this ambiguity can be seen as a disadvantage,
it should be noted that there may be some other criteria by which a certain basis in
the PCA subspace is to be preferred over others. Independent component analysis is
a prime example of methods in which PCA is a useful preprocessing step, but once
the vector
x
has been expressed in terms of the first
m
eigenvectors, a further rotation
brings out the much more useful independent components.
It can also be shown [112] that the value of the minimum mean-square error of
(6.7) is
J
PCA
MSE
=
n
X
i=m+1
d
i
(6.10)
the sum of the eigenvalues corresponding to the discarded eigenvectors
e
m+1
:::e
n
.
If the orthonormality constraint is simply changed to
w

E
fe
T
m
xx
T
e
m
g = e
T
m
C
x
e
m
= d
m
(6.12)
The variances of the principal components are thus directly given by the eigenvalues
of
C
x
. Note that, because the principal components have zero means, a small eigen-
value (a small variance)
d
m
indicates that the value of the corresponding principal
component
y
m

m = n
or all
the principal components are included. A very important practical problem is how to
130
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
choose
m
in (6.13); this is a trade-off between error and the amount of data needed
for the expansion. Sometimes a rather small number of principal components are
sufficient.
Fig. 6.1
Leftmost column: some digital images in a
32  32
grid. Second column: means
of the samples. Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64
principal components were used in the expansion.
Example 6.1 In digital image processing, the amount of data is typically very large,
and data compression is necessary for storage, transmission, and feature extraction.
PCA is a simple and efficient method. Fig. 6.1 shows 10 handwritten characters that
were represented as binary
32  32
matrices (left column) [183]. Such images, when
scanned row by row, can be represented as 1024-dimensional vectors. For each of the
10 character classes, about 1700 handwritten samples were collected, and the sample
means and covariance matrices were computed by standard estimation methods. The
covariance matrices were
1024  1024
matrices. For each class, the first 64 principal
component vectors or eigenvectors of the covariance matrix were computed. The
second column in Fig. 6.1 shows the sample means, and the other columns show the

X
i=1
a
i
s
i
+ n
(6.14)
where
m<n
.There
a
i
are some fixed vectors and the coefficients
s
i
are random
numbers that are zero mean and uncorrelated. We can assume that their variances
have been absorbed in vectors
a
i
so that they have unit variances. The term
n
is
white noise, for which E
fnn
T
g = 
2
I

i=1
a
i
a
T
i
, added by the constant

2
.
But the matrix
P
m
i=1
a
i
a
T
i
has at most
m
nonzero eigenvalues, and these correspond
to eigenvectors that span the signal subspace. When the eigenvalues of
C
x
are
computed, the first
m
form a decreasing sequence and the rest are small constants,
equal to

of the sample
x(1):::x(T )
and on the eigenvalues
d
1
:::d
n
of the matrix
C
x
. Finding the minimum point gives a good value for
m
.
6.1.4 Closed-form computation of PCA
To use the closed-form solution
w
i
= e
i
given earlier for the PCA basis vectors, the
eigenvectors of the covariance matrix
C
x
must be known. In the conventional use of
132
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
PCA, there is a sufficiently large sample of vectors
x
available, from which the mean
and the covariance matrix

the discrete cosine transform [154].
6.2 PCA BY ON-LINE LEARNING
Another alternative is to derive gradient ascent algorithms or other on-line methods
for the preceding maximization problems. The algorithms will then converge to the
solutions of the problems, that is, to the eigenvectors. The advantage of this approach
is that such algorithms work on-line, using each input vector
x(t)
once as it becomes
available and making an incremental change to the eigenvector estimates, without
computing the covariance matrix at all. This approach is the basis of the PCA neural
network learning rules.
Neural networks provide a novel way for parallel on-line computation of the PCA
expansion. The PCA network [326] is a layer of parallel linear artificial neurons
shown in Fig. 6.2. The output of the
i
th unit (
i =1:::m
)is
y
i
= w
T
i
x
, with
x
denoting the
n
-dimensional input vector of the network and
w


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status