Tài liệu Bài 10: ICA by Minimization of Mutual Information - Pdf 92

10
ICA by Minimization of
Mutual Information
An important approach for independent component analysis (ICA) estimation, in-
spired by information theory, is minimization of mutual information.
The motivation of this approach is that it may not be very realistic in many cases
to assume that the data follows the ICA model. Therefore, we would like to develop
an approach that does not assume anything about the data. What we want to have
is a general-purpose measure of the dependence of the components of a random
vector. Using such a measure, we could deﬁne ICA as a linear decomposition that
minimizes that dependence measure. Such an approach can be developed using
mutual information, which is a well-motivated information-theoretic measure of
statistical dependence.
One of the main utilities of mutual information is that it serves as a unifying
framework for many estimation principles, in particular maximum likelihood (ML)
estimation and maximization of nongaussianity. In particular, this approach gives a
rigorous justiﬁcation for the heuristic principle of nongaussianity.
10.1 DEFINING ICA BY MUTUAL INFORMATION
10.1.1 Information-theoretic concepts
The information-theoretic concepts needed in this chapter were explained in Chap-
ter 5. Readers not familiar with information theory are advised to read that chapter
before this one.
221
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright

2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
222

vectors. Mutual information
I
between
m
(scalar) random variables,
y
i
i =1:::m
is
deﬁned as follows
I (y
1
y
2
:::y
m
)=
m
X
i=1
H (y
i
)  H (y)
(10.3)
10.1.2 Mutual information as measure of dependence
We have seen earlier (Chapter 5) that mutual information is a natural measure of the
dependence between random variables. It is always nonnegative, and zero if and only
if the variables are statistically independent. Mutual information takes into account
the whole dependence structure of the variables, and not just the covariance, like
principal component analysis (PCA) and related methods.

n
)=
X
i
H (y
i
)  H (x)  log j det Bj
(10.5)
Now, let us consider what happens if we constrain the
y
i
to be uncorrelated and of
unit variance. This means
E fyy
T
g = BE fxx
T
gB
T
= I
, which implies
det I = 1 = det(BE fxx
T
gB
T
) = (det B)(det E fxx
T
g)(det B
T
)

where the constant term does not depend on
B
. This shows the fundamental relation
between negentropy and mutual information.
We see in (10.7) that ﬁnding an invertible linear transformation
B
that minimizes
the mutual information is roughly equivalent to ﬁnding directions in which the ne-
gentropy is maximized. We have seen previously that negentropy is a measure of
nongaussianity. Thus, (10.7) shows that ICA estimation by minimization of mutual in-
formation is equivalent to maximizing the sum of nongaussianities of the estimates of
the independent components, when the estimates are constrained to be uncorrelated.
Thus, we see that the formulation of ICA as minimization of mutual information
gives another rigorous justiﬁcation of our more heuristically introduced idea of ﬁnding
maximally nongaussian directions, as used in Chapter 8.
In practice, however, there are also some important differences between these two
criteria.
1. Negentropy, and other measures of nongaussianity, enable the deﬂationary, i.e.,
one-by-one, estimation of the independent components, since we can look for
the maxima of nongaussianity of a single projection
b
T
x
. This is not possible
with mutual information or most other criteria, like the likelihood.
2. A smaller difference is that in using nongaussianity, we force the estimates of
the independent components to be uncorrelated. This is not necessary when
using mutual information, because we could use the form in (10.5) directly,
as will be seen in the next section. Thus the optimization space is slightly
reduced.

i
H (b
T
i
x)
. Thus the likelihood would be equal, up to an additive constant given
by the total entropy of
x
, to the negative of mutual information as given in Eq. (10.5).
In practice, the connection may be just as strong, or even stronger. This is because
in practice we do not know the distributions of the independent components that are
needed in ML estimation. A reasonable approach would be to estimate the density
of
b
T
i
x
as part of the ML estimation method, and use this as an approximation of the
density of
s
i
. This is what we did in Chapter 9. Then, the
p
i
in this approximation
of likelihood are indeed equal to the actual pdf’s
b
T
i
x

i
)glog j det BjH (x)
(10.9)
Now we see that this approximation is equal to the approximation of the likelihood
used in Chapter 9 (except, again, for the global sign and the additive constant given by
H (x)
). This also gives an alternative method of approximating mutual information
that is different from the approximation that uses the negentropy approximations.
10.4 ALGORITHMS FOR MINIMIZATION OF MUTUAL INFORMATION
To use mutual information in practice, we need some method of estimating or ap-
proximating it from real data. Earlier, we saw two methods for approximating mutual
entropy. The ﬁrst one was based on the negentropy approximations introduced in
Section 5.6. The second one was based on using more or less ﬁxed approximations
for the densities of the ICs in Chapter 9.
Thus, using mutual information leads essentially to the same algorithms as used for
maximization of nongaussianity in Chapter 8, or for maximum likelihood estimation
in Chapter 9. In the case of maximization of nongaussianity, the corresponding
algorithms are those that use symmetric orthogonalization, since we are maximizing
the sum of nongaussianities, so that no order exists between the components. Thus,
we do not present any new algorithms in this chapter; the reader is referred to the two
preceding chapters.
EXAMPLES
225
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5

The corresponding results for two supergaussian independent components are shown
in Fig. 10.2. Convergence was obtained after three iterations, after which mutual
information was practically zero.
10.6 CONCLUDING REMARKS AND REFERENCES
A rigorous approach to ICA that is different from the maximum likelihood approach
is given by minimization of mutual information. Mutual information is a natural
information-theoretic measure of dependence, and therefore it is natural to estimate
the independent components by minimizing the mutual information of their estimates.
Mutual information gives a rigorous justiﬁcation of the principle of searching for
maximally nongaussian directions, and in the end turns out to be very similar to the
likelihood as well.
Mutual information can be approximated by the same methods that negentropy is
approximated. Alternatively, is can be approximated in the same way as likelihood.

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Tài liệu Bài 10: ICA by Minimization of Mutual Information - Pdf 92

Tài liệu, ebook tham khảo khác

Học thêm