16
ICA with Overcomplete
Bases
A difficult problem in independent component analysis (ICA) is encountered if the
number of mixtures is smaller than the number of independent components .This
means that the mixing system is not invertible: We cannot obtain the independent
components (ICs) by simply inverting the mixing matrix . Therefore, even if
we knew the mixing matrix exactly, we could not recover the exact values of the
independent components. This is because information is lost in the mixing process.
This situation is often called ICA with overcomplete bases. This is because we
have in the ICA model
(16.1)
where the number of “basis vectors”, , is larger than the dimension of the space of
: thus this basis is “too large”, or overcomplete. Such a situation sometimes occurs
in feature extraction of images, for example.
As with noisy ICA, we actually have two different problems. First, how to estimate
the mixing matrix, and second, how to estimate the realizations of the independent
components. This is in stark contrast to ordinary ICA, where these two problems are
solved at the same time. This problem is similar to the noisy ICA in another respect
as well: It is much more difficult than the basic ICA problem, and the estimation
methods are less developed.
305
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
306
ICA WITH OVERCOMPLETE BASES
16.1 ESTIMATION OF THE INDEPENDENT COMPONENTS
not a very satisfactory solution in many cases.
In general, therefore, the estimator given by (16.4) can only be obtained by
numerical optimization. A gradient ascent method can be easily derived. One
case where the optimization is easier than usual is when the have a Laplacian
distribution:
(16.5)
ESTIMATION OF THE MIXING MATRIX
307
Ignoring uninteresting constants, we have
arg (16.6)
which can be formulated as a linear program and solved by classic methods for linear
programming [275].
16.1.2 The case of supergaussian components
Using a supergaussian distribution, such as the Laplacian distribution, is well justified
in feature extraction, where the components are supergaussian. Using the Laplacian
density also leads to an interesting phenomenon: The ML estimator gives coefficients
of which only are nonzero. Thus, only the minimum number of the components
are activated. Thus we obtain a sparse decomposition in the sense that the components
are quite often equal to zero.
It may seem at first glance that it is useless to try to estimate the ICs by ML
estimation, because they cannot be estimated exactly in any case. This is not so,
however; due to this phenomenon of sparsity, the ML estimation is very useful. In
the case where the independent components are very supergaussian, most of them
are very close to zero because of the large peak of the pdf at zero. (This is related to
the principle of sparse coding that will be treated in more detail in Section 21.2.)
Thus, those components that are not zero may not be very many, and the system
may be invertible for those components. If we first determine which components are
likely to be clearly nonzero, and then invert that part of the linear system, we may
be able to get quite accurate reconstructions of the ICs. This is done implicitly in
the ML estimation method. For example, assume that there are three speech signals
(16.9)
which means that the rows of form an orthonormal system. This orthonormality
could be enforced after every step of (16.8), for further stabilization.
16.2.2 Maximizing likelihood approximations
Maximization of the joint likelihood is a rather crude method of estimation. From
a Bayesian viewpoint, what we really want to maximize is the marginal posterior
probability of the mixing matrix. (For basic concepts of Bayesian estimation, see
Section 4.6.)
A more sophisticated form of maximum likelihood estimation can be obtained
by using a Laplace approximation of the posterior distribution of . This improves
the stability of the algorithm, and has been successfully used for estimation of
overcomplete bases from image data [274], as well as for separation of audio signals
[272]. For details on the Laplace approximation, see [275]. An alternative for the
Laplace approximation is provided by ensemble learning; see Section 17.5.1.
A promising direction of research is given by Monte Carlo methods. These are
a class of methods often used in Bayesian estimation, and are based on numerical
integration using stochastic algorithms. One method in this class, Gibbs sampling,
has been used in [338] for overcomplete basis estimation. Monte Carlo methods
typically give estimators with good statistical properties; the drawback is that they
are computationally very demanding.
Also, one could use an expectation-maximization (EM) algorithm [310, 19]. Using
gaussian mixtures as models for the distributions of the independent components, the
algorithm can be derived in analytical form. The problem is, however, that its
complexity grows exponentially with the dimension of , and thus it can only be used
,
,
ESTIMATION OF THE MIXING MAT RIX
309
in small dimensions. Suitable approximations of the algorithm might alleviate this
limitation [19].
We can also assume, for simplicity, that the data is prewhitened as a preprocessing
step, as in most ICA method in Part II. Then the independent components are simply
given by the dot-products of the whitened data vector with the basis vectors .
Due to the preceding considerations, we assume in our approach that what is
usually needed, is a collection of basis vectors that has the following two properties:
1. The dot-products of the observed data with the basis vectors have sparse
(supergaussian) marginal distributions.
2. The should be approximately uncorrelated (“quasiuncorrelated”). Equiva-
lently, the vectors should be approximately orthogonal (“quasiorthogonal”).
310
ICA WITH OVERCOMPLETE BASES
A decomposition with these two properties seems to capture the essential properties
of the decomposition obtained by estimation of the ICA model. Such decompositions
could be called sparse approximately uncorrelated decompositions.
It is clear that it is possible to find highly overcomplete basis sets that have the
first property of these two. Classic ICA estimation is usually based on maximizing
the sparseness (or, in general, nongaussianity) of the dot-products, so the existence
of several different classic ICA decompositions for a given image data set shows the
existence of decompositions with the first property.
What is not obvious, however, is that it is possible to find strongly overcomplete
decompositions such that the dot-products are approximately uncorrelated. The main
point here is that this is possible because of the phenomenon of quasiorthogonality.
Quasiorthogonality in high-dimensional spaces
Quasiorthogonality [247]
is a somewhat counterintuitive phenomenon encountered in very high-dimensional
spaces. In a certain sense, there is much more room for vectors in high-dimensional
spaces. The point is that in an
-dimensional space, where is large, it is possible
to have (say) vectors that are practically orthogonal, i.e., their angles are close to
90 degrees. In fact, when grows, the angles can be made arbitrarily close to 90
where
is a constant determining the force of quasiorthogonalization. If ,we
have ordinary, perfect orthogonalization. We have found in our experiments that an
in the range is sufficient in spaces where the dimension is 64.
In certain applications it may be desirable to use a symmetric version of quasi-
orthogonalization, in which no vectors are “privileged” over others [210, 197]. This
can be accomplished, for example, by the following algorithm:
1.
2. Normalize each column of to unit norm
(16.11)
which is closely related to the iterative symmetric orthogonalization method used for
basic ICA in Section 8.4.3. The present algorithm is simply doing one iteration of the
iterative algorithm. In some cases, it may be necessary to do two or more iterations,
although in the experiments below, just one iteration was sufficient.
Thus, the algorithm that we propose is similar to the FastICA algorithm as de-
scribed, e.g. in Section 8.3.5 in all other respects than the orthogonalization, which
is replaced by one of the preceding quasiorthogonalization methods.
Experiments with over complete image bases
We applied our algorithm on
image windows (patches) of pixels taken from natural images. Thus, we used
ICA for feature extraction as explained in detail in Chapter 21.
The mean of the image window (DC component) was removed as a preprocess-
ing step, so the dimension of the data was 63. Both deflationary and symmetric
quasiorthogonalization were used. The nonlinearity used in the FastICA algorithm
was the hyperbolic tangent. Fig. 16.1 shows an estimated approximately 4 times
overcomplete basis (with 240 components). The sample size was 14000. The results
shown here were obtained using the symmetric approach; the deflationary approach
yielded similar results, with the parameter fixed at .
The results show that the estimated basis vectors are qualitatively quite similar
to those obtained by other, computationally more expensive methods [274]; they
To obtain computationally efficient algorithms, strong approximations are necessary.
For example, one can use a modification of the FastICA algorithm that is based on
finding a quasidecorrelating sparse decomposition. This algorithm is computationally
very efficient, reducing the complexity of overcomplete basis estimation to that of
classic ICA estimation.