Tài liệu Independent Component Analysis - Chapter 20: Other Extensions - Pdf 97

20
Other Extensions
In this chapter, we present some additional extensions of the basic independent
component analysis (ICA) model. First, we d iscuss the use of prior information
on the mixing matrix, especially on its sparseness. Second, we present models that
somewhat relax the assumption of the independence of the components. In the model
called independent subspace analysis, the components are divided into subspaces that
are independent, but the components inside the subspaces are not independent. In the
model of topographic ICA, higher-order dependencies are modeled by a topographic
organization. Finally, we show how to adapt some of the basic ICA algorithms to the
case where the data is complex-valued instead of real-valued.
20.1 PRIORS ON THE MIXING MATRIX
20.1.1 Motivation for prior information
No prior knowledge on the mixing matrix is used in the basic ICA model. This has the
advantage of giving the model great generality. In many application areas, however,
information on the form of the mixing matrix is available. Using prior information on
the mixing matrix is likely to give better estimates of the matrix for a given number
of data points. This is of great importance in situations where the computational
costs of ICA estimation are so high that they severely restrict the amount of data that
can be used, as well as in situations where the amount of data is restricted due to the
nature of the application.
371
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
372
OTHER EXTENSIONS
This situation can be compared to that found in nonlinear regression, where

entries. Thus this form of prior is analogous to the widely-used prior knowledge on
the supergaussianity or sparseness of the independent components. In fact, due to this
similarity, sparse priors are so-called conjugate priors, which implies that estimation
using this kind of priors is particularly easy: Ordinary ICA methods can be simply
adapted to using such priors.
20.1.2 Classic priors
In the following, we assume that the estimator of the inverse of the mixing matrix
is constrained so that the estimates of the independent components are
white, i.e., decorrelated and of unit variance: . This restriction greatly
facilitates the analysis. It is basically equivalent to ﬁrst whitening the data and then
restricting to be orthogonal, but here we do not want to restrict the generality of
PRIORS ON THE MIXIN G MATRIX
373
these results by whitening. We concentrate here on formulating priors for .
Completely analogue results hold for prior on .
Jeffreys’ prior
The classic prior in Bayesian inference is Jeffreys’ prior. It
is considered a m aximally uninformative prior, which already indicates that it is
probably not useful for our purpose.
Indeed, it was shown in [342] that Jeffreys’ prior for the basic ICA model has the
form:
(20.1)
Now, the constraint of whiteness of the
means that can be expressed
as ,where is a constant whitening matrix, and is restricted to
be orthogonal. But we have , which implies that
Jeffreys’s prior is constant in the space of allowed estimators (i.e., decorrelating ).
Thus we see that Jeffreys’ prior has no effect on the estimator, and therefore cannot
reduce overlearning.
Quadratic priors

In other words, the quadratic prior is constant. The same result can be proven for a
quadratic prior on
. Thus, quadratic priors are of little interest in ICA.
20.1.3 Sparse priors
Motivation
A much more satisfactory class of priors is given by what we call
sparse priors. This means that the prior information says that most of the elements
of each row of are zero; thus their distribution is supergaussian or sparse. The
motivation f or considering sparse p riors is both empirical and algorithmic.
Empirically, it has been observed in feature extraction of images (see Chapter 21)
that the obtained ﬁlters tend to be localized in space. This implies that the distribution
of the elements of the ﬁlter tends to be sparse, i.e., most elements are practically
zero. A similar phenomenon can be seen in analysis of magnetoencephalography,
where each source signal is usually captured by a limited number of sensors. This is
due to the spatial localization of the sources and the sensors.
The algorithmic appeal of sparsifying priors, on the other hand, is based on the
fact that sparse priors can be made to be conjugate priors (see below for deﬁnition).
This is a special class of priors, and means that estimation of the model using this
prior requires only very simple modiﬁcations in ordinary ICA algorithms.
Another motivation for sparse priors is their neural interpretation. Biological
neural networks are known to be sparsely connected, i.e., only a small proportion
of all possible connections between neurons are actually used. This is exactly what
sparse priors model. This interpretation is especially interesting when ICA is used in
modeling of the visual cortex (Chapter 21).
Measuring sparsity
The sparsity of a random variable, say , can be measured by
expectations of the form ,where is a nonquadratic function, for example,
the f ollowing
(20.6)
The use of such measures requires that the variance of is normalized to a ﬁxed

const
(20.9)
This form shows that the posterior distribution has the same form as the prior
distribution (and, in fact, the original likelihood). Priors with this property are called
conjugate priors in Bayesian theory. The usefulness of conjugate priors resides in the
property that the prior can be considered to correspond to a “virtual” sample. The
posterior distribution in (20.9) has the same form as the likelihood of a sample of size
, which consists of both the observed and the canonical basis vectors .
In other words, the posterior in (20.9) is the likelihood of the augmented (whitened)
data sample
if
if
(20.10)
376
OTHER EXTENSIONS
Thus, using conjugate priors has the additional beneﬁt that we can use exactly the
same algorithm for maximization of the posterior as in ordinary maximum likelihood
estimation of ICA. All we need to do is to add this virtual sample to the data; the
virtual sample is of the same size as the dimension of the data.
For experiments using sparse priors in image feature extraction, see [209].
Modifying prior strength
The conjugate priors given above can be generalized
by considering a family of supergaussian priors given by
const (20.11)
Using this kind of prior means that the virtual sample points are weighted by some
parameter
. This parameter expresses the degree of belief that we have in the prior.
Alarge means that the belief in the prior is strong. Also, the parameter could
be different for different , but this seems less useful here. The posterior distribution
then has the form:

does stay constant. This means that the sparsity measure is now measuring rather the
global sparsity of , instead of the sparsities of individual rows.
In practice, one usually wants to whiten the data for technical reasons. Then the
problems arises: How to impose the sparseness on the original separating matrix even
when the data used in the estimation algorithm needs to be whitened? The preceding
framework can be easily modiﬁed so that the sparseness is imposed on the original
separating matrix. Denote by the whitening matrix and by the separating matrix
for original data. Thus, we have and by deﬁnition. Now, we can
express the prior in (20.8) as
const. const.
(20.14)
Thus, we see that the virtual sample added to
now consists of the columns of the
whitening matrix, instead of the identity matrix.
Incidentally, a similar manipulation of (20.8) shows how to put the p rior on the
original mixing matrix instead of the separating matrix. We always have
. Thus, we obtain .This
shows that imposing a sparse prior on is done by using the virtual sample given
by the rows of the inverse of the whitening matrix. (Note that for whitened data,
the mixing matrix is the transpose of the separating matrix, so the fourth logical
possibility of formulating prior for the whitened mixing matrix is not different from
using a prior on the whitened separating matrix.)
In practice, the problems implied by whitening can often be solved by using a
whitening matrix that is sparse in itself. Then imposing sparseness on the whitened
separating matrix is meaningful. In the context of image feature extraction, a sparse
whitening matrix is obtained by the zero-phase whitening matrix (see [38] for dis-
cussion), for example. Then it is natural to impose the sparseness for the whitend
separating matrix, and the complications discussed in this subsection can be ignored.
20.1.4 Spatiotemporal ICA
When using sparse priors, we typically make rather similar assumptions on both the

rather similar to using sparse priors. The basic idea is to form a virtual sample where
the data consists of two parts, the original data and the data obtained by transposing
the data matrix. The dimensions of these data sets must be strongly reduced and
made equal to each other, using PCA-like methods. This is possible because it was
assumed that both and have the same kind of redundancy: many more rows
than columns. For details, see [412], where the infomax criterion was applied on this
estimation task.
20.2 RELAXING THE INDEPENDENCE ASSUMPTION
In the ICA data model, it is assumed that the components are independent. How-
ever, ICA is often applied on data sets, for example, on image data, in which the
obtained estimates of the independent components are not very independent, even
approximately. In fact, it is not possible, in general, to decompose a random vector
linearly into components that are independent. This raises questions on the utility
and interpretation of the components given by ICA. Is it useful to perform ICA on
real data that does not give independent components, and if it is, how should the
results be interpreted?
One approach to this problem is to reinterpret the estimation results. A straight-
forward reinterpretation was offered in Chapter 10: ICA gives components that are as
independent as possible. Even in cases where this is not enough, we can still justify
the utility by other arguments. This is because ICA simultaneously serves certain
RELAXING THE INDEPENDENCE ASSUMPTION
379
other useful purposes than dependence reduction. For example, it can be interpreted
as projection pursuit (see Section 8.5) or sparse coding (see Section 21.2). Both of
these methods are based on the maximal nongaussianity property of the independent
components, and they give important insight into what ICA algorithms are really
doing.
A different approach to the problem of not ﬁnding independent components is to
relax the very assumption of independence, thus explicitly formulating new data mod-
els. In this section, we consider this approach, and present three recently developed

probability density inside the th -tuple of .Theterm appears here as in
380
OTHER EXTENSIONS
any expression of the probability density of a transformation, giving the change in
volume produced by the linear transformation, as in Chapter 9.
The -dimensional probability density is not speciﬁed in advance in the
general deﬁnition of multidimensional ICA [66]. Thus, the question arises how
to estimate the model of multidimensional ICA. One approach is to estimate the
basic ICA model, and then group the components into -tuples according to their
dependence structure [66]. This is meaningful only if the independent components
are well deﬁned and can be accurately estimated; in general we would like to utilize
the subspace structure in the estimation process. Another approach is to model
the distributions inside the subspaces by a suitable model. This is potentially very
difﬁcult, since we then encounter the classic problem of estimating -dimensional
distributions. One solution for this problem is given by independent subspaces
analysis, to be explained next.
20.2.2 Independent subspace analysis
Independent subspace analysis [204] is a simple model that models some dependen-
cies between the components. It is based on combining multidimensional ICA with
the principle of invariant-feature subspaces.
Invariant-feature subspaces
To motivate independent subspace analysis, let us
consider the problem of feature extraction, treated in more detail in Chapter 21. In the
most basic case, features are given by linear transformations, or ﬁlters. The presence
of a given feature is detected by computing the dot-product of input data with a given
feature vector. For example, wavelet, Gabor, and Fourier transforms, as well as most
models of V1 simple cells, use such linear features (see Chapter 21). The problem
with linear features, however, is that they necessarily lack any invariance with respect
to such transformations as spatial shift or change in (local) Fourier phase [373, 248].
Kohonen [248] developed the principle of invariant-feature subspaces as an ab-

In the case of clearly supergaussian components, we can use the following proba-
bility distribution:
(20.20)
which could be considered a multi-dimensional version of the exponential distribu-
tion. The scaling constant and the normalization constant are determined so as
to give a probability density that is compatible with the constraint of unit variance of
the , but they are irrelevant in the following. Thus we see that the estimation of
the model consists of ﬁnding subspaces such that the norms of the projections of the
(whitened) data on those subspaces have maximally sparse distributions.
Independent subspace analysis is a natural generalization of ordinary ICA. In fact,
if the projections on the subspaces are reduced to dot-products, i.e., projections on
one-dimensional (1-D) subspaces, the model reduces to ordinary ICA, provided that,
in addition, the independent components are assumed to h ave symmetric distributions.
It is to be expected that the norms of the projections on the subspaces represent some
higher-order, invariant features. The exact nature of the invariances has not been
speciﬁed in the model but will emerge from the input data, using only the prior
information on their independence.
If the subspaces have supergaussian (sparse) distributions, the dependencyimplied
by the model is such that components in the same subspace tend to be nonzero at the
same time. In other words, the subspaces are somehow “activated” as a whole, and
then the values of the individual components are generated according to how strongly
the subspaces are activated. This is the particular kind of dependency that is modeled
by independent subspaces in most applications, for example, with image data.
382
OTHER EXTENSIONS
For more details on independent subspace analysis, the reader is referred to [204].
Some experiments on image data are reported in Section 21.5 as well.
20.2.3 Topographic ICA
Another way of approachingthe problem of nonexistenceof independentcomponents
is to try to somehow make the dependency structure of the estimated components

dependent in the sense of higher-order correlations.
To obtain topographic ICA, we generalize the model deﬁned by (20.19) so that
it models a dependence not only inside the -tuples, but among all neighboring
components. A neighborhood relation deﬁnes a topographical order. We deﬁne the
likelihood of the model as follows:
const (20.21)
COMPLEX-VALUED DATA
383
Here, the is a neighborhood function, which expresses the strength of the
connection between the th and th units. It can be deﬁned in the same way as in
other topographic maps, like the self-organizing map (SOM) [247]. The function
is similar to the one in independent subspace analysis. The additive constant depends
only on .
This model thus can be considered a g eneralization of the model of independent
subspace analysis. In independent subspace analysis,the latent variables are clearly
divided into -tuples or subspaces, whereas in topographic ICA, such subspaces are
completely overlapping: Every neighborhood corresponds to one subspace.
Just as independent subspace analysis, topographic ICA usually models a situation
where nearby components tend to be active (nonzero) at the same time. This seems
to be a common dependency structure for natural sparse data [404]. In fact, the
likelihood given earlier can also be derived as an approximation of the likelihood of
a model where the variance of the ICs is controlled b y some higher-order variables,
so that the variances of near-by components are strongly dependent.
For m ore details on topographic ICA, the reader is referred to [206]. Some
experiments on image data are reported in Chapter 21 as well.
20.3 COMPLEX-VA LUED DATA
Sometimes in ICA, the ICs and/or the mixing matrix are complex-valued. For exam-
ple, in signal processing in some cases frequency (Fourier) domain representations of
signals have advantages over time-domain representations. Especially in the separa-
tion of convolutive mixtures (see Chapter 19) it is quite common to Fourier transform

.
.
.
(20.22)
where and y stands for the Hermitian of y,thatis,y transposed
and conjugated. The data can be whitened in the usual way.
In our complex ICA model, all ICs
have zero mean and unit variance. Moreover,
we require that they have uncorrelated real and imaginary parts of equal variances.
This can be equivalently expressed as and ss O. In the latter,
the expectationof the outer product of a complex r andom vector without the conjugate
is a null matrix. These assumptions imply that
must be strictly complex; that is,
the imaginary part of may not in general vanish.
The deﬁnition of kurtosis can be easily generalized. For a zero-mean complex
random variable it could be deﬁned, for example, as [305, 319]
kurt
(20.23)
but the deﬁnitions vary with respect to the placement of conjugates — actually,
there are ways to deﬁne the kurtosis [319]. We choose the deﬁnition in [419],
where
kurt
(20.24)
where the last equality holds if
is white, i.e., the real and imaginary parts of
are uncorrelated and their variances are equal to . This deﬁnition of kurtosis is
intuitive since it vanishes if is gaussian.
20.3.2 Indeterminacy of the independent components
The independent components in the ICA model are found by searching for a matrix
such that . However, as in basic ICA, there are some indeterminacies. In

a constraint of orthogonality. Thus one obtains the following optimization problem:
maximize with respect to
under constraint (20.27)
where for and otherwise.
It is highly preferable that the estimator given by the contrast function is robust
against outliers. The more slowly grows as its argument increases, the more robust
is the estimator. For the choice of we propose now three different functions, the
derivatives
of which are also given:
(20.28)
(20.29)
(20.30)
where and are some arbitrary constants (for example, and
seem to be suitable). Of the preceding functions, and grow more slowly than
and thus they give more robust estimators. is motivated by kurtosis (20.24).
386
OTHER EXTENSIONS
20.3.4 Consistency of estimator
In Chapter 8 it was stated that any nonlinear learning function divides the space of
probability distributions into two half-spaces. Independent components can be esti-
mated by either maximizing or minimizing a function similar to (20.26), depending
on which half-space their distribution lies in. A theorem for real valued signals was
presented that distinguished between maximization and minimization and gave the
exact conditions for convergence. Now we show how this idea can be generalized
to complex-valued random variables. We have the following theorem on the local
consistency of the estimators [47]:
Theorem 20.1 Assume that the input data follows the complex ICA model. The
observed mixtures are prewhitened so that . The independent com-
ponents have zero mean, unit variance, and uncorrelated real and imaginary parts
of equal variances. Also, is a sufﬁciently smooth even func-

In our complex ICA, the nongaussianity measure operates on which
can be interpreted as the norm of a projection onto a subspace. The subspace
is two-dimensional, corresponding to the real and imaginary parts of a complex
number. In contrast to the subspace method, one of the basis vectors is determined
straightforward from the other basis vector. In independent subspace analysis, the
independent subspace is determined only up to an orthogonal matrix factor. In
complex ICA however, the indeterminacy is less severe: the sources are determined
up to a complex factor , .
It can be concluded that complex ICA is a restricted form of independent subspace
methods.
20.4 CONCLUDING REMARKS
The methods presented in the ﬁrst two sections of the chapter were all related to
the case where we know more about the data than just the blind assumption o f
independence. Using sparse priors, we incorporate some extra knowledge on the
sparsity of the mixing matrix in the estimation procedure. This was made very easy
by the algorithmic trick of conjugate priors.
In the methods of independent subspaces or topographic ICA, on the other hand,
we assume that we cannot really ﬁnd independent components; instead we can
ﬁnd groups of independent components, or components whose dependency structure
can be visualized. A special case of the subspace formalism is encountered if the
independent components are complex-valued.
Another class of extensions that we did not treat in this chapter are the so-called
semiblind methods, that is, methods in which much prior information on the mixing
is available. In the extreme case, the mixing could be almost completely known, in
which case the “blind” aspect of the method disappears. Such semiblind methods
are quite application-dependent. Some methods related to telecommunications are
treated in Chapter 23. A closely related theoretical framework is the “principal”
ICA proposed in [285]. See also [415] for a semiblind method in a brain imaging
application.

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Tài liệu Independent Component Analysis - Chapter 20: Other Extensions - Pdf 97

Tài liệu, ebook tham khảo khác

Học thêm