17
Nonlinear ICA
This chapter deals with independent component analysis (ICA) for nonlinear mixing
models. A fundamental difficulty in the nonlinear ICA problem is that it is highly
nonunique withoutsome extra constraints, which are often realized by using a suitable
regularization. We also address the nonlinear blind source separation (BSS) problem.
Contrary to the linear case, we consider it different from the respective nonlinear ICA
problem. After considering these matters, some methods introduced for solving the
nonlinear ICA or BSS problems are discussed in more detail. Special emphasis is
given to a Bayesian approach that applies ensemble learning to a flexible multilayer
perceptron model for finding the sources and nonlinear mixing mapping that have
most probably given rise to the observed mixed data. The efficiency of this method is
demonstrated using both artificial and real-world data. At the end of the chapter, other
techniques proposed for solving the nonlinear ICA and BSS problems are reviewed.
17.1 NONLINEAR ICA AND BSS
17.1.1 The nonlinear ICA and BSS problems
In many situations, the basic linear ICA or BSS model
(17.1)
is too simple for describing the observed data adequately. Hence, it is natural to
consider extension of the linear model to nonlinear mixing models. For instantaneous
315
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
316
NONLINEAR ICA
mixtures, the nonlinear mixing model has the general form
(17.2)
Thus the sources , are first mixed linearly according to the basic
ICA/BSS model (17.1), but after that a nonlinear function is applied to them to get
the final observations . It can be shown [418] that for the post-nonlinear mixtures,
the indeterminacies are usually the same as for the basic linear instantaneous mixing
model (17.1). That is, the sources can be separated or the independent compo-
nents estimated up to the scaling, permutation, and sign indeterminacies under weak
conditions on the mixing matrix and source distributions. The post-nonlinearity
assumption is useful and reasonable in many signal processing applications, because
NONLINEAR ICA AND BSS
317
it can be thought of as a model for a nonlinear sensor distortion. In more general
situations, it is a restrictive and somewhat arbitrary constraint. This model will be
treated in more detail below.
Another difficulty in the general nonlinear BSS (or ICA) methods proposed thus
far is that they tend to be computationally rather demanding. Moreover, the compu-
tational load usually increases very rapidly with the dimensionality of the problem,
preventing in practice the application of nonlinear BSS methods to high-dimensional
data sets.
The nonlinear BSS and ICA methods presented in the literature could be divided
into two broad classes: generative approaches and signal transformation approaches
[438]. In the generative approaches, the goal is to find a specific model that explains
how the observations were generated. In our case, this amounts to estimating both
the source signals and the unknown mixing mapping that have generated the
observed data
through the general mapping (17.2). In the signal transformation
methods, one tries to estimate the sources directly using the inverse transformation
(17.3). In these methods, the number of estimated sources is the same as the number
of observed mixtures [438].
17.1.2 Existence and uniqueness of nonlinear ICA
The question of existence and uniqueness of solutions for nonlinear independent
that have a joint uniform distribution in the unit cube .Let be any scalar
random variable. Define as in (17.5), and set
(17.6)
Then is independent from the , and the variables are
jointly uniformly distributed in the unit cube .
The theorem is proved in [213]. The constructive method given above can be used
to decompose variables into independent components ,
giving a solution for the nonlinear ICA problem.
This construction also clearly shows that the decomposition in independent com-
ponents is by no means unique. For example, we could first apply a linear trans-
formation on the to obtain another random vector = , and then compute
= with being defined using the above procedure, where is replaced by
. Thus we obtain another decomposition of into independent components. The
resulting decomposition = is in general different from , and cannot be
reduced to by any simple transformations. A more rigorous justification of the
nonuniqueness property has been given in [213].
Lin [278] has recently derived some interesting theoretical results on ICA that
are useful in describing the nonuniqueness of the general nonlinear ICA problem.
Let the matrices and denote the Hessians of the logarithmic probability
densities and of the source vector and mixture (data) vector ,
respectively. Then for the basic linear ICA model (17.1) it holds that
(17.7)
where is the mixing matrix. If the components of are truly independent,
should be a diagonal matrix. Due to the symmetry of the Hessian matrices and
, Eq. (17.7) imposes constraints for the elements of the matrix
. Thus a constant mixing matrix can be solved by estimating at two different
points, and assuming some values for the diagonal elements of .
SEPARATION OF POST-NONLINEAR MIXTURES
319
If the nonlinear mapping (17.2) is twice differentiable, we can approximate it
leads to the familiar Bell-Sejnowski algorithm (see Chapters 10 and 9)
E (17.8)
where components of the vector are score functions of the components of
the output vector :
(17.9)
320
NONLINEAR ICA
Here is the probability density function of and its derivative. In practice,
the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8);
see Chapter 9.
For the nonlinear stage, one can derive the gradient learning rule [418]
E E
Here is the th component of the input vector, is the element of the matrix
,and is the derivative of the th nonlinear function . The exact computation
algorithm depends naturally on the specific parametric form of the chosen nonlinear
mapping . In [418], a multilayer perceptron network is used for modeling
the functions , .
In linear BSS, it suffices that the score functions (17.9) are of the right type for
achieving separation. However, their appropriate estimation is critical for the good
performance of the proposed nonlinear separation method. The score functions (17.9)
must be estimated adaptively from the output vector . Several alternative ways to
do this are considered in [418]. An estimation method based on the Gram-Charlier
expansion performs appropriately only for mild post-nonlinear distortions. However,
another method, which estimates the score functions directly, also provides very good
results for hard nonlinearities. Experimental results are presented in [418]. A well
performing batch type method for estimating the score functions has been introduced
in a later paper [417].
Before proceeding, we mention that separation of post-nonlinear mixtures also
has been studied in [271, 267, 469] using mainly extensions of the natural gradient
algorithm.
0.5
1
1.5
Original signals
Fig. 17.1
Original source signals.
0 5 10 15 20 25 30 35 40 45 50
−10
−5
0
5
10
0 5 10 15 20 25 30 35 40 45 50
−20
−10
0
10
20
Fig. 17.2
Nonlinear mixtures.
The following experiment [345] illustrates the use of the self-organizing map in
nonlinear blind source separation. There were two subgaussian source signals
shown in Fig. 17.1, consisting of a sinusoid and uniformly distributed white noise.
Each source vector was first mixed linearly using the mixing matrix
(17.10)
After this, the data vectors were obtained as post-nonlinear mixtures of the sources
by applying the formula (17.4), where the nonlinearity = , .These
mixtures are depicted in Fig. 17.2.
0 5 10 15 20 25 30 35 40 45 50
−20
Generally speaking, there are several difficulties in applying self-organizing maps
to nonlinear blind source separation. If the sources are uniformly distributed, then
it can be heuristically justified that the regularization of the nonlinear separating
mapping provided by the SOM approximately separates the sources. But if the true
sources are not uniformly distributed, the separating mapping providing uniform
densities inevitably causes distortions, which are in general the more serious the
farther the true source densities are from the uniform ones. Of course, the SOM
method still provides an approximate solution to the nonlinear ICA problem, but this
solution may have little to do with the true source signals.
Another difficulty in using SOM for nonlinear BSS or ICA is that computational
complexity increases very rapidly with the number of the sources (dimensionality of
the map), limiting the potential application of this method to small-scale problems.
Furthermore, the mapping provided by the SOM is discrete, where the discretization
is determined by the number of grid points.
17.4 A GENERATIVE TOPOGRAPHIC MAPPING APPROACH TO
NONLINEAR BSS *
17.4.1 Background
The self-organizing map discussed briefly in the previous section is a nonlinear
mapping method that is inspired by neurobiological modeling arguments. Bishop,
Svensen and Williams introduced the generative topographic mapping (GTM) method
as a statistically more principled alternative to SOM. Their method is presented in
detail in [49].
In the basic GTM method, mutually similar impulse (delta) functions that are
equispaced on a rectangular grid are used to model the discrete uniform density in the
space of latent variables, or the joint density of the sources in our case. The mapping
from the sources to the observed data, corresponding in our nonlinear BSS problem
to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussians
model. The parameters of the mixture-of-gaussians model, defining the mixing
mapping, are then estimated using a maximum likelihood (ML) method (see Section
4.5) realized by the expectation-maximization (EM) algorithm [48, 172]. After this,
forming a regular array in the -dimensionallatent space. As in SOM, the dimension
of the latent space is usually . Vectors lying in the latent space are denoted by
; in our application they will be source vectors. The GTM method uses a set of
fixed nonlinear basis functions , , which form a nonorthogonal
basis set. These basis functions typically consist of a regular array of spherical
gaussian functions, but the basis functions can at least in principle be of other types.
The mapping from the -dimensional latent space to the -dimensional data
space, which is in our case the mixing mapping of Eq. (17.2), is in GTM modeled as
a linear combination of basis functions :
(17.11)
Here is an matrix of weight parameters.
Denote the node locations in the latent space by . Eq. (17.11) then defines a
corresponding set of reference vectors
(17.12)
in data space. Each of these reference vectors then forms the center of an isotropic
gaussian distribution in data space. Denoting the common variance of these gaussians
by ,weget
(17.13)
324
NONLINEAR ICA
The probability density function for the GTM model is obtained by summing over
all of the gaussian components, yielding
(17.14)
Here
is the total number of gaussian components, which is equal to the number of
grid points in latent space, and the prior probabilities of the gaussian components
are all equal to .
GTM tries to represent the distribution of the observed data in the -dimensional
data space in terms of a smaller -dimensional nonlinear manifold [49]. The gaussian
distribution in (17.13) represents a noise or error model which is needed because the
points in the grid are denoted by , . One can then
write
(17.17)
The coefficients
= ,where and are the values of the marginal
densities and corresponding to the location of the node point .In
(17.17), is the Dirac delta function or impulse function which has the special
property that = if the integration extends over the point ,
otherwise the integral is zero. In the last phase, the node points and their respective
probabilities have been reindexed again using a single index for easier notation.
This can be done easily by going through all the node points in some prescribed
order, for example rowwise.
Inserting (17.17) into (17.15) yields
(17.18)
Computing the gradient of this expression with respect to the weight matrix and
setting it to zero, after some manipulations yields the following equation for updating
the weight matrix :
old
new
old
(17.19)
In this formula, = is the data matrix, and the -th element
= of the matrix is the value of the th basis function at
the th node point . Furthermore, is a diagonal matrix with elements
(17.20)
and the elements of the responsibility matrix
are
(17.21)
Then the variance parameter can be updated using the formula
new
responsibility
max
=argmax , for each sample index .
17.4.3 An experiment
In the following, a simple experiment involving two sources shown in Fig. 17.5
and three noisy nonlinear mixtures is described. The mixed data was generated by
transforming linear mixtures of the original sources using a multilayer perceptron
A GENERATIVE TOPOGRAPHIC MAPPING APPROACH *
327
Fig. 17.7
Joint mixture densities with superimposed maps. Top left: Joint density
of mixtures and . Top right: Joint density of mixtures and . Bottom left:
Joint density
of mixtures and . Bottom right: Joint density of the estimated
two source signals.
network with a volume conserving architecture (see [104]). Such an architecture was
chosen for ensuring that the total mixing mapping is bijective and therefore reversible,
and for avoiding highly complex distortions of the source densities. However, this
choice has the advantage that it makes the total mixing mapping more complex than
the post-nonlinear model (17.4). Finally, gaussian noise was added to the mixtures.
The mixtures were generated using the model
(17.24)
where is an upper-diagonal matrix with zero diagonal elements. The nonzero
elements of were drawn from a standard gaussian distribution. The matrix
ensures volume conservation of the nonlinearity applied to .
The modified GTM algorithm presented above was used to learn a separating
mapping. For reducing scaling effects, the mixtures were first whitened. After
whitening the mixtures are uncorrelated and have unit variance. Then the modified
GTM algorithm was run for eight iterations using a grid. The number of basis
functions was .
planations for the noise in addition to the true sources or independent components.
Choosing too simple a model results in underfitting, leaving hidden some of the true
sources that have generated the data.
An appropriate solution to this problem is that no single model should actually
be chosen. Instead, all the possible explanations should be taken into account
and weighted according to their posterior probabilities. This approach, known as
Bayesian learning [48], optimally solves the trade-off between under- and overfitting.
All the relevant information needed in choosing an appropriate model is contained
in the posterior probability density functions (pdf’s) of different model structures.
The posterior pdf’s of too simple models are low, because they leave a lot of the data
unexplained. On the other hand, too complex models occupy little probability mass,
AN ENSEMBLE LEARNING APPROACH TO NONLINEAR BSS
329
even though they often show a high but very narrow peak in their posterior pdf’s
corresponding to the overfitted parameters.
In practice, exact computation of the posterior pdf’s of the models is impossible.
Therefore, some suitable approximation method must be used. Ensemble learning
[180, 25, 260], also known as variational learning, is a method for parametric approx-
imation of posterior pdf’s where the search takes into account the probability mass
of the models. Therefore, it does not suffer from overfitting. The basic idea in en-
semble learning is to minimize the misfit between the posterior pdf and its parametric
approximation.
Let us denote by the set of available mixture (data) vectors, and by
the respective source vectors. Denote by all the unknown parameters
of the mixture (data) model, which will be described in more detail in the next
subsection. Furthermore, let denote the exact posterior pdf and
its parametric approximation. The misfit is measured with the Kullback-Leibler (KL)
divergence
between the densities and , which is defined by the cost function
E
assumed to be gaussian, except for the sources, which are modeled as mixtures of
gaussians. This does not limit the generality of the approach severely, but makes
computational implementation simpler and much more efficient. The hierarchical
description of the distributions of the parameters of the model used here is a standard
procedure in probabilistic Bayesian modeling. Its strength lies in that knowledge
about equivalent status of different parameters can be easily incorporated. For
example, all the variances of the noise components have a similar status in the model.
This is reflected by the fact that their distributions are assumed to be governed by
common hyperparameters.
17.5.3 Computing Kullback-Leibler cost function *
In this subsection, the Kullback-Leibler cost function defined earlier in Eq.
(17.25) in considered in more detail. For approximating and then minimizing it, we
need two things: the exact formulation of the posterior density and its
parametric approximation .
According to the Bayes’ rule, the posterior pdf of the unknown variables and
is
(17.27)
The probability density of the data given the sources and the
parameters is obtained from Eq. (17.26). Let us denote the variance of the th
component of the noise vector by . The distribution
of the th component of the data vector is thus gaussian with the mean
+ and variance . Here denotes the th row vector of the
weight matrix ,and is the th component of the bias vector . As usually, the
noise components are assumed to be independent, and therefore
(17.28)
The terms and in (17.27) are also products of simple gaussian dis-
tributions, and they are obtained directly from the definition of the model structure
[259, 436]. The term does not depend on the model parameters and can be
neglected.
The approximation must be simple enough for mathematical tractabil-
estimates.
Let us denote the two parts of the cost function (17.25) arising from the denom-
inator and numerator of the logarithm by, respectively, = E and =
E . The variances are obtained by differentiating (17.25) with respect to
[259, 436]:
(17.32)
Equating this to zero yields a fixed-point iteration for updating the variances:
(17.33)
The means can be estimated from the approximate Newton iteration [259, 436]
(17.34)
332
NONLINEAR ICA
The formulas (17.33) and (17.34) have a central role in learning.
17.5.4 Learning procedure *
Usually MLP networks learn the nonlinear input–output mapping in a supervised
manner using known input–output training pairs, for which the mean-square mapping
error is minimized using the back-propagation algorithm or some alternative method
[172, 48]. In our case, the inputs are the unknown source signals , and only
the outputs of the MLP network, namely the observed data vectors , are known.
Hence, unsupervised learning must be applied. Detailed account of the learning
method and discussion of potential problems can be found in [259, 436]. In the
following, we give an overall description of the learning method.
The practical learning procedure used in all the experiments was the same. First,
linear principal component analysis (PCA) (see Chapter 6) was applied to find sensible
initial values for the posterior means of the sources. Even though PCA is a linear
method, it yields much better initial values than a random choice. The posterior
variances of the sources are initialized to small values. Good initial values are
important for the method because it can effectively prune away unused parts of the
MLP network
1
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
a nonlinear mapping, which was obtained by using a randomly initialized MLP
network having 30 hidden neurons and 20 output neurons. Gaussian noise having a
standard deviation of 0.1 was added to the data. The nonlinearity used in the hidden
neurons was chosen to be the inverse hyperbolic sine . This guarantees
that the nonlinear source separation algorithm using the MLP network with tanh
nonlinearities cannot use exactly the same weights.
Several different numbers of hidden neurons were tested in order to optimize
the structure of the MLP network, but the number of sources was assumed to be
known. This assumption is reasonable because it seems to be possible to optimize
the number of sources simply by minimizing the cost function, as experiments with
purely gaussian sources have shown [259, 438]. The MLP network which minimized
334
NONLINEAR ICA
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
Fig. 17.9
Scatter plots of the sources after 2000 sweeps of nonlinear factor analysis followed
by a rotation with a linear ICA. Signal-to-noise ratio is 13.2 dB.
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
−2
0
2
4
−4 −2 0 2 4
−4
scatter plots, corresponding to each of the eight original sources. The original source
AN ENSEMBLE LEARNING APPROACH TO NONLINEAR BSS
335
0 5 10 15 20 25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of sources
Remaining energy
linear FA
nonlinear FA
Fig. 17.11
The remaining energy of the process data as a function of the number of extracted
components using linear and nonlinear factor analysis
which was used for generating the data appears on the -axis, and the respective
estimated source in on the -axis of each plot. Each point corresponds to one sample
point . The upper plots of each figure correspond to the supergaussian and the
lower plots to the subgaussian sources. The optimal result is a straight line implying
that the estimated values of the sources coincide with the true values.
Figure 17.8 shows the result given by the linear FastICA algorithm alone. Linear
ICA is able to retrieve the original sources with only 0.7 dB signal-to-noise ratio
(SNR). In practice a linear method could not deduce the number of sources, and
10 sources (outputs) and 30 hidden neurons. The same network size was chosen for
nonlinear blind source separation based on the mixture-of-gaussians model for the
sources. After 2000 sweeps using the nonlinear factor analysis model, the sources
were rotated with FastICA, and each source was modeled with a mixture of three
gaussian distributions. The learning process was then continued using this refined
model for 5500 more sweeps. The resulting sources are shown in Fig. 17.12.
Figure 17.13 shows 30 subimages, each corresponding to a specific measurement
made from the process. The original measurement appears as the upper time series
in each subimage, and the lower time series is the reconstruction of this measurement
given by the network. These reconstructions are the posterior means of the outputs
of the network when the inputs were the estimated sources shown in Fig. 17.12. The
original measurements show a great variability, but the reconstructions are strikingly
accurate. In some cases it seems that the reconstruction is less noisy than the original
signal. This is not so surprising, because generally an estimate of data reconstructed
from a smaller number of most significant components is often able to remove some
noise.
The experiments suggest that the estimated source signals can have meaningful
physical interpretations. The results are encouraging, but further studies are needed
to verify the interpretations of the signals.
The proposed ensemble learning method for nonlinear blind source separation
can be extended in several ways. An obvious extension is inclusion of time delays
into the data model, making use of the temporal information often present in the
sources. This would probably help in describing the process data even better. Using
the Bayesian framework, it is also easy to treat missing observations or only partly
known inputs.
17.6 O THER APPROACHES
In this section, other methods proposed for nonlinear independent component analysis
or blind source separation are briefly reviewed. The interested reader can find more
information on them in the given references.
Already in 1987, Jutten [226] used soft nonlinear mixtures for assessing the
Most works on autoassociative MLPs use point estimates for weights and sources
obtained by minimizing the mean-square representation error for the data. It is then
impossible to reliably choose the structure of the model, and problems with over- or
underfitting can be severe. Hecht-Nielsen [176, 177] proposed so-called replicator
networks for universal optimal nonlinear coding of input data. These networks
are autoassociate MLP networks, where the data vectors are mapped onto a unit
hypercube so that the mapped data is uniformly distributed inside the hypercube.
The coordinates of the mapped data on the axes of the hypercube, called natural
coordinates, then in fact form a nonlinear ICA solution, even though this has not been
noted in the original papers [176, 177].
Hochreiter and Schmidhuber [181] have used in context with MLP networks a
method based on minimum description length, called LOCOCODE. This method
does estimate the distribution of the weights, but has no model for the sources. It
is then impossible to measure the description length of the sources. Anyway, their
method shows interesting connections with ICA; sometimes it provides a nonlinear
ICA solution, sometimes it does not [181]. Another well-known information theoretic
criterion, mutual information, is applied to measuring independence in [2, 469]. In
these papers, methods based on various MLP network structures are also introduced
for nonlinear blind separation. In particular, Yang, Amari, and Cichocki [469] deal
with extensions of the basic natural gradient method (see Chapter 9), for nonlinear
BSS, presenting also an extension based on entropy maximization and experiments
with post-nonlinear mixtures.
Xu has developed the general Bayesian Ying-Yang framework which can be
applied on ICA as well; see e.g. [462, 463].
Other general approachesproposed for solving the nonlinear ICA or BSS problems
include a pattern repulsion based method [295], a state-space modeling approach [86],
and an entropy-based method [134]. Various separation methods for the considerably
simpler case of post-nonlinear mixtures (17.4) have been introduced at least in [271,
267, 365, 418, 417].
In the ensemble learning method discussed before, the necessary regularization
ICA or BSS problems have been briefly reviewed in the previous section. Another
possibility is to have more information about the sources or mixtures themselves.
An example of such an approach is the method based on the generative topographic
mapping (GTM), which requires knowledge of the probability distributions of the
sources.
A large part of this chapter has been devoted to a recently introduced fully Bayesian
approach based on ensemble learning for solving the nonlinear BSS problem. This
method applies the well-known MLP network, which is well-suited to modeling
both mildly and strongly nonlinear mappings. The proposed unsupervised ensemble-
learning method tries to find the sources and the mapping that together have most
probably generated the observed data. This regularization principle has a firm theo-
retical foundation, and it is intuitively satisfying for the nonlinear source separation
problem. The results are encouraging for both artificial and real-world data. The
ensemble-learning method allows nonlinear source separation for larger scale prob-