17
Nonlinear ICA
This chapter deals with independent component analysis (ICA) for nonlinear mixing
models. A fundamental difficulty in the nonlinear ICA problem is that it is highly
nonunique without some extra constraints, which are often realized by using a suitable
regularization. We also address the nonlinear blind source separation (BSS) problem.
Contrary to the linear case, we consider it different from the respective nonlinear ICA
problem. After considering these matters, some methods introduced for solving the
nonlinear ICA or BSS problems are discussed in more detail. Special emphasis is
given to a Bayesian approach that applies ensemble learning to a flexible multilayer
perceptron model for finding the sources and nonlinear mixing mapping that have
most probably given rise to the observed mixed data. The efficiency of this method is
demonstrated using both artificial and real-world data. At the end of the chapter, other
techniques proposed for solving the nonlinear ICA and BSS problems are reviewed.
17.1 NONLINEAR ICA AND BSS
17.1.1 The nonlinear ICA and BSS problems
In many situations, the basic linear ICA or BSS model
x = As =
n
X
j =1
s
j
a
j
(17.1)
is too simple for describing the observed data
x
adequately. Hence, it is natural to
consider extension of the linear model to nonlinear mixing models. For instantaneous
315
equals
the number of mixtures
m
. The general nonlinear ICA problem then consists of
finding a mapping
h : R
n
! R
n
that gives components
y = h(x)
(17.3)
that are statistically independent. A fundamental characteristic of the nonlinear
ICA problem is that in the general case, solutions always exist, and they are highly
nonunique. One reason for this is that if
x
and
y
are two independent random
variables, any of their functions
f (x)
and
g (y )
are also independent. An even more
serious problem is that in the nonlinear case,
x
and
y
can be mixed and still statistically
independent, as will be shown below. This is not unlike in the case of gaussian ICs
A
i =1::: n
(17.4)
Thus the sources
s
j
,
j = 1::: n
are first mixed linearly according to the basic
ICA/BSS model (17.1), but after that a nonlinear function
f
i
is applied to them to get
the final observations
x
i
. It can be shown [418] that for the post-nonlinear mixtures,
the indeterminacies are usually the same as for the basic linear instantaneous mixing
model (17.1). That is, the sources can be separated or the independent compo-
nents estimated up to the scaling, permutation, and sign indeterminacies under weak
conditions on the mixing matrix
A
and source distributions. The post-nonlinearity
assumption is useful and reasonable in many signal processing applications, because
NONLINEAR ICA AND BSS
317
it can be thought of as a model for a nonlinear sensor distortion. In more general
situations, it is a restrictive and somewhat arbitrary constraint. This model will be
treated in more detail below.
Another difficulty in the general nonlinear BSS (or ICA) methods proposed thus
conformal mapping together with some other assumptions; see [213] for details.
In the following, we present in more detail the constructive method introduced in
[213] that always yields at least one solution to the nonlinear ICA problem. This
procedure might be considered as a generalization of the well-known Gram-Schmidt
orthogonalization method. Given
m
independent variables
y
=
(y
1
::: y
m
)
and a
variable
x
, a new variable
y
m+1
=
g (yx)
is constructed so that the set
y
1
:::y
m+1
is mutually independent.
The construction is defined recursively as follows. Assume that we have already
m
m
b p
yx
)=P (x bjy
1
= a
1
::: y
m
= a
m
)
(17.5)
=
R
b
1
p
yx
(a
1
::: a
m
)d
p
y
(a
1
:::a
m
m =0
,
g
is simply the cumulative
distribution function of
x
.Now,
g
as defined above gives a nonlinear decomposition,
as stated in the following theorem.
Theorem 17.1 Assume that
y
1
::: y
m
are independent scalar random variables
that have a joint uniform distribution in the unit cube
0 1]
m
.Let
x
be any scalar
random variable. Define
g
as in (17.5), and set
y
m+1
= g (y
1
:::y
n
into
n
independent components
y
1
:::y
n
,
giving a solution for the nonlinear ICA problem.
This construction also clearly shows that the decomposition in independent com-
ponents is by no means unique. For example, we could first apply a linear trans-
formation on the
x
to obtain another random vector
x
0
=
Lx
, and then compute
y
0
=
g
0
(x
0
)
with
g
x
denote the Hessians of the logarithmic probability
densities
log p
s
(s)
and
log p
x
(x)
of the source vector
s
and mixture (data) vector
x
,
respectively. Then for the basic linear ICA model (17.1) it holds that
H
s
= A
T
H
x
A
(17.7)
where
A
is the mixing matrix. If the components of
s
are truly independent,
H
@ f (s)=@ s
of the Taylor series expansion of
f (s)
at the desired point.
But now
A
generally changes from point to point, so that the constraint conditions
(17.7) still leave
n(n 1)=2
degrees of freedom for determining the mixing matrix
A
(omitting the diagonal elements). This also shows that the nonlinear ICA problem
is highly nonunique.
Taleb and Jutten have considered separability of nonlinear mixtures in [418, 227].
Their general conclusion is the same as earlier: Separation is impossible without
additional prior knowledge on the model, since the independence assumption alone
is not strong enough in the general nonlinear case.
17.2 SEPARATION OF POST-NONLINEAR MIXTURES
Before discussing approaches applicable to general nonlinear mixtures, let us briefly
consider blind separation methods proposed for the simpler case of post-nonlinear
mixtures (17.4). Especially Taleb and Jutten have developed BSS methods for this
case. Their main results have been represented in [418], and a short overview of their
studies on this problem can be found in [227]. In the following, we present the the
main points of their method.
A separation method for the post-nonlinear mixtures (17.4) should generally con-
sist of two subsequent parts or stages:
1. A nonlinear stage, which should cancel the nonlinear distortions
f
i
i =
between the components
y
1
::: y
n
of the output vector (see Chapter 10) as the cost function and independence
criterion in both stages. For the linear part, minimization of the mutual information
leads to the familiar Bell-Sejnowski algorithm (see Chapters 10 and 9)
@I(y )
@ B
=
E
f x
T
g(B
T
)
1
(17.8)
where components
i
of the vector
are score functions of the components
y
i
of
the output vector
y
(u)
its derivative. In practice,
the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8);
see Chapter 9.
For the nonlinear stage, one can derive the gradient learning rule [418]
@I(y )
@
k
=
E
@ log j g
0
k
(
k
x
k
) j
@
k
E
(
n
X
i=1
i
is the derivative of the
k
th nonlinear function
g
k
. The exact computation
algorithm depends naturally on the specific parametric form of the chosen nonlinear
mapping
g
k
(
k
x
k
)
. In [418], a multilayer perceptron network is used for modeling
the functions
g
k
(
k
x
k
)
,
k =1::: n
.
In linear BSS, it suffices that the score functions (17.9) are of the right type for
achieving separation. However, their appropriate estimation is critical for the good
performance of the proposed nonlinear separation method. The score functions (17.9)
achieving the desired goals.
0 5 10 15 20 25 30 35 40 45 50
−1
−0.5
0
0.5
1
0 5 10 15 20 25 30 35 40 45 50
−1.5
−1
−0.5
0
0.5
1
1.5
Original signals
Fig. 17.1
Original source signals.
0 5 10 15 20 25 30 35 40 45 50
−10
−5
0
5
10
0 5 10 15 20 25 30 35 40 45 50
−20
−10
0
10
20
mixtures
x
i
are depicted in Fig. 17.2.
0 5 10 15 20 25 30 35 40 45 50
−20
−15
−10
−5
0
0 5 10 15 20 25 30 35 40 45 50
−15
−10
−5
0
5
Fig. 17.3
Signals separated by SOM.
−10 −8 −6 −4 −2 0 2 4 6 8 10
−15
−10
−5
0
5
10
15
Fig. 17.4
Converged SOM map.
The sources separated by the SOM method are shown in Fig. 17.3, and the
converged SOM map is illustrated in Fig. 17.4. The estimates of the source signals
In the basic GTM method, mutually similar impulse (delta) functions that are
equispaced on a rectangular grid are used to model the discrete uniform density in the
space of latent variables, or the joint density of the sources in our case. The mapping
from the sources to the observed data, corresponding in our nonlinear BSS problem
to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussians
model. The parameters of the mixture-of-gaussians model, defining the mixing
mapping, are then estimated using a maximum likelihood (ML) method (see Section
4.5) realized by the expectation-maximization (EM) algorithm [48, 172]. After this,
the inverse (separating) mapping from the data to the latent variables (sources) can
be determined.
It is well-known that any continuous smooth enough mapping can be approximated
with arbitrary accuracy using a mixture-of-gaussians model with sufficiently many
gaussian basis functions [172, 48]. Roughly stated, this provides the theoretical
basis of the GTM method. A fundamental difference of the GTM method compared
with SOM is that GTM is based on a generative approach that starts by assuming
a model for the latent variables, in our case the sources. On the other hand, SOM
A GENERATIVE TOPOGRAPHIC MAPPING APPROACH *
323
tries to separate the sources directly by starting from the data and constructing a
suitable separating signal transformation. A key benefit of GTM is its firm theoretical
foundation which helps to overcome some of the limitations of SOM. This also
provides the basis of generalizing the GTM approach to arbitrary source densities.
Using the basic GTM method instead of SOM for nonlinear blind source separation
does not yet bring out any notable improvement, because the densities of the sources
are still assumed to be uniform. However, it is straightforward to generalize the GTM
method to arbitrary known source densities. The advantage of this approach is that
one can directly regularize the inverse of the mixing mapping by using the known
source densities. This modified GTM method is then used for finding a noncomplex
mixing mapping. This approach is described in the following.
17.4.2 The modified GTM method
-dimensional data
space, which is in our case the mixing mapping of Eq. (17.2), is in GTM modeled as
a linear combination of basis functions
'
j
:
x = f (s)=M'(s) ' ='
1
'
2
::: '
L
]
T
(17.11)
Here
M
is an
n L
matrix of weight parameters.
Denote the node locations in the latent space by
i
. Eq. (17.11) then defines a
corresponding set of reference vectors
m
i
= M'(
i
)
p
x
(x(t) j M) =
K
X
i=1
P (i)p
x
(x j i)
=
K
X
i=1
1
K
2
n=2
exp
2
k m
i
x k
2
of
the basis functions is clearly smaller than the number
K
of node locations and their
respective noise distributions (17.13). In this way, one can avoid overfitting and
prevent the mixing mapping (17.11) to become overly complicated.
The unknown parameters in this model are the weight matrix
M
and the inverse
variance
. These parameters are estimated by fitting the model (17.14) to the
observed data vectors
x(1) x(2)::: x(T )
using the maximum likelihood method
discussed earlier in Section 4.5. The log likelihood function of the observed data is
given by
L(M)=
T
X
t=1
log p
x
(x(t)jM)=
T
X
t=1
log
Z
p
2
::: s
m
are statistically independent, this joint density can be evaluated as the product of the
marginal densities of the individual sources:
p
s
(s)=
m
Y
i=1
p
i
(s
i
)
(17.16)
Each marginal density is here a discrete density defined at the sampling points
corresponding to the locations of the node vectors.
The latent space in the GTM method usually has a small dimension, typically
m =2
. The method can be applied in principle for
m>2
, but its computational
load then increases quite rapidly just like in the SOM method. For this reason, only