14
Overview and Comparison
of Basic ICA Methods
In the preceding chapters, we introduced several different estimation principles and
algorithms for independent component analysis (ICA). In this chapter, we provide
an overview of these methods. First, we show that all these estimation principles
are intimately connected, and the main choices are between cumulant-based vs.
negentropy/likelihood-based estimation methods, and between one-unit vs. multi-
unit methods. In other words, one must choose the nonlinearity and the decorrelation
method. We discuss the choice of the nonlinearity from the viewpoint of statistical
theory. In practice, one must also choose the optimization method. We compare the
algorithms experimentally, and show that the main choice here is between on-line
(adaptive) gradient algorithms vs. fast batch fixed-point algorithms.
At the end of this chapter, we provide a short summary of the whole of Part II,
that is, of basic ICA estimation.
14.1 OBJECTIVE FUNCTIONS VS. ALGORITHMS
A distinction that has been used throughout this book is between the formulation of
the objective function, and the algorithm used to optimize it. One might express this
in the following “equation”:
ICA method
=
objective function
+
optimization algorithm
:
In the case of explicitly formulated objective functions, one can use any of the
classic optimization methods, for example, (stochastic) gradient methods and Newton
273
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
14.2.1 Similarities between estimation principles
Mutual information gives a convenient starting point for showing the similarity be-
tween different estimation principles. We have for an invertible linear transformation
y = Bx
:
I (y
1
y
2
:::y
n
)=
X
i
H (y
i
) H (x) log j det Bj
(14.1)
If we constrain the
y
i
to be uncorrelated and of unit variance, the last term on the
right-hand side is constant; the second term does not depend on
B
anyway (see
Chapter 10). Recall that entropy is maximized by a gaussian distribution, when
variance is kept constant (Section 5.3). Thus we see that minimization of mutual
information means maximizing the sum of the nongaussianities of the estimated
components. If these entropies (or the corresponding negentropies) are approximated
by the approximations used in Chapter 8, we obtain the same algorithms as in that
(y )
2
(14.2)
where the first term could be omitted, leaving just the term containing kurtosis.
Likewise, cumulants could be used to approximate mutual information, since mutual
information is based on entropy. More explicitly, we could consider the following
approximation of mutual information:
I (y) c
1
c
2
X
i
kurt
(y
i
)
2
(14.3)
where
c
1
and
c
2
are some constants. This shows clearly the connection between
cumulants and minimization of mutual information. Moreover, the tensorial methods
in Chapter 11 were seen to lead to the same fixed-point algorithm as the maximization
of nongaussianity as measured by kurtosis,which shows that they are doing very much
the same thing as the other kurtosis-based methods.
Thus, from a statistical viewpoint, the choice of estimation method is more or less
reduced to the choice of the nonquadratic function
G
that gives information on the
higher-order statistics in the form of the expectation
E fG(b
T
i
x)g
. In the algorithms,
this choice corresponds to the choice of the nonlinearity
g
that is the derivative of
G
.
In this section, we analyze the statistical properties of different nonlinearities. This
is based on the family of approximations of negentropy given in (8.25). This family
includes kurtosis as well. For simplicity, we consider here the estimation of just one
IC, given by maximizing this nongaussianity measure. This is essentially equivalent
to the problem
max
E f(b
T
x)
2
g=1
E fG(b
T
x)g
(14.4)
T
as
T !1
. This gives an approximation of the mean-square error of
b
b
, as was already
STATISTICALLY OPTIMAL NONLINEARITIES
277
discussed in Chapter 4. Comparison of, say, the traces of the asymptotic variances of
two estimators enables direct comparison of the accuracy of two estimators. One can
solve analytically for the asymptotic variance of
b
b
, obtaining the following theorem
[193]:
Theorem 14.1 The trace of the asymptotic variance of
b
b
as defined above for the
estimation of the independent component
s
i
, equals
V
G
= C (A)
E fg
2
(s
nonquadratic functions
G
boils down to a comparison of the
V
G
. In particular, one
can use variational calculus to find a
G
that minimizes
V
G
. Thus one obtains the
following theorem [193]:
Theorem 14.2 The trace of the asymptotic variance of
b
b
is minimized when
G
is of
the form
G
opt
(y )=c
1
log p
i
(y )+c
2
y
2
approach.
14.3.2 Comparison of robustness *
Another very desirable property of an estimator is robustness against outliers. This
means that single, highly erroneous observations do not have much influence on the
estimator. In this section, we shall treat the question: How does the robustness of
the estimator
^
b
depend on the choice of the function
G
? The main result is that the
function
G(y )
should not grow fast as a function of
jy j
if we want robust estimators.
In particular, this means that kurtosis gives nonrobust estimators, which may be very
disadvantagous in some situations.
1
One has to take into account, however, that in the definition of negentropy, the nonquadratic function is
not fixed in advance, whereas in our nongaussianity measures,
G
is fixed. Thus, the statistical properties
of negentropy can be only approximatively derived from our analysis.
278
OVERVIEW AND COMPARISON OF BASIC ICA METHODS
First, note that the robustness of
^
b
depends also on the method of estimation used
^
w
can be analyzed using the theory of M-
estimators. Without going into technical details, the definition of an M-estimator
can be formulated as follows: an estimator is called an M-estimator if it is defined as
the solution
^
for
of
E f (z)g =0
(14.7)
where
z
is a random vector and
is some function defining the estimator. Now, the
point is that the estimator
^
w
is an M-estimator. To see this, define
=(w)
,where
is the Lagrangian multiplier associated with the constraint. Using the Lagrange
conditions, the estimator
^
w
can then be formulated as the solution of Eq. (14.7) where
the influence of single observations on the estimator. It would be desirable to have
an influence function that is bounded as a function of
z
,asthisimpliesthateven
the influence of a far-away outlier is “bounded”, and cannot change the estimate
too much. This requirement leads to one definition of robustness, which is called
B-robustness. An estimator is called B-robust, if its influence function is bounded
as a function of
z
, i.e.,
sup
z
kIF (z
^
)k
is finite for every
^
. Even if the influence
function is not bounded, it should grow as slowly as possible when
kzk
grows, to
reduce the distorting effect of outliers.
It can be shown that the influence function of an M-estimator equals
IF (z
^
)=B (z
^
)
(14.9)
h(w
T
z)+C
3
(14.10)
where
C
1
C
2
C
3
are constants that do not depend on
z
,and
h(y )=yg(y )
. Thus we
see that the robustness of
^
w
essentially depends on the behavior of the function
h(u)
.