18
Methods using Time
Structure
The model of independent component analysis (ICA) that we have considered so
far consists of mixing independent random variables, usually linearly. In many
applications, however, what is mixed is not random variables but time signals, or
time series. This is in contrast to the basic ICA model in which the samples of
have no particular order: We could shuffle them in any way we like, and this would
have no effect on the validity of the model, nor on the estimation methods we have
discussed. If the independent components (ICs) are time signals, the situation is quite
different.
In fact, if the ICs are time signals, they may contain much more structure than sim-
ple random variables. For example, the autocovariances (covariances over different
time lags) of the ICs are then well-defined statistics. One can then use such additional
statistics to improve the estimation of the model. This additional information can
actually make the estimation of the model possible in cases where the basic ICA
methods cannot estimate it, for example, if the ICs are gaussian but correlated over
time.
In this chapter, we consider the estimation of the ICA model when the ICs are
time signals, ,where is the time index. In the previous chapters,
we denoted by the sample index, but here has a more precise meaning, since it
defines an order between the ICs. The model is then expressed by
(18.1)
where is assumed to be square as usual, and the ICs are of course independent. In
contrast, the ICs need not be nongaussian.
In the following, we shall make some assumptions on the time structure of the ICs
that allow for the estimation of the model. These assumptions are alternatives to the
341
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
that give decorrelated components. This
is why in basic ICA, we have to use the nongaussian structure of the independent
components, for example, by minimizing the higher-order dependencies as measured
by mutual information.
The key point here is that the information in a time-lagged covariance matrix
could be used instead of the higher-order information [424, 303]. What we do is
to find a matrix so that in addition to making the instantaneous covariances of
go to zero, the lagged covariances are made zero as well:
for all (18.4)
The motivation for this is that for the ICs , the lagged covariances are all zero due
to independence. Using these lagged covariances, we get enough extra information
to estimate the model, under certain conditions specified below. No higher-order
information is then needed.
SEPARATION BY AUTOCOVARIANCES
343
18.1.2 Using one time lag
In the simplest case, we can use just one time lag. Denote by such a time lag, which
is very often taken equal to 1. A very simple algorithm can now be formulated to find
a matrix that cancels both the instantaneous covariances and the ones corresponding
to lag .
Consider whitened data (see Chapter 6), denoted by . Then we have for the
orthogonal separating matrix :
(18.5)
(18.6)
Let us consider a slightly modified version of the lagged covariance matrix as defined
in (18.2), given by
(18.7)
We have by linearity and orthogonality the relation
(18.8)
Due to the independence of the
the eigenvalues are distinct, but this is not always possible: If the signals have
identical power spectra, that is, identical autocovariances, then no value of makes
estimation possible.
18.1.3 Extension to several time lags
An extension of the AMUSE method that improves its performance is to consider
several time lags instead of a single one. Then, it is enough that the covariances for
one of these time lags are different. Thus the choice of is a somewhat less serious
problem.
In principle, using several time lags, we want to simultaneously diagonalize all the
corresponding lagged covariance matrices. It must be noted that the diagonalization
is not possible exactly, since the eigenvectors of the different covariance matrices
are unlikely to be identical, except in the theoretical case where the data is exactly
generated by the ICA model. So here we formulate functions that express the degree
of diagonalization obtained and find its maximum.
One simple way of measuring the diagonality of a matrix is to use the operator
off (18.10)
which gives the sum of squares of the off-diagonal elements .Whatwenow
want to do is to minimize the sum of the off-diagonal elements of several lagged
covariances of . As before, we use the symmetric version of the lagged
covariance matrix. Denote by the set of the chosen lags . Then we can write this
as an objective function :
off (18.11)
Minimizing under the constraint that is orthogonal gives us the estimation
method. This minimization could be performed by (projected) gradient descent.
Another alternative is to adapt the existing methods for eigenvalue decomposition to
this simultaneous approximate diagonalization of several matrices. The algorithm
called SOBI (second-order blind identification) [43] is based on these principles, and
so is TDSEP [481].
SEPARATION BY AUTOCOVARIANCES
345
trace trace .
346
METHODS USING TIME STRUCTURE
in common is concavity, so one might speculate that many other concave functions
could be used as well.
The gradient of can be evaluated as
(18.18)
with
diag (18.19)
Thus we obtain the gradient descent algorithm
(18.20)
Here,
should be orthogonalized after every iteration. Moreover, care must
be taken so that in the inverse in (18.19), very small entries do not cause numerical
problems. A very similar gradient descent can be obtained for (18.12), the main
difference being the scalar function in the definition of .
Thus we obtain an algorithm that estimates based on autocorrelations with
several time lags. This gives a simpler alternative to methods based on joint approx-
imative diagonalization. Such an extension allows estimation of the model in some
cases where the simple method using a single time lag fails. The basic limitation
cannot be avoided, however: if the ICs have identical autocovariances (i.e., identical
power spectra), they cannot be estimated by the methods using time-lagged covari-
ances only. This is in contrast to ICA using higher-order information, where the
independent components are allowed to have identical distributions.
Further work on using autocovariances for source separation can be found in
[11, 6, 106]. In particular, the optimal weighting of different lags has be considered
in [472, 483].
18.2 SEPARATION BY NONSTATIONARITY OF VARIANCES
An alternative approach to using the time structure of the signals was introduced in
[296], where it was shown that ICA can be performed by using the nonstationarity
we find a matrix so that the components of are uncorrelated at every
time point t, we have estimated the ICs. Note that due to nonstationarity, the covari-
ance of depends on , and thus if we force the components to be uncorrelated
for every , we obtain a much stronger condition than simple whitening.
The (local) uncorrelatedness of could be measured using the same measures
of diagonality as used in Section 18.1.3. We use here a measure based on (18.14):
(18.21)
The subscript in the expectation emphasizes that the signal is nonstationary, and the
expectation is the expectation around the time point . This function is minimized by
the separating matrix .
348
METHODS USING TIME STRUCTURE
Expressing this as a function of we obtain
(18.22)
Note that the term does not depend on at all. Furthermore,
to take into account all the time points, we sum the values of in different time
points, and obtain the objective function
const.
(18.23)
As usual, we can whiten the data to obtain whitened data , and force the separating
matrix to be orthogonal. Then the objective function simplifies to
const.
(18.24)
Thus we can compute the gradient of as
diag
(18.25)
The question is now: How to estimate the local variances ?We
cannot simply use the sample variances, due to nonstationarity, which leads to de-
pendence between these variances and the . Instead, we have to use some local
estimates at time point . A natural thing to do is to assume that the variance changes
for using nonstationarity will be considered next.
18.2.2 Using cross-cumulants
Nonlinear autocorrelations
A second method of using nonstationarity is based
on interpreting variance nonstationarity in terms of higher-order cross-cumulants.
Thus we obtain a very simple criterion that expresses nonstationarity of variance.
To see how this works, consider the energy (i.e., squared amplitude) of the signal
in Fig. 18.1. The energies of the initial 1000 time points are shown in Fig. 18.2.
What is clearly visible is that the energies are correlated in time. This is of course a
consequence of the assumption that the variance changes smoothly in time.
Before proceeding, note that the nonstationarity of a signal depends on the time-
scale and the level of the detail in the model of the signal. If the nonstationarity of
the variance is incorporated in the model (by hidden Markov models, for example),
the signal no longer needs to be considered nonstationary [370]. This is the approach
that we choose in the following. In particular, the energies are not considered
nonstationary, but rather they are considered as stationary signals that are time-
correlated. This is simply a question of changing the viewpoint.
So, we could measure the variance nonstationarity of a signal
using a measure based on the time-correlation of energies: where
is some lag constant, often equal to one. For the sake of mathematical simplicity, it
is often useful to use cumulants instead of such basic higher-order correlations. The
350
METHODS USING TIME STRUCTURE
cumulant corresponding to the correlation of energies is given by the fourth-order
cross cumulant
cum
(18.28)
This could be considered as a normalized version of the cross-correlation of energies.
In our case, where the variances are changing smoothly, this cumulant is positive
because the first term dominates the two normalizing terms.
Note that this statement requires that we identify nonstationarity with the energy correlations, which may
or may not be meaningful depending on the context.
SEPARATION PRINCIPLES UNIFIED
351
Thus we see that maximization of the nonstationarity, as measured by the cross-
cumulant, of a linear combination of the observed mixtures allows for the estimation
of one IC. This also gives a one-unit approach to source separation by nonstationarity.
A fixed-point algorithm
To maximize the variance nonstationarity as measured
by the cross-cumulant, one can use a fixed-point algorithm derived along the same
lines as the FastICA algorithm for maximizing nongaussianity.
To begin with, let us whiten the data to obtain . Now, using the principle of
fixed-point iteration as in Chapter 8, let us equate to the gradient of the cross-
cumulant of . This gives, after straightforward calculations, the following
update for :
(18.31)
where we have multiplied the gradient by for notational simplicity, and the matrix
is equal to , as above. The algorithm
thus consists of iteratively computing the new value of according to (18.31), and
normalizing to unit norm after every step.
The convergence of the algorithm can be proven to be cubic, i.e., very fast. A
detailed proof can be constructed as with kurtosis. Let us only note here that if
the algorithm is expressed with respect to the transformed variable , which can be
simply obtained by computing the gradient of (18.29), we have
cum (18.32)
followed by normalization of the norm of . This can be easily seen to lead to
convergence of to a vector where only one of the is nonzero. The index for
which will be nonzero depends on the initial value of .
Thus we have obtained a fast fixed-point algorithm for separating ICs by non-
stationarity, using cross-cumulants. This gives an alternative for the algorithm in
time-dependencies. The basic methods do not use the time structure, but this does
not mean that they would be disturbed by such structure. It must be noted, though,
that the basic ICA methods may then be very far from optimal, since they do not use
the whole structure of the data.
18.3.2 Kolmogoroff complexity as a unifying framework
It is also possible to combine different types of information. For example, methods
that combine nongaussianity with autocorrelations were proposed in [202, 312]. What
is interesting is that one can formulate a general framework that encompasses all these
different principles. This framework that includes basic ICA and methods using time
structure was proposed by Pajunen [342, 343], based on the information-theoretic
concept of Kolmogoroff complexity.
As has been argued in Chapters 8 and 10, ICA can be seen as a method of finding
a transformation into components that are somehow structured. It was argued that
nongaussianity is a measure of structure. Nongaussianity can be measured by the
information-theoretic concept of entropy.
Entropy of a random variable measures the structure of its marginal distribution
only. On the other hand, in this section we have been dealing with time signals that
have different kinds of time structure, like autocorrelations and nonstationarity. How
could one measure such a more general type of structure using information-theoretic
criteria? The answer lies in Kolmogoroff complexity. One can define a very general
form of linear ICA by finding projections of the data that have low complexity in this
sense. First, let us look at the definition of Kolmogoroff complexity.
Definition of Kolmogoroff complexity
The information-theoretic measures
of structure are based on the interpretation of coding length as structure. Suppose
SEPARATION PRINCIPLES UNIFIED
353
that we want to code a signal . For simplicity, let us assume that
the signal is binary, so that every value is 0 or 1. We can code such a signal
so that every bit in the code gives the value of for one ; this is the obvious
Furthermore, in [344] it was shown how to approximate by criteria using the
time structure of the signals, which proves that Kolmogoroff complexity gives a gen-
eralization of methods using time correlations as well. (More on this connection can
be found in Section 23.3.)
Kolmogoroff complexity is a rather theoretical measure, since its computation
involves finding the best coding scheme for the signal. The number of possible
coding schemes is infinite, so this optimization cannot be performed in practice with
much accuracy. In some special cases of Kolmogoroff complexity the optimization
can be performed rather accurately however. These include the above-mentioned
cases of mutual information and time-correlations.
354
METHODS USING TIME STRUCTURE
18.4 CONCLUDING REMARKS
In addition to the fundamental assumption of independence, another assumption that
assures that the signals have enough structure is needed for successful separation of
signals. This is because the ICA model cannot be estimated for gaussian random
variables. In basic ICA, which was treated in Part II, the assumption was the
nongaussianity of the ICs. In this chapter, we used the alternative assumption that
the ICs are time signals that have some time dependencies. Here we have at least
two cases in which separation is possible. First, the case where the signals have
different power spectra, i.e., different autocovariance functions. Second, the case
where they have nonstationary variances. All these assumptions can be considered
in the unifying framework of Kolmogoroff complexity: They can all be derived as
special cases of complexity minimization.