Part I
MATHEMATICAL
PRELIMINARIES
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
2
Random Vectors and
Independence
In this chapter, we review central concepts of probability theory,statistics, and random
processes. The emphasis is on multivariate statistics and random vectors. Matters
that will be needed later in this book are discussed in more detail, including, for
example, statistical independence and higher-order statistics. The reader is assumed
to have basic knowledge on single variable probability theory, so that fundamental
definitions such as probability, elementary events, and random variables are familiar.
Readers who already have a good knowledge of multivariate statistics can skip most
of this chapter. For those who need a more extensive review or more information on
advanced matters, many good textbooks ranging from elementary ones to advanced
treatments exist. A widely used textbook covering probability, random variables, and
stochastic processes is [353].
2.1 PROBABILITY DISTRIBUTIONS AND DENSITIES
2.1.1 Distribution of a random variable
In this book, we assume that random variables are continuous-valued unless stated
otherwise. The cumulative distribution function (cdf)
F
x
of a random variable
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
16
RANDOM VECTORS AND INDEPENDENCE
σσ
m
Fig. 2.1
A gaussian probability density function with mean
m
and standard deviation
.
0 F
x
(x) 1
. From the definition, it also follows directly that
F
x
(1)=0
,and
F
x
(+1)=1
.
Usually a probability distribution is characterized in terms of its density function
Z
x
0
1
p
x
( )d
(2.3)
For simplicity,
F
x
(x)
is often denoted by
F (x)
and
p
x
(x)
by
p(x)
, respectively. The
subscript referring to the random variable in question must be used when confusion
is possible.
Example 2.1 The gaussian (or normal) probability distribution is used in numerous
models and applications, for example to describe additive noise. Its density function
is given by
p
x
(x)=
1
0
!1
. However, the values of the
cdf can be computed numerically using, for example, tabulated values of the error
function
erf
(x)=
1
p
2
Z
x
0
exp
2
2
d
(2.5)
The error function is closely related to the cdf of a normalized gaussian density, for
which the mean
m =0
and the variance
2
=1
. See [353] for details.
In particular, the cumulative distribution function of
x
is defined by
F
x
(x
0
)=P (x x
0
)
(2.7)
where
P (:)
again denotes the probability of the event in parentheses, and
x
0
is
some constant value of the random vector
x
. The notation
x x
0
means that each
component of the vector
x
is less than or equal to the respective component of the
vector
x
0
. The multivariate cdf in Eq. (2.7) has similar properties to that of a single
F
x
(x)
with respect to all components of the
random vector
x
:
p
x
(x
0
)=
@
@x
1
@
@x
2
:::
@
@x
n
F
x
(x)
x=x
(x)dx
n
:::dx
2
dx
1
(2.9)
18
RANDOM VECTORS AND INDEPENDENCE
where
x
0i
is the
i
th component of the vector
x
0
. Clearly,
Z
+1
1
p
x
(x)dx =1
(2.10)
This provides the appropriate normalization condition that a true multivariate proba-
bility density
p
x
(x)
x 0
or
y 0
, the density
p
z
(z)
and consequently
also the cdf is zero. In the region where
0 <x 2
and
0 <y 1
, the cdf is given
by
F
z
(z)=F
xy
(x y )=
Z
y
0
Z
x
0
3
7
(2 )( + )d d
=
3
x > 2
and
y > 1
,the
cdf becomes unity, showing that the probability density
p
z
(z)
has been normalized
correctly. Collecting these results yields
F
z
(z)=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
0 x 0
or
2
y ) x>2 0 <y 1
1 x>2
and
y>1
2.1.3 Joint and marginal distributions
The joint distribution of two different random vectors can be handled in a similar
manner. In particular, let
y
be another random vector having in general a dimension
m
different from the dimension
n
of
x
. The vectors
x
and
y
can be concatenated to
EXPECTATIONS AND MOMENTS
19
a "supervector"
z
T
=
(x
T
y
T
respectively, and Eq. (2.11) defines the joint probability of the event
x x
0
and
y y
0
.
The joint density function
p
xy
(x y)
of
x
and
y
is again defined formally by dif-
ferentiating the joint distribution function
F
xy
(x y)
with respect to all components
of the random vectors
x
and
y
. Hence, the relationship
F
xy
(x
0
and
p
y
(y)
of
y
are obtained by integrating
over the other random vector in their joint density
p
xy
(x y)
:
p
x
(x)=
Z
1
1
p
xy
(x )d
(2.13)
p
y
(y)=
Z
1
1
p
xy
y
(y )=
Z
2
0
3
7
(2 x)(x + y )dx y 2 0 1]
=
(
2
7
(2 + 3y ) y 2 0 1]
0
elsewhere
2.2 EXPECTATIONS AND MOMENTS
2.2.1 Definition and general properties
In practice, the exact probability density function of a vector or scalar valued random
variable is usually unknown. However, one can use instead expectations of some
20
RANDOM VECTORS AND INDEPENDENCE
functions of that random variable for performing useful analyses and processing. A
great advantage of expectations is that they can be estimated directly from the data,
even though they are formally defined in terms of the density function.
Let
g(x)
denote any quantity derived from the random vector
x
. The quantity
g(x)
1. Linearity. Let
x
i
,
i =1::: m
be a set of different random vectors, and
a
i
,
i =1::: m
, some nonrandom scalar coefficients. Then
E
f
m
X
i=1
a
i
x
i
g =
m
X
i=1
a
i
E
fx
i
g
1
1
yp
y
(y)dy =
Z
1
1
g(x)p
x
(x)dx
(2.18)
Thus E
fyg
=E
fg(x)g
, even though the integrations are carried out over
different probability density functions.
These properties can be proved using the definition of the expectation operator
and properties of probability density functions. They are important and very helpful
in practice, allowing expressions containing expectations to be simplified without
actually needing to compute any integrals (except for possibly in the last phase).
2.2.2 Mean vector and correlation matrix
Moments of a random vector
x
are typical expectations used to characterize it. They
are obtained when
g(x)
consists of products of components of
x
of the
n
-vector
m
x
is given by
m
x
i
=
E
fx
i
g =
Z
1
1
x
i
p
x
(x)dx =
Z
1
1
x
i
p
x
i
between the
i
th and
j
th component of
x
is given
by the second moment
r
ij
=
E
fx
i
x
j
g =
Z
1
1
x
i
x
j
p
x
(x)dx =
Z
1
1
T
g
(2.22)
of the vector
x
represents in a convenient form all its correlations,
r
ij
being the
element in row
i
and column
j
of
R
x
.
The correlation matrix
R
x
has some important properties:
1. It is a symmetric matrix:
R
x
=
R
T
x
.
2. It is positive semidefinite:
poned to Section 2.7. Instead, we shall first consider the corresponding central and
second-order moments for two different random vectors.
22
RANDOM VECTORS AND INDEPENDENCE
2.2.3 Covariances and joint moments
Central moments are defined in a similar fashion to usual moments, but the mean
vectors of the random vectors involved are subtracted prior to computing the ex-
pectation. Clearly, central moments are only meaningful above the first order. The
quantity corresponding to the correlation matrix
R
x
is called the covariance matrix
C
x
of
x
, and is given by
C
x
=
E
f(x m
x
)(x m
x
)
T
g
(2.24)
The elements
x
. Using the properties of the expectation operator, it is easy to see that
R
x
= C
x
+ m
x
m
T
x
(2.26)
If the mean vector
m
x
= 0
, the correlation and covariance matrices become the
same. If necessary, the data can easily be made zero mean by subtracting the
(estimated) mean vector from the data vectors as a preprocessing step. This is a usual
practice in independent component analysis, and thus in later chapters, we simply
denote by
C
x
the correlation/covariance matrix, often even dropping the subscript
x
for simplicity.
For a single random variable
x
, the mean vector reduces to its mean value
m
+ m
2
x
.
The expectation operation can be extended for functions
g(x y)
of two different
random vectors
x
and
y
in terms of their joint density:
E
fg(x y)g =
Z
1
1
Z
1
1
g(x y)p
xy
(x y)dy dx
(2.28)
The integrals are computed over all the components of
x
and
y
.
Of the joint expectations, the most widely used are the cross-correlation matrix
−2
−1
0
1
2
3
4
5
y
x
Fig. 2.2
An example of negative covariance
between the random variables
x
and
y
.
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
y
= R
T
yx
C
xy
= C
T
yx
(2.31)
If the mean vectors of
x
and
y
are zero, the cross-correlation and cross-covariance
matrices become the same. The covariance matrix
C
x+y
of the sum of two random
vectors
x
and
y
having the same dimension is often needed in practice. It is easy to
see that
C
x+y
= C
x
+ C
xy
. Hence, their
covariance
c
xy
0
.
24
RANDOM VECTORS AND INDEPENDENCE
2.2.4 Estimation of expectations
Usually the probability density of a random vector
x
is not known, but there is often
available a set of
K
samples
x
1
x
2
::: x
K
from
x
. Using them, the expectation
(2.15) can be estimated by averaging over the sample using the formula [419]
E
fg(x)g
1
K
K
xy
(x y)
of the random vectors
x
and
y
, we know
K
sample pairs
(x
1
y
1
) (x
2
y
2
)::: (x
K
y
K
)
, we can estimate the
expectation (2.28) by
E
fg(x y)g
1
K
K
X
C
xy
.
2.3 UNCORRELATEDNESS AND INDEPENDENCE
2.3.1 Uncorrelatedness and whiteness
Two random vectors
x
and
y
are uncorrelated if their cross-covariance matrix
C
xy
is a zero matrix:
C
xy
=
E
f(x m
x
)(y m
y
)
T
g = 0
(2.37)
This is equivalent to the condition
R
xy
=
E
is zero:
c
xy
=
E
f(x m
x
)(y m
y
)g =0
(2.39)
or equivalently
r
xy
=
E
fxy g =
E
fxg
E
fy g = m
x
m
y
(2.40)
Again, in the case of zero-mean variables, zero covariance is equivalent to zero
correlation.
Another important special case concerns the correlations between the components
of a single random vector
x
(c
11
c
22
::: c
nn
)=
diag
(
2
x
1
2
x
2
:::
2
x
n
)
(2.42)
whose
n
diagonal elements are the variances
2
x
i
=E
= I
(2.43)
where
I
is the
n n
identity matrix.
Assume now that an orthogonal transformation defined by an
n n
matrix
T
is
applied to the random vector
x
. Mathematically, this can be expressed
y = Tx
where
T
T
T = TT
T
= I
(2.44)
An orthogonal matrix
T
defines a rotation (change of coordinate axes) in the
n
-
dimensional space, preserving norms and distances. Assuming that
x
T
= TT
T
= I
(2.46)
26
RANDOM VECTORS AND INDEPENDENCE
showing that
y
is white, too. Hence we can conclude that the whiteness property is
preserved under orthogonal transformations. In fact, whitening of the original data
can be made in infinitely many ways. Whitening will be discussed in more detail
in Chapter 6, because it is a highly useful and widely used preprocessing step in
independent component analysis.
It is clear that there also exists infinitely many ways to decorrelate the original
data, because whiteness is a special case of the uncorrelatedness property.
Example 2.5 Consider the linear signal model
x = As + n
(2.47)
where
x
is an
n
-dimensional random or data vector,
A
an
n m
constant matrix,
s
an
g +
E
fns
T
A
T
g +
E
fnn
T
g
= A
E
fss
T
gA
T
+ A
E
fsn
T
g +
E
fns
T
gA
T
+
E
fnn
fsg
E
fn
T
g = 0
(2.49)
Similarly,
R
ns
= 0
, and the correlation matrix of
x
simplifies to
R
x
= AR
s
A
T
+ R
n
(2.50)
Another often made assumption is that the noise is white, which means here that
the components of the noise vector
n
are all uncorrelated and have equal variance
2
, so that in (2.50)
R
(2.52)
where
s
1
s
2
::: s
m
are components of the signal vector
s
. Then (2.50) can be
written in the form
R
x
= AD
s
A
T
+
2
I =
m
X
i=1
E
fs
2
i
ga
i
and
y
. The random variable
x
is independent of
y
, if knowing the
value of
y
does not give any information on the value of
x
. For example,
x
and
y
can
be outcomes of two events that have nothing to do with each other, or random signals
originating from two quite different physical processes that are in no way related to
each other. Examples of such independent random variables are the value of a dice
thrown and of a coin tossed, or speech signal and background noise originating from
a ventilation system at a certain time instant.
Mathematically, statistical independence is defined in terms of probability densi-
ties. The random variables
x
and
y
are said to be independent if and only if
p
xy
(x y )=p
E
fh(y )g
(2.55)
where
g (x)
and
h(y )
are any absolutely integrable functions of
x
and
y
, respectively.
This is because
E
fg (x)h(y )g =
Z
1
1
Z
1
1
g (x)h(y )p
xy
(x y )dy dx
(2.56)
=
Z
1
1
g (x)p
x y z:::
is then
p
xyz:::
(x y z:::)=p
x
(x)p
y
(y)p
z
(z) :::
(2.57)
and the basic property (2.55) generalizes to
E
fg
x
(x)g
y
(y)g
z
(z) :::g =
E
fg
x
(x)g
E
fg
y
(y)g
E
and
z
. Clearly, the components
of
x
can be mutually dependent, while they are independent with respect to the
components of the other random vectors
y
and
z
, and (2.57) still holds. A similar
argument applies to the random vectors
y
and
z
.
Example 2.6 First consider the random variables
x
and
y
discussed in Examples 2.2
and 2.3. The joint density of
x
and
y
, reproduced here for convenience,
p
xy
(x y )=
(
and
y
.
Consider then the joint density of a two-dimensional random vector
x =(x
1
x
2
)
T
and a one-dimensional random vector
y = y
given by [419]
p
xy
(x y)=
(
(x
1
+3x
2
)y x
1
x
2
2 0 1] y 2 0 1]
0
elsewhere
Using the above argument, we see that the random vectors
x
0
is typically a specific realization
of a measurement vector
y
.
Assuming that the joint density
p
xy
(x y)
of
x
and
y
and their marginal densities
exist, the conditional probability density of
x
given
y
is defined as
p
xjy
(xjy)=
p
xy
(x y)
p
y
(y)
(2.59)
This can be interpreted as follows: assuming that the random vector
are some constant vectors, and both
x
and
y
are small. Similarly,
p
yjx
(yjx)=
p
xy
(x y)
p
x
(x)
(2.60)
In conditional densities, the conditioning quantity,
y
in (2.59) and
x
in (2.60), is
thought to be like a nonrandom parameter vector, even though it is actually a random
vector itself.
Example 2.7 Consider the two-dimensional joint density
p
xy
(x y )
depicted in
Fig. 2.4. For a given constant value
x
0
merely a scaling constant that does not affect the shape of the conditional distribution
p
yjx
(y jx
0
)
as a function of
y
.
Similarly, the conditional distribution
p
xjy
(xjy
0
)
can be obtained geometrically
by slicing the joint distribution of Fig. 2.4 parallel to the
x
-axis at the point
y = y
0
.
The resulting conditional distributions are shown in Fig. 2.5 for the value
x
0
=1:27
,
and Fig. 2.6 for
y
0
1
1
p
yjx
( jx)d =1
(2.61)
If the random vectors
x
and
y
are statistically independent, the conditional density
p
xjy
(xjy)
equals to the unconditional density
p
x
(x)
of
x
,since
x
does not depend
30
RANDOM VECTORS AND INDEPENDENCE
−2
−1
0
1
2
(y)
, and both Eqs. (2.59) and (2.60)
can be written in the form
p
xy
(x y)=p
x
(x)p
y
(y)
(2.62)
which is exactly the definition of independence of the random vectors
x
and
y
.
In the general case, we get from Eqs. (2.59) and (2.60) two different expressions
for the joint density of
x
and
y
:
p
xy
(x y)=p
yjx
(yjx)p
x
(x)=p
xjy
(xj )p
y
( )d
(2.65)