Tài liệu Bài 4: Estimation Theory - Pdf 86

4
Estimation Theory
An important issue encountered in various branches of science is how to estimate the
quantities of interest from a given finite set of uncertain (noisy) measurements. This
is studied in estimation theory, which we shall discuss in this chapter.
There exist many estimation techniques developed for various situations; the
quantities to be estimated may be nonrandom or have some probability distributions
themselves, and they may be constant or time-varying. Certain estimation methods
are computationally less demanding but they are statistically suboptimal in many
situations, while statistically optimal estimation methods can have a very high com-
putational load, or they cannot be realized in many practical situations. The choice
of a suitable estimation method also depends on the assumed data model, which may
be either linear or nonlinear, dynamic or static, random or deterministic.
In this chapter, we concentrate mainly on linear data models, studying the esti-
mation of their parameters. The two cases of deterministic and random parameters
are covered, but the parameters are always assumed to be time-invariant. The meth-
ods that are widely used in context with independent component analysis (ICA) are
emphasized in this chapter. More information on estimation theory can be found in
books devoted entirely or partly to the topic, for example [299, 242, 407, 353, 419].
Prior to applying any estimation method, one must select a suitable model that
well describes the data, as well as measurements containing relevant information on
the quantities of interest. These important, but problem-specific issues will not be
discussed in this chapter. Of course, ICA is one of the models that can be used. Some
topics related to the selection and preprocessing of measurements are treated later in
Chapter 13.
77
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright


)
T
(4.1)
Hence, the parameter vector

is an
m
-dimensional column vector having as its
elements the individual parameters. Similarly, the measurements can be represented
as the
T
-dimensional measurement or data vector
1
x
T
=x(1)x(2)::: x(T )]
T
(4.2)
Quite generally, an estimator
^

of the parameter vector

is the mathematical
expression or function by which the parameters can be estimated from the measure-
ments:
^
 = h(x
T
)=h(x(1)x(2)::: x(T ))

i
.
Example 4.1 Two parameters that are often needed are the mean

and variance

2
of a random variable
x
. Given the measurement vector (4.2), they can be estimated
from the well-known formulas, which will be derived later in this chapter:
^ =
1
T
T
X
j =1
x(j )
(4.5)
^
2
=
1
T  1
T
X
j =1
x(j )  ^]
2
(4.6)

estimate some of the parameters
A
,
!
,and

, or all of them. In the latter case, the
parameter vector becomes

=
(A !  )
T
. Clearly, different formulas must be used
for estimating
A
,
!
,and

. The amplitude
A
depends linearly on the measurements
x(j )
, while the angular frequency
!
and the phase

depend nonlinearly on the
x(j )
.

^
(j +1) = h
1
(
^
 (j )) + h
2
(x(j +1)
^
(j ))
(4.8)
where
^
(j )
denotes the estimate based on
j
first measurements
x(1)x(2)::: x(j )
.
The correction or update term
h
2
(x(j +1)
^
(j ))
depends only on the new incoming
(j +1)
-th sample
x(j +1)
and the current estimate

But it is impossible to meet these extremely stringent requirements for a finite data
set. Therefore, one must consider less demanding criteria for the estimation error.
Unbiasedness and consistency
The first requirement is that the mean value
of the error E
f
~
g
should be zero. Taking expectations of the both sides of Eq. (4.10)
leads to the condition
E
f
^
g =
E
fg
(4.11)
Estimators that satisfy the requirement (4.11) are called unbiased. The preceding def-
inition is applicable to random parameters. For nonrandom parameters, the respective
definition is
E
f
^
 j g = 
(4.12)
Generally, conditional probability densities and expectations, conditioned by the
parameter vector

, are used throughout in dealing with nonrandom parameters to
indicate that the parameters

when the number of
measurements grows infinitely large. Estimators satisfying this asymptotic property
are called consistent. Consistent estimators need not be unbiased; see [407].
Example 4.3 Assume that the observations
x(1)x(2)::: x(T )
are independent.
The expected value of the sample mean (4.5) is
E
f ^g =
1
T
T
X
j =1
E
fx(j )g =
1
T
T = 
(4.14)
2
See for example [299, 407] for various definitions of stochastic convergence.
PROPERTIES OF ESTIMATORS
81
Thus the sample mean is an unbiased estimator of the true mean

. It is also consistent,
which can be seen by computing its variance
E
f(^  )

It is useful to introduce a scalar-valued loss function
L(
~
)
for describing the relative importance of specific estimation errors
~

. A popular loss
function is the squared estimation error
L(
~
)
=
k
~
 k
2
=
k  
^
 k
2
because of its
mathematical tractability. More generally, typical properties required from a valid
loss function are that it is symmetric:
L(
~
)
=
L(

E =
E
fL(
~
) j g
(4.16)
where the first definition is used for random parameters

and the second one for
deterministic ones.
A widely used error criterion is the mean-square error (MSE)
E
MSE
=
E
fk  
^
 k
2
g
(4.17)
If the mean-square error tends asymptotically to zero with increasing number of
measurements, the respective estimator is consistent. Another important property of
the mean-square error criterion is that it can be decomposed as (see (4.13))
E
MSE
=
E
fk
~

(square root of the variance

2
) for an estimator
^

of a single scalar parameter

. In a Bayesian interpretation
(see Section 4.6), the bias and variance of the estimator
^

are, respectively, the mean
82
ESTIMATION THEORY

b

E (
^
 )
p(
^
 jx)
^

Fig. 4.1
Bias
b
and standard deviation


~

T
g =
E
f( 
^
 )( 
^
)
T
g
(4.19)
It measures the errors of individual parameter estimates, while the mean-square error
is an overall scalar error measure for all the parameter estimates. In fact, the mean-
square error (4.17) can be obtained by summing up the diagonal elements of the error
covariance matrix (4.19), or the mean-square errors of individual parameters.
Efficiency
An estimator that provides the smallest error covariance matrix among
all unbiased estimators is the best one with respect to this quality criterion. Such
an estimator is called an efficient one, because it optimally uses the information
contained in the measurements. A symmetric matrix
A
is said to be smaller than
another symmetric matrix
B
,or
A < B
, if the matrix

J =
E
(

@
@ 
ln p(x
T
j  )

@
@ 
ln p(x
T
j )

T
j 
)
(4.21)
Here it is assumed that the inverse
J
1
exists. The term
@
@ 
ln p(x
T
j )
is

problem with many estimators is that they may be quite sensitive to outliers, that is,
observations that are very far from the main bulk of data. For example, consider the
estimation of the mean from
100
measurements. Assume that all the measurements
(but one) are distributed between
1
and
1
, while one of the measurements has the
value
1000
. Using the simple estimator of the mean given by the sample average
in (4.5), the estimator gives a value that is not far from the value
10
. Thus, the
single, probably erroneous, measurement of
1000
had a very strong influence on the
estimator. The problem here is that the average corresponds to minimization of the
squared distance of measurements from the estimate [163, 188]. The square function
implies that measurements far away dominate.
Robust estimators can be obtained, for example, by considering instead of the
square error other optimization criteria that grow slower than quadratically with
the error. Examples of such criteria are the absolute value criterion and criteria
3
We have here omitted the subscript
x j 
of the density function
p(x j )

::: 
m
)
T
in (4.1). Recall from
Section 2.7 that the
j
th moment

j
of
x
is defined by

j
=
E
fx
j
j g =
Z
1
1
x
j
p(x j )dx j =1 2:::
(4.22)
Here the conditional expectations are used to indicate that the parameters

are

j
with the estimated ones
d
j
:

j
()=
j
(
1

2
::: 
m
)=d
j
(4.24)
Usually,
m
equations for the
m
first moments
j = 1::: m
are sufficient for
solving the
m
unknown parameters

1

i=1
x(i)  d
1
]
j
(4.26)
METHOD OF MOMENTS
85
to form the
m
equations

j
(
1

2
::: 
m
)=s
j
 j =1 2::: m
(4.27)
for solving the unknown parameters

=
(
1

2


2
> 0
. We wish to estimate the parameter vector

=
(
1

2
)
T
using the method of moments. For doing this, let us first compute the
theoretical moments

1
and

2
:

1
=
E
fx j g =
Z
1

1
x

2
exp


(x  
1
)

2

dx =(
1
+ 
2
)
2
+ 
2
2
(4.30)
The moment estimators are obtained by equating these expressions with the first two
sample moments
d
1
and
d
2
, respectively, which yields

1

(4.33)
^

2M M
= (d
2
 d
2
1
)
1=2
(4.34)
The other possible solution
^

2M M
=
(d
2
 d
2
1
)
1=2
must be rejected because the
parameter

2
must be positive. In fact, it can be observed that
^

These negative remarks have implications in independent component analysis. Al-
gebraic, cumulant-based methods proposed for ICA are typically based on estimating
fourth-order moments and cross-moments of the components of the observation (data)
vectors. Hence, one could claim that cumulant-based ICA methods inefficiently uti-
lize, in general, the information contained in the data vectors. On the other hand,
these methods have some advantages. They will be discussed in more detail in
Chapter 11, and related methods can be found in Chapter 8 as well.
4.4 LEAST-SQUARES ESTIMATION
4.4.1 Linear least-squares method
The least-squares method can be regarded as a deterministic approach to the es-
timation problem where no assumptions on the probability distributions, etc., are
necessary. However, statistical arguments can be used to justify the least-squares
method, and they give further insight into its properties. Least-squares estimation is
discussed in numerous books, in a more thorough fashion from estimation point-of-
view, for example, in [407, 299].
In the basic linear least-squares method, the
T
-dimensional data vectors
x
T
are
assumed to obey the following model:
x
T
= H + v
T
(4.35)
Here

is again the


=
H
1
x
T
. If there were more unknown parameters than measurements (
m>T
),
infinitely many solutions would exist for Eqs. (4.35) satisfying the condition
v
=
0
. However, if the measurements are noisy or contain errors, it is generally highly
desirable to have much more measurements than there are parameters to be estimated,
in order to obtain more reliable estimates. So, in the following we shall concentrate
on the case
T >m
.
When
T >m
, equation (4.35) has no solution for which
v
T
=
0
. Because the
measurement errors
v
T

E
LS
tries to minimize the measurement errors
v
, and not
directly the estimation error
 
^

.
Minimization of the criterion (4.36) with respect to the unknown parameters

leads to so-called normal equations [407, 320, 299]
(H
T
H)
^

LS
= H
T
x
T
(4.37)
for determining the least-squares estimate
^

LS
of


H)
1
H
T
is the pseudoinverse of
H
(assuming that
H
has maximal
rank
m
and more rows than columns:
T >m
) [169, 320, 299].
The least-squares estimator can be analyzed statistically by assuming that the
measurement errors have zero mean: E
fv
T
g
=
0
. It is easy to see that the least-
squares estimator is unbiased: E
f
^

LS
j  g
=


(t)
,
i =1 2::: m
,are
m
basis functions that can be generally nonlinear
functions of the argument
t
— it suffices that the model (4.39) be linear with respect
to the unknown parameters
a
i
. Assume now that there are available measurements
y (t
1
)y(t
2
)::: y(t
T
)
at argument values
t
1
t
2
::: t
T
, respectively. The linear
model (4.39) can be easily written in the vector form (4.35), where now the parameter
vector is given by

T
)]
T
contains the error terms
v (t
i
)
.
The observation matrix becomes
H =
2
6
6
6
4

1
(t
1
) 
2
(t
1
)  
m
(t
1
)

1

T
)  
m
(t
T
)
3
7
7
7
5
(4.42)


Nhờ tải bản gốc
Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status