Class Notes in Statistics and Econometrics Part 4 - Pdf 16

CHAPTER 7
Chebyshev Inequality, Weak Law of Large
Numbers, and Central Limit Theorem
7.1. Chebyshev Inequality
If the random variable y has ﬁnite expected value µ and standard deviation σ,
and k is some positive number, then the Chebyshev Inequality says
(7.1.1) Pr

|y − µ|≥kσ

≤
1
k
2
.
In words, the probability that a given random variable y diﬀers from its expected
value by more than k standard deviations is less than 1/k
2
. (Here “more than”
and “less than” are short forms for “more than or equal to” and “less than or equal
189
1907. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEOREM
to.”) One does not need to know the full distribution of y for that, only its expected
value and standard deviation. We will give here a proof only if y has a discrete
distribution, but the inequality is valid in general. Going over to the standardized
variable z =
y−µ
σ
we have to show Pr[|z|≥k] ≤
1
k

2
p(z
i
)(7.1.3)
≤

i : |z
i
|≥k
z
2
i
p(z
i
)(7.1.4)
≤

all i
z
2
i
p(z
i
) = var[
z] = 1.(7.1.5)
The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable
which takes the value −k with probability
1
2k
2

.
Hint: ﬁrst compute what Chebyshev will tell you about the lefthand side, and then
you will need still another inequality.
Answer. E[y/n] = p and var[y/n] = pq/n (where q = 1 − p). Chebyshev says therefore
(7.1.7) Pr




y
n
− p



≥k

pq
n

≤
1
k
2
.
Setting ε = k

pq/n, therefore 1/k
2
= pq/nε

1
, y
2
, y
3
, . . . be a sequence of independent random variables all of which
have the same expected value µ and variance σ
2
. Then ¯y
n
=
1
n

n
i=1
y
i
has expected
value µ and variance
σ
2
n
. I.e., its probability mass is clustered much more closely
around the value µ than the individual y
i
. To make this statement more precise we
need a concept of convergence of random variables. It is not possible to deﬁne it in
the “obvious” way that the sequence of random variables y
n

In many applications, the limiting variable y is a degenerate random variable, i.e., it
is a constant.
The Weak Law of Large Numbers says that, if the expected value exists, then the
probability limit of the sample means of an ever increasing sample is the expected
value, i.e., plim
n→∞
¯y
n
= µ.
Problem 117. 5 points Assuming that not only the expected value but also the
variance exists, derive the Weak Law of Large Numbers, which can be written as
(7.2.3) lim
n→∞
Pr

|¯y
n
− E[y]|≥δ

= 0 for all δ > 0,
from the Chebyshev inequality
(7.2.4) Pr[|x − µ|≥kσ] ≤
1
k
2
where µ = E[x] and σ
2
= var[x]
Answer. From nonnegativity of probability and the Chebyshev inequality for x = ¯y follows
0 ≤ Pr[|¯y − µ|≥

n

n
i=1
y
i
and sample variance s
2
=
1
n

n
i=1
(y
i
− ¯y)
2
. Show that the data satisfy the following “sample equivalent” of
the Chebyshev inequality: if k is any ﬁxed positive number, and m is the number of
observations y
j
which satisfy


y
j
− ¯y




• a. 3 points What happens to this result when the distribution from which the
y
i
are taken does not have an expected value or a variance?
7.3. CENTRAL LIMIT THEOREM 195
Answer. The result still holds but ¯y and s
2
do not converge as the number of observations
increases. 
7.3. Central Limit Theorem
Assume all y
i
are independent and have the same distribution with mean µ,
variance σ
2
, and also a moment generating function. Again, let ¯y
n
be the sample
mean of the ﬁrst n observations. The central limit theorem says that the probability
distribution for
(7.3.1)
¯y
n
− µ
σ/
√
n
converges to a N(0, 1). This is a diﬀerent concept of convergence than the probability
limit, it is convergence in distribution.

us therefore what happens to the shape of the cumulative distribution function of ¯y
n
.
If we disregard the fact that it becomes more and more concentrated (by multiplying
it by a factor which is chosen such that the variance remains constant), then we see
that its geometric shape comes closer and closer to a normal distribution.
Proof of the Central Limit Theorem: By Problem 120,
(7.3.2)
¯y
n
− µ
σ/
√
n
=
1
√
n
n

i=1
y
i
− µ
σ
=
1
√
n
n

3
t
3
3!
+
m
4
t
4
4!
+ ···
Therefore the m.g.f. of
1
√
n

n
i=1
z
i
is (multiply and substitute t/
√
n for t):
(7.3.4)

1 +
t
2
2!n
+

=
t
2
2!
+
m
3
t
3
3!
√
n
+
m
4
t
4
4!n
+ ··· .
Now use Euler’s limit, this time in the form: if w
n
→ w for n → ∞, then

1+
w
n
n

n
→

=
1
√
n

n
i=1
y
i
−µ
σ
.
Answer. Lhs =
√
n
σ


1
n

n
i=1
y
i

−µ

=
√

n
i=1
y
i
−
µ

= rhs. 
1987. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEOREM
Problem 121. 3 points Explain verbally clearly what the law of large numbers
means, what the Central Limit Theorem means, and what their diﬀerence is.
Problem 122. (For this problem, a table is needed.) [Lar82, exercise 5.6.1,
p. 301] If you roll a pair of dice 180 times, what is the approximate probability that
the sum seven appears 25 or more times? Hint: use the Central Limit Theorem (but
don’t worry about the continuity correction, which is beyond the scope of this class).
Answer. Let x
i
be the random variable that equals one if the i-th roll is a seven, and zero
otherwise. Since 7 can be obtained in six ways (1+6, 2+5, 3+4, 4+3, 5+2, 6+1), the probability
to get a 7 (which is at the same time the expected value of x
i
) is 6/36=1/6. Since x
2
i
= x
i
,
var[x
i
] = E[x

Pr[x≥25] = Pr[x>24]. Therefore Pr[y≥25] and Pr[y>24] are two alternative good approximations;
but the best is Pr[y≥24.5] = .8643. This is the continuity correction. 
CHAPTER 8
Vector Random Variables
In this chapter we will look at two random variables x and y deﬁned on the same
sample space U, i.e.,
(8.0.6) x: U  ω → x(ω) ∈ R and y : U  ω → y(ω) ∈ R.
As we said before, x and y are c alled indep e ndent if all events of the form x ≤ x
are independent of any event of the form y ≤ y. But now let us assume they are
not independent. In this case, we do not have all the information about them if we
merely know the distribution of each.
The following example from [Lar82, example 5.1.7. on p. 233] illustrates the
issues involved. This example involves two random variables that have only two
possible outcomes each. Suppose you are told that a coin is to be ﬂipped two times
199
200 8. VECTOR RANDOM VARIABLES
and that the probability of a head is .5 for each ﬂip. This information is not enough
to determine the probability of the second ﬂip giving a head conditionally on the
ﬁrst ﬂip giving a head.
For instance, the above two probabilities can be achieved by the following ex-
perimental setup: a person has one fair coin and ﬂips it twice in a row. Then the
two ﬂips are independent.
But the probabilities of 1/2 for heads and 1/2 for tails can also be achieved as
follows: The pe rson has two coins in his or her pocket. One has two heads, and one
has two tails. If at random one of these two coins is picked and ﬂipped twice, then
the second ﬂip has the same outcome as the ﬁrst ﬂip.
What do we need to get the full picture? We must consider the two variables not
separately but jointly, as a totality. In order to do this, we combine x and y into one
entity, a vector



x
y

] = Pr[x ≤ x and y ≤ y].
For discrete random variables, for which the cumulative distribution function is
a step function, the joint probability mass function provides the same information:
(8.0.8) p
x,y
(x, y) = Pr[

x
y

=

x
y

] = Pr[x=x and y=y].
Problem 123. Write down the joint probability mass functions for the two ver-
sions of the two coin ﬂips discussed above.
Answer. Here are the probability mass functions for these two cases:
(8.0.9)
Second Flip
H T sum
First H .25 .25 .50
Flip T .25 .25 .50
sum .50 .50 1.00
Second Flip


∈ B] =
 
B
f(x, y) dx dy,(8.0.11)
or one says, for a inﬁnitesimal two-dimensional volume element dV
x,y
located at [
x
y
],
which has the two-dimensional volume (i.e., area) |dV |,
Pr[

x
y

∈ dV
x,y
] = f(x, y) |dV |.(8.0.12)
The vertical bars here do not mean the absolute value but the volume of the argument
inside.
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 203
8.1. Expected Value, Variances, Covariances
To get the exp e cted value of a function of x and y, one simply has to put this
function together with the density function into the integral, i.e., the formula is
(8.1.1) E[g(x, y)] =
 
R
2

68

, and

71
72

, compute the probability that the bus will be preferred.
Answer. The probability is 9/40. u and v have a joint density function that is uniform in
the rectangle below and zero outside (u, the preferen ce for buses, is on the horizontal, and v, the
preference for cars, on the vertical axis). The probability is the fraction of this rectangle below the
diagonal.
204 8. VECTOR RANDOM VARIABLES
68
69
70
71
72
66 67 68 69 70 71




• b. 2 points How would you criticize an econometric study which argued along
the above lines?
Answer. The preferences are not for a bus or a car, but for a whole transportation systems.
And these preferences are not formed independently and in dividu alisti cally, but they depend on
which other infrastructures are in place, whether there is suburban sprawl or concentrated walkable
cities, etc. This is again the error of detotalization (which favors the status quo).


(8.1.3) cov[x, y] = E

(x − E[x])(y −E[y])

Computation rules with covariances are
cov[x, z] = cov[z, x] cov[x, x] = var[x] cov[x, α] = 0(8.1.4)
cov[x + y, z] = cov[x, z] + cov[y, z] cov[αx, y] = α cov[x, y](8.1.5)
Problem 125. 3 points Using deﬁnition (8.1.3) prove the following formula:
(8.1.6) cov[x, y] = E[xy] − E[x] E[y].
Write it down carefully, you will lose points for unbalanced or missing parantheses
and brackets.
206 8. VECTOR RANDOM VARIABLES
Answer. Here it is side by side with and without the notation E[x] = µ and E[y] = ν:
cov[x, y] = E

(x − E[x])(y −E[y])

= E

xy − x E[y] − E[x]y + E[x] E[y]

= E[xy] − E[x] E[y] − E[x] E[y] + E[x] E[y]
= E[xy] − E[x] E[y].
cov[x, y] = E[(x − µ)(y −ν)]
= E[xy − xν − µy + µν]
= E[xy] − µν − µν + µν
= E[xy] − µν.
(8.1.7)

Problem 126. 1 point Using (8.1.6) prove the ﬁve computation rules with co-


var[x] cov[x, y]
cov[y, x] var[y]

.(8.1.10)
An important computation rule for the covariance matrix is
(8.1.11)
V
[x] = Ψ ⇒
V
[Ax] = AΨA

.
Problem 128. 4 points Let x =

y
z

be a vector consisting of two random
variables, with covariance matrix
V
[x] = Ψ, and let A =

a b
c d

be an arbitrary
2 × 2 matrix. Prove that
(8.1.12)
V


var[ay + bz] cov[ay + bz, cy + dz]
cov[cy + dz, ay + bz] var[cy + dz]

On the other hand, AΨA

=

a b
c d

var[y] cov[y, z]
cov[y, z] var[z]

a c
b d

=

a var[y] + b cov[y, z] a cov[y, z] + b var[z]
c var[y] + d cov[y, z] c cov[y, z] + d var[z]

a c
b d

Multiply out and show that it is the same thing. 
Since the variances are nonnegative, one can see from equation (8.1.11) that
covariance matrices are nonnegative deﬁnite (which is in econometrics is often also
called positive semideﬁnite). By deﬁnition, a symmetric matrix Σ
Σ

i, j element is cov[x
i
, y
j
].
8.1. EXPECTED VALUE, VARIANCES, COVARIANCES 209
The correlation coeﬃcient of two scalar random variables is deﬁned as
(8.1.14) corr[x, y] =
cov[x, y]

var[x] var[y]
.
The advantage of the correlation coeﬃcient over the covariance is that it is always
between −1 and +1. This follows from the Cauchy-Schwartz inequality
(8.1.15) (cov[x, y])
2
≤ var[x] var[y].
Problem 130. 4 points Given two random variables y and z with var[y] = 0,
compute that constant a for which var[ay − z] is the minimum. Then derive the
Cauchy-Schwartz inequality from the fact that the minimum variance is nonnega-
tive.
Answer.
var[ay − z] = a
2
var[y] − 2a cov[y, z] + var[z](8.1.16)
First order condition: 0 = 2 a var[y] − 2 cov[y, z](8.1.17)
Therefore the minimum value is a
∗
= cov[y, z]/ var[y], for which the cross product term is −2 times
the ﬁrst item:

y: p(x,y)=0
p
x,y
(x, y).
For density functions, the following argument can be given:
Pr[x ∈ dV
x
] = Pr[

x
y

∈ dV
x
× R].(8.2.2)
8.2. MARGINAL PROBABILITY LAWS 211
By the deﬁnition of a product set:

x
y

∈ A × B ⇔ x ∈ A and y ∈ B. Split R into
many small disjoint intervals, R =

i
dV
y
i
, then
Pr[x ∈ dV

= |dV
x
|

i
f
x,y
(x, y
i
)|dV
y
i
|.(8.2.5)
Therefore

i
f
x,y
(x, y)|dV
y
i
| is the density function we are looking for. Now the
|dV
y
i
| are usually written as dy, and the sum is usually written as an integral (i.e.,
an inﬁnite sum each summand of which is inﬁnitesimal), therefore we get
(8.2.6) f
x
(x) =

located at x and condition on x ∈ dV
x
:
Pr[y ∈ dV
y
|x ∈ dV
x
] =
Pr[y ∈ dV
y
and x ∈ dV
x
]
Pr[x ∈ dV
x
]
(8.3.2)
=
f
x,y
(x, y)|dV
x
||dV
y
|
f
x
(x)|dV
x
|

Problem 131. 2 points The conditional density is the joint divided by the mar-
ginal:
(8.3.6) f
y|x
(y, x) =
f
x,y
(x, y)
f
x
(x)
.
Show that this density integrates out to 1.
Answer. The conditional is a density in y with x as parameter. Therefore its integral with
respect to y must be = 1. Indeed,

+∞
y=−∞
f
y|x=x
(y, x) dy =

+∞
y=−∞
f
x,y
(x, y) dy
f
x
(x)

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Class Notes in Statistics and Econometrics Part 4 - Pdf 16

Tài liệu, ebook tham khảo khác

Học thêm