CHAPTER 15
Hypothesis Testing
Imagine you are a business p e rson considering a major investment in order to
launch a new product. The sales prospects of this product are not known with
certainty. You have to rely on the outcome of n marketing surveys that measure
the demand for the product once it is offered. If µ is the actual (unknown) rate of
return on the investment, each of these surveys here will be modeled as a random
variable, which has a Normal distribution with this mean µ and known variance 1.
Let y
1
, y
2
, . . . , y
n
be the observed survey results. How would you decide whether to
build the plant?
The intuitively reasonable thing to do is to go ahead with the investment if
the sample mean of the observations is greater than a given value c, and not to do
425
426 15. HYPOTHESIS TESTING
it otherwise. This is indeed an optimal decision rule, and we will discuss in what
respect it is, and how c should be picked.
Your decision can be the wrong decision in two different ways: either you decide
to go ahead with the investment although there will be no demand for the product,
or you fail to invest although there would have been demand. There is no decision
rule which eliminates both errors at once; the first error would be minimized by the
rule never to produce, and the second by the rule always to pro duce. In order to
determine the right tradeoff between these errors, it is important to be aware of their
asymmetry. The error to go ahead with production although there is no demand has
potentially disastrous consequences (loss of a lot of money), while the other error
may cause you to miss a profit opportunity, but there is no actual loss involved, and
i
are independent. Compute the value c which satisfies Pr[¯y > c |µ = 0] =
α. You shoule either look it up in a table and include a xerox copy of the table with
428 15. HYPOTHESIS TESTING
the entry circled and the complete bibliographic reference written on the xerox copy,
or do it on a computer, writing exactly which commands you used. In R, the function
qnorm does what you need, find out about it by typing help(qnorm).
Answer. In the case n = 400, ¯y has variance 1/400 and therefore standard deviation 1/20 =
0.05. Therefore 20¯y is a standard normal: from Pr[¯y > c |µ = 0] = 0.05 follows Pr[20¯y > 20c |µ =
0] = 0.05. Therefore 20c = 1.645 can be looked up in a table, perhaps use [JHG
+
88, p. 986], the
row for ∞ d.f.
Let us do this in R. The p-“quantile” of the distribution of the random variable y is defined
as that value q for which Pr[y ≤ q] = p. If y is normally distributed, this quantile is computed
by the R-function qnorm(p, mean=0, sd=1, lower.tail=TRUE). In the present case we need either
qnorm(p=1-0.05, mean=0, sd=0.05) or qnorm(p=0.05, mean=0, sd=0.05, lower.tail=FALSE) which
gives the value 0.08224268.
Choosing a decision which makes a loss unlikely is not enough; your decision
must also give you a chance of success. E.g., the decision rule to build the plant if
−0.06 ≤ ¯y ≤ −0.05 and not to build it otherwise is completely perverse, although
the significance level of this decision rule is approximately 4% (if n = 100). In other
words, the significance level is not enough information for evaluating the performance
of the test. You also need the “power function,” which gives you the probability
with which the test advises you to make the “critical” decision, as a function of
the true parameter values. (Here the “critical” decision is that decision which might
15. HYPOTHESIS TESTING 429
-3 -2 -1 0 1 2 3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 1. Eventually this Figure will show the Power function of
a one-sided normal test, i.e., the probability of error of type one as
a function of µ; right now this is simply the cdf of a Standard Normal
potentially lead to an error of type one.) By the definition of the significance level, the
power function does not exceed the significance level for those parameter values for
which going ahead would lead to a type 1 error. But only those tests are “powerful”
whose power function is high for those parameter values for which it would be correct
to go ahead. In our case, the power function must be below 0.05 when µ ≤ 0, and
we want it as high as possible when µ > 0. Figure 1 shows the power function for
the dec ision rule to go ahead whenever ¯y ≥ c, where c is chosen in s uch a way that
the significance level is 5%, for n = 100.
The hypothesis whose rejection, although it is true, constitutes an error of type
one, is called the null hypothesis, and its alternative the alternative hypothesis. (In the
examples the null hypotheses were: the return on the investment is zero or negative,
430 15. HYPOTHESIS TESTING
the defendant is innoce nt, or the results about which one wants to publish a research
paper are wrong.) The null hypothesis is therefore the hypothesis that nothing is
the case. The test tests whether this hypothesis should be rejected, will safeguard
against the hypothesis one wants to reject but one is afraid to reject erroneously. If
you reject the null hypothesis, you don’t want to regret it.
Mathematically, every test can be identified with its null hypothesis, which is
a region in parameter space (often consisting of one point only), and its “critical
region,” which is the event that the test comes out in favor of the “critical decision,”
i.e., rejects the null hypothesis. The critical region is usually an event of the form
that the value of a certain random variable, the “test statistic,” is within a given
range, usually that it is too high. The power function of the test is the probability
of the critical region as a function of the unknown parameters, and the significance
level is the maximum (or, if this maximum depends on unknown parameters, any
Problem 215. 7 points Someone makes an extended experiment throwing a coin
10,000 times. The relative frequency of heads in these 10,000 throws is a random
variable. Given that the probability of getting a head is p, what are the mean and
standard deviation of the relative frequency? Design a test, at 1% significance level,
of the null hypothesis that the coin is fair, against the alternative hypothesis that
p < 0.5. For this you should use the central limit theorem. If the head showed 4,900
times, would you reject the null hypothesis?
Answer. Let x
i
be the random variable that equals one when the i-th throw is a head, and
zero otherwise. The expected value of x is p, the probability of throwing a head. Since x
2
= x,
var[x] = E[x] − (E[x])
2
= p(1 − p). The relative frequency of heads is simply the average of all x
i
,
call it ¯x. It has mean p and variance σ
2
¯x
=
p(1−p)
10,000
. Given that it is a fair coin, its mean is 0.5 and
its standard deviation is 0.005. Reject if the actual frequency < 0.5 − 2.326σ
¯x
= .48857. Another
approach:
(15.0.33) Pr(¯x ≤ 0.49) = Pr
0
∈ Ω.
Mathematically, confidence regions and such families of tests are one and the
same thing: if one has a confidence region R(y), one can define a test of the null
hypothesis φ = φ
0
as follows: for an observed outcome y reject the null hypothesis
if and only if φ
0
is not contained in R(y). On the other hand, given a family of tests,
one can build a confidence region by the prescription: R(y) is the set of all those
parameter values which would not be rejected by a test based on observation y.
Problem 216. Show that with these definitions, equations (14.0.5) and (15.1.1)
are equivalent.
Answer. Since φ
0
∈ R(y) iff y ∈ C
(φ
0
) (the complement of the critical region rejecting that
the parameter value is φ
0
), it follows Pr[R(y) ∈ φ
0
|φ = φ
0
] = 1 − Pr[C(φ
0
)|φ = φ
other critical region with same probability of type one error has equal or higher
probability of committing error of type two, regardless of the true value of µ.
Here are formulation and proof of the Neyman Pearson lemma, first for the
case that both null hypothesis and alternative hypothesis are simple: H
0
: θ = θ
0
,
H
A
: θ = θ
1
. In other words, we want to determine on the basis of the observations of
the random variables y
1
, . . . , y
n
whether the true θ was θ
0
or θ
1
, and a determination
θ = θ
1
when in fact θ = θ
0
is an error of type one. The critical region C is the set of
all outcomes that lead us to conclude that the parameter has value θ
1
.
ical region of a different test with same significance level α, i.e., if the null hypothesis
is correct, then C and D reject (and therefore commit an error of type one) with
equally low probabilities α. In formulas, Pr[C|θ
0
] = Pr[D|θ
0
] = α. Look at figure 2
with C = U ∪ V and D = V ∪ W . Since C and D have the same significance level,
it follows
Pr[U|θ
0
] = Pr[W |θ
0
].(15.2.2)
Also
Pr[U|θ
1
] ≥ k Pr[U|θ
0
],(15.2.3)
436 15. HYPOTHESIS TESTING
since U ⊂ C and C were chosen such that the likelihood (density) function of the
alternative hypothesis is high relatively to that of the null hypothesis. Since W lies
outside C, the same argument gives
Pr[W |θ
1
] ≤ k Pr[W |θ
0
].(15.2.4)
Linking those two inequalities and the equality gives
1
√
2π
n
e
−
1
2
((y
1
−t)
2
+···+(y
n
−t)
2
)
≥ k
1
√
2π
n
e
−
1
2
(y
2
n
)}
(15.2.7)
= {y
1
, . . . , y
n
: t(y
1
+ ··· + y
n
) −
t
2
n
2
≥ ln k}
(15.2.8)
= {y
1
, . . . , y
n
: ¯y ≥
ln k
nt
+
t
2
}
written in the form C = {y
1
, . . . , y
4
: y
1
+ ··· + y
4
≥ 3.290}.
Answer. Here is the equation which determines when y
1
, . . . , y
4
lie in C:
(2π)
−2
exp −
1
2
(y
1
− 1)
2
+ ···+ (y
4
− 1)
2
≥ 3.633 · (2π)
y
2
1
+ ···+ y
2
4
(15.2.12)
y
1
+ ···+ y
4
− 2 ≥ 1.290(15.2.13)
15.2. THE NEYMAN PEARSON LEMMA AND LIKELIHOOD RATIO TESTS 439
Since Pr[y
1
+ ···+ y
4
≥ 3.290] = Pr[z = (y
1
+ ···+ y
4
)/2 ≥ 1.645] and z is a standard normal, one
obtains the significance level of 5% from the standard normal table or the t-table.
Note that due to the properties of the Normal distribution, this critical region,
for a given significance level, does not depend at all on the value of t. Therefore this
test is uniformly most powerful against the composite hypothesis µ > 0.
One can als write the null hypothesis as the composite hypothesis µ ≤ 0, because
the highest probability of type one error will still be attained when µ = 0. This
completes the proof that the test given in the original fertilizer example is uniformly
; θ)
sup
θ∈ω
L(x
1
, . . . , x
n
; θ)
≥ k}
where k is chosen such that the probability of the critical region when the null
hypothesis is true has as its maximum the desired significance level. It can be shown
that twice the log of this quotient is asymptotically distributed as a χ
2
q−s
, where q
is the dimension of Ω and s the dimension of ω. (Sometimes the likelihood ratio
is defined as the inverse of this ratio, but whenever possible we will define our test
statistics so that the null hypothjesis is rejected if the value of the test statistic is
too large.)
In order to perform a likelihood ratio test, the following steps are necessary:
First construct the MLE’s for θ ∈ Ω and θ ∈ ω, then take twice the difference of the
attained levels of the log likelihoodfunctions, and compare with the χ
2
tables.
15.3. The Runs Test
[Spr98, pp. 171–175] is a good introductory treatment, similar to the one given
here. More detail in [GC92, Chapter 3] (not in University of Utah Main Library)
and even more in [Bra68, Chapters 11 and 23] (which is in the Library).
15.3. THE RUNS TEST 441
Each of your three research assistants has to repeat a certain experiment 9 times,
dence): the probability of a given sequence of failures and successes only depends on
the number of failures and successes, not on the order in which they occur. Then the
conditional distribution of the number of runs can be obtained by simple counting.
How many arrangements of 5 zeros and 4 ones are there? The answer is
9
4
=
(9)(8)(7)(6)
(1)(2)(3)(4)
= 126. How many of these arrangements have 9 runs? Only one, i.e., the
probability of having 9 runs (conditionally on observing 4 successes) is 1/126. The
probability of having 2 runs is 2/126, since one can either have the zeros first, or the
ones first.
15.3. THE RUNS TEST 443
In order to compute the probability of 7 runs, lets first ask: what is the proba-
bility of having 4 runs of ones and 3 runs of zeros? Since there are only 4 ones, each
run of ones must have exactly one element. So the distribution of ones and zeros
must be:
1 −one or more zeros −1 −one or more zeros − 1 − one or more zeros −1.
In order to specify the distribution of ones and zeros completely, we must therefore
count how many ways there are to split the sequence of 5 zeros into 3 nonempty
batches. Here are the possibilities:
(15.3.1)
0 0 0 | 0 | 0
0 0 | 0 0 | 0
0 0 | 0 | 0 0
0 | 0 0 0 | 0
0 | 0 0 | 0 0
m−1
s−1
n−1
s
+
m−1
s
n−1
s−1
m+n
m
(15.3.3)
Pr[r = 2s] = 2
m−1
s−1
n−1
s−1
m+n
even if the unconditional significance level is needed, there is one way out. If we
were to specify a decision rule for every number of successes in such a way that the
conditional probability of rejecting is the same in all of them, then this conditional
446 15. HYPOTHESIS TESTING
✻
✻
✻
✻
✻
✻
Figure 3. Distribution of runs in 7 trials, if there are 4 successes
and 3 failures
probability is also equal to the unconditional probability. The only problem here
is that, due to discreteness, we can make the probability of type one errors only
approximately equal; but with increasing sample size this problem disappears.
15.4. PEARSON’S GOODNESS OF FIT TEST. 447
Problem 218. Write approximately 200 x’es and o’s on a piece of paper trying
to do it in a random manner. Then make a run test whether these x’s and o’s were
indeed random. Would you want to run a two-sided or one-sided test?
The law of rare events literature can be considered a generalization of the run
test. For epidemiology compare [Cha96], [DH94], [Gri79], and [JL97].
15.4. Pearson’s Goodness of Fit Test.
Given an experiment with r outcomes, which have probabilities p
1
, . . . , p
r
, where
p
i
i=1
(
x
i
− np
0
i
)
2
np
0
i
.
This test statistic is often called the Chi-Square statistic. It is asymptotically dis-
tributed as a χ
2
r−1
; reject the null hypothesis when the observed value of this statistic
is too big, the critical region can be read off a table of the χ
2
.
448 15. HYPOTHESIS TESTING
Why does one get a χ
2
distribution in the limiting case? Because the x
i
them-
selves are asymptotically normal, and certain quadratic forms of normal distributions
are χ
2
χ
2
k
iff
ΨΩΨΩΨ = ΨΩΨ and k is the rank of Ω. If Ψ is singular, i.e., does not have an
inverse, and Ω is a g-inverse of Ψ, then condition (10.4.9) holds. A matrix Ω is a
g-inverse of Ψ iff ΨΩΨ = Ψ. Every matrix has at least one g-inverse, but may have
more than one.
Now back to our multinomial distribution. By the central limit theorem, the x
i
are asymptotically jointly normal; their mean and covariance matrix are given by
equation (8.4.2). This covariance matrix is singular (has rank r −1), and a g-inverse
15.4. PEARSON’S GOODNESS OF FIT TEST. 449
is given by (15.4.2), which has in its diagonal exactly the weighting factors used in
the statistic for the goodness of fit test.
Problem 219. 2 points A matrix Ω is a g-inverse of Ψ iff ΨΩΨ = Ψ. Show
that the following matrix
(15.4.2)
1
n
1
p
1
0 ··· 0
(15.4.3)
p
1
− p
2
1
−p
1
p
2
··· −p
1
p
r
−p
2
p
1
p
2
− p
2
2
··· −p
2
p
1
n
1
p
1
0 ··· 0
0
1
p
2
··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ···
1
p
.
.
.
.
−p
r
−p
r
··· 1 − p
r
,