Applied Econometrics Outliers, Leverage and Influence
1
Applied Econometrics
Lecture 3: Outliers, Leverage and Influence
‘Life is the art of drawing sufficient conclusions from insufficient premises’
SAMUEL BUTLER
1) Introduction
The estimates of the regression parameters are influenced by a few extreme observations. The
residual plot may let us pick out, which the individual data points are high or low. We may use the
residual plot to find the outlier, which are inadequately captured by the regression model itself.
2) Identification of outliers
¾ The percentiles that cut the data up into four quarters have special names: The 25
th
percentiles
and the 75
th
percentiles are called the lower and upper quartiles (Q
L
and Q
U
)
¾ The lower quartile will be the [integer((n+1)/2)+1]/2 value from the bottom of the ordered list.
the upper quartile is the [integer((n+1)/2)+1]/2 value from the top
¾ A data point Y
0
is considered to be an outliers if
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers, Leverage and Influence
2
h1
s(i)
e
t
i
i
i
−
=
where
e
i
is the residual (e
i
= y
i
– )
iy
ˆ
s(i) is the standard error of the estimation having dropped the ith observation from the sample
h
i
is the hat statistic for the observation ith, which is defined as
∑
−
−
4) Leverage
A data point has a high leverage if it is far removed in the X – direction (i.e., it is a disproportionate
distance away from the middle range of the X – direction) (Myers, 1990).
The points of high leverage can exert undue influence on the outcome of a least squares regression
line. That is, points with high leverage are capable of exerting a strong pull on the slope of the
regression line.
In univariate analysis, the definition of an outlier and a point of leverage are the same. A point,
which is an outlier, also has high leverage with respect to the mean. In bivariate analysis, a point of
high leverage (with respect to the slope coefficient) is one which is far removed in the X – direction
(as opposed to an outliers. which are far removed from Y – direction).
A test statistic for the leverage is the hat statistic:
∑
−
−
+=
=
n
1i
2
2
i
)
XX
i
(
)
max(h
i
) < 0.2 little to worry about
0.2 < max(h
i
) < 0.5 risky
0.5 < max(h
i
) too much leverage
5) Influence
A data point is influential if removing it from the sample would markedly change the position of the
least squares regression line (Moore and McCabe, 1989). Hence, influential data points pull the
regression line in their regression.
The influential data points do not necessarily produce large residuals. That is, they are not always
outliers as well, although they can be. Conversely, an outlier is not necessarily an influential point,
particularly when it is a point with little leverage.
In univariate analysis, an outlier has high leverage and will be influential. In bivariate analysis, high
leverage is a necessarily condition for influence on the slope, but not a sufficient one. Similarly, an
outlier may not be influential if it has low leverage, nor a point of high leverage be an outlier if its
leverage is strong enough.
A test for influence is the DFBETA statistic, which is defined as
1
:
)(i)
β
, the point is influential
if 2/
n
< DFBETA < 3/
n
, the point is inconclusive
The regression analysis should capture general pattern in the data: an influential point can prevent
this from being so. Hence, they are often best dropped from the regression.
DFBETAs should always be used in conjunction with diagnostic regression graphics. It is always
possible that a cluster of points is exerting influence rather than a single data point.
Table 5: Summary measures outliers, leverage, and influence
Statistic Formula Use Critical value
Studentized residual (t
i
)
h1
s(i)
e
t
i
i
i
−
=
Outliers Critical values available (higher than usual t–test), but
recommend use t
leverage); values above 0.5 indicate excessive leverage
and values over 0.2 indicate the observation may give
problems
DFBETA
)(i)
β
SE(
(i)
ββ
DFBETA
1
11
i
−
=
Influence
Under 2/
n , the point has no influence; over 3/ n,
the point is influential and strongly so if DFBETA
exceeds 2
Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimation from the
sample omitting observation i. In each case you should use the absolute value of the calculated statistic.
Source: Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for
Developing Countries’ published by Routledge, London, UK.
Written by Nguyen Hoang Bao May 20, 2004
Applied Econometrics Outliers, Leverage and Influence
6
Workshop 3: Outliers, Leverage and Influence
1) Look carefully at the four plots in the attached figure. For each plot write down whether any of
the points is: an outliers, a point of high leverage, an influential points or some combination of
these. Briefly comment on your findings
Hint:
Outliers are not necessarily influential (plot 4)
But they can be so (depending on leverage) (plot 3)
Yet high leverage points are not always influential (plot 1)
And influential points are not necessarily outliers (plot 2)
Plot summary
Plot Outliers Leverage Influence
Plot 1
Plot 2
Plot 3
Plot 4
______
______
______
______
______
______
______
______
______
______
Y X
4
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
10
8
13
9
11
14
6
4
12
7
5
9.14
8.14
8.74
8.77
9.26
8.1
13
9
11
14
6
4
12
7
5
6.58
5.76
7.71
8
8.47
7.04
5.25
12.5
5.56
7.91
6.89
8
8
8
8
8
8
8
19
8
4.5) Show algebraically that a point with no leverage cannot have any influence on the slope
coefficient
5) Outliers in bivariate analysis
5.1) Using the data file HOLMQ, which contains the data for EDUEXP and EAID, examine the
figure and test whether any possible points is:
a) an outlier
b) a point of high leverage
c) an influential point
5.2) Draw the scatter plot, showing the fitted line. Briefly comment on your findings. Written by Nguyen Hoang Bao May 20, 2004