TABLE 8.3 The effect of missing values (?.??) on the summary values of x and y.
n
x y x2 0.30 0.28 0.29 2
0.75 0.32 0.83 0.10 0.69 0.27
0.18 5
0.43 0.54 0.18 2.14 1.25 1
0.55 0.53 0.37 ?.?? 0.14 ?.?? 3 4
0.21 ?.?? 0.04 ?.??
0.29 0.23 Sum
?.?? ?.?? ?.??
Since getting the aggregated values correct is critical, the modeler requires some method
to determine the appropriate values, even with missing values. This sounds a bit like
pulling one’s self up by one’s bootstraps! Estimate the missing values to estimate the
missing values! However, things are not quite so difficult. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.In a representative sample, for any particular joint distribution, the ratios between the
various values xx and xx
2
, and xy and xy
2
remain constant. So too do the ratios between
xx and xxy and xy and xxy. When these ratios are found, they are the equivalent of setting
the value of n to 1. One way to see why this is so is because in any representative sample
the ratios are constant, regardless of the number of instance values—and that includes
n = 1. More mathematically, the effect of the number of instances cancels out. The end
result is that when using ratios, n can be set to unity. In the linear regression formulae,
values are multiplied by n, and multiplying a value by 1 leaves the original value
unchanged. When multiplying by n = 1, the n can be left out of the expression. In the
calculations that follow, that piece is dropped since it has no effect on the result.
The key to building the regression equations lies in discovering the needed ratios for
those values that are jointly present. Given the present and missing values that are shown
in the lower part of Table 8.3, what are the ratios? xxy Ratio xx to:
0.45
0.61
identical to using their mean values. The mean values of variable x and of variable y are
taken for the values of each that are jointly present as shown in Table 8.5.
TABLE 8.5 Mean values of x and y for estimating missing values.
n
x y
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
0.37 3
0.32 0.83 4
0.21
1.30 1.90 Mean
0.43 0.63 Est xx
2
Est xxy 0.43
0.43 x 0.45 = 0.1935
The a value is 1.06. With suitable values discovered for a and b, and using the formula for
a straight line, an expression can be built that will provide an appropriate estimate for any
missing value of y, given a value of x. That expression is
y = a + bx
y = 1.06 + (–1)x
y = 1.06 – x
Table 8.7 uses this expression to estimate the values of y, given x, for all of the original
values of x.
TABLE 8.7 Derived estimates of y given an x value using linear regression based 0.55
0.53 0.51 0.02 0.75 0.74 0.09 0.21
0.86 0.85 These estimates of y are quite close to the original values in this example. The error, the
difference between the original value and the estimate, is small compared to the actual
value.
Multiple Linear Regression
The equations used for performing multiple regression are extensions of those already
used for linear regression. They are built from the same components as linear
regression—xx, xx
2
, xxy—for every pair of variables included in the multiple regression.
(Each variable becomes x in turn, and for that x, each of the other variables becomes y in
turn.) All of these values can be estimated by finding the ratio relationships for those
variables’ values that are jointly present in the initial sample data set. With this information
available, good linear estimates of the missing values of any variable can be made using
whatever variable instance values are actually present.
With the ratio information known for all of the variables, a suitable multiple regression can
relationship is what the modeler needs to explore, not some pattern artificially constructed
by replacing missing values!
Alternative Methods of Missing-Value Replacement
Preserving joint variability between variables is far more effective at providing unbiased
replacement values than methods that do not preserve variability. In practice, many
variables do have essentially linear between-variable relationships. Even where the
relationship is nonlinear, a linear estimate, for the purpose of finding a replacement for a
missing value, is often perfectly adequate. The minute amount of bias introduced is often
below the noise level in the data set anyway and is effectively unnoticeable.
Compared to finding nonlinear relationships, discovering linear relationships is both fast
and easy. This means that linear techniques can be implemented to run fast on modern
computers, even when the dimensionality of a data set is high. Considering the small
amount of distortion usually associated with linear techniques, the trade-offs in terms of
speed and flexibility are heavily weighted in favor of their use. The replacement values
can be generated dynamically (on the fly) at run time and substituted as needed. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
An advantage of this method is that the ratio method already described can be extended
to capture nonlinear relationships. The level of computational complexity increases
considerably, but not as much as with some other methods. The difficulty is that choosing
the degree of nonlinearity to use is fairly arbitrary. There are robust methods to determine
the amount of nonlinearity that can be captured at any chosen degree of nonlinearity
without requiring that the full nonlinear multiple regressions be built at every level. This
allows a form of optimization to be included in the nonlinearity estimation and capture.
However, there is still no guarantee that nonlinearities that are actually present will be
captured. The amount of data that has to be captured is quite considerable but relatively
modest compared with other methods, and remains quite tractable.
At run time, missing-value estimates can be produced very quickly using various
optimization techniques. The missing-value replacement rate is highly dependent on
many factors, including the dimensionality of the data set and the speed of the computer,
to name only two. However, in practical deployed production systems, replacement rates
exceeding 1000 replacements per second, even in large or high-throughput data sets,
can be easily achieved on modern PCs.
Nonlinear Submodels
increases.
An additional problem is that it is hard to determine the appropriate level of complexity.
Missing-value estimates are produced slowly at run time since, for every value, the
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
appropriate network has to be looked up, loaded, run, and output produced.
Autoassociative Neural Networks
Autoassociative neural networks are briefly described in Chapter 10. In this architecture,
all of the inputs are also used as predicted outputs. Using such an architecture, only a
single neural network need be built. When a missing value(s) is detected, the network can
be used in a back-propagation mode—but not a training mode, as no internal weights are
adjusted. Instead, the errors are propagated all the way back to the inputs. At the input, an
appropriate weight can be derived for the missing value(s) so that it least disturbs the
internal structure of the network. The value(s) so derived for any set of inputs reflects, and
least disturbs, the nonlinear relationship captured by the autoassociative neural network.
Such methods are inherently nonlinear so long as representative near neighbors can be
found.
The main drawbacks are that having the training data set available, even in some collapsed
form, may require very significant storage. Lookup times for neighbors can be very slow, so
finding replacement values too is slow. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Chapter 9: Series Variables
Overview
Series variables have a number of characteristics that are sufficiently different from other
types of variables that they need examining in more detail. Series variables are always at
least two-dimensional, although one of the dimensions may be implicit. The most common
type of series variable is a time series, in which a series of values of some feature or
event are recorded over a period of time. The series may consist of only a list of
measurements, giving the appearance of a single dimension, but the ordering is by time,
which, for a time series, is the implicit variable.
2.
Find methods for manipulating the unique features of series data to expose the
information content to mining tools
Series data has features that require more involvement by the miner in the preparation
process than for nonseries data. Where miner involvement is required, fully automated
preparation tools cannot be used. The miner just has to be involved in the preparation and
exercise judgment and experience. Much of the preparation requires visualizing the data set
and manipulating the series features discussed. There are a number of excellent commercial
tools for series data visualization and manipulation, so the demonstration software does not
include support for these functions. Thus, instead of implementation notes concluding the
chapter discussing how the features discussed in the chapter are put into practice, this
chapter concludes with a suggested checklist of actions for preparing series data for the
miner to use. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
9.1 Here There Be Dragons!
Mariners and explorers of old used fanciful and not always adequate maps. In unexplored or
unknown territory, the map warned of dragons—terrors of the unknown. So it is when
preparing data, for the miner knows at least some of the territory. Many data explorers have
measurements of the other variable (or variables) are made as time is “displaced,” or
changed. The displacement variable is also called the index variable. That is because the
points along the displacement variable at which the measurements are taken are called
the index points.
Dimensions other than time can serve as the displacement dimension. Distance, for
instance, can be used. For example, measuring the height of the American continent
above sea level at different points on a line extending from the Atlantic to the Pacific
produces a distance displacement series.
Since time series are the most common series, where this chapter makes assumptions, a
time series will be assumed. The issues and techniques described about time series also
apply to any other displacement series. Series, however indexed, share many features in
common, and techniques that apply to one type of series usually apply to other types of
series. Although the exact nature of the displacement variable may make little difference to
the preparation and even, to some degree, the analysis of the series itself, it makes all the
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
difference to the interpretation of the result! 9.3 Describing Series Data
9.3.1 Constructing a Series
A series is constructed by measuring and recording a feature of an object or event at
defined index points on a displacement dimension.
This statement sufficiently identifies a series for mining purposes. It is not a formal
definition but a conceptual description, which also includes the following assumptions:
1.
The feature or event is recorded as numerical information.
2.
The index point information is either recorded, or at least the displacements are
There is no reason at all why several features should not be captured at each index, the
same as in any nonseries multidimensional data set. However, just as each of the
variables can be considered separately from each other during much of the nonseries
data preparation process, so too can each series variable in a multidimensional series be
considered separately during preparation.
9.3.2 Features of a Series
By its nature a series has some implicit pattern within the ordering. That pattern may
repeat itself over a period. Often, time series are thought of by default as repetitive, or
cyclic, but there is no reason that any repeating pattern should in fact exist. There is, for
example, a continuing debate about whether the stock market exhibits a repetitive pattern
or is simply the result of a random walk (touched on later). Enormous effort has been put
into detecting any cyclic pattern that may exist, and still the debate continues. There is,
nonetheless, a pattern in series data, albeit not necessarily a repeating one. One of the
objectives of analyzing series data is to describe that pattern, identify it as recognizable if
possible, and find any parts that are repetitive. Preparing series data for modeling, then,
must preserve the nature of the pattern that exists. Preparation also includes putting the
data into a form in which the desired information is best exposed to a modeling tool. Once
again, a warning: this is not always easy!