system that can affect the outcome. The more of them that there are, the more likely it is
that, purely by happenstance, some particular, but actually meaningless, pattern will show
up. The number of variables in a data set, or the number of weights in a neural network,
all represent things that can change. So, yet again, high-dimensionality problems turn up,
this time expressed as degrees of freedom. Fortunately for the purposes of data
preparation, a definition of degrees of freedom is not needed as, in any case, this is a
problem previously encountered in many guises. Much discussion, particularly in this
chapter, has been about reducing the dimensionality/combinatorial explosion problem
(which is degrees of freedom in disguise) by reducing dimensionality. Nonetheless, a data
set always has some dimensionality, for if it does not, there is no data set! And having
some particular dimensionality, or number of degrees of freedom, implies some particular
chance that spurious patterns will turn up. It also has implications about how much data is
needed to ensure that any spurious patterns are swamped by valid, real-world patterns.
The difficulty is that the calculations are not exact because several needed measures,
such as the number of significant system states, while definable in theory, seem
impossible to pin down in practice. Also, each modeling tool introduces its own degrees of
freedom (weights in a neural network, for example), which may be unknown to the
minere .mi..
The ideal, if the miner has access to software that can make the measurements (such as
data surveying software), requires use of a multivariable sample determined to be
representative to a suitable degree of confidence. Failing that, as a rule of thumb for the
minimum amount of data to accept, for mining (as opposed to data preparation), use at least
twice the number of instances required for a data preparation representative sample. The
key is to have enough representative instances of data to swamp the spurious patterns.
Each significant system state needs sufficient representation, and having a truly
representative sample of data is the best way to assure that.
Suppose a marketing department needs to improve a direct-mail marketing campaign.
The normal response rate for the random mailings so far is 1.5%. Mailing rolls out, results
trickle in. A (neophyte) data miner is asked to improve response. “Aha!,” says the miner, “I
have just the thing. I’ll whip up a quick response model, infer who’s responding, and
redirect the mail to similar likely responders. All I need is a genuinely representative
sample, and I’ll be all set!” With this terrific idea, the miner applies the modeling tools, and
after furiously mining, the best prediction is that no one at all will respond! Panic sets in;
staring failure in the face, the neophyte miner begins the balding process by tearing out
hair in chunks while wondering what to do next.
Fleeing the direct marketers with a modicum of hair, the miner tries an industrial chemical
manufacturer. Some problem in the process occasionally curdles a production batch. The
exact nature of the process failure is not well understood, but the COO just read a
business magazine article extolling the miraculous virtues of data mining. Impressed by
the freshly minted data miner (who has a beautiful certificate attesting to skill in mining),
the COO decides that this is a solution to the problem. Copious quantities of data are
available, and plenty more if needed. The process is well instrumented, and continuous
chemical batches are being processed daily. Oodles of data representative of the process
are on hand. Wielding mining tools furiously, the miner conducts an onslaught designed to
wring every last confession of failure from the recalcitrant data. Using every art and
artifice, the miner furiously pursues the problem until, staring again at failure and with
desperation setting in, the miner is forced to fly from the scene, yet more tufts of hair
flying.
simply increasing the proportion of responders in the sample may not help. It’s assumed
that there are some other features in the sample that actually do vary as response varies.
It’s just that they’re swamped by spurious patterns, but only because of their low density
in the sample. Enhancing the density of responders is intended to enhance the variability
of connected features. The hope is that when enhanced, these other features become
visible to the predictive mining tool and, thus, are useful in predicting likely responders.
These assumptions are to some extent true. Some performance improvement may be
obtained this way, usually more by happenstance than design, however. The problem is
that low-density features have more than just low-level interactions with other, potentially
predictive features. The instances with the low-level feature represent some small
proportion of the whole sample and form a subsample—the subsample containing only
those instances that have the required feature. Considered alone, because it is so small,
the subsample almost certainly does not represent the sample as a whole—let alone the
population. There is, therefore, a very high probability that the subsample contains much
noise and bias that are in fact totally unrelated to the feature itself, but are simply
concomitant to it in the sample taken for modeling.
Simply increasing the desired feature density also increases the noise and bias patterns
that the subsample carries with it—and those noise and bias patterns will then appear to
be predictive of the desired feature. Worse, the enhanced noise and bias patterns may
swamp any genuinely predictive feature that is present.
population—except for the desired feature. To do this, divide the source data set into two
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
subsets such that one subset has only instances that contain the feature of interest and
the other subset has no instances that contain the feature of interest. Use the already
described techniques (Chapter 5) to extract a representative sample from each subset,
ignoring the effect of the key feature. This results in two separate subsets, both similar to
each other and representative of the population as a whole when ignoring the effect of the
key feature. They are effectively identical except that one has the key feature and the
other does not.
Any difference in distribution between the two subsets is due either to noise, bias, or the
effect of the key feature. Whatever differences there are should be investigated and
validated whatever else is done, but this procedure minimizes noise and bias since both
data sets are representative of the population, save for the effect of the key feature.
Adding the two subsets together gives a composite data set that has an enhanced
presence of the desired feature, yet is as free from other bias and noise as possible.
Feature Enhancement with Limited Data
Feature enhancement is more difficult when there is only limited data available. This is the
case in the second example of the chemical processor. The production staff bends every
a second data subset. (See Chapter 9 for a discussion of noise and colored noise.) The
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
interesting thing about the second subset is that its variables all have the same mean
values, distributions and so on, as the original data set—yet no two instance values,
except by some small chance, are identical. Of course, the noise-added data set can be
made as large as the miner needs. If duplicates do exist, they should be removed.
When added to the original data set, these now appear as more instances with the
feature, increasing the apparent count and increasing the feature density in the overall
data set. The added density means that mining tools will generalize their predictions from
the multiplied data set. A problem is that any noise or bias present will be multiplied too.
Can this be reduced? Maybe.
A technique called color matching helps. Adding white noise multiplies everything exactly
as it is, warts and all. Instead of white noise, specially constructed colored noise can be
added. The multidimensional distribution of a data sample representative of the
population determines the precise color. Color matching adds noise that matches the
multivariable distribution found in the representative sample (i.e., it is the same color, or
has the same spectrum). Any noise or bias present in the original key feature subsample
is still present, but color matching attempts to avoid duplicating the effect of the original
bias, even diluting it somewhat in the multiplication.
evaluation sets. With the best of intentions, the mining data has been distorted and, to at
least that extent, no longer accurately represents the population. The only place that the
inferences or predictions can be examined to ensure that they do not carry an unacceptable
distortion through into the real world is to test them against data that is as undistorted—that
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
is, as representative of the real world—as possible. 10.8 Implementation Notes
Of the four topics covered in this chapter, the demonstration code implements algorithms
for the problems that can be automatically adjusted without high risk of unintended data
set damage. Some of the problems discussed are only very rarely encountered or could
cause more damage than benefit to the data if applied without care. Where no preparation
code is available, this section includes pointers to procedures the miner can follow to
perform the particular preparation activity.
10.8.1 Collapsing Extremely Sparsely Populated Variables
The demonstration code has no explicit support for collapsing extremely sparsely
populated variables. It is usual to ignore such variables, and only in special circumstances
3.
Check the number of discrete values for each candidate sparse variable.
4.
Look in the complete-content file, which lists all of the values for all of the variables.
5.
Extract the lists for the sparse variables.
6.
Access the sample data set with your tool of choice and search for, and list, those
cases where the sparse variables simultaneously have values. (This won’t happen
often, even in sparse data sets.)
Neural networks comprise a vast topic on their own. The brief introduction in this chapter
only touched the surface. In keeping with all of the other demonstration code segments,
the neural network design is intended mainly for humans to read and understand.
Obviously, it also has to be read (and executed) by computer systems, but the primary
focus is that the internal working of the code be as clearly readable as possible. Of all the
demonstration code, this requirement for clarity most affects the network code. The
network is not optimized for speed, performance, or efficiency. The sparsity mechanism is
modified random assignment without any dynamic interconnection. Compression factor
(hidden-node count) is discovered by random search.
The included code demonstrates the key principles involved and compresses information.
Code for a fully optimized autoassociative neural network, including dynamic connection
search with modified cascade hidden-layer optimization, is an impenetrable beast! The
full version, from which the demonstration is drawn, also includes many other obfuscating
(as far as clarity of reading goes) “bells and whistles.” For instance, it includes
modifications to allow maximum compression of information into the hidden layer, rather
than spreading it between hidden and output layers, as well as modifications to remove
linear relationships and represent those separately. While improving performance and
compression, such features completely obscure the underlying principles.
10.8.3 Measuring Variable Importance
function. 10.9 Where Next?
A pause at this point. Data preparation, the focus of this book, is now complete. By
applying all of the insights and techniques so far covered, raw data in almost any form is
turned into clean prepared data ready for modeling. Many of the techniques are illustrated
with computer code on the accompanying CD-ROM, and so far as data preparation for
data mining is concerned, the journey ends here.
However, the data is still unmined. The ultimate purpose of preparing data is to gain
understanding of what the data “means” or predicts. The prepared data set still has to be
used. How is this data used? The last two chapters look not at preparing data, but at
surveying and using prepared data. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Chapter 11: The Data Survey
These three families all have very different interests and desires for their perfect vacation.
Can they all be satisfied? Of course. The locations that each family would like to find and
enjoy exist in many places; their only problem is to find them and narrow down the
possibilities to a final choice. The obvious starting point is with a map. Any map of the
whole country indicates broad features—mountains, forests, deserts, lakes, cities, and
probably roads. The Abbotts will find, perhaps, the Finger Lakes in upstate New York a
place to focus their attention. The Bennigans may look at the deserts of the Southwest,
while the Calloways look to Florida. Given their different interests, each family starts by
narrowing down the area of search for their ideal vacation to those general areas of the
country that seem likely to meet their needs and interests.
Once they have selected a general area, a more detailed map of the particular territory
lets each family focus in more closely. Eventually, each family will decide on the best
choice they can find and leave for their various vacations. Each family explores its own
vacation site in detail. While the explorations do not seem to produce maps, they reveal
small details—the very details that the vacations are aimed at. The Abbotts find particular
lake coves, see particular trees, and watch specific birds and deer. The Bennigans find
individual artifacts in specific places. The Calloways enjoy particular cabaret performers
and see specific exhibits at particular museums. It is these detailed explorations that each
family feels to be the whole purpose for their vacations.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Each family started with a general search to find places likely to be of interest. Their initial
This chapter deals entirely with the data survey, a topic at least as large as data
preparation. The introduction to the use, purposes, and methods of data surveying in this
chapter discusses how prepared data is used during the survey. Most, if not all, of the
surveying techniques can be automated. Indeed, the full suite of programs from which the
data preparation demonstration code is drawn is a full data preparation and survey tool
set. This chapter touches only on the main topics of data surveying. It is an introduction to
the territory itself. The introduction starts with understanding the concept of “information.”
This book mentions “information” in several places. “Information is embedded in a data
set.” “The purpose of data preparation is to best expose information to a mining tool.”
“Information is contained in variability.” Information, information, information. Clearly,
“information” is a key feature of data preparation. In fact, information—its discovery,
exposure, and understanding—is what the whole preparation-survey-mining endeavor is
about. A data set may represent information in a form that is not easily, or even at all,
understandable by humans. When the data set is large, understanding significant and
salient points becomes even more difficult. Data mining is devised as a tool to transform
the impenetrable information embedded in a data set into understandable relationships or
predictions.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
However, it is important to keep in mind that mining is not designed to extract information.
Data, or the data set, enfolds information. This information describes many and various
relationships that exist enfolded in the data. When mining, the information is being mined
Everything begins with information. The data set embeds it. The data survey surveys it.
Data mining translates it. But what exactly is information? The Oxford English Dictionary
begins its definition with “The act of informing, . . .” and continues in the same definition a
little later, “Communication of instructive knowledge.” The act referred to is clearly one
where this thing, “information,” is passed from one person to another. The latter part of the
definition explicates this by saying it is “communication.” It is in this sense of
communicating intelligence—transferring insight and understanding—that the term
“information” is used in data mining. Data possesses information only in its latent form.
Mining provides the mechanism by which any insight potentially present is explicated.
Since information is so important to this discussion, it is necessary to try to clarify, and if
possible quantify, the concept.
Because information enables the transferring of insight and understanding, there is a
sense in which quantity of information relates to the amount of insight and understanding
generated; that is, more information produces greater insight. But what is it that creates
greater insight?
A good mystery novel—say, a detective story—sets up a situation. The situation
described includes all of the necessary pieces to solve the mystery, but in a nonobvious
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
way. Insight comes when, at the end of the story, some key information throws all of the
the information in different contexts (more evidence) before you would accept this as
valid. (Speaking personally, it would take an enormous readjustment of my world view to
accept any rational explanation that includes several trillion tons of curdled milk products
hanging in the sky a quarter of a million miles distant!)
These two very fundamental points about information—how surprising the communication
is, and how much existing knowledge requires revision—both indicate something about
how much information is communicated. But these seem very subjective measures, and
indeed they are, which is partly why defining information is so difficult to come to grips
with.
Claude E. Shannon did come to grips with the problem in 1948. In what has turned out to
be one of the seminal scientific papers of the twentieth century, “A Mathematical Theory
of Communication,” he grappled directly with the problem. This was published the next
year as a book and established a whole field of endeavor, now called “information theory.”
Shannon himself referred to it as “communication theory,” but its effects and applicability
have reached out into a vast number of areas, far beyond communications. In at least one
sense it is only about communications, because unless information is communicated, it
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.