Maclennan c01.tex V2 - 10/04/2008 1:59am Page 3
Introduction to Data Mining in SQL Server 2008 3
Figure 1-1 Student table
In contrast, the data mining approach for this problem is almost the reverse
of the query-and-explore method. Instead of guessing a hypothesis and trying
it out in different ways, you ask the question in terms of the data that can
support many hypotheses, and allow your data mining system to explore them
for you.
In this case, you indicate that the columns
IQ, Gender, ParentIncome,
and
ParentEncouragement are to be used as hypotheses in determining
CollegePlans. As the data mining system passes over the data, it analyzes the
influence of each input column on the target column.
Figure 1-2 shows the hypothetical result of a decision tree algorithm operat-
ing on this data set. In this case, each path from the root node to the leaf node
forms a rule about the data. Looking at this tree, you see that students with IQs
greater than 100 and who are encouraged by their parents are highly likely to
attend college. In this case, you have extracted knowledge from the data.
As shown here, data mining applies algorithms such as decision trees,
clustering, association, time series, and so on to a data set, and then analyzes
its contents. This analysis produces patterns, which can be explored for
valuable information. Depending on the underlying algorithm, these patterns
can be in the form of trees, rules, clusters, or simply a set of mathematical
formulas. The information found in the patterns can be used for reporting (to
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 4
4 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
guide marketing strategies, for instance) and for prediction. For example, if
you could collect data about undecided students, you could select those who
are likely to be interested in continued education and preemptively market to
that audience.
are provided appropriate and timely recommendations are likely to be
more valuable (because they purchase more) and more loyal (because
they feel a stronger relationship to the vendor). For example, if you go to
online stores such as Amazon.com or Barnesandnoble.com to purchase
an item, you are provided with recommendations about other items
you may be interested in. These recommendations are derived from
using data mining to analyze purchase behavior of all of the retailer’s
customers, and applying the derived rules to your personal information.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 5
Business Problems for Data Mining 5
Anomaly detection — How do you know whether your data is ‘‘good’’
or not? Data mining can analyze your data and pick out those items that
don’t fit with the rest. Credit card companies use data mining–driven
anomaly detection to determine if a particular transaction is valid. If
the data mining system flags the transaction as anomalous, you get a
call to see if it was really you who used your card. Insurance compa-
nies also use anomaly detection to determine if claims are fraudulent.
Because these companies process thousands of claims a day, it is impos-
sible to investigate each case, and data mining can identify which claims
are likely to be false. Anomaly detection can even be used to validate
data entry — checking to see if the data entered is correct at the point
of entry.
Churn analysis — Which customers are most likely to switch to a com-
petitor? The telecom, banking, and insurance industries face severe com-
petition. On average, obtaining a single new mobile phone subscriber
costs more than $200. Every business would like to retain as many cus-
tomers as possible. Churn analysis can help marketing managers identify
the customers who are likely to leave and why, and as a result, they can
improve customer relations and retain customers.
Risk management — Should a loan be approved for a particular cus-
classification.
Classification is the act of assigning a category to each case. Each case
contains a set of attributes, one of which is the class attribute. The task requires
finding a model that describes the class attribute as a function of input
attributes. In the College Plans data set shown in Figure 1-1, the class is the
CollegePlans attribute with two states: Yes and No. A classification model will
use the other attributes of a case (the input attributes) to determine patterns
about the class (the output attribute). Data mining algorithms that require a
target to learn against are considered supervised algorithms.
Typical classification algorithms include decision trees, neural network, and
Na
¨
ıve Bayes.
Clustering
Clustering is also called segmentation.Itisusedtoidentifynaturalgroupingsof
cases based on a set of attributes. Cases within the same group have more or
less similar attribute values.
Figure 1-3 shows a very simple customer data set containing two attributes:
Age and Income. The clustering algorithm groups the data set into three seg-
ments based on these two attributes. Cluster 1 contains a younger population
with low income. Cluster 2 contains middle age customers with higher income.
Cluster 3 is a group of older individuals with a relatively low income.
Clustering is an unsupervised data mining task. There is no single attribute
used to guide the training process, so all input attributes are treated equally.
Most clustering algorithms build the model through a number of iterations,
and stop when the model converges (that is, the boundaries of these segments
are stabilized).
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 7
Data Mining Tasks 7
Income
linear regression and logistic regression. Other techniques supported by SQL
Server Data Mining are regression trees (part of the Microsoft Decision Trees
algorithm) and neural networks.
Regression is used to solve many business problems — for example, to
predict a couponredemption rate based onthe face value, distribution method,
distribution volume, and season, or to predict wind velocities based on
temperature, air pressure, and humidity.
Forecasting
Forecasting is yet another important data mining task. What will the stock
value of Microsoft Corporation (NASDAQ symbol MSFT) be tomorrow? What
will the sales amount of wine be next month? Forecasting can help answer
these questions. As input, it takes sequences of numbers indicating a series
of values through time, and then it imputes future values of those series
using a variety of machine-learning and statistical techniques that deal with
seasonality, trending, and noisiness of data.
Figure 1-5 shows two curves. The solid line curve is the actual time-series
data on Microsoft stock value, and the dotted curve is a time-series model that
predicts values based on past values.
38
36
34
32
30
28
26
24
22
20
MSFT 3-year price history
Figure 1-5 Time series
Science
Figure 1-6 Web navigation sequence
Deviation Analysis
Deviation analysis is used to find rare cases that behave very differently from
the norm. Deviation analysis is widely applicable, the most common usage
being credit card fraud detection. Identifying abnormal cases among millions
of transactions is a very challenging task. Other applications include network
intrusion detection, manufacture error analysis, and so on.
There is no standard technique for deviation analysis. Usually, analysts
apply decision trees, clustering, or neural network algorithms for this task.
Data Mining Project Cycle
From the initial business problem formation through to deployment and
sustained management, most data mining projects pass through the same
phases.
Maclennan c01.tex V2 - 10/04/2008 1:59am Page 10
10 Chapter 1 ■ Introduction to Data Mining in SQL Server 2008
Business Problem Formation
What are the problems you are trying to solve? What techniques are you going
to apply to solve the problem? How do you know if you will be successful?
These are important questions to ask before embarking on any project.
You may find that a simple OLAP, reporting, or data integration solution
may be sufficient. A predictive or data mining solution involves determining
the unknown, relying on a belief that making sense of that unknown will add
value. This is a shaky precipice from which to begin any business endeavor.
Luckily, successful data mining solutions have been shown to have an average
of 150-percent return on investment (ROI), so that makes justification easier.
Data Collection
Business data is stored in many systems across an enterprise. For example,
at Microsoft, there are hundreds of online transaction processing (OLTP)
databases and more than 70 data warehouses. The first step is to pull the
ful. You can group these values to reduce the model complexity. For
example, the column
Profession may have many different types of engi-
neers, such as Software Engineer, Telecom Engineer, Mechanical Engi-
neer, and so on. You can group all of these professions to the single value
Engineer.
Aggregation — Aggregation is an important transformation to derive
additional value from your data. Suppose you want to group customers
based on their phone usage. If the call detail record information is too
detailed for the model, you must aggregate all the calls into a few
derived attributes such as total number of calls and the average call
duration. These derived attributes can later be used in the model.
Missing value handling — Most data sets contain missing values. This
can be caused by many different things. For example, you may have two
customer tables coming from two OLTP databases that, when merged,
have missing values because the tables are not aligned. Another example
occurs when customers don’t supply data values such as age. Another is
when you have stock market values with blanks because the markets are
closed on weekends and holidays.
Addressing missing values is important, because it is reflected in the
business value of your solution. You may need to retain the missing
data (for example, customers who refuse to report their age may have
other interesting things in common). You may need to discard the entire
record (having too many unknowns could pollute your model). Or, you
may simply be able to replace missing values with some other value
(such as the previous value for time-series data such as stock market val-
ues, or the most popular value). For more advanced cases, you can use
data mining to predict the most likely value for each missing case.
Removing outliers — Outliers are abnormal data and can be real or (as
is often the case) errors. Abnormal data has an effect on the quality of
business value. It is very important to work with business analysts who have
the proper domain knowledge to validate the discoveries.
Sometimes, the model doesn’t contain useful patterns. This is generally
becausethesetofvariablesinthemodelarenottherightonestosolveyour
business problem. You may need to repeat the data cleaning and transforma-
tion steps, or even redefine your problem in order to derive more meaningful
variables. Data mining is an exploratory process, and it often takes a few
iterations before you find the right model.
Reporting and Prediction
In many organizations, the goal of data miners is todeliver reports to marketing
executives. SQL Server Data Mining is integrated with SQL Server Reporting
Services to generate reports directly from data mining results. Reports may
contain predictions (such as lists of customers with the highest value potential)
or the rules found in the data mining analysis.
To provide predictions, you apply the selected model against new cases of
data. Consider a banking scenario where you build a model about loan risk
prediction. Every day there are thousands of new loan applications. You can