470643 c10.qxd 3/8/04 11:16 AM Page 346
346 Chapter 10
Second, link analysis can apply the concepts generated by visualization to
larger sets of customers. For instance, a churn reduction program might avoid
targeting customers who have high inertia or be sure to target customers with
high influence. This requires traversing the call graph to calculate the inertia or
influence for all customers. Such derived characteristics can play an important
role in marketing efforts.
Different marketing programs might suggest looking for other features in
the call graph. For instance, perhaps the ability to place a conference call
would be desirable, but who would be the best prospects? One idea would be
to look for groups of customers that all call each other. Stated as a graph prob-
lem, this group is a fully connected subgraph. In the telephone industry, these
subgraphs are called “communities of interest.” A community of interest may
represent a group of customers who would be interested in the ability to place
conference calls.
Lessons Learned
Link analysis is an application of the mathematical field of graph theory. As a
data mining technique, link analysis has several strengths:
■■ It capitalizes on relationships.
■■ It is useful for visualization.
■■ It creates derived characteristics that can be used for further mining.
Some data and data mining problems naturally involve links. As the two
case studies about telephone data show, link analysis is very useful for
telecommunications—a telephone call is a link between two people. Opportu-
nities for link analysis are most obvious in fields where the links are obvious
such as telephony, transportation, and the World Wide Web. Link analysis is
also appropriate in other areas where the connections do not have such a clear
manifestation, such as physician referral patterns, retail sales data, and foren-
sic analysis for crimes.
Links are a very natural way to visualize some types of data. Direct visual-
out. As with radio reception, too many competing signals add up to noise.
Clustering provides a way to learn about the structure of complex data, to
break up the cacophony of competing signals into its components.
When human beings try to make sense of complex questions, our natural
tendency is to break the subject into smaller pieces, each of which can be
explained more simply. If someone were asked to describe the color of trees in
the forest, the answer would probably make distinctions between deciduous
trees and evergreens, and between winter, spring, summer, and fall. People
know enough about woodland flora to predict that, of all the hundreds of vari-
ables associated with the forest, season and foliage type, rather than say age
and height, are the best factors to use for forming clusters of trees that follow
similar coloration rules.
Once the proper clusters have been defined, it is often possible to find simple
patterns within each cluster. “In Winter, deciduous trees have no leaves so the
trees tend to be brown” or “The leaves of deciduous trees change color in the
349
470643 c11.qxd 3/8/04 11:16 AM Page 350
350 Chapter 11
autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy
dataset is actually composed of a number of better-behaved clusters. The ques-
tion is: how can these be found? That is where techniques for automatic cluster
detection come in—to help see the forest without getting lost in the trees.
This chapter begins with two examples of the usefulness of clustering—one
drawn from astronomy, another from clothing design. It then introduces the
K-Means clustering algorithm which, like the nearest neighbor techniques dis-
cussed in Chapter 8, depends on a geometric interpretation of data. The geo-
metric ideas used in K-Means bring up the more general topic of measures of
similarity, association, and distance. These distance measures are quite sensi-
tive to variations in how data is represented, so the next topic addressed is
data preparation for clustering, with special attention being paid to scaling
the clusters mean. When clustering is successful, the results can be dramatic:
One famous early application of cluster detection led to our current under-
standing of stellar evolution.
Star Light, Star Bright
Early in the twentieth century, astronomers trying to understand the relation-
ship between the luminosity (brightness) of stars and their temperatures,
made scatter plots like the one in Figure 11.1. The vertical scale measures lumi-
nosity in multiples of the brightness of our own sun. The horizontal scale
measures surface temperature in degrees Kelvin (degrees centigrade above
absolute 0, the theoretical coldest possible temperature).
10
6
10
4
10
2
1
10
-2
10
-4
Red Giants
40,000 20,000 10,000 5,000 2,500
Main Sequence
White Dwarfs
Luminosity (Sun = 1)
Temperature (Degrees Kelvin)
Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity.
470643 c11.qxd 3/8/04 11:16 AM Page 352
352 Chapter 11
cult to visualize clusters. Our intuition about how close things are to each
other also quickly breaks down with more dimensions.
Saying that a problem has many dimensions is an invitation to analyze it
geometrically. A dimension is each of the things that must be measured inde-
pendently in order to describe something. In other words, if there are N vari-
ables, imagine a space in which the value of each variable represents a distance
along the corresponding axis in an N-dimensional space. A single record con-
taining a value for each of the N variables can be thought of as the vector that
defines a particular point in that space. When there are two dimensions, this is
easily plotted. The HR diagram was one such example. Figure 11.2 is another
example that plots the height and weight of a group of teenagers as points on
a graph. Notice the clustering of boys and girls.
TEAMFLY
Team-Fly
®
470643 c11.qxd 3/8/04 11:17 AM Page 353
Automatic Cluster Detection 353
The chart in Figure 11.2 begins to give a rough idea of people’s shapes. But
if the goal is to fit them for clothes, a few more measurements are needed!
In the 1990s, the U.S. army commissioned a study on how to redesign the
uniforms of female soldiers. The army’s goal was to reduce the number of dif-
Height (Inches)
ferent uniform sizes that have to be kept in inventory, while still providing
each soldier with well-fitting uniforms.
As anyone who has ever shopped for women’s clothing is aware, there is
already a surfeit of classification systems (even sizes, odd sizes, plus sizes,
junior, petite, and so on) for categorizing garments by size. None of these
systems was designed with the needs of the U.S. military in mind. Susan
Ashdown and Beatrix Paal, researchers at Cornell University, went back to the
basics; they designed a new set of sizes based on the actual shapes of women
in the army.
1
80
75
70
65
sponding to two-element vectors (x
1
,x
2
), the points correspond to n-element
vectors (x
1
,x
2
, . . . , x
n
). The procedure itself is unchanged.
Three Steps of the K-Means Algorithm
In the first step, the algorithm randomly selects K data points to be the seeds.
MacQueen’s algorithm simply takes the first K records. In cases where the
records have some meaningful order, it may be desirable to choose widely
spaced records, or a random selection of records. Each of the seeds is an
embryonic cluster with only one element. This example sets the number of
clusters to 3.
The second step assigns each record to the closest seed. One way to do this
is by finding the boundaries between the clusters, as shown geometrically
in Figure 11.3. The boundaries between two clusters are the points that are
equally close to each cluster. Recalling a lesson from high-school geometry
makes this less difficult than it sounds: given any two points, A and B, all
points that are equidistant from A and B fall along a line (called the perpen-
dicular bisector) that is perpendicular to the one connecting A and B and
halfway between them. In Figure 11.3, dashed lines connect the initial seeds;
the resulting cluster boundaries shown with solid lines are at right angles to
470643 c11.qxd 3/8/04 11:17 AM Page 355
Automatic Cluster Detection 355
X
1
Figure 11.4 The centroids are calculated from the points that are assigned to each cluster.
The centroids become the seeds for the next iteration of the algorithm. Step 2
is repeated, and each point is once again assigned to the cluster with the closest
centroid. Figure 11.5 shows the new cluster boundaries—formed, as before, by
drawing lines equidistant between each pair of centroids. Notice that the point
with the box around it, which was originally assigned to cluster number 2, has
now been assigned to cluster number 1. The process of assigning points to clus-
ter and then recalculating centroids continues until the cluster boundaries
stop changing. In practice, the K-means algorithm usually finds a set of stable
clusters after a few dozen iterations.
What K Means
Clusters describe underlying structure in data. However, there is no one right
description of that structure. For instance, someone not from New York City
may think that the whole city is “downtown.” Someone from Brooklyn or
Queens might apply this nomenclature to Manhattan. Within Manhattan, it
might only be neighborhoods south of 23
rd
Street. And even there, “down-
town” might still be reserved only for the taller buildings at the southern tip of
the island. There is a similar problem with clustering; structures in data exist
at many different levels.
470643 c11.qxd 3/8/04 11:17 AM Page 357
Automatic Cluster Detection 357
X
2
X
1
Figure 11.5 At each iteration, all cluster assignments are reevaluated.
et voilà! The problem, of course, is that the databases encountered in market-
ing, sales, and customer support are not about points in space. They are about
purchases, phone calls, airplane trips, car registrations, and a thousand other
things that have no obvious connection to the dots in a cluster diagram.
Clustering records of this sort requires some notion of natural association;
that is, records in a given cluster are more similar to each other than to records
in another cluster. Since it is difficult to convey intuitive notions to a computer,
358 Chapter 11
470643 c11.qxd 3/8/04 11:17 AM Page 358
470643 c11.qxd 3/8/04 11:17 AM Page 359
Automatic Cluster Detection 359
this vague concept of association must be translated into some sort of numeric
measure of the degree of similarity. The most common method, but by no
means the only one, is to translate all fields into numeric values so that the
records may be treated as points in space. Then, if two points are close in
the geometric sense, they represent similar records in the database. There are
two main problems with this approach:
■■ Many variable types, including all categorical variables and many
numeric variables such as rankings, do not have the right behavior to
properly be treated as components of a position vector.
■■ In geometry, the contributions of each dimension are of equal impor-
tance, but in databases, a small change in one field may be much more
important than a large change in another field.
The following section introduces several alternative measures of similarity.
Similarity Measures and Variable Type
Geometric distance works well as a similarity measure for well-behaved
numeric variables. A well-behaved numeric variable is one whose value indi-
cates its placement along the axis that corresponds to it in our geometric
model. Not all variables fall into this category. For this purpose, variables fall
into four classes, listed here in increasing order of suitability for the geometric
Geometric distance metrics are well-defined for interval variables and true
measures. In order to use categorical variables and rankings, it is necessary to
transform them into interval variables. Unfortunately, these transformations
may add spurious information. If ice cream flavors are assigned arbitrary
numbers 1 through 28, it will appear that flavors 5 and 6 are closely related
while flavors 1 and 28 are far apart.
These and other data transformation and preparation issues are discussed
extensively in Chapter 17.
Formal Measures of Similarity
There are dozens if not hundreds of published techniques for measuring the
similarity of two records. Some have been developed for specialized applica-
tions such as comparing passages of text. Others are designed especially for
use with certain types of data such as binary variables or categorical variables.
Of the three presented here, the first two are suitable for use with interval vari-
ables and true measures, while the third is suitable for categorical variables.
Geometric Distance between Two Points
When the fields in a record are numeric, the record represents a point in
n-dimensional space. The distance between the points represented by two
records is used as the measure of similarity between them. If two points are
close in distance, the corresponding records are similar.
There are many ways to measure the distance between two points, as
discussed in the sidebar “Distance Metrics”. The most common one is the
Euclidian distance familiar from high-school geometry. To find the Euclidian
distance between X and Y, first find the differences between the corresponding
elements of X and Y (the distance along each axis) and square them. The dis-
tance is the square root of the sum of the squared differences.
470643 c11.qxd 3/8/04 11:17 AM Page 361
Automatic Cluster Detection 361
Any function that takes two points and produces a single number describing a
◆ Distance(X,Y) = 0 if and only if X = Y
Take the values for length of whiskers, length of tail, overall body length,
length of teeth, and length of claws for a lion and a house cat and plot them as
single points, they will be very far apart. But if the ratios of lengths of these
body parts to one another are similar in the two species, than the vectors will
be nearly colinear.
470643 c11.qxd 3/8/04 11:17 AM Page 362
362 Chapter 11
The angle between vectors provides a measure of association that is not
influenced by differences in magnitude between the two things being com-
pared (see Figure 11.7). Actually, the sine of the angle is a better measure since
it will range from 0 when the vectors are closest (most nearly parallel) to 1
when they are perpendicular. Using the sine ensures that an angle of 0 degrees
is treated the same as an angle of 180 degrees, which is as it should be since for
this measure, any two vectors that differ only by a constant factor are consid-
ered similar, even if the constant factor is negative. Note that the cosine of the
angle measures correlation; it is 1 when the vectors are parallel (perfectly
correlated) and 0 when they are orthogonal.
Big Fish
Little Fish
Big Cat
Little Cat
Figure 11.7 The angle between vectors as a measure of similarity.
TEAMFLY
Team-Fly
®
470643 c11.qxd 3/8/04 11:17 AM Page 363
Automatic Cluster Detection 363
Manhattan Distance
Another common distance metric gets its name from the rectangular grid pat-
tern of streets in midtown Manhattan. It is simply the sum of the distances
traveled along each axis. This measure is sometimes preferred to the Euclidean
distance because given that the distances along each axis are not squared, it
is less likely that a large difference in one dimension will dominate the total
distance.
Number of Features in Common
When the preponderance of fields in the records are categorical variables, geo-
ference of 185,200 in Y or 2,025 in X. Clearly, they must all be converted to a
common scale before distances will make any sense.
Unfortunately, in commercial data mining there is usually no common scale
available because the different units being used are measuring quite different
things. If variables include plot size, number of children, car ownership, and
family income, they cannot all be converted to a common unit. On the other
hand, it is misleading that a difference of 20 acres is indistinguishable from
a change of $20. One solution is to map all the variables to a common
range (often 0 to 1 or –1 to 1). That way, at least the ratios of change become
comparable—doubling the plot size has the same effect as doubling income.
Scaling solves this problem, in this case by remapping to a common range.
TIP It is very important to scale different variables so their values fall roughly
into the same range, by normalizing, indexing, or standardizing the values.
Here are three common ways of scaling variables to bring them all into com-
parable ranges:
■■ Divide each variable by the range (the difference between the lowest
and highest value it takes on) after subtracting the lowest value. This
maps all values to the range 0 to 1, which is useful for some data
mining algorithms.
■■ Divide each variable by the mean of all the values it takes on. This is
often called “indexing a variable.”
■■ Subtract the mean value from each variable and then divide it by the
standard deviation. This is often called standardization or “converting to
z-scores.” A z-score tells you how many standard deviations away from
the mean a value is.
Normalizing a single variable simply changes its range. A closely related
concept is vector normalization which scales all variables at once. This too has a
geometric interpretation. Consider the collection of values in a single record or
observation as a vector. Normalizing them scales each value so as to make the
length of the vector equal one. Transforming all the vectors to unit length
Of course, if you want to evaluate the effects of different weighting strate-
gies, you will have to add another outer loop to the clustering process.
Other Approaches to Cluster Detection
The basic K-means algorithm has many variations. Many commercial software
tools that include automatic cluster detection incorporate some of these varia-
tions. Among the differences are alternate methods of choosing the initial
seeds and the use of probability density rather than distance to associate
records with clusters. This last variation merits additional discussion. In addi-
tion, there are several different approaches to clustering, including agglomer-
ative clustering, divisive clustering, and self organizing maps.
Gaussian Mixture Models
The K-means method as described has some drawbacks:
■■ It does not do well with overlapping clusters.
■■ The clusters are easily pulled off-center by outliers.
■■ Each record is either inside or outside of a given cluster.
470643 c11.qxd 3/8/04 11:17 AM Page 366
366 Chapter 11
Gaussian mixture models are a probabilistic variant of K-means. The name
comes from the Gaussian distribution, a probability distribution often
assumed for high-dimensional problems. The Gaussian distribution general-
izes the normal distribution to more than one variable. As before, the algo-
rithm starts by choosing K seeds. This time, however, the seeds are considered
to be the means of Gaussian distributions. The algorithm proceeds by iterating
over two steps called the estimation step and the maximization step.
The estimation step calculates the responsibility that each Gaussian has for
each data point (see Figure 11.8). Each Gaussian has strong responsibility
for points that are close to its mean and weak responsibility for points that are
distant. The responsibilities are be used as weights in the next step.
In the maximization step, a new centroid is calculated for each cluster
taking into account the newly calculated responsibilities. The centroid for a
2
X
1
Figure 11.9 Each Gaussian mean is moved to the centroid of all the data points weighted
by its responsibilities for each point. Thicker arrows indicate higher weights.
470643 c11.qxd 3/8/04 11:17 AM Page 368
368 Chapter 11
Agglomerative Clustering
The K-means approach to clustering starts out with a fixed number of clusters
and allocates all records into exactly that number of clusters. Another class of
methods works by agglomeration. These methods start out with each data point
forming its own cluster and gradually merge them into larger and larger clusters
until all points have been gathered together into one big cluster. Toward the
beginning of the process, the clusters are very small and very pure—the members
of each cluster are few and closely related. Towards the end of the process, the
clusters are large and not as well defined. The entire history is preserved making
it possible to choose the level of clustering that works best for a given application.
An Agglomerative Clustering Algorithm
The first step is to create a similarity matrix. The similarity matrix is a table of
all the pair-wise distances or degrees of similarity between clusters. Initially,
the similarity matrix contains the pair-wise distance between individual pairs
of records. As discussed earlier, there are many measures of similarity between
records, including the Euclidean distance, the angle between vectors, and the
ratio of matching to nonmatching categorical fields. The issues raised by the
choice of distance measures are exactly the same as those previously discussed
in relation to the K-means approach.
It might seem that with N initial clusters for N data points, N
2
measurement
calculations are required to create the distance table. If the similarity measure
member of its cluster than to any point outside it.
Another approach is the complete linkage method, where the distance between
two clusters is given by the distance between their most distant members. This
method produces clusters with the property that all members lie within some
known maximum distance of one another.
Third method is the centroid distance, where the distance between two clusters
is measured between the centroids of each. The centroid of a cluster is its average
element. Figure 11.10 gives a pictorial representation of these three methods.
X
2
X
1
Closest clusters by
centroid method
C1
C2
C3
Closest clusters by
complete linkage method
Closest clusters by
single linkage method
Figure 11.10 Three methods of measuring the distance between clusters.
470643 c11.qxd 3/8/04 11:17 AM Page 370
370 Chapter 11
Clusters and Trees
The agglomeration algorithm creates hierarchical clusters. At each level in the
hierarchy, clusters are formed from the union of two clusters at the next level
down. A good way of visualizing these clusters is as a tree. Of course, such a
tree may look like the decision trees discussed in Chapter 6, but there are some
important differences. The most important is that the nodes of the cluster tree