Citation
Frequently used methods in cluster analysis

Material Information

Title:
Frequently used methods in cluster analysis
Creator:
Bruntz, Merribeth
Publication Date:
Language:
English
Physical Description:
vii, 63 leaves : illustrations ; 29 cm

Subjects

Subjects / Keywords:
Cluster analysis ( lcsh )
Cluster analysis ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 61-63).
General Note:
Submitted in partial fulfillment of the requirements for the degree of Master of Science, Department of Mathematics, 1987.
Statement of Responsibility:
by Merribeth Bruntz.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
17800841 ( OCLC )
ocm17800841
Classification:
LD1190.L62 1987m .B78 ( lcc )

Downloads

This item has the following downloads:


Full Text
FREQUENTLY USED METHODS IN CLUSTER ANALYSIS
by
Merribeth Bruntz
B.A., University of Colorado, 1983
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Mathematics
1987


This thesis for the Master of Science degree by
Merribeth Bruntz
has been approved for the
Department of
Mathematics
by
Ramalingam Shanmugam
\
Date


Ill
Bruntz, Merribeth (M.S., Applied Math)
Frequently Used Methods in Cluster Analysis
Thesis directed by Assistant Professor Kathyrn F. Jones
Cluster analysis is a statistical pattern recognition technique
that classifies experimental data into sets called clusters. Some
of the more frequently used techniques of classification are
discussed in this paper. Most of the paper is dedicated to
frequently used hierarchical methods; single linkage, complete
linkage, group average, centroid, and Wards method.
Graph-theoretic techniques for the development of the single
linkage method and the complete linkage method are discussed.
Finally, methods of comparing the hierarchical techniques and
their results are reviewed, with the paper concluding with a look
at some future techniques that utilize parallel computing and
fuzzy set theory.


IV
ACKNOWLEDGEMENTS
I would like to thank K. Jones, H. Greenberg, and R.
Shanmugam for their much appreciated comments. I would also
like to thank Bev Clyncke and Ellen Sanchez for their helpful
support.


CONTENTS
CHAPTER
I. INTRODUCTION................ 1
II. MEASURES OF SIMILARITY...... 3
III. STATISTICAL TECHNIQUES..... 16
IV. HIERARCHICAL TECHNIQUES. .. 21
V. SINGLE LINKAGE METHOD...... 26
VI COMPLETE LINKAGE METHOD... 38
VII. GROUP AVERAGE METHODS...... 47
VIII. CENTROID METHODS........... 51
IX WARDS METHOD.............. 53
X CONCLUSIONS................ 57
BIBLIOGRAPHY
61


TABLES
Data: Percentage of Republican Votes in
seven states.................................. 12
Table of Results.............................. 14


FIGURES
Figure 1 ................................... 21
Figure 2....................................... 22
Figure 3................................... . 27
Figure 4....................................... 29
Figure 5 .................................. 36
Figure 6 . ,................................ 42
Figure 6b ................................. 43


1
CHAPTER I
INTRODUCTION
Cluster analysis is a statistical pattern recognition technique
that classifies experimental data into a number of sets called
clusters. The clusters generally contain elements that are similar
to each other, yet dissimilar to the elements of other clusters.
The elements, which are also commonly referred to as objects in
the literature, can represent anything that can be represented as
a point in a multidimensional measurement space. Thus, the
object is any vector 5q where x; = (xa ,xi2 ,...,xin ) in the
n-dimensional space. The components are numerical values of
the measurable properties. It is important in this representation
that all objects are measurement vectors of the same dimension
and that any specific property be represented as the same
component.
In this paper, some of the more frequently used techniques
of cluster analysis are investigated. We begin by defining
measures of similarity between the elements to be classified, and
then discussing a popular statistical method of cluster
classification. Methods of utilizing cluster analysis as a diagnostic


2
aid to analysis of variance, factor analysis, and discriminant
analysis are then discussed. We continue on to an investigation
of five of the more frequently used hierarchical methods; single
linkage, complete linkage, group average, centroid, and Wards
method. The paper concludes with a discussion of methods of
comparison between the various hierarchical techniques, and a
look at some future techniques that utilize parallel computing
and fuzzy set theory.


3
CHAPTER II
MEASURES OF SIMILARITY
Since the goal of cluster analysis is to identify groups that
contain objects that are similar, yet dissimilar from those of
other groups, then the existence of a measure of similarity
between the objects to be classified must be defined. A
standard way of expressing similarity is through a set of distance
measurements between pairs of objects. Since so many
clustering algorithms assume that such distances are given in a
matrix format, it is important to carefully select the distance
function to be used [12].
Perhaps the most familiar measure of distance is the
Euclidean distance. This is simply the "straight-line" distance
between two objects, x; and Xj :
" 2 ^
D^Xj) = [Z]
where xik is the k*h component of the ith object in the
n-dimensional space.
Some form of the generalized Minkowski distance can be


4
used for almost all problems:
n m1^
D^xj) = [2 lxik- xjkl J
where, if m = 2, the Euclidean distance is calculated.
In general, the distance measurement can be any function,
D(xj ,Xj ), that satisfies the properties of a metric [31]:
1. D(Xi ,Xj ) > 0 i.e, the function is nonnegative.
2. D(x; ,xj ) = Q i.e, if and only if Xj = Xj.
3. D(Xj ,Xj ) = D(xj ,Xi ) i.e, the distance is symmetric for
all pairs X; and Xj.
4. D(Xj ,xj ) < D(xj ,xk ) + D(xk ,Xj ), for any three objects Xj ,
Xj xk in the space of measurement (this is called
the Triangle Inequality).
This definition of the distance function makes sense only if
the variables are measured in the same units. Since this is rarely
the case, it is desirable to compute a weighted distance that will
place all variables on the same scale. In other words, if weights
{wp> = {wi,w2,..-,wm> are introduced for all the variables then the
new points Xj = {w,xii,...,wmxim> and the new Euclidean distances
as shown in equation (1) are generated.


5
m
DCW*
2' p x?>2]*,

U = l.
<1)
If the weighting process is not performed with care, the
importance of some variables may be increased or decreased on
subjective or a' priori grounds. To avoid this, there are several
methods that are considered to be unsatisfactory [12]. The first
weighting method to be avoided is the subjective method. Care
must be taken to avoid weighting the variables according to
considerations outside the data. A second weighting method uses
weights that are chosen inversely proportional to the
measurement variance. Unfortunately, the use of the
measurement error, which is estimated from a subjective
preclustering of the cases, produces a problem. The preclustering
may cause a tendency to overweight the variables, thus making
variables which were probably measured with great precision
useless for clustering. Finally, the third method scales the
variables to equalize the variances within the cluster. This
causes the distances between clusters to become smaller, making
the separations between them less distinct. This method can be
improved if the within-cluster variances are scaled to become
approximately equal.


6
The relationships between variables should also be given
consideration when determining measures of similarity, for after
determining a method of weighting the variables, it may be
noticed that several variables are highly correlated making it
desirable to down-weight each. Therefore, similarity between
variables affects the computation between cases and conversely.
A more generalized Euclidean distance can be defined to
accommodate the relationships between variables. This distance
is called the Mahalanobis distance [28] and is defined as follows:
D(x.,x.) = The matrix W is the inverse of the covariance matrix of the
variables.
More recently, it has been suggested that the weighting
procedure could produce better results if included in the
clustering procedure instead of producing weights prior to and
independently of clustering, as demonstrated above. An
algorithm has been designed by Vladimir J. Lumelsky, which
performs the weighting of the variables within the clustering
procedure. A set of weights, {wp>, is determined, which
maximizes a function used for the classification.
The following is a description of the algorithm. Given n
objects Xi,...,xn, each of which is a vector X; = {Xji,...,Xjm} in a


7
m-dimensional space of variables xi,...,xm, the objects can be
classified into a given number k0< m of non-overlapping clusters
Ai,A2,...,Ako in such a way that the following function is
minimized:
F = n £ D ,x>
VVdih 1 J
i*j
<2)
where D(x;,Xj) is the weighted Euclidean distance defined
previously by (1), and where nk is the number of objects in the
cluster Ak and the term
Vv
D (x;
j
is an average of the squared distances between the objects of the
cluster Ak. Therefore, the function F is an average, taken
over all the clusters, of the squared distances between the objects
of the cluster. Substituting equation (1) into (2) yields:
F = Sr-4rr T. T. xP)2, i*j
fcsiVV^je^p* p J
(3)
If the set of weights, {wp>, is given then Lumelskys
algorithm of clustering is as follows [19]:


8
1. Cluster all the objects Xi,...,xn arbitrarily into k0
non-overlapping clusters Ai,...,Ako.
2. Take the object Xj and leave it in such a cluster
Aq ,q = l,...,k0 as produces a minimum
value of the function F.
3. Repeat step 2 for all remaining Xi i = 2,...,n.
4. Repeat steps 2 and 3 until the value of F stops
changing.
In equation (3), F is defined as a multi-extremal function of
clusters. Therefore, the procedure above produces a local
minimum of F The function F is dependent upon the initial
choice of clusters and the order of the objects in the procedure.
Lumelsky [19] extends the algorithm to the case where the
weights are not given, i.e., they must be found. All the terms in
equation (3) are positive, and the function F is quadratic.
Therefore, the function can be defined as uni-extremal with
respect to the weighted set, {wp> If the following condition is
taken into account
£ w = b b>0
P=i P
to avoid the trivial solution wp = 0 then along with the use of
the standard Lagrange multiplier method the minimum of F ,
with respect to the weighted set can be found [241. The


9
procedure continues with the introduction of a Lagrange
multiplier X. If we let b = m the condition in equation (4) will
\
yield the following:
m
f P=i P
The functional to be minimized can be defined as
$= F+ Xf So,
$
m) .
If we take the partial derivatives of $ with respect to the
weighted sets and combine this with equation (5), then (m + 1)
linear algebraic equations with respect to (m + 1) parameters
w,,w2,...,wm, x are produced such that
d$ .
= 2w
d ID
R
o
E:
. ^ 1 E qMW1>WiV J
X = 0


10
where q = , and
m
Sw m = 0 .
p=i P
Let the following denote,

1
fonk £ *
J
<6>
Then,
(w Xc ) + X 0 j q = L > and <7a)
q q ^
m
£ w m = 0 p=i p <7b)
From equations (7a) and (7b),
w
q
x_
c
and X = -
m
£ P P
p=L..>m
Finally,
w
q
£ P=i P
q = L...
(8)
Thus, the values Wi,...,wm, found from equations (6) and (8),
produce a local minimum of the function F (given clusters


11
Lumelskys algorithm, called the extremal weighting
procedure [19], combines the clustering and weighting procedures
is as follows:
1. Cluster all the objects Xi,...,xn arbitrarily into k0
non-overlapping clusters Ai,...,Ako. Assume
an arbitrary weighting set, {wp},
taking into account the condition in
equation (4).
2. Take the object x; and leave it in a cluster
Aq q = l,...,kQ which corresponds to a
minimum value of F Continue this
for all i, i = l,...,n.
3. Using equations (6) and (8), find the optimal
weights {wp} for the clusters built in
step 2.
4. Repeat steps 2 and 3 until the value of F stops
changing.
This procedure guarantees that for the resultant weights
the final clusters correspond to a local minimum of the average
within-cluster squared distance. Furthermore, the weights for
the clusters produce the global minimum (with respect to the
weights) of the same average within-cluster squared distance.
The following simple example compares the various


12
measures of similarity along with their weighting procedures
[12,191.
Data: Percentage of Republican votes in seven states
Objects (No. and Name) Variables 1960 1964
1 GA 37 54
2 LA 29 57
3 SC 49 59
4 WV 47 32
5 MO 50 36
6 KY 54 36
7 MD 46 35


13
In general, the average distance between the objects within a
cluster is calculated using the formula,
k
1 0 j
= 7 S _n £ D2(x.,x.), i*j
w kri ' J
and the average distance between clusters is calculated using the
formula,
Z Z £ d2 '^k -^k
In the following table of results (located on the next page), these
calculations are performed using the above data set.


14
Table of Results
Method of weighting the variables Within-cluster Distance Between-cluster Distance
Dk k=l Dk k=2 Dw Db
No weighting 14.0 5.3 9.6 26.0
Equal Variance Weighting 19.3 7.2 13.2 29.2
Mahalanobis Weighting 20.0 7.8 13.9 20.0
Extremal Weighting 6.7 4.0 5.4 40.3
The table of results indicates that the Mahalanobis distance
with its weighting procedure reduces the distance between
clusters, thus causing a lack of clarity. The extremal weighting
procedure increases the distance between clusters and decreases
the distance within the clusters. Therefore, the ability of the
latter procedure to minimize the within-cluster distance and
maximize the between-cluster distance lends itself to a much


15
more distinct measure of similarity in the classification of
clusters of data.


16
CHAPTER III
STATISTICAL TECHNIQUES
Probably the most popular of the statistical methods are
those that minimize a squared error criterion, in which the goal
is to define clusters that are hyperellipsoidal in shape 171. The
procedure begins by letting the i*n object from the data set be
denoted as
x. = i il i2

where i = l,...,n. The number of objects, n, is assumed to be
significantly larger than the number of features, N, where a
feature is defined as some identifiable trait which distinguishes
between objects. Then, a clustering is defined as a partition,
(Q,C2,...,Ck), of integers, (l,...,n), so that each object is assigned a
single cluster label. The objects correspond to the integers in Ck
for the kth cluster. The center of this cluster is defined as
c
k
< W


17
where
V
m.
Ex..
u
The number of objects in cluster k is denoted by mk. The
sample mean of all objects in the cluster is the cluster center, or
centroid. And, the squared error of cluster k is
^ = V-
Furthermore, the squared error for the clustering is
where K is the total number of clusters.
There are two major objectives of this method. The first
is to define a clustering, for a given K that minimizes the
squared error, EK2 ; and the second is to determine a suitable
number of clusters, K that is much smaller than the number of
objects, n .
Occasionally, it is possible to combine the efforts of cluster
analysis with some of the methods of statistics. For example,
analysis of variance is a statistical technique which tests whether


18
variations in selected independent variables cause significant
variation in a dependent (response) variable. It is important to
insure that the only variation is caused by the independent
variables that characterize each group of observations. If there
are additional effects operating within groups which imply that a
group contains distinct subgroups, then the apparent error
variance is magnified and it becomes more difficult to detect a
genuine variation between groups. Furthermore, if there are
effects operating between groups (other important variables
exhibit group-to-group variation), then the genuine effect of the
independent variable under investigation may be either concealed
or amplified. When the analysis is concerned with systems
subject to only indirect control, such as when the responses of
human beings are being measured, then it is possible for a host
of factors to influence the observed responses. According to
Anderberg [1], cluster analysis might be used to an advantage in
the following ways:
1. Within-cluster analysis will either indicate that
the assumption of significant similarity
within-clusters is justified, or reveal a specific
way in which the assumption is violated.
2. A cluster analysis across the entire data set may
reveal group-to-group variations that are not
due to an independent variable.


19
Another statistical method that works quite well in
combination with cluster analysis is factor analysis, which is a
multivariate technique used in many of the behavioral and social
sciences. The objective is to find a set of factors with the
following properties:
1. The factors are linear combinations of the original
variables.
2. The factors are orthogonal to each other.
3. The factors are fewer in number than the original
variables, but they account for most of the
variance in the original data.
4. The factors are meaningful in their own right, or
are at least subject to some useful
interpretation.
According to Cattel [5], factor analysis should be applied to
relatively similar groups of data, and cluster analysis is a method
for finding such groups.
Discriminant analysis is another multivariate technique. It
is used in the construction of decision procedures by which data
are classified as members of one group or another. For example,
in the simplest case, samples of data are available from each of
two known categories. The objective is to utilize the sample
information to construct a linear discriminant function that
separates the groups in such a way as to minimize the number


20
of expected misclassifications. Subsequently, the discriminant
function may be used to classify new data into one of the two
classes [11.
Typically, the number of groups in the data set is thought
to be known, and the primary objective is to find an efficient
means of distinguishing between the groups as given. Sometimes
the analysis fails because the resulting discriminant function
separates the groups poorly. According to Anderberg [11, a lack
of success might be caused by the fact that the defined groups
are not separable within the measurement space defined by the
available variables. Thus, the data set contains fewer identifiable
groups than prior knowledge would suggest. It is also possible
that there are actually more identifiable groups in the data set
than the number being sought in the analysis. When problems of
this kind arise, cluster analysis can be utilized as a diagnostic tool
for discovering distinct subsets within a defined group. However,
if the discriminant analysis gives a satisfactory discriminant
function, then it is unnecessary to supplement the method with
cluster analysis [11.


21
CHAPTER IV
HIERARCHICAL TECHNIQUES
Given the definitions of distance, both within a cluster and
between a cluster, it is possible to formulate a hierarchical model
of clustering for a set of objects. The result of this procedure is
the hierarchical tree. This tree is defined just as it is from the
graph-theoretical point of view, meaning it is a connected graph
without cycles. The hierarchical structure of the tree implies
that it is a graph with a unique path between any two nodes in
the graph [31]. Figure 1 represents a hierarchical tree of 31
objects (leaves presented as circles). Seven objects
(A,B,C,D,E,F,G) are represented as the vector x .
Root
Levels
Figure 1


22
If lengths are assigned to the branches, then a
representation of the tree, called a dendrogram can be produced
[311. Figure 2 is the dendrogram for the hierarchical tree in
figure 1.
Distance
Similarity

-.20
-.40
----..60
-.80
100
The number of clusters is associated with a given distance or
similarity level. For example, at the dashed-line at the distance
level of 40 % which is a similarity level of 60 % there are
nine clusters in the dendrogram. The lengths of the branches
may be relative or absolute to some maximal distance. The
similarity scale in figure 2 is related to the relative distance as


23
follows:
S ^ 'xjJ max[0Cyp]
It should be noted that the term "distance" is associated with the
number of edges between two nodes in graph theory. This,
however, is not the definition of the term "distance" when
working with the metric defining measurement for both
between-clusters and within-clusters.
When performing hierarchical clustering, initially the
distance (calculated with respect to some metric) between every
object is calculated and placed into a distance matrix, Dnxn The
minimal element of is determined. If there is a tie, then
traditionally the first or last pair is selected. The selection of
the minimal element of the distance matrix causes the i*h and
the jth object, or cluster, to be combined into a single cluster.
Next, a new distance matrix is calculated, D(n-1)X(n-i) This new
matrix contains both the elements of the old matrix that were
not joined, and a new set of distances between the new cluster
and all other clusters. This iterative approach that creates a new
distance matrix continues until the last two clusters have been
combined into a single cluster. It should be noted that the
distance between two clusters is calculated from the elements of


24
the old matrix of the previous step. It is possible to do this from
the initial set of data (taking into account all individual objects
of both clusters), however, this method is quite time consuming,
computationally.
Seven of the most frequently used hierarchical clustering
methods are the single linkage, complete linkage, group average,
weighted group average, centroid, median, and Ward. The
flowchart on the next page demonstrates the steps taken by these
hierarchical methods [31].




26
CHAPTER V
SINGLE LINKAGE METHOD
The single linkage method, which is also known as the
nearest neighbor technique, is one of the simplest of the
clustering methods. This technique produces the distance
between two clusters that is the smallest distance between two
points in both clusters. This is accomplished through the
utilization of a graph-theoretic technique known as the minimal
spanning tree [311.
Graph-theoretic techniques provide a means for finding
unconventional data structures [71. As an introduction to some
of the more powerful graph-theoretic techniques, consider the
following: two well-separated clusters (figure 3a), each with
approximately the same homogeneous point density. If the
standard deviation, cr, is relatively small, then the point density is
defined as "homogeneous." The size of the intercluster distance
relative to the mean, n, implies the term "well-separated" clusters.
If the distance from each point to its nearest neighbor (figure
3a) is computed, a mean value n = 20.5 is attained. The standard
deviation is computed as cr = 1.66, and the distance between the


27
clusters equals 58. The range of values of nearest distance is 17.5
to 24. A graph can be formed from the points of figure 3a by
connecting any pair whose distance is smaller than a threshold
value depending upon n and
figure 3b results.
(a)
. If n + 3or is used, the graph in

(b)
Figure 3


28
The two clusters in figure 3c have non-homogeneous point
densities. Intuitively, these clusters are perceived as being
well-separated. The left most point of the upper cluster,
however, is further from its nearest neighbor than the distance
between the clusters, making it difficult to detect these clusters
using the method described above. Utilizing an adaptive
connecting algorithm, such as one that connects each point to its
k-nearest-neighbors, yields the results in figure 3d, where k = 3
[30].
Figure 3e illustrates two "touching clusters", which appear
as a single cluster if the above two methods are utilized. It is
true that it is a single cluster by the criterion of connectivity;
however, it is perceived as being two clusters connected by a
small neck. Figure 3f illustrates the construction of a
k-nearest-neighbor graph, with k = 4. In addition to the
construction of this graph, cut points (whose removal disconnects
the graph) or bridges (edges whose removal disconnect the graph)
can be used to separate the cluster into two clusters.
The minimal spanning tree (MST) is a powerful "locally
adaptive interconnecting mechanism" for a point set that favors
nearest neighbors. In general, the MST is generated for the
complete graph in which the nodes are the patterns and the edge
weights are Euclidean distances in the pattern space. A threshold


29
is computed as the sample mean of the edge weights in the tree
plus the sample standard deviations. All edges having weights
greater than the computed threshold are cut, forming subtrees.
Each subtree represents a cluster.
More specifically, Zahn [30] describes the algorithm as
follows^ A tree is defined as a connected graph with no circuits.
A spanning tree of the connected graph, G is a tree in G
which contains all the nodes of G So, given the graph G
(figure 4a), then two spanning trees of G are represented in
figures 4b and 4c. Defining the weight of the tree to be the sum
of the weights of its constituent edges produces the following
definition of a minimal spanning tree: "A minimal spanning tree
(MST) of graph G is a spanning tree whose weight is minimum
among all spanning trees of G ." [301
Thus, the tree in figure 4c is the MST of the graph in figure 4a.
The nodes of graph G (figure 4a) can be partitioned into
two disjoint nonempty subsets, P and Q where P = {A,B,C>
c,
Figure 4


30
and Q = {D,E,F>. The distance, D(P,Q) across this partition is
the smallest weight among all edges which have one end node in
P and another in Q Therefore, D(P,Q) = 8 making edge AD
the bridge spanning from the subset P = {A,B,C} to Q = {D,E,F>.
The set of edges that span a partition is defined as the
cut-set and denoted as C(P,Q) A link is any edge in C(P,Q)
whose weight is equal to the distance across a partition, and the
set of all links in the cut-set, C(P,Q) is called the link-set,
denoted L(P,Q) For example, in figure 4a [301:
C(P,Q) = (AD,CD,CF), and L(P,Q) = (AD) .
Therefore, the following three theorems are true [30]:
Theorem 1. Any MST contains at least one edge from each
L(P,Q) .
Proof. This proof shows that a spanning tree T' containing no
edge from L(P,Q) can be improved by switching one of its edges
for one in L(P,Q). Select any edge (xy) e L(P,Q) and add it to
T' to produce a new graph with precisely one circuit. The
portion of this circuit which lies in T' must have at least one
edge (uv) e C(P,Q) because x and y are in P and Q ,
respectively. The edge (uv) is not in L(P,Q) by the definition
of T'. The spanning tree
T = {T' (uv)} u (xy)
has a smaller weight, w, than T' because W(xy) < W(uv) by the


31
definition of L(P,Q). Thus, any MST must have at least one
edge from L(P,Q) .
Lemma 1. Each edge (xy) of a spanning tree T determines a
unique partition (P,Q) of the nodes of G in a natural way so
that C(P,Q) contains exactly one edge in T .
Proof. Let T' = T (xy) be the graph obtained by deleting edge
(xy) from the spanning tree T Since every edge of a tree is a
bridge, it follows that T' has two connected components Tj' and
T2' and the nodes x and y are in different components (say
x nodes of T2'. Since no nodes were deleted from T which spans
G then (P,Q) are certainly disjoint and they contain all nodes
of G Also, the only edge in T joining P to Q is the
deleted edge (xy) and so the lemma is proved.
Theorem 2. All MST edges are links of some partition of G .
Proof. Let T be any MST for G and (xy) any edge in T .
Let (P,Q) be the unique partition assured by Lemma 1.
From Theorem 1 we see that T must contain at least one edge
from L(P,Q), however, since T contains only one edge from
C(P,Q) then it certainly contains only one edge of L(P,Q) .
Thus, edge (xy) must be the edge which belongs to L(P,Q) and
so (xy) is a link of G .


32
Theorem 3. If S denotes the nodes of G and C is a
nonempty subset of S with the property that D(P,Q) <
D(C,S-C) for all partitions (P,Q) of C then the restriction of
any MST to the nodes of C forms a connected subtree of the
MST.
Proof. Select an arbitrary partition (P,Q) of C and let R =
S-C We must show that any MST contains at least one edge in
the cut-set, C(P,Q) To do this we need to show that D(P,Q) <
D(P,R).
Then,
L(P, S-P) c C(P,Q).
Notice that,
D(C, S-O = D(P u Q, R) = min {D(P,R),D(Q,R)>,
hence,
D(P,R) > D(C,S-C).
By the hypothesis of the theorem,
D(C, S-C) > D(P,Q).
Thus,
D(P,Q) < D(C,S-C) < (P,R).
This implies that the link-set, L(P,S-P), is a subset of the cut-set,
C(P,Q). By theorem 1, we can conclude that any MST has


33
at least one edge from L(P,S-P) and, therefore, from C(P,Q)
whenever (P,Q) partitions C Finally, this means that the
restriction of a MST to C cannot fall into two or more
components.
Theorem 3 is the most significant as it reveals the
relationship between the MST and cluster structure. For
example, in figure 4d no partition (P2,Q2) of C2 is such that
D(P2,Q2) > 22 Since D(C2,Q) = 78 > 22, theorem 3 holds.
A spanning tree is forced to span all the points, but a MST
(according to Theorem 2) jumps across the smaller gaps first.
Therefore, any MST edge is the smallest jump from some set to
the rest of the nodes. To delete edges from a MST so that the
resulting connected, subtrees correspond to the observable
clusters, two of the more popular algorithms are Kruskals
Algorithm and Prims Algorithm. Both are defined as follows:
Kruskals Algorithm [171. Arrange the edges of G in order from
smallest to largest weight and then select edges in order making
sure to select only edges which do not form a circuit with those
already chosen. The algorithm stops when (n-1) edges have
been selected where n is the number of nodes in G The set of
edges is then a MST for G .
Prims Algorithm [211. Begin with an arbitrary node of G and


34
add the edge with smallest weight connected to this node. This
edge with its two end nodes a is fragment tree T,. The k*h
fragment tree is obtained by adding the shortest edge from Tk-t
to the nodes of G not in Tk-|. This step continues until Tn-i is
the desired MST.
The following illustrates how Kruskals Algorithm would work on
the graph in figure 4a [301:
Edge Weight
BC 2
DF 2
DE 3
EF 4
AB 4
AC 5
AD 8
CD 9
CE 10
Prims Algorithm for the same graph proceeds as follows [301:
Fragment Nodes New Edge Weight
A AB 4
A,B BC 2
A,B,C AD 8
A,B,C,D DF 2
A,B,C,D,F A,B,C,D,F,E DE 3
Circuit MST Edges
*
*
*
(DEFD) *
(ABCA)
* (5th edge)
Probably one of the key advantages of using the MST in
the detection of clusters involves a widely held principle of
perceptual organization which states that, "Our nervous systems
organize the perceived world in whatever way will keep changes
and differences to a minimum." [13] An analogy can be drawn


35
between the "minimum principle" organization produced by the
MST and that produced by human perceptual mechanisms. In
fact, Zahn [30] suggests that the MST models represent a possible
model to certain perceptual mechanisms that should be tested in
psychological experiments.
Sneath [27] was the first to use the single linkage method.
He summarized taxonomic relationships in the form of
dendrograms, which represented taxonomic trees. The method
expressed the relationships between n samples, which was done
in terms of the taxonomic distances measured between every
pair of samples.
In general, the single linkage method consists of a sorting
scheme that determines clusters at a series of increasing distance
thresholds, (di,d2,...) The clusters at level d; are constructed by
first grouping the samples into disjoint sets by joining all
segments of length dj or less. Each set forms a cluster at level
dj. Therefore, all segments joining two clusters defined at level
dj will have lengths greater than dj. Many of the links of length
< dj will be redundant, however, all that is required is that a
chain of segments of length < dj joins all the members of a
cluster. If sorting is done at a greater distance threshold, di+i,
then all clusters at level dj remain but some may combine into
larger clusters. If at least one link exists between clusters of
length d where dj< d < di+1, then two clusters will combine.


36
Given the minimal spanning tree in figure 5a, the
dendrogram in figure 5b can be formed using Kruskals
algorithm. Briefly, E and F are clustered at distance level 1 ,
thus they are represented on the branches of the dendrogram
that join at level 1 B and C are clustered at level 2 and, in the
same fashion, H clusters with E and F at level 3 It is
important to note that when two points in the dendrogram are
linked at level d then no segment joining the same two points
in the MST can be longer than d units. For example, in figure
5b, A and C are linked at level 4 A is joined to C in
figure 5a via B where AB equals 4 units and BC equals 2
units [10].
Figure 5


37
In practice, the threshold levels in the single linkage method
are not increased continuously but at a constant increment, 6.
Obviously, several links may join between two threshold levels L
and L + 6, therefore some detail on the exact distances when
clusters combine may be lost. Thus, a dendrogram obtained in
this way may not be exactly the same as one derived directly
from the MST. Figure 5c illustrates the distortion of figure 5b
when links are considered only at 0,2,4,6, and 8 units (6 = 2).
Unlike some other clustering methods that define clusters
by maximizing some simple function of average intersect
distance (giving fairly compact, roughly spherical clusters) the
single-linkage method produces long clusters in which the
individuals that belong to two different clusters may be closer
together than different members of the same cluster. For this
reason, the single-linkage method is often disliked, however, the
evidence of a continuous sequence of intermediate samples is
often informative [10].


38
CHAPTER VI
COMPLETE LINKAGE METHOD
The complete linkage method performs the evaluation of
the distance between two clusters as the longest distance that can
be found between two points. Thus, this method is also known
as the furthest neighbor technique [311. Like the single linkage
method, the complete linkage method has a graph-theoretic
interpretation. This interpretation is related to the problem of
coloring the nodes of a graph.
Given the basic problem approached in hierarchical
clustering, let S be a set of n objects, Oi,...,on, with a
symmetric proximity measure, su, defined between each pair of
objects, Oj and Oj. Furthermore, let a smaller proximity value
correspond to pairs of objects that are similar. The
complete-linkage method produces a sequence or hierarchy of
partitions of the set S denoted by l0,...,ln-i, from the ordinal
information present in the matrix ||sj| In other words, the
hierarchy of partitions is produced from the rank order of all
object pairs in terms of increasing dissimilarity. For example,
the partition 10 contains all objects in separate classes, ln-i
consists of one all-inclusive object class, and lkM is defined from
lk by uniting a single pair of sets in lk [21.


39
Several elementary concepts from graph theory are helpful
in the characterization of what sets are chosen to unite in
defining lk+1 from lk. Some of these concepts include the
following [11]:
1. A graph G is complete if and only if an edge exists
between each pair of distinct nodes in G .
2. An induced subgraph of G defined by the subset D ,
D c S is a graph consisting of the nodes in
D where an edge exists between Oj and Oj if and only
if they are joined in G .
3. The complementary graph G is a graph with the same
node set as G but in which two nodes are joined if
and only if they are not joined in G .
4. A graph G is said to be m-colorable if and only if
the nodes of G can be assigned m different colors
in such a way that no two distinct nodes with the
same color are joined.
5. A graph G has chromatic number X(G) if G is
X(G)-colorable but not (XCG) l)-colorable.
Thus, the complete-linkage clustering method can be
characterized as follows. Suppose Gc is a graph consisting of the
nodes Oi,...,on, where O; and Oj are linked by a single edge if and
only if Sjj < C such that i*j and 0 < c < max {S^- | 1 < i, j < n>.
If lk consists of the sets Li,...,Ln-k, then Lu and Lv contained in


40
lk are united to form lk+1 when a diameter measure Q(.,.) is a
minimum on the two sets Lu and Lv. This diameter measure is
defined as follows [16]:
Q(Ls,Lt) = min {Sjj | the induced subgraph of GSij defined
by the node set LSU Lt is complete}
Just as is the case with the single linkage method, the
complete linkage method unites those two classes in lk to form
lk*i that are "closest" together. For the complete-linkage method,
the concept of "closeness" is defined as the maximum proximity
value attained for pairs of objects within the union of the two
sets. Therefore, from a graph-theoretic point of view, this
method must be interpreted in terms of complete subgraphs.
Consider for the moment a single graph Gc with proximity
value c Assume that P is the set of all partitions of S It
is necessary to choose certain elements p in P that are related
in some way to the partitions formed by the complete linkage
method. For example, suppose B is the set of partitions such
that for all p in B O; and Oj are linked if Oj and Oj belong to
the same object class in p This implies that no two dissimilar
objects can belong to the same object class in p The elements
of B can be partially ordered with respect to partition
refinement, where one specific partition p is a refinement of a
second partition p' if and only if p' can be formed by uniting


41
certain sets contained within p If it should be necessary to
select members of B for a more thorough analysis, then two
classes can be considered 12b
1. The maximal elements of B are defined as
Bi = {p e B | there is no element p'c B such that p
is a proper refinement of pT
2. The maximal elements of B with the smallest number
of object classes are defined as
B2 = {p e B | p has f! object classes)
where f[ = min {f | p e B] and p has f object
classes)
The set B2 is considered the most useful alternative to
consider the common concern of interpreting a minimum
number of object classes for meaning.
Baker and Hubert [2] develop a simple illustration with S =
{oi,o2,o3,o4} that has a proximity matrix ||sj| of the form
illustrated on the next page.


42
2 3 4
1 0 1 4 5
2 1 0 2 6
3 4 2 0 3
4 5 6 3 0
Let c = 4 then the graph G4 can be obtained by drawing
undirected edges between those pairs whose proximity values are
less than or equal to 4, as follows:
The graph G4 is obtained by connecting only those object pairs
whose proximity values are greater than 4, as follows:


43
Figure 6 b
In this illustration, S contains four objects and P consists
of fifteen partitions. The set B which is defined by all
partitions of S that decompose G4 into node sets defining
complete subgraphs, contains the following seven elements:
{(o, o2 o3), (o4
{(01 o2), (o3 o4)>
{(o! o2), (o3), (o4)}
{(o, o3), (o2), (o4)>
{(o2 o3), (o,), (o4)>
<(o3 o4), (o,), (o2)>
{(o!), (o2), (o3), (o4)>
Both B[ and B2 happen to be the same and each consist of the
two partitions {(O] o2), (o3 o4)} and {(o, o2 o3), (o4)> Membership
in B] is satisfied by both of these partitions as uniting any pair
of sets in either partition generates a new decomposition of the
set S that fails to be a member of B Also, membership in B2
is satisfied by both partitions as f! = 2 is the minimum number
of object classes obtainable for any partition within B .
Some element of Bj must appear within the sequence of


44
partitions generated by the complete linkage method. In the
above illustration, the partition {(oi o2), (o3 o4)> would be obtained
at level 2 in the hierarchy. It is not generally true, however, that
an element of B2 must appear in the complete linkage sequence.
The set B2 serves another purpose in clarifying the nature
of complete linkage clustering; it has some interesting connections
to the node-coloring problem. In particular, X(GC) = fh or fi is
the chromatic number of the graph Gc. Furthermore, finding all
the different ways of coloring the nodes of Gc with fj colors is
equivalent to finding all the elements of B2.
Baker and Hubert [21 describe this with the following
example: given graph Gc with proximity value c and t edges.
The first t distinct ranks are arbitrarily assigned to the object
pairs that are linked in Gc, with ranks larger than t assigned to
the remainder. If a complete-linkage clustering is performed until
the measure Q exceeds t then the last element formed in the
hierarchy must be a member of Bj. If all t! possible
assignments of rank are made, then all the members of Bj will
be constructed. The set B2 can be obtained from B, by
choosing those partitions with fj object classes.
Furthermore, the complete linkage method can be viewed
as a generalization of a common heuristic used in graph theory.
This is known as sequential node coloring, in which any ordered


45
sequence of n nodes, ohi,...,ohn, can be used to produce a coloring
of Gc through the following procedure [20]:
1. ohl is assigned to the color class 1 .
2. If the nodes ohl,...,ohH have been assigned
to j color classes, then, if possible, ohi is
assigned to the color class m where m is the
minimum positive integer less than or equal to
j such that ohi is not linked to any
of the objects in the m*h class. If no such
integer exists, then ohi is used to define
the new (j + l)st color class.
The same partitioning of Gc can be performed via the
complete linkage method. In particular, the existing edges of
Gc, defined by the node pairs {ohi,ohj> where hi < hj are ordered
lexicographically according to the index sequence hlv..,hn and the
"nonexisting" edges are all given an arbitrary large rank, such as
n(n-l)/2 As long as the diameter measure Q is less than the
arbitrary rank n(n-l)/2 then the last partition constructed is the
appropriate decomposition of Gc. For example, in figure 6a the
node sequence 0!,02,03,o4 produces the partition {(a o2 o3), (o4)}
when sequential node coloring is applied. Equivalently, suppose
the node pairs are ordered lexicographically. This produces
{o1,o2>, {oi,o3}, {o2,o3>, {o3,o4}, lO),o4>, {o2,o4}, and ranks of 1, 2, 3, 4,


46
6, 6, respectively, where the first four node pairs define; existing
edges in G4 and the last two pairs define nonexisting edges. The
complete linkage method produces exactly the same partition.
It should be noted that at least theoretically the complete
linkage method leads to a well-defined procedure for obtaining
B2. In practice there is usually no guarantee that an element of
B2 will be obtained. For this reason, the use of the
complete-linkage method and any assignment of the first t
ranks is usually referred to as a "naive" coloring of Gc .
Furthermore, use of a random assignment of proximity
ranks for constructing Bj and B2 is computationally inefficient
since the same partition will be obtained repeatedly for different
assignments for the first t ranks [21.


47
CHAPTER VII
GROUP AVERAGE METHODS
There are two types of group average methods: weighted
and unweighted. In both cases, the distance between two clusters
is calculated as the average distance between all pairs of objects
in the two clusters.
The two methods differ in that the weighted technique
employs a weight on the clusters, with the smaller cluster being
weighted more than the larger. This is done in an attempt to
enhance the influence of a small outlier cluster if it joins a larger
group of objects. Otherwise, it is possible that a small outlier
cluster may join a larger cluster with little significant effect.
The average distance between all pairs of objects in the
two groups I and J is calculated as
DW= n) ' '
j £ J
In the following,


48
the average distance between clusters D and E can be
calculated as
De,d C d2,s + d _ + d : + d
2,9 2,10 2,11
+ d _ + d^ _ + d. _
3,8 3,9 3,10
+ d + d . + d, *
3,11 4,8 4,9
+ d4,10+ d4,J
= 3.18
using the unweighted group average method [311.
Now, using the weighted group average method, the
average distance between clusters D and E can be calculated
as indicated on the next page.
c


49

(1/4X1/4)
+ d2,9 + d240 + A..
+ d3,8 + d3,i0
+ d3,J
(1/2X1/4) [ d4#8 + d
+ d4,i0 + d4,J
= 3.23
The linking distances are somewhat different for this method.
Instead of using the number of objects in the calculation, the
number of underlaying clusters to which each object belongs are
utilized. The distance d is multiplied with the ratios 1/2 and
1/4 because the objects 4 and 11 are already members of one
and two clusters, respectively. The same is true for all distances
involving object 4 [311. The following dendrograms illustrate
the difference between these two techniques.


50
Unweighted
Dendrogram
Weighted
Dendrogram


51
CHAPTER VIII
CENTROID METHODS
There are two types of centroid methods: centroid and
median centroid. In both, the points in the middle of the cluster
represent the centroid, with the distances between the clusters
being calculated as the distances between the centroids. Since
there exists a potential for difficulty in the linking of two
clusters of significantly different sizes, a weighting of both
clusters is performed similar to that in the weighted average
technique. The weighting gives the smaller cluster a larger
influence in the shift of the centroid point. Once the number of
objects in each cluster appears to be the same, then a median
centroid method can be utilized.
In the following,
using the same measurement of distance as that used by the
c


52
group average methods (with coefficients 1/2 1/2 -1/4 ), the
median method produces half the number of clusters [311.
D 3
2
1
0
In

Median Method
Dendrogram


53
CHAPTER IX
WARDS METHOD
In each step of this method the central point is calculated
for any possible combination of two clusters. Then the total sum
of squared distances from this point to all objects in a particular
hypothetical cluster is evaluated. The two clusters yielding the
smallest sum of squares are then taken to be the new cluster.
This distance is a statistically evaluated parameter, not a
geometrical distance. Therefore, this method is based on a
statistical minimization of cluster expansion [311.
Wishart [29] has demonstrated how this method can be
implemented through the updating of a stored matrix of squared
cluster centroids. This is demonstrated as follows [1]:
xijk = score on the i*h of n variables for the jth of mk
data units in the k*h of h clusters.
xik = Sxijk/mk = mean on the i*h variable for data
units in the k*h cluster.
Ek = SS(xijk Xik)2
= % Sxijk2 mkExik2
= error sum of squares for cluster k ; sum of


54
Euclidean distances from each data point in
cluster k to the mean vector of cluster k ;
within group squared deviations about the mean
for cluster k .
E = £Ek
= total within group error sum of squares for
the collection of clusters.
There are m data units, each a cluster. Thus, the membership
and the mean of each cluster coincide so that Ek = 0 for all
clusters.
The objective of this method is to find two clusters at each
stage whose merging gives the minimum increase in the total
within group error sum of squares, E .
Suppose clusters u and v are chosen to be merged and
the resulting cluster is denoted as k Then the increase in E is
The mean on the i*h variable for the new cluster is found from


55
mKXiK= ViU + mvXiv
Squaring both sides of the equation yields
22 2-2 22 2 _2
"VxiK V + "Viv + 2mumvxiuxiv
The product of the means can be rewritten as
_ _2 _2 2
213c .x... = x- + x-u " x-..> j
lU IV lU |V |U lV
such that,
mk*iK = mu Note that mk = mu + mv, and dividing both sides of the previous
equation by mk2,
_2 __2 mv _______2
x.k = x. + x.u -
,K mK ,u mR lV m:
Cx! 30 (2)
2 iu iv
Now, substituting (2) into (1) yields
_ mymv _ 2
AE = Z. + mv ,v
Therefore, the minimum increase in the error sum of squares is
proportional to the squared Euclidean distance between the


centroids of the merged clusters [11.


57
CHAPTER X
CONCLUSIONS
This paper has discussed only a small number of the
clustering methods currently available. In total, there are so
many different methods that a user is sometimes at a loss to
determine which method is the most appropriate given a
particular setting. Hubert [14], Rand [23], Fisher and Van Ness [9],
and Dubes and Jain [7] have studied the relative merits of
various techniques. There seems to be little agreement on the
superiority of one method over the others for a specific kind of
data set. In fact, in the opinion of Dubes and Jain [71, looking
for an optimum clustering method is contrary to the nature of
the problem. They maintain that, with the exception of
choosing the type of output, no guidelines exist for helping users
in their selection of a clustering method. The opposite view is
held by several, meaning that for specific data sets, methods
that do a good clustering job are identifiable.
Kuiper and Fisher [18], Rand [23], and Hubert [14] evaluate
the intrinsic abilities of various clustering methods in terms of
extent of retrieval or misclassification where the classification


58
structure is already known. In all cases, for compact clusters of
nearly equal sample size, the complete linkage method and
Wards method appear to be the most robust against the
dimension, number of clusters, and separation of clusters.
Since clustering methods have the "annoying habit" [8,26]
of discovering clusters on random data, Jain et al [15] have
reversed the approach mentioned above by generating random
observations and then applying clustering methods. They
compare the negative performance in terms of falsely imposing
a structure on such random observations. In other words, a
method is bad if it discovers a "good structure" on random data.
Interestingly, the results equal those obtained in the known
structure tests (Jain et al [15] do not include Wards method in
the comparison); once again the complete linkage method is
superior. The single linkage method is poorer but improves as
the total number of clusters increase.
As VLSI technology has advanced, it has become more
practical (in terms of computer time) to place large clustering
applications (those involving large data sets) on parallel
computers. Much research has been done in recent years to
develop efficient parallel graph algorithms that may prove to be
useful in clustering methods [22]. One such algorithm, the
minimal spanning tree (MST), has proven useful in the single
linkage method. This algorithm has been restructured in parallel


59
for use on various machines. Bentley [4] has designed a version
of the Prim-Dijkstra MST for the tree-structured systolic array.
Recalling that Prims algorithm keeps track of the nearest
non-fragment neighbor, the addition of Dijkstras algorithm
changes the strategy by keeping track of the nearest fragment
neighbor. In terms of serial computational speed, Prims
algorithm is 0(n3) while the Prim-Dijkstra algorithm is OCn2).
Bentleys [4] algorithm assigns log n vertices to every
computation node of the tree-structured systolic array.
Therefore, the entire graph can be accommodated with n/log n
computation nodes. When a vertex (object for clustering
purposes) is added to the MST, the name of the vertex is
broadcast through the tree to the computation nodes in O(log n)
time. The computation nodes update the status of their assigned
vertices. Of the nontree vertices that remain, each computation
node issues the name of the vertex closest to the tree, the tree
vertex to which it is closest, and the distance between them.
This process takes OClog n) time. For n 1 iterations, the
combining processors then choose the candidate with the smallest
distance, and determine the candidate nontree vertex closest to
the tree after OClog n) units of time. When this is complete the
minimal spanning tree has been determined. This algorithm has
complexity 0(n log n) on a tree machine with n/log n
processors [4].


60
Finally, fuzzy clustering methods have entered the field in
the past decade due to the fuzzy nature of practical problems.
In general, the major difference between the traditional "hard"
clustering and "fuzzy" clustering is that an object belongs to only
one cluster in the "hard" clustering, while in "fuzzy" clustering
objects are allowed to belong to all clusters with different
degrees of belongingness. In other words,, every object has a
membership function that contains all the degrees of membership
of this object to different clusters, thus allowing the membership
to be distributed 125].
Fuzzy clustering methods, as well as the restructuring of
some of the serial methods to accommodate parallelism, have
added new dimensions to cluster analysis in recent years.
Undoubtedly, future research in these and other areas will
change the complexion of the methods most frequently used in
cluster analysis today.


61
BIBLIOGRAPHY
1. Anderberg, Michael R. Cluster Analysis for
Applications. Academic Press, London, (1973).
2. Baker, Frank B., Hubert, Lawrence J. "A Graph-theoretic
Approach to Goodness-of-Fit in Complete-link Hierarchical
Clustering." Journal of the Amer. Stat. Assoc., 71, 356,
870-878, (1976).
3. Bayne, Charles K., Beauchamp, John J., Begovich, Connie L.,
Kane, Victor E. "Monte Carlo Comparisons of Selected
Clustering Procedures." Pattern Recognition, 12, 51-62,
(1980).
4. Bentley, Jon Louis. "A Parallel Algorithm for Constructing
Minimum Spanning Trees." Journal of Algorithms, 1, 51-59,
(1980).
5. Cattel, R. B. "Factor Analysis: An Introduction to
Essentials II. The Role of Factor Analysis in Research."
Biometrics, 21, 2, 405-435, (1965).
6. Chen, C. H. Statistical Pattern Recognition. Hayden
Book Co., Rochelle Park, N.J., (1973).
7. Dubes, Richard., Jain, Anil K. "Clustering Techniques:
The Users Dilemma." Pattern Recognition, 8, 247-260,
(1976).
8. Dubes, Richard., Jain, Anil K. "Validity Studies in
Clustering Methodologies." Pattern Recognition, 11, 235-254,
(1979).
9. Fisher, Lloyd., Van Ness, John W. "Admissible Clustering
Procedures." Biometrika, 58, 1, 91-104, (1971).
10. Gower, J. C., Ross, G. J. S. "Minimum Spanning Trees
and Single Linkage Cluster Analysis." Appl. Statistics,
18, 1, 54-64, (1969).
11. Harary, F. Graph Theory. Addison-Wesley Co., Reading,


62
Mass., (1969).
12. Hartigan, John A. Clustering Algorithms. John Wiley &
Sons, New York, (1975).
13. Hochberg, J. E. Perception. Prentice-Hall, Englewood
Cliffs, N.J., (1964).
14. Hubert, Lawrence. "Approximate Evaluation Techniques
for the Single-link and Complete-link Hierarchical
Clustering Procedures." Journal of the Amer. Stat. Assoc.,
69, 347, 698-704, (1974).
15. Jain, Naresh C., Indrayan, Abhaya., Goel, Lajpat R. "Monte
Carlo Comparison of Six Hierarchical Methods on Random
Data." Pattern Recognition, 19, 1, 95-99, (1986).
16. Johnson, Stephen C. "Hierarchical Clustering Schemes."
Psychometrika, 32, 3, 241-254, (1967).
17. Kruskal, J. B. "On the Shortest Spanning Subtree of a
Graph and the Traveling Salesman Problem." Proc.
Amer. Math Soc, 7, 48-50, (1956).
18. Kuiper, Kent F., Fisher, Lloyd. "A Monte Carlo Comparison
of Six Clustering Procedures." Biometrics, 31, 777-783,
(1975).
19. Lumelsky, Vladimir J. "A Combined Algorithm for Weighting
the Variables and Clustering in the Clustering Problem."
Pattern Recognition, 15, 53-60, (1982).
20. Matula, D. W., Marble, G., Isaacson, J. D. "Graph Coloring
Algorithms." Graph Theory and Computing. Academic
Press, (1972).
21. Prim, R. C. "Shortest Connection Networks and Some
Generalizations." Bell Sys. Tech. Journal, 1389-1401,
(Nov 1957).
22. Quinn, Michael J., Deo, Narsingh. "Parallel Graph
Algorithms." Computing Surveys, 16, 3, 319-348, (1984).
23. Rand, William M. "Objective Criteria for the Evaluation
of Clustering Methods." Journal of the Amer. Stat. Assoc.,
66, 336, 846-850, (1971).


63
24. Sebestyen, G. S. Decision-Making Processes in Pattern
Recognition. Macmillan Co., New York, (1962).
25. Selim, Shokri Z., Ismail, M. A. "Soft Clustering of
Multidimensional Data: A Semi-fuzzy Approach." Pattern
Recognition, 17, 5, 559-568, (1984).
26. Smith, Stephen P., Dubes, Richard. "Stability of a
Hierarchical Clustering." Pattern Recognition, 12, 177-187,
(1980).
27. Sneath, P. H. "Computers in Taxonomy." Journal Gen.
Microbiol., 17, 201-226, (1957).
28. Srivastava, M. S., Carter, E. M. An Introduction to
Applied Multivariate Statistics. Elsevier Science
Publishing Co., New York, (1983).
29. Wishart, David. "An Algorithm for Hierarchical
Classification." Biometrics, 165-170, (1969).
30. Zahn, Charles T. "Graph-theoretical Methods for Detecting
and Describing Gestalt Clusters." IEEE Trans, on
Computers, C-20, 68-86, (1971).
31. Zupan, Jure. Clustering of Large Data Sets. John
Wiley & Sons, Great Britain, (1982).
i


Full Text

PAGE 1

FREQUENTLY USED METHODS IN CLUSTER ANALYSIS by Merribeth Bruntz B.A., University of Colorado, 1983 A thesis submitted to the Faculty of theGraduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Mathematics 1987

PAGE 2

This thesis for the Master of Science degree by Merribeth Bruntz has been approved for the Department of Mathematics by Ramalingam Shanmugam I Date Af'1J 2'11 19 gl

PAGE 3

Bruntz, Merribeth (M.S., Applied Math) Frequently Used Methods in Cluster Analysis Thesis directed by Assistant Professor Kathyrn F. Jones iii Cluster analysis is a statistical pattern recognition technique that classifies experimental data into sets called clusters. Some of the more frequently used techniques of classification are discussed in this paper. Most of the paper is dedicated to frequently used hierarchical methods; single linkage, complete linkage, group average, centroid, and Ward's method. Graph-theoretic techniques for the development of the single linkage method and the complete linkage method are discussed. Finally, methods of comparing the hierarchical techniques and their results are reviewed, with the paper concluding with a look at some future techniques that utilize parallel computing and fuzzy set theory.

PAGE 4

iv ACKNOVVLEDGEMENTS I would like to thank K. Jones, H. Greenberg, and R. Shanmugam for their much appreciated comments. I would also like to thank Bev Clyncke and Ellen Sanchez for their helpful support.

PAGE 5

CONTENTS CHAPTER I. INTRODUCTION . . . . . 1 II. MEASURES OF SIMILARITY . . 3 III. STATISTICAL TECHNIQUES . . I 6 IV. HIERARCHICAL TECHNIQUES.;...... 21 V. SINGLE LINKAGE METHOD . . 26 VI. COMPLETE LINKAGE METHOD . 38 VII. GROUP AVERAGE METHODS . . 47 VIII. CENTROiD METHODS . . . 51 IX WARD'S METHOD. . . . . 53 X CONCLUSIONS ...................... 57 BIBLIOGRAPHY . . . . . . . 61 y

PAGE 6

VI TABLES Data: Percentage of Republican Votes in seven states . . . . . . . 12 Table of Results . . . . . . 14

PAGE 7

fo"IGURES Figure l ................................. 21 fo"igure 2 . . . . . . . . 22 Figure 3 . . . . . . . . 27 FigLtre 4 . . . . . . . . 29 Pigure S . . . . . . . . Figure 6 Pigure 6b .. ...................... ..... . . . . . . . . 36 42 43 vii

PAGE 8

1 CHAPTER I INTRODUCTION Cluster analysis is a statistical pattern recognition technique that classifies experimental data into a number of sets called clusters. The clusters generally contain elements that are similar to each other, yet dissimilar to the elements of other clusters. The elements, which are also commonly referred to as objects in the literature, can represent anything that can be represented as a point in a multidimensional measurement space. Thus, the object is any vector 'X\ where xi = (xil ,xi2 ,Xin ) in the n-dimensional space. The components are numerical. values of the measurable properties. It is important in this representation that all objects are measurement vectors of the same dimension and that any specific. property be represented as the same component. In this paper, some of the more frequently used techniques of cluster analysis are investigated. We begin by defining measures of similarity between the elements to be classified, and then discussing a popular statistical method of cluster classification. Methods of utilizing cluster analysis as a diagnostic

PAGE 9

2 aid to analysis of variance, factor analysis, and discriminant analysis are then discussed. We continue on to an investigation of five of the more frequently used hierarchical methods; single linkage, complete linkage, group average, centroid, ar1d Ward's method. The paper concludes with a discussion of methods of comparison between the various hierarchical techniques, and a look at some future techniques that utilize parallel computing and fuzzy set theory.

PAGE 10

3 CHAPTER II MEASURES OF SIMILARITY Since the goal of cluster analysis is to identify groups that contain objects that are similar, yet dissimilar from those of other groups, then the existence of a measure of similarity between the objects to be classified must be defined. A standard way of expressing similarity is through a set of distance & measurements between pairs of objects. Since so many clustering algorithms assume that such distances are given in a matrix format, it is important to carefully select the distance function to be used [12]. Perhaps the most familiar measure of distance is the Euclidean This is simply the "straight-line" distance between two objects, xi and xj where xik is the kth component of the ith object In the n-dimensional space. Some form of the generalized Minkowski distance can be

PAGE 11

4 used for almost all problems: n 1/m Yr'l D""'G.Ix :) = [ L lx.k -x 'kl ] ''' I J I J where, if m = 2, the Euclidean distance is calculated. In general, the distance measurement can be any function, D(xi ,xJ ), that satisfies the properties of a metric [31]: 1. D(xi ,xJ ) 1 0 i.e, the function is nonnegative. 2. D(xi ,xJ ) = 0 i.e; if and only if Xi = xJ. 3. D(xi ,xJ ) = D(xJ ,xj ) i.e, the distance is symmetric for all pairs xi and x J x J xk in the space of measurement (this is called the Triangle Inequality). This definition of the distance function makes sense only if the variables are measured in the same units. Since this is rarely the case, it is desirable to compute a weighted distance that will place all variables on the same scale. In other words, if weights {wP} = {whw2 ,wm} are introduced for all the variables then the new points xi = {w1xi, ... ,wmxim} and the new Euclidean distances as shown in equation (1) are generated.

PAGE 12

5 If the weighting process is not performed with care, the importance of some variables may be increased or decreased on subjective or a' priori grounds. To avoid this, there are several methods that are considered to be unsatisfactory [12]. The first weighting method to be avoided is the subjective method. Care must be taken to avoid weighting the variables according to considerations outside the data. A second weighting method uses weights that are chosen inversely proportional to the measurement variance. Unfortunately, the use of the measurement error, which is estimated .from a subjective preclustering of the cases, produces a The preclustering. may cause a tendency to overweight the variables, thus making variables which were probably measured with great precision useless for clustering. Finally, the third method scales the variables to equalize the variances within the cluster .. This causes the distances between clusters to become smaller, making the separations between them less distinct. This method can be improved if the within-cluster variances are scaled to become approximately equal.

PAGE 13

6 The relationships between variables should also be given consideration when determining measures of similarity, for after determining a method of weighting the variables, it may be noticed that several variables are highly correlated making it desirable to down-weight each. Therefore, similarity between variables affects the computation between cases and conversely. A more generalized Euclidean distance can be defined to accommodate the relationships between variables. This distance is called the Mahalanobis distance [281 and is defined as follows: T DCX.1X .) = (x. -x .) W (x. -x .) I J I J I J The matrix W is the inverse of the covariance matrix of the variables. More recently, it has been suggested that the weighting procedure could produce better. results if included in the clustering procedure instead of producing weights prior to and independently of clustering, as demonstrated above. An algorithm has been designed by Vladimir J. Lumelsky, which performs the weighting of the variables within the clustering procedure. A set of weights, { w P}, is determined, whiyh maximizes a function used for the classification. The following is a description of the algorithm. Given n objects xh ... ,Xm each of which is a vector 'X\ = {xit, ... ,xim} in a

PAGE 14

7 m-dimensional space of variables xt, ... ,xm, the objects can be classified into a given number ko < m of non-overlapping clusters AhA2, ... ,Ako in such a way that the following function is minimized: where D(xhx) 1s the weighted Euclidean distance defined previously by (1), and where nk is the number of objects in the cluster Ak and the term is an average of the squared distances between the objects of the cluster Ak. Therefore, the function F is an average, taken over all the clusters, of the squared distances between the objects of the cluster. Substituting equation (l) into (2) yields: k 0 1 k 2 p p2 F = w
PAGE 15

8 1. Cluster all the objects x17 ,X0 arbitrarily into ko non-overlapping clusters A17 ,Ako 2. Take the object x1 and leave it in such a cluster Aq ,q = l, ... ,k0 as produces a minimum value of the function F. 3. Repeat step 2 for all remaining xi i = 2, ... ,n. 4. Repeat steps 2 and 3 until the value of F stops changing. In equation (3), F is defined as a multi-extremal function of clusters. Therefore, the procedure above produces a local minimum of F The function F is dependent upon the initial choice of clusters and the order of the objects in the procedure. Lumelsky [19] extends the algorithm to the case where the weights are not given, i.e., they must be found. Allthe terms in equation (3) are positive, and the function F is quadratic. Therefore, the function can be defined as uni-extremal with respect to the weighted set, { w P} If the following condition is taken into account (4) to avoid the trivial solution wP = 0 then along with the use of the standard Lagrange multiplier method the minimum of F with respect to the weighted set can be found [24 ]. The

PAGE 16

9 procedure continues with the introduction of a Lagrange multiplier X. If we let b = m the condition in equation (4) will I yield the following: t'rl f = L w -m = e p=1 p (5) The functional to be minimized can be defined as q, = F + A.f(w) So, k t'rl 0 1 2 p p2 t'rl = L L L w + X = e J d w q 1<.=1 'f\ ('f\1) I J q

PAGE 17

where q = l, ... ,m and m -m=0. p=1 p Let the following denote, Then,
PAGE 18

11 A" ... ,Ako). Lumelsky's algorithm, called the extremal weighting procedure [19], combines the clustering and weighting procedures is as follows: 1. Cluster all the objects x" ... ,Xn arbitrarily into ko non-overlapping clusters A" ... ,Ako Assume an arbitrary weighting set, {wP}, taking into account the condition in equation (4). 2. Take the object xi and leave it in a cluster Aq q = l, ... ,ka which corresponds to a minimum value of F Continue this for all i, i = l, ... ,n. 3. Using equations (6) and (8), find the optimal weights {wP} for the clusters built in step 2. 4. Repeat steps 2 and 3 until the value of F stops changing. This procedure guarantees that for the resultant weights the final clusters correspond to a local minimum of the within-cluster squared distance. Furthermore, the weights for the clusters produce the global minimum (with respect to the weights) of the same average within-cluster squared distance. The following simple example compares the various

PAGE 19

12 measures of similarity along with their weighting procedures [12,191. Data: Percentage of Republican votes in seven states Objects Variables (No. and Name) 1960 1964 1 GA 37 54 2 LA 29 57 3 sc 49 59 4 wv 47 32 s MO so 36 6 KY 54 36 7 MD 46 35

PAGE 20

13 In general, the average distance between the objects within a cluster is calculated using the formula, and the average distance between clusters is calculated using the formula, D = b 1 In the following table of results (located on the next page), these calculations are performed using the above data set.

PAGE 21

14 Table of Results !Method of Within-cluster Distance Between-cluster weighting Distance the variables Dk 'k=1 Dk 'k=2 Dw Db No weighting 14.0 5.3 9.6 26.0 Variance 19.3 7.2 13.2 29.2 Weighting \Mahalanobis Weighting 20.0 7.8 13.9 20.0 Weighting 6] 4.0 5.4 40.3 The table of results indicates that the Mahalanobis distance with its procedure reduces the distance between clusters, thus causing a lack of clarity. The extremal weighting procedure increases the distance between clusters and decreases the distance within the clusters. Therefore, the ability of the latter procedure to minimize the within-cluster distance and maximize the between-cluster distance lends itself to a much

PAGE 22

15 more distinct measure of similarity in the classification of clusters of data.

PAGE 23

16 CHAPTER III STATISTICAL TECHNIQUES Probably the most popular of the statistical methods are those that minimize a squared error criterion, in which the goal is to define clusters that are hyperellipsoidal in shape The procedure begins by letting the ith object from the data set be denoted as T x. =
PAGE 24

17 where The number of objects in cluster k 1s denoted by mk. The sample mean of all objects in the cluster is the cluster center, or centroid. And, the squared error of cluster k is 2 T e =.(X. -c)
PAGE 25

18 variations in selected independent variables cause significant variation in a dependent (response) variable. It is important to. insure that the only variation is caused by the independent variables that characterize each group of observations. If there are additional effects operating within groups which imply that a group contains distinct subgroups, then the apparent error variance is magnified and it becomes more difficult to detect a genuine variation between groups. Furthermore, if there are effects operating between groups (other important variables exhibit group-to-group variation), then the genuine effect of the independent variable under investigation may be either concealed or amplified. When the analysis is concerned with systems subject to only indirect control, such as when the responses of human beings are being measured, then it is possible for a host of factors to influence the observed responses. According to Anderberg [1], cluster analysis might be used to an advantage in the following ways: 1. Within-cluster analysis will either indicate that the assumption of significant similarity within-clusters is justified, or reveal a specific way in which the assumption is violated. 2. A cluster analysis across the entire data set may reveal group-to-group variations that are not due .to an independent variable.

PAGE 26

19 Another statistical method that works quite well in combination with cluster analysis is factor analysis, which is a multivariate technique used in many of the behavioral and social sciences. The objective is to find a set of factors with the fallowing properties: 1. The factors are linear combinations of the original variables. 2. The factors are orthogonal to each other. 3. The factors are fewer in number than the original variables, but they account for most of the variance in the original data. 4. The factors are meaningful in their own right, or are at least subject to some useful interpretation. According to Cattel [5], factor analysis shoulcl be applied to relatively similar groups of data, and chister analysis is a method for finding such groups. Discriminant analysis is another multivariate technique. It is used in the construction of decision procedures by which data are classified as members of one group or another. For example, in the simplest case, samples of data are available from each .of two known categories. The objective is to utilize the sample information to construct a linear discriminant function that separates the groups in such a way as to minimize the number

PAGE 27

20 of expected misclassifications. Subsequently, the discriminant function may be used to classify new data into one of the two classes [11. Typically, the number of groups in the data set is thought to be known, and the primary objective is to find an efficient means of distinguishing between the groups as given. Sometimes the analysis fails because the resulting discriminant function separates the groups poorly. According to Anderberg [1], a lack of success might be caused by the fact that the defined groups are not separable within the measurement space defined by the available variables. Thus, the data set contains fewer identifiable groups than prior knowledge would suggest. It is also possible that there are actually more identifiable groups in the data set than the number being sought in the analysis. When problems of this kind arise, cluster analysis can be utilized as a diagnostic tool for discovering distinct subsets within a defined group. However, if the discriminant. analysis gives a satisfactory discriminant function, then it is unnecessary to supplement the method with cluster analysis [1].

PAGE 28

21 CHAPTER IV HIERARCHICAL TECHNIQUES Given the definitions of distance, both within a cluster and between a cluster, it is possible to formulate a hierarchical model of Clustering for a set of objects. The result of this procedure is the hierarchical tree. This tree is defined just as it is from the graph-theoretical point of view, meaning it is a connected graph without cycles. The hierarchical structure of the tree implies that it is a graph with a unique path between any two nodes in the graph [31]. Figure 1 represents a hierarchical tree of 31 objects (leaves presented as circles). (A,B,C,D,E,F,G) are represented as the vector Root Figure 1 Seven objects x. 2 3 5

PAGE 29

22 If lengths are assigned to the branches, then a representation of the tree, called a dendrogram can be produced [311. Figure 2 is the dendrogram for the hierarchical tree in figure 1. Dis ta nee Similarity (%) (%) 80 20 60 40 20 80 0 100 Figure 2 The number of clusters is associated with a given distance or similarity level. For example, at the dashed-line at the level of 40 % which is a similarity level of 60 % there are nine clusters in the dendrogram. The lengths of the branches may be relative or absolute to some maximal distance. The similarity scale in figure 2 is related to the relative distance as

PAGE 30

23 follows: It should be noted that -the term "distance" is associated with the number of edges between two nodes in graph theory. This, however, is not the definition of the term "distance" when working with the metric defining measurement for both between-clusters and within-clusters. When performing hierarchical clustering, initially the distance (calculated with respect to some m:etric) between every object is calculated and placed into a distance matrix, Dnxn The minimal element of Du is determined. If there is a tie, then traditionally the first or last pair is selected. The selection of the minimal element of the distance matrix causes the ith and the jth object, or cluster, to be combined into a single cluster. Next, a new distance matrix is calculated, D
PAGE 31

24 the old matrix of the previous step. It is possible to do this from the initial set of data (taking into account all individual objects of both clusters), however, this method is quite time consuming, computationally. Seven of the most frequently used hierarchical clustering methods are the single linkage, complete linkage, group average, weighted group average, centroid, median, and Ward. The flowchart on the next page demonstrates the steps taken by these hierarchical methods [311.

PAGE 32

25 jstart I SelectiOYI of the distaYJce fuYJctioYJ aYJd clusteriYJg method _L I ... IYiput of IYJpu t of the H object YIXYI distaYJce vectors matrix I Calcula tioYJ of the HxH distaYJce matrix I. 'I Search for the miYJimal elemeYJt, D ... Put together the Ylew (YJ-l)x(YJ-1) IJ dis ta YJce matrix by calcula ti YJg the Dij,k for all I lYI = n-lj No Is Yl=0 Yes I Output of the hierarchical tree l !Stop J

PAGE 33

26 CHAPTER V SINGLE LINKAGE METHOD The single linkage method, which is also known as the nearest neighbor technique, is one of the simplest of the clustering methods. This technique produces the distance between two clusters that is the smallest distance between two points in both clusters. This is accomplished through the utilization of a graph-theoretic technique known as the minimal spanning tree [31). Graph-theoretic techniques provide a means for finding unconventional data structures [71. As an introduction to some of the more powerful graph-theoretic techniques, consider the following: two well-separated clusters (figure 3a), each with approximately the same homogeneous point density. If the standard deviation, tJ, is relatively small, then the point density is defined as "homogeneous." The size of the intercluster distance relative to the mean, 1.1., implies the term "well-separated" cluste:rs. If the distance from each point to its nearest neighbor (figure 3a) is computed, a mean value 1.1. = 20.5 is attained. The standard deviation is computed as tJ = 1.66, and the distance between the

PAGE 34

27 clusters equals 58. The range of values of nearest distance is 17.5 to 24. A graph can be formed from the points of figure 3a by connecting any pair whose distance is smaller than a threshold value depending upon 1.1. and a-. If 1.1. + 3ais used, the graph in figure 3b results. . Ia I . . . . .. . (C) . .. . . . . . . . . . . (e) ... ........... ......... ........... .... -- ..>" \;/' (b) /0(d) Figure 3

PAGE 35

28 The two clusters in figure 3c have non-homogeneous point densities. Intuitively, these clusters are perceived as being well-separated. The left most point of the upper cluster, however, is further from its nearest neighbor than the distance between the clusters, making it difficult to detect these clusters using the method described above. Utilizing an adaptive connecting algorithm, such as one that connects each point to its k-nearest-neighbors, yields the results in figure 3d, where k = 3 [30]. Figure 3e illustrates two "touching clusters", which appear as a single cluster if the above two methods are utilized. It is true that it is a single cluster by the criterion of connectivity; however, it 1s perceived as being two clusters connected by a small Figure 3f illustrates the construction of a k-nearest-neighbor graph, with k = 4. In addition to the construction of this graph, cut points (whose removal disconnects the graph) or bridges (edges whose removal disconnect the graph) can be used to separate the cluster into two clusters. The minimal spanning tree CMST) is a powerful "locally adaptive interconnecting mechanism" for a point set that favors nearest neighbors. In general, the MST is generated for the complete graph in which the nodes are the patterns and the edge weights are Euclidean distances in the pattern space. A threshold

PAGE 36

29 is computed as the sample mean of the edge weights in the tree plus the sample standard deviations. All edges having weights greater than the computed threshold are cut, forming subtrees. Each subtree represents a cluster. More specifically, Zahn [301 describes the algorithm as follows. A tree is defined as a connected graph with circuits. A spanning tree of the connected graph, G is a tree in G which contains all the nodes of G So, given the graph G (figure 4a), then two trees of G are represented in figures 4b and 4c. Defining the weight of the tree to be the sum of the weights of its constituent edges produces the following definition of a minimal spanning tree: "A minimal spanning tree CMSTI of graph G is a spanning tree whose weight is minimum among all spanning trees of G ." [301 -----. c A c 78 C F (c) (d) lbl Figure 4 Thus, the tree in figure 4c is the MST of the graph in figure 4a. The nodes of graph G (figure 4a) can be partitioned into two disjoint nonempty subsets, P and Q where P = {A,B,C}

PAGE 37

30 and Q = {D,E,F}. The distance, D(P,Q) across this partition is the smallest weight among all edges which have one end node in P and another in Q Therefore, D(P,Q) = 8 making edge AD the bridge spanning from the subset P = {A,B,C} to Q = {D,E,F}. The set of edges that span a partition is defined as the cut-set and denoted as C(P,Q) A link is any edge in C(P,Q) whose weight is equal to the distance across a partition, and the set of all links in the cut-set, C(P,Q) is called the link-set, denoted L(P,Q) For example, in figure 4a [30]: C(P ,Q) = (AD,CD,CF), and L(P ,Q) = (AD) Therefore, the following three theorems are true [30]: Theorem 1. Any MST contains at least one edge from each L(P,Q). Proof. This proof shows that a spanning tree T' containing no edge from L(P,Q) can be improved by switching one of its edges for one in L(P,Q). Select any edge (xy) e L(P,Q) and add it to T' to produce a new graph with precisely one circuit. The portion of this circuit which lies in T' must have at least one edge (uv) e C(P,Q) because x and y are in P and Q respectively. The edge (uv) is not in L(P,Q) by the definition of T'. The spanning tree T = {T'(uv)} u (xy) has a smaller weight, w, than T' because W(xy) < W(uv) by the

PAGE 38

31 definition of L(P,Q). Thus, any MST must have at least one edge from L(P,Q) Lemma 1. Each edge (xy) of a spanning tree T determines a unique partition (P ,Q) of the nodes of G in a natural way so that C(P,Q) contains exactly one edge in T Proof. Let T' = T -(xy) be the graph obtained by deleting edge (xy) from the spanning tree T Since every edge of a tree is a bridge, it follows that T' has two connected components T( and T2 and the nodes x and y are in different components (say x<:T(, y.::T2'). Now let P denote the nodes of T(, and Q the nodes of T2'. Since no nodes were deleted from T which spans G then (P,Q) are certainly disjoint and they contain all nodes of G Also, the only edge in T joining P to Q is the deleted edge (xy) and so the lemma is Theorem 2. All MST edges are links of some partition of G. Proof. Let T be any MST for G and (xy) any edge in T Let (P ,Q) be the unique partition assured by Lemma 1. From Theorem 1 we see that T must contain at least one edge from L(P,Q), however, since T contains only one edge C(P,Q) then it certainly contains only one edge of L(P,Q) Thus, edge (xy) must be the edge which belongs to L(P,Q) and so (xy) is a link of G

PAGE 39

32 Theorem 3. If S denotes the nodes of G and C is a nonempty subset of S with the property that DCP,Q) < DCC,S-C) for all partitions CP,Q) of C then the restriction of any MST to the nodes .of C forms a connected subtree of the MST. Proof. Select an arbitrary partition CP,Q) of C and let R = S-C We must show that any MST contains at least one edge in the cut-set, CCP,Q) To do this we need to show that DCP,Q) < DCP,R). Then, LCP, S-P) c CCP,Q). Notice that, DCC, S-C) = DCP u Q, R) = min {DCP ,R),DCQ,R)}, hence, DCP,R) DCC,S-C). By the hypothesis of the theorem, DCC, S-C) > DCP,Q). Thus, DCP ,Q) < DCC,S-C) s. CP ,R). This implies that the link-set, L(P,S-P), is a subset of the cut-set, CCP,Q). By theorem 1, we can conclude that any MST has

PAGE 40

33 at least one edge from L(P,S-P) and, therefore, from C(P,Q) whenever (P,Q) partitions C Finally, this means that the restriction of a MST to C cannot fall into two or more components. Theorem 3 is the most significant as it reveals the relationship between the MST and cluster structure. For example, in figure 4d no partition (P2,Q2 ) of C2 is such that 0(P2,Q2 ) > 22 Since 0(C2,C1 ) = 78 > 22, theorem 3 holds. A spanning tree is forced to span all the points, but a MST (according to Theorem 2) jumps across the smaller gaps first. Therefore, any MST edge is the smallest jump from some set to the rest of the nodes. To delete edges from a MST so that the resultiqg connected. subtrees correspond to the observable clusters, two of the more popular algorithms are Kruskal's Algorithm and Prim's Algorithm. Both are defined as follows: Kruskal's Algorithm [17]. Arrange the edges of G in order from smallest to largest weight and then select edges in order making sure to select only edges which do not form a circuit with those already chosen. The algorithm stops when (n-1) edges have been selected where n is the number of nodes in G The set of edges is then a MST for G Prim's Algorithm [211. Begin with an arbitrary node of G and

PAGE 41

34 add the edge with smallest weight connected to this node. This edge with its two end nodes a is fragment tree T1 The kth fragment tree is obtained by adding the shortest edge from Tk-1 to the nodes of G not in T k+ This step continues until T n-1 is the desired MST. The following illustrates how Kruskal's Algorithm would work on the graph in figure 4a [30]: Edge Weight Circuit MST Edges BC 2 DF 2 DE 3 EF 4 CDEFD) AB 4 AC 5 (ABCA) AD 8 *(5th edge) CD 9 CE 10 Prim's Algorithm for the same graph proceeds as follows [301: Fragment Nodes A A,B A,B,C A,B,C,D A,B,C,D,F A,B,C,D,F,E New Edge AB BC AD DF DE Weight 4 2 8 2 3 Probably one of the key advantages of using the MST in the detection of clusters involves a widely held principle of I perceptual organization which states that, "Our nervous systems organize the perceived world in whatever way will keep changes and differences to a minimum." [13] An analogy can be drawn

PAGE 42

35 between the "minimum principle" organization produced by_ the MST and that produced by human perceptual mechanisms. In fact, Zahn [301 suggests that the MST models represent a possible model to certain perceptual mechanisms that should be tested in psychological experiments. Sneath [271 was the first to use the single linkage method. He summarized taxonomic relationships in the form of dendrograms, which represented taxonomic trees. The method expressed the relationships between n samples, which was done in terms of the taxonomic distances measured between every pair of samples. In general, the single linkage method consists of a sorting scheme that determines clusters at a series of increasing distance thresholds, (d"d2 ) The clusters at level di are constructed by first grouping the samples into disjoint sets by joining all segments of length di or less. Each set forms a cluster at level di. Therefore, all segments joining two clusters defined at level di will have lengths greater than di. Many of the links of length i di will be redundant, however, all that is required is that a chain of segments of length i d j joins all the members of a cluster. If sorting is done at a greater distance threshold, dih then all clusters at level di remain but some may combine into larger clusters. If at least one link exists between clusters of length d where di < d i dih then two clusters will combine.

PAGE 43

36 Given the minimal spanning tree in figure Sa, the dendrogram in figure Sb can be formed using Kruskal's algorithm. Briefly, E and F are clustered at distance level 1 thus they are represented on the branches of the dendrogram that join at level 1 B and C are clustered at level 2 and, in the same fashion, H clusters with E and F at level 3 It is important to note that when two points in the dendrogram are linked at level d then no segment joining the same two points in the MST can be longer than d units. For example, in figure Sb, A and C are linked at level 4 A is joined to C in figure Sa via B where AB equals 4 units and BC equals 2 units [10]. 0 (a) OistaY'Ice 8 ? 6 s 4 3 2 1 0 EFHBCAGD (b) FigureS EFHCBAGD (c)

PAGE 44

37 In practice, the threshold levels in the single linkage method are not increased continuously but at a constant increment, 6. Obviously, several links may join between two threshold levels L and L + 6, therefore some detail on the exact distances when clusters combine may be lost. Thus, a dendrogram obtained in this way may not be exactly the same as one derived directly from the MST. Figure Sc illustrates the distortion of figure Sb when links are considered only at 0,2,4,6, and 8 units (6 = 2). Unlike some other clustering methods that define clusters by maximizing some simple function of average intersect distance (giving fairly compact, roughly spherical clusters) the single-linkage method produces long clusters in which the individuals that belorig to two different clusters may be closer together than different members of the same cluster. For this reason, the single-linkage method is often disliked, however, the evidence of a continuous sequence of intermediate samples is often informative [10].

PAGE 45

38 CHAPTER VI COMPLETE LINKAGE METHOD The complete linkage method performs the evaluation of the distance between two clusters as the longest distance that can be found between two points. Thus, this method is also known as the furthest neighbor technique [311. Like the single linkage method, the complete linkage method has a graph-theoretic interpretation. This interpretation is related to the problem of coloring the nodes of a graph. Given the basic problem approached in hierarchical clustering, let S be a set of n objects, o" ... ,om with a symmetric proximity measure, sii, defined between each pair of objects, oi and oj. Furthermore, let a smaller proximity value correspond to pairs of objects that are similar. The complete-linkage method produces a sequence or hierarchy of partitions of the set S denoted by 10 ,ln-h from the ordinal information present in the matrix llsiJI In other words, the hierarchy of partitions is produced from the rank order of all object pairs in terms of increasing dissimilarity. For example, the partition 10 contains all objects in separate classes, ln-t consists of one all-inclusive object class, and lkt is defined from lk by uniting a single pair of sets in lk [2].

PAGE 46

39 Several elementary concepts from graph theory are helpful in the characterization of what sets are chosen to unite in defining lkl from lk. Some of these concepts include the fallowing Ull: 1. A graph G is complete if and only if an edge exists between each pair of distinct nodes in G 2. An induced subgraph of G defined by the subset D, D c S is a graph consisting of the nodes in D where an edge exists between oi and Oj if and only if they are joined in G 3. The complementary graph G is a graph with the same node set as G but in which two nodes are joined if and only if they are not joined in G 4. A graph G is said to be m-colorable if and only if the nodes of G can be assigned m cliff erent colors in such a way that no two distinct nodes with the same color are joined. 5. A graph G has chromatic number X(G) if G is X(G)-colorable but not (X(G)-I) -colorable. Thus; the complete-linkage clustering method cari be characterized as follows. Suppose Gc is a graph consisting of the nodes oh,om where oi and oj are linked by a single edge if and only if Sui C such that and 0 i c i max {Su-11 i i, j i n}. If lk consists of the sets Lh ... ,Ln-b then Lu and Lv contained in

PAGE 47

40 lk are united to form lkJ when a diameter measure QC.,.) is a minimum on the two sets Lu and Lv. This diameter measure is defined as follows [16]: QCLs,Lt) =min {SuI the induced subgraph of Gsu ,defined by the node set Ls u Lt is complete} Just as is the case with the single linkage method, the complete linkage method unites those two classes in lk to form lkl that are "closest" together. For the complete-linkage method, the concept of "closeness" is defined as the maximum proximity value attained for pairs of objects within the union of the two sets. Therefore, from a graph-theoretic point of view, this method must be interpreted in terms of complete subgraphs. Consider for the moment a single graph Gc with proximity value c Assume that P is the set of all partitions of S It is necessary to choose certain elements p in P that are related in some way to the partitions formed by the complete linkage method. For example, suppose B is the set of partitions such that for all p in B oi and Oj are linked if oi and oj belong to the same object class in p This implies that no two dissimilar objects can belong to the same object class in p The elements of B can be partially ordered with respect to partition refinement, where one specific partition p is a refinement of a second partition p' if and only if p' can be formed by uniting

PAGE 48

41 certain sets contained within p If it should be necessary to select members of B for a more thorough analysis, then two classes can be considered [2]: 1. The maximal elements of B are defined as B1 = {p B I there is no element p' B such that p is a proper refinement of p '} 2. The maximal elements of B with the smallest number of object classes are defined as B2 = {p B I p has f1 object classes} where f1 = min {f I p B1 and p has f object classes} The set B 2 is' considered the most useful alternative to consider the common concern of interpreting a minimum number of object classes for meaning. Baker and Hubert [2] develop a simple illustration with S = {o,,o2,oJ,o4} that has a proximity matrix llsiJI of the form illustrated on the next page.

PAGE 49

42 01 02 03 04 01 0 1 4 5 02 1 0 2 6 03 4 2 0 3 04 5 6 3 0 Let c = 4 theri the graph G4 can be obtained by drawing undirected edges between those pairs whose proximity values are less than or equal to 4, as follows: Figure 6a The graph G4 is obtained by connecting only those object pairs whose proximity values are greater than 4, as follows:

PAGE 50

43 Figure 6b In this illustration, S contains four objects and P consists of fifteen partitions. The set B which is defined by all partitions of S that decompose G 4 into node sets defining complete subgraphs, contains the following seven elements: {(o1 0 2 OJ), (o4)} {(o o2), (oJ o4)} {(o o2), (oJ), (o4)} {(o1 OJ), Co2), Co4)} {(o2 OJ), Co.), (o4)} {(oJ o4), Co.), Co2)} {(o.), Co2), (oJ), Co4)} Both B. and B 2 happen to be the same and each .consist of the m B1 is satisfied by both of these partitions as uniting any pair of sets in either partition generates a new decomposition of the set S that fails to be a member of B Also, membership in B 2 is satisfied by both partitions as f1 = 2 is the minimum number of object classes obtainable for any partition within B Some element of B1 must appear within the sequence of

PAGE 51

44 partitions generated by the complete linkage method. In the above illustration, the partition {(o1 o2), (o3 o4)} would be obtained at level 2 in the hierarchy. It is not generally true, however, that an element of B2 must appear in the complete linkage, sequence. The set B2 serves another purpose in clarifying the nature of complete linkage clustering; it has some interesting connections to the node-coloring problem. In particular, XCGc) = f" or f1 is the chromatic number of the graph Gc. Furthermore, finding all the different ways of coloring the nodes of Gc with f1 colors is equivalent to finding all the elements of B2 Baker and Hubert [2] describe this with the following example: given graph Gc with proximity value c and t edges. The first t distinct ranks are arbitrarily assigned to the object pairs that are linked in Gc, with ranks larger than t assigned to the remainder. If a complete-linkage clustering is performed until the measure Q exceeds t then the last element formed in the hierarchy must be a member of If all t! possible assignments of rank are made, then all the members of B1 will be constructed. The set B2 can be obtained from B1 by choosing those partitions with f1 object classes. Furthermore, the complete linkage method can be viewed as a generalization of a common heuristic used in graph theory. This is known as sequential node coloring, in which any ordered

PAGE 52

45 sequence of n nodes, ohh,oh"' can be used to produce a coloring of Gc through the following procedure [20]: 1. oh1 is assigned to the color class 1 2. If the nodes ohh,ohH have been assigned to j color classes, then, if possible, ohi is assigned to color class m where m is the minimum positive integer less than or equal to j such that ohi is not linked to any of the objects in the mth class. If no such integer exists, then ohi is used to define the new (j + l)st color class. The same partitioning of Gc can be performed via the complete linkage method. In particular; the existing edges of Gc, defined by the node pairs {ohhoh) where hi < h j are ordered lexicographically according to the index sequence h" ... ,hn and the "nonexisting" edges are all given an arbitrary large rank, such as n(n-1)/2 As long as the diameter measure Q is less than the arbitrary rank n(n-1)/2 then the last partition constructed is the appropriate decomposition of Gc. For example, in figure 6a the node sequence o"o2 ,o 3 ,o 4 produces the partition {(o1 o 2 o3), (o4)} when sequential node coloring is applied. Equivalently, suppose the node pairs are ordered lexicographically. This produces {oho2}, {oho3}, {o2,03}, {o3,o4}, {oho4}, {o2,o4}, and ranks of 1, 2, 3, 4,

PAGE 53

46 6, 6, respectively, where the first four node pairs define existing edges in G4 and the last two pairs define nonexisting edges. The complete linkage method produces exactly the same partition. It should be noted that at least theoretically the complete linkage method leads to a well-defined procedure for obtaining B2 In practice there is usually no guarantee that an element of B2 will be obtained. For this reason, the use of the complete-linkage method and any assignment of the first t ranks is usually referred to as a "naive" coloring of Gc Furthermore, use of a random assignment of proximity ranks for constructing B1 and B2 is computationally inefficient since the same partition will be obtained repeatedly for different assignments for the first t ranks [2].

PAGE 54

47 CHAPTER VII GROUP AVERAGE METHODS There are two types of group average methods: weighted and unweighted. In both cases, the distance between two clusters is calculated as the average distance between all pairs of objects in the two clusters. The two methods differ 1n that the weighted technique employs a weight on the clusters, with the smaller cluster being weighted more than the larger. This is done in an attempt to enhance the influence of a small outlier cluster if it joins a larger group of objects. Otherwise, it is possible that a small outlier cluster may join a larger cluster with little significant effect. The average distance between all pairs of objects 1n the two groups I and J is calculated as D = 1 "CC I,.J ( Yl Yl ) .. D. i I I j J I .J IJ In the following,

PAGE 55

48 average distance between clusters D and E can be calculated as DE,D = ('(I; '(ID) [ d2,8 + d2,9 + d2,10 + d2,11 + d3,8 + d3,9 + d3,10 = 3.18 using the unweighteq group average method [311. Now, using the weighted group average method, the average distance between clusters D and E can be calculated as indicated on the next page. c 0 11 t1f:r-,E h I ., '8 \'6"-::.(l -G D -4

PAGE 56

OE,D = (114)(114) ( d2,8 + d2,9 + d2,10 + d2,11 + d3,8 + d3,9 + d3,10 + d3,11] + (112)(114) [ d4,8 + d4,9 + d4,10 + d4,11] "'3.23 49 The linking distances are somewhat different for this method. Instead of using. the number of objects in the calculation, the number of underiaying clusters to which each object belongs are utilized. The distance d is multiplied with the ratios 1/2 and 1/4 because the objects 4 and 11 are already members of one and two clusters, respectively. The same is true for all distances involving object 4 [311. The following dendrograms illustrate the difference between these two techniques.

PAGE 57

UYlweighted DeYldrogram Weighted DeYldrogram so

PAGE 58

51 CHAPTER VIII CENTROID METHODS There are two types of centroid methods: centroid and median centroid. In both, the points in the middle of the cluster represent the centroid, with the distances between. the clusters being calculated as the distances between the centroids. Since there exists a potential for difficulty in the linking of two clusters of significantly different sizes, a weighting of both clusters is performed similar to that in the weighted average technique. The weighting gives the smaller cluster a larger influence in the shift of the centroid point. Once the number of objects in each cluster appears to be the same, then a median centroid method can be utilized. In the following, using the same measurement of distance as that used by the

PAGE 59

52 group average methods (with coefficients 1/2 1/2 -1/4 ), the median method produces half the number of clusters [311. D
PAGE 60

53 CHAPTER IX WARD'S METHOD In each step of this method the central point is calculated for any possible combination of two clusters. Then the total sum of squared distances from this point to all objects in a particular hypothetical cluster is evaluated. The two clusters yielding the smallest sum of squares are then taken to be the new cluster. This distance is a statistically evaluated parameter, not. a geometrical distance.. Therefore, this method is based on a statistical minimization of cluster expansion [311. Wishart has demonstrated how this method can be implemented through the updating of a stored matrix of squared cluster centroids. This is demonstrated as follows [1]: xuk = score on the ith of n variables for the jth of mk data units in the kth of h clusters. xik = = mean on the ith variable for data units in the kth cluster. Ek xik)2 = =error sum of squares for cluster k ; sum of

PAGE 61

54 Euclidean distances from each data point in cluster k to the mean vector of cluster k ; within group squared deviations about the mean for cluster k =total within group error sum of squares for the collection of clusters. There are m data units, each a cluster. Thus, the membership and the mean of each cluster coincide so that Ek = 0 for all clusters. The objective of this method is to find two clusters at each stage whose merging gives the minimum increase in the total within group error sum of squares, E. Suppose clusters u and v are chosen to be merged and the resulting cluster is denoted as k Then the increase in E is b. Euv = EJc -E -E n m U v n = [ 2 x .... -m.. x ... 1 : J: I J" 0\ 1 : lr\ t"l t'l'lu n -[ L r l. -m L x.2 1 j:1 j:1 IJU U j:1 IU n mv n -[ L L -mv 'S' X.2 ] j:1 j:1 IJV f:t IV t"l t"l t"l 2 = m L.. X. + m L.. x. -m .. L.. x... U j:1 1U V j:1 IV j:1 lr\ (1) The mean on the ith variable for the new cluster is found from

PAGE 62

55 Squaring both sides of the equation yields The product of the means can be rewritten as such that, Note that mk = mu + mv, and dividing both sides of the previous (2) Now, substituting (2) into (1) yields n 2 L (X,u-X:v) j:1 I I Therefore, the minimum increase in the error sum of squares is proportional to the squared Euclidean distance between the

PAGE 63

56 centroids of the merged clusters [1].

PAGE 64

57 CHAPTER X CONCLUSIONS. This paper has discussed only a small number of the clustering methods currently available. In total, there are so many different methods that a user is sometimes at a loss to determine which method is the most appropriate given a particular setting. Hubert [141, Rand [231, Fisher and Van Ness [91, and Dubes and Jain [71 have studied the relative merits of various techniques. There seems to be little agreement on the superiority of one method over the others for a specific kind of data set. In fact, in the opinion of Dubes and Jain [71, looking for an optimum clustering method is contrary to the nature of the problem. They maintain that, with the exception of choosing the type of output, no guidelines exist for helping users in their selection of a clustering method. The opposite view is held by several, meaning that for specific data sets, methods that do a good clustering job are identifiable. Kuiper and Fisher [181, Rand [231, and Hubert [141 evaluate the intrinsic abilities of various clustering methods in terms of extent of retrieval or misclassification where the classification

PAGE 65

58 structure is already known. In all cases, for compact clusters of nearly equal sample size, the complete linkage method and Ward's -method appear to be the most robust against the dimension, number of clusters, and separation of clusters. Since clustering methods have the "annoying habit" [8,261 of discovering clusters on random data, Jain et al [15) have reversed the approach mentioned above by generating random observations and then applying clustering methods. They compare the negative performance in terms of falsely imposing a structure on such random observations. In other words, a method is bad if it discovers a "good structure" on random data. Interestingly, the results equal those. obtained in the known structure tests (Jain et al [15) do not include Ward's method in the comparison); once again the complete linkage method is superior. The single linkage method is poorer but improves as the total number of clusters increase. As VLSI technology has advanced, it has become more practical (in terms of computer time) to place large clustering applications (those involving large data sets) on parallel computers. Much research has been done in recent years to develop efficient parallel graph algorithms that may prove to -be useful in clustering methods [221. One such algorithm, the minimal spanning tree (MST), has proven useful in the single linkage method. This algorithm has been restructured in parallel

PAGE 66

59 for use on various machines. Bentley [ 41 has designed a version of the Prim-Dijkstra MST for the tree-structured systolic array. Recalling that Prim's algorithm keeps track of the nearest non-fragment neighbor, the addition of Dijkstra's algorithm changes the strategy by keeping. track of the nearest fragment neighbor. In terms of serial computational. speed, Prim's algorithm is 0(ri3) while the Prim-Dijkstra algorithm is O(n2). Bentley's [4] algorithm assigns log n vertices to every computation node of the tree-structured systolic array. Therefore, the entire graph can be accommodated with n/log n computation nodes. When a vertex (object for clustering purposes) is to the MST, the name of the vertex is broadcast through the tree to the computation nodes in O(log n) time. The computation nodes update the status of their assigned vertices. Of the nontree vertices that remain, each computation node issues the name of the vertex closest to the tree, the tree vertex to which it is closest, and the distance between them. This process takes O(log n) time. For n t' iterations, the combining processors then choose the candidate with the smallest distance, and determine the candidate nontree vertex closest to the tree after O(log n) units of time. When this is complete the minimal spanning tree has been determined. This algorithm has complexity O(n log n) on a tree machine with n/log n processors [ 4 ].

PAGE 67

60 Finally, fuzzy clustering methods have entered the field in the past decade due to the fuzzy nature of practical problems. In general, the major difference between the traditional "hard" clustering and "fuzzy" clustering is that an object belongs to only one cluster in the "hard" clustering, while in "fuzzy" clustering objects are allowed to belong to all clusters with different degrees of belongingness. In other words,. every object has a membership function that contains all the degrees of membership of this object to different clusters, thus allowing the membership to be distributed [25]. Fuzzy clustering methods, as well as the restructuring of some of the serial methods to accommodate parallelism, have added new dimensions to cluster analysis in recent years. Undoubtedly; future research in these and other areas will change the complexion of the methods most frequently used in cluster analysis today.

PAGE 68

BIBLIOGRAPHY 1. Anderberg, Michael R. Cluster Analysis for Applications. Academic Press, London, (1973). 2. Baker, Frank B., Hubert, Lawrence J. "A Graph-theoretic Approach to Goodness-of-Fit in Complete-link Hierarchical Clustering." .Journal of the Amer. Stat. Assoc., 71, 356, 870-878, (1976). 3. Bayne, Charles K., Beauchamp, John J., Begovich, Connie L., Kane, Victor E. "Monte Carlo Comparisons of Selected Clustering Procedures." Pattern Recognition, 12, 51-62, (1980). 4. Bentley, Jon Louis. "A Parallel Algorithm for Constructing Minimum Spanning Trees." Journal of Algorithms, 1, 51-59, (1980). 5. Cattel, R. B. "Factor Analysis: An Introduction to Essentials II. The Role of Factor Analysis in Research." Biometrics, 21, 2, 405-435, (1965). 6. Chen, C. H. Statistical Pattern Recognition. Hayden Book Co., Rochelle Park, N.J., (1973). 7. Dubes, Richard., Jain, Anil K. "Clustering Techniques: The User's Dilemma." Pattern Recognition, 8, 247-260, (1976). 8. Dubes, Richard., Jain, Anil K. "Validity Studies in Clustering Methodologies." Pattern Recognition, 11, 235-254, (1979). 9. Fisher, Lloyd., Van Ness, John W. "Admissible Clustering Procedures." Biometrika, 58, 1, 91-104, (1971). 10. Gower, J. C., Ross, G. J. S. "Minimum Spanning Trees and Single Linkage Cluster Analysis." Appl. Statistics, 18, 1, 54-64, (1969). 11. Harary, F. Graph Theory. Addison-Wesley Co., Reading, 61

PAGE 69

Mass., (1969). 12. Hartigan, John A. Clustering Algorithms. John Wiley & Sons, New York, (1975). 13. Hochberg, J. E. Perception. Prentice-Hall, Englewood Cliffs, N.J., (1964). 14. Hubert, Lawrence. "Approximate Evaluation Techniques. for the Single-link and Complete-link Hierarchical Clustering Procedures." Journal of the A.mer. Stat. Assoc., 69, 347, 698-704, (1974). 15. Jain, Naresh C., lndrayan, Abhaya., Gael, Lajpat R. "Monte Carlo Comparison of Six Hierarchical Methods on Random Data." Pattern Recognition, 19, 1, 95-99, (1986). 16. Johnson, Stephen C. "Hierarchical Clustering Schemes." Psychometrika, 32, 3, 241-254, (1967). 17. Kruskal, J. B. "On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem." Proc. Amer. Math Soc, 7, 48-50, (1956). 18. Kuiper, Kent F., Fisher, Lloyd. "A Monte Carlo Comparison of Six Clustering Procedures." Biometrics, 31, 777-783, (1975). 62 19. Lumelsky, Vladimir J. "A Combined Algorithm for Weighting the Variables and Clustering in the Clustering Problem." Pattern Recognition, 15, 53-60, (1982). 20. Matula, D. W., Marble, G., Isaacson, J.D. "Graph Coloring Algorithms." Graph Theory and Computing. Academic Press, (1972). 21. Prim, R. C. "Shortest Connection Networks and Some Generalizations." Bell Sys. Tech. Journal, 1389-1401, (Nov 1957). Quinn, Michael J., Dec, Narsingh. "Parallel Graph Algorithms." Computing Surveys, 16, 3, 319-348, (1984). 23. Rand, William M. "Objective Criteria for the Evaluation of Clustering Methods." Journal of the Amer. Stat. Assoc., 66, 336, 846-850, (1971).

PAGE 70

24. Sebestyen, G. S. Decision-Making Processes in Pattern Recognition. Macmillan Co., New York, (1962). 25. Selim, Shokri Z., Ismail, M. A. "Soft Clustering of Multidimensional Data: A Semi-fuzzy Approach." Pattern Recognition, 17, 5, 559-568, (1984). 26. Smith, Stephen P., Dubes, Richard. "Stability of a Hierarchical Clustering." Pattern Recognition, 12, 177-187, (1980). 27. Sneath, P. H. "Computers in Taxonomy." Journal Gen. Microbiol., 17, 201-226, (1957). 28. Srivastava, M. S., Carter, E. M. An Introduction to Applied Multivariate Statistics. Elsevier Science Publishing Co., New York, (1983). 29. Wishart, David. "An Algorithm for Hierarchical Classification." Biometrics, 165-170, (1969). 30. Zahn, Charles T. "Graph-theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers, C20, ?8-86, (1971). 31. Zupan, Jure. Clustering of Large Data Sets. John Wiley & Sons, Great Britain, (1982). 63