CLUDAMI: A MODEL FOR MEASURING DATA MIGRATION BETWEEN
CLUSTERS IN DIFFERENT DATA STEPS
by
Todd Arthur Gibson
B.A., California State University at Chico, 1989
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science and Engineering
2004
This thesis for the Master of Science
degree by
Todd Arthur Gibson
has been approved
John Clark
Date
Karen Newell
Gibson, Todd Arthur (M.S., Computer Science)
CLUDAMI: A Model for Measuring Data Migration Between Clusters in Different
Data Steps
Thesis directed by Professor Krzysztof Cios
ABSTRACT
Independently clustering each step in time series data shows changes in
cluster patterns between time steps. Similarly, clustering each experiment in
dependently in experimentalseries data experiments exposes changes in cluster
patterns between experiments. Although cluster membership can easily be iden
tified and compared between steps in a time series or a experimentseries, under
standing the path the data has taken from one step to the next is more elusive.
CLUDAMI (CLUstered DAta Migration) is proposed as a model to quantify
data migration between independentlyclustered sets of data. The model is
evaluated on Flow Cytometry data used in a cancer study.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
m
ACKNOWLEDGMENTS
I owe much gratitude to Dr. Krys Cios who has provided me with encouragement,
opportunities, and challenges for the last three years. Because of his guidance I
am now looking forward to a future I would never have envisioned three years ago.
I would also like to thank Dr. Karen Newell and Lisa VillalobosMenuey. Access
to both their expertise and their lab has been invaluable to the completion of this
thesis.
CONTENTS
Figures ............................................................. vii
Tables.............................................................. xi
Chapter
1. Introduction....................................................... 1
1.1 Cluster Analysis .................................................. 2
1.2 Subspace Clustering and Biclustering............................... 4
1.3 Clustering TimeSeries Data........................................ 6
2. Materials and Methods.............................................. 8
2.1 Biclustering....................................................... 8
2.1.1 Problem Definition.............................................. 8
2.1.2 Problem Complexity............................................. 10
2.1.3 Algorithmic Solutions ......................................... 12
2.2 FLOC.............................................................. 16
2.2.1 FLOC Complexity................................................ 17
2.3 The CLUDAMI Model................................................. 19
2.3.1 Semantics Underlying CLUDAMI................................... 21
2.4 Flow Cytometer Data .............................................. 22
2.5 The CLUDAMI Implementation........................................ 24
3. Experimental Preliminaries........................................ 26
3.1 Terms and Conventions .......................................... 26
v
3.2 Scatterplot Overview............................................... 26
3.3 Relevant Parameters................................................ 27
3.4 Migration Pattern Selection........................................ 28
3.4.1 Bicluster Discrimination......................................... 30
4. Experimental Results................................................. 34
4.1 Data Migration Features........................................... 34
4.1.1 Subcluster Localization ......................................... 35
4.2 Migration Reflections and Rank Order Mismatch ..................... 37
4.3 Varying k and m.................................................... 38
4.4 Comparing Similar Source Migrations................................ 41
4.5 Biological Interpretation.......................................... 42
5. Discussion........................................................... 45
5.1 Model Flexibility................................................. 45
5.2 Identifying Invalid Migrations..................................... 45
5.3 Effect of k on Migration Paths..................................... 46
5.4 Future Work ....................................................... 46
Appendix
A. Proof of the Minimum Score Matrix................................... 50
B. The Befuddling Proof of Cheng and Church ........................... 54
References.............................................................. 56
vi
FIGURES
Figure
1.1 The results of hierarchical clustering are typically visualized using
a dendrogram. Each branch represents a separation into additional
clusters. The length of each arm represents the distance (degree of
similarity) between clusters. The final number of clusters can be
selected by choosing a height across which to make a horizontal cut. 3
1.2 The Kmeans algorithm................................................. 4
2.1 The Naive greedy algorithm illustrates the computational complexity
in a nonheuristic approach to biclustering........................... 12
2.2 The complete biclustering algorithm which heuristically reduces the
residue score of the bicluster........................................ 15
2.3 The basic FLOC Algorithm. The algorithm shows ordered progres
sion through rows and columnsnot the enhanced weighted ordering. 18
2.4 A Venn diagram illustrates the simplicity underlying CLUDAMI.
Step j identifies the source of migration. (Step (j + 1) is the des
tination of migration.) Cluster i is some cluster in step j. Cluster
k is some cluster in step (j + 1). Bicluster By is some bicluster 1
computed between steps j and (j + 1). The intersections identify the
source and destination of the migrating data. Other combinations of
intersecting clusters and biclusters similarly inform additional mi
gratory patterns....................................................... 20
Vll
2.5 The correlation between Side Scatter (SS), Forward Scatter (FS),
and Fluorescence (FL). The sample mean is subtracted from each
instance value and then is divided by the standard deviation. The
Coefficient measures how well the fits a straight line.................... 25
3.1 A sample scatterplot typical of that generated for a flow cytometer
sample. The Xaxis is the FS and the Yaxis is the SS. Shown are
the approximate locations of live and dead cells.................... 27
3.2 Graph (a) shows the total cluster mismatches for all four biclusters
when intersections are ranked according to the number of instances
they contain. Graph (b) indicates the highest rank to be mismatched. 31
4.1 Plot (a) shows a sample which has not been exposed to radon. Plot (b)
shows a sample exposed to high levels of radon...................... 34
4.2 Four of the 8 valid migrations identified with k = 18, m = 4. Plots
(a) and (b) show the up and right movement of the topmost cloud.
Plots (c) and (d) show the down and right movement of cells due to
radon exposure. Closeups of plots (a), (b), and (c) can be found in
Figure 43............................................................. 36
4.3 Closeups of the centroid arrows for plots (a), (b), and (c) from Fig
ure 42................................................................ 37
4.4 An example of a misleading migration arrow. The arrow starts at the
source cluster centroid and ends at the destination cluster centroid.
Cluster/bicluster intersections include only a portion of the data in
the cluster. The figure shows that the actual migration is to the left
of the cluster centroid arrow.......................................... 38
vm
I
4.5 Plots (a) and (b) are invalid migrationsthey are reflections of one
another. These two plots also have mismatched ranks between their
source and destination clusters. Plot (a) ranks are 0/2 and plot (b)
ranks are 2/0. By contrast, plots (c) and (d) are valid migrations
they are not reflections of each other (nor of any other plot) and have
ranks of 0/0 and 2/2 respectively................................... 39
4.6 Migration (a) was discovered at k = 12, m = 2. The data migrations
for k = 18, 2 < m < 3 did not reveal the migration shown in (a). At
k = 18, m = 4 migrations (b) and (c) appeared as submigrations of
(a)................................................................. 40
4.7 Two samples were selected from the same pool of redundant DCFDA
no radon samples. Both plots (a) and (b) are the starting steps for
migrations from no radon to high radon. Plot (a) is the same plot
used previously. Plot (b) shows a much more compressed silhouette. 42
4.8 Plot (a) shows a migration from the no radon to high radon sample
generated previously. Plot (b) shows the equivalent migration for the
anomalous sample. CLUDAMI accounts for the compressed shape
of the anomalous sample by indicating a further distance that this
data must migrate. Results are similar for other migrations compared
between these two samples........................................... 43
IX
4.9 Plot (a) shows all valid, cluster migrations for no radon to high radon
DCFDA. There are a total of seven arrowheads (paths) present on
the plot. Plot (b) shows the increased DCFDA level present in the
longest cell migration. The Xaxis of the histogram is the DCFDA
level. The Yaxis of the histogram is a probability of cells in the
cluster (containing the DCFDA level indicated on the Xaxis). . 44
x
TABLES
Table
2.1 A table of the samples selected for examination by the CLUDAMI
model. Their organization into two experimental series is shown in
the last column....................................................... 23
3.1 The data migration model defined in Equation 2.12 identifies any
cluster which intersects with a bicluster as a migratory pattern. The
table shows for one bicluster, the number of data instances shared
with each of k = 18 clusters. The wide breadth of coverage of the
bicluster indicates the need to reduce the list of clusters under consid
eration if a reasonable number of migratory patterns is to be identified. 29
3.2 The biclustering algorithm was run on a single duplicated sample:
G0025954 Shown are the ordered rankings of k = 18 clusters based
on the number of cells which intersect between a bicluster and a clus
ter. An asterisk indicates where the ordering is different between the
two identical samples................................................. 32
1. Introduction
A data type that has recently received increased attention in the area of
data mining is series data, that is, data that is organized into either steps of
time or steps in an experimental process. Time series data can be defined as
a constant stream of information (e.g. a radio signal) or may be sampled at
discrete intervals (e.g. monthly marketing data). In an experimental series,
data is sampled at each stage of an experiment that subjects the data source
to various stimuli.
One of the objectives in collecting and analyzing series data is to under
stand the impact that the series has on the data source. Consider an experi
mental series where the effect of different pharmaceuticals on bacteria. Sup
pose the analyst wants to determine the effect various drugs have on bacte
ria. Do all bacteria die? Do some die? Do they grow? A comparative analy
sis between different steps in the series may provide insight into the changes
that have taken place in the data. Many techniques exist for evaluating the
changes that occur, and while standard analysis techniques are quite infor
mative at evaluating the characteristics of the data at each step, they are less
informative at uncovering the path the data takes while progressing from one
step to the next.
To visualize this, consider data with n features. For a data point or in
stance in a ndimensional space, the path each instance takes from one step
in a series to the next can be represented as a vector from the datas position
1
in the initial step of the series to the datas position in the following step. Dis
covering the path from one step to the next potentially offers rich information
on how data features interact as the data progresses through the series.
This work presents CLUstered DAta Migration (CLUDAMI), a novel
model of series data (Gibson, 2004). Clusters are identified in a step of a se
ries, and then data within each cluster is tracked as it migrates, separates, and
combines through feature space into the clusters at the next step of the series.
Prior to a embarking on a detailed discussion of the methods used to build
(and test) the model we provide a short survey of the clustering work upon
which the model is built.
1.1 Cluster Analysis
Cluster analysis is the process of grouping data together based on shared
similarities within each group and dissimilarities between groups. In this con
text, data can be defined as a set of individuals or instances, described by n
features. Individuals grouped together into a single cluster have features meet
ing some similarity criteria.
Broadly speaking, there are two categories of cluster analysis: hierarchi
cal organization of clusters and partitioning (Aldenderfer and Blashfield, 1984;
Cios et ah, 1998). Hierarchical clustering can be further divided into agglom
erative and divisive algorithms (Gordon, 1996). Agglomerative hierarchical
clustering begins with each individual in a separate cluster. The two most
similar clusters are then iteratively joined until all individuals are contained
in a single cluster. Divisive hierarchical clustering works in the opposite di
rection; begin with a single cluster and iteratively divide until no individuals
2
share a cluster. A dendrogram is commonly used to visualize the results of
hierarchical clustering (see Figure 1.1).
O
c
cd
2 3 4 5
Individuals
Figure 1.1: The results of hierarchical clustering are typically visualized
using a dendrogram. Each branch represents a separation into additional clus
ters. The length of each arm represents the distance (degree of similarity)
between clusters. The final number of clusters can be selected by choosing a
height across which to make a horizontal cut.
While hierarchical cluster analysis build clusters from either the bottom
up or the topdown, partitioning algorithms begin with data partitioned into
a predetermined number of clusters. The clusters are then iteratively refined
by moving individuals between clusters in order to optimize an objective func
tion (e.g. one that maximizes similarity between cluster members). One of the
most popular partitionbased clustering methods is Kmeans. Figure 1.2 illus
trates the Kmeans algorithm.
Despite the many and varied approaches to clustering, they all have
adopted the same heuristic: clustering occurs across all features. That is,
given a set of data instances, the similarity between them is calculated using
all features of each individual under consideration. In order to discriminate
between features (i.e. cluster similar features) all of the data instances must
3
Input: Set TV of data points, k number of desired clusters
Output: TV, with each member assigned to one of k clusters
Initialization: Randomly assign each n 6 TV to one of k clusters
repeat
Calculate the center (centroid) of each cluster
foreach data point do
c cluster with centroid closest to data point
Reassign i to cluster c if i ^ c
until No data points are reassigned
Return TV with cluster assignments Vn TV
Figure 1.2: The Kmeans algorithm
be evaluated. This is a limitation that prevents more subtle similarities in the
data from being uncovered. For example, a marketing database tracks pur
chasing trends of consumers by recording 10 features of each customer (age,
gender, income, profession, purchases, etc.) Certain trends shared by only a
(relatively) few customers which involve similarities involving a fraction of the
10 features would likely be missed by clustering algorithms which inherently
calculate similarity across all 10 features. Feature reduction methods used
prior to clustering such as Principle Component Analysis also offer no help
as they favor dominant discriminating features among all features rather than
subtle similarities among a feature subset.
Two additional families of clustering algorithms have developed indepen
dently to directly address the limitations found in the traditional clustering
algorithms: Subspace Clustering and Biclustering
1.2 Subspace Clustering and Biclustering
Subspace clustering was born out of a need to identify clusters in data
containing a large number of features, using a method that was both usable
4
and scalable to extremely large data sets (Agrawal et al., 1998). Subspace
clustering methods use either a bottomup or a topdown method for search
ing the cluster space (Parsons et al., 2004). In bottomup methods, a his
togram is created for each feature in order to identify features that contain
dense regions above some threshold. By considering only features containing
such regions, the number of features is significantly reduced. An iterative ap
proach is used to grow the dense regions into additional dimensions forming
clusters.
The topdown methods of subspace clustering seed initial clusters by us
ing traditional clustering methods across all features. Features are then as
signed weights for each cluster and the process repeats. Iteratively applying
weights and reclustering results in some features disappearing in some clusters
and being dominant in others. As with the bottomup methods, the result are
clusters which have grouped data instances across subsets of features.
Unlike the large data problem that spurred the development of sub
space clustering algorithms, biclustering algorithms developed out of a need to
uncover subtle biological patterns effectively hidden beneath the more promi
nent patterns identified by traditional clustering algorithms. Biclustering
addresses the problem of sensitivity rather than of usability and scalability.
Early biclustering algorithms organized features and instances into a matrix
and calculated residue scores as a measure of degradation of cluster coher
ence (Cheng and Church, 2000; Yang et al., 2002). Other methods include
converting the features and instances into a bipartite graph and looking for
heavy subgraphs (Tanay et al., 2002) and iteratively ordering data instances
5
across different subsets of features (Lazzeroni and Owen, 2002).
A distinguishing characteristic between subspace clustering algorithms
and biclustering algorithms is in the approach used to reduce the number of
features (and data instances) in the clusters. Subspace clustering algorithms
start with techniques for identifying features that can be removed from con
sideration in clustering. By contrast, biclustering algorithms simultaneously
match subsets of instances with subsets of features in the search for similarity.
Biclustering can also be heavily influenced by notions of similarity inspired by
biology. For example, in gene expression arrays a biologically significant pat
tern in the genes (data instances) includes shared variance across features. To
capture this, BenDor et al. (2002) apply a sorted rank to the features of each
data instance and group the instances together which share similar rank order
ings across subsets of features.
1.3 Clustering TimeSeries Data
There is a wide body of literature on the analysis of time series data
both streaming (e.g., a sound wave) and discrete. However, the general prob
lem addressed in the analysis of temporal data is the discovery of recurring
motifs and trends rather than tracking the migration of data instances across
separate time steps. One notable exception is analysis of ecommerce transac
tion data. Customers were clustered by purchasing trends and their migration
from one trend to another over time. (Gupta and Ghosh, 2001). However, the
actual migration path traveled by the customer is trivially determined as each
customer (the data instance) is uniquely identifiable across all time steps.
6
Perlman and Java (2003) cluster astronomical time series data in or
der to predict astronomical events. It typifies another type of time series
analysis which is the prediction of future events given patterns found in past
events (Gavrilov et al., 2000).
Finally, Alon, Sclaroff, and Kollios (2003) study data patterns present in
each step of time series data. In this study, the frames of a video are grouped
into clusters of common motion. That is, video frames sharing a common mo
tion are grouped into a single cluster.
It is worth emphasizing what distinguishes this study from previous
work. We seek to identify clusters in each step of a series, and then measure
how the data within each cluster migrate, separate, and combine into different
clusters as they progress to the next step of the series.
The remainder of this study focuses on the definition, implementation,
and evaluation of the CLUDAMI model. Section 2 details the algorithms that
inspire the model and those that are actually implemented in the model. A
formal definition of the model is also presented. Section 3 presents the exper
iments conducted to evaluate the model and Section 5 discusses significance
of the results. The section concludes with a look at future work that could be
performed to enhance the model.
7
2. Materials and Methods
As mentioned in the previous section, the CLUDAMI model incorporates
both traditional clustering and biclustering to achieve its aims. In particu
lar, biclustering is foundational to the definition of the model. Therefore the
biclustering algorithm that inspired the model is reviewed in detail in this sec
tion followed by a summary of a modified biclustering algorithm which is used
in CLUDAMI. The model itself is then presented, followed by a description of
data used and tools written to evaluate the model.
2.1 Biclustering
Cheng and Church (2000) introduced biclustering as a means of finding
clusters from DNA microarray data by simultaneously clustering subsets of
genes and subsets of samples. Although the authors note similar efforts in
other domains with terms such as box clustering and direct clustering, theirs
is the first application of the technique to gene expression data (Tanay et al.,
2002; Cheng and Church, 2000).
2.1.1 Problem Definition
The following definitions define the gene expression data:
Let A be the gene expression matrix.
Let M be the set of genes and N be the set of samples.
Let /CM be any subset of genes, J C N be any subset of samples, and
Au be any submatrix of A.
8
Let an element of the matrix be represented by where i E I and
j G J. For each submatrix Au, there is a score H(I, J) representing the in
verse similarity of the elements. That is, as the H(I,j) score increases, the
similarity decreases. A bicluster is a submatrix with maximal area with a
score
H(I, J) <5,5 >= 0 One technique for choosing 5 is to run a (hierarchical)
clustering algorithm on the data to generate a set of C clusters, and set
5 = min{H(I,J),C} (Shamir, 2002). A bicluster with H(I,J) <= 5 is re
ferred to as a dbicluster.
H(I,J) is defined in terms of residue. In general, residue is the difference
between predicted values and observed values (Wackerly et ah, 2002). In order
to compute the residue, several statistical means are first defined:
The mean of row i for all columns J is:
The mean of column j for all rows / is:
aii = TJi ^ a0
' il
The mean of the entire matrix A is:
(2.1)
(2.2)
(2.3)
Incidentally, note that the mean of the rowmeansums equals the mean
of the columnmeansums which in turn equals the matrix mean:
r,au
il
1
Ml
(2.4)
9
The residue for a single matrix element is computed by subtracting the
row and column means of the element from the element value and adding back
in the matrix mean:
A(i, j~) Q*ij &iJ aIj T &IJ
(2.5)
From R(i,j), H(I, J) is defined as a variance of the residue:
(2.6)
There are three conditions under which H(I, J) could score zero. The
first is if Ajj is a uniform matrix (a*, = c Vi G I, j G J where c is a constant).
The second is where = i + j\/iEl,jJ. The third is when / = J = 1.
The authors suggest using the row variance as a measure avoid misclassifying
one of these as a valid bicluster:
2.1.2 Problem Complexity
Although the complexity of finding a submatrix with maximal area is
unknown, the complexity of finding a maximal submatrix Ajj where / =  J\
is an NPhard problem as shown below:
Lemma 1. A K x K matrix of a single 1 all Os elsewhere has the score:
(2.7)
hK = fAK ~ 1)2[(K if + 2(K 1) + 1)
10
The original lemma which appeared in Cheng and Church (2000) was
offered without proof and was incorrectly stated as:
hK = (K 1 ){(K l)3 + 2(K l)2 2]
We have found a proof which resulted in the correct Lemma shown
above. The proof can be found in its entirety in Appendix A.
Proposition 1. Any submatrix (2ij) larger than a single element will have a
score H(I, J) no lower that .25.
The proof accompanying the following theorem includes description and
motivation not found in Cheng and Church (2000), but is otherwise presented
intact. For a discussion of this proof and presentation of an alternative proof,
refer to Appendix B.
Theorem 1. Finding the largest square 5bicluster is NPhard.
Proof. This proof is a reduction from the NPcomplete problem balanced com
plete bipartite subgraph (Garey and Johnson, 1979). Given a positive integer
m, this problem asks whether a complete subgraph Kmm exists in a bipartite
graph G = (U,V,E). From G, form \U\ x F matrix A with i E U, j G V:
If the largest square Jbicluster in A with 5 = 1 /hK has a size > m, there
exists a Kmm. Therefore finding the largest square Jbicluster is NPhard.
0 : (i, j)
2 ij : otherwise
11
2.1.3 Algorithmic Solutions
Because finding 6biclusters is NPhard, heuristic algorithms are used for
identifying biclusters. Figure 2.1 illustrates a naive approach to biclustering.
It is not used in practice due to its complexity, but it serves as a basis for un
derstanding the biclustering process.
Input: \M\ x \N\ Matrix A, 5 > 0
Output: (ibicluster Ajj
Initialization: Au = AMN, Calculate H(I,J)
while H(I, J) > 8 do
foreach row AMN, column Amn do
Calculate H(I, J) for addition to (or deletion from) Ajj
end
Add or delete row or column with greatest reduction in H(I, J)
end
return Ajj
Figure 2.1: The Naive greedy algorithm illustrates the computational com
plexity in a nonheuristic approach to biclustering.
Given m = \M\,n = iV, for each iteration m + m potential row/column
actions must be scored. Each score requires m x n calculations. Therefore a
single iteration requires mn(m + n). If 5 = 0, then m + n iterations would be
required, resulting in a complexity of 0(mn(m + n)2).
Though the runtime is polynomial and not exponential, it is still too
complex to be of practical use. For example, in the domain for which it was
developed (gene expression matrices), thousands of rows and tens to hundreds
of columns must be biclustered.
Rather than computing H(I, J) repeatedly for each potential row or col
umn, an alternative measure is used to identify columns that can be removed
12
(returning a column or row to the bicluster matrix isnt considered). The al
ternative measure is the average squared residue. Specifically, the row or col
umn with the largest average squared residue is removed. The maximum row
and column squared residue averages are defined as:
The column or row corresponding to max(f/r, Vc) > H(I, J) is removed
unless neither is greater than the matrix score H(I, J). This algorithm is re
ferred to as the Single Node Deletion algorithm (Cheng and Church, 2000).
The algorithm is now reduced to doing a maximum of (m + n) iterations, each
of which requires recalculating the H(I, J) which is 0(mn). Therefore, the
overall complexity has now been reduced to 0(mn(m + n)).
The Single Node Deletion algorithm is further accelerated by noting
that many of the max(Vrr, Vc) maximum averages are significantly larger than
H(I, J) and can be removed en masse without recalculating H(I, J) after
each removal. This algorithm is referred to as the Multiple Node Deletion al
gorithm. Formally, the set of rows and columns that can be removed en masse
is defined as:
(2.8)
Vc = i
(2.9)
(2.10)
13
(2.11)
Rc ^ Ji ^ ^ ^i&ij Q>ij 0,1 j T Ojj) > aH(I,
The a value (a > l)1 ensures that only values significantly higher than
H(I, J) are removed. The Multiple Node Deletion algorithm does perform a
single recalculation of H(I, J) between the removal of Rr and before the re
moval of Rc. Because there are no iterate recalculations of H(I, J), the run
time of the Multiple Node Deletion algorithm is 0(mn + m + n).
As a balance between speed and quality, both the Multiple Node and
Single Node deletion algorithms are used. The fastest and less accurate Mul
tiple Node Deletion algorithm is used first to pare the matrix down to a man
ageable size. Then the slower, more accurate Single Node Deletion algorithm
is used to further reduce the matrix to a bicluster.
There are two issues remaining in the identified bicluster. The first is
that the node addition portion of Figure 2.1 was abandoned in the node dele
tion algorithms. In order to compensate, a single iteration is done on the
deleted rows and columns to insert back into Ajj any rows or columns with
average squared residues that are less than H(I, J). The second issue is that
the mean squared residue scores do not allow for inversely correlated genes. A
final iteration reinserts rows (genes) into A\j with a negative mean squared
residue that is less than H(I, J). The complete algorithm is shown in Fig
ure 2.2.
One limitation of the biclustering algorithm is that it finds only a single
bicluster. Because the algorithm is deterministic, finding additional biclus
la = 1.2 was selected by the authors.
14
Input: \M\ x \N\ Matrix A, 6 > 0, a > 1
Output: (ibicluster Ajj
Initialization: Au = AMn, Calculate H(I,J)
/* Fast, less accurate multiple node deletion */
repeat
/* Remove all rows in the set */
Rr ^ jjj J2jeAav "F &u) > aH(I, .7)^
Recalculate H(I, J)
Remove all columns in the set
Rc ^7 ^ ^' 7f /j /j) aH(I,
Recalculate //(/, J)
until No rows or columns have been removed
/* Slower, more accurate single node deletion */
repeat
/* Remove the row or column with max(V/r, Vc) */
Vr = maxj6/ J2jej(atj a!i7 a7j + a77)2)
Rc = maxjeJ ^7 ^i/(ay ait7 a7j + a7i7)2)
Recalculate //(/, J)
until Neither a row nor column has been removed
/* Reinflate bicluster with rows & columns < H(I,J) V
repeat
/* Add back all rows and columns in the set: */
^7?r U I, jjj Q'iJ h O/Jf) < 7/(/) *^) ^
Recalculate //(/, J)
^7?c U J, jyj /j h ^ H(I> J') ^
Recalculate H(I,J)
/* Add back in rows which inversely correlate */
j/?r U /; jij Y,jeA~an + _ ab + a/^)2 < H(R J)}
until Neither a row nor a column has been added
return Au
Figure 2.2: The complete biclustering algorithm which heuristically reduces
the residue score of the bicluster.
15
ters is problematic. The method used in Cheng and Church (2000) to identify
additional biclusters is to mask previously identified biclusters by replac
ing their values with uniformly distributed random numbers. They argue that
the expected values produced by the uniformly distributed random numbers
will be higher than the values calculated from the remaining expression data.
In other words, rows and columns with uniformly distributed random num
bers will be removed earlier, allowing any patterns in the remaining expression
data to be identified by the biclustering algorithm. To implement this, the al
gorithm in Figure 2.2 contains an outer loop which replaces identified bicluster
values with random numbers. The loop terminates once the desired number of
biclusters has been found.
2.2 FLOC
Biclustering research that followed that of Cheng and Church (2000) im
proved on limitations found in the biclustering algorithm. Yang et al. (2002)
noted the that the algorithm cannot handle missing attributes, suffers reduced
accuracy after the first iteration due to masking the bicluster with random
values, and effectively erases any overlapping biclusters due to the random
number replacement. To correct these limitations, Yang et al. (2002) proposed
the FLexible Overlaped Clustering (FLOC) algorithm as an alternative ap
proach to biclustering.
Like the original biclustering algorithm, FLOC minimizes bicluster
residue scores to optimize biclusters. Beyond that, the algorithm takes quite
a different approach. As before, the data is organized as a matrix with data
instances as rows and data features as columns. First, k initial clusters are
16
seeded by assigning each row and column to each bicluster with a particular
probability. Note that FLOC supports overlapping biclusters so a row or a
column may be assigned to more than one bicluster.
The overall average residue score of the system is reduced by iteratively
changing cluster membership of the rows and columns. At each iteration, each
row and column is temporarily added or removed from each bicluster and the
resulting change in residue in noted. The row or column is added or removed
from the bicluster that results in the greatest reduction in residue. After an
action has been performed for all rows and columns, a new overall average
score is calculated and compared with the previous iterations score. Iterations
continue until one fails to reduce the average residue, at which point the last,
nonreducing, iteration is rolled back and the biclustering assignments are re
turned.
Enhancements to the basic algorithm include random ordering of the
rows and columns and weighted row/column ordering based on the residue
improvement made by the row or column. The basic algorithm is outlined in
Figure 2.3.
2.2.1 FLOC Complexity
The full complexity analysis of FLOC can be found in Yang et al. (2002)
and is listed as O ((N + M) xNxMxkxp) where M and N are the num
ber of rows and columns, k is the number of clusters, and p is the number of
iterations required by the algorithm.
Though somewhat onerous, the average runtime performs much better.
First, the published complexity there are no overlaps between the biclusters
17
Input: Set N of data points, k number of biclusters (6X ,fc)
Output: N, with each member assigned to zero or more biclusters
Initialization: Assign each n E N to each k with probability p
Calculate average residue of system r
s < r
while s < r do
foreach row and column do
for i < l..k do
if row or column E bi then
Remove row or column from bi
Calculate and store bicluster residue
Return row or column to bi
end
if row or column ^ bi then
Add row or column to bi
Calculate and store bicluster residue
Remove row or column from bi
end
end
Remove/add row/column with greatest residue decrease
end
r < s
s * average residue of system
end
Rollback bicluster assignments to previous iteration
return N with bicluster assignments Vn E N
Figure 2.3: The basic FLOC Algorithm. The algorithm shows ordered pro
gression through rows and columnsnot the enhanced weighted ordering.
18
which is rarely the case in practice. Second, many of the iterations embedded
within in the algorithm can be run independently and are therefore amenable
to segmentation and parallelization. Finally, Yang et al. (2002) cite an average
of 10 iterations (p) required of the algorithm before it terminates. The experi
ment described in Section 3 averaged 7.4 iterations.
2.3 The CLUDAMI Model
The CLUDAMI model is defined as a triple (Gibson, 2004):
CLUDAMI = {A, x, /?}
A The series data captured in s steps: D\, D2,..., Ds
X The set of ms clusters: m clusters calculated for each Di, 1 < i < s
{Cn, C21,..., Cm 1, C12, C22, , Cms}
(3 A set of n(s 1) biclusters:
n biclusters calculated for each Dt U Dj+i, 1 < i < (s 1)
{B11, B2i, , Bn 1, B12, B22, , Bn(s!)}
Data migration is defined as:
(2.12)
19
Figure 2.4: A Venn diagram illustrates the simplicity underlying CLUDAMI.
Step j identifies the source of migration. (Step (j + 1) is the destination of mi
gration.) Cluster i is some cluster in step j. Cluster k is some cluster in step
(j + 1). Bicluster By is some bicluster 1 computed between steps j and (j + 1).
The intersections identify the source and destination of the migrating data.
Other combinations of intersecting clusters and biclusters similarly inform
additional migratory patterns.
The Venn diagram in Figure 2.4 illustrates the data migration between
two steps.
The set of biclusters (3 as defined above is particular to time series data.
Specifically, it is assumed that time step i + 1 is dependent on time step i.
An alternative organization accommodates experimental series data where
each experimental step in the series has a stimulus applied to a control. Un
der those conditions step i + 1 is assumed to be dependent on step 1 (i.e., step
1 is the control with which all other steps are biclustered). Formally:
/3 A set of n(s 1) biclusters:
n biclusters calculated for each Dj U Di+1,1 < i < (s 1)
{fin, fi2l, ..., fini, Bn, B22, fin(sl)}
20
2.3.1 Semantics Underlying CLUDAMI
The premise upon which the CLUDAMI model relies is that some subset
of data in Dt which share some subset of similar features will still share that
same subset of features in A+i Because the biclustering algorithm is identify
ing these similarities across two steps, any identified data migrations will only
be valid if there exists some multivariate correlation in the data instances and
their selected feature subsets as they progress through the steps. However,
since each pair of steps is biclustered independently, these subsets of similarity
can change from step to step.
Regarding the data, each instance should contain more than a single fea
ture. Data defined by a single scalar value render the CLUDAMI moot be
cause there is no proper subset of features that can be selected by the biclus
tering algorithm. Another assumed quality of the data is that the series does
not permit data instances to be directly tracked across the seriesat each step
samples from the population are selected independent of previous and sub
sequent steps. If the instances in the data can be specifically tracked across
steps, data migration identification becomes a trivial exercise of following spe
cific instances from one step to the next. That said, applying CLUDAMI to
traceable data might still assist in migration trend and feature relationship
identification.
21
2.4 Flow Cytometer Data
As a test of the CLUDAMI model, data generated from a biological wet
lab experiment were used (Gibson et al., 2004). The experiment itself was
a study of the effects of radon on human lung cells. A population of lung
cells were divided into several different samples (test tubes). Samples were
grouped according to the level of radon exposure they were subjected to: none
(control), low, medium, and high. Each group was further subdivided into two
subgroups based on the data that was to be gathered.
In order to gather data on the cells after radon exposure, each sample
was passed through a flow cytometer. A flow cytometer operates by inserting
a syringe into the sample and drawing out the cells. Each cell is then indi
vidually passed in front of several lasers. Based on the refraction of the laser
light, three features are measured for each cell (Newell, 2002; Celis, 1998):
Forward Scatter The relative diameter of the cell. This is measured as a
linear value.
Side Scatter The relative complexity inside the cell. This is measured as a
linear value.
Fluorescence The relative intensity of fluorescence in the cell. This parame
ter is measured as value on a three or fourdecade logarithmic scale.
Every cell passed through the flow cytometer has its Forward Scatter
(FS) and Side Scatter (SS) measured. The context of the Fluorescence value
22
depends on the experiment. In order to measure the quantity of a particular
feature in a cell, the feature is tagged with a molecule that fluoresces when
exposed to laser light. Therefore, the amount of fluorescence detected in each
cell directly corresponds to the quantity of the tagged feature present in the
cell.
As mentioned previously, each group of samples was divided into two
subgroups. The subgroups were divided based on the cellular feature tagged
with the fluorescing molecule. One subgroup measured levels of reactive oxy
gen species (identified here as DCFDAthe stain fluorescent marker that iden
tifies reactive oxygen species). The other subgroup measured levels of B7,
which is a type of structure on the cell. Table 2.1 summarizes the types of
samples and their associated features.
Radon FS ss DCFDA B7 A
none X X X A
low X X X A
medium X X X A
high X X X A
none X X X A
low X X X A
medium X X X A
high X X X A
Table 2.1: A table of the samples selected for examination by the CLUDAMI
model. Their organization into two experimental series is shown in the last
column.
23
To remain consistent with standard flow cytometer data analysis meth
ods, the fluorescence level was analyzed on a 4decade log scale.
Fitting the experimental data to the CLUDAMI model simply requires
ordering the samples as two experimental series: one series for the DCFDA
runs and one series for the B7 runs. Within each series the no radon sam
ples are identified as the Di controls, and the other samples are organized as
D2, D3, and Â£>4 based on low, medium, and high radon dosage respectively.
Given the type of data used in this this study, the data are organized as an
experimental seriesbiclustering was run on D\, pairs of samples.
Flow cytometry data typifies the difficulties inherent in tracking the mi
gration patterns of data.
1. There is no single data instance (i.e., cell) to follow. Once a cell passes
through the flow cytometer it is discarded and is unavailable for further
study. This is consistent with the data criteria discussed in Section 2.3.1.
2. Different numbers of individuals are measured with each sample. The
flow cytometer has a general threshold for the number of cells to count,
but the exact number varies from sample to sample.
3. The data is noisyoutliers are common.
2.5 The CLUDAMI Implementation
For the clustering algorithm, Kmeans was selected from an existing
library (de Hoon et al., 2003). The similarity measure used with Kmeans is
the Pearson correlation coefficient (See Figure 2.5).
24
r
n
y' /SSjSs\ /F5jFs\ /FLjFL\
\ SS J \ I?FS ) \ &FL )
Figure 2.5: The correlation between Side Scatter (SS), Forward Scatter
(FS), and Fluorescence (FL). The sample mean is subtracted from each in
stance value and then is divided by the standard deviation. The Coefficient
measures how well the fits a straight line.
For biclustering, the FLOC algorithm (Yang et ah, 2002) was imple
mented in Perl. Likewise, Perl utilities were written to parse and process the
flow cytometry files. Twodimensional visualizations of results were generated
using Perl and Gnuplot (Williams and Kelley, 1986,2003). Development of
the model was facilitated by using a custombuilt application using the three
dimensional visualization suite OpenDX (IBM Research, 2004).
25
3. Experimental Preliminaries
3.1 Terms and Conventions
The CLUDAMI model defines a step in a series. In the context of the
flow cytometer data, a single step is referred to as a sample (one test tube of
cells). Step and sample will be used interchangeably throughout this section.
Within each step, cell, instance, and member will be used interchangeably.
Several scatterplots are presented throughout the results section. To ac
commodate additional space for the plots, the axis labels are not included. In
all cases, the scatter plots show the Side Scatter (SS) along the Xaxis and
Forward Scatter (FS) along the Yaxis. Histograms are used to quantify Fluo
rescence levels (LFL1). FS, SS, and LFL1 are described in Section 3.2.
Any form of the word cluster specifically refers the Kmeans clusters
and any form of the word bicluster specifically refers to biclusters generated
using the FLOC algorithm.
3.2 Scatterplot Overview
A staple of flow cytometer analysis is to visualize a sample as a two
dimensional scatterplot with the Side Scatter (SS) along the Xaxis and For
ward Scatter (FS) along the Yaxis. To evaluate the level of fluorescence
(LFL1), a histogram of cell counts over LFL1 levels is generated. The flow
cytometer discretizes LFL1 levels across 1024 values and there are an average
of 10,000 cells per sample, so a histogram divided on counts per value is fea
sible. In the data used for this study there are a significant number of cells
26
with overflow values of 1023 for the SS. To handle these outliers, the data was
preprocessed by removing all cells with SS values of 1023.
Central to many flow cytometer analysis tasks is distinguishing between
live cells and dead cells. Live cells tend to have larger FS and smaller SS.
Dead cells tend to shrink in size (smaller FS) and increase in complexity in
side the cell as it starts to break up (larger SS). Figure 3.1 Illustrates the de
lineation between live and dead cells that is typically made. A portion of the
CLUDAMI analysis includes an examination of cell migration from the live
region to the dead cell region.
G0025974.LMD
Figure 3.1: A sample scatterplot typical of that generated for a flow cytome
ter sample. The Xaxis is the FS and the Yaxis is the SS. Shown are the
approximate locations of live and dead cells.
3.3 Relevant Parameters
The CLUDAMI model was followed by independently clustering each
step. To evaluate the influence k (number of clusters) has on the model, each
27
step was clustered with 17 different k values: 2 < k < 18. The Kmeans al
gorithm generates better results if the best cluster assignments are selected
among many runs of the algorithm. To accommodate this, the algorithm was
run 1000 times for 2 < k < 10 and 2000 times for 11 < k < 18. The best
scoring Kmeans clustering among the 1000 (or 2000) runs was chosen for each
k clustering.
Bicluster pairings were made based on the experimental series (3 speci
fied by the CLUDAMI model. Two separate series were run; one series that
captured reactive oxygen species (DCFDA) and one series that captured cell
structure B7. A value of 0.05 was used as the probability of a row (cell) be
ing included as a member of a bicluster in the initialization of the algorithm.
Given the lowdimensionality of the feature space (SS, FS, and LFL1), a fixed
number of four biclusters was used in all biclustering runs. The initial column
memberships for the four biclusters were defined as:
{SS, FS}, {SS, LFL1}, {FS, LFLl}, and {SS, FS, LFL1}.
Of course after initialization, the FLOC algorithm was free to add or
remove columns from the biclusters during the main run of the algorithm.
Without exception, the bicluster initialized with all threecolumns reduced
itself to two columns during the main run of the algorithm.
3.4 Migration Pattern Selection
The data migration equation (Equation 2.12 on page 19) identifies
n(s 1 )m2 possible migration patterns for the entire series. This study focuses
on the migration between two steps, so (s 1) is just 1. Even with the reduced
complexity of nm2, the task of interpreting and discriminating between migra
28
tory patterns becomes difficult with large numbers of clusters. However, if the
biclustering algorithm intersects with a very few number of clusters then the
issue resolves itself. Recall that migrating data is defined in terms of intersec
tions between clusters and biclusters. If an intersection is the empty set, then
there is no data to migrate for that cluster/bicluster pair.
Bicluster 0
File: G0025954.LMD
Clust. ID Cells
0 10
1 89
2 34
3 13
4 29
5 593
6 289
7 12
8 437
Clust. ID Cells
9 40
10 355
11 57
12 68
13 4
14 82
15 189
16 17
17 139
Table 3.1: The data migration model defined in Equation 2.12 identifies
any cluster which intersects with a bicluster as a migratory pattern. The ta
ble shows for one bicluster, the number of data instances shared with each of
k = 18 clusters. The wide breadth of coverage of the bicluster indicates the
need to reduce the list of clusters under consideration if a reasonable number
of migratory patterns is to be identified.
Despite the potential reduction in complexity due to empty sets, the ac
tual results show that biclusters almost always spread across all clusters. Ta
ble 3.1 shows for a typical bicluster the number of cells which intersect with
29
!
each of k = 18 clusters.
One notable quality of bicluster/cluster intersections revealed in Ta
ble 3.1 is that the distribution of cell counts between clusters is not uniform.
This observation leads to the following method for culling the number of mi
gration patterns: for each bicluster in a step, order the clusters by the number
of instances intersecting with the bicluster; select the top m clusters for deter
mining data migration. To determine an appropriate value for ra, the discrimi
nating abilities of the bicluster algorithm were analyzed.
3.4.1 Bicluster Discrimination
Because the biclustering algorithm is heuristic, there is some uncertainty
in the ability of the algorithm to consistently select the most similar instances
for a given subset of features. This uncertainty extends to ordering clusters
based on intersection counts. The underlying assumption in selecting the m
clusters with the greatest intersections is that the most significant migration
trends will be identified. If the counts arent reliable, then the efficacy of the
selection method is in question.
A simple test of the accuracy of the biclustering algorithm was performed
by generating migration data between two identical steps. A single sample
and its k clusterings was duplicated and used as both the source and desti
nation steps for computing data migration. The migration intersections were
calculated independently for each set of clusters between k 2 and k = 18. If
the biclustering algorithm performs flawlessly, then both the source and desti
nation steps should show identical intersections for all k. Figure 3.2 shows the
inaccuracy of the biclustering algorithm as measured against the intersecting k
30
I
clusters.
(a) (b)
Figure 3.2: Graph (a) shows the total cluster mismatches for all four biclus
ters when intersections are ranked according to the number of instances they
contain. Graph (b) indicates the highest rank to be mismatched.
Although Figure 3.2 identifies the number of mismatches and the high
est mismatch, it is helpful to look more specifically at how cluster rankings
are being mismatched. A detailed view of the cluster intersection rankings for
k 18 are shown in Table 3.2. An interesting trend in the mismatches iden
tified in the table is that mismatches in the higherranked clusters tend to be
swaps that are within a few cells of each other. For example in bicluster 2, the
first mismatched pair shows a migration from cluster 5 in the source to cluster
12 in the destination. The very next mismatched pair shows just the oppo
site: a migration from cluster 12 to cluster 5. Had the biclustering algorithm
been errorfree, the source and destination clusters would have been identi
fied as from 12 to 12 and from 5 to 5. Although the discrepancy in cell counts
introduces an error in the numbers of cells migrating (albeit a small error),
the actual orderswapping does not. Recall that the data migration equation
denotes all of the following as data migration patterns: 5 to 5, 12 to 12, 5 to
31
Bicluster 0
Bicluster 1
File: G0025954&.LMD File: G0025954b.LMD
Clust. ID Cells Clust. ID Cells *
5 593 5 575
8 437 8 436
10 355 10 351
6 289 6 297
15 189 15 195
17 139 17 144
1 89 1 105
14 82 14 80
12 68 12 72
11 57 11 51
9 40 9 39
2 34 2 37
4 29 4 28
16 17 16 17
3 13 0 15 *
7 12 3 14 *
0 10 7 10 *
13 4 13 5
Bicluster 2
File: G0025954a.LMD File: G0025954b.LMD
Clust. ID Cells Clust. ID Cells *
6 345 6 357
10 241 10 242
7 177 7 174
5 138 12 139 *
12 129 5 125 *
17 88 17 88
13 71 13 81
0 56 0 65
16 45 16 46
3 42 4 36 *
4 34 3 32 *
9 24 9 24
11 9 11 11
8 4 1 4 *
1 1 8 1 *
2 0 2 0
14 0 14 0
15 0 15 0
File: G0025954a.LMD File: G0025954b.LMD
Clust. ID Cells Clust. ID Cells *
7 288 7 288
12 238 12 243
13 107 13 110
17 61 17 65
2 51 2 53
0 48 0 47
5 34 5 35
10 29 6 26 *
6 25 10 26 *
9 18 9 18
1 17 1 17
16 13 16 14
15 8 8 11 *
8 7 3 10 *
3 5 4 8 *
14 5 15 8 *
11 4 11 7
4 3 14 4 *
Bicluster 3
File: G0025954&.LMD File: G0025954b.LMD
Clust. ID Cells Clust. ID Cells *
5 403 5 397
8 384 8 377
15 234 15 238
12 175 17 166 
17 169 12 165 *
14 123 14 122
11 63 11 76
7 43 7 44
4 38 4 35
16 29 3 27 *
13 28 13 26
0 26 0 24
3 20 16 23 *
6 3 10 6 *
10 3 1 2 *
1 2 9 2 *
2 0 6 1 *
9 0 2 0 *
Table 3.2: The biclustering algorithm was run on a single duplicated sample
G0025954 Shown are the ordered rankings of k = 18 clusters based on the
number of cells which intersect between a bicluster and a cluster. An asterisk
indicates where the ordering is different between the two identical samples.
32
12, and 12 to 5. Therefore the swapping due to biclustering error does not af
fect the migrations that are identified. The tendency for ranking errors to be
swaps held for all 2 < k < 18 cluster combinations examined.
The number of clusters considered in the data migration equation was
2 < m < 4 because:
1. The results summarized in Figure 3.2 illustrate that only k = 14 and
A: = 15 had mismatches in the top two positions
2. Those mismatches (and all others in the highest mismatched positions)
are simple swaps in position.
An upper limit of 4 maintains a high degree of bicluster discrimination and
also generates a manageable number of migrations to analyze.
With n = 4 biclusters (see Section 3.3), there will be n(m2) data migra
tions (between 16 and 54) to be examined between a pair of steps. Further
refinements identified in Section 4.2 reduce the number of probable data mi
grations patterns to nm resulting in a range of 8 to 16 valid data migrations.
The actual number of migration patterns is usually slightly lower due to over
lapping biclusters which occasionally intersect the same source and destination
clusters.
33
4. Experimental Results
Migrations in both the B7 and DCFDA series were examined in de
tail. The migration between the no radon sample and high radon sample for
DCFDA will be reported on in the following sections. Refer to Section 2.4 for
more information on the samples.
4.1 Data Migration Features
Figure 4.1 shows both steps for which data migration was measured. A
sidebyside comparison of the plots leads us to some intuitions about possible
migration paths.
G0025952.LMD G0025974.LMD
Figure 4.1: Plot (a) shows a sample which has not been exposed to radon.
Plot (b) shows a sample exposed to high levels of radon.
34
Immediately noticeable is that the shape of the plots is roughly the same.
There are 3 clouds of data extending from the plot origin. Between the plots
is a subtle change in the location of the clouds. The middle and top clouds
appear to have separated somewhat, so migrations indicating movement of
the clouds is expected. Another feature that is not visually discernible is the
effect on the cells by exposing them to high radon. Some cells are expected to
migrate rightward and downward toward the cloud which denotes dead cells
(see Figure 3.1).
These data migration intuitions are affirmed in the results. Figures 4.2
and 4.3 show a representative sample of migration patterns.
4.1.1 Subcluster Localization
When the migration patterns are plotted a notable feature present in
some cluster/bicluster intersections is that the biclusters favor just a portion
of the clusters they intersect with. Figure 4.4 is an extreme but instructive
example. The destination cluster is a very broad cloud whose centroid is in
dicated by the head of the arrow. The destination migration data however is
localized below and to the left of the cluster, away from the cluster centroid.
This tendency toward subcluster localization is to be expected when
given the differences between clustering and biclustering. The biclustering
algorithm has identified a subset of cells which are similar across a subset of
features. The clustering algorithm is not as discriminating because it is identi
fying broad similarities across all features.
35
ID: plot_02_14, G0025952.LWD to G0025974.LMD, 18 Clusters
ID: plot_07_07, G0025952.LMD to G0025974.LMD, 18 Clusters
(a) (b)
ID: plotJ7J)3, G0025952.LMD to G0025974.LMD, 18 Clusters ID: plot_10_06, G0025952.LMD to G0025974.LMD, 18 Clusters
(c) (d)
Figure 4.2: Four of the 8 valid migrations identified with k = 18, m = 4.
Plots (a) and (b) show the up and right movement of the topmost cloud.
Plots (c) and (d) show the down and right movement of cells due to radon
exposure. Closeups of plots (a), (b), and (c) can be found in Figure f.3.
36
Figure 4.3: Closeups of the centroid arrows for plots (a), (b), and (c) from
Figure f.2.
4.2 Migration Reflections and Rank Order Mismatch
In addition to valid cluster data migrations, CLUDAMI also generates
many invalid migrations. These invalid migrations have the peculiar charac
teristic of being the mirror image, or reflection, of another (invalid) migration.
Visually identifying reflections between different migrations can be a time
consuming manual task. Fortunately there is a second indicator of which mi
grations have reflections: rank order. Nearly all migrations with reflections
also have the property that the rank of their source and destination clus
ters do not match. Correspondingly, nearly all migrations without reflections
(thereby indicating a valid migration) have the property that their source and
destination clusters are identically ranked. See Figure 4.5 for an example of
this.
Though these characteristics are easy to discern they can occasionally
mislead. There are a very few occurrences where a migration may have iden
tically ranked source/destination clusters, but is also a reflection of a different
migration, thereby invalidating the migration.
37
ID: ploM 0_11, G0025954.LMD to G0025978.LMD, 18 Clusters
Figure 4.4: An example of a misleading migration arrow. The arrow starts
at the source cluster centroid and ends at the destination cluster centroid.
Cluster/bicluster intersections include only a portion of the data in the cluster.
The figure shows that the actual migration is to the left of the cluster centroid
arrow.
4.3 Varying k and m
In general, a larger k provides greater resolution to the data migrations.
However, it doesnt necessarily follow that a smaller k represents a lessdefined
view of larger k. Figure 4.6 illustrates an example of a valid migration found
at
k = 12, m = 2 that didnt reemerge until k = 18, m = 4.
In addition to resolving migrations with greater clarity, another effect
of increasing k is that it can break down a single migration into smaller sub
migrations. Figure 4.6 also shows the submigrations that replaced the single
migration when k was increased.
38
ID: plot_12_03. G0025952.LMD to G0025974.LMD, 18 Clusters
ID: plot_17_01. G0025952.LMD to G0025974.LMD, 18 Clusters
1000
800
600
400
200
0
0 200 400 600 800 1000 0 200 400 600 800 1000
(a)
ID: plot_12_01, G0025952.LMD to G0025974.LMD. 18 Clusters
1000
800
600
400
200
0
0 200 400 600 800 1000 0 200 400 600 800 1000
(c) (d)
Figure 4.5: Plots (a) and (b) are invalid migrationsthey are reflections of
one another. These two plots also have mismatched ranks between their source
and destination clusters. Plot (a) ranks are 0/2 and plot (b) ranks are 2/0.
By contrast, plots (c) and (d) are valid migrationsthey are not reflections of
each other (nor of any other plot) and have ranks of 0/0 and 2/2 respectively.
Bicluster: 3; FSSS Src Clust 12 Dest Clust 1
V'..
i
(b)
ID: plot_17_03, G0025952.LMD to G0025974.LMD. 18 Clusters
Bicluster: 3; FSSS
Src Clust 17
Dest Clust 3
> j
_J_____________L_
Bicluster: 3; FSSS Src Clust 12 Dest Clust 3
. ;
 j.. *
39
ID: plot_05J)3, G0025952.LMD to G0025974.LMD, 12 Clusters
(a)
ID: plot_10_13, G0025952.LMD to G0025974.LMD, 18 Clusters ID: plot_10_05, G0025952.LMD to G0025974.LMD, 18 Clusters
1000
800
600
400
200
0
0 200 400 600 800 1000 0 200 400 600 800 1000
(b) (c)
Figure 4.6: Migration (a) was discovered at k = 12, m = 2. The data migra
tions for k = 18,2 < m < 3 did not reveal the migration shown in (a). At
k = 18, m = 4 migrations (b) and (c) appeared as submigrations of (a).
Bicluster: 2; SSLFL1
Src Clust 10
DestClust 13 ,
40
The effects found by increasing k are often augmented by increasing m
as well. Although increasing k improves data migration clarity, the improved
migrations may belong to clusters which are below the m threshold. By in
creasing m, these additional migrations become visible.
4.4 Comparing Similar Source Migrations
For each wet lab experiment, samples are generally run through the flow
cytometer in triplicate. The redundancy ensures the scientific rigor of the ex
periment. In the no radon samples, one of the three samples was anomalous.
Figure 4.7 shows the anomalous cell population compared to one of the other
two no radon samples. The migration path from the abnormal sample to the
high radon sample was modeled to test the ability of CLUDAMI to show simi
lar data migrations despite variance in the starting step.
After identifying migrations using the anomalous sample as the source
step, the migration were compared with those previously generated by the
normal sample.
In Figure 4.8, the first plot shows the relatively modest migration from
a normal sample. By contrast, the anomalous sample has to travel further to
reach the same destination. In comparing the plots in Figure 4.7, the com
pressed shape along the left edge indicates that because the anomalous sample
is compressed in the lower lefthand corner of the plot, migrations to the inte
rior of the plot travel a greater distance than the normal sample. Both plots
show a migration to the same destination which is accurate given that the des
tination samples are identical.
41
G0025952.LMD
G0025950.LMD
Figure 4.7: Two samples were selected from the same pool of redundant
DCFDA no radon samples. Both plots (a) and (b) are the starting steps for
migrations from no radon to high radon. Plot (a) is the same plot used previ
ously. Plot (b) shows a much more compressed silhouette.
4.5 Biological Interpretation
Independent of the CLUDAMI analysis, the following biological conclu
sions were drawn on the no radon to high radon DCFDA samples (Burkhart
et ah, 2002):
1. In the high radon dosage samples, approximately 39% of the cells were
hit by the radon (more accurately, that percentage of cells were hit by
alpha particles generated by the radon).
2. Not all of the cells exposed to radon die.
42
ID: plotj)7_07, G0025952.LMD to G0025974.LMD, 18 Clusters
ID: plot_07_07, G0025950.LMD to G0025974.LMD, 18 Clusters
(a) (b)
Figure 4.8: Plot (a) shows a migration from the no radon to high radon
sample generated previously. Plot (b) shows the equivalent migration for
the anomalous sample. CLUDAMI accounts for the compressed shape of the
anomalous sample by indicating a further distance that this data must migrate.
Results are similar for other migrations compared between these two samples.
3. Both surviving and killed cells showed an increase in DCFDA (reactive
oxygen) levels in response to the radon dosage.
The CLUDAMI analysis shows that most cells in the sample make small,
insignificant migrations. One migration does stand out: one cluster separates
and migrates down and to the right. This direction indicates a movement
from the live cell region toward the dead cell region (see Figure 3.1). The
cells in this migration are also the only cells which show increased levels of
DCFDA.
Cells were counted in the source cluster of the migration shown in Fig
ure 4.6. For k = 12 there are 343 cells and for k = 18 there are 394 cells.
There are a total of 11290 cells in the source population, which results in a
43
All Migrations: No Radon to High Radon, DCFDA
Figure 4.9: Plot (a) shows all valid duster migrations for no radon to high
radon DCFDA. There are a total of seven arrowheads (paths) present on the
plot. Plot (b) shows the increased DCFDA level present in the longest cell mi
gration. The Xaxis of the histogram is the DCFDA level. The Yaxis of the
histogram is a probability of cells in the cluster (containing the DCFDA level
indicated on the Xaxis).
migration of 3% and 3.5% of the population respectively. It should be noted
that CLUDAMI is identifying only migrations with the largest intersections,
not all migrations. Therefore it should be assumed that cells other than those
identified also migrated toward the dead cell region and also had an increase
in DCFDA levels.
In summary, the CLUDAMI migrations are consistent with the biologi
cal conclusions. Figure 4.9 shows all valid CLUDAMI migrations from the no
radon step to the high radon DCFDA step.
44
5. Discussion
In this study we proposed the CLUDAMI model for tracking data migra
tion across steps in series data. Its application to flow cytometry data shows
that the model is able to capture the migratory patterns of clustered data as
it moves from one step to the next. There are several aspects of the model
which merit further discussion.
5.1 Model Flexibility
A strength of the model is its simplicity. Clustering and biclustering each
occur independently prior to calculating migrations. If several different clus
tering and biclustering algorithms are run on the data, the model facilitates a
combinatorial comparison of the effect these clustering and biclustering algo
rithms have on the data. The time complexity of the model is dependent on
the clustering and biclustering algorithms selected. Calculating cluster migra
tions is simply a matter of computing intersections between the clusters and
biclusters.
5.2 Identifying Invalid Migrations
Distinguishing between valid and invalid migrations is key to the success
of the model. As noted in Section 4.2, migrations with mismatched rank or
derings can be used as an efficient method of identifying invalid migrations
(data with reflections). However, there are exceptions when the source and
destination ranks match, but the migration does have a reflection and is there
fore invalid. All exceptions that were encountered in the study were of migra
tions with source/destination cluster ranks of either 2 or 3 (third and fourth
45
highest ranks). This held for both k = 12 and k = 18. In all migrations exam
ined, matches of either the first or second rank were always valid. This can be
attributed to the discriminating ability of the biclustering algorithm decreas
ing as the cluster populations decrease (see Table 3.2 on page 32).
These exceptions exist for invalid migrations as well. There are rank mis
matches which clearly have no reflections, and therefore should be considered
valid migrations. Again, these exceptions occurred in mismatches involving
lowerranked clusters.
These observations indicate that a conservative choice of m = 2 succeeds
in finding only true positives, but also fails to evaluate (as valid or invalid)
other migrations which exist. Setting m = 4 identifies true positives, but
also identifies false positives and false negatives which require the additional
scrutiny of examining rank mismatches and reflections independently.
5.3 Effect of k on Migration Paths
Using a small number of clusters, k, results in larger clusters covering
the data, which in turn increases the likelihood that the highestranked in
tersections of each bicluster belong to the same cluster. This results in fewer,
nonredundant migrations. Although a higher k decreases the likelihood that
biclusters will share the same intersections, there is little change in the migra
tion vectors identified between cluster centroids.
5.4 Future Work
The results indicate that CLUDAMI is a valid model of migrating data.
However, there are many areas not addressed by the initial study that deserve
additional research in future work:
46
High dimensional data The data selected to test the CLUDAMI model had
only three dimensions: side scatter, forward scatter, and fluorescence.
The small number of dimensions eased interpretation of the results, but
leaves the model untested with more complex data. The model needs to
be evaluated on data with more than three dimensions.
Trend identification and prediction Much of time series data analysis re
search identifies trends and predicts future trends. A comparative anal
ysis of the model should be made against existing trend identification
algorithms to determine how migration path information can enhance
trend identification. Although CLUDAMI has been presented as a de
scriptive model, it would be interesting to evaluate its ability to enhance
trend prediction as well.
Multivariate correlation analysis Though interpretation of candidate data
migrations by a domain expert provides perhaps the best validation of
the model, a multivariate correlation analysis of the identified migrations
would help underscore their statistical significance.
Clustering and Biclustering Different clustering and biclustering algo
rithms should be used with the model. Due to the biological context
under which biclustering has developed, there exist several very different
approaches to biclustering. Many offer unique methods for measuring
similarity which might provide unique insights into nonbiological data
sets.
Another class of (bi)clustering algorithms to use in the model is sub
47
space clustering. Though not specifically defined as such, biclustering
has evolved almost exclusively within the biotechnology domain. By
contrast, subspace clustering has evolved outside of biotechnology and
would be an interesting alternative to try in the model.2
Multiple steps Because the data used in this study is an experimental se
ries, there are never more than two time steps to compare. The model
needs to be tested on time series data which spans several steps. The
model as defined provides for this by iteratively computing biclusters on
pairs of time steps (see definition of (3 in Section 2.3).
A variation on the model worth exploring is to increase the number of
time steps simultaneously biclustered.
Validating Migrations An alternative method to ranking clusters by the
number of members they contain is to only consider migrations where
the difference in number of cells between the source and destination in
tersections is beneath some stated threshold (e.g., the source and des
tination intersections between any two clusters must be within 50 cells
of each other). This eliminates migration patterns which are difficult to
justify (e.g., 500 cells in the source cluster migrating to 30 cells in the
destination cluster). This method could also be integrated as an addi
tional constraint used with the ordered ranking method.
2One of the motivations for selecting the FLOC algorithm for this study is that work
relevant to the development of the algorithm built on the strengths and weaknesses of both
biclustering and subspace clustering (Yang et ah, 2002)
48
Automating Migration Validation With larger m, the process of validat
ing clusters based on the existence of migration reflections is currently a
timeconsuming manual task. Integrating an algorithm which evaluates
the similarity of clusters in ndimensional space to automate reflection
identification would be of practical value to users of CLUDAMI.
Different k and s The model assumes that the number of clusters k and the
number of biclusters s used in each step are constant. Allowing k and s
to change in the series would be an interesting enhancement to explore.
Migration population counts The work done in this study focuses on the
general shape and direction of migrating clusters of data. Very little
evaluation was done the magnitude of the migration. That is, the num
ber of data instances contained in the source and destination clusters of
each migration wasnt thoroughly analyzed. In particular, these counts
are quite sensitive to the k chosen, since increasing k reduces the num
ber of instances in each cluster. Also, the number of instances between
the source and destination clusters is seldom identical. Further analysis
is required on the magnitude of clustered data migration.
In conclusion, CLUDAMI was proposed as a model for discerning the
migration paths of clustered data. The model was tested in a variety of ways
using flow cytometry data and techniques were developed for proper use and
interpretation of the model and its results. The experimental tests validate
CLUDAMI as a model capable of identifying clustered data migration, but it
is only a start. Additional work has been identified to refine the model.
49
Appendix A. Proof of the Minimum Score Matrix
Below is the Lemma and a derived proof which corrects the original
formula stated incorrectly and offered without proof in Cheng and Church
(2000).
Lemma 2. A K x K matrix of a single 1 all Os elsewhere has the score:
hK = fi(KinK\? + 2(Kl) + l}
Proof. Without loss of generality, a^k = L
an a\2 a\k ^
a21 a22 a2k
\akl ak2 CLkk
Equation 2.6 can be expanded into:
a summation of cell residues not in a row or column with the 1
a summation of cell residues in the row with the 1
a summation of cell residues in the column with the 1
the residue of the cell containing the 1
hx
\K\\K\
'fci fci
fci
fci
E ^ R(ij)2 + y, R(*> *)2 + E R^ + R^k)2
(A.l)
,i=i j=i
i=1
j=1
50
The matrix mean is simply:
okk
K2
(A.2)
Each squared residue in the first summation is:
R(hj)2 ~ (aij ~ aiK ~ &Kj + o,KkY
= (000 + ?
1
(A.3)
(A.4)
(A.5)
The first summation is a double summation from 1 to K 1 so the Tj
term is multiplied (K l)2 times. Substitute into Equation A.l:
1
K2
j k 1 k 1
j7i(K 1)2 + (*.kf + E Rik
i=l j=1
Each squared residue in the second and third summations is:
(A.6)
R(i, k)2 (ciik diK OKk + Ukk)2 + (&kj OKj &kK + aKK)2 (A.7)
2 / \ 2
1 1V
(ooi + ^j +{j{0+ K1J
(A.8)
(A.9)
These summations occur K 1 times:
*(a)2={^ + ^) UK1)
1
K2
(Al) 2(K1)
1
= f4(K l)2 2(K 1)
= 2^j(Al)
(A. 10)
(A.ll)
(A.12)
(A.13)
51
Combining them the formula becomes:
h = To
(K if + 2(K if + R(k,kf
(A.14)
The final squared residue is:
R(k, k)2 = (a^k okK ~ aKk + okkY
_Lx2
\ K K + K2
= l (K3 2K + 1)
= (^D2
(A.15)
(A.16)
(A.17)
(A.18)
(A.19)
(A.20)
52
CT>  Ci\ tO 
Substitute in the equation:
{Klf + 2;(,Klf + L(Kl)* (A.21)
{K l)2 [1 + 2{K 1) + {K l)2] (A.22)
(ai)2[(ai)2 + 2(a:i) + i] (A.23)
53
Appendix B. The Befuddling Proof of Cheng and Church
Biclustering of Expression Data (Cheng and Church, 2000) is founda
tional to biclustering in biotechnology. There are few examples of biclustering
prior to this paper, and it is the first to apply biclustering to the Biotechnol
ogy domain. Nearly the entire (relatively small) body of biclustering literature
published since has cited this article.
Unfortunately, the article contains errors and makes (sometimes false)
assertions without explanation. These confound understanding of their proof
that (5biclustering is NPhard. Two incorrectly presented assertions used to
support their proof are the score for hx, a matrix of all zeros and a single one,
and the score for a matrix (ap = 2ij)i,j > 0. I believe hx was presented be
cause it represents the lowest possible nonzero score for a matrix of integer
values > 0. Any submatrix S = (a;j = 2ij) assures a score: 0 < hx < S.
Although I havent rigorously sought to prove this, a few minutes with pencil
and paper seem to bear this out. The closed form for the score of S is a series
of summations based on EizUTl Given the constant denominator, the mini
mum score for S is greater than hx
The authors set up the reduction using a matrix comprised of values
from S and 0 if there is an edge. This means that the only 6 = hx, (5biclusters
that can exist are those that are made up entirely of 0s, which meets the cri
teria of a complete subgraph Kmm. This is where my divination of the authors
intentions end. Because they dont define the (5bicluster as 5 = hx, they de
54
fine it as 6 = 1/fi/c This value for 6 is so high that the largest bicluster in any
matrix created via the reduction the matrix itself, which is useless. I can only
assume that the inverse hk is yet another error.
Another issue is why use 6 = hx at all? This is a very clever way of
choosing a nonzero value for 6 that is guaranteed to be lower than any sub
matrix that could possible be formed that included values in S (a^ = 2ij).
However, the only possible motivation for choosing hx for A is if A > 0. But
the article states: 5 > 0!
Here is a revision of the biclustering theorem and proof that doesnt use
Theorem B. Finding the largest square 6bicluster is NPhard.
Proof. This proof is a reduction from the NPcomplete problem balanced com
plete bipartite subgraph (Garey and Johnson, 1979). Given a positive integer
m, this problem asks whether a complete subgraph Kmm exists in a bipartite
graph G = (U,V,E). From G, form \U\ x P matrix A with i G U,j G V:
If the largest square 5bicluster in A with <5 = 0 has a size > m, there
exists a complete submatrix Kmm. Therefore finding the largest square 6
0 : (i,j) G E
2 ij : otherwise
bicluster is NPhard.
55
REFERENCES
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic sub
space clustering of high dimensional data for data mining applications.
In Proceedings of the 1998 ACM SIGMOD International Conference on
Management of Data, 1998.
M. S. Aldenderfer and R. K. Blashfield. Cluster Analysis. Sage Publica
tions, Newbury Park, CA, 1984.
J. Alon, S. Sclaroff, and G. Kollios. Discovering clusters in motion time
series data. In Proc. IEEE CVPR, June 2003.
A. BenDor, B. Chor, R. Karp, and Z. Yakhini. Discovering local struc
ture in gene expression data: The orderpreserving submatrix problem. In
Proceedings of The Sixth Annual International Conference on Research in
Computational Molecular Biology (RECOMB), 2002.
J. F. Burkhart, B. A. Camley, R. E. Camley, E. VillalobosMenuey, and
M. K. Newell. Lung cell immune system response due to irradiation by
alpha particles. In Conference of Radiation Control Program Directors.
Inc., October 2002.
J. Celis, editor. Cell Biology: A Laboratory Handbook, pages 261274.
Academic Press, University of Aarhus, Denmark, second edition, 1998.
Y. Cheng and G. M. Church. Biclustering of expression data. In Pro
ceedings of the Eighth International Conference on Intelligent Systems for
Molecular Biology, pages 93103, 2000.
K. J. Cios, W. Pedrycz, and R. W. Swiniarski. Data Mining Methods
for Knowledge Discovery. Kluwer Academic Publishers, Norwell, Mas
sachusetts, 1998.
M. de Hoon, S. Imoto, and S. Miyano. The C Clustering Library for
cDNA Microarray data. University of Tokyo, Institute of Medical Sci
ences, Human Genome Center. World Wide Web, http://bonsai.ims.
utokyo.ac.jp/~mdehoon/software/cluster/, 2003.
56
M. Garey and D. Johnson. Computers and Intractability: A Guide to the
Theory of NPcompleteness. W. H. Freeman and Company, New York,
1979.
M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock
market: Which measure is best? In Proceedings of the Sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 487496, 2000.
T. Gibson. Identifying clustered data migration through series data. In
preparation, 2004.
T. Gibson, M. K. Newell, E. VillalobosMenuey, and K. Cios. Identifying
cell migration in flow cytometry data analysis. In preparation, 2004.
A. D. Gordon. Hierarchical classification. In P. Arabie, L. J. Hubert, and
G. D. Soete, editors, Clustering and Classification, pages 65121. World
Scientific, River Edge, NJ, 1996.
G. K. Gupta and J. Ghosh. Detecting seasonal trends and cluster motion
visualization for very high dimensional transactional data. In First SIAM
International Conference on Data Mining, 2001.
IBM Research. OpenDX: The Open Source Software Project Based on
IBMs Visualization Data Explorer. World Wide Web, http://www.
opendx.org/, 2004.
L. Lazzeroni and A. Owen. Plaid models for gene expression data. Statis
tica Sinica, 12(1):6186, January 2002.
K. Newell. Lab Protocol. University of Colorado at Colorado Springs,
April 2002.
L. Parsons, E. Haque, and H. Liu. Evaluating subspace clustering algo
rithms. To be published in SIAM SDM Of Workshop on Clustering
High Dimensional Data and its Applications, 2004.
E. Perlman and A. Java. Predictive mining of time series data in astron
omy. In H. E. Payne, R. I. Jedrzejewski, and R. N. Hook, editors, Pro
ceedings of Astronomical Data Analysis Software and Systems XII, 2003.
R. Shamir. Lecture 8: Biclustering I. In Analysis of Gene Expression
Data, DNA Chips and Gene Networks. Tel Aviv University, World Wide
Web, http://www.math.tau.ac.il/~rshamir/ge/02/ge02.html, 2002.
57
A. Tanay, R. Sharan, and R. Shamir. Discovering statistically signifi
cant biclusters in gene expression data. In Proceedings of the Tenth Inter
national Conference on Intelligent Systems for Molecular Biology, pages
136144, 2002.
D. D. Wackerly, W. M. Ill, and R. L. Scheaffer. Mathematical Statistics
mth Applications. Duxbury, Pacific Grove, California, sixth edition, 2002.
T. Williams and C. Kelley. Gnuplot. World Wide Web, http://www.
gnuplot. info/, 1986,2003.
J. Yang, W. Wang, H. Wang, and P. S. Yu. ^clusters: Capturing sub
space correlation in a large data set. In ICDE, 2002.
58
