Citation
The simcal family of algorithms for analysis of microarray data

Material Information

Title:
The simcal family of algorithms for analysis of microarray data
Creator:
Dvorkin, Daniel
Publication Date:
Language:
English
Physical Description:
xi, 82 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Algorithms ( lcsh )
Bioinformatics ( lcsh )
Cluster analysis ( lcsh )
Protein microarrays ( lcsh )
Algorithms ( fast )
Bioinformatics ( fast )
Cluster analysis ( fast )
Protein microarrays ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 79-82).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
Daniel Dvorkin.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
62774802 ( OCLC )
ocm62774802
Classification:
LD1193.E52 2005m D86 ( lcc )

Full Text
THE SIMCAL FAMILY OF ALGORITHMS FOR ANALYSIS OF
MICROARRAY DATA
by
Daniel Dvorkin
B.S., Mathematics, Metropolitan State College of Denver, 2001
A thesis submitted to the
University of Colorado at Denver and Health Sciences Center
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science and Engineering
2005


This thesis for the Master of Science
degree by
Daniel Dvorkin
has been approved
by
Harvey Greenberg


Dvorkin, Daniel (M.S., Computer Science and Engineering)
The SiMCAL Family of Algorithms for Analysis of Microarray Data
Thesis directed by Professor Krzysztof Cios
ABSTRACT
SiMCAL 1, 2, and 3 (Sample Multilevel Clustering And Linking, versions 1,
2, and 3) are novel clustering algorithms for the analysis of microarray data. The
purpose of these algorithms is to present complete feature sets not found in either
Jarvis-Patrick clustering, from which the original SiMCAL concept is derived, or
in other popular clustering methods such as hierarchical and k-means. Although
each algorithm in the SiMCAL family has distinct features and methods, they all
share the following attributes: they are simple, in that they are computationally
inexpensive; they are multilevel, in that they provide a small number of clearly
defined hierarchical levels of clusters; and they offer linking between clusters at
the same level in each hierarchy. Presented here are the design, development, and
analysis of the algorithms; their applications to two types of microarray data, one
involving the phosphatidylserine receptor (PSR) and the other involving cystic
fibrosis (CF); a description of the Web-based interface for visualization of results;
and possible avenues for further development.
Code and data are available at http://www.dvorkin.com/daniel/Simcall23.zip
under an open-source license.
m


This abstract accurately represents the content of the candidates thesis. I rec-
ommend its publication.
Signed
IV


DEDICATION
For my friends and family, who have been there with me through the worst and
the best; this is at least as much your accomplishment as it is mine.


ACKNOWLEDGMENT
Dr. Valerie Fadok of National Jewish Medical and Research Center was the
originator of the project and supplied the data and biological background, and
her belief in the merits of the project has spurred its continuing development. Dr.
Krzysztof Cios of the University of Colorado at Denver and Health Sciences Center
has overseen the project since its inception. Drs. Peter Henson and Jenifer Monks,
both of NJMRC, have given generously of their time and expertise at critical
points in development and analysis. Thami Rachidi of UCD/HSC provided major
aassistance with data interpretation. All of these individuals have the authors
profound thanks and gratitude.


CONTENTS
Figures .................................................................. x
Tables................................................................. xi
Chapter
1. Introduction..................................................... . 1
1.1 Computational Background............................................ 1
1.1.1 Existing Clustering Methods......................................... 1
1.1.2 SiMCAL vs. Existing Methods......................................... 4
1.1.3 Project History..................................................... 7
1.2 Biological Background............................................... 8
1.2.1 The Phosphatidylserine Receptor (PSR)............................... 9
1.2.1.1 Significance of PS and PSR....................................... 9
1.2.1.2 The PS/PSR Mechanism ........................................... 10
1.2.1.3 Avenues of PS/PSR Research...................................... 12
1.2.2 Cystic Fibrosis (CF)............................................... 14
1.3 Notations and Definitions........................................ 15
2. SiMCAL 1............................................................. 19
2.1 Data and Preprocessing .......................................... . 19
2.1.1 Raw Measurements and Reliability................................... 20
2.1.2 Combination and Normalization...................................... 21
2.1.3 Gene Expression Calculation ....................................... 23
2.2 Clustering and Linking............................................. 24
vii


2.2.1 Distance Measures and Matrices
24
2.2.2 Ranks and Cluster Formation..................................... 30
2.2.3 Linking......................................................... 35
2.3 Reporting........................................................ 37
2.4 Biological Interpretation......................................... 39
2.4.1 Internal Analysis............................................... 40
2.4.2 External Analysis............................................... 42
3. SiMCAL 2.......................................................... 43
3.1 Data and Preprocessing ........................................... 43
3.1.1 Measurement Reliability and Combination......................... 45
3.1.2 Algorithm Outline.............................................. 46
3.1.3 Simulated Data................................................. 47
3.2 Clustering and Linking............................................ 47
3.2.1 Distance Measures and Matrices ................................. 47
3.2.2 Ranks and Cluster Formation..................................... 49
3.2.3 Linking......................................................... 52
3.3 Differentiation.................................................. 53
3.4 Reporting......................................................... 58
3.5 Performance....................................................... 59
3.6 Biological Interpretation......................................... 64
4. SiMCAL 3.......................................................... 68
4.1 Data and Preprocessing ......................................... . 68
4.1.1 Algorithm Outline.............................................. 69
4.1.2 Simulated Data................................................. 70
viii


4.2 Clustering and Linking................................................ 70
4.3 Interface and Performance Comparison.................................. 74
4.4 Biological Interpretation............................................. 74
5. Conclusions and Future Directions....................................... 77
References................................................................. 79
IX


FIGURES
Figure
1.1 Simple dendrogram showing cuts at each level.................... 3
2.1 Expression levels of x, y, and z............................... 26
2.2 Expression levels of x and y................................... 28
2.3 Expression plot for G4355 and its clusters.................. 35
2.4 Part of the introductory screen............................. . 37
2.5 Navigation through cluster relationships.................... 38
2.6 (^2,84 and its constituent genes............................ 40
3.1 SiMCAL 2 search interface................................... 59
3.2 All genes in DchipPM_MM..................................... 60
3.3 Clusters and genes in DchipPM_MM............................ 61
3.4 Execution times for SiMCAL 2 components........................ 62
3.5 Execution times for SiMCAL 2 overall........................... 64
4.1 A gene from simulated SiMCAL 3 data, m 8, with its clusters. . 75
x


TABLES
Table
2.1 1 p vs. A2............................................................ 27
3.1 SiMCAL 2 accuracy figures............................................... 58
3.2 SiMCAL 2 run times, in seconds.......................................... 63
3.3 Total genes under consideration, and differential genes discovered. . 65
3.4 Genes shared by datasets after preprocessing............................ 66
3.5 Differentially expressed genes shared by datasets....................... 66
xi


1. Introduction
This chapter covers the computational background, including an overview of
existing clustering methods as opposed to SiMCAL, as well as a project history;
and the biological background, with explanations of the significance of analyzing
both PSR and CF data. Also, a number of notations and definitions which will
be useful in later chapters are introduced.
1.1 Computational Background
Clustering has a rich history in the analysis of microarray data. It is an
undirected method of data mining which attempts to find clusters, or groups of
data elements, which represent some underlying structure in the data by means of
an affinity between the data elements in any one cluster[4, 375-430]. A variety of
clustering methods exist, all of which represent various tradeoffs between precision,
reproducibility, and temporal and spatial computational requirements. In many
cases, the choice of methods is fairly arbitrary.
This section looks at two of the most common existing clustering algorithms,
and examines their advantages and disadvantages. A comparison with the SiM-
CAL algorithm family is then offered, with an explanation of the advantages of
this approach. Finally, the evolution of the SiMCAL project is described.
1.1.1 Existing Clustering Methods
One of the most popular methods of clustering is k-means[4, 382-384]. This
method requires the user to choose an arbitrary number, k, of cluster centers.
Cluster centers are not data elements, but they closely resemble them, in that
1


they have values which place them within the data space. Usually, cluster center
locations are randomly chosen at the beginning of the algorithm. The algorithm
then uses an iterative process to fit the cluster centers as closely as possible to the
data.
K-means has the advantage of being computationally inexpensive it is
0(ikn), where i is the number of iterations (a maximum value also being specified
by the user in advance), k is the number of clusters, and n is the number of
data elements. Because generally, i,k n, this approaches 0(n) performance.
Its disadvantages are: first, it requires the user to specify the number of clusters
in advance, regardless of whether that number bears any relation to the actual
structure of the data; and second, it produces convex (elliptical or cubical) clusters
which, again, may not fit the data well.
Hierarchical clustering [4, 379-382] is another popular method. It is an ag-
glomerative technique in which initially, each data element is assigned to its own
cluster. Then links are built between the points to form new clusters, which are
linked with existing clusters (whether single-point clusters representing the orig-
inal data, or multi-point clusters built in previous steps) in a stepwise fashion.
Generally, at each step, the clusters which are linked are the two which are closest
to each other. The end result is a dendrogram, similar to a family tree, showing
the relationship between points and the clusters they form.
The main advantage of hierarchical clustering is that the dendrogram can be
cut at any level of resolution to determine the number and characteristics of
the clusters at that level. In Figure 1.1, by cut number, the clusters are:
1. Two clusters: {xl, x2} and {x3, x4, x5, x6}.
2


Figure 1.1: Simple dendrogram showing cuts at each level.
2. Three clusters: {xl, x2}, {x3}, and {x4, x5, x6}.
3. Four clusters: {xl, x2}, {x3}, {x4}, and {x5, x6}.
4. Six clusters: each data element has its own cluster.
Clearly, this offers an advantage over k-means, in that the number of clusters is
not fixed ahead of time, but grows naturally out of the data. However, the number
of clusters can also offer a disadvantage. For large data sets, many cut levels are
available, and choosing between them may be impossible. Interpretation of Figure
1.1 is simple enough; but how best to interpret a dendrogram with thousands of
data elements at the base, and hundreds or thousands of possible cut levels?
A related and perhaps more serious problem is that of computational ex-
pense. The computational complexity of hierarchical methods is generally given
as 0(n2), which is already significant for large data sets. However, this is an opti-
mistic figure, because each step of the algorithm involves considerable amounts of
calculation. In other words, n2 is a lower bound on complexity; thus hierarchical
3


clustering is actually fl(n2)[21, 32-33]. Depending on the data, complexity may
actually be much greater.
Consider that at every step of the algorithm, the matrix representing the
distance (or similarity, depending on the precise variant of the algorithm being
used) between clusters must be searched, a new cluster must be formed, and then
the distance matrix (or a portion of it) must be recalculated. In degenerate cases,
where each step adds an original data element to a single cluster (note that Figure
1.1 is close to this situation) complexity may actually be 0(n3). For most data
sets, 0(n2lgn) is probably a realistic estimate.
1.1.2 SiMCAL vs. Existing Methods
The origin of SiMCAL lies in the Jarvis-Patrick (JP) algorithm [16], which has
been widely used in fields ranging from astronomy to chemoinformatics, but has
so far not been common in bioinformatics applications. JP begins with two user-
defined parameters: k and kmin. For each data element, a list of that elements k
nearest neighbors is calculated. Then the algorithm states that two data elements
are in the same cluster if either of the following conditions applies:
They are within each others lists of nearest neighbors.
They have at least kmin nearest neighbors in common.
The advantages of JP are threefold. First, because it allows chains of relation-
ships between data elements e.g., a may be clustered with b, and b clustered
with c, so that a and c end up in the same cluster even if they are not obviously
related to each other it allows non-convex clusters which may more accurately
4


reflect patterns in the data. Second, it is largely non-parametric; it does not im-
pose an arbitrary threshold on similarity between elements in order for them to
be clustered together, but operates primarily by ranking, and is thus less sensi-
tive to outliers than parametric methods. Third, and perhaps most important,
it provides a method of determining the natural number of clusters within the
data, rather than requiring a pre-selected figure as with k-means or an arbitrary
cut as with hierarchical clustering.
However, it has disadvantages as well. The values of k and kmin are arbitrary
and must be pre-selected; also arbitary is the definition of what degree of simi-
larity constitutes neighborness. Also, depending on the size of these values, it
may be computationally expensive; because of the requirement to compare lists
of neighbors, the algorithm is 0(kn2), probably less complex than hierarchical
clustering but much more expensive than k-means. Finally, it provides only one
level of resolution the algorithm in its basic form provides no way of building
hierarchical levels of clusters which may contain each other.1
The idea of SiMCAL is to address these shortcomings, as well as to provide
additional functionality not found in any of the algorithms so far examined. To
meaningfully be labeled SiMCAL, an algorithm must possess the following at-
tributes:
1. It must be simple, in that it has low computational complexity, preferably
0(n2) or less, and/or relatively few and lightweight computations are per-
Tt is, of course, possible to provide multiple levels of resolution by iterating through pos-
sible values of kmin; SiMCAL 1 uses a similar but not identical method, optimized for low
computational expense.
5


formed at each step of the inner loop. Subjectively, the user should feel that
the algorithm is fast compared to other methods.
2. It must be multilevel, in that it provides -a hierarchical relationship between
clusters, but with fewer levels than simple hierarchical clustering provides.
Therefore, the user can, to use a laboratory metaphor, switch the objective
to examine the results at a few well-defined levels of resolution.
3. It must provide linking between clusters, in that it yields information on the
relationship between different clusters at the same level in the hierarchy.
SiMCAL 1, 2, and 3 each fulfill these requirements. With regards to compu-
tational complexity, they are strictly 0(n2), in contrast to hierarchical clustering.
Also, their inner loop calculations are much more lightweight than those in hier-
archical clustering. Unlike k-means, however, algorithms in the SiMCAL family
do establish a hierarchical relationship between clusters. Finally, they share the
advantage of JP in that they show the natural number of clusters in the data
at each level in the hierarchy, while being less expensive than JP and offering the
unique advantage of providing information on the relationship between different
clusters at the same level.
The nature of this relationship varies depending on the type of data being
analyzed. In SiMCAL 1, the relationship is temporal. The user can then trace
the relationship through multiple clusters by navigating a directed graph. In
SiMCAL 2, the relationship is spatial, based on genes which are shared between
clusters, and the user can navigate an undirected graph. SiMCAL 3 offers both
temporal and spatial linking between clusters.
6


1.1.3 Project History
The original Fall 2002 version of the project, which used k-means cluster-
ing, was written using PostgreSQL[31] as the database server, PL/pgSQL[30] as
the language for execution data import, preprocessing, and clustering and
database interaction, and PHP[29] for the visualization and reporting interface. A
Spring 2003 version, also based on k-means, was written using MySQL[20] as the
database server, Perl[28] and Perl Data Language[27] for execution and database
interaction.
Work on the SiMCAL 1 algorithm began in Fall 2003. The Summer 2004
version, which is identical in all but minor detail to the current version of SiM-
CAL 1, uses a command-line application written in Python[32] and Numerical
Python[24], or NumPy (a set of numerical extensions to Python which provide
similar functionality to dedicated numerical languages such as MATLAB[18] and
Octave[25]) and MySQL as the DBMS. SiMCAL 2, developed in Fall 2004 and
Spring 2005, and SiMCAL 3, developed in Summer 2005, also use NumPy and
MySQL. The Web-based visualization and reporting interface remains in PHP,
and Gnuplot[ll] is used to produce high-quality gene expression plots which can
then be displayed as part of this interface.
The use of these various open-source tools, all of which are free for non-
commercial (i.e., academic and research) use, has been a powerful factor in making
the project financially feasible, and in contributing to rapid development.
Each of the algorithms is designed for a slightly different type of data, which
will be covered in detail in later chapters. In brief, SiMCAL 1 is designed for
clustering time-series data, with the goal of finding clusters of genes with simi-
7


lar behavior patterns over time under experimental conditions and then finding
succession relationships between these genes and the clusters which contain them.
SiMCAL 2 is designed for analysis of single-point measurements of gene expres-
sion under both experimental and control conditions, in order to determine which
genes are differentially expressed under experimental conditions, and finding spa-
tial relationships between the clusters. SiMCAL 3 combines the two approaches,
offering a means to find differential expression of genes in time-series data and
then find succession relationships between them and their clusters.
Overall, although all the SiMCAL algorithms are considerably more complex
than k-means, careful selection of language and platform as well as steady tuning
of the algorithms has resulted in better than an order of magnitude performance
improvement over the original version from ten to fifteen hours for all operations
(data import, preprocessing, and analysis) in the original version, to under an
hour today, on a midrange laptop. Performance on a dedicated server should be
excellent.
1.2 Biological Background
Both the phosphatidylserine receptor (PSR) and cystic fibrosis (CF) are areas
of intensive research. PSR appears to play a vital role in mediation of inflamma-
tory response, and understanding the mechanism by which PSR fulfills this role
may be important to research on the mechanisms of many diseases, including CF.
CF is a serious and widespread disease which results in significant morbidity and
mortality, and is presumed to involve large portions of the human genome; thus
being able to determine which genes are differentially expressed in CF patients
may lead to important advances in diagnosis and treatment.
8


Both PSR and CF are covered in this section. PSR is covered in detail because
it is a newer topic and not as widely known. CF is covered in brief; the information
given here is summarized from [3, 2366-2371] unless otherwise cited.
1.2.1 The Phosphatidylserine Receptor (PSR)
Normal cell death, or apoptosis, requires regular housekeeping activity within
all multicellular organisms. In vertebrates, this activity is performed by macrophages,
which must be able to distinguish between three classes of cells: healthy cells,
which the macrophages must not consume; apoptotic cells, which must be con-
sumed without triggering an inflammatory response; and cells undergoing lysis
death from some external cause such as injury or disease, which must also be
consumed, but which should trigger an inflammatory response[19].
In invertebrates such as the ubiquitous C. elegans, less specialized cells per-
form housekeeping activities with respect to both apoptotic and lytic neighbors[8],
but the same challenges are presented. Since cells are constantly undergoing
apoptosis throughout the bodies of all multicellular organisms, the mechanism for
preventing inflammatory response must work perfectly or disease will result. In
humans, serious diseases such as CF, lupus, and some forms of diabetes may be
at least partly the result of a breakdown of this mechanism[37].
1.2.1.1 Significance of PS and PSR
Phosphatidylserine (PS) is a phospholipid generally found on the cytosolic
surface of the cell membrane in most animal cells. Apoptotic cells appear to
shift PS rapidly to the exterior of the membrane, at least partly as a signal for
phagocytosis. However, it is far from the only signal for this process. What makes
PS interesting?
9


The answer is the effect PS has on the macrophage. The PS receptor, or
PSR, is believed to be a crucial molecular switch in the control of immune
response[15]. Specifically, when PS is expressed in apoptotic cells, it appears to
reduce inflammation and other aspects of the immune response by engaging the
PSR in macrophages. Cells undergoing lysis do not show this behavior, allowing
the full range of immune response, including inflammation, to come into play.
The fact that PSR is a highly conserved protein, present and functional in
species from C. elegans to H. sapiens, indicates that it plays an important role.2
In some physiological systems, apoptosis is such a frequent event that any in-
flammatory response it might produce must be downregulated to the point of
elimination. For example, the entire pool of neutrophils (a type of white blood
cell) in the body turns over about 2.5 times per day[15]. Any significant inflam-
matory response to such a high rate of apoptosis would (and, in diseases such as
systemic vasculitis, does) have disastrous physiological consequences.
1.2.1.2 The PS/PSR Mechanism
Fadok, Henson et al. propose a tether and tickle mechanism for the action of
the PS-PSR complex. A variety of adhesion ligands on the surface of the apoptotic
cell induce the initial attachment of the macrophage the tether. But the
tickle provided by the apoptotic cells PS to the macrophages PSR is what
signals the macrophage to both engulf the cell and to mediate immune response.
2It is generally accepted that highly conserved genes, and therefore the proteins they ex-
press, are highly functional in the species across which they are conserved. The assumption is
that important sequences which undergo significant mutation will have dramatic, and generally
unfortunate, consequences for the evolutionary fitness of their carriers, while less important
sequences are more likely to change randomly over time without having such effects.
10


PSR antibodies may cause the macrophage to down-regulate inflammation by
inducing the secretion of TGF-/3[19].
Mature macrophages possess a wide variety of recognition molecules on their
surfaces to bind to apoptotic cells. Many of these molecules, such as the f3-
integrins, have very low specificity essentially, they act as signals for the
macrophage to bind to any of its possible targets, including lytic cells which
normally create an inflammatory response[8]. This, of course, is the tether. The
low specificity of these molecules allows the macrophage to bind to both apoptotic
and lytic targets with ease, but it also means that they cannot act as regulators
for the immune response. That is left for the tickle, a more specific molecular
signal: the interaction of apoptotic cell PS with macrophage PSR.
Phosphatidylserine is found on the surface of all apoptotic cells[15]. In healthy
cells, PS is confined to the cytosolic (inner) leaflet of the membrane bilayer, ex-
cept for transient expressions on the exoplasmic (outer) leaflet which are quickly
corrected. During apoptosis, phospholipid scramblases allow chaotic movement
of phospholipids in both directions across the membrane. (Low-level activity of
these scramblases in healthy cells probably accounts for the transient expression
of PS on the cell surface.) PS which approaches the cell surface in healthy cells is
relocated inside the membrane by aminophospholipid translocases (APLTs). In
contrast, APLTs seem to inactivate or be overwhelmed during apoptosis.
The observation that close homologues of PSR are to be found in C. elegans
and D. melanogaster, organisms which lack macrophages3 goes hand-in-hand with
3D. melanogaster possesses specialized immune cells called hemophages, which function sim-
ilarly to macrophages in vertebrates.
11


the fact that most tissue cells, not just dedicated phagocytes, can ingest apoptotic
cells[8]. This is an elegant mechanism for dealing with the immunological problem
presented by apoptosis.
Tethered cells (macrophages or general tissue cells) which are stimulated by
PSR increase their uptake not only of the PS expressing cell, but of other cells
to which they are tethered[8]. Furthermore, blocking the effect of PSR greatly
reduces the uptake of apoptotic cells in general. Both the tether and the tickle are
necessary for apoptotic cell uptake, but neither by itself is sufficient. Together,
they provide a powerful and strikingly efficient mechanism for disposal of apoptotic
cells.
PSR seems to have far-reaching effects on various aspects of possible inflam-
mation. PSR on immature dendritic cells appears to inhibit maturation of these
cells, which in turn prevents these cells from processing and presenting various
antigens which might cause inflammation. Since mature dendritic cells are re-
quired for processing of most antigens in order for the antigens to be recognized
by the rest of the immune system, this effect of PSR constitutes a powerful im-
munosuppressive and thus anti-inflammatory effect[15].
1.2.1.3 Avenues of PS/PSR Research
One major question about the PSR effect is why lytic cells, which also contain
PS which normally resides on the cytosolic leaflet of the cell membrane but can
be expressed on the exoplasmic face during cell death, do not engage PSR. Ob-
viously this would be a physiologically inappropriate response; the consequences
of suppressing inflammation in the case of lysis could be just as severe as those
of allowing inflammation in the case of apoptosis, because inflammation is a cru-
12


cial component of the immune response. (The major functions of inflammation
in immune response are to attract and activate immune system components such
as B and T lymphocytes, to begin tissue repair and remodeling, and to contain
pathogens at the inflamed site rather than allowing them to spread throughout the
body[36, 440-443].) Although the mechanism by which lytic cells avoid engaging
PSR is not yet known, there are at least two proposed mechanisms:
First, cytoplasmic proteins such as annexins may bind PS and prevent it from
engaging PSR. Lysis is a violent process compared to apoptosis, and interactions
between the cell membrane and the cytoplasm during lysis might provide an op-
portunity for such binding.
Second,proteases which are released only during lysis might cleave PSR in the
engulfing cells (macrophages in vertebrates, general tissue cells in invertebrates.)
PSR is highly susceptible to protease cleavage, and lytic cells express proteases to
a degree not found in cells in any other condition. Less dramatic protease cleavage
of PSR might also be involved in allowing immature dendrites to mature in cases
where this response is appropriate.
Clinically, high levels of serine proteases such as neutrophil elastase are often
found in patients suffering from inflammatory disorders, or disorders which have
a significant inflammatory component, such as CF and bronchiectasis[37]. It may
be that these proteases would make useful targets for drug therapy which, by
reducing the protease levels, would allow more PSR regulation of inflammatory
response.
Because of their cell walls, bacterial and fungal pathogens cannot express PS
on their surfaces, and therefore cannot engage PSR. On the other hand, tumor
13


cells can and do express PS one of the main reasons why the immune system
is much less effective against cancer than against infection.
1.2.2 Cystic Fibrosis (CF)
CF is a hereditary disease which occurs in about 1 of every 5,000 births. It is
one of the most common fatal genetic diseases, and the median age of death for CF
patients is only 31 years. Although primarily thought of as a respiratory disease,
it can affect many major body systems: symptoms include not only respiratory
insufficiency and increased sputum production resulting in coughing and diffi-
culty sleeping, but also growth deficiency, heart disease, increased vulnerability to
many infectious diseases, and pancreatic insufficiency leading to insulin-depended
diabetes. Most exocrine glands (those which secrete substances which are trans-
ported outside the body or to body cavities, via the skin or mucous membranes)
are also affected by the disease, often leading to digestive difficulties and infertility
among other problems.
The cause of CF is believed to be a mutation on a single gene on human
chromosome 7, which is carried asymptomatically as a recessive trait by 2-3% of
the population. However, the diseases wide-ranging effects on the body make
it reasonable to expect that many genes in CF patients will be either up- or
down-regulated, making patients with the disease (and, of course, healthy control
subjects) fruitful subjects for genetic analysis, including microarray experiments.
Complicating the genetic analysis of the disease is the fact that, although the
disease is probably caused by a mutation on a single gene, there are 600+ distinct
mutations to this gene which may be able to cause the disease to a greater or
lesser degree of severity.
14


The various mutations associated with CF all cause sequence and structural
changes in the cystic fibrosis transmembrane regulator (CFTR) protein, which
appears to be part of a chlorine (Cl) and sodium (Na) transport channel across
epithelial membranes (membranes lining the surface of the body and body cav-
ities.) This channel is regulated by cyclic adenosine monophosphate (cAMP), a
substance which plays a vital role in metabolism[38, 442], Thus genes which might
be expected to be up- or down-regulated in CF patients include those associated
with CFTR, with Cl and Na transport, with cAMP synthesis and breakdown, with
metabolism in general, and with the various organs and body systems affected by
the disease in short, a significant portion of the entire human genome.
1.3 Notations and Definitions
It is useful here to define some notations and terms used throughout the
following sections. Equations in this section which are not numbered may be
found in [6, 113-117, 216-218]. Numbered equations are original, unless otherwise
cited.
Given two vectors x = (x1: X2, , xn) and y = (yi, y2, , yn), and a scalar
z; and for arbitrary functions /(x) such as x2, g(x, y) such as xy, or /i(x, z) such
as x + z, unless otherwise specified:
/(x) = (f(xi), /(x2), , /M); (L1)
p(x,y) = {gixuyi), g(x2,y2),.. , g(xn, yn))\ (1-2)
h(x) = (h(xi, z), h(x2, z),..., h(xn, z)). (1.3)
15


The mean or expected value of x is:
1 n
-T,-.
n < J
Px / %i-
i= 1
The variance and standard deviation of x are:
^(X) P(x /ix)2 Px2 /^xl
(7X = \/V(x).
The covariance of x and y is:
Cov(x, y) = P(x-/ix)(y-/iy) = Pxy PxPy
The correlation coefficient [6, 218-220] of x and y is:
Cov(x, y)
yx,y
(JyiVy
(1.4)
If X is a matrix such that X = (xi, x2,..., xm) but y is a vector as above:
PX,y (Pxi,yj Px2,yi Pxm,y)-
(1.5)
If X and Y are both matrices such that X = (xl5 x2,..., xm) and Y =
(yi, y2,..., yn), then the correlation matrix is:
Px, y =
^ Px i -yi Pxi >ya Pxi ,yn ^
Px 2 ,yi Px2 ,Y2 Px2,yn
\Pxm,yi Pxm,y2 ' Pxm,yn /
(1.6)
16


The weighted mean of x and y is the mean of x weighted by the values of
y. In other words, if y = (1,2,1), then in the calculation of the mean, x2 has
a weight equal to the combined weights of X\ and x3, which have weights equal
to each other. The definition of the function depends on the dimension of its
arguments. For x = (aq, x2,..., xn) and y = (yu y2,.. , yn):
If X is a matrix such that X = (xi, x2,..., xm) but y is a vector as above:
If X and Y are both matrices such that X = (xi, x2,..., xm) and Y =
(yi,y2,---,ym), then:
n
(1.7)
1=1
W(X,y) = (lV(xuy),W(x2,y),...,W(xm,y)).
(1.8)
W(X,Y) = (W(xlt y1),H/(x2,y2),...,W,(xm,ym)).
(1.9)
Let max0(x) and maxo(X) be defined as:
max0(x) = (max(xi, 0), max(a:2, 0),..., max(rn, 0)). (1.10)
max0(X) = (maxo(x!), max0(x2),..., rnax0(xm).
(1.11)
If X is m x n, then the transpose of X [9, 17] is XT such that:
Xj^X^VielLmUetLn]
(1.12)
17


The identity matrix of size n [9, 83] is the n x n matrix In such that:
f 1 if % = j
Ii,j = \ Vi, jf E [1, n\. (1.13)
0 if i J- j
Minkowski distances are widely used distance measures of the family:
AP(x,y) =
The best-known Minkowski distance is probably the Euclidean[4, 377]:
^|xj -yi\P.
\ i=i
A2(x,y) =
A - Vi)2-
N<=l
(1.14)
Another very useful Minkowski distance is the simplest possible case, the city
block distance[4, 377]:
n
Ai(x,y) = ^2 \xi ~ Vi\ . (L15)
i=i
Given two sets A and B, and that for any set S = {si, S2,..., s}, the car-
dinality of S is given by \S\ = n, let the congruence of A and B be defined
as:
Con(A, B) =
0 if A = 0 or 5 = 0
|Anffl2
\A\\B\
otherwise
(1.16)
It can be seen that Con(A, B) is a dimensionless measure of the degree to
which A and B intersect: Con(A, B) E [0,1]VA, B, and A = B =$ Con(A, B) = 1,
while 4nB = 0=> Con(A, B) = 0.
18


2. SiMCAL 1
SiMCAL 1, the original SiMCAL algorithm, is designed primarily to cluster
genes based on their behavior under experimental conditions over time [7]. A
secondary goal is to determine which genes and clusters accurately predict the
behavior of others. This chapter covers SiMCAL 1 in detail, and introduces a
number of concepts which will also be used in subsequent chapters.
2.1 Data and Preprocessing
To ensure that the data were accurate and replicable, three separate and
identical experiments (replicants) were performed. In each, microarrays containing
12,488 mouse genes known to be active in mouse macrophages, with each gene
being represented by between 5 and 30 RNA probes, were exposed to a single
concentration of monoclonal PSR antibody over time. Fluorescent hybridization
measurements were taken for RNA at 30 minutes, 2 hours, and 8 hours 30 minutes.
Thus, using the standard notation by which n denotes the number of variables
and m denotes the number of observations, n = 12488 and m = 3.
Identical measurements were taken of control samples which were treated with
an isotype, i.e., another IgM antibody with better-known effects. Another control
sample, of which only one measurement was taken, was not treated. Therefore,
seven distinct data points were generated for each probe. These steps were per-
formed using Affymetrix (Affy) microarray equipment and the Affy chip MG-
U74Av2, a standard chip containing mouse genes.
Preprocessing is of fundamental importance in any data mining operation.
The raw data, although interesting, do not yet contain usable information. The
19


primary goals of preprocessing in this case are to eliminate or reduce unreliable
data, combine separate data points in a way that takes all measurements into
account, and to normalize the data in such a way that clustering and linking will
produce meaningful results.
2.1.1 Raw Measurements and Reliability
For each raw gene in each replicant, there exists a set of measurements,
based on one measurement for the unexposed control, and time series measure-
ments for the isotype-exposed genes, iso, and the PSR-exposed genes, psr. In the
time series measurements, psrt and isot denote the values measured at time point
t. For the current data, t = 1, t = 2, and t = 3 indicate exposure times of \ hour,
2 hours, and 8| hours respectively. Therefore, overall:
3
measurementsgeneerepiicant = {control} (J{zso£,psrt}. (2.1)
t=l
Each of these measurements has an associated p-value, and a lower p-value
indicates higher reliability placed on the measurement [1]. A reliability can be
calculated for any gene in a replicant based on the average p-values. First, let
e = log0 95 0.05 ~ 58.4, and then reliability is:
TdgeneZreplicant = min(l {j)ValueSgene^replicant)) (2-2)
In other words, for any gene in any replicant, that genes reliability is the
minimum of the reliability for any measurement associated with it. One bad
measurement in an otherwise good time series could badly throw off all subsequent
calculations; by this method, the algorithm is aware of such measurements, and
will either discard them or assign them less weight in calculations, as described
20


in following sections. Eqn. 2.2 also ensures rel E [0,1], providing a simple and
consistent basis for calculation and interpretation.
The definition of e deserves some discussion. Affys own evaluation of measure-
ments reliability marks measurements having a p-value significantly above 0.05 as
absent,, those having a p-value around 0.05 as marginal, and those having a
p-value significantly less than 0.05 as present. The definition of e = log0 95 0.05
means that a marginal gene having a maximum p-value of 0.95 will have a re-
liability of 0.05, or 5%. This is the minimum reliability used in any subsequent
calculation and leads to very low but not zero weighting for measurements
which have a p-value only slightly less than or equal to 0.05.
This choice is arbitrary; it would be possible to let e = 1, or, at the opposite
extreme, a very large value such as e = 1000. In any case, e > 1 serves the purpose
of concentrating the more reliable values[4, 81]: the higher its magnitude, the
more emphasis is given to those expressions with lower p-values.
2.1.2 Combination and Normalization
Before calculating normalized values, replicants with rel < 0.05 are discarded.
This ensures that the analysis will proceed only with reliable data. Now, a con-
sistent basis is required for comparing expression levels between genes for which
reliable measurements exist.
Unfortunately, there no meaningful way to compare the raw expression levels
to each other for different genes; it cannot be said that if one gene at one time
point expresses at 5000 according to the Affy reading, and another at the same
time point expresses at 10000, that the second gene is expressing twice as much
of the relevant protein as the first. However, it can be said that if a single gene
21


expresses at 5000 at one time point and at 10000 at the next, that its expression
level has doubled [1].
Therefore, a measurement of the relative expression levels of the PSR-exposed
data points relative to the isotype-exposed data points and the untreated control
data point is defined in Eqn. 2.3. Note that lgx = log2x, that is, the base-2
logarithm of x.
expt = lg
Psrt
-,te {1,2,3}.
control x isot
Now let the expression level for the untreated control be expo = 0, and:
(2.3)
CXP;,= (expo, expi,exp2, exp3). (2.4)
Each expression level has been calculated to take into account the change
between the control expression level and the PSR-exposed expression level, and
just as importantly, the changes between the PSR-exposed expression level and
the isotype-exposed expression level. In other words, the normalized expression
level should reflect the difference PSR specifically makes in expression, as well the
difference PSR as an IgM antibody makes in expression. Consider two possible
cases:
Psrt
9 if (control, isot,psrt) = (1000,1000, 3000);
3 if (control, isot,psrt) = (1000, 3000, 3000).
control x isot
In the first case, PSR specifically is clearly responsible for up-regulating the
gene under examination, while in the second case, it appears to be the presence of
22


IgM antibody in general which is responsible for the up-regulation. Appropriately,
therefore, the first case has a much higher normalized value.
A logarithmic scale for normalized expression levels is used for two reasons,
both relating to revealing small differences in the expression levels. First, the
values of (psrf/(control x isot)) cover an enormous range, from roughly 0.00003
to 3500, and clearly any reporting mechanism which can cover such a range would
lack the precision to distinguish between a subtle but important difference such as
that between 3 and 9. Second, such a reporting mechanism would tend to overem-
phasize up-regulation and underemphasize down-regulation, since the magnitude
of the values for up-regulated genes is so much greater.
Finally, 2 is used as the base of the logarithm, rather than some other common
choice such as e or 10, because it makes interpretation straightforward: a change
of 1 in the reported value means a twofold change in expression. The use of base-2
log ratios has become standard in processing raw microarray expression values;
some recent examples are to be found in [12], [13], and [23]. The particular ratio
used in Eqn. 2.3 is original.
2.1.3 Gene Expression Calculation
At this point, reliability and expression values exist for each gene in each
replicant. The reliability score is used as a weighting factor in the averaging. In
addition to the expression vector exp and the reliability rel, it is also useful to
calculate variability, var, which is an overall measure of how much the gene reacts
to the experimental conditions. For each gene:
exPgene ^replicants(e^Pgeneereplicanti ^^geneereplicant)]
(2.5)
23


'^'^'^genc I^Tel gene£replicant 1
(2.6)
= i V fihs(exp,.,J. (2.7)
6 t= 1
Genes are assigned Id numbers, ordered from least variable (Id 0) to most vari-
able (with the current data set, for all genes that made it through preprocessing,
Id 5285). They may then be considered as an array, and individual genes denoted
by their position in the array for example, the gene with the Id number 2201
is ^2201-
2.2 Clustering and Linking
Now clustering can begin. This section describes the distance measures used
and how clusters axe formed, associations are mapped, and genes and clusters are
linked from this information.
2.2.1 Distance Measures and Matrices
The similarity measure, for any two genes, used throughout the clustering and
linking calculations is the correlation coefficient p as defined in Eqn. 1.4. This is
a general measure of the degree of linear dependence between two vectors i.e.,
the degree to which they tend to rise or fall together[6, 218-220].
This measure, often referred to as the Pearson correlation coefficient, has
become a canonical measure of similarity (or, in the inverse, difference) between
gene expression patterns in a wide variety of applications; some of many recent
examples include [13], [17], [35], and [39]. More complex time-series measures,
such as Box-Jenkins, generally require a larger number of data points than those
generated in this experiment, and also often have a requirement that the data be
24


evenly spaced; in contrast, p is robust for multidimensional data of almost any
dimensionality and interval.
Compared to other dependence measures, p has the useful quality that it is
dimensionless: pxy G [1,1] for all vectors of equal length x and y. Interpretation
is straightforward:
Axy = 0 indicates that x and y are linearly independent.
Pxy ~ 1 indicates increasing positive linear dependence; as x rises or falls, so
does y. Similarly, pxy > 1 indicates increasing negative linear dependence;
as x rises, y falls, and vice versa.
|Pxy | > 0.8 indicates a strong linear dependence between x and y, 0.5 <
|Pxy | < 0.8 indicates moderate dependence, and 0 < |pxy| < 0.5 indicates
weak dependence.
Why use a dependence measure as opposed to the common Minkowski distance
measures (Ch. 1.3) such as the Euclidean? The answer is that clustering is
based here on similarity in the changes in expression patterns under experimental
conditions, and a dependence measure such as p provides a better measure of the
relationship bewteen different genes up- or down-regulation over time than does
a Minkowski distance. Consider the following three vectors (see Fig. 2.1) which
represent possible values of expgene:
x = (0,0.5,1,1.5);
. y = (0,2,4, 5);
z = (0,0,-0.5,-0.5).
25


Figure 2.1: Expression levels of x, y, and z.
Clearly y and z are closer to each other by any Minkowski measure than
either is to x; just as clearly, however, the responses of x and y under experi-
mental conditions are more similar, since both are steadily upregulated while z is
downregulated. Comparing 1 p to A2 (so as to cast both values as measures of
distance, rather than similarity) shows that the choice of p as a similarity measure
reflects this.
As Table 2.1 shows, clustering these genes based on A2 would create the
clusters {x, z} and {y}, but clustering based on p creates {x, y} and {z}. Given
the actual reactions of the genes to experimental conditions, p is a much more
26


1 p : A2 X y z
X 0.00:0.00 0.01:4.85 1.89:2.55
y 0.01:4.85 0.00:0.00 1.91:7.38
z 1.89:2.55 1.91:7.38 0.00:0.00
Table 2.1: 1 p vs. A%.
reasonable measure.
The use of p is the reason that expo = 0. Recall that each gene is starting from
the natural (untreated) state, and changes in expression level are measured from
this baseline. If the expression vector were simply expgene = (expi,exp2,exp3),
this could lead to false correlations.
Suppose x = (1, 3, 2) and y = (2,0,1). (See Fig. 2.2.) Without exp0 =
0, px,y = 1, which is clearly inaccurate, since in fact x is downregulated under
experimental conditions while y is upregulated. Setting x = (0, 1, 3, 2) and
y = (0, 2, 0,1) gives a much more accurate value of px>y = 0.13.
Therefore, the initial step in clustering is to form a correlation matrix for all
genes under consideration. This is the most complex step in the entire algorithm,
but is strictly 0(n2), and is still computationally inexpensive because calculating
p is quite fast. Thus SiMCAL 1 as a whole is 0(n2), and the first SiMCAL
condition is fulfilled.
For the n genes g G [1, n] and expg = (exp9i0, expg^exp9y2, expg^), let
expgarly = (expg^,expg^expg^) and exp^ate = (expg^,expg>2expg^). Thus there
are three expression matrices:
27


Express i on
Time Point
Figure 2.2: Expression levels of x and y.
Exp = (exp!, exp2,..., exp);
Expear/y = (expWy, exP2Wy,..., exp^); (2.8)
Exp'ate = (exp^, exp!2ate,..., exp(fe).
These expression matrices are used to obtain the similarity matrices which
will be used in clustering. First, the raw matrices are:
28


Neighbors = max0(pExP,Exp);
Leaders = pExp(ote Expeariy; (2-9)
Followers = Leadersr.
The derivation of Neighbors is obvious. Leaders and Followers are some-
what deceptively named, but once they are used to obtain ranks based on the
values of their elements (as described in the next section) they do indeed provide
a measurement of the degree to which Expear/y predicts Expiate. In other words,
the maximum value in the ith row of Leaders is the index of the gene which best
predicts gene i: its Leader. Similarly, the maximum value in the zth row of
Followers is the index of the gene which gene i best predicts: its Follower.
However, the matrices in their current form are not adequate for clustering
to-proceed, and must be adjusted. First, Neighbors, used in clustering, is sub-
tracted from both Leaders and Followers, used in linking:
(2.10)
Leaders = Leaders Neighbors;
Followers = Followers Neighbors.
This step serves to prevent false leader-follower (succession) associations
between genes whose expression levels track each other very closely under exper-
imental conditions. Such genes should be clustered together using Neighbors,
but they should not be linked using Leaders or Followers.
To illustrate this, suppose that for two genes x and y, and some constant c > 0,
y ~ cx. In other words, x and y approach complete positive linear dependence.
Then x leads y and y leads x would be equally accurate statements. Since
the purpose of the succession analysis is the elucidation of biological pathways,
29


based on the assumption that a genes leaders precede it in these pathways and
its followers succeed it, this would not be very helpful. Thus the subtraction step,
which ensures that the more likely two genes are to be clustered, the less likely
they are to be linked.
Finally, genes should not be clustered or linked with themselves; this is ensured
by by reducing their reported associations to below min(p). To accomplish this,
first let the matrix Mask = 2In (see Eqn. 1.13) where n is the number of genes.
Then Mask is subtracted from all three similarity matrices:
Neighbors = Neighbors Mask;
Leaders = Leaders Mask; (211)
Followers = Followers Mask.
2.2.2 Ranks and Cluster Formation
The correlation matrices are now used to obtain rankings of association. De-
fine the rank of a genes relationship to another gene as its ranking in the ap-
propriate correlation matrix that is, rank 1 corresponds to the index of the
largest argument in a given row of the correlation matrix, rank 2 corresponds to
the index of the second-largest argument, and so on.
Let j argmax(NeighborSj). Then gene Xj is the rank 1 neighbor of x,.
(Note that i ^ j because of the mask subtraction step.) The rank 2 neighbor of
Xj is x*; where k = argmax(Neighbors,) | k ^ j; the rank 3 neighbor of x, is x;
where l = argmax(NeighborSj) | l ^ j A l ^ k\ etc. The calculations of leader
and follower ranks proceed in similar fashion.
As implemented, this calculation is efficiently performed on all three similarity
matrices (Neighbors, Leaders, and Followers) in their entirety, in one step,
30






using the NumPy argsort() function, which produces a sorted list of indices of a
matrix or vector, as shown below in Python sample code:
a = array([l, 3, 2, 4])
b = argsort(a)
# b = array([0, 2, 1, 3])
In this example, do is the rank 1 member of a, <22 is the rank 2 member, a\ is
the rank 3 member, and 03 is the rank 4 member.
The fliplr() function simply flips a matrix or vector from left to right, which
makes it easier to extract rank information in the order the algorithm demands.
Thus the code for finding ranked associations is:
Neighbors = fliplr (argsort (Neighbors))
Leaders = fliplr(argsort(Leaders))
Followers = fliplr (argsort (Followers))
Now clustering can proceed. Clusters are ranked just as neighbors, leaders,
and followers are; a cluster of rank 1 contains all genes which have each other as
neighbors of rank 1, and so on through higher ranks until only one supercluster
exists. In other words, the condition for clustering is that for any rank R > 0, a
gene is in the same cluster of rank R as other genes which it has as neighbors of
rank r < R. Thus at rank 1, a gene is in the same cluster as its neighbors of rank
1 (and those genes neighbors of rank 1, and so on); at rank 2, it is in the same
cluster as its neighbors of rank 1 or 2, etc.
This goal is achieved by starting, at rank 1, with each gene in its own rank
0 cluster, then finding its neighbor of rank 1, and placing that gene, along with
31


any other genes which may be in its cluster, in the same cluster as the first gene.
As the algorithm proceeds, of course, the clusters grow rapidly. Note that there
is no requirement that genes be directly related to each other to be placed in the
same cluster this fulfills the requirement, inherited from JP, that the algorithm
should be able to produce non-convex clusters. Instead, the algorithm navigates
along chains of close association through the data. If x has y as its nearest
neighbor, and y has z as its nearest neighbor, then x and z will end up in the
same cluster of rank 1.
Clusters of higher rank are produced by reusing existing data structures. At
rank 2, the genes are already clustered with their neighbors of rank 1, so there
is no need to repeat the entire process with each genes entire list of neighbors.
Instead, the existing rank 1 clusters are agglomerated into rank 2 clusters, and so
on through higher ranks. This process allows a hierarchy of clusters of different
ranks, expressed as parent and child relationships. Let A and B be clusters of
rank r, and C be a cluster of rank r + 1. If x G A, y E B, and x, y £ C, then C
is the parent of A and B, and A and B are children of C.
The following Python code from the clustering control file shows the core of
the clustering procedure, at any rank:
# cycle through genes
for Geneld in xrange(NumOfGenes):
# get IDs into easy-to-use forms
i = Geneld
j = GeneNeighbors[rank-l, i]
# cluster assignment
32


if GeneClusterIndexes[j] != GeneClusterlndexes [i]:
# merge higher-ID cluster with lower-ID cluster
localGCI = array ([GeneClusterIndexes[i],
GeneCluster Indexes [j] ])
live = min(localGCI)
dead = max(localGCI)
# update GeneClusterlndexes to reflect merge
for id in Cluster Genes [dead]:
GeneClusterlndexes [id] = GeneClusterlndexes [live]
# merge old cluster into new
ClusterGenes[live] += Cluster Genes [dead]
# empty out old cluster
Cluster Genes [dead] = []
# else: already in the same cluster
Clustering ends when the algorithm reaches a rank such that only one cluster
exists. The algorithm could, of course, continue operations up to rank n 1 for
n genes. As a practical matter, doing so would serve no useful purpose. In fact,
to speed up overall operations and reduce the amount of data storage required,
the constant MAX.RANK is defined; no neighbor, leader, or follower assocations
beyond this rank are calculated or stored. With the current data set, clustering
terminates at rank 4, and MAX_RANK is (generously) set equal to 5. Determining
the proper value of MAX_RANK is a sort of preprocessing step, based on early
analysis of the data in question, and of course the value may be changed for new
data.
33


Cluster centers are calculated, although they exist only for reporting and
visualization purposes rather than any further algorithmic uses. Like a gene, a
cluster has the expression vector expciuster = (exp0, expi, exp2, exp3). For a given
cluster C with m genes as members, let G be the matrix of the expression vectors
of the genes in the cluster:
G (xi, X2, , xTO) | xi, X2, 1 xm E C.
Also let rel be the vector of the reliabilities of those genes:
rel = (relurel2,... ,relm).
Then the location of the cluster center is:
expc = W(G, rel). (2.12)
Like genes, clusters are given Id numbers, starting at 0 within each rank and
going up to the last cluster in the rank. They are not ordered by variability or any
other inherent feature, although practically speaking, because the clustering algo-
rithm proceeds along the gene array as ordered by Id, higher-numbered clusters
will generally have higher variability. The ranks, and the clusters within them,
may be considered as a matrix, and clusters may be uniquely identified by their
position within the matrix for example, the cluster of rank 2 with Id 47 is 62,47.
With regards to computational complexity, the creation of clusters at each
rank is of linear complexity, or 0(n). Because MAX.RANK is a constant, the
entire clustering process is 0(MAX_RANK n) = 0(n). If MAX_RANK is con-
sidered a variable, which in a sense it is because it depends on the structure of
34


Time Point
Figure 2.3: Expression plot for G4355 and its clusters.
the data, note that MAXJRANK performance. Calculation of cluster centers is also 0(n). Therefore, the most
complex step in the algorithm remains the calculation of the correlation matrices
as described above, and the SiMCAL 1 algorithm is 0(n2) overall.
2.2.3 Linking
Linking proceeds in a similar fashion to clustering. Like neighbors, leaders
and followers are ranked up to MAX_RANK. A gene x leads a gene y at rank r if
x is the rank r leader of y; similarly, a gene z follows y if z is the rank r follower
of y. This leads naturally to the concept of successions, the union of leaders and
followers in this example, y succeeds x and z succeeds y. Also, y is a successor
35


of x, and so on.
Now let A and B be two clusters of rank R, and x A and y E B. Then A
is a leader of B, and B is a follower of A, if y succeeds x at rank r < R. In other
words, for one cluster to lead another (and for the second cluster to be a follower
of the first cluster; the two cases are identical) it is sufficient either for a gene in
the first cluster to be a leader (at a rank equal to or less than that of the cluster)
of a gene in the second cluster, or for a gene in the second cluster to be a follower
of a gene in the first cluster.
Since clusters at higher ranks may be very large, containing numerous genes,
some have many leaders and followers. This is desirable for the possible elucida-
tion of biochemical pathways, since such pathways generally have many branch
points. Representing the set of clusters as a directed graph allows navigation
along many leader-follower paths through the graph, and comparison of those
paths with biochemical pathways of interest.
Analysis of the linking yields some unexpected results. For instance, at rank
3 (the highest rank with more than one cluster) there is a single very large cluster
(Cluster Rank 3, Id 0, or C^o) containing thousands of genes which showed little
change in expression under experimental conditions. This is not surprising, as
the majority of genes on the Affy mouse chip are presumably not involved in
inflammatory response or other PSR-mediated activity.
What is surprising is that the majority of the other, much smaller clusters
at rank 3 have C^o as their sole leader and follower. This may imply that the
rank 3 clusters are too large to be useful in analysis, and that the rank 1 and 2
clusters are more useful when trying to determine the biological meaning of the
36


clustering results. On the other hand, it may have some biological significance
perhaps implying that PSR-mediated reactions generally follow short pathways
which involve few genes at a time.
2.3 Reporting
The reporting interface is Web-based and is designed to present a user-friendly
means of analyzing the results. It provides an introductory page, general informa-
tion about clusters and genes, the ability to search for and view specific clusters
and genes, and a guide to interpretation (Fig. 2.4).
Menu Welcome to the SiMCALl gene expression p
Genes
Clusters
Gene Search
Cluster Search
Logarithms
"Genes" will show statistics about the genes i
"Gene Search* will provide an interface to se
"Clusters" will show statistics about the over
"Cluster Search" will provide an interface to
"Logarithms" will provide tables for con versi
relative to baseline, and vice versa. You may
A guide to usage:
Buttons found as table headers may be used I
search results, save user annotations, or displ
To open genes or clusters found in tables, cli
Figure 2.4: Part of the introductory screen.
Expression levels of both genes and clusters are shown as measured data and
interpolated values. The interpolation is via the cubic spline method [10, 238-248]
which is a fairly robust and stable method. One problem with cubic splines is that
they do not work well for unevenly spaced data; thus the values of expo, expi,
37


exp2, and exp% are shown as evenly spaced over time, which may be deceptive
if the user is not watching for this. On the other hand, the interpolated values,
while they should certainly not assumed to be as accurate as the measured values,
should be fairly representative of the actual activity of the gene over time.
Relationships between clusters, both hierarchical and leader-follower, may be
navigated from any cluster (Fig. 2.5). The genes within the clusters are also
available for view. Cluster, leader-follower, and neighbor relationships for genes
may be similarly navigated. Any of these relationships can be shown on the same
plot as the main entity under consideration, giving the user a quick means to
visualize the relationships (Fig. 2.3).
Related Cents and Clusters |
Leaders
I Id 1 Same |Rank |Exo0:30 |Exd 2:00 |Exa 8:30l
[256 ]jl8J284_r_al ir |-0.161 |0101 |0.108 j
[278 i|l 60621 _at ,|2 [-0.066 jo.180 |0-134 j
[431 j|98991_at |2 [-0.070 JO.225 |0.159 J
[489 i|96797_s_at |2 1-0.171 |0.173 [0.133 j
[749 [97204_s_at |2. [-0.135 |0.242_ |o.i 86 _J
[973 i|85488_at |2 1-0.197 |0.243 |0 188
[1320|[97403_at :|2 j-0.359 |0 163 jo.212 ;
[I370[94827_at [2. [-0.289 |0 227 [0.230
jl503j[97703_at |2 >0.318 [0.243 |0 228 j
jl554i[104025_at ;|2 1-0.374 [0.186 |0 244 j
jl620!|180353J_at |2 >0.243 |0.324 |0 255 j
[171 Bi[101035_at |2 1-0.287 jo.312 |0.250 j
|1737j|98911_at J2 1-0.316 jo.300 jo.237 j
|1777i|95648_at |2 [-0.399 jo 230 jo.235 !
|l947;|93318_at |2 I-0.325 [0.335 J0.255 [
|l950i|l01741_at ;|2 [-0.443 [0.225 [0247 [
|l956||l62500_f_at !|2 |-0.446 i|0.211 [0.260 ]
I212SH93228 at |2 I-0.229 I0.400 I0.337
| u iRank iMembenilExp 0:30;lExp 2^)0 IExd 8:30
iot F... _[0.411 j[q.308 >0.220 j
[8lF [28 [-0.056 >0.040 >0.651
EE >38 [0.353 jjo.493 [0.618 J
IIEF |19 '[0.188 Foi- >0429 j
W'W |148 [0.178 ;|0.765 >0.012 ;
l7Pj[2.. 140 _[0.318 ][q.37i _|-0 215 !
(w"F h [0.782 |0.740 [0 216 J
[89 |2 :|14 JO.161 jo.203 I-0.362 :
[1#8|2 |5 (1.406 :[1.148 J0.776 |
|181!|2 |9 :[1.039 l1 14-1 [0.355 !
[TSfF [0 259 [0.251 >0.270
Followers
I Id iRank lMembcn!lExpQ:30 Ied 2:00 lExn 8:30
[2 1195 >0.343 >0.104 [0.471
A |2 ;|12. |0.051 >0.162 [0.399 :
Figure 2.5: Navigation through cluster relationships.
38


Both genes and clusters may be annotated by the user. In addition, the Affy
annotation for genes is displayed when the gene is viewed. Unfortunately, the
Affy annotations provided with the data are often not especially informative; [22]
is recommended as a source for more complete annotations on genes of interest.
2.4 Biological Interpretation
A major challenge in biological analysis of the clustering results is in the
quality of the data. The data used in this phase of the project have proven
to have an unfortunately large number of measurements that were not deemed
reliable enough to make it through preprocessing. As a practical matter, this
means that many genes of interest were not included in the final data set.
The number of time-series measurements used here is small; although these
particular time points were chosen deliberately based on the current understanding
of the biological activity of PSR, a denser and more evenly spaced time series
perhaps one measurement at 30 minutes, followed by hourly measurements over
the following 8, 9, or 10 hours would be desirable for future study. It is
worth noting that other experiments have had good results with small numbers
of time series microarray measurements; an interesting example is [13], in which
Hashimoto et al. compared expression levels of genes involved in spinal cord
injury healing at 1, 3, and 7 days. However, more data can always provide more
information, and it is to be hoped that higher-dimensional data on PSR activity
will be available in the future.
Furthermore, the time course of 8 hours 30 minutes may have been too short
to detect substantial associations between changes in gene activity relating to
apoptosis and inflammatory response. Most studies in this area deal with time
39


courses of 24 hours or longer. Even with the existing limitations, however, some
promising congruences with evidence from other sources did emerge.
2.4.1 Internal Analysis
C2)84 contains two genes which both hold significant interest and have Affy
annotations which provide useful information about their function: G5285 and
^5264- Both genes are highly upregulated under experimental conditions, and
both show (as does 62,84) a pattern of sharp upregulation early in the experiment
followed by decline. (In Fig. 2.6, G5285 and G5264 are represented by the top and
second-to-top blue lines, respectively. C2,84 is surrounded by other genes toward
the middle of the plot.)
4
2
C
O
pH
W
2 0
i_
o.
x
Ld
-£
-4
Re 1ated
Measured
.erpo 1 ated
Time Point
Figure 2.6: C2,84 and its constituent genes.
40


G5285, which is known to be stimulated by PSR, is a gene for production of the
protein prostaglandin (PG) G/H synthase, or PGH synthase, which is also known
as COX2. This is a key enzyme in prostaglandin production. G5264 is a gene for
the PG receptor PGE2, and was not previously known to be stimulated by PSR.
The fact that these genes show the same expression pattern may be an indicator
that production of the PG receptor proceeds at the same time as the production
of PG precursors, rather than following such production at a later time.
Another cluster of interest is C2,29, which shows consistently high levels
of expression throughout the experiment, with steadily increasing upregulation.
Among its genes are G^ioo, described by Affy as a scavenger receptor, and G4974,
described as a house-keeping protein. The presence of these genes in the same
cluster, with consistently high expression on exposure to high concentrations of
PSR, may indicate that they play a role in PSR mediation of immune response to
apoptosis. Particularly interesting in this case is that 62,29 is a follower of 62,84-
62,29 and other followers of 62,84 also contain several genes related to B cell
production and activity.4 He et al. [14] found that in the presence of resistant T
cells, PGE2 increased B cell proliferation and differentiation. Resistant T cells
are defined here as cells which can increase immunoglobulin, or Ig, production;
PSR antibody, of course, is an IgM antibody. Almost all B cells have IgM recep-
tors. This is a good sign that the linking phase of the analysis correctly tracks
the progression from PGE2 production to B cell proliferation.
4Specifically, G1004 62,29) 64544 £ 62,176, and G3061 £ 62>229-
41


2.4.2 External Analysis
One useful way to measure the effectiveness of the clustering portion of the
algorithm in biological terms is to measure results against other, more fully de-
veloped experiments. Stein et al. [34] studied gene regulation in the immune
response to mammary involution, the process by which lactation ceases. This
process naturally involves the apoptosis of large numbers of cells involved in milk
production and secretion, and should not trigger a severe inflammatory response.
The same Affy chip, MG-U74Av2, as was used in this project, was subjected
to experimental conditions related to mammary involution over 16 hours, and the
results were analyzed by SOM (Self-Organizing Map) clustering using p as the
similarity measure. Genes, and the clusters they formed, which were not deemed
of interest to the experiment were discarded from the final analysis. Eventually
ten clusters of interest were found.
Although the experimental conditions were quite different from those used
for this project, the biological processes under examination are similar; some
correspondence of results should be apparent, and this is what was found. Genes
which were clustered together in the Stein clusters were also clustered together
by SiMCAL 1 at rank 2 in two of the ten Stein clusters, and at rank 3 in eight
of the ten Stein clusters. The rate of agreement would probably have been still
higher if many of the genes of interest had not been eliminated in SiMCAL 1
preprocessing.
42


3. SiMCAL 2
SiMCAL 2 has as its sole purpose the determination of differential gene ex-
pression under experimental conditions. Although a wide variety of statistical
tools exist for determining differentiation in gene expression measurements (and,
indeed, almost any set of data, of any kind) they all have limitations; see [26] and
[33] for discussion of some of these. Therefore, it is to be hoped that SiMCAL 2
will make a real contribution to the body of work on this subject. Also, SiMCAL
2 offers a method of determining the degree of differentiation, a feature not found
in most other such tests.
3.1 Data and Preprocessing
The SiMCAL 2 data are provided in a substantially different form from the
data in SiMCAL 1; they have already been preprocessed to a significant degree,
and furthemore, comprise four distinct data sets rather than one. Thus preprocess-
ing consists mainly of getting the data into a usable form rather than performing
calculations to normalize the measured values. Reliability information, although
not as complete as that in the SiMCAL 1 data, is available for three of the four
data sets. The data and the preprocessing methods used are thoroughly described
in [33]; an overview is given here.
Also, unlike SiMCAL 1, SiMCAL 2 was developed from the beginning using
simulated as well as real data. The simulated data, modeled as closely as possible
on real data but with well-known properties, allowed thorough testing of the
accuracy and performance of the algorithm. The nature of the simulated data
43


will be described in this section, and the testing performed will be referred to
throughout the chapter.
All the real SiMCAL 2 data sets derive from the same source: measurements
of gene expression activity in 12,225 genes (n = 12225) in 18 patients, 9 CF
patients and 9 healthy control subjects, at a single time point (m = 1). Expression
values are measured for perfect match (PM) oligonucleotides (oligos) which are
expected to hybridize with the genes under examination and are used to measure
actual gene expression, and for mismatch (MM) oligos which are not expected
to hybridize with the genes under examination and are used to measure scanner
noise and general chip activity not caused by experimental conditions. The data
are then normalized in four different ways to produce distinct data sets, described
in [33] as follows:
MAS5 MAS5 procedure uses the difference between PM and MM. The signal
values of any hybridization are multiplied by a factor so that all chips to be
analyzed have the same mean.
RMA RMA procedure ignores MM and removes global background. Normaliza-
tion is done using quantile normalization, which is a method in which probe
intensities are adjusted to produce identical distributions. RMA procedure
has been obtained from http://www.bioconductor.org.
Dchip Dchip procedure uses a baseline array to normalize arrays by selecting
invariant sets of genes(or probes) then using them to fit a non-linear re-
lationship between the treatment and baseline arrays. The non-linear
relationship is used to carry out the normalization. This procedure can be
44


based on PM intensity only or PM/MM intensity differences. Dchip method
is available at http://www.dchip.org.
Note that the MAS5 and RMA procedures produce only one data set apiece,
here referred to simply as MAS5 and RMA, while the Dchip procedure produces
two: DchipPM and DchipPMJMM.
3.1.1 Measurement Reliability and Combination
None of the four data sets have p-values provided; however, all but RMA do
have Affy evaluations of present, absent, or marginal for each gene in each
patient. Therefore, a simple standard is used for determining gene expression in
MAS5, DchipPM, and DchipPM_MM: to be included in the final analysis,
a gene must have at least one expression measurements in the healthy subjects
and one in the CF patients which is considered to be present, and expnormai is
set equal to the mean of the present healthy subject expressions for that gene,
while expabnormai is set equal to the mean of the present CF patient expressions
for that gene. The procedure for RMA is even simpler normal and abnormal
expressions for each gene are determined by finding the mean of all healthy and
CF expression values, respectively, for that gene.
This leads naturally to the definition of some data structures:
The Entity data structure has the attributes Expression, a floating-point
value which is calculated as described above, and IsNormal, a Boolean value
describing whether a given instance of Expression represents expression in
healthy subjects (IsNormal = True) or CF patients (IsNormal = False).
Entity also has the attribute Rank, an integer value whose use will be de-
45


scribed later. At this point, Entity.Rank = 0 for all Entities created so
far.
The Gene data structure has various attributes relating to annotation, and
of more interest at this point, the NormalEntity and Abnormal Entity at-
tributes, whose names should be self-explanatory.
Genes and entities (and later, clusters) are assigned Id numbers and are ref-
erenced as described in Ch. 2.1.3.
3.1.2 Algorithm Outline
At this point, it is appropriate to look at the overall structure of the SiMCAL
2 algorithm, presented here in pseudocode:
1. G = all genes in data set
2. E = all entities (normal and abnormal) attached to genes in G
3. while E is not empty:
4. find nearest neighbor for each entity in E
5. cluster based on nearest neighbors
6. find cluster intersections
7. G = differentially expressed genes based on clusters
8. E = entities attached to genes in G
9. if E = previous iterations E:
10. end while loop
Lines 1-4 have been covered in the preceding section; lines 5-10 are explained
in the following sections.
46


3.1.3 Simulated Data
Simulated data can consist of any number (n) of normal and abnormal ex-
pression values, which then generate data structures as described above. A wide
range of values was used during development and testing, n [100,10000].
One advantage of the non-parametric nature of SiMCAL analysis is that it
does not incorporate the assumption, common to many statistical tests of differ-
entiation, that the data are normally distributed[26]. The simulated data can,
of course, have any distribution desired, as long as the mean and variance are
reasonably close to those of real data. Most testing with simulated data was per-
formed using a mixture model in which noise has a normal distribution while up-
and down- regulation under experimental conditions have a uniform distribution.
Other distributions, including exponential and beta, for both noise and regulation
were used in testing at various points, and all produced good results.
3.2 Clustering and Linking
Clustering in SiMCAL 2 is based on the same general principle as in SiMCAL
1, but proceeds rather differently. The most important difference is that a number
of genes are eliminated at each iteration: since the purpose of SiMCAL 2 analysis
is to determine which genes are differentially expressed, those genes found not
to be differentially expressed at each rank are eliminated from consideration at
higher ranks, and are not included in the clustering. (Recall that in SiMCAL 1, all
genes in the data set are included in the clustering at all ranks.) Also, clustering
is generally faster and simpler, for reasons which are described below.
3.2.1 Distance Measures and Matrices
Because m = 1, a correlation measure between two expressions such as the
p used in SiMCAL 1 is clearly meaningless. Instead, given two entities Ei and
47


E2, with expressions e\ and e2, the distance between the entities is simply the
absolute value of the difference of their expressions:
A(£i, E2) |ei e2|. (3.1)
It can be seen that this is actually a special case of the city block distance Ai
(Eqn. 1.15) in which the expression vectors consist of a single value.
Now let Gr be the vector of genes under consideration at rank r, and let nr =
|Gr|; thus Gr = (Gr,i, Gr,2,..., Gr,r). The expressions of all entities attached to
these genes are considered as the vector expr of length 2nr, because each gene
has two expressions, one normal and one abnormal:
exprj = Gr^.N ormalEntity .Expression-,
exprti+1 = Gr^.AbnormalEntity.Expression-, (3.2)
expr = (expr,i, expr,2, . expr,2nr )
Then the distance vector for any entity Erj, i G [1, 2nr\, expressed as a vector-
scalar function which follows from Eqns. 1.3 and 3.1, is:
dists rti(Erj) = A(expr, expTti). (3.3)
Next, the ith element of distsr i is set equal to max(distsr,j) in order to avoid
assigning Er% as its own nearest neighbor, and then the nearest neighbor of Er^
is:
nearest (E^i) = argmin(distsr)j).
(3.4)
48


The neighbor-finding operation, because it must find distances between each
of 2nr entities and each other entity in the set under consideration, is 0(n2). Also,
because nr is an approximately linear function of n, the number of genes in the
original data set, neighbor-finding is ultimately 0(n2). However, there are two
important and non-obvious features to consider in this analysis:
1. Calculating A is lighter-weight than calculating p, and so generally SiMCAL
2 neighbor-finding proceeds a good bit faster than the equivalent operation
in SiMCAL 1.
2. For any r > 1, it is likely that nr -C n, so operations at higher ranks are
particularly fast.
Details of performance are given later in this chapter.
Note that leaders and followers are not found here as they are in SiMCAL
1. Again, this is simply because m = 1 without measured changes in the
behavior of the genes over time, there is no meaningful way to assign succession
relationships between them.
3.2.2 Ranks and Cluster Formation
SiMCAL 2 clustering is different in a number of ways from that in SiMCAL 1.
It is simpler and lighter-weight, because in the previous step, only a single nearest
neighbor was discovered for each entity. Also, the specific programming logic used
is somewhat different, because the set of entities under consideration changes at
each iteration (i.e., at each rank). Perhaps the most significant difference is that
entities are clustered, not genes-, a gene is a member of a cluster if either or both
49


of its attached entities are a member of that cluster. However, the underlying
concepts are the same.
The SiMCAL 2 clustering implementation depends heavily in the Python
set data type, which does just what the name implies: it is an unordered collec-
tion of data, to which may be applied any of the usual set-theoretical operations
such as intersection, union, symmetric difference, etc. Operations on sets are
generally much faster than operations on arrays, vectors, etc. in Python or in
other programming languages, because set members are stored in hash tables
which (theoretically) offer nearly 0(1) lookup times. Testing shows that given
two sets S and T, operations involving either S or T by itself are effectively 0(1),
and operations involving both sets (e.g., finding their intersection) are roughly
0(min(|Sj, |Tj)).
The first step is to put each the nearest neighbor of each entity into a single-
member set. Then, the following Python code from the clustering control file
shows the core of the clustering procedure, at any rank:
for Id in Clusters:
if Touched [Id] is None:
Candidates = Clusters [Id]
NewCluster = makeset(Id) # make sure the Entity under
# examination is in own cluster
while len(Candidates) > 0:
next = Candidates.pop()
if next not in NewCluster:
50


NewCluster.add(next)
if Touchedjnext] is None:
Candidates, update (Clusters [next])
Clusters [next]. clear ()
else:
Candidates, add (Touched [next])
Touched [next] = Id
Clusters[Id] .update(NewCluster)
At the beginning of execution, Touched is simply an empty array of length
2nT containing None (the Python equivalent of NULL in C) values. The make-
set () function takes a single data object as an argument and returns a single-
member set containing that object as its only element; S = makeset(s) = S = {s}.
The pop() method removes a randomly selected element from a set and returns
that element, while the add() method adds the element which is passed as an ar-
gument. The update() method performs a union; S.update(T) = S = SLiT. Note
that because these are true sets, duplicate values are eliminated e.g., if S =
set([l, 2, 3]) and T = set([2, 3, 4]), then S.update(T) will result in S = {1,2,3,4},
and add() produces similarly correct results.
The end result is that Clusters contains a large number of empty sets and
a few sets containing large numbers of entities, which have been clustered to-
gether with their nearest neighbors. Only these non-empty sets are saved to the
database as clusters. The Cluster data structure, in addition to Id and Rank,
also has the attribute Attached Entity this entity represents the cluster center,
and, not surprisingly, has its Expression attribute set equal to the mean of the
51


Expressions of the entities in the cluster.
3.2.3 Linking
Linking in SiMCAL 2 is based on the concept of cluster intersections. As seen
above, a cluster may be considered as a set of entities. It may also, however, be
considered as a set of genes. For any gene G and cluster C:
G. Normal Entity C V G. Abnormal Entity C => G E C. (3.5)
Then, considering any two clusters of equal rank C and D as sets of genes,
C and D are intersecting clusters if C fl D ^ 0. The congruence of C and D
is as given in Eqn. 1.16. (This comparison can only be done using clusters as
sets of genes; clusters as sets of entities are always disjoint.) Therefore, although
there is no temporal linking, relationships between clusters at the same rank are
established. SiMCAL 2 users can navigate between intersecting clusters in the
same way as SiMCAL 1 users can navigate cluster successions.
Parent and child relationships are also formed between clusters at different
ranks, just as in SiMCAL 1; however, the definition is slightly different. Consider
two clusters E and F, where E.Rank = r and F.Rank = r + 1, as sets of entities.
In SiMCAL 1, it was the case that F was the parent of E, and E was a child of F,
if E C F. However, since in SiMCAL 2, some genes (and thus their entities) are
eliminated at each iteration, there may be elements of E which are not present in
F or in any other cluster at rank r + 1. Therefore, here F is the parent of E, and
E is a child of F, if E Cl F ^ 0.
52


3.3 Differentiation
Differentiation that is, the determination of which genes are differentially
expressed under experimental, as opposed to control, conditions is the purpose
of SiMCAL 2 analysis. The condition used for differentiation has evolved some-
what during the history of this phase of the project, and provides the best way to
understand the current condition.
There exist two major measures of accuracy in any differentiation test: sensi-
tivity, the degree to which the test manages to identify those data elements which
are meaningfully different from the control set; and specificity, the degree to
which data elements identified as different in fact are. Colloquially, low sensitivity
means a high rate of false negatives while low specificity means a high rate of
false positives. Obviously, both sensitivity and specificity should be as high as
possible; depending on the precise application, one or the other may be of greater
importance, but neither should be sacrificed. In the current application, speci-
ficity may be slightly more important simply because of the size of the data set
it is more useful to identify a few genes which are almost certainly differentially
expressed, for further study, than to identify a great many genes which may be
differentially expressed but whose number still leaves the user drowning in data.
Fortunately, the use of simulated data allows precise testing of both accuracy
measures for this algorithm. (This is a major improvement over SiMCAL 1.)
The simulation code generates a set of genes which includes abnormal genes
those which are differentially expressed which here is denoted Ga Then the
algorithm itself, at each rank r E [l,rmax], finds a set of differentially expressed
genes: Go,r (Recall that rmax is not set by the user a priori, as in SiMCAL 1;
53


instead, it is simply, by definition, the maximum rank at which any differentially
expressed genes are found.) This provides a simple measure of both sensitivity
and specificity for any value of r:
sensr =
specr =
(3.6)
The ideal result for SiMCAL 2 processing would be to have perfect sensitiv-
ity and good specificity at lower ranks, with acceptable sensitivity and perfect
specificity at higher ranks; the idea is that as r increases, only the genes with
greater differentiation will be identified as differentially expressed. Therefore,
identification at higher ranks should serve as a guide to the degree of differential
expression or, in the context of the current data set, the severity of a genes
up- or down-regulation under disease conditions.
The original condition for differentiation was very simple. At any rank r,
any gene G under consideration (which is any G G Go,r-i) may be said to
belong to two clusters, N(G) and A(G), where G. Normal Entity G N(G) and
G. Abnormal Entity G A(G). Thus the condition was:
Unfortunately, this condition produces very high sensitivity (sensr ~ 1 at all
ranks) but very low sensitivity, and is therefore insufficient. The clear answer is
to provide some sort of filter to remove the large number of false positives found.
The next step, therefore, is to introduce the idea of a distance matrix between
the all clusters at rank r, which is denoted Cr = (CV,i, Cr,2,..., CVjcr|). Then the
N(G) A(G) = G G GD,r
54


distance matrix for all clusters at rank r is:
Cdistsri j. = A(Crti.AttachedEntity, Cr^.AttachedEntity). (3.7)
The next step is to scale this matrix so that all values are within the range
[0,1], thus producing the scaled distance matrix:
.. Cdists
bCdistsr =-----------
max(Cdists)
(3.8)
For notational convenience, given clusters Cr SCdistsTii. Then the condition for differentiation became:
AS(N{G), A{G)) > Con (N(G),A(G)) ^Ge GD,r.
Note that the original condition is subsumed by this condition:
N(G) = A(G) => AS(N(G), A(G)) = 0 A Con(N(G), A(G)) = 1
... and therefore ...
AS(N(G),A(G)) > Con(N(G),A(G)) => N(G) ? A(G).
However, it is clearly not the case that:
N(G) ? A(G) => AS(N(G),A(G)) > Con(N{G),A(G)).
Thus this condition is much more restrictive, as desired.
The justification for the new condition is that if two clusters have a high
distance between them, and low congruence, then any genes shared between them
are likely to be genuinely differentially expressed; on the other hand, clusters
55


which are close together and have a high congruence may well be in some sense
the same cluster. That is, it is quite likely that some entity in one cluster is a
very near neighbor of some entity in the other cluster (though not, of course, the
nearest neighbor; in that case the clusters would not be separate at all) and so
genes shared between the clusters are not actually differentially expressed. As a
practical matter, using this condition dramatically improves specificity (specr ~ 1
at all ranks) while leading to noticeably diminished but perhaps still acceptable
sensitivity (sensr > 0.5 at most ranks.)
It should still be possible to do better. The fact that specificity is now very
high while sensitivity is lower implies that perhaps the new condition is new
restrictive. The solution is to apply a scaling factor, k, to the congruence before
performing the test. After trying a variety of values, the best results were obtained
with the following condition:
k 3,
AS{N(G),A{G)) > k x Con(N(G), A(G)) => G GDyT.
Results here are nearly ideal for small (n < 1000) data sets. At lower ranks,
sensr 1 and spec,. > 0.75. As r rmax, specr > 1. Sensitivity decreases
(although rarely goes below 0.5) at higher ranks, but this is to be expected and is in
fact desirable, because only the more differentially expressed genes are identified as
such at these ranks. Using lg-scaled values for expression changes between normal
and abnormal entities attached to the same gene, such that a change of 1 indicates
a twofold change in expression, 2 indicates a fourfold change, etc., minimum values
for expression change in genes identified as differentially expressed climb from < 1
56


at r = 1 to > 5 at r = rmax, while mean values climb from <5atr = lto>8
at r = rmax. As the last value is about equal to the greatest observed change in
both real and simulated data, this is an excellent accuracy figure.
However, with large data sets (especially n > 3000) a problem emerges. A
desirable value for rmax is somewhere in the neighborhood of 3 to 5, and certainly
values higher than 7 or so are not desirable, but the value of k used above can
lead to 10 or more ranks being formed. Recall that the second SiMCAL condition
states that levels (ranks) should be few relative to hierarchical clustering, so that
the user does not have to decide on an arbitrary cut level to interpret the analysis.
Clearly 10 or even 20 ranks is preferable to the hundreds of levels which standard
hierarchical clustering can produce, but it is still too large a number to allow for
easy interpretation.
The solution is to make k a variable, not a constant, and dependent on rank.
The value of | used above is good for the first rank, but is inadequate for the
higher ranks which SiMCAL 2, in this form, will try to create. The condition
must become more restrictive as ranks increase. Therefore, the final form of the
condition which determines whether a gene is differentially expressed is:
nr1
k = -^5;
^ (3.9)
A S(N(G),A(G)) >kx Con (N(G),A(G)) => G Once Gr>,r is determined, the algorithm iterates again at r + 1 with the entities
attached to the genes in Go,r The entire algorithm halts when either \Gd,t\ = 0
or Gd,t = Go,r-1-
57


Rank Sensitivity Specificity Min. Differential Mean Differential
1 0.902 0.993 0.285647 5.37723
2 0.81 1 0.791236 5.85364
3 0.613 1 0.791236 6.62608
4 0.325 1 2.43204 7.56764
5 0.072 1 5.61399 8.42569
Table 3.1: SiMCAL 2 accuracy figures.
Xo particular justification is offered for the formula used to determine k in
Eqn. 3.9, except that it works. Sensitivity at r = 1 is excellent, and specificity
very good; at higher ranks, specificity rapidly reaches 1, and while sensitivity
drops, the genes which are identified as differentially expressed axe uniformly
those with greater differences between normal and abnormal expressions. Just as
importantly, the number of ranks formed is constrained even for the largest real
data set, RMA with n = 12625, operations terminated at rmax = 5. Accuracies
for a typical run with simulated data, n = 3000, are shown in Table 3.1.
3.4 Reporting
The reporting interface in SiMCAL 2 is substantially similar to that in SiM-
CAL 1, although with a few refinements to make it easier to use, such as the
simplified search interface show in Fig. 3.1.
The primary difference is in visualization of genes and clusters. Because m =
1, time-series plots would of course not be meaningful. Instead, the standard
adopted for display is:
58


<*ne*: IDS:
Name/Annotation
Expression:
Ousters: ID*: RanX: ~7^ (Any) i Annotation:
Di/f Rank (Any) ? Get Cengs ^
Expression: T) (Cet Clusters'^
Figure 3.1: SiMCAL 2 search interface.
1. Genes and clusters are both displayed as points on a 2-dimensional graph.
2. For genes, NormalEntity.Expression is displayed on the x-axis, while
Abnormal Entity. Expression is displayed on the y-axis. Fig 3.2 shows ex-
pressions for all genes in the DchipPM_MM data set, distinguishing be-
tween genes found to be differentially expressed and those not found to be
differentially expressed.
3. For clusters, since they have only one entity (and thus, one expression) which
represents the cluster center, Attached Entity .Expression is displayed on
both the x- and y-axes. Fig. 3.3 shows all clusters in DchipPM_MM at
all ranks, against a background of genes in the data set.
3.5 Performance
SiMCAL 2 is very fast compared to SiMCAL 1, in large part due to the relative
simplicity of its operations. It is worth analyzing each step and then seeing how
the algorithm performs in practice.
It is clear that neighbor-finding operations are inherently 0(n2), but in prac-
tice they tend to follow a polynomial of lower order, about 0(n1'7). This is
probably due to the efficiency of the NumPy routines used for vector calculation,
59


Figure 3.2: All genes in DchipPM_MM.
which are used to obtain the results in Eqns. 3.3 and 3.4. It seems likely that
with very large data sets, these operations would eventually become truly 0(n2),
but this would be with much larger data sets than any actually used here.
Clustering is roughly 0(n). This is because early in the clustering, clusters
grow rapidly, effectively removing entities from future clustering operations once
they are already in a cluster. There is a pathological case in which clustering is
0(n2), but this simply cannot occur with real (or realistic simulated) data it
happens only when every entity has the same expression value! In this case, each
entity reports Er^ as its own nearest neighbor, and thus entities are removed from
60


Figure 3.3: Clusters and genes in DchipPM_MM.
consideration during clustering operations only one at a time.
Since the number of clusters |Cr| is roughly a linear function of the number of
genes, and since differentiation at each rank requires the construction and analysis
of |Cr| x |Cr | matrices, it might also be expected that differentiation is 0(n2).
However, not only does this step take heavy advantage of NumPys vector routines
and Pythons set operations, both highly optimized, but it also is written to do
as much as possible of the work by iterating through genes instead of comparing
clusters directly. Thus, like neighbor-finding, in practice it tends to follow a lower-
order polynomial, in this case about 0(n12).
61


Table 3.2 shows operation times for n E [200,3000]. Total operation times
may not be precisely equal to the sum of neighbor-finding, clustering, and differ-
entiation times, especially at lower ranks, because all times are rounded to the
nearest integer value.
Fig. 3.4 shows the execution times for the various components of the algorithm
plotted against their approximate orders, while Fig. 3.5 shows total execution
time. Unsurprisingly, total execution time also has a polynomial order < 0(n2),
about 0(n13). Again, of course, it is reasonable to believe that this figure would
increase for very large data sets, but there are no data sets this large currently
under consideration.
Figure 3.4: Execution times for SiMCAL 2 components.
62


n Neighbors Clusters Differentials TOTAL
200 1 1 1 4
400 2 2 3 8
600 3 3 5 12
800 4 5 7 17
1000 7 6 10 23
1200 8 7 12 28
1400 11 9 15 35
1600 13 10 17 41
1800 15 11 19 46
2000 18 12 22 52
2200 21 13 25 60
2400 25 15 28 68
2600 28 17 32 78
2800 33 18 35 86
3000 35 19 37 92
Table 3.2: SiMCAL 2 run times, in seconds.
63


Figure 3.5: Execution times for SiMCAL 2 overall.
3.6 Biological Interpretation
It was found in [33] that [t]he choice of normalization seems to greatly affect
the differentially expressed genes across the conditions. In other words, for each
of the four data sets, greatly varying numbers of differentially expressed genes
were discovered. Similar results were found here. Table 3.3 shows what varying
results were obtained.
A further refinement is to see how many genes the datasets had in common
initially (that is, after preprocessing) as shown in Table 3.4, and how many genes
they had in common which were identified as differentially expressed, as shown in
64


MAS5 DchipPM DchipPM_MM RMA
Genes 6219 7658 7658 12625
Rank 1 Differentials 168 965 1681 450
Rank 2 Differentials 105 248 636 102
Rank 3 Differentials 53 66 202 28
Rank 4 Differentials 19 11 45 3
Rank 5 Differentials 2 0 1 0
Table 3.3: Total genes under consideration, and differential genes discovered.
Table 3.5.
Ultimately, only three genes were identified as differentially expressed in all
datasets. The Affy designations for these genes are 38095_i_at, 32794_g_at,
and AFFX-HSAC07/X00351_3_at. They are described at [22]:
38095_i_at A major histocompatibility complex (MHC) gene responsible for
pathogen and antigen detection. CF often involves an exaggerated (although
ultimately ineffective) immune response. Particularly interesting is that this
gene is involved in cross-membrane transport; recall that CF is largely a dis-
order of this process.
32794_g_at A T-cell (immune cell) receptor gene, also membrane-associated, lo-
cated on chromosome 7, where the mutations believed to be responsible for
CF occur.
AFFX-HSAC07/X00351_3_at Involved primarily in cellular motor activity.
65


Dataset MAS5 DchipPM DchipPM_MM RMA
MAS5 - 6076 6076 6186
DchipPM 6076 - 7658 7619
DchipPM_MM 6076 7658 - 7619
RMA 6186 7619 7619 -
Table 3.4: Genes shared by datasets after preprocessing.
Dataset MAS5 DchipPM DchipPM_MM RMA
MAS5 - 10 21 28
DchipPM 10 - 418 31
DchipPM_MM 21 418 - 65
RMA 28 31 65 -
Table 3.5: Differentially expressed genes shared by datasets.
66


It is associated with a cytoskeletal acetyltransferase complex; CF patients
metabolize acetyltransferase-associated drugs unusually fast. This gene is
also located on chromosome 7.
Therefore, although no definitive conclusions can be drawn from these results,
it is reasonable that all three of these genes should be strongly differentially ex-
pressed in CF patients. More generally, genes which are found to be differentially
expressed at high ranks in any of the data sets (particularly in MAS5, consid-
ered in [33] to use the best of the normalization methods) should be considered
for more intensive future research.
67


4. SiMCAL 3
SiMCAL 3 combines the major features of SiMCAL 1 and 2 into a single
application. Like SiMCAL 1, it works with time-series data (m > 1) by cluster-
ing genes over time and finds succession relationships between genes and clusters.
Like SiMCAL 2, it determines which genes in the data set are differentially ex-
advantages over both its predecessors: it is designed to work with any value of m
rather than a specific value for a specific data set (m = 3 for SiMCAL 1, m = 1
for SiMCAL 2) and it is to be hoped that the elimination of genes which are
not differentially expressed will lead to more meaningful clustering and linking.
SiMCAL 3 is capable of reproducing the features of both of its predecessors, but
offers analytical power beyond either of them.
4.1 Data and Preprocessing
As with SiMCAL 2, SiMCAL 3 was developed using simulated data, and
then used to analyze real data. The real data come from the SiMCAL 1 data
set, but are preprocessed in a slightly different way. After eliminating unreliable
measurements, there exists the same set of measurements given in Eqn. 2.1.
However, in SiMCAL 3, normalization proceeds as follows, generating both normal
and abnormal expression vectors for each gene in each replicant:
pressed and offers the feature of determining the degree of differentiation. It offers
expt
.normal __
(4.1)
tJ-K
.abnormal __
68


These vectors are then averaged across replicants to produce expression vec-
tors for the normal and abnormal entities attached to each gene. Reliability
measures are not calculated or used as weighting factors, in keeping with the
practice used in SiMCAL 2 it is reasonable to expect that any measurement
with p < 0.05, which is marked present by the Affy equipment, is reliable enough
to be used as in in future calculations.
The rather artificial expo data point is not used in most of SiMCAL 3,
although it does resurface in succession-finding, as explained in following sections.
Data structures are essentially identical to those used in SiMCAL 2, except
that Entity now has the attribute Expressions (expi, exp2,..., expm). This
is a vector (as calculated in Eqn. 4.1 in the case of entities attached to genes) of
floating-point values instead of the scalar Expression used in SiMCAL 2.
4.1.1 Algorithm Outline
Unsurprisingly, the large-scale logic of SiMCAL 3 is basically a combination
of that in its predecessors:
1. G = all genes in data set
2. E = all entities (normal and abnormal) attached to genes in G
3. while E is not empty:
4. find nearest neighbor and successions for each entity in E
5. cluster based on nearest neighbors
6. find cluster intersections and successions based
on succession relationships of member entities
7. G = differentially expressed genes based on clusters
8. E = entities attached to genes in G
69


9. if E = previous iterations E:
10. end while loop
4.1.2 Simulated Data
Simulated data were created here by drawing randomly selected normal ex-
pression values from the real data and then applying a variety of distributions
for up- and down-regulation. A uniform distribution seems to be closest to the
differential expressions in the real data, but others were tested with good results,
including normal, beta, and exponential.
4.2 Clustering and Linking
Clustering is almost identical to that in SiMCAL 2. However, SiMCAL 3 of
course has a succession-finding component which is wholly lacking in SiMCAL 2,
and the fact that the algorithm deals with multidimensional data requires the use
of more sophisticated distance measures. One of the driving forces in development
of SiMCAL 3 has been to incorporate the lessons learned in the development of
SiMCAL 2 which make that algorithm superior to SiMCAL 1, while retaining the
features of both.
One lesson in the development of SiMCAL 3, while testing against a variety
of simulated data, was that despite the arguments for the use of p as a similarity
measure for clustering which are detailed in Ch. 2.2.1, p is a poor choice for
determining differential expression. If the goal is simply to cluster genes together
based on behavior, p is still an excellent standard; but when vagaries of the
clustering process may lead to genes being eliminated from the data set, as is the
case in both SiMCAL 2 and SiMCAL 3, accidental correlation in noisy data can
70


lead to wildly inaccurate results. Specifically, when using p in early development
of SiMCAL 3, both sensitivity and specificity were observed in the range of 0.3 to
0.5 at all ranks, which is clearly unacceptable.
A variety of other distance and similarity measures were tried.5 Other correl-
ative measures, including covariance and Spearman correlation, failed to produce
significantly better results. A return to Minkowski distances (Ch. 1.3) was tried
next: first A2, the Euclidean, which produced good but not spectacular results
(both sensitivity and specificity in the 0.5-0.8 range) and finally, the simplest pos-
sible Minkowski distance for multidimensional data, Ai, the city block distance.
Perhaps unsurprisingly, since a special case of this measure was used in SiMCAL
2 with great success, it was A! that produced the best results. Genes with gen-
erally similar behavior over time were clustered together as in SiMCAL 1, and
determination of differential expression was as good as in SiMCAL 2.
However, Ai, and Minkowski measures generally, proved to be very poor at
succession-finding, and so after further experimentation, p was retained for this
purpose. This leads to the necessity of defining two separate measures to be used
in SiMCAL 3:
The standard distance measure, used for neighbor-finding, which is defined
in the current form of the algorithm as Ai; and
The standard similarity measure, used for succession-finding, which is de-
fined in the current form of the algorithm as p.
5Naturally, very slight alterations to the code are necessary for neighbor-finding operations
depending on whether a distance measure or a similarity measure is used, but the underlying
logic is the same.
i
71


For maximum flexibility, so that users can focus on specific facets of gene
behavior as desired, the application is written to allow using different standard
distance and/or similarity measures with very minor code changes. However,
using the definitions of these measures given above is strongly recommended for
the best results with a wide variety of data sets.
Neighbor-finding proceeds much as it does in SiMCAL 2 (Eqns. 3.2 3.4)
with the sole exception that expr is now a matrix rather than a vector, and expr^
is now a vector rather than a scalar. This implies a slight change in notation: let
expr i represent Eri. Expressions, and the matrix of expressions for all entities
under consideration is:
The nearest neighbor is then found as in Eqn. 3.4.
Succession-finding is very similar to that in SiMCAL 1, with some exceptions.
Expr = (expr l, expr 2, , expr,2nr)-
(4.2)
Then the distance vector for any entity Er u i G [1, 2nr] is:
dists-r^E^j) = Ai(Expr, expr>i).
(4.3)
(4.4)
It follows that:
(4.5)
72


This is the only case in which exp0 is introduced; it plays no part in
neighbor-finding or clustering, and is not displayed in the interface. Now leaders
and followers can be determined:
leader^ = argmax(pExpeariy expr.);
follower^ = argmax(pExpr expeariy).
Note that there are none of the calculations used in SiMCAL 1 involving
subtracting neighbor distances from succession distances. Again, this is due to
rigorous testing with simulated data it was found that these calculations added
nothing to the accuracy of the analysis, and by omitting them in SiMCAL 3, the
succession-finding process is considerably simplified. Of course, the succession-
finding step is only performed if m > 1.
Clustering now proceeds exactly as in SiMCAL 2. Linking is done both as in
SiMCAL 1, in which cluster succession relationships are determined (although in
SiMCAL 3, these are based on entity successions rather than gene successions; a
genes succession relationships are simply the union of its normal and abnormal
entities succession relationships) and as in SiMCAL 2, in which cluster intersec-
tions are determined. Finally, gene differential expression determination is done
as in SiMCAL 2.
Accuracies are comparable to those in SiMCAL 2 for gene differentiation,
regardless of the values of n and m. In addition, SiMCAL 3 finds succession
relationships in simulated data with both sensitivity and specificity ~ 0.5 at r = 1,
and specificity > 0.9 at higher ranks. The sensitivity of succession-finding at
r > 1 naturally has sensr_i as an upper bound; if a gene was found not to be
73


differentially expressed at r 1, it is eliminated from consideration at r. However,
genes involved in pathways marked out by succession relationships in the simulated
data are usually correctly identified by the algorithm as differentially expressed
at all but the highest ranks. Finally, rmax remains low, as desired rmax = 6 for
the current real data set.
4.3 Interface and Performance Comparison
The reporting interface in SiMCAL 3, as might be expected, combines the
features of those in SiMCAL 1 and 2. The look and feel is very much like that in
SiMCAL 2, while gene and cluster visualization is like that in SiMCAL 1, with the
added feature of showing both normal and abnormal expressions for genes (Fig.
4.1). Unlike either of its predecessors, the SiMCAL 3 interface is designed, as is
the algorithm, to deal with data of any dimensionality.
The order of performance for SiMCAL 3 is equivalent to that of SiMCAL 2
with respect to n, although it is slightly slower overall due to the extra operations
involved in succession-finding. SiMCAL 3 is also 0(m), which is unsurprising
since the computational expense of finding Ai scales linearly with the size of the
data.
4.4 Biological Interpretation
Among the more interesting differentially expressed genes in the current data
set are:
100013_at Differentially expressed up to r = 2: a poorly documented gene which
is, however, known to be involved in immune response.
100005_at Differentially expressed up to r = 3: a tumor necrosis factor (TNF)
which plays a role in apoptosis.
74


Figure 4.1: A gene from simulated SiMCAL 3 data, m = 8, with its clusters.
100014_at Differentially expressed up to r = 2: a kinase involved in response to
DNA damage.
100980_at Differentially expressed up to r = 5: an apoptosis signalling protein.
These genes, their leaders and followers, and other genes in their clusters may
all be worthy of further and more intensive investigation.
Another gene of interest is 102914_s_at, a gene with wide-ranging effects on
cell death, which is differentially expressed up to r = 5. Among its followers
at high rank is 98869_g_at, a leukemia- and lymphoma-associated B cell protein
involved in regulation of apoptosis; among its leaders is 99522_at, a cell cycle
75


signaller. These sorts of relations in the SiMCAL 3 analysis point toward the
ability to zero in on pathways of profound biological and medical importance
without the noise of genes which are not differentially expressed a problem
which plagued SiMCAL 1 and has now been corrected.
76


5. Conclusions and Future Directions
All three algorithms essentially meet their goals: they produce meaningful re-
sults while conforming to the SiMCAL specification. However, considerable future
work can both improve the algorithms themselves and broaden their usefulness in
bioinformatics research.
Future development on the family of algorithms presented here will almost
certainly use SiMCAL 3 as a starting point, since it subsumes the functionality
of SiMCAL 1 and 2. These earlier algorithms may now be viewed as prototypes
working prototypes to be sure, and capable of producing useful results on
their own, but not adaptable enough to handle the wide range of data which
is produced by modern microarray experiments. Future versions of SiMCAL 3
will be designated SiMCAL 3a, 3b, etc. as both the algorithm and the interface
improve. Some goals include:
Build a high-level interface which can not only show results, but also act as a
controller for all aspects of the computational portion of the project. Prefer-
ably, the entire analysis process, from data import, through preprocessing,
clustering, and linking, to reporting and visualization, should be available
through the Web interface, so that users have a one-stop application.
Perform statistical analysis on the operations of the algorithm to understand
not only what it is doing, but why. In particular, it would be desirable to
understand whether arbitrary calculations within the algorithm such as n
77


are the best possible solution, or if more rigorous development can produce
better solutions.
Compare the results obtained from the algorithm with results obtained on
identical data with different algorithms in more detail. Cross-validation
would be a desirable outcome of this process.
Obtain other data sets covering a variety of topics.
Solicit user feedback on both the results of the algorithm and the interface.
The end users of the application are intended to be biologists, not math-
ematicians and computer scientists; although a number of biologists have
been closely involved in the development process since the projects incep-
tion, much more work is needed to make the application suitable for general
use.
Looking farther ahead to SiMCAL 4, 5, etc., entirely different approaches to
clustering may prove fruitful. Evolutionary search through the data space is one
possibility; minimum spanning tree and other graph search algorithms present
another. Nearest neighbor clustering is an intuitive, lightweight method, and one
which can certainly produce good results, but it is far from the only avenue worth
exploring in the SiMCAL context.
As a design principle, SiMCAL is a useful guideline for development in the
ever-expanding field of microarray analysis. As a specific set of algorithms, it is a
proven concept which has produced gratifying results. Its future lies in following
the principles to produce a family of algorithms which build on the successes of
those presented here.
78


REFERENCES
[1] Statistical Algorithms Description Document, Affymetrix, Inc., 2002,
http://www.affymetrix.com/support/technical/whitepapers/
sadd_whitepaper.pdf.
[2] GeneChip Expression Analysis: Experimental Design, Statisti-
cal Analysis, and Biological Interpretation, Affymetrix, Inc., 2003,
http: //www.affymetrix.com/support / downloads/manuals/
data_analysis_2_manual.pdf.
[3] M. Beers and R. Berkow, Eds., The Merck Manual of Diagnosis and Therapy,
17th ed. Whitehouse Station, NJ: Merck Research Laboratories, 1999.
[4] K. Cios, W. Pedrycz, and R. Swiniarski, Data Mining Methods For Knowledge
Discovery. Norwell, MA: Kluwer Academic Publishers, 1998.
[5] M. de Hoon, S. Imoto, and S. Miyano, The C Clustering Library, 2002,
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf.
[6] J. Devore, Probability and Statistics for Scientists and Engineers. Pacific
Grove, CA: Duxbury, 2000.
[7] D. Dvorkin, V. Fadok, and K. Cios, SiMCAL 1 algorithm for anal-
ysis of gene expression data related to the phosphatidylserine recep-
tor, Artificial Intelligence in Medicine, 2005, in press; preprint at:
http: //www.dvorkin.com/daniel/DvorkinFadokCios. preprint, pdf.
[8] V. Fadok, D. Bratton, M. Rose, A. Pearson, R. Ezekewitz, and P. Henson, A
receptor for phosphatidylserine specific clearance of apoptotic cells, Nature,
vol. 405, pp. 85-90, May 2000.
[9] S. Friedberg, A. Insel, and L. Spence, Linear Algebra, 3rd ed. Upper Saddle
River, NJ: Prentice Hall, 1997.
[10] C. Gerald and P. Wheatley, Applied Numerical Analysis. Reading, MA:
Addison Wesley Longman, 1997.
[11] Gnuplot homepage, http://www.gnuplot.info/.
79


[12] M. Gwinn, D. Whipkey, and A. Weston, "The effect of oxythioquinox expo-
sure on normal human mammary epithelial cell gene expression: A microarray
analysis study, Environmental Health, September 23 2004, (E-publication
ahead of print).
[13] M. Hashimoto, M. Koda, H. Ino, K. Yoshinaga, A. Murata, M. Yamazaki,
K. Kojima, K. Chiba, C. Mori, and H. Moriya, Gene expression profiling
of cathepsin D, metallothioneins-1 and -2, osteopontin, and tenascin-C in a
mouse spinal cord injury model by cDNA microarray analysis, Acta Neu-
ropathologica, December 9 2004, (E-publication ahead of print).
[14] X. He, C. Weyand, J. Goronzy, W. Zhong, and J. Stuart, Bi-directional
modulation of T cell-dependent antibody production by prostaglandin E2,
International Immunology, vol. 14, no. 1, pp. 69-77, January 2002.
[15] P. Henson, D. Bratton, and V. Fadok, The phosphatidylserine receptor: a
crucial molecular switch, Nature Reviews: Molecular Cell Biology, vol. 2,
pp. 627-633, August 2001.
[16] R. Jarvis and E. Patrick, Clustering using a similarity measure based on
shared near neighbors, IEEE Transactions on Computers, vol. C22, no. 11,
pp. 1025-1034, November 1973.
[17] H. Kim, G. Golub, and H. Park, Missing value estimation for DNA microar-
ray gene expression data: Local least squares imputation, Bioinformatics,
August 27 2004, (E-publication ahead of print).
[18] The MathWorks MATLAB and Simulink for technical computing,
http://www.mathworks.com/.
[19] P. McDonald, V. Fadok, D. Bratton, and P. Henson, Transcriptional and
translational regulation of inflammatory mediator production by endogenous
TGF-/3 in macrophages that have ingested apoptotic cells, The Journal of
Immunology, vol. 163, pp. 6164-6172, 1999.
[20] MySQL, http://www.mysql.com/.
[21] R. Neapolitan and K. Naimipour, Foundations of Algorithms, 2nd ed. Sud-
bury, MA: Jones and Bartlett Publishers, 1998.
[22] Netaffx, https://www.affymetrix.com/analysis/netaffx/index.affx.
80


[23] D. Nguyen, H. Toshida, D. Schurr, and R. Beuerman, Microarray analysis
of the rat lacrimal gland following the loss of parasympathetic control of
secretion, Physiological Genomics, vol. 18, no. 1, pp. 108-118, 2004.
[24] Numerical python, http://sourceforge.net/projects/numpy.
[25] Octave home page, http://www.octave.org/.
[26] W. Pan, A comparative review of statistical methods for discovering differen-
tially expressed genes in replicated microarray experiments, Bioinformatics,
vol. 18, no. 4, pp. 546-554, April 2002.
[27] PDL The Perl data language, http://pdl.perl.org/.
[28] The Perl directory, http://www.perl.org/.
[29] PHP: Hypertext preprocessor, http://www.php.net/.
[30] PL/pgSQL SQL, http://www.postgresql.Org/docs/7.4/interactive/
plpgsql.html.
[31] PostgreSQL, http://www.postgresql.org/.
[32] Python programming language, http://python.org/.
[33] T. Rachidi and K. Cios, Cystic fibrosis gene expression data analysis, Ar-
tificial Intelligence in Medicine, 2005, in press.
[34] T. Stein, J. Morris, C. Davies, S. Weber-Hall, M. Duffy, V. Heath, A. Bell,
R. Ferrier, G. Sandilands, and B. Gusterson, Involution of the mouse mam-
mary gland is associated with an immune cascade and an acute-phase re-
sponse, Breast Cancer Research, vol. 6, pp. R75-R91, 2004.
[35] Y. Taguchi and Y. Oono, Relational patterns of gene expression via non-
metric multidimensional scaling analysis, Bioinformatics, October 27 2004,
(E-publication ahead of print).
[36] K. Talaro and A. Talaro, Foundations in Microbiology. Dubuque, IA: Wm.
C. Brown, 1996.
[37] R. Vandivier, V. Fadok, P. Hoffman, D. Bratton, C. Penvari, K. Brown,
J. Brain, F. Accurso, and P. Henson, Elastase-mediated phosphatidylser-
ine receptor cleavage impairs apoptotic cell clearance in cystic fibrosis and
bronchiectasis, The Journal of Clinical Investigation, vol. 109, pp. 661-670,
2002.
81


[38] D. Voet, J. Voet, and C. Pratt, Fundamentals of Biochemistry. New York:
John Wiley and Sons, 1999.
[39] K. Zorn, A. Jazaeri, C. Awtrey, G. Gardner, S. Mok, J. Boyd, and M. Birrer,
Choice of normal ovarian control influences determination of differentially
expressed genes in ovarian cancer expression profiling studies, Clinical Can-
cer Research, vol. 9, pp. 4811-4818, October 15 2003.
82