
Citation 
 Permanent Link:
 http://digital.auraria.edu/AA00000073/00001
Material Information
 Title:
 Models for spectral clustering and their applications
 Creator:
 McCuan, Donald F
 Place of Publication:
 Denver, CO
 Publisher:
 University of Colorado Denver
 Publication Date:
 2012
 Language:
 English
 Physical Description:
 1 electronic file. : ;
Subjects
 Subjects / Keywords:
 Spectral theory (Mathematics) ( lcsh )
Spectral sequences (Mathematics) ( lcsh ) Eigenfunctions ( lcsh ) Eigenfunctions ( fast ) Spectral sequences (Mathematics) ( fast ) Spectral theory (Mathematics) ( fast )
 Genre:
 nonfiction ( marcgt )
Notes
 Subject:
 In this dissertation the concept of spectral clustering is examined, with the goal of better understanding of properties of eigenfunctions of the graph Laplacian and their use in clustering. We start by discussing bipartitioning of images via spectral clustering, and justify this technique using an analogy to vibration problems. Such a justification does not rely upon a traditional relaxation of a combinatorial optimization based definition of clustering. The importance of the Fiedler vector and theorems by Fiedler is emphasized. We extend Fiedler\rq{}s theorem to the case of the generalized eigenvalue problem. Courant\rq{}s Nodal Line Theorem (CNLT) can be viewed as an analog of Fiedler\rq{}s theorem for eigenvectors corresponding to higher vibration frequencies. We review the literature for discrete CNLTs. A new definition for rweak sign graphs is presented and used to prove a modified discrete CNLT theorem. Applications of Fiedler\rq{}s theorems and CNLTs to spectral clustering are discussed. Practical difficulties with image segmentation related to peculiarities in construction of graph edge weights are investigated. Another application of spectral clustering, covered in this dissertation, deals with recursive gene bipartitioning for DNA microarrays. We develop a new technique and prove a theorem for dealing with disconnected graph components that are common for DNA microarray data. Our ideas are incorporated in new MATLAB software for gene clustering in DNA microarrays. Results of clustering using practical microarray data are presented. The last section deals with the software package Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) which as part of the PH.D. candidate's graduate work has been updated to Version 1.1. BLOPEX implements the Locally Optimal BLock Preconditioned Conjugate Gradient (LOBPCG) method for solution of very large, sparse, symmetric (or Hermitian) generalized eigenvalue problems. Version 1.1 of BLOPEX adds (amongst other things) support for complex arithmetic and 64bit integers. BLOPEX includes interfaces to independently developed software packages such as PETSc and Hypre which provide high quality preconditioners and parallel MPIbased processing support. There is also support for execution of BLOPEX from MATLAB and stand alone C programs. The PH.D. candidate's personal contribution in the BLOPEX project is recoding of the abstract routines and PETSc and Hypre interfaces for the new functionality, development of new complex arithmetic test cases, and aid and assistance in testing, debugging and documentation of all BLOPEX interfaces. In this dissertation, we describe the BLOPEX software, design decisions, the LOBPCG algorithm as implemented, and all of the various interfaces. Numerical results are presented.
 Thesis:
 Thesis (Ph.D.)University of Colorado Denver. Applied mathematics
 Bibliography:
 Includes bibliographic references.
 General Note:
 Department of Mathematical and Statistical Sciences
 Statement of Responsibility:
 by Donald F. McCuan.
Record Information
 Source Institution:
 University of Colorado Denver
 Holding Location:
 Auraria Library
 Rights Management:
 All applicable rights reserved by the source institution and holding location.
 Resource Identifier:
 862742686 ( OCLC )
ocn862742686

Downloads 
This item has the following downloads:

Full Text 
MODELS FOR SPECTRAL CLUSTERING AND THEIR APPLICATIONS
by
Donald, F. McCuan
B.A., Austin College, 1972
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2012
This thesis for the Doctor of Philosophy degree by
Donald, F. McCuan
has been approved
by
Andrew Knyazev, Advisor
Stephen Billups, Committee Chair
Tzu Phang
Julien Langou
Burt Simon
October 26, 2012
McCuan, Donald, F. (Ph.D., Applied Mathematics)
Models for Spectral Clustering and Their Applications
Thesis directed by Professor Andrew Knyazev
ABSTRACT
In this dissertation the concept of spectral clustering is examined, with the goal
of better understanding of properties of eigenfunctions of the graph Laplacian and
their use in clustering. We start by discussing bipartitioning of images via spectral
clustering, and justify this technique using an analogy to vibration problems. Such a
justification does not rely upon a traditional relaxation of a combinatorial optimiza
tion based definition of clustering. The importance of the Fiedler vector and theorems
by Fiedler is emphasized. We extend Fiedlers theorem to the case of the generalized
eigenvalue problem.
Courants Nodal Line Theorem (CNLT) can be viewed as an analog of Fiedlers
theorem for eigenvectors corresponding to higher vibration frequencies. We review the
literature for discrete CNLTs. A new definition for rweak sign graphs is presented
and used to prove a modified discrete CNLT theorem. Applications of Fiedlers
theorems and CNLTs to spectral clustering are discussed. Practical difficulties with
image segmentation related to peculiarities in construction of graph edge weights are
investigated.
Another application of spectral clustering, covered in this dissertation, deals with
recursive gene bipartitioning for DNA microarrays. We develop a new technique and
prove a theorem for dealing with disconnected graph components that are common
for DNA microarray data. Our ideas are incorporated in new MATLAB software for
gene clustering in DNA microarrays. Results of clustering using practical microarray
data are presented.
The last section deals with the software package Block Locally Optimal Pre
conditioned Eigenvalue Xolvers (BLOPEX) which as part of the PH.D. candidates
graduate work has been updated to Version 1.1. BLOPEX implements the Locally
Optimal BLock Preconditioned Conjugate Gradient (LOBPCG) method for solution
of very large, sparse, symmetric (or Hermitian) generalized eigenvalue problems.
Version 1.1 of BLOPEX adds (amongst other things) support for complex arith
metic and 64bit integers. BLOPEX includes interfaces to independently developed
software packages such as PETSc and Hypre which provide high quality precondition
ers and parallel MPIbased processing support. There is also support for execution
of BLOPEX from MATLAB and stand alone C programs.
The PH.D. candidates personal contribution in the BLOPEX project is recoding
of the abstract routines and PETSc and Hypre interfaces for the new functionality,
development of new complex arithmetic test cases, and aid and assistance in testing,
debugging and documentation of all BLOPEX interfaces. In this dissertation, we
describe the BLOPEX software, design decisions, the LOBPCG algorithm as imple
mented, and all of the various interfaces. Numerical results are presented.
The form and content of this abstract are approved. I recommend its publication.
Approved: Andrew Knyazev
DEDICATION
I dedicate this thesis to my wife Carolyn, and the volunteers and staff of the Denver
Museum of Natural History who have given me encouragement over the many years
of this endeavor.
ACKNOWLEDGMENT
Foremost, I would like to to thank my advisor, Andrew Knyazev, for his support
and motivation. Also, I thank my other Professors at UCD who renewed by interest
in Mathematics, and raised my knowledge and understanding of the subject to new
levels. My research and work on the thesis have also been supported by NSF DMS
grants 0612751 and 1115734, the PI Professor Andrew Knyazev.
TABLE OF CONTENTS
Figures .................................................................... x
Tables.................................................................... xii
Chapter
1. Introduction............................................................. 1
2. Spectral Clustering and Image Segmentation............................... 3
2.1 Introduction....................................................... 3
2.2 Graph Laplacian.................................................... 4
2.3 Combinatorial Model and Relaxation ................................ 5
2.4 Spectral Clustering Algorithm...................................... 9
2.5 Overview of Literature ........................................... 10
2.6 Vibrational Model................................................. 15
2.7 Fiedler Theorems ................................................. 17
2.8 An extension of Fiedlers theorem................................. 20
2.9 Effect of mass matrix on segmentation............................. 24
2.10 Image Segmentation............................................... 27
2.11 Edge Weights .................................................... 28
2.12 Tuning the Edge Weight Parameters................................ 29
2.13 Eigenvalue Solvers............................................... 33
2.14 Nodal Domain Theorems............................................ 35
2.15 Summary.......................................................... 46
3. DNA MicroArrays......................................................... 47
3.1 Introduction...................................................... 47
3.2 What is a DNA Microarray?......................................... 47
3.3 How is Microarray data analyzed?.................................. 51
3.4 Normalization of data............................................. 52
3.5 Disconnected Graphs............................................... 53
vii
3.6 Recursive bipartitioning........................................... 60
3.7 Weight construction................................................ 62
3.8 Toy experiments................................................. 63
3.9 Analysis of real microarray experiments............................ 64
3.10 The Software....................................................... 65
4. BLOPEX ................................................................ 68
4.1 Introduction....................................................... 68
4.2 The Problem........................................................ 69
4.3 Current Software................................................... 69
4.4 LOBPCG............................................................. 71
4.5 BLOPEX software.................................................... 72
4.5.1 Structure.................................................... 73
4.5.2 Abstract Code................................................ 74
4.5.2.1 Algorithms............................................. 75
4.5.2.2 Data Types............................................. 79
4.5.2.3 Parameters ............................................ 81
4.5.3 Drivers and Interfaces....................................... 83
4.5.3.1 PETSc.................................................. 84
4.5.3.2 HYPRE.................................................. 85
4.5.3.3 Matlab................................................. 85
4.5.3.4 Serial................................................. 85
4.6 The Google Source Site............................................. 86
4.7 Environments BLOPEX Tested On ..................................... 86
4.8 Numerical Results.................................................. 86
4.9 Summary............................................................ 88
5. Conclusions and Final Thoughts......................................... 90
viii
Appendix
A. SpectralCluster Function.................................................. 92
B. SpectralClusterTest Driver............................................... 104
C. Subroutines Used by SpectralCluster Funtion.............................. 106
D. List of Genes by Cluster................................................. 109
References................................................................... 115
ix
FIGURES
Figure
2.1 Partition of a graph by vertex cuts. All edge weights are equal........... 6
2.2 Ratiocut penalty.......................................................... 8
2.3 Fiedler vector for a tree................................................ 18
2.4 Fiedler vector for a lattice............................................. 19
2.5 Fiedler vector for a wheel............................................... 19
2.6 Observation 2: Separation of two highest masses......................... 25
2.7 Observation 3: Largest mass in a cluster by itself...................... 26
2.8 Observation 3: When largest mass cant be in a cluster by itself........ 26
2.9 Observation 4: Smallest mass never in a cluster by itself............... 26
2.10 Labeling of pixels in an image.......................................... 27
2.11 Base image for analysis ................................................ 29
2.12 Initial analysis ....................................................... 30
2.13 Effect of increasing ValScale........................................... 31
2.14 Effect of geomScale .................................................... 32
2.15 Effect of islands....................................................... 32
2.16 Effect of epsilon....................................................... 33
2.17 Poor quality Fiedler vector produced by too small a shift............... 34
2.18 Examples of strong and weak sign graphs ................................ 39
2.19 Strong sign graphs exceeding eigenvector number......................... 40
2.20 Sign graphs decreasing in number........................................ 41
2.21 An example of rweak sign graphs ....................................... 42
3.1 Gene expression.......................................................... 48
3.2 An Affymetrix GeneChip .................................................. 50
3.3 Genes with correlated expression levels.................................. 52
3.4 Effect of normalization on clusters ..................................... 53
x
3.5 A connected and unconnected graph and the effect of a connecting node
(dum) on their eigenpairs............................................. 57
3.6 An example of recursive bipartition.................................... 62
3.7 Microarray data clustering result...................................... 65
4.1 Structure of BLOPEX functions.......................................... 74
xi
TABLES
Table
4.1 Matrices analyzed ..................................................... 87
4.2 Single processor setup and solution time............................... 88
4.3 Multiple processor setup and solution time for finan512................ 89
xii
1. Introduction
In Chapter ?? of this thesis spectral clustering of data is discussed. A vibra
tional model is introduced that provides an interpretation of spectral clustering in
terms of vibration modes of a massspring system representing the data in a thought
experiment. Mathematical justification of the technique is based of some theorems
of Fiedler describing graph properties via the eigenvectors of the graph adjacency
and graph Laplacian matrices. This is a departure from traditional combinatorial
graph theoretical definitions of clustering, that seek to find a solution of combinato
rial minimization problems on graphs. While mentioned in the literature for spectral
clustering, Fiedlers theorems do not receive proper attention. We show how they are
fundamental for understanding the vibrational model of spectral clustering.
Fiedlers basic theorem is extended in Chapter ?? to encompass a wider range of
problems including the normalized graph Laplacians and some generalized eigenvalue
problems. We then experimentally investigate the effect on clustering of a mass
matrix in the generalized eigenvalue problem. With this background, we examine
an implementation of spectral clustering for image segmentation. Fiedlers theorem
also allows for a better understanding of the importance of graph edge weights. The
emphasis here is on the effect of variation of parameters in a particular approach to
edge weight generation that we thoroughly investigate in our numerical experiments.
We conclude Chapter ?? with a more theoretical examination of a different
technique, Nodal Domain theorems, also known as Courants Nodal Line Theorems
(CNLTs). The original CNLTs describe low frequency vibration modes of a mem
brane, which mathematically are the eigenfunctions of a corresponding partial dif
ferential operator. If a membrane is approximated by a massspring system, this
operator turns into its discrete analog, which in fact is the graph Laplacian, where
the graph nodes are associated with the masses and the graph edge weights are calcu
lated according to the stiffness of the springs connecting the masses. We demonstrate
1
how discrete CNLTs are related to Fiedlers theorems.
Chapter ?? builds on this initial discussion of spectral clustering by application
to DNA Microarray analysis. This is a completely different problem from image
segmentation, with challenges of its own. We introduce new techniques and construct
a full featured algorithm for gene clustering. This algorithm incorporates recursive
bipartitioning and handling of graphs with multiple components.
Chapter ?? is a departure from the previous two. It discusses our work done
on BLOPEX software, which is a parallel solver for large sparse Hermitian gener
alized eigenvalue problems. The relevance of BLOPEX for spectral clustering is its
potential applicability to eigenvalue problems involving very large matrices appear
ing in analysis of massive datasets. BLOPEX implements the LOBPCG algorithm
which is described in details. A overview of the BLOPEX software is given, which
includes a high level flowchart of the code. We also introduce our new Google site
for the BLOPEX code. A comparison to other software with similar functionality is
provided.
2
2. Spectral Clustering and Image Segmentation
2.1 Introduction
We are concerned with the problem of partitioning an image into two parts (a
bipartition) with the objective that the elements (image pixels) within each part are
more similar to each other than elements between parts tend to be. This statement,
while simple, has left undefined such concepts as similarity and measures of whether
any particular bipartition is better than another.
There is a large literature which takes the approach of defining this problem as a
partitioning of a weighted graph where the partitioning is performed by minimization
of some vertex cut function; for example see [? ], [? ]. These problems can be expressed
as a minimization under constraints of the RayleighRitz Ratio of the associated graph
Laplacian matrix.
This combinatorial problem is NP complete and to solve it, the constraints are
relaxed, leading to a problem of solving for an eigenvector of the second smallest
eigenvalue of the graph Laplacian, commonly referred to as the Fiedler Vector. Then
this eigenvector is used to separate the graphs vertices into two groups. This is the
technique of spectral clustering. We discuss the graph Laplacian in Section ??, which
is background for the rest of the chapter.
Bichot [? ] attributes the origin of spectral clustering to Donath and Hoffman [?
]. The concept is simple. Its complexity lies in understanding why it works.
Spectral clustering does not always give good solutions to the original combinato
rial problem. We examine some of these issues in Section ?? and present an alternative
justification for spectral clustering in Section ??. But, before this, we give a brief
overview of the literature in Section ?? which examines the held of combinatorial and
spectral clustering.
Spectral clustering involves using the Fiedler vector to create a bipartition of
the graph. Some theorems by Fiedler are needed to understand the character of the
3
Fiedler vector and how this relates to clustering. These theorems are reviewed in
Section ?? and we expand the scope of one of these theorems in Section ??.
Having expanded the scope of the theorems, we examine numerical results for
the generalized eigenvalue problem using mass matrices in Section ??. We then
examine the problem of image segmentation in Section ?? and discuss generation
of edge weights in Section ??. We define our algorithm for image bipartitioning in
Section ??. Then in Section ??, we give examples of how weight parameters can
effect clustering, connect this with the Fiedler theorems, and show how problems can
occur.
This is followed in Section ?? by a discussion of eigenvalue solvers and the need for
solvers for eigenpairs of large sparse matrices for the image segmentation problems.
Nodal sets and sign graph theory are reviewed in Section ?? where we derive an
extension of the discrete nodal domain theorem that includes Fiedlers theorem as a
special case. We summarize in the last section.
2.2 Graph Laplacian
Let G = (V,E) denote a graph where V is its vertex set, E is its edge set, and
the number of vertices is \V\ = n. We number the vertices of G and this index is
then used to represent the vertices V; that is V = {1 The edges of G are
undirected and have no loops.
The n x n adjacency matrix A of G has entries representing the edges E of
G. If vertices i and j have an edge between them then the element of A
represents a real positive weight assigned to that edge. If there is no edge between
vertices i and j then = 0. The degree matrix D for G is a diagonal matrix where
d/x y fkj n;r
Definition 2.1 The unnormalized graph Laplacian is L = D A.
The matrix L is a discrete analog, up to the sign, of the continuous Laplacian
A = A justification for this can be found in [? ].
4
The discrete unnormalized Laplacian has the following properties [? ].
Lemma 2.2 Let L be an unnormalized Laplacian, then
it is real symmetric,
has nonpositive off diagonal entries,
is positive semidefinite,
its smallest eigenvalue is 0,
I = (1,1,..., 1)T is an eigenvector for the eigenvalue 0, and
the multiplicity of 0 is the number of connected components of the graph.
We note here the existence of normalized graph Laplacians, and will have more
to say about them later.
Definition 2.3 The symmetric normalized graph Laplacian is Lsym = D~^LD~^.
Definition 2.4 The random walk graph Laplacian is Lrw = D~1L.
2.3 Combinatorial Model and Relaxation
One approach to clustering is to minimize some function that reflects how closely
the clusters are connected. For example in Figure ?? vertex cuts are used to partition
the graph into two subgraphs A and B where A + Â£> = 0. A cut is assigned a
value cut (A, B) = J2ieA jeB aij. This is unsatisfactory in many cases where subgraphs
that are highly unbalanced are produced. For example Cut 1 in Figure ?? has the
same cut value as that of Cut 2.
To overcome this, the following ratios are often used [? ],[? ].
cut(A.B) cut(A.B)
Ratiocut(A, B) = ;
1^1 \B\
5
and
Newt (A, B) = cut^B) + cut(A,B) where voi^ = ^ da
vol(A) vol(B) j^A
Figure 2.1: Partition of a graph by vertex cuts. All edge weights are equal.
Assuming the graph is connected (more on this in Section ??), it can be shown
[? ] that minimizing the Ratiocut is equivalent to
mm fTLf where h
AcV
/LI
y/\B\/\A\ifi e A
where we refer to / as the combinatorial solution. Note that the components of the
vector / have fixed values in A and B. Thus the solution is constrained to a subset
of E which is the set of all / Era such / is defined as above and A C V. In this
subset, by the definition of /, we have / 1 I and  / 2= y/n. Note also, that the
graph Laplacian L gets defined as part of the transition from the original Ratiocut
minimization problem to the minimization problem in the form expressed above.
6
This problem is NP complete (see Appendix of [? ]), but if the constraints are
relaxed it becomes
min
fe Rn
/LI
fLf
The term fTLf is the numerator of the Rayleigh quotient R(f) = fTLf / fT / and
the RayleighRitz characterization of eigenvalues [? ] gives the solution as the 2nd
smallest eigenvalue of L with / as its eigenvector. An eigenvector of the 2nd smallest
eigenvalue (which could have multiplicity greater than 1) is called the Fiedler vector.
A similar analysis can be performed for Ncut [? ],[? ] which yields relaxed
problems Lf = DXf and Lsymf = A/.
A spectral clustering analysis consists of finding the Fiedler vector and using it
to produce the clusters A and B. Commonly, these are chosen as A = {i\fi > 0} and
B = {i\fi < 0}. Other alternatives include A = {i\fi > 0} and B = {i\fi < 0}. This
method is derived by analogy to the combinatorial solution of /, which have strictly
positive and negative solutions for the components of /. Sometimes, a search around
zero is used to find partitions with better Ratiocuts [? ]. There, an assumption is
that the relaxed problem has a solution close to the ratiocut problem. In any of these
cases a strict partition (i.e. with non overlapping clusters) is always produced.
There are some questions that can be raised at this point.
How closely does the relaxed solution come to the actual combinatorial solution?
Graphs can be constructed where the difference in cut values is arbitrarily large
[? ]. This might be expected since the original problem is NP complete, the relaxed
problem is not, and we have placed no restrictions on the type of graphs involved.
Even if we have a solution that is close to combinatorially optimal, how balanced
is it?
7
We make the following observation. The balancing term in Ratiocut is p] + ]^
If we plot this for various values of \A\ and n (Figure ??) we see a large penalty near
the extremes which is desirable but a very small penalty for values of R between.
So if we want a balance closer to ^ = .5 this can be achieved even when \A\ \B\.
This is particularly true if there is a large number of vertices, which is usually the
case for practically important applications.
Figure 2.2: Ratiocut penalty
This means that there is a possibility that the combinatorial solution is not par
ticularly balanced. Whether this is good or bad depends on the type of clustering.
There are examples where the relaxed problem yields intuitively satisfying results
where the ratiocut does not. Let us consider a very long string of beads where the
edge weights are all the same, except for one edge near an end of the string where
the penalty is very small and another edge in the center of the beads. Let us suppose
that the edge near the end of the string has a weight less than that of the edge in the
center, and no edge weight is very much greater than any other. Under this scenario,
the relaxed problem has a solution which partitions the string around the center edge
8
but the ratiocut is minimized by a partition around the outer edge. A more balanced
partition produced by the relaxed problem is more intuitively satisfying.
What should we do with vertices where fi = 0? This situation is precluded in
the original combinatorial solution for /.
The choice of putting the vertex i where fi = 0 into cluster A along with /* > 0
is arbitrary. We could just as well have chosen to place it in cluster B.
Are clusters connected?
We see in Section ?? that one of the clusters will always be connected but the
other may not be. This upsets our intuition about what a good cluster should be. We
expect the nodes within a cluster to have some similarity, but if the cluster itself is
not connected how can this be. Also, this is a practical issue if we should do recursive
bipartitions to break an image into multiple clusters. We want to start our algorithm
with a connected graph and this might no longer be the case. Why this is the case is
explained in Section ??.
In an attempt to overcome these issues and force the spectral clustering results to
give clusters with better Ratiocut values, some algorithms resort to adhoc approaches
such as looking for a better Ratiocut around zero; for example A = {i\fi > r} and
B = {i\fi < r} where r > 0. The result may still not be optimal and one of the
clusters could still be not connected.
These kinds of problems have led us to adopt an alternative justification for
spectral clustering and an algorithm that reflects this.
2.4 Spectral Clustering Algorithm
We present the basic algorithm for Spectrial Clustering that will be used later.
This is done at this point to familiarize the reader with the approach. The details of
explaining the algorithm are presented in the subsequent sections.
9
1. Associate the problem of clustering n data points with a graph G = (V., E) with
n vertices.
2. Number the vertices of G from 1 to n.
3. Assign weights to each edge and construct a weighted adjacency matrix A for
G and the associated degree matrix D.
4. Construct the Laplacian matrix L = D A.
5. Compute the Fiedler vector Y = {yi,y2, , Vn} of the Laplacian L.
6. Partition the graph into subgraphs A and B using Y as follows:
A = {i E V : yi > 0}, B = {i eV :yi < 0}.
Notes:
The graph G must be connected.
There are many ways to assign weights. Section ?? only describes one method.
A (J B = V but it is possible that A P B yt 0 since there may be y* = 0.
With respect to the vibrational model (see later in Section ??) for spectral clus
tering, A and B are the optimal solutions. Their RatioCut values are irrelevant.
Since both A and B are connected, we can use recursive applications of the
algorithm to generate smaller clusters.
2.5 Overview of Literature
The clustering literature, with connections to spectral clustering, is primarily
concerned with the solution of a combinatorial problem based on minimization of
some cost function. Since this problem turns out to be NP hard, its solution is
attempted via relaxation to a spectral clustering problem.
10
The literature is generally concerned with one or more of the following issues:
alternatives to spectral clustering,
justification of spectral clustering,
edge weight selection,
how to use eigenvectors to obtain clusters,
how many clusters to construct, and
applications.
An excellent starting point to study spectral clustering is the tutorial by Ulrike
von Luxburg [? ]. After that, possibly the most referenced paper on the subject is
by Shi and Malik [? ].
There are alternatives to spectral clustering. An algorithm for finding bipartitions
with minimum cut weight is presented by Stoer and Wagner [? ]. However, this
algorithm does not address the problem of finding balanced partitions.
Some authors [? ],[? ],[? ],[? ],[? ],[? ],[? ] make use of several eigenvectors,
collect these into a matrix, and partition based on the row vectors, which are usually
normalized, and then have fcmeans clustering applied to them.
Alpert and Yao [? ] construct a mincut problem where cluster volumes must
fall within predefined ranges. They then define a maxmin kway vector partitioning
problem based on the eigenvector rows, and show that these two problems are equiv
alent when all eigenvectors are used. Of course in this case, since mincut is NP hard,
their vector partitioning must also be NP hard. No justification, except for numerical
experiments, is presented when not using all eigenvectors.
Alpert, Kahng, and Yao [? ] define their MELO (multiple eigenvector linear
orderings) algorithm, which uses multiple eigenvectors to produce a bipartition.
11
White and Smyth [? ] define a new cost function, referred to as the Q function,
which is a measure of the deviation between the probability that both ends of an edge
he in the same cluster, and the square of the probability that either end of an edge lies
in the cluster. Minimization of the Q function is a problem whose relaxation results in
(approximately) an eigenvalue problem for Lrw, the random walk Laplacian. fcmeans
clustering is then applied to the rows of some of the eigenvectors of Lrw. The best
clustering is computed by varying k and computing Q for each partition produced.
Another probabilistic approach is proposed by Meila and Shi [? ] [? ]. They
define a cost function via a random walk and connect this to the NCUT balanced cost
function. They then present a machine learning algorithm to determine edge weights,
trained by predefined image clusters.
While Meila and Shis random walk development is fairly intuitive, a more obscure
attempt at justifying spectral clustering via the random walk Laplacian, is suggested
by Nadler, Lafon, Coifman, and Kevrekidis [? ]. This is based on an assumption that
the graph vertices are distributed according to the probability distribution p(x) =
e~u(x) where U(x) is a potential.
A different discrete cost function is presented in the paper by Kanna, Vempala,
and Vetta [? ]. They define an (cqe) bicriteria, relax the minimization problem, and
then show that spectral clustering has a worst case approximation guarantee with
respect to the (cqe) measure. While interesting, it does not seem to be that practical.
Other attempts to place bounds on the cluster results with respect to a cost
function date back to Donath and Hoffman [? ]. They set a lower bound on the
sum of the edge cuts, where the number of clusters and the number of vertices in the
clusters are preselected.
Ng, Jordan, and Weiss [? ] use a typical approach to spectral clustering, i.e. the
symmetric Laplacian and fcmeans clustering applied to multiple eigenvectors. They
then use the Cheeger constant [? ] of the clusters to construct criteria assumptions.
12
If these assumptions are met, they show that there exists orthogonal vectors in k
space such that rows of the eigenvectors are close to them.
Sharon, Galun, Sharon, Basri, and Brandt [? ] apply spectral clustering to image
segmentation. They present a multigrid approach (see also Bichot [? ]) where spectral
clustering (Ncut cost function) plays the role of biclustering at each coarseness level.
Dhillon, Guan, and Kulis [?][?] give another multigrid approach using the
weighted Kernal fcmeans algorithm. This projects the data vectors onto a higher
dimension space where fcmeans does the clustering for the refinement steps. Spec
tral clustering is applied at the coarsest level. They apply this technique to image
segmentation and gene networks.
A technique that is not spectral based is presented by Felzenszwalb and Utten
locher [? ]. This is applied to image segmentation and, while not spectral in nature,
has an interesting approach for coarsening of the image. They define a concept of a
segmentation being too fine or too coarse. A lumping procedure is then applied to
lump vertices into segments that are neither too fine or too coarse.
Orponen and Schaeffer [? ],construct an approximate Fiedler vector by minimiz
ing the Rayleigh Quotient of a reduced Dirchlet matrix via the gradient descent.
The issue of round off error in the computation of the Fiedler vector via Krylov
subspaces is taken up by Matsekh, Skurikhin, and Rosten [? ]. They observe that
round off errors, inherent in numerical solutions of the eigenvalue problems, can make
the Fiedler vector unsuitable for image segmentation. We address this issue later in
this thesis in Section ??.
Clustering the Fiedler vector is analyzed by Chan, Ciarlet, and Szeto [? ]. Their
issue is, what is the best way to partition the Fiedler vector? They show that a median
cut (equal numbers of vector coordinates) on the second eigenvector is optimal, in the
sense that the partition vector induced by this is the closest partition vector to the
second eigenvector. What they actually prove, is that the median cut on any vector
13
is optimal to that vector. It does not show how close the partition vector based on
the second eigenvector is to the solution of the combinatorial problem.
Ding, He, and Simon [? ] show the equivalence of symmetric nonnegative Matrix
Factorization (NMF) and Kernal fcmeans, and the equivalence of Ncut and NMF.
The definition of Ncut they use is not the same as that defined by Shi and Malik.
An entirely different approach is presented by Grady and Schwartz [? ]. They
start with the RatioCut problem and convert it to an isoperimetric problem which
can be solved as a a linear system.
A very interesting paper comes from ZelnikManor and Perona [? ]. They use the
inverse Gausian to compute weights, but associate a different scaling factor with each
vertex. The scaling factor reflects the density of vertices around each vertex. This
allows for handling of multiscale data and background clutter. They also propose a
cost function for automatic determination of the number of clusters.
Similarly, Fischer and Poland [? ] also adjust the scaling factor for each vertex.
But their technique for doing so is different. They also propose use of klines clustering
of the eigenvector rows instead of fcmeans.
We deal with image segmentation and microarray analysis in this thesis. Another
application that is only starting to receive some attention is the application of spectral
clustering to Phylogenetics [? ] [? ] [? ]. Here spectral clustering has been proposed
as an alternative to maximum parsimony, maximum likelihood, and neighborjoining.
The data, so far, has been gene sequences. The techniques are mostly fairly standard:
symmetric Laplacian, multiple eigenvectors, and fcmeans.
Other papers, from the list of references, are mentioned elsewhere in the thesis
and are not repeated here.
14
2.6 Vibrational Model
We first consider a problem with an infinite number of points. Suppose we have
a planar membrane which is continuous. It is stretched over some geometric shape
Q which is connected but possibly not convex; such as a circle, square, L shape, etc.
Vibrations on the membrane are described by the wave equation pu + Au = 0 in
Q [? ],[? ],[? ] with natural (free) boundary conditions. Here u is the displacement
of the membrane, and p = ^ where v is interpreted to be the velocity of the wave
(see Davis [? ] page 200). This interpretation is based on an analysis of units in the
equation where u has dimension of length.
Suppose we only have transversal vibration without friction. We assume standing
wave solutions of the form u(x,y,t) = U(x,y)elu;t. The problem then becomes a
generalized eigenvalue problem of the form AU = uo2pU.
The smallest vibration frequency is zero, which corresponds to the first vibration
mode, which is a constant function. The second vibration mode corresponds to the
Fiedler vector. This solution divides the membrane into two parts. Each part is
connected and moves synchronously. One part will have positive displacement while
the other has negative displacement. These are the nodal domains induced by the
eigenfunction (more on this is Section ??).
Now, suppose instead of a membrane we have a mass/spring system with a finite
number of points (nodes) in a plane and let the masses be numbered from 1 to n. The
masses are free to move perpendicular to the plane, but cannot move in the plane.
Springs are attached between masses, but not necessarily between all masses. Assume
displacements of the masses are small. This is a system of undamped harmonic
oscillators [? ]. The equation describing this is Mz + Kz = 0 where M is a diagonal
mass matrix with non negative elements, K a stiffness matrix and z a vector whose
15
Zi entry describes the displacement of the zth point mass.
Assuming a standing wave solution z = ZeluJt, the equation becomes the gener
alized eigenvalue problem oj2MZ + KZ = 0. Here the values of any eigenvector Z
represent the magnitude and sign of vibration of each node for the corresponding fun
damental frequency uj. Again consider the Fiedler vector which defines the 2nd vibra
tional mode. Nodes with the same sign are connected and move synchronously. This
consideration of vibration nodes imposes a natural bipartition on the mass/spring
system.
Suppose we have masses of all unity so that M = /. Take the point masses to be
vertices in a graph and the spring constants uy, to be a weight assigned to the edge
connecting vertices i and j where i yt j. The stiffness matrix K is now the negative
of the graph Laplacian. To see this note that an element of the vector KZ is the net
force on node i for displacement vector Z, which for positive z is negative in sign.
This gives
(KZ)i =  Zj) = 
j
where
I
I +Wij if yf j
Thus we arrive at the problem expressed in terms of a graph Laplacian LZ = oj2Z
. Spectral clustering produces a bipartitioning corresponding to the second mode of
vibration. This clustering satisfies our intuitive notion of a clustering as each cluster
is connected, vibrates in phase (similarity) and vibrates tt degrees out of phase with
the other cluster (dissimilarity).
The statement that the clusters are connected is intuitively true for the membrane
with an infinity of points and can be proven (see Section ??) but is not as apparent
for the finite mass/spring system. We have also assumed that there are no boundary
(zj = 0) nodes. We use some theorems by Fiedler to resolve these issues.
E
w.
v
E
UJijZj
16
2.7 Fiedler Theorems
The 1975 paper by Miroslav Fiedler [? ] has two theorems on the Fiedler vector
which he referred to as the characteristic valuation of a graph.
While the work of Fiedler is mentioned by some papers [? ],[? ],[? ] it is not
central to the combinatorial justification for spectral clustering. We will see however
that it clarifies certain aspects and is important for understanding practical numerical
results of spectral clustering.
We repeat Fiedlers theorems here with some modification of terminology.
Theorem 2.5 (Fiedler /? ] Theorem 3.3 and Remark 34) Let G = (V,E) be a
finite connected graph with positive edge weights. Let L = D A be the unnormalized
Laplacian of G as defined in Section ?? and u the Fiedler vector of L. Then V r > 0
the subgraphs induced by A = {i G V : Ui > r} and B = {i E V : Ui < r} are
connected.
In this theorem the subgraph induced by A is the graph with vertices A and edges
Ea C E where e G Ea if and only if both edges of e are in A.
Corollary 2.6 If r = 0 then both A and B are connected.
Theorem 2.7 (Fiedler /? ] Theorem 3.8) Given any edge cut of a finite, connected
graph G that creates two connected components there exists a weighting (positive valu
ation) of the edges of G such that the Fiedler vector defines the same two components.
This partition is derived via positive and negative values of the Fiedler vector and there
are no zero values in the Fiedler vector.
In Figure ?? (a tree), Figure ?? (a lattice), and Figure ?? (a wheel) we display
Fiedler vector values for three graphs with constant edge weights. The values for the
Fiedler vector are listed against the corresponding vertex. These examples demon
strate how Theorem ?? would partition the graph. Lattices will be used for image
17
Figure 2.3: Fiedler vector for a tree
Tree with eval: 0 0.26795 0.65708 1
0.444
bipartitioning. The wheel in particular shows how using > 0 and < 0 as partitioning
criteria can lead to disconnected clusters.
The first thing to note is that there is no good reason to preferentially place nodes
with Ui = 0 in cluster A or B. But there is a very good reason to place these boundary
nodes in both clusters A and B. We see from Corollary ?? that both clusters A and
B will now be connected. The clusters now are not disjoint and thus do not satisfy
the usual objective of discrete partitioning problems.
Theorem ?? demonstrates the importance of the weights we place on edges.
Choosing any bipartition giving connected clusters there is a weighting of the edges
which will produce it. So we must be very careful to choose weights which impart
some meaning to the similarity of vertices.
We present a model for edge weights on images in Section ??.
18
Figure 2.4: Fiedler vector for a lattice
4x7 5pt lattice with eval: 0 0 19806 0.58579 0.75302
0.261 0 209 0 116 0 000 0 116 0 209 0 261
7 * 0.261 7 0.209 9 \ 0.116
0 261 * v 0 209 9 l 0.116
0.261 * t 0.209 * Â£ 0.116 * t
0.000
0.000
T 1 0.116 r r 0.209
0 116 0.209 C
0.116 < i t 9 <,7 0.209 C * t \
0.261
0 261
Figure 2.5: Fiedler vector for a wheel
5Spoke Wheel with eval: 0 111
19
2.8 An extension of Fiedlers theorem
The theorems of Fiedler as applied to graphs are proven for unnormalized graph
Laplacians. We propose and prove a more general version of Theorem ??, such that it
applies to the normalized graph Laplacians and the more general eigenvalue problem
Lx = AMx where M is any diagonal matrix with positive diagonal entries. This
problem corresponds to a mass/spring system where the mass matrix is not necessarily
the identity matrix. In particular, we want to show that an eigenvector corresponding
to the second smallest eigenvalue of this generalized eigenvalue problem can be used
to partition the associated graph for the problem into connected components.
To this end we state a lemma and a theorem from Fiedlers 1975 paper [? ] that
we will need in our proof. We also need the concept of an irreducible matrix [? ].
Definition 2.8 A square matrix is reducible if it is in the form
A
B C
0 D
or can be placed in this form via a permutation matrix P, as follows, PTAP.
A matrix is irreducible if it is not reducible.
20
Lemma 2.9 A graph G is connected if and only if its adjacency matrix A is irre
ducible.
Proof: Suppose a graph is not connected. Then by permutation of its adjacency
matrix it can be transformed to a matrix where the rows of any connected component
are contiguous. A is now in a block form similar to that in Definition ??. For a row
i and a column j where j is not in the same component as i, we must have Aij = 0,
since there is no edge between vertices i and j. So the off diagonal block must be all
zeros and A is reducible. So, A irreducible implies G is connected.
If G is connected and A is reducible, then we can permute A into the form in
Definition ??. Since A is symmetric, submatrix C must be all zeros. But then the
rows of matrices B and D define two disconnected components of G, a contradiction.
Note, that it is the off diagonal zero elements of a matrix that determine whether
it is irreducible. So, from Lemma ??, it follows that the unnormalized Laplacian
L = D A is irreducible if the graph is connected.
Theorem 2.10 (Fiedler /? ] Theorem 2.3) Let A be an nxn nonnegative irreducible
symmetric matrix with eigenvalues Ai > A2 > ... > Xn. Let u = (uf) > 0 be an
eigenvector corresponding to Ai and v = (vf) an eigenvector corresponding to A2.
Then for any a > 0, the submatrix A(Va) is irreducible where Va = {i : Vi + aui > 0}
and A(Va) is the matrix consisting of rows i E Va of A and columns j E Va of A.
Note: u = (uf) > 0 exists by the PerronFrobenius Theorem [?][?] since A is
nonnegative and irreducible, there exists a positive eigenvector u for Ai.
21
We now generalize Fiedlers Theorem ?? as follows.
Theorem 2.11 Let G = (V,E) be a finite connected graph with vertices V numbered
1,... ,n and edges e^ G E, between vertices i and j, assigned a positive number. Let
L G Rraxra be the unnormalized Laplacian of G. Let M G Rraxra be a diagonal matrix
where mu > 0. Let f be the Fiedler vector of the generalized eigenvalue problem
Lx = AMx. For r > 0, let Vr = {i : fi > r} and VTr = {*:/< r}. Then Gr the
subgraph induced on G by Vr is connected and G_r the subgraph induced on G by V_r
is connected.
Proof: We note that this proof follows closely that of Fiedler [? ] for the
unnormalized Laplacian. It proceeds by transformation of matrix L to one satisfying
the properties of Theorem ?? and then using its eigenvectors to form connected
subgraphs of G.
Since G is connected its adjacency matrix is irreducible by Lemma ??, and so is
L = D A. Let B = M~^LM~K B is nonnegative on the off diagonal elements,
since Bij = and is the product of terms with sign of Ih Here we
have used < 0 for off diagonal elements ( see Lemma ?? ). B is irreducible since
(1) L is irreducible and (2) an off diagonal element of B is zero if and only if it is zero
in L.
For a > maxj \Bu\ it holds that B + a I is nonnegative, symmetric, and has
positive diagonals. B + a I has the same off diagonal elements as B, and since B is
irreducible B + al is also. Therefore it satisfies the conditions of Theorem ??.
22
Furthermore, (A, x) is an eigenpair of Lx = AMx <=/ (A = A + a, M?x) is an
eigenpair of (B + al)x = Xx. To show this let Lx = A Mx, and y = M^x then
Lx = A Mx
M~\lM~\m\x = A M~\Mx
BM^x = A M^x
(B + al)y = (A + a)y
(B + al)y = Xy.
Since, all steps are reversible the if and only if relation follows.
Now, zero is an eigenvalue of Lx = A Mx with eigenvector I = (1,1,..., 1)T.
Since L is positive semidefinite and M is positive definite then zero is the smallest
eigenvalue of Lx = A Mx and A=(A + o;) = Q!is the largest eigenvalue of of B + al
with u = M^I the corresponding eigenvector. Let v be the eigenvector of the second
largest eigenvalue of B + al. This would correspond to the eigenvector of the second
smallest eigenvalue of Lx = XMx; i.e. the Fiedler vector / where v = M^f.
Let r > 0 and Vr = {i : Vi+rv,i > 0}. By Theorem ??, the submatrix (B+aI)(Vr)
is irreducible. By the definition of L and B, the off diagonal elements of B + al are
zero if and only if the off diagonal elements of L are zero. So, L(Vr) is irreducible, as
is A(Vr) where A is the adjacency matrix of graph G. Therefore, by Lemma ??, the
subgraph induced on G by Vr is connected.
Now Vi = = y/m~ifi and Ui = (Mil); = v/m^. So, ty + ny = y/m~ifi +
ry/rriu which implies Vr = {i : fi + r < 0} = {*:/ > r}. Now, if v is a Fiedler
vector then v is also. So, V_r = {i : fi > r} = {*:/< r} must also be
connected.
Theorem ?? covers the generalized eigenvalue problem Lx = XMx with M = D
provided the graph defining L is connected. This is the the random normalized
Laplacian problem Lrwx = D_1Lx = Xx.
23
For the symmetric normalized graph Laplacian, Lsym = D~^LD~^ The problem
Lsymx = \x, reduces to the problem Ly = XDy with x = D^y.
So, without recourse to combinatorial models we have shown that the normalized
graph Laplacians and the more general forms of these equations using a mass matrix
produce connected subgraphs via spectral clustering based on the Fiedler vector,
where the clusters produced might not be disjoint.
2.9 Effect of mass matrix on segmentation
Flaving extended one of Fiedlers theorems (Theorem ??, see Section ??) with
respect to the mass matrix Laplacian problem Lx = XMx, we now examine the effect
of the mass matrix on clustering. For this analysis, M is a diagonal matrix with
Mtl > 0.
Numerical experiments were performed. These have involved a variety of graphs,
including lattices, lines, and circles. The edge weights are equal, since we wanted
to isolate the effect of the masses. Masses are varied, but the effect of equal masses
except at one vertex was always examined. The unnormalized Laplacian has been
used in this analysis.
We present the following observations of numerical experiments, which apply to
bipartitions produced via the Fiedler vector.
1. Clusters defined by division of the Fiedler vector via > 0 and < 0 will be
connected. This is insured by our extension of Fiedlers Theorem (Theorem ??).
2. The two vertices of highest mass tend to be in separate clusters; see Figure ??
for two examples of this.
3. For the vertex of highest mass, if the mass is large enough, it will be in a cluster
by itself provided observation 1 is not violated; see Figure ?? and Figure ??.
4. The vertex of smallest mass is never isolated in a cluster by itself; see Figure ??.
24
Observation 3 can be understood (but not proven) by the following. Reformulate
l l
the problem Lx = AMx as \ / * /. ,\ / > // = By = Ay. Let to* be the mass of vertex
i. The matrix B has elements By = If we take B row by row we see
that we are adjusting the weight of the edges of row (vertex) i by Suppose
TOj is the largest weight and >> nij for every i ^ j. If the original edge weights
are large relative to ^==, then we now have the smallest weights around vertex i and
would expect it to be in a cluster by itself.
Observation 4 follows for similar reasons with mk rrii and the largest weights
are around vertex k. So it will be strongly associated with the surrounding vertices.
Numerical experiments for these results are reproduced in the following figures.
The mass of each vertex and its Fiedler vector value is listed. The Laplacian is setup
with constant edge weights and the Matlab eig function is used to solve for the Fiedler
vector.
Figure 2.6: Observation 2: Separation of two highest masses
0.371 0.213 0.495
0.619 0.597 0.369 0.013 0.352
m=0.10 m=1.00 m=1,00 m=2.00 m=3.00
25
Figure 2.7: Observation 3: Largest mass in a cluster by itself
0.084 0.082 0.517
0.729
Figure 2.8: Observation 3: When largest mass cant be in a cluster by itself
0.223 0.143 0.010 0.498 0.826
m=1.00 m=1.00 m=100.00 m=1.00 m=1.10
Figure 2.9: Observation 4: Smallest mass never in a cluster by itself
0.577 0.289 0.000
1=0.01
.289
i=1.00
.577
m=1.00 m=1.00 m=1.00
> n=1.00 < i n=1.00 o
0.289 0.000
n=1.00 n=1.00
D.000 o < 1289 o
26
2.10 Image Segmentation
Our problem is to bipartition an image according to the ideas discussed in Sec
tion ??. A real world image has an infinite (for all practical purposes) number of
pixels. When we take a digital picture we reduce the real world image to a finite
number of pixels. In a simple case, each pixel has red,green,blue (RGB) values from
0 to 255. We consider the discrete image as a graph with vertices corresponding to
the pixels which are labeled as consecutive positive integers from the lower left edge
of the image to the upper left edge column by column as shown in Figure ?? for an
image with m rows and l columns.
Figure 2.10: Labeling of pixels in an image
m o 2m o ... o ml
2 om + 2o... :
1 om+\o...o{l 1 )m + 1
Edges are defined based on a 5 point stencil. A 5 point stencil connects each pixel
to the 4 closest pixels, except for edge or corner pixels, where they connect to 3 or 2
points respectively. The result is a mx l lattice. If we had a 3 dimensional image we
would define similarly a pixel numbering with edges defined on a 7 point stencil.
Note that the graph Laplacian of a lattice has a structure imposed by the 5 point
stencil and the numbering choice for the vertices. The only non zero elements of the
Laplacian lie on the diagonal, adjacent to the diagonal, and m columns from the
diagonal.
27
2.11 Edge Weights
The final component we need to construct the Laplacian is a formula for the
edge weights. There are many ways to assign weights. The method we choose is
implemented in Leo Gradys software [? ]. It is based on a combination of RGB
values and geometric distance.
Say pixel 1 and pixel 2 are connected by an edge and they have RGB values
{R1,G1,B 1} and {R2,G2,B2}. Let 8R = R1 R2, 5G = Gl G2, and SB =
Bl B2.
Let valDist = ^SR2 + 5G2 + 5B2. Then normalize the values with respect to
the range [0,1]. Set geomDist = 1, which is the same for all edges. valDist is a
measure of dissimilarity between pixel values. geomDist is a measure of the physical
distance between pixels.
The distances are a measure of dissimilarity between pixels. For the Laplacian we
need measures of similarity. This is achieved by using the function e~dlstk*a where a
is a constant and k a positive integer. These control the rate that similarity declines
with distance. For images, we restrict our analysis to a simpler model where k = 1
and a = 1.
Definition 2.12 We define the weights assigned to edges as edgeWeight = e~dlst +
epsilon where dist = geomScale* geomDist+valScale*valDist. epsilon,geomScale,
and valScale are positive constants in the computation of edgeWeight vMch are used
to tune the model.
We have introduced 3 parameters: geomScale and valScale adjust the relative
importance of geometric vs RGB values. Geometric distances are always unity, since
the pixels are nodes of a uniformly spaced lattice, but value distance (valDist) can
vary from 0 to 1 and values are typically less than 0.05. geomScale is typically chosen
to be 1 and valScale 25 or larger, epsilon is the minimum allowable weight. A typical
value for epsilon is 105. We discuss the impact of these parameters in Section ??.
2.12 Tuning the Edge Weight Parameters
Some numerical experiments using Matlab with the image in Figure ?? were
performed to illustrate the effect of the parameters on weight assignment as described
in Section ??. The image is of a cats ear with dark fur against a light gray background
with some white whiskers and fur. This has an obvious clustering consisting of the
ear and the background.
The primary image is 200 x 320 pixels with RGB values from 0255. A graph
Laplacian is constructed as defined in the previous sections.
Figure 2.11: Base image for analysis
For the initial analysis the image is scaled by sampling of pixels into images with
smaller resolution. Weights are assigned with valScale = 25, geomScale = 1, and
epsilon = 105. The Matlab eigs sparse eigensolver has been used.
29
The results are shown in Figure ??, where a light line has been added to the
image to show the boundary separating the clusters generated. The graph in the
lower right hand corner shows solution time in seconds (yaxis) versus image size in
pixels (xaxis).
Figure 2.12: Initial analysis
200x320
100x160
67x107
50x80
40x64
34x54
29x46
The results are good with the exception of the 200 x 320 image where the boundary
deviates upwards to the top edge of the image. This can happen when the valDist
is too small with respect to the geomDist.
To see the effect of geometry, consider the lattice in Figure ??. In this graph
the edge weights are all one and the lattice could be considered as an image where
RGB values for all pixels are the same. Note how the boundary {i : iji = 0} bisects
the image from top to bottom (the smaller dimension) at the midpoint of the larger
dimension. This illustrates how the geometry of the image can affect the Fiedler
vector, and we might suspect that the distance term is overpowering the value term
in the 200 x 320 image.
30
Adjusting the valScale to 50 and repeating the experiment solves the problem,
see Figure ??.
Figure 2.13: Effect of increasing ValScale
200x320
100x160
67x107
50x80 40x64
34x54
29x46
25x40
eigs,val=50,geom=1
Why do not we see this problem at the reduced image scales? As the scale is
reduced by sampling, the computed mean valDist is increased from .018 for the
200 x 320 image to .05 for the 25 x 40 image. As a result the valScale valDist term
increases with respect to the geomScale geomDist term which is unchanged. This
has the same effect as adjusting valScale.
Similarly varying geomScale with respect to valScale can cause degradation of
the bipartition as shown in Figure ?? for the 40 x 64 image.
So what does epsilon, in edgeWeight = e~dist + epsilon, do? Very small edge
weights can occur when we have a small number of connected pixels with RGB values
very different from those surrounding them. These islands or specks within the
image can dramatically affect the clustering. To see this consider the lattice in figure
?? but with the edges of a single vertex made smaller than one (vertex number 6
using the numbering previously defined). The result is shown in Figure ??. Vertex 6
31
Figure 2.14: Effect of geomScale
geomScale=1 geomScale=2 geomScale=3
geomScale=4 geomScale=5 geomScale=6
geomScale=7 geomScale=8 geomScale=9
40x64 eigs valScale=50
is now in a cluster by itself.
Figure 2.15: Effect of islands
4x7 5pt lattice with eval: 0 0.04086 0.18894 0.46508
0.025 0.028 0.033 0.039 0.044 0.048 0.050
We can see this in our ear image if we set epsilon = 10 8. The result shown in
Figure ?? shows the effect of one such speck in the 100 x 160, 50 x 80, and 40 x 64
32
scaled images. In the 40x64 image we have magnified the image to show more clearly
how a single pixel has been isolated. Epsilon smooths out large dissimilarities in
RGB values.
Figure 2.16: Effect of epsilon
200x320 100x160 67x107
2.13 Eigenvalue Solvers
The Matlab eigs function has been used in the form eigs(L, n, sigma). If sigma
is not specified it defaults to zero, i.e. it solves for the smallest n eigenvalues. For
the positive semidefinite matrices we are dealing with, when solving for the smallest
eigenvalues, this can lead to instability in the eigs function which results in NAN
errors. To overcome this a small sigma is used. This technique is also utilized in Leo
Gradys software [? ]. If the sigma value is too large eigs may not return a good
Fiedler vector. Also, if we are solving to a tolerance that is too large, the Fiedler
vector may be of low quality.
In these cases disconnected clusters can be produced, see Figure ?? where a sigma
of 105 has been used. Note that for the 50 x 80, 40 x 64, and 29 x 46 scaled images
33
the cluster boundary gives disconnected clusters. We know from Corollary ?? that
this is not possible, so something must be wrong with the solution for the Fiedler
vector.
Figure 2.17: Poor quality Fiedler vector produced by too small a shift
200x320 100x160
67x107
50x80
40x64
34x54
The eigs function is usable in our experiments since the size of the image is small.
Larger images will produce Laplacians with dimensions in excess of 107. This size
problem requires the use of large sparse eigensolvers such as BLOPEX [? ],[? ].
When using BLOPEX a positive definite matrix is needed and for image analysis a
small shift (L + al) is used. BLOPEX implements the iterative solver LOBPCG [?
] and makes use of preconditioners [? ]. But, whatever method is used, if either
G V : fi > 0} or {i G V : / < 0} is not connected, then the Fiedler vector
produced is suspect, because we know by Theorem ?? that they must be connected.
Image segmentation can practically exploit these eigensolvers. The Laplacian is
symmetric positive semidehnite with smallest eigenvalue of zero. It can be shifted to
produce a positive definite matrix without changing the eigenvectors, since if Lx = Xx
then (L + aI)x = (A + a;):r. We know what the eigenvector is for eigenvalue zero, and
34
this can be used as an orthogonal constraint for solution of the next eigenpair. Both,
a shift and orthogonal constraints are a part of the LOBPCG software. Details of this
are in the chapter on BLOPEX in this thesis. These eigensolvers are designed to solve
for only a few of the smallest eigenvalues, and we only need the second eigenvalue
and associated eigenvector, so spectral clustering can be done efficiently.
2.14 Nodal Domain Theorems
Fiedlers theorem ?? deals with the bipartitioning of a graph based on the 2nd
eigenvector of the graph Laplacian (Fiedler Vector). We have provided an extension
of this to a generalized eigenvalue problem. A theorem similar to Fiedlers exists for
higher eigenvectors. We summarize the following work done on nodal domains and
its discrete counterparts.
In Courant and Hilberts book Methods of Mathematical Physics, Volume 1 ,Ch.
6, Sec. 6 [? ] they present the following theorem. This dates from 1924, and is
frequently referred to in papers on nodal domains. It is often referred to as CNLT
(Courants Nodal Line theorem).
Theorem 2.13 (CNLT /? ] pg f52) Given the selfadjoint second order differential
equation
L[u] + Apu = 0 (p > 0)
for a domain G vnth arbitrary homogeneous boundary conditions; if its eigenfunctions
are ordered according to increasing eigenvalues, then the nodes of the nth eigenfunc
tion un divide the domain into no more than n subdomains.
By nodes is meant the set {x E G : un(x) = 0}. Depending on the dimension of
G this might be a nodal point, curve, surface, etc. These nodes (also referred to as
nodal lines) divide the domain G into connected subdomains called nodal domains.
If G has dimension m then the nodes are hypersurfaces of dimension m 1. It
can be shown that the nodes are either closed or begin and end on the boundary [?
], and are of Lebesque measure zero in G [? ].
35
The proof of CNLT provided by Courant and Hilbert is rather cryptic; being more
of a discussion than a formal proof and is for the case of G C f?2. A more accessible
proof, for a more general case of G C Rm, is provided in a paper by Gladwell and
Zhu [? ].
While not giving the details of the proof we examine its basic features and see
how it acts as a template for a discrete CNLT. Ultimately, we derive a new version of
the discrete CNLT, that includes Fiedlers theorem as a special case. Fiedlers proof
depends on theorems of positive, irreducible matrices. It is instructive to see how an
entirely different approach than that used by Fiedler yields the same result.
The CNLT proof depends on:
1. the nature of the eigenfunctions and eigenvalues of the problem, i.e. the eigen
values can be ordered and the eigenfunctions form a complete orthonormal set,
2. the variational form of the problem,
3. the application of the minmax characterization of eigenvalues (CourantFischer
Theorem), and
4. the unique continuation theorem.
The eigen problem has infinitely many positive eigenvalues 0 < Ai < A2 < ...
whose corresponding eigenfunctions form a complete orthonormal set, see Griff el [? ]
Thm 9.16. Ai = 0 for the free membrane problem and Ai > 0 for the fixed membrane
problem.
Those of us primarily versed in matrix analysis are familiar with the Courant
Fischer theorem from Horn and Johnson [? ] Thm 4.2.If. The equivalent on a
domain D that is a subset of a Hilbert space is given in Griffel [? ] Thm 10.18.(b)
and is repeated here.
36
Theorem 2.14 (Griffel /? ] Maximum Principle pg 287) If A is a positive definite
symmetric differential operator, on a Hilbert space H, with domain D, a compact
inverse, and with eigenvalues then
\n+i = max {mln{R(u) : uPspan{ki,kn}}}
ki,...,knÂ£D
where R(u) = is the Rayleigh Quotient.
For example: if A = IrA then R(u) = Au.
Gladwells proof examines solutions of the variational form of the problem on the
space Hq(D). H)(D) is the space of L2(D) functions, whose derivatives are in L2(D)
and the functions vanish on the boundary of D [? ] [? ]. The domain D is assumed
to be bounded and connected.
The unique continuation theorem [? ] states that if any solution u G i?o(0) of
the Helmholtz equation vanishes on a nonempty open subset of O then u = 0 in D.
The complete proof of the CNLT is in three parts. First a theorem by Courant
and Hilbert that there are at most n + r 1 sign domains for the eigenfunction un of
Xn, then a theorem by Hermann and Pleijel that shows for a connected domain this
number (n + r 1) can be improved to n, and finally the extension of this limit to
unconnnected domains.
The continuous CNLT serves as motivation for a discrete version. The proof for
this discrete version was not completed until 2001 in a paper by Davies, Gladwell,
Leydold, and Stadler [? ]. The discrete CNLT deals with solutions of Aun = Anun
and requires the following terminology and definitions.
Definition 2.15 Let A G Mm be a real symmetric matrix with nonpositive off diag
onal elements. The graph G = (V., E) associated with A is the graph with vertices V
numbered 1, ...,m and edges E where an edge exists if aij 0.
No restriction is placed on the diagonal elements of A. Examples of such matrices
would be the graph Laplacian L, L + aD where D is a diagonal matrix and the
37
negative of the adjacency matrix of a graph. Solutions of Aun = Anun are of the form
Ai < A2 < ... < \m.
The nodal set of an eigenvector un does not lend itself to an exact correspondence
with the nodal set and nodal domains in the continuous case. Instead, strong and
weak sign graphs are defined as follows.
Definition 2.16 (Davies /? ]) Letu be any eigenvector of A. A strong positive (neg
ative) sign graph is a maximal, connected subgraph of G, with vertices i E V and
u(i) > 0 (u(i) < 0).
Definition 2.17 (Davies /? ]) Let u be any eigenvector of A. A weak posi
tive (negative) sign graph is a maximal, connected subgraph of G, with vertices i eV
and u(i) > 0 (u(i) < 0).
We speak of a vertex as having a sign of positive, negative, or zero. We note
two key components of these definitions. The subgraphs are connected and they are
maximal in the sense that no vertices of the same sign as the subgraph can be added
to the subgraph and have it still be connected.
A few examples in Figure ??, that we derived, illustrate these concepts. The
eigenvectors of the graphs were computed using the unnormalized Laplacian with
constant edge weights. Only the signs of vertices are shown.
Given these definitions the discrete CNLT is formulated as follows.
Figure 2.18: Examples of strong and weak sign graphs
4th eigenvector
4 strong sign graphs
2 weak sign graphs
3rd eigenvector
3 strong sign graphs
3 weak sign graphs
Theorem 2.18 (Davies /? ]) If A is a symmetric matrix with nonpositive off diag
onal elements where the associated graph G = (V,E) as defined in ?? is connected,
the nth eigenvector un of A divides the graph into no more than n weak sign graphs.
The proof of the discrete CNLT proceeds as for the continuous CNLT and depends
on the ordering of the eigenvalues, the orthonormality of the eigenvectors, and the
minmax characterization of the eigenvalues. Instead of using the variational form
used in the continuous CNLT, an identity from Duval and Reiner [? ] is used. Also,
we do not have the unique continuation theorem in the discrete case but there is an
analog (see Davies et al. Lemma 3 [? ]).
It should be noted that much of this proof was done in the previous paper by
Duval and Reiner. However, they made the claim that the CNLT was valid for strong
sign graphs. Friedman [? ] had previously given examples showing this is not true.
Figure ?? shows two examples of this. The more complicated one on the left being
one we have constructed. Also, we note that the graph G must be connected and the
theorem applies to weak rather than strong sign graphs.
39
Figure 2.19: Strong sign graphs exceeding eigenvector number
5th eigenvector
8 strong sign graphs
2 weak sign graphs
2nd eigenvector
3 strong sign graphs
2 weak sign graphs
A few conclusions can be draw from the theory and an examination of
examples.
The number of weak sign graphs is always < the number of strong sign graphs.
The number of both weak and strong sign graphs can be < n.
If there are no zero vertices the number of strong and weak sign graphs are the
same.
If there are zero vertices the number of strong and weak sign graphs can still
be the same, see Figure ??.
The number of weak or strong sign graphs can decrease with increasing n, that
is, if 7i\ < n2 the number of weak (strong) sign graphs defined by eigenvector n2
can be less than that defined by eigenvector n\. This seems counter intuitive,
but is demonstrated by an example we present in Figure ??.
Finally, in the concluding remarks of Davies paper [? ] he states that the theorem
can be extended to the generalized eigenvalue problem (K XM)u = 0 where K is
40
Figure 2.20: Sign graphs decreasing in number
(o)
3rd eigenvector 4th eigenvector
3 strong sign graphs 2 strong sign graphs
3 weak sign graphs 2 weak sign graphs
positive semidefinite and M is positive definite. An outline of the proof is given in
that paper.
Fiedlers Theorem ?? and our extended Theorem ?? is in part a special case
of the discrete CNLT for n = 2 where the cut is with respect to r = 0. However,
note that Fiedlers theorem covers subgraphs induced by {i G V : Ui > r} and
{i Â£ V : Ui < r} where r > 0, which the CNLT does not address.
We modify the definition of weak sign graphs and produce a new ver
sion of CNLT that completely covers Fiedlers theorem as a special case.
Definition 2.19 An rweak positive (negative) sign graph is a maximal, connected
subgraph of G, with r > 0 and vertices i e V and Ui > r (Ui < r). (For an example
see Figure ??.)
Theorem 2.20 (Modified Discrete CNLT) If A is a symmetric matrix with nonpos
itive off diagonal elements where the associated graph G = (V>E) as defined in ?? is
connected, and r > 0, then the nth eigenfunction un of A divides the graph into no
more than n rweak sign graphs.
41
Figure 2.21: An example of rweak sign graphs
We can only present a proof of this theorem which is dependent on a conjecture
about the nature of rweak sign graphs. We divide rweak sign graphs into two types.
Definition 2.21 Let r > 0. A type 1 rwsg is a rweak sign graph R where there
exists a weak sign graph W such that W C R. A type 2 rwsg is a rweak sign graph
that is not a type 1 rwsg.
Consider a type 2 rwsg. Weak sign graphs have vertices i E V where either
Ui < 0 or Ui > 0 and are maximally connected. If a rweak sign graph has (1) both
negative and positive components, or (2) negative and zero, or (3) positive and zero,
there must be a weak sign graph that is a subset of the rwsg and therefore it is a
type 1. This is easily seen by considering the cases. If the rwsg satisfies case (1) then
if it is generated by < r there is a weak sign graph generated by < 0 contained in it.
If it is generated by > r there is a weak sign graph generated by > 0 contained in
it. Similar arguments apply to case (2) and (3).
This says that the type 2 rwsg must have either all positive or all negative
elements. Let us assume that they are all positive. Now, if the rwsg is generated via
> r then it has to contain the weak sign graph generated by > 0 and containing
these points. So, it is a type 1 rwsg.
42
We are left with only one possibility, a type 2 rwsg (with positive elements)
is generated by < r and has no zero elements. Since, our graph by assumption is
connected, there must be a Ui > r in every path connecting the rwsg to the rest of
the graph. A type 2 rwsg is a kind of local minimum in a graph. For the case of
positive elements, this is a kind of divot that does not cross zero.
Such a rweak sign graph can be constructed by considering arbitrary vectors.
Consider a line with four points and values assigned to the vertices of U\ = 1, 2 =
2, u3 = 1, 4 = 2. For r = 1.5 this has 2 weak sign graphs,{1} and {2, 3, 4}; but three
rweak sign graphs, {1}, {1, 2, 3, 4}, and {3}.
So, is this type of behavior possible with an eigenvector of matrix A? Since, this
can be thought of as a vibration within a major mode of vibration, that does not seem
reasonable. Numerous numerical experiments support this. I present the following
conjecture, which I emphasize is not proven.
This is used in a proof of ??, which I have identified as incomplete, because of
this.
Conjecture 2.22 If G is a connected graph, A a symmetric matrix with nonpositive
off diagonal elements and r > 0, then the nth eigenfunction un of A cannot generate
any type 2 rwsgs.
Flowever, this is easily proven for the case of the unnormalized graph Laplacian
L = D A.
Lemma 2.23 For the unnormalized graph Laplacian L = D A of a connected graph,
its nth eigenfunction cannot generate any type 2 rwsgs.
Proof: Let L = D A be the unnormalized Laplacian of a connected graph
G. Let u be the nth eigenfunction of L and let Ui denote the component of u
corresponding to vertex i. Let r > 0 and assume 3 a type 2 rwsg S of G. Since
43
G is connected, u\ is a constant vector and has only a single type 1 sign graph. So,
assume n > 1.
Without loss of generality let us assume tq > 0 for vertices in S. Then there must
be some vertex % e S such that tq < Uj V j adjacent to i and tq < Uj for some j. Let
i ~ j denote that vertex i is adjacent to vertex j.
Since Lu = Xu and A^ = 0 if i is not adjacent to j, we have
LijUj A Ui
DaUi ^ij^j A Ui
Hi X^ijUj A Ui
Aij(ui Uj) = A Ui
Now Aij > 0, Ui Uj <= 0, Ui Uj < 0 for some j, and Ui > 0. So the left hand
side of this equation is always < 0. But, L is positive semidefinite and n > 1, so the
right hand size is > 0. This is a contradiction to the assumption that there is a type
2 rwsg.
For the generalized problem (K XM)u = 0 mentioned by Davies, we see the
CNLT applies to Lu = XDu. Since D~l applied to L only changes the magnitudes of
the left hand side of this equation but not the signs, we can see from the proof of ??
that D_1L also cannot have any type 2 rwsg. Similarly, D~^LD~^ cannot have any
type 2 rwsg.
We provide the following proof for the Modified Discrete CNLT. Note that by
Lemma ?? the theorem is complete for the unnormalized Laplacian, but for the more
general class of problems of matrices with nonpositive off diagonal elements it is
dependent on the conjecture ??, which is not proven.
44
Proof: (For Modified Discrete CNLT. Depends on Conjecture. ) Let r > 0
and R be an rweak sign graph (rwsg) of type 1. Let \wsg\ be the number of wsgs
and \rwsg\ be the number of rwsgs. Suppose R\ and R2 are type f rwsgs and W
is a wsg such that W C R\ and W C f?2 Since R\ and R2 are maximal we must
have R\ = f?2 So \wsg\ > \type 1 rwsg. By the conjecture ??, we conclude that
\wsg\ > \rwsg\ and by the discrete CNLT the nth eigenfunction divides the graph
into no more than n rweak sign graphs.
We make an observation that the definition of strong sign graph excludes zero
vertices from the partition of the graph and they are mutually exclusive. Weak sign
graphs include the zero vertices, if any, and may not be mutually exclusive. This
means that the set of all strong or weak sign graphs may not constitute a partition
of V in the sense as usually stated in the combinatorial graph partitioning problem
[? ]. This being the case, partitioning objective functions (ratio cut, Ncut) lose their
meaning without some redefinition.
From a spectral clustering perspective this is not a restriction and we can ask the
question, Should we be partitioning based on strong or weak sign graphs?. Both
have an association with modes of vibration. And if we use strong sign graphs we may
be excluding vertices (the vertices with value zero), which does not seem desirable.
But, in either case our clusters will be connected which we intuitively think of as
desirable.
The use of rweak sign graphs has a further advantage. We have said previously
there is no good reason to include/exclude zeros from multiple clusters. But, when
dealing with a numerical solution to our eigenvalue problem, in general we do not
know any vertex eigenvector value to exact precision. If the error is < r where r
is small we would more properly define clusters based on rweak sign graphs. In
this case the overlap between clusters may be enlarged but our clusters will still be
connected.
45
We conclude this section with an observation about Fiedlers Corollary 2,2 [? ].
This concerns the degree of reducibility of subgraphs N = {i : Vi > 0} where v is the
nth eigenvector of a nonnegative, irreducible, symmetric matrix A. Fiedler proved
the degree of reducibility cannot exceed n2. The degree of reducibility plus 1 is the
number of maximally connected subgraphs in N. These are the weak sign graphs
generated by Vi > 0. To get all of the weak sign graphs we also have to consider those
generated by Vi < 0. This gives at most 2((n 2) + 1)) weak sign graphs for the nth
eigenvector of A. For n = 2 this gives at most 2 weak sign graphs. We know that for
this case there are exactly 2. For n > 2 we get an upper bound > n.
While this corollary is for nonnegative matrices we can see in his proof of Theorem
?? how to convert the matrix A to one that is nonpositive, symmetric, and irreducible.
So, we see from this that Fiedler did prove an early CNLT but with an upper bound
which has been improved on.
2.15 Summary
In this chapter, we reviewed the concept of spectral clustering and proposed an
alternative explanation (the vibrational model) of why it is an effective technique
for clustering. We provide an extension to one of Fiedlers theorems using similar
techniques as those used by Fiedler, which complements the vibrational model. We
examine discrete nodal domain theory and, by expanding the definition of weak sign
graphs, produced a modified CNLT that includes as a special case the CNLT and
Fiedler theorem. Its use for image segmentation is explained. The importance of
edge weights and a particular method for their determination is examined. Finally,
some numerical results are given.
46
3. DNA Micro Arrays
3.1 Introduction
In the previous chapter we introduce the ideas involved in spectral segmentation
and apply these to image segmentation. In this chapter we extend these to the
problem of microarray analysis. We advocate for a normalization rule for the analysis
of microarray data where the objective is recognition of genes with similar expression
levels in time dependent experiments.
Microarray analysis involves some practical difficulties that are not relevant in
image segmentation. The primary one being that if we are going to have sparse
Laplacians then we will almost certainly end up with disconnected graphs. We pro
pose a new technique for dealing with this.
We also alluded to using recursive bipartition in the previous chapter. We develop
and implement a technique for doing this and as a part of this propose a rule for
partition selection.
3.2 What is a DNA Microarray?
A DNA microarray is a chip containing oligonucleotide sequences used to detect
the expression level of genes. We now give a brief review of the biology inherent in
this statement.
Genes are DNA sequences of nucleotides A,C,T, and G which code for proteins.
The process of protein formation in a cell involves: transcription of DNA to mRNA
(messenger RNA) in the cell nucleus and then translation of mRNA to a protein via
a ribosome, see Figure ??. When DNA sequences are transcribed to mRNA this is
referred to as gene expression.
Gene expression is not a constant process. It can depend on the various states
a cell is in, such as metabolic cycles, disease response, and stages of embryonic de
velopment. The level of gene expression will vary. This can either be an increase
or decrease from normal levels, referred to as up regulation and down regulation.
47
Figure 3.1: Gene expression
DNAsense strand
antisense strand
mRNA
Protein
... ATA CGT
... TAT GCA
... AUA CGU
transcription
v translation
. . He Arg . .
Knowing which genes are expressed and by how much can aid in the understanding
of cellular processes and in diagnosis of disease.
However, measurement of the concentration of proteins in a cell is difficult. One
solution is to measure mRNA as a proxy. This depends on the assumption that most
mRNA created is actually translated to a protein.
Various DNA microchip technologies have been designed to perform this function.
They have the advantage of being able to measure the expression levels of thousands of
genes at the same time. For purposes of discussion we use the Affymetrix GeneChips.
A GeneChip microarray is a quartz wafer less than 1/2 inch square, on which millions
of oligonucleotide sequences have been assembled using a photolithographic process.
GeneChips are specific to the genome of a particular organism (Ecoli, Aribidopsis,
Homo Sapiens, etc. The oligonucleotide sequences consist of nucleotide chains of
length 25 (25mers). These chains are chosen to correspond to selected parts of genes,
and these genes (ideally) cover the entire genome of the organism. These 25mers are
complementary to the sense strand of DNA and correspond to mRNA sequences.
Some 1120 of these 25mers target a specific gene and are referred to as perfect
matches (PM). In addition, a 25mer corresponding to each of the PMs is constructed
with a mismatch (MM) in the 13th base pair.
48
The 25mers are built on the GeneChip in a pattern of dots as small as 11 microns
in size. Each dot is referred to as a probe and the set of probes for a single gene are
called probe sets. Also, each chip contains calibration probes. Each probe contains
hundreds of the 25mers with the same nucleotide sequence. The information about
where these probes are on the chip is contained in an Affymetrix hie with an extension
of .chp. Meta information about the probe set such as gene name is contained in a
(.gin) hie.
Omitting most of the biochemical details, a sample of mRNA is extracted from
the cells of an organism and is converted to a biotin labeled strand of complementary
RNA (cRNA). The cRNA is complementary to the 25mers on the GeneChip. When
the GeneChip is exposed to this sample a process called hybridization occurs. During
hybridization, complementary nucleotides line up and bond together via weak hydro
gen bonds. The greater the concentration of a particular mRNA strand, the greater
the number of bonds formed within one or more of the probes in the corresponding
probe set, see Figure ?? for a depiction of a GeneChip.
The number of those hybridized probes have to be counted. A fluorescent stain
is applied to the GeneChip that bonds to the biotin and the GeneChip is processed
through a machine that paints each pixel of the chip with a laser (a pixel is the
minimum resolution of the laser and is a small part of a probe) causing that pixel to
fluoress, the magnitude of which is measured. The results are stored in a (.dat) hie
containing the raw information about the GeneChip experiment. The pixel locations
and intensities are mapped and normalized into probe locations and these results are
stored in a (.cel) hie.
49
Figure 3.2: An Affymetrix GeneChip
Affymetrix GeneChip DNA Microarrays
Click to add title
Affymetrix GeneChip DNA Microarrays
GeneChip* Probe Array
Hybridized Probe Fcafuro
Image ot Hybridized Probe Array
tie
E*cti prctw iNtut c*xian
rwapfn of cm>i d > p*o>c
otgonudicMtprete
OwH 200.000 C*H*r#nl probM
eefnptomnUry to gtnefee
mtormtbon of tntofU
j A ^ Prow Joci o o o o o o i
A
Irrege Courier/: j^rt/TTidrix
A measurement by a single microarray is just a sample size of one. It is multiple
sample sizes that tell us biologically meaningful information. In general, microarray
experiments are of two kinds. (1) Investigating a process over time. For example:
We want to measure the gene response to a drug or a metabolic process. (2) Looking
for differences between states. For example: Normal cells versus Cancer cells, which
would have utility in disease identification.
50
In this work, our analysis is concerned with an experiment of the first type. This
involves multiple chips measuring responses in one or more organisms. Taken over
time or during identifiable metabolic stages.
3.3 How is Microarray data analyzed?
The first part is referred to as low level analysis and involves a three step process,
see Bolstad [? ] for a more detailed discussion.
1. The adjustment of probe level information in each array for background noise.
This is due to variations in the cRNA preparation process.
2. The normalization of data across multiple arrays. Arrays cannot be created
perfectly identical. Data distributions and calibration probes are involved in
this.
3. Summarization of probe information into an intensity (expression level) for each
probeset.
This gives an expression level for each gene in each array. The results of this
analysis are stored in a (.chp) hie.
At this point we have the following hies.
.cdf probe to probe set (gene) mapping
.gin information about the probe set
.dat pixel level data
.cel probe level data
.chp gene level data
The next level of analysis (referred to as High level) involves looking for rela
tionships between genes via various forms of cluster analysis (hierarchical, Kmeans,
principle component, spectral, etc.).
51
3.4 Normalization of data
Various statistical normalizations have been applied to the raw data to produce
gene expression levels. Before performing spectral clustering we propose normal
izing the expression vectors to one. Here is why.
Suppose, our microarray experiment examines a metabolic process at 5 distinct
points. The processing of the raw data has produced an array of data where the
rows are the genes and the columns (5 in number) are the expression levels. We
are interested in clustering genes according to how similarly they are upgraded or
downgraded. Now suppose we have two genes with expression levels (vectors) of
(1, 3,4,2,1) and (2,6, 8, 4, 2), see Figure ??. The expression levels of these two genes
show a strong correlation with respect to how they change through the metabolic
process; the second just has a magnitude twice the first. We would like to see these
genes clustered together.
Figure 3.3: Genes with correlated expression levels
But, looking at these within our 5D expression space, they may be far enough
apart that when we generate edge weights they are not connected. Geometrically, we
are saying that vectors close to the same straight line segment starting at the origin
52
are correlated and should be clustered together. To accomplish this we normalize
the data to vectors of length 1. This is equivalent to projecting the vectors onto a
unit sphere around the origin. An example of this in the 2D case is illustrated in
Figure ??. Without normalization (squares in green) there are three, maybe four,
groups. With this normalization (circles in red) there are two.
Figure 3.4: Effect of normalization on clusters
3.5 Disconnected Graphs
When defining edges between graph nodes (i.e. expression level vectors) we are
typically dealing with thousands of nodes. To control the sparsity of the resulting
Laplacian we usually have to limit the number of edges in some way. For images, we
control the number of edges by defining our graph as a lattice with edges defined by
a 5 point stencil.
The data for microarrays does not fit into any preconceived graph architecture.
The edge weight between two nodes is computed via a function such as an inverse
53
Gaussian distance which we describe in more detail later. Then we apply a cutoff
weight value. If the computed edge weight is less than this value the edge weight is
set at zero. This effectively says there is no edge between those two nodes.
Selection of the limiting value can produce a graph that is almost complete or
one that is so sparse few of the nodes are connected. Ideally we want a graph whose
Laplacian is sparse (so it is easier to solve), has enough edges to retain the inherent
structure of the data, and is connected (so we can apply the nodal theory previously
developed). The first two conditions can be met by examination of the sparsity of the
resulting adjacency matrix and adjustment of the cutoff weight. The last condition
(connectivity) is not as easy.
As the edges are reduced we inevitably create multiple components in the resulting
graph. The discrete nodal domain theorem required our graph to be connected.
These disconnected components need to be identified. For components of size one
(no edges connect to it) this is simple and efficiently accomplished by examination of
the adjacency matrix; i.e. all values in a row are zero. For components of a larger
size this is not as easy.
Graph algorithms such as a Breadth First Search (BFS) can be utilized to do this
(see [? ] page 534) but have run time of 0(H + Â£j). Generally there are a lot
more edges than vectors and this can be quite large (potentially 0(H2) ).
An alternate method to find components using spectral methods was proposed
by Ding, He, and Zha [? ]. Their method is based on examination of the eigenvectors
of the Laplacian with eigenvalue zero. We know the multiplicity of eigenvalue zero is
the number of components of the graph. Careful examination of these eigenvectors
can identify the individual components.
Since the eigenvalue problem is in general also 0(H2), is performance a factor
in preferring one over the other? BFS has the advantage if the number of edges is
not too large. However, when using a spectral method we are only solving for a few
54
of the smallest eigenpairs and so expect to do better than 0(H2). So, there is no
clear cut answer.
We propose an alternate method to that of Ding, He, and Zha, which is also
spectral in nature. It is easy to integrate into the recursive bipartitioning algorithm
we introduce later. This method starts with the addition of a connecting node to the
graph and the definition of an edge of a small weight between the connecting node
and every other node. Our graph is now connected, and this small perturbation to
an unnormalized Laplacian is subject to an exact solution in terms of the original
Laplacian for the problem of Lv = Xv. We next give the solution to this problem in
the following theorem, and then apply it to an algorithm for extraction of components.
Theorem 3.1 Let L be the unnormalized Laplacian of a graph G with n vertices, not
necessarily connected. Let G be a graph formed by the addition of a vertex to G and
an edge between this vertex and every other vertex. Let all of these edges have vjeight
b > 0. Let L be the Laplacian of G and let In be the vector of all ones of size n. Then
the eigenpairs of Lw = Aw can be characterized as follows:
L (0,lri+l),
2. (6, [ o ]) where Lv = Ov and v yt 0 and Yhvi = 0,
3. ((n+l)b, [!"]),
4. (A + b, [o ]) where Lv = Xv and A > 0.
Proof:
By construction of G it is connected and so (1) follows.
We have L = [L^Tb ^b\ where B = bl,n and Db is the diagonal matrix with b on
all diagonals. Let v be a solution of Lv = Ov where v yt 0 and the sum of the entries
of v is zero. Then L[ o] = = [o'] = & [ o ] which proves (2). Note that this
case can only occur when G is disconnected.
55
We also have L[\] = [] = UK?n)] = (n + 1)6[which
proves (3).
Suppose Lv = Xv and A > 0, then by the orthogonality of n to 1U we have
L[l] = [+%'] = [At,oto] = (A + 6)[g] which proves (4). In addition, L has an
orthogonal set of eigenvectors OE, and we can choose v to be from this set. In this
case, all [o] are mutually orthogonal.
Now, we have only to show that all possible eigenvalues have been accounted for
and their eigenvectors are orthogonal.
If G is connected then by (1),(3) and (4) we have defined n + 1 eigenpairs. The
eigenvectors v in (4) are linearly independent and orthogonal to So, (1) and (4)
define n linearly independent eigenvectors. For the same reason, = 0 and
we have n + 1 linearly independent eigenvectors. Since G is connected, (2) does not
apply.
If G is not connected then the multiplicity of the eigenvalue zero is the number of
components of G. Let m be the number of components of G. For any two components,
say Gj and Gk of G, we can define a vector v such that Lv = 0 and = 0. We take
Vi = 1 if i E Gj and ry = a if i E Gk where a = Gj/Gfc. We can construct exactly
m1 such vectors that are linearly independent. Now, the eigenvectors v in (2) that
we have constructed are formed from eigenvectors of OE, that is v = av\ + [3v2 where
V\ G OE and n2 G OE. So, the eigenvectors of (2) are orthogonal to those of (4) and
are orthogonal to those of (1) and (3) since = 0.
So, by (1),(2), and (4) we have 1, m 1, and n m independent vectors for a
total of n and then (3) adds the last.
56
Corollary 3.2 If Lv = bv then = 0 and all values in v for any component of
G will be the same.
Proof: We note that our choice of eigenvectors for (2) in the proof of ?? span the
eigenspace of eigenvalue b. Let us denote these as wl. Any other eigenvector in the
eigenspace will be of the form v = 'YhiPiW%. We have Yljvj = YliPiYljwj = 0.
Finally, since the values of components in any w1 are the same by construction, this
must also be true in v.
The importance of this corollary is that an eigenvector of eigenvalue b must be
a Fiedler vector of L since 0 < b < A + b. Choosing any one of these vectors to use
to create a bisection we see that all values in v for any component of G must be the
same and this implies a component cannot be split into separate clusters by finding
an rweak sign graph. As such, a Fiedler vector of L will provide a way to separate
components of the graph G. Recursively bipartitioning naturally separate out all of
the components.
The examples in Figure ?? illustrate these statements.
Figure 3.5: A connected and unconnected graph and the effect of a connecting node
(dum) on their eigenpairs
1 2 4
1 2 3 1 N *
.3 s
\ i \ \ 4 9 s 5
v / \
s / , ^ 
7 / ^  6
dum i (zz:.. , >
dum
node eigenvectors
1 1 0 1 1 1 1 0 1
node e i genvec t ors 2 1 0 1 1 0 0 0 3
1: 1 1 1 1 3 1 0 L 1 1 0 0 1
2: 1 1 0 2 4 1 0 1 1 0 1 0 1
3: 1 1 1 1 5 1 1 0 1 0 0 1 0
dum: 1 3 0 0 6 1 1 0 1 0 0 0 0
7 1  2  4 1 0 0 0 0
eval: 0 4b 1+b 3+b dum 1 0 0 7 0 0 0 0
eval: 0 b b 8b 1+b 1+b 2+b 4+b
57
For purposes of clustering we want the option of using the normalized Laplacians:
LSym = D~^LD~^ and Lrw = D~lL. Theorem ?? does not apply to them and
deriving exact solutions to the perturbation of L by a connecting node has proven to
be elusive. Perturbation of the algebraic connectivity (2nd eigenvalue) is addressed
in [? ], but the issue of perturbation of the Fiedler vector is more difficult. But, from
numerical experiments we can make the following observations.
Assume b (the perturbation weight) is small relative to the minimum of the
weights in L. For the case of a connected graph, let v be the Fiedler vector of
the unperturbed problem Lrwv = Xv. The Fiedler vector of the perturbed problem,
Lrww = Apw or equivalently Lv = ApDv, is of the form w = where vector 8 is
small relative to the values of v and Xp ~ A. The perturbed Fiedler vector of G with
the connecting node removed should then give us the same partition as the original
vector.
For the problem of the symmetric Laplacian, Lsym = D~^LD~^, we have similar
results since eigenvalues of Lrw and of Lsym are the same, and if v is an eigenvector
of Lrw then w = D^v is an eigenvector of Lsym. Note that one implication of this is
that the eigenvector of eigenvalue zero is no longer a constant vector.
For the case of a graph with multiple components, the Fiedler vector of the
perturbed matrix has values distributed according to the components of the graph.
For the Lrw Laplacian, the values of any component will be nearly a constant. Each
component may or may not have the same constant, but the eigenvector will include
both positive and negative values and so a partition of the vector yields nonempty
partitions where a component cannot be split into separate partitions (but could be
in both).
58
We can now define the following algorithm for identification of components and
recursive bipartitioning. This is valid for the unnormalized Laplacian and we have
some intuition based on numerical experiments that it is valid for the normalized
Laplacians.
Step 1 Reduce the adjacency matrix of G by all singleton components.
Step 2 Select a value for b.
Step 3 Add a connecting node to G as described above to create G.
Step 4 Construct the Laplacian and solve for the first few eigenpairs.
Step 5 Excluding the zero eigenvalue identify the next smallest eigenvalue.
Step 6 Use the eigenvector of this eigenvalue to partition the graph.
Step 7 If the partition results in an empty partition then exclude that eigenvalue,
and using the next smallest eigenvalue, goto Step 6.
Step 8 Using these components, repeat the process starting with step 3 until a suf
ficient number of clusters has been identified.
Comments:
If we only want to identify all components then we can use the unnormalized
Laplacian and stop in step 5 when the eigenvalue is not b.
In Step 1, we choose to eliminate all singletons. This is not necessary, but is
done because it is easy and cheap (the diagonal value of a singleton is zero) and
the singletons do not carry much information about their relation to other vectors
(genes). In the program to follow we just put them all in a partition by themselves.
This can significantly reduce the size of the Laplacian and speed up the analysis.
We choose b to be .001 times the weight limit. Since, all edge weights must be
greater than this limit, this makes b << all other edge weights.
59
One can ask,Why bother with the connecting node?. Just examine the zero
eigenvectors of L. There are some practical reasons: If the graph G is connected we
do not know this and we still have to examine the zero eigenvector. Ideally this would
have an eigenvector with identical values or for the symmetric Laplacian one which
has all positive values. Then we know the graph is connected. Numerical solutions
give eigenvectors that are not exact. When the solutions are normalized and the
dimensionality of L is very large, this can result in values close to zero including both
positive and negative values. This results in a bad partition of the graph. Adding a
connecting node allows us to identify and ignore the zero eigenvector.
In step 6 we partition according to the rweak sign graphs. If r is too large
this can result in two identical partitions. This is a major type of error and should
terminate processing less recursive iterations endlessly replicate the same cluster.
Since we partition by rweak sign graphs, the resulting partitions may intersect. As
discussed earlier, we consider this an advantage of the technique.
3.6 Recursive bipartitioning
In the previous section we developed an algorithm for dealing with graphs with
multiple components. This algorithm not only deals with multiple components but
also handles partitioning of a connected graph. We want to incorporate this algorithm
into another for recursive bipartitioning applied to our original graph. We have two
problems to solve. (1) Which partition do we apply the algorithm to next?, and (2)
When do we stop?
To answer the first question we need a measure of the internal connectivity of a
graph. We say a graph is fully connected if the graph is complete and has the same
weight on all of the edges. In this case partitioning of it would not be meaningful. All
nonzero eigenvalues are the same and every partition creates another set of complete
graphs. On the other hand a graph with no edges has no internal connectivity.
60
We developed the following measure which reflects these two extremes.
Definition 3.3 The internal connectivity of a graph is n^li) > where G = sum of
weights of all edges ofG.
This is based on the observation that the number of edges in a complete graph
with n vertices is n(n l)/2. For convenience we have dropped the constant factor
2 in the definition of connectivity.
We select the next cluster to partition by choosing the one with the lowest internal
connectivity. Note that while this is motivated by mathematical considerations, it is
in essence an adhoc procedure, and if there are large variations in edge weights it can
produce counter intuitive results.
Figure ?? illustrates this process. The example is of a graph formed from 4 normal
distributions of data. These are centered at positions (2,3), (8,9), (2,9), and (5,6).
Weights are generated via a gaussian weight function with a minimum weight. This
results in some isolated vertices (singleton clusters, in dark blue) and one cluster of 2
vertices (in red). Recursive bipartitioning roughly identifies the original distributions.
Note that there may be vertices in more than one cluster but these are not identified
in the plots generated here.
The solution to the second question (about stoping criteria), is also adhoc in
nature. We could choose to proceed until all partitions achieved some minimum con
nectivity. We could also choose to proceed till a fixed number of clusters is produced,
or until we achieve a full hierarchical breakdown to one or two vertices per cluster.
In either case we have to make an apriori judgement about these values.
Some authors have suggested using eigenvalue jumps to determine the number
of clusters to find [? ]. This works well if we are looking for disconnected or nearly
disconnected clusters. For microarrays, however, we do not expect the resulting graph
to be that simple. We choose here to explore the data by looking for 16 clusters. This
choice is arbitrary, and would only be a starting point for analysis. After examination
61
Figure 3.6: An example of recursive bipartition
4 dusters :r 5 dusters
*
of the results, this choice would be modified for succeeding runs.
3.7 Weight construction
In the software, several techniques are implemented to compute edge weights.
Inverse Gausian distance is computed via the function e~sc*d2 where sc is a scaling
factor which defaults to 1 and d is the Euclidian distance between two gene expression
vectors; i.e. the norm of the difference between the two vectors. This is then limited
to zero by a radius value; i.e. if the weight is less than the radius then the edge weight
is zero. This is done to introduce sparsity to the Laplacian.
62
A fully connected (complete) graph can be produced. Here all of the edge weights
are one. Edges can be predefined and in this case the edge weights are one. These
techniques are implemented to analyze certain test graphs.
The technique we use for microarray data is a Euclidean distance limited by a
radius value. When the distance is greater than the radius value the weight is zero,
otherwise it is one. This method is chosen because we will be projecting the gene
expression vectors onto the unit sphere and we do not have to handle large variation
in distances. We note that angles could be used here as a measure of distance.
3.8 Toy experiments
In this thesis and in many of the papers referenced, so called Toy data is used to
analyze the algorithms presented there. The use of this data is done for two reasons.
1. validation of techniques
2. validation of software
Toy data provides test examples where the clustering results are well defined, or
at least roughly defined intuitively. For Real world data the clustering may not
be easily recognized. If it could be there would be no need to do the analysis. We
can also construct Toy data which should represent specific situations the software is
supposed to handle.
The Toy data gives us an objective way to test our techniques and their imple
mentation in software. This is not always as simple as it sounds. Debugging of the
software for recursive bipartition revealed several problems that required adjustments
to the algorithm and enhancements to the theory supporting them. Software test
ing has thus been an integral part of the overall process of doing the mathematical
research.
Once the Toy data has served its purpose, we can now use the software to examine
real world data. Some of the mathematics reviewed and developed here is fairly
63
advanced and some rather simple. The ultimate purpose of what we have developed is
not just an exercise in pure mathematics, but hopefully has applicability to real world
data and problems. In the next section we apply these techniques to real microarray
data.
3.9 Analysis of real microarray experiments
We analyzed yeastvalues taken in a time sequence over a metabolic shift from
fermentation to respiration. The source of this data is the Matlab demo Gene
Expression Profile Analysis from the Bioinformatics Toolbox. The data represents
a set of 7 microarrays where all of the low level analysis has occurred prior to ours
to produce a set of gene expression levels. Further, genes with very low expression
levels have been filtered out. This leaves 822 genes represented in an 822x7 matrix.
One of the problems we have to face for multi dimensional data is how to represent
the results. Toy experiments were all 2D so these had a natural way to graph clusters.
We could do something similar for 3D but with difficulty. Microarray data usually
has dimensions larger than this.
Listing the genes in groups is necessary because this is what a biological scientist
is going to need to evaluate the cluster. This is always possible by cross referencing a
genes matrix row number to a meta data list of gene descriptions. The listing of the
clusters for this experiment is given in Appendix ??.
We also desire a graphical way to represent the results. We want to visually
confirm how good the cluster is and identify interesting characteristics of the data.
The technique we choose is to graph a genes expression level (unnormalized) against
its microarray number(1 to 7 for our test data). All genes within the same cluster
are placed in the same graph. Multiple graphs are produced; one for each cluster,
see Figure ??. The set of genes with no edge connection to any other gene are not
represented. The number of entries in the cluster and the clusters connectivity (see
64
Figure 3.7: Microarray data clustering result
2 1.000000
3 1.000000
12 0.121212
31 0.150538
85 0.206162
2 4 6
23 0.189723
6 0.200000
T
22 0.129870
35 0.159664
2 4 6
8 0.250000
2 4 6
269 0.118127
2 4 6
143 0.140451
Definition ??) is printed above each graph.
This analysis is performed using the Matlab eigs function solving the Lv = Xv
problem, vectors normalized, Euclidian distance function with a radius cut oh of 0.2,
and a value of r = .0000001 used to compute rweak sign graphs.
3.10 The Software
The software is implemented in Matlab. The program to perform the analysis is
coded as a function and listed in Appendix ??. A program to invoke the function for
the examples in this thesis is listed in Appendix 2.
The function arguments consist of 4 fixed parameters and a variable argument list.
These are an array of the vertices to cluster, the number of clusters to produce, a string
defining the computational options, and a string defining the graphs to produce. The
variable arguments define Radius, Sigma, Scale, Mass, Edges, and Rweak the value
65
to use for determination of sign graphs. The first 2 arguments are required. The
remainder have defaults if not entered.
The function returns a cell array where the clusters are defined and a vector which
records a connectivity value for each cluster. The first cluster in the cell array is all
of the isolated vertices. A vertex can appear in more than one cluster.
Two Methods for determining clusters are provided in the MATLAB program;
recursive bipartition and kmeans. Kmeans clustering is described in Luxburg [?
] and involves applying the kmeans algorithm to the row vectors of a matrix of
the first k eigenvectors of the graph Laplacian. It works well provided clusters are
nearly disconnected and there are not lots of isolated vertices. It produces a classic
partitioning where a vertex can only be in a single cluster.
Parameters and their options are detailed in the program comments in Ap
pendix A.
The computation of eigenvalues is based on the Matlab eigs or eig function. The
eigs function should be used for large sparse matrices. The eigenvalue problem can
be rcut Lv = Xv, mass Lv = XMv, or ncut Lv = XDv.
The organization of the program is as follows:
Analyze parameters, set defaults,
Define weight matrix
Define connecting node edge weight
Do kmeans clustering if requested
let k be the number of clusters to solve for
solve the eigenvalue problem
identify the eigenvectors of the k lowest eigenvalues (not including zero)
perform kmeans on the row vectors of these eigenvectors
66
define these clusters in the cell array partition
Do recursive bipartitioning if requested
initialize the cell array partition with one entry containing all vertices
split out the unconnnected vertices into their own partition entry and
define its connnectivity as 999
select the other partition entry for processing
WHILE the number of partition entries is less than the number requested
do
bipartition the selected partition entry based on hedler vector
* solve the eigenvalue problem
* identify lowest nonzero eigenvalue and its associated eigenvector
* B: split the partition into two clusters based on the rweak value and
the eigenvector
* if both partition are identical to the original then error the program
* if either partition has zero entries then skip to next eigenpair and goto
B:
split the selected partition into two entries based on bipartition results
compute the connectivity of all partition entries
select the partition entry with smallest connectivity
end WHILE
Various plots are produced throughout the program as requested in the Plot
parameter.
67
4. BLOPEX
4.1 Introduction
The software package Block Locally Optimal Preconditioned Eigenvalue Xolver
(BLOPEX) was introduced in 2005. It has recently been upgraded to Version 1.1.
BLOPEX implements the Locally Optimal BLock Preconditioned Conjugate Gradient
(LOBPCG) method [? ] for solution of very large, sparse, symmetric (or Hermitian)
generalized eigenvalue problems. Version 1.1 adds support for complex matrices and
64 bit integers.
The generalized eigenvalue problem Ax = ABx for large, sparse symmetric and
Hermitian matrices occurs in a variety of traditional problems in science and engi
neering; as well as more recent applications such as image segmentation and DNA
microarray analysis via spectral clustering. These problems involve matrices that can
be very large (dimension > 105) but fortunately are sparse and frequently we only
need to solve for a few smallest or largest eigenpairs. This is the function of the
BLOPEX software.
BLOPEX has been previously described in [? ] and [? ] This chapter seeks to
give a more detailed description of the software than has previously been supplied.
This software includes not just the implementation of the LOBPCG method but
interfaces to independently developed software packages such as PETSc 1, Hypre 2,
and MATLAB 3.
The remainder of this chapter is organized as follows. We start in Section ?? by a
general discussion of the problem, review some of the other software available for the
problem in Section ??, and present the LOBPCG method in Section ??. Section ??
then covers the BLOPEX software. BLOPEX is available via a new Google Source
1PETSc (Portable Extensible Toolkit for Scientific Computation) is developed by Argonne Na
tional Laboratory
2Hypre (High Performance Preconditioners) is developed at the Center for Applied Scientific
Computing (CASC) at Lawrence Livermore National Laboratory
3MATLAB is a product of The MathWorks
68
site and this is covered in Section ??. We discuss the environments BLOPEX has
been tested on in Section ??, give some numerical results in Section ??, and wrap up
in Section ??.
4.2 The Problem
We seek solutions to the generalized eigenvalue problem Ax = ABx where A and
B are real symmetric or complex Hermitian matrices. B must be positive definite. A
and/or B may be defined as a matrix or be supplied in functional form.
Note that the requirement that B be positive definite implies all eigenvalues are
finite and the symmetry of A and B imply all eigenvalues are real.
We emphasize that A and B need not be supplied in matrix form, but can be
defined as functions.
Problems of this type arise from discretizations of continuous boundary value
problems with selfadjoint differential operators [? ]. We often only need the m
smallest eigenvalues or eigenpairs; where m is much less than the dimension of the
operator A. We dont usually need solutions to high accuracy since the discretization
of the problem is itself an approximation to the continuous problem.
The large dimensionality of the problem precludes solution by direct (factoriza
tion) methods. Thus the need for iterative methods. But iterative methods can have
slow convergence and so we require a preconditioner [? ]. The choice of precondi
tioner is separate from the choice of iterative method.
We use the LOBPCG iterative method (see Section ??). Preconditioners are
supplied to BLOPEX by the calling programs. This and the interfaces to PETSc and
Hypre make possible the use of high quality preconditioners.
4.3 Current Software
There are a number of existing software packages for solutions of large, sparse
eigenvalue problems. We discuss two of these that have been previously described in
ACM transactions.
69
Anasazi [? ] is a package within the Trilinos framework [? ], written in C++
which uses object oriented concepts, implements 3 block variants of iterative methods:
LOBPCG, Davidson, and KrylovSchur.
Anasazi solves for a partial set of eigenpairs of the generalized eigenvalue problem.
It uses a preconditioner which must be supplied by the user. Starting with Trilinos
9.0 there is interoperability with PETSc.
Preconditioned Iterative Multi Method Eigenvalue (PRIMME) [? ] was released
Oct 2006. It implements the JDQMR and JD+k methods to solve for a partial set
of eigenvalues of the problem Ax = Xx. It does not currently handle the Generalized
Eigenvalue problem. Written in C it has an emphasis on being user friendly, by
which is meant a minimal parameter set can be used to obtain solutions without
extensive tuning or knowledge on the part of the user. More sophisticated users can
utilize an extended set of parameters to tune the performance.
PRIMME can handle real and complex numbers and orthogonality constraints.
The preconditioner is supplied by the user. Interfaces to PETSc and Hypre are not
mentioned and presumably not available.
Neither PRIMME nor Anasazi mention interfaces to Matlab.
By contrast BLOPEX
handles both real and complex numbers
is written in C
has similar parameters as PRIMME
has interfaces to PETSc, Hypre, Matlab, and stand alone serial interfaces
allows for use of high quality preconditioners via PETSc and Hypre
70
4.4 LOBPCG
To solve for a single eigenpair of the problem Ax = ABx the LOBPCG iterative
method can be described as a 3 term recurrence formula as follows:
x{i+l)
w(t) +7Wx(t1},
(4.1)
where
wb) = 7yW)rW = Ax^ A ^Bx^\
\{i) = ^x^\Ax^)/(Bx^\x^) the Rayleigh quotient, and
T is a preconditioner for the matrix A.
The values rW and in (??) are chosen to minimize A(t+1) within the subspace
span{w}. This minimization is done via the RayleighRitz method as
described in Parlett [? ]. The preconditioner T should be linear, symmetric, and
positive definite.
Use of and as basis vectors for span{w} can lead to ill
conditioned Gram matrices [? ] in the RayleighRitz method, because can be
very close to x(t~1\
The effect of basis vectors is a nontrivial problem discussed in [? ]. An improve
ment on the basis used in (??) was proposed by Knyazev [? ]. This replaces
with pW as follows:
x(m) = w{t) +7(V,
(4.2)
and for the next iteration
p(i+i) = w(i) __ /yWpM and the other terms are as in (??).
71
In this case, it can be shown that span{w^, } = span{w^
So, the two iterative problems (??) and (??) are mathematically equivalent but (??)
is more numerically stable.
When more than one eigenpair is to be computed, a block version of LOBPCG is
used. To compute the m smallest eigenpairs we apply the RayleighRitz method to
the subspace spanned by {x^\w^\p^\ ,Xm ,Wm ,Pm} This gives m Ritz vectors
x^+1') as estimates for the m smallest eigenvectors with estimates for eigenvalues given
by their Rayleigh quotients.
We note that the choice of block size m is in part problem dependent and in part
a tuning consideration. This is discussed in some detail in [? ].
4.5 BLOPEX software
The BLOPEX software provides functions to the user for solution of eigenvalue
problems as described in Section ??. The software external to BLOPEX which the
user writes must do the following:
setup matrices or functions for A and B
setup the preconditioner T
provide functions for matrixvector operations
call LOBPCG solver in BLOPEX
BLOPEX software is written in C. C has been chosen since it provides for ease
and versatility of interfacing with a wide variety of languages including C, C++,
Fortran, and Matlab. This makes BLOPEX highly portable.
BLOPEX can be logically separated into two parts. The first part implements
the LOBPCG algorithm. We refer to this as the abstract code. It contains the
functions called by the user as well as a number of utility functions needed internally.
We discuss this in detail in Section ??.
72
The second part is code which provides functions for interfacing with software
packages such as PETSc, Hypre, and Matlab. One challenge for all eigensolver soft
ware is the necessity of supporting multiple diverse formats for sparse matrices and
vectors. Functions for accessing matrix (vector) elements and doing matrix vector op
erations are inherent in the calling software and BLOPEX has to access these routines.
This access occurs via the specific interface functions. We describe the interfaces in
Section ??.
4.5.1 Structure
Figure ?? shows a high level overview of BLOPEX and how it fits with the calling
software. The driver is software written by the user which calls the LOBPCG solver
in BLOPEX abstract. The driver can be written in numerous external environments
such as PETSc, Hypre, Matlab, etc. BLOPEX provides a number of sample drivers
which are described in Section ??.
The driver uses macros, commands, or functions from its environment to define
matrices, vectors, and matrix/vector operations. These are communicated to the
LOBPCG solver via parameters, (see Section ??). To access these external environ
ment matrix/vector routines, BLOPEX supplies interfaces.
These interfaces package data as multivectors (see Section ??) to pass to the
LOBPCG solver and provide functions to convert parameters in formats defined
within BLOPEX to parameters specific to the external environment functions.
BLOPEX requires LAPACK functions or equivalents to perform orthonormal
ization and solve the generalized eigenvalue problem for the Gram matrices in the
RayleighRitz method. These can be the standard LAPACK functions dsygv and
dpotrf for real, or zhegv and zpotrf for complex numbers. Equivalents from the
ACML, MKL, or ESSL libraries can be used. The addresses of the routines to use are
passed by the Driver to BLOPEX. If the parameters are not the same as the standard
73
Figure 4.1: Structure of BLOPEX functions
LAPACK funtions, then a function must be coded to do parameter conversions.
4.5.2 Abstract Code
This is the solver. It consists of three modules lobpcg.c, multivector. c, and
f ortran_matrix. c. Small matrices and vectors that arise as result of RayleighRitz
Method are kept internal to the abstract code in Fortran column major order and
processed via routines in fortran_matrix. c.
Two routines in lobpcg.c are callable by Drivers. lobpcg_solve_double and
lobpcg_solve_complex. These routines setup a function interpreter which is a list
of function addresses. The functions in multivector. c and f ortran_matrix. c
are specific to double (real) or complex numbers. These two functions then call
lobpcg_solve where the LOBPCG algorithm is implemented. As a result BLOPEX
does not have to be compiled specifically for complex or real numbers.
The functions in multivector, c provide for conversions and operations between
matrices in Fortran format and multivectors with pointers to matrices and vectors in
the external environment format. These functions in turn call interface functions to
access these matrix/vector operations.
74
4.5.2.1 Algorithms
The steps of the Block LOBPCG algorithm as implemented in BLOPEX follow
with detailed comments on various steps appearing afterwards. We list only the ma
jor parameters here but we give a complete list and more explanation in Section ??.
The operator denotes either matrix/matrix, matrix/vector, or scalar/matrix mul
tiplication, as in Matlab.
Input:
X m starting vectors (in block form)
A matrix or address of function to compute A* X
B matrix or address of function to compute B * X
T preconditioner (operator, and data)
Y constraint vectors (in block form)
Output:
X m computed eigenvectors
A m computed eigenvectors
Algorithm:
1. Apply constraints Y to X
2. Borthonormalize X
3. [C, A] = RR(A, B, X) Apply RaleighRitz Method
4. X = X C Compute Ritz vectors
75
5. J = [1, m\ Initialize the index set of active residuals
6. for k= l, Max Iterations
7. Rj = B Xj A A Xj Compute Residual vectors
8. Compute norms of residual vectors
9. Check residual norms for convergence
10. Exclude converged vectors from index J (soft locking)
11. if all vectors have converged then stop
12. Wj = operatorT(Rj, dataT) Apply preconditioner to residuals
13. Apply constraints Y to Wj
14. Borthonormalize Wj
15. if k > 1
16. Borthonormalize Pj
17. basis for RR is S = [X Wj Pj]
18. else
19. basis for RR is S = [X Wj]
20. end
21. [G, 0] = RR(A, B, S) Apply RaleighRitz Method
22. C = G(1 : m,:) Get columns of G corresponding to X
23. A = 0(1 : m) Get eigenvalues corresponding to X
24. if k > 1
76
25.
26.
27.
28.
29.
30.
Partition C
Cx
Cw
CP
P = Wj*Cw + Pj*Cp
according to columns of X, Wj, and Pj
else
Partition C
P = Wj*Cw
Cx
Cw
according to columns of X and Wj
end
31. X = X *CX + P
32. end
Comments:
(??) Constraints Y are previously computed or known eigenvectors. For example,
the vectors of all ones is an eigenvector of the smallest eigenvalue of a graph Laplacian.
So we can choose this as a constraint and then force all vectors of X to be Borthogonal
to Y. In this case LOBPCG solves for the next m smallest eigenvalues. We apply
constraint Y to X via replacing X with the difference of X and the Borthogonal
projection of X onto the subspace generated by Y; that is X = X Y (YT B *
Y)~l ((B Y)t X).
(??) Borthonormalize X using Cholesky factorization; that is R = chol(X' B *
X),X = X R~l. Block vector X must be composed of linearly independent vectors
to being with else orthonormalization will fail.
(??) We adopt the following notation: [G,A] = RR(A,B,S) to specify the
RayleighRitz method which finds eigenvectors G and eigenvalues A of the generalized
77
eigenvalue problem (STAS)G = A(STBS)X. STAS is referred to as the gramA ma
trix and STBS as the gramB matrix. S is given in block form such as [X W P] where
X, W, and P are block vectors. S forms a basis for the subspace to find estimated
eigenvectors in. Note that the eigenvectors G are chosen to be Borthogonal.
(??) Note that the Borthonormality of the Ritz vectors X is preserved.
(??),(??),(??) Initially all residuals of X are active. As we iterate, this changes as
their norms converge towards zero. When one of the vectors in X converges to within
a prescribed tolerance it is removed from the index. This we call soft locking. All
of X remains as part of the basis for the next iteration. However, only the residuals
(denoted by Rj and CG step vectors (denoted by Pj) are used to create the new
subspace. All of X is retained on the expectation that converged vectors in X will
continue to improve.
(??) To apply the preconditioner to Rj, a call to a routine T provided by the
Driver is done. This function is usually coded in the Driver and must have parameters
of the form void operatorT(void dataT, void R, void W) and must be
able to handle W and R as block vectors. The code for operatorT is highly dependent
on the Drivers environment.
(??),(??) We transform the preconditioned residual block vector Wj to be B
orthonormal to the constraint Y and the vectors of Wj to be Borthonormal to each
other.
(??) For k = 1 we do not have the first Pj yet. It is first computed in (??) and
in (??) there after.
(??),(??) For the 1st iteration the basis for RayleighRitz is S = [X Wj]. Note
that X and Wj are Borthonormal with respect to their own vectors, but X is not
necessarily Borthonormal to Wj. Consequently, the symmetric Gram matrices take
78
the form
and
gramA = STAS
A XTAWj
. WjAWj
gramB = ST BS
I XtBWj
I
where the dot notation indicates the symmetry of the matrix. Note that XTAX = A
since X is Borthonormal.
For subsequent iterations the X, Wj, and Pj blocks are Borthonormal within
their blocks but not between them. So the Gram matrices are
A XTAWj XTAPj
gramA = S1 AS
WjAWj WjAPj
Pj APj
and
gramB = ST BS
IX1 BWj XtBPj
I WjBPj
Finally, we note that computations for the components of the Gram matrices are
optimized in the code, so they are not computed directly from the basis vectors. For
clarity these details have been omitted.
(??), (??) Computation of the block vector P corresponds to pb+b jn equation
??. Note that P has the same number of rows as X.
(??) Finally, we compute a new X which is just the Ritz vectors X = S C.
4.5.2.2 Data Types
BLOPEX defines several data types which are described here.
To deal with block vectors in various diverse formats BLOPEX defines the struc
ture mv_MultiVector. This contains the following fields.
79
A pointer to another structure that defines the vectors. How these vectors
are formatted depends on the interface. For example in the PETSc interface
it points to another structure mv_TempMultivector which contains fields that
define the number of vectors, active mask and a pointer to an array of point
ers, each of which point to a Vec PETSc variable. Interfaces for Hypre, Se
rial, and Matlab have different but similar structures. Creation of variables of
mv_MultiVector type is done by routines in multivector. c.
An integer variable that is set to 1 when data for the pointer defined above is
allocated. This is to aid in deletion of the multivector when we are finished
with it.
A pointer to a structure mv_Interf acelnterpreter which is a list of function
addresses. These pointers are set to the appropriate interface functions.
An mv_Multivector variable such as parameter X then encapsulates the data
and interface functions to manipulate the data.
The second data type is to deal with matrices. This also depends on the interface.
Matrices do not have to be manipulated like block vectors. Typically they are in
volved in some matrix vector operation and just need to be passed to the appropriate
interface routine which is specified via the interpreter in the associated block vector.
So it is possible we only need to pass a pointer to the matrix as a parameter in
the form it appears in the external environment. This is what is done for the Matlab
interface. Other interfaces take a different approach. The PETSc interface includes in
a multivector like structure, variables for A, B, and KSP solver and then passes this
as the parameter for both A, B, and preconditioner data. The operatorA, operatorB,
and operatorT functions then use the appropriate variable.
Internally the BLOPEX abstract code uses data type utilities_FortranMatrix
to define Fortran style matrices. This includes variables for global column height,
80
column height, row width, pointer to position (1,1) of the matrix, and an ownsData
variable. The global height can be different from the current height because we overlay
in memory blocks of the Gram matrices to optimize their computation.
For BLOPEX version 1.1 we added a type for complex numbers which we call
komplex. Since there is no standard between C compilers for complex numbers we
chose to implement our own type along with routines for the basic math functions
of addition, subtraction, multiplication, and division. This maintains portability of
BLOPEX.
4.5.2.3 Parameters
A description of the parameters for the LOBPCG solver follows. The functional
definition can be found in include lobpcg.h available in the online site.
Some parameters are operators. For these a pointer to the operator is passed.
The operator must be defined as
void (*operatorA)(void*,void*,void*).
If a parameter is not needed then a NULL is passed.
Parameter Description
X Required. A block vector. This is the initial guess of eigenvectors. This can be
based on prior knowledge or just random guesses. The number of vectors defines
the number of eigenvalues to solve for. On output it contains the computed
eigenvectors.
A Optional. Use this if A is a matrix.
operator A Required. This implements a matrix vector multiplication. If A is NULL
then it must define A as an operator and do a matrix vector multiplication.
81
B Optional. Use if B is a matrix.
operatorB Optional. Only needed if solving a generalized eigenvalue problem. This
implements a matrix vector multiplication. If B is NULL then it must define B
as an operator and do a matrix vector multiplication.
T Optional. Use this if a preconditioner is supplied. This is data in matrix form
to be passed to the preconditioner operatorT. The data is dependent on the
preconditioner that operatorT implements. It could be NULL, a preconditioned
matrix based on A, or A.
operatorT Optional. But must be supplied if a preconditioner is used. This pre
forms the actual preconditioning on the residuals block vector.
Y Optional. This is block vector of constraints. Orthogonality of X to Y is enforced
in the LOBPCG solver.
blapfn Required. A structure which contains the addresses to lapack functions (or
equivalents) dsygv, dpotrf, zhegv, and zpotrf.
tolerance Required. A structure containing absolute and relative tolerances to apply
to the residual norms to test for convergence.
maxlterations Required. The maximum number of iterations to perform.
verbosity Lev el Required. The LOBPCG algorithm can print error messages and
messages to track progress of the solver. verbosityLevel values control this.
Value of 0 means print no messages. Value of 1 means print error messages,
max residual norm after each iteration, and eigenvalues after last iteration.
Value of 3 means print error messages and eigenvalues after every iteration.
iterations Required. Output. The number of iterations actually performed.
82
eigs Required. Output. An array containing the eigenvalues computed.
eigsHistory Optional. Output. An array containing eigenvalues produced after
each iteration.
eigsHistNum Optional. Input. Max number of eigenvalues. This should be > the
number of eigenvalues to compute. It is used to reformat the eigsHistory array
into a matrix format. Required if eigsHistory is not NULL.
residNorms Optional. Output. An array containing residual norms of eigenvalues
computed.
residHistory Optional. Output. An array containing residual norms of eigenvalues
produced after each iteration.
residHistNum Optional. Input. Max number of eigenvalues. This should be >
the number of eigenvalues to compute. It is used to reformat the residHistory
array into a matrix format. Required if residHistory is not NULL.
4.5.3 Drivers and Interfaces
Drivers are the programs that setup the eigenvalue problem and BLOPEX ab
stract is where they are solved. These encompass two separate environments. That
of the Driver (PETSc, Hypre, etc.) handle matrix and vector sparse formats and the
matrix vector operations on them including application of preconditioners. BLOPEX
abstract has all of the logic for the LOBPCG algorithm. The interface is where the
functionality of the two environments overlaps.
The interfaces and various multivector structures reviewed in Section ?? can be
intimidating to a user. To overcome this we supply various Drivers which serve both
as examples and in some cases generic problem solvers.
The next sections describe the Drivers and interfaces that are available. For
details of execution of tests with these drivers and configuration for PETSc and
83
Hypre, review the Wikis available on the Google source html site. See Section ?? for
more information. For execution of BLOPEX under PETSc and Plypre also see the
appendices of [? ].
4.5.3.1 PETSc
BLOPEX is included as part of the PETSc distribution which must be con
figured with the option downloadblopex=l. Scalar values in PETSc are ei
ther real or complex and this must be specified during configuration via the option
withscalartype=complex. PETSC provides parallel processing support on ma
trix vector operations.
There are 4 Drivers distributed with PETSc located in the PETSc subdirectory
. ./src/contrib/blopex.
driver. c builds and solves a 7pt Laplacian.
driver_f iedler. c accepts as input the matrix A in Petsc format. These can
be setup via some Matlab programs in the PETSc socket interface to Matlab;
PetscBinaryRead.m and PetscBinaryWrite .m. These programs read and write
Matlab matrices and vectors to hies formatted for Petsc. The version from Petsc
only supports double. We have modified these programs to also support complex
and 64 bit integers. Our versions are included in the Google source directory
. ./blopex_petsc along with PetscWriteReadExample .m to illustrate how to
use them.
driver_diag. c solves an eigenvalue problem for a diagonal matrix. This serves
as a test program for very large sparse matrices. It has been executed success
fully with over 8 million rows.
ex2f _blopex. F is an example of using BLOPEX with PETSc from Fortran.
84
4.5.3.2 HYPRE
Hypre does not support complex number, but like PETSc provides parallel
support via matrix vector multiplication and high quality preconditioners. The
BLOPEX LOBPCG solver is incorporated into Hypre programs struct. c and ij c
located in the Hypre directory ./src/test. These programs have broad func
tionality and can setup and solve 3D7pt Laplacians. They can also input ma
trix hies in Hypre formats to construct a generalized eigenvalue problem. These
hies can be created in Matlab using the Matlab matlab2hype package available on
http: / / www.mathworks. cn / matlabcentral /.
There is also a somewhat less intimidating example in . /src/examples/exll. c
which solves a 2D Laplacian eigenvalue problem with zero boundary conditions on
an n x n grid.
4.5.3.3 Matlab
The Matlab interface consists of m hies and c hies available on the Google source
site under directory ./blopex_matlab. The BLOPEX abstract hies must also be
acquired from the Google source site. All c hies are compiled under the Matlab Mex
compiler. Complex numbers are supported along with 64bit integers in the newer
version of Matlab. Preconditioners are implemented in this interface as m hies.
4.5.3.4 Serial
These are stand alone drivers and interfaces written in C. There are complex
and real versions. Matrices created by the drivers are in standard Fortran format.
They do not have any parallel support. They have been used primarily for BLOPEX
development testing.
85
4.6 The Google Source Site
BLOPEX source code as of Version 1.1 is maintained on the Google source code
site http://code.google.eom/p/blopex/ under the SVN version manager. All source
is downloadable.
This site also provides some Wiki documents that describe tests we have executed
for all interfaces and various systems. Between the source code for the Drivers and
the Wikis we hope users find BLOPEX accessable and usable.
4.7 Environments BLOPEX Tested On
BLOPEX has been tested in a wide variety of environments. The following is a
partial list covered by one or more of the Wikis described in Section ??
Machines: UCD XVIB, UCAR Frost, NCAR Bluehre, Lawrence Livermore
National Laboratory, IBM PC
Operating Systems: Linux, Fedora, IBM AIX, Cygwin under Windows 7
Compilers: gcc, IBM blrts_xlc, g++, pgee, SUN mpcc, AMD OPEN64
Lapack Libraries: Lapack, AMD ACML, Intel MKL, IBM ESSL
MPI: openmpi, mpich2
4.8 Numerical Results
Some numerical tests for 3D 7Point Laplacians of the BLOPEX implementation
of LOBPCG in Hypre have previously been reported in [? ] and in PETSc and Hypre
in [? ].
We report here on some results using Hypre and a few of the matrices that were
analyzed by PRIMME as reported in [? ]. These matrices are available from the
University of Florida Sparse Matrix Collection at
http://www.cise.ufl.edu/research/sparse/matrices/. A direct comparison to
PRIMMEs results is not possible since they are produced on a very different machine.
86
Table 4.1: Matrices analyzed
Matrix Rows nnz nnz (L+U) AMD
Andrews 60,000 760,154 234,019,880
hnan512 74,752 596,992 5,600,676
cdfl 70,656 1,825,580 71,684,224
cdf2 123,440 3,085,406 147,417,232
All of the matrices used are symmetric positive definite. We use matrix An
drews, which has a seemingly random sparsity pattern and not much structure,
finan512, which is a stochastic matrix used for financial portfolio optimization and
cfdl and cfd2, which are pressure matrices from structural engineering. Their char
acteristics are described in table ??. Note nnz(L + lJ) AMD is the number of nonzeros
in L+U of the LU factorization using AMD.
Analysis was performed on a Fedora 10 OS, 4 Quad Core Opteron 2.0 Ghz
CPUs, and 64 GB RAM. Hypre was configured using openmpi with gcc compiler
and BLAS/LAPACK libraries.
To setup the matrices for processing by Plypre we downloaded the matlab versions
and converted them using our matlab2hypreIJ.m program to Plypre formats. This
hie was then processed using the Plypre ij program. For example to find 5 eigenvalues
of finan512 to a tolerance of le 6 using the BoomerAMG preconditioner we would
execute:
./ij lobpcg vrand 5 tol le6 pcgitr 0 itr 200 seed 1 solver 0 fromhle hnan512
For the first experiment (Table ??), we process all of the hies in single processor
mode. Both the times to setup the preconditioner and to execute the LOBPCG
solver are reported. The matrix Andrews has an eigenvalue very close to zero, which
causes problems orthonormalizing the residual. To overcome this a shift of le 7 was
87
Table 4.2: Single processor setup and solution time
Matrix Setup Eigenvalues to solve for
1 2 3 4 5 7 10 15
Andrews 29 4 18 34 62 @ * * *
hnan512 1 5 10 22 37 43 66 90 203
cdfl 25 152 297 405 599 737 * * *
cdf2 36 335 644 1342 * * * * *
All times rounded to nearest second.
* Analysis not performed.
@ Failure of dsygv routine.
applied to the matrix. This was successful up to solution for 5 eigenvalues where
dsygv routine failed. Also, note the relatively large preconditioner setup times for
all matrices except finan512. This seems to reflected in the nnz(L+lJ) AMD values
shown in Table ??. The setup times are independent of the number of eigenvalues to
solve for.
The second experiment (Table ??) studies the effect of parallel processing on
solution time for matrix finan512. It solves for 5 eigenvalues using openmpi varying
the number of processors. For example to run with 2 processors, we would split
hnan512 into 2 Hypre hies using matlab2hypreIJ.m and process as follows:
mpirun np 2 ,/ij lobpcg vrand 5 tol le6 pcgitr 0 itr 200 seed 2 solver 0
fromhle hnan5122
4.9 Summary
Version 1.1 of BLOPEX has been implemented in PETSC version 3.1p3 and sub
mitted to Hypre for inclusion. Version 1.1 incorporates the new features of complex

Full Text 
PAGE 1
MODELSFORSPECTRALCLUSTERINGANDTHEIRAPPLICATIONS by Donald,F.McCuan B.A.,AustinCollege,1972 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2012
PAGE 2
ThisthesisfortheDoctorofPhilosophydegreeby Donald,F.McCuan hasbeenapproved by AndrewKnyazev,Advisor StephenBillups,CommitteeChair TzuPhang JulienLangou BurtSimon October26,2012
PAGE 3
McCuan,Donald,F.Ph.D.,AppliedMathematics ModelsforSpectralClusteringandTheirApplications ThesisdirectedbyProfessorAndrewKnyazev ABSTRACT Inthisdissertationtheconceptofspectralclusteringisexamined,withthegoal ofbetterunderstandingofpropertiesofeigenfunctionsofthegraphLaplacianand theiruseinclustering.Westartbydiscussingbipartitioningofimagesviaspectral clustering,andjustifythistechniqueusingananalogytovibrationproblems.Sucha justicationdoesnotrelyuponatraditionalrelaxationofacombinatorialoptimizationbaseddenitionofclustering.TheimportanceoftheFiedlervectorandtheorems byFiedlerisemphasized.WeextendFiedler'stheoremtothecaseofthegeneralized eigenvalueproblem. Courant'sNodalLineTheoremCNLTcanbeviewedasananalogofFiedler's theoremforeigenvectorscorrespondingtohighervibrationfrequencies.Wereviewthe literaturefordiscreteCNLTs.Anewdenitionforrweaksigngraphsispresented andusedtoproveamodieddiscreteCNLTtheorem.ApplicationsofFiedler's theoremsandCNLTstospectralclusteringarediscussed.Practicaldicultieswith imagesegmentationrelatedtopeculiaritiesinconstructionofgraphedgeweightsare investigated. Anotherapplicationofspectralclustering,coveredinthisdissertation,dealswith recursivegenebipartitioningforDNAmicroarrays.Wedevelopanewtechniqueand proveatheoremfordealingwithdisconnectedgraphcomponentsthatarecommon forDNAmicroarraydata.OurideasareincorporatedinnewMATLABsoftwarefor geneclusteringinDNAmicroarrays.Resultsofclusteringusingpracticalmicroarray dataarepresented.
PAGE 4
ThelastsectiondealswiththesoftwarepackageBlockLocallyOptimalPreconditionedEigenvalueXolversBLOPEXwhichaspartofthePH.D.candidate's graduateworkhasbeenupdatedtoVersion1.1.BLOPEXimplementstheLocally OptimalBLockPreconditionedConjugateGradientLOBPCGmethodforsolution ofverylarge,sparse,symmetricorHermitiangeneralizedeigenvalueproblems. Version1.1ofBLOPEXaddsamongstotherthingssupportforcomplexarithmeticand64bitintegers.BLOPEXincludesinterfacestoindependentlydeveloped softwarepackagessuchasPETScandHyprewhichprovidehighqualitypreconditionersandparallelMPIbasedprocessingsupport.Thereisalsosupportforexecution ofBLOPEXfromMATLABandstandaloneCprograms. ThePH.D.candidate'spersonalcontributionintheBLOPEXprojectisrecoding oftheabstractroutinesandPETScandHypreinterfacesforthenewfunctionality, developmentofnewcomplexarithmetictestcases,andaidandassistanceintesting, debugginganddocumentationofallBLOPEXinterfaces.Inthisdissertation,we describetheBLOPEXsoftware,designdecisions,theLOBPCGalgorithmasimplemented,andallofthevariousinterfaces.Numericalresultsarepresented. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:AndrewKnyazev
PAGE 5
DEDICATION IdedicatethisthesistomywifeCarolyn,andthevolunteersandstaoftheDenver MuseumofNaturalHistorywhohavegivenmeencouragementoverthemanyyears ofthisendeavor.
PAGE 6
ACKNOWLEDGMENT Foremost,Iwouldliketotothankmyadvisor,AndrewKnyazev,forhissupport andmotivation.Also,IthankmyotherProfessorsatUCDwhorenewedbyinterest inMathematics,andraisedmyknowledgeandunderstandingofthesubjecttonew levels.MyresearchandworkonthethesishavealsobeensupportedbyNSFDMS grants0612751and1115734,thePIProfessorAndrewKnyazev.
PAGE 7
TABLEOFCONTENTS Figures.......................................x Tables........................................xii Chapter 1.Introduction...................................1 2.SpectralClusteringandImageSegmentation.................3 2.1Introduction...............................3 2.2GraphLaplacian.............................4 2.3CombinatorialModelandRelaxation.................5 2.4SpectralClusteringAlgorithm.....................9 2.5OverviewofLiterature.........................10 2.6VibrationalModel............................15 2.7FiedlerTheorems............................17 2.8AnextensionofFiedler'stheorem...................20 2.9Eectofmassmatrixonsegmentation.................24 2.10ImageSegmentation...........................27 2.11EdgeWeights..............................28 2.12TuningtheEdgeWeightParameters..................29 2.13EigenvalueSolvers............................33 2.14NodalDomainTheorems........................35 2.15Summary.................................46 3.DNAMicroArrays................................47 3.1Introduction...............................47 3.2WhatisaDNAMicroarray?......................47 3.3HowisMicroarraydataanalyzed?...................51 3.4Normalizationofdata..........................52 3.5DisconnectedGraphs..........................53 vii
PAGE 8
3.6Recursivebipartitioning.........................60 3.7Weightconstruction...........................62 3.8"Toy"experiments............................63 3.9Analysisofrealmicroarrayexperiments................64 3.10TheSoftware...............................65 4.BLOPEX....................................68 4.1Introduction...............................68 4.2TheProblem...............................69 4.3CurrentSoftware.............................69 4.4LOBPCG.................................71 4.5BLOPEXsoftware............................72 4.5.1Structure.............................73 4.5.2AbstractCode..........................74 4.5.2.1Algorithms.........................75 4.5.2.2DataTypes........................79 4.5.2.3Parameters........................81 4.5.3DriversandInterfaces......................83 4.5.3.1PETSc...........................84 4.5.3.2HYPRE..........................85 4.5.3.3Matlab...........................85 4.5.3.4Serial............................85 4.6TheGoogleSourceSite.........................86 4.7EnvironmentsBLOPEXTestedOn..................86 4.8NumericalResults............................86 4.9Summary.................................88 5.ConclusionsandFinalThoughts........................90 viii
PAGE 9
Appendix A.SpectralClusterFunction............................92 B.SpectralClusterTestDriver...........................104 C.SubroutinesUsedbySpectralClusterFuntion.................106 D.ListofGenesbyCluster............................109 References ......................................115 ix
PAGE 10
FIGURES Figure 2.1Partitionofagraphbyvertexcuts.Alledgeweightsareequal......6 2.2Ratiocutpenalty...............................8 2.3Fiedlervectorforatree...........................18 2.4Fiedlervectorforalattice..........................19 2.5Fiedlervectorforawheel..........................19 2.6Observation2:Separationoftwohighestmasses..............25 2.7Observation3:Largestmassinaclusterbyitself.............26 2.8Observation3:Whenlargestmasscan'tbeinaclusterbyitself.....26 2.9Observation4:Smallestmassneverinaclusterbyitself.........26 2.10Labelingofpixelsinanimage........................27 2.11Baseimageforanalysis...........................29 2.12Initialanalysis................................30 2.13EectofincreasingValScale.........................31 2.14Eectof geomScale .............................32 2.15Eectofislands................................32 2.16Eectofepsilon................................33 2.17PoorqualityFiedlervectorproducedbytoosmallashift.........34 2.18Examplesofstrongandweaksigngraphs.................39 2.19Strongsigngraphsexceedingeigenvectornumber.............40 2.20Signgraphsdecreasinginnumber......................41 2.21Anexampleof r weaksigngraphs.....................42 3.1Geneexpression................................48 3.2AnAymetrixGeneChip..........................50 3.3Geneswithcorrelatedexpressionlevels...................52 3.4Eectofnormalizationonclusters.....................53 x
PAGE 11
3.5Aconnectedandunconnectedgraphandtheeectofaconnectingnode dumontheireigenpairs..........................57 3.6Anexampleofrecursivebipartition.....................62 3.7Microarraydataclusteringresult......................65 4.1StructureofBLOPEXfunctions.......................74 xi
PAGE 12
TABLES Table 4.1Matricesanalyzed..............................87 4.2Singleprocessorsetupandsolutiontime..................88 4.3Multipleprocessorsetupandsolutiontimefornan512..........89 xii
PAGE 13
1.Introduction InChapter ?? ofthisthesisspectralclusteringofdataisdiscussed.Avibrationalmodelisintroducedthatprovidesaninterpretationofspectralclusteringin termsofvibrationmodesofamassspringsystemrepresentingthedatainathought experiment.Mathematicaljusticationofthetechniqueisbasedofsometheorems ofFiedlerdescribinggraphpropertiesviatheeigenvectorsofthegraphadjacency andgraphLaplacianmatrices.Thisisadeparturefromtraditionalcombinatorial graphtheoreticaldenitionsofclustering,thatseektondasolutionofcombinatorialminimizationproblemsongraphs.Whilementionedintheliteratureforspectral clustering,Fiedler'stheoremsdonotreceiveproperattention.Weshowhowtheyare fundamentalforunderstandingthevibrationalmodelofspectralclustering. FiedlersbasictheoremisextendedinChapter ?? toencompassawiderrangeof problemsincludingthenormalizedgraphLaplaciansandsomegeneralizedeigenvalue problems.Wethenexperimentallyinvestigatetheeectonclusteringofamass matrixinthegeneralizedeigenvalueproblem.Withthisbackground,weexamine animplementationofspectralclusteringforimagesegmentation.Fiedler'stheorem alsoallowsforabetterunderstandingoftheimportanceofgraphedgeweights.The emphasishereisontheeectofvariationofparametersinaparticularapproachto edgeweightgenerationthatwethoroughlyinvestigateinournumericalexperiments. WeconcludeChapter ?? withamoretheoreticalexaminationofadierent technique,NodalDomaintheorems,alsoknownasCourant'sNodalLineTheorems CNLTs.TheoriginalCNLTsdescribelowfrequencyvibrationmodesofamembrane,whichmathematicallyaretheeigenfunctionsofacorrespondingpartialdifferentialoperator.Ifamembraneisapproximatedbyamassspringsystem,this operatorturnsintoitsdiscreteanalog,whichinfactisthegraphLaplacian,where thegraphnodesareassociatedwiththemassesandthegraphedgeweightsarecalculatedaccordingtothestinessofthespringsconnectingthemasses.Wedemonstrate 1
PAGE 14
howdiscreteCNLTsarerelatedtoFiedler'stheorems. Chapter ?? buildsonthisinitialdiscussionofspectralclusteringbyapplication toDNAMicroarrayanalysis.Thisisacompletelydierentproblemfromimage segmentation,withchallengesofitsown.Weintroducenewtechniquesandconstruct afullfeaturedalgorithmforgeneclustering.Thisalgorithmincorporatesrecursive bipartitioningandhandlingofgraphswithmultiplecomponents. Chapter ?? isadeparturefromtheprevioustwo.Itdiscussesourworkdone onBLOPEXsoftware,whichisaparallelsolverforlargesparseHermitiangeneralizedeigenvalueproblems.TherelevanceofBLOPEXforspectralclusteringisits potentialapplicabilitytoeigenvalueproblemsinvolvingverylargematricesappearinginanalysisofmassivedatasets.BLOPEXimplementstheLOBPCGalgorithm whichisdescribedindetails.AoverviewoftheBLOPEXsoftwareisgiven,which includesahighlevelowchartofthecode.WealsointroduceournewGooglesite fortheBLOPEXcode.Acomparisontoothersoftwarewithsimilarfunctionalityis provided. 2
PAGE 15
2.SpectralClusteringandImageSegmentation 2.1Introduction Weareconcernedwiththeproblemofpartitioninganimageintotwopartsa bipartitionwiththeobjectivethattheelementsimagepixelswithineachpartare moresimilartoeachotherthanelementsbetweenpartstendtobe.Thisstatement, whilesimple,hasleftundenedsuchconceptsassimilarityandmeasuresofwhether anyparticularbipartitionisbetterthananother. Thereisalargeliteraturewhichtakestheapproachofdeningthisproblemasa partitioningofaweightedgraphwherethepartitioningisperformedbyminimization ofsomevertexcutfunction;forexamplesee[ ? ],[ ? ].Theseproblemscanbeexpressed asaminimizationunderconstraintsoftheRayleighRitzRatiooftheassociatedgraph Laplacianmatrix. ThiscombinatorialproblemisNPcompleteandtosolveit,theconstraintsare relaxed,leadingtoaproblemofsolvingforaneigenvectorofthesecondsmallest eigenvalueofthegraphLaplacian,commonlyreferredtoastheFiedlerVector.Then thiseigenvectorisusedtoseparatethegraphsverticesintotwogroups.Thisisthe techniqueofspectralclustering.WediscussthegraphLaplacianinSection ?? ,which isbackgroundfortherestofthechapter. Bichot[ ? ]attributestheoriginofspectralclusteringtoDonathandHoman[ ? ].Theconceptissimple.Itscomplexityliesinunderstandingwhyitworks. Spectralclusteringdoesnotalwaysgivegoodsolutionstotheoriginalcombinatorialproblem.WeexaminesomeoftheseissuesinSection ?? andpresentanalternative justicationforspectralclusteringinSection ?? .But,beforethis,wegiveabrief overviewoftheliteratureinSection ?? whichexaminestheeldofcombinatorialand spectralclustering. SpectralclusteringinvolvesusingtheFiedlervectortocreateabipartitionof thegraph.SometheoremsbyFiedlerareneededtounderstandthecharacterofthe 3
PAGE 16
Fiedlervectorandhowthisrelatestoclustering.Thesetheoremsarereviewedin Section ?? andweexpandthescopeofoneofthesetheoremsinSection ?? Havingexpandedthescopeofthetheorems,weexaminenumericalresultsfor thegeneralizedeigenvalueproblemusingmassmatricesinSection ?? .Wethen examinetheproblemofimagesegmentationinSection ?? anddiscussgeneration ofedgeweightsinSection ?? .Wedeneouralgorithmforimagebipartitioningin Section ?? .TheninSection ?? ,wegiveexamplesofhowweightparameterscan eectclustering,connectthiswiththeFiedlertheorems,andshowhowproblemscan occur. ThisisfollowedinSection ?? byadiscussionofeigenvaluesolversandtheneedfor solversforeigenpairsoflargesparsematricesfortheimagesegmentationproblems. NodalsetsandsigngraphtheoryarereviewedinSection ?? wherewederivean extensionofthediscretenodaldomaintheoremthatincludesFiedler'stheoremasa specialcase.Wesummarizeinthelastsection. 2.2GraphLaplacian Let G = V;E denoteagraphwhere V isitsvertexset, E isitsedgeset,and thenumberofverticesis j V j = n .Wenumbertheverticesof G andthisindexis thenusedtorepresentthevertices V ;thatis V = f 1 ;:::;n g .Theedgesof G are undirectedandhavenoloops. The n n adjacencymatrix A of G hasentriesrepresentingtheedges E of G .Ifvertices i and j haveanedgebetweenthemthentheelement a ij = a ji of A representsarealpositiveweightassignedtothatedge.Ifthereisnoedgebetween vertices i and j then a ij =0.Thedegreematrix D for G isadiagonalmatrixwhere d ii = P j 2 V a ij = a ji Denition2.1 TheunnormalizedgraphLaplacianis L = D )]TJ/F19 11.9552 Tf 11.955 0 Td [(A Thematrix L isadiscreteanalog,uptothesign,ofthecontinuousLaplacian 4 = P i @ @x i .Ajusticationforthiscanbefoundin[ ? ]. 4
PAGE 17
ThediscreteunnormalizedLaplacianhasthefollowingproperties[ ? ]. Lemma2.2 Let L beanunnormalizedLaplacian,then itisrealsymmetric, hasnonpositiveodiagonalentries, ispositivesemidenite, itssmallesteigenvalueis0, I = ; 1 ;:::; 1 T isaneigenvectorfortheeigenvalue0,and themultiplicityof0isthenumberofconnectedcomponentsofthegraph. WenoteheretheexistenceofnormalizedgraphLaplacians,andwillhavemore tosayaboutthemlater. Denition2.3 ThesymmetricnormalizedgraphLaplacianis L sym = D )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 LD )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 Denition2.4 TherandomwalkgraphLaplacianis L rw = D )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 L 2.3CombinatorialModelandRelaxation Oneapproachtoclusteringistominimizesomefunctionthatreectshowclosely theclustersareconnected.ForexampleinFigure ?? vertexcutsareusedtopartition thegraphintotwosubgraphsAandBwhere j A j + j B j = j V j .Acutisassigneda value cut A;B = P i 2 A;j 2 B a ij .Thisisunsatisfactoryinmanycaseswheresubgraphs thatarehighlyunbalancedareproduced.ForexampleCut1inFigure ?? hasthe samecutvalueasthatofCut2. Toovercomethis,thefollowingratiosareoftenused[ ? ],[ ? ]. Ratiocut A;B = cut A;B j A j + cut A;B j B j 5
PAGE 18
and Ncut A;B = cut A;B vol A + cut A;B vol B where vol A = X i 2 A d ii Figure2.1: Partitionofagraphbyvertexcuts.Alledgeweightsareequal. AssumingthegraphisconnectedmoreonthisinSection ?? ,itcanbeshown [ ? ]thatminimizingtheRatiocutisequivalentto min A V f ? I k f k = p n f T Lf where f i = 8 > < > : p j B j = j A j if i 2 A )]TJ/F25 11.9552 Tf 9.299 10.221 Td [(p j A j = j B j if i 2 B wherewereferto f asthecombinatorialsolution.Notethatthecomponentsofthe vector f havexedvaluesin A and B .Thusthesolutionisconstrainedtoasubset of R n whichisthesetofall f 2 R n such f isdenedasaboveand A V .Inthis subset,bythedenitionof f ,wehave f ? I and k f k 2 = p n .Notealso,thatthe graphLaplacian L getsdenedaspartofthetransitionfromtheoriginalRatiocut minimizationproblemtotheminimizationproblemintheformexpressedabove. 6
PAGE 19
ThisproblemisNPcompleteseeAppendixof[ ? ],butiftheconstraintsare relaxeditbecomes min f 2 R n f ? I k f k = p n f T Lf Theterm f T Lf isthenumeratoroftheRayleighquotient R f = f T Lf=f T f and theRayleighRitzcharacterizationofeigenvalues[ ? ]givesthesolutionasthe2nd smallesteigenvalueof L with f asitseigenvector.Aneigenvectorofthe2ndsmallest eigenvaluewhichcouldhavemultiplicitygreaterthan1iscalledtheFiedlervector. Asimilaranalysiscanbeperformedfor Ncut [ ? ],[ ? ]whichyieldsrelaxed problems Lf = Df and L sym f = f AspectralclusteringanalysisconsistsofndingtheFiedlervectorandusingit toproducetheclustersAandB.Commonly,thesearechosenas A = f i j f i 0 g and B = f i j f i < 0 g .Otheralternativesinclude A = f i j f i > 0 g and B = f i j f i 0 g .This methodisderivedbyanalogytothecombinatorialsolutionof f ,whichhavestrictly positiveandnegativesolutionsforthecomponentsof f .Sometimes,asearcharound zeroisusedtondpartitionswithbetterRatiocuts[ ? ].There,anassumptionis thattherelaxedproblemhasasolutionclosetotheratiocutproblem.Inanyofthese casesastrictpartitioni.e.withnonoverlappingclustersisalwaysproduced. Therearesomequestionsthatcanberaisedatthispoint. Howcloselydoestherelaxedsolutioncometotheactualcombinatorialsolution? Graphscanbeconstructedwherethedierenceincutvaluesisarbitrarilylarge [ ? ].ThismightbeexpectedsincetheoriginalproblemisNPcomplete,therelaxed problemisnot,andwehaveplacednorestrictionsonthetypeofgraphsinvolved. Evenifwehaveasolutionthatisclosetocombinatoriallyoptimal,howbalanced isit? 7
PAGE 20
Wemakethefollowingobservation.Thebalancingtermin Ratiocut is 1 j A j + 1 j B j Ifweplotthisforvariousvaluesof j A j and n Figure ?? weseealargepenaltynear theextremeswhichisdesirablebutaverysmallpenaltyforvaluesof j A j between. Soifwewantabalancecloserto j A j n = : 5thiscanbeachievedevenwhen j A j >> j B j Thisisparticularlytrueifthereisalargenumberofvertices,whichisusuallythe caseforpracticallyimportantapplications. Figure2.2: Ratiocutpenalty Thismeansthatthereisapossibilitythatthecombinatorialsolutionisnotparticularlybalanced.Whetherthisisgoodorbaddependsonthetypeofclustering. Thereareexampleswheretherelaxedproblemyieldsintuitivelysatisfyingresults wheretheratiocutdoesnot.Letusconsideraverylongstringofbeadswherethe edgeweightsareallthesame,exceptforoneedgenearanendofthestringwhere thepenaltyisverysmallandanotheredgeinthecenterofthebeads.Letussuppose thattheedgeneartheendofthestringhasaweightlessthanthatoftheedgeinthe center,andnoedgeweightisverymuchgreaterthananyother.Underthisscenario, therelaxedproblemhasasolutionwhichpartitionsthestringaroundthecenteredge 8
PAGE 21
buttheratiocutisminimizedbyapartitionaroundtheouteredge.Amorebalanced partitionproducedbytherelaxedproblemismoreintuitivelysatisfying. Whatshouldwedowithverticeswhere f i =0?Thissituationisprecludedin theoriginalcombinatorialsolutionfor f Thechoiceofputtingthevertexiwhere f i =0intocluster A alongwith f i > 0 isarbitrary.WecouldjustaswellhavechosentoplaceitinclusterB. Areclustersconnected? WeseeinSection ?? thatoneoftheclusterswillalwaysbeconnectedbutthe othermaynotbe.Thisupsetsourintuitionaboutwhatagoodclustershouldbe.We expectthenodeswithinaclustertohavesomesimilarity,butiftheclusteritselfis notconnectedhowcanthisbe.Also,thisisapracticalissueifweshoulddorecursive bipartitionstobreakanimageintomultipleclusters.Wewanttostartouralgorithm withaconnectedgraphandthismightnolongerbethecase.Whythisisthecaseis explainedinSection ?? Inanattempttoovercometheseissuesandforcethespectralclusteringresultsto giveclusterswithbetterRatiocutvalues,somealgorithmsresorttoadhocapproaches suchaslookingforabetterRatiocutaroundzero;forexample A = f i j f i r g and B = f i j f i
PAGE 22
1.Associatetheproblemofclustering n datapointswithagraph G = V;E with n vertices. 2.Numbertheverticesof G from1to n 3.Assignweightstoeachedgeandconstructaweightedadjacencymatrix A for G andtheassociateddegreematrix D 4.ConstructtheLaplacianmatrix L = D )]TJ/F19 11.9552 Tf 11.956 0 Td [(A 5.ComputetheFiedlervector Y = f y 1 ;y 2 ;:::;y n g oftheLaplacian L 6.Partitionthegraphintosubgraphs A and B using Y asfollows: A = f i 2 V : y i 0 g ;B = f i 2 V : y i 0 g Notes: Thegraph G mustbeconnected. Therearemanywaystoassignweights.Section ?? onlydescribesonemethod. A S B = V butitispossiblethat A T B 6 = ; sincetheremaybe y i =0. WithrespecttothevibrationalmodelseelaterinSection ?? forspectralclustering, A and B aretheoptimalsolutions.TheirRatioCutvaluesareirrelevant. Sinceboth A and B areconnected,wecanuserecursiveapplicationsofthe algorithmtogeneratesmallerclusters. 2.5OverviewofLiterature Theclusteringliterature,withconnectionstospectralclustering,isprimarily concernedwiththesolutionofacombinatorialproblembasedonminimizationof somecostfunction.SincethisproblemturnsouttobeNPhard,itssolutionis attemptedviarelaxationtoaspectralclusteringproblem. 10
PAGE 23
Theliteratureisgenerallyconcernedwithoneormoreofthefollowingissues: alternativestospectralclustering, justicationofspectralclustering, edgeweightselection, howtouseeigenvectorstoobtainclusters, howmanyclusterstoconstruct,and applications. AnexcellentstartingpointtostudyspectralclusteringisthetutorialbyUlrike vonLuxburg[ ? ].Afterthat,possiblythemostreferencedpaperonthesubjectis byShiandMalik[ ? ]. Therearealternativestospectralclustering.Analgorithmforndingbipartitions withminimumcutweightispresentedbyStoerandWagner[ ? ].However,this algorithmdoesnotaddresstheproblemofndingbalancedpartitions. Someauthors[ ? ],[ ? ],[ ? ],[ ? ],[ ? ],[ ? ],[ ? ]makeuseofseveraleigenvectors, collecttheseintoamatrix,andpartitionbasedontherowvectors,whichareusually normalized,andthenhave k meansclusteringappliedtothem. AlpertandYao[ ? ]constructamincutproblemwhereclustervolumesmust fallwithinpredenedranges.Theythendeneamaxmin k wayvectorpartitioning problembasedontheeigenvectorrows,andshowthatthesetwoproblemsareequivalentwhenalleigenvectorsareused.Ofcourseinthiscase,sincemincutisNPhard, theirvectorpartitioningmustalsobeNPhard.Nojustication,exceptfornumerical experiments,ispresentedwhennotusingalleigenvectors. Alpert,Kahng,andYao[ ? ]denetheirMELOmultipleeigenvectorlinear orderingsalgorithm,whichusesmultipleeigenvectorstoproduceabipartition. 11
PAGE 24
WhiteandSmyth[ ? ]deneanewcostfunction,referredtoastheQfunction, whichisameasureofthedeviationbetweentheprobabilitythatbothendsofanedge lieinthesamecluster,andthesquareoftheprobabilitythateitherendofanedgelies inthecluster.MinimizationoftheQfunctionisaproblemwhoserelaxationresultsin approximatelyaneigenvalueproblemfor L rw ,therandomwalkLaplacian. k means clusteringisthenappliedtotherowsofsomeoftheeigenvectorsof L rw .Thebest clusteringiscomputedbyvarying k andcomputingQforeachpartitionproduced. AnotherprobabilisticapproachisproposedbyMeilaandShi[ ? ][ ? ].They deneacostfunctionviaarandomwalkandconnectthistotheNCUTbalancedcost function.Theythenpresentamachinelearningalgorithmtodetermineedgeweights, trainedbypredenedimageclusters. WhileMeilaandShi'srandomwalkdevelopmentisfairlyintuitive,amoreobscure attemptatjustifyingspectralclusteringviatherandomwalkLaplacian,issuggested byNadler,Lafon,Coifman,andKevrekidis[ ? ].Thisisbasedonanassumptionthat thegraphverticesaredistributedaccordingtotheprobabilitydistribution p x = e )]TJ/F20 7.9701 Tf 6.587 0 Td [(U x where U x isapotential. AdierentdiscretecostfunctionispresentedinthepaperbyKanna,Vempala, andVetta[ ? ].Theydenean bicriteria,relaxtheminimizationproblem,and thenshowthatspectralclusteringhasaworstcaseapproximationguaranteewith respecttothe measure.Whileinteresting,itdoesnotseemtobethatpractical. Otherattemptstoplaceboundsontheclusterresultswithrespecttoacost functiondatebacktoDonathandHoman[ ? ].Theysetalowerboundonthe sumoftheedgecuts,wherethenumberofclustersandthenumberofverticesinthe clustersarepreselected. Ng,Jordan,andWeiss[ ? ]useatypicalapproachtospectralclustering,i.e.the symmetricLaplacianand k meansclusteringappliedtomultipleeigenvectors.They thenusetheCheegerconstant[ ? ]oftheclusterstoconstructcriteriaassumptions. 12
PAGE 25
Iftheseassumptionsaremet,theyshowthatthereexistsorthogonalvectorsin k spacesuchthatrowsoftheeigenvectorsare"close"tothem. Sharon,Galun,Sharon,Basri,andBrandt[ ? ]applyspectralclusteringtoimage segmentation.TheypresentamultigridapproachseealsoBichot[ ? ]wherespectral clusteringNcutcostfunctionplaystheroleofbiclusteringateachcoarsenesslevel. Dhillon,Guan,andKulis[ ? ][ ? ]giveanothermultigridapproachusingthe weightedKernal k meansalgorithm.Thisprojectsthedatavectorsontoahigher dimensionspacewhere k meansdoestheclusteringfortherenementsteps.Spectralclusteringisappliedatthecoarsestlevel.Theyapplythistechniquetoimage segmentationandgenenetworks. AtechniquethatisnotspectralbasedispresentedbyFelzenszwalbandUttenlocher[ ? ].Thisisappliedtoimagesegmentationand,whilenotspectralinnature, hasaninterestingapproachforcoarseningoftheimage.Theydeneaconceptofa segmentationbeingtooneortoocoarse.Alumpingprocedureisthenappliedto lumpverticesintosegmentsthatareneithertooneortoocoarse. OrponenandSchaeer[ ? ],constructanapproximateFiedlervectorbyminimizingtheRayleighQuotientofareducedDirchletmatrixviathegradientdescent. TheissueofroundoerrorinthecomputationoftheFiedlervectorviaKrylov subspacesistakenupbyMatsekh,Skurikhin,andRosten[ ? ].Theyobservethat roundoerrors,inherentinnumericalsolutionsoftheeigenvalueproblems,canmake theFiedlervectorunsuitableforimagesegmentation.Weaddressthisissuelaterin thisthesisinSection ?? ClusteringtheFiedlervectorisanalyzedbyChan,Ciarlet,andSzeto[ ? ].Their issueis,whatisthebestwaytopartitiontheFiedlervector?Theyshowthatamedian cutequalnumbersofvectorcoordinatesonthesecondeigenvectorisoptimal,inthe sensethatthepartitionvectorinducedbythisistheclosestpartitionvectortothe secondeigenvector.Whattheyactuallyprove,isthatthemediancutonanyvector 13
PAGE 26
isoptimaltothatvector.Itdoesnotshowhowclosethepartitionvectorbasedon thesecondeigenvectoristothesolutionofthecombinatorialproblem. Ding,He,andSimon[ ? ]showtheequivalenceofsymmetricnonnegativeMatrix FactorizationNMFandKernal k means,andtheequivalenceofNcutandNMF. ThedenitionofNcuttheyuseisnotthesameasthatdenedbyShiandMalik. AnentirelydierentapproachispresentedbyGradyandSchwartz[ ? ].They startwiththeRatioCutproblemandconvertittoanisoperimetricproblemwhich canbesolvedasaalinearsystem. AveryinterestingpapercomesfromZelnikManorandPerona[ ? ].Theyusethe inverseGausiantocomputeweights,butassociateadierentscalingfactorwitheach vertex.Thescalingfactorreectsthedensityofverticesaroundeachvertex.This allowsforhandlingofmultiscaledataandbackgroundclutter.Theyalsoproposea costfunctionforautomaticdeterminationofthenumberofclusters. Similarly,FischerandPoland[ ? ]alsoadjustthescalingfactorforeachvertex. Buttheirtechniquefordoingsoisdierent.Theyalsoproposeuseofklinesclustering oftheeigenvectorrowsinsteadof k means. Wedealwithimagesegmentationandmicroarrayanalysisinthisthesis.Another applicationthatisonlystartingtoreceivesomeattentionistheapplicationofspectral clusteringtoPhylogenetics[ ? ][ ? ][ ? ].Herespectralclusteringhasbeenproposed asanalternativetomaximumparsimony,maximumlikelihood,andneighborjoining. Thedata,sofar,hasbeengenesequences.Thetechniquesaremostlyfairlystandard: symmetricLaplacian,multipleeigenvectors,and k means. Otherpapers,fromthelistofreferences,arementionedelsewhereinthethesis andarenotrepeatedhere. 14
PAGE 27
2.6VibrationalModel Werstconsideraproblemwithaninnitenumberofpoints.Supposewehave aplanarmembranewhichiscontinuous.Itisstretchedoversomegeometricshape whichisconnectedbutpossiblynotconvex;suchasacircle,square,Lshape,etc. Vibrationsonthemembranearedescribedbythewaveequation )]TJ/F19 11.9552 Tf 9.299 0 Td [( u + 4 u =0in [ ? ],[ ? ],[ ? ]withnaturalfreeboundaryconditions.Here u isthedisplacement ofthemembrane,and = 1 2 where v isinterpretedtobethevelocityofthewave seeDavis[ ? ]page200.Thisinterpretationisbasedonananalysisofunitsinthe equationwhere u hasdimensionoflength. Supposeweonlyhavetransversalvibrationwithoutfriction.Weassumestanding wavesolutionsoftheform u x;y;t = U x;y e i!t .Theproblemthenbecomesa generalizedeigenvalueproblemoftheform U = 2 U Thesmallestvibrationfrequencyiszero,whichcorrespondstotherstvibration mode,whichisaconstantfunction.Thesecondvibrationmodecorrespondstothe Fiedlervector.Thissolutiondividesthemembraneintotwoparts.Eachpartis connectedandmovessynchronously.Onepartwillhavepositivedisplacementwhile theotherhasnegativedisplacement.Thesearethenodaldomainsinducedbythe eigenfunctionmoreonthisisSection ?? Now,supposeinsteadofamembranewehaveamass/springsystemwithanite numberofpointsnodesinaplaneandletthemassesbenumberedfrom1to n .The massesarefreetomoveperpendiculartotheplane,butcannotmoveintheplane. Springsareattachedbetweenmasses,butnotnecessarilybetweenallmasses.Assume displacementsofthemassesaresmall.Thisisasystemofundampedharmonic oscillators[ ? ].Theequationdescribingthisis M z + Kz =0where M isadiagonal massmatrixwithnonnegativeelements, K astinessmatrixand z avectorwhose 15
PAGE 28
z i entrydescribesthedisplacementofthe i thpointmass. Assumingastandingwavesolution z = Ze i!t ,theequationbecomesthegeneralizedeigenvalueproblem )]TJ/F19 11.9552 Tf 9.299 0 Td [(! 2 MZ + KZ =0.Herethevaluesofanyeigenvector Z representthemagnitudeandsignofvibrationofeachnodeforthecorrespondingfundamentalfrequency .AgainconsidertheFiedlervectorwhichdenesthe2ndvibrationalmode.Nodeswiththesamesignareconnectedandmovesynchronously.This considerationofvibrationnodesimposesanaturalbipartitiononthemass/spring system. Supposewehavemassesofallunitysothat M = I .Takethepointmassestobe verticesinagraphandthespringconstants w ij tobeaweightassignedtotheedge connectingverticesiandjwhere i 6 = j .Thestinessmatrix K isnowthenegative ofthegraphLaplacian.Toseethisnotethatanelementofthevector KZ isthenet forceonnode i fordisplacementvectorZ,whichforpositive z isnegativeinsign. Thisgives KZ i = )]TJ/F25 11.9552 Tf 11.291 11.357 Td [(X j w ij z i )]TJ/F19 11.9552 Tf 11.955 0 Td [(z j = )]TJ/F25 11.9552 Tf 11.291 20.443 Td [(" X j w ij # z i + X j w ij z j where K ij = 8 > < > : )]TJ/F25 11.9552 Tf 11.291 8.967 Td [(P j w ij if i = j + w ij if i 6 = j ThuswearriveattheproblemexpressedintermsofagraphLaplacian LZ = 2 Z .Spectralclusteringproducesabipartitioningcorrespondingtothesecondmodeof vibration.Thisclusteringsatisesourintuitivenotionofaclusteringaseachcluster isconnected,vibratesinphasesimilarityandvibrates degreesoutofphasewith theotherclusterdissimilarity. Thestatementthattheclustersareconnectedisintuitivelytrueforthemembrane withaninnityofpointsandcanbeprovenseeSection ?? butisnotasapparent forthenitemass/springsystem.Wehavealsoassumedthattherearenoboundary z i =0nodes.WeusesometheoremsbyFiedlertoresolvetheseissues. 16
PAGE 29
2.7FiedlerTheorems The1975paperbyMiroslavFiedler[ ? ]hastwotheoremsontheFiedlervector whichhereferredtoasthecharacteristicvaluationofagraph. WhiletheworkofFiedlerismentionedbysomepapers[ ? ],[ ? ],[ ? ]itisnot centraltothecombinatorialjusticationforspectralclustering.Wewillseehowever thatitclariescertainaspectsandisimportantforunderstandingpracticalnumerical resultsofspectralclustering. WerepeatFiedler'stheoremsherewithsomemodicationofterminology. Theorem2.5 Fiedler[ ? ]Theorem3.3andRemark3.4Let G = V;E bea niteconnectedgraphwithpositiveedgeweights.Let L = D )]TJ/F19 11.9552 Tf 11.176 0 Td [(A betheunnormalized Laplacianof G asdenedinSection ?? and u theFiedlervectorof L .Then 8 r 0 thesubgraphsinducedby A = f i 2 V : u i )]TJ/F19 11.9552 Tf 23.635 0 Td [(r g and B = f i 2 V : u i r g are connected. Inthistheoremthesubgraphinducedby A isthegraphwithvertices A andedges E A E where e 2 E A ifandonlyifbothedgesof e arein A Corollary2.6 If r =0 thenboth A and B areconnected. Theorem2.7 Fiedler[ ? ]Theorem3.8Givenanyedgecutofanite,connected graph G thatcreatestwoconnectedcomponentsthereexistsaweightingpositivevaluationoftheedgesof G suchthattheFiedlervectordenesthesametwocomponents. ThispartitionisderivedviapositiveandnegativevaluesoftheFiedlervectorandthere arenozerovaluesintheFiedlervector. InFigure ?? atree,Figure ?? alattice,andFigure ?? awheelwedisplay Fiedlervectorvaluesforthreegraphswithconstantedgeweights.Thevaluesforthe Fiedlervectorarelistedagainstthecorrespondingvertex.TheseexamplesdemonstratehowTheorem ?? wouldpartitionthegraph.Latticeswillbeusedforimage 17
PAGE 30
Figure2.3: Fiedlervectorforatree bipartitioning.Thewheelinparticularshowshowusing 0and < 0aspartitioning criteriacanleadtodisconnectedclusters. Therstthingtonoteisthatthereisnogoodreasontopreferentiallyplacenodes with u i =0inclusterAorB.Butthereisaverygoodreasontoplacetheseboundary nodesinbothclustersAandB.WeseefromCorollary ?? thatbothclustersAand Bwillnowbeconnected.Theclustersnowarenotdisjointandthusdonotsatisfy theusualobjectiveofdiscretepartitioningproblems. Theorem ?? demonstratestheimportanceoftheweightsweplaceonedges. Choosinganybipartitiongivingconnectedclustersthereisaweightingoftheedges whichwillproduceit.Sowemustbeverycarefultochooseweightswhichimpart somemeaningtothesimilarityofvertices. WepresentamodelforedgeweightsonimagesinSection ?? 18
PAGE 31
Figure2.4: Fiedlervectorforalattice Figure2.5: Fiedlervectorforawheel 19
PAGE 32
2.8AnextensionofFiedler'stheorem ThetheoremsofFiedlerasappliedtographsareprovenforunnormalizedgraph Laplacians.WeproposeandproveamoregeneralversionofTheorem ?? ,suchthatit appliestothenormalizedgraphLaplaciansandthemoregeneraleigenvalueproblem Lx = Mx where M isanydiagonalmatrixwithpositivediagonalentries.This problemcorrespondstoamass/springsystemwherethemassmatrixisnotnecessarily theidentitymatrix.Inparticular,wewanttoshowthataneigenvectorcorresponding tothesecondsmallesteigenvalueofthisgeneralizedeigenvalueproblemcanbeused topartitiontheassociatedgraphfortheproblemintoconnectedcomponents. TothisendwestatealemmaandatheoremfromFiedler's1975paper[ ? ]that wewillneedinourproof.Wealsoneedtheconceptofanirreduciblematrix[ ? ]. Denition2.8 Asquarematrixisreducibleifitisintheform A = BC 0 D orcanbeplacedinthisformviaapermutationmatrix P ,asfollows, P T AP Amatrixisirreducibleifitisnotreducible. 20
PAGE 33
Lemma2.9 Agraph G isconnectedifandonlyifitsadjacencymatrix A isirreducible. Proof: Supposeagraphisnotconnected.Thenbypermutationofitsadjacency matrixitcanbetransformedtoamatrixwheretherowsofanyconnectedcomponent arecontiguous. A isnowinablockformsimilartothatinDenition ?? .Forarow i andacolumn j where j isnotinthesamecomponentas i ,wemusthave A ij =0, sincethereisnoedgebetweenvertices i and j .Sotheodiagonalblockmustbeall zerosand A isreducible.So, A irreducibleimplies G isconnected. If G isconnectedand A isreducible,thenwecanpermute A intotheformin Denition ?? .Since A issymmetric,submatrix C mustbeallzeros.Butthenthe rowsofmatrices B and D denetwodisconnectedcomponentsof G ,acontradiction. Note,thatitistheodiagonalzeroelementsofamatrixthatdeterminewhether itisirreducible.So,fromLemma ?? ,itfollowsthattheunnormalizedLaplacian L = D )]TJ/F19 11.9552 Tf 11.955 0 Td [(A isirreducibleifthegraphisconnected. Theorem2.10 Fiedler[ ? ]Theorem2.3Let A bean n n nonnegativeirreducible symmetricmatrixwitheigenvalues 1 2 ::: n .Let u = u i > 0 bean eigenvectorcorrespondingto 1 and v = v i aneigenvectorcorrespondingto 2 Thenforany 0 ,thesubmatrix A V isirreduciblewhere V = f i : v i + u i 0 g and A V isthematrixconsistingofrows i 2 V of A andcolumns j 2 V of A Note: u = u i > 0existsbythePerronFrobeniusTheorem[ ? ][ ? ]since A is nonnegativeandirreducible,thereexistsapositiveeigenvector u for 1 21
PAGE 34
WenowgeneralizeFiedler'sTheorem??asfollows. Theorem2.11 Let G = V;E beaniteconnectedgraphwithverticesVnumbered 1 ;:::;n andedges e ij 2 E ,betweenvertices i and j ,assignedapositivenumber.Let L 2 R n n betheunnormalizedLaplacianofG.Let M 2 R n n beadiagonalmatrix where m ii > 0 .Let f betheFiedlervectorofthegeneralizedeigenvalueproblem Lx = Mx .For r 0 ,let V r = f i : f i )]TJ/F19 11.9552 Tf 22.368 0 Td [(r g and V )]TJ/F20 7.9701 Tf 6.587 0 Td [(r = f i : f i r g .Then G r the subgraphinducedon G by V r isconnectedand G )]TJ/F20 7.9701 Tf 6.586 0 Td [(r thesubgraphinducedon G by V )]TJ/F20 7.9701 Tf 6.586 0 Td [(r isconnected. Proof: WenotethatthisprooffollowscloselythatofFiedler[ ? ]forthe unnormalizedLaplacian.Itproceedsbytransformationofmatrix L toonesatisfying thepropertiesofTheorem ?? andthenusingitseigenvectorstoformconnected subgraphsof G Since G isconnecteditsadjacencymatrixisirreduciblebyLemma ?? ,andsois L = D )]TJ/F19 11.9552 Tf 12.196 0 Td [(A: Let B = )]TJ/F19 11.9552 Tf 9.298 0 Td [(M )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 LM )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 B isnonnegativeontheodiagonalelements, since B ij = )]TJ/F17 7.9701 Tf 18.564 4.707 Td [(1 p m ii L ij 1 p m ii andistheproductoftermswithsignof )]TJ/F15 11.9552 Tf 11.619 0 Td [(+ )]TJ/F15 11.9552 Tf 9.298 0 Td [(+.Herewe haveused L ij 0forodiagonalelementsseeLemma ?? B isirreduciblesince L isirreducibleandanodiagonalelementof B iszeroifandonlyifitiszero in L For > max i j B ii j itholdsthat B + I isnonnegative,symmetric,andhas positivediagonals. B + I hasthesameodiagonalelementsas B ,andsince B is irreducible B + I isalso.ThereforeitsatisestheconditionsofTheorem ?? 22
PAGE 35
Furthermore, ;x isaneigenpairof Lx = Mx ^ = )]TJ/F19 11.9552 Tf 9.298 0 Td [( + ;M 1 2 x isan eigenpairof B + I x = ^ x .Toshowthislet Lx = Mx ,and y = M 1 2 x then Lx = Mx )]TJ/F19 11.9552 Tf 9.299 0 Td [(M )]TJ/F18 5.9776 Tf 7.782 3.258 Td [(1 2 LM )]TJ/F18 5.9776 Tf 7.782 3.258 Td [(1 2 M 1 2 x = )]TJ/F19 11.9552 Tf 9.299 0 Td [(M )]TJ/F18 5.9776 Tf 7.782 3.258 Td [(1 2 Mx BM 1 2 x = )]TJ/F19 11.9552 Tf 9.299 0 Td [(M 1 2 x B + I y = )]TJ/F19 11.9552 Tf 9.299 0 Td [( + y B + I y = ^ y: Since,allstepsarereversibletheifandonlyif"relationfollows. Now,zeroisaneigenvalueof Lx = Mx witheigenvector I = ; 1 ;:::; 1 T Since L ispositivesemideniteand M ispositivedenitethenzeroisthesmallest eigenvalueof Lx = Mx and ^ = )]TJ/F19 11.9552 Tf 9.298 0 Td [( + = isthelargesteigenvalueofof B + I with u = M 1 2 I thecorrespondingeigenvector.Let v betheeigenvectorofthesecond largesteigenvalueof B + I .Thiswouldcorrespondtotheeigenvectorofthesecond smallesteigenvalueof Lx = Mx ;i.e.theFiedlervector f where v = M 1 2 f Let r 0and V r = f i : v i + ru i 0 g .ByTheorem ?? ,thesubmatrix B + I V r isirreducible.Bythedenitionof L and B ,theodiagonalelementsof B + I are zeroifandonlyiftheodiagonalelementsof L arezero.So, L V r isirreducible,as is A V r where A istheadjacencymatrixofgraph G .Therefore,byLemma ?? ,the subgraphinducedon G by V r isconnected. Now v i = M 1 2 f i = p m ii f i and u i = M 1 2 I i = p m ii .So, v i + ru i = p m ii f i + r p m ii whichimplies V r = f i : f i + r 0 g = f i : f i )]TJ/F19 11.9552 Tf 23.004 0 Td [(r g .Now,if v isaFiedler vectorthen )]TJ/F19 11.9552 Tf 9.298 0 Td [(v isalso.So, V )]TJ/F20 7.9701 Tf 6.587 0 Td [(r = f i : )]TJ/F19 11.9552 Tf 9.299 0 Td [(f i )]TJ/F19 11.9552 Tf 24.179 0 Td [(r g = f i : f i r g mustalsobe connected. Theorem ?? coversthegeneralizedeigenvalueproblem Lx = Mx with M = D providedthegraphdening L isconnected.Thisisthetherandomnormalized Laplacianproblem L rw x = D )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Lx = x 23
PAGE 36
ForthesymmetricnormalizedgraphLaplacian, L sym = D )]TJ/F18 5.9776 Tf 7.783 3.259 Td [(1 2 LD )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 Theproblem L sym x = x ,reducestotheproblem Ly = Dy with x = D 1 2 y So,withoutrecoursetocombinatorialmodelswehaveshownthatthenormalized graphLaplaciansandthemoregeneralformsoftheseequationsusingamassmatrix produceconnectedsubgraphsviaspectralclusteringbasedontheFiedlervector, wheretheclustersproducedmightnotbedisjoint. 2.9Eectofmassmatrixonsegmentation HavingextendedoneofFiedler'stheoremsTheorem ?? ,seeSection ?? with respecttothemassmatrixLaplacianproblem Lx = Mx ,wenowexaminetheeect ofthemassmatrixonclustering.Forthisanalysis, M isadiagonalmatrixwith M ii > 0. Numericalexperimentswereperformed.Thesehaveinvolvedavarietyofgraphs, includinglattices,lines,andcircles.Theedgeweightsareequal,sincewewanted toisolatetheeectofthemasses.Massesarevaried,buttheeectofequalmasses exceptatonevertexwasalwaysexamined.TheunnormalizedLaplacianhasbeen usedinthisanalysis. Wepresentthefollowingobservationsofnumericalexperiments,whichapplyto bipartitionsproducedviatheFiedlervector. 1.ClustersdenedbydivisionoftheFiedlervectorvia 0and 0willbe connected.ThisisinsuredbyourextensionofFiedler'sTheoremTheorem ?? 2.Thetwoverticesofhighestmasstendtobeinseparateclusters;seeFigure ?? fortwoexamplesofthis. 3.Forthevertexofhighestmass,ifthemassislargeenough,itwillbeinacluster byitselfprovidedobservation1isnotviolated;seeFigure ?? andFigure ?? 4.Thevertexofsmallestmassisneverisolatedinaclusterbyitself;seeFigure ?? 24
PAGE 37
Observation3canbeunderstoodbutnotprovenbythefollowing.Reformulate theproblem Lx = Mx as M )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 LM )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 y = By = y .Let m i bethemassofvertex i .Thematrix B haselements B ij = 1 p m i A ij 1 p m j .Ifwetake B rowbyrowwesee thatweareadjustingtheweightoftheedgesofrowvertexiby 1 p m i p m j .Suppose m i isthelargestweightand m i >>m j forevery i 6 = j .Iftheoriginaledgeweights arelargerelativeto 1 p m i ,thenwenowhavethesmallestweightsaroundvertex i and wouldexpectittobeinaclusterbyitself. Observation4followsforsimilarreasonswith m k <
PAGE 38
Figure2.7: Observation3:Largestmassinaclusterbyitself Figure2.8: Observation3:Whenlargestmasscan'tbeinaclusterbyitself Figure2.9: Observation4:Smallestmassneverinaclusterbyitself 26
PAGE 39
2.10ImageSegmentation OurproblemistobipartitionanimageaccordingtotheideasdiscussedinSection ?? .Arealworldimagehasaninniteforallpracticalpurposesnumberof pixels.Whenwetakeadigitalpicturewereducetherealworldimagetoanite numberofpixels.Inasimplecase,eachpixelhasred,green,blueRGBvaluesfrom 0to255.Weconsiderthediscreteimageasagraphwithverticescorrespondingto thepixelswhicharelabeledasconsecutivepositiveintegersfromthelowerleftedge oftheimagetotheupperleftedgecolumnbycolumnasshowninFigure ?? foran imagewith m rowsand l columns. Figure2.10: Labelingofpixelsinanimage m 2 m ::: ml . . . . 2 m +2 ::: . 1 m +1 ::: l )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 m +1 Edgesaredenedbasedona5pointstencil.A5pointstencilconnectseachpixel tothe4closestpixels,exceptforedgeorcornerpixels,wheretheyconnectto3or2 pointsrespectively.Theresultisa m l lattice.Ifwehada3dimensionalimagewe woulddenesimilarlyapixelnumberingwithedgesdenedona7pointstencil. NotethatthegraphLaplacianofalatticehasastructureimposedbythe5point stencilandthenumberingchoiceforthevertices.Theonlynonzeroelementsofthe Laplacianlieonthediagonal,adjacenttothediagonal,and m columnsfromthe diagonal. 27
PAGE 40
2.11EdgeWeights ThenalcomponentweneedtoconstructtheLaplacianisaformulaforthe edgeweights.Therearemanywaystoassignweights.Themethodwechooseis implementedinLeoGrady'ssoftware[ ? ].ItisbasedonacombinationofRGB valuesandgeometricdistance. Saypixel1andpixel2areconnectedbyanedgeandtheyhaveRGBvalues f R 1 ;G 1 ;B 1 g and f R 2 ;G 2 ;B 2 g .Let R = R 1 )]TJ/F19 11.9552 Tf 12.637 0 Td [(R 2, G = G 1 )]TJ/F19 11.9552 Tf 12.637 0 Td [(G 2,and B = B 1 )]TJ/F19 11.9552 Tf 11.955 0 Td [(B 2. Let valDist = p R 2 + G 2 + B 2 .Thennormalizethevalueswithrespectto therange[0 ; 1].Set geomDist =1,whichisthesameforalledges. valDist isa measureofdissimilaritybetweenpixelvalues. geomDist isameasureofthephysical distancebetweenpixels. Thedistancesareameasureofdissimilaritybetweenpixels.FortheLaplacianwe needmeasuresofsimilarity.Thisisachievedbyusingthefunction e )]TJ/F20 7.9701 Tf 6.587 0 Td [(dist k a where a isaconstantand k apositiveinteger.Thesecontroltheratethatsimilaritydeclines withdistance.Forimages,werestrictouranalysistoasimplermodelwhere k =1 and a =1. Denition2.12 Wedenetheweightsassignedtoedgesas edgeWeight = e )]TJ/F20 7.9701 Tf 6.587 0 Td [(dist + epsilon where dist = geomScale geomDist + valScale valDist epsilon geomScale and valScale arepositiveconstantsinthecomputationof edgeWeight whichareused totunethemodel. Wehaveintroduced3parameters: geomScale and valScale adjusttherelative importanceofgeometricvsRGBvalues.Geometricdistancesarealwaysunity,since thepixelsarenodesofauniformlyspacedlattice,butvaluedistance valDist can varyfrom0to1andvaluesaretypicallylessthan0 : 05. geomScale istypicallychosen 28
PAGE 41
tobe1and valScale 25orlarger. epsilon istheminimumallowableweight.Atypical valuefor epsilon is10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 .WediscusstheimpactoftheseparametersinSection ?? 2.12TuningtheEdgeWeightParameters SomenumericalexperimentsusingMatlabwiththeimageinFigure ?? were performedtoillustratetheeectoftheparametersonweightassignmentasdescribed inSection ?? .Theimageisofacatsearwithdarkfuragainstalightgraybackground withsomewhitewhiskersandfur.Thishasanobviousclusteringconsistingofthe earandthebackground. Theprimaryimageis200 320pixelswithRGBvaluesfrom0255.Agraph Laplacianisconstructedasdenedintheprevioussections. Figure2.11: Baseimageforanalysis Fortheinitialanalysistheimageisscaledbysamplingofpixelsintoimageswith smallerresolution.Weightsareassignedwith valScale =25, geomScale =1,and epsilon =10 )]TJ/F17 7.9701 Tf 6.587 0 Td [(5 .TheMatlab eigs sparseeigensolverhasbeenused. 29
PAGE 42
TheresultsareshowninFigure ?? ,wherealightlinehasbeenaddedtothe imagetoshowtheboundaryseparatingtheclustersgenerated.Thegraphinthe lowerrighthandcornershowssolutiontimeinsecondsyaxisversusimagesizein pixelsxaxis. Figure2.12: Initialanalysis Theresultsaregoodwiththeexceptionofthe200 320imagewheretheboundary deviatesupwardstothetopedgeoftheimage.Thiscanhappenwhenthe valDist istoosmallwithrespecttothe geomDist Toseetheeectofgeometry,considerthelatticeinFigure ?? .Inthisgraph theedgeweightsarealloneandthelatticecouldbeconsideredasanimagewhere RGBvaluesforallpixelsarethesame.Notehowtheboundary f i : y i =0 g bisects theimagefromtoptobottomthesmallerdimensionatthemidpointofthelarger dimension.ThisillustrateshowthegeometryoftheimagecanaecttheFiedler vector,andwemightsuspectthatthedistancetermisoverpoweringthevalueterm inthe200 320image. 30
PAGE 43
Adjustingthe valScale to50andrepeatingtheexperimentsolvestheproblem, seeFigure ?? Figure2.13: EectofincreasingValScale Whydonotweseethisproblematthereducedimagescales?Asthescaleis reducedbysampling,thecomputedmean valDist isincreasedfrom : 018forthe 200 320imageto : 05forthe25 40image.Asaresultthe valScale valDist term increaseswithrespecttothe geomScale geomDist termwhichisunchanged.This hasthesameeectasadjusting valScale Similarlyvarying geomScale withrespectto valScale cancausedegradationof thebipartitionasshowninFigure ?? forthe40 64image. Sowhatdoes epsilon ,in edgeWeight = e )]TJ/F20 7.9701 Tf 6.587 0 Td [(dist + epsilon ,do?Verysmalledge weightscanoccurwhenwehaveasmallnumberofconnectedpixelswithRGBvalues verydierentfromthosesurroundingthem.Theseislands"orspecks"withinthe imagecandramaticallyaecttheclustering.Toseethisconsiderthelatticeingure ?? butwiththeedgesofasinglevertexmadesmallerthanonevertexnumber6 usingthenumberingpreviouslydened.TheresultisshowninFigure ?? .Vertex6 31
PAGE 44
Figure2.14: Eectof geomScale isnowinaclusterbyitself. Figure2.15: Eectofislands Wecanseethisinourearimageifweset epsilon =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(8 .Theresultshownin Figure ?? showstheeectofonesuchspeck"inthe100 160,50 80,and40 64 32
PAGE 45
scaledimages.Inthe40x64imagewehavemagniedtheimagetoshowmoreclearly howasinglepixelhasbeenisolated. Epsilon smoothsoutlargedissimilaritiesin RGBvalues. Figure2.16: Eectofepsilon 2.13EigenvalueSolvers TheMatlab eigs functionhasbeenusedintheform eigs L;n;sigma .If sigma isnotspecieditdefaultstozero,i.e.itsolvesforthesmallest n eigenvalues.For thepositivesemidenitematriceswearedealingwith,whensolvingforthesmallest eigenvalues,thiscanleadtoinstabilityinthe eigs functionwhichresultsin NAN errors.Toovercomethisasmall sigma isused.ThistechniqueisalsoutilizedinLeo Gradyssoftware[ ? ].Ifthe sigma valueistoolarge eigs maynotreturnagood Fiedlervector.Also,ifwearesolvingtoatolerancethatistoolarge,theFiedler vectormaybeoflowquality. Inthesecasesdisconnectedclusterscanbeproduced,seeFigure ?? wherea sigma of10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(5 hasbeenused.Notethatforthe50 80,40 64,and29 46scaledimages 33
PAGE 46
theclusterboundarygivesdisconnectedclusters.WeknowfromCorollary ?? that thisisnotpossible,sosomethingmustbewrongwiththesolutionfortheFiedler vector. Figure2.17: PoorqualityFiedlervectorproducedbytoosmallashift The eigs functionisusableinourexperimentssincethesizeoftheimageissmall. LargerimageswillproduceLaplacianswithdimensionsinexcessof10 7 .Thissize problemrequirestheuseoflargesparseeigensolverssuchasBLOPEX[ ? ],[ ? ]. WhenusingBLOPEXapositivedenitematrixisneededandforimageanalysisa smallshift L + I isused.BLOPEXimplementstheiterativesolverLOBPCG[ ? ]andmakesuseofpreconditioners[ ? ].But,whatevermethodisused,ifeither f i 2 V : f i 0 g or f i 2 V : f i 0 g isnotconnected,thentheFiedlervector producedissuspect,becauseweknowbyTheorem ?? thattheymustbeconnected. Imagesegmentationcanpracticallyexploittheseeigensolvers.TheLaplacianis symmetricpositivesemidenitewithsmallesteigenvalueofzero.Itcanbeshiftedto produceapositivedenitematrixwithoutchangingtheeigenvectors,sinceif Lx = x then L + I x = + x .Weknowwhattheeigenvectorisforeigenvaluezero,and 34
PAGE 47
thiscanbeusedasanorthogonalconstraintforsolutionofthenexteigenpair.Both, ashiftandorthogonalconstraintsareapartoftheLOBPCGsoftware.Detailsofthis areinthechapteronBLOPEXinthisthesis.Theseeigensolversaredesignedtosolve foronlyafewofthesmallesteigenvalues,andweonlyneedthesecondeigenvalue andassociatedeigenvector,sospectralclusteringcanbedoneeciently. 2.14NodalDomainTheorems Fiedler'stheorem ?? dealswiththebipartitioningofagraphbasedonthe2nd eigenvectorofthegraphLaplacianFiedlerVector.Wehaveprovidedanextension ofthistoageneralizedeigenvalueproblem.AtheoremsimilartoFiedler'sexistsfor highereigenvectors.Wesummarizethefollowingworkdoneonnodaldomainsand itsdiscretecounterparts. InCourantandHilbert'sbookMethodsofMathematicalPhysics,Volume1",Ch. 6,Sec.6[ ? ]theypresentthefollowingtheorem.Thisdatesfrom1924,andis frequentlyreferredtoinpapersonnodaldomains.ItisoftenreferredtoasCNLT CourantsNodalLinetheorem. Theorem2.13 CNLT[ ? ]pg452Giventheselfadjointsecondorderdierential equation L [ u ]+ u =0 > 0 foradomainGwitharbitraryhomogeneousboundaryconditions;ifitseigenfunctions areorderedaccordingtoincreasingeigenvalues,thenthenodesofthentheigenfunction u n dividethedomainintonomorethan n subdomains. Bynodesismeanttheset f x 2 G : u n x =0 g .Dependingonthedimensionof G thismightbeanodalpoint,curve,surface,etc.Thesenodesalsoreferredtoas nodallinesdividethedomainGintoconnectedsubdomainscallednodaldomains. If G hasdimension m thenthenodesarehypersurfacesofdimension m )]TJ/F15 11.9552 Tf 12.263 0 Td [(1.It canbeshownthatthenodesareeitherclosedorbeginandendontheboundary[ ? ],andareofLebesquemeasurezeroin G [ ? ]. 35
PAGE 48
TheproofofCNLTprovidedbyCourantandHilbertisrathercryptic;beingmore ofadiscussionthanaformalproofandisforthecaseof G R 2 .Amoreaccessible proof,foramoregeneralcaseof G R m ,isprovidedinapaperbyGladwelland Zhu[ ? ]. Whilenotgivingthedetailsoftheproofweexamineitsbasicfeaturesandsee howitactsasatemplateforadiscreteCNLT.Ultimately,wederiveanewversionof thediscreteCNLT,thatincludesFiedler'stheoremasaspecialcase.Fiedler'sproof dependsontheoremsofpositive,irreduciblematrices.Itisinstructivetoseehowan entirelydierentapproachthanthatusedbyFiedleryieldsthesameresult. TheCNLTproofdependson: 1.thenatureoftheeigenfunctionsandeigenvaluesoftheproblem,i.e.theeigenvaluescanbeorderedandtheeigenfunctionsformacompleteorthonormalset, 2.thevariationalformoftheproblem, 3.theapplicationoftheminmaxcharacterizationofeigenvaluesCourantFischer Theorem,and 4.theuniquecontinuationtheorem. Theeigenproblemhasinnitelymanypositiveeigenvalues0 1 2 ::: whosecorrespondingeigenfunctionsformacompleteorthonormalset,seeGriel[ ? ] Thm9.16. 1 =0forthefreemembraneproblemand 1 > 0forthexedmembrane problem. ThoseofusprimarilyversedinmatrixanalysisarefamiliarwiththeCourantFischertheoremfromHornandJohnson[ ? ]Thm4.2.11.Theequivalentona domain D thatisasubsetofaHilbertspaceisgiveninGriel[ ? ]Thm10.18.b andisrepeatedhere. 36
PAGE 49
Theorem2.14 Griel[ ? ]MaximumPrinciplepg287IfAisapositivedenite symmetricdierentialoperator,onaHilbertspace H ,withdomain D ,acompact inverse,andwitheigenvalues 1 ; 2 ;::: ,then n +1 =max k 1 ;:::;k n 2 D f min f R u : u ? span f k 1 ;:::;k n ggg where R u = istheRayleighQuotient. Forexample:if A = )]TJ/F17 7.9701 Tf 16.222 4.708 Td [(1 x 4 then R u = R D 4 u u R D uu Gladwell'sproofexaminessolutionsofthevariationalformoftheproblemonthe space H 1 0 D H 1 0 D isthespaceof L 2 D functions,whosederivativesarein L 2 D andthefunctionsvanishontheboundaryof D [ ? ][ ? ].Thedomain D isassumed tobeboundedandconnected. Theuniquecontinuationtheorem[ ? ]statesthatifanysolution u 2 H 1 0 D of theHelmholtzequationvanishesonanonemptyopensubsetof D then u 0in D ThecompleteproofoftheCNLTisinthreeparts.FirstatheorembyCourant andHilbertthatthereareatmost n + r )]TJ/F15 11.9552 Tf 11.39 0 Td [(1signdomainsfortheeigenfunction u n of n ,thenatheorembyHermannandPleijelthatshowsforaconnecteddomainthis number n + r )]TJ/F15 11.9552 Tf 12.251 0 Td [(1canbeimprovedto n ,andnallytheextensionofthislimitto unconnnecteddomains. ThecontinuousCNLTservesasmotivationforadiscreteversion.Theprooffor thisdiscreteversionwasnotcompleteduntil2001inapaperbyDavies,Gladwell, Leydold,andStadler[ ? ].ThediscreteCNLTdealswithsolutionsof Au n = n u n andrequiresthefollowingterminologyanddenitions. Denition2.15 Let A 2 M m bearealsymmetricmatrixwithnonpositiveodiagonalelements.Thegraph G = V;E associatedwith A isthegraphwithvertices V numbered 1 ;:::;m andedges E whereanedgeexistsif a ij 6 =0 Norestrictionisplacedonthediagonalelementsof A .Examplesofsuchmatrices wouldbethegraphLaplacian L L + D where D isadiagonalmatrixandthe 37
PAGE 50
negativeoftheadjacencymatrixofagraph.Solutionsof Au n = n u n areoftheform 1 2 ::: m Thenodalsetofaneigenvector u n doesnotlenditselftoanexactcorrespondence withthenodalsetandnodaldomainsinthecontinuouscase.Instead,strongand weaksigngraphsaredenedasfollows. Denition2.16 Davies[ ? ]Let u beanyeigenvectorof A .Astrongpositivenegativesigngraphisamaximal,connectedsubgraphof G ,withvertices i 2 V and u i > 0 u i < 0 Denition2.17 Davies[ ? ]Let u beanyeigenvectorof A .Aweakpositivenegativesigngraphisamaximal,connectedsubgraphof G ,withvertices i 2 V and u i 0 u i 0 Wespeakofavertexashavingasignofpositive,negative,orzero.Wenote twokeycomponentsofthesedenitions.Thesubgraphsareconnectedandtheyare maximalinthesensethatnoverticesofthesamesignasthesubgraphcanbeadded tothesubgraphandhaveitstillbeconnected. AfewexamplesinFigure ?? ,thatwederived,illustratetheseconcepts.The eigenvectorsofthegraphswerecomputedusingtheunnormalizedLaplacianwith constantedgeweights.Onlythesignsofverticesareshown. GiventhesedenitionsthediscreteCNLTisformulatedasfollows. 38
PAGE 51
Figure2.18: Examplesofstrongandweaksigngraphs Theorem2.18 Davies[ ? ]If A isasymmetricmatrixwithnonpositiveodiagonalelementswheretheassociatedgraph G = V;E asdenedin ?? isconnected, thentheigenvector u n of A dividesthegraphintonomorethan n weaksigngraphs. TheproofofthediscreteCNLTproceedsasforthecontinuousCNLTanddepends ontheorderingoftheeigenvalues,theorthonormalityoftheeigenvectors,andthe minmaxcharacterizationoftheeigenvalues.Insteadofusingthevariationalform usedinthecontinuousCNLT,anidentityfromDuvalandReiner[ ? ]isused.Also, wedonothavetheuniquecontinuationtheoreminthediscretecasebutthereisan analogseeDaviesetal.Lemma3[ ? ]. Itshouldbenotedthatmuchofthisproofwasdoneinthepreviouspaperby DuvalandReiner.However,theymadetheclaimthattheCNLTwasvalidforstrong signgraphs.Friedman[ ? ]hadpreviouslygivenexamplesshowingthisisnottrue. Figure ?? showstwoexamplesofthis.Themorecomplicatedoneontheleftbeing onewehaveconstructed.Also,wenotethatthegraph G mustbeconnectedandthe theoremappliestoweakratherthanstrongsigngraphs. 39
PAGE 52
Figure2.19: Strongsigngraphsexceedingeigenvectornumber Afewconclusionscanbedrawfromthetheoryandanexaminationof examples. Thenumberofweaksigngraphsisalways thenumberofstrongsigngraphs. Thenumberofbothweakandstrongsigngraphscanbe
PAGE 53
Figure2.20: Signgraphsdecreasinginnumber positivesemideniteand M ispositivedenite.Anoutlineoftheproofisgivenin thatpaper. Fiedler'sTheorem ?? andourextendedTheorem ?? isinpartaspecialcase ofthediscreteCNLTfor n =2wherethecutiswithrespectto r =0.However, notethatFiedler'stheoremcoverssubgraphsinducedby f i 2 V : u i )]TJ/F19 11.9552 Tf 24.054 0 Td [(r g and f i 2 V : u i r g where r> 0,whichtheCNLTdoesnotaddress. WemodifythedenitionofweaksigngraphsandproduceanewversionofCNLTthatcompletelycoversFiedler'stheoremasaspecialcase. Denition2.19 An r weakpositivenegativesigngraphisamaximal,connected subgraphof G ,with r 0 andvertices i 2 V and u i )]TJ/F19 11.9552 Tf 21.918 0 Td [(r u i r .Foranexample seeFigure ?? Theorem2.20 ModiedDiscreteCNLTIf A isasymmetricmatrixwithnonpositiveodiagonalelementswheretheassociatedgraph G = V;E asdenedin ?? is connected,and r> 0 ,thenthentheigenfunction u n of A dividesthegraphintono morethan nr weaksigngraphs. 41
PAGE 54
Figure2.21: Anexampleof r weaksigngraphs Wecanonlypresentaproofofthistheoremwhichisdependentonaconjecture aboutthenatureof r weaksigngraphs.Wedivide r weaksigngraphsintotwotypes. Denition2.21 Let r 0 .Atype1 rwsg isa r weaksigngraph R wherethere existsaweaksigngraph W suchthat W R .Atype2 rwsg isa r weaksigngraph thatisnotatype1 rwsg Consideratype2 rwsg .Weaksigngraphshavevertices i 2 V whereeither u i 0or u i 0andaremaximallyconnected.Ifa r weaksigngraphhasboth negativeandpositivecomponents,ornegativeandzero,or3positiveandzero, theremustbeaweaksigngraphthatisasubsetofthe rwsg andthereforeitisa type1.Thisiseasilyseenbyconsideringthecases.Ifthe rwsg satisescasethen ifitisgeneratedby r thereisaweaksigngraphgeneratedby 0containedinit. Ifitisgeneratedby )]TJ/F19 11.9552 Tf 22.189 0 Td [(r thereisaweaksigngraphgeneratedby 0containedin it.Similarargumentsapplytocaseand. Thissaysthatthetype2 rwsg musthaveeitherallpositiveorallnegative elements.Letusassumethattheyareallpositive.Now,ifthe rwsg isgeneratedvia )]TJ/F19 11.9552 Tf 23.022 0 Td [(r thenithastocontaintheweaksigngraphgeneratedby 0andcontaining thesepoints.So,itisatype1 rwsg 42
PAGE 55
Weareleftwithonlyonepossibility,atype2 rwsg withpositiveelements isgeneratedby r andhasnozeroelements.Since,ourgraphbyassumptionis connected,theremustbea u i >r ineverypathconnectingthe rwsg totherestof thegraph.Atype2 rwsg isakindoflocalminimuminagraph.Forthecaseof positiveelements,thisisakindofdivotthatdoesnotcrosszero. Sucha r weaksigngraphcanbeconstructedbyconsideringarbitraryvectors. Consideralinewithfourpointsandvaluesassignedtotheverticesof u 1 = )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 ;u 2 = 2 ;u 3 =1 ;u 4 =2.For r =1 : 5thishas2weaksigngraphs, f 1 g and f 2 ; 3 ; 4 g ;butthree r weaksigngraphs, f 1 g ; f 1 ; 2 ; 3 ; 4 g ,and f 3 g So,isthistypeofbehaviorpossiblewithaneigenvectorofmatrix A ?Since,this canbethoughtofasavibrationwithinamajormodeofvibration,thatdoesnotseem reasonable.Numerousnumericalexperimentssupportthis.Ipresentthefollowing conjecture,whichIemphasizeisnotproven. Thisisusedinaproofof ?? ,whichIhaveidentiedasincomplete,becauseof this. Conjecture2.22 If G isaconnectedgraph, A asymmetricmatrixwithnonpositive odiagonalelementsand r> 0 ,thenthe n theigenfunction u n of A cannotgenerate anytype2 rwsgs However,thisiseasilyprovenforthecaseoftheunnormalizedgraphLaplacian L = D )]TJ/F19 11.9552 Tf 11.955 0 Td [(A Lemma2.23 FortheunnormalizedgraphLaplacian L = D )]TJ/F19 11.9552 Tf 9.737 0 Td [(A ofaconnectedgraph, its n theigenfunctioncannotgenerateanytype2 rwsgs Proof: Let L = D )]TJ/F19 11.9552 Tf 12.463 0 Td [(A betheunnormalizedLaplacianofaconnectedgraph G .Let u bethe n theigenfunctionof L andlet u i denotethecomponentof u correspondingtovertex i .Let r> 0andassume 9 atype2 rwsgS of G .Since 43
PAGE 56
G isconnected, u 1 isaconstantvectorandhasonlyasingletype1signgraph.So, assume n> 1. Withoutlossofgeneralityletusassume u i > 0forverticesin S .Thentheremust besomevertex i 2 S suchthat u i u j 8 j adjacentto i and u i 0, u i )]TJ/F19 11.9552 Tf 11.96 0 Td [(u j < =0, u i )]TJ/F19 11.9552 Tf 11.959 0 Td [(u j < 0forsome j ,and u i > 0.Sothelefthand sideofthisequationisalways < 0.But, L ispositivesemideniteand n> 1,sothe righthandsizeis > 0.Thisisacontradictiontotheassumptionthatthereisatype 2 rwsg Forthegeneralizedproblem K )]TJ/F19 11.9552 Tf 12.486 0 Td [(M u =0mentionedbyDavies,weseethe CNLTappliesto Lu = Du .Since D )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 appliedto L onlychangesthemagnitudesof thelefthandsideofthisequationbutnotthesigns,wecanseefromtheproofof ?? that D )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 L alsocannothaveanytype2 rwsg .Similarly, D )]TJ/F18 5.9776 Tf 7.783 3.258 Td [(1 2 LD )]TJ/F18 5.9776 Tf 7.782 3.258 Td [(1 2 cannothaveany type2 rwsg WeprovidethefollowingprooffortheModiedDiscreteCNLT.Notethatby Lemma ?? thetheoremiscompletefortheunnormalizedLaplacian,butforthemore generalclassofproblemsofmatriceswithnonpositiveodiagonalelementsitis dependentontheconjecture ?? ,whichisnotproven. 44
PAGE 57
Proof: ForModiedDiscreteCNLT.DependsonConjecture.Let r 0 and R bean r weaksigngraphrwsgoftype1.Let j wsg j bethenumberofwsg's and j rwsg j bethenumberofrwsg's.Suppose R 1 and R 2 aretype1rwsg'sand W isawsgsuchthat W R 1 and W R 2 .Since R 1 and R 2 aremaximalwemust have R 1 = R 2 .So j wsg jj type 1 rwsg j .Bytheconjecture ?? ,weconcludethat j wsg jj rwsg j andbythediscreteCNLTthentheigenfunctiondividesthegraph intonomorethan nr weaksigngraphs. Wemakeanobservationthatthedenitionofstrongsigngraphexcludeszero verticesfromthepartitionofthegraphandtheyaremutuallyexclusive.Weaksign graphsincludethezerovertices,ifany,andmaynotbemutuallyexclusive.This meansthatthesetofallstrongorweaksigngraphsmaynotconstituteapartition of V inthesenseasusuallystatedinthecombinatorialgraphpartitioningproblem [ ? ].Thisbeingthecase,partitioningobjectivefunctionsratiocut,Ncutlosetheir meaningwithoutsomeredenition. Fromaspectralclusteringperspectivethisisnotarestrictionandwecanaskthe question,Shouldwebepartitioningbasedonstrongorweaksigngraphs?".Both haveanassociationwithmodesofvibration.Andifweusestrongsigngraphswemay beexcludingverticestheverticeswithvaluezero,whichdoesnotseemdesirable. But,ineithercaseourclusterswillbeconnectedwhichweintuitivelythinkofas desirable. Theuseof r weaksigngraphshasafurtheradvantage.Wehavesaidpreviously thereisnogoodreasontoinclude/excludezerosfrommultipleclusters.But,when dealingwithanumericalsolutiontooureigenvalueproblem,ingeneralwedonot knowanyvertexeigenvectorvaluetoexactprecision.Iftheerroris r where r issmallwewouldmoreproperlydeneclustersbasedon r weaksigngraphs.In thiscasetheoverlapbetweenclustersmaybeenlargedbutourclusterswillstillbe connected. 45
PAGE 58
WeconcludethissectionwithanobservationaboutFiedler'sCorollary2,2[ ? ]. Thisconcernsthedegreeofreducibilityofsubgraphs N = f i : v i 0 g where v isthe ntheigenvectorofanonnegative,irreducible,symmetricmatrix A .Fiedlerproved thedegreeofreducibilitycannotexceedn2.Thedegreeofreducibilityplus1isthe numberofmaximallyconnectedsubgraphsin N .Thesearetheweaksigngraphs generatedby v i 0.Togetalloftheweaksigngraphswealsohavetoconsiderthose generatedby v i 0.Thisgivesatmost2 n )]TJ/F15 11.9552 Tf 11.759 0 Td [(2+1weaksigngraphsforthenth eigenvectorof A .For n =2thisgivesatmost2weaksigngraphs.Weknowthatfor thiscasethereareexactly2.For n> 2wegetanupperbound >n WhilethiscorollaryisfornonnegativematriceswecanseeinhisproofofTheorem ?? howtoconvertthematrix A toonethatisnonpositive,symmetric,andirreducible. So,weseefromthisthatFiedlerdidproveanearlyCNLTbutwithanupperbound whichhasbeenimprovedon. 2.15Summary Inthischapter,wereviewedtheconceptofspectralclusteringandproposedan alternativeexplanationthevibrationalmodelofwhyitisaneectivetechnique forclustering.WeprovideanextensiontooneofFiedler'stheoremsusingsimilar techniquesasthoseusedbyFiedler,whichcomplementsthevibrationalmodel.We examinediscretenodaldomaintheoryand,byexpandingthedenitionofweaksign graphs,producedamodiedCNLTthatincludesasaspecialcasetheCNLTand Fiedlertheorem.Itsuseforimagesegmentationisexplained.Theimportanceof edgeweightsandaparticularmethodfortheirdeterminationisexamined.Finally, somenumericalresultsaregiven. 46
PAGE 59
3.DNAMicroArrays 3.1Introduction Inthepreviouschapterweintroducetheideasinvolvedinspectralsegmentation andapplythesetoimagesegmentation.Inthischapterweextendthesetothe problemofmicroarrayanalysis.Weadvocateforanormalizationrulefortheanalysis ofmicroarraydatawheretheobjectiveisrecognitionofgeneswithsimilarexpression levelsintimedependentexperiments. Microarrayanalysisinvolvessomepracticaldicultiesthatarenotrelevantin imagesegmentation.Theprimaryonebeingthatifwearegoingtohavesparse Laplaciansthenwewillalmostcertainlyendupwithdisconnectedgraphs.Weproposeanewtechniquefordealingwiththis. Wealsoalludedtousingrecursivebipartitioninthepreviouschapter.Wedevelop andimplementatechniquefordoingthisandasapartofthisproposearulefor partitionselection. 3.2WhatisaDNAMicroarray? ADNAmicroarrayisachipcontainingoligonucleotidesequencesusedtodetect theexpressionlevelofgenes.Wenowgiveabriefreviewofthebiologyinherentin thisstatement. GenesareDNAsequencesofnucleotidesA,C,T,andGwhichcodeforproteins. Theprocessofproteinformationinacellinvolves:transcriptionofDNAtomRNA messengerRNAinthecellnucleusandthentranslationofmRNAtoaproteinvia aribosome,seeFigure ?? .WhenDNAsequencesaretranscribedtomRNAthisis referredtoasgeneexpression. Geneexpressionisnotaconstantprocess.Itcandependonthevariousstates acellisin,suchasmetaboliccycles,diseaseresponse,andstagesofembryonicdevelopment.Thelevelofgeneexpressionwillvary.Thiscaneitherbeanincrease ordecreasefrom"normal"levels,referredtoasupregulationanddownregulation. 47
PAGE 60
Figure3.1: Geneexpression Knowingwhichgenesareexpressedandbyhowmuchcanaidintheunderstanding ofcellularprocessesandindiagnosisofdisease. However,measurementoftheconcentrationofproteinsinacellisdicult.One solutionistomeasuremRNAasaproxy.Thisdependsontheassumptionthatmost mRNAcreatedisactuallytranslatedtoaprotein. VariousDNAmicrochiptechnologieshavebeendesignedtoperformthisfunction. Theyhavetheadvantageofbeingabletomeasuretheexpressionlevelsofthousandsof genesatthesametime.ForpurposesofdiscussionweusetheAymetrixGeneChips. AGeneChipmicroarrayisaquartzwaferlessthan1/2inchsquare,onwhichmillions ofoligonucleotidesequenceshavebeenassembledusingaphotolithographicprocess. GeneChipsarespecictothegenomeofaparticularorganismEcoli,Aribidopsis, HomoSapiens,etc.Theoligonucleotidesequencesconsistofnucleotidechainsof length25mers.Thesechainsarechosentocorrespondtoselectedpartsofgenes, andthesegenesideallycovertheentiregenomeoftheorganism.These25mersare complementarytothesensestrandofDNAandcorrespondtomRNAsequences. Some1120ofthese25merstargetaspecicgeneandarereferredtoasperfect matchesPM.Inaddition,a25mercorrespondingtoeachofthePMsisconstructed withamismatchMMinthe13thbasepair. 48
PAGE 61
The25mersarebuiltontheGeneChipinapatternofdotsassmallas11microns insize.Eachdotisreferredtoasaprobeandthesetofprobesforasinglegeneare calledprobesets.Also,eachchipcontainscalibrationprobes.Eachprobecontains hundredsofthe25merswiththesamenucleotidesequence.Theinformationabout wheretheseprobesareonthechipiscontainedinanAymetrixlewithanextension of.chp.Metainformationabouttheprobesetsuchasgenenameiscontainedina .ginle. Omittingmostofthebiochemicaldetails,asampleofmRNAisextractedfrom thecellsofanorganismandisconvertedtoabiotinlabeledstrandofcomplementary RNAcRNA.ThecRNAiscomplementarytothe25mersontheGeneChip.When theGeneChipisexposedtothissampleaprocesscalledhybridizationoccurs.During hybridization,complementarynucleotideslineupandbondtogetherviaweakhydrogenbonds.ThegreatertheconcentrationofaparticularmRNAstrand,thegreater thenumberofbondsformedwithinoneormoreoftheprobesinthecorresponding probeset,seeFigure ?? foradepictionofaGeneChip. Thenumberofthosehybridizedprobeshavetobecounted.Auorescentstain isappliedtotheGeneChipthatbondstothebiotinandtheGeneChipisprocessed throughamachinethatpaintseachpixelofthechipwithalaserapixelisthe minimumresolutionofthelaserandisasmallpartofaprobecausingthatpixelto uoress,themagnitudeofwhichismeasured.Theresultsarestoredina.datle containingtherawinformationabouttheGeneChipexperiment.Thepixellocations andintensitiesaremappedandnormalizedintoprobelocationsandtheseresultsare storedina.celle. 49
PAGE 62
Figure3.2: AnAymetrixGeneChip Ameasurementbyasinglemicroarrayisjustasamplesizeofone.Itismultiple samplesizesthattellusbiologicallymeaningfulinformation.Ingeneral,microarray experimentsareoftwokinds.Investigatingaprocessovertime.Forexample: Wewanttomeasurethegeneresponsetoadrugorametabolicprocess.Looking fordierencesbetweenstates.Forexample:NormalcellsversusCancercells,which wouldhaveutilityindiseaseidentication. 50
PAGE 63
Inthiswork,ouranalysisisconcernedwithanexperimentofthersttype.This involvesmultiplechipsmeasuringresponsesinoneormoreorganisms.Takenover timeorduringidentiablemetabolicstages. 3.3HowisMicroarraydataanalyzed? Therstpartisreferredtoaslowlevelanalysisandinvolvesathreestepprocess, seeBolstad[ ? ]foramoredetaileddiscussion. 1.Theadjustmentofprobelevelinformationineacharrayforbackgroundnoise. ThisisduetovariationsinthecRNApreparationprocess. 2.Thenormalizationofdataacrossmultiplearrays.Arrayscannotbecreated perfectlyidentical.Datadistributionsandcalibrationprobesareinvolvedin this. 3.Summarizationofprobeinformationintoanintensityexpressionlevelforeach probeset. Thisgivesanexpressionlevelforeachgeneineacharray.Theresultsofthis analysisarestoredina.chple. Atthispointwehavethefollowingles. .cdfprobetoprobesetgenemapping .gininformationabouttheprobeset .datpixelleveldata .celprobeleveldata .chpgeneleveldata ThenextlevelofanalysisreferredtoasHighlevelinvolveslookingforrelationshipsbetweengenesviavariousformsofclusteranalysishierarchical,Kmeans, principlecomponent,spectral,etc.. 51
PAGE 64
3.4Normalizationofdata Variousstatisticalnormalizationshavebeenappliedtotherawdatatoproduce geneexpressionlevels.Beforeperformingspectralclustering weproposenormalizingtheexpressionvectorstoone .Hereiswhy. Suppose,ourmicroarrayexperimentexaminesametabolicprocessat5distinct points.Theprocessingoftherawdatahasproducedanarrayofdatawherethe rowsarethegenesandthecolumnsinnumberaretheexpressionlevels.We areinterestedinclusteringgenesaccordingtohowsimilarlytheyareupgradedor downgraded.Nowsupposewehavetwogeneswithexpressionlevelsvectorsof ; 3 ; 4 ; 2 ; 1and ; 6 ; 8 ; 4 ; 2,seeFigure ?? .Theexpressionlevelsofthesetwogenes showastrongcorrelationwithrespecttohowtheychangethroughthemetabolic process;thesecondjusthasamagnitudetwicetherst.Wewouldliketoseethese genesclusteredtogether. Figure3.3: Geneswithcorrelatedexpressionlevels But,lookingatthesewithinour5Dexpressionspace,theymaybefarenough apartthatwhenwegenerateedgeweightstheyarenotconnected.Geometrically,we aresayingthatvectorsclosetothesamestraightlinesegmentstartingattheorigin 52
PAGE 65
arecorrelatedandshouldbeclusteredtogether.Toaccomplishthiswenormalize thedatatovectorsoflength1.Thisisequivalenttoprojectingthevectorsontoa unitspherearoundtheorigin.Anexampleofthisinthe2Dcaseisillustratedin Figure ?? .Withoutnormalizationsquaresingreentherearethree,maybefour, groups.Withthisnormalizationcirclesinredtherearetwo. Figure3.4: Eectofnormalizationonclusters 3.5DisconnectedGraphs Whendeningedgesbetweengraphnodesi.e.expressionlevelvectorsweare typicallydealingwiththousandsofnodes.Tocontrolthesparsityoftheresulting Laplacianweusuallyhavetolimitthenumberofedgesinsomeway.Forimages,we controlthenumberofedgesbydeningourgraphasalatticewithedgesdenedby a5pointstencil. Thedataformicroarraysdoesnottintoanypreconceivedgrapharchitecture. Theedgeweightbetweentwonodesiscomputedviaafunctionsuchasaninverse 53
PAGE 66
Gaussiandistancewhichwedescribeinmoredetaillater.Thenweapplyacuto weightvalue.Ifthecomputededgeweightislessthanthisvaluetheedgeweightis setatzero.Thiseectivelysaysthereisnoedgebetweenthosetwonodes. Selectionofthelimitingvaluecanproduceagraphthatisalmostcompleteor onethatissosparsefewofthenodesareconnected.Ideallywewantagraphwhose Laplacianissparsesoitiseasiertosolve,hasenoughedgestoretaintheinherent structureofthedata,andisconnectedsowecanapplythenodaltheorypreviously developed.Thersttwoconditionscanbemetbyexaminationofthesparsityofthe resultingadjacencymatrixandadjustmentofthecutoweight.Thelastcondition connectivityisnotaseasy. Astheedgesarereducedweinevitablycreatemultiplecomponentsintheresulting graph.Thediscretenodaldomaintheoremrequiredourgraphtobeconnected. Thesedisconnectedcomponentsneedtobeidentied.Forcomponentsofsizeone noedgesconnecttoitthisissimpleandecientlyaccomplishedbyexaminationof theadjacencymatrix;i.e.allvaluesinarowarezero.Forcomponentsofalarger sizethisisnotaseasy. GraphalgorithmssuchasaBreadthFirstSearchBFScanbeutilizedtodothis see[ ? ]page534buthaveruntimeof O k V k + k E k .Generallytherearealot moreedgesthanvectorsandthiscanbequitelargepotentially O k V k 2 Analternatemethodtondcomponentsusingspectralmethodswasproposed byDing,He,andZha[ ? ].Theirmethodisbasedonexaminationoftheeigenvectors oftheLaplacianwitheigenvaluezero.Weknowthemultiplicityofeigenvaluezerois thenumberofcomponentsofthegraph.Carefulexaminationoftheseeigenvectors canidentifytheindividualcomponents. Sincetheeigenvalueproblemisingeneralalso O k V k 2 ,isperformanceafactor inpreferringoneovertheother?BFShastheadvantageifthenumberofedgesis nottoolarge.However,whenusingaspectralmethodweareonlysolvingforafew 54
PAGE 67
ofthesmallesteigenpairsandsoexpecttodobetterthan O k V k 2 .So,thereisno clearcutanswer. Weproposeanalternatemethod tothatofDing,He,andZha,whichisalso spectralinnature.Itiseasytointegrateintotherecursivebipartitioningalgorithm weintroducelater.Thismethodstartswiththeadditionofaconnectingnodetothe graphandthedenitionofanedgeofasmallweightbetweentheconnectingnode andeveryothernode.Ourgraphisnowconnected,andthissmallperturbationto anunnormalizedLaplacianissubjecttoanexactsolutionintermsoftheoriginal Laplacianfortheproblemof Lv = v .Wenextgivethesolutiontothisproblemin thefollowingtheorem,andthenapplyittoanalgorithmforextractionofcomponents. Theorem3.1 Let L betheunnormalizedLaplacianofagraph G with n vertices,not necessarilyconnected.Let ^ G beagraphformedbytheadditionofavertextoGand anedgebetweenthisvertexandeveryothervertex.Letalloftheseedgeshaveweight b> 0 .Let ^ L betheLaplacianof ^ G andlet I n bethevectorofallonesofsize n .Then theeigenpairsof ^ Lw = w canbecharacterizedasfollows: 1. ; I n +1 2. b; v 0 where Lv =0 v and v 6 =0 and P v i =0 3. n +1 b; I n )]TJ/F20 7.9701 Tf 6.586 0 Td [(n 4. + b; v 0 where Lv = v and > 0 Proof: Byconstructionof ^ G itisconnectedandsofollows. Wehave ^ L = L + D b B B T nb where B = )]TJ/F19 11.9552 Tf 9.299 0 Td [(b I n and D b isthediagonalmatrixwithbon alldiagonals.Let v beasolutionof Lv =0 v where v 6 =0andthesumoftheentries of v iszero.Then ^ L v 0 = Lv + D b v B T v = bv 0 = b v 0 whichproves.Notethatthis casecanonlyoccurwhen G isdisconnected. 55
PAGE 68
Wealsohave ^ L I n )]TJ/F20 7.9701 Tf 6.587 0 Td [(n = L I n + D b I n + B )]TJ/F20 7.9701 Tf 6.587 0 Td [(n B T I n + nb )]TJ/F20 7.9701 Tf 6.586 0 Td [(n = b I n + nb I n )]TJ/F20 7.9701 Tf 6.586 0 Td [(nb + nb )]TJ/F20 7.9701 Tf 6.586 0 Td [(n = n +1 b I n )]TJ/F20 7.9701 Tf 6.587 0 Td [(n which proves. Suppose Lv = v and > 0,thenbytheorthogonalityof v to I n wehave ^ L v 0 = Lv + D b v B T v = v + bv 0 = + b v 0 whichproves.Inaddition, L hasan orthogonalsetofeigenvectors OE ,andwecanchoose v tobefromthisset.Inthis case,all v 0 aremutuallyorthogonal. Now,wehaveonlytoshowthatallpossibleeigenvalueshavebeenaccountedfor andtheireigenvectorsareorthogonal. If G isconnectedthenby,andwehavedened n +1eigenpairs.The eigenvectors v inarelinearlyindependentandorthogonalto I n .So,and dene n linearlyindependenteigenvectors.Forthesamereason, I n )]TJ/F20 7.9701 Tf 6.587 0 Td [(n T v 0 =0and wehave n +1linearlyindependenteigenvectors.Since G isconnected,doesnot apply. If G isnotconnectedthenthemultiplicityoftheeigenvaluezeroisthenumberof componentsof G .Let m bethenumberofcomponentsof G .Foranytwocomponents, say G j and G k of G ,wecandeneavector v suchthat Lv =0and P v i =0.Wetake v i =1if i 2 G j and v i = if i 2 G k where = j G j j = j G k j .Wecanconstructexactly m )]TJ/F15 11.9552 Tf 11.211 0 Td [(1suchvectorsthatarelinearlyindependent.Now,theeigenvectors v inthat wehaveconstructedareformedfromeigenvectorsof OE ,thatis v = v 1 + v 2 where v 1 2 OE and v 2 2 OE .So,theeigenvectorsofareorthogonaltothoseofand areorthogonaltothoseof1andsince P v i =0. So,by,,andwehave1, m )]TJ/F15 11.9552 Tf 12.351 0 Td [(1,and n )]TJ/F19 11.9552 Tf 12.351 0 Td [(m independentvectorsfora totalof n andthenaddsthelast. 56
PAGE 69
Corollary3.2 If ^ Lv = bv then P v i =0 andallvaluesin v foranycomponentof G willbethesame. Proof: Wenotethatourchoiceofeigenvectorsforintheproofof ?? spanthe eigenspaceofeigenvalue b .Letusdenotetheseas w i .Anyothereigenvectorinthe eigenspacewillbeoftheform v = P i i w i .Wehave P j v j = P i i P j w i j =0. Finally,sincethevaluesofcomponentsinany w i arethesamebyconstruction,this mustalsobetruein v Theimportanceofthiscorollaryisthataneigenvectorofeigenvalue b mustbe aFiedlervectorof ^ L since0
PAGE 70
ForpurposesofclusteringwewanttheoptionofusingthenormalizedLaplacians: L sym = D )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 LD )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 and L rw = D )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 L .Theorem ?? doesnotapplytothemand derivingexactsolutionstotheperturbationof L byaconnectingnodehasprovento beelusive.Perturbationofthealgebraicconnectivityndeigenvalueisaddressed in[ ? ],buttheissueofperturbationoftheFiedlervectorismoredicult.But,from numericalexperimentswecanmakethefollowingobservations. Assume b theperturbationweightissmallrelativetotheminimumofthe weightsin L .Forthecaseofaconnectedgraph,let v betheFiedlervectorof theunperturbedproblem L rw v = v .TheFiedlervectoroftheperturbedproblem, ^ L rw w = p w orequivalently ^ Lv = p Dv ,isoftheform w = v + wherevector is smallrelativetothevaluesof v and p .TheperturbedFiedlervectorof G with theconnectingnoderemovedshouldthengiveusthesamepartitionastheoriginal vector. FortheproblemofthesymmetricLaplacian, L sym = D )]TJ/F18 5.9776 Tf 7.783 3.258 Td [(1 2 LD )]TJ/F18 5.9776 Tf 7.782 3.258 Td [(1 2 ,wehavesimilar resultssinceeigenvaluesof L rw andof L sym arethesame,andif v isaneigenvector of L rw then w = D 1 2 v isaneigenvectorof L sym .Notethatoneimplicationofthisis thattheeigenvectorofeigenvaluezeroisnolongeraconstantvector. Forthecaseofagraphwithmultiplecomponents,theFiedlervectorofthe perturbedmatrixhasvaluesdistributedaccordingtothecomponentsofthegraph. Forthe L rw Laplacian,thevaluesofanycomponentwillbenearlyaconstant.Each componentmayormaynothavethesameconstant,buttheeigenvectorwillinclude bothpositiveandnegativevaluesandsoapartitionofthevectoryieldsnonempty partitionswhereacomponentcannotbesplitintoseparatepartitionsbutcouldbe inboth. 58
PAGE 71
Wecannowdenethefollowingalgorithmforidenticationofcomponentsand recursivebipartitioning.ThisisvalidfortheunnormalizedLaplacianandwehave someintuitionbasedonnumericalexperimentsthatitisvalidforthenormalized Laplacians. Step1 Reducetheadjacencymatrixof G byallsingletoncomponents. Step2 Selectavaluefor b Step3 Addaconnectingnodeto G asdescribedabovetocreate ^ G Step4 ConstructtheLaplacianandsolvefortherstfeweigenpairs. Step5 Excludingthezeroeigenvalueidentifythenextsmallesteigenvalue. Step6 Usetheeigenvectorofthiseigenvaluetopartitionthegraph. Step7 Ifthepartitionresultsinanemptypartitionthenexcludethateigenvalue, andusingthenextsmallesteigenvalue,gotoStep6. Step8 Usingthesecomponents,repeattheprocessstartingwithstep3untilasufcientnumberofclustershasbeenidentied. Comments: Ifweonlywanttoidentifyallcomponentsthenwecanusetheunnormalized Laplacianandstopinstep5whentheeigenvalueisnot b InStep1,wechoosetoeliminateallsingletons.Thisisnotnecessary,butis donebecauseitiseasyandcheapthediagonalvalueofasingletoniszeroand thesingletonsdonotcarrymuchinformationabouttheirrelationtoothervectors genes.Intheprogramtofollowwejustputthemallinapartitionbythemselves. ThiscansignicantlyreducethesizeoftheLaplacianandspeeduptheanalysis. Wechoose b tobe.001timestheweightlimit.Since,alledgeweightsmustbe greaterthanthislimit,thismakes b<< allotheredgeweights. 59
PAGE 72
Onecanask,Whybotherwiththeconnectingnode?".Justexaminethezero eigenvectorsof L .Therearesomepracticalreasons:Ifthegraph G isconnectedwe donotknowthisandwestillhavetoexaminethezeroeigenvector.Ideallythiswould haveaneigenvectorwithidenticalvaluesorforthesymmetricLaplacianonewhich hasallpositivevalues.Thenweknowthegraphisconnected.Numericalsolutions giveeigenvectorsthatarenotexact.Whenthesolutionsarenormalizedandthe dimensionalityof L isverylarge,thiscanresultinvaluesclosetozeroincludingboth positiveandnegativevalues.Thisresultsinabadpartitionofthegraph.Addinga connectingnodeallowsustoidentifyandignorethezeroeigenvector. Instep6wepartitionaccordingtothe r weaksigngraphs.If r istoolarge thiscanresultintwoidenticalpartitions.Thisisamajortypeoferrorandshould terminateprocessinglessrecursiveiterationsendlesslyreplicatethesamecluster. Sincewepartitionby r weaksigngraphs,theresultingpartitionsmayintersect.As discussedearlier,weconsiderthisanadvantageofthetechnique. 3.6Recursivebipartitioning Intheprevioussectionwedevelopedanalgorithmfordealingwithgraphswith multiplecomponents.Thisalgorithmnotonlydealswithmultiplecomponentsbut alsohandlespartitioningofaconnectedgraph.Wewanttoincorporatethisalgorithm intoanotherforrecursivebipartitioningappliedtoouroriginalgraph.Wehavetwo problemstosolve.Whichpartitiondoweapplythealgorithmtonext?,and Whendowestop? Toanswertherstquestionweneedameasureoftheinternalconnectivityofa graph.Wesayagraphisfullyconnectedifthegraphiscompleteandhasthesame weightonalloftheedges.Inthiscasepartitioningofitwouldnotbemeaningful.All nonzeroeigenvaluesarethesameandeverypartitioncreatesanothersetofcomplete graphs.Ontheotherhandagraphwithnoedgeshasnointernalconnectivity. 60
PAGE 73
Wedevelopedthefollowingmeasurewhichreectsthesetwoextremes. Denition3.3 Theinternalconnectivityofagraphis j G j n n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ,where j G j = sumof weightsofalledgesof G Thisisbasedontheobservationthatthenumberofedgesinacompletegraph with n verticesis n n )]TJ/F15 11.9552 Tf 12.157 0 Td [(1 = 2.Forconveniencewehavedroppedtheconstantfactor 2inthedenitionofconnectivity. Weselectthenextclustertopartitionbychoosingtheonewiththelowestinternal connectivity.Notethatwhilethisismotivatedbymathematicalconsiderations,itis inessenceanadhocprocedure,andiftherearelargevariationsinedgeweightsitcan producecounterintuitiveresults. Figure ?? illustratesthisprocess.Theexampleisofagraphformedfrom4normal distributionsofdata.Thesearecenteredatpositions,3,,9,,9,and,6. Weightsaregeneratedviaagaussianweightfunctionwithaminimumweight.This resultsinsomeisolatedverticessingletonclusters,indarkblueandoneclusterof2 verticesinred.Recursivebipartitioningroughlyidentiestheoriginaldistributions. Notethattheremaybeverticesinmorethanoneclusterbutthesearenotidentied intheplotsgeneratedhere. Thesolutiontothesecondquestionaboutstopingcriteria,isalsoadhocin nature.Wecouldchoosetoproceeduntilallpartitionsachievedsomeminimumconnectivity.Wecouldalsochoosetoproceedtillaxednumberofclustersisproduced, oruntilweachieveafullhierarchicalbreakdowntooneortwoverticespercluster. Ineithercasewehavetomakeanapriorijudgementaboutthesevalues. Someauthorshavesuggestedusingeigenvaluejumpstodeterminethenumber ofclusterstond[ ? ].Thisworkswellifwearelookingfordisconnectedornearly disconnectedclusters.Formicroarrays,however,wedonotexpecttheresultinggraph tobethatsimple.Wechooseheretoexplorethedatabylookingfor16clusters.This choiceisarbitrary,andwouldonlybeastartingpointforanalysis.Afterexamination 61
PAGE 74
Figure3.6: Anexampleofrecursivebipartition oftheresults,thischoicewouldbemodiedforsucceedingruns. 3.7Weightconstruction Inthesoftware,severaltechniquesareimplementedtocomputeedgeweights. InverseGausiandistanceiscomputedviathefunction e )]TJ/F20 7.9701 Tf 6.587 0 Td [(sc d 2 where sc isascaling factorwhichdefaultsto1and d istheEuclidiandistancebetweentwogeneexpression vectors;i.e.thenormofthedierencebetweenthetwovectors.Thisisthenlimited tozerobyaradiusvalue;i.e.iftheweightislessthantheradiusthentheedgeweight iszero.ThisisdonetointroducesparsitytotheLaplacian. 62
PAGE 75
Afullyconnectedcompletegraphcanbeproduced.Herealloftheedgeweights areone.Edgescanbepredenedandinthiscasetheedgeweightsareone.These techniquesareimplementedtoanalyzecertaintestgraphs. ThetechniqueweuseformicroarraydataisaEuclideandistancelimitedbya radiusvalue.Whenthedistanceisgreaterthantheradiusvaluetheweightiszero, otherwiseitisone.Thismethodischosenbecausewewillbeprojectingthegene expressionvectorsontotheunitsphereandwedonothavetohandlelargevariation indistances.Wenotethatanglescouldbeusedhereasameasureofdistance. 3.8"Toy"experiments Inthisthesisandinmanyofthepapersreferenced,socalled"Toy"dataisusedto analyzethealgorithmspresentedthere.Theuseofthisdataisdonefortworeasons. 1.validationoftechniques 2.validationofsoftware Toydataprovidestestexampleswheretheclusteringresultsarewelldened,or atleastroughlydenedintuitively.For"Realworld"datatheclusteringmaynot beeasilyrecognized.Ifitcouldbetherewouldbenoneedtodotheanalysis.We canalsoconstructToydatawhichshouldrepresentspecicsituationsthesoftwareis supposedtohandle. TheToydatagivesusanobjectivewaytotestourtechniquesandtheirimplementationinsoftware.Thisisnotalwaysassimpleasitsounds.Debuggingofthe softwareforrecursivebipartitionrevealedseveralproblemsthatrequiredadjustments tothealgorithmandenhancementstothetheorysupportingthem.Softwaretestinghasthusbeenanintegralpartoftheoverallprocessofdoingthemathematical research. OncetheToydatahasserveditspurpose,wecannowusethesoftwaretoexamine "realworld"data.Someofthemathematicsreviewedanddevelopedhereisfairly 63
PAGE 76
advancedandsomerathersimple.Theultimatepurposeofwhatwehavedevelopedis notjustanexerciseinpuremathematics,buthopefullyhasapplicabilitytorealworld dataandproblems.Inthenextsectionweapplythesetechniquestorealmicroarray data. 3.9Analysisofrealmicroarrayexperiments Weanalyzedyeastvaluestakeninatimesequenceoverametabolicshiftfrom fermentationtorespiration.ThesourceofthisdataistheMatlabdemo"Gene ExpressionProleAnalysis"fromtheBioinformaticsToolbox.Thedatarepresents asetof7microarrayswhereallofthelowlevelanalysishasoccurredpriortoours toproduceasetofgeneexpressionlevels.Further,geneswithverylowexpression levelshavebeenlteredout.Thisleaves822genesrepresentedinan822x7matrix. Oneoftheproblemswehavetofaceformultidimensionaldataishowtorepresent theresults.Toyexperimentswereall2Dsothesehadanaturalwaytographclusters. Wecoulddosomethingsimilarfor3Dbutwithdiculty.Microarraydatausually hasdimensionslargerthanthis. Listingthegenesingroupsisnecessarybecausethisiswhatabiologicalscientist isgoingtoneedtoevaluatethecluster.Thisisalwayspossiblebycrossreferencinga genesmatrixrownumbertoametadatalistofgenedescriptions.Thelistingofthe clustersforthisexperimentisgiveninAppendix ?? Wealsodesireagraphicalwaytorepresenttheresults.Wewanttovisually conrmhow"good"theclusterisandidentifyinterestingcharacteristicsofthedata. Thetechniquewechooseistographagenesexpressionlevelunnormalizedagainst itsmicroarraynumberto7forourtestdata.Allgeneswithinthesamecluster areplacedinthesamegraph.Multiplegraphsareproduced;oneforeachcluster, seeFigure ?? .Thesetofgeneswithnoedgeconnectiontoanyothergenearenot represented.Thenumberofentriesintheclusterandtheclustersconnectivitysee 64
PAGE 77
Figure3.7: Microarraydataclusteringresult Denition ?? isprintedaboveeachgraph. ThisanalysisisperformedusingtheMatlabeigsfunctionsolvingthe Lv = v problem,vectorsnormalized,Euclidiandistancefunctionwitharadiuscutoof0 : 2, andavalueof r = : 0000001usedtocompute r weaksigngraphs. 3.10TheSoftware ThesoftwareisimplementedinMatlab.Theprogramtoperformtheanalysisis codedasafunctionandlistedinAppendix ?? .Aprogramtoinvokethefunctionfor theexamplesinthisthesisislistedinAppendix2. Thefunctionargumentsconsistof4xedparametersandavariableargumentlist. Theseareanarrayoftheverticestocluster,thenumberofclusterstoproduce,astring deningthecomputationaloptions,andastringdeningthegraphstoproduce.The variableargumentsdeneRadius,Sigma,Scale,Mass,Edges,andRweakthevalue 65
PAGE 78
tousefordeterminationofsigngraphs.Therst2argumentsarerequired.The remainderhavedefaultsifnotentered. Thefunctionreturnsacellarraywheretheclustersaredenedandavectorwhich recordsaconnectivityvalueforeachcluster.Therstclusterinthecellarrayisall oftheisolatedvertices.Avertexcanappearinmorethanonecluster. TwoMethodsfordeterminingclustersareprovidedintheMATLABprogram; recursivebipartitionandkmeans.KmeansclusteringisdescribedinLuxburg[ ? ]andinvolvesapplyingthekmeansalgorithmtotherowvectorsofamatrixof therstkeigenvectorsofthegraphLaplacian.Itworkswellprovidedclustersare nearlydisconnectedandtherearenotlotsofisolatedvertices.Itproducesaclassic partitioningwhereavertexcanonlybeinasinglecluster. ParametersandtheiroptionsaredetailedintheprogramcommentsinAppendixA. ThecomputationofeigenvaluesisbasedontheMatlabeigsoreigfunction.The eigsfunctionshouldbeusedforlargesparsematrices.Theeigenvalueproblemcan bercut Lv = v ,mass Lv = Mv ,orncut Lv = Dv Theorganizationoftheprogramisasfollows: Analyzeparameters,setdefaults, Deneweightmatrix Deneconnectingnodeedgeweight Dokmeansclusteringifrequested { letkbethenumberofclusterstosolvefor { solvetheeigenvalueproblem { identifytheeigenvectorsoftheklowesteigenvaluesnotincludingzero { performkmeansontherowvectorsoftheseeigenvectors 66
PAGE 79
{ denetheseclustersinthecellarraypartition Dorecursivebipartitioningifrequested { initializethecellarraypartitionwithoneentrycontainingallvertices { splitouttheunconnnectedverticesintotheirownpartitionentryand deneitsconnnectivityas999 { selecttheotherpartitionentryforprocessing { WHILEthenumberofpartitionentriesislessthanthenumberrequested do { bipartitiontheselectedpartitionentrybasedonedlervector solvetheeigenvalueproblem identifylowestnonzeroeigenvalueanditsassociatedeigenvector B:splitthepartitionintotwoclustersbasedonthe r weakvalueand theeigenvector ifbothpartitionareidenticaltotheoriginalthenerrortheprogram ifeitherpartitionhaszeroentriesthenskiptonexteigenpairandgoto B: { splittheselectedpartitionintotwoentriesbasedonbipartitionresults { computetheconnectivityofallpartitionentries { selectthepartitionentrywithsmallestconnectivity { endWHILE VariousplotsareproducedthroughouttheprogramasrequestedinthePlot parameter. 67
PAGE 80
4.BLOPEX 4.1Introduction ThesoftwarepackageBlockLocallyOptimalPreconditionedEigenvalueXolver BLOPEXwasintroducedin2005.IthasrecentlybeenupgradedtoVersion1.1. BLOPEXimplementstheLocallyOptimalBLockPreconditionedConjugateGradient LOBPCGmethod[ ? ]forsolutionofverylarge,sparse,symmetricorHermitian generalizedeigenvalueproblems.Version1.1addssupportforcomplexmatricesand 64bitintegers. Thegeneralizedeigenvalueproblem Ax = Bx forlarge,sparsesymmetricand Hermitianmatricesoccursinavarietyoftraditionalproblemsinscienceandengineering;aswellasmorerecentapplicationssuchasimagesegmentationandDNA microarrayanalysisviaspectralclustering.Theseproblemsinvolvematricesthatcan beverylargedimension > 10 5 butfortunatelyaresparseandfrequentlyweonly needtosolveforafewsmallestorlargesteigenpairs.Thisisthefunctionofthe BLOPEXsoftware. BLOPEXhasbeenpreviouslydescribedin[ ? ]and[ ? ].Thischapterseeksto giveamoredetaileddescriptionofthesoftwarethanhaspreviouslybeensupplied. ThissoftwareincludesnotjusttheimplementationoftheLOBPCGmethodbut interfacestoindependentlydevelopedsoftwarepackagessuchasPETSc 1 ,Hypre 2 andMATLAB 3 Theremainderofthischapterisorganizedasfollows.WestartinSection ?? bya generaldiscussionoftheproblem,reviewsomeoftheothersoftwareavailableforthe probleminSection ?? ,andpresenttheLOBPCGmethodinSection ?? .Section ?? thencoverstheBLOPEXsoftware.BLOPEXisavailableviaanewGoogleSource 1 PETScPortableExtensibleToolkitforScienticComputationisdevelopedbyArgonneNationalLaboratory 2 HypreHighPerformancePreconditionersisdevelopedattheCenterforAppliedScientic ComputingCASCatLawrenceLivermoreNationalLaboratory 3 MATLABisaproductofTheMathWorks TM 68
PAGE 81
siteandthisiscoveredinSection ?? .WediscusstheenvironmentsBLOPEXhas beentestedoninSection ?? ,givesomenumericalresultsinSection ?? ,andwrapup inSection ?? 4.2TheProblem Weseeksolutionstothegeneralizedeigenvalueproblem Ax = Bx whereAand BarerealsymmetricorcomplexHermitianmatrices.Bmustbepositivedenite.A and/orBmaybedenedasamatrixorbesuppliedinfunctionalform. NotethattherequirementthatBbepositivedeniteimpliesalleigenvaluesare niteandthesymmetryofAandBimplyalleigenvaluesarereal. WeemphasizethatAandBneednotbesuppliedinmatrixform,butcanbe denedasfunctions. Problemsofthistypearisefromdiscretizationsofcontinuousboundaryvalue problemswithselfadjointdierentialoperators[ ? ].Weoftenonlyneedthe m smallesteigenvaluesoreigenpairs;where m ismuchlessthanthedimensionofthe operatorA.Wedon'tusuallyneedsolutionstohighaccuracysincethediscretization oftheproblemisitselfanapproximationtothecontinuousproblem. Thelargedimensionalityoftheproblemprecludessolutionbydirectfactorizationmethods.Thustheneedforiterativemethods.Butiterativemethodscanhave slowconvergenceandsowerequireapreconditioner[ ? ].Thechoiceofpreconditionerisseparatefromthechoiceofiterativemethod. WeusetheLOBPCGiterativemethodseeSection ?? .Preconditionersare suppliedtoBLOPEXbythecallingprograms.ThisandtheinterfacestoPETScand Hypremakepossibletheuseofhighqualitypreconditioners. 4.3CurrentSoftware Thereareanumberofexistingsoftwarepackagesforsolutionsoflarge,sparse eigenvalueproblems.Wediscusstwoofthesethathavebeenpreviouslydescribedin ACMtransactions. 69
PAGE 82
Anasazi[ ? ]isapackagewithintheTrilinosframework[ ? ],writteninC++ whichusesobjectorientedconcepts.implements3blockvariantsofiterativemethods: LOBPCG,Davidson,andKrylovSchur. Anasazisolvesforapartialsetofeigenpairsofthegeneralizedeigenvalueproblem. Itusesapreconditionerwhichmustbesuppliedbytheuser.StartingwithTrilinos 9.0thereisinteroperabilitywithPETSc. PreconditionedIterativeMultiMethodEigenvaluePRIMME[ ? ]wasreleased Oct2006.ItimplementstheJDQMRandJD+kmethodstosolveforapartialset ofeigenvaluesoftheproblem Ax = x .ItdoesnotcurrentlyhandletheGeneralized Eigenvalueproblem.WritteninCithasanemphasisonbeinguserfriendly",by whichismeantaminimalparametersetcanbeusedtoobtainsolutionswithout extensivetuningorknowledgeonthepartoftheuser.Moresophisticateduserscan utilizeanextendedsetofparameterstotunetheperformance. PRIMMEcanhandlerealandcomplexnumbersandorthogonalityconstraints. Thepreconditionerissuppliedbytheuser.InterfacestoPETScandHyprearenot mentionedandpresumablynotavailable. NeitherPRIMMEnorAnasazimentioninterfacestoMatlab. BycontrastBLOPEX handlesbothrealandcomplexnumbers iswritteninC hassimilarparametersasPRIMME hasinterfacestoPETSc,Hypre,Matlab,andstandaloneserialinterfaces allowsforuseofhighqualitypreconditionersviaPETScandHypre 70
PAGE 83
4.4LOBPCG Tosolveforasingleeigenpairoftheproblem Ax = Bx theLOBPCGiterative methodcanbedescribedasa3termrecurrenceformulaasfollows: x i +1 = w i + i x i + i x i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 ; .1 where w i = Tr i ;r i = Ax i )]TJ/F19 11.9552 Tf 11.955 0 Td [( i Bx i ; i = x i ;Ax i = Bx i ;x i theRayleighquotient,and T isapreconditionerforthematrixA : Thevalues i and i in ?? arechosentominimize i +1 withinthesubspace span f w i ;x i ;x i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 g .ThisminimizationisdoneviatheRayleigh{Ritzmethodas describedinParlett[ ? ].Thepreconditioner T shouldbelinear,symmetric,and positivedenite. Useof x i and x i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 asbasisvectorsfor span f w i ;x i ;x i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 g canleadtoillconditionedGrammatrices[ ? ]intheRayleighRitzmethod,because x i canbe verycloseto x i )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 Theeectofbasisvectorsisanontrivialproblemdiscussedin[ ? ].Animprovementonthebasisusedin ?? wasproposedbyKnyazev[ ? ].Thisreplaces x i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 with p i asfollows: x i +1 = w i + i x i + i p i ; .2 andforthenextiteration p i +1 = w i + i p i andtheothertermsareasin ?? : 71
PAGE 84
Inthiscase,itcanbeshownthat span f w i ;x i ;p i g = span f w i ;x i ;x i )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 g So,thetwoiterativeproblems ?? and ?? aremathematicallyequivalentbut ?? ismorenumericallystable. Whenmorethanoneeigenpairistobecomputed,ablockversionofLOBPCGis used.Tocomputethe m smallesteigenpairsweapplytheRayleighRitzmethodto thesubspacespannedby f x i 1 ;w i 1 ;p i 1 ; ;x i m ;w i m ;p i m g .Thisgives m Ritzvectors x i +1 j asestimatesforthe m smallesteigenvectorswithestimatesforeigenvaluesgiven bytheirRayleighquotients. Wenotethatthechoiceofblocksize m isinpartproblemdependentandinpart atuningconsideration.Thisisdiscussedinsomedetailin[ ? ]. 4.5BLOPEXsoftware TheBLOPEXsoftwareprovidesfunctionstotheuserforsolutionofeigenvalue problemsasdescribedinSection ?? .ThesoftwareexternaltoBLOPEXwhichthe userwritesmustdothefollowing: setupmatricesorfunctionsfor A and B setupthepreconditioner T providefunctionsformatrixvectoroperations callLOBPCGsolverinBLOPEX BLOPEXsoftwareiswritteninC.Chasbeenchosensinceitprovidesforease andversatilityofinterfacingwithawidevarietyoflanguagesincludingC,C++, Fortran,andMatlab.ThismakesBLOPEXhighlyportable. BLOPEXcanbelogicallyseparatedintotwoparts.Therstpartimplements theLOBPCGalgorithm.Werefertothisastheabstract"code.Itcontainsthe functionscalledbytheuseraswellasanumberofutilityfunctionsneededinternally. WediscussthisindetailinSection ?? 72
PAGE 85
Thesecondpartiscodewhichprovidesfunctionsforinterfacingwithsoftware packagessuchasPETSc,Hypre,andMatlab.Onechallengeforalleigensolversoftwareisthenecessityofsupportingmultiplediverseformatsforsparsematricesand vectors.FunctionsforaccessingmatrixvectorelementsanddoingmatrixvectoroperationsareinherentinthecallingsoftwareandBLOPEXhastoaccesstheseroutines. Thisaccessoccursviathespecicinterfacefunctions.Wedescribetheinterfacesin Section ?? 4.5.1Structure Figure ?? showsahighleveloverviewofBLOPEXandhowittswiththecalling software.ThedriverissoftwarewrittenbytheuserwhichcallstheLOBPCGsolver inBLOPEXabstract.Thedrivercanbewritteninnumerousexternalenvironments suchasPETSc,Hypre,Matlab,etc.BLOPEXprovidesanumberofsampledrivers whicharedescribedinSection ?? Thedriverusesmacros,commands,orfunctionsfromitsenvironmenttodene matrices,vectors,andmatrix/vectoroperations.Thesearecommunicatedtothe LOBPCGsolverviaparameters,seeSection ?? .Toaccesstheseexternalenvironmentmatrix/vectorroutines,BLOPEXsuppliesinterfaces. TheseinterfacespackagedataasmultivectorsseeSection ?? topasstothe LOBPCGsolverandprovidefunctionstoconvertparametersinformatsdened withinBLOPEXtoparametersspecictotheexternalenvironmentfunctions. BLOPEXrequiresLAPACKfunctionsorequivalentstoperformorthonormalizationandsolvethegeneralizedeigenvalueproblemfortheGrammatricesinthe RayleighRitzmethod.ThesecanbethestandardLAPACKfunctions dsygv and dpotrf forreal,or zhegv and zpotrf forcomplexnumbers.Equivalentsfromthe ACML,MKL,orESSLlibrariescanbeused.Theaddressesoftheroutinestouseare passedbytheDrivertoBLOPEX.Iftheparametersarenotthesameasthestandard 73
PAGE 86
Figure4.1: StructureofBLOPEXfunctions Driver Matrix/VectorRoutines BLOPEXAbstract BLOPEXInterface LAPACKorequivalent ? Z Z Z Z Z Z Z~ 7 6 LAPACKfuntions,thenafunctionmustbecodedtodoparameterconversions. 4.5.2AbstractCode Thisisthesolver.Itconsistsofthreemodules lobpcg.c multivector.c ,and fortran_matrix.c .SmallmatricesandvectorsthatariseasresultofRayleighRitz MethodarekeptinternaltotheabstractcodeinFortrancolumnmajororderand processedviaroutinesin fortran_matrix.c Tworoutinesin lobpcg.c arecallablebyDrivers. lobpcg_solve_double and lobpcg_solve_complex .Theseroutinessetupafunctioninterpreterwhichisalist offunctionaddresses.Thefunctionsin multivector.c and fortran_matrix.c arespecictodoublerealorcomplexnumbers.Thesetwofunctionsthencall lobpcg_solve wheretheLOBPCGalgorithmisimplemented.AsaresultBLOPEX doesnothavetobecompiledspecicallyforcomplexorrealnumbers. Thefunctionsin multivector.c provideforconversionsandoperationsbetween matricesinFortranformatandmultivectorswithpointerstomatricesandvectorsin theexternalenvironmentformat.Thesefunctionsinturncallinterfacefunctionsto accessthesematrix/vectoroperations. 74
PAGE 87
4.5.2.1Algorithms ThestepsoftheBlockLOBPCGalgorithmasimplementedinBLOPEXfollow withdetailedcommentsonvariousstepsappearingafterwards.WelistonlythemajorparametersherebutwegiveacompletelistandmoreexplanationinSection ?? The*"operatordenoteseithermatrix/matrix,matrix/vector,orscalar/matrixmultiplication,asinMatlab. Input: X m startingvectorsinblockform Amatrixoraddressoffunctiontocompute A X Bmatrixoraddressoffunctiontocompute B X Tpreconditioneroperator,anddata Yconstraintvectorsinblockform Output: X m computedeigenvectors m computedeigenvectors Algorithm: 1.Applyconstraints Y to X 2.Borthonormalize X 3.[ C; ]= RR A;B;X ApplyRaleighRitzMethod 4. X = X C ComputeRitzvectors 75
PAGE 88
5. J =[1 ; ;m ]Initializetheindexsetofactiveresiduals 6. fork =1 ; ;MaxIterations 7. R J = B X J )]TJ/F19 11.9552 Tf 11.955 0 Td [(A X J ComputeResidualvectors 8.Computenormsofresidualvectors 9.Checkresidualnormsforconvergence 10.Excludeconvergedvectorsfromindex J softlocking 11.ifallvectorshaveconvergedthenstop 12. W J = operatorT R J ;dataT Applypreconditionertoresiduals 13.Applyconstraints Y to W J 14.Borthonormalize W J 15.if k> 1 16.Borthonormalize P J 17.basisforRRis S =[ XW J P J ] 18.else 19.basisforRRis S =[ XW J ] 20.end 21.[ G; ]= RR A;B;S ApplyRaleighRitzMethod 22. C = G : m; :GetcolumnsofGcorrespondingtoX 23.=: m GeteigenvaluescorrespondingtoX 24.if k> 1 76
PAGE 89
25.Partition C = 2 6 6 6 6 4 C X C W C P 3 7 7 7 7 5 accordingtocolumnsof X W J ,and P J 26. P = W J C W + P J C P 27.else 28.Partition C = 2 6 4 C X C W 3 7 5 accordingtocolumnsof X and W J 29. P = W J C W 30.end 31. X = X C X + P 32.end Comments: ?? Constraints Y arepreviouslycomputedorknowneigenvectors.Forexample, thevectorsofallonesisaneigenvectorofthesmallesteigenvalueofagraphLaplacian. SowecanchoosethisasaconstraintandthenforceallvectorsofXtobeBorthogonal toY.InthiscaseLOBPCGsolvesforthenext m smallesteigenvalues.Weapply constraintYtoXviareplacingXwiththedierenceofXandtheBorthogonal projectionofXontothesubspacegeneratedbyY;thatis X = X )]TJ/F19 11.9552 Tf 12.31 0 Td [(Y Y T B Y )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 B Y T X ?? BorthonormalizeXusingCholeskyfactorization;thatis R = chol X 0 B X ; X = X R )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 .BlockvectorXmustbecomposedoflinearlyindependentvectors tobeingwithelseorthonormalizationwillfail. ?? Weadoptthefollowingnotation:[ G; ]= RR A;B;S tospecifythe RayleighRitzmethodwhichndseigenvectors G andeigenvaluesofthegeneralized 77
PAGE 90
eigenvalueproblem S T AS G = S T BS X S T AS isreferredtoasthegramAmatrixand S T BS asthegramBmatrix. S isgiveninblockformsuchas[ XWP ]where X W ,and P areblockvectors. S formsabasisforthesubspacetondestimated eigenvectorsin.NotethattheeigenvectorsGarechosentobeBorthogonal. ?? NotethattheBorthonormalityoftheRitzvectors X ispreserved. ?? ?? ?? Initiallyallresidualsof X areactive.Asweiterate,thischangesas theirnormsconvergetowardszero.Whenoneofthevectorsin X convergestowithin aprescribedtoleranceitisremovedfromtheindex.Thiswecallsoftlocking.All ofXremainsaspartofthebasisforthenextiteration.However,onlytheresiduals denotedby R J andCGstepvectorsdenotedby P J areusedtocreatethenew subspace.AllofXisretainedontheexpectationthatconvergedvectorsinXwill continuetoimprove. ?? Toapplythepreconditionerto R J ,acalltoaroutine T providedbythe Driverisdone.ThisfunctionisusuallycodedintheDriverandmusthaveparameters oftheform voidoperatorTvoid*dataT,void*R,void*W andmustbe abletohandle W and R asblockvectors.Thecodefor operatorT ishighlydependent ontheDriversenvironment. ?? ?? Wetransformthepreconditionedresidualblockvector W J tobeBorthonormaltotheconstraint Y andthevectorsof W J tobeBorthonormaltoeach other. ?? For k =1wedonothavetherst P J yet.Itisrstcomputedin ?? and in ?? thereafter. ?? ?? Forthe1stiterationthebasisforRayleighRitzis S =[ XW J ].Note that X and W J areBorthonormalwithrespecttotheirownvectors,but X isnot necessarilyBorthonormalto W J .Consequently,thesymmetricGrammatricestake 78
PAGE 91
theform gramA = S T AS = 2 6 4 X T AW J :W T J AW J 3 7 5 and gramB = S T BS = 2 6 4 IX T BW J :I 3 7 5 wherethedotnotationindicatesthesymmetryofthematrix.Notethat X T AX = sinceXisBorthonormal. Forsubsequentiterationsthe X W J ,and P J blocksareBorthonormalwithin theirblocksbutnotbetweenthem.SotheGrammatricesare gramA = S T AS = 2 6 6 6 6 4 X T AW J X T AP J :W T J AW J W T J AP J ::P T J AP J 3 7 7 7 7 5 and gramB = S T BS = 2 6 6 6 6 4 IX T BW J X T BP J :IW T J BP J ::I 3 7 7 7 7 5 Finally,wenotethatcomputationsforthecomponentsoftheGrammatricesare optimizedinthecode,sotheyarenotcomputeddirectlyfromthebasisvectors.For claritythesedetailshavebeenomitted. ?? ?? ComputationoftheblockvectorPcorrespondsto p i +1 inequation ?? .NotethatPhasthesamenumberofrowsasX. ?? Finally,wecomputeanew X whichisjusttheRitzvectors X = S C 4.5.2.2DataTypes BLOPEXdenesseveraldatatypeswhicharedescribedhere. TodealwithblockvectorsinvariousdiverseformatsBLOPEXdenesthestructure mv_MultiVector .Thiscontainsthefollowingelds. 79
PAGE 92
Apointertoanotherstructurethatdenesthevectors.Howthesevectors areformatteddependsontheinterface.ForexampleinthePETScinterface itpointstoanotherstructure mv_TempMultivector whichcontainseldsthat denethenumberofvectors,activemaskandapointertoanarrayofpointers,eachofwhichpointtoaVecPETScvariable.InterfacesforHypre,Serial,andMatlabhavedierentbutsimilarstructures.Creationofvariablesof mv_MultiVector typeisdonebyroutinesin multivector.c Anintegervariablethatissetto1whendataforthepointerdenedaboveis allocated.Thisistoaidindeletionofthemultivectorwhenwearenished withit. Apointertoastructure mv_InterfaceInterpreter whichisalistoffunction addresses.Thesepointersaresettotheappropriateinterfacefunctions. An mv_Multivector variablesuchasparameterXthenencapsulatesthedata andinterfacefunctionstomanipulatethedata. Theseconddatatypeistodealwithmatrices.Thisalsodependsontheinterface. Matricesdonothavetobemanipulatedlikeblockvectors.Typicallytheyareinvolvedinsomematrixvectoroperationandjustneedtobepassedtotheappropriate interfaceroutinewhichisspeciedviatheinterpreterintheassociatedblockvector. Soitispossibleweonlyneedtopassapointertothematrixasaparameterin theformitappearsintheexternalenvironment.ThisiswhatisdonefortheMatlab interface.Otherinterfacestakeadierentapproach.ThePETScinterfaceincludesin amultivectorlikestructure,variablesforA,B,andKSPsolverandthenpassesthis astheparameterforbothA,B,andpreconditionerdata.TheoperatorA,operatorB, andoperatorTfunctionsthenusetheappropriatevariable. InternallytheBLOPEXabstractcodeusesdatatype utilities_FortranMatrix todeneFortranstylematrices.Thisincludesvariablesforglobalcolumnheight, 80
PAGE 93
columnheight,rowwidth,pointertoposition ; 1ofthematrix,andanownsData variable.Theglobalheightcanbedierentfromthecurrentheightbecauseweoverlay inmemoryblocksoftheGrammatricestooptimizetheircomputation. ForBLOPEXversion1.1weaddedatypeforcomplexnumberswhichwecall komplex .SincethereisnostandardbetweenCcompilersforcomplexnumberswe chosetoimplementourowntypealongwithroutinesforthebasicmathfunctions ofaddition,subtraction,multiplication,anddivision.Thismaintainsportabilityof BLOPEX. 4.5.2.3Parameters AdescriptionoftheparametersfortheLOBPCGsolverfollows.Thefunctional denitioncanbefoundininclude lobpcg.h availableintheonlinesite. Someparametersareoperators.Fortheseapointertotheoperatorispassed. Theoperatormustbedenedas void*operatorAvoid*,void*,void*. IfaparameterisnotneededthenaNULLispassed. ParameterDescription X Required.Ablockvector.Thisistheinitialguessofeigenvectors.Thiscanbe basedonpriorknowledgeorjustrandomguesses.Thenumberofvectorsdenes thenumberofeigenvaluestosolvefor.Onoutputitcontainsthecomputed eigenvectors. A Optional.UsethisifAisamatrix. operatorA Required.Thisimplementsamatrixvectormultiplication.IfAisNULL thenitmustdeneAasanoperatoranddoamatrixvectormultiplication. 81
PAGE 94
B Optional.UseifBisamatrix. operatorB Optional.Onlyneededifsolvingageneralizedeigenvalueproblem.This implementsamatrixvectormultiplication.IfBisNULLthenitmustdeneB asanoperatoranddoamatrixvectormultiplication. T Optional.Usethisifapreconditionerissupplied.Thisisdatainmatrixform tobepassedtothepreconditioneroperatorT.Thedataisdependentonthe preconditionerthatoperatorTimplements.ItcouldbeNULL,apreconditioned matrixbasedonA,orA. operatorT Optional.Butmustbesuppliedifapreconditionerisused.Thispreformstheactualpreconditioningontheresidualsblockvector. Y Optional.Thisisblockvectorofconstraints.OrthogonalityofXtoYisenforced intheLOBPCGsolver. blapfn Required.Astructurewhichcontainstheaddressestolapackfunctionsor equivalentsdsygv,dpotrf,zhegv,andzpotrf. tolerance Required.Astructurecontainingabsoluteandrelativetolerancestoapply totheresidualnormstotestforconvergence. maxIterations Required.Themaximumnumberofiterationstoperform. verbosityLevel Required.TheLOBPCGalgorithmcanprinterrormessagesand messagestotrackprogressofthesolver.verbosityLevelvaluescontrolthis. Valueof0meansprintnomessages.Valueof1meansprinterrormessages, maxresidualnormaftereachiteration,andeigenvaluesafterlastiteration. Valueof3meansprinterrormessagesandeigenvaluesaftereveryiteration. iterations Required.Output.Thenumberofiterationsactuallyperformed. 82
PAGE 95
eigs Required.Output.Anarraycontainingtheeigenvaluescomputed. eigsHistory Optional.Output.Anarraycontainingeigenvaluesproducedafter eachiteration. eigsHistNum Optional.Input.Maxnumberofeigenvalues.Thisshouldbe the numberofeigenvaluestocompute.ItisusedtoreformattheeigsHistoryarray intoamatrixformat.RequiredifeigsHistoryisnotNULL. residNorms Optional.Output.Anarraycontainingresidualnormsofeigenvalues computed. residHistory Optional.Output.Anarraycontainingresidualnormsofeigenvalues producedaftereachiteration. residHistNum Optional.Input.Maxnumberofeigenvalues.Thisshouldbe thenumberofeigenvaluestocompute.ItisusedtoreformattheresidHistory arrayintoamatrixformat.RequiredifresidHistoryisnotNULL. 4.5.3DriversandInterfaces DriversaretheprogramsthatsetuptheeigenvalueproblemandBLOPEXabstractiswheretheyaresolved.Theseencompasstwoseparateenvironments.That oftheDriverPETSc,Hypre,etc.handlematrixandvectorsparseformatsandthe matrixvectoroperationsonthemincludingapplicationofpreconditioners.BLOPEX abstracthasallofthelogicfortheLOBPCGalgorithm.Theinterfaceiswherethe functionalityofthetwoenvironmentsoverlaps. TheinterfacesandvariousmultivectorstructuresreviewedinSection ?? canbe intimidatingtoauser.ToovercomethiswesupplyvariousDriverswhichserveboth asexamplesandinsomecasesgenericproblemsolvers. ThenextsectionsdescribetheDriversandinterfacesthatareavailable.For detailsofexecutionoftestswiththesedriversandcongurationforPETScand 83
PAGE 96
Hypre,reviewtheWiki'savailableontheGooglesourcehtmlsite.SeeSection ?? for moreinformation.ForexecutionofBLOPEXunderPETScandHyprealsoseethe appendicesof[ ? ]. 4.5.3.1PETSc BLOPEXisincludedaspartofthePETScdistributionwhichmustbeconguredwiththeoption downloadblopex=1 .ScalarvaluesinPETScareeitherrealorcomplexandthismustbespeciedduringcongurationviatheoption withscalartype=complex .PETSCprovidesparallelprocessingsupportonmatrixvectoroperations. Thereare4DriversdistributedwithPETSclocatedinthePETScsubdirectory ../src/contrib/blopex driver.c buildsandsolvesa7ptLaplacian. driver_fiedler.c acceptsasinputthematrixAinPetscformat.Thesecan besetupviasomeMatlabprogramsinthePETScsocketinterfacetoMatlab; PetscBinaryRead.m and PetscBinaryWrite.m .Theseprogramsreadandwrite MatlabmatricesandvectorstolesformattedforPetsc.TheversionfromPetsc onlysupportsdouble.Wehavemodiedtheseprogramstoalsosupportcomplex and64bitintegers.OurversionsareincludedintheGooglesourcedirectory ../blopex_petsc alongwith PetscWriteReadExample.m toillustratehowto usethem. driver_diag.c solvesaneigenvalueproblemforadiagonalmatrix.Thisserves asatestprogramforverylargesparsematrices.Ithasbeenexecutedsuccessfullywithover8millionrows. ex2f_blopex.F isanexampleofusingBLOPEXwithPETScfromFortran. 84
PAGE 97
4.5.3.2HYPRE Hypredoesnotsupportcomplexnumber,butlikePETScprovidesparallel supportviamatrixvectormultiplicationandhighqualitypreconditioners.The BLOPEXLOBPCGsolverisincorporatedintoHypreprograms struct.c and ij.c locatedintheHypredirectory ../src/test .Theseprogramshavebroadfunctionalityandcansetupandsolve3D7ptLaplacians.TheycanalsoinputmatrixlesinHypreformatstoconstructageneralizedeigenvalueproblem.These lescanbecreatedinMatlabusingtheMatlabmatlab2hypepackageavailableon http://www.mathworks.cn/matlabcentral/. Thereisalsoasomewhatlessintimidatingexamplein ../src/examples/ex11.c whichsolvesa2DLaplacianeigenvalueproblemwithzeroboundaryconditionson an n n grid. 4.5.3.3Matlab TheMatlabinterfaceconsistsofmlesandclesavailableontheGooglesource siteunderdirectory ../blopex_matlab .TheBLOPEXabstractlesmustalsobe acquiredfromtheGooglesourcesite.AllclesarecompiledundertheMatlabMex compiler.Complexnumbersaresupportedalongwith64bitintegersinthenewer versionofMatlab.Preconditionersareimplementedinthisinterfaceasmles. 4.5.3.4Serial ThesearestandalonedriversandinterfaceswritteninC.Therearecomplex andrealversions.MatricescreatedbythedriversareinstandardFortranformat. Theydonothaveanyparallelsupport.TheyhavebeenusedprimarilyforBLOPEX developmenttesting. 85
PAGE 98
4.6TheGoogleSourceSite BLOPEXsourcecodeasofVersion1.1ismaintainedontheGooglesourcecode sitehttp://code.google.com/p/blopex/undertheSVNversionmanager.Allsource isdownloadable. ThissitealsoprovidessomeWikidocumentsthatdescribetestswehaveexecuted forallinterfacesandvarioussystems.BetweenthesourcecodefortheDriversand theWiki'swehopeusersndBLOPEXaccessableandusable. 4.7EnvironmentsBLOPEXTestedOn BLOPEXhasbeentestedinawidevarietyofenvironments.Thefollowingisa partiallistcoveredbyoneormoreoftheWiki'sdescribedinSection ?? Machines :UCDXVIB,UCARFrost,NCARBluere,LawrenceLivermore NationalLaboratory,IBMPC OperatingSystems :Linux,Fedora,IBMAIX,CygwinunderWindows7 Compilers :gcc,IBM blrts_xlc ,g++,pgcc,SUNmpcc,AMDOPEN64 LapackLibraries :Lapack,AMDACML,IntelMKL,IBMESSL MPI :openmpi,mpich2 4.8NumericalResults Somenumericaltestsfor3D7PointLaplaciansoftheBLOPEXimplementation ofLOBPCGinHyprehavepreviouslybeenreportedin[ ? ]andinPETScandHypre in[ ? ]. WereporthereonsomeresultsusingHypreandafewofthematricesthatwere analyzedbyPRIMMEasreportedin[ ? ].Thesematricesareavailablefromthe UniversityofFloridaSparseMatrixCollectionat http://www.cise.ufl.edu/research/sparse/matrices/ .Adirectcomparisonto PRIMME'sresultsisnotpossiblesincetheyareproducedonaverydierentmachine. 86
PAGE 99
Table4.1: Matricesanalyzed Matrix Rows nnz nnzL+UAMD Andrews 60,000 760,154 234,019,880 nan512 74,752 596,992 5,600,676 cdf1 70,656 1,825,580 71,684,224 cdf2 123,440 3,085,406 147,417,232 Allofthematricesusedaresymmetricpositivedenite.Weusematrix Andrews ,whichhasaseeminglyrandom"sparsitypatternandnotmuchstructure", nan512 ,whichisastochasticmatrixusedfornancialportfoliooptimizationand cfd1 and cfd2 ,whicharepressurematricesfromstructuralengineering.Theircharacteristicsaredescribedintable ?? .Note nnzL+UAMD isthenumberofnonzeros inL+UoftheLUfactorizationusingAMD. AnalysiswasperformedonaFedora10OS,4QuadCoreOpteron2.0Ghz CPUs,and64GBRAM.Hyprewasconguredusingopenmpiwithgcccompiler andBLAS/LAPACKlibraries. TosetupthematricesforprocessingbyHyprewedownloadedthematlabversions andconvertedthemusingourmatlab2hypreIJ.mprogramtoHypreformats.This lewasthenprocessedusingtheHypreijprogram.Forexampletond5eigenvalues of nan512 toatoleranceof1 e )]TJ/F15 11.9552 Tf 11.251 0 Td [(6usingtheBoomerAMGpreconditionerwewould execute: ./ijlobpcgvrand5tol1e6pcgitr0itr200seed1solver0fromlenan512 FortherstexperimentTable ?? ,weprocessallofthelesinsingleprocessor mode.BoththetimestosetupthepreconditionerandtoexecutetheLOBPCG solverarereported.Thematrix Andrews hasaneigenvalueveryclosetozero,which causesproblemsorthonormalizingtheresidual.Toovercomethisashiftof1 e )]TJ/F15 11.9552 Tf 10.858 0 Td [(7was 87
PAGE 100
Table4.2: Singleprocessorsetupandsolutiontime Eigenvaluestosolvefor Matrix Setup 1 2 3 4 5 7 10 15 Andrews 29 4 18 34 62 @ * nan512 1 5 10 22 37 43 66 90 203 cdf1 25 152 297 405 599 737 * cdf2 36 335 644 1342 * * Alltimesroundedtonearestsecond. *Analysisnotperformed. @Failureofdsygvroutine. appliedtothematrix.Thiswassuccessfuluptosolutionfor5eigenvalueswhere dsygvroutinefailed.Also,notetherelativelylargepreconditionersetuptimesfor allmatricesexcept nan512 .Thisseemstoreectedinthe nnzL+UAMD values showninTable ?? .Thesetuptimesareindependentofthenumberofeigenvaluesto solvefor. ThesecondexperimentTable ?? studiestheeectofparallelprocessingon solutiontimeformatrix nan512 .Itsolvesfor5eigenvaluesusingopenmpivarying thenumberofprocessors.Forexampletorunwith2processors,wewouldsplit nan512into2Hyprelesusingmatlab2hypreIJ.mandprocessasfollows: mpirunnp2./ijlobpcgvrand5tol1e6pcgitr0itr200seed2solver0 fromlenan5122 4.9Summary Version1.1ofBLOPEXhasbeenimplementedinPETSCversion3.1p3andsubmittedtoHypreforinclusion.Version1.1incorporatesthenewfeaturesofcomplex 88
PAGE 101
Table4.3: Multipleprocessorsetupandsolutiontimefornan512 Processors SetupTime SolutionTime 1 .42 42.86 2 .96 25.17 3 .63 15.00 4 .40 10.82 5 .31 8.35 6 .24 6.44 7 .20 5.20 numbersand64bitintegers.ThenewGooglecodesiteforBLOPEXmakestesting documentationavailabletotheuser.BLOPEXhasinterfacestopopularsoftware packagesPETSc,Hypre,andMatlab. 89
PAGE 102
5.ConclusionsandFinalThoughts Weadvocatethatspectralclusteringgivesmeaningfulresultswithoutrecourseto relaxationofcombinatorialproblems.Apartitioningobtainedusingspectralclustering,whilepossiblyvaryingwidelyfromthecombinatorialminimizationsolution,still hasitsvalidity.Thevibrationalmodelisusedtoformallyjustifyspectralclustering andtoprovideuswithintuitivelyappealingnotionofclusters. Thepresentedmodelforspectralclusteringmaynaturallypartitionagraphinto nondisjointclusters.Thisisalsoatvariancefromthecombinatorialapproachwhere thedenitionofthepartitioningrequiresdisjointclusters. Theexaminationoftype2 r weaksigngraphsmayhelpwithunderstandinghow eigenvectorscorrespondingtoseverallowesteigenvaluescanbeusedforclustering. Theconjectureraisedinconnectionwiththemappearsplausiblebutneedsaproof. Thetechniqueweproposeforhandlingdisconnectedgraphsperformswellinour numericaltests,butthequestionofwhetherthisorabreathrstsearchalgorithm worksthebestisnotfullyanswered.Ifthediscretealgorithmcouldbeappliedjust once,beforethespectralclustering,suchacombinationmaybethebestapproach. Specialattentionisneededtomonitornumericalinaccuracyintheeigenvectors,since itmightintroducedisconnectedclustersduringtherecursivebipartitioning. RecursivebipartitioningusingjusttheFiedlervectorhasnotreceivedenough attentionintheliterature.Theuseofmultipleeigenvectorsismuchmorecommon. Iftheclustersareonlylooselyconnectedthelatterapproachgivesgoodqualityresults. However,asweobserveinourDNAmicroarraynumericalexperiments,realworld problemsarenotsoeasytopartition.Wethusanticipatethatourcompletealgorithm aspresentedinChapter ?? haswideapplicability. Finally,wecommentoneigensolvers.Ontheonehand,clusteringexamplesin thisthesisaresolvedinMatlab,withthecorrespondingeigenproblemssolvedusing Matlab'sEIGandEIGSfunctions.Thisispossiblebecausethesizeoftheproblems 90
PAGE 103
inournumericalclusteringexperimentshasbeendeliberatelykeeprelativelysmallfor simplicityofcodingandtesting.Ontheotherhand,wedevelopourBLOPEXeigensolversoftwarepackagethatcanecientlysolveinparallellargescaleeigenproblems, andthuscanbeusedforclusteringofmassivedata.ApplicationofBLOPEXforreal worldclusteringproblemsisleftforfurtherdevelopment. 91
PAGE 104
APPENDIXA.SpectralClusterFunction Thisistheprimaryfunctionusedformicroarrayanalysis. function[partition,con]=spectralclusterX,numclusters,options,plot,varargin %Spectralclustering %Input:Fixedarguments %Xpointsverticestocluster %numclustersnumberofclusterstofind %optionsoptionsforcomputationasastring %optionsforcomputingweights %gaussusegaussiantocomputeweights %euclidianuseeuclidiandistancedefault %fullmakefullyconnectedgraph %edgesedgespredefined,weights=1 %normnormalizepointvectors %whatsolvertouse %eigsuseeigstocomputeeval %eiguseeigdefault %whichproblemtosolve %ncutsolveL*x=lambda*D*x %masssolveLx=lambda*M*x %rcutsolveL*x=lambda*xRcutdefault %howtocomputemultipleclusters %kmeansclusterviakmeans %biclusterclusterviabiclusterdefault %plotplotstoproduceasastring %%nodesnodesincolorsspecifiedbycdx %edgesedgesincolorsspecifiedbycdx %eigvaleigenvalues %eigvecfiedlervector %clustersgraphwithclustersindifferentcolors %infolistmiscinfo %Input:Variablearguments 92
PAGE 105
%Radiusvaluetousetolimitweights %Sigmalocaldensityorpvaluetouseinweights %Scalemultiplierforweightcomputation %Massvectorofnodemasses %Edgespredefinededges %RweakRvalueforweaksigngraphclustering % %Oputput:partitioncellarrayofindexestonodesshowingpartitioning %convectorofconnectivityvaluesforclusters %%setupdefaults %ifnargin<3 error'Atleast2parametersareexpected'; end %setproblemsizeanddefaults [xsize,dim]=sizeX; radius=1; sigma=onesxsize,1;%uniformlocaldensity scale=1; Mass=onesxsize,1; edges=0; rweak=0; vi=sizevarargin,2; i=1; whilei
PAGE 106
i=i+2; elseifstrfindvarargin{i},'Mass' Mass=varargin{i+1}; i=i+2; elseifstrfindvarargin{i},'Rweak' rweak=varargin{i+1}; i=i+2; else i=i+1; end end globalOPTIONSNUMCLUSTERSDWPLOTRWEAK; OPTIONS=options; PLOT=plot; NUMCLUSTERS=numclusters; RWEAK=rweak; color=[001;%blue 100;%red 010;%green 000;%black 011;%cyan 101;%magenta 110];%yellow color=[color;color;color;color]; %%definesimilarityweightmatrixSi,j %sincediagiszeroandsymmetricweonlyneedupperpart %ifstrfindOPTIONS,'norm' %normalizeXvectors tic fori=1:xsize Xi,:=Xi,:/normXi,:; end sprintf'normalize%f',toc end 94
PAGE 107
ifstrfindOPTIONS,'gauss' %gaussianadjbysigma tic S=zerosxsize,xsize; fori=1:xsize forj=i+1:xsize %calculatealocaldensitysigma sc=minsigmai,sigmaj^2/sigmai*sigmaj; sc=sc*scale; %computeinversegausiandistance Si,j=expsc*normXi,:Xj,:^2; %excludewtsthataretoosmall ifSi,j
PAGE 108
S=fullS; else %wt1ifinsideradiusadjbysigma S=spallocxsize,xsize,200*xsize; fori=1:xsize forj=i+1:xsize a=normXi,:Xj,:; ifa0*.1; DW=minDW,.001; DW=DW/xsize; %%defineGraphi.e.adjacencymatrix %W=S+S'; clearS; %plotgraphwithoutedges ifstrfindPLOT,'nodes' cdx=onesxsize,1; figure scatterX:,1,X:,2,30,colorcdx,:,'filled' 96
PAGE 109
title'Graphnodes' end %plotgraphwithedges ifstrfindPLOT,'edges' cdx=onesxsize,1; figure i=xsize1; gplot2W,X,3,cdx; title['Graphnodes&edges:'OPTIONS] end %%dokmeansclusteringoncolumnvector %definedby1stk+1eigenvectors %ifstrfindOPTIONS,'kmeans' %solvetheevalproblem k1=NUMCLUSTERS+10; [V,e,eidx]=solveW,Mass,k1; %definerowvectorsof1stnumclusterrowsofeigenvectorsV %excludingfirst0eigenvalue k=numclusters; yidx=eidx:k+1; pidx=kmeansV:,yidx,k,'emptyaction','singleton','MaxIter',200; fori=1:k partition{i}=findpidx==i; end %ploteigenvalues ifstrfindPLOT,'eig' figure stairs:k1,e:k1; title['Eigenvalues:'OPTIONS]; %ploteigW,V:,idx,16; end end %97
PAGE 110
%findclustersviasuccessivebicluster %ifstrfindOPTIONS,'kmeans' else %putallunconnectedverticesinpartition1 %andrestrictanalysistoremainingvertices sumW=sumW; ifsumsumW==0~=0 partition{1}=findsumW==0; partition{2}=findsumW~=0; pi=2; psize=2; NUMCLUSTERS=NUMCLUSTERS+1; else partition{1}=1:xsize; pi=1; psize=1; end pnz=pi; whilepsize
PAGE 111
ifps==0 error'0partitionsizeincluster' end ps=ps*ps1; ifps==0%eliminatesinglenodesfromconsideration coni=999; else%computeconnectivityrelativetofullyconnectedgraph coni=sumsumWpartition{i},partition{i}; coni=coni/ps; end end %findpartitionleaststronglyconnected m=mincon; pi=findcon==m; %ifequalconnectivitythentake1stone ifsizepi,2>1 pi=pi; end ifstrfindPLOT,'debug' con partition pi end end end ifstrfindPLOT,'info' fprintf,'%sn','Partitionsizeandconnectivity' fori=1:sizepartition,2; fprintf,'%5d%8fn',sizepartition{i},2,coni end end %%plotresultsofclustering %ifstrfindPLOT,'clusters' figure 99
PAGE 112
ifdim==2%verticeslieinaplane numpart=sizepartition,2; fori=1:numpart ix=partition{i}; ifi<8 scatterXix,1,Xix,2,30,colori,:,'o','filled' elseifi<15 scatterXix,1,Xix,2,30,colori,:,'*' else scatterXix,1,Xix,2,20,colori,:,'s','filled' end ifi==1 holdon end end title['Clusters:'OPTIONS]; holdoff %figure %gplotW,X,'*'; else%verticesinmorethan2dimensions %problemwithindexesinplotstmt %don'tgetthisafterreturntocallingpgm %forc=1:NUMCLUSTERS %subplot,4,c; %Y=Xpartition{c},:; %plotY %axistight %title['Clusters:'OPTIONS]; %end end end function[part1,part2]=biclusterW,Mass globalOPTIONS %savetheinitialsize origsize=sizeW,1; %solvetheevalproblem 100
PAGE 113
k1=4; [V,e,eidx]=solveW,Mass,k1; %doasinglebicluster fori=1:k1 ifei>1e10 part1=findV:,eidxi>=RWEAK; part2=findV:,eidxi<=RWEAK; %iforiginalpartitionissameaspart1andpart2thenerror iforigsize==sizepart1,1&origsize==sizepart2,1 error'Biclustererror.Samesizepartitions.rvaluepossiblytoolarge.'; end %dummynodemayintroduceemptypartitionwhengraph %isfullyconnectedtobeginwithandafternodeis %removedfromevec,inthiscasekeeplooking. %1stevecwherethisdoesnotoccurisrealfiedlervec ifsizepart1,1>0&sizepart2,1>0 break end end end ifstrfindPLOT,'eigvec' V:,eidxi end %ploteigenvalues ifstrfindPLOT,'eigval' figure stairs:k1,e:k1; title['Eigenvalues:'OPTIONS]; %ploteigW,V:,idx,16; end end function[V,e,eidx]=solveW,Mass,k1 globalOPTIONSDW %adddummynodetoelimmulticomponents ifstrfindOPTIONS,'full' 101
PAGE 114
xs=sizeW,1; else xs=sizeW,1+1; fori=1:xs1 Wi,xs=DW; Wxs,i=DW; end Wxs,xs=0; end SD=sumW,2; D=sparse:xs,1:xs,SD; L=DW; %solveLfor1stlowtohighk1evalue,evectors %note:Matlabreturnsevaluesinadiagmatrix k1=mink1,xs; ifstrfindOPTIONS,'eigs' opts.issym=1; opts.disp=0; warningoffMATLAB:nearlySingularMatrix ifstrfindOPTIONS,'ncut' [V,E,flag]=eigsL,D,k1,'sm',opts; ifflag~=0;flag;end elseifstrfindOPTIONS,'mass' %addadummymass ifstrfindOPTIONS,'full' else Mass=[Mass;DW];%dummymass end M=sparsediagMass; [V,E,flag]=eigsL,M,k1,'sm',opts; ifflag~=0;flag;end else %warning:settingthesigmaineigstoolowcanresultin %'matrixissingulartoworkingprecision'andevalofNaN %.001istoolow [V,E,flag]=eigsL,k1,.01,opts; 102
PAGE 115
ifflag~=0;flag;end end warningonMATLAB:nearlySingularMatrix else L=fullL; D=fullD; ifstrfindOPTIONS,'ncut' [V,E]=eigL,D; elseifstrfindOPTIONS,'mass' ifstrfindOPTIONS,'full' else Mass=[Mass;DW];%dummymass end M=fulldiagMass; [V,E]=eigL,M; else [V,E]=eigL; end end %forceeigenvaluesintolowtohighorder E=sumE; [e,eidx]=sortE; e:k1 %don'tincludethedummynode ifstrfindOPTIONS,'full' else V=V:xs1,:; end end end 103
PAGE 116
APPENDIXB.SpectralClusterTestDriver ThisisaMatlabprogramusedtoproducesomeoftheexamplespresentedinthis thesis.Itdemonstrateshowtocallthespectralculsterfunctiontoperformrecursive biclustering. %spectralclustertests test=3; switchtest case1 %fourclustersnormallydistributedaround %foundpointsinR^2 randn'state',5 X=sample[2,3],1,50; Y=sample[8,9],2,80; X=[X;Y]; Y=sample[2,9],1,30; X=[X;Y]; Y=sample[5,6],.5,40; X=[X;Y]; n=sizeX,1; cdx=spectralclusterX,5,'eig,gauss','edges,clusters,info,debug','Radius',.02,'Rweak',.0001; case2 %yeastvaluesfromMatlabdemo"GeneExpressionProfileAnalysis %theseareafterfilteringtoeliminategeneswithlowexpression %thisisasetof7microarrays,takeninatimesequencefor %metabolicshiftfromfermentationtorespiration loadc:cnsdemoyeastvalues.mat X=yeastvalues; [cdx,con]=spectralclusterX,16,'eigs,norm','info','Radius',.2,'Rweak',.0000001; %[cdx,con]=spectralclusterX,16,'eigs','info','Radius',2,'Rweak',.0000001; [xsize,dim]=sizeX; %wedon'tplotthefirstpartitionsincethisisthecollectionof %isolatedverticesandcarriesnoobviousinformation figure'Color','white' 104
PAGE 117
forc=2:17 subplot,4,c1; plot:dim,Xcdx{c},: t=sprintf'%d%f',sizecdx{c},2,conc; titlet axistight end case3 %5ptcomplete+3ptcomplete+4ptcomplete [p1,e1]=complete_graph,0,0,2; [p2,e2]=complete_graph,5,0,2; [p3,e3]=complete_graph,0,6,2; points=[p1;p2;p3;2.53]; e2=e2+5; e3=e3+5+3; edges=[e1;e2;e3;113;713;913]; cdx=spectralclusterpoints,3,'edges,eig,ncut','clusters,edges','Edges',edges; otherwise end 105
PAGE 118
APPENDIXC.SubroutinesUsedbySpectralClusterFuntion ThesearethesubroutinescallbytheMatlabspectralclusterfunction. functiongplot2W,points,range,Idx,area %plotthepartitionofagraphwithdifferentedgeweights % %inputsWweightedadjmatrix %pointsx,ycoordofgraphnodes %rangemaxgraphedsizeofanedge %Idxpartitionofthegraph,values1,2,3,... %areasizeofnodes n=sizepoints,1; ifnargin<5 area=30; end ifnargin<4 Idx=ones,n; end ifnargin<3 range=3; end holdon %findrangeofvertexcoordandadjaxes xmin=minpoints:,1; xmax=maxpoints:,1; ymin=minpoints:,2; ymax=maxpoints:,2; axis[xmin1xmax+1ymin1ymax+1]; %plotnodes color=[001;%blue 100;%red 010;%green 106
PAGE 119
000;%black 011;%cyan 101;%magenta 110];%yellow color=[color;color;color;color]; scatterpoints:,1,points:,2,area,colorIdx,:,'filled' %fori=1:n %plotpointsi,1,pointsi,2,'*b' %end %findedges [x,y]=findtriuW; edges=[x,y]; n=sizeedges,1; %plotedges idx=triuW>0; xmin=minWidx; xmax=maxWidx; fori=1:n X=[pointsedgesi,1,1pointsedgesi,2,1]; Y=[pointsedgesi,1,2pointsedgesi,2,2]; %computelinewidthinpoints ifxmax==xmin width=.5; else width=Wedgesi,1,edgesi,2xmin/xmaxxmin; width=width*range+.3; end lineX,Y,'LineStyle','','Color','b','LineWidth',width; end holdoff functionW=adjacencyedges,weights 107
PAGE 120
%Produceadjacencymatrixofgraph %definedbyinputparametersedgesandweights %Graphnodesarenumberedfrom1toN. %Thehighestordernodeshouldhaveanedge. %Inputparamateredgeshasanentryforeachgraphedge %edges,1isnodewithconnectiontoedges,2. %weightsisweighttoassigntoedge1. %Getnumberofgraphnodes N=maxmaxedges; %Buildsparseadjacencymatrix %Notethatmatrixissymmetric r=[edges:,1;edges:,2]; c=[edges:,2;edges:,1]; v=[weightsweights]; %BuildNxNsparsematrix %Wri,ci=vi W=sparser,c,v,N,N; 108
PAGE 121
APPENDIXD.ListofGenesbyCluster GeneNCBIlocustagscorrespondingtotheClustersextractedfromtheMatlab demo"GeneExpressionProleAnalysis". ClusterSequence2,NumberofGenes2,Connectivity1.000000 YGL059WYOR177C ClusterSequence3,NumberofGenes3,Connectivity1.000000 YCR036WYMR104CYOR032C ClusterSequence4,NumberofGenes12,Connectivity0.121212 YBR050CYJL164CYJR008WYKL091CYPL256C YBR051WYPR002WYBR056WYDL234CYFR055W YHL039WYGR052W ClusterSequence5,NumberofGenes31,Connectivity0.150538 YAL034CYBL043WYBL049WYBR046CYBR285W YCR091WYDL204WYDL218WYDR330WYKL093W YLR164WYBL048WYDR043CYDR313CYGR236C YIL097WYIL101CYJL067WYJR155WYKL016C YNR007CYOR097CYPL185WYGR243WYNL093W YEL039CYGR146CYIL113WYKL217WYMR107W YPR150W ClusterSequence6,NumberofGenes85,Connectivity0.206162 YAL003WYAL012WYBR048WYCL054WYDL148C YDR144CYDR384CYEL026WYGR155WYGR159C YJR063WYLR196WYLL047WYMR131CYNL111C YNL175CYNL207WYNL303WYNR050CYNR054C YPL012WYPR137WYCL053CYCLX02CYDL083C 109
PAGE 122
YDR025WYGR092WYGR160WYHR128WYMR049C YMR229CYMR290CYNL060CYNL110CYNL132W YNL182CYNL256WYOR361CYPL043WYPL093W YPR144CYAL036CYBR247CYDL182WYDR206W YDR398WYEL040WYER036CYGL076CYGL078C YGR103WYIL053WYJL122WYJL148WYJR041C YJR071WYKL009WYKL081WYLR186WYLR056W YMR037CYMR217WYMR239CYNL075WYNL141W YNL313CYOR116CYPL126WYPL226WYAL025C YDL063CYDL213CYGL029WYKL078WYKL082C YLR009WYLR129WYLL008WYLR355CYLR449W YMR093WYNL002CYNL120CYNR067CYOL010W ClusterSequence7,NumberofGenes6,Connectivity0.200000 YCR019WYDR436WYJR006WYNR034WYBR069C YDR101C ClusterSequence8,NumberofGenes22,Connectivity0.129870 YAL054CYER024WYGR067CYLR142WYKR097W YDR505CYER065CYJL089WYCR005CYFL030W YIL057CYMR118CYNL117WYNL195CYPL054W YCR010CYDL215CYDR009WYKL171WYLR377C YPR030WYDL215C ClusterSequence9,NumberofGenes35,Connectivity0.159664 YBR116CYDR096WYDR216WYML042WYNL009W YNR002CYPL134CYPL262WYBR117CYDL199C YDL245CYEL012WYER096WYER098WYGL153W YGR110WYHL032CYHR096CYJL045WYKL107W YKL187CYLR267WYOR027WYPL109CYPL135W YER015WYML054CYOL084WYBR298CYDL233W YDR262WYGR224WYJR095WYOR019WYPL201C 110
PAGE 123
ClusterSequence10,NumberofGenes23,Connectivity0.189723 YBR241CYDR148CYDR306CYGR043CYGR201C YGR231CYJL144WYML131WYOR120WYBR280C YMR068WYBR203WYDR030CYDR494WYJL170C YLR254CYLR080WYMR030WYBL086CYDL169C YDR275WYNR071CYNR073C ClusterSequence11,NumberofGenes8,Connectivity0.250000 YNL174WYPL183CYMR108WYBR155WYJL109C YKL191WYNL216WYPR136C ClusterSequence12,NumberofGenes269,Connectivity0.118127 YBL015WYBR052CYBR072WYBR169CYBR183W YDL004WYDL124WYDR070CYDR074WYDR258C YDR272WYDR358WYEL011WYEL024WYER141W YFL014WYGR019WYGR111WYIL124WYIL162W YJL166WYJR104CYKL065CYKL067WYKL085W YLR168CYLR216CYLL041CYLL023CYLR270W YLR290CYLR356WYML100WYML120CYMR173W YMR181CYNL015WYNL037CYNL173CYOL126C YOL053CYOR052CYOR220WYOR244WYPL186C YPL230WNORF7YAL060WYAR028WYBL050W YBL064CYBR139WYBR147WYBR214WYBR256C YCL035CYCR021CYDL181WYDR001CYDR077W YDR125CYDR171WYDR272WYDR453CYDR529C YDR533CYER067WYGL037CYGL187CYGL259W YGR044CYGR088WYGR130CYGR132CYGR182C YGR250CYHR051WYHR104WYHR195WYIL169C YIR038CYJL137CYJL161WYJL185CYJR034W YJR080CYKL036CYKL141WYLR193CYLR217W YLR219WYLR093CYLR149CYKR058WYKR076W YLL026WYLR271WYLR295CYML128CYMR105C YMR311CYNL134CYNL160WYNL200CYNL252C YNL274CYOL117WYOR031WYOL032WYOL048C YOL071WYOR049CYOR215CYOR273CYOR289W 111
PAGE 124
YOR317WYPL087WYPL165CYPR020WYPR026W YPR098CNORF4NORF8NORF54YBL075C YBL099WYBL107CYBR054WYBR126CYBR269C YDL022WYDR032CYDR178WYDR342CYER053C YFL054CYFR033CYGL006WYGL198WYGR149W YHL021CYHR087WYJL102WYJR019CYJR073C YJR096WYKL148CYKL150WYLR178CYLR252W YLR258WYLR038CYKR067WYLR304CYMR090W YMR110CYMR133WYMR145CYMR196WYMR271C YNL045WYNL115CYNL305CYOR136WYOR178C YOR374WYPL004CYPL078CYPL154CYPL196W YPR149WYDR258CYAL017WYBL030CYBL038W YBL078CYBL100CYBL108WYBR101CYBR149W YBR222CYBR230CYCL025CYCR097WYDL021W YDL023CYDL067CYDL091CYDR031WYDR059C YDR085CYDR231CYDR277CYDR329CYDR343C YDR377WYDR513WYER035WYER079WYER150W YER158CYER182WYFR015CYGL045WYGL047W YGL121CYGL191WYGL199CYGR008CYGR028W YGR070WYGR142WYGR174CYGR194CYGR238C YGR244CYGR248WYHL024WYHR016CYHR092C YHR209WYIL087CYIL107CYJL079CYJL103C YJL151CYJL155CYJR048WYJR121WYKL026C YKL151CYKL193CYKR016WYKR046CYLR251W YLR081WYLR299WYLR327CYLR345WYLR395C YLR423CYML004CYMR031CYMR056CYMR081C YMR136WYMR170CYMR188CYMR195WYMR197C YMR250WYMR297WYNL052WYNL100WYNL144C YNL194CYNR001CYOL153CYOR035CYOR041C YOL083WYOR065WYOR089CYOR161CYOR347C YPL123CYPL223CYPR184WNORF46 ClusterSequence13,NumberofGenes143,Connectivity0.140451 YAR073WYBR092CYBR187WYBR189WYBR191W YDL082WYDL130WYDR382WYDR450WYEL054C YER070WYER074WYER117WYGR085CYHL001W YHL033CYHR141CYHR216WYIL069CYJL136C 112
PAGE 125
YJL190CYLR198CYLR212CYLL045CYLR044C YLR048WYLR076CYKR057WYKR059WYLR264W YLR340WYLR384CYLR432WYMR121CYNL013C YNL247WYNL301CYNL327WYOL120CYOR224C YOR310CYOR312CYPL142CYPL160WYPR145W YBL024WYDR165WYDR341CYDR365CYER002W YGL135WYGR034WYHR215WYIL018WYIL052C YJL189WYKL181WYLR175WYLL044WYLR029C YLR075WYLR339CYLR367WYLR409CYLR413W YML123CYNL178WYNL302CYNR053CYOL121C YOL127WYOL077CYOR153WYOR335CYOR369C YAL038WYBL027WYBR032WYBR106WYBR181C YBR249CYDL136WYDL208WYDR012WYDR060W YDR064WYDR418WYER131WYGL030WYGL102C YHL015WYHR089CYHR208WYLR180WYLR060W YLR062CYLR344WYLR372WYLR448WYML063W YMR318CYNL065WYNL069CYNL119WYOR234C YPL198WYPL220WNORF17YBL076CYDL167C YDR037WYDR321WYDR417CYDR447CYDR449C YDR471WYER110CYFR031BCYGL031CYGL103W YGL123WYGR148CYGR214WYGR264CYHR203C YIL133CYJL177WYJR123WYJR145CYKL006W YLR249WYLR325CYLR441CYMR242CYNL096C YOL040CYOR063WYOR309CYPL081WYPL131W YPR102CYPR132WNORF20 ClusterSequence14,NumberofGenes10,Connectivity0.466667 YBL045CYBR067CYIL125WYJL163CYKL109W YLR173WYOL053WYGL192WYMR191WYBR039W ClusterSequence15,NumberofGenes9,Connectivity0.138889 YCLX09WYPR043WYBR218CYLR341WYPR116W YNL067WYDR039CYDR491CYGR094W 113
PAGE 126
ClusterSequence16,NumberofGenes13,Connectivity0.269231 YAL026CYAR027WYKL035WYER044CYDR516C YFR053CYKL142WYMR278WYOL082WYIL111W YKL103CYLR257WYOR285W ClusterSequence17,NumberofGenes4,Connectivity0.333333 YGL158WYCR039CYMR232WYLR297W end 114
PAGE 127
REFERENCES []CharlesJ.AlpertandSoZenYao.Spectralpartitioning:Themoreeigenvectors, thebetter.In Proc.ACM/IEEEDesignAutomationConf ,pages195{200,1994. URL http://vlsicad.ucsd.edu/Publications/Journals/j37.pdf []CharlesJ.Alpert,AndrewB.Kahng,andSoZenYao.Spectralpartitioningwithmultipleeigenvectors. DiscreteAppliedMathematics ,90:3{26,1999. doi:10.1016/S0166218X00833. []ChristopherG.Baker,UlrichL.Hetmaniuk,RichardB.Lehoucq,andHeidiK. Thornquist.Anasazisoftwareforthenumericalsolutionoflargescaleeigenvalue problems. ACMTransactionsonMathematicalSoftware ,5no.N:1{22,2008. ISSN00983500.doi:10.1145/1527286.1527287. []RobertA.Becker. IntroductiontoTheoreticalMechanics ,chapter15.McGrawHill,NewYork,1954. []TurkerBiyikoglu,JosefLeydold,andPeterF.Stadler. LaplacianEigenvectors ofGraphs .SpringerVerlag,BerlinHeidelberg,2007. []BenjaminMiloBolstad. LowlevelAnalysisofHighdensityOligonucleotideArrayData .PhDthesis,UniversityofWaikato,2004.URL http://bmbolstad. com/Dissertation/Bolstad_2004_Dissertation.pdf []EditedbyCharlesEdmondBIchotandPatrickSiarry. GraphPartitioning .Wiley,NewYork,2011. []TonyF.Chan,PhilippeG.Ciarlet,andW.K.Szeto.Ontheoptimalityof themediancutspectralbisectiongraphpartitioningmethod. SIAMJournalon ScienticComputing ,18:943{948,1997.doi:10.1137/S1064827594262649. []DuhongChen,J.GordonBurleigh,andDavidFernandezBaca.Spectralpartitioningofphylogeneticdatasetsbasedoncompatibility. Syst.Biol. ,56: 623{632,2007.doi:10.1080/10635150701499571. []ShiuYuenCheng.Eigenfunctionsandnodalsets. Comment.Math.Helvetici 51:43{55,1976.doi:10.1007/BF02568142. []YunChi,XiaodanSong,KojiHino,andBelleL.Tseng.Evolutionaryspectralclusteringbyincorporatingtemporalsmoothness.In Proceedingsofthe 13thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata mining ,KDD'07,pages153{162,NewYork,NY,USA,2007.ACM.ISBN 9781595936097.doi:10.1145/1281192.1281212. 115
PAGE 128
[]FanR.K.Chung. SpectralGraphTheory ,chapter2.2.A.M.A.CBMS,Providence,RhodeIsland,1997.URL http://www.math.ucsd.edu/ ~ fan/research/ revised.html []ThomasCormen,CharlesLeiserson,RonaldRivest,andCliordStein. IntroductiontoAlgorithms,2ndEdition .MITPress,Cambridge,Massachusetts,2001. []RichardCourantandDavidHilbert. MethodsofMathematicalPhysics,Vol.1 Interscience,NewYork,1953. []E.BrianDavies,GrahamM.L.Gladwell,JosefLeydold,andPeterF.Stadler. Discretenodaldomaintheorems. LinearAlgebraanditsApplications ,336:51{60, 2001.doi:10.1016/S0024379515. []HarryF.Davis. FourierSeriesandOrthogonalFunctions ,chapter4.2.Dover Publications,Inc.,NewYork,1963. []InderjitS.Dhillon,YuqiangGuan,andBrianKulis.Auniedviewofkernal kmeans,spectralclusteringandgraphcuts. UTCSTechnicalReportTR0425 2005.URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 75.1701 []InderjitS.Dhillon,YuqiangGuan,andBrianKulis.Weightedgraphcutswithouteigenvectors:Amultilevelapproach. IEEETrans.PatternAnal.Mach. Intell ,29Issue11:1944{1957,2007.doi:10.1109/TPAMI.2007.1115. []ChrisDing,XiaofengHe,andHongyuanZha.Aspectralmethodtoseparate disconnectedandnearlydisconnectedwebgraphcomponents. Proc7thInt'l Conf.onKnowledgeDiscoveryandDataMining ,KDD2001:275{280,2001. doi:10.1145/502512.502551. []ChrisDing,XiaofengHe,andHorstD.Simon.Ontheequivalenceofnonnegativematrixfactorizationandspectralclustering.In Proc.SIAMDataMining Conf ,pages606{610,2005.URL http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.215.1503 []W.DonathandA.Homan.Algorithmsforpartitioninggraphsandcomputer logicbasedoneigenvectorsofconnectionmatrices. IBMTechnicalDisclosure Bulletin ,15no.3:938{944,1972. []W.DonathandA.Homan.Lowerboundsforthepartitioningof graphs. IBMJournalofResearchandDevelopment ,pages420{425, 1973.URL http://domino.research.ibm.com/tchjr/journalindex.nsf/ c469af92ea9eceac85256bd50048567c/3e52839ce8b7e98885256bfa006841bd! OpenDocument []ArtM.DuvalandVictorReiner.Perronfrobeniustyperesultsanddiscrete versionsofnodaldomaintheorems. LinearAlgebraanditsApplications ,294: 259{268,1999.doi:10.1016/S002437957. 116
PAGE 129
[]StanleyJ.Farlow. PartialDierentialEquationsforScientistsandEngineers DoverPublications,Inc.,NewYork,1982. []PedroF.FelzenszwalbandDanielP.Huttenlocher.Ecientgraphbasedimage segmentation. InternationalJournalofComputerVision ,59no.2:167{181,2004. doi:10.1023/B:VISI.0000022288.19776.77. []MiroslavFiedler.Algebraicconnectivityofgraphs. CzechoslovakMathematical Journal ,23no.2:298{305,1973.URL http://dml.cz/dmlcz/101168 []MiroslavFiedler.Apropertyofeigenvectorsofnonnegativesymmetricmatrices anditsapplicationstographtheory. Czech.Math.J. ,25,no.4:619{633,1975. URL http://dml.cz/dmlcz/101357 []MiroslavFiedler. SpecialMatricesandTheirApplicationsinNumericalMathematics .Doveredition,Boston,2008. []IgorFischerandJanPoland.Newmethodsforspectralclustering.In TechnicalReportNo.IDSIA1204 .DalleMolleInstituteforArticialIntelligence, 2004.URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 70.1339 []JoelFriedman.Somegeometricaspectsofgraphsandtheireigenfunctions. Duke MathJ. ,69:487{525,1993.doi:10.1215/S0012709493069219. []GrahamM.L.GladwellandH.Zhu.Courant'snodallinetheorem anditsdiscretecounterparts. Q.JlMech.Appl.Math. ,55:1{15,2002. doi:10.1093/qjmam/55.1.1. []LeoGrady.Graphanalysistoolboxmatlabcode. http://eslab.bu.edu/ software/graphanalysis/ ,August2003. []LeoGradyandEricL.Schwartz.Isoperimetricgraphpartitioningforimage segmentation. IEEETrans.onPat.Anal.andMach.Int ,28:469{475,2006. doi:10.1109/TPAMI.2006.57. []DavidH.Griel. AppliedFunctionalAnalysis .HalstedPress,NewYork,1981. []JiMingGuo.Thealgebraicconnectivityofgraphsunderperturbation. LinearAlgebraanditsApplications ,433:1148{1153,2010.ISSN00243795. doi:10.1016/j.laa.2010.04.046. []UlrichHetmaniukandRichLehoucq.Basisselectioninlobpcg. J.Comput. Phys. ,218:324{332,2006.doi:10.1016/j.jcp.2006.02.007. []DesmondJ.Higham,GabrielaKalna,andMillaKibble.Spectralclusteringand itsuseinbioinformatics. JournalofComputationalandAppliedMathematics 204:25{37,2007.doi:10.1016/j.cam.2006.04.026. 117
PAGE 130
[]RogerA.HornandCharlesR.Johnson. MatrixAnalysis .CambridgeUniversity Press,NewYork,NY,2005. []DavidJerisonandCarlosKenig.Uniquecontinuationandabsenceofpositiveeigenvaluesforschrodingeroperators. Ann.Math. ,121:159{268,1999. doi:10.2307/1971205. []ClaesJohnson. NumericalSolutionsofPartialDierentialEquationsbythe FiniteElementMethod ,chapter24.CambridgeUniversityPress,Cambridge, 1987. []RaviKannan,SantoshVempala,andAdrianVetta.Onclusterings: Good,badandspectral. JournaloftheACM ,51no.3:497{515,2004. doi:10.1109/SFCS.2000.892125. []AndrewV.Knyazev.Preconditionedeigensolvers{anoxymoron? Electron. Trans.Numer.Anal. ,7:104{123,1998.URL http://etna.mcs.kent.edu/vol. 7.1998/pp104123.dir/pp104123.pdf []AndrewV.Knyazev.Towardtheoptimalpreconditionedeigensolver:Locally optimalblockpreconditionedconjugategradientmethod. SIAMJ.Sci.Comput 23:517{541,2001.doi:10.1137/S1064827500366124. []AndrewV.KnyazevandMericoE.Argentati.Implementationofapreconditionedeigensolverusinghypre.TechnicalReportUCDCCM220,Center forComputationalMathematics,UniversityofColoradoDenver,2005.URL http://math.ucdenver.edu/ccm/reports/rep220.pdf []AndrewV.Knyazev,M.E.Argentati,I.Lashuk,andE.E.Ovtchinnikov.Block locallyoptimalpreconditionedeigenvaluexolversblopexinhypreandpetsc. SIAMJ.Sci.Comput ,29:2224{2239,2007.doi:10.1137/060661624. []SandiaNationalLaboratories.Thetrilinosproject. http://trilinos.sandia. gov/ []AnnaMatsekh,AlexeiSkurikhin,LakshmanPrasad,andEdwardRosten.Numericalaspectsofspectralsegmentation.In AppliedParallelandScienticComputing ,volumeLNCS7133,pages193{203,2012.doi:10.1007/9783642281518 19. []MarinaMeilaandJianboShi.Learningsegmentationbyrandomwalks.In InAdvancesinNeuralInformationProcessing ,pages470{477.MITPress, 2000.URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 33.2235 []MarinaMeilaandJianboShi.Arandomwalksviewofspectralsegmentation. In AIandSTATISTICSAISTATS2001 ,2001.URL http://citeseerx.ist. psu.edu/viewdoc/summary?doi=10.1.1.33.1501 118
PAGE 131
[]BoazNadler,StphaneLafon,RonaldR.Coifman,andIoannisG.Kevrekidis. Diusionmaps,spectralclusteringandeigenfunctionsoffokkerplanckoperators. In inAdvancesinNeuralInformationProcessingSystems18 ,pages955{962. MITPress,2005.doi:10.1016/j.acha.2005.07.004. []AndrewY.Ng,MichaelI.Jordan,andYairWeiss.Onspectralclustering:Analysisandanalgorithm.In ADVANCESINNEURALINFORMATIONPROCESSINGSYSTEMS ,pages849{856.MITPress,2001.URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8100 []PekkaOrponenandSatuElisaSchaeer.Localclusteringoflargegraphsby approximateedlervectors.In ProceedingsoftheFourthInternationalWorkshoponEcientandExperimentalAlgorithmsWEA05,volume3505ofLectureNotesinComputerScience ,pages524{533.SpringerVerlagGmbH,2005. doi:10.1007/11427186 45. []BeresfordN.Parlett. TheSymmetricEigenvalueProblem .Prentice{Hall,Inc., EnglewoodClis,N.J.,1980.doi:10.1137/1.9781611971163. []EitanSharon,MeiravGalun,DahliaSharon,RonenBasri,andAchiBrandt. Hierarchyandadaptivityinsegmentingvisualscenes. Nature ,442:810{813, 2006.doi:10.1038/nature04977. []JianboShiandJitendraMalik.Normalizedcutsandimagesegmentation. IEEE TransactionsonPatternAnalysisandMachineIntelligence ,22:888{905,August2000.doi:10.1109/ICIP.1998.723676. []AndreasStathopoulosandJamesR.McCombs.PRIMME:PReconditionedIterativeMultiMethodEigensolver:Methodsandsoftwaredescription. ACMTransactionsonMathematicalSoftware ,37:21:1{21:30,April2010. doi:10.1145/1731022.1731031. []MechthildStoerandFrankWagner.Asimplemincutalgorithm. J.ACM ,44 :585{591,July1997.ISSN00045411.doi:10.1145/263867.263872. []UlrikevonLuxburg.Atutorialonspectralclustering. StatisticsandComputing 17:395{416,2007.doi:10.1007/s112220079033z. []YairWeiss.Segmentationusingeigenvectors:Aunifyingview. InternationConferenceonComputerVision ,pages974{982,September1999. doi:10.1109/ICCV.1999.790354. []ScottWhiteandPadhraicSmyth.Aspectralclusteringapproachto ndingcommunitiesingraphs.In SiamConferenceonDataMining ,2005.URL http://wwwstat.stanford.edu/ ~ owen/courses/315c/ readings/Padraicsiam_graph_clustering.pdf 119
PAGE 132
[]LihiZelnikmanorandPietroPerona.Selftuningspectralclustering.In AdvancesinNeuralInformationProcessingSystems17 ,pages1601{1608.MIT Press,2004.URL http://www.vision.caltech.edu/lihi/Publications/ SelfTuningClustering.pdf []ShuBOZhang,SongYuZhou,JianGuOHe,andJianHuangLai.Phylogeny inferencebasedonspectralgraphclustering. JournalofComputationalBiology 18no.4:627{637,2011.doi:10.1089/cmb.2009.0028. 120

