On accelerating the nonsymmetric Eigenvalue problem in multicore architectures

Material Information

On accelerating the nonsymmetric Eigenvalue problem in multicore architectures
Nabity, Matthew W. ( author )
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:
Physical Description:
1 electronic file. : ;


Subjects / Keywords:
Eigenvalues -- Mathematical models ( lcsh )
Eigenvalues -- Data processing ( lcsh )
Nonsymmetric matrices ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Scientific computing relies on state of the art algorithms in software libraries and packages. Achieving high performance in today's computing landscape is a challenging task as modern computer architectures in High-Performance Computing (HPC) are increasingly incorporating multicore processors and special purpose hardware, such as graphics processing units (GPUs). The emergence of this heterogeneous environment necessitates the development of algorithms and software packages that can fully exploit this structure. Maximizing the amount of parallelism within an algorithm is an essential step towards high performance. One of the fundamental computations in numerical linear algebra is the computation of eigenvalues and eigenvectors. Matrix eigenvalue problems can be found in many applications. The algorithm of choice depends on the nature of the matrix. For sparse matrices, there are several viable approaches, but, for dense nonsymmetric matrices, one algorithm remains the method of choice. This is the QR algorithm. Unfortunately, on modern computer platforms, there are limitations to this approach in terms of parallelism and data movement. Current implementations do not scale well or do not take full advantage of the heterogeneous computing environment. The goal of this work is to examine nonsymmetric matrix eigenvalue problems in the context of HPC. In Chapter 2, we examine tile algorithms and the implementation of block Arnoldi expansion in the context of multicore. Pseudocodes and implementation details are provided along with performance results. In Chapter 3, we examine various algorithms in the context of computing a partial Schur factorization for nonsymmetric matrices. We examine several iterative approaches and present implementations of specific methods. The methods studied include a block version of explicitly restarted Arnoldi with deflation, a block extension of Stewart's Krylov-Schur method, and a block version of Jacobi-Davidson. We present a new formulation of block Krylov-Schur that is robust and achieves improved performance for eigenvalue problems with sparse matrices. We experiment with dense matrices as well. We expand on our work and present a C code for our block Krylov-Schur approach using LAPACK routines in Chapter 4. The code is robust and represents a first step towards an optimized version. Our code can use any desired block size and compute any number of desired eigenvalues. Detailed pseudocodes are provided along with a theoretical analysis.
Thesis (Ph.D.)--University of Colorado Denver. Applied mathematics
Includes bibliographic references.
System Details:
System requirements: Adobe Reader.
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Matthew W. Nabity.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
880144765 ( OCLC )


This item has the following downloads:

Full Text
Matthew W. Nabity
M.S., University of Colorado Boulder, 2003
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics

This thesis for the Doctor of Philosophy degree by
Matthew W. Nabity
has been approved
Julien Langou, Advisor
Karen Braman
Lynn Bennethum
Jack Dongarra
Jan Mandel
July 12, 2013

Nabity, Matthew W. (Ph.D., Applied Mathematics)
On Accelerating the Nonsymmetric Eigenvalue Problem in Multicore Architectures
Thesis directed by Associate Professor Julien Langou
Scientific computing relies on state of the art algorithms in software libraries and
packages. Achieving high performance in todays computing landscape is a challeng-
ing task as modern computer architectures in High-Performance Computing (HPC)
are increasingly incorporating multicore processors and special purpose hardware,
such as graphics processing units (GPUs). The emergence of this heterogeneous envi-
ronment necessitates the development of algorithms and software packages that can
fully exploit this structure. Maximizing the amount of parallelism within an algorithm
is an essential step towards high performance.
One of the fundamental computations in numerical linear algebra is the compu-
tation of eigenvalues and eigenvectors. Matrix eigenvalue problems can be found in
many applications. The algorithm of choice depends on the nature of the matrix.
For sparse matrices, there are several viable approaches, but, for dense nonsymmetric
matrices, one algorithm remains the method of choice. This is the QR algorithm.
Unfortunately, on modern computer platforms, there are limitations to this approach
in terms of parallelism and data movement. Current implementations do not scale
well or do not take full advantage of the heterogeneous computing environment.
The goal of this work is to examine nonsymmetric matrix eigenvalue problems
in the context of HPC. In Chapter 2, we examine tile algorithms and the imple-
mentation of block Arnoldi expansion in the context of multicore. Pseudocodes and
implementation details are provided along with performance results.
In Chapter 3, we examine various algorithms in the context of computing a par-

tial Schur factorization for nonsymmetric matrices. We examine several iterative
approaches and present implementations of specific methods. The methods studied
include a block version of explicitly restarted Arnoldi with deflation, a block exten-
sion of Stewarts Krylov-Schur method, and a block version of Jacobi-Davidson. We
present a new formulation of block Krylov-Schur that is robust and achieves improved
performance for eigenvalue problems with sparse matrices. We experiment with dense
matrices as well.
We expand on our work and present a C code for our block Krylov-Schur approach
using LAPACK routines in Chapter 4. The code is robust and represents a first step
towards an optimized version. Our code can use any desired block size and compute
any number of desired eigenvalues. Detailed pseudocodes are provided along with a
theoretical analysis.
The form and content of this abstract are approved. I recommend its publication.
Approved: Julien Langou

To Blake and Breeann

First I want to thank my advisor Julien Langou for everything. His encourage-
ment and guidance over the years has made this work possible. I want to thank my
committee members for their patience and support; Professor Lynn Bennethum for
guiding this process by being chair, Professor Karen Braman for the discussions over
the years and for being a friendly face at conferences, Professor Jack Dongarra for
the summer research opportunities at the Innovative Computing Laboratory (ICL) at
the University of Tennessee Knoxville (UTK), and Professor Jan Mandel for pushing
me as a student and researcher.
The work in this thesis was partially supported by the National Science Founda-
tion Grants GK-12-0742434, NSF-CCF-1054864, and by the Bateman family in the
form of the Lynn Bateman Memorial Fellowship.
This research benefited greatly by access to computing resources at the Center
for Computational Mathematics (CCM) at the University of Colorado Denver and
the ICL at UTK. The colibri cluster, NSF-CNS-0958354, was used for some of the
experimental results presented in this work.
The journey was certainly enriched by the friendship of my fellow graduate stu-
dents both at the University of Colorado Boulder and here in the Mathematics De-
partment at the University of Colorado Denver.
Finally, I would not be here without the support of my family. I thank my parents
for supporting me as I followed my own path. I owe a great deal to my brother Paul
for constantly pushing his older brother to keep up with him. Lastly, finishing this
work would not have been possible without the constant support of my Breeann.

Figures .................................................................... ix
Tables...................................................................... xi
1. Introduction.............................................................. 1
1.1 The Computing Environment........................................... 2
1.2 The Standard Eigenvalue Problem..................................... 4
1.3 Algorithms.......................................................... 7
1.4 Contributions ..................................................... 14
2. Tiled Krylov Expansion on Multicore Architectures....................... 16
2.1 The Arnoldi Method................................................. 17
2.2 Tiled Arnoldi with Householder .................................... 21
2.3 Numerical Results.................................................. 29
2.4 Conclusion and Future Work......................................... 32
3. Alternatives to the QR Algorithm for NEP............................ 34
3.1 Iterative Methods.................................................. 36
3.1.1 Arnoldi for the nonsymmetric eigenvalue problem and IRAM 36
3.1.2 Block Arnoldi.............................................. 38
3.1.3 Block Krylov-Schur......................................... 50
3.1.4 Block Jacobi-Davidson...................................... 58
3.2 Numerical Results.................................................. 71
3.3 Conclusion and Future Work......................................... 92
4. Block Krylov Schur with Householder...................................... 94
4.1 The Krylov-Schur Algorithm......................................... 95
4.2 The Block Krylov-Schur Algorithm.................................. 100
4.3 Numerical Results................................................. 112
4.4 Conclusion and Future Work........................................ 114

5. Conclusion............................................................. 115
References................................................................ 116

1.1 Data structure for tiled matrix .................................... 4
1.2 The Hessenberg structure disturbed.................................. 9
1.3 Chasing the bulge .................................................. 9
1.4 Scalability of ZGEES............................................... 12
2.1 Compact storage of the Arnoldi decomposition ...................... 23
2.2 Tiled QR with nt = 5 using xTTQRT ................................. 28
2.3 Modification of QUARK_CORE_ZGEQRT for sub-tiles.................... 29
2.4 Performance comparison for a dense (smaller) matrix................ 30
2.5 Performance comparison for a tiled dense (larger) matrix........... 32
2.6 Performance comparison for a tiled tridiagonal matrix ............. 33
2.7 Performance comparison for a tiled diagonal matrix................. 33
3.1 Block Krylov-Schur Decomposition................................... 52
3.2 Expanded Block Krylov Decomposition................................ 53
3.3 Complete Spectrum of TOLS2000 ..................................... 79
3.4 MVPs versus dimension of search subspace........................... 90
3.5 Iterations versus dimension of search subspace for block Krylov-Schur
method............................................................. 91
3.6 MVPs versus number of desired eigenvalues for block Krylov-Schur method. 91
3.7 Iterations versus size of matrix for block Krylov-Schur with b = 5, ks = 40
and kf = 75........................................................ 92
4.1 Typically, implementations of iterative eigensolvers are not able to com-
pute n eigenvalues for an nxn matrix. Here is an example using Matlabs
eigs which is based on ARPACK................................. 94
4.2 Structure of the Rayleigh quotient ................................ 98
4.3 Structure of block Krylov-Schur decomposition .................... 101

4.4 Subspace initialization and block Krylov-Schur decomposition........... 106
4.5 Expanded block Krylov decomposition.................................... 108
4.6 Truncation of block Krylov-Schur decomposition ........................ 110
4.7 Scalability experiments for our block Krylov-Schur algorithm to compute
the five largest eigenvalues of a 4, 000 x 4, 000 matrix.............. 113
4.8 Scalability experiments for LAPACKs ZGEES algorithm to compute the
full Schur decompositon of a 4, 000 x 4,000 matrix.................... 113

1.1 Available Iterative Methods for the NEP.............................. 13
3.1 Software for comparison of iterative methods ............................. 71
3.2 Ten eigenvalues of CK656 with largest real part...................... 72
3.3 Computing 10 eigenvalues for CK656 ....................................... 75
3.4 Computing 10 eigenvalues for CK656, expanded search subspace.............. 77
3.5 Computing 10 eigenvalues for CK656, ks = 36............................... 78
3.6 Summary of results presented by Jia [38]............................. 80
3.7 Computing three eigenvalues with largest imaginary part for TOLS2000 83
3.8 Summary of results for HOR131........................................ 85
3.9 Computing 8 eigenvalues with largest real part for HOR131, Krylov ... 86
3.10 Computing 8 eigenvalues with largest real part for HOR131, JD............ 87
3.11 Computing 5 eigenvalues with smallest real part for random matrix ... 89

1. Introduction
This work is about computing with matrices, primarily those with complex en-
tries. The held of complex numbers is denoted by C, so the set of n x 1 matrices,
usually called column vectors, is denoted by C and the set of m x n matrices is
denoted by Cmxn. Our main focus is on properties of square matrices, that is, matri-
ces in and corresponding invariant subspaces, or eigenvectors, of square matrices. There
are several types of matrices whose structure provides desirable properties one may
exploit when formulating algorithms. These properties may lead to attractive nu-
merical properties, such as stability, or attractive computational properties, such as
efficient storage. Much has been done to formulate algorithms for the computation of
eigenvalues and invariant subspaces for Hermitian matrices and symmetric matrices,
see [63] and [71] for details. Many other structures that induce particular eigenvalue
properties, such as block cyclic matrices, Hamiltonian matrices, orthogonal matrices,
symplectic matrices and others have been studied extensively in [43, 70]. Our com-
putational setting is problems involving nonsymmetric matrices which have no such
underlying structure. These matrices may range from being extremely sparse, that
is primarily consisting of zeros, or dense (i.e., not sparse). Eigenvalue problems with
nonsymmetric matrices show up in many branches of the sciences and numerous col-
lections of such matrices from real applications are maintained in efforts such as the
Matrix Market [12] and the University of Florida Sparse Matrix Collection [22], Ex-
ample applications in the NEP Collection in Matrix Market include computational
fluid dynamics, several branches of engineering, quantum chemistry, and hydrody-
We begin with an introduction to the computing environment, a review of the
theoretical foundations from linear algebra, and key algorithmic components. Most
of the information presented in this section is needed introductory material for the

nonsymmetric eigenvalue problem (NEP) and subsequent sections, but it also serves
as an overview of the state of the art with respect to numerical methods for the
standard eigenvalue problem.
1.1 The Computing Environment
State of the art algorithms are the foundation of software libraries and packages
that strive to achieve optimal performance in todays computing landscape. Mod-
ern computer architectures in High-Performance Computing (HPC) are increasingly
incorporating multicore processors and special purpose hardware, such as graphics
processing units (GPUs). The change to this multicore environment necessitates the
development of algorithms and software packages that can take full advantage of this
computational setting. One major challenge is maximizing the amount of parallelism
within an algorithm, but there are many other issues related to designing effective
algorithms which are nicely surveyed in [5].
Our work will make use of several standards in computing. The BLAS (Basic
Linear Algebra Subprograms) library is used to perform essential operations [11].
Operations in numerical linear algebra are divided into three levels of functionality
based on the type of data involved and the cost of the associated operation. Level 1
BLAS operations involve only vectors, Level 2 BLAS operations involve both matrices
and vectors, and Level 3 BLAS operations involve only matrices. Level 3 BLAS
operations are the preferred type of operation for optimal performance. Optimized
BLAS libraries are often provided for specific architectures. The BLAS library can
be multithreaded to make use of the multicore environment.
LAPACK (Linear Algebra PACKage) is a library based on BLAS that performs
higher level operations common to numerical linear algebra [2]. Routines in LA-
PACK are coded with a specific naming structure where the first letter indicates the
matrix data type, the next two indicate the type of matrix, and the last three indi-
cate the computation performed by the program. We will refer to specific routines

generically using a lower case x in place of the matrix data type, so xGEQRT, which
computes a block QR factorization of a general matrix, may refer to any of the vari-
ants SGEQRT (real), DGEQRT (double), CGEQRT (complex), or ZGEQRT (double
complex). Algorithms in LAPACK are formulated to incorporate Level 3 BLAS op-
erations as much as possible. This is accomplished by organizing algorithms so that
operations are done with panels, either blocks of columns or rows of a matrix, rather
than single columns or rows. While this perspective provides algorithms rich in Level
3 BLAS operations, there are disadvantages in the context of multicore. Memory ar-
chitecture, synchronizations, and limited fine granularity can diminish performance.
ScaLAPACK (Scalable LAPACK) is a library that includes a subset of LAPACK rou-
tines redesigned for distributed memory MIMD (multiple instruction multiple data)
parallel computers [10]. The LAPACK and ScaLAPACK libraries are considered the
standard for high performance computations in dense linear algebra. Both libraries
implement sequential algorithms that rely on parallel building blocks. There are
considerable advantages to reformulating old algorithms and developing new ones to
increase performance on multicore platforms as demonstrated in [19, 18, 53].
PLASMA (Parallel Linear Algebra Software for Multicore Architectures) is a
recent development focusing on tile algorithms, see [19, 18, 53], with the goal of
addressing the performance issues of LAPACK and ScaLAPACK on multicore ma-
chines [1]. PLASMA uses a different data layout subdividing a matrix into square
tiles as demonstrated in Figure 1.1. Operations restricted to small tiles create fine
grained parallelism and provide enough work to keep multiple cores busy. The cur-
rent version of PLASMA, release 2.4.6, relies on runtime scheduling of parallel tasks.
Hybrid environments are developing as well that combine both multicore and other
special purpose architectures like GPUs. Projects such as MAGMA (Matrix Algebra
on GPU and Multicore Architectures) aim to build next generation libraries that fully
exploit heterogeneous architectures [65].

Figure 1.1: Data structure for tiled matrix
The multicore perspective necessitates the development of algorithms that can
incorporate evolving computational approaches, such as tiled data structures that
facilitate high performance. The work presented in ensuing chapters is focused on
accelerating algorithms for the NEP in the context of multicore and hybrid architec-
tures. We now turn to some basic theory of eigenvalues and an essential component
of state-of-the-art algorithms.
1.2 The Standard Eigenvalue Problem
A nonzero vector v G C is an eigenvector, or right eigenvector, of A if Av is a
constant multiple of v. That is, there is a scalar A G C such that Av = Xv. The scalar
A is an eigenvalue associated with the right eigenvector v. Equivalently, an eigenvalue
of a matrix A G pA(X) = \A- A/1 = 0,
where the vertical bars denote the determinant of the matrix A XI. A nonzero
vector ye C is a left eigenvector if it satisfies yHA = X yH, where yH is the conjugate
transpose or Hermitian transpose of y. As our immediate focus is on the former, we
will now refer to all right eigenvectors as simply eigenvectors unless the context is
unclear. It is worth noting that some numerical methods utilize both left and right

eigenvectors. Often the pair (A, v) is called an eigenpair and the set of all eigenvalues
is the spectrum of A and is denoted by cr(A). The eigenvalue associated with a given
eigenvector is unique, but each eigenvalue has many eigenvectors associated with it
as any nonzero multiple of v is also an eigenvector associated with A. Spaces spanned
by eigenvectors remain invariant under multiplication by A as
span{Au} C span{Au} C span{u}.
This idea can be generalized to higher dimensions. A subspace V C Cn such that
AV C V is called a right invariant subspace of A. Again, there is a similar character-
ization of a left invariant subspace, but we will now work exclusively with the former
and refer to them simply as invariant subspaces. An eigenvalue decomposition is a
factorization of the form
A = XAX-1 or AX = XA, (1.1)
where A is a diagonal matrix with eigenvalues Ai,..., Xn on the diagonal and X is a
nonsingular matrix with columns consisting of associated eigenvectors v\,... ,vn. A
matrix is said to be nondefective if and only if it can be diagonalized, that is, it has
an eigenvalue decomposition. Often nondefective matrices are called diagonalizable
matrices. While a nice theoretical factorization, eigenvalue decompositions can be
unstable calculations due to the conditioning of the eigenvector basis X. Fortunately
there is a beautiful theoretical and computationally practical result, Schurs unitary
triangularization theorem, which is a staple in linear algebra texts such as Horn and
Johnson [35].
A central result to the computation of eigenvalues and eigenvectors is the Schur
form of a matrix. For any matrix A G Craxra, there exists a unitary matrix Q such
QhAQ = T or AQ = QT, (1.2)

where T is upper triangular. This factorization exists for any square matrix and the
computation of this factorization is stable. More information on stability may be
found in [32, 66] and we will discuss the stability of specific algorithms later, but
for now it will suffice to hold stability as an attractive quality for algorithm design.
The decomposition to Schur form is an eigenvalue revealing factorization as A and
T are unitarily similar and the eigenvalues of A he on the diagonal of T. The order
of the eigenvalues on the diagonal may be organized by the appropriate choice of
Q. It is often worthwhile to reorder the Schur factorization, for example to improve
stability. A Schur decomposition is not unique as it depends on the order of the
eigenvalues in T and does not account for eigenvalues with multiplicities greater than
one. The associated eigenvectors, or invariant subspaces, may be computed from
the Schur form. This is not an overly complicated computation, but there are some
fundamental issues that can render computations unstable. Our primary focus is at
the level of the Schur factorization and on algorithms designed to compute such a
factorization. The matrix T may be complex when A is real. In this case, it may
be advantageous to work with the real Schur form in which the matrix Q is now
orthogonal and the matrix T is block upper triangular where the diagonal blocks
are of order one or two depending on whether the eigenvalues are real or complex,
respectively. In the case that A is Hermitian, AH = A, the matrix T is a real diagonal
matrix as the Schur form and the diagonalization of A are the same. As we will see,
many algorithms compute a partial Schur form given by
AQk = QkTk, (1.3)
where Qk £ Craxfc with k < n has orthonormal columns that span an invariant
subspace and 7k e Cfcxfc js upper triangular or upper block triangular as described

1.3 Algorithms
In any approach, the computation of eigenvalues is necessarily an iterative process.
This follows directly from the formulation of an eigenvalue problem as determining
the roots of the characteristic polynomial, pa(A), and Abels well known theorem that
there is no general algebraic solution to polynomials of degree five and higher. An
extensive discussion of classic and more recent algorithmic approaches can be found
in [28, 55, 63, 66, 70, 71]. In this section, we survey the major practical computational
approach for computing a full Schur decomposition and the relevant state-of-the-art
The most essential algorithm to the NEP is the QR algorithm or QR iteration.
First introduced by Francis [25, 26] and Kublanovskaya [44], the QR algorithm gen-
erates a sequence of orthogonally similar matrices that, under suitable assumptions,
converges to a Schur form for a given matrix. The QR algorithm gets its name from
the repeated computation of a QR factorization during the iteration. In this case,
given antixti matrix A, a QR factorization is a decomposition of the form A = QR
where Q is an n x n unitary matrix and R is an n x n upper triangular matrix. The
most basic form of the QR algorithm computes a QR factorization, reverses the order
of the factors and repeats until convergence. A more practical approach incorporates
a preprocessing phase and shifts using eigenvalue estimates.
The first phase of a practical version of the QR algorithm requires the reduction
of nonsymmetric A to Hessenberg form H. A matrix H is in upper Hessenberg form
if hij = 0 whenever i > j + 1. Every matrix can be reduced to Hessenberg form
by unitary similarity transformations. This reduction can be computed by using
Householder reflectors as in the Arnoldi procedure further discussed in Chapter 2.
An upper Hessenberg matrix H is in proper upper Hessenberg form, often called

irreducible, if hj+ij yt 0 for j = 1,..., n 1. If a matrix is not in proper upper
Hessenberg form, it may be divided into two independent subproblems in which the
matrices are proper upper Hessenberg. We will assume proper upper Hessenberg form
and simply use the term Hessenberg from now on. Reduction to Hessenberg form is
a cost saving measure aimed at reducing the number of flops in the iterative second
phase as working with the unreduced matrix A is prohibitively expensive. Currently,
reduction to Hessenberg form using block Householder reflectors is handled by calling
LAPACKs xGEHRD or ScaLAPACKs PxGEHRD. We will discuss performance
issues of this computation on multicore platforms in Chapter 3.
The second phase of a practical implementation works exclusively with the Hes-
senberg form. The modern implicit QR iteration takes on a much more complicated
structure than its original formulation. We outline the main computational pieces of
one step of the iteration.
Beginning with the Hessenberg matrix A, a select number of shifts or eigenvalue
estimates are generated. Using these k shifts, pi,... ,pk, a QR factorization of the
polynomial p(A) = (ApiI) (Apkl) is desired as this spectral transformation will
speed up convergence. Explicit formulation of p(A) is cost prohibitive, but computing
p(A)ei where e\ is the canonical basis vector is relatively inexpensive. Next unitary
Q with first column q1 = ap(A)ei, where a E C, is constructed. Performing the
similarity transformation
A -> QhAQ
disturbs the Hessenberg form as in Figure 1.2. In the next step of the iteration, the
bulge is chased from the top left of the matrix to the bottom right corner returning
the matrix to Hessenberg form as in Figure 1.3. The bulge is eliminated by applying
similarity transformations, such as Householder reflectors, that introduce zeros in
desired locations moving the bulge. This process of disturbing the Hessenberg form
with eigenvalue estimates and chasing the bulge is continued until the desired Schur

Figure 1.2: The Hessenberg structure disturbed
form is computed. There have been some major improvements to this process that
form the foundation for state-of-the-art implementations of the implicit QR algorithm.
Figure 1.3: Chasing the bulge
After the introduction of the implicit shift approach in 1961, several implementa-
tions used single or double shifts creating multiple 1 x 1 or 2 x 2 bulges. These bulges
were chased one column at a time using mainly Level 1 BLAS operations. In 1987,
a multishift version of the QR algorithm was introduced by Bai and Demmel [8].
Here k simultaneous shifts were used creating a k x k bulge that was then chased p
columns at a time. This was a significant step in the evolution of the QR algorithm as
the restructured algorithm could be cast in more efficient Level 2 and Level 3 BLAS
operations. Additionally, the k shifts were chosen to be the eigenvalues of the bottom

right k x k principal submatrix extending what had long been standard choices for
Though the multishift QR algorithm was able to be structured in efficient BLAS
operations, the performance for a large number of shifts was lacking. Accurate shifts
accelerate the convergence of the QR algorithm, but when a large number of shifts
were used rounding errors developed degrading the shifts and slowing convergence. In
1992, Watkins [69] detailed this phenomenon of shift blurring. His analysis suggested
that a small number of shifts be used maintaining a very small order bulge to avoid ill-
conditioning and shift blurring. This proved to be an important idea that motivated
a solution to using a large number of shifts and maintaining well focused shifts.
The seminal work by Braman, Byers and Mathias [15, 16] on multishift QR with
aggressive early deflation (AED) introduced two new components that contributed
greatly to the current success of the QR algorithm. Rather than k shifts and a single
bulge as depicted in Figure 1.2, a chain of several tightly coupled bulges, each consist-
ing of two shifts, is chased in the course of one iteration. This idea facilitated the use
of Level 3 BLAS operations for most of the computational work and allowed the use of
a larger number of shifts without the undesirable numerical degradation of shift blur-
ring. Additionally, the use of AED located converged eigenvalues much faster than
earlier deflation strategies which had changed little since the introduction of implicitly
shifted QR. Together, these improvements reduced the number of iterations required
by the QR algorithm greatly increasing overall performance. The LAPACK routine
xHSEQR is the state-of-the-art serial implementation that computes the Schur form
beginning with a Hessenberg matrix. The entire process, reduction to Hessenberg
form and then Schur form, is performed by xGEES.
Another major improvement concerns the nontrivial task of parallelizing the QR
algorithm. Parallel versions of the QR algorithm for distributed memory had been
previously proposed, by Stewart [62], and work had been done on parallelizing the

multishift QR algorithm. The issue of shift blurring forced most efforts to focus on
bulges of order 2 and scalability was still an issue. Issues pertaining to scalability
of the standard double implicit shift QR algorithm were explored by Henry and
van de Geijn [31]. In 1997, an approach using k shifts to form and chase | bulges
in parallel was presented by Henry, Watkins, and Dongarra [30] and subsequently
added to ScaLAPACK. A novel approach based off of the multishift QR with AED
was formulated by Granat, Kagstrom, and Kressner [29]. Here multi-window bulge
chain chasing was formulated along with a parallelized version of AED. The software
presented outperformed the existing ScaLAPACK implementation PxLAHQR.
Improving the QR algorithm continues to be a topic of interest. Recent work by
Braman [14] investigated adjusting the AED strategy to find deflations in the middle
of the matrix. Such a strategy could lead to a new divide and conquer formulation of
the QR algorithm. Even more recent work by Karlsson and Kressner [39] examined
optimal packing strategies for the chain of bulges in an effort to make effective use
of Level 3 BLAS operations. A modified LAPACK implementation was presented
and numerical results demonstrated the success of the approach. Optimally packed
chains of bulges should aid the performance of parallel implementations of the QR
algorithm as well.
Though the QR algorithm is the method of choice when computing a full Schur
decomposition, there are some limitations to this approach in terms of parallelism
and data movement. The current implementation of xGEES does not scale well. An
important aspect of performance analysis is the study of how algorithm performance
varies with problem size, the number of processors, and related parameters. Of par-
ticular importance is the scalability of an algorithm, or how effective it can use an
increased number of processors. One approach at quantifying scalability is to deter-
mine how execution time varies with the number of available processors. To assess
ZGEES, we recorded the execution time for computing the Schur factorization of an

(a) Timing
(b) Speed Up
Figure 1.4: Scalability of ZGEES
8, 000 x 8, 000 matrix for up to 8 cores. The results along with a curve depicting
perfect scalability can be seen in Figure 1.4a. We also illustrate the measured speed
up and perfect speed up as the number of cores is increased from 1 to 8 in Figure 1.4b.
As depicted in Figure 1.4, ZGEES does not scale well in the context of multicore. In
addition to its performance, for the NEP, the only algorithm available in LAPACK for
computing the Schur form is xGEES. Currently partial Schur forms for nonsymmetric
matrices are unattainable in LAPACK.
In Table 1.1 we list the currently available computational approaches for the NEP
available on various platforms. The methods listed in Table 1.1 are abbreviated as
follows: Arnold! based approaches are denoted by (A), implicitly restarted Arnoldi
by (IRAM), Krylov-Schur based approaches by (KS), and Jacobi-Davidson methods
by (JD). For packages based on Arnoldi, method (A), we include all formulations
of Arnoldi such as the use of Chebyshev acceleration, preconditioning with Cheby-
shev polynomials, explicit restarts, and deflation, but not implicit restarts. Of the
methods listed in Table 1.1, the only block Krylov-Schur implementation is part of
the Anasazi package which is part of the Trilinos library. This implementation

Table 1.1: Available Iterative Methods for the NEP
Software Routine Method Blocked Language Architecture
ARPACK Various IRAM No Fortran Shared, Dist
SLEPc EPSARNOLDI A No Fortran, C Shared, Dist
SLEPc EPSKRYLOVSCHUR KS No Fortran, C Shared, Dist
SLEPc EPSJD JD No Fortran, C Shared, Dist
Anasazi BlockArnoldi A Yes Fortran, C Shared, Dist
Anasazi BlockKrylovSchur KS Yes Fortran, C Shared, Dist
HSL 2013 EB13 A Both Fortran, C Shared, Dist
uses two orthogonalization schemes, one proposed by Daniel, Gragg, Kaufman and
Stewart (DGKS) [20] and a more recent offering by Stathopoulos and Wu [60] with
the latter as the default setting.
The goal of this work is to attack the NEP in the context of HPC from a different
approach. Our work concerns the computation of a partial Schur form via standard
iterative techniques. To this end, in Chapter 2 we examine tile algorithms and the
implementation of block Arnoldi expansion in the multicore context of PLASMA.
The process constructs an orthonormal basis for a block Krylov space, and we extend
existing algorithms with our tiled version. Pseudocodes and implementation details
are provided along with performance results.
In Chapter 3, we examine various algorithms in the context of computing a par-
tial Schur factorization for the nonsymmetric matrices. We examine several iterative
approaches and present novel implementations of specific methods. The methods
studied include a block version of explicitly restarted Arnoldi with deflation, a block
extension of Stewarts Krylov-Schur method, and a block version of Jacobi-Davidson.

We compare our implementations to existing unblocked and blocked codes when avail-
able. All work is done in tatlab and extensive numerical results are performed. This
work motivates our algorithmic design choices in Chapter 4.
Finally, in Chapter 4 we present a detailed implementation of our block Krylov-
Schur method using Householder reflectors. Our approach features a block algorithm,
the ability to compute a full Schur decomposition, and the novel use of Householder
reflectors consistent with the work in Chapter 3.
In this thesis we will use two distinct ways to parallelize our codes. In Section 2, we
use the task-based scheduler QUARK from the University of Tennessee to parallelize
our tile Arnoldi expansion. The tile Arnoldi expansion is written in term of sequential
kernels, then dependencies between tasks are declared by labelling the variables as
INPUT, INOUT, OUTPUT, finally, the QUARK scheduler unravels our code, figures
out the task depedencies and exploits any parallelism present in our application. In
Section 4, we obtain parallelism by calling multithreaded BLAS. (This is a similar
parallelism model to the one in the LAPACK library.) In both cases, we relied on
a third party to perform the parallelization per se. Both mechanisms are fairly high
level and are indeed easy to use.
1.4 Contributions
Here we outline the specific contributions of this manuscript and associated work.
In Chapter 2, we present our novel tiled implementation of block Arnoldi with House-
holder reflectors. The Arnoldi computation is an important component of both al-
gorithms designed to solve linear systems and those used to compute eigenvalues. A
great deal of time is spent in the Arnoldi component when working in either com-
putational context and we present a marked improvement in performance with our
tile approach. We managed to merge the orthogonalization with the matrix vec-
tor product. This has the potential to increase the performance of methods using
block Arnoldi factorizations such as block GMRES and Morgans GMRES-DR [51].

Additionally, any eigenvalue solver using block Arnoldi stands to benefit from this
The novel formulation of block Krylov-Schur with Householder in Chapter 3 also
improves upon the state of the art. We present a new algorithm based on Householder
reflectors rather than other orthogonalization schemes. Our robust formulation per-
forms very well in the sparse case when computing partial Schur decompositions. We
present numerical experiments in Matlab in Chapter 3 that suggest our approach
is worth implementing in a compilable programming language.
In Chapter 4 we implement our block Krylov-Schur approach in a C code using
LAPACK subroutines. The code is robust and supports any block sizes and can
compute any number of desired eigenvalues. This code is the first step towards an
optimized version that could be released to the scientific computing community.

2. Tiled Krylov Expansion on Multicore Architectures
In this chapter, we present joint work with Henricus Bouwmeester.
Many algorithms that aim to compute eigenvalues and invariant subspaces rely on
Krylov sequences and Krylov subspaces. Additionally, many algorithms computing
solutions to linear systems do as well. Here we consider the computation of an
orthonormal basis for a Krylov subspace in the context of HPC. We are interested in an
algorithm rich in BLAS Level 3 operations that achieves a high level of parallelism. We
review some basic theory of Krylov subspaces and associated algorithms to motivate
our current work. A wealth of information on Krylov subspaces and their connection
to linear systems and eigenvalue problems may be found in [43, 55, 70].
If A E Cnxn is a matrix and v G C is a nonzero vector, then the sequence
v, Av, A2v, A3v,... is called a Krylov sequence. The subspace
K.m(A, v) = span{u, Av, A2v,..., Am~lv}
is called the mth Krylov subspace associated with A and v and the matrix
Km(A, v) = [v, Av, A2v,..., Am~lv\
is called the mth Krylov matrix associated with A and v. Our current computational
focus is on constructing a basis for the subspace )Cm(A, v) as this Krylov subspace will
play a pivotal role in many linear algebra computations, especially certain numerical
methods for eigenvalue problems. Computing an explicit Krylov basis of the form
Km(A,v) is not advisable. As m increases, under mild assumptions on the starting
vector, the vectors Am~lv converge to the eigenvector associated with the largest
eigenvalue in magnitude of A (provided th eigenvalue is simple). As m gets larger, the
basis [v, Av, A2v,..., Am~lv] becomes extremely ill-conditioned and, consequently,
much of the information in this basis is corrupted by roundoff errors as discussed by
Kressner [43]. An elegant algorithm due to W. E. Arnoldi is one way to to compute a

basis for the Krylov space with better conditioning. The algorithm has a few variants
which we explore in the next section.
2.1 The Arnoldi Method
In 1951, Walter E. Arnoldi [4], expanding Lanczoss work on eigenvalues of sym-
metric matrices, introduced an algorithm that reduced a general matrix to upper
Hessenberg form. Let v\,v2, ,vm be the result of sequentially orthonormalizing
the Krylov sequence v, Av,... Am~lv and let Vm = [iq, v2,... vm]. In matrix terms,
Arnoldis procedure generates a decomposition of the form
AVm VmHm T f3mvm-(2-1)
where Hm e C"1*"1 is upper Hessenberg, [5m is a scalar, Vm has orthonormal columns,
and em e Cm is the vector with one in the mth position and zeros everywhere else.
This factorization is called an Arnoldi decomposition of order m, or simply an Arnoldi
decomposition. We can represent this in matrix terms by
AVm = Vm+\Hm+i (2-2)
Arnoldi suggested that the matrix Hm may contain accurate approximations to the
eigenvalues of A. We will revisit this idea in Chapter 3. The vectors vi,v2, ,vm
in the Arnoldi decomposition form an orthonormal basis of the subspace in ques-
tion. They are orthonormal by construction and a straightforward inductive argument
shows that fCm(A,v) = spanjui,... ,vm}-
In exact arithmetic, there are several algorithmic variants of Arnoldis method to
construct this orthonormal basis. We review some well established results on a few
variants in finite precision arithmetic to motivate our current work. One variant is to

use the Gram-Schmidt process to sequentially orthogonalize each new vector against
the previously orthogonalized vectors. The Arnoldi method using Classical Gram-
Schmidt (CGS) to compute an orthonormal basis of Km(A,v) given A, v, and m is
given in Algorithm 2.1.1. The CGS variant is unstable. A mathematically equivalent
Algorithm 2.1.1: Arnoldi CGS
Input: A G Cnxn, v G C and m
Result: Construction of Vm+\ and Hm+1
1 vi = u/|H|2;
2 for j = 1 : m do
h3 = V/IA Vf,
v = Avj \ jh j\
if h
i+ij = 0 then
8 Vj+1 =
9 h3 = Hj-i hj
0 hj+ij
but numerically more attractive version of CGS is the subtle rearrangement called
Modified Gram-Schmidt (MGS). Either approach requires 2mn2 flops for the matrix-
vector multiplications (we assume the matrix A to be dense) and 2m2n flops for the
vector operations (due to the Gram-Schmidt process). Though MGS is numerically
more attractive than the CGS variant, it still inherits the numerical instabilities of
the Gram-Schmidt process. The orthogonality of the columns of Vm can severely
be affected by roundoff errors. To remedy this, there are a few computationally
attractive alternatives. The vector Vj+\ can be reorthogonalized against the columns
of Vj whenever one suspects loss of orthogonality may have occurred. This improves

stability, but the process has its limitations and this adds flops. Extensive details on
the Gram-Schmidt process may be found in [45]. Another option is to approach the
process of orthogonalization in an entirely different way.
The final variant under consideration changes the orthogonalization scheme and
uses one of the most reliable orthogonalization procedures, one based on Householder
transformations. As Trefethen and Bau explain it, while the Gram-Schmidt process
can be viewed as a sequence of elementary triangular matrices applied to generate a
matrix with orthonormal columns, the Householder formulation can be viewed as a
sequence of elementary unitary matrices whose product forms a triangular matrix.
The Householder variant has very appealing properties, specifically the use of orthog-
onal transformations guarantees numerical stability. This does come with an increase
in the number of flops. The use of Householder in the context of Arnoldi is back-
ward stable but requires 4m2n |m3 flops (this does not count the 2mn2 flops for
the matrix-vector multiplications). The Arnoldi procedure with Householder makes
use of reflectors of the form P = I-----which introduce zeros in desired lo-
cations. Walker [67] initially formulated the method in the context of solving large
nonsymmetric linear systems.
As we desire an approach rich in BLAS Level 3 operations, we turn our attention
to block methods that use blocks of vectors rather than single vectors. We note
that there are other benefits in using a block approach in the context of an iterative
eigensolver. The iterative eigensolver converges faster and is more robust in the
presence of clustered eigenvalues. This will be examined in Chapter 3. The extension
of the Arnoldi method to a block algorithm is straightforward. Rather than using an
initial starting vector, a block of vectors, V G Cnxb, is used and the Arnoldi procedure
constructs an orthonormal basis for the block Krylov subspace
}Cmr(A, V) = span{V, AV, A2V,..., Am~lV}.

An Arnoldi decomposition now takes the form
AWm WmHm + (2-3)
where each V, E Cnxb and
Em = matrix of the last b columns of Imr.
Here Imb is the mb x m identity matrix, Wm E Hm E AYVm Wm+\Hm+i
where Hm+1
G C(-m+1'>bxmb is the block version of our simplified notation given by
tj rpH
A block analog of Algorithm 2.1.1 follows immediately but some variants are
possible depending on concerns for parallelism as detailed in [55]. The block version
of Arnoldi with Householder fits nicely in the context of BLAS Level 3 operations
thanks to the compact WY representation presented in Schreiber and Van Loan [56].
In the compact WY form, a product of b Householder reflectors can be represented
Q = In- YTYt (2.5)
where Y E Cnxb is a lower trapezoidal matrix and T E Cbxb is an upper triangu-
lar matrix. This enables the use of BLAS Level 3 operations. This performance
along with the aforementioned backward stability is why we opt to use Householder
orthogonalization in the Arnoldi method.

2.2 Tiled Arnoldi with Householder
Block formulations of the Arnoldi method are not new. Morgan [51] developed
one in in the context of solving linear systems with his introduction of GMRES-DR
that uses Ruhes variant of block Arnoldi. Explicit formulation of Ruhes variant can
be found in [55]. In the context of the NEP, M oiler [49] and Baglama [7] offer different
implementations for each of their approaches. We present a new implementation with
the focus on performance in the context of multicore architectures that works on tiles.
This also sets the foundation for our ensuing work on algorithms for the NEP.
The algorithmic formulation of Arnoldi with Householder follows directly from
employing the compact WY representation. To simplify the presentation we adopt a
Matlab style notation. It will be convenient to refer to locations within a matrix
as follows: for an n x n matrix A, A(7 : n, 1 : b) denotes the submatrix consisting
of the first 6 columns and last n 6 rows. As we are working with blocks, it will
simplify the notation to refer to the jth block of rows or columns within a matrix. For
example, for the jth column block of Wm = [W, V2,... Vm] we may write Wm(:, {j})
as Wm(:, {j}) = Wm(l : n, (j 1)6 + 1 : jb).
Algorithm 2.2.1 outlines the essential steps to compute an orthonormal basis for
JCmb(A,U) given A e Cnxn, starting block U e Crax6 and m. Algorithm 2.2.1
is formulated to employ specific existing computational kernels in LAPACK 3.4.1,
as denoted in parentheses next to each major computation. We will discuss the
LAPACK subroutines used as they form the basis for our new implementation. The
central computational kernel in Algorithm 2.2.1 is the QR factorization. The blocked
QR factorization with compact WY version of Q = In YTYt is accomplished
by the LAPACK xGEQRT subroutine. The call to xGEQRT constructs a compact
WY representation of 6 Householder reflections that introduce zeros in 6 consecutive

Algorithm 2.2.1: Block Arnoldi HH

Input: A, U and m
Result: Construction of Qm+1 and Hm+1 such that AQn
1 U = AU (xGEMM); Q = Inxmb first mr columns of
2 for j = 0 : m 1 do
3 Compute the QR factorization of U(jb + 1 : n,:) (xGEQRT),
4 generating T and storing the reflectors in V(:,{j + 1}));
5 Accumulate reflectors to explicitly build next block of Q (xGEMQRT);
6 for k = j : 1 : 1 do
r + 1}) = + 1}) \ v:. {/-'} )/'(:. {k})V, {k})HQ, {j + 1});
if j < m 1 then
U = AQ(:,{j + 1}) (xGEMM);
Update U with reflectors from previous columns (xGEMQRT);
for k = 1 : j + 1 do
U=u- V (:, {k})T(:, {k})HV(:, {k})HU-
columns. The Y factor in this case is unit lower triangular by construction and to
save on storage, the reflectors that construct the Y factor, aside from the ls on
the diagonal, are stored in the zeroed out area of the matrix passed to xGEQRT.
We denote the essential pieces of the Y factor that we must store by Vm+\. This
allows for compact storage of the Hm+1 factor and the essential pieces of Y in the
matrix Vm+\ as seen in Figure 2.1. For each block in the Krylov sequence, an upper
triangular block reflector is computed and the final T generated here has the form
T = [TuT2,... ,Tm] where each Tj is a b x b upper triangular matrix. It is worth
noting that the T factor here has a slightly different form than a full mb x mb factor
in a compact WY representation as LAPACK is designed to efficiently exploit the
block structure. Applying the compact WY version of the reflectors efficiently when

Figure 2.1: Compact storage of the Arnoldi decomposition
multiplying by Q or QH is handled by xGEMQRT which uses the reflectors stored
in the lower part of Vm+i and the triangular factors in T. Multiplication of the new
block of vectors by matrix A requires the BLAS xGEMM subroutine.
LAPACK subroutines are designed to use blocks of columns, or blocks of rows,
to cast the operations as matrix multiplications rather than vector operations. This
facilitates Level 3 BLAS operations, but there are issues that limit performance. Of
note are the synchronizations performed at each step and the lack of fine grain tasks
for increased parallelism. Multithreaded BLAS can be utilized, but this is often not
enough to ensure that these algorithms perform optimally in the context of multicore.
To take full advantage of emerging computer architectures, we must reformulate
our block algorithm. Multicore architectures require algorithms that can take full
advantage of their structure. To this end, algorithms have been moving towards the
class of so-called tile algorithms [19, 18, 53] and are available as part of efforts like
the PLASMA library. The data layout in the context of PLASMA requires a matrix
to be reordered into smaller regions of memory, called tiles. A matrix A G 23

can be subdivided into tiles ranging in size from lxl tiles to n x n tiles, but once
set, the tile size is fixed for the duration of the algorithm. Finding the tile size that
achieves optimal performance requires a bit of tuning, so for now, we will assume A
is decomposed into nt tiles of size x so that n = ntUb We will add one more
notational convenience for our algorithms and let Aitj denote the n5 x n5 tile in the
zth row and jth column of the tiling of A. It is important to note that currently
only square tiles are permitted in PLASMA and our application will require us to
make accommodations for various tile sizes. The decomposition of a matrix into
tiles and the subdivision of computational tasks adds a new dynamic to our problem
in that these tasks must be organized. As detailed by Bouwmeester [13], working
with tiles forces us to consider several specific features. Tiles can be in three states,
namely zero, triangle or full, and introducing zeros in a matrix can be accomplished
by three different tasks. Tile algorithms allow the separation of factorization stages
and corresponding update stages whereas these are considered a single stage in coarse-
grain algorithms, such as those in LAPACK.
Organizing the factorization tasks and the decoupled associated updates in dif-
ferent ways leads to different algorithmic formulations and possibly different per-
formance. Computations are often expressed through a task graph, often called a
Directed Acyclic Graph (DAG), where nodes are elementary tasks that operate on
tiles and edges represent the dependencies. Different algorithmic formulations may
result in different DAGs. To compare various algorithmic formulations, we look at
the respective DAGs and compute the critical path. The critical path is the longest
necessary path from start to finish and represents a sequence of operations that must
be carried out sequentially. Analyzing DAGs and critical paths allows for the selection
of optimal parallelization strategies. Much work is currently being devoted to devel-
oping scheduling strategies that improve performance. After introducing the essential
computational kernels of our algorithm, we will revisit the question of performance.

Extending our algorithm from the LAPACK framework to that of PLASMA
requires replacing each major kernel with a tiled analog. The block QR factorization
with compact WY form performed by xGEQRT (line 3 of Algorithm 2.2.1) will be
replaced with a tiled QR factorization. Tiled QR factorizations for multicore were
introduced in [19, 18, 53] and recently improved and analyzed by Bouwmeester [13].
Line 3 of Algorithm 2.2.1 requires the QR factorization of a tall and skinny matrix,
one that has many more rows than columns. As we will see later in Chapter 4, the
number of columns b in a block Krylov subspace method will be set by practical
concerns and, in general, it will be small, much smaller than the recommended tile
size rib-
The basic structure of our approach, Algorithm 2.2.2, remains the same but the
details of each step require a bit of discussion. From A G Craxra and a starting block
U G Cnxb, we compute a block Arnoldi decomposition of size m. Our approach
begins with the n x mb identity matrix in Q. Next we compute the product AU and
as depicted in Line 3 of Algorithm 2.2.2. Here we explicitly listed the double loop to
give the flavor of tile operations. Continuing with a detailed tile perspective is not
feasible for a readable pseudocode, so we expand on each step with a more detailed
As before, the primary computation in Algorithm 2.2.2 is the QR factorization. In
the previous case, the LAPACK routine xGEQRT zeroed out b columns at a time and
computed the corresponding block Householder reflector in compact WY form. In the
tiled version, the structure of the QR factorization changes with the opportunities
to increase parallelism and this leads to new computational tasks among the tiles.
For our application, we will consider the case where we wish to compute the QR
factorization of a matrix V G Crat"'f'x6, that is, V is comprised of one column of rit

Algorithm 2.2.2: Tiled block Arnoldi with HH

Input: A, U, m, b, and nt
Result: Construction of Qm+1 and Hm+1 such that AQr
1 Q = Inxmb first mb columns of Q(:, {1}) = U;
2 for k = 0 : m 1 do
3 for i = 0 : nt 1 do
4 for j = 0 : nt 1 do
5 AijQj k-1 (xGEMM);
Update U with reflectors from previous factorizations (xUNMQR,
Compute the QR factorization of V(jb + 1 : n, {j + 1}) (xGEQRT,
Accumulate reflectors to explicitly build next block of Q (xTTMQR and
tiles each of size rib x b as in the following
with Vi i G CnbXb, i = 1,..., nt
Vnt, 1
Computing the QR factorization of a column of tiles allows for different algorithmic
formulations based on the ordering of computations on each tile. Using existing
PLASMA routines, we could first compute the QR factorization of VQi so that we

v= ; ,
Vnt, 1
where T1;1 is upper triangular with the Householder reflectors stored below the diag-
onal as is standard in LAPACK. Then we could use T1;1 to systematically zero out
the remaining tiles. Each of these elimination steps has the form of a triangle on
top of a square and can be accomplished by PLASMAs xTSQRT routine. Updates
would then require both xUNMQR and xTSMQR. The process just outlined can be
described by using a flat tree beginning at the top. This approach is sequential and
does not parallelize. Fortunately, the tiled environment allows for more choices.
For example, we could proceed by first computing the QR factorizations of each of
the nt tiles, TQi with i = 1,..., nt, using nt calls to PLASMAs xGEQRT. This results
in a column of nt upper triangular factors with respective Householder reflectors stored
below the diagonal of each of the nt tiles. Each individual tile now has the same
structure as the output of LAPACKs xGEQRT. Next, we could proceed by using the
triangular factors to eliminate the triangular factors directly below. This approach
gives rise to a computational kernel, PLASMAs xTTQRT detailed in [13], that zeroes
a triangle with a triangle on top. Repeating this step we could proceed left to right
as depicted in Figure 2.2 where the steps are organized using a binomial tree. The
QR factorization depicted in Figure 2.2 requires different routines to update using the
Householder reflectors. Each call to xGEQRT generates compact WY components
that may be applied by calling PLASMAs xUNMQR. In the case of xTTQRT, the
corresponding update is achieved by the PLASMA routine xTTMQR. We will explore
some variants of Algorithm 2.2.2 shortly, but comprehensive information on various
elimination lists and algorithmic formulations of tiled QR may be found in [13].

Figure 2.2: Tiled QR with nt = 5 using xTTQRT
After computing the QR factorization of the column, we build the next column
block in the matrix Q. To do so, we must apply the reflectors, that is apply the
updates, in reverse order requiring the use of both xTTMQR and then xUNMQR.
By reverse order here we mean that the updates must use the same elimination tree
used in the forward sense. If we use a binomial tree going in the forward direction as
in Figure 2.2 then we must traverse that same tree in the reverse direction. We then
begin the loop again and multiply our newly computed column block of Q by A. This
must then be updated with the reflectors from previous columns using xUNMQR
and then xTTMQR. We are now ready to compute the QR factorization of the next
column of tiles and continue on with the Arnoldi process.
What is not evident in Figure 2.2 is the case when the single block of columns
does not ht nicely in the context of square tiles. As PLASMA requires square tiles,
we had to modify several existing routines so that they could operate on sub-tiles.
Routines that were modified include the QR factorization xGEQRT with associated
update xUNMQR and the zeroing out of a triangle with a triangle xTTQRT with

associated update xTTMQR. An example modification is presented in Figure ?? for
the routine QUARK_CORE_ZGEQRT. The last two lines, not including the 0, lock
the tile referenced by Ap and the corresponding Tp tile since the reference to A and T
point within the tile and do not lock the entire tile. We needed to point to within
the tile but this does not indicate that a dependency is needed.
#include doth.h
void my_QUARK_CORE_zgeqrt(Quark *quark, Quark_Task_Flags *task_flags,
int m, int n, int ib, int nb,
PLASMA_Complex64_t *A, int Ida,
PLASMA_Complex64_t *T, int ldt,
PLASMA_Complex64_t *Ap, PLASMA_Complex64_t *Tp)
QUARK_Insert_Task(quark, CORE_zgeqrt_quark, task_flags,
sizeof(int), &m, VALUE,
sizeof(int), &n, VALUE,
sizeof(int), &ib, VALUE,
sizeof(PLASMA_Complex64_t)*nb*nb, A, INOUT,
sizeof(int), &lda, VALUE,
sizeof(PLASMA_Complex64_t)*ib*nb, T, OUTPUT,
sizeof(int), &ldt, VALUE,
sizeof(PLASMA_Complex64_t)*nb, NULL, SCRATCH,
sizeof(PLASMA_Complex64_t)*ib*nb, NULL, SCRATCH,
sizeof(PLASMA_Complex64_t)*nb*nb, Ap, INOUT,
sizeof(PLASMA_Complex64_t)*ib*nb, Tp, OUTPUT,
Figure 2.3: Modification of QUARK_CORE_ZGEQRT for sub-tiles
2.3 Numerical Results
Here we compare different formulations of our tiled Arnoldi method by adjusting
the underlying elimination list. We assume an unlimited number of processors in this
analysis and investigate a few algorithmic variations. As the number of resources
is unlimited any task may be executed as soon as the dependencies are satisfied.
Algorithm 2.2.2 requires a QR factorization of a single column of tiles but also uses

the updates to explicitly form the Q factor one block column at a time and update
new columns using previously computed reflectors. By using different elimination
trees for the QR decomposition, the tree used for the update changes as well. This
in turn changes the DAG and possibly allows for more interleaving of the various
steps, i.e., the matrix multiplication and factorization, and might also provide better
pipelining of update and factorization trees.
Here we present some numerical experiments comparing our tiled code with two
different trees (binomial and pipeline) to a reference implementation. We present
results for a diagonal matrix, a tridiagonal matrix and a dense matrix. The Rooftop
Tiled Ainoldi Iteration (Dense)
Comparison of Reference, Binomial and Pipeline Trees
m=2400, mb=200, b=50, ^iter=6
Figure 2.4: Performance comparison for a dense (smaller) matrix
Bound is an upper bound on performance on p processors. To calculate the Rooftop

Bound, we use
Rooftop(p) = max ( p, -------------------------- ) (performance of a processor).
\ # of flops on the critical path J
In Figure 2.4, hi is a dense, 2, 400 x 2, 400 matrix, the block size is b = 50, the initial
vector V is 2, 400 x 50, and we want to perform 20 Arnoldi iterations so that we obtain
a band Hessenberg matrix of size 1,000 x 1,000 with bandwidth 50. We present a
reference implementation which is a monolithic implementation calling LAPACK and
BLAS. For our tile implementation and this experiment, the tile size for A is = 200,
so that nt = 12, this choice makes V to be 12 x 1 in term of tiles with 200 x 50 tiles. As
explained, the matrix V is made of rectangular tiles and PLASMA does not natively
handle rectangular tiles so we needed to hack rectangular matrix support into the
PLASMA library. To obtain square tiles, we could have tiled A with 50 x 50 tiles
however we judged that there were too many drawbacks in tiling A with a tile size
equal to the size of the block in the Krylov method, (i) We expect the block size
of block Arnoldi to vary given an iterative method. We do not want to change the
data layout of A over and over, (ii) The block size of block Arnoldi is determined for
behaving well for the eigensolver, while the tile size is determined for performance.
In general, we expect the block size of block Arnoldi to be smaller than the tile size.
Given this setup, m = 2, 400 and b = 50 and ^iter=20, the total work for the Arnoldi
factorization is approximately 4.35 Gflops and the length of the weighted critical path
for the binomial tree method is 0.39 Gflops. We use the ig multicore machine located
in Tennessee for our experiments. The performance of one core is 1.757 Gflop/sec.
In Figure 2.4, we plot the Rooftop Bound for the binomial tree. Note that the point
at which the performance starts to level off is the same point at which the Rooftop
Bound reaches it maximum (in terms of number of processors). We acknowledge that
there is still a large gap between the bound and the curves. We would like to have
more descriptive upper bounds.

We could not produce a Rooftop Bound for the case where k = 20 and m = 9, 600
in Figure 2.5 since we have no closed-form formula for the weighted critical path
length and the DAG was too large for the computer to calculate it. The results for
Tiled Arnoldi Iteration (Dense)
Comparison of Reference, Binomial and Pipeline Trees
m=9600, mb=200, b=50, #iter=20
Figure 2.5: Performance comparison for a tiled dense (larger) matrix
the tridiagonal case can be seen in Figure 2.6 and for the diagonal case in Figure
2.7. In all of four cases we conclude that our Arnoldi implementation performs better
than the reference implementations.
2.4 Conclusion and Future Work
We have proposed a new algorithm to compute a block Arnoldi expansion with
Householder reflectors on multicore architectures. An experimental study has shown
that our algorithm performs significantly better than our reference implementations.
We would like to obtain closed-form formula for the critical path of our new algorithm,
and we would like to benchmark our code with sparse matrices.

Tiled Arnoldi Iteration (Tri-Diagonal)
Comparison of Reference, Binomial and Pipeline Trees
m=9600, mb=200, b=50, #iter=20
Figure 2.6: Performance comparison for a tiled tridiagonal matrix
Tiled Arnoldi Iteration (Diagonal)
Comparison of Reference, Binomial and Pipeline Trees
m=9600, mb=200, b=50, #iter=20
Figure 2.7: Performance comparison for a tiled diagonal matrix

3. Alternatives to the QR Algorithm for NEP
In Chapter 2, we studied an efficient Arnoldi tile expansion for multicore system.
In this chapter, we turn our focus to the computation of eigenvalues and associated
invariant subspaces. This chapter motivates our work in Chapter 4 where we focus
solely on the block Krylov-Schur method. We are specifically interested in computing
a partial Schur decomposition as in Equation 1.3 using an iterative method algo-
rithm. Here we detail our block extensions of various approaches and undertake an
experimental numerical study for various algorithms in the context of computing any
number of eigenvalues of a nonsymmetric matrix. Our implementations of several
approaches are compared to existing unblocked implementations and blocked codes
when available. All algorithms are implemented in Matlab. When applicable, we
survey current state-of-the-art implementations and related software libraries.
Here we focus strictly on methods aiming to compute eigenvalues of an n x n
matrix A which work by accessing the operator A only through matrix-vector prod-
ucts. In particular, none of the algorithms considered in this Chapter reduces the
matrix to Hessenberg form. There are several reasons for this design choice. Though
the reduction to Hessenberg form is the first phase of practical implementations of
the QR algorithm, it is a costly endeavor, in particular in terms of communication
and parallelism. Using Householder reflectors, proceeding a column at a time, this
computation requires approximately hffi3 hops and is based mainly on Level 2 BLAS
The accumulation of Householder reflectors into compact WY form [56] can be
used to improve the situation by incorporating Level 3 BLAS operations when pos-
sible. This was described by Dongarra, Hammarling, and Sorensen [23], but perfor-
mance issues still remain in the context of multicore. Recently, Quintana-Ortf and
van de Geijn [54] cast more of the required computations in efficient matrix-matrix
operations achieving significant performance improvements. Yet, 20% of the flops

remain in Level 2 BLAS operations. Howell and Diaa [36] presented an algorithm,
BHESS, using Gaussian similarity transformations to reduce a general real square
matrix to a small band Hessenberg form. Eigenvalues can then be computed using
the bulge-chasing BR iteration or the Rayleigh quotient iteration. The overall cost of
the BHESS-BR method was reported to be typically |n3 flops compared to the stan-
dard QR iteration which requires |n3 for the reduction and 10 to 16n2 flops for the
ensuing iteration on the Hessenberg factor. The BHESS-BR approach is appropriate
for computing nonextremal eigenvalues of mildly nonnormal matrices [36].
A two-staged approach, described by Ltaief, Kurzak, and Dongarra [48], showed
that an initial reduction to block Hessenberg form, also called band Hessenberg as
there is a block or band of subdiagonals rather than one subdiagonal, is efficiently
handled by a tile algorithm variant. Using the tiled approach, their algorithm with
Householder refectors achieves 72% (95 Gfop/s) of the DGEMM peak on a 12, 000 x
12,000 matrix size with 16 Intel Tigerton 2.4GHz cores and most of the operations
are in Level 3 BLAS. The second phase of the proposed method [48] reduces the block
Hessenberg matrix to Hessenberg form using a bulge chasing approach similar to some
extent to what is done in the QR algorithm. The algorithm used in the second phase
does not achieve any comparable performance, mainly due to the inefficiency of the
parallel bulge chasing procedure on multicore architectures. The bulge chasing in
the second phase may benefit from the optimal packing strategy [39], but we do not
investigate this further. Because of these challenges, we turn our focus to iterative
methods (Arnoldi, Jacobi-Davidson) which avoid complete reduction to Hessenberg
form. As we will see though, the implicit QR algorithm will remain an essential piece
of any approach to the NEP.
We note that, if we were to run the Block Arnoldi process presented in Chapter 2
to completion (n steps), we would perform a Block Hessenberg reduction as in [39].
However, the block Hessenberg matrix we obtain is associated with a Krylov space

expansion, and so early submatrices of this matrix should contain relevant information
on invariant subspaces of the initial matrix.
3.1 Iterative Methods
3.1.1 Arnoldi for the nonsymmetric eigenvalue problem and IRAM
The Arnoldi procedure discussed in Chapter 2 not only produces an orthonormal
basis for the subspace nm(A,v), but it also generates information about the eigen-
values of A. Though originally developed as a method to reduce a general matrix to
upper Hessenberg form, the Arnoldi method may be viewed as the computation of
projections onto successive Krylov subspaces. In exact arithmetic, Algorithm 2.1.1
will terminate on line 6 if hj+ij = 0, as the columns of Vj span an invariant subspace
of A. In this case, the eigenvalues of Hj are the exact eigenvalues of the matrix A.
If it does not terminate early, the algorithm constructs an Arnoldi decomposition of
order m given by
AVm VmHm + f3mvm-(3-1)
Except for a rank one perturbation, we have an approximate invariant subspace re-
lationship, that is AVm ~ VmHm. From Equation 3.1 and the orthogonality of the
columns of Vm+\, we have that
V*AVm = H
and the approximate eigenvalues provided by projection onto the Krylov subspace
nm(A,v) are simply the eigenvalues of Hm. These are often called Ritz values, as
this projection can be scene as a part of the Rayleigh-Ritz process. Ritz vectors,
or approximate eigenvectors of A, are simply are the associated eigenvectors of Hm
premultiplied by Vm. To find the eigenvalues and eigenvectors of Hm, which is already
in upper Hessenberg form, we simply apply a practical version of the implicitly shifted
QR algorithm.

In practice, a suitable order m for desired convergence is not known a priori and
it may not be desirable to store the n x (m+ 1) matrix Vm+\ as m must usually be
rather large for an acceptable approximation to be computed. To address this, several
restarting strategies have been suggested for how to select a new starting vector v\.
Explicit restarting strategies compute an approximate eigenvector associated with an
eigenvalue of interest, say the one with largest real part. If the approximation is
satisfactory, the iteration stops, and if not, the approximate eigenvector is then used
as the starting vector for a new rrith order Arnoldi factorization. A similar strategy
may be used when multiple eigenpairs are desired. One may restart with a linear
combination of approximate eigenvectors, or one may use a deflation strategy. Mor-
gans analysis [50] suggested that restarting with a linear combination is ill-advised
unless care is taken to avoid losing accuracy when forming the linear combination. An
approach to combine Ritz vectors that prevents loss of accuracy is that of Sorensen,
which we will discuss in detail momentarily. As there is no easy way to determine an
appropriate linear combination, we opt for a strategy based on deflating eigenpairs.
We will expand on this idea when we formulate our block variant of Arnoldi for the
One of the more successful approaches based on Krylov subspaces is that of
Sorensen, the implicitly restarted Arnoldi method (IRAM) presented in [59]. This
method uses the implicitly shifted QR algorithm to restart the Arnoldi process. From
the decomposition 3.1, for fixed k, rri k shifts, fii,..., nm-k, are selected and used to
perform m k steps of the implicitly shifted QR algorithm on the Rayleigh quotient
Hm. This results in
AV+ = V+H+ + (3mvm+leTmQ (3.2)
where = VmQ, H+ = QHHmQ, and Q = Q1Q2 Qk where each Qi is the
orthogonal matrix associated with each of the k shifts. Sorensen observed that the
first k 1 components of e^Q are zero and that equating the first k columns on each

side yields an updated fcth order Arnoldi factorization. The updated decomposition
given by
AV+ = V+H+ + Pkvt+l4 (3-3)
with updated residual Pk^t+i is a restart of the Arnoldi process with a starting vector
proportional to p(A)v\ where p(A) = (A pil) (A pm_kI). Using this as a start-
ing point, Sorensen continued the Arnoldi process to return to the original mth order
decomposition. Sorensen showed this process may be viewed as a truncation of the
implicitly shifted QR iteration. Along with this formulation, Sorensen also suggested
shift choices that help locate desired parts of the spectrum. Sorensens approach is
the foundation for the ARPACK library of routines that implement IRAM [47]. In
Matlab, the function eigs provides the user interface to ARPACK. A parallel ver-
sion, PARPACK, is available as well but both offerings only compute a partial Schur
decomposition of a general matrix A. As discussed earlier, one of our computational
goals is the ability to compute partial and full Schur decompositions using the same
3.1.2 Block Arnoldi
As always, we desire to cast our computation in Level 3 BLAS operations as
much as possible for efficiency concerns, but there are other reasons. Block methods
are better suited for handling the computation of clustered or multiple eigenvalues,
and block methods are appropriate when more than one good initial vector is known.
We will examine this more closely in our numerical experiments. Various block ap-
proaches to eigenvalue problems may be found in Saads book [55] or in the tem-
plate book [71]. In the case of IRAM, a block extension, bIRAM, was presented
by Lehoucq and Maschhoff [46]. The bIRAM implementation compared favorably
to other block variants of Arnoldi studied by Scott [57]. Of specific interest to our
endeavors, the implicitly restarted block scheme was superior to block approaches

using explicit restarting strategies and also outperformed approaches using precon-
ditioning. This includes strategies such as Chebyshev acceleration and Chebyshev
preconditioning. Also of note, all implementations studied in [57] computed only
a partial Schur decomposition. Currently, ARPACK does not include the bIRAM
approach. One reason for this may be the complexities of such an implementation.
An example of one of the difficulties in implementing such an approach is the shift
strategy as discussed by Baglama [7]. The generalization to a block method creates
possible convergence issues if the shifts are not chosen appropriately. Additionally,
IRAM, as Stewart [64] points out, and subsequently bIRAM requires the structure
of the Arnoldi decomposition to be preserved which may make it difficult to deflate
converged Ritz vectors. Due to these issues, we opt to examine the behavior of a
basic block Arnoldi approach. This will serve as starting point for our analysis of
block methods.
Building off our work in Chapter 2, we formulate a block Arnoldi approach using
Householder reflectors. Our approach uses explicit restarts and deflation to lock con-
verged Schur vectors and is modeled after algorithm 7.43 in [71] which is reproduced
in Algorithm 3.1.1. The converged eigenvalues and Schur vectors are not touched in
subsequent steps of Algorithm 3.1.1. This is referred to as hard locking as opposed
to soft locking in which the converged Schur vectors are continuously updated re-
gardless of residual values. This was introduced by Knyazev [41] in the context of
iterative methods for the symmetric eigenvalue problem. The upper triangular por-
tion of the matrix H is also locked. In subsequent steps, the approximate eigenvalues
are the eigenvalues of the m x m matrix H whose k x k principal submatrix is up-
per triangular. By locking converged eigenvalues and computed Schur vectors we are
implicitly projecting out the invariant subspace already computed.
In the following, we outline the main components of one sweep of our block
Arnoldi method, Algorithm 3.1.2, and discuss the specifics of our implementation.

Algorithm 3.1.1: Explicitly Restarted Arnoldi Method with Deflation
1 Set k = 1;
2 for j = k : m do
w = Avj;
Compute a set of j coefficients hij so that w = w Yhl=\ hijVi is
orthogonal to all previous ry, i = 1, 2,..., j;
^j+lj II ^ II 2j
, w
3+ hi+1J
7 Compute approximate eigenvector of A associated with the eigenvalue and
its associated residual norm estimate
8 Orthonormalize this eigenvector against all previous rys to get the
approximate Schur vector uk and define vk = uk\
9 if pk is small enough then
Accept the eigenvalue estimate:
hitk = v^Avk, i = l, , k, set k = k + l;
If the desired number of eigenvalues has been reached, stop,
otherwise go to 2;
14 else
Go to 2;
Here b denotes the block size, kf denotes the size of the search subspace where kf is a
multiple of b, kmax denotes the number of desired eigenvalues, and kcan is the number
of eigenvalues that have converged. For the sake of notation, the ensuing discussion
will assume no eigenvalues have converged, that is kcan = 0. The case where kcon > 0
is detailed in the following pseudocodes. A step of the iteration begins with a block of

Algorithm 3.1.2: Block Arnoldi Method with explicit restart and deflation
Input: A E Cnxn, U E Cnxb, dimension of search subspace kf and number of
desired eigenvalues kmax
Result: Partial Schur form given by Qkrnax Cnxkmax and Hkmax E CkmaxXkmax
1 Set number of blocks: b =
2 Set kcon 0,
3 while kcon < kmaX do
Apply block Arnoldi iteration, Algorithm 3.1.3, to get a size kf block
Arnoldi decomposition:
AQ (*, 1 kcon A kf) Q (., 1 kcon A kf A b) H ( 1 kcon A kf A b, 1 kcon A kf),
Compute the Schur form of the Rayleigh quotient:
[Z, Ts] = schur (H{kcan A 1 : kcon A kf, kcon A 1 : kcon A kf), complex );
Use Algorithm 3.1.6 to compute Ur that reorders the Schur form:
Ts E- U?TsUr and Z <- ZUr-
Update the corresponding columns of Q:
Q (: ; kcon 1 kcon A kf) E- Q(: i kCon 1 kcon + k,)Z-
8 Check convergence;
9 if converged then
10 kcon kcon A 1,
n Explicitly deflate H{kcan A 1 : kcon A b, kcon)-,
12 Collapse Q E- Q(:, 1 : kC(m A b) and build new compact WY
representation using Algorithm 3.1.7;
13 else
14 Collapse Q E- Q(:, 1 : kC(m A b) and build new compact WY
representation using Algorithm 3.1.7;

vectors U G Cnxb that is used to generate a size k = yf block Arnoldi decomposition
AQk Qk+\Hk-\-l
where Hk+1 G c(fc/+6)x(fc/) is given by
Qk+i G Cnx(kf+b) has orthonormal columns and a compact WY representation as in
Equation 2.5, Hk G making up the (k + l)st block of Qk+i- For the pseudocode, it will be convenient to
use the Matlab style notation introduced in Chapter 2, that is Qk = Q(-, 1 : kb) =
Q(:, 1 : kf) and Q(k+i) = Q(:, kb + 1 : kb + b) = Q(:, kf + 1 : kf + b) or using the block
notation Q{k+1) = <30, {k + 1}).
Expansion using block Arnoldi is detailed in Algorithm 3.1.3. In Algorithm 3.1.3,
we use MATLABs function qr to compute some of the components of the compact
WY representation. Using the economy-size option, we compute the upper triangu-
lar factor with components of the reflectors stored below the diagonal as in LAPACK.
The Householder reflectors have the form H(i) = I r^v^vf- where ry is unit lower
triangular and thus, its upper part does not need to be stored. Since Matlabs in-
terface does not provide the scalars, ly, needed to construct the elementary reflectors,
we opted to implement our own xLARFG in Matlab to generate the missing compo-
nents to be able to compare to LAPACK when constructing the compact WY form.
Details on the computation of the scalars may be found in Algorithm 3.1.4. The
scalars we compute with our xLARFG are then used in our own Matlab implementa-
tion of xLARFT to construct the triangular factor in the compact WY representation.
Details may be found in Algorithm 3.1.5. We will revisit this in Chapter 4 as our
algorithmic construction motivates a slight modification of xLARFT in LAPACK to
to a matrix with k + 1 blocks of size n x b and Q(k+1) refers to the n x b matrix

Algorithm 3.1.3: Block Arnoldi Iteration
Input: A, k, and possibly collapsed Q in compact WY form
Result: kth order Arnoldi decomposition with Q in compact WY form
1 if kcon = 0 then
2 for j = 0 : b\ do
3 V = qr(U(kccn Y jb Y 1 : n, 1 : b), 0) and compute scalars
r(kC(m Y {j + 1}) for the elementary reflectors;
4 if j > 0 then
s H(1 : jb, {j}) = U(1 : jb,:);
e H({j Y = triu(V(l : 6,1 : &));
Y(jb + 1 : n, {j + 1}) = tril(V, 1) + eye(size(Y)); T = zlarft(Y, r);
<30,0 + <}) = <30, U + 1}) YTYHQ(:, {j + 1});
8 if j < b\ then
9 U = AQ(:,{j + l})), U= U- YThYhU-
io else
for j = 1 : b\ do
U = AQ(:, keen + {j});
U = U- ythyhu-
U(1 kcon Y b^.) R( 1 kcon Y bA ^con Y b^ bl(\ kcon Y b,.),
V = qr(U(kC[m Y jb Y 1 : n,:), 0) and compute scalars r(kcan Y {j Y 1})
for the elementary reflectors;
Y(Kan YjbYl:n, kcon Y {j Y 1}) = tril(V, -1) + eye(size(Y));
T = zlarft(Y, r);
H( 1 j b Y kcan, kcan Y {j }) U( 1 j b Y kC(mi,.),
can + {j + 1 }),kconY{j}) = triu(Y(l : b, 1 : 6));
Q(-, kcan + {j + l}) = (5(:, kcan Y {j Y 1}) YTYHQ(:, kcan Y {j Y 1});

Algorithm 3.1.4: Matlab implementation of ZLARFG
Input: A vector x E C
Result: Scalars /3, r and vector v
1 n = length(x);
2 alpha = x(l); xnorm = norm(x(2 : n), 2);
3 ar = real(alpha); ai = imag(alpha);
4 if xnorm = 0 then
tau = 0; beta = alpha; v = x(2:n,:);
6 else
beta = -sign(ar)*sqrt(ar2 + ai2 + xnorm2);
tau = (beta ar)/beta ai/beta*i; v = x(2:n)/(alpha beta);
Algorithm 3.1.5: Matlab implementation of ZLARFT
Input: reflectors Y and associated scalars r
Result: triangular factor T for compact WY form
1 [n, k] = size(R);
2 T = zeros(k);
3 for i = 1 : k do
4 T(1 : \,i) = r(z)T(l : 1, 1 : 1)(R(:,1 : 1)HV(:, *));
5 T(i, i) = t(z);

avoid unneeded computations. If no eigenvalues have converged, the matrix Qk+i has
the following form
Qk+l = In,(k+l)b YTYH In,(k+l)b, (3.6)
where In,{k+i)b is the first (k + 1)6 columns of the nxn identity matrix, Y E Crax(fc+1)6
with unit diagonal and T e c(k+l)bx(k+l)b as jn LAPACK. If kcon eigenvalues have
been deflated, Qk+i has the form
Qk+l In,kcon+(k+l)b ~ YTY In,kcon+(k+l)b} (3-7)
where In,kco+(k+i)b £ Craxfccon+(^fc+1^ has the form
Y^tkcon~\~{,k~\~ 1)^
~^'kcon ^kcon~\~byk kc,
kcon kjkcon +£> k kc,
Here Rkco+b S C(fcc+6)x(fcc+6) is a diagonal matrix with entries 1 generated when
we rebuild the compact WY form after deflating or restarting and the other compo-
nents are appropriately sized matrices of zeros and the identity matrix.
In the next step of our block Arnoldi approach, we compute the Schur factoriza-
tion of the Rayleigh quotient matrix Hk, that is
HkVk = VkSk, (3.8)
where 14 £ Cfc6xfc6, V^Vk = /, and Sk E implementation, Sk is upper triangular rather than upper block triangular, but we
could adjust our approach to work in real arithmetic. The choice to work in complex
arithmetic was made, in part, to compare to some of the available implementations
of unblocked methods that use the same approach. Our implementation uses Mat-
labs function schur which provides the interface to LAPACKs routines xGEHRD,
The Schur form 3.8 is then reordered so that the desired eigenvalues appear in
the top left of the matrix Sk- Reordering the Schur form will play an important

role in all of our approaches in this section. This can be accomplished in LAPACK
by the routine xTRSEN for both upper triangular and block upper triangular Schur
factorizations. As we are currently working in Matlab and since our applications
keep the order of the Schur factor Sk relatively manageable, we reorder the Schur form
using Givens rotations and a target to locate a specific part of the spectrum, such as
the eigenvalues with largest real components. Our approach was adapted from the
computations presented in [24]. Depending on the order of the Schur factor Sk, we may
need to adjust how we handle reordering the Schur form. Kressner discussed block
algorithms for the task of reordering Schur forms in [42] and specifically addressed
the applications of interest here, namely reordering Schur forms in the Krylov-Schur
and Jacobi-Davidson algorithms. To fully take advantage of level 3 BLAS operations,
we should adopt a similar approach. Details of our implementation can be found in
Algorithm 3.1.6 and we will postpone as future work any further analysis on efficiently
reordering the Schur form.
Once the Schur form is reordered, we check to see if the Ritz values are acceptable
approximations of the eigenvalues of the matrix A. After reordering, our block Arnoldi
decomposition has the form
AQkVk [Qfckfc, Q(fc+1)]
so that
From here we can see that the quality of the Ritz value is given by
||Q(k+i)H(k-\-i,k)Ek Vk||2

Algorithm 3.1.6: Reorder Schur Form
Input: Unitary Z G Cfcxfc, upper triangular Ts G Cfcxfc and a target
Result: Reordered Ts and Ur G Cfcxfc
1 Ur = ey e(k);
2 for z = 1 : k 1 do
3 d = diag(Ts(z : k,i : k));
4 [a, jj] = niin(abs(d target));
5 JJ = 33 + !;
6 for t = j j 1 : 1 : i do
7 X = [Ts(t,t+ 1 ),Ts(t,t) Ts(t + 1 ,t + 1)];
8 G([2,1], [2,1]) = planerot(xH)H;
9 Ts( 1 : k, [t, t + 1]) = Ts( 1 : fc, [t, t + 1])G;
10 Ts([t, t + 1],:) = G^Ts([t, t + 1],:);
11 Ur(:, [t, t + 1]) = Ur(:, [t,t + 1])G;
and a natural stopping criteria is when WH^i^E^Vk||2 is sufficiently small. In
ARPACK, given an Arnoldi decomposition of size k, that is

a Ritz value A is regarded as converged when the associated Ritz vector Ukw, with
||w||2 = 1, satisfies
\\A(Umw) \{Umw)\\2 = \hk+itkelw\ < max{u||#fc||F, tol-|A|}
where u is machine epsilon and tol is a user specified tolerance. Further details may
be found in ARPACKs user guide [47]. This criteria guarantees a small backward
error for A as the estimate is an eigenvalue of the slightly perturbed matrix (A + E)
E ( hrn-\-\trnemw)urn-\-\{JJrnw)

as detailed by Kressner [43]. We can extend this idea to our approach and accept the
Ritz value A, in the upper left of Sk after reordering, when
\\H(k+ltk)E%Vk(:,l)\\2< max{u\\Hk\\F,tol \\\} . (3.12)
The final step in a sweep of our block Arnoldi method is to check convergence using
Equation 3.12. If an eigenvalue has converged, we explicitly deflate the converged
eigenvalue in Hk and collapse Qk to include the converged Schur vectors plus a block
of size n x b to be used to restart the Arnoldi process. A compact WY representation
of the truncated Qk is computed and we begin the sweep again. Here we generate
the matrix Rkcon+b introduced in 3.7. If the eigenvalue approximation is not yet
satisfactory, we collapse Qk and build a compact WY representation of the previously
converged Schur vectors and the additional n x b block to be used for restarting.
Details of this final step can be found in Algorithm 3.1.7.

Algorithm 3.1.7: Explicit restart with possible deflation
Input: Q(:, 1 : kC(m Y k + b) with compact WY representation
Result: Q(:,l : kC(m Y b) with compact WY representation
1 p = norm(R(fccora + {b Y 1 }),kC(m Y {b})Z({b}, 1));
2 if p < check then
^ kcon kcon Y 1 ?
4 Explicitly deflate:
H(kConj ^c 5 1 kcon Y ? H -R(l kcon Y b^ 1 .
6 Y = qr(Q, 0); and compute scalars f(l : fccora Y 6) for elementary reflectors;
7 f? = diag(diag(E(l : fccora Y b, 1 : fccora Y 6)));;
8 Y = tril(Y, 1) Y eye(size(Y)); T = zlarft(Y, t);;
9 Q[kCon Y b Y 1 rq kcon Y b Y 1 kcon Y b Y &) eye(n (^c 10 else
if fccora = 0 then
[/ = AQ(:, {1}); Q = eye(n, fc Y 6);
1 kcon Y ? H -R(l kcon Y b^ 1 kc(yn)?
V = qr(Q,0); and compute scalars f(l : fccora Y 6) for elementary
f? = diag(diag(E(1 : kcan Y b, 1 : fccora Y 6)));;
Y = tril(Y, 1) Y eye(size(Y)); T = zlarft(Y, t);;
Q(fccora Y b Y 1 : n, kcan Y b Y 1 : fccora Y b Y fc) = eye(n (fccora Y 6), fc);

3.1.3 Block Krylov-Schur
A more numerically reliable procedure than IRAM is Stewarts Krylov-Schur
algorithm [64], This reliability, along with the ease with which converged Ritz pairs
are deflated and unwanted Ritz values are purged, makes the Krylov-Schur method
an attractive alternative to IRAM. A step of Stewarts Krylov-Schur method begins
and ends with a Krylov-Schur decomposition of the form
AVk = VkSk + vk+1b%+1 (3.13)
where Sk is a k x k upper triangular matrix and the columns of Vk+\ are orthonormal.
For convenience, we write
AVk = Vk+1Sk, (3.14)
_ U
The Krylov-Schur method uses subspace expansion and contraction phases much
like any approach based on Arnoldi decompositions. The expansion phase of Krylov-
Schur proceeds exactly as in the typical Arnoldi approach and requires approximately
the same amount of work. The real power of the Krylov-Schur approach is in the
contraction phase. As we will see in detail in Chapter 4, the key aspect of the
Krylov-Schur method is that a decomposition of the form 3.13 can be truncated at
any point in the process providing an easy way to purge unwanted Ritz values. As
the Krylov-Schur method works explicitly with the eigenvalues of Sk, the Rayleigh
quotient, this approach is an exact-shift method. This is in contrast to methods that
use other shifts such as linear combinations of the eigenvalues of Sk. We will revisit
the details of the Krylov-Schur method in Chapter 4.

A block version of this approach has been implemented for symmetric matrices
by Saad and Zhou in [72], They included detailed pseudocodes, including how to han-
dle rank deficient cases and adaptive block sizes. The numerical results provided by
Saad and Zhou show that their implementation performs well against Matlabs eigs
function based on ARPACK, irbleigs by Baglama et ah, an implementation of an
implicitly restarted block-Lanczos algorithm [6], and lobpcg, without precondition-
ing, presented by Knyazev [40]. Of particular interest to our pursuits is that in this
study, the Krylov-Schur based approach consistently outperformed the competitors
as the number of desired eigenvalues was increased.
Though the Krylov-Schur method was presented as an attractive alternative to
IRAM, there are few implementations in either unblocked or blocked form for non-
symmetric matrices. Baglamas Augmented Block Householder Arnoldi (ABHA)
method [7] is an implicit version of a block Krylov-Schur approach to the NEP as
it employs Schur vectors for restarts. We will compare our approach to the publicly
available program ahbeigs based on this ABHA method later. An implementation
of the block Krylov-Schur method is available in the Anasazi eigensolver package,
which is part of Trilinos [9].
As we did for our block Arnoldi approach, we now present a sweep of our block
Krylov-Schur method, Algorithm 3.1.8, along with detailed computational kernels.
We again formulate our approach based on Householder reflectors in compact WY
form. A step of the iteration begins with A G Cnxn, a block of vectors U G Cnxb. For
the remainder of this section, b denotes the block size, ks denotes the starting basis
size and the size of the basis after contraction, kf denotes the final basis size so that
ks < kf where each is a multiple of b, and kcan denotes the number of eigenvalues that
have converged. For the moment, we will assume kC(m = 0. In our implementation,
we begin by using Algorithm 3.1.3 to compute a block Arnoldi factorization
AQ(:, 1 : ks) = Q(:, 1 : ks + b)H{ 1 : ks + b, 1 : ks)

where Q E £nx(-ks+b'1 has orthonormal columns and a compact WY representation as
in Equation 2.5. We will discuss the importance of this first expansion using block
Arnoldi in Chapter 4.
The Schur form of the Rayleigh quotient H E js computed next so
that we have
H( 1 : ks, 1 : ks)Us = USTS
with Us E CksXks such that U^US = / and Ts E CksXks is upper triangular. As
before, we could instead compute the real Schur form in which Ts is upper block
triangular and work completely in real arithmetic. We will discuss this further in
Chapter 4. Updating our block Arnoldi decomposition we have the block Krylov-
Schur factorization
AZ(:, 1 : ks) = Z(:, 1 : ks + b)S(l : ks + b, 1 : ks), (3.15)
where Z(:,l : ks) = Q(:, 1 : ks)Us, Z E Crax(^fcs+^ has orthonormal columns, the
Rayleigh quotient is upper triangular, S'(l : ks, 1 : ks), with a full r x ks block on
the bottom as depicted in Figure 3.1. This ends the initialization phase of our block
Figure 3.1: Block Krylov-Schur Decomposition
Krylov-Schur approach. Now we enter a cycle of expanding and contracting the
Krylov decomposition until an eigenvalue may be deflated. The initial Krylov-Schur
decomposition 3.15 is expanded in the same manner as in our block Arnoldi approach.
Details may be found in Algorithm 3.1.3. The expanded Krylov decomposition is now

of the form
AQ(:, 1 : kf + b) = Q(:, 1 : kf + 6)S'(f : fc/ + b, 1 : kf) (3.16)
where Q E Crax(-fc/+6) has orthonormal columns and a compact WY representation,
and the Rayleigh quotient has the structure depicted in Figure 3.2. If kC(m > 0,
constructing the compact WY representation of our expanded Z may involve a form
as in Equation 3.7. We will address this in detail when we reach the end of a sweep.
Figure 3.2: Expanded Block Krylov Decomposition
The next step is to compute the Schur form of the kf x kf principal submatrix in
the upper left of S. Again, we use Matlabs function schur to compute the upper
triangular Schur factor. The Schur factor is then reordered as we did before in our
block Arnoldi method. Algorithm 3.1.6 accomplishes this task. We then update the
corresponding parts of S so that our decomposition has the form
AQ(:, 1 : kf) Q(:, 1 : kf + b)S(l : kf + b, 1 : kf) (3.17)
where S again has the structure depicted in Figure 3.1. Once the desired eigenvalue
approximations have been moved to the upper left of the matrix S, we check for
convergence. Kressner [43] details the connection between the convergence criteria of
ARPACK and possible extensions to a Krylov decomposition. Given a Krylov-Schur
decomposition of order m,

the direct extension of Equation 3.11 is given by
IIA(Umw) \{Umw)\\2 = |6^+1w| < max{u||5m||F, tol x |A|} ,
where ||w||2 = 1, u is the machine precision and tol is a user tolerance. Both of
these criteria, the one of ARPACK and the Krylov extension, guarantee a small
backward error. Kressner also discussed how the Krylov-Schur approach suggests a
more restrictive convergence criteria based on Schur vectors rather than Ritz vectors.
If no eigenvalues have been deflated, we may accept an approximation if
|&i| < max {u||Rm||i?, tol x | A|} , (3.18)
where b\ is the first component of 6^+1. We discuss this in depth in Chapter 4. An
extension of the convergence criterion based on Schur vectors to our block method is
given by
\\S(kf + 1 : kf + b, 1) || 2 < max {u||S'(l : kf, 1 : kf) ||f, to/x|A|} (3.19)
where A = S'(l, 1), u is machine precision and tol is a user specified tolerance. Our
block Krylov-Schur implementation can deflate converged eigenvalues one at a time,
or nC(m at a time where 1 < nC(m < b. If the eigenvalue satisfies Equation 3.19, we
explicitly deflate by zeroing out the bottom 6x1 block in the corresponding column
of S and then truncate our Krylov-Schur decomposition. If no eigenvalues have
converged, the Krylov-Schur decomposition is truncated. Truncation in either case
is handled by Algorithm 3.1.10 and a compact WY representation of Z is computed.
After truncation, the process begins again with expansion of the search subspace.

Algorithm 3.1.8: Block Krylov-Schur
Input: A e Cnxn, U G Result: Partial Schur form given by Zu and Su
O J <^max ax
1 Use Algorithm 3.1.3 to generate:
AQ(:, 1 : ks) = Q{:, 1 : ks A b)H{ 1 : ks + b, 1 : ks);
2 Compute Schur form of Rayleigh quotient H{ 1 : ks, 1 : Ay);
3 Update columns of Q and H(ks A 1 : Ay + b,:);
4 Now we have a block Krylov-Schur decomposition:
AZ{:, 1 : ks) = Z(:, 1 : ks + b)S( 1 : Ay, 1 : Ay);
5 while kcon < kmax do
Use Algorithm 3.1.9 to expand as in Arnoldi:
AZ{:, 1 : kcon A kf) = Z{:, 1 : kcon A kf A 6)*Sr( 1 : Ayora + kf A r, 1 : fccora + kf)]
Compute the Schur form of the active part of the Rayleigh quotient:
S{kCOn A 1 kcon A kf} kcon A 1 kcon A kf)Us USTS,
Reorder Schur form using Algorithm 3.1.6: Ts G- UrTsUr;
Update corresponding pieces of Krylov decomposition:
kCon A 1 : kcon A kf) < Z(:, kcon A 1 : A;cori + kf)Ur
A(1 kcon, kcon A 1 kcon A kf) 4 ^(l kcon} kcon A 1 kcon A kf)UsUr
S{kcan A 1 : fccon A kf A b, kCCn A 1 : A;cora A kf A 6) 4 Ts
S{kCon A kf A 1 : A;cora A kf A b,:) 4 S{kCCn A kf A 1 : A;cora A kf A b, :)UsUr]
if converged then
Explicitly deflate ncora converged eigenvalues;
S{kcon A kf A 1 kCon A kf A b} kCon A 1 kCon A ricon) zeros(6, nCon)
kcon, kcon, A VI, cone
Truncate expansion to size ks using Algorithm 3.1.10;
Truncate expansion to size Ay using Algorithm 3.1.10;

Algorithm 3.1.9: Expansion of block Krylov decomposition
Input: A, Z e cnx(-k^+ks+b\ S e C(fc con +ks+b)x(k
Result: Z G Cnx(-kcon+kf+b\ S G CW-+U+&)
1 for j = bs : bf do
2 Compute components for compact WY form of QR factorization:
V = qr(U(kcan + jb + 1 : n, 1 : b), 0); and compute scalars r(kcan + {j + 1})
for the elementary reflectors;
3 Updates': S(1 : jb + kcan,kcan + {j}) = U(1 : jb + kcan,:)
S{kcon + {j + 1 },kcan + {j}) = triu(U(l : b, 1 : 6));
Store reflectors in Y:
Y(kcon + jb + 1 : n, kcon + {j + 1}) = eye(size(U)) + tril(V, -1);
Build triangular factor for compact WY representation: T = zlarft(Y,r);
Explicitly build next block of Z\
Z(:, kcan + {j + 1}) = Z(:, kcan + {j + 1}) YTYHZ(:, kcan + {j + 1});
Ifj < &2 u = AZ(:, kcan + {j + 1});
U=U- YThYhU-
U( 1 : fccora + ks + b,:) = RHU(1 : fccora + ks + b,:);

Algorithm 3.1.10: Truncation of block Krylov decomposition
Input: S, Z and nC(m
Result: truncated S and Z with compact WY form
1 if nC(m > 0 then
Explicitly deflate;
for z=l: nC(m do
S(kcan + kf + 1 : kcan + kf + b, kcan + i) = zeros(b, 1);
kcori kcan I Vjcxjni
Z(:,ks-\-kcan-\-1 : ks-\-kcan-\-b) = Z(:,kf-\-kcarl ncan-\-l : kf-\-kcon ncon-\-b)',
Z(:, ks + kcon + r + 1 : kcon + kf + b) = zeros(n, kf ks)]
Z{kCan + ks + b + 1 : n, ks + fccora + b + 1 : fccora + kf + 6) =
eye(n ks b kcan, kf ks);
S [^(l ks T kCOn) 1 ks T kcon)i k^ikf T kcon ncora T 1
kcon T kf T b '^coni 1 ks T ^cora)])
Won, 0,
ii else
Z(:, + fccora + 1 : + fccora + 6) Y(:, kf + fccora + 1 : kf + fccora + 6);
Z(:, Ay + fccora + r + 1 : fccora + &/ + &) = zeros(n, kf ks);
Z{kCan + ks + b + 1 : n, ks + fccora + b + 1 : fccora + kf + 6) =
eye(n ks b kcon, kf Ay);
S = [S^l : ks-\-kCan, 1 : ks + kcon)', S(kcon-\-kf +1 : A;cori + Ay + 6,1 : ks + A;cora)];
16 Build new compact WY representation: Y = qr(Z(:, 1 : fccora + Ay + 6), 0); and
compute scalars r(l : fccora + Ay + 6);
17 Y = tril(Y, -1) + eye(u, fccora + Ay + 6);
is T = zlarft(Y, r); f? = triu(Y(l : fccora + Ay + b,:));;

3.1.4 Block Jacobi-Davidson
We now turn our focus from methods that project onto Krylov subspaces to an
approach that uses a different projection technique. The Jacobi-Davidson approach
was first proposed in 1996 by Sleijpen and Van der Vorst [58]. This method combined
ideas from Davidsons work on large, sparse symmetric matrices from the 1970s [21]
and Jacobis iterative approaches to computing eigenvalues of symmetric matrices
from the 1840s [37]. We present a brief discussion of Davidsons work and Jacobis
work to see the relationship between Krylov based methods such as Arnoldi and
methods such as Jacobi-Davidson.
Jacobi introduced a combination of two iterative methods to compute the eigen-
values of a symmetric matrix. To see how Jacobi viewed eigenvalue problems, let A
be an n x n, diagonally dominant matrix with largest diagonal element A(l, 1) = a.
An approximation of the largest eigenvalue and associated eigenvector is the Ritz pair
(a, v) as in
A 1 = A 1 or a T C = A 1
z z b F z
where b, c E Mra_1 and F E ]R(ra_1)x(ra_1). Jacobi proposed to solve eigenvalue problems
of this form using his Jacobi iteration. To see this, consider the alternative formulation
of this system
A = a + cTz
(F-XI)z = -b.
Jacobi solved the linear system on the second line using his Jacobi iteration by be-
ginning with z\ = 0 and getting an updated approximation 6k for A using a variation

of the iteration
9k = a + cTzk
(D 9kI)zk+i = (D F)zk b,
where D is the diagonal matrix with the same diagonal entries as F. The first step
was to make the matrix strongly diagonally dominant by applying rotations to pre-
condition the matrix. Jacobi then proceeded with his iterative method that searched
for the orthogonal complement to the initial approximation. Further details of Ja-
cobis work may be found in [58]. The key idea from his work is that all corrections
came from the orthogonal complement of the initial approximation.
Davidson was also working with real symmetric matrices. Suppose we have a
subspace V of dimension k, the projected matrix A has Ritz value 9k and Ritz vector
uk, and that an orthogonal basis for V is given by {v\,... ,vk}. A measure of the
quality of our approximation is given by the residual
vk Auk 9kuk.
Davidson was concerned with how to expand the subspace V to improve the approx-
imation and update uk. His answer consisted of the following steps:
1. Compute t from the system (D 9kI)t = rk
2. Orthogonalize t against the basis for V: t J_ {v\,..., vk}
3. Expand the subspace V by taking vk+\ = t
where D is a diagonal matrix with the same diagonal entries as A. The Jacobi-
Davidson method combines elements from both approaches. Given an approximation
uk, the correction to this approximation is found in the projection onto the orthogonal
complement of the current approximation. The projected matrix is given by
B = (/ uku[)A(I ukul)

and rearranging terms yields
A = B + Aukul + Uki^A 9kukuk,
where 9k = ukAuk. For a desired eigenvalue of A, say A, that is close to 9k, the
desired correction t is such that
A(uk + t) = A(uk + t), and t _L uk. (3-21)
Substituting Equation 3.20 into the desired correction Equation 3.21, and using some
orthogonality relations, we have the following equation for the correction
(.B AI)t = rk (3.22)
where rk = Auk 9kuk is the residual. Different approaches to solving the correction
equation 3.22 result in different methods. If the solution is approximated by the
residual, that is t = rk, the correction is formally the same as that generated by
Arnoldi. In the symmetric case, if t = (D 9kI)~1rk, then we recover the approach
proposed by Davidson. In the general case, combining the strategies of Jacobi and
Davidson, the correction equation has the form
(.B 9kI)t = rk with t _L uk, (3.23)
where (B XI) is replaced by (B 9kI). Approximating a solution to Equation 3.23
has been studied extensively since it was proposed. We will highlight recent develop-
ments later. The work by Sleijpen and Van der Vorst [58] set the foundation for the
formulation of a Jacobi-Davidson style QR algorithm, JDQR, presented by Fokkema
et al. [24], that iteratively constructs a partial Schur form. In [24], algorithms were
presented for both standard eigenvalue problems and generalized eigenvalue problems,
including the use of preconditioning for the correction equation and restart strategies.
We will restrict our discussion to standard eigenvalue problems.

For the standard eigenvalue problem, the Jacobi-Davidson method picks an ap-
proximate eigenvector from a search subspace that is expanded at each step. If the
search subspace is given by span{E}, then the projected problem is given by
(VhAV 0VHV)u = 0, (3.24)
where V E C*-7 and in exact arithmetic VHV = I. Equation 3.24 is solved yielding
Ritz value 9, Ritz vector q = Vu and residual vector r = (A 9I)q where rig. The
projected eigenproblem 3.24 is reduced to Schur form by the QR algorithm. Fokkema
et al. defined the j x j interaction matrix M = VHAV with Schur decomposition
MU = US where S is upper triangular and ordered such that
\S(1,1) t\ < \S(2,2) -t\< ...< |S(j,j) t|,
where r is some specified target. A Ritz approximation to the projected problem
with Ritz value closest to r is given by
M) = (VT/(:,1),A(1,1)).
Additionally, useful information for the i eigenvalues closest to r may be found in the
span of the columns of VU(:, 1 : i) with i < j. This facilitates a restart strategy by
taking V = VU(:, 1 : jmin) for some jmin < j. Fokkema et al. [24] referred to this
as an implicit restart. The second component of this approach determines how to
expand the search subspace using iterative approaches due to Jacobi. The correction
equation given by
(.I qqH)(A 91)(I qqH)v = r with qHv = 0, (3.25)
is solved approximately and the expanded search subspace becomes span{E, v}. The
JDQR approach can compute several eigenpairs by constructing a partial Schur form.
Both deflation and a restarting strategy are used. An interesting feature of JDQR is
the option of preconditioning the correction equation 3.25. Due to the projections

involved, preconditioning is not straightforward. Detailed pseudocodes may be found
in [24] and a Matlab implementation jdqr is publicly available.
Since its introduction in 1996 by Sleijpen and Van der Vorst [58], much work
has been devoted to understanding and improving the Jacobi-Davidson approach,
especially in the case of symmetric and Hermitian matrices. Here we survey the
major highlights that pertain to our block variant in the context of the NEP. There
is a wealth of information available on the Jacobi-Davidson approach and a good
starting point is the Jacobi-Davidson Gateway [33] maintained by Hochstenbach.
As mentioned earlier, Fokkema et al. [24] discussed deflation and implicit restarts.
Deflation and the search for multiple eigenvalues of Hermitian matrices was also
studied by Stathopoulos and McCombs [61]. The connection between the inner and
outer iterations was studied by Hochstenbach et al. [34], This is a critical component
of the Jacobi-Davidson approach as solving the correction equation too accurately at
the wrong time can lead to the search subspace being expanded in ineffective ways.
We will examine this issue when presenting our numerical results. Hochstenbach et
al. proved a relation between the residual norm of the inner linear system and the
residual of the eigenvalue problem. This analysis suggested new stopping criteria
that improved the overall performance of the method. We will employ some of their
heuristics in our block approach.
A block variant for sparse symmetric eigenvalue problems was formulated by
Geus [27]. For general matrices with inexpensive action, for example large and sparse
matrices, Brandts [17] suggested a variant of blocked Jacobi-Davidson based on his
Ritz-Galerkin Method with inexact Riccati expansion. This method has a Riccati
correction equation, that depending on the quality of the approximate solution, re-
duces to a block Arnoldi approach or a block Jacobi-Davidson approach. In fact,
the Riccati correction equation, when linearized, becomes the correction equation of
Jacobi-Davidson. Brandtss method solves the Riccati equation exactly and the ex-

tra work is demonstrated to be negligible in the case of matrices with inexpensive
action. Further investigation into subspace expansion from the solutions of general-
ized algebraic Riccati equations is the subject of future work. Parallelization has been
investigated, but mainly in the context of generalized eigenvalue problems for large
Hermitian matrices [52, 3] and most recently for quadratic eigenvalue problems [68].
Our block version of Jacobi-Davidson, Algorithm 3.1.11, is a straight forward
extension of JDQR from [24], In the publicly available Matlab implementation,
jdqr, Fokkema et al. used existing implementations of the QR algorithm such as
schur to compute the Schur form of the projected problem. MGS was used for the
the construction of an orthogonal basis for the search subspace. Rather than using
MGS, we opt as before to base our algorithm on the use of Householder reflectors.
Several linear solvers are available to approximately solve the correction equation 3.25.
The implementation jdqr allows the user to specify various methods, but as our
stopping criteria depends on the method used, we choose to employ the generalized
minimal residual method (GMRES). We also formulate our approach to allow for
preconditioning of the correction equation, though we do not suggest strategies for
identifying effective preconditioners.
We now present one sweep of our block Jacobi-Davidson method detailed in Al-
gorithm 3.1.11. Here b denotes the block size, jTOm is the minimum dimension of the
search subspace, jmax is the maximum dimension of the search subspace, kcan is the
number of converged eigenvalues and kmax is the number of desired eigenvalues. In
the ensuing steps, we use similar notation as Fokkema et al. introduced in [24] to help
with the discussion. We let Q E Craxfccon be the matrix of converged Schur vectors,
K E Cnxn is the preconditioner for (A £/) for some fixed value of £ and define the

^orithm 3.1.11: Block Jacobi-Davidson
nput: A E Cnxn, U e Cnxb, kmax, jmin, and jmax
lesult: AQ = QR with Q e Cnxkmax and R e CkmaxXk71
ihile kCrm, A kmax do
if j = 0 then
Initialize search subspace v using Algorithm 3.1.12;
Solve b correction equations approximately using Algorithm 3.1.13;
Expand search subspace: V = [V, v\\
if j > 0 then
[V, temp] = qr(V, 0) and construct compact WY form for V;
Expand interaction matrix: M = VHAV]
Compute Schur form: MU = US;
Reorder the Schur form S using Algorithm 3.1.6;
j = j + b] found = 1;
while found do
Compute Ritz vectors: q = VU(:, 1 : 6);
Precondition the Schur vectors: y = K~lq\
Compute b residual vectors and associated norms:
for z=l:6 do
r(:,i) = M{-A) ~ S(i,i)q(:,i); nres(*) = ||r(:,*)||2;
if Converged then
Deflate and restart using Algorithm 3.1.14
Implicit restart using Algorithm 3.1.14

Q = [Q, q\, the matrix Q expanded by approximate Schur vectors q,
Y = K~XQ, the matrix of preconditioned Schur vectors,
H = QY, the preconditioned projector QHK~lQ.
We begin with a block U G Cnxb, and in the initial sweep orthogonalize this using
Householder. That is, we have
[H,i?] = qr([/,0)
so that V G Cnxb is such that VHV = lb- As in jdqr, we also allow for initializing
the search subspace using Arnoldi. Convergence may suffer in the single vector case
when using Jacobi-Davidson beginning from the initial vector. That is, the correction
equation may not immediately provide useful information for building a desirable
search subspace. To adjust for this, often some type of subspace expansion is used
to generate an initial search subspace that may have better approximations to work
with. The first phase of jdqr computes an Arnoldi factorization of size jmin + 1 using
the supplied initial vector. Then jmin of the Arnoldi vectors are used as the initial
subspace V in the initial sweep of the Jacobi-Davidson method. We experimentally
verified that this slightly improves the speed of convergence and incorporate this
approach into our block algorithm. We note that a nice analysis of the correction
equation is provided by Brandts [17]. If desired, we use Algorithm 3.1.3 and the
starting block U to build a size jmin block Arnoldi decomposition and let V G Cnxjmin
be the first jmin basis vectors. If this is not the initial sweep, V G Cnxjmin is our
restarted search subspace and has orthonormal columns. In either case a compact
WY representation of the search subspace V is constructed. The interaction matrix
is computed by
M = VhAV

and then the Schur form of M is computed using Matlabs function schur so that
MU = US. Next the Schur form is reordered using our earlier approach Algo-
rithm 3.1.6 so that the diagonal entries of S are arranged with those closest to some
target in the upper left and U is updated as well. The first b diagonal elements of S
are the Ritz values with associated Ritz vectors given by
q = VU(:,l:b),
and are used to compute the b residuals
r(-;i) = Aq(:,i) S(i,i)q(:,i) i = l,...,b (3.26)
along with the corresponding norms of each residual ||r(:, z) ||2- Next we check for
convergence. In jdqr, a Ritz approximation is accepted if the norm of the residual
is below a certain user specified tolerance with default value le-8. We use the same
convergence criterion for individual Ritz approximations, but check to see if any of the
b approximations has converged. If the Ritz pair satisfies our convergence criterion,
then we explicitly deflate the converged eigenvalue and lock the approximation as
detailed in Algorithm 3.1.14. If the approximation is not yet satisfactory, we move
Algorithm 3.1.12: Subspace Initialization
Input: U G Cnxb and jmin
Result: Initial search subspace v with compact WY representation
1 if Starting with Arnoldi then
Construct size jmin block Arnoldi decomposition:
AU(., 1 jmin) R(-, 1 jmin Y U)H(\ jmin Y &, 1 jmin)
3 with compact WY form of U using Algorithm 3.1.3;
4 V t/(., 1 jmin)i
5 else
[n, R] = qr(U, 0) and along with compact WY form of v;

Algorithm 3.1.13: Approximate Solution to Correction Equations
1 Update residuals:
2 r = K~lr; r = r YH~1QHr;
3 for z = 1 : b do
4 Approximately solve the b correction equations of the form:
(/ YH~lQH)(K~lA S(i, i)K~l)(I YH~1Qh)z(:, i) = -r(:, z);
5 using GMRES and orthogonalize against Q ;
to the inner iteration. Here b correction equations of the form 3.25 are solved to
expand the search subspace. As only an approximate solution to each system is
required, iterative methods for linear systems are the natural choice. We opted to
use GMRES to solve each of the b correction equations. Rather than use Matlabs
function gmres, we opted to implement our own version of GMRES as we needed
more control during the inner solve to compute the components required for our
convergence criteria. After computing the b approximate solutions, we find ourselves
back at the beginning of a sweep. We then continue the cycle of outer projections
and inner linear solves increasing the dimension of our search subspace a block at a
time. If the size of the search subspace has reached the maximum allowed dimension
jmax, we restart as detailed in Algorithm 3.1.14.
One of the central issues in the Jacobi-Davidson approach is the connection
between the outer projection and the inner linear system. As mentioned earlier,
Hochstenbach and Notay [34] provide an in-depth analysis of how progress in the
outer iteration is connected to the residual norm of the correction equation. We
present their results for the standard eigenvalue to detail the stopping criteria we
used in our implementation. Our work uses a subset of the heuristics provided by
Hochstenback and Notay. The correction equation in the standard eigenvalue prob-
lem has the form 3.25 with residual vector r = (A 9I)q and Ritz value 9 = qHAq.

Algorithm 3.1.14: Deflation and Implicit Restart
1 if Converged then
2 if kcan = 0 then
s R = 5(1, 1);
4 else
5 R = [triu(.R), Q(:, 1 : kcan)Hrsave(:, 1); zeros(l, kcan), S(l, 1)];
6 kcort kcan Y 1)
7 Y = Y(:, 1 : kcan)-,
8 H H{\ kcon, 1 kcon),
9 r \ I 2 : j):
io 5=5(2: j, 2: j);
n M = 5;
12 [/ = eye(j 1);
is j = j 1;
14 else
15 Implicit restart;
16 J = jminj
17 V = VIJ('., 1 ' jmin) j
18 5 5(1 jmini 1 jmin)i
19 M = 5;
20 U = eye(jmin);
Let the inner residual of the linear system be given by
fin = -r (/ - - rl>.

m-tl)(q + v)\\
\\q +1;||

-------- < Venn <
1 + s2 9 -
if f3 < gs
where g = ||rira||, s = ||u||, and f3 = \9 r + qH(A rl)v\. Hochstenbach and Notay
suggested exiting the inner iteration if at least one of the following holds:
Qk < Ti||r|| and reig < e
9k < Ti||r|| and ^ > f and gk < r3^=
< ri||r|| and > f and k > 1 and > (^2 ^
where t\ = 10-1/2, r3 = 15, a norm-minimizing method such as GMRES is used, and
e is the desired value of the residual norm. For the outer eigenvalue estimate, they
show that one may use the estimate
Teig ~

1 + s2
As we use GMRES for the solution of the correction equations, we only report the
results pertaining to the use of GMRES in the context of the standard eigenvalue
problem. Extensive details on how to proceed when using alternatives to GMRES
may be found in [34]. Their approach computed these values twice during the solution
of the inner system, first when ||rjn|| < ri||r|| and then when ||rjn|| < ^11r 11 where
72 = 10-1. Their heuristic choices of thresholds r\, 72, and r3 were validated with a
selection of test problems but they suggest experimenting with other values and also
suggest the last criterion may be optional. For the standard eigenvalue problem with
GMRES used for the inner solve, reig is about until reig reaches its asymptotic
value ^2 and further reduction of ||rira|| is useless.
The main numerical result reported for the standard NEP showed that the num-
ber of matrix-vector products remained about the same when using jdqr with the
new stopping criteria versus the default settings, but the revised stopping criteria

increased the number of inner iterations which are less costly than the outer itera-
tions. Hochstenback and Notay concluded that jdqr would perform better with their
stopping criteria. We will explore the behavior of different stopping criteria in our nu-
merical experiments. Dedicating the appropriate amount of work to solving the inner
linear system is increasingly important in the case of multiple correction equations.
The stopping criteria in our block Jacobi-Davidson approach consists of the first two
suggestions with threshold parameters mentioned above and parameters consistent
with using GMRES for the the linear solve. The situation becomes more complicated
when solving b correction equations and we do not claim to have the optimal criteria.
A detailed analysis is the subject of future work.
Before we present numerical results, we pause to summarize the methods and
associated software that will be used in our comparison. Table 3.1 lists all the ap-
proaches used in the ensuing section. Each of the approaches listed in Table 3.1 has
several paramters the user can control. Matlabs eigs allows one to set the initial
vector, the tolerance for the convergence criteria, the number of vectors in the search
subspace, the number of desired eigenvectors, and the targeted subset of the spec-
trum. Baglamas ahbeigs uses many of the same parameters, but has a few more
options. The user may set the block size and the number of blocks. One of the
parameters unique to ahbeigs is the adjust parameter that adds additional vectors
to the restart vectors after Schur vectors converge. This is intended to help speed-up
convergence. Sleijpens jdqr has several input parameters the user may set to control
the calculation. One may set the tolerance for the stopping criteria, the minimum
and maximum dimensions of the search subspace, the initial vector, the type of linear
solver for the correction equation, the tolerance for the residual reduction of the linear
solver, the maximum number of iterations for the linear solver, and whether or not
to use a supplied preconditioner.

Table 3.1: Software for comparison of iterative methods
Name Source Description Blocked
eigs Matlab Built-in function based on ARPACK No
ahbeigs Baglamas website Matlab implementation of ABHA Yes
jdqr Sleijpens website Matlab implementation of JDQR No
bA Our home-grown code Explicitly restarted Arnoldi Yes
bKS Our home-grown code Block Krylov-Schur Yes
bjdqr Our home-grown code Block extension of JDQR Yes
Jia Not available Block Arnoldi Yes
bIRAM Not available bIRAM by Lehoucq and Maschhoff Yes
Moller(L) Not availale bIRAM by Moller Yes
MMoller(S) Not available bIRAM by Moller Yes
Our implementation bA uses parameters such as the dimension of the search sub-
space, a target for the desired subset of the spectrum, a tolerance for the convergence
criteria and the number of desired eigenvalues. Our implementation bKS has these
same input parameters, but requires both the dimension of the contracted search
subspace and the dimension of the expanded search subspace. Our implementation
bjdqr is a block extension of jdqr but only uses GMRES as the linear solver. The
parameters include the number of desired eigenvalues, the starting block of vectors,
the minimum and maximum dimensions of the search subspace, and a tolerance for
the stopping criteria.
3.2 Numerical Results
In this section we present numerical experiments to assess the performance of our
block codes. We compare our blocked implementations to unblocked versions and to

Table 3.2: Ten eigenvalues of CK656 with largest real part
alternate approaches that are publicly available. The purpose of these experiments is
to get a good impression of how these methods actually perform in the context of the
NEP. We hope to explore why block methods may be an attractive option, whether
these block methods can handle difficult computations, and further understand the
performance of the chosen methods. We will study how block size affects convergence
and explore reasonable conditions for the underlying search subspaces. Each method
from Section 3.1 has specific parameters that may affect performance, and we en-
deavor to understand as much as possible. All the ensuing Matlab comparisons are
performed on a Mac Pro with dual 2.4 GHz Quad-Core Intel Xeon CPU and 8GB
RAM running Mac OS X Version 10.8.4 with Matlab R2013a.
We begin with assessing one of the theoretical advantages of block methods,
the computation of clustered or multiple eigenvalues. To explore this we selected a
suitable matrix from the NEP Collection in Matrix Market. We chose CK656 which
is a 656 x 656 real, nonsymmetric matrix with eigenvalues that occur in clusters of
order 4 with each cluster consisting of two pairs of very nearly multiple eigenvalues.
There is no information on the application of origin. For each approach, we attempt
to compute the ten eigenvalues with largest real part which are given in Table 3.2.
We fix the number of vectors in the search subspace and vary the block size b and the
number of blocks rib accordingly for the blocked versions. This makes comparisons

between our block Arnoldi method and our block Krylov-Schur method relatively
easy as both approaches solve the same size projected eigenvalue problem at each
iteration. Comparisons to Jacobi-Davidson based approaches are more difficult as
the size of the projected eigenvalue problem grows at each step in the outer iteration,
and there is the inner iteration to consider as well. It is important to note that an
outer iteration of Jacobi-Davidson requires more overall work than an inner iteration
and that both (inner iterations and outer iterations) involve multiplications by the full
matrix. To understand the performance of our Jacobi-Davidson based approaches, we
will conduct experiments with various dimensions of the search subspace. In Table 3.3,
we report the block size, number of blocks in the search subspaces, the number of
iterations, both inner and outer for our block Jacobi-Davidson, the total number of
matrix-vector products (MVPs), the total number of block matrix-vector products
(BMVPs), and the relative error of the resulting partial Schur decompositions. We
do not report the level of orthogonality as most methods are based on Householder
and loss of orthogonality was not observed to be an issue.
The initial 656 x 4 block of vectors was generated by Matlabs randn func-
tion with state 7 and the appropriate number of columns of vectors used in each
case presented in Table 3.3. The tolerance, the value of tol supplied by the user
for all methods, was le-12. This was to ensure that all methods use the same in-
put parameters, but we note that different stopping criteria are used for different
implementations. As detailed in Chapter 4, the stopping criteria used in bKS is
based on Schur vectors rather than Ritz vectors as in done in ARPACK. This is one
of the challenges of comparing algorithms using software. Details of the role of tol
may be found in each section discussing our implementations, in the user guide for
ARPACK [47], and in the documentation for ahbeigs and jdqr.
We attempted several experiments with higher tolerances, e.g. le-6 and le-9, but
eigs and unblocked bA occasionally missed a desired eigenvalue. It is worth noting

that the block methods did not experience the same difficulty. For all methods the
search subspace was fixed at 20 vectors with the exception of a few additional compu-
tations for Jacobi-Davidson based methods in which we extended this to 40 vectors.
We set the parameters in each method accordingly with a few exceptions. For ah-
beigs, we did not use the default value for the parameter adjust. This parameter
adjusts the number of vectors added when vectors start to converge to help speed up
convergence. The default value is adjust=3 and the sum of the number of desired
eigenvalues and the parameter adjust should be a multiple of the block size. We
attempted this experiment with adjust= 0, and then again with the default setting.
The performance was noticeably different and we will elaborate on this momentarily.
We also note that the documentation of ahbeigs recommends ten or more blocks
for the approach to converge and compute all desired eigenvalues. The size of the
truncated subspace in the block Krylov-Schur approaches was fixed at 8. We experi-
mented a bit with some of the options for jdqr, but ended up using the default values
of most parameters. These include the use of GMRES with a tolerance of 0.7J for
the residual reduction and a maximum of five iterations for the linear solve. We did
not use a preconditioner for A 91 in the the correction equation and we point out
that the Jacobi-Davidson based methods benefit greatly when a good preconditioner
is available.
As displayed in Table 3.3, every approach successfully found the desired eigenval-
ues in this experiment. The first three results from ahbeigs forced the method to use
no additional vectors. This did not seem to affect the performance for computations
using 10 or more blocks, but the performance was dramatically different for blocks
of size four. Over 6,000 MVPs were required compared to only 320 when additional
vectors were allowed, but we were using only half of the recommended number of
blocks. The worst performance in terms of total number of MVPs was observed by
bA. This was not entirely unexpected. In the unblocked case, comparing bA to

Table 3.3: Computing 10 eigenvalues for CK656
Method b nb Iterations MVPs (BMVPs) \\az-zs\\2 ICII2
eigs 1 20 111 9.716e-13
ahbeigs 1 20 117 9.262e-13
ahbeigs 2 10 144 (72) 6.959e-13
ahbeigs 4 5 6476 (1619) 1.005e-12
ahbeigs 1 20 107 9.262e-13
ahbeigs 2 10 130 (65) 6.959e-13
ahbeigs 4 5 320 (80) 1.005e-12
bA 1 20 21 430 4.372e-13
bA 2 10 29 600 (300) 1.212e-13
bA 4 5 40 840 (210) 1.774e-13
bKS 1 8, 20 11 140 5.606e-14
bKS 2 4, 10 12 152 (76) 1.858e-13
bKS 4 2,5 23 284 (71) 1.012e-13
bjdqr 1 8, 20 70, 275 354 1.888e-13
bjdqr 2 4, 10 39, 299 385 (39) 1.598e-13
bjdqr 4 2,5 25, 376 484 (25) 1.332e-13
bjdqr 1 16, 40 60, 231 307 1.614e-13
bjdqr 2 8, 20 31, 240 318 (31) 1.488e-13
bjdqr 4 4, 10 21, 317 417 (21) 1.339e-13
jdqr 1 8, 20 98, - 351 1.025e-13
jdqr 1 16, 40 82, - 292 1.371e-13

eigs is comparing Arnoldi with explicit restart to IRAM. It was expected that IRAM
would perform better as it employs a much better restart strategy. Both ahbeigs and
our bKS required comparable total number of MVPs for the runs in which additional
vectors were allowed for ahbeigs. In terms of iterations, bKS required nearly the
same for blocks of size one and two and performed nearly the same number of total
The story for the Jacobi-Davidson based methods is harder to tell. In Table 3.3,
outer and inner iterations are reported for bjdqr but only outer iterations are avail-
able for jdqr. The default stopping criteria for the correction equation in jdqr was
used. Our unblocked bjdqr performed about the same number of matrix-vector prod-
ucts as jdqr, but invested more in the solution to the correction equation. This was
somewhat anticipated as Hochstenbach and Notay [34] experienced the same result
when using this stopping criteria and jdqr. This result also seems to suggest that
the work invested in a more refined stopping criteria may be worthwhile as less outer
iterations will result in better overall performance as the inner iterations require less
work. We verified the number of MVPs for jdqr by tracking the number of times the
function providing the matrix was accessed as we did for eigs and ahbeigs. Overall,
the total number of MVPs remained relatively consistent for computations with our
bjdqr, though ahbeigs and bKS required less on average. The Jacobi-Davidson
based approaches seem to benefit more from a larger search subspace, specifically in
the case where more vectors are used for the implicit restart. We will further explore
the role of the dimension of the search subspace in additional numerical experiments.
Next we repeat the same experiment but increase the dimension of the search
subspace. We set the number of vectors to be 48 with 20 used in the contracted
subspace for bKS. For the Jacobi-Davidson based approaches, we set jmin = 36
and jmax = 60. The results are reported in Table 3.4. Increasing the subspace
had some very interesting results. For our bA approach with block size 6 = 4,

Table 3.4: Computing 10 eigenvalues for CK656, expanded search subspace
Method b nb Iterations MVPs (BMVPs) \\az-zs\\2 llAb
eigs 1 48 112 9.716e-13
ahbeigs 1 48 118 2.024e-14
ahbeigs 2 24 148 (74) 1.302e-14
ahbeigs 4 12 252 (63) 1.754e-14
ahbeigs 1 48 116 2.370e-14
ahbeigs 2 24 144 (72) 1.373e-14
ahbeigs 4 12 200 (50) 3.976e-13
bA 1 48 11 538 3.928e-15
bA 4 12 23 332 (166) 5.796e-14
bKS 1 20, 48 10 300 5.309e-15
bKS 2 10, 24 10 300 (150) 3.037e-15
bKS 4 5, 12 21 328 (82) 3.705e-15
bjdqr 1 36, 60 40, 156 232 1.560e-14
bjdqr 2 18, 30 23, 176 258 (23) 1.170e-13
bjdqr 4 9, 15 18, 272 380 (18) 1.586e-13
bjdqr 1 72, 120 18, 68 158 3.027e-14
bjdqr 2 36, 60 9, 64 154 (9) 2.968e-14
bjdqr 4 9, 15 12, 176 296 (12) 1.345e-13
jdqr 1 36, 60 53, - 232 2.570e-14
jdqr 1 72, 120 14,- 129 3.798e-14
the required MVPs decreased dramatically from 840 to 332. The performance of
ahbeigs also improved with the increased search subspace. As the number of blocks

was consistently larger than the recommended 10, the difference between the runs
with and without the additional vectors was less significant. The additional vectors
only seemed to be needed with block size 6 = 4 where the total number of MVPs
was reduced by about 20%. The benefit of blocks was most evident in bjdqr as with
the increased search subspace, slightly less total MVPs were required when comparing
blocks of size one and blocks of size two. One of the most interesting results illustrated
in these experiments is that bKS requires less total MVPs with a smaller search
subspace. When restricting the search subspace to 20 vectors, bKS required half as
many MVPs as it did for blocks of size 1 and 2. For blocks of size 4, bKS required
about 12% less flops when working with a smaller search subspace. We note that
the difference between the size of the expanded search subspace and the contracted
search subspace increased from 12 to 28 vectors. The increased number of total MVPs
required for bKS here seems to be tied to this increase. To verify this, we repeated
this experiment just for bKS with the larger search subspace and only 12 additional
vectors in the expansion phase, that is we set ks = 36 rather than ks = 20. The
results are presented in Table 3.5. The results in Table 3.5 demonstrate that bKS
Table 3.5: Computing 10 eigenvalues for CK656, ks = 36
Method 6 nb Iterations MVPs (BMVPs) \\az-zs\\2 IITb
bKS 1 36, 48 10 156 4.105e-15
bKS 2 18, 24 10 156 (78) 4.490e-15
bKS 4 9, 12 13 192 (48) 3.061e-12
performs similarly to the experiment with only 20 vectors in the search subspace.
This seems to indicate that the difference between the dimension of the expanded
search subspace and the dimension of the contracted search subspace is an important
part of the overall performance of bKS. If the difference is too large, additional and

Eigenvalues of TOLS2000
Figure 3.3: Complete Spectrum of TOLS2000
seemingly unnecessary work is performed. This is in contrast to ahbeigs and the
Jacobi-Davidson based approaches which all perform better on average with a larger
search subspace.
In our next experiment, we examine a matrix with difficult to compute eigenval-
ues. Here we hope to assess whether or not our implementations satisfies our hope
for robustness. This example comes from the NEP Collection in Matrix Market as
well. The matrix TOLS2000 is a Tolosa matrix from aerodynamics, related to the
stability analysis of a model of a plane in flight. The eigenmodes of interest are
complex eigenvalues with imaginary part in a certain frequency range determined by
engineers. Figure 3.3 shows the complete spectrum. This computation aims to com-
pute eigenvalues with largest imaginary part and the associated eigenvectors. The
matrix is sparse and highly nonnormal making it potentially difficult to compute a
few eigenpairs. Jia [38] computed the three eigenvalues with largest imaginary part

to be
Ai = -730.68859 + 2330.11977*,
A2 = -726.98657 + 2324.99172^,
A3 = -723.29395 + 2319.85901*,
using a block Arnoldi approach with refined approximate eigenvectors, and we will
make indirect comparisons to his results as there is no publicly available code. We
can make such a comparison thanks to efforts such as Matrix Market. Though the
codes may not be available, the matrices used are available and we can compare to
some extent to preious results. Jia observed the results in Table 3.6 with his proposed
method. Jias results show that his method benefited from a larger search subspace
Table 3.6: Summary of results presented by Jia [38]
b nb Iterations MVPs \\AZ-ZSh
2 25 67 3350 7.9e-7
2 30 33 1980 8.1e-7
2 35 26 1820 7.2e-7
2 40 32 2560 9.1e-7
2 50 11 1100 1.9e-7
3 20 88 5280 6.0e-7
3 30 20 1800 8.4e-7
3 40 8 960 6.3e-7
as the number of MVPs decreased when the number of blocks in the search subspace
increased. This is similar to the behavior of ahbeigs in previous experiments. Here,
we set the tolerance to le-9 to compare with the results by Jia and also by Baglama [7].

Again we experiment with allowing ahbeigs no additional vectors and the default
setting. The initial block was generated using Matlabs randn with state 7 as before
with the appropriate number of columns used in each computation. Jia examined
blocks of size two with 25 to 50 blocks and blocks of size three with 20 to 40 blocks
and we selected various combinations to allow for a meaningful comparison. As
was the case for the experiments performed by Jia, our block Arnoldi, bA, failed
to converge for various block sizes with several different dimensions of the search
subspace. Baglama found that jdqr failed to converge as well and we observed this
for both jdqr and our bjdqr version with refined stopping criteria when working with
blocks of size one. Even with blocks of various sizes, our bjdqr failed to converge.
This could be due to the difficult nature of the problem and partly due to not using
a preconditioner.
Both eigs and ahbeigs have options for computing the eigenvalues with largest
imaginary part by setting the input option SIGMA=LI. When attempting to locate
the desired eigenvalues using the appropriate input string, both eigs and abheigs
returned the three eigenvalues with largest imaginary part in magnitude. That is,
both routines returned some combination the complex conjugates
Ai = -730.68859 + 2330.11977*,
A2 = -730.68859 2330.11977*,
A3 = -726.98660 2324.99171*,
rather than the approximations offered by Jia. This could be an issue specific to Mat-
lab as the documentation on eigs does not include how the interface to ARPACK is
achieved. Matlabs eigs can take a real or complex value as input for SIGMA, but
ahbeigs only has the option of a real number or the appropriate string to designate
the location and it works in only real arithmetic. We attempted to use various tar-
gets for eigs with no success. As Baglama [7] did not report the actual eigenvalues

computed, we report only the results we were able to generate using ahbeigs. In
Table 3.7 we report the same information as before adding the number of successfully
computed eigenvalues based on the ones reported by Jia, giving credit for complex
The only approach to successfully compute all three desired eigenvalues was bKS.
When compared to the results in Table 3.6, we see that bKS performed significantly
less MVPs while using less vectors in the search subspace. For example, when using
10 blocks of size three bKS required 750 MVPs compared to Jias approach that
required 5280 MVPs when 20 blocks of size three were used. Increasing the number
of blocks to 40 blocks brought the number of total MVPs required by Jias approach
much closer (down to 960) but this required four times the number of vectors in the
search subspace.
Again we explored the effect of additional vectors for ahbeigs. In Table 3.7 there
are select duplicate runs for ahbeigs. The first used no additional vectors and the
second used the default value. Forcing the method to not use any additional vectors
resulted in rather erratic behavior. For blocks of size two, additional vectors helped
when the search subspace consisted of 15 blocks but performance suffered when using
30 blocks. Due to this, all ensuing runs were performed with the default setting. In
all the results by ahbeigs, the best performance in terms of MVPs occurred when
using 15 blocks of size two and additional vectors.
The results for bKS are mainly what one would expect. First, the configuration
that required the least amount of total MVPs was unblocked bKS. This was followed
closely by bKS using 15 blocks of size five and then 10 blocks of size three. Two
different configurations of bKS outperformed eigs and a third required approximately
the same number of total MVPs. Again, more MVPs were required for configurations
of bKS when the difference between the expanded and contracted subspace was
larger. The expanded search subspace needed to be large enough, but making it too

Table 3.7: Computing three eigenvalues with largest imaginary part for TOLS2000
Method b nb Iterations kcon/ kd MVPs (BMVPs) \\az-zs\\2
eigs 1 30 2/3 746 8.305e-7
ahbeigs 1 30 2/3 1806 6.291e-7
ahbeigs 1 30 2/3 1580 1.107e-6
ahbeigs 2 15 2/3 1424 (712) 1.563e-6
ahbeigs 2 15 2/3 942 (471) 1.990e-6
ahbeigs 2 30 2/3 1740 (870) 1.604e-6
ahbeigs 2 30 2/3 2166 (1083) 3.518e-8
ahbeigs 3 10 2/3 2022 (674) 1.138e-6
ahbeigs 3 30 2/3 1350 (450) 1.969e-6
ahbeigs 5 6 2/3 3605 (721) 2.345e-6
ahbeigs 5 6 2/3 4090 (818) 2.251e-6
ahbeigs 5 10 2/3 1810 (362) 2.188e-6
ahbeigs 5 15 2/3 1960 (392) 2.020e-6
bKS 1 12, 30 27 3/3 498 4.541e-9
bKS 2 6, 15 64 3/3 1164 (582) 9.136e-9
bKS 3 4, 10 41 3/3 750 (250) 1.497e-6
bKS 2 12, 30 32 3/3 1176 (588) 1.497e-6
bKS 3 8, 20 33 3/3 1212 (404) 7.266e-9
bKS 5 4, 10 36 3/3 1100 (220) 4.348e-9
bKS 5 5, 10 38 3/3 975 (195) 1.007e-6
bKS 5 10, 15 23 3/3 625 (125) 9.105e-7
bKS 5 5, 15 70 3/3 3525 (705) 4.768e-9
bKS 10 2, 7 41 3/3 2070 (207) 1.231e-7
bKS 10 5, 10 28 3/3 1450 (145) 2.113e-6

large relative to the contracted subspace was not necessarily beneficial. The best
performance was observed when a modest number of blocks were used to expand,
specifically six blocks of size three and five blocks of size five. Overall, bKS seems
flexible enough to work with a variety of configurations. We will further explore the
relationships between block size, number of blocks (or dimension of search subspace),
required iterations, and required matrix-vector products.
It is worth noting that for this numerical experiment we needed to deflate only
one eigenvalue at a time in our successful approaches. Initial runs showed detection
of more than one eigenvalue at a time, but the accuracy suffered. For most of the
multiple deflations we observed representativity of the partial Schur form ||AZ
zs\\ 0(10 7). Our bKS typically looks for multiple deflations, but with the
conditioning of this problem we needed to be a bit less ambitious to preserve accuracy.
This experiment shows that our bKS approach performs well even with a challenging
Next we consider a matrix used in experiments by Moller [49], by Lehoucq and
Maschhoff [46], and by Baglama [7]. The purpose of this experiment is to make
indirect comparisons to versions of bIRAM presented separately by Moller and by
Lehoucq and Maschhoff as there are no publicly available codes. As Baglama [7] did,
we refer to the two methods presented by Moller [49] as Moller(S) and Moller(L)
and to the work by Lehoucq and Maschhoff as bIRAM. The matrix under considera-
tion is HOR131 from the Harwell-Boeing Collection available on Matrix Market. The
matrix is a 434 x 434 nonsymmetric matrix and comes from a flow network problem.
We desire to compute the 8 eigenvalues with largest real part. We set the number
of stored vectors to be 24, set the tolerance to be le-12, and generate the same ini-
tial starting block as we have in previous experiments. We opted to use the default
for adjust in ahbeigs. We also verified the accuracy of the computed eigenvalues
by comparing to the those computed by Matlabs eig but we do not report those

Table 3.8: Summary of results for H0R131
Method b nb MVPs
bIRAM 1 24 77
bIRAM 2 12 84
bIRAM 3 8 99
bIRAM 4 6 108
Moller(S) 1 24 88
Moller(S) 2 12 136
Moller(S) 4 6 264
Moller(L) 1 24 79
Moller(L) 2 12 93
Moller(L) 4 6 105
details as all eigenvalues were computed within the desired tolerance. In Table 3.8
we report the results for indirect comparison.
In Table 3.9 we report the results of our computations. In Table 3.10 we present
the results of our computations for Jacobi-Davidson based approaches. There are
several things to note among the results for this experiment. First, bA is again
the worst performer in terms of MVPs and again this was somewhat expected. Both
ahbeigs and bKS seem to be competitive with bIRAM, Moller(S) and Moller(L).
The total number of MVPs required is similar and in our numerical experiments we
have seen that different initial vectors can account for enough MVPs to consider these
approaches as competitive. It is especially interesting to note that all the methods
represented in Table 3.8 require more MVPs as the block size increases. This seems to
be the case for ahbeigs, bA and bjdqr. The lone exception is bKS. As demonstrated

Table 3.9: Computing 8 eigenvalues with largest real part for HOR131, Krylov
Method b nb Iterations MVPs (BMVPs) \\AZ-ZS\\2
eigs 1 24 83 2.958e-15
ahbeigs 1 24 83 2.096e-15
ahbeigs 2 12 140 (70) 2.295e-14
ahbeigs 3 8 180 (60) 2.100e-13
ahbeigs 4 6 316 (79) 1.787e-13
bA 1 24 17 416 1.627e-13
bA 2 12 20 496 (248) 6.815e-13
bA 3 8 24 600 (200) 5.410e-13
bA 4 6 25 632 (158) 7.866e-13
bKS 1 16, 24 10 96 1.301e-14
bKS 1 18, 24 11 84 5.991e-14
bKS 2 6, 12 11 144 (72) 2.172e-14
bKS 2 8, 12 12 112 (56) 6.173e-13
bKS 2 10, 12 16 84 (42) 3.430e-13
bKS 3 4,8 13 168 (56) 1.396e-13
bKS 3 5, 8 14 141 (47) 1.396e-13
bKS 3 6, 8 13 114 (38) 1.396e-13
bKS 4 2,6 16 264 (66) 6.044e-13
bKS 4 3, 6 14 180 (45) 2.175e-13
bKS 4 4, 6 19 168 (42) 2.738e-13
in Table 3.9, as we increase the dimension of the contracted subspace, the number
of total MYPs decreases consistently for all block sizes reported in Table 3.9. This

Table 3.10: Computing 8 eigenvalues with largest real part for HOR131, JD
Method b nb Iterations MVPs (BMVPs) \\AZ-ZSh
bjdqr 1 18 30 37, 144 199 1.030e-12
bjdqr 2 9, 15 24, 184 250 (24) 4.382e-13
bjdqr 3 6, 10 18, 204 276 (18) 4.052e-13
bjdqr 4 4, 32 15, 224 300 (15) 1.775e-12
bjdqr 1 24 48 26, 100 150 1.002e-12
bjdqr 2 12 24 19, 144 206 (19) 1.136e-12
bjdqr 3 8, 16 16, 180 252 (16) 1.051e-12
bjdqr 4 6, 12 12, 176 248 (12) 1.471e-12
jdqr 1 18 30 65 182 8.546e-15
jdqr 1 36 60 14 67 1.954e-13
suggests that by finding the optimal configuration for the search subspace, bKS can
be a very competitive approach.
The Jacobi-Davidson results in Table 3.10 tell much the same story as before
for these types of methods. They all seem to require a bit more MVPs in general,
but benefit from a larger search subspace. Increasing jmin and jmax makes bjdqr
competitive but requires more vectors in the search subspace which then requires
the solution of a larger eigenvalue problem than those methods not based on Jacobi-
Davidson. Additional storage is also required.
Thus far, the performance of our bKS approach has been the most consistent
among the iterative approaches discussed in this section. It has handled difficult com-
putations and done so in a robust and efficient manner. To get a better understanding
of how block size and dimension of the search subspace affect the performance, we

embark on some additional numerical experiments. The experiments performed to
this point have focused on sparse matrices, from real applications, that others have
used to assess performance of approaches to the NEP. We now consider dense random
matrices. We begin with a comparison similar to the previous numerical experiments.
We seek to compute the five eigenvalues with smallest real part for a 2, 500 x 2, 500
real matrix generated using Matlabs randn with initial state 4. This populates
the matrix with random numbers drawn from the standard normal distribution. We
used the same initial starting block as in previous computations and again set the
tolerance to le-12. The only other parameter we set is fixing 100 vectors in the search
subspaces. For ahbeigs we use default values for the remaining parameters. We set
the contracted search subspace dimension to ks = 72 and adjusted the dimension of
the expanded kf as close to 100 as possible using multiples of the block sizes. The
results of our experiment are presented in Table 3.11. We observed that bA failed to
converge for various configurations and jdqr had difficulties as well. We had to set the
tolerance to le-11 to generate the results in Table 3.11 as it failed to converge with the
tolerance set at le-12. The results in Table 3.11 show that bKS performs better than
Matlabs eigs and Baglamas ahbeigs for blocks up to size three. Larger blocks
increase the total number of required MVPs, but not on the same scale as what is
required by ahbeigs. Our Jacobi-Davidson implementation was able to locate all five
desired eigenvalues where jdqr needed to relax the tolerance. This is most likely due
to the difference in stopping criteria. As we have seen in previous experiments, we
may be able to adjust the values of ks and kf to reduce the number of MVPs required
and increase overall performance.
So far, bKS has performed well in difficult eigenvalue computations involving
sparse matrices and demonstrated that it is an attractive option for computing a few
eigenvalues of dense random matrices. We now turn our focus on only bKS as we
attempt to understand how it behaves for dense random matrices. For the ensuing

Table 3.11: Computing 5 eigenvalues with smallest real part for random matrix
Method b nb Iterations MVPs (BMVPs) \\AZ-ZSh
eigs 1 too 1687 5.987e-13
ahbeigs 1 too 1588 3.632e-13
ahbeigs 2 50 3502 (1751) 5.400e-13
ahbeigs 3 33 4869 (1623) 3.572e-13
ahbeigs 4 25 11188 (2797) 4.899e-13
ahbeigs 5 20 9010 (1802) 4.385e-13
ahbeigs 6 17 12522 (2087) 5.451e-13
bKS 1 72, 100 22 688 4.488e-13
bKS 2 36, 50 33 996 (498) 4.125e-13
bKS 3 24, 33 48 1368 (456) 3.204e-13
bKS 4 18, 25 65 1892 (473) 4.488e-13
bKS 6 12, 17 92 2832 (472) 5.119e-13
bjdqr 1 84, 108 371, 1463 1918 1.047e-14
bjdqr 2 42, 54 238, 1835 2395 (238) 9.896e-15
bjdqr 3 38, 36 175, 1988 2597 (175) 1.045e-14
bjdqr 4 21, 27 156, 2350 3058 (156) 1.471e-12
jdqr 1 84, 108 266, - 1631 3.858e-12

Full Text


ONACCELERATINGTHENONSYMMETRICEIGENVALUEPROBLEMIN MULTICOREARCHITECTURES by MatthewW.Nabity M.S.,UniversityofColoradoBoulder,2003 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2013 i


ThisthesisfortheDoctorofPhilosophydegreeby MatthewW.Nabity hasbeenapproved by JulienLangou,Advisor KarenBraman LynnBennethum JackDongarra JanMandel July12,2013 ii


Nabity,MatthewW.Ph.D.,AppliedMathematics OnAcceleratingtheNonsymmetricEigenvalueProbleminMulticoreArchitectures ThesisdirectedbyAssociateProfessorJulienLangou ABSTRACT Scienticcomputingreliesonstateoftheartalgorithmsinsoftwarelibrariesand packages.Achievinghighperformanceintoday'scomputinglandscapeisachallengingtaskasmoderncomputerarchitecturesinHigh-PerformanceComputingHPC areincreasinglyincorporatingmulticoreprocessorsandspecialpurposehardware, suchasgraphicsprocessingunitsGPUs.Theemergenceofthisheterogeneousenvironmentnecessitatesthedevelopmentofalgorithmsandsoftwarepackagesthatcan fullyexploitthisstructure.Maximizingtheamountofparallelismwithinanalgorithm isanessentialsteptowardshighperformance. Oneofthefundamentalcomputationsinnumericallinearalgebraisthecomputationofeigenvaluesandeigenvectors.Matrixeigenvalueproblemscanbefoundin manyapplications.Thealgorithmofchoicedependsonthenatureofthematrix. Forsparsematrices,thereareseveralviableapproaches,but,fordensenonsymmetric matrices,onealgorithmremainsthemethodofchoice.ThisistheQRalgorithm. Unfortunately,onmoderncomputerplatforms,therearelimitationstothisapproach intermsofparallelismanddatamovement.Currentimplementationsdonotscale wellordonottakefulladvantageoftheheterogeneouscomputingenvironment. Thegoalofthisworkistoexaminenonsymmetricmatrixeigenvalueproblems inthecontextofHPC.InChapter2,weexaminetilealgorithmsandtheimplementationofblockArnoldiexpansioninthecontextofmulticore.Pseudocodesand implementationdetailsareprovidedalongwithperformanceresults. InChapter3,weexaminevariousalgorithmsinthecontextofcomputingapariii


tialSchurfactorizationfornonsymmetricmatrices.Weexamineseveraliterative approachesandpresentimplementationsofspecicmethods.Themethodsstudied includeablockversionofexplicitlyrestartedArnoldiwithdeation,ablockextensionofStewart'sKrylov-Schurmethod,andablockversionofJacobi-Davidson.We presentanewformulationofblockKrylov-Schurthatisrobustandachievesimproved performanceforeigenvalueproblemswithsparsematrices.Weexperimentwithdense matricesaswell. WeexpandonourworkandpresentaCcodeforourblockKrylov-Schurapproach usingLAPACKroutinesinChapter4.Thecodeisrobustandrepresentsarststep towardsanoptimizedversion.Ourcodecanuseanydesiredblocksizeandcompute anynumberofdesiredeigenvalues.Detailedpseudocodesareprovidedalongwitha theoreticalanalysis. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:JulienLangou iv


DEDICATION ToBlakeandBreeann v


ACKNOWLEDGMENT FirstIwanttothankmyadvisorJulienLangouforeverything.Hisencouragementandguidanceovertheyearshasmadethisworkpossible.Iwanttothankmy committeemembersfortheirpatienceandsupport;ProfessorLynnBennethumfor guidingthisprocessbybeingchair,ProfessorKarenBramanforthediscussionsover theyearsandforbeingafriendlyfaceatconferences,ProfessorJackDongarrafor thesummerresearchopportunitiesattheInnovativeComputingLaboratoryICLat theUniversityofTennesseeKnoxvilleUTK,andProfessorJanMandelforpushing measastudentandresearcher. TheworkinthisthesiswaspartiallysupportedbytheNationalScienceFoundationGrantsGK-12-0742434,NSF-CCF-1054864,andbytheBatemanfamilyinthe formoftheLynnBatemanMemorialFellowship. ThisresearchbenetedgreatlybyaccesstocomputingresourcesattheCenter forComputationalMathematicsCCMattheUniversityofColoradoDenverand theICLatUTK.Thecolibri"cluster,NSF-CNS-0958354,wasusedforsomeofthe experimentalresultspresentedinthiswork. ThejourneywascertainlyenrichedbythefriendshipofmyfellowgraduatestudentsbothattheUniversityofColoradoBoulderandhereintheMathematicsDepartmentattheUniversityofColoradoDenver. Finally,Iwouldnotbeherewithoutthesupportofmyfamily.Ithankmyparents forsupportingmeasIfollowedmyownpath.IoweagreatdealtomybrotherPaul forconstantlypushinghisolderbrothertokeepupwithhim.Lastly,nishingthis workwouldnothavebeenpossiblewithouttheconstantsupportofmyBreeann. vi


TABLEOFCONTENTS Figures.......................................ix Tables........................................xi Chapter 1.Introduction...................................1 1.1TheComputingEnvironment......................2 1.2TheStandardEigenvalueProblem...................4 1.3Algorithms................................7 1.4Contributions..............................14 2.TiledKrylovExpansiononMulticoreArchitectures.............16 2.1TheArnoldiMethod...........................17 2.2TiledArnoldiwithHouseholder....................21 2.3NumericalResults............................29 2.4ConclusionandFutureWork......................32 3.AlternativestotheQRAlgorithmforNEP..................34 3.1IterativeMethods............................36 3.1.1ArnoldiforthenonsymmetriceigenvalueproblemandIRAM36 3.1.2BlockArnoldi...........................38 3.1.3BlockKrylov-Schur........................50 3.1.4BlockJacobi-Davidson......................58 3.2NumericalResults............................71 3.3ConclusionandFutureWork......................92 4.BlockKrylovSchurwithHouseholder.....................94 4.1TheKrylov-SchurAlgorithm......................95 4.2TheBlockKrylov-SchurAlgorithm...................100 4.3NumericalResults............................112 4.4ConclusionandFutureWork......................114 vii


5.Conclusion....................................115 References ......................................116 viii


FIGURES Figure 1.1Datastructurefortiledmatrix.......................4 1.2TheHessenbergstructuredisturbed.....................9 1.3Chasingthebulge..............................9 1.4ScalabilityofZGEES.............................12 2.1CompactstorageoftheArnoldidecomposition..............23 2.2TiledQRwith n t =5usingxTTQRT...................28 2.3ModicationofQUARK CORE ZGEQRTforsub-tiles..........29 2.4Performancecomparisonforadensesmallermatrix...........30 2.5Performancecomparisonforatileddenselargermatrix.........32 2.6Performancecomparisonforatiledtridiagonalmatrix..........33 2.7Performancecomparisonforatileddiagonalmatrix............33 3.1BlockKrylov-SchurDecomposition.....................52 3.2ExpandedBlockKrylovDecomposition...................53 3.3CompleteSpectrumofTOLS2000......................79 3.4MVPsversusdimensionofsearchsubspace.................90 3.5IterationsversusdimensionofsearchsubspaceforblockKrylov-Schur method.....................................91 3.6MVPsversusnumberofdesiredeigenvaluesforblockKrylov-Schurmethod.91 3.7IterationsversussizeofmatrixforblockKrylov-Schurwith b =5, k s =40 and k f =75..................................92 4.1Typically,implementationsofiterativeeigensolversarenotabletocompute n eigenvaluesforan n n matrix.Hereisanexampleusing Matlab 's eigs whichisbasedonARPACK.......................94 4.2StructureoftheRayleighquotient.....................98 4.3StructureofblockKrylov-Schurdecomposition..............101 ix


4.4SubspaceinitializationandblockKrylov-Schurdecomposition......106 4.5ExpandedblockKrylovdecomposition...................108 4.6TruncationofblockKrylov-Schurdecomposition.............110 4.7ScalabilityexperimentsforourblockKrylov-Schuralgorithmtocompute thevelargesteigenvaluesofa4 ; 000 4 ; 000matrix...........113 4.8ScalabilityexperimentsforLAPACK'sZGEESalgorithmtocomputethe fullSchurdecompositonofa4 ; 000 4 ; 000matrix............113 x


TABLES Table 1.1AvailableIterativeMethodsfortheNEP..................13 3.1Softwareforcomparisonofiterativemethods...............71 3.2TeneigenvaluesofCK656withlargestrealpart..............72 3.3Computing10eigenvaluesforCK656....................75 3.4Computing10eigenvaluesforCK656,expandedsearchsubspace.....77 3.5Computing10eigenvaluesforCK656, k s =36...............78 3.6SummaryofresultspresentedbyJia[38]..................80 3.7ComputingthreeeigenvalueswithlargestimaginarypartforTOLS200083 3.8SummaryofresultsforHOR131.......................85 3.9Computing8eigenvalueswithlargestrealpartforHOR131,Krylov...86 3.10Computing8eigenvalueswithlargestrealpartforHOR131,JD.....87 3.11Computing5eigenvalueswithsmallestrealpartforrandommatrix...89 xi


1.Introduction Thisworkisaboutcomputingwithmatrices,primarilythosewithcomplexentries.Theeldofcomplexnumbersisdenotedby C ,sothesetof n 1matrices, usuallycalledcolumnvectors,isdenotedby C n andthesetof m n matricesis denotedby C m n .Ourmainfocusisonpropertiesofsquarematrices,thatis,matricesin C n n .Specically,weareconcernedwithalgorithmstocomputeeigenvalues andcorrespondinginvariantsubspaces,oreigenvectors,ofsquarematrices.There areseveraltypesofmatriceswhosestructureprovidesdesirablepropertiesonemay exploitwhenformulatingalgorithms.Thesepropertiesmayleadtoattractivenumericalproperties,suchasstability,orattractivecomputationalproperties,suchas ecientstorage.Muchhasbeendonetoformulatealgorithmsforthecomputationof eigenvaluesandinvariantsubspacesforHermitianmatricesandsymmetricmatrices, see[63]and[71]fordetails.Manyotherstructuresthatinduceparticulareigenvalue properties,suchasblockcyclicmatrices,Hamiltonianmatrices,orthogonalmatrices, symplecticmatricesandothershavebeenstudiedextensivelyin[43,70].Ourcomputationalsettingisproblemsinvolvingnonsymmetricmatriceswhichhavenosuch underlyingstructure.Thesematricesmayrangefrombeingextremelysparse,that isprimarilyconsistingofzeros,ordensei.e.,notsparse.Eigenvalueproblemswith nonsymmetricmatricesshowupinmanybranchesofthesciencesandnumerouscollectionsofsuchmatricesfromrealapplicationsaremaintainedineortssuchasthe MatrixMarket[12]andtheUniversityofFloridaSparseMatrixCollection[22].ExampleapplicationsintheNEPCollectioninMatrixMarketincludecomputational uiddynamics,severalbranchesofengineering,quantumchemistry,andhydrodynamics. Webeginwithanintroductiontothecomputingenvironment,areviewofthe theoreticalfoundationsfromlinearalgebra,andkeyalgorithmiccomponents.Most oftheinformationpresentedinthissectionisneededintroductorymaterialforthe 1


nonsymmetriceigenvalueproblemNEPandsubsequentsections,butitalsoserves asanoverviewofthestateoftheartwithrespecttonumericalmethodsforthe standardeigenvalueproblem. 1.1TheComputingEnvironment Stateoftheartalgorithmsarethefoundationofsoftwarelibrariesandpackages thatstrivetoachieveoptimalperformanceintoday'scomputinglandscape.ModerncomputerarchitecturesinHigh-PerformanceComputingHPCareincreasingly incorporatingmulticoreprocessorsandspecialpurposehardware,suchasgraphics processingunitsGPUs.Thechangetothismulticoreenvironmentnecessitatesthe developmentofalgorithmsandsoftwarepackagesthatcantakefulladvantageofthis computationalsetting.Onemajorchallengeismaximizingtheamountofparallelism withinanalgorithm,buttherearemanyotherissuesrelatedtodesigningeective algorithmswhicharenicelysurveyedin[5]. Ourworkwillmakeuseofseveralstandardsincomputing.TheBLASBasic LinearAlgebraSubprogramslibraryisusedtoperformessentialoperations[11]. Operationsinnumericallinearalgebraaredividedintothreelevelsoffunctionality basedonthetypeofdatainvolvedandthecostoftheassociatedoperation.Level1 BLASoperationsinvolveonlyvectors,Level2BLASoperationsinvolvebothmatrices andvectors,andLevel3BLASoperationsinvolveonlymatrices.Level3BLAS operationsarethepreferredtypeofoperationforoptimalperformance.Optimized BLASlibrariesareoftenprovidedforspecicarchitectures.TheBLASlibrarycan bemultithreadedtomakeuseofthemulticoreenvironment. LAPACKLinearAlgebraPACKageisalibrarybasedonBLASthatperforms higherleveloperationscommontonumericallinearalgebra[2].RoutinesinLAPACKarecodedwithaspecicnamingstructurewheretherstletterindicatesthe matrixdatatype,thenexttwoindicatethetypeofmatrix,andthelastthreeindicatethecomputationperformedbytheprogram.Wewillrefertospecicroutines 2


genericallyusingalowercasexinplaceofthematrixdatatype,soxGEQRT,which computesablockQRfactorizationofageneralmatrix,mayrefertoanyofthevariantsSGEQRTreal,DGEQRTdouble,CGEQRTcomplex,orZGEQRTdouble complex.AlgorithmsinLAPACKareformulatedtoincorporateLevel3BLASoperationsasmuchaspossible.Thisisaccomplishedbyorganizingalgorithmssothat operationsaredonewithpanels,eitherblocksofcolumnsorrowsofamatrix,rather thansinglecolumnsorrows.WhilethisperspectiveprovidesalgorithmsrichinLevel 3BLASoperations,therearedisadvantagesinthecontextofmulticore.Memoryarchitecture,synchronizations,andlimitednegranularitycandiminishperformance. ScaLAPACKScalableLAPACKisalibrarythatincludesasubsetofLAPACKroutinesredesignedfordistributedmemoryMIMDmultipleinstructionmultipledata parallelcomputers[10].TheLAPACKandScaLAPACKlibrariesareconsideredthe standardforhighperformancecomputationsindenselinearalgebra.Bothlibraries implementsequentialalgorithmsthatrelyonparallelbuildingblocks.Thereare considerableadvantagestoreformulatingoldalgorithmsanddevelopingnewonesto increaseperformanceonmulticoreplatformsasdemonstratedin[19,18,53]. PLASMAParallelLinearAlgebraSoftwareforMulticoreArchitecturesisa recentdevelopmentfocusingontilealgorithms,see[19,18,53],withthegoalof addressingtheperformanceissuesofLAPACKandScaLAPACKonmulticoremachines[1].PLASMAusesadierentdatalayoutsubdividingamatrixintosquare tilesasdemonstratedinFigure1.1.Operationsrestrictedtosmalltilescreatene grainedparallelismandprovideenoughworktokeepmultiplecoresbusy.ThecurrentversionofPLASMA,release2.4.6,reliesonruntimeschedulingofparalleltasks. Hybridenvironmentsaredevelopingaswellthatcombinebothmulticoreandother specialpurposearchitectureslikeGPUs.ProjectssuchasMAGMAMatrixAlgebra onGPUandMulticoreArchitecturesaimtobuildnextgenerationlibrariesthatfully exploitheterogeneousarchitectures[65]. 3


Figure1.1: Datastructurefortiledmatrix Themulticoreperspectivenecessitatesthedevelopmentofalgorithmsthatcan incorporateevolvingcomputationalapproaches,suchastileddatastructuresthat facilitatehighperformance.Theworkpresentedinensuingchaptersisfocusedon acceleratingalgorithmsfortheNEPinthecontextofmulticoreandhybridarchitectures.Wenowturntosomebasictheoryofeigenvaluesandanessentialcomponent ofstate-of-the-artalgorithms. 1.2TheStandardEigenvalueProblem Anonzerovector v 2 C n isaneigenvector,orrighteigenvector,of A if Av isa constantmultipleof v .Thatis,thereisascalar 2 C suchthat Av = v .Thescalar isaneigenvalueassociatedwiththerighteigenvector v .Equivalently,aneigenvalue ofamatrix A 2 C n n isarootofthecharacteristicpolynomial p A j A )]TJ/F19 11.9552 Tf 11.955 0 Td [(I j =0 ; wheretheverticalbarsdenotethedeterminantofthematrix A )]TJ/F19 11.9552 Tf 12.772 0 Td [(I .Anonzero vector y 2 C n isalefteigenvectorifitsatises y H A = y H ,where y H istheconjugate transposeorHermitiantransposeof y .Asourimmediatefocusisontheformer,we willnowrefertoallrighteigenvectorsassimplyeigenvectorsunlessthecontextis unclear.Itisworthnotingthatsomenumericalmethodsutilizebothleftandright 4


eigenvectors.Oftenthepair ;v iscalledaneigenpairandthesetofalleigenvalues isthespectrumof A andisdenotedby A .Theeigenvalueassociatedwithagiven eigenvectorisunique,buteacheigenvaluehasmanyeigenvectorsassociatedwithit asanynonzeromultipleof v isalsoaneigenvectorassociatedwith .Spacesspanned byeigenvectorsremaininvariantundermultiplicationby A as span f Av g span f v g span f v g : Thisideacanbegeneralizedtohigherdimensions.Asubspace V C n suchthat AV V iscalledarightinvariantsubspaceof A .Again,thereisasimilarcharacterizationofaleftinvariantsubspace,butwewillnowworkexclusivelywiththeformer andrefertothemsimplyasinvariantsubspaces.Aneigenvaluedecompositionisa factorizationoftheform A = X X )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 or AX = X ; .1 whereisadiagonalmatrixwitheigenvalues 1 ;:::; n onthediagonaland X isa nonsingularmatrixwithcolumnsconsistingofassociatedeigenvectors v 1 ;:::;v n .A matrixissaidtobenondefectiveifandonlyifitcanbediagonalized,thatis,ithas aneigenvaluedecomposition.Oftennondefectivematricesarecalleddiagonalizable matrices.Whileanicetheoreticalfactorization,eigenvaluedecompositionscanbe unstablecalculationsduetotheconditioningoftheeigenvectorbasis X .Fortunately thereisabeautifultheoreticalandcomputationallypracticalresult,Schur'sunitary triangularizationtheorem,whichisastapleinlinearalgebratextssuchasHornand Johnson[35]. AcentralresulttothecomputationofeigenvaluesandeigenvectorsistheSchur formofamatrix.Foranymatrix A 2 C n n ,thereexistsaunitarymatrix Q such that Q H AQ = T or AQ = QT; .2 5


where T isuppertriangular.Thisfactorizationexistsforanysquarematrixandthe computationofthisfactorizationisstable.Moreinformationonstabilitymaybe foundin[32,66]andwewilldiscussthestabilityofspecicalgorithmslater,but fornowitwillsucetoholdstabilityasanattractivequalityforalgorithmdesign. ThedecompositiontoSchurformisaneigenvaluerevealingfactorizationas A and T areunitarilysimilarandtheeigenvaluesof A lieonthediagonalof T .Theorder oftheeigenvaluesonthediagonalmaybeorganizedbytheappropriatechoiceof Q .ItisoftenworthwhiletoreordertheSchurfactorization,forexampletoimprove stability.ASchurdecompositionisnotuniqueasitdependsontheorderofthe eigenvaluesin T anddoesnotaccountforeigenvalueswithmultiplicitiesgreaterthan one.Theassociatedeigenvectors,orinvariantsubspaces,maybecomputedfrom theSchurform.Thisisnotanoverlycomplicatedcomputation,buttherearesome fundamentalissuesthatcanrendercomputationsunstable.Ourprimaryfocusisat theleveloftheSchurfactorizationandonalgorithmsdesignedtocomputesucha factorization.Thematrix T maybecomplexwhen A isreal.Inthiscase,itmay beadvantageoustoworkwiththerealSchurforminwhichthematrix Q isnow orthogonalandthematrix T isblockuppertriangularwherethediagonalblocks areoforderoneortwodependingonwhethertheeigenvaluesarerealorcomplex, respectively.Inthecasethat A isHermitian, A H = A ,thematrix T isarealdiagonal matrixastheSchurformandthediagonalizationof A arethesame.Aswewillsee, manyalgorithmscomputeapartialSchurformgivenby AQ k = Q k T k ; .3 where Q k 2 C n k with k

1.3Algorithms Inanyapproach,thecomputationofeigenvaluesisnecessarilyaniterativeprocess. Thisfollowsdirectlyfromtheformulationofaneigenvalueproblemasdetermining therootsofthecharacteristicpolynomial, p A ,andAbel'swellknowntheoremthat thereisnogeneralalgebraicsolutiontopolynomialsofdegreeveandhigher.An extensivediscussionofclassicandmorerecentalgorithmicapproachescanbefound in[28,55,63,66,70,71].Inthissection,wesurveythemajorpracticalcomputational approachforcomputingafullSchurdecompositionandtherelevantstate-of-the-art software. ThemostessentialalgorithmtotheNEPistheQRalgorithmorQRiteration. FirstintroducedbyFrancis[25,26]andKublanovskaya[44],theQRalgorithmgeneratesasequenceoforthogonallysimilarmatricesthat,undersuitableassumptions, convergestoaSchurformforagivenmatrix.TheQRalgorithmgetsitsnamefrom therepeatedcomputationofaQRfactorizationduringtheiteration.Inthiscase, givenan n n matrix A ,aQRfactorizationisadecompositionoftheform A = QR where Q isan n n unitarymatrixand R isan n n uppertriangularmatrix.The mostbasicformoftheQRalgorithmcomputesaQRfactorization,reversestheorder ofthefactorsandrepeatsuntilconvergence.Amorepracticalapproachincorporates apreprocessingphaseandshiftsusingeigenvalueestimates. TherstphaseofapracticalversionoftheQRalgorithmrequiresthereduction ofnonsymmetric A toHessenbergform H .Amatrix H isinupperHessenbergform if h ij =0whenever i>j +1.EverymatrixcanbereducedtoHessenbergform byunitarysimilaritytransformations.Thisreductioncanbecomputedbyusing HouseholderreectorsasintheArnoldiprocedurefurtherdiscussedinChapter2. AnupperHessenbergmatrix H isinproperupperHessenbergform,oftencalled 7


irreducible,if h j +1 ;j 6 =0for j =1 ;:::;n )]TJ/F15 11.9552 Tf 12.763 0 Td [(1.Ifamatrixisnotinproperupper Hessenbergform,itmaybedividedintotwoindependentsubproblemsinwhichthe matricesareproperupperHessenberg.WewillassumeproperupperHessenbergform andsimplyusethetermHessenbergfromnowon.ReductiontoHessenbergformis acostsavingmeasureaimedatreducingthenumberofopsintheiterativesecond phaseasworkingwiththeunreducedmatrix A isprohibitivelyexpensive.Currently, reductiontoHessenbergformusingblockHouseholderreectorsishandledbycalling LAPACK'sxGEHRDorScaLAPACK'sPxGEHRD.Wewilldiscussperformance issuesofthiscomputationonmulticoreplatformsinChapter3. ThesecondphaseofapracticalimplementationworksexclusivelywiththeHessenbergform.ThemodernimplicitQRiterationtakesonamuchmorecomplicated structurethanitsoriginalformulation.Weoutlinethemaincomputationalpiecesof onestepoftheiteration. BeginningwiththeHessenbergmatrix A ,aselectnumberofshiftsoreigenvalue estimatesaregenerated.Usingthese k shifts, 1 ;:::; k ,aQRfactorizationofthe polynomial p A = A )]TJ/F19 11.9552 Tf 9.487 0 Td [( 1 I A )]TJ/F19 11.9552 Tf 9.487 0 Td [( k I isdesiredasthisspectraltransformationwill speedupconvergence.Explicitformulationof p A iscostprohibitive,butcomputing p A e 1 where e 1 isthecanonicalbasisvectorisrelativelyinexpensive.Nextunitary Q withrstcolumn q 1 = p A e 1 ,where 2 C ,isconstructed.Performingthe similaritytransformation A Q H AQ disturbstheHessenbergformasinFigure1.2.Inthenextstepoftheiteration,the bulgeischasedfromthetopleftofthematrixtothebottomrightcornerreturning thematrixtoHessenbergformasinFigure1.3.Thebulgeiseliminatedbyapplying similaritytransformations,suchasHouseholderreectors,thatintroducezerosin desiredlocationsmovingthebulge.ThisprocessofdisturbingtheHessenbergform witheigenvalueestimatesandchasingthebulgeiscontinueduntilthedesiredSchur 8


Figure1.2: TheHessenbergstructuredisturbed formiscomputed.Therehavebeensomemajorimprovementstothisprocessthat formthefoundationforstate-of-the-artimplementationsoftheimplicitQRalgorithm. Figure1.3: Chasingthebulge Aftertheintroductionoftheimplicitshiftapproachin1961,severalimplementationsusedsingleordoubleshiftscreatingmultiple1 1or2 2bulges.Thesebulges werechasedonecolumnatatimeusingmainlyLevel1BLASoperations.In1987, amultishiftversionoftheQRalgorithmwasintroducedbyBaiandDemmel[8]. Here k simultaneousshiftswereusedcreatinga k k bulgethatwasthenchased p columnsatatime.ThiswasasignicantstepintheevolutionoftheQRalgorithmas therestructuredalgorithmcouldbecastinmoreecientLevel2andLevel3BLAS operations.Additionally,the k shiftswerechosentobetheeigenvaluesofthebottom 9


right k k principalsubmatrixextendingwhathadlongbeenstandardchoicesfor shifts. ThoughthemultishiftQRalgorithmwasabletobestructuredinecientBLAS operations,theperformanceforalargenumberofshiftswaslacking.Accurateshifts acceleratetheconvergenceoftheQRalgorithm,butwhenalargenumberofshifts wereusedroundingerrorsdevelopeddegradingtheshiftsandslowingconvergence.In 1992,Watkins[69]detailedthisphenomenonofshiftblurring.Hisanalysissuggested thatasmallnumberofshiftsbeusedmaintainingaverysmallorderbulgetoavoidillconditioningandshiftblurring.Thisprovedtobeanimportantideathatmotivated asolutiontousingalargenumberofshiftsandmaintainingwellfocused"shifts. TheseminalworkbyBraman,ByersandMathias[15,16]onmultishiftQRwith aggressiveearlydeationAEDintroducedtwonewcomponentsthatcontributed greatlytothecurrentsuccessoftheQRalgorithm.Ratherthan k shiftsandasingle bulgeasdepictedinFigure1.2,achainofseveraltightlycoupledbulges,eachconsistingoftwoshifts,ischasedinthecourseofoneiteration.Thisideafacilitatedtheuse ofLevel3BLASoperationsformostofthecomputationalworkandallowedtheuseof alargernumberofshiftswithouttheundesirablenumericaldegradationofshiftblurring.Additionally,theuseofAEDlocatedconvergedeigenvaluesmuchfasterthan earlierdeationstrategieswhichhadchangedlittlesincetheintroductionofimplicitly shiftedQR.Together,theseimprovementsreducedthenumberofiterationsrequired bytheQRalgorithmgreatlyincreasingoverallperformance.TheLAPACKroutine xHSEQRisthestate-of-the-artserialimplementationthatcomputestheSchurform beginningwithaHessenbergmatrix.Theentireprocess,reductiontoHessenberg formandthenSchurform,isperformedbyxGEES. AnothermajorimprovementconcernsthenontrivialtaskofparallelizingtheQR algorithm.ParallelversionsoftheQRalgorithmfordistributedmemoryhadbeen previouslyproposed,byStewart[62],andworkhadbeendoneonparallelizingthe 10


multishiftQRalgorithm.Theissueofshiftblurringforcedmosteortstofocuson bulgesoforder2andscalabilitywasstillanissue.Issuespertainingtoscalability ofthestandarddoubleimplicitshiftQRalgorithmwereexploredbyHenryand vandeGeijn[31].In1997,anapproachusing k shiftstoformandchase k 2 bulges inparallelwaspresentedbyHenry,Watkins,andDongarra[30]andsubsequently addedtoScaLAPACK.AnovelapproachbasedoofthemultishiftQRwithAED wasformulatedbyGranat,Kagstrom,andKressner[29].Heremulti-windowbulge chainchasingwasformulatedalongwithaparallelizedversionofAED.Thesoftware presentedoutperformedtheexistingScaLAPACKimplementationPxLAHQR. ImprovingtheQRalgorithmcontinuestobeatopicofinterest.Recentworkby Braman[14]investigatedadjustingtheAEDstrategytonddeationsinthemiddle ofthematrix.Suchastrategycouldleadtoanewdivideandconquerformulationof theQRalgorithm.EvenmorerecentworkbyKarlssonandKressner[39]examined optimalpackingstrategiesforthechainofbulgesinaneorttomakeeectiveuse ofLevel3BLASoperations.AmodiedLAPACKimplementationwaspresented andnumericalresultsdemonstratedthesuccessoftheapproach.Optimallypacked chainsofbulgesshouldaidtheperformanceofparallelimplementationsoftheQR algorithmaswell. ThoughtheQRalgorithmisthemethodofchoicewhencomputingafullSchur decomposition,therearesomelimitationstothisapproachintermsofparallelism anddatamovement.ThecurrentimplementationofxGEESdoesnotscalewell.An importantaspectofperformanceanalysisisthestudyofhowalgorithmperformance varieswithproblemsize,thenumberofprocessors,andrelatedparameters.Ofparticularimportanceisthescalabilityofanalgorithm,orhoweectiveitcanusean increasednumberofprocessors.Oneapproachatquantifyingscalabilityistodeterminehowexecutiontimevarieswiththenumberofavailableprocessors.Toassess ZGEES,werecordedtheexecutiontimeforcomputingtheSchurfactorizationofan 11


aTiming bSpeedUp Figure1.4: ScalabilityofZGEES 8 ; 000 8 ; 000matrixforupto8cores.Theresultsalongwithacurvedepicting perfectscalabilitycanbeseeninFigure1.4a.Wealsoillustratethemeasuredspeed upandperfectspeedupasthenumberofcoresisincreasedfrom1to8inFigure1.4b. AsdepictedinFigure1.4,ZGEESdoesnotscalewellinthecontextofmulticore.In additiontoitsperformance,fortheNEP,theonlyalgorithmavailableinLAPACKfor computingtheSchurformisxGEES.CurrentlypartialSchurformsfornonsymmetric matricesareunattainableinLAPACK. InTable1.1welistthecurrentlyavailablecomputationalapproachesfortheNEP availableonvariousplatforms.ThemethodslistedinTable1.1areabbreviatedas follows:ArnoldibasedapproachesaredenotedbyA,implicitlyrestartedArnoldi byIRAM,Krylov-SchurbasedapproachesbyKS,andJacobi-Davidsonmethods byJD.ForpackagesbasedonArnoldi,methodA,weincludeallformulations ofArnoldisuchastheuseofChebyshevacceleration,preconditioningwithChebyshevpolynomials,explicitrestarts,anddeation,butnotimplicitrestarts.Ofthe methodslistedinTable1.1,theonlyblockKrylov-Schurimplementationispartof the Anasazi packagewhichispartofthe Trilinos library.Thisimplementation 12


Table1.1: AvailableIterativeMethodsfortheNEP Software Routine Method Blocked Language Architecture ARPACK Various IRAM No Fortran Shared,Dist SLEPc EPSARNOLDI A No Fortran,C Shared,Dist SLEPc EPSKRYLOVSCHUR KS No Fortran,C Shared,Dist SLEPc EPSJD JD No Fortran,C Shared,Dist Anasazi BlockArnoldi A Yes Fortran,C Shared,Dist Anasazi BlockKrylovSchur KS Yes Fortran,C Shared,Dist HSL2013 EB13 A Both Fortran,C Shared,Dist usestwoorthogonalizationschemes,oneproposedbyDaniel,Gragg,Kaufmanand StewartDGKS[20]andamorerecentoeringbyStathopoulosandWu[60]with thelatterasthedefaultsetting. ThegoalofthisworkistoattacktheNEPinthecontextofHPCfromadierent approach.OurworkconcernsthecomputationofapartialSchurformviastandard iterativetechniques.Tothisend,inChapter2weexaminetilealgorithmsandthe implementationofblockArnoldiexpansioninthemulticorecontextofPLASMA. TheprocessconstructsanorthonormalbasisforablockKrylovspace,andweextend existingalgorithmswithourtiledversion.Pseudocodesandimplementationdetails areprovidedalongwithperformanceresults. InChapter3,weexaminevariousalgorithmsinthecontextofcomputingapartialSchurfactorizationforthenonsymmetricmatrices.Weexamineseveraliterative approachesandpresentnovelimplementationsofspecicmethods.Themethods studiedincludeablockversionofexplicitlyrestartedArnoldiwithdeation,ablock extensionofStewart'sKrylov-Schurmethod,andablockversionofJacobi-Davidson. 13


Wecompareourimplementationstoexistingunblockedandblockedcodeswhenavailable.Allworkisdonein tatlab andextensivenumericalresultsareperformed.This workmotivatesouralgorithmicdesignchoicesinChapter4. Finally,inChapter4wepresentadetailedimplementationofourblockKrylovSchurmethodusingHouseholderreectors.Ourapproachfeaturesablockalgorithm, theabilitytocomputeafullSchurdecomposition,andthenoveluseofHouseholder reectorsconsistentwiththeworkinChapter3. Inthisthesiswewillusetwodistinctwaystoparallelizeourcodes.InSection2,we usethetask-basedschedulerQUARKfromtheUniversityofTennesseetoparallelize ourtileArnoldiexpansion.ThetileArnoldiexpansioniswrittenintermofsequential kernels,thendependenciesbetweentasksaredeclaredbylabellingthevariablesas INPUT,INOUT,OUTPUT,nally,theQUARKschedulerunravelsourcode,gures outthetaskdepedenciesandexploitsanyparallelismpresentinourapplication.In Section4,weobtainparallelismbycallingmultithreadedBLAS.Thisisasimilar parallelismmodeltotheoneintheLAPACKlibrary.Inbothcases,wereliedon athirdpartytoperformtheparallelizationperse.Bothmechanismsarefairlyhigh levelandareindeedeasytouse. 1.4Contributions Hereweoutlinethespeciccontributionsofthismanuscriptandassociatedwork. InChapter2,wepresentournoveltiledimplementationofblockArnoldiwithHouseholderreectors.TheArnoldicomputationisanimportantcomponentofbothalgorithmsdesignedtosolvelinearsystemsandthoseusedtocomputeeigenvalues.A greatdealoftimeisspentintheArnoldicomponentwhenworkingineithercomputationalcontextandwepresentamarkedimprovementinperformancewithour tileapproach.Wemanagedtomergetheorthogonalizationwiththematrixvectorproduct.Thishasthepotentialtoincreasetheperformanceofmethodsusing blockArnoldifactorizationssuchasblockGMRESandMorgan'sGMRES-DR[51]. 14


Additionally,anyeigenvaluesolverusingblockArnoldistandstobenetfromthis improvement. ThenovelformulationofblockKrylov-SchurwithHouseholderinChapter3also improvesuponthestateoftheart.WepresentanewalgorithmbasedonHouseholder reectorsratherthanotherorthogonalizationschemes.OurrobustformulationperformsverywellinthesparsecasewhencomputingpartialSchurdecompositions.We presentnumericalexperimentsin Matlab inChapter3thatsuggestourapproach isworthimplementinginacompilableprogramminglanguage. InChapter4weimplementourblockKrylov-SchurapproachinaCcodeusing LAPACKsubroutines.Thecodeisrobustandsupportsanyblocksizesandcan computeanynumberofdesiredeigenvalues.Thiscodeistherststeptowardsan optimizedversionthatcouldbereleasedtothescienticcomputingcommunity. 15


2.TiledKrylovExpansiononMulticoreArchitectures Inthischapter,wepresentjointworkwithHenricusBouwmeester. Manyalgorithmsthataimtocomputeeigenvaluesandinvariantsubspacesrelyon KrylovsequencesandKrylovsubspaces.Additionally,manyalgorithmscomputing solutionstolinearsystemsdoaswell.Hereweconsiderthecomputationofan orthonormalbasisforaKrylovsubspaceinthecontextofHPC.Weareinterestedinan algorithmrichinBLASLevel3operationsthatachievesahighlevelofparallelism.We reviewsomebasictheoryofKrylovsubspacesandassociatedalgorithmstomotivate ourcurrentwork.AwealthofinformationonKrylovsubspacesandtheirconnection tolinearsystemsandeigenvalueproblemsmaybefoundin[43,55,70]. If A 2 C n n isamatrixand v 2 C n isanonzerovector,thenthesequence v;Av;A 2 v;A 3 v;::: iscalledaKrylovsequence.Thesubspace K m A;v =span f v;Av;A 2 v;:::;A m )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 v g iscalledthe m thKrylovsubspaceassociatedwith A and v andthematrix K m A;v = v;Av;A 2 v;:::;A m )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v iscalledthe m thKrylovmatrixassociatedwith A and v .Ourcurrentcomputational focusisonconstructingabasisforthesubspace K m A;v asthisKrylovsubspacewill playapivotalroleinmanylinearalgebracomputations,especiallycertainnumerical methodsforeigenvalueproblems.ComputinganexplicitKrylovbasisoftheform K m A;v isnotadvisable.As m increases,undermildassumptionsonthestarting vector,thevectors A m )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v convergetotheeigenvectorassociatedwiththelargest eigenvalueinmagnitudeof A providedtheigenvalueissimple.As m getslarger,the basis[ v;Av;A 2 v;:::;A m )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v ]becomesextremelyill-conditionedand,consequently, muchoftheinformationinthisbasisiscorruptedbyroundoerrorsasdiscussedby Kressner[43].AnelegantalgorithmduetoW.E.Arnoldiisonewaytotocomputea 16


basisfortheKrylovspacewithbetterconditioning.Thealgorithmhasafewvariants whichweexploreinthenextsection. 2.1TheArnoldiMethod In1951,WalterE.Arnoldi[4],expandingLanczos'sworkoneigenvaluesofsymmetricmatrices,introducedanalgorithmthatreducedageneralmatrixtoupper Hessenbergform.Let v 1 ;v 2 ;:::;v m betheresultofsequentiallyorthonormalizing theKrylovsequence v;Av;:::;A m )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 v andlet V m =[ v 1 ;v 2 ;:::;v m ].Inmatrixterms, Arnoldi'sproceduregeneratesadecompositionoftheform AV m = V m H m + m v m +1 e T m .1 where H m 2 C m m isupperHessenberg, m isascalar, V m hasorthonormalcolumns, and e m 2 C m isthevectorwithoneinthe m thpositionandzeroseverywhereelse. ThisfactorizationiscalledanArnoldidecompositionoforder m ,orsimplyanArnoldi decomposition.Wecanrepresentthisinmatrixtermsby AV m = V m +1 e H m +1 .2 where e H m +1 = 2 6 4 H m m e T m 3 7 5 : Arnoldisuggestedthatthematrix H m maycontainaccurateapproximationstothe eigenvaluesof A .WewillrevisitthisideainChapter3.Thevectors v 1 ;v 2 ;:::;v m intheArnoldidecompositionformanorthonormalbasisofthesubspaceinquestion.Theyareorthonormalbyconstructionandastraightforwardinductiveargument showsthat K m A;v =span f v 1 ;:::;v m g Inexactarithmetic,thereareseveralalgorithmicvariantsofArnoldi'smethodto constructthisorthonormalbasis.Wereviewsomewellestablishedresultsonafew variantsinniteprecisionarithmetictomotivateourcurrentwork.Onevariantisto 17


usetheGram-Schmidtprocesstosequentiallyorthogonalizeeachnewvectoragainst thepreviouslyorthogonalizedvectors.TheArnoldimethodusing Classical GramSchmidtCGStocomputeanorthonormalbasisof K m A;v given A v ,and m is giveninAlgorithm2.1.1.TheCGSvariantisunstable.Amathematicallyequivalent Algorithm2.1.1: Arnoldi-CGS Input : A 2 C n n v 2 C n and m Result :Constructionof V m +1 and e H m +1 1 v 1 = v= jj v jj 2 ; 2 for j =1: m do 3 h j = V H j Av j ; 4 v = Av j )]TJ/F19 11.9552 Tf 11.955 0 Td [(V j h j ; 5 h j +1 ;j = jj v jj 2 ; 6 if h j +1 ;j =0 then 7 stop; 8 v j +1 = v=h j +1 ;j ; V j +1 =[ V j ;v j +1 ]; 9 e H j = 2 6 4 e H j )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 h j 0 h j +1 ;j 3 7 5 ; butnumericallymoreattractiveversionofCGSisthesubtlerearrangementcalled Modied Gram-SchmidtMGS.Eitherapproachrequires2 mn 2 opsforthematrixvectormultiplicationsweassumethematrix A tobedenseand2 m 2 n opsforthe vectoroperationsduetotheGram-Schmidtprocess.ThoughMGSisnumerically moreattractivethantheCGSvariant,itstillinheritsthenumericalinstabilitiesof theGram-Schmidtprocess.Theorthogonalityofthecolumnsof V m canseverely beaectedbyroundoerrors.Toremedythis,thereareafewcomputationally attractivealternatives.Thevector v j +1 canbereorthogonalizedagainstthecolumns of V j wheneveronesuspectslossoforthogonalitymayhaveoccurred.Thisimproves 18


stability,buttheprocesshasitslimitationsandthisaddsops.Extensivedetailson theGram-Schmidtprocessmaybefoundin[45].Anotheroptionistoapproachthe processoforthogonalizationinanentirelydierentway. Thenalvariantunderconsiderationchangestheorthogonalizationschemeand usesoneofthemostreliableorthogonalizationprocedures,onebasedonHouseholder transformations.AsTrefethenandBauexplainit,whiletheGram-Schmidtprocess canbeviewedasasequenceofelementarytriangularmatricesappliedtogeneratea matrixwithorthonormalcolumns,theHouseholderformulationcanbeviewedasa sequenceofelementaryunitarymatriceswhoseproductformsatriangularmatrix." TheHouseholdervarianthasveryappealingproperties,specicallytheuseoforthogonaltransformationsguaranteesnumericalstability.Thisdoescomewithanincrease inthenumberofops.TheuseofHouseholderinthecontextofArnoldiisbackwardstablebutrequires4 m 2 n )]TJ/F17 7.9701 Tf 13.55 4.707 Td [(4 3 m 3 opsthisdoesnotcountthe2 mn 2 opsfor thematrix-vectormultiplications.TheArnoldiprocedurewithHouseholdermakes useofreectorsoftheform P = I )]TJ/F17 7.9701 Tf 19.919 4.707 Td [(2 x H x xx H whichintroducezerosindesiredlocations.Walker[67]initiallyformulatedthemethodinthecontextofsolvinglarge nonsymmetriclinearsystems. AswedesireanapproachrichinBLASLevel3operations,weturnourattention toblockmethodsthatuseblocksofvectorsratherthansinglevectors.Wenote thatthereareotherbenetsinusingablockapproachinthecontextofaniterative eigensolver.Theiterativeeigensolverconvergesfasterandismorerobustinthe presenceofclusteredeigenvalues.ThiswillbeexaminedinChapter3.Theextension oftheArnoldimethodtoablockalgorithmisstraightforward.Ratherthanusingan initialstartingvector,ablockofvectors, V 2 C n b ,isusedandtheArnoldiprocedure constructsanorthonormalbasisfortheblockKrylovsubspace K mr A;V =span f V;AV;A 2 V;:::;A m )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 V g : 19


AnArnoldidecompositionnowtakestheform AW m = W m H m + V m +1 H m +1 ;m E H m .3 whereeach V i 2 C n b and W m =[ V 1 ;V 2 ;:::;V m ] ; E m =matrixofthelast b columnsof I mr : Here I mb isthe mb m identitymatrix, W m 2 C n mb hasorthonormalcolumnsand H m 2 C mb mb isband-Hessenbergasthereare b subdiagonals.Forsimplicitywewrite AW m = W m +1 e H m +1 .4 where e H m +1 2 C m +1 b mb istheblockversionofoursimpliednotationgivenby e H m +1 = 2 6 4 H m H m +1 ;m E H m 3 7 5 : AblockanalogofAlgorithm2.1.1followsimmediatelybutsomevariantsare possibledependingonconcernsforparallelismasdetailedin[55].Theblockversion ofArnoldiwithHouseholdertsnicelyinthecontextofBLASLevel3operations thankstothecompactWYrepresentationpresentedinSchreiberandVanLoan[56]. InthecompactWYform,aproductof b Householderreectorscanberepresented as Q = I n )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY T .5 where Y 2 C n b isalowertrapezoidalmatrixand T 2 C b b isanuppertriangularmatrix.ThisenablestheuseofBLASLevel3operations.Thisperformance alongwiththeaforementionedbackwardstabilityiswhyweopttouseHouseholder orthogonalizationintheArnoldimethod. 20


2.2TiledArnoldiwithHouseholder BlockformulationsoftheArnoldimethodarenotnew.Morgan[51]developed oneininthecontextofsolvinglinearsystemswithhisintroductionofGMRES-DR thatusesRuhe'svariantofblockArnoldi.ExplicitformulationofRuhe'svariantcan befoundin[55].InthecontextoftheNEP,Moller[49]andBaglama[7]oerdierent implementationsforeachoftheirapproaches.Wepresentanewimplementationwith thefocusonperformanceinthecontextofmulticorearchitecturesthatworksontiles. ThisalsosetsthefoundationforourensuingworkonalgorithmsfortheNEP. ThealgorithmicformulationofArnoldiwithHouseholderfollowsdirectlyfrom employingthecompactWYrepresentation.Tosimplifythepresentationweadopta Matlab stylenotation.Itwillbeconvenienttorefertolocationswithinamatrix asfollows:foran n n matrix A A : n; 1: b denotesthesubmatrixconsisting oftherst b columnsandlast n )]TJ/F15 11.9552 Tf 12.595 0 Td [(6rows.Asweareworkingwithblocks,itwill simplifythenotationtorefertothe j thblockofrowsorcolumnswithinamatrix.For example,forthe j thcolumnblockof W m =[ V 1 ;V 2 ;:::;V m ]wemaywrite W m : ; f j g as W m : ; f j g = W m : n; j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 b +1: jb Algorithm2.2.1outlinestheessentialstepstocomputeanorthonormalbasisfor K mb A;U given A 2 C n n ,startingblock U 2 C n b and m .Algorithm2.2.1 isformulatedtoemployspecicexistingcomputationalkernelsinLAPACK3.4.1, asdenotedinparenthesesnexttoeachmajorcomputation.Wewilldiscussthe LAPACKsubroutinesusedastheyformthebasisforournewimplementation.The centralcomputationalkernelinAlgorithm2.2.1istheQRfactorization.Theblocked QRfactorizationwithcompactWYversionof Q = I n )]TJ/F19 11.9552 Tf 13.117 0 Td [(YTY T isaccomplished bytheLAPACKxGEQRTsubroutine.ThecalltoxGEQRTconstructsacompact WYrepresentationof b Householderreectionsthatintroducezerosin b consecutive 21


Algorithm2.2.1: BlockArnoldi-HH Input : A U and m Result :Constructionof Q m +1 and e H m +1 suchthat AQ m = Q m +1 e H m +1 1 U = AU xGEMM; Q = I n mb rst mr columnsof I n ; 2 for j =0: m )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 3 ComputetheQRfactorizationof U jb +1: n; :xGEQRT, 4 generating T andstoringthereectorsin V : ; f j +1 g ; 5 Accumulatereectorstoexplicitlybuildnextblockof Q xGEMQRT; 6 for k = j : )]TJ/F15 11.9552 Tf 9.299 0 Td [(1:1 do 7 Q : ; f j +1 g = Q : ; f j +1 g )]TJ/F19 11.9552 Tf 11.132 0 Td [(V : ; f k g T : ; f k g V : ; f k g H Q : ; f j +1 g ; 8 if j

V m +1 e H m +1 Figure2.1: CompactstorageoftheArnoldidecomposition multiplyingby Q or Q H ishandledbyxGEMQRTwhichusesthereectorsstored inthelowerpartof V m +1 andthetriangularfactorsin T .Multiplicationofthenew blockofvectorsbymatrix A requirestheBLASxGEMMsubroutine. LAPACKsubroutinesaredesignedtouseblocksofcolumns,orblocksofrows, tocasttheoperationsasmatrixmultiplicationsratherthanvectoroperations.This facilitatesLevel3BLASoperations,butthereareissuesthatlimitperformance.Of notearethesynchronizationsperformedateachstepandthelackofnegraintasks forincreasedparallelism.MultithreadedBLAScanbeutilized,butthisisoftennot enoughtoensurethatthesealgorithmsperformoptimallyinthecontextofmulticore. Totakefulladvantageofemergingcomputerarchitectures,wemustreformulate ourblockalgorithm.Multicorearchitecturesrequirealgorithmsthatcantakefull advantageoftheirstructure.Tothisend,algorithmshavebeenmovingtowardsthe classofso-calledtilealgorithms[19,18,53]andareavailableaspartofeortslike thePLASMAlibrary.ThedatalayoutinthecontextofPLASMArequiresamatrix tobereorderedintosmallerregionsofmemory,calledtiles.Amatrix A 2 C n n 23


canbesubdividedintotilesranginginsizefrom1 1tilesto n n tiles,butonce set,thetilesizeisxedforthedurationofthealgorithm.Findingthetilesizethat achievesoptimalperformancerequiresabitoftuning,sofornow,wewillassume A isdecomposedinto n t tilesofsize n b n b sothat n = n t n b .Wewilladdonemore notationalconvenienceforouralgorithmsandlet A i;j denotethe n b n b tileinthe i throwand j thcolumnofthetilingof A .Itisimportanttonotethatcurrently onlysquaretilesarepermittedinPLASMAandourapplicationwillrequireusto makeaccommodationsforvarioustilesizes.Thedecompositionofamatrixinto tilesandthesubdivisionofcomputationaltasksaddsanewdynamictoourproblem inthatthesetasksmustbeorganized.AsdetailedbyBouwmeester[13],working withtilesforcesustoconsiderseveralspecicfeatures.Tilescanbeinthreestates, namelyzero,triangleorfull,andintroducingzerosinamatrixcanbeaccomplished bythreedierenttasks.Tilealgorithmsallowtheseparationoffactorizationstages andcorrespondingupdatestageswhereastheseareconsideredasinglestageincoarsegrainalgorithms,suchasthoseinLAPACK. Organizingthefactorizationtasksandthedecoupledassociatedupdatesindifferentwaysleadstodierentalgorithmicformulationsandpossiblydierentperformance.Computationsareoftenexpressedthroughataskgraph,oftencalleda DirectedAcyclicGraphDAG,wherenodesareelementarytasksthatoperateon tilesandedgesrepresentthedependencies.Dierentalgorithmicformulationsmay resultindierentDAGs.Tocomparevariousalgorithmicformulations,welookat therespectiveDAGsandcomputethecriticalpath.Thecriticalpathisthelongest necessarypathfromstarttonishandrepresentsasequenceofoperationsthatmust becarriedoutsequentially.AnalyzingDAGsandcriticalpathsallowsfortheselection ofoptimalparallelizationstrategies.Muchworkiscurrentlybeingdevotedtodevelopingschedulingstrategiesthatimproveperformance.Afterintroducingtheessential computationalkernelsofouralgorithm,wewillrevisitthequestionofperformance. 24


ExtendingouralgorithmfromtheLAPACKframeworktothatofPLASMA requiresreplacingeachmajorkernelwithatiledanalog.TheblockQRfactorization withcompactWYformperformedbyxGEQRTline3ofAlgorithm2.2.1willbe replacedwithatiledQRfactorization.TiledQRfactorizationsformulticorewere introducedin[19,18,53]andrecentlyimprovedandanalyzedbyBouwmeester[13]. Line3ofAlgorithm2.2.1requirestheQRfactorizationofatallandskinnymatrix, onethathasmanymorerowsthancolumns.AswewillseelaterinChapter4,the numberofcolumns b inablockKrylovsubspacemethodwillbesetbypractical concernsand,ingeneral,itwillbesmall",muchsmallerthantherecommendedtile size n b Thebasicstructureofourapproach,Algorithm2.2.2,remainsthesamebutthe detailsofeachsteprequireabitofdiscussion.From A 2 C n n andastartingblock U 2 C n b ,wecomputeablockArnoldidecompositionofsize m .Ourapproach beginswiththe n mb identitymatrixin Q .Nextwecomputetheproduct AU and asdepictedinLine3ofAlgorithm2.2.2.Hereweexplicitlylistedthedoubleloopto givetheavoroftileoperations.Continuingwithadetailedtileperspectiveisnot feasibleforareadablepseudocode,soweexpandoneachstepwithamoredetailed discussion. Asbefore,theprimarycomputationinAlgorithm2.2.2istheQRfactorization.In thepreviouscase,theLAPACKroutinexGEQRTzeroedout b columnsatatimeand computedthecorrespondingblockHouseholderreectorincompactWYform.Inthe tiledversion,thestructureoftheQRfactorizationchangeswiththeopportunities toincreaseparallelismandthisleadstonewcomputationaltasksamongthetiles. Forourapplication,wewillconsiderthecasewherewewishtocomputetheQR factorizationofamatrix V 2 C n t n b b ,thatis, V iscomprisedofonecolumnof n t 25


Algorithm2.2.2: TiledblockArnoldiwithHH Input : A U m b ,and n t Result :Constructionof Q m +1 and e H m +1 suchthat AQ m = Q m +1 e H m +1 1 Q = I n mb rst mb columnsof I n ; Q : ; f 1 g = U ; 2 for k =0: m )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 3 for i =0: n t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 4 for j =0: n t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 5 V i;k A i;j Q j;k )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 xGEMM; 6 Update V withreectorsfrompreviousfactorizationsxUNMQR, xTTMQR; 7 ComputetheQRfactorizationof V jb +1: n; f j +1 g xGEQRT, xTTQRT; 8 Accumulatereectorstoexplicitlybuildnextblockof Q xTTMQRand xUNMQR; tileseachofsize n b b asinthefollowing V = 2 6 6 6 6 6 6 6 4 V 1 ; 1 V 2 ; 1 . V n t ; 1 3 7 7 7 7 7 7 7 5 ; with V i; 1 2 C n b b ;i =1 ;:::;n t : ComputingtheQRfactorizationofacolumnoftilesallowsfordierentalgorithmic formulationsbasedontheorderingofcomputationsoneachtile.Usingexisting PLASMAroutines,wecouldrstcomputetheQRfactorizationof V 1 ; 1 sothatwe 26


have V = 2 6 6 6 6 6 6 6 4 T 1 ; 1 V 2 ; 1 . V n t ; 1 3 7 7 7 7 7 7 7 5 ; where T 1 ; 1 isuppertriangularwiththeHouseholderreectorsstoredbelowthediagonalasisstandardinLAPACK.Thenwecoulduse T 1 ; 1 tosystematicallyzeroout theremainingtiles.Eachoftheseeliminationstepshastheformofatriangleon topofasquareandcanbeaccomplishedbyPLASMA'sxTSQRTroutine.Updates wouldthenrequirebothxUNMQRandxTSMQR.Theprocessjustoutlinedcanbe describedbyusingaattreebeginningatthetop.Thisapproachissequentialand doesnotparallelize.Fortunately,thetiledenvironmentallowsformorechoices. Forexample,wecouldproceedbyrstcomputingtheQRfactorizationsofeachof the n t tiles, V i; 1 with i =1 ;:::;n t ,using n t callstoPLASMA'sxGEQRT.Thisresults inacolumnof n t uppertriangularfactorswithrespectiveHouseholderreectorsstored belowthediagonalofeachofthe n t tiles.Eachindividualtilenowhasthesame structureastheoutputofLAPACK'sxGEQRT.Next,wecouldproceedbyusingthe triangularfactorstoeliminatethetriangularfactorsdirectlybelow.Thisapproach givesrisetoacomputationalkernel,PLASMA'sxTTQRTdetailedin[13],thatzeroes atrianglewithatriangleontop.Repeatingthisstepwecouldproceedlefttoright asdepictedinFigure2.2wherethestepsareorganizedusingabinomialtree.The QRfactorizationdepictedinFigure2.2requiresdierentroutinestoupdateusingthe Householderreectors.EachcalltoxGEQRTgeneratescompactWYcomponents thatmaybeappliedbycallingPLASMA'sxUNMQR.InthecaseofxTTQRT,the correspondingupdateisachievedbythePLASMAroutinexTTMQR.Wewillexplore somevariantsofAlgorithm2.2.2shortly,butcomprehensiveinformationonvarious eliminationlistsandalgorithmicformulationsoftiledQRmaybefoundin[13]. 27


Figure2.2: TiledQRwith n t =5usingxTTQRT AftercomputingtheQRfactorizationofthecolumn,webuildthenextcolumn blockinthematrix Q .Todoso,wemustapplythereectors,thatisapplythe updates,inreverseorderrequiringtheuseofbothxTTMQRandthenxUNMQR. Byreverseorderherewemeanthattheupdatesmustusethesameeliminationtree usedintheforwardsense.Ifweuseabinomialtreegoingintheforwarddirectionas inFigure2.2thenwemusttraversethatsametreeinthereversedirection.Wethen begintheloopagainandmultiplyournewlycomputedcolumnblockof Q by A .This mustthenbeupdatedwiththereectorsfrompreviouscolumnsusingxUNMQR andthenxTTMQR.WearenowreadytocomputetheQRfactorizationofthenext columnoftilesandcontinueonwiththeArnoldiprocess. WhatisnotevidentinFigure2.2isthecasewhenthesingleblockofcolumns doesnottnicelyinthecontextofsquaretiles.AsPLASMArequiressquaretiles, wehadtomodifyseveralexistingroutinessothattheycouldoperateonsub-tiles. RoutinesthatweremodiedincludetheQRfactorizationxGEQRTwithassociated updatexUNMQRandthezeroingoutofatrianglewithatrianglexTTQRTwith 28


associatedupdatexTTMQR.AnexamplemodicationispresentedinFigure ?? for theroutineQUARK CORE ZGEQRT.Thelasttwolines,notincludingthe0,lock" thetilereferencedby Ap andthecorresponding Tp tilesincethereferenceto A and T pointwithinthetileanddonotlock"theentiretile.Weneededtopointtowithin thetilebutthisdoesnotindicatethatadependencyisneeded. #include``doth.h'' voidmy_QUARK_CORE_zgeqrtQuark*quark,Quark_Task_Flags*task_flags, intm,intn,intib,intnb, PLASMA_Complex64_t*A,intlda, PLASMA_Complex64_t*T,intldt, PLASMA_Complex64_t*Ap,PLASMA_Complex64_t*Tp { DAG_CORE_GEQRT; QUARK_Insert_Taskquark,CORE_zgeqrt_quark,task_flags, sizeofint,&m,VALUE, sizeofint,&n,VALUE, sizeofint,&ib,VALUE, sizeofPLASMA_Complex64_t*nb*nb,A,INOUT, sizeofint,&lda,VALUE, sizeofPLASMA_Complex64_t*ib*nb,T,OUTPUT, sizeofint,&ldt,VALUE, sizeofPLASMA_Complex64_t*nb,NULL,SCRATCH, sizeofPLASMA_Complex64_t*ib*nb,NULL,SCRATCH, sizeofPLASMA_Complex64_t*nb*nb,Ap,INOUT, sizeofPLASMA_Complex64_t*ib*nb,Tp,OUTPUT, 0; } Figure2.3: ModicationofQUARK CORE ZGEQRTforsub-tiles 2.3NumericalResults HerewecomparedierentformulationsofourtiledArnoldimethodbyadjusting theunderlyingeliminationlist.Weassumeanunlimitednumberofprocessorsinthis analysisandinvestigateafewalgorithmicvariations.Asthenumberofresources isunlimitedanytaskmaybeexecutedassoonasthedependenciesaresatised. Algorithm2.2.2requiresa QR factorizationofasinglecolumnoftilesbutalsouses 29


theupdatestoexplicitlyformthe Q factoroneblockcolumnatatimeandupdate newcolumnsusingpreviouslycomputedreectors.Byusingdierentelimination treesforthe QR decomposition,thetreeusedfortheupdatechangesaswell.This inturnchangestheDAGandpossiblyallowsformoreinterleavingofthevarious steps,i.e.,thematrixmultiplicationandfactorization,andmightalsoprovidebetter pipeliningofupdateandfactorizationtrees. Herewepresentsomenumericalexperimentscomparingourtiledcodewithtwo dierenttreesbinomialandpipelinetoareferenceimplementation.Wepresent resultsforadiagonalmatrix,atridiagonalmatrixandadensematrix.TheRooftop 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 14 16 18 20 22 24 #threads GFLOP/s TiledArnoldiIterationDense ComparisonofReference,BinomialandPipelineTrees m=2400,mb=200,b=50,#iter=6 Pipeline Binomial Reference RooftopBound Figure2.4: Performancecomparisonforadensesmallermatrix Boundisanupperboundonperformanceon p processors.TocalculatetheRooftop 30


Bound,weuse Rooftop p =max p; total#ops #ofopsonthecriticalpath performanceofaprocessor : InFigure2.4, A isadense,2 ; 400 2 ; 400matrix,theblocksizeis b =50,theinitial vector V is2 ; 400 50,andwewanttoperform20Arnoldiiterationssothatweobtain abandHessenbergmatrixofsize1 ; 000 1 ; 000withbandwidth50.Wepresenta referenceimplementationwhichisamonolithicimplementationcallingLAPACKand BLAS.Forourtileimplementationandthisexperiment,thetilesizefor A is m b =200, sothat n t =12,thischoicemakes V tobe12 1intermoftileswith200 50tiles.As explained,thematrix V ismadeofrectangulartilesandPLASMAdoesnotnatively handlerectangulartilessoweneededtohackrectangularmatrixsupportintothe PLASMAlibrary.Toobtainsquaretiles,wecouldhavetiled A with50 50tiles howeverwejudgedthatthereweretoomanydrawbacksintiling A withatilesize equaltothesizeoftheblockintheKrylovmethod.iWeexpecttheblocksize ofblockArnolditovarygivenaniterativemethod.Wedonotwanttochangethe datalayoutof A overandover.iiTheblocksizeofblockArnoldiisdeterminedfor behavingwellfortheeigensolver,whilethetilesizeisdeterminedforperformance. Ingeneral,weexpecttheblocksizeofblockArnolditobesmallerthanthetilesize. Giventhissetup, m =2 ; 400and b =50and#iter=20,thetotalworkfortheArnoldi factorizationisapproximately4.35Gopsandthelengthoftheweightedcriticalpath forthebinomialtreemethodis0.39Gops.Weusethe ig multicoremachinelocated inTennesseeforourexperiments.Theperformanceofonecoreis1.757Gop/sec. InFigure2.4,weplottheRooftopBoundforthebinomialtree.Notethatthepoint atwhichtheperformancestartstoleveloisthesamepointatwhichtheRooftop Boundreachesitmaximumintermsofnumberofprocessors.Weacknowledgethat thereisstillalargegapbetweentheboundandthecurves.Wewouldliketohave moredescriptiveupperbounds. 31


WecouldnotproduceaRooftopBoundforthecasewhere k =20and m =9 ; 600 inFigure2.5sincewehavenoclosed-formformulafortheweightedcriticalpath lengthandtheDAGwastoolargeforthecomputertocalculateit.Theresultsfor 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 #threads GFLOP/s TiledArnoldiIterationDense ComparisonofReference,BinomialandPipelineTrees m=9600,mb=200,b=50,#iter=20 Pipeline Binomial Reference Figure2.5: Performancecomparisonforatileddenselargermatrix thetridiagonalcasecanbeseeninFigure2.6andforthediagonalcaseinFigure 2.7.InalloffourcasesweconcludethatourArnoldiimplementationperformsbetter thanthereferenceimplementations. 2.4ConclusionandFutureWork WehaveproposedanewalgorithmtocomputeablockArnoldiexpansionwith Householderreectorsonmulticorearchitectures.Anexperimentalstudyhasshown thatouralgorithmperformssignicantlybetterthanourreferenceimplementations. Wewouldliketoobtainclosed-formformulaforthecriticalpathofournewalgorithm, andwewouldliketobenchmarkourcodewithsparsematrices. 32


5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 #threads GFLOP/s TiledArnoldiIterationTri-Diagonal ComparisonofReference,BinomialandPipelineTrees m=9600,mb=200,b=50,#iter=20 Pipeline Binomial Reference Figure2.6: Performancecomparisonforatiledtridiagonalmatrix 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 #threads GFLOP/s TiledArnoldiIterationDiagonal ComparisonofReference,BinomialandPipelineTrees m=9600,mb=200,b=50,#iter=20 Pipeline Binomial Reference Figure2.7: Performancecomparisonforatileddiagonalmatrix 33


3.AlternativestotheQRAlgorithmforNEP InChapter2,westudiedanecientArnolditileexpansionformulticoresystem. Inthischapter,weturnourfocustothecomputationofeigenvaluesandassociated invariantsubspaces.ThischaptermotivatesourworkinChapter4wherewefocus solelyontheblockKrylov-Schurmethod.Wearespecicallyinterestedincomputing apartialSchurdecompositionasinEquation1.3usinganiterativemethod"algorithm.Herewedetailourblockextensionsofvariousapproachesandundertakean experimentalnumericalstudyforvariousalgorithmsinthecontextofcomputingany numberofeigenvaluesofanonsymmetricmatrix.Ourimplementationsofseveral approachesarecomparedtoexistingunblockedimplementationsandblockedcodes whenavailable.Allalgorithmsareimplementedin Matlab .Whenapplicable,we surveycurrentstate-of-the-artimplementationsandrelatedsoftwarelibraries. Herewefocusstrictlyonmethodsaimingtocomputeeigenvaluesofan n n matrix A whichworkbyaccessingtheoperator A onlythroughmatrix-vectorproducts".Inparticular,noneofthealgorithmsconsideredinthisChapterreducesthe matrixtoHessenbergform.Thereareseveralreasonsforthisdesignchoice.Though thereductiontoHessenbergformistherstphaseofpracticalimplementationsof theQRalgorithm,itisacostlyendeavor,inparticularintermsofcommunication andparallelism.UsingHouseholderreectors,proceedingacolumnatatime,this computationrequiresapproximately 10 3 n 3 opsandisbasedmainlyonLevel2BLAS operations. TheaccumulationofHouseholderreectorsintocompactWYform[56]canbe usedtoimprovethesituationbyincorporatingLevel3BLASoperationswhenpossible.ThiswasdescribedbyDongarra,Hammarling,andSorensen[23],butperformanceissuesstillremaininthecontextofmulticore.Recently,Quintana-Ortand vandeGeijn[54]castmoreoftherequiredcomputationsinecientmatrix-matrix operationsachievingsignicantperformanceimprovements.Yet,20%oftheops 34


remaininLevel2BLASoperations.HowellandDiaa[36]presentedanalgorithm, BHESS,usingGaussiansimilaritytransformationstoreduceageneralrealsquare matrixtoasmallbandHessenbergform.Eigenvaluescanthenbecomputedusing thebulge-chasingBRiterationortheRayleighquotientiteration.Theoverallcostof theBHESS-BRmethodwasreportedtobetypically 8 3 n 3 opscomparedtothestandardQRiterationwhichrequires 10 3 n 3 forthereductionand10to16 n 2 opsforthe ensuingiterationontheHessenbergfactor.TheBHESS-BRapproachisappropriate forcomputingnonextremaleigenvaluesofmildlynonnormalmatrices[36]. Atwo-stagedapproach,describedbyLtaief,Kurzak,andDongarra[48],showed thataninitialreductiontoblockHessenbergform,alsocalledbandHessenbergas thereisablockorbandofsubdiagonalsratherthanonesubdiagonal,iseciently handledbyatilealgorithmvariant.Usingthetiledapproach,theiralgorithmwith Householderrefectorsachieves72%Gfop/softheDGEMMpeakona12 ; 000 12 ; 000matrixsizewith16IntelTigerton2.4GHzcoresandmostoftheoperations areinLevel3BLAS.Thesecondphaseoftheproposedmethod[48]reducestheblock HessenbergmatrixtoHessenbergformusingabulgechasingapproachsimilartosome extenttowhatisdoneintheQRalgorithm.Thealgorithmusedinthesecondphase doesnotachieveanycomparableperformance,mainlyduetotheineciencyofthe parallelbulgechasingprocedureonmulticorearchitectures.Thebulgechasingin thesecondphasemaybenetfromtheoptimalpackingstrategy[39],butwedonot investigatethisfurther.Becauseofthesechallenges,weturnourfocustoiterative methodsArnoldi,Jacobi-DavidsonwhichavoidcompletereductiontoHessenberg form.Aswewillseethough,theimplicitQRalgorithmwillremainanessentialpiece ofanyapproachtotheNEP. Wenotethat,ifweweretoruntheBlockArnoldiprocesspresentedinChapter2 tocompletion n steps,wewouldperformaBlockHessenbergreductionasin[39]. However,theblockHessenbergmatrixweobtainisassociatedwithaKrylovspace 35


expansion,andsoearlysubmatricesofthismatrixshouldcontainrelevantinformation oninvariantsubspacesoftheinitialmatrix. 3.1IterativeMethods 3.1.1ArnoldiforthenonsymmetriceigenvalueproblemandIRAM TheArnoldiprocedurediscussedinChapter2notonlyproducesanorthonormal basisforthesubspace m A;v ,butitalsogeneratesinformationabouttheeigenvaluesof A .Thoughoriginallydevelopedasamethodtoreduceageneralmatrixto upperHessenbergform,theArnoldimethodmaybeviewedasthecomputationof projectionsontosuccessiveKrylovsubspaces.Inexactarithmetic,Algorithm2.1.1 willterminateonline6if h j +1 ;j =0,asthecolumnsof V j spananinvariantsubspace ofA.Inthiscase,theeigenvaluesof H j aretheexacteigenvaluesofthematrix A Ifitdoesnotterminateearly,thealgorithmconstructsanArnoldidecompositionof order m givenby AV m = V m H m + m v m +1 e T m : .1 Exceptforarankoneperturbation,wehaveanapproximateinvariantsubspacerelationship,thatis AV m V m H m .FromEquation3.1andtheorthogonalityofthe columnsof V m +1 ,wehavethat V H m AV m = H m ; andtheapproximateeigenvaluesprovidedbyprojectionontotheKrylovsubspace m A;v aresimplytheeigenvaluesof H m .Theseareoftencalled Ritzvalues ,as thisprojectioncanbesceneasapartoftheRayleigh-Ritzprocess. Ritzvectors orapproximateeigenvectorsof A ,aresimplyaretheassociatedeigenvectorsof H m premultipliedby V m .Tondtheeigenvaluesandeigenvectorsof H m ,whichisalready inupperHessenbergform,wesimplyapplyapracticalversionoftheimplicitlyshifted QRalgorithm. 36


Inpractice,asuitableorder m fordesiredconvergenceisnotknownaprioriand itmaynotbedesirabletostorethe n m +1matrix V m +1 as m mustusuallybe ratherlargeforanacceptableapproximationtobecomputed.Toaddressthis,several restartingstrategieshavebeensuggestedforhowtoselectanewstartingvector v 1 Explicitrestartingstrategiescomputeanapproximateeigenvectorassociatedwithan eigenvalueofinterest,saytheonewithlargestrealpart.Iftheapproximationis satisfactory,theiterationstops,andifnot,theapproximateeigenvectoristhenused asthestartingvectorforanew m thorderArnoldifactorization.Asimilarstrategy maybeusedwhenmultipleeigenpairsaredesired.Onemayrestartwithalinear combinationofapproximateeigenvectors,oronemayuseadeationstrategy.Morgan'sanalysis[50]suggestedthatrestartingwithalinearcombinationisill-advised unlesscareistakentoavoidlosingaccuracywhenformingthelinearcombination.An approachtocombineRitzvectorsthatpreventslossofaccuracyisthatofSorensen, whichwewilldiscussindetailmomentarily.Asthereisnoeasywaytodeterminean appropriatelinearcombination,weoptforastrategybasedondeatingeigenpairs. WewillexpandonthisideawhenweformulateourblockvariantofArnoldiforthe NEP. OneofthemoresuccessfulapproachesbasedonKrylovsubspacesisthatof Sorensen,theimplicitlyrestartedArnoldimethodIRAMpresentedin[59].This methodusestheimplicitlyshiftedQRalgorithmtorestarttheArnoldiprocess.From thedecomposition3.1,forxed k m )]TJ/F19 11.9552 Tf 10.555 0 Td [(k shifts, 1 ;:::; m )]TJ/F20 7.9701 Tf 6.587 0 Td [(k ,areselectedandusedto perform m )]TJ/F19 11.9552 Tf 11.607 0 Td [(k stepsoftheimplicitlyshiftedQRalgorithmontheRayleighquotient H m .Thisresultsin AV + m = V + m H + m + m v m +1 e T m Q .2 where V + m = V m Q H + m = Q H H m Q ,and Q = Q 1 Q 2 Q k whereeach Q i isthe orthogonalmatrixassociatedwitheachofthe k shifts.Sorensenobservedthatthe rst k )]TJ/F15 11.9552 Tf 11.264 0 Td [(1componentsof e T m Q arezeroandthatequatingtherst k columnsoneach 37


sideyieldsanupdated k thorderArnoldifactorization.Theupdateddecomposition givenby AV + k = V + k H + k + k v + k +1 e T k .3 withupdatedresidual k v + k +1 isarestartoftheArnoldiprocesswithastartingvector proportionalto p A v 1 where p A = A )]TJ/F19 11.9552 Tf 10.661 0 Td [( 1 I A )]TJ/F19 11.9552 Tf 10.661 0 Td [( m )]TJ/F20 7.9701 Tf 6.587 0 Td [(k I .Usingthisasastartingpoint,SorensencontinuedtheArnoldiprocesstoreturntotheoriginal m thorder decomposition.Sorensenshowedthisprocessmaybeviewedasatruncationofthe implicitlyshiftedQRiteration.Alongwiththisformulation,Sorensenalsosuggested shiftchoicesthathelplocatedesiredpartsofthespectrum.Sorensen'sapproachis thefoundationfortheARPACKlibraryofroutinesthatimplementIRAM[47].In Matlab ,thefunction eigs providestheuserinterfacetoARPACK.Aparallelversion,PARPACK,isavailableaswellbutbothoeringsonlycomputeapartialSchur decompositionofageneralmatrix A .Asdiscussedearlier,oneofourcomputational goalsistheabilitytocomputepartialandfullSchurdecompositionsusingthesame approach. 3.1.2BlockArnoldi Asalways,wedesiretocastourcomputationinLevel3BLASoperationsas muchaspossibleforeciencyconcerns,butthereareotherreasons.Blockmethods arebettersuitedforhandlingthecomputationofclusteredormultipleeigenvalues, andblockmethodsareappropriatewhenmorethanonegoodinitialvectorisknown. Wewillexaminethismorecloselyinournumericalexperiments.VariousblockapproachestoeigenvalueproblemsmaybefoundinSaad'sbook[55]orinthetemplate"book[71].InthecaseofIRAM,ablockextension,bIRAM,waspresented byLehoucqandMaschho[46].ThebIRAMimplementationcomparedfavorably tootherblockvariantsofArnoldistudiedbyScott[57].Ofspecicinteresttoour endeavors,theimplicitlyrestartedblockschemewassuperiortoblockapproaches 38


usingexplicitrestartingstrategiesandalsooutperformedapproachesusingpreconditioning.ThisincludesstrategiessuchasChebyshevaccelerationandChebyshev preconditioning.Alsoofnote,allimplementationsstudiedin[57]computedonly apartialSchurdecomposition.Currently,ARPACKdoesnotincludethebIRAM approach.Onereasonforthismaybethecomplexitiesofsuchanimplementation. Anexampleofoneofthedicultiesinimplementingsuchanapproachistheshift strategyasdiscussedbyBaglama[7].Thegeneralizationtoablockmethodcreates possibleconvergenceissuesiftheshiftsarenotchosenappropriately.Additionally, IRAM,asStewart[64]pointsout,andsubsequentlybIRAMrequiresthestructure oftheArnoldidecompositiontobepreservedwhichmaymakeitdiculttodeate convergedRitzvectors.Duetotheseissues,weopttoexaminethebehaviorofa basicblockArnoldiapproach.Thiswillserveasstartingpointforouranalysisof blockmethods. BuildingoourworkinChapter2,weformulateablockArnoldiapproachusing Householderreectors.OurapproachusesexplicitrestartsanddeationtolockconvergedSchurvectorsandismodeledafteralgorithm7 : 43in[71]whichisreproduced inAlgorithm3.1.1.TheconvergedeigenvaluesandSchurvectorsarenottouchedin subsequentstepsofAlgorithm3.1.1.Thisisreferredtoashardlocking"asopposed tosoftlocking"inwhichtheconvergedSchurvectorsarecontinuouslyupdatedregardlessofresidualvalues.ThiswasintroducedbyKnyazev[41]inthecontextof iterativemethodsforthesymmetriceigenvalueproblem.Theuppertriangularportionofthematrix H isalsolocked.Insubsequentsteps,theapproximateeigenvalues aretheeigenvaluesofthe m m matrix H whose k k principalsubmatrixisuppertriangular.BylockingconvergedeigenvaluesandcomputedSchurvectorsweare implicitlyprojectingouttheinvariantsubspacealreadycomputed. Inthefollowing,weoutlinethemaincomponentsofonesweepofourblock Arnoldimethod,Algorithm3.1.2,anddiscussthespecicsofourimplementation. 39


Algorithm3.1.1: ExplicitlyRestartedArnoldiMethodwithDeation 1 Set k =1; 2 for j = k : m do 3 w = Av j ; 4 Computeasetof j coecients h ij sothat w = w )]TJ/F25 11.9552 Tf 11.955 8.967 Td [(P j i =1 h ij v i is orthogonaltoallprevious v i i =1 ; 2 ;:::;j ; 5 h j +1 ;j = k w k 2 ; 6 v j +1 = w h j +1 ;j ; 7 Computeapproximateeigenvectorof A associatedwiththeeigenvalue e k and itsassociatedresidualnormestimate k ; 8 Orthonormalizethiseigenvectoragainstallprevious v i stogetthe approximateSchurvector e u k anddene v k = e u k ; 9 if k issmallenough then 10 Accepttheeigenvalueestimate: 11 h i;k = v H i Av k ;i =1 ;:::;k; set k = k +1; 12 Ifthedesirednumberofeigenvalueshasbeenreached,stop, 13 otherwisegoto 2 ; 14 else 15 Goto 2 ; Here b denotestheblocksize, k f denotesthesizeofthesearchsubspacewhere k f isa multipleof b k max denotesthenumberofdesiredeigenvalues,and k con isthenumber ofeigenvaluesthathaveconverged.Forthesakeofnotation,theensuingdiscussion willassumenoeigenvalueshaveconverged,thatis k con =0.Thecasewhere k con > 0 isdetailedinthefollowingpseudocodes.Astepoftheiterationbeginswithablockof 40


Algorithm3.1.2: BlockArnoldiMethodwithexplicitrestartanddeation Input : A 2 C n n U 2 C n b ,dimensionofsearchsubspace k f andnumberof desiredeigenvalues k max Result :PartialSchurformgivenby Q k max 2 C n k max and H k max 2 C k max k max 1 Setnumberofblocks: b = k f b ; 2 Set k con =0; 3 while k con

vectors U 2 C n b thatisusedtogenerateasize k = k f b blockArnoldidecomposition AQ k = Q k +1 e H k +1 .4 where e H k +1 2 C k f + b k f isgivenby e H k +1 = 2 6 4 H k H k +1 ;k E H k 3 7 5 ; .5 Q k +1 2 C n k f + b hasorthonormalcolumnsandacompactWYrepresentationasin Equation2.5, H k 2 C k f k f isbandHessenberg,and H k +1 ;k 2 C b b .Here Q k +1 refers toamatrixwith k +1blocksofsize n b and Q k +1 referstothe n b matrix makingupthe k +1stblockof Q k +1 .Forthepseudocode,itwillbeconvenientto usethe Matlab stylenotationintroducedinChapter2,thatis Q k = Q : ; 1: kb = Q : ; 1: k f and Q k +1 = Q : ;kb +1: kb + b = Q : ;k f +1: k f + b orusingtheblock notation Q k +1 = Q : ; f k +1 g ExpansionusingblockArnoldiisdetailedinAlgorithm3.1.3.InAlgorithm3.1.3, weuse Matlab 'sfunction qr tocomputesomeofthecomponentsofthecompact WYrepresentation.Usingtheeconomy-size"option,wecomputetheuppertriangularfactorwithcomponentsofthereectorsstoredbelowthediagonalasinLAPACK. TheHouseholderreectorshavetheform H i = I )]TJ/F19 11.9552 Tf 12.415 0 Td [( i v i v H i where v i isunitlower triangularandthus,itsupperpartdoesnotneedtobestored.Since Matlab 'sinterfacedoesnotprovidethescalars, i ,neededtoconstructtheelementaryreectors, weoptedtoimplementourownxLARFGin Matlab togeneratethemissingcomponentstobeabletocomparetoLAPACKwhenconstructingthecompactWYform. DetailsonthecomputationofthescalarsmaybefoundinAlgorithm3.1.4.The scalarswecomputewithourxLARFGarethenusedinourown Matlab implementationofxLARFTtoconstructthetriangularfactorinthecompactWYrepresentation. DetailsmaybefoundinAlgorithm3.1.5.WewillrevisitthisinChapter4asour algorithmicconstructionmotivatesaslightmodicationofxLARFTinLAPACKto 42


Algorithm3.1.3: BlockArnoldiIteration Input : A k ,andpossiblycollapsed Q incompactWYform Result : k thorderArnoldidecompositionwith Q incompactWYform 1 if k con =0 then 2 for j =0: b 1 do 3 V =qr U k con + jb +1: n; 1: b ; 0andcomputescalars k con + f j +1 g fortheelementaryreectors; 4 if j> 0 then 5 H : jb; f j g = U : jb; :; 6 H f j +1 g ; f j g =triu V : b; 1: b ; 7 Y jb +1: n; f j +1 g =tril V; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1+eyesize V ; T =zlarft Y; ; Q : ; f j +1 g = Q : ; f j +1 g )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY H Q : ; f j +1 g ; 8 if j

Algorithm3.1.4: Matlab implementationofZLARFG Input :Avector x 2 C n Result :Scalars ; andvector v 1 n =length x ; 2 alpha= x ;xnorm=norm x : n ; 2; 3 ar=realalpha;ai=imagalpha; 4 if xnorm =0 then 5 tau=0;beta=alpha;v=x:n,:; 6 else 7 beta=-signar*sqrtar 2 +ai 2 +xnorm 2 ; 8 tau=beta-ar/beta-ai/beta*i;v=x:n/alpha-beta; Algorithm3.1.5: MatlabimplementationofZLARFT Input :reectors Y andassociatedscalars Result :triangularfactor T forcompactWYform 1 [ n;k ]=size V ; 2 T =zeros k ; 3 for i =1: k do 4 T : i )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ;i = )]TJ/F19 11.9552 Tf 9.298 0 Td [( i T : i )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ; 1: i )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 V : ; 1: i )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 H V : ;i ; 5 T i;i = i ; 44


avoidunneededcomputations.Ifnoeigenvalueshaveconverged,thematrix Q k +1 has thefollowingform Q k +1 = I n; k +1 b )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY H I n; k +1 b ; .6 where I n; k +1 b istherst k +1 b columnsofthe n n identitymatrix, Y 2 C n k +1 b withunitdiagonaland T 2 C k +1 b k +1 b asinLAPACK.If k con eigenvalueshave beendeated, Q k +1 hastheform Q k +1 = e I n;k con + k +1 b )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY H e I n;k con + k +1 b ; .7 where e I n;k con + k +1 b 2 C n k con + k +1 b hastheform e I n;k con + k +1 b = 2 6 4 R k con + b 0 k con + b;k )]TJ/F20 7.9701 Tf 6.586 0 Td [(k con )]TJ/F20 7.9701 Tf 6.587 0 Td [(b 0 n )]TJ/F20 7.9701 Tf 6.587 0 Td [(k con )]TJ/F20 7.9701 Tf 6.586 0 Td [(b;k con + b I k )]TJ/F20 7.9701 Tf 6.586 0 Td [(k con )]TJ/F20 7.9701 Tf 6.587 0 Td [(b 3 7 5 : Here R k con + b 2 C k con + b k con + b isadiagonalmatrixwithentries 1generatedwhen werebuildthecompactWYformafterdeatingorrestartingandtheothercomponentsareappropriatelysizedmatricesofzerosandtheidentitymatrix. InthenextstepofourblockArnoldiapproach,wecomputetheSchurfactorizationoftheRayleighquotientmatrix H k ,thatis H k V k = V k S k ; .8 where V k 2 C kb kb V H k V k = I ,and S k 2 C kb kb isuppertriangular.Inourcurrent implementation, S k isuppertriangularratherthanupperblocktriangular,butwe couldadjustourapproachtoworkinrealarithmetic.Thechoicetoworkincomplex arithmeticwasmade,inpart,tocomparetosomeoftheavailableimplementations ofunblockedmethodsthatusethesameapproach.Ourimplementationuses Matlab 'sfunction schur whichprovidestheinterfacetoLAPACK'sroutinesxGEHRD, xUNGHR,andxHSEQR. TheSchurform3.8isthenreorderedsothatthedesiredeigenvaluesappearin thetopleftofthematrix S k .ReorderingtheSchurformwillplayanimportant 45


roleinallofourapproachesinthissection.ThiscanbeaccomplishedinLAPACK bytheroutinexTRSENforbothuppertriangularandblockuppertriangularSchur factorizations.Aswearecurrentlyworkingin Matlab andsinceourapplications keeptheorderoftheSchurfactor S k relativelymanageable,wereordertheSchurform usingGivensrotationsandatargettolocateaspecicpartofthespectrum,suchas theeigenvalueswithlargestrealcomponents.Ourapproachwasadaptedfromthe computationspresentedin[24].DependingontheorderoftheSchurfactor S k ,wemay needtoadjusthowwehandlereorderingtheSchurform.Kressnerdiscussedblock algorithmsforthetaskofreorderingSchurformsin[42]andspecicallyaddressed theapplicationsofinteresthere,namelyreorderingSchurformsintheKrylov-Schur andJacobi-Davidsonalgorithms.Tofullytakeadvantageoflevel3BLASoperations, weshouldadoptasimilarapproach.Detailsofourimplementationcanbefoundin Algorithm3.1.6andwewillpostponeasfutureworkanyfurtheranalysisoneciently reorderingtheSchurform. OncetheSchurformisreordered,wechecktoseeiftheRitzvaluesareacceptable approximationsoftheeigenvaluesofthematrix A .Afterreordering,ourblockArnoldi decompositionhastheform AQ k V k = Q k V k ;Q k +1 2 6 4 S k H k +1 ;k E T k V k 3 7 5 : .9 sothat AQ k V k )]TJ/F19 11.9552 Tf 11.955 0 Td [(Q k V k S k = Q k +1 H k +1 ;k E T k V k ; .10 FromherewecanseethatthequalityoftheRitzvalueisgivenby k AQ k V k )]TJ/F19 11.9552 Tf 11.956 0 Td [(Q k V k S k k 2 = k Q k +1 H k +1 ;k E T k V k k 2 46


Algorithm3.1.6: ReorderSchurForm Input :Unitary Z 2 C k k ,uppertriangular T s 2 C k k andatarget Result :Reordered T s and U r 2 C k k 1 U r =eye k ; 2 for i =1: k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 do 3 d =diag T s i : k;i : k ; 4 [ a;jj ]=minabs d )]TJ/F19 11.9552 Tf 11.956 0 Td [(target ; 5 jj = jj + i )]TJ/F15 11.9552 Tf 11.956 0 Td [(1; 6 for t = jj )]TJ/F15 11.9552 Tf 11.955 0 Td [(1: )]TJ/F15 11.9552 Tf 9.299 0 Td [(1: i do 7 x =[ T s t;t +1 ;T s t;t )]TJ/F19 11.9552 Tf 11.955 0 Td [(T s t +1 ;t +1]; 8 G [2 ; 1] ; [2 ; 1]=planerot x H H ; 9 T s : k; [ t;t +1]= T s : k; [ t;t +1] G ; 10 T s [ t;t +1] ; := G H T s [ t;t +1] ; :; 11 U r : ; [ t;t +1]= U r : ; [ t;t +1] G ; andanaturalstoppingcriteriaiswhen k H k +1 ;k E T k V k k 2 issucientlysmall.In ARPACK,givenanArnoldidecompositionofsize k ,thatis AU k =[ U k ;u k +1 ] 2 6 4 H k h k +1 ;k e T 3 7 5 ; .11 aRitzvalue isregardedasconvergedwhentheassociatedRitzvector U k w ,with k w k 2 =1,satises k A U m w )]TJ/F19 11.9552 Tf 11.955 0 Td [( U m w k 2 = h k +1 ;k e T k w max f u k H k k F ; tol j jg where u ismachineepsilonand tol isauserspeciedtolerance.Furtherdetailsmay befoundinARPACK'suserguide[47].Thiscriteriaguaranteesasmallbackward errorfor astheestimateisaneigenvalueoftheslightlyperturbedmatrix A + E with E = )]TJ/F19 11.9552 Tf 9.299 0 Td [(h m +1 ;m e T m w u m +1 U m w T 47


asdetailedbyKressner[43].Wecanextendthisideatoourapproachandacceptthe Ritzvalue ,intheupperleftof S k afterreordering,when k H k +1 ;k E T k V k : ; 1 k 2 max f u k H k k F ;tol j jg : .12 ThenalstepinasweepofourblockArnoldimethodistocheckconvergenceusing Equation3.12.Ifaneigenvaluehasconverged,weexplicitlydeatetheconverged eigenvaluein H k andcollapse Q k toincludetheconvergedSchurvectorsplusablock ofsize n b tobeusedtorestarttheArnoldiprocess.AcompactWYrepresentation ofthetruncated Q k iscomputedandwebeginthesweepagain.Herewegenerate thematrix R k con + b introducedin3.7.Iftheeigenvalueapproximationisnotyet satisfactory,wecollapse Q k andbuildacompactWYrepresentationofthepreviously convergedSchurvectorsandtheadditional n b blocktobeusedforrestarting. DetailsofthisnalstepcanbefoundinAlgorithm3.1.7. 48


Algorithm3.1.7: Explicitrestartwithpossibledeation Input : Q : ; 1: k con + k + b withcompactWYrepresentation Result : Q : ; 1: k con + b withcompactWYrepresentation 1 p =norm H k con + f b +1 g ;k con + f b g Z f b g ; 1; 2 if p< check then 3 k con = k con +1; 4 Explicitlydeate: H k con ;k con = T s ; 1; H k con +1: k con + b;k con =zeros b; 1; 5 Q = Q : ; 1: k con + b ; H = H : k con + b; 1: k con ; 6 V =qr Q; 0;andcomputescalars t : k con + b forelementaryreectors; 7 R =diagdiag V : k con + b; 1: k con + b ;; 8 Y =tril V; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1+eyesize V ; T =zlarft Y;t ;; 9 Q k con + b +1: n;k con + b +1: k con + b + k =eye n )]TJ/F15 11.9552 Tf 11.955 0 Td [( k con + b ;k ; 10 else 11 if k con =0 then 12 U = AQ : ; f 1 g ; Q =eye n;k + b ; 13 else 14 Q = Q : ; 1: k con + b ; H = H : k con + b; 1: k con ; 15 V =qr Q; 0;andcomputescalars t : k con + b forelementary reectors; 16 R =diagdiag V : k con + b; 1: k con + b ;; 17 Y =tril V; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1+eyesize V ; T =zlarft Y;t ;; 18 Q k con + b +1: n;k con + b +1: k con + b + k =eye n )]TJ/F15 11.9552 Tf 11.955 0 Td [( k con + b ;k ; 49


3.1.3BlockKrylov-Schur AmorenumericallyreliableprocedurethanIRAMisStewart'sKrylov-Schur algorithm[64].Thisreliability,alongwiththeeasewithwhichconvergedRitzpairs aredeatedandunwantedRitzvaluesarepurged,makestheKrylov-Schurmethod anattractivealternativetoIRAM.AstepofStewart'sKrylov-Schurmethodbegins andendswithaKrylov-Schurdecompositionoftheform AV k = V k S k + v k +1 b H k +1 .13 where S k isa k k uppertriangularmatrixandthecolumnsof V k +1 areorthonormal. Forconvenience,wewrite AV k = V k +1 e S k ; .14 where e S k = 2 6 4 S k b H k +1 3 7 5 : TheKrylov-Schurmethodusessubspaceexpansionandcontractionphasesmuch likeanyapproachbasedonArnoldidecompositions.TheexpansionphaseofKrylovSchurproceedsexactlyasinthetypicalArnoldiapproachandrequiresapproximately thesameamountofwork.TherealpoweroftheKrylov-Schurapproachisinthe contractionphase.AswewillseeindetailinChapter4,thekeyaspectofthe Krylov-Schurmethodisthatadecompositionoftheform3.13canbetruncatedat anypointintheprocessprovidinganeasywaytopurgeunwantedRitzvalues.As theKrylov-Schurmethodworksexplicitlywiththeeigenvaluesof S k ,theRayleigh quotient,thisapproachisanexact-shiftmethod.Thisisincontrasttomethodsthat useothershiftssuchaslinearcombinationsoftheeigenvaluesof S k .Wewillrevisit thedetailsoftheKrylov-SchurmethodinChapter4. 50


Ablockversionofthisapproachhasbeenimplementedforsymmetricmatrices bySaadandZhouin[72].Theyincludeddetailedpseudocodes,includinghowtohandlerankdecientcasesandadaptiveblocksizes.Thenumericalresultsprovidedby SaadandZhoushowthattheirimplementationperformswellagainst Matlab 's eigs functionbasedonARPACK, irbleigs byBaglamaetal.,animplementationofan implicitlyrestartedblock-Lanczosalgorithm[6],and lobpcg ,withoutpreconditioning,presentedbyKnyazev[40].Ofparticularinteresttoourpursuitsisthatinthis study,theKrylov-Schurbasedapproachconsistentlyoutperformedthecompetitors asthenumberofdesiredeigenvalueswasincreased. ThoughtheKrylov-Schurmethodwaspresentedasanattractivealternativeto IRAM,therearefewimplementationsineitherunblockedorblockedformfornonsymmetricmatrices.Baglama'sAugmentedBlockHouseholderArnoldiABHA method[7]isanimplicitversionofablockKrylov-SchurapproachtotheNEPas itemploysSchurvectorsforrestarts.Wewillcompareourapproachtothepublicly availableprogram ahbeigs basedonthisABHAmethodlater.Animplementation oftheblockKrylov-Schurmethodisavailableinthe Anasazi eigensolverpackage, whichispartof Trilinos [9]. AswedidforourblockArnoldiapproach,wenowpresentasweepofourblock Krylov-Schurmethod,Algorithm3.1.8,alongwithdetailedcomputationalkernels. WeagainformulateourapproachbasedonHouseholderreectorsincompactWY form.Astepoftheiterationbeginswith A 2 C n n ,ablockofvectors U 2 C n b .For theremainderofthissection, b denotestheblocksize, k s denotesthestartingbasis sizeandthesizeofthebasisaftercontraction, k f denotesthenalbasissizesothat k s

where Q 2 C n k s + b hasorthonormalcolumnsandacompactWYrepresentationas inEquation2.5.Wewilldiscusstheimportanceofthisrstexpansionusingblock ArnoldiinChapter4. TheSchurformoftheRayleighquotient e H 2 C k s + b k s iscomputednextso thatwehave e H : k s ; 1: k s U s = U s T s with U s 2 C k s k s suchthat U H s U s = I and T s 2 C k s k s isuppertriangular.As before,wecouldinsteadcomputetherealSchurforminwhich T s isupperblock triangularandworkcompletelyinrealarithmetic.Wewilldiscussthisfurtherin Chapter4.UpdatingourblockArnoldidecompositionwehavetheblockKrylovSchurfactorization AZ : ; 1: k s = Z : ; 1: k s + b S : k s + b; 1: k s ; .15 where Z : ; 1: k s = Q : ; 1: k s U s Z 2 C n k s + b hasorthonormalcolumns,the Rayleighquotientisuppertriangular, S : k s ; 1: k s ,withafull r k s blockon thebottomasdepictedinFigure3.1.Thisendstheinitializationphaseofourblock Figure3.1: BlockKrylov-SchurDecomposition Krylov-Schurapproach.Nowweenteracycleofexpandingandcontractingthe Krylovdecompositionuntilaneigenvaluemaybedeated.TheinitialKrylov-Schur decomposition3.15isexpandedinthesamemannerasinourblockArnoldiapproach. DetailsmaybefoundinAlgorithm3.1.3.TheexpandedKrylovdecompositionisnow 52


oftheform AQ : ; 1: k f + b = Q : ; 1: k f + b S : k f + b; 1: k f .16 where Q 2 C n k f + b hasorthonormalcolumnsandacompactWYrepresentation, andtheRayleighquotienthasthestructuredepictedinFigure3.2.If k con > 0, constructingthecompactWYrepresentationofourexpanded Z mayinvolveaform asinEquation3.7.Wewilladdressthisindetailwhenwereachtheendofasweep. Figure3.2: ExpandedBlockKrylovDecomposition ThenextstepistocomputetheSchurformofthe k f k f principalsubmatrixin theupperleftof S .Again,weuse Matlab 'sfunction schur tocomputetheupper triangularSchurfactor.TheSchurfactoristhenreorderedaswedidbeforeinour blockArnoldimethod.Algorithm3.1.6accomplishesthistask.Wethenupdatethe correspondingpartsof S sothatourdecompositionhastheform AQ : ; 1: k f = Q : ; 1: k f + b S : k f + b; 1: k f .17 where S againhasthestructuredepictedinFigure3.1.Oncethedesiredeigenvalue approximationshavebeenmovedtotheupperleftofthematrix S ,wecheckfor convergence.Kressner[43]detailstheconnectionbetweentheconvergencecriteriaof ARPACKandpossibleextensionstoaKrylovdecomposition.GivenaKrylov-Schur decompositionoforder m AU m = U m +1 2 6 4 B m b T m +1 3 7 5 53


thedirectextensionofEquation3.11isgivenby k A U m w )]TJ/F19 11.9552 Tf 11.955 0 Td [( U m w k 2 = j b T m +1 w j max f u k B m k F ; tol j jg ; where k w k 2 =1, u isthemachineprecisionand tol isausertolerance.Bothof thesecriteria,theoneofARPACKandtheKrylovextension,guaranteeasmall backwarderror.KressneralsodiscussedhowtheKrylov-Schurapproachsuggestsa morerestrictiveconvergencecriteriabasedonSchurvectorsratherthanRitzvectors. Ifnoeigenvalueshavebeendeated,wemayacceptanapproximationif j b 1 j max f u k B m k F ; tol j jg ; .18 where b 1 istherstcomponentof b T m +1 .WediscussthisindepthinChapter4.An extensionoftheconvergencecriterionbasedonSchurvectorstoourblockmethodis givenby k S k f +1: k f + b; 1 k 2 max f u k S : k f ; 1: k f k F ; tol j jg .19 where = S ; 1, u ismachineprecisionand tol isauserspeciedtolerance.Our blockKrylov-Schurimplementationcandeateconvergedeigenvaluesoneatatime, or n con atatimewhere1 n con b .IftheeigenvaluesatisesEquation3.19,we explicitlydeatebyzeroingoutthebottom b 1blockinthecorrespondingcolumn of S andthentruncateourKrylov-Schurdecomposition.Ifnoeigenvalueshave converged,theKrylov-Schurdecompositionistruncated.Truncationineithercase ishandledbyAlgorithm3.1.10andacompactWYrepresentationof Z iscomputed. Aftertruncation,theprocessbeginsagainwithexpansionofthesearchsubspace. 54


Algorithm3.1.8: BlockKrylov-Schur Input : A 2 C n n U 2 C n b andnumberofdesiredeigenvalues k max Result :PartialSchurformgivenby Z k max and S k max 1 UseAlgorithm3.1.3togenerate: AQ : ; 1: k s = Q : ; 1: k s + b e H : k s + b; 1: k s ; 2 ComputeSchurformofRayleighquotient e H : k s ; 1: k s ; 3 Updatecolumnsof Q and e H k s +1: k s + b; :; 4 NowwehaveablockKrylov-Schurdecomposition: AZ : ; 1: k s = Z : ; 1: k s + b S : k s ; 1: k s ; 5 while k con k max do 6 UseAlgorithm3.1.9toexpandasinArnoldi: AZ : ; 1: k con + k f = Z : ; 1: k con + k f + b S : k con + k f + r; 1: k con + k f ; 7 ComputetheSchurformoftheactivepartoftheRayleighquotient: S k con +1: k con + k f ;k con +1: k con + k f U s = U s T s ; 8 ReorderSchurformusingAlgorithm3.1.6: T s U r T s U r ; 9 UpdatecorrespondingpiecesofKrylovdecomposition: Z : ;k con +1: k con + k f Z : ;k con +1: k con + k f U r S : k con ;k con +1: k con + k f S : k con ;k con +1: k con + k f U s U r S k con +1: k con + k f + b;k con +1: k con + k f + b T s S k con + k f +1: k con + k f + b; : S k con + k f +1: k con + k f + b; : U s U r ; 10 if converged then 11 Explicitlydeate n con convergedeigenvalues; 12 S k con + k f +1: k con + k f + b;k con +1: k con + n con =zeros b;n con k con = k con + n con ; 13 Truncateexpansiontosize k s usingAlgorithm3.1.10; 14 else 15 Truncateexpansiontosize k s usingAlgorithm3.1.10; 55


Algorithm3.1.9: ExpansionofblockKrylovdecomposition Input : A Z 2 C n k con + k s + b S 2 C k con + k s + b k con + k s Result : Z 2 C n k con + k f + b S 2 C k con + k f + b 1 for j = b s : b f do 2 ComputecomponentsforcompactWYformofQRfactorization: V =qr U k con + jb +1: n; 1: b ; 0;andcomputescalars k con + f j +1 g fortheelementaryreectors; 3 Update S : S : jb + k con ;k con + f j g = U : jb + k con ; : S k con + f j +1 g ;k con + f j g =triu V : b; 1: b ; 4 Storereectorsin Y : Y k con + jb +1: n;k con + f j +1 g =eyesize V +tril V; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1; 5 BuildtriangularfactorforcompactWYrepresentation: T =zlarft Y; ; 6 Explicitlybuildnextblockof Z : Z : ;k con + f j +1 g = Z : ;k con + f j +1 g )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY H Z : ;k con + f j +1 g ; 7 If j

Algorithm3.1.10: TruncationofblockKrylovdecomposition Input : S Z and n con Result :truncated S and Z withcompactWYform 1 if n con > 0 then 2 Explicitlydeate; 3 for i =1: n con do 4 S k con + k f +1: k con + k f + b;k con + i =zeros b; 1; 5 k con = k con + n con ; 6 Z : ;k s + k con +1: k s + k con + b = Z : ;k f + k con )]TJ/F19 11.9552 Tf 10.056 0 Td [(n con +1: k f + k con )]TJ/F19 11.9552 Tf 10.057 0 Td [(n con + b ; 7 Z : ;k s + k con + r +1: k con + k f + b =zeros n;k f )]TJ/F19 11.9552 Tf 11.956 0 Td [(k s ; 8 Z k con + k s + b +1: n;k s + k con + b +1: k con + k f + b = eye n )]TJ/F19 11.9552 Tf 11.955 0 Td [(k s )]TJ/F19 11.9552 Tf 11.955 0 Td [(b )]TJ/F19 11.9552 Tf 11.955 0 Td [(k con ;k f )]TJ/F19 11.9552 Tf 11.956 0 Td [(k s ; 9 S =[ S : k s + k con ; 1: k s + k con ; S k f + k con )]TJ/F19 11.9552 Tf 11.955 0 Td [(n con +1: k con + k f + b )]TJ/F19 11.9552 Tf 11.955 0 Td [(n con ; 1: k s + k con ]; 10 n con =0; 11 else 12 Z : ;k s + k con +1: k s + k con + b = Z : ;k f + k con +1: k f + k con + b ; 13 Z : ;k s + k con + r +1: k con + k f + b =zeros n;k f )]TJ/F19 11.9552 Tf 11.956 0 Td [(k s ; 14 Z k con + k s + b +1: n;k s + k con + b +1: k con + k f + b = eye n )]TJ/F19 11.9552 Tf 11.955 0 Td [(k s )]TJ/F19 11.9552 Tf 11.955 0 Td [(b )]TJ/F19 11.9552 Tf 11.955 0 Td [(k con ;k f )]TJ/F19 11.9552 Tf 11.955 0 Td [(k s ; 15 S =[ S : k s + k con ; 1: k s + k con ; S k con + k f +1: k con + k f + b; 1: k s + k con ]; 16 BuildnewcompactWYrepresentation: X =qr Z : ; 1: k con + k s + b ; 0;and computescalars : k con + k s + b ; 17 Y =tril X; )]TJ/F15 11.9552 Tf 9.299 0 Td [(1+eye n;k con + k s + b ; 18 T =zlarft Y; ; R =triu X : k con + k s + b; :;; 57


3.1.4BlockJacobi-Davidson WenowturnourfocusfrommethodsthatprojectontoKrylovsubspacestoan approachthatusesadierentprojectiontechnique.TheJacobi-Davidsonapproach wasrstproposedin1996bySleijpenandVanderVorst[58].Thismethodcombined ideasfromDavidson'sworkonlarge,sparsesymmetricmatricesfromthe1970s[21] andJacobi'siterativeapproachestocomputingeigenvaluesofsymmetricmatrices fromthe1840s[37].WepresentabriefdiscussionofDavidson'sworkandJacobi's worktoseetherelationshipbetweenKrylovbasedmethodssuchasArnoldiand methodssuchasJacobi-Davidson. Jacobiintroducedacombinationoftwoiterativemethodstocomputetheeigenvaluesofasymmetricmatrix.ToseehowJacobiviewedeigenvalueproblems,let A bean n n ,diagonallydominantmatrixwithlargestdiagonalelement A ; 1= AnapproximationofthelargesteigenvalueandassociatedeigenvectoristheRitzpair ;v asin A 2 6 4 1 z 3 7 5 = 2 6 4 1 z 3 7 5 or 2 6 4 c T bF 3 7 5 = 2 6 4 1 z 3 7 5 ; where b;c 2 R n )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 and F 2 R n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 n )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 .Jacobiproposedtosolveeigenvalueproblems ofthisformusinghisJacobiiteration.Toseethis,considerthealternativeformulation ofthissystem = + c T z F )]TJ/F19 11.9552 Tf 11.955 0 Td [(I z = )]TJ/F19 11.9552 Tf 9.298 0 Td [(b: JacobisolvedthelinearsystemonthesecondlineusinghisJacobiiterationbybeginningwith z 1 =0andgettinganupdatedapproximation k for usingavariation 58


oftheiteration k = + c T z k D )]TJ/F19 11.9552 Tf 11.956 0 Td [( k I z k +1 = D )]TJ/F19 11.9552 Tf 11.955 0 Td [(F z k )]TJ/F19 11.9552 Tf 11.955 0 Td [(b; where D isthediagonalmatrixwiththesamediagonalentriesas F .Therststep wastomakethematrixstronglydiagonallydominantbyapplyingrotationstopreconditionthematrix.Jacobithenproceededwithhisiterativemethodthatsearched fortheorthogonalcomplementtotheinitialapproximation.FurtherdetailsofJacobi'sworkmaybefoundin[58].Thekeyideafromhisworkisthatallcorrections camefromtheorthogonalcomplementoftheinitialapproximation. Davidsonwasalsoworkingwithrealsymmetricmatrices.Supposewehavea subspace V ofdimension k ,theprojectedmatrix A hasRitzvalue k andRitzvector u k ,andthatanorthogonalbasisfor V isgivenby f v 1 ;:::;v k g .Ameasureofthe qualityofourapproximationisgivenbytheresidual r k = Au k )]TJ/F19 11.9552 Tf 11.955 0 Td [( k u k : Davidsonwasconcernedwithhowtoexpandthesubspace V toimprovetheapproximationandupdate u k .Hisanswerconsistedofthefollowingsteps: 1.Compute t fromthesystem D )]TJ/F19 11.9552 Tf 11.955 0 Td [( k I t = r k 2.Orthogonalize t againstthebasisfor V : t ?f v 1 ;:::;v k g 3.Expandthesubspace V bytaking v k +1 = t where D isadiagonalmatrixwiththesamediagonalentriesas A .TheJacobiDavidsonmethodcombineselementsfrombothapproaches.Givenanapproximation u k ,thecorrectiontothisapproximationisfoundintheprojectionontotheorthogonal complementofthecurrentapproximation.Theprojectedmatrixisgivenby B = I )]TJ/F19 11.9552 Tf 11.955 0 Td [(u k u T k A I )]TJ/F19 11.9552 Tf 11.955 0 Td [(u k u T k .20 59


andrearrangingtermsyields A = B + Au k u T k + u k u T k A )]TJ/F19 11.9552 Tf 11.955 0 Td [( k u k u T k ; where k = u T k Au k .Foradesiredeigenvalueof A ,say ,thatiscloseto k ,the desiredcorrection t issuchthat A u k + t = u k + t ; and t ? u k : .21 SubstitutingEquation3.20intothedesiredcorrectionEquation3.21,andusingsome orthogonalityrelations,wehavethefollowingequationforthecorrection B )]TJ/F19 11.9552 Tf 11.955 0 Td [(I t = )]TJ/F19 11.9552 Tf 9.298 0 Td [(r k .22 where r k = Au k )]TJ/F19 11.9552 Tf 11.507 0 Td [( k u k istheresidual.Dierentapproachestosolvingthecorrection equation3.22resultindierentmethods.Ifthesolutionisapproximatedbythe residual,thatis t = r k ,thecorrectionisformallythesameasthatgeneratedby Arnoldi.Inthesymmetriccase,if t = D )]TJ/F19 11.9552 Tf 12.123 0 Td [( k I )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 r k ,thenwerecovertheapproach proposedbyDavidson.Inthegeneralcase,combiningthestrategiesofJacobiand Davidson,thecorrectionequationhastheform B )]TJ/F19 11.9552 Tf 11.955 0 Td [( k I t = )]TJ/F19 11.9552 Tf 9.298 0 Td [(r k with t ? u k ; .23 where B )]TJ/F19 11.9552 Tf 11.392 0 Td [(I isreplacedby B )]TJ/F19 11.9552 Tf 11.393 0 Td [( k I .ApproximatingasolutiontoEquation3.23 hasbeenstudiedextensivelysinceitwasproposed.Wewillhighlightrecentdevelopmentslater.TheworkbySleijpenandVanderVorst[58]setthefoundationforthe formulationofaJacobi-DavidsonstyleQRalgorithm, jdqr ,presentedbyFokkema etal.[24],thatiterativelyconstructsapartialSchurform.In[24],algorithmswere presentedforbothstandardeigenvalueproblemsandgeneralizedeigenvalueproblems, includingtheuseofpreconditioningforthecorrectionequationandrestartstrategies. Wewillrestrictourdiscussiontostandardeigenvalueproblems. 60


Forthestandardeigenvalueproblem,theJacobi-Davidsonmethodpicksanapproximateeigenvectorfromasearchsubspacethatisexpandedateachstep.Ifthe searchsubspaceisgivenbyspan f V g ,thentheprojectedproblemisgivenby V H AV )]TJ/F19 11.9552 Tf 11.955 0 Td [(V H V u =0 ; .24 where V 2 C n j andinexactarithmetic V H V = I .Equation3.24issolvedyielding Ritzvalue ,Ritzvector q = Vu andresidualvector r = A )]TJ/F19 11.9552 Tf 11.277 0 Td [(I q where r ? q .The projectedeigenproblem3.24isreducedtoSchurformbytheQRalgorithm.Fokkema etal.denedthe j j interactionmatrix M = V H AV withSchurdecomposition MU = US where S isuppertriangularandorderedsuchthat j S ; 1 )]TJ/F19 11.9552 Tf 11.956 0 Td [( jj S ; 2 )]TJ/F19 11.9552 Tf 11.955 0 Td [( j ::: j S j;j )]TJ/F19 11.9552 Tf 11.955 0 Td [( j ; where issomespeciedtarget.ARitzapproximationtotheprojectedproblem withRitzvalueclosestto isgivenby q; VU : ; 1 ;S ; 1 : Additionally,usefulinformationforthe i eigenvaluesclosestto maybefoundinthe spanofthecolumnsof VU : ; 1: i with i

involved,preconditioningisnotstraightforward.Detailedpseudocodesmaybefound in[24]anda Matlab implementation jdqr ispubliclyavailable. Sinceitsintroductionin1996bySleijpenandVanderVorst[58],muchwork hasbeendevotedtounderstandingandimprovingtheJacobi-Davidsonapproach, especiallyinthecaseofsymmetricandHermitianmatrices.Herewesurveythe majorhighlightsthatpertaintoourblockvariantinthecontextoftheNEP.There isawealthofinformationavailableontheJacobi-Davidsonapproachandagood startingpointistheJacobi-DavidsonGateway[33]maintainedbyHochstenbach. Asmentionedearlier,Fokkemaetal.[24]discusseddeationandimplicitrestarts. DeationandthesearchformultipleeigenvaluesofHermitianmatriceswasalso studiedbyStathopoulosandMcCombs[61].Theconnectionbetweentheinnerand outeriterationswasstudiedbyHochstenbachetal.[34].Thisisacriticalcomponent oftheJacobi-Davidsonapproachassolvingthecorrectionequationtooaccuratelyat thewrongtimecanleadtothesearchsubspacebeingexpandedinineectiveways. Wewillexaminethisissuewhenpresentingournumericalresults.Hochstenbachet al.provedarelationbetweentheresidualnormoftheinnerlinearsystemandthe residualoftheeigenvalueproblem.Thisanalysissuggestednewstoppingcriteria thatimprovedtheoverallperformanceofthemethod.Wewillemploysomeoftheir heuristicsinourblockapproach. Ablockvariantforsparsesymmetriceigenvalueproblemswasformulatedby Geus[27].Forgeneralmatriceswithinexpensiveaction,forexamplelargeandsparse matrices,Brandts[17]suggestedavariantofblockedJacobi-Davidsonbasedonhis Ritz-GalerkinMethodwithinexactRiccatiexpansion.ThismethodhasaRiccati correctionequation,thatdependingonthequalityoftheapproximatesolution,reducestoablockArnoldiapproachorablockJacobi-Davidsonapproach.Infact, theRiccaticorrectionequation,whenlinearized,becomesthecorrectionequationof Jacobi-Davidson.Brandts'smethodsolvestheRiccatiequationexactlyandtheex62


traworkisdemonstratedtobenegligibleinthecaseofmatriceswithinexpensive action.FurtherinvestigationintosubspaceexpansionfromthesolutionsofgeneralizedalgebraicRiccatiequationsisthesubjectoffuturework.Parallelizationhasbeen investigated,butmainlyinthecontextofgeneralizedeigenvalueproblemsforlarge Hermitianmatrices[52,3]andmostrecentlyforquadraticeigenvalueproblems[68]. OurblockversionofJacobi-Davidson,Algorithm3.1.11,isastraightforward extensionof jdqr from[24].Inthepubliclyavailable Matlab implementation, jdqr ,Fokkemaetal.usedexistingimplementationsoftheQRalgorithmsuchas schur tocomputetheSchurformoftheprojectedproblem.MGSwasusedforthe theconstructionofanorthogonalbasisforthesearchsubspace.Ratherthanusing MGS,weoptasbeforetobaseouralgorithmontheuseofHouseholderreectors. Severallinearsolversareavailabletoapproximatelysolvethecorrectionequation3.25. Theimplementation jdqr allowstheusertospecifyvariousmethods,butasour stoppingcriteriadependsonthemethodused,wechoosetoemploythegeneralized minimalresidualmethodGMRES.Wealsoformulateourapproachtoallowfor preconditioningofthecorrectionequation,thoughwedonotsuggeststrategiesfor identifyingeectivepreconditioners. WenowpresentonesweepofourblockJacobi-DavidsonmethoddetailedinAlgorithm3.1.11.Here b denotestheblocksize, j min istheminimumdimensionofthe searchsubspace, j max isthemaximumdimensionofthesearchsubspace, k con isthe numberofconvergedeigenvaluesand k max isthenumberofdesiredeigenvalues.In theensuingsteps,weusesimilarnotationasFokkemaetal.introducedin[24]tohelp withthediscussion.Welet Q 2 C n k con bethematrixofconvergedSchurvectors, K 2 C n n isthepreconditionerfor A )]TJ/F19 11.9552 Tf 11.93 0 Td [(I forsomexedvalueof anddenethe 63


Algorithm3.1.11: BlockJacobi-Davidson Input : A 2 C n n U 2 C n b k max j min ,and j max Result : AQ = QR with Q 2 C n k max and R 2 C k max k max 1 j =0; 2 while k con k max do 3 if j =0 then 4 Initializesearchsubspace v usingAlgorithm3.1.12; 5 else 6 Solve b correctionequationsapproximatelyusingAlgorithm3.1.13; 7 Expandsearchsubspace: V =[ V;v ]; 8 if j> 0 then 9 [ V;temp ]=qr V; 0andconstructcompactWYformfor V ; 10 Expandinteractionmatrix: M = V H AV ; 11 ComputeSchurform: MU = US ; 12 ReordertheSchurform S usingAlgorithm3.1.6; 13 j = j + b ; found =1; 14 while found do 15 ComputeRitzvectors: q = VU : ; 1: b ; 16 PreconditiontheSchurvectors: y = K )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 q ; 17 Compute b residualvectorsandassociatednorms: 18 for i =1: b do 19 r : ;i = Aq : ;i )]TJ/F19 11.9552 Tf 11.955 0 Td [(S i;i q : ;i ;nres i = k r : ;i k 2 ; 20 if Converged then 21 DeateandrestartusingAlgorithm3.1.14 22 else 23 ImplicitrestartusingAlgorithm3.1.14 64


following: e Q [ Q;q ] ; thematrix Q expandedbyapproximateSchurvectors q e Y K )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 e Q; thematrixofpreconditionedSchurvectors, e H e Q e Y; thepreconditionedprojector e Q H K )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 e Q Webeginwithablock U 2 C n b ,andintheinitialsweeporthogonalizethisusing Householder.Thatis,wehave [ V;R ]=qr U; 0 sothat V 2 C n b issuchthat V H V = I b .Asin jdqr ,wealsoallowforinitializing thesearchsubspaceusingArnoldi.Convergencemaysuerinthesinglevectorcase whenusingJacobi-Davidsonbeginningfromtheinitialvector.Thatis,thecorrection equationmaynotimmediatelyprovideusefulinformationforbuildingadesirable searchsubspace.Toadjustforthis,oftensometypeofsubspaceexpansionisused togenerateaninitialsearchsubspacethatmayhavebetterapproximationstowork with.Therstphaseof jdqr computesanArnoldifactorizationofsize j min +1using thesuppliedinitialvector.Then j min oftheArnoldivectorsareusedastheinitial subspace V intheinitialsweepoftheJacobi-Davidsonmethod.Weexperimentally veriedthatthisslightlyimprovesthespeedofconvergenceandincorporatethis approachintoourblockalgorithm.Wenotethataniceanalysisofthecorrection equationisprovidedbyBrandts[17].Ifdesired,weuseAlgorithm3.1.3andthe startingblock U tobuildasize j min blockArnoldidecompositionandlet V 2 C n j min betherst j min basisvectors.Ifthisisnottheinitialsweep, V 2 C n j min isour restartedsearchsubspaceandhasorthonormalcolumns.Ineithercaseacompact WYrepresentationofthesearchsubspace V isconstructed.Theinteractionmatrix iscomputedby M = V H AV 65


andthentheSchurformof M iscomputedusing Matlab 'sfunction schur sothat MU = US .NexttheSchurformisreorderedusingourearlierapproachAlgorithm3.1.6sothatthediagonalentriesof S arearrangedwiththoseclosesttosome targetintheupperleftand U isupdatedaswell.Therst b diagonalelementsof S aretheRitzvalueswithassociatedRitzvectorsgivenby q = VU : ; 1: b ; andareusedtocomputethe b residuals r : ;i = Aq : ;i )]TJ/F19 11.9552 Tf 11.955 0 Td [(S i;i q : ;i i =1 ;:::;b .26 alongwiththecorrespondingnormsofeachresidual k r : ;i k 2 .Nextwecheckfor convergence.In jdqr ,aRitzapproximationisacceptedifthenormoftheresidual isbelowacertainuserspeciedtolerancewithdefaultvalue1e-8.Weusethesame convergencecriterionforindividualRitzapproximations,butchecktoseeifanyofthe b approximationshasconverged.IftheRitzpairsatisesourconvergencecriterion, thenweexplicitlydeatetheconvergedeigenvalueandlocktheapproximationas detailedinAlgorithm3.1.14.Iftheapproximationisnotyetsatisfactory,wemove Algorithm3.1.12: SubspaceInitialization Input : U 2 C n b and j min Result :Initialsearchsubspace v withcompactWYrepresentation 1 if StartingwithArnoldi then 2 Constructsize j min blockArnoldidecomposition: AU : ; 1: j min = U : ; 1: j min + b H : j min + b; 1: j min 3 withcompactWYformof U usingAlgorithm3.1.3; 4 v = U : ; 1: j min ; 5 else 6 [ v;R ]=qr U; 0andalongwithcompactWYformof v ; 66


Algorithm3.1.13: ApproximateSolutiontoCorrectionEquations 1 Updateresiduals: 2 r = K )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 r ; r = r )]TJ/F25 11.9552 Tf 13.331 3.022 Td [(e Y e H )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 e Q H r ; 3 for i =1: b do 4 Approximatelysolvethe b correctionequationsoftheform: I )]TJ/F25 11.9552 Tf 13.331 3.022 Td [(e Y e H )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 e Q H K )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 A )]TJ/F19 11.9552 Tf 11.955 0 Td [(S i;i K )]TJ/F17 7.9701 Tf 6.587 0 Td [(1 I )]TJ/F25 11.9552 Tf 13.331 3.022 Td [(e Y e H )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 e Q H z : ;i = )]TJ/F19 11.9552 Tf 9.298 0 Td [(r : ;i ; 5 usingGMRESandorthogonalizeagainst e Q ; totheinneriteration.Here b correctionequationsoftheform3.25aresolvedto expandthesearchsubspace.Asonlyanapproximatesolutiontoeachsystemis required,iterativemethodsforlinearsystemsarethenaturalchoice.Weoptedto useGMREStosolveeachofthe b correctionequations.Ratherthanuse Matlab 's function gmres ,weoptedtoimplementourownversionofGMRESasweneeded morecontrolduringtheinnersolvetocomputethecomponentsrequiredforour convergencecriteria.Aftercomputingthe b approximatesolutions,wendourselves backatthebeginningofasweep.Wethencontinuethecycleofouterprojections andinnerlinearsolvesincreasingthedimensionofoursearchsubspaceablockata time.Ifthesizeofthesearchsubspacehasreachedthemaximumalloweddimension j max ,werestartasdetailedinAlgorithm3.1.14. OneofthecentralissuesintheJacobi-Davidsonapproachistheconnection betweentheouterprojectionandtheinnerlinearsystem.Asmentionedearlier, HochstenbachandNotay[34]provideanin-depthanalysisofhowprogressinthe outeriterationisconnectedtotheresidualnormofthecorrectionequation.We presenttheirresultsforthestandardeigenvaluetodetailthestoppingcriteriawe usedinourimplementation.Ourworkusesasubsetoftheheuristicsprovidedby HochstenbackandNotay.Thecorrectionequationinthestandardeigenvalueproblemhastheform3.25withresidualvector r = A )]TJ/F19 11.9552 Tf 12.106 0 Td [(I q andRitzvalue = q H Aq 67


Algorithm3.1.14: DeationandImplicitRestart 1 if Converged then 2 if k con =0 then 3 R = S ; 1; 4 else 5 R =[triu R ;Q : ; 1: k con H r save : ; 1;zeros ;k con ;S ; 1]; 6 k con = k con +1; 7 Y = e Y : ; 1: k con ; 8 H = e H : k con ; 1: k con ; 9 V = VU : ; 2: j ; 10 S = S : j; 2: j ; 11 M = S ; 12 U =eye j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1; 13 j = j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1; 14 else 15 Implicitrestart; 16 j = j min ; 17 V = VU : ; 1: j min ; 18 S = S : j min ; 1: j min ; 19 M = S ; 20 U =eye j min ; Lettheinnerresidualofthelinearsystembegivenby r in = )]TJ/F19 11.9552 Tf 9.299 0 Td [(r )]TJ/F15 11.9552 Tf 11.955 0 Td [( I )]TJ/F19 11.9552 Tf 11.955 0 Td [(qq H A )]TJ/F19 11.9552 Tf 11.955 0 Td [(I v: .27 then r eig =min k A )]TJ/F19 11.9552 Tf 11.955 0 Td [(I q + v k k q + v k .28 68


satises j g )]TJ/F19 11.9552 Tf 11.955 0 Td [(s j 1+ s 2 r eig 8 > < > : p g 2 + 2 p 1+ s 2 if 2 and g k < 3 s p 1+ s 2 g k 1 k r k and s 1+ s 2 > 2 and k> 1and g k g k )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 2 > 2 )]TJ/F25 11.9552 Tf 11.955 13.27 Td [( g k g k )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 2 )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 where 1 =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 = 2 3 =15,anorm-minimizingmethodsuchasGMRESisused,and isthedesiredvalueoftheresidualnorm.Fortheoutereigenvalueestimate,they showthatonemayusetheestimate r eig s k r in k p 1+ s 2 2 + s 1+ s 2 2 : .30 AsweuseGMRESforthesolutionofthecorrectionequations,weonlyreportthe resultspertainingtotheuseofGMRESinthecontextofthestandardeigenvalue problem.ExtensivedetailsonhowtoproceedwhenusingalternativestoGMRES maybefoundin[34].Theirapproachcomputedthesevaluestwiceduringthesolution oftheinnersystem,rstwhen k r in k 1 k r k andthenwhen k r in k 2 k r k where 2 =10 )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 .Theirheuristicchoicesofthresholds 1 2 ,and 3 werevalidatedwitha selectionoftestproblemsbuttheysuggestexperimentingwithothervaluesandalso suggestthelastcriterionmaybeoptional.Forthestandardeigenvalueproblemwith GMRESusedfortheinnersolve, r eig isabout k r in k p 1+ s 2 until r eig reachesitsasymptotic value s 1+ s 2 andfurtherreductionof k r in k isuseless. ThemainnumericalresultreportedforthestandardNEPshowedthatthenumberofmatrix-vectorproductsremainedaboutthesamewhenusing jdqr withthe newstoppingcriteriaversusthedefaultsettings,buttherevisedstoppingcriteria 69


increasedthenumberofinneriterationswhicharelesscostlythantheouteriterations.HochstenbackandNotayconcludedthat jdqr wouldperformbetterwiththeir stoppingcriteria.Wewillexplorethebehaviorofdierentstoppingcriteriainournumericalexperiments.Dedicatingtheappropriateamountofworktosolvingtheinner linearsystemisincreasinglyimportantinthecaseofmultiplecorrectionequations. ThestoppingcriteriainourblockJacobi-Davidsonapproachconsistsofthersttwo suggestionswiththresholdparametersmentionedaboveandparametersconsistent withusingGMRESforthethelinearsolve.Thesituationbecomesmorecomplicated whensolving b correctionequationsandwedonotclaimtohavetheoptimalcriteria. Adetailedanalysisisthesubjectoffuturework. Beforewepresentnumericalresults,wepausetosummarizethemethodsand associatedsoftwarethatwillbeusedinourcomparison.Table3.1listsalltheapproachesusedintheensuingsection.EachoftheapproacheslistedinTable3.1has severalparamterstheusercancontrol. Matlab 's eigs allowsonetosettheinitial vector,thetolerancefortheconvergencecriteria,thenumberofvectorsinthesearch subspace,thenumberofdesiredeigenvectors,andthetargetedsubsetofthespectrum.Baglama's ahbeigs usesmanyofthesameparameters,buthasafewmore options.Theusermaysettheblocksizeandthenumberofblocks.Oneofthe parametersuniqueto ahbeigs isthe adjust parameterthataddsadditionalvectors totherestartvectorsafterSchurvectorsconverge.Thisisintendedtohelpspeed-up convergence.Sleijpen's jdqr hasseveralinputparameterstheusermaysettocontrol thecalculation.Onemaysetthetoleranceforthestoppingcriteria,theminimum andmaximumdimensionsofthesearchsubspace,theinitialvector,thetypeoflinear solverforthecorrectionequation,thetolerancefortheresidualreductionofthelinear solver,themaximumnumberofiterationsforthelinearsolver,andwhetherornot touseasuppliedpreconditioner. 70


Table3.1: Softwareforcomparisonofiterativemethods Name Source Description Blocked eigs Matlab Built-infunctionbasedonARPACK No ahbeigs Baglama'swebsite Matlab implementationofABHA Yes jdqr Sleijpen'swebsite Matlab implementationof JDQR No bA Ourhome-growncode ExplicitlyrestartedArnoldi Yes bKS Ourhome-growncode BlockKrylov-Schur Yes bjdqr Ourhome-growncode Blockextensionof JDQR Yes Jia Notavailable BlockArnoldi Yes bIRAM Notavailable bIRAMbyLehoucqandMaschho Yes MollerL Notavailale bIRAMbyMoller Yes MMollerS Notavailable bIRAMbyMoller Yes Ourimplementation bA usesparameterssuchasthedimensionofthesearchsubspace,atargetforthedesiredsubsetofthespectrum,atolerancefortheconvergence criteriaandthenumberofdesiredeigenvalues.Ourimplementation bKS hasthese sameinputparameters,butrequiresboththedimensionofthecontractedsearch subspaceandthedimensionoftheexpandedsearchsubspace.Ourimplementation bjdqr isablockextensionof jdqr butonlyusesGMRESasthelinearsolver.The parametersincludethenumberofdesiredeigenvalues,thestartingblockofvectors, theminimumandmaximumdimensionsofthesearchsubspace,andatolerancefor thestoppingcriteria. 3.2NumericalResults Inthissectionwepresentnumericalexperimentstoassesstheperformanceofour blockcodes.Wecompareourblockedimplementationstounblockedversionsandto 71


Table3.2: TeneigenvaluesofCK656withlargestrealpart 1 ; 2 =5 : 5024 ; 5 : 5024 3 ; 4 =1 : 5940 ; 1 : 5940 5 ; 6 =1 : 4190 ; 1 : 4190 7 ; 8 =1 : 4120 ; 1 : 4120 9 ; 10 =1 : 1980 ; 1 : 1980 alternateapproachesthatarepubliclyavailable.Thepurposeoftheseexperimentsis togetagoodimpressionofhowthesemethodsactuallyperforminthecontextofthe NEP.Wehopetoexplorewhyblockmethodsmaybeanattractiveoption,whether theseblockmethodscanhandledicultcomputations,andfurtherunderstandthe performanceofthechosenmethods.Wewillstudyhowblocksizeaectsconvergence andexplorereasonableconditionsfortheunderlyingsearchsubspaces.Eachmethod fromSection3.1hasspecicparametersthatmayaectperformance,andweendeavortounderstandasmuchaspossible.Alltheensuing Matlab comparisonsare performedonaMacProwithdual2.4GHzQuad-CoreIntelXeonCPUand8GB RAMrunningMacOSXVersion10.8.4with Matlab R2013a. Webeginwithassessingoneofthetheoreticaladvantagesofblockmethods, thecomputationofclusteredormultipleeigenvalues.Toexplorethisweselecteda suitablematrixfromtheNEPCollectioninMatrixMarket.WechoseCK656which isa656 656real,nonsymmetricmatrixwitheigenvaluesthatoccurinclustersof order4witheachclusterconsistingoftwopairsofverynearlymultipleeigenvalues. Thereisnoinformationontheapplicationoforigin.Foreachapproach,weattempt tocomputetheteneigenvalueswithlargestrealpartwhicharegiveninTable3.2. Wexthenumberofvectorsinthesearchsubspaceandvarytheblocksize b andthe numberofblocks n b accordinglyfortheblockedversions.Thismakescomparisons 72


betweenourblockArnoldimethodandourblockKrylov-Schurmethodrelatively easyasbothapproachessolvethesamesizeprojectedeigenvalueproblemateach iteration.ComparisonstoJacobi-Davidsonbasedapproachesaremoredicultas thesizeoftheprojectedeigenvalueproblemgrowsateachstepintheouteriteration, andthereistheinneriterationtoconsideraswell.Itisimportanttonotethatan outeriterationofJacobi-Davidsonrequiresmoreoverallworkthananinneriteration andthatbothinneriterationsandouteriterationsinvolvemultiplicationsbythefull matrix.TounderstandtheperformanceofourJacobi-Davidsonbasedapproaches,we willconductexperimentswithvariousdimensionsofthesearchsubspace.InTable3.3, wereporttheblocksize,numberofblocksinthesearchsubspaces,thenumberof iterations,bothinnerandouterforourblockJacobi-Davidson,thetotalnumberof matrix-vectorproductsMVPs,thetotalnumberofblockmatrix-vectorproducts BMVPs,andtherelativeerroroftheresultingpartialSchurdecompositions.We donotreporttheleveloforthogonalityasmostmethodsarebasedonHouseholder andlossoforthogonalitywasnotobservedtobeanissue. Theinitial656 4blockofvectorswasgeneratedby Matlab 's randn functionwithstate7andtheappropriatenumberofcolumnsofvectorsusedineach casepresentedinTable3.3.Thetolerance,thevalueof tol suppliedbytheuser forallmethods,was1e-12.Thiswastoensurethatallmethodsusethesameinputparameters,butwenotethatdierentstoppingcriteriaareusedfordierent implementations.AsdetailedinChapter4,thestoppingcriteriausedin bKS is basedonSchurvectorsratherthanRitzvectorsasindoneinARPACK.Thisisone ofthechallengesofcomparingalgorithmsusingsoftware.Detailsoftheroleof tol maybefoundineachsectiondiscussingourimplementations,intheuserguidefor ARPACK[47],andinthedocumentationfor ahbeigs and jdqr Weattemptedseveralexperimentswithhighertolerances,e.g.1e-6and1e-9,but eigs andunblocked bA occasionallymissedadesiredeigenvalue.Itisworthnoting 73


thattheblockmethodsdidnotexperiencethesamediculty.Forallmethodsthe searchsubspacewasxedat20vectorswiththeexceptionofafewadditionalcomputationsforJacobi-Davidsonbasedmethodsinwhichweextendedthisto40vectors. Wesettheparametersineachmethodaccordinglywithafewexceptions.For ahbeigs ,wedidnotusethedefaultvaluefortheparameter adjust .Thisparameter adjuststhenumberofvectorsaddedwhenvectorsstarttoconvergetohelpspeedup convergence.Thedefaultvalueis adjust =3andthesumofthenumberofdesired eigenvaluesandtheparameter adjust shouldbeamultipleoftheblocksize.We attemptedthisexperimentwith adjust =0,andthenagainwiththedefaultsetting. Theperformancewasnoticeablydierentandwewillelaborateonthismomentarily. Wealsonotethatthedocumentationof ahbeigs recommendstenormoreblocks fortheapproachtoconvergeandcomputealldesiredeigenvalues.Thesizeofthe truncatedsubspaceintheblockKrylov-Schurapproacheswasxedat8.Weexperimentedabitwithsomeoftheoptionsfor jdqr ,butendedupusingthedefaultvalues ofmostparameters.TheseincludetheuseofGMRESwithatoleranceof0 : 7 j for theresidualreductionandamaximumofveiterationsforthelinearsolve.Wedid notuseapreconditionerfor A )]TJ/F19 11.9552 Tf 12.083 0 Td [(I inthethecorrectionequationandwepointout thattheJacobi-Davidsonbasedmethodsbenetgreatlywhenagoodpreconditioner isavailable. AsdisplayedinTable3.3,everyapproachsuccessfullyfoundthedesiredeigenvaluesinthisexperiment.Therstthreeresultsfrom ahbeigs forcedthemethodtouse noadditionalvectors.Thisdidnotseemtoaecttheperformanceforcomputations using10ormoreblocks,buttheperformancewasdramaticallydierentforblocks ofsizefour.Over6,000MVPswererequiredcomparedtoonly320whenadditional vectorswereallowed,butwewereusingonlyhalfoftherecommendednumberof blocks.TheworstperformanceintermsoftotalnumberofMVPswasobservedby bA .Thiswasnotentirelyunexpected.Intheunblockedcase,comparing bA to 74


Table3.3: Computing10eigenvaluesforCK656 Method b n b Iterations MVPsBMVPs k AZ )]TJ/F20 7.9701 Tf 6.586 0 Td [(ZS k 2 k A k 2 eigs 1 20 )]TJETq1 0 0 1 333.968 623.754 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 373.898 630.948 Td [(111 9.716e-13 ahbeigs 1 20 )]TJETq1 0 0 1 333.968 599.376 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 373.898 606.57 Td [(117 9.262e-13 ahbeigs 2 10 )]TJETq1 0 0 1 333.968 574.997 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 361.542 582.191 Td [(144 6.959e-13 ahbeigs 4 5 )]TJETq1 0 0 1 333.968 550.618 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 352.762 557.813 Td [(6476 1.005e-12 ahbeigs 1 20 )]TJETq1 0 0 1 333.968 526.24 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 373.898 533.434 Td [(107 9.262e-13 ahbeigs 2 10 )]TJETq1 0 0 1 333.968 501.861 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 361.542 509.055 Td [(130 6.959e-13 ahbeigs 4 5 )]TJETq1 0 0 1 333.968 477.483 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 361.542 484.677 Td [(320 1.005e-12 bA 1 20 21 430 4.372e-13 bA 2 10 29 600 1.212e-13 bA 4 5 40 840 1.774e-13 bKS 1 8,20 11 140 5.606e-14 bKS 2 4,10 12 152 1.858e-13 bKS 4 2,5 23 284 1.012e-13 bjdqr 1 8,20 70,275 354 1.888e-13 bjdqr 2 4,10 39,299 385 1.598e-13 bjdqr 4 2,5 25,376 484 1.332e-13 bjdqr 1 16,40 60,231 307 1.614e-13 bjdqr 2 8,20 31,240 318 1.488e-13 bjdqr 4 4,10 21,317 417 1.339e-13 jdqr 1 8,20 98, )]TJETq1 0 0 1 333.968 160.561 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 373.898 167.755 Td [(351 1.025e-13 jdqr 1 16,40 82, )]TJETq1 0 0 1 333.968 136.182 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 373.898 143.376 Td [(292 1.371e-13 75


eigs iscomparingArnoldiwithexplicitrestarttoIRAM.ItwasexpectedthatIRAM wouldperformbetterasitemploysamuchbetterrestartstrategy.Both ahbeigs and our bKS requiredcomparabletotalnumberofMVPsfortherunsinwhichadditional vectorswereallowedfor ahbeigs .Intermsofiterations, bKS requirednearlythe sameforblocksofsizeoneandtwoandperformednearlythesamenumberoftotal MVPs. ThestoryfortheJacobi-Davidsonbasedmethodsishardertotell.InTable3.3, outerandinneriterationsarereportedfor bjdqr butonlyouteriterationsareavailablefor jdqr .Thedefaultstoppingcriteriaforthecorrectionequationin jdqr was used.Ourunblocked bjdqr performedaboutthesamenumberofmatrix-vectorproductsas jdqr ,butinvestedmoreinthesolutiontothecorrectionequation.Thiswas somewhatanticipatedasHochstenbachandNotay[34]experiencedthesameresult whenusingthisstoppingcriteriaand jdqr .Thisresultalsoseemstosuggestthat theworkinvestedinamorerenedstoppingcriteriamaybeworthwhileaslessouter iterationswillresultinbetteroverallperformanceastheinneriterationsrequireless work.WeveriedthenumberofMVPsfor jdqr bytrackingthenumberoftimesthe functionprovidingthematrixwasaccessedaswedidfor eigs and ahbeigs .Overall, thetotalnumberofMVPsremainedrelativelyconsistentforcomputationswithour bjdqr ,though ahbeigs and bKS requiredlessonaverage.TheJacobi-Davidson basedapproachesseemtobenetmorefromalargersearchsubspace,specicallyin thecasewheremorevectorsareusedfortheimplicitrestart.Wewillfurtherexplore theroleofthedimensionofthesearchsubspaceinadditionalnumericalexperiments. Nextwerepeatthesameexperimentbutincreasethedimensionofthesearch subspace.Wesetthenumberofvectorstobe48with20usedinthecontracted subspacefor bKS .FortheJacobi-Davidsonbasedapproaches,weset j min =36 and j max =60.TheresultsarereportedinTable3.4.Increasingthesubspace hadsomeveryinterestingresults.Forour bA approachwithblocksize b =4, 76


Table3.4: Computing10eigenvaluesforCK656,expandedsearchsubspace Method b n b Iterations MVPsBMVPs k AZ )]TJ/F20 7.9701 Tf 6.587 0 Td [(ZS k 2 k A k 2 eigs 1 48 )]TJETq1 0 0 1 336.895 623.754 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 376.825 630.948 Td [(112 9.716e-13 ahbeigs 1 48 )]TJETq1 0 0 1 336.895 599.376 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 376.825 606.57 Td [(118 2.024e-14 ahbeigs 2 24 )]TJETq1 0 0 1 336.895 574.997 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.468 582.191 Td [(148 1.302e-14 ahbeigs 4 12 )]TJETq1 0 0 1 336.895 550.618 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.468 557.813 Td [(252 1.754e-14 ahbeigs 1 48 )]TJETq1 0 0 1 336.895 526.24 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 376.825 533.434 Td [(116 2.370e-14 ahbeigs 2 24 )]TJETq1 0 0 1 336.895 501.861 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.468 509.055 Td [(144 1.373e-14 ahbeigs 4 12 )]TJETq1 0 0 1 336.895 477.483 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.468 484.677 Td [(200 3.976e-13 bA 1 48 11 538 3.928e-15 bA 4 12 23 332 5.796e-14 bKS 1 20,48 10 300 5.309e-15 bKS 2 10,24 10 300 3.037e-15 bKS 4 5,12 21 328 3.705e-15 bjdqr 1 36,60 40,156 232 1.560e-14 bjdqr 2 18,30 23,176 258 1.170e-13 bjdqr 4 9,15 18,272 380 1.586e-13 bjdqr 1 72,120 18,68 158 3.027e-14 bjdqr 2 36,60 9,64 154 2.968e-14 bjdqr 4 9,15 12,176 296 1.345e-13 jdqr 1 36,60 53, )]TJETq1 0 0 1 336.895 184.939 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 376.825 192.133 Td [(232 2.570e-14 jdqr 1 72,120 14, )]TJETq1 0 0 1 336.895 160.561 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 376.825 167.755 Td [(129 3.798e-14 therequiredMVPsdecreaseddramaticallyfrom840to332.Theperformanceof ahbeigs alsoimprovedwiththeincreasedsearchsubspace.Asthenumberofblocks 77


wasconsistentlylargerthantherecommended10,thedierencebetweentheruns withandwithouttheadditionalvectorswaslesssignicant.Theadditionalvectors onlyseemedtobeneededwithblocksize b =4wherethetotalnumberofMVPs wasreducedbyabout20%.Thebenetofblockswasmostevidentin bjdqr aswith theincreasedsearchsubspace,slightlylesstotalMVPswererequiredwhencomparing blocksofsizeoneandblocksofsizetwo.Oneofthemostinterestingresultsillustrated intheseexperimentsisthat bKS requireslesstotalMVPswithasmallersearch subspace.Whenrestrictingthesearchsubspaceto20vectors, bKS requiredhalfas manyMVPsasitdidforblocksofsize1and2.Forblocksofsize4, bKS required about12%lessopswhenworkingwithasmallersearchsubspace.Wenotethat thedierencebetweenthesizeoftheexpandedsearchsubspaceandthecontracted searchsubspaceincreasedfrom12to28vectors.TheincreasednumberoftotalMVPs requiredfor bKS hereseemstobetiedtothisincrease.Toverifythis,werepeated thisexperimentjustfor bKS withthelargersearchsubspaceandonly12additional vectorsintheexpansionphase,thatisweset k s =36ratherthan k s =20.The resultsarepresentedinTable3.5.TheresultsinTable3.5demonstratethat bKS Table3.5: Computing10eigenvaluesforCK656, k s =36 Method b n b Iterations MVPsBMVPs k AZ )]TJ/F20 7.9701 Tf 6.587 0 Td [(ZS k 2 k A k 2 bKS 1 36,48 10 156 4.105e-15 bKS 2 18,24 10 156 4.490e-15 bKS 4 9,12 13 192 3.061e-12 performssimilarlytotheexperimentwithonly20vectorsinthesearchsubspace. Thisseemstoindicatethatthedierencebetweenthedimensionoftheexpanded searchsubspaceandthedimensionofthecontractedsearchsubspaceisanimportant partoftheoverallperformanceof bKS .Ifthedierenceistoolarge,additionaland 78


Figure3.3: CompleteSpectrumofTOLS2000 seeminglyunnecessaryworkisperformed.Thisisincontrastto ahbeigs andthe Jacobi-Davidsonbasedapproacheswhichallperformbetteronaveragewithalarger searchsubspace. Inournextexperiment,weexamineamatrixwithdiculttocomputeeigenvalues.Herewehopetoassesswhetherornotourimplementationssatisesourhope forrobustness.ThisexamplecomesfromtheNEPCollectioninMatrixMarketas well.ThematrixTOLS2000isaTolosamatrixfromaerodynamics,relatedtothe stabilityanalysisofamodelofaplaneinight.Theeigenmodesofinterestare complexeigenvalueswithimaginarypartinacertainfrequencyrangedeterminedby engineers.Figure3.3showsthecompletespectrum.Thiscomputationaimstocomputeeigenvalueswithlargestimaginarypartandtheassociatedeigenvectors.The matrixissparseandhighlynonnormalmakingitpotentiallydiculttocomputea feweigenpairs.Jia[38]computedthethreeeigenvalueswithlargestimaginarypart 79


tobe 1 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(730 : 68859+2330 : 11977 i; 2 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(726 : 98657+2324 : 99172 i; 3 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(723 : 29395+2319 : 85901 i; usingablockArnoldiapproachwithrenedapproximateeigenvectors,andwewill makeindirectcomparisonstohisresultsasthereisnopubliclyavailablecode.We canmakesuchacomparisonthankstoeortssuchasMatrixMarket.Thoughthe codesmaynotbeavailable,thematricesusedareavailableandwecancompareto someextenttopreiousresults.JiaobservedtheresultsinTable3.6withhisproposed method.Jia'sresultsshowthathismethodbenetedfromalargersearchsubspace Table3.6: SummaryofresultspresentedbyJia[38] b n b Iterations MVPs k AZ )]TJ/F19 11.9552 Tf 11.955 0 Td [(ZS k 2 2 25 67 3350 7.9e-7 2 30 33 1980 8.1e-7 2 35 26 1820 7.2e-7 2 40 32 2560 9.1e-7 2 50 11 1100 1.9e-7 3 20 88 5280 6.0e-7 3 30 20 1800 8.4e-7 3 40 8 960 6.3e-7 asthenumberofMVPsdecreasedwhenthenumberofblocksinthesearchsubspace increased.Thisissimilartothebehaviorof ahbeigs inpreviousexperiments.Here, wesetthetoleranceto1e-9tocomparewiththeresultsbyJiaandalsobyBaglama[7]. 80


Againweexperimentwithallowing ahbeigs noadditionalvectorsandthedefault setting.Theinitialblockwasgeneratedusing Matlab 's randn withstate7asbefore withtheappropriatenumberofcolumnsusedineachcomputation.Jiaexamined blocksofsizetwowith25to50blocksandblocksofsizethreewith20to40blocks andweselectedvariouscombinationstoallowforameaningfulcomparison.As wasthecasefortheexperimentsperformedbyJia,ourblockArnoldi, bA ,failed toconvergeforvariousblocksizeswithseveraldierentdimensionsofthesearch subspace.Baglamafoundthat jdqr failedtoconvergeaswellandweobservedthis forboth jdqr andour bjdqr versionwithrenedstoppingcriteriawhenworkingwith blocksofsizeone.Evenwithblocksofvarioussizes,our bjdqr failedtoconverge. Thiscouldbeduetothedicultnatureoftheproblemandpartlyduetonotusing apreconditioner. Both eigs and ahbeigs haveoptionsforcomputingtheeigenvalueswithlargest imaginarypartbysettingtheinputoptionSIGMA='LI'.Whenattemptingtolocate thedesiredeigenvaluesusingtheappropriateinputstring,both eigs and abheigs returnedthethreeeigenvalueswithlargestimaginarypartinmagnitude.Thatis, bothroutinesreturnedsomecombinationthecomplexconjugates 1 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(730 : 68859+2330 : 11977 i; 2 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(730 : 68859 )]TJ/F15 11.9552 Tf 11.956 0 Td [(2330 : 11977 i; 3 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(726 : 98660 2324 : 99171 i; ratherthantheapproximationsoeredbyJia.Thiscouldbeanissuespecicto Matlab asthedocumentationon eigs doesnotincludehowtheinterfacetoARPACKis achieved. Matlab 's eigs cantakearealorcomplexvalueasinputforSIGMA,but ahbeigs onlyhastheoptionofarealnumberortheappropriatestringtodesignate thelocationanditworksinonlyrealarithmetic.Weattemptedtousevarioustargetsfor eigs withnosuccess.AsBaglama[7]didnotreporttheactualeigenvalues 81


computed,wereportonlytheresultswewereabletogenerateusing ahbeigs .In Table3.7wereportthesameinformationasbeforeaddingthenumberofsuccessfully computedeigenvaluesbasedontheonesreportedbyJia,givingcreditforcomplex conjugates. Theonlyapproachtosuccessfullycomputeallthreedesiredeigenvalueswas bKS WhencomparedtotheresultsinTable3.6,weseethat bKS performedsignicantly lessMVPswhileusinglessvectorsinthesearchsubspace.Forexample,whenusing 10blocksofsizethree bKS required750MVPscomparedtoJia'sapproachthat required5280MVPswhen20blocksofsizethreewereused.Increasingthenumber ofblocksto40blocksbroughtthenumberoftotalMVPsrequiredbyJia'sapproach muchcloserdownto960butthisrequiredfourtimesthenumberofvectorsinthe searchsubspace. Againweexploredtheeectofadditionalvectorsfor ahbeigs .InTable3.7there areselectduplicaterunsfor ahbeigs .Therstusednoadditionalvectorsandthe secondusedthedefaultvalue.Forcingthemethodtonotuseanyadditionalvectors resultedinrathererraticbehavior.Forblocksofsizetwo,additionalvectorshelped whenthesearchsubspaceconsistedof15blocksbutperformancesueredwhenusing 30blocks.Duetothis,allensuingrunswereperformedwiththedefaultsetting.In alltheresultsby ahbeigs ,thebestperformanceintermsofMVPsoccurredwhen using15blocksofsizetwoandadditionalvectors. Theresultsfor bKS aremainlywhatonewouldexpect.First,theconguration thatrequiredtheleastamountoftotalMVPswasunblocked bKS .Thiswasfollowed closelyby bKS using15blocksofsizeveandthen10blocksofsizethree.Two dierentcongurationsof bKS outperformed eigs andathirdrequiredapproximately thesamenumberoftotalMVPs.Again,moreMVPswererequiredforcongurations of bKS whenthedierencebetweentheexpandedandcontractedsubspacewas larger.Theexpandedsearchsubspaceneededtobelargeenough,butmakingittoo 82


Table3.7: ComputingthreeeigenvalueswithlargestimaginarypartforTOLS2000 Method b n b Iterations k con =k d MVPsBMVPs k AZ )]TJ/F19 11.9552 Tf 11.956 0 Td [(ZS k 2 eigs 1 30 )]TJETq1 0 0 1 303.658 636.706 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 643.9 Td [(2/3 746 8.305e-7 ahbeigs 1 30 )]TJETq1 0 0 1 303.658 612.327 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 619.521 Td [(2/3 1806 6.291e-7 ahbeigs 1 30 )]TJETq1 0 0 1 303.658 587.948 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 595.143 Td [(2/3 1580 1.107e-6 ahbeigs 2 15 )]TJETq1 0 0 1 303.658 563.57 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 570.764 Td [(2/3 1424 1.563e-6 ahbeigs 2 15 )]TJETq1 0 0 1 303.658 539.191 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 546.385 Td [(2/3 942 1.990e-6 ahbeigs 2 30 )]TJETq1 0 0 1 303.658 514.813 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 522.007 Td [(2/3 1740 1.604e-6 ahbeigs 2 30 )]TJETq1 0 0 1 303.658 490.434 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 497.628 Td [(2/3 2166 3.518e-8 ahbeigs 3 10 )]TJETq1 0 0 1 303.658 466.055 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 473.249 Td [(2/3 2022 1.138e-6 ahbeigs 3 30 )]TJETq1 0 0 1 303.658 441.677 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 448.871 Td [(2/3 1350 1.969e-6 ahbeigs 5 6 )]TJETq1 0 0 1 303.658 417.298 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 424.492 Td [(2/3 3605 2.345e-6 ahbeigs 5 6 )]TJETq1 0 0 1 303.658 392.92 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 400.114 Td [(2/3 4090 2.251e-6 ahbeigs 5 10 )]TJETq1 0 0 1 303.658 368.541 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 375.735 Td [(2/3 1810 2.188e-6 ahbeigs 5 15 )]TJETq1 0 0 1 303.658 344.162 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 318.997 351.356 Td [(2/3 1960 2.020e-6 bKS 1 12,30 27 3/3 498 4.541e-9 bKS 2 6,15 64 3/3 1164 9.136e-9 bKS 3 4,10 41 3/3 750 1.497e-6 bKS 2 12,30 32 3/3 1176 1.497e-6 bKS 3 8,20 33 3/3 1212 7.266e-9 bKS 5 4,10 36 3/3 1100 4.348e-9 bKS 5 5,10 38 3/3 975 1.007e-6 bKS 5 10,15 23 3/3 625 9.105e-7 bKS 5 5,15 70 3/3 3525 4.768e-9 bKS 10 2,7 41 3/3 2070 1.231e-7 bKS 10 5,10 28 3/3 1450 2.113e-6 83


largerelativetothecontractedsubspacewasnotnecessarilybenecial.Thebest performancewasobservedwhenamodestnumberofblockswereusedtoexpand, specicallysixblocksofsizethreeandveblocksofsizeve.Overall, bKS seems exibleenoughtoworkwithavarietyofcongurations.Wewillfurtherexplorethe relationshipsbetweenblocksize,numberofblocksordimensionofsearchsubspace, requirediterations,andrequiredmatrix-vectorproducts. Itisworthnotingthatforthisnumericalexperimentweneededtodeateonly oneeigenvalueatatimeinoursuccessfulapproaches.Initialrunsshoweddetection ofmorethanoneeigenvalueatatime,buttheaccuracysuered.Formostofthe multipledeationsweobservedrepresentativityofthepartialSchurform k AZ )]TJ/F19 11.9552 Tf -422.702 -23.98 Td [(ZS kO )]TJ/F17 7.9701 Tf 6.586 0 Td [(7 .Our bKS typicallylooksformultipledeations,butwiththe conditioningofthisproblemweneededtobeabitlessambitioustopreserveaccuracy. Thisexperimentshowsthatour bKS approachperformswellevenwithachallenging computation. NextweconsideramatrixusedinexperimentsbyMoller[49],byLehoucqand Maschho[46],andbyBaglama[7].Thepurposeofthisexperimentistomake indirectcomparisonstoversionsofbIRAMpresentedseparatelybyMollerandby LehoucqandMaschhoastherearenopubliclyavailablecodes.AsBaglama[7]did, werefertothetwomethodspresentedbyMoller[49]as MollerS and MollerL andtotheworkbyLehoucqandMaschhoas bIRAM .ThematrixunderconsiderationisHOR131fromtheHarwell-BoeingCollectionavailableonMatrixMarket.The matrixisa434 434nonsymmetricmatrixandcomesfromaownetworkproblem. Wedesiretocomputethe8eigenvalueswithlargestrealpart.Wesetthenumber ofstoredvectorstobe24,setthetolerancetobe1e-12,andgeneratethesameinitialstartingblockaswehaveinpreviousexperiments.Weoptedtousethedefault for adjust in ahbeigs .Wealsoveriedtheaccuracyofthecomputedeigenvalues bycomparingtothethosecomputedby Matlab 's eig butwedonotreportthose 84


Table3.8: SummaryofresultsforHOR131 Method b n b MVPs bIRAM 1 24 77 bIRAM 2 12 84 bIRAM 3 8 99 bIRAM 4 6 108 MollerS 1 24 88 MollerS 2 12 136 MollerS 4 6 264 MollerL 1 24 79 MollerL 2 12 93 MollerL 4 6 105 detailsasalleigenvalueswerecomputedwithinthedesiredtolerance.InTable3.8 wereporttheresultsforindirectcomparison. InTable3.9wereporttheresultsofourcomputations.InTable3.10wepresent theresultsofourcomputationsforJacobi-Davidsonbasedapproaches.Thereare severalthingstonoteamongtheresultsforthisexperiment.First, bA isagain theworstperformerintermsofMVPsandagainthiswassomewhatexpected.Both ahbeigs and bKS seemtobecompetitivewith bIRAM MollerS and MollerL ThetotalnumberofMVPsrequiredissimilarandinournumericalexperimentswe haveseenthatdierentinitialvectorscanaccountforenoughMVPstoconsiderthese approachesascompetitive.Itisespeciallyinterestingtonotethatallthemethods representedinTable3.8requiremoreMVPsastheblocksizeincreases.Thisseemsto bethecasefor ahbeigs bA and bjdqr .Theloneexceptionis bKS .Asdemonstrated 85


Table3.9: Computing8eigenvalueswithlargestrealpartforHOR131,Krylov Method b n b Iterations MVPsBMVPs k AZ )]TJ/F19 11.9552 Tf 11.955 0 Td [(ZS k 2 eigs 1 24 )]TJETq1 0 0 1 324.85 623.754 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 367.706 630.948 Td [(83 2.958e-15 ahbeigs 1 24 )]TJETq1 0 0 1 324.85 599.376 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 367.706 606.57 Td [(83 2.096e-15 ahbeigs 2 12 )]TJETq1 0 0 1 324.85 574.997 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 352.423 582.191 Td [(140 2.295e-14 ahbeigs 3 8 )]TJETq1 0 0 1 324.85 550.618 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 352.423 557.813 Td [(180 2.100e-13 ahbeigs 4 6 )]TJETq1 0 0 1 324.85 526.24 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 352.423 533.434 Td [(316 1.787e-13 bA 1 24 17 416 1.627e-13 bA 2 12 20 496 6.815e-13 bA 3 8 24 600 5.410e-13 bA 4 6 25 632 7.866e-13 bKS 1 16,24 10 96 1.301e-14 bKS 1 18,24 11 84 5.991e-14 bKS 2 6,12 11 144 2.172e-14 bKS 2 8,12 12 112 6.173e-13 bKS 2 10,12 16 84 3.430e-13 bKS 3 4,8 13 168 1.396e-13 bKS 3 5,8 14 141 1.396e-13 bKS 3 6,8 13 114 1.396e-13 bKS 4 2,6 16 264 6.044e-13 bKS 4 3,6 14 180 2.175e-13 bKS 4 4,6 19 168 2.738e-13 inTable3.9,asweincreasethedimensionofthecontractedsubspace,thenumber oftotalMVPsdecreasesconsistentlyforallblocksizesreportedinTable3.9.This 86


Table3.10: Computing8eigenvalueswithlargestrealpartforHOR131,JD Method b n b Iterations MVPsBMVPs k AZ )]TJ/F19 11.9552 Tf 11.955 0 Td [(ZS k 2 bjdqr 1 18,30 37,144 199 1.030e-12 bjdqr 2 9,15 24,184 250 4.382e-13 bjdqr 3 6,10 18,204 276 4.052e-13 bjdqr 4 4,32 15,224 300 1.775e-12 bjdqr 1 24,48 26,100 150 1.002e-12 bjdqr 2 12,24 19,144 206 1.136e-12 bjdqr 3 8,16 16,180 252 1.051e-12 bjdqr 4 6,12 12,176 248 1.471e-12 jdqr 1 18,30 65, )]TJETq1 0 0 1 322.991 428.725 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 362.92 435.919 Td [(182 8.546e-15 jdqr 1 36,60 14, )]TJETq1 0 0 1 322.991 404.347 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 365.847 411.541 Td [(67 1.954e-13 suggeststhatbyndingtheoptimalcongurationforthesearchsubspace, bKS can beaverycompetitiveapproach. TheJacobi-DavidsonresultsinTable3.10tellmuchthesamestoryasbefore forthesetypesofmethods.TheyallseemtorequireabitmoreMVPsingeneral, butbenetfromalargersearchsubspace.Increasing j min and j max makes bjdqr competitivebutrequiresmorevectorsinthesearchsubspacewhichthenrequires thesolutionofalargereigenvalueproblemthanthosemethodsnotbasedonJacobiDavidson.Additionalstorageisalsorequired. Thusfar,theperformanceofour bKS approachhasbeenthemostconsistent amongtheiterativeapproachesdiscussedinthissection.Ithashandleddicultcomputationsanddonesoinarobustandecientmanner.Togetabetterunderstanding ofhowblocksizeanddimensionofthesearchsubspaceaecttheperformance,we 87


embarkonsomeadditionalnumericalexperiments.Theexperimentsperformedto thispointhavefocusedonsparsematrices,fromrealapplications,thatothershave usedtoassessperformanceofapproachestotheNEP.Wenowconsiderdenserandom matrices.Webeginwithacomparisonsimilartothepreviousnumericalexperiments. Weseektocomputetheveeigenvalueswithsmallestrealpartfora2 ; 500 2 ; 500 realmatrixgeneratedusing Matlab 's randn withinitialstate4.Thispopulates thematrixwithrandomnumbersdrawnfromthestandardnormaldistribution.We usedthesameinitialstartingblockasinpreviouscomputationsandagainsetthe toleranceto1e-12.Theonlyotherparameterwesetisxing100vectorsinthesearch subspaces.For ahbeigs weusedefaultvaluesfortheremainingparameters.Weset thecontractedsearchsubspacedimensionto k s =72andadjustedthedimensionof theexpanded kf ascloseto100aspossibleusingmultiplesoftheblocksizes.The resultsofourexperimentarepresentedinTable3.11.Weobservedthat bA failedto convergeforvariouscongurationsand jdqr haddicultiesaswell.Wehadtosetthe toleranceto1e-11togeneratetheresultsinTable3.11asitfailedtoconvergewiththe tolerancesetat1e-12.TheresultsinTable3.11showthat bKS performsbetterthan Matlab 's eigs andBaglama's ahbeigs forblocksuptosizethree.Largerblocks increasethetotalnumberofrequiredMVPs,butnotonthesamescaleaswhatis requiredby ahbeigs .OurJacobi-Davidsonimplementationwasabletolocateallve desiredeigenvalueswhere jdqr neededtorelaxthetolerance.Thisismostlikelydue tothedierenceinstoppingcriteria.Aswehaveseeninpreviousexperiments,we maybeabletoadjustthevaluesof k s and k f toreducethenumberofMVPsrequired andincreaseoverallperformance. Sofar, bKS hasperformedwellindiculteigenvaluecomputationsinvolving sparsematricesanddemonstratedthatitisanattractiveoptionforcomputingafew eigenvaluesofdenserandommatrices.Wenowturnourfocusononly bKS aswe attempttounderstandhowitbehavesfordenserandommatrices.Fortheensuing 88

PAGE 100

Table3.11: Computing5eigenvalueswithsmallestrealpartforrandommatrix Method b n b Iterations MVPsBMVPs k AZ )]TJ/F19 11.9552 Tf 11.955 0 Td [(ZS k 2 eigs 1 100 )]TJETq1 0 0 1 327.776 550.577 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.78 557.771 Td [(1687 5.987e-13 ahbeigs 1 100 )]TJETq1 0 0 1 327.776 526.199 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.78 533.393 Td [(1588 3.632e-13 ahbeigs 2 50 )]TJETq1 0 0 1 327.776 501.82 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 346.57 509.014 Td [(3502 5.400e-13 ahbeigs 3 33 )]TJETq1 0 0 1 327.776 477.441 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 346.57 484.636 Td [(4869 3.572e-13 ahbeigs 4 25 )]TJETq1 0 0 1 327.776 453.063 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 343.644 460.257 Td [(11188 4.899e-13 ahbeigs 5 20 )]TJETq1 0 0 1 327.776 428.684 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 346.57 435.878 Td [(9010 4.385e-13 ahbeigs 6 17 )]TJETq1 0 0 1 327.776 404.306 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 343.644 411.5 Td [(12522 5.451e-13 bKS 1 72,100 22 688 4.488e-13 bKS 2 36,50 33 996 4.125e-13 bKS 3 24,33 48 1368 3.204e-13 bKS 4 18,25 65 1892 4.488e-13 bKS 6 12,17 92 2832 5.119e-13 bjdqr 1 84,108 371,1463 1918 1.047e-14 bjdqr 2 42,54 238,1835 2395 9.896e-15 bjdqr 3 38,36 175,1988 2597 1.045e-14 bjdqr 4 21,27 156,2350 3058 1.471e-12 jdqr 1 84,108 266, )]TJETq1 0 0 1 327.776 160.519 cm[]0 d 0 J 0.398 w 0 0 m 0 23.98 l SQBT/F15 11.9552 Tf 364.78 167.714 Td [(1631 3.858e-12 89

PAGE 101

Figure3.4: MVPsversusdimensionofsearchsubspace experiment,weexploretheeectofvaryingthedimensionofthesearchsubspace. Figure3.4displaysthetotalnumberofrequiredMVPsversusthedimensionofthe searchsubspace.Hereweincreasedthesearchsubspacefrom k s =20and k f =40 to k s =80and k f =100adding12vectorseachtime.Initiallyweobservedasharp decreaseinthetotalnumberofMVPsthatleveledout.Thissuggeststhatcaremust betakentomakethesearchsubspacelargeenoughbutnotnecessarilytoolarge. Thesituationfortherequirednumberofiterationsisnearlyidenticalandisdepicted inFigure3.5.InFigure3.6wepresentthenumberofMVPsrequiredtocompute anincreasingnumberofdesiredeigenvalues.Hereweusearandom2 ; 500 2 ; 500 complexmatrixgeneratedby Matlab againusingstate4.Theinitialblockwas constructedinthesamemannerasbeforeandtheblocksizewasxedat5.We computedupto100eigenvaluesandnotedthenumberofrequiredMVPs.Herewe seethattherequiredworkincreasesbutthecurveisslightlyconcavedown. Finally,weexaminetherelationshipbetweenthenumberofiterationsandthe problemsize.Wextheblocksizeat b =5,thesearchsubspaceat k s =40and 90

PAGE 102

Figure3.5: IterationsversusdimensionofsearchsubspaceforblockKrylov-Schur method. Figure3.6: MVPsversusnumberofdesiredeigenvaluesforblockKrylov-Schur method. 91

PAGE 103

Figure3.7: IterationsversussizeofmatrixforblockKrylov-Schurwith b =5, k s =40and k f =75. k f =75,andnowvarytheproblemsizefromarandom250 250complexmatrixto arandom5 ; 000 5 ; 000matrixallconstructedusing Matlab 's randn withstate4. Theiterationsseemtogrowlinearlywiththesizeofthematrix,butwepointoutthis isforaxedsearchsubspaceconguration.Previousnumericalexperimentssuggest varying k s k f andtheblocksizecandramaticallyaectperformance. 3.3ConclusionandFutureWork InthisChapter,wehaveanalyzedtheperformanceofseveraliterativeapproaches totheNEP,includingsomenovelformulationsweimplementedin Matlab .OurnumericalexperimentsindicatethatournovelblockKrylov-SchurwithHouseholder orthogonalizationcompareswellwithcurrentstandardsamongiterativemethodsin thecaseofsparsenonsymmetricmatrices.Wewereabletondcongurationsofthe parametersfor bKS thatshowedittobecompetitivewithIRAMinthe Matlab environment.Specically,wefoundthatwecouldadjusttheblocksize,thedimension ofthecontractedsearchsubspace,andthedimensionoftheexpandedsearchsubspacesothatourblockmethodperformedapproximatelythesamenumberofMVPs asthestate-of-the-artserialimplementationofIRAMprovidedbyARPACK.The 92

PAGE 104

implementation bKS wasshowntoberobustandhandledavarietyofcomputations, includingmultipleandclusteredeigenvalues,andchallengingcalculationsinvolving highlynonnormalmatrices.Additionally,ourapproachisabletocomputeanysize partialSchurdecomposition. 93

PAGE 105

4.BlockKrylovSchurwithHouseholder Inthischapter,wepresentourcurrentworkonacceleratingalgorithmsdesigned fortheNEPinthecontextofHPC.WebuildotheapproachfromChapter3 wherewepresenteda Matlab implementationofourblockversionofKrylov-Schur andextendourworktoaLAPACKimplementation.Hereweconsiderthecareful numericalimplementationofanextensionofStewart'sKrylov-Schurmethod[64].We beginwithabriefintroductionofKrylovdecompositionsandtheessentialcomponents oftheKrylov-Schurprocess.Thenweoutlineourblockapproachandprovidea detailedpseudocodeofourimplementation.Finallywecompareourapproachto existingLAPACKroutines. Ourimplementationisabletocomputeall n eigenvaluesofan n n matrix. Mostimplementationsofiterativeeigensolversarenotabletodoso,theycompute onlypartialSchurfactorizations.Figure4.1showswhathappenswith Matlab ,for example,whenoneattemptstocomputeall n eigenvaluesusingthefunction eigs Indeed,someclean-upcodesandsomemathematicaldevelopmentsareneededin ordertohandlethelastpartofthecomputation.Asfarasweknow,thisdicultyis nothandledbyanyiterativeeigensolvers.Notethat,ingeneral,thereisnoneedfor suchfunctionalityforverylargematrices,therealmofstandarditerativeeigensolvers, whereitwouldbeunreasonabletocomputealltheeigenvaluesofthegivenmatrix. Itishoweveradesirablefunctionalityforourmethod,soweworkedoutthedetails toenablethisfeature. Figure4.1: Typically,implementationsofiterativeeigensolversarenotableto compute n eigenvaluesforan n n matrix.Hereisanexampleusing Matlab 's eigs whichisbasedonARPACK. 94

PAGE 106

4.1TheKrylov-SchurAlgorithm Stewart[64]rstsuggestedtheKrylov-Schuralgorithmasacomputationally attractivealternativetoSorensen'sIRAM.ThetwomainissueswiththeIRAMapproach,asdetailedinChapter3,aretheneedtopreservethestructureoftheArnoldi decompositiongivenin3.1andthecomplexitiesassociatedwithdeatingconverged Ritzvectors.StewartintroducedageneralKrylovdecompositiontoaddressbothof theseissues.If A 2 C n n ,aKrylovdecompositionoforder m isgivenby AV m = V m B m + v m +1 b H m +1 .1 where B m 2 C m m andthecolumnsof V m +1 =[ V m ;v m +1 ] 2 C n m +1 areindependent.Thecolumnsof V m +1 arecalledthebasisforthedecompositionandspanthe associatedspaceofthedecomposition.Ifthebasisisorthonormal,thedecompositionissaidtobeorthonormal.Thefactor B m iscalledtheRayleighquotientofthe KrylovdecompositionastheRayleigh-Ritzprocedureextendstodecompositionsof thisform.AsStewartdescribes,thisdenitionremovesalltherestrictionsimposedon anArnoldifactorization.Thematrix B m andthevector b m +1 areallowedtobearbitraryunlikeinanArnoldifactorizationwheretheRayleighquotientmustbeanupper Hessenbergmatrix.Additionally,thebasisvectorsofaKrylovdecompositionarenot requiredtobeorthonormal,butwewillshortlyseethatorthonormalvectorsarethe mostdesirableoptioninthecontextofcomputing.Stewartoeredaconnectionbetweenthesetwodecompositions.Afactorizationoftheform4.1isequivalent,thatis relatedbyasequenceoftranslationsandsimilarities,toanArnoldidecompositionof theform3.1.IftheHessenbergtermisirreducible,thentheArnoldidecomposition isessentiallyunique.Stewart'sworkconnectingthesetwofactorizationsshowsone canworkwithalessrestrictiveform,thatoftheKrylovdecomposition,andnotlose theKrylovsubspacepropertyassociatedwithArnoldifactorizations. Inparticular,anyKrylovdecompositioncanbeformulatedusingorthonormal basisvectors V m +1 andtheRayleighquotientmaybereducedtoSchurform.This 95

PAGE 107

resultsinaKrylov-Schurdecompositiongivenby AV m = V m S m + v m +1 b H m +1 ; .2 wherethecolumnsof V m +1 areorthonormaland S m theuppertriangularfactorfrom thecomplexSchurform.Inmatrixnotation,wehave AV m = V m +1 e S m +1 ; .3 where e S m +1 = 2 6 4 S m b H m +1 3 7 5 : TheKrylov-SchurmethodusingthisdecompositionproceedsinmuchthesamemannerasIRAM.TheKrylov-SchurmethodusesanexpansionphasetoextendtheunderlyingKrylovsubspaceandthenemploysacontractionphasetopurgeunwanted Ritzvalues.TheeasewithwhichtheKrylov-Schurmethodaccomplishesthelatter isoneofthereasonswhythisapproachissoattractive.Beforeturningtoablock variant,weoutlinethemainstepsinthebasicKrylov-Schuriterationinalgorithm 4.1.1anddiscussrelevantimplementationdetails. ApracticalimplementationoftheKrylov-Schurmethodbeginswithsomekind ofsubspaceintializationintheformofaKrylovdecomposition.Forourapproach, basedontheworkinChapter2,webeginwitha k s orderArnoldidecomposition asinequation2.2,butwenotethatanysubspaceinitializationthattstheperspectiveofaKrylovdecompositionmaybeused.Notethatsubspaceinitialization withArnoldiwasalsousedinChapter3forourblockextensionof jdqr andinthe originalcodeprovidedbySleijpen.BeforebeginningtheKrylov-Schuriteration,the ArnoldidecompositionisreducedtoaKrylov-Schurdecomposition.Thestructureof theRayleighquotientwhenreducedtoSchurformcanbeseeninFigure4.2b.Before theiterationbegins,thisisa k s orderdecomposition.TheKrylov-Schuriteration rstexpandsthesearchsubspaceinthesamemannerastheArnoldiprocesscreating 96

PAGE 108

Algorithm4.1.1: Krylov-Schur Input : A 2 C n n v 2 C n ,anddesirednumberofeigenvalues k max Result :apartialorfullSchurformgivenby Z k max and e S k max 1 Initializesubspacewith k s orderArnoldidecomposition, AV k s = V k s +1 e H k s +1 ; 2 ComputeSchurformof H k s +1 ; 3 Updaterst k s columnsof V k s +1 andlastrowof e H k s +1 ; 4 Thefactorizationnowhastheform AZ k s = Z k s +1 e S k s +1 ; 5 while k k max do 6 ExpandtheKrylov-Schurdecompositionto AZ k + k f = Z k + k f +1 e S k + k f +1 ; 7 Re-ordertheSchurform S k + k f +1 ; 8 Updatetherst k + k f columnsof Z k + k f +1 andthelastrowof e S k + k f +1 ; 9 Checkconvergence; 10 if converged then 11 DeateconvergedeigenvalueandSchurvector; 12 Tracknumberofconvergedeigenvalues, k = k +1; 13 Truncatedecompositionandadjustactivesearchsubspacedimension; 14 AZ k + k s = Z k + k s +1 e S k + k s +1 ; 15 if k + k f = n then 16 ComputeSchurformoflastprojectedproblem; 17 Build Z n and S n ; 18 else 19 Truncatedecomposition, AZ k + k s = Z k + k s +1 e S k + k s +1 ; 97

PAGE 109

aExpandedForm bSchurForm Figure4.2: StructureoftheRayleighquotient a k f ordergeneralKrylovdecomposition.ThestructureoftheRayleighquotient afterexpansioncanbeseeninFigure4.2a.NexttheSchurformoftheexpanded Rayleighquotient S k f iscomputed.Updatingthecorrespondingcolumnsof Z k f and thebottomrowof e S k f +1 changesthestructureoftheRayleighquotientbacktoits originalformasillustratedinFigure4.2b. Inthenextstep,theexpandedSchurformisre-orderedmovingthedesiredRitz approximationstothetopleftofthematrix e S k f +1 andunwantedonestothebottom rightforpurging.Dierentpartsofthespectrummaybetargetedinthere-ordering phase.Theleadingcomponentof b H k f +1 ,thebottomrowof e S k f +1 ,isthenchecked againstthedeationcriteria.Iftheapproximationisnotyetacceptable,theexpansionistruncatedandtheiterationcontinuesbuildingoofthetruncatedKrylov-Schur expansion.TruncationoftheRayleighquotientisaccomplishedbyselectingtherst k s rowsand k s columnsof S k f andtherst k s componentsofthebottomrowof e S k f +1 toconstruct e S k s +1 .Theorthonormalvectorsintherst k s columnsof Z k f and thelastcolumnmakeupthenewsearchsubspace Z k s +1 .Iftheleadingcomponent satisestheconvergencecriteria3.18,theconvergedRitzpairisdeated. 98

PAGE 110

DeatingconvergedvectorsassociatedwithasingleRitzvalueisfairlystraightforward.AfteronestepoftheKrylov-Schuriteration,wehavetheform A [ z 1 jj z k f ]=[ z 1 jj z k f +1 ] 2 6 6 6 6 4 sS 12 0 S 22 bb H 2 3 7 7 7 7 5 ; .4 where s isthecomponentintherstrowandrstcolumnof S k f and b istherst componentofthevector b H k f +1 .If b satisestheconvergencecriteriongivenin3.18, itmaybesettozerodeatingtheconvergedRitzvalue s .Stewartsuggestedaconvergencestrategybasedonhavingsmallbackwarderrortoensurebackwardstability. AsimilarstrategyisusedinARPACKforaconvergedRitzvalueinanArnoldi decompositionandcanbeeasilyextendedtoageneralKrylovdecomposition.As Kressner[43]discussed,amorerestrictiveconvergencecriterionbasedonSchurvectorsratherthanRitzvectorsispossibleandweusethisstrategy.Assuming k con
PAGE 111

where u istheunitroundoand tol isauserdenedtolerance.IfaRitzvaluesatises thecriterion4.7,itsatisestheconvergencestrategybasedonRitzvectorsusedby StewartandARPACK.Ifnodeationispossiblewith ,theSchurformmaybe re-orderedandthesametestappliedtotheresultingcomponentofthebottomrow. Kressner[43]pointsoutinthatthisdeationstrategycanbeconsideredavariantof AEDusedinthemultishiftQRalgorithmcurrentlyimplementedinLAPACK.We willcomebacktothisanalogywhenweformulateourblockapproach. OurimplementationoftheKrylov-SchuralgorithmextendstheworkofStewart tocomputebothpartialSchurformsandacompleteSchurdecomposition.Toour knowledge,nocurrentimplementationofKrylov-SchurisusedtocomputeafullSchur decomposition.Asdetailedabove,ourapproachbuildsanexpansionoftheform4.5. InthecaseofafullSchurformdecomposition,theactivesearchsubspaceisalways ofsize k s inthecontractionphaseandofsize k f intheexpansionphase.The k con convergedRitzvaluesarelockedinthetriangularfactor S k con andtheconvergedSchur vectorsmakeuptherst k con columnsofthematrix Z k con + k f +1 .When k con + k f +1= n ,thesearchsubspacecannotbeextendedfurtherandthenalprojectedeigenvalue problemisusedtocompletetheSchurdecomposition.ThefullSchurformisofthe form AZ n = Z n S n where Z H n Z n = I n and S n isuppertriangular. 4.2TheBlockKrylov-SchurAlgorithm Theextensionofalgorithm4.1.1toablockmethodisthesubjectofthissection. Aswewillsee,blockmethodsaremorecomplicatedthantheirnon-blockcounterparts andrequirecarefulimplementation.BlockKrylov-Schurexpansionsareoftheform AZ m = Z m +1 e S m +1 .8 wherethe Z m +1 =[ V 1 ;:::;V m +1 ]hasorthonormalcolumnswitheach V i 2 C n r .The Rayleighquotientisoftheform e S m +1 = 2 6 4 S m b H m +1 3 7 5 100

PAGE 112

where S m 2 C mr mr isuppertriangularand b H m +1 2 C r mr isageneralmatrix.The structureoftheRayleighquotient e S m +1 canbeseeninFigure4.3. Figure4.3: StructureofblockKrylov-Schurdecomposition BlockvariantsofKrylov-Schurarenotnew.Forsymmetricmatrices,ablock Krylov-SchurapproachwassuggestedbySaadandZhou[72].Thisworkaddressed severalimportantimplementationdetails.Ofnotearethehandlingoftherankdecientcase,theorthogonalizationscheme,andtheincorporationofadaptiveblock sizes.Onecomplicationofblockedmethodsisthatvectorsinanewblockmaybecomelinearlydependentduringtheiteration.Theexpansionphasebeginswitha k s orderblockKrylov-Schurdecompositiongivenby AZ k s = Z k s S k s + Fb H k s +1 ; .9 where Z k s 2 C n k s r hasorthonormalcolumns, S k s 2 C k s r k s r isuppertriangular, F 2 C n r issuchthat F ? Z k s ,and b k s +1 2 C k s r r .Thisisexpandedusingablock Lanzcosapproach,asSaadandZhouwereworkingwithsymmetricmatrices,to AZ k f = Z k f S k f + FE H k f ; .10 where E k f isdenedasin2.3.Theissueisensuringthat F ? Z k f .When F isoffull rank,thecorrectaugmentationvectorshavebeenlocatedandwemayproceedwith theblockKrylov-Schuriteration.Inthecasewhere F isrankdecient,caremust betaken.ThestrategysuggestedbySaadandZhou[72]computesarankrevealing pivotedQRfactorizationof F .Forourdiscussion,lettherank F = f andthe 101

PAGE 113

pivotedQRfactorizationbegivenby FP = QR where Q =[ Q f ;Q r )]TJ/F20 7.9701 Tf 6.586 0 Td [(f ].Toensure Q r )]TJ/F20 7.9701 Tf 6.586 0 Td [(f isnotintherangeof Z k f ,theyproposea singlevectorversionofGram-Schmidtwithreorthogonalizationandnotethatthisis thesamestrategyastheDGKSmethodusedinARPACK[47].Vectorsfrom Q r )]TJ/F20 7.9701 Tf 6.586 0 Td [(f inrange )]TJ/F19 11.9552 Tf 5.479 -9.683 Td [(Z k f arecounted,replacedwithrandomvectorsandthenorthonormalized againsttherest.Wementionthistopointoutthatweoptforasuperiororthogonalizationscheme,ataslightcost.Themethodformulatedin[72]alsoincorporated adaptiveblocksizeswhichisthesubjectoffutureworkinourendeavors. AblockapproachfornonsymmetricmatricesisalsosuggestedbyBaglama[7]to compute k
PAGE 114

sweepofourapproachandprovideadetailedpseudocode.Asbefore,webeginwith asubspaceinitializationphaseandbuildingoourworkinChapter2,westartwith a k s orderblockArnoldifactorizationasin2.4where k s isamultipleoftheblocksize r .Herethenumberofblocksinthecontractionphaseisgivenby b s sothat k s = b s r andthenumberofblocksintheexpansionphaseisgivenby b f sothatthesizeof theexpandedsearchsubspaceis k f = b f r .Asbefore, k max isthedesirednumberof eigenvalueswewishtocomputeand k con isthenumberofconvergedeigenvalues. Asoutlinedearlier,webeginbyinitializingoursearchsubspaceusingblock Arnoldi.OurblockArnoldiroutineusesavariationonthecompactWYrepresentationoftheHouseholderreectors.Forourinitialdecomposition,weuseAlgorithm 4.2.1toconstruct W 2 C n k s + r and e H 2 C k s + r k s suchthat AW : ; 1: k s = W : ; 1: k s + r e H : k s + r; 1: k s ; .11 where W =[ V 1 ;:::;V b s ]hasacompactWYrepresentationandeach V i 2 C n r .Here xGEQRTisusedonblocksofsize r creating T =[ T 1 ;:::;T b s ]whereeach T i 2 C r r i =1 ;:::;b s ,correspondstoablock V i inthesearchsubspace. BeforebeginningtheKrylov-Schuriteration,wecomputetheSchurformofthe Rayleighquotient e H : k s ; 1: k s U = US; where U;S 2 C k s k s UU H = I ,and S isuppertriangular.Updatingtherst k s columnsof W andthelast r rowsof e H wehavetheinitialKrylov-Schurdecomposition AZ = Z e S ,where Z : ; 1: k s = W : ; 1: k s U and e S = 2 6 4 S b H U 3 7 5 : ReductiontoSchurformrequiresacouplecomputations.First,xGEHRDisusedto reducethebandHessenbergRayleighquotienttoHessenbergformandxUNMQRis usedtoaccumulatethecorrespondingreectors.NextxHESQRisusedtocompute 103

PAGE 115

Algorithm4.2.1: BlockArnoldisubspaceinitialization Input : A 2 C n n U 2 C n r b s Result : AQ = Q e H with Q 2 C n b s r and e H 2 C b s r b s r 1 for j =0: b s do 2 ComputetheQRfactorizationof U jr +1: n; :: 3 xGEQRT U [ jr ], T ; 4 Storetheresultin V :xLACPY U V [ jnr ]; 5 ThereectorsforthecompactWYformof Q areinthelowertriangular partof V withassociatedtriangularfactorsin T ; 6 Explicitlybuild Q byaccumulatingthereectorsinreverseorder: 7 for k = j : )]TJ/F15 11.9552 Tf 9.299 0 Td [(1:0 do 8 xGEMQRT V [ knr + kr ], T [ krr ], Q [ jrn + kr ]; 9 if j
PAGE 116

Algorithm4.2.3: BlockKrylovSchur Input : A 2 C n n r k s k f and k max Result : AZ = ZS with Z 2 C n k max and S 2 C k max k max 1 InitializesubspaceusingArnoldiasinAlgorithm4.2.1 AQ : ; 1: k s = Q : ; 1: k s + r e H : k s + r; 1: k s ; 2 ComputeSchurformof e H : k s ; 1: k s usingAlgorithm4.2.2; 3 ThefactorizationnowisaKrylov-Schurdecomposition AZ : ; 1: k s = Z : ; 1: k s + r e S : k s + r; 1: k s ; 4 k con =0; 5 while k con k max do 6 ExpandtheKrylov-SchurdecompositionusingAlgorithm4.2.4to AZ = Z e S where Z 2 C n k con + k f + r and e S 2 C k con + k f + r k con + k f ; 7 Re-orderthe k f k f Schurformintheactivepartof e S ; 8 Updateactivepartsof Z and e S ; 9 Checkconvergence; 10 if converged then 11 Deate n con convergedeigenvaluesandSchurvectors; 12 Adjustactiveregionbylockingconvergedapproximations, k con = k con + n con ; 13 Truncatedecompositionandadjustactivesearchsubspacedimension AZ : ; 1: k con + k s = Z : ; 1: k con + k s + r e S : k con + k s + r; 1: k con + k s ; 14 else 15 Truncatedecomposition; 16 AZ : ; 1: k con + k s = Z : ; 1: k con + k s + r e S : k con + k s + r; 1: k con + k s ; 105

PAGE 117

Figure4.4: SubspaceinitializationandblockKrylov-Schurdecomposition Krylov-Schurapproachentersacycleofexpansionandcontractionofthesearch subspaceuntilaRitzapproximationisreadytobedeated.Thegeneraloutlineof theapproachisdetailedinAlgorithm4.2.3.Theexpansionphaseproceedsinthe samemannerasourblockArnoldiiterationandwithnearlythesamestructureof Algorithm4.2.1.Expandinganexistingfactorizationcouldbeaccomplishedby4.2.1 ifeigenvaluesconverged r atatime,butthisdoesnotseemtobethecasebasedon ournumericalexperiments.Tonavigatethisslightissue,weusexGEQRTtocompute anynecessaryQRfactorizationsandstorethediagonalofthe T factorsreturnedby xGEQRT,asthesescalarscorrespondtotheelementaryHouseholderreectorsifwe weretoproceedbyeliminatingacolumnatatime.WethenusexLARFTtoconstruct abig" T fortheappropriatelysizedcompactWYformusingonevectorofallthe diagonalentries.If k con eigenvalueshaveconverged,thenafterexpansionthecompact WYformusedinourapproachwillbehavelowerunittriangular Y 2 C n k con + k f + r andtriangularfactor T 2 C k con + k f + r k con + k f + r .Computingthisadditional T adds unnecessaryopstoourcomputation.ThismotivatesapossibleadditiontoLAPACK thatallowsforexibilityinthestructureofthe T factor. Inamoregeneralcontext,ourapproachrequiresthecomputationofaQRfactorizationofan m n matrix B .Attheconclusionofthecalculation,wewantthe Householderreectors, m n unitlowertriangular Y ,the n n uppertriangular R factor,andaspecicversionofthe T factor.ThestandardoutputofxGEQRTisa 106

PAGE 118

b n matrix T thatisasequenceof b b triangularfactorswhere b istheblocksize. Ifourmatrix B ispartitionedintwopartsasin B =[ B 1 ;B 2 ]where B 1 is m n 1 and B 2 is m n 2 suchthat n = n 1 + n 2 ,wemaydesiretocomputeacompactWY representationintwosteps.Thatis,rstcompute R 1 ; 1 2 C n 1 n 1 Y 1 2 C m n 1 and T 1 2 C n 1 n 1 sothat B 1 = I )]TJ/F19 11.9552 Tf 10.077 0 Td [(Y 1 T 1 Y H 1 R 1 ; 1 .Nextwedesiretocomputetheremaining componentsforthecompactWYformoftheentirematrix B .Wedesiretocompute R 1 ; 2 2 C n 1 n 2 R 2 ; 2 2 C n 2 n 2 Y 2 2 C m n 2 ,and T 2 2 C n 2 n 2 sothattheendresult 2 6 6 6 6 4 R 1 ; 1 R 1 ; 2 Y 1 R 2 ; 2 Y 1 Y 2 3 7 7 7 7 5 with T =[ T 1 ;T 2 ] ; isamoregeneralversionofthecurrentoutputofxGEQRT.Herethe T factorisa sequenceofsmallertriangularfactors,possiblyofvarioussizes.Thistsourapplicationnicelyandpossiblyothers.CurrentlyLAPACKdoesnothavethisfunctionality, butthecomputationof T 2 couldbeachievedbyadjustingxLARFTaccordingly.The othercomponentsmaybecomputedusingexistingLAPACKkernels. Inadditiontothisslightadjustment,thecompactWYformofoursearchsubspace mayhaveanadditionalcomponent.Asdiscussedinour Matlab implementationin Chapter3,atthecompletionofonesweepweconstructanewcompactWYrepresentationofthetruncatedsearchsubspace.Indoingso,wegenerateanadditional diagonalmatrixwithdiagonalelements 1asin3.7,whichwemusttakeintoconsideration.ThedetailsofexpandingtheKrylovdecompositionusingexistingLAPACK routinesmaybefoundinAlgorithm4.2.4.TheexpandedRayleighquotientinthe blockKrylovdecompositionhasthestructuredepictedinFigure4.5. NextwecomputetheSchurformoftheactive k f k f partof e S .Thisisagain accomplishedbyAlgorithm4.2.2.ThenextstepistoreordertheSchurformmoving thedesiredRitzvaluestotheupperleftoftheactivewindowandtheunwanted onestothebottomrightof e S tobepurgedinthetruncationstep.Thiscrucialstep 107

PAGE 119

Figure4.5: ExpandedblockKrylovdecomposition allowsustocomputeeigenvaluesindesiredregionsofthespectrum.Asmentionedin Chapter3,wemaywanttoincorporateKressner'sapproachtomaximizeouruseof BLAS3operations.RatherthancallingxTRSENweemploythesameapproachused inour Matlab implementationandreordertheSchurformbyhand.Optimizingthe performanceofthisimportantstepisthesubjectoffuturework. Fortheremainderofthisdiscussion,wewillassume k con ofthe k max desired eigenvalueshaveconverged,with k con
PAGE 120

Algorithm4.2.4: ExpansionofKrylovdecompositionwithcompactWYform Input : U 2 C n r andKrylovdecompositionofsize k s = b s r Result :Krylovdecompositionofsize k f = b f r 1 for j = b s +1: b f do 2 ComputeQRfactorizationof U k con + jr +1: n; : 3 xGEQRT U [ k con + jr ], T ; 4 Storereectorsin V with H isstoredinupperpartof V 5 xLACPY U V [ k con n + jnr ]; 6 Accumulatescalarsforconstructionoflarge T factor 7 for i =0: r )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 8 [ k con + i + jr ]= T [ i + ir ]; 9 Buildfull T :xLARFT V T ; 10 UpdatenextpartofKrylovsequence 11 ExplicitlybuildnextblockofZ:xGEMQRT V T Z [ k con n + jnr ]; 12 if j
PAGE 121

Ifourapproximationissatisfactory,welocktheconvergedeigenvalueandcontract thesearchsubspace.TruncationafterdeationisdetailedinAlgorithm4.2.5.As illustratedinthenumericalexperimentsofChapter3,occasionallymorethanone approximationisreadytobedeated.Ourapproachchecksformultipledeations eachsweep,butasmentionedbefore,veryrarelywere r eigenvaluesreadytobe deated.Ifnoapproximationsareacceptable,wepurgetheunwantedRitzvaluesby truncatingtheblockKrylov-Schurdecomposition.Thesimplicityofthecontraction phaseiscertainlyoneoftheattractivefeaturesofKrylovdecompositions,especially comparedtoArnoldidecompositions.Again,detailsmaybefoundinAlgorithm4.2.5. ContractionofablockKrylov-SchurdecompositionisillustratedinFigure4.6with theRitzpairstobepurgedhighlightedinblue. Figure4.6: TruncationofblockKrylov-Schurdecomposition 110

PAGE 122

Algorithm4.2.5: TruncationofblockKrylov-Schurdecomposition Input : H 2 C k con + k f + r k con + k f and Z 2 C n k con + k f + r Result : H 2 C k con + k s + r k con + k s and Z 2 C n k con + k s + r 1 n con =0; 2 Checkconvergence,deate n con ifpossible,andtruncate; 3 if n con > 0 then 4 [ k con ]= H [ k con + k con k d ]; 5 k con = k con +1; 6 Restoreidentityin Q 7 Q k con + k s + r +1: n;k s + k con + r +1: k con + k f + r =eye n )]TJ/F19 11.9552 Tf 9.589 0 Td [(k s )]TJ/F19 11.9552 Tf 9.589 0 Td [(r )]TJ/F19 11.9552 Tf 9.589 0 Td [(k con ;k f )]TJ/F19 11.9552 Tf 9.589 0 Td [(k s ; 8 Copypiecesfrom Z to Q withxLACPY: Q : ; 1: k con + k s = Z : ; 1: k con + k s ; 9 Q : ;k con + k s +1: k con + k s + r = Z : ;k con + k f )]TJ/F19 11.9552 Tf 10.892 0 Td [(n con +1: k con + k f )]TJ/F19 11.9552 Tf 10.892 0 Td [(n con + r ; 10 Thencopy Q backto Z ; 11 Explicitlydeateandcontract H usingxLACPY: H : k con + k s ; 1: k con + k s = triu H : k con + k s ; 1: k con + k s ; 12 H k con + k s +1: k con + k s + r; 1: k con + k s = S k con + k f )]TJ/F19 11.9552 Tf 11.955 0 Td [(n con +1: k con + k f + r )]TJ/F19 11.9552 Tf 11.955 0 Td [(n con ; 1: k s + k con ]; 13 Computeupdatedcompactversionof Z = I )]TJ/F19 11.9552 Tf 11.955 0 Td [(YTY H R with xGEQRT Q T ; 14 Storereectorsin V ; 111

PAGE 123

4.3NumericalResults WepresentonesamplerunperformedononenodeofthecolibriclusteratUC Denver.Onenodehas16sharedmemorycore.Wetake m =4 ; 000forthematrix size,initializearandomdensematrix,andcomputethe5largesteigenvaluesof A Weuseablocksizeof20fortheKrylovSchuralgorithmwithallotherparameterset totheirdefault.TheresultsarepresentedinFigure4.7.TheresultsfortheLAPACK ZGEESreductiontoSchurformarepresentedinFigure4.8.Theparallelisminthis experimentisachievedbymultithreadedBLAS. RegardingblockKrylovSchur,thealgorithmconvergesin2,171BMVPswhich correspondsto65,070MVPs.Wecanseenicescalabilityontherightandspeedup ontheleft.ThescalabilityismuchbetterthanforLAPACK'sZGEES.However bKS onlycomputes5eigenvalues,whileLAPACK'sZGEEScomputesalleigenvalues ,000,and,insequential, bKS is10xslowerthanLAPACK.Soeventhoughthe scalabilityismuchbetter,theinterestisstilllimited. Thenextstepforthisimplementationistoo-loadthematrix-vectorproduct blockedornottoGPUacceleration.Theideaistosetthematrix A ontheGPU memoryonceandforallandusetheGPUonlyformatrix-vectorproduct.Sothe vectorswouldmovebackandforthfromCPUtoGPUbutnotthecoecientmatrix. Ourprolingindicatesusthatfortheexperimentpresentedhereinthesequential case89 : 20%ofthecomputationtimeisspentinmatrix-vectorproducts.Sospeeding upthematrix-vectorproductwithGPUcouldpotentiallybringaspeedupof10x. Wealsoproled m =8 ; 000and b =1andinthiscase99 : 13%ofthecomputation timeisspentinmatrix-vectorproducts.Sospeedingupthematrix-vectorproduct withGPUcouldpotentiallybringaspeedupof100x. 112

PAGE 124

Figure4.7: ScalabilityexperimentsforourblockKrylov-Schuralgorithmtocomputethevelargesteigenvaluesofa4 ; 000 4 ; 000matrix. Figure4.8: ScalabilityexperimentsforLAPACK'sZGEESalgorithmtocompute thefullSchurdecompositonofa4 ; 000 4 ; 000matrix. 113

PAGE 125

4.4ConclusionandFutureWork WehaveimplementedblockKrylov-SchurinaCcodeusingLAPACKsubroutines.Thecodeisrobustandsupportsanyblocksizesandcancomputeanynumber ofeigenvalues.Thecodefollowsthe Matlab implementationofChapter3butithas notbeenoptimizedyet.Futureworkincludesoptimizationofthecode,inparticular, wewouldliketooloadthematrix-vectorproducttoGPUunitsandtoincorporateourfastKrylovexpansionfromChapter2.Wearealsointerestedintryingour methodforparalleldistributedheterogeneousornotarchitectures.Additionally, wewouldliketousesparsematriceswhencollectingtimingresultsandcompareto ARPACK.Regardingdensematrices,wedoneedtocomparetoScaLAPACKaswell asofnowweonlycomparetoLAPACKwithmultithreadedBLAS.Finally,wehope topossiblyreleaseourcodeforthescienticcomputingcommunity. 114

PAGE 126

5.Conclusion ThisworkendeavoredtoexaminethechallengesoftheNEPinthecontextof HPC.Tothatend,weexploredseveralalgorithmicoptionsandavailablesoftware. Chapter-by-chaptertheresultsareasfollows. InChapter2,wepresentedadiscussiononblockArnoldiexpansion.WeintroducedatilealgorithmusingHouseholderreectors.Ourcodewasimplemented withinthePLASMAframeworkwhichrequiredtheaugmentationofseveralexisting PLASMAroutinestoaccommodateoperationsonsub-tiles.Wecomparedtwoalgorithmicvariationsofourapproachachievedbyusingtwodierenttreesbinomialand pipeline.PreliminaryresultsindicatethatourArnoldiimplementationperformssignicantlybetterthanareferenceimplementation.Futureworkincludesgettingmore descriptiveupperboundsforperformance,obtainingaclosedformforthecritical pathandbenchmarkingourcodewithsparsematrices. InChapter3,weturnedourattentiontoiterative"eigensolversusedtocompute thepartialSchurformofamatrix.WeimplementedexplicitlyrestartedblockArnoldi withdeation,blockKrylov-Schur,andblockJacobi-Davidson,allwithHouseholder orthogonalizationandin Matlab .Wecomparedourimplementationstopublicly availablecodesfortheNEPandresultsfromtheliterature.Ournumericalstudy showedthatourblockKrylov-Schurimplementationperformscomparablytocurrent standardsamongiterativeeigensolversforsparsematrices. InChapter4,wepresentedaCcodeofourblockKrylov-Schurimplementation usingLAPACKsubroutines.Thecode,builtoour Matlab implementationof Chapter3,isrobustandcancomputeanynumberofdesiredeigenvalues.Theimplementationachievesgoodscalability,butamajorityofthecomputationtimeis spentinthematrix-vectorproduct.Futureworkincludesoptimizationofthecode ando-loadingthematrix-vectorproductstoGPUs. 115

PAGE 127

REFERENCES [1]E.Agullo,J.Dongarra,B.Hadri,J.Kurzak,J.Langou,andHatemLtaief. Plasmausers'guide.Technicalreport,ICLUTK,2009. [2]E.Anderson,Z.Bai,C.Bischof,S.Blackford,J.Demmel,J.Dongarra, J.DuCroz,A.Greenbaum,S.Hammarling,A.McKenney,andD.Sorensen. LAPACKUsers'Guide .SocietyforIndustrialandAppliedMathematics,Philadelphia,PA,thirdedition,1999. [3]PeterArbenz,MartinBecka,RomanGeus,UlrichHetmaniuk,andTizianoMengotti.OnaparallelmultilevelpreconditionedMaxwelleigensolver. Parallel Comput. ,32:157{165,2006. [4]W.E.Arnoldi.Theprincipleofminimizediterationinthesolutionofthematrix eigenvalueproblem. Quart.Appl.Math. ,9:17{29,1951. [5]MarcBaboulin,JackDongarra,andStanimireTomov.Someissuesindense linearalgebraformulticoreandspecialpurposearchitectures.TechnicalReport 200,LAPACKWorkingNote,2004. [6]J.Baglama,D.Calvetti,andL.Reichel.Algorithm827:irbleigs:aMATLAB programforcomputingafeweigenpairsofalargesparseHermitianmatrix. ACM Trans.Math.Software ,29:337{348,2003. [7]JamesBaglama.AugmentedblockHouseholderArnoldimethod. LinearAlgebra Appl. ,429:2315{2334,2008. [8]Z.BaiandJ.Demmel.OnablockimplementationofHessenbergmultishiftQR iteration.TechnicalReport8,LAPACKWorkingNote,1989. [9]C.G.Baker,U.L.Hetmaniuk,R.B.Lehoucq,andH.K.Thornquist.Anasazi softwareforthenumericalsolutionoflarge-scaleeigenvalueproblems. ACM Trans.Math.Softw. ,36:13:1{13:23,July2009. [10]L.S.Blackford,J.Choi,A.Cleary,E.D'Azevedo,J.Demmel,I.Dhillon,J.Dongarra,S.Hammarling,G.Henry,A.Petitet,K.Stanley,D.Walker,andR.C. Whaley. ScaLAPACKUsers'Guide .SocietyforIndustrialandAppliedMathematics,Philadelphia,PA,1997. [11]BLAS.Basiclinearalgebrasubprograms. 116

PAGE 128

[12]RonaldF.Boisvert,RoldanPozo,KarinRemington,RichardF.Barrett,and JackJ.Dongarra.Matrixmarket:awebresourcefortestmatrixcollections.In ProceedingsoftheIFIPTC2/WG2.5workingconferenceonQualityofnumerical software:assessmentandenhancement ,pages125{137,London,UK,UK,1997. Chapman&Hall,Ltd. [13]HenricusM.Bouwmeester. TileAlgorithmsForMatrixComputationsOnMulticoreArchitectures .PhDthesis,UniversityofColoradoDenver,2012. [14]KarenBraman.MiddledeationsintheQRalgorithm.HouseholderSymposium XVII,2008. [15]KarenBraman,RalphByers,andRoyMathias.Themultishift QR algorithm.I. Maintainingwell-focusedshiftsandlevel3performance. SIAMJ.MatrixAnal. Appl. ,23:929{947electronic,2002. [16]KarenBraman,RalphByers,andRoyMathias.Themultishift QR algorithm. II.Aggressiveearlydeation. SIAMJ.MatrixAnal.Appl. ,23:948{973electronic,2002. [17]JanBrandts.TheRiccatialgorithmforeigenvaluesandinvariantsubspaces ofmatriceswithinexpensiveaction. LinearAlgebraAppl. ,358:335{365,2003. SpecialissueonaccuratesolutionofeigenvalueproblemsHagen,2000. [18]AlfredoButtari,JulienLangou,JakubKurzak,andJackDongarra.Paralleltiled QRfactorizationformulticorearchitectures. ConcurrencyComputat.:Pract. Exper. ,20:1573{1590,September2008. [19]AlfredoButtari,JulienLangou,JakubKurzak,andJackDongarra.Aclass ofparalleltiledlinearalgebraalgorithmsformulticorearchitectures. Parallel Computing ,35:38{53,2009. [20]J.W.Daniel,W.B.Gragg,L.Kaufman,andG.W.Stewart.Reorthogonalizationandstablealgorithmsforupdatingthegram-schmidtqrfactorization. MathematicsofComputation ,30:pp.772{795,1976. [21]ErnestR.Davidson.Theiterativecalculationofafewofthelowesteigenvalues andcorrespondingeigenvectorsoflargereal-symmetricmatrices. J.ComputationalPhys. ,17:87{94,1975. [22]TimothyA.Davis.Theuniversityoforidasparsematrixcollection. NADIGEST ,92,1994. [23]J.Dongarra,S.Hammarling,andD.Sorensen.Blockreductionofmatricesto condensedformsforeigenvaluecomputations.TechnicalReport2,LAPACK WorkingNote,1987. 117

PAGE 129

[24]DiederikR.Fokkema,GerardL.G.Sleijpen,andHenkA.VanderVorst.JacobiDavidsonstyleQRandQZalgorithmsforthereductionofmatrixpencils. SIAM J.Sci.Comput. ,20:94{125electronic,1998. [25]J.G.F.Francis.The QR transformation:aunitaryanaloguetothe LR transformation.I. Comput.J. ,4:265{271,1961/1962. [26]J.G.F.Francis.The QR transformation.II. Comput.J. ,4:332{345,1961/1962. [27]R.Geus. TheJacobi-Davidsonalgorithmforsolvinglargesparsesymmetric eigenvalueproblemswithapplicationtothedesignofacceleratorcavities .PhD thesis,ETHZurich,2002. [28]GeneH.GolubandCharlesF.VanLoan. MatrixComputation .TheJohns HopkinsUniversityPress,1996. [29]R.Granat,B.Kagstrom,andD.Kressner.AnovelparallelQRalgorithmforhybriddistributedmemoryHPCsystems.TechnicalReport216,LAPACKWorkingNote,2004. [30]G.Henry,D.Watkins,andJ.Dongarra.AparallelimplementationofthenonsymmetricQRalgorithmfordistributedmemoryarchitectures.LAPACKWorkingNote121,UT-CS,1997.UT-CS-97-352,March1997. [31]GregHenryandRobertvandeGeijn.ParallelizingtheQRalgorithmforthe unsymmetricalgebraiceigenvalueproblem:Mythsandreality.TechnicalReport79,LAPACKWorkingNote,1994. [32]NicholasJ.Higham. AccuracyandStabilityofNumericalAlgorithms .SIAM, 2002. [33]M.E.Hochstenbach.Jacobi-DavidsonGateway. casa/research/topics/jd/ [34]MichielE.HochstenbachandYvanNotay.Controllinginneriterationsinthe Jacobi-Davidsonmethod. SIAMJ.MatrixAnal.Appl. ,31:460{477,2009. [35]RogerA.HornandCharlesR.Johnson. MatrixAnalysis .CambridgeUniversity Press,1985. [36]GaryW.HowellandNadiaDiaa.Algorithm841:Bhess:Gaussianreduction toasimilarbandedhessenbergform. ACMTrans.Math.Softw. ,31:166{185, March2005. [37]C.G.J.Jacobi. UbereineneueAuosungsartderbeiderMethodederkleinsten QuadratevorkommendelinearenGleichungen. Astronom.Nachr. ,pages297{ 306,1845. [38]ZhongxiaoJia.ArenediterativealgorithmbasedontheblockArnoldiprocess forlargeunsymmetriceigenproblems. LinearAlgebraAppl. ,270:171{189,1998. 118

PAGE 130

[39]LarsKarlssonandDanielKressner.OptimallypackedchainsofbulgesinmultishiftQRalgorithms.TechnicalReport271,LAPACKWorkingNote,2012. [40]AndrewV.Knyazev.Towardtheoptimalpreconditionedeigensolver:locally optimalblockpreconditionedconjugategradientmethod. SIAMJ.Sci.Comput. 23:517{541electronic,2001.CopperMountainConference. [41]AndrewV.Knyazev.Hardandsoftlockinginiterativemethodsforsymmetric eigenvalueproblems.CopperMountainConferenceonIterativemethods,2004. [42]D.Kressner.BlockalgorithmsforreorderingstandardandgeneralizedSchur formslapackworkingnote171.LAPACKWorkingNote171,UT-CS,2006. UT-CS-97-352,March1997. [43]DanielKressner. NumericalMethodsforGeneralandStructuredEigenvalue Problems .Springer,2000. [44]V.N.Kublanovskaya.Onsomealgorithmsforthesolutionofthecompleteeigenvalueproblem. ZhurnalVychislitel'noiMatematikiiMatematicheskoiFiziki 1:555{570,1961. [45]JulienLangou. Solvinglargelinearsystemswithmultipleright-handsides .PhD thesis,L'InstitutNationalDesSciencesAppliqueesDeToulouse,2003. [46]R.LehoucqandK.Maschho.Implementationofanimplicitlyrestartedblock Arnoldimethod. PreprintMCS-P649-0297,ArgonneNationalLab ,1997. [47]R.B.Lehoucq,D.C.Sorensen,andC.Yang.ARPACKuser'sguide:Solutionof largescaleeigenvalueproblemsbyimplicitlyrestartedArnoldimethods.,1997. [48]HatemLtaief,JakubKurzak,andJackDongarra.Paralleltwo-stageHessenberg reductionusingtilealgorithmsformulticorearchitectures.TechnicalReport208, LAPACKWorkingNote,2004. [49]J.Moller.ImplementationsoftheimplicitlyrestartedblockArnoldimethod. Technicalreport,RoyalInstituteofTechnologyKTH,Dept.ofNumericalAnalysisandComputerScience,2004. [50]RonaldB.Morgan.OnrestartingtheArnoldimethodforlargenonsymmetric eigenvalueproblems. Math.Comp. ,65:1213{1230,1996. [51]RonaldB.Morgan.Restartedblock-GMRESwithdeationofeigenvalues. Appl. Numer.Math. ,54:222{236,2005. [52]MargreetNoolandAukevanderPloeg.AparallelJacobi-Davidson-typemethod forsolvinglargegeneralizedeigenvalueproblemsinmagnetohydrodynamics. SIAMJ.Sci.Comput. ,22:95{112electronic,2000. 119

PAGE 131

[53]G.Quintana-Ort,E.S.Quintana-Ort,R.A.vandeGeijn,F.G.VanZee,and ErnieChan.Programmingmatrixalgorithms-by-blocksforthread-levelparallelism. ACMTransactionsonMathematicalSoftware ,36,2009. [54]GregorioQuintana-OrtandRobertvandeGeijn.Improvingtheperformance ofreductiontoHessenbergform. ACMTrans.Math.Software ,32:180{194, 2006. [55]YousefSaad. NumericalMethodsforLargeEigenvalueProblems .SIAM,2011. [56]RobertSchreiberandCharlesVanLoan.Astorage-ecient WY representation forproductsofHouseholdertransformations. SIAMJ.Sci.Statist.Comput. 10:53{57,1989. [57]JenniferA.Scott.AnArnoldicodeforcomputingselectedeigenvaluesofsparse, real,unsymmetricmatrices. ACMTrans.Math.Software ,21:432{475,1995. [58]GerardL.G.SleijpenandHenkA.VanderVorst.AJacobi-Davidsoniteration methodforlineareigenvalueproblems. SIAMJ.MatrixAnal.Appl. ,17:401{ 425,1996. [59]D.C.Sorensen.Implicitapplicationofpolynomialltersina k -stepArnoldi method. SIAMJ.MatrixAnal.Appl. ,13:357{385,1992. [60]A.StathopoulosandK.Wu.Ablockorthogonalizationprocedurewithconstantsynchronizationrequirements. SIAMJournalonScienticComputing 23:2165{2182,2002. [61]AndreasStathopoulosandJamesR.McCombs.Nearlyoptimalpreconditioned methodsforHermitianeigenproblemsunderlimitedmemory.II.Seekingmany eigenvalues. SIAMJ.Sci.Comput. ,29:2162{2188electronic,2007. [62]G.W.Stewart.Aparallelimplementationofthe QR -algorithm.In Proceedingsoftheinternationalconferenceonvectorandparallelcomputing|issuesin appliedresearchanddevelopmentLoen,1986 ,volume5,pages187{196,1987. [63]G.W.Stewart. MatrixAlgorithmsVolumeII:Eigensystems .SIAM,2001. [64]G.W.Stewart.AKrylov-Schuralgorithmforlargeeigenproblems. SIAMJ. MatrixAnal.Appl. ,23:601{614electronic,2001/02. [65]S.Tomov,J.Dongarra,V.Volkov,andJ.Demmel.Magmalibrary,version0.1. ,08/2009. [66]LloydN.TrefethenandDavidBauIII. NumericalLinearAlgebra .SIAM,1997. [67]HomerF.Walker.ImplementationoftheGMRESmethodusingHouseholder transformations. SIAMJ.Sci.Statist.Comput. ,9:152{163,1988. 120

PAGE 132

[68]ShunxuWang.AparallelrenedJacobi-Davidsonmethodforquadratic eigenvalueproblems.In ParallelArchitectures,AlgorithmsandProgramming PAAP,2010ThirdInternationalSymposiumon ,pages111{115,2010. [69]DavidS.Watkins.ThetransmissionofshiftsandshiftblurringintheQRalgorithm,1992. [70]DavidS.Watkins. TheMatrixEigenvalueProblem .SIAM,2007. [71]BaiZhaojun,DemmelJames,DongarraJack,RuheAxel,andvander VorstHenk,editors. TemplatesfortheSolutionofAlgebraicEigenvalueProblems .SocietyforIndustrialandAppliedMathematics,2000. [72]YunkaiZhouandYousefSaad.BlockKrylov-Schurmethodforlargesymmetric eigenvalueproblems. Numer.Algorithms ,47:341{359,2008. 121