Citation
On preconditioning for linear equations and eigenvalue problems

Material Information

Title:
On preconditioning for linear equations and eigenvalue problems
Creator:
Lashuk, Ilya
Publication Date:
Language:
English
Physical Description:
xiv, 116 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Eigenvalues ( lcsh )
Differential equations, Linear ( lcsh )
Conjugate gradient methods ( lcsh )
Conjugate gradient methods ( fast )
Differential equations, Linear ( fast )
Eigenvalues ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 110-116).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Ilya Lashuk.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
230742709 ( OCLC )
ocn230742709
Classification:
LD1193.L622 2007d L37 ( lcc )

Full Text
/
/
ON PRECONDITIONING FOR LINEAR EQUATIONS AND EIGENVALUE
PROBLEMS
by
Ilya Lashuk
M.S., Moscow State University, 2002
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2007


This thesis for the Doctor of Philosophy
degree by
Ilya Lashuk
has been approved
by
Andrew Knyazev
Krzysztof Cios
Leo Franca
Tien
ou
/l*r 7 '2^7
Date
Jan Mandel


Lashuk, Ilya (Ph.D., Applied Mathematics)
On Preconditioning for Linear Equations and Eigenvalue Problems
Thesis directed by Professor Andrew Knyazev
ABSTRACT
In this dissertation, we focus on three areas of research related to precondi-
tioning for linear systems and eigenvalue problems.
We analyze the behaviour of the conjugate gradient method for linear sys-
tems with variable preconditioning. We consider the case where the precondi-
tioner is widely variable, i.e., preconditioners used on different iterations are
essentially unrelated. We assume that the preconditioner on every step is sym-
metric and positive definite and that the condition number of the preconditioned
system matrix is bounded above by a constant independent of the step number.
Our main result is of a negative nature: we show that the conjugate gradient
method with variable preconditioning may not yield any improvement relative
to the preconditioned steepest descent method.
We describe a Matlab implementation of the element-based algebraic multi-
grid (AMG) methods that target linear systems of equations coming from finite
element discretizations of elliptic partial differential equations. The individual
element information (element matrices and element topology) is the main input
to construct the AMG hierarchy. We analyze a number of variants of the spec-


tral (based on solving a large number of local eigenvalue problems) agglomerate
element based AMG method. The core of the algorithms relies on element ag-
glomeration utilizing the element topology built recursively from fine to coarse
levels. We investigate strategies for adaptive AMG as well as multigrid cycles
that are more expensive than the V-cycle utilizing simple interpolation matri-
ces and nested conjugate gradient based recursive calls between the levels. We
perform an extensive set of numerical experiments.
We describe our software package Block Locally Optimal Preconditioned
Eigenvalue Xolvers (BLOPEX), which includes the Locally Optimal Block Pre-
conditioned Conjugate Gradient method for symmetric eigenvalue problems.
We demonstrate numerical scalability of BLOPEX by testing it on a number of
distributed and shared memory parallel systems, including a Beowulf system,
SUN Fire 880, an AMD dual-core Opteron workstation, and IBM BlueGene/L
supercomputer, using PETSc (Portable, Extensible Toolkit for Scientific Com-
putation by Argonne National Laboratory) domain decomposition and hypre
(High Performance Preconditioners by Lawrence Livermore National Labora-
tory) multigrid preconditioning.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Andrew Knyazev


ACKNOWLEDGMENT
I thank my advisor, Andrew Knyazev, for his support and guidance during all
stages of my research and dissertation preparation. I am grateful to Panayot
Vassilevski. Chapter 3 of this dissertation is based on the research that I have
performed under his supervision at the Lawrence Livermore National Laboratory
(LLNL). I am indebted to Merico Argentati, Julien Langou, and Jan Mandel
for advice and helpful discussions.
I thank the Department of Mathematical Sciences at the University of Col-
orado Denver (UCD) for providing an exciting learning environment. I acknowl-
edge financial support from the Department of Mathematical Sciences UCD
(one-year Ph.D. fellowship), National Science Foundation (2 years of research
assistantship), and the LLNL (two summer fellowships).


CONTENTS
Figures ix
Tables.................................................................. xii
Chapter
1. Introduction........................................................... 1
2. Variable preconditioning............................................... 9
2.1 Introduction......................................................... 9
2.2 Pointed circular cones represent sets of SPD matrices with varying
condition numbers .................................................. 14
2.3 Local optimality of the method with variable preconditioning .... 16
2.4 Convergence rate bounds for variable preconditioning................ 20
2.5 The convergence rate bound is sharp.......................... 21
2.6 Complex Hermitian case.............................................. 23
2.7 Practical PCG algorithms............................................ 27
2.8 Numerical experiments............................................... 29
2.8.1 Numerical illustration of the main results 29
2.8.2 Inner-outer iterations as variable preconditioning 31
2.8.3 Random vs. fixed preconditioning................................ 32
2.9 Conclusions......................................................... 34
3. Algebraic multigrid via element agglomeration......................... 36
3.1 Introduction....................................................... 36
vi


3.2 Basic building tools............................................... 38
3.2.1 Input data 38
3.2.2 Agglomerating elements........................................... 40
3.2.3 Building minimal intersection sets............................... 40
3.2.4 Forming Schur complements, computing eigenvalues and building
tentative prolongators........................................... 41
3.2.5 Proceeding by recursion.......................................... 43
3.2.6 On the theoretical justification of the spectral AMGe approach 45
3.2.7 The nonlinear nested-CG AMLI-type multigrid cycles............. 47
3.3 Adaptivity in the element agglomeration AMGe framework............ 48
3.4 Numerical results for spectral agglomerate AMGe................... 54
3.5 Tests with tentative prolongators within nested CG cycles 64
3.6 Adaptive AMG tests................................................. 67
3.7 Conclusions........................................................ 77
4. BLOPEX .............................................................. 79
4.1 Introduction....................................................... 79
4.2 LOBPCG and its Implementations..................................... 80
4.2.1 LOBPCG Ideas and Description .................................... 81
4.2.1.1 The problem and the assumptions 81
4.2.1.2 Single-Vector LOBPCG........................................... 82
4.2.1.3 The LOBPCG Basics 84
4.2.2 The Detailed Description of the LOBPCG Algorithm................. 85
4.2.3 LOBPCG deflation by hard and soft locking........................ 91
4.3 Abstract BLOPEX implementation for PETSc and hypre 94
vii


4.4 BLOPEX LOBPCG Numerical Results in PETSc and hypre .... 96
4.4.1 The Basic Accuracy of Algorithm.............................. 96
4.4.2 Performance Versus the Number of Inner Iterations......... 97
4.4.3 LOBPCG Performance vs. Block Size 99
4.4.4 Scalability.................................................. 99
4.5 Conclusions.................................................... 101
Appendix
A. BLOPEX appendix................................................. 103
A.l Description of all LOBPCG-BLOPEX variables................... 103
A.2 Installing and testing BLOPEX with PETSc and hypre........... 104
A.3 Computers used for testing..................................... 106
A.4 Acronyms and other names ................................ 107
A.4.1 Organizations................................................ 107
A.4.2 Software packages 107
A.4.3 hypre interfaces and preconditioners ........................ 108
A.4.4 Methods...................................................... 109
References......................................................... 110
viii


LIST OF FIGURES
Figure
2.1 Algorithm 2.7.1 with (2.13) fails to provide the PSD convergence rate. 30
2.2 The PSD and PCG methods with preconditioning by inner CG with
different stopping criteria i) = 0.2,0.4,0.6 and 0.8 (from the bottom
to the top).......................................................... 31
2.3 Two-grid preconditioning with fixed (left) and random (right) coarse
grids................................................................ 33
3.1 Agglomerated elements................................................. 41
3.2 Minimal intersection sets............................................. 42
3.3 One of coarsest-grid basis functions. Laplacian on square grid, 2048
elements, 1089 dofs. Fully-harmonic prolongator is used........... 58
3.4 Residual decrease history for some runs in tables 3.3, 3.4. 59
3.5 One of coarsest-grid basis functions. Laplacian on unstructured grid,
6400 elements, 3321 dofs. Fully-harmonic prolongator is used.
Coarsest grid contains 8 elements.............................. 59
3.6 One of coarsest-grid basis functions. Anisotropic diffusion on square
grid, 2048 elements, 1089 dofs. Fully-harmonic prolongator is used.
Note that this is one function, not a sum of two functions........ 61
3.7 Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were used (r = 0.15)........ 62
IX


3.8 Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were NOT used (r = 0)........ 63
3.9 One of coarsest-grid basis functions. Anisotropic diffusion on un-
structured grid, 6400 elements, 3321 dofs. Fully-harmonic pro-
longator is used. Coarsest grid contains 2 elements. This function
corresponds to additional spectral coarse dofs........... 63
3.10 Residual decrease history for some runs in tables 3.9, 3.10........ 65
3.11 One of coarsest-grid basis functions. 2d Laplacian on square grid,
2048 elements, 1089 dofs. Tentative prolongator is used. Coarsest
grid contains 4 elements......................... ................. 66
3.12 Residual decrease history for some runs in tables 3.11, 3.12. 67
3.13 One of coarsest-grid basis functions. 2d Laplacian on unstructured
grid, 6400 elements, 3321 dofs. Tentative11 prolongator is used.
Coarsest grid contains 4 elements............................... 68
3.14 Residual decrease history for some runs in tables 3.13, 3.14. 69
3.15 One of coarsest-grid basis functions. 2d anisotropic diffusion on un-
structured grid, 25600 elements, 13041 dofs. Tentative11 prolongator
is used. Coarsest grid contains 4 elements......................... 70
3.16 Residual decrease history for some runs in tables 3.15, 3.16......... 72
3.17 Residual decrease history for some runs in tables 3.17, 3.18......... 73
3.18 Residual decrease history for some runs in tables 3.19, 3.20. 74
3.19 One of coarsest-grid basis functions, ux component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partially-harmonic11 prolongator is used. 75
x


3.20 One of coarsest-grid basis functions, uy component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partially-harmonic prolongator is used. 75
3.21 One of coarsest-grid basis functions, p component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partially-harmonic" prolongator is used.............................. 76
3.22 Adaptivity: FOSLS Helmholtz, k2 = 200, 6400 elements, 9963 dofs.
Partially-harmonic prolongator is used inside nested CG cycle".
Complexity and convergence rate depending on the number of incor-
porated vectors. The convergence rate is for the stationary method. 76
4.1 Residual norms (left) and eigenvalue errors (right) for 10 eigenpairs
in LOBPCG with soft locking. Soft locked (the bottom 6) eigenpairs
continue to improve............................................. 94
4.2 BLOPEX hypre and PETSc software modules................................ 95
4.3 LOBPCG Performance vs. Block Size. Preconditioner: hypre
Boomer AMG........................................................... 100
4.4 LOBPCG scalability, 1M unknowns per node on Beowulf. 101
xi


LIST OF TABLES
Table
3.1 2d Laplacian on square grid. 2048 elements, 1089 dofs. Harmonic
prolongator is used inside V-cycle. No CG acceleration is used. 57
3.2 2d Laplacian on square grid. 32768 elements, 16641 dofs. Har-
monic prolongator is used inside V-cycle. No CG acceleration is
used............................................................. 57
3.3 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Har-
monic" prolongator is used inside V-cycle. Conjugate gradient ac-
celeration is used............................................... 57
3.4 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Harmonic" prolongator is used inside V-cycle. Conjugate gradient
acceleration is used. 58
3.5 2d anisotropic diffusion on square grid. Anisotropy is grid-aligned.
2048 elements, 1089 dofs. Harmonic*1 prolongator is used inside
V-cycle. No CG acceleration is used........................... 60
3.6 2d anisotropic diffusion on square grid. Anisotropy is grid-aligned.
32768 elements, 16641 dofs. Harmonic prolongator is used inside
V-cycle. No CG acceleration is used........................... 60
3.7 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Harmonic" prolongator is used inside V-cycle. Conjugate
gradient acceleration is used. 61
xii


3.8 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Harmonic prolongator is used inside V-cycle. Conjugate
gradient acceleration is used. 62
3.9 2d Laplacian on square grid, 2048 elements, 1089 dofs. Tentative11
prolongator is used inside nested CG cycle. Conjugate gradient
acceleration is used. 64
3.10 2d Laplacian on square grid, 32768 elements, 16641 dofs. Tentative"
prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used............................................. 65
3.11 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Ten-
tative" prolongator is used inside nested CG cycle". Conjugate
gradient acceleration is used. 65
3.12 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Tentative prolongator is used inside nested CG cycle . Conju-
gate gradient acceleration is used............................... 66
3.13 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Tentative" prolongator is used inside nested CG cycle". Con-
jugate gradient acceleration is used............................. 68
3.14 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Tentative" prolongator is used inside nested CG cycle". Con-
jugate gradient acceleration is used............................. 69
3.15 Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 6400 ele-
ments, 9963 dofs. Partially-harmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used....... 71
xiii


3.16 Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 25600 ele-
ments, 39123 dofs. Partially-harmonic prolongator is used inside
nested CG cycle11. Conjugate gradient acceleration is used.... 71
3.17 Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 6400 el-
ements, 9963 dofs. Partially-harmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 71
3.18 Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 25600 el-
ements, 39123 dofs. Partially-harmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 72
3.19 Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 6400 el-
ements, 9963 dofs. Partially-harmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 73
3.20 Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 25600 el-
ements, 39123 dofs. Partially-harmonic prolongator is used inside
nested CG cycle. Conjugate gradient acceleration is used.... 74
4.1 Scalability for 80 x 80 x 80 = 512,000 mesh per CPU, m 1/6. . 100
4.2 Scalability for 50 x 50 x 50 = 125,000 mesh per CPU, m 50. 101
XIV


1. Introduction
The field of numerical linear algebra focuses primarily on two important
problems in applied mathematics: the numerical solution of linear systems of
equations and eigenvalue problems. Progress in the field and increasing capabili-
ties of computer hardware make problems of record sizes tractable. For problems
of modest sizes, robust direct methods are available. Therefore, in many cases
it is possible to obtain an automated black box solution.
Nevertheless, many linear systems and eigenvalue problems of large sizes
remain difficult to solve. One particular example of such a problem arises in the
numerical solution of partial differential equations (PDEs). Matrices emerging
from the discretization of PDEs are typically very large but sparse, i.e., the
number of non-zero elements is proportional to the matrix dimension. When
working with sparse matrices, it is possible to store only a list of non-zero ele-
ments. This makes assembling the system matrix a linear complexity task, i.e.,
requiring a storage and operation count proportional to the problem size. In
such a situation it is tempting to seek solution methods which have (nearly)
linear complexity.
Krylov subspace methods (e.g., Greenbaum [23]) are well-known both for
solution of linear systems and eigenvalue computations. For many such methods,
each iteration has linear complexity. Thus, the overall complexity of the method
is linear, as long as the number of iterations (required to achieve a desired
accuracy of the approximate solution) does not grow with the problem size.
1


However, without use of preconditioning (discussed below), the required number
of iterations typically becomes unbounded as the problem size goes to infinity.
In order to keep the number of iterations bounded, Krylov methods are
enhanced with preconditioning techniques. The preconditioned steepest descent
(PSD) method, described below, is a simple but illustrative example of the
concept of preconditioning. Given a linear system Ax = f with symmetric and
positive definite (SPD) matrix A, the method can be written as
xk+1 = xk + qB~1 (/ Axk).
Here vector xk is the current approximation to the solution x = A~lf. We aim
to compute the vector xk+i, which is the new approximation to the solution. The
scalar a is chosen to minimize a particular norm of the error ek+i = x xk+i.
With an appropriate choice of the norm this scalar is computable from the known
quantities A, f and xk.
Matrix B, called the preconditioner, is in this case assumed to be SPD and
of the same size as A. The case when B is the identity matrix corresponds to the
absense of preconditioning, i.e., to the plain steepest descent (SD) method.
Intuitively, the role of B is to approximate A, for B~1 to approximate A-1.
Indeed, in the case B = A the method will converge in one iteration, since
B~l (/ Axk) will reduce to x xk, and we shall obtain xk+i = x by using
a = 1.
Another intuitive interpretation is that multiplying by B~l rotates the resid-
ual rk = f Axk in such a way that the angle (in a particular inner product)
between the so-called preconditioned residual sk = B~1rk and current error
ek = x xk is reduced. For example, having sk and ek parallel, i.e., sk = 7efc,
2


yields immediate convergence with a = 1/7. Actually, the convergence rate on
a given iteration depends exclusively on the angle between eand s*, see, e.g.,
Lemma 2.9 on page 20 of this dissertation.
It follows from the discussion above that it is desirable for the preconditioner
to satisfy the following requirements. First, its quality, i.e., how well B~l
approximates A-1, must not degrade as the problem size increases. Informally
speaking, this is what keeps the number of iterations bounded. Second, the
action T = B~l of applying the preconditioner must itself have linear complexity.
It turns out that a PSD method can be viewed as an SD method applied
to the transformed system B~lAx = B~lf and using a different inner prod-
uct (with respect to which, in particular, matrix B~lA is symmetric). For the
PSD method itself, this property is of a moderate importance. However a similar
observation is crucial in establishing convergence properties of other Krylov sub-
space methods, such as the preconditioned conjugate gradient (PCG) method.
The interpretation of preconditioning as a transformation of the original
linear system may not be appropriate for all practical situations, e.g., where
the preconditioner is variable, i.e., changes from iteration to iteration. Such a
situation can arise, e.g., if the preconditioner involves inner iterations.
Analysis of the PCG method with variable preconditioning constitutes
Chapter 2 of this thesis. The chapter is based on the paper Knyazev and Lashuk
[40] . An extended version is published as a technical report Knyazev and Lashuk
[41] . We consider the case where the preconditioner is widely variable, that is,
preconditioners used on different iterations are essentially unrelated. We assume

that the preconditioner is SPD on each step, and that the condition number of
3


the preconditioned system matrix is bounded above by a constant independent
of the step number. We do not impose any other restrictions on the sequence of
preconditioners. Our main result in this chapter is of a negative nature: we show
that the PCG method with variable preconditioning under our assumptions may
not yield an improvement, relative to the PSD method.
In particular, in Section 2.2 we prove that a nonzero real vector multiplied by
all real SPD matrices with a condition number bounded by a constant generates
a pointed circular cone. This observation is used later in Section 2.5 to derive
the main result.
In Section 2.3 we discuss some basic properties of the PCG method with
variable preconditioning. These properties are later used in Section 2.4 to derive
the convergence rate bound for the method. Specifically, in Section 2.4 we prove
that the standard PSD convergence rate bound still holds for PCG with variable
preconditioning.
As was mentioned before, in Section 2.5 we present our main result: it turns
out, that the PSD convergence rate bound is sharp for the PCG method with
variable preconditioning (under our assumptions). That is, we can construct a
sequence of preconditioners such that PSD convergence rate bound is attained
on every iteration.
In Sections 2.7 and 2.8 we discuss PCG algorithms used in practice (that
is, their sensitivity to variable preconditioning) and provide some numerical
experiments.
In Chapter 3 we turn our attention to one particular type of preconditioner:
the multilevel methods. Multigrid (MG) (e.g., Trottenberg et al. [55]) is one of
4


the most efficient and natural methods for solving linear systems of equations
coming from partial differential equations discretized on a sequence of grids. In
algebraic multigrid (AMG) Brandt et al. [11, 12], Ruge and Stiiben [48], Stiiben
[53] the necessary MG components (coarse grids, coarse-grid operators, interpo-
lation operators) are built by the solver algorithm opposite to geometric multi-
grid where these components are naturally given by the discretization. An ex-
treme case of an algebraic multigrid approach would lead to a black box solver,
i.e., an algorithm which would only use the linear system matrix as input data.
In practice, all AMG methods utilize (often assumed) some additional informa-
tion about the class of problems they are applied to. In the chapter we deal with
class of problems that come from finite element discretization of elliptic PDEs.
More specifically, we focus on variants of the element-based AMG (or AMGe)
methods developed in Brezina et al. [13], Chartier et al. [15, 16], Henson and
Vassilevski [24], Jones and Vassilevski [28].
We have implemented several algorithms in Matlab in the framework of
the element agglomeration spectral AMGe (Chartier et al. [16]) and illustrate
their performance. Our Matlab implementation allows for further extensions
and offers potential for illustrating other element based AMG algorithms.
The chapter, based on the paper Lashuk and Vassilevski [42], is structured
as follows. Section 3.2 describes the main building tools of the implemented
algorithms. In particular, in Section 3.2.1 we describe the required input, Section
3.2.2 reviews some details on element agglomeration, and in § 3.2.3 the so-
called minimal intersection sets are introduced. These sets are later used, in
Section 3.2.4, to define the coarse degrees of freedom by solving local eigenvalue
5


problems (associated with each minimal intersection set). The recursive nature
of the algorithm is summarized in § 3.2.5. The spectral way of selecting coarse
degrees of freedom naturally leads to the construction of tentative prolongators.
In § 3.2.6 the need to improve on the stability of the tentative prolongator is
outlined.
Another direction of extending the presented method that we pursue utilizes
more sophisticated multigrid cycles that are based on inner (between the levels)
preconditioned conjugate gradient (CG) iterations. Such an idea goes back to
the non-linear AMLI methods in Axelsson and Vassilevski [5]. This respective
method, referred to as nested-CG (AMLI) cycle is described in Section 3.2.7.
Section 3.3 contains a description of a specific implementation of the adap-
tive AMG method (cf., Brezina et al. [14]) in the present AMGe setting. We
consider also an option which does not utilize the actual element matrices (but
does use the element topology relations).
The V-cycle and nested-CG cycle are illustrated in Sections 3.4 and 3.5,
whereas the results utilizing AMG adaptivity are found in the last section 3.6.
In Chapter 4 of this dissertation we describe a new software package
Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) revision
1.0 publicly released recently. The chapter is based on the papers Knyazev
et al. [39], Lashuk et al. [43]. BLOPEX is available as a stand-alone serial li-
brary, which can be downloaded from http: //math. cudenver. edu/~aknyazev/
software/BLOPEX/, as an external package to the Portable, Extensible Toolkit
for Scientific Computation (PETSc) Balay et al. [6], and is built into the High
Performance Preconditioners (hypre) library Falgout et al. [19, 20]. Jose Ro-
6


man has also written a Scalable Library for Eigenvalue Problem Computations
(SLEPc) Hernandez et al. [25]) interface to our hypre BLOPEX.
The native hypre BLOPEX version efficiently and directly takes advantage of
powerful hypre multigrid preconditioners, both algebraic (AMG, called Boomer-
AMG in hypre) and geometric or structured (SMG). BLOPEX in PETSc gives
the PETSc community easy access to the customizable code of a preconditioned
eigensolver.
At present, BLOPEX includes only one solverthe Locally Optimal Block
Preconditioned Conjugate Gradient (LOBPCG) Knyazev [32, 35] method for
eigenproblems Ax = ABx with large, possibly sparse, symmetric matrix A
and symmetric positive definite matrix B. Preconditioned eigenvalue solvers
in general, see Knyazev [34, 35], and in particular the LOBPCG method have
recently attracted attention as potential competitors to (block-) Lanczos meth-
ods. LOBPCG has been implemented using different computer languages in
a number of other software packages: C++ in Anasazi Trilinos Arbenz et al.
[1], Heroux et al. [26] and in NGSolve Zaglmayr [62], Sections 7.2-7.4; C in
PRIMME Stathopoulos [52]; FORTRAN 77 in PLTMG Bank [7]; FORTRAN
90 in ABINIT Bottin et al. [9] and in PESCAN Tomov et al. [54]; Python in
SciPy by Robert Cimrman and in PYFEMax by Peter Arbenz and Roman Geus.
LOBPCG has been independently tested in Yamada et al. [59, 60] for Fermion-
Habbard Model on Earth Simulator CDIR/MPI; in Arbenz et al. [1] and Borzi
and Borzi [8] using AMG preconditioning for vibration problems; and in To-
mov et al. [54], Yang et al. [61] for electronic structure calculations. Our hypre
BLOPEX version has been used in Bramble et al. [10] for the Maxwell problem.
7


Section 4.2 contains a complete and detailed description of the LOBPCG
algorithm as implemented in BLOPEX. We discuss our specific abstract im-
plementation approach in section 4.3. Parallel performance on distributed and
shared memory systems including a Beowulf system, SUN Fire 880, an AMD
dual-core Opteron workstation, and IBM BlueGene/L supercomputer, using do-
main decomposition and multigrid preconditioning is discussed in section 4.4.
My personal contribution to the work described in the thesis is as follows.
Chapter 2: Formulation and the proof of the main result (Theorem 2.11) and
most of other statements. A geometric interpretation (Theorem 2.1) and the idea
of local optimality (Lemma 2.6) are suggested by Andrew Knyazev. Numerical
experiments have been designed and performed jointly. Chapter 3: Matlab
implementation of different AMGe methods, all numerical tests and the idea of
using the so-called fully-harmonic interpolation for spectral AMGe, see Section
3.4. Chapter 4: Implementation in C of the LOBPCG eigensolver in hypre and
PETSc versions of BLOPEX and numerical testing on SUN, Beowulf and IBM
BlueGene/L parallel computers.
8


2. Variable preconditioning
2.1 Introduction
Preconditioning, a transformation, usually implicit, of the original linear
system aiming at accelerating the convergence of the approximations to the so-
lution, is typically a necessary part of an efficient iterative technique. Modern
preconditioning, e.g., based on so-called algebraic multilevel and domain de-
composition methods, attempts to become as close to a black box ideal of
direct solvers as possible. In this attempt, the mathematical structure of the
preconditioner, which in the classical case is regarded as some linear transfor-
mation, may become very complex, in particular, the linearity can be easily
lost, e.g., if the preconditioning itself involves inner iterative solvers. The
fact that the preconditioner may be nonlinear, or variable, i.e., changing from
iteration to iteration, may drastically affect the known theory as well as the
practical behavior of preconditioned iterative methods and therefore needs spe-
cial attention. Our main result is that the conjugate gradient (CG) method
with variable preconditioning in certain situations may not give improvement,
compared to the steepest descent (SD) method for solving a linear system with
a real symmetric positive definite (SPD) matrix of coefficients. We assume that
the preconditioner is SPD on each step, and that the condition number of the
preconditioned system matrix is bounded above by a constant.
This chapter is based on the paper Knyazev and Lashuk [40]. An extended
version is published as a technical report Knyazev and Lashuk [41].
9


Let us now introduce the notation, so that we can formulate the main result
mathematically. Let A be a real SPD matrix, (x, y) be the standard inner
product of real vectors x and y, so that (Ax, y) = (x, Ay), and let ||x|| = \J(x,x)
be the corresponding vector norm. We also use || || to denote the operator
norm. The A-inner product and the A-norm are denoted by (x, y)A = (x, Ay)
and \\x\\A = y/(x,x)A-
We consider a family of iterative methods to obtain a sequence of approxi-
mate solutions Xk of a linear system Ax = b and use the A-norm to measure the
error ek = x xk. The SD and CG methods are well-known iterative procedures
that fit into our framework. To accelerate the convergence of the error ek to zero
we introduce preconditioning, i.e., on every iteration k an operator Bk, called
the preconditioner, possibly different for each iteration k, is applied to the resid-
ual rk = b Axk A general algorithm, which includes the preconditioned SD or
CG (PSD or PCG respectively) methods as particular cases, can be presented
as follows, e.g., Axelsson [4], p. 540 and Axelsson and Vassilevski [2], Algorithm
5.3: given A, b, {Bk}, {mfc}, x0, for k = 0,1,...: rk = b- Axk, sk = B^rk, and
Pk $k
k-1
E
l=kTTlk
(Ask,pi)
(Api,Pi)Ph
Efc+i fc T
(rk,Pk)
(Apk,Pk)Pk
(2.1)
where
0 < mk < k and mk+i < mk + 1.
(2.2)
The latter condition is highlighted in Notay [45], p. 1447, line 1 and en-
sures that the formula for pk in (2.1) performs the standard Gram-Schmidt
A-orthogonalizations to previous search directions, which are already pairwise
A-orthogonal. The full orthogonalization that performs explicit A-orthogo-
10


nalizations to all previous search directions corresponds to mk = k. Choosing
mk = min{fc, 1} gives the PCG method, e.g., described in Golub and Ye [22],
IPCG Algorithm. The connection of this PCG method to the commonly used
PCG algorithm is discussed in section 2.7 following Golub and Ye [22], Remark
2.3. The shortest recursion mk = 0 leads to the standard PSD method.
It is well-known, e.g., D'yakonov [18], p. 34 and Axelsson [4], Sec. 11.1.2,
p. 458 that if the preconditioner is SPD and fixed, Bk = B = B* > 0, a
preconditioned method, such as (2.1), using the preconditioner B can be viewed
as the corresponding unpreconditioned method applied to the preconditioned
system B^Ax = B~lb in the R-based inner product (x,y)B = (x,By). This
implies that the theory obtained for unpreconditioned methods remains valid
for preconditioned methods, in particular, the A-orthogonalization terms with
l < k 1 in the sum in (2.1) vanish in exact arithmetic, e.g., Axelsson [4],
Sec. 11.2.6, Th. 11.5. The situation changes dramatically, however, if different
preconditioners Bk are used in the PCG method.
We study the behavior of method (2.1), where the preconditioner Bk varies
from step to step, but remains SPD on each step and the spectral condition
number k (R^ JA) = Amax [B^1 A) / Xmin (R^ M) is bounded above by some con-
stant nmax independent of the step number k. We note that the matrix B^A is
SPD with respect to, e.g., the Bk inner product, so its eigenvalues are real posi-
tive. Let us highlight that our assumption k, (R^A) < nmax can be equivalently
written as ||7 BklA\\Bk < 7 with nmax = (1 + 7)/(l 7), assuming without
loss of generality that Bk is scaled such that Amax (R^M) + Amin (B^lA) = 2.
Here, we only deal with methods that are invariant with respect to scaling of
11


Bk.
The main result of this chapter is that the preconditioned method (2.1)
with (2.2) turns into the PSD method with the worst possible convergence
rate on every iteration, if the preconditioners Bk satisfying our assumption
k < nmax are chosen in a special way. We explicitly construct a variable
preconditioner that slows down the CG method to the point that the worst linear
convergence rate of the SD method is recovered. Thus one can only guarantee
that the convergence rate for the method (2.1) with (2.2) is just the same as for
the PSD method, rat = 0, obtained in Kantorovic [29] and reproduced, e.g., in
Kantorovich and Akilov [30], Ch. XV:
llefc+l IL <- Kmax ~ 1 ^ 3^
ll^fc||.A ^max t" 1
Our proof is geometric and is based on the simple fact, proved in section 2.2,
that a nonzero vector multiplied by all SPD matrices with a condition number
bounded by a constant generates a pointed circular cone. We apply this fact on
every iteration to the current residual vector, which becomes the center of the
cone, so all points in the cone correspond to all possible preconditioned residuals.
In a somewhat similar way, Golub and Ye [22] use the angle between the exact
and the perturbed preconditioned residuals. In the CG method context, this
cone has a nontrivial intersection with the subspace A-orthogonal to all previous
search directions. So on each iteration we can choose a preconditioner with the a
priori chosen quality, determined by Kmax, that makes enforcing A-orthogonality
with respect to all previous search directions useless.
Basic properties of method (2.1), most importantly the local optimality, are
derived in section 2.3. In section 2.4 we apply our results from section 2.2 about
12


the cone to obtain a new proof of estimate (2.3). In section 2.5 we analyze
the convergence of the PCG method with variable preconditioning and prove
our main result. We assume real arithmetic everywhere in this chapter, except
for section 2.6, where we show that our main results also hold for complex
Hermitian positive definite matrices. In section 2.7 we consider two particular
PCG algorithms that are often used in practice and describe their behavior with
variable preconditioning.
Our numerical experiments in section 2.8 comparing the preconditioned SD
and CG methods support and illustrate our theoretical findings, and also reveal
some potentially practically important effects. In subsection 2.8.1, we test the
widely used modification of the CG method with a simplified formula for the
scalar /?*. from section 2.7 and demonstrate that variable preconditioning can
make this modification much slower than even the SD method. In subsection
2.8.2, we analyze inner-outer iterations as variable preconditioning. Finally, in
subsection 2.8.3, we demonstrate that variable preconditioning may surprisingly
accelerate the SD and the CG compared to the use of fixed preconditioning in
the same methods.
Different aspects of variable preconditioning are considered, e.g., in Axelsson
and Vassilevski [2, 3], Axelsson [4], where rather general nonlinear precondition-
ing is introduced, and in Golub and Ye [22], Notay [45] that mainly deal with
the case when the preconditioner on each iteration approximates a fixed opera-
tor. In Axelsson and Vassilevski [3], Axelsson [4], Golub and Ye [22], Notay [45],
convergence estimates for some iterative methods with variable preconditioning
are proved. For recent results and other aspects of variable preconditioning see
13


Simoncini and Szyld [49, 50, 51] and references there. No attempts are appar-
ently made in the literature to obtain a result similar to ours, even though it
should appear quite natural and somewhat expected to experts in the area.
2.2 Pointed circular cones represent sets of SPD matrices with
varying condition numbers
For a pair of real non-zero vectors x and y we define the angle between x
and y in the usual way as
Z(I!') = arccos(|j^|[)[0,4
The following theorem is inspired by Neymeyr [44], Lemma 2.3.
Theorem 2.1. The set {Cx}, where x is a fixed nonzero real vector and C runs
through all SPD matrices with condition number n(C) bounded above by some
nmax, is a pointed circular cone, specifically,
{Cx : C = C* > 0, k(C) < /tmax} = jy : sin Z(x,y)<
Theorem 2.1 can be proved by constructing our cone as the smallest pointed
cone that includes the ball considered in Neymeyr [44], Lemma 2.3. Preparing
for section 2.6 that deals with the complex case, not covered in Neymeyr [44],
we provide a direct proof here based on the following two lemmas. The first
lemma is simple and states that the set in question cannot be larger than the
cone:
Lemma 2.2. Let x be a non-zero real vector, let C be an SPD matrix with
spectral condition number k (C). Then sin Z (x, Cx) < (k (C) 1)/(k (C) + 1).
Kr)

+ 1
14


Proof. Denote y = Cx. We have (x,Cx) = (y,C~1y) > 0 since C is SPD, so
y 7^ 0 and Z (x,y) < 7r/2. A positive scaling of C and thus of y is obviously
irrelevant, so let us choose y to be the orthogonal projection of x onto the
1-dimensional subspace spanned by the original y. Then from elementary 2D
geometry it follows that \\y x|| = ||x|| sinZ (x, y). The orthogonal projection
of a vector onto a subspace is the best approximation to the vector from the
subspace, thus
H^ll sin Z (x, y) = ||y x\\ < ||sy x|| = \\sCx x\\ < \\sC 11| ||rr||
for any scalar s, where I is the identity. Taking s = 2/ (Amax (C) + Amin (C)),
where Amin (C) and Amax (C) are the minimal and maximal eigenvalues of C,
respectively, we get ||sC ij| = (k (C) l)/( (C) + 1).
The second lemma implies that every point in the cone can be represented
as Cx for some SPD matrix C with k(C) determined by the opening angle of
the cone.
Lemma 2.3. Let x and y be non-zero real vectors, such that Z (x,y) [0, |).
Then there exists an SPD matrix C, such that Cx = y and
=smZ(;r,!,).
Proof. Denote a = Z.(x,y). A positive scaling of vector y is irrelevant, so as
in the previous proof we choose y to be the orthogonal projection of x onto the
1-dimensional subspace spanned by the original y, then ||y x\\ = (sina) ||x||,
so the vectors y x and (sina)x are of the same length. This implies that
there exists a Householder reflection H such that H ((sina)x) = y x, cf.
15


Neymeyr [44], Lemma 2.3, so (I + (sin a) H) x = y. We define C = 1 + (sin a) H
to get Cx = y. Any Householder reflection is symmetric and has only two
distinct eigenvalues 1, so C is also symmetric and has only two distinct positive
eigenvalues 1 sin a, as o [0, 7t/2), and we conclude that C > 0 and k (C) =
(1 + sina)/(l sin a).
2.3 Local optimality of the method with variable preconditioning
Here we discuss some basic properties of method (2.1) with (2.2). We de-
rive a simple, but very useful, error propagation identity in Lemma 2.4. We
prove in Lemma 2.5 that the method is well-defined and has a certain local
A-orthogonality property, formulated without a proof in Notay [45], formulas
(2.1)-(2.2) and in the important particular case mk = min{A;, 1} proved in Golub
and Ye [22], Lemma 2.1. Using the local A-orthogonality property of Lemma
2.5, we prove the local A-optimality property in Lemma 2.6 by generalizing the
result of Golub and Ye [22], Proposition 2.2. Finally, we derive a trivial Corol-
lary 2.7 from Lemma 2.6, which uses the idea from Golub and Ye [22], p. 1309
of comparison with the PSD method, m*, = 0.
The material of this section is inspired by Golub and Ye [22] and may be
known to experts in the field, e.g., some even more general facts can be found
in Axelsson [4], Sec. 12.3.2, Lemma 12.22. We provide straightforward and
complete proofs here suitable for a general audience.
Lemma 2.4. Let A and {Bk} be SPD matrices. Suppose pk in method (2.1) is
well-defined and nonzero. Then
£/c+1
(Aek,pk)
(Apk,pk)Pk
(2.4)
16


Proof. Recall that ek = A 1b xk and thus rk = Aek. Then (2.4) follows
immediately from the last formula in (2.1).
Lemma 2.5. Let A and {Bk} be SPD matrices and {m*,} satisfies (2.2). Then
the error, the preconditioned residual, and the direction vectors generated by
method (2.1) before the exact solution is obtained are well-defined and satisfy
CPuPj)A = > k~mk (ek+i,sk)A = (ek+1,Pi)A = 0, k mk < i < k. (2.6)
Proof. We first notice that (2.4) for any k obviously implies
(ek+1,pk)A = 0. (2.7)
For the rest of the proof we use an induction in k. Let us take k = 0 and suppose
x0 7^ x, then r0 ^ 0 and so 7^ 0 since B0 is SPD. By (2.2), ra0 = 0 and thus
p0 = s0 7^ 0, so in the formula for xk+i we do not divide by zero, i.e., xk+\ is
well defined. There is nothing to prove in (2.5) for k = 0 since mo = 0. Formula
(2.7) implies (e^po)^ = (ei,So),4 = 0> i.e., (2.6) holds for k = 0. This provides
the basis for the induction.
Suppose the statement of the lemma holds for k 1, which is the induction
hypothesis, i.e., up to the index k 1 all quantities are well defined and
(Pi,Pj)A = 0, k- 1 mfc_ 1 (e*, Sfc-i)^ = (ek,Pi)A = 0, k 1 mfc_! < i < k 1. (2.9)
We now show by contradiction that xk ^ x implies pk 0. Indeed, if pk = 0
then Sfc is a linear combination of pk-mk, ,Ph-1- However, since mk < mfc_i+l,
17


it follows from (2.9) that
(ek,Pi)J4 = 0, k-mk Then we have (sk, ek)A = 0. At the same time, since the matrix B^A is A-SPD,
sk = BflAek cannot be A-orthogonal to ek unless sk = ek 0, i.e., xk = x.
Next, we prove (2.5) by showing that the formula for pk in (2.1) is a valid
step of the Gram-Schmidt orthogonalization process with respect to the A-based
inner product. If mk = 0, there is nothing to prove. If mk = 1 then (2.5) gets
reduced to (pk,pk-i)A = 0, which follows from the formula for pk in (2.1). If
mk > 2 then condition (2.2) implies that vectors pk-mk,... ,pk-i are among
the vectors pfc_i_mfc_j,... ,pk-i and therefore are already A-orthogonal by the
induction assumption (2.8). Then the formula for pk in (2.1) is indeed a valid
step of the Gram-Schmidt orthogonalization process with respect to the A-based
inner product, so (2.5) holds.
It remains to prove (2.6). We have already established (2.5), and (2.7)-
(2.10). Equalities (2.5) and (2.10) imply that pk and ek are A-orthogonal to
Pk-mk, ,Pk-1- Equality (2.4) implies that ek+\ is a linear combination of
ek and pk. Thus, we have (ek+i,Pi)A =0, k mk < i < k 1. Finally, it
is enough to notice that sk is a linear combination of pk,pk-i,... ,pk-mk, so
{ek+i,sk)A = 0.
We now use Lemma 2.5 to prove the local optimality of method (2.1) with
(2.2), which generalizes the statement of Golub and Ye [22], Proposition 2.2.
Lemma 2.6. Under the assumptions of Lemma 2.5,
l|efc+i||i4= min \\ek-p\\A.
p^span^sk,pk^mk ,...,pk-1}
18


Proof. We get ek+i 6 ek + span {sk,pk-mk, ,Pk-1} from the formula for pk in
(2.1) and (2.4). Putting this together with A-orthogonality relations (2.6) of the
vector efc+i with all vectors that span the subspace finishes the proof.
Two important corollaries follow immediately from Lemma 2.6 by analogy
with Golub and Ye [22], Proposition 2.2.
Corollary 2.7. The A-norm of the error ||e>t+i ||^4 in method (2.1) with (2.2)
is bounded above by the A-norm of the error of one step of the PSD method,
mk = 0, using the same xk as the initial guess and Bk as the preconditioner,
i.e., specifically, ||efc+i< minQ ||efc ask\\A.
Corollary 2.8. Let mk > 0, then the A-norm of the error ||efc+i||A in method
(2.1) with (2.2) fork > 0 satisfies Hefc+ilh < minQi/3 ||efc ask 0(ek efc_i)||^.
Proof. Under the lemma assumptions, the formula for pk in (2.1) and (2.4)
imply that ek+i ek + span {sk,Pk-i} = ek + span {sk,ek efc_i} and the
^-orthogonality relations (2.6) turn into (ek+i,sk)A = 0 and (ek+i,pk-i)A
(ek+-[,ek ek-f)A = 0, so the vector efc+1 is ^-orthogonal to both vectors that
span the subspace. As in the proof of Lemma 2.6, the local A-orthogonality
implies the local A-optimality.
Corollary 2.7 allows us in section 2.4 to estimate the convergence rate of
method (2.1) with (2.2) by comparison with the PSD method, mk = 0,this
idea is borrowed from Golub and Ye [22], p. 1309. The results of Lemma 2.6
and Corollary 2.8 seem to indicate that an improved convergence rate bound
of method (2.1) with (2.2) can be obtained, compared to the PSD method
19


convergence rate bound that follows from Corollary 2.7. Our original intent
has been to combine Corollary 2.8 with convergence rate bounds of the heavy
ball method, in order to attempt to prove such an improved convergence rate
bound. However, our results of section 2.5 demonstrate that this improvement is
impossible under our only assumption k. (b;'a) < ftmax, since one can construct
such preconditioners Bk that, e.g., the minimizing value of /? in Corollary 2.8 is
zero, so Corollary 2.8 gives no improvement compared to Corollary 2.7.
2.4 Convergence rate bounds for variable preconditioning
The classical Kantorovich and Akilov [30], Ch. XV convergence rate bound
(2.3) for the PSD method is local in the sense that it relates the A-norm
of the error on two subsequent iterations and does not depend on previous
iterations. Thus, it remains valid when the preconditioner Bk changes from
iteration to iteration, while the condition number k (B^lA) is bounded above
by some constant Kmax independent of k. The goal of this section is to give an
apparently new simple proof of the estimate (2.3) for the PSD method, based
on our cone Theorem 2.1, and to extend this statement to cover the general
method (2.1) with (2.2), using Corollary 2.7.
We denote the angle between two real nonzero vectors with respect to the
A-based inner product by
and express the error reduction ratio for the PSD method in terms of the angle
with respect to the A-based inner product:
20


Lemma 2.9. On every step of the PSD algorithm, (2.1) with mk = 0, the error
reduction factor takes the form Hefe+ill^/llefcH^ = si^Z^e*, BflAek)).
Proof. By (2.6), we have (ek+\,Pk)A = 0 Now, for mk = 0, in addition,
pk = sk, so 0 = (ek+1,pk)A = (ek+i,sk)A = (efc+1, xk+i xk)A, i.e., the triangle
with vertices x, xk, xk+\ is right-angled in the yt-inner product, where the hy-
potenuse is ek = x xk. Therefore, \\ek+i\\A / \\ek\\A = sin(Zj4(efc, xk+\ xk)) =
sin (ZA(ek, sk)), where sk = £*T1 (b Axk) = BflAek by (2.1).
Let us highlight that Lemma 2.9 provides an exact expression for the error
reduction factor, not just a boundwe need this in the proof of Theorem 2.11
in the next section. Combining the results of Lemmas 2.2 and 2.9 together
immediately leads to (2.3) for the PSD method, where mk = 0. Finally, taking
into account Corollary 2.7, by analogy with the arguments of Golub and Ye [22],
p. 1309 and decrypting a hidden statement in Golub and Ye [22], Lemma 3.5,
we get
Theorem 2.10. Convergence rate bound (2.3) holds for method (2.1) with (2.2).
2.5 The convergence rate bound is sharp
Here we formulate and prove the main result of this chapter: if one assumes
k(B\A) < Kmax> then the worst convergence rate for method (2.1) with (2.2) is
given by (2.3).
Let us remind the reader that (2.3) also describes the convergence rate for
the PSD method, (2.1) with mk = 0. We now show that adding more vectors
to the PSD iterative recurrence results in no improvement in convergence, if a
specially constructed set of variable preconditioners is used.
21


Theorem 2.11. Let an SPD matrix A, vectors b andx0, and Atmax > 1 be given.
Assuming that the matrix size is larger than the number of iterations, one can
choose a sequence of SPD preconditioners Bk, satisfying k(B^1 A) < Kmax, such
that method (2.1) with (2.2) turns into the PSD method, (2.1) with mk = 0, and
on every iteration
\ek+\\
Kri
1
(2.11)
II^A: 11^4 ^max T 1
Proof. We construct the sequence Bk by induction. First, we choose any vector
g0, such that sin e0) = (/tmax l)/(nmax + 1). According to Lemma 2.3
applied in the A-inner product, there exists an A-SPD matrix Co with condition
number /c(Co) = Kmax, such that Coeo = go- We define the SPD B0 = AC^1,
then k(B^1A) = /t(C0) = Kmax- We have sk = B^Aek, so such a choice of B0
implies s0 = go- Also, we have p0 = s0, i.e., the first step is always a PSD step,
thus, by Lemma 2.9 we have proved (2.11) for k = 0. Note that (ei,p0)A = 0
by (2.6).
Second, we make the induction assumption: let preconditioners Bi for / <
k 1 be constructed, such that ||ef+iH^/He/= (nmax l)/(max + 1) and
(c-kiPi) a 0 hold for alii < k 1. The dimension of the space is greater than
the total number of iterations by our assumption, so there exists a vector uk,
such that (uk,pi)A = 0 for / < k 1 and uk and ek are linearly independent.
Then the 2D subspace spanned by uk and ek is A-orthogonal to pi for l < k 1.
Let us consider the boundary of the pointed circular cone made of vectors
qk satisfying the condition sinZ^g^efc) = (/tmax l)/(max + 1)- This conical
surface has a nontrivial intersection with the 2D subspace spanned by uk and
ek, since ek is the cone axis. Let us choose vector qk in the intersection. This
22


vector will be obviously A-orthogonal to pi, l < k 1.
Applying the same reasoning as for constructing B0, we deduce that there
exists an SPD Bk such that k(B^A) < max and B^Aek = qk With such a
choice of Bk we have Sk qk Since qk = Sk is A-orthogonal to pi for all / < k 1,
it turns out that pk = Sk, no matter how {m*,} are chosen. This means that
Xk+1 is obtained from Xk by a steepest descent step. Then we apply Lemma
2.9 and conclude that (2.11) holds. We note, that (ek+i,Pi)A = 0 for all l < k.
Indeed, (ek+i,pi)A = 0 for all l < k 1 since \ is a linear combination of e*
and pk = Sk = qk, both A-orthogonal to pi for l < k 1. Finally, (ek+i,Pk)A = 0
by (2.6). This completes the construction of {Bk} by induction and thus the
proof.
Let us highlight that the statement of Theorem 2.11 consists of two parts:
first, it is possible to have the PCG method with variable preconditioning that
converges not any faster than the PSD method with the same preconditioning;
and second, moreover, it is possible that the PCG method with variable precon-
ditioning converges not any faster than the worst possible theoretical conver-
gence rate for the PSD method described by (2.3). Numerical tests in section
2.8 show that the former possibility is more likely than the latter. Specifically,
we demonstrate numerically in subsection 2.8.3 that the PCG and PSD methods
with random preconditioning converge with the same speed, but both are much
faster than what bound (2.3) predicts.
2.6 Complex Hermitian case
In all other sections of this chapter we assume for simplicity that matrices
and vectors are real. However, our main results also hold when matrices A
23


and {Bk} are complex Hermitian positive definite. In this section we discuss
necessary modifications to statements and proofs in sections 2.2, 2.4 and 2.5 in
order to cover the complex Hermitian case.
In section 2.2, the first thing to be changed is the definition of the angle
between two non-zero vectors x, y G Cn, where an absolute value is now taken,
Z(x, y) = arccos
(x,y)
*11 llyl
6
that makes the angle acute and invariant with respect to complex nonzero scaling
of the vectors. Lemma 2.2 remains valid in the complex case:
Lemma 2.12. Let x be a non-zero complex vector, and C be a complex Her-
mitian positive definite matrix with the spectral condition number k(C), then
sin Z (x, Cx) < (k (C) l)/( (C) + 1).
Proof. Denote y = Cx and let 7 = (y, x) / \\y\\2 then 7y is the projection
of x onto span{y}, and Z(x,y) = Z(z, 7y). Moreover, (x,7y) = {x,y)7 =
(x,y)(y,x)/ \\y\\2 is realwe need this fact later in the proof of Lemma 2.13.
We redefine y to 7y. The rest of proof is exactly the same as that of Lemma
2.2, since the identity ||y x|| = ||x|| sin Z (x, y), where y is scaled by a complex
scalar to be the orthogonal projection of x onto span{j/}, holds in the complex
case with the new definition of the angle.
Lemma 2.3 and, thus, Theorem 2.1 do not hold in the complex case after
the straightforward reformulation. A trivial counterexample is a pair of vectors
x 7^ 0 and y = ixthe angle between x and y is obviously zero, yet it is
impossible that y = Cx for any complex Hermitian matrix C, since the inner
24


product (x,y) = i\\x\\2 is not a real number. This counterexample also gives
an idea for a simple fix:
Lemma 2.13. Let x and y be non-zero complex vectors, such that Z(x,y) ^
7t/2. Then there exists a complex Hermitian positive definite matrix C and a
complex scalar 7, such that Cx = 7y and (n (C) l)/( (C) + 1) = sin Z (x,y).
Proof. We first scale the complex vector y as in the proof of Lemma 2.12 to
make y to be the projection of x onto span{y}. The rest of the proof is similar
to that of Lemma 2.3, but we have to be careful working with the Householder
reflection in the complex case, so we provide the complete proof.
The redefined y is the projection of x onto span{y}, thus, ||y rr|| =
(sin a) ||x||, so the vectors u = y x and v = (sina)a; are of the same length.
Moreover, their inner product (u, v) is real, since (x, y) is real, see the proof of
Lemma 2.12. This implies that the Householder reflection Hz = z 2(w,z)w,
where vj = (u v)/\\u t>||, acts on z = u such that Hu = v, i.e.,
H ((sin a) x) = y x, so (/ + (sin a) H)x y. We define C = I + (sin a) H to
get Cx = y.
The Householder reflection H is Hermitian and has only two distinct eigen-
values 1, so C is also Hermitian and has only two distinct positive eigen-
values 1 sin a, as a [0,7r/2), and we conclude that C > 0 and k (C) =
(1 + sino:)/(l sin a).
The same change then makes Theorem 2.1 work in the complex case:
Theorem 2.14. The set {7Cx}, where x is a fixed nonzero complex vector,
7 runs through all nonzero complex scalars, and C runs through all complex
25


Hermitian positive definite matrices with condition number k(C) bounded above
by some nmax, is a pointed circular cone, specifically,
{7Cx : 7 ^ 0, C = C* > 0, k
Section 2.3 requires no changes other then replacing SPD with Hermitian
positive definite." In section 2.4 we just change the definition of the ^4-angle to
and then Lemma 2.9 holds without any further changes.
Finally, the statement of Theorem 2.11 from section 2.5 allows for a straight-
forward generalization:
Theorem 2.15. Let a Hermitian positive definite matrix A, complex vectors b
and xo, and /tmax > 1 be given. Assuming that the matrix size is larger than the
number of iterations, one can choose a sequence of Hermitian positive definite
preconditioners Bk, satisfying k(B^A) < Kmax, such that method (2.1) with
(2.2) turns into the PSD method, (2.1) with mk = 0, and on every iteration
llefc+l \\A Kmax ~ 1 .O')
INL nx + l' ^ j
Proof. Only a small change in the proof of Theorem 2.11 is needed. We first
choose any vector q'0, satisfying sinZ^g, eo) = l)/(max + !) Then by
Lemma 2.13 we obtain the complex Hermitian positive definite matrix Cq and
the complex scalar 7 such that Coeo = 7 continue as in the proof of Theorem 2.11. The same modification is made in the
choice of the vectors qk for k > 1 later in the proof.
/-a (x, y) = arccos
26


2.7 Practical PCG algorithms
In this section we briefly discuss two particular well-known PCG algorithms
that are often used in practice. Our discussion here is motivated by and follows
Golub and Ye [22], Remark 2.3. Suppose A, b, x0, r0 = b Ar0, {Bk} for
k = 0,1,... are given and consider Algorithm 2.7.1 where (3k on line 6 is defined
either by expression
(3k =
jsk,rk)
(sk-i,rk-i)'
(2.13)
or by expression
n (^fc) Tk ^k l)
Pk~ {Sk-lSk-x)
(2.14)
Formula (2.13) is more often used in practice compared to (2.14), since it can be
implemented in such a way that does not require storing the extra vector rk_i.
If the preconditioner is SPD and fixed, it is well-known, e.g., Golub and Ye
[22], Remark 2.3, that (sk,rk-i) = 0, so formula (2.14) coincides with (2.13)
and Algorithm 2.7.1 is described by (2.1) with mk = min (fc, 1). Of course,
in this case the choice mk = min (k, 1) is enough to keep all search directions
A-orthogonal in exact arithmetic.
Things become different when variable preconditioning is used. While we
can derive formula (2.14) from the framework (2.1)-(2.2), there is no reason for sk
to be orthogonal to rk-1 and so there is no justification for using (2.13). Indeed,
it is well-known, e.g., Golub and Ye [22], Remark 2.3 and Notay [45], Table 2,
that using formula (2.13) for (3k can significantly slow down the convergence,
and we provide our own numerical evidence of that in Section 2.8.
27


Algorithm 2.7.1
l: for k = 0,1,... do
2: sk = B^rk
3: if k = 0 then
4: Pq So
5: else
6: pk = sk + Pkpk-1 (where 0k is defined by either (2.14) or (2.13) for all
iterations)
7: end if
c (sk,rk)
k CPk,Apk)
9: 2-fc+i "I" Qkpk
10: rk+1 = rk- ockApk
11: end for
28


2.8 Numerical experiments
We first illustrate the main theoretical results of this chapter numerically
for a model problem. We numerically investigate the influence of the choice
for (3k between formulas (2.13) and (2.14) in Algorithm 2.7.1 and observe that
(2.14) leads to the theoretically predicted convergence rate, while (2.13) may
significantly slow down the convergence. Second, we test the convergence of
inner-outer iteration schemes, where the inner iterations play the role of the
variable preconditioning in the outer PCG iteration, and we illustrate our main
conclusion that variable preconditioning may effectively reduce the convergence
speed of the PCG method to the speed of the PSD method. Third, and last, we
test the PSD and PCG methods with preconditioners of the same quality chosen
randomly. We observe a surprising acceleration of the PCG method compared
to the use of only one fixed preconditioner; at the same time, we show that the
PSD method with random preconditioners works as well as the PCG method,
which explains the PCG acceleration and again supports our main conclusion.
2.8.1 Numerical illustration of the main results
Here, we use the standard 3-point approximation of the 1-D Laplacian of
the size 200 as the matrix A of the system. To simulate the application of
the variable preconditioner, we essentially repeat the steps described in the
proof of Theorem 2.11, i.e., we fix the condition number k (B^M) = 2 and on
each iteration we generate a pseudo-random vector sk, which is A-orthogonal to
29


Iterative errors for PCG methods
Figure 2.1: Algorithm 2.7.1 with (2.13) fails to provide the PSD convergence
rate.
previous search directions and such that the A-angle between sk and ek satisfies
sin (Za (sk, ek)) = (k 1)/(k + 1).
We summarize the numerical results of this subsection on Figure 2.1, where
the horizontal axis represents the number of iterations and the vertical axis rep-
resents the A-norm of the error. The iteration count actually starts from 1, so
the A-norm of the error on the 0-th iteration ||eo||^ is just the A-norm of the
initial error. The straight dotted (red in the colored print) line marked with
squares on Figure 2.1 represents the PSD theoretical bound (2.3) and at the
same time it perfectly coincides, which illustrates the statements of Theorem
2.11, with the change of the A-norm of the error in the case where the complete
A-orthogonalization is performed, i.e., mk = k in method (2.1), as well as in the
case where Algorithm 2.7.1 with (3k defined by (2.14) is used. The curved solid
(blue) line marked with diamonds represents the convergence of Algorithm 2.7.1
with (5k defined by (2.13), which visibly performs much worse in this test com-
pared to Algorithm 2.7.1 with (2.14). The paper Notay [45], Sec. 5.2 contains
30


Iteration number
Figure 2.2: The PSD and PCG methods with preconditioning by inner CG
with different stopping criteria rj = 0.2,0.4,0.6 and 0.8 (from the bottom to the
top).
analogous results comparing the change in the convergence rate using formulas
(2.13) and (2.14), but it misses a comparison with the PSD method. To check
our results of section 2.6, we repeat the tests in the complex arithmetic. The
figure generated is similar to Figure 2.1, so we do not reproduce it here.
2.8.2 Inner-outer iterations as variable preconditioning
Inner-outer iterative schemes, where the inner iterations play the role of the
variable preconditioner in the outer PCG iteration is a traditional example of
variable preconditioning; see, e.g., Golub and Ye [22], Notay [45]. Previously
published tests analyze an approximation of some fixed preconditioner, Bk ~ B,
different from A, by inner iterations, typically using the PCG method. The
quality of the approximation is determined by the stopping criteria of the inner
PCG method. A typical conclusion is that the performance of the outer PCG
method improves and it starts behaving like the PCG method with the fixed
preconditioner B when Bk approximates B more accurately by performing more
inner iterations.
31


The idea of our tests in this subsection is different: we approximate Bk ~
B = A. The specific setup is the following. We take a diagonal matrix A with
all integer entries from 1 to 2000, with the right-hand side zero and a random
normally distributed zero mean initial guess, and we do the same for the PSD
and PCG methods. For preconditioning on the fc-th step, applied to the residual
r/c, we run the standard CG method without preconditioning as inner iterations,
using the zero initial approximation, and for the stopping criteria we compute
the norm of the true residual at every inner iteration and iterate until it gets
smaller than 7711r*11 for a given constant 77. On Figure 2.2, we demonstrate the
performance of the PSD and PCG methods for four values of 77 = 0.2,0.4,0.6
and 0.8 (from the bottom to the top). We observe that the PSD, displayed using
dashed (red in the colored print) lines marked with circles and PCG shown as
dash-dot (blue) lines with x-marks methods both converge with a similar rate,
for each tested value of 77. We notice here that the PSD method is even a bit
faster than the PCG method. This does not contradict our Corollary 2.7, since
the preconditioners Bk here are evidently different in the PSD and PCG methods
even though they are constructed using the same principle.
2.8.3 Random vs. fixed preconditioning
In this subsection, we numerically investigate a situation where random
preconditioners of a similar quality are used in the course of iterations. The
system matrix is the standard 3-point finite-difference approximation of the
one-dimensional Laplacian using 3000 uniform mesh points and the Dirichlet
boundary conditions. We test the simplest multigrid preconditioning using two
grids, where the number of coarse grid points is 600. The interpolation is linear,
32


Figure 2.3: Two-grid preconditioning with fixed (left) and random (right)
coarse grids.
the restriction is the transpose of the interpolation, and the coarse-grid operator
is defined by the Galerkin condition. The smoother is the Richardson iteration.
On Figure 2.3 left, we once choose (pseudo-)randomly 600 coarse mesh points
and build the fixed two-grid preconditioner, based on this choice. On Figure 2.3
right, we choose 600 new random coarse mesh points and rebuild the two-grid
preconditioner on each iteration.
position of the coarse grid points is not available, so the random choice of
the coarse grids may be an interesting alternative to traditional approaches.
Figure 2.3 displays the convergence history for the PSD (top), PCG (mid-
dle), and PCG with the full orthogonalization (bottom) with the same random
initial guess using the fixed (left) and variable (right) two-grid preconditioners.
On Figure 2.3 left, for a fixed preconditioner, we observe the expected conver-
gence behavior, with the PSD being noticeably the slowest, and the PCG with
the full orthogonalization being slightly faster than the standard PCG. Figure
2.3 right demonstrates that all three methods with the variable random pre-
conditioner converge with essentially the same rate, which again illustrates the
main result of this chapter that the PCG method with variable preconditioning
33


may just converge with the same speed as the PSD method.
Figure 2.3 reveals a surprising fact that the methods with random precon-
ditioning converge twice as fast as the methods with fixed preconditioning! We
highlight that Figure 2.3 shows a typical case, not a random outlier, as we
confirm by repeating the fixed preconditioner test in the left panel for every
random preconditioner used in the right panel of Figure 2.3 and by running
the tests multiple times with different seeds. Our informal explanation for the
fast convergence of the PSD method with random preconditioning is based on
Lemma 2.9 that provides the exact expression for the error reduction factor as
sin(Z,i(efc, B^lAe*,)). It takes its largest value only if ek is one of specific linear
combination of the eigenvectors of B^lAe corresponding to the two extreme
eigenvalues. If Bk is fixed, the error ek in the PSD method after several first
iterations approaches these magic linear combinations, e.g., Forsythe [21], and
the convergence rate reaches its upper bound. If Bk changes randomly, as in our
test, the average effective angle is smaller, i.e., the convergence is faster.
We finally note, that this experiment should not be viewed as a practical
multilevel approach. Standard uniform coarsening would of course have given
better convergence rates than the ones we have observed. Still, we believe this
experiment suggests that introducing randomization might be practical in some
numerical algorithms.
2.9 Conclusions
We use geometric arguments to investigate the behavior of the PCG methods
with variable preconditioning under a rather weak assumption that the quality
of the preconditioner is fixed. Our main result is negative in its nature: we show
34


that under this assumption the PCG method with variable preconditioning may
converge as slow as the PSD method, moreover, as the PSD method with the
slowest rate guaranteed by the classical convergence rate bound. In particular,
that gives the negative answer, under our assumptions, to the question asked
in Golub and Ye [22], Sec. 6, whether better bounds for the steepest descent
reduction factor may exists for Algorithm 2.7.1 with (2.14).
Stronger assumptions on variable preconditioning, e.g., such as made in
Golub and Ye [22], Notay [45] that the variable preconditioners are all small
perturbations of some fixed preconditioner, are necessary in order to hope to
prove a convergence rate bound of the PCG method with variable precondi-
tioning resembling the standard convergence rate bound of the PCG method
with fixed preconditioning. Such stronger assumptions hold in many presently
known real life applications of the PCG methods with variable preconditioning,
but often require extra computational work.
35


3. Algebraic multigrid via element agglomeration
3.1 Introduction
Multigrid (MG) (e.g., Trottenberg et al. [55]) is one of the most efficient
and natural methods for solving linear systems of equations coming from partial
differential equations (PDEs) discretized on a sequence of grids. In algebraic
multigrid (AMG) Brandt et al. [11, 12], Ruge and Stiiben [48], Stiiben [53] the
necessary MG components (coarse grids, coarse-grid operators, interpolation
operators) are built by the solver algorithm opposite to geometric multigrid
where these components are naturally given by the discretization. An extreme
case of an algebraic multigrid approach would lead to a black box solver, i.e.,
an algorithm which would only use the linear system matrix as input data. In
practice, all AMG methods utilize (often assumed) some additional information
about the class of problems they are applied to. In this chapter, based on
the paper Lashuk and Vassilevski [42], we deal with the class of problems that
come from finite element discretization of elliptic PDEs. More specifically, we
focus on variants of the element-based AMG (or AMGe) methods developed in
Brezina et al. [13], Chartier et al. [15, 16], Henson and Vassilevski [24], Jones
and Vassilevski [28].
We have implemented several algorithms in matlab in the framework of
the element agglomeration spectral AMGe (Chartier et al. [16]) and illustrate
their performance. Our matlab implementation allows for further extensions
and offers potential for illustrating other element based AMG algorithms. More
36


specifically, the chapter is structured as follows. Section 3.2 describes the main
building tools of the implemented algorithms. In particular, in Section 3.2.1
we describe the required input, Section 3.2.2 reviews some details on element
agglomeration, and in § 3.2.3 the so-called minimal intersection sets are in-
troduced. These sets are later used, in Section 3.2.4, to define the coarse degrees
of freedom by solving local eigenvalue problems (associated with each minimal
intersection set). The recursive nature of the algorithm is summarized in § 3.2.5.
The spectral way of selecting coarse degrees of freedom naturally leads to the
construction of tentative prolongators. In § 3.2.6 the need to improve on the
stability of the tentative prolongator is outlined.
Another direction of extending the presented method that we pursue utilizes
more sophisticated multigrid cycles that are based on inner (between the levels)
preconditioned conjugate gradient (CG) iterations. Such an idea goes back to
the non-linear AMLI methods in Axelsson and Vassilevski [5]. This respective
method, referred to as nested-CG (AMLI) cycle is described in Section 3.2.7.
Section 3.3 contains a description of a specific implementation of the adap-
tive AMG method (cf., Brezina et al. [14]) in the present AMGe setting. We
consider also an option which does not utilize the actual element matrices (but
does use the element topology relations). This results in a new AMGe algorithm
which we view as a main algorithmic contribution of this chapter.
The V-cycle and nested-CG cycle are illustrated in Sections 3.4 and 3.5,
whereas the results utilizing AMG adaptivity are found in the last section 3.6.
The main result of the present work is that it describes some performance
results of a number of AMGe algorithms; some are simple variations of previ-
37


ously implemented ones, in addition to some newly developed ones, such as the
nested-CG AMLI cycles, as well as the adaptive element agglomerate AMGe.
3.2 Basic building tools
3.2.1 Input data
We assume finite element setting of the problem in the framework of relation
tables as described in detail in Vassilevski [57].
More specifically, our matlab software requires the following input data:
The element_dof relation. It can be defined by treating each element
as list of degrees of freedom (dofs). If the dofs and elements are respec-
tively numbered as 1,..., No and 1, ..., Ne, then the incidence relation
element _dof can be represented as the boolean sparse matrix M of size
Ne x No with entries
Mij
if element i contains dof j,
otherwise.
The element-face relation. This relation describes the incidence ele-
ment z has a face j. For triangular elements, element faces are the
triangles sides. This relation can be used to define neighboring elements;
namely, we say that two elements are neighbors if they share a com-
mon face. This defines the relation table element .element. Again, the
relations element-face and element-element can be implemented as
boolean sparse matrices.
List of boundary dofs. This is a list of dofs which are on the boundary
where essential boundary conditions are to be imposed..
38


We also need access to the individual element matrices. We assume that
all these matrices are symmetric positive semi-definite (or SPSD for short).
The actual matrix of the linear system of our main interest is built in 2
steps:
- First, we assemble a global (in general, singular, SPSD) matrix from
the individual element matrices in the usual way, i.e., according to
the formula
where AT denotes the local matrix corresponding to element r. For
a given v and a set of dofs r, vT stands for the restriction of v to the
subset of indices (dofs) corresponding to r.
- Second, we impose the essential boundary conditions on the global
assembled matrix A. That is, for any dof d on essential boundary
we set to zero all off-diagonal entries in the d-th row and the d-
th column of A. For the class of problems we consider, imposing
essential boundary conditions in this way produces SPD (symmetric
positive definite), hence non-singular matrices.
We refer to the initial set of elements and respective dofs fine-grid elements
(or fine elements) and fine-grid dofs (or fine dofs).
We note that the right-hand side vector (required input in the solution
phase) is not needed as input in the construction of the actual AMGe solver.
T
39


3.2.2 Agglomerating elements
Agglomeration refers to a partition of the set of all (fine) elements into
non-intersecting subsets, called agglomerates or agglomerated elements (AEs).
More specifically, we treat the set of elements as vertices of the undirected
graph where two vertices (elements) are linked with an edge if and only if they
are neighbors, i.e., share a face. This graph defines (as described before) the
relation element.element. If the relations element-face and face.element
are implemented as (boolean) sparse matrices, then element-element equals
the product element Jace x face.element.
We use the graph partitioner METIS (Karypis and Kumar [31]) to partition
the element-element relation into a desired number of components (a user
specified parameter referred to as coarsening factor or Crs. f.). We make sure
that the produced components are connected. Each (connected) component
defines an agglomerate. Figure 3.1 gives an illustration of this process. We also
view each agglomerated element as a list of fine degrees of freedom (the union
of the degrees of freedom that come with the fine-grid elements forming the
agglomerate).
3.2.3 Building minimal intersection sets
Having created agglomerated elements, we can split the set of all (fine)
degrees of freedom into non-overlapping partitions, referred to as minimal in-
tersection sets (as in Vassilevski and Zikatanov [58]). These partitions represent
equivalence classes with respect to the equivalence relation that two dofs belong
40


0.0
08
07
08
08
04
03
0.7
0 1
0.1 0.2 0.3 0 4 0.S 0 6 0.7 0.6 0.0
0.1
0.2 0.3 0 4 0 5 0 6 0.7
06 00
Figure 3.1: Agglomerated elements
to a same minimal intersection set if and only if they belong to exactly the same
set of agglomerated elements. For example, all interior dofs of an AE constitute
a minimal intersection set. A boundary (its interior only) between two AEs
constitutes a minimal intersection set, etc. Figure 3.2 illustrates this process.
3.2.4 Forming Schur complements, computing eigenvalues and
building tentative prolongators
For each minimal intersection set /, we define its neighborhood N(I) to
be all (fine) elements that intersect / (i.e., having a common dof with I). We
assemble the local matrix AN^-) from the element matrices corresponding to the
elements in N(I) and then compute its Schur complement Sj by eliminating all
dofs outside I. Note that both AN^) and Sj are SPSD.
Next, we calculate all the eigenpairs (AJtk, qj>fc), k = 1, ..., |/|, of Sj. Here,
\I\ stands for the cardinality of the set I. Let Qj be the matrix whose columns
are the (orthogonal and normalized) eigenvectors qj^ of Si.


4 aofllomerates, 100 elements
Figure 3.2: Minimal intersection sets
Due to our assumptions, we have that all eigenvalues Aof 5/ are non-
negative.
Based on a user specified tolerance t 6 [0, 1] (referred to as the spectral tol-
erance), we partition Qj into two submatrices. The first one, Q/iC, has columns
that are the eigenvectors qj^ corresponding to the lower part of the spectrum of
Sj, i.e., all <7/*. for k such that Afc < r |[ylyv(/)||, where ||.|| stands for the ^-norm
of the neighborhood matrix ^4v(/) (or the diagonal of AN(i)). The second block
Qij contains all remaining columns of Qi (i.e., eigenvectors corresponding to
the upper part of the spectrum, A/, k > T ||AlN(/)||).
The tentative prolongator Qc is then formed from the blocks QitC (extended
by zero outside the set I), i.e., Qc = [ ..., Qi,c, ] Since all matrices Q/jC are
orthogonal and the sets I do not overlap, Qc is also an orthogonal matrix.
42


3.2.5 Proceeding by recursion
So far, we have constructed the components necessary for a 2-level method.
More specifically, we assume a standard smoother M such as (block) Gauss-
Seidel. The coarse-grid matrix Ac is obtained by the standard (Galerkin)
procedure; namely, with P = Qc (or some improved version of Qc), we set
Ac PTAP. A symmetric two-grid iteration for solving Ax = b, for a given
current iterate xQ, takes the form:
Algorithm 3.1 (Symmetric two-grid algorithm).
pre-smooth, i.e., compute
y = x0 + M-1(6 Axq).
compute coarse-grid correction
xc = (Acy1PT(b-Ay).
interpolate and update the result
z = y + Pxc.
post-smooth, i.e., compute
x = z + M~T(b Az).
The mapping b i> x = B^b resulting from Algorithm 3.1 (with x0 0)
defines a two-grid preconditioner Btg It is well-known (and readily checked)
that its inverse admits the following explicit form
= i1 ~ M'tA)P(AcY1 PT(I AM~X) +I'1, (3.1)
43


where M = M(M + MT A)~lMT is the so-called symmetrized smoother.
For example, if M = D + L (D-diagonal, L-strictly lower triangular) represents
the forward Gauss-Seidel method coming from A = D + L + LT, then M =
(D + L)D~1(D + LT) gives rise exactly to the symmetric Gauss-Seidel method.
To define a MG algorithm the exact solve with Ac is replaced with a cor-
responding B~l defined by recursion. At the coarsest level Bc typically equals
the matrix at that level.
In our setting, to exploit recursion, we need to construct coarse elements
(the needed topology relations), coarse dofs and coarse element matrices.
The topology of the agglomerated elements (which serve as coarse elements)
is constructed based on the algorithms described in Vassilevski [57]. This part
of the setup is independent of the selection of the coarse dofs. The required
input here is only the fine-grid element Jace topological relation.
The coarse dofs at a given level can be identified with the columns of the
tentative prolongator Qc that we construct based on the lower part of the spec-
trum of the Schur complements 5/ associated with each minimal intersection set
/.
To build coarse elements (as lists of coarse degrees of freedom), we use the
agglomerated elements. Each agglomerated element (as a set of fine dofs) can
be split into several minimal intersection sets and each minimal intersection set
has one or several coarse dofs (eigenmodes) associated with its respective Schur
complement. For each agglomerated element T, the list of coarse degrees of
freedom associated with all minimal intersection sets that form T defines the
relation AE.coarse dof. This relation is the coarse counterpart of the (fine-
44


grid) relation element_dof.
To construct coarse element matrices, we proceed as follows. For each ag-
glomerated element T, we first assemble the local matrix At from the (fine-
grid) element matrices AT, r C T. Also, based on T, we form the submatrix
Qt,c of the tentative prolongator Qc that corresponds to the coarse dofs in T.
In what follows, we refer to Qt,c as the local tentative prolongators. Finally,
we construct the coarse element matrix Aj, based on the local Galerkin relation
A? = QtiCAtQt,c-
3.2.6 On the theoretical justification of the spectral AMGe approach
Consider the orthogonal (in the euclidean inner product) splitting Rn =
Vc Vf where Vc = Range(Qc) and Vf = Range(Qc). Let Pvf denote the
orthogonal projection onto Vf. It can be proved (see Chartier et al. [16]) that
the restriction Aff = PVfA\vf of A to the subspace Vj is well-conditioned if we
choose a sufficiently large portion of the eigenmodes (in the lower part of the
spectrum) of each Schur complement 5/ to form the columns of the tentative
prolongator Qc. That is, larger the spectral tolerance r 6 (0,1) better the
condition number of the resulting A/j.
We note that the fact that Ajf is well-conditioned by itself is not sufficient
to conclude that a two-grid method has a good convergence factor; we also
need some stability property of the interpolation matrix. Since our tentative
prolongator is block-diagonal its stability properties in the A-norm are not
very good. That is, the tentative prolongator needs to be improved in general.
For example, we can smooth it out as in the smoothed aggregation AMG
(Vanek et al. [56]). This is a feasible option that can lead to better two-grid
45


convergence rates. In our setting though, some of the minimal intersection sets
are thin aggregates, which led to higher operator complexities. That is why, we
did not pursue this option. Operator complexity (C) reflects the cost (in terms
of arithmetic operations) of one V-cycle. It is defined based on the number of
non-zero entries A/j of the l-th level matrix At (/ = 0 is finest level, l = L is the
coarsest level) by the formula
c = (3.2)
One option that we chose to stabilize the tentative prolongator is based on
harmonic extension defined as follows. For each agglomerated element T, we
construct Qc only for minimal intersection sets I that are on the boundary of T
(i.e., shared also by other agglomerated elements). Then Qc defined only on the
boundary of T is extended into the interior of T by inverting the block of the
matrix A corresponding to the interior of T. More specifically, if A restricted
to T has rows [Ar,i, AT,b] corresponding to the interior of T, then the actual
interpolation mapping P has the block form P =
ATiAT,b Qc
Qc
Another possible approach in the multilevel case is to keep the unstable
tentative prolongators but compensate for that with more expensive cycles (like
W-cycles, for example). We chose to use recursive calls to coarse levels based
on preconditioned CG (conjugate gradient) iterations with a preconditioner (de-
fined by recursion) that is a (mildly) nonlinear mapping. This is described in
more detail in the next subsection.
46


3.2.7 The nonlinear nested-CG AMLI-type multigrid cycles
The alternative to compensate for a unstable tentative prolongator that we
chose is to use more expensive multigrid cycles based on CG (conjugate gra-
dient) inner (between levels) iterations goes back to the nonlinear AMLI-cycle
hierarchical basis methods proposed in Axelsson and Vassilevski [5]. Recent
analysis of this type of cycles used in the MG setting is found in Notay and
Vassilevski [46].
We implemented an algorithm to be referred to as a nested-CG (AMLI)
cycle. It consists of CG-acceleration on each level of the currently defined (by
recursion from coarse to fine-levels) (nonlinear) preconditioner. More specifi-
cally, we run CG iterations for the linear system Ax = b (where A is the finest
level matrix) with a preconditioner that is a non-linear mapping. To compute
y = B_1(6), we use the following recursive algorithm:
For a given smoother M, apply a pre-smoothing step using y = 0 as the
initial guess, i.e., compute y := M~lb
Restrict the residual: rc := PT(b Ay)
At the coarsest level solve directly, i.e., compute yc:=(Ac)~i rc; otherwise
apply several CG iterations to the linear system Acyc = rc, using yc = 0
as an initial guess and a preconditioner whose inverse actions S"1 are
defined recursively (by the present algorithm); that is, compute yc as an
approximate solution to (Ac)_1 rc.
47


Correct y based on yc as y := y + Pyc-
Based on MT perforin a post-smoothing step, i.e., compute y := y +
M-T{b-Ay).
Note that the above algorithm defines an inverse of a preconditioner, B~l,
that is a nonlinear mapping.
In the implementation, we limit the number of inner CG iterations per level
so that the complexity of the resulting cycle is kept under control. More specif-
ically, the number of CG iterations at level l is chosen based on the coarsening
factor defined as the ratio Ni/Ni+i.
3.3 Adaptivity in the element agglomeration AMGe framework
In the present section we first describe the main steps of the adaptive AMG
method from Brezina et al. [14]. The method based on its performance improves
itself by augmenting its current hierarchy (if any) so that the refined coarse
spaces contain more algebraically smooth vectors. These vectors represent
approximations to the minimal eigenmode of the generalized eigenvalue problem
Ax = ABx, where B~l stands for the matrix representation of the V-cycle
corresponding to the most current AMG hierarchy. If no hierarchy is available
then B stands for the current level symmetrized smoother M. The choice of the
smoothers is fixed, for example Gauss-Seidel.
We may assume that an initial AMG hierarchy has been built as described
in the preceding section. That is, based on an AMGe algorithm, we have created
a sequence of interpolation matrices, respective (Galerkin) coarse-grid matrices
and the smoothing matrices M associated with them (that choice we assume
48


fixed). In the adaptive AMG we begin with performing several dry runs based
on the current method to measure its performance. If inefficiency is detected
then we try to improve the method by adapting the AMG hierarchy. Some
preliminary results of an adaptive AMG method (in AMGe framework) were
reported in Vassilevski and Zikatanov [58]. We describe, in what follows, the
particular version of the approach that we have implemented.
For simplicity, we start with a two-level method. In our setting, it comes
with a tentative prolongator Qc, an improved interpolation mapping P, the
coarse-grid matrix Ac = PTAP and a smoother M associated with A. These
components define a two-level preconditioner B (as in Algorithm 3.1 or formula
(3.1)). Note that the case Qc = 0 and B = M is treated similarly (with obvious
modification). This case is refereed to as adaptive AMG starting from scratch.
Then we perform several stationary iterations to solve Ax = 0 with a random
initial guess x, i.e.,
x := (I -B~1A)x,
where I is the identity matrix. At every iteration step we monitor the conver-
gence rate, i.e., we compute
2 xTAx
If after a few (for example, five) iterations p is greater than a desired convergence
factor, we stop the iteration. Our current (two-level) solver B cannot efficiently
handle this vector, i.e., x is rich in eigenmode components in the lower part
of the spectrum of B~lA. We refer to x as an algebraically smooth vector.
Note that x is algebraically smooth with respect to the current (two-grid)
49


preconditioner B (which initially may be simply the smoother M). The iteration
process above can be interpreted as calculating approximations to the highest
eigenmode of the A-symmetric two-level iteration matrix E = I B~lA. Since
we consider symmetric MG cycles and A-convergent smoothers (like Gauss-
Seidel), this implies that AE is symmetric positive semi-definite.
The next step in the adaptive AMG, is to incorporate the algebraically
smooth vector x in the two-level hierarchy by changing the tentative prolonga-
tor Qc and afterwards, the respective improved interpolation matrix P. After
a new P has been computed we re-compute the coarse-grid matrix Ac = PTAP.
The following algorithm implements the above steps:
Algorithm 3.2 (Augmenting coarse spaces based on additional vectors).
We compute the interpolation error e = x QcQfx. Since Qc has mu-
tually orthogonal columns, e is the projection of x onto the orthogonal
complement of the span of the columns of Qc.
For each minimal intersection set I, consider e/ (the restriction of e to
I). Then, if |]e/||/||x|| is greater than some given threshold, we add ej (ex-
tended by zeros outside I) as an extra column to the tentative prolongator
Qc- If ||e/||/||a:|| is less than the threshold, we discard e/.
At the end, compute P, the improved interpolation matrix from Qc using
harmonic extension as described in section 3.2.6. Using this updated P,
we re-compute Ac = PTAP.
In the multilevel case, we have to perform few additional steps at any given
level. Note that below the current coarse level, we may already have a hierarchy
50


of coarse spaces. Since we have augmented Qc with more columns and built a
new P, we have generally changed the dimension of the coarse space at the given
level. Hence, to relate the next level coarse space with the current level coarse
space based on the previously available interpolation matrix Pnext is not possible
because the dimensions do not match. To fix this problem, we need to add some
additional rows to Pnext to match the dimensions. The new rows are simply set
to zero. Thus the second coarse matrix does not have to be recomputed. This
is seen as follows. Let Qld be the current level tentative prolongator. We add a
few new columns to Qld based on the vector e computed in Algorithm 3.2. That
is, we have Qc = [QcCW, Qld] Similarly, the old interpolation matrix Pold gets
updated with the same number of columns; i.e., we have P = [Pnew5 PoW], The
new coarse matrix is Ac = PTAP. The second coarse level interpolation matrix
Pnext gets modified with some extra zero rows. It becomes Pnext
Therefore, the next coarse level matrix equals
r tT
pold
rnext
Ac ____ pT Acp _
next rnext** rnext
0
pold
^next
__ ^pold y ^poldy j^poldpold
|pnew poldy' jypnew pold^
pold
r next
which is the expression defining the next level old coarse matrix.
A second problem that needs to be fixed is to augment the minimal inter-
section sets at the given coarse level since we have added some new coarse dofs
there. To accomplish this task, during the setup phase of the algorithm (more
specifically, when we agglomerate elements, build the minimal intersection sets,
etc.) we keep a hierarchy structure of the minimal intersection sets. We say
51


that a level l minimal intersection set / is associated with a coarser level l + 1
minimal intersection set 7;+1 if there exists a coarse dof d/+1 from 7/+i associated
with I[. Recall that (initially) every coarse dof corresponds to an eigenvector
of a local matrix associated with a minimal intersection set. It is easily seen
that for each 7; there exists a unique coarse minimal intersection set /+1 that is
associated with fi. Based on the hierarchy of the minimal intersection sets, we
distribute the newly created coarse dofs among the coarser minimal intersection
sets. Then we proceed by recursion. The previously current coarse level be-
comes fine and we apply the method starting from that level. The initial vector
x is now not random; it equals QTx where x was the vector computed at the
previous (fine) level and Q is the augmented tentative prolongator at that level.
The same scheme applies to all subsequent levels till we end up with two levels
only. Then, we apply the two-level scheme described above.
The following monotonicity result demonstrates the fact that augmenting
a current coarse space (in the sense described above) leads to a better two-grid
method.
Proposition 3.3. Let A, M, Pold and Acold = (Pld)T A (poW) define a current
two-grid preconditioner B0id (as in (3.1)/ Augment the coarse space so that
P = [Pne, pold] is the new interpolation matrix and Ac = PTAP is the new
coarse matrix. Then A, M, P and Ac define the new two-grid preconditioner B
(also as in (3.1)/ Then the following inequalities hold:
vtA~1v > vtB-1v > vTB~ldv.
That is, B~l provides a more accurate approximate inverse to A than the old
52


two-grid preconditioner Bol\.
Proof. Let (u,v)A = vTAu denote the A-inner product (note that A is s.p.d.).
It is equivalent to show that for all u
0 < ((I B-'A)u,u)a< ((I B£A)u,u)a. (3.3)
The symmetric two-level cycle leads to an error propagation matrix that admits
the following product form
I B~'A = (I- M~tA) (I P (AT1 PtA) (I M~lA)
and similarly
I B£A = (/ M~tA) (/ F> (PMf a) (1 M-'A).
We also note that (I M~TA) is A-adjoint to (I M_1A). Thus (3.3) with
v = (I M~lA)u reduces to
o <((I-P{A)-1PtA)v,v)a
< ((/ PM (AJ,,)-1 (Pi)TA) v.v)a.
These inequalities are readily seen from the fact that both (/ P (Ac)-1 PTA)
and Pold {Acol(f)~l (PoW)TA^ are A-projectors providing the best approx-
imation in the A-norm from the two nested spaces Range [PnT PoW] and
Range (PoW).
53


3.4 Numerical results for spectral agglomerate AMGe
In this and subsequent sections we illustrate the performance of the methods
discussed in this chapter. The problems we consider come from finite element
(f.e.) discretization of 2D anisotropic diffusion as well as the 2D (indefinite)
Helmholtz equation rewritten as a first order (mixed) system casted in a least-
squares form (the so-called FOSLS formulation). The latter problem gives rise
to three degrees of freedom per node.
The specific PDEs, posed in Cl = (0, l)2, read:
anisotropic diffusion:
div (el + b^)Vp = /, p = 0 on dd,
where e is either 1 (which corresponds to no anisotropy, i.e. the operator
is Laplacian if b = 0) or 0.001 and b is a given constant vector in R2. Here
we use f.e. discretization based on standard piecewise linear f.e. space Sh
over a given triangular mesh Th of fl.
Helmholtz equation Ap k2p = /, p = 0 on dCl for a given k2. The
FOSLS formulation reads: Compute
IIcurl uh\\2 + ||u* Vp||2 + ||/ + divuh + k2ph\\2 min,
over uh (Sh)2 and pn £ Sh, where Sh is a finite element space of piecewise
linear functions over a given triangular mesh Th on fh Here ||.|| stands for
the L2(D) norm, ph and the tangential component of uh are set to zero on
<9f2. The additional vector unknown uh approximates Vp.
54


In our tests, we varied the spectral tolerance r e [0,1) (see Section 3.2.4) as well
as the smoothers used: Gauss-Seidel (GS) and its block-version. We used
(overlapping) blocks referring to the agglomerated elements (viewed as sets of
fine-grid dofs). This block smoother is denoted in the tables by AE-BlockGS.
We used two types of multigrid cycles: the standard symmetric F(l,l)-cycle
(forward (block) Gauss-Seidel in fine-to-coarse direction and backward (block)
Gauss-Seidel in coarse-to-fine direction) and the nested CG cycle described
in Section 3.2.7. For each method and problem, we compare its two-level and
multilevel versions. To reduce the complexity of the methods we use different
coarsening factors (Crs. f.), one at the initial level (equal to sixteen or eight)
and another one (equal to four) at all remaining coarse levels. The number of
levels used is denoted by Niev. We also list g that is an estimate of the con-
vergence factor. In most of the tables the resulting multigrid cycle is used as
a preconditioner in the CG (conjugate gradient) method. In addition, the con-
vergence history of the methods is illustrated graphically. Finally, some typical
graphs of the coarsest level basis functions illustrating the kind of interpolation
matrices (their columns viewed on the finest grid) that result from a particular
algorithm are shown. The tables also contain the operator complexity C defined
in (3.2) as well as a related quantity Ca+p defined as:
r c E/to1 nnz (pi)
Ca+p~ c+- nnz(io)
That is, to the commonly used operator complexity C we add the total number
of non-zero entries of all interpolation matrices Pi divided by the number of
non-zeros of the finest level matrix.
55


In our numerical experiments, we use two types of meshes: a structured
square mesh and unstructured triangular meshes. A given uniform square
mesh produces the actual structured mesh by subdividing each square cell using
its diagonal into two triangles. In the case of structured grid, we choose the
anisotropy vector b aligned with the grid. All unstructured meshes we consider
are obtained by refinement of the mesh shown in Figs. 3.1-3.2.
In this particular section, we use V-cycle iterations based on the so-called
(fully) harmonic interpolation matrices. These interpolation matrices are ob-
tained by first selecting coarse dofs based on a portion of the lower part of the
spectrum of the Schur complements Si associated with each minimal intersection
set / (as described previously) leading to an orthogonal matrix Qc, Based on the
remaining part of the spectrum we can also construct the complementary (also
orthogonal) matrix Qf. Then, we first perform a change of variables that leads
. Note that similar structure
to a matrix [Qf, Qc]TA[Qf, Qc] = ff fc
Acf Acc
is obtained for all neighborhood matrices Ax(I) (discussed in Section 3.2.4).
Having the 2x2 block form of the transformed matrix with the coarse dofs
identified (denoted by c index) the actual interpolation matrix is constructed
as described in Jones and Vassilevski [28]. The resulting interpolation matrix
P can be viewed as a locally harmonic extension of the tentative prolongator
Qc. This was the construction used also in Chartier et al. [16].
Numerical results for this section are found in tables 3.1, 3.2, 3.3, 3.4, 3.5,
3.6, 3.7, 3.8. Figures 3.4, 3.7, 3.8 contain convergence plots for runs in the
tables listed above. Figures 3.3, 3.5, 3.6, 3.9 contain graphs of some coarsest
56


Table 3.1: 2d Laplacian on square grid. 2048 elements, 1089 dofs. Harmonic11
prolongator is used inside V-cycle. No CG acceleration is used.
Crs. f. r CA Ca+p N)ev Smoother P
8 4 0 1.4 1.7 2 GS 0.22
8 4 0 1.5 2.0 5 GS 0.23
Table 3.2: 2d Laplacian on square grid. 32768 elements, 16641 dofs. Har-
monic prolongator is used inside V-cycle. No CG acceleration is used.
Crs. f. r CA Ca+p Nlev Smoother P
8 4 0 1.3 1.7 2 GS 0.24
8 4 0 1.4 1.9 7 GS 0.24
basis functions.
Table 3.3: 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Har-
monic prolongator is used inside V-cycle. Conjugate gradient acceleration is
used.
Crs. f. r CA Ca+p N]ev Smoother P Fig/Run
16 4 0 1.5 1.9 2 GS 0.15 3.4/1
16 4 0 1.6 2.2 6 GS 0.18 3.4/2
57


06
0}
Figure 3.3: One of coarsest-grid basis functions. Laplacian on square grid,
2048 elements, 1089 dofs. Fully-harmonic prolongator is used.
Table 3.4: 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Harmonic prolongator is used inside V-cycle. Conjugate gradient acceleration
is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0 1.6 2.2 7 GS 0.21 3.4/3
16 4 0 1.5 1.9 2 GS 0.18 3.4/4
58


Figure 3.4: Residual decrease history for some runs in tables 3.3, 3.4.
Figure 3.5: One of coarsest-grid basis functions. Laplacian on unstructured
grid, 6400 elements, 3321 dofs. Fully-harmonic prolongator is used. Coarsest
grid contains 8 elements.
59


Table 3.5: 2d anisotropic diffusion on square grid. Anisotropy is grid-aligned.
2048 elements, 1089 dofs. Harmonic prolongator is used inside V-cycle. No
CG acceleration is used.
Crs. f. r CA Ca+p Niev Smoother P
8 4 0.05 2.0 2.4 2 GS 0.27
8 4 0.05 3.9 6.0 5 GS 0.55
8 4 0.05 2.0 2.4 2 AE-BlockGS 0.11
8 4 0.05 3.9 6.0 5 AE-BlockGS 0.11
Table 3.6: 2d anisotropic diffusion on square grid. Anisotropy is grid-aligned.
32768 elements, 16641 dofs. Harmonic prolongator is used inside V-cycle. No
CG acceleration is used.
Crs. f. r CA Ca+p Niev Smoother P
8 4 0.05 1.9 2.3 2 GS 0.27
8 4 0.05 4.3 6.8 7 GS 0.74
8 4 0.05 1.9 2.3 2 AE-BlockGS 0.11
84 0.05 4.3 6.8 7 AE-BlockGS 0.11
60


08^
Figure 3.6: One of coarsest-grid basis functions. Anisotropic diffusion on
square grid, 2048 elements, 1089 dofs. Fully-harmonic prolongator is used.
Note that this is one function, not a sum of two functions.
Table 3.7: 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Harmonic prolongator is used inside V-cycle. Conjugate gradient accel-
eration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0.15 2.0 2.5 2 AE-BlockGS 0.13 3.7/1
16 4 0.15 4.0 5.9 7 AE-BlockGS 0.16 3.7/2
16 4 0 1.5 1.9 2 AE-BlockGS 0.61 3.8/1
16 4 0 1.6 2.2 7 AE-BlockGS 0.62 3.8/2
61


Table 3.8: 2d anisotropic diffusion on unstructured grid. 6400 elements, 3321
dofs. Harmonic" prolongator is used inside V-cycle. Conjugate gradient accel-
eration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0.15 2.0 2.5 2 AE-BlockGS 0.12 3.7/3
16 4 0.15 3.5 5.2 6 AE-BlockGS 0.14 3.7/4
16 4 0 1.5 1.9 2 AE-BlockGS 0.44 3.8/3
16 4 0 1.6 2.2 6 AE-BlockGS 0.44 3.8/4
Figure 3.7: Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were used (r = 0.15).
62


Figure 3.8: Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were NOT used (r = 0).
0.4-,
N
0.2'
1
y 0 0
X
Figure 3.9: One of coarsest-grid basis functions. Anisotropic diffusion on
unstructured grid, 6400 elements, 3321 dofs. Fully-harmonic prolongator is
used. Coarsest grid contains 2 elements. This function corresponds to additional
spectral coarse dofs.
63


Table 3.9: 2d Laplacian on square grid, 2048 elements, 1089 dofs. Tentative
prolongator is used inside nested CG cycleConjugate gradient acceleration
is used.
Crs. f. r CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.2 1.4 2 GS 0.21 3.10/1
32 4 0 1.3 1.5 4 GS 0.21 3.10/2
3.5 Tests with tentative prolongators within nested CG cycles
In this section we illustrate the performance of the method when tentative
(i.e. block-diagonal and orthogonal) prolongators (Qc) are used. In order to
compensate for prolongator instability, we use nested CG cycles described in
Section 3.2.7. We report in this section test results for Laplacian and anisotropic
diffusion. We use (pointwise) Gauss-Seidel as a smoother here, since our cycle
is now more robust (and expensive). Numerical results for this section are
found in tables 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. Figures 3.10, 3.12, 3.14 contain
convergence plots for runs in the tables listed above. Figures 3.11, 3.13, 3.15
show graphs of some of the resulting coarsest basis functions.
64


Table 3.10: 2d Laplacian on square grid, 32768 elements, 16641 dofs. Ten-
tative prolongator is used inside nested CG cycle". Conjugate gradient accel-
eration is used.
Crs. f. T CA Ca+p N,ev Smoother P Fig/Run
32 4 0 1.2 1.4 2 GS 0.22 3.10/3
32 4 0 1.3 1.5 6 GS 0.22 3.10/4
Figure 3.10: Residual decrease history for some runs in tables 3.9, 3.10.
Table 3.11: 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs.
Tentative" prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.4 1.5 2 GS 0.31 3.12/1
32 4 0 1.5 1.7 5 GS 0.32 3.12/2
65


0.1
Figure 3.11: One of coarsest-grid basis functions. 2d Laplacian on square
grid, 2048 elements, 1089 dofs. Tentative" prolongator is used. Coarsest grid
contains 4 elements.
Table 3.12: 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Tentative prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used.
Crs. f. r CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.4 1.5 2 GS 0.37 3.12/3
32 4 0 1.5 1.7 6 GS 0.42 3.12/4
66


Figure 3.12: Residual decrease history for some runs in tables 3.11, 3.12.
3.6 Adaptive AMG tests
Here we present numerical results for adaptive AMG discussed in Section
3.3. We apply this method to the Helmholtz problem described in Section 3.4,
with k2 varying from 0 (which is just the Laplace operator) to 200. In our
tests, we use the partially harmonic extension of the tentative prolongators (as
explained in Section 3.2.6) to define the actual interpolation matrices Pi used
within the adaptive V-cycle AMG. We also employ nested-CG cycles (described
in Section 3.2.7).
Everywhere adaptivity is done from scratch (i.e., no initial hierarchy is as-
sumed, hence adaptivity is based initially on the symmetrized Gauss-Seidel
67


Y 0 0
Figure 3.13: One of coarsest-grid basis functions. 2d Laplacian on unstruc-
tured grid, 6400 elements, 3321 dofs. Tentative prolongator is used. Coarsest
grid contains 4 elements.
Table 3.13: 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Tentative" prolongator is used inside nested CG cycle. Conjugate
gradient acceleration is used.
Crs. f. T CA Ca+p Niev Smoother P Fig/Run
32 4 0.05 2.0 2.2 2 GS 0.40 3.14/1
32 4 0.05 2.7 3.5 5 GS 0.44 3.14/2
68


Table 3.14: 2d anisotropic diffusion on unstructured grid, 25600 elements,
13041 dofs. Tentative11 prolongator is used inside nested CG cycle. Conju-
gate gradient acceleration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
32 4 0.05 2.1 2.4 2 GS 0.43 3.14/3
32 4 0.05 3.3 4.4 6 GS 0.46 3.14/4
Figure 3.14: Residual decrease history for some runs in tables 3.13, 3.14.
69


0.1
Figure 3.15: One of coarsest-grid basis functions. 2d anisotropic diffusion on
unstructured grid, 25600 elements, 13041 dofs. Tentative11 prolongator is used.
Coarsest grid contains 4 elements.
smoother). While obtaining smooth vectors for adaptivity, we do not use
CG at the finest level, i.e., we run the nested-CG cycle as a stationary iter-
ation. However, we do use CG in the final test runs. The number of smooth
vectors used to build the final AMG hierarchy is shown in the column Nvec
of each table.
Numerical results for this section are found in tables 3.15, 3.16, 3.17, 3.18,
3.19, 3.20 Figures 3.16, 3.17, 3.18 contain convergence plots for runs shown in
the above tables. Figures 3.19, 3.20, 3.21 show graphs of some of the resulting
coarsest basis functions. Finally, in Figure 3.22, we show how the complexity and
convergence rate vary with the number of incorporated algebraically smooth
vectors.
70


Table 3.15: Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 6400
elements, 9963 dofs. Partially-harmonic prolongator is used inside nested
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother NVec P Fig/Run
32 4 2.2 3.1 2 SymGS 6 0.10 3.16/1
32 4 2.7 4.1 5 SymGS 6 0.11 3.16/2
Table 3.16: Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l
elements, 39123 dofs. Partially-harmonic prolongator is used inside
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Niev Smoother NVec P Fig/Run
32 4 2.4 3.3 2 SymGS 7 0.11 3.16/3
32 4 3.2 4.8 6 SymGS 7 0.10 3.16/4
Table 3.17: Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=
elements, 9963 dofs. Partially-harmonic" prolongator is used inside
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec p Fig/Run
32 4 2.6 3.7 2 SymGS 10 0.11 3.17/1
32 4 3.6 5.5 5 SymGS 10 0.11 3.17/2
, 25600
nested
1, 6400
nested
71


Figure 3.16: Residual decrease history for some runs in tables 3.15, 3.16.
Table 3.18: Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 25600
elements, 39123 dofs. Partially-harmonic11 prolongator is used inside nested
CG cycle . Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother NVec p Fig/Run
32 4 2.6 3.7 2 SymGS 10 0.10 3.17/3
32 4 4.0 6.1 6 SymGS 11 0.13 3.17/4
72


Figure 3.17: Residual decrease history for some runs in tables 3.17, 3.18.
Table 3.19: Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 6400
elements, 9963 dofs. Partially-harmonic prolongator is used inside nested
CG cycle. Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec P Fig/Run
32 4 2.5 3.6 2 SymGS 9 0.13 3.18/1
32 4 3.6 5.5 5 SymGS 10 0.18 3.18/2
73


Table 3.20: Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 25600
elements, 39123 dofs. Partially-harmonic prolongator is used inside nested
CG cycleConjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec p Fig/Run
32 4 2.7 3.7 2 SymGS 10 0.10 3.18/3
32 4 4.1 6.2 6 SymGS 11 0.25 3.18/4
Figure 3.18: Residual decrease history for some runs in tables 3.19, 3.20.
74


Coarsest function 21: u
X
Figure 3.19: One of coarsest-grid basis functions, ux component. Adaptive
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs. Partially-
harmonic prolongator is used.
Coarsest function 21: u
y
Y 0 0
Figure 3.20: One of coarsest-grid basis functions, uy component. Adaptive
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs. Partially-
harmonic prolongator is used.
75


Coarsest function 21: p
Figure 3.21: One of coarsest-grid basis functions, p component.
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
harmonic prolongator is used.
Adaptive
Partially-
Figure 3.22: Adaptivity: FOSLS Helmholtz, k2 = 200, 6400 elements, 9963
dofs. Partially-harmonic prolongator is used inside nested CG cycle. Com-
plexity and convergence rate depending on the number of incorporated vectors.
The convergence rate is for the stationary method.
76


3.7 Conclusions
We have implemented a version of the spectral agglomerate AMGe method
that leads in a natural way to (orthogonal) tentative prolongation matrices.
These tentative prolongators can then be used to construct more stable inter-
polation matrices by harmonic extension (in the interior of the agglomerated
elements) that results in better convergence properties of standard V-cycle MG.
The spectral choice of coarse dofs by itself defines a scale of AMG methods that
can get more powerful (in terms of the convergence rate) at the expense of in-
creasingly higher operator complexity. Another alternative that we explored was
to keep the tentative prolongators in the computation and compensate for their
poor stability by using more expensive MG cycles; we considered the nested
CG cycle that gives rise to (mildly) non-linear MG mappings. Finally, we
note that a low tolerance spectral agglomerate AMGe can be used to initiate
an adaptive AMG cycle. Once the initial AMG hierarchy has been constructed
the individual element matrices are no longer needed for the adaptive AMG we
constructed. We chose, in the numerical tests, to construct the initial AMG
hierarchy from scratch which version completely eliminated the need for the
fine-grid element matrices. This particular method though did utilize the fine-
grid element topology relations needed to construct the multilevel agglomerates
and respective minimal intersection sets. All approaches offer potential to gen-
erate better (in convergence properties) AMG methods. The approach based
on tentative prolongators (and more expensive MG cycles) has the fastest setup
77


among all, however one cycle is more expensive (in setup and/or in cost per
cycle). In general, all approaches can get fairly expensive if we want to achieve
very small convergence factors. We have tried to demonstrate their performance
on a variety of test problems including scalar anisotropic diffusion as well as
Helmholtz problem in a SPD (FOSLS) formulation.
78


4. BLOPEX
4.1 Introduction
In this chapter we describe a new software package Block Locally Opti-
mal Preconditioned Eigenvalue Xolvers (BLOPEX) revision 1.0 publicly re-
leased recently. The exposition here is based on the papers Knyazev et al.
[39], Lashuk et al. [43]. BLOPEX is available as a stand-alone serial li-
brary, which can be downloaded from http: //math. cudenver. edu/~aknyazev/
software/BLOPEX/, as an external package to the Portable, Extensible Toolkit
for Scientific Computation (PETSc) Balay et al. [6], and is built into the High
Performance Preconditioners (hypre) library Falgout et al. [19, 20]. Jose Ro-
man has also written a Scalable Library for Eigenvalue Problem Computations
(SLEPc) Hernandez et al. [25]) interface to our hypre BLOPEX.
The native hypre BLOPEX version efficiently and directly takes advantage of
powerful hypre multigrid preconditioners, both algebraic (AMG, called Boomer-
AMG in hypre) and geometric or structured (SMG). BLOPEX in PETSc gives
the PETSc community easy access to the customizable code of a preconditioned
eigensolver.
At present, BLOPEX includes only one solverthe Locally Optimal Block
Preconditioned Conjugate Gradient (LOBPCG) Knyazev [32, 35] method for
eigenproblems Ax = ABx with large, possibly sparse, symmetric matrix A
and symmetric positive definite matrix B. Preconditioned eigenvalue solvers
in general, see Knyazev [34, 35], and in particular the LOBPCG method have
79


recently attracted attention as potential competitors to (block-) Lanczos meth-
ods. LOBPCG has been implemented using different computer languages in
a number of other software packages: C++ in Anasazi Trilinos Arbenz et al.
[1], Heroux et al. [26] and in NGSolve Zaglmayr [62], Sections 7.2-7.4; C in
PRIMME Stathopoulos [52]; FORTRAN 77 in PLTMG Bank [7]; FORTRAN
90 in ABINIT Bottin et al. [9] and in PESCAN Tomov et al. [54]; Python in
SciPy by Robert Cimrman and in PYFEMax by Peter Arbenz and Roman Geus.
LOBPCG has been independently tested in Yamada et al. [59, 60] for Fermion-
Habbard Model on Earth Simulator CDIR/MPI; in Arbenz et al. [1] and Borzi
and Borzi [8] using AMG preconditioning for vibration problems; and in To-
mov et al. [54], Yang et al. [61] for electronic structure calculations. Our hypre
BLOPEX version has been used in Bramble et al. [10] for the Maxwell problem.
Section 4.2 contains a complete and detailed description of the LOBPCG
algorithm as implemented in BLOPEX. We discuss our specific abstract im-
plementation approach in section 4.3. Parallel performance on distributed and
shared memory systems using domain decomposition and multigrid precondi-
tioning is discussed in section 4.4. We collect technical information, e.g., the list
of acronyms, in the appendix A.
4.2 LOBPCG and its Implementations
In subsection 4.2.1, we briefly describe the ideas behind the LOBPCG
method mostly following Knyazev [32, 35]. In subsection 4.2.2, we present a
complete description of the LOBPCG algorithm as implemented in BLOPEX.
Deflation by hard and soft locking is discussed in subsection 4.2.3.
80


4.2.1 LOBPCG Ideas and Description
4.2.1.1 The problem and the assumptions
We consider the problem of computing the m smallest eigenvalues and the
corresponding eigenvectors of the generalized eigenvalue problem Ax = XBx,
with symmetric (Hermitian in the complex case) matrices A and B, the latter
assumed to be positive definite. For a given approximate eigenvector x, we use
the traditional Rayleigh quotient \(x) = (x,Ax)/(x,Bx) in the standard scalar
product as an approximation to the corresponding eigenvalue. We emphasize
that only matrix B is assumed to be positive definite. There is no such re-
quirement on A; moreover, in all formulas in this section we can replace A with
A + aB and the formulas are invariant with respect to a real shift a.
To accelerate the convergence, we introduce a preconditioner T, which is
typically a function that for a given vector x produces Tx. The existing the-
ory, e.g., Knyazev [32], Knyazev and Neymeyr [38], of the LOBPCG method is
based on several assumptions: T needs to be linear, symmetric, and positive
definite. Under these assumptions a specific convergence rate bound is proved
in Knyazev [32], Knyazev and Neymeyr [38]. Numerical tests suggest that all
these requirements on T are necessary in order to guarantee this convergence
rate. The use of preconditioning that does not satisfy the requirements above
not only can slow down the convergence, but may also lead to a breakdown of
the method and severe instabilities in the code resulting in incorrect results
that should not be mistakenly interpreted as bugs in the code. The method is
81


robust however if T changes from iteration to iteration as soon as in every step
it satisfies the above assumptions.
4.2.1.2 Single-Vector LOBPCG
For computing only the smallest eigenpair, i.e. if m 1, the LOBPCG
method takes the form of a 3-term recurrence:
^.(j+l) w(i) _|_
(4-1)
w(i) = t(AxW X^Bx^), X^ = X(x^) = (x^\ Ax^)/(Bx^, x^)
with properly chosen scalar iteration parameters r^, and 7^. The easiest and
perhaps the most efficient choice of parameters is based on an idea of local
optimality Knyazev [32, 33, 35], namely, select r^) and 7b) that minimize the
Rayleigh quotient A(x^+1') by using the Rayleigh-Ritz procedure in the three-
dimensional trial subspace spanning x^\ wand We do not describe
the Rayleigh-Ritz procedure using an arbitrary basis of the trial subspace in this
chapter and assume that the reader is familiar with it.
If 7W = 0, method (4.1) turns into the classical preconditioned locally opti-
mal steepest descent method, so one can think of method (4.1) as an attempt to
accelerate the steepest descent method by adding an extra term to the iterative
recurrence; e.g., Knyazev [33, 35]. The general idea of accelerating an iterative
sequence by involving the previous approximation is very natural and is well
known. -Let us note here that if instead of the previous approximation z^-i)
we use the previous preconditioned residual u/*-1) in formula (4.1), we get a
different method that in practical tests converges not as fast as method (4.1).
When implementing the Rayleigh-Ritz procedure in the trial subspace span-
ning x^l\ and we need to be careful in choosing an appropriate basis
82


that leads to reasonably conditioned Gram matrices. The current eigenvec-
tor approximation x^ and the previous eigenvector approximation get
closer to each other in the process of iterations, so special measures need to be
taken in method (4.1) to overcome the potential numerical instability due to the
round-off errors. Using both vectors x^ and x^_1) as basis vectors of the trial
subspace leads to ill-conditioned Gram matrices, and the Rayleigh-Ritz method
can produce spurious eigenpairs.
A simple fix is suggested in Knyazev [32]: the three-term recurrence that
contains the current eigenvector approximation, the preconditioned residual,
and the implicitly computed difference between the current and the previous
eigenvector approximations:
x
(t+l) _ w(i) _|_ T(i)T()
+ rx + 7 w = T{Ax AWbx(0),
p(i+l) w(i) _j_ p(0) _ ^(i) __ (x(*)).
x(*+1) G span{u/\ x^, pb)} = span{u/*l, x^\ asp^i+1) = xb+i)_t^x^\
therefore, the new formula (4.2) is mathematically equivalent to (4.1), in exact
arithmetic. We choose in (4.2) the scalar iteration parameters rW and 7^ as
above, i.e. minimizing the Rayleigh quotient A(x^+1^).
We note that formula (4.2) is not quite the ultimate fix, e.g., it is possible
in principle that at the initial stage of the iterations is small, so that p(,+1)
is too close to x(l+1\ and the Rayleigh-Ritz procedure may fail. The proper
choice of the basis in the trial subspace span{t/;W,x^,p^} for the Rayleigh-
Ritz procedure in (4.2), having in mind that vectors w^ and p^ converge to
zero, is not a trivial issue, see Hetmaniuk and Lehoucq [27] for possible causes
of LOBPCG instability.
(4.2)
83


The locally optimal choice of the step sizes in the LOBPCG allows an easy
generalization from the single-vector version (4.1) or (4.2) to the block version
described next in subsection 4.2.1.3, where a block of m vectors is iterated simul-
taneously. We return to the discussion of the choice of the basis in subsection
4.2.2, where it becomes even more vital with the increase of the block size m.
4.2.1.3 The LOBPCG Basics
A block version of LOBPCG for finding the m > 1 smallest eigenpairs is
suggested in Knyazev [34, 35]:
.(*+i)
'3
G span
x
(0
T(A A
T(<-i) T(0
5 1 >

where is computed as the j-th Ritz vector, j = 1,,m corresponding to
the j-th smallest Ritz value in the Rayleigh-Ritz procedure on a 3m-dimensional
trial subspace. The block method has the same problem of having close vectors
in the trial subspace as the single-vector version (4.1) discussed in the previous
subsection, for which Knyazev [32] suggested the same fix using the directions
P-
As in other block methods, the block size should be chosen, if possible, to
provide a large gap between first m eigenvalues and the rest of the spectrum
as this typically leads to a better convergence, see Knyazev [32], Knyazev and
Neymeyr [38], Ovtchinnikov [47]. Block methods generally handle clusters in the
spectrum and multiple eigenvalues quite well; and the LOBPCG is no exception.
An attempt should be made to include the whole cluster of eigenvalues into the
block, while for multiple eigenvalues this is not essential at all. The numerical
84


experience is that the smallest eigenvalues often converge faster, and that the few
eigenvalues with the noticeably slow convergence usually are the largest in the
block and typically correspond to a cluster of eigenvalues that is not completely
included in the block. A possible cure is to use a flexible block size a bit larger
than the number of eigenpairs actually wanted and to ignore poorly converging
eigenpairs.
The block size m should not be so big that the costs of the Rayleigh-Ritz
procedure in the 3m-dimensional trial subspace dominate the costs of iterations.
If a large number of eigenpairs is needed, deflation is necessary, see subsec-
tion 4.2.3. The optimal choice of the block size m is problem-, software- and
computer-dependent; e.g., it is affected by such details as the amount of the
CPU cache and the use of optimized BLAS library functions in the multivector
implementation, see section 4.3.
4.2.2 The Detailed Description of the LOBPCG Algorithm
The description of the LOBPCG algorithm as implemented in our BLOPEX-
1.0 code follows:
Input: m starting linearly independent vectors in X £ R"xm, l linearly independent
constraint vectors in Y 6 Rnxl, devices to compute A* X, B X and T X.
1. Allocate memory W, P, Q, AX, AVU, AP, BX, BW, BP £ Rnxm, BY £ Rnx'.
2. Apply the constraints to X:
BY = B *Y, X = X Y (YT BY)~l ((BY)T X).
3. B-orthonormalize X: BX = B X\R = chol(ATr BX);X = X R-1;
BX BX R-1; AX = A X. (Note: chol is the Cholesky decomposition).
85


4. Compute the initial Ritz vectors: solve the eigenproblem
(.XT AX) TMP = TMP A;
and compute X = X TMP; AX = AX TMP; BX = BX TMP.
5. Define the index set J of active iterates to be {1,, m}.
6. for k = 0,..., Maxlterations:
7. Compute the residuals: Wj = AX j BX j Aj.
8. Exclude from the index set J the indices that correspond to residual
vectors for which the norm has become smaller than the tolerance.
If J then becomes empty, exit loop.
9. Apply the preconditioner T to the residuals: Wj = T Wj.
10 Apply the constraints to the preconditioned residuals Wj\
Wj = Wj-Y*(Yt BY)1 ((BY)t Wj).
11. Compute BWj and R-orthonormalize Wj\ BWj = B *Wj\
R = chol(VEj BWj)- Wj = Wj* R~u, BWj = BWj R~\
12. Compute AWj: AWj = A* Wj.
13. if A: > 0
14. B-orthonormalize Pj: R = chol(Pj BPj)\ Pj = Pj R~l \
15. Update APj APj i?-1; BPj = BPj R~l.
16. end if
Perform the Rayleigh Ritz Procedure:
Compute symmetric Gram matrices:
17. it k > 0
86