/
/
ON PRECONDITIONING FOR LINEAR EQUATIONS AND EIGENVALUE
PROBLEMS
by
Ilya Lashuk
M.S., Moscow State University, 2002
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2007
This thesis for the Doctor of Philosophy
degree by
Ilya Lashuk
has been approved
by
Andrew Knyazev
Krzysztof Cios
Leo Franca
Tien
ou
/l*r 7 '2^7
Date
Jan Mandel
Lashuk, Ilya (Ph.D., Applied Mathematics)
On Preconditioning for Linear Equations and Eigenvalue Problems
Thesis directed by Professor Andrew Knyazev
ABSTRACT
In this dissertation, we focus on three areas of research related to precondi
tioning for linear systems and eigenvalue problems.
We analyze the behaviour of the conjugate gradient method for linear sys
tems with variable preconditioning. We consider the case where the precondi
tioner is widely variable, i.e., preconditioners used on different iterations are
essentially unrelated. We assume that the preconditioner on every step is sym
metric and positive definite and that the condition number of the preconditioned
system matrix is bounded above by a constant independent of the step number.
Our main result is of a negative nature: we show that the conjugate gradient
method with variable preconditioning may not yield any improvement relative
to the preconditioned steepest descent method.
We describe a Matlab implementation of the elementbased algebraic multi
grid (AMG) methods that target linear systems of equations coming from finite
element discretizations of elliptic partial differential equations. The individual
element information (element matrices and element topology) is the main input
to construct the AMG hierarchy. We analyze a number of variants of the spec
tral (based on solving a large number of local eigenvalue problems) agglomerate
element based AMG method. The core of the algorithms relies on element ag
glomeration utilizing the element topology built recursively from fine to coarse
levels. We investigate strategies for adaptive AMG as well as multigrid cycles
that are more expensive than the Vcycle utilizing simple interpolation matri
ces and nested conjugate gradient based recursive calls between the levels. We
perform an extensive set of numerical experiments.
We describe our software package Block Locally Optimal Preconditioned
Eigenvalue Xolvers (BLOPEX), which includes the Locally Optimal Block Pre
conditioned Conjugate Gradient method for symmetric eigenvalue problems.
We demonstrate numerical scalability of BLOPEX by testing it on a number of
distributed and shared memory parallel systems, including a Beowulf system,
SUN Fire 880, an AMD dualcore Opteron workstation, and IBM BlueGene/L
supercomputer, using PETSc (Portable, Extensible Toolkit for Scientific Com
putation by Argonne National Laboratory) domain decomposition and hypre
(High Performance Preconditioners by Lawrence Livermore National Labora
tory) multigrid preconditioning.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Andrew Knyazev
ACKNOWLEDGMENT
I thank my advisor, Andrew Knyazev, for his support and guidance during all
stages of my research and dissertation preparation. I am grateful to Panayot
Vassilevski. Chapter 3 of this dissertation is based on the research that I have
performed under his supervision at the Lawrence Livermore National Laboratory
(LLNL). I am indebted to Merico Argentati, Julien Langou, and Jan Mandel
for advice and helpful discussions.
I thank the Department of Mathematical Sciences at the University of Col
orado Denver (UCD) for providing an exciting learning environment. I acknowl
edge financial support from the Department of Mathematical Sciences UCD
(oneyear Ph.D. fellowship), National Science Foundation (2 years of research
assistantship), and the LLNL (two summer fellowships).
CONTENTS
Figures ix
Tables.................................................................. xii
Chapter
1. Introduction........................................................... 1
2. Variable preconditioning............................................... 9
2.1 Introduction......................................................... 9
2.2 Pointed circular cones represent sets of SPD matrices with varying
condition numbers .................................................. 14
2.3 Local optimality of the method with variable preconditioning .... 16
2.4 Convergence rate bounds for variable preconditioning................ 20
2.5 The convergence rate bound is sharp.......................... 21
2.6 Complex Hermitian case.............................................. 23
2.7 Practical PCG algorithms............................................ 27
2.8 Numerical experiments............................................... 29
2.8.1 Numerical illustration of the main results 29
2.8.2 Innerouter iterations as variable preconditioning 31
2.8.3 Random vs. fixed preconditioning................................ 32
2.9 Conclusions......................................................... 34
3. Algebraic multigrid via element agglomeration......................... 36
3.1 Introduction....................................................... 36
vi
3.2 Basic building tools............................................... 38
3.2.1 Input data 38
3.2.2 Agglomerating elements........................................... 40
3.2.3 Building minimal intersection sets............................... 40
3.2.4 Forming Schur complements, computing eigenvalues and building
tentative prolongators........................................... 41
3.2.5 Proceeding by recursion.......................................... 43
3.2.6 On the theoretical justification of the spectral AMGe approach 45
3.2.7 The nonlinear nestedCG AMLItype multigrid cycles............. 47
3.3 Adaptivity in the element agglomeration AMGe framework............ 48
3.4 Numerical results for spectral agglomerate AMGe................... 54
3.5 Tests with tentative prolongators within nested CG cycles 64
3.6 Adaptive AMG tests................................................. 67
3.7 Conclusions........................................................ 77
4. BLOPEX .............................................................. 79
4.1 Introduction....................................................... 79
4.2 LOBPCG and its Implementations..................................... 80
4.2.1 LOBPCG Ideas and Description .................................... 81
4.2.1.1 The problem and the assumptions 81
4.2.1.2 SingleVector LOBPCG........................................... 82
4.2.1.3 The LOBPCG Basics 84
4.2.2 The Detailed Description of the LOBPCG Algorithm................. 85
4.2.3 LOBPCG deflation by hard and soft locking........................ 91
4.3 Abstract BLOPEX implementation for PETSc and hypre 94
vii
4.4 BLOPEX LOBPCG Numerical Results in PETSc and hypre .... 96
4.4.1 The Basic Accuracy of Algorithm.............................. 96
4.4.2 Performance Versus the Number of Inner Iterations......... 97
4.4.3 LOBPCG Performance vs. Block Size 99
4.4.4 Scalability.................................................. 99
4.5 Conclusions.................................................... 101
Appendix
A. BLOPEX appendix................................................. 103
A.l Description of all LOBPCGBLOPEX variables................... 103
A.2 Installing and testing BLOPEX with PETSc and hypre........... 104
A.3 Computers used for testing..................................... 106
A.4 Acronyms and other names ................................ 107
A.4.1 Organizations................................................ 107
A.4.2 Software packages 107
A.4.3 hypre interfaces and preconditioners ........................ 108
A.4.4 Methods...................................................... 109
References......................................................... 110
viii
LIST OF FIGURES
Figure
2.1 Algorithm 2.7.1 with (2.13) fails to provide the PSD convergence rate. 30
2.2 The PSD and PCG methods with preconditioning by inner CG with
different stopping criteria i) = 0.2,0.4,0.6 and 0.8 (from the bottom
to the top).......................................................... 31
2.3 Twogrid preconditioning with fixed (left) and random (right) coarse
grids................................................................ 33
3.1 Agglomerated elements................................................. 41
3.2 Minimal intersection sets............................................. 42
3.3 One of coarsestgrid basis functions. Laplacian on square grid, 2048
elements, 1089 dofs. Fullyharmonic prolongator is used........... 58
3.4 Residual decrease history for some runs in tables 3.3, 3.4. 59
3.5 One of coarsestgrid basis functions. Laplacian on unstructured grid,
6400 elements, 3321 dofs. Fullyharmonic prolongator is used.
Coarsest grid contains 8 elements.............................. 59
3.6 One of coarsestgrid basis functions. Anisotropic diffusion on square
grid, 2048 elements, 1089 dofs. Fullyharmonic prolongator is used.
Note that this is one function, not a sum of two functions........ 61
3.7 Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were used (r = 0.15)........ 62
IX
3.8 Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were NOT used (r = 0)........ 63
3.9 One of coarsestgrid basis functions. Anisotropic diffusion on un
structured grid, 6400 elements, 3321 dofs. Fullyharmonic pro
longator is used. Coarsest grid contains 2 elements. This function
corresponds to additional spectral coarse dofs........... 63
3.10 Residual decrease history for some runs in tables 3.9, 3.10........ 65
3.11 One of coarsestgrid basis functions. 2d Laplacian on square grid,
2048 elements, 1089 dofs. Tentative prolongator is used. Coarsest
grid contains 4 elements......................... ................. 66
3.12 Residual decrease history for some runs in tables 3.11, 3.12. 67
3.13 One of coarsestgrid basis functions. 2d Laplacian on unstructured
grid, 6400 elements, 3321 dofs. Tentative11 prolongator is used.
Coarsest grid contains 4 elements............................... 68
3.14 Residual decrease history for some runs in tables 3.13, 3.14. 69
3.15 One of coarsestgrid basis functions. 2d anisotropic diffusion on un
structured grid, 25600 elements, 13041 dofs. Tentative11 prolongator
is used. Coarsest grid contains 4 elements......................... 70
3.16 Residual decrease history for some runs in tables 3.15, 3.16......... 72
3.17 Residual decrease history for some runs in tables 3.17, 3.18......... 73
3.18 Residual decrease history for some runs in tables 3.19, 3.20. 74
3.19 One of coarsestgrid basis functions, ux component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partiallyharmonic11 prolongator is used. 75
x
3.20 One of coarsestgrid basis functions, uy component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partiallyharmonic prolongator is used. 75
3.21 One of coarsestgrid basis functions, p component. Adaptive AMG
for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
Partiallyharmonic" prolongator is used.............................. 76
3.22 Adaptivity: FOSLS Helmholtz, k2 = 200, 6400 elements, 9963 dofs.
Partiallyharmonic prolongator is used inside nested CG cycle".
Complexity and convergence rate depending on the number of incor
porated vectors. The convergence rate is for the stationary method. 76
4.1 Residual norms (left) and eigenvalue errors (right) for 10 eigenpairs
in LOBPCG with soft locking. Soft locked (the bottom 6) eigenpairs
continue to improve............................................. 94
4.2 BLOPEX hypre and PETSc software modules................................ 95
4.3 LOBPCG Performance vs. Block Size. Preconditioner: hypre
Boomer AMG........................................................... 100
4.4 LOBPCG scalability, 1M unknowns per node on Beowulf. 101
xi
LIST OF TABLES
Table
3.1 2d Laplacian on square grid. 2048 elements, 1089 dofs. Harmonic
prolongator is used inside Vcycle. No CG acceleration is used. 57
3.2 2d Laplacian on square grid. 32768 elements, 16641 dofs. Har
monic prolongator is used inside Vcycle. No CG acceleration is
used............................................................. 57
3.3 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Har
monic" prolongator is used inside Vcycle. Conjugate gradient ac
celeration is used............................................... 57
3.4 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Harmonic" prolongator is used inside Vcycle. Conjugate gradient
acceleration is used. 58
3.5 2d anisotropic diffusion on square grid. Anisotropy is gridaligned.
2048 elements, 1089 dofs. Harmonic*1 prolongator is used inside
Vcycle. No CG acceleration is used........................... 60
3.6 2d anisotropic diffusion on square grid. Anisotropy is gridaligned.
32768 elements, 16641 dofs. Harmonic prolongator is used inside
Vcycle. No CG acceleration is used........................... 60
3.7 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Harmonic" prolongator is used inside Vcycle. Conjugate
gradient acceleration is used. 61
xii
3.8 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Harmonic prolongator is used inside Vcycle. Conjugate
gradient acceleration is used. 62
3.9 2d Laplacian on square grid, 2048 elements, 1089 dofs. Tentative11
prolongator is used inside nested CG cycle. Conjugate gradient
acceleration is used. 64
3.10 2d Laplacian on square grid, 32768 elements, 16641 dofs. Tentative"
prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used............................................. 65
3.11 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Ten
tative" prolongator is used inside nested CG cycle". Conjugate
gradient acceleration is used. 65
3.12 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Tentative prolongator is used inside nested CG cycle . Conju
gate gradient acceleration is used............................... 66
3.13 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Tentative" prolongator is used inside nested CG cycle". Con
jugate gradient acceleration is used............................. 68
3.14 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Tentative" prolongator is used inside nested CG cycle". Con
jugate gradient acceleration is used............................. 69
3.15 Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 6400 ele
ments, 9963 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used....... 71
xiii
3.16 Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 25600 ele
ments, 39123 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle11. Conjugate gradient acceleration is used.... 71
3.17 Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 6400 el
ements, 9963 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 71
3.18 Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 25600 el
ements, 39123 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 72
3.19 Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 6400 el
ements, 9963 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle". Conjugate gradient acceleration is used..... 73
3.20 Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 25600 el
ements, 39123 dofs. Partiallyharmonic prolongator is used inside
nested CG cycle. Conjugate gradient acceleration is used.... 74
4.1 Scalability for 80 x 80 x 80 = 512,000 mesh per CPU, m 1/6. . 100
4.2 Scalability for 50 x 50 x 50 = 125,000 mesh per CPU, m 50. 101
XIV
1. Introduction
The field of numerical linear algebra focuses primarily on two important
problems in applied mathematics: the numerical solution of linear systems of
equations and eigenvalue problems. Progress in the field and increasing capabili
ties of computer hardware make problems of record sizes tractable. For problems
of modest sizes, robust direct methods are available. Therefore, in many cases
it is possible to obtain an automated black box solution.
Nevertheless, many linear systems and eigenvalue problems of large sizes
remain difficult to solve. One particular example of such a problem arises in the
numerical solution of partial differential equations (PDEs). Matrices emerging
from the discretization of PDEs are typically very large but sparse, i.e., the
number of nonzero elements is proportional to the matrix dimension. When
working with sparse matrices, it is possible to store only a list of nonzero ele
ments. This makes assembling the system matrix a linear complexity task, i.e.,
requiring a storage and operation count proportional to the problem size. In
such a situation it is tempting to seek solution methods which have (nearly)
linear complexity.
Krylov subspace methods (e.g., Greenbaum [23]) are wellknown both for
solution of linear systems and eigenvalue computations. For many such methods,
each iteration has linear complexity. Thus, the overall complexity of the method
is linear, as long as the number of iterations (required to achieve a desired
accuracy of the approximate solution) does not grow with the problem size.
1
However, without use of preconditioning (discussed below), the required number
of iterations typically becomes unbounded as the problem size goes to infinity.
In order to keep the number of iterations bounded, Krylov methods are
enhanced with preconditioning techniques. The preconditioned steepest descent
(PSD) method, described below, is a simple but illustrative example of the
concept of preconditioning. Given a linear system Ax = f with symmetric and
positive definite (SPD) matrix A, the method can be written as
xk+1 = xk + qB~1 (/ Axk).
Here vector xk is the current approximation to the solution x = A~lf. We aim
to compute the vector xk+i, which is the new approximation to the solution. The
scalar a is chosen to minimize a particular norm of the error ek+i = x xk+i.
With an appropriate choice of the norm this scalar is computable from the known
quantities A, f and xk.
Matrix B, called the preconditioner, is in this case assumed to be SPD and
of the same size as A. The case when B is the identity matrix corresponds to the
absense of preconditioning, i.e., to the plain steepest descent (SD) method.
Intuitively, the role of B is to approximate A, for B~1 to approximate A1.
Indeed, in the case B = A the method will converge in one iteration, since
B~l (/ Axk) will reduce to x xk, and we shall obtain xk+i = x by using
a = 1.
Another intuitive interpretation is that multiplying by B~l rotates the resid
ual rk = f Axk in such a way that the angle (in a particular inner product)
between the socalled preconditioned residual sk = B~1rk and current error
ek = x xk is reduced. For example, having sk and ek parallel, i.e., sk = 7efc,
2
yields immediate convergence with a = 1/7. Actually, the convergence rate on
a given iteration depends exclusively on the angle between eand s*, see, e.g.,
Lemma 2.9 on page 20 of this dissertation.
It follows from the discussion above that it is desirable for the preconditioner
to satisfy the following requirements. First, its quality, i.e., how well B~l
approximates A1, must not degrade as the problem size increases. Informally
speaking, this is what keeps the number of iterations bounded. Second, the
action T = B~l of applying the preconditioner must itself have linear complexity.
It turns out that a PSD method can be viewed as an SD method applied
to the transformed system B~lAx = B~lf and using a different inner prod
uct (with respect to which, in particular, matrix B~lA is symmetric). For the
PSD method itself, this property is of a moderate importance. However a similar
observation is crucial in establishing convergence properties of other Krylov sub
space methods, such as the preconditioned conjugate gradient (PCG) method.
The interpretation of preconditioning as a transformation of the original
linear system may not be appropriate for all practical situations, e.g., where
the preconditioner is variable, i.e., changes from iteration to iteration. Such a
situation can arise, e.g., if the preconditioner involves inner iterations.
Analysis of the PCG method with variable preconditioning constitutes
Chapter 2 of this thesis. The chapter is based on the paper Knyazev and Lashuk
[40] . An extended version is published as a technical report Knyazev and Lashuk
[41] . We consider the case where the preconditioner is widely variable, that is,
preconditioners used on different iterations are essentially unrelated. We assume
that the preconditioner is SPD on each step, and that the condition number of
3
the preconditioned system matrix is bounded above by a constant independent
of the step number. We do not impose any other restrictions on the sequence of
preconditioners. Our main result in this chapter is of a negative nature: we show
that the PCG method with variable preconditioning under our assumptions may
not yield an improvement, relative to the PSD method.
In particular, in Section 2.2 we prove that a nonzero real vector multiplied by
all real SPD matrices with a condition number bounded by a constant generates
a pointed circular cone. This observation is used later in Section 2.5 to derive
the main result.
In Section 2.3 we discuss some basic properties of the PCG method with
variable preconditioning. These properties are later used in Section 2.4 to derive
the convergence rate bound for the method. Specifically, in Section 2.4 we prove
that the standard PSD convergence rate bound still holds for PCG with variable
preconditioning.
As was mentioned before, in Section 2.5 we present our main result: it turns
out, that the PSD convergence rate bound is sharp for the PCG method with
variable preconditioning (under our assumptions). That is, we can construct a
sequence of preconditioners such that PSD convergence rate bound is attained
on every iteration.
In Sections 2.7 and 2.8 we discuss PCG algorithms used in practice (that
is, their sensitivity to variable preconditioning) and provide some numerical
experiments.
In Chapter 3 we turn our attention to one particular type of preconditioner:
the multilevel methods. Multigrid (MG) (e.g., Trottenberg et al. [55]) is one of
4
the most efficient and natural methods for solving linear systems of equations
coming from partial differential equations discretized on a sequence of grids. In
algebraic multigrid (AMG) Brandt et al. [11, 12], Ruge and Stiiben [48], Stiiben
[53] the necessary MG components (coarse grids, coarsegrid operators, interpo
lation operators) are built by the solver algorithm opposite to geometric multi
grid where these components are naturally given by the discretization. An ex
treme case of an algebraic multigrid approach would lead to a black box solver,
i.e., an algorithm which would only use the linear system matrix as input data.
In practice, all AMG methods utilize (often assumed) some additional informa
tion about the class of problems they are applied to. In the chapter we deal with
class of problems that come from finite element discretization of elliptic PDEs.
More specifically, we focus on variants of the elementbased AMG (or AMGe)
methods developed in Brezina et al. [13], Chartier et al. [15, 16], Henson and
Vassilevski [24], Jones and Vassilevski [28].
We have implemented several algorithms in Matlab in the framework of
the element agglomeration spectral AMGe (Chartier et al. [16]) and illustrate
their performance. Our Matlab implementation allows for further extensions
and offers potential for illustrating other element based AMG algorithms.
The chapter, based on the paper Lashuk and Vassilevski [42], is structured
as follows. Section 3.2 describes the main building tools of the implemented
algorithms. In particular, in Section 3.2.1 we describe the required input, Section
3.2.2 reviews some details on element agglomeration, and in Â§ 3.2.3 the so
called minimal intersection sets are introduced. These sets are later used, in
Section 3.2.4, to define the coarse degrees of freedom by solving local eigenvalue
5
problems (associated with each minimal intersection set). The recursive nature
of the algorithm is summarized in Â§ 3.2.5. The spectral way of selecting coarse
degrees of freedom naturally leads to the construction of tentative prolongators.
In Â§ 3.2.6 the need to improve on the stability of the tentative prolongator is
outlined.
Another direction of extending the presented method that we pursue utilizes
more sophisticated multigrid cycles that are based on inner (between the levels)
preconditioned conjugate gradient (CG) iterations. Such an idea goes back to
the nonlinear AMLI methods in Axelsson and Vassilevski [5]. This respective
method, referred to as nestedCG (AMLI) cycle is described in Section 3.2.7.
Section 3.3 contains a description of a specific implementation of the adap
tive AMG method (cf., Brezina et al. [14]) in the present AMGe setting. We
consider also an option which does not utilize the actual element matrices (but
does use the element topology relations).
The Vcycle and nestedCG cycle are illustrated in Sections 3.4 and 3.5,
whereas the results utilizing AMG adaptivity are found in the last section 3.6.
In Chapter 4 of this dissertation we describe a new software package
Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) revision
1.0 publicly released recently. The chapter is based on the papers Knyazev
et al. [39], Lashuk et al. [43]. BLOPEX is available as a standalone serial li
brary, which can be downloaded from http: //math. cudenver. edu/~aknyazev/
software/BLOPEX/, as an external package to the Portable, Extensible Toolkit
for Scientific Computation (PETSc) Balay et al. [6], and is built into the High
Performance Preconditioners (hypre) library Falgout et al. [19, 20]. Jose Ro
6
man has also written a Scalable Library for Eigenvalue Problem Computations
(SLEPc) Hernandez et al. [25]) interface to our hypre BLOPEX.
The native hypre BLOPEX version efficiently and directly takes advantage of
powerful hypre multigrid preconditioners, both algebraic (AMG, called Boomer
AMG in hypre) and geometric or structured (SMG). BLOPEX in PETSc gives
the PETSc community easy access to the customizable code of a preconditioned
eigensolver.
At present, BLOPEX includes only one solverthe Locally Optimal Block
Preconditioned Conjugate Gradient (LOBPCG) Knyazev [32, 35] method for
eigenproblems Ax = ABx with large, possibly sparse, symmetric matrix A
and symmetric positive definite matrix B. Preconditioned eigenvalue solvers
in general, see Knyazev [34, 35], and in particular the LOBPCG method have
recently attracted attention as potential competitors to (block) Lanczos meth
ods. LOBPCG has been implemented using different computer languages in
a number of other software packages: C++ in Anasazi Trilinos Arbenz et al.
[1], Heroux et al. [26] and in NGSolve Zaglmayr [62], Sections 7.27.4; C in
PRIMME Stathopoulos [52]; FORTRAN 77 in PLTMG Bank [7]; FORTRAN
90 in ABINIT Bottin et al. [9] and in PESCAN Tomov et al. [54]; Python in
SciPy by Robert Cimrman and in PYFEMax by Peter Arbenz and Roman Geus.
LOBPCG has been independently tested in Yamada et al. [59, 60] for Fermion
Habbard Model on Earth Simulator CDIR/MPI; in Arbenz et al. [1] and Borzi
and Borzi [8] using AMG preconditioning for vibration problems; and in To
mov et al. [54], Yang et al. [61] for electronic structure calculations. Our hypre
BLOPEX version has been used in Bramble et al. [10] for the Maxwell problem.
7
Section 4.2 contains a complete and detailed description of the LOBPCG
algorithm as implemented in BLOPEX. We discuss our specific abstract im
plementation approach in section 4.3. Parallel performance on distributed and
shared memory systems including a Beowulf system, SUN Fire 880, an AMD
dualcore Opteron workstation, and IBM BlueGene/L supercomputer, using do
main decomposition and multigrid preconditioning is discussed in section 4.4.
My personal contribution to the work described in the thesis is as follows.
Chapter 2: Formulation and the proof of the main result (Theorem 2.11) and
most of other statements. A geometric interpretation (Theorem 2.1) and the idea
of local optimality (Lemma 2.6) are suggested by Andrew Knyazev. Numerical
experiments have been designed and performed jointly. Chapter 3: Matlab
implementation of different AMGe methods, all numerical tests and the idea of
using the socalled fullyharmonic interpolation for spectral AMGe, see Section
3.4. Chapter 4: Implementation in C of the LOBPCG eigensolver in hypre and
PETSc versions of BLOPEX and numerical testing on SUN, Beowulf and IBM
BlueGene/L parallel computers.
8
2. Variable preconditioning
2.1 Introduction
Preconditioning, a transformation, usually implicit, of the original linear
system aiming at accelerating the convergence of the approximations to the so
lution, is typically a necessary part of an efficient iterative technique. Modern
preconditioning, e.g., based on socalled algebraic multilevel and domain de
composition methods, attempts to become as close to a black box ideal of
direct solvers as possible. In this attempt, the mathematical structure of the
preconditioner, which in the classical case is regarded as some linear transfor
mation, may become very complex, in particular, the linearity can be easily
lost, e.g., if the preconditioning itself involves inner iterative solvers. The
fact that the preconditioner may be nonlinear, or variable, i.e., changing from
iteration to iteration, may drastically affect the known theory as well as the
practical behavior of preconditioned iterative methods and therefore needs spe
cial attention. Our main result is that the conjugate gradient (CG) method
with variable preconditioning in certain situations may not give improvement,
compared to the steepest descent (SD) method for solving a linear system with
a real symmetric positive definite (SPD) matrix of coefficients. We assume that
the preconditioner is SPD on each step, and that the condition number of the
preconditioned system matrix is bounded above by a constant.
This chapter is based on the paper Knyazev and Lashuk [40]. An extended
version is published as a technical report Knyazev and Lashuk [41].
9
Let us now introduce the notation, so that we can formulate the main result
mathematically. Let A be a real SPD matrix, (x, y) be the standard inner
product of real vectors x and y, so that (Ax, y) = (x, Ay), and let x = \J(x,x)
be the corresponding vector norm. We also use   to denote the operator
norm. The Ainner product and the Anorm are denoted by (x, y)A = (x, Ay)
and \\x\\A = y/(x,x)A
We consider a family of iterative methods to obtain a sequence of approxi
mate solutions Xk of a linear system Ax = b and use the Anorm to measure the
error ek = x xk. The SD and CG methods are wellknown iterative procedures
that fit into our framework. To accelerate the convergence of the error ek to zero
we introduce preconditioning, i.e., on every iteration k an operator Bk, called
the preconditioner, possibly different for each iteration k, is applied to the resid
ual rk = b Axk A general algorithm, which includes the preconditioned SD or
CG (PSD or PCG respectively) methods as particular cases, can be presented
as follows, e.g., Axelsson [4], p. 540 and Axelsson and Vassilevski [2], Algorithm
5.3: given A, b, {Bk}, {mfc}, x0, for k = 0,1,...: rk = b Axk, sk = B^rk, and
Pk $k
k1
E
l=kTTlk
(Ask,pi)
(Api,Pi)Ph
Efc+i fc T
(rk,Pk)
(Apk,Pk)Pk
(2.1)
where
0 < mk < k and mk+i < mk + 1.
(2.2)
The latter condition is highlighted in Notay [45], p. 1447, line 1 and en
sures that the formula for pk in (2.1) performs the standard GramSchmidt
Aorthogonalizations to previous search directions, which are already pairwise
Aorthogonal. The full orthogonalization that performs explicit Aorthogo
10
nalizations to all previous search directions corresponds to mk = k. Choosing
mk = min{fc, 1} gives the PCG method, e.g., described in Golub and Ye [22],
IPCG Algorithm. The connection of this PCG method to the commonly used
PCG algorithm is discussed in section 2.7 following Golub and Ye [22], Remark
2.3. The shortest recursion mk = 0 leads to the standard PSD method.
It is wellknown, e.g., D'yakonov [18], p. 34 and Axelsson [4], Sec. 11.1.2,
p. 458 that if the preconditioner is SPD and fixed, Bk = B = B* > 0, a
preconditioned method, such as (2.1), using the preconditioner B can be viewed
as the corresponding unpreconditioned method applied to the preconditioned
system B^Ax = B~lb in the Rbased inner product (x,y)B = (x,By). This
implies that the theory obtained for unpreconditioned methods remains valid
for preconditioned methods, in particular, the Aorthogonalization terms with
l < k 1 in the sum in (2.1) vanish in exact arithmetic, e.g., Axelsson [4],
Sec. 11.2.6, Th. 11.5. The situation changes dramatically, however, if different
preconditioners Bk are used in the PCG method.
We study the behavior of method (2.1), where the preconditioner Bk varies
from step to step, but remains SPD on each step and the spectral condition
number k (R^ JA) = Amax [B^1 A) / Xmin (R^ M) is bounded above by some con
stant nmax independent of the step number k. We note that the matrix B^A is
SPD with respect to, e.g., the Bk inner product, so its eigenvalues are real posi
tive. Let us highlight that our assumption k, (R^A) < nmax can be equivalently
written as 7 BklA\\Bk < 7 with nmax = (1 + 7)/(l 7), assuming without
loss of generality that Bk is scaled such that Amax (R^M) + Amin (B^lA) = 2.
Here, we only deal with methods that are invariant with respect to scaling of
11
Bk.
The main result of this chapter is that the preconditioned method (2.1)
with (2.2) turns into the PSD method with the worst possible convergence
rate on every iteration, if the preconditioners Bk satisfying our assumption
k < nmax are chosen in a special way. We explicitly construct a variable
preconditioner that slows down the CG method to the point that the worst linear
convergence rate of the SD method is recovered. Thus one can only guarantee
that the convergence rate for the method (2.1) with (2.2) is just the same as for
the PSD method, rat = 0, obtained in Kantorovic [29] and reproduced, e.g., in
Kantorovich and Akilov [30], Ch. XV:
llefc+l IL < Kmax ~ 1 ^ 3^
ll^fc.A ^max t" 1
Our proof is geometric and is based on the simple fact, proved in section 2.2,
that a nonzero vector multiplied by all SPD matrices with a condition number
bounded by a constant generates a pointed circular cone. We apply this fact on
every iteration to the current residual vector, which becomes the center of the
cone, so all points in the cone correspond to all possible preconditioned residuals.
In a somewhat similar way, Golub and Ye [22] use the angle between the exact
and the perturbed preconditioned residuals. In the CG method context, this
cone has a nontrivial intersection with the subspace Aorthogonal to all previous
search directions. So on each iteration we can choose a preconditioner with the a
priori chosen quality, determined by Kmax, that makes enforcing Aorthogonality
with respect to all previous search directions useless.
Basic properties of method (2.1), most importantly the local optimality, are
derived in section 2.3. In section 2.4 we apply our results from section 2.2 about
12
the cone to obtain a new proof of estimate (2.3). In section 2.5 we analyze
the convergence of the PCG method with variable preconditioning and prove
our main result. We assume real arithmetic everywhere in this chapter, except
for section 2.6, where we show that our main results also hold for complex
Hermitian positive definite matrices. In section 2.7 we consider two particular
PCG algorithms that are often used in practice and describe their behavior with
variable preconditioning.
Our numerical experiments in section 2.8 comparing the preconditioned SD
and CG methods support and illustrate our theoretical findings, and also reveal
some potentially practically important effects. In subsection 2.8.1, we test the
widely used modification of the CG method with a simplified formula for the
scalar /?*. from section 2.7 and demonstrate that variable preconditioning can
make this modification much slower than even the SD method. In subsection
2.8.2, we analyze innerouter iterations as variable preconditioning. Finally, in
subsection 2.8.3, we demonstrate that variable preconditioning may surprisingly
accelerate the SD and the CG compared to the use of fixed preconditioning in
the same methods.
Different aspects of variable preconditioning are considered, e.g., in Axelsson
and Vassilevski [2, 3], Axelsson [4], where rather general nonlinear precondition
ing is introduced, and in Golub and Ye [22], Notay [45] that mainly deal with
the case when the preconditioner on each iteration approximates a fixed opera
tor. In Axelsson and Vassilevski [3], Axelsson [4], Golub and Ye [22], Notay [45],
convergence estimates for some iterative methods with variable preconditioning
are proved. For recent results and other aspects of variable preconditioning see
13
Simoncini and Szyld [49, 50, 51] and references there. No attempts are appar
ently made in the literature to obtain a result similar to ours, even though it
should appear quite natural and somewhat expected to experts in the area.
2.2 Pointed circular cones represent sets of SPD matrices with
varying condition numbers
For a pair of real nonzero vectors x and y we define the angle between x
and y in the usual way as
Z(I!') = arccos(j^[)[0,4
The following theorem is inspired by Neymeyr [44], Lemma 2.3.
Theorem 2.1. The set {Cx}, where x is a fixed nonzero real vector and C runs
through all SPD matrices with condition number n(C) bounded above by some
nmax, is a pointed circular cone, specifically,
{Cx : C = C* > 0, k(C) < /tmax} = jy : sin Z(x,y)<
Theorem 2.1 can be proved by constructing our cone as the smallest pointed
cone that includes the ball considered in Neymeyr [44], Lemma 2.3. Preparing
for section 2.6 that deals with the complex case, not covered in Neymeyr [44],
we provide a direct proof here based on the following two lemmas. The first
lemma is simple and states that the set in question cannot be larger than the
cone:
Lemma 2.2. Let x be a nonzero real vector, let C be an SPD matrix with
spectral condition number k (C). Then sin Z (x, Cx) < (k (C) 1)/(k (C) + 1).
Kr)
+ 1
14
Proof. Denote y = Cx. We have (x,Cx) = (y,C~1y) > 0 since C is SPD, so
y 7^ 0 and Z (x,y) < 7r/2. A positive scaling of C and thus of y is obviously
irrelevant, so let us choose y to be the orthogonal projection of x onto the
1dimensional subspace spanned by the original y. Then from elementary 2D
geometry it follows that \\y x = x sinZ (x, y). The orthogonal projection
of a vector onto a subspace is the best approximation to the vector from the
subspace, thus
H^ll sin Z (x, y) = y x\\ < sy x = \\sCx x\\ < \\sC 11 rr
for any scalar s, where I is the identity. Taking s = 2/ (Amax (C) + Amin (C)),
where Amin (C) and Amax (C) are the minimal and maximal eigenvalues of C,
respectively, we get sC ij = (k (C) l)/( (C) + 1).
The second lemma implies that every point in the cone can be represented
as Cx for some SPD matrix C with k(C) determined by the opening angle of
the cone.
Lemma 2.3. Let x and y be nonzero real vectors, such that Z (x,y) [0, ).
Then there exists an SPD matrix C, such that Cx = y and
=smZ(;r,!,).
Proof. Denote a = Z.(x,y). A positive scaling of vector y is irrelevant, so as
in the previous proof we choose y to be the orthogonal projection of x onto the
1dimensional subspace spanned by the original y, then y x\\ = (sina) x,
so the vectors y x and (sina)x are of the same length. This implies that
there exists a Householder reflection H such that H ((sina)x) = y x, cf.
15
Neymeyr [44], Lemma 2.3, so (I + (sin a) H) x = y. We define C = 1 + (sin a) H
to get Cx = y. Any Householder reflection is symmetric and has only two
distinct eigenvalues 1, so C is also symmetric and has only two distinct positive
eigenvalues 1 sin a, as o [0, 7t/2), and we conclude that C > 0 and k (C) =
(1 + sina)/(l sin a).
2.3 Local optimality of the method with variable preconditioning
Here we discuss some basic properties of method (2.1) with (2.2). We de
rive a simple, but very useful, error propagation identity in Lemma 2.4. We
prove in Lemma 2.5 that the method is welldefined and has a certain local
Aorthogonality property, formulated without a proof in Notay [45], formulas
(2.1)(2.2) and in the important particular case mk = min{A;, 1} proved in Golub
and Ye [22], Lemma 2.1. Using the local Aorthogonality property of Lemma
2.5, we prove the local Aoptimality property in Lemma 2.6 by generalizing the
result of Golub and Ye [22], Proposition 2.2. Finally, we derive a trivial Corol
lary 2.7 from Lemma 2.6, which uses the idea from Golub and Ye [22], p. 1309
of comparison with the PSD method, m*, = 0.
The material of this section is inspired by Golub and Ye [22] and may be
known to experts in the field, e.g., some even more general facts can be found
in Axelsson [4], Sec. 12.3.2, Lemma 12.22. We provide straightforward and
complete proofs here suitable for a general audience.
Lemma 2.4. Let A and {Bk} be SPD matrices. Suppose pk in method (2.1) is
welldefined and nonzero. Then
Â£/c+1
(Aek,pk)
(Apk,pk)Pk
(2.4)
16
Proof. Recall that ek = A 1b xk and thus rk = Aek. Then (2.4) follows
immediately from the last formula in (2.1).
Lemma 2.5. Let A and {Bk} be SPD matrices and {m*,} satisfies (2.2). Then
the error, the preconditioned residual, and the direction vectors generated by
method (2.1) before the exact solution is obtained are welldefined and satisfy
CPuPj)A = > k~mk
(ek+i,sk)A = (ek+1,Pi)A = 0, k mk < i < k. (2.6)
Proof. We first notice that (2.4) for any k obviously implies
(ek+1,pk)A = 0. (2.7)
For the rest of the proof we use an induction in k. Let us take k = 0 and suppose
x0 7^ x, then r0 ^ 0 and so 7^ 0 since B0 is SPD. By (2.2), ra0 = 0 and thus
p0 = s0 7^ 0, so in the formula for xk+i we do not divide by zero, i.e., xk+\ is
well defined. There is nothing to prove in (2.5) for k = 0 since mo = 0. Formula
(2.7) implies (e^po)^ = (ei,So),4 = 0> i.e., (2.6) holds for k = 0. This provides
the basis for the induction.
Suppose the statement of the lemma holds for k 1, which is the induction
hypothesis, i.e., up to the index k 1 all quantities are well defined and
(Pi,Pj)A = 0, k 1 mfc_ 1
(e*, Sfci)^ = (ek,Pi)A = 0, k 1 mfc_! < i < k 1. (2.9)
We now show by contradiction that xk ^ x implies pk 0. Indeed, if pk = 0
then Sfc is a linear combination of pkmk, ,Ph1 However, since mk < mfc_i+l,
17
it follows from (2.9) that
(ek,Pi)J4 = 0, kmk
Then we have (sk, ek)A = 0. At the same time, since the matrix B^A is ASPD,
sk = BflAek cannot be Aorthogonal to ek unless sk = ek 0, i.e., xk = x.
Next, we prove (2.5) by showing that the formula for pk in (2.1) is a valid
step of the GramSchmidt orthogonalization process with respect to the Abased
inner product. If mk = 0, there is nothing to prove. If mk = 1 then (2.5) gets
reduced to (pk,pki)A = 0, which follows from the formula for pk in (2.1). If
mk > 2 then condition (2.2) implies that vectors pkmk,... ,pki are among
the vectors pfc_i_mfc_j,... ,pki and therefore are already Aorthogonal by the
induction assumption (2.8). Then the formula for pk in (2.1) is indeed a valid
step of the GramSchmidt orthogonalization process with respect to the Abased
inner product, so (2.5) holds.
It remains to prove (2.6). We have already established (2.5), and (2.7)
(2.10). Equalities (2.5) and (2.10) imply that pk and ek are Aorthogonal to
Pkmk, ,Pk1 Equality (2.4) implies that ek+\ is a linear combination of
ek and pk. Thus, we have (ek+i,Pi)A =0, k mk < i < k 1. Finally, it
is enough to notice that sk is a linear combination of pk,pki,... ,pkmk, so
{ek+i,sk)A = 0.
We now use Lemma 2.5 to prove the local optimality of method (2.1) with
(2.2), which generalizes the statement of Golub and Ye [22], Proposition 2.2.
Lemma 2.6. Under the assumptions of Lemma 2.5,
lefc+ii4= min \\ekp\\A.
p^span^sk,pk^mk ,...,pk1}
18
Proof. We get ek+i 6 ek + span {sk,pkmk, ,Pk1} from the formula for pk in
(2.1) and (2.4). Putting this together with Aorthogonality relations (2.6) of the
vector efc+i with all vectors that span the subspace finishes the proof.
Two important corollaries follow immediately from Lemma 2.6 by analogy
with Golub and Ye [22], Proposition 2.2.
Corollary 2.7. The Anorm of the error e>t+i ^4 in method (2.1) with (2.2)
is bounded above by the Anorm of the error of one step of the PSD method,
mk = 0, using the same xk as the initial guess and Bk as the preconditioner,
i.e., specifically, efc+i< minQ efc ask\\A.
Corollary 2.8. Let mk > 0, then the Anorm of the error efc+iA in method
(2.1) with (2.2) fork > 0 satisfies Hefc+ilh < minQi/3 efc ask 0(ek efc_i)^.
Proof. Under the lemma assumptions, the formula for pk in (2.1) and (2.4)
imply that ek+i ek + span {sk,Pki} = ek + span {sk,ek efc_i} and the
^orthogonality relations (2.6) turn into (ek+i,sk)A = 0 and (ek+i,pki)A
(ek+[,ek ekf)A = 0, so the vector efc+1 is ^orthogonal to both vectors that
span the subspace. As in the proof of Lemma 2.6, the local Aorthogonality
implies the local Aoptimality.
Corollary 2.7 allows us in section 2.4 to estimate the convergence rate of
method (2.1) with (2.2) by comparison with the PSD method, mk = 0,this
idea is borrowed from Golub and Ye [22], p. 1309. The results of Lemma 2.6
and Corollary 2.8 seem to indicate that an improved convergence rate bound
of method (2.1) with (2.2) can be obtained, compared to the PSD method
19
convergence rate bound that follows from Corollary 2.7. Our original intent
has been to combine Corollary 2.8 with convergence rate bounds of the heavy
ball method, in order to attempt to prove such an improved convergence rate
bound. However, our results of section 2.5 demonstrate that this improvement is
impossible under our only assumption k. (b;'a) < ftmax, since one can construct
such preconditioners Bk that, e.g., the minimizing value of /? in Corollary 2.8 is
zero, so Corollary 2.8 gives no improvement compared to Corollary 2.7.
2.4 Convergence rate bounds for variable preconditioning
The classical Kantorovich and Akilov [30], Ch. XV convergence rate bound
(2.3) for the PSD method is local in the sense that it relates the Anorm
of the error on two subsequent iterations and does not depend on previous
iterations. Thus, it remains valid when the preconditioner Bk changes from
iteration to iteration, while the condition number k (B^lA) is bounded above
by some constant Kmax independent of k. The goal of this section is to give an
apparently new simple proof of the estimate (2.3) for the PSD method, based
on our cone Theorem 2.1, and to extend this statement to cover the general
method (2.1) with (2.2), using Corollary 2.7.
We denote the angle between two real nonzero vectors with respect to the
Abased inner product by
and express the error reduction ratio for the PSD method in terms of the angle
with respect to the Abased inner product:
20
Lemma 2.9. On every step of the PSD algorithm, (2.1) with mk = 0, the error
reduction factor takes the form Hefe+ill^/llefcH^ = si^Z^e*, BflAek)).
Proof. By (2.6), we have (ek+\,Pk)A = 0 Now, for mk = 0, in addition,
pk = sk, so 0 = (ek+1,pk)A = (ek+i,sk)A = (efc+1, xk+i xk)A, i.e., the triangle
with vertices x, xk, xk+\ is rightangled in the ytinner product, where the hy
potenuse is ek = x xk. Therefore, \\ek+i\\A / \\ek\\A = sin(Zj4(efc, xk+\ xk)) =
sin (ZA(ek, sk)), where sk = Â£*T1 (b Axk) = BflAek by (2.1).
Let us highlight that Lemma 2.9 provides an exact expression for the error
reduction factor, not just a boundwe need this in the proof of Theorem 2.11
in the next section. Combining the results of Lemmas 2.2 and 2.9 together
immediately leads to (2.3) for the PSD method, where mk = 0. Finally, taking
into account Corollary 2.7, by analogy with the arguments of Golub and Ye [22],
p. 1309 and decrypting a hidden statement in Golub and Ye [22], Lemma 3.5,
we get
Theorem 2.10. Convergence rate bound (2.3) holds for method (2.1) with (2.2).
2.5 The convergence rate bound is sharp
Here we formulate and prove the main result of this chapter: if one assumes
k(B\A) < Kmax> then the worst convergence rate for method (2.1) with (2.2) is
given by (2.3).
Let us remind the reader that (2.3) also describes the convergence rate for
the PSD method, (2.1) with mk = 0. We now show that adding more vectors
to the PSD iterative recurrence results in no improvement in convergence, if a
specially constructed set of variable preconditioners is used.
21
Theorem 2.11. Let an SPD matrix A, vectors b andx0, and Atmax > 1 be given.
Assuming that the matrix size is larger than the number of iterations, one can
choose a sequence of SPD preconditioners Bk, satisfying k(B^1 A) < Kmax, such
that method (2.1) with (2.2) turns into the PSD method, (2.1) with mk = 0, and
on every iteration
\ek+\\
Kri
1
(2.11)
II^A: 11^4 ^max T 1
Proof. We construct the sequence Bk by induction. First, we choose any vector
g0, such that sin e0) = (/tmax l)/(nmax + 1). According to Lemma 2.3
applied in the Ainner product, there exists an ASPD matrix Co with condition
number /c(Co) = Kmax, such that Coeo = go We define the SPD B0 = AC^1,
then k(B^1A) = /t(C0) = Kmax We have sk = B^Aek, so such a choice of B0
implies s0 = go Also, we have p0 = s0, i.e., the first step is always a PSD step,
thus, by Lemma 2.9 we have proved (2.11) for k = 0. Note that (ei,p0)A = 0
by (2.6).
Second, we make the induction assumption: let preconditioners Bi for / <
k 1 be constructed, such that ef+iH^/He/= (nmax l)/(max + 1) and
(ckiPi) a 0 hold for alii < k 1. The dimension of the space is greater than
the total number of iterations by our assumption, so there exists a vector uk,
such that (uk,pi)A = 0 for / < k 1 and uk and ek are linearly independent.
Then the 2D subspace spanned by uk and ek is Aorthogonal to pi for l < k 1.
Let us consider the boundary of the pointed circular cone made of vectors
qk satisfying the condition sinZ^g^efc) = (/tmax l)/(max + 1) This conical
surface has a nontrivial intersection with the 2D subspace spanned by uk and
ek, since ek is the cone axis. Let us choose vector qk in the intersection. This
22
vector will be obviously Aorthogonal to pi, l < k 1.
Applying the same reasoning as for constructing B0, we deduce that there
exists an SPD Bk such that k(B^A) < max and B^Aek = qk With such a
choice of Bk we have Sk qk Since qk = Sk is Aorthogonal to pi for all / < k 1,
it turns out that pk = Sk, no matter how {m*,} are chosen. This means that
Xk+1 is obtained from Xk by a steepest descent step. Then we apply Lemma
2.9 and conclude that (2.11) holds. We note, that (ek+i,Pi)A = 0 for all l < k.
Indeed, (ek+i,pi)A = 0 for all l < k 1 since \ is a linear combination of e*
and pk = Sk = qk, both Aorthogonal to pi for l < k 1. Finally, (ek+i,Pk)A = 0
by (2.6). This completes the construction of {Bk} by induction and thus the
proof.
Let us highlight that the statement of Theorem 2.11 consists of two parts:
first, it is possible to have the PCG method with variable preconditioning that
converges not any faster than the PSD method with the same preconditioning;
and second, moreover, it is possible that the PCG method with variable precon
ditioning converges not any faster than the worst possible theoretical conver
gence rate for the PSD method described by (2.3). Numerical tests in section
2.8 show that the former possibility is more likely than the latter. Specifically,
we demonstrate numerically in subsection 2.8.3 that the PCG and PSD methods
with random preconditioning converge with the same speed, but both are much
faster than what bound (2.3) predicts.
2.6 Complex Hermitian case
In all other sections of this chapter we assume for simplicity that matrices
and vectors are real. However, our main results also hold when matrices A
23
and {Bk} are complex Hermitian positive definite. In this section we discuss
necessary modifications to statements and proofs in sections 2.2, 2.4 and 2.5 in
order to cover the complex Hermitian case.
In section 2.2, the first thing to be changed is the definition of the angle
between two nonzero vectors x, y G Cn, where an absolute value is now taken,
Z(x, y) = arccos
(x,y)
*11 llyl
6
that makes the angle acute and invariant with respect to complex nonzero scaling
of the vectors. Lemma 2.2 remains valid in the complex case:
Lemma 2.12. Let x be a nonzero complex vector, and C be a complex Her
mitian positive definite matrix with the spectral condition number k(C), then
sin Z (x, Cx) < (k (C) l)/( (C) + 1).
Proof. Denote y = Cx and let 7 = (y, x) / \\y\\2 then 7y is the projection
of x onto span{y}, and Z(x,y) = Z(z, 7y). Moreover, (x,7y) = {x,y)7 =
(x,y)(y,x)/ \\y\\2 is realwe need this fact later in the proof of Lemma 2.13.
We redefine y to 7y. The rest of proof is exactly the same as that of Lemma
2.2, since the identity y x = x sin Z (x, y), where y is scaled by a complex
scalar to be the orthogonal projection of x onto span{j/}, holds in the complex
case with the new definition of the angle.
Lemma 2.3 and, thus, Theorem 2.1 do not hold in the complex case after
the straightforward reformulation. A trivial counterexample is a pair of vectors
x 7^ 0 and y = ixthe angle between x and y is obviously zero, yet it is
impossible that y = Cx for any complex Hermitian matrix C, since the inner
24
product (x,y) = i\\x\\2 is not a real number. This counterexample also gives
an idea for a simple fix:
Lemma 2.13. Let x and y be nonzero complex vectors, such that Z(x,y) ^
7t/2. Then there exists a complex Hermitian positive definite matrix C and a
complex scalar 7, such that Cx = 7y and (n (C) l)/( (C) + 1) = sin Z (x,y).
Proof. We first scale the complex vector y as in the proof of Lemma 2.12 to
make y to be the projection of x onto span{y}. The rest of the proof is similar
to that of Lemma 2.3, but we have to be careful working with the Householder
reflection in the complex case, so we provide the complete proof.
The redefined y is the projection of x onto span{y}, thus, y rr =
(sin a) x, so the vectors u = y x and v = (sina)a; are of the same length.
Moreover, their inner product (u, v) is real, since (x, y) is real, see the proof of
Lemma 2.12. This implies that the Householder reflection Hz = z 2(w,z)w,
where vj = (u v)/\\u t>, acts on z = u such that Hu = v, i.e.,
H ((sin a) x) = y x, so (/ + (sin a) H)x y. We define C = I + (sin a) H to
get Cx = y.
The Householder reflection H is Hermitian and has only two distinct eigen
values 1, so C is also Hermitian and has only two distinct positive eigen
values 1 sin a, as a [0,7r/2), and we conclude that C > 0 and k (C) =
(1 + sino:)/(l sin a).
The same change then makes Theorem 2.1 work in the complex case:
Theorem 2.14. The set {7Cx}, where x is a fixed nonzero complex vector,
7 runs through all nonzero complex scalars, and C runs through all complex
25
Hermitian positive definite matrices with condition number k(C) bounded above
by some nmax, is a pointed circular cone, specifically,
{7Cx : 7 ^ 0, C = C* > 0, k
Section 2.3 requires no changes other then replacing SPD with Hermitian
positive definite." In section 2.4 we just change the definition of the ^4angle to
and then Lemma 2.9 holds without any further changes.
Finally, the statement of Theorem 2.11 from section 2.5 allows for a straight
forward generalization:
Theorem 2.15. Let a Hermitian positive definite matrix A, complex vectors b
and xo, and /tmax > 1 be given. Assuming that the matrix size is larger than the
number of iterations, one can choose a sequence of Hermitian positive definite
preconditioners Bk, satisfying k(B^A) < Kmax, such that method (2.1) with
(2.2) turns into the PSD method, (2.1) with mk = 0, and on every iteration
llefc+l \\A Kmax ~ 1 .O')
INL nx + l' ^ j
Proof. Only a small change in the proof of Theorem 2.11 is needed. We first
choose any vector q'0, satisfying sinZ^g, eo) = l)/(max + !) Then by
Lemma 2.13 we obtain the complex Hermitian positive definite matrix Cq and
the complex scalar 7 such that Coeo = 7
continue as in the proof of Theorem 2.11. The same modification is made in the
choice of the vectors qk for k > 1 later in the proof.
/a (x, y) = arccos
26
2.7 Practical PCG algorithms
In this section we briefly discuss two particular wellknown PCG algorithms
that are often used in practice. Our discussion here is motivated by and follows
Golub and Ye [22], Remark 2.3. Suppose A, b, x0, r0 = b Ar0, {Bk} for
k = 0,1,... are given and consider Algorithm 2.7.1 where (3k on line 6 is defined
either by expression
(3k =
jsk,rk)
(ski,rki)'
(2.13)
or by expression
n (^fc) Tk ^k l)
Pk~ {SklSkx)
(2.14)
Formula (2.13) is more often used in practice compared to (2.14), since it can be
implemented in such a way that does not require storing the extra vector rk_i.
If the preconditioner is SPD and fixed, it is wellknown, e.g., Golub and Ye
[22], Remark 2.3, that (sk,rki) = 0, so formula (2.14) coincides with (2.13)
and Algorithm 2.7.1 is described by (2.1) with mk = min (fc, 1). Of course,
in this case the choice mk = min (k, 1) is enough to keep all search directions
Aorthogonal in exact arithmetic.
Things become different when variable preconditioning is used. While we
can derive formula (2.14) from the framework (2.1)(2.2), there is no reason for sk
to be orthogonal to rk1 and so there is no justification for using (2.13). Indeed,
it is wellknown, e.g., Golub and Ye [22], Remark 2.3 and Notay [45], Table 2,
that using formula (2.13) for (3k can significantly slow down the convergence,
and we provide our own numerical evidence of that in Section 2.8.
27
Algorithm 2.7.1
l: for k = 0,1,... do
2: sk = B^rk
3: if k = 0 then
4: Pq So
5: else
6: pk = sk + Pkpk1 (where 0k is defined by either (2.14) or (2.13) for all
iterations)
7: end if
c (sk,rk)
k CPk,Apk)
9: 2fc+i "I" Qkpk
10: rk+1 = rk ockApk
11: end for
28
2.8 Numerical experiments
We first illustrate the main theoretical results of this chapter numerically
for a model problem. We numerically investigate the influence of the choice
for (3k between formulas (2.13) and (2.14) in Algorithm 2.7.1 and observe that
(2.14) leads to the theoretically predicted convergence rate, while (2.13) may
significantly slow down the convergence. Second, we test the convergence of
innerouter iteration schemes, where the inner iterations play the role of the
variable preconditioning in the outer PCG iteration, and we illustrate our main
conclusion that variable preconditioning may effectively reduce the convergence
speed of the PCG method to the speed of the PSD method. Third, and last, we
test the PSD and PCG methods with preconditioners of the same quality chosen
randomly. We observe a surprising acceleration of the PCG method compared
to the use of only one fixed preconditioner; at the same time, we show that the
PSD method with random preconditioners works as well as the PCG method,
which explains the PCG acceleration and again supports our main conclusion.
2.8.1 Numerical illustration of the main results
Here, we use the standard 3point approximation of the 1D Laplacian of
the size 200 as the matrix A of the system. To simulate the application of
the variable preconditioner, we essentially repeat the steps described in the
proof of Theorem 2.11, i.e., we fix the condition number k (B^M) = 2 and on
each iteration we generate a pseudorandom vector sk, which is Aorthogonal to
29
Iterative errors for PCG methods
Figure 2.1: Algorithm 2.7.1 with (2.13) fails to provide the PSD convergence
rate.
previous search directions and such that the Aangle between sk and ek satisfies
sin (Za (sk, ek)) = (k 1)/(k + 1).
We summarize the numerical results of this subsection on Figure 2.1, where
the horizontal axis represents the number of iterations and the vertical axis rep
resents the Anorm of the error. The iteration count actually starts from 1, so
the Anorm of the error on the 0th iteration eo^ is just the Anorm of the
initial error. The straight dotted (red in the colored print) line marked with
squares on Figure 2.1 represents the PSD theoretical bound (2.3) and at the
same time it perfectly coincides, which illustrates the statements of Theorem
2.11, with the change of the Anorm of the error in the case where the complete
Aorthogonalization is performed, i.e., mk = k in method (2.1), as well as in the
case where Algorithm 2.7.1 with (3k defined by (2.14) is used. The curved solid
(blue) line marked with diamonds represents the convergence of Algorithm 2.7.1
with (5k defined by (2.13), which visibly performs much worse in this test com
pared to Algorithm 2.7.1 with (2.14). The paper Notay [45], Sec. 5.2 contains
30
Iteration number
Figure 2.2: The PSD and PCG methods with preconditioning by inner CG
with different stopping criteria rj = 0.2,0.4,0.6 and 0.8 (from the bottom to the
top).
analogous results comparing the change in the convergence rate using formulas
(2.13) and (2.14), but it misses a comparison with the PSD method. To check
our results of section 2.6, we repeat the tests in the complex arithmetic. The
figure generated is similar to Figure 2.1, so we do not reproduce it here.
2.8.2 Innerouter iterations as variable preconditioning
Innerouter iterative schemes, where the inner iterations play the role of the
variable preconditioner in the outer PCG iteration is a traditional example of
variable preconditioning; see, e.g., Golub and Ye [22], Notay [45]. Previously
published tests analyze an approximation of some fixed preconditioner, Bk ~ B,
different from A, by inner iterations, typically using the PCG method. The
quality of the approximation is determined by the stopping criteria of the inner
PCG method. A typical conclusion is that the performance of the outer PCG
method improves and it starts behaving like the PCG method with the fixed
preconditioner B when Bk approximates B more accurately by performing more
inner iterations.
31
The idea of our tests in this subsection is different: we approximate Bk ~
B = A. The specific setup is the following. We take a diagonal matrix A with
all integer entries from 1 to 2000, with the righthand side zero and a random
normally distributed zero mean initial guess, and we do the same for the PSD
and PCG methods. For preconditioning on the fcth step, applied to the residual
r/c, we run the standard CG method without preconditioning as inner iterations,
using the zero initial approximation, and for the stopping criteria we compute
the norm of the true residual at every inner iteration and iterate until it gets
smaller than 7711r*11 for a given constant 77. On Figure 2.2, we demonstrate the
performance of the PSD and PCG methods for four values of 77 = 0.2,0.4,0.6
and 0.8 (from the bottom to the top). We observe that the PSD, displayed using
dashed (red in the colored print) lines marked with circles and PCG shown as
dashdot (blue) lines with xmarks methods both converge with a similar rate,
for each tested value of 77. We notice here that the PSD method is even a bit
faster than the PCG method. This does not contradict our Corollary 2.7, since
the preconditioners Bk here are evidently different in the PSD and PCG methods
even though they are constructed using the same principle.
2.8.3 Random vs. fixed preconditioning
In this subsection, we numerically investigate a situation where random
preconditioners of a similar quality are used in the course of iterations. The
system matrix is the standard 3point finitedifference approximation of the
onedimensional Laplacian using 3000 uniform mesh points and the Dirichlet
boundary conditions. We test the simplest multigrid preconditioning using two
grids, where the number of coarse grid points is 600. The interpolation is linear,
32
Figure 2.3: Twogrid preconditioning with fixed (left) and random (right)
coarse grids.
the restriction is the transpose of the interpolation, and the coarsegrid operator
is defined by the Galerkin condition. The smoother is the Richardson iteration.
On Figure 2.3 left, we once choose (pseudo)randomly 600 coarse mesh points
and build the fixed twogrid preconditioner, based on this choice. On Figure 2.3
right, we choose 600 new random coarse mesh points and rebuild the twogrid
preconditioner on each iteration.
position of the coarse grid points is not available, so the random choice of
the coarse grids may be an interesting alternative to traditional approaches.
Figure 2.3 displays the convergence history for the PSD (top), PCG (mid
dle), and PCG with the full orthogonalization (bottom) with the same random
initial guess using the fixed (left) and variable (right) twogrid preconditioners.
On Figure 2.3 left, for a fixed preconditioner, we observe the expected conver
gence behavior, with the PSD being noticeably the slowest, and the PCG with
the full orthogonalization being slightly faster than the standard PCG. Figure
2.3 right demonstrates that all three methods with the variable random pre
conditioner converge with essentially the same rate, which again illustrates the
main result of this chapter that the PCG method with variable preconditioning
33
may just converge with the same speed as the PSD method.
Figure 2.3 reveals a surprising fact that the methods with random precon
ditioning converge twice as fast as the methods with fixed preconditioning! We
highlight that Figure 2.3 shows a typical case, not a random outlier, as we
confirm by repeating the fixed preconditioner test in the left panel for every
random preconditioner used in the right panel of Figure 2.3 and by running
the tests multiple times with different seeds. Our informal explanation for the
fast convergence of the PSD method with random preconditioning is based on
Lemma 2.9 that provides the exact expression for the error reduction factor as
sin(Z,i(efc, B^lAe*,)). It takes its largest value only if ek is one of specific linear
combination of the eigenvectors of B^lAe corresponding to the two extreme
eigenvalues. If Bk is fixed, the error ek in the PSD method after several first
iterations approaches these magic linear combinations, e.g., Forsythe [21], and
the convergence rate reaches its upper bound. If Bk changes randomly, as in our
test, the average effective angle is smaller, i.e., the convergence is faster.
We finally note, that this experiment should not be viewed as a practical
multilevel approach. Standard uniform coarsening would of course have given
better convergence rates than the ones we have observed. Still, we believe this
experiment suggests that introducing randomization might be practical in some
numerical algorithms.
2.9 Conclusions
We use geometric arguments to investigate the behavior of the PCG methods
with variable preconditioning under a rather weak assumption that the quality
of the preconditioner is fixed. Our main result is negative in its nature: we show
34
that under this assumption the PCG method with variable preconditioning may
converge as slow as the PSD method, moreover, as the PSD method with the
slowest rate guaranteed by the classical convergence rate bound. In particular,
that gives the negative answer, under our assumptions, to the question asked
in Golub and Ye [22], Sec. 6, whether better bounds for the steepest descent
reduction factor may exists for Algorithm 2.7.1 with (2.14).
Stronger assumptions on variable preconditioning, e.g., such as made in
Golub and Ye [22], Notay [45] that the variable preconditioners are all small
perturbations of some fixed preconditioner, are necessary in order to hope to
prove a convergence rate bound of the PCG method with variable precondi
tioning resembling the standard convergence rate bound of the PCG method
with fixed preconditioning. Such stronger assumptions hold in many presently
known real life applications of the PCG methods with variable preconditioning,
but often require extra computational work.
35
3. Algebraic multigrid via element agglomeration
3.1 Introduction
Multigrid (MG) (e.g., Trottenberg et al. [55]) is one of the most efficient
and natural methods for solving linear systems of equations coming from partial
differential equations (PDEs) discretized on a sequence of grids. In algebraic
multigrid (AMG) Brandt et al. [11, 12], Ruge and Stiiben [48], Stiiben [53] the
necessary MG components (coarse grids, coarsegrid operators, interpolation
operators) are built by the solver algorithm opposite to geometric multigrid
where these components are naturally given by the discretization. An extreme
case of an algebraic multigrid approach would lead to a black box solver, i.e.,
an algorithm which would only use the linear system matrix as input data. In
practice, all AMG methods utilize (often assumed) some additional information
about the class of problems they are applied to. In this chapter, based on
the paper Lashuk and Vassilevski [42], we deal with the class of problems that
come from finite element discretization of elliptic PDEs. More specifically, we
focus on variants of the elementbased AMG (or AMGe) methods developed in
Brezina et al. [13], Chartier et al. [15, 16], Henson and Vassilevski [24], Jones
and Vassilevski [28].
We have implemented several algorithms in matlab in the framework of
the element agglomeration spectral AMGe (Chartier et al. [16]) and illustrate
their performance. Our matlab implementation allows for further extensions
and offers potential for illustrating other element based AMG algorithms. More
36
specifically, the chapter is structured as follows. Section 3.2 describes the main
building tools of the implemented algorithms. In particular, in Section 3.2.1
we describe the required input, Section 3.2.2 reviews some details on element
agglomeration, and in Â§ 3.2.3 the socalled minimal intersection sets are in
troduced. These sets are later used, in Section 3.2.4, to define the coarse degrees
of freedom by solving local eigenvalue problems (associated with each minimal
intersection set). The recursive nature of the algorithm is summarized in Â§ 3.2.5.
The spectral way of selecting coarse degrees of freedom naturally leads to the
construction of tentative prolongators. In Â§ 3.2.6 the need to improve on the
stability of the tentative prolongator is outlined.
Another direction of extending the presented method that we pursue utilizes
more sophisticated multigrid cycles that are based on inner (between the levels)
preconditioned conjugate gradient (CG) iterations. Such an idea goes back to
the nonlinear AMLI methods in Axelsson and Vassilevski [5]. This respective
method, referred to as nestedCG (AMLI) cycle is described in Section 3.2.7.
Section 3.3 contains a description of a specific implementation of the adap
tive AMG method (cf., Brezina et al. [14]) in the present AMGe setting. We
consider also an option which does not utilize the actual element matrices (but
does use the element topology relations). This results in a new AMGe algorithm
which we view as a main algorithmic contribution of this chapter.
The Vcycle and nestedCG cycle are illustrated in Sections 3.4 and 3.5,
whereas the results utilizing AMG adaptivity are found in the last section 3.6.
The main result of the present work is that it describes some performance
results of a number of AMGe algorithms; some are simple variations of previ
37
ously implemented ones, in addition to some newly developed ones, such as the
nestedCG AMLI cycles, as well as the adaptive element agglomerate AMGe.
3.2 Basic building tools
3.2.1 Input data
We assume finite element setting of the problem in the framework of relation
tables as described in detail in Vassilevski [57].
More specifically, our matlab software requires the following input data:
The element_dof relation. It can be defined by treating each element
as list of degrees of freedom (dofs). If the dofs and elements are respec
tively numbered as 1,..., No and 1, ..., Ne, then the incidence relation
element _dof can be represented as the boolean sparse matrix M of size
Ne x No with entries
Mij
if element i contains dof j,
otherwise.
The elementface relation. This relation describes the incidence ele
ment z has a face j. For triangular elements, element faces are the
triangles sides. This relation can be used to define neighboring elements;
namely, we say that two elements are neighbors if they share a com
mon face. This defines the relation table element .element. Again, the
relations elementface and elementelement can be implemented as
boolean sparse matrices.
List of boundary dofs. This is a list of dofs which are on the boundary
where essential boundary conditions are to be imposed..
38
We also need access to the individual element matrices. We assume that
all these matrices are symmetric positive semidefinite (or SPSD for short).
The actual matrix of the linear system of our main interest is built in 2
steps:
 First, we assemble a global (in general, singular, SPSD) matrix from
the individual element matrices in the usual way, i.e., according to
the formula
where AT denotes the local matrix corresponding to element r. For
a given v and a set of dofs r, vT stands for the restriction of v to the
subset of indices (dofs) corresponding to r.
 Second, we impose the essential boundary conditions on the global
assembled matrix A. That is, for any dof d on essential boundary
we set to zero all offdiagonal entries in the dth row and the d
th column of A. For the class of problems we consider, imposing
essential boundary conditions in this way produces SPD (symmetric
positive definite), hence nonsingular matrices.
We refer to the initial set of elements and respective dofs finegrid elements
(or fine elements) and finegrid dofs (or fine dofs).
We note that the righthand side vector (required input in the solution
phase) is not needed as input in the construction of the actual AMGe solver.
T
39
3.2.2 Agglomerating elements
Agglomeration refers to a partition of the set of all (fine) elements into
nonintersecting subsets, called agglomerates or agglomerated elements (AEs).
More specifically, we treat the set of elements as vertices of the undirected
graph where two vertices (elements) are linked with an edge if and only if they
are neighbors, i.e., share a face. This graph defines (as described before) the
relation element.element. If the relations elementface and face.element
are implemented as (boolean) sparse matrices, then elementelement equals
the product element Jace x face.element.
We use the graph partitioner METIS (Karypis and Kumar [31]) to partition
the elementelement relation into a desired number of components (a user
specified parameter referred to as coarsening factor or Crs. f.). We make sure
that the produced components are connected. Each (connected) component
defines an agglomerate. Figure 3.1 gives an illustration of this process. We also
view each agglomerated element as a list of fine degrees of freedom (the union
of the degrees of freedom that come with the finegrid elements forming the
agglomerate).
3.2.3 Building minimal intersection sets
Having created agglomerated elements, we can split the set of all (fine)
degrees of freedom into nonoverlapping partitions, referred to as minimal in
tersection sets (as in Vassilevski and Zikatanov [58]). These partitions represent
equivalence classes with respect to the equivalence relation that two dofs belong
40
0.0
08
07
08
08
04
03
0.7
0 1
0.1 0.2 0.3 0 4 0.S 0 6 0.7 0.6 0.0
0.1
0.2 0.3 0 4 0 5 0 6 0.7
06 00
Figure 3.1: Agglomerated elements
to a same minimal intersection set if and only if they belong to exactly the same
set of agglomerated elements. For example, all interior dofs of an AE constitute
a minimal intersection set. A boundary (its interior only) between two AEs
constitutes a minimal intersection set, etc. Figure 3.2 illustrates this process.
3.2.4 Forming Schur complements, computing eigenvalues and
building tentative prolongators
For each minimal intersection set /, we define its neighborhood N(I) to
be all (fine) elements that intersect / (i.e., having a common dof with I). We
assemble the local matrix AN^) from the element matrices corresponding to the
elements in N(I) and then compute its Schur complement Sj by eliminating all
dofs outside I. Note that both AN^) and Sj are SPSD.
Next, we calculate all the eigenpairs (AJtk, qj>fc), k = 1, ..., /, of Sj. Here,
\I\ stands for the cardinality of the set I. Let Qj be the matrix whose columns
are the (orthogonal and normalized) eigenvectors qj^ of Si.
4 aofllomerates, 100 elements
Figure 3.2: Minimal intersection sets
Due to our assumptions, we have that all eigenvalues Aof 5/ are non
negative.
Based on a user specified tolerance t 6 [0, 1] (referred to as the spectral tol
erance), we partition Qj into two submatrices. The first one, Q/iC, has columns
that are the eigenvectors qj^ corresponding to the lower part of the spectrum of
Sj, i.e., all <7/*. for k such that Afc < r [ylyv(/), where . stands for the ^norm
of the neighborhood matrix ^4v(/) (or the diagonal of AN(i)). The second block
Qij contains all remaining columns of Qi (i.e., eigenvectors corresponding to
the upper part of the spectrum, A/, k > T AlN(/)).
The tentative prolongator Qc is then formed from the blocks QitC (extended
by zero outside the set I), i.e., Qc = [ ..., Qi,c, ] Since all matrices Q/jC are
orthogonal and the sets I do not overlap, Qc is also an orthogonal matrix.
42
3.2.5 Proceeding by recursion
So far, we have constructed the components necessary for a 2level method.
More specifically, we assume a standard smoother M such as (block) Gauss
Seidel. The coarsegrid matrix Ac is obtained by the standard (Galerkin)
procedure; namely, with P = Qc (or some improved version of Qc), we set
Ac PTAP. A symmetric twogrid iteration for solving Ax = b, for a given
current iterate xQ, takes the form:
Algorithm 3.1 (Symmetric twogrid algorithm).
presmooth, i.e., compute
y = x0 + M1(6 Axq).
compute coarsegrid correction
xc = (Acy1PT(bAy).
interpolate and update the result
z = y + Pxc.
postsmooth, i.e., compute
x = z + M~T(b Az).
The mapping b i> x = B^b resulting from Algorithm 3.1 (with x0 0)
defines a twogrid preconditioner Btg It is wellknown (and readily checked)
that its inverse admits the following explicit form
= i1 ~ M'tA)P(AcY1 PT(I AM~X) +I'1, (3.1)
43
where M = M(M + MT A)~lMT is the socalled symmetrized smoother.
For example, if M = D + L (Ddiagonal, Lstrictly lower triangular) represents
the forward GaussSeidel method coming from A = D + L + LT, then M =
(D + L)D~1(D + LT) gives rise exactly to the symmetric GaussSeidel method.
To define a MG algorithm the exact solve with Ac is replaced with a cor
responding B~l defined by recursion. At the coarsest level Bc typically equals
the matrix at that level.
In our setting, to exploit recursion, we need to construct coarse elements
(the needed topology relations), coarse dofs and coarse element matrices.
The topology of the agglomerated elements (which serve as coarse elements)
is constructed based on the algorithms described in Vassilevski [57]. This part
of the setup is independent of the selection of the coarse dofs. The required
input here is only the finegrid element Jace topological relation.
The coarse dofs at a given level can be identified with the columns of the
tentative prolongator Qc that we construct based on the lower part of the spec
trum of the Schur complements 5/ associated with each minimal intersection set
/.
To build coarse elements (as lists of coarse degrees of freedom), we use the
agglomerated elements. Each agglomerated element (as a set of fine dofs) can
be split into several minimal intersection sets and each minimal intersection set
has one or several coarse dofs (eigenmodes) associated with its respective Schur
complement. For each agglomerated element T, the list of coarse degrees of
freedom associated with all minimal intersection sets that form T defines the
relation AE.coarse dof. This relation is the coarse counterpart of the (fine
44
grid) relation element_dof.
To construct coarse element matrices, we proceed as follows. For each ag
glomerated element T, we first assemble the local matrix At from the (fine
grid) element matrices AT, r C T. Also, based on T, we form the submatrix
Qt,c of the tentative prolongator Qc that corresponds to the coarse dofs in T.
In what follows, we refer to Qt,c as the local tentative prolongators. Finally,
we construct the coarse element matrix Aj, based on the local Galerkin relation
A? = QtiCAtQt,c
3.2.6 On the theoretical justification of the spectral AMGe approach
Consider the orthogonal (in the euclidean inner product) splitting Rn =
Vc Vf where Vc = Range(Qc) and Vf = Range(Qc). Let Pvf denote the
orthogonal projection onto Vf. It can be proved (see Chartier et al. [16]) that
the restriction Aff = PVfA\vf of A to the subspace Vj is wellconditioned if we
choose a sufficiently large portion of the eigenmodes (in the lower part of the
spectrum) of each Schur complement 5/ to form the columns of the tentative
prolongator Qc. That is, larger the spectral tolerance r 6 (0,1) better the
condition number of the resulting A/j.
We note that the fact that Ajf is wellconditioned by itself is not sufficient
to conclude that a twogrid method has a good convergence factor; we also
need some stability property of the interpolation matrix. Since our tentative
prolongator is blockdiagonal its stability properties in the Anorm are not
very good. That is, the tentative prolongator needs to be improved in general.
For example, we can smooth it out as in the smoothed aggregation AMG
(Vanek et al. [56]). This is a feasible option that can lead to better twogrid
45
convergence rates. In our setting though, some of the minimal intersection sets
are thin aggregates, which led to higher operator complexities. That is why, we
did not pursue this option. Operator complexity (C) reflects the cost (in terms
of arithmetic operations) of one Vcycle. It is defined based on the number of
nonzero entries A/j of the lth level matrix At (/ = 0 is finest level, l = L is the
coarsest level) by the formula
c = (3.2)
One option that we chose to stabilize the tentative prolongator is based on
harmonic extension defined as follows. For each agglomerated element T, we
construct Qc only for minimal intersection sets I that are on the boundary of T
(i.e., shared also by other agglomerated elements). Then Qc defined only on the
boundary of T is extended into the interior of T by inverting the block of the
matrix A corresponding to the interior of T. More specifically, if A restricted
to T has rows [Ar,i, AT,b] corresponding to the interior of T, then the actual
interpolation mapping P has the block form P =
ATiAT,b Qc
Qc
Another possible approach in the multilevel case is to keep the unstable
tentative prolongators but compensate for that with more expensive cycles (like
Wcycles, for example). We chose to use recursive calls to coarse levels based
on preconditioned CG (conjugate gradient) iterations with a preconditioner (de
fined by recursion) that is a (mildly) nonlinear mapping. This is described in
more detail in the next subsection.
46
3.2.7 The nonlinear nestedCG AMLItype multigrid cycles
The alternative to compensate for a unstable tentative prolongator that we
chose is to use more expensive multigrid cycles based on CG (conjugate gra
dient) inner (between levels) iterations goes back to the nonlinear AMLIcycle
hierarchical basis methods proposed in Axelsson and Vassilevski [5]. Recent
analysis of this type of cycles used in the MG setting is found in Notay and
Vassilevski [46].
We implemented an algorithm to be referred to as a nestedCG (AMLI)
cycle. It consists of CGacceleration on each level of the currently defined (by
recursion from coarse to finelevels) (nonlinear) preconditioner. More specifi
cally, we run CG iterations for the linear system Ax = b (where A is the finest
level matrix) with a preconditioner that is a nonlinear mapping. To compute
y = B_1(6), we use the following recursive algorithm:
For a given smoother M, apply a presmoothing step using y = 0 as the
initial guess, i.e., compute y := M~lb
Restrict the residual: rc := PT(b Ay)
At the coarsest level solve directly, i.e., compute yc:=(Ac)~i rc; otherwise
apply several CG iterations to the linear system Acyc = rc, using yc = 0
as an initial guess and a preconditioner whose inverse actions S"1 are
defined recursively (by the present algorithm); that is, compute yc as an
approximate solution to (Ac)_1 rc.
47
Correct y based on yc as y := y + Pyc
Based on MT perforin a postsmoothing step, i.e., compute y := y +
MT{bAy).
Note that the above algorithm defines an inverse of a preconditioner, B~l,
that is a nonlinear mapping.
In the implementation, we limit the number of inner CG iterations per level
so that the complexity of the resulting cycle is kept under control. More specif
ically, the number of CG iterations at level l is chosen based on the coarsening
factor defined as the ratio Ni/Ni+i.
3.3 Adaptivity in the element agglomeration AMGe framework
In the present section we first describe the main steps of the adaptive AMG
method from Brezina et al. [14]. The method based on its performance improves
itself by augmenting its current hierarchy (if any) so that the refined coarse
spaces contain more algebraically smooth vectors. These vectors represent
approximations to the minimal eigenmode of the generalized eigenvalue problem
Ax = ABx, where B~l stands for the matrix representation of the Vcycle
corresponding to the most current AMG hierarchy. If no hierarchy is available
then B stands for the current level symmetrized smoother M. The choice of the
smoothers is fixed, for example GaussSeidel.
We may assume that an initial AMG hierarchy has been built as described
in the preceding section. That is, based on an AMGe algorithm, we have created
a sequence of interpolation matrices, respective (Galerkin) coarsegrid matrices
and the smoothing matrices M associated with them (that choice we assume
48
fixed). In the adaptive AMG we begin with performing several dry runs based
on the current method to measure its performance. If inefficiency is detected
then we try to improve the method by adapting the AMG hierarchy. Some
preliminary results of an adaptive AMG method (in AMGe framework) were
reported in Vassilevski and Zikatanov [58]. We describe, in what follows, the
particular version of the approach that we have implemented.
For simplicity, we start with a twolevel method. In our setting, it comes
with a tentative prolongator Qc, an improved interpolation mapping P, the
coarsegrid matrix Ac = PTAP and a smoother M associated with A. These
components define a twolevel preconditioner B (as in Algorithm 3.1 or formula
(3.1)). Note that the case Qc = 0 and B = M is treated similarly (with obvious
modification). This case is refereed to as adaptive AMG starting from scratch.
Then we perform several stationary iterations to solve Ax = 0 with a random
initial guess x, i.e.,
x := (I B~1A)x,
where I is the identity matrix. At every iteration step we monitor the conver
gence rate, i.e., we compute
2 xTAx
If after a few (for example, five) iterations p is greater than a desired convergence
factor, we stop the iteration. Our current (twolevel) solver B cannot efficiently
handle this vector, i.e., x is rich in eigenmode components in the lower part
of the spectrum of B~lA. We refer to x as an algebraically smooth vector.
Note that x is algebraically smooth with respect to the current (twogrid)
49
preconditioner B (which initially may be simply the smoother M). The iteration
process above can be interpreted as calculating approximations to the highest
eigenmode of the Asymmetric twolevel iteration matrix E = I B~lA. Since
we consider symmetric MG cycles and Aconvergent smoothers (like Gauss
Seidel), this implies that AE is symmetric positive semidefinite.
The next step in the adaptive AMG, is to incorporate the algebraically
smooth vector x in the twolevel hierarchy by changing the tentative prolonga
tor Qc and afterwards, the respective improved interpolation matrix P. After
a new P has been computed we recompute the coarsegrid matrix Ac = PTAP.
The following algorithm implements the above steps:
Algorithm 3.2 (Augmenting coarse spaces based on additional vectors).
We compute the interpolation error e = x QcQfx. Since Qc has mu
tually orthogonal columns, e is the projection of x onto the orthogonal
complement of the span of the columns of Qc.
For each minimal intersection set I, consider e/ (the restriction of e to
I). Then, if ]e//x is greater than some given threshold, we add ej (ex
tended by zeros outside I) as an extra column to the tentative prolongator
Qc If e//a: is less than the threshold, we discard e/.
At the end, compute P, the improved interpolation matrix from Qc using
harmonic extension as described in section 3.2.6. Using this updated P,
we recompute Ac = PTAP.
In the multilevel case, we have to perform few additional steps at any given
level. Note that below the current coarse level, we may already have a hierarchy
50
of coarse spaces. Since we have augmented Qc with more columns and built a
new P, we have generally changed the dimension of the coarse space at the given
level. Hence, to relate the next level coarse space with the current level coarse
space based on the previously available interpolation matrix Pnext is not possible
because the dimensions do not match. To fix this problem, we need to add some
additional rows to Pnext to match the dimensions. The new rows are simply set
to zero. Thus the second coarse matrix does not have to be recomputed. This
is seen as follows. Let Qld be the current level tentative prolongator. We add a
few new columns to Qld based on the vector e computed in Algorithm 3.2. That
is, we have Qc = [QcCW, Qld] Similarly, the old interpolation matrix Pold gets
updated with the same number of columns; i.e., we have P = [Pnew5 PoW], The
new coarse matrix is Ac = PTAP. The second coarse level interpolation matrix
Pnext gets modified with some extra zero rows. It becomes Pnext
Therefore, the next coarse level matrix equals
r tT
pold
rnext
Ac ____ pT Acp _
next rnext** rnext
0
pold
^next
__ ^pold y ^poldy j^poldpold
pnew poldy' jypnew pold^
pold
r next
which is the expression defining the next level old coarse matrix.
A second problem that needs to be fixed is to augment the minimal inter
section sets at the given coarse level since we have added some new coarse dofs
there. To accomplish this task, during the setup phase of the algorithm (more
specifically, when we agglomerate elements, build the minimal intersection sets,
etc.) we keep a hierarchy structure of the minimal intersection sets. We say
51
that a level l minimal intersection set / is associated with a coarser level l + 1
minimal intersection set 7;+1 if there exists a coarse dof d/+1 from 7/+i associated
with I[. Recall that (initially) every coarse dof corresponds to an eigenvector
of a local matrix associated with a minimal intersection set. It is easily seen
that for each 7; there exists a unique coarse minimal intersection set /+1 that is
associated with fi. Based on the hierarchy of the minimal intersection sets, we
distribute the newly created coarse dofs among the coarser minimal intersection
sets. Then we proceed by recursion. The previously current coarse level be
comes fine and we apply the method starting from that level. The initial vector
x is now not random; it equals QTx where x was the vector computed at the
previous (fine) level and Q is the augmented tentative prolongator at that level.
The same scheme applies to all subsequent levels till we end up with two levels
only. Then, we apply the twolevel scheme described above.
The following monotonicity result demonstrates the fact that augmenting
a current coarse space (in the sense described above) leads to a better twogrid
method.
Proposition 3.3. Let A, M, Pold and Acold = (Pld)T A (poW) define a current
twogrid preconditioner B0id (as in (3.1)/ Augment the coarse space so that
P = [Pne, pold] is the new interpolation matrix and Ac = PTAP is the new
coarse matrix. Then A, M, P and Ac define the new twogrid preconditioner B
(also as in (3.1)/ Then the following inequalities hold:
vtA~1v > vtB1v > vTB~ldv.
That is, B~l provides a more accurate approximate inverse to A than the old
52
twogrid preconditioner Bol\.
Proof. Let (u,v)A = vTAu denote the Ainner product (note that A is s.p.d.).
It is equivalent to show that for all u
0 < ((I B'A)u,u)a< ((I BÂ£A)u,u)a. (3.3)
The symmetric twolevel cycle leads to an error propagation matrix that admits
the following product form
I B~'A = (I M~tA) (I P (AT1 PtA) (I M~lA)
and similarly
I BÂ£A = (/ M~tA) (/ F> (PMf a) (1 M'A).
We also note that (I M~TA) is Aadjoint to (I M_1A). Thus (3.3) with
v = (I M~lA)u reduces to
o <((IP{A)1PtA)v,v)a
< ((/ PM (AJ,,)1 (Pi)TA) v.v)a.
These inequalities are readily seen from the fact that both (/ P (Ac)1 PTA)
and Pold {Acol(f)~l (PoW)TA^ are Aprojectors providing the best approx
imation in the Anorm from the two nested spaces Range [PnT PoW] and
Range (PoW).
53
3.4 Numerical results for spectral agglomerate AMGe
In this and subsequent sections we illustrate the performance of the methods
discussed in this chapter. The problems we consider come from finite element
(f.e.) discretization of 2D anisotropic diffusion as well as the 2D (indefinite)
Helmholtz equation rewritten as a first order (mixed) system casted in a least
squares form (the socalled FOSLS formulation). The latter problem gives rise
to three degrees of freedom per node.
The specific PDEs, posed in Cl = (0, l)2, read:
anisotropic diffusion:
div (el + b^)Vp = /, p = 0 on dd,
where e is either 1 (which corresponds to no anisotropy, i.e. the operator
is Laplacian if b = 0) or 0.001 and b is a given constant vector in R2. Here
we use f.e. discretization based on standard piecewise linear f.e. space Sh
over a given triangular mesh Th of fl.
Helmholtz equation Ap k2p = /, p = 0 on dCl for a given k2. The
FOSLS formulation reads: Compute
IIcurl uh\\2 + u* Vp2 + / + divuh + k2ph\\2 min,
over uh (Sh)2 and pn Â£ Sh, where Sh is a finite element space of piecewise
linear functions over a given triangular mesh Th on fh Here . stands for
the L2(D) norm, ph and the tangential component of uh are set to zero on
<9f2. The additional vector unknown uh approximates Vp.
54
In our tests, we varied the spectral tolerance r e [0,1) (see Section 3.2.4) as well
as the smoothers used: GaussSeidel (GS) and its blockversion. We used
(overlapping) blocks referring to the agglomerated elements (viewed as sets of
finegrid dofs). This block smoother is denoted in the tables by AEBlockGS.
We used two types of multigrid cycles: the standard symmetric F(l,l)cycle
(forward (block) GaussSeidel in finetocoarse direction and backward (block)
GaussSeidel in coarsetofine direction) and the nested CG cycle described
in Section 3.2.7. For each method and problem, we compare its twolevel and
multilevel versions. To reduce the complexity of the methods we use different
coarsening factors (Crs. f.), one at the initial level (equal to sixteen or eight)
and another one (equal to four) at all remaining coarse levels. The number of
levels used is denoted by Niev. We also list g that is an estimate of the con
vergence factor. In most of the tables the resulting multigrid cycle is used as
a preconditioner in the CG (conjugate gradient) method. In addition, the con
vergence history of the methods is illustrated graphically. Finally, some typical
graphs of the coarsest level basis functions illustrating the kind of interpolation
matrices (their columns viewed on the finest grid) that result from a particular
algorithm are shown. The tables also contain the operator complexity C defined
in (3.2) as well as a related quantity Ca+p defined as:
r c E/to1 nnz (pi)
Ca+p~ c+ nnz(io)
That is, to the commonly used operator complexity C we add the total number
of nonzero entries of all interpolation matrices Pi divided by the number of
nonzeros of the finest level matrix.
55
In our numerical experiments, we use two types of meshes: a structured
square mesh and unstructured triangular meshes. A given uniform square
mesh produces the actual structured mesh by subdividing each square cell using
its diagonal into two triangles. In the case of structured grid, we choose the
anisotropy vector b aligned with the grid. All unstructured meshes we consider
are obtained by refinement of the mesh shown in Figs. 3.13.2.
In this particular section, we use Vcycle iterations based on the socalled
(fully) harmonic interpolation matrices. These interpolation matrices are ob
tained by first selecting coarse dofs based on a portion of the lower part of the
spectrum of the Schur complements Si associated with each minimal intersection
set / (as described previously) leading to an orthogonal matrix Qc, Based on the
remaining part of the spectrum we can also construct the complementary (also
orthogonal) matrix Qf. Then, we first perform a change of variables that leads
. Note that similar structure
to a matrix [Qf, Qc]TA[Qf, Qc] = ff fc
Acf Acc
is obtained for all neighborhood matrices Ax(I) (discussed in Section 3.2.4).
Having the 2x2 block form of the transformed matrix with the coarse dofs
identified (denoted by c index) the actual interpolation matrix is constructed
as described in Jones and Vassilevski [28]. The resulting interpolation matrix
P can be viewed as a locally harmonic extension of the tentative prolongator
Qc. This was the construction used also in Chartier et al. [16].
Numerical results for this section are found in tables 3.1, 3.2, 3.3, 3.4, 3.5,
3.6, 3.7, 3.8. Figures 3.4, 3.7, 3.8 contain convergence plots for runs in the
tables listed above. Figures 3.3, 3.5, 3.6, 3.9 contain graphs of some coarsest
56
Table 3.1: 2d Laplacian on square grid. 2048 elements, 1089 dofs. Harmonic11
prolongator is used inside Vcycle. No CG acceleration is used.
Crs. f. r CA Ca+p N)ev Smoother P
8 4 0 1.4 1.7 2 GS 0.22
8 4 0 1.5 2.0 5 GS 0.23
Table 3.2: 2d Laplacian on square grid. 32768 elements, 16641 dofs. Har
monic prolongator is used inside Vcycle. No CG acceleration is used.
Crs. f. r CA Ca+p Nlev Smoother P
8 4 0 1.3 1.7 2 GS 0.24
8 4 0 1.4 1.9 7 GS 0.24
basis functions.
Table 3.3: 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs. Har
monic prolongator is used inside Vcycle. Conjugate gradient acceleration is
used.
Crs. f. r CA Ca+p N]ev Smoother P Fig/Run
16 4 0 1.5 1.9 2 GS 0.15 3.4/1
16 4 0 1.6 2.2 6 GS 0.18 3.4/2
57
06
0}
Figure 3.3: One of coarsestgrid basis functions. Laplacian on square grid,
2048 elements, 1089 dofs. Fullyharmonic prolongator is used.
Table 3.4: 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Harmonic prolongator is used inside Vcycle. Conjugate gradient acceleration
is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0 1.6 2.2 7 GS 0.21 3.4/3
16 4 0 1.5 1.9 2 GS 0.18 3.4/4
58
Figure 3.4: Residual decrease history for some runs in tables 3.3, 3.4.
Figure 3.5: One of coarsestgrid basis functions. Laplacian on unstructured
grid, 6400 elements, 3321 dofs. Fullyharmonic prolongator is used. Coarsest
grid contains 8 elements.
59
Table 3.5: 2d anisotropic diffusion on square grid. Anisotropy is gridaligned.
2048 elements, 1089 dofs. Harmonic prolongator is used inside Vcycle. No
CG acceleration is used.
Crs. f. r CA Ca+p Niev Smoother P
8 4 0.05 2.0 2.4 2 GS 0.27
8 4 0.05 3.9 6.0 5 GS 0.55
8 4 0.05 2.0 2.4 2 AEBlockGS 0.11
8 4 0.05 3.9 6.0 5 AEBlockGS 0.11
Table 3.6: 2d anisotropic diffusion on square grid. Anisotropy is gridaligned.
32768 elements, 16641 dofs. Harmonic prolongator is used inside Vcycle. No
CG acceleration is used.
Crs. f. r CA Ca+p Niev Smoother P
8 4 0.05 1.9 2.3 2 GS 0.27
8 4 0.05 4.3 6.8 7 GS 0.74
8 4 0.05 1.9 2.3 2 AEBlockGS 0.11
84 0.05 4.3 6.8 7 AEBlockGS 0.11
60
08^
Figure 3.6: One of coarsestgrid basis functions. Anisotropic diffusion on
square grid, 2048 elements, 1089 dofs. Fullyharmonic prolongator is used.
Note that this is one function, not a sum of two functions.
Table 3.7: 2d anisotropic diffusion on unstructured grid, 25600 elements, 13041
dofs. Harmonic prolongator is used inside Vcycle. Conjugate gradient accel
eration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0.15 2.0 2.5 2 AEBlockGS 0.13 3.7/1
16 4 0.15 4.0 5.9 7 AEBlockGS 0.16 3.7/2
16 4 0 1.5 1.9 2 AEBlockGS 0.61 3.8/1
16 4 0 1.6 2.2 7 AEBlockGS 0.62 3.8/2
61
Table 3.8: 2d anisotropic diffusion on unstructured grid. 6400 elements, 3321
dofs. Harmonic" prolongator is used inside Vcycle. Conjugate gradient accel
eration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
16 4 0.15 2.0 2.5 2 AEBlockGS 0.12 3.7/3
16 4 0.15 3.5 5.2 6 AEBlockGS 0.14 3.7/4
16 4 0 1.5 1.9 2 AEBlockGS 0.44 3.8/3
16 4 0 1.6 2.2 6 AEBlockGS 0.44 3.8/4
Figure 3.7: Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were used (r = 0.15).
62
Figure 3.8: Residual decrease history for some runs in tables 3.8, 3.7. In these
runs, additional spectral coarse dofs were NOT used (r = 0).
0.4,
N
0.2'
1
y 0 0
X
Figure 3.9: One of coarsestgrid basis functions. Anisotropic diffusion on
unstructured grid, 6400 elements, 3321 dofs. Fullyharmonic prolongator is
used. Coarsest grid contains 2 elements. This function corresponds to additional
spectral coarse dofs.
63
Table 3.9: 2d Laplacian on square grid, 2048 elements, 1089 dofs. Tentative
prolongator is used inside nested CG cycleConjugate gradient acceleration
is used.
Crs. f. r CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.2 1.4 2 GS 0.21 3.10/1
32 4 0 1.3 1.5 4 GS 0.21 3.10/2
3.5 Tests with tentative prolongators within nested CG cycles
In this section we illustrate the performance of the method when tentative
(i.e. blockdiagonal and orthogonal) prolongators (Qc) are used. In order to
compensate for prolongator instability, we use nested CG cycles described in
Section 3.2.7. We report in this section test results for Laplacian and anisotropic
diffusion. We use (pointwise) GaussSeidel as a smoother here, since our cycle
is now more robust (and expensive). Numerical results for this section are
found in tables 3.9, 3.10, 3.11, 3.12, 3.13, 3.14. Figures 3.10, 3.12, 3.14 contain
convergence plots for runs in the tables listed above. Figures 3.11, 3.13, 3.15
show graphs of some of the resulting coarsest basis functions.
64
Table 3.10: 2d Laplacian on square grid, 32768 elements, 16641 dofs. Ten
tative prolongator is used inside nested CG cycle". Conjugate gradient accel
eration is used.
Crs. f. T CA Ca+p N,ev Smoother P Fig/Run
32 4 0 1.2 1.4 2 GS 0.22 3.10/3
32 4 0 1.3 1.5 6 GS 0.22 3.10/4
Figure 3.10: Residual decrease history for some runs in tables 3.9, 3.10.
Table 3.11: 2d Laplacian on unstructured grid, 6400 elements, 3321 dofs.
Tentative" prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.4 1.5 2 GS 0.31 3.12/1
32 4 0 1.5 1.7 5 GS 0.32 3.12/2
65
0.1
Figure 3.11: One of coarsestgrid basis functions. 2d Laplacian on square
grid, 2048 elements, 1089 dofs. Tentative" prolongator is used. Coarsest grid
contains 4 elements.
Table 3.12: 2d Laplacian on unstructured grid, 25600 elements, 13041 dofs.
Tentative prolongator is used inside nested CG cycle". Conjugate gradient
acceleration is used.
Crs. f. r CA Ca+p Nlev Smoother P Fig/Run
32 4 0 1.4 1.5 2 GS 0.37 3.12/3
32 4 0 1.5 1.7 6 GS 0.42 3.12/4
66
Figure 3.12: Residual decrease history for some runs in tables 3.11, 3.12.
3.6 Adaptive AMG tests
Here we present numerical results for adaptive AMG discussed in Section
3.3. We apply this method to the Helmholtz problem described in Section 3.4,
with k2 varying from 0 (which is just the Laplace operator) to 200. In our
tests, we use the partially harmonic extension of the tentative prolongators (as
explained in Section 3.2.6) to define the actual interpolation matrices Pi used
within the adaptive Vcycle AMG. We also employ nestedCG cycles (described
in Section 3.2.7).
Everywhere adaptivity is done from scratch (i.e., no initial hierarchy is as
sumed, hence adaptivity is based initially on the symmetrized GaussSeidel
67
Y 0 0
Figure 3.13: One of coarsestgrid basis functions. 2d Laplacian on unstruc
tured grid, 6400 elements, 3321 dofs. Tentative prolongator is used. Coarsest
grid contains 4 elements.
Table 3.13: 2d anisotropic diffusion on unstructured grid, 6400 elements, 3321
dofs. Tentative" prolongator is used inside nested CG cycle. Conjugate
gradient acceleration is used.
Crs. f. T CA Ca+p Niev Smoother P Fig/Run
32 4 0.05 2.0 2.2 2 GS 0.40 3.14/1
32 4 0.05 2.7 3.5 5 GS 0.44 3.14/2
68
Table 3.14: 2d anisotropic diffusion on unstructured grid, 25600 elements,
13041 dofs. Tentative11 prolongator is used inside nested CG cycle. Conju
gate gradient acceleration is used.
Crs. f. T CA Ca+p Nlev Smoother P Fig/Run
32 4 0.05 2.1 2.4 2 GS 0.43 3.14/3
32 4 0.05 3.3 4.4 6 GS 0.46 3.14/4
Figure 3.14: Residual decrease history for some runs in tables 3.13, 3.14.
69
0.1
Figure 3.15: One of coarsestgrid basis functions. 2d anisotropic diffusion on
unstructured grid, 25600 elements, 13041 dofs. Tentative11 prolongator is used.
Coarsest grid contains 4 elements.
smoother). While obtaining smooth vectors for adaptivity, we do not use
CG at the finest level, i.e., we run the nestedCG cycle as a stationary iter
ation. However, we do use CG in the final test runs. The number of smooth
vectors used to build the final AMG hierarchy is shown in the column Nvec
of each table.
Numerical results for this section are found in tables 3.15, 3.16, 3.17, 3.18,
3.19, 3.20 Figures 3.16, 3.17, 3.18 contain convergence plots for runs shown in
the above tables. Figures 3.19, 3.20, 3.21 show graphs of some of the resulting
coarsest basis functions. Finally, in Figure 3.22, we show how the complexity and
convergence rate vary with the number of incorporated algebraically smooth
vectors.
70
Table 3.15: Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l, 6400
elements, 9963 dofs. Partiallyharmonic prolongator is used inside nested
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother NVec P Fig/Run
32 4 2.2 3.1 2 SymGS 6 0.10 3.16/1
32 4 2.7 4.1 5 SymGS 6 0.11 3.16/2
Table 3.16: Adaptivity: FOSLS Helmholtz, k2 = 0, curl penalty=l
elements, 39123 dofs. Partiallyharmonic prolongator is used inside
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Niev Smoother NVec P Fig/Run
32 4 2.4 3.3 2 SymGS 7 0.11 3.16/3
32 4 3.2 4.8 6 SymGS 7 0.10 3.16/4
Table 3.17: Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=
elements, 9963 dofs. Partiallyharmonic" prolongator is used inside
CG cycle". Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec p Fig/Run
32 4 2.6 3.7 2 SymGS 10 0.11 3.17/1
32 4 3.6 5.5 5 SymGS 10 0.11 3.17/2
, 25600
nested
1, 6400
nested
71
Figure 3.16: Residual decrease history for some runs in tables 3.15, 3.16.
Table 3.18: Adaptivity: FOSLS Helmholtz, k2 = 100, curl penalty=l, 25600
elements, 39123 dofs. Partiallyharmonic11 prolongator is used inside nested
CG cycle . Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother NVec p Fig/Run
32 4 2.6 3.7 2 SymGS 10 0.10 3.17/3
32 4 4.0 6.1 6 SymGS 11 0.13 3.17/4
72
Figure 3.17: Residual decrease history for some runs in tables 3.17, 3.18.
Table 3.19: Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 6400
elements, 9963 dofs. Partiallyharmonic prolongator is used inside nested
CG cycle. Conjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec P Fig/Run
32 4 2.5 3.6 2 SymGS 9 0.13 3.18/1
32 4 3.6 5.5 5 SymGS 10 0.18 3.18/2
73
Table 3.20: Adaptivity: FOSLS Helmholtz, k2 = 200, curl penalty=l, 25600
elements, 39123 dofs. Partiallyharmonic prolongator is used inside nested
CG cycleConjugate gradient acceleration is used.
Crs. f. CA Ca+p Nlev Smoother Nvec p Fig/Run
32 4 2.7 3.7 2 SymGS 10 0.10 3.18/3
32 4 4.1 6.2 6 SymGS 11 0.25 3.18/4
Figure 3.18: Residual decrease history for some runs in tables 3.19, 3.20.
74
Coarsest function 21: u
X
Figure 3.19: One of coarsestgrid basis functions, ux component. Adaptive
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs. Partially
harmonic prolongator is used.
Coarsest function 21: u
y
Y 0 0
Figure 3.20: One of coarsestgrid basis functions, uy component. Adaptive
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs. Partially
harmonic prolongator is used.
75
Coarsest function 21: p
Figure 3.21: One of coarsestgrid basis functions, p component.
AMG for FOSLS Helmholtz, k2 = 200, 25600 elements, 39123 dofs.
harmonic prolongator is used.
Adaptive
Partially
Figure 3.22: Adaptivity: FOSLS Helmholtz, k2 = 200, 6400 elements, 9963
dofs. Partiallyharmonic prolongator is used inside nested CG cycle. Com
plexity and convergence rate depending on the number of incorporated vectors.
The convergence rate is for the stationary method.
76
3.7 Conclusions
We have implemented a version of the spectral agglomerate AMGe method
that leads in a natural way to (orthogonal) tentative prolongation matrices.
These tentative prolongators can then be used to construct more stable inter
polation matrices by harmonic extension (in the interior of the agglomerated
elements) that results in better convergence properties of standard Vcycle MG.
The spectral choice of coarse dofs by itself defines a scale of AMG methods that
can get more powerful (in terms of the convergence rate) at the expense of in
creasingly higher operator complexity. Another alternative that we explored was
to keep the tentative prolongators in the computation and compensate for their
poor stability by using more expensive MG cycles; we considered the nested
CG cycle that gives rise to (mildly) nonlinear MG mappings. Finally, we
note that a low tolerance spectral agglomerate AMGe can be used to initiate
an adaptive AMG cycle. Once the initial AMG hierarchy has been constructed
the individual element matrices are no longer needed for the adaptive AMG we
constructed. We chose, in the numerical tests, to construct the initial AMG
hierarchy from scratch which version completely eliminated the need for the
finegrid element matrices. This particular method though did utilize the fine
grid element topology relations needed to construct the multilevel agglomerates
and respective minimal intersection sets. All approaches offer potential to gen
erate better (in convergence properties) AMG methods. The approach based
on tentative prolongators (and more expensive MG cycles) has the fastest setup
77
among all, however one cycle is more expensive (in setup and/or in cost per
cycle). In general, all approaches can get fairly expensive if we want to achieve
very small convergence factors. We have tried to demonstrate their performance
on a variety of test problems including scalar anisotropic diffusion as well as
Helmholtz problem in a SPD (FOSLS) formulation.
78
4. BLOPEX
4.1 Introduction
In this chapter we describe a new software package Block Locally Opti
mal Preconditioned Eigenvalue Xolvers (BLOPEX) revision 1.0 publicly re
leased recently. The exposition here is based on the papers Knyazev et al.
[39], Lashuk et al. [43]. BLOPEX is available as a standalone serial li
brary, which can be downloaded from http: //math. cudenver. edu/~aknyazev/
software/BLOPEX/, as an external package to the Portable, Extensible Toolkit
for Scientific Computation (PETSc) Balay et al. [6], and is built into the High
Performance Preconditioners (hypre) library Falgout et al. [19, 20]. Jose Ro
man has also written a Scalable Library for Eigenvalue Problem Computations
(SLEPc) Hernandez et al. [25]) interface to our hypre BLOPEX.
The native hypre BLOPEX version efficiently and directly takes advantage of
powerful hypre multigrid preconditioners, both algebraic (AMG, called Boomer
AMG in hypre) and geometric or structured (SMG). BLOPEX in PETSc gives
the PETSc community easy access to the customizable code of a preconditioned
eigensolver.
At present, BLOPEX includes only one solverthe Locally Optimal Block
Preconditioned Conjugate Gradient (LOBPCG) Knyazev [32, 35] method for
eigenproblems Ax = ABx with large, possibly sparse, symmetric matrix A
and symmetric positive definite matrix B. Preconditioned eigenvalue solvers
in general, see Knyazev [34, 35], and in particular the LOBPCG method have
79
recently attracted attention as potential competitors to (block) Lanczos meth
ods. LOBPCG has been implemented using different computer languages in
a number of other software packages: C++ in Anasazi Trilinos Arbenz et al.
[1], Heroux et al. [26] and in NGSolve Zaglmayr [62], Sections 7.27.4; C in
PRIMME Stathopoulos [52]; FORTRAN 77 in PLTMG Bank [7]; FORTRAN
90 in ABINIT Bottin et al. [9] and in PESCAN Tomov et al. [54]; Python in
SciPy by Robert Cimrman and in PYFEMax by Peter Arbenz and Roman Geus.
LOBPCG has been independently tested in Yamada et al. [59, 60] for Fermion
Habbard Model on Earth Simulator CDIR/MPI; in Arbenz et al. [1] and Borzi
and Borzi [8] using AMG preconditioning for vibration problems; and in To
mov et al. [54], Yang et al. [61] for electronic structure calculations. Our hypre
BLOPEX version has been used in Bramble et al. [10] for the Maxwell problem.
Section 4.2 contains a complete and detailed description of the LOBPCG
algorithm as implemented in BLOPEX. We discuss our specific abstract im
plementation approach in section 4.3. Parallel performance on distributed and
shared memory systems using domain decomposition and multigrid precondi
tioning is discussed in section 4.4. We collect technical information, e.g., the list
of acronyms, in the appendix A.
4.2 LOBPCG and its Implementations
In subsection 4.2.1, we briefly describe the ideas behind the LOBPCG
method mostly following Knyazev [32, 35]. In subsection 4.2.2, we present a
complete description of the LOBPCG algorithm as implemented in BLOPEX.
Deflation by hard and soft locking is discussed in subsection 4.2.3.
80
4.2.1 LOBPCG Ideas and Description
4.2.1.1 The problem and the assumptions
We consider the problem of computing the m smallest eigenvalues and the
corresponding eigenvectors of the generalized eigenvalue problem Ax = XBx,
with symmetric (Hermitian in the complex case) matrices A and B, the latter
assumed to be positive definite. For a given approximate eigenvector x, we use
the traditional Rayleigh quotient \(x) = (x,Ax)/(x,Bx) in the standard scalar
product as an approximation to the corresponding eigenvalue. We emphasize
that only matrix B is assumed to be positive definite. There is no such re
quirement on A; moreover, in all formulas in this section we can replace A with
A + aB and the formulas are invariant with respect to a real shift a.
To accelerate the convergence, we introduce a preconditioner T, which is
typically a function that for a given vector x produces Tx. The existing the
ory, e.g., Knyazev [32], Knyazev and Neymeyr [38], of the LOBPCG method is
based on several assumptions: T needs to be linear, symmetric, and positive
definite. Under these assumptions a specific convergence rate bound is proved
in Knyazev [32], Knyazev and Neymeyr [38]. Numerical tests suggest that all
these requirements on T are necessary in order to guarantee this convergence
rate. The use of preconditioning that does not satisfy the requirements above
not only can slow down the convergence, but may also lead to a breakdown of
the method and severe instabilities in the code resulting in incorrect results
that should not be mistakenly interpreted as bugs in the code. The method is
81
robust however if T changes from iteration to iteration as soon as in every step
it satisfies the above assumptions.
4.2.1.2 SingleVector LOBPCG
For computing only the smallest eigenpair, i.e. if m 1, the LOBPCG
method takes the form of a 3term recurrence:
^.(j+l) w(i) __
(41)
w(i) = t(AxW X^Bx^), X^ = X(x^) = (x^\ Ax^)/(Bx^, x^)
with properly chosen scalar iteration parameters r^, and 7^. The easiest and
perhaps the most efficient choice of parameters is based on an idea of local
optimality Knyazev [32, 33, 35], namely, select r^) and 7b) that minimize the
Rayleigh quotient A(x^+1') by using the RayleighRitz procedure in the three
dimensional trial subspace spanning x^\ wand We do not describe
the RayleighRitz procedure using an arbitrary basis of the trial subspace in this
chapter and assume that the reader is familiar with it.
If 7W = 0, method (4.1) turns into the classical preconditioned locally opti
mal steepest descent method, so one can think of method (4.1) as an attempt to
accelerate the steepest descent method by adding an extra term to the iterative
recurrence; e.g., Knyazev [33, 35]. The general idea of accelerating an iterative
sequence by involving the previous approximation is very natural and is well
known. Let us note here that if instead of the previous approximation z^i)
we use the previous preconditioned residual u/*1) in formula (4.1), we get a
different method that in practical tests converges not as fast as method (4.1).
When implementing the RayleighRitz procedure in the trial subspace span
ning x^l\ and we need to be careful in choosing an appropriate basis
82
that leads to reasonably conditioned Gram matrices. The current eigenvec
tor approximation x^ and the previous eigenvector approximation get
closer to each other in the process of iterations, so special measures need to be
taken in method (4.1) to overcome the potential numerical instability due to the
roundoff errors. Using both vectors x^ and x^_1) as basis vectors of the trial
subspace leads to illconditioned Gram matrices, and the RayleighRitz method
can produce spurious eigenpairs.
A simple fix is suggested in Knyazev [32]: the threeterm recurrence that
contains the current eigenvector approximation, the preconditioned residual,
and the implicitly computed difference between the current and the previous
eigenvector approximations:
x
(t+l) _ w(i) __ T(i)T()
+ rx + 7 w = T{Ax AWbx(0),
p(i+l) w(i) _j_ p(0) _ ^(i) __ (x(*)).
x(*+1) G span{u/\ x^, pb)} = span{u/*l, x^\ asp^i+1) = xb+i)_t^x^\
therefore, the new formula (4.2) is mathematically equivalent to (4.1), in exact
arithmetic. We choose in (4.2) the scalar iteration parameters rW and 7^ as
above, i.e. minimizing the Rayleigh quotient A(x^+1^).
We note that formula (4.2) is not quite the ultimate fix, e.g., it is possible
in principle that at the initial stage of the iterations is small, so that p(,+1)
is too close to x(l+1\ and the RayleighRitz procedure may fail. The proper
choice of the basis in the trial subspace span{t/;W,x^,p^} for the Rayleigh
Ritz procedure in (4.2), having in mind that vectors w^ and p^ converge to
zero, is not a trivial issue, see Hetmaniuk and Lehoucq [27] for possible causes
of LOBPCG instability.
(4.2)
83
The locally optimal choice of the step sizes in the LOBPCG allows an easy
generalization from the singlevector version (4.1) or (4.2) to the block version
described next in subsection 4.2.1.3, where a block of m vectors is iterated simul
taneously. We return to the discussion of the choice of the basis in subsection
4.2.2, where it becomes even more vital with the increase of the block size m.
4.2.1.3 The LOBPCG Basics
A block version of LOBPCG for finding the m > 1 smallest eigenpairs is
suggested in Knyazev [34, 35]:
.(*+i)
'3
G span
x
(0
T(A A
T(<i) T(0
5 1 >
where is computed as the jth Ritz vector, j = 1,,m corresponding to
the jth smallest Ritz value in the RayleighRitz procedure on a 3mdimensional
trial subspace. The block method has the same problem of having close vectors
in the trial subspace as the singlevector version (4.1) discussed in the previous
subsection, for which Knyazev [32] suggested the same fix using the directions
P
As in other block methods, the block size should be chosen, if possible, to
provide a large gap between first m eigenvalues and the rest of the spectrum
as this typically leads to a better convergence, see Knyazev [32], Knyazev and
Neymeyr [38], Ovtchinnikov [47]. Block methods generally handle clusters in the
spectrum and multiple eigenvalues quite well; and the LOBPCG is no exception.
An attempt should be made to include the whole cluster of eigenvalues into the
block, while for multiple eigenvalues this is not essential at all. The numerical
84
experience is that the smallest eigenvalues often converge faster, and that the few
eigenvalues with the noticeably slow convergence usually are the largest in the
block and typically correspond to a cluster of eigenvalues that is not completely
included in the block. A possible cure is to use a flexible block size a bit larger
than the number of eigenpairs actually wanted and to ignore poorly converging
eigenpairs.
The block size m should not be so big that the costs of the RayleighRitz
procedure in the 3mdimensional trial subspace dominate the costs of iterations.
If a large number of eigenpairs is needed, deflation is necessary, see subsec
tion 4.2.3. The optimal choice of the block size m is problem, software and
computerdependent; e.g., it is affected by such details as the amount of the
CPU cache and the use of optimized BLAS library functions in the multivector
implementation, see section 4.3.
4.2.2 The Detailed Description of the LOBPCG Algorithm
The description of the LOBPCG algorithm as implemented in our BLOPEX
1.0 code follows:
Input: m starting linearly independent vectors in X Â£ R"xm, l linearly independent
constraint vectors in Y 6 Rnxl, devices to compute A* X, B X and T X.
1. Allocate memory W, P, Q, AX, AVU, AP, BX, BW, BP Â£ Rnxm, BY Â£ Rnx'.
2. Apply the constraints to X:
BY = B *Y, X = X Y (YT BY)~l ((BY)T X).
3. Borthonormalize X: BX = B X\R = chol(ATr BX);X = X R1;
BX BX R1; AX = A X. (Note: chol is the Cholesky decomposition).
85
4. Compute the initial Ritz vectors: solve the eigenproblem
(.XT AX) TMP = TMP A;
and compute X = X TMP; AX = AX TMP; BX = BX TMP.
5. Define the index set J of active iterates to be {1,, m}.
6. for k = 0,..., Maxlterations:
7. Compute the residuals: Wj = AX j BX j Aj.
8. Exclude from the index set J the indices that correspond to residual
vectors for which the norm has become smaller than the tolerance.
If J then becomes empty, exit loop.
9. Apply the preconditioner T to the residuals: Wj = T Wj.
10 Apply the constraints to the preconditioned residuals Wj\
Wj = WjY*(Yt BY)1 ((BY)t Wj).
11. Compute BWj and Rorthonormalize Wj\ BWj = B *Wj\
R = chol(VEj BWj) Wj = Wj* R~u, BWj = BWj R~\
12. Compute AWj: AWj = A* Wj.
13. if A: > 0
14. Borthonormalize Pj: R = chol(Pj BPj)\ Pj = Pj R~l \
15. Update APj APj i?1; BPj = BPj R~l.
16. end if
Perform the Rayleigh Ritz Procedure:
Compute symmetric Gram matrices:
17. it k > 0
86
