Citation
Black box multigrid for convection-diffusion equations on advanced computers

Material Information

Title:
Black box multigrid for convection-diffusion equations on advanced computers
Creator:
Bandy, Victor Alan
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
xxiv, 344 leaves : illustrations ; 29 cm

Subjects

Subjects / Keywords:
Differential equations, Partial ( lcsh )
Multi-grid methods (numerical analysis) ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 337-344).
Thesis:
Submitted in partial fulfillment of the requirements for the degree, Doctor of Philosophy, Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Victor Alan Bandy.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
37296643 ( OCLC )
ocm37296643
Classification:
LD1190.L622 1996d .B36 ( lcc )

Downloads

This item has the following downloads:


Full Text
BLACK BOX MULTIGRID FOR
CONVECTION-DIFFUSION EQUATIONS
ON ADVANCED COMPUTERS
by
VICTOR ALAN BANDY
M.S., University of Colorado at Denver, 1988
B.S., Oregon State University, 1983
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Mathematics
1996


This thesis for the Doctor of Philosophy degree by
Victor Alan Bandy
has been approved for the
Department of
Mathematics
by
Gita Alaghband
Date
be, 13 !f?£


Bandy, Victor Alan (Ph. D., Applied Mathematics)
Black Box Multigrid for Convection-Diffusion Equations on Advanced Computers
Thesis directed by Dr. Joel E. Dendy, Jr.
ABSTRACT
In this thesis we present Black Box Multigrid methods for the solution of
convection-diffusion equations with anisotropic and discontinuous coefficients on ad-
vanced computers. The methods can be classified as either using standard or semi-
coarsening for the generation of the coarse grids. The domains are assumed to be
either two or three dimensional with a logically rectangular mesh structure being used
for the discretization.
New grid transfer operators are presented and compared to earlier grid transfer
operators. The new operators are found to be more robust for convection-diffusion
equations.
Local mode and model problem analysis are used to examine several choices
of iterative methods for the smoother and their relative effectiveness for the class of
problems under consideration. The red/black alternating line Gauss-Seidel method
and the incomplete line LU (ILLU) by lines-in-x methods were found to be the most
robust for two dimensional domains, and red/black alternating plane Gauss-Seidel,
using the 2D black box multigrid method for the plane solves, was found to be the
most robust and efficient smoother for 3D problems.
The Black Box Multigrid methods were developed to be portable, but opti-
mized for either vector computers, such as the Cray Y-MP, or for parallel computers,
m


such as the CM-5. While the computer architectures are very different, they represent
two of the main directions that supercomputer architectures are moving in today. Per-
formance measures for a variety of test problems are presented for the two computers.
The vectorized methods are suitable for another large class of common com-
puters that use superscalar pipelined processors, such as PCs and workstations. "While
the codes have not been optimized for these computers, especially when considering
caching issues, they do perform quite well. Some timing results are presented for a Sun
Sparc-5 for comparison with the supercomputers.
This abstract accurately represents the contents of the candidates thesis. I
recommend its publication.
IV


To my Mom, Lee Buchanan, and everyone else who kept on asking
When are you going to finish?


CONTENTS
CHAPTER
1 INTRODUCTION ......................................................... 1
1.1 Summary............................................................... 1
1.1.1 Previous Results............................................... 2
1.1.2 New Contributions.............................................. 6
1.2 Class of Problems..................................................... 9
1.3 Discretization of the Problem............................! . . 10
1.4 Multigrid Overview................................................... 13
1.4.1 Multigrid Cycling Strategies.................................. 19
1.5 Black Box Multigrid................................................. 24
2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME . 27
2.1 Finite Difference Discretization..................................... 28
2.2 Finite Volume Discretization ........................................ 31
2.3 Cell Centered Finite Volume Discretization; Evaluation at the Vertices 34
2.3.1 Interior Finite Volumes ...................................... 36
2.3.2 Dirichlet Boundary Condition.................................. 37
2.3.3 Neumann and Robin Boundary Conditions...................... 38
2.4 Cell Centered Finite Volume Discretization; Evaluation at the Cell
Centers ............................................................. 39
2.4.1 Interior Finite Volumes ...................................... 40
2.4.2 Dirichlet Boundary Condition.................................. 41
vi


2.4.3 Neumann and Robin Boundary Conditions.....................'. 42
2.5 Vertex Centered Finite Volume Discretization Evaluation at the Ver-
tices .................................................................... 42
2.5.1 Interior Finite Volumes ............................... 42
2.5.2 Edge Boundary Finite Volumes ............................. 43
2.5.3 Dirichlet Boundary Condition.................................. 43
2.5.4 Neumann and Robin Boundary Conditions..................... 43
2.5.5 Corner Boundary Finite Volumes............................ 44
2.5.6 Dirichlet Boundary Condition.................................. 45
2.5.7 Neumann and Robin Boundary Conditions..................... 45
2.6 Vertex Centered Finite Volume Discretization Evaluation at the Cell
Vertices............................................................ 46
2.6.1 Interior Finite Volumes .................................. 46
2.6.2 Dirichlet Boundary Condition.................................. 47
2.6.3 Neumann and Robin Boundary Conditions..................... 47
2.6.4 Corner Boundary Finite Volumes................................ 48
2.6.5 Dirichlet Boundary Condition.................................. 48
2.6.6 Neumann and Robin Boundary Conditions..................... 49
3 PROLONGATION AND RESTRICTION OPERATORS.................................... 51
3.1 Prolongation ........................................................ 52
3.1.1 Prolongation Correction Near Boundaries....................... 55
3.2 Restriction ...................................................... 56
3.3 Overview ............................................................ 56
3.4 Symmetric Grid Operator Lh: Collapsing Methods....................... 59
3.5 Nonsymmetric Grid Operator Lh: Collapsing Methods ................... 65
3.5.1 Prolongation Based on symm(Lft) .............................. 65
Vll


3.5.2 Prolongation Based on Lh and symm(Lh)..................... 68
3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and
symm(L/l)............................................... 68
3.6 Nonsymmetric Grid Operators: Extension of Schaffers Idea....... 69
3.6.1 Extension of Schaffers Idea to Standard Coarsening..... 71
3.7 Conclusions Regarding Grid Transfer Operators..................... 73
4 BASIC ITERATION METHODS FOR SMOOTHERS................................ 75
4.1 Overview of Basic Iteration Methods............................... 75
4.2 Gauss-Seidel Relaxation........................................... 79
4.2.1 Point Gauss-Seidel Iteration ............................. 80
4.2.2 Line Gauss-Seidel Iteration by Lines in X.................. 83
4.2.3 Line Gauss-Seidel Iteration by Lines in Y.................. 84
4.2.4 Alternating Line Gauss-Seidel Iteration..................... 86
4.3 Incomplete Line LU Iteration...................................... 86
5 FOURIER MODE ANALYSIS OF SMOOTHERS ................................. 91
5.1 Introduction...................................................... 91
5.2 Motivation ...................................................... 92
5.3 Overview of Smoothing Analysis.................................... 94
5.4 2D Model Problems ............................................... 101
5.5 Local Mode Analysis for Point Gauss-Seidel Relaxation............ 102
5.6 Local Mode Analysis for Line Gauss-Seidel Relaxation............. Ill
5.7 Local Mode Analysis: Alternating Line Gauss-Seidel and ILLU Iteration 115
5.8 Local Mode Analysis Conclusions.................................. 120
5.9 Other Iterative Methods Considered for Smoothers................. 122
6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS . 125
6.1 Cray Hardware Overview........................................... 127
viii


6.2 Memory Mapping and Data Structures.................................. 131
6.3 Scalar Temporaries................................................. 132
6.4 In-Code Compiler Directives......................................... 133
6.5 Inlining............................................................ 134
6.6 Loop Swapping....................................................... 135
6.7 Loop Unrolling...................................................... 135
6.8 Loops and Conditionals ........................................... 135
6.9 Scalar Operations................................................... 136
6.10 Compiler Options................................................... 136
6.11 Some Algorithmic Considerations for Smoothers ..................... 137
6.11.1 Point Gauss-Seidel Relaxation................................ 137
6.11.2 Line Gauss-Seidel Relaxation................................ 138
6.12 Coarsest Grid Direct Solver ..................................... 139
6.13 /2_Norm of the Residual......................................... 140
6.14 2D Standard Coarsening Vector Algorithm . ..................... 144
6.14.1 Coarsening . ......................................... 144
6.14.2 Data Structures..........................................<. 144
6.14.3 Smoothers.................................................. . 145
6.14.4 Coarsest Grid Solver........................................ 146
6.14.5 Grid Transfer Operators..................................... 146
6.14.6 Coarse Grid Operators....................................... 146
6.15 2D Semi-Coarsening Vector Algorithm................................ 146
6.15.1 Data Structures.............................................. 146
6.15.2 Coarsening................................................... 146
6.15.3 Smoothers................................................... 146
6.15.4 Coarsest Grid Solver........................................ 147
ix


6.15.5 Grid Transfer Operators............................... 147
6.15.6 Coarse Grid Operators.................................. 147
7 2D NUMERICAL RESULTS............................................ 148
7.1 Storage Requirements.......................................... 148
7.2 Vectorization Speedup......................................... 151
7.3 2D Computational Work..........................................156
7.4 Timing Results for Test Problems.............................. 157
7.5 Numerical Results for Test Problem 8..................... . . 165
7.6 Numerical Results for Test Problem 9.......................... 174
7.7 Numerical Results for Test Problem 10......................... 181
7.8 Numerical Results for Test Problem 11......................... 187
7.9 Numerical Results for Test Problem 13......................... 191
7.10 Numerical Results for Test Problem 17..........................194
7.11 Comparison of 2D Black Box Multigrid Methods.................. 198
8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 203
8.1 CM-2 and CM-200 Parallel Algorithms........................... 203
8.1.1 Timing Comparisons...................................... 206
8.2 CM-5 Hardware Overview........................................ 207
8.3 CM-5 Memory Management........................................ 215
8.4 Dynamic Memory Management Utilities......................... 219
8.5 CM-5 Software Considerations ............................... 222
8.6 Coarsening and Data Structures in 2D.......................... 223
8.7 Coarse Grid Operators......................................... 227
8.8 Grid Transfer Operators....................................... 228
8.9 Smoothers.................................................... 229
8.9.1 Parallel Line Gauss-Seidel Relaxation................... 229
x


8.9.2 CM-5 Tridiagonal Line Solver Using Cyclic Reduction...... 230
8.10 Coarsest Grid Solver............................................. 233
8.11 Miscellaneous Software Issues.................................... 236
8.11.1 Using Scalapack .......................................... 236
8.11.2 Poly-Shift Communication................................. 237
8.12 2D Standard Coarsening Parallel Algorithm ..................... 237
8.12.1 Data Structures................ . ..................... 238
8.12.2 Coarsening................................................ 238
8.12.3 Smoothers................................................. 239
8.12.4 Coarsest Grid Solver....................................... 239
8.12.5 Grid Transfer Operators.................................... 239
8.12.6 Coarse Grid Operators..................................... 240
8.13 2D Semi-Coarsening Parallel Algorithm............................ 240
8.13.1 Data Structures ......................................... 240
8.13.2 Coarsening................................................. 240
8.13.3 Smoothers'. :............................................. 241
8.13.4 Coarsest Grid Solver :................................ 241
8.13.5 Grid Transfer Operators................................... 241
8.13.6 Coarse Grid Operators ................................... 241
8.14 2D Parallel Timings ............................................ . 241
9 BLACK BOX MULTIGRID IN THREE DIMENSIONS.......................... 250
9.1 Introduction. ................................................... 250
9.1.1 Semi-Coarsening........................................... 251
10 3D DISCRETIZATIONS.................................................... 253
10.1 Finite Difference Discretization................................ 254
10.2 Finite Volume Discretization ................................... 254
xi


10.2.1 Interior Finite Volumes
255
10.2.2 Edge Boundary Finite Volumes......................... 256
10.2.3 Dirichlet Boundary Condition......................... 257
10.2.4 Neumann and Robin Boundary Conditions................ 257
11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS........................ 260
11.1 3D Grid Transfer Operations................................. 262
11.2 3D Nonsymmetric Grid Operator: Collapsing Methods........... 264
11.2.1 3D Grid Transfer Operator Variations.............. 268
11.3 3D Coarse Grid Operator . .............................. 268
12 3D SMOOTHERS.................................................... 270
12.1 Point Gauss-Seidel............................................ 270
12.2 Line Gauss-Seidel ............................................ 271
12.3 Plane Gauss-Seidel . ....................................... 272
13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS......................... 274
13.1 Overview of 3D Local Mode Analysis ... I ................. 274
13.2 Three Dimensional Model Problems ........................... 278
13.3 Local Mode Analysis for Point Gauss-Seidel Relaxation....... 280
13.4 Local Mode Analysis for Line Gauss-Seidel Relaxation........: 285
13.5 Local Mode Analysis for Plane Gauss-Seidel Relaxation....... 293
14 3D VECTOR ALGORITHM CONSIDERATIONS.............................. 308
14.1 3D Smoother .................... ......................... 308
14.2 Data Structures and Memory.................................... 309
14.3 3D Standard Coarsening Vector Algorithm....................... 313
14.3.1 Coarsening............................................ 313
14.3.2 Data Structures........................................ 313
14.3.3 Smoothers.............................................. 314
xii


14.3.4 Coarsest Grid Solver.................................. 314
14.3.5 Grid Transfer Operators............................... 314
14.3.6 Coarse Grid Operators................................. 314
14.4 3D Semi-Coarsening Vector Algorithm.......................... 314
14.4.1 Data Structures....................................... 315
14.4.2 Coarsening............................................ 315
14.4.3 Smoothers............................................. 315
14.4.4 Coarsest Grid Solver.................................. 315
14.4.5 Grid Transfer Operators............................... 315
14.4.6 Coarse Grid Operators................................. 315
14.5 Timing Results for 3D Test Problems ......................... 316
14.6 Numerical Results for 3D Test Problem 1...................... 320
14.7 Numerical Results for 3D Test Problem 2...................... 320
15 PARALLEL 3D BLACK BOX MULTIGRID................................. 324
15.1 3D Standard Coarsening Parallel Algorithm Modifications...... 324
15.2 3D Parallel Smoother ........................................ 324
15.3 3D Data Structures and Communication......................... 326
15.4 3D Parallel Timings.......................................... 326
APPENDIX
A. OBTAINING THE BLACK BOX MULTIGRID CODES........................ 331
B. COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS ..................... 333
B.l Cray Y-MP................................................ 333
B.2 CM-5......................................................... 335
BIBLIOGRAPHY....................................................... 337
xiii


FIGURES
FIGURE
1.1 Standard coarsening: superimposed fine grid Gh and coarse grid GH. . . 14
1.2 Semi-coarsening: superimposed fine grid Gh and coarse grid GH............ 15
1.3 One V-cycle iteration for five grid levels............................... 20
1.4 One S-cycle iteration for four grid levels............................... 22
1.5 One W-cycle iteration for four grid levels............................... 22
1.6 One F-cyde iteration for five grid levels................................ 23
2.1 Vertex centered finite volume grid....................................... 32
2.2 Cell centered finite volume grid....................-................ 33
2.3 Cell centered finite volume f2jj. ....................................... 35
2.4 Vertex centered finite volume CUj at y = 0............................... 43
2.5 Southwest boundary corner finite volume.................................. 44
3.1 Standard coarsening interpolation 2D cases............................... 53
6.1 Cray Y-MP hardware diagram.............................................. 128
6.2 Cray CPU configuration.................................................. 128
6.3 2D Data Structures ..................................................... 145
7.1 Comparison of Setup time for BMGNS, SCBMG, and MGD9V ...... 154
7.2 Comparison of one V-cycle time for BMGNS, SCBMG, and MGD9V . . 155
7.3 Domain Q, for problem 8................................................. 166
7.4 Domain 12 for problem 9................................................. 174
7.5 Domain 12 for problem 10................................................ 181
xiv


7.6 Domain fi for problem 11.............................................. 187
7.7 Domain Q for problem 13.............................................. 191
7.8 Domain 0. for problem 17.............................................. 195
8.1 CM-5 system diagram.................................................. 210
8.2 CM-5 processor node diagram............................................ 212
8.3 CM-5 vector unit diagram............................................... 214
8.4 CM-5 processor node memory map......................................... 217
8.5 Grid Data Structure Layout............................................. 225
9.1 Grid operator stencil in three dimensions.............................. 252
11.1 Grid transfer operators stencil in three dimensions................... 261
14.1 3D FSS data structure.................................................. 311
xv


TABLES
TABLE
5.1 Smoothing factor /i for point Gauss-Seidel relaxation for anisotropic dif-
fusion equations...................................................... 109
5.2 Smoothing factor /i for point Gauss-Seidel relaxation for convection-
diffusion equations................................................... 110
5.3 Smoothing factor fi for x- and y-line Gauss-Seidel relaxation for anisotropic
diffusion equations................................................... 114
5.4 Smoothing factor fi for x- and y-line Gauss-Seidel relaxation for convection-
diffusion equations.................................-............... 116
5.5 Smoothing factor /i for alternating line Gauss-Seidel relaxation and in-
complete line LU iteration for anisotropic diffusion equations..... 119
5.6 Smoothing factor /j, for alternating line Gauss-Seidel relaxation and in-
complete line LU iteration for convection-diffusion equations ....... 121
6.1 Cray Y-MP Timings for the Naive, Kahan, and Doubling Summation.
Algorithms.......................................................... 143
6.2 Sparc5 Timings for the Naive, Kahan, and Doubling Summation Algorithms. 144
7.1 Memory storage requirements for the Cray Y-MP......................... 149
7.2 Storage requirements for BMGNS, SCBMG, and MGD9V...................... 150
7.3 Vectorization speedup factors for standard coarsening................. 151
7.4 Vectorization speedup factors for semi-coarsening..................... 152
7.5 Operation count for standard coarsening setup......................... 156
xvi


7.6 Operation count for standard coarsening residual and grid transfers. . . 157
7.7 Operation count for standard coarsening smoothers......................... 158
7.8 Timing for standard coarsening on problem 8............................... 158
7.9 Grid transfer timing comparison for standard and semi-coarsening....... 160
7.10 Timing for various smoothers.......................................... 161
7.11 Smoothing versus grid transfer timing ratios.......................... 162
7.12 Setup times for the various grid transfers............................ 163
7.13 V-cycle time for various smoothers.................................... 164
7.14 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 8.................................................. 166
7.15 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 8............................................................. 167
7.16 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 8................................................. 168
7.17 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 8.................................................... 169
7.18 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 8.................................................... 169
7.19 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 8 with ILLU........................................ 170
7.20 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 8 with ILLU................................................... 170
7.21 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 8 with ILLU....................................... 171
7.22 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 8 with ILLU....................................... 171
xvii


7.23 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 8 with ILLU........................................ 172
7:24 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 9.................................................. 175
7.25 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 9. ........................................................ 175
7.26 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 9.................................................. 176
7.27 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 9. . ............................................ 176
7.28 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 9.................................................. 177
7.29 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 9 with ILLU.......................................... 177
7.30 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 9 with ILLU................................................. 178
7.31 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 9 with ILLU. ...................................... 178
7.32 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 9 with ILLU. ...................................... 179
7.33 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 9 with ILLU........................................ 179
7.34 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 9 with 4-direction PGS............................. 180
7.35 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 10................................................. 183
xvm


7.36 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 10............................................................ 183
7.37 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 10................................................. 184
7.38 Number of V-cycles for standard-coarsening using the symmetric grid
transfer for problem 10................................................. 184
7.39 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 10.................................................. 185
7.40 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 10............................................. 185
7.41 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 10................................;......................... 185
7.42 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 10.................................................. 186
7.43 Number of V-cycles for MGD9V on problem 10............................. 186
7.44 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 11................................................. 188
7.45 Number of V-cycles for standard coarsening using the sL/L grid transfer
for problem 11........................................................... 189
7.46 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 11................................................. 189
7.47 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 11. ............................................. 190
7.48 Number of V-cycles for standard coarsening using the operator, L/L, grid
transfer for problem 11................................................. 190
xix


7.49 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 13. ............................................. 192
7.50 Number of V-cycles for standard coarsening using the hybrid sL/L grid
transfer for problem 13.............................................. 193
7.51 Number of V-cycles for standard coarsening using the symmetric grid
transfer for problem 13. . .......................................... 193
7.52 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 17............................................... 194
7.53 Number of V-cycles for standard coarsening using the original collapsing
method for problem 17............................................. 196
7.54 Number of V-cycles for standard coarsening using the extension of Schaf-
fers idea for problem 17............................................... 197
7.55 Number of V-cycles for standard coarsening using the hybrid collapsing
method for problem 17............................................. 197
7.56 Number of V-cycles for semi-coarsening for problem ,17. .............. 197
7.57 Comparison for problem 8 on Cray Y-MP................................ 199
7.58 Comparison for problem 9 on Cray Y-MP................................ 201
8.1 Timing Comparison per V-cycle for semi-coarsening on the Cray Y-MP, .
CM-2. and CM-200...................................................... 206
8.2 Timing Comparison per V-cycle for Standard Coarsening on the Cray
Y-MP, CM-2, and CM-200............................................... 208
8.3 2D Standard coarsening 32 - 512 CM-5 nodes V-cycle timings............ 243
8.4 2D Standard coarsening 32 - 512 CM-5 nodes Setup timings ............. 244
8.5 2D Standard coarsening 32 - 512 CM-5 nodes parallel efficiency ..... 244
8.6 2D Semi-coarsening 32 512 CM-5 nodes V-cycle timings................ 245
8.7 2D Semi-coarsening 32 512 CM-5 nodes setup timings.................. 246
xx


8.8 2D Semi-coarsening 32 512 CM-5 nodes parallel efficiency.............. 246
8.9 2D Timing comparison between CM-5, Cray Y-MP, and Sparc-5............... 248
13.1 Smoothing factor for point Gauss-Seidel relaxation for anisotropic diffu-
sion equations in 3D ..................................................... 282
13.2 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion
equations in 3D........................................................... 282
13.3 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion
equations in 3D.........: . ......................................... 283
13.4 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion
equations in 3D......................................................... 284
13.5 Smoothing factors for line Gauss-Seidel relaxation for anisotropic diffusion
equations ............................................................ 289
13.6 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion
equations............................................................... 290
13.7 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion
equations.............................................................. 291
13.8 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion
equations................................................................. 292
13.9 Smoothing factors for zebra line Gauss-Seidel relaxation for anisotropic
diffusion equations....................................................... 293
13.10 Smoothing factors: zebra line Gauss-Seidel relaxation for convection-
diffusion equations....................................................... 294
13.11 Smoothing factors: zebra line Gauss-Seidel relaxation for convection-
diffusion equations ..................................................... 295
13.12 Smoothing factors: zebra line Gauss-Seidel relaxation for convection-
diffusion equations ............................ .................... 296
xxi


13.13 Smoothing factor // for xy-, xz-, and yz-plane Gauss-Seidel relaxation
for anisotropic diffusion equations ................................. 300
13.14 Smoothing factor \i for xy-,xz-, and yz-plane Gauss-Seidel relaxation for
convection-diffusion equations ........................................ 301
13.15 Smoothing factor \i for plane Gauss-Seidel (continued)............... 302
13.16 Smoothing factor fj, for plane Gauss-Seidel (continued).............. 303
13.17 Smoothing factors: zebra xy-, xz-, yz-, and alternating plane Gauss-
Seidel relaxation for anisotropic diffusion equations ............... 304
13.18 Smoothing factors: Zebra xy-,xz-, yz-, and alternating plane Gauss-Seidel
relaxation for convection-diffusion equations.......................... 305
13.19 Smoothing factor for zebra plane Gauss-Seidel (continued)............ 306
13.20 Smoothing factor for zebra plane Gauss-Seidel (continued)............ 307
14.1 3D Multigrid Component Timing........................................ 317
14.2 Grid transfer timing comparison for standard and semi-coarsening.... 318
14.3 Timing for various smoothers.......................................... 319
14.4 Smoothing versus grid transfer timing ratios.......................... 320
14.5 Numerical results for problem 1 in 3D............................... 321
14.6 Numerical results for problem 1 in 3D................................ 321
14.7 Numerical results for problem 1 in 3D................................ 322
14.8 Numerical results for problem 1 in 3D................................ 323
14.9 Numerical results for problem 1 in 3D............................... 323
14.10 Numerical results for problem 1 in 3D.............................. 323
15.1 3D Standard coarsening 32, 64, 128 CM-5 nodes V-cycle timings....... 327
15.2 3D Standard coarsening 32, 64, 128 CM-5 nodes Setup timings........... 327
15.3 3D Standard coarsening 32, 64, 128 CM-5 nodes parallel efficiency .... 328
15.4 3D Semi-coarsening 32, 64, 128 CM-5 nodes V-cycle timings ............ 329
XXII


15.5 3D Semi-coarsening 32, 64, 128 CM-5 nodes setup timings............... 329
15.6 3D Semi-coarsening 32, 64, 128 CM-5 nodes parallel efficiency......... 329
15.7 3D Timing comparison between CM-5 and Cray Y-MP ..................... 330
xxiii


ACKNOWLEDGMENTS
I would first like to thank my advisor Joel E. Dendy, Jr., Los Alamos National
Laboratory, because without him none of this would have been possible; Thanks! In
addition, at Los Alamos National Laboratory, I would like to thank Mac Hyman of
Group T-7, the Center for Nonlinear Studies for their support and the Advanced
Computing Laboratory and the CIC Division for the use of their computing facilities.
This work was partially supported by the Center for Research on Parallel Computation
through NSF Cooperative Agreement No. CCR-8809615.
I would like to thank my PhD committee members, Joel Dendy, Jan Mandel,
Leo Franca, Gita Alaghband, and Steve McCormmick. A special thanks to professors
Bill Briggs, Stan Payne, and Roland Sweet. In addition, I would like to give, a big
thanks to Dr. Suely B. Oliveira for getting me.back on track.
Finally, I would like to thank my mom, Lee Buchanan, my twin brother,
Fred Bandy, my wife, Darlene Bandy, and all my friends for all their support and
encouragement. Last but not least, a special thanks to Mark and Flavia Kuta, some
very good friends, for letting me stay with them while I was in Denver.
XXIV


CHAPTER 1
INTRODUCTION
1.1 Summary
The subject of this dissertation is the investigation of Black Box multigrid
solvers for the numerical solution of second order elliptic partial differential equations
in two or three dimensional domains. We place particular emphasis on efficiency on both
vector and parallel computers, represented here by the Cray Y-MP and the Thinking
Machines CM-5.
Black Box multigrid methods are sometimes referred to as geometric multi-
grid methods or, more recently, as automatic multigrid methods, in the literature. The
methods can be considered to be a subclass of algebraic multigrid methods with sev-
eral algorithmic restrictions. Geometric multigrid methods make a priori assumptions
about the domain and the class of problems that are to be solved, and in addition, it
uses intergrid operators and coarse grid points based on the geometry and the order of
grid equation operator. Algebraic multigrid, on the other hand, chooses both the coarse
grid and intergrid operator based only on the coefficient matrix. Black box multigrid
is in between these two, with the grids chosen geometrically, on logically rectangular
grids, and the intergrid operators axe chosen algebraically. There are other hybrid
multigrid methods such as the unstructured grid method by Chan [22], which chooses
the coarse grid based on graph theoretical considerations and the intergrid operator
from the nodal coordinates (geometry), and the algebraic multigrid method of Vanek
1


[81], which uses kernels bf the associated quadratics form in lieu of geometrical infor-
mation. The algebraic multigrid method of Stiiben and Ruge [66] [67] use almost the
same construction of intergrid operator as Dendy [26] once the coarse has been chosen,
while VanEks work is based on a different idea. The assumptions and the components
that make up the black box multigrid methods are spelled out in more detail in the
following sections of this chapter.
We will examine the development of robust black box multigrid solvers us-
ing both standard and semi-coarsening. The methods are aimed at the solution of
convection-diffusion equations with anisotropic and discontinuous coefficients (inter-
face problems), such that the discrete system of equations need only be specified on a
logically rectangular grid. A guiding principal in the design is that if the discrete sys-
tem of equations is symmetric, then the multigrid coarse grid problems should preserve
that symmetry.
1.1.1 Previous Results. The black box multigrid method was first in-
troduced by Dendy [26]. The method is a practical implementation of a multigrid
method for symmetric diffusion problems with anisotropic and discontinuous coeffi-
cients, represented by
- V (D VC/) +cU = / onftcJ?2. (1.1)
The domain fi is assumed to be embedded in a logically rectangular mesh and then
discretized in such a manner as to yield a stencil which is no larger than a compact
9-point stencil. The method employs the Galerkin coarse grid approximation, LH =
iff LhIjj, to form the coarse grid operators, using the robust choice of grid transfer
operators from Alcouffe et. aZ [1]. The robust choice of grid transfer operators is
an operator induced formulation that, when c = 0, preserves the flux (i (D VU)
across interfaces. In [1] lexicographic point Gauss-Seidel relaxation and alternating
2


lexicographic line Gauss-Seidel relaxation were the choices available for smoothers. In
subsequent extensions for vector machines, the choices available were red/black (or
four color for nine point operators) point Gauss-Seidel and alternating red/black line
Gauss-Seidel relaxation.
The black box multigrid method was extended to elliptic convection-diffusion
problems [27], for which the model problem is
sAU + Ux + TJy f on Q, C F?, (1.2)
where e > 0. The mesh is the same as before and the discretization is of the form
L^Uij = -Ph&hUij + DxQ'hUhJ + Dt'hUij = Fij, (1.3)
where
Ah.Ui,j = (tftj-i + Uiij 4Utj + Ui+ijJ- Uitj+i),
DlhUij = i(Ku+i-Vy.i),
and where @ \ yields upstream differencing. A generalization of Galerkin coarse grid
approximation is used to form the coarse grid operators. The prolongation operators
are formed in the same way as they were for the symmetric method, but instead of
being induced by Lh, they are induced by the symmetric part of the grid operator,
symm(Lh) = ^((Lh)* + Lh). It was found that instead of using if? = (I#)* to induce
the restriction operator, a more robust choice is to form a new interpolation operator
Jjj based on (Lh)* and then to define the restriction operator to be if? = These
choices were made to generalize the work of [26]. The choice of smoothers was also
changed to include lexicographic point, line, and alternating line Kaczmarz relaxation.
3


The method performed well for the. problems tested as long as /3 > 0.25, but since
nonphysical oscillations begin to dominate for (3 < 0.25, this restriction is no difficulty.
The next development was the creation of a 3D black box multigrid solver for
symmetric problems [29]. This method uses the same type of grid transfer operators
as the earlier 2D symmetric method. Two different methods' of forming the coarse grid
operators were examined with nearly identical convergence results. The first method
uses Galerkin coarse grid approximation with standard coarsening. The second method
also uses Galerkin coarse grid approximation, but it does so by using auxiliary interme-
diate grids obtained by semi-coarsening successively in each of the three independent
variables. For robustness, alternating red/black plane Gauss-Seidel relaxation was used
for the smoother. The plane solves of the smoother were performed by using the 2D
symmetric black box multigrid solver.
The 2D symmetric black box multigrid solver was then extended to solve
singular and periodic diffusion problems [30]. The existence of a solution, in case
c = 0, is assured by requiring that the equation be consistent; F = 0. The periodic
boundary conditions only impact the multigrid method by requiring the identification
of the auxiliary grid point equations at setup, the identification of the auxiliary grid
point unknowns after interpolation, and the identification of the auxiliary grid point
residuals before restriction. The coarsest grid problem, if c = 0, is singular and cannot
be solved by Gaussian elimination, but since the solution is determined only up to a
constant, the arbitrary addition of the linearly independent condition that Uij = 0 for
some coarse grid point (i, j) allows solution by Gaussian elimination.
The first semi-coarsening black box multigrid solver was introduced for the so-
lution of three dimensional petroleum reservoir simulations [33]. This method employs
semi-coarsening in the z-direction and xy-plane relaxation for the smoother. Galerkin
coarse grid approximation is used to form the coarse grid operators. Operator induced
4


grid transfer operators were used, but only after Schaffers paper [70] was it realized
how to compute these in a robust manner; see section 3.6.
A two dimensional black box multigrid solver called MGD9V was developed
by de Zeeuw [24]. This method, was designed to solve the general elliptic convection-
diffusion equation. The method used standard coarsening, an ILLU smoother, a
V(0, l)-cycle (sawtooth), and a new set of operator induced grid transfer operators
that were designed specifically for convection dominated problems. The method was
found to be more robust than previous methods but was still divergent for problems
with closed convection characteristics on large grids. The method of MGD9V was
developed only for two dimensions and is not parallelizable.
The 2D symmetric black box multigrid solvers [26] [30] were updated to be
portable, have consistent user interfaces, adhering to the SLATEC software guidelines
[38], and provided with three new user interfaces by Bandy [9]. One of the interfaces in-
cluded an automatic discretization routine, requiring the user to provide only a function
which can evaluate the coefficients at the fine grid points. The interfaces all included
extensive input parameter validation and memory management for workspace.
A parallel version of the semi-coarsening method for two dimensional scalar
problems for the CM-2 was presented in [32]. A parallel version of semi-coarsening for
two- and three- dimensional problems was presented in [75]. Both papers essentially
relied on the algorithm from [33] and borrowed from Schaffer [69] [70] for the robust
determination of grid transfer operators.
Fourier mode analysis has been used by many multigrid practitioners to find
good smoothers for use in multigrid methods. The results of many of these analyses
have been presented in the literature. Stiiben and Trottenberg [78] present several
fundamental results of Fourier mode analysis for a few selected 2D problems. Kettler
[50] reports results for a range of 2D test problems and several lexicographic ordered
5


Gauss-Seidel methods along with several variations of ILU methods. Wesseling [84]
reports a summary of smoothing analysis results for the 2D rotated anisotropic diffusion
equation and the convection diffusion equation; however, the results are for only a
limited number of worst case problems. Smoothing analysis results for the red/black
ordered methods appear in many places in the literature, but they are only for a few
selected problems. There are some results in the literature for 3D problems [79], but
just like the 2D results, the analysis is not complete enough for our purposes.
1.1.2 New Contributions In this thesis we have developed and ex-
tended several black box multigrid methods for both two and three dimensional non-
symmetric problems on sequential, vector, and parallel computing platforms. The new
methods are based on a new implementation of the two dimensional nonsymmetric
black box multigrid method [27] for vector computers. The new implementation was
designed to take better advantage of developments in vector computing, while increas-
ing portability and compatibility with sequential computers. The new implementation
performs with a speedup factor of six over the earlier methods on vector computers,
while providing identical functionality, and it also incorporates many of the ideas and
software features from [9].
The new methods include the development of a three dimensional method,
both vector and parallel versions, and a two dimensional parallel method for nonsym-
metric problems. The new methods were also extended to handle periodic and singular
problems using the modifications from [30].
In [27] a two dimensional nonsymmetric black box multigrid method was
examined for a convection dominated problem with constant convection characteristics.
In this work we investigate the new methods for a general convection-diffusion equation
-v (D(x) vu(x)) + b(x) vt/(x) + c(x)U(x) = fix), x e a (1.4)
6


When the earlier method of [27] was applied to equation 1.4, but with more vectorizable
smoothers than those in [27], it was found to perform poorly, and even fail, for some
non-constant convection characteristic problems. This poor performance was caused
by both the new smoothers and by poor coarse grid correction. Several new grid
transfer operators are introduced to address these problems, of which two were found
to be robust; see chapter 3. The search for a more robust smoother was facilitated
by using local mode analysis, and led to the implementation of an incomplete line
LU factorization method (ILLU) for the smoother. The ILLU smoother made the
new methods more robust for convection dominated problems. A four-direction point
Gauss-Seidel method was also briefly considered for use as a smoother but was discarded
because it was not parallelizable nor suitable for anisotropic problems, even though it
was fairly robust for convection dominated problems.
A nonsymmetric black box multigrid method, using standard coarsening, was
created for three dimensional problems; previously only a semi-coarsening version ex-
isted. The new method is the three dimensional analogue of the new two dimensional
black box multigrid method, and it uses alternating red/black plane Gauss-Seidel as
a smoother for robustness. The 3D smoother uses one V(l, l)-cycle of the 2D non-
symmetric black box multigrid method to perform the required plane solves. The new
method was developed to use either the new grid transfer operators from the new 2D
nonsymmetric method or those from the 3D extension of Dendys 2D nonsymmetric
black box multigrid method. The coarse grid operators are formed using the second
method from [29], which uses auxiliary intermediate grids obtained by successively
applying semi-coarsening in each of the independent variables. In addition, the new
method is designed to handle periodic and singular problems. Another use of local
mode analysis was in the design of robust three dimensional smoothers. Although
7


there axe hints in the literature for how to perform local mode analysis for color relax-
ation in three dimensions, we are unaware of the appearance elsewhere of the detailed
analysis presented in chapter 13.
The new methods are compared to a new implementation of the semi-coarsening
method with a speedup factor of over 5 for the two dimensional method and speedup
factor of 2 for the three dimensional method on vector computers. The grid transfer
operators are based on Schaffers idea; see chapter 3. The 2D semi-coarsening method
uses coarsening in the y-direction coupled with red/black x-line Gauss-Seidel relaxation
for the smoother. The 3D semi-coarsening method uses coarsening in the z-direction
coupled with red/black xy-plane Gauss-Seidel relaxation for the smoother. The new
implementation also includes the ILLU smoother, not present in the original version.
Another aspect of this work was to compare de Zeeuws MGD9V with the
black box multigrid methods. The idea was to mix and match components of the two
approaches to investigate the strengths and weaknesses and to ascertain if a combi-
nation existed which was better than either. The results obtained from studying the
algorithm components is that MGD9V obtains its robustness from the ILLU smoother
and not from its grid transfer operators. If MGD9V uses alternating red/black line
Gauss-Seidel for its smoother then performance similar to the black box multigrid
methods is observed. Likewise, if ILLU is used as the smoother in the black box
multigrid methods, then the performance is similar to that of MGD9V.
Parallel versions of the standard coarsening nonsymmetric black box multigrid
methods are developed in this thesis and compared with the existing parallel version of
semi-coarsening black box method. The 3D parallel version smoother uses a modified
2D nonsymmetric black box multigrid method to perform the simultaneous solution of
all the planes of a single color.
8


A hybrid parallel black box multigrid method was developed that uses stan-
dard coarsening for grid levels with a VP (virtual processor) ratio, i.e. number of
grid points per processor, greater than one, and semi-coarsening when the VP ratio
is less than one. When the VP ratio is greater than one, standard coarsening reduces
the number of grid points per processor, reducing the amount of serial work, faster
than in semi-coarsening case. When the VP ratio is less than one, the semi-coarsening
method is more efficient than standard coarsening because it keeps more processors
busy that would otherwise be idle; in addition, tri-diagonal library routines, which are
more efficient than we can write, are available for the data structures. The hybrid
parallel method is the most efficient method on the CM-5 because it uses the most
computationally efficient method for a given VP ratio.
1.2 Class of Problems
The class of problems that is being addressed is convection-diffusion equations
with anisotropic and discontinuous coefficients on a two- or three- dimensional domain.
These types of problems can be represented by the following equation and boundary
conditions,
L(X)= V (D(x) VJ7(x)) + b(x) Vf7(x) + c{x)U{x) = fix) X^Cl (1.5)
Kx) D(x)VU(x) + l(x)U{x) = 0 X e dQ, (1.6)
on a bounded domain Cl C with boundary dCl, where d is either 2 or 3, x (x> y)
or (x,y,z), and D(x) = (Dl,D2) or (D1,!)2,!)3), respectively. The term v{x) is the
outward normal vector. It is assumed that D(x) > 0, c(x) > 0, and j(x) > 0 to
ensure that upon discretization we end up with a positive definite system of equations.
Anisotropies are also allowed, e.g. if Q, C 3?2 we have D = (Dl,D2) where it is possible
that D1 D2 in some subregion(s) while Dl 9


D(x), c(x), and /(x) are allowed to be discontinuous across internal boundaries TcO.
Moreover, let y{x) be a normal vector at x T; then it is natural to assume also that
U and /x (DVU) are continuous at x fr almost every X £ T. (1.7)
The almost every is necessary to exclude juncture points of T, that is points where
two pieces of T intersect and the continuity of /x (DVU) does not make any sense.
The boundary conditions permitted in (1.6) can be of three types: Dirichlet,
Neumann, mixed. The periodic boundary condition is not considered, but can be
handled by making a few adjustments and modifications to the black box multigrid
codes. It should be noted that, for a problem with pure Neumann boundary conditions,
a finite difference (volume or element) discretization may lead to a singular system of
equations: the singularity can be propagated to the coarsest grid level and cause trouble
for the direct solver, but a minor modification to the code circumvents this difficulty,
allowing solution of the coarsest grid level problem.
1.3 Discretization of the Problem
Let the continuous problem represented by equation (1.5) be written in oper-
ator notation as
Lu = / in fi. (1.8)
The following discussion is valid for both two and three dimensions, but only
the two dimensional case is presented. Suppose that, for all x = (x, y) fi, ax < x < bx
and ay < y < by. Let Gh define a rectangular grid on [ax, bx\ x [ay, by], partitioned with
ax = xi < x2 << xnx = bx, ay = yi < y2 < < yny = by, (1.9)
and let the grid spacings be defined as
hXi = xi+i - hyj = yj+1 yj (1.10)
10


Then the rectangular grid, Gh is defined as
G = £ [dxj^x\>yj (1-H)
with the domain grid, flh, being defined as
nh = CinGh. (1.12)
Before the discrete grid problem is defined we should first address the issue of
domains with irregular boundaries. The black box multigrid solvers in two dimensions
are intended to solve the equation (1.8) on logically rectangular grids, but for simplicity,
we consider only rectangular grids. An irregular shaped domain can be embedded in
the smallest rectangular grid, Gh, possible, Qh C Gh. The problem is then discretized
on Cth avoiding any coupling to the grid points not in Clh. For grid points outside of flh,
Xh £ Gh Clh, considered to be fictitious points, an arbitrary equation is introduced,
such as CijUij = fitj, where ^ 0 and fyj are arbitrary. The problem is now
rectangular and the solution to the discrete equations can be obtained at the points in
the domain, while the solution u^j = fij/cij is obtained for the other points. Problems
with irregular domains in three dimensions can be handled in a similar fashion for a
cuboid box grid.
Now the discrete grid problem approximating the continuous problem, (1.8)
can be written as
Lhuh = fh in Gh, (1.13)
where the superscript h refers to discretization with grid spacing h. Note that, for
irregular domains the discrete solution uh{x) makes sense only for % £ uh(x), for
X Gh\Qh, is arbitrary.
We consider only discrete operators Lh on rectangular grids that can be de-
scribed by 5-point or 9-point box stencils. Suppose we discretize the equation (1.5)
11


using five points at the grid point (Xi,yj),
SijUi j-i WijUi-ij + CijUij- EijUi+ij NijUij+i = Fij.
(1.14)
We use stencil notation to represent the 5 and 9 point cases, respectively:
]- h -J
N NW N NE
W C E W C E
S SW s SE
L -
(1.15)

where the stencil represents the coefficients for the discrete equation at the grid point
(Xi, yj) on grid Gh. The subscripts i,j can be dropped and it will be understood that
the stencil is centered at the grid point (X{,yj). The superscript h can also be dropped
when the mesh spacing is clear from the context. The stencils are valid over the entire
grid including the boundary points because the coefficients are allowed to be zero.
Hence, any coefficients that reach out of the domain can be set to zero. Clearly, the
5-point stencil is a special case of the 9-point stencil, where the NW, NE, SW, and
SE coefficients are set to zero.
We illustrate the stencil notation for Poissons equation on a square domain
in two dimensions,
Lu(x, y) = -uxx(x, y) uyy{x, y) = f(x,y), (x, y) Q = (0, l)2
(1.16)
using 5- and 9-point finite difference discretizations. The 5-point stencil for the opera-
tor L, using a central finite difference discretization on a uniform grid with grid spacing
h = 1/N for N = nx = ny, is
jh = _Afc _
K2
-i h
-1
-1 4 -1
-1
(1.17)
12


One 9-point discretization for L in (1.16) has the stencil
1 h
-1
-4 (1.18)
-1
Many types of discretization can be considered: central finite differences, up-
stream finite differences, finite volumes, finite elements, etc.
The black box multigrid solvers actually allow for more general meshes than
just the rectangular grids shown so far. The only requirement is that the mesh be
logically rectangular. In two dimensions the logically rectangular grid G can be defined
as
G = {x{i,j),y{i,j) : 1 < % where the grid cell formed by
(x(i,j + 1), y(i,j + 1)), + y(i + l,j + l))
1/(*,J')) (as(* + 1,j), y(* + l,j))
has positive area, 1 < i < nx, 1 < j < ny.
The black box multigrid solvers which we consider require the discretization
to be represented by a 9-point box stencil. However, just because the problem has a
9-point box stencil does not mean that it can be solved by the black box multigrid
methods presented in this thesis. Such solutions are dependent on a number of factors
which are problem dependent. We attempt to investigate these factors in this thesis.
1.4 Multigrid Overview
A two level multigrid method is presented first to illustrate the basic compo-
nents and underlying ideas that will be expanded into the classical multigrid method.
Th
L h2
-1 -4
-4 20
-1 -4
13


Standard Coarsening
Figure 1.1. Standard coarsening. Superimposed fine grid Gh and coarse grid GH, where
the indicates the coarse grid points in relation to the fine grid Gh.
Suppose that we have a continuous problem of the form
Lu(x,y) = f(x,y), (x,y) Q C SR2, . (1.20)
where L is a linear positive definite operator defined on an appropriate set of functions
in (0, l)2 = C 5ft2. Let Gh and GH be two uniform grids for the discretization of Q;
then
Gh = {(*, y) n : (x, y) = (ih, jh), i, j = Q,..., n} (1.21)
and
GH = {(^>2/) ft : (x,y) = (iH,jH) = (i2hJ2h), i, j = 0,..., ||, (1.22)
where the number of grid cells n on Gh is even with grid spacing h= 1/n, and where
grid Gh has n/2 grid cells with grid spacing H = 2h.
The coarse grid Gh is often referred to as a standard coarsening of Gh; see
figure 1.1. However, this choice is not the only one possible. Another popular choice is
semi-coarsening, which coarsens in only one dimension; see figure 1.2. For the overview,
only standard coarsening will be used.
14


Semi-coarsening
Figure 1.2. Semi-coarsening. Superimposed fine grid Gh and coarse grid GH, where
the indicates the coarse grid points in relation to the fine grid Gh.
15


The discrete problems now take the form
Lhuh = fh on Gn
(1.23)
and
Lhuh = fH on GH. (1-24)
We refer to Lh and LH as the fine and coarse grid operators respectively. The grid
operators are positive definite, linear operators
Lh :Gh - Gh,
(1.25)
and
Lh : Gh Gh.
(1.26)
Let Uh be an approximation to uh from equation (1.23). Denote the error eh
by
eh = Uh- uh; (1.27)
thus eh can also be regarded as a correction to Uh. The residual (defect) of equation
(1.23) is given by
rh = fh LhUh. (1.28)
The defect equation (error-residual equation) on grid Gh
Lheh = rh
(1.29)
is equivalent to the original fine grid equation (1.23). The defect equation and its
approximation play a central role in the development of a multigrid method.
The fine grid equation (1.23) can be approximately solved using an iterative
method such as Gauss-Seidel. The first few iterations reduce the error quickly, but then
the reduction in the error slows down for subsequent iterations. The slowing down in
16


the reduction of the error after the initial quick reduction is a property of most regular
splitting methods and of most basic iterative methods. These methods reduce the error
associated with high frequency (rough) components of the error quickly, but the low
frequency (smooth) components are reduced very little. Hence, the methods seem to
converge quickly for the first few iterations, as the high frequency error components
are eliminated, but then the convergence rate slows down towards its asymptotic value
as the low frequency components are slowly reduced. The idea behind the multigrid
method is to take advantage of this behavior in the reduction of the error components.
The point is that a few iterations of the relaxation method on Gh effectively eliminate
the high frequency components of the error.
Further relaxation on the fine grid results in little gain towards approximating
the solution. However, the smooth components of the error on the fine grid are high
frequency components with respect to the coarse grid. So, let us project the defect
equation, since it is the error that we are interested in resolving, onto the coarse grid
from the fine grid. This projection is done by using a restriction operator to project
the residual. rh, onto the coarse grid, where we can form a new defect equation
Lhvh = I%rh = fH, (1.30)
where iff is the restriction operator. We can now solve this equation for vH. Having
done so. we can project the solution back up to the fine grid with a prolongation
(interpolation) operator, I#, and correct the solution on the fine grid, Gh,
Uh _ Uh + l\vH (1.31)
We call this process (of projecting the error from the coarse grid to the fine grid and
correcting the solution there) the coarse grid correction step. The process of projecting
the error from a coarse grid to a fine grid introduces high frequency errors. The high
17


frequencies introduced by prolongation can be eliminated by applying a few iterations
of a relaxation scheme. The relaxation scheme can be applied to the projection of the
error, or to the approximation to the solution, Uh, after the correction. It is
desirable to apply the relaxation to Uh instead of I^vh since then additional reduction
of the smooth components of the error in the solution may be obtained.
The projection operator from the fine grid to the coarse grid is called the
restriction operator, while the projection operator from the coarse grid to the fine
grid is called the prolongation operator or, interchangeably, the interpolation operator.
These two operators are referred to as the grid transfer operators.
In the two level scheme just described, it can be seen that the coarse grid
problem is the same, in form, as the fine grid problem with uh and fh being replaced
by vH and fH = lffrh respectively. We can now formulate the classical multigrid
method by applying the above two level scheme recursively. In doing so, we no longer
solve the coarse grid defect equation exactly. Instead, we use the relaxation scheme on
the coarse grid problem, where now, the smooth (low) frequencies from the fine grid
appear to be higher frequencies with respect to the coarse grid. The relaxation scheme
now effectively reduces the error components of these, now, higher frequencies. The
coarse grid problem now looks like the fine grid problem, and we can project the coarse
grid residual to an even coarser grid where a new defect equation is formed to solve
for the error. The grid spacing in this yet coarser grid is 2H. After sufficiently many
recursions of the two level method, the resulting grid will have too few grid points
to be reduced any further. We call this grid level the coarsest grid. We can either
use relaxation or a direct solver to solve the coarsest grid problem. The approximate
solution is then propagated back up to the fine grid, using the coarse grid correction
step recursively.
What we have described informally is one multigrid V-cycle. More formally,
18


let us number the grid levels from 1 to M, where grid level 1 is the coarsest and grid
level M is the finest.
Algorithm 1.4.1 ( MGV(&,z/i,z/2,h) )
1. relax u\ times on LkUk Fk
2. compute the residual, rk = Fk LkUk
3. restrict the residual I^~lrk to Gk~l, Fk~l = I%~lrk and form the coarse grid
problem (defect equation) Lk~1Uk~1 = T1*-1, where vk = Ik_xUk~l andhk~l =
2 hk.
4 IF (k 1) 7^ 1 THEN call Algorithm MGV(k 1, v\, v%, H)
5. solve Lk~1Uk~l = Fk~1 to get the solution uk~l
6. interpolate the defect (coarse grid solution) to the fine grid, and correct the fine
grid solution, Uk < Uk + lj(_luk~1
7. relax 1/2 times on LkUk = Fk
8. IF (finest grid) THEN Stop
This algorithm describes the basic steps in the multigrid method for one iteration of
a V-cycle. If the algorithm uses bi-linear (tri-linear in 3D) interpolation, it is called
the classical multigrid method. This algorithm assumes that the coarsening is done by-
doubling the fine grid spacing, which can be seen in step 3 of the algorithm. However,
the algorithm is valid for any choice of coarsening, hk~l = mhk, where m is any integer
greater than one.
1.4.1 Multigrid Cycling Strategies There are many different types of
cycling strategies that are used in multigrid methods besides the V-cycle. We illustrate
the different cycling types with the use of a few pictures and brief descriptions.
19


5
V-cycle
Figure 1.3. One V-cycle iteration for five grid levels, where the represent a visit to a
grid level.
20


The V-cycle is illustrated graphically in figure 1.3. The represents a visit
to a particular grid level.. A slanting line connection between two grid levels indicates
that smoothing work is to be performed. A vertical line connection between grid levels
means that no smoothing is to take place between grid level visits. The grid levels are
indicated by a numerical value listed on the left side of the figure, where grid level 1 is
the coarsest grid level and is always placed at the bottom of the diagram.
The mechanics of the V-cycle were described in the multigrid algorithm in the
last section. The V-cycle is one of the most widely used multigrid cycling strategies. Its
best performance can be realized when there is an initial guess of the solution available.
When a guess is not available a common choice is to use a zero initial guess or to use
an F-cycle (see below).
The S-cycle is illustrated in figure 1.4. The S stands for sawtooth, because
that is what it resembles; it is clearly a V(0, l)-cycle and thus a special case of a V-
cycle. The S-cycle is what de Zeeuws MGD9V [24] black box multigrid code uses for its
cycling strategy. The S-cycle usually requires a smoother with a very good smoothing
factor in order to be efficient and competitive with other cycling strategies.
The W-cycle is illustrated in figure 1.5. The W-cycle is sometimes called a
2-cycle; similarly, a V-cycle can be called a 1-cycle. From the figure 1.5, one can see
the W type structure. It is called a 2-cycle because there must be two visits to the
coarsest grid level before ascending to the next finer intermediate fine grid level. An
intermediate fine grid level is one that is not the finest nor coarsest grid level and where
the algorithm switches from ascending to descending based on the number times the
grid level has been visited since the residual was restricted to it from a finer grid.
The F-cycle is illustrated in figure 1.6 and is called a full multigrid cycle. The
figure shows a full multigrid V-cycle, that is, each sub-cycle that visits the coarsest
grid level is a V-cycle. An F-cycle can also be created using a W-cycle, or any other
21


5
4
3
2
1
S-cycle
Figure 1.4. One S-cycle iteration for four grid levels, where the represent a visit to a
grid level.
4
3
2
1
W-cycle
Figure 1.5. One W-cycle iteration for four grid levels, where the represent a visit to
a grid level.
22


5
F-cycle
Figure 1.6. One F-cycle iteration for fiye grid levels, where the represent a visit to a
grid level.
23


type of cycling, for its sub-cycle. The F-cycle is very good when an initial guess for the
multigrid iteration is not available, since it constructs its own initial guess. The F-cycle
first projects the fine grid problem down to the coarsest grid level and then proceeds
to construct a solution by using sub-cycles. After-the completion of each sub-cycle the
solution on an intermediate fine grid level is interpolated up to the next finer grid level
where a new sub-cycle begins. This process is continued until the finest grid level is
reached and its own V-cycle completed. At this point if more multigrid iterations are
needed then the V-cycling is continued at the finest grid level.
1.5 Black Box Multigrid
Black box multigrid is also called geometric multigrid by some and is a member
of the algebraic multigrid method (AMG) family. The distinguishing feature of black
box multigrid is that the black box approach makes several assumptions about the
type of problem to be solved and the structure of the system of equations. The black
box multigrid methods also have a predetermined coarsening scheme where the coarse
grid has roughly half as many grid points as the fine grid does in one or more of the
coordinate directions. For a uniform grid, this means that H = 2h. Both methods
automatically generate the grid transfer operators, prolongation Ik_x and restriction
Ik"1 for 2 < k < M, and the coarse grid operators Lk for 1 < k < M 1. The coarse
grid operators are formed using the Galerkin coarse grid approximation,
Lk-1 = I%-lLkl£_1, (1.32)
where k = 1.... ,M 1. The algebraic multigrid methods deal with the system of
equations in a purely algebraic way. The coarsening strategy for general AMG is not
fixed nor is the formation of the grid transfer operators, resulting in methods that can
be highly adaptable. However, the more adaptable a method is, the more complex its
24


implementation is likely to be, and it may also be less efficient due to its complexity.
Another disadvantage of general AMG. methods is that the coarse grid problems are
usually not structured even when the fine grid problem is; moreover, the unstructured
matrices on coarser levels tend to become less and less sparse, the coarser the grid level.
To define the black box multigrid method we need to define several of the
multigrid components, such as the grid transfer operators, the coarse grid operators,
the type of smoother employed, and the coarsest grid solver. We can also mention the
type of cycling strategies that are available and other options.
There are several different grid transfer operators that we have developed and
used in our codes. They are of two basic types. The first type collapses the stencil of
the operator in a given grid coordinate direction to form three point relations, and the
second is based on ideas from S. Schaffer [69]. The details of the grid transfer operators
will be presented in chapter 3.
The coarse grid operators are formed by using the Galerkin coarse grid ap-
proximation given in equation (1.32).
There are several choices for the smoothing operator available in our codes.
The smoothers that we have chosen are all of the multi-color type, except for incom-
plete line LU. For standard coarsening versions, the choices are point Gauss-Seidel,
line Gauss-Seidel, alternating line Gauss-Seidel, and incomplete line LU. The semi-
coarsening version uses either line Gauss-Seidel by lines in the x-direction or incomplete
line LU. The smoothers will be presented in more detail in chapter 4.
In the standard coarsening codes, the coarsest grid solver is a direct solver
using LU factorization. The semi-coarsening version allows the option of using line
Gauss-Seidel relaxation.
There are several cycling strategies that are allowed, and they are chosen
by input parameters. The most important choice is whether to choose full multigrid
25


cycling or not. There is also a choice for N-cycling, where N = 1 is the standard
V-cycle and N = 2 is the W-cycle, etc... For more details, see section (1.4.1) above.
26


CHAPTER 2
DISCRETIZATIONS: FINITE DIFFERENCE AND
FINITE VOLUME
This chapter presents some of the discretizations that can be used on the
convection-diffusion equation. We present only some of the more common finite dif-
ference and finite volume discretizations. Although this section may be considered
elementary, it was thought to be important for two reasons. First, it shows some of
the range of discrete problems that can be solved by the black box multigrid methods.
Secondly, it gives sufficient detail for others to be able to duplicate the results presented
in this thesis. The sections on the finite volume method present more than is needed,
but because there is very little on this topic in the current literature and because of its
importance for maintaining 0(h2) accurate discretizations for interface problems, we
have decided to include it. For references on the finite volume discretization see [85]
and [52].
The continuous two dimensional problem is given by
V (D Vu) + b Vu + c u = / in Q = (0, Mx) x (0, My) (2.1)
where D is a 2 x 2 tensor,
^ Dx DXy
Dyx Dy
and det D > 0, c > 0. In general, Dxy Dyx, but we only consider either Dxy = Dyx
or Dxy = DyX = 0. In addition, D, c, and / are allowed to be discontinuous across
(2.2)
27


internal interfaces in the domain Cl. The boundary conditions are given by
3vl
h o u = g, on (2.3)
on
where o and g are functions, and n is the outward unit normal vector. This allows us
to represent Dirichlet, Neumann, and Robin boundary conditions.
The domain is assumed to be rectangular, Cl = (0, Mx) x (0, My), and is then
divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, where Nx and Ny
are the number of cells in the x- and y-directions respectively. A uniform grid is not
required, but we will use it to simplify our discussions.
It should be noted that finite elements on a regular triangulation can also be
used to derive the discrete system of equations to be solved by the black box multigrid
methods. However, we will not present any details on how to derive these equations.
2.1 Finite Difference Discretization
The finite difference approach to discretization is well known. Finite difference
approximation is based on Taylors series expansion. In one dimension, if a function
u and its derivatives are single valued, finite, and continuous functions of x, then we
have the Taylors series expansions,
u(x + h) = u(x) + hu'(x) + \h2u"{x) + \hzu'"(x) + ... (2.4)
2 6
and
u(x h) = u(x) hu'{x) + ^-h2u"(x) ^h?u'"(x) H ... (2.5)
^ o
If we add equations (2.4) and (2.5) together we get an approximation to the second
derivative of u, given by,
u"{x) i (u(x + h) 2u(x) + u(x h)) (2.6)
28


where the leading error term is 0(h2): Subtracting equation (2.5) from (2.4) gives
u'(x) 7- (u(x + h) u(x h)), (2.7)
h
with an error of 0(h2). Both equations (2.6) and (2.7) are. said to be central difference
approximations. We also derive a forward and backward difference approximation to
the first derivative from equations (2.4) and (2.5):
u'(x) (u(x + h) u(x)) (2.8)
lb
and
u'(x) ^ (u(x) u(x h)) (2.9)
lb
respectively, with an error of 0(h).
The above approximations can be extended to higher dimensions easily and
form the basis for finite difference approximation. We illustrate the finite difference
discretization, using stencil notation, by way of examples for some of the types of
problems that we are interested in. There are many references on finite differences if
one is interested in more details; see for instance [74] [39].
The first example is for the anisotropic Poissons equation on a square domain,
Lu = euxx uyy ='/ in Q = (0, l)2, (2-10)
where u and / are functions of (x,y) fh Using central finite differences and dis-
cretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the
5-point stencil,
-1
Lh
£ 2(1 + c) £
(2.11)
-1
29


The second example is for the convection-diffusion equation on a square do-
main,
Lu =eAu + bxux + byUy = f (x, y) e fi = (0, l)2 (2-12)
where u, bx, by, and / are functions of x and y. Using a mix of upstream and central
finite differences and discretizing on a uniform grid with grid spacing h = l/N for
N = nx = ny, gives the 5-point stencil,
£ "I" byflfly
Lh
-e + bxh(fix 1)
E
£ h bxhfix
(2.13)
£ + byh{Hy ~ 1)
where
= 4e + bxh(2fjLx 1) + byh(2fj,y 1) (2-14)
and
/ £ 2 bxh bxh > s £ 2byh byh > £
Mx = < 1 + 2 bxh fly < 1 + 2byh byh< £ (2.15)
1 2 \bxh\ < £ 1 2 \byh\ < £ .
The third example is the rotated anisotropic diffusion equation on a square
domain. It has this name because it is obtained from the second example by rotating
the axes through an angle of 6. The equation is given by
Lu
d2u
s-(£c2+s2)i?-2(£-i)
d2u (
CS dxdy V
£S2 + c2^ = 0
J dy2
(2.16)
(x,y) e fi = (0,1) x (0,1)
where c = cos0, s = sin#, and e > 0. There are two parameters, e and 9, that can be
varied. There are two popular discretizations of this equation which are seen in real
30


world applications. They differ only in the discretization of the cross derivative term.
Let
a=(ec2 + s2) j3=(e l)cs 7 =(es2 + c2); (2-17)
then if the grid spacing is h = 1/N for N = nx = ny, the first, a 7-point finite difference
stencil, is
P ~P ~ 7
Th. _____
L h?
-P- 7
The second, a 9-point finite difference stencil, is,
a (3 2 (a + P + 7) a ft
P
(2.18)
Lh

\P -7 ~\P
a 2 (a + 7) a
-IP -7 ¥
(2.19)
The fourth example is the convection-diffusion equation on a square domain,
Lu = eAu + cux + suy = 0 (x, y) fi = (0, l)2 (2.20)
where c = cos 6, s = sin0, and £ > 0. Upstream finite differences and discretization on
a uniform grid with grid spacing h = 1/N for IV = nx = ny, yields
L ~ h?
-£+|(s-|s|)
-e-|(c+|c|) 4e + /i(|c| + |s|) -£ + |(c-|c|)
(2.21)
£ |(S+|S|)
2.2 Finite Volume Discretization
There are two types of computational grids that will be considered. The first
type is the vertex centered grid Gv, defined as
31


9 j < r i i i T 1 W 1 9 1 1 1 9 1 i i
i i 1 1 i i i 1 i
i 1 1 _L - I _! 1 J i
1 1 k 4 * i \ k ! 1 1 1 i 1 1 ft M
Figure 2.1. Vertex centered finite volume grid, where the indicates where the dis-
cretization is centered and the dashed lines delineate the finite volumes.
32



0
-

Figure 2.2. Cell centered finite volume grid, where the indicates where the discretiza-
tion is centered and the solid lines delineate the finite volumes.
Gv = <
(xi, Vj) :
&i % ^ 0)... Nx,

(2.22)
Gc =
(*. Vj) :
(2.23)
Uj 3 hyi 3 0) > Ny
where Nx and Ny are the number of cells in the x and y directions respectively, see
figure 2.1. The second type is the cell centered grid Gc which is defined by
Xi = (i 2) hxi i = !}>
yj = (3~h)hy, j = l,...,Ny
where Nx and Ny are the number of cells in the x and y directions respectively, see
figure 2.2.
There are two other somewhat common finite volume grids that will not be
discussed here, but can be used to derive the discrete system of equations to be solved
by the black box multigrid methods. These grids are defined by placing the finite
volume cell centers on the grid lines in one of the coordinate directions and centered
between the grid lines in the other coordinate direction. For instance, align the cell
centers with the y grid lines and centered between x grid lines. The cell edges will then
correspond with x grid lines and centered between y grid lines.
We will present finite volume discretization for both vertex and cell centered
finite volumes where the coefficients are evaluated at either the vertices or cell centers.
33


The coefficients could be evaluated at other points, such as cell edges, but we will
not show the development of such discretizations because they follow easily from the
descriptions given below.
2.3 Cell Centered Finite Volume Discretization; Evalua-
tion at the Vertices
For the cell centered finite volume discretization the cell has its center at the
point ((i \)hx, (j \)hyj and the cell is called the finite volume, fijj, for the point
(i.j) on the computational grid Gc, where i = l,...,Nx and j = 1,..., Ny\ see equation
(2.23). A finite volume is shown in figure 2.3. The approximation of u in the center of
the cell is called tty. The coefficients are approximated by constant values in the finite
volume Clij. This discretization is useful when the discontinuities are not aligned with
the finite volume cell boundaries.
Assume that Dxy = Dyx = 0 and that b = 0 for now. If we integrate equation
(2.1) over the finite volume fiij and use Greens theorem we get
f Dx^-nx + Dy^-ny dT + [ cudQ= f f dfl, (2.24)
JdSkj dx ydy Ja-j Jni}
where nx and ny are the components of the outward normal vector to the boundary
dtiij.
We proceed by developing the equations for the interior points Uij, and then
for the boundary points, where we present the modifications that axe needed for the
three types of boundary conditions that we consider. We refer to figure 2.3 to aid in
the development of the finite volume discretization.
34


tOl*-*
Figure 2.3. Cell centered finite volume £l;j, where P has the coordinates
)hx> (j ^)hy)-
35


2.3.1 Interior Finite Volumes Referring to figure 2.3, we write the
line integral from equation (2.24) as
du ^ du
f ^ du ^ du m rse du fne ,
/ At "o H Z/y 7\ dx / Dx "o ^2/
Jan,,, ax ay Au, <9y Ae ax
r*w du , /*w du ,
+ / Dy-dx- / Dx-?dy.
J ne 9y J nw 9x
The integral from (sw) to (se) can be approximated by
fs . ,9u, fse du ,
/ Dv(sw)dx + / £Use) dx
J sw 9y J s dy
s
hx
2hx
h '
(Dy(sw) + Dy(se)) (
ui,j Uhj~ l)
h 1
(2.25)
(2.26)
where afj ^ (Dy>ij + Dy>i-ij), and Dy^j is the value of Dy at the point (i,j).
The other line integrals of flij, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be
approximated in a similar fashion.
The surface integrals in equation (2.24) can be approximated by:
1 C U rffl Q,j ,j (2.27)
and
1 f dCl hxhyfi^j-, (2.28)
where Cjj and fij are approximations of c and-/, respectively, at the grid point
((* 5)^x> (j f)^y)i given by
1 / Qj ~ ^ fad "h cil,j + cilj-1 "b Cijl) (2.29)
and
fi,j = 7 (/ij + filj + /i-lj-1 + fij-l) (2.30)
36


respectively. The resulting stencil for interior points is
where
. IzSL/y* .
E+hxhyCij
hx j
-hi.ay. ,
(2.31)
ai,j 9 (Dx,i,j-l + Dxjj)
(2.32)
ai,j ~ 9 + Dy,i,j)
(2.33)
and
53 hy + a^) + /il + a^) '
(2.34)
At an interface, the diffusivity is given as an arithmetic mean of the diffusion
coefficients of adjacent finite volumes. The arithmetic makes sense because the inter-
face passes through the finite volume. This discretization is most accurate when the
interface passes directly through the cell of the finite volume.
When the finite volume flij has an edge on the boundary, the line integral in
equation (2.24) for that edge has to be treated differently. We examine what needs to
be done for each of the three different types of boundary conditions. We examine the
changes that are needed only on one boundary edge, and the other changes needed for
the other boundary edges follow in a similar fashion.
2.3.2 Dirichlet Boundary Condition Let us examine the south bound-
ary, (sw) (se), where we have
() = 9(s)- (2-35)
37


The line integral from (sw) to (se) is approximated by
Jsw ^v~dy^X ~ ~2hy \Uid ~ u(s) j
This gives the stencil
(2.36)
_ hx-fy* .
ky
htai-i,j S +hxhyCi,j
(2.37)
where X) is defined in equation (2.34) and a is defined by equation (2.32) and (2.33).
2.3.3 Neumann and Robin Boundary Conditions We examine the
south boundary, (sw)-(se), where
du
We then make the approximation
u{s) u{v)
1, du
2 yfrl
(s)
~ 2 hy (9(3) a
()
Solving for gives
1L
o hy9(s) + (p)
u(s) = ~
1 2^y(s)
The line integral is then approximated as
n duj
Dydx
dy
rs*>
hxa
y
hj
du I
_15yl(s)
hxa
y
ij-l
(2.38)
(2.39)
(2.40)
(2.41)
38


Now we substitute equation (2.40) to obtain
I,
Senduj~. 2hX
Dy dec ,
SW &y 2 h h'yQ'^s}
ai,j-1 (a(s)u*J 5(s))
The resulting stencil for the south boundary is
. Lz3L/yt' .
hy
ihLryX V4-h h r- 1 ^ hya0,j x ^IbLfy
hx ai-1 j 2-, +rixtiyClj + 2 + hxaoj A, a
*?.
where a is defined in equations (2.32) and (2.33), and J2 is now given by
^ = + t + aij)
(2.42)
(2.43)
(2.44)
The other boundaries can be handled in the same way. We have now defined
the cell centered finite volume discretization where the coefficients are evaluated at the
grid vertices.
2.4 Cell Centered Finite Volume Discretization; Evalua-
tion at the Cell Centers
This discretization is better suited to problems when the interfaces align with
the boundaries of the finite volumes. The discretization is very similar to what was
done in section 2.3, except that now the coefficients are evaluated at the cell centers,
((i \)hXl (j ^)hy), of the finite volume The coefficients are approximated by
constant values in the finite volume f\j. We need to approximate the integrals in
equation (2.24).
39


2.4.1 Interior Finite Volumes We have the line integral, as in equation
(2.25), and the integral from (sw) to (se) can be approximated by
J ~ (u*d u(s)) (2-45)
where Dyjj is the value of Dy at the point (i, j). We still need to approximate ri(s),
and to do this we will use the continuity of u and Dy
Dy,i,j (iHj ~ u(s)^j = Dyjj-i ^(s) > (2-46)
yielding
u(s) =
_ Dy,i,jui,j + DyjjiUjji
Dy,i,j +
We can now substitute equation (2.46) into equation (2.45) to get
fse du hxy
J Qy^X ~ h Ui
where of is now given by
_ 2 Dy,i,jDy,i,j-1
<£.* 1
(2.47)
(2.48)
(2.49)
n . _l n . '
^ uy,%o-1
The other line integrals of fijj, (se) to (ne), (ne) to (mu), and (mu) to (sw),
can be approximated in a similar fashion.
The surface integrals are approximated in the same way as before,
c u d hxfoyCj_____i iUi
and
/ y dfl - h'xh'Tjfi 12)
Aii,* 2J 2.
but instead of q , and /; , we have c- i and i ,_i.
1 22 2,J 2
(2.50)
(2.51)
40


The resulting stencil for interior points is
where
. 3L ry? ,
hy iyj
12+hxhyCij T^ai,j
ks.ay. ,
hy l J-l
(2.52).
and
a:
,x 2 Dx,i,j
'11J Dx,i,j 4" Dx,ilJ
,y - ^ Dy,i,jDy!itj-1
l 1 v~>
^~K + + fe-1 + a^')
(2.53)
(2.54)
(2.55)
At an interface, the diffusivity is given as a harmonic average of the diffusion
coefficients of the adjacent finite volumes.
2.4.2 Dirichlet Boundary Condition For the south boundary, (sw) to
(se), the Dirichlet boundary condition, u(s) = g(sy The line integral is approximated
by
Q'lJj 2 / \
^y~dy^X ~ ~h \i,:> ~ 9(s)j ' (2.56)
The stencil is then given by
hr
h3
__hy
h x
a
X
i-lj
£ +hxhyCij + ^
D.

.IhLo,? .
hx Ui,3
(2.57)
0
where X! is given in equation (2.55) and a is given by equation (2.53) and (2.54).
41


2.4.3 Neumann and Robin Boundary Conditions The Neumann
and Robin boundary conditions can be handled in the same way as in section 2.3.3.
The line integral for the south boundary is
(2.58)
The resulting stencil is now
(2.59)
0
where J2 is given in equation (2.55) and a is given by equation (2.53) and (2.54).
2.5 Vertex Centered Finite Volume Discretization Eval-
uation at the Vertices
In this discretization D, c, and / are approximated by constant values in finite
volume, Qij, whose centers are at the vertices. This discretization is useful when the
discontinuities align with the boundaries of the finite volumes.
2.5.1 Interior Finite Volumes The development is done the same as
before for the cell centered cases; see section (2.3.1). The stencil, when Dxy = Dyx 0
and b = 0, is given by
af- ij E -\-hxhyCij
(2.60)
42


2 hy
! nw
I w
___i__
sw
-*
ne
s
h,
se
Figure 2.4: Vertex centered finite volume Ojj at the southern, y = 0, edge boundary.
where
x _ ^ Dx,i,jDx,i+l,j
id
Dx,i,j + -Dx,i+l,j
(2.61)
and
=
2 Dy,i,jDy,i,j+1
A/.*d + A/,id+1
(I'd-1 + I'd) + 7T (I-id + ?d)
where c and / are evaluated at the grid point (i hx,jhy).
(2.62)
(2.63)
2.5.2 Edge Boundary Finite Volumes Let the finite volume Qij have
its southern edge, (sw)-(se) at the southern boundary (y = 0) of the domain; see figure
2.4.
2.5.3 Dirichlet Boundary Condition For the Dirichlet boundary con-
dition we have and we can just eliminate the unknown U(s) and move it to
the right-hand side of the equation.
2.5.4 Neumann and Robin Boundary Conditions The line integral
along the boundary is approximated by
n duj
Dydx
dy
hxD
y,hj
du
Qy
43


Figure 2.5. Southwest corner finite volume, where the indicates where the discretiza-
tion is centered.
hxDyjj (9(s) a(s)ui,j'j
(2.64)
and now we need to look at the surface integrals
f C U d&l W hxhyCijUij
JCiij 2
and similarly for /. The stencil for the edge boundary is given by
hy
S + \hXhyCij + hxQ>(s)DytiJ f^OC,
iZLn? .
1,3
where
^ hv + hi + a^)
and a is defined by equations (2.61) and (2.62)..
(2.65)
(2.66)
(2.67)
2.5.5 Corner Boundary Finite Volumes The comer finite volume
discretization will be shown for the southwest corner of the computational grid; see
figure (2.5).
44


2.5.6 Dirichlet Boundary Condition In the Dirichlet boundary con-
dition case, the unknown U(sw) is eliminated by the boundary condition equation,
u{sw) = 9(sw)- (2.68)
The term 9(sw) is incorporated into the right hand side of the discrete system of equa-
tions. The stencil for the southwest corner is
-As./vV.
2 hy ai,j
o 52+\hxhyCij 2ai,j
where J2 is defined as
and a is defined by equations (2.61) and (2.62).
0
h$ y Oc- 2 hy '* A-a?- 2 hxa^
(2.69)
(2.70)
2.5.7 Neumann and Robin Boundary Conditions In the Neumann
and Robin boundary condition cases, we have
du
dx
du
~ q---f* O'sU
l dy
(sw)
J (sw)
9w
= 9si
(2.71)
(2.72)
where the subscripts (sw) means evaluation at the sru-point; see figure 2.5. The line
integrals around the finite volume are approximated by
rse Qu
rse n du ,
/ Dydx
J sw dy
11. 7-i du(sw)
2 ^ dy
\hxDy^ (as(sw)uij gs(sw))
(2.73)
fnw du , 1-
J ^x~Qx^ ~ 2hyDxhj'
du(sw)
dy
^hyDx^i^j (o>w(sw)uij 9w(siij))
(2.74)
45


(2.75)
fne du lhy x . ,
J QX^ ^ 2 h UiJrlj)
rne fa, i fo
- Dy^dx -faVj (uitj uiJ+1). (2.76)
The stencil for the southwest corner is
2hy ai,j
0 E+ihxhydj+BC
0
(2.77)
where X) is defined in equation (2.70), a is defined by equations (2.61) and (2.62), and
JBCJ ^ (hxT^t/,i,jns(sty) (su7)) .
(2.78)
2.6 Vertex Centered Finite Volume Discretization Eval-
uation at the Cell Vertices
In this discretization D, c, and f are approximated by constant values in
finite volume, Slij, whose centers are at the vertices. This discretization is useful when
the discontinuities pass through the interior of the finite volumes, and best when the
interface passes through the cell center.
2.6.1 Interior Finite Volumes The development is the same as for the
previous section on vertex centered finite volumes; see section 2.5. The stencil, when
Dxy = Dyx = 0 and b = 0, is given by
__hx.rr! .
hy
JrhxhyCij
hx ai,j
hyat,J-l
(2.79)
46


where
ai,j ~ 2 + Ar,i+l,j+l)
ai,j = 2 + -^j/,i+i,i+i)
and
£ T (ah-1 + aL') + IT (^-hi + fj)
and where c and / are evaluated at the grid point (i hx,j hy).
Cij ~ ^ (ci-lJ-l.+ Q+lJ-l + Ci-lJ+l + Cj+lj+l)
(2.80)
(2.81)
(2.82)
(2.83)
and
fid = ^ (/i-lj-l + /i+lj-l + /i-lj+l + /i+lj+l) (2.84)
Let the finite volume Clij have its southern edge, (sw)-(se) at the southern
boundary (y = 0) of the domain; see figure 2.4.
2.6.2 Dirichlet Boundary Condition For the Dirichlet boundary con-
dition we have and we can just eliminate the unknown and move it to
the right-hand side of the equation.
2.6.3 Neumann and Robin Boundary Conditions The line integral
along the boundary is approximated by
I,
se du 8u
D dx h ay
ydy x ^ dn
M
hx(X- j (j)(s) a{s)ui,j^j i
(2.85)
L
ftc h
^X~8x^ ~ ~ni!~Dy,i+l,j (ui+l,j ~ ui,j)
2 hx
47


and similarly for the line integral from (sw)-(nw), and the line integral from (nw)-(ne)
is done as before for the interior.
The surface integrals are now given by
f cu d£l ~ hxhyC:LjUij
JQij L
(2.86)
where
Qj 2 d" Ci+lJ+l)
and similarly for /. The stencil for the edge boundary is given by
__hz-rvV .
hy ^1,3
~2h^Dx,i-\,j ^ 2 hxhyCij + hxa^a\j 2h^DXtij
where
H ~^ah + oh~ (Dx,i-i,j + Dx,ij),
2 hx
(2.87)
(2.88)
(2.89)
and a is defined by equations (2.80) and (2.81).
2.6.4 Corner Boundary Finite Volumes The corner finite volume
discretization will be shown for the southwest corner of the computational grid; see
figure (2.5).
2.6.5 Dirichlet Boundary Condition In the Dirichlet boundary con-
dition case, the unknown U(sw) is eliminated by the boundary condition equation,
u(sw) = 9(sw)- The term g^sw) is incorporated into the right hand side of the discrete
48


system of equations. The stencil for the southwest corner is
hx
'2 hy
D.

0 4" 4 hx hyCi,j
JhLr>
2hx
0
where Y1 is defined as
E-
hx
__ n ._________
- J^y,i,j
2 h.
y n .
2hxx
and a is defined by equations (2.80) and (2.81).
(2.90)
(2.91)
2.6.6 Neumann and Robin Boundary Conditions In the Neumann
and Robin boundary condition cases, we have
du
" 7^ "b UwU
OX
(sw)
du
"5--h OisU
9w
9 s)
(2.92)
(2.93)
where the subscripts (sw) means evaluation at the siu-point; see figure 2.5. The line
integrals around the finite volume are approximated by
_ du ,
Dydx
dy
1 l 7-i du(sw)
-hxDy,i+id+i (o5(sw)u{j gs(sw))
1L du(sw)
2 UyUx,i+l,j-\-l ^
hyDx,i+i,j-t-i (&w(sw)uij 9/w($w))
n duA
D*dzdy
D,/~dx
dy
2h.
-Dx,i+W (uij ui+ij)
2^ A/.i+lJ+l (ui,j ^ij+l)
(2.94)
(2.95)
(2.96)
(2.97)
49


The surface integrals are approximated by
J cu cm ~ ~ hi /iy Cj+1 j _). i Uj j
and similarly for /. The stencil for the southwest corner is
hx 7~)
2hv uy^i+1 J+1
0 YLJf\hxhyCi+i,j+\ +BC &-DXti+ij+i
(2.98)
(2.99)
where is defined in equation (2.91), a is defined by equations (2.80) and (2181), and
JBC ^ (hxZ?yit+ij+iOs(stw) + hyDx,i+lj+iQwi.sw')')
(2.100)
50


CHAPTER 3
PROLONGATION AND RESTRICTION OPERATORS
Suppose that we have an elliptic linear operator L on a two dimensional
rectangular domain Cl:
Lu = f in Cl c 3?2. (3.1)
This problem can be discretized using finite differences (or other discretization) on a
rectangular grid Gh with grid spacing h, given by
Lhuh = jh -mGh, (3.2)
Gh = {(a:*, yj) : Xi = xq + i h, yj =yo+j h} (3.3)
We assume that the discretization is represented in stencil notation as
NW N NE
WCE (3.4)
SW S SE
J (*J)
where NW, N, NE,... are the coefficients of the discretization stencil centered at
(*,%) ''
The size of the fine grid operators stencil is important to remember because
we require that the coarser grid operators stencil not be any larger than the largest
allowable fine grid operator stencil. By keeping the grid operator stencil fixed at a
maximum of 9-points, we ensure that the implementation will be easier and more
efficient by maintaining the sparsity of the operators. This consideration is important
51


when discussing the formation of the grid transfer operators since we use the Galerkin
coarse grid approximation approach to form the coarse grid operators. The formulation
of the coarse grid operators involves the multiplication of three matrices, and if their
stencils are at most 9-point, then the coarse grid operator will also be at most 9-point.
If we use grid transfer operators with larger stencils, the size of the coarse grid operator
stencil can grow without bound, as the grids levels became coarser, until the stencils
either become the size of the full matrix or we rim out of grid levels.
Another guiding principal that we follow is that if we are given a symmetric
fine grid operator we would like all the coarser grid operators to be symmetric also. In
order to follow this principal the interpolation and restriction operators must be chosen
with care.
Before getting started it would be best to show where and how the operators
are used to transfer components between grid levels. We assume the layout of coarse
and fine grids shown in figure 1.1. We refer to coarse grid points with indices (ic,jc)
and fine grid points with indices (if,jf )
3.1 Prolongation
We interpolate the defect correction (error) from the coarse grid level to the
fine grid level, where it is added as a correction to the approximation of the fine
grid solution. There are four possible interpolation cases for standard coarsening in
two dimensions. The four cases are illustrated in figure 3.1, where the thick lines
represent coarse grid lines, thin lines represent the fine grid lines, circles represent
coarse grid points, X represents the fine grid interpolation point, and the subscripts f
and c distinguish the fine and coarse grid indices respectively. Figure 3.1(a) represents
interpolation to fine grid points that coincide with coarse grid points. Figure 3.1(b)
represents interpolation to fine grid points that do not coincide with coarse grid points,
52


\e h
(a)
i k \
'p / \ r
(b)
'/-1
'*-1
i
i
/
c
ie i
j
i*1
( \ \ /
\
(C)
i e b t
i/-1
J*-i
\ /
\
i/-1 */
(d)

Figure 3.1. The four 2D standard coarsening interpolation cases, where represents
the coarse grid points used to interpolate to the fine grid point represented by x. The
thick lines represent coarse grid lines.
53


but lie on coarse grid lines in the x-direction. Figure 3.1(c) represents interpolation to
fine grid points that do not coinciding with coarse grid points, but lie on coarse grid
lines in the y-direction. Figure 3.1(d) represents interpolation to fine grid points that
do not align with any coarse grid lines either horizontally or vertically.
The fine grid points that are also coarse grid points, case (a), use the identity
as the interpolation operator. The coarse grid correction is then given by
u.

Uic,jc
(3.5)
where (Xif,yjf) = (xic,yjc) on the grid; here the interpolation coefficient is 1.
The fine grid points that are between two coarse grid points that share the
same yj coordinate, case (b), use a two point relation for the interpolation. The coarse
grid correction is given by
u-
-l,jf u*f-idf + K-^jc uic-i,jc + !t
e -u?
C)3c lC)3c
(3.6)
where yJc = yjf and Xic-1 < X{f-\ .< X{c on the grid, and the interpolation coefficients
are If _x and If .
LC J-j JC lCiJC
The fine grid points that are between two coarse grid points that share the
same Xi coordinate, case (c), use a similar two point relation for the interpolation. The
coarse grid correction is then given by
^ic,jc Uic,jc J-icdc-l Uicdc-l'
(3.7)
where Xic = Xif and yjc~i < yjf-i < yjc on the grid, and the interpolation coefficients
are If and If _x.
*CjJC tCfJC *
The last set of fine grid points are those that do not share either a or
a yj coordinate with the coarse grid, case (d). We use a four point relation for the
interpolation in this case, and the coarse grid correction is given by
u.

u.
V-ij/-1
54


(3.8)
, TSW . H , TTIW H
^ 1ic-l,Jc-l U*c-ljc-l ^ Xlc-ljc ic-ljc
H I jse ,.// .
^cjjc 1 lci3c 1
+ U, ,
*cijc iciJc
where Xic < Xif < Xic+i and yjc < yjf < yjc+\, and the interpolation coefficients are
ItT-ljc-l an<^ The interpolation operators coefficients can also
be represented in stencil notation, just like the grid operator, as
r n h
jnw r jne
JW i ie
JSW is jse
(3.9)
L -I H
3.1.1 Prolongation Correction Near Boundaries In the black box
multigrid solvers, the right hand side of the grid equation next to the boundary can
contain boundary data, in wliich case the above interpolation formulas can lead to 0(1)
interpolation errors. To improve this error we can use a correction term that contains
the residual to bring the interpolation errors back to 0(h2); [26]. The correction term
is 0(h2) for the interior grid points, and in general will not improve the error on the
interior, but near the boundary the correction term can be of 0(1). The correction term
takes the form of the residual divided by the diagonal of the grid equation coefficient
matrix; the correction term is equal to where the residual was computed for
the grid before restriction. The correction term is added to equations 3.6, 3.7, and
3.8, which are for interpolating to fine grid points that are not coarse grid points.
Applying the correction is similar to performing an additional relaxation sweep along
the boundary, and it does not affect the size of the prolongation stencil.
55


3.2 Restriction
The restriction operator restricts the residual from the fine grid level to the
coarse grid level, where it becomes the right-hand-side of the defect equation (error-
residual equation). The restriction equation is
= JW Ji '' v+lj/ + Jtic 7 .h
+ J? . Wc rh rV4/+1 + JZj c r .h vb/-1
+ JSW rh Uf+l,jf+1 + JZSc . rh rif+l,
+ jne Jic,jc ' %-hjf -1 rh Tif-1,
+ r(* Vrf/
(3.10)
where the restriction coefficients are Jw, Je, Js, Jn, Jsw, Jnw, Jne, Jse, and 1. The
restriction coefficients can also be represented in stencil notation as
r H
jnw Jn jne
JW 1 Je
JSW Js jse
where the restriction is centered at the fine grid point = (xic,yjc).
(3.11)
3.3 Overview
In the following sections we present several different interpolation operators
by exhibiting the coefficients needed to represent the operators stencil. In most cases,
we omit the indices of the operators, it being be understood that the grid operator is
given at the fine grid point (Xif,yjf). The grid transfer operators can be split into two
groups based upon how the operators are computed.
The first class of grid transfer operators is based on using a collapse (lumping)
in one of the coordinate directions, yielding a simple three point relation that can be
56


solved. The second class of grid transfer operators is based on an idea from Schaffers
semi-coarsening multigrid [69]. Both these methods for operator induced grid transfer
operators are an approximation to the Schur complement, that is, they try to approxi-
mate the block Gaussian elimination of the unknowns that are on the fine grid but not
on the coarse grid. The collapsing methods are a local process while Schaffers idea is
to apply the procedure to a block (line) of unknowns.
We start by presenting the grid transfer operators used in the symmetric
versions of the black box multigrid solvers. Then we present several different grid
transfer operators that are used in the nonsymmetric black box multigrid solvers.
In classic multigrid methods, the grid transfer operators are often taken to be
bilinear interpolation and full weighting; injection is also popular. To see why we do
not use these choices, we need to look at the type of problems that we are hoping to
solve. These problems are represented by the convection-diffusion equation,
V (D Vu) + b Vu + c it = /, (3.12)
where D, c, and / are allowed to be discontinuous across internal boundaries. The
black box multigrid solvers are aimed at solving these problems when D is strongly
discontinuous. The classical multigrid grid transfer operators perform quite well when
D jumps by an order of magnitude or less, but when D jumps by several orders of
magnitude, the classical methods can exhibit extremely poor convergence, since these
methods are based on the continuity of Vu and the smoothing of the error in Vu.
However, it is D Vit that is continuous, not Vu. Hence, if D has jumps of more
than an order of magnitude across internal boundaries, then it is more appropriate to
use grid transfer operators that approximate the continuity of D Vu instead of the
continuity of Vu. It is important to remember that we are using the Galerkin coarse
grid approximation approach to form the coarse grid operators. We want the coarse
57


grid operators to approximate the continuity of D Vu. This goal is accomplished by
basing the grid transfer operators on the grid operator Lh.
Before proceeding with the definitions of the first class of grid transfer oper-
ators, we need to define a few terms and make a few explanations.
Definition 3.3.1 Using the grid, operators stencil notation, define Ra, row sum, at a
given grid point, (Xi,yj), to be
Rx = C + NW + N + NE + W + E + SW + S + SE, ' (3.13)
where the subscript (i,j) has been suppressed.
The row sum is used to determine when to switch between two different ways of com-
puting the grid transfer coefficients at a given point. The switch happens when the
grid operator is marginally diagonally dominant, or in others words, when the row sum
is small in some sense.
We recall what is meant by the symmetric part of the operator.
Definition 3.3.2 Define the symmetric part of the operator, L, as
cL = symm(.L) = ^(L + L*) (3.14)
where L* is the adjoint of the grid operator L.
The notation applies equally to the grid operators coefficients, for example:
crNij = 5 (Nij + Sij+i)
and (3.15)
trSWij^USWij + NEi-u-i.)
In addition, we can give some examples of the adjoint (transpose) of the grid
58


operators coefficients are:
(Wy)*
(crSEij)*
(vCij)*
3.4 Symmetric Grid Operator Lh: Collapsing Methods
The interpolation operator is based upon the discrete grid operator Lh, while
the restriction operator is based bn (Lh)*.
We want to preserve the flux fi (D VC/) across interfaces, which can be done
by using the grid operator Lh. Assume that Lh has a 5-point stencil, then
W(Uij Ui-ij) = E(Ui+itj Uij) , (3.17)
which gives the interpolation formula
W E
Uij = W + EUi~1,j + W + EUi+1J ' ^3'18^
When Lh has a 9-point stencil, the idea is to integrate the contributions from the other
coefficients ( NW, NE, SW, and SE), which can be done by summing (collapsing) the
coefficients to get the three point relation,
A-Vi-ij + A0Vitj + A+Vj+ij = 0 (3.19)
where A_ = (NW + W + SW), A0 = (N + C + 5), and A+ = (NE + E + SE).
The computation of the Iw and Ie coefficients axe done by collapsing the grid
operator in the y-direction to get a three point relation on the x-grid lines. Let the
interpolation formula be given by
Ei-lj,
= crNWi+ij-i,
and
'*,3
(3.16)
Ai-\Vi-\ + AiVi + Aj+iUj+i = 0
(3.20)
59


where Vk is written for Vk,j, and Ai-i = (NW + W + SW\j, A{ = (N + C + S)ij, and
Ai+1 = (NE + E + SE)ij. We now solve the equation for Vi to get the interpolation
formula in an explicit form.
Vi = -A{ 1Ai-ivi-i Ai 1Ai+ivi+i.
The interpolation coefficients Iw and Ie are then given by
Iw = -A-xAi-1 and Ie = -Ar1Ai+x
Writing out the coefficients explicitly gives
rw NW + W + SW
N + C + S
(3.21)
(3.22)
(3.23)
NE + E + SE
N+C+S
(3.24)
where Iw and Ie are evaluated at (ic 1 ,jc) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at (if 1, J/). If however, the row
sum number,(see 3.13), is small (see 3.28) then instead of (N + C + S)i for Ai we
use (NW + W + SW + NE + E + SE)i. These two formulas give the same result
when the row sum is zero, which is the case for an operator with only second order
terms away from the boundary. This idea is observed to lead to better convergence,
and it is due to Dendy [30]. The coefficients are then defined by
r =
NW + W + SW
NW + W + SW + NE + E + SE
(3.25)
and
NE + E + SE
NW + W + SW + NE + E + SE
(3.26)
where Iw and Ie are evaluated at (ic 1 ,jc) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at (if 1, jf).
60


Let
.'r = mm{\NW + W-+SW\, \NE + E + SE\, 1.}. (3.27)
Then by small we mean that
Re < ~'r{NW + W + SW + N + S + NE + E + SE), (3.28)
where is the row sum defined above.
The computation of the Is and In coefficients is done by collapsing the grid
operator in the x-direction to get the three point relation on the y-grid line. Let the
interpolation formula be given by
Aj-iVj-i + AjVj + Aj+iVj+i = 0
(3.29)
where Vj-i = {vij-1 : i = 1 ,...,nx}, Vi = {vij : i = 1,.. .\nx}, Vj+\ = {vij+1 : i =
1,... ,nx}, and Aj+1 = (iVW + N + NE)ij+\, Aj = (W +,C + E\j, and Aj-\ =
(SW + 5 + SE)ij-1. We now solve the equation for Vj to get the interpolation formula
in an explicit form:
Vj = ~Aj ^ Aj ^A.jj-\Vj-)_i (3.30)
The interpolation coefficients Is and In are given by
Is = -A~lAj-\ and P = -A~1Aj+x (3.31)
Writing out the coefficients explicitly gives
SW + S + SE
W+C+E
(3.32)
NE+N+NE
W + C + E
(3.33)
61


If however, the row sum,i?s, is small, then instead of (W + C + E)j for Af we use
(NW + N + NE + SW + S + SE)j. The coefficients are then defined by
SW + S + SE
NW + N + NE + SW + S + SE
(3.34)

NW + N + NE
NW + N + NE +SW + S + SE'
(3.35)
where Is and In are evaluated at (zc,jc 1) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at !) Let
7 = min{|!VW + N + NE\, |SW + S + SJ3|, 1.}. (3.36)
Then by small we mean that
Re < -7 -{NW+N + NE +SW + S + SE), (3.37)
where Re is the row sum.
The computation of the interpolation coefficients Isw, Inw, Ine, and Ise is sim-
ilar to that of the coefficients that have already been computed. Let the interpolation
formula be given by
A-il,j+lvi-l,j+l + AiJ+lVij+i + Aj+ij+iUj+ij+i
+ AiijViij + AijVij + -Ai+ijUj+ij (3.38)
+ -A-i-ij-iVi-ij-i + Aij-iVij-i + Ai+ij-iVi+ij-i = 0 .
where the A*,* are just the corresponding grid operator coefficients. We can now solve
for Vitj to get the interpolation formula.
i,j = -A-iJ ( Ai-ij+iUj-ij+i + Ajj+ifij+i + Aj+ij+iUi+ij+i
AiijUji j + A^v^ + Aj+ijUj-i-ij (3.39)
+Ai-ij-iVi-ij-i + Aij-iVij-i + Ai+ij-iVi+ij-i )
62


Notice that Vij-i, Vi-ij, Vi+i,j, and are unknowns. However, we can use their
interpolated values that we computed above, being careful to note that their stencils are
all centered at different grid points. After performing the substitutions and collecting
the terms for v%ii, vm1: 1,^+1, and 1 we gst
Vij = IswVi- 1J-1 + InwVi-u+1 + Fvi+u-i + PeVi+i,j+i , (3.40)
where instead of having to compute everything all over again, it can be seen that Isw,
Inw, Ine, and Ise can be expressed in terms of the previous four coefficients, Iw, Ie, Is,
and In. However, we must now explicitly write the subscripts for the coefficients Iw,
Ie, Is, and In to indicate where their stencils are centered relative to the interpolated
points stencil, which is centered at {i,j). The formulas for the four coefficients are
I
SW
SW + 5 7%-x + W IUj
C
(3.41)
where Isw is evaluated at (xic-i, yjc-\),
jnw __
NW + N I%+1 + W Itij
C
(3.42)
where Inw is evaluated at (a'ic~i,yjc),
NE + N Ifj+i + E Ij+ij
C
where Ine is evaluated at (Zic,yjc),
I$e __
SE + S-I^ + E-IUu
C
(3.43)
(3.44)
where Ise is evaluated at (xic,yjc-\), and the the other stencil coefficients are evaluated
at (Xif,yjf). If, however, ife is small, then
rsw _ ___OVV TO ijj-i T W ij-lj
NW + N + NE + w + E + sw + S + ,
jnw NW + N-If^+W-IU,
NW + N + NE + W + E + SW + S + S
(3.45)
(3.46)
63


(3.47)
rne __
1ic Jc
jse
NE + N -Ilj+l + E-I?+lj
NW + N + NE + W + E + SW + S + SE'
SE + S Ijj-i + E If+ij
NW + N + NE + W + E + SW + S + SE'
(3.48)
and where NW, N, NE, W, C, E, SW, S, and SE are evaluated at (Xif,yjf). Let
\SW + W + NW\, \NW + N + NE\,
7 = min
\NE + E + SE\, \SE + S + SW\, 1.
Then by small we mean that
.
(3.49)
i?s < -7- (NW + N + NE + W + E + SW + S + SE). (3.50)
The interpolation correction terms are A~lrH, A~JlrH, or A~jrH for the cor-
responding interpolation formulas above, where rH is the residual on the coarse grid.
Note that the ^4s change depending on whether ife is small or not.
The computation of the interpolation coefficients in this way was used in the
BOXMG, BOXMGP, BBMG, and BBMGP codes for symmetric problems [1], [26],
[30], [10]. Similar computations have also been used for most black box, geometric,
and algebraic multigrid solvers for symmetric problems arising from finite difference
and finite volume discretizations using either a 5-point or a 9-point standard stencil
[7], [23], [29], [31], [52], [54], [53], [55], [63], [85], [24],
The computation of the restriction operators coefficients is closely related
to that of the interpolation coefficients. In fact, in the symmetric case, the restric-
tion coefficients for the symmetric grid operator Lh can be taken to be equal to the
interpolation coefficients,
/? = (#) (3.51)
64


3.5 Nonsymmetric Grid Operator Lh\ Collapsing Meth-
ods
The interpolation coefficients can be computed in the same way as in the
symmetric case except that we replace all of the grid operators coefficients with their
equivalent symmetric stencil coefficients, denoted by a(-). However, the row sum
definition remains unchanged.
3.5.1 Prolongation Based on symm(Lft) The computation of the Iw
and Ie coefficients is given by
_ aNW + crW + aSW .
aN + aC + aS
. e aNE-\-aE-^-aSE
aN + aC + aS '
If, however, is small, then
_ aNW + aW + aSW
aNW + aW + aSW + aNE + aE + aSE
(3.53)
(3.54)
aNE + aE + aSE
aNW + aW + aSW + aNE + aE + aSE
(3.55)
In (3.52)-(3.55) Iw and Ie are evaluated at {xic-i,yjc) and (Xic,yjc) respectively, and
the other coefficients on the right hand side are evaluated at (Xif-\,yjf) for the Lh
components. Let
7 = min{|<7iVW + aW + aSW\, \aNE + aE + aSE\, 1.}. (3.56)
Then by small we mean that
< 7 (aNW + aW + aSW + aN + aS + aNE + aE + aSE) (3.57)
65


The formulas for the In and Is coefficients are
aNW + crN + aNE
aW + aC + aE
aSW + aS + aSE
aW + aC + crE
If, however, is small, then
aNW + aN + aNE
aNW + aN + aNE + aSW + aS + aSE
(3.58)
(3.59)
(3.60)
aSW + aS + aSE
aNW + aN + aNE + aSW + aS + aSE
(3.61)
where In and Is are evaluated at (Xic,yjc) and (xic,yjc-1) respectively, and the other
coefficients on the right hand side are evaluated at (xif,yjf-1) for the Lh components.
Let
7 = min {\aNW + aN + aNE\, |aSW + aS + aSE\, 1.} . (3.62)
Then by small we mean that
Rz < -~r(aNW + aN + aNE + aW + aE + aSW + aS + aSE). (3.63)
The computation of the interpolation coefficients Isw, Inw, Ine, and Ise can be ex-
pressed in terms of the other four coefficients:
T$W ic ' aSW + aS-If^ + aW-IUj c. (3.64)
raw _ aNW + aN-ird+1 + aW-I?_ld (3.65)
C
jne 2icdc aNE + aN-Ii^ + aE.I^ (3.66)
C
T$e Iic>jc-1 aSE + aS Ifj_i + crE If+1j C (3.67)
66


If, however, Rz is small, then
= aSW + vS-I^ + aW-IUj
Jc"i aNW + gN + gNE + gW + gE + gSW + gS + aSE
rne ______
* i* *i"
gJVTV + gJV + gNE -f- gIV + g.E + g5IV + gS + gNE + gN -Ilj+1 + aE-I?+ltj
ic,3c ~ aNW + aN + aNE + crW + gE + gSW + gS + gSE'
rse =__________gSE + gS-II^+gE-I?^
(3.68)
(3.69)
(3.70)
(3.71)
gNW + gN + gNE + gW + gE + gSW + gS + gSE
where gNW, gN, gNE, gW, gC, gE, gSW, gS, and aSE are evaluated at (Xif,yjf)
for the Lh components. Let
7 = mm <
| gSW + gW + gNW |, | gNW + gN + gNE |,
|gNE gE -\- gSE\, |gSE + gS+ g5IV|, 1.
Then by small we mean that
.
(3.72)
jRs It has been found in practice that the restriction operator iff need not be
based on the same operator as the interpolation operator, so we change its symbol to
be Jff to reflect this change. The restriction operators coefficients are based on {Lh)T
instead oi'aLh. The restriction coefficients are computed in exactly the same way as
the interpolation coefficients except that all of the grid operators coefficients in the
computations are replaced by their transposes. The computations for the restriction
coefficients are now straightforward and will not be written out.
The grid transfer operators have been computed in this fashion for the black
box multigrid solver for nonsymmetric problems [27]. It should be noted that when
the grid operator Lh is symmetric, then the computations given here for both the
symmetric case and nonsymmetric case yield the same grid transfer coefficients.
67


3.5.2 Prolongation Based on Lh and symm(LA) The third possi-
bility for computing the grid transfer operators is one that uses the same form of
the computations as above, see section 3.5.1. This prolongation is a point collapse
approximation to Schaffers ideas; see section 3.6. The only difference in the above
computations for the nonsymmetric case is that for the denominators, A~l and Aj1,
we use the coefficients based on Lh instead of oLh. The test for small is still in the
same form as before except that Lh is used, but 7 is still based on oLh.
The restriction operator coefficients are computed as before, but the denomi-
nator is now based on Lh instead of on {Lh)T.
3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and
symm(T/l) The prolongation operator coefficients are computed the same as in the
last section 3.5.2. However, the computation of the restriction operator coefficients has
been modified into a hybrid form that uses both LT and L. '
The difference in the computation of the restriction coefficients comes into
play when the switch is made in the denominator, A~l and A~l, because the row sum
is small. When the row sum is large we modify the denominator by adding in two
coefficients from the grid operator L. We can illustrate this modification by computing
the restriction coefficients Jw and Je.
= cNW)T + (Wf + (SW)T
N + C + S
(3.74
(NE)t + (E)t + (SE)t
N + C + S
(3.75)
If, however, is small, then
________________(NW)T + (W)T + (SW)T______________
(NW)T + (W)T + (SW)T + N + S + (NE)T + (E)T + {SE)T' K 1
68


(3.77)
Je =
(NEf + (Ef + (SEf
(.NWf + (Wf + (SWf + N + S+ (NEf + (.Ef + (SEf'
In (3.74)-(3.76) Jw and Je are evaluated at {xic~i,yjc) and (Xic,yjc) respectively, and
the other coefficients on the right hand side are evaluated at (Xif-i,yjf) for the Lh
components. Let
7 = min{\{NW)T + (W)T + (5W)r|, |(AT£)t + {Ef + (5£;)r|, 1.}. (3.78)
Then by small we mean that
i?E < -7- ( (NW)T + {W)T- + {SW)T +
{Nf + (S)T + N + S (3.79)
+{NEf + (Ef + (SEf )
The restriction coefficients Jn and Js are computed in a similar way.
The motivation behind these modifications was to try to get the coarse grid
operator to approximate the one obtained when using the extension of Schaffers idea;
see section 3.6. The grid operators from section 3.5.2 above were computed to approx-
imate the grid transfer coefficients based on an extension of Schaffers idea; while the
method in this section attempts to do the same thing, it also makes some modifications
so that the coarse grid operator more closely approximates the one obtained in section
3.6.1.
3.6 Nonsymmetric Grid Operators: Extension of Schaf-
fers Idea
The second class of grid transfer routines is based on Schaffers idea for grid
transfer operators in his semi-coarsening multigrid method [70]. Schaffers idea is to
approximate a full matrix by a diagonal matrix to compute the grid transfer operators.
69


Schaffers idea was used in the development of the semi-coarsening black box multi-
grid method [32]. We took Schaffers idea and extended it to apply to the standard
coarsening grid transfer operators.
The ideas used in the semi-coarsening method are as follows. Suppose that
coarsening takes place only in the y-direction. Then the interpolation to points on the
fine grid can be represented by
AjiVji + AjVj + Aj+iVj+i = 0 (3.80)
where Vk = {v^k : i = 1,... ,nx, j = j 1 ,j,j + 1}, The tridiagonal matrices Aj-\, Aj,
and Aj+i represent the nine point grid operator on the j 1, j, and j + 1 grid lines
respectively;
Aj+1 = tridiag [NW, N, NE]j+1
Aj = tridiag [W, C, E]j
Aj-i = tridiag [SW, S, SE]j_x .
As before, we solve this equation for Vj to get,
vj = -Aj'-Aj-iVj-i Aj1Aj+1vj+1, (3.81)
where we have assumed that AJ1 exists and can be stably inverted. This assumption
can not always be guaranteed, but Schaffers and our. methods allow line relaxation
as a smoother, where these assumptions are necessary. The methods would fail if the
assumptions did not hold, so in that sense we can say that the assumptions hold.
From equation (3.81), we form the quantities AJ1Aj-\ and AjlAj-\, lead-
ing to a non-sparse interpolation operator. If the interpolation operator is not sparse,
that is, involves only V{j-\ and Vij+1 for interpolation at the point (i,j), then the coarse
grid operators formed by the Galerkin coarse grid approximation approach will grow
70


beyond a 9-point stencil. This is a property that we would very much like to avoid,
since it would lead to full operators on the coarser grid levels. Schaffers idea, also
arrived at independently by Dendy, is to approximate these quantities with diagonal
matrices Bj-\ and Bj+This is accomplished by solving the following relations
-AjxAj-\e = Bj-\e (3.82)
-A~1Aj+ie = Bj+ie,
where e = (1,1,..., 1)T. They can be solved quickly because they are tridiagonal equa-
tions. After solving, the entries (diagonals) in Bj-\ and Bj+1 axe just the interpolation
coefficients Is and In respectively.
In the semi-coarsening case the restriction operator is still based on the trans-
pose of the nonsymmetric grid operator Lh. This is done by replacing Aj-i, Aj, and
Aj+1 by their transpose to get (A^-i)*, (Aj)*, and (Aj+i)* respectively.
3.6.1 Extension of Schaffers Idea, to Standard Coarsening The
above was presented in a manner suitable for the symmetric case. It can be modified for
the nonsymmetric case, as we did for the collapsing methods, by using the symmetric
part of the operator. We can do this by replacing A* with get,
(symm(Aj))~l symm(Aj-i) e Bj-\e (3.83)
(symm(Aj))~1 symm(Aj->ri) e = Bj+\e .
Schaffer constructs his grid transfer operators in a different manner and his
construction for variable coefficient problems can yield a nonsymmetric coarse grid
operator LH even if Lh is symmetric. We would like the coarse grid operators to be
symmetric whenever the fine grid operator is symmetric. We can do this is several
71


ways, but a more efficient construction is to replace equation (3.83) with
A~l symm(Aj-i) e = Bj-\e (3.84)
-AJ1 symm(Aj+i) e = Bj+\e .
The advantage of this form is that it can use the same tridiagonal system solver that
we are already using for the line solves for the multigrid smoother. Equation (3.83)
will require an additional tridiagonal solve for symm(Aj) and additional storage if the
LU factors are to be saved.
To extend these ideas to the standard coarsening case is quite easy. We first
compute the grid transfer coefficients for semi-coarsening in the y-direction, and define
Vk = {vitk i = 1,...., nx, k = j l,j,j + 1} and the tridiagonal matrices
Aj+i = tridiag [aNW, aN, aNE\-+l
Aj = tridiag [W,C,E]j
Aj-i = tridiag [aSW, crS, u5j5]J_1 .
We save the diagonals of Bj-\ and Bj+1 associated with coarse grid lines in the x-
direction as the Is and In interpolation coefficients respectively.
To obtain the coefficients for the y-direction, we compute the grid transfer
coefficients for semi-coarsening in the x-direction and define
Vk = {vk,j : k = i 1, i, i + 1, j = 1,..., ny}
and the tridiagorial matrices
Ai+1 = tridiag [aSW, aW, Ai = tridiag [5, C, N}-
A{-1 = tridiag [aSE, aE, aNE}-_1 .
72


We save the diagonals of B{-\ and B{+1 associated with coarse grid lines in the x-
direction as the Iw and Ie interpolation coefficients respectively.
Finally, we can then combine the semi-coarsening coefficients from the X and
Y lines to obtain the Isw, Inw, Ine, and Ise interpolation coefficients. They can be
computed as the product of the coefficients that have already been computed,
jnw __ jn jw jne jn m je
(3.85)
jsw js m jw jse js m je
or elimination can be used as before.
The restriction operator for the extension to' the standard coarsening case
is computed as above, but the transpose of the grid operator is used instead of the
symmetric part of the operator. This is done by replacing Aj-\ and Aj+1 by their
transpose to get (Aj-i)* and (A,+i)* respectively.
3.7 Conclusions Regarding Grid Transfer Operators
Many other grid transfer operators were tried in the standard coarsening black
box multigrid method in addition to the those presented above. However, only three
were deemed to be robust and efficient enough to include in a release version of the
solver. The three choices for grid transfer operators are the original nonsymmetric col-
lapsing method described in section 3.5.1, the nonsymmetric hybrid collapsing method
described in section 3.5.3, and the nonsymmetric extension to Schaffers ideas described
in section 3.6.1. While all three of these choices are good, better results were obtained
for the later two for all test and application problems run to date.
Most of the other grid transfer operators, that were tried had good perfor-
mance on some of the test problems but failed on others. There does appear to be
enough good results to cover all the test problems, with the exception of re-entrant
flows. However, to unify these into one set of grid transfer operators would be much
73


more expensive to compute and may also introduce trouble when combining the various
types of grid transfer operators.
The grid transfer operators from section 3.5.2, which use a collapsing method
to try to approximate the extension of Schaffers ideas for nonsymmetric problems,
were a disappointment. While they seemed to be a good idea, they turned out to not
be very robust and in several cases actually caused divergence of the multigrid method.
This bad behavior prompted examination of the coarse grid operators and grid transfer
operators. After comparing the operators with those obtained from Schaffers ideas,
it was noticed that several things were wrong, but with the modifications described
in section 3.5.3, these problems were overcome. These new grid transfer operators
extended Schaffers ideas to standard coarsening very well.
74


CHAPTER 4
BASIC ITERATION METHODS FOR SMOOTHERS
In this chapter we examine several basic iteration schemes for use as smoothers
in the Black Box Multigrid solvers. Fourier mode analysis is used to identify which
scheme makes the best smoother for a given type of model problem in two dimensions.
In this chapter we will be using parentheses around a superscript to denote
an iteration index. For example: means the nth iterate.
4.1 Overview of Basic Iteration Methods
All of the methods in this section can be characterized in the following way.
The algebraic system of equations to be solved is given by the matrix equation
Lu = f (4.1)
The matrix L is an Nxy x Nxy matrix, where Nxy = nxny. The computational grid
is two dimensional with nx and ny grid points in the x- and y-directions respectively.
The matrix L can be split as
L = M N, (4.2)
where M is non-singular and assumed easy to invert. Then a basic iteration method
for the solution of equation (4.1) is given by
Mu(n+1\= Nu^ + f (4.3)
75


or as
u^+^ = Su^+M~lf, (4.4)
where S = M-1iV is called the iteration matrix. The basic iteration method can also
be damped, and if the damping parameter is u>, then the damped method is given by
u("+1) = u (m-Wn) + M-1/) + (1 u) u(n) (4.5)
or by
(^+1) = 5u(n) + uM~lf,
where S is now given by
S dM ^ N + (1 u>) /,
(4.6)
(4.7)
and I is the identity matrix. When w = 1 we recover the undamped basic iterative
method.
The eigenvalues of the damped basic iteration matrix S can be given in terms
of the eigenvalues of the undamped basic iteration matrix S. They are related by
X(S) wA(5) + 1 cu,
(4.8)
where u is the damping parameter and A (5) on the right hand side of the equation is
an eigenvalue of S, the undamped iteration matrix.
The error after the nth iteration is
e(") = (*> u (4.9)
where u is a solution (unique if L is non-singular) to equation (4.1). The error at the
(n + l)st iteration is related to the error at the nth iteration by
e 76


Full Text

PAGE 1

BLACK BOX MULTIGRID FOR CONVECTION-DIFFUSION EQUATIONS ON ADVANCED COMPUTERS by VICTOR ALAN BANDY M.S., University of Colorado at Denver, 1988 B.S., Oregon State University, 1983 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics 1996

PAGE 2

This thesis for the Doctor of Philosophy degree by Victor Alan Bandy has been approved for the Department of Mathematics by Jan Mandel Joel E. Dendy, Jr. Stephen McCormick Leopolda Franca Gita Alaghband Date ______________________

PAGE 3

Bandy, Victor Alan (Ph. D., Applied Mathematics) Black Box Multigrid for Convection-Diffusion Equations on Advanced Computers Thesis directed by Dr. Joel E. Dendy, Jr. ABSTRACT In this thesis we present Black Box Multigrid methods for the solution of convection-diffusion equations with anisotropic and discontinuous coefficients on ad vanced computers. The methods can be classified as either using standard or semi coarsening for the generation of the coarse grids. The domains are assumed to be either two or three dimensional with a logically rectangular mesh structure being used for the discretization. New grid transfer operators are presented and compared to earlier grid transfer operators. The new operators are found to be more robust for convection-diffusion equations. Local mode and model problem analysis are used to examine several choices of iterative methods for the smoother and their relative effectiveness for the class of problems under consideration. The red/black alternating line Gauss-Seidel method and the incomplete line L U (ILL U) by lines-in-x methods were found to be the most robust for two dimensional domains, and red/black alternating plane Gauss-Seidel, using the the 2D black box multigrid method for the plane solves, was found to be the most robust and efficient smoother for 3D problems. The Black Box Multigrid methods were developed to be portable, but opti mized for either vector computers, such as the Cray Y-MP, or for parallel computers, iii

PAGE 4

such as the CM-5. While the computer architectures are very different, they represent two of the main directions that supercomputer architectures are moving in today. Per formance measures for a variety of test problems are presented for the two computers. The vectorized methods are suitable for another large class of common computers that use superscalar pipelined processors, such as PCs and workstations. While the codes have not been optimized for these computers, especially when considering caching issues, they do perform quite well. Some timing results are presented for a Sun Sparc-5 for comparison with the supercomputers. This abstract accurately represents the contents of the candidate's thesis. I recommend its publication. Signed Joel E. Dendy, Jr. iv

PAGE 5

To my Mom, Lee Buchanan, and everyone else who kept on asking "When are you going to finish?"

PAGE 6

CHAPTER 1 INTRODUCTION 1.1 Summary .. CONTENTS 1.1.1 Previous Results. 1.1.2 New Contributions 1.2 Class of Problems . . 1.3 Discretization of the Problem 1 1 2 6 9 10 1.4 Multigrid Overview . . . 13 1.4.1 Multigrid Cycling Strategies 19 1.5 Black Box Multigrid . . . . 24 2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME. 27 2.1 Finite Difference Discretization 28 2.2 Finite Volume Discretization 31 2.3 Cell Centered Finite Volume Discretization; Evaluation at the Vertices 34 2.3.1 Interior Finite Volumes .... 2.3.2 Dirichlet Boundary Condition 36 37 2.3.3 Neumann and Robin Boundary Conditions 38 2.4 Cell Centered Finite Volume Discretization; Evaluation at the Cell Centers ............ 2.4.1 Interior Finite Volumes 2.4.2 Dirichlet Boundary Condition vi 39 40 41

PAGE 7

2.4.3 Neumann and Robin Boundary Conditions . . . . . 42 2.5 2.6 Vertex Centered Finite Volume Discretization-Evaluation at the Vertices ............... 2.5.1 Interior Finite Volumes 2.5.2 Edge Boundary Finite Volumes 2.5.3 Dirichlet Boundary Condition 2.5.4 Neumann and Robin Boundary Conditions 2.5.5 Corner Boundary Finite Volumes 2.5.6 Dirichlet Boundary Condition . 2.5.7 Neumann and Robin Boundary Conditions Vertex Centered Finite Volume Discretization-Evaluation at the Cell Vertices . . . . . . 2.6.1 Interior Finite Volumes 2.6.2 Dirichlet Boundary Condition 2.6.3 Neumann and Robin Boundary Conditions 2.6.4 Corner Boundary Finite Volumes 2.6.5 Dirichlet Boundary Condition .. 2.6.6 Neumann and Robin Boundary Conditions 3 PROLONGATION AND RESTRICTION OPERATORS 3.1 Prolongation . . . . . . . . . . 3.1.1 Prolongation Correction Near Boundaries 3.2 Restriction 3.3 Overview 3.4 Symmetric Grid Operator Lh: Collapsing Methods 3.5 Nonsymmetric Grid Operator Lh: Collapsing Methods 3.5.1 Prolongation Based on symm(Lh) vii 42 42 43 43 43 44 45 45 46 46 47 47 48 48 49 51 52 55 56 56 59 65 65

PAGE 8

3.5.2 3.5.3 Prolongation Based on Lh and symm(Lh) ........... Grid Transfer Operators Based on a hybrid form of Lh and symm(Lh) ....................... 3.6 Nonsymmetric Grid Operators: Extension of Schaffer's Idea 3.6.1 Extension of Schaffer's Idea to Standard Coarsening 3. 7 Conclusions Regarding Grid Transfer Operators 4 BASIC ITERATION METHODS FOR SMOOTHERS 4.1 Overview of Basic Iteration Methods 4.2 Gauss-Seidel Relaxation . . . 4.2.1 Point Gauss-Seidel Iteration 4.2.2 Line Gauss-Seidel Iteration by Lines in X 4.2.3 Line Gauss-Seidel Iteration by Lines in Y 4.2.4 Alternating Line Gauss-Seidel Iteration 4.3 Incomplete Line LU Iteration ........ 5 FOURIER MODE ANALYSIS OF SMOOTHERS 5.1 Introduction 5.2 Motivation 5.3 Overview of Smoothing Analysis 68 68 69 71 73 75 75 79 80 83 84 86 86 91 91 92 94 5.4 2D Model Problems . . . 101 5.5 Local Mode Analysis for Point Gauss-Seidel Relaxation 102 5.6 Local Mode Analysis for Line Gauss-Seidel Relaxation 111 5.7 Local Mode Analysis for Alternating Line Gauss-Seidel and ILLU It-eration . . . . . . . . 115 5.8 Local Mode Analysis Conclusions 5.9 Other Iterative Methods Considered for Smoothers 120 122 6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 125 viii

PAGE 9

6.1 Cray Hardware Overview . . . . 6.2 Memory Mapping and Data Structures 6.3 Scalar Temporaries . . . 6.4 In-Code Compiler Directives 6.5 Inlining . . 6.6 Loop Swapping 6.7 Loop Unrolling 6.8 Loops and Conditionals 6.9 Scalar Operations 6.10 Compiler Options 6.11 Some Algorithmic Considerations for Smoothers 6.11.1 Point Gauss-Seidel Relaxation 6.11.2 Line Gauss-Seidel Relaxation 6.12 Coarsest Grid Direct Solver 6.13 l2-Norm of the Residual .. 6.14 2D Standard Coarsening Vector Algorithm. 6.14.1 Coarsening ... 6.14.2 Data Structures. 6.14.3 Smoothers .... 6.14.4 Coarsest Grid Solver 6.14.5 Grid Transfer Operators 6.14.6 Coarse Grid Operators 6.15 2D Semi-Coarsening Vector Algorithm 6.15.1 Data Structures. 6.15.2 Coarsening 6.15.3 Smoothers. ix 127 131 132 133 134 135 135 135 136 136 137 137 138 139 140 144 144 144 145 146 146 146 146 146 146 146

PAGE 10

6.15.4 Coarsest Grid Solver . 147 6.15.5 Grid Transfer Operators 147 6.15.6 Coarse Grid Operators 147 7 2D NUMERICAL RESULTS 148 7.1 Storage Requirements 148 7.2 Vectorization Speedup 151 7.3 2D Computational Work 156 7.4 Timing Results for Test Problems 157 7.5 Numerical Results for Test Problem 8 165 7.6 Numerical Results for Test Problem 9 174 7.7 Numerical Results for Test Problem 10 181 7.8 Numerical Results for Test Problem 11 187 7.9 Numerical Results for Test Problem 13 191 7.10 Numerical Results for Test Problem 17 194 7.11 Comparison of 2D Black Box Multigrid Methods 198 8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 203 8.1 CM-2 and CM-200 Parallel Algorithms 203 8.1.1 Timing Comparisons 206 8.2 CM-5 Hardware Overview . 207 8.3 CM-5 Memory Management 215 8.4 Dynamic Memory Management Utilities 219 8.5 CM-5 Software Considerations .... 222 8.6 Coarsening and Data Structures in 2D 223 8.7 Coarse Grid Operators 227 8.8 Grid Transfer Operators 228 8.9 Smoothers ........ 229 X

PAGE 11

8.9.1 Parallel Line Gauss-Seidel Relaxation ......... 229 8.9.2 CM-5 Tridiagonal Line Solver Using Cyclic Reduction 230 8.10 Coarsest Grid Solver ..... 233 8.11 Miscellaneous Software Issues 236 8.11.1 Using Scalapack ... 236 8.11.2 Poly-Shift Communication 237 8.12 2D Standard Coarsening Parallel Algorithm 237 8.12.1 Data Structures 238 8.12.2 Coarsening 238 8.12.3 Smoothers. 239 8.12.4 Coarsest Grid Solver 239 8.12.5 Grid Transfer Operators 239 8.12.6 Coarse Grid Operators 240 8.13 2D Semi-Coarsening Parallel Algorithm 240 8.13.1 Data Structures 240 8.13.2 Coarsening 240 8.13.3 Smoothers. 241 8.13.4 Coarsest Grid Solver 241 8.13.5 Grid Transfer Operators 241 8.13.6 Coarse Grid Operators 241 8.14 2D Parallel Timings ...... 241 9 BLACK BOX MULTIGRID IN THREE DIMENSIONS 250 9.1 Introduction . . . 250 9.1.1 Semi-Coarsening 251 10 3D DISCRETIZATIONS . 253 10.1 Finite Difference Discretization 254 xi

PAGE 12

10.2 Finite Volume Discretization 10.2.1 Interior Finite Volumes 10.2.2 Edge Boundary Finite Volumes 10.2.3 Dirichlet Boundary Condition 10.2.4 Neumann and Robin Boundary Conditions 11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS 11.1 3D Grid Transfer Operations ............. 11.2 3D Nonsymmetric Grid Operator Lh: Collapsing Methods 11.2.1 3D Grid Transfer Operator Variations 11.3 3D Coarse Grid Operator 12 3D SMOOTHERS ... 12.1 Point Gauss-Seidel 12.2 Line Gauss-Seidel 12.3 Plane Gauss-Seidel 13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS 13.1 Overview of 3D Local Mode Analysis 13.2 Three Dimensional Model Problems 13.3 Local Mode Analysis for Point Gauss-Seidel Relaxation 13.4 Local Mode Analysis for Line Gauss-Seidel Relaxation 13.5 Local Mode Analysis for Plane Gauss-Seidel Relaxation 14 3D VECTOR ALGORITHM CONSIDERATIONS 14.1 3D Smoother ........ 14.2 Data Structures and Memory 14.3 3D Standard Coarsening Vector Algorithm. 14.3.1 Coarsening ... 14.3.2 Data Structures xii 254 255 256 257 257 260 262 264 268 268 270 270 271 272 274 274 278 280 285 293 308 308 309 313 313 313

PAGE 13

14.3.3 Smoothers ..... 14.3.4 Coarsest Grid Solver 14.3.5 Grid Transfer Operators 14.3.6 Coarse Grid Operators .. 14.4 3D Semi-Coarsening Vector Algorithm 14.4.1 Data Structures. 14.4.2 Coarsening 14.4.3 Smoothers 14.4.4 Coarsest Grid Solver 14.4.5 Grid Transfer Operators 14.4.6 Coarse Grid Operators .. 14.5 Timing Results for 3D Test Problems 14.6 Numerical Results for 3D Test Problem 1 14.7 Numerical Results for 3D Test Problem 2 15 PARALLEL 3D BLACK BOX MULTIGRID 15.1 3D Standard Coarsening Parallel Algorithm Modifications 15.2 3D Parallel Smoother .......... 15.3 3D Data Structures and Communication 15.4 3D Parallel Timings APPENDIX A OBTAINING THE BLACK BOX MULTIGRID CODES B COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS B.1 Cray Y-MP B.2 CM-5 .. BIBLIOGRAPHY xiii 314 314 314 314 314 315 315 315 315 315 315 316 320 320 324 324 324 326 326 331 333 333 335 337

PAGE 14

FIGURES FIGURE 1.1 Standard coarsening: superimposed fine grid Gh and coarse grid GH. 14 1.2 Semi-coarsening: superimposed fine grid Gh and coarse grid GH. 15 1.3 One V-cycle iteration for five grid levels. 20 1.4 One S-cycle iteration for four grid levels. 22 1.5 One W-cycle iteration for four grid levels. 22 1.6 One F-cycle iteration for five grid levels. 23 2.1 Vertex centered finite volume grid. 32 2.2 Cell centered finite volume grid. 33 2.3 Cell centered finite volume ni,j. 35 2.4 Vertex centered finite volume ni,j at y = 0. 43 2.5 Southwest boundary corner finite volume. 44 3.1 Standard coarsening interpolation 2D cases 53 6.1 Cray Y-MP hardware diagram 128 6.2 Cray CPU configuration 128 6.3 2D Data Structures . 145 7.1 Comparison of Setup time for BMGNS, SCBMG, and MGD9V 154 7.2 Comparison of one V-cycle time for BMGNS, SCBMG, and MGD9V 155 7.3 Domain n for problem 8. 166 7.4 Domain n for problem 9. 7.5 Domain n for problem 10 .. xiv 174 181

PAGE 15

7.6 Domain n for problem 11. 187 7.7 Domain n for problem 13 .. 191 7.8 Domain n for problem 17 .. 195 8.1 CM-5 system diagram ... 210 8.2 CM-5 processor node diagram. 212 8.3 CM-5 vector unit diagram . 214 8.4 CM-5 processor node memory map 217 8.5 Grid Data Structure Layout. .... 225 9.1 Grid operator stencil in three dimensions. 252 11.1 Grid transfer operator's stencil in three dimensions. 261 14.1 3D FSS data structure . . . . . . . . 311 XV

PAGE 16

TABLES TABLE 5.1 Smoothing factor JL for point Gauss-Seidel relaxation for anisotropic diffusion equations . 109 5.2 Smoothing factor JL for point Gauss-Seidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 110 5.3 Smoothing factor JL for x-and y-line Gauss-Seidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . 114 5.4 Smoothing factor JL for x-and y-line Gauss-Seidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 116 5.5 Smoothing factor JL for alternating line Gauss-Seidel relaxation and in complete line LU iteration for anisotropic diffusion equations . . . 119 5.6 Smoothing factor JL for alternating line Gauss-Seidel relaxation and in complete line LU iteration for convection-diffusion equations . . . 121 6.1 Cray Y-MP Timings for the Naive, Kahan, and Doubling Summation Algorithms. . . . . . . . . . . . . . . . 143 6.2 Sparc5 Timings for the Naive, Kahan, and Doubling Summation Algorithms. 144 7.1 Memory storage requirements for the Cray Y-MP. .... 7.2 Storage requirements for BMGNS, SCBMG, and MGD9V. 7.3 Vectorization speedup factors for standard coarsening. 7.4 Vectorization speedup factors for semi-coarsening. 7.5 Operation count for standard coarsening setup ... xvi 149 150 151 152 156

PAGE 17

7.6 Operation count for standard coarsening residual and grid transfers. 157 7.7 Operation count for standard coarsening smoothers. 158 7.8 Timing for standard coarsening on problem 8. . . 158 7.9 Grid transfer timing comparison for standard and semi-coarsening. 160 7.10 Timing for various smoothers. . . . . 161 7.11 Smoothing versus grid transfer timing ratios. 162 7.12 Setup times for the various grid transfers. 163 7.13 V-cycle time for various smoothers. . 164 7.14 Number of V -cycles for standard coarsening using the extension of Schaffer's idea for problem 8. . . . . . . . . . . . . . . 166 7.15 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 8. 167 7.16 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 8. . . . . . . . . . . . . . . 168 7.17 Number of V -cycles for standard coarsening using the symmetric grid transfer for problem 8. . . . . . . . . . . . . . . 169 7.18 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 8. . . . . . . . . . . . . . . 169 7.19 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 8 with ILLU. . . . . . . . . . . . 170 7.20 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 8 with ILL U. . . . . . . . . . . . . . 170 7.21 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 8 with ILLU. . . . . . . . . . . . 171 7.22 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 8 with ILLU. . . . . . . . . . . . 171 xvii

PAGE 18

7.23 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 8 with ILLU. . . . . . . . . . . . 172 7.24 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 9. . . . . . . . . . . . . . . 175 7.25 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 9. 175 7.26 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 9. . . . . . . . . . . . . . . 176 7.27 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 9. . . . . . . . . . . . . . . 176 7.28 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 9. . . . . . . . . . . . . . . 177 7.29 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 9 with ILLU. . . . . . . . . . . . 177 7.30 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 9 with ILL U. . . . . . . . . . . . . . 178 7.31 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 9 with ILLU. . . . . . . . . . . . 178 7.32 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 9 with ILLU. . . . . . . . . . . . 179 7.33 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 9 with ILLU. . . . . . . . . . . . 179 7.34 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 9 with 4-direction PGS. . . . . . . . . 180 7.35 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 10. . . . . . . . . . . . . . 183 xviii

PAGE 19

7.36 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 10. . . . . . . . . . . . . . . . . 183 7.37 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 10. . . . . . . . . . . . . . . 184 7.38 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 10. . . . . . . . . . . . . . . 184 7.39 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 10. . . . . . . . . . . . . . . 185 7.40 Number of V -cycles for standard coarsening using the extension of Schaffer's idea for problem 10. . . . . . . . . . . . . . 185 7.41 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 10. . . . . . . . . . . . . . . . . 185 7.42 Number of V -cycles for standard coarsening using the symmetric grid transfer for problem 10. . . . . . . . 186 7.43 Number of V-cycles for MGD9V on problem 10. 186 7.44 Number of V -cycles for standard coarsening using the extension of Schaffer's idea for problem 11. . . . . . . . . . . . . . 188 7.45 Number of V-cycles for standard coarsening using the sL/L grid transfer for problem 11. . . . . . . . . . . . . . . . . 189 7.46 Number of V-cycles for standard coarsening using the hybrid sL/L grid transfer for problem 11. . . . . . . . . . . . . . . 189 7.47 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 11. . . . . . . . . . . . . . . 190 7.48 Number of V-cycles for standard coarsening using the operator, L/L, grid transfer for problem 11. . . . . . . . . . . . . . . 190 xix

PAGE 20

7.49 Number of V -cycles for standard coarsening using the extension of Schaffer's idea for problem 13. . . . . . . . . . . . . . 192 7.50 Number of V-cycles for standard coarsening using the hybrid sL/1 grid transfer for problem 13. . . . . . . . . . . . . . . 193 7.51 Number of V-cycles for standard coarsening using the symmetric grid transfer for problem 13. . . . . . . . . . . . . . . 193 7.52 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 17. . . . . . . . . . . . . . 194 7.53 Number of V-cycles for standard coarsening using the original collapsing method for problem 17. . . . . . . . . . . . . . . 196 7.54 Number of V-cycles for standard coarsening using the extension of Schaffer's idea for problem 17. . . . . . . . . . . . . . 197 7.55 Number of V-cycles for standard coarsening using the hybrid collapsing method for problem 17. . . . . . . . . . 197 7.56 Number of V-cycles for semi-coarsening for problem 17. 197 7.57 Comparison for problem 8 on Cray Y-MP 199 7.58 Comparison for problem 9 on Cray Y-MP 201 8.1 Timing Comparison per V-cycle for semi-coarsening on the Cray Y-MP, CM-2, and CM-200. ............. 8.2 Timing Comparison per V-cycle for Standard Coarsening on the Cray Y-MP, CM-2, and CM-200. ......... 8.3 2D Standard coarsening 32-512 CM-5 nodes V-cycle timings 8.4 2D Standard coarsening 32-512 CM-5 nodes Setup timings 8.5 2D Standard coarsening 32-512 CM-5 nodes parallel efficiency 8.6 2D Semi-coarsening 32-512 CM-5 nodes V-cycle timings 8. 7 2D Semi-coarsening 32 512 CM-5 nodes setup timings XX 206 208 243 244 244 245 246

PAGE 21

8.8 2D Semi-coarsening 32 512 CM-5 nodes parallel efficiency . 8.9 2D Timing comparison between CM-5, Cray Y-MP, and Sparc-5 13.1 Smoothing factor for point Gauss-Seidel relaxation for anisotropic diffu246 248 sion equations in 3D . . . . . . . . . . . . . . . 282 13.2 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion equations in 3D . . . . . . . . . . . . . . . . 282 13.3 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion equations in 3D . . . . . . . . . . . . . . . . 283 13.4 Smoothing factor for point Gauss-Seidel relaxation for convection-diffusion equations in 3D . . . . . . . . . . . . . . . . 284 13.5 Smoothing factors for line Gauss-Seidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . . . . 289 13.6 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion equations . . . . . . . . . . . . . . . . . . 290 13.7 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion equations . . . . . . . . . . . . . . . . . . 291 13.8 Smoothing factors for line Gauss-Seidel relaxation for convection-diffusion equations . . . . . . . . . . . . . . . . . . 292 13.9 Smoothing factors for zebra line Gauss-Seidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . 293 13.10Smoothing factors for zebra line Gauss-Seidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 294 13.11Smoothing factors for zebra line Gauss-Seidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 295 13.12Smoothing factors for zebra line Gauss-Seidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 296 xxi

PAGE 22

13.13Smoothing factor JL for xy-, xz-, and yz-plane Gauss-Seidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . 300 13.14Smoothing factor JL for xy-,xz-, and yz-plane Gauss-Seidel relaxation for convection-diffusion equations . . . . . . . 301 13.15Smoothing factor JL for plane Gauss-Seidel (continued) 302 13.16Smoothing factor JL for plane Gauss-Seidel (continued) 303 13.17Smoothing factor for zebra xy-, xz-, yz-, and alternating plane Gauss-Seidel relaxation for anisotropic diffusion equations . . . . . . 304 13.18Smoothing factor for Zebra xy-,xz-, yz-, and alternating plane Gauss-Seidel relaxation for convection-diffusion equations . . 305 13.19Smoothing factor for zebra plane Gauss-Seidel (continued) 306 13.20Smoothing factor for zebra plane Gauss-Seidel (continued) 307 14.1 3D Multigrid Component Timing . . . . . . . 317 14.2 Grid transfer timing comparison for standard and semi-coarsening. 318 14.3 Timing for various smoothers. . . . . 319 14.4 Smoothing versus grid transfer timing ratios. 320 14.5 Numerical results for problem 1 in 3D. 321 14.6 Numerical results for problem 1 in 3D. 321 14.7 Numerical results for problem 1 in 3D. 322 14.8 Numerical results for problem 1 in 3D. 323 14.9 Numerical results for problem 1 in 3D. 323 14.10Numerical results for problem 1 in 3D. 323 15.1 3D Standard coarsening 32, 64, 128 CM-5 nodes V-cycle timings 327 15.2 3D Standard coarsening 32, 64, 128 CM-5 nodes Setup timings 327 15.3 3D Standard coarsening 32, 64, 128 CM-5 nodes parallel efficiency 328 15.4 3D Semi-coarsening 32, 64, 128 CM-5 nodes V-cycle timings 329 xxii

PAGE 23

15.5 3D Semi-coarsening 32, 64, 128 CM-5 nodes setup timings . 15.6 3D Semi-coarsening 32, 64, 128 CM-5 nodes parallel efficiency. 15.7 3D Timing comparison between CM-5 and Cray Y-MP .... xxiii 329 329 330

PAGE 24

ACKNOWLEDGMENTS xxiv

PAGE 25

I would first like to thank my advisor Joel E. Dendy, Jr., Los Alamos National Laboratory, because without him none of this would have been possible; Thanks! In addition, at Los Alamos National Laboratory, I would like to thank Mac Hyman of Group T-7, the Center for Nonlinear Studies for their support and the Advanced Computing Laboratory and the CIC Division for the use of their computing facilities. This work was partially supported by the Center for Research on Parallel Computation through NSF Cooperative Agreement No. CCR-8809615. I would also like to thank my PhD committee members at UCD, with a special thanks to professors Bill Briggs, Stan Payne, Roland Sweet and Jan Mandel. In addition, I would like to give a big thanks to Dr. Suely B. Oliveira for getting me back on track. Finally, I would like to thank my mom, Lee Buchanan, my twin brother, Fred Bandy, my wife, Darlene Bandy, and all my friends for all their support and encouragement. Last but not least, a special thanks to Mark and Flavia Kuta, some very good friends, for letting me stay with them while I was in Denver. On a sadder note, I would like to express a great debt that is owed to Seymour Cray, who died Oct. 5 1996. This thesis would never have been possible without Sey mour Gray's creativity, intelligence, and drive to conceive and develop supercomputers. His contributions to computing have changed the face of science and engineering and we will all miss him dearly. XXV

PAGE 26

CHAPTER 1 INTRODUCTION 1.1 Summary The subject of this dissertation is the investigation of Black Box multigrid solvers for the numerical solution of second order elliptic partial differential equations in two or three dimensional domains. We place particular emphasis on efficiency on both vector and parallel computers, represented here by the Cray Y-MP and the Thinking Machines CM-5. Black Box multigrid methods are sometimes referred to as geometric multi grid methods or, more recently, as automatic multigrid methods, in the literature. The methods can be considered to be a subclass of algebraic multigrid methods with sev eral algorithmic restrictions. Geometric multigrid methods make a priori assumptions about the domain and the class of problems that are to be solved, and in addition, it uses intergrid operators and coarse grid points based on the geometry and the order of grid equation operator. Algebraic multigrid, on the other hand, chooses both the coarse grid and intergrid operator based only on the coefficient matrix. Black box multigrid is in between these two, with the grids chosen geometrically, on logically rectangular grids, and the intergrid operators are chosen algebraically. There are other hybrid multigrid methods such as the unstructured grid method by Chan [22], which chooses the coarse grid based on graph theoretical considerations and the intergrid operator from the nodal coordinates (geometry), and the algebraic multigrid method of Vanek 1

PAGE 27

[81], which uses kernels of the associated quadratics form in lieu of geometrical information. The algebraic multigrid method of Stiiben and Ruge [66] [67] use almost the same construction of intergrid operator as Dendy [26] once the coarse has been chosen, while VanEk's work is based on a different idea. The assumptions and the components that make up the black box multigrid methods are spelled out in more detail in the following sections of this chapter. We will examine the development of robust black box multigrid solvers us ing both standard and semi-coarsening. The methods are aimed at the solution of convection-diffusion equations with anisotropic and discontinuous coefficients (inter face problems), such that the discrete system of equations need only be specified on a logically rectangular grid. A guiding principal in the design is that if the discrete system of equations is symmetric, then the multigrid coarse grid problems should preserve that symmetry. 1.1.1 Previous Results. The black box multigrid method was first introduced by Dendy [26]. The method is a practical implementation of a multigrid method for symmetric diffusion problems with anisotropic and discontinuous coeffi cients, represented by -\7 (D V'U) + cU = f on n c R2 (1.1) The domain n is assumed to be embedded in a logically rectangular mesh and then discretized in such a manner as to yield a stencil which is no larger than a compact 9-point stencil. The method employs the Galerkin coarse grid approximation, LH = If! Lh I'Ji, to form the coarse grid operators, using the robust choice of grid transfer operators from Alcouffe et. al [1]. The robust choice of grid transfer operators is an operator induced formulation that, when c = 0, preserves the flux f-L (D V'U) across interfaces. In [1] lexicographic point Gauss-Seidel relaxation and alternating 2

PAGE 28

lexicographic line Gauss-Seidel relaxation were the choices available for smoothers. In subsequent extensions for vector machines, the choices available were red/black (or four color for nine point operators) point Gauss-Seidel and alternating red/black line Gauss-Seidel relaxation. The black box multigrid method was extended to nonsymmetric elliptic convection-diffusion problems [27], for which the model problem is -E6U + Ux + Uy = f on n c R2 (1.2) where E > 0. The mesh is the same as before and the discretization is of the form where L h U (.lh 1\hU + Dx,hu. + Dy,hu. D. f3h t,J -fJ 0 t,J 0 t,J 0 t,J rt,J' 6hUi,j Dx,hu .. 0 t,J 1 h2 (Ui,j-l + Ui-l,j 4Ui,j + Ui+l,j + Ui,j+l), 1 2 h (Ui+l,j-Ui-l,j), 1 2 h (ui,j+l -ui,j-l), (1.3) and where {3 = yields upstream differencing. A generalization of Galer kin coarse grid approximation is used to form the coarse grid operators. The prolongation operators are formed in the same way as they were for the symmetric method, but instead of being induced by Lh, they are induced by the symmetric part of the grid operator, symm(Lh) = + Lh). It was found that instead of using If!= (Ij{ )* to induce the restriction operator, a more robust choice is to form a new interpolation operator Jj{ based on (Lh)* and then to define the restriction operator to be I{!= (Jj{)*. These choices were made to generalize the work of [26]. The choice of smoothers was also changed to include lexicographic point, line, and alternating line Kaczmarz relaxation. 3

PAGE 29

The method performed well for the problems tested as long as {3 2:: 0.25, but since nonphysical oscillations begin to dominate for {3 < 0.25, this restriction is no difficulty. The next development was the creation of a 3D black box multigrid solver for symmetric problems [29]. This method uses the same type of grid transfer operators as the earlier 2D symmetric method. Two different methods of forming the coarse grid operators were examined with nearly identical convergence results. The first method uses Galerkin coarse grid approximation with standard coarsening. The second method also uses Galerkin coarse grid approximation, but it does so by using auxiliary intermediate grids obtained by semi-coarsening successively in each of the three independent variables. For robustness, alternating red/black plane Gauss-Seidel relaxation was used for the smoother. The plane solves of the smoother were performed by using the 2D symmetric black box multigrid solver. The 2D symmetric black box multigrid solver was then extended to solve singular and periodic diffusion problems [30]. The existence of a solution, in case c = 0, is assured by requiring that the equation be consistent; 0 F = 0. The periodic boundary conditions only impact the multigrid method by requiring the identification of the auxiliary grid point equations at setup, the identification of the auxiliary grid point unknowns after interpolation, and the identification of the auxiliary grid point residuals before restriction. The coarsest grid problem, if c = 0, is singular and cannot be solved by Gaussian elimination, but since the solution is determined only up to a constant, the arbitrary addition of the linearly independent condition that Ui,j = 0 for some coarse grid point ( i, j) allows solution by Gaussian elimination. The first semi-coarsening black box multigrid solver was introduced for the solution of three dimensional petroleum reservoir simulations [33]. This method employs semi-coarsening in the z-direction and xy-plane relaxation for the smoother. Galerkin coarse grid approximation is used to form the coarse grid operators. Operator induced 4

PAGE 30

grid transfer operators were used, but only after Schaffer's paper [70] was it realized how to compute these in a robust manner; see section 3.6. A two dimensional black box multigrid solver called MGD9V was developed by de Zeeuw [24]. This method was designed to solve the general elliptic convectiondiffusion equation. The method used standard coarsening, an ILLU smoother, a V(O, 1)-cycle (sawtooth), and a new set of operator induced grid transfer operators that were designed specifically for convection dominated problems. The method was found to be more robust than previous methods but was still divergent for problems with closed convection characteristics on large grids. The method of MGD9V was developed only for two dimensions and is not parallelizable. The 2D symmetric black box multigrid solvers [26] [30] were updated to be portable, have consistent user interfaces, adhering to the SLATEC software guidelines [38], and provided with three new user interfaces by Bandy [9]. One of the interfaces in cluded an automatic discretization routine, requiring the user to provide only a function which can evaluate the coefficients at the fine grid points. The interfaces all included extensive input parameter validation and memory management for workspace. A parallel version of the semi-coarsening method for two dimensional scalar problems for the CM-2 was presented in [32]. A parallel version of semi-coarsening for two-and three-dimensional problems was presented in [75]. Both papers essentially relied on the algorithm from [33] and borrowed from Schaffer [69] [70] for the robust determination of grid transfer operators. Fourier mode analysis has been used by many multigrid practitioners to find good smoothers for use in multigrid methods. The results of many of these analyses have been presented in the literature. Stiiben and Trottenberg [78] present several fundamental results of Fourier mode analysis for a few selected 2D problems. Kettler [50] reports results for a range of 2D test problems and several lexicographic ordered 5

PAGE 31

Gauss-Seidel methods along with several variations of ILU methods. Wesseling [84] reports a summary of smoothing analysis results for the 2D rotated anisotropic diffusion equation and the convection diffusion equation; however, the results are for only a limited number of worst case problems. Smoothing analysis results for the red/black ordered methods appear in many places in the literature, but they are only for a few selected problems. There are some results in the literature for 3D problems [79], but just like the 2D results, the analysis is not complete enough for our purposes. 1.1.2 New Contributions In this thesis we have developed and extended several black box multigrid methods for both two and three dimensional nonsymmetric problems on sequential, vector, and parallel computing platforms. The new methods are based on a new implementation of the two dimensional nonsymmetric black box multigrid method [27] for vector computers. The new implementation was designed to take better advantage of developments in vector computing, while increas ing portability and compatibility with sequential computers. The new implementation performs with a speedup factor of six over the earlier methods on vector computers, while providing identical functionality, and it also incorporates many of the ideas and software features from [9]. The new methods include the development of a three dimensional method, both vector and parallel versions, and a two dimensional parallel method for nonsymmetric problems. The new methods were also extended to handle periodic and singular problems using the modifications from [30]. In [27] a two dimensional nonsymmetric black box multigrid method was examined for a convection dominated problem with constant convection characteristics. In this work we investigate the new methods for a general convection-diffusion equation -\7 (D(x) \i'U(x)) + b(x) \i'U(x) + c(x)U(x) = f(x), X E 0. (1.4) 6

PAGE 32

When the earlier method of [27] was applied to equation 1.4, but with more vectorizable smoothers than those in [27], it was found to perform poorly, and even fail, for some non-constant convection characteristic problems. This poor performance was caused by both the new smoothers and by poor coarse grid correction. Several new grid transfer operators are introduced to address these problems, of which two were found to be robust; see chapter 3. The search for a more robust smoother was facilitated by using local mode analysis, and led to the implementation of an incomplete line L U factorization method (ILL U) for the smoother. The ILL U smoother made the new methods more robust for convection dominated problems. A four-direction point Gauss-Seidel method was also briefly considered for use as a smoother but was discarded because it was not parallelizable nor suitable for anisotropic problems, even though it was fairly robust for convection dominated problems. A nonsymmetric black box multigrid method, using standard coarsening, was created for three dimensional problems; previously only a semi-coarsening version ex isted. The new method is the three dimensional analogue of the new two dimensional black box multigrid method, and it uses alternating red/black plane Gauss-Seidel as a smoother for robustness. The 3D smoother uses one V(l, 1)-cycle of the 2D nonsymmetric black box multigrid method to perform the required plane solves. The new method was developed to use either the new grid transfer operators from the new 2D nonsymmetric method or those from the 3D extension of Dendy's 2D nonsymmetric black box multigrid method. The coarse grid operators are formed using the second method from [29], which uses auxiliary intermediate grids obtained by successively applying semi-coarsening in each of the independent variables. In addition, the new method is designed to handle periodic and singular problems. Another use of local mode analysis was in the design of robust three dimensional smoothers. Although 7

PAGE 33

there are hints in the literature for how to perform local mode analysis for color relaxation in three dimensions, we are unaware of the appearance elsewhere of the detailed analysis presented in chapter 13. The new methods are compared to a new implementation of the semi-coarsening method with a speedup factor of over 5 for the two dimensional method and speedup factor of 2 for the three dimensional method on vector computers. The grid transfer operators are based on Schaffer's idea; see chapter 3. The 2D semi-coarsening method uses coarsening in they-direction coupled with red/black x-line Gauss-Seidel relaxation for the smoother. The 3D semi-coarsening method uses coarsening in the z-direction coupled with red/black xy-plane Gauss-Seidel relaxation for the smoother. The new implementation also includes the ILLU smoother, not present in the original version. Another aspect of this work was to compare de Zeeuw's MGD9V with the black box multigrid methods. The idea was to mix and match components of the two approaches to investigate the strengths and weaknesses and to ascertain if a combination existed which was better than either. The results obtained from studying the algorithm components is that MGD9V obtains its robustness from the ILLU smoother and not from its grid transfer operators. If MGD9V uses alternating red/black line Gauss-Seidel for its smoother then performance similar to the black box multigrid methods' is observed. Likewise, if ILLU is used as the smoother in the black box multigrid methods, then the performance is similar to that of MGD9V. Parallel versions of the standard coarsening nonsymmetric black box multigrid methods are developed in this thesis and compared with the existing parallel version of semi-coarsening black box method. The 3D parallel version smoother uses a modified 2D nonsymmetric black box multigrid method to perform the simultaneous solution of all the planes of a single color. 8

PAGE 34

A hybrid parallel black box multigrid method was developed that uses standard coarsening for grid levels with a VP (virtual processor) ratio, i.e. number of grid points per processor, greater than one, and semi-coarsening when the VP ratio is less than one. When the VP ratio is greater than one, standard coarsening reduces the number of grid points per processor, reducing the amount of serial work, faster than in semi-coarsening case. When the VP ratio is less than one, the semi-coarsening method is more efficient than standard coarsening because it keeps more processors busy that would otherwise be idle; in addition, tri-diagonal library routines, which are more efficient than we can write, are available for the data structures. The hybrid parallel method is the most efficient method on the CM-5 because it uses the most computationally efficient method for a given VP ratio. 1.2 Class of Problems The class of problems that is being addressed is convection-diffusion equations with anisotropic and discontinuous coefficients on a two-or three-dimensional domain. These types of problems can be represented by the following equation and boundary conditions, L(x) = -\7 (D(x) V'U(x)) + b(x) V'U(x) + c(x)U(x) = f(x) X En (1.5) v(x) D(x)V'U(x) + l'(x)U(x) = 0 x E an, (1.6) on a bounded domain n c iRd with boundary an, where dis either 2 or 3, X= (x, y) or (x, y, z), and D(x) = (D\ D2 ) or (D\ D2 D3), respectively. The term v(x) is the outward normal vector. It is assumed that D(x) > 0, c(x) 0, and I'(X) 0 to ensure that upon discretization we end up with a positive definite system of equations. Anisotropies are also allowed, e.g. if n c 3?2 we have D = (D1 D2 ) where it is possible that D1 D2 in some subregion(s) while D1 D2 in other subregion(s). In addition, 9

PAGE 35

D(x), c(x), and f(x) are allowed to be discontinuous across internal boundaries r c n. Moreover, let JL(X) be a normal vector at x E r; then it is natural to assume also that U and JL (DVU) are continuous at x for almost every x E r. (1.7) The "almost every" is necessary to exclude juncture points of r, that is points where two pieces of r intersect and the continuity of JL ( DVU) does not make any sense. The boundary conditions permitted in (1.6) can be of three types: Dirichlet, Neumann, mixed. The periodic boundary condition is not considered, but can be handled by making a few adjustments and modifications to the black box multigrid codes. It should be noted that, for a problem with pure Neumann boundary conditions, a finite difference (volume or element) discretization may lead to a singular system of equations; the singularity can be propagated to the coarsest grid level and cause trouble for the direct solver, but a minor modification to the code circumvents this difficulty, allowing solution of the coarsest grid level problem. 1.3 Discretization of the Problem Let the continuous problem represented by equation (1.5) be written in operator notation as Lu= f inn. (1.8) The following discussion is valid for both two and three dimensions, but only the two dimensional case is presented. Suppose that, for all X= (x, y) E n, ax :::; X :::; bx and ay :::; y :::; by. Let Gh define a rectangular grid on [ax, bx] x [ay, by], partitioned with ay = Yl < Y2 < < Yny = by, (1.9) and let the grid spacings be defined as (1.10) 10

PAGE 36

Then the rectangular grid, Gh is defined as (1.11) with the domain grid, nh' being defined as (1.12) Before the discrete grid problem is defined we should first address the issue of domains with irregular boundaries. The black box multigrid solvers in two dimensions are intended to solve the equation (1.8) on logically rectangular grids, but for simplicity, we consider only rectangular grids. An irregular shaped domain can be embedded in the smallest rectangular grid, Gh, possible, nh c Gh. The problem is then discretized on nh avoiding any coupling to the grid points not in nh. For grid points outside of nh, Xh E Gh nh, considered to be fictitious points, an arbitrary equation is introduced, such as Ci,jUi,j = fi,j, where Ci,j # 0 and fi,j are arbitrary. The problem is now rectangular and the solution to the discrete equations can be obtained at the points in the domain, while the solution Ui,j = fi,j / Ci,j is obtained for the other points. Problems with irregular domains in three dimensions can be handled in a similar fashion for a cuboid box grid. Now the discrete grid problem approximating the continuous problem, (1.8) can be written as (1.13) where the superscript h refers to discretization with grid spacing h. Note that, for irregular domains the discrete solution uh(x) makes sense only for x E nh; uh(x), for x E Ghlnh, is arbitrary. We consider only discrete operators Lh on rectangular grids that can be de scribed by 5-point or 9-point box stencils. Suppose we discretize the equation (1.5) 11

PAGE 37

using five points at the grid point (xi, Yi ), (1.14) We use stencil notation to represent the 5 and 9 point cases, respectively: N W C E s h NW N NE W C E SW S SE h (1.15) where the stencil represents the coefficients for the discrete equation at the grid point (xi, Yj) on grid Gh. The subscripts i, j can be dropped and it will be understood that the stencil is centered at the grid point (Xi, Yj). The superscript h can also be dropped when the mesh spacing is clear from the context. The stencils are valid over the entire grid including the boundary points because the coefficients are allowed to be zero. Hence, any coefficients that reach out of the domain can be set to zero. Clearly, the 5-point stencil is a special case of the 9-point stencil, where the NW, N E, SW, and SE coefficients are set to zero. We illustrate the stencil notation for Poisson's equation on a square domain in two dimensions, Lu(x,y) = -Uxx(x,y)-Uyy(x,y) = f(x,y), (x,y) En= (0, 1)2 (1.16) using 5-and 9-point finite difference discretizations. The 5-point stencil for the opera-tor L, using a central finite difference discretization on a uniform grid with grid spacing h = 1/ N for N = nx = ny, is h -1 -1 4 -1 (1.17) -1 12

PAGE 38

One 9-point discretization for L in (1.16) has the stencil h -1 -4 -1 Lh = _!_ h2 -4 20 -4 (1.18) -1 -4 -1 Many types of discretization can be considered: central finite differences, up-stream finite differences, finite volumes, finite elements, etc. The black box multigrid solvers actually allow for more general meshes than just the rectangular grids shown so far. The only requirement is that the mesh be logically rectangular. In two dimensions the logically rectangular grid G can be defined as G = {x(i,j),y(i,j): 1 :S i :S nx, 1 :S j :S ny} (1.19) where the grid cell formed by (x(i,j + 1), y(i,j + 1)), (x(i + 1,j + 1), y(i + 1,j + 1)) (x(i,j), y(i,j)), (x(i + 1,j), y(i + 1,j)) has positive area, 1 :S i :S nx, 1 :S j :S ny. The black box multigrid solvers which we consider require the discretization to be represented by a 9-point box stencil. However, just because the problem has a 9-point box stencil does not mean that it can be solved by the black box multigrid methods presented in this thesis. Such solutions are dependent on a number of factors which are problem dependent. We attempt to investigate these factors in this thesis. 1.4 Multigrid Overview A two level multigrid method is presented first to illustrate the basic campo-nents and underlying ideas that will be expanded into the classical multigrid method. 13

PAGE 39

Standard Coarsening Figure 1.1. Standard coarsening. Superimposed fine grid Gh and coarse grid GH, where the indicates the coarse grid points in relation to the fine grid Gh. Suppose that we have a continuous problem of the form Lu(x, y) = f(x, y), (1.20) where L is a linear positive definite operator defined on an appropriate set of functions in (0, 1)2 = 0 C lR2 Let Gh and GH be two uniform grids for the discretization of 0; then Gh = {(x, y) E 0: (x, y) = (ih,jh), i,j = 0, ... n} (1.21) and n (x,y) E 0: (x,y) = (iH,jH) = (i2h,j2h), i,j = 0, ... 2 (1.22) where the number of grid cells non Gh is even with grid spacing h = 1/n, and where grid cH has n/2 grid cells with grid spacing H = 2h. The coarse grid cH is often referred to as a standard coarsening of Gh; see figure 1.1. However, this choice is not the only one possible. Another popular choice is semi-coarsening, which coarsens in only one dimension; see figure 1.2. For the overview, only standard coarsening will be used. 14

PAGE 40

Semi-coarsening Figure 1.2. Semi-coarsening. Superimposed fine grid Gh and coarse grid GH, where the indicates the coarse grid points in relation to the fine grid Gh. 15

PAGE 41

The discrete problems now take the form (1.23) and onGH. (1.24) We refer to Lh and LH as the fine and coarse grid operators respectively. The grid operators are positive definite, linear operators (1.25) and LH: cH--+ cH. (1.26) Let Uh be an approximation to uh from equation (1.23). Denote the error eh by (1.27) thus eh can also be regarded as a correction to Uh. The residual (defect) of equation (1.23) is given by (1.28) The defect equation (error-residual equation) on grid Gh (1.29) is equivalent to the original fine grid equation (1.23). The defect equation and its approximation play a central role in the development of a multigrid method. The fine grid equation (1.23) can be approximately solved using an iterative method such as Gauss-Seidel. The first few iterations reduce the error quickly, but then the reduction in the error slows down for subsequent iterations. The slowing down in 16

PAGE 42

the reduction of the error after the initial quick reduction is a property of most regular splitting methods and of most basic iterative methods. These methods reduce the error associated with high frequency (rough) components of the error quickly, but the low frequency (smooth) components are reduced very little. Hence, the methods seem to converge quickly for the first few iterations, as the high frequency error components are eliminated, but then the convergence rate slows down towards its asymptotic value as the low frequency components are slowly reduced. The idea behind the multigrid method is to take advantage of this behavior in the reduction of the error components. The point is that a few iterations of the relaxation method on Gh effectively eliminate the high frequency components of the error. Further relaxation on the fine grid results in little gain towards approximating the solution. However, the smooth components of the error on the fine grid are high frequency components with respect to the coarse grid. So, let us project the defect equation, since it is the error that we are interested in resolving, onto the coarse grid from the fine grid. This projection is done by using a restriction operator to project the residual, rh, onto the coarse grid, where we can form a new defect equation (1.30) where If! is the restriction operator. We can now solve this equation for vH. Having done so, we can project the solution back up to the fine grid with a prolongation (interpolation) operator, I'H, and correct the solution on the fine grid, Gh, (1.31) We call this process (of projecting the error from the coarse grid to the fine grid and correcting the solution there) the coarse grid correction step. The process of projecting the error from a coarse grid to a fine grid introduces high frequency errors. The high 17

PAGE 43

frequencies introduced by prolongation can be eliminated by applying a few iterations of a relaxation scheme. The relaxation scheme can be applied to the projection of the error, I'lfvH, or to the approximation to the solution, Uh, after the correction. It is desirable to apply the relaxation to Uh instead of I'lfvh since then additional reduction of the smooth components of the error in the solution may be obtained. The projection operator from the fine grid to the coarse grid is called the restriction operator, while the projection operator from the coarse grid to the fine grid is called the prolongation operator or, interchangeably, the interpolation operator. These two operators are referred to as the grid transfer operators. In the two level scheme just described, it can be seen that the coarse grid problem is the same, in form, as the fine grid problem with uh and fh being replaced by vH and JH = If!rh respectively. We can now formulate the classical multigrid method by applying the above two level scheme recursively. In doing so, we no longer solve the coarse grid defect equation exactly. Instead, we use the relaxation scheme on the coarse grid problem, where now, the smooth (low) frequencies from the fine grid appear to be higher frequencies with respect to the coarse grid. The relaxation scheme now effectively reduces the error components of these, now, higher frequencies. The coarse grid problem now looks like the fine grid problem, and we can project the coarse grid residual to an even coarser grid where a new defect equation is formed to solve for the error. The grid spacing in this yet coarser grid is 2H. After sufficiently many recursions of the two level method, the resulting grid will have too few grid points to be reduced any further. We call this grid level the coarsest grid. We can either use relaxation or a direct solver to solve the coarsest grid problem. The approximate solution is then propagated back up to the fine grid, using the coarse grid correction step recursively. What we have described informally is one multigrid V-cycle. More formally, 18

PAGE 44

let us number the grid levels from 1 to M, where grid level 1 is the coarsest and grid level M is the finest. Algorithm 1.4.1 ( MGV(k, v1, v2, h) ) 1. relax v1 times on LkUk = pk 2. compute the residual, rk = Fk-LkUk 3. restrict the residual It-1rk to ck-1 Fk-1 = It-1rk and form the coarse grid problem {defect equation) Lk-1uk-l = pk-l' where vk =ILl uk-l and hk-l = 2hk. 4. IF (k-1) i1 THEN call Algorithm MGV(k-1, v1, v2, H) 5. solve Lk-luk-l = pk-l to get the solution uk-l 6. interpolate the defect (coarse grid solution) to the fine grid, and correct the fine grid solution, uk +--uk +ILl uk-1 8. IF {finest grid) THEN Stop This algorithm describes the basic steps in the multigrid method for one iteration of a V-cycle. If the algorithm uses bi-linear (tri-linear in 3D) interpolation, it is called the classical multigrid method. This algorithm assumes that the coarsening is done by doubling the fine grid spacing, which can be seen in step 3 of the algorithm. However, the algorithm is valid for any choice of coarsening, hk-l = mhk, where m is any integer greater than one. 1.4.1 Multigrid Cycling Strategies There are many different types of cycling strategies that are used in multigrid methods besides the V -cycle. We illustrate the different cycling types with the use of a few pictures and brief descriptions. 19

PAGE 45

5 4 3 2 1 V-cycle Figure 1.3. One V-cycle iteration for five grid levels, where the represent a visit to a grid level. 20

PAGE 46

The V -cycle is illustrated graphically in figure 1.3. The represents a visit to a particular grid level. A slanting line connection between two grid levels indicates that smoothing work is to be performed. A vertical line connection between grid levels means that no smoothing is to take place between grid level visits. The grid levels are indicated by a numerical value listed on the left side of the figure, where grid level 1 is the coarsest grid level and is always placed at the bottom of the diagram. The mechanics of the V-cycle were described in the multigrid algorithm in the last section. The V-cycle is one of the most widely used multigrid cycling strategies. Its best performance can be realized when there is an initial guess of the solution available. When a guess is not available a common choice is to use a zero initial guess or to use an F -cycle (see below). The S-cycle is illustrated in figure 1.4. The "S" stands for "sawtooth", because that is what it resembles; it is clearly a V(O, 1)-cycle and thus a special case of a V cycle. The S-cycle is what DeZeeuw's MGD9V [24] black box multigrid code uses for its cycling strategy. The S-cycle usually requires a smoother with a very good smoothing factor in order to be efficient and competitive with other cycling strategies. The W-cycle is illustrated in figure 1.5. The W-cycle is sometimes called a 2-cycle; similarly, a V-cycle can be called a 1-cycle. From the figure 1.5, one can see the W type structure. It is called a 2-cycle because there must be two visits to the coarsest grid level before ascending to the next finer intermediate fine grid level. An intermediate fine grid level is one that is not the finest nor coarsest grid level and where the algorithm switches from ascending to descending based on the number times the grid level has been visited since the residual was restricted to it from a finer grid. The F-cycle is illustrated in figure 1.6 and is called a full multigrid cycle. The figure shows a full multigrid V-cycle, that is, each sub-cycle that visits the coarsest grid level is a V-cycle. An F-cycle can also be created using a W-cycle, or any other 21

PAGE 47

5 4 3 2 1 S-cycle Figure 1.4. One S-cycle iteration for four grid levels, where the represent a visit to a grid level. 4 3 2 1 W-cycle Figure 1.5. One W-cycle iteration for four grid levels, where the represent a visit to a grid level. 22

PAGE 48

5 4 3 2 F-cycle Figure 1.6. One F-cycle iteration for five grid levels, where the represent a visit to a grid level. 23

PAGE 49

type of cycling, for its sub-cycle. The F -cycle is very good when an initial guess for the multigrid iteration is not available, since it constructs its own initial guess. The F -cycle first projects the fine grid problem down to the coarsest grid level and then proceeds to construct a solution by using sub-cycles. After the completion of each sub-cycle the solution on an intermediate fine grid level is interpolated up to the next finer grid level where a new sub-cycle begins. This process is continued until the finest grid level is reached and its own V-cycle completed. At this point if more multigrid iterations are needed then the V -cycling is continued at the finest grid level. 1.5 Black Box Multigrid Black box multigrid is also called geometric multigrid by some and is a member of the algebraic multigrid method (AMG) family. The distinguishing feature of black box multigrid is that the black box approach makes several assumptions about the type of problem to be solved and the structure of the system of equations. The black box multigrid methods also have a predetermined coarsening scheme where the coarse grid has roughly half as many grid points as the fine grid does in one or more of the coordinate directions. For a uniform grid, this means that H = 2h. Both methods automatically generate the grid transfer operators, prolongation IL1 and restriction for 2 :::; k :::; M, and the coarse grid operators Lk for 1 :::; k < M-1. The coarse grid operators are formed using the Galerkin coarse grid approximation, Lk-1 1k-1Lklk k k-ll (1.32) where k = 1, ... M-1. The algebraic multigrid methods deal with the system of equations in a purely algebraic way. The coarsening strategy for general AMG is not fixed nor is the formation of the grid transfer operators, resulting in methods that can be highly adaptable. However, the more adaptable a method is, the more complex its 24

PAGE 50

implementation is likely to be, and it may also be less efficient due to its complexity. Another disadvantage of general AMG methods is that the coarse grid problems are usually not structured even when the fine grid problem is; moreover, the unstructured matrices on coarser levels tend to become less and less sparse, the coarser the grid level. To define the black box multigrid method we need to define several of the multigrid components, such as the grid transfer operators, the coarse grid operators, the type of smoother employed, and the coarsest grid solver. We can also mention the type of cycling strategies that are available and other options. There are several different grid transfer operators that we have developed and used in our codes. They are of two basic types. The first type collapses the stencil of the operator in a given grid coordinate direction to form three point relations, and the second is based on ideas from S. Schaffer [69]. The details of the grid transfer operators will be presented in chapter 3. The coarse grid operators are formed by using the Galerkin coarse grid approximation given in equation (1.32). There are several choices for the smoothing operator available in our codes. The smoothers that we have chosen are all of the multi-color type, except for incom plete line LU. For standard coarsening versions, the choices are point Gauss-Seidel, line Gauss-Seidel, alternating line Gauss-Seidel, and incomplete line L U. The semicoarsening version uses either line Gauss-Seidel by lines in the x-direction or incomplete line L U. The smoot hers will be presented in more detail in chapter 4. In the standard coarsening codes, the coarsest grid solver is a direct solver using LU factorization. The semi-coarsening version allows the option of using line Gauss-Seidel relaxation. There are several cycling strategies that are allowed, and they are chosen by input parameters. The most important choice is whether to choose full multigrid 25

PAGE 51

cycling or not. There is also a choice for N-cycling, where N = 1 is the standard V-cycle and N = 2 is theW-cycle, etc ... For more details, see section (1.4.1) above. 26

PAGE 52

CHAPTER 2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME This chapter presents some of the discretizations that can be used on the convection-diffusion equation. We present only some of the more common finite dif ference and finite volume discretizations. Although this section may be considered elementary, it was thought to be important for two reasons. First, it shows some of the range of discrete problems that can be solved by the black box multigrid methods. Secondly, it gives sufficient detail for others to be able to duplicate the results presented in this thesis. The sections on the finite volume method present more than is needed, but because there is very little on this topic in the current literature and because of its importance for maintaining O(h2 ) accurate discretizations for interface problems, we have decided to include it. For references on the finite volume discretization see [85] and [52]. The continuous two dimensional problem is given by -\7 (D V'u) + b V'u + cu = f, in 0 = (0, Mx) x (0, My) (2.1) where D is a 2 x 2 tensor, D= (2.2) Dyx Dy and det D > 0, c 2: 0. In general, Dxy -=f. Dyx, but we only consider either Dxy = Dyx or Dxy = Dyx = 0. In addition, D, c, and f are allowed to be discontinuous across 27

PAGE 53

internal interfaces in the domain n. The boundary conditions are given by au on +au= g, on an (2.3) where a and g are functions, and n is the outward unit normal vector. This allows us to represent Dirichlet, Neumann, and Robin boundary conditions. The domain is assumed to be rectangular, n = (0, Mx) x (0, My), and is then divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, where Nx and Ny are the number of cells in the x-and y-directions respectively. A uniform grid is not required, but we will use it to simplify our discussions. It should be noted that finite elements on a regular triangulation can also be used to derive the discrete system of equations to be solved by the black box multigrid methods. However, we will not present any details on how to derive these equations. 2.1 Finite Difference Discretization The finite difference approach to discretization is well known. Finite difference approximation is based on Taylor's series expansion. In one dimension, if a function u and its derivatives are single valued, finite, and continuous functions of x, then we have the Taylor's series expansions, 1 1 u(x +h)= u(x) + hu'(x) + 2h2u"(x) + 6h3u"'(x) + ... (2.4) and u(x-h)= u(x)-hu'(x) + +-... (2.5) If we add equations (2.4) and (2.5) together we get an approximation to the second derivative of u, given by, 1 u"(x) h2 (u(x +h)-2u(x) + u(x-h)) (2.6) 28

PAGE 54

where the leading error term is O(h2). Subtracting equation (2.5) from (2.4) gives u'(x) l (u(x +h)-u(x-h)), (2.7) with an error of O(h2). Both equations (2.6) and (2.7) are said to be central difference approximations. We also derive a forward and backward difference approximation to the first derivative from equations (2.4) and (2.5): u'(x) l (u(x +h)-u(x)) (2.8) and I 1 u (x) h (u(x)-u(x-h)) (2.9) respectively, with an error of O(h). The above approximations can be extended to higher dimensions easily and form the basis for finite difference approximation. We illustrate the finite difference discretization, using stencil notation, by way of examples for some of the types of problems that we are interested in. There are many references on finite differences if one is interested in more details; see for instance [74] [39]. The first example is for the anisotropic Poisson's equation on a square domain, Lu = -E:Uxx -Uyy = f (2.10) where u and f are functions of (x, y) E n. Using central finite differences and discretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the 5-point stencil, -1 Lh __!_ h2 -E 2(1 +c) -E (2.11) -1 29

PAGE 55

The second example is for the convection-diffusion equation on a square do-main, (x,y) ED= (0, 1)2 (2.12) where u, bx, by, and f are functions of x andy. Using a mix of upstream and central finite differences and discretizing on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the 5-point stencil, (2.13) where (2.14) and E E byh > E 2bxh bxh > E 2byh E bxh < -E E byh < -E 1-Lx = 1 + 2bxh /-Ly = 1 + 2b h y (2.15) 1 lbxhl :S E 1 -lbyhl :S E. 2 2 The third example is the rotated anisotropic diffusion equation on a square domain. It has this name because it is obtained from the second example by rotating the axes through an angle of e. The equation is given by o2u o2u o2u Lu = r::c2 + s2 --2 (r:; -1) cs--r::s2 + c2 0 ox2 oxoy oy2 -(2.16) (x,y) ED= (0,1) X (0,1) where c = cos e, s = sine, and E > 0. There are two parameters, E and e, that can be varied. There are two popular discretizations of this equation which are seen in real 30

PAGE 56

world applications. They differ only in the discretization of the cross derivative term. Let ,8=(c:-1)cs (2.17) then if the grid spacing ish= 1/N for N = nx = ny, the first, a 7-point finite difference stencil, is -a-,8 2(a+,8+'Y) -a-,8 (2.18) -,8-')' ,8 The second, a 9-point finite difference stencil, is, (2.19) The fourth example is the convection-diffusion equation on a square domain, Lu = -c:6u + CUx + suy = 0 (x,y) E 0 = (0, 1)2 (2.20) where c = cos(}, s = sin(}, and c: > 0. Upstream finite differences and discretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, yields -c:(s +lsi) 2.2 Finite Volume Discretization (2.21) There are two types of computational grids that will be considered. The first type is the vertex centered grid Gv, defined as 31

PAGE 57

I I I 4t 4 0 0 4 I -I I Figure 2.1. Vertex centered finite volume grid, where the indicates where the discretization is centered and the dashed lines delineate the finite volumes. 32

PAGE 58

Figure 2.2. Cell centered finite volume grid, where the indicates where the discretization is centered and the solid lines delineate the finite volumes. (2.22) Yj = j hy, j = 0, ... Ny where Nx and Ny are the number of cells in the x and y directions respectively, see figure 2.1. The second type is the cell centered grid Gc which is defined by i=l, ... ,Nx, (2.23) Yj = (jhy, j = 1, ... Ny where Nx and Ny are the number of cells in the x and y directions respectively, see figure 2.2. There are two other somewhat common finite volume grids that will not be discussed here, but can be used to derive the discrete system of equations to be solved by the black box multigrid methods. These grids are defined by placing the finite volume cell centers on the grid lines in one of the coordinate directions and centered between the grid lines in the other coordinate direction. For instance, align the cell centers with the y grid lines and centered between x grid lines. The cell edges will then correspond with x grid lines and centered between y grid lines. We will present finite volume discretization for both vertex and cell centered finite volumes where the coefficients are evaluated at either the vertices or cell centers. 33

PAGE 59

The coefficients could be evaluated at other points, such as cell edges, but we will not show the development of such discretizations because they follow easily from the descriptions given below. 2.3 Cell Centered Finite Volume Discretization; Evalua-tion at the Vertices For the cell centered finite volume discretization the cell has its center at the point ( i ) hx, (j ) hy and the cell is called the finite volume, ni,j, for the point ( i, j) on the computational grid Gc, where i = 1, ... Nx and j = 1, ... Ny; see equation (2.23). A finite volume is shown in figure 2.3. The approximation of u in the center of the cell is called Ui,j. The coefficients are approximated by constant values in the finite volume ni,j. This discretization is useful when the discontinuities are not aligned with the finite volume cell boundaries. Assume that Dxy = Dyx = 0 and that b = 0 for now. If we integrate equation (2.1) over the finite volume ni,j and use Green's theorem we get fdO, (2.24) where nx and ny are the components of the outward normal vector to the boundary We proceed by developing the equations for the interior points Ui,j, and then for the boundary points, where we present the modifications that are needed for the three types of boundary conditions that we consider. We refer to figure 2.3 to aid in the development of the finite volume discretization. 34

PAGE 60

' nw 1 n ne I I I I I w 1 P e --------------1 I I I I I sw :s se Figure 2.3. Cell centered finite volume Oi,j, where P has the coordinates (i(j. 35

PAGE 61

2.3.1 Interior Finite Volumes Referring to figure 2.3, we write the line integral from equation (2.24) as au au se au Dx-a nx + Dy-a ny df = Dy-dx8f!;,j X Y sw ay ne au Dx-a dy se X nw au + D -dx-ne yay sw au Dx-a dy. nw X The integral from ( sw) to ( se) can be approximated by 8 au Dy(sw)-a dx + sw y se au Dy(se)-a dx s y hx ;:::; 2 h (Dy(sw) + Dy(se)) (ui,j-Ui,j-1 ) y hx x ( ) hai,j-1 Ui,j -Ui,j-1 y (2.25) (2.26) where ai,j = (Dy,i,j + Dy,i-1,j), and Dy,i,j is the value of Dy at the point (i,j). The other line integrals of ni,j, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be approximated in a similar fashion. The surface integrals in equation (2.24) can be approximated by: (2.27) and (2.28) where and h,j are approximations of c and f, respectively, at the grid point (i(j, given by 1 Ci,j = 4 (Ci,j + Ci-1,j + Ci-1,j-1 + Ci,j-d (2.29) and 1 !i,j = 4 (fi,j + fi-1,j + fi-1,j-1 + fi,j-1) (2.30) 36

PAGE 62

respectively. The resulting stencil for interior points is (2.31) where 1 -(D 1 + D ) 2 x,t,J-x,t,J (2.32) 1 -(D 1 + D ) 2 y,t,] y,t,J (2.33) and (2.34) At an interface, the diffusivity is given as an arithmetic mean of the diffusion coefficients of adjacent finite volumes. The arithmetic makes sense because the interface passes through the finite volume. This discretization is most accurate when the interface passes directly through the cell of the finite volume. When the finite volume ni,j has an edge on the boundary, the line integral in equation (2.24) for that edge has to be treated differently. We examine what needs to be done for each of the three different types of boundary conditions. We examine the changes that are needed only on one boundary edge, and the other changes needed for the other boundary edges follow in a similar fashion. 2.3.2 Dirichlet Boundary Condition Let us examine the south boundary, (sw) -(se), where we have U(s) = 9(s) (2.35) 37

PAGE 63

The line integral from ( sw) to ( se) is approximated by se au 1 hx sw Dy ay dx 2 hy (Dy,i,j-1 + Dy,i-1,j-1) Ui,j U(s) (2.36) This gives the stencil (2.37) 0 where is defined in equation (2.34) and a is defined by equation (2.32) and (2.33). 2.3.3 Neumann and Robin Boundary Conditions We examine the south boundary, ( sw )-( se), where We then make the approximation Solving for U(s) gives au -+au an = 9(s) (s) au 2 y an (s) 2 y 9(s) -a(s) U(s) 1 2hy9(s) + U(p) U(s) = 1 1 + 2hya(s) The line integral is then approximated as se au Dy-a dx sw y 38 (2.38) (2.39) (2.40) (2.41)

PAGE 64

Now we substitute equation (2.40) to obtain se au ,....., 2 hx y Dy-dx,....., a. 1 a(s)Ui,j -9(s) sw ay 2 + hya(s) (2.42) The resulting stencil for the south boundary is (2.43) 0 where a is defined in equations (2.32) and (2.33), and is now given by (2.44) The other boundaries can be handled in the same way. We have now defined the cell centered finite volume discretization where the coefficients are evaluated at the grid vertices. 2.4 Cell Centered Finite Volume Discretization; Evalua-tion at the Cell Centers This discretization is better suited to problems when the interfaces align with the boundaries of the finite volumes. The discretization is very similar to what was done in section 2.3, except that now the coefficients are evaluated at the cell centers, (i(j, of the finite volume Oi,j The coefficients are approximated by constant values in the finite volume Oi,j. We need to approximate the integrals in equation (2.24). 39

PAGE 65

2.4.1 Interior Finite Volumes We have the line integral, as in equation (2.25), and the integral from (sw) to (se) can be approximated by se au 2 hx Dy-8 dx -h Dy,i,j ui,j-u(s) sw y y (2.45) where Dy,i,j is the value of Dy at the point (i,j). We still need to approximate U(s)' and to do this we will use the continuity of u and Dy Dy,i,j Ui,j U(s) = Dy,i,j-1 U(s) -Ui,j-1 (2.46) yielding Dy i J'Ui J. + Dy i J'-1 Ui J'-1 u -'' '' (s)-D + D 1 y,t,J y,t,J(2.47) We can now substitute equation (2.46) into equation (2.45) to get (2.48) where af.j-1 is now given by aY 2 Dy,i,jDy,i,j-1 i,j-1-D + D y,i,j y,i,j-1 (2.49) The other line integrals of Oi,j, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be approximated in a similar fashion. The surface integrals are approximated in the same way as before, (2.50) and (2.51) but instead of Ci,j and fi,j we have ci-.! 3 _.! and fi-.! 3 _.!. 2' 2 2' 2 40

PAGE 66

where and The resulting stencil for interior points is 2 Dx,i,jDx,i-1,j Dx,i,j + Dx,i-1,j 2 Dy,i,jDy,i,j-1 Dy,i,j + Dy,i,j-1 hy X X hx y y C 1 + C +C 1 + C hx ,] hy (2.52) (2.53) (2.54) (2.55) At an interface, the diffusivity is given as a harmonic average of the diffusion coefficients of the adjacent finite volumes. 2.4.2 Dirichlet Boundary Condition For the south boundary, (sw) to (se), the Dirichlet boundary condition, U(s) = 9(s) The line integral is approximated by se au 2 hx Dy-8 dx ;::::j -h Dy,i,j ui,j-9(s) sw y y (2.56) The stencil is then given by (2.57) 0 where is given in equation (2.55) and a is given by equation (2.53) and (2.54). 41

PAGE 67

2.4.3 Neumann and Robin Boundary Conditions The Neumann and Robin boundary conditions can be handled in the same way as in section 2.3.3. The line integral for the south boundary is se au 2 hx Dy-dx Dy i j a(s)Ui,j 9(s) sw 8y 2 + hya(s) ' (2.58) The resulting stencil is now (2.59) 0 where is given in equation (2.55) and a is given by equation (2.53) and (2.54). 2.5 Vertex Centered Finite Volume Discretization-Eval-uation at the Vertices In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. This discretization is useful when the discontinuities align with the boundaries of the finite volumes. 2.5.1 Interior Finite Volumes The development is done the same as before for the cell centered cases; see section (2.3.1). The stencil, when Dxy = Dyx = 0 and b = 0, is given by (2.60) 42

PAGE 68

nw n ne w p e sw se Figure 2.4: Vertex centered finite volume Oi,j at the southern, y = 0, edge boundary. where a?-t,J and 2 Dx,i,jDx,i+l,j Dx,i,j + Dx,i+l,j 2 Dy,i,jDy,i,J+l Dy,i,j + Dy,i,J+l hy X X +a 1 +a hx t,) t,J where c and f are evaluated at the grid point ( i hx, j hy). (2.61) (2.62) (2.63) 2.5.2 Edge Boundary Finite Volumes Let the finite volume Oi,j have its southern edge, ( sw )-( se) at the southern boundary (y = 0) of the domain; see figure 2.4. 2.5.3 Dirichlet Boundary Condition For the Dirichlet boundary conclition we have U(s) = 9(s), and we can just eliminate the unknown U(s) and move it to the right-hand side of the equation. 2.5.4 Neumann and Robin Boundary Conditions The line integral along the boundary is approximated by se au Dy-a dx sw y au ;::::j -hxDy,i,j ay 43 (s)

PAGE 69

nw ne hy r w r s sw se hx Figure 2.5. Southwest corner finite volume, where the indicates where the discretization is centered. and now we need to look at the surface integrals and similarly for f. The stencil for the edge boundary is given by where h, X --"-a 1 hx t,J 0 hx y hy -hy ai,j + hx and a is defined by equations (2.61) and (2.62). (2.64) (2.65) (2.66) (2.67) 2.5.5 Corner Boundary Finite Volumes The corner finite volume discretization will be shown for the southwest corner of the computational grid; see figure (2.5). 44

PAGE 70

2.5.6 Dirichlet Boundary Condition In the Dirichlet boundary con-dition case, the unknown U(sw) is eliminated by the boundary condition equation, U(sw) = 9(sw) (2.68) The term 9(sw) is incorporated into the right hand side of the discrete system of equa tions. The stencil for the southwest corner is 0 (2.69) 0 where is defined as -hx y hy x -2hy ai,j -2hx ai,j (2.70) and a is defined by equations (2.61) and (2.62). 2.5.7 Neumann and Robin Boundary Conditions In the Neumann and Robin boundary condition cases, we have au -ax+ awU 9w (sw) (2.71) au ay + asU = 9s, (sw) (2.72) where the subscripts ( sw) means evaluation at the sw-point; see figure 2.5. The line integrals around the finite volume are approximated by se au Dy-a dx sw y nw au Dx-a dy sw X D .. au(sw) 2 X ay 1 2hxDy,i,j (as(sw)ui,j-g8(sw)) D .. au(sw) 2 y ay 1 2hyDx,i,j (aw(sw)ui,j9w(sw)) 45 (2.73) (2.74)

PAGE 71

ne au 1 h nw Dy ay dx "2 h: af,j (ui,j-Ui,j+I). The stencil for the southwest corner is 0 1 h +-4hxhyci,1 + BC __ Y ax 2hx t,J 0 (2.75) (2.76) (2.77) where is defined in equation (2.70), a is defined by equations (2.61) and (2.62), and (2.78) 2.6 Vertex Centered Finite Volume Discretization-Eval-uation at the Cell Vertices In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. This discretization is useful when the discontinuities pass through the interior of the finite volumes, and best when the interface passes through the cell center. 2.6.1 Interior Finite Volumes The development is the same as for the previous section on vertex centered finite volumes; see section 2.5. The stencil, when Dxy = Dyx = 0 and b = 0, is given by h, X --"-(X 1. hx t,J (2.79) 46

PAGE 72

where 1 2 (Dx,i+l,j + Dx,i+l,j+l) (2.80) 1 2 (Dy,i,j+l + Dy,i+l,j+l) (2.81) and (2.82) and where c and fare evaluated at the grid point (i hx,j hy) 1 Ci,j = 4 (ci-1,j-1 + Ci+l,j-1 + Ci-1,j+l + Ci+l,j+l) (2.83) and 1 fi,j = 4 (!i-1,j-1 + fi+1,j-1 + fi-1,j+1 + fi+1,j+1). (2.84) Let the finite volume ni,j have its southern edge, ( sw )-( se) at the southern boundary (y = 0) of the domain; see figure 2.4. 2.6.2 Dirichlet Boundary Condition For the Dirichlet boundary conclition we have U(s) = 9(s), and we can just eliminate the unknown U(s) and move it to the right-hand side of the equation. 2.6.3 Neumann and Robin Boundary Conditions The line integral along the boundary is approximated by se au Dy""""f.)dx sw uy hy D 1 (u 1 u ) 2hx y,t+ ,J t+ ,J t,J 47 (2.85)

PAGE 73

and similarly for the line integral from (sw)-(nw), and the line integral from (nw)-(ne) is done as before for the interior. The surface integrals are now given by (2.86) where 1 c = (c '+1 + c+1 '+1) t,J 2 t,J t ,J (2.87) and similarly for f. The stencil for the edge boundary is given by h 2hx Dx,i-1,j _!!:JJ_ D .. 2hx x,t,J (2.88) 0 where -hx a.Y + hy (D 1 + D . ) hy i,j 2 hx x,t,J x,t,J (2.89) and a. is defined by equations (2.80) and (2.81). 2.6.4 Corner Boundary Finite Volumes The corner finite volume discretization will be shown for the southwest corner of the computational grid; see figure (2.5). 2.6.5 Dirichlet Boundary Condition In the Dirichlet boundary con-dition case, the unknown U(sw) is eliminated by the boundary condition equation, U(sw) = g(sw) The term g(sw) is incorporated into the right hand side of the discrete 48

PAGE 74

system of equations. The stencil for the southwest corner is 0 _!!J!_D .. 2hx (2.90) 0 where is defined as hx hy ---D --D 2hy 2hx (2.91) and a is defined by equations (2.80) and (2.81). 2.6.6 Neumann and Robin Boundary Conditions In the Neumann and Robin boundary condition cases, we have au -ax+ awU 9w (sw) (2.92) au -ay + asU = 9s, (sw) (2.93) where the subscripts (sw) means evaluation at the sw-point; see figure 2.5. The line integrals around the finite volume are approximated by se au Dy-a dx sw y 1 au(sw) 2,hxDy,i+l,j+l ay 1 2hxDy,i+l,j+l (as(sw)ui,j-9s(sw)) (2.94) 1 au(sw) 2,hyDx,i+l,j+l ay 1 2hyDx,i+l,j+l (aw(sw)ui,j-9w(sw)) (2.95) (2.96) (2.97) 49

PAGE 75

The surface integrals are approximated by 1 0 cudn 4hxhyci+l,j+Iui,j (2.98) and similarly for f. The stencil for the southwest corner is 2hhy Dy,i+l,j+l 0 (2.99) 0 where is defined in equation (2.91), a is defined by equations (2.80) and (2.81), and (2.100) 50

PAGE 76

CHAPTER 3 PROLONGATION AND RESTRICTION OPERATORS Suppose that we have an elliptic linear operator L on a two dimensional rectangular domain n: Lu=f (3.1) This problem can be discretized using finite differences (or other discretization) on a rectangular grid Gh with grid spacing h, given by We assume that the discretization is represented in stencil notation as ( i,j) NW NNE W C E SW S SE ( i,j) (3.2) (3.3) (3.4) where NW, N, N E, ... are the coefficients of the discretization stencil centered at (xi,Yj) The size of the fine grid operator's stencil is important to remember because we require that the coarser grid operator's stencil not be any larger than the largest allowable fine grid operator stencil. By keeping the grid operator stencil fixed at a maximum of 9-points, we ensure that the implementation will be easier and more efficient by maintaining the sparsity of the operators. This consideration is important 51

PAGE 77

when discussing the formation of the grid transfer operators since we use the Galerkin coarse grid approximation approach to form the coarse grid operators. The formulation of the coarse grid operators involves the multiplication of three matrices, and if their stencils are at most 9-point, then the coarse grid operator will also be at most 9-point. If we use grid transfer operators with larger stencils, the size of the coarse grid operator stencil can grow without bound, as the grids levels became coarser, until the stencils either become the size of the full matrix or we run out of grid levels. Another guiding principal that we follow is that if we are given a symmetric fine grid operator we would like all the coarser grid operators to be symmetric also. In order to follow this principal the interpolation and restriction operators must be chosen with care. Before getting started it would be best to show where and how the operators are used to transfer components between grid levels. We assume the layout of coarse and fine grids shown in figure 1.1. We refer to coarse grid points with indices ( ic, ic) and fine grid points with indices ( i 1, j 1). 3.1 Prolongation We interpolate the defect correction (error) from the coarse grid level to the fine grid level, where it is added as a correction to the approximation of the fine grid solution. There are four possible interpolation cases for standard coarsening in two dimensions. The four cases are illustrated in figure 3.1, where the thick lines represent coarse grid lines, thin lines represent the fine grid lines, circles represent coarse grid points, X represents the fine grid interpolation point, and the subscripts f and c distinguish the fine and coarse grid indices respectively. Figure 3.1(a) represents interpolation to fine grid points that coincide with coarse grid points. Figure 3.1 (b) represents interpolation to fine grid points that do not coincide with coarse grid points, 52

PAGE 78

. (a) (b) i -1 c jc j, jc j, j ,-1 .I/ j ,-1 /I j c-1 j c-1 i, i ,-1 i, ic i -1 c ic (c) (d) Figure 3.1. The four 2D standard coarsening interpolation cases, where represents the coarse grid points used to interpolate to the fine grid point represented by x. The thick lines represent coarse grid lines. 53

PAGE 79

but lie on coarse grid lines in the x-direction. Figure 3.1 (c) represents interpolation to fine grid points that do not coinciding with coarse grid points, but lie on coarse grid lines in they-direction. Figure 3.1(d) represents interpolation to fine grid points that do not align with any coarse grid lines either horizontally or vertically. The fine grid points that are also coarse grid points, case (a), use the identity as the interpolation operator. The coarse grid correction is then given by (3.5) where (Xi f, YiJ) = ( Xic, YjJ on the grid; here the interpolation coefficient is 1. The fine grid points that are between two coarse grid points that share the same Yj coordinate, case (b), use a two point relation for the interpolation. The coarse grid correction is given by (3.6) where Yjc = YiJ and Xic-1 < Xit-1 < Xic on the grid, and the interpolation coefficients are IJV _1 1 and Le 1 l>C C liC' C The fine grid points that are between two coarse grid points that share the same Xi coordinate, case (c), use a similar two point relation for the interpolation. The coarse grid correction is then given by (3.7) where Xic = XiJ and Yjc-1 < Yjt-1 < Yjc on the grid, and the interpolation coefficients are If 1 and I! 1 _1 c, c c, c The last set of fine grid points are those that do not share either a Xi or a Yj coordinate with the coarse grid, case (d). We use a four point relation for the interpolation in this case, and the coarse grid correction is given by 54

PAGE 80

+ (3.8) + where Xic < Xif < Xic+l and Y)c < Y)j < Y)c+l, and the interpolation coefficients are lf'W_1 J. _1 I/'"!!..1 J. TfeJ. and IrJ. _1 The interpolation operator's coefficients can also (lc c c..c c c..c, c c-c, c be represented in stencil notation, just like the grid operator, as h 1nw In 1ne I'H = JW 1 Je (3.9) JSW fS JSe H 3.1.1 Prolongation Correction Near Boundaries In the black box multigrid solvers, the right hand side of the grid equation next to the boundary can contain boundary data, in which case the above interpolation formulas can lead to 0(1) interpolation errors. To improve this error we can use a correction term that contains the residual to bring the interpolation errors back to O(h2); [26]. The correction term is O(h2 ) for the interior grid points, and in general will not improve the error on the interior, but near the boundary the correction term can be of 0(1). The correction term takes the form of the residual divided by the diagonal of the grid equation coefficient matrix; the correction term is equal to ri,j/Ci,j, where the residual was computed for the grid before restriction. The correction term is added to equations 3.6, 3.7, and 3.8, which are for interpolating to fine grid points that are not coarse grid points. Applying the correction is similar to performing an additional relaxation sweep along the boundary, and it does not affect the size of the prolongation stencil. 55

PAGE 81

3.2 Restriction The restriction operator restricts the residual from the fine grid level to the coarse grid level, where it becomes the right-hand-side of the defect equation (error-residual equation). The restriction equation is (3.10) restriction coefficients can also be represented in stencil notation as H If!= (3.11) h where the restriction is centered at the fine grid point (Xi 1 Y]f) = ( Xic, YjJ. 3.3 Overview In the following sections we present several different interpolation operators by exhibiting the coefficients needed to represent the operator's stencil. In most cases, we omit the indices of the operators, it being be understood that the grid operator is given at the fine grid point (Xi 1 YiJ). The grid transfer operators can be split into two groups based upon how the operators are computed. The first class of grid transfer operators is based on using a collapse (lumping) in one of the coordinate directions, yielding a simple three point relation that can be 56

PAGE 82

solved. The second class of grid transfer operators is based on an idea from Schaffer's semi-coarsening multigrid [69]. Both these methods for operator induced grid transfer operators are an approximation to the Schur complement, that is, they try to approximate the block Gaussian elimination of the unknowns that are on the fine grid but not on the coarse grid. The collapsing methods are a local process while Schaffer's idea is to apply the procedure to a block (line) of unknowns. We start by presenting the grid transfer operators used in the symmetric versions of the black box multigrid solvers. Then we present several different grid transfer operators that are used in the nonsymmetric black box multigrid solvers. In classic multigrid methods, the grid transfer operators are often taken to be bilinear interpolation and full weighting; injection is also popular. To see why we do not use these choices, we need to look at the type of problems that we are hoping to solve. These problems are represented by the convection-diffusion equation, -\7 (D \lu) + b \lu + c u = J, (3.12) where D, c, and f are allowed to be discontinuous across internal boundaries. The black box multigrid solvers are aimed at solving these problems when D is strongly discontinuous. The classical multigrid grid transfer operators perform quite well when D jumps by an order of magnitude or less, but when D jumps by several orders of magnitude, the classical methods can exhibit extremely poor convergence, since these methods are based on the continuity of \lu and the smoothing of the error in \lu. However, it is D \lu that is continuous, not \lu. Hence, if D has jumps of more than an order of magnitude across internal boundaries, then it is more appropriate to use grid transfer operators that approximate the continuity of D \lu instead of the continuity of \lu. It is important to remember that we are using the Galerkin coarse grid approximation approach to form the coarse grid operators. We want the coarse 57

PAGE 83

grid operators to approximate the continuity of D \lu. This goal is accomplished by basing the grid transfer operators on the grid operator Lh. Before proceeding with the definitions of the first class of grid transfer operators, we need to define a few terms and make a few explanations. Definition 3.3.1 Using the grid operator's stencil notation, define Ra, row sum, at a given grid point, (Xi, Yj), to be R"2:, = C+NW +N +NE+ W +E+SW +S+SE, (3.13) where the subscript ( i, j) has been suppressed. The row sum is used to determine when to switch between two different ways of computing the grid transfer coefficients at a given point. The switch happens when the grid operator is marginally diagonally dominant, or in others words, when the row sum is small in some sense. We recall what is meant by the symmetric part of the operator. Definition 3.3.2 Define the symmetric part of the operator, L, as aL = symm(L) = (L + L*) where L* is the adjoint of the grid operator L. The notation applies equally to the grid operator's coefficients, for example: a Ni,j = ( Ni,j + Si,j+ 1) and aSWi,j = (SWi,j + N Ei-l,j-l) (3.14) (3.15) In addition, we can give some examples of the adjoint (transpose) of the grid 58

PAGE 84

operators coefficients are: (w;. )* 2,) (aBE-)* 2,J (3.16) and ( aG-)* 2,) 3.4 Symmetric Grid Operator Lh: Collapsing Methods The interpolation operator is based upon the discrete grid operator Lh, while the restriction operator is based on (Lh)*. We want to preserve the flux J..L (D 'VU) across interfaces, which can be done by using the grid operator Lh. Assume that Lh has a 5-point stencil, then (3.17) which gives the interpolation formula (3.18) When Lh has a 9-point stencil, the idea is to integrate the contributions from the other coefficients ( NW, NE, SW, and SE), which can be done by summing (collapsing) the coefficients to get the three point relation, (3.19) where A_= (NW + W + SW), Ao = (N + C + S), and A+= (NE + E + SE). The computation of the Iw and Je coefficients are done by collapsing the grid operator in the y-direction to get a three point relation on the x-grid lines. Let the interpolation formula be given by (3.20) 59

PAGE 85

where vk is written for vk,j, and Ai-l= (NW + W +SW)i,j, Ai = (N +C+S)i,j, and Ai+l = ( N E + E + S E)i,j. We now solve the equation for Vi to get the interpolation formula in an explicit form. The interpolation coefficients Iw and r are then given by and Writing out the coefficients explicitly gives Iw = _NW+W+SW N+C+S NE+E+SE N+C+S (3.21) (3.22) (3.23) (3.24) where JW and Je are evaluated at (ic-1,jc) and (ic,jc) respectively, and the other coefficients on the right hand side are evaluated at (if -1, iJ). If however, the row sum number,RI: (see 3.13), is small (see 3.28) then instead of (N + C + S)i for Ai we use -(NW + W + SW + NE + E + SE)i. These two formulas give the same result when the row sum is zero, which is the case for an operator with only second order terms away from the boundary. This idea is observed to lead to better convergence, and it is due to Dendy [30]. The coefficients are then defined by Iw = NW+W+SW NW + W +SW +NE+E+SE (3.25) and r = NE+E+SE NW + W +SW +NE+E+SE' (3.26) where Iw and Je are evaluated at (ic-1,jc) and (ic,jc) respectively, and the other coefficients on the right hand side are evaluated at (if -1, j f). 60

PAGE 86

Let ')'=min{INW+W+SWI, INE+E+SEI, 1.}. (3.27) Then by small we mean that R'E < -'Y (NW + W + SW + N + S + N E + E + SE) (3.28) where R'E is the row sum defined above. The computation of the JS and In coefficients is done by collapsing the grid operator in the x-direction to get the three point relation on the y-grid line. Let the interpolation formula be given by (3.29) where Vj-1 = { Vi,j-1 : i = 1, ... nx }, Vi = { Vi,j : i = 1, ... nx }, Vj+1 = { Vi,j+1 : i = 1, ... ,nx}, and Aj+l = (NW + N + NE)i,j+l, Aj = (W + C + E)i,j, and Aj-1 = (SW + S + SE)i,j-1 We now solve the equation for Vj to get the interpolation formula in an explicit form: The interpolation coefficients Is and In are given by and Writing out the coefficients explicitly gives JB = _SW+S+SE W+C+E' NE+N+NE W+C+E 61 (3.30) (3.31) (3.32) (3.33)

PAGE 87

If, however, the row is small, then instead of (W + C + E)j for Aj we use -( NW + N + N E + SW + S + S E) j. The coefficients are then defined by SW+S+SE (3.34) NW +N +NE+SW +S+SE' NW+N+NE (3.35) NW +N +NE+SW +S+SE' where JS and In are evaluated at ( ic, jc -1) and ( ic, jc) respectively, and the other coefficients on the right hand side are evaluated at (if, j f -1). Let 'Y = min { INW + N + N El ISW + S + SEI 1.} (3.36) Then by small we mean that < -'Y (NW + N + N E + SW + S + SE) (3.37) where is the row sum. The computation of the interpolation coefficients pw, 1nw, 1ne, and pe is similar to that of the coefficients that have already been computed. Let the interpolation formula be given by (3.38) + Ai-1,j-1Vi-1,j-1 + Ai,j-1Vi,j-1 + Ai+1,j-1Vi+I,j-1 = 0 where the A*,* are just the corresponding grid operator coefficients. We can now solve for Vi,j to get the interpolation formula. A-1 ViJ. =' 2,) +Ai-1,jVi-1,j + A,jVi,j + Ai+1,jVi+1,j 62 (3.39)

PAGE 88

Notice that Vi,j-1, Vi-1,j, Vi+1,j, and Vi,j+1 are unknowns. However, we can use their interpolated values that we computed above, being careful to note that their stencils are all centered at different grid points. After performing the substitutions and collecting the terms for Vi-1,j-1, Vi+1,j-1, Vi-1,j+1, and Vi+1,j+1 we get (3.40) where instead of having to compute everything all over again, it can be seen that pw, Inw, Ine, and pe can be expressed in terms of the previous four coefficients, Iw, Ie, I8 and In. However, we must now explicitly write the subscripts for the coefficients Iw, Ie, fS, and In to indicate where their stencils are centered relative to the interpolated point's stencil, which is centered at (i,j). The formulas for the four coefficients are sw + s IW. 1 + w J'! 1 ISW = 2,)-2,) c (3.41) where Isw is evaluated at (xic-1, Y]c1), NW + N Iw.+1 + W In 1 Inw = 2,] 2,J c (3.42) where Inw is evaluated at (xic1, Yjc), N E + N +1 + E I:t+1 Ine = 2,] 2 ,] c (3.43) where Ine is evaluated at (xic, YjJ, SE + S Ie 1 + E !'!+1 Ise = 2,]-2 ,J c (3.44) where Ise is evaluated at ( Xic, Y]c-1), and the the other stencil coefficients are evaluated at (Xi 1 Yj 1). If, however, RL. is small, then SW + S IYJ. 1 + W !'! 1 If!W = 2,)-2,) 2c1,Jc-1 NW +N +NE+ W +E+ SW + S + SE' (3.45) NW + N + W In 1 Jnw = 2,] 2.J 2c1,Jc NW +N +NE+ W +E+SW +S+SE' (3.46) 63

PAGE 89

N E + N Ie +1 + E J!t+1 J!te. = t,J t ,J tc,Jc NW +N +NE+ W +E+SW +S+SE' (3.47) SE + S Ie 1 + E Jl!+1 = t,J-t ,J tc,Jc1 NW +N +NE+ W +E+SW +S+SE' (3.48) and where NW, N, NE, W, C, E, SW, S, and SE are evaluated at (xit,YiJ) Let "(=min ISW+W+NWI, INW+N+NEI, (3.49) IN E + E + SEI ISE + s + SWI 1. Then by small we mean that R'E. < -"( (NW + N + N E + W + E + SW + S + SE) (3.50) The interpolation correction terms are Ai1rH, Aj1rH, or A:;:}rH for the cor responding interpolation formulas above, where rH is the residual on the coarse grid. Note that the A's change depending on whether R'E. is small or not. The computation of the interpolation coefficients in this way was used in the BOXMG, BOXMGP, BBMG, and BBMGP codes for symmetric problems [1], [26], [30], [10]. Similar computations have also been used for most black box, geometric, and algebraic multigrid solvers for symmetric problems arising from finite difference and finite volume discretizations using either a 5-point or a 9-point standard stencil [7]' [23], [29]' [31], [52]' [54], [53]' [55], [63]' [85], [24]. The computation of the restriction operator's coefficients is closely related to that of the interpolation coefficients. In fact, in the symmetric case, the restric-tion coefficients for the symmetric grid operator Lh can be taken to be equal to the interpolation coefficients, "'E R. (3.51) 64

PAGE 90

3.5 Nonsymmetric Grid Operator Lh: Collapsing Meth-ods The interpolation coefficients can be computed in the same way as in the symmetric case except that we replace all of the grid operator's coefficients with their equivalent symmetric stencil coefficients, denoted by O"(). However, the row sum R'f', definition remains unchanged. 3.5.1 Prolongation Based on symm(Lh) The computation of the Iw and Je coefficients is given by If, however, R'f', is small, then O"NW + O"W + O"SW O"N + O"C + O"S O"NE + O"E + O"SE O"N+O"C+O"S Iw = O"NW + O"W + O"SW O"NW + O"W + O"SW + O"NE + O"E + O"SE' (3.52) (3.53) (3.54) (3.55) In (3.52)-(3.55) JW and Je are evaluated at (xic-1, Yjc) and (xic' YjJ respectively, and the other coefficients on the right hand side are evaluated at (Xi 1-1, Yj 1 ) for the Lh components. Let (3.56) Then by small we mean that (3.57) 65

PAGE 91

The formulas for the In and I8 coefficients are If, however, Ry:, is small, then O"NW + O"N + O"NE O"W + O"C + O"E O"SW + O"S + O"SE O"W + O"C + O"E r O"NW + O"N + O"NE O"NW + O"N + O"NE + O"SW + O"S + O"SE' O"SW + O"S + O"SE O"NW + O"N + O"NE + O"SW + O"S + O"SE' (3.58) (3.59) (3.60) (3.61) where In and JS are evaluated at (xic, Yjc) and (xic, Yjc1) respectively, and the other coefficients on the right hand side are evaluated at (xi 1 Y]f-1) for the Lh components. Let '"'( = min{IO"NW + O"N + O"NEI, IO"SW + O"S + O"SEI, 1.}. (3.62) Then by small we mean that Ry:, < -'"'( (O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE). (3.63) The computation of the interpolation coefficients pw, Inw, Ine, and Ise can be expressed in terms of the other four coefficients: O"SW + O"S I'!ll. 1 + O"W fS 1 II!W -,] C (3.64) O"NW + O"N Iw-+1 + O"W ITt 1 Inw = ,J C (3.65) O"NE + O"N .+1 + O"E In+1 Ine. = ,J C (3.66) O"SE+O"SIe. 1+0"EJS+1 Ise. = ,] C (3.67) 66

PAGE 92

If, however, Ry:, is small, then (3.68) O"NW + O"N I'!D.+1 + O"W In 1 I'f}W = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.69) O"NE + O"N I<:-+1 + O"E I"!-+1 I"!'e. = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.70) O"SE + O"S I<:1 + O"E P+1 Il?e. = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.71) where O"NW, O"N, O"NE, O"W, O"C, O"E, O"SW, O"S, and O"SE are evaluated at (xit,Y]f) for the Lh components. Let 'Y =min (3.72) Then by small we mean that (3.73) It has been found in practice that the restriction operator If! need not be based on the same operator as the interpolation operator, so we change its symbol to be Jf! to reflect this change. The restriction operator's coefficients are based on (Lhf instead of O" Lh. The restriction coefficients are computed in exactly the same way as the interpolation coefficients except that all of the grid operator's coefficients in the computations are replaced by their transposes. The computations for the restriction coefficients are now straightforward and will not be written out. The grid transfer operators have been computed in this fashion for the black box multigrid solver for nonsymmetric problems [27]. It should be noted that when the grid operator Lh is symmetric, then the computations given here for both the symmetric case and nonsymmetric case yield the same grid transfer coefficients. 67

PAGE 93

3.5.2 Prolongation Based on Lh and symm(Lh) The third possibility for computing the grid transfer operators is one that uses the same form of the computations as above, see section 3.5.1. This prolongation is a point collapse approximation to Schaffer's ideas; see section 3.6. The only difference in the above computations for the nonsymmetric case is that for the denominators, Ai1 and Aj\ we use the coefficients based on Lh instead of a Lh. The test for small is still in the same form as before except that Lh is used, but 'Y is still based on a Lh. The restriction operator coefficients are computed as before, but the denomi-nator is now based on Lh instead of on (Lhf. 3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and symm(Lh) The prolongation operator coefficients are computed the same as in the last section 3.5.2. However, the computation of the restriction operator coefficients has been modified into a hybrid form that uses both LT and L. The difference in the computation of the restriction coefficients comes into play when the switch is made in the denominator, Ai1 and Aj\ because the row sum is small. When the row sum is large we modify the denominator by adding in two coefficients from the grid operator L. We can illustrate this modification by computing the restriction coefficients Jw and Je. If, however, R'f', is small, then (NWf + (W)T + (SWf N+C+S (NEf + (Ef + (SEf N+C+S w (NWf + (W)T + (SW)T J = (NW)T + (W)T + (SW)T + N + S + (N E)T + (E)T + (SE)T' 68 (3.74) (3.75) (3.76)

PAGE 94

e (NEf + (Ef + (SEf J = (NW)T + (W)T + (SW)T + N + S + (N E)T + (E)T + (SE)T. (3.77) In (3.74)-(3.76) JW and Je are evaluated at (xicl, YjJ and (xic, YjJ respectively, and the other coefficients on the right hand side are evaluated at (Xi 1-1, Y]f) for the Lh components. Let 'Y =min (NWf + (Wf + (SWf (NEf + (Ef + (SEf 1. (3.78) Then by small we mean that < -T ( (NW)T + (Wf + (SWf + (Nf +(Sf+ N + S (3.79) +(N Ef + (Ef + (SEf ) The restriction coefficients Jn and JS are computed in a similar way. The motivation behind these modifications was to try to get the coarse grid operator to approximate the one obtained when using the extension of Schaffer's idea; see section 3.6. The grid operators from section 3.5.2 above were computed to approx-imate the grid transfer coefficients based on an extension of Schaffer's idea; while the method in this section attempts to do the same thing, it also makes some modifications so that the coarse grid operator more closely approximates the one obtained in section 3.6.1. 3.6 Nonsymmetric Grid Operators: Extension of Schaffer's Idea The second class of grid transfer routines is based on Schaffer's idea for grid transfer operators in his semi-coarsening multigrid method [70]. Schaffer's idea is to approximate a full matrix by a diagonal matrix to compute the grid transfer operators. 69

PAGE 95

Schaffer's idea was used in the development of the semi-coarsening black box multi grid method [32]. We took Schaffer's idea and extended it to apply to the standard coarsening grid transfer operators. The ideas used in the semi-coarsening method are as follows. Suppose that coarsening takes place only in the y-direction. Then the interpolation to points on the fine grid can be represented by (3.80) wherevk = {vi,k: i = 1, ... ,nx, j = j -1,j,j + 1}, ThetridiagonalmatricesAj-1, Aj, and Aj+1 represent the nine point grid operator on the j-1, j, and j + 1 grid lines respectively; Aj+l tridiag [NW, N, N E]i+l Aj tridiag [W, C, E]j Aj-1 tridiag [SW, S, SE]j_1 As before, we solve this equation for Vj to get, (3.81) where we have assumed that Aj1 exists and can be stably inverted. This assumption can not always be guaranteed, but Schaffer's and our methods allow line relaxation as a smoother, where these assumptions are necessary. The methods would fail if the assumptions did not hold, so in that sense we can say that the assumptions hold. From equation (3.81), we form the quantities -Aj1 Aj-1 and -Aj1 Aj-1, lead ing to a non-sparse interpolation operator. If the interpolation operator is not sparse, that is, involves only Vi,j-1 and Vi,j+l for interpolation at the point ( i, j), then the coarse grid operators formed by the Galerkin coarse grid approximation approach will grow 70

PAGE 96

beyond a 9-point stencil. This is a property that we would very much like to avoid, since it would lead to full operators on the coarser grid levels. Schaffer's idea, also arrived at independently by Dendy, is to approximate these quantities with diagonal matrices Bj-1 and Bj+l This is accomplished by solving the following relations -Aj1 Aj-1e = Bj-1e (3.82) -Aj1 AJ+1e = Bj+le, where e = (1, 1, ... 1f. They can be solved quickly because they are tridiagonal equa tions. After solving, the entries (diagonals) in Bj-1 and Bj+l are just the interpolation coefficients JS and In respectively. In the semi-coarsening case the restriction operator is still based on the transpose of the nonsymmetric grid operator Lh. This is done by replacing Aj-1 Aj, and Aj+l by their transpose to get (Aj-1)*, (Aj )*, and (Aj+l)* respectively. 3.6.1 Extension of Schaffer's Idea to Standard Coarsening The above was presented in a manner suitable for the symmetric case. It can be modified for the nonsymmetric case, as we did for the collapsing methods, by using the symmetric part of the operator. We can do this by replacing A* with O"A* in equation (3.82) to get, -(symm(Aj))-1 symm(Aj-1) e = Bj-1e (3.83) -(symm(Aj))-1 symm(Aj+l) e = Bj+le. Schaffer constructs his grid transfer operators in a different manner and his construction for variable coefficient problems can yield a nonsymmetric coarse grid operator LH even if Lh is symmetric. We would like the coarse grid operators to be symmetric whenever the fine grid operator is symmetric. We can do this is several 71

PAGE 97

ways, but a more efficient construction is to replace equation (3.83) with -Aj1 symm(Aj-1) e = Bj-le (3.84) -Aj1 symm(Aj+I) e = Bj+le. The advantage of this form is that it can use the same tridiagonal system solver that we are already using for the line solves for the multigrid smoother. Equation (3.83) will require an additional tridiagonal solve for symm(Aj) and additional storage if the LU factors are to be saved. To extend these ideas to the standard coarsening case is quite easy. We first compute the grid transfer coefficients for semi-coarsening in the y-direction, and define vk = { vi,k : i = 1, ... nx, k = j -1, j, j + 1} and the tridiagonal matrices AJ+l tridiag[aNW,aN,aNE]J+1 A1 tridiag [W, C, E]1 Aj-1 tridiag [aSW, aS, aSE]1 1 We save the diagonals of Bj-1 and Bj+l associated with coarse grid lines in the xdirection as the JS and In interpolation coefficients respectively. To obtain the coefficients for the y-direction, we compute the grid transfer coefficients for semi-coarsening in the x-direction and define Vk = { Vk,j : k = i-1, i, i + 1, j = 1, ... ny} and the tridiagonal matrices Ai+l tridiag [aSW, aW, aNW]1+1 Ai tridiag [S, C, N]1 Ai-l tridiag [aSE, a E, aN E]1 _1 72

PAGE 98

We save the diagonals of Bi-1 and Bi+l associated with coarse grid lines in the xdirection as the Iw and Ie interpolation coefficients respectively. Finally, we can then combine the semi-coarsening coefficients from the X and Y lines to obtain the pw, Inw, Ine, and pe interpolation coefficients. They can be computed as the product of the coefficients that have already been computed, Inw =In. Iw ISW =IS. IW or elimination can be used as before. Ine =In. Ie Ise =Is. Ie, (3.85) The restriction operator for the extension to the standard coarsening case is computed as above, but the transpose of the grid operator is used instead of the symmetric part of the operator. This is done by replacing Aj-1 and AJ+1 by their transpose to get (Aj_l)* and (Aj+l)* respectively. 3. 7 Conclusions Regarding Grid Transfer Operators Many other grid transfer operators were tried in the standard coarsening black box multigrid method in addition to the those presented above. However, only three were deemed to be robust and efficient enough to include in a release version of the solver. The three choices for grid transfer operators are the original nonsymmetric col lapsing method described in section 3.5.1, the nonsymmetric hybrid collapsing method described in section 3.5.3, and the nonsymmetric extension to Schaffer's ideas described in section 3.6.1. While all three of these choices are good, better results were obtained for the later two for all test and application problems run to date. Most of the other grid transfer operators, that were tried had good performance on some of the test problems but failed on others. There does appear to be enough good results to cover all the test problems, with the exception of re-entrant flows. However, to unify these into one set of grid transfer operators would be much 73

PAGE 99

more expensive to compute and may also introduce trouble when combining the various types of grid transfer operators. The grid transfer operators from section 3.5.2, which use a collapsing method to try to approximate the extension of Schaffer's ideas for nonsymmetric problems, were a disappointment. While they seemed to be a good idea, they turned out to not be very robust and in several cases actually caused divergence of the multigrid method. This bad behavior prompted examination of the coarse grid operators and grid transfer operators. After comparing the operators with those obtained from Schaffer's ideas, it was noticed that several things were wrong, but with the modifications described in section 3.5.3, these problems were overcome. These new grid transfer operators extended Schaffer's ideas to standard coarsening very well. 74

PAGE 100

CHAPTER4 BASIC ITERATION METHODS FOR SMOOTHERS In this chapter we examine several basic iteration schemes for use as smoothers in the Black Box Multigrid solvers. Fourier mode analysis is used to identify which scheme makes the best smoother for a given type of model problem in two dimensions. In this chapter we will be using parentheses around a superscript to denote an iteration index. For example: u(n) means the nth iterate. 4.1 Overview of Basic Iteration Methods All of the methods in this section can be characterized in the following way. The algebraic system of equations to be solved is given by the matrix equation Lu=f (4.1) The matrix L is an Nxy X Nxy matrix, where Nxy = nxny. The computational grid is two dimensional with nx and ny grid points in the x-and y-directions respectively. The matrix L can be split as L=M-N, (4.2) where M is non-singular and assumed easy to invert. Then a basic iteration method for the solution of equation ( 4.1) is given by (4.3) 75

PAGE 101

or as (4.4) where S = M-1 N is called the iteration matrix. The basic iteration method can also be damped, and if the damping parameter is w, then the damped method is given by (4.5) or by (4.6) where Sis now given by (4.7) and I is the identity matrix. When w = 1 we recover the undamped basic iterative method. The eigenvalues of the damped basic iteration matrix S can be given in terms of the eigenvalues of the undamped basic iteration matrix S. They are related by .X( B)= w.X(S) + 1-w, (4.8) where w is the damping parameter and .X(S) on the right hand side of the equation is an eigenvalue of S, the undamped iteration matrix. The error after the nth iteration is (4.9) where u is a solution (unique if L is non-singular) to equation ( 4.1). The error at the (n + 1)st iteration is related to the error at the nth iteration by (4.10) 76

PAGE 102

where S is the iteration matrix defined above (S can also replace S in the equation). From equation (4.10), it follows by induction, that e(n) can be written in terms of the original error, e(0), as (4.11) where the superscript, n on S is now an exponent and n 2:: 0. In terms of vector norms, we have lle(n) II IISne(O)II < IISnlllle(O) II (4.12) where II II is any vector norm with induced matrix norm for IISnll The term IISII is called the contraction number of the basic iterative method. The spectral radius of S is defined as (4.13) The basic iterative method is said to be convergent if p(S) < 1, (4.14) in which case, we have that lim IISnll = 0. n--too (4.15) If the method is convergent, then this also implies that (4.16) for any initial choice of e(O) if and only if p(S) = max I.A(S)I < 1. Some other useful results, that are similar to those found in Varga [82], are: 77

PAGE 103

Theorem 4.1.1 If p(S) is the spectral radius of S, then if p(S) :S IISII (4.17) then (4.18) Theorem 4.1.2 If p(S) < 1 then 1 lim (IISnll):;;: = p(S) n---+oo (4.19) for any suitable induced matrix norm. We can also define the reduction factor and the rate of convergence for basic iterative methods. Definition 4.1.1 The reduction factor is defined as lle(n+l) II T = lle(n) II :S IISII (4.20) where T is the reduction factor per iteration. Definition 4.1.2 The average reduction factor, i, is given by 1 T= II (n)ll :;;: _e_ < lle(o)ll -(4.21) after n iterations. Definition 4.1.3 The average rate of convergence can be defined as (4.22) where n > 0 and IISnll < 1. Definition 4.1.4 Then the asymptotic rate of convergence is defined to be Roo(S) = lim R(Sn) = -log p(S). n---+oo (4.23) 78

PAGE 104

Theorem 4.1.3 If IISnll < 1 for n > 0, then (4.24) Since Lis a linear operator, we can write the defect equation for (4.1) above as (4.25) where r(n) = f-Lu(n) is the residual after the nth iteration. We can also relate r(n+l) to r(n) by using equation (4.10) to get (4.26) where S = N M-1 is the iteration matrix for the residual. Since the residual and the error are related by the defect equation (4.25), then as the residual, r(n), goes to the zero vector the error, e(n), will also go to zero as long as L is non-singular. If L is singular then the error, e(n), will tend to a vector in the null space of L. This can and does happen for pure Neumann boundary conditions, and any solution obtained is only unique up to a constant. In this case, special care, usually through normalization, needs to be taken to ensure that a solution is obtained. 4.2 Gauss-Seidel Relaxation In this section we define several relaxation methods which we have considered for inclusion in the black box multigrid methods. We will also discuss there appropri ateness for inclusion in both the vector and parallel black box multigrid solvers. We need to define some notation concerning the grid point equations. The grid that was used to generate the system of equations is the one that was defined in 79

PAGE 105

section 1.3, and is the rectangular grid, Xi = ax + hxk, i = 1, ... nx, k=1 j Yj = ay + hyk, j = 1, ... ny k=1 ( 4.27) where Gh = nh. The grid point equation on grid G, suppressing the superscript h, is then s Ui,j-1 + w Ui-1,j + c Ui,j + E Ui+l,j + N Ui,j+l = Fi,j (4.28) where the subscript ( i, j) on the coefficients has been suppressed. Likewise, the 9-point stencil is given by SW Ui-1,j-1 +S Ui,j-1 +SE Ui+1,j-1 W Ui-1,j +C Ui,j +E Ui+l,j (4.29) NW Ui-1,j+l +N Ui,j+l + N E Ui+l,j+l = Fi,j 4.2.1 Point Gauss-Seidel Iteration We are using a multi-color point Gauss-Seidel relaxation method. If the stencil for the grid operator Lh is a 5-point stencil, then a red/black version is used, and if the stencil is a 9-point stencil, then a 4-color version is used. The point Gauss-Seidel relaxation method without multi-coloring for the 5-point stencil is given by S (n+1) W (n+1) C (n+l) (n) N (n) U 1 + U 1 + U + E U+1 + U '+1 = Fi ]. t,]-t,] t,] t ,] t,] ( 4.30) and for the 9-point stencil is given by SW (n+l) +S u(n+i) (n+l) ui-1,j-1 t,]-+SE ui+l,j-1 W u(ni1) +C (n) +Eu+1 t,] t,] t ,] (4.31) (n) (n) (n) = Fi,j NWui-1,j+l +Nu . +1 +NE ui+1,j+l t,] 80

PAGE 106

This method does not vectorize or parallelize if only one color is employed regardless of the ordering of the equations. Vectorization fails because there will always be a vector feedback dependency (e.g. u(i, j) depends on u(i-1, j) fori= 1, 2, ). It is possible to write down single color orderings that appear to allow vector and/ or parallel processing, but one finds upon closer examination that they are equivalent to multi-color orderings. For our purposes, when we say one color ordering we mean that the equations have a lexicographic ordering. Parallelization is not possible for lexicographic ordering because of the data dependencies that exist by sweeping through the equations. The multi-color point Gauss-Seidel relaxation method for the 5-point stencil has two colors: red and black. The coloring can be defined in terms of the grid point indices, Red: i + j even Black : i + j odd ( 4.32) This method proceeds in two half steps, one for each color of points being used. The grid point equation for the red grid points (first half step) is S (n) W (n) C E (n) N (n) D U 1 + U 1 + U + U+1 + U "+1 = .ri 1 t,J-t,] t,J t ,] t,J ' ( 4.33) where i + j is even, and for the black grid points (second half step) (n+1) Su .. 1 +Wu. 1 -+Cu .. +Eu-+1 +Nu. +1 =Fi1 t,J-t,] t,J t ,] t,J ' (4.34) where i + j is odd. The multi-color point Gauss-Seidel relaxation method for the 9-point stencil has four colors: red, black, green, and yellow. The coloring can be defined in terms of 81

PAGE 107

the grid point indices, Red: i odd, j odd Black: i even, J odd ( 4.35) Green: i odd, j even Yellow: z even, J even. This method proceeds in four quarter steps, one for each color of points being used. The grid point equation for the red grid points (first quarter step) is SW (n) U 1 1 t,]S (n) + U. 1 t,]-SE (n) + ui+l,j-1 W u(n)1 +C (n) ( 4.36) +Eu+1 t,J t,] t ,J (n) (n) (n) -R NWui-1,j+l +Nu-+ 1 +NE ui+1,j+l t,] t,] where i is odd and j is odd. The grid point equation for the black grid points (second quarter step) is SW (n) U 1 1 t,]S (n) + U. 1 t,]-SE (n) + ui+l,j-1 W +C ( 4.37) ui-1,j +Eu-+1. t,] t ,] (n) (n) (n) -R NWui-1,j+l +Nu-+ 1 +NE ui+1,j+l t,] t,] where i is even and j is odd. The grid point equation for the green grid points (third quarter step) is SW ui-1,j-1 W u(n)1 t,] NW ui-1,j+l +s ui,j-1 +C u(n+) t,] +Nu . +1 t,] +SE ui+l,j-1 E (n) + U+1. t ,] +NE ui+1,j+l ( 4.38) -R t,] where i is odd and j is even. And finally, the grid point equation for the yellow grid points (fourth quarter step) is SW ui-1,j-1 W (n+) ui-1,j NW ui-1,j+l +s ui,j-1 +C u(n_+1) t,] +Nu.+1 t,] 82 +BE ui+l,i-1 (n+) +Eu-+1 t ,J +NE ui+1,j+l (4.39) -R t,]

PAGE 108

where i is even and j is even. The addition of multi-coloring for the ordering of the sweeps through the grid equations leads to highly vectorizable and parallelizable solvers. Vectorization is ob-tained because sweeping through either red or black equations no longer contains vector feedback dependencies. Parallelization is obtained because the red and black points are now decoupled and all the equations of a single color can be computed independently of the other color. In other words, there are no longer any data dependencies among equations of the same color. However, the computations are performed not in one parallel operation, but in a number of parallel operations equal to the number of inde-pendent colors. As will be seen in the next chapter, in addition to the gains in vector and parallel performance, we also obtain, generally, better convergence and smoothing properties with multiple color ordering. 4.2.2 Line Gauss-Seidel Iteration by Lines in X The line Gauss-Seidel relaxation by lines in the x-direction is used in our code with red/black zebra coloring. This relaxation method is good when there are strong connections (locally) on the grid in the x-direction. This method requires the use of a tridiagonal solver to solve for each line of unknowns. We present the details for only the 9-point stencil because the 5-point stencil is just a special case where the NW, NE, SW, and SE grid operator coefficients are zero. The line Gauss-Seidel relaxation by x-lines method without multi-coloring for the 9-point stencil is given by SW (n+l) +S u(n+i) (n+1) ui-1,j-1 +SE ui+l,i-1 Wu(ni1) ,] +C u(n.+1) +E (n+1) ui+1,j (4.40) (n) (n) (n) = Fi,j. NW ui-1,j+1 +Nu. +1 +NE ui+l,j+l 83

PAGE 109

The method is vectorizable and parallelizable only in the multiple line solves. By this we mean that we can loop over all the lines to obtain vectorization or we can solve all the lines simultaneously for parallelization. The parallel line solves are done on the CM-5 using odd-even cyclic reduction. When the red/black zebra coloring is used, the colors are defined to be Red: J odd (4.41) Black : j even. The line relaxation is done in two half steps, first for the red points, and second for the black points. The line relaxation for the 9-point stencil is W ui-1,j (n) NWui-1,j+l S (n) + U. 1 t,J-+C t,J (n) +Nu . +1 t,J SE (n) + ui+l,j-1 +Eu-+1 t ,J (n) +NE ui+1,j+l -R t,J where j is odd for the first half step, and for the second half step, where j is even. SW ui-1,j-1 W u(n+11) t,) NW ui-1,]+1 +s ui,j-1 +C u(n_+1) t,J +Nu . +1 t,J +BE ui+l,j-1 +E (n+l) ui+1,j +NE ui+1,j+l -R t,J ( 4.42) (4.43) The zebra coloring method allows both vectorization and parallelization at not only for the lines solves, as before, but also by decoupling the red and black lines. For parallelization this decoupling means that all the lines of one color can be solved simultaneously. As will be seen, the convergence factors for zebra coloring are usually better than those for lexicographic ordering. 4.2.3 Line Gauss-Seidel Iteration by Lines in Y The line Gauss-Seidel relaxation by lines in the y-direction is used in our code with red/black zebra 84

PAGE 110

coloring. The method requires the use of a tridiagonal solver to solve for each line of unknowns. The line relaxation method is good when there are strong connections (locally) on the grid in the y-direction. We will present the details for only the 9-point stencil, since the 5-point stencil is just a special case with the NW, N E, SW, and S E coefficients are zero. The line Gauss-Seidel relaxation by y-lines method without multi-coloring for the 9-point stencil is given by SW (n+l) ui-1,j-1 +S u(n+i) S (n) + E ui+l,j-1 W u(ni1) ,J +C u(n.+1) E (n) + u+1. ,J (4.44) (n+1) NWui-1,i+1 +N (n+l) ui,j+1 NE (n) + ui+l,j+l = Fi,j. When the red/black zebra coloring is used, the colors are defined to be Red: odd ( 4.45) Black : i even. The line relaxation is done in two half steps, first for the red points and second for the black points. The first half step for the line relaxation for the 9-point stencil is SW (n) U 1 1 ,J+S S (n) + E ui+1,j-1 (n) +C (n) Wu. 1 +Eu.+l ,J ,J NW (n) NE (n) ui-1,j+1 +Nu. + 1 + ui+1,j+l where i is odd and for the second half step, it is given by where i is even. +N (n+l) ui,j+l -R -R ( 4.46) (4.47) The same comments about vectorization and parallelization that were made about x-line Gauss-Seidel apply toy-line Gauss-Seidel. 85

PAGE 111

4.2.4 Alternating Line Gauss-Seidel Iteration The alternating line Gauss-Seidel method performs zebra colored line Gauss-Seidel by lines in the xdirection followed by zebra colored line Gauss-Seidel by lines in the y-direction. For the details see the two previous sections 4.2.2 and 4.2.3 above. 4.3 Incomplete Line L U Iteration The incomplete line L U iteration (ILL U) is also referred to in the literature as incomplete block L U iteration. We consider ILL U by lines in the x-direction. The method presented is for the 9-point stencil with the grid points ordered lexicographi cally. There are two parts to using this smoother, the first being the factorization and the second the actual iterations. In this section we change the notation we have been using for our problem from Lu = f to Au = f. This is because we want to use the symbol L to represent the lower triangular part of a matrix. The ILL U factorization assumes that the matrix A of the system to be solved is in tridiagonal block matrix form, A= (4.48) where Lj, Bj, and Uj are nx x nx tridiagonal matrices. Then there exists a matrix D, derived below, such that A = (L +D) n-1 (D + U) (4.49) 86

PAGE 112

where L= U= (4.50) 0 D= Dny and D is block diagonal consisting of nx x nx tridiagonal blocks D j. The factorization of A in equation (4.49) is called the line LU factorization of A. The blocks L, D, and U correspond to horizontal grid lines in the x-direction on the computational grid G. The equation ( 4.49) can also be written as (4.51) and the last term is the block-diagonal matrix, 0 (4.52) 87

PAGE 113

From the last two equations a recursion formula to computeD can be obtained Dl = Bl, (4.53) Dj = BjLjDj_!_1 Uj, j = 2,3, ... ,ny assuming that Dj.!.1 exists. This shows that the splitting in equation (4.49) holds when D is computed by equation ( 4.53). The problem with this splitting is that the Dj 's are full. There are many variations for the incomplete iteration that have been proposed in the literature. For more on the description and theory of ILLU iteration see [4], [3], [2], [76], [34], [44], [84], [85]. The variation that we are using is obtained by replacing the term LjDj.!.1 Uj in equation ( 4.53) by its tridiagonal part, to get Dl = Bl, (4.54) Dj = Bj -tridiag LjDj_!_1 Uj where j = 2, 3, ... ny. The ILLU factorization is then defined as A=M-N (4.55) where M = (L + D)D-1(D + U), and Dis computed using equation (4.54). The iteration for the system of equations Au=f (4.56) is given by the splitting in equation (4.55). The iteration is then given by Mu(n+l) = Nu(n) + f (4.57) or as (4.58) 88

PAGE 114

where S = M-1 N is called the iteration matrix. This iteration then becomes r = f-Au(n) (L + D)D-1(D + U)u(n+l) = r (4.59) for computing purposes. The center equation in (4.59) above is solved in the following way: solve (L + D)u(n+l) = r r = Du(n+l) solve (D + U)u(n+l) = r, and the first equation in ( 4.60) above can be solved by j = 2,3, ... ,ny (4.60) (4.61) where Uj and Tj are nx dimensional vectors corresponding to block j. The last equation in (4.60) above is computed similarly to the first equation in (4.61). For completeness, we could have looked at ILL U by lines in the y-direction, or even alternating ILLU. This has not been found to be necessary because the smoothing properties of the ILL U by lines in either direction are so good that we can get away with only using the ILLU by lines in the x-direction. However, this is only true in two dimensions, and this smoother is not robust in three dimensions. We still need to comment about the vector and parallel aspects for ILLU. The ILLU method does not easily lend itself to vectorization or parallelization, but we have 89

PAGE 115

been able to get some reasonable performance on the Cray Y-MP. The vector method is the same as the one used by De Zeeuw [24], but the implementation is different. However, this method does not parallelize. Many people are working on parallel ILLU methods, but so far we are not aware of any that are efficient enough on the CM-5 to compete with zebra alternating line Gauss-Seidel relaxation as a multigrid smoother. Here efficiency is in the sense of convergence factor per unit of execution time. As a final note, it was found that De Zeeuw's MGD9V obtains its robustness from the use of the ILL U smoother and not from his new grid transfer operators, although the do contribute to the robustness of the method. The ILLU smoother was replaced with an alternating red/black line Gauss-Seidel smoother in MGD9V and experiments showed it to perform only marginally better than Dendy's nonsymmetric black box multigrid method. Likewise, the ILL U smoother was placed in Dendy's nonsymmetric black box multigrid method, and experiments showed that it performed about the same as MGD9V. 90

PAGE 116

CHAPTER 5 FOURIER MODE ANALYSIS OF SMOOTHERS 5.1 Introduction To understand which relaxation schemes make good smoothers for our Black Box Multigrid code, we can use Fourier mode analysis, which is also known as local mode analysis or local Fourier analysis. We will use Fourier mode analysis to help guide us in finding robust and efficient relaxation methods for use as the multigrid smoother. We want to find methods that will reduce the high frequency error components for a range of test problems that include anisotropic and convection dominated operators. Results of local mode analysis have been reported in the literature. However, most of the reports have been for only a few selected problems and smoothing methods. Since the literature lacks adequate coverage of smoothing analysis for our range of test problems we have presented many of the results from our own smoothing analysis investigation. The smoother plays an important part of the multigrid process by reducing the high frequency error components. The coarse grid correction complements this process by eliminating the low frequency error components. It is hoped that our choice of coarse grid and intergrid operators will meet the requirement, for coarse grid correction, that the range of prolongation contains low frequency components; see 5.3. When it does, the smoothing factor will give a reasonable approximation of the multigrid convergence 91

PAGE 117

factor for definite elliptic problems. However, the smoothing factors do not generally predict the exact performance of the two-level algorithm, since the intergrid operators are neglected, as well as the differences between the fine and coarse grid operators. A two-level analysis can often give more information than smoothing analy sis; both are performed by using Fourier mode analysis. Two-level analysis attempts to approximates the spectral radius of (II'Ji(LH)-1 I{! Lh)Sh for the two-level algorithm, while smoothing analysis computes the spectral radius of Sh, the convergence factor of high frequencies for the multigrid smoother. We have used Galerkin coarse grid approximation, which can produce a different coarse grid operator on each grid level, causing the two-level analysis to be valid only for those levels. For constant coefficient zero row-sum problems, the collapsing method intergrid operators become bi-linear interpolation and full weighting restriction, in which case the two-level analy sis is straightforward [78]. Variable coefficient problems are handled by performing the analysis for the extreme cases with frozen coefficients and using a continuity argument. For highly variable or discontinuous coefficient problems, it is not clear how to perform two-level analysis, especially when Galerkin coarse grid approximation is being used. Because of the above mentioned difficulties for performing a two-level analysis we have chosen to use local mode analysis to analyze only the smoothing factors. 5.2 Motivation When working on parallel and vector computers, it can often pay to consider methods that one would not usually think of using on a sequential computer. Likewise, there are tried and true algorithms that work well on a sequential computer but that do not vectorize or parallelize to any great extent. We have used the CM-5, which is a massively parallel SIMD computer. It can also be used in a MIMD mode, but we will not be using it in this way. The CM-5 92

PAGE 118

uses a small vector length of sixteen (until late 1995 it was eight), which even though it is short, favors algorithms which vectorize. We have also used the Cray Y -MP, which is a vector computer with a vector length of 64. For a five point stencil, the relaxation scheme on the CM-5 one might consider using is a Gauss-Seidel point relaxation implemented by using multiple colors. For instance, one might use red/black Gauss-Seidel. This scheme takes two sweeps over the data, one for the red points and the other for the black points. If one implements Jacobi relaxation, one finds that it looks exactly like a Gauss-Seidel relaxation on a sequential computer (this is because of the way the synchronous SIMD operations work on the CM computers). It requires only one sweep across the data, and hence one can do two iterations of Jacobi for the price of one red/black Gauss-Seidel relaxation. The Jacobi method on the CM-5 does not need the extra storage space or index switching that is needed on a sequential computer. However, a nine point stencil requires a four color Gauss-Seidel relaxation scheme, which takes four sweeps through the data, per iteration, as much work as four sweeps of Jacobi relaxation. It is known that the Jacobi relaxation is not very good, but we can use a damping factor to make it better. In this case, we may find that the Jacobi relaxation becomes more competitive with the Gauss-Seidel relaxation on parallel computers. So, when considering smoothers, it is important to take into account the amount of work that must be done in parallel and to remember to consider those methods which may seem to be outdated. In order for the Fourier mode analysis to be technically valid, the operator L must be constant coefficient with periodic boundary conditions. However, it can still provide useful bounds for other types of boundary conditions. If the problem has variable coefficients, then it should be analyzed for a fixed set of coefficients that are sampled across the domain, and it should include the extreme cases in amplitude of 93

PAGE 119

the coefficients. The behavior of the problem should then be bounded by the extreme bounds found in the analysis. Fourier (local) mode analysis has been the topic of debate for many years as to how robust and rigorous it can be, and under what circumstances it can be successfully applied. However, it has proven itself to be quite useful and is used by most practitioners. Achi Brandt and several of his students have proposed that it can be rigorous in its construction and use, if proper care is taken; see Brandt's paper on Rigorous Local Mode Analysis [ 1 7]. 5.3 Overview of Smoothing Analysis A good introduction to Fourier smoothing analysis can be found in the 1982 paper by Stiiben and Trottenberg [78] and in the book by P. Wesseling [85]. The presentation in this section uses similar notation and is patterned after Wesseling's book chapter 7. Let the grid G be defined by G= Xi = i hx, i = 1, ... nx Yj = j hy, j = 1, ... ny hy = n y The continuous problem is discretized into a system of algebraic equations where L is represented by the stencil [L] .. = Lu=f NW NNE W C E SW S SE i,j (5.1) (5.2) (5.3) For the most part, smoothing methods are basic iterative methods. That is, they are 94

PAGE 120

splittings (usually a regular splitting) of the form L=M-N. (5.4) The details for these methods are given in chapter 4. For the basic type of iterative methods the error amplification matrix is given by (5.5) without damping, and with damping it becomes (5.6) If the continuous problem has constant coefficients and periodic boundary conditions, then the stencils of [L], [M], and [N] are independent of the grid point ( i, j). We will assume that S has a complete set of eigenfunctions (local modes). The error before, e(o), and after, e(l) smoothing is given by (5.7) which gives us the relation S <1>(0) = A(O)(O) (5.8) where A(B) is an eigenvalue associated with the eigenfunction <1>(0). The eigenfunctions of S are e E 8 (5.9) J=I, e =(Ox, By), and 8 is defined as e 27r kx k nx 1 nx nx x -nx x -2 -' -2' 2 e 21rky k -ny 1 ny ny Yny Y--2-'-2, '2 (5.10) 95

PAGE 121

If nx and ny are assumed to be even, then the corresponding eigenvalues of S are BEe, (5.11) where K, = (kx, ky) is a vector. The eigenvalue, >.(B), is called the amplification factor of the Fourier mode ( B). If under-relaxation or over-relaxation is used then >.(B)= w>.(B) + (1-w) (5.12) where w is the relaxation parameter, and the >.(B) on the right hand side is the eigenvalue from the undamped amplification matrix. The Fourier representation of a periodic grid function ui,j is ui,j = cei,j(B), (5.13) 0E8 and the error is cO' ( B) a= 0,1 (5.14) which then gives =>.(B) (5.15) Next, we define the sets of rough and smooth frequencies, that is, the high and low frequencies respectively relative to the grid G. We have assumed that the ratio between the fine and coarse grid spacings is two. The smooth frequencies are defined as (5.16) and the rough frequencies are defined as (5.17) 96

PAGE 122

where the \ means "the set minus". The Fourier smoothing factor is now defined to be p, =max {1>.(0)1}. 0E8r (5.18) Now we need to consider the effect of the boundary conditions, in particular, the Dirichlet boundary condition. For problems with Dirichlet boundary conditions, we know that the error at the boundary is always zero, and hence, we can ignore the wave numbers where Ox = 0 and/or Oy = 0. Then the set of rough wave numbers, in the Dirichlet boundary condition case, is defined to be and then the corresponding smoothing factor, p,D, is given by p,D = max {1>.(0)1}. 0E8f (5.19) (5.20) The above definitions of the smoothing factor are grid size dependent because they depend on nx and ny (the number of grid points in the x and y directions re spectively). The definitions can be changed to be grid-independent if we change the definition of the discrete set e to be 8 = {0 : Ox E [-1r, 1r], Oy E [-1r, 1r]}. (5.21) This grid-independent definition is much harder to compute numerically, and when the boundary conditions have a big influence on the solution, the results are not very realistic. The grid dependent definitions given above are best when the choice of nx and ny are in the same range as those expected when using the multigrid method. We would always like to have p, < 1 uniformly in nx and ny, at least when the boundary conditions do not have a strong influence. If that is not the case, then the coarse grid correction part of the multigrid method must be very good to overcome the smoother's 97

PAGE 123

divergent influence. We can also numerically investigate the behavior of f-l as nx and ny --+ 0 to see what its asymptotic behavior is like. Up to this point we have not addressed a very important class of relaxation methods for the analysis, and those which use multi-coloring schemes. For these meth-ads we must modify the above definitions. For the case of multi-color relaxation, the
PAGE 124

where S(B) is a 4 x 4 matrix, which is called the amplification matrix, and co is a vector of dimension 4. If under-relaxation or over-relaxation is used, then S(B) = wS(B) + (1-w)I (5.27) where w is the relaxation damping parameter and I is the identity matrix. The amplification matrix S(B) can be found by: 1. Write the equations for the error for one step (color) of one complete iteration of the smoother. 2. Combine the error equations for that step into one expression. 3. Evaluate the combined expression for each of the invariant sub-spaces. 4. Write the equation that expresses the nth_step Fourier coefficient c(J in terms of the initial Fourier coefficient which are related by the step amplification matrix; 5. Do the above for each step of one complete smoothing iteration. 6. Multiply all the step amplification matrices together to get the amplification matrix for the smoother, which will express the Fourier coefficients in terms of = This algorithm will be illustrated for the smoothing analysis of the point Gauss-Seidel method in section 5.5. For multi-color relaxations, the definition of the Fourier smoothing factor, J-l, has to be modified in the following ways. The rough Fourier modes are now given by and the smooth Fourier modes are now represented by ei ..L X I 2' 99 (5.28) (5.29)

PAGE 125

All of these values must be added to 8r, in order for the Fourier mode analysis to have any meaning for several cases that can arise. We can now define a projection operator, Q(B), for
PAGE 126

where 0 Bx = 0 and/or By= 0 Pl(B) = 1 otherwise 0 Bx = 0 P3(B) = (5.34) 1 otherwise 0 By= 0 P4(B) = 1 otherwise It can also be seen that p3(B) = 0 implies that = 0, and that P4(B) = 0 implies that = 0. The definition of the smoothing factor now becomes p,D = max {p [P(B)Q(B)S(B)]}, IIE8f where p denotes the spectral radius. (5.35) The smoothing factors for Dirichlet boundary conditions are better than those for other boundary conditions because they exclude points in 8r and 8.s. This fact means that if the maximum occurs on these excluded points, then the smoothing factor for Dirichlet boundary conditions will be smaller. 5.4 2D Model Problems In the subsequent sections and chapters, various model problems are examined. The model problems are used for comparing the performance of the black box multigrid components (smoothers and grid transfer operators) on a finite set of model problems that represent various characteristics of more realistic problems. The domain 0 is the unit square for the two dimensional model problems: 1. -!:l.u = f 2. -UxxEUyy = f 3. -EUxxUyy = f 101

PAGE 127

4. Ux = f 5. + Ux = f 6. -EfluUy = j 7. -Eflu + Uy = f 8. -Eflu -Ux -Uy = f 9. -Eflu + Ux + Uy = f 10. -Eflu -Ux + Uy = f 11. -EflU + Ux -Uy = f where flu = Uxx + Uyy, E = lO-P for p = 0, 1, ... 5. The model problems will be discretized using central differences for the second order terms and upstream differencing for the first order terms. 5.5 Local Mode Analysis for Point Gauss-Seidel Relax-at ion Local mode analysis results are presented for lexicographical and red/black ordering for point Gauss-Seidel relaxations. Point Gauss-Seidel relaxation with lexicographic ordering gives the splitting 0 [M] = W C 0 s The amplification factor .A(O) is given by -N [N] = 0 0 -E 0 (5.36) (5.37) The red/black point Gauss-Seidel relaxation local mode amplification matrix 102

PAGE 128

is computed below. The stencil is assumed to be 5-point because a four color scheme would be needed for a 9-point stencil. The details for the computation of the amplifica-tion matrix will only be given for the 5-point red/black point Gauss-Seidel relaxation. The amplification matrices for all other multi-color Gauss-Seidel type smoothers can be computed in a similar manner. One iteration of the red/black point Gauss-Seidel relaxation is performed in two half steps. The first half step is computed for the red points, and the second half step is computed on the black points. The red points can be identified by those points ( i, j) where i + j is even and the black points when i + j is odd. Let the error before smoothing be e0 ; then the error after the first half step is S e9 1 + W e9 1 + E e0+1 + N e9 .+1 ,J ,J and after the second half step it is 1 2 ei,j' c 1 1 1 1 Se'l-. 1+We2 1+Ee2+1+Ne'l-.+1 ,J ,J c The Fourier representation of ei,j, n = 0, 1 is given by ei,j = ( c(Jf i,j (e) 0E8;s where c(J and i,j (e) are vectors of dimension 4. i + j even (5.38) i + j odd i + j even (5.39) i + j odd (5.40) Examining the first half step in the relaxation process, let the initial error be k = 1,2,3,4. (5.41) Substitution into equation (5.38) gives s i,j-1(ek) + w i-1,j(ek) + E i+1,j(ek) + N i,j+I(ek) c i + j even !_ k e2 .(e ) = i + j odd 103

PAGE 129

Recall that
PAGE 130

If k is odd, then [cos(kO) ( -1) + sin(kO) 0] + i [sin(kO) ( -1) cos(kO) 0] cos(kO) i sin(kO) We want to combine the expressions for the red and black points of equation (5.42) into a single expression. We already have i,j(01), but we need to add one or more other sub-spaces to to create a linear combination that will yield a single expression. If we take e 7r for all the angles whose indices are involved in the coloring pattern designation, i and j in this case, we get the additional sub-space that we need for the linear combination. The single expression linear combination is (5.46) Using theorem 5.5.1 we can find the values of A and B. A + B = a for i + j even A B = 1 for i + j odd, which gives A B 1), and therefore (5.47) Let a be defined by (5.48) 105

PAGE 131

and define f3 to be (5.49) We now evaluate the first half step error for each of the four invariant subspaces. 1 + &(01) + &(01) 1 2 2 1 1 1 2 2 (1 +a)
PAGE 132

We now proceed in a similar way for the second half step (black points) of the relaxation process. The error after the second half step was given in equation (5.39), and it can be written, as in the first half step, as
PAGE 133

1 + &((P) + 1 &((P) 2 2 1 2 1 1 2 (1a) if!i,j(O ) + 2 (1 +a) if!i,j(O ), (5.59) 1 The Fourier coefficient in terms of cJ is 1+a 1+a 0 0 1 1 1a 1-a 0 0 1 2 e E e.s. (5.62) co=-co' 2 0 0 1+,8 1+,8 0 0 1-,8 1-,8 1 Finally, we can express in terms of by substitution of cJ into equation (5.62) to get the red/black point Gauss-Seidel amplification matrix S(O) that gives the relation a(1 +a) -a(1 +a) 0 1 1 a(1-a) a(a1) 0 co=2 0 0 ,8(1 + ,8) 0 0 ,8(1 ,8) The eigenvalues of Q(O)S(O) are A1(0) = 0 .X2(0) = 0 108 0 0 0 co, e E e.s. (5.63) -,8(1 + ,8) ,8(,81)

PAGE 134

Table 5.1. Smoothing factor J.L for point Gauss-Seidel relaxation in lexicographical (pGS-lex) and red/black (r/b-pGS) ordering for the indicated anisotropic diffusion problems ( see section 5.4); where c = 10-P and (D) indicates Dirichlet boundary conditions. problem p pGS-lex r/b-pGS r/b-pGS (D) 1 .50000 .25000 .24992 1 .83220 .82645 .82619 2 3 .99797 .99800 .99770 5 .99998 .99999 .99975 1 .83220 .82645 .82619 3 3 .99797 .99800 .99770 5 .99998 .99999 .99975 1 .\3(B) = 2 (J(B) a-b) .\4(B) = f3, and for Dirichlet boundary condition case, the eigenvalues of P(B)Q(B)S(B) are .\1 (B) = 0 .\2(B) = 0 1 2 (PI(B)J(B) a-b) 1 2/3 (p3(B)-P4(B) + f3(p3(B) + P4(B))). The results of local mode analysis for the model problems from section 5.4 are shown in table 5.1 and table 5.2. The smoothing factors were computed numerically with the grid spacing hx = hy = 1 and the angles Bx and By were sampled at one degree increments. Table 5.1 shows the results of the smoothing analysis for pure diffusion type problems. The point Gauss-Seidel relaxations are good smoothers for Poisson's equation, but not for anisotropic problems. The table also shows that red/black ordering is better than lexicographic ordering. 109

PAGE 135

Table 5.2. Smoothing factor JL for point Gauss-Seidel relaxation in lexicographical (pGS-lex) and red/black (r/b-pGS) ordering for the indicated convection-diffusion problems (see section 5.4); where c: = w-p and (D) indicates Dirichlet boundary con ditions. problem p pGS-lex r/b-pGS r/b-pGS (D) 0 .60176 .36000 .35990 4 1 .87313 .73469 .73463 3 .99839 .99602 .99602 5 .99998 .99996 .99996 0 .45834 .36000 .35990 5 1 .44608 .73469 .73463 3 .44099 .99602 .99602 5 .44099 .99996 .99996 0 .60176 .36000 .35990 6 1 .87313 .73469 .73463 3 .99839 .99602 .99602 5 .99998 .99996 .99996 0 .45834 .36000 .35990 7 1 .44608 .73469 .73463 3 .44099 .99602 .99602 5 .44099 .99996 .99996 0 .66281 .28125 .28125 8 1 .91533 .69441 .69441 3 .99898 .99594 .99594 5 .99999 .99988 .99988 0 .32950 .28125 .28125 9 1 .08202 .69441 .69441 3 .00098 .99594 .99594 5 9.8E-6 .99988 .99988 0 .56192 .28125 .28125 10 1 .84486 .69441 .69441 3 .99797 .99594 .99594 5 .99998 .99988 .99988 0 .56192 .28125 .28125 11 1 .84486 .69441 .69441 3 .99797 .99594 .99594 5 .99998 .99988 .99988 110

PAGE 136

Table 5.2 shows the results of the smoothing analysis for convection-diffusion problems. The red/black ordering for point Gauss-Seidel has, in general, better smooth-ing properties than those for lexicographic ordering except for problems 5, 7, and 9. The reason that lexicographic ordering is better for those problems is because the order in which the unknowns are updated is in the same direction as the convection characteristics. The smoothing factors approach one as the convection terms become more dominant, which implies that point Gauss-Seidel is not a robust smoother for these types of problems. 5.6 Local Mode Analysis for Line Gauss-Seidel Relax-at ion Local mode analysis results are presented for lexicographic and zebra (reb/black) ordering for x-and y-line Gauss-Seidel relaxations. X-line Gauss-Seidel relaxation with lexicographic ordering gives the splitting -N 0 [M]= W C E s [N] = 0 0 0 0 The amplification factor .A(O) is given by Zebra x-line Gauss-Seidel relaxation has the amplification matrix S(O) = a 0 -a 0 0 c 0 -c b 0 -b 0 0 d 0 -d 111 (5.64) (5.65) (5.66)

PAGE 137

where a= a(1 +a), b = a(1-a), c = ,8(1 + ,8), and d = ,8(1,8) and ,8 The eigenvalues of S(O) are W e-dJx + C + E edJx -S e-dJy + N .\1 (0) = 0 .X2(0) = 0 .X3(0) = 1 2 (6(0) a-b) .X4(0) = 1 2 (c-d) and for Dirichlet boundary conditions we have .\1(()) = 0 .X2(0) = 0 1 2 (Pl(O)c5(0) aP3(0) b) 1 2 (cP4(0) d). (5.67) (5.68) Y -line Gauss-Seidel relaxation with lexicographic ordering gives the splitting N [M] = W C 0 s The amplification factor .X(O) is given by 112 0 [N] = 0 0 -E (5.69) 0 (5.70)

PAGE 138

Zebra y-line Gauss-Seidel relaxation has the amplification matrix S(O) = a 0 0 a 0 c c 0 0 d d 0 b 0 0 b where a= a( a+ 1), b =a( a1), c = /3(/3 + 1), and d = /3(/31) and {3 The eigenvalues of S(O) are W e-dJx + E edJx S e-LIJy + C + N w + E -S + C + N .\1 (0) = 0 .X2(0) = 0 .X3(0) = 1 2 (6(0) a+ b) .X4(0) = 1 2 (c +d), and for Dirichlet boundary conditions we have AI(())= 0 .X2(0) = 0 1 2 (PI(O)c5(0) + P4(0)b) 1 2 (c + P3(0)d). (5.71) (5.72) (5.73) The results of local mode analysis for the model problems from section 5.4 are shown in tables 5.3 and 5.4. The smoothing factors were computed numerically with the grid spacing hx = hy = 1 and the angles Ox and Oy were sampled at 1 degree increments. 113

PAGE 139

Table 5.3. Smoothing factor 11 for x-andy-line Gauss-Seidel relaxation in lexicograph ical (xlGS and ylGS respectively) and zebra (ZxlGS and ZylGS respectively) ordering for the indicated anisotropic diffusion problems (see section 5.4); where c = 10-P. problem p xlGS ZxlGS ZxlGS (D) ylGS ZylGS ZylGS (D) 1 .44412 .25000 .24992 .44412 .25000 .24992 1 .44412 .12500 .12500 .82644 .82645 .82619 2 3 .44412 .12500 .12500 .99800 .99800 .99770 5 .44412 .12500 .02891 .99998 .99998 .99968 1 .83092 .82645 .82619 .44412 .12500 .12500 3 3 .99797 .99800 .99770 .44412 .12500 .12500 5 .99998 .99998 .99968 .44412 .12500 .02891 114

PAGE 140

Table 5.3 shows the smoothing factors for line Gauss-Seidel relaxation for anisotropic diffusion model problems. It is seen that line relaxation is only a good smoother if the lines are taken in the direction of the strong coupling of the diffu sion coefficients. Again, it is seen that the zebra ordering of the lines gives a better smoothing factor than lexicographic ordering. Table 5.4 shows the smoothing factors for the convection-diffusion model problems for line Gauss-Seidel relaxation. The smoothing factors for line relaxation are good when the convection term characteristics are in the same direction as the lines. The smoothing factor becomes better (smaller) the more the convection terms dominate if the characteristics are in the direction of the lines. If the characteristics are not in the direction of the lines, then the smoothing factor degenerates quickly approaching one, the more the convection terms dominate the diffusion term. We see again that for lexicographic ordering the smoothing factor is better when the characteristics have at least one component in the direction of the lexicographic ordering of the lines. 5. 7 Local Mode Analysis for Alternating Line Gauss-Seidel and ILLU Iteration Local mode analysis results are presented for lexicographic and zebra ordering for alternating line Gauss-Seidel relaxation (x-line Gauss-Seidel followed by y-line Gauss-Seidel) and incomplete line LU by lines in x. The alternating line Gauss-Seidel relaxation with lexicographic ordering am plification factor .A(O) is given by .A(O) = Axlgs(O) .Aytgs(O) (5.74) where Axlgs ( 0) and Aylgs ( 0) are the x-and y-line Gauss-Seidel amplification factors 115

PAGE 141

Table 5.4. Smoothing factor p, for x-andy-line Gauss-Seidel relaxation in lexicograph ical (xlGS and ylGS respectively) and zebra (ZxlGS and ZylGS respectively) ordering for the indicated convection-diffusion problems (see section 5.4); where r:; = w-P. problem p xlGS ZxlGS ZxlGS (D) ylGS ZylGS ZylGS (D) 0 .45040 .15385 .15380 .62917 .36000 .35990 4 1 .48377 .22449 .22449 .91218 .73469 .73463 3 .44412 .12500 .05644 .99898 .99602 .99602 5 .44412 .12500 .00057 .99999 .99996 .99996 0 .45040 .15385 .15380 .32950 .36000 .35990 5 1 .48377 .22449 .22449 .32950 .73469 .73463 3 .44412 .12500 .05644 .32950 .99602 .99602 5 .44412 .12500 .00057 .32950 .99996 .99996 0 .62917 .36000 .35990 .45040 .15385 .15380 6 1 .91218 .73469 .73463 .48377 .22449 .22449 3 .99898 .99602 .99602 .44412 .12500 .05644 5 .99999 .99996 .99996 .44412 .12500 .00057 0 .32950 .36000 .35990 .45040 .15385 .15380 7 1 .32950 .73469 .73463 .48377 .22449 .22449 3 .32950 .99602 .99602 .44412 .12500 .05644 5 .32950 .99996 .99996 .44412 .12500 .00057 0 .63226 .24324 .24318 .63226 .24324 .24318 8 1 .91344 .69444 .69409 .91344 .69444 .69409 3 .99898 .99601 .99541 .99898 .99601 .99541 5 .99999 .99996 .99935 .99999 .99996 .99935 0 .27929 .24324 .24318 .27929 .24324 .24318 9 1 .06696 .69444 .69409 .06696 .69444 .69409 3 .00080 .99602 .99541 .00080 .99601 .99541 5 8.0E-6 .99996 .99935 8.0E-6 .99996 .99935 0 .27929 .24324 .24318 .63226 .24324 .24318 10 1 .06696 .69444 .69409 .91344 .69444 .69409 3 .00080 .99601 .99541 .99898 .99601 .99541 5 8.0E-6 .99996 .99935 .99999 .99996 .99935 0 .63226 .24324 .24318 .27929 .24324 .24318 11 1 .91344 .69444 .69409 .06696 .69444 .69409 3 .99898 .99601 .99541 .00080 .99601 .99541 5 .99999 .99996 .99935 8.0E-6 .99996 .99935 116

PAGE 142

respectively. Thus, The zebra alternating line Gauss-Seidel relaxation amplification matrix S(O) is given by S(O) = Bxtgs(O) Sytgs(O) (5.76) where Sxlgs ( 0) and Sylgs ( 0) are the x-and y-line Gauss-Seidel amplification matrices respectively. We represent the matrix S(O) by where S(O) = l e = P3(0)bxdy g = P4(0)axby a c e g b -a -b d -c -d f -e -f h -g -h d = CxCy f = P3(0)cxdy h = P4(())dxby, (5.77) and the subscripts x and y indicate that the coefficients came from Sxlgs ( 0) and Sylgs ( 0) respectively; see equations (5.66) and (5.71). The eigenvalues of S(O) are D -\3,4(0) = D2-4 [(a-e)(d-h)-(b-f)(c-g)] 8 where D = a+ d-e-h is the diagonal of S ( 0). For nonDirichlet boundary conditions we set Pl(O) = P3(0) = P4(0) = 1. 117

PAGE 143

The incomplete x-line L U iteration (ILL U) amplification factor is not hard to compute, but it is a little more complicated than the other relaxation methods. We need to compute M and N for the ILLU splitting; see section 4.3. Incomplete factorization methods have the property that the stencils for M and N are dependent upon their location on the grid, even when the stencil of Lis not. However, the stencils of M and N usually tend rapidly to constant stencils away from the boundaries. It is these constant stencils for M and N that will be used for the local mode analysis. It can be seen that the smoothing factor increases towards one as the block (x-line) size increases. For this reason, we will assume the worst case, nx = oo, for the computation of the local smoothing factors. The component Dj from equation ( 4.54) is computed without the j subscript until the stencil for D becomes stable. By stable, we mean that the values do not change, or that the change is taking place only in the digits after a specified decimal place. We used six decimal places in our computations. When a stable D has been computed, then the stencils for M and N can be constructed and the smoothing factor computed using equation (5.11). Due to the nature of these computations, it is not possible to write down a general formula for the amplification factor as was done for the other relaxation methods. The results of local mode analysis for the model problems from section 5.4 are shown in table 5.5 and table 5.6. The smoothing factors were computed numerically with the grid spacing hx = hy = 1, and the angles Bx and By were sampled at 1 degree increments. Table 5.5 shows the smoothing factors for alternating line Gauss-Seidel relaxation and incomplete line LU iteration for the anisotropic diffusion model problems. Lexicographic ordering for alternating line relaxation provides a fair smoothing factor, but zebra ordering provides much better smoothing factors. The smoothing factors for 118

PAGE 144

Table 5.5. Smoothing factor J.L for alternating line Gauss-Seidel relaxation in lexico graphical (ALGS) and zebra (AZLGS) ordering, and incomplete line LU iteration by lines in x (ILLU) for the indicated anisotropic diffusion problems (see section 5.4); where c = 10-P problem p ALGS AZLGS AZLGS (D) ILLU 1 .14634 .02547 .02546 .05788 1 .36903 .10107 .10104 .13272 2 3 .44322 .12472 .12467 .19209 5 .44411 .12500 .02890 .19920 1 .36903 .10107 .10104 .10769 3 3 .44322 .12472 .12467 .16422 5 .44411 .12500 .02890 .14136 119

PAGE 145

incomplete line LU iteration are good as well, but alternating zebra line relaxation is slightly better. Table 5.6 shows the smoothing factors for the convection-diffusion model problems for alternating line Gauss-Seidel relaxation and incomplete line LU iteration. The smoothing factors for lexicographic ordering for the alternating line relaxation are good but get even better when the characteristics are in the same direction (lexicographic) in which the lines are solved. The alternating zebra line relaxation gives good smoothing factors when the characteristics are in the same direction as the lines; indeed these smoothing factors are better than those for lexicographic ordering. The zebra ordering gives fair smoothing factors when the characteristics are not aligned with the lines. The incomplete line LU iteration is done by lines in the x-direction and is nearly a direct solver when the convection terms are dominant. The smoothing factors are good for all of the model problems, and they are about equal to or much better than those for alternating line relaxation. They are especially superior when the convection term characteristics are not aligned with the grid lines. 5.8 Local Mode Analysis Conclusions We have looked at local mode analysis for several iterative methods for use as a smoother in our black box multigrid method. The test problems can be classified in many ways, but we will break them down into four types and refer to them via their model problem number from section 5.4. The first type is the isotropic diffusion equation represented by model problem (1). The second type is the anisotropic diffusion equations represented by model problems (2) and (3). The third type is the convectiondiffusion equations represented by model problems (4)-(11) with c: = 1. The fourth type is the convection dominated equations represented by model problems (4)-(11) with c: 1. 120

PAGE 146

Table 5.6. Smoothing factor JL for alternating line Gauss-Seidel relaxation in lexi cographical (ALGS) and zebra (AZLGS) ordering, and incomplete line LU iteration by lines in x (ILLU) for the indicated convection-diffusion problems (see section 5.4); where c = 10-P problem p ALGS AZLGS AZLGS (D) ILLU 0 .22269 .04706 .04704 .07977 4 1 .40812 .14253 .14251 .16150 3 .44322 .12444 .05611 .19952 5 .44411 .12499 .00057 .20000 0 .14750 .04706 .04704 .07977 5 1 .15423 .14253 .14251 .16150 3 .14634 .12444 .05611 .19952 5 .14634 .12499 .00057 .20000 0 .22269 .04706 .04704 .03759 6 1 .40812 .14253 .14251 .00567 3 .44322 .12444 .05611 .00009 5 .44411 .12499 .00057 4.4E-9 0 .14750 .04706 .04704 .03759 7 1 .15423 .14253 .14251 .00567 3 .14634 .12444 .05611 .00009 5 .14634 .12499 .00057 4.4E-9 0 .24619 .06349 .06346 .04940 8 1 .39787 .31498 .31483 .01489 3 .44358 .40579 .40559 .00019 5 .44412 .40684 .40665 1.9E-6 0 .06754 .06349 .06346 .04940 9 1 .00441 .31498 .31483 .01489 3 6.3E-7 .40579 .40559 .00019 5 6.4E-11 .40684 .40665 1.9E-6 0 .15074 .06349 .06346 .04940 10 1 .05636 .31498 .31483 .01489 3 .00073 .40579 .40559 .00019 5 7.3E-6 .40684 .40665 1.9E-6 0 .15074 .06349 .06346 .04940 11 1 .05636 .31498 .31483 .01489 3 .00073 .40579 .40559 .00019 5 7.3E-6 .40684 .40665 1.9E-6 121

PAGE 147

We see that red/black point Gauss-Seidel relaxation is a good smoother for only the isotropic diffusion and convection-diffusion (c = 1) equations, however, for variable coefficients it is a good smoother for only isotropic diffusion equations. Zebra line Gauss-Seidel relaxation is a good smoother all four types of prob lems, provided that the anisotropies and convection characteristics are aligned with the proper grid directions. The only two robust choices for smoothers are the alternating zebra line Gauss-Seidel relaxation and the ILLU methods. They both exhibit good smoothing factors for all the types of problems. However, ILL U is better for all types of convection-diffusion equations and just slightly worse for the two types of diffusion equations. The suitability of either choice for the smoother will depend on the ef ficiency of the implementation, and under these circumstances it would appear that alternating zebra line Gauss-Seidel relaxation has the advantage. 5.9 Other Iterative Methods Considered for Smoothers If one looks at the local mode analysis, it will be noticed that lexicographic point Gauss-Seidel relaxation has a good smoothing property for convection problems when the sweep direction is aligned with that of the convection. This suggests another smoother, namely, 4-direction point Gauss-Seidel relaxation. The 4-direction point Gauss-Seidel method performs four sweeps over the grid starting each time from a different corner of the grid. The first two sweeps are the same as symmetric point Gauss-Seidel, with the dominant sweep direction being in x, and the third and fourth sweeps are again a symmetric point Gauss-Seidel relaxation, but with the dominant sweep direction in y this time. The sweeping strategy for symmetric point Gauss-Seidel relaxation is the 122

PAGE 148

same as performing one iteration of lexicographic point Gauss-Seidel relaxation fol lowed by another sweep, but the second sweep starts at the (Nx, Ny) point of the grid with the index in the x-direction decreasing the quickest. To form the 4-direction point Gauss-Seidel method we perform one iteration of symmetric point Gauss-Seidel followed by another iteration of symmetric point Gauss-Seidel, but this time rotated 90 degrees so that the rolls of x and y are reversed. The 4-direction point Gauss-Seidel method exhibits good smoothing properties for all the model problem except the anisotropic ones. The 4-direction method is partially vectorizable, but not parallelizable. Let us take a look at one of the four sweeping direction, namely lexicographic, to illustrate how one can obtain some vectorization. The non-vectorizing lexicographic sweep is computed as follows. DO j = 1, Ny DO i = 1, Nx END DO END DO Vectorization is prevented by the reference to Ui-l,j We can minimize the impact of the vector dependency by re-organizing the calculation and creating a new temporary array of length Nx. The new code with vectorization is computed in the following way. DO j = 1, Ny DO i = 1, Nx tmp(i) = Ni,jUi,j+l + Ei,jUi+l,j + si,jUi,j-1 + fi,j END DO DO i = 1, Nx 123

PAGE 149

u = (tmp(i) + W u-1 )/G,J END DO END DO The first loop over i performs vector operations and the second loop over i performs scalar operations. The algorithm is presented for a 5-point stencil, the only difference for a 9-point stencil is the addition of the other computational elements to the calcu-lation in the vector (first) loop. On the Cray Y-MP, the second algorithm is roughly equivalent, timewise, to alternating zebra line Gauss-Seidel for small grids ( < 32 x 32) and is faster for larger grids. The reason that the 4-direction method outperforms the alternating line method, for larger grids, is because the line solves are sequential in nature. Several experiments were run on the Cray Y-MP using the 4-direction point Gauss-Seidel relaxation for the smoother. The results were mixed, but generally favorable. The performance for isotropic diffusion or linear convection characteristic problems was good, as expected from the Fourier mode analysis. Anisotropic problems also performed quite poorly, as expected from the ananlysis. For convection-diffusion problems with variable convection characteristics the results were dependent on the form of the characteristics and the choice of grid transfer operators. The 4-direction point Gauss-Seidel smoother worked best with the nonsymmetric collapsing method (aL/rJL) from section 3.5.1 for the grid transfer operators. However, for re-entrant flows we were still unable to obtain a method which would give any kind of reasonable convergence rate. The results from one of the numerical experimetns are given in table 7.34 in section 7.6. 124

PAGE 150

CHAPTER 6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS The computers that we used fall into three categories: sequential, vector, and parallel. Each of these types of computers has its own characteristics that can affect the development and implementation of algorithms that are to execute on them. Sequential computers come in a lot of varieties, but they all execute basically "one" instruction at a time. By "one" instruction we mean to lump all of the pipelining architectures with the classic sequential computer architecture. This lumping may not be entirely fair, but we believe that it is nearly impossible to find a computer today that does not use some form of pipelining. If one looks at the scalar processors on vector computers, one sees a great deal of pipelining. As far as the choice of algorithms is concerned, they can indeed be lumped together. However, it does pay to remember that pipelining of instructions is taking place, and therefore the implementation of the algorithm should take advantage of it when at all possible. For the most part, compilers today are very good at keeping the pipeline full, but they can still benefit from a careful implementation that aligns the instructions to avoid possible bubbles (null operations) in the pipeline, e.g. excessive branching. We have used a SUN Spare workstation to represent the class of sequential computers, but a standard PC would have done just as well. The vector computers are those with vector units that can process a vector 125

PAGE 151

(array) of data with one instruction with only the indices varying in a fixed relationship to each other. The Cray Y -MP, which we have used, is a prime example of such a computer. The CM-5, which we have also used, has vector units, but the vector units are not very fast when compared to the Y-MP's, and they have a vector length of 16 (changed from 8 to 16 in late 1995), which is very short when compared to the Y -MP's vector length of 64. Vector computers also have scalar processors for handling the non-vectorizable instructions, and these processors can be considered to be the same as those of the sequential computers. The Cray Y-MP can have several processors that can be run in a parallel mode configuration. However, we have chosen to use only one processor on the Cray Y -MP so that we can concentrate on the vectorization issues. There are several types of parallel computers and parallel computing models. The type of parallel computers that we considered are single instruction multiple data (SIMD) computers; the CM-5 is such a computer, but it might be more meaningful to classify it as a single program multiple data (SPMD) computer for our purposes. The SPMD programming model is one of the most widely used parallel programming models for almost all parallel computers. The CM-5 can be run under two different execution models: the data parallel model, which we used, and the message passing model, which we choose not to address in this thesis. Probably the one issue that has the greatest effect on algorithm performance, regardless of the type of computer, is that of memory references. This issue can manifest itself in many ways at both the hardware and software level of the computer architecture. For sequential and vector computers it usually revolves around a memory cache, but memory bank architecture can also play a role. It should be noted that the Cray Y-MP does not have a memory cache. On parallel computers the memory cache and banks are usually subordinate to the data communications network. 126

PAGE 152

Each of these three types of computers has its own influences on the choice and implementation of the various components of multigrid algorithms. However, we will restrict our choices for the vector computers in such a way as to avoid degrading the code's execution on a sequential computer in any meaningful way. If, however, a particular choice would cause only minor degradation on the sequential computer but greatly improve its performance on vector computers, then it should be allowed. For the above reasons, and because it is not too interesting, we will not examine our multigrid codes on any sequential computers. The performance of closely related black box multigrid codes for various problems has already appeared in the literature [27] [8]. However, for timing comparisons only we will include some data for a Sparc5 workstation. 6.1 Cray Hardware Overview The Cray Y-MP is our baseline vector computer for the design of the black box multigrid vector algorithm. The hardware model that we will present for the software design is equally valid for the Cray Y-MP, X-MP, M90, and C90 computers because we are concerned only with the single processor vector algorithm. The Cray Y-MP computers can have a number of CPUs (central processing units), typically 4, 8, or 16. The CPUs are each connected to shared memory and an I/0 system; see figure 6.1. Each CPU has four ports, a memory path selector, instruction buffers, registers, and several functional units; see figure 6.2. We will start by describing the CPU registers. The Cray computer's word length is 64 bits. The vector registers are set up as 8 vectors of 64 elements each (the Cray C90 has 128 elements), where each element is a 64 bit word. There are 8 scalar registers and 64 intermediate result scalar registers, each with 64 bits. The address registers can also have an impact of the software design. There 127

PAGE 153

1/0 CPU 0 .... CPU 1 .... CPU 2 .... CPU n Memory Figure 6.1: Cray Y-MP hardware diagram for n CPUs. CPU Port 0 .... L Registers -Memory Port 1 ; Path Functional Units t Selector Instruction .... Port 3 .... buffers i & ... ... CPU n 1/0 Memory Figure 6.2: Cray CPU configuration. 128

PAGE 154

are 8 address registers consisting of 32 bits each and 64 additional 32 bit intermediate address registers. The intermediate address registers are primarily used for processing the address register data. In addition to the above mentioned registers there are a variety of others that will not be discussed here because they vary somewhat between Cray's different computer models and because they do not really have an affect on the design of the vector algorithms. For completeness we will mention the major categories; they are the vector mask, vector length, hardware performance monitor, programmable clock, con trol, mode, status, memory error information, exchange address, and exchange package information registers along with a number of flag bits. There are also some additional registers for parallel processing on the Cray computers which fall into either the shared resources registers or the cluster registers. The main memory consists of either 2 or 4 sections each containing 8 subsections of 64 banks. The memory bank size depends on the model of Cray computer and the memory size configuration chosen for that model. The two Cray computers that we used were a Y-MP and M90, which have memory banks of 256K words and 8M words respectively; see Appendix B for more details. The memory is interleaved throughout all the banks, subsections, and sections. Consecutively stored data items are placed into different memory banks, and no bank is reused until all the banks have been used. The 8 vector and 8 scalar registers coincide with the 8 subsections of a memory section. The 64 (128, C90) vector elements per vector register coincide with the 64 memory banks of each subsection. Each CPU has its own path to each section of memory. A single CPU cannot make simultaneous accesses to the same section of memory. Each CPU has four ports: two for reading, one for writing, and one for either an instruction buffer or an I/0 request; see figure 6.2. In order for a CPU to access memory, it must have both an 129

PAGE 155

available port and memory path. A CPU memory reference makes an entire memory subsection unavailable to all ports of the CPU until the reference has completed (five clock cycles). In addition, the memory reference also makes that bank unavailable to all other ports of all the other CPU s in the system until it has completed. There are 5 basic types of functional units in each CPU. The first one is vector bit-oriented, and it consists of integer add, logical, shift, pop/parity, and secondary logical operations. The second functional unit is the floating point vector operations unit, which is also used to perform scalar floating point operations. The third functional unit is the scalar bit-oriented unit and includes integer add, logical, shift, and pop/parity operations. The fourth functional unit is the address computational unit. The fifth functional unit is the instruction decode and fetch unit. The CPU's functional units all operate independently of each other. In ad dition, the functional units are fully segmented (pipelined). That is, the intermediate steps required to complete an operation are broken down into one clock period segments. Once the pipeline is full, results are produced at one per clock period. The number of segments that a functional unit has depends on the complexity of the functions it must perform; hence, the functional units are mostly of different segment lengths. The functional units can also be chained together, and because they operate independently, it is possible to perform several operations concurrently. For instance, let a, b, c be vectors and d a scalar; then the vector-scalar operation a( i) = b( i) c( i) + d can be performed with one result (i) per clock period for all i. The concurrent operations taking place are two vector loads (band c), a vector multiply, a scalar add, and a vector store (a). The Cray Y-MP can be forced to perform, essentially, as a sequential computer by compiling the code with the NOVEC compiler directive and by setting the compiler 130

PAGE 156

optimization flag for no vectorization, e.g. cf77 -0 vector{) or cft77 -o novector. This can be very useful for determining the actual speedup associated with vectorizing the code. Definition 6.1.1 The speedup factor for a code or code fragment is defined as S Told r' Tnew (6.1) where Told and Tnew are the execution times for the old and new codes respectively. The speedup factor as we have defined it is sometimes called the relative speedup. The speedup factor can be used to measure the vectorization speedup by setting Told to the non-vectorization execution time and Tnew to the vectorization execution time. To get the best performance for a given algorithm and still use only the higher level language (FORTRAN in our case), there are several things that can be done. First recall that the code must also be able to execute on a sequential computer and remain as portable as possible. This consideration means that we can not use any machine specific routines or instructions that will make the code non-portable. This decision limits the options available but does not impose too many difficulties or reduce the performance gains (by very much) that can be achieved. The things that we can do are to control the data structure designs, imple-mentation and choice of the algorithm, and the use of compiler options and directives. 6.2 Memory Mapping and Data Structures Leading array indices should always be declared to be of odd length, e.g. 129. This is because most vector (and non-vector) computers have their memory set up in banks, and the data are distributed across the memory banks in various ways depending on the particular computer. The number of banks is almost always set up to be a power 131

PAGE 157

of two. So, if an array is declared to be of even length, then there is a strong possibility for memory bank conflicts when fetching and storing consecutive array elements. The memory bank conflicts can significantly slow down the performance of a code. Typically on many vector computers, the speedup obtained when using an odd length for an array declaration is a speedup factor of 2 to 4 over using an even length. Vectorization takes place only for inner loops, but in special circumstances nested loops may be combined into a single loop that is vectorized by the compiler. The most frequently used inner looping index should have the longest length possible. For example, if a double DO loop, with no vector dependencies, is indexed over i = 1, ... N and j = 1, ... M where N > M, then the loop over i should be the inner one. The data structures of the arrays should also be set up in such a way as to allow the most frequently used looping index to be placed as near the beginning (leftmost) index position as is possible. By doing these simple restructurings of the arrays, it has been found that a speedup factor of any where from 2 to 8 can be obtained for the various components of the black box multigrid codes. 6.3 Scalar Temporaries A scalar temporary is a scalar that is set equal to a vector expression. Scalar temporaries are most often used to express common subexpressions in a loop. The use of scalar temporaries is a very delicate issue, and the extent of their use varies with the complexity of the computation and the compiler that is used. It can often be the case that a speedup factor of 1.6 to 3 can be observed in code with the proper use of scalar temporaries over code that has either over-used or under-used scalar temporaries. For the black box multigrid codes, the generation of the grid transfer operator coefficients and the formation of the coarse grid operators are highly susceptible to the 132

PAGE 158

use of scalar temporaries due to the size and complexity of these computations. It is not possible to obtain an optimum implementation using scalar temporaries and have the code remain optimum and portable at the same time. However, we did manage to find a reasonable compromise which should be fairly efficient over a wide range of computers and compilers. 6.4 In-Code Compiler Directives The use of compiler directives in the code can also greatly enhance the com piler's ability to optimize the code. When compiler directives are not used, the compiler may do several things that can slow down the performance. The compiler may fail to vectorize a loop because it suspects that there may be data dependencies present. The compiler may add run-time code to determine the vector length, amount of loop un rolling that can be performed, or whether or not a loop can be vectorized or not. Using compiler directives can eliminate many of these problems and in addition can speed up both the execution and compilation time by eliminating the need for run-time check ing. Compiler directives vary from computer to computer, but almost all have the advantage that they are interpreted as comment lines by other compilers. This means, however, that one will have to change the compiler directives when one moves the code from one type of computer to another, or that one will have to add all the compiler directives to the code for as many different computers as are likely to be used. We have chosen to place only the compiler directives for the Cray Y-MP in the vector versions of the black box multigrid codes. The most commonly used compiler directives are: 1. Ignore Vector Dependencies. (CDIR$ IVDEP) 2. Scalar Operations. (CDIR$ SCA) 133

PAGE 159

3. Inlining (CDIR$ INLINE) 6. 5 lnlining The use of subroutines to make a code modular has been in fashion for a couple of decades now. It has been found that codes are more readable and that the software can often be reused when the codes are split up modularly depending on their functionality. However, this modularity has the unfortunate effect of slowing down the codes' performance on many computers. This is due mostly to the overhead involved with executing a subroutine. Many compilers have a directive for inlining. Inlining consists of the compiler taking the subroutine and rewriting it at the same code level as the calling program. This eliminates the overhead of the subroutine call from the code's performance. One may ask why all compilers do not do inlining automatically. There are several answers to this question. Not all compilers can perform inlining. It takes much longer to compile a code that requests inlining. The executable is usually larger when inlining is requested because a subroutine may be called many times from different places in the code, and for each instance the code is copied into the calling routine, creating many copies of the same piece of code for that subroutine. The black box multigrid codes use subroutines mostly to separate out the different multigrid components by functionality. Thus, the use of inlining does not cause the executable to be really huge, even though there are several instances of subroutines being duplicated. The duplications are mostly for the smoother and computing the residual and its l2 norm. 134

PAGE 160

6.6 Loop Swapping Many algorithms, when implemented, contain nested DO loops. For the best performance, the inner loop should have the longest vector length possible while maintaining the loop's vectorization. The speedup gain is dependent on the length of the loops and the amount of computation in the inner loop, but speedup factors from 1 to 6 are not uncommon. The black box multigrid codes have been implemented with the longest vector length in the inner loop. This decision has meant that compiler directives were needed to inform the compiler that the inner loops contained no vector dependencies. 6. 7 Loop Unrolling Short loops can often generate more overhead instructions than computational instructions. Because of this fact, many compilers will automatically unroll loops that have a small fixed length. However, loops with a parameter as their indexing limit and very little computational work may sometimes be unrolled partially to leave more computational work per loop iteration, but not all compilers are capable of performing this kind of loop unrolling. There are several short loops in the smoothing subroutines of the black box multigrid codes that can benefit from loop unrolling. The performance speedup factors for these loops range from 1.3 to 3. 6.8 Loops and Conditionals Loops with a conditional statement in them will not vectorize. By conditional statements we mean IF statements and computed GOTO statements. In the subroutine that computes the grid transfer coefficients in the black box multigrid codes, a test needs to be performed inside several of the loops to determine 135

PAGE 161

which form the computation is to take; see the chapter 3 on grid transfer operators. The IF statement for the test is converted into a computation involving the maximum intrinsic function, which is used to combine both computational branches of the test's outcome. This device makes the loop vectorizable, giving a speedup factor of nearly 100 over the non-vectorizable version involving the IF statement. 6.9 Scalar Operations Loops and blocks of computation that can not be vectorized should be written in a form that the compiler can recognize as common scalar operations. Some of the common forms are simple vector dependencies and recursions. If the compiler can recognize these common scalar operations, it is possible for it to use specialized scalar libraries that make the best use of the hardware to obtain the best performance. The performance gain is usually modest and ranges up to a speedup factor of 1.5. 6.10 Compiler Options The performance of a code is greatly dependent on the ability of the compiler to recognize code fragments and optimize them. In order for the compiler to generate fast and efficient code, it is not only important to write efficient code and to use incode compiler directives, but also to use the appropriate compiler options in the compile commands. The choice of these options can dramatically speed up or slow down the performance. The compiler options that were used on the Cray Y -MP for all the black box multigrid codes were, cf77 -Zv -Wf"-dz -o inline3,aggress -A fast" files.f where -Zv means to vectorize for a single processor and use the dependence analyzer 136

PAGE 162

fpp; -Wfindicates that the options in the quotes are for the FORTRAN compiler cft77; -dz says to disable debugging and other utilities; -o inline3,aggress is two commands, where inline3 means to inline subroutines up to three levels deep, and aggress means to raise the limits on stacks, internal tables, and searches to allow some loops to be optimized that might not otherwise vectorize; -A fast means to use the fast addressing mode and not the longer addresses that use indirect addressing under the default full addressing mode. The use of these compiler options can easily double the performance of the black box multigrid codes by allowing the compiler to vectorize more fully the codes and by cutting out some of the overhead that is generated for software analysis tools. 6.11 Some Algorithmic Considerations for Smoothers We are interested in the multicolor point and line Gauss-Seidel methods and in the ILLU x-line method for smoothers. All the Gauss-Seidel methods vectorize quite easily, while the ILLU method does so only marginally. 6.11.1 Point Gauss-Seidel Relaxation While we have said that the vector and sequential algorithms are the same, when it comes to the point Gauss-Seidel smoother's implementation there is a difference. The red/black point Gauss-Seidel method, performed in the normal fashion first all red points are updated followed by updating all the black points is very inefficient on cache based systems, requiring all the data to pass through the cache twice. An easy modification can be made that allows the data to only pass through the cache once can be found in [35]; the algorithm is Algorithm 6.11.1 Let the grid be of size Nx x Ny, then the cache based red/black Gauss-Seidel algorithm is given by: 137

PAGE 163

1. update all red points in row j = 1 2. Do j = 2,Ny 3. update all red point in row j 4. update all black points in row j-1 5. End Do 6. update all black points in row j = Ny This cache based algorithm can give a speedup factor of two orders of magnitude on many cache based RISC processors. The 4-color Gauss-Seidel method can also be modified in a similar way to achieve the same type of speedup. The cache algorithm is not as useful on vector computers because the vector length is now only over a single line rather than over the entire grid. The loss in vector performance is not much provided that the rows are at least as long as the computer's vector length. The shorter the vectors become, compared to the machine's vector length, the faster the performance drops off. The Cray Y-MP vector computers do not have a cache and should not use the cache based Gauss-Seidel algorithm, but the RISC processor computers should use this algorithm because they all employ caches. 6.11.2 Line Gauss-Seidel Relaxation Zebra line Gauss-Seidel relaxation requires the solution of tridiagonal systems for the line equations. The tridiagonal systems are solved by Gaussian elimination. There are two approaches to solving the equations -either we solve lines every time from scratch or we factor the tridiagonal systems, saving the L U decompositions, and solve the factored systems. The first approach will use less memory by not having to save the L U decompositions, but the total solution time will take longer. The second approach is more favorable, if enough memory is available, because the L U decompositions account for 40% of the operation 138

PAGE 164

count for the solution of the tridiagonal system of equations. The L U factorization and solution phases for a tridiagonal system are both inherently sequential with essentially no vectorization. However, we are not solving just a single line, but a zebra ordering of all the lines on the grid. Vectorization can be obtained by performing each step of the decomposition and solution on all of the lines of the same color simultaneously. The benefits of vectorization diminish as the grids become coarser because there are fewer lines to vectorize across, but for standard coarsening the lines are also becoming shorter, reducing the amount of sequential work that must be performed. In this respect the standard coarsening algorithm is more efficient than the semi-coarsening algorithm. On a RISC processor cache based system it is best to perform the LU solution on each tridiagonal x-line separately because it requires the data to pass through the cache only once. However, the zebra y-line Gauss-Seidel relaxation should use the vector algorithm, looping over all the lines in the x direction, which only requires the data to pass through the cache once. We could have used the Linpack (Lapack) tridiagonal factorization and LU solvers, SGTTRF and SGTTRS respectively, but they can not perform the factorization or back substitution for more than one system at a time and they do not use the cache in the same way. The routines are fine for a stand alone system, but they can not take advantage of the vectorization or cache potential that can be obtained by writing our own routines that have knowledge of the entire smoothing process. For this reason we have implemented our own solvers. 6.12 Coarsest Grid Direct Solver The coarsest grid problem is either a single grid line (in the case of a rectangular fine grid with standard coarsening or in the case of the semi-coarsening algorithm) 139

PAGE 165

or a small system of equations. The single grid line equation is a tridiagonal system which is easily solved, and we have chosen to save the LU factorization. The small system of equations is a banded system. We chose to use the Linpack general banded routines, SGBFA and SGBSL, because they existed on all the machines we used. Lapack would be an even better choice, but not all the machines to which we had access had it installed. In addition, the implementation of the Linpack routines were optimized at our institution for our computers. 6.13 l2-Norm of the Residual The l2-norm of the residual is used in the black box multigrid solver as one of the determinations for when to stop the multigrid iterations. It has also been used in the test codes to determine the reduction factors for an iteration and for various components. The computation of the norm is straight forward, but since the computation involves the sum of a large number of floating point values, it might be wise to ask if the result has any meaning. The question is quite valid because it is well known that summing floating point numbers can lead to errors in the resulting summation depending on how the actual sum is computed. Originally the norm was computed by adding the squares of the residuals to the running sum total. We will call this type of summation the naive summation algorithm, and define it to be the following. Algorithm 6.13.1 Let r be a floating point array of N values, ri, i = 1, N. The naive summation algorithm for ri is given by the following: sum 0. Do i 1, N sum = sum + r(i) 140

PAGE 166

EndDo The computed sum is equal to ri(1 + Ji) where IJil < (n-i)E, E is the machine epsilon (which is the increment between representable floating point numbers), and n is the number of units (digits) in the last place that are in error. The naive summation algorithm vectorizes, and will be unrolled several iterations by good optimizing compilers. It also performs very well on cache based RISC computers because the data is usually stored contiguously. At first glance this may all seem to be of little importance, since the norm of the residual is only being used as a stopping criteria. However, after looking at the experimental and testing data there were many cases in which the convergence criterion was just barely missed and an additional multigrid iteration was performed. After a more extensive look at the norms, it was determined that several of the cases did not actually need to perform the additional iteration. To make matters worse, a couple of cases were found that had stopped prematurely because the norm of the residual was incorrectly computed. It should be noted that these cases showed up quite often on the workstations and only rarely on the Cray computers. The trouble has to do with floating point arithmetic and the loss of accuracy. The Cray computers, which have a longer word length, perform arithmetic at a much higher precision than the workstations and hence very rarely encounter such trouble. One approach to fixing this problem is to use higher precision for the summation, which is usually accomplished by doubling the current precision. The doubling algorithm uses the naive summation algorithm, but doubles the precision of all arithmetic operations, which gives the sum equal to ri(1 + f=i 6j) where l6jl :S E. This would appear to be the answer that we are looking for except that it can execute 141

PAGE 167

very slowly. Doubling the precision on 32 bit RISC workstations and Cray Y-MP means that the higher precision arithmetic is handled in software and not in the hardware. We will discuss this point in more detail later. The loss of accuracy in the summation process can easily be fixed with very little extra cost on sequential machines by using the Kahan summation algorithm [40], [56]. The Kahan summation algorithm can be described as follows. Algorithm 6.13.2 Let r be a floating point array of N values, ri, i = 1, N. The N Kahan summation algorithm for ri is given by the following: i=l sum = r(1) correction = 0. Do i = 2, N next_correction = r(i) -correction new_sum = sum + next_correction correction = (new_sum sum) -next correction sum new_sum EndDo is the machine epsilon. The difference between the two algorithms is now much clearer, since each summand in Kahan summation is perturbed by only 2E instead of the perturbations as large as m in the naive algorithm. The Kahan summation algorithm is not vectorizable, and even though the loop can be unrolled by the compiler, it is still miserably slow on the Cray Y-MP. However, on the sequential workstations it is only about twice as slow as the naive 142

PAGE 168

Table 6.1. Cray Y-MP timings for the naive, Kahan, and doubling summation algorithms in seconds. The numbers in parenthesis are the timing ratios relative to the naive algorithms times. N Naive Kahan Double elements time time (tk/tn) time (td/tn) 102 6.702E-6 2.161E-5 (3.22) 6.185E-5 (9.23) 103 1.208E-5 1.788E-4 (14.8) 3.026E-4 (25.1) 104 6.964E-5 1.751E-3 (25.1) 2. 708E-3 (38.9) 105 6.433E-4 1. 7 48E-2 (27.2) 2.677E-2 ( 41.6) 106 6.382E-3 1.746E-1 (27.4) 2.675E-1 (41.9) summation algorithm. We can now compare the three summation algorithms on both the vector and RISC (Sparc-5) computers. The summation timings on the Cray Y-MP are given in table 6.1. On the Cray Y-MP double precision arithmetic is performed using software, and hence it is quite slow. Recall that the Cray Y-MP does not need to use double precision because it already uses 64 bits for single precision, which was found to be very adequate for the summation process. As we have already said, the Kahan summation algorithm does not vectorize and is quite slow. It is obvious that the only practical implementation is to use the naive summation algorithm on the Cray Y-MP. The Sparc-5 timings, in table 6.2, show that the Kahan algorithm is about twice as slow as the naive algorithm, and that the double precision algorithm is only about 10% slower than the naive algorithm. Computing in double precision is very adequate for our needs. We are now faced with three summation algorithms. Our choices now appear to be either to settle for three versions or just to use the naive summation algorithm, as before, and not worry about missing or adding an additional multigrid iteration. For the timing data presented in the later sections of this thesis we have chosen just to use the naive summation algorithm, but we believe that it is better to make two 143

PAGE 169

Table 6.2. Sparc5 timings for the naive (tn), Kahan (tk), and doubling (td) summation algorithms in seconds. The numbers in parenthesis are the timing ratios relative to the naive algorithms times. N Naive Kahan Double elements tn tk (tkjtn) td (tdjtn) 10:l 2.70E-5 6.00E-5 (2.22) 3.20E-5 (1.19) 103 2.67E-4 6.04E-4 (2.27) 3.27E-4 (1.23) 104 2.75E-3 6.39E-3 (2.32) 3.31E-3 (1.21) 105 2.86E-2 6.59E-2 (2.30) 3.44E-2 (1.20) 106 2.88E-1 6.61E-1 (2.29) 3.56E-1 (1.24) or even three versions of the code. For vector computers either the naive or double summation algorithm should be used depending on whether double precision arithmetic is implemented in hardware at either 64 or 32 bits respectively. On sequential machines either the Kahan or the double summation algorithm should be used depending on the implementation of double precision arithmetic. 6.14 2D Standard Coarsening Vector Algorithm To this point we have discussed several issues concerning the black box multigrid components, vectorization, and programming on the Cray Y-MP, but we have not explicitly mentioned what our choices were for the code. We will do so now. We have implemented the code being aware of all the vectorization issues and using the most efficient choices that have been discussed above. 6.14.1 Coarsening We used standard coarsening, taking every other fine grid point in both coordinate directions to form the coarse grid. 6.14.2 Data Structures The data structures for the grid equations are grid point stencil oriented. The mesh of unknowns has been augmented with a border of fictitious zero equations. The border is used to avoid having to write special code 144

PAGE 170

L = (Nx, Ny, 9), u = (Nx, Ny), f = (Nx, Ny), Work_Space = (Nx, Ny, 3) Grid m Grid m-1 Prolongation = (Nx, Ny, 8), Restriction = (Nx, Ny, 8) Grid m-1 Grid 1 Figure 6.3. Data structure layout for m grid levels; grid equations, work space, and grid transfer operators. to handle the boundary of the grid. This arrangement makes the code easier to write and more efficient for vector operations. There are several arrays to hold the grid equations: the discrete coefficient array, the array of unknowns, and the right hand side array. There are also a few extra auxiliary arrays to hold the grid transfer operator coefficients, the residual, and the L U decompositions of the line solves and of the coarsest grid problem. Each grid level has its own data structure of the appropriate size that has been allocated, via pointers, as part of a larger linear array for each data type structure; see figure 6.3. This arrangement makes memory management for the number of grid levels easier. 6.14.3 Smoothers We have implemented the multicolor ordering point, line, and alternating line Gauss-Seidel methods and the ILLU x-line method. The cached based Gauss-Seidel algorithm is not used. 145

PAGE 171

6.14.4 Coarsest Grid Solver The coarsest grid solver is a direct solver using LU factorization. 6.14.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 3, that were implemented. They are the the ones discussed in sections 3.5.1, 3.5.3, and 3.6.1. 6.14.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, which uses the grid transfer operators and the fine grid operator. 6.15 2D Semi-Coarsening Vector Algorithm The semi-coarsening code was originally implemented by Joel E. Dendy, Jr. We have re-implemented it in a more efficient form to gain a speedup of about 5 over the previous vectorized version while maintaining and improving the portability of the code. The new implementation has kept all the functionality of the previous version. 6.15.1 Data Structures The data structures for the grid equations are the same as those for the standard coarsening code including the fictitious border equa tions. However, now the work space array is reduced to (Nx, Ny, 2) and the prolongation and restriction coefficient arrays are of length 2 instead of 8. 6.15.2 Coarsening Semi-coarsening in they-direction was used, taking every other fine grid point in the y-direction to form the coarse grid. 6.15.3 Smoothers Red/black x-line Gauss-Seidel relaxation is used for the smoother. As an experiment the x-line ILLU method was also implemented. 146

PAGE 172

6.15.4 Coarsest Grid Solver The coarsest grid solver is either the di rect L U factorization solver or a tridiagonal solver in the case that coarsening is continued until only one x-line remains. 6.15.5 Grid Transfer Operators The grid transfer operator is the one used in section 3.6.1 applied in only they-direction. 6.15.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, using the grid transfer and fine grid operators. 147

PAGE 173

CHAPTER 7 2D NUMERICAL RESULTS The numerical results in this section are for the two dimensional domain versions of the black box multigrid solvers. 7.1 Storage Requirements The black box multigrid solvers present some tradeoff issues for speed versus storage. The algorithms require that we perform a number of tasks involving grid operators and grid transfer operators. We can choose to save storage and sacrifice speed by computing these when we need them. However, since we are using a geometric multi grid method these computations are not cheap. The most expensive is the formation of the coarse grid operators. For the 2D grid levels we need storage for the grid equations (unknowns, co efficients, and right hand side), the grid transfer operators, and temporary work space. Let Nx and Ny be the number of grid points in the x-and y-directions respectively. We can compute how much storage will be needed by adding up the amount for grid point. We need 9 locations for the coefficient matrix and 1 each for the unknowns and right hand side. For the standard coarsening method we need 16 locations for both the grid transfer operator coefficients, and another 3 for temporary work. For the semi coarsening method we need 4 locations for both the grid transfer operator coefficients, and another 2 for temporary work. We can ignore the amount of storage for the coars est grid direct solver for now because it will remain constant and small when compared 148

PAGE 174

Table 7.1. Storage requirements for various grid sizes for the grid unknowns and grid operators on the Cray Y-MP. Unknowns coefficients NxN N2 N'f 9N2 9N2 9x9 81 115 729 1035 17 X 17 289 404 2601 3636 33 X 33 1089 1493 9801 13437 65 X 65 4225 5718 38025 51462 129 X 129 16641 22359 149769 201231 257 X 257 66049 88408 594441 795672 513 X 513 263169 351577 2368521 3164193 1025 X 1025 1050625 1402202 9455625 12619818 to the rest. This means that we need 30 locations for the standard coarsening and 17 for the semi-coarsening. However, we do not have grid transfer coefficients stored on the finest grid so we can subtract 16 and 4 locations from the total for the standard and semi-coarsening methods respectively. The amount of storage required for the 2D data structures, excluding storage for the line solves, is 1 30 1 + 4 + -16 NxNy (7.1) 1 17 1 + 2 + -4 NxNy (7.2) for the standard and semi-coarsening methods respectively. If we only have a 5-point operator on the finest grid we do not need to store the other 4 coefficients and then the storage requirements become 20NxNy and 26NxNy for the standard and semicoarsening methods respectively. The growth of the storage requirements versus the grid size can be seen in table 7.1. The first column is the number of grid points and is also the amount of storage needed for the grid unknowns or right hand sides for that grid level. The second column is the number of grid points for the indicated fine grid down to the coarsest grid level (3 x 3). The last two columns are the amount of storage needed for 149

PAGE 175

Table 7.2. The actual storage requirements for various grid sizes for the codes BMGNS, SCBMG, and MGD9V given in terms of number of real variables on the Cray Y-MP. NxN SCBMG BMGNS MGD9V 9x9 4721 4359 2063 17 X 17 12938 10309 7072 33 X 33 39635 29235 25777 65 X 65 132908 96001 97986 129 X 129 476165 345999 381651 257 X 257 1783454 1312669 1506020 513 X 513 6868535 5113515 5982965 1025 X 1025 26895056 20860411 23849734 the grid operator ( 9-point) on that grid level and all the grid levels respectively. The price of memory has been falling steadily for years, and as a consequence computers are being built with more and more memory. Due to this phenomenon and in the interest of speed, we have chosen to compute these operators once and store them. We could have chosen just to store the grid operators and compute the grid transfer operators when they are needed, and this procedure might even be practical for the three dimensional problem codes when memory is used up at an alarming rate. The actual storage requirements for the various codes versus the grid size is given in table 7.2. The BMGNS code, which uses zebra alternating line GaussSeidel, requires the least amount of storage for grids larger than 65 x 65. The MGD9V codes uses less storage for small grid levels because its grid transfer operators require only a fourth the storage of BMGNS, due to symmetries. However, MGD9V must store the ILL U smoother decomposition for each grid level, and this additional storage becomes more significant for larger grids. The SCBMG code always requires more storage because it uses semi-coarsening. 150

PAGE 176

Table 7.3. Speedup of the new vectorized version, vbmg, and Dendy's vectorized version, bmg, over the scalar version of bmg. grid size N BMGNS VBMGNS Speedup (N x N) (bmgns/vbmgns) 9 2.3 13.8 5.8 17 3.2 21.5 6.7 33 5.5 33.6 6.1 65 7.2 43.3 6.0 129 9.5 54.2 5.7 257 11.4 68.4 6.0 513 14.1 81.8 5.8 1025 16.2 92.4 5.7 7.2 Vectorization Speedup The original functionality of the Black Box Multigrid codes [29] was kept and used as a baseline. There were several restrictions that we imposed on what could be done with the codes. The desire behind developing and optimizing the black box multigrid codes was to maintain as much of the original functionality of the black box multigrid method. The original method was implemented as a research prototype. Even though care was taken to keep it somewhat portable and to adhere to most of the SLATEC guidelines [38], at least for the documentation, it still had a long way to go before being ready for release to the scientific and engineering community. We wanted to maintain all of the cycling strategies, except for those controlled by the MCYCL parameter, which is only valid when performing the initial stages of an F-cycle. We also decided to remove the truncation error estimates, ITAU, because they were no longer meaningful. The speedup of the new vectorized version of the standard coarsening code is given in table 7.3. The table shows that while Dendy's original code had some vectorization in it, the new version has much more. The speedup is due to a variety of 151

PAGE 177

Table 7.4. Comparison and speedup of the Semi-coarsening versions. scbmg is Dendy's version, and vscbmg is the new vectorized version. The-entries mean that there is no data. grid size N CM-2 Cray Y-MP Speedup (N x N) SCBMG SCBMG VSCBMG (scbmg/vscbmg) 32 .011 .0019 5.8 64 .65 .04 .0059 6.8 128 .99 .09 .015 6.0 256 1.84 .27 .045 6.0 512 4.55 .95 .16 5.9 1024 -3.69 .64 5.8 factors, such as better organization of the computations and better use of the compiler to achieve the vectorization that is present. The last column of the table shows that the new version of the code runs about six times faster than Dendy's original code. The speedup is not consistent over the range of grid sizes because of the effect of vector length and memory cache misses. There may also be some effects due to the fact that the timings were done in a time sharing environment, but they should be quite small when compared to the other issues. While table 7.3 is only for the standard coarsening code, it also reflects the speedup seen in the semi-coarsening code. The data in table 7.4 compares the execution time in seconds on the Cray Y-MP and the CM-2 of Dendy's semi-coarsening version to the new vectorized semi-coarsening version on the Cray Y -MP. The table shows that the new semi-coarsening version is roughly six times faster than Dendy's on the Cray YMP. The poor performance of the CM2 version of sc bmg is due to the much slower computing units and the communications penalty on the CM-2. The CM-2 version was run on one quarter of 1024 nodes (256 processors) under the slicewise model. The timing comparisons are perhaps clearer when presented in graphical form. We compare the three main types of algorithms represented by BMGNS, SCBMG, and 152

PAGE 178

MGD9V on the Cray Y -MP using a single processor. While the setup is done only once, the solution may require many V -cycle iterations. Hence, the time for the setup and the time for one V -cycle are presented separately. The BMGNS code uses zebra alternating line Gauss-Seidel relaxation for the smoother (which is the most expensive option), and the non-symmetric grid operator collapsing method involving aL for the grid transfer operators (which is the most expensive collapsing method). The setup time includes the computation of the coarse grid operators, the grid transfer operators (prolongation and restriction), the factorization for the coarsest grid solver and any setup required by the smoother. Figure (7.1) represents the setup time in seconds for the three codes. It is not surprising that SCBMG is the fastest, since it only needs to compute grid transfer coefficients in the y-direction and factorization for lines in the xdirection, while BMGNS has to compute the six additional grid transfer coefficients for each prolongation and restriction operator and an additional line factorization for the smoother. The factorization for the coarsest grid direct solver must also be computed for BMGNS. The MGD9V algorithm saves time in computing the grid transfer coeffi cients over BMGNS, and like SCBMG it does not need to compute a factorization for the direct solver on the coarsest grid. However, the factorizations and setup for the ILLU smoother are very expensive and are not really vectorizable. The times for one complete V-cycle are given in figure (7.2), excluding the setup time. The V-cycle is a V(1, 1)-cycle for BMGNS and SCBMG, and a V(O, 1) cycle for MGD9V. The V-cycle timing graph shows the same relationship as the setup timing graph, except that the time for one V -cycle is much less then the setup time. 153

PAGE 179

Setup Time Comparison 2.5,-------,-------,--------,-------,--------,-------, 2 {!1.5 c 8 ()) ()) E i= 0.5 200 SCBMG ...... BMGNS ---MGD9V / / / / / / 400 / // / --/ / / / / / / / / / / / / 600 Grid Size / / / / / / / / / / / / / 800 / / / / / / / / / / / / / / / / / 1000 1200 Figure 7.1: Comparison of Setup time for BMGNS, SCBMG, and MGD9V 154

PAGE 180

0.9 0.8 0.7 {!0.6 c 8 ,M_o.s ()) E i= 0.4 0.3 0.2 0.1 200 V-Cycle Time Comparison SCBMG ...... BMGNS ---MGD9V 400 / / 600 Grid Size / / / / / / 800 / / / / / / / / 1000 1200 Figure 7.2: Comparison of one V-cycle time for BMGNS, SCBMG, and MGD9V 155

PAGE 181

Table 7.5. Operation counts (actual) for standard coarsening black box multigrid Setup phase. LH 369NxNy 256NxNy CTLjaL 174NxNy-135Nx-135Ny + 96 122NxNy 91Nx 91Ny + 60 CTL/L Hybrid 166NxNy -131Nx 131Ny + 96 114NxNy -87 Nx -87 Ny + 60 Schaffer 168NxNy -112Nx 112Ny + 56 120NxNy 88Nx 88Ny + 56 ZxLGS decomp 4NxNy-3Ny ZyLGS decomp 4NxNy-3Nx ZALGS decomp 8NxNy -3Nx -3Ny xiLLU decomp 58NxNy -52Nx 106Ny + 101 7.3 2D Computational Work The amount of work performed by the various multigrid components as imple-mented on the Cray Y-MP are given below. The operation counts lump multiplication, division, addition, and subtraction together. The setup phase is broken down into three parts: grid transfer operators, coarse grid operator, and smoother decomposition. The grid parameters, Nx and Ny, are the number of coarse grid points in the respective coordinate directions. The operation counts are given in table 7.5. The first line in each block with multiple lines refers to when the grid operator has a 9-point stencil, and the second line for when it has a 5-point stencil. The operations counts say that the CT L j L-Hybrid method is the fastest for computing the grid transfer operator coefficients. We can also see that the zebra alternating line Gauss-Seidel smoother is very cheap to decompose when compared to the ILL U smoother. The number of operations performed to compute the residual and perform the 156

PAGE 182

Table 7.6. Operation counts (actual) for standard coarsening black box multigrid Residual and grid transfer components. Residual 18NxNy lONxNy Prolongation 20NxNy -14Nx -14Ny + 9 Restriction 16NxNy grid transfers are given in table 7.6. The amount of work needed to perform the various smoothers is given in table 7.7. Again the first line in each block refers to when the grid operator has a 9-point stencil, and the second line for when it has a 5-point stencil. It is important to remember that the operation counts give only a rough idea of how the actual components will perform for several reasons. The first is that we have lumped all of the different arithmetic operations together into one count. The amount of time to perform the various arithmetic operations can vary up to several clock cycles, with division usually taking the longest. In addition, the operation counts say nothing about how the code will perform on the hardware. Some of the issues that can affect the performance are whether or not the operations vectorize, fill the pipelines, and require cache paging. These issues can drastically change the amount of execution time needed to perform the operations. Even though the multigrid components listed here have all been optimized in FORTRAN they still all vary with regard to these hardware issues, and hence it is not easy to predict, without running the codes, which will actually execute the fastest. 7.4 Timing Results for Test Problems In this section we present some timing results of the various codes for com-paring the performance of the codes and their components. 157

PAGE 183

Table 7.7. Operation counts (actual) for standard coarsening black box multigrid smoother component. 4C-PGS 17NxNy 2C-PGS 9NxNy ZxLGS 18NxNy 10NxNy ZyLGS 18NxNy 10NxNy ZALGS 36NxNy 20NxNy xiLLU 40NxNy-17Nx-13Ny + 9 32NxNy-17Nx-13Ny + 9 Table 7.8. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother. Grid Size it. Total Total Direct Average n Setup Smoothing Solver per Cycle 9 2 9.001E-4 6.492E-4 2.022E-3 5.609E-4 17 4 1.856E-3 2.541E-3 5.969E-3 1.028E-3 33 4 3.811E-3 5.219E-3 1.189E-2 2.019E-3 65 4 8.778E-3 1.261E-2 2.783E-2 4.762E-3 129 4 2.630E-2 3.899E-2 8.392E-2 1.441E-2 257 4 8.707E-2 1.290E-1 2.807E-1 4.841E-2 158

PAGE 184

To illustrate how fast these codes perform in solving a problem, we examine the timing results for solving the discontinuous Helmholtz problem from section 7.5, see table 7.8. Table 7.8 gives the timing results, in seconds, for various stages of the program execution for various grid sizes. The grid is square, so in the first column where n = 9, we mean a grid of size 9 x 9 and so forth for the rest of the column entries. The second column gives the number of multigrid V(1, 1)-cycle iterations that were needed to reduce the l-2 norm of the initial residual by six orders of magnitude. The third column gives the total setup time, which involves the time it takes to form all of the grid transfer operators, generate all the coarse grid operators, and perform any decompositions needed for the smoother. The fourth column gives the total time for smoothing. The fifth column gives the total time for the direct solver. The last column contains the average time it took to complete one V(1, 1)-cycle. We observe that the code runs fairly quickly, and that it appears to scale very well with respect to the grid size. We also note that the total setup time is about 1. 7 times that of the average cycle time and that in addition this time is about 2. 7 times the total smoothing time for one iteration. A more detailed examination of these relationships between the various multigrid components is given below. The rest of the tables in this section are the results for one multigrid V(1, 1)cycle. The results are separated by multigrid components for easier comparison between the types of multigrid algorithms. All times are given in seconds of CPU time on the Cray Y-MP in single processor mode. The time to perform the L U decomposition of the coarsest grid (3 x 3) problem for the direct solver is 7.176E-5 seconds. The direct solver on the coarsest grid level (3 x 3, standard coarsening) takes 2.609E5 seconds. These times are constant for all of the standard coarsening algorithms that use the direct solver. It should be noted that these times are based on a coarsest grid size of 3 x 3 and that if another coarsest 159

PAGE 185

Table 7.9. Timings in seconds for multigrid grid transfer components for one V(1, 1)cycle for various grid sizes; comparing standard and semi-coarsening methods. Grid Size Standard Coarsening Semi Coarsening n Prolongation Restriction Prolongation Restriction 9 5.151E-5 4.525E-5 1.603E-5 2.261E-5 17 1.029E-4 8.675E-5 2.898E-5 3.798E-5 33 2.042E-4 1.693E-4 4.985E-5 6.383E-5 65 4.430E-4 4.063E-4 1.114E-4 1.427E-4 129 1.119E-3 1.264E-3 2.797E-4 3.611E-4 257 3.675E-3 4.306E-3 8.869E-4 1.164E-3 grid size is chosen, then the times will also change. The amount of work to perform the grid transfers depends on the grid size and on the type of coarsening used. A comparison between standard and semi-coarsening is given in table 7.9. As one would expect, semi-coarsening grid transfers are faster than standard coarsening grid transfers. The standard coarsening restriction requires 4 times the work that semi-coarsening does, and prolongation requires 10 times more work. However, due to the number of grid points that they operate on and the way in which they are computed by the hardware, there is only about a factor of 3.2 to 4.0 for prolongation and 2.0 to 3.6 for restriction, depending on the grid size. Table 7.10 gives the timings results for four standard coarsening smoothers and the semi-coarsening smoother. Note that for red/black point Gauss-Seidel relaxation, the ratios are not uniform or monotone in nature. This situation seems to be due to several factors involving the vector length, memory stride, and cache issues. The point relaxation methods seem to be much more sensitive to memory access delays. The zebra line Gauss-Seidel relaxation also shows some of these types of variation, but after taking into account multiple runs of both x-and y-line relaxation, they can be averaged out. The time variations for the line relaxations again point to memory access delays as being the main cause. The physical layout of the data in memory means that 160

PAGE 186

Table 7.10. Timings for the total smoothing time in seconds for one multigrid V(1, 1)cycle for various grid sizes and smoothers. Grid Size Total Smoothing Time (seconds) n R/BPGS ZLGS ZALGS ILLU SCBMG 9 1.572E-4 1.673E-4 3.246E-4 7.316E-4 1.846E-4 17 3.318E-4 3.354E-4 6.352E-4 1.670E-3 4.318E-4 33 6.763E-4 6.690E-4 1.305E-3 4.207E-3 1.014E-3 65 1.473E-3 1.555E-3 3.153E-3 1.174E-2 2.887E-3 129 3.912E-3 4.802E-3 9.747E-3 3.732E-2 8.473E-3 257 1.241E-2 1.584E-2 3.226E-2 1.293E-1 2.821E-2 x-and y-line relaxation require different access strides for the grid operator coefficients. A closer look at the issue shows that in almost all cases that were measured, y-line relaxation is slightly faster than x-line relaxation on the Cray Y-MP. While table 7.10 shows that point and line Gauss-Seidel relaxation are both quite fast, we have seen from local mode analysis that they are not robust. The alter-nating line, ILLU, and semi-coarsening methods are robust. We observe that standard coarsening with zebra alternating line relaxation has approximately the same performance time as the semi-coarsening method, not surprising since the semi-coarsening method is performing half the number of line relaxations, only x-lines, and since the lines are at least twice as long as in the standard coarsening method. The differences between these two can then be attributed to their performance on the hardware. Both of these smoothers are much faster than the ILLU smoother, ranging from about 2.3 to 4 times faster. However, do not forget that we saw in the local mode analysis that the ILL U method was a much better smoother. The ratio of time spent smoothing versus the time spent doing grid trans-fers is given in table 7.11. See the comments above under smoothers concerning the behavior of the point relaxation. The ratio of smoothing to grid transfers shows that the smoother is the dominant computation in the multigrid cycling algorithm. It also 161

PAGE 187

Table 7.11. Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Timing ratios (smoothing/grid transfer) for one V(1, 1)-cycle for vari ous grid sizes. Grid Size (Smoothing)/ (Grid Transfers) n R/BPGS ZxLGS ZALGS ILLU SCBMG 9 1.68 1.73 3.35 7.53 4.78 17 1.75 1.77 3.49 8.91 6.45 33 1.81 1.79 3.50 11.32 8.92 65 1.73 1.83 3.71 13.89 11.36 129 1.66 2.01 4.01 15.66 13.22 257 1.56 1.98 4.04 16.20 13.75 162

PAGE 188

Table 7.12. Timings for the multigrid setup (generate all grid transfer and grid operators and perform decompositions for smoother) for one V(1, 1)-cycle for various grid sizes. Grid Size ZALGS Schaffer's Idea n sL/sL sL/L-Hybrid ZALGS ILLU SCBMG 9 8.174E-4 7.611E-4 9.001E-4 1.369E-3 5.176E-4 17 1.780E-3 1.623E-3 1.856E-3 2.968E-3 1.238E-3 33 3.914E-3 3.583E-3 3.811E-3 6.657E-3 3.393E-3 65 9.209E-3 8.518E-3 8.778E-3 1.679E-2 1.349E-2 129 2.746E-2 2.560E-2 2.630E-2 5.165E-2 6.459E-2 257 9.277E-2 8.673E-2 8.707E-2 1.761E-1 3.761E-1 shows that for ILL U or semi-coarsening the smoothing is even more dominant for large grids. The times to perform the setup for various algorithms are given in table 7.12. The column headings are o-Ljo-L for the grid transfer operator from section 3.5.1; aL/L-Hybrid for the grid transfer operator from section 3.5.3; and Schaffer's idea for the grid transfer operator from section 3.6. We present the setup timing results for the codes using the zebra alternating line Gauss-Seidel relaxation, since it requires the decompositions of the both the x-and y-line solves. The number of operations that it takes to form the coarse grid operators, 369nxny per grid level, dominates the number of operations that it takes to form the grid transfer operators. The number of operations it takes for the decompositions for the ILLU method is even greater, as seen in the fifth column of table 7.12. The collapsing methods for the grid transfer operators (aL/aL and aL/L-Hybrid) are similar, but the hybrid version requires fewer computations. It is also a little surprising that the extension of Schaffer's ideas is also about as fast as the collapsing methods, since it has to perform lines solves to get the grid transfer coefficients. The actual number of operations, for a 9-point fine grid stencil, for aL/aL 163

PAGE 189

Table 7.13. Timings for one multigrid V(1, 1)-cycle for various grid sizes, excluding setup time. Grid Size Cycle Time (seconds) n R/BPGS ZLGS ZALGS ILLU SCBMG 9 3.856E-4 3.959E-4 5.609E-4 1.014E-3 5.108E-4 17 7.236E-4 7.189E-4 1.028E-3 2.120E-3 9.456E-4 33 1.395E-3 1.372E-3 2.019E-3 5.023E-3 1.906E-3 65 3.115E-3 3.173E-3 4.762E-3 1.356E-2 4.828E-3 129 8.548E-3 9.329E-3 1.441E-2 4.247E-2 1.419E-2 257 2.881E-2 3.181E-2 4.841E-2 1.465E-1 4.627E-2 is 174nxny 135nx 135ny + 96, for hybrid o-1/L is 166nxny -131nx 131ny + 96, and, for the extension of Schaffer's ideas is 168nxny -112nx 112ny + 56 per grid level of size nx x ny. This comparison shows that the collapsing methods require more computations (35%) per grid level but are not significantly slower (7%) because they vectorize, while the tri-diagonal solves for Schaffer's ideas do not. This example shows again that operation counts don't tell the complete story of how well a method will perform. The times for one complete V(1, 1)-cycle, excluding the setup and overhead time, are given for various smoothers in table 7.13. As expected, the point and line Gauss-Seidel methods are the fastest, even though we again see the strange behavior of point relaxation. It is interesting to note that the cycle time for the standard coarsening code using alternating line relaxation is virtually identical to that of the semi-coarsening code. This fact is primarily due to the fact that the smoothers, which are essentially equivalent in computation time, dominate the cycle time. The ILL U version is again the slowest method. 164

PAGE 190

7.5 Numerical Results for Test Problem 8 Problem 8 is a discontinuous diffusion four-corner junction problem. This problem has appeared many times in the literature; see [1], [26],[24]. It is defined by -\i'D'Vu=f au 1 D-+-u=O an 2 D = 1, f=1 D = 1000, f=O D = 1000, f=O D = 1, f=1 on n = (0, Nx) x (0, Ny) on an (x, y) E [0, x*] x [0, y*] (x,y) E [x*,Nx] x [O,y*] (x,y) E [O,x*] x [y*,Ny] (x,y) E [x*,Nx] x [y*,Ny] (7.3) (7.4) where x = x* and y = y* are the interface lines for the discontinuities; see figure 7.3. We compare the five different choices of prolongation operators in the stan-dard coarsening black box multigrid method using zebra alternating line Gauss-Seidel relaxation or incomplete x-line L U iteration for the smoother. The comparison is done for a variety of grid sizes ranging from 9 x 9 to 257 x 257. The data in the tables list the number of V(1, 1) cycles needed for the l2 norm of the residual to be reduced by 6 orders of magnitude with an initial guess of zero. The next three entries are the first V -cycle, last V -cycle, and average convergence factors. If convergence was not reached in 50 V -cycles then an appears, and the convergence factor is given based on only 50 V -cycles. Results for the method using an extension of Schaffer's idea for the grid trans-fer coefficients using alternating zebra Gauss-Seidel relaxation are given in table 7.14. The method exhibits good convergence factors for all grid sizes, and in addition the convergence factors grow very slowly as the grid size increases. Results for the method using the grid transfer operators based on the form a L / L, from section 3.5.2, using alternating zebra Gauss-Seidel relaxation are given in 165

PAGE 191

M 24. 2 1 N 12. M 1 2 0. 0. 12. 24. N Figure 7.3. Domain 0 for problem 8.1; N and M stand for Neumann and Mixed boundary conditions respectively. Table 7.14. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 6.93E-5 5.82E-4 2.01E-4 17 4 7.89E-3 1.42E-2 1.23E-2 33 4 3.03E-2 2.65E-2 2.74E-2 65 4 4.01E-2 2.43E-2 2.76E-2 129 4 4.34E-2 2.12E-2 2.49E-2 257 4 5.36E-2 2.94E-2 2.98E-2 166

PAGE 192

Table 7.15. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators of the form o-L/L. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.50E-4 1.78E-3 6.67E-4 17 4 1.17E-2 2.03E-2 1.77E-2 33 7 1.03E-1 1.29E-1 1.21E-2 65 16 3.85E-1 4.03E-1 3.99E-1 129 33 7.54E-1 6.52E-1 6.53E-1 257 8.22E-1 8.15E-1 8.15E-1 167

PAGE 193

Table 7.16. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.56E-4 1.82E-3 6.84E-4 17 4 1.43E-2 2.30E-2 2.04E-2 33 5 4.96E-2 4.31E-2 4.43E-2 65 5 7.85E-2 5.69E-2 6.07E-2 129 6 9.45E-2 6.55E-2 6.96E-2 257 6 1.01E-1 6.96E-2 7.41E-2 table 7.15. This method does not seem to be very attractive for this type of problem. The convergence factors grow quickly and approach one as the grid size increases. The method using the grid transfer operators based on the form u L I -hybrid, from section 3.5.3, using alternating zebra Gauss-Seidel relaxation are given in table 7.16. The method seems to be attractive for this type of problem. The convergence factor, for the first V-cycle, grows as a function of problem size, but the convergence factor for subsequent V-cycles settles down to about 0.07. The method using the grid transfer operators based on the form uLiuL, from section 3.5.1, using alternating zebra Gauss-Seidel relaxation are given in table 7.17. This method is almost identical to the last method, u L I -hybrid, as it should be for diffusion problems. The two methods differ only slightly in the grid transfer operators when the switch in the denominator is used. The method using the grid transfer operator, LIL form, from section 3.4 is given in table 7.18. This method does not perform very well at all. The method does not employ the use of the denominator switch in the grid transfer operators. Tables 7.19 through 7.23 are the same as the previous tables except that now 168

PAGE 194

Table 7.17. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.56E-4 1.83E-3 6.84E-4 17 4 1.43E-2 2.30E-2 2.04E-2 33 5 4.96E-2 4.31E-2 4.43E-2 65 5 7.85E-2 5.69E-2 6.07E-2 129 6 9.45E-2 6.55E-2 6.96E-2 257 6 1.01E-1 6.96E-2 7.41E-2 Table 7.18. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.57E-4 1.83E-3 6.85E-4 17 4 1.43E-2 2.30E-2 2.04E-2 33 4 1.46E-2 1.72E-2 1.44E-2 65 14 4.90E-1 3.58E-1 3.66E-1 129 l.llE+O 7.75E-1 7.81E-1 257 1.36E+O 9.38E-1 9.45E-1 169

PAGE 195

Table 7.19. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using incomplete x-line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 3.40E-9 3.40E-9 3.40E-9 17 2 2.04E-5 3.05E-4 7.90E-5 33 3 2.12E-3 4.70E-3 3.57E-3 65 4 1.30E-2 9.70E-3 1.04E-2 129 4 2.36E-2 1.13E-2 1.34E-2 257 4 3.11E-2 2.01E-2 1.68E-2 Table 7.20. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using incomplete x-line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 5.52E-9 5.52E-9 5.52E-9 17 2 1.62E-5 2.07E-4 5.80E-5 33 3 3.45E-3 1.34E-2 8.22E-3 65 7 7.44E-2 1.54E-1 1.38E-1 129 17 2.75E-1 4.42E-1 4.29E-1 257 36 3.65E-1 6.91E-1 6.78E-1 170

PAGE 196

Table 7.21. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using incomplete x-line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.35E-9 6.35E-9 6.35E-9 17 2 1.90E-5 3.12E-4 7.70E-5 33 3 1.35E-3 3.65E-3 2.60E-3 65 3 7.34E-3 8.43E-3 8.00E-3 129 4 1.36E-2 1.26E-2 1.28E-2 257 4 1.69E-2 2.04E-2 1.69E-2 Table 7.22. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using incomplete x-line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.33E-9 6.33E-9 6.33E-9 17 2 1.90E-5 3.12E-4 7.70E-5 33 3 1.35E-3 3.65E-3 2.60E-3 65 3 7.34E-3 8.43E-3 8.00E-3 129 4 1.36E-2 1.26E-2 1.28E-2 257 4 1.69E-2 2.04E-2 1.69E-2 171

PAGE 197

Table 7.23. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.4. Various grid sizes versus the number of V(1, 1) cycles using incomplete x-line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.33E-9 6.33E-9 6.33E-9 17 2 1.90E-5 3.13E-4 7.71E-5 33 3 1.21E-3 3.14E-3 2.25E-3 65 5 4.55E-2 6.36E-2 5.94E-2 129 18 4.24E-1 4.62E-1 4.59E-1 257 8.02E-1 8.18E-1 8.18E-1 172

PAGE 198

the smoother is an incomplete x-line LU iteration instead of the alternating zebra line Gauss-Seidel relaxation that was used before. The same observations hold as before, except that now the convergence factors are a bit smaller due to the fact that ILLU makes a better smoother than alternating line Gauss-Seidel. 173

PAGE 199

D 1. D D 0. 0. D 1. Figure 7.4: Domain 0 for problem 9; D stands for Dirichlet boundary condition. 7.6 Numerical Results for Test Problem 9 Problem 9 is a convection-diffusion problem, which can be found in [24], [66], [77]. The problem is defined as where au au -cf1u + a(x, y) OX + b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.5) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on an, (7.6) a(x,y) b(x, y) (2x-1)(1-x2), 2xy(y-1) and c: = 10-5 ; see figure 7.4. Five choices of prolongation operators for the standard coarsening black box multigrid method using zebra alternating line Gauss-Seidel relaxation or incomplete x-line LU iteration for the smoother are presented. The comparison is done for a variety of grid sizes ranging from 9 x 9 to 257 x 257. The results for the convection-diffusion equation using zebra alternating line Gauss-Seidel relaxation for the smoother are given in tables 7.24, 7.25, 7.26, 7.27, and 7.28. 174

PAGE 200

Table 7.24. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 8.78E-3 8.58E-6 2.75E-4 17 4 1.61E-2 1.57E-5 3.05E-3 33 5 2.78E-2 1.57E-2 3.04E-2 65 5 3.98E-2 6.03E-2 6.06E-2 129 6 5.67E-2 1.09E-1 9.08E-2 257 7 6.88E-2 1.19E-1 1.13E-1 Table 7.25. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 2.11E-2 5.63E-3 7.89E-3 17 4 3.18E-2 4.34E-3 2.48E-3 33 6 6.51E-2 2.38E-2 5.38E-2 65 9 1.49E+O 1.28E-1 1.88E-1 129 div 4.15E+1 1.13E+1 7.01E+O 257 div 2.91E+3 1.24E+3 1.24E+3 175

PAGE 201

Table 7.26. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 5.74E-3 1.86E-5 3.27E-4 17 4 1.54E-2 1.77E-5 3.24E-3 33 5 2.65E-2 1.42E-2 2.99E-2 65 6 4.88E-2 6.12E-2 6.12E-2 129 6 7.40E-2 9.31E-2 9.23E-2 257 8 1.29E+O 8.34E-2 1.35E-1 Table 7.27. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 2.24E-2 8.12E-3 8.96E-3 17 4 2.63E-2 2.84E-3 2.18E-2 33 6 4.90E-1 4.40E-2 6.68E-2 65 div 1.86E+l 1.53E+l 1.53E+l 129 div 1.67E+4 2.88E+3 2.88E+3 257 div 1.51E+9 9.02E+8 9.02E+8 176

PAGE 202

Table 7.28. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 3.89E-2 2.39E-5 3.16E-3 17 5 5.35E-2 l.llE-2 5.59E-2 33 7 7.29E-2 6.07E-2 1.24E-1 65 10 1.65E-1 1.54E-1 2.10E-1 129 12 1.69E-1 1.92E-1 2.95E-1 257 18 1.90E-1 2.27E-1 4.31E-1 Table 7.29. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 8.41E-14 8.41E-14 8.41E-14 17 1 6.91E-12 6.91E-12 6.91E-12 33 1 6.29E-10 6.29E-10 6.29E-10 65 1 4.63E-8 4.63E-8 4.63E-8 129 2 1.95E-6 5.21E-5 1.01E-5 257 2 4.88E-5 1.09E-3 2.31E-4 177

PAGE 203

Table 7.30. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 1.19E-13 1.19E-13 1.19E-13 17 1 9.36E-12 9.36E-12 9.36E-12 33 1 8.28E-10 8.28E-10 8.28E-10 65 1 6.60E-8 6.60E-8 6.60E-8 129 2 3.33E-6 1.09E-4 1.91E-5 257 2 2.33E-3 1.96E-4 6.74E-4 Table 7.31. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 8.45E-14 8.45E-14 8.45E-14 17 1 6.62E-12 6.62E-12 6.62E-12 33 1 5.74E-10 5.74E-10 5.74E-10 65 1 4.11E-8 4.11E-8 4.11E-8 129 2 1.71E-6 3.65E-5 7.90E-6 257 2 4.21E-5 8.20E-4 1.86E-4 178

PAGE 204

Table 7.32. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 1.45E-13 1.45E-13 1.45E-13 17 1 8.56E-12 8.56E-12 8.56E-12 33 1 5.11E-10 5.11E-10 5.11E-10 65 1 2.54E-8 2.54E-8 2.54E-8 129 2 1.06E-6 1.81E-5 4.38E-6 257 6.84E+O 8.25E-1 8.60E-1 Table 7.33. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 2.08E-13 2.08E-13 2.08E-13 17 1 1.23E-11 1.23E-11 1.23E-11 33 1 5.90E-10 5.90E-10 5.90E-10 65 1 3.22E-8 3.22E-8 3.22E-8 129 2 1.75E-6 1.02E-4 1.34E-5 257 2 6.76E-5 1.14E-3 2.78E-4 179

PAGE 205

Table 7.34. Problem 9: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using 4-direction point Gauss-Seidel relaxation for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 3.09E-6 1.91E-6 2.43E-6 17 2 2.34E-6 4.04E-6 3.08E-6 33 2 1.74E-6 5.02E-6 2.95E-6 65 2 5.05E-6 1.38E-5 8.35E-6 129 2 6.15E-5 3.05E-4 1.37E-4 257 4 5.35E+1 7.19E-5 l.OOE-2 From the tables we see that the methods that use extended Schaffer's ideas perform the best closely followed by the hybrid method from section 3.5.3. Notice that changing the smoother from the alternating line Gauss-Seidel to the incomplete line LU iteration greatly improves the convergence factors. When alternating line Gauss-Seidel is used, we see that both the grid transfer operator methods from sections 3.5.2 and 3.5.1, respectively, diverge for large grid sizes, but that they are both convergent when the incomplete line L U iteration is used for the smoother. We also see that using 4-direction point Gauss-Seidel relaxation for the smoother with the grid transfer operator method from section 3.5.1 ((TLjaL) gives results that are comparable to those using the ILLU smoother. However, as can be seen by the initial convergence factor for the 257 x 257 grid, that problems have crept in for large grid sizes. The convergence factor is only greater than one for the initial iteration and then for subsequent iterations the convergence rate drops off very quickly. However, for larger fine grids the convergence factor oscillates back and fourth from around 50 to around 0.37, and the methods are divergent. 180

PAGE 206

D 1. I /----! ( /-' D \ D \ .. \ \\ "'-/ '"-,,.// '" ', 0. ..,., 0. D 1. Figure 7.5: Domain n for problem 10; D stands for Dirichlet boundary condition. 7. 7 Numerical Results for Test Problem 10 Problem 10 is a convection-diffusion problem, which can be found in the literature; see [24], [66], [77]. The problem is defined by where au au + a(x, y) OX+ b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.7) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on on, (7.8) a(x,y) b(x, y) 4x(x-1)(1-2y), -4y(y-1)(1-2x), and E = 10-5 ; see figure 7.5. This problem is a re-entrant flow problem; such problems are among the most difficult convection-diffusion problems. None of our methods are adequate for solving these types of problems and several are not even convergent except for small grid sizes. However, using ILLU for the smoother does help many of the methods become convergent, even if the convergence factor is rather poor. 181

PAGE 207

There are several reasons why our methods do not work properly on these types of problems. One is that the smoothers are just not adequate. Another is that all of our grid transfer operators, that we have considered, are close to violating the order of interpolation rule [15] [41] [45] [85]. The rule states that (7.9) where mr and mp are the orders of interpolation for the restriction and prolongation operators, respectively, and mz is the order of the grid equation operator. In our cse we have mr = 1, mp = 2, and mz = 3 for equality. Numerically the rule is violated for some of the grid equations due to the affects of computer arithmetic. Another way to look at the trouble is that the grid transfer operators fail to map all the high frequency errors into the range of the smoother. De Zeeuw's MGD9V [24] code was designed for these types of convection diffusion problems, and his interpolation operator does map the error into the range of the ILLU smoother; see table 7.43. Although MGD9V does become divergent for large grids (> 160 x 160), it does perform much better than any of the other methods for the smaller grid sizes. 182

PAGE 208

Table 7.35. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 14 1.06E-1 3.96E-1 3.58E-1 17 25 1.38E-1 6.06E-1 5.68E-1 33 1.56E-1 7.96E-1 7.65E-1 65 1.67E-1 9.11E-1 8.70E-1 129 2.05E-1 9.63E-1 9.14E-1 257 2.66E-1 9.83E-1 9.31E-1 Table 7.36. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 1.20E-1 4.22E-1 3.89E-1 17 36 2.19E-1 7.03E-1 6.79E-1 33 2.36E-1 8.42E-1 8.36E-1 65 div 4.35E-1 1.12E+O 1.05E+O 129 div 1.30E+1 4.60E+1 4.60E+1 257 div 1.49E+2 9.23E+2 9.23E+2 183

PAGE 209

Table 7.37. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 5.79E-2 2.29E-1 1.95E-1 17 22 1.30E-1 5.65E-1 5.25E-1 33 1.71E-1 8.20E-1 7.88E-1 65 1.89E-1 9.34E-1 8.92E-1 129 2.57E-1 9.76E-1 9.27E-1 257 div 4.00E-1 1.98E+O 1.98E+O Table 7.38. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 5 4.64E-2 6.76E-2 6.19E-2 17 7 2.70E-1 8.72E-2 1.20E-1 33 div 5.66E+O 9.93E+1 9.93E+1 65 div 2.66E+2 2.15E+3 2.15E+3 129 div 1.04E+5 1.77E+8 1.77E+8 257 div 2.04E+6 1.08E+12 1.08E+12 184

PAGE 210

Table 7.39. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 8.88E-2 2.07E-1 1.89E-1 17 20 1.32E-1 5.31E-1 4.90E-1 33 1.76E-1 8.67E-1 8.31E-1 65 2.46E-1 9.73E-1 9.25E-1 129 3.58E-1 9.92E-1 9.36E-1 257 6.20E-1 9.90E-1 9.41E-1 Table 7.40. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 5.58E-2 2.26E-1 1.93E-1 17 15 9.22E-2 4.22E-1 3.81E-1 33 26 l.lOE-1 6.28E-1 5.86E-1 65 49 1.13E-1 7.85E-1 7.53E-1 129 1.09E-1 8.77E-1 8.38E-1 257 8.80E-2 9.27E-1 8.75E-1 Table 7.41. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using ILL U for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 7 3.98E-2 1.56E-1 1.28E-1 17 13 8.19E-2 3.80E-1 3.38E-1 33 24 1.04E-1 6.07E-1 5.62E-1 65 47 l.lOE-1 7.76E-1 7.42E-1 129 1.09E-1 8.81E-1 8.40E-1 257 8.53E-1 9.46E-1 8.92E-1 185

PAGE 211

Table 7.42. Problem 10: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 9.42E-3 3.87E-2 2.70E-2 17 5 1.31E-2 6.51E-2 4.69E-2 33 5 1.51E-2 7.80E-2 5.56E-2 65 9 3.74E-2 2.51E-1 2.01E-1 129 div 6.88E+4 2.04E+5 1.90E+5 257 div 1.05E+5 2.88E+5 2.70E+5 Table 7.43. Problem 10: Convection-Diffusion Equation, for De Zeeuw's MGD9V. Various grid sizes versus the number of V(O, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 5 7.53E-2 1.05E-1 8.21E-2 17 5 1.54E-2 1.94E-1 9.31E-2 33 6 9.45E-3 2.84E-1 1.52E-1 65 8 1.36E-2 4.42E-1 2.83E-1 129 11 1.75E-2 5.84E-1 4.27E-1 257 div 1.27E+O 2.90E+O 2.49E+O 186

PAGE 212

1. D -------------------.. 0. 0. D D D 1. Figure 7.6: Domain n for problem 11; D stands for Dirichlet boundary condition. 7.8 Numerical Results for Test Problem 11 Problem 11 is a convection-diffusion problem, which can be found in the literature; see [24], [66], [77]. The problem is defined by au au + a(x, y) OX+ b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.10) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on on, (7.11) and (2y-1)(1x2 ) if x>O a(x,y) (2y-1) if 2xy(y1) if x>O b(x,y) 0 if where X = 1.2x-0.2 and E = w-5 ; see figure 7.6. The difference between problem 11 and problem 9 is that we now have a stagnation line rather than a stagnation point emanating from the boundary. The results of the numerical experiments for this problem are similar to those for problem 9. The method using Schaffer's ideas for the grid transfer operators is the 187

PAGE 213

Table 7.44. Problem 11: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 8.55E-3 6.31E-6 2.32E-4 17 4 1.46E-2 9.62E-6 2.34E-3 33 5 2.47E-2 1.29E-2 2.81E-2 65 5 3.70E-2 5.46E-2 5.64E-2 129 6 5.57E-2 9.67E-2 8.61E-2 257 7 7.27E-2 1.06E-1 1.08E-1 best, followed closely by the hybrid method from section 3.5.3. We also see that grid transfer operators generated using the method in section 3.4 are also quite adequate. 188

PAGE 214

Table 7.45. Problem 11: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 1.95E-2 4.11E-3 6.62E-3 17 4 3.16E-2 2.38E-3 1.95E-2 33 5 4.44E-2 5.39E-2 5.62E-2 65 11 8.02E-1 2.20E-1 2.63E-1 129 div 8.56E+O 7.14E+O 7.14E+O 257 div 1.26E+3 3.41E+2 3.41E+2 Table 7.46. Problem 11: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 5.81E-3 9.65E-5 7.49E-4 17 4 1.49E-2 1.20E-5 2.50E-3 33 5 2.30E-2 1.12E-2 2.67E-2 65 5 4.61E-2 5.38E-2 5.65E-2 129 6 6.87E-2 8.64E-2 9.10E-2 257 10 3.56E-1 7.31E-2 2.18E-1 189

PAGE 215

Table 7.47. Problem 11: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 1.97E-2 6.09E-3 8.07E-3 17 4 2.47E-2 1.26E-3 1.49E-2 33 7 l.llE+O 6.38E-2 1.16E-1 65 div 3.57E+O 7.94E+O 7.94E+O 129 div 7.85E+5 9.99E+4 9.99E+4 257 div 9.96E+9 7.00E+9 7.00E+9 Table 7.48. Problem 11: Convection-Diffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 3.73E-2 9.56E-5 3.78E-3 17 5 5.23E-2 6.26E-3 4.22E-2 33 7 6.07E-2 3.65E-2 1.05E-1 65 9 1.41E-1 1.69E-1 1.99E-1 129 11 1.53E-1 1.73E-1 2.75E-1 257 17 1.79E-1 2.13E-1 4.06E-1 190

PAGE 216

M 64. 33. f-------+-N 31. 0. 0. 31. 33. 64. N M Figure 7.7. Domain n for problem 13; Nand M stand for Neumann and Mixed boundary conditions respectively. 7.9 Numerical Results for Test Problem 13 where Problem 13 is an anisotropic and discontinuous problem defined as -\7D\7u+cu=f 8u 1 D-+-u=O on 2 8u = 0 on c=1 !=1 D1 = 1000 D2 = 1000 c=1 f=O if if on n = (0, Nx) x (0, Ny) on an at X= Nx or y=Ny on on at X= 0 or y=O x E [0, 31], y E [0, 31]; x E (33, 64], y E (33, 64] x E (31, 33], y E [0, 31]; x E [0, 31], y E (31, 33]; x E (33, 64], y E (31, 33]; x E (31, 33], y E (33, 64] 191 (7.12) (7.13) (7.14)

PAGE 217

Table 7.49. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 1.84E-2 2.32E-2 2.13E-2 17 4 1.67E-2 2.22E-2 1.95E-2 33 5 3.10E-2 6.57E-2 5.79E-2 65 6 1.29E-2 1.14E-1 7.52E-2 129 5 1.51E-2 1.19E-1 6.18E-2 257 6 2.47E-2 1.94E-1 8.72E-2 if x E (33, 64], y E [0, 31]; c=1 x E [0, 31], y E (33, 64] !=1 if x E (31, 33], y E [0, 31]; c=1 f=O and the domain is illustrated in figure 7.7. For this problem we have chosen to report only the three most valuable grid transfer operators from sections 3.6, 3.5.3, and 3.5.1. All three give very good performance over the range of grid sizes tested. 192

PAGE 218

Table 7.50. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 2.17E-2 2.71E-2 2.56E-2 17 4 1.09E-2 2.33E-2 1.78E-2 33 4 7.35E-3 5.73E-2 2.38E-2 65 7 1.44E-2 1.80E-1 1.14E-1 129 8 2.63E-2 1.96E-1 1.47E-1 257 8 4.55E-2 2.40E-1 1.71E-1 Table 7.51. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line Gauss-Seidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 2.17E-2 2.71E-2 2.56E-2 17 4 1.09E-2 2.33E-2 1.78E-2 33 4 7.35E-3 5.73E-2 2.38E-2 65 7 1.44E-2 1.80E-1 1.14E-1 129 8 2.63E-2 1.96E-1 1.47E-1 257 8 4.55E-2 2.40E-1 1.71E-1 193

PAGE 219

Table 7.52. Problem 17: Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using alternating zebra line Gauss-Seidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 18 6.35E+O 1.72E-1 4.38E-1 17 12 1.43E-1 3.09E-1 2.91E-1 33 12 1.18E-1 3.34E-1 3.07E-1 65 13 9.22E-2 3.54E-1 3.21E-1 129 13 7.81E-2 3.58E-1 3.22E-1 257 13 7.56E-2 3.59E-1 3.22E-1 7.10 Numerical Results for Test Problem 17 Problem 17 is a discontinuous staircase problem, which can be found in the literature; see [26], [77], [24]. The problem is defined by where see figure 7.8. -\i'DV'u=f au 1 D-+-u=O an 2 on 0 = (0, 16) X (0, 16) on aO D = 1, f = 0 (x, y) outside the shaded area D = 1000, f = 1 (x, y) inside the shaded area; (7.15) (7.16) In many real world applications boundaries are often curved, making the discretization hard to perform accurately on rectangular meshes. After discretization the curved boundary will look something like a staircase. Problems with staircase interfaces in the domain are not handled well by classical multigrid methods. In particular, multigrid methods which employ five point stencils on coarser grids are doomed to failure for staircase problems, since for sufficiently coarse grids, the five point stencil cannot resolve the staircase. The black box multigrid methods, however, can handle 194

PAGE 220

M 16 15 13 11 9 N M 7 5 3 1 0 0 1 3 5 7 9 11 13 15 16 N Figure 7.8. Domain n for problem 17; Nand M stand for Neumann and Mixed boundary conditions respectively. 195

PAGE 221

Table 7.53. Problem 17: Standard coarsening with grid transfer operators based on original collapsing method. Various grid sizes versus the number of V(1, 1) cycles using alternating zebra line Gauss-Seidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 13 4.54E-1 2.97E-1 3.17E-1 17 8 6.34E-2 1.64E-1 1.44E-1 33 8 5.35E-2 1.62E-1 1.48E-1 65 8 4.92E-2 1.73E-1 1.54E-1 129 8 5.10E-2 1.85E-1 1.65E-1 257 8 5.64E-2 1.89E-1 1.70E-1 staircase interfaces rather well because they use operator induced grid transfer opera-tors and the Galerkin coarse grid approximation to form the coarse grid operators; the nine point operators created in this way can resolve the staircase on coarser grids. 196

PAGE 222

Table 7.54. Problem 17: Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(O, 1) cycles using x-line ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 1.96E+O 1.08E-1 1.81E-1 17 8 1.85E-1 1.71E-1 1.73E-1 33 6 2.93E-2 8.96E-1 8.06E-2 65 5 2.03E-2 1.15E-1 6.13E-2 129 5 2.27E-2 5.07E-2 5.38E-2 257 7 2.89E-2 1.35E-1 1.19E-1 Table 7.55. Problem 17: Standard coarsening with grid transfer operators based on the hybrid collapsing method. Various grid sizes versus the number of V(O, 1) cycles using x-line ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 11 4.49E-1 2.68E-1 2.82E-1 17 6 2.91E-2 8.17E-2 6.85E-2 33 6 2.33E-2 8.21E-2 6.69E-2 65 6 1.90E-2 8.91E-2 6.96E-2 129 6 1.90E-2 1.09E-1 8.17E-2 257 6 1.98E-2 l.llE-1 8.44E-2 Table 7.56. Problem 17: Semi-coarsening code. Various grid sizes versus the number of V(1, 1) cycles using zebra x-line Gauss-Seidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.13E-1 2.66E-1 3.07E-1 17 9 2.14E-1 1.76E-1 1.83E-1 33 8 1.47E-1 1.61E-1 1.59E-1 65 8 1.13E-1 1.54E-1 1.48E-1 129 8 1.05E-1 1.51E-1 1.44E-1 257 8 1.14E-1 1.50E-1 1.44E-1 197

PAGE 223

7.11 Comparison of 2D Black Box Multigrid Methods We have looked at several example problems and how the various methods perform for those examples, but now we would like to determine which methods really are the best. The most obvious criterion for judging the best method would be to use the execution time to solve a given problem to a given tolerance. The trouble with this metric, although it is quite practical, is that it does not take into account implementation and algorithm variations, nor does it say anything about whether one method is more efficient than another method. We need a criterion that will measure both the execution time and the final accuracy of the solution. We propose to use a normalized execution time and the average convergence factor for our metric. We define the metric to be P = T X CFave (7.17) where P is the performance metric, T is the normalized execution time, and C Pave is the average convergence factor. By normalized execution time we mean that we have taken the tie for five V-cycles and the setup time and divided by five to give the average time for a V-cycle plus one-fifth the setup time. This allows us to take into account the variation in setup time for the different methods. The execution cycle time and average convergence factor are measured for how long it takes a method reduce the initiall-2 norm of the residual by a given amount; the results shown here use a reduction of six orders of magnitude. Like the convergence factor the smaller the performance metric P the better the method. The first comparison is for test problem 8, the four-corner junction problem (section 7.5); see table 7.57. The methods are listed in the far left hand column of the table where the characters represent the various methods. The method character string can be decoded into four fields; pt character, 2nd character, 3rd character, and 198

PAGE 224

Table 7.57. Comparison of various Black Box Multigrid methods on the Cray Y-MP for a 2D diffusion equation given in problem 8 method 9x9 17x17 33x33 65x65 129x129 257x257 stavll 1.488E-7 1.720E-5 7.627E-5 1.796E-4 4.898E-4 1.962E-3 htavll 4.871E-7 2.762E-5 1.203E-4 3.917E-4 1.338E-3 4.812E-3 otavll 5.004E-7 2.884E-5 1.256E-4 4.088E-4 1.404E-3 5.056E-3 stivOl 2.475E-8 6.171E-6 8.076E-5 3.201E-4 l.OllE-3 9.497E-3 htivOl 3.839E-8 6.503E-6 5.739E-5 2.723E-4 l.llOE-3 4.282E-3 otivOl 3.949E-8 5.053E-6 5.907E-5 2.805E-4 1.131E-3 4.405E-3 stivll 4.376E-12 2.143E-7 2.266E-5 1.758E-4 5.820E-4 3.050E-3 htivll 7.749E-12 2.039E-7 1.626E-5 1.364E-4 6.700E-4 3.035E-3 otivll 7.844E-12 2.081E-7 1.655E-5 1.393E-4 6.798E-4 3.106E-3 sclvll 2.883E-7 8.950E-6 6.051E-5 5.337E-4 7.668E-3 1.315E-2 scivOl 1.183E-7 3.660E-5 8.650E-4 6.755E-3 3.715E-2 1.859E-l scivll 2.648E-11 1.504E-6 4.366E-4 8.317E-3 5.438E-2 2.679E-1 htpvll 2.083E-5 5.550E-5 2.144E-4 5.802E-4 1.714E-3 5.786E-3 otpvll 2.155E-5 5.871E-5 2.273E-4 6.187E-4 1.838E-3 6.193E-3 ZivOl 5.295E-6 1.884E-5 7.367E-5 2.125E-4 6.020E-4 1.992E-1 199

PAGE 225

4th through 6th characters. The first field represents the grid transfer operator used; s = schaffer, h = hybrid, o = original. The second filed represents the coarsening; t = standard, c = semi-coarsening. The third field represents the type of smoother employed; a= alternating line, i = ILLU, l =line, p =point. The fourth field represents the type of multigrid cycling; vll = V(1, 1)-cycling, V(O, 1)-cycling. Table 7.57 shows that the standard coarsening method with alternating line relaxation using Schaffer's idea for the grid transfer coefficients is the best method for larger grids, while the same method with ILL U for the smoother is the best for smaller grids. Most of the methods perform about the same, but semi-coarsening with an ILLU smoother is the worst. It is a little surprising to see that DeZeeuw's MGD9V is beaten by the standard coarsening alternating line method when it was seen before that MGD9V had a faster execution time. Recall from the numerical examples that the ILLU smoother was essential for obtaining good convergence for convection-diffusion equations, except for the semi-coarsening method. The examination of a convection-diffusion equation for our new performance metric should prove enlightening. We will choose to look at the convection-diffusion problem 9 [24]. Table 7.58 show the performance metric for the convection-diffusion equation given in problem 9 for the various methods. The two clear winners for this problem are the standard coarsening method using ILL U for the smoother with the grid transfer operators computed by either Schaffer's idea or the hybrid collapsing method. While stivll and htivll are the best, De Zeeuw's method, MGD9V, shows nice consistency in performance for the range of grid sizes, but it also has the advantage of being more robust for more complex convection-diffusion equations, especially those with re-entrant flows. While our methods are still convergent for most of the more 200

PAGE 226

Table 7.58. Comparison of various Black Box Multigrid methods on the Cray Y-MP for a 2D convection-diffusion equation given in problem 9 method 9x9 17x17 33x33 65x65 129x129 257x257 stavll 1.835E-6 6.188E-5 1.263E-3 5.060E-3 2.181E-2 9.300E-2 htavll 2.149E-6 6.473E-5 1.248E-3 5.064E-3 2.208E-2 1.285E-1 otavll 8.287E-5 4.422E-4 3.281E-3 * stivOl 2.286E-12 1.069E-10 4.757E-9 5.709E-6 1.241E-3 5.667E-2 htivOl 2.144E-12 8.683E-11 3.526E-9 2.996E-6 5.472E-5 2.389E-3 otivOl 4.249E-12 9.064E-11 2.146E-9 1.506E-6 6.271E-5 stivll 2.124E-16 4.079E-14 8.658E-12 1.682E-9 1.537E-6 1.208E-4 htivll 2.022E-16 3.651E-14 7.519E-12 1.432E-9 1.163E-6 9.424E-5 otivll 3.652E-16 5.015E-14 7.086E-12 9.420E-10 6.740E-7 *** sclvll 1.097E-6 4.410E-5 6.635E-4 4.112E-3 2.738E-2 1.708E-l scivOl 1.158E-11 5.618E-10 7.717E-7 2.182E-5 1.409E-3 5.587E-2 scivll 6.317E-16 2.011E-13 4.620E-11 1.185E-8 2.210E-5 4.303E-3 ZivOl 3.353E-8 1.20E-7 4.883E-7 2.630E-6 1.477E-5 1.293E-4 201

PAGE 227

complex convection-diffusion equations, they are not useful as solvers because the con vergence factor is often above 0.9. We are not going to present the performance metric for the more complex convection problems because it is difficult to get convergence for large grids even using MGD9V. 202

PAGE 228

CHAPTER 8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS The parallel algorithm has undergone a lot of changes as the hardware and software support have changed. Originally, there were several codes developed for the CM-2 and CM-200, but when the CM-5 came along, those codes were abandoned. We will present only the CM-5 code, after a brief outline of the previous work. 8.1 CM-2 and CM-200 Parallel Algorithms We used the CM-Fortran SPMD model on the CM-2 and CM-200, which contained 65K integer processors and 4K floating point processors (Weitek). The CM200 computer had two modes of operation called the paris (parallel instruction) model and the slicewise model. The slicewise model was the preferred one because it looked at the machine as if it were only constructed of the 4K floating point processors. The CM-2 and CM-200 have a front end (host) computer, a Spare workstation, that is connected to the hypercube data network, which connects all the processors. The integer processors are bit serial, and the floating point processors are 32 bit based. On the CM-200 the data is stored bitwise and passed through a transpose device which packs it into 16 and 32 bit words, which can be passed to the floating point processors. The only real differences between the CM-2 and CM-200 are that the transposer and slicewise models were made standard on the CM-200 and that the overall 203

PAGE 229

hardware for both the processors and networks are faster on the CM-200. Codes were developed for the standard coarsening algorithm using all three of the data structures shown in figure 8.5. The CMSSL (CM Scientific Software Library) was very limited at that time and we had to write our own tridiagonal line and direct solvers. The direct solver was so slow that we found it easier and faster to use the front end to perform and store the L U decomposition and then just pass the right hand side to the front end and the solution back to the CM-2. On the CM-200, which was faster than the CM-2, the direct solver ran about as fast or a little faster than using the front end and passing the two vector arrays. All of the approaches taken to develop an efficient black box multigrid code ran into trouble with communications, and their performance was dismal when compared to the Cray Y-MP. The communications bottle neck appeared in both the transpose device and the processor-to-processor communication. There was no way around the transpose device, so efforts had to be concentrated on communications between proces sors. Much of this effort was hampered by the lack of control over the data's automatic layout across the processors. Dendy, Ida, and Rutledge [32] partially solved the layout problem by writing code to access the CM Run Time System (CMRTS). The code they wrote is essentially at the assembly level. While this approach met success, it also was not portable and when either the hardware or system software was updated (modified), the code could not be guaranteed to work properly. The layout problem was not really resolved until the codes were implemented on the CM-5 computer. There were several attempts to reduce the communication overhead by using various communication packages. The two we tried were the CM FastGraph package and the CMSSL poly-shift routines. The CM FastGraph (CMFG) package has two major components consisting 204

PAGE 230

of the communication compiler and the mapping facility. The CMFG can be used to speed up general communications between multiple processors. The speedup is achieved by determining and storing the routing paths once at the beginning of the program's execution. When the communications are performed, there is thus no need to determine the routing paths dynamically, thus reducing the time it takes to complete the communications. The CMFG components were used for the grid transfer operations between fine and coarse grid levels. First the communication map is defined for passing data between the different grid level data structures. The CMFG compiler then creates the maps to be used by the CMFG communication routines. While the communications were much quicker, the overhead involved in defining the communications was very prohibitive, often adding minutes to the execution time depending on the size and number of data communications to be performed. This approach, while interesting, just was not practical for our software library approach. However, the CMFG approach has proven to be useful for problems that run the same size problem many times (e.g. time-dependent problems), by storing all the routing paths for all the grid levels. Throughout the multigrid algorithm there are many instances where communication is needed with neighboring grid points; in such a case, multiple calls are made to communication routines. These communications are either circular ( CSHIFT) or end-off (EO SHIFT). The CSHIFT routine shifts the data in a circular fashion with wrap around occurring at the array boundaries. The EOSHIFT routine shifts data with data dropping off the end of an array with a predefined value being shifted into the array. The poly-shift communication routines allow the user to define and use communication stencils to combine multiple calls to other communication routines, CSHIFT or EO SHIFT, into only one communication call. The poly-shift stencil communication allows the various communications that make up the stencil to be overlapped when possible, and hence to reduce the amount of time spent on the communications. The 205

PAGE 231

Table 8.1. Timing comparison per V-cycle for the semi-coarsening code on the Cray Y-MP, CM-2, and CM-200. Times are given in seconds, and the CM-2 and CM-200 times are the elapsed time for 256 and 512 Floating point processors. "*" means the problem was to big to run, and "-" means there was no data. Size n Cray Y-MP Cray Y-MP CM-2 CM-2 CM-200 nxn (old) (new) 256 512 512 8 0.004 0.0005 -16 0.006 0.0009 0.51 -32 0.009 0.0019 0.59 0.75 -64 0.041 0.0044 0.65 0.77 0.14 128 0.093 0.013 0.99 0.80 0.23 256 0.267 0.043 1.84 1.79 0.40 512 0.948 0.158 4.55 3.04 1.00 1024 3.69 0.656 8.11 3.01 2048 2.714 25.39 -poly-shift communications consist of three routines. The allocation and stencil setup routine pshift_setup, the communication routine pshijt, and the deallocation routine deallocate_pshift_setup. The use of the poly-shift communications in the two dimensional codes gave a very slight performance gain for the standard coarsening code and was actually slightly detrimental for the semi-coarsening code. Only the three dimensional standard coarsening code showed any real benefits from using the poly-shift communication routines. The reason for the lack of performance gains was overhead. There was a set of poly-shift routines that were not as general as the CMSSL version, but performed nearly three times faster because of much less overhead. Use of the specialized poly-shift routines would have caused the multigrid codes to run considerably faster, but they still would not have been competitive with the Cray Y-MP versions. 8.1.1 Timing Comparisons For historical reasons it is interesting to compare the timing results for both the standard and semi-coarsening black box multigrid codes on both the Cray Y-MP and CM-2. 206

PAGE 232

Table 8.1 gives a timing comparison for the semi-coarsening code on both the Cray Y-MP and CM-200. The Cray Y-MP gives times for the "old" original vector code by Dendy and the "new" optimized vector code that we developed. The CM-200 semi-coarsening code timings are the same as those in [32], page 1466. The CM times are for the elapsed time, which is the sum of the busy and idle times. Busy time is defined to be the parallel execution time, and idle time includes the sequential and communication time. Note that the elapsed time is not the same as wallclock time, which includes time-sharing and system overhead time. The CM-200 code is at least an order of magnitude slower than the new Cray Y-MP code. However, if we examine the trend, we can see that the CM-200 is slowly gaining on the Cray Y-MP code; unfortunately we will run out of processors and memory long before it can catch up. Table 8.2 gives comparison timings for the standard coarsening code using alternating zebra line Gauss-Seidel smoothing. The Cray Y-MP timings also refer to the "old" original standard coarsening code by Dendy [27], that was ported to the Cray Y-MP, and the "new" code that we developed. The same observations can be made about the standard coarsening codes as was made about the semi-coarsening. However, the standard coarsening CM-200 code is more than twice as slow as the semi-coarsening code. The reason for this fact can be understood by noting that the standard coarsening version requires more communication and uses the inefficient data layouts that the system provides. The standard coarsening parallel code did not make use of poly-shift communications. 8.2 CM-5 Hardware Overview The CM-5 computer, as we have already said, is an SPMD computer with up to 1024 processors. We have considered only the data parallel model of execution in our studies. A program, under this model, is copied into each processor's memory, 207

PAGE 233

Table 8.2. Timing comparison per V-cycle for the standard coarsening code on the Cray Y-MP, CM-2, and CM-200 using AZLGS smoothing. Times are given in seconds, and the CM-2 and CM-200 times are the elapsed time for 256 and 512 floating point processors. "*" means the problem was to big to run, and "-" means there was no data. Size n Cray Y-MP Cray Y-MP CM-2 CM-2 CM-200 nxn (old) (new) 256 512 512 8 0.0031 0.0005 -16 0.0068 0.001 1.49 -32 0.0123 0.002 1.58 2.02 0.697 64 0.0275 0.0046 1.81 2.11 0.914 128 0.072 0.0128 2.71 2.21 1.52 256 0.266 0.0443 5.04 4.79 2.45 512 0.957 0.165 12.34 7.89 4.21 1024 3.84 0.673 20.41 12.55 2048 15.42 2.792 62.89 208

PAGE 234

and then every processor executes the same instruction at the same time on its own data. When data is needed from the other processors, it is passed through a data communication network that connects all of the processors to each other. To design parallel programs it is essential to understand the underlying hardware, the high level parallel programming language, and the behavior of the parallel run time system. While much effort is being made to hide these issues and make them transparent to the average user of the computing community, there is much room for improvement. All parallel computers and their high level languages, to date, have not had any real success in making it possible to ignore these three issues. This situation has mostly been due to a lack of software support both in the languages and software libraries. Some have said that object oriented design and algorithms is the answer, but first the underlying framework and code has to be developed, by no means a trivial task, and the computing community is not even close to the general prototype stage at this time. The description of the CM-5 given so far is not sufficient to understand and appreciate the issues and complexities of designing an efficient parallel algorithm. Hence, we provide a more detailed description. The CM-5 is an SPMD computer with 1024 processing nodes, 4096 vector units, a partition manager, and I/0 processor(s) that are all connected by a communication network; see figure 8.1. The partition manager is the user's gateway to the CM-5. It is essentially a Spare workstation that manages the various user partitions (time-sharing) of processing nodes, networking, and I/0 communications with external devices, such as remote terminals, disk and tape drives, and printers. It also executes and stores all of a program's scalar data and scalar instructions along with the CM Run-Time-System (CMRTS). There are two communication networks, one for control and the other for data. 209

PAGE 235

Fat Tree Data Connection Network I I PN n I Partition External Network Manager -------User 1/0 Processors Scalar Memory Workstation Figure 8.1: CM-5 system diagram for n processing nodes (PN). 210

PAGE 236

The control network is used by operations that require all the processing nodes to work together, such as partitioning of the processing nodes, global synchronization, and general broadcasting of messages. The control network is implemented as a complete binary tree with all the system components at the leaves. The data communication network is set up in the form of a fat-tree connection network that allows data to flow in both directions simultaneously. A fat-tree is basically a tree in which the connections become denser as one progresses toward the root, allowing for wider communication channels to cope with the increased message traffic. The fat-tree is a 4-ary tree with either 2 or 4 parent connections. It is used for point to point communication between individual processing nodes. The fat-tree network allows a variety of data structures to be mapped onto it easily, such as vectors, matrices, meshes (1, 2, and 3 dimensional grids), hypercubes, and of course hierarchical tree-like structures, while maintaining a high bandwidth. The organization of the individual processors of the CM-5 is as follows: each consists of four memory banks of 8 to 32 MBytes each, four vector processing units, a RISC (Spare) processor, and a network interface controller, all connected together by an internal bus; see figure 8.2. Each memory bank is connected to a vector unit which in turn connects to the internal bus. The RISC processor manages the issuing of vector instructions to the vector units, address computations, loop control, and other housekeeping activities. The network interface controller transmits and receives messages from the fat-tree network, communicating with other processors and I/0 units. The vector units act both as a memory manager for the node's RISC processor and as a vector arithmetic accelerator. The RISC processor can read or write to any of the four vector unit memory banks. The vector instructions are given by the RISC processor as memory address instructions with special bits set to indicate the type of 211

PAGE 237

Vector Unit 0 Memory Control Network i Data Network l Spare Microprocessor Network Interface Vector Unit 1 Memory Vector Unit 2 Memory Vector Unit 3 Memory Figure 8.2: CM-5 processor node diagram with four vector units. 212

PAGE 238

vector operation to be performed. It is important to note that the vector units are not independent processors. They do not load or execute instructions, but merely perform memory management and vector arithmetic functions. However, because they do perform all the arithmetic functions from the user's program, it is convenient to think of the vector units as acting like processors. The vector units consist of an internal bus interface, a vector instruction decoder, a memory controller, a bank of 64 64-bit registers, and a pipelined arithmetic logic unit (ALU); see figure 8.3. The bank of registers can also be addressed as 128 32-bit registers. In addition there are a couple of control registers. There is a 16-bit vector mask register and a 4-bit vector length register. The vector mask register controls certain conditional operations and receives single bit status results. The vector length register indicates the vector length that is being used and ranges from a vector length of 1 to 16; thus, a vector length of one is used for scalar operations. It should be noted that the original vector length on the CM-5 was 4, in 1993 it was increased to 8, and finally it was increased to 16 in late 1995. The increase in the vector length has improved the performance of the black box multigrid codes over the years, and in addition, it has caused modifications in the implementation of several of the algorithmic components. A special note concerning communication as it relates to the vector units, which are implemented two to a chip is that units 0 and 1 are on one chip and units 2 and 3 on the other. This arrangement means that communication between two vector units on the same chip are faster than communication between vector units of the same processing node but different chips. Communication between vector units of differing processing nodes will involve the data communication network and be much slower. There are 1024 processors on the CM-5, but since each processor contains 213

PAGE 239

Internal Processor Node Bus Bus Instruction Vector Unit Interface -Decoder T Memory Controller I Memory Bank VectorPipelined Arithmetic I Logic Unit T 64 x 64bit Registers Figure 8.3: Diagram of one CM-5 vector unit. 214

PAGE 240

four vector units (processors) it is better to think of the CM-5 as having 4096 vector processors. This viewpoint is justified because the vector units perform virtually all the computations. When we refer to a processor in our discussion, we will in general be referring to the vector processing units. 8.3 CM-5 Memory Management Memory management is usually less important than either parallel computation or communication in determining the speed in program execution, but it is still one of the major considerations in obtaining good performance in CM-Fortran. It is possible to run out of parallel memory even when it would appear that the data should fit. To understand why this can happen we need to examine the memory more closely. Scalar memory is any memory region not dedicated to storing parallel data, including the scalar memory of the partition manager and any portion of the processing node's memory not being used by the parallel stack or heap. Parallel memory is any region of memory located on all four vector unit memory banks on all the processing nodes. Parallel memory is the same size and starts at the same address on all the memory banks. There are two types of parallel memory: stack and heap. Stack memory is temporary memory. Heap memory is relatively permanent and is allocated and deallocated arbitrarily and never gets compacted. Because heap memory does not get compacted it can become fragmented and leave areas of memory unusable. The partition manager runs a complete UNIX operating system, but each processing node runs only a subset which is called the PN kernel. The PN kernel occupies about 0.5 MBytes of memory in each processing node's memory. It is enlightening to see how a processing node's memory is partitioned to hold the PN kernel, parallel stack memory, parallel heap memory, scalar memory, and user 215

PAGE 241

code; see figure 8.4. The CM operating system is responsible for assigning the memory pages. The parallel memory pages must be aligned across all four memory banks on a processing node, but the scalar memory pages can be assigned arbitrarily. The node's memory is organized into high and low memory with the PN kernel stored in vector unit O's low memory. The PN kernel takes up only half a megabyte of memory, but it effectively takes up 2 MBytes because its memory shadow on the other three memory banks are unusable for parallel data. To make things worse, the 1.5 MBytes of memory in the PN kernel shadow is not always assigned for scalar memory, so that other memory locations used for the scalar memory also cause a memory shadow that blocks parallel memory assignment. The user code (variables and instructions) are stored in high memory in scalar memory pages. This arrangement leaves the parallel stack and heap stored in between the PN kernel and user code, with the stack always being stored towards the low end memory relative to the parallel heap. Both the parallel stack and heap grow from their starting locations towards high memory. Parallel arrays come in many forms and are stored either on the stack or heap. There are four types of user defined arrays and two types of compiler generated arrays. The user defined arrays types will be defined now; in the discussion "local" means declared in a routine and not defined outside of that routine. Ordinary local arrays are those declared in a routine without the SAVE or DATA attribute. These arrays are allocated on the stack on entry into the routine and deallocated upon exiting the routine. Permanent local arrays are declared with the SAVE or DATA attribute. They are allocated on the heap when entering the routine for the first time and are never deallocated. Dynamically allocated arrays are explicitly allocated and deallocated by function calls and are stored on the heap. Common block arrays are allocated on the heap when the array is first used and are never deallocated. 216

PAGE 242

V.U. 0 Memory V.U. 1 Memory I V.U. 2 Memory V.U. 3 Memory PN Kernel memory shadow 1 -Parallel Stack Memory Region Parallel Heap Memory Region Low Memory High Memory Figure 8.4. CM-5 processor node memory map for vector unit configuration. Area 1 is the scalar stack, area 2 is the scalar heap, and area 3 is the user's code (scalar variables and instructions). White space is unclaimed memory and is neither scalar nor parallel. 217

PAGE 243

The compiler generates three kinds of internal temporary arrays which are all stored on the heap. The first type are communication temporaries. They are arrays that temporarily hold results from communication operations which are the result of either explicit or implicit communication taking place in an expression. The second kind of temporaries are the computation temporaries. They are the result of either computations being performed inside a communication function (e.g. CSHIFT) or when a selection type statement is executed (e.g. FORALL, WHERE). Communication and computation temporaries are allocated at the time the expression is evaluated and are deallocated when the calculation of the expression has completed. The third kind are common subexpression temporaries. They are arrays that hold values of common subexpressions between the first time they are used and the last time that they are needed. A common subexpression temporary is generated to store every WHERE statement's mask, and sometimes the mask for FORALL statements. We will need to define what we mean by basic code block and P E code block in order to simplify our discussion. A basic code block is a segment of statements bounded by control flow statements. A PE code block is a region of pure parallel computation involving no control flow statements. The compiler can collapse and reuse communication and computation temporaries in basic code blocks, instructions that do not involve control flow, if the temporaries have the same shape. Common subexpression temporaries can be stored in registers with increased speed and efficiency if they occur within the same loop, PE code block, and basic code block. When the amount of memory available on the processing nodes becomes an issue, there are several rules of thumb that should be followed. 218

PAGE 244

1. Use more complex array expressions and less arrays. 2. Rewrite code fragments to reduce the number of temporary arrays generated by the compiler. 3. Try to reuse arrays or parts of arrays. 4. Split up program units so that fewer arrays and temporaries are allocated at one time. 5. Use the aliasing functions to use the same memory for arrays of different lay outs. 6. Consider what the effects will be for garbage element array padding and vectorlength padding, which can invisibly increase an arrays storage. 7. Use the scratch space on the parallel I/0 devices, e.g. Data Vault. Temporary array compiler allocation can be reduced in several ways. Write expressions so that common subexpressions are easily recognized; if one can not easily see them, then the compiler may not be able to see them either. A void writing complicated expressions that involve many array functions or changes of array layouts, which will cause the generation of communication and computation temporaries. Be aware that most communication functions assume that their source and destination arrays are distinct. If the source and destination are the same array then a communication temporary will be generated. 8.4 Dynamic Memory Management Utilities The Dynamic Memory Management Utilities (DMMU) were developed by W. Spangenberg at Los Alamos National Laboratory [8]. The DMMU were designed to address the problem of control of the data structure layout on the CM-5 processors. The 219

PAGE 245

CM-FORTRAN compiler and Run-Time-System (RTS) aggressively monitor the array layouts to assure that arrays are distributed uniformly across the processors. While the monitoring and re-distribution of array layouts is good for many applications, it can be disastrous for multigrid method performance. As we have indicated before, when the VP ratio is less than or equal to one, the most efficient communication was through the use of compatible arrays and masks. When the VP ratio is greater than one we used incompatible arrays, because compatible arrays led to inefficient computations and storage. When the aspect ratio of parallel grid axes become large or processors become unused, the CM RTS re-distributes the arrays to minimize the number of idle processors. The re-distribution occurs when the grid dimensions are not equal and also when coarsening takes place, especially for the semi-coarsening multigrid method. The re-distribution problems can persist even when using the PROCS and BLOCS attributes for the array layout. The fine grid coefficient matrix (array), L, defines the physical layout of the data on the processors. The layout is specified by the LAYOUT command which is part of CM FORTRAN and High Performance FORTRAN (HPF). The command specifies the serial and parallel dimensions of the array (SERIAL and NEWS attributes respectively), and in addition, it can also specify the the physical extents of the array onto the processors (PROCS attribute) and the subgrid size on the processors (BLOCS attribute). The PROCS attribute defines the physical extents of the array across the processors, and the BLOCS attribute defines the array axis subgrid length, which is given as a ratio of the axis extent to the physical processor extent. Every array has a geometry descriptor that contains these attributes. The DMMU use the fine grid coefficient array's geometry descriptor as a template to control the layout of the other arrays. For each grid dimension, all arrays have the same physical extents over the processors by using a common PROCS directive, from the geometry 220

PAGE 246

descriptor template, for each parallel axis. All arrays on each grid level, in each grid dimension, have the same subgrid extents which are specified by a common BLOCS directive, from the geometry descriptor template, for each parallel axis. The DMMU uses the geometry descriptor template to dynamically allocate arrays with the same layout on each grid level. The compatible and incompatible arrays for different grid levels can also be aligned using the DMMU to obtain more efficient communications between grid levels. The intergrid communications require the use of temporary arrays for efficiency, and are also allocated dynamically by the DMMU. The use of the DMMU ensures that the FORTRAN 90, CM FORTRAN, and HPF array and array section operations perform correctly while also allowing for the most efficient inter-processor communication. The black box multigrid codes use the DMMU to dynamically allocate all of the internal arrays. The process works in the following way. First we need to determine the number of grid levels and their size. Next we need to determine the number of processors and at which grid level the VP ratio will be less than or equal to one. We then obtain the geometry descriptor template for the fine grid coefficient array. We determine the physical extents across the processors of the congruent geometry template's parallel axes, assign the appropriate axis' physical extent, and check to see that we have a consistent physical extents across the processors. We then determine the appropriate PROCS and BLOCS directives for each grid level. We then create a congruent array alias for each array to conveniently reference the different grid levels via an index. Finally, we dynamically allocate all the coarse grid arrays, using the geometry descriptor template to enforce the desired data layout. 221

PAGE 247

8.5 CM-5 Software Considerations Due to the state of compilers today, when designing a parallel program, one should keep in mind the computer's architecture to obtain reasonably fast performance. Dropping down out of a higher level language to the underlying support language, sometimes even to assembly language, will always yield the best performance, but will inevitably lead to non-portable, long, and confusing code. However, the underlying structure of a computer often changes with updates to both the hardware and system software, certainly the case with the CM-5. These changes can greatly affect the life of any computer program which uses specific hardware or low-level software features. The high level computer languages are much more stable and are usually not affected by these changes. Hence, we have chosen to keep the code as portable and readable as possible by exclusively using the higher level languages. The CM-5 supports several higher level languages that support the data par allel model, the two most common being C and FORTRAN. We have chosen to use FORTRAN because it is more stable, and its behavior is better understood for per forming numerical computations than is the behavior of C. The flavor of FORTRAN that the CM-5 employs, called CM-FORTRAN, is a subset of FORTRAN 90, but it also includes the entire F77 ANSI standard and a few CM extensions. The parallelism is achieved by the way that FORTRAN 90 expresses looping indices, data structures, and data dependencies in the program. This simplification in notation can lead to more compact and easier to read codes for most algorithms. However, care must be taken to note if a variable is actually an array or variable and whether it resides on the front-end (sequential) or on the CM (parallel) side of the computer. It is important to note some relationships in computing and communication on the CM-5. Computations are very fast if they are performed entirely on one processor. 222

PAGE 248

Communications are very expensive when compared to computations but can vary a lot among themselves. There are several types of communications that we are interested in: circular, end-off, and irregular. The circular and end-off shift communications are about the same speed, with circular getting the edge, but they are both much faster then the irregular pattern communications. Another aspect that relates to speed is the distance of the communication path between processors. If the distance is a power of 2, then the best performance results, but the fastest is nearest neighbor communication, which is only slightly faster. 8.6 Coarsening and Data Structures in 2D It is important to understand how the data are layed out in the memory across the processing nodes. When a parallel array has more elements than there are processors, the array is decomposed into contiguous subsets and spread across the processing nodes. The subsets are called subgrids and they are uniform in size, shape, and starting memory address on all the memory banks on all the processing nodes. If the data will not fill all the subgrids, then some of the subgrids will be padded with null data until they are all full. Computations taking place on subgrid data are all done sequentially on a processing node by the vector units. The best efficiency is obtained when there is no padding, and thus full vector operations can be performed. When padding is present in the subgrids additional overhead is incurred; first no-op instructions are sent to the vector units associated with the padded elements, and then a mask is created and used to prevent storing the results. We have considered two types of coarsening to generate the coarser grids: standard coarsening and semi-coarsening. The degree of parallelization is quite dif ferent for these two choices. For a given grid, the next coarser grid has one fourth the number of grid points for the standard coarsening method versus one half in the 223

PAGE 249

semi-coarsening method. For the standard coarsening method, the number of points in both dimensions is reduced by half, taking every other grid point from the fine grid to form the coarser grid. The semi-coarsening method reduces the number of points in only one dimension by half. The rest of the discussion will concentrate on the standard coarsening method but will equally apply to the semi-coarsening method with the obvious difference that coarsening is only in one direction. Any comments that do not apply to both methods will be pointed out as they arise. There are several ways in which we can set up the grid data structures. The fine grid, in two dimensions, is layed out and partitioned into subgrids across the processors, see figure 8.5(a). Ideally this layout is thought of as one grid point per processor. However, it is important to remember that for large grids, each processor contains a contiguous subgrid of grid points and that each grid point is treated as if it were on an individual processor; such imaginary processors are often referred to as virtual processors, but we will just refer to them as processors. In order to discuss the data structures and their relationship to communi cations, we need to define the grid spacing relationship between the different grid levels. Let us assume a uniform fine grid for now, but our comments will equally apply to the data structures for non-uniform grids. Recall the notation used in Algorithm MGV(k, v1, v2, h), from section 1.4, where k referred to the grid level and ranged from 1 (coarsest) to M (finest). The coarse grid spacing, for our solvers, is determined by doubling the fine grid spacing, hk-l = 2hk, this leads to the equation, (8.1) where d is the grid communication distance between neighboring grid points on grid level k and M is the total number of grid levels. The actual grid spacing, for a uniform grid, on grid level k is hk = dhM. 224

PAGE 250

---- (a) (b) (c) Figure 8.5. Standard coarsening grid data structure layout for the finest (left), coarse (center), and coarsest (right) grid levels, where represents an active processor (grid point), and figures (a), (b), and (c) represent three different data structures. 225

PAGE 251

Most of the computations in the multigrid algorithm are performed using neighboring grid points, which requires communications between the active grid points. The distance of the communications between nearest neighbor grid points is given by equation 8.1. This formula is valid in the direction of the coarsening for both standard and semi-coarsening in two or three dimensions. A possible disadvantage of all the grid levels sharing the same data structure is that the distance of the communication increases with the coarseness of the grid level. The increase in communication distance may cause slower data transfer rates. A further disadvantage for the standard coarsening method is that most of the CMSSL routines can not be used on coarser grids because the communication distance is greater than one. This disadvantage can be overcome if we introduce data structure transformation routines to convert data from our data structures to ones on which the CMSSL can operate. The CMSSL routines are needed for the L U decomposition and to perform tridiagonal line solves, which are used in the smoother and the direct solver respectively. The conversion of the data structures will of course increase the execution time. If we keep the same data structure and do not want to use conversion routines, we are then forced to abandon the CMSSL and write our own routines. Writing our own LU decomposition solver and tridiagonal solvers for the CM-5 is not a trivial task, and it will be impossible to obtain efficiency any where near that of the CMSSL, written in assembly language, by writing in CM-Fortran alone. Another choice for the grid data structures is to keep one as the grid communication distance between nearest neighbor grid points. There are two data structures that can be used to accomplish this goal. The first way is to choose to use the same data structure for all grid levels but to use for computation only a compact subgrid of the fine grid level corresponding to the coarse grid. The other points in the data structure are unused, see figure 8.5(b). The second way is to have each coarse grid 226

PAGE 252

have its own compact data structure of just the right size, see figure 8.5(c). In each case, communication for the computations on a grid level are all near est neighbor communications, which are the fastest possible communications. However, there are disadvantages with these two data structuring schemes. The grid transfer operations are now complicated and require the use of general communications routines, which are the slowest type of communication between processors. Clearly the choice of data structures will have a different impact on each of the various multigrid components. For this reason, we examine the multigrid components separately and then discuss the choice of tradeoffs. 8. 7 Coarse Grid Operators The discussion above about data structures covers most of the options for which data structures are reasonable for the coarse grid operators. However, there are a few more pitfalls that should be considered. The choice of data structures on the CM computers can have a large effect on performance, besides communications, when temporary variables are created to hold intermediate data from computations and when data is passed between routines. Temporary variables can affect the performance in two ways: size and alignment. The storage for the temporary variable is the same as that of the largest data structure involved in the computation. Complex computations often require several temporary variables. If there are any communications in a computation, then it is almost always the case that a temporary variable is created to hold the data from the communication to be used in the computation. The alignment of a temporary vari able is governed by the data structures involved in the computation. The choices that the runtime system makes can sometimes cause a slight misalignment of the data and slow down performance by introducing communication. It sounds much worse than it 227

PAGE 253

really is because these communications have always been found to be between virtual processors or between the vector units on a single processor node, and the overhead is usually negligible if the subgrid size is relatively small. When data are passed between routines it can cause the creation of temporary variables into which the data are copied. This creation happens when the called routine is only using a subset of the data structure from the calling routine. However, the creation of temporary variables has also occasionally been observed on the CM-2 and CM-200 when the entire data structure has been passed from the calling routine to the called routine; it has not yet been observed on the CM-5. There is an overhead cost in time that is associated with the creation and use of temporary variables by the runtime system. These costs can sometimes be cut if the implementation already uses temporary variables, but it is best if the implementation can minimize the need for temporary variables altogether. 8.8 Grid Transfer Operators The grid transfer operations involve mostly communication of data between two grid levels. It is therefore important to minimize the amount of data being transferred and to use the most efficient type of communication that we can. However, the time spent in one V(l, 1)-cycle on the vector computers (Cray Y-MP) performing grid transfers was only about 25 percent of that spent on smoothing when alternating line relaxation was used. We expect to see the same kind of relationship between grid transfers and smoothing on the CM-5, and the percentage may even drop because the smoother will usually involve more communication than the grid transfers. However, the smoother and grid transfer routines may use different types of communication, affecting the percentage of time spent in each routine. 228

PAGE 254

8.9 Smoothers There are several relaxation methods available for smoothing; they are the point, x-line, y-line and alternating line Gauss-Seidel relaxation using multicolor or dering. By multicolor ordering we mean that either red/black or 4-color ordering is used. The active grid points are a power of two distance apart when multi-coloring is used, which means that it is still possible to use efficient communication between processors. Regardless of which data structure layout is used, the multicolor ordering in the smoother will always use communications that are either a power of two apart or nearest neighbor. Recall the comments from section 8.6 about the CMSSL. If we choose to use the grid data structure layout in figure 8.5(a) we can not use the CMSSL tridiagonal solvers for the line solves without using data structure conversion routines. However, the semi-coarsening method can use the CMSSL tridiagonal solver because the line solves are not in the direction of the coarsening. There is also the incomplete line LU (ILLU) iterative method, used in the vector code, to consider. However, the ILLU method is not parallelizable in its current form and for this reason we chose not to implement it on the CM-5. Many researchers are working on developing a parallel ILU solver, and we are not aware of any algorithms or efforts to develop the more robust ILL U method on parallel computers. 8.9.1 Parallel Line Gauss-Seidel Relaxation The CMSSL provides a tridiagonal solver that comes in two forms: one that performs the entire solution process and the other that splits the process up into a call to the L U factorization routine and another call to the LU solution routine. We can use the factorization routine and save the L U decompositions between smoothing steps, but that would 229

PAGE 255

mean saving the factors for every grid level. Saving all the LU decompositions would be costly because the CMSSL also allocates its own temporary work space for each decomposition to be used in the solution phase. We only save about 30% on the execution time for a V-cycle by saving the LU decomposition, but it takes about six times the storage required when not saving the LU decompositions. The CMSSL gen-tridiag-solve routine can be used to solve both X and Y lines by just changing the vector-axis parameter to point to the array axis that the diagonal elements lie on. All of the X (Y) lines of a single color can be solved in parallel. The zebra line Gauss-Seidel relaxation will take two tridiagonal line solve times, one for each color. The alternating zebra line Gauss-Seidel relaxation will take a total of four tridiagonal line solve times. Since the CMSSL tridiagonal solver overwrites both the coefficient and right hand side arrays with the L U decomposition and solution respectively we need to copy the data into temporary work space arrays before calling the solver. Once again extra temporary storage is needed, but it is only allocated for the duration of the smoothing step. This temporary storage space is not saved for each grid, which could easily fill up memory reducing the size of problems that can be solved, but reused on each grid level. 8.9.2 CM-5 Tridiagonal Line Solver Using Cyclic Reduction The tridiagonal systems from the line relaxation on the vector computers used Gaussian elimination with vectorization taking place by solving all the lines of one color simulta neously. We could also use this approach to obtain a parallel tridiagonal line solver by solving all the lines of a single color in parallel. However, this approach still leaves each line to be solved sequentially, and we can do better than that by using cyclic reduction. The cyclic reduction algorithm is an example of a divide and conquer method. 230

PAGE 256

A tridiagonal system of irreducible linear equations LU = F, where L is of dimension N = 2n-1, can be represented as a matrix equation b1 C1 U1 h C2 b2 C2 U2 h LU= =F. (8.2) CN-1 UN-1 fN-1 aN bN UN fN The basic idea is to solve for Ui in terms of Ui-1 and Ui+l, providing that bi i= 0. We do this solution for all odd i equations and substitute the expression for Ui into the remaining equations. The result is a tridiagonal system of equations in l N /2 J variables. The procedure is applied recursively until only one equation remains. The single equation is then solved and the other variables are obtained through back substitution. To simplify a more detailed description, let uo = UN+l = 0 and let the subscripts represent the equation numbers and the superscripts denote the reduction and back substitution steps. Let a} = ai, b} = bi, c} = Ci, and Jl = !i; then the reduction step is given by: a 2k (8.3) j3 -bf-a (8.4) ck 'Y -bf+a (8.5) ak+l f3iaf-a (8.6) ck+l k "fiCi+a (8.7) bk+l bf + f3icf-a + 'Yiai+a (8.8) fik+l fik + f3dik-a + 'Ydl'+a (8.9) where i =a, 2a, 3a, (2n-k1)a, for the reduction steps k = 1, 2, n-1. After 231

PAGE 257

the n -1 reductions steps we are left with one equation, which when solved is (8.10) The back substitution is given by (8.11) where i = a 3a 5a (2n-k 1)a and k = n -2 n -3 0 ' ' ' The cyclic reduction algorithm just described derives its parallelism by performing all the i indexed equation simultaneously for each reduction step k. However, the number of processors needed at each reduction step is half that of the previous step. Notice that the reduction process can be written to yield any of the unknowns. If we write a set of reduction steps to yield each unknown, after performing n -1 reductions, the set of "single" equations can then be solved simultaneously using an equation similar to (8.10) to give the solution. This method no longer requires the back substitution step, and it also keeps all the processors busy at all the steps, giving a method that is twice as fast as the original cyclic reduction algorithm. This version of cyclic reduction can be found in [46] where it is referred to as the PARACR algorithm, and it is one of the tridiagonal solvers implemented in the CMSSL. If the VP ratio is much greater than one, then the subgrids on each processor are large and the PARACR algorithm becomes inefficient. A better algorithm is to use block cyclic reduction [59] [49], which is also in the CMSSL. The block cyclic reduction performs the LU decomposition sequentially on each processor and cyclic reduction over the processors. 232

PAGE 258

8.10 Coarsest Grid Solver One of the advantages of the standard coarsening multigrid algorithm is that each coarser grid level takes only one fourth the amount of work as the previous grid level until the number of the coarse grid points becomes smaller than one grid point per processor; thereafter the time remains constant for the computations on a grid level. The coarsest grid solver is a direct solver that uses an L U decomposition. The direct solver is slow on the CM-5 due to its sequential nature and is dependent on the number of grid points on the coarsest grid level. It is therefore important to make the coarsest grid level size small, so that the direct solver time is approximately equal to the smoother time on the grid level with one grid point per processor. A banded sparse matrix direct solver does not exist in the CMSSL, even though the documentation states CMSSL routines exist which are equivalent to those of Linpack and Lapack for solving general banded systems. In fact the CMSSL provides routines that only solve tridiagonal and pentadiagonal banded systems. Another common choice for the coarsest grid solver is to use several iterations of one of the relaxation methods. This choice is not very practical on the CM-5 because of the constant time per iteration per grid level when there are fewer grid points then there are processors. Thus the solution time on the coarsest grid is proportional to the number of smoothing steps. The constant time for the coarsest grid level can only be reduced if a cheaper smoother can be found or if the number of coarse grids which have fewer grid points than processors is kept to a minimum. The problem with this approach is that the larger the system to be solved, the worse the reduction factor for the smoother. To add insult to injury, we are already using some of the cheapest and most effective smoothers. Using a relaxation method on the coarsest grid does not seem to help the situation and indicates that the use of a direct solver is probably better. 233

PAGE 259

A third approach is to use standard coarsening for large grids and then switch to semi-coarsening for the coarser and coarsest grids. This approach solves a few of our problems because the coarsest grid can now be just one line which can be solved directly with the CMSSL's tridiagonal solver. However, the new question is when to switch from standard to semi-coarsening. It was decided that the user would specify when the switch took place by setting the coarsest grid input parameter. The optimum value for the switch will be dependent on the finest grid size and the number of processors. Recall that, when the switch to semi-coarsening takes place, coarsening will only happen in the y-direction and that the grid point distribution in the x-direction across the processors will remain fixed. The tridiagonal solve on the coarsest grid level solves one x-line of the size that was fixed when the switch took place, and the tridiagonal solution time is dependent on that size. The smaller the x-dimension size is when the switch takes place, the faster the tridiagonal solution. It therefore becomes necessary to balance the constant time spent performing standard coarsening multigrid, the constant time spent performing semi-coarsening multigrid, and the tridiagonal solution time. A parametric study has not been done, and is not planned, to determine the optimum values. However, from experience a reasonable choice for the switch is when the number of x-direction points is about one quarter the number of processors or less. Another practical choice, from the programmer's perspective, is to switch from standard to semi-coarsening when the VP ratio becomes less than one after coarsening. This choice is convenient and keeps the maximum number of processors active. This choice would seem to be very good, but the semi-coarsening performance compared to standard coarsening performance when V P 1 is highly dependent on the efficiency of the tridiagonal line solver used by the smoother. However, we now have the advantage of being able to use the CMSSL to perform the tridiagonal line solves. 234

PAGE 260

If the coarsening stops before we reach a single grid line or equation, we are left with sparse banded system of equations to solve. The CMSSL does not provide a solver to handle this case except for the dense linear system solvers. Using the dense system solvers means that we would have to allocate more storage and copy the banded system's data into it with the cost of general communication, the most expensive. The dense system solver will also perform many unneeded computations and communications that will involve matrix entries outside the banded area. All things said, the use of the dense system solvers is very inefficient. The only solution left for us is to write our own system solver for banded systems. While this may sound attractive, it is not. Once again we are met with the challenge of trying to write a solver that is both efficient and competitive. It is very difficult to write codes in high level languages that can compete with the CMSSL. To make matters worse, the parallelism in Gaussian elimination is at best modest for a banded system, depending on the length of the band. However, if we are willing to sacrifice some memory, we can store the L U factors to save some execution time for the solution of the coarsest grid when several multigrid iterations are performed. For a dense system of equations the best performance is obtained by using block cyclic ordering of the equations; see [59]. However, the performance gain assumes that V P 1, and a sparse banded system with V P < 1 will actually perform much slower than most of the other methods. To obtain an efficient banded system solver requires transferring the data from a grid point oriented data structure to a matrix oriented one, requiring general communication. For efficiency, if we have N unknowns, we will need N2 processors. The computations can then be done in order N operations, but communications will add significantly to the execution time. All things considered, the best we can hope to do for the solver is order N times a constant plus the communications times, which 235

PAGE 261

include the data structure transfers. It should be noted that the constant can be on the order of N when N is small. The performance of an L U direct solver is not very attractive when V P < 1. The most efficient solution, so far, is to switch to the semi-coarsening code at sometime after the V P < 1. 8.11 Miscellaneous Software Issues An interesting compiler deficiency is that a parameter passed into a subroutine manifests poor performance if it is used as a loop control parameter. The way to avoid this deficiency is to copy the parameter's value into a local variable and then use that variable as the loop control parameter. The poor performance might have to do with the fact that the passed variable is usually scalar and is stored in the partition manager's scalar memory, requiring a broadcast communication every time the variable is needed. 8.11.1 Using Scalapack Scalapack is the same as Lapack but designed for distributed memory parallel computers using the parallel basic linear algebra subprograms (PBLAS) and basic linear algebra communication subprograms (BLACS). The Scalapack is available on the CM-5 using CM-PVM (parallel virtual machine) under the CMMD message passing model. The SPMD data model is not compatible with the CM-PVM CMMD model, and for this reason we can not use Scalapack. If we assume that we had compatible programming models, another problem with using Scalapack is that the data structures are all matrix oriented, instead of the grid oriented data structure that we use. Scalapack also assumes that the matrices are distributed to a grid of processors in a 2D block cyclic decomposition. This distribution of data would require that we use costly general communications to copy the data into the block cyclic format. 236

PAGE 262

To top it all off, Scalapack is still under development and the two routines that we would need to perform the LU factorization and LU solution, PSGBTRF and PSGBTRS respectively, have not been implemented yet. 8.11.2 Poly-Shift Communication The PSHIFT communication rou tines in the CMSSL are also available on the CM-5. The PSHIFT routine achieved the overlapping of communications on the CM-2 and CM-200 because those computers used a hypercube data communication network. The PSHIFT routine on the CM-5 is not very effective because the fat-tree data communication network will not allow as many communications to be overlapped. The PSHIFT setup routine dynamically allo cates memory and when a particular poly-shift stencil is no longer needed, the memory should be deallocated. The PSHIFT routine can perform a maximum of two shifts per array dimen sion, one in each direction. The number of communications that can be overlapped is limited to at most four, but the size and shape of the stencil are not restricted. If an array has padding in the dimension in which communication is to take place then the PSHIFT will perform approximately the same as the equivalent calls to CSHIFT and/or EOSHIFT. The best performance is obtained when the subgrid lengths in the communication dimensions are all roughly equivalent. The performance improvement for 2D 9-point stencils is barely noticeable. The PSHIFT routine should be able to do better than it does on the CM-5, but first it will have to be optimized for the fat-tree network. 8.12 2D Standard Coarsening Parallel Algorithm Many parallel algorithms have been tried over the years in an effort to create an efficient parallel black box multigrid code. The code was first developed the CM-2 237

PAGE 263

and then ported and modified for the CM-200, and finally a version was created for the CM-5. The algorithmic choices presented here are those that were made for the CM-5. 8.12.1 Data Structures The data structures for the grid equations are grid point stencil oriented. There are no fictitious grid equations needed for the boundary as there were in the vector code. The references to neighboring grid points are made through the communications routine EO SHIFT, which gives a zero value for off grid references. The previous trouble with the data structure layout is solved by the use of dynamic allocation using the Dynamic Memory Management Utilities (DMMU) de veloped by Bill Spangenberg of Los Alamos National Laboratory in conjunction with Thinking Machines Inc. The DMMU provide a way to allocate arrays dynamically with a given fixed geometry and to be able to use array aliasing to create an array of grid level arrays. 8.12.2 Coarsening We used standard coarsening, taking every other fine grid point in both coordinate directions to form the coarse grid. The data structures when V P > 1 are of the non-compatible compact type; see figure 8.5(c). When VP :S 1, we used a natural grid layout that uses compatible grids as illustrated in figure 8.5(a). The natural grid layout leaves more and more idle processors with every coarser grid level. However, since we wanted to use the CMSSL for the line solves we switched to semi-coarsening after a few coarse grid levels below V P :S 1. As a special note, the easiest way to implement the black box multigrid solver for V P :S 1 is to use the semi-coarsening algorithm. This choice keeps the maximum number of processors busy and allows the direct use of the CMSSL tridiagonal solver. 238

PAGE 264

8.12.3 Smoothers We have implemented the multicolor ordering point, line, and alternating line Gauss-Seidel methods. The ILLU method was not implemented for the reasons given in section 8.9. However, as mentioned before, we can use the CMSSL tridiagonal solver when V P > 1 and also if the semi-coarsening algorithm is used on all grid levels when V P < 1. When V P < 1 and the semi-coarsening algorithm is not used we will end up with non-contiguous active data in the lines to be solved, preventing us from using the CMSSL. We tried implementing data structure transformation routines, but they were found, as should be expected, to be clumsy and inefficient. We also implemented our own parallel tridiagonal solvers, but they were not very competitive, being about twice as slow as the CMSSL routine. 8.12.4 Coarsest Grid Solver We tried a direct solver using LU fac torization, but it turned out to be hard to implement and slow in its general form, unless the coarsest grid was always of a given fixed size. Instead, we chose to use the semi-coarsening algorithm in which case only a tridiagonal solver was needed. So now, the coarsening continues until only one line is left to solve, and that can be done by using the same tridiagonal solver that was used for the line solves of the smoother. 8.12.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 3, that were implemented. They are the ones discussed in sections 3.5.1, 3.5.3, and 3.6.1. The two collapsing type methods were readily parallelizable and easily im plemented. The computation of the grid transfer operator coefficients created a lot of temporary variables. It was difficult to find a good implementation that did not use too many temporaries and that could avoid having the compiler generate too many and fill up the available memory. 239

PAGE 265

The grid transfer operators based on extensions to Schaffer's ideas were also parallelizable, but they depended on the availability of tridiagonal line solvers. The time to compute all the operators is also longer than the collapsing method because of the line solves. 8.12.6 Coarse Grid Operators They are formed using the Galerkin coarse grid approximation using the grid transfer operators. 8.13 2D Semi-Coarsening Parallel Algorithm The semi-coarsening code was originally implemented by Joel E. Dendy Jr., Michael Ida, and Jeff Rutledge on the CM-200. A better implementation was done by Bill Spangenberg, who wrote the Dynamic Memory Management Utilities (DMMU), on the CM-5. It is still possible to obtain an even better implementation of the semi coarsening code on the CM-5, but this improvement has not been done because the code cannot be placed (at least not for the near future) in the public domain since it uses the proprietary DMMU. 8.13.1 Data Structures The data structures are grid point stencil oriented with a different array data structure for the coefficients, the unknowns, and the right hand side. 8.13.2 Coarsening Semi-coarsening in they-direction was used, taking every other fine grid point in the y-directions to form the coarse grid. Non-compatible grid data structures were used when VP > 1, and a compatible grid data structure otherwise, as was the case for the standard coarsening parallel code. 240

PAGE 266

8.13.3 Smoothers Red/black x-line Gauss-Seidel relaxation is used for the smoother. The CMSSL tridiagonal solver using cyclic reduction was used to solve the lines. A better implementation for the line solves exists if block cyclic reduction is used when the finest grid level has V P 1, since the subgrid size per processor will be large, and it makes more sense to use the vector units more efficiently by using sequential cyclic reduction on the subgrids of each processor. 8.13.4 Coarsest Grid Solver The coarsening takes place until only one X grid line remains, and then the CMSSL tridiagonal solver is called to solve it exactly. 8.13.5 Grid Transfer Operators The grid transfer operator is the one used in section 3.6.1 applied in only the y-direction. The CMSSL tridiagonal solver was also used. 8.13.6 Coarse Grid Operators They are formed using the Galerkin coarse grid approximation using the grid transfer operators. 8.14 2D Parallel Timings In the following tables we have reported both busy (B) and idle (I) times. Busy time is the execution time for the parallel processing nodes, while idle time is the sequential execution time and also the time to perform all communications. We are reporting times for various time shared partitions of the CM-5. The partitions are identified by the number of processing nodes (PN) namely, 32, 64, 128, 256, and 512 processing nodes. The CM-5 has a full configuration of 1024 processing nodes, but the full partition is not available under our time sharing system. The tables report timings, in seconds, for the average time of five runs for either the setup time or the average of 241

PAGE 267

five V-cycles. The standard coarsening timings are given in tables 8.3 and 8.4 for one V(1, 1) cycle and the setup respectively. We see the affects of the parallel overhead in the tables for small grids sizes and large partitions. For a given partition we do not see the almost perfect scaling that is seen with the Cray Y-MP; a problem that is has four times the number of unknowns takes far less than four times the time. Nor do we see perfect scaleup with the number of processors; for the 1024 x 1024 case, the 128 takes about half the time of the 32 processor partition and the 512 processor partition takes two-thirds the time of the 128 processor partition. We can also look at the parallel efficiency by examining the data for busy and idle times. The parallel efficiency for the standard coarsening algorithm is given in table 8.5. Note that the highest parallel efficiency is given for the largest grid size problem on the smallest number of processors. This should be expected since that combination produces the largest subgrid size per processor, which will be processed serially on each processor, keeping all the processors busy until the calculation is completed. We still see that the parallel efficiency ranges from 63 to 88, where the higher efficiencies are given for the larger grid sizes. Tables 8.6 and 8. 7 give timings for the semi-coarsening algorithm for the setup and one V(1, 1)-cycle, respectively, for a range of grid sizes and processing node partitions. The parallel timings for the semi-coarsening algorithm shows that we, again, do not have perfect scaling with the problem size nor do we have perfect scaleup with the number of processors. For the 1024 x 1024 case, the 128 processor partition takes about half the time of the 32 processor partition, and the 512 processor partition takes about two-thirds times the tie of the 128 processor partition. The parallel efficiency for the semi-coarsening algorithm is given in table 8.8. 242

PAGE 268

Table 8.3. Timings, in seconds, for the standard coarsening code performing one V(l, I)-cycle with zebra alternating line Gauss-Seidel on 32, 64, 128, 256, and 512 processing nodes of the CM-5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 9.060E-2 1.113E-1 9.098E-2 1.039E-1 1.071E-1 B 1.844E-1 1.881E-1 1.892E-1 1.923E-1 1.951E-1 16 I 1.050E-l 1.316E-l 1.232E-l 1.206E-l 1.218E-l B 2.378E-1 2.778E-1 2.816E-1 2.844E-1 2.898E-1 32 I 1.508E-1 1.348E-1 1.406E-1 1.752E-1 1.690E-1 B 2.902E-l 3.314E-l 3.458E-l 3.934E-l 4.004E-l 64 I 1.756E-l 2.090E-l 1.636E-l 1.828E-l 1.862E-l B 3.558E-1 3.948E-1 4.076E-1 4.558E-1 4.814E-1 128 I 1.794E-1 1.962E-1 1.828E-1 2.118E-1 2.180E-1 B 4.520E-l 4.828E-l 4.774E-l 5.276E-l 5.510E-l 256 I 2.346E-1 2.068E-1 2.060E-1 2.420E-1 2.442E-1 B 6.374E-1 6.202E-1 5.858E-1 6.286E-1 6.262E-1 512 I 2.152E-1 2.300E-1 2.542E-1 2.604E-1 2.830E-1 B 1.092E+O 9.188E-1 7.912E-1 7.822E-1 7.536E-1 1024 I 3.240E-1 2.574E-1 2.676E-1 2.950E-1 2.860E-1 B 2.474E+O 1.731E+O 1.265E+O 1.132E+O 9.722E-1 243

PAGE 269

Table 8.4. Timings, in seconds, for the setup phase of the standard coarsening code with zebra alternating line Gauss-Seidel on 32, 64, 128, 256, and 512 processing nodes of the CM-5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 2.581E-1 4.332E-1 2.776E-1 4.141E-1 3.877E-1 B 3.454E-1 3.519E-1 3.530E-1 3.581E-1 3.627E-1 16 I 2.620E-1 4.833E-1 3.756E-1 3.067E-1 3.051E-1 B 4.880E-1 5.507E-1 5.556E-1 5.604E-1 5.710E-1 32 I 5.020E-1 3.419E-1 3.583E-1 5.578E-1 6.335E-1 B 6.110E-1 6.851E-1 7.167E-1 8.052E-1 8.195E-1 64 I 5.616E-1 6.272E-1 4.143E-1 4.806E-1 4.620E-1 B 7.704E-1 8.378E-1 8.677E-1 9.574E-1 1.016E+O 128 I 5.480E-1 6.920E-1 5.000E-1 6.440E-1 5.590E-1 B 1.010E+O 1.044E+O 1.038E+O 1.135E+O 1.188E+O 256 I 7.080E-1 5.750E-1 5.620E-1 6.490E-1 6.340E-1 B 1.450E+O 1.371E+O 1.293E+O 1.363E+O 1.372E+O 512 I 6.420E-1 6.140E-1 6.500E-1 6.930E-1 7.070E-1 B 2.584E+O 2.096E+O 1.786E+O 1.728E+O 1.671E+O 1024 I 1.198E+O 8.730E-1 8.420E-1 9.520E-1 8.090E-1 B 6.046E+O 4.113E+O 2.957E+O 2.528E+O 2.176E+O Table 8.5. Parallel efficiency for standard coarsening V(1, 1)-cycle using zebra alternating line Gauss-Seidel for the CM-5 with 32, 64, 128, 256, and 512 nodes. The results are given in percentages and N means an N x N grid. Size CM-5 N 32 PN 64 PN 128 PN 256 PN 512 PN 8 67 63 68 65 65 16 69 68 70 70 65 32 66 71 71 69 70 64 56 58 62 71 72 128 52 58 67 71 72 256 73 75 82 72 72 512 84 80 76 75 73 1024 88 87 83 79 77 244

PAGE 270

Table 8.6. Timings, in seconds, for the semi-coarsening code performing one V(1, 1) cycle on 32, 64, 128, 256, and 512 processing nodes of the CM-5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 4.042E-2 5.116E-2 6.814E-2 5.444E-2 4.465E-2 B 7.463E-2 7.462E-2 7.686E-2 7.768E-2 6.363E-2 16 I 5.036E-2 5.986E-2 5.830E-2 5.378E-2 5.758E-2 B 1.104E-1 1.125E-1 1.145E-1 1.175E-1 1.191E-1 32 I 6.722E-2 6.990E-2 6.714E-2 7.658E-2 8.664E-2 B 1.408E-1 1.525E-1 1.587E-1 1.631E-1 1.667E-1 64 I 7.418E-2 1.197E-1 6.684E-2 8.356E-2 9.090E-2 B 1.912E-1 1.912E-1 1.940E-1 2.090E-1 2.182E-1 128 I 8.604E-2 9.196E-2 9.264E-2 9.846E-2 1.075E-1 B 2.839E-1 2.582E-1 2.537E-1 2.580E-1 2.633E-1 256 I 9.992E-2 1.083E-1 1.022E-1 1.095E-1 1.203E-1 B 4.776E-1 3.793E-1 3.601E-1 3.321E-1 3.309E-1 512 I 1.058E-1 1.362E-1 1.337E-1 1.203E-1 1.330E-1 B 9.796E-1 6.700E-1 5.779E-1 4.662E-1 4.465E-1 1024 I 1.332E-1 1.510E-1 1.244E-1 1.343E-1 1.490E-1 B 2.653E+O 1.405E+O 1.102E+O 7.685E-1 6.932E-1 245

PAGE 271

Table 8. 7. Timings, in seconds, for the setup phase of the semi-coarsening code on 32, 64, 128, 256, and 512 processing nodes of the CM-5, where the size N means an N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 3.899E-2 2.441E-1 1.271E-1 2.317E-1 1.976E-1 B 7.561E-2 7.713E-2 7.705E-2 7.773E-2 7.874E-2 16 I 5.260E-2 2.254E-1 1.156E-1 5.630E-2 5.710E-1 B 1.183E-1 1.173E-1 1.170E-1 1.204E-1 1.210E-1 32 I 6.940E-1 1.187E-1 7.480E-2 2.457E-1 3.146E-1 B 1.552E-1 1.646E-1 1.691E-1 1.679E-1 1.709E-1 64 I 7.550E-2 2.485E-1 1.427E-1 8.800E-2 9.340E-2 B 2.066E-1 2.060E-1 2.113E-1 2.255E-1 2.321E-1 128 I 1.612E-1 1.393E-1 1.441E-1 1.013E-1 1.125E-1 B 3.004E-1 2.717E-1 2.693E-1 2.733E-1 2.817E-1 256 I 1.530E-1 1.560E-1 1.557E-1 1.154E-1 1.276E-1 B 5.006E-1 3.989E-1 3.811E-1 3.515E-1 3.502E-1 512 I 1.480E-1 1.623E-1 1.640E-1 1.274E-1 1.394E-1 B 1.025E+O 6.860E-1 6.029E-1 4.856E-1 4.650E-1 1024 I 1.610E-1 1.920E-1 1.710E-1 1.436E-1 1.570E-1 B 2.463E+O 1.476E+O 1.155E+O 8.036E-1 7.199E-1 Table 8.8. Parallel efficiency for semi-coarsening V(1, 1)-cycle for the CM-5 with 32, 64, 128, 256, and 512 processing nodes. The results are given in percentages and N means anN x N grid. Size CM-5 N 32 PN 64 PN 128 PN 256 PN 512 PN 8 64 59 53 59 59 16 67 65 66 69 67 32 68 69 70 68 66 64 72 62 74 71 71 128 77 74 73 72 71 256 83 78 78 75 73 512 90 83 81 79 77 1024 95 90 90 85 82 246

PAGE 272

We again see that the highest parallel efficiency is obtained for the largest grid size problem on the least number of processors, and thus providing the largest subgrid for each processor to work on. The range of parallel efficiency values is now from 53 to 95. The increase in the values for the semi-coarsening algorithm over the standard coarsening algorithm is to be expected because the semi-coarsening algorithm keeps more processors busy during the smoother's line solves, especially on the coarser grid levels. We give a comparison for both the standard and semi-coarsening algorithms on the CM-5, Cray Y-MP, and a Sparc-5 workstation in table 8.9. The CM-5 timings are given for the fastest time for a given grid size and the processing partition that the time was obtained on are given in parentheses. The times in the table are the average time to complete one V(1, 1)-cycle for five V-cycles averaged over five separate runs. The fastest times for a given grid size for the standard coarsening algorithm on the CM-5 are on the 32 processor partition for grid sizes up to 128 x 128, the 128 processor partition for 256 x 256 grid, and the 512 processor partition for grid sizes greater than or equal to 512 x 512. The semi-coarsening algorithm exhibits similar behavior where the 32 processor partition is the fastest for grid sizes up to 32 x 32, 128 processor partition for grid sizes between 64 x 64 and 128 x 128, the 256 processor partition for the 256 x 256 grid, and the 512 processor partition for grid sizes lager than 512 x 512. Comparing the results from table 8.9 shows that vectorization plays an important role on the Cray Y-MP, even for the smallest problems. The Sparc-5 times show that the scaling argument holds, but that it is affected by caching issues. We see from the data that the Cray Y-MP time is still twice as fast as the 128 processor partition and 30% faster than the 512 processor partition. The Cray Y-MP codes are the fastest, but the CM-5 codes are catching up for large problems when enough processors are 247

PAGE 273

Table 8.9. Timing comparison between the CM-5, Cray Y-MP, and Sparc-5 workstation for one V(1, 1)-cycle in seconds, where N means an N x N grid. The top entries are for the standard coarsening codes and the bottom entries are for the semi-coarsening codes, and means that the problem was to big to fit into the available memory. Size CM-5 Cray Y-MP Sparc-5 8 2. 750E-1 (32) 5.270E-4 3.000E-4 1.083E-1 (512) 5.156E-4 5.000E-4 16 3.428E-1 (32) 1.016E-3 1.200E-2 1.640E-1 (32) 9.365E-4 l.OOOE-2 32 4.410E-1 (32) 2.019E-3 5.000E-2 2.080E-1 (32) 1.896E-3 3.200E-2 64 5.314E-1 (32) 4.579E-3 1.910E-1 2.608E-1 (128) 4.435E-3 1.383E-1 128 6.314E-1 (32) 1.285E-2 7.980E-1 3.463E-1 (128) 1.325E-2 5.988E-1 256 7.918E-1 (128) 4.429E-2 3.514E+O 4.416E-1 (256) 4.320E-2 2.666E+O 512 1.037E+O (512) 1.654E-1 1.503E+1 5.795E-1 (512) 1.576E-1 1.209E+1 1.427E+O (512) 6.732E-1 1024 9.028E-1 (512) 6.563E-1 248

PAGE 274

available. When scaling is applied to take into account the difference in clock speeds and instructions per clock cycle between the Cray Y-MP and the CM-5, we see that the two are nearly identical for the 1024 x 1024 problem, but that the Cray Y-MP still has a very slight edge. This shows that the CM-5 codes not only suffer from the overhead associated with parallelization, but that the communication issues are the main bottleneck to beating the vector codes. 249

PAGE 275

CHAPTER 9 BLACK BOX MULTIGRID IN THREE DIMENSIONS 9.1 Introduction The development of a three dimensional black box multigrid solver essentially involves just extending the two dimensional version. The three dimensional methods provide the same functionality as the two dimensional black box multigrid methods. The basic multigrid algorithm and the multigrid components are essentially the same, except that the standard coarsening methods need alternating red/black plane relaxation to obtain a robust smoother. In addition, there are several changes in the implementation, especially for the parallel code. The 3D parallel methods use (alternating) red/black plane Gauss-Seidel relaxation, where the required plane solves are performed using a 2D multigrid method, which have been modified to solve all the planes of a single color simultaneously. We will examine both the standard and semi-coarsening black box multigrid algorithms for problems in three dimensions. The examination will include the 3D algorithm implementations on vector (Cray Y-MP) and parallel (CM-5) computers. The grid operator stencil in three dimensions is now assumed to fit into the 27-point cubic stencil. The 27-point stencil is illustrated in figure 9.1. Notice that for each fixed z (xy-plane) we use the same compass coefficient notation that were used for two dimensions with a prefix to indicate the z level index of the stencil. For the stencil at grid point (i,j, k), the three prefixes are t for(*,*, k + 1), p for(*,*, k), and 250

PAGE 276

b for ( *, *, k -1) 9.1.1 Semi-Coarsening The semi-coarsening algorithm can be done in several ways. Recall that the semi-coarsening method used a smoother working orthogonal to the direction of the coarsening. The coarsening can be done in one of the coordinate directions, leaving the smoother to work on planes, or the coarsening can be done in two of the coordinate directions with the smoother working on lines. We have chosen to examine only semi-coarsening in the z coordinate direction. Either of the other two coordinate direction would have been equally valid, but since we plan on using the 2D semi-coarsening algorithm to perform the plane solves, which is already written for xy-planes, we can avoid writing additional versions for the other planes. 251

PAGE 277

tn , tne tnw, , I I I I I I 1/ 1/ 1/ tw/ I 1/ I te I I 1/ ts 1/ I I tsw tse z ( pnw pn / y I I pne I 1/ 1/ I p I 1/ pw I I I X 1/ pe I psw ps I pse bnw , I. I I 1/ bn I bne bw // I I I I I be I I b I I bsw 1/ 1/ I bs bse Figure 9.1: Grid operator stencil in three dimensions. 252

PAGE 278

CHAPTER 10 3D DISCRETIZATIONS This chapter presents some of the discretizations that can be used on the convection-diffusion equation in three dimensions. The finite difference and finite vol-ume discretizations in three dimensions are straightforward extensions of the two dimensional discretizations presented in chapter 2. We will present only a few examples in three dimensions. The continuous three dimensional problem is given by -\7 (D \lu) +b \lu+cu = f, (x, y) En= (0, Mx) x (0, My) x (0, Mz) (10.1) where D is a 3 x 3 tensor, det D > 0, and c 2: 0. We will only be considering problems where D is diagonal in this chapter. In addition, D, c, and f are allowed to be discontinuous across internal interfaces in the domain n. The boundary conditions are given by au on +au= g, on on (10.2) where a and g are functions, and n is the outward unit normal vector. This allows us to represent Dirichlet, Neumann, and Robin boundary conditions. The domain is assumed to be a rectangular parallelpiped, n = (0, Mx) x (0, My) X (0, Mz), which is divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, by hz = Mz/Nz, where Nx, Ny, and Nz are the number of cells in the x-, y-, and z-directions respectively. The mesh need not be uniform, but such an assumption will simplify our discussions. 253

PAGE 279

A finite element discretization on a regular tetrahedral mesh can also be used to derive the discrete system of equations which can be used for input to the black box multigrid methods. 10.1 Finite Difference Discretization The anisotropic Poisson's equation on a cube domain, in0=(0,1)3 (10.3) where u and fare functions of x, y, and z, can be discretized by central finite differences with a uniform grid spacing, h = 1/N for N = nx = ny = nz, to get the 7-point stencil at grid point (i,j, k): 0 -Ey 0 0 -Ez 0 0 -Ez 0 (10.4) 0 b -Ey 0 p where the stencil subscripts b, p, and t are short for the k-1, k, and k + 1 stencil planes respectively. 10.2 Finite Volume Discretization There are several finite volume grids that can be used for discretization, but the two most common are the vertex and cell centered grid, just as in two dimensions. We will present only the finite volume discretization for the vertex centered finite volumes, (X;, YJ" Zk): yJ. h J.0 N J -y, -' ... y, (10.5) Zk = khz, k = 0, ... Nz 254

PAGE 280

with evaluation at the vertices. The discretization is best when the discontinuous interfaces align with the finite volume boundaries (surfaces). In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. 10.2.1 Interior Finite Volumes The development is done the same as in chapter 2. However, instead of having four line integrals, we now have six surface integrals to evaluate over the finite volume. We will refer to the six surfaces as ni-l, ni+l, nj-1, nj+l, nk-1, and nk+l, where the subscripts indicate the fixed grid index. The surface integral for ni-l is D ou d dz hyhz 2Dx,i,j,kDx,i-l,j,k (u u ) X ox y h D . + D . t,],k t-l,],k !1,_1 x x,t,],k x,t-l,],k (10.6) The other five surface integrals are approximated similarly. The volume integrals are (10.7) and (10.8) 255

PAGE 281

The stencil for grid point ( i, j, k) is given by where and hyhz x hx ai-1,j,k a'!.!. 0 0 hxhzaY hy i,j,k 0 0 2 Dx,i,j,kDx,i-1,j,k D k+D 1k' x,'l,J, x;z,], 2 Dy,i,j,kDy,i,j-1,k Dy,i,j,k + Dy,i,j-1,k' 2 Dz,i,j,kDz,i,j,k-1 Dz,i,j,k + Dz,i,j,k-1 10.2.2 Edge Boundary Finite Volumes (10.9) p (10.10) (10.11) (10.12) (10.13) Let the finite volume Oi,j,k have its southern edge at the southern boundary (y = 0) of the domain. 256

PAGE 282

10.2.3 Dirichlet Boundary Condition For the Dirichlet boundary condition we have U(s) = 9(s), so that the surface integral over ni,j-1,k is au 2hxhz Dy8dxdz = -h-Dy,i,j,k ui,j,k-u(s) nj-1 Y y (10.14) where u(s) means to evaluate u at the grid point (i,j-1, k). We now get the stencil 0 0 hxhzaY hy i,j,k hyhz x hx ai-1,j,k (10.15) 0 p 0 0 where hyhz x x hxhz y hxhy z z -a 1 k +a k + -h-a;,3,k + -ha k 1 +a k hx ,], y z (10.16) and the a's are given as before; see equations (10.10), (10.11), and (10.12). 10.2.4 Neumann and Robin Boundary Conditions The boundary condition along the southern boundary is au -+au = 9(s) on (s) We can make the approximation au on (s) 257 (10.17) (10.18)

PAGE 283

2 where U(s)-U(p) = hi (9(s)-a(s)U(s)), which gives The surface integral along the boundary is approximated by 2hxhz D 2 + h a y,i,j,k a(s)Ui,j,k 9(s) y (s) We now get the stencil 0 0 _hxhzaY hy i,j,k hyhz x -hx ai-l,j,k 0 p 0 0 258 (10.19) (10.20) (10.21) (10.22) (10.23)

PAGE 284

where where is defined in equation (10.16), the a's are defined by equations (10.10), (10.11), and (10.12), and 2hxhza(s) BC= D k 2 + hya(s) y,2,J' (10.24) The other boundary finite volume cases (faces, edges, and corners) can be easily deduced from the previous boundary conditions cases above. 259

PAGE 285

CHAPTER 11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS The three dimensional grid transfer operators are the same as those used for the two dimensional grid transfer operators, except that the grid operators Lh and LH now have 27-point stencils. The three dimensional grid transfer coefficients are computed using the same type of grid decomposition method as were used in the second method of [29]. The computational method involves the formation of the grid transfer coefficients and the coarse grid operator by operator induced interpolation and Galerkin coarse grid approximation by performing consecutive semi-coarsening in each of the coordinate directions. The grid transfer coefficients are computed by an extension of the same methods that were used for the two dimensional grids, that is, the collapsing methods and the extension of Schaffer's ideas; see sections 3.5.1 through 3.6. The only difference is that instead of Ai-l, Ai+l, etc. representing points and lines respectively, they now represent points and planes. The three dimensional grid transfer operator stencil is a little more complex than the two dimensional ones; see figure 11.1. The computations of the grid transfer coefficients become quite clear if one draws several pictures; then the symmetry of the computations really stands out. The pictures are not presented here because they are hard to represent in a static 260

PAGE 286

yzne tne tnw/ ,I I I xzn I I xznw/ I / xzne I I I yznw 1 I tsw' I tse xyn z xynw l ,I I / y I xyne I I I I I xyw X I I xye I I xys I I xysw xyse bnw I 1 yzse bne I I I xzsw I I xzs I xzse bsw I I I bse yzsw Figure 11.1: Grid transfer operator's stencil in three dimensions. 261

PAGE 287

monochrome mode. 11.1 3D Grid Transfer Operations The fine grid points that are also coarse grid points use the identity as the interpolation operator. The coarse grid correction is then given by (11.1) where (Xi 1 Yi 1 Zk 1 ) = ( Xic, Yic, Zkc) on the grid; here the interpolation coefficient is 1. The fine grid points that are between two coarse grid points that share the same Yj and Zk coordinates use a two point relation for the interpolation. The coarse grid correction is given by (11.2) where Xic-1 < XiJ-1 < Xic, Yic = YiJ, and Zkc = ZkJ on the grid, and the interpolation coefficients are J. k and k c c, c c, c, c The fine grid points that are between two coarse grid points that share the same Xi and Zk coordinates use a similar two point relation for the interpolation. The coarse grid correction is then given by (11.3) where Xic = Xi1 Yic-1 < YiJ-1 < Yic' and Zkc = Zkf on the grid, and the interpolation ffi t Jxys d Jxyn coe c1en s are J. _1 k an J. k l>Cl C l C l>Cl Cl C The fine grid points that are between two coarse grid points that share the same Xi and Yj coordinates use a similar two point relation for the interpolation. The coarse grid correction is then given by (11.4) 262

PAGE 288

where Xic = Xi 1 Y]c = Y]f, and Zkc-l < Zk 1 -l < Zkc on the grid, and the interpolation coefficients are r:z]s k -l and r:z]n k (lc, c, c (lc, c, c For the fine grid points that share Zk coordinates, but do not share either a Xi or a Yj coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by + (11.5) + where Xic < Xi 1 < Xic+l, Y]c < Y]j < Y]c+l, and Zkc = Zk 1 with interpolation coefficients Ixysw. Ixyne and Ixyse 2c-l,]c-l,kc' 2c-l,]c,kc' 2c,]c,kc' 2c,]c-l,kc For the fine grid points that share Xi coordinates, but do not share either a Yj or a Zk coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by + (11.6) + where Xic = Xi1 Y]c < Y]j < Y]c+l, and Zkc < ZkJ < Zkc+l' with interpolation coeffiFor the fine grid points that share Yj coordinates, but do not share either a Xi or a Zk coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by (11.7) 263

PAGE 289

where Xic < Xit < Xic+l, Yic = Yit, and Zkc < Zkt < Zkc+l, with interpolation coeffiLastly, for the fine grid points that do not share either a Xi, Yj, or Zk coordi-nates with the coarse grid, we use an eight point relation for the interpolation, and the coarse grid correction formula is given by + (11.8) + + + where Xic < Xit < Xic+l, Yic < Yit < YiJ+l, and Zkc < Zkt < Zkc+l, with interpola-tion coefficients J. -1 k Itnw Itse_ Jbsw Jbnw c c c tc-l,]c,kc' tc,)c,kc' tc,)c-l,kc' tc-l,]c-l,kc-1' tc-l,]c,kc-1> The prolongation operators also have a correction term, containing the residual, added to them to obtain an O(h2 ) error at the boundaries. The correction is similar to the one employed in the two dimensional case; see 3.1.1. 11.2 3D Nonsymmetric Grid Operator Lh: Collapsing Meth-ods To illustrate the grid transfer operators, we will present the nonsymmetric collapsing method in three dimensions. From this discussion it should be clear how to extend the other grid transfer operators from two to three dimensions. The I xyw coefficient is computed by (11.9) 264

PAGE 290

If, however, Ry:, is small, then I O"[West] xyw = -----=-------=------=-----=-O"[W est] + O"[East] (11.10) where O"[West] (11.11) O"[East] O"NEt + O"Et + O"SEt (11.12) are the west and east planes of the grid operator stencil. In (11.9)-(11.12), Iw is eval-uated at (xic-1, Yic' zkJ, and the other coefficients on the right hand side are evaluated at (xirl,YiJ,Zk1 ) for the Lh components. Let "( = min{IO"[West]l, IO"[East]l, 1.}. (11.13) Then by small we mean that (11.14) where (11.15) 265

PAGE 291

Prolongation coefficients which are computed in a similar way are I xye, I xys, I xyn, Ixzn, and Ixzs. The prolongation coefficients Ixynw, Ixyne, Ixysw, Ixyse, Ixznw, Ixzne, Ixzsw, Ixzse, Iyznw, Iyzne, Iyzsw, and Iyzse can all be computed in a similar fashion. The computation of these coefficients is analogous to the computation of Inw, Ine, Isw, and Ise from section 3.5.1. We now illustrate these computations by computing the prolongation coefficient Ixynw. I cr[NWzLine] + cr[NzLine] Ixyw + cr[WzLine] Ixyn xynw = C p where cr[NzLine] cr[WzLine] cr[NW zLine] (11.16) (11.17) (11.18) (11.19) and where the notation means to take the line in the z-direction that contains the given grid operator coefficients. If, however, RL. is small, then I cr[NWzLine] + cr[NzLine] Ixyw + cr[WzLine] Ixyn xynw cr [RL. Cp] (11.20) Let "( min{lcr[NWzLine]l, lcr[NzLine]l, lcr[NEzLine]l, lcr[WzLine]l, lcr[EzLine]l (11.21) lcr[SWzLine]l, lcr[SzLine]l, lcr[SEzLinet]l, 1.}. Then by small we mean that (11.22) 266

PAGE 292

where Ry:, is defined in equation (11.15). Ixynw, Ixyw and Ixyn are evaluated at (xic1, Yjc, zkJ and O"[NzLine], O"[WzLine], and O"[NWzLine] are evaluated at (xit-1' YiJ-1, Zkt ). Finally, the last eight prolongation coefficients, Itnw, Itne, Itsw, Itse, Ibnw, Ibne, Ibsw, and Ibse, are used to interpolate to fine grid points which do not align with any of the coarse grid lines. They can all be computed in a similar fashion, which will be illustrated for Itnw. Itnw O"NWt + O"Wp Iyzne + O"NWp Ixzn + O"Np Ixznw + O"Wt Ixyn + O"Nt Ixyw + O"Ct Ixynw If, however, Ry:, is small, then Let Itnw = O"NWt + O"Wp Iyzne + O"NWp Ixzn + O"Np Ixznw + O"Wt Ixyn + O"Nt Ixyw + O"Ct Ixynw 'Y min {IO"[West]l, IO"[N orth]l, IO"[East]l, Then by small we mean that (11.23) (11.24) (11.25) (11.26) where Ry:, is defined in equation (11.15). Iyzne, Ixzn, Ixznw, Ixynw, Ixyw, and Ixyn are evaluated at (xic-1, Yjc-1, Zkc1) and O"[West], O"[N orth], O"[East], O"[South], O"[Top], and O"[Bottom] are evaluated at (xit-1' YiJ-1, Zkr1) Note that, O"[Top] and O"[Bottom] are just the sum of the grid operator coefficients on the top (k + 1) and bottom (k-1) planes of the grid operator stencil respectively. 267

PAGE 293

The restriction operator coefficients are computed in the same way as above, but instead of using the symmetric part of the grid operator, O" L, we use the transpose of the grid operator, LT. 11.2.1 3D Grid Transfer Operator Variations With the information above on how to compute the basic grid transfer operators, it is easy to see how to extend all of the grid transfer operator variations, that we discussed in chapter 3, from 2D to 3D. 11.3 3D Coarse Grid Operator The three dimensional coarse grid operator is computed in the same way as the second method in [29]. The computational method involves the formation of the grid transfer coefficients and the coarse grid operator using auxiliary grids. The coarse grid operator is formed in a series of steps using a series of semi-coarsening auxiliary grids. Define an auxiliary grid Glj = x x Glj, which is just the grid Gh coarsened in the z-direction only. Now we define the grid transfer operator, ( Jj}) z : Glj --+ Gh. The grid transfer operator, (JJ}) z, can be constructed using any of the methods discussed, as can the other two grid transfer operators discussed. We now define the partial coarse grid operator to be (11.27) In a similar fashion we define Gf/z = x Gfj x Glj, and the grid transfer operator h -H -H ( J H )yz : G yz --+ G z The associated coarse grid operator is defined by (11.28) Finally, in a similar fashion, we define the coarse grid, GH = G{! x Gfj x Glj, and the 268

PAGE 294

grid transfer operator J'lf : GH --+ GfJz. The coarse grid operator is finally obtained by (11.29) The formation of the coarse grid operator in this way saves 31% and 50% of the operations for the seven and twenty-seven point grid operators respectively. As an added bonus the coding is much less complex and easier to debug. 269

PAGE 295

CHAPTER 12 3D SMOOTHERS There are several choices of relaxation methods that can be used for the 3D smoother. We have chosen to look at point, line, and plane Gauss-Seidel relaxation methods using either lexicographic or multi-color ordering. 12.1 Point Gauss-Seidel The point Gauss-Seidel method in three dimensions is the same as it is in two, but now there are more choices for the sweeping direction. We have chosen to only look at the lexicographic ordering, the red/black ordering for 7-point operators, and an 8-color ordering for 27-point operators. The red/black ordering is given by Red: i + j + k even (12.1) Black : i + j + k odd 270

PAGE 296

and the 8-color ordering is given by Black: i odd, J odd k odd Red: i odd, J odd k even orange: i odd, J even k odd yellow: i odd, j even k even (12.2) Green: z even, J odd k odd Blue: i even, J odd k even violet : i even, J even k odd white: i even, j even k even. 12.2 Line Gauss-Seidel We have three choices for the direction of the lines, for line Gauss-Seidel relaxation, either x-lines, y-lines, or z-lines. We can also look at alternating line relaxation, as we did in two dimensions, except that now we have four possibilities; x-andy-lines, y-and z-lines, x-and z-lines, or x-, y-, and z-lines. As before we can look at different orderings of the the lines. Lexicographic is a common choice, but it can not be parallelized. We can get better convergence and obtain parallelism and vectorization, across lines as in the two dimensional case, by using a zebra (red-black) ordering of the lines. For standard coarsening, the only choice of smoother, which might be robust, is alternating zebra line Gauss-Seidel relaxation. It will be a good smoother for some of the convection problems, but not others because each coordinate line direction sweep can handle anisotropies and convections with components in its coordinate direction. However, anisotropies in a plane not being sweeped by the lines will exhibit poor smoothing. 271

PAGE 297

12.3 Plane Gauss-Seidel In three dimensions we can now perform plane relaxation, which is analogous to line relaxation in two dimensions. Plane relaxation can be performed in several ways; xy-plane, yz-plane, xz-plane, or alternating plane relaxation. These methods can also be done in a lexicographic or red/black ordering of the planes. We need a robust method for our smoother and these can be found among those that perform plane relaxations [11]. In general, red/black ordering will give better results than lexicographic because it removes the directional dependencies that are associated with a sweeping direction. However, plane relaxation can not reduce the error orthogonal to the plane, and hence we must use alternating plane relaxation to obtain a robust smoother. Alternating red/black plane Gauss-Seidel relaxation is the most robust be cause it takes into account the three coordinate directions for anisotropies and con vection. One iteration of the method is performed by performing red/black xy-plane Gauss-Seidel relaxation followed by red/black yz-plane Gauss-Seidel relaxation, and finally followed by red/black xz-plane Gauss-Seidel relaxation. The question now arises as to how to efficiently perform the plane solves needed by the smoother. In 2D we used a cyclic reduction tri-diagonal solver to perform the line solves, but in 3D we are stuck with having to solve a sparse banded system. To perform LU factorization and solve for each plane would be very time consuming. We could save some time by saving the L U decompositions, but at the expense of memory. However, there is a better solution to our problem: we can use a 2D multigrid method to perform the plane solves. We have chosen to use the 2D black box multigrid method for the planes solves because it was designed for just such a mission. By using the 2D multigrid method we still need extra memory, but not as much as the L U method, and 272

PAGE 298

we can also perform multigrid much quicker than L U. One possible drawback to using the 2D multigrid method is that it is not an exact solver. This should not be much of a problem since the relaxation method gives only an improved approximation to the solution for each iteration. However, we do not want to degrade the convergence of the relaxation by providing poor approximations for the plane solves. We have found that it is usually sufficient to use a single V(1, 1) cycle in the 2D black box multigrid method with alternating zebra line Gauss-Seidel relaxation to obtain essentially the same results for the red/black plane Gauss-Seidel relaxation as when LU factorization is used. Depending on the convection charac teristics it is sometimes better to use either a V(2, 1)-cycle, W(1, 1)-cycle, or several V(1, 1)-cycles; however, this improvement in the plane solve accuracy is moot since the relaxation method can fail even when exact plane solves are used. 273

PAGE 299

CHAPTER 13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS Local mode analysis of 3D smoothers is somewhat sparse in the literature and does not have adequate coverage for the range of problems that we wish to solve. In addition, there are only hints in the literature for how to perform local mode analysis for color relaxation in three dimensions, and we are unaware of the appearance elsewhere of the detailed analysis that we have presented in this chapter. The local (Fourier) mode analysis was described in section 5.3 for two dimensions, and we now extend it to three dimensions. 13.1 Overview of 3D Local Mode Analysis The continuous problem is discretized into a system of algebraic equations Lu=f where the grid G is defined by i = 1, ... nx G= 274 h -1 x---1 nx-h -1 Y --1 ny-(13.1) (13.2)

PAGE 300

The grid operator L can be represented in stencil notation as NWt Nt NEt NWp Np NEp Wt Ct Et Wp Cp Ep SWt St SEt SWp Sp SEp p (13.3) NWb Nb NEb wb cb Eb swb sb SEb b where the subscripts b, p, and t stand for the bottom ( k-1), plane ( k), and top ( k + 1) levels of the stencil. If the continuous problem has constant coefficients and periodic boundary conditions, then the stencils of [L], [M], and [N] are independent of the grid points (i,j, k). The eigenfunctions of the smoothing amplification matrix S are () E 8, (13.4) (13.5) If nx, ny and nz are assumed to be even, then the corresponding eigenvalues of S are y;, N ( K,) y;, M(K,) where K, = (lx, ly, lz) is a vector. () E 8, (13.6) We now define the sets of rough and smooth frequencies for the grid G, when the ratio between the fine and coarse grid spacings is two. The smooth frequencies are defined as 7f 7f 3 8s = 8 n --2' 2 (13.7) 275

PAGE 301

and the rough frequencies as (13.8) The Fourier smoothing factor is then defined to be 11 =max {1>.(0)1}, 0E8r (13.9) just as it was in two dimensions. The smoothing factor can be made grid size indepen-dent by changing the definition of 8 to be (13.10) For the case of multi-color relaxation, the if>i,j,k(O) are again not eigenfunc-tions any more, but certain subspaces spanned by their linear combinations are still invariant. Instead of four invariant subspaces, as in two dimensions, we now have eight invariant subspaces, which are defined as n1 8 n _:!!: I 3 u s 2' 2 02 = o;-sign(0;)1r, O!-sign(0!)1f 03 = o;-sign(o;)1f, 0 04 = o; sign(0;)1r, 0, O! sign(0!)1f 05 = 0, O! sign(0!)1f 06 = 01 0, 0, O! sign(0!)1f 07 = 01 -0, 0 08 = 01 o; sign(o;)1f, 0, 0 and 4>(0) is now written as The error before smoothing is now 276 (13.11) (13.12) (13.13)

PAGE 302

and after smoothing it is (13.14) where S(O) is the 8 x 8 amplification matrix, and co is a vector of dimension 8. The amplification matrix is computed in the same way as in the two dimensional case. For multi-colored relaxations, the definition of the Fourier smoothing factor, J-l, has to be modified. The rough Fourier modes are now given by and the smooth Fourier modes are now represented by 01 .../.. or z r 2. (13.15) (13.16) All of these values must be added to 8r. We now define a projection operator, Q(O), for ( 0) onto the Fourier modes, which is represented by the diagonal 8 x 8 matrix 8(0) 1 1 1 Q(O) = (13.17) 1 1 1 1 where 1 oi = for 't = x,y,z 8(0) = (13.18) 0 otherwise Define 8.s = 01 and the multi-color definition for the Fourier smoothing factor is given by J-l =max {p [Q(O)S(O)]} 0E8;; (13.19) 277

PAGE 303

where p denotes the spectral radius. The definitions for the smoothing factor can be modified, as in the two di mensional case, to take into account the Dirichlet boundary conditions. 13.2 Three Dimensional Model Problems The domain n is the unit cube for the three dimensional model problems: 1. -IJ.u = f 2. -El Uxx E2Uyy E3Uzz = f a) 0 < El E2 E3 b) 0 < El E2 E3 c) 0 < El E2 E3 d) 0 < El E2 E3 3. -EIJ.u -Ux = f 4. -EIJ.u + Ux = f 5. -E!J.U -Uy = f 6. -E!J.U + Uy = f 7. -E!J.U -Uz = f 8. -E!J.U + Uz = f 9. -E!J.U + Ux + Uy = f 10. -E!J.U + Ux -Uy = f 11. -E!J.U -Ux + Uy = f 12. -E!J.U -Ux -Uy = f 278

PAGE 304

13. + Ux + Uz = f 14. + Ux -Uz = f 15. -Ellu -Ux + Uz = f 16. -Ellu -Ux -Uz = f 17. -Eflu + Uy + Uz = f 18. -Eflu + Uy -Uz = f 19. -Eflu -Uy + Uz = f 20. -EflU -Uy -Uz = f 21. -EflU + Ux + Uy + Uz = f 22. -EflU + Ux + Uy -Uz = j 23. -EflU + Ux -Uy + Uz = j 24. -EflU + Ux -Uy -Uz = j 25. -EflU -Ux + Uy + Uz = j 26. -EflU -Ux + Uy -Uz = j 27. -Eflu -Ux -Uy + Uz = j 28. -Eflu -Ux -Uy -Uz = j where flu = Uxx + Uyy, E = 10-P for p = 0, 1, ... 5, and the are to be taken in all possible combinations. 279

PAGE 305

13.3 Local Mode Analysis for Point Gauss-Seidel Relax-at ion Local mode analysis results are presented for lexicographical and red/black ordering for point Gauss-Seidel relaxations. Point Gauss-Seidel relaxation with lexicographic ordering gives the splitting 0 [MJ = o cb o 0 0 [N] = 0 0 0 0 b b 0 0 0 The amplification factor ..\(0) is given by 0 0 0 0 0 0 p 0 0 -Ct 0 0 p Red/black point Gauss-Seidel relaxation has the amplification matrix a a 0 0 0 0 0 0 b b 0 0 0 0 0 0 0 0 c 0 0 0 0 c S(O) = 0 0 0 e 0 0 e 0 0 0 0 0 g g 0 0 0 0 0 0 h h 0 0 0 0 0 f 0 0 f 0 0 0 d 0 0 0 0 d 280 (13.20) (13.21) (13.22) (13.23)

PAGE 306

where a= 1+a, b = 1-a, c = ,8(1+,8), d = 1-,8, e = ')'(1+')'), f = 1-')', g = ry(1+ry), h = 17], and c Cb e-dJz + Sp e-dJy -Wp e-dJx -Ep + NP + Ct c ,8 'Y Cb Sp + Wp + Ep -NP + Ct c -Cb + Sp + Wp + Ep + Np Ct c 7] = The eigenvalues of Q(B)S(B) are >.1(8) = 0 >.2(8) = 0 >.3(8) = 0 >.4(0) = 0 >.5(8) = 1 -2 >.6(0) = 1 -2 1 + ,82 1 + 'Y2 >.7(8) = 1 "2 (1a+ J(B)(1 +a)) >.s(B) = 1 1 + 7]2 -2 (13.24) (13.25) (13.26) (13.27) The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.1 through 13.4. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Bx, By, and Bz were sampled at two degree increments. Table 13.1 shows the results of the smoothing analysis for pure diffusion type problems. The point Gauss-Seidel relaxation is a reasonable smoothers for Poisson's equation, but not for anisotropic problems. 281

PAGE 307

Table 13.1. Smoothing factor J-l for point Gauss-Seidel relaxation in lexicographical (pGS-lex) and red/black (r/b-pGS) ordering for the indicated anisotropic diffusion problems ( see section 13.2) using central finite differences; where E = 10-P. problem Px Py Pz pGS-lex r/b-pGS 1 0.5669 0.7182 1 0 0 0.9093 0.9526 2b 300 0.9990 0.9993 50 0 0.9999 0.9998 1 1 0 0.8472 0.9187 2c 330 0.9980 0.9988 55 0 0.9999 0.9998 2 1 0 0.9821 0.9907 2d 53 0 0.9999 0.9998 Table 13.2. Smoothing factor J-l for point Gauss-Seidel relaxation in lexicographi cal (pGS-lex) and red/black (r/b-pGS) ordering for the indicated convection-diffusion problems (see section 13.2) using central and upstream finite differences; where E = 10-P. problem p pGS-lex r/b-pGS 0 0.6459 0.7515 3 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 4 1 0.5617 0.8808 3 0.5524 0.9978 0 0.6459 0.7515 5 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 6 1 0.5617 0.8808 3 0.5524 0.9978 0 0.6459 0.7515 7 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 8 1 0.5617 0.8808 3 0.5524 0.9978 282

PAGE 308

Table 13.3. Smoothing factors for point Gauss-Seidel relaxation with lexicographic and red/black ordering for the indicated convection-diffusion problems (see section 13.2); c: = w-P. problem p pGS-lex r/b-pGS 0 0.5460 0.7779 9 1 0.5482 0.9247 3 0.5059 0.9988 0 0.6385 0.7779 10 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 11 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6999 0.7779 12 1 0.9251 0.9247 3 0.9991 0.9988 0 0.5460 0.7779 13 1 0.5489 0.9247 3 0.1312 0.9988 0 0.6384 0.7779 14 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 15 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6998 0.7779 16 1 0.9251 0.9247 3 0.9991 0.9988 0 0.5459 0.7779 17 1 0.5484 0.9247 3 0.5423 0.9988 0 0.6385 0.7779 18 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 19 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6998 0.7779 20 1 0.9251 0.9247 3 0.9991 0.9988 283

PAGE 309

Table 13.4. Smoothing factor f-L for point Gauss-Seidel relaxation in lexicographi cal (pGS-lex) and red/black (r/b-pGS) ordering for the indicated convection-diffusion problems (see section 13.2) using central and upstream finite differences; where c = lQ-P. problem p pGS-lex r/b-pGS 0 0.4176 0.7161 21 1 0.1295 0.7067 3 0.0017 0.7031 0 0.5749 0.7161 22 1 0.7970 0.7067 3 0.9967 0.7031 0 0.5749 0.7161 23 1 0.7969 0.7067 3 0.9967 0.7031 0 0.6513 0.7161 24 1 0.8851 0.7067 3 0.9984 0.7031 0 0.5749 0.7161 25 1 0.7969 0.7067 3 0.9968 0.7031 0 0.6513 0.7161 26 1 0.8851 0.7067 3 0.9985 0.7031 0 0.6513 0.7161 27 1 0.8851 0.7067 3 0.9984 0.7031 0 0.7048 0.7161 28 1 0.9199 0.7067 3 0.9990 0.7031 284

PAGE 310

The tables 13.2 through 13.4 show the results of the smoothing analysis for convection-diffusion problems. Most of the smoothing factors approach one as the convection terms become more dominant, which implies that point Gauss-Seidel is not a good smoother for these types of problems. However, lexicographic point GaussSeidel relaxation exhibits good smoothing properties when the convection characteristic coincides with that of the sweeping direction. 13.4 Local Mode Analysis for Line Gauss-Seidel Relax-at ion The line Gauss-Seidel relaxation can be implemented in many ways for three dimensional problems. It can be done by lines in any of the three axis directions. The ordering of the lines of unknowns can be done in many ways. Local mode analysis results are presented for lexicographical and zebra (red/black) ordering for x-line Gauss-Seidel relaxations and alternating line Gauss-Seidel relaxation. X-line Gauss-Seidel relaxation with lexicographic ordering gives the splitting 0 0 0 [M]= 0 cb 0 Wp Cp Ep 0 0 0 (13.28) 0 Sp 0 b p 0 -Np 0 [N]= 0 0 0 0 0 0 0 -Ct 0 (13.29) 0 0 0 b p The amplification factor .A(O) is given by (13.30) 285

PAGE 311

Zebra x-line Gauss-Seidel relaxation has the amplification matrix a 0 -a 0 0 0 0 0 0 c 0 0 0 0 0 -c b 0 -b 0 0 0 0 0 S(O) = 0 0 0 e -e 0 0 0 (13.31) 0 0 0 f -f 0 0 0 0 0 0 0 0 g -g 0 0 0 0 0 0 h -h 0 0 d 0 0 0 0 0 -d where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = 'Y(1+'Y), f = 1'(1-')'), g = 77 ( 1 + 77), h = 77 ( 1 77), and a ,B 'T] = Cb e-dJz + Sp e-dJy + Np + Ct Wp + Cp + Ep Cb + Sp + Np + Ct Wp + Cp Ep Cb Sp -Np + Ct Wp + Cp Ep Cb Sp -Np + Ct Wp + Cp + Ep The eigenvalues of Q(O)S(O) are >.1(0) = 0 >.2(0) = 0 >.3(0) = 0 >.4(0) = 0 >.s ( 0) = ,82 >.6 ( 0) = 'Y2 286 (13.32) (13.33) (13.34) (13.35)

PAGE 312

>.7(8) = r? 1 >.s(B) = 2a (a1 + J(B)(1 + ry)). The alternating line Gauss-Seidel relaxation with lexicographic ordering amplification factor >.(B) is given by (13.36) where Axlgs(B), Aylgs(B), and Azlgs(B) are the x-, y-, and z-line Gauss-Seidel amplifi-cation factors respectively, given by Azlgs(B) = ICb e-d}z + Wp e-dJx + Ep edJx + Cp + Sp e-d}y I IEp + Ct ICb + Wp + Cp + Sp + Np I I Ep + Np I (13.37) (13.38) (13.39) The zebra alternating line Gauss-Seidel relaxation amplification matrix S(B) is given by S(B) = Bxz9s(B) Sytgs(B) Bzz9s(B) (13.40) where Sxlgs(B), Sylgs(B), and Sylgs(B) are the zebra x-, y-, and z-line Gauss-Seidel amplification matrices respectively. The zebra x-line Gauss-Seidel amplification matrix Sxlgs(B) is given in equation (13.31), and the amplification matrices for Sylgs(B) and 287

PAGE 313

Bzlgs(B) are given by a 0 0 -a 0 0 0 0 0 c 0 0 0 0 -c 0 0 0 e 0 -e 0 0 0 1 b 0 0 -b 0 0 0 0 Syt9s(B) = 2 (13.41) 0 0 f 0 -f 0 0 0 0 0 0 0 0 g 0 -g 0 d 0 0 0 0 -d 0 0 0 0 0 0 h 0 -h where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = 'Y(1+'Y), f = 1'(1-')'), g = 77 ( 1 + 77), h = 77 ( 1 77), and a Cb e-dlz + Wp e-dlx + Ep + Ct (13.42) S + C + N p p p ,B Cb + Wp + Ep + Ct (13.43) -S + C N p p p 'Y Cb Wp -Ep + Ct (13.44) -S + C N p p p 'T] Cb Wp -Ep + Ct (13.45) S + C + N p p p and a 0 0 0 -a 0 0 0 0 c 0 0 0 -c 0 0 0 0 e -e 0 0 0 0 S(B) = 0 0 f -f 0 0 0 0 (13.46) b 0 0 0 -b 0 0 0 0 d 0 0 0 -d 0 0 0 0 0 0 0 0 g -g 0 0 0 0 0 0 h -h 288

PAGE 314

Table 13.5. Smoothing factor f-L for x-, y-, and z-line and alternating line GaussSeidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10-P. problem p xlGS ylGS zlGS alGS 1 0.5000 0.5000 0.5000 0.1096 1,0 0.9091 0.8347 0.8347 0.6332 2b 3,0 0.9990 0.9980 0.9980 0.9950 5,0 0.9999 0.9999 0.9999 0.9999 1,0 0.8462 0.8461 0.5000 0.3396 2c 3,0 0.9980 0.9980 0.5000 0.4976 5,0 0.9999 0.9999 0.5000 0.5000 2,1,0 0.9821 0.9804 0.8347 0.8036 2d 5,3,0 0.9999 0.9999 0.9804 0.9804 where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = 1'(1+')'), f = 1'(1-')'), g = ry(1 + ry), h = ry(1-ry), and a Cb e-dJz + Cp + Ct (13.47) S + lV; + E + N p p p p (13.48) ,8 -Cb + Cp Ct s lV; -E + N p p p p (13.49) -Cb + Cp Ct S lV; -E + N p p p p (13.50) cb + Cp + Ct 7] = The zebra alternating line Gauss-Seidel amplification matrix S(B) can be computed numerically and then its eigenvalues can be found and evaluated on 8.s. The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.5 through 13.8. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Bx, By, and Bz were sampled at 2 degree increments. Tables 13.6 through 13.8 show the smoothing factors for the convectiondiffusion model problems for lexicographic line Gauss-Seidel relaxation. The smoothing 289

PAGE 315

Table 13.6. Smoothing factor 1-1 for x-, y-, and z-line and alternating line Gauss-Seidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upwind finite differences; where c = 10-P. problem p xlGS ylGS zlGS alGS 0 0.5047 0.6059 0.6059 0.1626 3 1 0.5411 0.8751 0.8751 0.3782 3 0.1250 0.9984 0.9984 0.1241 0 0.5047 0.5000 0.4999 0.1049 4 1 0.5411 0.4992 0.4992 0.1089 3 0.1250 0.1119 0.1119 0.0010 0 0.6059 0.5047 0.6059 0.1625 5 1 0.8751 0.5411 0.8751 0.3783 3 0.9984 0.5000 0.9984 0.4976 0 0.5000 0.5047 0.4999 0.1049 6 1 0.5000 0.5411 0.4992 0.1089 3 0.5000 0.5000 0.4472 0.1041 0 0.6059 0.6059 0.5047 0.1625 7 1 0.8751 0.8751 0.5411 0.3783 3 0.9984 0.9984 0.5000 0.4976 0 0.5000 0.5000 0.5047 0.1049 8 1 0.5000 0.5000 0.5411 0.1085 3 0.5000 0.5000 0.5000 0.1041 290

PAGE 316

Table 13.7. Smoothing factor for x-, y-, and z-line and alternating line Gauss-Seidel with lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for convection diffusion problems (see section 13.2); where E = 10-P. problem p xlGS ylGS zlGS alGS 0 0.4680 0.4681 0.4998 0.1011 9 1 0.4655 0.4654 0.4995 0.1037 3 0.4125 0.4125 0.4645 0.0779 0 0.6089 0.4680 0.5999 0.1551 10 1 0.8778 0.4655 0.8569 0.3397 3 0.9981 0.4125 0.9977 0.4106 0 0.4680 0.6089 0.5999 0.1551 11 1 0.4655 0.8778 0.8569 0.3397 3 0.4125 0.9984 0.9980 0.4106 0 0.6089 0.6087 0.6667 0.2190 12 1 0.8778 0.8778 0.9167 0.6666 3 0.9981 0.9984 0.9990 0.9945 0 0.4681 0.4999 0.4680 0.1011 13 1 0.4648 0.4992 0.4645 0.1039 3 0.0844 0.1114 0.0846 7.9E-4 0 0.6089 0.5999 0.4680 0.1551 14 1 0.8778 0.8569 0.4645 0.3393 3 0.9981 0.9977 0.0846 0.0842 0 0.4681 0.6000 0.6088 0.1551 15 1 0.4648 0.8571 0.8779 0.3390 3 0.0844 0.9980 0.9984 0.0841 0 0.4681 0.6666 0.6088 0.2190 16 1 0.4648 0.9167 0.8779 0.6666 3 0.0844 0.9990 0.9984 0.9947 0 0.5000 0.4681 0.4681 0.1010 17 1 0.5000 0.4654 0.4654 0.1034 3 0.5000 0.4472 0.4472 0.1000 0 0.6000 0.6089 0.4681 0.1551 18 1 0.8571 0.8778 0.4654 0.3391 3 0.9980 0.9984 0.4472 0.4454 0 0.6000 0.4681 0.6088 0.1551 19 1 0.8571 0.4654 0.8779 0.3391 3 0.9980 0.4472 0.9984 0.4454 0 0.6667 0.6089 0.6088 0.2190 20 1 0.9167 0.8778 0.8779 0.6665 3 0.9990 0.9984 0.9984 0.9950 291

PAGE 317

Table 13.8. Smoothing factor JL for x-, y-, and z-line and alternating line Gauss-Seidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upwind finite differences; where c = 10-P. problem p xlGS ylGS zlGS alGS 0 0.3956 0.3956 0.3956 0.0395 21 1 0.1302 0.1302 0.1302 9.1E-4 3 0.0017 0.0017 0.0017 1.8E-9 0 0.5699 0.5699 0.3956 0.1195 22 1 0.8488 0.8488 0.1302 0.0852 3 0.9977 0.9977 0.0017 0.0015 0 0.5699 0.3956 0.5699 0.1195 23 1 0.8489 0.1302 0.8488 0.0853 3 0.9977 0.0017 0.9977 0.0015 0 0.6686 0.5699 0.5699 0.1662 24 1 0.9175 0.8488 0.8488 0.2927 3 0.9989 0.9977 0.9977 0.3422 0 0.3956 0.5699 0.5699 0.1194 25 1 0.1302 0.8488 0.8488 0.0853 3 0.0017 0.9980 0.9980 0.0015 0 0.5699 0.6686 0.5699 0.1662 26 1 0.8488 0.9175 0.8488 0.2928 3 0.9977 0.9990 0.9980 0.3424 0 0.5699 0.5699 0.6686 0.1663 27 1 0.8489 0.8489 0.9174 0.2928 3 0.9977 0.9980 0.9990 0.3424 0 0.6686 0.6686 0.6686 0.2122 28 1 0.9175 0.9175 0.9174 0.3612 3 0.9989 0.9990 0.9990 0.3949 292

PAGE 318

Table 13.9. Smoothing factor f-L for zebra x-, y-, and z-line and alternating line Gauss-Seidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10-P. problem p ZxlGS ZylGS ZzlGS ZalGS 1 0.4444 0.4444 0.4444 0.0278 1,0 0.9070 0.8264 0.8264 0.6195 2b 3,0 0.9990 0.9980 0.9980 0.9950 5,0 0.9999 0.9999 0.9999 0.9999 1,0 0.8403 0.8403 0.2500 0.1736 2c 3,0 0.9980 0.9980 0.2500 0.2490 5,0 0.9999 0.9999 0.2500 0.2500 2,1,0 0.9821 0.9803 0.8264 0.7956 2d 5,3,0 0.9999 0.9999 0.9803 0.9803 factors for line relaxation are good when the convection term characteristics are in the same direction as the lines. The smoothing factor becomes better (smaller) the more the convection terms dominate if the characteristics are in the direction of the lines. If the characteristics are not in the direction of the lines, then the smoothing factor degenerates, quickly approaching one the more the convection terms dominate. 13.5 Local Mode Analysis for Plane Gauss-Seidel Relax-at ion We analyze plane Gauss-Seidel relaxation for xy-plane and alternating plane using lexicographic and zebra ordering. XY -plane Gauss-Seidel relaxation with lexicographic ordering gives the split-ting 0 Np 0 [M]= 0 cb 0 Wp Cp Ep 0 0 0 (13.51) 0 Sp 0 b p 293

PAGE 319

Table 13.10. Smoothing factor J-l for zebra x-, y-, and z-line and alternating line Gauss-Seidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upwind finite differences; where c = 10-P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.3200 0.5102 0.5102 0.0459 3 1 0.2500 0.7656 0.7656 0.1406 3 0.2500 0.9960 0.9960 0.2480 0 0.3200 0.5102 0.5102 0.0459 4 1 0.2500 0.7656 0.7656 0.1406 3 0.2500 0.9960 0.9960 0.2480 0 0.5102 0.3200 0.5102 0.0459 5 1 0.7656 0.2500 0.7656 0.1406 3 0.9960 0.2500 0.9960 0.2480 0 0.5102 0.3200 0.5102 0.0459 6 1 0.7656 0.2500 0.7656 0.1406 3 0.9960 0.2500 0.9960 0.2480 0 0.5102 0.5102 0.3200 0.0459 7 1 0.7656 0.7656 0.2500 0.1406 3 0.9960 0.9960 0.2500 0.2480 0 0.5102 0.5102 0.3200 0.0459 8 1 0.7656 0.7656 0.2500 0.1406 3 0.9960 0.9960 0.2500 0.2480 294

PAGE 320

Table 13.11. Smoothing factor for zebra x-, y-, and z-line and alternating line Gauss-Seidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for convection diffusion problems (see section 13.2); where E = 10-P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.3846 0.3846 0.5625 0.0729 9 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 10 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 11 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 12 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.5625 0.3846 0.0729 13 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 14 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 15 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 16 1 0.7347 0.9127 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 17 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 18 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 19 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9980 0.9901 0 0.5625 0.3846 0.3846 0.0729 20 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9980 0.9901 295

PAGE 321

Table 13.12. Smoothing factor J.l for zebra x-, y-, and z-line and alternating line Gauss-Seidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upwind finite differences; where c = 10-P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.4390 0.4390 0.4390 0.0342 21 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 22 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 23 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 24 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 25 1 0.6900 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 26 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 27 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 28 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 296

PAGE 322

0 [N] = 0 0 0 0 0 0 0 0 0 b The amplification factor ..\(0) is given by p 0 0 -Ct 0 0 Zebra xy-plane Gauss-Seidel relaxation has the amplification matrix a 0 0 0 0 -a 0 0 0 c 0 0 -c 0 0 0 0 0 e 0 0 0 -e 0 S(O) = 0 0 0 g 0 0 0 -g 0 d 0 0 -d 0 0 0 b 0 0 0 0 -b 0 0 0 0 f 0 0 0 -f 0 0 0 0 h 0 0 0 -h (13.52) (13.53) (13.54) where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = ')'(1+')'), f = ')'(1-')'), g = TJ(1 + TJ), h = TJ(1 -TJ), and ,8 1] = Sp e-d}y + Wp e-dJx Cp + Ep + Np -Cb Ct The eigenvalues of Q(O)S(O) are 297 (13.55) (13.56) (13.57) (13.58)

PAGE 323

.A2(B) = 0 .A3(B) = 0 .A4(B) = 0 .A5(B) = {32 .A6(B) = 'Y2 .A7(B) = 'f/2 -As(B) = 1 2a (a1 + J(B)(1 + ry)). The alternating plane Gauss-Seidel relaxation with lexicographic ordering has the amplification factor .A( B) given by (13.59) where Axypgs(B), Ayzpgs(B), and Axzpgs(B) are the xy-, yz-, and xz-plane Gauss-Seidel amplification factors respectively. The xy-plane Gauss-Seidel amplification factor is given by equation (13.53), and the others by Ayzpgs(B) Axzpgs(B) = The zebra alternating plane Gauss-Seidel relaxation amplification matrix S(B) is given by S(B) = Bxypgs(B) Byzpgs(B) Bxzpgs(B) (13.62) where Bxypgs(B), Syzpgs(B), and Bxzpgs(B) are the zebra xy-, yz-, and xz-plane Gauss-Seidel amplification matrices respectively. The zebra xy-plane Gauss-Seidel amplifica-tion matrix Bxypgs(B) was given in equation (13.54), and the amplification matrices for 298

PAGE 324

Syzpgs ( 0) and Bxzpgs ( 0) are given by a 0 0 0 0 0 0 -a 0 c -c 0 0 0 0 0 0 d -d 0 0 0 0 0 1 0 0 0 e 0 -e 0 0 Syzpgs(O) = 2 (13.63) 0 0 0 0 g 0 -g 0 0 0 0 f 0 -f 0 0 0 0 0 0 h 0 -h 0 b 0 0 0 0 0 0 -b where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = ')'(1+')'), f = ')'(1-')'), g = TJ(1 + TJ), h = TJ(1 -TJ), and a IWp e-dlx -Ep I (13.64) ICb + Sp + Cp + Np + Ct I' ,8 IWp -Ep I (13.65) ICb + Sp Cp + Np + Ct I' 'Y IWp -Ep I (13.66) ICb Sp CpNp + Ct I' 1] IWp + Ep I (13.67) ICb Sp + CpNp + Ct eOz I and a 0 0 0 0 0 -a 0 0 c 0 -c 0 0 0 0 0 0 e 0 0 -e 0 0 1 0 d 0 -d 0 0 0 0 Bxzpgs(O) = 2 (13.68) 0 0 0 0 g 0 0 -g 0 0 f 0 0 -f 0 0 b 0 0 0 0 0 -b 0 0 0 0 0 h 0 0 -h 299

PAGE 325

Table 13.13. Smoothing factor 11 for xy-, xz-, yz-, and alternating plane Gauss-Seidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10-P. problem p xyplGS xzplGS yzplGS AplGS 1 0.4472 0.4472 0.4472 0.0497 1,0 0.8333 0.8333 0.4472 0.3106 2b 3,0 0.9980 0.9980 0.4472 0.4454 5,0 0.9999 0.9999 0.4472 0.4472 1,0 0.8333 0.4472 0.4472 0.1242 2c 3,0 0.9980 0.4472 0.4472 0.1488 5,0 0.9999 0.4472 0.4472 0.1491 2,1,0 0.9980 0.8333 0.4472 0.3654 2d 5,3,0 0.9999 0.9804 0.4472 0.4384 where a= a(1+a), b = a(l-a), c = ,8(1+,8), d = ,8(1-,8), e = 1'(1+')'), f = 1'(1-')'), g = ry(1 + ry), h = ry(1-ry), and ,8 7] = ICb e-d}z + Wp e-dJx + Cp + Ep edJx + Ct I' I Sp e-dJy -NP I ICb + Wp Cp + Ep + Ct I' I Sp -Np I ICb -Wp CpEp + Ct I' ISp + Np I ICb -Wp + CpEp + Ct I' (13.69) (13.70) (13.71) (13.72) The zebra alternating plane Gauss-Seidel amplification matrix S(O) can be computed and the eigenvalues can then be found and evaluated on B.s. The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.13 through 13.20. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Ox, Oy, and (}z were sampled at 2 degree increments. Table 13.13 shows the results of the local mode analysis for diffusion problems from section 13.2. We can see that single plane relaxation does not yield good 300

PAGE 326

Table 13.14. Smoothing factor J-L for xy-, xz-, yz-, and alternating plane Gauss-Seidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upstream finite differences; where c = 10-P. problem p xyplGS xzplGS yzplGS AplGS 0 0.4535 0.4535 0.6323 0.0750 3 1 0.4878 0.4878 0.9135 0.1335 3 0.5000 0.5000 0.9990 0.0035 0 0.4535 0.4535 0.3333 0.0500 4 1 0.4878 0.4878 0.3333 0.0514 3 0.5000 0.5000 0.3333 2.0E-4 0 0.4535 0.6323 0.4535 0.0751 5 1 0.4878 0.9135 0.4878 0.1335 3 0.5000 0.9990 0.5000 0.1488 0 0.4535 0.3333 0.4535 0.0500 6 1 0.4878 0.3333 0.4878 0.0516 3 0.5000 0.3333 0.5000 0.0497 0 0.6323 0.4535 0.4535 0.0751 7 1 0.9135 0.4878 0.4878 0.1329 3 0.9990 0.5000 0.5000 0.1488 0 0.3333 0.4535 0.4535 0.0500 8 1 0.3333 0.4878 0.4878 0.0517 3 0.3333 0.5000 0.5000 0.0497 301

PAGE 327

Table 13.15. Smoothing factor for xy-, xz-, yz-, and alternating plane Gauss-Seidel relaxation with lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for convection-diffusion problems (see section 13.2); where c = 10-P. problem p xyplGS xzplGS yzplGS AplGS 0 0.4587 0.3333 0.3333 0.0502 9 1 0.4918 0.3333 0.3333 0.0518 3 0.5000 0.3333 0.3333 0.0364 0 0.4587 0.6355 0.3333 0.0754 10 1 0.4918 0.9147 0.3333 0.1336 3 0.5000 0.9987 0.3333 0.1201 0 0.4587 0.3333 0.6355 0.0754 11 1 0.4918 0.3333 0.9147 0.1336 3 0.5000 0.3333 0.9987 0.1201 0 0.4587 0.6355 0.6355 0.1136 12 1 0.4918 0.9147 0.9147 0.3507 3 0.5000 0.9987 0.9987 0.3961 0 0.3333 0.4587 0.3333 0.0502 13 1 0.3333 0.4918 0.3333 0.0518 3 0.3333 0.5000 0.3333 2.0E-4 0 0.6355 0.4587 0.3333 0.0754 14 1 0.9147 0.4918 0.3333 0.1340 3 0.9987 0.5000 0.3333 0.0034 0 0.3333 0.4587 0.6355 0.0754 15 1 0.3333 0.4918 0.9147 0.1338 3 0.3333 0.5000 0.9987 0.0034 0 0.6355 0.4587 0.6355 0.1136 16 1 0.9147 0.4918 0.9147 0.3520 3 0.9987 0.5000 0.9987 0.0060 0 0.3333 0.3333 0.4587 0.0501 17 1 0.3333 0.3333 0.4918 0.0516 3 0.3333 0.3333 0.5000 0.0497 0 0.6355 0.3333 0.4587 0.0754 18 1 0.9147 0.3333 0.4918 0.1338 3 0.9987 0.3333 0.5000 0.1488 0 0.3333 0.6355 0.4587 0.0754 19 1 0.3333 0.9147 0.4918 0.1338 3 0.3333 0.9987 0.5000 0.1488 0 0.6355 0.6355 0.4587 0.1136 20 1 0.9147 0.9147 0.4918 0.3508 3 0.9987 0.9987 0.5000 0.4454 302

PAGE 328

Table 13.16. Smoothing factor J-l for xy-, xz-, yz-, alternating plane Gauss-Seidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upstream finite differences; where c = 10-P. problem p xyplGS xzplGS yzplGS AplGS 0 0.2852 0.2852 0.2852 0.0170 21 1 0.0713 0.0713 0.0713 3.4E-4 3 8.6E-4 8.6E-4 8.6E-4 6.3E-10 0 0.6382 0.2852 0.2852 0.0373 22 1 0.9150 0.0713 0.0713 0.0038 3 0.9987 8.6E-4 8.6E-4 6.2E-7 0 0.2852 0.6382 0.2852 0.0373 23 1 0.0713 0.9150 0.0713 0.0038 3 8.6E-4 0.9987 8.6E-4 6.2E-7 0 0.6382 0.6382 0.2852 0.0603 24 1 0.9150 0.9150 0.0713 0.0256 3 0.9987 0.9987 8.6E-4 3.5E-4 0 0.2852 0.2852 0.6382 0.0373 25 1 0.0713 0.0713 0.9150 0.0038 3 8.6E-4 8.6E-4 0.9987 6.2E-7 0 0.6382 0.2852 0.6382 0.0603 26 1 0.9150 0.0713 0.9150 0.0256 3 0.9987 8.6E-4 0.9987 3.5E-4 0 0.2852 0.6382 0.6382 0.0603 27 1 0.0713 0.9150 0.9150 0.0256 3 8.6E-4 0.9987 0.9987 3.5E-4 0 0.6382 0.6382 0.6382 0.0978 28 1 0.9150 0.9150 0.9150 0.1760 3 0.9987 0.9987 0.9987 0.2020 303

PAGE 329

Table 13.17. Smoothing factor J.L for zebra xy-, xz-, yz-, and alternating plane GaussSeidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10-P problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 1 0.2500 0.2500 0.2500 0.0037 1 0.8264 0.8264 0.1250 0.0719 2b 3 0.9980 0.9980 0.1249 0.1240 5 0.9999 0.9999 0.0154 0.0154 1 0.8264 0.2500 0.2500 0.0160 2c 3 0.9980 0.2500 0.2500 0.0175 5 0.9999 0.2500 0.2500 8.4E-4 2,1,0 0.9803 0.8264 0.1250 0.0915 2d 5,3,0 0.9999 0.9803 0.0289 0.0210 smoothing factors when the dominant plane is not the plane of the relaxation method. Tables 13.14 through 13.16 show local mode smoothing factors for convectiondiffusion problems from section 13.2. Good smoothing factors are obtained in most cases except when the sweeping direction is against the convection flow. Table 13.17 shows the results of the local mode analysis for diffusion problems from section 13.2. Relaxation in a single plane by zebra ordering does not give a good smoothing factor unless the the problem is either isotropic or the anisotropies lie in the plane of relaxation. Tables 13.18 through 13.20 show local mode smoothing factors for convectiondiffusion problems from section 13.2. Good smoothing factors are obtained if the convection lies in the plane of relaxation, otherwise the smoothing factor degrades quickly with the increased dominance of the convection terms over the diffusion term. 304

PAGE 330

Table 13.18. Smoothing factor J-L for Zebra xy-, xz-, yz-, and alternating plane Gauss Seidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, AZplGS respectively, for the indicated convection-diffusion problems (see section 13.2) using central and upstream finite dif ferences; where c = 10-P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.2500 0.2500 0.3600 0.0057 3 1 0.2500 0.2500 0.7347 0.0279 3 0.2500 0.2500 0.9960 0.0032 0 0.2500 0.2500 0.3600 0.0057 4 1 0.2500 0.2500 0.7347 0.0279 3 0.2500 0.2500 0.9960 0.0032 0 0.2500 0.3600 0.2500 0.0057 5 1 0.2500 0.7347 0.2500 0.0279 3 0.2500 0.9960 0.2500 0.0032 0 0.2500 0.3600 0.2500 0.0057 6 1 0.2500 0.7347 0.2500 0.0279 3 0.2500 0.9960 0.2500 0.0032 0 0.3600 0.2500 0.2500 0.0057 7 1 0.7347 0.2500 0.2500 0.0279 3 0.9960 0.2500 0.2500 0.0032 0 0.3600 0.2500 0.2500 0.0057 8 1 0.7347 0.2500 0.2500 0.0279 3 0.9960 0.2500 0.2500 0.0032 305

PAGE 331

Table 13.19. Smoothing factor for zebra xy-, xz-, yz-, and alternating plane Gauss-Seidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for convection-diffusion problems (see section 13.2); where E = 10-P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.1542 0.3600 0.3600 0.0087 9 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 10 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 11 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 12 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 13 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 14 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 15 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 16 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.3600 0.1542 0.0087 17 1 0.7347 0.7347 0.2358 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 18 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 19 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 20 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 306

PAGE 332

Table 13.20. Smoothing factor JL for zebra xy-, xz-, yz-, and alternating plane GaussSeidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for the in dicated convection-diffusion problems (see section 13.2) using central and upstream finite differences; where c = 10-P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.2436 0.2436 0.2436 0.0100 21 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 22 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 23 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 24 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 25 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 26 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 27 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 28 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 307

PAGE 333

CHAPTER 14 3D VECTOR ALGORITHM CONSIDERATIONS Most of the vectorization issues for the black box multigrid algorithms were already covered earlier in chapter 6. We mention here only those issues that are either new or different from those for the 2D algorithms. 14.1 3D Smoother There are several choices for the smoother, but only alternating plane relaxation gives a robust smoother. Point relaxation is possible, but it is only useful for isotropic diffusion equations. Line relaxation is also possible, but only alternating line relaxation is useful; however, it can perform poorly if the convection characteristics do not align with the grid by either being skewed or varying. The robust 3D smoother of choice is alternating red/black plane Gauss-Seidel relaxation where the plane solves are computed using the 2D black box multigrid method. On the vector (and sequential) computers a loop was set up to cycle through all of the red planes and then another loop for all of the black planes. The 2D solver was called each time a plane needed to be solved. We had a choice of either writing three 2D black box multigrid solvers; one for each of the three plane data structures, xy-, yz-, and xz-planes respectively, or to use only one 2D black box multigrid solver and transfer the data back and forth from the plane data structures to that of the 2D solver. In our case, we chose to use the later approach, using the 2D vector codes that were developed earlier in this thesis for xy-plane oriented data. 308

PAGE 334

14.2 Data Structures and Memory For the 3D grid levels we need storage for the grid equations (unknowns, coefficients, and right hand side), the grid transfer operators, and temporary work space. Let Nx, Ny, and Nz be the number of grid points in the x-, y-, and z-directions respectively. We can compute how much storage will be needed by adding up the amount used for each grid point. We need 27 locations for the coefficient matrix and 1 each for the unknown and right hand side. For the standard coarsening method we need 52 locations for both the grid transfer operator coefficients, and another 5 for temporary work. For the semi-coarsening method we need 4locations for both the grid transfer operator coefficients, and another 2 for temporary work. We will ignore the amount of storage for the coarsest grid direct solver because it will remain constant and small when compared to the rest. This means that we need 86 locations for the standard coarsening and 35 for the semi-coarsening. However, we do not have grid transfer coefficients stored on the finest grid so we can subtract 52 and 4 locations from the total for the standard and semi-coarsening methods respectively. The amount of storage required for the 3D data structures, excluding storage for the planes solves, is 1 NxNyNz 324 86 1++ -52 7NxNyNz, 8 (14.1) 35 1 1+-+ ... 2 -4 NxNyNz 66NxNyNz (14.2) for the standard and semi-coarsening methods respectively. If we have only a 7-point operator on the finest grid we do not need to store the other 20 coefficients and then the storage requirements become NxNyNz and 46NxNyNz for the standard and semi-coarsening methods respectively. We now need to address how we are going to handle the storage for the plane solves. Recall that the storage for the 2D black box multigrid method (xy-plane) is 309

PAGE 335

24NxNy and 30NxNy for the standard and semi-coarsening methods respectively for 9-point fine grid operators, and for a 5-point fine grid operator they will both be reduced by 4NxNy. We now have to decide how we are going to handle the plane solves for the smoother, leading to two obvious choices. The first choice we will call the small storage scheme (SSS) and the other will be called the full storage scheme (FSS). The SSS code will only require storage for the 3D data and enough additional storage for the largest plane, on the 3D finest grid, to be solved using 2D black box multigrid. The FSS will require storage for the 3D data plus storage for all the planes solved by 2D black box multigrid on all the 3D grid levels. To understand how much storage is required by the FSS it will be useful to refer to figure 14.1. The shading with arrows pointing from a higher level down to a lower level indicates the sub-partitioning of the higher level. The top line of figure 14.1 refers to data partitioning of the 3D grid levels, where grid level m is the finest grid. This is analogous to the 2D data structures used earlier. We have a large array that is partitioned for the storage of each array needed for each of the 3D grid levels. The second line of figure 14.1 refers to the storage for the three groups of plane solves needed by the smoother to perform alternating red/black plane Gauss-Seidel relaxation. The third line of figure 14.1 indicates the number of plane data structures needed for one red/black plane Gauss-Seidel relaxation. The fourth line of figure 14.1 is the data partitioning of the 2D grid levels for a single plane solve by black box multigrid. The total storage for the FSS is Nate that the FSS scheme has included storage on a 3D grid level for all the grid operator coefficients used in the plane solves, but this data is just a copy of the 310

PAGE 336

3D grid m 3D grid m-1 3D grid 1 XY planes YZ planes XZ planes k=1 xy-plane k=2 xy-plane k=nz xy-plane 2D grid m2 2D grid m2-1 2D grid 1 Figure 14.1: 3D full storage scheme memory data structure 311

PAGE 337

3D grid operator coefficients on a plane. We can eliminate this duplicate storage for identical data. However, this means that we will have to add back in enough storage for the largest plane to hold the 2D multigrid data structures. In addition, we will also have to add routines for copying 3D grid operator coefficients and all the saved 2D coarse grid operator coefficients for a given plane into the newly added storage each time a plane needs to be solved for the smoother. We can call this method the nearly full storage scheme (NFSS), which requires NxNyNz + 24 MAX {NxNy, NyNz, NxNz} storage, while the SSS requires storage, where MAX { -} stands for the maximum function. The NFSS only takes about 74% more storage than SSS does, and 60% less than FSS. The NFSS is even more attractive when we consider the computing time as compared to SSS. The SSS does not save the grid transfer or coarse grid operators, and hence it will have to perform the setup of these operators every time. The 2D setup time is between one and two times that of the execution of one V(1, 1)-cycle. The 3D smoother, performing plane solves, for SSS will double its execution time compared to NFSS. Because of vectorization issues the NFSS smoother does not quite perform twice as fast as the SSS smoother, and since the smoother is only one of the multigrid components we only see a speedup factor of about 1.5 for NFSS over SSS. NFSS will not be quite as fast as FSS because of the additional time needed to copy the grid operator coefficients, but this time should be fairly small if not negligible. The NFSS approach is clearly the winning strategy because it minimizes the storage requirements and maximizes the speed of execution. 312

PAGE 338

We also have the possibility for both an FSS, NFSS, and SSS implementation for the semi-coarsening algorithm. The semi-coarsening storage required for the SSS is while the NFSS is and the FSS is 14.3 3D Standard Coarsening Vector Algorithm We have discussed several issues concerning the black box multigrid compo nents, vectorization, and programming on the Cray Y-MP. We will explicitly state what choices we made for the vector algorithm, as we did for the 2D vector algorithm. We have implemented the code being aware of all the vectorization issues and using the most practical and efficient choices that have been discussed so far. 14.3.1 Coarsening We used standard coarsening, taking every other fine grid point in each of the coordinate directions to form the coarse grid. 14.3.2 Data Structures The data structures for the grid equations are grid point stencil oriented. The mesh of unknowns has been augmented with a border of fictitious zero equations in the same way as we did in the 2D code. The border is used to avoid having to write special code to handle the boundary of the grid. This arrangement makes the code easier to write and more efficient for vector operations. There are several arrays to hold the grid equations: the discrete coefficient array, the array of unknowns, and the right hand side array. There are also several 313

PAGE 339

extra auxiliary arrays to hold the grid transfer operator coefficients, the residual, and the 2D plane problems for the smoother. The 2D plane problems contain the storage space for all the 3D coarse grids to be solved by the 2D black box multigrid solver for the alternating zebra plane relaxation. Each grid level has its own data structure of the appropriate size that has been allocated, via pointers, as part of a larger linear array for each data type structure. This arrangement makes memory management for the number of grid levels easier. 14.3.3 Smoothers We have implemented the multicolor ordering point and alternating zebra plane Gauss-Seidel methods. The plane solves are performed using the 2D black box multigrid method where each plane's data is copied into the 2D solvers data structures. 14.3.4 Coarsest Grid Solver The coarsest grid solver is a direct solver using LU factorization, which is performed by the Linpack routine SGBSL. 14.3.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 11. They are analogous to the three that were implemented for the 2D standard coarsening method in sections 3.5.1, 3.5.3, and 3.6.1. 14.3.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, which uses the grid transfer operators and the fine grid operator. 14.4 3D Semi-Coarsening Vector Algorithm The semi-coarsening code was originally implemented by Joel E. Dendy Jr. We have re-implemented it in a slightly more efficient form to gain a speedup of about 2 over the previous vectorized version while maintaining and improving the portability 314

PAGE 340

of the code. The new implementation has kept all the functionality of the previous version. 14.4.1 Data Structures The data structures for the grid equations are the same as those for the standard coarsening code including the fictitious border equations. However, we only need storage for the xy-planes used by the smoother. 14.4.2 Coarsening Semi-coarsening in the z-direction was used, taking every other fine grid point in the z-direction to form the coarse grid. 14.4.3 Smoothers Zebra xy-plane Gauss-Seidel relaxation is used for the smoother. The plane solves are performed using the 2D semi-coarsening black box multigrid method. 14.4.4 Coarsest Grid Solver The coarsest grid solver is either the di rect L U factorization Lin pack solver or a single call to the 2D semi-coarsening black box multigrid method when the coarsening has continued until only one plane is left to solve. 14.4.5 Grid Transfer Operators The grid transfer operator is anal ogous to the 2D one used in section 3.6.1, but extended to 3D and applied in the z-direction. 14.4.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, using the grid transfer and fine grid operators. 315

PAGE 341

14.5 Timing Results for 3D Test Problems In this section we present some timing results of the various codes for comparing the performance of the codes and their various components. To illustrate how fast these codes perform in solving a problem, we examine the timing results for solving Poisson's equation for five V-cycles; see table 14.1. Table 14.1 gives the timing results, in seconds, for various stages of the program execution for various grid sizes. The grid is square, so in the first column where n = 9, we mean a grid of size 9 x 9 x 9 and so forth for the rest of the column entries. The second column gives the total setup time, which involves the time it takes to form all of the grid transfer operators, generate all the coarse grid operators, and perform any decompositions needed for the smoother. The third column gives the total time for smoothing. The fourth column gives the total time for the direct solver. The last column contains the average time it took to complete one V(1, 1)-cycle. We observe that the code runs fairly quickly, and that it appears to scale with respect to the grid size. However, the scaling is not what we expect, that is a factor of 8 for the standard coarsening and a factor of 2 for the semi-coarsening. We do not see these speedups for several reasons, but the main one has to do with the fact that we are performing five V(1, 1)-cycles. Each of these 3D V-cycles is using alternating plane solves for the smoother, and each plane is solved using the 2D black box multigrid method. Another big reason for the difference in the scaling has to do with vectorization issues. We also note that the total setup time is about 0.55 times that of the average cycle time, and in addition it is about 0.12 the total smoothing time for one iteration. In the 2D codes we saw just the opposite, where the setup time was greater than the average cycling time, but because of the overhead involved in using the 2D multigrid 316

PAGE 342

Table 14.1. Multigrid component timings for standard (top four lines) and semi coarsening (bottom four lines) in seconds for various grid sizes versus the time for five V(1, 1)-cycles. The standard coarsening method uses zebra alternating plane Gauss Seidel for the smoother, which uses a 2D multigrid method, and the grid transfer operator is the nonsymmetric hybrid collapsing method. Grid Size Total Total Direct Average n Setup Smoothing Solver per Cycle 9 4.671E-2 3.791E-1 5.571E-4 8.466E-2 17 1.753E-1 1.502E+O 5.610E-4 3.324E-1 33 7.051E-1 5.916E+O 5.579E-4 1.321E+O 65 3.031E+O 2.351E+1 5.583E-4 5.281E+O 9 5.854E-2 2.915E-1 9.234E-3 7.554E-2 17 2.330E-1 1.495E+O 1.855E-2 2.991E-1 33 1.073E+O 4.956E+O 3.918E-2 1.317E+O 65 5.503E+O 2.536E+1 1.006E-1 6.952E+O 317

PAGE 343

Table 14.2. Timings in seconds for multigrid grid transfer components for one V(1, 1)cycle for various grid sizes; comparing standard and semi-coarsening methods. Grid Size Standard Coarsening Semi Coarsening n Prolongation Restriction Prolongation Restriction 9 5.427E-4 5.714E-4 2.129E-4 3.600E-4 17 2.513E-3 1.981E-3 7.142E-4 1.292E-3 33 7.518E-3 7.560E-3 2.980E-3 5.268E-3 65 2.681E-2 2.934E-2 2.060E-2 3.699E-2 method in the smoother and the simplified tensor composition for the grid operator setup, the setup is now faster in 3D. A more detailed examination of these relationships between the various multigrid components is given below. The rest of the tables in this section give the results for one multigrid V ( 1, 1 )-cycle. The results are separated by multigrid components for easier comparison, and each table is further broken down into the type of multigrid algorithm. All times are given in seconds of CPU time on the Cray Y-MP in single processor mode. The time to perform the LU decomposition of the coarsest grid (3 x 3 x 3) problem for the direct solver is 7.148E-4 seconds. The direct solver on the coarsest grid level (3 x 3 x 3, standard coarsening) takes 1.116E-4 seconds. These times are constant for all of the standard coarsening algorithms that use the direct solver. The amount of work to perform the grid transfers depends on the grid size and on the type of coarsening used. A comparison between standard and semi-coarsening is given in table 14.2. As one would expect, semi-coarsening grid transfers are faster than standard coarsening grid transfers. Table 14.3 gives the timings results for two standard coarsening smoothers and the semi-coarsening smoother. Note that the time for the alternating zebra plane and semi-coarsening are fairly close, and that the point Gauss-Seidel method is roughly 17 times faster. 318

PAGE 344

Table 14.3. Timings for the total smoothing time in seconds for one multigrid V(1, 1)cycle for various grid sizes and smoothers. Grid Size Total Smoothing Time (seconds) n R/BPGS AZplGS SCBMG3 9 4.515E-3 7.582E-2 5.832E-2 17 1.747E-2 3.004E-1 2.994E-1 33 6.879E-2 1.183E+O 9.912E-1 65 2.688E-1 4.702E+O 5.072E+O The ratio of time spent smoothing versus the time spent doing grid transfers is given in table 14.4. The ratio of smoothing to grid transfers shows that the smoother is the dominant computation in the multigrid cycling algorithm. It also shows that that the use of plane smoothing via 2D multigrid dominates the computations completely. Recall from section 5.9 that 4-direction point Gauss-Seidel relaxation is a good smoother for isotropic 2D problems using standard coarsening. We can extend this method to 3D problems by using an 8-direction point Gauss-Seidel relaxation that will give us good smoothing and be robust for isotropic problems. The method should also be attractive for both the execution time and memory usage. We can see from table 14.3 that the execution time for the red/black point method is about one sixteenth of the alternating plane method, and since lexicographic ordering performs nearly identically to red/black ordering on the Cray Y-MP, we should get the 8-direction point Gauss-Seidel relaxation method to perform in half the time that the alternating plane relaxation takes. As an additional bonus the 8-direction point Gauss-Seidel method does not require any extra storage over that required to store the 3D grid equation data structures and the 3D grid transfer operator coefficients. 319

PAGE 345

Table 14.4. Standard coarsening with grid transfer operators based on the hybrid collapsing method. Timing ratios (smoothing/grid transfer) for one V(1, 1)-cycle for various grid sizes. Grid Size (Smoothing)/ (Grid Transfers) n R/BPGS AZplGS SCBMG3 9 4.05 68.05 101.8 17 3.89 66.84 149.2 33 4.56 78.45 120.2 65 4.79 83.74 88.07 14.6 Numerical Results for 3D Test Problem 1 Problem 1 is a Poisson problem defined by au= 0 on au 1 ---u=O on 2 on 0 = (0, 32.) X (0, 32.) X (0, 32.) on x = 0. or y = 0. or z = 0. on x = 32. or y = 32. or z = 32. (14.3) We discretize the equation using finite volumes with central differencing. The numerical results are given in tables 14.5 through 14.6. 14.7 Numerical Results for 3D Test Problem 2 Problem 1 is a discontinuous four-cube junction problem defined by -\7 D\7 u + c u = f au= 0 on au 1 D---u = 0 on 2 on 0 = (0, 32.) X (0, 32.) X (0, 32.) on x = 0. or y = 0. or z = 0. on x = 32. or y = 32. or z = 32. where the domain is split into the regions, R1 {(x, y, z) : 0 < x < 16, 0 < y < 16, 0 < z < 16}, R2 {(x,y,z): 0
PAGE 346

Table I4.5. Number of V(I, I)-cycles using standard coarsening and zebra alternating plane Gauss-Seidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(I, I) CF CF CF 9 11 1.223E-I 1.767E-I 2.311E-I I7 I3 8.98IE-2 2.82IE-I 2.423E-I 33 I2 4.211E-2 2.645E-I 2.724E-I 65 I5 4.80IE-2 3.943E-I 2.876E-I Table I4.6. Number of V(I, I)-cycles using semi-coarsening zebra alternating plane Gauss-Seidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(I, I) CF CF CF 9 I2 1.703E-I 2.798E-I 2.556E-I I7 I5 1.886E-I 2.922E-I 2.723E-I 33 I6 2.222E-I 2.955E-I 2.795E-I 65 IS 3.23IE-I 2.989E-I 2.876E-I 32I

PAGE 347

Table 14.7. Number of V(1, 1)-cycles when c = 0 and using standard coarsening and zebra alternating plane Gauss-Seidel for the smoother for the nonsymmetric collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 2.523E-1 4.333E-1 3.667E-1 17 20 3.041E-1 5.324E-1 4.441E-1 33 27 3.331E-1 6.144E-1 5.329E-1 65 36 4.801E-1 7.464E-1 6.267E-1 R3 {(x,y,z): 0 < X < 16, 16 < y 32, 0 < Z < 16} R4 {(x, y, z) : 16 X < 32, 16 y < 32, 0 < Z < 16} R5 {(x, y, z) : 0 < X < 16, 0 < y < 16, 16 Z < 32} R6 {(x,y,z): 16 X < 32, 0 < y < 16, 16 z < 32}' R1 {(x,y,z): 0 < X < 16, 16 y < 32, 16 Z < 32}, Rs {(x, y, z) : 16 X < 32, 16 y < 32, 16 Z < 32}; then let 1. for regions 2, 3, 5, 8 D= (14.5) 1000. for regions 1, 4, 6, 7 and 1. for regions 2, 3, 5, 8 f= (14.6) 0. for regions 1, 4, 6, 7 We discretize the equation using finite volumes with central differencing. The numerical results are given in tables 14.7 through 14.10. The methods all perform roughly the same with the semi-coarsening method coming in last. 322

PAGE 348

Table 14.8. Number of V(1, 1)-cycles when c = 0 and using standard coarsening and zebra alternating plane Gauss-Seidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 2.544E-1 4.339E-1 3.679E-1 17 19 3.038E-1 5.281E-1 4.276E-1 33 27 3.348E-1 6.159E-1 5.336E-1 65 36 4.833E-1 7.878E-1 6.308E-1 Table 14.9. Number of V(1, 1)-cycles when c = 3b and using standard coarsening and zebra alternating plane Gauss-Seidel for the smoother for the nonsymmetric collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.112E-1 4.210E-1 3.677E-1 17 14 1.123E-1 5.223E-1 4.121E-1 33 21 2.388E-1 6.100E-1 5.629E-1 65 26 3.522E-1 6.363E-1 6.230E-1 Table 14.10. Number of V(1, 1)-cycles when c = 3b and using standard coarsening and zebra alternating plane Gauss-Seidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.104E-1 4.209E-1 3.679E-1 17 15 1.334E-1 5.231E-1 4.276E-1 33 22 2.241E-1 6.129E-1 5.336E-1 65 26 3.480E-1 6.376E-1 6.308E-1 323

PAGE 349

CHAPTER 15 PARALLEL 3D BLACK BOX MULTIGRID The parallel 3D black box multigrid methods are similar to the parallel 2D methods. However, unlike the 3D vector versions, the smoother does not use the parallel 2D black box multigrid method in its original form. 15.1 3D Standard Coarsening Parallel Algorithm Modifications Just like the 2D parallel standard coarsening method, we run into the same problems with performing the tridiagonal line solves. We have chosen to modify the coarsening to use standard coarsening until the VP ratio is less than or equal to one and then switch to the semi-coarsening algorithm. This approach has several benefits, which include easier coding and faster execution. As will be seen, the semi-coarsening algorithm is the fastest for VP ratio less than or equal to one, but the standard coarsening method is actually faster when the VP ratio is greater than one. 15.2 3D Parallel Smoother On a parallel computer we could use the same approach as we did on the vector computers by calling the 2D solver, but this would not be very efficient since we know that all the red (black) planes can be solved for simultaneously. Instead, 324

PAGE 350

we can modify the 2D solver to solve all of the red (black) planes simultaneously by introducing the third coordinate axis to the data structure. This is good news because it will cut down on the overhead associated with calling the 2D solver over and over again. It indeed means that we will use more memory for the 2D solver because of the third coordinate axis, but the performance gain justifies this decision. However, we are again faced with the choice between creating three 2D plane solvers, one for each of the 2D planes needed for alternating plane relaxation, or creating one 2D plane solver and transferring the information from the 3D solver data structures to the 2D plane solver data structures, with a copying routine. The first choice of creating three 2D plane solvers is rather prohibitive for two reasons. First, it requires much more storage for all the data structures needed for the three 2D plane solvers, but it would also be the fastest implementation for performance. Secondly, the size of the final executable code, which is already quite large, more than doubles, meaning that there is that much less storage available for solving large grid size problems. We decided to use only one 2D plane solver, sacrificing some performance gains in order to save memory for solving larger problems. We may have saved some space by not writing three 2D multigrid solvers, but we now have to transfer data between the 3D and 2D data structures. These data transfers require the use of the less efficient communications. In either case, we have to transfer the unknowns and the right hand sides of the grid equations, followed by transferring the solution back, all of which uses inefficient general communications for the yz-and xz-plane solves. By having only one 2D (xy-plane) version we also have to transfer the plane's coefficient matrix to the 2D fine grid coefficient matrix. If we did not transfer the coefficients we would need nearly double the storage; recall section 14.2. We have also decided not to save the LU decompositions in the 2D plane solver because of the concerns for the amount of memory it would require with the additional 325

PAGE 351

coordinate (third) axis; hence performance is again reduced. However, experimentation has shown that we only save from 25% to 40% of the execution time required to perform one V-cycle, but it cost four times the storage required for a tridiagonal solve. In actuality, the storage costs is closer to six times because of the temporary work space allocated by the CMSSL tridiagonal solver. 15.3 3D Data Structures and Communication The data structure allocation and layout are again handled by the use of DMMU. The data structures have the same storage requirements as the 3D vector versions for both the standard and semi-coarsening methods respectively. However, instead of pointers to the various grid level data, we use congruent array aliases, allowing for indexing the desired grid level's data directly. 15.4 3D Parallel Timings The following tables we have reported both busy (B) and idle (I) times. Busy time is the execution time for the parallel processing nodes, while idle time is the sequential execution time and also the time to perform all communications. We are reporting times for various time shared partitions of the CM-5. The partitions are identified by the number of processing nodes (PN) namely, 32, 64, and 128 processing nodes. The tables report timings, in seconds, for the average time of five runs for either the setup time or the average of five V-cycles. The standard coarsening timings are given in tables 15.1 and 15.2 for one V(1, 1)-cycle and the setup respectively. We see the affects of the parallel overhead in the tables for small grids sizes and large partitions. The "**" mean that no data was obtained because of a failure in the codes. We believe that the "no data" runs failed because the standard coarsening case has data alignment trouble on coarser grids. The 326

PAGE 352

Table 15.1. Timings, in seconds, for the standard coarsening code performing one V(l, I)-cycle with zebra alternating plane Gauss-Seidel on 32, 64, 128 processing nodes of the CM-5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN N Idle Busy Idle Busy Idle Busy 8 5.003E-1 8.793E-1 ** ** 5.376E-1 1.057E+O 16 8.586E-1 1.565E+O ** ** 8.918E-1 1.746E+O 32 1.236E+O 2.982E+O ** ** 1.379E+O 2.906E+O 64 1.687E+O 7.920E+O ** ** 1.808E+O 5.514E+O Table 15.2. Timings, in seconds, for the setup phase of the standard coarsening code with zebra alternating plane Gauss-Seidel on 32, 64, 128 processing nodes of the CM-5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN N Idle Busy Idle Busy Idle Busy 8 1.684E+O 2.403E+O ** ** 1.578E+O 2.839E+O 16 2.932E+O 4.464E+O ** ** 3.063E+O 4.827E+O 32 4.389E+O 8.911E+O ** ** 4.748E+O 8.452E+O 64 5.650E+O 2.457E+l ** ** 5.970E+O 1.652E+O 327

PAGE 353

Table 15.3. Parallel efficiency for standard coarsening V(1, 1)-cycle using zebra alternating plane Gauss-Seidel for the CM-5 with 32, 64, and 128 nodes. The results are given in percentages and N means an N x N x N grid. Size CM-5 N 32 PN 64 PN 128 PN 8 64 ** 66 16 65 ** 66 32 71 ** 68 64 82 ** 75 alignment trouble only happens when using the DMMU and using the 64 processor or 256 processor partitions. The DMMU was designed and tested for the semi-coarsening algorithm. For the standard coarsening algorithm, the DMMU appears to have trouble in aligning the coarse grid data points with their closely related finer grid points on a subgrid. For the 64 and 256 processor partitions the DMMU can not keep the coarsening confined to the subgrid, as is done in the semi-coarsening code. As in the 2D parallel timings, we again do not see perfect scaling with respect to the grid size nor scaleup with the processing partition size. The parallel efficiencies are given in table 15.3, and they exhibit the same behavior that was seen in the 2D timings. Tables 15.4 through 15.6 present the timing data for the semi-coarsening algorithm. However, now we do not see any problem with the DMMU. Finally, we give a comparison on three different computers for one V(1, 1)cycles and a variety of grid sizes. The CM-5 timings are given for the 32, 64, and 128 processing partitions. 328

PAGE 354

Table I5.4. Timings, in seconds, for the semi-coarsening code performing one V(I, I)cycle on 32, 64, I28 processing nodes of the CM-5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN I28 PN N Idle Busy Idle Busy Idle Busy 8 7.256E-I 1.235E+O 7.650E-I 1.287E+O 7.602E-I 1.257E+O I6 1.264E+O 2.27IE+O l.I90E+O 2.242E+O 1.239E+O 2.348E+O 32 I.793E+O 4.630E+O 1.823E+O 4.I82E+O 1.849E+O 3.995E+O 64 2.4I2E+O 1.234E+I 2.480E+O 9.874E+O 2.435E+O 8.I27E+O Table I5.5. Timings, in seconds, for the setup phase of the semi-coarsening code on 32, 64, I28 processing nodes of the CM-5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN I28 PN N Idle Busy Idle Busy Idle Busy 8 l.I03E+O 1.48IE+O 1.4I5E+O 1.539E+O 1.482E+O 1.506E+O I6 2.I80E+O 2.739E+O 1.909E+O 2.734E+O 2.036E+O 2.848E+O 32 2.248E+O 5.626E+O 2.872E+O 5.083E+O 2.4I9E+O 4.825E+O 64 3.IOOE+O 1.5I4E+I 3.340E+O 1.2IOE+I 3.372E+O 9.868E+O Table I5.6. Parallel efficiency for semi-coarsening V(I, I)-cycle for the CM-5 with 32, 64, and I28 nodes. The results are given in percentages and N means anN x N x N grid. Size CM-5 N 32 PN 64 PN I28 PN 8 63 63 62 I6 64 65 65 32 72 69 68 64 84 80 77 329

PAGE 355

Table 15.7. Timing comparison between the CM-5 and Cray Y-MP computers for one V(1, 1)-cycle in seconds, where N means anN x N x N grid. The top entries are for the standard coarsening codes and the bottom entries are for the semi-coarsening codes, and ** means that the problem failed to execute. Size CM-5 Cray Y-MP N 32 PN 64 PN 128 PN 1.3796 ** 1.5956 2.332E-2 8 1.9606 2.0520 2.0172 3.742E-2 2.4236 ** 2.6378 1.398E-1 16 3.5350 3.4320 3.5870 1.825E-1 4.2180 ** 4.2850 8.795E-1 32 6.4230 6.0050 5.8440 9.107E-1 9.6070 ** 7.3220 4.663E+O 64 14.800 12.354 10.562 4.696E+O 330

PAGE 356

APPENDIX A OBTAINING THE BLACK BOX MULTIGRID CODES The black box multigrid codes and a User's Guide can be obtained via anony mous FTP through MGNet, MGNet's web site, or by contacting the author. MGNet stands for the MultiGrid Network. MGNet's FTP site: casper.cs.yale.edu use "anonymous" for the username and your e-mail address for a password. MGNet has a World Wide Web page that can be accessed via the URL: http://na.cs.yale.edu/mgnet/www/mgnet.html The author's E--mail address is: na.bandy@na-net.ornl.gov Additional copies of this thesis may be obtained by contacting the author or by down loading a copy from the University of Colorado at Denver Mathematics Department's Web page via the URL: 331

PAGE 357

http://www-math.cudenver.edu/ Any comments, insights, and suggestions would be greatly appreciated. Thank you. 332

PAGE 358

APPENDIX B COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS B.l Cray Y-MP Manufacturer: Cray Research Inc. Hardware Specifics: Machine RHO at Los Alamos National Laboratory computer model Cray Y-MP 8/64 832 serial number= 1054 number CPUs 8 clock cycle 6.0 nanoseconds (166666666 cycles/second) word length 64 bits memory size 67108864 words (67 MWords) memory speed 102.0 nanoseconds (17 clock cycles) memory banks 256 memory bank busy = 30.0 nanoseconds (5 clock cycles) instruction buffer size = 32 number of clusters = 9 333

PAGE 359

Operating System: UNICOS version 7.0.6.1 FORTRAN Programming Environment: CF77 version 6.0.4.1 GPP version 6.0.4.1 FPP version 6.0 FMP version 6.0.4.0 CFT77 version 6.0.4.12 (some are done with 6.0.4.10) segldr version 7.0i Cray Library: Craylib version 1.2 Manufacturer: Cray Research Inc. Hardware Specifics: Machine GAMMA at Los Alamos National Laboratory computer model Cray Y-MP 8/2048 (M90) serial number= 2806 number CPUs 8 clock cycle word length memory size memory speed memory banks 6.0 nanoseconds (166666666 cycles/second) 64 bits 2147483648 words (2.147 GWords) 162.0 nanoseconds (27 clock cycles) 256 334

PAGE 360

memory bank busy = 120.0 nanoseconds (20 clock cycles) instruction buffer size 32 number of clusters = 9 Operating System: UNICOS version 8.0.3 FORTRAN Programming Environment: CF77 version 6.0.4.1 GPP version 6.0.4.1 FPP version 6.0 FMP version 6.0.4.0 CFT77 version 6.0.4.10 segldr version 8.0i Cray Library: Craylib version 1.2 B.2 CM-5 manufacturer: Thinking Machines Inc. Hardware Specifics: computer number CPUs memory CM-5 at Advanced Computing Laboratory (ACL), Los Alamos National Laboratory 1024 Sparc-2 node CPUs, 4 vector units per node (4096 vector units)(vector length of 16) 32 MBytes per CPU node 335

PAGE 361

Four HIPPI interfaces; 120 GBytes rotating storage Operating System: CMOST version 7.4.0 (based on SunOS 4.1.3U1b) CM Run-Time-System: CMRTS 8.1 CM-FORTRAN Programming Environment: CM Fortran Driver Release 2.2.11 Connection Machine Fortran Version 2.2 (CMF) Compiler runtime library Version: 2.2 (LIBCMFCOMPILER) CM Scientific Software Library: CMSSL version 4.0 Slicewise runtime library version CMRTS CM5 8 1 6 (LIBCMRTS) 336

PAGE 362

BIBLIOGRAPHY [1] R. E. ALCOUFFE, A. BRANDT, J. E. DENDY, JR., AND J. W. PAINTER, The multi-grid methods for the diffusion equation with strongly discontinuous coeffi cients, SIAM J. Sci. Stat. Comput., 2 (1981), pp. 430-454. [2] 0. AXELSSON, Analysis of incomplete matrix factorizations as multigrid smoothers for vector and parallel computers, Appl. Math. Comput., 19 (1986), pp. 3-22. [3] --,A general incomplete block-matrix factorization method, J. Lin. Alg. Applic., 74 (1986), pp. 179-190. [4] 0. AXELSSON, S. BRINKKEMPER, AND V. P. IL'IN, On some versions of incom plete block-matrix factorization iterative methods, J. Lin. Alg. Applic., 58 (1984), pp. 3-15. [5] 0. AXELSSON AND B. POLMAN, On the factorization methods for block matrices suitable for vector and parallel processors., J. Lin. Alg. Applic., 77 (1986), pp. 3-26. [ 6] 0. AXELSSON AND P. S. VASSILEVSKI, Algebraic multilevel preconditioning meth ods, I, Numer. Math., 56 (1989), pp. 157-177. [7] V. A. BANDY, A comparison of 2d black box multigrid for convection-diffusion problems with discontinuous and anisotropic coefficients. Presented at the Sixth Copper Mountain Multigrid Conference, Apr. 1993. [8] V. A. BANDY, J. E. DENDY, JR., AND W. H. SPANGENBERG, Some multigrid algorithms for elliptic problems on data parallel machines. To Appear, SIAM J. Sci. Comput., May 1996. [9] V. A. BANDY AND R. SWEET, A set of three drivers for boxmg: a black box multi grid solver, in Preliminary Proceedings of the Fifth Copper Mountain Conference on Multigrid Methods, T. A. Manteuffel and S. F. McCormick, eds., val. 1, Denver, 1991, University of Colorado, pp. 47-55. [10] --, A set of three drivers for BOXMG: A black-box multigrid solver, Comm. Appl. Num. Methods, 8 (1992), pp. 563-571. 337

PAGE 363

[11] A. BEHlE AND P. A. FORSYTH, Multi-grid solution of three-dimensional problems with discountinuous coefficients, Appl. Math. Comput., 13 (1983), pp. 229-240. [12] D. P. BERTSEKAS AND J. N. TSITSIKLIS, Parallel and Distributed Computation: Numerical Methods, Prentice Hall, 1989. [13] J. H. BRAMBLE AND J. E. PASCIAK, The analysis of smoothers for multigrid algorithms, Math. Comp., 58 (1992), pp. 467-488. [14] A. BRANDT, Multi-level adaptive solutions to boundary-value problems, Math. Comp., 31 (1977), pp. 333-390. [15] --, Multi-level adaptive techniques (MLAT) for partial differential equations: ideas and software, in Mathematical Software III, J. R. Rice, ed., Academic Press, New York, 1977, pp. 277-318. [16] --, Algebraic multigrid theory: The symmetric case, Appl. Math. Comput., 19 (1986), pp. 23-56. [17] --, Rigorous local mode analysis of multigrid, in Preliminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. Mc Cormick, eds., vol. 1, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 55-133. [18] A. M. BRUASET, A. TVEITO, AND R. WINTHER, On the stability of relaxed incomplete lu factorizations, Math. Comp., 54 (1990), pp. 701-719. [19] M. CALVO, T. GRANDE, AND R. D. GRIGORIEFF, On the zero stability of the variable order variable stepsize bdf-formulas, Numer. Math., 57 (1990), pp. 39-50. [20] Z. H. CAO, Convergence of multigrid methods for nonsymmetric, indefinite prob lems, Appl. Math. Comput., 28 (1988), pp. 269-288. [21] T. F. CHAN AND H. C. ELMAN, Fourier analysis of iterative methods for elliptic problems, SIAM Review, 31 (1989), pp. 20-49. [22] T. F. CHAN AND B. F. SMITH, Domain decomposition and multigrid algorithms for elliptic problems on unstructured meshes, in Domain Decomposition Methods in Scientific and Engineering Computing: Proceedings of the Seventh International Conference on Domain Decomposition, vol. 180 of Contemporary Mathematics, Providence, Rhode Island, 1994, American Mathematical Society, pp. 175-189. [23] Q. S. CHANG, Y. S. WONG, AND Z. F. LI, New interpolation formulas of using geometric assumptions in the algebraic multigrid method, Appl. Math. Comput., 50 (1992), pp. 223-254. 338

PAGE 364

[24] P. M. DE ZEEUW, Matrix-dependent prolongations and restrictions in a blackbox multigrid solver, J. Comput. Appl. Math., 33 (1990), pp. 1-27. [25] P. M. DE ZEEUW AND E. J. VAN ASSELT, The convergence rate of multi-level algorithms applied to convection-diffusion equations, SIAM J. Sci. Stat. Comput., 6 (1985), pp. 492-503. [26] J. E. DENDY, JR., Black box multigrid, J. Comput. Phys., 48 (1982), pp. 366-386. [27] J. E. DENDY JR., Black box multigrid for nonsymmetric problems, Appl. Math. Comput., 13 (1983), pp. 261-284. [28] --, A priori local grid refinement in the multigrid method, in Elliptic Problem Solvers II, G. Birkhoff and A. Schoenstadt, eds., Academic Press, New York, 1984, pp. 439-451. [29] --, Two multigrid methods for three-dimensional equations with highly discon tinuous coefficients, SIAM J. Sci. Stat. Comput., 8 (1987), pp. 673-685. [30] --, Black box multigrid for periodic and singular problems, Appl. Math. Com put., 25 (1988), pp. 1-10. [31] --, Multigrid methods for diffusion equations with highly discontinuous coeffi cients, Trans. A.N.S., 56 (1988), p. 290. [32] J. E. DENDY JR., M.P. IDA, AND J. M. RUTLEDGE, A semicoarsening multigrid algorithm for SIMD machines, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 1460-1469. [33] J. E. DENDY JR., S. F. McCoRMICK, J. W. RuGE, T. F. RussELL, AND S. SCHAFFER, Multi grid methods for three-dimensional petroleum reservoir simu lation, in Proceedings of the Tenth Symposium on Reservoir Simulation, Houston, 1989, pp. 6-8. [34] S. Dor, On parallelism and convergence of incomplete lu factorizations, Appl. Numer. Math., 7 (1991), pp. 417-436. [35] C. C. DouGLAS, A review of numerous parallel multigrid methods, SIAM News, 25 (1992). [36] C. C. DOUGLAS AND B. F. SMITH, Using symmetries and antisymmetries to analyze a parallel multigrid algorithm, SIAM J. Numer. Anal., 26 (1989), pp. 1439-1461. 339

PAGE 365

[37] H. C. ELMAN, A stability analysis of incomplete lu factorizations, Math. Comp., 47 (1986), pp. 191-217. [38] K. W. FONG, T. H. JEFFERSON, T. SUYEHIRO, AND L. WAL TON, Guide to the SLATEC Common Mathematical Library, Netlib, http:/ /www.netlib.org/slatecjguide, 1993. [39] G. E. FORSYTHE, Finite-Difference Methods for Partial Differential Equations, Wiley, New York, 1960. [40] D. GOLDBERG, What every computer scientist should know about floating-point arithmetic, ACM Comput. Surveys, 23 (1991), pp. 5-48. [41] W. HACKBUSCH, Multigrid Methods and Applications, vol. 4 of Computational Mathematics, Springer-Verlag, Berlin, 1985. [42] --, Iterative Solution of Large Sparse Systems of Equations, Springer-Verlag, Berlin, 1993. [43] L.A. HAGEMAN AND D. M. YOUNG, Applied Iterative Methods, Academic Press, 1981. [44] M. HEGGLAND, On the parallel solution of tridiagonal systems by wrap-around partitioning and incomplete lu factorization, Numer. Math., 59 (1991), pp. 453-472. [45] P. W. HEMKER, On the order of prolongations and restrictions in multi grid pro cedures, J. Comput. Appl. Math., 32 (1990), pp. 423-429. [46] R. W. HOCKNEY AND C. R. JESSHOPE, Parallel Computers 2, Adam Rigler, Philadelphia, 1988. [47] W. H. HOLTER, A vectorized multigrid solver for the three-dimensional poisson equation, Appl. Math. Comput., 19 (1986), pp. 127-144. [48] W. Z. HUANG, Convergence of algebraic multigrid methods for symmetric positive definite matrices with weak diagonal dominance, Appl. Math. Comput., 46 (1991), pp. 145-164. [49] S. L. JOHNSON, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Stat. Comput., 8 (1987), pp. 354-392. 340

PAGE 366

[50] R. KETTLER, Analysis and comparison of relaxation schemes in robust multigrid and preconditioned conjugate gradient methods, in Multigrid Methods, W. Hack busch and U. Trottenberg, eds., no. 960 in Lect. Notes in Math., Springer-Verlag, 1982, pp. 502-534. [51] R. KETTLER AND P. WESSELING, Aspects of multigrid methods for problems in three dimensions, Appl. Math. Comput., 19 (1986), pp. 159-168. [52] M. KHALIL, Analysis of Linear Multigrid Methods for Elliptic Differential equa tions with Discontinuous and Anisotropic Coefficients, PhD thesis, Delft Univer sity of Technology, Delft, Netherlands, 1989. [53] M. KHALIL AND P. WESSELING, A cell-centered multigrid method for three dimensional anisotropic-diffusion and interface problems, in Preliminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. McCormick, eds., vol. 3, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 99-117. [54] --, Vertex-centered and cell-centered multigrid for interface problems, in Pre liminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. McCormick, eds., vol. 3, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 61-97. [55] --, Vertex-centered and cell-centered multigrid for interface problems, J. Com put. Phys., 98 (1992), pp. 1-20. [56] D. E. KNUTH, The Art of Computer Programming, vol. II, Addison-Wesley, Read ing, Mass., 2nd ed. ed., 1981. [57] C.-C. J. Kuo AND B. C. LEVY, Two-color fourier analysis of the multigrid method with red-black gauss-seidel smoothing, Appl. Math. Comput., 29 (1989), pp. 69-87. [58] J. M. LEVESQUE AND J. W. WILLIAMSON, A Guidebook to FORTRAN on Su percomputers, Academic Press, 1988. [59] W. LICHTENSTIEN AND S. L. JOHNSON, Block cyclic dense linear algebra, SIAM J. Sci. Stat. Comput., 14 (1993), pp. 1257-1286. [60] W. M. LIOEN, Parallelizing a highly vectorized multigrid code with zebra relax ation. Obtained a copy of the paper at The Copper Mountain Conference on Multigrid Methods, Apr. 1993. 341

PAGE 367

[61] W. A. MULDER, A new multigrid approach to convection problems, J. Comput. Phys., 83 (1989), pp. 303-323. [62] S. V. PARTER, Estimates for multigrid methods based on red-black gauss-seidel smoothings, Numer. Math., 52 (1988), pp. 701-723. [63] A. REUSKEN, Multigrid with matrix dependent transfer operators for a singular perturbation problem, Comput., 50 (1993), pp. 199-211. [64] J. RUGE, AMG for problems of elasticity, App. Math. Comput., 19 (1986), pp. 293-309. [65] J. W. RuGE, Algebraic multigrid ( AMG) for geodetic survey problems, in Prelimary Proc. Internat. Multigrid Conference, Fort Collins, CO, 1983, Institute for Computational Studies at Colorado State University. [66] J. W. RuGE AND K. STUBEN, Efficient solution of finite difference and finite element equations by algebraic multigrid {AMG), in Multigrid Methods for Integral and Differential Equations, D. J. Paddon and H. Holstein, eds., The Institute of Mathematics and its Applications Conference Series, Clarendon Press, Oxford, 1985, pp. 169-212. [67] --, Algebraic multigrid {AMG), in Multigrid Methods, S. F. McCormick, ed., vol. 3 of Frontiers in Applied Mathematics, SIAM, Philadelphia, PA, 1987, pp. 73-130. [68] S. SCHAFFER, Higher order multi-grid methods, Math. Comp., 43 (1984), pp. 89-115. [69] --, New ideas for semi-coarsening multigrid methods. Talk given at CNLS, Los Alamos National Laboratory, 1992. [70] --, A semi-coarsening multigrid method for elliptic partial differential equations with highly discontinuous and anisotropic coefficients, SIAM J. Sci. Comput., To Appear (1995). [71] Y. SHAPIRA, Two-level analysis of automatic multigrid for spd, non-normal and indefinite problems, Technical Report 824, Computer Science Department, Technion-Israel Institute of Technology, July 1994. submitted to Numer. Math. [72] --, Two-level analysis based on spectral analysis, Sept. 1995. Private commu nications. [73] S. SIVALOGANATHAN, The use of local mode analysis in the design and comparison of multigrid methods, Comput. Phys. Commun., 65 (1991), pp. 246-252. 342

PAGE 368

[74] G. D. SMITH, Numerical Solution of Partial Differential Equations: Finite Dif ference Methods, Clarendon Press, Oxford, 1978. [75] R. A. SMITH AND A. WEISER, Semicoarsening multigrid on a hypercube, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 1314-1329. [76] P. SONNEVELD, P. WESSELING, AND P. M. DEZEEUW, Multigrid and conjugate gradient methods as convergence acceleration techniques., in Multigrid Methods for Integral and Differential Equations, D. J. Paddon and H. Holstein, eds., The Institute of Mathematics and its Applications Conference Series, Clarendon Press, Oxford, 1985, pp. 117-168. [77] K. STUBEN, Algebraic multigrid (AMG): expenences and compansons, Appl. Math. Comput., 13 (1983), pp. 419-452. [78] K. STUBEN AND U. TROTTENBERG, Multigrid methods: Fundamental algorithms, model problem analysis and applications, in Multigrid Methods, W. Hackbusch and U. Trottenberg, eds., vol. 960 of Lecture Notes in Mathematics, Berlin, 1982, Springer-Verlag, pp. 1-176. [79] C. THOLE AND U. TROTTENBERG, Basic smoothing procedures for the multigrid treatment of elliptic 3D-operators, Appl. Math. Comput., 19 (1986), pp. 333-345. [80] --, A short note on standard parallel multigrid algorithms for 3D-problems, Appl. Math. Comput., 27 (1988), pp. 101-115. [81] P. VANEK, J. MANDEL, AND M. BREZINA, Algebraic multigrid based on smoothed aggregation for second and fourth order problems, Computing, 56 (1996), pp. 179-196. [82] R. S. VARGA, Matrix Iterative Analysis, Prentice-Hall, 1962. [83] P. WESSELING, Theoretical and practical aspects of a multigrid method, SIAM J. Sci. Stat. Comput., 3 (1982), pp. 387-407. [84] --, A survey of fourier smoothing analysis results, in Multigrid Methods III, vol. 98 of International Series of Numerical Mathematics, Birkhauser, Basel, 1991, pp. 105-127. [85] --,An Introduction to Multigrid Methods, John Wiley & Sons, Chichester, 1992. [86] G. WITTUM, On the robustness of ilu smoothing, SIAM J. Sci. Stat. Comput., 10 (1989), pp. 699-717. 343

PAGE 369

[87] I. YAVNEH, Multigrid and sor revisited, in Preliminary Proceedings of the Colorado Conference on Iterative Methods, T. A. Manteuffel, ed., Denver, 1994, University of Colorado. [88] --, Smoothing factors of two-color gauss-seidel relaxtion for a class of elliptic operators. personnel correspondence, May 1994. 344