BLACK BOX MULTIGRID FOR
CONVECTIONDIFFUSION EQUATIONS
ON ADVANCED COMPUTERS
by
VICTOR ALAN BANDY
M.S., University of Colorado at Denver, 1988
B.S., Oregon State University, 1983
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Mathematics
1996
This thesis for the Doctor of Philosophy degree by
Victor Alan Bandy
has been approved for the
Department of
Mathematics
by
Gita Alaghband
Date
be, 13 !f?Â£
Bandy, Victor Alan (Ph. D., Applied Mathematics)
Black Box Multigrid for ConvectionDiffusion Equations on Advanced Computers
Thesis directed by Dr. Joel E. Dendy, Jr.
ABSTRACT
In this thesis we present Black Box Multigrid methods for the solution of
convectiondiffusion equations with anisotropic and discontinuous coefficients on ad
vanced computers. The methods can be classified as either using standard or semi
coarsening for the generation of the coarse grids. The domains are assumed to be
either two or three dimensional with a logically rectangular mesh structure being used
for the discretization.
New grid transfer operators are presented and compared to earlier grid transfer
operators. The new operators are found to be more robust for convectiondiffusion
equations.
Local mode and model problem analysis are used to examine several choices
of iterative methods for the smoother and their relative effectiveness for the class of
problems under consideration. The red/black alternating line GaussSeidel method
and the incomplete line LU (ILLU) by linesinx methods were found to be the most
robust for two dimensional domains, and red/black alternating plane GaussSeidel,
using the 2D black box multigrid method for the plane solves, was found to be the
most robust and efficient smoother for 3D problems.
The Black Box Multigrid methods were developed to be portable, but opti
mized for either vector computers, such as the Cray YMP, or for parallel computers,
m
such as the CM5. While the computer architectures are very different, they represent
two of the main directions that supercomputer architectures are moving in today. Per
formance measures for a variety of test problems are presented for the two computers.
The vectorized methods are suitable for another large class of common com
puters that use superscalar pipelined processors, such as PCs and workstations. "While
the codes have not been optimized for these computers, especially when considering
caching issues, they do perform quite well. Some timing results are presented for a Sun
Sparc5 for comparison with the supercomputers.
This abstract accurately represents the contents of the candidates thesis. I
recommend its publication.
IV
To my Mom, Lee Buchanan, and everyone else who kept on asking
When are you going to finish?
CONTENTS
CHAPTER
1 INTRODUCTION ......................................................... 1
1.1 Summary............................................................... 1
1.1.1 Previous Results............................................... 2
1.1.2 New Contributions.............................................. 6
1.2 Class of Problems..................................................... 9
1.3 Discretization of the Problem............................! . . 10
1.4 Multigrid Overview................................................... 13
1.4.1 Multigrid Cycling Strategies.................................. 19
1.5 Black Box Multigrid................................................. 24
2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME . 27
2.1 Finite Difference Discretization..................................... 28
2.2 Finite Volume Discretization ........................................ 31
2.3 Cell Centered Finite Volume Discretization; Evaluation at the Vertices 34
2.3.1 Interior Finite Volumes ...................................... 36
2.3.2 Dirichlet Boundary Condition.................................. 37
2.3.3 Neumann and Robin Boundary Conditions...................... 38
2.4 Cell Centered Finite Volume Discretization; Evaluation at the Cell
Centers ............................................................. 39
2.4.1 Interior Finite Volumes ...................................... 40
2.4.2 Dirichlet Boundary Condition.................................. 41
vi
2.4.3 Neumann and Robin Boundary Conditions.....................'. 42
2.5 Vertex Centered Finite Volume Discretization Evaluation at the Ver
tices .................................................................... 42
2.5.1 Interior Finite Volumes ............................... 42
2.5.2 Edge Boundary Finite Volumes ............................. 43
2.5.3 Dirichlet Boundary Condition.................................. 43
2.5.4 Neumann and Robin Boundary Conditions..................... 43
2.5.5 Corner Boundary Finite Volumes............................ 44
2.5.6 Dirichlet Boundary Condition.................................. 45
2.5.7 Neumann and Robin Boundary Conditions..................... 45
2.6 Vertex Centered Finite Volume Discretization Evaluation at the Cell
Vertices............................................................ 46
2.6.1 Interior Finite Volumes .................................. 46
2.6.2 Dirichlet Boundary Condition.................................. 47
2.6.3 Neumann and Robin Boundary Conditions..................... 47
2.6.4 Corner Boundary Finite Volumes................................ 48
2.6.5 Dirichlet Boundary Condition.................................. 48
2.6.6 Neumann and Robin Boundary Conditions..................... 49
3 PROLONGATION AND RESTRICTION OPERATORS.................................... 51
3.1 Prolongation ........................................................ 52
3.1.1 Prolongation Correction Near Boundaries....................... 55
3.2 Restriction ...................................................... 56
3.3 Overview ............................................................ 56
3.4 Symmetric Grid Operator Lh: Collapsing Methods....................... 59
3.5 Nonsymmetric Grid Operator Lh: Collapsing Methods ................... 65
3.5.1 Prolongation Based on symm(Lft) .............................. 65
Vll
3.5.2 Prolongation Based on Lh and symm(Lh)..................... 68
3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and
symm(L/l)............................................... 68
3.6 Nonsymmetric Grid Operators: Extension of Schaffers Idea....... 69
3.6.1 Extension of Schaffers Idea to Standard Coarsening..... 71
3.7 Conclusions Regarding Grid Transfer Operators..................... 73
4 BASIC ITERATION METHODS FOR SMOOTHERS................................ 75
4.1 Overview of Basic Iteration Methods............................... 75
4.2 GaussSeidel Relaxation........................................... 79
4.2.1 Point GaussSeidel Iteration ............................. 80
4.2.2 Line GaussSeidel Iteration by Lines in X.................. 83
4.2.3 Line GaussSeidel Iteration by Lines in Y.................. 84
4.2.4 Alternating Line GaussSeidel Iteration..................... 86
4.3 Incomplete Line LU Iteration...................................... 86
5 FOURIER MODE ANALYSIS OF SMOOTHERS ................................. 91
5.1 Introduction...................................................... 91
5.2 Motivation ...................................................... 92
5.3 Overview of Smoothing Analysis.................................... 94
5.4 2D Model Problems ............................................... 101
5.5 Local Mode Analysis for Point GaussSeidel Relaxation............ 102
5.6 Local Mode Analysis for Line GaussSeidel Relaxation............. Ill
5.7 Local Mode Analysis: Alternating Line GaussSeidel and ILLU Iteration 115
5.8 Local Mode Analysis Conclusions.................................. 120
5.9 Other Iterative Methods Considered for Smoothers................. 122
6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS . 125
6.1 Cray Hardware Overview........................................... 127
viii
6.2 Memory Mapping and Data Structures.................................. 131
6.3 Scalar Temporaries................................................. 132
6.4 InCode Compiler Directives......................................... 133
6.5 Inlining............................................................ 134
6.6 Loop Swapping....................................................... 135
6.7 Loop Unrolling...................................................... 135
6.8 Loops and Conditionals ........................................... 135
6.9 Scalar Operations................................................... 136
6.10 Compiler Options................................................... 136
6.11 Some Algorithmic Considerations for Smoothers ..................... 137
6.11.1 Point GaussSeidel Relaxation................................ 137
6.11.2 Line GaussSeidel Relaxation................................ 138
6.12 Coarsest Grid Direct Solver ..................................... 139
6.13 /2_Norm of the Residual......................................... 140
6.14 2D Standard Coarsening Vector Algorithm . ..................... 144
6.14.1 Coarsening . ......................................... 144
6.14.2 Data Structures..........................................<. 144
6.14.3 Smoothers.................................................. . 145
6.14.4 Coarsest Grid Solver........................................ 146
6.14.5 Grid Transfer Operators..................................... 146
6.14.6 Coarse Grid Operators....................................... 146
6.15 2D SemiCoarsening Vector Algorithm................................ 146
6.15.1 Data Structures.............................................. 146
6.15.2 Coarsening................................................... 146
6.15.3 Smoothers................................................... 146
6.15.4 Coarsest Grid Solver........................................ 147
ix
6.15.5 Grid Transfer Operators............................... 147
6.15.6 Coarse Grid Operators.................................. 147
7 2D NUMERICAL RESULTS............................................ 148
7.1 Storage Requirements.......................................... 148
7.2 Vectorization Speedup......................................... 151
7.3 2D Computational Work..........................................156
7.4 Timing Results for Test Problems.............................. 157
7.5 Numerical Results for Test Problem 8..................... . . 165
7.6 Numerical Results for Test Problem 9.......................... 174
7.7 Numerical Results for Test Problem 10......................... 181
7.8 Numerical Results for Test Problem 11......................... 187
7.9 Numerical Results for Test Problem 13......................... 191
7.10 Numerical Results for Test Problem 17..........................194
7.11 Comparison of 2D Black Box Multigrid Methods.................. 198
8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 203
8.1 CM2 and CM200 Parallel Algorithms........................... 203
8.1.1 Timing Comparisons...................................... 206
8.2 CM5 Hardware Overview........................................ 207
8.3 CM5 Memory Management........................................ 215
8.4 Dynamic Memory Management Utilities......................... 219
8.5 CM5 Software Considerations ............................... 222
8.6 Coarsening and Data Structures in 2D.......................... 223
8.7 Coarse Grid Operators......................................... 227
8.8 Grid Transfer Operators....................................... 228
8.9 Smoothers.................................................... 229
8.9.1 Parallel Line GaussSeidel Relaxation................... 229
x
8.9.2 CM5 Tridiagonal Line Solver Using Cyclic Reduction...... 230
8.10 Coarsest Grid Solver............................................. 233
8.11 Miscellaneous Software Issues.................................... 236
8.11.1 Using Scalapack .......................................... 236
8.11.2 PolyShift Communication................................. 237
8.12 2D Standard Coarsening Parallel Algorithm ..................... 237
8.12.1 Data Structures................ . ..................... 238
8.12.2 Coarsening................................................ 238
8.12.3 Smoothers................................................. 239
8.12.4 Coarsest Grid Solver....................................... 239
8.12.5 Grid Transfer Operators.................................... 239
8.12.6 Coarse Grid Operators..................................... 240
8.13 2D SemiCoarsening Parallel Algorithm............................ 240
8.13.1 Data Structures ......................................... 240
8.13.2 Coarsening................................................. 240
8.13.3 Smoothers'. :............................................. 241
8.13.4 Coarsest Grid Solver :................................ 241
8.13.5 Grid Transfer Operators................................... 241
8.13.6 Coarse Grid Operators ................................... 241
8.14 2D Parallel Timings ............................................ . 241
9 BLACK BOX MULTIGRID IN THREE DIMENSIONS.......................... 250
9.1 Introduction. ................................................... 250
9.1.1 SemiCoarsening........................................... 251
10 3D DISCRETIZATIONS.................................................... 253
10.1 Finite Difference Discretization................................ 254
10.2 Finite Volume Discretization ................................... 254
xi
10.2.1 Interior Finite Volumes
255
10.2.2 Edge Boundary Finite Volumes......................... 256
10.2.3 Dirichlet Boundary Condition......................... 257
10.2.4 Neumann and Robin Boundary Conditions................ 257
11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS........................ 260
11.1 3D Grid Transfer Operations................................. 262
11.2 3D Nonsymmetric Grid Operator: Collapsing Methods........... 264
11.2.1 3D Grid Transfer Operator Variations.............. 268
11.3 3D Coarse Grid Operator . .............................. 268
12 3D SMOOTHERS.................................................... 270
12.1 Point GaussSeidel............................................ 270
12.2 Line GaussSeidel ............................................ 271
12.3 Plane GaussSeidel . ....................................... 272
13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS......................... 274
13.1 Overview of 3D Local Mode Analysis ... I ................. 274
13.2 Three Dimensional Model Problems ........................... 278
13.3 Local Mode Analysis for Point GaussSeidel Relaxation....... 280
13.4 Local Mode Analysis for Line GaussSeidel Relaxation........: 285
13.5 Local Mode Analysis for Plane GaussSeidel Relaxation....... 293
14 3D VECTOR ALGORITHM CONSIDERATIONS.............................. 308
14.1 3D Smoother .................... ......................... 308
14.2 Data Structures and Memory.................................... 309
14.3 3D Standard Coarsening Vector Algorithm....................... 313
14.3.1 Coarsening............................................ 313
14.3.2 Data Structures........................................ 313
14.3.3 Smoothers.............................................. 314
xii
14.3.4 Coarsest Grid Solver.................................. 314
14.3.5 Grid Transfer Operators............................... 314
14.3.6 Coarse Grid Operators................................. 314
14.4 3D SemiCoarsening Vector Algorithm.......................... 314
14.4.1 Data Structures....................................... 315
14.4.2 Coarsening............................................ 315
14.4.3 Smoothers............................................. 315
14.4.4 Coarsest Grid Solver.................................. 315
14.4.5 Grid Transfer Operators............................... 315
14.4.6 Coarse Grid Operators................................. 315
14.5 Timing Results for 3D Test Problems ......................... 316
14.6 Numerical Results for 3D Test Problem 1...................... 320
14.7 Numerical Results for 3D Test Problem 2...................... 320
15 PARALLEL 3D BLACK BOX MULTIGRID................................. 324
15.1 3D Standard Coarsening Parallel Algorithm Modifications...... 324
15.2 3D Parallel Smoother ........................................ 324
15.3 3D Data Structures and Communication......................... 326
15.4 3D Parallel Timings.......................................... 326
APPENDIX
A. OBTAINING THE BLACK BOX MULTIGRID CODES........................ 331
B. COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS ..................... 333
B.l Cray YMP................................................ 333
B.2 CM5......................................................... 335
BIBLIOGRAPHY....................................................... 337
xiii
FIGURES
FIGURE
1.1 Standard coarsening: superimposed fine grid Gh and coarse grid GH. . . 14
1.2 Semicoarsening: superimposed fine grid Gh and coarse grid GH............ 15
1.3 One Vcycle iteration for five grid levels............................... 20
1.4 One Scycle iteration for four grid levels............................... 22
1.5 One Wcycle iteration for four grid levels............................... 22
1.6 One Fcyde iteration for five grid levels................................ 23
2.1 Vertex centered finite volume grid....................................... 32
2.2 Cell centered finite volume grid.................................... 33
2.3 Cell centered finite volume f2jj. ....................................... 35
2.4 Vertex centered finite volume CUj at y = 0............................... 43
2.5 Southwest boundary corner finite volume.................................. 44
3.1 Standard coarsening interpolation 2D cases............................... 53
6.1 Cray YMP hardware diagram.............................................. 128
6.2 Cray CPU configuration.................................................. 128
6.3 2D Data Structures ..................................................... 145
7.1 Comparison of Setup time for BMGNS, SCBMG, and MGD9V ...... 154
7.2 Comparison of one Vcycle time for BMGNS, SCBMG, and MGD9V . . 155
7.3 Domain Q, for problem 8................................................. 166
7.4 Domain 12 for problem 9................................................. 174
7.5 Domain 12 for problem 10................................................ 181
xiv
7.6 Domain fi for problem 11.............................................. 187
7.7 Domain Q for problem 13.............................................. 191
7.8 Domain 0. for problem 17.............................................. 195
8.1 CM5 system diagram.................................................. 210
8.2 CM5 processor node diagram............................................ 212
8.3 CM5 vector unit diagram............................................... 214
8.4 CM5 processor node memory map......................................... 217
8.5 Grid Data Structure Layout............................................. 225
9.1 Grid operator stencil in three dimensions.............................. 252
11.1 Grid transfer operators stencil in three dimensions................... 261
14.1 3D FSS data structure.................................................. 311
xv
TABLES
TABLE
5.1 Smoothing factor /i for point GaussSeidel relaxation for anisotropic dif
fusion equations...................................................... 109
5.2 Smoothing factor /i for point GaussSeidel relaxation for convection
diffusion equations................................................... 110
5.3 Smoothing factor fi for x and yline GaussSeidel relaxation for anisotropic
diffusion equations................................................... 114
5.4 Smoothing factor fi for x and yline GaussSeidel relaxation for convection
diffusion equations................................................ 116
5.5 Smoothing factor /i for alternating line GaussSeidel relaxation and in
complete line LU iteration for anisotropic diffusion equations..... 119
5.6 Smoothing factor /j, for alternating line GaussSeidel relaxation and in
complete line LU iteration for convectiondiffusion equations ....... 121
6.1 Cray YMP Timings for the Naive, Kahan, and Doubling Summation.
Algorithms.......................................................... 143
6.2 Sparc5 Timings for the Naive, Kahan, and Doubling Summation Algorithms. 144
7.1 Memory storage requirements for the Cray YMP......................... 149
7.2 Storage requirements for BMGNS, SCBMG, and MGD9V...................... 150
7.3 Vectorization speedup factors for standard coarsening................. 151
7.4 Vectorization speedup factors for semicoarsening..................... 152
7.5 Operation count for standard coarsening setup......................... 156
xvi
7.6 Operation count for standard coarsening residual and grid transfers. . . 157
7.7 Operation count for standard coarsening smoothers......................... 158
7.8 Timing for standard coarsening on problem 8............................... 158
7.9 Grid transfer timing comparison for standard and semicoarsening....... 160
7.10 Timing for various smoothers.......................................... 161
7.11 Smoothing versus grid transfer timing ratios.......................... 162
7.12 Setup times for the various grid transfers............................ 163
7.13 Vcycle time for various smoothers.................................... 164
7.14 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 8.................................................. 166
7.15 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 8............................................................. 167
7.16 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 8................................................. 168
7.17 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 8.................................................... 169
7.18 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 8.................................................... 169
7.19 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 8 with ILLU........................................ 170
7.20 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 8 with ILLU................................................... 170
7.21 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 8 with ILLU....................................... 171
7.22 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 8 with ILLU....................................... 171
xvii
7.23 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 8 with ILLU........................................ 172
7:24 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 9.................................................. 175
7.25 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 9. ........................................................ 175
7.26 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 9.................................................. 176
7.27 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 9. . ............................................ 176
7.28 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 9.................................................. 177
7.29 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 9 with ILLU.......................................... 177
7.30 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 9 with ILLU................................................. 178
7.31 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 9 with ILLU. ...................................... 178
7.32 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 9 with ILLU. ...................................... 179
7.33 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 9 with ILLU........................................ 179
7.34 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 9 with 4direction PGS............................. 180
7.35 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 10................................................. 183
xvm
7.36 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 10............................................................ 183
7.37 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 10................................................. 184
7.38 Number of Vcycles for standardcoarsening using the symmetric grid
transfer for problem 10................................................. 184
7.39 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 10.................................................. 185
7.40 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 10............................................. 185
7.41 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 10................................;......................... 185
7.42 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 10.................................................. 186
7.43 Number of Vcycles for MGD9V on problem 10............................. 186
7.44 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 11................................................. 188
7.45 Number of Vcycles for standard coarsening using the sL/L grid transfer
for problem 11........................................................... 189
7.46 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 11................................................. 189
7.47 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 11. ............................................. 190
7.48 Number of Vcycles for standard coarsening using the operator, L/L, grid
transfer for problem 11................................................. 190
xix
7.49 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 13. ............................................. 192
7.50 Number of Vcycles for standard coarsening using the hybrid sL/L grid
transfer for problem 13.............................................. 193
7.51 Number of Vcycles for standard coarsening using the symmetric grid
transfer for problem 13. . .......................................... 193
7.52 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 17............................................... 194
7.53 Number of Vcycles for standard coarsening using the original collapsing
method for problem 17............................................. 196
7.54 Number of Vcycles for standard coarsening using the extension of Schaf
fers idea for problem 17............................................... 197
7.55 Number of Vcycles for standard coarsening using the hybrid collapsing
method for problem 17............................................. 197
7.56 Number of Vcycles for semicoarsening for problem ,17. .............. 197
7.57 Comparison for problem 8 on Cray YMP................................ 199
7.58 Comparison for problem 9 on Cray YMP................................ 201
8.1 Timing Comparison per Vcycle for semicoarsening on the Cray YMP, .
CM2. and CM200...................................................... 206
8.2 Timing Comparison per Vcycle for Standard Coarsening on the Cray
YMP, CM2, and CM200............................................... 208
8.3 2D Standard coarsening 32  512 CM5 nodes Vcycle timings............ 243
8.4 2D Standard coarsening 32  512 CM5 nodes Setup timings ............. 244
8.5 2D Standard coarsening 32  512 CM5 nodes parallel efficiency ..... 244
8.6 2D Semicoarsening 32 512 CM5 nodes Vcycle timings................ 245
8.7 2D Semicoarsening 32 512 CM5 nodes setup timings.................. 246
xx
8.8 2D Semicoarsening 32 512 CM5 nodes parallel efficiency.............. 246
8.9 2D Timing comparison between CM5, Cray YMP, and Sparc5............... 248
13.1 Smoothing factor for point GaussSeidel relaxation for anisotropic diffu
sion equations in 3D ..................................................... 282
13.2 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion
equations in 3D........................................................... 282
13.3 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion
equations in 3D.........: . ......................................... 283
13.4 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion
equations in 3D......................................................... 284
13.5 Smoothing factors for line GaussSeidel relaxation for anisotropic diffusion
equations ............................................................ 289
13.6 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion
equations............................................................... 290
13.7 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion
equations.............................................................. 291
13.8 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion
equations................................................................. 292
13.9 Smoothing factors for zebra line GaussSeidel relaxation for anisotropic
diffusion equations....................................................... 293
13.10 Smoothing factors: zebra line GaussSeidel relaxation for convection
diffusion equations....................................................... 294
13.11 Smoothing factors: zebra line GaussSeidel relaxation for convection
diffusion equations ..................................................... 295
13.12 Smoothing factors: zebra line GaussSeidel relaxation for convection
diffusion equations ............................ .................... 296
xxi
13.13 Smoothing factor // for xy, xz, and yzplane GaussSeidel relaxation
for anisotropic diffusion equations ................................. 300
13.14 Smoothing factor \i for xy,xz, and yzplane GaussSeidel relaxation for
convectiondiffusion equations ........................................ 301
13.15 Smoothing factor \i for plane GaussSeidel (continued)............... 302
13.16 Smoothing factor fj, for plane GaussSeidel (continued).............. 303
13.17 Smoothing factors: zebra xy, xz, yz, and alternating plane Gauss
Seidel relaxation for anisotropic diffusion equations ............... 304
13.18 Smoothing factors: Zebra xy,xz, yz, and alternating plane GaussSeidel
relaxation for convectiondiffusion equations.......................... 305
13.19 Smoothing factor for zebra plane GaussSeidel (continued)............ 306
13.20 Smoothing factor for zebra plane GaussSeidel (continued)............ 307
14.1 3D Multigrid Component Timing........................................ 317
14.2 Grid transfer timing comparison for standard and semicoarsening.... 318
14.3 Timing for various smoothers.......................................... 319
14.4 Smoothing versus grid transfer timing ratios.......................... 320
14.5 Numerical results for problem 1 in 3D............................... 321
14.6 Numerical results for problem 1 in 3D................................ 321
14.7 Numerical results for problem 1 in 3D................................ 322
14.8 Numerical results for problem 1 in 3D................................ 323
14.9 Numerical results for problem 1 in 3D............................... 323
14.10 Numerical results for problem 1 in 3D.............................. 323
15.1 3D Standard coarsening 32, 64, 128 CM5 nodes Vcycle timings....... 327
15.2 3D Standard coarsening 32, 64, 128 CM5 nodes Setup timings........... 327
15.3 3D Standard coarsening 32, 64, 128 CM5 nodes parallel efficiency .... 328
15.4 3D Semicoarsening 32, 64, 128 CM5 nodes Vcycle timings ............ 329
XXII
15.5 3D Semicoarsening 32, 64, 128 CM5 nodes setup timings............... 329
15.6 3D Semicoarsening 32, 64, 128 CM5 nodes parallel efficiency......... 329
15.7 3D Timing comparison between CM5 and Cray YMP ..................... 330
xxiii
ACKNOWLEDGMENTS
I would first like to thank my advisor Joel E. Dendy, Jr., Los Alamos National
Laboratory, because without him none of this would have been possible; Thanks! In
addition, at Los Alamos National Laboratory, I would like to thank Mac Hyman of
Group T7, the Center for Nonlinear Studies for their support and the Advanced
Computing Laboratory and the CIC Division for the use of their computing facilities.
This work was partially supported by the Center for Research on Parallel Computation
through NSF Cooperative Agreement No. CCR8809615.
I would like to thank my PhD committee members, Joel Dendy, Jan Mandel,
Leo Franca, Gita Alaghband, and Steve McCormmick. A special thanks to professors
Bill Briggs, Stan Payne, and Roland Sweet. In addition, I would like to give, a big
thanks to Dr. Suely B. Oliveira for getting me.back on track.
Finally, I would like to thank my mom, Lee Buchanan, my twin brother,
Fred Bandy, my wife, Darlene Bandy, and all my friends for all their support and
encouragement. Last but not least, a special thanks to Mark and Flavia Kuta, some
very good friends, for letting me stay with them while I was in Denver.
XXIV
CHAPTER 1
INTRODUCTION
1.1 Summary
The subject of this dissertation is the investigation of Black Box multigrid
solvers for the numerical solution of second order elliptic partial differential equations
in two or three dimensional domains. We place particular emphasis on efficiency on both
vector and parallel computers, represented here by the Cray YMP and the Thinking
Machines CM5.
Black Box multigrid methods are sometimes referred to as geometric multi
grid methods or, more recently, as automatic multigrid methods, in the literature. The
methods can be considered to be a subclass of algebraic multigrid methods with sev
eral algorithmic restrictions. Geometric multigrid methods make a priori assumptions
about the domain and the class of problems that are to be solved, and in addition, it
uses intergrid operators and coarse grid points based on the geometry and the order of
grid equation operator. Algebraic multigrid, on the other hand, chooses both the coarse
grid and intergrid operator based only on the coefficient matrix. Black box multigrid
is in between these two, with the grids chosen geometrically, on logically rectangular
grids, and the intergrid operators axe chosen algebraically. There are other hybrid
multigrid methods such as the unstructured grid method by Chan [22], which chooses
the coarse grid based on graph theoretical considerations and the intergrid operator
from the nodal coordinates (geometry), and the algebraic multigrid method of Vanek
1
[81], which uses kernels bf the associated quadratics form in lieu of geometrical infor
mation. The algebraic multigrid method of Stiiben and Ruge [66] [67] use almost the
same construction of intergrid operator as Dendy [26] once the coarse has been chosen,
while VanEks work is based on a different idea. The assumptions and the components
that make up the black box multigrid methods are spelled out in more detail in the
following sections of this chapter.
We will examine the development of robust black box multigrid solvers us
ing both standard and semicoarsening. The methods are aimed at the solution of
convectiondiffusion equations with anisotropic and discontinuous coefficients (inter
face problems), such that the discrete system of equations need only be specified on a
logically rectangular grid. A guiding principal in the design is that if the discrete sys
tem of equations is symmetric, then the multigrid coarse grid problems should preserve
that symmetry.
1.1.1 Previous Results. The black box multigrid method was first in
troduced by Dendy [26]. The method is a practical implementation of a multigrid
method for symmetric diffusion problems with anisotropic and discontinuous coeffi
cients, represented by
 V (D VC/) +cU = / onftcJ?2. (1.1)
The domain fi is assumed to be embedded in a logically rectangular mesh and then
discretized in such a manner as to yield a stencil which is no larger than a compact
9point stencil. The method employs the Galerkin coarse grid approximation, LH =
iff LhIjj, to form the coarse grid operators, using the robust choice of grid transfer
operators from Alcouffe et. aZ [1]. The robust choice of grid transfer operators is
an operator induced formulation that, when c = 0, preserves the flux (i (D VU)
across interfaces. In [1] lexicographic point GaussSeidel relaxation and alternating
2
lexicographic line GaussSeidel relaxation were the choices available for smoothers. In
subsequent extensions for vector machines, the choices available were red/black (or
four color for nine point operators) point GaussSeidel and alternating red/black line
GaussSeidel relaxation.
The black box multigrid method was extended to elliptic convectiondiffusion
problems [27], for which the model problem is
sAU + Ux + TJy f on Q, C F?, (1.2)
where e > 0. The mesh is the same as before and the discretization is of the form
L^Uij = Ph&hUij + DxQ'hUhJ + Dt'hUij = Fij, (1.3)
where
Ah.Ui,j = (tftji + Uiij 4Utj + Ui+ijJ Uitj+i),
DlhUij = i(Ku+iVy.i),
and where @ \ yields upstream differencing. A generalization of Galerkin coarse grid
approximation is used to form the coarse grid operators. The prolongation operators
are formed in the same way as they were for the symmetric method, but instead of
being induced by Lh, they are induced by the symmetric part of the grid operator,
symm(Lh) = ^((Lh)* + Lh). It was found that instead of using if? = (I#)* to induce
the restriction operator, a more robust choice is to form a new interpolation operator
Jjj based on (Lh)* and then to define the restriction operator to be if? = These
choices were made to generalize the work of [26]. The choice of smoothers was also
changed to include lexicographic point, line, and alternating line Kaczmarz relaxation.
3
The method performed well for the. problems tested as long as /3 > 0.25, but since
nonphysical oscillations begin to dominate for (3 < 0.25, this restriction is no difficulty.
The next development was the creation of a 3D black box multigrid solver for
symmetric problems [29]. This method uses the same type of grid transfer operators
as the earlier 2D symmetric method. Two different methods' of forming the coarse grid
operators were examined with nearly identical convergence results. The first method
uses Galerkin coarse grid approximation with standard coarsening. The second method
also uses Galerkin coarse grid approximation, but it does so by using auxiliary interme
diate grids obtained by semicoarsening successively in each of the three independent
variables. For robustness, alternating red/black plane GaussSeidel relaxation was used
for the smoother. The plane solves of the smoother were performed by using the 2D
symmetric black box multigrid solver.
The 2D symmetric black box multigrid solver was then extended to solve
singular and periodic diffusion problems [30]. The existence of a solution, in case
c = 0, is assured by requiring that the equation be consistent; F = 0. The periodic
boundary conditions only impact the multigrid method by requiring the identification
of the auxiliary grid point equations at setup, the identification of the auxiliary grid
point unknowns after interpolation, and the identification of the auxiliary grid point
residuals before restriction. The coarsest grid problem, if c = 0, is singular and cannot
be solved by Gaussian elimination, but since the solution is determined only up to a
constant, the arbitrary addition of the linearly independent condition that Uij = 0 for
some coarse grid point (i, j) allows solution by Gaussian elimination.
The first semicoarsening black box multigrid solver was introduced for the so
lution of three dimensional petroleum reservoir simulations [33]. This method employs
semicoarsening in the zdirection and xyplane relaxation for the smoother. Galerkin
coarse grid approximation is used to form the coarse grid operators. Operator induced
4
grid transfer operators were used, but only after Schaffers paper [70] was it realized
how to compute these in a robust manner; see section 3.6.
A two dimensional black box multigrid solver called MGD9V was developed
by de Zeeuw [24]. This method, was designed to solve the general elliptic convection
diffusion equation. The method used standard coarsening, an ILLU smoother, a
V(0, l)cycle (sawtooth), and a new set of operator induced grid transfer operators
that were designed specifically for convection dominated problems. The method was
found to be more robust than previous methods but was still divergent for problems
with closed convection characteristics on large grids. The method of MGD9V was
developed only for two dimensions and is not parallelizable.
The 2D symmetric black box multigrid solvers [26] [30] were updated to be
portable, have consistent user interfaces, adhering to the SLATEC software guidelines
[38], and provided with three new user interfaces by Bandy [9]. One of the interfaces in
cluded an automatic discretization routine, requiring the user to provide only a function
which can evaluate the coefficients at the fine grid points. The interfaces all included
extensive input parameter validation and memory management for workspace.
A parallel version of the semicoarsening method for two dimensional scalar
problems for the CM2 was presented in [32]. A parallel version of semicoarsening for
two and three dimensional problems was presented in [75]. Both papers essentially
relied on the algorithm from [33] and borrowed from Schaffer [69] [70] for the robust
determination of grid transfer operators.
Fourier mode analysis has been used by many multigrid practitioners to find
good smoothers for use in multigrid methods. The results of many of these analyses
have been presented in the literature. Stiiben and Trottenberg [78] present several
fundamental results of Fourier mode analysis for a few selected 2D problems. Kettler
[50] reports results for a range of 2D test problems and several lexicographic ordered
5
GaussSeidel methods along with several variations of ILU methods. Wesseling [84]
reports a summary of smoothing analysis results for the 2D rotated anisotropic diffusion
equation and the convection diffusion equation; however, the results are for only a
limited number of worst case problems. Smoothing analysis results for the red/black
ordered methods appear in many places in the literature, but they are only for a few
selected problems. There are some results in the literature for 3D problems [79], but
just like the 2D results, the analysis is not complete enough for our purposes.
1.1.2 New Contributions In this thesis we have developed and ex
tended several black box multigrid methods for both two and three dimensional non
symmetric problems on sequential, vector, and parallel computing platforms. The new
methods are based on a new implementation of the two dimensional nonsymmetric
black box multigrid method [27] for vector computers. The new implementation was
designed to take better advantage of developments in vector computing, while increas
ing portability and compatibility with sequential computers. The new implementation
performs with a speedup factor of six over the earlier methods on vector computers,
while providing identical functionality, and it also incorporates many of the ideas and
software features from [9].
The new methods include the development of a three dimensional method,
both vector and parallel versions, and a two dimensional parallel method for nonsym
metric problems. The new methods were also extended to handle periodic and singular
problems using the modifications from [30].
In [27] a two dimensional nonsymmetric black box multigrid method was
examined for a convection dominated problem with constant convection characteristics.
In this work we investigate the new methods for a general convectiondiffusion equation
v (D(x) vu(x)) + b(x) vt/(x) + c(x)U(x) = fix), x e a (1.4)
6
When the earlier method of [27] was applied to equation 1.4, but with more vectorizable
smoothers than those in [27], it was found to perform poorly, and even fail, for some
nonconstant convection characteristic problems. This poor performance was caused
by both the new smoothers and by poor coarse grid correction. Several new grid
transfer operators are introduced to address these problems, of which two were found
to be robust; see chapter 3. The search for a more robust smoother was facilitated
by using local mode analysis, and led to the implementation of an incomplete line
LU factorization method (ILLU) for the smoother. The ILLU smoother made the
new methods more robust for convection dominated problems. A fourdirection point
GaussSeidel method was also briefly considered for use as a smoother but was discarded
because it was not parallelizable nor suitable for anisotropic problems, even though it
was fairly robust for convection dominated problems.
A nonsymmetric black box multigrid method, using standard coarsening, was
created for three dimensional problems; previously only a semicoarsening version ex
isted. The new method is the three dimensional analogue of the new two dimensional
black box multigrid method, and it uses alternating red/black plane GaussSeidel as
a smoother for robustness. The 3D smoother uses one V(l, l)cycle of the 2D non
symmetric black box multigrid method to perform the required plane solves. The new
method was developed to use either the new grid transfer operators from the new 2D
nonsymmetric method or those from the 3D extension of Dendys 2D nonsymmetric
black box multigrid method. The coarse grid operators are formed using the second
method from [29], which uses auxiliary intermediate grids obtained by successively
applying semicoarsening in each of the independent variables. In addition, the new
method is designed to handle periodic and singular problems. Another use of local
mode analysis was in the design of robust three dimensional smoothers. Although
7
there axe hints in the literature for how to perform local mode analysis for color relax
ation in three dimensions, we are unaware of the appearance elsewhere of the detailed
analysis presented in chapter 13.
The new methods are compared to a new implementation of the semicoarsening
method with a speedup factor of over 5 for the two dimensional method and speedup
factor of 2 for the three dimensional method on vector computers. The grid transfer
operators are based on Schaffers idea; see chapter 3. The 2D semicoarsening method
uses coarsening in the ydirection coupled with red/black xline GaussSeidel relaxation
for the smoother. The 3D semicoarsening method uses coarsening in the zdirection
coupled with red/black xyplane GaussSeidel relaxation for the smoother. The new
implementation also includes the ILLU smoother, not present in the original version.
Another aspect of this work was to compare de Zeeuws MGD9V with the
black box multigrid methods. The idea was to mix and match components of the two
approaches to investigate the strengths and weaknesses and to ascertain if a combi
nation existed which was better than either. The results obtained from studying the
algorithm components is that MGD9V obtains its robustness from the ILLU smoother
and not from its grid transfer operators. If MGD9V uses alternating red/black line
GaussSeidel for its smoother then performance similar to the black box multigrid
methods is observed. Likewise, if ILLU is used as the smoother in the black box
multigrid methods, then the performance is similar to that of MGD9V.
Parallel versions of the standard coarsening nonsymmetric black box multigrid
methods are developed in this thesis and compared with the existing parallel version of
semicoarsening black box method. The 3D parallel version smoother uses a modified
2D nonsymmetric black box multigrid method to perform the simultaneous solution of
all the planes of a single color.
8
A hybrid parallel black box multigrid method was developed that uses stan
dard coarsening for grid levels with a VP (virtual processor) ratio, i.e. number of
grid points per processor, greater than one, and semicoarsening when the VP ratio
is less than one. When the VP ratio is greater than one, standard coarsening reduces
the number of grid points per processor, reducing the amount of serial work, faster
than in semicoarsening case. When the VP ratio is less than one, the semicoarsening
method is more efficient than standard coarsening because it keeps more processors
busy that would otherwise be idle; in addition, tridiagonal library routines, which are
more efficient than we can write, are available for the data structures. The hybrid
parallel method is the most efficient method on the CM5 because it uses the most
computationally efficient method for a given VP ratio.
1.2 Class of Problems
The class of problems that is being addressed is convectiondiffusion equations
with anisotropic and discontinuous coefficients on a two or three dimensional domain.
These types of problems can be represented by the following equation and boundary
conditions,
L(X)= V (D(x) VJ7(x)) + b(x) Vf7(x) + c{x)U{x) = fix) X^Cl (1.5)
Kx) D(x)VU(x) + l(x)U{x) = 0 X e dQ, (1.6)
on a bounded domain Cl C with boundary dCl, where d is either 2 or 3, x (x> y)
or (x,y,z), and D(x) = (Dl,D2) or (D1,!)2,!)3), respectively. The term v{x) is the
outward normal vector. It is assumed that D(x) > 0, c(x) > 0, and j(x) > 0 to
ensure that upon discretization we end up with a positive definite system of equations.
Anisotropies are also allowed, e.g. if Q, C 3?2 we have D = (Dl,D2) where it is possible
that D1 D2 in some subregion(s) while Dl
9
D(x), c(x), and /(x) are allowed to be discontinuous across internal boundaries TcO.
Moreover, let y{x) be a normal vector at x T; then it is natural to assume also that
U and /x (DVU) are continuous at x fr almost every X Â£ T. (1.7)
The almost every is necessary to exclude juncture points of T, that is points where
two pieces of T intersect and the continuity of /x (DVU) does not make any sense.
The boundary conditions permitted in (1.6) can be of three types: Dirichlet,
Neumann, mixed. The periodic boundary condition is not considered, but can be
handled by making a few adjustments and modifications to the black box multigrid
codes. It should be noted that, for a problem with pure Neumann boundary conditions,
a finite difference (volume or element) discretization may lead to a singular system of
equations: the singularity can be propagated to the coarsest grid level and cause trouble
for the direct solver, but a minor modification to the code circumvents this difficulty,
allowing solution of the coarsest grid level problem.
1.3 Discretization of the Problem
Let the continuous problem represented by equation (1.5) be written in oper
ator notation as
Lu = / in fi. (1.8)
The following discussion is valid for both two and three dimensions, but only
the two dimensional case is presented. Suppose that, for all x = (x, y) fi, ax < x < bx
and ay < y < by. Let Gh define a rectangular grid on [ax, bx\ x [ay, by], partitioned with
ax = xi < x2 << xnx = bx, ay = yi < y2 < < yny = by, (1.9)
and let the grid spacings be defined as
hXi = xi+i  hyj = yj+1 yj (1.10)
10
Then the rectangular grid, Gh is defined as
G = Â£ [dxj^x\>yj (1H)
with the domain grid, flh, being defined as
nh = CinGh. (1.12)
Before the discrete grid problem is defined we should first address the issue of
domains with irregular boundaries. The black box multigrid solvers in two dimensions
are intended to solve the equation (1.8) on logically rectangular grids, but for simplicity,
we consider only rectangular grids. An irregular shaped domain can be embedded in
the smallest rectangular grid, Gh, possible, Qh C Gh. The problem is then discretized
on Cth avoiding any coupling to the grid points not in Clh. For grid points outside of flh,
Xh Â£ Gh Clh, considered to be fictitious points, an arbitrary equation is introduced,
such as CijUij = fitj, where ^ 0 and fyj are arbitrary. The problem is now
rectangular and the solution to the discrete equations can be obtained at the points in
the domain, while the solution u^j = fij/cij is obtained for the other points. Problems
with irregular domains in three dimensions can be handled in a similar fashion for a
cuboid box grid.
Now the discrete grid problem approximating the continuous problem, (1.8)
can be written as
Lhuh = fh in Gh, (1.13)
where the superscript h refers to discretization with grid spacing h. Note that, for
irregular domains the discrete solution uh{x) makes sense only for % Â£ uh(x), for
X Gh\Qh, is arbitrary.
We consider only discrete operators Lh on rectangular grids that can be de
scribed by 5point or 9point box stencils. Suppose we discretize the equation (1.5)
11
using five points at the grid point (Xi,yj),
SijUi ji WijUiij + CijUij EijUi+ij NijUij+i = Fij.
(1.14)
We use stencil notation to represent the 5 and 9 point cases, respectively:
] h J
N NW N NE
W C E W C E
S SW s SE
L 
(1.15)
where the stencil represents the coefficients for the discrete equation at the grid point
(Xi, yj) on grid Gh. The subscripts i,j can be dropped and it will be understood that
the stencil is centered at the grid point (X{,yj). The superscript h can also be dropped
when the mesh spacing is clear from the context. The stencils are valid over the entire
grid including the boundary points because the coefficients are allowed to be zero.
Hence, any coefficients that reach out of the domain can be set to zero. Clearly, the
5point stencil is a special case of the 9point stencil, where the NW, NE, SW, and
SE coefficients are set to zero.
We illustrate the stencil notation for Poissons equation on a square domain
in two dimensions,
Lu(x, y) = uxx(x, y) uyy{x, y) = f(x,y), (x, y) Q = (0, l)2
(1.16)
using 5 and 9point finite difference discretizations. The 5point stencil for the opera
tor L, using a central finite difference discretization on a uniform grid with grid spacing
h = 1/N for N = nx = ny, is
jh = _Afc _
K2
i h
1
1 4 1
1
(1.17)
12
One 9point discretization for L in (1.16) has the stencil
1 h
1
4 (1.18)
1
Many types of discretization can be considered: central finite differences, up
stream finite differences, finite volumes, finite elements, etc.
The black box multigrid solvers actually allow for more general meshes than
just the rectangular grids shown so far. The only requirement is that the mesh be
logically rectangular. In two dimensions the logically rectangular grid G can be defined
as
G = {x{i,j),y{i,j) : 1 < %
where the grid cell formed by
(x(i,j + 1), y(i,j + 1)), + y(i + l,j + l))
1/(*,J')) (as(* + 1,j), y(* + l,j))
has positive area, 1 < i < nx, 1 < j < ny.
The black box multigrid solvers which we consider require the discretization
to be represented by a 9point box stencil. However, just because the problem has a
9point box stencil does not mean that it can be solved by the black box multigrid
methods presented in this thesis. Such solutions are dependent on a number of factors
which are problem dependent. We attempt to investigate these factors in this thesis.
1.4 Multigrid Overview
A two level multigrid method is presented first to illustrate the basic compo
nents and underlying ideas that will be expanded into the classical multigrid method.
Th
L h2
1 4
4 20
1 4
13
Standard Coarsening
Figure 1.1. Standard coarsening. Superimposed fine grid Gh and coarse grid GH, where
the indicates the coarse grid points in relation to the fine grid Gh.
Suppose that we have a continuous problem of the form
Lu(x,y) = f(x,y), (x,y) Q C SR2, . (1.20)
where L is a linear positive definite operator defined on an appropriate set of functions
in (0, l)2 = C 5ft2. Let Gh and GH be two uniform grids for the discretization of Q;
then
Gh = {(*, y) n : (x, y) = (ih, jh), i, j = Q,..., n} (1.21)
and
GH = {(^>2/) ft : (x,y) = (iH,jH) = (i2hJ2h), i, j = 0,..., , (1.22)
where the number of grid cells n on Gh is even with grid spacing h= 1/n, and where
grid Gh has n/2 grid cells with grid spacing H = 2h.
The coarse grid Gh is often referred to as a standard coarsening of Gh; see
figure 1.1. However, this choice is not the only one possible. Another popular choice is
semicoarsening, which coarsens in only one dimension; see figure 1.2. For the overview,
only standard coarsening will be used.
14
Semicoarsening
Figure 1.2. Semicoarsening. Superimposed fine grid Gh and coarse grid GH, where
the indicates the coarse grid points in relation to the fine grid Gh.
15
The discrete problems now take the form
Lhuh = fh on Gn
(1.23)
and
Lhuh = fH on GH. (124)
We refer to Lh and LH as the fine and coarse grid operators respectively. The grid
operators are positive definite, linear operators
Lh :Gh  Gh,
(1.25)
and
Lh : Gh Gh.
(1.26)
Let Uh be an approximation to uh from equation (1.23). Denote the error eh
by
eh = Uh uh; (1.27)
thus eh can also be regarded as a correction to Uh. The residual (defect) of equation
(1.23) is given by
rh = fh LhUh. (1.28)
The defect equation (errorresidual equation) on grid Gh
Lheh = rh
(1.29)
is equivalent to the original fine grid equation (1.23). The defect equation and its
approximation play a central role in the development of a multigrid method.
The fine grid equation (1.23) can be approximately solved using an iterative
method such as GaussSeidel. The first few iterations reduce the error quickly, but then
the reduction in the error slows down for subsequent iterations. The slowing down in
16
the reduction of the error after the initial quick reduction is a property of most regular
splitting methods and of most basic iterative methods. These methods reduce the error
associated with high frequency (rough) components of the error quickly, but the low
frequency (smooth) components are reduced very little. Hence, the methods seem to
converge quickly for the first few iterations, as the high frequency error components
are eliminated, but then the convergence rate slows down towards its asymptotic value
as the low frequency components are slowly reduced. The idea behind the multigrid
method is to take advantage of this behavior in the reduction of the error components.
The point is that a few iterations of the relaxation method on Gh effectively eliminate
the high frequency components of the error.
Further relaxation on the fine grid results in little gain towards approximating
the solution. However, the smooth components of the error on the fine grid are high
frequency components with respect to the coarse grid. So, let us project the defect
equation, since it is the error that we are interested in resolving, onto the coarse grid
from the fine grid. This projection is done by using a restriction operator to project
the residual. rh, onto the coarse grid, where we can form a new defect equation
Lhvh = I%rh = fH, (1.30)
where iff is the restriction operator. We can now solve this equation for vH. Having
done so. we can project the solution back up to the fine grid with a prolongation
(interpolation) operator, I#, and correct the solution on the fine grid, Gh,
Uh _ Uh + l\vH (1.31)
We call this process (of projecting the error from the coarse grid to the fine grid and
correcting the solution there) the coarse grid correction step. The process of projecting
the error from a coarse grid to a fine grid introduces high frequency errors. The high
17
frequencies introduced by prolongation can be eliminated by applying a few iterations
of a relaxation scheme. The relaxation scheme can be applied to the projection of the
error, or to the approximation to the solution, Uh, after the correction. It is
desirable to apply the relaxation to Uh instead of I^vh since then additional reduction
of the smooth components of the error in the solution may be obtained.
The projection operator from the fine grid to the coarse grid is called the
restriction operator, while the projection operator from the coarse grid to the fine
grid is called the prolongation operator or, interchangeably, the interpolation operator.
These two operators are referred to as the grid transfer operators.
In the two level scheme just described, it can be seen that the coarse grid
problem is the same, in form, as the fine grid problem with uh and fh being replaced
by vH and fH = lffrh respectively. We can now formulate the classical multigrid
method by applying the above two level scheme recursively. In doing so, we no longer
solve the coarse grid defect equation exactly. Instead, we use the relaxation scheme on
the coarse grid problem, where now, the smooth (low) frequencies from the fine grid
appear to be higher frequencies with respect to the coarse grid. The relaxation scheme
now effectively reduces the error components of these, now, higher frequencies. The
coarse grid problem now looks like the fine grid problem, and we can project the coarse
grid residual to an even coarser grid where a new defect equation is formed to solve
for the error. The grid spacing in this yet coarser grid is 2H. After sufficiently many
recursions of the two level method, the resulting grid will have too few grid points
to be reduced any further. We call this grid level the coarsest grid. We can either
use relaxation or a direct solver to solve the coarsest grid problem. The approximate
solution is then propagated back up to the fine grid, using the coarse grid correction
step recursively.
What we have described informally is one multigrid Vcycle. More formally,
18
let us number the grid levels from 1 to M, where grid level 1 is the coarsest and grid
level M is the finest.
Algorithm 1.4.1 ( MGV(&,z/i,z/2,h) )
1. relax u\ times on LkUk Fk
2. compute the residual, rk = Fk LkUk
3. restrict the residual I^~lrk to Gk~l, Fk~l = I%~lrk and form the coarse grid
problem (defect equation) Lk~1Uk~1 = T1*1, where vk = Ik_xUk~l andhk~l =
2 hk.
4 IF (k 1) 7^ 1 THEN call Algorithm MGV(k 1, v\, v%, H)
5. solve Lk~1Uk~l = Fk~1 to get the solution uk~l
6. interpolate the defect (coarse grid solution) to the fine grid, and correct the fine
grid solution, Uk < Uk + lj(_luk~1
7. relax 1/2 times on LkUk = Fk
8. IF (finest grid) THEN Stop
This algorithm describes the basic steps in the multigrid method for one iteration of
a Vcycle. If the algorithm uses bilinear (trilinear in 3D) interpolation, it is called
the classical multigrid method. This algorithm assumes that the coarsening is done by
doubling the fine grid spacing, which can be seen in step 3 of the algorithm. However,
the algorithm is valid for any choice of coarsening, hk~l = mhk, where m is any integer
greater than one.
1.4.1 Multigrid Cycling Strategies There are many different types of
cycling strategies that are used in multigrid methods besides the Vcycle. We illustrate
the different cycling types with the use of a few pictures and brief descriptions.
19
5
Vcycle
Figure 1.3. One Vcycle iteration for five grid levels, where the represent a visit to a
grid level.
20
The Vcycle is illustrated graphically in figure 1.3. The represents a visit
to a particular grid level.. A slanting line connection between two grid levels indicates
that smoothing work is to be performed. A vertical line connection between grid levels
means that no smoothing is to take place between grid level visits. The grid levels are
indicated by a numerical value listed on the left side of the figure, where grid level 1 is
the coarsest grid level and is always placed at the bottom of the diagram.
The mechanics of the Vcycle were described in the multigrid algorithm in the
last section. The Vcycle is one of the most widely used multigrid cycling strategies. Its
best performance can be realized when there is an initial guess of the solution available.
When a guess is not available a common choice is to use a zero initial guess or to use
an Fcycle (see below).
The Scycle is illustrated in figure 1.4. The S stands for sawtooth, because
that is what it resembles; it is clearly a V(0, l)cycle and thus a special case of a V
cycle. The Scycle is what de Zeeuws MGD9V [24] black box multigrid code uses for its
cycling strategy. The Scycle usually requires a smoother with a very good smoothing
factor in order to be efficient and competitive with other cycling strategies.
The Wcycle is illustrated in figure 1.5. The Wcycle is sometimes called a
2cycle; similarly, a Vcycle can be called a 1cycle. From the figure 1.5, one can see
the W type structure. It is called a 2cycle because there must be two visits to the
coarsest grid level before ascending to the next finer intermediate fine grid level. An
intermediate fine grid level is one that is not the finest nor coarsest grid level and where
the algorithm switches from ascending to descending based on the number times the
grid level has been visited since the residual was restricted to it from a finer grid.
The Fcycle is illustrated in figure 1.6 and is called a full multigrid cycle. The
figure shows a full multigrid Vcycle, that is, each subcycle that visits the coarsest
grid level is a Vcycle. An Fcycle can also be created using a Wcycle, or any other
21
5
4
3
2
1
Scycle
Figure 1.4. One Scycle iteration for four grid levels, where the represent a visit to a
grid level.
4
3
2
1
Wcycle
Figure 1.5. One Wcycle iteration for four grid levels, where the represent a visit to
a grid level.
22
5
Fcycle
Figure 1.6. One Fcycle iteration for fiye grid levels, where the represent a visit to a
grid level.
23
type of cycling, for its subcycle. The Fcycle is very good when an initial guess for the
multigrid iteration is not available, since it constructs its own initial guess. The Fcycle
first projects the fine grid problem down to the coarsest grid level and then proceeds
to construct a solution by using subcycles. Afterthe completion of each subcycle the
solution on an intermediate fine grid level is interpolated up to the next finer grid level
where a new subcycle begins. This process is continued until the finest grid level is
reached and its own Vcycle completed. At this point if more multigrid iterations are
needed then the Vcycling is continued at the finest grid level.
1.5 Black Box Multigrid
Black box multigrid is also called geometric multigrid by some and is a member
of the algebraic multigrid method (AMG) family. The distinguishing feature of black
box multigrid is that the black box approach makes several assumptions about the
type of problem to be solved and the structure of the system of equations. The black
box multigrid methods also have a predetermined coarsening scheme where the coarse
grid has roughly half as many grid points as the fine grid does in one or more of the
coordinate directions. For a uniform grid, this means that H = 2h. Both methods
automatically generate the grid transfer operators, prolongation Ik_x and restriction
Ik"1 for 2 < k < M, and the coarse grid operators Lk for 1 < k < M 1. The coarse
grid operators are formed using the Galerkin coarse grid approximation,
Lk1 = I%lLklÂ£_1, (1.32)
where k = 1.... ,M 1. The algebraic multigrid methods deal with the system of
equations in a purely algebraic way. The coarsening strategy for general AMG is not
fixed nor is the formation of the grid transfer operators, resulting in methods that can
be highly adaptable. However, the more adaptable a method is, the more complex its
24
implementation is likely to be, and it may also be less efficient due to its complexity.
Another disadvantage of general AMG. methods is that the coarse grid problems are
usually not structured even when the fine grid problem is; moreover, the unstructured
matrices on coarser levels tend to become less and less sparse, the coarser the grid level.
To define the black box multigrid method we need to define several of the
multigrid components, such as the grid transfer operators, the coarse grid operators,
the type of smoother employed, and the coarsest grid solver. We can also mention the
type of cycling strategies that are available and other options.
There are several different grid transfer operators that we have developed and
used in our codes. They are of two basic types. The first type collapses the stencil of
the operator in a given grid coordinate direction to form three point relations, and the
second is based on ideas from S. Schaffer [69]. The details of the grid transfer operators
will be presented in chapter 3.
The coarse grid operators are formed by using the Galerkin coarse grid ap
proximation given in equation (1.32).
There are several choices for the smoothing operator available in our codes.
The smoothers that we have chosen are all of the multicolor type, except for incom
plete line LU. For standard coarsening versions, the choices are point GaussSeidel,
line GaussSeidel, alternating line GaussSeidel, and incomplete line LU. The semi
coarsening version uses either line GaussSeidel by lines in the xdirection or incomplete
line LU. The smoothers will be presented in more detail in chapter 4.
In the standard coarsening codes, the coarsest grid solver is a direct solver
using LU factorization. The semicoarsening version allows the option of using line
GaussSeidel relaxation.
There are several cycling strategies that are allowed, and they are chosen
by input parameters. The most important choice is whether to choose full multigrid
25
cycling or not. There is also a choice for Ncycling, where N = 1 is the standard
Vcycle and N = 2 is the Wcycle, etc... For more details, see section (1.4.1) above.
26
CHAPTER 2
DISCRETIZATIONS: FINITE DIFFERENCE AND
FINITE VOLUME
This chapter presents some of the discretizations that can be used on the
convectiondiffusion equation. We present only some of the more common finite dif
ference and finite volume discretizations. Although this section may be considered
elementary, it was thought to be important for two reasons. First, it shows some of
the range of discrete problems that can be solved by the black box multigrid methods.
Secondly, it gives sufficient detail for others to be able to duplicate the results presented
in this thesis. The sections on the finite volume method present more than is needed,
but because there is very little on this topic in the current literature and because of its
importance for maintaining 0(h2) accurate discretizations for interface problems, we
have decided to include it. For references on the finite volume discretization see [85]
and [52].
The continuous two dimensional problem is given by
V (D Vu) + b Vu + c u = / in Q = (0, Mx) x (0, My) (2.1)
where D is a 2 x 2 tensor,
^ Dx DXy
Dyx Dy
and det D > 0, c > 0. In general, Dxy Dyx, but we only consider either Dxy = Dyx
or Dxy = DyX = 0. In addition, D, c, and / are allowed to be discontinuous across
(2.2)
27
internal interfaces in the domain Cl. The boundary conditions are given by
3vl
h o u = g, on (2.3)
on
where o and g are functions, and n is the outward unit normal vector. This allows us
to represent Dirichlet, Neumann, and Robin boundary conditions.
The domain is assumed to be rectangular, Cl = (0, Mx) x (0, My), and is then
divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, where Nx and Ny
are the number of cells in the x and ydirections respectively. A uniform grid is not
required, but we will use it to simplify our discussions.
It should be noted that finite elements on a regular triangulation can also be
used to derive the discrete system of equations to be solved by the black box multigrid
methods. However, we will not present any details on how to derive these equations.
2.1 Finite Difference Discretization
The finite difference approach to discretization is well known. Finite difference
approximation is based on Taylors series expansion. In one dimension, if a function
u and its derivatives are single valued, finite, and continuous functions of x, then we
have the Taylors series expansions,
u(x + h) = u(x) + hu'(x) + \h2u"{x) + \hzu'"(x) + ... (2.4)
2 6
and
u(x h) = u(x) hu'{x) + ^h2u"(x) ^h?u'"(x) H ... (2.5)
^ o
If we add equations (2.4) and (2.5) together we get an approximation to the second
derivative of u, given by,
u"{x) i (u(x + h) 2u(x) + u(x h)) (2.6)
28
where the leading error term is 0(h2): Subtracting equation (2.5) from (2.4) gives
u'(x) 7 (u(x + h) u(x h)), (2.7)
h
with an error of 0(h2). Both equations (2.6) and (2.7) are. said to be central difference
approximations. We also derive a forward and backward difference approximation to
the first derivative from equations (2.4) and (2.5):
u'(x) (u(x + h) u(x)) (2.8)
lb
and
u'(x) ^ (u(x) u(x h)) (2.9)
lb
respectively, with an error of 0(h).
The above approximations can be extended to higher dimensions easily and
form the basis for finite difference approximation. We illustrate the finite difference
discretization, using stencil notation, by way of examples for some of the types of
problems that we are interested in. There are many references on finite differences if
one is interested in more details; see for instance [74] [39].
The first example is for the anisotropic Poissons equation on a square domain,
Lu = euxx uyy ='/ in Q = (0, l)2, (210)
where u and / are functions of (x,y) fh Using central finite differences and dis
cretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the
5point stencil,
1
Lh
Â£ 2(1 + c) Â£
(2.11)
1
29
The second example is for the convectiondiffusion equation on a square do
main,
Lu =eAu + bxux + byUy = f (x, y) e fi = (0, l)2 (212)
where u, bx, by, and / are functions of x and y. Using a mix of upstream and central
finite differences and discretizing on a uniform grid with grid spacing h = l/N for
N = nx = ny, gives the 5point stencil,
Â£ "I" byflfly
Lh
e + bxh(fix 1)
E
Â£ h bxhfix
(2.13)
Â£ + byh{Hy ~ 1)
where
= 4e + bxh(2fjLx 1) + byh(2fj,y 1) (214)
and
/ Â£ 2 bxh bxh > s Â£ 2byh byh > Â£
Mx = < 1 + 2 bxh fly < 1 + 2byh byh< Â£ (2.15)
1 2 \bxh\ < Â£ 1 2 \byh\ < Â£ .
The third example is the rotated anisotropic diffusion equation on a square
domain. It has this name because it is obtained from the second example by rotating
the axes through an angle of 6. The equation is given by
Lu
d2u
s(Â£c2+s2)i?2(Â£i)
d2u (
CS dxdy V
Â£S2 + c2^ = 0
J dy2
(2.16)
(x,y) e fi = (0,1) x (0,1)
where c = cos0, s = sin#, and e > 0. There are two parameters, e and 9, that can be
varied. There are two popular discretizations of this equation which are seen in real
30
world applications. They differ only in the discretization of the cross derivative term.
Let
a=(ec2 + s2) j3=(e l)cs 7 =(es2 + c2); (217)
then if the grid spacing is h = 1/N for N = nx = ny, the first, a 7point finite difference
stencil, is
P ~P ~ 7
Th. _____
L h?
P 7
The second, a 9point finite difference stencil, is,
a (3 2 (a + P + 7) a ft
P
(2.18)
Lh
\P 7 ~\P
a 2 (a + 7) a
IP 7 Â¥
(2.19)
The fourth example is the convectiondiffusion equation on a square domain,
Lu = eAu + cux + suy = 0 (x, y) fi = (0, l)2 (2.20)
where c = cos 6, s = sin0, and Â£ > 0. Upstream finite differences and discretization on
a uniform grid with grid spacing h = 1/N for IV = nx = ny, yields
L ~ h?
Â£+(ss)
e(c+c) 4e + /i(c + s) Â£ + (cc)
(2.21)
Â£ (S+S)
2.2 Finite Volume Discretization
There are two types of computational grids that will be considered. The first
type is the vertex centered grid Gv, defined as
31
9 j < r i i i T 1 W 1 9 1 1 1 9 1 i i
i i 1 1 i i i 1 i
i 1 1 _L  I _! 1 J i
1 1 k 4 * i \ k ! 1 1 1 i 1 1 ft M
Figure 2.1. Vertex centered finite volume grid, where the indicates where the dis
cretization is centered and the dashed lines delineate the finite volumes.
32
0

Figure 2.2. Cell centered finite volume grid, where the indicates where the discretiza
tion is centered and the solid lines delineate the finite volumes.
Gv = <
(xi, Vj) :
&i % ^ 0)... Nx,
(2.22)
Gc =
(*. Vj) :
(2.23)
Uj 3 hyi 3 0) > Ny
where Nx and Ny are the number of cells in the x and y directions respectively, see
figure 2.1. The second type is the cell centered grid Gc which is defined by
Xi = (i 2) hxi i = !}>
yj = (3~h)hy, j = l,...,Ny
where Nx and Ny are the number of cells in the x and y directions respectively, see
figure 2.2.
There are two other somewhat common finite volume grids that will not be
discussed here, but can be used to derive the discrete system of equations to be solved
by the black box multigrid methods. These grids are defined by placing the finite
volume cell centers on the grid lines in one of the coordinate directions and centered
between the grid lines in the other coordinate direction. For instance, align the cell
centers with the y grid lines and centered between x grid lines. The cell edges will then
correspond with x grid lines and centered between y grid lines.
We will present finite volume discretization for both vertex and cell centered
finite volumes where the coefficients are evaluated at either the vertices or cell centers.
33
The coefficients could be evaluated at other points, such as cell edges, but we will
not show the development of such discretizations because they follow easily from the
descriptions given below.
2.3 Cell Centered Finite Volume Discretization; Evalua
tion at the Vertices
For the cell centered finite volume discretization the cell has its center at the
point ((i \)hx, (j \)hyj and the cell is called the finite volume, fijj, for the point
(i.j) on the computational grid Gc, where i = l,...,Nx and j = 1,..., Ny\ see equation
(2.23). A finite volume is shown in figure 2.3. The approximation of u in the center of
the cell is called tty. The coefficients are approximated by constant values in the finite
volume Clij. This discretization is useful when the discontinuities are not aligned with
the finite volume cell boundaries.
Assume that Dxy = Dyx = 0 and that b = 0 for now. If we integrate equation
(2.1) over the finite volume fiij and use Greens theorem we get
f Dx^nx + Dy^ny dT + [ cudQ= f f dfl, (2.24)
JdSkj dx ydy Jaj Jni}
where nx and ny are the components of the outward normal vector to the boundary
dtiij.
We proceed by developing the equations for the interior points Uij, and then
for the boundary points, where we present the modifications that axe needed for the
three types of boundary conditions that we consider. We refer to figure 2.3 to aid in
the development of the finite volume discretization.
34
tOl**
Figure 2.3. Cell centered finite volume Â£l;j, where P has the coordinates
)hx> (j ^)hy)
35
2.3.1 Interior Finite Volumes Referring to figure 2.3, we write the
line integral from equation (2.24) as
du ^ du
f ^ du ^ du m rse du fne ,
/ At "o H Z/y 7\ dx / Dx "o ^2/
Jan,,, ax ay Au, <9y Ae ax
r*w du , /*w du ,
+ / Dydx / Dx?dy.
J ne 9y J nw 9x
The integral from (sw) to (se) can be approximated by
fs . ,9u, fse du ,
/ Dv(sw)dx + / Â£Use) dx
J sw 9y J s dy
s
hx
2hx
h '
(Dy(sw) + Dy(se)) (
ui,j Uhj~ l)
h 1
(2.25)
(2.26)
where afj ^ (Dy>ij + Dy>iij), and Dy^j is the value of Dy at the point (i,j).
The other line integrals of flij, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be
approximated in a similar fashion.
The surface integrals in equation (2.24) can be approximated by:
1 C U rffl Q,j ,j (2.27)
and
1 f dCl hxhyfi^j, (2.28)
where Cjj and fij are approximations of c and/, respectively, at the grid point
((* 5)^x> (j f)^y)i given by
1 / Qj ~ ^ fad "h cil,j + cilj1 "b Cijl) (2.29)
and
fi,j = 7 (/ij + filj + /ilj1 + fijl) (2.30)
36
respectively. The resulting stencil for interior points is
where
. IzSL/y* .
E+hxhyCij
hx j
hi.ay. ,
(2.31)
ai,j 9 (Dx,i,jl + Dxjj)
(2.32)
ai,j ~ 9 + Dy,i,j)
(2.33)
and
53 hy + a^) + /il + a^) '
(2.34)
At an interface, the diffusivity is given as an arithmetic mean of the diffusion
coefficients of adjacent finite volumes. The arithmetic makes sense because the inter
face passes through the finite volume. This discretization is most accurate when the
interface passes directly through the cell of the finite volume.
When the finite volume flij has an edge on the boundary, the line integral in
equation (2.24) for that edge has to be treated differently. We examine what needs to
be done for each of the three different types of boundary conditions. We examine the
changes that are needed only on one boundary edge, and the other changes needed for
the other boundary edges follow in a similar fashion.
2.3.2 Dirichlet Boundary Condition Let us examine the south bound
ary, (sw) (se), where we have
() = 9(s) (235)
37
The line integral from (sw) to (se) is approximated by
Jsw ^v~dy^X ~ ~2hy \Uid ~ u(s) j
This gives the stencil
(2.36)
_ hxfy* .
ky
htaii,j S +hxhyCi,j
(2.37)
where X) is defined in equation (2.34) and a is defined by equation (2.32) and (2.33).
2.3.3 Neumann and Robin Boundary Conditions We examine the
south boundary, (sw)(se), where
du
We then make the approximation
u{s) u{v)
1, du
2 yfrl
(s)
~ 2 hy (9(3) a
()
Solving for gives
1L
o hy9(s) + (p)
u(s) = ~
1 2^y(s)
The line integral is then approximated as
n duj
Dydx
dy
rs*>
hxa
y
hj
du I
_15yl(s)
hxa
y
ijl
(2.38)
(2.39)
(2.40)
(2.41)
38
Now we substitute equation (2.40) to obtain
I,
Senduj~. 2hX
Dy dec ,
SW &y 2 h h'yQ'^s}
ai,j1 (a(s)u*J 5(s))
The resulting stencil for the south boundary is
. Lz3L/yt' .
hy
ihLryX V4h h r 1 ^ hya0,j x ^IbLfy
hx ai1 j 2, +rixtiyClj + 2 + hxaoj A, a
*?.
where a is defined in equations (2.32) and (2.33), and J2 is now given by
^ = + t + aij)
(2.42)
(2.43)
(2.44)
The other boundaries can be handled in the same way. We have now defined
the cell centered finite volume discretization where the coefficients are evaluated at the
grid vertices.
2.4 Cell Centered Finite Volume Discretization; Evalua
tion at the Cell Centers
This discretization is better suited to problems when the interfaces align with
the boundaries of the finite volumes. The discretization is very similar to what was
done in section 2.3, except that now the coefficients are evaluated at the cell centers,
((i \)hXl (j ^)hy), of the finite volume The coefficients are approximated by
constant values in the finite volume f\j. We need to approximate the integrals in
equation (2.24).
39
2.4.1 Interior Finite Volumes We have the line integral, as in equation
(2.25), and the integral from (sw) to (se) can be approximated by
J ~ (u*d u(s)) (245)
where Dyjj is the value of Dy at the point (i, j). We still need to approximate ri(s),
and to do this we will use the continuity of u and Dy
Dy,i,j (iHj ~ u(s)^j = Dyjji ^(s) > (246)
yielding
u(s) =
_ Dy,i,jui,j + DyjjiUjji
Dy,i,j +
We can now substitute equation (2.46) into equation (2.45) to get
fse du hxy
J Qy^X ~ h Ui
where of is now given by
_ 2 Dy,i,jDy,i,j1
<Â£.* 1
(2.47)
(2.48)
(2.49)
n . _l n . '
^ uy,%o1
The other line integrals of fijj, (se) to (ne), (ne) to (mu), and (mu) to (sw),
can be approximated in a similar fashion.
The surface integrals are approximated in the same way as before,
c u d hxfoyCj_____i iUi
and
/ y dfl  h'xh'Tjfi 12)
Aii,* 2J 2.
but instead of q , and /; , we have c i and i ,_i.
1 22 2,J 2
(2.50)
(2.51)
40
The resulting stencil for interior points is
where
. 3L ry? ,
hy iyj
12+hxhyCij T^ai,j
ks.ay. ,
hy l Jl
(2.52).
and
a:
,x 2 Dx,i,j
'11J Dx,i,j 4" Dx,ilJ
,y  ^ Dy,i,jDy!itj1
l 1 v~>
^~K + + fe1 + a^')
(2.53)
(2.54)
(2.55)
At an interface, the diffusivity is given as a harmonic average of the diffusion
coefficients of the adjacent finite volumes.
2.4.2 Dirichlet Boundary Condition For the south boundary, (sw) to
(se), the Dirichlet boundary condition, u(s) = g(sy The line integral is approximated
by
Q'lJj 2 / \
^y~dy^X ~ ~h \i,:> ~ 9(s)j ' (2.56)
The stencil is then given by
hr
h3
__hy
h x
a
X
ilj
Â£ +hxhyCij + ^
D.
.IhLo,? .
hx Ui,3
(2.57)
0
where X! is given in equation (2.55) and a is given by equation (2.53) and (2.54).
41
2.4.3 Neumann and Robin Boundary Conditions The Neumann
and Robin boundary conditions can be handled in the same way as in section 2.3.3.
The line integral for the south boundary is
(2.58)
The resulting stencil is now
(2.59)
0
where J2 is given in equation (2.55) and a is given by equation (2.53) and (2.54).
2.5 Vertex Centered Finite Volume Discretization Eval
uation at the Vertices
In this discretization D, c, and / are approximated by constant values in finite
volume, Qij, whose centers are at the vertices. This discretization is useful when the
discontinuities align with the boundaries of the finite volumes.
2.5.1 Interior Finite Volumes The development is done the same as
before for the cell centered cases; see section (2.3.1). The stencil, when Dxy = Dyx 0
and b = 0, is given by
af ij E \hxhyCij
(2.60)
42
2 hy
! nw
I w
___i__
sw
*
ne
s
h,
se
Figure 2.4: Vertex centered finite volume Ojj at the southern, y = 0, edge boundary.
where
x _ ^ Dx,i,jDx,i+l,j
id
Dx,i,j + Dx,i+l,j
(2.61)
and
=
2 Dy,i,jDy,i,j+1
A/.*d + A/,id+1
(I'd1 + I'd) + 7T (Iid + ?d)
where c and / are evaluated at the grid point (i hx,jhy).
(2.62)
(2.63)
2.5.2 Edge Boundary Finite Volumes Let the finite volume Qij have
its southern edge, (sw)(se) at the southern boundary (y = 0) of the domain; see figure
2.4.
2.5.3 Dirichlet Boundary Condition For the Dirichlet boundary con
dition we have and we can just eliminate the unknown U(s) and move it to
the righthand side of the equation.
2.5.4 Neumann and Robin Boundary Conditions The line integral
along the boundary is approximated by
n duj
Dydx
dy
hxD
y,hj
du
Qy
43
Figure 2.5. Southwest corner finite volume, where the indicates where the discretiza
tion is centered.
hxDyjj (9(s) a(s)ui,j'j
(2.64)
and now we need to look at the surface integrals
f C U d&l W hxhyCijUij
JCiij 2
and similarly for /. The stencil for the edge boundary is given by
hy
S + \hXhyCij + hxQ>(s)DytiJ f^OC,
iZLn? .
1,3
where
^ hv + hi + a^)
and a is defined by equations (2.61) and (2.62)..
(2.65)
(2.66)
(2.67)
2.5.5 Corner Boundary Finite Volumes The comer finite volume
discretization will be shown for the southwest corner of the computational grid; see
figure (2.5).
44
2.5.6 Dirichlet Boundary Condition In the Dirichlet boundary con
dition case, the unknown U(sw) is eliminated by the boundary condition equation,
u{sw) = 9(sw) (2.68)
The term 9(sw) is incorporated into the right hand side of the discrete system of equa
tions. The stencil for the southwest corner is
As./vV.
2 hy ai,j
o 52+\hxhyCij 2ai,j
where J2 is defined as
and a is defined by equations (2.61) and (2.62).
0
h$ y Oc 2 hy '* Aa? 2 hxa^
(2.69)
(2.70)
2.5.7 Neumann and Robin Boundary Conditions In the Neumann
and Robin boundary condition cases, we have
du
dx
du
~ qf* O'sU
l dy
(sw)
J (sw)
9w
= 9si
(2.71)
(2.72)
where the subscripts (sw) means evaluation at the srupoint; see figure 2.5. The line
integrals around the finite volume are approximated by
rse Qu
rse n du ,
/ Dydx
J sw dy
11. 7i du(sw)
2 ^ dy
\hxDy^ (as(sw)uij gs(sw))
(2.73)
fnw du , 1
J ^x~Qx^ ~ 2hyDxhj'
du(sw)
dy
^hyDx^i^j (o>w(sw)uij 9w(siij))
(2.74)
45
(2.75)
fne du lhy x . ,
J QX^ ^ 2 h UiJrlj)
rne fa, i fo
 Dy^dx faVj (uitj uiJ+1). (2.76)
The stencil for the southwest corner is
2hy ai,j
0 E+ihxhydj+BC
0
(2.77)
where X) is defined in equation (2.70), a is defined by equations (2.61) and (2.62), and
JBCJ ^ (hxT^t/,i,jns(sty) (su7)) .
(2.78)
2.6 Vertex Centered Finite Volume Discretization Eval
uation at the Cell Vertices
In this discretization D, c, and f are approximated by constant values in
finite volume, Slij, whose centers are at the vertices. This discretization is useful when
the discontinuities pass through the interior of the finite volumes, and best when the
interface passes through the cell center.
2.6.1 Interior Finite Volumes The development is the same as for the
previous section on vertex centered finite volumes; see section 2.5. The stencil, when
Dxy = Dyx = 0 and b = 0, is given by
__hx.rr! .
hy
JrhxhyCij
hx ai,j
hyat,Jl
(2.79)
46
where
ai,j ~ 2 + Ar,i+l,j+l)
ai,j = 2 + ^j/,i+i,i+i)
and
Â£ T (ah1 + aL') + IT (^hi + fj)
and where c and / are evaluated at the grid point (i hx,j hy).
Cij ~ ^ (cilJl.+ Q+lJl + CilJ+l + Cj+lj+l)
(2.80)
(2.81)
(2.82)
(2.83)
and
fid = ^ (/iljl + /i+ljl + /ilj+l + /i+lj+l) (2.84)
Let the finite volume Clij have its southern edge, (sw)(se) at the southern
boundary (y = 0) of the domain; see figure 2.4.
2.6.2 Dirichlet Boundary Condition For the Dirichlet boundary con
dition we have and we can just eliminate the unknown and move it to
the righthand side of the equation.
2.6.3 Neumann and Robin Boundary Conditions The line integral
along the boundary is approximated by
I,
se du 8u
D dx h ay
ydy x ^ dn
M
hx(X j (j)(s) a{s)ui,j^j i
(2.85)
L
ftc h
^X~8x^ ~ ~ni!~Dy,i+l,j (ui+l,j ~ ui,j)
2 hx
47
and similarly for the line integral from (sw)(nw), and the line integral from (nw)(ne)
is done as before for the interior.
The surface integrals are now given by
f cu dÂ£l ~ hxhyC:LjUij
JQij L
(2.86)
where
Qj 2 d" Ci+lJ+l)
and similarly for /. The stencil for the edge boundary is given by
__hzrvV .
hy ^1,3
~2h^Dx,i\,j ^ 2 hxhyCij + hxa^a\j 2h^DXtij
where
H ~^ah + oh~ (Dx,ii,j + Dx,ij),
2 hx
(2.87)
(2.88)
(2.89)
and a is defined by equations (2.80) and (2.81).
2.6.4 Corner Boundary Finite Volumes The corner finite volume
discretization will be shown for the southwest corner of the computational grid; see
figure (2.5).
2.6.5 Dirichlet Boundary Condition In the Dirichlet boundary con
dition case, the unknown U(sw) is eliminated by the boundary condition equation,
u(sw) = 9(sw) The term g^sw) is incorporated into the right hand side of the discrete
48
system of equations. The stencil for the southwest corner is
hx
'2 hy
D.
0 4" 4 hx hyCi,j
JhLr>
2hx
0
where Y1 is defined as
E
hx
__ n ._________
 J^y,i,j
2 h.
y n .
2hxx
and a is defined by equations (2.80) and (2.81).
(2.90)
(2.91)
2.6.6 Neumann and Robin Boundary Conditions In the Neumann
and Robin boundary condition cases, we have
du
" 7^ "b UwU
OX
(sw)
du
"5h OisU
9w
9 s)
(2.92)
(2.93)
where the subscripts (sw) means evaluation at the siupoint; see figure 2.5. The line
integrals around the finite volume are approximated by
_ du ,
Dydx
dy
1 l 7i du(sw)
hxDy,i+id+i (o5(sw)u{j gs(sw))
1L du(sw)
2 UyUx,i+l,j\l ^
hyDx,i+i,jti (&w(sw)uij 9/w($w))
n duA
D*dzdy
D,/~dx
dy
2h.
Dx,i+W (uij ui+ij)
2^ A/.i+lJ+l (ui,j ^ij+l)
(2.94)
(2.95)
(2.96)
(2.97)
49
The surface integrals are approximated by
J cu cm ~ ~ hi /iy Cj+1 j _). i Uj j
and similarly for /. The stencil for the southwest corner is
hx 7~)
2hv uy^i+1 J+1
0 YLJf\hxhyCi+i,j+\ +BC &DXti+ij+i
(2.98)
(2.99)
where is defined in equation (2.91), a is defined by equations (2.80) and (2181), and
JBC ^ (hxZ?yit+ij+iOs(stw) + hyDx,i+lj+iQwi.sw')')
(2.100)
50
CHAPTER 3
PROLONGATION AND RESTRICTION OPERATORS
Suppose that we have an elliptic linear operator L on a two dimensional
rectangular domain Cl:
Lu = f in Cl c 3?2. (3.1)
This problem can be discretized using finite differences (or other discretization) on a
rectangular grid Gh with grid spacing h, given by
Lhuh = jh mGh, (3.2)
Gh = {(a:*, yj) : Xi = xq + i h, yj =yo+j h} (3.3)
We assume that the discretization is represented in stencil notation as
NW N NE
WCE (3.4)
SW S SE
J (*J)
where NW, N, NE,... are the coefficients of the discretization stencil centered at
(*,%) ''
The size of the fine grid operators stencil is important to remember because
we require that the coarser grid operators stencil not be any larger than the largest
allowable fine grid operator stencil. By keeping the grid operator stencil fixed at a
maximum of 9points, we ensure that the implementation will be easier and more
efficient by maintaining the sparsity of the operators. This consideration is important
51
when discussing the formation of the grid transfer operators since we use the Galerkin
coarse grid approximation approach to form the coarse grid operators. The formulation
of the coarse grid operators involves the multiplication of three matrices, and if their
stencils are at most 9point, then the coarse grid operator will also be at most 9point.
If we use grid transfer operators with larger stencils, the size of the coarse grid operator
stencil can grow without bound, as the grids levels became coarser, until the stencils
either become the size of the full matrix or we rim out of grid levels.
Another guiding principal that we follow is that if we are given a symmetric
fine grid operator we would like all the coarser grid operators to be symmetric also. In
order to follow this principal the interpolation and restriction operators must be chosen
with care.
Before getting started it would be best to show where and how the operators
are used to transfer components between grid levels. We assume the layout of coarse
and fine grids shown in figure 1.1. We refer to coarse grid points with indices (ic,jc)
and fine grid points with indices (if,jf )
3.1 Prolongation
We interpolate the defect correction (error) from the coarse grid level to the
fine grid level, where it is added as a correction to the approximation of the fine
grid solution. There are four possible interpolation cases for standard coarsening in
two dimensions. The four cases are illustrated in figure 3.1, where the thick lines
represent coarse grid lines, thin lines represent the fine grid lines, circles represent
coarse grid points, X represents the fine grid interpolation point, and the subscripts f
and c distinguish the fine and coarse grid indices respectively. Figure 3.1(a) represents
interpolation to fine grid points that coincide with coarse grid points. Figure 3.1(b)
represents interpolation to fine grid points that do not coincide with coarse grid points,
52
\e h
(a)
i k \
'p / \ r
(b)
'/1
'*1
i
i
/
c
ie i
j
i*1
( \ \ /
\
(C)
i e b t
i/1
J*i
\ /
\
i/1 */
(d)
Figure 3.1. The four 2D standard coarsening interpolation cases, where represents
the coarse grid points used to interpolate to the fine grid point represented by x. The
thick lines represent coarse grid lines.
53
but lie on coarse grid lines in the xdirection. Figure 3.1(c) represents interpolation to
fine grid points that do not coinciding with coarse grid points, but lie on coarse grid
lines in the ydirection. Figure 3.1(d) represents interpolation to fine grid points that
do not align with any coarse grid lines either horizontally or vertically.
The fine grid points that are also coarse grid points, case (a), use the identity
as the interpolation operator. The coarse grid correction is then given by
u.
Uic,jc
(3.5)
where (Xif,yjf) = (xic,yjc) on the grid; here the interpolation coefficient is 1.
The fine grid points that are between two coarse grid points that share the
same yj coordinate, case (b), use a two point relation for the interpolation. The coarse
grid correction is given by
u
l,jf u*fidf + K^jc uici,jc + !t
e u?
C)3c lC)3c
(3.6)
where yJc = yjf and Xic1 < X{f\ .< X{c on the grid, and the interpolation coefficients
are If _x and If .
LC Jj JC lCiJC
The fine grid points that are between two coarse grid points that share the
same Xi coordinate, case (c), use a similar two point relation for the interpolation. The
coarse grid correction is then given by
^ic,jc Uic,jc Jicdcl Uicdcl'
(3.7)
where Xic = Xif and yjc~i < yjfi < yjc on the grid, and the interpolation coefficients
are If and If _x.
*CjJC tCfJC *
The last set of fine grid points are those that do not share either a or
a yj coordinate with the coarse grid, case (d). We use a four point relation for the
interpolation in this case, and the coarse grid correction is given by
u.
u.
Vij/1
54
(3.8)
, TSW . H , TTIW H
^ 1icl,Jcl U*cljcl ^ Xlcljc icljc
H I jse ,.// .
^cjjc 1 lci3c 1
+ U, ,
*cijc iciJc
where Xic < Xif < Xic+i and yjc < yjf < yjc+\, and the interpolation coefficients are
ItTljcl an<^ The interpolation operators coefficients can also
be represented in stencil notation, just like the grid operator, as
r n h
jnw r jne
JW i ie
JSW is jse
(3.9)
L I H
3.1.1 Prolongation Correction Near Boundaries In the black box
multigrid solvers, the right hand side of the grid equation next to the boundary can
contain boundary data, in wliich case the above interpolation formulas can lead to 0(1)
interpolation errors. To improve this error we can use a correction term that contains
the residual to bring the interpolation errors back to 0(h2); [26]. The correction term
is 0(h2) for the interior grid points, and in general will not improve the error on the
interior, but near the boundary the correction term can be of 0(1). The correction term
takes the form of the residual divided by the diagonal of the grid equation coefficient
matrix; the correction term is equal to where the residual was computed for
the grid before restriction. The correction term is added to equations 3.6, 3.7, and
3.8, which are for interpolating to fine grid points that are not coarse grid points.
Applying the correction is similar to performing an additional relaxation sweep along
the boundary, and it does not affect the size of the prolongation stencil.
55
3.2 Restriction
The restriction operator restricts the residual from the fine grid level to the
coarse grid level, where it becomes the righthandside of the defect equation (error
residual equation). The restriction equation is
= JW Ji '' v+lj/ + Jtic 7 .h
+ J? . Wc rh rV4/+1 + JZj c r .h vb/1
+ JSW rh Uf+l,jf+1 + JZSc . rh rif+l,
+ jne Jic,jc ' %hjf 1 rh Tif1,
+ r(* Vrf/
(3.10)
where the restriction coefficients are Jw, Je, Js, Jn, Jsw, Jnw, Jne, Jse, and 1. The
restriction coefficients can also be represented in stencil notation as
r H
jnw Jn jne
JW 1 Je
JSW Js jse
where the restriction is centered at the fine grid point = (xic,yjc).
(3.11)
3.3 Overview
In the following sections we present several different interpolation operators
by exhibiting the coefficients needed to represent the operators stencil. In most cases,
we omit the indices of the operators, it being be understood that the grid operator is
given at the fine grid point (Xif,yjf). The grid transfer operators can be split into two
groups based upon how the operators are computed.
The first class of grid transfer operators is based on using a collapse (lumping)
in one of the coordinate directions, yielding a simple three point relation that can be
56
solved. The second class of grid transfer operators is based on an idea from Schaffers
semicoarsening multigrid [69]. Both these methods for operator induced grid transfer
operators are an approximation to the Schur complement, that is, they try to approxi
mate the block Gaussian elimination of the unknowns that are on the fine grid but not
on the coarse grid. The collapsing methods are a local process while Schaffers idea is
to apply the procedure to a block (line) of unknowns.
We start by presenting the grid transfer operators used in the symmetric
versions of the black box multigrid solvers. Then we present several different grid
transfer operators that are used in the nonsymmetric black box multigrid solvers.
In classic multigrid methods, the grid transfer operators are often taken to be
bilinear interpolation and full weighting; injection is also popular. To see why we do
not use these choices, we need to look at the type of problems that we are hoping to
solve. These problems are represented by the convectiondiffusion equation,
V (D Vu) + b Vu + c it = /, (3.12)
where D, c, and / are allowed to be discontinuous across internal boundaries. The
black box multigrid solvers are aimed at solving these problems when D is strongly
discontinuous. The classical multigrid grid transfer operators perform quite well when
D jumps by an order of magnitude or less, but when D jumps by several orders of
magnitude, the classical methods can exhibit extremely poor convergence, since these
methods are based on the continuity of Vu and the smoothing of the error in Vu.
However, it is D Vit that is continuous, not Vu. Hence, if D has jumps of more
than an order of magnitude across internal boundaries, then it is more appropriate to
use grid transfer operators that approximate the continuity of D Vu instead of the
continuity of Vu. It is important to remember that we are using the Galerkin coarse
grid approximation approach to form the coarse grid operators. We want the coarse
57
grid operators to approximate the continuity of D Vu. This goal is accomplished by
basing the grid transfer operators on the grid operator Lh.
Before proceeding with the definitions of the first class of grid transfer oper
ators, we need to define a few terms and make a few explanations.
Definition 3.3.1 Using the grid, operators stencil notation, define Ra, row sum, at a
given grid point, (Xi,yj), to be
Rx = C + NW + N + NE + W + E + SW + S + SE, ' (3.13)
where the subscript (i,j) has been suppressed.
The row sum is used to determine when to switch between two different ways of com
puting the grid transfer coefficients at a given point. The switch happens when the
grid operator is marginally diagonally dominant, or in others words, when the row sum
is small in some sense.
We recall what is meant by the symmetric part of the operator.
Definition 3.3.2 Define the symmetric part of the operator, L, as
cL = symm(.L) = ^(L + L*) (3.14)
where L* is the adjoint of the grid operator L.
The notation applies equally to the grid operators coefficients, for example:
crNij = 5 (Nij + Sij+i)
and (3.15)
trSWij^USWij + NEiui.)
In addition, we can give some examples of the adjoint (transpose) of the grid
58
operators coefficients are:
(Wy)*
(crSEij)*
(vCij)*
3.4 Symmetric Grid Operator Lh: Collapsing Methods
The interpolation operator is based upon the discrete grid operator Lh, while
the restriction operator is based bn (Lh)*.
We want to preserve the flux fi (D VC/) across interfaces, which can be done
by using the grid operator Lh. Assume that Lh has a 5point stencil, then
W(Uij Uiij) = E(Ui+itj Uij) , (3.17)
which gives the interpolation formula
W E
Uij = W + EUi~1,j + W + EUi+1J ' ^3'18^
When Lh has a 9point stencil, the idea is to integrate the contributions from the other
coefficients ( NW, NE, SW, and SE), which can be done by summing (collapsing) the
coefficients to get the three point relation,
AViij + A0Vitj + A+Vj+ij = 0 (3.19)
where A_ = (NW + W + SW), A0 = (N + C + 5), and A+ = (NE + E + SE).
The computation of the Iw and Ie coefficients axe done by collapsing the grid
operator in the ydirection to get a three point relation on the xgrid lines. Let the
interpolation formula be given by
Eilj,
= crNWi+iji,
and
'*,3
(3.16)
Ai\Vi\ + AiVi + Aj+iUj+i = 0
(3.20)
59
where Vk is written for Vk,j, and Aii = (NW + W + SW\j, A{ = (N + C + S)ij, and
Ai+1 = (NE + E + SE)ij. We now solve the equation for Vi to get the interpolation
formula in an explicit form.
Vi = A{ 1Aiivii Ai 1Ai+ivi+i.
The interpolation coefficients Iw and Ie are then given by
Iw = AxAi1 and Ie = Ar1Ai+x
Writing out the coefficients explicitly gives
rw NW + W + SW
N + C + S
(3.21)
(3.22)
(3.23)
NE + E + SE
N+C+S
(3.24)
where Iw and Ie are evaluated at (ic 1 ,jc) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at (if 1, J/). If however, the row
sum number,(see 3.13), is small (see 3.28) then instead of (N + C + S)i for Ai we
use (NW + W + SW + NE + E + SE)i. These two formulas give the same result
when the row sum is zero, which is the case for an operator with only second order
terms away from the boundary. This idea is observed to lead to better convergence,
and it is due to Dendy [30]. The coefficients are then defined by
r =
NW + W + SW
NW + W + SW + NE + E + SE
(3.25)
and
NE + E + SE
NW + W + SW + NE + E + SE
(3.26)
where Iw and Ie are evaluated at (ic 1 ,jc) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at (if 1, jf).
60
Let
.'r = mm{\NW + W+SW\, \NE + E + SE\, 1.}. (3.27)
Then by small we mean that
Re < ~'r{NW + W + SW + N + S + NE + E + SE), (3.28)
where is the row sum defined above.
The computation of the Is and In coefficients is done by collapsing the grid
operator in the xdirection to get the three point relation on the ygrid line. Let the
interpolation formula be given by
AjiVji + AjVj + Aj+iVj+i = 0
(3.29)
where Vji = {vij1 : i = 1 ,...,nx}, Vi = {vij : i = 1,.. .\nx}, Vj+\ = {vij+1 : i =
1,... ,nx}, and Aj+1 = (iVW + N + NE)ij+\, Aj = (W +,C + E\j, and Aj\ =
(SW + 5 + SE)ij1. We now solve the equation for Vj to get the interpolation formula
in an explicit form:
Vj = ~Aj ^ Aj ^A.jj\Vj)_i (3.30)
The interpolation coefficients Is and In are given by
Is = A~lAj\ and P = A~1Aj+x (3.31)
Writing out the coefficients explicitly gives
SW + S + SE
W+C+E
(3.32)
NE+N+NE
W + C + E
(3.33)
61
If however, the row sum,i?s, is small, then instead of (W + C + E)j for Af we use
(NW + N + NE + SW + S + SE)j. The coefficients are then defined by
SW + S + SE
NW + N + NE + SW + S + SE
(3.34)
NW + N + NE
NW + N + NE +SW + S + SE'
(3.35)
where Is and In are evaluated at (zc,jc 1) and (ic,jc) respectively, and the other
coefficients on the right hand side are evaluated at !) Let
7 = min{!VW + N + NE\, SW + S + SJ3, 1.}. (3.36)
Then by small we mean that
Re < 7 {NW+N + NE +SW + S + SE), (3.37)
where Re is the row sum.
The computation of the interpolation coefficients Isw, Inw, Ine, and Ise is sim
ilar to that of the coefficients that have already been computed. Let the interpolation
formula be given by
Ail,j+lvil,j+l + AiJ+lVij+i + Aj+ij+iUj+ij+i
+ AiijViij + AijVij + Ai+ijUj+ij (3.38)
+ AiijiViiji + AijiViji + Ai+ijiVi+iji = 0 .
where the A*,* are just the corresponding grid operator coefficients. We can now solve
for Vitj to get the interpolation formula.
i,j = AiJ ( Aiij+iUjij+i + Ajj+ifij+i + Aj+ij+iUi+ij+i
AiijUji j + A^v^ + Aj+ijUjiij (3.39)
+AiijiViiji + AijiViji + Ai+ijiVi+iji )
62
Notice that Viji, Viij, Vi+i,j, and are unknowns. However, we can use their
interpolated values that we computed above, being careful to note that their stencils are
all centered at different grid points. After performing the substitutions and collecting
the terms for v%ii, vm1: 1,^+1, and 1 we gst
Vij = IswVi 1J1 + InwViu+1 + Fvi+ui + PeVi+i,j+i , (3.40)
where instead of having to compute everything all over again, it can be seen that Isw,
Inw, Ine, and Ise can be expressed in terms of the previous four coefficients, Iw, Ie, Is,
and In. However, we must now explicitly write the subscripts for the coefficients Iw,
Ie, Is, and In to indicate where their stencils are centered relative to the interpolated
points stencil, which is centered at {i,j). The formulas for the four coefficients are
I
SW
SW + 5 7%x + W IUj
C
(3.41)
where Isw is evaluated at (xici, yjc\),
jnw __
NW + N I%+1 + W Itij
C
(3.42)
where Inw is evaluated at (a'ic~i,yjc),
NE + N Ifj+i + E Ij+ij
C
where Ine is evaluated at (Zic,yjc),
I$e __
SE + SI^ + EIUu
C
(3.43)
(3.44)
where Ise is evaluated at (xic,yjc\), and the the other stencil coefficients are evaluated
at (Xif,yjf). If, however, ife is small, then
rsw _ ___OVV TO ijji T W ijlj
NW + N + NE + w + E + sw + S + ,
jnw NW + NIf^+WIU,
NW + N + NE + W + E + SW + S + S
(3.45)
(3.46)
63
(3.47)
rne __
1ic Jc
jse
NE + N Ilj+l + EI?+lj
NW + N + NE + W + E + SW + S + SE'
SE + S Ijji + E If+ij
NW + N + NE + W + E + SW + S + SE'
(3.48)
and where NW, N, NE, W, C, E, SW, S, and SE are evaluated at (Xif,yjf). Let
\SW + W + NW\, \NW + N + NE\,
7 = min
\NE + E + SE\, \SE + S + SW\, 1.
Then by small we mean that
.
(3.49)
i?s < 7 (NW + N + NE + W + E + SW + S + SE). (3.50)
The interpolation correction terms are A~lrH, A~JlrH, or A~jrH for the cor
responding interpolation formulas above, where rH is the residual on the coarse grid.
Note that the ^4s change depending on whether ife is small or not.
The computation of the interpolation coefficients in this way was used in the
BOXMG, BOXMGP, BBMG, and BBMGP codes for symmetric problems [1], [26],
[30], [10]. Similar computations have also been used for most black box, geometric,
and algebraic multigrid solvers for symmetric problems arising from finite difference
and finite volume discretizations using either a 5point or a 9point standard stencil
[7], [23], [29], [31], [52], [54], [53], [55], [63], [85], [24],
The computation of the restriction operators coefficients is closely related
to that of the interpolation coefficients. In fact, in the symmetric case, the restric
tion coefficients for the symmetric grid operator Lh can be taken to be equal to the
interpolation coefficients,
/? = (#) (3.51)
64
3.5 Nonsymmetric Grid Operator Lh\ Collapsing Meth
ods
The interpolation coefficients can be computed in the same way as in the
symmetric case except that we replace all of the grid operators coefficients with their
equivalent symmetric stencil coefficients, denoted by a(). However, the row sum
definition remains unchanged.
3.5.1 Prolongation Based on symm(Lft) The computation of the Iw
and Ie coefficients is given by
_ aNW + crW + aSW .
aN + aC + aS
. e aNE\aE^aSE
aN + aC + aS '
If, however, is small, then
_ aNW + aW + aSW
aNW + aW + aSW + aNE + aE + aSE
(3.53)
(3.54)
aNE + aE + aSE
aNW + aW + aSW + aNE + aE + aSE
(3.55)
In (3.52)(3.55) Iw and Ie are evaluated at {xici,yjc) and (Xic,yjc) respectively, and
the other coefficients on the right hand side are evaluated at (Xif\,yjf) for the Lh
components. Let
7 = min{<7iVW + aW + aSW\, \aNE + aE + aSE\, 1.}. (3.56)
Then by small we mean that
< 7 (aNW + aW + aSW + aN + aS + aNE + aE + aSE) (3.57)
65
The formulas for the In and Is coefficients are
aNW + crN + aNE
aW + aC + aE
aSW + aS + aSE
aW + aC + crE
If, however, is small, then
aNW + aN + aNE
aNW + aN + aNE + aSW + aS + aSE
(3.58)
(3.59)
(3.60)
aSW + aS + aSE
aNW + aN + aNE + aSW + aS + aSE
(3.61)
where In and Is are evaluated at (Xic,yjc) and (xic,yjc1) respectively, and the other
coefficients on the right hand side are evaluated at (xif,yjf1) for the Lh components.
Let
7 = min {\aNW + aN + aNE\, aSW + aS + aSE\, 1.} . (3.62)
Then by small we mean that
Rz < ~r(aNW + aN + aNE + aW + aE + aSW + aS + aSE). (3.63)
The computation of the interpolation coefficients Isw, Inw, Ine, and Ise can be ex
pressed in terms of the other four coefficients:
T$W ic ' aSW + aSIf^ + aWIUj c. (3.64)
raw _ aNW + aNird+1 + aWI?_ld (3.65)
C
jne 2icdc aNE + aNIi^ + aE.I^ (3.66)
C
T$e Iic>jc1 aSE + aS Ifj_i + crE If+1j C (3.67)
66
If, however, Rz is small, then
= aSW + vSI^ + aWIUj
Jc"i aNW + gN + gNE + gW + gE + gSW + gS + aSE
rne ______
* i* *i"
gJVTV + gJV + gNE f gIV + g.E + g5IV + gS +
gNE + gN Ilj+1 + aEI?+ltj
ic,3c ~ aNW + aN + aNE + crW + gE + gSW + gS + gSE'
rse =__________gSE + gSII^+gEI?^
(3.68)
(3.69)
(3.70)
(3.71)
gNW + gN + gNE + gW + gE + gSW + gS + gSE
where gNW, gN, gNE, gW, gC, gE, gSW, gS, and aSE are evaluated at (Xif,yjf)
for the Lh components. Let
7 = mm <
 gSW + gW + gNW ,  gNW + gN + gNE ,
gNE gE \ gSE\, gSE + gS+ g5IV, 1.
Then by small we mean that
.
(3.72)
jRs
It has been found in practice that the restriction operator iff need not be
based on the same operator as the interpolation operator, so we change its symbol to
be Jff to reflect this change. The restriction operators coefficients are based on {Lh)T
instead oi'aLh. The restriction coefficients are computed in exactly the same way as
the interpolation coefficients except that all of the grid operators coefficients in the
computations are replaced by their transposes. The computations for the restriction
coefficients are now straightforward and will not be written out.
The grid transfer operators have been computed in this fashion for the black
box multigrid solver for nonsymmetric problems [27]. It should be noted that when
the grid operator Lh is symmetric, then the computations given here for both the
symmetric case and nonsymmetric case yield the same grid transfer coefficients.
67
3.5.2 Prolongation Based on Lh and symm(LA) The third possi
bility for computing the grid transfer operators is one that uses the same form of
the computations as above, see section 3.5.1. This prolongation is a point collapse
approximation to Schaffers ideas; see section 3.6. The only difference in the above
computations for the nonsymmetric case is that for the denominators, A~l and Aj1,
we use the coefficients based on Lh instead of oLh. The test for small is still in the
same form as before except that Lh is used, but 7 is still based on oLh.
The restriction operator coefficients are computed as before, but the denomi
nator is now based on Lh instead of on {Lh)T.
3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and
symm(T/l) The prolongation operator coefficients are computed the same as in the
last section 3.5.2. However, the computation of the restriction operator coefficients has
been modified into a hybrid form that uses both LT and L. '
The difference in the computation of the restriction coefficients comes into
play when the switch is made in the denominator, A~l and A~l, because the row sum
is small. When the row sum is large we modify the denominator by adding in two
coefficients from the grid operator L. We can illustrate this modification by computing
the restriction coefficients Jw and Je.
= cNW)T + (Wf + (SW)T
N + C + S
(3.74
(NE)t + (E)t + (SE)t
N + C + S
(3.75)
If, however, is small, then
________________(NW)T + (W)T + (SW)T______________
(NW)T + (W)T + (SW)T + N + S + (NE)T + (E)T + {SE)T' K 1
68
(3.77)
Je =
(NEf + (Ef + (SEf
(.NWf + (Wf + (SWf + N + S+ (NEf + (.Ef + (SEf'
In (3.74)(3.76) Jw and Je are evaluated at {xic~i,yjc) and (Xic,yjc) respectively, and
the other coefficients on the right hand side are evaluated at (Xifi,yjf) for the Lh
components. Let
7 = min{\{NW)T + (W)T + (5W)r, (ATÂ£)t + {Ef + (5Â£;)r, 1.}. (3.78)
Then by small we mean that
i?E < 7 ( (NW)T + {W)T + {SW)T +
{Nf + (S)T + N + S (3.79)
+{NEf + (Ef + (SEf )
The restriction coefficients Jn and Js are computed in a similar way.
The motivation behind these modifications was to try to get the coarse grid
operator to approximate the one obtained when using the extension of Schaffers idea;
see section 3.6. The grid operators from section 3.5.2 above were computed to approx
imate the grid transfer coefficients based on an extension of Schaffers idea; while the
method in this section attempts to do the same thing, it also makes some modifications
so that the coarse grid operator more closely approximates the one obtained in section
3.6.1.
3.6 Nonsymmetric Grid Operators: Extension of Schaf
fers Idea
The second class of grid transfer routines is based on Schaffers idea for grid
transfer operators in his semicoarsening multigrid method [70]. Schaffers idea is to
approximate a full matrix by a diagonal matrix to compute the grid transfer operators.
69
Schaffers idea was used in the development of the semicoarsening black box multi
grid method [32]. We took Schaffers idea and extended it to apply to the standard
coarsening grid transfer operators.
The ideas used in the semicoarsening method are as follows. Suppose that
coarsening takes place only in the ydirection. Then the interpolation to points on the
fine grid can be represented by
AjiVji + AjVj + Aj+iVj+i = 0 (3.80)
where Vk = {v^k : i = 1,... ,nx, j = j 1 ,j,j + 1}, The tridiagonal matrices Aj\, Aj,
and Aj+i represent the nine point grid operator on the j 1, j, and j + 1 grid lines
respectively;
Aj+1 = tridiag [NW, N, NE]j+1
Aj = tridiag [W, C, E]j
Aji = tridiag [SW, S, SE]j_x .
As before, we solve this equation for Vj to get,
vj = Aj'AjiVji Aj1Aj+1vj+1, (3.81)
where we have assumed that AJ1 exists and can be stably inverted. This assumption
can not always be guaranteed, but Schaffers and our. methods allow line relaxation
as a smoother, where these assumptions are necessary. The methods would fail if the
assumptions did not hold, so in that sense we can say that the assumptions hold.
From equation (3.81), we form the quantities AJ1Aj\ and AjlAj\, lead
ing to a nonsparse interpolation operator. If the interpolation operator is not sparse,
that is, involves only V{j\ and Vij+1 for interpolation at the point (i,j), then the coarse
grid operators formed by the Galerkin coarse grid approximation approach will grow
70
beyond a 9point stencil. This is a property that we would very much like to avoid,
since it would lead to full operators on the coarser grid levels. Schaffers idea, also
arrived at independently by Dendy, is to approximate these quantities with diagonal
matrices Bj\ and Bj+This is accomplished by solving the following relations
AjxAj\e = Bj\e (3.82)
A~1Aj+ie = Bj+ie,
where e = (1,1,..., 1)T. They can be solved quickly because they are tridiagonal equa
tions. After solving, the entries (diagonals) in Bj\ and Bj+1 axe just the interpolation
coefficients Is and In respectively.
In the semicoarsening case the restriction operator is still based on the trans
pose of the nonsymmetric grid operator Lh. This is done by replacing Aji, Aj, and
Aj+1 by their transpose to get (A^i)*, (Aj)*, and (Aj+i)* respectively.
3.6.1 Extension of Schaffers Idea, to Standard Coarsening The
above was presented in a manner suitable for the symmetric case. It can be modified for
the nonsymmetric case, as we did for the collapsing methods, by using the symmetric
part of the operator. We can do this by replacing A* with
get,
(symm(Aj))~l symm(Aji) e Bj\e (3.83)
(symm(Aj))~1 symm(Aj>ri) e = Bj+\e .
Schaffer constructs his grid transfer operators in a different manner and his
construction for variable coefficient problems can yield a nonsymmetric coarse grid
operator LH even if Lh is symmetric. We would like the coarse grid operators to be
symmetric whenever the fine grid operator is symmetric. We can do this is several
71
ways, but a more efficient construction is to replace equation (3.83) with
A~l symm(Aji) e = Bj\e (3.84)
AJ1 symm(Aj+i) e = Bj+\e .
The advantage of this form is that it can use the same tridiagonal system solver that
we are already using for the line solves for the multigrid smoother. Equation (3.83)
will require an additional tridiagonal solve for symm(Aj) and additional storage if the
LU factors are to be saved.
To extend these ideas to the standard coarsening case is quite easy. We first
compute the grid transfer coefficients for semicoarsening in the ydirection, and define
Vk = {vitk i = 1,...., nx, k = j l,j,j + 1} and the tridiagonal matrices
Aj+i = tridiag [aNW, aN, aNE\+l
Aj = tridiag [W,C,E]j
Aji = tridiag [aSW, crS, u5j5]J_1 .
We save the diagonals of Bj\ and Bj+1 associated with coarse grid lines in the x
direction as the Is and In interpolation coefficients respectively.
To obtain the coefficients for the ydirection, we compute the grid transfer
coefficients for semicoarsening in the xdirection and define
Vk = {vk,j : k = i 1, i, i + 1, j = 1,..., ny}
and the tridiagorial matrices
Ai+1 = tridiag [aSW, aW,
Ai = tridiag [5, C, N}
A{1 = tridiag [aSE, aE, aNE}_1 .
72
We save the diagonals of B{\ and B{+1 associated with coarse grid lines in the x
direction as the Iw and Ie interpolation coefficients respectively.
Finally, we can then combine the semicoarsening coefficients from the X and
Y lines to obtain the Isw, Inw, Ine, and Ise interpolation coefficients. They can be
computed as the product of the coefficients that have already been computed,
jnw __ jn jw jne jn m je
(3.85)
jsw js m jw jse js m je
or elimination can be used as before.
The restriction operator for the extension to' the standard coarsening case
is computed as above, but the transpose of the grid operator is used instead of the
symmetric part of the operator. This is done by replacing Aj\ and Aj+1 by their
transpose to get (Aji)* and (A,+i)* respectively.
3.7 Conclusions Regarding Grid Transfer Operators
Many other grid transfer operators were tried in the standard coarsening black
box multigrid method in addition to the those presented above. However, only three
were deemed to be robust and efficient enough to include in a release version of the
solver. The three choices for grid transfer operators are the original nonsymmetric col
lapsing method described in section 3.5.1, the nonsymmetric hybrid collapsing method
described in section 3.5.3, and the nonsymmetric extension to Schaffers ideas described
in section 3.6.1. While all three of these choices are good, better results were obtained
for the later two for all test and application problems run to date.
Most of the other grid transfer operators, that were tried had good perfor
mance on some of the test problems but failed on others. There does appear to be
enough good results to cover all the test problems, with the exception of reentrant
flows. However, to unify these into one set of grid transfer operators would be much
73
more expensive to compute and may also introduce trouble when combining the various
types of grid transfer operators.
The grid transfer operators from section 3.5.2, which use a collapsing method
to try to approximate the extension of Schaffers ideas for nonsymmetric problems,
were a disappointment. While they seemed to be a good idea, they turned out to not
be very robust and in several cases actually caused divergence of the multigrid method.
This bad behavior prompted examination of the coarse grid operators and grid transfer
operators. After comparing the operators with those obtained from Schaffers ideas,
it was noticed that several things were wrong, but with the modifications described
in section 3.5.3, these problems were overcome. These new grid transfer operators
extended Schaffers ideas to standard coarsening very well.
74
CHAPTER 4
BASIC ITERATION METHODS FOR SMOOTHERS
In this chapter we examine several basic iteration schemes for use as smoothers
in the Black Box Multigrid solvers. Fourier mode analysis is used to identify which
scheme makes the best smoother for a given type of model problem in two dimensions.
In this chapter we will be using parentheses around a superscript to denote
an iteration index. For example: means the nth iterate.
4.1 Overview of Basic Iteration Methods
All of the methods in this section can be characterized in the following way.
The algebraic system of equations to be solved is given by the matrix equation
Lu = f (4.1)
The matrix L is an Nxy x Nxy matrix, where Nxy = nxny. The computational grid
is two dimensional with nx and ny grid points in the x and ydirections respectively.
The matrix L can be split as
L = M N, (4.2)
where M is nonsingular and assumed easy to invert. Then a basic iteration method
for the solution of equation (4.1) is given by
Mu(n+1\= Nu^ + f (4.3)
75
or as
u^+^ = Su^+M~lf, (4.4)
where S = M1iV is called the iteration matrix. The basic iteration method can also
be damped, and if the damping parameter is u>, then the damped method is given by
u("+1) = u (mWn) + M1/) + (1 u) u(n) (4.5)
or by
(^+1) = 5u(n) + uM~lf,
where S is now given by
S dM ^ N + (1 u>) /,
(4.6)
(4.7)
and I is the identity matrix. When w = 1 we recover the undamped basic iterative
method.
The eigenvalues of the damped basic iteration matrix S can be given in terms
of the eigenvalues of the undamped basic iteration matrix S. They are related by
X(S) wA(5) + 1 cu,
(4.8)
where u is the damping parameter and A (5) on the right hand side of the equation is
an eigenvalue of S, the undamped iteration matrix.
The error after the nth iteration is
e(") = (*> u (4.9)
where u is a solution (unique if L is nonsingular) to equation (4.1). The error at the
(n + l)st iteration is related to the error at the nth iteration by
e
76

PAGE 1
BLACK BOX MULTIGRID FOR CONVECTIONDIFFUSION EQUATIONS ON ADVANCED COMPUTERS by VICTOR ALAN BANDY M.S., University of Colorado at Denver, 1988 B.S., Oregon State University, 1983 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics 1996
PAGE 2
This thesis for the Doctor of Philosophy degree by Victor Alan Bandy has been approved for the Department of Mathematics by Jan Mandel Joel E. Dendy, Jr. Stephen McCormick Leopolda Franca Gita Alaghband Date ______________________
PAGE 3
Bandy, Victor Alan (Ph. D., Applied Mathematics) Black Box Multigrid for ConvectionDiffusion Equations on Advanced Computers Thesis directed by Dr. Joel E. Dendy, Jr. ABSTRACT In this thesis we present Black Box Multigrid methods for the solution of convectiondiffusion equations with anisotropic and discontinuous coefficients on ad vanced computers. The methods can be classified as either using standard or semi coarsening for the generation of the coarse grids. The domains are assumed to be either two or three dimensional with a logically rectangular mesh structure being used for the discretization. New grid transfer operators are presented and compared to earlier grid transfer operators. The new operators are found to be more robust for convectiondiffusion equations. Local mode and model problem analysis are used to examine several choices of iterative methods for the smoother and their relative effectiveness for the class of problems under consideration. The red/black alternating line GaussSeidel method and the incomplete line L U (ILL U) by linesinx methods were found to be the most robust for two dimensional domains, and red/black alternating plane GaussSeidel, using the the 2D black box multigrid method for the plane solves, was found to be the most robust and efficient smoother for 3D problems. The Black Box Multigrid methods were developed to be portable, but opti mized for either vector computers, such as the Cray YMP, or for parallel computers, iii
PAGE 4
such as the CM5. While the computer architectures are very different, they represent two of the main directions that supercomputer architectures are moving in today. Per formance measures for a variety of test problems are presented for the two computers. The vectorized methods are suitable for another large class of common computers that use superscalar pipelined processors, such as PCs and workstations. While the codes have not been optimized for these computers, especially when considering caching issues, they do perform quite well. Some timing results are presented for a Sun Sparc5 for comparison with the supercomputers. This abstract accurately represents the contents of the candidate's thesis. I recommend its publication. Signed Joel E. Dendy, Jr. iv
PAGE 5
To my Mom, Lee Buchanan, and everyone else who kept on asking "When are you going to finish?"
PAGE 6
CHAPTER 1 INTRODUCTION 1.1 Summary .. CONTENTS 1.1.1 Previous Results. 1.1.2 New Contributions 1.2 Class of Problems . . 1.3 Discretization of the Problem 1 1 2 6 9 10 1.4 Multigrid Overview . . . 13 1.4.1 Multigrid Cycling Strategies 19 1.5 Black Box Multigrid . . . . 24 2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME. 27 2.1 Finite Difference Discretization 28 2.2 Finite Volume Discretization 31 2.3 Cell Centered Finite Volume Discretization; Evaluation at the Vertices 34 2.3.1 Interior Finite Volumes .... 2.3.2 Dirichlet Boundary Condition 36 37 2.3.3 Neumann and Robin Boundary Conditions 38 2.4 Cell Centered Finite Volume Discretization; Evaluation at the Cell Centers ............ 2.4.1 Interior Finite Volumes 2.4.2 Dirichlet Boundary Condition vi 39 40 41
PAGE 7
2.4.3 Neumann and Robin Boundary Conditions . . . . . 42 2.5 2.6 Vertex Centered Finite Volume DiscretizationEvaluation at the Vertices ............... 2.5.1 Interior Finite Volumes 2.5.2 Edge Boundary Finite Volumes 2.5.3 Dirichlet Boundary Condition 2.5.4 Neumann and Robin Boundary Conditions 2.5.5 Corner Boundary Finite Volumes 2.5.6 Dirichlet Boundary Condition . 2.5.7 Neumann and Robin Boundary Conditions Vertex Centered Finite Volume DiscretizationEvaluation at the Cell Vertices . . . . . . 2.6.1 Interior Finite Volumes 2.6.2 Dirichlet Boundary Condition 2.6.3 Neumann and Robin Boundary Conditions 2.6.4 Corner Boundary Finite Volumes 2.6.5 Dirichlet Boundary Condition .. 2.6.6 Neumann and Robin Boundary Conditions 3 PROLONGATION AND RESTRICTION OPERATORS 3.1 Prolongation . . . . . . . . . . 3.1.1 Prolongation Correction Near Boundaries 3.2 Restriction 3.3 Overview 3.4 Symmetric Grid Operator Lh: Collapsing Methods 3.5 Nonsymmetric Grid Operator Lh: Collapsing Methods 3.5.1 Prolongation Based on symm(Lh) vii 42 42 43 43 43 44 45 45 46 46 47 47 48 48 49 51 52 55 56 56 59 65 65
PAGE 8
3.5.2 3.5.3 Prolongation Based on Lh and symm(Lh) ........... Grid Transfer Operators Based on a hybrid form of Lh and symm(Lh) ....................... 3.6 Nonsymmetric Grid Operators: Extension of Schaffer's Idea 3.6.1 Extension of Schaffer's Idea to Standard Coarsening 3. 7 Conclusions Regarding Grid Transfer Operators 4 BASIC ITERATION METHODS FOR SMOOTHERS 4.1 Overview of Basic Iteration Methods 4.2 GaussSeidel Relaxation . . . 4.2.1 Point GaussSeidel Iteration 4.2.2 Line GaussSeidel Iteration by Lines in X 4.2.3 Line GaussSeidel Iteration by Lines in Y 4.2.4 Alternating Line GaussSeidel Iteration 4.3 Incomplete Line LU Iteration ........ 5 FOURIER MODE ANALYSIS OF SMOOTHERS 5.1 Introduction 5.2 Motivation 5.3 Overview of Smoothing Analysis 68 68 69 71 73 75 75 79 80 83 84 86 86 91 91 92 94 5.4 2D Model Problems . . . 101 5.5 Local Mode Analysis for Point GaussSeidel Relaxation 102 5.6 Local Mode Analysis for Line GaussSeidel Relaxation 111 5.7 Local Mode Analysis for Alternating Line GaussSeidel and ILLU Iteration . . . . . . . . 115 5.8 Local Mode Analysis Conclusions 5.9 Other Iterative Methods Considered for Smoothers 120 122 6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 125 viii
PAGE 9
6.1 Cray Hardware Overview . . . . 6.2 Memory Mapping and Data Structures 6.3 Scalar Temporaries . . . 6.4 InCode Compiler Directives 6.5 Inlining . . 6.6 Loop Swapping 6.7 Loop Unrolling 6.8 Loops and Conditionals 6.9 Scalar Operations 6.10 Compiler Options 6.11 Some Algorithmic Considerations for Smoothers 6.11.1 Point GaussSeidel Relaxation 6.11.2 Line GaussSeidel Relaxation 6.12 Coarsest Grid Direct Solver 6.13 l2Norm of the Residual .. 6.14 2D Standard Coarsening Vector Algorithm. 6.14.1 Coarsening ... 6.14.2 Data Structures. 6.14.3 Smoothers .... 6.14.4 Coarsest Grid Solver 6.14.5 Grid Transfer Operators 6.14.6 Coarse Grid Operators 6.15 2D SemiCoarsening Vector Algorithm 6.15.1 Data Structures. 6.15.2 Coarsening 6.15.3 Smoothers. ix 127 131 132 133 134 135 135 135 136 136 137 137 138 139 140 144 144 144 145 146 146 146 146 146 146 146
PAGE 10
6.15.4 Coarsest Grid Solver . 147 6.15.5 Grid Transfer Operators 147 6.15.6 Coarse Grid Operators 147 7 2D NUMERICAL RESULTS 148 7.1 Storage Requirements 148 7.2 Vectorization Speedup 151 7.3 2D Computational Work 156 7.4 Timing Results for Test Problems 157 7.5 Numerical Results for Test Problem 8 165 7.6 Numerical Results for Test Problem 9 174 7.7 Numerical Results for Test Problem 10 181 7.8 Numerical Results for Test Problem 11 187 7.9 Numerical Results for Test Problem 13 191 7.10 Numerical Results for Test Problem 17 194 7.11 Comparison of 2D Black Box Multigrid Methods 198 8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS 203 8.1 CM2 and CM200 Parallel Algorithms 203 8.1.1 Timing Comparisons 206 8.2 CM5 Hardware Overview . 207 8.3 CM5 Memory Management 215 8.4 Dynamic Memory Management Utilities 219 8.5 CM5 Software Considerations .... 222 8.6 Coarsening and Data Structures in 2D 223 8.7 Coarse Grid Operators 227 8.8 Grid Transfer Operators 228 8.9 Smoothers ........ 229 X
PAGE 11
8.9.1 Parallel Line GaussSeidel Relaxation ......... 229 8.9.2 CM5 Tridiagonal Line Solver Using Cyclic Reduction 230 8.10 Coarsest Grid Solver ..... 233 8.11 Miscellaneous Software Issues 236 8.11.1 Using Scalapack ... 236 8.11.2 PolyShift Communication 237 8.12 2D Standard Coarsening Parallel Algorithm 237 8.12.1 Data Structures 238 8.12.2 Coarsening 238 8.12.3 Smoothers. 239 8.12.4 Coarsest Grid Solver 239 8.12.5 Grid Transfer Operators 239 8.12.6 Coarse Grid Operators 240 8.13 2D SemiCoarsening Parallel Algorithm 240 8.13.1 Data Structures 240 8.13.2 Coarsening 240 8.13.3 Smoothers. 241 8.13.4 Coarsest Grid Solver 241 8.13.5 Grid Transfer Operators 241 8.13.6 Coarse Grid Operators 241 8.14 2D Parallel Timings ...... 241 9 BLACK BOX MULTIGRID IN THREE DIMENSIONS 250 9.1 Introduction . . . 250 9.1.1 SemiCoarsening 251 10 3D DISCRETIZATIONS . 253 10.1 Finite Difference Discretization 254 xi
PAGE 12
10.2 Finite Volume Discretization 10.2.1 Interior Finite Volumes 10.2.2 Edge Boundary Finite Volumes 10.2.3 Dirichlet Boundary Condition 10.2.4 Neumann and Robin Boundary Conditions 11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS 11.1 3D Grid Transfer Operations ............. 11.2 3D Nonsymmetric Grid Operator Lh: Collapsing Methods 11.2.1 3D Grid Transfer Operator Variations 11.3 3D Coarse Grid Operator 12 3D SMOOTHERS ... 12.1 Point GaussSeidel 12.2 Line GaussSeidel 12.3 Plane GaussSeidel 13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS 13.1 Overview of 3D Local Mode Analysis 13.2 Three Dimensional Model Problems 13.3 Local Mode Analysis for Point GaussSeidel Relaxation 13.4 Local Mode Analysis for Line GaussSeidel Relaxation 13.5 Local Mode Analysis for Plane GaussSeidel Relaxation 14 3D VECTOR ALGORITHM CONSIDERATIONS 14.1 3D Smoother ........ 14.2 Data Structures and Memory 14.3 3D Standard Coarsening Vector Algorithm. 14.3.1 Coarsening ... 14.3.2 Data Structures xii 254 255 256 257 257 260 262 264 268 268 270 270 271 272 274 274 278 280 285 293 308 308 309 313 313 313
PAGE 13
14.3.3 Smoothers ..... 14.3.4 Coarsest Grid Solver 14.3.5 Grid Transfer Operators 14.3.6 Coarse Grid Operators .. 14.4 3D SemiCoarsening Vector Algorithm 14.4.1 Data Structures. 14.4.2 Coarsening 14.4.3 Smoothers 14.4.4 Coarsest Grid Solver 14.4.5 Grid Transfer Operators 14.4.6 Coarse Grid Operators .. 14.5 Timing Results for 3D Test Problems 14.6 Numerical Results for 3D Test Problem 1 14.7 Numerical Results for 3D Test Problem 2 15 PARALLEL 3D BLACK BOX MULTIGRID 15.1 3D Standard Coarsening Parallel Algorithm Modifications 15.2 3D Parallel Smoother .......... 15.3 3D Data Structures and Communication 15.4 3D Parallel Timings APPENDIX A OBTAINING THE BLACK BOX MULTIGRID CODES B COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS B.1 Cray YMP B.2 CM5 .. BIBLIOGRAPHY xiii 314 314 314 314 314 315 315 315 315 315 315 316 320 320 324 324 324 326 326 331 333 333 335 337
PAGE 14
FIGURES FIGURE 1.1 Standard coarsening: superimposed fine grid Gh and coarse grid GH. 14 1.2 Semicoarsening: superimposed fine grid Gh and coarse grid GH. 15 1.3 One Vcycle iteration for five grid levels. 20 1.4 One Scycle iteration for four grid levels. 22 1.5 One Wcycle iteration for four grid levels. 22 1.6 One Fcycle iteration for five grid levels. 23 2.1 Vertex centered finite volume grid. 32 2.2 Cell centered finite volume grid. 33 2.3 Cell centered finite volume ni,j. 35 2.4 Vertex centered finite volume ni,j at y = 0. 43 2.5 Southwest boundary corner finite volume. 44 3.1 Standard coarsening interpolation 2D cases 53 6.1 Cray YMP hardware diagram 128 6.2 Cray CPU configuration 128 6.3 2D Data Structures . 145 7.1 Comparison of Setup time for BMGNS, SCBMG, and MGD9V 154 7.2 Comparison of one Vcycle time for BMGNS, SCBMG, and MGD9V 155 7.3 Domain n for problem 8. 166 7.4 Domain n for problem 9. 7.5 Domain n for problem 10 .. xiv 174 181
PAGE 15
7.6 Domain n for problem 11. 187 7.7 Domain n for problem 13 .. 191 7.8 Domain n for problem 17 .. 195 8.1 CM5 system diagram ... 210 8.2 CM5 processor node diagram. 212 8.3 CM5 vector unit diagram . 214 8.4 CM5 processor node memory map 217 8.5 Grid Data Structure Layout. .... 225 9.1 Grid operator stencil in three dimensions. 252 11.1 Grid transfer operator's stencil in three dimensions. 261 14.1 3D FSS data structure . . . . . . . . 311 XV
PAGE 16
TABLES TABLE 5.1 Smoothing factor JL for point GaussSeidel relaxation for anisotropic diffusion equations . 109 5.2 Smoothing factor JL for point GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 110 5.3 Smoothing factor JL for xand yline GaussSeidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . 114 5.4 Smoothing factor JL for xand yline GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 116 5.5 Smoothing factor JL for alternating line GaussSeidel relaxation and in complete line LU iteration for anisotropic diffusion equations . . . 119 5.6 Smoothing factor JL for alternating line GaussSeidel relaxation and in complete line LU iteration for convectiondiffusion equations . . . 121 6.1 Cray YMP Timings for the Naive, Kahan, and Doubling Summation Algorithms. . . . . . . . . . . . . . . . 143 6.2 Sparc5 Timings for the Naive, Kahan, and Doubling Summation Algorithms. 144 7.1 Memory storage requirements for the Cray YMP. .... 7.2 Storage requirements for BMGNS, SCBMG, and MGD9V. 7.3 Vectorization speedup factors for standard coarsening. 7.4 Vectorization speedup factors for semicoarsening. 7.5 Operation count for standard coarsening setup ... xvi 149 150 151 152 156
PAGE 17
7.6 Operation count for standard coarsening residual and grid transfers. 157 7.7 Operation count for standard coarsening smoothers. 158 7.8 Timing for standard coarsening on problem 8. . . 158 7.9 Grid transfer timing comparison for standard and semicoarsening. 160 7.10 Timing for various smoothers. . . . . 161 7.11 Smoothing versus grid transfer timing ratios. 162 7.12 Setup times for the various grid transfers. 163 7.13 Vcycle time for various smoothers. . 164 7.14 Number of V cycles for standard coarsening using the extension of Schaffer's idea for problem 8. . . . . . . . . . . . . . . 166 7.15 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 8. 167 7.16 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 8. . . . . . . . . . . . . . . 168 7.17 Number of V cycles for standard coarsening using the symmetric grid transfer for problem 8. . . . . . . . . . . . . . . 169 7.18 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 8. . . . . . . . . . . . . . . 169 7.19 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 8 with ILLU. . . . . . . . . . . . 170 7.20 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 8 with ILL U. . . . . . . . . . . . . . 170 7.21 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 8 with ILLU. . . . . . . . . . . . 171 7.22 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 8 with ILLU. . . . . . . . . . . . 171 xvii
PAGE 18
7.23 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 8 with ILLU. . . . . . . . . . . . 172 7.24 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 9. . . . . . . . . . . . . . . 175 7.25 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 9. 175 7.26 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 9. . . . . . . . . . . . . . . 176 7.27 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 9. . . . . . . . . . . . . . . 176 7.28 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 9. . . . . . . . . . . . . . . 177 7.29 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 9 with ILLU. . . . . . . . . . . . 177 7.30 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 9 with ILL U. . . . . . . . . . . . . . 178 7.31 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 9 with ILLU. . . . . . . . . . . . 178 7.32 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 9 with ILLU. . . . . . . . . . . . 179 7.33 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 9 with ILLU. . . . . . . . . . . . 179 7.34 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 9 with 4direction PGS. . . . . . . . . 180 7.35 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 10. . . . . . . . . . . . . . 183 xviii
PAGE 19
7.36 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 10. . . . . . . . . . . . . . . . . 183 7.37 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 10. . . . . . . . . . . . . . . 184 7.38 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 10. . . . . . . . . . . . . . . 184 7.39 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 10. . . . . . . . . . . . . . . 185 7.40 Number of V cycles for standard coarsening using the extension of Schaffer's idea for problem 10. . . . . . . . . . . . . . 185 7.41 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 10. . . . . . . . . . . . . . . . . 185 7.42 Number of V cycles for standard coarsening using the symmetric grid transfer for problem 10. . . . . . . . 186 7.43 Number of Vcycles for MGD9V on problem 10. 186 7.44 Number of V cycles for standard coarsening using the extension of Schaffer's idea for problem 11. . . . . . . . . . . . . . 188 7.45 Number of Vcycles for standard coarsening using the sL/L grid transfer for problem 11. . . . . . . . . . . . . . . . . 189 7.46 Number of Vcycles for standard coarsening using the hybrid sL/L grid transfer for problem 11. . . . . . . . . . . . . . . 189 7.47 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 11. . . . . . . . . . . . . . . 190 7.48 Number of Vcycles for standard coarsening using the operator, L/L, grid transfer for problem 11. . . . . . . . . . . . . . . 190 xix
PAGE 20
7.49 Number of V cycles for standard coarsening using the extension of Schaffer's idea for problem 13. . . . . . . . . . . . . . 192 7.50 Number of Vcycles for standard coarsening using the hybrid sL/1 grid transfer for problem 13. . . . . . . . . . . . . . . 193 7.51 Number of Vcycles for standard coarsening using the symmetric grid transfer for problem 13. . . . . . . . . . . . . . . 193 7.52 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 17. . . . . . . . . . . . . . 194 7.53 Number of Vcycles for standard coarsening using the original collapsing method for problem 17. . . . . . . . . . . . . . . 196 7.54 Number of Vcycles for standard coarsening using the extension of Schaffer's idea for problem 17. . . . . . . . . . . . . . 197 7.55 Number of Vcycles for standard coarsening using the hybrid collapsing method for problem 17. . . . . . . . . . 197 7.56 Number of Vcycles for semicoarsening for problem 17. 197 7.57 Comparison for problem 8 on Cray YMP 199 7.58 Comparison for problem 9 on Cray YMP 201 8.1 Timing Comparison per Vcycle for semicoarsening on the Cray YMP, CM2, and CM200. ............. 8.2 Timing Comparison per Vcycle for Standard Coarsening on the Cray YMP, CM2, and CM200. ......... 8.3 2D Standard coarsening 32512 CM5 nodes Vcycle timings 8.4 2D Standard coarsening 32512 CM5 nodes Setup timings 8.5 2D Standard coarsening 32512 CM5 nodes parallel efficiency 8.6 2D Semicoarsening 32512 CM5 nodes Vcycle timings 8. 7 2D Semicoarsening 32 512 CM5 nodes setup timings XX 206 208 243 244 244 245 246
PAGE 21
8.8 2D Semicoarsening 32 512 CM5 nodes parallel efficiency . 8.9 2D Timing comparison between CM5, Cray YMP, and Sparc5 13.1 Smoothing factor for point GaussSeidel relaxation for anisotropic diffu246 248 sion equations in 3D . . . . . . . . . . . . . . . 282 13.2 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion equations in 3D . . . . . . . . . . . . . . . . 282 13.3 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion equations in 3D . . . . . . . . . . . . . . . . 283 13.4 Smoothing factor for point GaussSeidel relaxation for convectiondiffusion equations in 3D . . . . . . . . . . . . . . . . 284 13.5 Smoothing factors for line GaussSeidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . . . . 289 13.6 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . . . . 290 13.7 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . . . . 291 13.8 Smoothing factors for line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . . . . 292 13.9 Smoothing factors for zebra line GaussSeidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . . . . 293 13.10Smoothing factors for zebra line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 294 13.11Smoothing factors for zebra line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 295 13.12Smoothing factors for zebra line GaussSeidel relaxation for convectiondiffusion equations . . . . . . . . . . . . . . . 296 xxi
PAGE 22
13.13Smoothing factor JL for xy, xz, and yzplane GaussSeidel relaxation for anisotropic diffusion equations . . . . . . . . . . . . 300 13.14Smoothing factor JL for xy,xz, and yzplane GaussSeidel relaxation for convectiondiffusion equations . . . . . . . 301 13.15Smoothing factor JL for plane GaussSeidel (continued) 302 13.16Smoothing factor JL for plane GaussSeidel (continued) 303 13.17Smoothing factor for zebra xy, xz, yz, and alternating plane GaussSeidel relaxation for anisotropic diffusion equations . . . . . . 304 13.18Smoothing factor for Zebra xy,xz, yz, and alternating plane GaussSeidel relaxation for convectiondiffusion equations . . 305 13.19Smoothing factor for zebra plane GaussSeidel (continued) 306 13.20Smoothing factor for zebra plane GaussSeidel (continued) 307 14.1 3D Multigrid Component Timing . . . . . . . 317 14.2 Grid transfer timing comparison for standard and semicoarsening. 318 14.3 Timing for various smoothers. . . . . 319 14.4 Smoothing versus grid transfer timing ratios. 320 14.5 Numerical results for problem 1 in 3D. 321 14.6 Numerical results for problem 1 in 3D. 321 14.7 Numerical results for problem 1 in 3D. 322 14.8 Numerical results for problem 1 in 3D. 323 14.9 Numerical results for problem 1 in 3D. 323 14.10Numerical results for problem 1 in 3D. 323 15.1 3D Standard coarsening 32, 64, 128 CM5 nodes Vcycle timings 327 15.2 3D Standard coarsening 32, 64, 128 CM5 nodes Setup timings 327 15.3 3D Standard coarsening 32, 64, 128 CM5 nodes parallel efficiency 328 15.4 3D Semicoarsening 32, 64, 128 CM5 nodes Vcycle timings 329 xxii
PAGE 23
15.5 3D Semicoarsening 32, 64, 128 CM5 nodes setup timings . 15.6 3D Semicoarsening 32, 64, 128 CM5 nodes parallel efficiency. 15.7 3D Timing comparison between CM5 and Cray YMP .... xxiii 329 329 330
PAGE 24
ACKNOWLEDGMENTS xxiv
PAGE 25
I would first like to thank my advisor Joel E. Dendy, Jr., Los Alamos National Laboratory, because without him none of this would have been possible; Thanks! In addition, at Los Alamos National Laboratory, I would like to thank Mac Hyman of Group T7, the Center for Nonlinear Studies for their support and the Advanced Computing Laboratory and the CIC Division for the use of their computing facilities. This work was partially supported by the Center for Research on Parallel Computation through NSF Cooperative Agreement No. CCR8809615. I would also like to thank my PhD committee members at UCD, with a special thanks to professors Bill Briggs, Stan Payne, Roland Sweet and Jan Mandel. In addition, I would like to give a big thanks to Dr. Suely B. Oliveira for getting me back on track. Finally, I would like to thank my mom, Lee Buchanan, my twin brother, Fred Bandy, my wife, Darlene Bandy, and all my friends for all their support and encouragement. Last but not least, a special thanks to Mark and Flavia Kuta, some very good friends, for letting me stay with them while I was in Denver. On a sadder note, I would like to express a great debt that is owed to Seymour Cray, who died Oct. 5 1996. This thesis would never have been possible without Sey mour Gray's creativity, intelligence, and drive to conceive and develop supercomputers. His contributions to computing have changed the face of science and engineering and we will all miss him dearly. XXV
PAGE 26
CHAPTER 1 INTRODUCTION 1.1 Summary The subject of this dissertation is the investigation of Black Box multigrid solvers for the numerical solution of second order elliptic partial differential equations in two or three dimensional domains. We place particular emphasis on efficiency on both vector and parallel computers, represented here by the Cray YMP and the Thinking Machines CM5. Black Box multigrid methods are sometimes referred to as geometric multi grid methods or, more recently, as automatic multigrid methods, in the literature. The methods can be considered to be a subclass of algebraic multigrid methods with sev eral algorithmic restrictions. Geometric multigrid methods make a priori assumptions about the domain and the class of problems that are to be solved, and in addition, it uses intergrid operators and coarse grid points based on the geometry and the order of grid equation operator. Algebraic multigrid, on the other hand, chooses both the coarse grid and intergrid operator based only on the coefficient matrix. Black box multigrid is in between these two, with the grids chosen geometrically, on logically rectangular grids, and the intergrid operators are chosen algebraically. There are other hybrid multigrid methods such as the unstructured grid method by Chan [22], which chooses the coarse grid based on graph theoretical considerations and the intergrid operator from the nodal coordinates (geometry), and the algebraic multigrid method of Vanek 1
PAGE 27
[81], which uses kernels of the associated quadratics form in lieu of geometrical information. The algebraic multigrid method of Stiiben and Ruge [66] [67] use almost the same construction of intergrid operator as Dendy [26] once the coarse has been chosen, while VanEk's work is based on a different idea. The assumptions and the components that make up the black box multigrid methods are spelled out in more detail in the following sections of this chapter. We will examine the development of robust black box multigrid solvers us ing both standard and semicoarsening. The methods are aimed at the solution of convectiondiffusion equations with anisotropic and discontinuous coefficients (inter face problems), such that the discrete system of equations need only be specified on a logically rectangular grid. A guiding principal in the design is that if the discrete system of equations is symmetric, then the multigrid coarse grid problems should preserve that symmetry. 1.1.1 Previous Results. The black box multigrid method was first introduced by Dendy [26]. The method is a practical implementation of a multigrid method for symmetric diffusion problems with anisotropic and discontinuous coeffi cients, represented by \7 (D V'U) + cU = f on n c R2 (1.1) The domain n is assumed to be embedded in a logically rectangular mesh and then discretized in such a manner as to yield a stencil which is no larger than a compact 9point stencil. The method employs the Galerkin coarse grid approximation, LH = If! Lh I'Ji, to form the coarse grid operators, using the robust choice of grid transfer operators from Alcouffe et. al [1]. The robust choice of grid transfer operators is an operator induced formulation that, when c = 0, preserves the flux fL (D V'U) across interfaces. In [1] lexicographic point GaussSeidel relaxation and alternating 2
PAGE 28
lexicographic line GaussSeidel relaxation were the choices available for smoothers. In subsequent extensions for vector machines, the choices available were red/black (or four color for nine point operators) point GaussSeidel and alternating red/black line GaussSeidel relaxation. The black box multigrid method was extended to nonsymmetric elliptic convectiondiffusion problems [27], for which the model problem is E6U + Ux + Uy = f on n c R2 (1.2) where E > 0. The mesh is the same as before and the discretization is of the form where L h U (.lh 1\hU + Dx,hu. + Dy,hu. D. f3h t,J fJ 0 t,J 0 t,J 0 t,J rt,J' 6hUi,j Dx,hu .. 0 t,J 1 h2 (Ui,jl + Uil,j 4Ui,j + Ui+l,j + Ui,j+l), 1 2 h (Ui+l,jUil,j), 1 2 h (ui,j+l ui,jl), (1.3) and where {3 = yields upstream differencing. A generalization of Galer kin coarse grid approximation is used to form the coarse grid operators. The prolongation operators are formed in the same way as they were for the symmetric method, but instead of being induced by Lh, they are induced by the symmetric part of the grid operator, symm(Lh) = + Lh). It was found that instead of using If!= (Ij{ )* to induce the restriction operator, a more robust choice is to form a new interpolation operator Jj{ based on (Lh)* and then to define the restriction operator to be I{!= (Jj{)*. These choices were made to generalize the work of [26]. The choice of smoothers was also changed to include lexicographic point, line, and alternating line Kaczmarz relaxation. 3
PAGE 29
The method performed well for the problems tested as long as {3 2:: 0.25, but since nonphysical oscillations begin to dominate for {3 < 0.25, this restriction is no difficulty. The next development was the creation of a 3D black box multigrid solver for symmetric problems [29]. This method uses the same type of grid transfer operators as the earlier 2D symmetric method. Two different methods of forming the coarse grid operators were examined with nearly identical convergence results. The first method uses Galerkin coarse grid approximation with standard coarsening. The second method also uses Galerkin coarse grid approximation, but it does so by using auxiliary intermediate grids obtained by semicoarsening successively in each of the three independent variables. For robustness, alternating red/black plane GaussSeidel relaxation was used for the smoother. The plane solves of the smoother were performed by using the 2D symmetric black box multigrid solver. The 2D symmetric black box multigrid solver was then extended to solve singular and periodic diffusion problems [30]. The existence of a solution, in case c = 0, is assured by requiring that the equation be consistent; 0 F = 0. The periodic boundary conditions only impact the multigrid method by requiring the identification of the auxiliary grid point equations at setup, the identification of the auxiliary grid point unknowns after interpolation, and the identification of the auxiliary grid point residuals before restriction. The coarsest grid problem, if c = 0, is singular and cannot be solved by Gaussian elimination, but since the solution is determined only up to a constant, the arbitrary addition of the linearly independent condition that Ui,j = 0 for some coarse grid point ( i, j) allows solution by Gaussian elimination. The first semicoarsening black box multigrid solver was introduced for the solution of three dimensional petroleum reservoir simulations [33]. This method employs semicoarsening in the zdirection and xyplane relaxation for the smoother. Galerkin coarse grid approximation is used to form the coarse grid operators. Operator induced 4
PAGE 30
grid transfer operators were used, but only after Schaffer's paper [70] was it realized how to compute these in a robust manner; see section 3.6. A two dimensional black box multigrid solver called MGD9V was developed by de Zeeuw [24]. This method was designed to solve the general elliptic convectiondiffusion equation. The method used standard coarsening, an ILLU smoother, a V(O, 1)cycle (sawtooth), and a new set of operator induced grid transfer operators that were designed specifically for convection dominated problems. The method was found to be more robust than previous methods but was still divergent for problems with closed convection characteristics on large grids. The method of MGD9V was developed only for two dimensions and is not parallelizable. The 2D symmetric black box multigrid solvers [26] [30] were updated to be portable, have consistent user interfaces, adhering to the SLATEC software guidelines [38], and provided with three new user interfaces by Bandy [9]. One of the interfaces in cluded an automatic discretization routine, requiring the user to provide only a function which can evaluate the coefficients at the fine grid points. The interfaces all included extensive input parameter validation and memory management for workspace. A parallel version of the semicoarsening method for two dimensional scalar problems for the CM2 was presented in [32]. A parallel version of semicoarsening for twoand threedimensional problems was presented in [75]. Both papers essentially relied on the algorithm from [33] and borrowed from Schaffer [69] [70] for the robust determination of grid transfer operators. Fourier mode analysis has been used by many multigrid practitioners to find good smoothers for use in multigrid methods. The results of many of these analyses have been presented in the literature. Stiiben and Trottenberg [78] present several fundamental results of Fourier mode analysis for a few selected 2D problems. Kettler [50] reports results for a range of 2D test problems and several lexicographic ordered 5
PAGE 31
GaussSeidel methods along with several variations of ILU methods. Wesseling [84] reports a summary of smoothing analysis results for the 2D rotated anisotropic diffusion equation and the convection diffusion equation; however, the results are for only a limited number of worst case problems. Smoothing analysis results for the red/black ordered methods appear in many places in the literature, but they are only for a few selected problems. There are some results in the literature for 3D problems [79], but just like the 2D results, the analysis is not complete enough for our purposes. 1.1.2 New Contributions In this thesis we have developed and extended several black box multigrid methods for both two and three dimensional nonsymmetric problems on sequential, vector, and parallel computing platforms. The new methods are based on a new implementation of the two dimensional nonsymmetric black box multigrid method [27] for vector computers. The new implementation was designed to take better advantage of developments in vector computing, while increas ing portability and compatibility with sequential computers. The new implementation performs with a speedup factor of six over the earlier methods on vector computers, while providing identical functionality, and it also incorporates many of the ideas and software features from [9]. The new methods include the development of a three dimensional method, both vector and parallel versions, and a two dimensional parallel method for nonsymmetric problems. The new methods were also extended to handle periodic and singular problems using the modifications from [30]. In [27] a two dimensional nonsymmetric black box multigrid method was examined for a convection dominated problem with constant convection characteristics. In this work we investigate the new methods for a general convectiondiffusion equation \7 (D(x) \i'U(x)) + b(x) \i'U(x) + c(x)U(x) = f(x), X E 0. (1.4) 6
PAGE 32
When the earlier method of [27] was applied to equation 1.4, but with more vectorizable smoothers than those in [27], it was found to perform poorly, and even fail, for some nonconstant convection characteristic problems. This poor performance was caused by both the new smoothers and by poor coarse grid correction. Several new grid transfer operators are introduced to address these problems, of which two were found to be robust; see chapter 3. The search for a more robust smoother was facilitated by using local mode analysis, and led to the implementation of an incomplete line L U factorization method (ILL U) for the smoother. The ILL U smoother made the new methods more robust for convection dominated problems. A fourdirection point GaussSeidel method was also briefly considered for use as a smoother but was discarded because it was not parallelizable nor suitable for anisotropic problems, even though it was fairly robust for convection dominated problems. A nonsymmetric black box multigrid method, using standard coarsening, was created for three dimensional problems; previously only a semicoarsening version ex isted. The new method is the three dimensional analogue of the new two dimensional black box multigrid method, and it uses alternating red/black plane GaussSeidel as a smoother for robustness. The 3D smoother uses one V(l, 1)cycle of the 2D nonsymmetric black box multigrid method to perform the required plane solves. The new method was developed to use either the new grid transfer operators from the new 2D nonsymmetric method or those from the 3D extension of Dendy's 2D nonsymmetric black box multigrid method. The coarse grid operators are formed using the second method from [29], which uses auxiliary intermediate grids obtained by successively applying semicoarsening in each of the independent variables. In addition, the new method is designed to handle periodic and singular problems. Another use of local mode analysis was in the design of robust three dimensional smoothers. Although 7
PAGE 33
there are hints in the literature for how to perform local mode analysis for color relaxation in three dimensions, we are unaware of the appearance elsewhere of the detailed analysis presented in chapter 13. The new methods are compared to a new implementation of the semicoarsening method with a speedup factor of over 5 for the two dimensional method and speedup factor of 2 for the three dimensional method on vector computers. The grid transfer operators are based on Schaffer's idea; see chapter 3. The 2D semicoarsening method uses coarsening in theydirection coupled with red/black xline GaussSeidel relaxation for the smoother. The 3D semicoarsening method uses coarsening in the zdirection coupled with red/black xyplane GaussSeidel relaxation for the smoother. The new implementation also includes the ILLU smoother, not present in the original version. Another aspect of this work was to compare de Zeeuw's MGD9V with the black box multigrid methods. The idea was to mix and match components of the two approaches to investigate the strengths and weaknesses and to ascertain if a combination existed which was better than either. The results obtained from studying the algorithm components is that MGD9V obtains its robustness from the ILLU smoother and not from its grid transfer operators. If MGD9V uses alternating red/black line GaussSeidel for its smoother then performance similar to the black box multigrid methods' is observed. Likewise, if ILLU is used as the smoother in the black box multigrid methods, then the performance is similar to that of MGD9V. Parallel versions of the standard coarsening nonsymmetric black box multigrid methods are developed in this thesis and compared with the existing parallel version of semicoarsening black box method. The 3D parallel version smoother uses a modified 2D nonsymmetric black box multigrid method to perform the simultaneous solution of all the planes of a single color. 8
PAGE 34
A hybrid parallel black box multigrid method was developed that uses standard coarsening for grid levels with a VP (virtual processor) ratio, i.e. number of grid points per processor, greater than one, and semicoarsening when the VP ratio is less than one. When the VP ratio is greater than one, standard coarsening reduces the number of grid points per processor, reducing the amount of serial work, faster than in semicoarsening case. When the VP ratio is less than one, the semicoarsening method is more efficient than standard coarsening because it keeps more processors busy that would otherwise be idle; in addition, tridiagonal library routines, which are more efficient than we can write, are available for the data structures. The hybrid parallel method is the most efficient method on the CM5 because it uses the most computationally efficient method for a given VP ratio. 1.2 Class of Problems The class of problems that is being addressed is convectiondiffusion equations with anisotropic and discontinuous coefficients on a twoor threedimensional domain. These types of problems can be represented by the following equation and boundary conditions, L(x) = \7 (D(x) V'U(x)) + b(x) V'U(x) + c(x)U(x) = f(x) X En (1.5) v(x) D(x)V'U(x) + l'(x)U(x) = 0 x E an, (1.6) on a bounded domain n c iRd with boundary an, where dis either 2 or 3, X= (x, y) or (x, y, z), and D(x) = (D\ D2 ) or (D\ D2 D3), respectively. The term v(x) is the outward normal vector. It is assumed that D(x) > 0, c(x) 0, and I'(X) 0 to ensure that upon discretization we end up with a positive definite system of equations. Anisotropies are also allowed, e.g. if n c 3?2 we have D = (D1 D2 ) where it is possible that D1 D2 in some subregion(s) while D1 D2 in other subregion(s). In addition, 9
PAGE 35
D(x), c(x), and f(x) are allowed to be discontinuous across internal boundaries r c n. Moreover, let JL(X) be a normal vector at x E r; then it is natural to assume also that U and JL (DVU) are continuous at x for almost every x E r. (1.7) The "almost every" is necessary to exclude juncture points of r, that is points where two pieces of r intersect and the continuity of JL ( DVU) does not make any sense. The boundary conditions permitted in (1.6) can be of three types: Dirichlet, Neumann, mixed. The periodic boundary condition is not considered, but can be handled by making a few adjustments and modifications to the black box multigrid codes. It should be noted that, for a problem with pure Neumann boundary conditions, a finite difference (volume or element) discretization may lead to a singular system of equations; the singularity can be propagated to the coarsest grid level and cause trouble for the direct solver, but a minor modification to the code circumvents this difficulty, allowing solution of the coarsest grid level problem. 1.3 Discretization of the Problem Let the continuous problem represented by equation (1.5) be written in operator notation as Lu= f inn. (1.8) The following discussion is valid for both two and three dimensions, but only the two dimensional case is presented. Suppose that, for all X= (x, y) E n, ax :::; X :::; bx and ay :::; y :::; by. Let Gh define a rectangular grid on [ax, bx] x [ay, by], partitioned with ay = Yl < Y2 < < Yny = by, (1.9) and let the grid spacings be defined as (1.10) 10
PAGE 36
Then the rectangular grid, Gh is defined as (1.11) with the domain grid, nh' being defined as (1.12) Before the discrete grid problem is defined we should first address the issue of domains with irregular boundaries. The black box multigrid solvers in two dimensions are intended to solve the equation (1.8) on logically rectangular grids, but for simplicity, we consider only rectangular grids. An irregular shaped domain can be embedded in the smallest rectangular grid, Gh, possible, nh c Gh. The problem is then discretized on nh avoiding any coupling to the grid points not in nh. For grid points outside of nh, Xh E Gh nh, considered to be fictitious points, an arbitrary equation is introduced, such as Ci,jUi,j = fi,j, where Ci,j # 0 and fi,j are arbitrary. The problem is now rectangular and the solution to the discrete equations can be obtained at the points in the domain, while the solution Ui,j = fi,j / Ci,j is obtained for the other points. Problems with irregular domains in three dimensions can be handled in a similar fashion for a cuboid box grid. Now the discrete grid problem approximating the continuous problem, (1.8) can be written as (1.13) where the superscript h refers to discretization with grid spacing h. Note that, for irregular domains the discrete solution uh(x) makes sense only for x E nh; uh(x), for x E Ghlnh, is arbitrary. We consider only discrete operators Lh on rectangular grids that can be de scribed by 5point or 9point box stencils. Suppose we discretize the equation (1.5) 11
PAGE 37
using five points at the grid point (xi, Yi ), (1.14) We use stencil notation to represent the 5 and 9 point cases, respectively: N W C E s h NW N NE W C E SW S SE h (1.15) where the stencil represents the coefficients for the discrete equation at the grid point (xi, Yj) on grid Gh. The subscripts i, j can be dropped and it will be understood that the stencil is centered at the grid point (Xi, Yj). The superscript h can also be dropped when the mesh spacing is clear from the context. The stencils are valid over the entire grid including the boundary points because the coefficients are allowed to be zero. Hence, any coefficients that reach out of the domain can be set to zero. Clearly, the 5point stencil is a special case of the 9point stencil, where the NW, N E, SW, and SE coefficients are set to zero. We illustrate the stencil notation for Poisson's equation on a square domain in two dimensions, Lu(x,y) = Uxx(x,y)Uyy(x,y) = f(x,y), (x,y) En= (0, 1)2 (1.16) using 5and 9point finite difference discretizations. The 5point stencil for the operator L, using a central finite difference discretization on a uniform grid with grid spacing h = 1/ N for N = nx = ny, is h 1 1 4 1 (1.17) 1 12
PAGE 38
One 9point discretization for L in (1.16) has the stencil h 1 4 1 Lh = _!_ h2 4 20 4 (1.18) 1 4 1 Many types of discretization can be considered: central finite differences, upstream finite differences, finite volumes, finite elements, etc. The black box multigrid solvers actually allow for more general meshes than just the rectangular grids shown so far. The only requirement is that the mesh be logically rectangular. In two dimensions the logically rectangular grid G can be defined as G = {x(i,j),y(i,j): 1 :S i :S nx, 1 :S j :S ny} (1.19) where the grid cell formed by (x(i,j + 1), y(i,j + 1)), (x(i + 1,j + 1), y(i + 1,j + 1)) (x(i,j), y(i,j)), (x(i + 1,j), y(i + 1,j)) has positive area, 1 :S i :S nx, 1 :S j :S ny. The black box multigrid solvers which we consider require the discretization to be represented by a 9point box stencil. However, just because the problem has a 9point box stencil does not mean that it can be solved by the black box multigrid methods presented in this thesis. Such solutions are dependent on a number of factors which are problem dependent. We attempt to investigate these factors in this thesis. 1.4 Multigrid Overview A two level multigrid method is presented first to illustrate the basic camponents and underlying ideas that will be expanded into the classical multigrid method. 13
PAGE 39
Standard Coarsening Figure 1.1. Standard coarsening. Superimposed fine grid Gh and coarse grid GH, where the indicates the coarse grid points in relation to the fine grid Gh. Suppose that we have a continuous problem of the form Lu(x, y) = f(x, y), (1.20) where L is a linear positive definite operator defined on an appropriate set of functions in (0, 1)2 = 0 C lR2 Let Gh and GH be two uniform grids for the discretization of 0; then Gh = {(x, y) E 0: (x, y) = (ih,jh), i,j = 0, ... n} (1.21) and n (x,y) E 0: (x,y) = (iH,jH) = (i2h,j2h), i,j = 0, ... 2 (1.22) where the number of grid cells non Gh is even with grid spacing h = 1/n, and where grid cH has n/2 grid cells with grid spacing H = 2h. The coarse grid cH is often referred to as a standard coarsening of Gh; see figure 1.1. However, this choice is not the only one possible. Another popular choice is semicoarsening, which coarsens in only one dimension; see figure 1.2. For the overview, only standard coarsening will be used. 14
PAGE 40
Semicoarsening Figure 1.2. Semicoarsening. Superimposed fine grid Gh and coarse grid GH, where the indicates the coarse grid points in relation to the fine grid Gh. 15
PAGE 41
The discrete problems now take the form (1.23) and onGH. (1.24) We refer to Lh and LH as the fine and coarse grid operators respectively. The grid operators are positive definite, linear operators (1.25) and LH: cH+ cH. (1.26) Let Uh be an approximation to uh from equation (1.23). Denote the error eh by (1.27) thus eh can also be regarded as a correction to Uh. The residual (defect) of equation (1.23) is given by (1.28) The defect equation (errorresidual equation) on grid Gh (1.29) is equivalent to the original fine grid equation (1.23). The defect equation and its approximation play a central role in the development of a multigrid method. The fine grid equation (1.23) can be approximately solved using an iterative method such as GaussSeidel. The first few iterations reduce the error quickly, but then the reduction in the error slows down for subsequent iterations. The slowing down in 16
PAGE 42
the reduction of the error after the initial quick reduction is a property of most regular splitting methods and of most basic iterative methods. These methods reduce the error associated with high frequency (rough) components of the error quickly, but the low frequency (smooth) components are reduced very little. Hence, the methods seem to converge quickly for the first few iterations, as the high frequency error components are eliminated, but then the convergence rate slows down towards its asymptotic value as the low frequency components are slowly reduced. The idea behind the multigrid method is to take advantage of this behavior in the reduction of the error components. The point is that a few iterations of the relaxation method on Gh effectively eliminate the high frequency components of the error. Further relaxation on the fine grid results in little gain towards approximating the solution. However, the smooth components of the error on the fine grid are high frequency components with respect to the coarse grid. So, let us project the defect equation, since it is the error that we are interested in resolving, onto the coarse grid from the fine grid. This projection is done by using a restriction operator to project the residual, rh, onto the coarse grid, where we can form a new defect equation (1.30) where If! is the restriction operator. We can now solve this equation for vH. Having done so, we can project the solution back up to the fine grid with a prolongation (interpolation) operator, I'H, and correct the solution on the fine grid, Gh, (1.31) We call this process (of projecting the error from the coarse grid to the fine grid and correcting the solution there) the coarse grid correction step. The process of projecting the error from a coarse grid to a fine grid introduces high frequency errors. The high 17
PAGE 43
frequencies introduced by prolongation can be eliminated by applying a few iterations of a relaxation scheme. The relaxation scheme can be applied to the projection of the error, I'lfvH, or to the approximation to the solution, Uh, after the correction. It is desirable to apply the relaxation to Uh instead of I'lfvh since then additional reduction of the smooth components of the error in the solution may be obtained. The projection operator from the fine grid to the coarse grid is called the restriction operator, while the projection operator from the coarse grid to the fine grid is called the prolongation operator or, interchangeably, the interpolation operator. These two operators are referred to as the grid transfer operators. In the two level scheme just described, it can be seen that the coarse grid problem is the same, in form, as the fine grid problem with uh and fh being replaced by vH and JH = If!rh respectively. We can now formulate the classical multigrid method by applying the above two level scheme recursively. In doing so, we no longer solve the coarse grid defect equation exactly. Instead, we use the relaxation scheme on the coarse grid problem, where now, the smooth (low) frequencies from the fine grid appear to be higher frequencies with respect to the coarse grid. The relaxation scheme now effectively reduces the error components of these, now, higher frequencies. The coarse grid problem now looks like the fine grid problem, and we can project the coarse grid residual to an even coarser grid where a new defect equation is formed to solve for the error. The grid spacing in this yet coarser grid is 2H. After sufficiently many recursions of the two level method, the resulting grid will have too few grid points to be reduced any further. We call this grid level the coarsest grid. We can either use relaxation or a direct solver to solve the coarsest grid problem. The approximate solution is then propagated back up to the fine grid, using the coarse grid correction step recursively. What we have described informally is one multigrid Vcycle. More formally, 18
PAGE 44
let us number the grid levels from 1 to M, where grid level 1 is the coarsest and grid level M is the finest. Algorithm 1.4.1 ( MGV(k, v1, v2, h) ) 1. relax v1 times on LkUk = pk 2. compute the residual, rk = FkLkUk 3. restrict the residual It1rk to ck1 Fk1 = It1rk and form the coarse grid problem {defect equation) Lk1ukl = pkl' where vk =ILl ukl and hkl = 2hk. 4. IF (k1) i1 THEN call Algorithm MGV(k1, v1, v2, H) 5. solve Lklukl = pkl to get the solution ukl 6. interpolate the defect (coarse grid solution) to the fine grid, and correct the fine grid solution, uk +uk +ILl uk1 8. IF {finest grid) THEN Stop This algorithm describes the basic steps in the multigrid method for one iteration of a Vcycle. If the algorithm uses bilinear (trilinear in 3D) interpolation, it is called the classical multigrid method. This algorithm assumes that the coarsening is done by doubling the fine grid spacing, which can be seen in step 3 of the algorithm. However, the algorithm is valid for any choice of coarsening, hkl = mhk, where m is any integer greater than one. 1.4.1 Multigrid Cycling Strategies There are many different types of cycling strategies that are used in multigrid methods besides the V cycle. We illustrate the different cycling types with the use of a few pictures and brief descriptions. 19
PAGE 45
5 4 3 2 1 Vcycle Figure 1.3. One Vcycle iteration for five grid levels, where the represent a visit to a grid level. 20
PAGE 46
The V cycle is illustrated graphically in figure 1.3. The represents a visit to a particular grid level. A slanting line connection between two grid levels indicates that smoothing work is to be performed. A vertical line connection between grid levels means that no smoothing is to take place between grid level visits. The grid levels are indicated by a numerical value listed on the left side of the figure, where grid level 1 is the coarsest grid level and is always placed at the bottom of the diagram. The mechanics of the Vcycle were described in the multigrid algorithm in the last section. The Vcycle is one of the most widely used multigrid cycling strategies. Its best performance can be realized when there is an initial guess of the solution available. When a guess is not available a common choice is to use a zero initial guess or to use an F cycle (see below). The Scycle is illustrated in figure 1.4. The "S" stands for "sawtooth", because that is what it resembles; it is clearly a V(O, 1)cycle and thus a special case of a V cycle. The Scycle is what DeZeeuw's MGD9V [24] black box multigrid code uses for its cycling strategy. The Scycle usually requires a smoother with a very good smoothing factor in order to be efficient and competitive with other cycling strategies. The Wcycle is illustrated in figure 1.5. The Wcycle is sometimes called a 2cycle; similarly, a Vcycle can be called a 1cycle. From the figure 1.5, one can see the W type structure. It is called a 2cycle because there must be two visits to the coarsest grid level before ascending to the next finer intermediate fine grid level. An intermediate fine grid level is one that is not the finest nor coarsest grid level and where the algorithm switches from ascending to descending based on the number times the grid level has been visited since the residual was restricted to it from a finer grid. The Fcycle is illustrated in figure 1.6 and is called a full multigrid cycle. The figure shows a full multigrid Vcycle, that is, each subcycle that visits the coarsest grid level is a Vcycle. An Fcycle can also be created using a Wcycle, or any other 21
PAGE 47
5 4 3 2 1 Scycle Figure 1.4. One Scycle iteration for four grid levels, where the represent a visit to a grid level. 4 3 2 1 Wcycle Figure 1.5. One Wcycle iteration for four grid levels, where the represent a visit to a grid level. 22
PAGE 48
5 4 3 2 Fcycle Figure 1.6. One Fcycle iteration for five grid levels, where the represent a visit to a grid level. 23
PAGE 49
type of cycling, for its subcycle. The F cycle is very good when an initial guess for the multigrid iteration is not available, since it constructs its own initial guess. The F cycle first projects the fine grid problem down to the coarsest grid level and then proceeds to construct a solution by using subcycles. After the completion of each subcycle the solution on an intermediate fine grid level is interpolated up to the next finer grid level where a new subcycle begins. This process is continued until the finest grid level is reached and its own Vcycle completed. At this point if more multigrid iterations are needed then the V cycling is continued at the finest grid level. 1.5 Black Box Multigrid Black box multigrid is also called geometric multigrid by some and is a member of the algebraic multigrid method (AMG) family. The distinguishing feature of black box multigrid is that the black box approach makes several assumptions about the type of problem to be solved and the structure of the system of equations. The black box multigrid methods also have a predetermined coarsening scheme where the coarse grid has roughly half as many grid points as the fine grid does in one or more of the coordinate directions. For a uniform grid, this means that H = 2h. Both methods automatically generate the grid transfer operators, prolongation IL1 and restriction for 2 :::; k :::; M, and the coarse grid operators Lk for 1 :::; k < M1. The coarse grid operators are formed using the Galerkin coarse grid approximation, Lk1 1k1Lklk k kll (1.32) where k = 1, ... M1. The algebraic multigrid methods deal with the system of equations in a purely algebraic way. The coarsening strategy for general AMG is not fixed nor is the formation of the grid transfer operators, resulting in methods that can be highly adaptable. However, the more adaptable a method is, the more complex its 24
PAGE 50
implementation is likely to be, and it may also be less efficient due to its complexity. Another disadvantage of general AMG methods is that the coarse grid problems are usually not structured even when the fine grid problem is; moreover, the unstructured matrices on coarser levels tend to become less and less sparse, the coarser the grid level. To define the black box multigrid method we need to define several of the multigrid components, such as the grid transfer operators, the coarse grid operators, the type of smoother employed, and the coarsest grid solver. We can also mention the type of cycling strategies that are available and other options. There are several different grid transfer operators that we have developed and used in our codes. They are of two basic types. The first type collapses the stencil of the operator in a given grid coordinate direction to form three point relations, and the second is based on ideas from S. Schaffer [69]. The details of the grid transfer operators will be presented in chapter 3. The coarse grid operators are formed by using the Galerkin coarse grid approximation given in equation (1.32). There are several choices for the smoothing operator available in our codes. The smoothers that we have chosen are all of the multicolor type, except for incom plete line LU. For standard coarsening versions, the choices are point GaussSeidel, line GaussSeidel, alternating line GaussSeidel, and incomplete line L U. The semicoarsening version uses either line GaussSeidel by lines in the xdirection or incomplete line L U. The smoot hers will be presented in more detail in chapter 4. In the standard coarsening codes, the coarsest grid solver is a direct solver using LU factorization. The semicoarsening version allows the option of using line GaussSeidel relaxation. There are several cycling strategies that are allowed, and they are chosen by input parameters. The most important choice is whether to choose full multigrid 25
PAGE 51
cycling or not. There is also a choice for Ncycling, where N = 1 is the standard Vcycle and N = 2 is theWcycle, etc ... For more details, see section (1.4.1) above. 26
PAGE 52
CHAPTER 2 DISCRETIZATIONS: FINITE DIFFERENCE AND FINITE VOLUME This chapter presents some of the discretizations that can be used on the convectiondiffusion equation. We present only some of the more common finite dif ference and finite volume discretizations. Although this section may be considered elementary, it was thought to be important for two reasons. First, it shows some of the range of discrete problems that can be solved by the black box multigrid methods. Secondly, it gives sufficient detail for others to be able to duplicate the results presented in this thesis. The sections on the finite volume method present more than is needed, but because there is very little on this topic in the current literature and because of its importance for maintaining O(h2 ) accurate discretizations for interface problems, we have decided to include it. For references on the finite volume discretization see [85] and [52]. The continuous two dimensional problem is given by \7 (D V'u) + b V'u + cu = f, in 0 = (0, Mx) x (0, My) (2.1) where D is a 2 x 2 tensor, D= (2.2) Dyx Dy and det D > 0, c 2: 0. In general, Dxy =f. Dyx, but we only consider either Dxy = Dyx or Dxy = Dyx = 0. In addition, D, c, and f are allowed to be discontinuous across 27
PAGE 53
internal interfaces in the domain n. The boundary conditions are given by au on +au= g, on an (2.3) where a and g are functions, and n is the outward unit normal vector. This allows us to represent Dirichlet, Neumann, and Robin boundary conditions. The domain is assumed to be rectangular, n = (0, Mx) x (0, My), and is then divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, where Nx and Ny are the number of cells in the xand ydirections respectively. A uniform grid is not required, but we will use it to simplify our discussions. It should be noted that finite elements on a regular triangulation can also be used to derive the discrete system of equations to be solved by the black box multigrid methods. However, we will not present any details on how to derive these equations. 2.1 Finite Difference Discretization The finite difference approach to discretization is well known. Finite difference approximation is based on Taylor's series expansion. In one dimension, if a function u and its derivatives are single valued, finite, and continuous functions of x, then we have the Taylor's series expansions, 1 1 u(x +h)= u(x) + hu'(x) + 2h2u"(x) + 6h3u"'(x) + ... (2.4) and u(xh)= u(x)hu'(x) + +... (2.5) If we add equations (2.4) and (2.5) together we get an approximation to the second derivative of u, given by, 1 u"(x) h2 (u(x +h)2u(x) + u(xh)) (2.6) 28
PAGE 54
where the leading error term is O(h2). Subtracting equation (2.5) from (2.4) gives u'(x) l (u(x +h)u(xh)), (2.7) with an error of O(h2). Both equations (2.6) and (2.7) are said to be central difference approximations. We also derive a forward and backward difference approximation to the first derivative from equations (2.4) and (2.5): u'(x) l (u(x +h)u(x)) (2.8) and I 1 u (x) h (u(x)u(xh)) (2.9) respectively, with an error of O(h). The above approximations can be extended to higher dimensions easily and form the basis for finite difference approximation. We illustrate the finite difference discretization, using stencil notation, by way of examples for some of the types of problems that we are interested in. There are many references on finite differences if one is interested in more details; see for instance [74] [39]. The first example is for the anisotropic Poisson's equation on a square domain, Lu = E:Uxx Uyy = f (2.10) where u and f are functions of (x, y) E n. Using central finite differences and discretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the 5point stencil, 1 Lh __!_ h2 E 2(1 +c) E (2.11) 1 29
PAGE 55
The second example is for the convectiondiffusion equation on a square domain, (x,y) ED= (0, 1)2 (2.12) where u, bx, by, and f are functions of x andy. Using a mix of upstream and central finite differences and discretizing on a uniform grid with grid spacing h = 1/N for N = nx = ny, gives the 5point stencil, (2.13) where (2.14) and E E byh > E 2bxh bxh > E 2byh E bxh < E E byh < E 1Lx = 1 + 2bxh /Ly = 1 + 2b h y (2.15) 1 lbxhl :S E 1 lbyhl :S E. 2 2 The third example is the rotated anisotropic diffusion equation on a square domain. It has this name because it is obtained from the second example by rotating the axes through an angle of e. The equation is given by o2u o2u o2u Lu = r::c2 + s2 2 (r:; 1) csr::s2 + c2 0 ox2 oxoy oy2 (2.16) (x,y) ED= (0,1) X (0,1) where c = cos e, s = sine, and E > 0. There are two parameters, E and e, that can be varied. There are two popular discretizations of this equation which are seen in real 30
PAGE 56
world applications. They differ only in the discretization of the cross derivative term. Let ,8=(c:1)cs (2.17) then if the grid spacing ish= 1/N for N = nx = ny, the first, a 7point finite difference stencil, is a,8 2(a+,8+'Y) a,8 (2.18) ,8')' ,8 The second, a 9point finite difference stencil, is, (2.19) The fourth example is the convectiondiffusion equation on a square domain, Lu = c:6u + CUx + suy = 0 (x,y) E 0 = (0, 1)2 (2.20) where c = cos(}, s = sin(}, and c: > 0. Upstream finite differences and discretization on a uniform grid with grid spacing h = 1/N for N = nx = ny, yields c:(s +lsi) 2.2 Finite Volume Discretization (2.21) There are two types of computational grids that will be considered. The first type is the vertex centered grid Gv, defined as 31
PAGE 57
I I I 4t 4 0 0 4 I I I Figure 2.1. Vertex centered finite volume grid, where the indicates where the discretization is centered and the dashed lines delineate the finite volumes. 32
PAGE 58
Figure 2.2. Cell centered finite volume grid, where the indicates where the discretization is centered and the solid lines delineate the finite volumes. (2.22) Yj = j hy, j = 0, ... Ny where Nx and Ny are the number of cells in the x and y directions respectively, see figure 2.1. The second type is the cell centered grid Gc which is defined by i=l, ... ,Nx, (2.23) Yj = (jhy, j = 1, ... Ny where Nx and Ny are the number of cells in the x and y directions respectively, see figure 2.2. There are two other somewhat common finite volume grids that will not be discussed here, but can be used to derive the discrete system of equations to be solved by the black box multigrid methods. These grids are defined by placing the finite volume cell centers on the grid lines in one of the coordinate directions and centered between the grid lines in the other coordinate direction. For instance, align the cell centers with the y grid lines and centered between x grid lines. The cell edges will then correspond with x grid lines and centered between y grid lines. We will present finite volume discretization for both vertex and cell centered finite volumes where the coefficients are evaluated at either the vertices or cell centers. 33
PAGE 59
The coefficients could be evaluated at other points, such as cell edges, but we will not show the development of such discretizations because they follow easily from the descriptions given below. 2.3 Cell Centered Finite Volume Discretization; Evaluation at the Vertices For the cell centered finite volume discretization the cell has its center at the point ( i ) hx, (j ) hy and the cell is called the finite volume, ni,j, for the point ( i, j) on the computational grid Gc, where i = 1, ... Nx and j = 1, ... Ny; see equation (2.23). A finite volume is shown in figure 2.3. The approximation of u in the center of the cell is called Ui,j. The coefficients are approximated by constant values in the finite volume ni,j. This discretization is useful when the discontinuities are not aligned with the finite volume cell boundaries. Assume that Dxy = Dyx = 0 and that b = 0 for now. If we integrate equation (2.1) over the finite volume ni,j and use Green's theorem we get fdO, (2.24) where nx and ny are the components of the outward normal vector to the boundary We proceed by developing the equations for the interior points Ui,j, and then for the boundary points, where we present the modifications that are needed for the three types of boundary conditions that we consider. We refer to figure 2.3 to aid in the development of the finite volume discretization. 34
PAGE 60
' nw 1 n ne I I I I I w 1 P e 1 I I I I I sw :s se Figure 2.3. Cell centered finite volume Oi,j, where P has the coordinates (i(j. 35
PAGE 61
2.3.1 Interior Finite Volumes Referring to figure 2.3, we write the line integral from equation (2.24) as au au se au Dxa nx + Dya ny df = Dydx8f!;,j X Y sw ay ne au Dxa dy se X nw au + D dxne yay sw au Dxa dy. nw X The integral from ( sw) to ( se) can be approximated by 8 au Dy(sw)a dx + sw y se au Dy(se)a dx s y hx ;:::; 2 h (Dy(sw) + Dy(se)) (ui,jUi,j1 ) y hx x ( ) hai,j1 Ui,j Ui,j1 y (2.25) (2.26) where ai,j = (Dy,i,j + Dy,i1,j), and Dy,i,j is the value of Dy at the point (i,j). The other line integrals of ni,j, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be approximated in a similar fashion. The surface integrals in equation (2.24) can be approximated by: (2.27) and (2.28) where and h,j are approximations of c and f, respectively, at the grid point (i(j, given by 1 Ci,j = 4 (Ci,j + Ci1,j + Ci1,j1 + Ci,jd (2.29) and 1 !i,j = 4 (fi,j + fi1,j + fi1,j1 + fi,j1) (2.30) 36
PAGE 62
respectively. The resulting stencil for interior points is (2.31) where 1 (D 1 + D ) 2 x,t,Jx,t,J (2.32) 1 (D 1 + D ) 2 y,t,] y,t,J (2.33) and (2.34) At an interface, the diffusivity is given as an arithmetic mean of the diffusion coefficients of adjacent finite volumes. The arithmetic makes sense because the interface passes through the finite volume. This discretization is most accurate when the interface passes directly through the cell of the finite volume. When the finite volume ni,j has an edge on the boundary, the line integral in equation (2.24) for that edge has to be treated differently. We examine what needs to be done for each of the three different types of boundary conditions. We examine the changes that are needed only on one boundary edge, and the other changes needed for the other boundary edges follow in a similar fashion. 2.3.2 Dirichlet Boundary Condition Let us examine the south boundary, (sw) (se), where we have U(s) = 9(s) (2.35) 37
PAGE 63
The line integral from ( sw) to ( se) is approximated by se au 1 hx sw Dy ay dx 2 hy (Dy,i,j1 + Dy,i1,j1) Ui,j U(s) (2.36) This gives the stencil (2.37) 0 where is defined in equation (2.34) and a is defined by equation (2.32) and (2.33). 2.3.3 Neumann and Robin Boundary Conditions We examine the south boundary, ( sw )( se), where We then make the approximation Solving for U(s) gives au +au an = 9(s) (s) au 2 y an (s) 2 y 9(s) a(s) U(s) 1 2hy9(s) + U(p) U(s) = 1 1 + 2hya(s) The line integral is then approximated as se au Dya dx sw y 38 (2.38) (2.39) (2.40) (2.41)
PAGE 64
Now we substitute equation (2.40) to obtain se au ,....., 2 hx y Dydx,....., a. 1 a(s)Ui,j 9(s) sw ay 2 + hya(s) (2.42) The resulting stencil for the south boundary is (2.43) 0 where a is defined in equations (2.32) and (2.33), and is now given by (2.44) The other boundaries can be handled in the same way. We have now defined the cell centered finite volume discretization where the coefficients are evaluated at the grid vertices. 2.4 Cell Centered Finite Volume Discretization; Evaluation at the Cell Centers This discretization is better suited to problems when the interfaces align with the boundaries of the finite volumes. The discretization is very similar to what was done in section 2.3, except that now the coefficients are evaluated at the cell centers, (i(j, of the finite volume Oi,j The coefficients are approximated by constant values in the finite volume Oi,j. We need to approximate the integrals in equation (2.24). 39
PAGE 65
2.4.1 Interior Finite Volumes We have the line integral, as in equation (2.25), and the integral from (sw) to (se) can be approximated by se au 2 hx Dy8 dx h Dy,i,j ui,ju(s) sw y y (2.45) where Dy,i,j is the value of Dy at the point (i,j). We still need to approximate U(s)' and to do this we will use the continuity of u and Dy Dy,i,j Ui,j U(s) = Dy,i,j1 U(s) Ui,j1 (2.46) yielding Dy i J'Ui J. + Dy i J'1 Ui J'1 u '' '' (s)D + D 1 y,t,J y,t,J(2.47) We can now substitute equation (2.46) into equation (2.45) to get (2.48) where af.j1 is now given by aY 2 Dy,i,jDy,i,j1 i,j1D + D y,i,j y,i,j1 (2.49) The other line integrals of Oi,j, (se) to (ne), (ne) to (nw), and (nw) to (sw), can be approximated in a similar fashion. The surface integrals are approximated in the same way as before, (2.50) and (2.51) but instead of Ci,j and fi,j we have ci.! 3 _.! and fi.! 3 _.!. 2' 2 2' 2 40
PAGE 66
where and The resulting stencil for interior points is 2 Dx,i,jDx,i1,j Dx,i,j + Dx,i1,j 2 Dy,i,jDy,i,j1 Dy,i,j + Dy,i,j1 hy X X hx y y C 1 + C +C 1 + C hx ,] hy (2.52) (2.53) (2.54) (2.55) At an interface, the diffusivity is given as a harmonic average of the diffusion coefficients of the adjacent finite volumes. 2.4.2 Dirichlet Boundary Condition For the south boundary, (sw) to (se), the Dirichlet boundary condition, U(s) = 9(s) The line integral is approximated by se au 2 hx Dy8 dx ;::::j h Dy,i,j ui,j9(s) sw y y (2.56) The stencil is then given by (2.57) 0 where is given in equation (2.55) and a is given by equation (2.53) and (2.54). 41
PAGE 67
2.4.3 Neumann and Robin Boundary Conditions The Neumann and Robin boundary conditions can be handled in the same way as in section 2.3.3. The line integral for the south boundary is se au 2 hx Dydx Dy i j a(s)Ui,j 9(s) sw 8y 2 + hya(s) ' (2.58) The resulting stencil is now (2.59) 0 where is given in equation (2.55) and a is given by equation (2.53) and (2.54). 2.5 Vertex Centered Finite Volume DiscretizationEvaluation at the Vertices In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. This discretization is useful when the discontinuities align with the boundaries of the finite volumes. 2.5.1 Interior Finite Volumes The development is done the same as before for the cell centered cases; see section (2.3.1). The stencil, when Dxy = Dyx = 0 and b = 0, is given by (2.60) 42
PAGE 68
nw n ne w p e sw se Figure 2.4: Vertex centered finite volume Oi,j at the southern, y = 0, edge boundary. where a?t,J and 2 Dx,i,jDx,i+l,j Dx,i,j + Dx,i+l,j 2 Dy,i,jDy,i,J+l Dy,i,j + Dy,i,J+l hy X X +a 1 +a hx t,) t,J where c and f are evaluated at the grid point ( i hx, j hy). (2.61) (2.62) (2.63) 2.5.2 Edge Boundary Finite Volumes Let the finite volume Oi,j have its southern edge, ( sw )( se) at the southern boundary (y = 0) of the domain; see figure 2.4. 2.5.3 Dirichlet Boundary Condition For the Dirichlet boundary conclition we have U(s) = 9(s), and we can just eliminate the unknown U(s) and move it to the righthand side of the equation. 2.5.4 Neumann and Robin Boundary Conditions The line integral along the boundary is approximated by se au Dya dx sw y au ;::::j hxDy,i,j ay 43 (s)
PAGE 69
nw ne hy r w r s sw se hx Figure 2.5. Southwest corner finite volume, where the indicates where the discretization is centered. and now we need to look at the surface integrals and similarly for f. The stencil for the edge boundary is given by where h, X "a 1 hx t,J 0 hx y hy hy ai,j + hx and a is defined by equations (2.61) and (2.62). (2.64) (2.65) (2.66) (2.67) 2.5.5 Corner Boundary Finite Volumes The corner finite volume discretization will be shown for the southwest corner of the computational grid; see figure (2.5). 44
PAGE 70
2.5.6 Dirichlet Boundary Condition In the Dirichlet boundary condition case, the unknown U(sw) is eliminated by the boundary condition equation, U(sw) = 9(sw) (2.68) The term 9(sw) is incorporated into the right hand side of the discrete system of equa tions. The stencil for the southwest corner is 0 (2.69) 0 where is defined as hx y hy x 2hy ai,j 2hx ai,j (2.70) and a is defined by equations (2.61) and (2.62). 2.5.7 Neumann and Robin Boundary Conditions In the Neumann and Robin boundary condition cases, we have au ax+ awU 9w (sw) (2.71) au ay + asU = 9s, (sw) (2.72) where the subscripts ( sw) means evaluation at the swpoint; see figure 2.5. The line integrals around the finite volume are approximated by se au Dya dx sw y nw au Dxa dy sw X D .. au(sw) 2 X ay 1 2hxDy,i,j (as(sw)ui,jg8(sw)) D .. au(sw) 2 y ay 1 2hyDx,i,j (aw(sw)ui,j9w(sw)) 45 (2.73) (2.74)
PAGE 71
ne au 1 h nw Dy ay dx "2 h: af,j (ui,jUi,j+I). The stencil for the southwest corner is 0 1 h +4hxhyci,1 + BC __ Y ax 2hx t,J 0 (2.75) (2.76) (2.77) where is defined in equation (2.70), a is defined by equations (2.61) and (2.62), and (2.78) 2.6 Vertex Centered Finite Volume DiscretizationEvaluation at the Cell Vertices In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. This discretization is useful when the discontinuities pass through the interior of the finite volumes, and best when the interface passes through the cell center. 2.6.1 Interior Finite Volumes The development is the same as for the previous section on vertex centered finite volumes; see section 2.5. The stencil, when Dxy = Dyx = 0 and b = 0, is given by h, X "(X 1. hx t,J (2.79) 46
PAGE 72
where 1 2 (Dx,i+l,j + Dx,i+l,j+l) (2.80) 1 2 (Dy,i,j+l + Dy,i+l,j+l) (2.81) and (2.82) and where c and fare evaluated at the grid point (i hx,j hy) 1 Ci,j = 4 (ci1,j1 + Ci+l,j1 + Ci1,j+l + Ci+l,j+l) (2.83) and 1 fi,j = 4 (!i1,j1 + fi+1,j1 + fi1,j+1 + fi+1,j+1). (2.84) Let the finite volume ni,j have its southern edge, ( sw )( se) at the southern boundary (y = 0) of the domain; see figure 2.4. 2.6.2 Dirichlet Boundary Condition For the Dirichlet boundary conclition we have U(s) = 9(s), and we can just eliminate the unknown U(s) and move it to the righthand side of the equation. 2.6.3 Neumann and Robin Boundary Conditions The line integral along the boundary is approximated by se au Dy""""f.)dx sw uy hy D 1 (u 1 u ) 2hx y,t+ ,J t+ ,J t,J 47 (2.85)
PAGE 73
and similarly for the line integral from (sw)(nw), and the line integral from (nw)(ne) is done as before for the interior. The surface integrals are now given by (2.86) where 1 c = (c '+1 + c+1 '+1) t,J 2 t,J t ,J (2.87) and similarly for f. The stencil for the edge boundary is given by h 2hx Dx,i1,j _!!:JJ_ D .. 2hx x,t,J (2.88) 0 where hx a.Y + hy (D 1 + D . ) hy i,j 2 hx x,t,J x,t,J (2.89) and a. is defined by equations (2.80) and (2.81). 2.6.4 Corner Boundary Finite Volumes The corner finite volume discretization will be shown for the southwest corner of the computational grid; see figure (2.5). 2.6.5 Dirichlet Boundary Condition In the Dirichlet boundary condition case, the unknown U(sw) is eliminated by the boundary condition equation, U(sw) = g(sw) The term g(sw) is incorporated into the right hand side of the discrete 48
PAGE 74
system of equations. The stencil for the southwest corner is 0 _!!J!_D .. 2hx (2.90) 0 where is defined as hx hy D D 2hy 2hx (2.91) and a is defined by equations (2.80) and (2.81). 2.6.6 Neumann and Robin Boundary Conditions In the Neumann and Robin boundary condition cases, we have au ax+ awU 9w (sw) (2.92) au ay + asU = 9s, (sw) (2.93) where the subscripts (sw) means evaluation at the swpoint; see figure 2.5. The line integrals around the finite volume are approximated by se au Dya dx sw y 1 au(sw) 2,hxDy,i+l,j+l ay 1 2hxDy,i+l,j+l (as(sw)ui,j9s(sw)) (2.94) 1 au(sw) 2,hyDx,i+l,j+l ay 1 2hyDx,i+l,j+l (aw(sw)ui,j9w(sw)) (2.95) (2.96) (2.97) 49
PAGE 75
The surface integrals are approximated by 1 0 cudn 4hxhyci+l,j+Iui,j (2.98) and similarly for f. The stencil for the southwest corner is 2hhy Dy,i+l,j+l 0 (2.99) 0 where is defined in equation (2.91), a is defined by equations (2.80) and (2.81), and (2.100) 50
PAGE 76
CHAPTER 3 PROLONGATION AND RESTRICTION OPERATORS Suppose that we have an elliptic linear operator L on a two dimensional rectangular domain n: Lu=f (3.1) This problem can be discretized using finite differences (or other discretization) on a rectangular grid Gh with grid spacing h, given by We assume that the discretization is represented in stencil notation as ( i,j) NW NNE W C E SW S SE ( i,j) (3.2) (3.3) (3.4) where NW, N, N E, ... are the coefficients of the discretization stencil centered at (xi,Yj) The size of the fine grid operator's stencil is important to remember because we require that the coarser grid operator's stencil not be any larger than the largest allowable fine grid operator stencil. By keeping the grid operator stencil fixed at a maximum of 9points, we ensure that the implementation will be easier and more efficient by maintaining the sparsity of the operators. This consideration is important 51
PAGE 77
when discussing the formation of the grid transfer operators since we use the Galerkin coarse grid approximation approach to form the coarse grid operators. The formulation of the coarse grid operators involves the multiplication of three matrices, and if their stencils are at most 9point, then the coarse grid operator will also be at most 9point. If we use grid transfer operators with larger stencils, the size of the coarse grid operator stencil can grow without bound, as the grids levels became coarser, until the stencils either become the size of the full matrix or we run out of grid levels. Another guiding principal that we follow is that if we are given a symmetric fine grid operator we would like all the coarser grid operators to be symmetric also. In order to follow this principal the interpolation and restriction operators must be chosen with care. Before getting started it would be best to show where and how the operators are used to transfer components between grid levels. We assume the layout of coarse and fine grids shown in figure 1.1. We refer to coarse grid points with indices ( ic, ic) and fine grid points with indices ( i 1, j 1). 3.1 Prolongation We interpolate the defect correction (error) from the coarse grid level to the fine grid level, where it is added as a correction to the approximation of the fine grid solution. There are four possible interpolation cases for standard coarsening in two dimensions. The four cases are illustrated in figure 3.1, where the thick lines represent coarse grid lines, thin lines represent the fine grid lines, circles represent coarse grid points, X represents the fine grid interpolation point, and the subscripts f and c distinguish the fine and coarse grid indices respectively. Figure 3.1(a) represents interpolation to fine grid points that coincide with coarse grid points. Figure 3.1 (b) represents interpolation to fine grid points that do not coincide with coarse grid points, 52
PAGE 78
. (a) (b) i 1 c jc j, jc j, j ,1 .I/ j ,1 /I j c1 j c1 i, i ,1 i, ic i 1 c ic (c) (d) Figure 3.1. The four 2D standard coarsening interpolation cases, where represents the coarse grid points used to interpolate to the fine grid point represented by x. The thick lines represent coarse grid lines. 53
PAGE 79
but lie on coarse grid lines in the xdirection. Figure 3.1 (c) represents interpolation to fine grid points that do not coinciding with coarse grid points, but lie on coarse grid lines in theydirection. Figure 3.1(d) represents interpolation to fine grid points that do not align with any coarse grid lines either horizontally or vertically. The fine grid points that are also coarse grid points, case (a), use the identity as the interpolation operator. The coarse grid correction is then given by (3.5) where (Xi f, YiJ) = ( Xic, YjJ on the grid; here the interpolation coefficient is 1. The fine grid points that are between two coarse grid points that share the same Yj coordinate, case (b), use a two point relation for the interpolation. The coarse grid correction is given by (3.6) where Yjc = YiJ and Xic1 < Xit1 < Xic on the grid, and the interpolation coefficients are IJV _1 1 and Le 1 l>C C liC' C The fine grid points that are between two coarse grid points that share the same Xi coordinate, case (c), use a similar two point relation for the interpolation. The coarse grid correction is then given by (3.7) where Xic = XiJ and Yjc1 < Yjt1 < Yjc on the grid, and the interpolation coefficients are If 1 and I! 1 _1 c, c c, c The last set of fine grid points are those that do not share either a Xi or a Yj coordinate with the coarse grid, case (d). We use a four point relation for the interpolation in this case, and the coarse grid correction is given by 54
PAGE 80
+ (3.8) + where Xic < Xif < Xic+l and Y)c < Y)j < Y)c+l, and the interpolation coefficients are lf'W_1 J. _1 I/'"!!..1 J. TfeJ. and IrJ. _1 The interpolation operator's coefficients can also (lc c c..c c c..c, c cc, c be represented in stencil notation, just like the grid operator, as h 1nw In 1ne I'H = JW 1 Je (3.9) JSW fS JSe H 3.1.1 Prolongation Correction Near Boundaries In the black box multigrid solvers, the right hand side of the grid equation next to the boundary can contain boundary data, in which case the above interpolation formulas can lead to 0(1) interpolation errors. To improve this error we can use a correction term that contains the residual to bring the interpolation errors back to O(h2); [26]. The correction term is O(h2 ) for the interior grid points, and in general will not improve the error on the interior, but near the boundary the correction term can be of 0(1). The correction term takes the form of the residual divided by the diagonal of the grid equation coefficient matrix; the correction term is equal to ri,j/Ci,j, where the residual was computed for the grid before restriction. The correction term is added to equations 3.6, 3.7, and 3.8, which are for interpolating to fine grid points that are not coarse grid points. Applying the correction is similar to performing an additional relaxation sweep along the boundary, and it does not affect the size of the prolongation stencil. 55
PAGE 81
3.2 Restriction The restriction operator restricts the residual from the fine grid level to the coarse grid level, where it becomes the righthandside of the defect equation (errorresidual equation). The restriction equation is (3.10) restriction coefficients can also be represented in stencil notation as H If!= (3.11) h where the restriction is centered at the fine grid point (Xi 1 Y]f) = ( Xic, YjJ. 3.3 Overview In the following sections we present several different interpolation operators by exhibiting the coefficients needed to represent the operator's stencil. In most cases, we omit the indices of the operators, it being be understood that the grid operator is given at the fine grid point (Xi 1 YiJ). The grid transfer operators can be split into two groups based upon how the operators are computed. The first class of grid transfer operators is based on using a collapse (lumping) in one of the coordinate directions, yielding a simple three point relation that can be 56
PAGE 82
solved. The second class of grid transfer operators is based on an idea from Schaffer's semicoarsening multigrid [69]. Both these methods for operator induced grid transfer operators are an approximation to the Schur complement, that is, they try to approximate the block Gaussian elimination of the unknowns that are on the fine grid but not on the coarse grid. The collapsing methods are a local process while Schaffer's idea is to apply the procedure to a block (line) of unknowns. We start by presenting the grid transfer operators used in the symmetric versions of the black box multigrid solvers. Then we present several different grid transfer operators that are used in the nonsymmetric black box multigrid solvers. In classic multigrid methods, the grid transfer operators are often taken to be bilinear interpolation and full weighting; injection is also popular. To see why we do not use these choices, we need to look at the type of problems that we are hoping to solve. These problems are represented by the convectiondiffusion equation, \7 (D \lu) + b \lu + c u = J, (3.12) where D, c, and f are allowed to be discontinuous across internal boundaries. The black box multigrid solvers are aimed at solving these problems when D is strongly discontinuous. The classical multigrid grid transfer operators perform quite well when D jumps by an order of magnitude or less, but when D jumps by several orders of magnitude, the classical methods can exhibit extremely poor convergence, since these methods are based on the continuity of \lu and the smoothing of the error in \lu. However, it is D \lu that is continuous, not \lu. Hence, if D has jumps of more than an order of magnitude across internal boundaries, then it is more appropriate to use grid transfer operators that approximate the continuity of D \lu instead of the continuity of \lu. It is important to remember that we are using the Galerkin coarse grid approximation approach to form the coarse grid operators. We want the coarse 57
PAGE 83
grid operators to approximate the continuity of D \lu. This goal is accomplished by basing the grid transfer operators on the grid operator Lh. Before proceeding with the definitions of the first class of grid transfer operators, we need to define a few terms and make a few explanations. Definition 3.3.1 Using the grid operator's stencil notation, define Ra, row sum, at a given grid point, (Xi, Yj), to be R"2:, = C+NW +N +NE+ W +E+SW +S+SE, (3.13) where the subscript ( i, j) has been suppressed. The row sum is used to determine when to switch between two different ways of computing the grid transfer coefficients at a given point. The switch happens when the grid operator is marginally diagonally dominant, or in others words, when the row sum is small in some sense. We recall what is meant by the symmetric part of the operator. Definition 3.3.2 Define the symmetric part of the operator, L, as aL = symm(L) = (L + L*) where L* is the adjoint of the grid operator L. The notation applies equally to the grid operator's coefficients, for example: a Ni,j = ( Ni,j + Si,j+ 1) and aSWi,j = (SWi,j + N Eil,jl) (3.14) (3.15) In addition, we can give some examples of the adjoint (transpose) of the grid 58
PAGE 84
operators coefficients are: (w;. )* 2,) (aBE)* 2,J (3.16) and ( aG)* 2,) 3.4 Symmetric Grid Operator Lh: Collapsing Methods The interpolation operator is based upon the discrete grid operator Lh, while the restriction operator is based on (Lh)*. We want to preserve the flux J..L (D 'VU) across interfaces, which can be done by using the grid operator Lh. Assume that Lh has a 5point stencil, then (3.17) which gives the interpolation formula (3.18) When Lh has a 9point stencil, the idea is to integrate the contributions from the other coefficients ( NW, NE, SW, and SE), which can be done by summing (collapsing) the coefficients to get the three point relation, (3.19) where A_= (NW + W + SW), Ao = (N + C + S), and A+= (NE + E + SE). The computation of the Iw and Je coefficients are done by collapsing the grid operator in the ydirection to get a three point relation on the xgrid lines. Let the interpolation formula be given by (3.20) 59
PAGE 85
where vk is written for vk,j, and Ail= (NW + W +SW)i,j, Ai = (N +C+S)i,j, and Ai+l = ( N E + E + S E)i,j. We now solve the equation for Vi to get the interpolation formula in an explicit form. The interpolation coefficients Iw and r are then given by and Writing out the coefficients explicitly gives Iw = _NW+W+SW N+C+S NE+E+SE N+C+S (3.21) (3.22) (3.23) (3.24) where JW and Je are evaluated at (ic1,jc) and (ic,jc) respectively, and the other coefficients on the right hand side are evaluated at (if 1, iJ). If however, the row sum number,RI: (see 3.13), is small (see 3.28) then instead of (N + C + S)i for Ai we use (NW + W + SW + NE + E + SE)i. These two formulas give the same result when the row sum is zero, which is the case for an operator with only second order terms away from the boundary. This idea is observed to lead to better convergence, and it is due to Dendy [30]. The coefficients are then defined by Iw = NW+W+SW NW + W +SW +NE+E+SE (3.25) and r = NE+E+SE NW + W +SW +NE+E+SE' (3.26) where Iw and Je are evaluated at (ic1,jc) and (ic,jc) respectively, and the other coefficients on the right hand side are evaluated at (if 1, j f). 60
PAGE 86
Let ')'=min{INW+W+SWI, INE+E+SEI, 1.}. (3.27) Then by small we mean that R'E < 'Y (NW + W + SW + N + S + N E + E + SE) (3.28) where R'E is the row sum defined above. The computation of the JS and In coefficients is done by collapsing the grid operator in the xdirection to get the three point relation on the ygrid line. Let the interpolation formula be given by (3.29) where Vj1 = { Vi,j1 : i = 1, ... nx }, Vi = { Vi,j : i = 1, ... nx }, Vj+1 = { Vi,j+1 : i = 1, ... ,nx}, and Aj+l = (NW + N + NE)i,j+l, Aj = (W + C + E)i,j, and Aj1 = (SW + S + SE)i,j1 We now solve the equation for Vj to get the interpolation formula in an explicit form: The interpolation coefficients Is and In are given by and Writing out the coefficients explicitly gives JB = _SW+S+SE W+C+E' NE+N+NE W+C+E 61 (3.30) (3.31) (3.32) (3.33)
PAGE 87
If, however, the row is small, then instead of (W + C + E)j for Aj we use ( NW + N + N E + SW + S + S E) j. The coefficients are then defined by SW+S+SE (3.34) NW +N +NE+SW +S+SE' NW+N+NE (3.35) NW +N +NE+SW +S+SE' where JS and In are evaluated at ( ic, jc 1) and ( ic, jc) respectively, and the other coefficients on the right hand side are evaluated at (if, j f 1). Let 'Y = min { INW + N + N El ISW + S + SEI 1.} (3.36) Then by small we mean that < 'Y (NW + N + N E + SW + S + SE) (3.37) where is the row sum. The computation of the interpolation coefficients pw, 1nw, 1ne, and pe is similar to that of the coefficients that have already been computed. Let the interpolation formula be given by (3.38) + Ai1,j1Vi1,j1 + Ai,j1Vi,j1 + Ai+1,j1Vi+I,j1 = 0 where the A*,* are just the corresponding grid operator coefficients. We can now solve for Vi,j to get the interpolation formula. A1 ViJ. =' 2,) +Ai1,jVi1,j + A,jVi,j + Ai+1,jVi+1,j 62 (3.39)
PAGE 88
Notice that Vi,j1, Vi1,j, Vi+1,j, and Vi,j+1 are unknowns. However, we can use their interpolated values that we computed above, being careful to note that their stencils are all centered at different grid points. After performing the substitutions and collecting the terms for Vi1,j1, Vi+1,j1, Vi1,j+1, and Vi+1,j+1 we get (3.40) where instead of having to compute everything all over again, it can be seen that pw, Inw, Ine, and pe can be expressed in terms of the previous four coefficients, Iw, Ie, I8 and In. However, we must now explicitly write the subscripts for the coefficients Iw, Ie, fS, and In to indicate where their stencils are centered relative to the interpolated point's stencil, which is centered at (i,j). The formulas for the four coefficients are sw + s IW. 1 + w J'! 1 ISW = 2,)2,) c (3.41) where Isw is evaluated at (xic1, Y]c1), NW + N Iw.+1 + W In 1 Inw = 2,] 2,J c (3.42) where Inw is evaluated at (xic1, Yjc), N E + N +1 + E I:t+1 Ine = 2,] 2 ,] c (3.43) where Ine is evaluated at (xic, YjJ, SE + S Ie 1 + E !'!+1 Ise = 2,]2 ,J c (3.44) where Ise is evaluated at ( Xic, Y]c1), and the the other stencil coefficients are evaluated at (Xi 1 Yj 1). If, however, RL. is small, then SW + S IYJ. 1 + W !'! 1 If!W = 2,)2,) 2c1,Jc1 NW +N +NE+ W +E+ SW + S + SE' (3.45) NW + N + W In 1 Jnw = 2,] 2.J 2c1,Jc NW +N +NE+ W +E+SW +S+SE' (3.46) 63
PAGE 89
N E + N Ie +1 + E J!t+1 J!te. = t,J t ,J tc,Jc NW +N +NE+ W +E+SW +S+SE' (3.47) SE + S Ie 1 + E Jl!+1 = t,Jt ,J tc,Jc1 NW +N +NE+ W +E+SW +S+SE' (3.48) and where NW, N, NE, W, C, E, SW, S, and SE are evaluated at (xit,YiJ) Let "(=min ISW+W+NWI, INW+N+NEI, (3.49) IN E + E + SEI ISE + s + SWI 1. Then by small we mean that R'E. < "( (NW + N + N E + W + E + SW + S + SE) (3.50) The interpolation correction terms are Ai1rH, Aj1rH, or A:;:}rH for the cor responding interpolation formulas above, where rH is the residual on the coarse grid. Note that the A's change depending on whether R'E. is small or not. The computation of the interpolation coefficients in this way was used in the BOXMG, BOXMGP, BBMG, and BBMGP codes for symmetric problems [1], [26], [30], [10]. Similar computations have also been used for most black box, geometric, and algebraic multigrid solvers for symmetric problems arising from finite difference and finite volume discretizations using either a 5point or a 9point standard stencil [7]' [23], [29]' [31], [52]' [54], [53]' [55], [63]' [85], [24]. The computation of the restriction operator's coefficients is closely related to that of the interpolation coefficients. In fact, in the symmetric case, the restriction coefficients for the symmetric grid operator Lh can be taken to be equal to the interpolation coefficients, "'E R. (3.51) 64
PAGE 90
3.5 Nonsymmetric Grid Operator Lh: Collapsing Methods The interpolation coefficients can be computed in the same way as in the symmetric case except that we replace all of the grid operator's coefficients with their equivalent symmetric stencil coefficients, denoted by O"(). However, the row sum R'f', definition remains unchanged. 3.5.1 Prolongation Based on symm(Lh) The computation of the Iw and Je coefficients is given by If, however, R'f', is small, then O"NW + O"W + O"SW O"N + O"C + O"S O"NE + O"E + O"SE O"N+O"C+O"S Iw = O"NW + O"W + O"SW O"NW + O"W + O"SW + O"NE + O"E + O"SE' (3.52) (3.53) (3.54) (3.55) In (3.52)(3.55) JW and Je are evaluated at (xic1, Yjc) and (xic' YjJ respectively, and the other coefficients on the right hand side are evaluated at (Xi 11, Yj 1 ) for the Lh components. Let (3.56) Then by small we mean that (3.57) 65
PAGE 91
The formulas for the In and I8 coefficients are If, however, Ry:, is small, then O"NW + O"N + O"NE O"W + O"C + O"E O"SW + O"S + O"SE O"W + O"C + O"E r O"NW + O"N + O"NE O"NW + O"N + O"NE + O"SW + O"S + O"SE' O"SW + O"S + O"SE O"NW + O"N + O"NE + O"SW + O"S + O"SE' (3.58) (3.59) (3.60) (3.61) where In and JS are evaluated at (xic, Yjc) and (xic, Yjc1) respectively, and the other coefficients on the right hand side are evaluated at (xi 1 Y]f1) for the Lh components. Let '"'( = min{IO"NW + O"N + O"NEI, IO"SW + O"S + O"SEI, 1.}. (3.62) Then by small we mean that Ry:, < '"'( (O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE). (3.63) The computation of the interpolation coefficients pw, Inw, Ine, and Ise can be expressed in terms of the other four coefficients: O"SW + O"S I'!ll. 1 + O"W fS 1 II!W ,] C (3.64) O"NW + O"N Iw+1 + O"W ITt 1 Inw = ,J C (3.65) O"NE + O"N .+1 + O"E In+1 Ine. = ,J C (3.66) O"SE+O"SIe. 1+0"EJS+1 Ise. = ,] C (3.67) 66
PAGE 92
If, however, Ry:, is small, then (3.68) O"NW + O"N I'!D.+1 + O"W In 1 I'f}W = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.69) O"NE + O"N I<:+1 + O"E I"!+1 I"!'e. = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.70) O"SE + O"S I<:1 + O"E P+1 Il?e. = ,J O"NW + O"N + O"NE + O"W + O"E + O"SW + O"S + O"SE' (3.71) where O"NW, O"N, O"NE, O"W, O"C, O"E, O"SW, O"S, and O"SE are evaluated at (xit,Y]f) for the Lh components. Let 'Y =min (3.72) Then by small we mean that (3.73) It has been found in practice that the restriction operator If! need not be based on the same operator as the interpolation operator, so we change its symbol to be Jf! to reflect this change. The restriction operator's coefficients are based on (Lhf instead of O" Lh. The restriction coefficients are computed in exactly the same way as the interpolation coefficients except that all of the grid operator's coefficients in the computations are replaced by their transposes. The computations for the restriction coefficients are now straightforward and will not be written out. The grid transfer operators have been computed in this fashion for the black box multigrid solver for nonsymmetric problems [27]. It should be noted that when the grid operator Lh is symmetric, then the computations given here for both the symmetric case and nonsymmetric case yield the same grid transfer coefficients. 67
PAGE 93
3.5.2 Prolongation Based on Lh and symm(Lh) The third possibility for computing the grid transfer operators is one that uses the same form of the computations as above, see section 3.5.1. This prolongation is a point collapse approximation to Schaffer's ideas; see section 3.6. The only difference in the above computations for the nonsymmetric case is that for the denominators, Ai1 and Aj\ we use the coefficients based on Lh instead of a Lh. The test for small is still in the same form as before except that Lh is used, but 'Y is still based on a Lh. The restriction operator coefficients are computed as before, but the denominator is now based on Lh instead of on (Lhf. 3.5.3 Grid Transfer Operators Based on a hybrid form of Lh and symm(Lh) The prolongation operator coefficients are computed the same as in the last section 3.5.2. However, the computation of the restriction operator coefficients has been modified into a hybrid form that uses both LT and L. The difference in the computation of the restriction coefficients comes into play when the switch is made in the denominator, Ai1 and Aj\ because the row sum is small. When the row sum is large we modify the denominator by adding in two coefficients from the grid operator L. We can illustrate this modification by computing the restriction coefficients Jw and Je. If, however, R'f', is small, then (NWf + (W)T + (SWf N+C+S (NEf + (Ef + (SEf N+C+S w (NWf + (W)T + (SW)T J = (NW)T + (W)T + (SW)T + N + S + (N E)T + (E)T + (SE)T' 68 (3.74) (3.75) (3.76)
PAGE 94
e (NEf + (Ef + (SEf J = (NW)T + (W)T + (SW)T + N + S + (N E)T + (E)T + (SE)T. (3.77) In (3.74)(3.76) JW and Je are evaluated at (xicl, YjJ and (xic, YjJ respectively, and the other coefficients on the right hand side are evaluated at (Xi 11, Y]f) for the Lh components. Let 'Y =min (NWf + (Wf + (SWf (NEf + (Ef + (SEf 1. (3.78) Then by small we mean that < T ( (NW)T + (Wf + (SWf + (Nf +(Sf+ N + S (3.79) +(N Ef + (Ef + (SEf ) The restriction coefficients Jn and JS are computed in a similar way. The motivation behind these modifications was to try to get the coarse grid operator to approximate the one obtained when using the extension of Schaffer's idea; see section 3.6. The grid operators from section 3.5.2 above were computed to approximate the grid transfer coefficients based on an extension of Schaffer's idea; while the method in this section attempts to do the same thing, it also makes some modifications so that the coarse grid operator more closely approximates the one obtained in section 3.6.1. 3.6 Nonsymmetric Grid Operators: Extension of Schaffer's Idea The second class of grid transfer routines is based on Schaffer's idea for grid transfer operators in his semicoarsening multigrid method [70]. Schaffer's idea is to approximate a full matrix by a diagonal matrix to compute the grid transfer operators. 69
PAGE 95
Schaffer's idea was used in the development of the semicoarsening black box multi grid method [32]. We took Schaffer's idea and extended it to apply to the standard coarsening grid transfer operators. The ideas used in the semicoarsening method are as follows. Suppose that coarsening takes place only in the ydirection. Then the interpolation to points on the fine grid can be represented by (3.80) wherevk = {vi,k: i = 1, ... ,nx, j = j 1,j,j + 1}, ThetridiagonalmatricesAj1, Aj, and Aj+1 represent the nine point grid operator on the j1, j, and j + 1 grid lines respectively; Aj+l tridiag [NW, N, N E]i+l Aj tridiag [W, C, E]j Aj1 tridiag [SW, S, SE]j_1 As before, we solve this equation for Vj to get, (3.81) where we have assumed that Aj1 exists and can be stably inverted. This assumption can not always be guaranteed, but Schaffer's and our methods allow line relaxation as a smoother, where these assumptions are necessary. The methods would fail if the assumptions did not hold, so in that sense we can say that the assumptions hold. From equation (3.81), we form the quantities Aj1 Aj1 and Aj1 Aj1, lead ing to a nonsparse interpolation operator. If the interpolation operator is not sparse, that is, involves only Vi,j1 and Vi,j+l for interpolation at the point ( i, j), then the coarse grid operators formed by the Galerkin coarse grid approximation approach will grow 70
PAGE 96
beyond a 9point stencil. This is a property that we would very much like to avoid, since it would lead to full operators on the coarser grid levels. Schaffer's idea, also arrived at independently by Dendy, is to approximate these quantities with diagonal matrices Bj1 and Bj+l This is accomplished by solving the following relations Aj1 Aj1e = Bj1e (3.82) Aj1 AJ+1e = Bj+le, where e = (1, 1, ... 1f. They can be solved quickly because they are tridiagonal equa tions. After solving, the entries (diagonals) in Bj1 and Bj+l are just the interpolation coefficients JS and In respectively. In the semicoarsening case the restriction operator is still based on the transpose of the nonsymmetric grid operator Lh. This is done by replacing Aj1 Aj, and Aj+l by their transpose to get (Aj1)*, (Aj )*, and (Aj+l)* respectively. 3.6.1 Extension of Schaffer's Idea to Standard Coarsening The above was presented in a manner suitable for the symmetric case. It can be modified for the nonsymmetric case, as we did for the collapsing methods, by using the symmetric part of the operator. We can do this by replacing A* with O"A* in equation (3.82) to get, (symm(Aj))1 symm(Aj1) e = Bj1e (3.83) (symm(Aj))1 symm(Aj+l) e = Bj+le. Schaffer constructs his grid transfer operators in a different manner and his construction for variable coefficient problems can yield a nonsymmetric coarse grid operator LH even if Lh is symmetric. We would like the coarse grid operators to be symmetric whenever the fine grid operator is symmetric. We can do this is several 71
PAGE 97
ways, but a more efficient construction is to replace equation (3.83) with Aj1 symm(Aj1) e = Bjle (3.84) Aj1 symm(Aj+I) e = Bj+le. The advantage of this form is that it can use the same tridiagonal system solver that we are already using for the line solves for the multigrid smoother. Equation (3.83) will require an additional tridiagonal solve for symm(Aj) and additional storage if the LU factors are to be saved. To extend these ideas to the standard coarsening case is quite easy. We first compute the grid transfer coefficients for semicoarsening in the ydirection, and define vk = { vi,k : i = 1, ... nx, k = j 1, j, j + 1} and the tridiagonal matrices AJ+l tridiag[aNW,aN,aNE]J+1 A1 tridiag [W, C, E]1 Aj1 tridiag [aSW, aS, aSE]1 1 We save the diagonals of Bj1 and Bj+l associated with coarse grid lines in the xdirection as the JS and In interpolation coefficients respectively. To obtain the coefficients for the ydirection, we compute the grid transfer coefficients for semicoarsening in the xdirection and define Vk = { Vk,j : k = i1, i, i + 1, j = 1, ... ny} and the tridiagonal matrices Ai+l tridiag [aSW, aW, aNW]1+1 Ai tridiag [S, C, N]1 Ail tridiag [aSE, a E, aN E]1 _1 72
PAGE 98
We save the diagonals of Bi1 and Bi+l associated with coarse grid lines in the xdirection as the Iw and Ie interpolation coefficients respectively. Finally, we can then combine the semicoarsening coefficients from the X and Y lines to obtain the pw, Inw, Ine, and pe interpolation coefficients. They can be computed as the product of the coefficients that have already been computed, Inw =In. Iw ISW =IS. IW or elimination can be used as before. Ine =In. Ie Ise =Is. Ie, (3.85) The restriction operator for the extension to the standard coarsening case is computed as above, but the transpose of the grid operator is used instead of the symmetric part of the operator. This is done by replacing Aj1 and AJ+1 by their transpose to get (Aj_l)* and (Aj+l)* respectively. 3. 7 Conclusions Regarding Grid Transfer Operators Many other grid transfer operators were tried in the standard coarsening black box multigrid method in addition to the those presented above. However, only three were deemed to be robust and efficient enough to include in a release version of the solver. The three choices for grid transfer operators are the original nonsymmetric col lapsing method described in section 3.5.1, the nonsymmetric hybrid collapsing method described in section 3.5.3, and the nonsymmetric extension to Schaffer's ideas described in section 3.6.1. While all three of these choices are good, better results were obtained for the later two for all test and application problems run to date. Most of the other grid transfer operators, that were tried had good performance on some of the test problems but failed on others. There does appear to be enough good results to cover all the test problems, with the exception of reentrant flows. However, to unify these into one set of grid transfer operators would be much 73
PAGE 99
more expensive to compute and may also introduce trouble when combining the various types of grid transfer operators. The grid transfer operators from section 3.5.2, which use a collapsing method to try to approximate the extension of Schaffer's ideas for nonsymmetric problems, were a disappointment. While they seemed to be a good idea, they turned out to not be very robust and in several cases actually caused divergence of the multigrid method. This bad behavior prompted examination of the coarse grid operators and grid transfer operators. After comparing the operators with those obtained from Schaffer's ideas, it was noticed that several things were wrong, but with the modifications described in section 3.5.3, these problems were overcome. These new grid transfer operators extended Schaffer's ideas to standard coarsening very well. 74
PAGE 100
CHAPTER4 BASIC ITERATION METHODS FOR SMOOTHERS In this chapter we examine several basic iteration schemes for use as smoothers in the Black Box Multigrid solvers. Fourier mode analysis is used to identify which scheme makes the best smoother for a given type of model problem in two dimensions. In this chapter we will be using parentheses around a superscript to denote an iteration index. For example: u(n) means the nth iterate. 4.1 Overview of Basic Iteration Methods All of the methods in this section can be characterized in the following way. The algebraic system of equations to be solved is given by the matrix equation Lu=f (4.1) The matrix L is an Nxy X Nxy matrix, where Nxy = nxny. The computational grid is two dimensional with nx and ny grid points in the xand ydirections respectively. The matrix L can be split as L=MN, (4.2) where M is nonsingular and assumed easy to invert. Then a basic iteration method for the solution of equation ( 4.1) is given by (4.3) 75
PAGE 101
or as (4.4) where S = M1 N is called the iteration matrix. The basic iteration method can also be damped, and if the damping parameter is w, then the damped method is given by (4.5) or by (4.6) where Sis now given by (4.7) and I is the identity matrix. When w = 1 we recover the undamped basic iterative method. The eigenvalues of the damped basic iteration matrix S can be given in terms of the eigenvalues of the undamped basic iteration matrix S. They are related by .X( B)= w.X(S) + 1w, (4.8) where w is the damping parameter and .X(S) on the right hand side of the equation is an eigenvalue of S, the undamped iteration matrix. The error after the nth iteration is (4.9) where u is a solution (unique if L is nonsingular) to equation ( 4.1). The error at the (n + 1)st iteration is related to the error at the nth iteration by (4.10) 76
PAGE 102
where S is the iteration matrix defined above (S can also replace S in the equation). From equation (4.10), it follows by induction, that e(n) can be written in terms of the original error, e(0), as (4.11) where the superscript, n on S is now an exponent and n 2:: 0. In terms of vector norms, we have lle(n) II IISne(O)II < IISnlllle(O) II (4.12) where II II is any vector norm with induced matrix norm for IISnll The term IISII is called the contraction number of the basic iterative method. The spectral radius of S is defined as (4.13) The basic iterative method is said to be convergent if p(S) < 1, (4.14) in which case, we have that lim IISnll = 0. ntoo (4.15) If the method is convergent, then this also implies that (4.16) for any initial choice of e(O) if and only if p(S) = max I.A(S)I < 1. Some other useful results, that are similar to those found in Varga [82], are: 77
PAGE 103
Theorem 4.1.1 If p(S) is the spectral radius of S, then if p(S) :S IISII (4.17) then (4.18) Theorem 4.1.2 If p(S) < 1 then 1 lim (IISnll):;;: = p(S) n+oo (4.19) for any suitable induced matrix norm. We can also define the reduction factor and the rate of convergence for basic iterative methods. Definition 4.1.1 The reduction factor is defined as lle(n+l) II T = lle(n) II :S IISII (4.20) where T is the reduction factor per iteration. Definition 4.1.2 The average reduction factor, i, is given by 1 T= II (n)ll :;;: _e_ < lle(o)ll (4.21) after n iterations. Definition 4.1.3 The average rate of convergence can be defined as (4.22) where n > 0 and IISnll < 1. Definition 4.1.4 Then the asymptotic rate of convergence is defined to be Roo(S) = lim R(Sn) = log p(S). n+oo (4.23) 78
PAGE 104
Theorem 4.1.3 If IISnll < 1 for n > 0, then (4.24) Since Lis a linear operator, we can write the defect equation for (4.1) above as (4.25) where r(n) = fLu(n) is the residual after the nth iteration. We can also relate r(n+l) to r(n) by using equation (4.10) to get (4.26) where S = N M1 is the iteration matrix for the residual. Since the residual and the error are related by the defect equation (4.25), then as the residual, r(n), goes to the zero vector the error, e(n), will also go to zero as long as L is nonsingular. If L is singular then the error, e(n), will tend to a vector in the null space of L. This can and does happen for pure Neumann boundary conditions, and any solution obtained is only unique up to a constant. In this case, special care, usually through normalization, needs to be taken to ensure that a solution is obtained. 4.2 GaussSeidel Relaxation In this section we define several relaxation methods which we have considered for inclusion in the black box multigrid methods. We will also discuss there appropri ateness for inclusion in both the vector and parallel black box multigrid solvers. We need to define some notation concerning the grid point equations. The grid that was used to generate the system of equations is the one that was defined in 79
PAGE 105
section 1.3, and is the rectangular grid, Xi = ax + hxk, i = 1, ... nx, k=1 j Yj = ay + hyk, j = 1, ... ny k=1 ( 4.27) where Gh = nh. The grid point equation on grid G, suppressing the superscript h, is then s Ui,j1 + w Ui1,j + c Ui,j + E Ui+l,j + N Ui,j+l = Fi,j (4.28) where the subscript ( i, j) on the coefficients has been suppressed. Likewise, the 9point stencil is given by SW Ui1,j1 +S Ui,j1 +SE Ui+1,j1 W Ui1,j +C Ui,j +E Ui+l,j (4.29) NW Ui1,j+l +N Ui,j+l + N E Ui+l,j+l = Fi,j 4.2.1 Point GaussSeidel Iteration We are using a multicolor point GaussSeidel relaxation method. If the stencil for the grid operator Lh is a 5point stencil, then a red/black version is used, and if the stencil is a 9point stencil, then a 4color version is used. The point GaussSeidel relaxation method without multicoloring for the 5point stencil is given by S (n+1) W (n+1) C (n+l) (n) N (n) U 1 + U 1 + U + E U+1 + U '+1 = Fi ]. t,]t,] t,] t ,] t,] ( 4.30) and for the 9point stencil is given by SW (n+l) +S u(n+i) (n+l) ui1,j1 t,]+SE ui+l,j1 W u(ni1) +C (n) +Eu+1 t,] t,] t ,] (4.31) (n) (n) (n) = Fi,j NWui1,j+l +Nu . +1 +NE ui+1,j+l t,] 80
PAGE 106
This method does not vectorize or parallelize if only one color is employed regardless of the ordering of the equations. Vectorization fails because there will always be a vector feedback dependency (e.g. u(i, j) depends on u(i1, j) fori= 1, 2, ). It is possible to write down single color orderings that appear to allow vector and/ or parallel processing, but one finds upon closer examination that they are equivalent to multicolor orderings. For our purposes, when we say one color ordering we mean that the equations have a lexicographic ordering. Parallelization is not possible for lexicographic ordering because of the data dependencies that exist by sweeping through the equations. The multicolor point GaussSeidel relaxation method for the 5point stencil has two colors: red and black. The coloring can be defined in terms of the grid point indices, Red: i + j even Black : i + j odd ( 4.32) This method proceeds in two half steps, one for each color of points being used. The grid point equation for the red grid points (first half step) is S (n) W (n) C E (n) N (n) D U 1 + U 1 + U + U+1 + U "+1 = .ri 1 t,Jt,] t,J t ,] t,J ' ( 4.33) where i + j is even, and for the black grid points (second half step) (n+1) Su .. 1 +Wu. 1 +Cu .. +Eu+1 +Nu. +1 =Fi1 t,Jt,] t,J t ,] t,J ' (4.34) where i + j is odd. The multicolor point GaussSeidel relaxation method for the 9point stencil has four colors: red, black, green, and yellow. The coloring can be defined in terms of 81
PAGE 107
the grid point indices, Red: i odd, j odd Black: i even, J odd ( 4.35) Green: i odd, j even Yellow: z even, J even. This method proceeds in four quarter steps, one for each color of points being used. The grid point equation for the red grid points (first quarter step) is SW (n) U 1 1 t,]S (n) + U. 1 t,]SE (n) + ui+l,j1 W u(n)1 +C (n) ( 4.36) +Eu+1 t,J t,] t ,J (n) (n) (n) R NWui1,j+l +Nu+ 1 +NE ui+1,j+l t,] t,] where i is odd and j is odd. The grid point equation for the black grid points (second quarter step) is SW (n) U 1 1 t,]S (n) + U. 1 t,]SE (n) + ui+l,j1 W +C ( 4.37) ui1,j +Eu+1. t,] t ,] (n) (n) (n) R NWui1,j+l +Nu+ 1 +NE ui+1,j+l t,] t,] where i is even and j is odd. The grid point equation for the green grid points (third quarter step) is SW ui1,j1 W u(n)1 t,] NW ui1,j+l +s ui,j1 +C u(n+) t,] +Nu . +1 t,] +SE ui+l,j1 E (n) + U+1. t ,] +NE ui+1,j+l ( 4.38) R t,] where i is odd and j is even. And finally, the grid point equation for the yellow grid points (fourth quarter step) is SW ui1,j1 W (n+) ui1,j NW ui1,j+l +s ui,j1 +C u(n_+1) t,] +Nu.+1 t,] 82 +BE ui+l,i1 (n+) +Eu+1 t ,J +NE ui+1,j+l (4.39) R t,]
PAGE 108
where i is even and j is even. The addition of multicoloring for the ordering of the sweeps through the grid equations leads to highly vectorizable and parallelizable solvers. Vectorization is obtained because sweeping through either red or black equations no longer contains vector feedback dependencies. Parallelization is obtained because the red and black points are now decoupled and all the equations of a single color can be computed independently of the other color. In other words, there are no longer any data dependencies among equations of the same color. However, the computations are performed not in one parallel operation, but in a number of parallel operations equal to the number of independent colors. As will be seen in the next chapter, in addition to the gains in vector and parallel performance, we also obtain, generally, better convergence and smoothing properties with multiple color ordering. 4.2.2 Line GaussSeidel Iteration by Lines in X The line GaussSeidel relaxation by lines in the xdirection is used in our code with red/black zebra coloring. This relaxation method is good when there are strong connections (locally) on the grid in the xdirection. This method requires the use of a tridiagonal solver to solve for each line of unknowns. We present the details for only the 9point stencil because the 5point stencil is just a special case where the NW, NE, SW, and SE grid operator coefficients are zero. The line GaussSeidel relaxation by xlines method without multicoloring for the 9point stencil is given by SW (n+l) +S u(n+i) (n+1) ui1,j1 +SE ui+l,i1 Wu(ni1) ,] +C u(n.+1) +E (n+1) ui+1,j (4.40) (n) (n) (n) = Fi,j. NW ui1,j+1 +Nu. +1 +NE ui+l,j+l 83
PAGE 109
The method is vectorizable and parallelizable only in the multiple line solves. By this we mean that we can loop over all the lines to obtain vectorization or we can solve all the lines simultaneously for parallelization. The parallel line solves are done on the CM5 using oddeven cyclic reduction. When the red/black zebra coloring is used, the colors are defined to be Red: J odd (4.41) Black : j even. The line relaxation is done in two half steps, first for the red points, and second for the black points. The line relaxation for the 9point stencil is W ui1,j (n) NWui1,j+l S (n) + U. 1 t,J+C t,J (n) +Nu . +1 t,J SE (n) + ui+l,j1 +Eu+1 t ,J (n) +NE ui+1,j+l R t,J where j is odd for the first half step, and for the second half step, where j is even. SW ui1,j1 W u(n+11) t,) NW ui1,]+1 +s ui,j1 +C u(n_+1) t,J +Nu . +1 t,J +BE ui+l,j1 +E (n+l) ui+1,j +NE ui+1,j+l R t,J ( 4.42) (4.43) The zebra coloring method allows both vectorization and parallelization at not only for the lines solves, as before, but also by decoupling the red and black lines. For parallelization this decoupling means that all the lines of one color can be solved simultaneously. As will be seen, the convergence factors for zebra coloring are usually better than those for lexicographic ordering. 4.2.3 Line GaussSeidel Iteration by Lines in Y The line GaussSeidel relaxation by lines in the ydirection is used in our code with red/black zebra 84
PAGE 110
coloring. The method requires the use of a tridiagonal solver to solve for each line of unknowns. The line relaxation method is good when there are strong connections (locally) on the grid in the ydirection. We will present the details for only the 9point stencil, since the 5point stencil is just a special case with the NW, N E, SW, and S E coefficients are zero. The line GaussSeidel relaxation by ylines method without multicoloring for the 9point stencil is given by SW (n+l) ui1,j1 +S u(n+i) S (n) + E ui+l,j1 W u(ni1) ,J +C u(n.+1) E (n) + u+1. ,J (4.44) (n+1) NWui1,i+1 +N (n+l) ui,j+1 NE (n) + ui+l,j+l = Fi,j. When the red/black zebra coloring is used, the colors are defined to be Red: odd ( 4.45) Black : i even. The line relaxation is done in two half steps, first for the red points and second for the black points. The first half step for the line relaxation for the 9point stencil is SW (n) U 1 1 ,J+S S (n) + E ui+1,j1 (n) +C (n) Wu. 1 +Eu.+l ,J ,J NW (n) NE (n) ui1,j+1 +Nu. + 1 + ui+1,j+l where i is odd and for the second half step, it is given by where i is even. +N (n+l) ui,j+l R R ( 4.46) (4.47) The same comments about vectorization and parallelization that were made about xline GaussSeidel apply toyline GaussSeidel. 85
PAGE 111
4.2.4 Alternating Line GaussSeidel Iteration The alternating line GaussSeidel method performs zebra colored line GaussSeidel by lines in the xdirection followed by zebra colored line GaussSeidel by lines in the ydirection. For the details see the two previous sections 4.2.2 and 4.2.3 above. 4.3 Incomplete Line L U Iteration The incomplete line L U iteration (ILL U) is also referred to in the literature as incomplete block L U iteration. We consider ILL U by lines in the xdirection. The method presented is for the 9point stencil with the grid points ordered lexicographi cally. There are two parts to using this smoother, the first being the factorization and the second the actual iterations. In this section we change the notation we have been using for our problem from Lu = f to Au = f. This is because we want to use the symbol L to represent the lower triangular part of a matrix. The ILL U factorization assumes that the matrix A of the system to be solved is in tridiagonal block matrix form, A= (4.48) where Lj, Bj, and Uj are nx x nx tridiagonal matrices. Then there exists a matrix D, derived below, such that A = (L +D) n1 (D + U) (4.49) 86
PAGE 112
where L= U= (4.50) 0 D= Dny and D is block diagonal consisting of nx x nx tridiagonal blocks D j. The factorization of A in equation (4.49) is called the line LU factorization of A. The blocks L, D, and U correspond to horizontal grid lines in the xdirection on the computational grid G. The equation ( 4.49) can also be written as (4.51) and the last term is the blockdiagonal matrix, 0 (4.52) 87
PAGE 113
From the last two equations a recursion formula to computeD can be obtained Dl = Bl, (4.53) Dj = BjLjDj_!_1 Uj, j = 2,3, ... ,ny assuming that Dj.!.1 exists. This shows that the splitting in equation (4.49) holds when D is computed by equation ( 4.53). The problem with this splitting is that the Dj 's are full. There are many variations for the incomplete iteration that have been proposed in the literature. For more on the description and theory of ILLU iteration see [4], [3], [2], [76], [34], [44], [84], [85]. The variation that we are using is obtained by replacing the term LjDj.!.1 Uj in equation ( 4.53) by its tridiagonal part, to get Dl = Bl, (4.54) Dj = Bj tridiag LjDj_!_1 Uj where j = 2, 3, ... ny. The ILLU factorization is then defined as A=MN (4.55) where M = (L + D)D1(D + U), and Dis computed using equation (4.54). The iteration for the system of equations Au=f (4.56) is given by the splitting in equation (4.55). The iteration is then given by Mu(n+l) = Nu(n) + f (4.57) or as (4.58) 88
PAGE 114
where S = M1 N is called the iteration matrix. This iteration then becomes r = fAu(n) (L + D)D1(D + U)u(n+l) = r (4.59) for computing purposes. The center equation in (4.59) above is solved in the following way: solve (L + D)u(n+l) = r r = Du(n+l) solve (D + U)u(n+l) = r, and the first equation in ( 4.60) above can be solved by j = 2,3, ... ,ny (4.60) (4.61) where Uj and Tj are nx dimensional vectors corresponding to block j. The last equation in (4.60) above is computed similarly to the first equation in (4.61). For completeness, we could have looked at ILL U by lines in the ydirection, or even alternating ILLU. This has not been found to be necessary because the smoothing properties of the ILL U by lines in either direction are so good that we can get away with only using the ILLU by lines in the xdirection. However, this is only true in two dimensions, and this smoother is not robust in three dimensions. We still need to comment about the vector and parallel aspects for ILLU. The ILLU method does not easily lend itself to vectorization or parallelization, but we have 89
PAGE 115
been able to get some reasonable performance on the Cray YMP. The vector method is the same as the one used by De Zeeuw [24], but the implementation is different. However, this method does not parallelize. Many people are working on parallel ILLU methods, but so far we are not aware of any that are efficient enough on the CM5 to compete with zebra alternating line GaussSeidel relaxation as a multigrid smoother. Here efficiency is in the sense of convergence factor per unit of execution time. As a final note, it was found that De Zeeuw's MGD9V obtains its robustness from the use of the ILL U smoother and not from his new grid transfer operators, although the do contribute to the robustness of the method. The ILLU smoother was replaced with an alternating red/black line GaussSeidel smoother in MGD9V and experiments showed it to perform only marginally better than Dendy's nonsymmetric black box multigrid method. Likewise, the ILL U smoother was placed in Dendy's nonsymmetric black box multigrid method, and experiments showed that it performed about the same as MGD9V. 90
PAGE 116
CHAPTER 5 FOURIER MODE ANALYSIS OF SMOOTHERS 5.1 Introduction To understand which relaxation schemes make good smoothers for our Black Box Multigrid code, we can use Fourier mode analysis, which is also known as local mode analysis or local Fourier analysis. We will use Fourier mode analysis to help guide us in finding robust and efficient relaxation methods for use as the multigrid smoother. We want to find methods that will reduce the high frequency error components for a range of test problems that include anisotropic and convection dominated operators. Results of local mode analysis have been reported in the literature. However, most of the reports have been for only a few selected problems and smoothing methods. Since the literature lacks adequate coverage of smoothing analysis for our range of test problems we have presented many of the results from our own smoothing analysis investigation. The smoother plays an important part of the multigrid process by reducing the high frequency error components. The coarse grid correction complements this process by eliminating the low frequency error components. It is hoped that our choice of coarse grid and intergrid operators will meet the requirement, for coarse grid correction, that the range of prolongation contains low frequency components; see 5.3. When it does, the smoothing factor will give a reasonable approximation of the multigrid convergence 91
PAGE 117
factor for definite elliptic problems. However, the smoothing factors do not generally predict the exact performance of the twolevel algorithm, since the intergrid operators are neglected, as well as the differences between the fine and coarse grid operators. A twolevel analysis can often give more information than smoothing analy sis; both are performed by using Fourier mode analysis. Twolevel analysis attempts to approximates the spectral radius of (II'Ji(LH)1 I{! Lh)Sh for the twolevel algorithm, while smoothing analysis computes the spectral radius of Sh, the convergence factor of high frequencies for the multigrid smoother. We have used Galerkin coarse grid approximation, which can produce a different coarse grid operator on each grid level, causing the twolevel analysis to be valid only for those levels. For constant coefficient zero rowsum problems, the collapsing method intergrid operators become bilinear interpolation and full weighting restriction, in which case the twolevel analy sis is straightforward [78]. Variable coefficient problems are handled by performing the analysis for the extreme cases with frozen coefficients and using a continuity argument. For highly variable or discontinuous coefficient problems, it is not clear how to perform twolevel analysis, especially when Galerkin coarse grid approximation is being used. Because of the above mentioned difficulties for performing a twolevel analysis we have chosen to use local mode analysis to analyze only the smoothing factors. 5.2 Motivation When working on parallel and vector computers, it can often pay to consider methods that one would not usually think of using on a sequential computer. Likewise, there are tried and true algorithms that work well on a sequential computer but that do not vectorize or parallelize to any great extent. We have used the CM5, which is a massively parallel SIMD computer. It can also be used in a MIMD mode, but we will not be using it in this way. The CM5 92
PAGE 118
uses a small vector length of sixteen (until late 1995 it was eight), which even though it is short, favors algorithms which vectorize. We have also used the Cray Y MP, which is a vector computer with a vector length of 64. For a five point stencil, the relaxation scheme on the CM5 one might consider using is a GaussSeidel point relaxation implemented by using multiple colors. For instance, one might use red/black GaussSeidel. This scheme takes two sweeps over the data, one for the red points and the other for the black points. If one implements Jacobi relaxation, one finds that it looks exactly like a GaussSeidel relaxation on a sequential computer (this is because of the way the synchronous SIMD operations work on the CM computers). It requires only one sweep across the data, and hence one can do two iterations of Jacobi for the price of one red/black GaussSeidel relaxation. The Jacobi method on the CM5 does not need the extra storage space or index switching that is needed on a sequential computer. However, a nine point stencil requires a four color GaussSeidel relaxation scheme, which takes four sweeps through the data, per iteration, as much work as four sweeps of Jacobi relaxation. It is known that the Jacobi relaxation is not very good, but we can use a damping factor to make it better. In this case, we may find that the Jacobi relaxation becomes more competitive with the GaussSeidel relaxation on parallel computers. So, when considering smoothers, it is important to take into account the amount of work that must be done in parallel and to remember to consider those methods which may seem to be outdated. In order for the Fourier mode analysis to be technically valid, the operator L must be constant coefficient with periodic boundary conditions. However, it can still provide useful bounds for other types of boundary conditions. If the problem has variable coefficients, then it should be analyzed for a fixed set of coefficients that are sampled across the domain, and it should include the extreme cases in amplitude of 93
PAGE 119
the coefficients. The behavior of the problem should then be bounded by the extreme bounds found in the analysis. Fourier (local) mode analysis has been the topic of debate for many years as to how robust and rigorous it can be, and under what circumstances it can be successfully applied. However, it has proven itself to be quite useful and is used by most practitioners. Achi Brandt and several of his students have proposed that it can be rigorous in its construction and use, if proper care is taken; see Brandt's paper on Rigorous Local Mode Analysis [ 1 7]. 5.3 Overview of Smoothing Analysis A good introduction to Fourier smoothing analysis can be found in the 1982 paper by Stiiben and Trottenberg [78] and in the book by P. Wesseling [85]. The presentation in this section uses similar notation and is patterned after Wesseling's book chapter 7. Let the grid G be defined by G= Xi = i hx, i = 1, ... nx Yj = j hy, j = 1, ... ny hy = n y The continuous problem is discretized into a system of algebraic equations where L is represented by the stencil [L] .. = Lu=f NW NNE W C E SW S SE i,j (5.1) (5.2) (5.3) For the most part, smoothing methods are basic iterative methods. That is, they are 94
PAGE 120
splittings (usually a regular splitting) of the form L=MN. (5.4) The details for these methods are given in chapter 4. For the basic type of iterative methods the error amplification matrix is given by (5.5) without damping, and with damping it becomes (5.6) If the continuous problem has constant coefficients and periodic boundary conditions, then the stencils of [L], [M], and [N] are independent of the grid point ( i, j). We will assume that S has a complete set of eigenfunctions (local modes). The error before, e(o), and after, e(l) smoothing is given by (5.7) which gives us the relation S <1>(0) = A(O) (O) (5.8) where A(B) is an eigenvalue associated with the eigenfunction <1>(0). The eigenfunctions of S are e E 8 (5.9) J=I, e =(Ox, By), and 8 is defined as e 27r kx k nx 1 nx nx x nx x 2 ' 2' 2 e 21rky k ny 1 ny ny Yny Y2'2, '2 (5.10) 95
PAGE 121
If nx and ny are assumed to be even, then the corresponding eigenvalues of S are BEe, (5.11) where K, = (kx, ky) is a vector. The eigenvalue, >.(B), is called the amplification factor of the Fourier mode ( B). If underrelaxation or overrelaxation is used then >.(B)= w>.(B) + (1w) (5.12) where w is the relaxation parameter, and the >.(B) on the right hand side is the eigenvalue from the undamped amplification matrix. The Fourier representation of a periodic grid function ui,j is ui,j = cei,j(B), (5.13) 0E8 and the error is cO' ( B) a= 0,1 (5.14) which then gives =>.(B) (5.15) Next, we define the sets of rough and smooth frequencies, that is, the high and low frequencies respectively relative to the grid G. We have assumed that the ratio between the fine and coarse grid spacings is two. The smooth frequencies are defined as (5.16) and the rough frequencies are defined as (5.17) 96
PAGE 122
where the \ means "the set minus". The Fourier smoothing factor is now defined to be p, =max {1>.(0)1}. 0E8r (5.18) Now we need to consider the effect of the boundary conditions, in particular, the Dirichlet boundary condition. For problems with Dirichlet boundary conditions, we know that the error at the boundary is always zero, and hence, we can ignore the wave numbers where Ox = 0 and/or Oy = 0. Then the set of rough wave numbers, in the Dirichlet boundary condition case, is defined to be and then the corresponding smoothing factor, p,D, is given by p,D = max {1>.(0)1}. 0E8f (5.19) (5.20) The above definitions of the smoothing factor are grid size dependent because they depend on nx and ny (the number of grid points in the x and y directions re spectively). The definitions can be changed to be gridindependent if we change the definition of the discrete set e to be 8 = {0 : Ox E [1r, 1r], Oy E [1r, 1r]}. (5.21) This gridindependent definition is much harder to compute numerically, and when the boundary conditions have a big influence on the solution, the results are not very realistic. The grid dependent definitions given above are best when the choice of nx and ny are in the same range as those expected when using the multigrid method. We would always like to have p, < 1 uniformly in nx and ny, at least when the boundary conditions do not have a strong influence. If that is not the case, then the coarse grid correction part of the multigrid method must be very good to overcome the smoother's 97
PAGE 123
divergent influence. We can also numerically investigate the behavior of fl as nx and ny + 0 to see what its asymptotic behavior is like. Up to this point we have not addressed a very important class of relaxation methods for the analysis, and those which use multicoloring schemes. For these methads we must modify the above definitions. For the case of multicolor relaxation, the
PAGE 124
where S(B) is a 4 x 4 matrix, which is called the amplification matrix, and co is a vector of dimension 4. If underrelaxation or overrelaxation is used, then S(B) = wS(B) + (1w)I (5.27) where w is the relaxation damping parameter and I is the identity matrix. The amplification matrix S(B) can be found by: 1. Write the equations for the error for one step (color) of one complete iteration of the smoother. 2. Combine the error equations for that step into one expression. 3. Evaluate the combined expression for each of the invariant subspaces. 4. Write the equation that expresses the nth_step Fourier coefficient c(J in terms of the initial Fourier coefficient which are related by the step amplification matrix; 5. Do the above for each step of one complete smoothing iteration. 6. Multiply all the step amplification matrices together to get the amplification matrix for the smoother, which will express the Fourier coefficients in terms of = This algorithm will be illustrated for the smoothing analysis of the point GaussSeidel method in section 5.5. For multicolor relaxations, the definition of the Fourier smoothing factor, Jl, has to be modified in the following ways. The rough Fourier modes are now given by and the smooth Fourier modes are now represented by ei ..L X I 2' 99 (5.28) (5.29)
PAGE 125
All of these values must be added to 8r, in order for the Fourier mode analysis to have any meaning for several cases that can arise. We can now define a projection operator, Q(B), for
PAGE 126
where 0 Bx = 0 and/or By= 0 Pl(B) = 1 otherwise 0 Bx = 0 P3(B) = (5.34) 1 otherwise 0 By= 0 P4(B) = 1 otherwise It can also be seen that p3(B) = 0 implies that = 0, and that P4(B) = 0 implies that = 0. The definition of the smoothing factor now becomes p,D = max {p [P(B)Q(B)S(B)]}, IIE8f where p denotes the spectral radius. (5.35) The smoothing factors for Dirichlet boundary conditions are better than those for other boundary conditions because they exclude points in 8r and 8.s. This fact means that if the maximum occurs on these excluded points, then the smoothing factor for Dirichlet boundary conditions will be smaller. 5.4 2D Model Problems In the subsequent sections and chapters, various model problems are examined. The model problems are used for comparing the performance of the black box multigrid components (smoothers and grid transfer operators) on a finite set of model problems that represent various characteristics of more realistic problems. The domain 0 is the unit square for the two dimensional model problems: 1. !:l.u = f 2. UxxEUyy = f 3. EUxxUyy = f 101
PAGE 127
4. Ux = f 5. + Ux = f 6. EfluUy = j 7. Eflu + Uy = f 8. Eflu Ux Uy = f 9. Eflu + Ux + Uy = f 10. Eflu Ux + Uy = f 11. EflU + Ux Uy = f where flu = Uxx + Uyy, E = lOP for p = 0, 1, ... 5. The model problems will be discretized using central differences for the second order terms and upstream differencing for the first order terms. 5.5 Local Mode Analysis for Point GaussSeidel Relaxat ion Local mode analysis results are presented for lexicographical and red/black ordering for point GaussSeidel relaxations. Point GaussSeidel relaxation with lexicographic ordering gives the splitting 0 [M] = W C 0 s The amplification factor .A(O) is given by N [N] = 0 0 E 0 (5.36) (5.37) The red/black point GaussSeidel relaxation local mode amplification matrix 102
PAGE 128
is computed below. The stencil is assumed to be 5point because a four color scheme would be needed for a 9point stencil. The details for the computation of the amplification matrix will only be given for the 5point red/black point GaussSeidel relaxation. The amplification matrices for all other multicolor GaussSeidel type smoothers can be computed in a similar manner. One iteration of the red/black point GaussSeidel relaxation is performed in two half steps. The first half step is computed for the red points, and the second half step is computed on the black points. The red points can be identified by those points ( i, j) where i + j is even and the black points when i + j is odd. Let the error before smoothing be e0 ; then the error after the first half step is S e9 1 + W e9 1 + E e0+1 + N e9 .+1 ,J ,J and after the second half step it is 1 2 ei,j' c 1 1 1 1 Se'l. 1+We2 1+Ee2+1+Ne'l.+1 ,J ,J c The Fourier representation of ei,j, n = 0, 1 is given by ei,j = ( c(Jf i,j (e) 0E8;s where c(J and i,j (e) are vectors of dimension 4. i + j even (5.38) i + j odd i + j even (5.39) i + j odd (5.40) Examining the first half step in the relaxation process, let the initial error be k = 1,2,3,4. (5.41) Substitution into equation (5.38) gives s i,j1(ek) + w i1,j(ek) + E i+1,j(ek) + N i,j+I(ek) c i + j even !_ k e2 .(e ) = i + j odd 103
PAGE 129
Recall that
PAGE 130
If k is odd, then [cos(kO) ( 1) + sin(kO) 0] + i [sin(kO) ( 1) cos(kO) 0] cos(kO) i sin(kO) We want to combine the expressions for the red and black points of equation (5.42) into a single expression. We already have i,j(01), but we need to add one or more other subspaces to to create a linear combination that will yield a single expression. If we take e 7r for all the angles whose indices are involved in the coloring pattern designation, i and j in this case, we get the additional subspace that we need for the linear combination. The single expression linear combination is (5.46) Using theorem 5.5.1 we can find the values of A and B. A + B = a for i + j even A B = 1 for i + j odd, which gives A B 1), and therefore (5.47) Let a be defined by (5.48) 105
PAGE 131
and define f3 to be (5.49) We now evaluate the first half step error for each of the four invariant subspaces. 1 + &(01) + &(01) 1 2 2 1 1 1 2 2 (1 +a)
PAGE 132
We now proceed in a similar way for the second half step (black points) of the relaxation process. The error after the second half step was given in equation (5.39), and it can be written, as in the first half step, as
PAGE 133
1 + &((P) + 1 &((P) 2 2 1 2 1 1 2 (1a) if!i,j(O ) + 2 (1 +a) if!i,j(O ), (5.59) 1 The Fourier coefficient in terms of cJ is 1+a 1+a 0 0 1 1 1a 1a 0 0 1 2 e E e.s. (5.62) co=co' 2 0 0 1+,8 1+,8 0 0 1,8 1,8 1 Finally, we can express in terms of by substitution of cJ into equation (5.62) to get the red/black point GaussSeidel amplification matrix S(O) that gives the relation a(1 +a) a(1 +a) 0 1 1 a(1a) a(a1) 0 co=2 0 0 ,8(1 + ,8) 0 0 ,8(1 ,8) The eigenvalues of Q(O)S(O) are A1(0) = 0 .X2(0) = 0 108 0 0 0 co, e E e.s. (5.63) ,8(1 + ,8) ,8(,81)
PAGE 134
Table 5.1. Smoothing factor J.L for point GaussSeidel relaxation in lexicographical (pGSlex) and red/black (r/bpGS) ordering for the indicated anisotropic diffusion problems ( see section 5.4); where c = 10P and (D) indicates Dirichlet boundary conditions. problem p pGSlex r/bpGS r/bpGS (D) 1 .50000 .25000 .24992 1 .83220 .82645 .82619 2 3 .99797 .99800 .99770 5 .99998 .99999 .99975 1 .83220 .82645 .82619 3 3 .99797 .99800 .99770 5 .99998 .99999 .99975 1 .\3(B) = 2 (J(B) ab) .\4(B) = f3, and for Dirichlet boundary condition case, the eigenvalues of P(B)Q(B)S(B) are .\1 (B) = 0 .\2(B) = 0 1 2 (PI(B)J(B) ab) 1 2/3 (p3(B)P4(B) + f3(p3(B) + P4(B))). The results of local mode analysis for the model problems from section 5.4 are shown in table 5.1 and table 5.2. The smoothing factors were computed numerically with the grid spacing hx = hy = 1 and the angles Bx and By were sampled at one degree increments. Table 5.1 shows the results of the smoothing analysis for pure diffusion type problems. The point GaussSeidel relaxations are good smoothers for Poisson's equation, but not for anisotropic problems. The table also shows that red/black ordering is better than lexicographic ordering. 109
PAGE 135
Table 5.2. Smoothing factor JL for point GaussSeidel relaxation in lexicographical (pGSlex) and red/black (r/bpGS) ordering for the indicated convectiondiffusion problems (see section 5.4); where c: = wp and (D) indicates Dirichlet boundary con ditions. problem p pGSlex r/bpGS r/bpGS (D) 0 .60176 .36000 .35990 4 1 .87313 .73469 .73463 3 .99839 .99602 .99602 5 .99998 .99996 .99996 0 .45834 .36000 .35990 5 1 .44608 .73469 .73463 3 .44099 .99602 .99602 5 .44099 .99996 .99996 0 .60176 .36000 .35990 6 1 .87313 .73469 .73463 3 .99839 .99602 .99602 5 .99998 .99996 .99996 0 .45834 .36000 .35990 7 1 .44608 .73469 .73463 3 .44099 .99602 .99602 5 .44099 .99996 .99996 0 .66281 .28125 .28125 8 1 .91533 .69441 .69441 3 .99898 .99594 .99594 5 .99999 .99988 .99988 0 .32950 .28125 .28125 9 1 .08202 .69441 .69441 3 .00098 .99594 .99594 5 9.8E6 .99988 .99988 0 .56192 .28125 .28125 10 1 .84486 .69441 .69441 3 .99797 .99594 .99594 5 .99998 .99988 .99988 0 .56192 .28125 .28125 11 1 .84486 .69441 .69441 3 .99797 .99594 .99594 5 .99998 .99988 .99988 110
PAGE 136
Table 5.2 shows the results of the smoothing analysis for convectiondiffusion problems. The red/black ordering for point GaussSeidel has, in general, better smoothing properties than those for lexicographic ordering except for problems 5, 7, and 9. The reason that lexicographic ordering is better for those problems is because the order in which the unknowns are updated is in the same direction as the convection characteristics. The smoothing factors approach one as the convection terms become more dominant, which implies that point GaussSeidel is not a robust smoother for these types of problems. 5.6 Local Mode Analysis for Line GaussSeidel Relaxat ion Local mode analysis results are presented for lexicographic and zebra (reb/black) ordering for xand yline GaussSeidel relaxations. Xline GaussSeidel relaxation with lexicographic ordering gives the splitting N 0 [M]= W C E s [N] = 0 0 0 0 The amplification factor .A(O) is given by Zebra xline GaussSeidel relaxation has the amplification matrix S(O) = a 0 a 0 0 c 0 c b 0 b 0 0 d 0 d 111 (5.64) (5.65) (5.66)
PAGE 137
where a= a(1 +a), b = a(1a), c = ,8(1 + ,8), and d = ,8(1,8) and ,8 The eigenvalues of S(O) are W edJx + C + E edJx S edJy + N .\1 (0) = 0 .X2(0) = 0 .X3(0) = 1 2 (6(0) ab) .X4(0) = 1 2 (cd) and for Dirichlet boundary conditions we have .\1(()) = 0 .X2(0) = 0 1 2 (Pl(O)c5(0) aP3(0) b) 1 2 (cP4(0) d). (5.67) (5.68) Y line GaussSeidel relaxation with lexicographic ordering gives the splitting N [M] = W C 0 s The amplification factor .X(O) is given by 112 0 [N] = 0 0 E (5.69) 0 (5.70)
PAGE 138
Zebra yline GaussSeidel relaxation has the amplification matrix S(O) = a 0 0 a 0 c c 0 0 d d 0 b 0 0 b where a= a( a+ 1), b =a( a1), c = /3(/3 + 1), and d = /3(/31) and {3 The eigenvalues of S(O) are W edJx + E edJx S eLIJy + C + N w + E S + C + N .\1 (0) = 0 .X2(0) = 0 .X3(0) = 1 2 (6(0) a+ b) .X4(0) = 1 2 (c +d), and for Dirichlet boundary conditions we have AI(())= 0 .X2(0) = 0 1 2 (PI(O)c5(0) + P4(0)b) 1 2 (c + P3(0)d). (5.71) (5.72) (5.73) The results of local mode analysis for the model problems from section 5.4 are shown in tables 5.3 and 5.4. The smoothing factors were computed numerically with the grid spacing hx = hy = 1 and the angles Ox and Oy were sampled at 1 degree increments. 113
PAGE 139
Table 5.3. Smoothing factor 11 for xandyline GaussSeidel relaxation in lexicograph ical (xlGS and ylGS respectively) and zebra (ZxlGS and ZylGS respectively) ordering for the indicated anisotropic diffusion problems (see section 5.4); where c = 10P. problem p xlGS ZxlGS ZxlGS (D) ylGS ZylGS ZylGS (D) 1 .44412 .25000 .24992 .44412 .25000 .24992 1 .44412 .12500 .12500 .82644 .82645 .82619 2 3 .44412 .12500 .12500 .99800 .99800 .99770 5 .44412 .12500 .02891 .99998 .99998 .99968 1 .83092 .82645 .82619 .44412 .12500 .12500 3 3 .99797 .99800 .99770 .44412 .12500 .12500 5 .99998 .99998 .99968 .44412 .12500 .02891 114
PAGE 140
Table 5.3 shows the smoothing factors for line GaussSeidel relaxation for anisotropic diffusion model problems. It is seen that line relaxation is only a good smoother if the lines are taken in the direction of the strong coupling of the diffu sion coefficients. Again, it is seen that the zebra ordering of the lines gives a better smoothing factor than lexicographic ordering. Table 5.4 shows the smoothing factors for the convectiondiffusion model problems for line GaussSeidel relaxation. The smoothing factors for line relaxation are good when the convection term characteristics are in the same direction as the lines. The smoothing factor becomes better (smaller) the more the convection terms dominate if the characteristics are in the direction of the lines. If the characteristics are not in the direction of the lines, then the smoothing factor degenerates quickly approaching one, the more the convection terms dominate the diffusion term. We see again that for lexicographic ordering the smoothing factor is better when the characteristics have at least one component in the direction of the lexicographic ordering of the lines. 5. 7 Local Mode Analysis for Alternating Line GaussSeidel and ILLU Iteration Local mode analysis results are presented for lexicographic and zebra ordering for alternating line GaussSeidel relaxation (xline GaussSeidel followed by yline GaussSeidel) and incomplete line LU by lines in x. The alternating line GaussSeidel relaxation with lexicographic ordering am plification factor .A(O) is given by .A(O) = Axlgs(O) .Aytgs(O) (5.74) where Axlgs ( 0) and Aylgs ( 0) are the xand yline GaussSeidel amplification factors 115
PAGE 141
Table 5.4. Smoothing factor p, for xandyline GaussSeidel relaxation in lexicograph ical (xlGS and ylGS respectively) and zebra (ZxlGS and ZylGS respectively) ordering for the indicated convectiondiffusion problems (see section 5.4); where r:; = wP. problem p xlGS ZxlGS ZxlGS (D) ylGS ZylGS ZylGS (D) 0 .45040 .15385 .15380 .62917 .36000 .35990 4 1 .48377 .22449 .22449 .91218 .73469 .73463 3 .44412 .12500 .05644 .99898 .99602 .99602 5 .44412 .12500 .00057 .99999 .99996 .99996 0 .45040 .15385 .15380 .32950 .36000 .35990 5 1 .48377 .22449 .22449 .32950 .73469 .73463 3 .44412 .12500 .05644 .32950 .99602 .99602 5 .44412 .12500 .00057 .32950 .99996 .99996 0 .62917 .36000 .35990 .45040 .15385 .15380 6 1 .91218 .73469 .73463 .48377 .22449 .22449 3 .99898 .99602 .99602 .44412 .12500 .05644 5 .99999 .99996 .99996 .44412 .12500 .00057 0 .32950 .36000 .35990 .45040 .15385 .15380 7 1 .32950 .73469 .73463 .48377 .22449 .22449 3 .32950 .99602 .99602 .44412 .12500 .05644 5 .32950 .99996 .99996 .44412 .12500 .00057 0 .63226 .24324 .24318 .63226 .24324 .24318 8 1 .91344 .69444 .69409 .91344 .69444 .69409 3 .99898 .99601 .99541 .99898 .99601 .99541 5 .99999 .99996 .99935 .99999 .99996 .99935 0 .27929 .24324 .24318 .27929 .24324 .24318 9 1 .06696 .69444 .69409 .06696 .69444 .69409 3 .00080 .99602 .99541 .00080 .99601 .99541 5 8.0E6 .99996 .99935 8.0E6 .99996 .99935 0 .27929 .24324 .24318 .63226 .24324 .24318 10 1 .06696 .69444 .69409 .91344 .69444 .69409 3 .00080 .99601 .99541 .99898 .99601 .99541 5 8.0E6 .99996 .99935 .99999 .99996 .99935 0 .63226 .24324 .24318 .27929 .24324 .24318 11 1 .91344 .69444 .69409 .06696 .69444 .69409 3 .99898 .99601 .99541 .00080 .99601 .99541 5 .99999 .99996 .99935 8.0E6 .99996 .99935 116
PAGE 142
respectively. Thus, The zebra alternating line GaussSeidel relaxation amplification matrix S(O) is given by S(O) = Bxtgs(O) Sytgs(O) (5.76) where Sxlgs ( 0) and Sylgs ( 0) are the xand yline GaussSeidel amplification matrices respectively. We represent the matrix S(O) by where S(O) = l e = P3(0)bxdy g = P4(0)axby a c e g b a b d c d f e f h g h d = CxCy f = P3(0)cxdy h = P4(())dxby, (5.77) and the subscripts x and y indicate that the coefficients came from Sxlgs ( 0) and Sylgs ( 0) respectively; see equations (5.66) and (5.71). The eigenvalues of S(O) are D \3,4(0) = D24 [(ae)(dh)(bf)(cg)] 8 where D = a+ deh is the diagonal of S ( 0). For nonDirichlet boundary conditions we set Pl(O) = P3(0) = P4(0) = 1. 117
PAGE 143
The incomplete xline L U iteration (ILL U) amplification factor is not hard to compute, but it is a little more complicated than the other relaxation methods. We need to compute M and N for the ILLU splitting; see section 4.3. Incomplete factorization methods have the property that the stencils for M and N are dependent upon their location on the grid, even when the stencil of Lis not. However, the stencils of M and N usually tend rapidly to constant stencils away from the boundaries. It is these constant stencils for M and N that will be used for the local mode analysis. It can be seen that the smoothing factor increases towards one as the block (xline) size increases. For this reason, we will assume the worst case, nx = oo, for the computation of the local smoothing factors. The component Dj from equation ( 4.54) is computed without the j subscript until the stencil for D becomes stable. By stable, we mean that the values do not change, or that the change is taking place only in the digits after a specified decimal place. We used six decimal places in our computations. When a stable D has been computed, then the stencils for M and N can be constructed and the smoothing factor computed using equation (5.11). Due to the nature of these computations, it is not possible to write down a general formula for the amplification factor as was done for the other relaxation methods. The results of local mode analysis for the model problems from section 5.4 are shown in table 5.5 and table 5.6. The smoothing factors were computed numerically with the grid spacing hx = hy = 1, and the angles Bx and By were sampled at 1 degree increments. Table 5.5 shows the smoothing factors for alternating line GaussSeidel relaxation and incomplete line LU iteration for the anisotropic diffusion model problems. Lexicographic ordering for alternating line relaxation provides a fair smoothing factor, but zebra ordering provides much better smoothing factors. The smoothing factors for 118
PAGE 144
Table 5.5. Smoothing factor J.L for alternating line GaussSeidel relaxation in lexico graphical (ALGS) and zebra (AZLGS) ordering, and incomplete line LU iteration by lines in x (ILLU) for the indicated anisotropic diffusion problems (see section 5.4); where c = 10P problem p ALGS AZLGS AZLGS (D) ILLU 1 .14634 .02547 .02546 .05788 1 .36903 .10107 .10104 .13272 2 3 .44322 .12472 .12467 .19209 5 .44411 .12500 .02890 .19920 1 .36903 .10107 .10104 .10769 3 3 .44322 .12472 .12467 .16422 5 .44411 .12500 .02890 .14136 119
PAGE 145
incomplete line LU iteration are good as well, but alternating zebra line relaxation is slightly better. Table 5.6 shows the smoothing factors for the convectiondiffusion model problems for alternating line GaussSeidel relaxation and incomplete line LU iteration. The smoothing factors for lexicographic ordering for the alternating line relaxation are good but get even better when the characteristics are in the same direction (lexicographic) in which the lines are solved. The alternating zebra line relaxation gives good smoothing factors when the characteristics are in the same direction as the lines; indeed these smoothing factors are better than those for lexicographic ordering. The zebra ordering gives fair smoothing factors when the characteristics are not aligned with the lines. The incomplete line LU iteration is done by lines in the xdirection and is nearly a direct solver when the convection terms are dominant. The smoothing factors are good for all of the model problems, and they are about equal to or much better than those for alternating line relaxation. They are especially superior when the convection term characteristics are not aligned with the grid lines. 5.8 Local Mode Analysis Conclusions We have looked at local mode analysis for several iterative methods for use as a smoother in our black box multigrid method. The test problems can be classified in many ways, but we will break them down into four types and refer to them via their model problem number from section 5.4. The first type is the isotropic diffusion equation represented by model problem (1). The second type is the anisotropic diffusion equations represented by model problems (2) and (3). The third type is the convectiondiffusion equations represented by model problems (4)(11) with c: = 1. The fourth type is the convection dominated equations represented by model problems (4)(11) with c: 1. 120
PAGE 146
Table 5.6. Smoothing factor JL for alternating line GaussSeidel relaxation in lexi cographical (ALGS) and zebra (AZLGS) ordering, and incomplete line LU iteration by lines in x (ILLU) for the indicated convectiondiffusion problems (see section 5.4); where c = 10P problem p ALGS AZLGS AZLGS (D) ILLU 0 .22269 .04706 .04704 .07977 4 1 .40812 .14253 .14251 .16150 3 .44322 .12444 .05611 .19952 5 .44411 .12499 .00057 .20000 0 .14750 .04706 .04704 .07977 5 1 .15423 .14253 .14251 .16150 3 .14634 .12444 .05611 .19952 5 .14634 .12499 .00057 .20000 0 .22269 .04706 .04704 .03759 6 1 .40812 .14253 .14251 .00567 3 .44322 .12444 .05611 .00009 5 .44411 .12499 .00057 4.4E9 0 .14750 .04706 .04704 .03759 7 1 .15423 .14253 .14251 .00567 3 .14634 .12444 .05611 .00009 5 .14634 .12499 .00057 4.4E9 0 .24619 .06349 .06346 .04940 8 1 .39787 .31498 .31483 .01489 3 .44358 .40579 .40559 .00019 5 .44412 .40684 .40665 1.9E6 0 .06754 .06349 .06346 .04940 9 1 .00441 .31498 .31483 .01489 3 6.3E7 .40579 .40559 .00019 5 6.4E11 .40684 .40665 1.9E6 0 .15074 .06349 .06346 .04940 10 1 .05636 .31498 .31483 .01489 3 .00073 .40579 .40559 .00019 5 7.3E6 .40684 .40665 1.9E6 0 .15074 .06349 .06346 .04940 11 1 .05636 .31498 .31483 .01489 3 .00073 .40579 .40559 .00019 5 7.3E6 .40684 .40665 1.9E6 121
PAGE 147
We see that red/black point GaussSeidel relaxation is a good smoother for only the isotropic diffusion and convectiondiffusion (c = 1) equations, however, for variable coefficients it is a good smoother for only isotropic diffusion equations. Zebra line GaussSeidel relaxation is a good smoother all four types of prob lems, provided that the anisotropies and convection characteristics are aligned with the proper grid directions. The only two robust choices for smoothers are the alternating zebra line GaussSeidel relaxation and the ILLU methods. They both exhibit good smoothing factors for all the types of problems. However, ILL U is better for all types of convectiondiffusion equations and just slightly worse for the two types of diffusion equations. The suitability of either choice for the smoother will depend on the ef ficiency of the implementation, and under these circumstances it would appear that alternating zebra line GaussSeidel relaxation has the advantage. 5.9 Other Iterative Methods Considered for Smoothers If one looks at the local mode analysis, it will be noticed that lexicographic point GaussSeidel relaxation has a good smoothing property for convection problems when the sweep direction is aligned with that of the convection. This suggests another smoother, namely, 4direction point GaussSeidel relaxation. The 4direction point GaussSeidel method performs four sweeps over the grid starting each time from a different corner of the grid. The first two sweeps are the same as symmetric point GaussSeidel, with the dominant sweep direction being in x, and the third and fourth sweeps are again a symmetric point GaussSeidel relaxation, but with the dominant sweep direction in y this time. The sweeping strategy for symmetric point GaussSeidel relaxation is the 122
PAGE 148
same as performing one iteration of lexicographic point GaussSeidel relaxation fol lowed by another sweep, but the second sweep starts at the (Nx, Ny) point of the grid with the index in the xdirection decreasing the quickest. To form the 4direction point GaussSeidel method we perform one iteration of symmetric point GaussSeidel followed by another iteration of symmetric point GaussSeidel, but this time rotated 90 degrees so that the rolls of x and y are reversed. The 4direction point GaussSeidel method exhibits good smoothing properties for all the model problem except the anisotropic ones. The 4direction method is partially vectorizable, but not parallelizable. Let us take a look at one of the four sweeping direction, namely lexicographic, to illustrate how one can obtain some vectorization. The nonvectorizing lexicographic sweep is computed as follows. DO j = 1, Ny DO i = 1, Nx END DO END DO Vectorization is prevented by the reference to Uil,j We can minimize the impact of the vector dependency by reorganizing the calculation and creating a new temporary array of length Nx. The new code with vectorization is computed in the following way. DO j = 1, Ny DO i = 1, Nx tmp(i) = Ni,jUi,j+l + Ei,jUi+l,j + si,jUi,j1 + fi,j END DO DO i = 1, Nx 123
PAGE 149
u = (tmp(i) + W u1 )/G,J END DO END DO The first loop over i performs vector operations and the second loop over i performs scalar operations. The algorithm is presented for a 5point stencil, the only difference for a 9point stencil is the addition of the other computational elements to the calculation in the vector (first) loop. On the Cray YMP, the second algorithm is roughly equivalent, timewise, to alternating zebra line GaussSeidel for small grids ( < 32 x 32) and is faster for larger grids. The reason that the 4direction method outperforms the alternating line method, for larger grids, is because the line solves are sequential in nature. Several experiments were run on the Cray YMP using the 4direction point GaussSeidel relaxation for the smoother. The results were mixed, but generally favorable. The performance for isotropic diffusion or linear convection characteristic problems was good, as expected from the Fourier mode analysis. Anisotropic problems also performed quite poorly, as expected from the ananlysis. For convectiondiffusion problems with variable convection characteristics the results were dependent on the form of the characteristics and the choice of grid transfer operators. The 4direction point GaussSeidel smoother worked best with the nonsymmetric collapsing method (aL/rJL) from section 3.5.1 for the grid transfer operators. However, for reentrant flows we were still unable to obtain a method which would give any kind of reasonable convergence rate. The results from one of the numerical experimetns are given in table 7.34 in section 7.6. 124
PAGE 150
CHAPTER 6 VECTOR ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS The computers that we used fall into three categories: sequential, vector, and parallel. Each of these types of computers has its own characteristics that can affect the development and implementation of algorithms that are to execute on them. Sequential computers come in a lot of varieties, but they all execute basically "one" instruction at a time. By "one" instruction we mean to lump all of the pipelining architectures with the classic sequential computer architecture. This lumping may not be entirely fair, but we believe that it is nearly impossible to find a computer today that does not use some form of pipelining. If one looks at the scalar processors on vector computers, one sees a great deal of pipelining. As far as the choice of algorithms is concerned, they can indeed be lumped together. However, it does pay to remember that pipelining of instructions is taking place, and therefore the implementation of the algorithm should take advantage of it when at all possible. For the most part, compilers today are very good at keeping the pipeline full, but they can still benefit from a careful implementation that aligns the instructions to avoid possible bubbles (null operations) in the pipeline, e.g. excessive branching. We have used a SUN Spare workstation to represent the class of sequential computers, but a standard PC would have done just as well. The vector computers are those with vector units that can process a vector 125
PAGE 151
(array) of data with one instruction with only the indices varying in a fixed relationship to each other. The Cray Y MP, which we have used, is a prime example of such a computer. The CM5, which we have also used, has vector units, but the vector units are not very fast when compared to the YMP's, and they have a vector length of 16 (changed from 8 to 16 in late 1995), which is very short when compared to the Y MP's vector length of 64. Vector computers also have scalar processors for handling the nonvectorizable instructions, and these processors can be considered to be the same as those of the sequential computers. The Cray YMP can have several processors that can be run in a parallel mode configuration. However, we have chosen to use only one processor on the Cray Y MP so that we can concentrate on the vectorization issues. There are several types of parallel computers and parallel computing models. The type of parallel computers that we considered are single instruction multiple data (SIMD) computers; the CM5 is such a computer, but it might be more meaningful to classify it as a single program multiple data (SPMD) computer for our purposes. The SPMD programming model is one of the most widely used parallel programming models for almost all parallel computers. The CM5 can be run under two different execution models: the data parallel model, which we used, and the message passing model, which we choose not to address in this thesis. Probably the one issue that has the greatest effect on algorithm performance, regardless of the type of computer, is that of memory references. This issue can manifest itself in many ways at both the hardware and software level of the computer architecture. For sequential and vector computers it usually revolves around a memory cache, but memory bank architecture can also play a role. It should be noted that the Cray YMP does not have a memory cache. On parallel computers the memory cache and banks are usually subordinate to the data communications network. 126
PAGE 152
Each of these three types of computers has its own influences on the choice and implementation of the various components of multigrid algorithms. However, we will restrict our choices for the vector computers in such a way as to avoid degrading the code's execution on a sequential computer in any meaningful way. If, however, a particular choice would cause only minor degradation on the sequential computer but greatly improve its performance on vector computers, then it should be allowed. For the above reasons, and because it is not too interesting, we will not examine our multigrid codes on any sequential computers. The performance of closely related black box multigrid codes for various problems has already appeared in the literature [27] [8]. However, for timing comparisons only we will include some data for a Sparc5 workstation. 6.1 Cray Hardware Overview The Cray YMP is our baseline vector computer for the design of the black box multigrid vector algorithm. The hardware model that we will present for the software design is equally valid for the Cray YMP, XMP, M90, and C90 computers because we are concerned only with the single processor vector algorithm. The Cray YMP computers can have a number of CPUs (central processing units), typically 4, 8, or 16. The CPUs are each connected to shared memory and an I/0 system; see figure 6.1. Each CPU has four ports, a memory path selector, instruction buffers, registers, and several functional units; see figure 6.2. We will start by describing the CPU registers. The Cray computer's word length is 64 bits. The vector registers are set up as 8 vectors of 64 elements each (the Cray C90 has 128 elements), where each element is a 64 bit word. There are 8 scalar registers and 64 intermediate result scalar registers, each with 64 bits. The address registers can also have an impact of the software design. There 127
PAGE 153
1/0 CPU 0 .... CPU 1 .... CPU 2 .... CPU n Memory Figure 6.1: Cray YMP hardware diagram for n CPUs. CPU Port 0 .... L Registers Memory Port 1 ; Path Functional Units t Selector Instruction .... Port 3 .... buffers i & ... ... CPU n 1/0 Memory Figure 6.2: Cray CPU configuration. 128
PAGE 154
are 8 address registers consisting of 32 bits each and 64 additional 32 bit intermediate address registers. The intermediate address registers are primarily used for processing the address register data. In addition to the above mentioned registers there are a variety of others that will not be discussed here because they vary somewhat between Cray's different computer models and because they do not really have an affect on the design of the vector algorithms. For completeness we will mention the major categories; they are the vector mask, vector length, hardware performance monitor, programmable clock, con trol, mode, status, memory error information, exchange address, and exchange package information registers along with a number of flag bits. There are also some additional registers for parallel processing on the Cray computers which fall into either the shared resources registers or the cluster registers. The main memory consists of either 2 or 4 sections each containing 8 subsections of 64 banks. The memory bank size depends on the model of Cray computer and the memory size configuration chosen for that model. The two Cray computers that we used were a YMP and M90, which have memory banks of 256K words and 8M words respectively; see Appendix B for more details. The memory is interleaved throughout all the banks, subsections, and sections. Consecutively stored data items are placed into different memory banks, and no bank is reused until all the banks have been used. The 8 vector and 8 scalar registers coincide with the 8 subsections of a memory section. The 64 (128, C90) vector elements per vector register coincide with the 64 memory banks of each subsection. Each CPU has its own path to each section of memory. A single CPU cannot make simultaneous accesses to the same section of memory. Each CPU has four ports: two for reading, one for writing, and one for either an instruction buffer or an I/0 request; see figure 6.2. In order for a CPU to access memory, it must have both an 129
PAGE 155
available port and memory path. A CPU memory reference makes an entire memory subsection unavailable to all ports of the CPU until the reference has completed (five clock cycles). In addition, the memory reference also makes that bank unavailable to all other ports of all the other CPU s in the system until it has completed. There are 5 basic types of functional units in each CPU. The first one is vector bitoriented, and it consists of integer add, logical, shift, pop/parity, and secondary logical operations. The second functional unit is the floating point vector operations unit, which is also used to perform scalar floating point operations. The third functional unit is the scalar bitoriented unit and includes integer add, logical, shift, and pop/parity operations. The fourth functional unit is the address computational unit. The fifth functional unit is the instruction decode and fetch unit. The CPU's functional units all operate independently of each other. In ad dition, the functional units are fully segmented (pipelined). That is, the intermediate steps required to complete an operation are broken down into one clock period segments. Once the pipeline is full, results are produced at one per clock period. The number of segments that a functional unit has depends on the complexity of the functions it must perform; hence, the functional units are mostly of different segment lengths. The functional units can also be chained together, and because they operate independently, it is possible to perform several operations concurrently. For instance, let a, b, c be vectors and d a scalar; then the vectorscalar operation a( i) = b( i) c( i) + d can be performed with one result (i) per clock period for all i. The concurrent operations taking place are two vector loads (band c), a vector multiply, a scalar add, and a vector store (a). The Cray YMP can be forced to perform, essentially, as a sequential computer by compiling the code with the NOVEC compiler directive and by setting the compiler 130
PAGE 156
optimization flag for no vectorization, e.g. cf77 0 vector{) or cft77 o novector. This can be very useful for determining the actual speedup associated with vectorizing the code. Definition 6.1.1 The speedup factor for a code or code fragment is defined as S Told r' Tnew (6.1) where Told and Tnew are the execution times for the old and new codes respectively. The speedup factor as we have defined it is sometimes called the relative speedup. The speedup factor can be used to measure the vectorization speedup by setting Told to the nonvectorization execution time and Tnew to the vectorization execution time. To get the best performance for a given algorithm and still use only the higher level language (FORTRAN in our case), there are several things that can be done. First recall that the code must also be able to execute on a sequential computer and remain as portable as possible. This consideration means that we can not use any machine specific routines or instructions that will make the code nonportable. This decision limits the options available but does not impose too many difficulties or reduce the performance gains (by very much) that can be achieved. The things that we can do are to control the data structure designs, implementation and choice of the algorithm, and the use of compiler options and directives. 6.2 Memory Mapping and Data Structures Leading array indices should always be declared to be of odd length, e.g. 129. This is because most vector (and nonvector) computers have their memory set up in banks, and the data are distributed across the memory banks in various ways depending on the particular computer. The number of banks is almost always set up to be a power 131
PAGE 157
of two. So, if an array is declared to be of even length, then there is a strong possibility for memory bank conflicts when fetching and storing consecutive array elements. The memory bank conflicts can significantly slow down the performance of a code. Typically on many vector computers, the speedup obtained when using an odd length for an array declaration is a speedup factor of 2 to 4 over using an even length. Vectorization takes place only for inner loops, but in special circumstances nested loops may be combined into a single loop that is vectorized by the compiler. The most frequently used inner looping index should have the longest length possible. For example, if a double DO loop, with no vector dependencies, is indexed over i = 1, ... N and j = 1, ... M where N > M, then the loop over i should be the inner one. The data structures of the arrays should also be set up in such a way as to allow the most frequently used looping index to be placed as near the beginning (leftmost) index position as is possible. By doing these simple restructurings of the arrays, it has been found that a speedup factor of any where from 2 to 8 can be obtained for the various components of the black box multigrid codes. 6.3 Scalar Temporaries A scalar temporary is a scalar that is set equal to a vector expression. Scalar temporaries are most often used to express common subexpressions in a loop. The use of scalar temporaries is a very delicate issue, and the extent of their use varies with the complexity of the computation and the compiler that is used. It can often be the case that a speedup factor of 1.6 to 3 can be observed in code with the proper use of scalar temporaries over code that has either overused or underused scalar temporaries. For the black box multigrid codes, the generation of the grid transfer operator coefficients and the formation of the coarse grid operators are highly susceptible to the 132
PAGE 158
use of scalar temporaries due to the size and complexity of these computations. It is not possible to obtain an optimum implementation using scalar temporaries and have the code remain optimum and portable at the same time. However, we did manage to find a reasonable compromise which should be fairly efficient over a wide range of computers and compilers. 6.4 InCode Compiler Directives The use of compiler directives in the code can also greatly enhance the com piler's ability to optimize the code. When compiler directives are not used, the compiler may do several things that can slow down the performance. The compiler may fail to vectorize a loop because it suspects that there may be data dependencies present. The compiler may add runtime code to determine the vector length, amount of loop un rolling that can be performed, or whether or not a loop can be vectorized or not. Using compiler directives can eliminate many of these problems and in addition can speed up both the execution and compilation time by eliminating the need for runtime check ing. Compiler directives vary from computer to computer, but almost all have the advantage that they are interpreted as comment lines by other compilers. This means, however, that one will have to change the compiler directives when one moves the code from one type of computer to another, or that one will have to add all the compiler directives to the code for as many different computers as are likely to be used. We have chosen to place only the compiler directives for the Cray YMP in the vector versions of the black box multigrid codes. The most commonly used compiler directives are: 1. Ignore Vector Dependencies. (CDIR$ IVDEP) 2. Scalar Operations. (CDIR$ SCA) 133
PAGE 159
3. Inlining (CDIR$ INLINE) 6. 5 lnlining The use of subroutines to make a code modular has been in fashion for a couple of decades now. It has been found that codes are more readable and that the software can often be reused when the codes are split up modularly depending on their functionality. However, this modularity has the unfortunate effect of slowing down the codes' performance on many computers. This is due mostly to the overhead involved with executing a subroutine. Many compilers have a directive for inlining. Inlining consists of the compiler taking the subroutine and rewriting it at the same code level as the calling program. This eliminates the overhead of the subroutine call from the code's performance. One may ask why all compilers do not do inlining automatically. There are several answers to this question. Not all compilers can perform inlining. It takes much longer to compile a code that requests inlining. The executable is usually larger when inlining is requested because a subroutine may be called many times from different places in the code, and for each instance the code is copied into the calling routine, creating many copies of the same piece of code for that subroutine. The black box multigrid codes use subroutines mostly to separate out the different multigrid components by functionality. Thus, the use of inlining does not cause the executable to be really huge, even though there are several instances of subroutines being duplicated. The duplications are mostly for the smoother and computing the residual and its l2 norm. 134
PAGE 160
6.6 Loop Swapping Many algorithms, when implemented, contain nested DO loops. For the best performance, the inner loop should have the longest vector length possible while maintaining the loop's vectorization. The speedup gain is dependent on the length of the loops and the amount of computation in the inner loop, but speedup factors from 1 to 6 are not uncommon. The black box multigrid codes have been implemented with the longest vector length in the inner loop. This decision has meant that compiler directives were needed to inform the compiler that the inner loops contained no vector dependencies. 6. 7 Loop Unrolling Short loops can often generate more overhead instructions than computational instructions. Because of this fact, many compilers will automatically unroll loops that have a small fixed length. However, loops with a parameter as their indexing limit and very little computational work may sometimes be unrolled partially to leave more computational work per loop iteration, but not all compilers are capable of performing this kind of loop unrolling. There are several short loops in the smoothing subroutines of the black box multigrid codes that can benefit from loop unrolling. The performance speedup factors for these loops range from 1.3 to 3. 6.8 Loops and Conditionals Loops with a conditional statement in them will not vectorize. By conditional statements we mean IF statements and computed GOTO statements. In the subroutine that computes the grid transfer coefficients in the black box multigrid codes, a test needs to be performed inside several of the loops to determine 135
PAGE 161
which form the computation is to take; see the chapter 3 on grid transfer operators. The IF statement for the test is converted into a computation involving the maximum intrinsic function, which is used to combine both computational branches of the test's outcome. This device makes the loop vectorizable, giving a speedup factor of nearly 100 over the nonvectorizable version involving the IF statement. 6.9 Scalar Operations Loops and blocks of computation that can not be vectorized should be written in a form that the compiler can recognize as common scalar operations. Some of the common forms are simple vector dependencies and recursions. If the compiler can recognize these common scalar operations, it is possible for it to use specialized scalar libraries that make the best use of the hardware to obtain the best performance. The performance gain is usually modest and ranges up to a speedup factor of 1.5. 6.10 Compiler Options The performance of a code is greatly dependent on the ability of the compiler to recognize code fragments and optimize them. In order for the compiler to generate fast and efficient code, it is not only important to write efficient code and to use incode compiler directives, but also to use the appropriate compiler options in the compile commands. The choice of these options can dramatically speed up or slow down the performance. The compiler options that were used on the Cray Y MP for all the black box multigrid codes were, cf77 Zv Wf"dz o inline3,aggress A fast" files.f where Zv means to vectorize for a single processor and use the dependence analyzer 136
PAGE 162
fpp; Wfindicates that the options in the quotes are for the FORTRAN compiler cft77; dz says to disable debugging and other utilities; o inline3,aggress is two commands, where inline3 means to inline subroutines up to three levels deep, and aggress means to raise the limits on stacks, internal tables, and searches to allow some loops to be optimized that might not otherwise vectorize; A fast means to use the fast addressing mode and not the longer addresses that use indirect addressing under the default full addressing mode. The use of these compiler options can easily double the performance of the black box multigrid codes by allowing the compiler to vectorize more fully the codes and by cutting out some of the overhead that is generated for software analysis tools. 6.11 Some Algorithmic Considerations for Smoothers We are interested in the multicolor point and line GaussSeidel methods and in the ILLU xline method for smoothers. All the GaussSeidel methods vectorize quite easily, while the ILLU method does so only marginally. 6.11.1 Point GaussSeidel Relaxation While we have said that the vector and sequential algorithms are the same, when it comes to the point GaussSeidel smoother's implementation there is a difference. The red/black point GaussSeidel method, performed in the normal fashion first all red points are updated followed by updating all the black points is very inefficient on cache based systems, requiring all the data to pass through the cache twice. An easy modification can be made that allows the data to only pass through the cache once can be found in [35]; the algorithm is Algorithm 6.11.1 Let the grid be of size Nx x Ny, then the cache based red/black GaussSeidel algorithm is given by: 137
PAGE 163
1. update all red points in row j = 1 2. Do j = 2,Ny 3. update all red point in row j 4. update all black points in row j1 5. End Do 6. update all black points in row j = Ny This cache based algorithm can give a speedup factor of two orders of magnitude on many cache based RISC processors. The 4color GaussSeidel method can also be modified in a similar way to achieve the same type of speedup. The cache algorithm is not as useful on vector computers because the vector length is now only over a single line rather than over the entire grid. The loss in vector performance is not much provided that the rows are at least as long as the computer's vector length. The shorter the vectors become, compared to the machine's vector length, the faster the performance drops off. The Cray YMP vector computers do not have a cache and should not use the cache based GaussSeidel algorithm, but the RISC processor computers should use this algorithm because they all employ caches. 6.11.2 Line GaussSeidel Relaxation Zebra line GaussSeidel relaxation requires the solution of tridiagonal systems for the line equations. The tridiagonal systems are solved by Gaussian elimination. There are two approaches to solving the equations either we solve lines every time from scratch or we factor the tridiagonal systems, saving the L U decompositions, and solve the factored systems. The first approach will use less memory by not having to save the L U decompositions, but the total solution time will take longer. The second approach is more favorable, if enough memory is available, because the L U decompositions account for 40% of the operation 138
PAGE 164
count for the solution of the tridiagonal system of equations. The L U factorization and solution phases for a tridiagonal system are both inherently sequential with essentially no vectorization. However, we are not solving just a single line, but a zebra ordering of all the lines on the grid. Vectorization can be obtained by performing each step of the decomposition and solution on all of the lines of the same color simultaneously. The benefits of vectorization diminish as the grids become coarser because there are fewer lines to vectorize across, but for standard coarsening the lines are also becoming shorter, reducing the amount of sequential work that must be performed. In this respect the standard coarsening algorithm is more efficient than the semicoarsening algorithm. On a RISC processor cache based system it is best to perform the LU solution on each tridiagonal xline separately because it requires the data to pass through the cache only once. However, the zebra yline GaussSeidel relaxation should use the vector algorithm, looping over all the lines in the x direction, which only requires the data to pass through the cache once. We could have used the Linpack (Lapack) tridiagonal factorization and LU solvers, SGTTRF and SGTTRS respectively, but they can not perform the factorization or back substitution for more than one system at a time and they do not use the cache in the same way. The routines are fine for a stand alone system, but they can not take advantage of the vectorization or cache potential that can be obtained by writing our own routines that have knowledge of the entire smoothing process. For this reason we have implemented our own solvers. 6.12 Coarsest Grid Direct Solver The coarsest grid problem is either a single grid line (in the case of a rectangular fine grid with standard coarsening or in the case of the semicoarsening algorithm) 139
PAGE 165
or a small system of equations. The single grid line equation is a tridiagonal system which is easily solved, and we have chosen to save the LU factorization. The small system of equations is a banded system. We chose to use the Linpack general banded routines, SGBFA and SGBSL, because they existed on all the machines we used. Lapack would be an even better choice, but not all the machines to which we had access had it installed. In addition, the implementation of the Linpack routines were optimized at our institution for our computers. 6.13 l2Norm of the Residual The l2norm of the residual is used in the black box multigrid solver as one of the determinations for when to stop the multigrid iterations. It has also been used in the test codes to determine the reduction factors for an iteration and for various components. The computation of the norm is straight forward, but since the computation involves the sum of a large number of floating point values, it might be wise to ask if the result has any meaning. The question is quite valid because it is well known that summing floating point numbers can lead to errors in the resulting summation depending on how the actual sum is computed. Originally the norm was computed by adding the squares of the residuals to the running sum total. We will call this type of summation the naive summation algorithm, and define it to be the following. Algorithm 6.13.1 Let r be a floating point array of N values, ri, i = 1, N. The naive summation algorithm for ri is given by the following: sum 0. Do i 1, N sum = sum + r(i) 140
PAGE 166
EndDo The computed sum is equal to ri(1 + Ji) where IJil < (ni)E, E is the machine epsilon (which is the increment between representable floating point numbers), and n is the number of units (digits) in the last place that are in error. The naive summation algorithm vectorizes, and will be unrolled several iterations by good optimizing compilers. It also performs very well on cache based RISC computers because the data is usually stored contiguously. At first glance this may all seem to be of little importance, since the norm of the residual is only being used as a stopping criteria. However, after looking at the experimental and testing data there were many cases in which the convergence criterion was just barely missed and an additional multigrid iteration was performed. After a more extensive look at the norms, it was determined that several of the cases did not actually need to perform the additional iteration. To make matters worse, a couple of cases were found that had stopped prematurely because the norm of the residual was incorrectly computed. It should be noted that these cases showed up quite often on the workstations and only rarely on the Cray computers. The trouble has to do with floating point arithmetic and the loss of accuracy. The Cray computers, which have a longer word length, perform arithmetic at a much higher precision than the workstations and hence very rarely encounter such trouble. One approach to fixing this problem is to use higher precision for the summation, which is usually accomplished by doubling the current precision. The doubling algorithm uses the naive summation algorithm, but doubles the precision of all arithmetic operations, which gives the sum equal to ri(1 + f=i 6j) where l6jl :S E. This would appear to be the answer that we are looking for except that it can execute 141
PAGE 167
very slowly. Doubling the precision on 32 bit RISC workstations and Cray YMP means that the higher precision arithmetic is handled in software and not in the hardware. We will discuss this point in more detail later. The loss of accuracy in the summation process can easily be fixed with very little extra cost on sequential machines by using the Kahan summation algorithm [40], [56]. The Kahan summation algorithm can be described as follows. Algorithm 6.13.2 Let r be a floating point array of N values, ri, i = 1, N. The N Kahan summation algorithm for ri is given by the following: i=l sum = r(1) correction = 0. Do i = 2, N next_correction = r(i) correction new_sum = sum + next_correction correction = (new_sum sum) next correction sum new_sum EndDo is the machine epsilon. The difference between the two algorithms is now much clearer, since each summand in Kahan summation is perturbed by only 2E instead of the perturbations as large as m in the naive algorithm. The Kahan summation algorithm is not vectorizable, and even though the loop can be unrolled by the compiler, it is still miserably slow on the Cray YMP. However, on the sequential workstations it is only about twice as slow as the naive 142
PAGE 168
Table 6.1. Cray YMP timings for the naive, Kahan, and doubling summation algorithms in seconds. The numbers in parenthesis are the timing ratios relative to the naive algorithms times. N Naive Kahan Double elements time time (tk/tn) time (td/tn) 102 6.702E6 2.161E5 (3.22) 6.185E5 (9.23) 103 1.208E5 1.788E4 (14.8) 3.026E4 (25.1) 104 6.964E5 1.751E3 (25.1) 2. 708E3 (38.9) 105 6.433E4 1. 7 48E2 (27.2) 2.677E2 ( 41.6) 106 6.382E3 1.746E1 (27.4) 2.675E1 (41.9) summation algorithm. We can now compare the three summation algorithms on both the vector and RISC (Sparc5) computers. The summation timings on the Cray YMP are given in table 6.1. On the Cray YMP double precision arithmetic is performed using software, and hence it is quite slow. Recall that the Cray YMP does not need to use double precision because it already uses 64 bits for single precision, which was found to be very adequate for the summation process. As we have already said, the Kahan summation algorithm does not vectorize and is quite slow. It is obvious that the only practical implementation is to use the naive summation algorithm on the Cray YMP. The Sparc5 timings, in table 6.2, show that the Kahan algorithm is about twice as slow as the naive algorithm, and that the double precision algorithm is only about 10% slower than the naive algorithm. Computing in double precision is very adequate for our needs. We are now faced with three summation algorithms. Our choices now appear to be either to settle for three versions or just to use the naive summation algorithm, as before, and not worry about missing or adding an additional multigrid iteration. For the timing data presented in the later sections of this thesis we have chosen just to use the naive summation algorithm, but we believe that it is better to make two 143
PAGE 169
Table 6.2. Sparc5 timings for the naive (tn), Kahan (tk), and doubling (td) summation algorithms in seconds. The numbers in parenthesis are the timing ratios relative to the naive algorithms times. N Naive Kahan Double elements tn tk (tkjtn) td (tdjtn) 10:l 2.70E5 6.00E5 (2.22) 3.20E5 (1.19) 103 2.67E4 6.04E4 (2.27) 3.27E4 (1.23) 104 2.75E3 6.39E3 (2.32) 3.31E3 (1.21) 105 2.86E2 6.59E2 (2.30) 3.44E2 (1.20) 106 2.88E1 6.61E1 (2.29) 3.56E1 (1.24) or even three versions of the code. For vector computers either the naive or double summation algorithm should be used depending on whether double precision arithmetic is implemented in hardware at either 64 or 32 bits respectively. On sequential machines either the Kahan or the double summation algorithm should be used depending on the implementation of double precision arithmetic. 6.14 2D Standard Coarsening Vector Algorithm To this point we have discussed several issues concerning the black box multigrid components, vectorization, and programming on the Cray YMP, but we have not explicitly mentioned what our choices were for the code. We will do so now. We have implemented the code being aware of all the vectorization issues and using the most efficient choices that have been discussed above. 6.14.1 Coarsening We used standard coarsening, taking every other fine grid point in both coordinate directions to form the coarse grid. 6.14.2 Data Structures The data structures for the grid equations are grid point stencil oriented. The mesh of unknowns has been augmented with a border of fictitious zero equations. The border is used to avoid having to write special code 144
PAGE 170
L = (Nx, Ny, 9), u = (Nx, Ny), f = (Nx, Ny), Work_Space = (Nx, Ny, 3) Grid m Grid m1 Prolongation = (Nx, Ny, 8), Restriction = (Nx, Ny, 8) Grid m1 Grid 1 Figure 6.3. Data structure layout for m grid levels; grid equations, work space, and grid transfer operators. to handle the boundary of the grid. This arrangement makes the code easier to write and more efficient for vector operations. There are several arrays to hold the grid equations: the discrete coefficient array, the array of unknowns, and the right hand side array. There are also a few extra auxiliary arrays to hold the grid transfer operator coefficients, the residual, and the L U decompositions of the line solves and of the coarsest grid problem. Each grid level has its own data structure of the appropriate size that has been allocated, via pointers, as part of a larger linear array for each data type structure; see figure 6.3. This arrangement makes memory management for the number of grid levels easier. 6.14.3 Smoothers We have implemented the multicolor ordering point, line, and alternating line GaussSeidel methods and the ILLU xline method. The cached based GaussSeidel algorithm is not used. 145
PAGE 171
6.14.4 Coarsest Grid Solver The coarsest grid solver is a direct solver using LU factorization. 6.14.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 3, that were implemented. They are the the ones discussed in sections 3.5.1, 3.5.3, and 3.6.1. 6.14.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, which uses the grid transfer operators and the fine grid operator. 6.15 2D SemiCoarsening Vector Algorithm The semicoarsening code was originally implemented by Joel E. Dendy, Jr. We have reimplemented it in a more efficient form to gain a speedup of about 5 over the previous vectorized version while maintaining and improving the portability of the code. The new implementation has kept all the functionality of the previous version. 6.15.1 Data Structures The data structures for the grid equations are the same as those for the standard coarsening code including the fictitious border equa tions. However, now the work space array is reduced to (Nx, Ny, 2) and the prolongation and restriction coefficient arrays are of length 2 instead of 8. 6.15.2 Coarsening Semicoarsening in theydirection was used, taking every other fine grid point in the ydirection to form the coarse grid. 6.15.3 Smoothers Red/black xline GaussSeidel relaxation is used for the smoother. As an experiment the xline ILLU method was also implemented. 146
PAGE 172
6.15.4 Coarsest Grid Solver The coarsest grid solver is either the di rect L U factorization solver or a tridiagonal solver in the case that coarsening is continued until only one xline remains. 6.15.5 Grid Transfer Operators The grid transfer operator is the one used in section 3.6.1 applied in only theydirection. 6.15.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, using the grid transfer and fine grid operators. 147
PAGE 173
CHAPTER 7 2D NUMERICAL RESULTS The numerical results in this section are for the two dimensional domain versions of the black box multigrid solvers. 7.1 Storage Requirements The black box multigrid solvers present some tradeoff issues for speed versus storage. The algorithms require that we perform a number of tasks involving grid operators and grid transfer operators. We can choose to save storage and sacrifice speed by computing these when we need them. However, since we are using a geometric multi grid method these computations are not cheap. The most expensive is the formation of the coarse grid operators. For the 2D grid levels we need storage for the grid equations (unknowns, co efficients, and right hand side), the grid transfer operators, and temporary work space. Let Nx and Ny be the number of grid points in the xand ydirections respectively. We can compute how much storage will be needed by adding up the amount for grid point. We need 9 locations for the coefficient matrix and 1 each for the unknowns and right hand side. For the standard coarsening method we need 16 locations for both the grid transfer operator coefficients, and another 3 for temporary work. For the semi coarsening method we need 4 locations for both the grid transfer operator coefficients, and another 2 for temporary work. We can ignore the amount of storage for the coars est grid direct solver for now because it will remain constant and small when compared 148
PAGE 174
Table 7.1. Storage requirements for various grid sizes for the grid unknowns and grid operators on the Cray YMP. Unknowns coefficients NxN N2 N'f 9N2 9N2 9x9 81 115 729 1035 17 X 17 289 404 2601 3636 33 X 33 1089 1493 9801 13437 65 X 65 4225 5718 38025 51462 129 X 129 16641 22359 149769 201231 257 X 257 66049 88408 594441 795672 513 X 513 263169 351577 2368521 3164193 1025 X 1025 1050625 1402202 9455625 12619818 to the rest. This means that we need 30 locations for the standard coarsening and 17 for the semicoarsening. However, we do not have grid transfer coefficients stored on the finest grid so we can subtract 16 and 4 locations from the total for the standard and semicoarsening methods respectively. The amount of storage required for the 2D data structures, excluding storage for the line solves, is 1 30 1 + 4 + 16 NxNy (7.1) 1 17 1 + 2 + 4 NxNy (7.2) for the standard and semicoarsening methods respectively. If we only have a 5point operator on the finest grid we do not need to store the other 4 coefficients and then the storage requirements become 20NxNy and 26NxNy for the standard and semicoarsening methods respectively. The growth of the storage requirements versus the grid size can be seen in table 7.1. The first column is the number of grid points and is also the amount of storage needed for the grid unknowns or right hand sides for that grid level. The second column is the number of grid points for the indicated fine grid down to the coarsest grid level (3 x 3). The last two columns are the amount of storage needed for 149
PAGE 175
Table 7.2. The actual storage requirements for various grid sizes for the codes BMGNS, SCBMG, and MGD9V given in terms of number of real variables on the Cray YMP. NxN SCBMG BMGNS MGD9V 9x9 4721 4359 2063 17 X 17 12938 10309 7072 33 X 33 39635 29235 25777 65 X 65 132908 96001 97986 129 X 129 476165 345999 381651 257 X 257 1783454 1312669 1506020 513 X 513 6868535 5113515 5982965 1025 X 1025 26895056 20860411 23849734 the grid operator ( 9point) on that grid level and all the grid levels respectively. The price of memory has been falling steadily for years, and as a consequence computers are being built with more and more memory. Due to this phenomenon and in the interest of speed, we have chosen to compute these operators once and store them. We could have chosen just to store the grid operators and compute the grid transfer operators when they are needed, and this procedure might even be practical for the three dimensional problem codes when memory is used up at an alarming rate. The actual storage requirements for the various codes versus the grid size is given in table 7.2. The BMGNS code, which uses zebra alternating line GaussSeidel, requires the least amount of storage for grids larger than 65 x 65. The MGD9V codes uses less storage for small grid levels because its grid transfer operators require only a fourth the storage of BMGNS, due to symmetries. However, MGD9V must store the ILL U smoother decomposition for each grid level, and this additional storage becomes more significant for larger grids. The SCBMG code always requires more storage because it uses semicoarsening. 150
PAGE 176
Table 7.3. Speedup of the new vectorized version, vbmg, and Dendy's vectorized version, bmg, over the scalar version of bmg. grid size N BMGNS VBMGNS Speedup (N x N) (bmgns/vbmgns) 9 2.3 13.8 5.8 17 3.2 21.5 6.7 33 5.5 33.6 6.1 65 7.2 43.3 6.0 129 9.5 54.2 5.7 257 11.4 68.4 6.0 513 14.1 81.8 5.8 1025 16.2 92.4 5.7 7.2 Vectorization Speedup The original functionality of the Black Box Multigrid codes [29] was kept and used as a baseline. There were several restrictions that we imposed on what could be done with the codes. The desire behind developing and optimizing the black box multigrid codes was to maintain as much of the original functionality of the black box multigrid method. The original method was implemented as a research prototype. Even though care was taken to keep it somewhat portable and to adhere to most of the SLATEC guidelines [38], at least for the documentation, it still had a long way to go before being ready for release to the scientific and engineering community. We wanted to maintain all of the cycling strategies, except for those controlled by the MCYCL parameter, which is only valid when performing the initial stages of an Fcycle. We also decided to remove the truncation error estimates, ITAU, because they were no longer meaningful. The speedup of the new vectorized version of the standard coarsening code is given in table 7.3. The table shows that while Dendy's original code had some vectorization in it, the new version has much more. The speedup is due to a variety of 151
PAGE 177
Table 7.4. Comparison and speedup of the Semicoarsening versions. scbmg is Dendy's version, and vscbmg is the new vectorized version. Theentries mean that there is no data. grid size N CM2 Cray YMP Speedup (N x N) SCBMG SCBMG VSCBMG (scbmg/vscbmg) 32 .011 .0019 5.8 64 .65 .04 .0059 6.8 128 .99 .09 .015 6.0 256 1.84 .27 .045 6.0 512 4.55 .95 .16 5.9 1024 3.69 .64 5.8 factors, such as better organization of the computations and better use of the compiler to achieve the vectorization that is present. The last column of the table shows that the new version of the code runs about six times faster than Dendy's original code. The speedup is not consistent over the range of grid sizes because of the effect of vector length and memory cache misses. There may also be some effects due to the fact that the timings were done in a time sharing environment, but they should be quite small when compared to the other issues. While table 7.3 is only for the standard coarsening code, it also reflects the speedup seen in the semicoarsening code. The data in table 7.4 compares the execution time in seconds on the Cray YMP and the CM2 of Dendy's semicoarsening version to the new vectorized semicoarsening version on the Cray Y MP. The table shows that the new semicoarsening version is roughly six times faster than Dendy's on the Cray YMP. The poor performance of the CM2 version of sc bmg is due to the much slower computing units and the communications penalty on the CM2. The CM2 version was run on one quarter of 1024 nodes (256 processors) under the slicewise model. The timing comparisons are perhaps clearer when presented in graphical form. We compare the three main types of algorithms represented by BMGNS, SCBMG, and 152
PAGE 178
MGD9V on the Cray Y MP using a single processor. While the setup is done only once, the solution may require many V cycle iterations. Hence, the time for the setup and the time for one V cycle are presented separately. The BMGNS code uses zebra alternating line GaussSeidel relaxation for the smoother (which is the most expensive option), and the nonsymmetric grid operator collapsing method involving aL for the grid transfer operators (which is the most expensive collapsing method). The setup time includes the computation of the coarse grid operators, the grid transfer operators (prolongation and restriction), the factorization for the coarsest grid solver and any setup required by the smoother. Figure (7.1) represents the setup time in seconds for the three codes. It is not surprising that SCBMG is the fastest, since it only needs to compute grid transfer coefficients in the ydirection and factorization for lines in the xdirection, while BMGNS has to compute the six additional grid transfer coefficients for each prolongation and restriction operator and an additional line factorization for the smoother. The factorization for the coarsest grid direct solver must also be computed for BMGNS. The MGD9V algorithm saves time in computing the grid transfer coeffi cients over BMGNS, and like SCBMG it does not need to compute a factorization for the direct solver on the coarsest grid. However, the factorizations and setup for the ILLU smoother are very expensive and are not really vectorizable. The times for one complete Vcycle are given in figure (7.2), excluding the setup time. The Vcycle is a V(1, 1)cycle for BMGNS and SCBMG, and a V(O, 1) cycle for MGD9V. The Vcycle timing graph shows the same relationship as the setup timing graph, except that the time for one V cycle is much less then the setup time. 153
PAGE 179
Setup Time Comparison 2.5,,,,,,, 2 {!1.5 c 8 ()) ()) E i= 0.5 200 SCBMG ...... BMGNS MGD9V / / / / / / 400 / // / / / / / / / / / / / / / / 600 Grid Size / / / / / / / / / / / / / 800 / / / / / / / / / / / / / / / / / 1000 1200 Figure 7.1: Comparison of Setup time for BMGNS, SCBMG, and MGD9V 154
PAGE 180
0.9 0.8 0.7 {!0.6 c 8 ,M_o.s ()) E i= 0.4 0.3 0.2 0.1 200 VCycle Time Comparison SCBMG ...... BMGNS MGD9V 400 / / 600 Grid Size / / / / / / 800 / / / / / / / / 1000 1200 Figure 7.2: Comparison of one Vcycle time for BMGNS, SCBMG, and MGD9V 155
PAGE 181
Table 7.5. Operation counts (actual) for standard coarsening black box multigrid Setup phase. LH 369NxNy 256NxNy CTLjaL 174NxNy135Nx135Ny + 96 122NxNy 91Nx 91Ny + 60 CTL/L Hybrid 166NxNy 131Nx 131Ny + 96 114NxNy 87 Nx 87 Ny + 60 Schaffer 168NxNy 112Nx 112Ny + 56 120NxNy 88Nx 88Ny + 56 ZxLGS decomp 4NxNy3Ny ZyLGS decomp 4NxNy3Nx ZALGS decomp 8NxNy 3Nx 3Ny xiLLU decomp 58NxNy 52Nx 106Ny + 101 7.3 2D Computational Work The amount of work performed by the various multigrid components as implemented on the Cray YMP are given below. The operation counts lump multiplication, division, addition, and subtraction together. The setup phase is broken down into three parts: grid transfer operators, coarse grid operator, and smoother decomposition. The grid parameters, Nx and Ny, are the number of coarse grid points in the respective coordinate directions. The operation counts are given in table 7.5. The first line in each block with multiple lines refers to when the grid operator has a 9point stencil, and the second line for when it has a 5point stencil. The operations counts say that the CT L j LHybrid method is the fastest for computing the grid transfer operator coefficients. We can also see that the zebra alternating line GaussSeidel smoother is very cheap to decompose when compared to the ILL U smoother. The number of operations performed to compute the residual and perform the 156
PAGE 182
Table 7.6. Operation counts (actual) for standard coarsening black box multigrid Residual and grid transfer components. Residual 18NxNy lONxNy Prolongation 20NxNy 14Nx 14Ny + 9 Restriction 16NxNy grid transfers are given in table 7.6. The amount of work needed to perform the various smoothers is given in table 7.7. Again the first line in each block refers to when the grid operator has a 9point stencil, and the second line for when it has a 5point stencil. It is important to remember that the operation counts give only a rough idea of how the actual components will perform for several reasons. The first is that we have lumped all of the different arithmetic operations together into one count. The amount of time to perform the various arithmetic operations can vary up to several clock cycles, with division usually taking the longest. In addition, the operation counts say nothing about how the code will perform on the hardware. Some of the issues that can affect the performance are whether or not the operations vectorize, fill the pipelines, and require cache paging. These issues can drastically change the amount of execution time needed to perform the operations. Even though the multigrid components listed here have all been optimized in FORTRAN they still all vary with regard to these hardware issues, and hence it is not easy to predict, without running the codes, which will actually execute the fastest. 7.4 Timing Results for Test Problems In this section we present some timing results of the various codes for comparing the performance of the codes and their components. 157
PAGE 183
Table 7.7. Operation counts (actual) for standard coarsening black box multigrid smoother component. 4CPGS 17NxNy 2CPGS 9NxNy ZxLGS 18NxNy 10NxNy ZyLGS 18NxNy 10NxNy ZALGS 36NxNy 20NxNy xiLLU 40NxNy17Nx13Ny + 9 32NxNy17Nx13Ny + 9 Table 7.8. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother. Grid Size it. Total Total Direct Average n Setup Smoothing Solver per Cycle 9 2 9.001E4 6.492E4 2.022E3 5.609E4 17 4 1.856E3 2.541E3 5.969E3 1.028E3 33 4 3.811E3 5.219E3 1.189E2 2.019E3 65 4 8.778E3 1.261E2 2.783E2 4.762E3 129 4 2.630E2 3.899E2 8.392E2 1.441E2 257 4 8.707E2 1.290E1 2.807E1 4.841E2 158
PAGE 184
To illustrate how fast these codes perform in solving a problem, we examine the timing results for solving the discontinuous Helmholtz problem from section 7.5, see table 7.8. Table 7.8 gives the timing results, in seconds, for various stages of the program execution for various grid sizes. The grid is square, so in the first column where n = 9, we mean a grid of size 9 x 9 and so forth for the rest of the column entries. The second column gives the number of multigrid V(1, 1)cycle iterations that were needed to reduce the l2 norm of the initial residual by six orders of magnitude. The third column gives the total setup time, which involves the time it takes to form all of the grid transfer operators, generate all the coarse grid operators, and perform any decompositions needed for the smoother. The fourth column gives the total time for smoothing. The fifth column gives the total time for the direct solver. The last column contains the average time it took to complete one V(1, 1)cycle. We observe that the code runs fairly quickly, and that it appears to scale very well with respect to the grid size. We also note that the total setup time is about 1. 7 times that of the average cycle time and that in addition this time is about 2. 7 times the total smoothing time for one iteration. A more detailed examination of these relationships between the various multigrid components is given below. The rest of the tables in this section are the results for one multigrid V(1, 1)cycle. The results are separated by multigrid components for easier comparison between the types of multigrid algorithms. All times are given in seconds of CPU time on the Cray YMP in single processor mode. The time to perform the L U decomposition of the coarsest grid (3 x 3) problem for the direct solver is 7.176E5 seconds. The direct solver on the coarsest grid level (3 x 3, standard coarsening) takes 2.609E5 seconds. These times are constant for all of the standard coarsening algorithms that use the direct solver. It should be noted that these times are based on a coarsest grid size of 3 x 3 and that if another coarsest 159
PAGE 185
Table 7.9. Timings in seconds for multigrid grid transfer components for one V(1, 1)cycle for various grid sizes; comparing standard and semicoarsening methods. Grid Size Standard Coarsening Semi Coarsening n Prolongation Restriction Prolongation Restriction 9 5.151E5 4.525E5 1.603E5 2.261E5 17 1.029E4 8.675E5 2.898E5 3.798E5 33 2.042E4 1.693E4 4.985E5 6.383E5 65 4.430E4 4.063E4 1.114E4 1.427E4 129 1.119E3 1.264E3 2.797E4 3.611E4 257 3.675E3 4.306E3 8.869E4 1.164E3 grid size is chosen, then the times will also change. The amount of work to perform the grid transfers depends on the grid size and on the type of coarsening used. A comparison between standard and semicoarsening is given in table 7.9. As one would expect, semicoarsening grid transfers are faster than standard coarsening grid transfers. The standard coarsening restriction requires 4 times the work that semicoarsening does, and prolongation requires 10 times more work. However, due to the number of grid points that they operate on and the way in which they are computed by the hardware, there is only about a factor of 3.2 to 4.0 for prolongation and 2.0 to 3.6 for restriction, depending on the grid size. Table 7.10 gives the timings results for four standard coarsening smoothers and the semicoarsening smoother. Note that for red/black point GaussSeidel relaxation, the ratios are not uniform or monotone in nature. This situation seems to be due to several factors involving the vector length, memory stride, and cache issues. The point relaxation methods seem to be much more sensitive to memory access delays. The zebra line GaussSeidel relaxation also shows some of these types of variation, but after taking into account multiple runs of both xand yline relaxation, they can be averaged out. The time variations for the line relaxations again point to memory access delays as being the main cause. The physical layout of the data in memory means that 160
PAGE 186
Table 7.10. Timings for the total smoothing time in seconds for one multigrid V(1, 1)cycle for various grid sizes and smoothers. Grid Size Total Smoothing Time (seconds) n R/BPGS ZLGS ZALGS ILLU SCBMG 9 1.572E4 1.673E4 3.246E4 7.316E4 1.846E4 17 3.318E4 3.354E4 6.352E4 1.670E3 4.318E4 33 6.763E4 6.690E4 1.305E3 4.207E3 1.014E3 65 1.473E3 1.555E3 3.153E3 1.174E2 2.887E3 129 3.912E3 4.802E3 9.747E3 3.732E2 8.473E3 257 1.241E2 1.584E2 3.226E2 1.293E1 2.821E2 xand yline relaxation require different access strides for the grid operator coefficients. A closer look at the issue shows that in almost all cases that were measured, yline relaxation is slightly faster than xline relaxation on the Cray YMP. While table 7.10 shows that point and line GaussSeidel relaxation are both quite fast, we have seen from local mode analysis that they are not robust. The alternating line, ILLU, and semicoarsening methods are robust. We observe that standard coarsening with zebra alternating line relaxation has approximately the same performance time as the semicoarsening method, not surprising since the semicoarsening method is performing half the number of line relaxations, only xlines, and since the lines are at least twice as long as in the standard coarsening method. The differences between these two can then be attributed to their performance on the hardware. Both of these smoothers are much faster than the ILLU smoother, ranging from about 2.3 to 4 times faster. However, do not forget that we saw in the local mode analysis that the ILL U method was a much better smoother. The ratio of time spent smoothing versus the time spent doing grid transfers is given in table 7.11. See the comments above under smoothers concerning the behavior of the point relaxation. The ratio of smoothing to grid transfers shows that the smoother is the dominant computation in the multigrid cycling algorithm. It also 161
PAGE 187
Table 7.11. Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Timing ratios (smoothing/grid transfer) for one V(1, 1)cycle for vari ous grid sizes. Grid Size (Smoothing)/ (Grid Transfers) n R/BPGS ZxLGS ZALGS ILLU SCBMG 9 1.68 1.73 3.35 7.53 4.78 17 1.75 1.77 3.49 8.91 6.45 33 1.81 1.79 3.50 11.32 8.92 65 1.73 1.83 3.71 13.89 11.36 129 1.66 2.01 4.01 15.66 13.22 257 1.56 1.98 4.04 16.20 13.75 162
PAGE 188
Table 7.12. Timings for the multigrid setup (generate all grid transfer and grid operators and perform decompositions for smoother) for one V(1, 1)cycle for various grid sizes. Grid Size ZALGS Schaffer's Idea n sL/sL sL/LHybrid ZALGS ILLU SCBMG 9 8.174E4 7.611E4 9.001E4 1.369E3 5.176E4 17 1.780E3 1.623E3 1.856E3 2.968E3 1.238E3 33 3.914E3 3.583E3 3.811E3 6.657E3 3.393E3 65 9.209E3 8.518E3 8.778E3 1.679E2 1.349E2 129 2.746E2 2.560E2 2.630E2 5.165E2 6.459E2 257 9.277E2 8.673E2 8.707E2 1.761E1 3.761E1 shows that for ILL U or semicoarsening the smoothing is even more dominant for large grids. The times to perform the setup for various algorithms are given in table 7.12. The column headings are oLjoL for the grid transfer operator from section 3.5.1; aL/LHybrid for the grid transfer operator from section 3.5.3; and Schaffer's idea for the grid transfer operator from section 3.6. We present the setup timing results for the codes using the zebra alternating line GaussSeidel relaxation, since it requires the decompositions of the both the xand yline solves. The number of operations that it takes to form the coarse grid operators, 369nxny per grid level, dominates the number of operations that it takes to form the grid transfer operators. The number of operations it takes for the decompositions for the ILLU method is even greater, as seen in the fifth column of table 7.12. The collapsing methods for the grid transfer operators (aL/aL and aL/LHybrid) are similar, but the hybrid version requires fewer computations. It is also a little surprising that the extension of Schaffer's ideas is also about as fast as the collapsing methods, since it has to perform lines solves to get the grid transfer coefficients. The actual number of operations, for a 9point fine grid stencil, for aL/aL 163
PAGE 189
Table 7.13. Timings for one multigrid V(1, 1)cycle for various grid sizes, excluding setup time. Grid Size Cycle Time (seconds) n R/BPGS ZLGS ZALGS ILLU SCBMG 9 3.856E4 3.959E4 5.609E4 1.014E3 5.108E4 17 7.236E4 7.189E4 1.028E3 2.120E3 9.456E4 33 1.395E3 1.372E3 2.019E3 5.023E3 1.906E3 65 3.115E3 3.173E3 4.762E3 1.356E2 4.828E3 129 8.548E3 9.329E3 1.441E2 4.247E2 1.419E2 257 2.881E2 3.181E2 4.841E2 1.465E1 4.627E2 is 174nxny 135nx 135ny + 96, for hybrid o1/L is 166nxny 131nx 131ny + 96, and, for the extension of Schaffer's ideas is 168nxny 112nx 112ny + 56 per grid level of size nx x ny. This comparison shows that the collapsing methods require more computations (35%) per grid level but are not significantly slower (7%) because they vectorize, while the tridiagonal solves for Schaffer's ideas do not. This example shows again that operation counts don't tell the complete story of how well a method will perform. The times for one complete V(1, 1)cycle, excluding the setup and overhead time, are given for various smoothers in table 7.13. As expected, the point and line GaussSeidel methods are the fastest, even though we again see the strange behavior of point relaxation. It is interesting to note that the cycle time for the standard coarsening code using alternating line relaxation is virtually identical to that of the semicoarsening code. This fact is primarily due to the fact that the smoothers, which are essentially equivalent in computation time, dominate the cycle time. The ILL U version is again the slowest method. 164
PAGE 190
7.5 Numerical Results for Test Problem 8 Problem 8 is a discontinuous diffusion fourcorner junction problem. This problem has appeared many times in the literature; see [1], [26],[24]. It is defined by \i'D'Vu=f au 1 D+u=O an 2 D = 1, f=1 D = 1000, f=O D = 1000, f=O D = 1, f=1 on n = (0, Nx) x (0, Ny) on an (x, y) E [0, x*] x [0, y*] (x,y) E [x*,Nx] x [O,y*] (x,y) E [O,x*] x [y*,Ny] (x,y) E [x*,Nx] x [y*,Ny] (7.3) (7.4) where x = x* and y = y* are the interface lines for the discontinuities; see figure 7.3. We compare the five different choices of prolongation operators in the standard coarsening black box multigrid method using zebra alternating line GaussSeidel relaxation or incomplete xline L U iteration for the smoother. The comparison is done for a variety of grid sizes ranging from 9 x 9 to 257 x 257. The data in the tables list the number of V(1, 1) cycles needed for the l2 norm of the residual to be reduced by 6 orders of magnitude with an initial guess of zero. The next three entries are the first V cycle, last V cycle, and average convergence factors. If convergence was not reached in 50 V cycles then an appears, and the convergence factor is given based on only 50 V cycles. Results for the method using an extension of Schaffer's idea for the grid transfer coefficients using alternating zebra GaussSeidel relaxation are given in table 7.14. The method exhibits good convergence factors for all grid sizes, and in addition the convergence factors grow very slowly as the grid size increases. Results for the method using the grid transfer operators based on the form a L / L, from section 3.5.2, using alternating zebra GaussSeidel relaxation are given in 165
PAGE 191
M 24. 2 1 N 12. M 1 2 0. 0. 12. 24. N Figure 7.3. Domain 0 for problem 8.1; N and M stand for Neumann and Mixed boundary conditions respectively. Table 7.14. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 6.93E5 5.82E4 2.01E4 17 4 7.89E3 1.42E2 1.23E2 33 4 3.03E2 2.65E2 2.74E2 65 4 4.01E2 2.43E2 2.76E2 129 4 4.34E2 2.12E2 2.49E2 257 4 5.36E2 2.94E2 2.98E2 166
PAGE 192
Table 7.15. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators of the form oL/L. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.50E4 1.78E3 6.67E4 17 4 1.17E2 2.03E2 1.77E2 33 7 1.03E1 1.29E1 1.21E2 65 16 3.85E1 4.03E1 3.99E1 129 33 7.54E1 6.52E1 6.53E1 257 8.22E1 8.15E1 8.15E1 167
PAGE 193
Table 7.16. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.56E4 1.82E3 6.84E4 17 4 1.43E2 2.30E2 2.04E2 33 5 4.96E2 4.31E2 4.43E2 65 5 7.85E2 5.69E2 6.07E2 129 6 9.45E2 6.55E2 6.96E2 257 6 1.01E1 6.96E2 7.41E2 table 7.15. This method does not seem to be very attractive for this type of problem. The convergence factors grow quickly and approach one as the grid size increases. The method using the grid transfer operators based on the form u L I hybrid, from section 3.5.3, using alternating zebra GaussSeidel relaxation are given in table 7.16. The method seems to be attractive for this type of problem. The convergence factor, for the first Vcycle, grows as a function of problem size, but the convergence factor for subsequent Vcycles settles down to about 0.07. The method using the grid transfer operators based on the form uLiuL, from section 3.5.1, using alternating zebra GaussSeidel relaxation are given in table 7.17. This method is almost identical to the last method, u L I hybrid, as it should be for diffusion problems. The two methods differ only slightly in the grid transfer operators when the switch in the denominator is used. The method using the grid transfer operator, LIL form, from section 3.4 is given in table 7.18. This method does not perform very well at all. The method does not employ the use of the denominator switch in the grid transfer operators. Tables 7.19 through 7.23 are the same as the previous tables except that now 168
PAGE 194
Table 7.17. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.56E4 1.83E3 6.84E4 17 4 1.43E2 2.30E2 2.04E2 33 5 4.96E2 4.31E2 4.43E2 65 5 7.85E2 5.69E2 6.07E2 129 6 9.45E2 6.55E2 6.96E2 257 6 1.01E1 6.96E2 7.41E2 Table 7.18. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel for the smoother, first, last and average con vergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 2.57E4 1.83E3 6.85E4 17 4 1.43E2 2.30E2 2.04E2 33 4 1.46E2 1.72E2 1.44E2 65 14 4.90E1 3.58E1 3.66E1 129 l.llE+O 7.75E1 7.81E1 257 1.36E+O 9.38E1 9.45E1 169
PAGE 195
Table 7.19. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using incomplete xline LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 3.40E9 3.40E9 3.40E9 17 2 2.04E5 3.05E4 7.90E5 33 3 2.12E3 4.70E3 3.57E3 65 4 1.30E2 9.70E3 1.04E2 129 4 2.36E2 1.13E2 1.34E2 257 4 3.11E2 2.01E2 1.68E2 Table 7.20. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using incomplete xline LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 5.52E9 5.52E9 5.52E9 17 2 1.62E5 2.07E4 5.80E5 33 3 3.45E3 1.34E2 8.22E3 65 7 7.44E2 1.54E1 1.38E1 129 17 2.75E1 4.42E1 4.29E1 257 36 3.65E1 6.91E1 6.78E1 170
PAGE 196
Table 7.21. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using incomplete xline LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.35E9 6.35E9 6.35E9 17 2 1.90E5 3.12E4 7.70E5 33 3 1.35E3 3.65E3 2.60E3 65 3 7.34E3 8.43E3 8.00E3 129 4 1.36E2 1.26E2 1.28E2 257 4 1.69E2 2.04E2 1.69E2 Table 7.22. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using incomplete xline LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.33E9 6.33E9 6.33E9 17 2 1.90E5 3.12E4 7.70E5 33 3 1.35E3 3.65E3 2.60E3 65 3 7.34E3 8.43E3 8.00E3 129 4 1.36E2 1.26E2 1.28E2 257 4 1.69E2 2.04E2 1.69E2 171
PAGE 197
Table 7.23. Problem 8: Helmholtz Equation, Standard coarsening with grid transfer operators from section 3.4. Various grid sizes versus the number of V(1, 1) cycles using incomplete xline LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 6.33E9 6.33E9 6.33E9 17 2 1.90E5 3.13E4 7.71E5 33 3 1.21E3 3.14E3 2.25E3 65 5 4.55E2 6.36E2 5.94E2 129 18 4.24E1 4.62E1 4.59E1 257 8.02E1 8.18E1 8.18E1 172
PAGE 198
the smoother is an incomplete xline LU iteration instead of the alternating zebra line GaussSeidel relaxation that was used before. The same observations hold as before, except that now the convergence factors are a bit smaller due to the fact that ILLU makes a better smoother than alternating line GaussSeidel. 173
PAGE 199
D 1. D D 0. 0. D 1. Figure 7.4: Domain 0 for problem 9; D stands for Dirichlet boundary condition. 7.6 Numerical Results for Test Problem 9 Problem 9 is a convectiondiffusion problem, which can be found in [24], [66], [77]. The problem is defined as where au au cf1u + a(x, y) OX + b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.5) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on an, (7.6) a(x,y) b(x, y) (2x1)(1x2), 2xy(y1) and c: = 105 ; see figure 7.4. Five choices of prolongation operators for the standard coarsening black box multigrid method using zebra alternating line GaussSeidel relaxation or incomplete xline LU iteration for the smoother are presented. The comparison is done for a variety of grid sizes ranging from 9 x 9 to 257 x 257. The results for the convectiondiffusion equation using zebra alternating line GaussSeidel relaxation for the smoother are given in tables 7.24, 7.25, 7.26, 7.27, and 7.28. 174
PAGE 200
Table 7.24. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 8.78E3 8.58E6 2.75E4 17 4 1.61E2 1.57E5 3.05E3 33 5 2.78E2 1.57E2 3.04E2 65 5 3.98E2 6.03E2 6.06E2 129 6 5.67E2 1.09E1 9.08E2 257 7 6.88E2 1.19E1 1.13E1 Table 7.25. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 2.11E2 5.63E3 7.89E3 17 4 3.18E2 4.34E3 2.48E3 33 6 6.51E2 2.38E2 5.38E2 65 9 1.49E+O 1.28E1 1.88E1 129 div 4.15E+1 1.13E+1 7.01E+O 257 div 2.91E+3 1.24E+3 1.24E+3 175
PAGE 201
Table 7.26. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 5.74E3 1.86E5 3.27E4 17 4 1.54E2 1.77E5 3.24E3 33 5 2.65E2 1.42E2 2.99E2 65 6 4.88E2 6.12E2 6.12E2 129 6 7.40E2 9.31E2 9.23E2 257 8 1.29E+O 8.34E2 1.35E1 Table 7.27. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 2.24E2 8.12E3 8.96E3 17 4 2.63E2 2.84E3 2.18E2 33 6 4.90E1 4.40E2 6.68E2 65 div 1.86E+l 1.53E+l 1.53E+l 129 div 1.67E+4 2.88E+3 2.88E+3 257 div 1.51E+9 9.02E+8 9.02E+8 176
PAGE 202
Table 7.28. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 3.89E2 2.39E5 3.16E3 17 5 5.35E2 l.llE2 5.59E2 33 7 7.29E2 6.07E2 1.24E1 65 10 1.65E1 1.54E1 2.10E1 129 12 1.69E1 1.92E1 2.95E1 257 18 1.90E1 2.27E1 4.31E1 Table 7.29. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 8.41E14 8.41E14 8.41E14 17 1 6.91E12 6.91E12 6.91E12 33 1 6.29E10 6.29E10 6.29E10 65 1 4.63E8 4.63E8 4.63E8 129 2 1.95E6 5.21E5 1.01E5 257 2 4.88E5 1.09E3 2.31E4 177
PAGE 203
Table 7.30. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 1.19E13 1.19E13 1.19E13 17 1 9.36E12 9.36E12 9.36E12 33 1 8.28E10 8.28E10 8.28E10 65 1 6.60E8 6.60E8 6.60E8 129 2 3.33E6 1.09E4 1.91E5 257 2 2.33E3 1.96E4 6.74E4 Table 7.31. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 8.45E14 8.45E14 8.45E14 17 1 6.62E12 6.62E12 6.62E12 33 1 5.74E10 5.74E10 5.74E10 65 1 4.11E8 4.11E8 4.11E8 129 2 1.71E6 3.65E5 7.90E6 257 2 4.21E5 8.20E4 1.86E4 178
PAGE 204
Table 7.32. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 1.45E13 1.45E13 1.45E13 17 1 8.56E12 8.56E12 8.56E12 33 1 5.11E10 5.11E10 5.11E10 65 1 2.54E8 2.54E8 2.54E8 129 2 1.06E6 1.81E5 4.38E6 257 6.84E+O 8.25E1 8.60E1 Table 7.33. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using incomplete line LU iteration by lines in x for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 1 2.08E13 2.08E13 2.08E13 17 1 1.23E11 1.23E11 1.23E11 33 1 5.90E10 5.90E10 5.90E10 65 1 3.22E8 3.22E8 3.22E8 129 2 1.75E6 1.02E4 1.34E5 257 2 6.76E5 1.14E3 2.78E4 179
PAGE 205
Table 7.34. Problem 9: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using 4direction point GaussSeidel relaxation for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 3.09E6 1.91E6 2.43E6 17 2 2.34E6 4.04E6 3.08E6 33 2 1.74E6 5.02E6 2.95E6 65 2 5.05E6 1.38E5 8.35E6 129 2 6.15E5 3.05E4 1.37E4 257 4 5.35E+1 7.19E5 l.OOE2 From the tables we see that the methods that use extended Schaffer's ideas perform the best closely followed by the hybrid method from section 3.5.3. Notice that changing the smoother from the alternating line GaussSeidel to the incomplete line LU iteration greatly improves the convergence factors. When alternating line GaussSeidel is used, we see that both the grid transfer operator methods from sections 3.5.2 and 3.5.1, respectively, diverge for large grid sizes, but that they are both convergent when the incomplete line L U iteration is used for the smoother. We also see that using 4direction point GaussSeidel relaxation for the smoother with the grid transfer operator method from section 3.5.1 ((TLjaL) gives results that are comparable to those using the ILLU smoother. However, as can be seen by the initial convergence factor for the 257 x 257 grid, that problems have crept in for large grid sizes. The convergence factor is only greater than one for the initial iteration and then for subsequent iterations the convergence rate drops off very quickly. However, for larger fine grids the convergence factor oscillates back and fourth from around 50 to around 0.37, and the methods are divergent. 180
PAGE 206
D 1. I /! ( /' D \ D \ .. \ \\ "'/ '",,.// '" ', 0. ..,., 0. D 1. Figure 7.5: Domain n for problem 10; D stands for Dirichlet boundary condition. 7. 7 Numerical Results for Test Problem 10 Problem 10 is a convectiondiffusion problem, which can be found in the literature; see [24], [66], [77]. The problem is defined by where au au + a(x, y) OX+ b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.7) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on on, (7.8) a(x,y) b(x, y) 4x(x1)(12y), 4y(y1)(12x), and E = 105 ; see figure 7.5. This problem is a reentrant flow problem; such problems are among the most difficult convectiondiffusion problems. None of our methods are adequate for solving these types of problems and several are not even convergent except for small grid sizes. However, using ILLU for the smoother does help many of the methods become convergent, even if the convergence factor is rather poor. 181
PAGE 207
There are several reasons why our methods do not work properly on these types of problems. One is that the smoothers are just not adequate. Another is that all of our grid transfer operators, that we have considered, are close to violating the order of interpolation rule [15] [41] [45] [85]. The rule states that (7.9) where mr and mp are the orders of interpolation for the restriction and prolongation operators, respectively, and mz is the order of the grid equation operator. In our cse we have mr = 1, mp = 2, and mz = 3 for equality. Numerically the rule is violated for some of the grid equations due to the affects of computer arithmetic. Another way to look at the trouble is that the grid transfer operators fail to map all the high frequency errors into the range of the smoother. De Zeeuw's MGD9V [24] code was designed for these types of convection diffusion problems, and his interpolation operator does map the error into the range of the ILLU smoother; see table 7.43. Although MGD9V does become divergent for large grids (> 160 x 160), it does perform much better than any of the other methods for the smaller grid sizes. 182
PAGE 208
Table 7.35. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 14 1.06E1 3.96E1 3.58E1 17 25 1.38E1 6.06E1 5.68E1 33 1.56E1 7.96E1 7.65E1 65 1.67E1 9.11E1 8.70E1 129 2.05E1 9.63E1 9.14E1 257 2.66E1 9.83E1 9.31E1 Table 7.36. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 1.20E1 4.22E1 3.89E1 17 36 2.19E1 7.03E1 6.79E1 33 2.36E1 8.42E1 8.36E1 65 div 4.35E1 1.12E+O 1.05E+O 129 div 1.30E+1 4.60E+1 4.60E+1 257 div 1.49E+2 9.23E+2 9.23E+2 183
PAGE 209
Table 7.37. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 5.79E2 2.29E1 1.95E1 17 22 1.30E1 5.65E1 5.25E1 33 1.71E1 8.20E1 7.88E1 65 1.89E1 9.34E1 8.92E1 129 2.57E1 9.76E1 9.27E1 257 div 4.00E1 1.98E+O 1.98E+O Table 7.38. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 5 4.64E2 6.76E2 6.19E2 17 7 2.70E1 8.72E2 1.20E1 33 div 5.66E+O 9.93E+1 9.93E+1 65 div 2.66E+2 2.15E+3 2.15E+3 129 div 1.04E+5 1.77E+8 1.77E+8 257 div 2.04E+6 1.08E+12 1.08E+12 184
PAGE 210
Table 7.39. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 8.88E2 2.07E1 1.89E1 17 20 1.32E1 5.31E1 4.90E1 33 1.76E1 8.67E1 8.31E1 65 2.46E1 9.73E1 9.25E1 129 3.58E1 9.92E1 9.36E1 257 6.20E1 9.90E1 9.41E1 Table 7.40. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 5.58E2 2.26E1 1.93E1 17 15 9.22E2 4.22E1 3.81E1 33 26 l.lOE1 6.28E1 5.86E1 65 49 1.13E1 7.85E1 7.53E1 129 1.09E1 8.77E1 8.38E1 257 8.80E2 9.27E1 8.75E1 Table 7.41. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V ( 1, 1) cycles using ILL U for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 7 3.98E2 1.56E1 1.28E1 17 13 8.19E2 3.80E1 3.38E1 33 24 1.04E1 6.07E1 5.62E1 65 47 l.lOE1 7.76E1 7.42E1 129 1.09E1 8.81E1 8.40E1 257 8.53E1 9.46E1 8.92E1 185
PAGE 211
Table 7.42. Problem 10: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 9.42E3 3.87E2 2.70E2 17 5 1.31E2 6.51E2 4.69E2 33 5 1.51E2 7.80E2 5.56E2 65 9 3.74E2 2.51E1 2.01E1 129 div 6.88E+4 2.04E+5 1.90E+5 257 div 1.05E+5 2.88E+5 2.70E+5 Table 7.43. Problem 10: ConvectionDiffusion Equation, for De Zeeuw's MGD9V. Various grid sizes versus the number of V(O, 1) cycles using ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 5 7.53E2 1.05E1 8.21E2 17 5 1.54E2 1.94E1 9.31E2 33 6 9.45E3 2.84E1 1.52E1 65 8 1.36E2 4.42E1 2.83E1 129 11 1.75E2 5.84E1 4.27E1 257 div 1.27E+O 2.90E+O 2.49E+O 186
PAGE 212
1. D .. 0. 0. D D D 1. Figure 7.6: Domain n for problem 11; D stands for Dirichlet boundary condition. 7.8 Numerical Results for Test Problem 11 Problem 11 is a convectiondiffusion problem, which can be found in the literature; see [24], [66], [77]. The problem is defined by au au + a(x, y) OX+ b(x, y) oy = 0 on 0 = (0, 1) X (0, 1) (7.10) u(x, y) = sin(7rx) + sin(7ry) + sin(137ry) + sin(137ry) on on, (7.11) and (2y1)(1x2 ) if x>O a(x,y) (2y1) if 2xy(y1) if x>O b(x,y) 0 if where X = 1.2x0.2 and E = w5 ; see figure 7.6. The difference between problem 11 and problem 9 is that we now have a stagnation line rather than a stagnation point emanating from the boundary. The results of the numerical experiments for this problem are similar to those for problem 9. The method using Schaffer's ideas for the grid transfer operators is the 187
PAGE 213
Table 7.44. Problem 11: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 8.55E3 6.31E6 2.32E4 17 4 1.46E2 9.62E6 2.34E3 33 5 2.47E2 1.29E2 2.81E2 65 5 3.70E2 5.46E2 5.64E2 129 6 5.57E2 9.67E2 8.61E2 257 7 7.27E2 1.06E1 1.08E1 best, followed closely by the hybrid method from section 3.5.3. We also see that grid transfer operators generated using the method in section 3.4 are also quite adequate. 188
PAGE 214
Table 7.45. Problem 11: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.2. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 1.95E2 4.11E3 6.62E3 17 4 3.16E2 2.38E3 1.95E2 33 5 4.44E2 5.39E2 5.62E2 65 11 8.02E1 2.20E1 2.63E1 129 div 8.56E+O 7.14E+O 7.14E+O 257 div 1.26E+3 3.41E+2 3.41E+2 Table 7.46. Problem 11: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 2 5.81E3 9.65E5 7.49E4 17 4 1.49E2 1.20E5 2.50E3 33 5 2.30E2 1.12E2 2.67E2 65 5 4.61E2 5.38E2 5.65E2 129 6 6.87E2 8.64E2 9.10E2 257 10 3.56E1 7.31E2 2.18E1 189
PAGE 215
Table 7.47. Problem 11: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.5.1. Various grid sizes versus the number of V ( 1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 3 1.97E2 6.09E3 8.07E3 17 4 2.47E2 1.26E3 1.49E2 33 7 l.llE+O 6.38E2 1.16E1 65 div 3.57E+O 7.94E+O 7.94E+O 129 div 7.85E+5 9.99E+4 9.99E+4 257 div 9.96E+9 7.00E+9 7.00E+9 Table 7.48. Problem 11: ConvectionDiffusion Equation, Standard coarsening with grid transfer operators based on section 3.4. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 3.73E2 9.56E5 3.78E3 17 5 5.23E2 6.26E3 4.22E2 33 7 6.07E2 3.65E2 1.05E1 65 9 1.41E1 1.69E1 1.99E1 129 11 1.53E1 1.73E1 2.75E1 257 17 1.79E1 2.13E1 4.06E1 190
PAGE 216
M 64. 33. f+N 31. 0. 0. 31. 33. 64. N M Figure 7.7. Domain n for problem 13; Nand M stand for Neumann and Mixed boundary conditions respectively. 7.9 Numerical Results for Test Problem 13 where Problem 13 is an anisotropic and discontinuous problem defined as \7D\7u+cu=f 8u 1 D+u=O on 2 8u = 0 on c=1 !=1 D1 = 1000 D2 = 1000 c=1 f=O if if on n = (0, Nx) x (0, Ny) on an at X= Nx or y=Ny on on at X= 0 or y=O x E [0, 31], y E [0, 31]; x E (33, 64], y E (33, 64] x E (31, 33], y E [0, 31]; x E [0, 31], y E (31, 33]; x E (33, 64], y E (31, 33]; x E (31, 33], y E (33, 64] 191 (7.12) (7.13) (7.14)
PAGE 217
Table 7.49. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 1.84E2 2.32E2 2.13E2 17 4 1.67E2 2.22E2 1.95E2 33 5 3.10E2 6.57E2 5.79E2 65 6 1.29E2 1.14E1 7.52E2 129 5 1.51E2 1.19E1 6.18E2 257 6 2.47E2 1.94E1 8.72E2 if x E (33, 64], y E [0, 31]; c=1 x E [0, 31], y E (33, 64] !=1 if x E (31, 33], y E [0, 31]; c=1 f=O and the domain is illustrated in figure 7.7. For this problem we have chosen to report only the three most valuable grid transfer operators from sections 3.6, 3.5.3, and 3.5.1. All three give very good performance over the range of grid sizes tested. 192
PAGE 218
Table 7.50. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.3. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 2.17E2 2.71E2 2.56E2 17 4 1.09E2 2.33E2 1.78E2 33 4 7.35E3 5.73E2 2.38E2 65 7 1.44E2 1.80E1 1.14E1 129 8 2.63E2 1.96E1 1.47E1 257 8 4.55E2 2.40E1 1.71E1 Table 7.51. Problem 13: Diffusion Equation, Standard coarsening with grid transfer operators based section 3.5.1. Various grid sizes versus the number of V(1, 1) cycles using zebra alternating line GaussSeidel (lines in x followed by lines in y) for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 4 2.17E2 2.71E2 2.56E2 17 4 1.09E2 2.33E2 1.78E2 33 4 7.35E3 5.73E2 2.38E2 65 7 1.44E2 1.80E1 1.14E1 129 8 2.63E2 1.96E1 1.47E1 257 8 4.55E2 2.40E1 1.71E1 193
PAGE 219
Table 7.52. Problem 17: Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(1, 1) cycles using alternating zebra line GaussSeidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 18 6.35E+O 1.72E1 4.38E1 17 12 1.43E1 3.09E1 2.91E1 33 12 1.18E1 3.34E1 3.07E1 65 13 9.22E2 3.54E1 3.21E1 129 13 7.81E2 3.58E1 3.22E1 257 13 7.56E2 3.59E1 3.22E1 7.10 Numerical Results for Test Problem 17 Problem 17 is a discontinuous staircase problem, which can be found in the literature; see [26], [77], [24]. The problem is defined by where see figure 7.8. \i'DV'u=f au 1 D+u=O an 2 on 0 = (0, 16) X (0, 16) on aO D = 1, f = 0 (x, y) outside the shaded area D = 1000, f = 1 (x, y) inside the shaded area; (7.15) (7.16) In many real world applications boundaries are often curved, making the discretization hard to perform accurately on rectangular meshes. After discretization the curved boundary will look something like a staircase. Problems with staircase interfaces in the domain are not handled well by classical multigrid methods. In particular, multigrid methods which employ five point stencils on coarser grids are doomed to failure for staircase problems, since for sufficiently coarse grids, the five point stencil cannot resolve the staircase. The black box multigrid methods, however, can handle 194
PAGE 220
M 16 15 13 11 9 N M 7 5 3 1 0 0 1 3 5 7 9 11 13 15 16 N Figure 7.8. Domain n for problem 17; Nand M stand for Neumann and Mixed boundary conditions respectively. 195
PAGE 221
Table 7.53. Problem 17: Standard coarsening with grid transfer operators based on original collapsing method. Various grid sizes versus the number of V(1, 1) cycles using alternating zebra line GaussSeidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 13 4.54E1 2.97E1 3.17E1 17 8 6.34E2 1.64E1 1.44E1 33 8 5.35E2 1.62E1 1.48E1 65 8 4.92E2 1.73E1 1.54E1 129 8 5.10E2 1.85E1 1.65E1 257 8 5.64E2 1.89E1 1.70E1 staircase interfaces rather well because they use operator induced grid transfer operators and the Galerkin coarse grid approximation to form the coarse grid operators; the nine point operators created in this way can resolve the staircase on coarser grids. 196
PAGE 222
Table 7.54. Problem 17: Standard coarsening with grid transfer operators based on extension of Schaffer's idea. Various grid sizes versus the number of V(O, 1) cycles using xline ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 9 1.96E+O 1.08E1 1.81E1 17 8 1.85E1 1.71E1 1.73E1 33 6 2.93E2 8.96E1 8.06E2 65 5 2.03E2 1.15E1 6.13E2 129 5 2.27E2 5.07E2 5.38E2 257 7 2.89E2 1.35E1 1.19E1 Table 7.55. Problem 17: Standard coarsening with grid transfer operators based on the hybrid collapsing method. Various grid sizes versus the number of V(O, 1) cycles using xline ILLU for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 11 4.49E1 2.68E1 2.82E1 17 6 2.91E2 8.17E2 6.85E2 33 6 2.33E2 8.21E2 6.69E2 65 6 1.90E2 8.91E2 6.96E2 129 6 1.90E2 1.09E1 8.17E2 257 6 1.98E2 l.llE1 8.44E2 Table 7.56. Problem 17: Semicoarsening code. Various grid sizes versus the number of V(1, 1) cycles using zebra xline GaussSeidel for the smoother, first, last and average convergence factor. Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.13E1 2.66E1 3.07E1 17 9 2.14E1 1.76E1 1.83E1 33 8 1.47E1 1.61E1 1.59E1 65 8 1.13E1 1.54E1 1.48E1 129 8 1.05E1 1.51E1 1.44E1 257 8 1.14E1 1.50E1 1.44E1 197
PAGE 223
7.11 Comparison of 2D Black Box Multigrid Methods We have looked at several example problems and how the various methods perform for those examples, but now we would like to determine which methods really are the best. The most obvious criterion for judging the best method would be to use the execution time to solve a given problem to a given tolerance. The trouble with this metric, although it is quite practical, is that it does not take into account implementation and algorithm variations, nor does it say anything about whether one method is more efficient than another method. We need a criterion that will measure both the execution time and the final accuracy of the solution. We propose to use a normalized execution time and the average convergence factor for our metric. We define the metric to be P = T X CFave (7.17) where P is the performance metric, T is the normalized execution time, and C Pave is the average convergence factor. By normalized execution time we mean that we have taken the tie for five Vcycles and the setup time and divided by five to give the average time for a Vcycle plus onefifth the setup time. This allows us to take into account the variation in setup time for the different methods. The execution cycle time and average convergence factor are measured for how long it takes a method reduce the initiall2 norm of the residual by a given amount; the results shown here use a reduction of six orders of magnitude. Like the convergence factor the smaller the performance metric P the better the method. The first comparison is for test problem 8, the fourcorner junction problem (section 7.5); see table 7.57. The methods are listed in the far left hand column of the table where the characters represent the various methods. The method character string can be decoded into four fields; pt character, 2nd character, 3rd character, and 198
PAGE 224
Table 7.57. Comparison of various Black Box Multigrid methods on the Cray YMP for a 2D diffusion equation given in problem 8 method 9x9 17x17 33x33 65x65 129x129 257x257 stavll 1.488E7 1.720E5 7.627E5 1.796E4 4.898E4 1.962E3 htavll 4.871E7 2.762E5 1.203E4 3.917E4 1.338E3 4.812E3 otavll 5.004E7 2.884E5 1.256E4 4.088E4 1.404E3 5.056E3 stivOl 2.475E8 6.171E6 8.076E5 3.201E4 l.OllE3 9.497E3 htivOl 3.839E8 6.503E6 5.739E5 2.723E4 l.llOE3 4.282E3 otivOl 3.949E8 5.053E6 5.907E5 2.805E4 1.131E3 4.405E3 stivll 4.376E12 2.143E7 2.266E5 1.758E4 5.820E4 3.050E3 htivll 7.749E12 2.039E7 1.626E5 1.364E4 6.700E4 3.035E3 otivll 7.844E12 2.081E7 1.655E5 1.393E4 6.798E4 3.106E3 sclvll 2.883E7 8.950E6 6.051E5 5.337E4 7.668E3 1.315E2 scivOl 1.183E7 3.660E5 8.650E4 6.755E3 3.715E2 1.859El scivll 2.648E11 1.504E6 4.366E4 8.317E3 5.438E2 2.679E1 htpvll 2.083E5 5.550E5 2.144E4 5.802E4 1.714E3 5.786E3 otpvll 2.155E5 5.871E5 2.273E4 6.187E4 1.838E3 6.193E3 ZivOl 5.295E6 1.884E5 7.367E5 2.125E4 6.020E4 1.992E1 199
PAGE 225
4th through 6th characters. The first field represents the grid transfer operator used; s = schaffer, h = hybrid, o = original. The second filed represents the coarsening; t = standard, c = semicoarsening. The third field represents the type of smoother employed; a= alternating line, i = ILLU, l =line, p =point. The fourth field represents the type of multigrid cycling; vll = V(1, 1)cycling, V(O, 1)cycling. Table 7.57 shows that the standard coarsening method with alternating line relaxation using Schaffer's idea for the grid transfer coefficients is the best method for larger grids, while the same method with ILL U for the smoother is the best for smaller grids. Most of the methods perform about the same, but semicoarsening with an ILLU smoother is the worst. It is a little surprising to see that DeZeeuw's MGD9V is beaten by the standard coarsening alternating line method when it was seen before that MGD9V had a faster execution time. Recall from the numerical examples that the ILLU smoother was essential for obtaining good convergence for convectiondiffusion equations, except for the semicoarsening method. The examination of a convectiondiffusion equation for our new performance metric should prove enlightening. We will choose to look at the convectiondiffusion problem 9 [24]. Table 7.58 show the performance metric for the convectiondiffusion equation given in problem 9 for the various methods. The two clear winners for this problem are the standard coarsening method using ILL U for the smoother with the grid transfer operators computed by either Schaffer's idea or the hybrid collapsing method. While stivll and htivll are the best, De Zeeuw's method, MGD9V, shows nice consistency in performance for the range of grid sizes, but it also has the advantage of being more robust for more complex convectiondiffusion equations, especially those with reentrant flows. While our methods are still convergent for most of the more 200
PAGE 226
Table 7.58. Comparison of various Black Box Multigrid methods on the Cray YMP for a 2D convectiondiffusion equation given in problem 9 method 9x9 17x17 33x33 65x65 129x129 257x257 stavll 1.835E6 6.188E5 1.263E3 5.060E3 2.181E2 9.300E2 htavll 2.149E6 6.473E5 1.248E3 5.064E3 2.208E2 1.285E1 otavll 8.287E5 4.422E4 3.281E3 * stivOl 2.286E12 1.069E10 4.757E9 5.709E6 1.241E3 5.667E2 htivOl 2.144E12 8.683E11 3.526E9 2.996E6 5.472E5 2.389E3 otivOl 4.249E12 9.064E11 2.146E9 1.506E6 6.271E5 stivll 2.124E16 4.079E14 8.658E12 1.682E9 1.537E6 1.208E4 htivll 2.022E16 3.651E14 7.519E12 1.432E9 1.163E6 9.424E5 otivll 3.652E16 5.015E14 7.086E12 9.420E10 6.740E7 *** sclvll 1.097E6 4.410E5 6.635E4 4.112E3 2.738E2 1.708El scivOl 1.158E11 5.618E10 7.717E7 2.182E5 1.409E3 5.587E2 scivll 6.317E16 2.011E13 4.620E11 1.185E8 2.210E5 4.303E3 ZivOl 3.353E8 1.20E7 4.883E7 2.630E6 1.477E5 1.293E4 201
PAGE 227
complex convectiondiffusion equations, they are not useful as solvers because the con vergence factor is often above 0.9. We are not going to present the performance metric for the more complex convection problems because it is difficult to get convergence for large grids even using MGD9V. 202
PAGE 228
CHAPTER 8 PARALLEL ALGORITHM CONSIDERATIONS IN TWO DIMENSIONS The parallel algorithm has undergone a lot of changes as the hardware and software support have changed. Originally, there were several codes developed for the CM2 and CM200, but when the CM5 came along, those codes were abandoned. We will present only the CM5 code, after a brief outline of the previous work. 8.1 CM2 and CM200 Parallel Algorithms We used the CMFortran SPMD model on the CM2 and CM200, which contained 65K integer processors and 4K floating point processors (Weitek). The CM200 computer had two modes of operation called the paris (parallel instruction) model and the slicewise model. The slicewise model was the preferred one because it looked at the machine as if it were only constructed of the 4K floating point processors. The CM2 and CM200 have a front end (host) computer, a Spare workstation, that is connected to the hypercube data network, which connects all the processors. The integer processors are bit serial, and the floating point processors are 32 bit based. On the CM200 the data is stored bitwise and passed through a transpose device which packs it into 16 and 32 bit words, which can be passed to the floating point processors. The only real differences between the CM2 and CM200 are that the transposer and slicewise models were made standard on the CM200 and that the overall 203
PAGE 229
hardware for both the processors and networks are faster on the CM200. Codes were developed for the standard coarsening algorithm using all three of the data structures shown in figure 8.5. The CMSSL (CM Scientific Software Library) was very limited at that time and we had to write our own tridiagonal line and direct solvers. The direct solver was so slow that we found it easier and faster to use the front end to perform and store the L U decomposition and then just pass the right hand side to the front end and the solution back to the CM2. On the CM200, which was faster than the CM2, the direct solver ran about as fast or a little faster than using the front end and passing the two vector arrays. All of the approaches taken to develop an efficient black box multigrid code ran into trouble with communications, and their performance was dismal when compared to the Cray YMP. The communications bottle neck appeared in both the transpose device and the processortoprocessor communication. There was no way around the transpose device, so efforts had to be concentrated on communications between proces sors. Much of this effort was hampered by the lack of control over the data's automatic layout across the processors. Dendy, Ida, and Rutledge [32] partially solved the layout problem by writing code to access the CM Run Time System (CMRTS). The code they wrote is essentially at the assembly level. While this approach met success, it also was not portable and when either the hardware or system software was updated (modified), the code could not be guaranteed to work properly. The layout problem was not really resolved until the codes were implemented on the CM5 computer. There were several attempts to reduce the communication overhead by using various communication packages. The two we tried were the CM FastGraph package and the CMSSL polyshift routines. The CM FastGraph (CMFG) package has two major components consisting 204
PAGE 230
of the communication compiler and the mapping facility. The CMFG can be used to speed up general communications between multiple processors. The speedup is achieved by determining and storing the routing paths once at the beginning of the program's execution. When the communications are performed, there is thus no need to determine the routing paths dynamically, thus reducing the time it takes to complete the communications. The CMFG components were used for the grid transfer operations between fine and coarse grid levels. First the communication map is defined for passing data between the different grid level data structures. The CMFG compiler then creates the maps to be used by the CMFG communication routines. While the communications were much quicker, the overhead involved in defining the communications was very prohibitive, often adding minutes to the execution time depending on the size and number of data communications to be performed. This approach, while interesting, just was not practical for our software library approach. However, the CMFG approach has proven to be useful for problems that run the same size problem many times (e.g. timedependent problems), by storing all the routing paths for all the grid levels. Throughout the multigrid algorithm there are many instances where communication is needed with neighboring grid points; in such a case, multiple calls are made to communication routines. These communications are either circular ( CSHIFT) or endoff (EO SHIFT). The CSHIFT routine shifts the data in a circular fashion with wrap around occurring at the array boundaries. The EOSHIFT routine shifts data with data dropping off the end of an array with a predefined value being shifted into the array. The polyshift communication routines allow the user to define and use communication stencils to combine multiple calls to other communication routines, CSHIFT or EO SHIFT, into only one communication call. The polyshift stencil communication allows the various communications that make up the stencil to be overlapped when possible, and hence to reduce the amount of time spent on the communications. The 205
PAGE 231
Table 8.1. Timing comparison per Vcycle for the semicoarsening code on the Cray YMP, CM2, and CM200. Times are given in seconds, and the CM2 and CM200 times are the elapsed time for 256 and 512 Floating point processors. "*" means the problem was to big to run, and "" means there was no data. Size n Cray YMP Cray YMP CM2 CM2 CM200 nxn (old) (new) 256 512 512 8 0.004 0.0005 16 0.006 0.0009 0.51 32 0.009 0.0019 0.59 0.75 64 0.041 0.0044 0.65 0.77 0.14 128 0.093 0.013 0.99 0.80 0.23 256 0.267 0.043 1.84 1.79 0.40 512 0.948 0.158 4.55 3.04 1.00 1024 3.69 0.656 8.11 3.01 2048 2.714 25.39 polyshift communications consist of three routines. The allocation and stencil setup routine pshift_setup, the communication routine pshijt, and the deallocation routine deallocate_pshift_setup. The use of the polyshift communications in the two dimensional codes gave a very slight performance gain for the standard coarsening code and was actually slightly detrimental for the semicoarsening code. Only the three dimensional standard coarsening code showed any real benefits from using the polyshift communication routines. The reason for the lack of performance gains was overhead. There was a set of polyshift routines that were not as general as the CMSSL version, but performed nearly three times faster because of much less overhead. Use of the specialized polyshift routines would have caused the multigrid codes to run considerably faster, but they still would not have been competitive with the Cray YMP versions. 8.1.1 Timing Comparisons For historical reasons it is interesting to compare the timing results for both the standard and semicoarsening black box multigrid codes on both the Cray YMP and CM2. 206
PAGE 232
Table 8.1 gives a timing comparison for the semicoarsening code on both the Cray YMP and CM200. The Cray YMP gives times for the "old" original vector code by Dendy and the "new" optimized vector code that we developed. The CM200 semicoarsening code timings are the same as those in [32], page 1466. The CM times are for the elapsed time, which is the sum of the busy and idle times. Busy time is defined to be the parallel execution time, and idle time includes the sequential and communication time. Note that the elapsed time is not the same as wallclock time, which includes timesharing and system overhead time. The CM200 code is at least an order of magnitude slower than the new Cray YMP code. However, if we examine the trend, we can see that the CM200 is slowly gaining on the Cray YMP code; unfortunately we will run out of processors and memory long before it can catch up. Table 8.2 gives comparison timings for the standard coarsening code using alternating zebra line GaussSeidel smoothing. The Cray YMP timings also refer to the "old" original standard coarsening code by Dendy [27], that was ported to the Cray YMP, and the "new" code that we developed. The same observations can be made about the standard coarsening codes as was made about the semicoarsening. However, the standard coarsening CM200 code is more than twice as slow as the semicoarsening code. The reason for this fact can be understood by noting that the standard coarsening version requires more communication and uses the inefficient data layouts that the system provides. The standard coarsening parallel code did not make use of polyshift communications. 8.2 CM5 Hardware Overview The CM5 computer, as we have already said, is an SPMD computer with up to 1024 processors. We have considered only the data parallel model of execution in our studies. A program, under this model, is copied into each processor's memory, 207
PAGE 233
Table 8.2. Timing comparison per Vcycle for the standard coarsening code on the Cray YMP, CM2, and CM200 using AZLGS smoothing. Times are given in seconds, and the CM2 and CM200 times are the elapsed time for 256 and 512 floating point processors. "*" means the problem was to big to run, and "" means there was no data. Size n Cray YMP Cray YMP CM2 CM2 CM200 nxn (old) (new) 256 512 512 8 0.0031 0.0005 16 0.0068 0.001 1.49 32 0.0123 0.002 1.58 2.02 0.697 64 0.0275 0.0046 1.81 2.11 0.914 128 0.072 0.0128 2.71 2.21 1.52 256 0.266 0.0443 5.04 4.79 2.45 512 0.957 0.165 12.34 7.89 4.21 1024 3.84 0.673 20.41 12.55 2048 15.42 2.792 62.89 208
PAGE 234
and then every processor executes the same instruction at the same time on its own data. When data is needed from the other processors, it is passed through a data communication network that connects all of the processors to each other. To design parallel programs it is essential to understand the underlying hardware, the high level parallel programming language, and the behavior of the parallel run time system. While much effort is being made to hide these issues and make them transparent to the average user of the computing community, there is much room for improvement. All parallel computers and their high level languages, to date, have not had any real success in making it possible to ignore these three issues. This situation has mostly been due to a lack of software support both in the languages and software libraries. Some have said that object oriented design and algorithms is the answer, but first the underlying framework and code has to be developed, by no means a trivial task, and the computing community is not even close to the general prototype stage at this time. The description of the CM5 given so far is not sufficient to understand and appreciate the issues and complexities of designing an efficient parallel algorithm. Hence, we provide a more detailed description. The CM5 is an SPMD computer with 1024 processing nodes, 4096 vector units, a partition manager, and I/0 processor(s) that are all connected by a communication network; see figure 8.1. The partition manager is the user's gateway to the CM5. It is essentially a Spare workstation that manages the various user partitions (timesharing) of processing nodes, networking, and I/0 communications with external devices, such as remote terminals, disk and tape drives, and printers. It also executes and stores all of a program's scalar data and scalar instructions along with the CM RunTimeSystem (CMRTS). There are two communication networks, one for control and the other for data. 209
PAGE 235
Fat Tree Data Connection Network I I PN n I Partition External Network Manager User 1/0 Processors Scalar Memory Workstation Figure 8.1: CM5 system diagram for n processing nodes (PN). 210
PAGE 236
The control network is used by operations that require all the processing nodes to work together, such as partitioning of the processing nodes, global synchronization, and general broadcasting of messages. The control network is implemented as a complete binary tree with all the system components at the leaves. The data communication network is set up in the form of a fattree connection network that allows data to flow in both directions simultaneously. A fattree is basically a tree in which the connections become denser as one progresses toward the root, allowing for wider communication channels to cope with the increased message traffic. The fattree is a 4ary tree with either 2 or 4 parent connections. It is used for point to point communication between individual processing nodes. The fattree network allows a variety of data structures to be mapped onto it easily, such as vectors, matrices, meshes (1, 2, and 3 dimensional grids), hypercubes, and of course hierarchical treelike structures, while maintaining a high bandwidth. The organization of the individual processors of the CM5 is as follows: each consists of four memory banks of 8 to 32 MBytes each, four vector processing units, a RISC (Spare) processor, and a network interface controller, all connected together by an internal bus; see figure 8.2. Each memory bank is connected to a vector unit which in turn connects to the internal bus. The RISC processor manages the issuing of vector instructions to the vector units, address computations, loop control, and other housekeeping activities. The network interface controller transmits and receives messages from the fattree network, communicating with other processors and I/0 units. The vector units act both as a memory manager for the node's RISC processor and as a vector arithmetic accelerator. The RISC processor can read or write to any of the four vector unit memory banks. The vector instructions are given by the RISC processor as memory address instructions with special bits set to indicate the type of 211
PAGE 237
Vector Unit 0 Memory Control Network i Data Network l Spare Microprocessor Network Interface Vector Unit 1 Memory Vector Unit 2 Memory Vector Unit 3 Memory Figure 8.2: CM5 processor node diagram with four vector units. 212
PAGE 238
vector operation to be performed. It is important to note that the vector units are not independent processors. They do not load or execute instructions, but merely perform memory management and vector arithmetic functions. However, because they do perform all the arithmetic functions from the user's program, it is convenient to think of the vector units as acting like processors. The vector units consist of an internal bus interface, a vector instruction decoder, a memory controller, a bank of 64 64bit registers, and a pipelined arithmetic logic unit (ALU); see figure 8.3. The bank of registers can also be addressed as 128 32bit registers. In addition there are a couple of control registers. There is a 16bit vector mask register and a 4bit vector length register. The vector mask register controls certain conditional operations and receives single bit status results. The vector length register indicates the vector length that is being used and ranges from a vector length of 1 to 16; thus, a vector length of one is used for scalar operations. It should be noted that the original vector length on the CM5 was 4, in 1993 it was increased to 8, and finally it was increased to 16 in late 1995. The increase in the vector length has improved the performance of the black box multigrid codes over the years, and in addition, it has caused modifications in the implementation of several of the algorithmic components. A special note concerning communication as it relates to the vector units, which are implemented two to a chip is that units 0 and 1 are on one chip and units 2 and 3 on the other. This arrangement means that communication between two vector units on the same chip are faster than communication between vector units of the same processing node but different chips. Communication between vector units of differing processing nodes will involve the data communication network and be much slower. There are 1024 processors on the CM5, but since each processor contains 213
PAGE 239
Internal Processor Node Bus Bus Instruction Vector Unit Interface Decoder T Memory Controller I Memory Bank VectorPipelined Arithmetic I Logic Unit T 64 x 64bit Registers Figure 8.3: Diagram of one CM5 vector unit. 214
PAGE 240
four vector units (processors) it is better to think of the CM5 as having 4096 vector processors. This viewpoint is justified because the vector units perform virtually all the computations. When we refer to a processor in our discussion, we will in general be referring to the vector processing units. 8.3 CM5 Memory Management Memory management is usually less important than either parallel computation or communication in determining the speed in program execution, but it is still one of the major considerations in obtaining good performance in CMFortran. It is possible to run out of parallel memory even when it would appear that the data should fit. To understand why this can happen we need to examine the memory more closely. Scalar memory is any memory region not dedicated to storing parallel data, including the scalar memory of the partition manager and any portion of the processing node's memory not being used by the parallel stack or heap. Parallel memory is any region of memory located on all four vector unit memory banks on all the processing nodes. Parallel memory is the same size and starts at the same address on all the memory banks. There are two types of parallel memory: stack and heap. Stack memory is temporary memory. Heap memory is relatively permanent and is allocated and deallocated arbitrarily and never gets compacted. Because heap memory does not get compacted it can become fragmented and leave areas of memory unusable. The partition manager runs a complete UNIX operating system, but each processing node runs only a subset which is called the PN kernel. The PN kernel occupies about 0.5 MBytes of memory in each processing node's memory. It is enlightening to see how a processing node's memory is partitioned to hold the PN kernel, parallel stack memory, parallel heap memory, scalar memory, and user 215
PAGE 241
code; see figure 8.4. The CM operating system is responsible for assigning the memory pages. The parallel memory pages must be aligned across all four memory banks on a processing node, but the scalar memory pages can be assigned arbitrarily. The node's memory is organized into high and low memory with the PN kernel stored in vector unit O's low memory. The PN kernel takes up only half a megabyte of memory, but it effectively takes up 2 MBytes because its memory shadow on the other three memory banks are unusable for parallel data. To make things worse, the 1.5 MBytes of memory in the PN kernel shadow is not always assigned for scalar memory, so that other memory locations used for the scalar memory also cause a memory shadow that blocks parallel memory assignment. The user code (variables and instructions) are stored in high memory in scalar memory pages. This arrangement leaves the parallel stack and heap stored in between the PN kernel and user code, with the stack always being stored towards the low end memory relative to the parallel heap. Both the parallel stack and heap grow from their starting locations towards high memory. Parallel arrays come in many forms and are stored either on the stack or heap. There are four types of user defined arrays and two types of compiler generated arrays. The user defined arrays types will be defined now; in the discussion "local" means declared in a routine and not defined outside of that routine. Ordinary local arrays are those declared in a routine without the SAVE or DATA attribute. These arrays are allocated on the stack on entry into the routine and deallocated upon exiting the routine. Permanent local arrays are declared with the SAVE or DATA attribute. They are allocated on the heap when entering the routine for the first time and are never deallocated. Dynamically allocated arrays are explicitly allocated and deallocated by function calls and are stored on the heap. Common block arrays are allocated on the heap when the array is first used and are never deallocated. 216
PAGE 242
V.U. 0 Memory V.U. 1 Memory I V.U. 2 Memory V.U. 3 Memory PN Kernel memory shadow 1 Parallel Stack Memory Region Parallel Heap Memory Region Low Memory High Memory Figure 8.4. CM5 processor node memory map for vector unit configuration. Area 1 is the scalar stack, area 2 is the scalar heap, and area 3 is the user's code (scalar variables and instructions). White space is unclaimed memory and is neither scalar nor parallel. 217
PAGE 243
The compiler generates three kinds of internal temporary arrays which are all stored on the heap. The first type are communication temporaries. They are arrays that temporarily hold results from communication operations which are the result of either explicit or implicit communication taking place in an expression. The second kind of temporaries are the computation temporaries. They are the result of either computations being performed inside a communication function (e.g. CSHIFT) or when a selection type statement is executed (e.g. FORALL, WHERE). Communication and computation temporaries are allocated at the time the expression is evaluated and are deallocated when the calculation of the expression has completed. The third kind are common subexpression temporaries. They are arrays that hold values of common subexpressions between the first time they are used and the last time that they are needed. A common subexpression temporary is generated to store every WHERE statement's mask, and sometimes the mask for FORALL statements. We will need to define what we mean by basic code block and P E code block in order to simplify our discussion. A basic code block is a segment of statements bounded by control flow statements. A PE code block is a region of pure parallel computation involving no control flow statements. The compiler can collapse and reuse communication and computation temporaries in basic code blocks, instructions that do not involve control flow, if the temporaries have the same shape. Common subexpression temporaries can be stored in registers with increased speed and efficiency if they occur within the same loop, PE code block, and basic code block. When the amount of memory available on the processing nodes becomes an issue, there are several rules of thumb that should be followed. 218
PAGE 244
1. Use more complex array expressions and less arrays. 2. Rewrite code fragments to reduce the number of temporary arrays generated by the compiler. 3. Try to reuse arrays or parts of arrays. 4. Split up program units so that fewer arrays and temporaries are allocated at one time. 5. Use the aliasing functions to use the same memory for arrays of different lay outs. 6. Consider what the effects will be for garbage element array padding and vectorlength padding, which can invisibly increase an arrays storage. 7. Use the scratch space on the parallel I/0 devices, e.g. Data Vault. Temporary array compiler allocation can be reduced in several ways. Write expressions so that common subexpressions are easily recognized; if one can not easily see them, then the compiler may not be able to see them either. A void writing complicated expressions that involve many array functions or changes of array layouts, which will cause the generation of communication and computation temporaries. Be aware that most communication functions assume that their source and destination arrays are distinct. If the source and destination are the same array then a communication temporary will be generated. 8.4 Dynamic Memory Management Utilities The Dynamic Memory Management Utilities (DMMU) were developed by W. Spangenberg at Los Alamos National Laboratory [8]. The DMMU were designed to address the problem of control of the data structure layout on the CM5 processors. The 219
PAGE 245
CMFORTRAN compiler and RunTimeSystem (RTS) aggressively monitor the array layouts to assure that arrays are distributed uniformly across the processors. While the monitoring and redistribution of array layouts is good for many applications, it can be disastrous for multigrid method performance. As we have indicated before, when the VP ratio is less than or equal to one, the most efficient communication was through the use of compatible arrays and masks. When the VP ratio is greater than one we used incompatible arrays, because compatible arrays led to inefficient computations and storage. When the aspect ratio of parallel grid axes become large or processors become unused, the CM RTS redistributes the arrays to minimize the number of idle processors. The redistribution occurs when the grid dimensions are not equal and also when coarsening takes place, especially for the semicoarsening multigrid method. The redistribution problems can persist even when using the PROCS and BLOCS attributes for the array layout. The fine grid coefficient matrix (array), L, defines the physical layout of the data on the processors. The layout is specified by the LAYOUT command which is part of CM FORTRAN and High Performance FORTRAN (HPF). The command specifies the serial and parallel dimensions of the array (SERIAL and NEWS attributes respectively), and in addition, it can also specify the the physical extents of the array onto the processors (PROCS attribute) and the subgrid size on the processors (BLOCS attribute). The PROCS attribute defines the physical extents of the array across the processors, and the BLOCS attribute defines the array axis subgrid length, which is given as a ratio of the axis extent to the physical processor extent. Every array has a geometry descriptor that contains these attributes. The DMMU use the fine grid coefficient array's geometry descriptor as a template to control the layout of the other arrays. For each grid dimension, all arrays have the same physical extents over the processors by using a common PROCS directive, from the geometry 220
PAGE 246
descriptor template, for each parallel axis. All arrays on each grid level, in each grid dimension, have the same subgrid extents which are specified by a common BLOCS directive, from the geometry descriptor template, for each parallel axis. The DMMU uses the geometry descriptor template to dynamically allocate arrays with the same layout on each grid level. The compatible and incompatible arrays for different grid levels can also be aligned using the DMMU to obtain more efficient communications between grid levels. The intergrid communications require the use of temporary arrays for efficiency, and are also allocated dynamically by the DMMU. The use of the DMMU ensures that the FORTRAN 90, CM FORTRAN, and HPF array and array section operations perform correctly while also allowing for the most efficient interprocessor communication. The black box multigrid codes use the DMMU to dynamically allocate all of the internal arrays. The process works in the following way. First we need to determine the number of grid levels and their size. Next we need to determine the number of processors and at which grid level the VP ratio will be less than or equal to one. We then obtain the geometry descriptor template for the fine grid coefficient array. We determine the physical extents across the processors of the congruent geometry template's parallel axes, assign the appropriate axis' physical extent, and check to see that we have a consistent physical extents across the processors. We then determine the appropriate PROCS and BLOCS directives for each grid level. We then create a congruent array alias for each array to conveniently reference the different grid levels via an index. Finally, we dynamically allocate all the coarse grid arrays, using the geometry descriptor template to enforce the desired data layout. 221
PAGE 247
8.5 CM5 Software Considerations Due to the state of compilers today, when designing a parallel program, one should keep in mind the computer's architecture to obtain reasonably fast performance. Dropping down out of a higher level language to the underlying support language, sometimes even to assembly language, will always yield the best performance, but will inevitably lead to nonportable, long, and confusing code. However, the underlying structure of a computer often changes with updates to both the hardware and system software, certainly the case with the CM5. These changes can greatly affect the life of any computer program which uses specific hardware or lowlevel software features. The high level computer languages are much more stable and are usually not affected by these changes. Hence, we have chosen to keep the code as portable and readable as possible by exclusively using the higher level languages. The CM5 supports several higher level languages that support the data par allel model, the two most common being C and FORTRAN. We have chosen to use FORTRAN because it is more stable, and its behavior is better understood for per forming numerical computations than is the behavior of C. The flavor of FORTRAN that the CM5 employs, called CMFORTRAN, is a subset of FORTRAN 90, but it also includes the entire F77 ANSI standard and a few CM extensions. The parallelism is achieved by the way that FORTRAN 90 expresses looping indices, data structures, and data dependencies in the program. This simplification in notation can lead to more compact and easier to read codes for most algorithms. However, care must be taken to note if a variable is actually an array or variable and whether it resides on the frontend (sequential) or on the CM (parallel) side of the computer. It is important to note some relationships in computing and communication on the CM5. Computations are very fast if they are performed entirely on one processor. 222
PAGE 248
Communications are very expensive when compared to computations but can vary a lot among themselves. There are several types of communications that we are interested in: circular, endoff, and irregular. The circular and endoff shift communications are about the same speed, with circular getting the edge, but they are both much faster then the irregular pattern communications. Another aspect that relates to speed is the distance of the communication path between processors. If the distance is a power of 2, then the best performance results, but the fastest is nearest neighbor communication, which is only slightly faster. 8.6 Coarsening and Data Structures in 2D It is important to understand how the data are layed out in the memory across the processing nodes. When a parallel array has more elements than there are processors, the array is decomposed into contiguous subsets and spread across the processing nodes. The subsets are called subgrids and they are uniform in size, shape, and starting memory address on all the memory banks on all the processing nodes. If the data will not fill all the subgrids, then some of the subgrids will be padded with null data until they are all full. Computations taking place on subgrid data are all done sequentially on a processing node by the vector units. The best efficiency is obtained when there is no padding, and thus full vector operations can be performed. When padding is present in the subgrids additional overhead is incurred; first noop instructions are sent to the vector units associated with the padded elements, and then a mask is created and used to prevent storing the results. We have considered two types of coarsening to generate the coarser grids: standard coarsening and semicoarsening. The degree of parallelization is quite dif ferent for these two choices. For a given grid, the next coarser grid has one fourth the number of grid points for the standard coarsening method versus one half in the 223
PAGE 249
semicoarsening method. For the standard coarsening method, the number of points in both dimensions is reduced by half, taking every other grid point from the fine grid to form the coarser grid. The semicoarsening method reduces the number of points in only one dimension by half. The rest of the discussion will concentrate on the standard coarsening method but will equally apply to the semicoarsening method with the obvious difference that coarsening is only in one direction. Any comments that do not apply to both methods will be pointed out as they arise. There are several ways in which we can set up the grid data structures. The fine grid, in two dimensions, is layed out and partitioned into subgrids across the processors, see figure 8.5(a). Ideally this layout is thought of as one grid point per processor. However, it is important to remember that for large grids, each processor contains a contiguous subgrid of grid points and that each grid point is treated as if it were on an individual processor; such imaginary processors are often referred to as virtual processors, but we will just refer to them as processors. In order to discuss the data structures and their relationship to communi cations, we need to define the grid spacing relationship between the different grid levels. Let us assume a uniform fine grid for now, but our comments will equally apply to the data structures for nonuniform grids. Recall the notation used in Algorithm MGV(k, v1, v2, h), from section 1.4, where k referred to the grid level and ranged from 1 (coarsest) to M (finest). The coarse grid spacing, for our solvers, is determined by doubling the fine grid spacing, hkl = 2hk, this leads to the equation, (8.1) where d is the grid communication distance between neighboring grid points on grid level k and M is the total number of grid levels. The actual grid spacing, for a uniform grid, on grid level k is hk = dhM. 224
PAGE 250
 (a) (b) (c) Figure 8.5. Standard coarsening grid data structure layout for the finest (left), coarse (center), and coarsest (right) grid levels, where represents an active processor (grid point), and figures (a), (b), and (c) represent three different data structures. 225
PAGE 251
Most of the computations in the multigrid algorithm are performed using neighboring grid points, which requires communications between the active grid points. The distance of the communications between nearest neighbor grid points is given by equation 8.1. This formula is valid in the direction of the coarsening for both standard and semicoarsening in two or three dimensions. A possible disadvantage of all the grid levels sharing the same data structure is that the distance of the communication increases with the coarseness of the grid level. The increase in communication distance may cause slower data transfer rates. A further disadvantage for the standard coarsening method is that most of the CMSSL routines can not be used on coarser grids because the communication distance is greater than one. This disadvantage can be overcome if we introduce data structure transformation routines to convert data from our data structures to ones on which the CMSSL can operate. The CMSSL routines are needed for the L U decomposition and to perform tridiagonal line solves, which are used in the smoother and the direct solver respectively. The conversion of the data structures will of course increase the execution time. If we keep the same data structure and do not want to use conversion routines, we are then forced to abandon the CMSSL and write our own routines. Writing our own LU decomposition solver and tridiagonal solvers for the CM5 is not a trivial task, and it will be impossible to obtain efficiency any where near that of the CMSSL, written in assembly language, by writing in CMFortran alone. Another choice for the grid data structures is to keep one as the grid communication distance between nearest neighbor grid points. There are two data structures that can be used to accomplish this goal. The first way is to choose to use the same data structure for all grid levels but to use for computation only a compact subgrid of the fine grid level corresponding to the coarse grid. The other points in the data structure are unused, see figure 8.5(b). The second way is to have each coarse grid 226
PAGE 252
have its own compact data structure of just the right size, see figure 8.5(c). In each case, communication for the computations on a grid level are all near est neighbor communications, which are the fastest possible communications. However, there are disadvantages with these two data structuring schemes. The grid transfer operations are now complicated and require the use of general communications routines, which are the slowest type of communication between processors. Clearly the choice of data structures will have a different impact on each of the various multigrid components. For this reason, we examine the multigrid components separately and then discuss the choice of tradeoffs. 8. 7 Coarse Grid Operators The discussion above about data structures covers most of the options for which data structures are reasonable for the coarse grid operators. However, there are a few more pitfalls that should be considered. The choice of data structures on the CM computers can have a large effect on performance, besides communications, when temporary variables are created to hold intermediate data from computations and when data is passed between routines. Temporary variables can affect the performance in two ways: size and alignment. The storage for the temporary variable is the same as that of the largest data structure involved in the computation. Complex computations often require several temporary variables. If there are any communications in a computation, then it is almost always the case that a temporary variable is created to hold the data from the communication to be used in the computation. The alignment of a temporary vari able is governed by the data structures involved in the computation. The choices that the runtime system makes can sometimes cause a slight misalignment of the data and slow down performance by introducing communication. It sounds much worse than it 227
PAGE 253
really is because these communications have always been found to be between virtual processors or between the vector units on a single processor node, and the overhead is usually negligible if the subgrid size is relatively small. When data are passed between routines it can cause the creation of temporary variables into which the data are copied. This creation happens when the called routine is only using a subset of the data structure from the calling routine. However, the creation of temporary variables has also occasionally been observed on the CM2 and CM200 when the entire data structure has been passed from the calling routine to the called routine; it has not yet been observed on the CM5. There is an overhead cost in time that is associated with the creation and use of temporary variables by the runtime system. These costs can sometimes be cut if the implementation already uses temporary variables, but it is best if the implementation can minimize the need for temporary variables altogether. 8.8 Grid Transfer Operators The grid transfer operations involve mostly communication of data between two grid levels. It is therefore important to minimize the amount of data being transferred and to use the most efficient type of communication that we can. However, the time spent in one V(l, 1)cycle on the vector computers (Cray YMP) performing grid transfers was only about 25 percent of that spent on smoothing when alternating line relaxation was used. We expect to see the same kind of relationship between grid transfers and smoothing on the CM5, and the percentage may even drop because the smoother will usually involve more communication than the grid transfers. However, the smoother and grid transfer routines may use different types of communication, affecting the percentage of time spent in each routine. 228
PAGE 254
8.9 Smoothers There are several relaxation methods available for smoothing; they are the point, xline, yline and alternating line GaussSeidel relaxation using multicolor or dering. By multicolor ordering we mean that either red/black or 4color ordering is used. The active grid points are a power of two distance apart when multicoloring is used, which means that it is still possible to use efficient communication between processors. Regardless of which data structure layout is used, the multicolor ordering in the smoother will always use communications that are either a power of two apart or nearest neighbor. Recall the comments from section 8.6 about the CMSSL. If we choose to use the grid data structure layout in figure 8.5(a) we can not use the CMSSL tridiagonal solvers for the line solves without using data structure conversion routines. However, the semicoarsening method can use the CMSSL tridiagonal solver because the line solves are not in the direction of the coarsening. There is also the incomplete line LU (ILLU) iterative method, used in the vector code, to consider. However, the ILLU method is not parallelizable in its current form and for this reason we chose not to implement it on the CM5. Many researchers are working on developing a parallel ILU solver, and we are not aware of any algorithms or efforts to develop the more robust ILL U method on parallel computers. 8.9.1 Parallel Line GaussSeidel Relaxation The CMSSL provides a tridiagonal solver that comes in two forms: one that performs the entire solution process and the other that splits the process up into a call to the L U factorization routine and another call to the LU solution routine. We can use the factorization routine and save the L U decompositions between smoothing steps, but that would 229
PAGE 255
mean saving the factors for every grid level. Saving all the LU decompositions would be costly because the CMSSL also allocates its own temporary work space for each decomposition to be used in the solution phase. We only save about 30% on the execution time for a Vcycle by saving the LU decomposition, but it takes about six times the storage required when not saving the LU decompositions. The CMSSL gentridiagsolve routine can be used to solve both X and Y lines by just changing the vectoraxis parameter to point to the array axis that the diagonal elements lie on. All of the X (Y) lines of a single color can be solved in parallel. The zebra line GaussSeidel relaxation will take two tridiagonal line solve times, one for each color. The alternating zebra line GaussSeidel relaxation will take a total of four tridiagonal line solve times. Since the CMSSL tridiagonal solver overwrites both the coefficient and right hand side arrays with the L U decomposition and solution respectively we need to copy the data into temporary work space arrays before calling the solver. Once again extra temporary storage is needed, but it is only allocated for the duration of the smoothing step. This temporary storage space is not saved for each grid, which could easily fill up memory reducing the size of problems that can be solved, but reused on each grid level. 8.9.2 CM5 Tridiagonal Line Solver Using Cyclic Reduction The tridiagonal systems from the line relaxation on the vector computers used Gaussian elimination with vectorization taking place by solving all the lines of one color simulta neously. We could also use this approach to obtain a parallel tridiagonal line solver by solving all the lines of a single color in parallel. However, this approach still leaves each line to be solved sequentially, and we can do better than that by using cyclic reduction. The cyclic reduction algorithm is an example of a divide and conquer method. 230
PAGE 256
A tridiagonal system of irreducible linear equations LU = F, where L is of dimension N = 2n1, can be represented as a matrix equation b1 C1 U1 h C2 b2 C2 U2 h LU= =F. (8.2) CN1 UN1 fN1 aN bN UN fN The basic idea is to solve for Ui in terms of Ui1 and Ui+l, providing that bi i= 0. We do this solution for all odd i equations and substitute the expression for Ui into the remaining equations. The result is a tridiagonal system of equations in l N /2 J variables. The procedure is applied recursively until only one equation remains. The single equation is then solved and the other variables are obtained through back substitution. To simplify a more detailed description, let uo = UN+l = 0 and let the subscripts represent the equation numbers and the superscripts denote the reduction and back substitution steps. Let a} = ai, b} = bi, c} = Ci, and Jl = !i; then the reduction step is given by: a 2k (8.3) j3 bfa (8.4) ck 'Y bf+a (8.5) ak+l f3iafa (8.6) ck+l k "fiCi+a (8.7) bk+l bf + f3icfa + 'Yiai+a (8.8) fik+l fik + f3dika + 'Ydl'+a (8.9) where i =a, 2a, 3a, (2nk1)a, for the reduction steps k = 1, 2, n1. After 231
PAGE 257
the n 1 reductions steps we are left with one equation, which when solved is (8.10) The back substitution is given by (8.11) where i = a 3a 5a (2nk 1)a and k = n 2 n 3 0 ' ' ' The cyclic reduction algorithm just described derives its parallelism by performing all the i indexed equation simultaneously for each reduction step k. However, the number of processors needed at each reduction step is half that of the previous step. Notice that the reduction process can be written to yield any of the unknowns. If we write a set of reduction steps to yield each unknown, after performing n 1 reductions, the set of "single" equations can then be solved simultaneously using an equation similar to (8.10) to give the solution. This method no longer requires the back substitution step, and it also keeps all the processors busy at all the steps, giving a method that is twice as fast as the original cyclic reduction algorithm. This version of cyclic reduction can be found in [46] where it is referred to as the PARACR algorithm, and it is one of the tridiagonal solvers implemented in the CMSSL. If the VP ratio is much greater than one, then the subgrids on each processor are large and the PARACR algorithm becomes inefficient. A better algorithm is to use block cyclic reduction [59] [49], which is also in the CMSSL. The block cyclic reduction performs the LU decomposition sequentially on each processor and cyclic reduction over the processors. 232
PAGE 258
8.10 Coarsest Grid Solver One of the advantages of the standard coarsening multigrid algorithm is that each coarser grid level takes only one fourth the amount of work as the previous grid level until the number of the coarse grid points becomes smaller than one grid point per processor; thereafter the time remains constant for the computations on a grid level. The coarsest grid solver is a direct solver that uses an L U decomposition. The direct solver is slow on the CM5 due to its sequential nature and is dependent on the number of grid points on the coarsest grid level. It is therefore important to make the coarsest grid level size small, so that the direct solver time is approximately equal to the smoother time on the grid level with one grid point per processor. A banded sparse matrix direct solver does not exist in the CMSSL, even though the documentation states CMSSL routines exist which are equivalent to those of Linpack and Lapack for solving general banded systems. In fact the CMSSL provides routines that only solve tridiagonal and pentadiagonal banded systems. Another common choice for the coarsest grid solver is to use several iterations of one of the relaxation methods. This choice is not very practical on the CM5 because of the constant time per iteration per grid level when there are fewer grid points then there are processors. Thus the solution time on the coarsest grid is proportional to the number of smoothing steps. The constant time for the coarsest grid level can only be reduced if a cheaper smoother can be found or if the number of coarse grids which have fewer grid points than processors is kept to a minimum. The problem with this approach is that the larger the system to be solved, the worse the reduction factor for the smoother. To add insult to injury, we are already using some of the cheapest and most effective smoothers. Using a relaxation method on the coarsest grid does not seem to help the situation and indicates that the use of a direct solver is probably better. 233
PAGE 259
A third approach is to use standard coarsening for large grids and then switch to semicoarsening for the coarser and coarsest grids. This approach solves a few of our problems because the coarsest grid can now be just one line which can be solved directly with the CMSSL's tridiagonal solver. However, the new question is when to switch from standard to semicoarsening. It was decided that the user would specify when the switch took place by setting the coarsest grid input parameter. The optimum value for the switch will be dependent on the finest grid size and the number of processors. Recall that, when the switch to semicoarsening takes place, coarsening will only happen in the ydirection and that the grid point distribution in the xdirection across the processors will remain fixed. The tridiagonal solve on the coarsest grid level solves one xline of the size that was fixed when the switch took place, and the tridiagonal solution time is dependent on that size. The smaller the xdimension size is when the switch takes place, the faster the tridiagonal solution. It therefore becomes necessary to balance the constant time spent performing standard coarsening multigrid, the constant time spent performing semicoarsening multigrid, and the tridiagonal solution time. A parametric study has not been done, and is not planned, to determine the optimum values. However, from experience a reasonable choice for the switch is when the number of xdirection points is about one quarter the number of processors or less. Another practical choice, from the programmer's perspective, is to switch from standard to semicoarsening when the VP ratio becomes less than one after coarsening. This choice is convenient and keeps the maximum number of processors active. This choice would seem to be very good, but the semicoarsening performance compared to standard coarsening performance when V P 1 is highly dependent on the efficiency of the tridiagonal line solver used by the smoother. However, we now have the advantage of being able to use the CMSSL to perform the tridiagonal line solves. 234
PAGE 260
If the coarsening stops before we reach a single grid line or equation, we are left with sparse banded system of equations to solve. The CMSSL does not provide a solver to handle this case except for the dense linear system solvers. Using the dense system solvers means that we would have to allocate more storage and copy the banded system's data into it with the cost of general communication, the most expensive. The dense system solver will also perform many unneeded computations and communications that will involve matrix entries outside the banded area. All things said, the use of the dense system solvers is very inefficient. The only solution left for us is to write our own system solver for banded systems. While this may sound attractive, it is not. Once again we are met with the challenge of trying to write a solver that is both efficient and competitive. It is very difficult to write codes in high level languages that can compete with the CMSSL. To make matters worse, the parallelism in Gaussian elimination is at best modest for a banded system, depending on the length of the band. However, if we are willing to sacrifice some memory, we can store the L U factors to save some execution time for the solution of the coarsest grid when several multigrid iterations are performed. For a dense system of equations the best performance is obtained by using block cyclic ordering of the equations; see [59]. However, the performance gain assumes that V P 1, and a sparse banded system with V P < 1 will actually perform much slower than most of the other methods. To obtain an efficient banded system solver requires transferring the data from a grid point oriented data structure to a matrix oriented one, requiring general communication. For efficiency, if we have N unknowns, we will need N2 processors. The computations can then be done in order N operations, but communications will add significantly to the execution time. All things considered, the best we can hope to do for the solver is order N times a constant plus the communications times, which 235
PAGE 261
include the data structure transfers. It should be noted that the constant can be on the order of N when N is small. The performance of an L U direct solver is not very attractive when V P < 1. The most efficient solution, so far, is to switch to the semicoarsening code at sometime after the V P < 1. 8.11 Miscellaneous Software Issues An interesting compiler deficiency is that a parameter passed into a subroutine manifests poor performance if it is used as a loop control parameter. The way to avoid this deficiency is to copy the parameter's value into a local variable and then use that variable as the loop control parameter. The poor performance might have to do with the fact that the passed variable is usually scalar and is stored in the partition manager's scalar memory, requiring a broadcast communication every time the variable is needed. 8.11.1 Using Scalapack Scalapack is the same as Lapack but designed for distributed memory parallel computers using the parallel basic linear algebra subprograms (PBLAS) and basic linear algebra communication subprograms (BLACS). The Scalapack is available on the CM5 using CMPVM (parallel virtual machine) under the CMMD message passing model. The SPMD data model is not compatible with the CMPVM CMMD model, and for this reason we can not use Scalapack. If we assume that we had compatible programming models, another problem with using Scalapack is that the data structures are all matrix oriented, instead of the grid oriented data structure that we use. Scalapack also assumes that the matrices are distributed to a grid of processors in a 2D block cyclic decomposition. This distribution of data would require that we use costly general communications to copy the data into the block cyclic format. 236
PAGE 262
To top it all off, Scalapack is still under development and the two routines that we would need to perform the LU factorization and LU solution, PSGBTRF and PSGBTRS respectively, have not been implemented yet. 8.11.2 PolyShift Communication The PSHIFT communication rou tines in the CMSSL are also available on the CM5. The PSHIFT routine achieved the overlapping of communications on the CM2 and CM200 because those computers used a hypercube data communication network. The PSHIFT routine on the CM5 is not very effective because the fattree data communication network will not allow as many communications to be overlapped. The PSHIFT setup routine dynamically allo cates memory and when a particular polyshift stencil is no longer needed, the memory should be deallocated. The PSHIFT routine can perform a maximum of two shifts per array dimen sion, one in each direction. The number of communications that can be overlapped is limited to at most four, but the size and shape of the stencil are not restricted. If an array has padding in the dimension in which communication is to take place then the PSHIFT will perform approximately the same as the equivalent calls to CSHIFT and/or EOSHIFT. The best performance is obtained when the subgrid lengths in the communication dimensions are all roughly equivalent. The performance improvement for 2D 9point stencils is barely noticeable. The PSHIFT routine should be able to do better than it does on the CM5, but first it will have to be optimized for the fattree network. 8.12 2D Standard Coarsening Parallel Algorithm Many parallel algorithms have been tried over the years in an effort to create an efficient parallel black box multigrid code. The code was first developed the CM2 237
PAGE 263
and then ported and modified for the CM200, and finally a version was created for the CM5. The algorithmic choices presented here are those that were made for the CM5. 8.12.1 Data Structures The data structures for the grid equations are grid point stencil oriented. There are no fictitious grid equations needed for the boundary as there were in the vector code. The references to neighboring grid points are made through the communications routine EO SHIFT, which gives a zero value for off grid references. The previous trouble with the data structure layout is solved by the use of dynamic allocation using the Dynamic Memory Management Utilities (DMMU) de veloped by Bill Spangenberg of Los Alamos National Laboratory in conjunction with Thinking Machines Inc. The DMMU provide a way to allocate arrays dynamically with a given fixed geometry and to be able to use array aliasing to create an array of grid level arrays. 8.12.2 Coarsening We used standard coarsening, taking every other fine grid point in both coordinate directions to form the coarse grid. The data structures when V P > 1 are of the noncompatible compact type; see figure 8.5(c). When VP :S 1, we used a natural grid layout that uses compatible grids as illustrated in figure 8.5(a). The natural grid layout leaves more and more idle processors with every coarser grid level. However, since we wanted to use the CMSSL for the line solves we switched to semicoarsening after a few coarse grid levels below V P :S 1. As a special note, the easiest way to implement the black box multigrid solver for V P :S 1 is to use the semicoarsening algorithm. This choice keeps the maximum number of processors busy and allows the direct use of the CMSSL tridiagonal solver. 238
PAGE 264
8.12.3 Smoothers We have implemented the multicolor ordering point, line, and alternating line GaussSeidel methods. The ILLU method was not implemented for the reasons given in section 8.9. However, as mentioned before, we can use the CMSSL tridiagonal solver when V P > 1 and also if the semicoarsening algorithm is used on all grid levels when V P < 1. When V P < 1 and the semicoarsening algorithm is not used we will end up with noncontiguous active data in the lines to be solved, preventing us from using the CMSSL. We tried implementing data structure transformation routines, but they were found, as should be expected, to be clumsy and inefficient. We also implemented our own parallel tridiagonal solvers, but they were not very competitive, being about twice as slow as the CMSSL routine. 8.12.4 Coarsest Grid Solver We tried a direct solver using LU fac torization, but it turned out to be hard to implement and slow in its general form, unless the coarsest grid was always of a given fixed size. Instead, we chose to use the semicoarsening algorithm in which case only a tridiagonal solver was needed. So now, the coarsening continues until only one line is left to solve, and that can be done by using the same tridiagonal solver that was used for the line solves of the smoother. 8.12.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 3, that were implemented. They are the ones discussed in sections 3.5.1, 3.5.3, and 3.6.1. The two collapsing type methods were readily parallelizable and easily im plemented. The computation of the grid transfer operator coefficients created a lot of temporary variables. It was difficult to find a good implementation that did not use too many temporaries and that could avoid having the compiler generate too many and fill up the available memory. 239
PAGE 265
The grid transfer operators based on extensions to Schaffer's ideas were also parallelizable, but they depended on the availability of tridiagonal line solvers. The time to compute all the operators is also longer than the collapsing method because of the line solves. 8.12.6 Coarse Grid Operators They are formed using the Galerkin coarse grid approximation using the grid transfer operators. 8.13 2D SemiCoarsening Parallel Algorithm The semicoarsening code was originally implemented by Joel E. Dendy Jr., Michael Ida, and Jeff Rutledge on the CM200. A better implementation was done by Bill Spangenberg, who wrote the Dynamic Memory Management Utilities (DMMU), on the CM5. It is still possible to obtain an even better implementation of the semi coarsening code on the CM5, but this improvement has not been done because the code cannot be placed (at least not for the near future) in the public domain since it uses the proprietary DMMU. 8.13.1 Data Structures The data structures are grid point stencil oriented with a different array data structure for the coefficients, the unknowns, and the right hand side. 8.13.2 Coarsening Semicoarsening in theydirection was used, taking every other fine grid point in the ydirections to form the coarse grid. Noncompatible grid data structures were used when VP > 1, and a compatible grid data structure otherwise, as was the case for the standard coarsening parallel code. 240
PAGE 266
8.13.3 Smoothers Red/black xline GaussSeidel relaxation is used for the smoother. The CMSSL tridiagonal solver using cyclic reduction was used to solve the lines. A better implementation for the line solves exists if block cyclic reduction is used when the finest grid level has V P 1, since the subgrid size per processor will be large, and it makes more sense to use the vector units more efficiently by using sequential cyclic reduction on the subgrids of each processor. 8.13.4 Coarsest Grid Solver The coarsening takes place until only one X grid line remains, and then the CMSSL tridiagonal solver is called to solve it exactly. 8.13.5 Grid Transfer Operators The grid transfer operator is the one used in section 3.6.1 applied in only the ydirection. The CMSSL tridiagonal solver was also used. 8.13.6 Coarse Grid Operators They are formed using the Galerkin coarse grid approximation using the grid transfer operators. 8.14 2D Parallel Timings In the following tables we have reported both busy (B) and idle (I) times. Busy time is the execution time for the parallel processing nodes, while idle time is the sequential execution time and also the time to perform all communications. We are reporting times for various time shared partitions of the CM5. The partitions are identified by the number of processing nodes (PN) namely, 32, 64, 128, 256, and 512 processing nodes. The CM5 has a full configuration of 1024 processing nodes, but the full partition is not available under our time sharing system. The tables report timings, in seconds, for the average time of five runs for either the setup time or the average of 241
PAGE 267
five Vcycles. The standard coarsening timings are given in tables 8.3 and 8.4 for one V(1, 1) cycle and the setup respectively. We see the affects of the parallel overhead in the tables for small grids sizes and large partitions. For a given partition we do not see the almost perfect scaling that is seen with the Cray YMP; a problem that is has four times the number of unknowns takes far less than four times the time. Nor do we see perfect scaleup with the number of processors; for the 1024 x 1024 case, the 128 takes about half the time of the 32 processor partition and the 512 processor partition takes twothirds the time of the 128 processor partition. We can also look at the parallel efficiency by examining the data for busy and idle times. The parallel efficiency for the standard coarsening algorithm is given in table 8.5. Note that the highest parallel efficiency is given for the largest grid size problem on the smallest number of processors. This should be expected since that combination produces the largest subgrid size per processor, which will be processed serially on each processor, keeping all the processors busy until the calculation is completed. We still see that the parallel efficiency ranges from 63 to 88, where the higher efficiencies are given for the larger grid sizes. Tables 8.6 and 8. 7 give timings for the semicoarsening algorithm for the setup and one V(1, 1)cycle, respectively, for a range of grid sizes and processing node partitions. The parallel timings for the semicoarsening algorithm shows that we, again, do not have perfect scaling with the problem size nor do we have perfect scaleup with the number of processors. For the 1024 x 1024 case, the 128 processor partition takes about half the time of the 32 processor partition, and the 512 processor partition takes about twothirds times the tie of the 128 processor partition. The parallel efficiency for the semicoarsening algorithm is given in table 8.8. 242
PAGE 268
Table 8.3. Timings, in seconds, for the standard coarsening code performing one V(l, I)cycle with zebra alternating line GaussSeidel on 32, 64, 128, 256, and 512 processing nodes of the CM5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 9.060E2 1.113E1 9.098E2 1.039E1 1.071E1 B 1.844E1 1.881E1 1.892E1 1.923E1 1.951E1 16 I 1.050El 1.316El 1.232El 1.206El 1.218El B 2.378E1 2.778E1 2.816E1 2.844E1 2.898E1 32 I 1.508E1 1.348E1 1.406E1 1.752E1 1.690E1 B 2.902El 3.314El 3.458El 3.934El 4.004El 64 I 1.756El 2.090El 1.636El 1.828El 1.862El B 3.558E1 3.948E1 4.076E1 4.558E1 4.814E1 128 I 1.794E1 1.962E1 1.828E1 2.118E1 2.180E1 B 4.520El 4.828El 4.774El 5.276El 5.510El 256 I 2.346E1 2.068E1 2.060E1 2.420E1 2.442E1 B 6.374E1 6.202E1 5.858E1 6.286E1 6.262E1 512 I 2.152E1 2.300E1 2.542E1 2.604E1 2.830E1 B 1.092E+O 9.188E1 7.912E1 7.822E1 7.536E1 1024 I 3.240E1 2.574E1 2.676E1 2.950E1 2.860E1 B 2.474E+O 1.731E+O 1.265E+O 1.132E+O 9.722E1 243
PAGE 269
Table 8.4. Timings, in seconds, for the setup phase of the standard coarsening code with zebra alternating line GaussSeidel on 32, 64, 128, 256, and 512 processing nodes of the CM5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 2.581E1 4.332E1 2.776E1 4.141E1 3.877E1 B 3.454E1 3.519E1 3.530E1 3.581E1 3.627E1 16 I 2.620E1 4.833E1 3.756E1 3.067E1 3.051E1 B 4.880E1 5.507E1 5.556E1 5.604E1 5.710E1 32 I 5.020E1 3.419E1 3.583E1 5.578E1 6.335E1 B 6.110E1 6.851E1 7.167E1 8.052E1 8.195E1 64 I 5.616E1 6.272E1 4.143E1 4.806E1 4.620E1 B 7.704E1 8.378E1 8.677E1 9.574E1 1.016E+O 128 I 5.480E1 6.920E1 5.000E1 6.440E1 5.590E1 B 1.010E+O 1.044E+O 1.038E+O 1.135E+O 1.188E+O 256 I 7.080E1 5.750E1 5.620E1 6.490E1 6.340E1 B 1.450E+O 1.371E+O 1.293E+O 1.363E+O 1.372E+O 512 I 6.420E1 6.140E1 6.500E1 6.930E1 7.070E1 B 2.584E+O 2.096E+O 1.786E+O 1.728E+O 1.671E+O 1024 I 1.198E+O 8.730E1 8.420E1 9.520E1 8.090E1 B 6.046E+O 4.113E+O 2.957E+O 2.528E+O 2.176E+O Table 8.5. Parallel efficiency for standard coarsening V(1, 1)cycle using zebra alternating line GaussSeidel for the CM5 with 32, 64, 128, 256, and 512 nodes. The results are given in percentages and N means an N x N grid. Size CM5 N 32 PN 64 PN 128 PN 256 PN 512 PN 8 67 63 68 65 65 16 69 68 70 70 65 32 66 71 71 69 70 64 56 58 62 71 72 128 52 58 67 71 72 256 73 75 82 72 72 512 84 80 76 75 73 1024 88 87 83 79 77 244
PAGE 270
Table 8.6. Timings, in seconds, for the semicoarsening code performing one V(1, 1) cycle on 32, 64, 128, 256, and 512 processing nodes of the CM5, where the size N means anN x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 4.042E2 5.116E2 6.814E2 5.444E2 4.465E2 B 7.463E2 7.462E2 7.686E2 7.768E2 6.363E2 16 I 5.036E2 5.986E2 5.830E2 5.378E2 5.758E2 B 1.104E1 1.125E1 1.145E1 1.175E1 1.191E1 32 I 6.722E2 6.990E2 6.714E2 7.658E2 8.664E2 B 1.408E1 1.525E1 1.587E1 1.631E1 1.667E1 64 I 7.418E2 1.197E1 6.684E2 8.356E2 9.090E2 B 1.912E1 1.912E1 1.940E1 2.090E1 2.182E1 128 I 8.604E2 9.196E2 9.264E2 9.846E2 1.075E1 B 2.839E1 2.582E1 2.537E1 2.580E1 2.633E1 256 I 9.992E2 1.083E1 1.022E1 1.095E1 1.203E1 B 4.776E1 3.793E1 3.601E1 3.321E1 3.309E1 512 I 1.058E1 1.362E1 1.337E1 1.203E1 1.330E1 B 9.796E1 6.700E1 5.779E1 4.662E1 4.465E1 1024 I 1.332E1 1.510E1 1.244E1 1.343E1 1.490E1 B 2.653E+O 1.405E+O 1.102E+O 7.685E1 6.932E1 245
PAGE 271
Table 8. 7. Timings, in seconds, for the setup phase of the semicoarsening code on 32, 64, 128, 256, and 512 processing nodes of the CM5, where the size N means an N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN 256 PN 512 PN 8 I 3.899E2 2.441E1 1.271E1 2.317E1 1.976E1 B 7.561E2 7.713E2 7.705E2 7.773E2 7.874E2 16 I 5.260E2 2.254E1 1.156E1 5.630E2 5.710E1 B 1.183E1 1.173E1 1.170E1 1.204E1 1.210E1 32 I 6.940E1 1.187E1 7.480E2 2.457E1 3.146E1 B 1.552E1 1.646E1 1.691E1 1.679E1 1.709E1 64 I 7.550E2 2.485E1 1.427E1 8.800E2 9.340E2 B 2.066E1 2.060E1 2.113E1 2.255E1 2.321E1 128 I 1.612E1 1.393E1 1.441E1 1.013E1 1.125E1 B 3.004E1 2.717E1 2.693E1 2.733E1 2.817E1 256 I 1.530E1 1.560E1 1.557E1 1.154E1 1.276E1 B 5.006E1 3.989E1 3.811E1 3.515E1 3.502E1 512 I 1.480E1 1.623E1 1.640E1 1.274E1 1.394E1 B 1.025E+O 6.860E1 6.029E1 4.856E1 4.650E1 1024 I 1.610E1 1.920E1 1.710E1 1.436E1 1.570E1 B 2.463E+O 1.476E+O 1.155E+O 8.036E1 7.199E1 Table 8.8. Parallel efficiency for semicoarsening V(1, 1)cycle for the CM5 with 32, 64, 128, 256, and 512 processing nodes. The results are given in percentages and N means anN x N grid. Size CM5 N 32 PN 64 PN 128 PN 256 PN 512 PN 8 64 59 53 59 59 16 67 65 66 69 67 32 68 69 70 68 66 64 72 62 74 71 71 128 77 74 73 72 71 256 83 78 78 75 73 512 90 83 81 79 77 1024 95 90 90 85 82 246
PAGE 272
We again see that the highest parallel efficiency is obtained for the largest grid size problem on the least number of processors, and thus providing the largest subgrid for each processor to work on. The range of parallel efficiency values is now from 53 to 95. The increase in the values for the semicoarsening algorithm over the standard coarsening algorithm is to be expected because the semicoarsening algorithm keeps more processors busy during the smoother's line solves, especially on the coarser grid levels. We give a comparison for both the standard and semicoarsening algorithms on the CM5, Cray YMP, and a Sparc5 workstation in table 8.9. The CM5 timings are given for the fastest time for a given grid size and the processing partition that the time was obtained on are given in parentheses. The times in the table are the average time to complete one V(1, 1)cycle for five Vcycles averaged over five separate runs. The fastest times for a given grid size for the standard coarsening algorithm on the CM5 are on the 32 processor partition for grid sizes up to 128 x 128, the 128 processor partition for 256 x 256 grid, and the 512 processor partition for grid sizes greater than or equal to 512 x 512. The semicoarsening algorithm exhibits similar behavior where the 32 processor partition is the fastest for grid sizes up to 32 x 32, 128 processor partition for grid sizes between 64 x 64 and 128 x 128, the 256 processor partition for the 256 x 256 grid, and the 512 processor partition for grid sizes lager than 512 x 512. Comparing the results from table 8.9 shows that vectorization plays an important role on the Cray YMP, even for the smallest problems. The Sparc5 times show that the scaling argument holds, but that it is affected by caching issues. We see from the data that the Cray YMP time is still twice as fast as the 128 processor partition and 30% faster than the 512 processor partition. The Cray YMP codes are the fastest, but the CM5 codes are catching up for large problems when enough processors are 247
PAGE 273
Table 8.9. Timing comparison between the CM5, Cray YMP, and Sparc5 workstation for one V(1, 1)cycle in seconds, where N means an N x N grid. The top entries are for the standard coarsening codes and the bottom entries are for the semicoarsening codes, and means that the problem was to big to fit into the available memory. Size CM5 Cray YMP Sparc5 8 2. 750E1 (32) 5.270E4 3.000E4 1.083E1 (512) 5.156E4 5.000E4 16 3.428E1 (32) 1.016E3 1.200E2 1.640E1 (32) 9.365E4 l.OOOE2 32 4.410E1 (32) 2.019E3 5.000E2 2.080E1 (32) 1.896E3 3.200E2 64 5.314E1 (32) 4.579E3 1.910E1 2.608E1 (128) 4.435E3 1.383E1 128 6.314E1 (32) 1.285E2 7.980E1 3.463E1 (128) 1.325E2 5.988E1 256 7.918E1 (128) 4.429E2 3.514E+O 4.416E1 (256) 4.320E2 2.666E+O 512 1.037E+O (512) 1.654E1 1.503E+1 5.795E1 (512) 1.576E1 1.209E+1 1.427E+O (512) 6.732E1 1024 9.028E1 (512) 6.563E1 248
PAGE 274
available. When scaling is applied to take into account the difference in clock speeds and instructions per clock cycle between the Cray YMP and the CM5, we see that the two are nearly identical for the 1024 x 1024 problem, but that the Cray YMP still has a very slight edge. This shows that the CM5 codes not only suffer from the overhead associated with parallelization, but that the communication issues are the main bottleneck to beating the vector codes. 249
PAGE 275
CHAPTER 9 BLACK BOX MULTIGRID IN THREE DIMENSIONS 9.1 Introduction The development of a three dimensional black box multigrid solver essentially involves just extending the two dimensional version. The three dimensional methods provide the same functionality as the two dimensional black box multigrid methods. The basic multigrid algorithm and the multigrid components are essentially the same, except that the standard coarsening methods need alternating red/black plane relaxation to obtain a robust smoother. In addition, there are several changes in the implementation, especially for the parallel code. The 3D parallel methods use (alternating) red/black plane GaussSeidel relaxation, where the required plane solves are performed using a 2D multigrid method, which have been modified to solve all the planes of a single color simultaneously. We will examine both the standard and semicoarsening black box multigrid algorithms for problems in three dimensions. The examination will include the 3D algorithm implementations on vector (Cray YMP) and parallel (CM5) computers. The grid operator stencil in three dimensions is now assumed to fit into the 27point cubic stencil. The 27point stencil is illustrated in figure 9.1. Notice that for each fixed z (xyplane) we use the same compass coefficient notation that were used for two dimensions with a prefix to indicate the z level index of the stencil. For the stencil at grid point (i,j, k), the three prefixes are t for(*,*, k + 1), p for(*,*, k), and 250
PAGE 276
b for ( *, *, k 1) 9.1.1 SemiCoarsening The semicoarsening algorithm can be done in several ways. Recall that the semicoarsening method used a smoother working orthogonal to the direction of the coarsening. The coarsening can be done in one of the coordinate directions, leaving the smoother to work on planes, or the coarsening can be done in two of the coordinate directions with the smoother working on lines. We have chosen to examine only semicoarsening in the z coordinate direction. Either of the other two coordinate direction would have been equally valid, but since we plan on using the 2D semicoarsening algorithm to perform the plane solves, which is already written for xyplanes, we can avoid writing additional versions for the other planes. 251
PAGE 277
tn , tne tnw, , I I I I I I 1/ 1/ 1/ tw/ I 1/ I te I I 1/ ts 1/ I I tsw tse z ( pnw pn / y I I pne I 1/ 1/ I p I 1/ pw I I I X 1/ pe I psw ps I pse bnw , I. I I 1/ bn I bne bw // I I I I I be I I b I I bsw 1/ 1/ I bs bse Figure 9.1: Grid operator stencil in three dimensions. 252
PAGE 278
CHAPTER 10 3D DISCRETIZATIONS This chapter presents some of the discretizations that can be used on the convectiondiffusion equation in three dimensions. The finite difference and finite volume discretizations in three dimensions are straightforward extensions of the two dimensional discretizations presented in chapter 2. We will present only a few examples in three dimensions. The continuous three dimensional problem is given by \7 (D \lu) +b \lu+cu = f, (x, y) En= (0, Mx) x (0, My) x (0, Mz) (10.1) where D is a 3 x 3 tensor, det D > 0, and c 2: 0. We will only be considering problems where D is diagonal in this chapter. In addition, D, c, and f are allowed to be discontinuous across internal interfaces in the domain n. The boundary conditions are given by au on +au= g, on on (10.2) where a and g are functions, and n is the outward unit normal vector. This allows us to represent Dirichlet, Neumann, and Robin boundary conditions. The domain is assumed to be a rectangular parallelpiped, n = (0, Mx) x (0, My) X (0, Mz), which is divided into uniform cells of length hx = Mx/Nx by hy = My/Ny, by hz = Mz/Nz, where Nx, Ny, and Nz are the number of cells in the x, y, and zdirections respectively. The mesh need not be uniform, but such an assumption will simplify our discussions. 253
PAGE 279
A finite element discretization on a regular tetrahedral mesh can also be used to derive the discrete system of equations which can be used for input to the black box multigrid methods. 10.1 Finite Difference Discretization The anisotropic Poisson's equation on a cube domain, in0=(0,1)3 (10.3) where u and fare functions of x, y, and z, can be discretized by central finite differences with a uniform grid spacing, h = 1/N for N = nx = ny = nz, to get the 7point stencil at grid point (i,j, k): 0 Ey 0 0 Ez 0 0 Ez 0 (10.4) 0 b Ey 0 p where the stencil subscripts b, p, and t are short for the k1, k, and k + 1 stencil planes respectively. 10.2 Finite Volume Discretization There are several finite volume grids that can be used for discretization, but the two most common are the vertex and cell centered grid, just as in two dimensions. We will present only the finite volume discretization for the vertex centered finite volumes, (X;, YJ" Zk): yJ. h J.0 N J y, ' ... y, (10.5) Zk = khz, k = 0, ... Nz 254
PAGE 280
with evaluation at the vertices. The discretization is best when the discontinuous interfaces align with the finite volume boundaries (surfaces). In this discretization D, c, and f are approximated by constant values in finite volume, ni,j, whose centers are at the vertices. 10.2.1 Interior Finite Volumes The development is done the same as in chapter 2. However, instead of having four line integrals, we now have six surface integrals to evaluate over the finite volume. We will refer to the six surfaces as nil, ni+l, nj1, nj+l, nk1, and nk+l, where the subscripts indicate the fixed grid index. The surface integral for nil is D ou d dz hyhz 2Dx,i,j,kDx,il,j,k (u u ) X ox y h D . + D . t,],k tl,],k !1,_1 x x,t,],k x,tl,],k (10.6) The other five surface integrals are approximated similarly. The volume integrals are (10.7) and (10.8) 255
PAGE 281
The stencil for grid point ( i, j, k) is given by where and hyhz x hx ai1,j,k a'!.!. 0 0 hxhzaY hy i,j,k 0 0 2 Dx,i,j,kDx,i1,j,k D k+D 1k' x,'l,J, x;z,], 2 Dy,i,j,kDy,i,j1,k Dy,i,j,k + Dy,i,j1,k' 2 Dz,i,j,kDz,i,j,k1 Dz,i,j,k + Dz,i,j,k1 10.2.2 Edge Boundary Finite Volumes (10.9) p (10.10) (10.11) (10.12) (10.13) Let the finite volume Oi,j,k have its southern edge at the southern boundary (y = 0) of the domain. 256
PAGE 282
10.2.3 Dirichlet Boundary Condition For the Dirichlet boundary condition we have U(s) = 9(s), so that the surface integral over ni,j1,k is au 2hxhz Dy8dxdz = hDy,i,j,k ui,j,ku(s) nj1 Y y (10.14) where u(s) means to evaluate u at the grid point (i,j1, k). We now get the stencil 0 0 hxhzaY hy i,j,k hyhz x hx ai1,j,k (10.15) 0 p 0 0 where hyhz x x hxhz y hxhy z z a 1 k +a k + ha;,3,k + ha k 1 +a k hx ,], y z (10.16) and the a's are given as before; see equations (10.10), (10.11), and (10.12). 10.2.4 Neumann and Robin Boundary Conditions The boundary condition along the southern boundary is au +au = 9(s) on (s) We can make the approximation au on (s) 257 (10.17) (10.18)
PAGE 283
2 where U(s)U(p) = hi (9(s)a(s)U(s)), which gives The surface integral along the boundary is approximated by 2hxhz D 2 + h a y,i,j,k a(s)Ui,j,k 9(s) y (s) We now get the stencil 0 0 _hxhzaY hy i,j,k hyhz x hx ail,j,k 0 p 0 0 258 (10.19) (10.20) (10.21) (10.22) (10.23)
PAGE 284
where where is defined in equation (10.16), the a's are defined by equations (10.10), (10.11), and (10.12), and 2hxhza(s) BC= D k 2 + hya(s) y,2,J' (10.24) The other boundary finite volume cases (faces, edges, and corners) can be easily deduced from the previous boundary conditions cases above. 259
PAGE 285
CHAPTER 11 3D NONSYMMETRIC: GRID TRANSFER OPERATORS The three dimensional grid transfer operators are the same as those used for the two dimensional grid transfer operators, except that the grid operators Lh and LH now have 27point stencils. The three dimensional grid transfer coefficients are computed using the same type of grid decomposition method as were used in the second method of [29]. The computational method involves the formation of the grid transfer coefficients and the coarse grid operator by operator induced interpolation and Galerkin coarse grid approximation by performing consecutive semicoarsening in each of the coordinate directions. The grid transfer coefficients are computed by an extension of the same methods that were used for the two dimensional grids, that is, the collapsing methods and the extension of Schaffer's ideas; see sections 3.5.1 through 3.6. The only difference is that instead of Ail, Ai+l, etc. representing points and lines respectively, they now represent points and planes. The three dimensional grid transfer operator stencil is a little more complex than the two dimensional ones; see figure 11.1. The computations of the grid transfer coefficients become quite clear if one draws several pictures; then the symmetry of the computations really stands out. The pictures are not presented here because they are hard to represent in a static 260
PAGE 286
yzne tne tnw/ ,I I I xzn I I xznw/ I / xzne I I I yznw 1 I tsw' I tse xyn z xynw l ,I I / y I xyne I I I I I xyw X I I xye I I xys I I xysw xyse bnw I 1 yzse bne I I I xzsw I I xzs I xzse bsw I I I bse yzsw Figure 11.1: Grid transfer operator's stencil in three dimensions. 261
PAGE 287
monochrome mode. 11.1 3D Grid Transfer Operations The fine grid points that are also coarse grid points use the identity as the interpolation operator. The coarse grid correction is then given by (11.1) where (Xi 1 Yi 1 Zk 1 ) = ( Xic, Yic, Zkc) on the grid; here the interpolation coefficient is 1. The fine grid points that are between two coarse grid points that share the same Yj and Zk coordinates use a two point relation for the interpolation. The coarse grid correction is given by (11.2) where Xic1 < XiJ1 < Xic, Yic = YiJ, and Zkc = ZkJ on the grid, and the interpolation coefficients are J. k and k c c, c c, c, c The fine grid points that are between two coarse grid points that share the same Xi and Zk coordinates use a similar two point relation for the interpolation. The coarse grid correction is then given by (11.3) where Xic = Xi1 Yic1 < YiJ1 < Yic' and Zkc = Zkf on the grid, and the interpolation ffi t Jxys d Jxyn coe c1en s are J. _1 k an J. k l>Cl C l C l>Cl Cl C The fine grid points that are between two coarse grid points that share the same Xi and Yj coordinates use a similar two point relation for the interpolation. The coarse grid correction is then given by (11.4) 262
PAGE 288
where Xic = Xi 1 Y]c = Y]f, and Zkcl < Zk 1 l < Zkc on the grid, and the interpolation coefficients are r:z]s k l and r:z]n k (lc, c, c (lc, c, c For the fine grid points that share Zk coordinates, but do not share either a Xi or a Yj coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by + (11.5) + where Xic < Xi 1 < Xic+l, Y]c < Y]j < Y]c+l, and Zkc = Zk 1 with interpolation coefficients Ixysw. Ixyne and Ixyse 2cl,]cl,kc' 2cl,]c,kc' 2c,]c,kc' 2c,]cl,kc For the fine grid points that share Xi coordinates, but do not share either a Yj or a Zk coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by + (11.6) + where Xic = Xi1 Y]c < Y]j < Y]c+l, and Zkc < ZkJ < Zkc+l' with interpolation coeffiFor the fine grid points that share Yj coordinates, but do not share either a Xi or a Zk coordinate with the coarse grid, we use a four point relation for the interpolation, and the coarse grid correction is given by (11.7) 263
PAGE 289
where Xic < Xit < Xic+l, Yic = Yit, and Zkc < Zkt < Zkc+l, with interpolation coeffiLastly, for the fine grid points that do not share either a Xi, Yj, or Zk coordinates with the coarse grid, we use an eight point relation for the interpolation, and the coarse grid correction formula is given by + (11.8) + + + where Xic < Xit < Xic+l, Yic < Yit < YiJ+l, and Zkc < Zkt < Zkc+l, with interpolation coefficients J. 1 k Itnw Itse_ Jbsw Jbnw c c c tcl,]c,kc' tc,)c,kc' tc,)cl,kc' tcl,]cl,kc1' tcl,]c,kc1> The prolongation operators also have a correction term, containing the residual, added to them to obtain an O(h2 ) error at the boundaries. The correction is similar to the one employed in the two dimensional case; see 3.1.1. 11.2 3D Nonsymmetric Grid Operator Lh: Collapsing Methods To illustrate the grid transfer operators, we will present the nonsymmetric collapsing method in three dimensions. From this discussion it should be clear how to extend the other grid transfer operators from two to three dimensions. The I xyw coefficient is computed by (11.9) 264
PAGE 290
If, however, Ry:, is small, then I O"[West] xyw = ====O"[W est] + O"[East] (11.10) where O"[West] (11.11) O"[East] O"NEt + O"Et + O"SEt (11.12) are the west and east planes of the grid operator stencil. In (11.9)(11.12), Iw is evaluated at (xic1, Yic' zkJ, and the other coefficients on the right hand side are evaluated at (xirl,YiJ,Zk1 ) for the Lh components. Let "( = min{IO"[West]l, IO"[East]l, 1.}. (11.13) Then by small we mean that (11.14) where (11.15) 265
PAGE 291
Prolongation coefficients which are computed in a similar way are I xye, I xys, I xyn, Ixzn, and Ixzs. The prolongation coefficients Ixynw, Ixyne, Ixysw, Ixyse, Ixznw, Ixzne, Ixzsw, Ixzse, Iyznw, Iyzne, Iyzsw, and Iyzse can all be computed in a similar fashion. The computation of these coefficients is analogous to the computation of Inw, Ine, Isw, and Ise from section 3.5.1. We now illustrate these computations by computing the prolongation coefficient Ixynw. I cr[NWzLine] + cr[NzLine] Ixyw + cr[WzLine] Ixyn xynw = C p where cr[NzLine] cr[WzLine] cr[NW zLine] (11.16) (11.17) (11.18) (11.19) and where the notation means to take the line in the zdirection that contains the given grid operator coefficients. If, however, RL. is small, then I cr[NWzLine] + cr[NzLine] Ixyw + cr[WzLine] Ixyn xynw cr [RL. Cp] (11.20) Let "( min{lcr[NWzLine]l, lcr[NzLine]l, lcr[NEzLine]l, lcr[WzLine]l, lcr[EzLine]l (11.21) lcr[SWzLine]l, lcr[SzLine]l, lcr[SEzLinet]l, 1.}. Then by small we mean that (11.22) 266
PAGE 292
where Ry:, is defined in equation (11.15). Ixynw, Ixyw and Ixyn are evaluated at (xic1, Yjc, zkJ and O"[NzLine], O"[WzLine], and O"[NWzLine] are evaluated at (xit1' YiJ1, Zkt ). Finally, the last eight prolongation coefficients, Itnw, Itne, Itsw, Itse, Ibnw, Ibne, Ibsw, and Ibse, are used to interpolate to fine grid points which do not align with any of the coarse grid lines. They can all be computed in a similar fashion, which will be illustrated for Itnw. Itnw O"NWt + O"Wp Iyzne + O"NWp Ixzn + O"Np Ixznw + O"Wt Ixyn + O"Nt Ixyw + O"Ct Ixynw If, however, Ry:, is small, then Let Itnw = O"NWt + O"Wp Iyzne + O"NWp Ixzn + O"Np Ixznw + O"Wt Ixyn + O"Nt Ixyw + O"Ct Ixynw 'Y min {IO"[West]l, IO"[N orth]l, IO"[East]l, Then by small we mean that (11.23) (11.24) (11.25) (11.26) where Ry:, is defined in equation (11.15). Iyzne, Ixzn, Ixznw, Ixynw, Ixyw, and Ixyn are evaluated at (xic1, Yjc1, Zkc1) and O"[West], O"[N orth], O"[East], O"[South], O"[Top], and O"[Bottom] are evaluated at (xit1' YiJ1, Zkr1) Note that, O"[Top] and O"[Bottom] are just the sum of the grid operator coefficients on the top (k + 1) and bottom (k1) planes of the grid operator stencil respectively. 267
PAGE 293
The restriction operator coefficients are computed in the same way as above, but instead of using the symmetric part of the grid operator, O" L, we use the transpose of the grid operator, LT. 11.2.1 3D Grid Transfer Operator Variations With the information above on how to compute the basic grid transfer operators, it is easy to see how to extend all of the grid transfer operator variations, that we discussed in chapter 3, from 2D to 3D. 11.3 3D Coarse Grid Operator The three dimensional coarse grid operator is computed in the same way as the second method in [29]. The computational method involves the formation of the grid transfer coefficients and the coarse grid operator using auxiliary grids. The coarse grid operator is formed in a series of steps using a series of semicoarsening auxiliary grids. Define an auxiliary grid Glj = x x Glj, which is just the grid Gh coarsened in the zdirection only. Now we define the grid transfer operator, ( Jj}) z : Glj + Gh. The grid transfer operator, (JJ}) z, can be constructed using any of the methods discussed, as can the other two grid transfer operators discussed. We now define the partial coarse grid operator to be (11.27) In a similar fashion we define Gf/z = x Gfj x Glj, and the grid transfer operator h H H ( J H )yz : G yz + G z The associated coarse grid operator is defined by (11.28) Finally, in a similar fashion, we define the coarse grid, GH = G{! x Gfj x Glj, and the 268
PAGE 294
grid transfer operator J'lf : GH + GfJz. The coarse grid operator is finally obtained by (11.29) The formation of the coarse grid operator in this way saves 31% and 50% of the operations for the seven and twentyseven point grid operators respectively. As an added bonus the coding is much less complex and easier to debug. 269
PAGE 295
CHAPTER 12 3D SMOOTHERS There are several choices of relaxation methods that can be used for the 3D smoother. We have chosen to look at point, line, and plane GaussSeidel relaxation methods using either lexicographic or multicolor ordering. 12.1 Point GaussSeidel The point GaussSeidel method in three dimensions is the same as it is in two, but now there are more choices for the sweeping direction. We have chosen to only look at the lexicographic ordering, the red/black ordering for 7point operators, and an 8color ordering for 27point operators. The red/black ordering is given by Red: i + j + k even (12.1) Black : i + j + k odd 270
PAGE 296
and the 8color ordering is given by Black: i odd, J odd k odd Red: i odd, J odd k even orange: i odd, J even k odd yellow: i odd, j even k even (12.2) Green: z even, J odd k odd Blue: i even, J odd k even violet : i even, J even k odd white: i even, j even k even. 12.2 Line GaussSeidel We have three choices for the direction of the lines, for line GaussSeidel relaxation, either xlines, ylines, or zlines. We can also look at alternating line relaxation, as we did in two dimensions, except that now we have four possibilities; xandylines, yand zlines, xand zlines, or x, y, and zlines. As before we can look at different orderings of the the lines. Lexicographic is a common choice, but it can not be parallelized. We can get better convergence and obtain parallelism and vectorization, across lines as in the two dimensional case, by using a zebra (redblack) ordering of the lines. For standard coarsening, the only choice of smoother, which might be robust, is alternating zebra line GaussSeidel relaxation. It will be a good smoother for some of the convection problems, but not others because each coordinate line direction sweep can handle anisotropies and convections with components in its coordinate direction. However, anisotropies in a plane not being sweeped by the lines will exhibit poor smoothing. 271
PAGE 297
12.3 Plane GaussSeidel In three dimensions we can now perform plane relaxation, which is analogous to line relaxation in two dimensions. Plane relaxation can be performed in several ways; xyplane, yzplane, xzplane, or alternating plane relaxation. These methods can also be done in a lexicographic or red/black ordering of the planes. We need a robust method for our smoother and these can be found among those that perform plane relaxations [11]. In general, red/black ordering will give better results than lexicographic because it removes the directional dependencies that are associated with a sweeping direction. However, plane relaxation can not reduce the error orthogonal to the plane, and hence we must use alternating plane relaxation to obtain a robust smoother. Alternating red/black plane GaussSeidel relaxation is the most robust be cause it takes into account the three coordinate directions for anisotropies and con vection. One iteration of the method is performed by performing red/black xyplane GaussSeidel relaxation followed by red/black yzplane GaussSeidel relaxation, and finally followed by red/black xzplane GaussSeidel relaxation. The question now arises as to how to efficiently perform the plane solves needed by the smoother. In 2D we used a cyclic reduction tridiagonal solver to perform the line solves, but in 3D we are stuck with having to solve a sparse banded system. To perform LU factorization and solve for each plane would be very time consuming. We could save some time by saving the L U decompositions, but at the expense of memory. However, there is a better solution to our problem: we can use a 2D multigrid method to perform the plane solves. We have chosen to use the 2D black box multigrid method for the planes solves because it was designed for just such a mission. By using the 2D multigrid method we still need extra memory, but not as much as the L U method, and 272
PAGE 298
we can also perform multigrid much quicker than L U. One possible drawback to using the 2D multigrid method is that it is not an exact solver. This should not be much of a problem since the relaxation method gives only an improved approximation to the solution for each iteration. However, we do not want to degrade the convergence of the relaxation by providing poor approximations for the plane solves. We have found that it is usually sufficient to use a single V(1, 1) cycle in the 2D black box multigrid method with alternating zebra line GaussSeidel relaxation to obtain essentially the same results for the red/black plane GaussSeidel relaxation as when LU factorization is used. Depending on the convection charac teristics it is sometimes better to use either a V(2, 1)cycle, W(1, 1)cycle, or several V(1, 1)cycles; however, this improvement in the plane solve accuracy is moot since the relaxation method can fail even when exact plane solves are used. 273
PAGE 299
CHAPTER 13 LOCAL MODE ANALYSIS IN THREE DIMENSIONS Local mode analysis of 3D smoothers is somewhat sparse in the literature and does not have adequate coverage for the range of problems that we wish to solve. In addition, there are only hints in the literature for how to perform local mode analysis for color relaxation in three dimensions, and we are unaware of the appearance elsewhere of the detailed analysis that we have presented in this chapter. The local (Fourier) mode analysis was described in section 5.3 for two dimensions, and we now extend it to three dimensions. 13.1 Overview of 3D Local Mode Analysis The continuous problem is discretized into a system of algebraic equations Lu=f where the grid G is defined by i = 1, ... nx G= 274 h 1 x1 nxh 1 Y 1 ny(13.1) (13.2)
PAGE 300
The grid operator L can be represented in stencil notation as NWt Nt NEt NWp Np NEp Wt Ct Et Wp Cp Ep SWt St SEt SWp Sp SEp p (13.3) NWb Nb NEb wb cb Eb swb sb SEb b where the subscripts b, p, and t stand for the bottom ( k1), plane ( k), and top ( k + 1) levels of the stencil. If the continuous problem has constant coefficients and periodic boundary conditions, then the stencils of [L], [M], and [N] are independent of the grid points (i,j, k). The eigenfunctions of the smoothing amplification matrix S are () E 8, (13.4) (13.5) If nx, ny and nz are assumed to be even, then the corresponding eigenvalues of S are y;, N ( K,) y;, M(K,) where K, = (lx, ly, lz) is a vector. () E 8, (13.6) We now define the sets of rough and smooth frequencies for the grid G, when the ratio between the fine and coarse grid spacings is two. The smooth frequencies are defined as 7f 7f 3 8s = 8 n 2' 2 (13.7) 275
PAGE 301
and the rough frequencies as (13.8) The Fourier smoothing factor is then defined to be 11 =max {1>.(0)1}, 0E8r (13.9) just as it was in two dimensions. The smoothing factor can be made grid size independent by changing the definition of 8 to be (13.10) For the case of multicolor relaxation, the if>i,j,k(O) are again not eigenfunctions any more, but certain subspaces spanned by their linear combinations are still invariant. Instead of four invariant subspaces, as in two dimensions, we now have eight invariant subspaces, which are defined as n1 8 n _:!!: I 3 u s 2' 2 02 = o;sign(0;)1r, O!sign(0!)1f 03 = o;sign(o;)1f, 0 04 = o; sign(0;)1r, 0, O! sign(0!)1f 05 = 0, O! sign(0!)1f 06 = 01 0, 0, O! sign(0!)1f 07 = 01 0, 0 08 = 01 o; sign(o;)1f, 0, 0 and 4>(0) is now written as The error before smoothing is now 276 (13.11) (13.12) (13.13)
PAGE 302
and after smoothing it is (13.14) where S(O) is the 8 x 8 amplification matrix, and co is a vector of dimension 8. The amplification matrix is computed in the same way as in the two dimensional case. For multicolored relaxations, the definition of the Fourier smoothing factor, Jl, has to be modified. The rough Fourier modes are now given by and the smooth Fourier modes are now represented by 01 .../.. or z r 2. (13.15) (13.16) All of these values must be added to 8r. We now define a projection operator, Q(O), for ( 0) onto the Fourier modes, which is represented by the diagonal 8 x 8 matrix 8(0) 1 1 1 Q(O) = (13.17) 1 1 1 1 where 1 oi = for 't = x,y,z 8(0) = (13.18) 0 otherwise Define 8.s = 01 and the multicolor definition for the Fourier smoothing factor is given by Jl =max {p [Q(O)S(O)]} 0E8;; (13.19) 277
PAGE 303
where p denotes the spectral radius. The definitions for the smoothing factor can be modified, as in the two di mensional case, to take into account the Dirichlet boundary conditions. 13.2 Three Dimensional Model Problems The domain n is the unit cube for the three dimensional model problems: 1. IJ.u = f 2. El Uxx E2Uyy E3Uzz = f a) 0 < El E2 E3 b) 0 < El E2 E3 c) 0 < El E2 E3 d) 0 < El E2 E3 3. EIJ.u Ux = f 4. EIJ.u + Ux = f 5. E!J.U Uy = f 6. E!J.U + Uy = f 7. E!J.U Uz = f 8. E!J.U + Uz = f 9. E!J.U + Ux + Uy = f 10. E!J.U + Ux Uy = f 11. E!J.U Ux + Uy = f 12. E!J.U Ux Uy = f 278
PAGE 304
13. + Ux + Uz = f 14. + Ux Uz = f 15. Ellu Ux + Uz = f 16. Ellu Ux Uz = f 17. Eflu + Uy + Uz = f 18. Eflu + Uy Uz = f 19. Eflu Uy + Uz = f 20. EflU Uy Uz = f 21. EflU + Ux + Uy + Uz = f 22. EflU + Ux + Uy Uz = j 23. EflU + Ux Uy + Uz = j 24. EflU + Ux Uy Uz = j 25. EflU Ux + Uy + Uz = j 26. EflU Ux + Uy Uz = j 27. Eflu Ux Uy + Uz = j 28. Eflu Ux Uy Uz = j where flu = Uxx + Uyy, E = 10P for p = 0, 1, ... 5, and the are to be taken in all possible combinations. 279
PAGE 305
13.3 Local Mode Analysis for Point GaussSeidel Relaxat ion Local mode analysis results are presented for lexicographical and red/black ordering for point GaussSeidel relaxations. Point GaussSeidel relaxation with lexicographic ordering gives the splitting 0 [MJ = o cb o 0 0 [N] = 0 0 0 0 b b 0 0 0 The amplification factor ..\(0) is given by 0 0 0 0 0 0 p 0 0 Ct 0 0 p Red/black point GaussSeidel relaxation has the amplification matrix a a 0 0 0 0 0 0 b b 0 0 0 0 0 0 0 0 c 0 0 0 0 c S(O) = 0 0 0 e 0 0 e 0 0 0 0 0 g g 0 0 0 0 0 0 h h 0 0 0 0 0 f 0 0 f 0 0 0 d 0 0 0 0 d 280 (13.20) (13.21) (13.22) (13.23)
PAGE 306
where a= 1+a, b = 1a, c = ,8(1+,8), d = 1,8, e = ')'(1+')'), f = 1')', g = ry(1+ry), h = 17], and c Cb edJz + Sp edJy Wp edJx Ep + NP + Ct c ,8 'Y Cb Sp + Wp + Ep NP + Ct c Cb + Sp + Wp + Ep + Np Ct c 7] = The eigenvalues of Q(B)S(B) are >.1(8) = 0 >.2(8) = 0 >.3(8) = 0 >.4(0) = 0 >.5(8) = 1 2 >.6(0) = 1 2 1 + ,82 1 + 'Y2 >.7(8) = 1 "2 (1a+ J(B)(1 +a)) >.s(B) = 1 1 + 7]2 2 (13.24) (13.25) (13.26) (13.27) The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.1 through 13.4. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Bx, By, and Bz were sampled at two degree increments. Table 13.1 shows the results of the smoothing analysis for pure diffusion type problems. The point GaussSeidel relaxation is a reasonable smoothers for Poisson's equation, but not for anisotropic problems. 281
PAGE 307
Table 13.1. Smoothing factor Jl for point GaussSeidel relaxation in lexicographical (pGSlex) and red/black (r/bpGS) ordering for the indicated anisotropic diffusion problems ( see section 13.2) using central finite differences; where E = 10P. problem Px Py Pz pGSlex r/bpGS 1 0.5669 0.7182 1 0 0 0.9093 0.9526 2b 300 0.9990 0.9993 50 0 0.9999 0.9998 1 1 0 0.8472 0.9187 2c 330 0.9980 0.9988 55 0 0.9999 0.9998 2 1 0 0.9821 0.9907 2d 53 0 0.9999 0.9998 Table 13.2. Smoothing factor Jl for point GaussSeidel relaxation in lexicographi cal (pGSlex) and red/black (r/bpGS) ordering for the indicated convectiondiffusion problems (see section 13.2) using central and upstream finite differences; where E = 10P. problem p pGSlex r/bpGS 0 0.6459 0.7515 3 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 4 1 0.5617 0.8808 3 0.5524 0.9978 0 0.6459 0.7515 5 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 6 1 0.5617 0.8808 3 0.5524 0.9978 0 0.6459 0.7515 7 1 0.8742 0.8808 3 0.9983 0.9978 0 0.5550 0.7515 8 1 0.5617 0.8808 3 0.5524 0.9978 282
PAGE 308
Table 13.3. Smoothing factors for point GaussSeidel relaxation with lexicographic and red/black ordering for the indicated convectiondiffusion problems (see section 13.2); c: = wP. problem p pGSlex r/bpGS 0 0.5460 0.7779 9 1 0.5482 0.9247 3 0.5059 0.9988 0 0.6385 0.7779 10 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 11 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6999 0.7779 12 1 0.9251 0.9247 3 0.9991 0.9988 0 0.5460 0.7779 13 1 0.5489 0.9247 3 0.1312 0.9988 0 0.6384 0.7779 14 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 15 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6998 0.7779 16 1 0.9251 0.9247 3 0.9991 0.9988 0 0.5459 0.7779 17 1 0.5484 0.9247 3 0.5423 0.9988 0 0.6385 0.7779 18 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6385 0.7779 19 1 0.8715 0.9247 3 0.9980 0.9988 0 0.6998 0.7779 20 1 0.9251 0.9247 3 0.9991 0.9988 283
PAGE 309
Table 13.4. Smoothing factor fL for point GaussSeidel relaxation in lexicographi cal (pGSlex) and red/black (r/bpGS) ordering for the indicated convectiondiffusion problems (see section 13.2) using central and upstream finite differences; where c = lQP. problem p pGSlex r/bpGS 0 0.4176 0.7161 21 1 0.1295 0.7067 3 0.0017 0.7031 0 0.5749 0.7161 22 1 0.7970 0.7067 3 0.9967 0.7031 0 0.5749 0.7161 23 1 0.7969 0.7067 3 0.9967 0.7031 0 0.6513 0.7161 24 1 0.8851 0.7067 3 0.9984 0.7031 0 0.5749 0.7161 25 1 0.7969 0.7067 3 0.9968 0.7031 0 0.6513 0.7161 26 1 0.8851 0.7067 3 0.9985 0.7031 0 0.6513 0.7161 27 1 0.8851 0.7067 3 0.9984 0.7031 0 0.7048 0.7161 28 1 0.9199 0.7067 3 0.9990 0.7031 284
PAGE 310
The tables 13.2 through 13.4 show the results of the smoothing analysis for convectiondiffusion problems. Most of the smoothing factors approach one as the convection terms become more dominant, which implies that point GaussSeidel is not a good smoother for these types of problems. However, lexicographic point GaussSeidel relaxation exhibits good smoothing properties when the convection characteristic coincides with that of the sweeping direction. 13.4 Local Mode Analysis for Line GaussSeidel Relaxat ion The line GaussSeidel relaxation can be implemented in many ways for three dimensional problems. It can be done by lines in any of the three axis directions. The ordering of the lines of unknowns can be done in many ways. Local mode analysis results are presented for lexicographical and zebra (red/black) ordering for xline GaussSeidel relaxations and alternating line GaussSeidel relaxation. Xline GaussSeidel relaxation with lexicographic ordering gives the splitting 0 0 0 [M]= 0 cb 0 Wp Cp Ep 0 0 0 (13.28) 0 Sp 0 b p 0 Np 0 [N]= 0 0 0 0 0 0 0 Ct 0 (13.29) 0 0 0 b p The amplification factor .A(O) is given by (13.30) 285
PAGE 311
Zebra xline GaussSeidel relaxation has the amplification matrix a 0 a 0 0 0 0 0 0 c 0 0 0 0 0 c b 0 b 0 0 0 0 0 S(O) = 0 0 0 e e 0 0 0 (13.31) 0 0 0 f f 0 0 0 0 0 0 0 0 g g 0 0 0 0 0 0 h h 0 0 d 0 0 0 0 0 d where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = 'Y(1+'Y), f = 1'(1')'), g = 77 ( 1 + 77), h = 77 ( 1 77), and a ,B 'T] = Cb edJz + Sp edJy + Np + Ct Wp + Cp + Ep Cb + Sp + Np + Ct Wp + Cp Ep Cb Sp Np + Ct Wp + Cp Ep Cb Sp Np + Ct Wp + Cp + Ep The eigenvalues of Q(O)S(O) are >.1(0) = 0 >.2(0) = 0 >.3(0) = 0 >.4(0) = 0 >.s ( 0) = ,82 >.6 ( 0) = 'Y2 286 (13.32) (13.33) (13.34) (13.35)
PAGE 312
>.7(8) = r? 1 >.s(B) = 2a (a1 + J(B)(1 + ry)). The alternating line GaussSeidel relaxation with lexicographic ordering amplification factor >.(B) is given by (13.36) where Axlgs(B), Aylgs(B), and Azlgs(B) are the x, y, and zline GaussSeidel amplification factors respectively, given by Azlgs(B) = ICb ed}z + Wp edJx + Ep edJx + Cp + Sp ed}y I IEp + Ct ICb + Wp + Cp + Sp + Np I I Ep + Np I (13.37) (13.38) (13.39) The zebra alternating line GaussSeidel relaxation amplification matrix S(B) is given by S(B) = Bxz9s(B) Sytgs(B) Bzz9s(B) (13.40) where Sxlgs(B), Sylgs(B), and Sylgs(B) are the zebra x, y, and zline GaussSeidel amplification matrices respectively. The zebra xline GaussSeidel amplification matrix Sxlgs(B) is given in equation (13.31), and the amplification matrices for Sylgs(B) and 287
PAGE 313
Bzlgs(B) are given by a 0 0 a 0 0 0 0 0 c 0 0 0 0 c 0 0 0 e 0 e 0 0 0 1 b 0 0 b 0 0 0 0 Syt9s(B) = 2 (13.41) 0 0 f 0 f 0 0 0 0 0 0 0 0 g 0 g 0 d 0 0 0 0 d 0 0 0 0 0 0 h 0 h where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = 'Y(1+'Y), f = 1'(1')'), g = 77 ( 1 + 77), h = 77 ( 1 77), and a Cb edlz + Wp edlx + Ep + Ct (13.42) S + C + N p p p ,B Cb + Wp + Ep + Ct (13.43) S + C N p p p 'Y Cb Wp Ep + Ct (13.44) S + C N p p p 'T] Cb Wp Ep + Ct (13.45) S + C + N p p p and a 0 0 0 a 0 0 0 0 c 0 0 0 c 0 0 0 0 e e 0 0 0 0 S(B) = 0 0 f f 0 0 0 0 (13.46) b 0 0 0 b 0 0 0 0 d 0 0 0 d 0 0 0 0 0 0 0 0 g g 0 0 0 0 0 0 h h 288
PAGE 314
Table 13.5. Smoothing factor fL for x, y, and zline and alternating line GaussSeidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10P. problem p xlGS ylGS zlGS alGS 1 0.5000 0.5000 0.5000 0.1096 1,0 0.9091 0.8347 0.8347 0.6332 2b 3,0 0.9990 0.9980 0.9980 0.9950 5,0 0.9999 0.9999 0.9999 0.9999 1,0 0.8462 0.8461 0.5000 0.3396 2c 3,0 0.9980 0.9980 0.5000 0.4976 5,0 0.9999 0.9999 0.5000 0.5000 2,1,0 0.9821 0.9804 0.8347 0.8036 2d 5,3,0 0.9999 0.9999 0.9804 0.9804 where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = 1'(1+')'), f = 1'(1')'), g = ry(1 + ry), h = ry(1ry), and a Cb edJz + Cp + Ct (13.47) S + lV; + E + N p p p p (13.48) ,8 Cb + Cp Ct s lV; E + N p p p p (13.49) Cb + Cp Ct S lV; E + N p p p p (13.50) cb + Cp + Ct 7] = The zebra alternating line GaussSeidel amplification matrix S(B) can be computed numerically and then its eigenvalues can be found and evaluated on 8.s. The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.5 through 13.8. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Bx, By, and Bz were sampled at 2 degree increments. Tables 13.6 through 13.8 show the smoothing factors for the convectiondiffusion model problems for lexicographic line GaussSeidel relaxation. The smoothing 289
PAGE 315
Table 13.6. Smoothing factor 11 for x, y, and zline and alternating line GaussSeidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upwind finite differences; where c = 10P. problem p xlGS ylGS zlGS alGS 0 0.5047 0.6059 0.6059 0.1626 3 1 0.5411 0.8751 0.8751 0.3782 3 0.1250 0.9984 0.9984 0.1241 0 0.5047 0.5000 0.4999 0.1049 4 1 0.5411 0.4992 0.4992 0.1089 3 0.1250 0.1119 0.1119 0.0010 0 0.6059 0.5047 0.6059 0.1625 5 1 0.8751 0.5411 0.8751 0.3783 3 0.9984 0.5000 0.9984 0.4976 0 0.5000 0.5047 0.4999 0.1049 6 1 0.5000 0.5411 0.4992 0.1089 3 0.5000 0.5000 0.4472 0.1041 0 0.6059 0.6059 0.5047 0.1625 7 1 0.8751 0.8751 0.5411 0.3783 3 0.9984 0.9984 0.5000 0.4976 0 0.5000 0.5000 0.5047 0.1049 8 1 0.5000 0.5000 0.5411 0.1085 3 0.5000 0.5000 0.5000 0.1041 290
PAGE 316
Table 13.7. Smoothing factor for x, y, and zline and alternating line GaussSeidel with lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for convection diffusion problems (see section 13.2); where E = 10P. problem p xlGS ylGS zlGS alGS 0 0.4680 0.4681 0.4998 0.1011 9 1 0.4655 0.4654 0.4995 0.1037 3 0.4125 0.4125 0.4645 0.0779 0 0.6089 0.4680 0.5999 0.1551 10 1 0.8778 0.4655 0.8569 0.3397 3 0.9981 0.4125 0.9977 0.4106 0 0.4680 0.6089 0.5999 0.1551 11 1 0.4655 0.8778 0.8569 0.3397 3 0.4125 0.9984 0.9980 0.4106 0 0.6089 0.6087 0.6667 0.2190 12 1 0.8778 0.8778 0.9167 0.6666 3 0.9981 0.9984 0.9990 0.9945 0 0.4681 0.4999 0.4680 0.1011 13 1 0.4648 0.4992 0.4645 0.1039 3 0.0844 0.1114 0.0846 7.9E4 0 0.6089 0.5999 0.4680 0.1551 14 1 0.8778 0.8569 0.4645 0.3393 3 0.9981 0.9977 0.0846 0.0842 0 0.4681 0.6000 0.6088 0.1551 15 1 0.4648 0.8571 0.8779 0.3390 3 0.0844 0.9980 0.9984 0.0841 0 0.4681 0.6666 0.6088 0.2190 16 1 0.4648 0.9167 0.8779 0.6666 3 0.0844 0.9990 0.9984 0.9947 0 0.5000 0.4681 0.4681 0.1010 17 1 0.5000 0.4654 0.4654 0.1034 3 0.5000 0.4472 0.4472 0.1000 0 0.6000 0.6089 0.4681 0.1551 18 1 0.8571 0.8778 0.4654 0.3391 3 0.9980 0.9984 0.4472 0.4454 0 0.6000 0.4681 0.6088 0.1551 19 1 0.8571 0.4654 0.8779 0.3391 3 0.9980 0.4472 0.9984 0.4454 0 0.6667 0.6089 0.6088 0.2190 20 1 0.9167 0.8778 0.8779 0.6665 3 0.9990 0.9984 0.9984 0.9950 291
PAGE 317
Table 13.8. Smoothing factor JL for x, y, and zline and alternating line GaussSeidel relaxation in lexicographic ordering, xlGS, ylGS, zlGS, and alGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upwind finite differences; where c = 10P. problem p xlGS ylGS zlGS alGS 0 0.3956 0.3956 0.3956 0.0395 21 1 0.1302 0.1302 0.1302 9.1E4 3 0.0017 0.0017 0.0017 1.8E9 0 0.5699 0.5699 0.3956 0.1195 22 1 0.8488 0.8488 0.1302 0.0852 3 0.9977 0.9977 0.0017 0.0015 0 0.5699 0.3956 0.5699 0.1195 23 1 0.8489 0.1302 0.8488 0.0853 3 0.9977 0.0017 0.9977 0.0015 0 0.6686 0.5699 0.5699 0.1662 24 1 0.9175 0.8488 0.8488 0.2927 3 0.9989 0.9977 0.9977 0.3422 0 0.3956 0.5699 0.5699 0.1194 25 1 0.1302 0.8488 0.8488 0.0853 3 0.0017 0.9980 0.9980 0.0015 0 0.5699 0.6686 0.5699 0.1662 26 1 0.8488 0.9175 0.8488 0.2928 3 0.9977 0.9990 0.9980 0.3424 0 0.5699 0.5699 0.6686 0.1663 27 1 0.8489 0.8489 0.9174 0.2928 3 0.9977 0.9980 0.9990 0.3424 0 0.6686 0.6686 0.6686 0.2122 28 1 0.9175 0.9175 0.9174 0.3612 3 0.9989 0.9990 0.9990 0.3949 292
PAGE 318
Table 13.9. Smoothing factor fL for zebra x, y, and zline and alternating line GaussSeidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10P. problem p ZxlGS ZylGS ZzlGS ZalGS 1 0.4444 0.4444 0.4444 0.0278 1,0 0.9070 0.8264 0.8264 0.6195 2b 3,0 0.9990 0.9980 0.9980 0.9950 5,0 0.9999 0.9999 0.9999 0.9999 1,0 0.8403 0.8403 0.2500 0.1736 2c 3,0 0.9980 0.9980 0.2500 0.2490 5,0 0.9999 0.9999 0.2500 0.2500 2,1,0 0.9821 0.9803 0.8264 0.7956 2d 5,3,0 0.9999 0.9999 0.9803 0.9803 factors for line relaxation are good when the convection term characteristics are in the same direction as the lines. The smoothing factor becomes better (smaller) the more the convection terms dominate if the characteristics are in the direction of the lines. If the characteristics are not in the direction of the lines, then the smoothing factor degenerates, quickly approaching one the more the convection terms dominate. 13.5 Local Mode Analysis for Plane GaussSeidel Relaxat ion We analyze plane GaussSeidel relaxation for xyplane and alternating plane using lexicographic and zebra ordering. XY plane GaussSeidel relaxation with lexicographic ordering gives the splitting 0 Np 0 [M]= 0 cb 0 Wp Cp Ep 0 0 0 (13.51) 0 Sp 0 b p 293
PAGE 319
Table 13.10. Smoothing factor Jl for zebra x, y, and zline and alternating line GaussSeidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upwind finite differences; where c = 10P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.3200 0.5102 0.5102 0.0459 3 1 0.2500 0.7656 0.7656 0.1406 3 0.2500 0.9960 0.9960 0.2480 0 0.3200 0.5102 0.5102 0.0459 4 1 0.2500 0.7656 0.7656 0.1406 3 0.2500 0.9960 0.9960 0.2480 0 0.5102 0.3200 0.5102 0.0459 5 1 0.7656 0.2500 0.7656 0.1406 3 0.9960 0.2500 0.9960 0.2480 0 0.5102 0.3200 0.5102 0.0459 6 1 0.7656 0.2500 0.7656 0.1406 3 0.9960 0.2500 0.9960 0.2480 0 0.5102 0.5102 0.3200 0.0459 7 1 0.7656 0.7656 0.2500 0.1406 3 0.9960 0.9960 0.2500 0.2480 0 0.5102 0.5102 0.3200 0.0459 8 1 0.7656 0.7656 0.2500 0.1406 3 0.9960 0.9960 0.2500 0.2480 294
PAGE 320
Table 13.11. Smoothing factor for zebra x, y, and zline and alternating line GaussSeidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for convection diffusion problems (see section 13.2); where E = 10P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.3846 0.3846 0.5625 0.0729 9 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 10 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 11 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.3846 0.5625 0.0729 12 1 0.7347 0.7347 0.8521 0.4599 3 0.9960 0.9960 0.9980 0.9901 0 0.3846 0.5625 0.3846 0.0729 13 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 14 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 15 1 0.7347 0.8521 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.3846 0.5625 0.3846 0.0729 16 1 0.7347 0.9127 0.7347 0.4599 3 0.9960 0.9980 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 17 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 18 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9960 0.9901 0 0.5625 0.3846 0.3846 0.0729 19 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9980 0.9901 0 0.5625 0.3846 0.3846 0.0729 20 1 0.8521 0.7347 0.7347 0.4599 3 0.9980 0.9960 0.9980 0.9901 295
PAGE 321
Table 13.12. Smoothing factor J.l for zebra x, y, and zline and alternating line GaussSeidel relaxation, ZxlGS, ZylGS, ZzlGS, and ZalGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upwind finite differences; where c = 10P. problem p ZxlGS ZylGS ZzlGS ZalGS 0 0.4390 0.4390 0.4390 0.0342 21 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 22 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 23 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 24 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 25 1 0.6900 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 26 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 27 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 0 0.4390 0.4390 0.4390 0.0342 28 1 0.6944 0.6944 0.6944 0.0968 3 0.9960 0.9960 0.9960 0.1553 296
PAGE 322
0 [N] = 0 0 0 0 0 0 0 0 0 b The amplification factor ..\(0) is given by p 0 0 Ct 0 0 Zebra xyplane GaussSeidel relaxation has the amplification matrix a 0 0 0 0 a 0 0 0 c 0 0 c 0 0 0 0 0 e 0 0 0 e 0 S(O) = 0 0 0 g 0 0 0 g 0 d 0 0 d 0 0 0 b 0 0 0 0 b 0 0 0 0 f 0 0 0 f 0 0 0 0 h 0 0 0 h (13.52) (13.53) (13.54) where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = ')'(1+')'), f = ')'(1')'), g = TJ(1 + TJ), h = TJ(1 TJ), and ,8 1] = Sp ed}y + Wp edJx Cp + Ep + Np Cb Ct The eigenvalues of Q(O)S(O) are 297 (13.55) (13.56) (13.57) (13.58)
PAGE 323
.A2(B) = 0 .A3(B) = 0 .A4(B) = 0 .A5(B) = {32 .A6(B) = 'Y2 .A7(B) = 'f/2 As(B) = 1 2a (a1 + J(B)(1 + ry)). The alternating plane GaussSeidel relaxation with lexicographic ordering has the amplification factor .A( B) given by (13.59) where Axypgs(B), Ayzpgs(B), and Axzpgs(B) are the xy, yz, and xzplane GaussSeidel amplification factors respectively. The xyplane GaussSeidel amplification factor is given by equation (13.53), and the others by Ayzpgs(B) Axzpgs(B) = The zebra alternating plane GaussSeidel relaxation amplification matrix S(B) is given by S(B) = Bxypgs(B) Byzpgs(B) Bxzpgs(B) (13.62) where Bxypgs(B), Syzpgs(B), and Bxzpgs(B) are the zebra xy, yz, and xzplane GaussSeidel amplification matrices respectively. The zebra xyplane GaussSeidel amplification matrix Bxypgs(B) was given in equation (13.54), and the amplification matrices for 298
PAGE 324
Syzpgs ( 0) and Bxzpgs ( 0) are given by a 0 0 0 0 0 0 a 0 c c 0 0 0 0 0 0 d d 0 0 0 0 0 1 0 0 0 e 0 e 0 0 Syzpgs(O) = 2 (13.63) 0 0 0 0 g 0 g 0 0 0 0 f 0 f 0 0 0 0 0 0 h 0 h 0 b 0 0 0 0 0 0 b where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = ')'(1+')'), f = ')'(1')'), g = TJ(1 + TJ), h = TJ(1 TJ), and a IWp edlx Ep I (13.64) ICb + Sp + Cp + Np + Ct I' ,8 IWp Ep I (13.65) ICb + Sp Cp + Np + Ct I' 'Y IWp Ep I (13.66) ICb Sp CpNp + Ct I' 1] IWp + Ep I (13.67) ICb Sp + CpNp + Ct eOz I and a 0 0 0 0 0 a 0 0 c 0 c 0 0 0 0 0 0 e 0 0 e 0 0 1 0 d 0 d 0 0 0 0 Bxzpgs(O) = 2 (13.68) 0 0 0 0 g 0 0 g 0 0 f 0 0 f 0 0 b 0 0 0 0 0 b 0 0 0 0 0 h 0 0 h 299
PAGE 325
Table 13.13. Smoothing factor 11 for xy, xz, yz, and alternating plane GaussSeidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10P. problem p xyplGS xzplGS yzplGS AplGS 1 0.4472 0.4472 0.4472 0.0497 1,0 0.8333 0.8333 0.4472 0.3106 2b 3,0 0.9980 0.9980 0.4472 0.4454 5,0 0.9999 0.9999 0.4472 0.4472 1,0 0.8333 0.4472 0.4472 0.1242 2c 3,0 0.9980 0.4472 0.4472 0.1488 5,0 0.9999 0.4472 0.4472 0.1491 2,1,0 0.9980 0.8333 0.4472 0.3654 2d 5,3,0 0.9999 0.9804 0.4472 0.4384 where a= a(1+a), b = a(la), c = ,8(1+,8), d = ,8(1,8), e = 1'(1+')'), f = 1'(1')'), g = ry(1 + ry), h = ry(1ry), and ,8 7] = ICb ed}z + Wp edJx + Cp + Ep edJx + Ct I' I Sp edJy NP I ICb + Wp Cp + Ep + Ct I' I Sp Np I ICb Wp CpEp + Ct I' ISp + Np I ICb Wp + CpEp + Ct I' (13.69) (13.70) (13.71) (13.72) The zebra alternating plane GaussSeidel amplification matrix S(O) can be computed and the eigenvalues can then be found and evaluated on B.s. The results of local mode analysis for the model problems from section 13.2 are shown in tables 13.13 through 13.20. The smoothing factors were computed numerically with the grid spacing h = 1 and the angles Ox, Oy, and (}z were sampled at 2 degree increments. Table 13.13 shows the results of the local mode analysis for diffusion problems from section 13.2. We can see that single plane relaxation does not yield good 300
PAGE 326
Table 13.14. Smoothing factor JL for xy, xz, yz, and alternating plane GaussSeidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upstream finite differences; where c = 10P. problem p xyplGS xzplGS yzplGS AplGS 0 0.4535 0.4535 0.6323 0.0750 3 1 0.4878 0.4878 0.9135 0.1335 3 0.5000 0.5000 0.9990 0.0035 0 0.4535 0.4535 0.3333 0.0500 4 1 0.4878 0.4878 0.3333 0.0514 3 0.5000 0.5000 0.3333 2.0E4 0 0.4535 0.6323 0.4535 0.0751 5 1 0.4878 0.9135 0.4878 0.1335 3 0.5000 0.9990 0.5000 0.1488 0 0.4535 0.3333 0.4535 0.0500 6 1 0.4878 0.3333 0.4878 0.0516 3 0.5000 0.3333 0.5000 0.0497 0 0.6323 0.4535 0.4535 0.0751 7 1 0.9135 0.4878 0.4878 0.1329 3 0.9990 0.5000 0.5000 0.1488 0 0.3333 0.4535 0.4535 0.0500 8 1 0.3333 0.4878 0.4878 0.0517 3 0.3333 0.5000 0.5000 0.0497 301
PAGE 327
Table 13.15. Smoothing factor for xy, xz, yz, and alternating plane GaussSeidel relaxation with lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for convectiondiffusion problems (see section 13.2); where c = 10P. problem p xyplGS xzplGS yzplGS AplGS 0 0.4587 0.3333 0.3333 0.0502 9 1 0.4918 0.3333 0.3333 0.0518 3 0.5000 0.3333 0.3333 0.0364 0 0.4587 0.6355 0.3333 0.0754 10 1 0.4918 0.9147 0.3333 0.1336 3 0.5000 0.9987 0.3333 0.1201 0 0.4587 0.3333 0.6355 0.0754 11 1 0.4918 0.3333 0.9147 0.1336 3 0.5000 0.3333 0.9987 0.1201 0 0.4587 0.6355 0.6355 0.1136 12 1 0.4918 0.9147 0.9147 0.3507 3 0.5000 0.9987 0.9987 0.3961 0 0.3333 0.4587 0.3333 0.0502 13 1 0.3333 0.4918 0.3333 0.0518 3 0.3333 0.5000 0.3333 2.0E4 0 0.6355 0.4587 0.3333 0.0754 14 1 0.9147 0.4918 0.3333 0.1340 3 0.9987 0.5000 0.3333 0.0034 0 0.3333 0.4587 0.6355 0.0754 15 1 0.3333 0.4918 0.9147 0.1338 3 0.3333 0.5000 0.9987 0.0034 0 0.6355 0.4587 0.6355 0.1136 16 1 0.9147 0.4918 0.9147 0.3520 3 0.9987 0.5000 0.9987 0.0060 0 0.3333 0.3333 0.4587 0.0501 17 1 0.3333 0.3333 0.4918 0.0516 3 0.3333 0.3333 0.5000 0.0497 0 0.6355 0.3333 0.4587 0.0754 18 1 0.9147 0.3333 0.4918 0.1338 3 0.9987 0.3333 0.5000 0.1488 0 0.3333 0.6355 0.4587 0.0754 19 1 0.3333 0.9147 0.4918 0.1338 3 0.3333 0.9987 0.5000 0.1488 0 0.6355 0.6355 0.4587 0.1136 20 1 0.9147 0.9147 0.4918 0.3508 3 0.9987 0.9987 0.5000 0.4454 302
PAGE 328
Table 13.16. Smoothing factor Jl for xy, xz, yz, alternating plane GaussSeidel relaxation in lexicographic ordering, xyplGS, xzplGS, yzplGS, and AplGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upstream finite differences; where c = 10P. problem p xyplGS xzplGS yzplGS AplGS 0 0.2852 0.2852 0.2852 0.0170 21 1 0.0713 0.0713 0.0713 3.4E4 3 8.6E4 8.6E4 8.6E4 6.3E10 0 0.6382 0.2852 0.2852 0.0373 22 1 0.9150 0.0713 0.0713 0.0038 3 0.9987 8.6E4 8.6E4 6.2E7 0 0.2852 0.6382 0.2852 0.0373 23 1 0.0713 0.9150 0.0713 0.0038 3 8.6E4 0.9987 8.6E4 6.2E7 0 0.6382 0.6382 0.2852 0.0603 24 1 0.9150 0.9150 0.0713 0.0256 3 0.9987 0.9987 8.6E4 3.5E4 0 0.2852 0.2852 0.6382 0.0373 25 1 0.0713 0.0713 0.9150 0.0038 3 8.6E4 8.6E4 0.9987 6.2E7 0 0.6382 0.2852 0.6382 0.0603 26 1 0.9150 0.0713 0.9150 0.0256 3 0.9987 8.6E4 0.9987 3.5E4 0 0.2852 0.6382 0.6382 0.0603 27 1 0.0713 0.9150 0.9150 0.0256 3 8.6E4 0.9987 0.9987 3.5E4 0 0.6382 0.6382 0.6382 0.0978 28 1 0.9150 0.9150 0.9150 0.1760 3 0.9987 0.9987 0.9987 0.2020 303
PAGE 329
Table 13.17. Smoothing factor J.L for zebra xy, xz, yz, and alternating plane GaussSeidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for the indicated anisotropic diffusion problems (see section 13.2) using central finite differences; where c = 10P problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 1 0.2500 0.2500 0.2500 0.0037 1 0.8264 0.8264 0.1250 0.0719 2b 3 0.9980 0.9980 0.1249 0.1240 5 0.9999 0.9999 0.0154 0.0154 1 0.8264 0.2500 0.2500 0.0160 2c 3 0.9980 0.2500 0.2500 0.0175 5 0.9999 0.2500 0.2500 8.4E4 2,1,0 0.9803 0.8264 0.1250 0.0915 2d 5,3,0 0.9999 0.9803 0.0289 0.0210 smoothing factors when the dominant plane is not the plane of the relaxation method. Tables 13.14 through 13.16 show local mode smoothing factors for convectiondiffusion problems from section 13.2. Good smoothing factors are obtained in most cases except when the sweeping direction is against the convection flow. Table 13.17 shows the results of the local mode analysis for diffusion problems from section 13.2. Relaxation in a single plane by zebra ordering does not give a good smoothing factor unless the the problem is either isotropic or the anisotropies lie in the plane of relaxation. Tables 13.18 through 13.20 show local mode smoothing factors for convectiondiffusion problems from section 13.2. Good smoothing factors are obtained if the convection lies in the plane of relaxation, otherwise the smoothing factor degrades quickly with the increased dominance of the convection terms over the diffusion term. 304
PAGE 330
Table 13.18. Smoothing factor JL for Zebra xy, xz, yz, and alternating plane Gauss Seidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, AZplGS respectively, for the indicated convectiondiffusion problems (see section 13.2) using central and upstream finite dif ferences; where c = 10P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.2500 0.2500 0.3600 0.0057 3 1 0.2500 0.2500 0.7347 0.0279 3 0.2500 0.2500 0.9960 0.0032 0 0.2500 0.2500 0.3600 0.0057 4 1 0.2500 0.2500 0.7347 0.0279 3 0.2500 0.2500 0.9960 0.0032 0 0.2500 0.3600 0.2500 0.0057 5 1 0.2500 0.7347 0.2500 0.0279 3 0.2500 0.9960 0.2500 0.0032 0 0.2500 0.3600 0.2500 0.0057 6 1 0.2500 0.7347 0.2500 0.0279 3 0.2500 0.9960 0.2500 0.0032 0 0.3600 0.2500 0.2500 0.0057 7 1 0.7347 0.2500 0.2500 0.0279 3 0.9960 0.2500 0.2500 0.0032 0 0.3600 0.2500 0.2500 0.0057 8 1 0.7347 0.2500 0.2500 0.0279 3 0.9960 0.2500 0.2500 0.0032 305
PAGE 331
Table 13.19. Smoothing factor for zebra xy, xz, yz, and alternating plane GaussSeidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for convectiondiffusion problems (see section 13.2); where E = 10P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.1542 0.3600 0.3600 0.0087 9 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 10 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 11 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.1542 0.3600 0.3600 0.0087 12 1 0.2355 0.7347 0.7347 0.0729 3 0.1250 0.9960 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 13 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 14 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 15 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.1542 0.3600 0.0087 16 1 0.7347 0.2355 0.7347 0.0729 3 0.9960 0.1250 0.9960 0.1233 0 0.3600 0.3600 0.1542 0.0087 17 1 0.7347 0.7347 0.2358 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 18 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 19 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 0 0.3600 0.3600 0.1542 0.0087 20 1 0.7347 0.7347 0.2355 0.0729 3 0.9960 0.9960 0.1250 0.1233 306
PAGE 332
Table 13.20. Smoothing factor JL for zebra xy, xz, yz, and alternating plane GaussSeidel relaxation, ZxyplGS, ZxzplGS, ZyzplGS, and AZplGS respectively, for the in dicated convectiondiffusion problems (see section 13.2) using central and upstream finite differences; where c = 10P. problem p ZxyplGS ZxzplGS ZyzplGS AZplGS 0 0.2436 0.2436 0.2436 0.0100 21 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 22 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 23 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 24 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 25 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 26 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 27 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 0 0.2436 0.2436 0.2436 0.0100 28 1 0.6944 0.6944 0.6944 0.0323 3 0.9960 0.9960 0.9960 0.0420 307
PAGE 333
CHAPTER 14 3D VECTOR ALGORITHM CONSIDERATIONS Most of the vectorization issues for the black box multigrid algorithms were already covered earlier in chapter 6. We mention here only those issues that are either new or different from those for the 2D algorithms. 14.1 3D Smoother There are several choices for the smoother, but only alternating plane relaxation gives a robust smoother. Point relaxation is possible, but it is only useful for isotropic diffusion equations. Line relaxation is also possible, but only alternating line relaxation is useful; however, it can perform poorly if the convection characteristics do not align with the grid by either being skewed or varying. The robust 3D smoother of choice is alternating red/black plane GaussSeidel relaxation where the plane solves are computed using the 2D black box multigrid method. On the vector (and sequential) computers a loop was set up to cycle through all of the red planes and then another loop for all of the black planes. The 2D solver was called each time a plane needed to be solved. We had a choice of either writing three 2D black box multigrid solvers; one for each of the three plane data structures, xy, yz, and xzplanes respectively, or to use only one 2D black box multigrid solver and transfer the data back and forth from the plane data structures to that of the 2D solver. In our case, we chose to use the later approach, using the 2D vector codes that were developed earlier in this thesis for xyplane oriented data. 308
PAGE 334
14.2 Data Structures and Memory For the 3D grid levels we need storage for the grid equations (unknowns, coefficients, and right hand side), the grid transfer operators, and temporary work space. Let Nx, Ny, and Nz be the number of grid points in the x, y, and zdirections respectively. We can compute how much storage will be needed by adding up the amount used for each grid point. We need 27 locations for the coefficient matrix and 1 each for the unknown and right hand side. For the standard coarsening method we need 52 locations for both the grid transfer operator coefficients, and another 5 for temporary work. For the semicoarsening method we need 4locations for both the grid transfer operator coefficients, and another 2 for temporary work. We will ignore the amount of storage for the coarsest grid direct solver because it will remain constant and small when compared to the rest. This means that we need 86 locations for the standard coarsening and 35 for the semicoarsening. However, we do not have grid transfer coefficients stored on the finest grid so we can subtract 52 and 4 locations from the total for the standard and semicoarsening methods respectively. The amount of storage required for the 3D data structures, excluding storage for the planes solves, is 1 NxNyNz 324 86 1++ 52 7NxNyNz, 8 (14.1) 35 1 1++ ... 2 4 NxNyNz 66NxNyNz (14.2) for the standard and semicoarsening methods respectively. If we have only a 7point operator on the finest grid we do not need to store the other 20 coefficients and then the storage requirements become NxNyNz and 46NxNyNz for the standard and semicoarsening methods respectively. We now need to address how we are going to handle the storage for the plane solves. Recall that the storage for the 2D black box multigrid method (xyplane) is 309
PAGE 335
24NxNy and 30NxNy for the standard and semicoarsening methods respectively for 9point fine grid operators, and for a 5point fine grid operator they will both be reduced by 4NxNy. We now have to decide how we are going to handle the plane solves for the smoother, leading to two obvious choices. The first choice we will call the small storage scheme (SSS) and the other will be called the full storage scheme (FSS). The SSS code will only require storage for the 3D data and enough additional storage for the largest plane, on the 3D finest grid, to be solved using 2D black box multigrid. The FSS will require storage for the 3D data plus storage for all the planes solved by 2D black box multigrid on all the 3D grid levels. To understand how much storage is required by the FSS it will be useful to refer to figure 14.1. The shading with arrows pointing from a higher level down to a lower level indicates the subpartitioning of the higher level. The top line of figure 14.1 refers to data partitioning of the 3D grid levels, where grid level m is the finest grid. This is analogous to the 2D data structures used earlier. We have a large array that is partitioned for the storage of each array needed for each of the 3D grid levels. The second line of figure 14.1 refers to the storage for the three groups of plane solves needed by the smoother to perform alternating red/black plane GaussSeidel relaxation. The third line of figure 14.1 indicates the number of plane data structures needed for one red/black plane GaussSeidel relaxation. The fourth line of figure 14.1 is the data partitioning of the 2D grid levels for a single plane solve by black box multigrid. The total storage for the FSS is Nate that the FSS scheme has included storage on a 3D grid level for all the grid operator coefficients used in the plane solves, but this data is just a copy of the 310
PAGE 336
3D grid m 3D grid m1 3D grid 1 XY planes YZ planes XZ planes k=1 xyplane k=2 xyplane k=nz xyplane 2D grid m2 2D grid m21 2D grid 1 Figure 14.1: 3D full storage scheme memory data structure 311
PAGE 337
3D grid operator coefficients on a plane. We can eliminate this duplicate storage for identical data. However, this means that we will have to add back in enough storage for the largest plane to hold the 2D multigrid data structures. In addition, we will also have to add routines for copying 3D grid operator coefficients and all the saved 2D coarse grid operator coefficients for a given plane into the newly added storage each time a plane needs to be solved for the smoother. We can call this method the nearly full storage scheme (NFSS), which requires NxNyNz + 24 MAX {NxNy, NyNz, NxNz} storage, while the SSS requires storage, where MAX { } stands for the maximum function. The NFSS only takes about 74% more storage than SSS does, and 60% less than FSS. The NFSS is even more attractive when we consider the computing time as compared to SSS. The SSS does not save the grid transfer or coarse grid operators, and hence it will have to perform the setup of these operators every time. The 2D setup time is between one and two times that of the execution of one V(1, 1)cycle. The 3D smoother, performing plane solves, for SSS will double its execution time compared to NFSS. Because of vectorization issues the NFSS smoother does not quite perform twice as fast as the SSS smoother, and since the smoother is only one of the multigrid components we only see a speedup factor of about 1.5 for NFSS over SSS. NFSS will not be quite as fast as FSS because of the additional time needed to copy the grid operator coefficients, but this time should be fairly small if not negligible. The NFSS approach is clearly the winning strategy because it minimizes the storage requirements and maximizes the speed of execution. 312
PAGE 338
We also have the possibility for both an FSS, NFSS, and SSS implementation for the semicoarsening algorithm. The semicoarsening storage required for the SSS is while the NFSS is and the FSS is 14.3 3D Standard Coarsening Vector Algorithm We have discussed several issues concerning the black box multigrid compo nents, vectorization, and programming on the Cray YMP. We will explicitly state what choices we made for the vector algorithm, as we did for the 2D vector algorithm. We have implemented the code being aware of all the vectorization issues and using the most practical and efficient choices that have been discussed so far. 14.3.1 Coarsening We used standard coarsening, taking every other fine grid point in each of the coordinate directions to form the coarse grid. 14.3.2 Data Structures The data structures for the grid equations are grid point stencil oriented. The mesh of unknowns has been augmented with a border of fictitious zero equations in the same way as we did in the 2D code. The border is used to avoid having to write special code to handle the boundary of the grid. This arrangement makes the code easier to write and more efficient for vector operations. There are several arrays to hold the grid equations: the discrete coefficient array, the array of unknowns, and the right hand side array. There are also several 313
PAGE 339
extra auxiliary arrays to hold the grid transfer operator coefficients, the residual, and the 2D plane problems for the smoother. The 2D plane problems contain the storage space for all the 3D coarse grids to be solved by the 2D black box multigrid solver for the alternating zebra plane relaxation. Each grid level has its own data structure of the appropriate size that has been allocated, via pointers, as part of a larger linear array for each data type structure. This arrangement makes memory management for the number of grid levels easier. 14.3.3 Smoothers We have implemented the multicolor ordering point and alternating zebra plane GaussSeidel methods. The plane solves are performed using the 2D black box multigrid method where each plane's data is copied into the 2D solvers data structures. 14.3.4 Coarsest Grid Solver The coarsest grid solver is a direct solver using LU factorization, which is performed by the Linpack routine SGBSL. 14.3.5 Grid Transfer Operators There are three choices for the grid transfer operators, discussed in chapter 11. They are analogous to the three that were implemented for the 2D standard coarsening method in sections 3.5.1, 3.5.3, and 3.6.1. 14.3.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, which uses the grid transfer operators and the fine grid operator. 14.4 3D SemiCoarsening Vector Algorithm The semicoarsening code was originally implemented by Joel E. Dendy Jr. We have reimplemented it in a slightly more efficient form to gain a speedup of about 2 over the previous vectorized version while maintaining and improving the portability 314
PAGE 340
of the code. The new implementation has kept all the functionality of the previous version. 14.4.1 Data Structures The data structures for the grid equations are the same as those for the standard coarsening code including the fictitious border equations. However, we only need storage for the xyplanes used by the smoother. 14.4.2 Coarsening Semicoarsening in the zdirection was used, taking every other fine grid point in the zdirection to form the coarse grid. 14.4.3 Smoothers Zebra xyplane GaussSeidel relaxation is used for the smoother. The plane solves are performed using the 2D semicoarsening black box multigrid method. 14.4.4 Coarsest Grid Solver The coarsest grid solver is either the di rect L U factorization Lin pack solver or a single call to the 2D semicoarsening black box multigrid method when the coarsening has continued until only one plane is left to solve. 14.4.5 Grid Transfer Operators The grid transfer operator is anal ogous to the 2D one used in section 3.6.1, but extended to 3D and applied in the zdirection. 14.4.6 Coarse Grid Operators The coarse grid operators are formed using the Galerkin coarse grid approximation, using the grid transfer and fine grid operators. 315
PAGE 341
14.5 Timing Results for 3D Test Problems In this section we present some timing results of the various codes for comparing the performance of the codes and their various components. To illustrate how fast these codes perform in solving a problem, we examine the timing results for solving Poisson's equation for five Vcycles; see table 14.1. Table 14.1 gives the timing results, in seconds, for various stages of the program execution for various grid sizes. The grid is square, so in the first column where n = 9, we mean a grid of size 9 x 9 x 9 and so forth for the rest of the column entries. The second column gives the total setup time, which involves the time it takes to form all of the grid transfer operators, generate all the coarse grid operators, and perform any decompositions needed for the smoother. The third column gives the total time for smoothing. The fourth column gives the total time for the direct solver. The last column contains the average time it took to complete one V(1, 1)cycle. We observe that the code runs fairly quickly, and that it appears to scale with respect to the grid size. However, the scaling is not what we expect, that is a factor of 8 for the standard coarsening and a factor of 2 for the semicoarsening. We do not see these speedups for several reasons, but the main one has to do with the fact that we are performing five V(1, 1)cycles. Each of these 3D Vcycles is using alternating plane solves for the smoother, and each plane is solved using the 2D black box multigrid method. Another big reason for the difference in the scaling has to do with vectorization issues. We also note that the total setup time is about 0.55 times that of the average cycle time, and in addition it is about 0.12 the total smoothing time for one iteration. In the 2D codes we saw just the opposite, where the setup time was greater than the average cycling time, but because of the overhead involved in using the 2D multigrid 316
PAGE 342
Table 14.1. Multigrid component timings for standard (top four lines) and semi coarsening (bottom four lines) in seconds for various grid sizes versus the time for five V(1, 1)cycles. The standard coarsening method uses zebra alternating plane Gauss Seidel for the smoother, which uses a 2D multigrid method, and the grid transfer operator is the nonsymmetric hybrid collapsing method. Grid Size Total Total Direct Average n Setup Smoothing Solver per Cycle 9 4.671E2 3.791E1 5.571E4 8.466E2 17 1.753E1 1.502E+O 5.610E4 3.324E1 33 7.051E1 5.916E+O 5.579E4 1.321E+O 65 3.031E+O 2.351E+1 5.583E4 5.281E+O 9 5.854E2 2.915E1 9.234E3 7.554E2 17 2.330E1 1.495E+O 1.855E2 2.991E1 33 1.073E+O 4.956E+O 3.918E2 1.317E+O 65 5.503E+O 2.536E+1 1.006E1 6.952E+O 317
PAGE 343
Table 14.2. Timings in seconds for multigrid grid transfer components for one V(1, 1)cycle for various grid sizes; comparing standard and semicoarsening methods. Grid Size Standard Coarsening Semi Coarsening n Prolongation Restriction Prolongation Restriction 9 5.427E4 5.714E4 2.129E4 3.600E4 17 2.513E3 1.981E3 7.142E4 1.292E3 33 7.518E3 7.560E3 2.980E3 5.268E3 65 2.681E2 2.934E2 2.060E2 3.699E2 method in the smoother and the simplified tensor composition for the grid operator setup, the setup is now faster in 3D. A more detailed examination of these relationships between the various multigrid components is given below. The rest of the tables in this section give the results for one multigrid V ( 1, 1 )cycle. The results are separated by multigrid components for easier comparison, and each table is further broken down into the type of multigrid algorithm. All times are given in seconds of CPU time on the Cray YMP in single processor mode. The time to perform the LU decomposition of the coarsest grid (3 x 3 x 3) problem for the direct solver is 7.148E4 seconds. The direct solver on the coarsest grid level (3 x 3 x 3, standard coarsening) takes 1.116E4 seconds. These times are constant for all of the standard coarsening algorithms that use the direct solver. The amount of work to perform the grid transfers depends on the grid size and on the type of coarsening used. A comparison between standard and semicoarsening is given in table 14.2. As one would expect, semicoarsening grid transfers are faster than standard coarsening grid transfers. Table 14.3 gives the timings results for two standard coarsening smoothers and the semicoarsening smoother. Note that the time for the alternating zebra plane and semicoarsening are fairly close, and that the point GaussSeidel method is roughly 17 times faster. 318
PAGE 344
Table 14.3. Timings for the total smoothing time in seconds for one multigrid V(1, 1)cycle for various grid sizes and smoothers. Grid Size Total Smoothing Time (seconds) n R/BPGS AZplGS SCBMG3 9 4.515E3 7.582E2 5.832E2 17 1.747E2 3.004E1 2.994E1 33 6.879E2 1.183E+O 9.912E1 65 2.688E1 4.702E+O 5.072E+O The ratio of time spent smoothing versus the time spent doing grid transfers is given in table 14.4. The ratio of smoothing to grid transfers shows that the smoother is the dominant computation in the multigrid cycling algorithm. It also shows that that the use of plane smoothing via 2D multigrid dominates the computations completely. Recall from section 5.9 that 4direction point GaussSeidel relaxation is a good smoother for isotropic 2D problems using standard coarsening. We can extend this method to 3D problems by using an 8direction point GaussSeidel relaxation that will give us good smoothing and be robust for isotropic problems. The method should also be attractive for both the execution time and memory usage. We can see from table 14.3 that the execution time for the red/black point method is about one sixteenth of the alternating plane method, and since lexicographic ordering performs nearly identically to red/black ordering on the Cray YMP, we should get the 8direction point GaussSeidel relaxation method to perform in half the time that the alternating plane relaxation takes. As an additional bonus the 8direction point GaussSeidel method does not require any extra storage over that required to store the 3D grid equation data structures and the 3D grid transfer operator coefficients. 319
PAGE 345
Table 14.4. Standard coarsening with grid transfer operators based on the hybrid collapsing method. Timing ratios (smoothing/grid transfer) for one V(1, 1)cycle for various grid sizes. Grid Size (Smoothing)/ (Grid Transfers) n R/BPGS AZplGS SCBMG3 9 4.05 68.05 101.8 17 3.89 66.84 149.2 33 4.56 78.45 120.2 65 4.79 83.74 88.07 14.6 Numerical Results for 3D Test Problem 1 Problem 1 is a Poisson problem defined by au= 0 on au 1 u=O on 2 on 0 = (0, 32.) X (0, 32.) X (0, 32.) on x = 0. or y = 0. or z = 0. on x = 32. or y = 32. or z = 32. (14.3) We discretize the equation using finite volumes with central differencing. The numerical results are given in tables 14.5 through 14.6. 14.7 Numerical Results for 3D Test Problem 2 Problem 1 is a discontinuous fourcube junction problem defined by \7 D\7 u + c u = f au= 0 on au 1 Du = 0 on 2 on 0 = (0, 32.) X (0, 32.) X (0, 32.) on x = 0. or y = 0. or z = 0. on x = 32. or y = 32. or z = 32. where the domain is split into the regions, R1 {(x, y, z) : 0 < x < 16, 0 < y < 16, 0 < z < 16}, R2 {(x,y,z): 0
PAGE 346
Table I4.5. Number of V(I, I)cycles using standard coarsening and zebra alternating plane GaussSeidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(I, I) CF CF CF 9 11 1.223EI 1.767EI 2.311EI I7 I3 8.98IE2 2.82IEI 2.423EI 33 I2 4.211E2 2.645EI 2.724EI 65 I5 4.80IE2 3.943EI 2.876EI Table I4.6. Number of V(I, I)cycles using semicoarsening zebra alternating plane GaussSeidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(I, I) CF CF CF 9 I2 1.703EI 2.798EI 2.556EI I7 I5 1.886EI 2.922EI 2.723EI 33 I6 2.222EI 2.955EI 2.795EI 65 IS 3.23IEI 2.989EI 2.876EI 32I
PAGE 347
Table 14.7. Number of V(1, 1)cycles when c = 0 and using standard coarsening and zebra alternating plane GaussSeidel for the smoother for the nonsymmetric collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 2.523E1 4.333E1 3.667E1 17 20 3.041E1 5.324E1 4.441E1 33 27 3.331E1 6.144E1 5.329E1 65 36 4.801E1 7.464E1 6.267E1 R3 {(x,y,z): 0 < X < 16, 16 < y 32, 0 < Z < 16} R4 {(x, y, z) : 16 X < 32, 16 y < 32, 0 < Z < 16} R5 {(x, y, z) : 0 < X < 16, 0 < y < 16, 16 Z < 32} R6 {(x,y,z): 16 X < 32, 0 < y < 16, 16 z < 32}' R1 {(x,y,z): 0 < X < 16, 16 y < 32, 16 Z < 32}, Rs {(x, y, z) : 16 X < 32, 16 y < 32, 16 Z < 32}; then let 1. for regions 2, 3, 5, 8 D= (14.5) 1000. for regions 1, 4, 6, 7 and 1. for regions 2, 3, 5, 8 f= (14.6) 0. for regions 1, 4, 6, 7 We discretize the equation using finite volumes with central differencing. The numerical results are given in tables 14.7 through 14.10. The methods all perform roughly the same with the semicoarsening method coming in last. 322
PAGE 348
Table 14.8. Number of V(1, 1)cycles when c = 0 and using standard coarsening and zebra alternating plane GaussSeidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 15 2.544E1 4.339E1 3.679E1 17 19 3.038E1 5.281E1 4.276E1 33 27 3.348E1 6.159E1 5.336E1 65 36 4.833E1 7.878E1 6.308E1 Table 14.9. Number of V(1, 1)cycles when c = 3b and using standard coarsening and zebra alternating plane GaussSeidel for the smoother for the nonsymmetric collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.112E1 4.210E1 3.677E1 17 14 1.123E1 5.223E1 4.121E1 33 21 2.388E1 6.100E1 5.629E1 65 26 3.522E1 6.363E1 6.230E1 Table 14.10. Number of V(1, 1)cycles when c = 3b and using standard coarsening and zebra alternating plane GaussSeidel for the smoother for the hybrid collapsing grid transfer operator, and the first, last and average convergence factor (CF). Grid Size n Iterations First Last Average (n x n) V(1, 1) CF CF CF 9 12 9.104E1 4.209E1 3.679E1 17 15 1.334E1 5.231E1 4.276E1 33 22 2.241E1 6.129E1 5.336E1 65 26 3.480E1 6.376E1 6.308E1 323
PAGE 349
CHAPTER 15 PARALLEL 3D BLACK BOX MULTIGRID The parallel 3D black box multigrid methods are similar to the parallel 2D methods. However, unlike the 3D vector versions, the smoother does not use the parallel 2D black box multigrid method in its original form. 15.1 3D Standard Coarsening Parallel Algorithm Modifications Just like the 2D parallel standard coarsening method, we run into the same problems with performing the tridiagonal line solves. We have chosen to modify the coarsening to use standard coarsening until the VP ratio is less than or equal to one and then switch to the semicoarsening algorithm. This approach has several benefits, which include easier coding and faster execution. As will be seen, the semicoarsening algorithm is the fastest for VP ratio less than or equal to one, but the standard coarsening method is actually faster when the VP ratio is greater than one. 15.2 3D Parallel Smoother On a parallel computer we could use the same approach as we did on the vector computers by calling the 2D solver, but this would not be very efficient since we know that all the red (black) planes can be solved for simultaneously. Instead, 324
PAGE 350
we can modify the 2D solver to solve all of the red (black) planes simultaneously by introducing the third coordinate axis to the data structure. This is good news because it will cut down on the overhead associated with calling the 2D solver over and over again. It indeed means that we will use more memory for the 2D solver because of the third coordinate axis, but the performance gain justifies this decision. However, we are again faced with the choice between creating three 2D plane solvers, one for each of the 2D planes needed for alternating plane relaxation, or creating one 2D plane solver and transferring the information from the 3D solver data structures to the 2D plane solver data structures, with a copying routine. The first choice of creating three 2D plane solvers is rather prohibitive for two reasons. First, it requires much more storage for all the data structures needed for the three 2D plane solvers, but it would also be the fastest implementation for performance. Secondly, the size of the final executable code, which is already quite large, more than doubles, meaning that there is that much less storage available for solving large grid size problems. We decided to use only one 2D plane solver, sacrificing some performance gains in order to save memory for solving larger problems. We may have saved some space by not writing three 2D multigrid solvers, but we now have to transfer data between the 3D and 2D data structures. These data transfers require the use of the less efficient communications. In either case, we have to transfer the unknowns and the right hand sides of the grid equations, followed by transferring the solution back, all of which uses inefficient general communications for the yzand xzplane solves. By having only one 2D (xyplane) version we also have to transfer the plane's coefficient matrix to the 2D fine grid coefficient matrix. If we did not transfer the coefficients we would need nearly double the storage; recall section 14.2. We have also decided not to save the LU decompositions in the 2D plane solver because of the concerns for the amount of memory it would require with the additional 325
PAGE 351
coordinate (third) axis; hence performance is again reduced. However, experimentation has shown that we only save from 25% to 40% of the execution time required to perform one Vcycle, but it cost four times the storage required for a tridiagonal solve. In actuality, the storage costs is closer to six times because of the temporary work space allocated by the CMSSL tridiagonal solver. 15.3 3D Data Structures and Communication The data structure allocation and layout are again handled by the use of DMMU. The data structures have the same storage requirements as the 3D vector versions for both the standard and semicoarsening methods respectively. However, instead of pointers to the various grid level data, we use congruent array aliases, allowing for indexing the desired grid level's data directly. 15.4 3D Parallel Timings The following tables we have reported both busy (B) and idle (I) times. Busy time is the execution time for the parallel processing nodes, while idle time is the sequential execution time and also the time to perform all communications. We are reporting times for various time shared partitions of the CM5. The partitions are identified by the number of processing nodes (PN) namely, 32, 64, and 128 processing nodes. The tables report timings, in seconds, for the average time of five runs for either the setup time or the average of five Vcycles. The standard coarsening timings are given in tables 15.1 and 15.2 for one V(1, 1)cycle and the setup respectively. We see the affects of the parallel overhead in the tables for small grids sizes and large partitions. The "**" mean that no data was obtained because of a failure in the codes. We believe that the "no data" runs failed because the standard coarsening case has data alignment trouble on coarser grids. The 326
PAGE 352
Table 15.1. Timings, in seconds, for the standard coarsening code performing one V(l, I)cycle with zebra alternating plane GaussSeidel on 32, 64, 128 processing nodes of the CM5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN N Idle Busy Idle Busy Idle Busy 8 5.003E1 8.793E1 ** ** 5.376E1 1.057E+O 16 8.586E1 1.565E+O ** ** 8.918E1 1.746E+O 32 1.236E+O 2.982E+O ** ** 1.379E+O 2.906E+O 64 1.687E+O 7.920E+O ** ** 1.808E+O 5.514E+O Table 15.2. Timings, in seconds, for the setup phase of the standard coarsening code with zebra alternating plane GaussSeidel on 32, 64, 128 processing nodes of the CM5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN 128 PN N Idle Busy Idle Busy Idle Busy 8 1.684E+O 2.403E+O ** ** 1.578E+O 2.839E+O 16 2.932E+O 4.464E+O ** ** 3.063E+O 4.827E+O 32 4.389E+O 8.911E+O ** ** 4.748E+O 8.452E+O 64 5.650E+O 2.457E+l ** ** 5.970E+O 1.652E+O 327
PAGE 353
Table 15.3. Parallel efficiency for standard coarsening V(1, 1)cycle using zebra alternating plane GaussSeidel for the CM5 with 32, 64, and 128 nodes. The results are given in percentages and N means an N x N x N grid. Size CM5 N 32 PN 64 PN 128 PN 8 64 ** 66 16 65 ** 66 32 71 ** 68 64 82 ** 75 alignment trouble only happens when using the DMMU and using the 64 processor or 256 processor partitions. The DMMU was designed and tested for the semicoarsening algorithm. For the standard coarsening algorithm, the DMMU appears to have trouble in aligning the coarse grid data points with their closely related finer grid points on a subgrid. For the 64 and 256 processor partitions the DMMU can not keep the coarsening confined to the subgrid, as is done in the semicoarsening code. As in the 2D parallel timings, we again do not see perfect scaling with respect to the grid size nor scaleup with the processing partition size. The parallel efficiencies are given in table 15.3, and they exhibit the same behavior that was seen in the 2D timings. Tables 15.4 through 15.6 present the timing data for the semicoarsening algorithm. However, now we do not see any problem with the DMMU. Finally, we give a comparison on three different computers for one V(1, 1)cycles and a variety of grid sizes. The CM5 timings are given for the 32, 64, and 128 processing partitions. 328
PAGE 354
Table I5.4. Timings, in seconds, for the semicoarsening code performing one V(I, I)cycle on 32, 64, I28 processing nodes of the CM5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN I28 PN N Idle Busy Idle Busy Idle Busy 8 7.256EI 1.235E+O 7.650EI 1.287E+O 7.602EI 1.257E+O I6 1.264E+O 2.27IE+O l.I90E+O 2.242E+O 1.239E+O 2.348E+O 32 I.793E+O 4.630E+O 1.823E+O 4.I82E+O 1.849E+O 3.995E+O 64 2.4I2E+O 1.234E+I 2.480E+O 9.874E+O 2.435E+O 8.I27E+O Table I5.5. Timings, in seconds, for the setup phase of the semicoarsening code on 32, 64, I28 processing nodes of the CM5, where the size N means an N x N x N fine grid. Busy and Idle refer to the parallel and communication/sequential time respectively. Size 32 PN 64 PN I28 PN N Idle Busy Idle Busy Idle Busy 8 l.I03E+O 1.48IE+O 1.4I5E+O 1.539E+O 1.482E+O 1.506E+O I6 2.I80E+O 2.739E+O 1.909E+O 2.734E+O 2.036E+O 2.848E+O 32 2.248E+O 5.626E+O 2.872E+O 5.083E+O 2.4I9E+O 4.825E+O 64 3.IOOE+O 1.5I4E+I 3.340E+O 1.2IOE+I 3.372E+O 9.868E+O Table I5.6. Parallel efficiency for semicoarsening V(I, I)cycle for the CM5 with 32, 64, and I28 nodes. The results are given in percentages and N means anN x N x N grid. Size CM5 N 32 PN 64 PN I28 PN 8 63 63 62 I6 64 65 65 32 72 69 68 64 84 80 77 329
PAGE 355
Table 15.7. Timing comparison between the CM5 and Cray YMP computers for one V(1, 1)cycle in seconds, where N means anN x N x N grid. The top entries are for the standard coarsening codes and the bottom entries are for the semicoarsening codes, and ** means that the problem failed to execute. Size CM5 Cray YMP N 32 PN 64 PN 128 PN 1.3796 ** 1.5956 2.332E2 8 1.9606 2.0520 2.0172 3.742E2 2.4236 ** 2.6378 1.398E1 16 3.5350 3.4320 3.5870 1.825E1 4.2180 ** 4.2850 8.795E1 32 6.4230 6.0050 5.8440 9.107E1 9.6070 ** 7.3220 4.663E+O 64 14.800 12.354 10.562 4.696E+O 330
PAGE 356
APPENDIX A OBTAINING THE BLACK BOX MULTIGRID CODES The black box multigrid codes and a User's Guide can be obtained via anony mous FTP through MGNet, MGNet's web site, or by contacting the author. MGNet stands for the MultiGrid Network. MGNet's FTP site: casper.cs.yale.edu use "anonymous" for the username and your email address for a password. MGNet has a World Wide Web page that can be accessed via the URL: http://na.cs.yale.edu/mgnet/www/mgnet.html The author's Email address is: na.bandy@nanet.ornl.gov Additional copies of this thesis may be obtained by contacting the author or by down loading a copy from the University of Colorado at Denver Mathematics Department's Web page via the URL: 331
PAGE 357
http://wwwmath.cudenver.edu/ Any comments, insights, and suggestions would be greatly appreciated. Thank you. 332
PAGE 358
APPENDIX B COMPUTER SYSTEMS USED FOR NUMERICAL RESULTS B.l Cray YMP Manufacturer: Cray Research Inc. Hardware Specifics: Machine RHO at Los Alamos National Laboratory computer model Cray YMP 8/64 832 serial number= 1054 number CPUs 8 clock cycle 6.0 nanoseconds (166666666 cycles/second) word length 64 bits memory size 67108864 words (67 MWords) memory speed 102.0 nanoseconds (17 clock cycles) memory banks 256 memory bank busy = 30.0 nanoseconds (5 clock cycles) instruction buffer size = 32 number of clusters = 9 333
PAGE 359
Operating System: UNICOS version 7.0.6.1 FORTRAN Programming Environment: CF77 version 6.0.4.1 GPP version 6.0.4.1 FPP version 6.0 FMP version 6.0.4.0 CFT77 version 6.0.4.12 (some are done with 6.0.4.10) segldr version 7.0i Cray Library: Craylib version 1.2 Manufacturer: Cray Research Inc. Hardware Specifics: Machine GAMMA at Los Alamos National Laboratory computer model Cray YMP 8/2048 (M90) serial number= 2806 number CPUs 8 clock cycle word length memory size memory speed memory banks 6.0 nanoseconds (166666666 cycles/second) 64 bits 2147483648 words (2.147 GWords) 162.0 nanoseconds (27 clock cycles) 256 334
PAGE 360
memory bank busy = 120.0 nanoseconds (20 clock cycles) instruction buffer size 32 number of clusters = 9 Operating System: UNICOS version 8.0.3 FORTRAN Programming Environment: CF77 version 6.0.4.1 GPP version 6.0.4.1 FPP version 6.0 FMP version 6.0.4.0 CFT77 version 6.0.4.10 segldr version 8.0i Cray Library: Craylib version 1.2 B.2 CM5 manufacturer: Thinking Machines Inc. Hardware Specifics: computer number CPUs memory CM5 at Advanced Computing Laboratory (ACL), Los Alamos National Laboratory 1024 Sparc2 node CPUs, 4 vector units per node (4096 vector units)(vector length of 16) 32 MBytes per CPU node 335
PAGE 361
Four HIPPI interfaces; 120 GBytes rotating storage Operating System: CMOST version 7.4.0 (based on SunOS 4.1.3U1b) CM RunTimeSystem: CMRTS 8.1 CMFORTRAN Programming Environment: CM Fortran Driver Release 2.2.11 Connection Machine Fortran Version 2.2 (CMF) Compiler runtime library Version: 2.2 (LIBCMFCOMPILER) CM Scientific Software Library: CMSSL version 4.0 Slicewise runtime library version CMRTS CM5 8 1 6 (LIBCMRTS) 336
PAGE 362
BIBLIOGRAPHY [1] R. E. ALCOUFFE, A. BRANDT, J. E. DENDY, JR., AND J. W. PAINTER, The multigrid methods for the diffusion equation with strongly discontinuous coeffi cients, SIAM J. Sci. Stat. Comput., 2 (1981), pp. 430454. [2] 0. AXELSSON, Analysis of incomplete matrix factorizations as multigrid smoothers for vector and parallel computers, Appl. Math. Comput., 19 (1986), pp. 322. [3] ,A general incomplete blockmatrix factorization method, J. Lin. Alg. Applic., 74 (1986), pp. 179190. [4] 0. AXELSSON, S. BRINKKEMPER, AND V. P. IL'IN, On some versions of incom plete blockmatrix factorization iterative methods, J. Lin. Alg. Applic., 58 (1984), pp. 315. [5] 0. AXELSSON AND B. POLMAN, On the factorization methods for block matrices suitable for vector and parallel processors., J. Lin. Alg. Applic., 77 (1986), pp. 326. [ 6] 0. AXELSSON AND P. S. VASSILEVSKI, Algebraic multilevel preconditioning meth ods, I, Numer. Math., 56 (1989), pp. 157177. [7] V. A. BANDY, A comparison of 2d black box multigrid for convectiondiffusion problems with discontinuous and anisotropic coefficients. Presented at the Sixth Copper Mountain Multigrid Conference, Apr. 1993. [8] V. A. BANDY, J. E. DENDY, JR., AND W. H. SPANGENBERG, Some multigrid algorithms for elliptic problems on data parallel machines. To Appear, SIAM J. Sci. Comput., May 1996. [9] V. A. BANDY AND R. SWEET, A set of three drivers for boxmg: a black box multi grid solver, in Preliminary Proceedings of the Fifth Copper Mountain Conference on Multigrid Methods, T. A. Manteuffel and S. F. McCormick, eds., val. 1, Denver, 1991, University of Colorado, pp. 4755. [10] , A set of three drivers for BOXMG: A blackbox multigrid solver, Comm. Appl. Num. Methods, 8 (1992), pp. 563571. 337
PAGE 363
[11] A. BEHlE AND P. A. FORSYTH, Multigrid solution of threedimensional problems with discountinuous coefficients, Appl. Math. Comput., 13 (1983), pp. 229240. [12] D. P. BERTSEKAS AND J. N. TSITSIKLIS, Parallel and Distributed Computation: Numerical Methods, Prentice Hall, 1989. [13] J. H. BRAMBLE AND J. E. PASCIAK, The analysis of smoothers for multigrid algorithms, Math. Comp., 58 (1992), pp. 467488. [14] A. BRANDT, Multilevel adaptive solutions to boundaryvalue problems, Math. Comp., 31 (1977), pp. 333390. [15] , Multilevel adaptive techniques (MLAT) for partial differential equations: ideas and software, in Mathematical Software III, J. R. Rice, ed., Academic Press, New York, 1977, pp. 277318. [16] , Algebraic multigrid theory: The symmetric case, Appl. Math. Comput., 19 (1986), pp. 2356. [17] , Rigorous local mode analysis of multigrid, in Preliminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. Mc Cormick, eds., vol. 1, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 55133. [18] A. M. BRUASET, A. TVEITO, AND R. WINTHER, On the stability of relaxed incomplete lu factorizations, Math. Comp., 54 (1990), pp. 701719. [19] M. CALVO, T. GRANDE, AND R. D. GRIGORIEFF, On the zero stability of the variable order variable stepsize bdfformulas, Numer. Math., 57 (1990), pp. 3950. [20] Z. H. CAO, Convergence of multigrid methods for nonsymmetric, indefinite prob lems, Appl. Math. Comput., 28 (1988), pp. 269288. [21] T. F. CHAN AND H. C. ELMAN, Fourier analysis of iterative methods for elliptic problems, SIAM Review, 31 (1989), pp. 2049. [22] T. F. CHAN AND B. F. SMITH, Domain decomposition and multigrid algorithms for elliptic problems on unstructured meshes, in Domain Decomposition Methods in Scientific and Engineering Computing: Proceedings of the Seventh International Conference on Domain Decomposition, vol. 180 of Contemporary Mathematics, Providence, Rhode Island, 1994, American Mathematical Society, pp. 175189. [23] Q. S. CHANG, Y. S. WONG, AND Z. F. LI, New interpolation formulas of using geometric assumptions in the algebraic multigrid method, Appl. Math. Comput., 50 (1992), pp. 223254. 338
PAGE 364
[24] P. M. DE ZEEUW, Matrixdependent prolongations and restrictions in a blackbox multigrid solver, J. Comput. Appl. Math., 33 (1990), pp. 127. [25] P. M. DE ZEEUW AND E. J. VAN ASSELT, The convergence rate of multilevel algorithms applied to convectiondiffusion equations, SIAM J. Sci. Stat. Comput., 6 (1985), pp. 492503. [26] J. E. DENDY, JR., Black box multigrid, J. Comput. Phys., 48 (1982), pp. 366386. [27] J. E. DENDY JR., Black box multigrid for nonsymmetric problems, Appl. Math. Comput., 13 (1983), pp. 261284. [28] , A priori local grid refinement in the multigrid method, in Elliptic Problem Solvers II, G. Birkhoff and A. Schoenstadt, eds., Academic Press, New York, 1984, pp. 439451. [29] , Two multigrid methods for threedimensional equations with highly discon tinuous coefficients, SIAM J. Sci. Stat. Comput., 8 (1987), pp. 673685. [30] , Black box multigrid for periodic and singular problems, Appl. Math. Com put., 25 (1988), pp. 110. [31] , Multigrid methods for diffusion equations with highly discontinuous coeffi cients, Trans. A.N.S., 56 (1988), p. 290. [32] J. E. DENDY JR., M.P. IDA, AND J. M. RUTLEDGE, A semicoarsening multigrid algorithm for SIMD machines, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 14601469. [33] J. E. DENDY JR., S. F. McCoRMICK, J. W. RuGE, T. F. RussELL, AND S. SCHAFFER, Multi grid methods for threedimensional petroleum reservoir simu lation, in Proceedings of the Tenth Symposium on Reservoir Simulation, Houston, 1989, pp. 68. [34] S. Dor, On parallelism and convergence of incomplete lu factorizations, Appl. Numer. Math., 7 (1991), pp. 417436. [35] C. C. DouGLAS, A review of numerous parallel multigrid methods, SIAM News, 25 (1992). [36] C. C. DOUGLAS AND B. F. SMITH, Using symmetries and antisymmetries to analyze a parallel multigrid algorithm, SIAM J. Numer. Anal., 26 (1989), pp. 14391461. 339
PAGE 365
[37] H. C. ELMAN, A stability analysis of incomplete lu factorizations, Math. Comp., 47 (1986), pp. 191217. [38] K. W. FONG, T. H. JEFFERSON, T. SUYEHIRO, AND L. WAL TON, Guide to the SLATEC Common Mathematical Library, Netlib, http:/ /www.netlib.org/slatecjguide, 1993. [39] G. E. FORSYTHE, FiniteDifference Methods for Partial Differential Equations, Wiley, New York, 1960. [40] D. GOLDBERG, What every computer scientist should know about floatingpoint arithmetic, ACM Comput. Surveys, 23 (1991), pp. 548. [41] W. HACKBUSCH, Multigrid Methods and Applications, vol. 4 of Computational Mathematics, SpringerVerlag, Berlin, 1985. [42] , Iterative Solution of Large Sparse Systems of Equations, SpringerVerlag, Berlin, 1993. [43] L.A. HAGEMAN AND D. M. YOUNG, Applied Iterative Methods, Academic Press, 1981. [44] M. HEGGLAND, On the parallel solution of tridiagonal systems by wraparound partitioning and incomplete lu factorization, Numer. Math., 59 (1991), pp. 453472. [45] P. W. HEMKER, On the order of prolongations and restrictions in multi grid pro cedures, J. Comput. Appl. Math., 32 (1990), pp. 423429. [46] R. W. HOCKNEY AND C. R. JESSHOPE, Parallel Computers 2, Adam Rigler, Philadelphia, 1988. [47] W. H. HOLTER, A vectorized multigrid solver for the threedimensional poisson equation, Appl. Math. Comput., 19 (1986), pp. 127144. [48] W. Z. HUANG, Convergence of algebraic multigrid methods for symmetric positive definite matrices with weak diagonal dominance, Appl. Math. Comput., 46 (1991), pp. 145164. [49] S. L. JOHNSON, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Stat. Comput., 8 (1987), pp. 354392. 340
PAGE 366
[50] R. KETTLER, Analysis and comparison of relaxation schemes in robust multigrid and preconditioned conjugate gradient methods, in Multigrid Methods, W. Hack busch and U. Trottenberg, eds., no. 960 in Lect. Notes in Math., SpringerVerlag, 1982, pp. 502534. [51] R. KETTLER AND P. WESSELING, Aspects of multigrid methods for problems in three dimensions, Appl. Math. Comput., 19 (1986), pp. 159168. [52] M. KHALIL, Analysis of Linear Multigrid Methods for Elliptic Differential equa tions with Discontinuous and Anisotropic Coefficients, PhD thesis, Delft Univer sity of Technology, Delft, Netherlands, 1989. [53] M. KHALIL AND P. WESSELING, A cellcentered multigrid method for three dimensional anisotropicdiffusion and interface problems, in Preliminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. McCormick, eds., vol. 3, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 99117. [54] , Vertexcentered and cellcentered multigrid for interface problems, in Pre liminary Proc. of the 4th Copper Mountain Conference on Multigrid Methods, J. Mandel and S. F. McCormick, eds., vol. 3, Denver, 1989, Computational Mathematics Group, Univ. of Colorado, pp. 6197. [55] , Vertexcentered and cellcentered multigrid for interface problems, J. Com put. Phys., 98 (1992), pp. 120. [56] D. E. KNUTH, The Art of Computer Programming, vol. II, AddisonWesley, Read ing, Mass., 2nd ed. ed., 1981. [57] C.C. J. Kuo AND B. C. LEVY, Twocolor fourier analysis of the multigrid method with redblack gaussseidel smoothing, Appl. Math. Comput., 29 (1989), pp. 6987. [58] J. M. LEVESQUE AND J. W. WILLIAMSON, A Guidebook to FORTRAN on Su percomputers, Academic Press, 1988. [59] W. LICHTENSTIEN AND S. L. JOHNSON, Block cyclic dense linear algebra, SIAM J. Sci. Stat. Comput., 14 (1993), pp. 12571286. [60] W. M. LIOEN, Parallelizing a highly vectorized multigrid code with zebra relax ation. Obtained a copy of the paper at The Copper Mountain Conference on Multigrid Methods, Apr. 1993. 341
PAGE 367
[61] W. A. MULDER, A new multigrid approach to convection problems, J. Comput. Phys., 83 (1989), pp. 303323. [62] S. V. PARTER, Estimates for multigrid methods based on redblack gaussseidel smoothings, Numer. Math., 52 (1988), pp. 701723. [63] A. REUSKEN, Multigrid with matrix dependent transfer operators for a singular perturbation problem, Comput., 50 (1993), pp. 199211. [64] J. RUGE, AMG for problems of elasticity, App. Math. Comput., 19 (1986), pp. 293309. [65] J. W. RuGE, Algebraic multigrid ( AMG) for geodetic survey problems, in Prelimary Proc. Internat. Multigrid Conference, Fort Collins, CO, 1983, Institute for Computational Studies at Colorado State University. [66] J. W. RuGE AND K. STUBEN, Efficient solution of finite difference and finite element equations by algebraic multigrid {AMG), in Multigrid Methods for Integral and Differential Equations, D. J. Paddon and H. Holstein, eds., The Institute of Mathematics and its Applications Conference Series, Clarendon Press, Oxford, 1985, pp. 169212. [67] , Algebraic multigrid {AMG), in Multigrid Methods, S. F. McCormick, ed., vol. 3 of Frontiers in Applied Mathematics, SIAM, Philadelphia, PA, 1987, pp. 73130. [68] S. SCHAFFER, Higher order multigrid methods, Math. Comp., 43 (1984), pp. 89115. [69] , New ideas for semicoarsening multigrid methods. Talk given at CNLS, Los Alamos National Laboratory, 1992. [70] , A semicoarsening multigrid method for elliptic partial differential equations with highly discontinuous and anisotropic coefficients, SIAM J. Sci. Comput., To Appear (1995). [71] Y. SHAPIRA, Twolevel analysis of automatic multigrid for spd, nonnormal and indefinite problems, Technical Report 824, Computer Science Department, TechnionIsrael Institute of Technology, July 1994. submitted to Numer. Math. [72] , Twolevel analysis based on spectral analysis, Sept. 1995. Private commu nications. [73] S. SIVALOGANATHAN, The use of local mode analysis in the design and comparison of multigrid methods, Comput. Phys. Commun., 65 (1991), pp. 246252. 342
PAGE 368
[74] G. D. SMITH, Numerical Solution of Partial Differential Equations: Finite Dif ference Methods, Clarendon Press, Oxford, 1978. [75] R. A. SMITH AND A. WEISER, Semicoarsening multigrid on a hypercube, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 13141329. [76] P. SONNEVELD, P. WESSELING, AND P. M. DEZEEUW, Multigrid and conjugate gradient methods as convergence acceleration techniques., in Multigrid Methods for Integral and Differential Equations, D. J. Paddon and H. Holstein, eds., The Institute of Mathematics and its Applications Conference Series, Clarendon Press, Oxford, 1985, pp. 117168. [77] K. STUBEN, Algebraic multigrid (AMG): expenences and compansons, Appl. Math. Comput., 13 (1983), pp. 419452. [78] K. STUBEN AND U. TROTTENBERG, Multigrid methods: Fundamental algorithms, model problem analysis and applications, in Multigrid Methods, W. Hackbusch and U. Trottenberg, eds., vol. 960 of Lecture Notes in Mathematics, Berlin, 1982, SpringerVerlag, pp. 1176. [79] C. THOLE AND U. TROTTENBERG, Basic smoothing procedures for the multigrid treatment of elliptic 3Doperators, Appl. Math. Comput., 19 (1986), pp. 333345. [80] , A short note on standard parallel multigrid algorithms for 3Dproblems, Appl. Math. Comput., 27 (1988), pp. 101115. [81] P. VANEK, J. MANDEL, AND M. BREZINA, Algebraic multigrid based on smoothed aggregation for second and fourth order problems, Computing, 56 (1996), pp. 179196. [82] R. S. VARGA, Matrix Iterative Analysis, PrenticeHall, 1962. [83] P. WESSELING, Theoretical and practical aspects of a multigrid method, SIAM J. Sci. Stat. Comput., 3 (1982), pp. 387407. [84] , A survey of fourier smoothing analysis results, in Multigrid Methods III, vol. 98 of International Series of Numerical Mathematics, Birkhauser, Basel, 1991, pp. 105127. [85] ,An Introduction to Multigrid Methods, John Wiley & Sons, Chichester, 1992. [86] G. WITTUM, On the robustness of ilu smoothing, SIAM J. Sci. Stat. Comput., 10 (1989), pp. 699717. 343
PAGE 369
[87] I. YAVNEH, Multigrid and sor revisited, in Preliminary Proceedings of the Colorado Conference on Iterative Methods, T. A. Manteuffel, ed., Denver, 1994, University of Colorado. [88] , Smoothing factors of twocolor gaussseidel relaxtion for a class of elliptic operators. personnel correspondence, May 1994. 344
