ADAPTIVE MESH REFINEMENT FOR DISTRIBUTED
PARALLEL ARCHITECTURES
by
Daniel James Quinlan
B. A., University of Colorado at Denver, 1987
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
1993
This thesis for the Doctor of Philosophy degree by
Daniel James Quinlan
has been approved for the
Department of
Mathematics
by
CL* frf /ff3
Quinlan, Daniel James (Ph.D., Applied Mathematics)
Adaptive Mesh Refinement for Distributed Parallel Architectures
Thesis directed by Professor Stephen F. McCormick
m
The purpose of adaptive mesh refinement is to match the computational demands
to an applications activity. In a fluid flow problem, this means that only regions of high local
activity (shocks, boundary layers, etc.) can demand increased computational effort, while
regions of little flow activity (or interest) are more easily solved using only relatively little
computational effort. A thorough exploitation of these techniques is crucial to the efficient
solution of more general problems arising in large scale computation.
The fast adaptive composite grid method (FAC) is an algorithm that uses uniform
grids, both global and local, to solve principally elliptic partial differential equations. How
ever, FAC suffers in the parallel environment from the way in which the levels of refinement
are treated sequentially. The asynchronous fast adaptive composite method, AFAC, and a
new method, AFACx, eliminate this bottleneck of parallelism. In both AFAC and AFACx,
individual refinement levels are processed in parallel. AFACx both generalizes AFAC and
permits the use of more complex block structured mesh refinement required for selfadaptive
mesh refinement. Although each levels processing may be parallelized in FAC, AFAC and
AFACx may be much more efficiently parallelized. It is shown that, under most circum
stances, AFAC is superior to FAC in a parallel environment. The theory for AFACx and an
evaluation of its performance, including FAC and AFAC, is a significant part of this thesis;
the remainder of the thesis details the objectoriented development of parallel adaptive mesh
refinement software.
The development of parallel adaptive mesh refinement software is divided into three
parts: 1) the abstraction of parallelism using the C++ parallel array class library P++; 2) the
abstraction of adaptive mesh refinement using the C++ serial adaptive mesh refinement class
library AMR++; 3) the serial application, specifically the single grid application, defined by
the user. Thus, we present a greatly simplified environment for the development of adaptive
mesh refinement software in both the serial and parallel environment. More specifically, such
work provides an architecture independent environment to support the development of more
general complex software, which might be targeted for a parallel environment.
This abstract accurately represents the content of the candidates thesis. I recom
mend its publication.
Signed
Stephen F. McCormick
DEDICATION
To my wife Kirsten and son Carsten, without whose patient support the decade of
undergraduate and graduate school would not have been possible.
ACKNOWLEDGEMENTS
Many people have helped me throughout my student years at CU Denver, and
I want to thank a select few. I owe my most sincere thanks to Steve McCormick for his
advice, support, and collaboration throughout my undergraduate and graduate education.
Additional thanks to: Max Lemke for his collaboration on the software development, his
thorough testing, and his insight into the singular perturbation example; Kristian Witsch,
Maxs advisor, for advice on details of the singular perturbation example problem and for
supporting Max in our the joint work; and Dinshaw Balsara for forcing me into the block
structured problems, which eventually led to the AFACx algorithm. Additional thanks go
to James Peery and Allen Robinson at Sandia National Laboratory for the loan of a SUN
Sparc Station, which allowed for significant extension of the software development of P++
and AMR++, and my mother Pat Quinlan, a graphics artist, who helped prepare the figures.
My deep appriciation goes to AFOSR for its early support of my work by way of a graduate
research assistantship and a later summer fellowship.
Finally, I sincerely thank Tom Zang at NASA Langley for his support, which led
to a threeyear NASA fellowship that allowed me to pursue my doctoral research work. The
motivation for much of this work was the problem that he proposed for my fellowship, which
originally came from an SBIR grant that was funded by NASA Langley and for which I was
the Principal Investigator.
CONTENTS
CHAPTER
1 INTRODUCTION.......................................................... 1
1.1 Algorithm Development............................................ 1
1.2 Software Development ............................................ 5
1.3 Problems and Future Work........................................ 10
2 PARALLEL ALGORITHM DESIGN (THEORY) .................................. 12
2.1 Overview of the FAC and AFAC Composite Grid Algorithms ..... 12
2.2 Notation and Definition of FAC, AFAC, and AFACx Algorithms ... 13
2.2.1 FAC Algorithm .......................................... 15
2.2.2 AFAC Algorithm .......................................... 16
2.2.3 AFACx Algorithm.......................................... 18
2.2.3.1 AFACx Motivation........................................ 18
2.2.3.2 AFACx Definition........................................ 21
2.3 AFACx Convergence Theory ....................................... 22
3 PARALLEL SOFTWARE DESIGN............................................. 33
3.1 Introduction.................................................... 33
3.2 C Language Implementation of Parallel AFAC/FAC.................. 34
3.2.1 Program Structure ....................................... 34
3.2.2 Data Structures ......................................... 35
3.2.3 Multigrid Solver......................................... 35
3.2.4 AFAC/FAC Scheduler....................................... 36
3.2.5 Grid Manager ............................................ 36
3.2.6 Multilevel Load Balancer (MLB)........................... 36
viii
37
38
38
41
43
49
49
49
53
54
54
55
56
56
62
64
65
67
68
69
72
72
73
76
78
78
3.2.7 Data Flow
Problems with Computer Languages ..................................
3.3.1 Problems with FORTRAN.......................................
3.3.2 Problems with Procedural Languages..........................
Motivation for ObjectOriented Design..............................
3.4.1 Problems with the ObjectOriented C++ Languages.............
P++, a Parallel Array Class Library ...............................
3.5.1 Introduction and Motivation.................................
3.5.2 Goals of the P++ Development................................
3.5.3 The P++ Applications Class .................................
3.5.4 P++ Implementation in Standard C++ .........................
3.5.5 The Programming Model of P++................................
3.5.5.1 Single Program Multiple Data (SPMD) Model..................
3.5.5.2 Virtual Shared Grids (VSG) Model...........................
3.5.6 The ObjectOriented Design of P++...........................
3.5.7 The P++ User Interface......................................
3.5.7.1 The M++ array class library................................
3.5.7.2 The P++ Optimization Manager...............................
3.5.8 Portability and Target Architectures of P++.................
3.5.9 Performance of P++..........................................
3.5.10 P++ Support for Specific Examples...........................
3.5.10.1 Standard Multigrid Algorithms on Rectangular Domains . .
3.5.10.2 Multilevel local refinement algorithms on blockstructured grids
3.5.11 Related Research............................................
AMR++, an Adaptive Mesh Refinement Class Library .................
3.6.1 Introduction
IX
3.6.2 Block Structured Grids Features and Restrictions............... 78
3.6.3 Some Implementation Issues.................................. 81
3.6.4 ObjectOriented Design and User Interface........................ 83
3.6.5 Static and Dynamic Adaptivity, Grid Generation................... 85
3.6.6 Current State and Related Work .................................. 87
3.7 ObjectOriented Design for Parallel Adaptive Refinement................. 88
4 PARALLEL ADAPTIVE MESH REFINEMENT: RESULTS............................. 90
4.1 Introduction............................................................ 90
4.2 Comparison of Convergence Factors....................................... 93
4.2.1 Composite Grid Convergence Factors for Poisson Equation . 93
4.2.2 Convergence Factors for Singular Perturbation Problem .... 94
4.3 Performance of AFAC............................................... 96
4.3.1 Simple Example on iPSC/2:........................................ 97
4.3.2 Simple Example on iPSC/1:........................................ 99
4.3.3 Complex Example on iPSC/1:....................................... 99
4.4 Performance Comparison of FAC and AFAC Algorithms ................ 101
4.4.1 Parallelization for Distributed Architectures................... 101
4.4.1.1 Parallel Multigrid Solver Standard Grid Partitioning......... 101
4.4.1.2 Parallelization of FAC Linear Single Level Grid Partitioning 103
4.4.1.3 Parallelization of AFAC Linear Multilevel Grid Partitioning . 103
4.4.2 Comparison of Interprocessor Communication in FAC and AFAC 104
4.4.3 Relative Performance Results ................................... 107
4.4.4 Conclusions of FAC versus AFAC.................................. 108
4.5 Dynamic Adaptivity in Parallel Environment........................ 110
4.5.1 Multilevel Load Balancing ................................. Ill
4.5.2 Dynamic Movement of Local Refinement (Grid Tracking) . Ill
X
4.5.3 Dynamic Addition of Local Refinement ................ 112
4.5.4 Relative Costs of Dynamic Adaptivity................. 112
4.5.5 Conclusions about Dynamic Adaptivity................. 112
4.6 SelfAdaptive Refinement Using Pff/AMR+( 113
5 CONCLUSIONS...................................................... 126
6 FUTURE WORK..................................................... 128
BIBLIOGRAPHY
135
XI
FIGURES
FIGURE
1 Example composite grid with five levels................................... 11
2 Effect of overlapping communication with computation...................... 39
3 FORTRAN and C of C++ example code fragments............................... 41
4 C Language example for block by block cache based execution............... 42
5 Distributed memory example code........................................... 44
6 Equivalent P++, objectoriented example code............................. 45
7 Equivalent P++, objectoriented example code using explicit indexing. . 45
8 Equivalent P++, objectoriented example code............................. 47
9 An example for VSG Update based on the Owner computes rule: A = B
+ C on 3 processors.................................................... 59
10 The standard method of grid partitioning with overlap..................... 59
11 The objectoriented design of the P++ environment......................... 63
12 C++ / M++ / P++ example code: McCormack (Hyperbolic) Scheme. . 66
13 The interaction the Overlap Update and VSG Update concepts for standard
multigrid partitioning with coarse level agglomeration................. 73
14 The interaction the Overlap Update, VSG Update, and BSG Interface
Update concepts for FAC and AFAC partitioning of a block structured
locally refined grid ........................................... 74
15 Example of a composite grid, its composite grid tree, and a cutout of two
blocks with their extended boundaries and their interface.............. 89
16 Simple composite grid .................................................... 98
XU
17 Regular single level strip partitioning of a 3level composite grid structure
onto 4 processors (FAC)..................................................... 121
18 Irregular multilevel strip partitioning of a 7level composite grid structure
onto 16 processors (AFAC)................................................... 121
19 Complex composite grid problem with 40 patches............................ 122
20 Timings in milliseconds regarded with respect to patch and parallel pro
cessor system size (AFAC: black bars, FAC: white bars)................. 123
21 Timings for Load Balancing using MLB ..................................... 124
22 Results for a singular perturbation problem: Plots of the solution, the error
and the composite grid with two different choices of the accuracy tj in the
selfadaptive refinement process........................................... 125
Xlll
TABLES
TABLE
1 Convergence factors for Poissons equation on simple composite grid. ... 94
2 Convergence factors for Poissons equation on block structured composite
grid.................................................................... 94
3 Convergence factors for a singular perturbation problem and, for compari
son, for Poissons equation............................................... 96
4 Timings for simple AFAC on iPSC/2................................ 100
5 Timings for Complex AFAC on iPSC/1............................... 102
6 Communication structure analysis of FAC and AFAC for the simple 7 patch
model problem........................................................ 105
7 Communication structure analysis of FAC and AFAC for the complex 30
patch model problem.................................................. 106
8 Timings for repositioning grids.................................. 114
9 Timings for addition of new refinement........................... 115
10 Relative timings of AFAC for Moving, Adding, and Load Balancing (MLB). 116
11 Accuracy (Llnorm e) vs. number of grid points (n) and number of blocks
(6) for MGV on a uniform grid and FAC on selfadaptively refined com
posite grids......................................................... 119
CHAPTER 1
INTRODUCTION
One of the aims in the development of efficient algorithms to solve partial differ
ential equations (PDEs) is to allow the computational intensity to be proportional to the
activity that the solution resolves. What this means practically is that solutions are then
obtained with minimum computation, which in turn allows for investigation of larger even
more complicated problems, which in turn we want solved with minimum computational ef
fort, and so on. It is the finite resources of even todays most modern computers that ensure
termination of this recursion. Since, in a realistic application, this activity is nonuniformly
distributed and localized in the problem space, the use of local refinement reduces compu
tational complexity away from these localized regions and reduces the global computational
work. More specifically, this use of local refinement allows for greater accuracy in the compu
tation of large scale problems, and so the solution is obtained more efficiently. The resulting
computational mesh is called the composite grid (see figure 1). Since the requirement for
such refinement is often only seen at runtime as the solution evolves, such refinement often
must be added selfadaptively. The use of self adaptive refinement is most important with
the use of more than just a few local refinement regions, since then the error of not adding
refinement where required jeopardizes the effectiveness of additional levels of refinement.
1.1 Algorithm Development
A complicating feature in the development of these local refinement techniques is
their introduction onto parallel computers. The nonuniform nature of the computational
workload for a composite grid is in direct conflict with the general goal of efficient processor
2
utilization. Classical methods of computing with local refinement grids, aside from exhibit
ing slow convergence to the solution, require substantial synchronous intergrid transfers and
processing of the solution between the composite grid levels. This means that, with a com
posite grid load balanced onto multiple processors, there is substantial inefficiency due to this
synchronous processing of grid levels and the nonuniform use of local refinement through
out the problem domain. The alternative of partitioning each level across all processors is
problematic since such partitionings greatly reduce the composite grid levels representa
tion (size) in each processor on massively parallel architectures. Indeed, new algorithms for
solving equations posed on composite grids have an opportunity to improve the existing per
formance of the parallelized serial algorithms. Thus, meaningful work on parallel adaptive
mesh refinement is not just a computer science issue, but one which combines the design of
parallel algorithms with the development of better mathematical algorithms, clearly crossing
both disciplines almost equally.
The existing fast adaptive composite grid method, FAC (see [40] and [39]) is a dis
cretization and solution method for partial differential equations designed to achieve efficient
local resolution by systematically constructing the discretization based on various regular
grids and using them as a basis for fast solution. Using multigrid as the individual grid
solver, FAC has been applied to a variety of fluid flow problems, including incompressible
NavierStokes equations [36] in both two and three dimensions. A more recently developed
variation of FAC designed for parallel computers, the asynchronous fast adaptive composite
grid method (AFAC), allows for independent processing of the individual regular grids that
constitute the composite grid. This means that the cost of solving the independent grid
equations may be fully shared by independent processing units.
AFAC is a method for adaptive grid discretization and solution of partial differential
equations. Where FAC is forced to process the composite grid level synchronously, AFAC
3
eliminates this bottleneck of parallelism. Through a simple mechanism used to reduce inter
level dependence, individual refinement levels can be processed by AFAC in parallel. Coupled
with multigrid, MG (see [10], [22], and [41]) for processing each constituent grid level, AFAC
is usually able to solve the composite grid problem in a time proportional to what it would
take to solve the global grid alone, with no added refinement. See Hart and McCormick [1]
for further details.
Because of the way local grids add computational load irregularly to processors,
an efficient load balancer is an essential ingredient for implementing adaptive methods on
distributed memory machines. The complexity of this process, as well as the overall algorithm
itself, depends intimately on how the processors are assigned to the computational tasks. The
independence of the various refinement levels in the AFAC process allows these assignments
to be made by level (in contrast to the usual domain decomposition approach), which greatly
simplifies the associated load balancing of the composite grid. To balance loads between these
levels, in this thesis we develop a new load balancing algorithm, Multilevel Load Balancing
(MLB), which borrows heavily on the multilevel philosophy that guides most all of the
work presented. Specifically, MLB is a load balancing strategy that addresses the different
levels of disparity in the loads that are spread spatially across the multiprocessor system.
The algorithm is intended for use with applications that change dynamically, as is the case
in selfadaptive mesh refinement and time dependent applications. Other load balancers
have been previously developed, but most often exhibit slow performance that limits their
usefulness on anything but predominantly static applications.
But even with the capability to solve composite grid problems efficiently in a parallel
environment, there is still a large class of practical problems that are not sufficiently well
addressed. The problems in this class are dynamic in nature: the movement and time
evolution and decay of regions of activity force continual and substantial manipulation of
the resulting partitioned composite grid. Local refinement around, or upstream of, shocks,
4
for example, often must be allowed to move (track the shock) so that the time dependent
problem may be solved properly. Using conventional methods for assignment of work to
processors, the result is an inefficient handling of this large class of problems because of
how and where the data is localized in distributed memory. However, analysis of these
inefficiencies leads to the sort of unconventional methods for assignment of work that can
be processed efficiently in a parallel environment using AFAC and the partitioning strategy
used in MLB. Clearly, the resulting movement of regions of local refinement that occurs in
dynamic problems is not handled efficiently using conventional techniques (e.g., partitioning
based on the division of the problem domain). In contrast, the resulting new methods for
assignment of work allow for more efficient handling of this important class of dynamic
problems in parallel environments.
Our experiments with a variety of local refinement algorithms for the solution of the
simple potential flow equation on parallel distributed memory architectures demonstrates
that, with the correct choice of solvers, performance of local refinement codes shows no
significant sign of degradation as more processors are used. Contrary to conventional wisdom,
the fundamental techniques used in our adaptive mesh refinement methods do not oppose
the requirements for efficient vectorization and parallelization. In fact, this research has
shown that algorithms that are expensive on serial and vector architectures, but are highly
parallelizable, can be superior on parallel architectures. Thus, parallelization capabilities
play an important role in the choice of the best suited algorithm for a specific application.
Chapter 2 details the development of AFAC and AFACx, including computational
results of their use in a parallel environment (more results are found in sections 4.2 and
4.3). The results comparing FAC and AFAC appear in chapter 4, while results on the use
of dynamic adaptivity for time dependent equations are presented in section 4.5. These
results compare the relative computational costs of the two iterative algorithms, including
the computational costs of adding and removing refinement regions adaptively and load
5
balancing the resulting composite grid using MLB. Section 3.2.6 describes MLB, a new
multilevel load balancing algorithm for parallel adaptive refinement algorithms. Its use is a
central part of good performance on parallel architectures. Section 4.4 contains a comparison
of the parallel performances of FAC and AFAC for two sample composite grids. Section 2.2.3
introduces a new algorithm, AFACx, which improves on the AFAC algorithm by expanding
applicability to more complex block structured local refinement grids. The use of AFACx
both simplifies the use of parallel adaptive mesh refinement and is more efficient than AFAC.
There we present the motivating theory for AFACx as well as a convergence proof. The next
subsection illustrates the use of selfadaptivity by presenting an example problem where it
is used. The final subsection addresses the complexity analyses of FAC, AFAC, and AFACx.
In particular, we answer some problematic questions about how composite grids can be
optimally partitioned for these three algorithms.
1.2 Software Development
The work introduced in the algorithm design section was the product of two separate
parallel adaptive refinement implementations. The purpose of this section is to introduce the
chapters that detail some of the more practical aspects of these separate implementations.
The implementation of parallel adaptive refinement, required for meaningful al
gorithm development, is a nontrivial issue that has historically limited further research
in this important field. For the work in this thesis to be accomplished, the problem of
practical development of parallel adaptive refinement software was addressed. The results
form a significant part of this thesis because they so successfully resolve the requirements
of selfadaptive mesh refinement for complex applications, even though the problems that
are presented are simplistic in nature. Specifically, this thesis presents an objectoriented
set of libraries in C++ (class libraries) that abstract the details of parallel adaptive refine
ment by separating the abstractions of parallelism and adaptive refinement from the users
application code, which is permitted to define the application on only a uniform domain. It
6
is hoped that additional collaborative work will make the set of C++ class libraries more
generally useful and available in the near future.
Parallel adaptive mesh refinement is a truly sophisticated level of software devel
opment, which is important since it strains the limits of what can be accomplished in soft
ware without massive support. Its use requires the development of many interrelated data
structures to permit the flow of control between the solvers that are required for the imple
mentation of the composite grid algorithms. Additionally, adaptive refinement is necessarily
dynamic since the composite grid is most often required to evolve at runtime. 1 Because
of the dynamic requirements of adaptive mesh refinement, FORTRAN, a static language,
was not considered to be an option in the development of the final codes. Note that even
static refinement under explicit user control would require dynamic memory management
since in the parallel environment data would have to be shuffled between processors in any
load balancing scheme. The experience in this project was that FORTRAN is too outdated
for use in the development of sophisticated parallel adaptive refinement software because
of its inability to support the abstractions required to define algorithms independent of the
location of the data that is manipulated. The concept of abstractions, and a presentation
of how and why an algorithm should be expressed independent of the organization of its
data, forms a basis for the P++ and AMR++ work that this thesis presents. We recognize
that special FORTRAN memory management could as well have been developed to allow
for some or most of these problems, but such work would have layered already complex code
on top of yet another layer of software with its own inherent restrictions.
The first implementation was done in the language C, which permitted dynamic
management of memory, a requirement of both static and adaptive refinement. The expe
rience with the working C version of the local refinement code is detailed in section 3.2.
Although written modularly and using the best techniques in software design known at the
1In our experience, the ratio of code supporting the details of adaptive refinement to code defining the
single grid application is approximately 20:1 for serial and parallel environments (due to the proportionally
increased complexity of the parallel environment).
7
time, the complexity of the parallelism in the adaptive refinement code could not be hid
den, thus greatly complicating the resulting code. The principal complications were in the
partitioning of the composite grid, the management of memory for data shuffled between
processors in the load balancing of the composite grid, the complex data structures which
required access from all levels, etc.; all of these features were necessarily handled explic
itly within the parallel adaptive refinement implementation. The C language version was
completed with two separate implementations, one for serial computers and a second for
parallel computers. Though acceptable for a research code, the requirement of separate se
rial and parallel codes limits the ability of parallel computing to address even more complex
applications and, just as important, their maintenance in a commercial setting.
The C language version of parallel adaptive mesh refinement was run on the iPSC/1,
iPSC/2, iPSC/860, SUPRENUM, and nCUBE parallel machines using precisely the same
code, so a degree of parallel machine independence was actually achieved. Due to the con
ventional way the code was developed, with a procedural language (FORTRAN is also a
procedural language), these complexities combined, and in effect multiplied, to limit the
ability to develop the sort of sophisticated application codes we need for solving complicated
flow problems efficiently. A more sophisticated selfadaptive mesh refinement code would
have made work even more difficult. Later work in C++, however, better allowed for the
management of these difficult issues and permitted even greater sophistication. The consis
tent goal throughout was to present a greater degree of architecture independence in order
to simplify the implementation further.
The experiences with parallel adaptive mesh refinement in the original implemen
tation in C, and those with a much more complex local refinement hypersonic application
developed (with Dinshaw Balsara) specifically for the serial environment, have motivated ad
ditional work to simplify the development of such numerical software for parallel distributed
8
memory computers. The second generation of the parallel adaptive refinement implemen
tation has been done much differently. The problems of combined parallelism, adaptive
refinement, and application specific code are too interrelated to permit significant advance
ment to more complex forms of adaptivity and applications.
The solution to this software difficulty presents abstractions as a means of handling
the combined complexities of adaptivity, mesh refinement, the application specific algorithm,
and parallelism. The abstraction for adaptive refinement is represented by the definition of
adaptive refinement in an objectoriented way that is independent of both explicit parallel
details and specific application details. The abstraction of parallelism is to represent the lower
level array operations in a way that is independent of parallelism (and adaptive refinement).
These abstractions greatly simplify the development of algorithms and codes for complex
applications. As an example, the abstraction of parallelism permits the development of
application codes (necessarily based on parallel algorithms as opposed to serial algorithms,
whose data and computation structures do not allow parallelization) in the simplified serial
environment, and the same code can be executed in a massively parallel distributed memory
environment. Since the codes require only a serial C++ compiler, we avoid the machine
dependent restrictions of research projects involving parallel compilers. 2
We attack the details of parallel adaptive mesh refinement software development
by dividing the problem into three large parts: 1) the abstraction of parallelism using the
C++ parallel array class library P++; 2) the abstraction of adaptive mesh refinement using
the C++ serial adaptive mesh refinement class library AMR++; 3) the serial application
specific single grid application, defined by the user. The division into these parts serves to
make the project smaller than it would otherwise be since the development of large codes
is inherently nonlinear. Additionally, each part is sufficiently separate to form the basis of
other large software, so the pieces are substantially reusable. This sort of code reuse is a
common feature of the C++ objectoriented language. Thus, we present a greatly simplified
2However, such work toward parallel C++ compilers (notably CC++ and pC++) is important.
9
environment for the development of adaptive mesh refinement software in both the serial
and parallel environment. More specifically, such work provides an architecture independent
environment to support the development of more general complex software, which might be
targeted for the parallel environment. We now summarize the individual parts.
P++ is a C++ parallel array class library that simplifies the development of ef
ficient parallel programs for large scale scientific applications, while providing portability
across the widest variety of computer architectures. The interface for P++ matches that of
M++3, a commercially available array class library for serial machines, so numerical appli
cations developed in the serial environment may be recompiled, unchanged, to run in the
parallel distributed memory environment. Although general in scope, P++ supports current
research in parallel selfadaptive mesh refinement methods by providing parallel support for
AMR++, a serial class library specific to selfadaptive mesh refinement. The P++ environ
ment supports parallelism using a standard language, C++, with absolutely no modification
of the compiler. For parallel communication, it employs existing widely portable communi
cations libraries. Such an environment allows existing C++ language compilers to be used
to develop software in the preferred serial environment, and such software to be efficiently
run, unchanged, in all target environments.
AMR++ is a C++ adaptive mesh refinement class library that abstracts details
of adaptive mesh refinement independent of the users specific application code, which is
used much like an input parameter to AMR++. AMR++ is written using the M++ serial
array class library interface, and thus can be recompiled using P++ to execute in either the
serial or parallel distributed memory environment. Thus, AMR++ is a serial application
written at a higher level than P++, and specific to selfadaptive refinement. Forming a serial
adaptive mesh refinement code requires only a uniform grid solver specific to the users
application. If the users application code uses the M++/P++ array interface, then the
users AMR++/application code can be recompiled to execute in the parallel environment.
3M++ is a product of Dyad Software.
10
Thus, such abstractions as AMR++ and P++ greatly simplify the development of complex
serial, and especially parallel, adaptive mesh refinement software.
Chapter 3 presents the details of the first generation of the parallel adaptive refine
ment code and the problems that were solved and introduced by the combined complexity
of adaptive refinement in the parallel environment. The work in this chapter motivated fur
ther work that simplified the objectoriented development of the later more complex parallel
codes. Additionally, it motivates the requirement for a superior language for development
of the adaptive mesh refinement application, whether targeted for the serial or parallel en
vironments. Chapter 3 also presents the objectoriented design of the second generation
parallel adaptive mesh refinement implementation. Section 3.5 presents the P++ parallel
array class library and section 3.6 presents the AMR++ class library, including details of
the selfadaptive mesh refinement strategy.
1.3 Problems and Future Work
Problems with the current work are discussed in chapter 5, since it is these problems
that the future work will attempt to address. Finally, chapter 6 presents some of the future
work that might be done to expand the usefulness of adaptive refinement methods for parallel
architectures. These include improvements and ideas for the composite grid algorithms and
the objectoriented strategies that guide their practical implementation. Additional detail
focuses on possible improvements to the parallel array class library P++ and the adaptive
mesh refinement class library AMRH.
u
composite grid with Sve fevA
CHAPTER 2
PARALLEL ALGORITHM DESIGN (THEORY)
2.1 Overview of the FAC and AFAC Composite Grid Algorithms
Reasons for basing the solution process on a composite grid include: 1) uniform
solvers and their discretizations are more easily defined for complex equations; 2) multigrid
solvers, which are appropriate to use, are simpler and more efficient on uniform grids; and 3)
iterative processes are most efficiently implemented for the case of structured uniform grids.
In section 2.2, we introduce the existing fast adaptive composite grid method, FAC, and the
asynchronous fast adaptive composite grid method, AFAC, and uses them to motivate and
define the details of AFACx, the new algorithm that this thesis presents. Convergence for
the AFACx algorithm is proved in section 2.3. A different and more restricted analysis is
contained in [43].
Both FAC and AFAC are multilevel methods for adaptive grid discretization and
solution of partial differential equations. FAC has a sequential limitation to the processing of
the individual levels of refinement, whereas AFAC has much better complexity in a parallel
computing environment because it allows for simultaneous processing of all levels in the
computationally dominate solution phase. Coupled with multigrid (MG) processing of each
level and nested iteration on the composite grids, AFAC is usually able to solve the composite
grid equations in a time proportional to what it would take to solve the global grid alone.
See [22] and [41] for further details.
Both FAC and AFAC consist of two basic steps that are described loosely as follows.
Step 1. Given the solution approximation and composite grid residuals on each level, use
MG to compute a correction local to that level (solving the error equation). Step 2. Combine
13
the local corrections with the global solution approximation, compute the global composite
grid residual, and transfer the local components of the approximation and residual to each
level. The difference between FAC and AFAC is in the order in which the levels are processed
and in the details of how they are combined.
Convergence theory in [40] shows that the FAC and AFAC asymptotic convergence
factors, satisfy the relation HIAFACIH = FVLC2, where AFAC and FAC are the error
propagation operators for AFAC and FAC, respectively, and   is the composite grid
energy norm (uc = (Lcuc,uc}%, where (,) denotes the C2 inner product and Lc the
composite grid operator). Although the theory is restricted to the twolevel composite grid,
experimental results have verified this relation even on a very large number of levels (a
specific test verified this property on a 50level composite grid1). Though the algorithmic
components in our code are chosen slightly differently than for the convergence analysis,
experiences show that very similar behavior is obtained. This implies that two cycles of
AFAC are roughly equivalent to one cycle of FAC.
2.2 Notation and Definition of FAC, AFAC, and AFACx Algorithms
To define these algorithms, we present the following problem and notation. We
begin by posing the weak form of the continuous equation and its discretization. Let ?fj(fl)
be the usual Sobolev space of realvalued functions on the problem domain J2 that have
zero values (in the usual sense) on the boundary, dQ, of fl. Assume that a(, ) is a generic,
realvalued, bilinear, symmetric, bounded, and coercive form on 7fo(f2) x 7fo(fi), and that
/() is a generic real valued functional on 7fo(ft). The weak form is then to find u 6 7fo(fi)
such that
a(u,v) = f(v), Vv6Wo(ft).
1The actual solution of such a 50level composite grid problem implies greater precision than double
precision grid values permit, but this is an issue of machine accuracy.
14
To discretize this equation, let T be a regular finite element partition of ft0 (e.g., triangles
or cells) and let V C 77j(ft) be the associated finite element space (e.g., piecewise linear
functions on triangles or piecewise bilinear functions for cells). Let l > 1 denote the number
of subregions of ft (in the discrete problem, which we will develop, l will define the number
of composite grid levels), which we assume are nested:
c ft*1 c c ft1 c ft0.
For each k Â£ {1,2,..., Â£}, let T* be a regular finite element partition of ft* constructed by
appropriately refining the elements of T*1, and define V* C 7fjj(ft*) as the finite element
space associated with Tk. Note that functions in Vk are zero on <9ft*. Define Wk =
V'*1 fl Vk, which is a coarse grid subspace of Vk. Note that Wk is similar to V*1 except
that it is local (if ft* is a proper subregion of ft*1). For convenience, let W = {0}.
Note that the refinement has been constructed so that it is conforming in the sense that
Wk = V*1 n?fo(ft*). We will refer to Wk as the restricted local refinement space for the
local refinement region ft*. Define 7* : Wk Vk as the natural embedding (interpolation)
operator.
Now define the composite grid by
p
ftc = J ft*,
i=0
and its associated space by
p
Vc =
i=0
Then the discrete problem we solve is the composite grid Galerkin discretization, using the
composite grid space Vc: find uc Â£ Vc such that
a(uc, v) = f(v), Vv G Ve.
Let L* : V* + Vk be the discrete operator on grid space V* determined by
a : a(uk,vk) = (Lkuk,vk), Vu* Â£ Vk, where (, ) denotes the L2(ft) inner product. Note
15
that L is an approximation to the differential operator on the global region fl. Let 1% : Vk
Ve and Ik \Vc*Vk be given interlevel transfers operators (interpolation and restriction,
respectively, defined by the finite element formulation). Note that IÂ£ is the natural imbedding
operator and Ik is its adjoint. Finally, consider the restricted grids Qk and their
associated spaces Vk = Wk and operators Lk, /Â£, and Ik, and let uk denote the restricted
grid solution, 1 < k < t. (For AFAC and AFACx, these restricted (intermediate) grids and
their operators are designed to remove error components that are common to both fine and
coarser levels and that would otherwise prevent simultaneous processing of the levels.)
2.2.1 FAC Algorithm FAC (see [40] and [39]) is a discretization and solution
method for partial differential equations designed to achieve efficient local resolution by
systematically constructing the discretization based on various regular grids and using them
as a basis for fast solution. Using multigrid as the individual grid solver, FAC has been
applied to a variety of fluid flow problems, including incompressible NavierStokes equations
[36] in both two and three dimensions.
Loosely speaking, one FAC iteration consists of the following basic steps:
Step 1. For all k Â£ {0,1, compute fk by transferring the composite grid
residual to
Step 2. Set k = 0 (so that we start the computation on fl) and the initial guess on
Qk to zero.
Step 3. Given the initial guess and composite grid residuals on level k, use multigrid
(or, alternatively, any direct or iterative solver) to compute a correction local to
that level, that is, solve the error equation that results from the use of the residual
assigned to fk on Qk.
Step 4. If k < l, then: interpolate the solution (resulting from step 3) at the
interface of levels Qk and Qfc+1 to supply fifc+1 with complete boundary conditions,
so that its correction equation is properly posed; interpolate it also to (the interior of)
16
flfc+1 to act as the initial guess; set k < fc + 1; and go to step 3. If k = Â£, interpolate
all corrections (i.e., solutions of each levels projected composite grid residual
equations) from the finest level in each region (i.e., f2fc/Â£2*+1) to the composite grid.
To be more specific, in addition to the above notation, let jÂ£'+1 : Vk Vfc+1
denote the mapping that interpolates values from level k on the interface (i.e., the boundary
of that does not coincide with the boundary of ft0). Note that the computation of
the composite grid residual equations at the interface is covered in detail in [39]. Given the
composite grid righthand side fc and initial approximation uc, then one iteration of FAC
(see McCormick [39] for motivation and further detail) is defined more concretely as follows
(we show here the direct solver version for simplicity, which requires no initial guess on Slk):
Step 1. For all k G {0,1,.set fk = Ik(fc Lcue).
Step 2. Set k = 0.
Step 3. Compute uk = (Lk)~lfk.
Step 4. If k < Â£, set ufc+1 = J^+1uk on and go to step 3. If k = Â£, form
uc = I)iuk on fr each & (^,+1 = 0)
2.2.2 AFAC Algorithm AFAC ([22] and [41]) is a multilevel method for
adaptive grid discretization and solution of partial differential equations. AFAC appears
to have near optimal complexity in a parallel computing environment because it allows for
simultaneous processing of all levels of refinement. This is important because the solution
process on each grid, even with the best solvers, dominates the computational intensity.
This is especially true for systems of equations where the solution process is even much
more computationally intensive than the evaluation of the residuals. Coupled with multigrid
processing of each level and nested iteration [39] on the composite grids, AFAC is usually
able to solve the composite grid equations in a time proportional to what it would take to
solve the global grid alone. See Hart and McCormick [22] and McCormick [40] for further
details.
17
The principal step is the computation of an approximation to the oscillatory com
ponent of the solution on each composite grid level. To simplify the explanation, we define
uk = Ickuk Ickuk to be the oscillatory component of the solution uk.
Loosely speaking, one AFAC iteration consists of the following basic steps:
Step 1. Compute fk for all k E {0,1,..., Â£} by transferring the composite grid
residual to and similarly for fk for all k E {1,2,..., .
Step 2. Set the initial guess to zero on Clk for all k E {0.. .Â£}, and similarly on Qk
for all k E {1,2....Â£}.
Step 3. For all grid levels Â£lk(k E {0,1;... ,^}):
Substep 3a. Use multigrid (or, alternatively, any direct or fast iterative solver)
to compute a correction local to that level, that is, solve the equation that
results from the use of fk on Qk and fk on Clk (k > 0).
Substep 3b. Subtract the restricted grid solution from the local grid solution.
This forms the oscillatory components.
Step 4. Interpolate and add the oscillatory components on all of Clk for all k E
{0,1,... ,Â£} to all finer composite levels.
To be more specific, given the composite grid righthand side fc and initial approxi
mation uc, then one iteration of FAC (see McCormick [39] for motivation and further detail),
is defined more concretely as follows (we show here the direct solver version for simplicity,
which again needs no initial guesses on Â£lk or Qk):
Step 1. For all k E {0,1,...,Â£}, set fk = Ik(fc Lcuc) and (for k > 0) fk =
Ik(f'L'uc).
Step 2. For all k E {0,1,..., Â£}, set uk = 0 and (for k > 0) uk = 0.
Step 3. For all k E {0,
Substep 3a. Compute uk = (Lk)~lfk and (for k > 0) uk = (Lk)~1fk.
Substep 3b. Set uk = Ikuk Ikuk for k E {1,2,... ,Â£}.
18
Step 4. Set uc = uc + IqU + Â£3fc=i u\.
The processing of each step of AFAC is fully parallelizable. For example, the lev
els in step 3 can be processed simultaneously by a MG solver, which is itself parallelizable
[10]. The present version of the code uses synchronization between steps 3 and 4, although
asynchronous processing would be allowed here. (With an efficient load balancing scheme,
asynchronous processing of the levels provides little real advantage.) A more complete deriva
tion of AFAC, along with a convergence proof and related theory, can be found in McCormick
[40].
2.2.3 AFACx Algorithm AFACx is a new algorithm, which this thesis
presents and analyzes as its principal mathematical development. The motivation for AFACx
is in the use of adaptive refinement for problems with complex internal regions demanding
additional, but local, resolution. Such complex internal regions are found around, and up
stream of, shocks and complex shock structures.2
In this section, since the AFACx algorithm is new, we present an expanded descrip
tion, which includes the principal motivation for its development and use.
2.2.3.1 AFACx Motivation The use of adaptive refinement for problems with
geometrically complex regions of activity requires more than simple rectangular refinement.
In sucli problems, local refinement strategies must cover the target regions with nonregular
meshes, or collections of regular rectangular grids that combine to conform to the nonregular
regions. In the latter approach, using collections of regular rectangular grids, the efficiency
and simplified implementation of complex application codes can be restricted to the more
conventional setting3, where good efficiency on the structured rectangular grids is assured.
For explicit problems, the details of handling the resulting block structured local re
finement regions are an issue only at the interfaces; the properties of the algorithm (stability,
2A motivating example problem for this work has been the simulation of enhanced mixing (such as flows
passing through oscillating shock structures) in support of the National Aerospace Plane Project (NASP).
However, the extension of this work to that problem is not complete.
3In section 3.6, we show that by using the AMR+1 class library, only the uniform grid solver need be
defined by the user.
19
convergence, etc.) are not as much an issue as they are for implicit problems.4 With implicit
equations, the handling of block structured grids is more problematic since the solution of
the individual blocks is not sufficient for the solution of the block structured grid. More
complex methods are typically required, but rarely applied, to also resolve the smoother
components that span the collection of block structured grids, since if only the blocks are
processed with iterative solvers, the smooth errors (across blocks) are poorly damped and
result in poor convergence factors for the block structured solution.
It would be sufficient to solve the block structured local refinement grid problems
directly, but this is prohibitively expensive. Alternatively, we could define a multigrid set
of coarsenings (of the blocks and the structure of blocks), which would permit fast multi
grid solution of such block structured refinement regions. But the automated coarsening of
the block structure (beyond that of simply the coarsened blocks themselves) is difficult to
implement and the solvers abstracted to work on such coarsenings are inefficient. 5
For implicit solvers, the block structured solution is required, on the block struc
tured grid, if we intend to use AFAC, since the formulation requires an approximate solution
on the refinement patch. FAC could be simplified to use relaxation on the block struc
tured regions starting from the solution interpolated from the coarser level (the global grid
in the two level composite grid case)6. The use of relaxation avoids the complication of
constructing the block structured coarsenings that a multigrid solver would require for the
block structured refinement region. We seek a similar efficiency, but with AFAC, so that the
composite grid levels can be processed asynchronously. AFACx is just such an algorithm,
since it requires no definition of the coarsened block structured local refinement regions and
uses only relaxation on the predefined block structured refinement region.
'However, the details of these explicit algorithms at each grid point can be more complex., e.g., for PPM
and ENO methods for Euler equations, the Riemann solvers are more sophisticated than most relaxation
methods used in the implicit MG solvers.
5 A substantial amount of work was done on this approach, and its failure motivated the objectoriented
approach that was taken and that led to AMR+( (see section 3.6).
6This version (variation) of FAC is introduced as FACx in section 4.2.
20
AFACx uses only the predefined block structured region and requires no construc
tion of coarsening, even for the individual blocks, but still preserves the multilevel efficiency
of AFAC7 and processes all levels asynchronously. The predefined block structured grid in
cludes the finest level of the local refinement region and the grid built consistent with the
flagged points on the coarser composite grid level, which were used to generate the block
structured local refinement region. In practice it is easier, and equivalent, to let each block
define a single coarsening. This coarsening of each block is guaranteed to exist because the
local refinement block was derived from flagging points on the coarser composite grid level
(the coarse grid points of the block structured refinement patch).
Thus, AFACx uses the finest level of the block structured refinement grid (the
block structured refinement grid itself), and a single coarser level (which we have shown is
predefined since it corresponds to the flagged points that were used initially to build the finer
local refinement level). Because AFACx avoids processing the coarser levels on each block,
it is cheaper than AFAC, though the difference is only in the processing of the coarser levels
and so it is not very significant in a serial environment.8 However, in a parallel environment,
the avoidance of processing coarser levels means substantially less message traffic in the
multiprocessor network, and a higher parallel efficiency for the overall method. Since, in
the context of adaptive mesh refinement, the local regions are sized according to demand, it
is likely that such refinement blocks would not be sufficiently large to adequately hide the
communication overhead of processing the coarsest levels. Thus, by avoiding the coarsening
altogether, we avoid a potentially significant overhead in the parallel environment, as well
as the complicated construction of the block structured coarsening that would be required
of a completely general block structured grid.
Finally, the use of relaxation on the block structured grids is what makes AFACx,
and the analogous variant of FAC, attractive. This is because the user defined relaxation
7 The convergence factors are observed to be within 3% of that of AFAC.
8 The relatively inefficient processing of short vectors is also avoided in the vector environment.
21
(which is assumed to be parallelizable) is easily supported on the block structured grids.
Then the process of exploiting the parallelism across blocks, on the block structured grid, is
equivalent to that of exploiting the parallelism across multiple processors, on the partitioned
grid. Thus, in the support for the block structured grid, interface boundaries are copied in
much the same way that messages are passed between processors. These details are hidden
from the user in the AMR++ class library in the same way that the message passing is hidden
from the user in the P++ class library; see chapter 3, sections 3.6 and 3.5, respectively.
2.2.3.2 AFACx Definition In the case of AFAC, we can attempt to understand
it by considering the use of exact solvers on the composite, local refinement, and restricted
grids. An important step in AFAC is the elimination of the common components between
the coarser composite grid level and the finer local refinement grid patch. This step allows
us to avoid the amplification of these components (inherently smooth components, since
only they are represented on, or shared by, both the local refinement and coarser levels) in
the interpolation and addition of the solution from the coarser levels up through the finer
composite grid levels, finally forming the composite grid approximation. The result in AFAC,
after this important step, on each level, is an approximation to the oscillatory contribution to
the composite grid solution, which is unique to that level. AFAC and AFACx differ only in
the way that this oscillatory contribution is computed: AFAC uses exact or fast approximate
solvers, and AFACx uses only relaxation (typically one or two sweeps).
In order to differentiate between the individual relaxation iterates, we will use
subscripts: uk is the nth iterate approximating the solution uk on the kth grid level Clk.
Actually, we use only one iteration on flfc, but allow for more on Qk. Let p(Lk) be the
spectral radius of the discrete differential operator Lk. Then Richardsons iteration, which
we will use throughout our analysis, is given by
+i = U" ^) (Lku fk) = Rk (" f')
and similarly for u*+1 and Rk{un \ fk) Denote Rl(uk\fk) = Rk(Rk(uk', /*)) and so on for
22
R% so that uk = R%(uo',fk). To simplify the explanation below, we define uk = I%uk I%uk
to be the oscillatory component of the first iterate u*. An important aspect of AFACx is
that the relaxation iterate uk is computed first and the initial guess for Qk is interpolated
from the iterate on Q* (i.e., uk = Ikuk).
We can now define AFACx more precisely. Given the composite grid righthand
side fc and initial approximation uc, then one cycle of AFACx based on one relaxation sweep
per level Qk and n sweeps per level Qk is given by the following:
Step 1. For all fc G {0,1 set fk = Ik(fc Leuc) and (fc > 0) /* = Ik(fe 
Lcuc).
Step 2. For all fc 6 {1,2... ,Â£}, set uk = 0.
Step 3. For all fc G {0,1,... ,t}\
Substep 3a. Set uk = R^(uk,fk) for fc > 0 and uk = 0 for fc = 0, then uk = ikuk,
and uk = Rk(uk]fk) for fc > 0 and uk = (i0)1/0 for k 0.
Substep 3b. Set uj = I%uk i%uk.
Step 4. uc uc + IqU + X3fc=iMi
Notice step 3 uses only relaxation, and that the fine grid initial guess is the restricted grid
approximation interpolated to Qk, namely, uk. All steps except step 3 are the same as in
AFAC.
In the next section, the connections between AFACx and AFAC are further clarified
since the convergence proof of AFACx relies on convergence theory developed for AFAC.
2.3 AFACx Convergence Theory
We prove the convergence of AFACx in several steps. The basic idea for the devel
opment is to use existing AFAC theory [40] and establish that the AFACx convergence factor
is at most only slightly larger than that of AFAC. As with the AFAC theory (McCormick
[40]), we develop this in the restricted setting of a twolevel composite grid problem.
To simplify notation in the case of a twolevel composite grid, we introduce the
23
fine grid patch Qi, = Q1 and its restricted grid Q2/1 = fl1. In a similar way, we define the
following:
the exact discrete solution on flh is denoted by uh = u1, and on fyh by u2h = u1.
Further, we use the subscript n to denote the nth iterate of a relaxation step. Thus,
u27 is the nth iterate of the relaxation operator on starting from u2h.
the interpolation operator from Q.2h to fic is denoted by = I{.
the restriction operator from flc to fi/,, is denoted by I2h = l\.
similarly, interpolation and restriction between 0,2k and are denoted by I%h and
I%h, respectively.
the fine grid discrete differential operator on flj, is denoted by Lh = A1.
the restricted grid discrete differential operator on fi2fc is denoted by L2h = L1.
To support the comparison of AFACx and AFAC, we require a specific version of
AFAC that uses a combination of exact and iterative solvers. To this end, we define AFAC
as the usual AFAC method defined in section 2.2.2, except that the exact or multigrid solver
on Qh is replaced by one relaxation sweep starting from the initial guess = l2hU2h. Note
that u2h is the exact solution of the discrete problem on the restricted grid 0.2h We will
first relate AFAC to the approximatesolver version of AFAC introduced in [40], which is
denoted by AFACL, where e =
and Qh by approximate solvers based on operators M2h (A27*)1 and Mh (Aft)1 that
satisfy
(l eh) (L7*)1
and
(1e271) (A27*)1 < M2h < {L2h)~l,
respectively. That is, the exact solutions on ^2/1 and on Q,k are replaced by the iterates
u2h = M2hI2h(fc Lcuc) and uh = I%hu2h + Mh(LhI*hu2h I*(fc Leue)), respectively.
c
c2h
. It is defined by replacing the exact solvers on Q.2h
24
Note that, in the case of Richardsons iteration, M2h = and Mh
Finally, let S > 0 and 6 > 0 be the quantities defined on pages 110 and 118 of [40]
that are typically bounded uniformly in h, depending on the application. In this section,
we assume that Lc in symmetric and positive definite and that the following variational
conditions hold:
where coh and c;, are positive constants. Assume further that I^h and are full rank, so
that L2h and Lh are (symmetric) positive definite.
Theorem 1 The spectral radii of the twolevel exact solver versions of AFAC and
FAC satisfy the relation
and
p(AFAC) = p(FAC)^ .
Thus, with HI HI denoting the composite grid energy norm (uc = (Lcuc, uc)2), we have
The convergence factor for the approximate solver version, AF AC c, satisfies
AFACI<(le)AFAC + e.
where e = max(e2\ eh).
Proof: See McCormick [40], page 144.
Lemma 1 AFAC converges with factor bounded according to
J4FAC<(le)J4FAC' + e
25
Proof: First consider the case of n = 1. The inequality on page 118 of [40] shows
that
(f^>1112 s H)!
h\\\2
(1)
for the initial error eh on Slh. Since Mh = > 0 for Richardson iteration, we clearly
have
(rh, (l eh) (Lh)1 rh) < (rh,Mhrh) < (rh, (Lft)_1 rh)
for rh = Lheh and eh = ^1 0 2 The lemma now follows from the estimate for AFACL
in Theorem 1 with e =
( K \
eh
V 0 /
. The lemma now also follows for general n > 1 because
additional relaxations on level h in AFAC cannot increase the energy norm of the error.
Q.E.D.
AF AC and AF AC differ by a perturbation term that can be expressed in terms of
the operator
1
P* = h^Lh$h
P(Lh)
I
r 2h
p(L2hY
(2)
Lemma 2 The error propagation operators AFACx and AFAC are related ac
cording to
AFACxec = AFACec + I^ef,
where egh is the initial error on level 2h.
Proof: First consider the case n 1. Following the definition of AFACx, the
iterate uf1 on SI2h using the initial guess u,Qh is given by
= u2h
{L"uf fh) .
P(Lh)
Then, as defined, the initial guess for the subsequent relaxation step on fi/, is just lift =
lfthu\h So the iterate on fl/, is computed as
< = A^ir){Lh<fh).
26
Then, by substitution,
rft ..2ft 1 l r 2h..2h t2h\
h [ 0 p{L2h) VL u0 f ) 
i rh 2 ft 1 (j2h 2h *2h\ rft 
l2h [ p(L2h) ^ 0 J 'J ' ) '
p(Lh)
Consider the splitting of uh into its energy orthogonal components: uh = l2hu2h + th, where
u2h is a level 2h component and I2hLhth = 0. Then,
Th
i2h
u2h 
0
1
p(L2h)
Pm'LhI*
L2he2h
1
2h
U0
1
r 2h2h
~pm) e
 Lh {%hu2h+th)j ,
where = Uq u We thus have
L2h
u2h 
0
p(L2h)
t 2h 2h
L> eQ
P{Lh)
LhI>
2 ft
e2h 
Cn
L2he2h
+
p{Lh)
Lhth.
(3)
p(L2h)
Now, following the definition of AFAC, its iterate uf, which uses as its initial guess the
exact solution from grid (namely, I^v2*1), is given by 9
ijh rh 2h . 1 rftjft
Ul hhU +p(L)Lt
(4)
The final processing step on fi/, subtracts from the final iterate the interpolated 2h approx
imation, which is just
Th
l2h
2ft _ 1 T 2ft 2ft
0 /T7h\lj e0
p{L2h)
for the case of AFACx and fr case of AFAC. We represent the solution after the
subtraction of the restricted grid solution for AFACx and AFAC by uj and uj, respectively:
Â£ = f fa ..2ft i r 2h _2ft
u pmy ^
1 Thjh c2h 1 t 2h h
p(Lh) " i2ft [U P{L2h) j e0
+
p(L)
ftjft
Lnt
(5)
9 Here, the initial guess on fi/, is given by uj = on the fine grid patch.
27
and
Uh = u\l2hUlh
(6)
The definition of Pll for n 1 then shows that
ph 2h
eo
1 jh rh
L J2h
p(Lh)
I
1 L2h
p(L2h)
2h
u* uh,
(7)
from which the desired equality follows.
Assuming now the more general case of n > 1 relaxation steps, it is easy to verify
that the approximations after the final processing step satisfy
A
pm
LhI*
2 h
I
r 2h
p(L2hY
*lh
h4h
pm
Lnt
(8)
and
uh =
hjh
pm
Lnt
(9)
Again, the desired inequality follows from the definition of P%.
Q.E.D.
HI HI was defined as the composite grid energy norm. But for functions vh 6 Vh,
we note that /Â£u,l = (LcIlvh,Iftvh)^ = (Lhvh, vh)i, which we write simply as the fine
grid energy norm uft = (Lhvh,vh)i. Similarly, for v2h G V2h we can define i>2ft =
(L2hv2h,v2h)i = 11 l^h f'2?l 111
Lemma 3 The perturbation term P^eQh is bounded according to
IllPnlll < P (iL2h)~^
28
Proof: We can simplify the evaluation of the energy norm 111 111 since it is related
to the L2 norm  :
\\\py0h\\\ = (Lhpy0h,py0h)? = ({l^p^^p^ = 1 \(Lh)ipy0h\\.
First note that
= IIIWIII
= (LhI*helh,$helh)i
= (el\lh2hTLhIh2hef)^
= (e2\L2he2h)?
= \\(L2h)*elh\\.
(10)
By defining w2h = (L2h)lelh, Gh = j^Lh, and F2h = (i ^rjL271) > we thus have
111^111 _ \(Lh)iGhI*hF2he2
2h I
o2h\
\\(L2)h2h\\
\\{Lh)iGhI2hF2h(L2h)?w2h\
11 11
< sup
iuak = l
\(Lh)$GhllhF2H(L2h)iw
2hl
\W
2/l I
(11)
(Lh)?GhI$hF2h(L2h)i
= P ((L2h)~?F2h (I2hGhLhGhI%h) F2h(L2h)~i) 5
= p ({L2h)F2h F2h(L2h)"*)
The last lines follow because, for any n x n matrix A, j4 = p(ATA).
Q.E.D.
Lemma 4 The fine and restricted coarse grid operators satisfy the inverse relation
(Lhyi>i*h(L2h)'i2\
Proof: For any matrix A, we have ATA < I <=> AAT < I. Hence,
I = (.L2h)?L2h{L2h)~i
29
= (L2h)illhLhI%h(L2h)?
=>7 > (Lh)il!}h(L2h)i(L2h)~*Ihh Lh*
= (Lh)il*h(L2h)1I2hhLh*
=>(Lh)~1 > %h{L2h)'llh. (12)
Q.E.D.
Lemma 5 There exists a constant c E 5J+ such that
j2h
Proof: First we observe that p(Lh) scales Lh so that ^jrjLh < I. Thus,
j2h ^ (rh\3 Th ^ j2h ^ ( t h \2 jh
Ih p(Lh)2 'L > l2h~h 2h'
Then, to establish the bound we seek, we need only prove that
For some constant c E 3J+. First notice that,
using Lemma 4, we can choose a constant b E 3Â£+ such that
7 < b$hT$h
{L2h)~2 <6(L2ft)172\T4(L2ft)1
I}fh(L2h)2Ih2h < b (72\(L2,l)172\T72\(L2ft)172\T) < b(Lh)~2
<Â£ LhI^(L2h)~2llTLh < bLh(Lh)~2Lh = bl
<3* Lhl!fh{L2hr2I^TLh < bl
L"l*h{L2h)\L2h)'lh2hT Lh < bl
(LhI*h(L2h)') (LhI%h(L2h)')T < bl
(LhI*h(L2h)')T (Lh72\(L2h)"1) < bl
30
Lhi^hmrimr1i2kTLh < bl
jhT rh2 Th A2h ^ A2h < b(L2h)2. (13)
Dividing by p(Lh), we have shown that
We want the smallest value for b such that
I < blih ^/..sowelet
1
6 =
Amin (j%h ^2h)
Now we need to find a minimum value for c so that
b c
<
(14)
pm Pmy
Letting Xmin(A) denote the smallest positive eigenvalue of a symmetric positive definite
matrix A, then the minimum value for c is evidently
i p m)
c =
A (Ih T Th \ P (Lh)
''mm y*2h 12h) y '
1 P
<
(Ih TIh ^
\12h 12h)
1
(jh Tjh A
^ln y12h 12h)
1
A (Ih T Ih ^
Amm ^2/1 12hJ
= (^T4)
where a denotes the condition number. Hence,
So that,
pm
(Th TJh \ pm
v2k hh)jm
(ih t ik ^
yX2h J2h j
j2h ^ / t h\3jh ^ ( jh ^ jh A ^ ( t 2h\*
4 } l2h alj2ft l2h) dm]{L >
(15)
31
and the lemma is proved.
Q.E.D.
Lemma 6 If I%h is based on piecewise bilinear elements on a uniform rectangular
grid, then
o (4T4) < 40.
Proof: The stencil for the bilinear interpolation operator llfh is given by
1 i 1
16 8 16
III
8 4 8
1 i 1
16 8 16'
The stencil for the product ^2h is therefore
1 3 1
16 8 16
I 9 I
8 4 8
1 3 1
16 8 16
the lemma now follows from a straightforward mode analysis to estimate the eigenvalues of
this stencil operator.
Q.E.D.
The following theorem establishs the AFACx conveyes in the usual optimal mul
tilevel sense whenever AFAC does.
Theorem 1 The spectral radius of AFACx is bounded below one uniformally in
h, assuming this is true of AFAC and that n is sufficiently large.
Proof: Lemmas 3 and 5 combine to prove
32
V~c
p{L2hY
< y/c max 0
1 c
I 2 71 f 1
(I/?)"
where (3 is any eigenvalue of p^2k^L2h
(16)
From Lemma 2, we thus have
\\\AFACxec\\\ = \\\AFACec + IchP^elh\\\ (17)
< \\\AFACec\\\+\\\FhPyoh\\\ (18)
< \\\AFAC\\\ee + PhMe^ (19)
< (\\\AFW\\\ + J^^\\\e\\\. (20)
Q.E.D.
CHAPTER 3
PARALLEL SOFTWARE DESIGN
3.1 Introduction
The design and implementation of parallel software is a hindrance to obtaining
feedback on the design of parallel numerical algorithms. An important part of this design
involves implementation of the proposed algorithms on the complex target architectures
used on todays computers and the feedback of these results into the design of the algorithm.
An understanding of the numerical properties of the parallel algorithm can be obtained
from a serial implementation, but issues of parallel performance show up only in the much
more complex parallel environment, though often some details of performance on a proposed
parallel architecture can be estimated with a good understanding of the algorithm.
Based on the design and development of much parallel software, a recurring set
of problems was recognized as fundamental to the expanded development of software as
complex as the parallel adaptive mesh refinement (AMR) codes. The initial development
of a parallel adaptive refinement code was complex enough to prevent its expanded use on
much more realistic fluid applications. As a result, a portion of the thesis research effort was
spent in the analysis and resolution of these road blocks to efficient and productive software
design.
This thesis implements two separate parallel adaptive refinement codes, one in C
and one in C++ (using the objectoriented design features of C++). The C language ver
sion, which was completed first and served as motivation for the C++ version, and earlier
FORTRAN work showed the substantial difficulties involved in the use of FORTRAN or
34
any other procedural language for parallel adaptive refinement. Motivated by these observa
tions a new way to develop general parallel software, and specifically parallel adaptive mesh
refinement software, was designed and is presented in what follows.
3.2 C Language Implementation of Parallel AFAC/FAC
The initial implementation of the general problem design was mostly carried into the
second objectoriented AMR++/P++ version, though the AMR+t/P++ implementation
was substantially more robust and feature ladened (see section 3.6).
3.2.1 Program Structure A decomposition of the composite grid problem
domain is commonly used to partition work across multiple processors. However, since
AFAC requires minimal intergrid transfers, additional solver efficiency (the dominant cost)
is obtained by partitioning the composite grid by level. A partition of the problem domain
might cut across many grids and add substantially to the total communication cost involved
in updating internal boundary values, but a partition by level means that the grids will
be shared across a minimal number of processors. This reduces and simplifies necessary
communication between processors sharing a given grid, which is especially effective since
most message traffic occurs during the MG solves. In addition, level partitioning allows for
a more simplified load balancing strategy. An even greater advantage is that it allows for
the movement of grids, as required by shock tracking, with no movement or rebalancing of
the distributed composite grid. Further, level partitioning allows for a greater amount of
each grid to be stored in the processors that share it. This results in longer vectors to be
formed by the solvers, which is expected to better utilize the vector and pipeline hardware
on machines with these features.
With this reduction in total communication requirements, the existing communi
cation costs can be more easily hidden by the computational work. This sort of message
latency hiding would be expected to appear best when there is special message passing
hardware designed to relieve the main CPU of the message passing overhead. Due to the
35
unbalanced communicationtocomputation costs associated with the iPSC/1 and its lack
of special communication hardware, this message latency hiding was, however, difficult to
measure.
3.2.2 Data Structures The relationship between the levels of the composite
grid is expressed in a tree of arbitrary branching factor at each vertex. Each grid exists as a
data structure at one node of this tree. This composite grid tree is replicated in each node.
In the case of a change to the composite grid (adding, deleting, or moving a grid), this change
is communicated globally to all processors so that the representation of the composite grid
in each processor is consistent. The partitioning of all grids is also recorded and is consistent
in all processors.
Storage of the matrix values uses a onedimensional array of pointers to row values.
These rows are allocated and deallocated dynamically, allowing the partitioned matrices to
be repartitioned without recopying the entire matrix. This is important to the efficiency
of the dynamic load balancer, MLB. The effect of noncontiguous matrices is not felt when
using this data structure since subscript computation is done by pointer indirection for all
but the last subscript and, in this case, subscript computation is done by the addition of an
offset. This organization is particularly effective for partitioning along one dimension.
In the case of a 3D problem space, 2D planes would be dynamically allocated.
Multiple rows or planes could also be allocated, allowing choices of vector lengths to optimize
the implementation on vector machines.
The newer implementation using P++ does not have to address this level of detail
since such issues as storage and data layout are a part of P++ and the abstraction that it
represents.
3.2.3 Multigrid Solver A significant change to the code described in Briggs
et al. [10] is the restriction to onedimensional decomposition (strips) and the allowance of
irregular partitioning. The experiments documented here use (2,1) Vcycles that recurse to
36
the coarsest grid, which consists of one interior point. The AFAC scheduler is responsible for
ordering the execution of the necessary multigrid and intergrid transfer subroutines based
on the receipt of messages. For example, multigrid is executed for all grids asynchronously
and is driven by the order in which the messages are received.
3.2.4 AFAC/FAC Scheduler The AFAC scheduler handles the asynchronous
scheduling of the steps needed to perform an AFAC iteration. An AFAC iteration is divided
into operations between communications. The scheduler orders these operations on each of
the grids contained in each processor. Thus, it is intended that much communication would
be hidden by the computation that is scheduled while waiting for messages (message latency
hiding).
3.2.5 Grid Manager The grid manager is responsible for the creation of
the grid data structures and the update of the composite grid tree in all processors. Calls
to the grid manager allow for the passing of internal boundary values between processors
and the adjustment of the partitions as called for by MLB. Additional services provided are
modification of the composite grid by the addition of a new refinement, as required of an
adaptive method, and movement of any grid or stack of grids, as required in shock tracking.
3.2.6 Multilevel Load Balancer (MLB) MLB is responsible for the dy
namic readjustment of the evolving composite grid. As new grids are built, the composite
grid is modified, its tree is updated in all processors, and the partitions are adjusted to
balance the loads. Given the data structures used for the storage of the matrices (outlined
previously), MLB can adjust a partition at a cost commensurate with the amount of data
transferred between processors. Additionally, MLB assures that the data transferred be
tween processors during partitioning follows a minimum length path. Further, since the cost
of determining if a multiprocessor system requires balancing (a global communication) is
approximately the cost of MLB when no partitions are changed, there is a negligible penalty
for the frequent load balancing required in dynamic refinement.
37
3.2.7 Data Flow The design of the solver allows for progress on a given grid
to be divided into 27 computationally equal parts. After each part is finished, all shared
grids are checked for the receipt of boundary values (messages) from coowning processors.
All shared grids are serviced in a round robin fashion, but are checked for receipt of boundary
values before any wholly owned grid is processed. This gives the shared grids a higher priority
than the wholly owned grids.
Using the solver in this way allows for good processor utilization. When used with
the load balancer, the totalsolve times vary only a few percent between processors. Thus,
processor utilization during the most costly part of AFAC is quite high. Further, the order
of execution is both dynamic and nearly optimal, since the work done on each processor is
driven by the availability (receipt) of messages.
A significant improvement in this context would be to increase the fineness of grain
in the parallelism available. The current graininess in the parallelism of the solver in each
processor depends on the size of the grids owned. With a very coarse grain of parallelism,
the receipt of messages during the solve of a large grid does not trigger the servicing of the
grid whose boundary values were just received. Thus, the execution (servicing) of the shared
grids is not handled optimally. The remedy is to partition these large grids into smaller
pieces and thus reduce the time a shared grid waits for service while processors finish the
larger grids.
A commonly suggested optimization for the organization of message passing in the
parallel environment allows relaxation on the overlap (ghost) boundary and then triggers
message passing on that overlap while relaxation is done on the interior. The motivation is
to trigger the message passing as soon possible and then use the interior relaxation to hide
the latency associated with the communications. Contrary to common understanding, the
effect of this optimization was only a few percent improvement on the larger problems run
and about 10% additional overhead on the smaller problems. However, using the iPSC/2
38
asynchronous communication calls means that the results are inconclusive since the iPSC/2
hardware only supports very limited overlap in computation and communication. The results
are detailed in figure 2. Note that this sort of message latency hiding could be handled
transparently to the user in the objectoriented P++, though it is not done currently.
3.3 Problems with Computer Languages
This thesis explores an objectoriented design for complex parallel numerical soft
ware. This line of research was discovered after having made several attempts at the design
of parallel adaptive mesh refinement codes for realistic applications using complicated non
rectangular refinement grids. The experience was useful in discovering just a few of the very
wrong ways to implement adaptive refinement software. Initial work on block structured
grids was unsuccessful mostly because of the lack of the C languages ability to support
encapsulation.
3.3.1 Problems with FORTRAN FORTRAN is a static procedural lan
guage, and as a result has limited flexibility to handle the necessarily dynamic requirements
of adaptive mesh refinement. In a parallel environment, even static refinement would be
problematic since, without special support, the shuffling of data between processors as new
refinement is added would require recopying large amounts of data. Memory management in
general becomes a practical limitation of FORTRAN for such complex software. Addition
ally, the details of adaptive mesh refinement, application requirements, and parallelism can
become unnecessarily mixed because of the lack of the data hiding (the ability to hide the
organization and access of internal data) and encapsulation (the ability to hide the internal
manipulation). FORTRAN 90 addresses many of these issues, but is mostly unavailable.
High Performance FORTRAN (HPF) ignores most of the features of FORTRAN 90 that
might simplify adaptive mesh refinement, specifically the objectorientedtype features of
FORTRAN 90 (e.g., operator overloading).
Additional problems with FORTRAN:
39
Pass 1st or Not
Prepass Boundaries I I
Dont Prepass
Figure 2: Effect of overlapping communication with computation.
40
Type checking is an important requirement in the development of large complex
software. The use of user defined types in C and C++ significantly reduces the
debugging time required for adaptive refinement codes because the complexity of
the different data structures can be isolated, separated, and represented as different
types (by the use of user defined structures). In the resulting implementation, type
checking verifies a consistent expression of the algorithm implementation. This type
checking is stronger in C++, and provides more advanced capabilities in the object
oriented design since there is greater expressiveness in the definition of objects that
combine the organization of data, as in the C language structures and the method
functions that manipulate the objects data. For a more complete definition of the
C++ language, see [47].
Dynamic Memory Management is another important requirement for parallel adap
tive mesh refinement. Alternatively, the use of a memory management layer between
the parallel adaptive mesh refinement and the FORTRAN language can provide the
required support. The advantage of using existing FORTRAN code and the efficiency
that FORTRAN presents at the lowest levels can make FORTRAN deceptively at
tractive. More common approaches have mixed C++ and FORTRAN so that the
advantages leveraged from the use of the objectoriented design can be exploited.
This thesis has not taken a mixed language approach since the objectoriented design
is required at a high level to implement the AMR code and at a low level (the level
of the looping constructs) to provide the architecture independence.
The use of common blocks in FORTRAN does not adequately isolate the data, or
its internal structure (partitioned or contiguous), away from the statements that
manipulate the data. The effect complicates the development and maintenance of
complex codes and especially complicates the porting of codes designed for the serial
environment to the parallel environment. Many codes that solve explicit equations
41
are sufficiently simple that they do not have such problems. Similarly, the definition
of standards for libraries is sometimes reduced to the awkward standardization of
common block variable orderings, unnecessary in more advanced languages.
3.3.2 Problems with Procedural Languages The fundamental problem
that was experienced in the implementation of the C language parallel adaptive refinement
codes was the overwhelming complexity of combining the application specific problem with
the adaptive mesh refinement and the explicit parallelism for the distributed memory environ
ment. Each would have been tractable individually, but these software engineering problems
combine nonlinearly. This is nothing more than the statement that the time requirements
of software development in general are nonlinear in the number of lines of code.
The solution to this problem starts with the recognition that an algorithm can be
expressed independent of the organization of its data. For example, the addition of the
arrays could be expressed in either FORTRAN or C (or C++) as in figure 3.
FORTRAN
DO i=0 Size
DO j=0 Size
A(i,j) = B(i1,j) + B(i+l,j) + B(i,j1) + B(i,j+1)
C or C++
for (int i=0; i < Size; i++)
for (int j=0; j < Size; j++)
A[i] [j] = B[i1] [j] + B[i+1] [j] + B[i] [j1] + B[i][j+1];
Figure 3: FORTRAN and C of C++ example code fragments.
Both the FORTRAN and C or C++ versions of this statement implicitly rely on
the contiguous ordering of the data in the arrays A,B, and C. This is due to the definition
of the indexing operators, ( ) in FORTRAN and [ ] in C and C++. The reliance on the
contiguous ordering of the data means that the algorithms expression is NOT independent
of the organization of the data.
A result of this dependence of the algorithms implementation on the organization of
its data is that a change in the layout of the data affects the expression and implementation
42
of the algorithm. In the case of a vector architecture, the data should be organized into
contiguous vectors of constant stride so that the vector hardware can efficiently evaluate
the algorithm. This is the trivial case of the example implementation, and the traditional
style of implementation maps well the low level efficient vector processing. In the case
of a cachebased RISC microprocessor architecture, the use of the consecutively ordered
multidimensional arrays A and B and the sequential looping force continued flushing of the
microprocessor cache. In this case, the implementation gets the correct result but efficiency
is sacrificed because the sequential loop processing flushes the caches record of the element
B[i\[j 1] (among others). The solution that enables efficient processing is a block by block
processing of the twodimensional grid, but we clearly see how this modification affects the
implementation in figure 4. Further, this block by block processing of the 2D grid is in conflict
C or C++ \\
for (int Block_i=0; i < Size / Block_Size; i++)
for (int Block.j=0; j < Size / Block_Size; j++)
for (int Element_i=lj Element.i < Block_Size; Element_i++)
for (int Element.j=l; Element.j < Block.Size; Element.j++)
{
int i = Element.i + (Block.i + Block.Size)j
int j = Element.j + (Block.j + Block.Size);
A [i] [j] = B[il][j] + B[i+1] [j] + B[i] [j1] + B [i] Cj+lJ ;
Figure 4: C Language example for block by block cache based execution.
with the efficient vector processing since utilization of the cache requires many very short
vectors to be processed, and efficient vector processing requires longer vectors. Attempts to
have such issues be addressed at compile time have been relatively unsuccessful.
The case of the equivalent distributed memory code changes the processing even
more drastically since the data is partitioned across multiple processors, and so explicit mes
sage passing must be introduced. The resulting implementation is greatly expanded (in lines
of code) and the implementation is far from clear because of the required additional parallel
logic. Figure 5 shows an example code (showing a simpler case where global addressing is
43
used at the expense of the whole arrays storage on each processor1) with the equivalent
parallel implementation of the previous code fragment.
In all three examples, the implementation of the algorithm is affected by the details
of the target architecture. The effect of more sophisticated parallel architectures is most ex
treme. The use of complex algorithms on such architectures greatly complicates the software
development. More specifically, the more complex parallel implementation hides the details
of the algorithms definition and thus precludes the normal development evolution of the
software as increasingly complex applications are attempted (e.g., more physics). The effect
on software development is to force dual implementations, one serial and simple to modify,
extend, and the second parallel and difficult to extend. The practical effect of multiple im
plementations makes the development of realistic parallel applications expensive and slow,
because the algorithm cannot economically be modified for each of several, or perhaps many,
different architectures. The fundamental reason for this problem is the dependence of the
implementation of the algorithm on the organization of the data. The ability to express
algorithms independent of the architecture is thus a principal feature of the software for
parallel adaptive refinement work.
3.4 Motivation for ObjectOriented Design
Having seen, by example of the Jacobi relaxation code fragment, that the implemen
tation is affected by the target architecture, we want representations of the algorithms that
sufficiently abstract the details of each possible architecture. The work on array languages
during the late seventies provides just such an abstraction. Here the arrays are manipulated
using array operators that internally understand the details of the target architecture, but
which by their use permit the implementation of the algorithm independent of the organiza
tion of the data. Specifically, we do not know how the arrays are represented internally, but
this detail is unimportant to the definition of the algorithm. Thus, the algorithm may be
1This simplification is done only for clarity since the use of nonglobal indexing would be less clear.
int Nodejfinberjaie [16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
double ScOnticn
SEE
SEE
double (Hd^dhrtiai
void main ()
int i, j ;
int Iterate;
double San = 0.0;
int IfexJfaiter_Qf.Iterations = 10;
int RightJlessageJ^pe = 100; A Arbitrary values */
int Lertjfessagejype = 101; A Arbitrary values */
int ProcessJD = mypdd (); A Currant Process ID */
long ProcessorJfuiber = myrxds 0 A The Node ID of this Processor */
int IJiirber_Qf_Processars = nnnncdesO; A Total mrrber of Processors */
A Caipite our partition of the distributed Solntiai array! */
int Start = tsi/h / NutberJQf.Processors) Processor Jfuiber;
int End = (SEE / Koiter_0f.Processors) (Processor Jfuiber+1) 1;
A Modify axis of the partition! */
if (ProcessorJfuiber = WcdeJfutberJhble [Nutber Qfjhxcessarsl])
Ehi;
if (ProcessorJfuiber = NodeJfutbsrJhtle [0])
Start++;
A Initialize nixie array to zero! */
far (i=0; i < SEE; i++)
far.(j=0; j < SEE; j++)
ScQmtiar CQCD =0.0;
QldJSdliTtimu3 [j] = 0.0;
A Initialize Interior! */
far (i=l; i <= SEE2; iH)
far (i=l; j <= SIZE2; i++)
Sdiitin HIj] = 1.0;
far^(Iterate=l; Iterate <= MaxJfuiber_QfJteratians; Iterated)
A Assigp. currant processors partitim of ScQntim to QldJSolutiai.
We do this to setup the Jaodti iteratim! */
far (i=Startl; i <= End+i; iH)
far (i=0; j < SEE iH)
OQdJblutim Dam= Solution [i] [j] ;
A Do Jacobi relaxation, far I an! are EaEtim! */
far (i=Start; i <=End; in)
far (j=l; j <= SIZE2: iH)
Sauitim CO CO = ( QldJScihitiai CLf] Pi] + lUdjTolutrim Pi+1] CQ +
CQd_Scflntim B + nidjhintim B [jHJ )/ 4.0;
A Send the new solntiai cn our boundary to the Rigjit and Left Processors! */
if (Rrocessarjfuiber < NcrieJJurberJItle Dfaer 0fProcessarsl])
cssid ( RipjitJfessaffsJVpe &(Sclntian [frcQ [Q]) ,
SEE sizeaf (ccuhle) ,
NodeJfutberJbhie [processorJfutber+l] Process JD );
if (ProcessorJfutber > NodeJfutber Ibble CO])
csand ( LeftJfesageJVpe &XSdutim [Start] [0]) ,
SEE sizeaf taoiiie) ,
Node JfutberJbhLe [proGessarJfiitber1] Process JD );
A Receive the new solution on our boundary fron the RiAt and Left Processors! */
if (Processarjtaber < NodejfutberJhbLe Dlmber_Qf Jrxessars1] )
crecv ( Leftjfesscgejype &(Scflntim KhiH] [0]) Siai sizeaf (double) );
if (Processor Jfotber > Node Jfotber JhKLe [0]) ____
crecv ( RdgitJfess^'eJfype &(Solnticn D3tart1] DCO) SEE sizeaf (daMe) )
priiitf ("Program Terminated Normally! NnV);
Figure 5: Distributed memory example code.
45
defined using simple array operations as in figure 6, regardless of the target architecture. In
the case of a vector architecture, the array operators process the vectors one at a time2. In
RISC architectures, the internal representation can be processed by block type operations
where each block fits into the cache 3.
Index I (l,Size,l);
Index J (1.Size,1);
A(I,J) = B(11,J) + B(1+1,J) + B(I, Jl) + B(I,J+l);
Figure 6: Equivalent P++, objectoriented example code.
Many algorithms are not well suited to the array language implementation, for ex
ample, the usual tridiagonal solve which would require explicit scalar indexing4. However,
such algorithms are not parallel or vectorizable, so the array language only encourages the
use of algorithms that are better suited to the more complex architectures available today,
without limiting other algorithms. Clearly, the use of a serial algorithm in a parallel en
vironment only results in poor performance, not in an incorrect result. Figure 7 shows an
example code fragment of the objectoriented code fragment using the explicit looping and
indexing, which is also easily provided.
for (int i=0; i < Size; i++)
for (int j=0; j < Size; j++)
A(i,j) = B(i1,j) + B(i+1,j) + B(i,jl) + B(i,j+1);
Figure 7: Equivalent P++, objectoriented example code using explicit indexing.
The use of the Index type specifies a FORTRAN 90 like triplet of integers specifying
the initial indexed position, the number of elements to be indexed, and the associated index
stride. Then the arrays A and B are manipulated using the array operators + and which
treat only those portions of A and B that are indexed using the Index variables I and J. In
this case, the Index variables are themselves manipulated using the + and operators, so
that B is indexed using combinations of I + 1, I 1, J + 1, and J 1.
2In complex array expressions, this has special sorts of inefficiencies that will be discussed later.
3This has been worked out by Roldan Pozo at the University of Tennessee as part of the distributed
LAPACK++ class library.
4Such algorithms must use explicit indexing (using looping and scalar index array operators), which can
be easily provided within the definition of the array language, but are not considered array operations.
46
But the use of an array language would require the construction of a specialized
compiler. Such compilers are not commonly available. Though FORTRAN 90 array exten
sions have been added to many implementations of FORTRAN 77, only a few compilers use
them to simplify development of parallel software. The use of a specialized compiler would
limit the use of the resulting codes to the few architectures where such compilers might be
available. This effectively contradicts the goal of architecture independence. An alternative
that would permit the use of array constructs is to build such an array language into an
existing language.
A procedural language would not permit the extension of the language without
modifying the compiler, since the language defines a static set of types that may not be
extended. But an objectoriented language permits the definition of user defined types
(objects), called an extensible type system, and such user defined types (objects) behave
similar to the static set of types defined in procedural languages. The practical effect is
to permit extension of the language to include new user defined types and to have them
function as if they were an original part of the language definition.
The C++ language is a relatively new objectoriented language that is widely avail
able for nearly every serial and parallel computer. It is increasingly being used in the de
velopment of numerical software across a broad range of computer architectures.5 Although
developed originally in 1984, C++ has stabilized sufficiently to support large implementa
tion projects and is increasingly being used for the development of sophisticated numerical
software6.
Figure 8 shows an example of an array and index object, illustrating how the
data and method functions which manipulate the objects data are combined. The example
definition of the Index and Intarray C++ classes define new types to the compiler and code
that uses them. The details of the implementation can be specific to a given architecture,
5 A substantial portion of the engineering codes developed at Sandia National Labs tire using C++.
6Several of the largest production codes at Sandia National Laboratory are developed using C++.
class Index
{
private:
int Position;
int Count;
int Stride;
public:
Index & operator + ( const Index k X int i );
Index k operator  ( const Index k X int i );
Index k operator = ( const Index & X );
}
class intarray
{
private:
Array_Descriptor_Type *Array_Descriptor;
int *Data;
public:
Array k operator () ( const int k i );
Array k operator () ( const int & i const int k j );
Array k operator () ( const Index k I );
Array k operator () ( const Index k I const Index & J );
Array k operator + ( const Array & Lhs , const Array k Rhs )
Array k operator  ( const Array k Lhs , const Array & Rhs )
Array k operator * ( const Array k Lhs , const Array k Rhs )
Array k operator / .( const Array k Lhs , const Array k Rhs )
Array & operator = ( const Array k Rhs );
}
Figure 8: Equivalent P++, objectoriented example code.
48
but the interface can be constant and is architecture independent. This is the principal way
this thesis proposes architecture independence for general numerical software.
This thesis uses C++ to define an array class library. Specifically, an array object
is defined in C++ and all possible operators between arrays are provided, so that the array
objects appear as builtin types (though user defined). The actual interface is copied from
the commercial M++ 7 array language. The motivation for this work is to simplify the use
of software implemented using the array interface for distributed memory architectures. In
the serial environment, it is sufficient to use the M++ class library. The purpose of the
P++ class library is to extend the identical interface for use on distributed memory parallel
computers with absolutely no change to the original serial code. The parallel array class
library P++ uses the serial array class library internally. The result is a separation between
the details of the parallel array object and the serial array object. Such separation divides
the complexity of the combined design of the single processor array object and its extended
use in the parallel environment. Since the interface is identical, the code using the interface
(the original serial source code) does not change. The computational model is to run the
same code on each processor, which is called a Single Processor Multiple Data model (SPMD
model).
The P++ array class library, which provides general support for data parallelism,
is part of the strategy to support parallel adaptive refinement. Its complete support requires
support for the adaptive refinement details of adaptive mesh refinement so that different
applications can reuse the complex data structures that adaptive mesh refinement codes
contain. The common AMR code, independent of the application and parallel issues, is
presented in the AMR++ class libraries. In this way, we expect substantial reuse of code
across many adaptive refinement applications. More details are contained in [5] and in section
3.6.
7M++ is a product of Dyad Software (18003661573).
49
3.4.1 Problems with the ObjectOriented C++ Languages The use of
an objectoriented language is no panacea. There are often dozens of ways to implement
a give application, which is the strength of the language. Yet, as studies at AT&T have
shown, the learning time for C++ is approximately 1218 months. It is important to note,
however, that most people are more productive in C++ after the first few months than they
were previously 8.
The principal problems experienced with C++ have to do with its ability to opti
mize the resulting code. The use of simple arrays of primitive types (such as int, float, and
double) and the compilers knowledge of these primitive types permit common optimiza
tion. For numerical software, such optimization centers around loop optimization, including
removal of constant subexpressions and use of registers for temporary storage and accumu
lation. Such optimization greatly improves the efficiency of the compiled code. The use
of nonprimitive types, such as C++ objects (effectively user defined types), greatly limits
optimization. Within the current line of C++ compilers, optimizers are typically turned off
internally if nonprimitive types are discovered. The result is a mass of function calls and
the loss of register storage for accumulation within inner loops. Such code is inefficient at
runtime. The to obtain performance emphasis is on the class library programmer ( who are
forced to work within C++ at a deeper level than tyypical users, who generally use such
class libraries, in conjunction with C++, in more simple ways). Details of these problems
have been discussed in [1], [45], and [46].
3.5 P++, a Parallel Array Class Library
3.5.1 Introduction and Motivation The current trend in the development
of parallel architectures is clearly toward distributed memory designs. This is evident from
current product announcements from Intel, Cray, and Thinking Machines, even though the
latter two originally had successful machines of competing design. Experience has shown
8This was the conclusion of discussions within the C+( conference on BIX (Byte Information Exchange).
50
that shared memory machines are easier to program than distributed memory ones and
also have autoparallelizing tools that are not available for distributed memory architectures.
This is due to the fact that the shared memory programming model permits paralleliza
tion independent of the position of data between processors. However, memory access in
shared memory machines becomes increasingly problematic as more processors are added;
this nonscalablity of the shared memory design limits its potential to support the greater
computational power demanded of future applications. Within the last few years, approaches
of adding local caches to the global memory of shared memory architectures have slightly
improved memory access. The problem of efficiently optimizing cache usage, however, is
very similar to the data partitioning problem on distributed memory machines, and is not
yet satisfactorily resolved. Distributed memory machines require the explicit control of data,
for example, data partitioning, in order to obtain the same or better parallel performance
as shared memory machines. This explicit control of parallelism, through Message Pass
ing, is difficult to achieve and the result is a dramatically more complicated programming
environment.
The lack of a shared memory programming model on distributed memory machines
is the fundamental disadvantage of its programming. The availability of portable communi
cation interfaces, and similar programming tools, implemented on many distributed memory
machines significantly simplifies the portability among distributed architectures, but does
not address the difficulties of the explicit control of parallelism done through the use of
message passing. Although even a shared memory parallel programming model in a dis
tributed memory environment, called Virtual Shared Memory, would be advantageous from
the point of view of software development, such general attempts have resulted in poor
performance, except for an overly restricted set of problems. Newer advances in this area
have not yet been evaluated for feasibility and efficiency [15]. Because the general data dis
tribution problem remains unsolved, such Virtual Shared Memory environments attempt to
51
support parallelism without explicit knowledge of the algorithms structure and requirements
to offprocessor data. The resulting voluminous amount of accesses to offprocessor memory
degrades performance. This is mostly due to unnecessary and often replicated communica
tion startuptimes and/or data transfer of irrelevant data in proximity to relevant data. It
seems clear that the support of parallelism requires at least an interpretation of algorithm
requirements, so that accesses to offprocessor data can be managed efficiently [7].
P++ is a user environment that simplifies the development of efficient parallel pro
grams for largescale scientific applications. It permits portability across the widest variety
of computer architectures. The target machines are distributed memory computers with
different types of node architectures (scalar, vector, or superscalar), but the requirements
of shared memory computers are also addressed. Such a simplifying environment for the
development of software is sorely needed to take advantage of the current and future de
velopments in advanced computational hardware. The P++ environment does this using
a standard language, C++, with absolutely no modification of the compiler. For parallel
communication, it employs existing widely portable communications libraries. Such an en
vironment allows existing C++ language compilers to be used to develop software in the
preferred serial environment, and such software to be efficiently run unchanged in all target
environments.
The explicit goal of P++ is the support of advanced computational methods for the
solution of largescale computational problems. For efficiency reasons and simplification, the
use of P++ is restricted to the large class of structured grid methods for partial differential
equations. Applications using P++ can selectively exploit the added degree of freedom
presented by parallel processing by use of an optimization module within the array language
interface. A collection of defaults ensures deterministic behavior. In this way, complicated
optimization issues of parallel execution may be easily introduced by setting switches within
the user interface, which are then interpreted at runtime. Through the use of Virtual Shared
52
Memory, restricted to array variables (Virtual Shared Grids), issues such as partitioning
become matters of optimization, and not criteria for correct execution. Due to the restriction
and optimization for structured grids, we expect the same parallel performance as for codes
based on the traditionally used explicit Message Passing model. To speed the development of
the P++ environment, we use a reasonably efficient, commercially available array language
library, M++, developed in C++; the M++ array interface is also used as the P++ array
interface.
The internal implementation of P++ is based on the Single Program Multiple
Data stream (SPMD) programming model. This model is coupled with the Virtual Shared
Grids (VSG) programming model (section 3.5.5), which is a restriction of general operat
ing system controlled Virtual Shared Memory to all types of structured grids, controlling
communication at runtime. The user interface and programming model guarantee that se
rial codes developed in the P++ environment can be run on parallel distributed memory
machines without modification. Because the array syntax is more compact than explicit
looping, in the description of the algorithm, the resulting code is easier to implement and
debug, even for strictly serial codes, in the serial environment. Moreover, since the SPMD
and VSG programming models permit the serial code to be used in a parallel environment,
the development of parallel codes using P++ is simpler than the development of serial codes
when the explicit looping model is used. This is primarily due to the fact that the VSG
model allows the specification of data partitioning to be an optimization, in contrast to most
other developments in this area where appropriate data partitioning is essential for the cor
rect execution of the code. P++ employs a default grid partitioning strategy, which can be
overridden in several ways specific to the users application.
Recognizing that the acceptance of advanced parallel computers depends on their
use by scientists, and not parallel researchers, it is hoped that P++, as a technological
advance, will simplify access to these parallel machines. While significantly simplifying our
53
own work on parallel selfadaptive local refinement methods, it is hoped that P++ will
more generally simplify the introduction of structured grid methods for large scale scientific
computing onto advanced parallel machines.
3.5.2 Goals of the P++ Development The general goal of the P++ devel
opment is to provide a simplified parallel programming environment. In this section, some
ideal requirements for a user interface and programming model for distributed memory archi
tectures are stated. These are fulfilled with the P++ environment for a large, but restricted,
class of problems (detailed in subsection 3.5.3):
Algorithm and code development should take place in a serial environment.
Serial source codes should be able to be compiled and recompiled to run in parallel
on distributed architectures without modification.
Codes should be portable between different serial and parallel architectures (shared
and distributed memory machines).
Vectorization, parallelization, and data partitioning should be hidden from the user,
except for optimization switches to which the user has full access and that have
meaning only on vector or parallel environments.
The debugging of parallel software should be greatly simplified. First, this is done if
the software is debugged in the serial environment and data parallelism is exploited
by recompilation. Second, the objectoriented treatment of the array operations
avoids the respecification of index bounds, so one of the most common errors in
the implementation of numerical software is eliminated because Index objects can
be computed once and reused. Third, the objectoriented design used to build ap
plication codes from the P(1 data parallel arrays further abstract low level imple
mentation details of the solvers and separate them from the remaining application
code.
54
3.5.3 The P++ Applications Class The restriction to a simplified class
of applications for P++ has allowed focus on a general solution and evaluation of the func
tionality and performance of the P+1 environment by using realistic application codes.
Additionally, the restriction to a large but reasonable class of problems assures that P++ is
not exaggerated to overly general or particularly pathological cases. In addition, extending
the generality more toward general Virtual Shared Memory would cause the performance
limiting factors to apply also to P++. The use of P+1 is restricted to all different kinds of
structured grid oriented problems. Domain decomposition and applications using adaptive
block structured local refinement belong to this target class, and the problems introduced
by their use in parallel environments has motivated this work. Grids can be constructed
with all levels of freedom (e.g., overlapping blocks, adaptively added, deleted, resized, ...), as
long as they fulfil the above restrictions. In particular, PH is dimensionally independent,
which allows for its use in onetofour dimensional problems. Although not yet applied to
many applications, the target applications are largescale computational problems in fluid
dynamics and combustion. Specific algorithms with which this work is intended to be tested
include:
Explicit time stepping algorithms (Piecewise Parabolic Method (PPM) for hyper
sonic flow [4], [5]);
Standard multigrid algorithms on rectangular domains ([10], [26], [34], [38]);
Multilevel local refinement algorithms on simple grids ([22], [28], [29], [40]);
Multilevel adaptive local refinement algorithms on simple grids ([41], [42]);
Multilevel adaptive local refinement algorithms on block structured grids.
3.5.4 P++ Implementation in Standard C++ C++, an objectoriented
language, was chosen as a basis for P++ because of the languages ability to support ab
stractions, which is fundamental in the development of the user interface that is expected to
abstract issues of data parallelism. Some important features of C++ follow (for a definition
55
and description the reader is referred to [47]); each is obtained from C++ and carries over
directly to the P++ environment, including any combination of C++/P++ use:
Easy design, development and maintenance of large codes;
ANSI standard under development;
Objectoriented features like inheritance, data hiding, and encapsulation;
Dynamic memory management;
Strong typing of variables;
Guaranteed user defined initialization through user defined constructors;
Operator and function overloading;
Templates;
Same efficiency allowed as for C (currently a research area).
Most of these features are not unique to C++, so we do not preclude the use of any other
objectoriented language, such as Smalltalk. But C++ is currently available on a wider
number of machines than any other objectoriented language that suits numerical needs.
Additionally, the C++ language is a superset of the C language, and so all C software is
also available for use with C++. Specifically, this allows for use of distributed memory
communication libraries, like PVM, to be used as easily with C++ as with C.
P++, as currently developed, uses the AT&T C++ Cfront compiler and the Intel
iPSC communications library. In the near future, however, it is planned to make the code
portable through use of EXPRESS or PVM and later also PARMACS (see section 3.5.8).
Since C++ is a superset of C and the communication library is designed for use with C,
these libraries can be easily used from within C++.
3.5.5 The Programming Model of P++ Use of the Single Program Mul
tiple Data (SPMD) programming model combined with the VSG programming model is
important, since without the combined programming models, the simplified representation
of the parallel program from the serial program would not be practical. Their combined
56
effect is the principal means by which P++ allows serially developed codes to be efficiently
run on distributed memory machines.
3.5.5.1 Single Program Multiple Data (SPMD) Model In contrast to the
explicit host/node programming model, which requires both a host and one or more node
programs, the SPMD programming model consists of executing one single program source on
all nodes of the parallel system. For example, the suggested programming model of the Intel
Delta machine is SPMD. This is becoming increasingly common in new generation parallel
machines from Intel, Cray, and TMC.
Implementation of the SPMD model requires that commonly available functionality
in the serial environment be provided in the parallel environment in such a way that the
serial source code can be used on the distributed memory machine. One of the most im
portant functionalities that is provided in the parallel programming model to support basic
functionality of the serial code is a parallel I/O system. This can then be used in place of the
serial I/O system, to support the required functionality of the parallel SPMD programming
environment.
Currently, only basic functionality of the SPMD programming model (I/O system:
printf, scanf; initialization and termination of processes) is available. Implementation details
are abstracted from the user. The SPMD programming model replicates the functionality of
the traditional parallel host/node programming model. For example, the standard function
scanf for reading from standard input is implemented in such a way that an arbitrarily chosen
master node reads the data from standard input and distributes it to all other nodes (slave
nodes). This master/slave relationship is only present within the Parallel I/O System and
not used, or intended, elsewhere in P++.
3.5.5.2 Virtual Shared Grids (VSG) Model The concept of Virtual Shared
Grids gives the appearance of Virtual Shared Memory restricted to array variables. Com
putations are based on global indexing. Communication patterns are derived at runtime,
57
and the appropriate send and receive messages are automatically generated by P++. In
contrast to Virtual Shared Memory, where the operating system does the communication
without having information about the algorithms data and structure, the array syntax of
P++ provides the means for the user to express details of the algorithm and data structure
to the compiler and runtime system. This guarantees that the number of communications
and the amount of communicated data are minimized. Through the restriction to structured
grids, the same kind and amount of communication as with the explicit Message Passing
model is sent/received and, therefore, also approximately the same efficiency is achieved.
This is a tremendous advantage over the more general Virtual Shared Memory model.
The amount and size of communication are further minimized by the ability of
the user to override the default partitioning. Specifically, Virtual Shared Grids allow the
treatment of partitioning as a parameter in the optimization. This is an important feature
of the VSG model since it permits serial applications to be run correctly and to exploit data
parallelism inherent in their array expressions, and so still permits the auxiliary description
of data organization after the code is running. This greatly simplifies decisions about data
partitioning, which is the singular additional degree of freedom in the designing of the data
parallel implementation. Note that the data parallel implementation might not be suffi
cient, but is a common component in the design of numerical software. For example, data
parallelism is the principal part of the model for a single grid iterative solvers (including
multigrid solvers), but is not sufficient for optimal parallel performance using composite grid
solvers (including FAC, AFAC, and AFACx) because additional functional parallelism can
be exploited in these more sophisticated solvers.
There are two basic communication models that are currently implemented in P+1
(how these models interact is described in more detail in the examples in section 3.5.10):
VSG Update:
The Owner Computes Rule is a common rule for the evaluation of expressions in
58
the parallel environment. It actually has several conflicting definitions, but generally
means that an owner is defined (usually by the processor owning the Lhs of a Lhs =
Rhs expression) and the required Rhs arguments are sent to the owning processor.
Finally, the relevant part of the expression is evaluated on the owning processor
(no communication is required for the evaluation step, but there is potentially more
communication required to send the relevant parts of the Rhs to the owner). In the
implementation of the communication model for the general Virtual Shared Grids
concept, this classical Owner Computes rule is restricted. Instead, what might be
applied to the whole expression is applied instead to the binary subexpression, where
the Owner is defined arbitrarily to be the left operand. This simple rule handles the
communication required in a parallel environment; specifically, the designated owner
of the left operand receives all parts of the distributed array necessary to perform
the given binary operation (without further communication). Thus, the temporary
result and the left operand are partitioned similarly (see figure 9).
Overlap Update:
In order to provide high performance for a broad range of iterative methods that
would use the VSG programming model, as a specific optimization there is an imple
mented nearest neighbor access to array elements through the widely used technique
of grid partitioning with overlap (currently restricted to width one; see figure 10).
In this way, the most common type of array accesses can be handled by complicated
expressions where communication in the parallel environment is limited to one over
lap update that occurs in the definition of the = sign defined (overloaded) for the
arrays. The necessity of updating the overlapping boundaries, based on whether the
overlap has been modified after the preceding update, is detected at runtime. Thus,
communication in the parallel environment is minimized.
Virtual Shared Grids are constructed in a distributed fashion across the processors
59
* P++ user level:
A = B + C
* P++ internal execution:
1. T = B + C
PI: Til = Bll + Cl
receive C21 from P2
T12 = B12 + C21
P2: T2 = B2 + C22
send C21 to PI
P3: idle
2. A = T
PI: send T1 to P3
P2: send T2 to P3
P3: receive T1 from PI
receive T2 from P2
A = T
Figure 9. An example for VSG Update based on the Owner computes rule: A = B + C on
3 processors.
Figure 10: The standard method of grid partitioning with overlap.
60
of the parallel system. Partitioning information and processor mapping are stored in a par
titioning table (part of the Data_Manager object). This partitioning table basically contains
processor numbers that own the Virtual Shared Grids, and the local and global positions
and sizes of the partitions. Functions are made available that abstract the details of access
queries of the tables data.
All information required for evaluating expressions using VSG is expressed through
the array syntax and the use of the partitioning table. The numbers of entries in the table is
reduced by grouping associated grids and providing applications support for storage of only
the required data on a processor by processor basis. This is necessary due to the large sizes
that these tables can be in massively parallel systems. The table is efficiently implemented
through a combination of arrays and linked lists. Thus, all necessary global information can
be made locally available (even on massively parallel systems of 1000+ processors). Thus,
access to global information about the partitioned arrays requires no communication and
contributes insignificantly to memory overhead.
A simple mechanism is provided to interrogate the communication pattern between
pairs of VSGs at runtime. This mechanism looks at the availability of data in the adjacent
processors that are required to satisfy the specific instance of the distributed binary oper
ation. If necessary, it triggers communication to send or receive the required pieces of the
array operands (VSG Update) on the basis of the Owner Computes rule. In this way, whole
buffers of data are known at runtime to be required, and no sophisticated loop analysis is
needed to recover this information. Thus, there is no need for costly element by element
communication. Compiler analysis would be required to do such analysis since it would be
prohibitively expensive at runtime.
In addition to the two basic communication models of VSG Update and Overlap
Update, possible enhancements include:
61
Communication Pattern Caching (CPC): This would permit the communica
tion patterns, once computed for an array statement, to be cached and recovered for
further use. Thus, the determination of the array partition relationships (communi
cation pattern) specific to a particular array statements indexing could be handled
with minimal overhead. Note that CPC could be used across multiple array state
ments since, within an application, we expect that many different array statements
would require, and thus could reuse, the same communication patterns.
Deferred Evaluation: This more complicated evaluation rule allows for signif
icant optimization of the given expression so that communication associated with
correct evaluation in the parallel environment can be minimized. In a vector environ
ment, the given operations that form the array expression are optionally collapsed
to form aggregate operators and provide for the aggregate operators implementa
tion in FORTRAN, so that the expression can fully exploit the vector hardware (e.g.,
through chaining). The use of deferred evaluation of the array statements (also called
lazy evaluation) permits the full expression to be know at runtime before evaluation.
The evaluation can even be deferred across many array statements (problems are
encountered across conditional statements that have dependencies on the results of
the deferred array statements, though this might be solved through some compiler
support for deferred evaluation). Currently, this principle has been implemented but
not yet evaluated for single nodes of vector machines (e.g., a Cray YMP in collabo
ration with Sandia National Laboratories). In particular, the efficient use of chaining
and the optimization of memory access are addressed. It is planned to further pur
sue this approach and fully exploit it into the VSG programming model of P+(.
While there are other complicated reasons for the use of Deferred Evaluation, those
mentioned above are only some uses specific to the vector environment. Other uses
include the deferred evaluation over large numbers of statements and the grouping
62
of blocks of these statements, based on runtime dependency analysis, so that each
block can be executed independently. Such a system would permit the recognition
of functional levels of parallelism at runtime. Note that runtime dependency anal
ysis would be based on the use of hash tables and collision detection within these
separate hash tables for the Lhs and Rhs operands in each array statement.
3.5.6 The ObjectOriented Design of P++ The basic structure of the
PH code consists of multiple interconnected classes. The overall structure and the inter
connections are illustrated in figure 11. The following types of objects are common to the
M++ array interface (see also section 3.5.7) and within PH form a similar user interface:
_VSG_Array:
specifies the type of array elements (currently restricted to float (64 bit) and
integer). Dimensional independence up to four dimensions is realized. ID default
partitioning information is stored in the Data_Manager.
VSGJndex:
Only simple index objects are provided. Each stores: Position, Count, and Stride.
Member functions for index objects include set and arithmetic operations. Some
typical examples for the use of indexes are addressing grid boundaries or interiors.
The < type >_VSG_A.rray uses internally the M++ array object < type >Array
and copies most of the interface member functions of the M++ array object < type >Array.
In this way, the < type >_VSG_Array uses the same member functions of < type >Array.
So the interface is the same and the numerical software developed in the serial environment
executes in a data parallel mode in the multiprocessor environment.
The following object is specific to the P++ interface (see also section 3.5.7) and is
also seen by the user:
OptimizationManager:
User control for details of parallel execution.
63
Figure 11: The objectoriented design of the P++ environment.
64
The following objects are hidden from the user and represent the notable features of the
underlying organization:
Data_Manager:
All partitioning data is stored in tables. Member functions allow interrogation of
the Data_Manager to find the processor associated with any part of any object of
type < type >_VSG_Array, etc.
CommunicationManager:
These member functions are the only functions that allow access to send/receive and
other primitives of the communications library and diagnostics environment (Intel,
EXPRESS, PARMACS; see section 3.5.8). Access to constants relative to parallel
execution, e.g., number of processors, is also available.
DiagnosticManager:
This class has restricted flow trace capabilities (number of constructor/destructor
calls) and also gathers statistical properties of the background communication (e.g.,
number of messages sent/received).
ParallelJ/O Manager:
Standard I/O functions overloaded with versions for use in the parallel environment
(eg. printf, scanf). Currently, all I/O is processed through a single processor in a
master slave relationship. File I/O is not handled, though it is a critical requirement
of the large scale computational target applications. We hope to use existing file
I/O packages for simplification.
3.5.7 The PH User Interface The P++ interface consists of a combi
nation of the commercially available M1 array language interface and a set of functions
for parallel optimization, the Optimization Manager. The P++ user interface provides for
switching between M++ in the serial environment to PH in the serial or parallel environ
ment.
65
3.5.7.1 The M++ array class library The commercially available M++
array class library (from Dyad Software Corp.) is used to simplify the software develop
ment. The M++ interface is modified only slightly; we consider the modifications to be bug
fixes. The array interface provides a natural way to express the problem and data structure
for structured grid applications; additionally, the syntax of the interface permits a natural
exploitation of the parallelism represented within expressions used for structured grid appli
cations (because single no execution ordering is assumed). By using M+1 within P+(, the
details of serial vs. parallel interpretation of the array statements are separated. It is hoped
that since the internal restrictions to the structured grid work are mostly contained in M++,
the move to support unstructured grids, in the future, will be separable and simplified.
The serial M++ interface allows dimensionally independent arrays to be dynam
ically created and manipulated with standard operators, and subarrays to be defined by
indexing. In addition, it has optional array bounds checking. At runtime, the explicit
loop hides the datas organization and operation structure, whereas the equivalent array
expression may have its datas organization and operation structure interpreted. Impor
tantly, within an array expression, there is no data dependency by definition. In fact, the
array language represents a simplification in the design, implementation, and maintenance
of structured grid codes. The functionality of the M++ interface is similar to the array fea
tures of FORTRAN 90 (see figure 12). In the current version of P++, only a restricted set
of data types (integer and float arrays) is implemented. However, complete sets of operators
are provided.
We feel that the choice of C++ as the programming language and M++ as an array
interface, made to provide as much information about the problem and structure of the data
as possible, is strategic in providing a solid base for the parallel array language support for
the target numerical problem class. It is especially strategic for support of parallel adaptive
mesh refinement, since the adaptive nature of the application means insufficient information
66
#include "header.h"
#ifdef PPP
Sdefine doubleArray double_VSG_Array
Sdefine Index VSG_Index
tendif
void MacCormack (Index I, double Time.Step, doubleArray k F,
doubleArray k U.BEW, doubleArray &U.0LD)
{ // array expression:
F = (U.OLD U.OLD) / 2;
// scalar expression:
U.BEW (0) = U.QLD (0) Time.Step (F ( 1 ) F (0));
II indexed array expression:
U.BEW (I) = U.OLD (I) Time.Step (F (1+1) F (I));
// array expression:
F = (U.BEW U.BEW) / 2;
// indexed array expression:
U.BEW (I) = 0.5 (U.OLD (I) + U.BEW (I)) 0.5 Time.Step (F (I) F (11));
}
void mainO
{ int B;
double Time.Step;
scanf (&B, ATime.Step) ;
doubleArray U.OLD (B,B) ,U.BEW (B,B), F(N,B);
// Setup data
int Interior.Start.Position = 1;
int Interior.Count = B2;
int Interior.Stride = 1;
Index Interior (Interior.Start.Position, Interior.Count, Interior.Stride);
HacCormack (Interior, Time.Step, F, U.BEW, U.OLD);
>
Figure 12: C++ / M++ / P++ example code: McCormack (Hyperbolic) Scheme.
67
at compile time about the partitioning of the data (the composite grid). Such partitioning
is defined only at runtime, so the communication patterns must be interpreted then. Similar
runtime support is an accepted and required part of any attempt to provide parallel compiler
support for complex applications, though, with compiler support, additional efficiency might
be possible even for the runtime support (such issues are under investigation; see [2], [7],
[11], [12], [24], [25] ).
3.5.7.2 The PH Optimization Manager The Optimization Manager allows
for override of defaults for user control of partitioning, communication, array to processor
mappings, communication models of Virtual Shared Grids, parallel I/O system, etc. Opti
mizations of this kind only have significance in a parallel environment. The Optimization
Manager is the only means by which the user can affect the parallel behavior of the code.
The Optimization Manager provides a consistent means of tailoring the parallel execution
and performance. It provides the user with four types of partitioning for the array (grid
variable) data:
Default partitioning: This involves even partitioning of each grid variable across
all nodes, based on ID partitioning of the last array dimension (see figure 2).
Associated partitioning: Grid variables are partitioned consistent with oth
ers. This strategy provides for same size or coarser grid construction in multigrid
algorithms, but also has general use.
User defined partitioning: A mapping structure is used to construct user defined
partitioning.
Application based partitioning : This allows for the introduction of user spec
ified load balancing algorithms to handle the partitioning of one or more specified
grid variables.
Currently, the functionality of the Optimization Manager is restricted in its support for
above partitioning strategies as required for the examples in section 3.5.10.
68
3.5.8 Portability and Target Architectures of PHh The target architec
tures of P++ are all existing and evolving distributed memory multiprocessor architectures,
including the Intel PARAGON and iPSC/860, the Connection Machine 5, the nCUBE 2, the
coming distributed memory Cray machine, and all kinds of networks of workstations (such
as of Suns or IBM RS6000s). PH requires only a C+H compiler or C+H to C translator,
which have begun to become widely available (e.g., AT&T C front compiler), and a portable
parallel communications library, such as PVM or Express. The current P++ implementation
uses the Intel iPSC communications library. For the near future, however, it is planned to
base P+1 independently on the EXPRESS and PARMACS parallel communications envi
ronments, guaranteeing portability of P++ across all of the architectures for which these
environments are available. Experiences in the past have shown that one or the other will be
implemented on all machines of this type shortly after they become available. Due to imple
mentations of PARMACS and EXPRESS for several shared memory architectures, P++ will
also be available for this class of machine which significantly simplifies support for shared
memory machines.
Since C+H is a superset of C and PVM, EXPRESS, and PARMACS support C,
each can be used within P++. PVM, PARMACS, and EXPRESS are described in more
detail:
PVM
PVM is a public domain programming environment for the development of parallel
applications and provides a vendor independent layer for communication support. Its
latest release is similar to the proposed Message Passing Interface standard (MPI).
EXPRESS
EXPRESS from ParaSoft Corp. is a programming environment for the development
of parallel FORTRAN and C programs. EXPRESS is available for a variety of
different distributed memory architectures (Intel iPSC, nCUBE, ...) and for networks
69
of workstations (Sun, IBM RS 6000, Silicon Graphics, ...). In addition to allowing
distributed memory codes to also run in a shared memory environment, it is also
available for some shared memory multiprocessor architectures (Cray YMP, IBM
3090/AIX). Besides a communications library, it contains modules for parallel I/O,
a graphics system, performance analysis tools, etc.
PARMACS
PARMACS (ANL/GMD Macros), which is a joint development of the German Na
tional Laboratory for Computer Science and Applied Mathematics (GMD) and Ar
gonne National Laboratory [8], is marketed through PALLAS GmbH. PARMACS
is a process programming model for FORTRAN, based on macros (expanded by a
standard Unix Macro expander). A C version is planned for the near future. PAR
MACS basically contains macros for process initialization, communication, etc., and
is available for the Intel iPSC, nCUBE, Meiko, SUPRENUM, Parsytec Megaframe,
and Sun and IBM RS6000 networks of workstations. In addition, implementations
for the shared memory architectures Alliant FX 28, Sequent Balance 2000, and En
core Multimax exist. As the PALLAS Performance Analysis tools (PATools) are
based on PARMACS, they also become available for use within P++.
Additional work must be done to support the new distributed memory machines
with vector or superscalar nodes based on a vector processing model. This work requires
incorporation of the PH design with the recent work on vectorization of the C++ array
language done in collaboration with Sandia National Laboratories. Additional optimization
could be done and is planned to eventually support the shared memory class of machine by
development of a shared memory specific version of P++.
3.5.9 Performance of P++ To date, the only running versions of P++ are
on the iPSC/860, the iPSC parallel simulator (running on SUN workstations), and, in serial
mode, the SUN and IBM PC. The performance of P++ on the actual hardware is dominated
70
by the single node performance, because no additional communication is added over the se
rial implementation, though for specific applications communication in the explicit Message
Passing programming model for distributed memory architectures could be better optimized
than that which P++ provides automatically. Such optimizations would involve the restric
tion of message passing, using knowledge about how several array expressions might access
memory; or the timing (scheduling) of messages using knowledge of dependencies across
several array expressions.
The current implementation could be optimized by analysis over multiple loops,
though such multiple array statement analysis is not presently a part of P++. The use of
deferred evaluation is required before such work can be done. Then we could expect similar
performance as that obtained with the explicit Message Passing programming model for
distributed memory architectures, even in the case of highly optimized hand coded com
munication. Current message passing comparisons are relative to a more naive, not highly
optimized, explicit message passing model that does not consider optimization across mul
tiple array expressions. Additional multiple array expression optimization is possible using
the P++ Optimization Managers message passing control, but it is not automated within
P++.
The M++ array library serves as a base to provide FORTRAN like performance
to P++, but its current performance is about 20%100% of FORTRAN performance and
degrades the single node performance of P++. It is the VSG principle in combination, with
the Optimization Managers capability of allowing the user to define efficient partitioning
that guarantees amounts of communication nearly identical to the explicit Message Passing
programming model. For many applications, better performance may be available, since
the simplified P++ development environment allows for greater implementation effort to be
spent on the selection of more advanced and faster computational algorithms. It is hoped
that this will additionally offset the ability of future PH work to compete directly with
71
FORTRAN, though this is a current area of research.
Performance is important, since without efficient execution of the application source
code, the effects of the parallel or vector architecture are lessened or lost:
Single node performance: Steps have been taken to optimize the single node
performance so that the P++ environment can be accurately assessed. For vector
nodes and nodes that are most efficiently programmed through a vector program
ming model, this work has included application of the C++ array language class
libraries on the Cray (through collaboration with Sandia National Laboratories).
First results by Sandia National Laboratory concerning performance comparisons
with FORTRAN are very promising. With an optimized C++ array class library,
about 50% 90% of the FORTRAN vector performance was achieved for complete
codes. The comparison is difficult because the realistic codes could not be readily
implemented in FORTRAN to include the more rich physics available in the C++
versions. In some cases, the C++ compiler optimization switches had to be turned
off. Such problems are examples of incomplete and often immature C++ compilers,
thereby hampering the comparison of FORTRAN and C++ on large realistic soft
ware. Similar performance has been demonstrated by P++ on the SUN Sparc, but
only on those select P++ array statements chosen for initial evaluation and testing
of P++, not complete codes.
Parallel system performance: Secondary to single node performance, parallel
performance is mostly affected by the amount of communication introduced. The
P++ VSG model optimizes this and introduces no more communication than the
explicit Message Passing model. However, currently, no runtime analysis is done
across multiple array statements, which might better schedule communication and
possibly further minimize the required communication. Such further work would
72
require deferred evaluation (lazy evaluation). Additional performance could be ob
tained by caching of the communication patterns and their reuse in multiple array
expressions. Such work has not been included in the P++ implementation, but in
a part of the P++ research and has been a part of several codes to test these ideas.
Work done specific to optimized parallel evaluation of array expressions has been
carried out for a number of relevant architectures in [14].
3.5.10 P++ Support for Specific Examples Although P++is dimension
ally independent, most example applications have been 2D; however, a 3D multigrid code
has been demonstrated. In figure 13, P++ is demonstrated with a partitioning developed
to support multigrid.
3.5.10.1 Standard Multigrid Algorithms on Rectangular Domains Multi
grid is a commonly used computational algorithm especially suited to the solution of elliptic
partial differential equations. In contrast to single grid methods, multigrid uses several grids
of different scale to accelerate the solution process.
The usual way to implement regular multigrid algorithms on a distributed memory
system is based on the method of grid partitioning ([10], [34], [38]). The computational do
main (grid) is divided into several subgrids that are assigned to parallel processors (see figure
2). The subgrids of the fine grids and the associated subgrids of the coarse grid are assigned
to the same processor. Each multigrid component (e.g., relaxation, restriction, and inter
polation) can be performed on a subset of the interior points of the subgrid independently
(in parallel). Calculation of values at interior boundary points, however, needs the values
from neighboring subgrids. Since distributed memory machines have no global address space,
these values somehow must be made available. Instead of transferring the values individually
at the time they are needed, it is more efficient to have copies of neighboring grid points in
the local memory of each processor. Hence, each process contains an overlap area, which has
to be updated after each algorithmic step. Because the details of the algorithm on a small
73
number of points per processor are problematic, agglomeration is one of the strategies that
can be used to consolidate the distributed application to a smaller number of processors.
P1 P2 P3 P4
<
o o o o o
<
o o o o
> <
U o
<
> <
o o o o
> <
o o o o
o o
>
> Overlap update
VSG update
Figure 13. The interaction the Overlap Update and VSG Update concepts for standard
multigrid partitioning with coarse level agglomeration.
Figure 14 shows the runtime support from P++ for the interpreted communica
tion patterns of the solver and for the agglomeration strategy used in the parallelization of
multigrid. Several variant strategies are possible, but the use of VSG reduces the details
of their implementation to defining the fine and coarse grid partitioning. The particular
VSG communication models, VSG Update (based on the Owner Computes rule) or Overlap
Update, is chosen on the basis of data availability within the partitioned grid variables.
3.5.10.2 Multilevel local refinement algorithms on blockstructured grids
As a more complicated example of the flexibility of P++, we demonstrate some of the sup
port within P++ for block structured local refinement ([22], [28],[29], [41], [42]). During the
solution of partial differential equations on structured grids, local refinement allows for the
solution complexity to depend directly on the complexity of the evolving activity. Specifi
cally, regions local to problem activity are refined. The effect is to provide a discretization
specifically tailored to an applications requirements.
Local refinement composite grid methods typically use uniform grids, both global
and local, to solve partial differential equations. These methods are known to be highly
Composite grid
FAC partitioning
74
Figure 14. The interaction the Overlap Update, VSG Update, and BSG Interface Update
concepts for FAC and AFAC partitioning of a block structured locally refined grid
75
efficient on scalar or single processor vector computers, due to their effective use of uniform
grids and multiple levels of resolution of the solution. On distributed memory multipro
cessors, such methods benefit from their tendency to create multiple isolated refinement
regions, which may be effectively treated in parallel. However, they suffer from the way
in which the levels of refinement are treated sequentially in each region. Specifically, in
FAC, the finer levels must wait to be processed until the coarselevel approximations have
been computed and passed to them; conversely, the coarser levels must wait until the finer
level approximations have been computed and used to correct their equations. In contrast,
AFAC eliminates this bottleneck of parallelism. Through a simple mechanism used to reduce
interlevel dependence, individual refinement levels can be processed by AFAC in parallel.
The result is that the convergence factor bound for AFAC is the square root of that for
FAC. Therefore, since both AFAC and FAC have roughly the same number of floating point
operations, AFAC requires twice the serial computational time as FAC, but AFAC may be
much more efficiently parallelized.
Specifically, the local refinement of geometries within regions of activity is not easily
done using a single rectangular local refinement patch. In order to better represent geome
tries within local activity, we introduce block structured local refinement. The flexibility
introduced by block structured local refinement equally applies to block structured global
grids. Though elliptic solvers for block structured grids (beyond that of relaxation methods)
are not provided for in the current work nor considered in the thesis, it is an important and
interesting area of current work by the GMD and others.
For example, in figure 6, the composite grid shows a rectangular domain with a
curved front embedded within. Such problems could represent oil reservoir simulation models
with simple oil/water fronts or more complicated fluid flow problems with shock fronts. In
this specific example, the front is refined with two levels; the first level is represented by
grids 2 and 3, the second by grids 4, 5, and 6.
76
For the parallel environment using FAC, because of the sequential processing of the
local refinement levels, the composite grid is partitioned as shown in figure 6. Note that
solvers used on the individual grids make use of Overlap Updates provided automatically
by P++. The intergrid transfers between local refinement levels rely on VSG Updates, also
provided automatically by the P++ environment. Note that P++ support of the block
structured local refinement is limited and does not include the block structured grid (BSG)
Interface Update, which must be handled within the block structured grid code or library.
Underlying support in the parallel environment for the BSG Interface Update is provided by
either Overlap Update or VSG Update, or by a combination of the two.
Support from P++ for a partitioning specific to AFAC is similarly provided. The
different application specific partitioning (shown in figure 14) naturally invokes automated
support from the P++ environment in a slightly different manner. The use of an environment
such as P++, which permits the implementation of the algorithms (in this case FAC and
AFAC) in the parallel environment independent of the organization of their data, greatly
simplifies the software development process since it may be developed in a serial workstation
environment where productivity is high and since each algorithm can reuse similar code. This
is important because 99.9% of the two algorithms implementations are similar, even though
they are optimized using distinctly different organizations (partitionings) of the composite
grids.
3.5.11 Related Research To our knowledge, there is currently no study of
architectureindependent programming environments that permit software developed specif
ically for serial machines to be run on parallel distributed memory architectures, and that
use existing commonly available compilers. The most important work done in related areas
is as follows (apologies to anyone we might have missed):
Los Alamos National Laboratories work on C++ for Scientific Computa
tion [18]: Initial work was done relative to the use of C++ for large scale scientific
77
computation on vector computers. The work on WAVE++, a CFD code, details
the advantages and disadvantages of the use of C++ for scientific computation in
general and for vector environments in particular. More recent work has been done
combining C++ objectoriented design with adaptive refinement hypersonic appli
cations.
Sandia National Laboratories work on C++ vectorization [45]: this effort
on array languages for C++ shows that 5090% of the equivalent FORTRAN per
formance can be attained. Some of the largest laboratory applications codes there
are being developed using C++, e.g., RALE++.
Paragon [13]: This is a parallel programming environment for scientific applications
(mostly image processing) using Communication Structures. It is also based on C++
and contains concepts similar to P++, but is much more restrictive (though it has
been demonstrated on a larger number of computers than P++). This is primarily
due to the fact that the concept of communication structures is not as general and
powerful as the concept of Virtual Shared Grids in allowing and expressing the view
of the distributed memory as a global address space, restricted to specific objects.
Additionally, indexing is cumbersome and generally restrictive compared to that of
P++, which is borrowed from M++ (whose origins are in Gauss).
FORTRAN D [19]: This work develops the extensive list of different array par
titionings done in FORTRAN D (based on FORTRAN 77) for use with FORTRAN
77 (and FORTRAN 90 (FORTRAN 90D)). It does not yet employ any concept sim
ilar to Virtual Shared Grids. Use of PARTI (see below) within FORTRAN 90 D,
however, is expected. This currently is a point of research.
PARTI (Parallel Runtime Tools at ICASE [7]): It provides runtime support for
use with FORTRAN and contains clever means by which FORTRAN loops can
be interrogated and existing data parallelism discovered and exploited at runtime.
78
PARTI is primarily focused on unstructured grids. As opposed to P++, the seamless
introduction of the PARTI primitives requires compiler modifications.
SUPERB [48]: This is a semiautomatic parallelization tool for distributed architec
tures on FORTRAN compiler level. The prototype developed within SUPRENUM
is restricted to a very specific class of applications.
3.6 AMR++, an Adaptive Mesh Refinement Class Library
3.6.1 Introduction AMR++ is a C++ class library that simplifies the de
tails of building selfadaptive mesh refinement applications. The use of this class library
significantly simplifies the construction of local refinement codes for both serial and par
allel architectures. AMR++ has been developed in a serial environment using C++ and
the M++ array class interface. It runs in a parallel environment because M++ and P++
share the same array interface. Therefore, AMR++ inherits the machine targets of P++
and, thus, has a broad base of architectures on which to execute. The efficiency and per
formance of AMR++ is mostly dependent on the efficiency of P++ (and, thus, M++) and
M++ in the serial and parallel environments, respectively. Together, the P++ and AMR++
class libraries separate the abstractions of local refinement and parallelism to significantly
ease the development of parallel adaptive mesh refinement applications in an architecture
independent manner.
The AMR++ class library represents work that combines complex numerical, com
puter science, and engineering application requirements. Therefore, the work naturally in
volves compromises in its initial development. In the following sections, the features and
current restrictions of the AMR++ class library are summarized.
3.6.2 Block Structured Grids Features and Restrictions The target
grid types of AMR++ are 2D and 3D block structured grids with rectangular or logically
rectangular grid blocks. On the one hand, they allow for a very good representation of
complex internal geometries that are introduced through local refinement in regions with
79
increased local activity. This flexibility of local refinement block structured grids equally
applies to global block structured grids that allow matching complex external geometries.
On the other hand, the restriction to structures of rectangular blocks, as opposed to fully
unstructured grids, allows for the application of the VSG programming model of P++ and,
therefore, is the foundation for good efficiency and performance in distributed environments,
which is one of the major goals of the P++/AMR++ development. Thus, we believe that
block structured grids are the best compromise between full generality of the grid structure
and efficiency in a distributed parallel environment. The application class forms a broad
cross section of important scientific applications.
In figure 15, the global grid is the finest uniformly discretized grid that covers the
whole physical domain. Local refinement grids (level i + 1) are formed from the global
grid (level i = 0) or recursively from refinement grids (discretization level i) by standard
refinement with /i;+i = /ij (refinement factor of two) in each coordinate direction. Thus,
boundary lines of block structured refinement grids always match grid lines on the underlying
discretization level.
The construction of block structured grids in AMR++ has some practical limita
tions that simplify the design and use of the class libraries. Specifically, grid blocks at the
same level of discretization cannot overlap. Block structures are formed by distinct or con
nected rectangular blocks that share their boundary points (block interfaces) at those places
where they adjoin. Thus, a connected region of blocks forms a block structured refinement
grid. It is possible that one refinement level consists of more than one disjoint block struc
tured refinement grid. In the dynamic adaptive refinement procedure, refinement grids can
be automatically merged if they adjoin each other.
In figure 15(a), an example for a composite grid is illustrated: The composite grid
shows a rectangular domain within which we center a curved front and a corner singularity.
The grid blocks are ordered lexicographically; the first digit represents the level, the second
80
digit the connected block structured refinement grid, and the third digit the grid block. Such
problems could represent the structure of shock fronts or multifluid interfaces in fluid flow
applications: In oil reservoir simulations, for example, the front could be an oilwater contact
front moving with time and the corner singularity could be a production well. In this specific
example, the front is refined with two block structured refinement grids; the first grid on
refinement level two is represented by grid blocks 2.1.1 and 2.1.2, and the second grid on
level three by grid blocks 3.1.1, 3.1.2, and 3.1.3. In the corner on each of the levels, a single
refinement block is introduced.
For ease of implementation, in the AMR++ prototype, the global grid must be
uniform. This simplification of the global geometry was necessary in order to be able to
concentrate on the major issues of this work, namely, to implement local refinement and
selfadaptivity in an objectoriented environment. The restriction is no general constraint
and can be more or less easily raised in a future version of the prototype. Aside from
implementation issues, some additional functionality has to be made available:
For implicit solvers, the resulting domain decomposition of the global grid may
require special capabilities within the single grid solvers (e.g., multigrid solvers for
block structured grids with adequate smoothers, such as inter block line or plane
relaxation methods).
The block structures in the current AMR+( prototype are defined only by the
needs of local refinement of a uniform global grid. This restriction allows them to
be cartesian. More complicated structures as they result from difficult noncartesian
external geometries (e.g., holes or spliss points; see [37]) currently are not taken into
consideration. An extension of AMR++, however, is principally possible. The wide
experience for general 2D block structured grids that has been gained at the GMD
[37] could form a basis for these extensions. Whereas our work is comparably simple
in 2D because no explicit communication is required, extending the GMD work to
81
3D problems is very complex, if not intractable.
3.6.3 Some Implementation Issues In the following, some implementation
issues are detailed. They also demonstrate the complexity of a proper and efficient treatment
of block structured grids and adaptive refinement. AMR++ takes care of all of these issues,
which would have to be handled explicitly if everything had to be handled at the application
level.
Dimensional Independence and Dimensional Independence Indexing (DII):
The implementation of most features of AMR++ and its user interface is dimen
sionally independent. Being derived from user requirements, on the lowest level, the
AMR++ prototype is restricted to 2D and 3D applications. This restriction can,
however, be easily removed.
One important means by which dimensional independence is reached is dimension
ally independent indices (DII), which contain index information for each coordinate
direction. On top of these DII indices are index variants defined for each type of sub
block region (interior, interior and boundary, boundary only, ...). Convex regions
only require a single DII, but nonconvex regions require multiple DII. For example,
for addressing the boundary of a 3D block (nonconvex), one DII index is needed for
each of the six planes. In order to avoid special treatment of physical boundaries,
all index variants are defined twice, including and excluding the physical boundary,
respectively. All index variants, several of them also including extended boundaries
(see below), are precomputed at the time when a grid block is allocated. A possible
enhancement, for efficiency, would permit them to be shared (cached), since they are
independent of the array data. In the AMR++ user interface and in the top level
classes, only index variants or indicators are used, thereby allowing a dimensionally
independent formulation, except for the very low level implementation. Many low
level operations, such as interpolation, for example, are necessarily dependent on the
82
problem dimension.
Implementation of block structured grids: The AMR++ grid block objects
consist of the interior, the boundary, and an extended boundary of a grid block, as
well as interface objects (links) that are formed between adjacent pairs of grid block
objects. The links contain P++ array objects that do not consist of actual data
but serve as views (subarrays) of the overlapping parts of the extended boundary
between adjacent grid block objects. The actual boundaries that are shared between
different blocks (block interfaces) are complex structures that are represented in a
list within each grid block object. Block interface objects are formally derived from
the grid block objects (so that interfaces of interfaces are possible (and required
for corners where grid blocks meet in twodimensional problems), and interfaces of
interfaces of interfaces are possible (where threedimension blocks meet at corners)).
For example, in 3D, interfaces between blocks are 2D planes, those between plane
interfaces are IDline interfaces, and, one step further, those between lineinterfaces
are points (zerodimensional). In figure 15(b), grid blocks 2.1.1 and 2.1.2 of the
composite grid in figure 15(a) are depicted, including their block interface and their
extended boundary. The regular lines denote the outermost line of grid points of
each block. Thus, with an extended boundary of two, there is one line of points
between the block boundary line and the dashed line for the extended boundary. In
its extended boundary, each grid block has views of the values of the original grid
points of its adjoining neighboring block. This way it is possible to evaluate stencils
on the interface and, with an extended boundary width of two, to also define a coarse
level of the block structured refinement grid in the multigrid sense.
Data structures and iterators: In AMR+(, the composite grid is stored as a tree
of all refinement grids, with the global grid being the root. Block structured grids
are stored as lists of blocks (for ease of implementation; collections of blocks would
83
be sufficient in most cases). In figure 15(c), the composite grid tree for the example
composite grid in figure 15(a) is illustrated. The user interface for doing operations
on these data structures is a set of iterators. For example, for an operation on the
composite grid (e.g., zeroing each level or interpolating a grid function to a finer
level), an iterator is called that traverses the tree in the correct order (preorder, pos
torder, no order). This iterator takes the function to be executed and two indicators
that specify the physical boundary treatment and the type of subgrid to be treated
as an argument. The iteration starts at the root and recursively traverses the tree.
For doing an operation (e.g., Jacobi relaxation) on a block structured grid, iterators
are available that process the list of blocks and all block interface lists on each grid
calling the object member function input as a parameter to the iterator. Iterators
are provided for all the relevant AMR++ objects, and so allow simplified internal
processing as required for the composite grid solvers. The use of the iterators is
not specific to the algorithms currently implemented in AMRH (FAC, AFAC, and
AFACx), and is intended to be a general interface for other algorithms as well.
3.6.4 ObjectOriented Design and User Interface The AMR+) class
libraries are customizable by using the objectoriented features of C+1. For example, in
order to obtain efficiency in a parallel environment, it may be necessary to introduce alternate
iterators that traverse the composite grid tree or the blocks of a refinement region in a special
order. However, the use of alternate iterators does not change the serial code that uses them,
but allows the Pf + operations on different composite grid levels to run concurrently. In this
way, the data parallel model of P++ is mixed with the tasking parallel model, which can be
either supported through C++ tasking libraries9 or compiler supported extensions10. The
same is true for alternate composite grid cycling strategies, as for example needed in AFAC
as opposed to FAC algorithms (section 2.2).
9Dirk Grunwald at the University of Colorado at Boulder has developed such parallel tasking class libraries.
10Carl Kesselman at Caltech has developed CC++, which provides tasking support as part of a simple
C++ language extension.
84
Application specific parts of AMR++, such as the single grid solvers or criteria
for adaptivity, which have to be supplied by the user, are also simply specified through
substitution of alternate base classes: A preexisting application (e.g., problem setup and
uniform grid solver) uses AMR++ to extend its functionality and to build an adaptive mesh
refinement application. Thus, the user supplies a solver class and some additional required
functionality (refinement criteria, ...) and uses the functionality of the highest level AMR++
((Self_)Adaptive_)Composite_Grid class to formulate his special algorithm or to use one of
the supplied PDE solvers. In the current prototype of AMR++, FAC and AFAC based
solvers (section 2.2) are supplied. If the single grid application is written using P++, then
the resulting adaptive mesh refinement application is architecture independent, and so can
be run efficiently in a parallel environment.
The design of AMR++ is objectoriented and the implementation of our prototype
extensively uses features like encapsulation and inheritance: The abstraction of selfadaptive
local refinement, which involves the handling of many issues, including memory manage
ment, interface for application specific control, dynamic adaptivity, and efficiency, is reached
through grouping these different functionalities into several interconnected classes. For ex
ample, memory management is greatly simplified by the objectoriented organization of the
AMR++ library: Issues such as lifetime of variables are handled automatically by the scoping
rules for C++, so memory management is automatic and predictable.
As the AMR++ interface is object oriented, the control over construction of the
composite grid is intuitive and natural: The creation of composite grid objects is similar to
the declaration of floating point or integer variables in procedural languages like FORTRAN
and C. Users basically formulate their solvers by allocating one of the predefined composite
grid solver objects or by formulating their own solvers on the basis of the composite grid
objects and the associate iterators, and by supplying the single grid solver class (object).
Although not a part of the current implementation of AMR++, C++ introduces a
85
template mechanism in the latest standardization of the language (AT&T version 3.0), which
only now starts to be part of commercial products. The general purpose of this template
language feature is to permit class libraries to use user specified base types. For AMR++,
for example, the template feature could be used to allow the specification of the base solver
and adaptive criteria for the parallel adaptive local refinement implementation. In this way,
the construction of an adaptive local refinement code from the single grid application on the
basis of the AMR++ class library can become even simpler and more cleanly implemented.
There is no shortage of obscure details associated with adaptive mesh refinement
and its implementation, but this will not be discussed further. The interested reader is
referred to [27] and [31].
3.6.5 Static and Dynamic Adaptivity, Grid Generation In the current
AMR++ prototype, static adaptivity is fully implemented. Users can specify their com
posite grid either interactively or by some input file: For each grid block, AMR++ needs
its global coordinates and the parent grid block. Block structured local refinement regions
are formed automatically by investigating neighboring relationships. In addition, the func
tionalities for adding and deleting grid blocks under user control are available within the
Adaptive_Composite_Grid object of AMR++.
Recently, dynamic adaptivity has been a subject of intensive research. First re
sults are very promising and some basic functionality has been included in the AMR++
prototype: Given a global grid, a flagging criteria function, and some stopping criteria, the
Self_Adaptive_Composite_Grid object contains the functionality for iteratively solving on the
actual composite grid and generating a new discretization level on top of the respective finest
level. Building a new composite grid level works as follows:
(1) The flagging criteria delivers an unstructured collection of flagged points in each grid
block.
(2) For representing grid block boundaries, all neighboring points of flagged points are
86
also flagged.
(3) The new set of grid blocks to contribute to the refinement level (gridding) is built by
applying a smart recursive bisection algorithm similar to the one developed in [6]: If
building a rectangle around all flagged points of the given grid block is too inefficient,
it is bisected in the longer coordinate direction and new enclosing rectangles are
computed. The efficiency of the respective fraction is measured by the ratio of
flagged points to all points of the new grid block. In the subsequent tests, 75% is
used. This procedure is repeated recursively if any of the new rectangles is also
inefficient. Having the goal of building the rectangles as large as possible within the
given efficiency constraint, the choice of the bisection point (splitting in halves is too
inefficient because it results in very many small rectangles) is done by a combination
of signatures and edge detection. The reader is referred to [6] or [27] for more details.
(4) Finally, the new grid blocks are added to the composite grid to form the new refine
ment level. Grouping these blocks into connected block structured grids is done the
same way as it is done in the static case.
This flagging and gridding algorithm has the potential for further optimization: The bisection
method can be further improved, and a clustering and merging algorithm could be applied.
This is especially true for refinement blocks of different parent blocks that could form one
single block with more than one parent. Internal to AMR++, this kind of parent/child
relationship is supported. The results in section 4.6, however, show that the gridding already
is quite good. The number of blocks that are constructed automatically is only slightly larger
(< 10%) than a manual construction could deliver.
A next step in selfadaptive refinement would be to support time dependent prob
lems whose composite grid structure changes dynamically with time (e.g., moving fronts).
In this case, in addition to adding and deleting blocks, enlarging and diminishing blocks
must be supported. Though some basic functionality and the implementation of the general
87
concept are already available, this problem has not yet been tackled further.
3.6.6 Current State and Related Work The AMR++ prototype is im
plemented using M++ and the AT&T Standard Components II class library to provide
standardized classes (e.g., linked list classes). Through the shared interface of M+1 and
P)h, AMR++ inherits all target architectures of P+(. AMR++ has been successfully
tested on Sun workstations and on the Intel iPSC/860.
Taking into account the large functionality of AMRf+, there are still several insuf
ficient aspects and restrictions, and a large potential for optimization in the current prototype
(as already pointed out in the preceding description). Until now, AMR++ has been suc
cessfully used as a research tool for the algorithms and model problems described in the
next two sections. However, AMR++ provides the functionality to implement much more
complicated application problems.
Concerning parallelization, running AMR+( under P++ on the Intel iPSC/860
has proven its full functionality. Intensive optimization, however, has only been done within
P++. AMRH itself offers a large potential for optimization. For example, efficiently
implementing selfadaptivity, including load (re)balancing in a parallel environment, requires
further research. In addition, the iterators that are currently available in AMR+though
working in a parallel environment, are best suited for serial environments. Special parallel
iterators that, for example, support functional parallelism on the internal AMR++ level
would have to be provided.
To our knowledge, the combined PH/AMR++ approach is unique. There are
several other developments in this area (e.g., [37]), but they either address a more restricted
class of problems or they are still restricted to serial environments. However, important
work at Los Alamos National Laboratory and Lawrence Livermore National Laboratory has
addressed adaptive mesh refinement for SIMD and MIMD, respectively, explicit equation
solvers.

PAGE 1
ADAPTIVE MESH REFINEMENT FOR DISTRIBUTED PARALLEL ARCIIITECTURES by Daniel James Quinlan B. A., University of Colorado at Denver, 1987 A thesis submitted to the Faculty of \he Graduate School of the University of Colorado at Denver in partial fulftl!ment of the requiren1ents for the degree of Doctor of Philosophy Applied Mathematics 1993
PAGE 2
l This thesis for the Doctor of Philosophy degree by Daniel James Quinlan has been approved for the Department of Mathematics by Stephen F. McCormick John Ruge Date lffJ
PAGE 3
Quinlan, Daniel James (Ph.D., Applied Mathematics) Adaptive Mesh Refinement for Distributed Parallel Architectures Thesis directed by Professor Stephen F. McCormick iii The purpose of adaptive rnesh refinement is to match the computational demands to an application's activity. In a fluid How problem, this n1eans that only regions of high local activity (shocks, boundary layers, etc.) can demand increased computational effort, while regions of little flow activity (or interest) are more easily solved using on1y relatively little computational effort. A thorough exploitation of these techniques is crucial to the efficient solution of rnore general problems arising in large scale computation. The fast adaptive composite grid method (FAC) is an algorithm that uses uniform grids, both global and local, to solve principally elliptic partial differential equations. How ever, :t"'A C suffers in the parallel environment from the way in which the levels of refinement are treated sequentially. The asynchronous fast adaptive cOinposite method, AFAC, and a new method, AFACx, eliminate this bottleneck of parallelism. In both AFAC and AFACx, individual refmement levels are processed in parallel. Ali'A.Cx both generalizes AFAC and permits the use of more complex block structured mesh refinement required for selfadaptive mesh refinement. Although each level's processing may be parallelized in FAC, AFAC and AFACx may be much more efficiently parallelized. It is shown that, under most circum stances, AFAC is superior to FAC in a parallel environment. The theory for AFACx and an evaluation of its performance, including FAC and AFAC, is a significant part of this thesis; the remainder of the thesis details the objectoriented development of parallel adaptive n1esh refinement software. The development of parallel adaptive mesh refinement software is divided into three parts: 1) the abstraction of parallelism using the C++ parallel array class library P++; 2) the abstraction of adaptive mesh refinement using the C++ serial adaptive mesh refinement class library AMR++; 3) the serial application, specifically the single grid application, defined by
PAGE 4
1 the user. Thus, we present a greatly simplified environment for the development of adaptive mesh refinement software in both the serial and parallel environment. More specifically, such work provides an architecture independent environment to support the development of rnore general complex software, which might be targeted for a parallel environment. This abstract accurately represents the content of the candidate's thesis. I recornmend its publication. Signed Stephen F. McCormick
PAGE 5
DEDICATION To my wife Kirsten and son Carsten, without whose patient support the decade of undergraduate and graduate school would not have been possible.
PAGE 6
ACKNOWLEDGEMENTS Many people have helped me throughout my student years at CU Denver, and I want to thank a select few. I owe my most sincere thanks to Steve McCormick for his advice, support, and collaboration throughout my undergraduate and graduate education. Additional thanks to: Max Lemke for his collaboration on the software development, his thorough testing, and his insight into the singular perturbation example; Kristian Witsch, Max's advisor, for advice on details of the singular perturbation example problem and for supporting Max in our the joint work; and Dinshaw Balsara for forcing me into the block structured problems, which eventually led to the AFACx algorithm. Additional thanks go to James Peery and Allen Robinson at Sandia National Laboratory for the loan of a SUN Spare Station, which allowed for significant extension of the software development of P++ and AMR++, and rny mother Pat Quinlan, a graphics artist, who helped prepare the figures. My deep appriciation goes to AFOSR for its early support of my work by way of a graduate research assistantship and a later sum1ner fellowship. Finally, I sincerely thank Tom Zang at NASA Langley for his support, which led to a threeyear NASA fellowship that allowed me to pursue my doctoral research work. The Inotivation for much of this work was the problem that he proposed for my fellowship, which originally came from an SBIR grant that was funded by NASA Langley and for which I was the Principal Investigator.
PAGE 7
CHAPTER INTRODUCTION 1.1 Algorithm Development 1.2 Software Development CONTENTS 1..3 Problems and 'Future VVork 2 PARALLEL ALGORITHM DESIGN (THEORY) 1 5 10 12 2.1 Overview of the FAC and AFAC Composite Grid Algorithms 12 2.2 Notation and Definition of FAC, A FAC, and AFACx Algorithms 13 2.2.1 FAC Algorithm 15 2.2.2 AFAC Algorithm 2.2.3 AFACx Algorithm 2.2.3.1 AFACx Motivation 2.2.3.2 AFACx Definition 2.3 AFACx Convergence Theory 3 PARALLEL SOFTWARE DESIGN 3.1 Introduction 3.2 C Language Implementation of Parallel AFAC/FAC 3.2.1 Program Structure 3.2.2 Data Structures 3.2.3 Multigrid Solver 3.2.4 AFAC/FAC Scheduler 3.2.5 Grid Manager 3.2.6 Multilevel Load Balancer (MLB) 16 18 18 21 22 33 33 34 34 35 35 36 36 36
PAGE 8
3.2. 7 Data Flow 3.3 Problems with Computer Languages 3.3.1 Problems with FORTRAN .. 3.3.2 Problems with Procedural Languages 3.4 Motivation for ObjectOriented Design .. 3.4.1 Problems with the ObjectOriented C++ Languages 3.5 P++, a Parallel Array Class Library 3.5.1 Introduction and Motivation 3.5.2 Goals of the P++ Development 3.5.3 The P++ Applications Class 3.5.4 P++ Implementation in Standard C++ 3.5.5 The Programming Model of P++ 3.5.5.1 Single Program Multiple Data (SPMD) Model 3.5.5.2 Virtual Shared Grids (VSG) Model 3.5.6 The ObjectOriented Design of P++ 3.5.7 The P++ User Interface 3.5.7.1 TheM++ array class library. 3.5.7.2 The P++ Optimization Manager 3.5.8 Portability and Target Architectures of P++ 3.5.9 Performance of P++ Vlll 37 38 38 41 43 49 49 49 53 54 54 55 56 56 62 54 65 67 68 69 3.5.10 P++ Support for Specific Examples. 72 3.5.10.1 Standard Multigrid Algorithms on Rectangular Domains 72 3.5.10.2 Multilevel local refinement algorithms on blockstructured grids 73 3.5.11 Related Research . 76 3.6 AMR++, an Adaptive Mesh Refinement Class Library 3.6.1 Introduction 78 78
PAGE 9
3.6.2 Block Structured Grids Features and Restrictions 3.6.3 Some Implementation Issues ... 3.6.4 ObjectOriented Design and User Interface 3.6.5 Static and Dynamic Adaptivity, Grid Generation 3.6.6 Current State and Related Work 3.7 ObjectOriented Design for Parallel Adaptive Refinement 4 PARALLEL ADAPTIVE MESH REFINEMENT: RESULTS 4.1 Introduction 4.2 Comparison of Convergence Factors lX 78 81 83 85 87 88 90 90 93 4.2.1 Composite Grid Convergence Factors for Poisson Equation 93 4.2.2 Convergence Factors for Singular Perturbation Problem 94 4.3 Performance of AFAC 4.3.1 Simple Example on iPSC/2: 4.3.2 Simple Example on iPSC/1: 4.3.3 Complex Example on il'SC/1: 4..4 Performance Comparison of l'AC and AFAC Algorithms 4.4.1 Parallelization for Distributed Architectures 96 97 99 99 101 101 4.4.1.1 Parallel Multigrid SolverStandard Grid Partitioning. 101 4.4.1.2 Parallelization of FAC Single Level Grid Partitioning 103 4.4.1.3 Paralleliza\ion of AFACLinear Multilevel Grid Partitioning. 103 4.4.2 Comparison of Interprocessor Communication in FAC and AFAC 104 4.4.3 Relative Performance Results 107 4.4.4 Conclusions of l'AC versus Al'AC 108 4.5 Dynamic Adaptivi\y in Parallel Environment 110 4.5.1 Multilevel Load Balancing 111 4.5.2 Dynamic Movement of Local Refinement (Grid Tracking) 111
PAGE 10
4.5.3 Dynamic Addition of Local Refinement 4.5.4 Relative Costs of Dynamic Adaptivity 4.5.5 Conclusions about Dynamic Adaptivity 4.6 SelfAdaptive Refinement Using P++/AMR++ 5 CONCLUSIONS 6 FUTURE WORK BIBLIOGRAPHY X 112 112 112 113 126 128 135
PAGE 11
FIGURE 1 2 3 4 5 6 7 8 FIGURES Example composite grid with five levels. Effect of overlapping comrnunication with cmnputation. FORTRAN and C of C++ example code fragments ... C Language example for block by block cache based execution. Distributed memory example code. Equivalent P++) objectoriented example code. Equivalent P++1 objectoriented example code using explicit indexing. Equivalent P++) objectoriented example code. 9 An example for VSG Update based on the Owner computes rule: A = B + C on 3 processors. 10 The standard method of grid partitioning with overlap. 11 The objectoriented design of the P++ environment. Xl 11 39 41 42 44 45 45 47 59 59 63 12 C++ I M++ I P++ example code: McCormack (Hyperbolic) Scheme. 66 13 The interaction the Overlap Update and VSG Update concepts for standard multigrid partitioning with coarse level agglorneration. 14 The interaction the Overlap Update, VSG Update, and BSG Interface Update concepts for FAC and AFAC partitioning of a block structured locally refmed grid 15 Exarr1ple of a composite grid, its composite grid tree) and a cutout of two blocks with their extended boundaries and their interface. 16 Simple composite grid 73 74 89 98
PAGE 12
17 Regular single level strip partitioning of a 3level composite grid structure onto 4 processors (FAC) ...... 18 Irregular multilevel strip partitioning of a 7level composite grid structure onto 16 processors (AFAC) ..... 19 Complex composite grid problem with 40 patches. 20 Timings in rnilliseconds regarded with respect to patch and parallel pro cessor system size (AFAC: black bars, FAC: white bars). 21 Timings for Load Balancing using MLB 22 Results for a singular perturbation problem: Plots of the solution, the error and the composite grid with two different choices of the accuracy 17 in the selfadaptive refinement process. xu 121 121 122 123 124 125
PAGE 13
TABLE 1 TABLES Convergence factors for Poisson's equation on simple composite grid, 2 Convergence factors for Poisson's equation on block structured composite grid. 3 Convergence factors for a singular perturbation problem and, for compari son, for Poisson's equation .. 4 5 Timings for simple AFAC on iPSC/2. Timings for Complex AFAC on iPSC/1. 6 Communication structure analysis of:I:
PAGE 14
CHAPTER 1 INTRODUCTION One of the aims in the development of efficient algorithms to solve partial differ ential equations (PDEs) is to allow the computational intensity to be proportional to the activity that the solution resolves. VVhat this means practically is that solutions are then obtained with minimum computation) which in turn allows for investigation of larger even more complicated problems) which in turn we want solved with minimum computational effort, and so on. It is the Hnite resources of even today's most modern computers that ensure termination of this recHrsion. Since, in a realistic application, this activity is nonuniformly distributed and localized in the problem space, the use of local refinement reduces computational complexity away frorn these localized regions and reduces the global computational work. More specifically, this use of local refinement allows for greater accuracy in the computation of large scale and so the solution is obtained more efficiently. The resulting computational mesh is called the composite grid (see figure 1). Since the requirement for such refinement is often only seen at runtirne as the solution evolves, such refinement often must be added selfadaptively. The use of selfadaptive refinement is most important with the use of rnore than just a few local refinement regions, since then the error of not adding refmement where required jeopardizes the effectiveness of additional levels of refinernent. 1.1 Algorithm Development A complicating feature in the development of these local refinement techniques is their introduction onto parallel computers. The nonuniform nature of the computational workload for a composite grid is in direct conflict with the general goal of efficient processor
PAGE 15
2 utilization. Classical methods of computing with local refinement grids, aside from exhibit ing slow convergence to the solution, require substantial synchronous intergrid transfers and processing of the solution between the composite grid levels. This means that, with a com posite grid load balanced onto multiple processors, there is substantial inefficiency due to this synchronous processing of grid levels and the nonuniform use of local refinen1ent throughout the problem domain. The alternative of partitioning each level across all processors is problematic since such partitionings greatly reduce the composite grid level's representation (size) in each processor on massively parallel architectures. Indeed) new algorithms for solving equations posed on composite grids have an opportunity to improve the existing perfonnance of the parallelized serial algorithms. Thus, meaningful work on parallel adaptive mesh refineme11t is not just a computer science issue, but one which con1bines the design of parallel algorithms with the development of better mathematical algorithrrlS, clearly crossing both disciplines almost equally. The existing fast adaptive composite grid method, FAC (see [40] and [39]) is a discretization and solution method for partial differential equations designed to achieve efficient local resolution by systematically constructing the discretization based on various regular grids and using them as a basis for fast solution. Using multigrid as the individual grid solver, FAC has been applied to a variety of fluid flow problems, including incompressible NavierStokes equations [36] in both two and three dimensions. A more recently developed variation of FAC designed for parallel c01nputers, the asynchronous fast adaptive composite grid method (AFAC), allows for independent processing of the individual regular grids that constitute the composite grid. This means that the cost of solving the independent grid equations may be fully shared by independent processing units. Af"AC is a method for adaptive grid discretization and solution of partial differential equations. Where FAC is forced to process the composite gr.id level synchronously) AFAC
PAGE 16
3 eliminates this bottleneck of parallelism. Through a simple mechanism used to reduce inter level dependence, individual refinement levels can be processed by AFAC in parallel. Coupled with rnultigrid, MG (see [10], [22], and [41]) for processing each constituent grid level, AFAC is usually able to solve the composite grid problem in a time proportional to what it would take to solve the global grid alone, with no added refinement. See Hart and McCormick [1 J for further details. Because of the way local grids add computational load irregularly to processors1 an efficient load balancer is an essential ingredient for implementing adaptive methods on distributed memory machines. The complexity of this process1 as well as the overall algorithm itself, depends intimately on how the processors are assigned to the computational tasks. The independence of the various refinement levels in the AFAC process allows these assignrnents to be made by level (in contrast to the usual domain decomposition approach), which greatly simplifies the associated load balancing of the composite grid. To balance loads between these levels, in this thesis we develop a new load balancing algorithm, Multilevel Load Balancing (MLB), which borrows heavily on the multilevel philosophy that guides most all of the work presented. Specifically, MLB is a load balancing strategy that addresses the different levels of disparity in the loads that are spread spatially across the multiprocessor system. The algorithm is intended for use with applications that change dynamically, as is the case in selfadaptive mesh refinement and time dependent applications. Other load balancers have been previously developed, but most often exhibit slow performance that limits their usefulness on anything but predominantly static applications. But even with the capability to solve composite grid problems efficiently in a parallel environment, there is still a large class of practical problems that are not sufficiently well addressed. The problems in this class are dynamic in nature: the movernent and time evolution and decay of regions of activity force continual and substantial manipulation of the resulting partitioned composite grid. Local refinement around, or upstream of, shocks1
PAGE 17
4 for example, often must be allowed to move (track the shock) so that the time dependent problem may be solved properly. Using conventional methods for assignment of work to processors, the result is an inefficient handling of this large class of problems because of how and where the data is localized in distributed memory. However, analysis of these inefficiencies leads to the sort of unconventional methods for assignment of work that can be processed efficiently in a parallel environment using AFAC and the partitioning strategy used in MLB. Clearly, the resulting moverr1ent of regions of local refinement that occurs in dynamic problems is not handled efficiently using conventional techniques (e.g., partitioning based on the division of the problem domain). In contrast, the resulting new methods for assignment of work allow for rnore efficient handling of this important class of dynamic problems in parallel environments. Our experiments with a variety of local refinement algorithms for the solution of the simple potential flow equation on parallel distributed memory architectures demonstrates that, with the conect choice of solvers, performance of local refinement codes shows no significant sign of degradation as more processors are used. Contrary to conventional wisdom, the fundamental techniques used in our adaptive mesh refinement methods do not oppose the requirements for efficient vectorization and paralleli:tation. ln fact, this research has shown that algorithms that are expensive on serial and vector architectures, but are highly parallelizable, can be superior on parallel architectures. Thus, parallelization capabilities play an important role in the choice of the best suited algorithm for a specific application. Chapter 2 details the development of AFAC and AFACx, including computational results of their use in a parallel environment (more results are found in sections 4.2 and 4.3). The results comparing FA() and AFAC appear in chapter 4, while results on the use of dynamic adaptivity for time dependent equations are presented in section 4.5. These results compare the relative computational costs of the two iterative algorithms, including the computational costs of adding and removing refinement regions ada.ptively and load
PAGE 18
I rl I ll ) J II II !. I I ! II \! 5 balancing the resulting composite grid using MLB. Section 3.2.6 describes MLB! a new multilevel load balancing algorithm for parallel adaptive refinerrtent algorithms. Its use is a central part of good performance on parallel architectures. Section 4.4 contains a cornparison of the parallel performances of FAC and AFAC for two sample composite grids. Section 2.2.3 introduces a new algorithm, AFACx, which improves on the AFAC algorithrn by expanding applicability to more complex block structured local refinement grids. The use of AFACx both simplifies the use of parallel adaptive mesh refinement and is more efficient than AFAC. There we present, the motivating theory for AFACx as well as a convergence proof. The next subsection illustrates the use of selfadaptivity by presenting an example problem where it is used. The fmal subsection addresses the complexity analyses of FAC, AFAC, and AFACx. In particular, we answer some problematic questions about how cornposite grids can be optimally partitioned for these three algorithms. 1.2 Software Develop1nent The work introduced in the algorithm design section was the product of two separate parallel adaptive refinement implementations. The purpose of this section is to introduce the chapters that detail some of the more practical aspects of these separate implementations. The implementation of parallel adaptive refinement, required for meaningful algorithm cleveloprnent, is a nontrivial issue that has historically limited further research in this important field. For the work in this thesis to be accomplished, the problem of practical development of parallel adaptive refinement software was addressed. The results forrn a significant part of this thesis because they so successfully resolve the requirements of selfadaptive mesh refinement for complex applications) even though the problems that are presented are simplistic in nature. Specifically, this thesis presents an objectoriented set of libraries in C++ (class libraries) that abstract the details of parallel adaptive refinement by separating the abstractions of parallelism and adaptive refinement from the user)s application code) which is permitted to define the application on only a uniform domain. It
PAGE 19
6 is hoped that additional collaborative work will make the set of C++ class libraries more generally useful and available in the near future. Parallel adaptive mesh refinement is a truly sophisticated level of software development, which is important since it strains the limits of what can be accomplished in software without massive support. Its use requires the development of many interrelated data structures to permit the fiow of control between the solvers that are required for the implementation of the composite grid algorithn1s. Additionally, adaptive refinement is necessarily dynamic since the composite grid is most often required to evolve at runtime. 1 Because of the dynamic requirements of adaptive mesh refinement, FOHTRAN, a static language) was not considered to be an option in the development of the final codes. Note that even static refinement under explicit user control would require dynamic memory management since in the parallel environment data would have to be shuffled between processors in any load balancing scheme. The experience in this project was that FORTRAN is too outdated for use in the development of sophisticated parallel adaptive refmement software because of its inability to support the abstractions required to define algorithms independent of the location of the data that is manipulated. 'I'he concept of abstractions, and a presentation of how and why an algorithm should be expressed independent of the organization of its data) forms a basis for the P++ and AMR++ work that this thesis presents. We recognize that special FOHTR.AN memory management could as well have been developed to allow for some or most of these problems, but such work would have layered already complex code on top of yet another layer of software with its own inherent restrictions. The first implementation was done in the language C, which permitted dynamic management of mernory, a requirement of both static and adaptive refinem.ent. The experience with the working C version of the local refinement code is detailed in section 3.2. Although written modularly and using the best techniques in software design known at the our experience, the ratio of code supporting the details of adaptive refinement to code defining the single grid application is approximately 20:1 for serial and parallel environments (due to the proportionally increased complexity of the parallel enviromnent).
PAGE 20
I' I I i I d II l !11 I i i i I I I i II II i I \I 7 ti1ne, the complexity of the parallelism in the adaptive refmernent code could not be hidden, thus greatly complicating the resulting code. The principal complications were in the partitioning of the composite grid, the management of memory for data shuffled between processors in the load balancing of the composite grid, the complex data structures which required access from all levels, etc.; all of these features were necessarily handled explicitly within the parallel adaptive refinement implementation. The C language version was cornpleted with two separate implementations, one for serial computers and a second for parallel computers. Though acceptable for a research code, the requirement of separate serial and parallel codes limits the ability of parallel computing to address even more complex applications and, just as important, their maintenance in a commercial setting. The C language version of parallel adaptive mesh refinement wa...;; run on the iPSC/1, iPSC/2, iPSC/860, SUPRENUM, and nCUBE parallel machines using precisely the same code, so a degree of parallel machine independence was actually achieved. Due to the conventional way the code was developed, with a procedural language (FORTRAN is also a procedural language), these complexities combined, and in effect multiplied, to lin1it the ability to develop the sort of sophisticated application codes we need for solving complicated flow problems efHciently. A more sophisticated selfadaptive rnesh refmement code would have made work even more difficult. Later work in C++, however, better allowed for the management of these difficult issues and permitted even greater sophistication. The consistent goal throughout was to present a greater degree of architecture independence in order to simplify the implementation further. The experiences with parallel adaptive mesh refinement in the original implementation in C, and those with a much more complex local reflnen1ent hypersonic application developed (with Dinshaw Balsara) specifically for the serial environment, have motivated additional work to simplify the development of such numerical software for parallel distributed
PAGE 21
b 8 memory computers. The second generation of the parallel adaptive refinement implernentation has been done much differently. The problems of cornbined parallelism, adaptive reftnement, and application specific code are too interrelated to permit signiftcant advancement to more complex forms of adaptivity and applications. The solution to this software difficulty presents abstractions as a means of handling the combined complexities of adaptivity, mesh refinement, the application specific algorithm, and parallelism. The abstraction for adaptive refinement is represented by the definition of adaptive refinement in an objectoriented way that is independent of both explicit parallel details and specific application details. The abstraction of parallelism is to represent the lower level array operations in a way that is independent of parallelism (and adaptive refinement). These abstractions greatly simplify the development of algorithms and codes for complex applications. As an example, the abstraction of parallelism permits the development of application codes (necessarily based on parallel algorithms as opposed to serial algorithms, whose data and computation structures do not allow parallelization) in the simplified serial environment, and the same code can be executed in a massively parallel distributed memory environment. Since the codes require only a serial C++ cornpiler1 we avoid the machine dependent restrictions of research projects involving parallel cmnpilers. 2 We attack the details of parallel adaptive mesh refmement software development by dividing the problem into three large parts: 1) the abstraction of parallelism using the C++ parallel array class library P++; 2) the abstraction of adaptive mesh refinement using the C++ serial adaptive mesh refinement class library AMR++; 3) the serial application speciftc single grid application, defined by the user. The division into these parts serves to make the project smaller than it would otherwise be since the developrnent of large codes is inherently nonlinear. Additionally, each part is sufficiently separate to forn1 the basis of other large software, so the pieces are substantially reusable. This sort of code reuse is a common feature of the C++objectoTiented language. Thus, we present a greatly simplified 21Iowever, such work toward parallel C++ compilers (notably CC++ and pC++} is important.
PAGE 22
9 environment for the developn1ent of adaptive mesh refinement software in both the serial and parallel environment. More specifically) such work provides an architecture independent environment to support the development of rnore general complex software, which might be targeted for the parallel environment. We now summarize the individual parts. P++ is a C++ parallel array class library that simplifies the development of ef ficient parallel programs for large scale scientific applications, while providing portability across the widest variety of computer architectures. The interface for P++ matches that of M ++3 a comrnercially available array class library for serial machines, so numerical appli cations developed in the serial environment may be recompiled) unchanged) to run in the parallel distributed memory environment. Although general in scope) P++ supports current research in parallel selfadaptive mesh refinernent methods by providing parallel support for AMR++t a serial class library specific to selfadaptive mesh refinement. The P++ environment supports parallelism using a standard language) C++t with absolutely no modification of the compiler. For parallel communication) it employs existing widely portable communi cations libraries. Such an environment allows existing C++ language compilers to be used to develop software in the preferred serial environrnent1 and such software to be efficiently run) unchanged) in all target environments. AM R++ is a C++ adaptive mesh refinement class library that abstracts details of adaptive mesh refinement independent of the users specific application code) which is used much like an input parameter to AMR++. AMR.++ is written using the M++ serial array class library interface) and thus can be recompiled using P++ to execute in either the serial or parallel distributed memory environment. Thus) AMR++ is a serial application written at a higher level than P++1 and specific to selfadaptive refinement. Forming a serial adaptive mesh refinement code requires only a uniform grid solver specific to the users) application. If the users' application code uses the M++/P++ array interface) then the users' AMR++/application code can be recompiled to execute in the parallel environment. 3M++ is a product of Dyad Software.
PAGE 23
10 Thus, such abstractions as AMR++ and P++ greatly simplify the development of complex serial, and especially parallel, adaptive mesh refinement software. Chapter 3 presents the details of the first generation of the parallel adaptive refinernent code and the problen1s that were solved and introduced by the combined complexity of adaptive refinement in the parallel environment. The work in this chapter motivated further work that simplified the objectoriented development of the later more complex parallel codes. Additionally, it motivates the requirernent for a superior language for developn1ent of the adaptive mesh refmement application, whether targeted for the serial or parallel environments. Chapter 3 also presents the objectoriented design of the second generation parallel adaptive mesh refinement implementation. Section 3.5 presents the P++ parallel array class library and section 3.6 presents the AMR++ class library1 including details of tbe selfadaptive mesh refinement strategy. 1.3 Problems and Future Work Problems with the current work are discussed in chapter 51 since it is these problems that the future work will attempt to address. Finally1 chapter 6 presents some of the future work that might be done to expand the usefulness of adaptive refinement methods for parallel architectures. These include improvements and ideas for the cornposite grid algorithrns and the objectoriented strategies that guide their practical implementation. Additional detail focuses on possible improvements to the parallel array class library P++ and the adaptive mesh refinement class library AMR++.
PAGE 24
I I I I i i Figure 1: Example composite grid with five levels. n
PAGE 25
CHAPTER 2 PARALLEL ALGORITHM DESIGN (THEORY) 2.1 Overview of the FAC and AFAC Composite Grid Algorithms Reasons for basing the solution process on a composite grid include: 1) uniform solvers and their discretizations are more easily defined for complex equations; 2) multigrid solvers, which are appropriate to use, are simpler and more efficient on uniform grids; and 3) iterative processes are most efficiently ilnplemented for the case of structured unifonn grids. In section 2.2, we introduce the existing fast adaptive composite g:rid method, FAC, and the asynchronous fast adaptive composite grid method, AFAC, and uses them to motivate and define the details of AFACx, the new algorithm that this thesis presents. Convergence for the AFACx algorith1n is proved in section 2.3. A different and more restricted analysis is contained in [43]. Both FAC and AFAC are rnnltilevel methods for adaptive grid discretization and solution of partial differential equations. FAC has a sequential limitation to the processing of the individual levels of refinement, whereas AFAC has much better complexity in a parallel computing environment because it allows for sirrmltaneous processing of all levels in the computationally dominate solution phase. Coupled with multigrid (MG) processing of each level and nested iteration on the composite grids, AFAC is usually able to solve the composite grid equations in a time proportional to what it would take to solve the global grid alone. See [22] and [41] for further details. Both FAC and AFAC consist of two basic steps that are described loosely aB follows. Step 1. Given the solution approximation and composite grid residuals on each level, use NIG to compute a correction local to that level (solving the error equation). Step 2. Combine
PAGE 26
13 the local corrections with the global solution approxirnation, cornpute the global composite grid residual, and transfer the local components of the approximation and residual to each leveL The difference between FAC and Ali'AC is in the order in which the levels are processed and in the details of how they are combined. Convergence theory in [40] shows that the FAC and AFAC asymptotic convergence factors, satisfy the relation IIIAF ACIII = IIIFACIII!, where AFAC and FAC are the error propagation operators for AFAC and FAC, respectively, and Ill Ill is the composite grid energy norm (lllu'!ll = (L'u',u')t, where(,) denotes the .2 inner product and L' the composite grid operator). Although the theory is restricted to the twolevel composite grid, experimental results have verified this relation even on a very large number of levels (a specific test verified this property on a 50level composite grid1 ). Though the algorithrrtic components in our code are chosen slightly differently than for the convergence analysis1 experiences show that very similar behavior is obtained. This implies that two cyc1es of AFAC are roughly equivalent to one cycle of FAC. 2.2 Notation and Definition of FAC, AFAC, and AFACx Algorithms To define these algorithms1 we present the following problem and notation. We begin by posing the weak form of the continuous equation and its discretization. Let 1i6(f2) be the usual Sobolev space of realvalued functions on the problem domain f2 that have zero values (in the usual sense) on the houndary1 a no, of Q0 Assume that a(1 )is a generic1 realvalued, bilinear, symmetric, bounded, and coercive form on H6(n) x H6(n), and that f() is a generic real valued functional on H6(n). The weak form is then to find 11 E H6(n) such that a(u,v) = f(v), 1The actual solution of such a 50level composite grid problem implies greater precision than double precision grid values permit., but this is an issue of machine accuracy.
PAGE 27
14 To discretize this equation, let 7 be a regular fmite element partition of 0 (e.g., triangles or cells) and let V C H6(0) be the associated finite element space (e.g., piecewise linear functions on triangles or piecewise bilinear functions for cells). Let R 1 denote the number of subregions of n (in the discrete problem, which we will develop, c will define the number of composite grid levels), which we assume are nested: For each k E {1, 2, ... ,},let Tk be a regular finite element partition of n constructed by appropriately refming the elements of yk1 and define Vk c He(n) as the finite element space with Tk. Note that functions in Vk are zero on [)O,k. Define Wk = vkl n Vk, which is a coarse grid subspace of Vk. Note that Wk is similar to vki except that it is local (if n is a proper subregion of nk1). For convenience, let W0 = {0}. Note that the reflnement has been constructed so that it is conforming in the sense that w = yk1 n ?16(flk). We will refer to Wk as the restricted local refinement space for the local reftnement region [Jk. Define lt : w Vk as the natural embedding (interpolation) operator. Now define the cornposite grid by p n' = Un', i:::::O and its associated space by p v' = L:v'. i.:::::O Then the discrete problem we solve is the composite grid Galerkin discretization, using the composite grid space vc: find uc E yc such that a(u', v) = f(v), Vv E V'. Let Lk Vk  Vk be the discrete operator on grid space Vk determined by a: a(uk,vk) = (Lkuk,vi), Vvk E Vk, where(,) denotes the L2(0) inner product. Note
PAGE 28
... 15 that L0 is an approximation to the differential operator on the global region 0. Let lk : Vk __,. vc and : vc __,. Vk be given interlevel transfers operators (interpolation and restriction, respectively, defined by the finite element formulation). Note that I> is the natural imbedding operator and is its adjoint. Finally, consider the restricted grids f"lk::::: nkl nnk and their aSSOCiated SpaCeS vk :;::: Wk and OperatOrS lk l Ik 1 and I and let uk denote the restricted grid solution, 1 ::; k ::; I. (For AFAC and AFACx, these restricted (intermediate) grids and their operators are designed to remove error components that are common to both fine and coarser levels and that would otherwise prevent simultaneous processing of the levels.) 2.2.1 FAC Algorithm FAC (see [40] and [39]) is a discretization and solution method for partial differential equations designed to achieve efficient local resolution by systematically constructing the discretization based on various regular grids and using them as a basis for fast solution. Using multigrid as the individual grid solver, :FAC has been applied to a variety of fluid flow problems, including incompressible NavierStokes equations [36] in both two and three dimensions. Loosely speaking, one FAC iteration consists of the following basic steps: Step 1. For all k E { 0, 1, ... }, compute f' by transferring the composite grid residual to [tk. Step 2. Set k = 0 (so that we start the computation on 0) and the initial guess on [2k to zero. Step 3. Given the initial guess and composite grid residuals on level k, use multigrid (or 1 alternatively1 any direct or iterative solver) to compute a correction local to that level1 that is, )) solve" the error equation that results from the use of the residual assigned to f' on n'. Step 4. If k < then: interpolate tbe "solution" (resulting from step 3) at the interface of levels [2k and nk+l to supply nk+l with complete boundary conditions, so that its correction equation is properly posed; interpolate it also to (the interior of)
PAGE 29
16 nk+l to act as the initial guess; set k .,.__ k+l; and go to step 3. If k =f., interpolate all corrections (i.e., "solutions" of each level's projected composite grid residual equations) from the fmest level in each region (i.e., n' ;nk+l) to the composite grid. To be D10IC specific, in addition to the above notation, let Ji+l : vk __,. vk+l denote the mapping that interpolates values from level k on the interface (i.e., the boundary of (Jk+1 that does not coincide with the boundary of ll0). Note that the computation of the composite grid residual equations at the interface is covered in detail in (39]. Given the composite grid righthand side fc and initial approximation uc, then one iteration of FAC (see McCormick [39] for motivation and further detail) is defined more concretely as follows (we show here the direct solver version for sirnplicity, which requires no initial guess on Qk): Step 1. For all k E {0, l, ... ,},set f' = L'u'). Step 2. Set k = 0. Step 3. Compute uk = (Lk)1 J'. Step 4. If k < set uk+1 = on nHr and go to step 3. If k = l, form u' = Jkuk on nk ;nk+l for each k (n'+l = 0). 2.2.2 AFAC Algorithm AFAC ([22] and [41]) is a multilevel method for adaptive grid discretization and solution of partial differential equations. AFAC appears to have near optimal complexity in a parallel computing environment because it allows for simultaneous processing of a1l levels of refinement. This is important because the solution process on each grid1 even with the best solvers, dominates the computational intensity. This is especially true for systems of equations where the solution process is even n1uch more computationally intensive than the evaluation of the residuals. Coupled with multigrid processing of each level and nested iteration [39] on the composite grids, AFAC is usually able to solve the composite grid equations in a time proportional to what it would take to solve the global grid alone. See Hart and McCormick [22] and McCormick [40] for further details. .................... ...
PAGE 30
17 The principal step is the computation of an approxirnation to the oscillatory component of the solution on each composite grid level. To simplify the explanation, we define y_k = Ikukizuk to be the ;'oscillatory cmnponent" of the solution uk. Loosely speaking, one AFAC iteration consists of the following basic steps: Step 1. Compute f' for all k E {0, 1, ... } by transferring the composite grid residual to [lk, and similarly for f' for all k E {1, 2, ... }. Step 2. Set the initial guess to zero on [lk for all k E {0 ... }, and similarly on {lk for all k E {1,2, ... ,}. Step 3. For all grid levels 11k(k E {0, 1, ... } ): Substep 3a. Use rnultigrid (or, alternatively, any direct or fast iterative solver) to compute a correction local to that level, that is, "solve'' the equation that results from the use of r on [lk and 1' on {lk (k > 0). Substep 3b. Subtract the restricted grid ''solution)) frorn the local grid "solution.'' This forms the "oscillatory components." Step 4. Interpolate and add the "oscillatory components" on all of !lk for all k E {0, l, ... } to all finer composite levels. To be more specific, given the composite grid righthand side fc and initial approximation u', then one iteration of FAC (see McCormick [39] for motivation and further detail), is defined more concretely as follows (we show here the direct solver version for sirnplicity, which again needs no initial guesses on nk or {2k): Step 1. For all k E {0, 1, ... }, set fk = L'u') and (for k > 0) J' n(j"L'u'). Step 2. For all k E {0, l, ... ,},set u1 = 0 and (fork> 0) uk = 0. Step 3. For all k E {0,1, ... ,}: Substep 3a. Compute u1 = (L1)'t> and (fork> 0) u1 = (k)1j' Substep 3b. Set ',f1 = I>u' fku' fork E {1,2, ... ,}.
PAGE 31
18 The processing of each step of AFAC is fully parallelizable. For example, the levels in step 3 can be processed simultaneously by a MG solver1 which is itself parallelizable [10]. The present version of the code uses synchronization between steps 3 and 4, although asynchronous processing would be allowed here. (With an efficient load balancing S<',heme, asynchronous processing of the levels provides little real advantage.) A more complete derivation of AFAC, along with a convergence proof and related theory, can be found in McCormick [40]. 2.2.3 AFACx Algorithm AFACx is a new algorithm1 which this thesis presents and analyzes as its principal mathematical development. The motivation for AFACx is in the use of adaptive refinement for problems with complex internal regions dernanding additional) but local: resolution. Such complex internal regions are found around1 and upstream oC shocks and complex shock structures.2 In this section, since the A"F'ACx algorithm is new, we present an expanded description, which includes the principal motivation for its developrnent and use. 2.2.3.1 AFACx Motivation The use of adaptive refinement for problerns with geometrically complex regions of activity requires rnore than simple rectangular refinernent. In such problems, local refmement strategies rrmst cover the target regions with nonregular meshes, or collections of regular rectangular grids that combine to conform to the nonregular regions. ln the latter approach, using collections of regular rectangular grids, the efficiency and simplified irnplementation of complex application codes can be restricted to the more conventional setting3 where good efficiency on the structured rectangular grids is assured. For explicit problems, the details of handling the resulting block structured local refinement regions are an issue only at the interfaces; the properties of the algorithm (stability, motivating example problem for this work has been the simulation of enhanced mixing (such a..<> flows passing through oscillating shock structures) in support of the National Aerospace Plane Project (NASP). However, the extension of this work to that problem is not complete. 3Iu section 3.6, we show that by using the AMR++ class library, only the uniform grid solver need be defined by the user.
PAGE 32
19 convergence\ etc.) are not as 1nuch an issue as they are for implicit problerns.4 With implicit equations) the handling of block structured grids is nwre problematic since the solution of the individual blocks is not sufficient for the solution of the block structured grid. More complex methods are typically required, but rarely applied, to also resolve the smoother components that span the collection of block structured grids, since if only the blocks are processed with iterative solvers, the smooth errors (across blocks) are poorly damped and result in poor convergence factors for the block structured solution. It would be sufficient \o solve the block structured local refmement grid problems directly1 but this is prohibitively expensive. Alternatively1 we could defme a multigrid set of coarsenings (of the blocks and the structure of blocks), which would permit fast multigrid solution of such block structured refinement regions. But the automated coarsening of the block structure (beyond that of sirnply the coarsened blocks themselves) is difficult to implement and the solvers abstracted to work on such coarsenings are inefficient. 5 For implicit solvers, the block structured solution is required, on the block structured grid, if we intend to use AFAC, since the formulation requires an approximate solution on the refmement patch. FAC could be simplified to use relaxation on the block structurcd regions starting from the solution interpolated from the coarser level (the global grid in the two level composite grid case )6 The use of relaxation avoids the complication of constructing the block structured coarsenings that a multigrid solver would require for the block structured refinement region. We seek a similar efficiency, but with AFAC, so that the cornposite grid levels can be processed asynchronously. AFACx is just such an algorithm, since it requires no definition of the coarsened block structured local refinement regions and uses only relaxation on the predefined block structured refinement region. 1Howeve1, the details of these explicit algorithms at each grid point ca.n be more complex., e.g.1 for PPM and ENO methods for Euler equations, the Riemann solvers are more sophisticated than most relaxation methods used in the implicit MG solvers. 5 A substantial ammmt of work was done on this approach, and its failure motivated the approach that was taken and that led to AMH++ (see section 3.6). 6This version (variation) of FAC .is .introduced as FACx in section 4.2.
PAGE 33
20 AFACx uses only the predefined block structured region and requires no construetion of coarsening, even for the individual blocks, but still preserves the multilevel efficiency of AFAC7 and processes all levels asynchronously. The predefined block structured grid ineludes the finest level of the local refinement region and the grid built consistent with the flagged points on the coarser composite grid level, which were used to generate the block structured local reJlnement region. In practice it is easier, and equivalent, to let each block define a single coarsening. This coarsening of each block is guaranteed to exist because the local refmem.ent block \.Yas derived from flagging points on the coarser cornposite grid level (the coarse grid points of the block structured refmement patch). Thus, AFACx uses the finest level of the block structured refinement grid (the block structured refinernent grid itself), and a single coarser level (which we have shown is predefined since it corresponds to the fiagged points that were used initially to build the finer local refinement level). Because AFACx avoids processing the coarser levels on each block, it is cheaper than AFAC, though the difference is only in the processing of the coarser levels and so it is not very significant in a serial environment.8 However, in a parallel environment, the avoidance of processing coarser levels means substantially less message traffic in the rnultiprocessor network, and a higher parallel efficiency for the overall method. Since, in the context of adaptive n1esh refmement, the local regions are sized according to demand, it is likely that such refinement blocks would not be sufficiently large to adequately hide the comrnunica.tion overhead of processing the coarsest levels. Thus1 by avoiding the coarsening altogether, we avoid a potentially significant overhead in the parallel environment1 as well as the cornplicated construction of the block structured coarsening that would be required of a completely general block structured grid. Finally, the use of relaxation on the block structured grids is what makes AFACx, and the analogous variant of FAC, attractive. This is because the user defined relaxation 7The convergence factors are observed to be within 3% of that of AFAC. 8The relatively inefficient processing of short vectors is also avojded in the vector environment.
PAGE 34
21 (which is assumed t.o be parallelizable) is easily supported on the block structured grids. Then the process of exploiting the parallelism across blocks, on the block structured grid, is equivalent to that of exploiting the parallelism across multiple processors, on the partitioned grid. Thus, in the support for the block structured grid, interface boundaries are copied in rnuch the same way that messages are passed between processors. These details are hidden from the user in the AMR++ class library in the same way that the rnessage passing is hidden frorr1 the user in the P++ class library; see chapter 3, sections 3.6 and 3.5, respectively. 2.2.3.2 AFACx Definition In the case of AFAC, we can attempt to understand it by considering the use of exact solvers on the composite, local refinement, and restricted grids. An important step in AFAC is the elimination of the cornmon components between the coarser composite grid level and the finer local refinement grid patch. This step allows us to avoid the amplification of these components (inherently smooth components, since only they are represented on, or shared by, both the local refinement and coarser levels) in the interpolation and addition of the solution from the coarser levels up through the finer composite grid levels, finally forming the composite grid approxi1nation. The result in AFAC, after this important step, on each level, is an approximation to the oscillatory contribution to the composite grid solution, which is unique to that leveL AFAC and A FACx differ only in the way that this oscillatory contribution is computed: AF'AC uses exact or fast approximate solvers, and AFACx uses only relaxation (typically one or two sweeps). In order to differentiate between the individual relaxation iterates, we will use subscripts: is the nth iterate approximating the solution uk on the kth grid level nk. Actually, we use only one iteration on Qk, but allow for more on flk. Let p(Lk) be the spectral radius of the discrete differential operator Lk. Then Richardson's iteration, which we will use throughout our analysis, is given by hr ........
PAGE 35
22 ftk so that = jk). To simplify the explanation below, we define = Ikuf to be the "oscillatory component" of the ftrst iterate uT. An important aspect of AI<"'ACx is that the relaxation iterate is computed first and the initial guess for nk is interpolated f tl .t t k r,, (. 'I, k) 10rr1 1e 1 era e un on H I.e., uo k un We can now define AFACx more precisely. Given the composite grid righthand side fc and initial approximation uc, then one cycle of AFACx based on one relaxation sweep per level (lk and n sweeps per level {jk is given by the following: Step 1. For all k E {0, 1, .. ,},set f' = L'u') and (k > 0) f' = n(f'L'u'). Step 2. For all k E {1,2 ... ,}, set iik = 0. Step 3. For all k E {0, 1, ... }: Substep 3a. Set = R'k(iiZ;]k) fork> 0 and = 0 fork= 0, then = JfuL Substep 3b. Set :gf = IkufikUf. Notice step 3 uses only relaxation, and that the fine grid initial guess is the restricted grid approxirnation interpolated to [2;k, namely, All steps except step 3 are the same as in AFAC. In the next section, the connections between AFACx and AFAC are further clarified since the convergence proof of AFACx relies on convergence theory developed for AFAC. 2.3 AFACx Convergence Theory We prove the convergence of AFACx in several steps. The basic idea for the developrnent is to use existing AFAC theory [40] and establish that the AFACx convergence factor is at most only slightly larger than that of AFAC. As with the AFAC theory (McCormick [40]), we develop this in the restricted setting of a twolevel composite grid problem. To simplify notation in the case of a twolevel composite grid, we introduce the
PAGE 36
23 fine grid patch fh = 01 and its restricted grid fhh = 01 In a similar way, we define the following: the exact discrete solution on nh is denoted by uh = u1 and on n2h by u2 h = U1 Further, we use the subscript n to denote the nth iterate of a relaxation step. Thus, is the nth iterate of the relaxation operator on 02h, starting from u6h. the interpolation operator from 02h to nc is denoted by I2h := h the restriction operator from nc to nh) is denoted by J;h = i;' similarly, interpolation and restriction between 02h and fh are denoted by and respectively. the fine grid discrete differential operator on fh is denoted by Lh := 11 the restricted grid discrete differential operator on Q2h is denoted by L2 h := .f}. To support the comparison of AFACx and AFAC, we require a specific version of AFAC that uses a combination of exact and iterative solvers. To this end, we define A.F AC as the usual AFAC method defined in section 2.2.2, except that the exact or rnultigrid solver on nh is teplaced by one relaxation sweep starting from the initial guess u3 = Ig11 u2h. Note that u2 h is the exact solution of the discrete problen1 on the restricted grid 0:2h. We will :IF' :ox(irr::telsol f2h and rlh by approximate solvers based on operators M2h"' (2")1 and M" "'(L")1 that satisfy and respectively. 'l'hat is) the exact solutions on n2h and on nh are replaced by the iterates u2h = M'"Iz"(f'L'u') and u" = + M"(L"Ifhu2"r;(f'L'u')), respectively.
PAGE 37
24 Note that, in the case of Richardson's iteration, M'2 h = P(};'2h)I and Mh = pdh)I. Finally, let b > 0 and 8 > 0 be the quantities defined on pages 110 and 118 of [40] that are typically bounded uniformly in h, depending on the application. In this section, we assume that Lc in symmetric and positive definite and that the following variational conditions hold: L 2h 12hL'I' 1 c 2hl LhIhL'I' c' h and [2h ( )T c = C2h I2h I where c2h and ch are positive constants. Assume further that l2h and lh are full rank, so that L2 h and [} are (symmetric) positive definite. Thcorern 1 The spectral radii of the twolevel exact solver versions of AF AC and PAC satisfy the relation p(AFAC) = p(FAC)'. Thus, with 111 denoting the composite grid energy norm (lllu'lll = we have IIIAFACIII :S c! J' The convergence factor for the approximate solver version, AF AC.s_, satisfies II lAP ACdll :S (1c) IIIAPACIII + c Proof: See McCormick [40], page 144. Lemrna 1 AF AC converges with factor bounded according to IIIAFACIII :S (1c)IIIAFACIII+c< 1, where c = ( 1 t) 2
PAGE 38
25 Proof: First consider the case of n = 1. The inequality on page 118 of [40] shows that (1) for the initial error eh on S111 Since Mh = p(j_h)I > 0 for Richardson iteration, we clearly have ( 12 The len1ma now follows frorn the estimate for AF AC.s_ in Theorem 1 with f = ( : ) TOe"'"'"' wow''" ""''" w '"''" additional relaxations on level h in AFAC cannot increase the energy norm of the error. Q.E.D AFAC and AFAC differ by a perturbation term that can be expressed in terms of the operator ph n (2) = Lenuna 2 The error propagation operators AF ACx and A.F AC are related according to AF ACxe' =APACe'+ I' phe2h h n 0 where e6h is the initial error on level 2h. Proof: First consider the case n = 1. Following the definition of AFACx, the iterate uih on n2h using the initial guess u6h is given by u2h u2h __ 1_ (Lhu2h fh) r o p(Lh) o Then, as defined, the initial guess for the subsequent relaxation step on Oh is just ua IIh uih. So the iterate on fh is computed as h h 1 (I h h fh) ul = uo p( Lh) .o uo .
PAGE 39
26 Then, by substitution, = Ih [ 2h 1 (1'" 2h !'")] 2h Uo p(L"') Uo 1 (L" T" [u'"1 (L'"u2"/2")]t") p(L") zh o p(L'") o Consider the splitting of uh into its energy orthogonal components: uh = r;h u 2 h + th, where u2 h is a level 2h component and Lhth = 0. Then, l1 u" 1 = 1 [ '" 1 L'" '"] 2h uo p( L'") eo 1 (L"J" [ '" 1 L'" '"] L" (I" '" ")) p( L") 2h uo p( L'") eo '" u + t where c6'" = u5h u2h. VVe thus have Ih [ '" 1 L '" '"] Zh uo p(L'") eo 1 ( h h [ 2h 1 2h '"]) 1 h h p(L") L I2 Eo p(L'")L e0 + p(L"/ t (3) Now, following the definition of AF AC, its iterate which uses as its initial guess the exact solution from grid r22 h (namely, is given by 9 Ih '"+ 1 L"t" zhu ( ")" P L (4) 'The final processing step on nh subtracts from the final iterate the interpolated 2h approximation, which is just lh [ '" 1 L '" '"] Zh uo p(L'") Eo for the case of AF ACx and 1;h,u2 h for the case of AFAC. We represent the solution after the subtraction of the restricted grid solution for AF ACx and AF AC by and respectively: h u, J" [ '" 1 L'" '"] ll12h 'Uo p(L2h) eo 1 ( h h [ 2h 1 2h 2hl) 1 h h p( L") L I2 e0 p( L2h / e0 + p( L") L t 9Here, the .initial guess on Dh is given by = on the fine grid patch. (5)
PAGE 40
and 1 ( h ") p(L") L t The definition of for n = 1 then shows that h h u.,u from which the desired equality follows. 27 (6) (7) Assuming now the more general case of n 2 1 relaxation steps, it is easy to verify that the approxirnations after the final processing step satisfy and 1 h h [ 1 2hl n 2h 1 h h p(L")L l2h Ip(L2h/ e0 + p(L")L t 1L"t" p(L") Again, the desired inequality follows from the definition of Q.E.D (8) (9) Ill Ill was defined as the composite grid energy norm. But for functions v" E V", we note that = = (L"v",v")t, which we write simply as the fine grid energy norm lllv" Ill := (L"v", v") L Similarly, for v2 E V2 we can define lllv2 Ill = Le1n1na 3 The perturbation term is bounded according to 111?"111 < p (J1L'")" (I'" 1 (1")3 I") (1 I n p(L'") h p(Lh)2 2h n .!. 1I'") (L2")t)' p(L'") 0
PAGE 41
28 Proof: We can simplify the evaluation of the energy norm llllll since it is related to the L, norm 11: First note that llle6hlll = llle6h Ill (Lh lh e'h Ih e'h) 2h 0 2h 0 = ( 2h zh TLhlh e'h)} eo 2h zh o II(Lh) Ghi;h Pzhe5h II II( Lh Ghii\P,h(L2h )w'hll llw2hll [ II(Lh)} Ghigh Pzh(L'h )} w2hll] < sup II 2hll llw2 h 11=1 W = = p (I?,hGhLhGhigh) (10) P ( (L2h)p2h ( (Lh)3 Igh) F2h(L'h)}).,. (11) The last lines follow because, for any n x n matrix A, II All= J p(AT A). Q.E.D. Lernrna 4 The fine and restricted coarse grid operators satisfy the inverse relation ( Lh)1 > Ih (L'h)1 I'h 2h h Proof: For any matrix A, we have ATA::; I<=> AAT ::; I. Hence,
PAGE 42
N Q.E.D. Lenuna 5 There exists a constant c E at+ such that 2h 1 ( ")3 h 1 ( '")2 I" p(L" )' L 12 <:; c p(U") L Proof: First we observe that p(L") scales L" so that p(Z,)Lh <:;I. Thus, Zh 1 ( ")3 h 2h 1 ( h)2 h Ih p(L")' L 12 <:;I, p(L") L I2h. Then1 to establish the bound we seek1 we need only prove that 2h 1 ( ")' h 1 ( '")' lh p(LhJ L 12h <:; c p(L2h) L For some constant c E 8(+. First notice that1 using Lemma 41 Vle can choose a constant b E at+ such that I b h Tlh <:; lzh zh <:; b(L21T1 I2hT 29 (12) lh (L'")2Jh T 2h 2h <:; b <:; b(L")2 <:; bL"(L")2 L" = bl L" Ii'h (L2 )2 L" <:; b! <:; bl :=> (L"Igh(L 2")1 ) (L"Ith(L'"t1J"" <:; b! :=> (Lh!h(L'")1)T <:; b!
PAGE 43
30 } Lh [h (L'h)1(L'h)1 Jh T L" 2h 2h bi } (L'")1 I" Lh' Jh T (L'h)1 2h 2h bl } Jh TLh2Jh 2h 2h b(L2h) 2 (13) Dividing by p(L"), we have shown that 2h 1 ( h)' h 1 ( 2h)' T" p(L") L I2h b p(Lh) L We want the smallest value for b such that [ < T so we let b (14) Now we need to find a minimum value for c so that Letting Amin(A) denote the smallest positive eigenvalue of a symmetric positive definite matrix A, then the minimum value for c is evidently c < = ( h T h ) cr 12h I2h (15) where cr denotes the condition number. Hence, So that, 2h 1 ( h )3 h ( h T h ) 1 ( 2h) 2 h p(')' L I2h CT 12h I,h p(L2h) L
PAGE 44
31 and the lernma is proved. Q.E.D. Lerrnna 6 If is based on piecewise bilinear elements on a uniform rectangular grid, then ( h T h ) (j I2h I2h S 4.0. Proof: The stencil for the bilinear interpolation operator r;h is given by 1 1 16 8 16 1 1 1 8 4 8 1 1 1 16 8 16 The stencil for the product is therefore 1 3 1 16 8 16 3 9 3 8 4 8 1 3 1 16 8 16 the lemrna now follows from a straightforward mode analysis to estimate the eigenvalues of this stencil operator. Q.E.D. The following theorem establishs the APACx conveyes in the usual "optimal" multile vel sense whenever AF AC does. Theormn 1 The spectral radius of AFACx is bounded below one uniforn1ally in h, assurning this is true of AFAC and that n is sufficiently large. Proof: Lemmas 3 and 5 combine to prove [[IP!;l!l < n n (( 2h)!( 1 L'") (I'h 1 (L")3I") (I 1 L'") (L2h)l)' p L Ip(L2h) h p(Lh)' 2h p(h) < (( 1 '")" ( c '") ( 1 P Ip(L2h/ p("/ Ip(L'")L
PAGE 45
Q.E.D. = (1< ye max (1[3)"1 where [3 is any eigenvalue of Pci'')L 2 o::;p:::;1 = V 2n: 1 From Lemma 2, we thus have IIIAFACxe'ill = iliA FACe'+ IiP/.'e3"111 < !IIAFACe'ill + IIIIhP/.'e6"111 < IIIAF ACIIIIIIe'lll + < (IIIAFAC'III + {;!i;) liie'ill 32 (17) (18) (19) (20) (16)
PAGE 46
CHAPTER 3 PARALLEL SOFTWARE DESIGN 3.1 Introduction The design and implementation of parallel software is a hindrance to obtaining feedback on the design of parallel numerical algorithms. An important part of this design involves implernentation of the proposed algorithms on the complex target architectures used on today's computers and the feedback of these results into the design of the algorithm. An understanding of the numerical properties of the parallel algorithm can be obtained from a serial implementation, but issues of parallel perfonnance show up only in the much more complex parallel environment, though often some details of performance on a proposed parallel architecture can be esti1nated with a good understanding of the algorithm. Based on the design and developrnent of much parallel software, a recurring set of problems was recognized as fundarnental to the expanded developrnent of software as complex as the parallel adaptive mesh refinement (AMR) codes. The initial development of a parallel adaptive refinement code was complex enough to prevent its expanded use on much more realistic fluid applications. As a result, a portion of the thesis research effort was spent in the analysis and resolution of these road blocks to efficient and productive software design. This thesis implements two separate parallel adaptive refinement codes, one in C and one in C++ (using the objectoriented design features of C++). The C language ver sion, which was completed first and served as motivation for the C++ version, and earlier FORTRAN work showed the substantial difficulties involved in the use of FORTRAN or
PAGE 47
34 any other procedural language for parallel adaptive refinernent. Motivated by these observa tions a new way to develop general para1lel software, and specifically parallel adaptive mesh refinement software, was designed and is presented in what follows. 3.2 C Language Implementation of Parallel AFAC/FAC The initial implementation of the general problem design wa._,;; mostly carried into the second objectoriented AMR++/P++ version, though the AMR++/P++ implementation was substantia11y more robust and feature laclened (see section 3.6). 3.2.1 Progrmn Structure A decomposition of the composite grid problern domain is commonly used to partition work across rnultiple processors. However, since AFAC requires rninimal inlergrid transfers, additional solver efftciency (the dominant cost) is obtained by partitioning the composite grid by level. A partition of the problem domain might cut across rnany grids and add substantially to the total cornrrumication cost involved in updating internal boundary values, but a partition by level means that the grids will be shared across a minimal number of processors. This reduces and simplifies necessary cmnmunication between processors sharing a given grid, which is especially effective since most message traffic occurs during the MG solves. In addition, level partitioning allows for a more simplified load balancing strategy. An even greater advantage is that it allows for the movement of grids, as required by shock tracking, with no movement or rebalancing of the distributed cornposite grid. Further, level partitioning allows for a greater amount of each grid to be stored in the processors that share it. This results in longer vectors to be formed by the solvers, which is expected to better utilize the vector and pipeline hardware on Inachines with these features. With this reduction in total corrnnunication requirements, the existing comrnuni cation costs can be more easily hidden by the cornputational work. This sort of rnessage latency hiding would he expected to appear best when there is special message passing hardware designed to relieve the main CPU of the message passing overhead. Due to the
PAGE 48
35 unbalanced communicationtocomputation costs associated with the iPSC/1 and its lack of special communication hardware\ this message latency hiding was1 however1 difficult to measure. 3.2.2 Data Structures The relationship between the levels of the cornposite grid is expressed in a tree of arbitrary branching factor at each vertex. Each grid exists as a data structure at one node of this tree. This composite grid tree is replicated in each node. In the case of a change to the composite grid ( adding1 deleting, or moving a grid), this change is communicated globally to all processors so that the representation of the composite grid in each processor is consistent. The partitioning of all grids is also recorded and is consistent in all processors. Storage of the matrix values uses a onedin1ensional array of pointers to row values. These rows are allocated and deallocated dynamically, allowing the partitioned matrices to be repartitioned without recopying the entire matrix. This is important to the efficiency of the dynamic load balancer, MLB. The effect of noncontiguous matrices is not felt when using this data strueture since subscript computation is done by pointer indirection for all but the last subscript and, in this case, subscript computation is done by the addition of an offset. This organization is particularly effective for partitioning along one dimension. In the case of a 3D problem space, 2D planes would be dynamically allocated. Multiple rows or planes could also be allocated, allowing choices of vector lengths to optimize the irnpleinentation on vector machines. The newer implementation using P++ does not have to address this level of detail since such issues as storage and data layout are a part of P++ and the abstraction that it represents. 3.2.3 Multigrid Solver A significant change to the code described in Briggs et al. [10] is the restriction to onedimensional decomposition ("strips") and the allowance of irregular partitioning. The experiments documented here use (2,1) Vcycles that recurse to
PAGE 49
36 the coarsest grid, which consists of one interior point. The AFAC scheduler is responsible for ordering the execution of the necessary multigrid and intergrid transfer subroutines based on the receipt of messages. For example, multigrid is executed for all grids asynchronously and is driven by the order in which the messages are received. 3.2.4 AFAC/FAC Scheduler The AFAC scheduler handles the asynchronous scheduling of the steps needed to perform an AFAC iteration. An AFAC iteration is divided into operations between communications. The scheduler orders these operations on each of the grids contained in each processor. Thus, it is intended that much communication would be hidden by the computation that is scheduled while waiting for messages (message latency hiding). 3.2.5 Grid Manager The grid rnanager is responsible for the creation o:f the grid data structures and the update of the composite grid tree in al1 processors. Calls to the grid manager allow for the pas.sing of internal boundary values between processors and the adjustment of the partitions a.s called for by MLB. Additional services provided are modification of the composite grid by the addition of a new refinement, as required of an adaptive method, and rnovement of any grid or stack of grids, as required in shock tracking. 3.2.6 Multilevel Load Balancer (MLB) MLB is responsible for the dynamic readjustment of the evolving cornposite grid. As new grids are built, the composite grid is modified, its tree is updated in all processors, and the partitions are adjusted to balance the loads. Given the data structures used for the storage of the Inatrices (outlined previously), MLB can adjust a partition at a cost commensurate with the amount of data transferred between processors. Additionally, MLB assures that the data transferred be tween processors during partitioning follows a minimum length path. Further, since the cost of determining if a multiprocessor system requires balancing (a global communication) is approximately the cost of MLB when no partitions are changed, there is a negligible penalty for the frequent load balancing required in dynamic refinement.
PAGE 50
37 3.2. 7 Data Flow The design of the solver allows for progress on a given grid to be divided into 27 computationally equal parts. After each part is finished, all shared grids are checked for the receipt of boundary values (messages) from coowning processors. All shared grids are serviced in a round robin fashion, but are checked for receipt of boundary values before any wholly owned grid is processed. This gives the shared grids a higher priority than the wholly owned grids. Using the solver in this way allows for good processor utilization. When used with the load balancer, the totalsolve times vary only a few percent between processors. Thus, processor utilization during the most costly part of AFAC is quite high. Further, the order of execution is both dynamic and nearly optimal, since the work done on each processor is driven by the availability (receipt) of messages. A significant improvement in this context would be to increase the fineness of grain m the parallelism available. The current graininess in the parallelism of the solver in each processor depends on the size of the grids owned. \Vith a very coarse grain of parallelism) the receipt of messages during the solve of a large grid does not trigger the servicing of the grid whose boundary values were just received. Thus, the execution (servicing) of the shared grids is not handled optimally. The remedy is to partition these large grids into smaller pieces and thus reduce the time a shared grid waits for service while processors finish the larger grids. A commonly suggested optimization for the organization of message passing in the parallel environment allows relaxation on the overlap (ghost) boundary and t:hen triggers message passing on that overlap while relaxation is done on the interior. The motivation is to trigger the message passing as soon possible and then use the interior relaxation to hide the latency associated with the communications. Contrary to cornmon understanding) the effect of this optimization was only a few percent improvement on the larger problems run and about 10% additional overhead on the smaller prohlems. However, using the iPSC/2
PAGE 51
38 asynchronous cornrnunication calls means that the results are inconclusive since the iPSC/2 hardware only supports very limited overlap in cmnputation and communication. The results are detailed in figure 2. Note that this sort of message latency hiding could be handled transparently to the user in the objectoriented P++, though it is not done currently. 3.3 Problerns with Cornputer Languages This thesis explores an objectoriented design for complex parallel numerical soft ware. This line of research was discovered after having made several attempts at the design of parallel adaptive mesh refinement codes for realistic applications using complicated nonrectangular refinernent grids. The experience was useful in discovering just a few of the very wrong ways to implement adaptive refmernent software. Initial work on block structured grids was unsuccessful mostly because of the lack of the C language's ability to support encapsulation. 3.3.1 Problems with FORTRAN FORTRAN is a static procedural !anguage1 and as a result has limited flexibility to handle the necessarily dynamic requirements of adaptive rnesh refinernent. In a parallel environment, even static refinement would be problematic since, without special support, the shuffling of data between processors as new refinement is added would require recopying large amounts of data. Memory rnanagernent in general becomes a practical limitation of FORTRAN for such complex software. Addition ally, the details of adaptive rnesh refinement, application requirements, and parallelisn1 can become unnecessarily mixed because of the lack of the data hiding (the ability to hide the organization and access of internal data) and encapsulation (the ability to hide the internal manipulation). FOHTRAN 90 addresses many of these issues, but is mostly unavailable. High Performance FOHTRAN (HPF) ignores most of the features of FORTRAN 90 that might simplify adaptive mesh refinement, specifically the objectorientedtype features of FORTRAN 90 (e.g., operator overloading). Additional problems with FORTRAN:
PAGE 52
39 4000., 35003000 2500 200015001000 fl 50: [illllll 012345 012345 012345 012345 l i 012345 012345 Pass 1st or Not Prepass Boundaries D Don't Prepass Figure 2: Effect of overlapping communication with computation.
PAGE 53
40 Type checking IS an important requirernent 1n the development of large complex software. The use of user defmed types in C and C++ significantly reduces the debugging time required for adaptive refinement codes because the complexity of the different data structures can be isolated, separated, and represented as different types (by the use of user defined structures). ln the resulting implementation, type checking verifies a consistent expression of the algorithm irnplementation. This type checking is stronger in C++, and provides n1ore advanced capabilities in the object oriented design since there is greater in the definition of objects that combine the organization of data, as in the C language'' structures" and the method functions that manipulate the objeces data. For a more complete definition of the C++ language, see [47]. Dynamic Memory Management is another important requirement for parallel adap tive mesh refinernent. Alternatively, the use of a memory management layer between the parallel adaptive mesh refinement and the FORTRAN language can provide the required support. The advantage of using existing FORTRAN code and the efficiency that FORTRAN presents at the lowest levels can make FORTRAN deceptively at tractive. More common approaches have mixed C++ and FORTRAN so that the advantages leveraged from the use of the objectoriented design can be exploited. This thesis has not taken a mixed language approach since the objectoriented design is required at a high level to implement the AMR code and at a low level (the level of the looping constructs) to provide the architecture independence. The use of common blocks in FORTRAN does not adequately isolate the data, or its intemal structure (partitioned or contiguous), away from the statements that manipulate the data. The effect complicates the development and maintenance of complex codes and especially complicates the porting of codes designed for the serial environment to the parallel environment. 1\1any codes that solve explicit equations
PAGE 54
41 are sufHciently simple that they do not have such problems. Similarly) the definition of standaxds for libraries is sornetimes reduced to the awkward standardization of common block variable orderings> unnecessary in more advanced languages. 3.3.2 Problems with Procedural Languages The fundarr1ental problem that was experienced in the implementation of the C language parallel adaptive refinement codes was the overwhelming complexity of combining the application specific problern with the adaptive mesh refmement and the explicit parallelism for the distributed memory environment. Each would have been tractable individually, but these software engineering problems combine nonlinearly. This is nothing more than the statement that the time requirernents of software development in general are nonlinear in the number of lines of code. The solution to this problem starts with the recognition that an algorithnt can be expressed independent of the organization of its data. For example, the addition of the arrays could be expressed in either FORTRAN or C (or C++) as in iigure 3. FORTRAN DO i=O Size DO j=O Size A(i,j) B(il,j) + B(i+l,j) + B(i,j1) + B(i,j+l) C or C++ for (int i=O; i < Size; i++) for (int j=O; j < Size; j++) A[i][j] = B[il][j] + B[i+l][j] + B[i][j1] + B[i][j+l]; Figure 3: FOHTRAN and C of C++ example code fragments. Both the FOHTRAN and C or C++ versions of this statement implicitly rely on the contiguous ordering of the data in the arrays A,B, and C. This is due to the definition of the indexing operators, ( ) in FOHTRAN and [ ] in C and C++. The reliance on the contiguous ordering of the data means that the algorithm's expression is NOT independent of the organization of the data. A reslllt of this dependence of the algorithm's implementation on the organization of its data is that a change in the layout of the data affects the expression and implementation
PAGE 55
42 of the algorithrn. In the case of a vector architecture, the data should be organized into contiguous vectors of constant stride so that the vector hardware can efficiently evaluate the algorithm. This is the trivial case of the example implementation, and the traditional style of implementation maps well the low level efficient vector processing. In the case of a cachebased ruse microprocessor architecture, the use of the consecutively ordered multidimensional arrays A and B and the sequential looping force continued flushing of the microprocessor cache. Iu this case, the irnplementation gets the correct result but efficiency is sacrificed because the sequential loop processing flushes the cache's record of the element B[i][j 1] (among others). The solution that enables efficient processing is a block by block processing of the twodimensional grid, but we clearly see how this modification affects the implementation in figure 4. Further, this block by block processing of the 2D grid is in conflict C or C++ \\ for (int Block_i=O; i < Size I Block_Size; i++) for (int Block_j=O; j < Size I Block_Size; j++) for (int Element_i=l; Element_i < Block_Size: Element_i++) for (int Element_j=l; Element_j < Block_Size; Element_j++) { } inti= Element_i + (Block_i + Block_Size); int j = Element_j + (Block_j + Block_Size); A [i][j] = B [i 1] [j] + B [i+1][j] + B [i][j1] + B [i][j+1]; Figure 4: C Language example for block by block cache based execution. with the efficient vector processing since utilization of the cache requires many very short vectors to be processed, and efficient vector processing requires longer vectors. Attempts to have such issues be addressed at compile time have been relatively unsuccessful. The case of the equivalent distributed memory code changes the processing even more drastically since the data is partitioned across multiple processors, and so explicit message passing must be introduced. The resulting implementation is greatly expanded (in lines of code) and the implernentation is far fron1 clear because of the required additional parallel logic. Figure 5 shows an example code (showing a simpler case where global addressing is
PAGE 56
43 used at the expense of the whole array's storage on each processor1 ) with the equivalent parallel implernentation of the previous code fragment. In all three examples, the implementation of the algorithm is affected by the details of the target architecture. The effect of more sophisticated parallel architectures is most ex treme. The use of complex algorithms on such architectures greatly cornplicates the software development. More specifically, the more corr1plex parallel implementation hides the details of the algorithm's definition and thus precludes the normal development evolution of the software as increasingly complex applications are attempted (e.g., more physics). The effect on software developrnent is to force dual implementations, one serial and simple to modify, extend, and the second parallel and difficult to extend. The practical effect of multiple im plementations makes the development of realistic parallel applications expensive and slow, because the algorithm cannot economically be modified for each of several, or perhaps many, di!Terent architectures. The fundamental reason for this problem is the dependence of the implementation of the algorithm on the organization of the data. The ability to express algorithms independent of the architecture is thus a principal feature of the software for parallel adaptive refinement work. 3.4 Motivation for ObjectOriented Design Having seen, by example of the Jacobi relaxation code fragment, that the implementation is affected by the target architecture, we want representations of the algorithms that sufficiently abstract the details of each possible architecture. The work on array languages during the late seventies provides just such an abstraction. Here the arrays are manipulated using array operators that inlernaily understand the details of the target architecture, but which by their use perrnit the implementation of the algorithm independent of the organiza tion of the data. Specifically, we do not know how the arrays are represented internally, but this detail is unimportant to the definition of the algorithm. Thus, the algorithm may be 1This simplification is done only for clarity since the use of nonglobal indexing would be less clear.
PAGE 57
int Sl2E 100; int Ncde.)lrnte:_Talile [16] = { 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; dalble Salut:im [SizE] [S:rzeJ ; daJble Cl1d....Sclllticn [SIZE] rs:rzE] ; vo:id nilln () { lltt i,j; intiterate; dalble SUn= 0.0; int = 10; int Ril')rt;J1essoge_Typl = 100; I lub:itrary values I int Left.JI=age_Typl = 101; I lub:itrary values I int P.roc.EssJD = myp:id () I Cumnt P.roc.Ess ID I laig Proc.esoor.)lrnte:= nifr3r_ClfJ'rocesoors = I1l.IlilCdes( ; I Total !llllbilr of l'l:oce3&rrs I I Ca1pite our ri.ticn of the c!isttihitaticn far L3pl&e EcJ.Et:im! I far (i=Start i <= Eh:l i++) [i1] [i] + lll.cl.Silutim [i+1] [i] + "" 01:\_Salut:im [iJ [jlJ + Ol.cLSilut:im [iJ [j+lJ )I 4.0; I Seed tlle new salnt:im an our bcun:Jary to tlle !ligh1: ard Left l'l:oce3&rrs! I if (Processar_Nudler < Ncde_Nrni:Jer_Thble [Nrnbei' Clf_!'rocess::>rs1]) csarl ( Ril')rt;.).,sage_'!l;pe &(S:Jllltim [);li:iJ lOJ) Sl2E sizeof (OOutile) Ncde_Nudler_Talile ll'=:Essar_Nlllber+l] P.roc.Ess_lll ) ; if (Proc.esoor.)lrnte:> Ncde_Nrni:Jer Thble [0]) csarl ( Left,Ji=>ge_1\IJ:e &\Sal.ut:im [Start] [0]) Sl2E sizeof \dalble) Ncde_Nudler_Thble ll'=:Essar_N\:111le!:l] P.roc.Ess_lll ) ; ! Receive tlle n"' solllticn on our OOurrlary :fnm tlle Left l'l:oce3&rrs! / if (Proc.esoor_Nudler < Ncde_Nrni:Jer Thble [Nrnber_Clf l]) crecv ( Leftj1essoge_Typl &\Solution [);rdtl] OJ) Sl2E sizeof (dalble) ) ; if (Processar_llwber > Ncde_Nrni:Jer_Thble [OJ) } cre:x ( R:ii'):lt.J""sage_Typl &(S:llllticn [Startl] [OJ) SIZE sizeof (dalble) ) ; pr1ntf (' 'l'rq!;ran Tetnrinate1 Nanmll y! \n' ') ; Figure 5: Distributed memory example code. 44
PAGE 58
45 defined using simple anay operations as in flgure 6, regardless of the target architecture. In the case of a vector architecture, the array operators process the vectors one at a time2 In RlSC architectures, the internal representation can be processed by block type operations where each block [its into the cache 3 Index I (1,Size,1); Index J (1,Size,1); A(I,J) = B(I1,J) + B(I+1,J) + B(I,J1) + B(I,J+1); Figure 6: Equivalent P++, objectoriented exarnple code. Many algoritlnns are not well suited to the array language implementation, for example, the usual tridiagonal solve which would require explicit scalar indexing4 However, such algorithms are not parallel or vectorizable, so the array language only encourages the use of algorithms thaL are better suited to the more complex architectures available today, without limiting other algorithms. Clearly, the use of a serial algorithm in a parallel environment only results in poor performance, not in an incorrect result. Figure 7 shows an example code fragment of the objectoriented code fragment using the explicit looping and indexing, which is also easily provided. for (int i=O; i < Size; i++) for (int j=O; j < Size; j++) A(i,j) = B(i1,j) + B(i+1,j) + B(i,j1) + B(i,j+1); 'Figure 7: Equivalent P++, objectoriented example code using explicit indexing. The use of the Index type specifies a FORTRAN 90 like triplet of integers specifying the initial indexed position, the number of elements to be indexed, and the associated index stride. Then the arrays A and B are manipulated using the array operators+ and=, which treat only those portions of A and B that are indexed using the Index variables I and J. In this case, the Index variables are themselves manipulated using the + and operators, so that B is indexed using combinations of I+ 1, I1, J + 1, and J1. 2In complex array expressions, this has special sorts of inefficiencies that will be discussed later. 3This has been worked out by Roldan Pozo at lhe University of Tennessee as part of the distributed LAPACK++ class library. 4Such algorithms must use explicil indexing (using looping and scalar .index array operators), which can be easily provided wit.hin the definition of the array language, hut are not considered array operations.
PAGE 59
46 But the use of an array language would require the construction of a specialized compiler. Such compilers are not commonly available. Though FORTRAN 90 array extensions have been added to many implementations of FORTRAN 77) only a few compilers use them to simplify development of parallel software. The use of a specialized compiler would limit the use of the resulting codes to the few architectures where such compilers might be available. This effectively contradicts the goal of architecture independence. An alternative that would permit the use of array constructs is to build such an array language into an existing language. A procedural language would not permit the extension of the language without modifying the compiler, since the language defines a static set of types that may not be extended. But an objectoriented language permits the definition of user defined types (objects), called an extensible type system, and such user defmed types (objects) behave similar t;o the static set of types defined in procedural languages. The practical effect is to pennit extension of the language to include new user defined types and to have them function as if they 'Were an original part of the language definition. The C++ language is a relatively new objectoriented language that is widely available for nearly every serial and parallel computer. It is increasingly being used in the development of nurnerical software across a broad range of computer architectures. 5 Although developed originally in 1984, C++ has stabilized sufficiently to support large implementation projects and is increasingly being used for the development of sophisticated numerical software6 Figure 8 shows an example of an array and index object, illustrating how the data and method functions which manipulate the object's data are combined. The example definition of the Index and Tntarray C++ classes define new types to the compiler and code that uses them. The details of the implementation can be specific to a given architecture, 5 A substantial portion of the engineering codes developed at Sandia National Labs are using C++. 6Scveral of t.he largest production codes at Sandia National Laboratory are developed using C++.
PAGE 60
class Index { private: int Position; int Count; int Stride; public: Index & operator + Index & operator Index & operator } class intarray { private: ( ( ( canst canst canst Index k X inti); Index & X inti); Index &: X ) ; Array_Descriptor_Type *Array_Descriptor; int *Data; public: Array & operator () canst int &: i ) : Array & operator () canst int &: i canst int &: j Array & operator () canst Index & I ) : ) : Array & operator () ( canst Index &: I canst Index &: J ) ; Array & operator + ( canst Array & Lhs canst Array & Rhs Array & operator ( canst Array & Lhs canst Array & Rhs Array & operator ( canst Array & Lhs canst Array & Rhs Array & operator I ( canst Array & Lhs canst Array & Rhs Array & operator = ( canst Array & Rhs ) : } Figure 8: Equivalent P++, objectoriented example code. 47 ) : ) : ) : ) :
PAGE 61
48 but the interface can be constant and is architecture independent. This is the principal way this thesis proposes architecture independence for general numerical software. This thesis uses C++ to define an array class library. Specifically, an array object is defined in C++ and all possible operators between arrays are provided, so that the array objects appear as builtin types (though user defined). The actual interface is copied from the commercial M++ 7 array language. The motivation for this work is to simplify the use of software implernented using the array interface for distributed memory architectures. In the serial environment, it is sufficient to use the M++ class library. The purpose of the P++ class library is to extend the identical interface for use on distributed memory parallel computers with absolutely no change to the original serial code. The parallel array class library P++ uses the serial array class library internally. The result is a separation between the details of the parallel array object and the serial array object. Such separation divides the complexity of the combined design of the single processor array object and. its extended use in the parallel environment. Since the interface is identical1 the code using the interface (U1e original serial source code) does not change. The computational model is to .run the same code on each processor1 which is called a Single Processor Multiple Data model (SPMD model). The P++ array class library, which provides general support for data parallelism, is part of the strategy to support parallel adaptive refinement. Its complete support requires support for the adaptive refinement details of adaptive mesh refinement so that different applications can reuse the complex data structures that adaptive mesh refinement codes contain. The cornmon AMR code, independent of the application and parallel issues1 is presented in the AMR++ class libraries. In this way, we expect substantial reuse of code across many adaptive refinement applications. More details are contained in [5) and in section 3.6. is a product of Dyad Software (18003661573).
PAGE 62
49 3.4.1 Problems with the ObjectOriented C++ Languages The use of an objectoriented language is no panacea. There are often dozens of ways to implement a give application, which is the strength of the language. Yet, as studies at AT&T have shown, the learning time for C++ is approximately 1218 months. It is important to note, however, that most people are 1nore productive in C++ after the first few months than they were previously 8 The principal problems experienced with C++ have to do with its ability to optimize the resulting code. The use of simple arrays of primitive types (such as int, float, and double) and Lhe compiler's knowledge of these primitive types permit common optirnization. For nurnerical software, such optin1ization centers around loop optimization, including removal of constant sub expressions and use of registers for temporary storage and accurnulation. Such optimization greatly improves the efficiency of the compiled code. The use of nonprimitive types, such as C++ objects (effectively user defined types), greatly limits optimization. VVithin the current line of C++ compilers, optimizers are typically turned off internally if nonprimitive types are discovered. The result is a mass of function calls and the loss of register storage for accumulation within inner loops. Such code is inefficient at runtime. The to obtain perforrnance emphasis is on the class library prograrnmer ( who are forced to work within C++ at a deeper level than tyypical users, who generally use such class libraries, in conjunction with C++, in more sirnple ways). Details of these problems have been discussed in[!]. [45]. and [46]. 3.5 P++, a Parallel Array Class Library 3.5.1 lntroduetion and Motivation The current trend in the development of parallel architectures is clearly toward distributed memory designs. This is evident from current product announcements from Intel, Cray) and Thinking Machines, even though the latter two origiua!ly had successful machines of competing design. Experience has shown 8This was t.hc conclusion of discussions within the C++ conference on BIX (Byte Information Exchange).
PAGE 63
50 that shared memory machines are eas1er to program than distributed memory ones and also have autoparallelizing tools that are not available for distributed memory architectures. This is due to the fact that the shared memory programming model permits parallelization independent of the position of data between processors. However, memory access in shared memory machines becomes increasingly problematic as more processors are added; this nonscalablity of the shared memory design limits its potential to support the greater computational power demanded of future applications. Within the last few years, approaches of adding local caches \o the global memory of shared memory architectures have slightly improved rnernory access. The problem of efficiently optimizing cache usage, however, is very similar to the data partitioning problem on distributed n1emory machines, and is not yet satisfactorily resolved. Distributed memory machines require the explicit control of data, for example, data. partitioning, in order to obtain the same or better parallel performance as shared memory machines. This explicit control of parallelism, through Message Pass ing, is diJlicult to achieve and the result is a dramatically more complicated programming environment. The lack of a shared memory programming model on distributed memory rnachines is the fundamental disadvantage of its programming. The availability of portable comnmnication interfaces, and similar programming tools, implemented on many distributed rnernory machines significantly simplifLes the portability among distributed architectures, but does not address the difilculties of the explicit control of parallelism done through the use of message passmg;. Although even a shared men1ory parallel programming model in a distributed memory environment, called Virtual Shared Memory, would be advantageous from the point of view of software development, such general attempts have resulted in poor perforrnance, except for an overly restricted set of problems. Newer advances in this area. have not yet been evaluated for feasibility and efficiency [15]. Because the general data distribution problern remains unsolved\ such Virtual Shared Metnory envirornnents atternpt to
PAGE 64
51 support parallelism without explicit knowledge of the algorithm)s structure and requirements to offprocessor data. The resulting volurninous amount of accesses to offprocessor memory degrades performance. This is mostly due to unnecessary and often replicated communication startuptimes and/or data transfer of irrelevant data in proximity to relevant data. It seems clear that the support of parallelism requires at least an interpretation of algorithm requirements, so that accesses to offprocessor data can be managed efficiently [7]. P++ is a user environment that simplifies the development of efficient parallel programs for largescale scientific applications. It permits portability across the widest variety of computer architectures. The target machines are distributed memory computers with different types of node architectures (scalar, vector, or superscalar), but the requirements of shared memory computers are also addressed. Such a simplifying environment for the development of software is sorely needed to take advantage of the current and future developments in advanced computational hardware. The P++ enviromnent does this using a standard language, C++, with absolutely no modification of the compiler. For parallel communication, it employs existing widely portable communications libraries. Such an environment <.dlows existing C++ language compilers to be used to develop software in the preferred serial environrnent, and such software to be efficiently run unchanged in all target environments. The explicit goal of P++ is the support of advanced computational methods for the solution of largescale computational problems. For efficiency reasons and simplification, the use of P++ is restricted to the large class of structured grid rnethods for partial differential equations. Applications using P++ can selectively exploit the added degree of freedom presented by parallel processing by use of a.n optimization module within the array language interface. A collection of defaults ensures deterministic behavior. In this way, complicated optimization issues of parallel execution may be easily introduced by setting switches within the user interface1 which are then interpreted at runtime. Through the use of Virtual Shared I L
PAGE 65
52 Memory, restricted to array variables (Virtual Shared Grids), issues such as partitioning become matters of optimization, and not criteria for correct execution. Due to the restriction and optimization for structured grids, we expect the same parallel performance as for codes based on the traditionally used explicit Message Passing model. To speed the development of the P++ environment, we use a reasonably efficient, commercially available array language library, M++, developed in C++; theM++ array interface is also used as the P++ array interface. The internal implementation of P++ is based on the "Single Program Multiple Data stream" (SPMD) programming model. This model is coupled with the ''Virtual Shared Grids" (VSG) programming model (section 3.5.5L which is a restriction of general operating system controlled Virtual Shared Memory to all types of structured grids) controlling communication at runtime. The user interface and programming model guarantee that serial codes developed in the P++ enviromnent can be run on parallel distributed rnemory machines without rnodi"fication. Because the array syntax is rnore compact than explicit looping, in the description of the algorithm, the resulting code is easier to implement and debug, even for strictly serial codes, in the serial environment. Moreover, since the SPMD and VSG prograrnrning models permit the serial code to be used in a parallel environment, the development of para.llel codes using P++ is simpler than the development of serial codes when the explicit looping model is used. This is primarily due to the fact that the VSG model allows the specification of data partitioning to be an optimization, in contrast to most other developrnents in this area where appropriate data partitioning is essential for the correct execution of the code. P++ employs a default grid partitioning strategy1 which can be overridden in several ways specific to the users application. Recognizing that the acceptance of advanced parallel computers depends on their use by scientists, and not parallel researchers1 it is hoped that P++) as a technological advance, will sirnplify access to these parallel machines. While significantly simplifying our 1 j I L
PAGE 66
53 own work on parallel selfadaptive local refinement methods, it is hoped that P++ will more generally simplify the introduction of structured grid methods for large scale scientific computing onto advanced parallel 111achines. 3.5.2 Goals of the P++ Development The general goal of the P++ development is to provide a sirnplifled parallel programming environment. In this section, some ideal requirements for a user interface and programming model for distributed memory architectures are stated. These are fulfilled with the P++ environment for a large, but restricted, class of problems (detailed in subsection 3.5.3): Algorithrn and code development should take place in a serial environment. Serial source codes should be able to be compiled and recompiled to run in parallel on distributed architectures without modification. Codes should be portable between different serial and parallel architectures (shared and distributed memory machines). Vectoriza.tion, pa.rallelization, and data partitioning should be hidden from the user, except for optimization switches to which the user has full access and that have meaning only on vector or parallel environments. The debugging of parallel software should be greatly simplified. First, this is done if the software is debugged in the serial environment and data parallelism is exploited by recompilation. Second, the objectoriented treatJnent of the array operations avoids the respecifi.cation of index bounds, so one of the most common enors in the implcrnentation of numerical software is eliminated because Index objects can be computed once and reused. Thircl1 the objectoriented design used to build ap plication codes from the P++ data parallel arrays further abstract low level irnplementation details of the solvers and separate them from the remaining application code.
PAGE 67
54 3.5.3 The P++ Applications Class The restriction to a simplified class of applications for P++ has allowed focus on a general solution and evaluation of the func tionality and per[ormance of the P++ environn1ent by using realistic application codes. Additionally, the restriction to a large but reasonable class of problems assures that P++ is not exaggerated to overly general or particularly pathological cases. In addition, extending the generality ntore toward general Virtual Shared Memory would cause the performance limiting factors to apply also to P++. The use of P++ is restricted to all different kinds of structured grid oriented problems. Domain decomposition and applications using adaptive block structured local refinement belong to this target class1 and the problems introduced by their use in parallel environments has motivated this work. Grids can be constructed with all levels of freedom (e.g., overlapping blocks, adaptively added1 deleted, resized, ... ), a..':l long as they fulfll the above restrictions. In particular, P++ is dimensionally independent, which allows for its use in onetofour dirnensional problems. Although not yet applied to many applications1 the target applications are largescale com.putational problems in fluid dynamics and cornbustion. Speciftc algoritluns with which this work is intended to be tested include: Explicit lime stepping algorithms (Piecewise Parabolic Method (PPM) for hypersonic flow [4], [5]); Standard multigrid algorithms on rectangular domains ([10], [26], [34], [38]); Multilevel local refinernent algorithms on simple grids ([22], [28], [29], [40]); Multilevel adaptive local refinement algorithms on simple grids ([41], [42]); Multilevel adaptive local refinement algorithms on block structured grids. 3.5.4 P++ Implementation in Standard C++ C++, an objectoriented language, was chosen as a basis for P++ because of the language's ability to support abstractions, which is fundamental in the development of the user interface that is expected to abstract issues of data. parallelism. Some important features of C++ follow (for a definition
PAGE 68
55 and description the reader is referred to [471); each is obtained from C++ and carries over directly to the P++ environment, including any combination of C++/P++ use: Easy design, developrnent and maintenance of large codes; ANSI standard under development; Objectoriented features like inheritance, data hiding, and encapsulation; Dynamic men1ory management; Strong typing of variables; Guaranteed user defmed initialization through user defined constructors; Operator and function overloading; Templates; Same efficiency allowed as for C (currently a research area). Most of these features are not unique to C++, so we do not preclude the use of any other objectoriented language, sueh as Smalltalk. But C++ is currently available on a wider number of rnaehines than any other objectoriented language that suits numerical needs. Additionally, the C++ language is a superset of the C language, and so all C software is also available for use wit.h C++. Specifically, this allows for use of distributed memory communication libraries, like PVM, to be used as easily with C++ as with C. P++, as currently developed, uses the AT&T C++ Cfront compiler and the Intel iPSC cornmunieations library. In the near future1 however, it is planned to make the code portable through use of EXPRESS or PVM and later also PARMACS (see section 3.5.8). Since C++ is a superset of C and the communication library is designed for use with C, these libraries can be easily used from within C++. 3.5.5 The Programtning Model of P++ Use of the Single Program Multiple Data (SPMD) prograrnrning model combined with the VSG programming model is important, since without the combined programming models, the sirnplif1ed representation of the parallel prograrn from the serial program would not be practical. Their cornbined
PAGE 69
56 effect is the principal rneans by which P++ allows serially developed codes to be efficiently run on distributed memory machines. 3.5.5.1 Single Program Multiple Data (SPMD) Model In contrast to the explicit host/node programming model, which requires both a host and one or more node programs, the SP.MD programming model consists of executing one single program source on all nodes of the parallel system. For example, the suggested programming model of the Intel Delta machine is SPMD. This is becoming increasingly common in new generation parallel machines from Intel, Cray, and TlVlC. Irnplernentation of the SPMD model requires that commonly available functionality m the serial environrneut be provided in the parallel enviromnent in such a way that the serial source code can be used on the distributed memory machine. One of the most important functionalitles that is provided in the parallel programming model to support basic functionality of the serial code is a parallel I/O system. This can then be used in place of the serial 1/0 system, to support the required functionality of the parallel SPMD programming environment. Currently, only basic functionality of the SPMD programming model (1/0 system: printf, scanf; initialization and termination of processes) is available. Implernentation details are abstracted from the user. The SFrviD programming model replicates the functionality of the traditional parallel host/ node programming model. For example, the standard function scanf for reading frorn t>tandard input is implemented in such a way that an arbitrarily chosen master node reads the data from standard input and distributes it to all other nodes (slave nodes). This master/slave relationship is only present within the Parallel I/O System and not used: or intended, elsewhere in P++. 3.5.5.2 Virtual Shared Grids (VSG) Model The concept of Virtual Shared Grids gives the appearance of Virtual Shared Memory restricted to array variables. Computations are based on global indexing. Comrnunication patterns are derived at runtime1
PAGE 70
57 and the appropriate send and receive messages are automatically generated by P++. In contrast to Virtual Shared Memory, where the operating system does the communication without having information about the algorithm's data and structure, the array syntax of P++ provides the means for the user to express details of the algorithm and data structure to the compiler and runtime system. This guarantees that the number of communications and the amount of communicated data are minimized. Through the restriction to structured grids, the same kind and amount of comnmnication as with the explicit Message Passing model is sent/received and, therefore, also approximately the sarne efficiency is achieved. This is a trernendous advantage over the more general Virtual Shared Mernory model. The amount and size of communication are further minimized by the ability of the user to override the default partitioning. Specifically, Virtual Shared Grids allow the treatrnent of partitioning as a parameter in the optimization. This is an important feature of the VSG model since it permits serial applications to be run correctly and to exploit data parallelism inherent in their array expressions, and so still permits the auxiliary description of data organization after the code is running. This greatly sin1plifies decisions about data partitioning, which is the singular additional degree of freedom in the designing of the data parallel implementation. Note that the data parallel implementation might not be suffi cient, but is a conunon component in the design of numerical software. For example, data parallelism is the principal part of the model for a single grid iterative solvers (including rnultigrid solvers), but is not sufficient for optimal parallel performance using composite grid solvers (including FAC, AFAC, and AFACx) because additional functional parallelisrr1 can be exploited in these more sophisticated solvers. There are two basic communication models that are currently implernented in P++ (how these models interact is described in more detail in the examples in section 3.5.10): VSG Update: T'he Owner Computes Rule is a common rule for the evaluation of expressions in
PAGE 71
58 the parallel environment. It actually has several conflicting definitions, but generally means that an owner is defined (usually by the processor owning the Lhs of a Lhs = Rhs expression) and the required Rhs argmnents are sent to the "owning" processor. Finally, the relevant part of the expression is evaluated on the owning processor (no communication is required for the evaluation step, but there is potentially more communication required to send the relevant parts of the Rhs to the owner). In the implementation of the communication model for the general Virtual Shared Grids concept, this classical Owner Computes rule is restricted. Instead, what might be applied to the whole expression is applied instead to the binary sub expression, where the Owner is defi.ned arbitrarily to be the left operand. This simple rule handles the cornmunication required in a parallel environment; specifically, the designated owner of the left operand receives all parts of the distributed array necessary to perfonn the given binary operation (without further communication). Thus) the temporary result and Lhe left operand are partitioned similarly (see figure 9). Overlap Update: In order to provide high performance for a broad range of iterative methods that would use the VSG programming model) as a speciftc optimization there is an implemented nearest neighbor access to array elements through the widely used technique of grid partitioning with overlap (currently restricted to width one; see figure 10). ln this way, the most common type of array accesses can be handled by complicated expressions where communication in the parallel environment is limited to one over lap update that occurs in the defmition of the )) =" sign defmed (overloaded) for the arrays. Tlle necessity of updating the overlapping boundaries, based on whether the overlap has been modified after the preceding update, is detected at runtime. Thus, communication in the parallel environment is minimized. Virtual Shared Grids are constructed in a distributed fashion across the processors ........ ft
PAGE 72
. . . . . . . . . . . . . . P3 P1 P2 P1 P++ user level: A3 B) 62 C1 A = B + c P++ internal execution: 1. T = B + c A31 :A32 B11:B12 P1: T11 = 811 + C1 receive C21 from P2 T12 = 812 + C21 I T P2: T2 : 82 + C22 send C21 to P1 P3: idle P1 I P2 2. A = T P1: send T1 to P3 P2: send T2 to P3 P3: receive T1 from P1 rece1ve T2 from P2 A = T Figure 9. An example for VSG Update based on the Owner computes rule: A 3 processors. I ;r. 'I' 'I' 'I :I: :I: :I: I I I I 'I' I :I: :\: :I: . I. .I I I I I 'I' 'I' 'I' :I: . , . .I. ., . I I I . 'I' :t: . '. :I: .I. .I ;r.;r ;r. "I "I . :1: :1 . I I . . I . . :1 : :1 : . . . I. .I . "I I . . "1 "] ... . :1: :1 .,i . .I . I I i . . ., :I 'I . ' . . I I I . . . . . :I . . . 'I . :I . . . . . :I . . . . . . . . . I" "I 1::::::1 [o o o o o] I 1":::: I: .. :1 I ] .. 1 1:::: 1: :1 I" I I" "I I" "I 0 !... I' 1: I I I' I; I I' 1: I. I I' I' . . . . . . . . Figure 10: The standard method of grid partitioning with overlap. 59 P2 c2 Operands Workspace B +Con I' . . I' . . . 'I :I . . I I . I I . . I . I; . . . . ;I . . .I . I I . I' . I . . I . . . 'I . :I . .I . I I . I' . 1: . . . 'I :I .
PAGE 73
60 of the parallel system. Partitioning information and processor mapping are stored in a partitioning table (part of the DataJ\1anager object). This partitioning table basically contains processor numbers that own the Virtual Shared Grids, and the local and global positions and sizes of the partitions. Functions are made available that abstract the details of access queries of the table's data. All information required for evaluating expressions using VSG is expressed through the array syntax and the use of the partitioning table. 'I'he numbers of entries in the table is reduced by grouping associated grids and providing applications support for storage of only the required clat.a on a processor by processor basis. This is necessary due to the large sizes that these tables can be in rna.':lsively parallel systems. The table is efficiently implemented through a combination of arrays and linked lists. Thus, all necessary global information can be made locally available (even on massively parallel systems of 1000+ processors). Thus, access to global information about the partitioned arrays requires no comrrmnication and contribut.es insignificantly to memory overhead. A simple mechanism is provided to interrogate the communication pattern between pairs of VSGs at runtirne. This mechanism looks at the availability of data in the adjacent processors that are required to satisfy the specific instance of the distributed binary operation. lf necessary, it triggers communication to send or receive the required pieces of the array operands (VSG Update) on the basis of the Owner Computes rule. In this way, whole buffers of data are known at runtime to be required, and no sophisticated loop analysis is needed to recover this information. Thus, there is no need for costly element by element cornmunication. Cornpiler analysis would be required to do such analysis since it would be prohibitively expensive at runtime. In addition to the two basic communication models of VSG Update and Overlap Update, possible enhancements include:
PAGE 74
61 Cornn1nnication Pattern Caching (CPC): This would permit the comnmnication patterns, once computed for an array staternent, to be cached and recovered for further use. Thus, the determination of the array partition relationships (communi cation pattern) specific to a particular array statement's indexing could be handled with minimal overhead. Note that CPC could be used across multiple array statements since, within an application, we expect that many different array staternents would require, and thus could reuse, the same cornnmnication patterns. Deferred Evaluation: This more complicated evaluation rule allows for significant optimization of the given expression so that communication associated with correct evaluation in the parallel environment can be minimized. In a vector environ ment1 the given operations that form. the array expression are optionally collapsed to form aggregate operators and provide for the aggregate operators' implernentation in FO!tfRAN, so that the expression can fully exploit the vector hardware (e.g., through chaining). The use of deferred evaluation of the array staternents (also called lat\y evaluation) permits the full expression to be know at runtime before evaluation. The evaluation can even be deferred across n:1any array statements (problems are encountered across conditional statements that have dependencies on the results of the deferred array statements, though this might be solved through some cornpiler support for deferred evaluation). Currently, this principle has been implemented but not yet evaluated for single nodes of vector machines (e.g., a Cray YMP in collabo ration with Sandia National Laboratories). ln particular, the efficient use of chaining and the optimization of n1emory access are addressed. It is planned to further pur sue this approach and fully exploit it into the VSG programming model of P++. While there al'e other complicated reasons for the use of Deferred Evaluation, those mentioned above are only some uses specific to the vector environrnent. Other uses include the deferred evaluation over large numbers of statements and the grouping
PAGE 75
62 of blocks of these statements, based on runtime dependency analysis1 so that each block can be executed independently. Such a system would permit the recognition of functional levels of parallelism at runtirne. Note that runtirne dependency analysis wou1d he based on the use of hash tables and collision detection within these separate hash tables for the Lhs and Rhs operands in each array statement. 3.5.6 The ObjectOriented Design of P++ The basic structure of the P++ code consists of n1ultiple interconnected classes. The overall structure and the inter connections are illustrated in figure 11. The following types of objects are common to the M++ array interface (see also section 3.5.7) and within P++ form a similar user interface: _VSG_Array: specifies the type of array elements (currently restricted to float (64 bit) and integer). Dimensional independence up to four dimensions is realized. lD default partitioning information is stored in the Data_Manager. Only sirnple index objects are provided. Each stores: Position, Count, and Stride. Member functions for index objects include set and arithmetic operations. Some typical exarnples for the use of indexes are addressing grid boundaries or interiors. The < type >VSG_Array uses internally the M++ array object < type >Array and copies rnost of the interface member functions of theM++ array object< type >Array. ln this way, the < type >VSG_Array uses the same member functions of< type >Array. So the interface is the same and the numerical software developed in the serial environment executes in a data parallel rnode in the multiprocessor environment. 'The following object is specific to the P++ interface (see also section 3.5.7) and is also seen by the user: Optirnizatiou_lVlanager: User control for details of parallel execution.
PAGE 76
63 P C Diagnostics i M++ Array Manager I Classes: Array, P Kernel Index, ... Classes: Vector Machine ParArray Library dependent Index, ... function Interface Comm. & Tools Library P Optimization Manager P Data Manager Machine Machine User Independent dependent level level level Fig me 1 1: The objectoriented design of the P++ environment.
PAGE 77
64 The following objects are hidden frorn the user and represent the notable features of the underlying organization: DataManager: All partitioning data is stored in tables. Member functions allow interrogation of the Data_Manager to find the processor associated with any part of any object of type < type >VSG_Array, etc. CornrnunicatiorL.TVIanager: These member functions are the only functions that allow access to send/receive and other primitives of the communications library and diagnostics environment (Intel, EXPRESS, PARMACS; see section 3.5.8). Access to constants relative to parallel execution, e.g., number of processors, is also available. DiagnosticJ\.1anager: This class has restricted flow trace capabilities (number of constructor/destructor calls) and also gathers statistical properties of the background communication (e.g., number of messages sent/received). ParalleL!/ 0 _Manager: Standard I/0 functions overloaded with versions for use in the parallel environment (eg. print[, scanf). Currently, all I/0 is processed through a single processor in a rnaster slave relationship. File I/0 is not handled, though it is a critical requirement of the large scale computational target applications. We hope to use existing file I/0 packages for sirnplification. 3.5.7 The P++ User Interface The P++ interface consists of a combination of the comntercially available M ++ array language interface and a set of functions for parallel optirnization, the Optimization Manager. The P++ user interface provides for switching between M++ in the serial enviromnent toP++ in the serial or parallel environment.
PAGE 78
I I l 65 3.5.7.1 The M++ array class library The commercially available M++ array class library (from Dyad Software Corp.) is used to simplify the software development. The M ++ interface is modified only slightly; we consider the modifications to be bug fixes. The array interface provides a natural way to express the problem and data structure for structured grid applications; additionally, the syntax of the interface permits a natural exploitation of the parallelism represented within expressions used for structured grid applications (because single no execution ordering is assumed). By using M++ within P++, the details of serial vs. parallel interpretation of the array statements are separated. It is hoped that since the intemal restrictions to the structured grid work are mostly contained in M++, the move to support unstructured grids, in the future, will be separable and simplified. The serial M++ interface allows dimensionally independent arrays to be dynarnically created and manipulated with standard operators, and snbarrays to be defined by indexing. In addition, it has optional array bounds checking. At runtirne, the explicit loop hides the data's organization and operation structure, whereas the equivalent array expressiOn rnay have its data's organization and operation structure interpreted. Irnportantly, within an array expression, there is no data dependency by definition. In fact, the array language represents a simplification in the design, implementation, and maintenance of structured grid codes. T'he functionality of theM++ interface is similar to the array features of FORTRAN 90 (see ftgure 12). In the current version of P++, only a restricted set of data types (integer and t1oat arrays) is implemented. However, complete sets of operators are provided. We feel that the choice of C++ as the programming language and M++ as an array interface, made to provide as much information about the problem and structure of the data as possible, is strategic in providing a solid base for the parallel array language support for the target nurnericat problem class. It is especially strategic for support of parallel adaptive mesh refinernent, since the adaptive nature of the application means insufficient infor.mation
PAGE 79
j I II II I I l #include "header.h" #ifdef PPP #define doubleArray double_VSG_Array #define Index VSG_Index #endif void MacCormack (Index I, double Time_Step, doubleArray 1 F, doubleArray & U_NEW, doubleArray lU_OLD) { I I array expression: } F = (U_OLD U_OLD) I 2; //scalar expression: U_NEW (0) = U_OLD (O) Time_Step (F ( 1 ) F (0)); /I indexed array expression: U_NEW (I) = U_OLD (I) Time_Step (F (1+1) F (I)); /I array expression: F = (U_NEW U_NEW) I 2; II indexed array expression: U_NEW (I) = 0.5 (U_OLD (I)+ U_NEW (!)) 0.5 Time_Step (F (I) F (I1)); void main() { int N; double Time_Step; scanf (&N, &Time_Step); doubleArray U_OLD (N ,10 ,U_NEW (N ,N), F(N ,Jl); II Setup data } int Interior_Start_Position 1; int Intenor Count N2; int Interior_Stride 1; Index Interior (Interior_Start_Position, Interior_Count, Interior_Stride); MacCormack (Interior, Time_Step, F, U_NEW, U_OLD); Figure 12: C++ I M++ I P++ example code: McCormack (Hyperbolic) Scheme. 66
PAGE 80
67 at compile time about the partitioning of the data (the composite grid). Such partitioning is defined only at so the communication patterns must be interpreted then. Similar runtime suppott is an accepted and required part of any attempt to provide parallel cornpiler support for complex applications, though, with compiler support, additional efficiency rnight be possible even for the runtime support (such issues are under investigation; see [2], [7], [11], [12], [24], [25] ). 3.5.7.2 The P++ Optimization Manager The Optimization Manager allows for override of defaults for user control of partitioning, communication: array to processor mappings, comrnunication models of Virtual Shared Grids, parallel I/0 system, etc. Opti mizations of this kind only have significance in a parallel environment. The Optimization Manager is the only means by which the user can affect the parallel behavior of the code. 'The Optimization Manager provides a consistent rneans of tailoring the parallel execution and pcrfonnance. It provides the user with four types of partitioning for the array (grid variable) data: DeHnilt partitioning: This involves even partitioning of each grid variable across all nodes, based on lD partitioning of the last array dimension (see figure 2). Associated partitioning: Grid variables are partitioned consistent with others. This strategy provides iOr same size or coarser grid construction in multigrid algorithrns, but also has general use. User defined partitioning: A mapping structure is used to construct user defined partitioning. Applieation based partitioning : This allows for the introduction of user spec ified load balancing algorithms to handle the partitioning of one or rnore specified grid variables. Currently, the functionality of the Optimization Manager is restricted in its support for above partitioning strategies as required for the exarnples in section 3.5.10.
PAGE 81
68 3.5.8 Portability and Target Architectures ofP++ The target architectures of P++ are all existing and evolving distributed rnemory multiprocessor architectures1 including the Intel PARAGON and iPSC/860, the Connection Machine 5, the nCUBE 2, the coming distributed me1nory Cray machine, and all kinds of networks of workstations (such as of Suns or IBM RS6000s). P++ requires only a C++ compiler or C++ to C translator, which have begun to become widely available (e.g., AT&T C front compiler), and a portable parallel communications library, such as PVM or Express. The current P++ implernentation uses the Intel iPSC communications library. For the near future) however, it is planned to base P++ independently on the EXPRESS and PARMACS parallel communications envi ronments, guaranteeing portability of P++ across all of the architectures for which these environments are available. Experiences in the past have shown that one or the other will be implemented on all machines of this type shortly after they becorne available. Due to implementations of PARMACS and EXPRESS for several shared memory architectures) P++ will also be available for this chtss of machine which significantly sirnplifies support for shared memory rnachines. Since C++ is a superset of C and PVM, EXPRESS, and PARMACS support C, each can be used within P++. PVM, PARMACS, and EXPRESS are described in more detail: PVM PVM is a public domain programming environment for the development of parallel a.pplications and provides a vendor independent layer for communication support. Its latest release is similar to the proposed Message Passing Interface standard (MPI). EXPRESS EXPRESS from ParaSoft Corp. is a programming environment for the development of parallel FORTRAN and C programs. EXPRESS is available for a variety of different distributed memory architectures (Intel iPSC, nCUBE, .. ) and for networks
PAGE 82
69 of workstations (Sun, IBM RS 6000, Silicon Graphics, ... ). In addition to allowing distributed memory codes to also run in a shared mernory environment, it is also available for some shared memory multiprocessor architectures (Cray YMP, IBM 3090/ AIX). Besides a communications library, it contains modules for parallel l/0, a grap hies system, performance analysis tools, etc. PARMACS PARMACS (ANL/GMD Macros), which is a joint development of the German Na tional Laboratory for Computer Science and Applied Mathematics (GMD) and Ar gonne Kational Laboratory [8], is marketed through PALLAS GmbH. PARMACS is a process programming model for FORTRAN, based on macros (expanded by a standard Unix Macro expander). A C version is planned for the near future. PARMACS basically contains macros for process initialization, comrnunication, etc., and is available for the Intel iPSC, nCUBE, Meiko, SUPRENUM, Parsytec lvlegaframe, and Sun and iBM_ RS6000 networks of workstations. In addition1 irnplementations for the shared memory architectures Alliant FX 28, Sequent Balance 2000, and En core lvlultirnax exist. As the PALLAS Performance Analysis tools (PATools) are based on PARMACS, they also become available for use within P++. Additional work must be done to support the new distributed 1nernory machines with vector or supcrscalar nodes based on a vector processing model. This work requires incorporatio11 of the P++ design with the recent work on vectorization of the C++ array language done in collaboration with Sandia National Laboratories. Additional optimization could be done aud is planned to eventually support the shared memory class of machine by development of a shared memory specific version of P++. 3.5.9 Performance of P++ To date1 the only running versions of P++ are on the iPSC/860, the iPSC parallel simulator (running on SUN workstations), and, in serial mode, the SUN and IBM PC. The performance of P++ on the actual hardware is dominated
PAGE 83
70 by the single node performance, because no additional comrnunication is added over the se rial implementation, though for specific applications communication in the explicit Message Passing programrning rnodel for distributed memory architectures could be better optimized than that which P++ provides automatically. Such optimizations would involve the restriction of rnessage passing, using knowledge about how several array expressions might access memory; or the tirning (scheduling) of messages using knowledge of dependencies across several array exptessious. The current irnplernentation could be optimized by analysis over multiple loops, though such multiple array statement analysis is not presently a part of P++. The use of deferred evaluation is required before such work can be done. Then we could expect similar performance &'> that obtained with the explicit Message Passing programming model for distributed memory architectures, even in the case of highly optimized hand coded cornmunication. Current message passing comparisons are relative to a more naive1 not highly optimized, explicit rnessage passing model that does not consider optimization across multiple array expressions. Additional multiple array expression optirrtization is possible using the P++ Optimization Manager's message passing control1 but it is not automated within P++The M++ array library serves as a base to provide FORTRAN like perforrnance to P++, but its cunent performance is about 20%100% of FORTRAN performance and degrades the single node performance of P++. It is the VSG principle in combination with the Optimization Manager1s capabi1ity of allowing the user to defme efficient partitioning that guarantees arnounts of communication nearly identical to the explicit Message Passing programming model. For many applications1 better performance may be available, since the simplified P++ development environment allows for greater implementation effort to be spent on the selection of more advanced and faster computational algorithms. It is hoped that this will additionally offset the ability of future P++ work to compete directly with
PAGE 84
71 FORTRAN) though this is a current area of research. Performance is irnportant1 since without efficient execution of the application source code) the effects of the parallel or vector architecture are lessened or lost: Single node perforrnance: Steps have been taken to optimize the single node performance so that the P++ environment can be accurately assessed. For vector nodes and nodes that are most efficiently programmed through a vector programIning model) this work has included application of the C++ array language class libraries on the Cray (through collaboration with Sandia National Laboratories). First, !'esults by Sandia National Laboratory concernmg performance comparisons with FO.R'TTlAN are very promising. With an optimized C++ array class library, about 50%90% of the FORTRAN vector performance was achieved for complete codes. The comparison is difficult because the realistic codes could not be readily implemented in FOR'TRA N to include the more rich physics available in the C++ versions. In some cases, the C++ compiler optimization switches had to be turned of[. Such problems are examples of incomplete and often immature C++ compilers, thereby hampering the comparison of FORTRAN and C++ on large realistic software. Sirnilar performance has been demonstrated by P++ on the SUN Spare, but only on those select P++ array statements chosen for initial evaluation and testing of P++, not complete codes. Parallel systen1 perforrnance: Secondary to single node performance, parallel perforrnance is mostly affected by the amount of comn1unication introduced. The II I I I P++ VSG model optimizes this and introduces no more communication than the explicit Message Passing model. However, currently, no runtime analysis is done actoss rnultiple array statements, which might better schedule communication and possibly furthet rninirnize the required communication. Such further work would
PAGE 85
72 require deferred evaluation (lazy evaluation). Additional performance could be obtained by caching of the communication patterns and their reuse in multiple array expressions. Such work has not been included in the P++ implementation, but in a part of the P++ research and has been a part of several codes to test these ideas. VVork done specific to optimized parallel evaluation of array expressions has been carried out for a number of relevant architectures in [14]. 3.5.10 P++ Support for Specific Examples Although P++ is dimensionally independent, most example applications have been 2D; however, a 3D n1ultigrid code has been demonstrated. ln figure 13, P++ is demonstrated with a partitioning developed to support multigrid. 3.5.10.1 Standard Multigrid Algorithn1s on Rectangular Do1nains Multigrid is a comrnonly used computational algorithm especially suited to the solution of elliptic partial differential equations. In contrast to single grid methods, rnultigrid uses several grids of different scale to accelerate the solution process. The usual way to implement regular multigrid algorithms ou a distributed memory system is based on the method of grid partitioning ([10], [34], [38]). The computational domain (grid) is divided into seven1J sub grids that are assigned to parallel processors (see figure 2). The subgrids; of the fine grids and the associated subgrids of the coarse grid are assigned to the same processor. Each multigtid component (e.g., relaxation, restriction, and interpolation) can be perforrned on a subset of the interior points of the subgrid independently (in parallel). Calculation of values at interior boundary points, however, needs the values from neighboring sub grids. Since distributed memory rr1achines have no global address space, these values somehow rnust be made available. Instead of transferring the values individually at the time they are it is rrwre efficient to have copies of neighboring grid points in the local rnemory of each processor. each process contains an overlap area, which has to be updated after each algorithmic step. Because the details of the algorithrn on a small
PAGE 86
l 73 number of points per processor are agglomeration is one of the strategies that can be used to consolidate the distributed application to a smaller number of processors. P1 P2 P3 P4 < ) / 7 < > 0 0 0 0 0 0 0 0 0 'O 0 0 o: 0 0 0 0 < < "" < ) / / 0 0 0 to 0 0 0 0 . y ;o 0 0 0 0 Overlap update VSG update Figure 13. The interaction the Overlap Update and VSG Update concepts for standard multigrid partitioning with coarse level agglomeration. Figure 11! shows the runtime support from P++ for the interpreted communication patterns of the solver and for the agglomeration strategy used in the parallelizatioh of multigrid. Several variant strategies are possible, but the use of VSG reduces the details of their irnplement.atiou to defining the fine and coarse grid partitioning. The particular VSG communication rnodels, VSG Update (based on the Owner Computes rule) or Overlap Update, is chosen on the basis of data availability within the partitioned grid variables. 3.5.10.2 Multilcvelloeal rcfinenwnt algorithms on bloekstructured grids As a more con1.plicated example of the flexibility of P++, we demonstrate some of the support within P++ for block structured local refinement ([22], [28],[29], [41], [42]). During the solution of partial differential equations on structured grids) local refinement allows for the solution complexity to depend directly on the complexity of the evolving activity. Specifically, regions local to problem activity are refined. The effect is to provide a discretization specifically tailored to an application's requirements. Local refinement composite grid methods typically use uniform grids, both global and local, to solve partial differential equations. These methods are known to be highly
PAGE 87
Composite grid \ 4 \ 2 \ I 5 6 I I \ 3 1 Overlap update VSG update 0 0 SSG Interface update AFAC partitioning P1 1 oFigure 14. The imeraction the Overlap Update, VSG Update, and BSG Interface Update concepts for FAC and ;\ FAC partitioning of a block structured locally refined grid
PAGE 88
75 efficient on scalar or single processor vector computers, due to their effective use of uniform grids and multiple levels of resolution of the solution. On distributed rnemory multiprocessors, such methods beneflt from their tendency to create multiple isolated refinement regions, which rnay be effectively treated in parallel. However, they suffer from the way in which the levels of refinement are treated sequentially in each region. Specifically, in FAC, the finer levels must wait to be processed until the coarselevel approximations have been computed and passed to them; conversely, the coarser levels must wait until the finer level approximations have been computed and used to correct their equations. In contrast, AFAC eliminates this bottleneck of parallelism. Through a simple mechanisrn used to reduce interlevel dependence, individual refmement levels can be processed by AFAC in parallel. The result is that the convergence factor bound for AFAC is the square root of that for FAC. Therefore, since both AFAC and FAC have roughly the sante number of floating point operations, AFAC requires twice the serial computational time as FAC, but AFAC may be rnuch more efficiently parallelized. Specifically, the local refinement of geometries within regions of activity is not easily done using a single rectangular local refinement patch. In order to better represent geome tries within local activity, we introduce block structured local refinement. The flexibility introduced by block structured local refinement equally applies to block structured global grids. Though elliptic solvers for block structured grids (beyond that. of relaxation methods) are not provided for in the current work nor considered in the thesis, it is an important and interesting area of current work by the GMD and others. For example, in figure 6, the composite grid shows a rectangular domain with a curved front ern bedded within. Such problems could represent oil reservoir sirrmlation models with simple oil/water fronts or more cornp1icated fluid flow problems with shock fronts. In this specific exa.mple1 the front is refined with two levels; the first level is represented by grids 2 and 3, tlle second by grids 4, 5, and 6.
PAGE 89
76 For the parallel environment using FAC, because of the sequential processing of the local refinement levels, the composite grid is partitioned as shown in figure 6. Note that solvers used on the individual grids make use of Overlap Updates provided automatically by P++. The intergrid transfers between local refinement levels rely on VSG Updates, also provided automatically by the P++ environment. Note that P++ support of the block structured local refinement is limited and does not include the block structured grid (BSG) Interface Up date, which must be handled within the block structured grid code or library. Underlying support in the parallel environment for the BSG Interface Update is provided by either Overlap Update or VSG Update, or by a combination of the two. Support from P++ for a partitioning specific to AFAC is similarly provided. The different application specific partitioning (shown in figure 14) naturally invokes automated support from the P++ environment in a slightly different manner. The use of an environn1ent such as P++, which permits the implementation of the algorithms (in this case FAC and A FAC) in the parallel environment independent of the organization of their data, greatly simplifies the software development process since it may be developed in a serial workstation environment where productivity is high and since each algorithm can reuse si1nilar code. This is important because 99.9% of the two algorithm's implementations are similar, even though they are optirnizecl u.sing distinctly different organizations (partitionings) of the composite grids. 3.5.11 Related Research To our knowledge, there is currently no study of architectureindependent programming environments that perrnit software developed specifica11y for serial rnachines to be run on parallel distributed memory architectures, and that use existing comrnonly available compilers. The most important work done in related areas is as follows (apologies to anyone we might have missed): Los Alarnos National Laboratories' work on C++ for Scientific tion [18]: Initial work was done relative to the use of C++ for large scale scientific
PAGE 90
77 cornput.ation on vector computers. The work on WAVE++, a CFD code, details the advantages and disadvantages of the use of C++ for scientific computation in general and for vector environments in particular. More recent work has been done combining C++ objectoriented design with adaptive refinement hypersonic applications. Sandia National Laboratories' work on C++ vectorization [45]: this effort on array languages for C++ shows that 5090% of the equivalent FORTRAN per fonnance can be attained. Some of the largest laboratory applications codes there are being developed using C++, e.g., HALE++. Paragon [13]: This is a parallel programming environment for scientific applications (rnostly lrnage processing) using Communication Structures. It is also based on C++ and contains concepts similar to P++, but is much more restrictive (though it has been demonstrated on a larger number of computers than P++ ). This is primarily due to the fact that the concept of communication structures is not as general and powerful as the concept of Virtual Shared Grids in allowing and expressing the view of the distributed memory as a global address space, restricted to specific objects. Additionally, indexing is cumbersome and generally restrictive compared to that of P++, \vhich is borrowed from M++ (whose origins are in Gauss). FORTRAN D [19]: This work develops the extensive list of different array par titionings done in FORTRAN D (based on FORTRAN 77) for use with FORTRAN 77 (and FORTRAN 90 (FORTRAN 90D)). It does not yet employ any concept sim ilar to Virtual Shared Grids. Use of PART! (see below) within FORTRAN 90 D, however, is expected. This currently is a point of research. PARTI (Panlllcl Rnntirne Tools at !CASE [7]): It provides runtime support for use with FORTRAN and contains clever means by which FORTRAN loops can be interrogated and existing data parallelism discovered and exploited at runtirne.
PAGE 91
78 PARTI is primarily focused on unstructured grids. As opposed toP++, the seamless introduction of the PARTI primitiveS requires con1piler modifications. SUPERB [48]: This is a semiautomatic parallelization tool for distributed architeclures on FORTRAN compiler level. The prototype developed within SUPRENUM is restricted to a very specific class of applications. 3.6 AMR++, an Adaptive Mesh Refinement Class Library 3.6.1 Introdudion AMR++ is a C++ class library that simplifies the details of building selfadaptive mesh refmement applications. The use of this class library significantly simplifies the construction of local refinement codes for both serial and parallel aJchitectures. AMR++ has been developed in a serial environment using C++ and the M++ array class interface. It runs in a parallel environment because M++ and P++ share the same array interface. Therefore, AMR++ inherits the machine targets of P++ and, thus, has a broad base of architectures on which to execute. The efficiency and performance of AM R++ is rnostly dependent on the efficiency of P++ (and, thus, M++) and M ++ in the serial ;:md parallel environments, respectively. Together, the P++ and AMR++ class libraries separa.te the abstractions of local refinement and parallelism to significantly ease the development of parallel adaptive rnesh refinement applications in an architecture independent manner. The A1tlll++ class library represents work that combines complex numerical, computer science, and engineering application requirements. Therefore, the work naturally involves comprornises in its initial development. In the following sections, the features and current restrictions of the AM R++ class library are summarized. 3.6.2 Block Structured GridsFeatures and Restrictions The target grid types of AMR++ are 20 and 3D block structured grids with rectangular or logically rectangular grid blocks. On the one hand, they allow for a very good representation of cornplex internal geometries that are introduced through local refinement in regions with
PAGE 92
79 increased local activity. This flexibility of local refinement block structured grids equally applies to global block structured grids that allow matching complex external geometries. On the other hand) the restriction to structures of rectangular blocks, as opposed to fully unstructured grids, allows for the application of the VSG programming model of P++ and, therefore, is the foundation for good efficiency and performance in distributed environments, which is one of the major goals of the P++ / AMR++ development. Thus, we believe that block structured grids are the best compromise between full generality of the grid structure and efficiency in a distributed parallel environment. The application class forms a broad cross section or irnportant. scientific applications. In figure 15, the global grid is the finest uniformly discretized grid that covers the whole physical domain. Local refinement grids (level i + 1) are formed from the global grid (level i = 0) or recursively front refinement grids (discretization level i) by standard refinement with h+1 = (refinement factor of two) in each coordinate direction. Thus, boundary lines of block structured refinement grids always match grid lines on the underlying discretization level. The construction of block structured grids in AMR++ has some practical limita tions that simplify the design and use of the class libraries. Specifically, grid blocks at the same level of discretization cannot overlap. Block structures are formed by distinct or con nected rectangular blocks that share their boundary points (block interfaces) at those places where they adjoin. Thus, a connected region of blocks forms a block structured refinernent grid. ll is possible that one refinement level consists of more than one disjoint block struc tured refinem_ent grid. Tn the dynamic adaptive refinement procedure, refinement grids can be automatically merged if they adjoin each other. In figure 15( a), an example for a composite grid is illustrated: The composite grid shows a rectangular domain within which we center a curved front and a corner singularity. The gtid blocks are ordered lexicographically; the first digit represents the level, the second
PAGE 93
80 digit the connected block structured refinerncnt grid, and the third digit the grid block. Such problems could represent the structure of shock fronts or multifluid interfaces in fluid flow applications: ln oil reservoir simulations, for example, the front could be an oilwater contact front moving with time and the corner singularity could be a production welL In this specific example, the front is refined with two block structured refinen1ent grids; the first grid on refinement level two is represented by grid blocks 2.1.1 and 2.1.2, and the second grid on level three by grid blocks 3.1.1, 3.1.2, and 3.1.3. In the corner on each of the levels, a single refinement block is introduced. For ease of implementation, in the AMR++ prototype, the global grid must be uniform. 'T'his simplification of the global geometry was necessary in order to be able to concentrate on the rn.ajor issues of this work, narnely, to implement local refinernent and selfada.ptivity ill an objectoriented environment. The restriction is no general constraint and can be rnore or less easily raised in a future version of the prototype. Aside from irnplementation issues, some additional functionality has to be made available: For implicit solvers, the resulting domain decomposition of the global grid may require special capabilities within the single grid solvers (e.g., multigrid solvers for block structured grids with adequate smoothers, such as inter block line or plane relaxation methods). The block structures in the current AMR++ prototype are defined only by the needs of local refinement of a uniform global grid. This restriction allows them to be cartesian. More complicated structures as they result from difficult noncartesian external geometries (e.g., holes or spliss points; see [37]) currently are not taken into consideration. An extension of AMR++, however, is principally possible. The wide experience for general 2D block structured grids that has been gained at the GMD [:H] could forrn a. basis for these extensions. Whereas our work is corn par ably simple in 20 because no explicit communication is required1 extending the GMD work to
PAGE 94
I I II l I I I I 81 30 problems is very complex, if not intractable. 3.6.3 Sonw Irnplernentation Issues In the following, some implementation issues are detailed. They also demonstrate the complexity of a proper and efficient treatment of block structured grids and adaptive refinement. AMR++ takes care of all of these issues, which would have to be handled explicitly if everything had to be handled at the application level. Di1nensional Independence and Dimensional Independence Indexing (DII): 'The irnplernentation of most features of AMR++ and its user interface is dirnensiona.lly independent. Being derived from user requirements, on the lowest level, the AMH++ prototype is restricted to 2D and 3D applications. This restriction can, however, be easily removed. One important means by which dimensional independence is reached is dimensionally independent indices (DII), which contain index information for each coordinate direction. On Lop of these Dli indices are index variants defined for each type of subblock region (interior, interior and boundary, boundary only, ... ). Convex regions only require iJ single DII, but nonconvex regions require multiple DII. For example, for addressing the boundary of a 3D block (nonconvex), one DII index is needed for each of the six planes. In order to avoid special treatment of physical boundaries, all index variants are deftned twice, including and excluding the physical boundary, respectively. All index variants, several of them also including extended boundaries (see below), are precomputed at the time when a grid block is allocated. A possible enhallcement, for efficiency, would permit them to be shared (cached), since they are independent of the array data. In the AMR++ user interface and in the top level classes, only index variants or indicators are used, thereby allowing a dimensionally independent forrnulation, except for the very low level implementation. Many low level operations) such as interpolation, for example, are necessarily dependent on the
PAGE 95
82 problem dimension. Implementation of block structured grids: The AMR++ grid block objects consist of the interior, the boundary, and an extended boundary of a grid block, as well as interface objects (links) that are formed between adjacent pairs of grid block objects. The links contain P++ array objects that do not consist of actual data but serve as views (subarrays) of the overlapping parts of the extended boundary between adjacent grid block objects. The actual boundaries that are shared between different blocks (block interfaces) are complex structures that are represented in a lisL wit.hin each grid block object. Block interface objects are formally derived from the grid block objects (so that interfaces of interfaces are possible (and required for corners where grid blocks meet in twodimensional problems), and interfaces of interfaces of interfaces are possible (where threedimension blocks meet at corners)). For example, in 3D, interfaces between blocks are 2D planes, those between planeinterfaces are 1 D1ine interfaces, and, one step further, those between lineinterfaces are points (zerodirnensional). In figure 15(b), grid blocks 2.1.1 and 2.1.2 of the composite grid in figure l5(a) are depicted, including their block interface and their extended boundary. The regular lines denote the outermost line of grid points of each block. Thus, with an extended boundary of two, there is one line of points between the block boundary line and the dashed line for the extended boundary. In its extended boundary, each grid block has views of the values of the original grid points of its adjoining neighboring block. This way it is possible to evaluate stencils on the iutcrfctce and, with an extended boundary width of two, to also define a coarse level of the block structured refinement grid in the multigrid sense. Data structures and iterators: In AMR++, the composite grid is stored as a tree of all refinement grids, with the global grid being the root. Block structured grids are stored as lists of blocks (for ease of implementation; collections of blocks would
PAGE 96
83 be sufficient in most cases). ln figure 15(c)1 the composite grid tree for the example composite grid in figure 15(a) is illustrated. The user interface for doing operations on these data structures is a set of iterators. For exarnple, for an operation on the composite grid (e.g., zeroing each level or interpolating a grid function to a finer levelL an iterator is called that traverses the tree in the correct order (preorder, postonier, no order). This iterator takes the function to be executed and two indicators that specify the physical boundary treatment and the type of subgrid to be treated as an argument. The iteration starts at the root and recursively traverses the tree. For doing an operation (e.g., Jacobi relaxation) on a block structured grid, iterators are available that process the list of blocks and all block interface lists on each grid calling the object rnernber function input as a parameter to the iterator. Iterators are provided for alt the relevant AMR++ objects, and so allow simplified internal processing as required for the composite grid solvers. The use of the iterators is not specillc to the algorithms currently implemented in AMR++ (FAC, AFAC, and AFACx), and is intended to be a general interface for other algorithms as well. 3.6.4 ObjectOriented Design and User Interface The AMR++ class libraries are customizable by nsing the objectoriented features of C++. For example, in order to obtain eHiciency in a parallel environment, it may be necessary to introduce alternate iterators tha,t traverse the composite grid tree or the blocks of a refinement region in a special order. However, the use of alternate iterators does not change the serial code that uses them, but allows the P++ operations on different composite grid levels to run concurrently. In this way, the data parallel rnodcl of P++ is mixed with the tasking parallel model, which can be either supported through C++ tasking libraries9 or compiler supported extensions10. The same is true for alternate composite grid cycling as for example needed in A.FAC as opposed to FAC algorithms (section 2.2). Grunwald al the Uni versily of Colorado at Boulder has developed such parallel tasking class Hbrarjes. 1Carll(esselrnan at Cal tech has developed CC++, which provides tasking support as part of a s.irnple C++ language extension.
PAGE 97
84 Application specific parts of AM1{++1 such as the single grid solvers or criteria for adaptivity, which have to be supplied by the user, are also simply specified through substitution of alternate base classes: A preexisting application (e.g., problem setup and uniforrn grid solver) uses AMR++ to extend its functionality and to build an adaptive rnesh refinement application. Thus, the user supplies a solver class and some additional required functionality (refinement criteria, ... ) and uses the functionality of the highest level AMR++ ((Se!L)Adaptive_)Composite_Grid class to formulate his special algorithm or to use one of the supplied PDE solvers. In the current prototype of AMR++, FAC and AFAC based solvers (section 2.2) are supplied. If the single grid application is written using P++1 then the resultiug adaptive rncsh reftnement application is architecture independent) and so can be run efllciently in a parallel environment. The design of AMR++ is objectoriented and the irnplernentation of our prototype extensively uses features like encapsulation and inheritance: The abstraction of selfadaptive local refinernenL, 1rvhich involves the handling of many issues, including memory manage ment, interface for application specific control, dynamic adaptivity, and efficiency, is reached through grouping these diJTerent functionalities into several interconnected classes. For ex ample, rnernory management is greatly simplified by the objectoriented organization of the AMR++ library: Issues such as lifetime of variables are handled automatically by the scoping rules for C++l so memory management is automatic and predictable. As the AMR++ interface is object oriented, the control over construction of the composite grid is intuitive and natural: The creation of cornposite grid objects is similar to the declaration of floating point or integer variables in procedural languages like FORTRAN and C. Users basically formulate their solvers by allocating one of the predefined composite grid solver objects or by formulating their own solvers on the basis of the composite grid objects and the associate iterators, and by supplying the single grid solver class (object). Although not a part of the current implementation of AMR++, C++ introduces a
PAGE 98
85 template mechanism in the latest standardization of the language (AT&T version 3.0), which only now starts to be part of commercial products. The general purpose of this template language feature is to permit class libraries to use user specified base types. For AMR++, for example, the template feature could be used to allow the specification of the base solver and adaptive criteria for the parallel adaptive local refinement implementation. In this way, the construction of an adaptive local refinement code from the single grid application on the basis of the AMll++ class library can becoine even simpler and more cleanly implemented. There is no shortage of obscure details associated with adaptive mesh refinement and its implementation, but this will not be discussed further. The interested reader is referred to [27] and [31]. 3.6.5 Statie and Dynarnic Adaptivity, Grid Generation In the current AMR++ protot.ype, static adaptivity is fully implemented. Users can specify their composite grid either interactively or by some input file: For each grid block, AMR++ needs its global coorc.Jinates and the parent grid block. Block structured local refinerr1ent regions are formed a.utomatic:ally by investigating neighboring relationships. In addition, the func tionalities for adding and deleting grid blocks under user control are available within the Adaptivc_Cornposite_Grid object of AMR++. Recently, dyna.mic adaptivity has been a subject of intensive research. First re sults are very promising and some basic functionality has been included in the AMR++ prototype: Given a global grid, a flagging criteria function, and some stopping criteria, the SelLAdaptive_Cornposite_Grid object contains the functionality for iteratively solving on the actual composite grid and generating a new discretization level on top of the respective finest level. Building a new composite grid level works as follows: (1) The flagging criteria delivers an unstructured collection of flagged points in each grid block. (2) For representing grid block boundaries, all neighboring points of flagged points are
PAGE 99
! I :.'I i I l 86 also Oaggecl. (3) The new set of grid blocks to contribute to the refinement level (gridding) is built by applying a smart recursive bisection algorithrn similar to the one developed in [6]: If building a rectangle around all 'flagged points of the given grid block is too inefficient) it is bisected in the longer coordinate direction and new enclosing rectangles are computed. The efficiency of the respective fraction is rrteasured by the ratio of flagged points to all points of the new grid block. In the subsequent tests, 75% is used. This procedure is repeated recursively if any of the new rectangles is also inefficient. Having the goal of building the rectangles as large as possible within the given efliciency constraint; the choice of the bisection point (splitting in halves is too inefilcient because it results in very 1nany small rectangles) is done by a combination of signatures and edge detection. The reader is referred to (6) or [27) for more details. ( 4) Finally, the new grid blocks are added to the cornposite grid to form the new refinement level. Grouping these blocks into connected block structured grids is done the same way as it is done in the static case. This flagging and grid ding algorithm has the potential for further optimization: The bisection method can be further improved, and a clustering and 1nerging algorithm could be applied. This is especially true for refinement blocks of different parent blocks that could form one single block with more than one parent. Internal to AMR++, this kind of parent/child relationship is supported. The results in section 4.6, however, show that the gridding already is quite good. The number of blocks that are constructed automatically is only slightly larger (< 10%) than a rnanual construction could deliver. A next step in selfadaptive refinement would be to support time dependent prob!ems whose cornposite grid structure changes dynamically with time (e.g., moving fronts). In this case, in addition to adding and deleting blocks) enlarging and diminishing blocks must be supported. Though some basic functionality and the implementation of the general
PAGE 100
87 concept are already available, this problem has not yet been tackled further. 3.6.6 Current State and R.elated Work The AMR++ prototype 1s m1plernented using M++ and the AT&T Standard Components II class library to provide standardized classes (e.g., linked list classes). Through the shared interface of M++ and P++, AMR++ inherits all target architectures of P++. AMR++ has been successfully tested on Sun workstations and on the Intel iPSC/860. Taking into account the large functionality of AMR++, there are still several insufficient aspects and restrictions, and a large potential for optimization in the current prototype (as already pointed out in the preceding description). Until now, AMR++ has been suecessfully used as a. research tool for the algorithms and model problems described in the next two sections. However, AMR++ provides the functionality to implement much more complicated application problems. Concerning parallelization, running AMR++ under P++ on the Intel iPSC/860 has proven ils full functionality. Intensive optimization, however, has only been done within P++. Al\illl++ itself offers a large potential for optimization. For example, efficiently implementing self<tdaptivity, including load (re)balancing in a parallel environment, requires further research. ln addition, the iterators that are currently available in AMR++, though working in a parallel environment, are best suited for serial environments. Special parallel iterators that, for exarnple, support functional parallelism on the internal AMR++ level would have to be pl'ovidecl. To our knowledge, the combined P++/ AMR++ approach is unique. There are several other developrnents in this area (e.g., (:H]), but they either address a n1ore restricted class of problerns or they are still restricted to serial environrnents. However) important work at Los Alarnos N a.tional Laboratory and Lawrence Liverrnore National 'Laboratory has addressed adaptive rnesh refinement. for SIMD and MIMD, respectively: explicit equation solvers.
PAGE 101
88 3. 7 ObjeetOrieuted Design for Parallel Adaptive Refinement Based on experiences with the ftrst C language version of parallel adaptive refinement, the combined complexities of the application, adaptive mesh refinement, and parallelism were seen to limit additional features of the irnplementation. The requirements of block structured local refinement, selfadaptivity, and 1nore complex applications were considered out of range. The relatively simplistic adaptive refinement code was roughly 16,000 lines and only solved the simple potential flow (principally, the Poisson equation) problern.
PAGE 102
:.u J i \ \ \ ''1 \ \ 2.1.1 J.!.l 3level composite grid 1 2.1.2 !!' I I 2.1.1 j (b) adjoining grid blocks 89 CJ grid block c=] extended boundary block interface (c) composite grid tree Figure lG. Exarnplc of a cornposite grid, its composite grid tree, and a cutout of two blocks with their extended boundaries and their interface.
PAGE 103
CHAPTER 4 PARALLEL ADAPTIVE MESH REFINEMENT: RESULTS 4.1 Introduction This chapter reports on comparative results of several facets of this thesis work. Specifically, we compare the two c01nposite grid algorithms E'AC and A FAC. These compar isons are a litt1e incomplete since they exclude the newer AFACx) which would correspond to AFAC formulated using only relaxation on the refinement level. A comparison of FAC and A FAC is nevertheless representative of the general case, since both FAC and AFAC can be formulated using relaxation of the composite grid refinement levels; we refer below to such scheJnes a,s FACx and A FACx, respectively. We introduce Lhe details of parallel performance of FAC and AFAC on several parallel cornputers. Some of these machines are no longer available1 and the codes that were used to obtain these results can1 in rnost cases1 no longer be run. This awkward level of software reliance on specific hardware) even from a common computer vendor 1 was one dominant motivation for the development of P++ and AMR++. Such issues as architec ture independence) even among the restricted class of parallel computers of the same basic architectures, were not co.mmon when these codes were developed, but they will become increasi11gly important in larger software development projects and magnified on projects involving machines of increa.::ingly different architecture. Such code might be required to have a lifetirne spanning that of several parallel computers (which seem to have lifetimes limited to about 5 years). Several difrerent parallel machines are used for the accumulated results. The dominant factor in predetermination of the results is the computation/communication ratio,
PAGE 104
91 since it factors out the relative computational advantage of some of the newer machines that have increased computational speed (MFLOPs) but maintain the software layer within the communication network that connects processors. We expect good vectorization capabilities of algorithms to be important now and m the future since\ in many species of the current and coming generation of distributed memory multiprocessor architectures, high perforrnance can only be reached if the programming model is based on vector operations. We have in mind that the i860 is no vector processor, but one only achieves high performance if its compiler is based on a vector rnodel, which most of the currently available ones are not. With this in mind, we note that some parallel algorithn1s perform differently across many machines because of the different computation/ connnunication cost ratios. Thus, to fully test the AFAC algorithm, it is important to test FAC and AFAC in a controlled way, under a variety of computation/communication cost ratios. Therefore, we test and compare AF'AC and FAC on the SUPRENUM in both scalar and vector modes and on thee earlier generations of Intel (iPSC/1, iPSC/2, and iPSC/860). lt is interesting, and important, that in vector mode we observe distinctly better efficiency for AFAC than for F'AC. It is likely that many algorithms are sensitive to the computation/communica.tion cost ratio. The results show that, for large problems, FAC is sensitive to this ratio and that1 for these same problems, AFAC is not. This is a distinct advantage of AFAC, which should make it appropriate across a wide class of parallel architectures, especially on parallelrnachines with fast vector floating point performance, or correspond ingly slow comrnunica.t.ion (message startup and transfer rate). The latter architectures with relatively slow comrnunication rates are increasingly common in more recently introduced parallel computers because message passing has consistently involved a software layer in its implemenLa.tioti. This software layer is an especially inherent property of portable cornmunication libraries, which are required for advanced parallel software (in order to amortize the software developrnent. across multiple architectures).
PAGE 105
92 To simplify the code that implements the adaptive AFAC and FAC algorithms in both scalar and vector modes) we solve a relatively simple Poisson problem. This permits the addition of complexity in the evaluation of the code and exploration of alternate parallei partitioning strategies. This thesis details these alternate parallelization strategies and compares AFAC and FAC under a variety of situations. We choose two cornposite grids of different structural complexity to show the effects of the composite problem domain on these two algorithms when run on different numbers of processors and grid sizes. For simplicity and focus, we restrict each level of refinement to the same size, although this is not a restriction in the code. Under these varying parameters (composite grid complexity, nun1ber of processors, and size of the uniform composite grid levels), we compare the effects of different computation/cornrnunication cost ratios of FAC and AFAC on different distributed memory multiprocessor architectures. Section 2.2 describes the FAC and AFAC algorithms, section 3.2 details the FAC and AFAC codes, section 4.4 reports on the comparison of FAC and AFAC on the target architectures, and section 4.4.7 summarizes the conclusions of these comparisons. Additional results indicate the ef[ects of dynamic adaptivity of the composite grid in the parallel environment. These results are compared to the relative con1putational time of a single AFAC iteration. They are speciHc to the Intel iPSC/1, but also present performance relative to an AFAC iteration. Thus, the results are representative of a machine with the iPSC/ 1 's comrnunica.tion/cornputation ratio, which is mostly unchanged even in more recent computer designs that use both faster processors and communication hardware 1 The performance of manipulating the composite grid is investigated. This manipulation includes the addition and deletion of local refinement regions, the required load balancing after addition or deletion oflocal refinement, and the repositioning (moving) of local refinement within the composite grid. 1Comrnunicatiou is still buried in a software layer and this greatly affect its performance, except notably on the Transputer.
PAGE 106
93 4.2 Cotnparison of Convergence Factors The principal result shows the difference between the convergence factor bounds of FAC and AFAC. Convergence theory in [40] shows that the convergence rates of FAC and AFAC have the relation IIIPM'ACIII = Although the theory is restricted to a 2level composite grid, these results have been experimentally verified to hold even on very large numbers of levels (a specific test verified this property of the convergence rate on a 50level composite grid). Though the algorithmic components in our code are chosen slightly differently than for the convergence analysis, experience shows that very similar behavior is obtained. This implies that two cycles of Al
PAGE 107
94 Convergence 5point stencil 9point stencil Fa'cctor p h 1/64 1/512 1/64 1/512 Poisson MGV 0.08 0.08 0.02 0.03 FAC 0.11 0.11 0.10 0.10 FACx 0.13 0.13 0.11 0.11 AFAC 0.35 0.35 0.33 0.33 AFACx 0.35 0.35 0.33 0.33 'I'able 1: Convergence factors for Poisson's equation on simple composite grid. 2 Table 2 shows the convergence rates of AFACx and FACx (compared to MGV on the global grid) using the more complex block structured local grids (figure 22). Note that the use of the block structured grid inhibits the use of FAC and since the block structured coarsenings required for their definition are not easily constructed. 5point stencil 9point stencil onvergcnee actor p h 1/64 l/512 1/64 1/512 olsson MGV 0.08 0.08 0.02 0.03 FACx 0.17 0.18 0.11 0.11 AFACx 0.40 0.41 0.33 0.33 Table 2: Convergence f<1ctors for Poisson1s equation on block structured composite grid. 4.2.2 Convergence Factors for Singular Perturbation Problern Numerical results (details arc found in [33]) have been obtained for the model problem s2.u + aux +buy = f on n = (0, 1)2 with homogeneous Dirichlet boundary conditions on 80 (except on the boundary where we force a boundary layer) and E = 0.00001. This problem serves as a good model problem for complex fluid flow applications, because several of the properties that are related to selfadaptive mesh refinement are already present in this simple problem. The equation is discretized using isotropic arlif1cia1 viscosity (diffusion): Eh max{s,6hmax{\al, \b\}/2}. 2Since in this case we do not use adaptive criteria based on the solution, the intermediate composite grid solve is extraneous.
PAGE 108
95 For comparison, some results with a nine point stencil discretization corresponding to bilinear finite elements for the Laplace operator have also been obtained. The discrete system is solved by multilevel methods: MGV on the finest global grid and FAC or AFAC on composite grids with refinement. For standard multigrid methods, it is known that, with artiftcial viscosity, the twogrid convergence factor (spectral radius of the corresponding iteration matrix) is bounded below by 0.5 (for h 0). This leads to multilevel convergence factors that tend to LO with increasing number oflevels (e.g., [20]). For many more details, see [27]. In [16], a multigrid variant that shows surprisingly good convergence behavior has been developed: onvergence factors stay far below 0.5 (with three relaxations on each level). Here, essentially this method is used, which has the following cornponents: Discretization with isotropic artificial viscosity using f3 = 3 on the finest grid rn and .611 = 1/2 (;31 + l/f31) for coarser grids l = m1,m2,. Standard NlG components (odd even relaxation, full weighting, and bilinear interpolation). Anisotropic artificial viscosity may also be used, but generally requires (parallel) zebra line relaxation, which has not yet been fully implemented. For FAC and AFAC, the above MG method with V(2,1) cycling is used as a global grid solver. On the refinement levels, three relaxations are perforrned. In table 3, several convergence rates for FACx, AJ<'ACx, and MGV are shown for the example equation. The finest grids have mesh sizes of h = 1/64 and h = 1/512. For FAC and A FAC, the global grid has the mesh size h = 1/32, and the (predetermined) fine block always covers 1/4 of the parent coarse block along the boundary layer. The follov
PAGE 109
96 Convergence 5point stencil 9point stencil Rate p h 1/64 1/.112 1/64 1/512 Poisson MGV 0.08 0.08 0.02 0.03 a=b=O FACx 0.33 0.33 0.10 0.10 ;l=l,e=1 AFACx 0.41 0.41 o.:n 0.32 SPP MGV 0.14 0.30 0.19 0.48 a=b=1 FACx 0.65 0.66 0.60 0.70 f3 = 3, e = 0.00001 AFACx 0.67 0.67 0.60 0.75 SPP MGV 0.38 1.0 0.20 0.52 a=b=1 FACx 0.65 0.66 0.53 0.53 f3 = 1.1, = 0.00001 AFACx 0.65 0.70 0.55 0.70 Table 3. Convergence factors for a singular perturbation problen1 and, for comparison, for Poisson's equation. For MGV and the 5 point stencil, the results are as expected, while the 9 point stencil gives better but also deteriorating results. V cycles are used; W or F cycles would yield better convergence rates but worse parallel efficiency. The 9 point stencil for the Laplacian fulfills the Galerkin condition wit.h respect to the level transfers used and shows better convergence rates than the 5 point stencil. Results for the composite grid problem are not as good as expected, but only the MGV schcrnc problem is specially treated for the singular perturbation equation. 4.3 Perfonnanee of AFAC To study various properties of the AFAC code1 two composite grid examples were used. The two exarnples were chosen to show different aspects of the implementation on both the 32node iPSC/1 and the 16node iPSC/2. One example is a simple sevenlevel composite grid with ail constituent uniform grids of the same size (as in figure 16). The second exarnplc is a much more complicated composite grid with 40 constituent uniform grids: all of the san1e size (figure 19). Although this composite grid is nonphysical, its purpose is to show the flexibility of the algorithm and prototype code and to simulate in 2D the potential complexity and number of grids that could be present in 3D applications. This exarnple is not actually a limiting case since the composite grid for a problem posed in
PAGE 110
97 3D could have even more grids and more levels of refinement. Also, the simulation of the 3D problem in 2D is not accurate since the computational loads of the grids posed in 3D are much larger than those posed in 2D. However, the larger loads would be expected to parallehze much better than the smaller loads since overhead could be rnore ea.<:>ily hidden. Although two examples were chosen, because of the limited availability of the iPSC/2 and the test machine's limited rnemory, only the much smaller contposite grid ex ample waB run on the iPSC/2. The tests on the iPSC/2, however, show the much better computationtocomrnunication balance over that of the iPSC/1. Both examples were run on the 32node il'SC/1. Each level vva.s an nxn grid, counting boundary grid lines: with n varying from 17 to 129. Results of timing these runs on the iPSC/2 for the simple example are depicted in table J!. Sirnila.rly, results for the complex example run on the i'PSC/1 are shown in table 5. Additionally, we provide results: on the iPSC/1, for the grid manager functions necessary for adaptive vvork in parallel environments. Specifically, we detail the results of MLB (table 21), moving grids (table 8), and adding further refinement (table 9). Finally, we show the relative costs of these functions to that of solving the complex composite grid (table 10). 4.3.1 Simple Example on iPSC/2: In table 4, the displayed times for the iPSC/2 are the worstcase times for one AFAC cycle (including the individual times for the MGV solve, step 1, and the interlevel transfers, step 2) and for aDdimensional cube with D ranging frorn zero to four. VVhile the data for n = 65 and D = 4 is missing because of a hardware anornaly, for the case n = 129 only D = 4 data is depicted because the iPSC/2's memory could not support the cases D < 4. Note the reasonably good speedup exhibited by AFAC, especially for the larger problerns. [n particular, from a single node to an eightnode cube: speedups for n = 17,33, and 65 were abouL five, six, and seven, respectively. A related observation is that cases with the same ratio of grid points to processors have nearly the same total times. Note also that
PAGE 111
98 : ' I I i I I I I i : I I 1 I i Figure 16: Simple Composite Grid
PAGE 112
99 the sum of the solve and transfer times is often greater than the total time. This is because the tirnei:l documented here are each worstcase over all processors. 4.3.2 Simple Example on iPSC/1: In table 5, the displayed times for the iPSC/1 are the worstcase times for one AFAC cycle and for aD dimensional cube with D ranging from zero to Hve. No data is available for the case n = 129 since messages passed during intergrid transfers can exceed 16K in length1 which is the maximum message size allowed on the iPSC/1. Here we see the effect of the more expensive communication costs on the iPSC/1. The result is that speedup is not as good a...;;; it is on the more balanced iPSC/21 though it is still quite good. While the transfer times show less improvement when parallelized than the solve Limes, the solvers are much more expensive computationally. 4.3.3 Compl
PAGE 113
seconds 16,, DOD 1 D2 D3 D4 17x17 DOD 1 D2 D3 D4 33x33 Not enough memory on smaller cubes DOD1 D2D3D4 65x65 DO D1 D2 D3 D4 129x129 Total Transfer Table 1: Timings for simple AFAC on iPSC/2. 100
PAGE 114
I I I I I I I I ( j 101 on a D5 the speedup for then::::: 17 grid is 15, while the speedup on then= 65 grid is seven. This is due to the larger granularity of the parallelism present on each node. Since the shared grids are serviced only after the completion of access of the entire grid, the larger grids result in longer delays in processing the more expensive shared grids (more expensive because of the required comnmnication of boundary values). The remedy is straightforward. As discussed iu the section on data flow, a reduction of the grain size, presently tied to the grid size, would force the servicing of the shared grids more frequently. Thus, further partitioning of the large grids within a processor would rnake the parallelism finer and eliminate the waiting done by shared grids. \Vith a finer granularity of parallelism in each processor, speedup should improve significantly through the larger cubes, particularly on larger problems. 4.4 Pcrforrnance Cornparison of FAC and AFAC Algorithrns The asynchronous processing of the composite grid in the parallel environment is only advantagious if doing so offsets the poorer numerical convergance factor of A.I<'AC compared to FAC. In this section, we study this issue by examining and comparing the perforrnance of AFAC and FAC in the parallel environment. 4.4.1 Parallelization for Distributed Architectures A basic AFAC code for distributed rnernory multiprocessor architectures was developed to solve Poisson's equation on the unit square, using the finite volume element technique ([40]) for composite grid discretization and MGV for solution of each grid level problem. The code was later nwdified in a straightforward manner to permit FAC as well as AFAC implementations the only change was the introduction of sequential processing of the composite grid levels through a reordering of the basic operations that origina.Uy formed the AFAC algorithm. Both algorithms are so closely related that this in fact resulted in an efficient implementation of FAC. 4.4.1.1 Parallel Multigrid SolverStandard Grid Partitioning In tbe AFAC/FAC cocle, the Dirichlet problems on all patches and the global grid are solved by a
PAGE 115
102 seconds 300 250 200 150 100 50 0 DOD1 D2D3D4D5 DOD1 D2D3D4D5 DOD1 D2D3D4D5 1 7x 1 7 33x33 65x65 Total c:::J Transfer Table 5: Tirnings for Complex AFAC on iPSC/1.
PAGE 116
103 parallel full mu!Ligeid (FMG) solver that is truncated to the two finest levels. For simplicity, we call this truncated FMG solver a 2step FMG cycle. Although the grid sequence for the standard multigrid solver 1nust be processed sequentially) processing on a given grid can be done sl1nultaneously across its points. This naturally leads us to consider standard grid partitioning [34], where each processor works on the fine grid and the associated parts of the coarse grid related to a speciilc subdoxnain. Interprocessor communication, as necessary during residual computation or relax ation on a given grid, is done by exchange of data in socalled process overlap areas. Intergrid data movement, as necessary within the grid transfer operations such as interpolation, can be done totally inside the processors, and therefore does not require any interprocessor comtnunication. 4.4.1.2 Parallelization of FAC Linear Single Level Grid Partitioning Due to the necessity of sequential treatment of the sequence of 1evels1 the parallelization strategy tha.t we chose for FAC involves regular linear strip partitioning of each level independently over all processors. All processors share an equal sized subpatch (strip) of each patch [23], which automatically leads to very good load balancing. If n is the number of processors and p the number of patches1 then this partitioning leads to n p subpatches and (n1) p process overlap areas (where internal boundaries are shared between processors). interprocess cornrnunication is done in the solver part as described in the previous section. Interlevel communication has to be done nearly exclusively over process boundaries. Concerning adaptivity, it is possible to add, move, or delete one or rr10re patches at any time without redistribution or rebalancing of the workload with respect to other patches. 4.4.1.3 Parallelization of AFAC Linear Multilevel Grid Partitioning Because all levels can be processed simultaneously, global irregular multilevel strip partitioning of all patches across all processors based on a workload function is used. Each processor 5n ................
PAGE 117
104 shares an equal part of the rows contained in the smn of all patches. A contiguous number of rows belonging to the same patch and the sarne processors denotes a subpatch. The maximum number of sub patches is n + p 1; the maximum number of overlap areas is n1. The irnplementation of AFAC allows asynchronous scheduling of multiple levels. Priority is given to shared grids that naturally require more communication. Additionally, boundary processing can be done before processing interior data so that exchange of boundary data can be overlapped with computation as rnuch as the architecture allows (message latency hiding is not supported on SUPRENUM, but is on the Intel machines). As in FAC, interprocessor cornrnunication is done within the solver; but interlevel communication is done without respect to process boundaries. Initial load balancing can be done in a nearly optimal way based on a work function for the initial cmnposite grid. Adaptively adding or deleting a patch, however, requires load rebalancing. This is done by the multilevel load balancer (MLB). Experience shows that this load balancer is highly efficient and requires only a small fraction of the thne needed for the arithmetic phases of the algorithm even in the case of Poisson's equation [41). For a detailed discussion of adaptive aspects, see [42}. 4.4.2 Co1nparison of Interprocessor Cominunieation in FAC and AFAC ln addition to the rough factor of two in convergence factor in favor of FAC, the influential parallel perfonnance factors are the vectorization capabilities and the communication structure. Both algorithms are very vectorizable because standard vectorization and parallelization do not interact: vectorization is done along the rows of the levels (programming language C), while parallelization is done along the columns (thus partitioning along rows). \,Yherea.s the amount of arithmetic and degree of vectorization for FAC and AFAC are very simiiar, the number and size of the subpatches, i.e., the distribution of the patches on a parallel computer, differ significantly: For li'AC, n p subpatches are needed, while for AFAC, n + p1 subpatches are needed. As a result, the communication structure, the
PAGE 118
105 number of messages) and the message lengths of the algorithms differ signiftcantly. Tables 6 and 7 give an overview of the corrnnunication structure of both algorithms for three examples each. p denotes the nmnber of patches, n the number of processors, nx the number of grid points in each direction on each patch, and Ps the maximum number of subpatches. Distinguishing between the solution and level transfer phases, the number of synchronization steps (communication steps,# steps), the maximum number of communications (# comms.), and the maxirnurn amount of cornn1unicated data (in Kwords, # Kwords) are listed. lsirnple p=7 p=7 p=7 eo1nposite n=4 n = 16 n = 16 grid nx = 127 nx = 127 nx = 255 FAC AFAC FAC AFAC FAC AFAC I 28 10 112 22 112 22 p, I 4 2 16 3 16 3 Solution # steps 728 104 728 104 840 120 # comrns. I 4,368 624 21,840 3,120 25,200 3,600 # Kwords 61 9 306 44 605 86 Level trausfer # steps 12 12 12 12 12 12 # cornms. 16 192 64 192 64 8 48 48 48 195 195 Table 6. Communication structure analysis of FAC and AFAC for the simple 7 patch model problem. In the solver phase, the number of com1nunication steps, the nun1ber of cornmunications, and the amount of cornrnunicated data are much larger for FAC than for AFAC. This is because the nurnbers linearly depend on the number of sub patches, which are rnore numerous due to each patch being partitioned over all processors in FAC. The number and arnount of cornrnunications per step, per processor, and per processor and step, and the average message length can be easily con1puted front the given numbers. lu the grid transfer phase (between levels), the amount of communicated data in
PAGE 119
106 Cornplex p = 30 p = 30 p = 30 eornposite n=4 n = 16 n = 16 grid nx = 63 nx = 63 nx = 127 FAC AFAC FAC AFAC FAC AFAC p, 120 33 480 45 480 45 maxi=l,n Ps 4 2 16 2 16 2 Solution # steps 2,640 88 2,640 88 3,120 104 # comms. 1.5,840 528 79,200 2,640 93,600 3,120 # Kwords 127 4 634 21 1,310 44 Level transfer # steps 58 58 58 58 58 58 # conuns. 232 16 928 64 928 64 # Kwords 58 58 58 58 234 234 L ____ Table 7. Communication structure analysis of FAC and Af"'r\C for the complex 30 patch model problem. both algorithrns is roughly equal, but the number of messages to be exchanged is higher in FAC. In FAC, at each step, all processors are involved and exchange messages in parallel (since every patch is partitioned across all processors). In AtAC, only those processors that contain subpatches linked by a father/son relationship, in the composite grid tree, are involved. Therefore, the number of communications has a different impact on performance, because the degree of parallelism, in the transfer between levels, is higher in FAC. Under the assumptions that communication between distinct pairs of processors does not have rnutual influence and that communication and computation cannot be overlapped, the comrnunication structure of this transfer between patch levels in FAC is superior to that of AFAC. But since the solver phase dominates the time spent in an FAC/ AFAC iteration (solve/transfer), the superiority of FAC in the transfer phase is not significant in n1ost cases. Speciftcally, the number of communications per processor per step is much smaller in FAC. This is the significant number if one considers the parallelism in the execution of communication bctvvccn distinct pairs in the transfer phase. As a result, the arnount of com rnunicateJ delta per message executed in parallel is smaller in FAC. Also, all messages
PAGE 120
107 have about the sarne length. In Aii'AC, however, the message passing associated with the transfer phase is less parallel, although alternative rnappings of the composite grid to the processors could alleviate this at an increased cost to the solver phase. As a result, the messages are longer and often of different lengths. Thus, in AFAC, the time required for one communication step is as large as the time required to exchange the longest message. The impact of this disadvantage of AFAC is significantly lessened if the transfer rate of each pair of processors is dependent on the number of con1municating pairs. This is the case on hierarchical buscoupled architectures such as SUPRENUM, where the total transfer rate is restricted by the bus bandwidth. On hypercubes, this is only the case for nonnearest neighbor communication of processor pairs that use the same links. Also, the transfer rate depends on both processors, of each pair, are within one cluster I nearest neighbor (faster) or in distinct clusters I nonnearest neighbor (slower). Also, the possibility of overlapping cornmunication and computation on the Intel rnachines (currently not possible on SUP RENU M) makes the influence of larger messages in AFAC significantly smaller. The latter is due to the greater parallelisrn of the AFAC algorithm. Since, for the message length associated wilh FAC and AFAC, message startup time in most cases is the dominating fac tor in overall comrnuuicalion time, the smaller number of Inessages, even longer messages, a.'3sociated with AFAC over tAC is a perceived advantage. 4.4.3 TI,elative Performance Results In this section, results of numerical experiments that. compare FAC and AFAC on SUPRENUM and the Intel IPSC/2 and iPSC/860 are presented. The code has been run on SUPRENUM in scalar and vector mode in order to provide for very different cornputationlcommunication cost ratios; although the communi cation time stays nearly constant (same startup costs) same transfer rate1 slightly faster bufFering in vector mode), the cornputation time differs significantly. For a more detailed discussiou of the distributed architectures, see [34]. Factors of up to
PAGE 121
108 20 in favor of vector computations for pure arithmetic processing are typical. Thus, the impact of communication on overall performance differs radically in both modes. Though it is obvious that on SUPRENUM the vector unit should be used, the scalar results give sorne idea of how the percentage of time spent in communication changes if basic architectural parameters are changed. All results were obtained on one cluster of a 4cluster (64 nodes) SUPRENUM machine at GMD in Bonn. For results on the 64 node SUPRENUM machine, the reader is referred to [30]. The code has been run in scalar mode on a 4D sub cube of the 32 node Intel!PSC/2VX of the GMD. The vector floating point processor has not been used due to vector memory restrictions (1MB/node, cf. section 3.2). Additional results on the 32 node iPSC/2 can be found in [41]. Results specific to the use of adaptive AFAC on the iPSC are in [42]. Finally, the code has been run on a 4D subcube of the 32 node Intel iPSC/860 of GSF based on the Portland Group C C01npiler. 3 Special vector libraries have not yet been available, so that the possibility of overlapping functional units, of the i860, has not been optimally used. The performance is therefore restricted by the n scalar" speed (cf., section 3.2). For future compiler releases, significant speedup in pure cornputation lime has to be expected, whereas communication tirne will stay rather constant. This leads to a further decrease of the computation/ communication cost ratio. Tirnings are in milliseconds for 1 cycle. DO, D2, and D4 denote systems consisting of 2,22 and 21 processors, respectively. 4.4.4 Conclusions of FAC versus AFAC To compare AFAC and FAC, we restrict ourselves to two test problems. The ftrst is a simple 7level composite grid with all refinernent at a corner 1 similar to figure 17. The second is a complex composite grid with 3Thc .iPSC /860 results are not complete due to system problems and restricted access time to the system.
PAGE 122
109 a total of ao local refinement patches, as in figure 19. In order to give a fair comparison of both methods, in the following bar charts (see figure 20), two cycles of AFAC (black bars) are always compared to one cycle of FAC (white barsL which takes the convergence factors of both methods into account. Referring to ftgure 20, note that while FAC out perforrns AFAC on a single proces sor, in most cases AFAC out performs FAC even on a few processors (16). This is amplified by the inclusion of vectorization. The case where FAC is better than AFAC is with small nutnbers of large patches in scalar mode on a few processors. In this case, the effect of communication is rninimized and computation dominates the Af"'AC/FAC algorithm, that is\ the reasons that FAC is superior on a single processor extends to this restrictive class of problems on small numbers of processors. Specific to the 7level composite grid (formed as in figure 17, but with 7 levels instead of the :1 pictured), we see that FAC is better in the case of the scalar results, at least for up to 16 processors. But for the vectorized code, AFAC is superior. This is due to the increased relative communication costs brought about by the much faster vectorized cornputation. 'Thus, since AFAC (as shown in tables 2 and 3) has reduced communication in the parallel environment, it has better performance. In the case of the complex corrtposite grid (with 30 patches, similar to figure 19), AFAC performs better than FAC in both scalar and vector nodes. This is due to the poor cornrnunication properties (shown in tables 2 and 3) of FAC on composite grids with greater complexity. As a result of these poor communication properties, AFAC out performs FAC even on the relatively small (16) processor parallel computers. The effect of the number of processors in a parallel system is not so clear on the small nurnber of processors that we have presented, so in figure 20 we include some results on 32 and ()4 processors. I,A/hat these results show (on both the 7level composite grid and the 30 patch composite grid in vector mode) is that the increase in the numbers of processors
PAGE 123
110 has a dramatic efl'ect on the comparison of AFAC and FAC, with AFAC performing very efficiently, especially compared to FAC. Thus, we see that the superiority of AFAC over FAC is dependent on both the number of processors and the complexity of the grid. The effect ofvectorizationjust amplifies the poor communication properties of FAC on even small numbers of processors. If we make the assumption that these results will extrapolate to larger multiprocessor computers, then we expect the marked superiority of AFAC over FAC and, likely, all other local refinement algorithrns because of their synchronous treatment of the cornposite grid levels. For the 30 patch model problern, one should be aware that the current FAC implementation is not optimal, because the problem actually only contains 6 levels. Patches of different refinement areas are treated in parallel, but could be partitioned more efficiently. The alternative partitioning would force all patches at the same level to be put into a list, which would then be partitioned across all processors, similar to the way that AFAC treats all the patches. As a result, the subpatches in each processor would be larger and there would be fewer total messages associated with FAC. However, this partitioning would have the disadvantage of a.dded complexity, especially for use in an evolving, dynamic, composite grid. 4.5 Dynaruie Adaptivity in Parallel Envirornnent An irnporta.nt and potentially disabling requirement of parallel adaptive mesh re finement for time dependent problems is the dynamic changes required of the evolving com posite grid distributed across the multiprocessor enviromnent. The dynamic changes take the form of addition and deletion of refinement regions, reloadbalancing after rnodifica tions of the cornpositc grid, and repositioning of existing refinement as a part of tracking moving regions where addibonalrefinernent is required. An important issue in the parallel environment is how the composite grid can be partitioned. Depending on this partitioning, the work associated with adding/deleting new refinement or moving existing refinement can
PAGE 124
111 be prohibitive. The partitioning available with AFAC permits any ordering of the composite grid) so moving grids rnove only logically in relative position but need not be reshuffled in the multiprocessor system. Alternatively) a partitioning of the physical domain forces significant reshuffiing when a grid is moved within the physical domain. Also) adding/deleting of grids is more expensive since the partitioned domain limits the ability to cache the added and deleted grids (deleted grids need not be deallocated and could be added trivially). 4.5.1 Multilevel Load Balancing Figure 21 shows the tirnes required to load balance the composite grid after addition of local refinernent grids of increasing size. Notice that these tirnes depend only slightly on cube size since most work is spent in the transfer of grid data between nodes. The cost of computing the target processors of the unbalanced data is very inexpensive. In this test of MLB, a grid is added to the existing cornplex cornposite grid and the resulting unbalanced composite grid is then rebalanced. Times are unavailable for the case n = 65 since message sizes are greater than 16K for the grid data transrer between nodes, which is the maxinmm message size on the iPSC/1. To compensate for this omission, the case n = 9 is presented. Note that these times are small compared to the costs of solving the composite grid problem. 4.5.2 Dynamic Movement of Loeal Refinement (Grid Tracking) To test the potential performance of AFAC for handling moving grids, we computed the times required to make the movement of the largest stack of grids shown in figure 19. In the complex grid exarnplc solved and timed previously, the 30 grids stacked on the coarsest grid are moved, and the time required for this move and the update of the rr10vement in all processors is recorded. Specifrcally, both the relative and global positions of all 30 grids are communicated and updated in all processors. The time required is identical for all processors and is shown in table 8. Note that these times are small compared to the costs of solving the composite grid problent.
PAGE 125
112 4.!J.3 Dynmnic Addition of Local Refinement To estimate the cost associated with building further refinement on the existing composite grid, we computed the times for one node to both build a grid and communicate the cmnposite grid modification to the composite grid tree present in all other processors; see table 9. The times are different between the node that owns the new grid and those that merely update their record of the cornposite grid. This is because the owning node must initialize the new grid, update the righthand side, and perform other functions, so the times depend in the size of the new grid. Nodes not owning the new grids must update their copy of the composite grid, which is inexpensive, so their tirnes reflect the cost of a global communication and are therefore de pendent on the cube size. [f necessary, the cost of building the new grid could be parallelized by letting severrd processors each build a piece of the new grid. Thus, the new grid would be shared rather than wholly owned after construction. Tn any case, the costs of adding a grid are still srnall compared to the costs of solving the composite grid problem. 4.5.4 Relative Costs of Dynarnic Adaptivity In table 10, we compare (in logaritlnnic scale) the times required of AFAC on the cornplicated example to rnove a stack of 30 grids, add further refinement, and load balance the additional refinement. We cornpare these tirnes to show that these grid manager functions have little impact on the over ali cost of solving the con1posite grid problems, even in a parallel environment. 4.5.5 Conclusions about Dynarnic Adaptivity This section has detailed the use of dynarnic a.daptivity combined with the composite grid solver. The results make it clear that the use of tirnedepcndent problems with parallel adaptive mesh refinement does not degrade the perforrnance of these adaptive composite grid solvers. This is an important result because tirnedepenclent applications are critical and computationally expensive. A rnotivaLing use for rnassively parallel machines is this problem class: that they can be solved using parallel adaptive mesh reflnement, and that the solution methods do not degrade by use of a.daptive techniques, are significant results. Such results are in some respects
PAGE 126
113 counterintuitive due to the nonuniform nature of the composite grid, but this is countered by the design of the composite grid solver AFAC, since AFAC allows a decoupling of the composite grid levels. It is the design of AFAC (and we expect the even better properties of AFACx) that allow the decoupling and, as a result, the much greater freedom to partition the composite grid more efficiently in the multiprocessor systern. 4.6 Adaptive Refinement Using P++ / AMR++ By example, we demonstrate some of the features of AMR++ and examples for the support of P++ for the design of parallel selfadaptive block structured local refinement applications on the basis of FACx and AFACx algorithms. The singular perturbation equa tion is an interesting example problem since it is not as simple a..:; the Poisson equation and provides a rnechauism to introduce more realistic singularities for which we can better justify the use of sclfadpative mesh refinement. In a parallel environrnent) partitioning the composite grid levels becomes a central Issue 111 the performance of composite grid solvers. In figure 14, two different partitioning strategies that are supported within P++/AMR++ are illustrated for the composite grid. For ease of illustration, grid blocks 2.2 and 2.3 are not included. The socalled FAC partitioning in figure 14(b) is typical for implicit and explicit algorithms, where the local refinement levels have to be treated in a hierarchical manner (FAC, MLAT, .. ). The socalled AFAC partitioning in figure 14(a) can be optimal for implicit algorithms that allow an independent and asynchronous treatment of the refmement levels. In case of AFAC, however, we must consider that this partitioning is only optimal for the solution phase, which dominates the computational work of the algorithm. The efficiency of the level transition phase, which is based on the same hierarchical structure as FAC and which can eventually dominate the aggregate conmiunication work of the algorithm, highly depends on the architecture and the application (cornmunic.ation /computation ratio, single node (vector) performance, message latency, transfer rate, congestion, ... ). For this reason, additional work should be done to
PAGE 127
80 40 0 Milliseconds DOD1 D2D3D4D5 17x17 DO D1 D2 D3 D4 DS 33x33 Table 8: Timings for repositioning grids. 114 DOD1 D2D3D4D5 65x65
PAGE 128
3 2.5 2 1.5 1 0.5 0 Milliseconds (Thousands)  DOD1D2D3D4D5 17x17 I DO 01 02030405 33x33 ............ ............ [, ........ ............ ...... f ........... ... _. DOD1D2D3D4D5 65x65 Not Owning Grid IB Owning Grid Table 9: Timings for addition of new refinement. 115
PAGE 129
116 t.4111Jseconds 10000 1000 100 10 DO D1 D2 D3 D4 DS DO D1 D2 D3 D4 DS DO D1 D2 D3 D4 DS 17x17 33x33 65x65 Total Solve D Transfer Move Eilil Not Owning Owning MLB Table 10: Relative timings of AFAC for Moving, Adding, and Load Balancing (MLB).
PAGE 130
117 eliminate this final coupling between the composite grid levels (see chapter 5). For determining whether AFAC is better than FAC in a parallel enviromnent, the aggregate efficiency and performance of both phases and the relation of the convergence factors must be properly evaluated. For more details, see [31] and [27]. Both types of partitioning are supported in the P++/AMR++ environment. Solvers used on the individually partitioned composite grid levels n1ake use of overlap updates within P++ array expressions that automatically provide cornmunication) as needed. The intergrid transfers between local refinement levels, typically located on different processors, rely on VSG updates. The VSG updates are also provided automatically by the P++ environment. Thus, the underlying support of parallelism is isolated in P++ through either overlap update or VSG update, or a combination of both, and the details of parallelism are isolated away from the AMR++ application. The block structured interface update is handled in AMR++. However, communication is hidden in P++ (mostly VSG update). See section 3.6 for more detail on AMR++, and section 3.5 for rnore detail on P++. The use of the tools described above is now demonstrated with preliminary ex amples. The a.daptivity provided by AMR++ is necessary in case of large gradients or singularities in the solution of the PDE. They may be due to rapid changes in the righthand side or the coefficients of the PDE, corners in the domain, or boundary layers, for exarnple. The details of the selfadaptive regridding are contained in section 3.6.5. Here, the first and last cases will be examined on the basis of model problems. This work is not intended to represent cunent research specific to the singular pertibation problem. The reader is referred to [33] for more detail. Singularly perturbed PDEs represent the modelling of physical processes with rel atively small difi'usion (viscosity) and dominating convection. They may occur as a single equation or within systems of complex equations, e.g., as the momentun1 equations within
PAGE 131
118 the NavierStokcs or as supplementary transport equations in the Boussinesque system of equations. Here, we merely treat a single equation, but we only use methods that generalize directly to the more complex situations. Therefore, we do not rely on the direct solution methods provided by downstream or IL U relaxations for simple problems with pure upstream discretization. The latter are not direct solution methods for systems of equations; cf. [20]. Moreover, these types of flow direction dependent relaxations are not efficiently parallelizahle in the case of only a few relaxations, which is what is usually used in multilevel methods. This, in particular, holds on massively parallel systems. As opposed to the Poisson equation, the convergence factors do not only depend on the PDE, but also on the particular solution. 'I'he results in this thesis have been obtained for the exact solution e(x1)/t:e1/t: 1 z 2 u.(x) = + eJOO(x +(y1) l. 1e2 This solution exhibits a boundary layer at x = 1, 0 :::; y $ 1 and a steep hill around x = 0, y = 1 (see figure l4(c)). In order to measure the error of the approximate solution, a discrete approximation to the L1 error norm is used. This is appropriate for this kind of problern: For solutions with discontinuities of the above type1 one can observe first order convergence only with respect to a norm of this type (no convergence in the Loo norm1 order 0.5 in the L2 norrn). 'The results have been obtained with the flagging criteria with a given value of 7]. For c < fh, the second factor is an approximation to the lowest order error terrn of the discretization. Based on experiments, f = 1 is a good choice. Starting with the global grid, the composite grid is built on the basis of the flagging and gridcling a.lgorithrn described in section 3.6.5. In table 11, the convergence factors for MG and FAC are presented for three values
PAGE 132
119 MGV FACx uniform = 0.02 = O.Gl = 0.001 h e n e n b e n b e n b 1/32 0.0293 961 0.0293 961 1 0.0293 961 1 0.0293 961 1 1/64 0.0159 3969 0.0160 1806 4 0.0160 1967 4 0.0159 2757 3 1/128 0.0083 16129 0.0089 3430 10 0.0087 3971 10 0.0083 6212 7 1/256 0.0043 65025 0.0056 6378 19 0.0051 7943 16 0.004:l 13473 12 1/512 0.0023 261121 0.0073 12306 34 0.0044 15909 30 0.0023 27410 22 Table 11. Accuracy (Llnorm e) vs. number of grid points (n) and number of blocks (b) for MGV on a uniform grid and FAC on selfadaptively refined composite grids. of ry, with two of the corresponding block structured grids displayed. The corresponding error plots give an irnpression of the error distribution restricted from the composite grid to the global uniform grid. Thus) larger enors near the boundary layer are not visible. These results allow the following conclusions: In spite of Lhe well known difficulties in error control of convection dominated problems1 the grids that are constructed selfadaptively are reasonably well suited to the nurncrica! problem. As long as the accuracy of the finest level is not reached, the error norm is approximately proportional to fJ. As usual in error control by residuals, with the norrn of the inverse operator being unknown, the constant factor is not known. If the refinernent grid does not properly match the local activity, convergence factors significantly degrade and the error norm may even increase. Additional tests have shown that, if the boundary layer is fully resolved with an increased number of refinement levels, the discretization order, as expected, changes frorr1 one to two. The grid ding algorithm is able to treat very complicated refmement structures efflciently: The number of blocks that are created is nearly minimal (compared to hand coding). 'Though this exarnple needs relatively large refinement regions, the overall gain by using adaptive grids is more than 3.5 (taking into account the different number of
PAGE 133
120 points and the different convergence factors). For pure boundary layer problems, gain factors larger than 10 have been observed. These results have been obtained in a serial environrnent. AMR++, however, has been S\Jcc.essfu11y tested in parallel. For performance and efficiency considerations, see sections 3.5 and 3.6.
PAGE 134
hierarchical FAC partitioning ! . .. \ ... \ .. :1:: : I ...... ... l :1:: j 121 3ievel composite gnd Figure 17. Regular single level strip partitioning of a 3level composite grid structure onto 4 processors (FAC). li I I I I 1 I 2 I I I I I :I 1: :I: :I: I I :I :I: I 8 I 9 I I II 1: 'I' ,I I 1: :I: I I I I :I: :I: I I Figure 18. lrreg;u!ar Jnultilcvel strip partitioning of a 7level composite grid structure onto 16 processors (AFAC)
PAGE 135
122 Figure 19: Complex composite grid problem with 40 patches.
PAGE 136
7 lllmp. orld on IPSC/2 and SUP'AENUM '"""""' ' h liT Ill dO 1112 '*' cia """"' h r 7 patch elmp&tl oompotJI,_ and on IPS0/880 and SUPRENUM I {vect.) ,........... I SlA'WIIt/UIV .. ""t1 ' '""" "'" '"" .... 0 00 02 .,,., 02 04 04 08 06 25S X 25!5 511 I 511 2eyde5AFAC 0 1 C'fd4tFAC 123 3d patch cromp6ex cotn0011b Ql1d on IPSCJaeO Mel SUPflENUM (v.ct.) _, ..__,., dO a::1t :CIO G :11 ::1 t 30 patCtl orid on IPSCI2 and SUPRENUM '""""' '"""' ' Ju. """' '"" 0 ill. 00 02 31 31 tl I I 02 04 ,._ tiJ I ] "" !27 X 127 1 1"I 06 25!1 x2.$5 0 1 Figure 20. 'Timings in rnilliseconds regarded with respect to patch and parallel processor system size (AFAC: black bars, FAC: white bars).
PAGE 137
seconds 1.2 !....................... 0.8 !0.6 0.4 0.2 0 01 0203 04 05 9x9 01 02 03 04 05 17x17 01 02 03 04 05 33x33 Figure 21: Timings for Load Balancing using rv1LB 124
PAGE 138
005 0.01 0.005 0 0.5 0.5 0.5 0.5 (<\) Crror ievds f)= 0.02 b:::: :H n::::I2J06 (b) Error for 5 level!! ry = 0.001 b = 22 i1. = 27
PAGE 139
CHAPTER 5 CONCLUSIONS Adaptive mesh refmement and its use on parallel architectures is both practical and worthwhile. The para1le1ization of traditionally serial algorithrns in the parallel environ ment, while comrnonly accepted and often reasonably efficient, is shown to be satisfactory in many specific circumstances on small numbers of processors. However 1 the advantages more specifically in the design of new mathen1atical algorithms are demonstrated through the development of AFAC and, in particular within this thesis, AFACx. AFACx addresses the requirements of rnore general selfadaptive mesh refinement and, at the same time, uncovers a generalization of the original AFAC algorithm and improves on the efficiency of solving composite grid problems in parallel. This thesis combines both the development of a new numerical algorithm to the adaptive solution of PDEs with developrnents in the software engineering of such complex parallel codes. The more computer science oriented developrnents have resulted in a greatly simplified environrnent for the construction of parallel numerical software in general. 'rhe resulting P++ environment, a C++ parallel array class library, is demonstrated on the target application area of this thesis: paral1e1 adaptive mesh refinement software. initial results detailing performance with AFAC and J
PAGE 140
127 parallelizable, can be much superior on parallel architectures in many cases. A factor of roughly two in favor of FAC in serial mode gives way in favor of Al<"AC in parallel mode, especially on larger numbers of processors. An additional and unexpected result of the difficult implementation work that was done to investigate the parallel properties of FAC, AFAC, and AFACx was the development of a superior architecture independent environment for programrning. The use of C++ has permitted an expanded scope of the work that was originally attempted, much more so than what would have been possible in FORTRAN. The work motivated by the requirement to build such complex parallel adaptive rr1esh refinetnent codes has led to the developrnent of a runtirne interpretation of parallelism and a radically new programming environment. The array language environment protects the user from the difficulties of developing even cmnplex software. Additionally, it eliminates the requirernent of parallel debugging, a particularly difficult task thaL has historically limited the complexity of numerical software on distributed memory environrnents. More than anything else, this thesis concludes that adaptive mesh refinement for parallel architectures requires support from both improved mathematical al goritluns and attention to advances in cornputer science.
PAGE 141
CHAPTER 6 FUTURE WORK We expect that the development of this work will continue into the foreseeable future. More specifically, the design of AFAC that has evolved into AFACx will continue to be improved. Similarly, the software that is used to develop the parallel adaptive mesh refinement work presented in this thesis will continue to evolve (P++ and AMR++ research). There are several problems that will likely be addressed through continued research and they separate into future work on algorithmic design of AFAC and future work on the design of P++ and AMR++. The development of improved algorithms for the parallel processing of the composite grid, derived from the selfadaptive mesh refinernent process, will continue to be an important research area. It is hoped that the software developments in this thesis will improve the accessibility of this research area. The development of AFACx greatly improves the ability of the fundamental concepts developed in AFAC to address more realistic problems with even greater ef:ficiency. \Ve expect further research specific to the adaptive algorithms to be done1 which might include: Decoup!ing of lntergrid Transfers. Current work on AFACx simplifies the solvers on each composite grid level1 but nothing has been done to simplify the final interpolation and projection processes that occur at each iteration. Additional work to decouple these processes1 partially or completely1 will be done so that this step can be more fully parallelized. Improved theoretical understanding of AFAC and AFACx, particularly the relationship of AFACx to BPX [43].
PAGE 142
I I 129 Development of the programming environment that was used to support the research into A FAC and AFACx is the second area of research. Additional work on P++ and AMR++ might include: Efficiency for P++ Comparable to that of FORTRAN. Current work is being done to make the P++ array class library as efficient as possible. Current serial C++ array classes in use at Sandia National Laboratories perform at 5080% the level of FOHTRAN on the Cray. Ongoing collaboration with James Peery and Allen Robinson at Sandia will allow for some improvements to be made to P++ in order to obtain similar results on the Cray 2, and the single node architectures of distributed memory architectures (scalar, vector, or superscalar). With respect to parallel efftciency and the amount and cost of comrnunication, P++ already performs nearly as well as codes based on explicit message passing, though more analysis will be requi red to account for optimizations across n1ultiple array statements. Such Inultiple expression optimizations are possible by hand, but are often tedious. Interpretation of Parallel Indexed Array Operations. An important core of P++ is the interpreted parallel array operations, specifically the message passing that each operation generates. Currently, the interpretation of the message passing between processors owning the array data is based on the use of a fixed overlap along each edge of the partition. The use of a dynamically sized overlap is required to handle more complex array statements efficiently, which requires greater overlap to avoid the rnore expensive VSG update (e.g., 4thorder operators). Future work should isolate this required functionality and parameterize it, by the overlap width, so that it can be rnade more broadly useful. Such work could predict the optimal overlap size at runtlrne and so dynamically provide increased optimization to the runtime interpretation of parallelism for other runtime environments, more than just P++. Mixing of Data Parallel Model with Task Parallel Model. Currently, P++ provides
PAGE 143
130 a simplified access to data parallel efficiency. However, for efficiency in parallel adaptive mesh re.ftnement applications, we require additional task parallel support. Thus, it is important to mix the two types of parallel support together. Language support at Cal tech, CC++, provides access to task parallelism and so is a target for additional work to combine with P++ for simplified support of numerical software in the parallel environment. However, other work at University of Colorado at Boulder has focused library support of task parallelism. Additional work should be done to evaluate both approaches and discover which is best suited for use with parallel adaptive mesh refinement. Use with Portable Communication Libraries. P++ is currently irnplemented using the Intel NX communication library and an EXPRESSTM 2 like portable communication library frorn Caltech. Current versions are running on the Intel iPSC/860 Hypercube, the Intel Simulator, Sun workstations, IBM PC, and the Cray 2. Future work might use the standardized message passing interface (MPI) within P++ or, alternatively, P++ nright use PVM internally. Use of Portable Class Libraries. AMR++ is implemented using M++ and P++, but uses sorne standard class libraries that limit its use on some of these. Current work uses the AT&T Standard Components II class library to provide standardized helper classes (e.g., linked list classes), which would then be available on all the target machines. Further investigation will be required to see if such standard class libraries are portable and widely available enough to allow P++ and AMR++ to be used on even obscure machines. The alternative would be to directly support the sirnilar interface of the linked list, serial array, and other class libraries within P++ Repla.cernent of M ++. 1t is not clear that further work should continue to use the Nl ++ class library, since it is currently a bottleneck to the performance of P++ and AMR++. M++ was originally used because it simplified the development of P++ 2EXPRESS i.s a trademark of ParaSoft Corp.
PAGE 144
131 and allowed only parallel issues to he researched. These issues were the original focus of our investigation. The success of P++ will strongly depend on its ability to provide near FORTRAN performance. Support of a similar\ or improved, interface, but restricted to the requirements of P++ (and optimized for use with P++ ), could easily be substituted. Such an intermediate serial array class library could nwre easily be made portable, and heavily optimized, across the proposed target architectures. Optimized Message Passing Subsystem. The scheduling of message in P++ is cur rently handled in a simple statement by statement processing. A much n1ore highly optimized message passing subsystem of P++ could dynan1ically treat collections of staternents and1 ln doing so, schedule the message passing with overlapping computation. Additional combining of messages could greatly reduce the number of messages sent within the multiprocessor system. This is especially efficient in mod ern parallel computers that have improved communication hardware allowing up to 160+ :rvlegabyte/sec transfer rates, but interject a software based communication layer, which universally leads to high message startup costs. lrnprovecl Control of Partitioning. This is a critical issue in the development of parallel software, since within P++ the algorithm definition separated from the partitioning1 and so any partitioning of the data is considered valid. However1 the control of partitioning of data is not currently easy to define or manipulate (unless the default partitionings are used). For most applications, the manipulation that algorithms require is restricted to the load balancer 1 which must then have explicit control. Such explicit control is available and sufficient in P++. However, the user manipulation is awkward since explicit positions must be computed and defined within the P++ interface to the partitioning. The objectoriented language provides a much simpler solution through the use of mapping objects. The mapping object would abstractly define a partitioning independent of the number of processors (thus
PAGE 145
132 defining a portable mechanism for the definition of a partitioning). Each P++ array would naturally define a partitioning and thus would contain a stack of mapping objects that could be pushed on or popped off the stack. These mapping objects would define a partitioning and could be defined with scope so that, after an array's partitioning was manipulated within local scope, the previous partitioning would be reset after the new 1napping object that changed the array went out of scope. In this way, scoping rules can be used to simplify the partitioning of the distributed arrays. Alternate manipulation of the rnapping objects could permanently change the partitioning. Maintenance of very large parallel software projects could even require a system of permissions within the definition of the mapping objects. This is an extension of work done on scoping of Vienna FORTRAN partitions. However, it could be easily imp1emented in C++ for the P++ objects without manipulation of the compiler. P++ could be a substantial advantage in productively prototyping even such a FORT'RAN partitioning control subsystern. Perforrnance Tool Interface for P++. Since performance is an important goal of the P++ work1 additional work should combine performance evaluation tools in a \Vay that permits the user to interpret relative perforrnance and so allow the user to optirnize the P++ irnplementation. Such optimizations might be specific to a given machine architecture and would involve tuning of the partitioning and other controls available from in the P++ Optimization Manager. There are several parallel per formance monitoring packages (some are available with each of the comrnunication libraries, such as EXPRESS and PARMACS); it is not currently clear which should be used. Time Dependent Explicit Hyperbolic Solvers with AMR++. As a research tool,
PAGE 146
133 AMR++ is currently being used to support study of the AFAC algorithm and vari ants of this algorithm designed for specific support of blockstructured local refine ment for potential flow problems. Additionally, one of the PPM codes has been ported to the P++ environment and is running on one processor of the Gray YMP, but more work specific to efficient use of P++ on the Cray is required to rnake it competitive with the FORTRAN versions. This work is mostly dependent on M++ working well on the Cray. The P++ PPM application has been designed to work with the AMR++ class library, using the PPM application as the single grid solver to build a P++/ AMR++ application for the parallel environment. However, tests of these combined applications will require enhancements of AMR++ in order to handle nonuniform grids. This work has yet to be finish.ed. Unstructured Grids. One of our goals is to provide abstractions for paraHelisrn, local refinement) and adaptivity for the development of unstructured grid codes. However, direct work on unstructured grids seems illadvised without substantial experience on the sirnpler case of structured grids. Thus, we have focused our initial work on block structured logically rectangular grids and will build on the abstractions that P++ and AMR++ provide for these more modest types of applications. This will provide the foundation for the unstructured grid cases to be supported later. Flexibility of AMR++ for Larger Classes of Applications. As already stated above, there is still work to be done to make AMR++ sufficiently general to run the test and other cornplex fluid flow applications. Currently, AMR++ is used to simplify research on adaptive local refinement for elliptic problen1s using the AFAC algorithm for solution of the composite grid problem (see the example in the previous section). To date, only simpler model problems have been tested. Nevertheless, the resulting work is alrea.cly more sophisticated and much simpler to use than previous work done on local refinement methods for parallel environments (see [28], [29]).
PAGE 147
134 Simplify the Use of AMR++. Although not a part of the current implementation of AMR++, C++ introduces a template mechanism in the latest standardization of the compiler. The general purpose of this template language feature is to permit class libraries to use userspecified base types. Thus, a class library for linked lists could be supplied by defining the link data type. For AMR++, the template feature will be used to allow specification of the base solver and adaptive criteria for the parallel adaptive local refinement implementation. In this way, we hope that the AMR++ class library will provide a simple tool for the construction of adaptive mesh refmement codes from single grid applications. Similarly, we hope to show that the combination of AMR++ and P++ allow the construction of parallel adaptive mesh refinement codes fr01n single grid (serial) applications. Disk Storage of AMR++ Grid Blocks. AMR++ has no provision for its grid blocks to be stored off line on disk. Such capability would he relatively simple to implement and would be required for large systems of equations such as are found in corubustion codes modeling multiple species. Its use would help control the virtual paging that might otherwise he required to support the large memory requirements of such code. This feature would copy a similar feature in CMPGRD (a grid generation/discretization package for the use of structured grids with complex boundaries). Current work on C++ persistent objects might greatly sin1plify this proposed feature.
PAGE 148
BIBLIOGRAPHY [1] Angus I. G. and Thompkins W. T.: Data Storage, Concurrency, and Portability: An Object Oriented Approach to Fluid Dynamics; Fourth Conference on Hypercubes, Concurrent Computers, and Applications, 1989. [2] Baden, S. B.; Kohn, S. R.: Lattice Parallelism: A Parallel Programming Model for Non Uniform, Structured ScientifLc Computations; Technical report of University of California, San Diego, VoL CS92261, September 1992. (3] llai, D.; Drandt, A.: Local Mesh Refinement Multilevel Techniques; Report, Department of Applied Mathematics, \iVeizmann Institute, Rehovot, Israel, 1983. [1] Balsara, D.: PhD Thesis, University of Illinois, Urbana Champaign, 1990. [5] llalsara, D., Lemke, M., Quinlan, D.: AMR++, a C++ Object Oriented Class Library for Parallel Adaptive Mesh Refmement Fluid Dynamics Applications, Proceeding of the American Society of Mechanical Engineers, Winter Anual Meeting, Anahiem, CA, Sym posium on Ad
PAGE 149
136 [14] Chatterjee, S.; Gilbert, .l.; Schreiber, R.; Teng, S.: Optimal Evaluation of Array Expressions on 1{a.ssively Parallel Machines, Tech Report of Xerox Palo Alto Research Center, Vol. CSL9211, Dec. 1992. [15] Costlow, T.: Kendall Readies Parallel System, a 1000processor Machine Uses Main Memory as cache; Electronic Engineering Times, December 9, 1991. [16] D()rfer, J .: 1/Iehrgitterverfahren bei Singuliiren StOrungen; Dissertation, HeinrichHeine Unlversitat, DiisseldorC 1990, (Details related by Max Lemke). [17] Dowell B.; Govett M.; McCorrniek, S.; Quinlan, D.: Parallel Multilevel Adaptive Methods, Proceedings o( the 11th International Conference on Computational Fluid Dynamics, Williamsburg, Virginia, 1988. [18] Forslund, D.i Wingate, C.; Ford, P.; Junkins, S.; Jackson, J .; Pope, S.: Experiences in Writing a Distributed Particle Sinmlation Code in C++; USENIX C++ Conference Proceedings, San Francisco, CA, 1990. [19] Fox, G.; Hirannndani, S.; Kennedy, K.; Koelbel, C.; Kremer, U.; Tseng, C.W.; Wu, M.Y .: FOltTRAN D Language Specification; NPAC Technical Report, Syracuse Universit,y, also: T'echnical Report, Rice University, 1991. [20] FrohnSehauf: FluxSplittingMethoden und Mehrgitterverfahren fiir Hyperbolische Sys teme mit Beispielen a us der StrOmungsmechanik; Dissertation, HeinrichHeineUniversitiit, DUsseldorf, 1992. (Details related by Max Lemke.) [21} Gropp, W.; Keyes, D.: Domain Decomposition on Parallel Computers; Proceedings of t.he 2nd International Symposium on Domain Decomposition Methods, SIAM, 1989. [22] Hart, L.; McCormick, S.F .: Asynchronous Multilevel Adaptive Methods for Solving Partial Differential Equations on Multiprocessors: Basic Ideas; Parallel Computing 12, 1989, pg. 131144. [23] Hempel, R.; Lemke, M.: Parallel Black Box Multigrid; Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods, SIAM, Philadelphia, 1989. [24) High Performance Fortran Forun'l: Draft High Performance Fortran Language Speci Jication, Version 0.4, Nov. 1992. Available from titan.cs.rice.edu by anonymous ftp. [25] Lee, J. K.; Gannon, D.; ObjectOriented Parallel Programming Experiments and Results; Proceedings of Supercomputing 91 (Albuquerque, Nov.), IEEE Computer Society and ACM SICARC!I, 1991, pg. 273282. [26] Lemke, M.: Experiments with a Vectorized Multigrid Poisson Solver on the CDC Cy ber 205, Cray XMP and Fujitsu VP 200; Arbeitspapiere der GMD, No 179, Bonn, West Germany, 1985. [27] Lemke\ M.: Multilevel Verfahren mit SelbstAdaptiven Gitterverfeinerungen fiir Paral lelrechner mit verteiltem Speicher; Dissertation, HeinrichHeineUniversitiit DUsseldorf, to appear in 1993. (Details related by Max Lemke). [28) Lernke, M.; Quinlan, D.: Local Refmement Based Fast Adaptive Composite Grid Methods on SUPRENUM; GMDStudien Nr. 189, Bonn, Germany, 1991.
PAGE 150
137 [29) Lmnke, M.; Quinlan, D.: Fast Adaptive Composite Grid Methods on Distributed Parallel Architectures; Proceedings of the Fifth Copper Mountain Conference on Multigrid Methods, Copper Mountain, April 1991. Also in Communications in Applied Numerical Methods, Wiley, Vol. 8 No.9 Sept. 1992. [30] Lemke, M.; Quinlan, D.: Local Refinement Based Fast Adaptive Composite Grid Meth ods ou S UPREN liM; Proceedings of the Third European Multigrid Conference in Bonn, October 1990, GMDStudie, to appear. [31] Len1.ke, M.; Quinlan, D.: P++, a C++ Virtual Shared Grids Based Programming En vironment for ArchitectureIndependent Development of Structured Grid Applications; Ar beitspapiere der GMD, No. 611, 20 pages, Gesellschaft fiir Mathematik und Datenverarbeitung, St. Augustin, Germany (West), February 1992. [32] Lernke, M.; Quinlan, D.: P++, a C++ Virtual Shared Grids Based Programming En vironrneut for ArchitectureIndependent Development of Structured Grid Applications; accepted for CONPAR/VAPP V, September 1992, Lyon, France; to be published in Lecture Notes iu Computer Science, Springer Verlag, September 1992. [33] Lernke, M., Quinlan, D., Witsch, K.: An Object Oriented Approach for Parallel Self Adaptive Mesh Refinement on Block Structured Grids, Preceedings of the 9th GAMM Serninar J\id, Notes on Numerical Fluid Mechanics, Vieweg, Germany, 1993. [34] Lc1nkc, M.; Schiiller, A.; Solchenbach, K.; Trottenberg, U.: Parallel Processing on Distributed Memory MulLiprocessors; Proceedings, GI20. Annual meeting 1990, A. Reuter (Ed.), Informatik fachberichte Nr. 257, Springer, October 1990. [35] Lernkc, M.; Solchcnbach, K.: The performance of the SUPRENUM node computer; SUPRENllM Report, SUPRENUM GmbH, Bonn, December 1990. [36] Liu, C.; McCormick, S.: Multigrid, the Rotated Hybrid Scheme, and the Fast Adaptive Composite Grid Method for Planar Cavity Flow, 12th IMACS World Congress on Sci. Camp., Pa .. ris, July 1822, 1988. [37] Lonsdale, G; Schiillerj A.: Multigrid Efficiency for Complex Flow Simulations on Distributed Memory Machines; Parallel Computing 19, pg. 2332, 1993. [38] McBryan, O.A.; Frederickson, P.O.; Linden, J .; SchUller, A.; Solchenbach, K.; Stiiben, K.; Thole, C.A.; Trottenberg, U.: Mu1tigrid Methods on Parallel Computers A survey of recent developments; Imp. Comp. Sc. Eng., Vol. 3, pg. 175. 1991. [39] McConnick, S.: Fast Adaptive Composite Grid (FAC) Methods: Theory for the Varia tional Case, in Defect Correction Methods: Theory and Applications, K. Bohmer and H.J. Steit.er, eds., Computations Supplementation, 5, SpringerVerlag, Wien, pg. 115122. 1984. [40] McCormick, S.: lvlultilevel Adaptive Methods for Partial Differential Equations; SIAM, frontiers in Applied Mathematics, Vol. 6, Philadelphia 1989. [41] McCormick, S., Quinlan, D.: Asynchronous Multilevel Adaptive Methods for Solving Partial Differential Equations on Multiprocessors: Performance results; Parallel Computing, 12, 1989, pg. 145156. [42] McCormick, S., Quinlan, D.: Dynamic Grid Refinement for Partial Differential Equations ou Parallel Computers; Proceedings of the Seventh International Conference on Finite Element. Methods in Flow Pmblems, 1989, pg. 12251230.
PAGE 151
I 138 [43} McCormick, S., Quinlan, D.: Idealized Analysis of Asynchronous Multilevel Methods, Preccedings of American Society of Mechanical Engineers, Winter Annual Meeting, Annahiem, CA, November 813, Adaptive, Multilevel and Hierarchical Computational Strate gies, AMDVol. 157, pg. 413433, 1992. [44] McCormick, S.; Quinlan, D.: Multilevel Load Balancing, Internal Report, Computational Mathematics Group, University of Colorado, Denver, 1987. [45] Peery, .J .; Budge, K.; Robinson, A.; Whitney, D.: Using C++ as a Scientific Pro gramming Language; Report, Sandia National Laboratories, Albuquerque, NM, 1991. [46] Schoenberg, R.: M++, an Array Language Extension to C++; Dyad Software Corp., Renton, WA, 1991. [47] Stroustrup, B.: The C++ Programming Language, 2nd Edition; AddisonWesley, 1991. [48] Zima, H.P.; Bast, H.J .; Gerndt, H.M.: SUPERB: A Tool for Semiautomatic MIMD/S'lMD Parallelization; Parallel Computing 6, pg. 118.
