PAGE 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. '' i PARALLEL MULTILEVEL METHODS FOR TRANSPORT EQUATIONS by SUELY OLIVEIRA B.S.C.E, UFPE, 1983 M.S., University of Colorado, 1988 A thesis submitted to the Faculty of the Graduate School of the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Doctor of Philosophy Mathematics 1993
PAGE 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Thls thesis for the Doctor of Philosophy degree by Suely Oliveira has been approved for the Department of Mathematics by .di11t01J1Y!:Ji?lf /A J:iJJ/i2 y Thomas F. Russell Date
PAGE 3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Oliveira, Suely (Ph.D., Mathematics) Parallel Multilevel Methods for Transport Equations Thesis directed by Professor Thomas A. Manteuffel Abstract Parallel multilevel algorithms are developed for solving the transport equations on SIMD computers for onedimensional isotropic and anisotropic scattering. The spatial discretization scheme used for both models is the modified linear discontinuous finite element method, which represents a lumped version of the stan dal'd linear discontinuous scheme. The angular discretization is accomplished by expanding the angular dependence in Legendre polynomials and is known as the SN approximation when the first N Legendre polynomials are used. Parallel algorithms for the isotropic model are developed using a multigrid scheme in space with block Jacobi and redblack twocell jiline relaxation. Both the uniform and nonuniform mesh cases are considered. A1gorithm implementation and performance on the Connection Machine 2 (CM2) are discussed for all cases. Proofs of convergence for both kinds of relaxation within the multigrid scheme are developed through Fourier analysis and shown to be in close agreement with the CM2 results. Assuming the spatial domain is divided into m cells (2m+ 2 points), the timings for the isotropic parallel algorithm are O(log2 m). A parallel algorithm for solving the anisotropic model is developed by using a multigrid in angle scheme that is known to attenuate both rapidly and slowly varying errors in angle. When coarsening in angle, we use an SN approximation for the fine level and an S!:!. for the coarse 1 The angular grid points for each angular 1 multigrid level are the N Gauss quadrature points for that level. Sequential angular multigrid involves a shifted source iteration, using a shifted transport sweep at each
PAGE 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. angular level. The shifted transport sweep requires 2m steps. The parallel algorithm uses cyclic reduction, which entails 2log2 m steps at each angular level. At diverse stages of the algorithm, we need to perform Legendre transforms. An algorithm for these transforms is developed that takes O(N) steps. Applying appropriate shifts to the cross sectional moments at each angular level results in an isotropic scattering equation at the coarsest level. For this problem the CM2 isotropic scattering parallel multigrid solver is used. The parallel anisotropic method was also implemented on the CM200, an upgraded version of the CM2, and the timings and convergence rates for this implementation are presented. The complexity for the anisotropic parallel algorithm is of 0(.1Vlog2 mlog2 N). This abstract accurately represents the content of the candidate's thesis. I recommend its publication. Signe iv
PAGE 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I ACKNOWLEDGMENTS I wish to express my appreciation to the following individuals: my advisor, Professor Tom Manteuffel, for his financial and academic support during the last years of my Ph.D. studies; Professors Roland Sweet and Bill Briggs, who brought me into the program, gave me initial financial support, and provided me with my first opportunity to use parallel computers; and Jennifer Ryan for her financial support of a year and a half and for giving me the opportunity to solve real world problems for US West. I am also appreciative of the support given to me by the remainder of the committee: Professors Steve McCormick, Tom Russell, John Ruge and Gita Alaghband. Part of this work was performed at Los Alamos National Laboratory during my stay there of 10 months spread over the last two years. From there, I would like to thank Dr. Jim Morel for his experience and ideas and Dr. James M. Hyman for his support and friendship, as well as the Center for Nonlinear Studies and the Advanced Computing Laboratory for use of their facilities. Last, but not least, I would like to thank Victor Bandy, my friends in Denver and Los Alamos for their friendship and support, the rest of my family back in Brazil, and especially my father who always considered learning important. v
PAGE 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONTENTS CHAPTER 1 INTRODUCTION 1 1.1 History. ... 1 1.2 Model Problem .. 2 1.3 Notation. .... 7 1.4 Summary of Results. 7 1.5 The CM2 and CM200. 9 2 PARALLEL MULTIGRID FOR THE ISOTROPIC MODEL 11 2.1 Introduction. .................. 11 2.2 The Modified Linear Discontinuous Scheme. 13 2.3 Twocell Inversion. 18 2.4 Multigrid ...... 21 2.5 Storage and Data Structure .. 24 2.6 Relaxation in Parallel. 27 2.7 Multigrid in Parallel. 31 2.8 Discretization in EdgeEdge Notation. 32 2.9 Convergence Analysis. 35 2.10 Numerical Results. 42 2.11 Timings ..... 49 2.12 Conclusions ... 56 
PAGE 7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ; I I I I I I 3 PARALLEL MULTIGRID FOR ANISOTROPIC SCATTERING .. 3.1 Introduction. ... 3.2 Angular Multigrid. 3.2.1 General description. 3.2.2 The V(1, 0) cycle. 3.2.3 The V(1, 1) cycle. 3.3 Discretization. ...... 3.4 Source Iteration in Parallel. . 3.5 Angular Multigrid on the CM200. 3.5.1 Storage ..... 3.5.2 Algorithm. .. 3.6 Cyclic Reduction on the CM200. 3.6.1 Storage and Initialization. 3.6.2 Algorithm for Cyclic Reduction on a SIMD Computer .. 3.7 Stability of the Cyclic Reduction. 3.8 Experimental Results. 4 CONCLUSIONS APPENDIX ... BIBLIOGRAPHY vii 58 58 63 63 64 69 70 71 84 84 85 90 90 92 95 97 106 109 111
PAGE 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FIGURES FIGURE 2.1 Computational grid. . . . 10 2.2 Redblack relaxation for 4 cells. 11 2.3 Fine and coarse grid. . . . 15 2.4 Computational mesh for four cells. 18 2.5 Group of cells for Fourier analysis. 24 2.6 Timings for a V(l, 1) cycle for small m and n = 1. 36 2.7 Timings for a V(1,1) for large m and n =1. . . 36 2.8 Timings for a V(l,1) cycle varying (n) and m = 512. 38 3.1 Computational grids for the first and second level of angular multigrid. 45 3.2 Coarsening of cells during cyclic reduction for the positive variables. 55 3.3 Coarsening of cells during cyclic reduction for the negative variables. 60 3.4 Data structure for a particular spacial grid point and for different angular levels, showing the fluxes and the Legendre moments. . . . 64 3.5 Data structure for Legendre polynomials for a particular spatial grid point and for two angular levels starting with N =4. 65 3.6 Timings for a V(1,1) cycle varying (n) and m = 4. 77 3.7 Timings for a V(1,1) cycle varying (n) and m = 64. 78 ' I
PAGE 9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TABLES TABLE 2.1 Convergence factors from the Fourier analysis(Fa) prediction and results for Jacobi relaxation for various values of Uth (S4 case). . . 43 2.2 Convergence factors from the Fourier anlaysis(Fa) prediction and results for redblack GaussSeidel relaxation for various values of Uth (S4 case). 44 2.3 Worst case convergence factor for multigrid with Jacobi V(1, 1) relaxation, Fourier analysis (Fa), and CM2 results (form= 512). . . . 45 2.4 Worst case convergence factor for multigrid with redblack GaussSeidel relaxation, Fourier analysis, and CM2 results (form= 512).. . . 46 2.5 Convergence factors for nonuniform cells (Jacobi and redblack GaussSeidel relaxation on the CM2), 1 $ h $ 100, and Ut = 1. . . . . 46 2.6 Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel (V(1, 1)), for various values of Uth. S4 case and m = 512. 47 2.7 Convergence rates for Jacobi (V(1, 1) and V(2, 2) cycle) and redblack (V(1,1)) on the CM2. S4 case and Uth = 0.1. . . . . . . . 48 2.8 Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel(V(1, 1)), for various values of Uth. S4 case and m = 512 (nonuniform code). . . . . . . . . . . . . . . 48 2.9 Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel (V(1,1)) on the CM2 and Cray YMP. S4 case, uth = 0.1. . . . . 54 2.10 Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel {V(1,1)), on the CM2 and Cray YMP s4 case, Uth = .1 (nonuniform code). . . . . . . . . . . . . . . . . . . 54 2.11 Comparison between Jacobi V{1,1), Jacobi V(2,2), and redblack convergence factors (P;l, Pi2 and Prb) for various values of Uth (S4 case). 55 3.1 Convergence rates for the V(1,1) and V(1,0) angular multigrid cycles for constant m (m = 4) and various values of N (SN cases). . . . 98 3.2 Timings in seconds for the V(1, 1) and V(1, 0) angular multigrid cycles for constant m (m = 4) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. . . . . . . . . . . . 99
PAGE 10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I l I I I 3.3 Timings in seconds for the V(l, 1) and V(l, 0) angular multigrid cycles for constant m (m = 64) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP.. . . . . . . . . . . 100 3.4 Comparison between V(1,1) and V(1,0) anisotropic codes on the CM200 for various values of n (SN cases, N = 2n). . . . . . . . 101 3.5 Convergence rates for the V(1,0) and V(1,1) angular multigrid cycles for N = 8 and various values of m. . . . . . . . . . . 105 3.6 Timings in seconds for the V(1, 0) and V(1, 1) angular multigrid cycles for various values of m and N = 8 on the CM200. . . . . . 105 X
PAGE 11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1 INTRODUCTION 1.1 History. The Boltzmann equation has formed the basis for the kinetic theory of gases since it was established by Ludwig Boltzmann in 1872, and has proved to be useful in the study of electron transport in solids and plasmas, neutron transport in nuclea.r reactors, photon transport in superfluids, and radiative transfer in planetary and stellar atmospheres. This thesis treats numerical methods for the solution of the neutron transport equations in the form of the linear Boltzmann equation. Mod ern transport codes are generally based on either Monte Carlo or discrete ordinate methods. Monte Carlo methods are stochastic in nature and are based upon the direct simulation of the particle transport process. As the number of particles in the simulation is increased, the statistical accuracy of the resulting .m is improved. This method is wellsuited to threedimensional problems with complex geometries, but it is often expensive if differential rather than integral quantities are desired. Discrete ordinate methods are based upon the use of finite difference and quadrature integration techniques to reduce the analytic transport equation to a matrix equation. For onedimensional problems the discrete ordinate method can be orders of magnitude faster than Monte Carlo. In two dimensions neither method has clear superiority. In three dimensions Monte Carlo has thus far seen more use. Due to the randomness of the error magnitude in Monte Carlo methods, discrete ordinate
PAGE 12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. methods are generally preferred. The latter, however, may be difficult to generalize for complex multidimensional geometries. By utilizing the properties of massively parallel computers it may be possible to develop discrete ordinate equations for these geometries. The focus of this dissertation is the development of efficient parallel al gorithms for the transport equations on SIMD computers using discrete ordinate methods. The use of massively parallel computers represents a step towards the development of algorithms that will generate solutions for the transport equations more accurately and faster. 1.2 Model Problem. The transport of neutrally charged particles can be modeled by the linear Boltzmann transport equation, which is derived in this section. Consider a beam of particles colliding with a target. Let N (!!,D., E, t) represent the density of particles at a position!!, moving in direction D. (unit vector), with energy E, at timet. Suppose the volume analyzed has cross sectional area and length and that the origin is at !f. Requiring particle balance, we can write the following equation where IP = LSLC + PE, I P = increase in N, LS = losses from streaming, LC = losses from collision, P E = particles sources. For the infinitesimal element of volume and particles with energy and direc tion in the intervals [E, E + 5E] and [ll, ll + oll], respectively we have 2
PAGE 13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i i I [N(!!,!1, E,t +At)N(!!,!1, E, t)]A!!AAoEo!1 = [.N(!!+ A!!,!1,E,t)N(y,fi,E,t)]AAvAtoEon ut(Y, E)N(y, !1, E, t)AyAAvAtoEo!1 + ( Q(y, !1, E, t) + S(y, !1, E, t) )AyAAvAtoEo!1, where vis the particle speed (a. function of the energy). Dividing by AyAAoE5!1At and taking the limits as At and Ay tend to zero, we may replace the first and the second terms as derivatives with respect to t and !!, respectively, yielding 8N 8vN Bt =By utvN+S(y,fi,E,t)+Q(y,fi,E,t), (1.2.1) where Ut dz represents the expected number of interactions (absorptive or scattering) that a particle will have in traveling a distance dz. Assuming that Ay is in the direction of !1, we can express the directional derivative :!! as !1 'V. Thus, the balance equation can be written as 8N at = n. v'V NUtVN + S(!!, !1, E, t) + Q(y, !1, E, t). (1.2.2) The first term on the righthand side represents particle loss due to streaming, the second term represents loss due to scattering or absorption, and the third term represents sources. For steadystate problems, {1.2.2) simplifies to !1 'Vt/J + UttP = S(y, !1, E)+ Q(y, !1, E), {1.2.3) where 1/J(?!.., !1, E) = v N (?!.., !1, E) is the particle flux. The source S(y, !1, E) contains the particles that are scattered from every other direction !1' to direction !1 and energy E' to energy E. This can be modeled as S(y,!l, E)= u. I dE' I dn'E.(y, n'. .n, E'. E)t/J(y, n', E'), 3
PAGE 14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where :E. is the probability that a particle will be scattered from n1.fi and E1. E. In most applications, it is appropriate to discretize in energy to form the multigroup equations [9]. Let ,pic represent the flux with energy Ek. Then we have (1.2.4) where u.dz is the expected number of collisions that result in a scattering with energy Elc along a path oflength d:c. The source term Q includes particles scattered with energy Elc from other energy groups. The solution of the single group equation 1 I 1 ll Vt/J + t/J =; dn1E(!!, . n1 ) + Q(!!, n., E), Ut Ut (1.2.5) where;= ;:,is an important step in the solution of the multigroup equations [9]. When u, . oo and ; . 1, equation {1.2.5) will be singularly perturbed. In fact, it is often the case that this equation is nearly singular in all spatial frequencies of interest. This fact causes difficulties for numerical solution techniques, and multigrid in particular. Consider equation (1.2.5) in a semiinfinite slab of width bin the :z:dimension. Assuming that the solution is constant in y and z, then = = 0. Equation {1.2.5) becomes 8t/J (T 1 ( 1 I ( I ) ( I) ( ) J.L. 8:z: + UttP = 2 )_1 dp. E. :c,p. __. J.L t/J :z:,p. + q :z:,p., (1.2.6) where p. = cos(O) and 0 is the angle between ll and the :caxis. For the isotropic scattering model, :E. is constant in angle, that is, the particle is equally likely to scatter in any direction. Isotropic scattering yields the equation 8t/J U (1 ( l)d 1 ( ) J.L 8z + UttP = 2 )_1 t/J :c,p. J.L + q :c,u, (1.2.7) 4
PAGE 15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for :z: f (O,b) and p. f [1,1]. Here, 1/J(:z:, p.) represents the flux of particles at position :z: traveling at an angle(} = arccos(p.) from the :z:axis. The quantities u. and O't were already defined and the quantitity O'a = O't u. represents the expected number of absorptive interactions. The boundary conditions prescribing particles entering the slab are 1/l(O, p.) = Uo(p.), ,P(b, p.) = g1(p.), p. E (0, 1). (1.2.8) Now we consider anisotropic scattering within a single energy group. Assume the physical domain to be a semiinfinite slab. Anisotropic scattering is mod eled under the assumption that the probability of scattering from direction .0.' into direction .0. is a function of the angle between these directions (.0. .0.'). This leads to scattering source terms of the form which are integral operators with spherical harmonics as their singular vectors. We assume that the variation in angle can be described by a finite collection of singular vectors, resulting in an SN discretization. In onedimensional problems, the singular vectors for the integral operators are Legendre polynomials. Let Po(p.), ... ,PN 1(P.) be the Legendre polynomials and uo, ... O'N1 be the associated singular values. If we express the flux ,P( :z:, p.) as a finite summation of the first N Legendre polynomials we have N1 tP(Z, p.) = L (21 + 1)if>t(z )Pt(p.), (1.2.9) l=O where (21 + 1) is a normalization factor and the moments t/>1 are (1.2.10) 5
PAGE 16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. :I which can be written exactly as N cP,(:t:) = L WJcPl(P.Jc)t/l(:t:,p.Jc), (1.2.11) lc=l for l = 0, ... N 1, where the variables Ill, p.2, ... JlN are the Gauss quadrature points and w 1 w2, ... WN are the Gauss quadrature weights of degree N. The scattering term in one dimension then becomes 1 N1 s.(:t:,J1.) = u. /_1 dJl''E.(:t:,p.1 .JL)t/l(:t:,J1.1 ) = (21 + 1)uzz(:t:)pz(p.), (1.2.12) Notice that equations (1.2.9), (1.2.11), and (1.2.12) can be written, for all the quadrature points, in the matrix notation = Tt/l(:t:), and S(:t:,l!) = T1 respectively. The vector '!!!_( :t:) is equal to ( t/J( :t:, p.1), t/J( :t:, P.2), ... t/J( :t:, JlN) )T, and the vector ( :t:) = ( 0 ( :t:), 1 ( :t:), ... N _1 ( :t:) )T represents the Legendre moments. The N x N matrices T and T1 are defined by their respective elements and the N x N diagonal matrix D has diagonal elements D&,& = ua1 6
PAGE 17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I 1.3 Notation. For the most part, the following notational conventions are used in this thesis. Particular usage and exceptions should be clear from context. A, L, X,... Matrices or differential operators. Bf, Fl, n;, . Matrices at level k on row i of a bigger matrix (or corresponding to spatial cell i). N n Pi X,; _g_, r., ... a, e, a, ... det, SIMD MIMD Number of Gauss quadrature points. Number of Gauss quadrature points at level k. Number of positive or negative Gauss quadrature points (n = !f). Legendre polynomial of degree i (i = 01 1 N1). Processor i (i = 1, ... N). Singular values associated with the Legendre polynomial Pi (degree i) (i = 01 1 N1). The ijth entry of the matrix X. Vectors. Scalars or functions. Operator or matrix at (grid) level k. The determinant of a matrix. Single Instruction Multiple Data computer Multiple Instruction Multiple Data computer 1.4 Summary of Results. In this dissertation, we develop parallel algorithms for solving the neutron transport equations for the isotropic and anisotropic models. These algorithms are designed for SIMD architectures like the CM2 and take advantage of its architecture 7
PAGE 18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. .! at all steps. In Chapter 2, we develop an efficient parallel algorithm for solving the isotropic transport equation (1.2.7). This algorithm uses the modified linear dis continuous (MLD) spatial discretization method and an SN angular discretization for the scattering term, in which only the first singular value is nonzero. Within a multigrid algorithm, we consider different kinds of relaxation schemes and choose the most appropriate for the SIMD architecture. Proofs of convergence for the algorithms are developed through Fourier analysis. We also compare the CM2 timings to the Cray YMP timings for a sequential version. In Chapter 3, we develop an efficient algorithm for solving the anisotropic transport equation model. We use an angular multigrid algorithm that requires a shifted source iteration at each level. Perfo'mling this iteration with a shifted transport sweep would take 2m steps for a grid with m cells or (2m+ 2) spatial grid points. We develop a block cyclic reduction scheme for the CM200, which is a version of the CM2 with upgraded processors, for this step of the angular multigrid method which takes 2log2 m steps. One issue dealt with is the stability of the parallel block cyclic reduction. We can get a stable algorithm if we choose to write the system of equations in such a way that the righthand side remains bounded at all levels of cyclic reduction. The stability of this method is proved. The coarsest level of angular multigrid involves an isotropic S2 equation. We use our parallel multigrid algorithm for the isotropic model to solve this equation. There are other aspects of the angular multigrid that need attention in the development of a parallel angular multigrid, such as calculation of the scattering term. This step is completely parallelized in the spatial direction and partially parallelized in the angular direction. In addition, we attempted to minimize the number of communications in the angular direction. 8
PAGE 19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! We compare two kinds of angular multigrid schemes, a V(l, I)cycle and a V(l, 0) cycle, and conclude that V(l, 1) is superior on the CM200. The algorithms are also compared to sequential implementations on the Cray YMP. 1.5 The CM2 and CM200. The architecture of the CM2 or CM200 contains 2048 floating point proces sors, each with up to 1MB of memory, a front end computer, and an interprocessor communications network. Each processor is like a serial computer: it can execute arithmetic and logical instructions, calculate memory addresses, and perform inter processor communication or I/0. The difference is that the data processors do not fetch instructions from their respective memories. Instead, they are collectively un der the control of a single microcoded sequencer. The task of the sequencer is to decode commands from the front end and broadcast them to the data processors which then execute the same instruction simultaneously. Each processor has its own memory. Each floating point accelerator consists of two special purpose VLSI chips for each pair of CM processor chips, a memory interfaceunit, and a floatingpoint execution unit. If the size of the parallel variables requires more physical processors than are available, fictitious, or so called virtual, processors are created. Interprocessor communication is very important in data parallel processing. The CM2 supports three forms of communications among the processors: routing, NEWS, and scanning. These tasks are accomplished in the CM2 by specialpurpose hardware. Message passing occurs in parallel. If the grid is assigned a NEWS for mat, each processor has a north, east, west, and south neighbor. This kind of grid is used to represent our computational grids and is most efficient when the interpro cessor communications is between nearest neighbors within a Cartesian grid. In our implementations, we use intrinsic functions that are combinations of communication 9
PAGE 20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and computational operations, such as sum and spread. The sum operation performs an addition for array elements distributed across the processors. The spread intrinsic function distributes data belonging to one processor across other processors in a given direction. 10
PAGE 21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I CHAPTER 2 PARALLEL MULTIGRlD FOR THE ISOTROPIC MODEL 2.1 Introduction. The focus of this chapter is on a parallel multigrid algorithm for solving the isotropic transport equations in a slab geometry. The spatial discretization scheme used is the MLD finite element method, which represents a lumped version of the standard linear discontinuous (LD) scheme [7]. The parallel algorithm was implemented on the CM2. Convergence rates and timings for the algorithm on this machine and Cray YMP are shown. For steady state problems within the same energy group in the isotropic case (by isotropic we mean that the probability of scattering for the particles is the same for all directions), the transport equation in a slab geometry of slab width b becomes 81/J 1 /1 p. Bz + UttP = 2u, _1 t/J(z, p.')dp.' + q(z, u), (2.1.1) for z e (O,b) and p. e [1,1]. Here, t/J(z,p.) represents the flux of particles at position z traveling at an angle () = arccos(p.) from the :z:axis, u,d:z: the expected number of scattering interactions, Ua = Utu, the expected number of absorptive interactions, and q(:z:,p.) the particle source. The boundary conditions, which describe particles
PAGE 22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: ! entering the slab, are 1/J(O, p.) = Yo(p.), 1/J(b, p.) = Yt(P.), p. E (0, 1). (2.1.2) This problem is difficult for conventional methods to solve in two cases of physical interest: 1. y = ;:= 1 (pure scattering, no absorption). 2. ;, b (optically dense). In fact, as CTt + oo and y + 1, the problem becomes singularly perturbed [10). Stan dard discrete approximations to 2.1.1 and 2.1.2 will have operators with condition numbers on the order of at least ul regardless of the mesh size [5). This phenomenon presents problems for numerical solution techniques in general and multigrid in par ticular. In this chapter, the discretization used in the spatial dimension is the special finite element method MLD, which we describe in detail in the next section. This scheme behaves well in the thick limit and is very accurate [8]. To solve the linear system of equations, we use a suitable relaxation process, called p.line relaxation, within a multigrid algorithm that gives convergence rates on the order of when CTth 1 and O((uthf) when CTth 1. For the twoangle case, it behaves like an exact solver. These convergence rates were proved for special cases in [10) and are substantiated here by Fourier analysis. The Fourier analysis for multigrid based on redblack p.line relaxation (with numerical results for both Jacobi and redblack relaxation) is shown in Section 2.9. The results show close agreement between the Fourier analysis and the computational results presented here. The main benefit of the SIMD architecture of the CM2 is attained at the relaxation step of the multigrid algorithm where we take advantage of the structure of the matrix, matching the nonzero entries of very sparse matrices with the relevant 12
PAGE 23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. II processors. An outline of the remainder of this chapter is as follows. In Section 2.2 we describe the discretization scheme used. In Section 2.3 we show how to accomplish the inversion of the relaxation matrix in a manner that takes advantage of the SIMD architecture on the CM2. In Section 2.4 we describe the multigrid scheme and the interpolation and restriction operators used. In Section 2.5 we discuss the storage and data structures. In Section 2.6 we develop the algorithm for the parallel relaxation and a few implementation details. In Section 2.7 we consider a few implementation details of our CM2 multigrid code. In Section 2.8 we develop the discretization in edgeedge notation. In Section 2.9 we predict convergence properties of the multigrid algorithm through Fourier analysis. In Section 2.10 we report on convergence rates. In Section 2.11 we show the timings obtained with our CM2 parallel codes and compare them with a sequential version of our algorithm on a Cray YMP. Finally, in Section 2.12 we make a few conclusions. 2.2 The Modified Linear Discontinuous Scheme. As described in the introduction, angular discretization is accomplished by expanding the angular dependence in Legendre polynomials. This scheme, known as the SN approximation when the first N Legendre polynomials are used, results in a semidiscrete set of equations that resemble collocation at N Gauss quadrature points, Jli, j = 1, ... N, with weights Wj, j = 1, ... N. Since the quadrature points and weights are symmetric about zero, we reformulate the problem in terms of the positive values, Jlj, j = 1, ... ,n, where n = N/2. We define .,Pj = 1/J(z,p.j) and 1/J'j = .,P{z, p.j) for j = 1, ... n. The spatial discretization is accomplished by the MLD scheme, which uses elements that are linear across each cell and discontinuous in the upwind direction. In our grid representation, the variables 1/Jt and 1/Jij denote 13
PAGE 24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the flux of particles at position :J:i in the direction P.i and p.j, respectively. Assuming y=1, the nodal equations are IL,P"!"+l,P"!"t. n rJ 1 "joJ I,,J + .t.J". = "'w ('T + ./,";" ) + T. u h. '1'1,3 L...J lc '1'1,/c 'l'1,le qi,J 1 t I fe=l (2.2.1) (2.2.2) IL,P:,.1/J:+l n ,3 a,,J 1 1 + ,p:. = + tjJ";") + :. u h. a,J L...J a,le 1,/c q,,J 1 t I /c=l (2.2.3) and .t..t.'1' 'f''. n P.i I,J "' + + 2 h + ,P._1 = L...J 1e + 27/Ji 1ctP+!.Ic) + q._!. Ut i I l'J lc=l I l' I l' I l'J (2.2.4) j = 1, ... n, i = 1, ... 1m, with boundary conditions .t.+ + .t.'f'! 39o,j 39t,j 2 f 2' (2.2.5) j = 1, ... ,n. In our model, :z:i+!. and :z:i_!. are cell edges, :Z:i = + :z:i_!.) is the cell l l l l center, and hi = :z:i+! :z:i_!. is the cell width, i m. Equations (2.2.1) and l l (2.2.3) are called balance equations and (2.2.2)and (2.2.4) are called edge equations. In block matrix form, equations {2.2.12.2.4) can be written respectively as + = + + lJ.i I + = R(t++!. + t_d + lJ.i+!' l l l l l 14 {2.2.6) {2.2.7) {2.2.8) {2.2.9)
PAGE 25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n 21 1 I I I T I I I I I I I I I I I I I I I I I I I I I I I I I I I I I ITI I I I I I I I I I I __ 1 ___ 1 ___ 1 __ J __ l __ L __ 1 ___ 1 ___ I __ J __ l __ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 2 3 2i 1 2i 2i + 1 2i + 2 2i + 3 2m+ 1 cell i cell i+ 1 Figure 2.1: Computational grid !.+ + .!.fl.t. 2 2 (2.2.10) for i = 1, ... m. Here, 0 1 and R= (2.2.11) 0 1 P.t1 P.21 1 /ln are the positive Gauss quadrature points, Wt1 w2, ... Wn are the Gauss quadrature Weights, and p_.+() is an nvector: p_.+() = ( 1 ... t/Jt())T. In the computational grid, the inflow for positive angles is on the left of each cell and on the right for the negative angles. For p.line relaxation, the inflows of each cell are assumed known. This is the same as using the values of these variables from the previous iteration. Figure 2.1 shows the computational domain with 2m+ 1 spatial points and n angles. Consider cell i. In p.line relaxation, the fluxes at the cell centers, t/J7' and _, 't,(grid points 2i), together with the outflow flux variables, '!f!.t._!. and 't,++!. (grid 2 2 points 2i1 and 2i + 1), will be updated by the solution of the following matrix equation: 15
PAGE 26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I+2BiR 2R 2Bi R 0 q: 1 _,_l 0 1R R Bi "! Bi"! 1 q"! '1 = + _, Bi R 1R 0 .,p: q: '1 _, R 2Bi 2R I+ 2BiR 0 + !li+!. {2.2.12) Solving this matrix equation corresponds to performing a p.line relaxation. In our implementations, we perform twocell relaxations for the whole domain. In this kind of relaxation, we consider pairs of cells coupled together. This relaxation yields an error that is linear, independent of angle and continuous across two cells up to O(u!h) accuracy when Uth 1 [10]. Instead of updating four variables with a relaxation scheme, we update eight. For example, in Figure 2.1, variables '!f!..., '!f!...+, '!!!;+u t++!.' t+l' j+1 and t++v will be updated. Notice that the inflow variables '!!!._j_1 will be used from the previous iteration Twocellpline relaxation involves inversion of the Bn X Bn matrix 1+2BiR 2R 2Bi R 0 0 0 0 1R R Bi 0 0 0 Bi R 1R 0 Bi 0 0 R 2Bi 2R 1+2BiR 0 0 0 0 0 0 0 0 0 0 0 I+ 2Bi+lR 2R 2Bi+l R 0 0 0 Bi+l 0 1R R B;+t 0 0 0 0 Bi+l R IR 0 0 0 0 0 R 2B;+1 2R I+ 2Bi+lR (2.2.13) The order of variables for this matrix is given by the representation {2.2.14) The righthand side for twocellpline relaxation is given by 16
PAGE 27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ____ I_ 2 black cells 1 2 3 4 5 6 7 8 9 2 red cells Figure 2.2: Redblack relaxation for 4 cells. (2.2.15) For our implementations, we developed a multigrid scheme that uses either a parallel blockJacobi relaxation or a parallel block redblack GaussSeidel relaxation. For Jacobi relaxation, all of the twocell relaxations will be performed simultaneously for the whole domain. For redblack relaxation, we update half of the total number of cellpairs during the first (red) stage of the algorithm, and the other half during second (black) stage of the algorithm, each time skipping neighboring cell pairs. To illustrate, Figure 2.2 shows 4 cells. The even numbers represent the cell centers and the odd numbers the cell edges. For simplicity, we omitted the vertical lines (passing through each spatial grid point) that contain the Gauss quadrature points (angles). For a Jacobi type of relaxation process, variables at spatial points 5 through 9 will be updated at the same time as points 1 through 5. Each pair of cells will use the relaxation matrix just described. On a parallel machine like the CM2, redblack relaxation takes approximately twice as much CPU time as a Jacobi relaxation. This is partially offset by a better convergence rate (refer to the Fourier analysis in Section 2.9 and numerical results in Section 2.10). 17
PAGE 28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Twocell Inversion. Direct solution of the twocell relaxation matrix equation can be done in a very efficient way for SIMD computers like the CM2. First notice that the matrix can be written as the difference of an easily inverted matrix, which we call A, and a rank four matrix, which we write as vwT. The matrix A has a special structure that allows its inverse to be calculated in place: I +2Bi 0 2Bi 0 0 I 0 Bi Bi 0 I 0 Bi 0 A= 2Bi 0 I +2Bi I+ 2Bi+t 0 2Bi+l 0 Bi+t 0 I 0 Bi+t Bi+t 0 I 0 0 2Bi+t 0 I+ 2Bi+t (2.3.1) This is a block matrix with n X n diagonal blocks. The rank four matrix VWT is defined by 18 (8nx8n) 
PAGE 29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Q Q Q 1 1 Q Q 1 1 Q Q V= Q 2. Q Q (2.3.2) Q Q 2. Q Q Q 1 1 Q Q 1 1 Q Q Q 2. (8nx4) and !wT 2wT 0 _!wT 20 0 0 0 wT= _!wT 20 wT !wT 20 0 0 0 (2.3.3) 0 0 0 0 !wT 2'JJ!.T 0 _!wT 20 0 0 0 _!wT 20 'JJ!.T !wT 2(4x8n) where!= (l,l, ... ,l)T,Q= (O,O, ... ,o)T,2.= (2,2, ... ,2)Tand!QT = (w1,w2 ... ,wn) We can use the ShermanMorrison [6] formula to calculate the inverse of the twocell Jlline relaxation matrix M = A VWT : (2.3.4) where E = 1WT A1v. If]L is annvector, we have M1]L = (I+A1 V E1WT)A1]L. The connectivity of A is two interleaved 4 X 4 block matrices. Thus, inverting A amounts to solving 2n tridiagonal matrices, each of size 4 X 4, which can be done in parallel. Since all matrices are diagonal, and thus commute, we make use of the notation M; = M1M;1 We have A1 = 19
PAGE 30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 0 0 2B? 0 4B[Bitt 0 D; D; D;D;tt D;Ditt 0 lt2B; 0 B; 0 0 0 0 D; D; B; 0 lt2B; 0 B;+2B? 0 (B;+2Bl )(2Bit I) 0 D; D; D;Ditt 0 0 I 0 0 0 0 D; D; 0 0 0 0 I 0 0 D;:j:1 tl 0 ( B;tl +2B?1 )2B; 0 0 I+2Bil 0 Bitt D;D;tl D;Ditl D;tt Ditl 0 0 0 0 B;l 0 I+2B;tt 0 Ditl Ditt 0 4B;Bt1 0 2Bltt 0 2Bitl 0 I D;Ditl D;Ditl Ditl Ditt (2.3.5) where Bi = diag( ... bi;, ... ), Di = diag( ... 1 + 2bi; + 2br;, ... ) and bi; = ;;t. The matrix E =IWT A1V is the 4 x 4 matrix d1 0 a1 Cl E = I WT A l V = 0 dt bt au (2.3.6) a2 c2 d2 0 b2 a22 0 d2 where N dt = 1 2 E w; + w;bi;, ;=1 di; (2.3.7) (2.3.8) (2.3.9) 20 .I
PAGE 31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! I (2.3.10) (2.3.11) N wb2 + wb2 b """ J ij J ij a+lJ ax= 2 L.J j=l c4;di+tj (2.3.12) N wb2 +wb .. b2 """ J i+lj J &J i+lj a22 = 2 L.J i=l d;;di+tj (2.3.13) (2.3.14) (2.3.15) and w;b;;bi+ti + w;b;;bl+ti a 2 = 2 L.J i=l d;;di+lj (2.3.16) E1 is also a 4 x 4 matrix and can be calculated analytically. Its entries are shown in the Appendix A. 2.4 Multigrid. To illustrate the multigrid scheme, we develop it here in a two level form. Let Lh denote the fine grid operator, L2h the coarse grid operator, and and the interpolation and restriction operators, respectively. Let vx and v2 be small integers (e.g., Vt = v2=1), which determine the number of relaxation sweeps performed respectively before and after the coarse grid correction. One multigrid V ( Vt, v2) cycle is then represented (in twolevel form) by the following: 21
PAGE 32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 12 i '+ 1 I 2 i+ 1 Figure 2.3: Fine and coarse grid. 2. Calculate the residual rh = Jh Lhuh. '+ 3 2 Our multigrid scheme is applied with regard to the spatial variable only. (For a multilevel scheme in angle, see Chapter 3. Figure 2.3 illustrates grid points on the fine grid and on the coarse grid. The interpolation and restriction operators used here are based on the same finite element principle as in the derivation of the MLD scheme [10]. They are defined as follows: I2h h S3,1 S3,2 22 {2.4.1) Sm1,1 Sm1,2
PAGE 33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where I 0 0 0 0 0 0 0 0 h I 0 0 0 0 0 Si,t = and si,2 = + 0 0 ____f:!j_I 0 0 0 0 hi+t +hi hi+t+h; 0 0 0 0 0 0 0 I (2.4.2) and Tt,t T2,1 Tt,a Irh = T2,a (2.4.3) Tt,m1 T2,ml where I 0 0 0 0 2h;t+h; I 0 Tt,i = hi+t+h; 1+1+ I 0 ____.fu_I 0 hi+t+h; hHt+h; 0 1t1+ I 0 h;h;l I hi+t +h; and h;h; I hi+t +h; 0 hi+t+h; 0 0 0 ____.fu_I T2,i = 1+1+ I h;+t+h; =k._I 0 2h;+h;l I 0 hi+t+h; hi+t+h; 0 0 0 I 23
PAGE 34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 'I i The order of variables for these operators is the same as in (2.2.14). The coarse grid operator is defined as where Lh is given by (2.2.62.2.9). Note that L2h has the same form as Lh, but on a coarser grid. We use the notation 2h to indicate a coarse grid, although our grids are not really assumed to be uniform. In the general nonuniform case, the mesh size at cell i on the coarse grid is given by hi = h2il + h2i 2.5 Storage and Data Structure. The implementation of this scheme on the CM2 involves the following considerations: We assign one processor to each point (i.e., for each ( ;z;i, JLi) pair) in the computational mesh, assuming there are at least {2m+ 1 )n processors. If we are using less than (2m+ l)n physical processors, the CM2 will reuse the physical processors, creating virtual processors to store the additional grid points. Variables that are defined at different grid levels have an index that represents the grid level. We store these variables such that if they are evaluated at the same (spatial and angnlar) coordinate pair they will be assigned to the same processor, even if their grid levels are distinct. This avoids communication when data is accessed from different grid levels for a given spatial and angular coordinate variable. For example, variable '!/!_ appears as t/J( :serial,:news,:news) in the code. Here the first index represents the grid level and the remaining indices represent spatial and angular coordinates. 24
PAGE 35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I 3 I I I I I I I I __ I__1 ___ I_ J_ l_ L __ I __ I I I I I I I 2 1 1 2 3 4 5 6 7 8 9 Figure 2.4: Computational mesh for four cells. In relaxation for the nonuniform grid, some processors will require products of variables that belong to different cells, as can be seen, for example, in any row of (2.3.5). In particular, consider the first row of this matrix. Even though this row will have its elements stored in a processor in the first cell of the twocell pair, it uses data from both cells (i.e., B[ Bi+d To avoid extra complication in the parallel algorithm, we store these variables in the proper processors at the outset. For example, we have an array b that stores the values of matrix B for the first cell, and an array bp1 that stores the values of matrix B for the second cell of the pair. In this way, all processors representing grid points of a twocell pair will contain data from both cells of the pair. Figure 2.4 illustrates the computational mesh for four cells (horizontal axis) and four angles (vertical axis). Vertical lines 15 represent the variables associated with cellt and cell2, l while vertical lines 59 represent the variables associl l ated with cella and cell4, Processors l l l represented by vertical lines 14 will contain all the information associated with cellt and cell2, while processors represented by vertical lines 58 will 25
PAGE 36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I contain all the information associated with cell3 and cel14. Consider cell1 and cell2. Array 6 contains the values of 6(i,j) = p.(i)f(ut(celll)h(celll)). Thus, 6(1,3) = 6(2,3) = 6(3,3) = 6( 4,3). Array 6p1 contains the values of 6p1(i,j) = p.(i)j(ut(cell2)h(cell2)). Thus, bp1(1,3) = 6p1(2,3) = bp1(3,3) = bp1( 4,3). However, 6(1, 1) f:. 6(1, 2), bp1(1, 1) f:. 6p1(1, 2), and so on. The arrays 6 and 6p1 are constant on the same subsets of the computational mesh. We also use some replication of data in a slightly different way for the mesh size variable h. The elements of array h contain the cell size hi, the array hm1 contains hi1, and the array hp1 contains hi+l These three arrays are constant on vertical lines in Figure 2.4. This data structure is the same for all the grid levels, but with a different number of cells for each grid level. Some communication, of course, is necessary between processors. In fact, between the boundaries of every pair of cells we will have some shifting of data. For example, consider the variables (2.2.14) updated with the appropriate twocell relaxation matrix (cella and celli+d Notice that '!f! .. will be updated by this matrix, but the variables will be updated by the next twocell pair relaxation matrix (celli+2 and celli+3). Again consider Figure 2.4. Vertical line 5 represents the variables + )(). The twocell relaxation matrix for cell1 and cell 2 will be stored on the processors associated with vertical lines 14. So to update variables '1/Jt the processors on line 5 will need to access data from line 4. In the CM2 conformable arrays (arrays with the same shape and size) are always stored in the same set of processors in the same order. To compute expressions like the elements of the matrix E in (2.3.6) without the need for communication, we store the variables Wj (a onedimensional array) as a 26
PAGE 37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i ,, twodimensional array that is conformable with array b {Figure 2.4). Con sequently, w is constant over horizontal lines in Figure 2.4. Note that the elements of E1 in (2.3.6) will also be conformable with variables wand b. 2.6 Relaxation in Parallel. A twocell relaxation step consists of applying M1 (Misgiven by (2.2.13)) to a righthand side vector, which we call '! .. : What follows is a description of the process to perform a twocell relaxation step. If the current grid has k cells, then twocell relaxations will be performed simultaneously for Jacobi, while twocell relaxations steps will be performed in each of two stages for redblack GaussSeidel. Each twocell relaxation can be performed through the following steps: 1. A1l': This step is done at the same time for all cells, using the matrix A l as shown in (2.3.5). In order to update any variable, the associated processor needs only one row of this matrix. For example, with the order of variables shown in (2.2.14), to update the variable '!f!..t we need only the second row of A1 so the processor that updates this variable needs only to store this row's entries. Note, though, that this row's entries will multiply data that are allocated in other processors. Through the use of the CM2 intrinsic function eoshift we can perform a move of data in the :1: direction (index i), and then multiply by the appropriate entry of the matrix. To update all the variables, we thus have the following combination of shifts: 27
PAGE 38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I for j =1, ... ,n simultaneously. Here, A1, A2, A3, A4 and As are elements of the appropriate row. Note that processor ( i, j) does not have variables other than the ones with indices i,j. It will therefore use the CM2 intrinsic function eoshift 1 in the following way: temp= = eoshift(t/Ji, 1, 1), tP&+1 = eoshift(temp, 1, 1), and so on. Each new shift will take advantage of the previous one. The big advantage of this step is that this updating occurs at the same time for all processors, allowing this matrix multiplication task to be performed in only 8 communications. Note that, for each cell, the vector has length 8n. In this step we take advantage of the fact that the 8n X 8n matrix v E1 wT can be written as the tensor product 2 1 L F R yl. Remember that 1 and Jl!.t are nvectors. The 8 X 8 matrix F is derived from a combination of rows and columns of the matrix E1 and is given by 1The first argument of the eoshift function is the array name, the second is the array index that is going to be shifted, and the last is the stride of the shift. ,We use the notation v L M = M v1 and M R v = M v. 28
PAGE 39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I I I 2 0 1 1 1 1 1 1 0 1 2 2 0 2 1 0 1 1 E1 2 2 2 0 1 1 0 1 2 2 1 1 1 0 1 1 2 2 1 1 0 2 (2.6.1) Note that if the grids are nonuniform, then the matrices E1 and F will have their elements as arrays and will be different for each distinct pair of cells. Multiplication by the matrix 1 F !Qt can thus be done in three steps: 2.1 Form g_ = (i = 1, ... ,8). This step consists of a dot product of !Qt and each nvector that makes up the 8Xnvector g_ (z;, i = 1, ... 8). Remember, vector g_ was obtained in Step 1 and !Qt is an nvector. So, first we multiply each entry of !Qt by each corresponding entry : Wj X Zij, which is performed for all processors at the same time because !Q is conformable Recall that, through the replication of data, vector '!!!.. was stored for all the spatial grid points. The i index for variable Zij varies from 1 to 8 for each cell. The second stage of the dot product is to perform a summation in the angle index direction. This i!! done with the use of the CM2 intrinsic function sum. The resulting vector g_ has length 8. 2.2 Form !1. = F g_ 29
PAGE 40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This step is a multiplication of a vector by a matrix, and can be per formed in the CM2 similar to the way Step 1 of this relaxation was performed. Depending on which point the processor is representing, it will have stored different rows of the matrix F. Its variables will be updated through the use of the CM2 intrinsic function eoshift only (in this case, 16 of them). The vector !1 has length 8. 2.3 Form 0 = 1 '!1: This spreads the result of Step 2.1 across all the angles (for all spatial points at the same time). This is done with the use of the CM2 intrinsic function spread. The vector 0 has length Bn. 3. Form A10 as in Step 1 and add the result A special remark should be made here. It is not hard to see that Steps 2.2 and 2.3 can have their order interchanged. In fact, since all the CM2 processors will execute the same instruction simultaneously, unless instructed otherwise by the use of masks, we make use of this property in the following way. We spread vector z throughout all the angles represented by different processors. Since all the processors had the appropriate row elements of matrix F from the outset, we perform Step 2.2 last. This will make the use of masks necessary only for distinction between the processors regarding the kind of spatial point it represents, not the kind of angle. It is easy to show that, if we take advantage of the structure of the var ious matrices and perform the matrix multiplication steps as described, the total operation count and the number of communications per processor for the twocell p. relaxation are both O(n), more precisely 2(n1). 30
PAGE 41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 2. 7 Multigrid in Parallel. The parallel relaxation process of the multigrid algorithm was explained in Section 2.6. For calculation of the residual on the finer grid, again we use the CM2 intrinsic function eoshift. For example, to calculate the residual at a cellcenter grid point for a negative angle, we use equation {2.2.3). Note that the processor that contains the variable t/Ji;; will not contain other variables necessary for the calculation of the residual at this point. Therefore, we perform a move of data: t/J'i.t!. = eoshift(.,pi, 1, 1), 2 t/Jz_!. = eoshift(t/Ji, 1, 1). 2 For calculation of the righthand side of this equation, we perform a onestep multi plication for the conformable arrays w and tfJ+ {the same for t/J), and then, using the CM2 intrinsic function sum, we perform a summation in the angle direction for the resultant products. Mter the residual is calculated, in order to solve for the error on the coarser grid, we have to transfer the residual to the coarser grid. This is done by multiplying rh, the residual on the fine grid, by the restriction operator shown in {2.4.1). In some processors there will be no operation or communication at all because, as in Section 2.3, the grid level index is serial and some processors will contain all the information they need. For example, for a negative angle at the lefthand side of a twocell pair {negative angle outflow), the restriction consists of = r?_!. (see the 2 2 first row of {2.4.2)). This requires no communication. For processors that represent cell centers on the coarse grid (second and third row of {2.4.2)), we use the CM2 intrinsic function eoshift in a way that is 31
PAGE 42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. '' : similar to what was done in Step 1 of relaxation: temp1 = rf+t = eoshift(rf, 1, 1), temp2 = rf_1 = eoshift(r?, 1, 1). We then calculate the residual on the coarse grid: 2h hml t 1 + hi t 2 ri (hml+hi) emp (hi+hml) emp Mter forming r2h, we perform parallel relaxation again, this time to estimate the error on the coarser grid (u2h). In order to use the coarser grid quantities to correct the solution on a finer grid, we multiply u2 h by the matrix 1rh shown in (2.4.3). This multiplication is performed similarly to the multiplication of rh by 2.8 Discretization in EdgeEdge Notation. In the following section we develop a Fourier analysis of the convergence properties of the multigrid scheme. In this analysis we use a different notation for the MLD scheme, called edgeedge notation. Consider the discretized spatial domain with m cells as before, where i is the center of the cell and i and i + are the rightand lefthand sides, respectively. The MLD equations using variables at the center and edges were shown in (2.2.12.2.4). In edgeedge notation the indices land r represent the left and right side of each cell, respectively. In order to write these equations in edgeedge notation, we notice that 1/Jt = 21/Jt.,P"'!" 1, 1/J"/: = .,P"'!" 1, .,P1 = .,p;_!, and 1/J; = 21/Ji.,p;_!. After we substitute these equations into (2.2.12.2.4), 2 2 we have 32
PAGE 43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I i! N (t/Jtr,jt/Jtl,j) + tPtr,i = L tPi,r,lcW/c, t lc=l (2.8.2) i = 1, ... m, j = 1, ... n, and the negative angle equations are N JL;( tP'i;r,i tP'i;t) + tP'i;t,j = u.h L tPi,l,lcW/c, (2.8.4) lc=l i = 1, ... m, j = 1, ... n. In this section we omit the external source term q for simplicity. The omission of a superscript on sum terms in this section indicates that the summation is performed for all N = 2n angles (n positive and n negative). If we multiply (2.8.1) and (2.8.3) by 2 and subtract from them (2.8.2) and (2.8.4), and rewriting equations (2.8.2) and (2.8.4 we have N P.i ('+ .t.+ ) 2 P.i .t.+ .t.+ "'' 0' h 'l'i,l,j + 'l'i,r,j 0' h '1'i1,r,j + 'l'i,l,j L....J 'l'a,l,lcWk, t t lc=1 (2.8.5) N P.i ("''+ .t.+ ) + .t.+ "'' 0' h 'l'i,r,j'l'i,l,j 'l'i,r,j L....J 'l'a,r,lcWic, t lc=1 (2.8.6) i = 1, ... m, j = 1, ... m and the negative angle equations are (2.8.7) N ( 1/Ji;,..;) + = L tPi,l,lcWic 1 t lc=l (2.8.8) i=1, ... ,m,j=1, ... ,n. Formulating the problem in terms of the ( n) positive angles only, using the vector notation such that t/Ji() is annvector p_t() = ... ,1/J"tn()f, we can rewrite these equations in matrix form as {2.8.9) 33
PAGE 44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. II B,( ,p! ,P"!1 ) + ,P! = R,P. .:a,r ...:.....a, ...:.....a,r (2.8.10) i = 1, ... m, and the negative angle equations are (2.8.11) (2.8.12) i = 1, ... ,m, where B, = and R = !Tw. After we multiply this system by Uth we notice that twocell pline relaxation involves the inversion of (2.8.13) (2.8.14) (2.8.15) (2.8.16) (2.8.17) 34
PAGE 45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Notice that, we can write the transformation between the centeredge notation and the edgeedge notation variables in matrix notation as (2.8.18) (2.8.19) In this section we analyze the convergence properties of the twogrid algorithm described in Section 2.4. We performed Fourier analysis for the two kinds of relaxation used in our parallel multigrid schemes: Jacobi and redblack GaussSeidel relaxation. Convergence factors for both kinds of schemes are compared to the numerical results. We will assume a special kind of Fourier spatial dependence for the errors in an infinite spatial domain with uniform cells size h each. Considering a i+ 1 i+ 2 i+3 RL RR BL BR Figure 2.5: Group of cells for Fourier analysis. 35
PAGE 46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I group of four cells, we define two black cells (BL black left and BR black right) and two red cells (RLred left and RRred right) as indicated in Figure 2.5. For a given p. we denote by '!!!_(z,p.) the vector that contains elements representing fluxes at the edges ofred and black cells (see 2.9.1). The vector '!1!_/>.,p.) contains elements representing coefficients at the edges of the red and black cells in the group of four cells considered above. This is the same as t/Ji,t(z, p.) .,pflL( >., JL) tPi,r(z, JL) .,p[}L(>., p.) tPi+t,t(z, p.) .,pflR(>., p.) '!!!_(z,p.) = tPi+l,r(z, JL) and '!!!_1(>.,p.) = .,p[:R( >., IL) (2.9.1) tPH2,1( z, 1L) .,pfL(>.., p.) tPi+2,r(z,p.) .,p[!L(>., p.) tPi+a,t( z, IL) .,pfR(>., p.) tPi+a,r( z, p.) .,p[!R(>., p.) where the subscripts l and r represent the left and right side of each cell i, respec tively. For each particular frequency, >.we will assume '!!!_(z,p.) = ei>.:r:'!!!_1(>.,p.) for the remaining domain, where :z: is the distance between points of the computational domain and points belonging to the group of four cells considered. These points are multiples of h away from each other. Considering 4P cells with periodic boundaries, we can write '!!!_(z,p.) = where >.1c = 2'ht, k = 0, 1, ... P1. The steps for the multigrid scheme Fourier analysis follows. Representing the relaxation step by A.,pl+l = B.,P1 we write Ah.,pl+l = Bh.,pl for the fine grid and A2h.,pl+l = B2h.,pl for the coarse grid. Matrices A and B are derived by substituting '!!!_(z, p.) = ei>.:r:'!/!_1(>., p.) into (2.8.92.8.12) for the p.line twocell relaxation (Jacobi or redblack GaussSeidel). Thus, the matrices A and B depend on>.. In the following 36
PAGE 47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I i description, 1;h and are the interpolation and restriction operators for the edge edge notation, respectively. Steps for the Fourier analysis: 1. Relaxation on the fine grid: '!1!_1+} = Ah1 Bh'!/!_1. 2. Calculation of the residual on the fine grid: r.h = Bh(Ah1 Bh1)'!1!_1 3. Transfer of the residual to the coarse grid: f_2h = = Bh(Ah1 Bh !)'!!!_'. 4. Calculation of the error on the coarse grid: !i2h = (A2hB2ht1 f_2h. 5. Correction to the solution: = + 1;h!i2h. The convergence rate for the twolevel multigrid V{l, I)cycle is given by the spectral radius of the following matrix: {2.9.2) The elements of matrix Z will depend on ). and the convergence factor for the Fourier analysis will be the maximum spectral radius of Z taken over the set ).k = 2"h'P 1 k = 0, ... P 1. To write the matrix Z, first we need to calculate matrices A and B on both the fine and coarse grids. The MLD scheme in edgeedge notation was derived in the previous section. Later in this section we show the equivalence between our analysis, done in edgeedge notation, and our edgecenter algorithm notation. In order to perform the Fourier analysis, we need to write these equations for all the angles and the four cells (RL, RR, BL, and BR) necessary in the red black GaussSeidel twocell relaxation analysis. The MLD relaxation will always be represented as two equations per cell per direction for each cell independent of the 37
PAGE 48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i: kind of notation we are using. Consequently, to represent redblack GaussSeidel relaxation, the number of equations is 8 X (number of directions). For redblack GaussSeidel relaxation, all the red cells are updated before any of the black cells. Consequently, when we write the MLD equations for the red cells in the Fourier analysis, whenever there is a reference to a variable that belongs to a black cell, that variable is assumed to be at iterative Ievell instead of l + 1. When we write the equations for the black cells there are only terms at Ievell + 1 (no terms on the right hand side of the equations). The Fourier analysis for the Jacobi relaxation is done similarly its analysis is simpler than the redblack GaussSeidel case. The main difference is that we need to consider only one kind of cell pair and when referencing cell pairs other than the one we are writing equations for, we will assume them to be at level l. 1. Equations on the fine grid. These equations generate matrices Ah and Bh used in Step 1. When the Fourier modes are substituted in equations (2.8.12.8.4) for the four cells, redblack relaxation can be written as F u 0 0 0 0 0 Lei>.H L F 0 0 '!!!.1+1= 0 0 U 0 '!!!_1, (2.9.3) 0 L F u 0 0 0 0 Uei>.H 0 L F 0 0 0 0 where H = 4h, I+ BiR R Bi 0 R I+ BiR 0 Bi F = Uth (2.9.4) Bi 0 I+ BiR R 0 Bi R I+ BiR 38
PAGE 49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I i 0 0 0 0 0 0 0 0 (2.9.5) 2Bi 0 0 0 0 0 0 0 and 0 0 0 0 0 0 0 2Bi (2.9.6) 0 0 0 0 0 0 0 0 The order of variables for this matrix is (2.9.7) where .1,RL = (,RL .1.+RL .1.RL .1.+RL) .1,RR = (,RR .1.+RR .1.RR .1.+RR) .1.BL = (.BL .J.+BL .1.BL .1.+BL) and For Jacobi relaxation we have [ FL U l ,pl+l = [ 0 LeOi>.H l '!Jl, F Uei>.H {2.9.8) where H = 2h. 2. Equations on the coarse grid. These equations will generate the matrix A 2hB2 h for Step 4. The equations on this level are assumed to be solved exactly. Since 39 
PAGE 50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coarsening the grid halves the number of points, we will have two cells in the Fourier analysis for redblack GaussSeidel relaxation and one cell for Jacobi relaxation. The matrix equation (A2hB2hk2 h = th for redblack GaussSeidel relaxation case is given by [ F U + L ]2 h 2 h = f L+U F {2.9.9) where F, U, and L are the same as before with h = 2h. The order of variables is also the same, but for two cells instead of four cells. For Jacobi relaxation we have F' fi2h = th, {2.9.10) where F' = F + U + L. 3. Interpolation and restriction operators. In order to define the interpolation and restriction operators for the equations in edgeedge notation, we first notice that and {2.9.11) where {2.9.12) matrix Q1 is given by (2.8.19), and are the operators in {2.2.13) in the edgecenter notation on the fine and coarse grids, respectively, and Ai, and are the operators in the edgeedge notation on the fine and coarse grids, respectively. The interpolation and restriction operators used in the Fourier analysis are given by [ s, s, s, s,] (2.9.13) 40
PAGE 51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where 2I 0 0 0 0 0 0 0 0 I 0 0 0 I 0 I St = s2 = I 0 I 0 0 0 I 0 0 0 0 0 0 0 0 2I and Tt Ih(e) T2 (2.9.14) 2h Tt T2 Here, I 0 0 0 !I 2 0 !I 2 0 0 I 0 0 0 !I 0 !I Tt = and T2 = 2 2 !I 2 0 !I 2 0 0 0 I 0 0 !I 2 0 !I 2 0 0 0 I We can observe the relationship between the restriction and interpolation operators for the edgeedge and edgecenter notation: (2.9.15) and Ih(e) Qih(c)Q1 2h 2h (2.9.16) From Section 2.4 we have A2h(c) I2h(c)Ah(c)Ih(c) h 2h Substituting for the edge notation, we have 41
PAGE 52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I _____ L which is equal to the fine grid operator (Ah(e)) applied to grid 2h. 4.Numerical experiments and results. For each ..\ we have a different matrix Z. The convergence factors for the Fourier analysis are obtained as the maximum over = 2'tp, k = 0, ... P 1) of the spectral radius of Z in (2.9.2). For each Uth, we evaluate the spectral radius of Z for P different frequencies ( k = 1, P 1) and the convergence factor from the Fourier analysis is the maximum. We compared the convergence factors between the Fourier analysis prediction and the CM2 numerical results, and found a good agreement (Tables 2.1 and 2.2). In fact, for this problem the Fourier analysis prediction was more accurate than the alternative analysis of convergence shown in [10]. Note that the convergence factor appears to be O((.,.!h)2 ) when CTth 1 and O((uth)3 ) when CTth : 1 for both kinds of relaxation (Tables 2.1 and 2.2). This is exactly what happens with our numerical results. The analytical proof of convergence shown in [10] showed the convergence factor to be bounded by O((.,.!h)) when Uth 1 and O((uth)2 ) when CTth 1 for both kinds of relaxation. Our results suggest that these bounds are not sharp. 2.10 Numerical Results. The CM2 codes were run for different values of CTth on the fine grid. In Tables 2.3 and 2.4 we chose these values to be the worst cases predicted by the Fourier analysis shown. All of the convergence factors (i.e., the ratio of Euclidean norms of the residuals before and after a multigrid V cycle) shown here are for the case with no absorption ('y = 1). The results are shown for the two kinds of 42
PAGE 53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.1. Convergence factors from the Fourier analysis(Fa) prediction and results for Jacobi relaxation for various values of Uth (S4 case). Uth Fa CM2 106 5.3 X 1013 9.4 x 1010 104 2.4 X 10lO 8.7 X 107 103 2.3 X 107 4.1 X 104 102 1.3 X 104 1.0 X 102 101 7.2 X 103 1.5 X 102 1. 5.1 X 103 4.2 X 103 101 3.8 x to6 3.9 X 106 102 3.5 x to7 3.4 X 107 43
PAGE 54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.2. Convergence factors from the Fourier anlaysis(Fa) prediction and results for redblack GaussSeidel relaxation for various values of CTth (S4 case). CTth Fa CM2 106 3.5 X 1013 3.3 X 1010 104 1.2 X 1010 2.8 X 107 103 1.2 x 107 7.2 X 106 102 8.7 X 106 4.3 X 103 101 3.8 X 103 8.0 X 103 1. 8.2 X 104 1.1 X 103 101 3.2 X 106 3.17 X 106 102 2.7 X 107 2.5 X 107 44
PAGE 55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' i. Table 2.3. Worst case convergence factor for multigrid with Jacobi V(1, 1) relaxation, Fourier analysis (Fa), and CM2 results (form= 512). angles s4 Sa S1s Sa2 Uth .230 .100 .150 .400 Fa .012 .012 .013 .0130 CM2 .016 .015 .016 .014 relaxation (Jacobi and redblack GaussSeidel) used in our multigrid scheme. The convergence factors shown for the CM2 codes were obtained for one V cycle after five Vcycles. The convergence factors were close to the ones predicted by the Fourier analysis as Tables 2.3 and 2.4 show. Tables 2.3 and 2.4 show results for uniform grids for different number of angles, e.g., s4 means 4 scattering angles (n=2) and so on. Table 2.5 shows convergence factors obtained for nonuniform grids for various numbers of angles (N = 4,8,16); the grid was generated randomly with h varying between 1 and 100. Table 2.6 shows the behavior of the convergence factor for three relaxation methods: twocell Jacobi (V(1, 1) and V(2, 2)) or twocell redblack GaussSeidel(V(1, 1)) for varying Uth. Note that the convergence factor appears to be when Uth 1 and O((uth)3 ) when Uth 1 for all three relaxation methods. To make a fair comparison between the different relaxations, we have to consider the difference in time spent to attain the above convergence factors. Tlus will be analyzed in the next section. In Table 2.7 we make Uth equal to .1 and vary the number of cells (m). For a fixed mesh size, the convergence factor is basically constant. Table 2.8 shows the 45
PAGE 56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.4. Worst case convergence factor for multigrid with redblack GaussSeidel relaxation, Fourier analysis, and CM2 results {form= 512). angletJ 84 Sa Sta Sa2 O'th .15 .24 .1 .2 Fa .0042 .0045 .0043 .0045 CM2 .0063 .0065 .0086 .0045 Table 2.5. Convergence factors for nonuniform cells (Jacobi and redblack Gauss Seidel relaxation on the CM2), 1 h 100, and Ut = 1. angles 84 Ss Sta Jacobi 1.6 X 104 1.3 X 104 5.6 X 104 Redblack 1.3 X 104 1.0 X 104 9.7 X 105 46
PAGE 57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.6. Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel (V(1, 1)), for various values of Uth. 84 case and m = 512. O'th Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2,2) 105 9.4 X 1010 3.3 X 1010 1.47 X 10lO 104 8.7 x 107 2.8 x 107 1.38 X 107 103 4.1 X 104 7.2 X 105 7.40 X 105 102 1.0 X 102 4.3 X 103 2.41 X 103 101 1.5 X 102 8.0 X 103 2.71 X 103 1. 4.2 x 103 1.1 X 103 1.03 X 103 101 3.9 X 105 3.2 X 105 1.86 X 105 102 3.4 X 107 2.5 X 107 4.16 X 107 47
PAGE 58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 'I Table 2.7. Convergence rates for Jacobi (V(1, 1) and V(2, 2) cycle) and redblack (V(1,1)) on the CM2. 84 case and Uth = 0.1. m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) 512 1.5 X 102 9.5 X 103 2.1 x 1o3 1024 1.5 X 102 9.7 X 103 2.8 x 1o3 2048 1.5 x 102 9.8 x 103 2.9 x 103 4096 1.5 x 102 9.9 X 103 3.0 x 1o3 32768 1.6 X 102 9.9 X 103 3.2 x 103 65536 1.6 X 102 9.9 x 103 3.1 X 103 Table 2.8. Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel(V(1, 1)), for various values of Uth. 84 case and m = 512 (nonuniform code). Uth hl h2 Pl h1 = randomly generated grid. h2 =randomly generated grid. P1 =uniform grid (h = 0.1). Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) 1.5 x 104 1.26 x 104 1.26 X 104 2.84 X 104 5.4 X 106 9.2 X 106 1.5 X 102 8.0 X 103 3.7 X 103 48
PAGE 59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I convergence factors for N = 4 and three distinct meshes {h1, h 2 and pl). Grid p1 is uniform with mesh size his equal to 101 for the whole spatial domain, while grids h1 and h2 are nonuniform and randomly generated (this means that some of the cells sizes are smaller or bigger than 0.1). Since from the Fourier analysis, we saw that the worse convergence factors occur around 101 we expected the convergence factors to be smaller for grids h1 and h2 than for for the uniform grid p1 On Tables 2.5 and 2.8, the convergence rates were measured using the nonuniform version of the algorithm. 2.11 Timings. First we show the cost behavior when we keep n constant (equal to one) and vary the spatial dimension m (Figures 2.6 and 2.7 and Table 2.9). Second, we analyze the cost when n varies and the spatial dimension m is kept constant {Figure 2.8). There are two codes, one for uniform and one for nonuniform grids. The results in this section are similar for both codes, although the timings obtained with the nonuniform grid code are roughly double that of the uniform grid code. All the timings shown in this section were measured for a V{1, 1) or V{2, 2) cycle, most of them using the uniform code. The sequential timings were measured using one processor of on a Cray YMP, which has a vector architecture. When we increase the number of grid points of the computational grid, we should have no increase on the time spent for our relaxation step on computers like the CM2, unless we reach the point where all the processors are being used. However, in our multigrid scheme, the number of levels should be defined such that the coarsest grid consists of two cells. Hence, the number of grid levels will always be log2!f, so, every time we double the number of cells (m) in our finest grid, we have new computations at the new added grid level since the grid level index is serial. 49
PAGE 60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l: seconds 6 7 8 9 10 11 o CM2 Jacobi *Cray Figure 2.6: Timings for a V(1, 1) cycle for small m and n = 1. seconds 512 16384 32768 Number of cells (m) o CM2 Jacobi timings o CM2 RedBlack timings Cray timings 65536 Figure 2.7: Timings for a V(1, 1) for large m and n =1. 50
PAGE 61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i; I I This increase in time corresponds only to the new calculations. This can be observed in Figure 2.6 for m between 26 and 29 specifically, where we notice that the increase in tjme when we double n is linearly proportional to log2 m, while after m = 29 there is a bigger increase in the slope of the graph every time we double m because the processors are saturated. For our example, when m reaches the value of210, we have begun to saturate the CM2. Here the number of grid points (2m+l), is 2049. When these timings were measured, we used one sequencer, which consists of 212 processors with 512 floating point units (FPU). Notice that 2048 is four times the number of FPU's, where four is the vector length at each FPU. This explains the beginning of increase for the slopes of the piecewise linear graph in Figure 2.6, since we will be reusing these FPU's. Also, when we vary m between 210 and 211, we are changing the number of grid points by 2048, which causes an even bigger increase on the slopes of Figure 2.6. In fact, the timings will double (disregarding the overhead) every time we further increase the number of cells (2m+ 1) by multiples of 2048 (the number of FPU x 4). If we look at the timings and dimensions for larger m, the parallel code starts increasing its time proportional to the increase in m, similar to the way a sequential code does. Consider the points m = 215(32768) and m = 216(65538) in Figure 2.7. This is explained by the fact that, for this range of m, every time we double m we will be reusing FPU's. Figure 2.8 shows the variation of cost when n changes. The multigrid method in this chapter was applied to the spatial variable m, so none of the increase in cost is due to the multigrid scheme used. The increase of cost here is due to the use of the CM2 intrinsic functions sum and spread in the angular variable direction. This happens, for example, in step 2 of the parallel relaxation. Of course, when we 51
PAGE 62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. seconds 2 8 16 32 Number of angles ( n ) o CM Jacobi timings o CM RedBlack timings Cray timings 64 Figure 2.8: Timings for a V(1, 1) cycle varying (n) and m = 512. 52
PAGE 63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I bcrease n to the point of reusing the FPU's, the time cost will increase similarly to the way it did when m was increased. Figure 2. 7 also shows the additional time spent if we use a redblack Gauss Seidel relaxation instead of a Jacobi. The timings for redblack GaussSeidel (V(1,1)) and Jacobi (V(1,1) and V(2,2)) twocell relaxation are compared to the Cray timings on Table 2.9. The Cray timings were measured for a V(1,1) cycle. Notice that since the code on the Cray is sequential, it is irrelevant for timing purposes whether we use the Jacobi or redblack GaussSeidel relaxation since both relaxations will take approximately the same time, W" sequential steps. In the CM2 for the same number of Vcycles the Jacobi rela..'Cation in a V(1, 1) cycle is much faster than the redblack GaussSeidel relaxation in a V{1, 1) cycle. but its convergence factor is, in general, worse than the redblack V(1, 1) convergence rate (Table 2.6). If we consider Jacobi relaxation for the V(2, 2) cycle, the timings will be slightly better than the redblack GaussSeidel V(1, 1) cycle and the convergence factors in this case will be comparable. We show here that if the additional time used in doing a Vcycle with redblack GaussSeidel relaxation was used to perform more of the Jacobi Vcycles, the convergence factor (both V(1, 1) and V(2, 2)) would be in general better. For a fixed number of Vcycles, Table 2.11 compares Jacobi V(1, 1) and V{2, 2) with redblack GaussSeidel V{1, 1). In this table we compare the convergence factors for each relaxation using the time for a redblack V(1, 1) as the standard unit. For example, ifredblack V(1, 1) took twice as long as Jacobi V(1, 1), then two Jacobi V{1, 1) cycles could be performed in the same time as one redblack V(1, 1). Using the time for redblack V{l, 1) as the standard unit oftime, the proper convergence 53
PAGE 64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I, Table 2.9. Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel (V(1,1)) on the CM2 and Cray YMP. S4 case, uth = 0.1. m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) Gray V(1, 1) 512 1.19 1.78 1.69 2.9 1024 1.93 3.00 2.82 5.85 2048 2.78 4.54 4.35 11.71 4096 4.60 7.57 7.41 23.43 32768 31.00 53.53 52.69 187.5 65536 63.14 109.54 108.6 376.0 Table 2.10. Timings for Jar.obi (V(1,1) and V{2,2)) and redblack GaussSeidel (V(1,1)), on the CM2 and Cray YMP S4 case, Uth = .1 (nonuniform code). m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) Gray V{1, 1) 512 2.81 3.90 2.80 2.9 1024 4.27 5.65 4.31 5.8 2048 7.20 9.80 9.73 11.7 4096 12.78 17.78 13.10 23.4 32768 106.73 146.24 145.25 187.5 65536 222.47 304.17 302.13 376.0 54
PAGE 65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! Table 2.11. Comparison between Jacobi V(l,l), Jacobi V(2,2), and redblack con vergence factors (Pjll Pj2 and Prb) for various values of Uth (S4 case). !.d. .!u Uth (pjt) tjl Prb (Pj2) t;2 105 2.8 X 1014 3.3 x Io10 4.4 X 1011 104 8.1 X 1010 2.8 X 107 5.9 X 1010 103 8.6 x 1oa 7.2 X 105 4.5 X 105 102 1.0 X 103 4.3 X 103 1.7 X 103 101 1.8 X 103 8.0 X 103 2.0 x 103 1. 2.7 X 104 1.1 X 103 7.1 X 104 101 2.4 X 107 3.2 X 105 1.0 x los 102 2.0 X 1010 2.5 X 107 1.9 X 107 55
PAGE 66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I ; ; factor for Jacobi V(1, 1) would be (p;1)2 We define trb = time for 1 V(1, 1) cycle with redblack GaussSeidelrelaxation. t;t = time for 1 V(1, 1) cycle with Jacobi relaxation. t;2 = time for 1 V(2, 2) cycle with Jacobi relaxation. Likewise, we define the convergence factors Prb1 Pit and Pi2 The correct comparisons are: !d Prb VB. (P;t)'il These comparisons are shown in Table 2.11. It is obvious that both Jacobi V(l,l) and V(2,2) are superior to redblack GaussSeidel. If we compare the Jacobi V(l,l) and V(2,2) relaxations we conclude that Jacobi V(1,1) gives, in general, a better convergence factor in this parallel environment for the same CPU time. Table 2.10 shows some timings obtained with the nonuniform implementation of the algorithm. We notice that the nonuniform timings are roughly doubled for all the cases, so the same analysis that was applied to the uniform version could be applied here without changing the conclusions. 2.12 Conclusions. In this chapter we have studied a parallel algorithm for a multigrid scheme for solving the transport equations in slab geometry. This algorithm is suitable for SIMD computers and takes advantage of this kind of architecture during the different stages of the multigrid scheme. It was implemented on the CM2 and was much faster than a sequential version of the same algorithm on the Cray YMP (using only one processor), especially when comparing large grid sizes. For the relaxation step of the multigrid algorithm, we used the ShermanMorrison formula and developed an efficient relaxation with the use of a few intrinsic functions on 56
PAGE 67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' the CM2. The interpolation and restriction operations, for some grid points, were performed without communication. The increase in the CPU time spent in a V cycle, when either one of the grid dimensions was increased, was small before processor saturation. For the spatial dimension, this increase is due to the addition of a new grid level, and for the angular dimension, it is due to the use of the CM2 intrinsic functions sum and spread. Mter saturation of the CM2 is reached, we have shown that the sharper increase in timings are caused by reuse of FPU's. In fact, when the CM2 becomes saturated the parallel timings start doubling when one of the grid dimensions is doubled. This, of course, always happens in the sequential timings for any doubling of m or n. The convergence rates attained per Vcycle were extremely good and matched theoretical results. We implemented three different kinds of relaxation: Jacobi V(1, 1) and V{2, 2) and redblack V(1,1) in two kinds of implementation each, uniform and nonuniform meshes. We have shown that, in this context, the Jacobi V(1,1)cycle was superior to the Jacobi V(2, 2)cycle and the redblack GaussSeidel V(1, 1)cycle. The results and timings from this chapter show the good performance of our parallel algorithms for solving the isotropic form of the transport equation on SIMD architectures. 57
PAGE 68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3 PARALLEL MULTIGRlD FOR ANISOTROPIC SCATTERING 3.1 Introduction. In this chapter we develop a parallel algorithm for the anisotropic transport equation and discuss its implementation on the CM200. Timings and convergence rates are shown and compared to a sequential implementation on the Cray YMP. As shown in Chapter 1, the transport equation for a semiinfinite slab of width 2 can be written as (3.1.1) For the anisotropic scattering case, the scattering term S in (3.1.1) can be written as N1 S(z,p.) = u. j dp.'"E.(z,p.'p.)t/J(z,p.') = L (21 + 1)uul>l(z)pi(JL), 1=0 (3.1.2) where N1 t/J(z, p.) = L (21 + l)rp,{z )PI(JL), (3.1.3) 1=0 and (3.1.4)
PAGE 69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; I for I = 0, 1, ... N 1 are the moments of the N degree approximation of the flux, P.l,P.21 ... 1p.N are the quadrature points and w1,w2, ... ,wN are the quadrature weights. Writing t/;(z) = ( t/J(z, P.l), .,P(z, p.2), ... t/J(z, P.N ))T to represent the flux and 1!_(z) = (r/>o(:z:), rp1(z), ... rf>Nl(z))T to represent the Legendre moments at the quadrature points, equations (3.1.3) and (3.1.4) can be written as rf>(z) = Tt/J(z), and The scattering term (3.1.2) can be written as = T1 DT'!!!_(z), where T and T1 are N x N matrices with elements: and and D is an N X N diagonal matrix with diagonal elements Di,i = O'i1 The requirement that (3.1.1) holds at each quadrature point yields a coupled system of ODE's: Mt/J'(z) + Ut'!/!_(z) = T1 DT'!!!._(z) + Q, (3.1.5) where M =diag( p.j, ) and D = diag( O'j1 ). This is the flux representation of the SN equations. If the righthand side of (3.1.5) is known, it becomes an 59
PAGE 70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. uncoupled system of ODE's This leads to the iteration (3.1.6) which is called a shifted source iteration. The variable a is a shift parameter. The system of ODE's will be stable for Ut. Suppose Ut = uo u 1 ... UN1 0 a J;f +uN1 and Ut 1. If a is chosen to be 2 a shifted source iteration will attenuate all the errors in the upper half of the flux moments (p!:!.. through p N _1). In other words, :r it will attenuate the errors that vary rapidly in angle, but poorly attenuate those that vary slowly in angle. Consider the error given by = '!//(:c)'!f!.k1 (z), which moments will be given by = = (e0(:c),e1 (z), ... ,eN1 (z))T. Each one of these error moment will be decreased approximately by :; ( i = 0, ... N 1), which shows that the first error moment (eo( z)) will not be decreased and that the error moments E!!..(z) and EN1(z) (corresponding to U!:f. and UNd will be 2 2 minimized. This suggests that multigrid in angle be used together with a shifted source iteration at each grid level as Morel and Manteuffel developed in (14]. In the angular multigrid algorithm, the source iteration is performed at each level with the appropriate shift applied to the crosssectional moments for that angular level. We will refer to the different levels of the angular multigrid by using k. The number of angles (N) is halved (Figure 3.1) when going from a fine to a coarse grid, but the angles on the coarse grid are not a subset of the angles on the fine grid. The angles at grid level k are the Nk Gauss quadrature points at that level (JLt, .. JL'N,. ). When coarsening the angular grid, we disregard the second half of the error moments (e!:!. (:c), ... EN1(z)) at the fine level since those error moments 2 were already attenuated by the shifted source iteration at that level. Applying the appropriate shift at each angular level results in an isotropic scattering at the coarsest angular grid (k = log2 N), which can be solved using the CM2 parallel 60
PAGE 71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I l n ...J L 1 .L 1.l 1I I I I I I I I I rrrI I I I I I I I I I I I 2 1....! !... _I.!. _I_ .!. I_ I I I I I I I I I I I I I I I 1 I I I I I I I 1 2 3 2m+2 cell i cell i+l 11 r 1 r 1; 1I I I I I I I I I I I I I I 2 I_I!.._ _I_ .!_ _I_ _!_ I_ I I I I I I I I I I I I I I I 1 I I 1 2 3 2m+2 Figure 3.1: Comp\ltational grids for the first and second level of angular multigrid. 61
PAGE 72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. multigrid solver developed in Chapter 1. The parallel isotropic algorithm uses a multigrid scheme in the spatial domain. In order to solve (3.1.6) at level k, we use the same spatial discretization used for the isotropic model in Chapter 2: the MLD scheme [13]. When this dis cretization is used we obtain = + where H1e and S1e both contain the shift 0:1e at angular level k. Matrix is described in Section 3.4 and (3.1.7) The transport sweep has been used before to perform the shifted source iteration [9]. Letting m be the number of cells (2m+ 2 spatial points), the transport sweep is a sequential algorithm that takes 2m steps, corresponding to solving a block lower bidiagonal system by block forward substitution and a block upper bidiagonal system by block backward substitution. To make the source iteration economical for the CM200, we perform it at each angular level by using cyclic reduction, which takes log2 m parallel steps. Throughout the angular multigrid cycle, multiplications by T and r1 which represent Legendre transforms, will be performed many times. Our storage design and algorithm for these multiplications allow Legendre transforms to be performed in O(N) parallel steps. An outline of the remainder of this chapter is as follows. In Section 3.2 we describe the angular multigrid algorithm in the flux notation. In Section 3.3 we describe the spatial discretization used (matrix H). In Section 3.4 we show how to perform the shifted source iteration as a cyclic reduction for the MLD discretization. In Section 3.5 we develop the parallel angular multigrid algorithm. In Section 3.6 62
PAGE 73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I we develop the parallel cyclic reduction algorithm. In Section 3. 7 we analyze the stability of our cyclic reduction. In Section 3.8 we report on convergence rates and timings tests. 3.2 Angular Multigrid. 3.2.1 General description. The angular multigrid method was de signed to accelerate problems with highly forwardpeaked scattering (14], where the probability that a neutron is scattered through a small angle is higher than for a large angle. Before we describe the implementation details of this algorithm, we give a general description of both cases treated in the section: The V(1, 0) and the V(1, 1) angular multigrid cycles. Consider that the discretization of (3.1.6) can be written as the matrix equation L'!!!_ = !l: A standard iteration scheme for solving this equation is to use the splitting L = HS, where S is the scattering matrix defined in (3.1. 7), and rewrite the equation as This iterative process is called source iteration. The angular multigrid method proceeds as follows Set k = 1 and Nk = N for the first angular grid level. 1. Perform a shifted source iteration on the fine grid: Compute the residual 'fie = Sk('!/!_1+1 '!!!_1). Calculate the moments for this residual: (1!,1+1 1!,1)k = Tlc'f.le Disregard the high moments: (1!,1+1 1!.\+1 = [I, O](p_1+1 P.\ 63
PAGE 74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 2. Set N1c+t = and k = k + 1. Perform a source iteration on the coarse grid with equal to T1(1!_'+1 1!_1): Repeat this coarsening until N1c is equal to 2. 3. On the coarsest grid (N1c = 2), this problem will be reduced to the isotropic model. Solve it exactly using the spatial multigrid algorithm. 4. Add the corrections from the coarse grid to the flux solution on the fine grid: In the case of a V(l, 1) cycle, perform the source iteration using the newest calculated solution for level k: = + 9.Jc In the case of a V(l, 0), do not perform a source iteration. Set N1c1 = and k = k1. Repeat this step until k = 1 and N1c = N. In the next two subsections we detail our V(l, 0) and V(l, 1) implemen tations. The V(l, 0)cycle, in general, does not need a special description, but in order to develop an algorithm that takes advantage of the greater simplicity of the V(l, 0)cycle over the V(l, I)cycle and to avoid redundant calculations, we write the algorithms in a slightly different way for the interpolation steps (Steps 57 in the next section). 3.2.2 The V(l, 0) cycle. Consider writing the transport equation in its discretized form using MLD for the spatial discretization, and let N be the number of angles and m the number of spatial cells: (3.2.1) 64
PAGE 75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, A submatrix of the shifted matrix His shown in (3.4.1). The matrix S = T1(Do:I)T. Step 0. Initialization. Set k = 1. Step 1. Shifted source iteration in parallel. In the source iteration, the righthand side variables are assumed to be known and the lefthand side variables are calculated using the following expression: H .J,l+lfl :!....Jc, (3.2.2) where {3.2.3) At level k, the number of angles is half the number in the previous grid. This indicates that the Gauss quadrature points are different at each angular grid level and, consequently, the matrices and M1e will depend on the grid level. The matrix H1e represents the MLD spatial discretization for the shifted source iteration (Section 3.4). In the spatial discretization, the variable EJe = Ut O:Je multiplies the identity matrix. This step is performed using cyclic reduction, which, for given boundary conditions, will calculate the fluxes for all spatial and angular points in 2log2 m steps. The cyclic reduction algorithm for the matrix equation (3.2.2) is described in Section 3.4. The initial guess for the flux on the righthand side of the matrix equation is = 0. This source iteration can be repeated with the new estimated flux ). Set '!to = o Step 2. Calculating the residual. The residual is given by (3.2.4) 65
PAGE 76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, I, Multiplication by matrix S1c is applied in conjunction with the restriction operation as described in the next step. Step 3. Restriction of the residual. The external source term on the next grid level (qlc+1 ) is equal to the restricted residual. The restriction operator is given by N iJ = (3.2.5) Putting (3.2.4) and (3.2.5) together, we have (3.2.6) To compute this efficiently, we proceed as follows e Calculate A'!/!_: Multiply A'!/!_ by This represents a change from flux representation to moment representation. The matrix T1c is given by T1c = 66 (3.2.7) NxN
PAGE 77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; Multiplications by [Jie+1, 0] and by (D1e are interchangeable. We multiply by [I1e+1, 0] first, thus disregarding the second half of the moments (higher moments) and considering only the first half of the moments (lower moments) on the coarser grid (k + 1). This can be written as [ I 0 ] ( }+1 c/11 ). !:!.xN :t:.Je :t:.Je 2 At angular level k we multiply by the scattering matrix truncated to rows and columns. Notice that is a truncation of the scattering matrix used with the PN1 expansion on the fine grid (k) and is given by uo u l;f +uNi and a1e is the shift on the fine grid ( k) and is equal to 2 Finally, multiply by Tle;1 (Nic+l = is calculated using the Gauss quadrature points. Recall that the N1e+1 Gauss quadrature points are not a subset of the Nk points. The matrix at level (k + 1) is given by Tk.;l = 67
PAGE 78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Po(Jlt) Po(JL2) Po(JLa) 3pt(Jlt) 3pt(Jl2) 3pt(Jla) 5p2(Jld 5p2(Jl2) 5p2(Jla) (2N3)PN2(Jlt) (2N3)PN2(Jl2) (2N3)PN2(Jla) (2Nl)PNt(Jlt) (2Nl)PNl(Jl2) (2Nl)PNt(Jla) (3.2.8) Using .[,.+l as the new source term, repeat Steps 1 through 3 until we have a P1 expansion on the righthandside (when = 2). Step 4. Calculation of the error at the coarsest level. With the appropriate shift (a: = u1) at this level (k = log2 n), we have an isotropic transport equation: where and 68 NxN
PAGE 79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. At this level, instead of performing the source iteration through cyclic re duction as was done in finer angular grid levels, we use the parallel multigrid in space algorithm for solving the isotropic equations. For N = 2, multigrid in space solves this system of equations exactly in one Vcycle. Step 5. Interpolation of the correction. Starting with k = log2 n, transform each correction from the flux domain to the moment domain: u [ l (3.2.9) The elongation in the CM codes is not performed, since the entries (or the processors) where .. is not calculated were previously initialized to zero. Set k = k1 and calculate a new fur. Repeat this process until k = 2, summing the moments for levels l = 2, log2 n as (3.2.10) Step 6. Addition of the corrections. At the finest angular grid, the summation of the errors at all angular levels is transformed from moment to flux representation and this fluxcorrection is then added to the latest flux solution at this level: (3.2.11) Recall that, at this level, T1 will be an N x N matrix generated by using the Gauss quadrature points for all n scattering angles. 3.2.3 The V(1, 1) cycle. For the V(1, 1) cycle, we modify the inter polation steps (Steps 5 through 6). The general interpolation operation at 69
PAGE 80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: level k can be described as (3.2.12) Mter we calculate the we use it to evaluate the righthand side of a new source iteration using the following expression: (3.2.13) where !he is the restricted residual used at level k when going down the cycle. Set k = k 1 and repeat this process until k = 1. 3.3 Discretization. The MLD equations using variables at the edges were derived before in (2.8.52.8.8). Par the anisotropic model, we use the MLD discretization in the form of equations (2.8.52.8.8), but with the scattering term as 1 N1 S.(:c, JL) = u. /_1 dp.'E.( :c, p.' . p.)t/1( :c, p.') = (21 + 1 )u1( :c )PI(JL), (3.3.1) Using this for the scattering term and multiplying (2.8.52.8.8) by Uth we have N1 JL;(t/Jt,,; + 1/Jt,.,;)2p.;.,Pt_1,r,j + O'tht/Jtl,j = L (21 + 1)ul(:c)pi(JL;), (3.3.2) 1=0 N1 JL;(t/lt,.,;1/Jtl,;) + Utht/Jt,.,; = L (21 + 1)ul(:c)pi(JL;), (3.3.3) 1=0 i = 1, ... m, j = 1, ... n and the negative angle equations are N1 JL;( 1/Ji;,,; + 1/Ji;,.,;) 2p.;t/JH.1,1,j + Utht/JZ,.,; = L (21 + 1 )u1( :c )PI(JLi ), (3.3.4) 1=0 N1 JL;(t/li;1,;1/Ji;,.,;) + Utht/Ji;1,; = L (21 + 1)ul(:c)pi(JL;), (3.3.5) 1=0 70
PAGE 81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i i i = 1, ... m, j = 1, ... n. Using the vector notation such that '!t+() is annvector ('!t+() = ( ... ,.,pi;,())T), we can rewrite these equations in form as (3.3.6) {3.3.7) i = 1, ... m and the negative angle equations are (3.3.8) (3.3.9) i = 1, ... m where the matrices s+ and srepresent the upper and lower part of matrix S = T1 DT, respectively, and n:"tatrix Mi = diag( *!", ). If we use shifted source iteration to solve these matrix equations, the righthand sides are assumed known, the matrix S = T1(Dal)T and the Ut's are replaced by f = Ut a in the above equations. 3.4 Iteration in Parallel. The source iteration is performed using cyclic reduction [15]. Using equa tions (3.3.63.3.9), the source iteration for the entire domain can be written as H'!/!_l+1 = l_1 where the matrices Hand the vector l_ already contain the shifts (a). The matrix H is 4 x m x n. A sub matrix of H, corresponding to 2 cells of a spatial domain with m cells (an Bn x Bn matrix), is 71
PAGE 82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I 'i El+Mi 0 Mi 0 0 0 0 0 0 El+Mi 0 Mi 0 0 0 0 Mi 0 El+Mi 0 2Mi 0 0 0 0 Mi 0 El+Mi 0 0 0 0 0 0 0 0 El + Mi+l 0 Mi+l 0 0 0 0 2Mi+l 0 El + Mi+l 0 Mi+1 0 0 0 0 Mi+1 0 El + Mi+1 0 0 0 0 0 0 Mi+l 0 d + Mi+1 Snxsn (3.4.1) the variables for this matrix are = 1 1 P.t+l,l' t++l,r)T' (3.4.2) The righthand side for this twocell system can be represented as (3.4.3) Notice that this relaxation can be performed separately for the negative and positive variables, since the above system of equations can be decoupled into two systems, one for the positive and another for the negative variables. Positive variable cyclic reduction Writing the above system for the positive variables over the entire domain generates a block lower bidiagonal (BLB) system with block matrices of size 2n x 2n. The associated matrix is 72
PAGE 83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dt 0 0 0 0 0 0 0 F2 D2 0 0 0 0 0 0 0 Fa Da 0 0 0 0 0 0 0 0 0 0 0 (3.4.4) 0 0 0 Fi Di 0 0 0 0 0 0 0 .Fi+l Di+t 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fm Dm 2mnx2mn The variables for this matrix are t/J ( .,p+ .,p+ .. .,p+ .,p+ .. .,p+ )T 1 '2:..2' '..:.....m (3.4.5) where .,P"! = ( .,P"!1 .,P"! ). The righthand side for this smaller system is _, ....:....t, :.t,,. f (J+ j+ ... j+ j+ ... j+ )T 1 '2' '=' '=1+1' 'm (3.4.6) and [ 0 2Mi l F.. 0 0 {3.4.7) Notice that index i in this notation represents the variables evaluated for cell i: ( .,P"!1 .,P"! ) If we perform the first step of cyclic reduction (eliminating every other :.....s, ...:..t,r 73
PAGE 84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. row, starting with the first row or cell), we obtain the following system D2 0 0 0 0 0 0 0 Ff D4 0 0 0 0 0 0 0 FJ Da 0 0 0 0 0 0 0 0 0 0 0 {3.4.8) 0 0 0 Fl Di 0 0 0 0 0 0 0 Fl+2 Di+2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dm mnxmn where F? [: 4M
PAGE 85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l: I I. I l where 5 = 2(kl) and {3.4.10) Coarsening of the cells in cyclic reduction for the positive variables is illustrated for m = 8 in Figure 3.2. If we keep coarsening the grids in this fashion, we end up with one cell with a matrix equation given by {3.4.11) where k = log2 m. The nvector fk is obtained through the same recursive elimina.:...m tion process used on the lefthand side of the matrix equations, where we eliminate every other row starting with the first row. After solving this equation for 1/J the ..:....m variables that were eliminated previously can be calculated. For example, at level k the variables t/J. can be calculated by the following expressions: '1 {3.4.12) for i = m 5, m 35, ... 35, and {3.4.13) where {3.4.14) We can describe the above algorithm in matrix notation. For example, to 75
PAGE 86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i' tPl tP4 1/Js t/Ja 1/Jg tP12 tP13 tP16 tP17 k=l tP2 tP3 1/Ja tP7 tPto t/Jn tP14 tP15 tPta k=2 tP4 t/Js t/Ja 1/Jg tP12 tP13 tPlB tP17 1/Ja 1/Jg tP16 tP17 k = log2 m tP16 tP17 Figure 3.2: Coarsening of cells during cyclic reduction for the positive variables. 76
PAGE 87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, go from (3.4.4) to (3.4.8) we multiply the matrix at levcl1 by the matrix F2D'11 I 0 0 0 0 0 0 0 0 0 0 F4D31 I 0 0 0 0 0 0 0 0 0 0 FaD 51 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fm2D;;,I_a I 0 0 0 0 0 0 0 0 0 0 FmD;;,I_I I This matrix, called the restriction matrix, has half the number of rows as the original since it will eliminate every other cell in the original grid. The restriction matrix to go from level k 1 to level k is p,lc1 n1 26 6 I 0 0 0 0 0 0 0 0 0 0 plc1 n1 46 36 I 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 66 66 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 m26 m36 I 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 m m6 I (3.4.15) where 6 is the stride at level k1 (6 = 2(/c2)). To obtain l, the restriction operator is applied repeatedly to the righthand side of the original matrix equation until level k is reached. At k = log2 m, we solve the final matrix equation and start interpolating back to finer levels. The 77
PAGE 88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interpolated variables at level k will be calculated by the following matrix equation: k 0 0 0 0 0 0 0 0 k+l 0 I 0 0 0 0 0 0 !!!.to 0 nt F.k 0 0 0 0 0 0 !!!.to 36 36 0 0 0 0 0 0 1/JT = t/JT '1 0 0 0 0 0 0 '1 0 0 0 0 0 I 0 0 p_t+O 0 0 0 0 0 D;;,_l__0Fmo 0 0 ,p+ .,p+ 'ffl 0 0 0 0 0 0 0 I m n1 0 0 0 0 0 0 0 k 0 a 0 0 0 0 0 0 0 0 0 0 n1 36 0 0 0 0 0 Lto 0 0 0 0 0 0 0 + t!" {3.4.16) 0 0 0 0 0 0 0 0 =I 0 0 0 0 0 0 0 0 tt+O 0 0 0 0 0 0 n1 0 m6 j+ 0 0 0 0 0 0 0 0 m Negative variable cyclic reduction. The same process can be applied to the negative variable relaxation matrix. in which case we have a block upper bidiagonal (BUB) system with block matrices of size 2n X 2n: 78 I
PAGE 89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I D1 F1 0 0 0 0 0 0 0 D2 F2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Di Pi 0 0 0 (3.4.17) 0 0 0 0 Di+1 Ei+1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dm1 Fm1 0 0 0 0 0 0 0 Dm 2mnx2mn The variables for this matrix are p_ = ... .. ,p;,JT, (3.4.18) where .,P: = ( .,P:1 .,P: ). The right hand side for this system is _, ....:....c,r (3.4.19) where j: = (!:,, r ), _, .=......c, .=......c,r and (3.4.20) Again the index i in this notation represents the variables of cell i : ( t/J :1 t/J: ) If we ....:.a, =a,r perform the first step of the cyclic reduction ( eliminating every other row, starting with the last row or cell), we obtain the following system: 79
PAGE 90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l l I;
PAGE 91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where 5 = 2(1cl) and [ 2 20 M2 O ] 22<1) },{i Mit1 Mi 0+(2!1)_!1 0 Ait1 Ait2 Ai+(2 (1o1) 1 ) {3.4.23) Coarsening of the cells in the cyclic reduction is illustrated for the negative variables, starting with m = 8 in Figure 3.3. If we keep coarsening the grids in this fashion, we end up with one cell with a matrix equation given by (3.4.24) where k = log2m. Mter solving this equation for vector '!!!._1 the variables that were eliminated previously can be calculated. For example, at level k, the variables 1/J. '' can be calculated by the following matrix expressions: for i = 1 + 5, 1 + 25, ... m25 + 1, and where Mil A; dM; A; (3.4.25) (3.4.26) (3.4.27) We can describe the above algorithm in matrix notation. For example, to go from (3.4.17) to (3.4.21 ), we multiply the matrix at level 1 by the matrix I FtD21 0 0 0 0 0 0 0 0 0 I F3D41 0 0 0 0 0 0 0 0 0 I FsD61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I Fm36n;.:_2 0 0 0 0 0 0 0 0 0 I 81
PAGE 92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tPl tP4 t/Js t/Ja tP9 tP12 tP13 tP16 'I/Jt7 k=l tP2 t/Ja t/Ja 'I/J7 '1/Jto '1/Jn 'I/Jt4 'I/Jt5 '1/Jts k=2 tP2 '1/Ja t/Ja tP7 tPlO '1/Jn 'I/Jt4 '1/Jts tP2 'I/J3 t/Jto '1/Ju k = log2 m tP2 'I/J3 Figure 3.3: Coarsening of cells during cyclic reduction for the negative variables. 82
PAGE 93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This matrix, called the restriction matrix, has half of the number of rows as the the original matrix since it will eliminate every other cell in the original grid. The restriction matrix to go from level k 1 to level k is I Fk1 n1 1 1+6 0 0 0 0 0 0 0 0 0 I pk1 n1 1+26 1+36 0 0 0 0 0 0 0 0 0 I pk1 n1 1+46 1+66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I pk1 n1 m36 m26 0 0 0 0 0 0 0 0 I pk1 n1 m6 m {3.4.28) where 5 is the stride at step k 1 (5 = 2(k2)). To obtain t we just have to keep applying the restriction operator to the right hand side of the original matrix equation until level k is reached. At k = log2 m, we solve the final matrix equation and start interpolating back to finer levels. The interpolated variables at level k will be calculated by the matrix equation k I 0 0 0 0 0 0 0 .,p+ ,p+ 1 0 0 n1 pk 0 0 0 0 0 1 tt+6 1+6 1+6 Y!.tt6 0 0 I 0 0 0 0 0 0 0 0 0 0 0 t/JT = ,p! '1 0 0 0 I 0 0 0 1 Y!.t+6 0 0 0 0 0 0 D;1Fr 0 Y!.tt6 0 0 0 0 0 0 I 0 Y!.!6+1 0 0 0 0 0 0 0 0 83 kt1
PAGE 94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. 0 0 0 0 0 0 0 0 1c 0 n1 1+0 0 0 0 0 0 0 j+ 1 0 0 0 0 0 0 0 0 Li+6 0 0 0 0 n1 1+26 0 0 0 + 0 0 0 0 0 0 0 !"! =& 0 0 0 0 0 0 0 0 L++6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I::n6+1 0 0 0 0 0 0 0 n1 mc5+1 where r = m 25 + 1. 3.5 Angular Multigrid on the CM200. 3.5.1 Storage. We assign one processor to each point (i.e., each ( zi, fLi) pair). At different angular levels we have different Gauss quadrature points. Consider a set of processors that contains the same spatial grid point and different angular grid points. Each one of these processors will contain one angle from each angular level. For example, processor P1 will store fL1 for level 1 and all the variables calculated at this point. It will also contain J.L1 for level 2. Figure 3.4 illustrates the distribution of data for N = 8 (n = 4) in the CM200. In our implementation we will always have two variables associated with the same quantity (one for the posi tive angles, another for the negative angles), for example, 1/J+(zi,J.Li) and 1/1(zi,J.Li), q+(zi,fLi) and q(zi,fLi), and so on. As in Chapter 2, we use some duplication of data to avoid communication. For example, variables Wj (a onedimensional array) are stored as a twodimensional array ( Wi,j) The diagonal matrices Mi are stored as full matrices, where each column is a replication of the diagonal. At each processor we store the Legendre polynomials of all degrees evaluated for the angle associated 84
PAGE 95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I with that processor. Figure 3.5 illustrates this for the case where the starting number of scattering angles is eight. 3.5.2 Algorithm. Now we describe the steps to perform angular multigrid on the CM2. 1. Initialization. This is performed completely with inplace calculations. 2. Calculation of the righthand side for the shifted source iteration. The righthand side variables for the shifted source iteration step are as sumed known. They are calculated through the following expression: t_' = S'!/l + 9_, where S = T1 DT (the matrix D contains the shift a). The initial guess for the flux on the right hand side of the matrix equation may be the zero vector (tJl = 0) or may be a nonzero vector. For a nonzero vector we notice that multiplication of the '!1!_1 vector by the matrix S can be performed in three stages: Multiplication by matrix T. Multiplication by the diagonal matrix D. Multiplication by matrix T1 The above three steps will be performed at different points of the algorithm and will be described next. After multiplication by the matrix S is performed, we have a resultant vector in which entries are stored conformable to the entries of vector IJ. Consequently, the addition of this vector to 9. takes only one parallel step. De tails about the implementation of the parallel cyclic reduction are discussed in the following section. Multiplication of a vector by the N X N matrix Tk: = Tkl!_ 85
PAGE 96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i; I ,, Pa k=l k=2 Figure 3.4. Data structure for a particular spacial grid point and for different angular levels, showing the fluxes and the Legendre moments. 86
PAGE 97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I k=l k=2 P1 (pf) P2(Pf) Pa(pf) p2 Pl Figure 3.5. Data structure for Legendre polynomials for a particular spatial grid point and for two angular levels starting with N =4. 87
PAGE 98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Every row of matrix T given in (3.2.7), will update an entry of the n vector for all the spatial grid points. For example, processor Pi,j representing angle P.i and a spotted point :Z:i will use row j of matrix T to update its entry(zj) of vector In particular, the processors that contain the entries of angle p.1 will use the first row of matrix T for updating the variables at p.1 ( z(p.t)). Looking back at the data structure, we can observe that the multiplications of WjP1 (P.i )Yi will all be performed in place and are done simultaneously for all the points in the spatial and angular domain. The next step is to perform a summation across the angles. This summation is done with the CM200 intrinsic function sum, and is also done for all the spatial points simultaneously. The only sequential part of this matrix multiplication is due to the single instruction mode of the CM200. The different processors will use different polynomials (because they are updated with different rows of matrix T) to update the entries of vector and this has to be done sequentially for this kind of data structure on an SIMD computer. This step will be executed in N different stages for the entire angular and spatial domain. We can express this as do k = 1,N cont ti = WjPict(P.j)Yi forall procsi (j = 1, ... N) Zlc = sum(tj, 2) for all i = 1, ... 2m+ 2 simultaneously. In MIMD computers, e.g. CM5, different processors can perform different kind of instructions simultaneously. Thus an algo rithm like this takes one parallel step on MIMD computers (each processor would be evaluating a different degree Legendre polynomial). Multiplication of a vector by an N X N matrix Tiel: = Tiel'!!.. Multiplication by T1 given in 3.2.8, is performed in a manner similar to 88
PAGE 99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I i that for T. To illustrate, consider updating for the processors that store variables p1 and any of the 2m+ 2 spatial grid points. These updates are done using the first row of matrix T1 Here we notice that the entries of this row are all stored at the processor that contains 1'1 (as is shown in Figure 3.5), but will be multiplied by array elements of'!!.. distributed across different processors throughout the CM200. Furthermore, it is not convenient to take advantage of the CM200 shifts since the shifts for different rows will be different for the same elements of l!: Consider rows 1 and 2 in variable y3 for example, needs a shift of 2 for row 1 and a shift of 1 for row 2. H we create a new vector that stores the partial summations for all spatial and angular points, the multiplication will be completed in N steps, where N 1 is the maximum degree of the polynomials at each processor. The partial summations are calculated as do d= 1,N sp; = sp; + (2d1 )Pd1 (I'; )Yd for procs;(i = 1, ... n.) cont for i = 1, ... 2m+ 1 simultaneously. Multiplication of a vector by theN X N matrix = Dk'!!.. This multiplication is done in place since each processor has the appropriate entry of the matrix and only multiplies its own vector entry: z = rry fori= 1, ... 2m+ 1, j = 1, ... n simultaneously. 3. Calculation of the residual. This step is performed completely in parallel and takes only one step for 89
PAGE 100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I the whole spatial and angular domain. ,j. Restriction of the residual (coarsening). First, we change from the flux to moment domain. This is accomplished through the multiplication by matrix Tk and was previously described for a generic vector. Second, we perform the truncation of moments. This is obtained automatically by just disregarding the appropriate processors that store the high moments at that level. Finally, we transform our moments into fluxes through multiplication by Tk.;1 as was described previously for a generic vector. 5. Calculation of the error on the finest grid. This is performed by using the parallel isotropic scattering solver developed in Chapter 1. 6. Interpolation of the error. Multiplication by matrix T is performed as described previously. The addition of the vectors e2, Ea, .. n are performed with inplace calculations in one parallel step. 7. Correction of the flux at the first angular level. Multiplication by matrix T1 is as previously described. The addition of the resultant vector to the flux at the finest angular level is done inplace. 3.6 Cyclic Reduction on the CM200. 3.6.1 Storage and Initialization. Each column of the matrix in (3.4.1) will multiply data belonging to a unique cell (for each level of the cyclic reduction algorithm). Each Fik (3.4.10) has a single nonzero submatrix, that we call which depends on the grid level k and spatial cell i. For nonuniform grids, the matrices Fik are different for distinct cells. The level index for this variable is stored sequentially for each processor, as is the case for all the variables that contain a level 90
PAGE 101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I II I index. The storage is done this way since the different calculation levels of the cyclic reduction have to be performed sequentially (it takes log2 m steps and each level uses results from the preceding level). Throughout the transference of data between grids (coarsening and interpolation), this storage minimizes communication. In this section we omit the angular index, since all the steps for the parallel cyclic reduction are performed for all angles simultaneously. Most of the steps for the spatial grid points will also be performed in parallel, even in the case where we have variable mesh sizes. For example, all processors will have their own definition of fef and perform their calculation simultaneously. Notice that fef has more product terms on coarser grid levels (larger k). The algorithm for the calculation of f on the CM200 is shown here: do 2 k = 1,ml t= M tl=M t2 = M do 1 ii = 1, 2 *(k1)1 t1 = shift(tl, 2, 1) t2 = shift(t2, 2, 1) t = t1 *2 tft2 1 continue fe(k) = t (2 *(2 *(k1))) 2 continue for all i = 1, ... 2m+ 2, j = 1, ... n simultaneously, ml is the number of levels in the cyclic reduction (ml = log2 m). 91
PAGE 102
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: 3.6.2 Algorithm for Cyclic Reduction on a SIMD Computer. Here we describe the algorithm for the positive variable cyclic reduction. The algorithm for the negative variable cyclic reduction is performed in a similar way. 1. Coarsening. Coarsening or restriction for cyclic reduction at each level k is performed through the multiplication of the matrix in (3.4.9) by the matrix in (3.4.15)Note that the only nonzero elements for the latter matrix belong to the identity and block matrices like Fl" D'it6 Matrix Di1 is calculated analytically in (3.4.14) and can be represented here as (3.6.1) Consequently matrix Fle D'it6 is represented by (3.6.2) Notice that this matrix has only 2 n X n nonzero submatrices: /efd21 and fefd12, which are henceforth denoted as r1 and r2 respectively. These submatrices will multiply q2i and q2i+1 (source terms at the left and right sides of cell i, respectively). The first step of the coarsening algorithm on the CM200 is At cell i, these multiplications are required from its left neighbor (cell i 5). Thus, we need to shift the two arrays ( t1 and t2) from the left by 5: where 5 is the stride at step k1 (5 = 2/cl ). 92
PAGE 103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finally, we can update the right and left side variables for each cell through the use of two masks that differentiate the right and left side of each spatial cell (in the matrix, the left and right sides of a cell correspond to the first n rows and the second n rows of each 2n block row, respectively). maske represents all the left side variables for all the cells. masko represents all the right side variables for all the cells. The algorithm to update the two different kinds of variables in the CM200 is forall (i = 2: 2m+ 2: 2} maske(i, :) = .true. forall (i = 3: 2m+ 2: 2} masko(i, :) = .true. where( maske) q2h = qh + ta endwhere where( masko) q2h = qh endwhere Notice that during the update of the variables for the right side of each cell, there is no communication between the processors. 2. Solving of the block diagonal system at the coarsest level. When we get to the coarsest level, we end up with a block diagonal system that involves only the last cell, as shown in (3.4.11). The inverse of matrix Dm is calculated analytically in (3.4.14} and will be represented in the following algorithm as (3.6.1). To update the variables at this level, we can use maske and masko (defined previously). Since the first n rows of the system '!!!..m = D ;;..1 J! represent the variables at the left side of cell m, we use maske to update the!le variables. Recall that vector fk represents the right hand side of the matrix system at level .:..m k = log2 m. The algorithm for the left side variables update of each cell is 93
PAGE 104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I. I, I I t = eoshift(f, 1, 1) where(maske) t/J = dn I + d12t endwhere The algorithm for the right side variable update of each cell is t = eoshift(f, 1, 1) where( masko) t/J = d2d + d22t. endwhere Notice that the shifts on the coarsest grid are both unitary shifts. 3. Interpolation. Interpolation for level k is accomplished through the application of (3.4.12) and (3.4.13) to the variables of level k + 1, or through the use of the matrix equation shown in (3.4.16). We notice that the block matrices necessary for the interpolation step are Di1 and Di1 Fl. The matrix Di1 is represented here as (3.6.1), thus matrix Di1 Fl is given by n:1 p!c = [ 0 dufef l ' 0 d12fef where fef is the only nonzero n X n block submatrix in matrix Fl. Thus, in the CM200 the interpolation algorithm for the variables at the left side of each cell is to= dnf h = dl2f t2 = dufef 62 = 26 + 1 where( maske) 94
PAGE 105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: t/J = to + eoshift( t11 1, 1) + eoshift(t2, 1, 1) eoshift( t/J, 1, 52) endwhere. For all right side cell variables we have to = d21f tl = d2d t2 = where( masko) t/J = eoshift(to, 1, 1) + t1 + ta eoshift(t/l, 1, 52) endwhere 3. 7 Stability of the Cyclic Reduction. The transport sweep is performed by cyclic reduction, as was shown in Section 3.4. In proceeding from level k to level k + 1, subvectors of the righthand side vector are multiplied by F1 k_i Di1 To prove stability, it is necessary to prove that the elements of matrices F1 k at all levels k are bounded. where and For positive variables, we have shown that [d.Mi. Mi. =.Mi.] dM; A; We want to prove that IIF1kll IIFlll fork= 1, ,log2 m. 95 (3.7.3) (3.7.4) (3.7.5)
PAGE 106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEMMA 3. 7.1 Let Mi and be diagonal matrices, where is given by {3. 7.4). Then PROOF. From the definition of matrix norm we have II THEOREM 3. 7.2 (Stability of the positive variables cyclic reduction}. Let F;k and Fl be given by (3.4.10} and (3.4.7], respectively. Then l!Fl"rl IIFlll fork= 1, ... ,log2 m. PROOF. From the definition of Fl", we can see that its norm is equal to its nonzero n X n block submatrix norm: IIFkll = ll22(1o1)M M2 A 1M2 A 1 M2 A 1 II i i i1 L.l.i1 i2L.l.i2... i(21Jo1)_1)L.l.i(21"1)_1) = II 2(1o1)12M M2 A1M2 A 1 M2 A 1 II 2 i i1 L.l.i1 i2L.l.i2... i(21Jo1)_1)L.l.i(21"l)_l) We thus have IIFl"ll 22"'t2IIM;IIn)2"'t = 2IIMill = IIFlll THEOREM 3. 7.3 (Stability of the negative variables cyclic reduction}. Let F;k and Fl be given by (3.4.23} and (3.4.20}, respectively. Then IIF;kll JIFl II for k = 1, ... ,log2 m. 96
PAGE 107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I PROOF. From the definition of Fl, we can see that its norm is equal to its nonzero n x n block submatriz norm: We thus have II.Fikll $ $ = 2IIMill = IIFlll 3.8 Experimental Results. Ill In this section we first compare the convergence rates for the two kinds of angular multigrid cycles used in this chapter: V(l, 0) and V(l, 1). Then we compare the timings between one V(l, 0)cycle and one V(l, I)cycle to show which one gives the best convergence rate considering the difference in time. We also show how these algorithms behave when m is increased and when n is increased. The convergence factors are measured for the forward peaked FokkerPlanck scattering model. Table 3.1 shows the convergence rates for m = 4 and various values of n. The convergence rates for V(l, 0} (p1,o) and V(l, 1) (Pt,t) increase as n increases, but seem to rel\ch au upper bound for n;::: 32 ( Pt,o = .57 and P1,1 = .34). This agrees with the Fourier analysis in [14] which predicts p $ .36 for the V(l, 1} cycle. As we expected, the convergence rates for the V(l, 1) are approximately the square of the convergence rates of V(l, 0) ( i.e. p1,1 ::::: p1,o2 ) for the various values of N. The two 97
PAGE 108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.1. Convergence rates for the V{1,1) and V{1,0) angular multigrid cycles for constant m (m = 4) and various values of N (SN cases). n V(1, 0) V(1, 1) 2 0.32 0.12 4 0.47 0.22 8 0.54 0.29 16 0.56 0.33 32 0.57 0.34 64 0.57 0.34 128 0.57 0.34 98
PAGE 109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.2. Timings in seconds for the V{1, 1) and V{1, 0) angular multigrid cycles for constant m {m = 4) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. n V{1, 0) Gray V(1, 1)Gray V(1,0)CM V(1,1)CM 2 0.023 0.024 0.34 .23 4 0.062 0.065 0.37 .36 8 0.180 0.188 0.56 .62 16 0.665 0.694 0.74 1.01 32 2.872 2.973 1.35 2.02 64 14.780 15.102 2.68 3.90 128 87.920 88.593 5.07 8.73 99
PAGE 110
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; '' Table 3.3. Timings in seconds for the V(1, 1) and V(1, 0) angular multigrid cycles for constant m (m = 64) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. n V(1, 0)Gray V(1, 1)Gray V(1,0)GM2 V(1, 1)GM2 2 0.335 0.363 0.51 .72 4 0.59 0.671 0.69 1.09 8 0.922 1.093 0.94 1.71 16 1.943 2.320 1.61 2.76 32 5.768 6.790 2.83 5.74 64 22.993 26.010 5.17 10.32 128 114.980 125.210 12.77 25.42 100
PAGE 111
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i Table 3.4. Comparison between V(1,1) and V(1,0) anisotropic codes on the CM200 for various values of n ( S N cases, N = 2n) .!ll. n 110 P1 o P1,1 2 0.19 0.12 4 0.46 0.22 8 0.50 0.29 16 0.45 0.33 32 0.43 0.34 64 0.44 0.34 128 0.43 0.34 101
PAGE 112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. seconds 2 8 16 32 64 Number of angles ( n ) oCM200 1'(1, 1) timings *Cray V{1,1) timings 128 Figure 3.6: Timings for a V(1,1) cycle varying (n) and m = 4. 102
PAGE 113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 135 120 105 90 75 60 45 30 15 seconds 2 8 16 32 64 Number of angles ( n ) oCM200 V(l,l} timings *Cray V(l, 1} timings 128 Figure 3.7: Timings for a V(1,1) cycle varying (n) and m = 64. 103
PAGE 114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I I methods would be equivalent if a V(l, 1) cycle took twice as long as a V(1, 0) cycle. Tables 3.2 and 3.3 show the timings on the Cray YMP and CM200 form= 4 and m = 64, respectively. If we consider the parallel timings on the CM200 for V(1, 1) (t11 ) and V(l, 0) {tto), we notice that using the time for V(l,l) as the standard unit we can compare the convergence rates for the two algorithms (Table 3.4). It is obvious from this table that V(1,1) is superior to V(l, 0) on the CM200. The same conclusions are true for our sequential codes on the Cray YMP. This computer has 64 vector registers of length 64 bits. For our timings we used only one of the 8 processors available on the Cray YMP. Graphs 3.6 and 3.7 show how much faster the CM200 V(l,l) codes are than the CrayYMP sequential version. In general, our algorithms behaved as expected. When N increases, we notice that, even though the number of levels for the angular multigrid scheme is log2 n, the timings for the whole algorithm are proportional to N. This is caused mostly by the T1 and T multiplications, which were shown to take N steps in Section 3.5. This means that whenever N is doubled, the timings on the CM200 will at least double. This is exactly what is shown on Tables 3.2 and 3.3. Of course, if we consider the same algorithm sequentially, we should expect the time spent on this algorithm to be at least proportional to N2 This can be observed on the same table for the sequential timings on the Cray YMP. For the same number of angles, as m increases the convergence rates do not change considerably (Table 3.5) and the timings will increase, due mainly to the cyclic reduction step. These increases are proportional to 2log2 m for the V(1, 0) cycle and 4log2 m for V(1, 1) cycle. The same happens at the coarsest level when the parallel multigrid isotropic code is used, as shown earlier in Chapter 2. The behavior when we increase m is shown in Table 3.6. 104
PAGE 115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: I. Table 3.5. Convergence rates for the V(1,0) and V(1,1) angular multigrid cycles for N = 8 and various values of m. m V(1, 0) V(1, 1) 4 0.53 0.22 16 0.53 0.28 32 0.53 0.28 64 0.53 0.28 Table 3.6. Timings in seconds for the V(1, 0) and V(1, 1) angular multigrid cycles for various values of m and N = 8 on the CM200. m V(1, 1) V(1, 0) 4 0.54 .53 8 0.54 .45 16 .63 .56 32 .76 .72 64 1.14 1.46 105
PAGE 116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I CHAPTER4 CONCLUSIONS In this section, we summarize our results for the parallel multilevel methods developed in this thesis. Efficient parallel algoritluns for the neutron transport equations were developed. In Chapter 2 the isotropic case was considered. A multigrid scheme was used in which various types of relaxation were considered. Jacobi V(1, 1), Jacobi V(2, 2), and redblack GaussSeidel V{1,1) pline twocell relaxations were implemented and analyzed. Mter an evaluation of CM2 timings, we concluded that Jacobi V(1, 1) is the best method for this type of architecture. We also showed, through comparisons with a sequential implementation on the Cray YMP, that the parallel algorithms are much faster than the sequential ones and consequently represent an important step towards the development of parallel algorithms for these equations. Some steps of the algoritlun could be made even more efficient with the use of MIMD architectures like the CM5. In Chapter 3 we developed an algorithm for solving the anisotropic transport equation. The sequential angular multigrid algorithm was developed before by Morel and Manteuffel (14] and contains a step that is traditionally sequential. In this thesis this step was parallelized by using cyclic reduction. Depending on the way we write the matrix equations, this process can be stable or not. We wrote these equations in a convenient way, and proved their stability. Cyclic reduction
PAGE 117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is a well known algorithm for solving linear systems of equations and is available in some software libraries. These libraries, however, are written for generic matrix equations and do not exploit the specific matrix structure that we have. This makes our development of parallel cyclic reduction important as an example of optional approaches for parallel computers, where we pay special attention to the structure of our problems. The steps we used to create an efficient parallel cyclic reduction algorithm can also be applied when cyclic reduction appears in other contexts. We improved our sequential versions, with a few compiler directives and small changes, towards more vectorizable codes, but we do not claim to have used a completely vectorized implementation in our measurings. From our experimental results we conclude that the V(l, I)cycle is, in general, preferable to the V(l, G) cycle on the CM200. Notice that both of these algorithms are faster than their sequential versions; our parallel timings behaved like 0( N), while our sequential codes were like O(N2). This behavior was dictated by the transformations from flux domain to moment domain and viceversa, which are accomplished through multiplications by N X N matrices. These matrix multiplications represent the numerical tools used to obtain a Legendre expansion of a function f (multiplication by T) and the evaluation of a function at N nodes, given an N term Legendre expansion (multiplication by T1 ). The sequential steps for these algorithms are known to be O(N2 ) and recently a fast algorithm was developed in [1], which takes O(N log N) sequential steps. However, this algorithm requires evaluation at the Chebyshev points in [1, 1] rather than Gauss quadrature which would imply that the transformation between the moment and flux representation is not exact. Our CM2 algorithm for the Legendre transforms is O(N), and could be 0(1) in a MIMD computer. It represents an improvement that can be used in many applications including algorithms involving quadratures, approximation theory, and the solution 107
PAGE 118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i ofPDE's. In this thesis we dealt with nonabsorption scattering cases(; = 1). Recent work has been done for isotropic scattering with various levels of absorption [11] in slab geometry. Currently we are working on a parallel version for the anisotropic model with absorption. Our final remark is that we should keep in mind that all the work developed here was for the onedimensional transport equations, but the same kind of algorithms can be applied for twodimensional and threedimensional transport equations, and they should maintain their superiority over sequential algorithms. In addition to choosing very accurate methods when using parallel comput ers, we should search for algorithms that take advantage of the kind of architecture we are using. The best algorithms thus will represent a compromise between ac curacy and time efficiency, will be architecture dependent, and will evolve as new architectures are designed. 108
PAGE 119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; Appendix: The inverse of the matrix E in (3.21) is given by (.0.1) where (.0.2) (.0.3) ( .0.4) (.0.5) (.0.6) (.0.7) (.0.8) 109
PAGE 120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (.0.9) (.0.10) [d1d2(c2b1 + a1a2)](b2) + (a22b1 + b2at)(a2) e32 = det2 (.0.11) [dtd2(c2b1 + a1a2)](d!) ea3 = det2 (.0.12) (a22b1 + a1b2)](dt) e34 = det2 (.0.13) (.0.14) (c2a11 + a2ct)( b2) + [d1d2(a22au + b2ct)]( a2) e42 = det2 (.0.15) (c2a11 + a2ct)(d1) e43 = det2 (.0.16) d1d2(aua22 + b2ct)(d!) e44 = det2 (.0.17) det1 = [dtd2(c1b2 + a1a2)][d1d2(aua22 + b1c2)](c1a22 + a1c2)(aub2 + b1a2), (.0.18) and (.0.19) 110
PAGE 121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY [1] Bradley K. Alpert and Vladimir Rokhlin, "A Fast Algorithm for the Evaluation of Legendre Expansions", SIAM Journal of Scientific and Statistical Computing, 12, No.1, pp. 158179, January 1991. [2] Kenneth M. Case and P. F. Zweifel, "Linear Transport Theory", Reading, Mass.: AddisonWesley, 1967. [3] Carlo Cercignani, "The Boltzman Equation and Its Applications", New York, NY: SpringerVerlag, 1988. [4] J. J. Duderstadt and W. R. Martin, "Transport Theory", New York, NY: John Wiley and Sons, 1979. [5] V. Faber and T. A. Manteuffel, "Neutron Transport from the Viewpoint of Lin ear Algebra", Invariant Imbedding and Equations (Nelson, Faber, Manteuffel, Seth, and White, eds.) Lecture Notes in Pure and Applied Mathematics, 115 pp. 3761, MarcelDekker, April, 1989. [6] Gene H. Golub and C. F. Van Loan, "Matrix Computations", Baltimore and London: The Johns Hopkins University Press, 1989. [7] M. S. Lazo and J. E. Morel, "A Linear Discontinuous Galer kin Approximation for the Continuous Slowing Down Operator", Nucl. Sci. Eng., 92, pp. 98, 1986. [8] E. W. Larsen and J. E. Morel, "Asymptotic Solutions of Numerical Transport Problems in Optically Thick Diffusive Regimes II, J. Comp. Phys. 83, p. 212, 1989. [9] E. E. Lewis and W. F. Miller, "Computational Methods of Neutron Transport", New York, NY: John Wiley and Sons, 1984. [10] T. Manteuffel, S. McCormick, J. Morel, S. Oliveira, and G. Yang, "Fast Multi grid Solver for Transport Problems", submitted to Nucl. Sci. Eng. [11] T. Manteuffel, S. McCormick, J. Morel, and G. Yang, "Fast Multigrid Solver for Transport Problems with Absorptions", submitted to Nucl. Sci. Eng.
PAGE 122
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i l I' II [12] T. Manteuffel, S. McCormick, J. Morel, S. Oliveira, and G. Yang, "Parallel Multigrid Methods for Transport Equations", submitted to SIAM Journal of Scientific and Statistical Computing. [13] J. E. Morel and E. W. Larson, "A New Class of SN Spatial Differencing Schemes", Nucl. Sci. Eng., 1988. [14] J. E. Morel and T. A. Manteuffel, "An Angular Multigrid Acceleration Tech nique for the Sn Equations with Highly ForwardedPeaked Scattering", Nucl. Sci. Eng., 107, pp. 33D342, 1991. [15] R. Sweet, "A Generalized CyclicReduction Algorithm", SIAM Journal of Nu mer. Anal., Vol. 11, pp. 506520, 1974. [16] P. Wesseling "An Introduction to Multigrid Methods", Chichester, England: John Wiley and Sons, 1992. [17] G. Milton Wing. "An Introduction to Transport Theory", New York, NY: John Wiley and Sons, 1962. 112
PAGE 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. '' i PARALLEL MULTILEVEL METHODS FOR TRANSPORT EQUATIONS by SUELY OLIVEIRA B.S.C.E, UFPE, 1983 M.S., University of Colorado, 1988 A thesis submitted to the Faculty of the Graduate School of the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Doctor of Philosophy Mathematics 1993
PAGE 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Thls thesis for the Doctor of Philosophy degree by Suely Oliveira has been approved for the Department of Mathematics by .di11t01J1Y!:Ji?lf /A J:iJJ/i2 y Thomas F. Russell Date
PAGE 3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Oliveira, Suely (Ph.D., Mathematics) Parallel Multilevel Methods for Transport Equations Thesis directed by Professor Thomas A. Manteuffel Abstract Parallel multilevel algorithms are developed for solving the transport equations on SIMD computers for onedimensional isotropic and anisotropic scattering. The spatial discretization scheme used for both models is the modified linear discontinuous finite element method, which represents a lumped version of the stan dal'd linear discontinuous scheme. The angular discretization is accomplished by expanding the angular dependence in Legendre polynomials and is known as the SN approximation when the first N Legendre polynomials are used. Parallel algorithms for the isotropic model are developed using a multigrid scheme in space with block Jacobi and redblack twocell jiline relaxation. Both the uniform and nonuniform mesh cases are considered. A1gorithm implementation and performance on the Connection Machine 2 (CM2) are discussed for all cases. Proofs of convergence for both kinds of relaxation within the multigrid scheme are developed through Fourier analysis and shown to be in close agreement with the CM2 results. Assuming the spatial domain is divided into m cells (2m+ 2 points), the timings for the isotropic parallel algorithm are O(log2 m). A parallel algorithm for solving the anisotropic model is developed by using a multigrid in angle scheme that is known to attenuate both rapidly and slowly varying errors in angle. When coarsening in angle, we use an SN approximation for the fine level and an S!:!. for the coarse 1 The angular grid points for each angular 1 multigrid level are the N Gauss quadrature points for that level. Sequential angular multigrid involves a shifted source iteration, using a shifted transport sweep at each
PAGE 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. angular level. The shifted transport sweep requires 2m steps. The parallel algorithm uses cyclic reduction, which entails 2log2 m steps at each angular level. At diverse stages of the algorithm, we need to perform Legendre transforms. An algorithm for these transforms is developed that takes O(N) steps. Applying appropriate shifts to the cross sectional moments at each angular level results in an isotropic scattering equation at the coarsest level. For this problem the CM2 isotropic scattering parallel multigrid solver is used. The parallel anisotropic method was also implemented on the CM200, an upgraded version of the CM2, and the timings and convergence rates for this implementation are presented. The complexity for the anisotropic parallel algorithm is of 0(.1Vlog2 mlog2 N). This abstract accurately represents the content of the candidate's thesis. I recommend its publication. Signe iv
PAGE 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I ACKNOWLEDGMENTS I wish to express my appreciation to the following individuals: my advisor, Professor Tom Manteuffel, for his financial and academic support during the last years of my Ph.D. studies; Professors Roland Sweet and Bill Briggs, who brought me into the program, gave me initial financial support, and provided me with my first opportunity to use parallel computers; and Jennifer Ryan for her financial support of a year and a half and for giving me the opportunity to solve real world problems for US West. I am also appreciative of the support given to me by the remainder of the committee: Professors Steve McCormick, Tom Russell, John Ruge and Gita Alaghband. Part of this work was performed at Los Alamos National Laboratory during my stay there of 10 months spread over the last two years. From there, I would like to thank Dr. Jim Morel for his experience and ideas and Dr. James M. Hyman for his support and friendship, as well as the Center for Nonlinear Studies and the Advanced Computing Laboratory for use of their facilities. Last, but not least, I would like to thank Victor Bandy, my friends in Denver and Los Alamos for their friendship and support, the rest of my family back in Brazil, and especially my father who always considered learning important. v
PAGE 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONTENTS CHAPTER 1 INTRODUCTION 1 1.1 History. ... 1 1.2 Model Problem .. 2 1.3 Notation. .... 7 1.4 Summary of Results. 7 1.5 The CM2 and CM200. 9 2 PARALLEL MULTIGRID FOR THE ISOTROPIC MODEL 11 2.1 Introduction. .................. 11 2.2 The Modified Linear Discontinuous Scheme. 13 2.3 Twocell Inversion. 18 2.4 Multigrid ...... 21 2.5 Storage and Data Structure .. 24 2.6 Relaxation in Parallel. 27 2.7 Multigrid in Parallel. 31 2.8 Discretization in EdgeEdge Notation. 32 2.9 Convergence Analysis. 35 2.10 Numerical Results. 42 2.11 Timings ..... 49 2.12 Conclusions ... 56 
PAGE 7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ; I I I I I I 3 PARALLEL MULTIGRID FOR ANISOTROPIC SCATTERING .. 3.1 Introduction. ... 3.2 Angular Multigrid. 3.2.1 General description. 3.2.2 The V(1, 0) cycle. 3.2.3 The V(1, 1) cycle. 3.3 Discretization. ...... 3.4 Source Iteration in Parallel. . 3.5 Angular Multigrid on the CM200. 3.5.1 Storage ..... 3.5.2 Algorithm. .. 3.6 Cyclic Reduction on the CM200. 3.6.1 Storage and Initialization. 3.6.2 Algorithm for Cyclic Reduction on a SIMD Computer .. 3.7 Stability of the Cyclic Reduction. 3.8 Experimental Results. 4 CONCLUSIONS APPENDIX ... BIBLIOGRAPHY vii 58 58 63 63 64 69 70 71 84 84 85 90 90 92 95 97 106 109 111
PAGE 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FIGURES FIGURE 2.1 Computational grid. . . . 10 2.2 Redblack relaxation for 4 cells. 11 2.3 Fine and coarse grid. . . . 15 2.4 Computational mesh for four cells. 18 2.5 Group of cells for Fourier analysis. 24 2.6 Timings for a V(l, 1) cycle for small m and n = 1. 36 2.7 Timings for a V(1,1) for large m and n =1. . . 36 2.8 Timings for a V(l,1) cycle varying (n) and m = 512. 38 3.1 Computational grids for the first and second level of angular multigrid. 45 3.2 Coarsening of cells during cyclic reduction for the positive variables. 55 3.3 Coarsening of cells during cyclic reduction for the negative variables. 60 3.4 Data structure for a particular spacial grid point and for different angular levels, showing the fluxes and the Legendre moments. . . . 64 3.5 Data structure for Legendre polynomials for a particular spatial grid point and for two angular levels starting with N =4. 65 3.6 Timings for a V(1,1) cycle varying (n) and m = 4. 77 3.7 Timings for a V(1,1) cycle varying (n) and m = 64. 78 ' I
PAGE 9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TABLES TABLE 2.1 Convergence factors from the Fourier analysis(Fa) prediction and results for Jacobi relaxation for various values of Uth (S4 case). . . 43 2.2 Convergence factors from the Fourier anlaysis(Fa) prediction and results for redblack GaussSeidel relaxation for various values of Uth (S4 case). 44 2.3 Worst case convergence factor for multigrid with Jacobi V(1, 1) relaxation, Fourier analysis (Fa), and CM2 results (form= 512). . . . 45 2.4 Worst case convergence factor for multigrid with redblack GaussSeidel relaxation, Fourier analysis, and CM2 results (form= 512).. . . 46 2.5 Convergence factors for nonuniform cells (Jacobi and redblack GaussSeidel relaxation on the CM2), 1 $ h $ 100, and Ut = 1. . . . . 46 2.6 Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel (V(1, 1)), for various values of Uth. S4 case and m = 512. 47 2.7 Convergence rates for Jacobi (V(1, 1) and V(2, 2) cycle) and redblack (V(1,1)) on the CM2. S4 case and Uth = 0.1. . . . . . . . 48 2.8 Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel(V(1, 1)), for various values of Uth. S4 case and m = 512 (nonuniform code). . . . . . . . . . . . . . . 48 2.9 Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel (V(1,1)) on the CM2 and Cray YMP. S4 case, uth = 0.1. . . . . 54 2.10 Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel {V(1,1)), on the CM2 and Cray YMP s4 case, Uth = .1 (nonuniform code). . . . . . . . . . . . . . . . . . . 54 2.11 Comparison between Jacobi V{1,1), Jacobi V(2,2), and redblack convergence factors (P;l, Pi2 and Prb) for various values of Uth (S4 case). 55 3.1 Convergence rates for the V(1,1) and V(1,0) angular multigrid cycles for constant m (m = 4) and various values of N (SN cases). . . . 98 3.2 Timings in seconds for the V(1, 1) and V(1, 0) angular multigrid cycles for constant m (m = 4) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. . . . . . . . . . . . 99
PAGE 10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I l I I I 3.3 Timings in seconds for the V(l, 1) and V(l, 0) angular multigrid cycles for constant m (m = 64) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP.. . . . . . . . . . . 100 3.4 Comparison between V(1,1) and V(1,0) anisotropic codes on the CM200 for various values of n (SN cases, N = 2n). . . . . . . . 101 3.5 Convergence rates for the V(1,0) and V(1,1) angular multigrid cycles for N = 8 and various values of m. . . . . . . . . . . 105 3.6 Timings in seconds for the V(1, 0) and V(1, 1) angular multigrid cycles for various values of m and N = 8 on the CM200. . . . . . 105 X
PAGE 11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 1 INTRODUCTION 1.1 History. The Boltzmann equation has formed the basis for the kinetic theory of gases since it was established by Ludwig Boltzmann in 1872, and has proved to be useful in the study of electron transport in solids and plasmas, neutron transport in nuclea.r reactors, photon transport in superfluids, and radiative transfer in planetary and stellar atmospheres. This thesis treats numerical methods for the solution of the neutron transport equations in the form of the linear Boltzmann equation. Mod ern transport codes are generally based on either Monte Carlo or discrete ordinate methods. Monte Carlo methods are stochastic in nature and are based upon the direct simulation of the particle transport process. As the number of particles in the simulation is increased, the statistical accuracy of the resulting .m is improved. This method is wellsuited to threedimensional problems with complex geometries, but it is often expensive if differential rather than integral quantities are desired. Discrete ordinate methods are based upon the use of finite difference and quadrature integration techniques to reduce the analytic transport equation to a matrix equation. For onedimensional problems the discrete ordinate method can be orders of magnitude faster than Monte Carlo. In two dimensions neither method has clear superiority. In three dimensions Monte Carlo has thus far seen more use. Due to the randomness of the error magnitude in Monte Carlo methods, discrete ordinate
PAGE 12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. methods are generally preferred. The latter, however, may be difficult to generalize for complex multidimensional geometries. By utilizing the properties of massively parallel computers it may be possible to develop discrete ordinate equations for these geometries. The focus of this dissertation is the development of efficient parallel al gorithms for the transport equations on SIMD computers using discrete ordinate methods. The use of massively parallel computers represents a step towards the development of algorithms that will generate solutions for the transport equations more accurately and faster. 1.2 Model Problem. The transport of neutrally charged particles can be modeled by the linear Boltzmann transport equation, which is derived in this section. Consider a beam of particles colliding with a target. Let N (!!,D., E, t) represent the density of particles at a position!!, moving in direction D. (unit vector), with energy E, at timet. Suppose the volume analyzed has cross sectional area and length and that the origin is at !f. Requiring particle balance, we can write the following equation where IP = LSLC + PE, I P = increase in N, LS = losses from streaming, LC = losses from collision, P E = particles sources. For the infinitesimal element of volume and particles with energy and direc tion in the intervals [E, E + 5E] and [ll, ll + oll], respectively we have 2
PAGE 13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i i I [N(!!,!1, E,t +At)N(!!,!1, E, t)]A!!AAoEo!1 = [.N(!!+ A!!,!1,E,t)N(y,fi,E,t)]AAvAtoEon ut(Y, E)N(y, !1, E, t)AyAAvAtoEo!1 + ( Q(y, !1, E, t) + S(y, !1, E, t) )AyAAvAtoEo!1, where vis the particle speed (a. function of the energy). Dividing by AyAAoE5!1At and taking the limits as At and Ay tend to zero, we may replace the first and the second terms as derivatives with respect to t and !!, respectively, yielding 8N 8vN Bt =By utvN+S(y,fi,E,t)+Q(y,fi,E,t), (1.2.1) where Ut dz represents the expected number of interactions (absorptive or scattering) that a particle will have in traveling a distance dz. Assuming that Ay is in the direction of !1, we can express the directional derivative :!! as !1 'V. Thus, the balance equation can be written as 8N at = n. v'V NUtVN + S(!!, !1, E, t) + Q(y, !1, E, t). (1.2.2) The first term on the righthand side represents particle loss due to streaming, the second term represents loss due to scattering or absorption, and the third term represents sources. For steadystate problems, {1.2.2) simplifies to !1 'Vt/J + UttP = S(y, !1, E)+ Q(y, !1, E), {1.2.3) where 1/J(?!.., !1, E) = v N (?!.., !1, E) is the particle flux. The source S(y, !1, E) contains the particles that are scattered from every other direction !1' to direction !1 and energy E' to energy E. This can be modeled as S(y,!l, E)= u. I dE' I dn'E.(y, n'. .n, E'. E)t/J(y, n', E'), 3
PAGE 14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where :E. is the probability that a particle will be scattered from n1.fi and E1. E. In most applications, it is appropriate to discretize in energy to form the multigroup equations [9]. Let ,pic represent the flux with energy Ek. Then we have (1.2.4) where u.dz is the expected number of collisions that result in a scattering with energy Elc along a path oflength d:c. The source term Q includes particles scattered with energy Elc from other energy groups. The solution of the single group equation 1 I 1 ll Vt/J + t/J =; dn1E(!!, . n1 ) + Q(!!, n., E), Ut Ut (1.2.5) where;= ;:,is an important step in the solution of the multigroup equations [9]. When u, . oo and ; . 1, equation {1.2.5) will be singularly perturbed. In fact, it is often the case that this equation is nearly singular in all spatial frequencies of interest. This fact causes difficulties for numerical solution techniques, and multigrid in particular. Consider equation (1.2.5) in a semiinfinite slab of width bin the :z:dimension. Assuming that the solution is constant in y and z, then = = 0. Equation {1.2.5) becomes 8t/J (T 1 ( 1 I ( I ) ( I) ( ) J.L. 8:z: + UttP = 2 )_1 dp. E. :c,p. __. J.L t/J :z:,p. + q :z:,p., (1.2.6) where p. = cos(O) and 0 is the angle between ll and the :caxis. For the isotropic scattering model, :E. is constant in angle, that is, the particle is equally likely to scatter in any direction. Isotropic scattering yields the equation 8t/J U (1 ( l)d 1 ( ) J.L 8z + UttP = 2 )_1 t/J :c,p. J.L + q :c,u, (1.2.7) 4
PAGE 15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for :z: f (O,b) and p. f [1,1]. Here, 1/J(:z:, p.) represents the flux of particles at position :z: traveling at an angle(} = arccos(p.) from the :z:axis. The quantities u. and O't were already defined and the quantitity O'a = O't u. represents the expected number of absorptive interactions. The boundary conditions prescribing particles entering the slab are 1/l(O, p.) = Uo(p.), ,P(b, p.) = g1(p.), p. E (0, 1). (1.2.8) Now we consider anisotropic scattering within a single energy group. Assume the physical domain to be a semiinfinite slab. Anisotropic scattering is mod eled under the assumption that the probability of scattering from direction .0.' into direction .0. is a function of the angle between these directions (.0. .0.'). This leads to scattering source terms of the form which are integral operators with spherical harmonics as their singular vectors. We assume that the variation in angle can be described by a finite collection of singular vectors, resulting in an SN discretization. In onedimensional problems, the singular vectors for the integral operators are Legendre polynomials. Let Po(p.), ... ,PN 1(P.) be the Legendre polynomials and uo, ... O'N1 be the associated singular values. If we express the flux ,P( :z:, p.) as a finite summation of the first N Legendre polynomials we have N1 tP(Z, p.) = L (21 + 1)if>t(z )Pt(p.), (1.2.9) l=O where (21 + 1) is a normalization factor and the moments t/>1 are (1.2.10) 5
PAGE 16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. :I which can be written exactly as N cP,(:t:) = L WJcPl(P.Jc)t/l(:t:,p.Jc), (1.2.11) lc=l for l = 0, ... N 1, where the variables Ill, p.2, ... JlN are the Gauss quadrature points and w 1 w2, ... WN are the Gauss quadrature weights of degree N. The scattering term in one dimension then becomes 1 N1 s.(:t:,J1.) = u. /_1 dJl''E.(:t:,p.1 .JL)t/l(:t:,J1.1 ) = (21 + 1)uzz(:t:)pz(p.), (1.2.12) Notice that equations (1.2.9), (1.2.11), and (1.2.12) can be written, for all the quadrature points, in the matrix notation = Tt/l(:t:), and S(:t:,l!) = T1 respectively. The vector '!!!_( :t:) is equal to ( t/J( :t:, p.1), t/J( :t:, P.2), ... t/J( :t:, JlN) )T, and the vector ( :t:) = ( 0 ( :t:), 1 ( :t:), ... N _1 ( :t:) )T represents the Legendre moments. The N x N matrices T and T1 are defined by their respective elements and the N x N diagonal matrix D has diagonal elements D&,& = ua1 6
PAGE 17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I 1.3 Notation. For the most part, the following notational conventions are used in this thesis. Particular usage and exceptions should be clear from context. A, L, X,... Matrices or differential operators. Bf, Fl, n;, . Matrices at level k on row i of a bigger matrix (or corresponding to spatial cell i). N n Pi X,; _g_, r., ... a, e, a, ... det, SIMD MIMD Number of Gauss quadrature points. Number of Gauss quadrature points at level k. Number of positive or negative Gauss quadrature points (n = !f). Legendre polynomial of degree i (i = 01 1 N1). Processor i (i = 1, ... N). Singular values associated with the Legendre polynomial Pi (degree i) (i = 01 1 N1). The ijth entry of the matrix X. Vectors. Scalars or functions. Operator or matrix at (grid) level k. The determinant of a matrix. Single Instruction Multiple Data computer Multiple Instruction Multiple Data computer 1.4 Summary of Results. In this dissertation, we develop parallel algorithms for solving the neutron transport equations for the isotropic and anisotropic models. These algorithms are designed for SIMD architectures like the CM2 and take advantage of its architecture 7
PAGE 18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. .! at all steps. In Chapter 2, we develop an efficient parallel algorithm for solving the isotropic transport equation (1.2.7). This algorithm uses the modified linear dis continuous (MLD) spatial discretization method and an SN angular discretization for the scattering term, in which only the first singular value is nonzero. Within a multigrid algorithm, we consider different kinds of relaxation schemes and choose the most appropriate for the SIMD architecture. Proofs of convergence for the algorithms are developed through Fourier analysis. We also compare the CM2 timings to the Cray YMP timings for a sequential version. In Chapter 3, we develop an efficient algorithm for solving the anisotropic transport equation model. We use an angular multigrid algorithm that requires a shifted source iteration at each level. Perfo'mling this iteration with a shifted transport sweep would take 2m steps for a grid with m cells or (2m+ 2) spatial grid points. We develop a block cyclic reduction scheme for the CM200, which is a version of the CM2 with upgraded processors, for this step of the angular multigrid method which takes 2log2 m steps. One issue dealt with is the stability of the parallel block cyclic reduction. We can get a stable algorithm if we choose to write the system of equations in such a way that the righthand side remains bounded at all levels of cyclic reduction. The stability of this method is proved. The coarsest level of angular multigrid involves an isotropic S2 equation. We use our parallel multigrid algorithm for the isotropic model to solve this equation. There are other aspects of the angular multigrid that need attention in the development of a parallel angular multigrid, such as calculation of the scattering term. This step is completely parallelized in the spatial direction and partially parallelized in the angular direction. In addition, we attempted to minimize the number of communications in the angular direction. 8
PAGE 19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! We compare two kinds of angular multigrid schemes, a V(l, I)cycle and a V(l, 0) cycle, and conclude that V(l, 1) is superior on the CM200. The algorithms are also compared to sequential implementations on the Cray YMP. 1.5 The CM2 and CM200. The architecture of the CM2 or CM200 contains 2048 floating point proces sors, each with up to 1MB of memory, a front end computer, and an interprocessor communications network. Each processor is like a serial computer: it can execute arithmetic and logical instructions, calculate memory addresses, and perform inter processor communication or I/0. The difference is that the data processors do not fetch instructions from their respective memories. Instead, they are collectively un der the control of a single microcoded sequencer. The task of the sequencer is to decode commands from the front end and broadcast them to the data processors which then execute the same instruction simultaneously. Each processor has its own memory. Each floating point accelerator consists of two special purpose VLSI chips for each pair of CM processor chips, a memory interfaceunit, and a floatingpoint execution unit. If the size of the parallel variables requires more physical processors than are available, fictitious, or so called virtual, processors are created. Interprocessor communication is very important in data parallel processing. The CM2 supports three forms of communications among the processors: routing, NEWS, and scanning. These tasks are accomplished in the CM2 by specialpurpose hardware. Message passing occurs in parallel. If the grid is assigned a NEWS for mat, each processor has a north, east, west, and south neighbor. This kind of grid is used to represent our computational grids and is most efficient when the interpro cessor communications is between nearest neighbors within a Cartesian grid. In our implementations, we use intrinsic functions that are combinations of communication 9
PAGE 20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and computational operations, such as sum and spread. The sum operation performs an addition for array elements distributed across the processors. The spread intrinsic function distributes data belonging to one processor across other processors in a given direction. 10
PAGE 21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I CHAPTER 2 PARALLEL MULTIGRlD FOR THE ISOTROPIC MODEL 2.1 Introduction. The focus of this chapter is on a parallel multigrid algorithm for solving the isotropic transport equations in a slab geometry. The spatial discretization scheme used is the MLD finite element method, which represents a lumped version of the standard linear discontinuous (LD) scheme [7]. The parallel algorithm was implemented on the CM2. Convergence rates and timings for the algorithm on this machine and Cray YMP are shown. For steady state problems within the same energy group in the isotropic case (by isotropic we mean that the probability of scattering for the particles is the same for all directions), the transport equation in a slab geometry of slab width b becomes 81/J 1 /1 p. Bz + UttP = 2u, _1 t/J(z, p.')dp.' + q(z, u), (2.1.1) for z e (O,b) and p. e [1,1]. Here, t/J(z,p.) represents the flux of particles at position z traveling at an angle () = arccos(p.) from the :z:axis, u,d:z: the expected number of scattering interactions, Ua = Utu, the expected number of absorptive interactions, and q(:z:,p.) the particle source. The boundary conditions, which describe particles
PAGE 22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: ! entering the slab, are 1/J(O, p.) = Yo(p.), 1/J(b, p.) = Yt(P.), p. E (0, 1). (2.1.2) This problem is difficult for conventional methods to solve in two cases of physical interest: 1. y = ;:= 1 (pure scattering, no absorption). 2. ;, b (optically dense). In fact, as CTt + oo and y + 1, the problem becomes singularly perturbed [10). Stan dard discrete approximations to 2.1.1 and 2.1.2 will have operators with condition numbers on the order of at least ul regardless of the mesh size [5). This phenomenon presents problems for numerical solution techniques in general and multigrid in par ticular. In this chapter, the discretization used in the spatial dimension is the special finite element method MLD, which we describe in detail in the next section. This scheme behaves well in the thick limit and is very accurate [8]. To solve the linear system of equations, we use a suitable relaxation process, called p.line relaxation, within a multigrid algorithm that gives convergence rates on the order of when CTth 1 and O((uthf) when CTth 1. For the twoangle case, it behaves like an exact solver. These convergence rates were proved for special cases in [10) and are substantiated here by Fourier analysis. The Fourier analysis for multigrid based on redblack p.line relaxation (with numerical results for both Jacobi and redblack relaxation) is shown in Section 2.9. The results show close agreement between the Fourier analysis and the computational results presented here. The main benefit of the SIMD architecture of the CM2 is attained at the relaxation step of the multigrid algorithm where we take advantage of the structure of the matrix, matching the nonzero entries of very sparse matrices with the relevant 12
PAGE 23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. II processors. An outline of the remainder of this chapter is as follows. In Section 2.2 we describe the discretization scheme used. In Section 2.3 we show how to accomplish the inversion of the relaxation matrix in a manner that takes advantage of the SIMD architecture on the CM2. In Section 2.4 we describe the multigrid scheme and the interpolation and restriction operators used. In Section 2.5 we discuss the storage and data structures. In Section 2.6 we develop the algorithm for the parallel relaxation and a few implementation details. In Section 2.7 we consider a few implementation details of our CM2 multigrid code. In Section 2.8 we develop the discretization in edgeedge notation. In Section 2.9 we predict convergence properties of the multigrid algorithm through Fourier analysis. In Section 2.10 we report on convergence rates. In Section 2.11 we show the timings obtained with our CM2 parallel codes and compare them with a sequential version of our algorithm on a Cray YMP. Finally, in Section 2.12 we make a few conclusions. 2.2 The Modified Linear Discontinuous Scheme. As described in the introduction, angular discretization is accomplished by expanding the angular dependence in Legendre polynomials. This scheme, known as the SN approximation when the first N Legendre polynomials are used, results in a semidiscrete set of equations that resemble collocation at N Gauss quadrature points, Jli, j = 1, ... N, with weights Wj, j = 1, ... N. Since the quadrature points and weights are symmetric about zero, we reformulate the problem in terms of the positive values, Jlj, j = 1, ... ,n, where n = N/2. We define .,Pj = 1/J(z,p.j) and 1/J'j = .,P{z, p.j) for j = 1, ... n. The spatial discretization is accomplished by the MLD scheme, which uses elements that are linear across each cell and discontinuous in the upwind direction. In our grid representation, the variables 1/Jt and 1/Jij denote 13
PAGE 24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the flux of particles at position :J:i in the direction P.i and p.j, respectively. Assuming y=1, the nodal equations are IL,P"!"+l,P"!"t. n rJ 1 "joJ I,,J + .t.J". = "'w ('T + ./,";" ) + T. u h. '1'1,3 L...J lc '1'1,/c 'l'1,le qi,J 1 t I fe=l (2.2.1) (2.2.2) IL,P:,.1/J:+l n ,3 a,,J 1 1 + ,p:. = + tjJ";") + :. u h. a,J L...J a,le 1,/c q,,J 1 t I /c=l (2.2.3) and .t..t.'1' 'f''. n P.i I,J "' + + 2 h + ,P._1 = L...J 1e + 27/Ji 1ctP+!.Ic) + q._!. Ut i I l'J lc=l I l' I l' I l'J (2.2.4) j = 1, ... n, i = 1, ... 1m, with boundary conditions .t.+ + .t.'f'! 39o,j 39t,j 2 f 2' (2.2.5) j = 1, ... ,n. In our model, :z:i+!. and :z:i_!. are cell edges, :Z:i = + :z:i_!.) is the cell l l l l center, and hi = :z:i+! :z:i_!. is the cell width, i m. Equations (2.2.1) and l l (2.2.3) are called balance equations and (2.2.2)and (2.2.4) are called edge equations. In block matrix form, equations {2.2.12.2.4) can be written respectively as + = + + lJ.i I + = R(t++!. + t_d + lJ.i+!' l l l l l 14 {2.2.6) {2.2.7) {2.2.8) {2.2.9)
PAGE 25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n 21 1 I I I T I I I I I I I I I I I I I I I I I I I I I I I I I I I I I ITI I I I I I I I I I I __ 1 ___ 1 ___ 1 __ J __ l __ L __ 1 ___ 1 ___ I __ J __ l __ I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 2 3 2i 1 2i 2i + 1 2i + 2 2i + 3 2m+ 1 cell i cell i+ 1 Figure 2.1: Computational grid !.+ + .!.fl.t. 2 2 (2.2.10) for i = 1, ... m. Here, 0 1 and R= (2.2.11) 0 1 P.t1 P.21 1 /ln are the positive Gauss quadrature points, Wt1 w2, ... Wn are the Gauss quadrature Weights, and p_.+() is an nvector: p_.+() = ( 1 ... t/Jt())T. In the computational grid, the inflow for positive angles is on the left of each cell and on the right for the negative angles. For p.line relaxation, the inflows of each cell are assumed known. This is the same as using the values of these variables from the previous iteration. Figure 2.1 shows the computational domain with 2m+ 1 spatial points and n angles. Consider cell i. In p.line relaxation, the fluxes at the cell centers, t/J7' and _, 't,(grid points 2i), together with the outflow flux variables, '!f!.t._!. and 't,++!. (grid 2 2 points 2i1 and 2i + 1), will be updated by the solution of the following matrix equation: 15
PAGE 26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I+2BiR 2R 2Bi R 0 q: 1 _,_l 0 1R R Bi "! Bi"! 1 q"! '1 = + _, Bi R 1R 0 .,p: q: '1 _, R 2Bi 2R I+ 2BiR 0 + !li+!. {2.2.12) Solving this matrix equation corresponds to performing a p.line relaxation. In our implementations, we perform twocell relaxations for the whole domain. In this kind of relaxation, we consider pairs of cells coupled together. This relaxation yields an error that is linear, independent of angle and continuous across two cells up to O(u!h) accuracy when Uth 1 [10]. Instead of updating four variables with a relaxation scheme, we update eight. For example, in Figure 2.1, variables '!f!..., '!f!...+, '!!!;+u t++!.' t+l' j+1 and t++v will be updated. Notice that the inflow variables '!!!._j_1 will be used from the previous iteration Twocellpline relaxation involves inversion of the Bn X Bn matrix 1+2BiR 2R 2Bi R 0 0 0 0 1R R Bi 0 0 0 Bi R 1R 0 Bi 0 0 R 2Bi 2R 1+2BiR 0 0 0 0 0 0 0 0 0 0 0 I+ 2Bi+lR 2R 2Bi+l R 0 0 0 Bi+l 0 1R R B;+t 0 0 0 0 Bi+l R IR 0 0 0 0 0 R 2B;+1 2R I+ 2Bi+lR (2.2.13) The order of variables for this matrix is given by the representation {2.2.14) The righthand side for twocellpline relaxation is given by 16
PAGE 27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ____ I_ 2 black cells 1 2 3 4 5 6 7 8 9 2 red cells Figure 2.2: Redblack relaxation for 4 cells. (2.2.15) For our implementations, we developed a multigrid scheme that uses either a parallel blockJacobi relaxation or a parallel block redblack GaussSeidel relaxation. For Jacobi relaxation, all of the twocell relaxations will be performed simultaneously for the whole domain. For redblack relaxation, we update half of the total number of cellpairs during the first (red) stage of the algorithm, and the other half during second (black) stage of the algorithm, each time skipping neighboring cell pairs. To illustrate, Figure 2.2 shows 4 cells. The even numbers represent the cell centers and the odd numbers the cell edges. For simplicity, we omitted the vertical lines (passing through each spatial grid point) that contain the Gauss quadrature points (angles). For a Jacobi type of relaxation process, variables at spatial points 5 through 9 will be updated at the same time as points 1 through 5. Each pair of cells will use the relaxation matrix just described. On a parallel machine like the CM2, redblack relaxation takes approximately twice as much CPU time as a Jacobi relaxation. This is partially offset by a better convergence rate (refer to the Fourier analysis in Section 2.9 and numerical results in Section 2.10). 17
PAGE 28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Twocell Inversion. Direct solution of the twocell relaxation matrix equation can be done in a very efficient way for SIMD computers like the CM2. First notice that the matrix can be written as the difference of an easily inverted matrix, which we call A, and a rank four matrix, which we write as vwT. The matrix A has a special structure that allows its inverse to be calculated in place: I +2Bi 0 2Bi 0 0 I 0 Bi Bi 0 I 0 Bi 0 A= 2Bi 0 I +2Bi I+ 2Bi+t 0 2Bi+l 0 Bi+t 0 I 0 Bi+t Bi+t 0 I 0 0 2Bi+t 0 I+ 2Bi+t (2.3.1) This is a block matrix with n X n diagonal blocks. The rank four matrix VWT is defined by 18 (8nx8n) 
PAGE 29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Q Q Q 1 1 Q Q 1 1 Q Q V= Q 2. Q Q (2.3.2) Q Q 2. Q Q Q 1 1 Q Q 1 1 Q Q Q 2. (8nx4) and !wT 2wT 0 _!wT 20 0 0 0 wT= _!wT 20 wT !wT 20 0 0 0 (2.3.3) 0 0 0 0 !wT 2'JJ!.T 0 _!wT 20 0 0 0 _!wT 20 'JJ!.T !wT 2(4x8n) where!= (l,l, ... ,l)T,Q= (O,O, ... ,o)T,2.= (2,2, ... ,2)Tand!QT = (w1,w2 ... ,wn) We can use the ShermanMorrison [6] formula to calculate the inverse of the twocell Jlline relaxation matrix M = A VWT : (2.3.4) where E = 1WT A1v. If]L is annvector, we have M1]L = (I+A1 V E1WT)A1]L. The connectivity of A is two interleaved 4 X 4 block matrices. Thus, inverting A amounts to solving 2n tridiagonal matrices, each of size 4 X 4, which can be done in parallel. Since all matrices are diagonal, and thus commute, we make use of the notation M; = M1M;1 We have A1 = 19
PAGE 30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 0 0 2B? 0 4B[Bitt 0 D; D; D;D;tt D;Ditt 0 lt2B; 0 B; 0 0 0 0 D; D; B; 0 lt2B; 0 B;+2B? 0 (B;+2Bl )(2Bit I) 0 D; D; D;Ditt 0 0 I 0 0 0 0 D; D; 0 0 0 0 I 0 0 D;:j:1 tl 0 ( B;tl +2B?1 )2B; 0 0 I+2Bil 0 Bitt D;D;tl D;Ditl D;tt Ditl 0 0 0 0 B;l 0 I+2B;tt 0 Ditl Ditt 0 4B;Bt1 0 2Bltt 0 2Bitl 0 I D;Ditl D;Ditl Ditl Ditt (2.3.5) where Bi = diag( ... bi;, ... ), Di = diag( ... 1 + 2bi; + 2br;, ... ) and bi; = ;;t. The matrix E =IWT A1V is the 4 x 4 matrix d1 0 a1 Cl E = I WT A l V = 0 dt bt au (2.3.6) a2 c2 d2 0 b2 a22 0 d2 where N dt = 1 2 E w; + w;bi;, ;=1 di; (2.3.7) (2.3.8) (2.3.9) 20 .I
PAGE 31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! I (2.3.10) (2.3.11) N wb2 + wb2 b """ J ij J ij a+lJ ax= 2 L.J j=l c4;di+tj (2.3.12) N wb2 +wb .. b2 """ J i+lj J &J i+lj a22 = 2 L.J i=l d;;di+tj (2.3.13) (2.3.14) (2.3.15) and w;b;;bi+ti + w;b;;bl+ti a 2 = 2 L.J i=l d;;di+lj (2.3.16) E1 is also a 4 x 4 matrix and can be calculated analytically. Its entries are shown in the Appendix A. 2.4 Multigrid. To illustrate the multigrid scheme, we develop it here in a two level form. Let Lh denote the fine grid operator, L2h the coarse grid operator, and and the interpolation and restriction operators, respectively. Let vx and v2 be small integers (e.g., Vt = v2=1), which determine the number of relaxation sweeps performed respectively before and after the coarse grid correction. One multigrid V ( Vt, v2) cycle is then represented (in twolevel form) by the following: 21
PAGE 32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 12 i '+ 1 I 2 i+ 1 Figure 2.3: Fine and coarse grid. 2. Calculate the residual rh = Jh Lhuh. '+ 3 2 Our multigrid scheme is applied with regard to the spatial variable only. (For a multilevel scheme in angle, see Chapter 3. Figure 2.3 illustrates grid points on the fine grid and on the coarse grid. The interpolation and restriction operators used here are based on the same finite element principle as in the derivation of the MLD scheme [10]. They are defined as follows: I2h h S3,1 S3,2 22 {2.4.1) Sm1,1 Sm1,2
PAGE 33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where I 0 0 0 0 0 0 0 0 h I 0 0 0 0 0 Si,t = and si,2 = + 0 0 ____f:!j_I 0 0 0 0 hi+t +hi hi+t+h; 0 0 0 0 0 0 0 I (2.4.2) and Tt,t T2,1 Tt,a Irh = T2,a (2.4.3) Tt,m1 T2,ml where I 0 0 0 0 2h;t+h; I 0 Tt,i = hi+t+h; 1+1+ I 0 ____.fu_I 0 hi+t+h; hHt+h; 0 1t1+ I 0 h;h;l I hi+t +h; and h;h; I hi+t +h; 0 hi+t+h; 0 0 0 ____.fu_I T2,i = 1+1+ I h;+t+h; =k._I 0 2h;+h;l I 0 hi+t+h; hi+t+h; 0 0 0 I 23
PAGE 34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 'I i The order of variables for these operators is the same as in (2.2.14). The coarse grid operator is defined as where Lh is given by (2.2.62.2.9). Note that L2h has the same form as Lh, but on a coarser grid. We use the notation 2h to indicate a coarse grid, although our grids are not really assumed to be uniform. In the general nonuniform case, the mesh size at cell i on the coarse grid is given by hi = h2il + h2i 2.5 Storage and Data Structure. The implementation of this scheme on the CM2 involves the following considerations: We assign one processor to each point (i.e., for each ( ;z;i, JLi) pair) in the computational mesh, assuming there are at least {2m+ 1 )n processors. If we are using less than (2m+ l)n physical processors, the CM2 will reuse the physical processors, creating virtual processors to store the additional grid points. Variables that are defined at different grid levels have an index that represents the grid level. We store these variables such that if they are evaluated at the same (spatial and angnlar) coordinate pair they will be assigned to the same processor, even if their grid levels are distinct. This avoids communication when data is accessed from different grid levels for a given spatial and angular coordinate variable. For example, variable '!/!_ appears as t/J( :serial,:news,:news) in the code. Here the first index represents the grid level and the remaining indices represent spatial and angular coordinates. 24
PAGE 35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I 3 I I I I I I I I __ I__1 ___ I_ J_ l_ L __ I __ I I I I I I I 2 1 1 2 3 4 5 6 7 8 9 Figure 2.4: Computational mesh for four cells. In relaxation for the nonuniform grid, some processors will require products of variables that belong to different cells, as can be seen, for example, in any row of (2.3.5). In particular, consider the first row of this matrix. Even though this row will have its elements stored in a processor in the first cell of the twocell pair, it uses data from both cells (i.e., B[ Bi+d To avoid extra complication in the parallel algorithm, we store these variables in the proper processors at the outset. For example, we have an array b that stores the values of matrix B for the first cell, and an array bp1 that stores the values of matrix B for the second cell of the pair. In this way, all processors representing grid points of a twocell pair will contain data from both cells of the pair. Figure 2.4 illustrates the computational mesh for four cells (horizontal axis) and four angles (vertical axis). Vertical lines 15 represent the variables associated with cellt and cell2, l while vertical lines 59 represent the variables associl l ated with cella and cell4, Processors l l l represented by vertical lines 14 will contain all the information associated with cellt and cell2, while processors represented by vertical lines 58 will 25
PAGE 36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I contain all the information associated with cell3 and cel14. Consider cell1 and cell2. Array 6 contains the values of 6(i,j) = p.(i)f(ut(celll)h(celll)). Thus, 6(1,3) = 6(2,3) = 6(3,3) = 6( 4,3). Array 6p1 contains the values of 6p1(i,j) = p.(i)j(ut(cell2)h(cell2)). Thus, bp1(1,3) = 6p1(2,3) = bp1(3,3) = bp1( 4,3). However, 6(1, 1) f:. 6(1, 2), bp1(1, 1) f:. 6p1(1, 2), and so on. The arrays 6 and 6p1 are constant on the same subsets of the computational mesh. We also use some replication of data in a slightly different way for the mesh size variable h. The elements of array h contain the cell size hi, the array hm1 contains hi1, and the array hp1 contains hi+l These three arrays are constant on vertical lines in Figure 2.4. This data structure is the same for all the grid levels, but with a different number of cells for each grid level. Some communication, of course, is necessary between processors. In fact, between the boundaries of every pair of cells we will have some shifting of data. For example, consider the variables (2.2.14) updated with the appropriate twocell relaxation matrix (cella and celli+d Notice that '!f! .. will be updated by this matrix, but the variables will be updated by the next twocell pair relaxation matrix (celli+2 and celli+3). Again consider Figure 2.4. Vertical line 5 represents the variables + )(). The twocell relaxation matrix for cell1 and cell 2 will be stored on the processors associated with vertical lines 14. So to update variables '1/Jt the processors on line 5 will need to access data from line 4. In the CM2 conformable arrays (arrays with the same shape and size) are always stored in the same set of processors in the same order. To compute expressions like the elements of the matrix E in (2.3.6) without the need for communication, we store the variables Wj (a onedimensional array) as a 26
PAGE 37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i ,, twodimensional array that is conformable with array b {Figure 2.4). Con sequently, w is constant over horizontal lines in Figure 2.4. Note that the elements of E1 in (2.3.6) will also be conformable with variables wand b. 2.6 Relaxation in Parallel. A twocell relaxation step consists of applying M1 (Misgiven by (2.2.13)) to a righthand side vector, which we call '! .. : What follows is a description of the process to perform a twocell relaxation step. If the current grid has k cells, then twocell relaxations will be performed simultaneously for Jacobi, while twocell relaxations steps will be performed in each of two stages for redblack GaussSeidel. Each twocell relaxation can be performed through the following steps: 1. A1l': This step is done at the same time for all cells, using the matrix A l as shown in (2.3.5). In order to update any variable, the associated processor needs only one row of this matrix. For example, with the order of variables shown in (2.2.14), to update the variable '!f!..t we need only the second row of A1 so the processor that updates this variable needs only to store this row's entries. Note, though, that this row's entries will multiply data that are allocated in other processors. Through the use of the CM2 intrinsic function eoshift we can perform a move of data in the :1: direction (index i), and then multiply by the appropriate entry of the matrix. To update all the variables, we thus have the following combination of shifts: 27
PAGE 38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I for j =1, ... ,n simultaneously. Here, A1, A2, A3, A4 and As are elements of the appropriate row. Note that processor ( i, j) does not have variables other than the ones with indices i,j. It will therefore use the CM2 intrinsic function eoshift 1 in the following way: temp= = eoshift(t/Ji, 1, 1), tP&+1 = eoshift(temp, 1, 1), and so on. Each new shift will take advantage of the previous one. The big advantage of this step is that this updating occurs at the same time for all processors, allowing this matrix multiplication task to be performed in only 8 communications. Note that, for each cell, the vector has length 8n. In this step we take advantage of the fact that the 8n X 8n matrix v E1 wT can be written as the tensor product 2 1 L F R yl. Remember that 1 and Jl!.t are nvectors. The 8 X 8 matrix F is derived from a combination of rows and columns of the matrix E1 and is given by 1The first argument of the eoshift function is the array name, the second is the array index that is going to be shifted, and the last is the stride of the shift. ,We use the notation v L M = M v1 and M R v = M v. 28
PAGE 39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I I I 2 0 1 1 1 1 1 1 0 1 2 2 0 2 1 0 1 1 E1 2 2 2 0 1 1 0 1 2 2 1 1 1 0 1 1 2 2 1 1 0 2 (2.6.1) Note that if the grids are nonuniform, then the matrices E1 and F will have their elements as arrays and will be different for each distinct pair of cells. Multiplication by the matrix 1 F !Qt can thus be done in three steps: 2.1 Form g_ = (i = 1, ... ,8). This step consists of a dot product of !Qt and each nvector that makes up the 8Xnvector g_ (z;, i = 1, ... 8). Remember, vector g_ was obtained in Step 1 and !Qt is an nvector. So, first we multiply each entry of !Qt by each corresponding entry : Wj X Zij, which is performed for all processors at the same time because !Q is conformable Recall that, through the replication of data, vector '!!!.. was stored for all the spatial grid points. The i index for variable Zij varies from 1 to 8 for each cell. The second stage of the dot product is to perform a summation in the angle index direction. This i!! done with the use of the CM2 intrinsic function sum. The resulting vector g_ has length 8. 2.2 Form !1. = F g_ 29
PAGE 40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This step is a multiplication of a vector by a matrix, and can be per formed in the CM2 similar to the way Step 1 of this relaxation was performed. Depending on which point the processor is representing, it will have stored different rows of the matrix F. Its variables will be updated through the use of the CM2 intrinsic function eoshift only (in this case, 16 of them). The vector !1 has length 8. 2.3 Form 0 = 1 '!1: This spreads the result of Step 2.1 across all the angles (for all spatial points at the same time). This is done with the use of the CM2 intrinsic function spread. The vector 0 has length Bn. 3. Form A10 as in Step 1 and add the result A special remark should be made here. It is not hard to see that Steps 2.2 and 2.3 can have their order interchanged. In fact, since all the CM2 processors will execute the same instruction simultaneously, unless instructed otherwise by the use of masks, we make use of this property in the following way. We spread vector z throughout all the angles represented by different processors. Since all the processors had the appropriate row elements of matrix F from the outset, we perform Step 2.2 last. This will make the use of masks necessary only for distinction between the processors regarding the kind of spatial point it represents, not the kind of angle. It is easy to show that, if we take advantage of the structure of the var ious matrices and perform the matrix multiplication steps as described, the total operation count and the number of communications per processor for the twocell p. relaxation are both O(n), more precisely 2(n1). 30
PAGE 41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 2. 7 Multigrid in Parallel. The parallel relaxation process of the multigrid algorithm was explained in Section 2.6. For calculation of the residual on the finer grid, again we use the CM2 intrinsic function eoshift. For example, to calculate the residual at a cellcenter grid point for a negative angle, we use equation {2.2.3). Note that the processor that contains the variable t/Ji;; will not contain other variables necessary for the calculation of the residual at this point. Therefore, we perform a move of data: t/J'i.t!. = eoshift(.,pi, 1, 1), 2 t/Jz_!. = eoshift(t/Ji, 1, 1). 2 For calculation of the righthand side of this equation, we perform a onestep multi plication for the conformable arrays w and tfJ+ {the same for t/J), and then, using the CM2 intrinsic function sum, we perform a summation in the angle direction for the resultant products. Mter the residual is calculated, in order to solve for the error on the coarser grid, we have to transfer the residual to the coarser grid. This is done by multiplying rh, the residual on the fine grid, by the restriction operator shown in {2.4.1). In some processors there will be no operation or communication at all because, as in Section 2.3, the grid level index is serial and some processors will contain all the information they need. For example, for a negative angle at the lefthand side of a twocell pair {negative angle outflow), the restriction consists of = r?_!. (see the 2 2 first row of {2.4.2)). This requires no communication. For processors that represent cell centers on the coarse grid (second and third row of {2.4.2)), we use the CM2 intrinsic function eoshift in a way that is 31
PAGE 42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. '' : similar to what was done in Step 1 of relaxation: temp1 = rf+t = eoshift(rf, 1, 1), temp2 = rf_1 = eoshift(r?, 1, 1). We then calculate the residual on the coarse grid: 2h hml t 1 + hi t 2 ri (hml+hi) emp (hi+hml) emp Mter forming r2h, we perform parallel relaxation again, this time to estimate the error on the coarser grid (u2h). In order to use the coarser grid quantities to correct the solution on a finer grid, we multiply u2 h by the matrix 1rh shown in (2.4.3). This multiplication is performed similarly to the multiplication of rh by 2.8 Discretization in EdgeEdge Notation. In the following section we develop a Fourier analysis of the convergence properties of the multigrid scheme. In this analysis we use a different notation for the MLD scheme, called edgeedge notation. Consider the discretized spatial domain with m cells as before, where i is the center of the cell and i and i + are the rightand lefthand sides, respectively. The MLD equations using variables at the center and edges were shown in (2.2.12.2.4). In edgeedge notation the indices land r represent the left and right side of each cell, respectively. In order to write these equations in edgeedge notation, we notice that 1/Jt = 21/Jt.,P"'!" 1, 1/J"/: = .,P"'!" 1, .,P1 = .,p;_!, and 1/J; = 21/Ji.,p;_!. After we substitute these equations into (2.2.12.2.4), 2 2 we have 32
PAGE 43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I i! N (t/Jtr,jt/Jtl,j) + tPtr,i = L tPi,r,lcW/c, t lc=l (2.8.2) i = 1, ... m, j = 1, ... n, and the negative angle equations are N JL;( tP'i;r,i tP'i;t) + tP'i;t,j = u.h L tPi,l,lcW/c, (2.8.4) lc=l i = 1, ... m, j = 1, ... n. In this section we omit the external source term q for simplicity. The omission of a superscript on sum terms in this section indicates that the summation is performed for all N = 2n angles (n positive and n negative). If we multiply (2.8.1) and (2.8.3) by 2 and subtract from them (2.8.2) and (2.8.4), and rewriting equations (2.8.2) and (2.8.4 we have N P.i ('+ .t.+ ) 2 P.i .t.+ .t.+ "'' 0' h 'l'i,l,j + 'l'i,r,j 0' h '1'i1,r,j + 'l'i,l,j L....J 'l'a,l,lcWk, t t lc=1 (2.8.5) N P.i ("''+ .t.+ ) + .t.+ "'' 0' h 'l'i,r,j'l'i,l,j 'l'i,r,j L....J 'l'a,r,lcWic, t lc=1 (2.8.6) i = 1, ... m, j = 1, ... m and the negative angle equations are (2.8.7) N ( 1/Ji;,..;) + = L tPi,l,lcWic 1 t lc=l (2.8.8) i=1, ... ,m,j=1, ... ,n. Formulating the problem in terms of the ( n) positive angles only, using the vector notation such that t/Ji() is annvector p_t() = ... ,1/J"tn()f, we can rewrite these equations in matrix form as {2.8.9) 33
PAGE 44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. II B,( ,p! ,P"!1 ) + ,P! = R,P. .:a,r ...:.....a, ...:.....a,r (2.8.10) i = 1, ... m, and the negative angle equations are (2.8.11) (2.8.12) i = 1, ... ,m, where B, = and R = !Tw. After we multiply this system by Uth we notice that twocell pline relaxation involves the inversion of (2.8.13) (2.8.14) (2.8.15) (2.8.16) (2.8.17) 34
PAGE 45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Notice that, we can write the transformation between the centeredge notation and the edgeedge notation variables in matrix notation as (2.8.18) (2.8.19) In this section we analyze the convergence properties of the twogrid algorithm described in Section 2.4. We performed Fourier analysis for the two kinds of relaxation used in our parallel multigrid schemes: Jacobi and redblack GaussSeidel relaxation. Convergence factors for both kinds of schemes are compared to the numerical results. We will assume a special kind of Fourier spatial dependence for the errors in an infinite spatial domain with uniform cells size h each. Considering a i+ 1 i+ 2 i+3 RL RR BL BR Figure 2.5: Group of cells for Fourier analysis. 35
PAGE 46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I group of four cells, we define two black cells (BL black left and BR black right) and two red cells (RLred left and RRred right) as indicated in Figure 2.5. For a given p. we denote by '!!!_(z,p.) the vector that contains elements representing fluxes at the edges ofred and black cells (see 2.9.1). The vector '!1!_/>.,p.) contains elements representing coefficients at the edges of the red and black cells in the group of four cells considered above. This is the same as t/Ji,t(z, p.) .,pflL( >., JL) tPi,r(z, JL) .,p[}L(>., p.) tPi+t,t(z, p.) .,pflR(>., p.) '!!!_(z,p.) = tPi+l,r(z, JL) and '!!!_1(>.,p.) = .,p[:R( >., IL) (2.9.1) tPH2,1( z, 1L) .,pfL(>.., p.) tPi+2,r(z,p.) .,p[!L(>., p.) tPi+a,t( z, IL) .,pfR(>., p.) tPi+a,r( z, p.) .,p[!R(>., p.) where the subscripts l and r represent the left and right side of each cell i, respec tively. For each particular frequency, >.we will assume '!!!_(z,p.) = ei>.:r:'!!!_1(>.,p.) for the remaining domain, where :z: is the distance between points of the computational domain and points belonging to the group of four cells considered. These points are multiples of h away from each other. Considering 4P cells with periodic boundaries, we can write '!!!_(z,p.) = where >.1c = 2'ht, k = 0, 1, ... P1. The steps for the multigrid scheme Fourier analysis follows. Representing the relaxation step by A.,pl+l = B.,P1 we write Ah.,pl+l = Bh.,pl for the fine grid and A2h.,pl+l = B2h.,pl for the coarse grid. Matrices A and B are derived by substituting '!!!_(z, p.) = ei>.:r:'!/!_1(>., p.) into (2.8.92.8.12) for the p.line twocell relaxation (Jacobi or redblack GaussSeidel). Thus, the matrices A and B depend on>.. In the following 36
PAGE 47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I i description, 1;h and are the interpolation and restriction operators for the edge edge notation, respectively. Steps for the Fourier analysis: 1. Relaxation on the fine grid: '!1!_1+} = Ah1 Bh'!/!_1. 2. Calculation of the residual on the fine grid: r.h = Bh(Ah1 Bh1)'!1!_1 3. Transfer of the residual to the coarse grid: f_2h = = Bh(Ah1 Bh !)'!!!_'. 4. Calculation of the error on the coarse grid: !i2h = (A2hB2ht1 f_2h. 5. Correction to the solution: = + 1;h!i2h. The convergence rate for the twolevel multigrid V{l, I)cycle is given by the spectral radius of the following matrix: {2.9.2) The elements of matrix Z will depend on ). and the convergence factor for the Fourier analysis will be the maximum spectral radius of Z taken over the set ).k = 2"h'P 1 k = 0, ... P 1. To write the matrix Z, first we need to calculate matrices A and B on both the fine and coarse grids. The MLD scheme in edgeedge notation was derived in the previous section. Later in this section we show the equivalence between our analysis, done in edgeedge notation, and our edgecenter algorithm notation. In order to perform the Fourier analysis, we need to write these equations for all the angles and the four cells (RL, RR, BL, and BR) necessary in the red black GaussSeidel twocell relaxation analysis. The MLD relaxation will always be represented as two equations per cell per direction for each cell independent of the 37
PAGE 48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i: kind of notation we are using. Consequently, to represent redblack GaussSeidel relaxation, the number of equations is 8 X (number of directions). For redblack GaussSeidel relaxation, all the red cells are updated before any of the black cells. Consequently, when we write the MLD equations for the red cells in the Fourier analysis, whenever there is a reference to a variable that belongs to a black cell, that variable is assumed to be at iterative Ievell instead of l + 1. When we write the equations for the black cells there are only terms at Ievell + 1 (no terms on the right hand side of the equations). The Fourier analysis for the Jacobi relaxation is done similarly its analysis is simpler than the redblack GaussSeidel case. The main difference is that we need to consider only one kind of cell pair and when referencing cell pairs other than the one we are writing equations for, we will assume them to be at level l. 1. Equations on the fine grid. These equations generate matrices Ah and Bh used in Step 1. When the Fourier modes are substituted in equations (2.8.12.8.4) for the four cells, redblack relaxation can be written as F u 0 0 0 0 0 Lei>.H L F 0 0 '!!!.1+1= 0 0 U 0 '!!!_1, (2.9.3) 0 L F u 0 0 0 0 Uei>.H 0 L F 0 0 0 0 where H = 4h, I+ BiR R Bi 0 R I+ BiR 0 Bi F = Uth (2.9.4) Bi 0 I+ BiR R 0 Bi R I+ BiR 38
PAGE 49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I i 0 0 0 0 0 0 0 0 (2.9.5) 2Bi 0 0 0 0 0 0 0 and 0 0 0 0 0 0 0 2Bi (2.9.6) 0 0 0 0 0 0 0 0 The order of variables for this matrix is (2.9.7) where .1,RL = (,RL .1.+RL .1.RL .1.+RL) .1,RR = (,RR .1.+RR .1.RR .1.+RR) .1.BL = (.BL .J.+BL .1.BL .1.+BL) and For Jacobi relaxation we have [ FL U l ,pl+l = [ 0 LeOi>.H l '!Jl, F Uei>.H {2.9.8) where H = 2h. 2. Equations on the coarse grid. These equations will generate the matrix A 2hB2 h for Step 4. The equations on this level are assumed to be solved exactly. Since 39 
PAGE 50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. coarsening the grid halves the number of points, we will have two cells in the Fourier analysis for redblack GaussSeidel relaxation and one cell for Jacobi relaxation. The matrix equation (A2hB2hk2 h = th for redblack GaussSeidel relaxation case is given by [ F U + L ]2 h 2 h = f L+U F {2.9.9) where F, U, and L are the same as before with h = 2h. The order of variables is also the same, but for two cells instead of four cells. For Jacobi relaxation we have F' fi2h = th, {2.9.10) where F' = F + U + L. 3. Interpolation and restriction operators. In order to define the interpolation and restriction operators for the equations in edgeedge notation, we first notice that and {2.9.11) where {2.9.12) matrix Q1 is given by (2.8.19), and are the operators in {2.2.13) in the edgecenter notation on the fine and coarse grids, respectively, and Ai, and are the operators in the edgeedge notation on the fine and coarse grids, respectively. The interpolation and restriction operators used in the Fourier analysis are given by [ s, s, s, s,] (2.9.13) 40
PAGE 51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where 2I 0 0 0 0 0 0 0 0 I 0 0 0 I 0 I St = s2 = I 0 I 0 0 0 I 0 0 0 0 0 0 0 0 2I and Tt Ih(e) T2 (2.9.14) 2h Tt T2 Here, I 0 0 0 !I 2 0 !I 2 0 0 I 0 0 0 !I 0 !I Tt = and T2 = 2 2 !I 2 0 !I 2 0 0 0 I 0 0 !I 2 0 !I 2 0 0 0 I We can observe the relationship between the restriction and interpolation operators for the edgeedge and edgecenter notation: (2.9.15) and Ih(e) Qih(c)Q1 2h 2h (2.9.16) From Section 2.4 we have A2h(c) I2h(c)Ah(c)Ih(c) h 2h Substituting for the edge notation, we have 41
PAGE 52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I _____ L which is equal to the fine grid operator (Ah(e)) applied to grid 2h. 4.Numerical experiments and results. For each ..\ we have a different matrix Z. The convergence factors for the Fourier analysis are obtained as the maximum over = 2'tp, k = 0, ... P 1) of the spectral radius of Z in (2.9.2). For each Uth, we evaluate the spectral radius of Z for P different frequencies ( k = 1, P 1) and the convergence factor from the Fourier analysis is the maximum. We compared the convergence factors between the Fourier analysis prediction and the CM2 numerical results, and found a good agreement (Tables 2.1 and 2.2). In fact, for this problem the Fourier analysis prediction was more accurate than the alternative analysis of convergence shown in [10]. Note that the convergence factor appears to be O((.,.!h)2 ) when CTth 1 and O((uth)3 ) when CTth : 1 for both kinds of relaxation (Tables 2.1 and 2.2). This is exactly what happens with our numerical results. The analytical proof of convergence shown in [10] showed the convergence factor to be bounded by O((.,.!h)) when Uth 1 and O((uth)2 ) when CTth 1 for both kinds of relaxation. Our results suggest that these bounds are not sharp. 2.10 Numerical Results. The CM2 codes were run for different values of CTth on the fine grid. In Tables 2.3 and 2.4 we chose these values to be the worst cases predicted by the Fourier analysis shown. All of the convergence factors (i.e., the ratio of Euclidean norms of the residuals before and after a multigrid V cycle) shown here are for the case with no absorption ('y = 1). The results are shown for the two kinds of 42
PAGE 53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.1. Convergence factors from the Fourier analysis(Fa) prediction and results for Jacobi relaxation for various values of Uth (S4 case). Uth Fa CM2 106 5.3 X 1013 9.4 x 1010 104 2.4 X 10lO 8.7 X 107 103 2.3 X 107 4.1 X 104 102 1.3 X 104 1.0 X 102 101 7.2 X 103 1.5 X 102 1. 5.1 X 103 4.2 X 103 101 3.8 x to6 3.9 X 106 102 3.5 x to7 3.4 X 107 43
PAGE 54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.2. Convergence factors from the Fourier anlaysis(Fa) prediction and results for redblack GaussSeidel relaxation for various values of CTth (S4 case). CTth Fa CM2 106 3.5 X 1013 3.3 X 1010 104 1.2 X 1010 2.8 X 107 103 1.2 x 107 7.2 X 106 102 8.7 X 106 4.3 X 103 101 3.8 X 103 8.0 X 103 1. 8.2 X 104 1.1 X 103 101 3.2 X 106 3.17 X 106 102 2.7 X 107 2.5 X 107 44
PAGE 55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' i. Table 2.3. Worst case convergence factor for multigrid with Jacobi V(1, 1) relaxation, Fourier analysis (Fa), and CM2 results (form= 512). angles s4 Sa S1s Sa2 Uth .230 .100 .150 .400 Fa .012 .012 .013 .0130 CM2 .016 .015 .016 .014 relaxation (Jacobi and redblack GaussSeidel) used in our multigrid scheme. The convergence factors shown for the CM2 codes were obtained for one V cycle after five Vcycles. The convergence factors were close to the ones predicted by the Fourier analysis as Tables 2.3 and 2.4 show. Tables 2.3 and 2.4 show results for uniform grids for different number of angles, e.g., s4 means 4 scattering angles (n=2) and so on. Table 2.5 shows convergence factors obtained for nonuniform grids for various numbers of angles (N = 4,8,16); the grid was generated randomly with h varying between 1 and 100. Table 2.6 shows the behavior of the convergence factor for three relaxation methods: twocell Jacobi (V(1, 1) and V(2, 2)) or twocell redblack GaussSeidel(V(1, 1)) for varying Uth. Note that the convergence factor appears to be when Uth 1 and O((uth)3 ) when Uth 1 for all three relaxation methods. To make a fair comparison between the different relaxations, we have to consider the difference in time spent to attain the above convergence factors. Tlus will be analyzed in the next section. In Table 2.7 we make Uth equal to .1 and vary the number of cells (m). For a fixed mesh size, the convergence factor is basically constant. Table 2.8 shows the 45
PAGE 56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.4. Worst case convergence factor for multigrid with redblack GaussSeidel relaxation, Fourier analysis, and CM2 results {form= 512). angletJ 84 Sa Sta Sa2 O'th .15 .24 .1 .2 Fa .0042 .0045 .0043 .0045 CM2 .0063 .0065 .0086 .0045 Table 2.5. Convergence factors for nonuniform cells (Jacobi and redblack Gauss Seidel relaxation on the CM2), 1 h 100, and Ut = 1. angles 84 Ss Sta Jacobi 1.6 X 104 1.3 X 104 5.6 X 104 Redblack 1.3 X 104 1.0 X 104 9.7 X 105 46
PAGE 57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.6. Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel (V(1, 1)), for various values of Uth. 84 case and m = 512. O'th Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2,2) 105 9.4 X 1010 3.3 X 1010 1.47 X 10lO 104 8.7 x 107 2.8 x 107 1.38 X 107 103 4.1 X 104 7.2 X 105 7.40 X 105 102 1.0 X 102 4.3 X 103 2.41 X 103 101 1.5 X 102 8.0 X 103 2.71 X 103 1. 4.2 x 103 1.1 X 103 1.03 X 103 101 3.9 X 105 3.2 X 105 1.86 X 105 102 3.4 X 107 2.5 X 107 4.16 X 107 47
PAGE 58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 'I Table 2.7. Convergence rates for Jacobi (V(1, 1) and V(2, 2) cycle) and redblack (V(1,1)) on the CM2. 84 case and Uth = 0.1. m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) 512 1.5 X 102 9.5 X 103 2.1 x 1o3 1024 1.5 X 102 9.7 X 103 2.8 x 1o3 2048 1.5 x 102 9.8 x 103 2.9 x 103 4096 1.5 x 102 9.9 X 103 3.0 x 1o3 32768 1.6 X 102 9.9 X 103 3.2 x 103 65536 1.6 X 102 9.9 x 103 3.1 X 103 Table 2.8. Convergence factors for the Jacobi (V(1, 1) and V(2, 2)) and redblack GaussSeidel(V(1, 1)), for various values of Uth. 84 case and m = 512 (nonuniform code). Uth hl h2 Pl h1 = randomly generated grid. h2 =randomly generated grid. P1 =uniform grid (h = 0.1). Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) 1.5 x 104 1.26 x 104 1.26 X 104 2.84 X 104 5.4 X 106 9.2 X 106 1.5 X 102 8.0 X 103 3.7 X 103 48
PAGE 59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I convergence factors for N = 4 and three distinct meshes {h1, h 2 and pl). Grid p1 is uniform with mesh size his equal to 101 for the whole spatial domain, while grids h1 and h2 are nonuniform and randomly generated (this means that some of the cells sizes are smaller or bigger than 0.1). Since from the Fourier analysis, we saw that the worse convergence factors occur around 101 we expected the convergence factors to be smaller for grids h1 and h2 than for for the uniform grid p1 On Tables 2.5 and 2.8, the convergence rates were measured using the nonuniform version of the algorithm. 2.11 Timings. First we show the cost behavior when we keep n constant (equal to one) and vary the spatial dimension m (Figures 2.6 and 2.7 and Table 2.9). Second, we analyze the cost when n varies and the spatial dimension m is kept constant {Figure 2.8). There are two codes, one for uniform and one for nonuniform grids. The results in this section are similar for both codes, although the timings obtained with the nonuniform grid code are roughly double that of the uniform grid code. All the timings shown in this section were measured for a V{1, 1) or V{2, 2) cycle, most of them using the uniform code. The sequential timings were measured using one processor of on a Cray YMP, which has a vector architecture. When we increase the number of grid points of the computational grid, we should have no increase on the time spent for our relaxation step on computers like the CM2, unless we reach the point where all the processors are being used. However, in our multigrid scheme, the number of levels should be defined such that the coarsest grid consists of two cells. Hence, the number of grid levels will always be log2!f, so, every time we double the number of cells (m) in our finest grid, we have new computations at the new added grid level since the grid level index is serial. 49
PAGE 60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l: seconds 6 7 8 9 10 11 o CM2 Jacobi *Cray Figure 2.6: Timings for a V(1, 1) cycle for small m and n = 1. seconds 512 16384 32768 Number of cells (m) o CM2 Jacobi timings o CM2 RedBlack timings Cray timings 65536 Figure 2.7: Timings for a V(1, 1) for large m and n =1. 50
PAGE 61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i; I I This increase in time corresponds only to the new calculations. This can be observed in Figure 2.6 for m between 26 and 29 specifically, where we notice that the increase in tjme when we double n is linearly proportional to log2 m, while after m = 29 there is a bigger increase in the slope of the graph every time we double m because the processors are saturated. For our example, when m reaches the value of210, we have begun to saturate the CM2. Here the number of grid points (2m+l), is 2049. When these timings were measured, we used one sequencer, which consists of 212 processors with 512 floating point units (FPU). Notice that 2048 is four times the number of FPU's, where four is the vector length at each FPU. This explains the beginning of increase for the slopes of the piecewise linear graph in Figure 2.6, since we will be reusing these FPU's. Also, when we vary m between 210 and 211, we are changing the number of grid points by 2048, which causes an even bigger increase on the slopes of Figure 2.6. In fact, the timings will double (disregarding the overhead) every time we further increase the number of cells (2m+ 1) by multiples of 2048 (the number of FPU x 4). If we look at the timings and dimensions for larger m, the parallel code starts increasing its time proportional to the increase in m, similar to the way a sequential code does. Consider the points m = 215(32768) and m = 216(65538) in Figure 2.7. This is explained by the fact that, for this range of m, every time we double m we will be reusing FPU's. Figure 2.8 shows the variation of cost when n changes. The multigrid method in this chapter was applied to the spatial variable m, so none of the increase in cost is due to the multigrid scheme used. The increase of cost here is due to the use of the CM2 intrinsic functions sum and spread in the angular variable direction. This happens, for example, in step 2 of the parallel relaxation. Of course, when we 51
PAGE 62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. seconds 2 8 16 32 Number of angles ( n ) o CM Jacobi timings o CM RedBlack timings Cray timings 64 Figure 2.8: Timings for a V(1, 1) cycle varying (n) and m = 512. 52
PAGE 63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I bcrease n to the point of reusing the FPU's, the time cost will increase similarly to the way it did when m was increased. Figure 2. 7 also shows the additional time spent if we use a redblack Gauss Seidel relaxation instead of a Jacobi. The timings for redblack GaussSeidel (V(1,1)) and Jacobi (V(1,1) and V(2,2)) twocell relaxation are compared to the Cray timings on Table 2.9. The Cray timings were measured for a V(1,1) cycle. Notice that since the code on the Cray is sequential, it is irrelevant for timing purposes whether we use the Jacobi or redblack GaussSeidel relaxation since both relaxations will take approximately the same time, W" sequential steps. In the CM2 for the same number of Vcycles the Jacobi rela..'Cation in a V(1, 1) cycle is much faster than the redblack GaussSeidel relaxation in a V{1, 1) cycle. but its convergence factor is, in general, worse than the redblack V(1, 1) convergence rate (Table 2.6). If we consider Jacobi relaxation for the V(2, 2) cycle, the timings will be slightly better than the redblack GaussSeidel V(1, 1) cycle and the convergence factors in this case will be comparable. We show here that if the additional time used in doing a Vcycle with redblack GaussSeidel relaxation was used to perform more of the Jacobi Vcycles, the convergence factor (both V(1, 1) and V(2, 2)) would be in general better. For a fixed number of Vcycles, Table 2.11 compares Jacobi V(1, 1) and V{2, 2) with redblack GaussSeidel V{1, 1). In this table we compare the convergence factors for each relaxation using the time for a redblack V(1, 1) as the standard unit. For example, ifredblack V(1, 1) took twice as long as Jacobi V(1, 1), then two Jacobi V{1, 1) cycles could be performed in the same time as one redblack V(1, 1). Using the time for redblack V{l, 1) as the standard unit oftime, the proper convergence 53
PAGE 64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i I, Table 2.9. Timings for Jacobi (V(1,1) and V(2,2)) and redblack GaussSeidel (V(1,1)) on the CM2 and Cray YMP. S4 case, uth = 0.1. m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) Gray V(1, 1) 512 1.19 1.78 1.69 2.9 1024 1.93 3.00 2.82 5.85 2048 2.78 4.54 4.35 11.71 4096 4.60 7.57 7.41 23.43 32768 31.00 53.53 52.69 187.5 65536 63.14 109.54 108.6 376.0 Table 2.10. Timings for Jar.obi (V(1,1) and V{2,2)) and redblack GaussSeidel (V(1,1)), on the CM2 and Cray YMP S4 case, Uth = .1 (nonuniform code). m Jacobi V(1, 1) redblack V(1, 1) Jacobi V(2, 2) Gray V{1, 1) 512 2.81 3.90 2.80 2.9 1024 4.27 5.65 4.31 5.8 2048 7.20 9.80 9.73 11.7 4096 12.78 17.78 13.10 23.4 32768 106.73 146.24 145.25 187.5 65536 222.47 304.17 302.13 376.0 54
PAGE 65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I! Table 2.11. Comparison between Jacobi V(l,l), Jacobi V(2,2), and redblack con vergence factors (Pjll Pj2 and Prb) for various values of Uth (S4 case). !.d. .!u Uth (pjt) tjl Prb (Pj2) t;2 105 2.8 X 1014 3.3 x Io10 4.4 X 1011 104 8.1 X 1010 2.8 X 107 5.9 X 1010 103 8.6 x 1oa 7.2 X 105 4.5 X 105 102 1.0 X 103 4.3 X 103 1.7 X 103 101 1.8 X 103 8.0 X 103 2.0 x 103 1. 2.7 X 104 1.1 X 103 7.1 X 104 101 2.4 X 107 3.2 X 105 1.0 x los 102 2.0 X 1010 2.5 X 107 1.9 X 107 55
PAGE 66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I ; ; factor for Jacobi V(1, 1) would be (p;1)2 We define trb = time for 1 V(1, 1) cycle with redblack GaussSeidelrelaxation. t;t = time for 1 V(1, 1) cycle with Jacobi relaxation. t;2 = time for 1 V(2, 2) cycle with Jacobi relaxation. Likewise, we define the convergence factors Prb1 Pit and Pi2 The correct comparisons are: !d Prb VB. (P;t)'il These comparisons are shown in Table 2.11. It is obvious that both Jacobi V(l,l) and V(2,2) are superior to redblack GaussSeidel. If we compare the Jacobi V(l,l) and V(2,2) relaxations we conclude that Jacobi V(1,1) gives, in general, a better convergence factor in this parallel environment for the same CPU time. Table 2.10 shows some timings obtained with the nonuniform implementation of the algorithm. We notice that the nonuniform timings are roughly doubled for all the cases, so the same analysis that was applied to the uniform version could be applied here without changing the conclusions. 2.12 Conclusions. In this chapter we have studied a parallel algorithm for a multigrid scheme for solving the transport equations in slab geometry. This algorithm is suitable for SIMD computers and takes advantage of this kind of architecture during the different stages of the multigrid scheme. It was implemented on the CM2 and was much faster than a sequential version of the same algorithm on the Cray YMP (using only one processor), especially when comparing large grid sizes. For the relaxation step of the multigrid algorithm, we used the ShermanMorrison formula and developed an efficient relaxation with the use of a few intrinsic functions on 56
PAGE 67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' the CM2. The interpolation and restriction operations, for some grid points, were performed without communication. The increase in the CPU time spent in a V cycle, when either one of the grid dimensions was increased, was small before processor saturation. For the spatial dimension, this increase is due to the addition of a new grid level, and for the angular dimension, it is due to the use of the CM2 intrinsic functions sum and spread. Mter saturation of the CM2 is reached, we have shown that the sharper increase in timings are caused by reuse of FPU's. In fact, when the CM2 becomes saturated the parallel timings start doubling when one of the grid dimensions is doubled. This, of course, always happens in the sequential timings for any doubling of m or n. The convergence rates attained per Vcycle were extremely good and matched theoretical results. We implemented three different kinds of relaxation: Jacobi V(1, 1) and V{2, 2) and redblack V(1,1) in two kinds of implementation each, uniform and nonuniform meshes. We have shown that, in this context, the Jacobi V(1,1)cycle was superior to the Jacobi V(2, 2)cycle and the redblack GaussSeidel V(1, 1)cycle. The results and timings from this chapter show the good performance of our parallel algorithms for solving the isotropic form of the transport equation on SIMD architectures. 57
PAGE 68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 3 PARALLEL MULTIGRlD FOR ANISOTROPIC SCATTERING 3.1 Introduction. In this chapter we develop a parallel algorithm for the anisotropic transport equation and discuss its implementation on the CM200. Timings and convergence rates are shown and compared to a sequential implementation on the Cray YMP. As shown in Chapter 1, the transport equation for a semiinfinite slab of width 2 can be written as (3.1.1) For the anisotropic scattering case, the scattering term S in (3.1.1) can be written as N1 S(z,p.) = u. j dp.'"E.(z,p.'p.)t/J(z,p.') = L (21 + 1)uul>l(z)pi(JL), 1=0 (3.1.2) where N1 t/J(z, p.) = L (21 + l)rp,{z )PI(JL), (3.1.3) 1=0 and (3.1.4)
PAGE 69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; I for I = 0, 1, ... N 1 are the moments of the N degree approximation of the flux, P.l,P.21 ... 1p.N are the quadrature points and w1,w2, ... ,wN are the quadrature weights. Writing t/;(z) = ( t/J(z, P.l), .,P(z, p.2), ... t/J(z, P.N ))T to represent the flux and 1!_(z) = (r/>o(:z:), rp1(z), ... rf>Nl(z))T to represent the Legendre moments at the quadrature points, equations (3.1.3) and (3.1.4) can be written as rf>(z) = Tt/J(z), and The scattering term (3.1.2) can be written as = T1 DT'!!!_(z), where T and T1 are N x N matrices with elements: and and D is an N X N diagonal matrix with diagonal elements Di,i = O'i1 The requirement that (3.1.1) holds at each quadrature point yields a coupled system of ODE's: Mt/J'(z) + Ut'!/!_(z) = T1 DT'!!!._(z) + Q, (3.1.5) where M =diag( p.j, ) and D = diag( O'j1 ). This is the flux representation of the SN equations. If the righthand side of (3.1.5) is known, it becomes an 59
PAGE 70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. uncoupled system of ODE's This leads to the iteration (3.1.6) which is called a shifted source iteration. The variable a is a shift parameter. The system of ODE's will be stable for Ut. Suppose Ut = uo u 1 ... UN1 0 a J;f +uN1 and Ut 1. If a is chosen to be 2 a shifted source iteration will attenuate all the errors in the upper half of the flux moments (p!:!.. through p N _1). In other words, :r it will attenuate the errors that vary rapidly in angle, but poorly attenuate those that vary slowly in angle. Consider the error given by = '!//(:c)'!f!.k1 (z), which moments will be given by = = (e0(:c),e1 (z), ... ,eN1 (z))T. Each one of these error moment will be decreased approximately by :; ( i = 0, ... N 1), which shows that the first error moment (eo( z)) will not be decreased and that the error moments E!!..(z) and EN1(z) (corresponding to U!:f. and UNd will be 2 2 minimized. This suggests that multigrid in angle be used together with a shifted source iteration at each grid level as Morel and Manteuffel developed in (14]. In the angular multigrid algorithm, the source iteration is performed at each level with the appropriate shift applied to the crosssectional moments for that angular level. We will refer to the different levels of the angular multigrid by using k. The number of angles (N) is halved (Figure 3.1) when going from a fine to a coarse grid, but the angles on the coarse grid are not a subset of the angles on the fine grid. The angles at grid level k are the Nk Gauss quadrature points at that level (JLt, .. JL'N,. ). When coarsening the angular grid, we disregard the second half of the error moments (e!:!. (:c), ... EN1(z)) at the fine level since those error moments 2 were already attenuated by the shifted source iteration at that level. Applying the appropriate shift at each angular level results in an isotropic scattering at the coarsest angular grid (k = log2 N), which can be solved using the CM2 parallel 60
PAGE 71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I l n ...J L 1 .L 1.l 1I I I I I I I I I rrrI I I I I I I I I I I I 2 1....! !... _I.!. _I_ .!. I_ I I I I I I I I I I I I I I I 1 I I I I I I I 1 2 3 2m+2 cell i cell i+l 11 r 1 r 1; 1I I I I I I I I I I I I I I 2 I_I!.._ _I_ .!_ _I_ _!_ I_ I I I I I I I I I I I I I I I 1 I I 1 2 3 2m+2 Figure 3.1: Comp\ltational grids for the first and second level of angular multigrid. 61
PAGE 72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. multigrid solver developed in Chapter 1. The parallel isotropic algorithm uses a multigrid scheme in the spatial domain. In order to solve (3.1.6) at level k, we use the same spatial discretization used for the isotropic model in Chapter 2: the MLD scheme [13]. When this dis cretization is used we obtain = + where H1e and S1e both contain the shift 0:1e at angular level k. Matrix is described in Section 3.4 and (3.1.7) The transport sweep has been used before to perform the shifted source iteration [9]. Letting m be the number of cells (2m+ 2 spatial points), the transport sweep is a sequential algorithm that takes 2m steps, corresponding to solving a block lower bidiagonal system by block forward substitution and a block upper bidiagonal system by block backward substitution. To make the source iteration economical for the CM200, we perform it at each angular level by using cyclic reduction, which takes log2 m parallel steps. Throughout the angular multigrid cycle, multiplications by T and r1 which represent Legendre transforms, will be performed many times. Our storage design and algorithm for these multiplications allow Legendre transforms to be performed in O(N) parallel steps. An outline of the remainder of this chapter is as follows. In Section 3.2 we describe the angular multigrid algorithm in the flux notation. In Section 3.3 we describe the spatial discretization used (matrix H). In Section 3.4 we show how to perform the shifted source iteration as a cyclic reduction for the MLD discretization. In Section 3.5 we develop the parallel angular multigrid algorithm. In Section 3.6 62
PAGE 73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I we develop the parallel cyclic reduction algorithm. In Section 3. 7 we analyze the stability of our cyclic reduction. In Section 3.8 we report on convergence rates and timings tests. 3.2 Angular Multigrid. 3.2.1 General description. The angular multigrid method was de signed to accelerate problems with highly forwardpeaked scattering (14], where the probability that a neutron is scattered through a small angle is higher than for a large angle. Before we describe the implementation details of this algorithm, we give a general description of both cases treated in the section: The V(1, 0) and the V(1, 1) angular multigrid cycles. Consider that the discretization of (3.1.6) can be written as the matrix equation L'!!!_ = !l: A standard iteration scheme for solving this equation is to use the splitting L = HS, where S is the scattering matrix defined in (3.1. 7), and rewrite the equation as This iterative process is called source iteration. The angular multigrid method proceeds as follows Set k = 1 and Nk = N for the first angular grid level. 1. Perform a shifted source iteration on the fine grid: Compute the residual 'fie = Sk('!/!_1+1 '!!!_1). Calculate the moments for this residual: (1!,1+1 1!,1)k = Tlc'f.le Disregard the high moments: (1!,1+1 1!.\+1 = [I, O](p_1+1 P.\ 63
PAGE 74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I 2. Set N1c+t = and k = k + 1. Perform a source iteration on the coarse grid with equal to T1(1!_'+1 1!_1): Repeat this coarsening until N1c is equal to 2. 3. On the coarsest grid (N1c = 2), this problem will be reduced to the isotropic model. Solve it exactly using the spatial multigrid algorithm. 4. Add the corrections from the coarse grid to the flux solution on the fine grid: In the case of a V(l, 1) cycle, perform the source iteration using the newest calculated solution for level k: = + 9.Jc In the case of a V(l, 0), do not perform a source iteration. Set N1c1 = and k = k1. Repeat this step until k = 1 and N1c = N. In the next two subsections we detail our V(l, 0) and V(l, 1) implemen tations. The V(l, 0)cycle, in general, does not need a special description, but in order to develop an algorithm that takes advantage of the greater simplicity of the V(l, 0)cycle over the V(l, I)cycle and to avoid redundant calculations, we write the algorithms in a slightly different way for the interpolation steps (Steps 57 in the next section). 3.2.2 The V(l, 0) cycle. Consider writing the transport equation in its discretized form using MLD for the spatial discretization, and let N be the number of angles and m the number of spatial cells: (3.2.1) 64
PAGE 75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, A submatrix of the shifted matrix His shown in (3.4.1). The matrix S = T1(Do:I)T. Step 0. Initialization. Set k = 1. Step 1. Shifted source iteration in parallel. In the source iteration, the righthand side variables are assumed to be known and the lefthand side variables are calculated using the following expression: H .J,l+lfl :!....Jc, (3.2.2) where {3.2.3) At level k, the number of angles is half the number in the previous grid. This indicates that the Gauss quadrature points are different at each angular grid level and, consequently, the matrices and M1e will depend on the grid level. The matrix H1e represents the MLD spatial discretization for the shifted source iteration (Section 3.4). In the spatial discretization, the variable EJe = Ut O:Je multiplies the identity matrix. This step is performed using cyclic reduction, which, for given boundary conditions, will calculate the fluxes for all spatial and angular points in 2log2 m steps. The cyclic reduction algorithm for the matrix equation (3.2.2) is described in Section 3.4. The initial guess for the flux on the righthand side of the matrix equation is = 0. This source iteration can be repeated with the new estimated flux ). Set '!to = o Step 2. Calculating the residual. The residual is given by (3.2.4) 65
PAGE 76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, I, Multiplication by matrix S1c is applied in conjunction with the restriction operation as described in the next step. Step 3. Restriction of the residual. The external source term on the next grid level (qlc+1 ) is equal to the restricted residual. The restriction operator is given by N iJ = (3.2.5) Putting (3.2.4) and (3.2.5) together, we have (3.2.6) To compute this efficiently, we proceed as follows e Calculate A'!/!_: Multiply A'!/!_ by This represents a change from flux representation to moment representation. The matrix T1c is given by T1c = 66 (3.2.7) NxN
PAGE 77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; Multiplications by [Jie+1, 0] and by (D1e are interchangeable. We multiply by [I1e+1, 0] first, thus disregarding the second half of the moments (higher moments) and considering only the first half of the moments (lower moments) on the coarser grid (k + 1). This can be written as [ I 0 ] ( }+1 c/11 ). !:!.xN :t:.Je :t:.Je 2 At angular level k we multiply by the scattering matrix truncated to rows and columns. Notice that is a truncation of the scattering matrix used with the PN1 expansion on the fine grid (k) and is given by uo u l;f +uNi and a1e is the shift on the fine grid ( k) and is equal to 2 Finally, multiply by Tle;1 (Nic+l = is calculated using the Gauss quadrature points. Recall that the N1e+1 Gauss quadrature points are not a subset of the Nk points. The matrix at level (k + 1) is given by Tk.;l = 67
PAGE 78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Po(Jlt) Po(JL2) Po(JLa) 3pt(Jlt) 3pt(Jl2) 3pt(Jla) 5p2(Jld 5p2(Jl2) 5p2(Jla) (2N3)PN2(Jlt) (2N3)PN2(Jl2) (2N3)PN2(Jla) (2Nl)PNt(Jlt) (2Nl)PNl(Jl2) (2Nl)PNt(Jla) (3.2.8) Using .[,.+l as the new source term, repeat Steps 1 through 3 until we have a P1 expansion on the righthandside (when = 2). Step 4. Calculation of the error at the coarsest level. With the appropriate shift (a: = u1) at this level (k = log2 n), we have an isotropic transport equation: where and 68 NxN
PAGE 79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. At this level, instead of performing the source iteration through cyclic re duction as was done in finer angular grid levels, we use the parallel multigrid in space algorithm for solving the isotropic equations. For N = 2, multigrid in space solves this system of equations exactly in one Vcycle. Step 5. Interpolation of the correction. Starting with k = log2 n, transform each correction from the flux domain to the moment domain: u [ l (3.2.9) The elongation in the CM codes is not performed, since the entries (or the processors) where .. is not calculated were previously initialized to zero. Set k = k1 and calculate a new fur. Repeat this process until k = 2, summing the moments for levels l = 2, log2 n as (3.2.10) Step 6. Addition of the corrections. At the finest angular grid, the summation of the errors at all angular levels is transformed from moment to flux representation and this fluxcorrection is then added to the latest flux solution at this level: (3.2.11) Recall that, at this level, T1 will be an N x N matrix generated by using the Gauss quadrature points for all n scattering angles. 3.2.3 The V(1, 1) cycle. For the V(1, 1) cycle, we modify the inter polation steps (Steps 5 through 6). The general interpolation operation at 69
PAGE 80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: level k can be described as (3.2.12) Mter we calculate the we use it to evaluate the righthand side of a new source iteration using the following expression: (3.2.13) where !he is the restricted residual used at level k when going down the cycle. Set k = k 1 and repeat this process until k = 1. 3.3 Discretization. The MLD equations using variables at the edges were derived before in (2.8.52.8.8). Par the anisotropic model, we use the MLD discretization in the form of equations (2.8.52.8.8), but with the scattering term as 1 N1 S.(:c, JL) = u. /_1 dp.'E.( :c, p.' . p.)t/1( :c, p.') = (21 + 1 )u1( :c )PI(JL), (3.3.1) Using this for the scattering term and multiplying (2.8.52.8.8) by Uth we have N1 JL;(t/Jt,,; + 1/Jt,.,;)2p.;.,Pt_1,r,j + O'tht/Jtl,j = L (21 + 1)ul(:c)pi(JL;), (3.3.2) 1=0 N1 JL;(t/lt,.,;1/Jtl,;) + Utht/Jt,.,; = L (21 + 1)ul(:c)pi(JL;), (3.3.3) 1=0 i = 1, ... m, j = 1, ... n and the negative angle equations are N1 JL;( 1/Ji;,,; + 1/Ji;,.,;) 2p.;t/JH.1,1,j + Utht/JZ,.,; = L (21 + 1 )u1( :c )PI(JLi ), (3.3.4) 1=0 N1 JL;(t/li;1,;1/Ji;,.,;) + Utht/Ji;1,; = L (21 + 1)ul(:c)pi(JL;), (3.3.5) 1=0 70
PAGE 81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i i i = 1, ... m, j = 1, ... n. Using the vector notation such that '!t+() is annvector ('!t+() = ( ... ,.,pi;,())T), we can rewrite these equations in form as (3.3.6) {3.3.7) i = 1, ... m and the negative angle equations are (3.3.8) (3.3.9) i = 1, ... m where the matrices s+ and srepresent the upper and lower part of matrix S = T1 DT, respectively, and n:"tatrix Mi = diag( *!", ). If we use shifted source iteration to solve these matrix equations, the righthand sides are assumed known, the matrix S = T1(Dal)T and the Ut's are replaced by f = Ut a in the above equations. 3.4 Iteration in Parallel. The source iteration is performed using cyclic reduction [15]. Using equa tions (3.3.63.3.9), the source iteration for the entire domain can be written as H'!/!_l+1 = l_1 where the matrices Hand the vector l_ already contain the shifts (a). The matrix H is 4 x m x n. A sub matrix of H, corresponding to 2 cells of a spatial domain with m cells (an Bn x Bn matrix), is 71
PAGE 82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I 'i El+Mi 0 Mi 0 0 0 0 0 0 El+Mi 0 Mi 0 0 0 0 Mi 0 El+Mi 0 2Mi 0 0 0 0 Mi 0 El+Mi 0 0 0 0 0 0 0 0 El + Mi+l 0 Mi+l 0 0 0 0 2Mi+l 0 El + Mi+l 0 Mi+1 0 0 0 0 Mi+1 0 El + Mi+1 0 0 0 0 0 0 Mi+l 0 d + Mi+1 Snxsn (3.4.1) the variables for this matrix are = 1 1 P.t+l,l' t++l,r)T' (3.4.2) The righthand side for this twocell system can be represented as (3.4.3) Notice that this relaxation can be performed separately for the negative and positive variables, since the above system of equations can be decoupled into two systems, one for the positive and another for the negative variables. Positive variable cyclic reduction Writing the above system for the positive variables over the entire domain generates a block lower bidiagonal (BLB) system with block matrices of size 2n x 2n. The associated matrix is 72
PAGE 83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dt 0 0 0 0 0 0 0 F2 D2 0 0 0 0 0 0 0 Fa Da 0 0 0 0 0 0 0 0 0 0 0 (3.4.4) 0 0 0 Fi Di 0 0 0 0 0 0 0 .Fi+l Di+t 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fm Dm 2mnx2mn The variables for this matrix are t/J ( .,p+ .,p+ .. .,p+ .,p+ .. .,p+ )T 1 '2:..2' '..:.....m (3.4.5) where .,P"! = ( .,P"!1 .,P"! ). The righthand side for this smaller system is _, ....:....t, :.t,,. f (J+ j+ ... j+ j+ ... j+ )T 1 '2' '=' '=1+1' 'm (3.4.6) and [ 0 2Mi l F.. 0 0 {3.4.7) Notice that index i in this notation represents the variables evaluated for cell i: ( .,P"!1 .,P"! ) If we perform the first step of cyclic reduction (eliminating every other :.....s, ...:..t,r 73
PAGE 84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. row, starting with the first row or cell), we obtain the following system D2 0 0 0 0 0 0 0 Ff D4 0 0 0 0 0 0 0 FJ Da 0 0 0 0 0 0 0 0 0 0 0 {3.4.8) 0 0 0 Fl Di 0 0 0 0 0 0 0 Fl+2 Di+2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dm mnxmn where F? [: 4M
PAGE 85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l: I I. I l where 5 = 2(kl) and {3.4.10) Coarsening of the cells in cyclic reduction for the positive variables is illustrated for m = 8 in Figure 3.2. If we keep coarsening the grids in this fashion, we end up with one cell with a matrix equation given by {3.4.11) where k = log2 m. The nvector fk is obtained through the same recursive elimina.:...m tion process used on the lefthand side of the matrix equations, where we eliminate every other row starting with the first row. After solving this equation for 1/J the ..:....m variables that were eliminated previously can be calculated. For example, at level k the variables t/J. can be calculated by the following expressions: '1 {3.4.12) for i = m 5, m 35, ... 35, and {3.4.13) where {3.4.14) We can describe the above algorithm in matrix notation. For example, to 75
PAGE 86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i' tPl tP4 1/Js t/Ja 1/Jg tP12 tP13 tP16 tP17 k=l tP2 tP3 1/Ja tP7 tPto t/Jn tP14 tP15 tPta k=2 tP4 t/Js t/Ja 1/Jg tP12 tP13 tPlB tP17 1/Ja 1/Jg tP16 tP17 k = log2 m tP16 tP17 Figure 3.2: Coarsening of cells during cyclic reduction for the positive variables. 76
PAGE 87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I, go from (3.4.4) to (3.4.8) we multiply the matrix at levcl1 by the matrix F2D'11 I 0 0 0 0 0 0 0 0 0 0 F4D31 I 0 0 0 0 0 0 0 0 0 0 FaD 51 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Fm2D;;,I_a I 0 0 0 0 0 0 0 0 0 0 FmD;;,I_I I This matrix, called the restriction matrix, has half the number of rows as the original since it will eliminate every other cell in the original grid. The restriction matrix to go from level k 1 to level k is p,lc1 n1 26 6 I 0 0 0 0 0 0 0 0 0 0 plc1 n1 46 36 I 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 66 66 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 m26 m36 I 0 0 0 0 0 0 0 0 0 0 F.lc1 n1 m m6 I (3.4.15) where 6 is the stride at level k1 (6 = 2(/c2)). To obtain l, the restriction operator is applied repeatedly to the righthand side of the original matrix equation until level k is reached. At k = log2 m, we solve the final matrix equation and start interpolating back to finer levels. The 77
PAGE 88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interpolated variables at level k will be calculated by the following matrix equation: k 0 0 0 0 0 0 0 0 k+l 0 I 0 0 0 0 0 0 !!!.to 0 nt F.k 0 0 0 0 0 0 !!!.to 36 36 0 0 0 0 0 0 1/JT = t/JT '1 0 0 0 0 0 0 '1 0 0 0 0 0 I 0 0 p_t+O 0 0 0 0 0 D;;,_l__0Fmo 0 0 ,p+ .,p+ 'ffl 0 0 0 0 0 0 0 I m n1 0 0 0 0 0 0 0 k 0 a 0 0 0 0 0 0 0 0 0 0 n1 36 0 0 0 0 0 Lto 0 0 0 0 0 0 0 + t!" {3.4.16) 0 0 0 0 0 0 0 0 =I 0 0 0 0 0 0 0 0 tt+O 0 0 0 0 0 0 n1 0 m6 j+ 0 0 0 0 0 0 0 0 m Negative variable cyclic reduction. The same process can be applied to the negative variable relaxation matrix. in which case we have a block upper bidiagonal (BUB) system with block matrices of size 2n X 2n: 78 I
PAGE 89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I' I D1 F1 0 0 0 0 0 0 0 D2 F2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Di Pi 0 0 0 (3.4.17) 0 0 0 0 Di+1 Ei+1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dm1 Fm1 0 0 0 0 0 0 0 Dm 2mnx2mn The variables for this matrix are p_ = ... .. ,p;,JT, (3.4.18) where .,P: = ( .,P:1 .,P: ). The right hand side for this system is _, ....:....c,r (3.4.19) where j: = (!:,, r ), _, .=......c, .=......c,r and (3.4.20) Again the index i in this notation represents the variables of cell i : ( t/J :1 t/J: ) If we ....:.a, =a,r perform the first step of the cyclic reduction ( eliminating every other row, starting with the last row or cell), we obtain the following system: 79
PAGE 90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l l I;
PAGE 91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where 5 = 2(1cl) and [ 2 20 M2 O ] 22<1) },{i Mit1 Mi 0+(2!1)_!1 0 Ait1 Ait2 Ai+(2 (1o1) 1 ) {3.4.23) Coarsening of the cells in the cyclic reduction is illustrated for the negative variables, starting with m = 8 in Figure 3.3. If we keep coarsening the grids in this fashion, we end up with one cell with a matrix equation given by (3.4.24) where k = log2m. Mter solving this equation for vector '!!!._1 the variables that were eliminated previously can be calculated. For example, at level k, the variables 1/J. '' can be calculated by the following matrix expressions: for i = 1 + 5, 1 + 25, ... m25 + 1, and where Mil A; dM; A; (3.4.25) (3.4.26) (3.4.27) We can describe the above algorithm in matrix notation. For example, to go from (3.4.17) to (3.4.21 ), we multiply the matrix at level 1 by the matrix I FtD21 0 0 0 0 0 0 0 0 0 I F3D41 0 0 0 0 0 0 0 0 0 I FsD61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I Fm36n;.:_2 0 0 0 0 0 0 0 0 0 I 81
PAGE 92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tPl tP4 t/Js t/Ja tP9 tP12 tP13 tP16 'I/Jt7 k=l tP2 t/Ja t/Ja 'I/J7 '1/Jto '1/Jn 'I/Jt4 'I/Jt5 '1/Jts k=2 tP2 '1/Ja t/Ja tP7 tPlO '1/Jn 'I/Jt4 '1/Jts tP2 'I/J3 t/Jto '1/Ju k = log2 m tP2 'I/J3 Figure 3.3: Coarsening of cells during cyclic reduction for the negative variables. 82
PAGE 93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This matrix, called the restriction matrix, has half of the number of rows as the the original matrix since it will eliminate every other cell in the original grid. The restriction matrix to go from level k 1 to level k is I Fk1 n1 1 1+6 0 0 0 0 0 0 0 0 0 I pk1 n1 1+26 1+36 0 0 0 0 0 0 0 0 0 I pk1 n1 1+46 1+66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I pk1 n1 m36 m26 0 0 0 0 0 0 0 0 I pk1 n1 m6 m {3.4.28) where 5 is the stride at step k 1 (5 = 2(k2)). To obtain t we just have to keep applying the restriction operator to the right hand side of the original matrix equation until level k is reached. At k = log2 m, we solve the final matrix equation and start interpolating back to finer levels. The interpolated variables at level k will be calculated by the matrix equation k I 0 0 0 0 0 0 0 .,p+ ,p+ 1 0 0 n1 pk 0 0 0 0 0 1 tt+6 1+6 1+6 Y!.tt6 0 0 I 0 0 0 0 0 0 0 0 0 0 0 t/JT = ,p! '1 0 0 0 I 0 0 0 1 Y!.t+6 0 0 0 0 0 0 D;1Fr 0 Y!.tt6 0 0 0 0 0 0 I 0 Y!.!6+1 0 0 0 0 0 0 0 0 83 kt1
PAGE 94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. 0 0 0 0 0 0 0 0 1c 0 n1 1+0 0 0 0 0 0 0 j+ 1 0 0 0 0 0 0 0 0 Li+6 0 0 0 0 n1 1+26 0 0 0 + 0 0 0 0 0 0 0 !"! =& 0 0 0 0 0 0 0 0 L++6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I::n6+1 0 0 0 0 0 0 0 n1 mc5+1 where r = m 25 + 1. 3.5 Angular Multigrid on the CM200. 3.5.1 Storage. We assign one processor to each point (i.e., each ( zi, fLi) pair). At different angular levels we have different Gauss quadrature points. Consider a set of processors that contains the same spatial grid point and different angular grid points. Each one of these processors will contain one angle from each angular level. For example, processor P1 will store fL1 for level 1 and all the variables calculated at this point. It will also contain J.L1 for level 2. Figure 3.4 illustrates the distribution of data for N = 8 (n = 4) in the CM200. In our implementation we will always have two variables associated with the same quantity (one for the posi tive angles, another for the negative angles), for example, 1/J+(zi,J.Li) and 1/1(zi,J.Li), q+(zi,fLi) and q(zi,fLi), and so on. As in Chapter 2, we use some duplication of data to avoid communication. For example, variables Wj (a onedimensional array) are stored as a twodimensional array ( Wi,j) The diagonal matrices Mi are stored as full matrices, where each column is a replication of the diagonal. At each processor we store the Legendre polynomials of all degrees evaluated for the angle associated 84
PAGE 95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I with that processor. Figure 3.5 illustrates this for the case where the starting number of scattering angles is eight. 3.5.2 Algorithm. Now we describe the steps to perform angular multigrid on the CM2. 1. Initialization. This is performed completely with inplace calculations. 2. Calculation of the righthand side for the shifted source iteration. The righthand side variables for the shifted source iteration step are as sumed known. They are calculated through the following expression: t_' = S'!/l + 9_, where S = T1 DT (the matrix D contains the shift a). The initial guess for the flux on the right hand side of the matrix equation may be the zero vector (tJl = 0) or may be a nonzero vector. For a nonzero vector we notice that multiplication of the '!1!_1 vector by the matrix S can be performed in three stages: Multiplication by matrix T. Multiplication by the diagonal matrix D. Multiplication by matrix T1 The above three steps will be performed at different points of the algorithm and will be described next. After multiplication by the matrix S is performed, we have a resultant vector in which entries are stored conformable to the entries of vector IJ. Consequently, the addition of this vector to 9. takes only one parallel step. De tails about the implementation of the parallel cyclic reduction are discussed in the following section. Multiplication of a vector by the N X N matrix Tk: = Tkl!_ 85
PAGE 96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i; I ,, Pa k=l k=2 Figure 3.4. Data structure for a particular spacial grid point and for different angular levels, showing the fluxes and the Legendre moments. 86
PAGE 97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i I k=l k=2 P1 (pf) P2(Pf) Pa(pf) p2 Pl Figure 3.5. Data structure for Legendre polynomials for a particular spatial grid point and for two angular levels starting with N =4. 87
PAGE 98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Every row of matrix T given in (3.2.7), will update an entry of the n vector for all the spatial grid points. For example, processor Pi,j representing angle P.i and a spotted point :Z:i will use row j of matrix T to update its entry(zj) of vector In particular, the processors that contain the entries of angle p.1 will use the first row of matrix T for updating the variables at p.1 ( z(p.t)). Looking back at the data structure, we can observe that the multiplications of WjP1 (P.i )Yi will all be performed in place and are done simultaneously for all the points in the spatial and angular domain. The next step is to perform a summation across the angles. This summation is done with the CM200 intrinsic function sum, and is also done for all the spatial points simultaneously. The only sequential part of this matrix multiplication is due to the single instruction mode of the CM200. The different processors will use different polynomials (because they are updated with different rows of matrix T) to update the entries of vector and this has to be done sequentially for this kind of data structure on an SIMD computer. This step will be executed in N different stages for the entire angular and spatial domain. We can express this as do k = 1,N cont ti = WjPict(P.j)Yi forall procsi (j = 1, ... N) Zlc = sum(tj, 2) for all i = 1, ... 2m+ 2 simultaneously. In MIMD computers, e.g. CM5, different processors can perform different kind of instructions simultaneously. Thus an algo rithm like this takes one parallel step on MIMD computers (each processor would be evaluating a different degree Legendre polynomial). Multiplication of a vector by an N X N matrix Tiel: = Tiel'!!.. Multiplication by T1 given in 3.2.8, is performed in a manner similar to 88
PAGE 99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I i that for T. To illustrate, consider updating for the processors that store variables p1 and any of the 2m+ 2 spatial grid points. These updates are done using the first row of matrix T1 Here we notice that the entries of this row are all stored at the processor that contains 1'1 (as is shown in Figure 3.5), but will be multiplied by array elements of'!!.. distributed across different processors throughout the CM200. Furthermore, it is not convenient to take advantage of the CM200 shifts since the shifts for different rows will be different for the same elements of l!: Consider rows 1 and 2 in variable y3 for example, needs a shift of 2 for row 1 and a shift of 1 for row 2. H we create a new vector that stores the partial summations for all spatial and angular points, the multiplication will be completed in N steps, where N 1 is the maximum degree of the polynomials at each processor. The partial summations are calculated as do d= 1,N sp; = sp; + (2d1 )Pd1 (I'; )Yd for procs;(i = 1, ... n.) cont for i = 1, ... 2m+ 1 simultaneously. Multiplication of a vector by theN X N matrix = Dk'!!.. This multiplication is done in place since each processor has the appropriate entry of the matrix and only multiplies its own vector entry: z = rry fori= 1, ... 2m+ 1, j = 1, ... n simultaneously. 3. Calculation of the residual. This step is performed completely in parallel and takes only one step for 89
PAGE 100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I the whole spatial and angular domain. ,j. Restriction of the residual (coarsening). First, we change from the flux to moment domain. This is accomplished through the multiplication by matrix Tk and was previously described for a generic vector. Second, we perform the truncation of moments. This is obtained automatically by just disregarding the appropriate processors that store the high moments at that level. Finally, we transform our moments into fluxes through multiplication by Tk.;1 as was described previously for a generic vector. 5. Calculation of the error on the finest grid. This is performed by using the parallel isotropic scattering solver developed in Chapter 1. 6. Interpolation of the error. Multiplication by matrix T is performed as described previously. The addition of the vectors e2, Ea, .. n are performed with inplace calculations in one parallel step. 7. Correction of the flux at the first angular level. Multiplication by matrix T1 is as previously described. The addition of the resultant vector to the flux at the finest angular level is done inplace. 3.6 Cyclic Reduction on the CM200. 3.6.1 Storage and Initialization. Each column of the matrix in (3.4.1) will multiply data belonging to a unique cell (for each level of the cyclic reduction algorithm). Each Fik (3.4.10) has a single nonzero submatrix, that we call which depends on the grid level k and spatial cell i. For nonuniform grids, the matrices Fik are different for distinct cells. The level index for this variable is stored sequentially for each processor, as is the case for all the variables that contain a level 90
PAGE 101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I II I index. The storage is done this way since the different calculation levels of the cyclic reduction have to be performed sequentially (it takes log2 m steps and each level uses results from the preceding level). Throughout the transference of data between grids (coarsening and interpolation), this storage minimizes communication. In this section we omit the angular index, since all the steps for the parallel cyclic reduction are performed for all angles simultaneously. Most of the steps for the spatial grid points will also be performed in parallel, even in the case where we have variable mesh sizes. For example, all processors will have their own definition of fef and perform their calculation simultaneously. Notice that fef has more product terms on coarser grid levels (larger k). The algorithm for the calculation of f on the CM200 is shown here: do 2 k = 1,ml t= M tl=M t2 = M do 1 ii = 1, 2 *(k1)1 t1 = shift(tl, 2, 1) t2 = shift(t2, 2, 1) t = t1 *2 tft2 1 continue fe(k) = t (2 *(2 *(k1))) 2 continue for all i = 1, ... 2m+ 2, j = 1, ... n simultaneously, ml is the number of levels in the cyclic reduction (ml = log2 m). 91
PAGE 102
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: 3.6.2 Algorithm for Cyclic Reduction on a SIMD Computer. Here we describe the algorithm for the positive variable cyclic reduction. The algorithm for the negative variable cyclic reduction is performed in a similar way. 1. Coarsening. Coarsening or restriction for cyclic reduction at each level k is performed through the multiplication of the matrix in (3.4.9) by the matrix in (3.4.15)Note that the only nonzero elements for the latter matrix belong to the identity and block matrices like Fl" D'it6 Matrix Di1 is calculated analytically in (3.4.14) and can be represented here as (3.6.1) Consequently matrix Fle D'it6 is represented by (3.6.2) Notice that this matrix has only 2 n X n nonzero submatrices: /efd21 and fefd12, which are henceforth denoted as r1 and r2 respectively. These submatrices will multiply q2i and q2i+1 (source terms at the left and right sides of cell i, respectively). The first step of the coarsening algorithm on the CM200 is At cell i, these multiplications are required from its left neighbor (cell i 5). Thus, we need to shift the two arrays ( t1 and t2) from the left by 5: where 5 is the stride at step k1 (5 = 2/cl ). 92
PAGE 103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finally, we can update the right and left side variables for each cell through the use of two masks that differentiate the right and left side of each spatial cell (in the matrix, the left and right sides of a cell correspond to the first n rows and the second n rows of each 2n block row, respectively). maske represents all the left side variables for all the cells. masko represents all the right side variables for all the cells. The algorithm to update the two different kinds of variables in the CM200 is forall (i = 2: 2m+ 2: 2} maske(i, :) = .true. forall (i = 3: 2m+ 2: 2} masko(i, :) = .true. where( maske) q2h = qh + ta endwhere where( masko) q2h = qh endwhere Notice that during the update of the variables for the right side of each cell, there is no communication between the processors. 2. Solving of the block diagonal system at the coarsest level. When we get to the coarsest level, we end up with a block diagonal system that involves only the last cell, as shown in (3.4.11). The inverse of matrix Dm is calculated analytically in (3.4.14} and will be represented in the following algorithm as (3.6.1). To update the variables at this level, we can use maske and masko (defined previously). Since the first n rows of the system '!!!..m = D ;;..1 J! represent the variables at the left side of cell m, we use maske to update the!le variables. Recall that vector fk represents the right hand side of the matrix system at level .:..m k = log2 m. The algorithm for the left side variables update of each cell is 93
PAGE 104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I. I, I I t = eoshift(f, 1, 1) where(maske) t/J = dn I + d12t endwhere The algorithm for the right side variable update of each cell is t = eoshift(f, 1, 1) where( masko) t/J = d2d + d22t. endwhere Notice that the shifts on the coarsest grid are both unitary shifts. 3. Interpolation. Interpolation for level k is accomplished through the application of (3.4.12) and (3.4.13) to the variables of level k + 1, or through the use of the matrix equation shown in (3.4.16). We notice that the block matrices necessary for the interpolation step are Di1 and Di1 Fl. The matrix Di1 is represented here as (3.6.1), thus matrix Di1 Fl is given by n:1 p!c = [ 0 dufef l ' 0 d12fef where fef is the only nonzero n X n block submatrix in matrix Fl. Thus, in the CM200 the interpolation algorithm for the variables at the left side of each cell is to= dnf h = dl2f t2 = dufef 62 = 26 + 1 where( maske) 94
PAGE 105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: t/J = to + eoshift( t11 1, 1) + eoshift(t2, 1, 1) eoshift( t/J, 1, 52) endwhere. For all right side cell variables we have to = d21f tl = d2d t2 = where( masko) t/J = eoshift(to, 1, 1) + t1 + ta eoshift(t/l, 1, 52) endwhere 3. 7 Stability of the Cyclic Reduction. The transport sweep is performed by cyclic reduction, as was shown in Section 3.4. In proceeding from level k to level k + 1, subvectors of the righthand side vector are multiplied by F1 k_i Di1 To prove stability, it is necessary to prove that the elements of matrices F1 k at all levels k are bounded. where and For positive variables, we have shown that [d.Mi. Mi. =.Mi.] dM; A; We want to prove that IIF1kll IIFlll fork= 1, ,log2 m. 95 (3.7.3) (3.7.4) (3.7.5)
PAGE 106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LEMMA 3. 7.1 Let Mi and be diagonal matrices, where is given by {3. 7.4). Then PROOF. From the definition of matrix norm we have II THEOREM 3. 7.2 (Stability of the positive variables cyclic reduction}. Let F;k and Fl be given by (3.4.10} and (3.4.7], respectively. Then l!Fl"rl IIFlll fork= 1, ... ,log2 m. PROOF. From the definition of Fl", we can see that its norm is equal to its nonzero n X n block submatrix norm: IIFkll = ll22(1o1)M M2 A 1M2 A 1 M2 A 1 II i i i1 L.l.i1 i2L.l.i2... i(21Jo1)_1)L.l.i(21"1)_1) = II 2(1o1)12M M2 A1M2 A 1 M2 A 1 II 2 i i1 L.l.i1 i2L.l.i2... i(21Jo1)_1)L.l.i(21"l)_l) We thus have IIFl"ll 22"'t2IIM;IIn)2"'t = 2IIMill = IIFlll THEOREM 3. 7.3 (Stability of the negative variables cyclic reduction}. Let F;k and Fl be given by (3.4.23} and (3.4.20}, respectively. Then IIF;kll JIFl II for k = 1, ... ,log2 m. 96
PAGE 107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I PROOF. From the definition of Fl, we can see that its norm is equal to its nonzero n x n block submatriz norm: We thus have II.Fikll $ $ = 2IIMill = IIFlll 3.8 Experimental Results. Ill In this section we first compare the convergence rates for the two kinds of angular multigrid cycles used in this chapter: V(l, 0) and V(l, 1). Then we compare the timings between one V(l, 0)cycle and one V(l, I)cycle to show which one gives the best convergence rate considering the difference in time. We also show how these algorithms behave when m is increased and when n is increased. The convergence factors are measured for the forward peaked FokkerPlanck scattering model. Table 3.1 shows the convergence rates for m = 4 and various values of n. The convergence rates for V(l, 0} (p1,o) and V(l, 1) (Pt,t) increase as n increases, but seem to rel\ch au upper bound for n;::: 32 ( Pt,o = .57 and P1,1 = .34). This agrees with the Fourier analysis in [14] which predicts p $ .36 for the V(l, 1} cycle. As we expected, the convergence rates for the V(l, 1) are approximately the square of the convergence rates of V(l, 0) ( i.e. p1,1 ::::: p1,o2 ) for the various values of N. The two 97
PAGE 108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.1. Convergence rates for the V{1,1) and V{1,0) angular multigrid cycles for constant m (m = 4) and various values of N (SN cases). n V(1, 0) V(1, 1) 2 0.32 0.12 4 0.47 0.22 8 0.54 0.29 16 0.56 0.33 32 0.57 0.34 64 0.57 0.34 128 0.57 0.34 98
PAGE 109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.2. Timings in seconds for the V{1, 1) and V{1, 0) angular multigrid cycles for constant m {m = 4) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. n V{1, 0) Gray V(1, 1)Gray V(1,0)CM V(1,1)CM 2 0.023 0.024 0.34 .23 4 0.062 0.065 0.37 .36 8 0.180 0.188 0.56 .62 16 0.665 0.694 0.74 1.01 32 2.872 2.973 1.35 2.02 64 14.780 15.102 2.68 3.90 128 87.920 88.593 5.07 8.73 99
PAGE 110
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; '' Table 3.3. Timings in seconds for the V(1, 1) and V(1, 0) angular multigrid cycles for constant m (m = 64) and various values of n (SN cases, N = 2n) on the CM200 and Cray YMP. n V(1, 0)Gray V(1, 1)Gray V(1,0)GM2 V(1, 1)GM2 2 0.335 0.363 0.51 .72 4 0.59 0.671 0.69 1.09 8 0.922 1.093 0.94 1.71 16 1.943 2.320 1.61 2.76 32 5.768 6.790 2.83 5.74 64 22.993 26.010 5.17 10.32 128 114.980 125.210 12.77 25.42 100
PAGE 111
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i Table 3.4. Comparison between V(1,1) and V(1,0) anisotropic codes on the CM200 for various values of n ( S N cases, N = 2n) .!ll. n 110 P1 o P1,1 2 0.19 0.12 4 0.46 0.22 8 0.50 0.29 16 0.45 0.33 32 0.43 0.34 64 0.44 0.34 128 0.43 0.34 101
PAGE 112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. seconds 2 8 16 32 64 Number of angles ( n ) oCM200 1'(1, 1) timings *Cray V{1,1) timings 128 Figure 3.6: Timings for a V(1,1) cycle varying (n) and m = 4. 102
PAGE 113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 135 120 105 90 75 60 45 30 15 seconds 2 8 16 32 64 Number of angles ( n ) oCM200 V(l,l} timings *Cray V(l, 1} timings 128 Figure 3.7: Timings for a V(1,1) cycle varying (n) and m = 64. 103
PAGE 114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I I methods would be equivalent if a V(l, 1) cycle took twice as long as a V(1, 0) cycle. Tables 3.2 and 3.3 show the timings on the Cray YMP and CM200 form= 4 and m = 64, respectively. If we consider the parallel timings on the CM200 for V(1, 1) (t11 ) and V(l, 0) {tto), we notice that using the time for V(l,l) as the standard unit we can compare the convergence rates for the two algorithms (Table 3.4). It is obvious from this table that V(1,1) is superior to V(l, 0) on the CM200. The same conclusions are true for our sequential codes on the Cray YMP. This computer has 64 vector registers of length 64 bits. For our timings we used only one of the 8 processors available on the Cray YMP. Graphs 3.6 and 3.7 show how much faster the CM200 V(l,l) codes are than the CrayYMP sequential version. In general, our algorithms behaved as expected. When N increases, we notice that, even though the number of levels for the angular multigrid scheme is log2 n, the timings for the whole algorithm are proportional to N. This is caused mostly by the T1 and T multiplications, which were shown to take N steps in Section 3.5. This means that whenever N is doubled, the timings on the CM200 will at least double. This is exactly what is shown on Tables 3.2 and 3.3. Of course, if we consider the same algorithm sequentially, we should expect the time spent on this algorithm to be at least proportional to N2 This can be observed on the same table for the sequential timings on the Cray YMP. For the same number of angles, as m increases the convergence rates do not change considerably (Table 3.5) and the timings will increase, due mainly to the cyclic reduction step. These increases are proportional to 2log2 m for the V(1, 0) cycle and 4log2 m for V(1, 1) cycle. The same happens at the coarsest level when the parallel multigrid isotropic code is used, as shown earlier in Chapter 2. The behavior when we increase m is shown in Table 3.6. 104
PAGE 115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I: I. Table 3.5. Convergence rates for the V(1,0) and V(1,1) angular multigrid cycles for N = 8 and various values of m. m V(1, 0) V(1, 1) 4 0.53 0.22 16 0.53 0.28 32 0.53 0.28 64 0.53 0.28 Table 3.6. Timings in seconds for the V(1, 0) and V(1, 1) angular multigrid cycles for various values of m and N = 8 on the CM200. m V(1, 1) V(1, 0) 4 0.54 .53 8 0.54 .45 16 .63 .56 32 .76 .72 64 1.14 1.46 105
PAGE 116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I CHAPTER4 CONCLUSIONS In this section, we summarize our results for the parallel multilevel methods developed in this thesis. Efficient parallel algoritluns for the neutron transport equations were developed. In Chapter 2 the isotropic case was considered. A multigrid scheme was used in which various types of relaxation were considered. Jacobi V(1, 1), Jacobi V(2, 2), and redblack GaussSeidel V{1,1) pline twocell relaxations were implemented and analyzed. Mter an evaluation of CM2 timings, we concluded that Jacobi V(1, 1) is the best method for this type of architecture. We also showed, through comparisons with a sequential implementation on the Cray YMP, that the parallel algorithms are much faster than the sequential ones and consequently represent an important step towards the development of parallel algorithms for these equations. Some steps of the algoritlun could be made even more efficient with the use of MIMD architectures like the CM5. In Chapter 3 we developed an algorithm for solving the anisotropic transport equation. The sequential angular multigrid algorithm was developed before by Morel and Manteuffel (14] and contains a step that is traditionally sequential. In this thesis this step was parallelized by using cyclic reduction. Depending on the way we write the matrix equations, this process can be stable or not. We wrote these equations in a convenient way, and proved their stability. Cyclic reduction
PAGE 117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is a well known algorithm for solving linear systems of equations and is available in some software libraries. These libraries, however, are written for generic matrix equations and do not exploit the specific matrix structure that we have. This makes our development of parallel cyclic reduction important as an example of optional approaches for parallel computers, where we pay special attention to the structure of our problems. The steps we used to create an efficient parallel cyclic reduction algorithm can also be applied when cyclic reduction appears in other contexts. We improved our sequential versions, with a few compiler directives and small changes, towards more vectorizable codes, but we do not claim to have used a completely vectorized implementation in our measurings. From our experimental results we conclude that the V(l, I)cycle is, in general, preferable to the V(l, G) cycle on the CM200. Notice that both of these algorithms are faster than their sequential versions; our parallel timings behaved like 0( N), while our sequential codes were like O(N2). This behavior was dictated by the transformations from flux domain to moment domain and viceversa, which are accomplished through multiplications by N X N matrices. These matrix multiplications represent the numerical tools used to obtain a Legendre expansion of a function f (multiplication by T) and the evaluation of a function at N nodes, given an N term Legendre expansion (multiplication by T1 ). The sequential steps for these algorithms are known to be O(N2 ) and recently a fast algorithm was developed in [1], which takes O(N log N) sequential steps. However, this algorithm requires evaluation at the Chebyshev points in [1, 1] rather than Gauss quadrature which would imply that the transformation between the moment and flux representation is not exact. Our CM2 algorithm for the Legendre transforms is O(N), and could be 0(1) in a MIMD computer. It represents an improvement that can be used in many applications including algorithms involving quadratures, approximation theory, and the solution 107
PAGE 118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I i ofPDE's. In this thesis we dealt with nonabsorption scattering cases(; = 1). Recent work has been done for isotropic scattering with various levels of absorption [11] in slab geometry. Currently we are working on a parallel version for the anisotropic model with absorption. Our final remark is that we should keep in mind that all the work developed here was for the onedimensional transport equations, but the same kind of algorithms can be applied for twodimensional and threedimensional transport equations, and they should maintain their superiority over sequential algorithms. In addition to choosing very accurate methods when using parallel comput ers, we should search for algorithms that take advantage of the kind of architecture we are using. The best algorithms thus will represent a compromise between ac curacy and time efficiency, will be architecture dependent, and will evolve as new architectures are designed. 108
PAGE 119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I; Appendix: The inverse of the matrix E in (3.21) is given by (.0.1) where (.0.2) (.0.3) ( .0.4) (.0.5) (.0.6) (.0.7) (.0.8) 109
PAGE 120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (.0.9) (.0.10) [d1d2(c2b1 + a1a2)](b2) + (a22b1 + b2at)(a2) e32 = det2 (.0.11) [dtd2(c2b1 + a1a2)](d!) ea3 = det2 (.0.12) (a22b1 + a1b2)](dt) e34 = det2 (.0.13) (.0.14) (c2a11 + a2ct)( b2) + [d1d2(a22au + b2ct)]( a2) e42 = det2 (.0.15) (c2a11 + a2ct)(d1) e43 = det2 (.0.16) d1d2(aua22 + b2ct)(d!) e44 = det2 (.0.17) det1 = [dtd2(c1b2 + a1a2)][d1d2(aua22 + b1c2)](c1a22 + a1c2)(aub2 + b1a2), (.0.18) and (.0.19) 110
PAGE 121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOGRAPHY [1] Bradley K. Alpert and Vladimir Rokhlin, "A Fast Algorithm for the Evaluation of Legendre Expansions", SIAM Journal of Scientific and Statistical Computing, 12, No.1, pp. 158179, January 1991. [2] Kenneth M. Case and P. F. Zweifel, "Linear Transport Theory", Reading, Mass.: AddisonWesley, 1967. [3] Carlo Cercignani, "The Boltzman Equation and Its Applications", New York, NY: SpringerVerlag, 1988. [4] J. J. Duderstadt and W. R. Martin, "Transport Theory", New York, NY: John Wiley and Sons, 1979. [5] V. Faber and T. A. Manteuffel, "Neutron Transport from the Viewpoint of Lin ear Algebra", Invariant Imbedding and Equations (Nelson, Faber, Manteuffel, Seth, and White, eds.) Lecture Notes in Pure and Applied Mathematics, 115 pp. 3761, MarcelDekker, April, 1989. [6] Gene H. Golub and C. F. Van Loan, "Matrix Computations", Baltimore and London: The Johns Hopkins University Press, 1989. [7] M. S. Lazo and J. E. Morel, "A Linear Discontinuous Galer kin Approximation for the Continuous Slowing Down Operator", Nucl. Sci. Eng., 92, pp. 98, 1986. [8] E. W. Larsen and J. E. Morel, "Asymptotic Solutions of Numerical Transport Problems in Optically Thick Diffusive Regimes II, J. Comp. Phys. 83, p. 212, 1989. [9] E. E. Lewis and W. F. Miller, "Computational Methods of Neutron Transport", New York, NY: John Wiley and Sons, 1984. [10] T. Manteuffel, S. McCormick, J. Morel, S. Oliveira, and G. Yang, "Fast Multi grid Solver for Transport Problems", submitted to Nucl. Sci. Eng. [11] T. Manteuffel, S. McCormick, J. Morel, and G. Yang, "Fast Multigrid Solver for Transport Problems with Absorptions", submitted to Nucl. Sci. Eng.
PAGE 122
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i l I' II [12] T. Manteuffel, S. McCormick, J. Morel, S. Oliveira, and G. Yang, "Parallel Multigrid Methods for Transport Equations", submitted to SIAM Journal of Scientific and Statistical Computing. [13] J. E. Morel and E. W. Larson, "A New Class of SN Spatial Differencing Schemes", Nucl. Sci. Eng., 1988. [14] J. E. Morel and T. A. Manteuffel, "An Angular Multigrid Acceleration Tech nique for the Sn Equations with Highly ForwardedPeaked Scattering", Nucl. Sci. Eng., 107, pp. 33D342, 1991. [15] R. Sweet, "A Generalized CyclicReduction Algorithm", SIAM Journal of Nu mer. Anal., Vol. 11, pp. 506520, 1974. [16] P. Wesseling "An Introduction to Multigrid Methods", Chichester, England: John Wiley and Sons, 1992. [17] G. Milton Wing. "An Introduction to Transport Theory", New York, NY: John Wiley and Sons, 1962. 112
