Citation
Tiled algorithms for matrix computations on multicore architectures

Material Information

Title:
Tiled algorithms for matrix computations on multicore architectures
Creator:
bouwmeester, Henricus M
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file. : ;

Subjects

Subjects / Keywords:
Computer algorithms ( lcsh )
Numerical analysis ( lcsh )
Computer algorithms ( fast )
Numerical analysis ( fast )
Genre:
non-fiction ( marcgt )

Notes

Review:
The current computer architecture has moved towards the multi/many-core structure. However, the algorithms in the current sequential dense numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A new family of algorithms, the tile algorithms, has recently been introduced to circumvent this problem. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factoriza- tion. The goal of this thesis is to study tiled algorithms in a multi/many-core setting and to provide new algorithms which exploit the current architecture to improve performance relative to current state-of-the-art libraries while maintaining the stability and robustness of these libraries.
Thesis:
Thesis (Ph.D.)--University of Colorado Denver. Applied mathematics
Bibliography:
Includes bibliographic references.
General Note:
School of Mathematical and Statistical Sciences
Statement of Responsibility:
by Henricus M. Bouwmeester.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
861757580 ( OCLC )
ocn861757580

Downloads

This item has the following downloads:


Full Text
TILED ALGORITHMS FOR MATRIX COMPUTATIONS ON MULTICORE
ARCHITECTURES
by
Henricus M Bouwmeester
B.S., Colorado Mesa University, 1998
M.S., Colorado State University, 2000
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2012


This thesis for the Doctor of Philosophy degree by
Henricus M Bouwmeester
has been approved
by
Julien Langou, Advisor
Lynn Bennethum, Chair
Gita Alaghband
Elizabeth R. Jessup
Stephen Billups
October 26, 2012


Bouwmeester, Henricus M (Ph.D., Applied Mathematics)
Tiled Algorithms for Matrix Computations on Multicore Architectures
Thesis directed by Associate Professor Julien Langou
ABSTRACT
Current computer architecture has moved towards the multi/many-core structure.
However, the algorithms in the current sequential dense numerical linear algebra li-
braries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A
new family of algorithms, the tile algorithms, has recently been introduced to circum-
vent this problem. Previous research has shown that it is possible to write efficient
and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU
factorization, and a QR factorization. The goal of this thesis is to study tiled al-
gorithms in a multi/many-core setting and to provide new algorithms that exploit
the current architecture to improve performance relative to current state-of-the-art
libraries while maintaining the stability and robustness of these libraries.
In Chapter 2, we confront the problem of computing the inverse of a symmetric
positive definite matrix with tiled algorithms. We observe that, using a dynamic
task scheduler, it is relatively painless to translate existing LAPACK code to ob-
tain a ready-to-be-executed tile algorithm. However we demonstrate that nontrivial
compiler techniques (array renaming, loop reversal and pipelining) need to be ap-
plied to further increase the parallelism of our application, both theoretically and
experimentally.
Chapter 3 revisits existing algorithms for the QR factorization of rectangular
matrices composed of p x q tiles, where p > q, for an unlimited number of processors.
Within this framework, we study the critical paths and performance of algorithms
such as Sameh-Kuck, Fibonacci, Greedy, and those found within PLASMA. We
iii


also provide a monotonically increasing function to transform the elimination list of a
coarse-grain algorithm to a tiled algorithm. Although the optimality from the coarse-
grain Greedy algorithm does not translate to the tiled algorithms, we propose a new
algorithm and show that it is optimal in the tiled context.
In Chapters 2 and 3, our context includes an unbounded number of processors.
The exercise was to find algorithmic variants with short critical paths. Since the
number of resources is unlimited, any task is executed as soon as all its dependencies
are satisfied. In Chapters 4 and 5, we set ourselves in the more realistic context
of bounded number of processors. In this context, at a given time, the number
of ready-to-go tasks can exceed the number of available resources, and therefore a
schedule which prescribes which tasks to execute when needs to be defined. For
the Cholesky factorization, we study standard schedules and find that the critical
path schedule is best. We also derive a lower bound on the time to solution of the
optimal schedule. We conclude that the critical path schedule is nearly optimal for
our study. For the QR factorization problem, we study the problem of optimizing
the reduction trees (therefore the algorithm) and the schedule simultaneously. This
is considerably harder than the Cholesky factorization where the algorithm is fixed
and so, for Cholesky factorization, we are concerned only with schedules. We provide
a lower bound for the time to solution for any tiled QR algorithm and any schedule.
We also show that, in some cases, the optimal algorithm for an unbounded number of
processors (found in Chapter 3) cannot be scheduled to solve optimally the combined
problem. We compare our algorithms and schedules with our lower bound.
Finally, in Chapter 6 we study a recursive tiled algorithm in the context of matrix
multiplication using the Winograd-Strassen algorithm using a dynamic task sched-
uler. Whereas most implementations obtain either one or two levels of recursion, our
implementation supports any level of recursion.
IV


The form and content of this abstract are approved. I recommend its publication.
Approved: Julien Langou
v


DEDICATION
I dedicate this thesis to my wife, my parents and all of my family for their care,
unwavering support, and steadfast belief in my dreams and goals.
vi


ACKNOWLEDGMENT
My deepest appreciation goes to my advisor and mentor, Professor Julien Langou,
whose persistence, guidance and help were instrumental in this dissertation and would
not have been possible without him.
Thank you to my outstanding co-authors on the papers from whom I learned the
importance of collaboration and gained insight into the thought processes of others.
They helped me to make my ideas and explanations more understandable to all of
the readers.
I would like to thank my committee members, Professor Lynn Bennethum for
her diligence in being the chair, Professor Stephen Billups, and Professors Elizabeth
Jessup and Gita Alaghband for their expertise in computer science and parallel pro-
gramming.
In addition, I would like to thank all of the current and former staff of the Mathe-
matics and Statistical Science Department at the University of Colorado Denver who
helped me with their administrative support. Financial support was provided by the
University of Colorado Denver and the National Science Foundation grant numbers
NSF CCF-811520 and NSF CCF-1054864. This research was conducted using the
resources of the Center for Computational Mathematics (CCM) at the University of
Colorado Denver and the Innovative Computing Laboratory (ICL) at the University
of Tennessee.
Finally, I would like to thank my fellow PhD/Masters colleagues for the laughs,
great times, and beer sessions: Jenny Diemunsch, Jeff Larson, Brad Lowery, Tim
Morris, Eric Sullivan, and many more.
Vll


TABLE OF CONTENTS
Figures .................................................................. x
Tables.................................................................... xiii
Chapter
1. Introduction............................................................... 1
2. Cholesky Inversion......................................................... 8
2.1 Tile in-place matrix inversion..................................... 10
2.2 Algorithmic study.................................................. 13
2.3 Conclusion and future work......................................... 15
3. QR Factorization.......................................................... 17
3.1 The QR factorization algorithm .................................... 20
3.1.1 Kernels...................................................... 23
3.1.2 Elimination lists............................................ 26
3.1.3 Execution schemes............................................ 28
3.2 Critical paths .................................................... 30
3.2.1 Coarse-grain algorithms...................................... 30
3.2.1.1 Sameh-Kuck algorithm................................... 30
3.2.1.2 Fibonacci algorithm.................................... 31
3.2.1.3 Greedy algorithm....................................... 31
3.2.2 Tiled algorithms............................................. 32
3.3 Experimental results............................................... 51
3.4 Conclusion......................................................... 60
4. Scheduling of Cholesky Factorization...................................... 67
4.1 ALAP Derived Performance Bound .................................... 68
4.2 Critical Path Scheduling .......................................... 69
4.3 Scheduling with synchronizations................................... 71
4.4 Theoretical Results................................................ 73
viii


4.5 Toward an olopt.................................................... 75
4.6 Related Work ...................................................... 75
4.7 Conclusion and future work......................................... 76
5. Scheduling of QR Factorization........................................... 78
5.1 Performance Bounds................................................. 78
5.2 Optimality......................................................... 80
5.3 Elimination Tree Scheduling........................................ 81
5.4 Conclusion......................................................... 83
6. Strassen Matrix-Matrix Multiplication ................................... 84
6.1 Strassen-Winograd Algorithm ....................................... 85
6.2 Tiled Strassen-Winograd Algorithm.................................. 86
6.3 Related Work ...................................................... 91
6.4 Experimental results............................................... 93
6.5 Conclusion......................................................... 96
7. Conclusion............................................................... 97
Appendix
A. Integer Programming Formulation of Tiled QR.............................. 99
A.l IP Formulation..................................................... 99
A. 1.1 Variables................................................... 99
A. 1.2 Constraints................................................ 100
A. 1.2.1 Precedence constraints............................... 105
A. 1.3 Objective function......................................... 107
References................................................................. 108
Glossary................................................................... 113
ix


FIGURES
Figure
1.1 Three variants for the Cholesky decomposition............................ 3
1.2 Data layout of tiled matrix ............................................. 5
1.3 Three variants of the Cholesky decomposition applied to a tiled matrix of
4x4 tiles............................................................... 5
2.1 DAGs of Step 3 of the Tile Cholesky Inversion [t = 4)........ 12
2.2 Scalability of Algorithm 2.1 (in place) and its out-of-place variant intro-
duced in § 2.2, using our dynamic scheduler against vecLib, ScaLAPACK
and LAPACK libraries................................................... 13
2.3 Impact of loop reversal on performance.................................. 15
3.1 Icon representations of the kernels .................................... 42
3.2 Critical Path length for the weighted FlatTree on a matrix of 4 x 4 tiles. 43
3.3 Illustration of first and second parts of the proof of Theorem 3.14 using
the Fibonacci algorithm on a matrix of 15 x 2 tiles.................. 49
3.4 Greedy versus GrASAP on matrix of 15 x 2 tiles....................... 50
3.5 Tiled matrices of p x q where the critical path length of GrASAP is
shorter than that of GREEDY for 1 < p < 100 and 1 < q < p............ 50
3.6 Upper bound and experimental performance of QR factorization TT
kernels................................................................ 52
3.7 Overhead in terms of critical path length and time with respect to
Greedy (Greedy =1) 53
3.8 Overhead in terms of critical path length and time with respect to
Greedy (Greedy =1) 55
3.9 Kernel performance for double complex precision......................... 56
3.10 Kernel performance for double precision................................. 57
3.11 Upper bound and experimental performance of QR factorization All kernels 58
x


3.12 Overhead in terms of critical path length and time with respect to
Greedy (Greedy =1) .......................................... 59
3.13 Overhead in terms of critical path length and time with respect to
Greedy (Greedy =1) .......................................... 60
4.1 ALAP execution for 5x5 tiles ....................................... 69
4.2 Example derivation of task priorities via the Backflow algorithm.... 71
4.3 Theoretical results for matrix of 40 x 40 tiles..................... 74
4.4 Values of a for matrices of t x t tiles where 3 < t < 40............ 75
4.5 Asymptotic efficiency versus a = p/n for LU decomposition and versus
a = p/t2 for Tiled Cholesky factorization........................... 76
5.1 Scheduling comparisons for each of the algorithms versus the ALAP De-
rived bounds on a matrix of 20 x 6 tiles.............................. 79
5.2 Tail-end execution using ALAP on unbounded resources for GrASAP
and Fibonacci on a matrix of 15 x 4 tiles.......................... 80
5.3 ALAP Derived bound comparison for all algorithms for a matrix of 15 x 4
tiles................................................................. 81
5.4 Comparison of speedup for CP Method on GrASAP, ALAP Derived
bound from GrASAP, and optimal schedules for a matrix of 5 x 5 tiles
on 1 to 14 processors................................................. 82
6.1 Task graph for the Strassen-Winograd Algorithm. Execution time pro-
gresses from left to right. Large ovals depict multiplication and small
ovals addition/subtraction....................................... 86
6.2 Strassen-Winograd DAG for matrix of 4 x 4 tiles with one recursion. Exe-
cution time progresses from left to right. Large ovals depict multiplication
and small ovals addition/subtraction.................................. 88
6.3 Required extra memory allocation for temporary storage for varying re-
cursion levels........................................................ 93
xi


6.4 Comparison of tuning parameters rib and r................................ 94
6.5 Scalability and Efficiency comparisons on 48 threads with matrices of 64 x
64 tiles and rib = 200................................................... 95
6.6 Scalabilty and efficiency comparisons executed on 12 threads with matrices
of 64 x 64 tiles and rib = 200........................................... 96
xii


TABLES
Table
2.1 Length of the critical path as a function of the number of tiles t......... 13
3
3.1 Kernels for tiled QR. The unit of time is where rib is the blocksize. . 23
3.2 Time-steps for coarse-grain algorithms..................................... 32
3.3 Time-steps for tiled algorithms............................................ 63
3.4 Neither Greedy nor Asap are optimal........................................ 63
3.5 Three schemes applied to a column whose update kernel weight is not an
integer multiple of the elimination kernel weight...................... 64
3.6 Greedy versus PT_TT and Fibonacci (Theoretical)............................ 65
3.7 Greedy versus PT_TT (Experimental Double).................................. 66
3.8 Greedy versus PT_TT (Experimental Double Complex).......................... 66
3.9 Greedy versus Fibonacci (Experimental Double).............................. 66
3.10 Greedy versus Fibonacci (Experimental Double Complex) .................... 66
4.1 Task Weights............................................................... 67
4.2 Upper bound on speedup and efficiency for 5 x 5 tiles...................... 69
5.1 Schedule lengths for matrix of 5 x 5 tiles................................. 82
6.1 Strassen-Winograd Algorithm................................................ 85
6.2 Recursion levels which minimize the number of tasks for a tiled matrix of
size p x p................................................................ 89
6.3 128 x 128 tiles of size rib = 200 ......................................... 90
6.4 Comparison of the total number of tasks and critical path length for matrix
of p x p tiles......................................................... 90
xiii


1. Introduction
The High-Performance Computing (HPC) landscape is trending more towards a
multi/many-core architecture [8] as is evidenced by the recent projects of major chip
manufacturers and reports of surveys conducted by consulting companies [52], The
computational algorithms for dense linear algebra need to be re-examined to make
better use of these architectures and provide higher levels of efficiency. Some of these
algorithms may have a straight forward translation from the current state-of-the-art
libraries while others require much more thought and effort to gain performance in-
creases. In this thesis, we endeavor to achieve algorithms that exploit the architecture
to improve performance relative to current state-of-the-art computational libraries
while maintaining the stability and robustness of these libraries.
Our research will make use of the BLAS (Basic Linear Algebra Subprograms) [15]
and the LAPACK (Linear Algebra PACKage) [7] libraries. The BLAS are a stan-
dard to perform basic linear algebra operations involving vectors and matrices while
LAPACK performs the more complicated and higher level linear algebra operations.
The BLAS divide numerical linear algebra operations into three distinct group-
ings based upon the amount of input data and the computational cost. Operations
involving only vectors are considered Level 1; those involving both vectors and ma-
trices are Level 2; and those involving only matrices are Level 3. For matrices of size
n x n, the Level 3 operations are most coveted since they use 0(n2) data for 0(n3)
computations and inherently reduce the amount of memory traffic. Since the BLAS
provide fundamental linear algebra operations, hardware and software vendors such
as Intel, AMD, and IBM provide optimized BLAS libraries for a variety of archi-
tectures. The BLAS library can be multithreaded to make use of multi/many-core
architectures and most of the vendor supplied libraries are multithreaded.
The LAPACK library is a collection of subroutines for solving most of the common
problems in numerical linear algebra and was developed to make use of Level 3 BLAS
1


operations as much as possible. Algorithms within LAPACK are written to make use
of panels which can be either a block of columns or a block of rows so that updates
are performed using matrix multiplications instead of vector operations. LAPACK
can make use of a multithreaded BLAS to exploit multi/many-core architectures but
this may not be enough to fully exploit the capability of the architecture.
As an example, let us consider the Cholesky decomposition to factorize a symmet-
ric positive definite (SPD) matrix into its triangular factor. There are three variants
for performing the Cholesky decomposition: bordered, left-looking, and right-looking.
All three work on either the upper or lower triangular portion of the matrix and
produce the same triangular factor, but depending on the usage, one may have an
advantage over the others. In Figure 1.1 we depict the steps involved in each variant
using the lower triangular formulations. At the start of each variant, the matrix is
subdivided into blocks of rows and columns, or panels, which will take advantage of
the Level 3 BLAS.
The bordered variant, as depicted in Figure 1.1a, involves a loop over three
steps. The first step updates the purple row block using the already factorized green
portion, the second step updates the next triangular block to be factorized (in red),
and the third step performs the factorization of the triangular block. This is then
repeated until the entire matrix is factorized. Note that the lower portion of the
matrix is not touched by the preceding steps.
The left-looking variant (see Figure 1.1b) involves four steps. The first step
updates the triangular block in red which is then factorized in the second step, the
third step updates the block column (in cyan) below the triangular block using the
previous columns, and the last step updates the column block using the factorization
of step 2. It is called left-looking since the algorithm does not affect the portion to
the right of the current block of the matrix and only looks to the left for its updates.
The top most triangular portion of the matrix is in its final form and will not change
2


STEP 1 :: | | <:
STEP 2 :: ^ ^ ^
STEP 3 :: =i CHOL( [\J
(a) Bordered variant
STEP 1 :: h, <=
STEP 2 :: t\ <= =. CHOL( |\J
STEP 3 :: 1
STEP 4 :: 1
(b) Left-looking variant
\ STEP 1 :: K <^= =i CHOL(
STEP 2:: <= =D^T
S \ STEP 3::

(c) Right-looking variant
Figure 1.1: Three variants for the Cholesky decomposition
3


in the suceeding steps of the algorithm.
The right-looking variant (see Figure 1.1c) involves three steps. It performs the
factorization of the red triangular block, updates the block column (in cyna) and then
updates the blue trailing matrix on the right. This is called right-looking since it does
not require anything from the previous factorized matrix and pushes its updates to
the right part of the matrix. The entire matrix to the left of the column block in
which the algorithm is currently working is in its final form.
The advantage of the bordered variant is that it does the least amount of oper-
ations to determine if a matrix is SPD. The advantage of the right-looking variant
is that it provides the most parallelism. A major disadvantage of the left-looking
variant is the added fork-join that it must perform between the steps as compared to
the other two variants which will negatively affect its parallel performance.
The LAPACK scheme of using panels has three distinct disadvantages which
limit its performance. The first can be seen in the third step of the right-looking
Cholesky decomposition (Figure 1.1c) where potentially a large symmetric rank k
operation is performed; the memory architecture will bound the performance of the
algorithm. Secondly, there is some impact of the synchronizations that must be
performed between each step. Third, the idea of panels does not allow for fine-
grained tasks. We alleviate the last two of these restrictions through the use of tiled
algorithms whereas the first is overcome through the use of Level 3 BLAS operations
within the tiled algorithm.
We approach this via tiling a matrix which means reordering the data of the
matrix into smaller regions of contiguous memory as depicted in Figure 1.2. By
varying the tile size, this data layout allows us to tune the algorithm such that the
data needed for the kernels is present in the cache of the processor core. Moreover,
we are able to increase the amount of parallelism and minimize the synchronizations.
Let us revisit the Cholesky decomposition as described earlier and apply each
4


Figure 1.2: Data layout of tiled matrix
step to the tiled matrix. In Figure 1.3 we present the directed acyclic graphs (DAG)
for the three variants on a tiled matrix of 4 x 4 tiles. In each of the DAGs, the tasks
are represented as the nodes and the data dependencies are the edges. The dashed
horizontal lines designate a full sweep through all of the steps in each algorithm.
(a) Bordered variant (b) Left-looking variant (c) Right-looking variant
Figure 1.3: Three variants of the Cholesky decomposition applied to a tiled matrix
of 4 x 4 tiles.
The first observation that one makes is that the height of each DAG varies ac-
cording to which variant is chosen. The bordered variant is the tallest since the tasks
5


become almost sequential where the only portion that is not sequential is that of
the row block update. The left-looking variant is almost of height t2 where t is the
number of tiles in a column of the tiled matrix. It gains parallelism from being able
to update the column block of the final step within the loop in parallel. As before,
the right-looking variant is the shortest and provides the most parallelism.
However, in the tiled versions, the synchronizations between each step of the
LAPACK algorithms is superficial and can be removed. By doing so, these three
variants reduce to only the right-looking variant.
The main difference between the tiled version and the blocked version is the
amount of parallelism that is gained from updates of the trailing matrix. Instead of
performing a large symmetric rank k update where k is the number of rows in a row
block or the number of columns in a column block, the operation is decomposed into
smaller symmetric rank updates and associated matrix multiplications where
is the size of a tile such that N = t rib- In the right-looking variant, for an N x N
matrix the size of the first trailing matrix is (N k) x (N k) so that the update
operation for this first matrix has a computational cost of 0(kN2). The tiled update
consists of both symmetric updates and matrix multiplications of tiles of size % x %
so that the computational cost per task is 0(nl).
The size of the tiles will determine the granularity of the tasks for a tiled algo-
rithm. For a matrix of size N x N, the tile size of the tiled txt matrix can vary from
the entire size of the matrix [t = 1) down to a scalar [t = N), but is held constant
throughout the execution of the algorithm. However, a balance must be kept between
the efficiency of the kernel and the amount of data movement.
Therefore, a tiled algorithm does overcome the restriction on the granularity
imposed by the panels concept of LAPACK as well as alleviate some of the synchro-
nizations and associated overhead. The memory bound is still present due to needing
to move the updates of the trailing matrices.
6


The Parallel Linear Algebra Software for Multicore Architectures (PLASMA) [2]
library provides the framework for the tiled algorithms. For the experimental por-
tions, our main assumption will be a shared memory machine architecture wherein
each processor has direct access to all portions of the memory in which the matrices
are stored.
In Chapter 2 we describe more fully the implementation of the tiled Cholesky
decomposition and further the tiled Cholesky Inversion algorithm. It shows that
translating from LAPACK to PLASMA can be straight forward, but that there are
caveats to be taken into account. In Chapter 3 we have implemented a tiled QR
decomposition showing that the implementation is not straight-forward. Moreover,
results from previous work do not translate directly to the tiled algorithm.
Unlike Chapters 2 and 3 where an unbounded number of processors is assumed,
Chapters 4 and 5 restrict the number of processors and provide bounds on the per-
formance of the algorithms. We observe the theoretical speed-up and efficiency and
provide more realistic bounds on the performance.
Finally, Chapter 6 presents a study on a tiled implementation of the Strassen-
Winograd algorithm for matrix-matrix multiplication. Unlike the other algorithms
presented, it is a recursive algorithm which becomes interesting in the scope of tiled
matrices.
7


2. Cholesky Inversion
In this chapter, we present joint work with Emmanuel Agullo, Jack Dongarra,
Jakub Kurzak, Julien Langou, and Lee Rosenberg [4].
The appropriate direct method to compute the solution of a symmetric positive
definite system of linear equations consists of computing the Cholesky factorization
of that matrix and then solving the underlying triangular systems. It is not recom-
mended to use the inverse of the matrix in this case. However some applications need
to explicitly form the inverse of the matrix. A canonical example is the computation
of the variance-covariance matrix in statistics. Higham [32, p.260,§3] lists more such
applications.
With their advent, multicore architectures [50] induce the need for algorithms and
libraries that fully exploit their capacities. A class of such algorithms called tile
algorithms [18, 19] has been developed for one-sided dense factorizations (Cholesky,
LU and QR) and made available as part of the Parallel Linear Algebra Software for
Multicore Architectures (PLASMA) library [2], In this chapter, we extend this class
of algorithms to the case of the (symmetric positive definite) matrix inversion. Besides
constituting an important functionality for a library such as PLASMA, the study of
the matrix inversion on multicore architectures represents a challenging algorithmic
problem. Indeed, first, contrary to standalone one-sided factorizations that have been
studied so far, the matrix inversion exhibits many anti-dependences [6] (Write after
Read). This is a false or artificial dependency which is reliant on the name of the
data and not the actual data flow. For example, given two operations where the
first only reads the data in the matrix A and the second only writes to the location
of A, then in a parallel execution there may be a case where the data being read
by the first operation is wrong since the second may have already written to the
location. By copying the data from A beforehand, both operations can be executed
in parallel. Those anti-dependences can be a bottleneck for parallel processing, which
8


is critical on multicore architectures. It is thus essential to investigate (and adapt)
well known techniques used in compilation such as using temporary copies of data to
remove anti-dependences to enhance the degree of parallelism of the matrix inversion.
This technique is known as array renaming [6] (or array privatization [28]). Second,
loop reversal [6] is to be investigated. Third, the matrix inversion consists of three
successive steps (first of which is the Cholesky decomposition). In terms of scheduling,
it thus represents an opportunity to study the effects of pipelining [6] those steps on
performance.
The current version of PLASMA (version 2.1) is scheduled statically. Initially de-
veloped for the IBM Cell processor [34], this static scheduling relies on POSIX threads
and simple synchronization mechanisms. It has been designed to maximize data reuse
and load balancing between cores, allowing for very high performance [3] on todays
multicore architectures. However, in the case of matrix inversion, the design of an
ad-hoc static scheduling is a time consuming task and raises load balancing issues
that are much more difficult to address than for a stand-alone Cholesky decomposi-
tion, in particular when dealing with the pipelining of multiple steps. Furthermore,
the growth of the number of cores and the more complex memory hierarchies make
executions less deterministic. In this chapter, we rely on an experimental in-house
dynamic scheduler [33]. This scheduler is based on the idea of expressing an algorithm
through its sequential representation and unfolding it at runtime using data hazards
(Read after Write, Write after Read, Write after Write) as constraints for parallel
scheduling. The concept is rather old and has been validated by a few successful
projects. We could have as well used schedulers from the Jade project from Stanford
University [42] or from the SMPSs project from the Barcelona Supercomputer Center
[40].
Our discussions are illustrated with experiments conducted on a dual-socket quad-
core machine based on an Intel Xeon EMT64 processor operating at 2.26 GHz. The
9


theoretical peak is equal to 9.0 Gflop/s per core or 72.3 Gflop/s for the whole machine,
composed of 8 cores. The machine is running Mac OS X 10.6.2 and is shipped with
the Apple vecLib vl26.0 multithreaded BLAS [15] and LAPACK vendor library, as
well as LAPACK [7] v3.2.1 and ScaLAPACK [13] vl.8.0 references.
2.1 Tile in-place matrix inversion
Tile algorithms are a class of Linear Algebra algorithms that allow for fine gran-
ularity parallelism and asynchronous scheduling, enabling high performance on mul-
ticore architectures [3, 18, 19, 41]. The matrix of order n is split into 1x1 square
submatrices of order b (n = b x 1). Such a submatrix is of small granularity (we
fixed b = 200 in this chapter) and is called a tile. So far, tile algorithms have been
essentially used to implement one-sided factorizations [3, 18, 19, 41].
Algorithm 2.1 extends this class of algorithms to the case of the matrix inver-
sion. As in state-of-the-art libraries (LAPACK, ScaLAPACK), the matrix inversion
is performed in-place, i.e., the data structure initially containing matrix A is directly
updated as the algorithm is progressing, without using any significant temporary
extra-storage; eventually, A-1 replaces A. Algorithm 2.1 is composed of three steps.
Step 1 is a Tile Cholesky Factorization computing the Cholesky factor L (lower tri-
angular matrix satisfying A = LLT). This step was studied in [19]. Step 2 computes
L_1 by inverting L. Step 3 finally computes the inverse matrix A-1 = L~lTL~l.
Each step is composed of multiple fine granularity tasks (since operating on tiles).
These tasks are part of the BLAS (SYRK, GEMM, TRSM, TRMM) and LAPACK
(POTRF, TRTRI, LAIJUM) standards. A more detailed description is beyond the
scope of this extended chapter and is not essential to the understanding of the rest
of the chapter. Indeed, from a high level point of view, an operation based on tile
algorithms can be represented as a Directed Acyclic Graphs (DAG) [22] where nodes
represent the fine granularity tasks in which the operation can be decomposed and the
edges represent the dependences among them. For instance, Figure 2.1a represents
10


Algorithm 2.1: Tile In-place Cholesky Inversion (lower format). Matrix A is
the on-going updated matrix (in-place algorithm).
Input: A, Symmetric Positive Definite matrix in tile storage (t x t tiles).
Result: A-1, stored in-place in A.
1 Step 1: Tile Cholesky Factorization (compute L such that A = LLT)-,
2 for j = 0 to t 1 do
3
4
5
6
7
8
9
10
for k = 0 to j 1 do
_ A,,, <- A Au AJj, (SYRK(j.k)) ;
Aj.j <- CHOL(A,,,) (POTRF(j)) ;
for i = j + 1 to t 1 do
for k = 0 to j 1 do
Ajj Aitj Aitk Ajk (GEMM(i,j,k))
for i = j + 1 to t 1 do
Aj ^ (TRSM(ij)) ;
n Step 2: Tile Triangular Inversion of L (compute L l)\
12 for j = t 1 to 0 do
13
14
15
16
17
18
Ajyj for i = t 1 to j + 1 do
Ajj < Aj;j Ajj (TRMM(i,j)) ;
for k = j + 1 to i 1 do
Ajj < Ajj + Ai;fc Afcj (GEMM(i,j,k)) ;
Ajj <--Ajj Ai;j (TRMM(i,j)) ;
19 Step 3: Tile Product of Lower Triangular Matrices (compute A 1 = L
20 for i = 0 to t 1 do
-irr-i
L~lh
21
22
23
24
25
26
27
28
for j = 0 to i 1 do
_ AtJ <- * AtJ (TRMM(ij)) ;
Au <- -g. A,.i (LAUUM(i)) ;
for j = 0 to i 1 do
for k = i -I- 1 to t 1 do
Aij < Aij + AjR Afcj (GEMM(i,j,k))
for k = i + l to t1 do
Aj,j < Aj;j + Aki Afc;j (SYRK(i,k)) ;
11


the DAG of Step 3 of Algorithm 2.1.
Figure 2.1: DAGs of Step 3 of the Tile Cholesky Inversion [t = 4).
Algorithm 2.1 is based on the variants used in LAPACK 3.2.1. Bientinesi, Gunter
and van de Geijn [10] discuss the merits of algorithmic variations in the case of the
computation of the inverse of a symmetric positive definite matrix. Although of
definite interest, this is not the focus of this extended chapter.
We have implemented Algorithm 2.1 using our dynamic scheduler introduced
in the beginning of the chapter. Figure 2.2 shows its performance against state-of-
the-art libraries and the vendor library on the machine described in the beginning
of the chapter. For a matrix of small size, it is difficult to extract parallelism and
have a full use of all the cores [3, 18, 19, 41], We indeed observe a limited scalabil-
ity (N = 1000, Figure 2.2a). However, tile algorithms (Algorithm 2.1) still benefit
from a higher degree of parallelism than blocked algorithms [3, 18, 19, 41], There-
fore Algorithm 2.1 (in place) consistently achieves a significantly better performance
than vecLib, ScaLAPACK and LAPACK libraries. A larger matrix size (N = 4000,
Figure 2.2b) allows for a better use of parallelism. In this case, an optimized imple-
mentation of a blocked algorithm (vecLib) competes well against tile algorithms (in
place) on few cores (left part of Figure 2.2a). However, only tile algorithms scale to
a larger number of cores (rightmost part of Figure 2.2b) thanks to a higher degree
of parallelism. In other words, the tile Cholesky inversion achieves a better strong
12


Table 2.1: Length of the critical path as a function of the number of tiles t.
In-place case Out-of-place case
Step 1 3t 2 3t 2
Step 2 CO r-h 1 CO 2t 1
Step 3 3t 2 t
scalability than the blocked versions, similarly to what had been observed for the
factorization step [3, 18, 19, 41].
Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads) Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=1000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon N=4000, NB=200, VecLib, Processor 2 x2.26GHz Quad-Core Intel Xeon
(a) n = 1000 (b) n = 4000
Figure 2.2: Scalability of Algorithm 2.1 (in place) and its out-of-place variant in-
troduced in § 2.2, using our dynamic scheduler against vecLib, ScaLAPACK and
LAPACK libraries.
2.2 Algorithmic study
In the § 2.1, we compared the performance of the tile Cholesky inversion against
state-the-art libraries. In this section, we focus on tile Cholesky inversion and we
discuss the impact of several variants of Algorithm 2.1 on performance.
Array renaming (removing anti-dependences). The dependence between
SYRK(0,1) and TRMM(1,0) in the DAG of Step 3 of Algorithm 2.1 (Figure 2.1a)
represents the constraint that the SYRK operation (1. 28 of Algorithm 2.1) needs
to read Ak)i = A10 before TRMM (1. 22) can overwrite Aitj = A10. This anti-
dependence (Write after Read) can be removed thanks to a temporary copy of Ai;0.
13


Similarly, all the SYRK-TRMM anti-dependences, as well as TRMM-LAUMM and
GEMM-TRMM anti-dependences can be removed. We have designed a variant of
Algorithm 2.1 that removes all the anti-dependences thanks to the use of a large
working array (this technique is called array renaming [6] in compilation [6]). The
subsequent DAG (Figure 2.1b) is split in multiple pieces (Figure 2.1b), leading to a
shorter critical path (Table 2.1). We implemented the out-of-place algorithm, based
on our dynamic scheduler too. Figure 2.2a shows that our dynamic scheduler exploits
its higher degree of parallelism to achieve a much higher strong scalability even on
small matrices (N = 1000). For a larger matrix (Figure 2.2b), the in-place algorithm
already achieved very good scalability. Therefore, using up to 7 cores, their perfor-
mance are similar. However, there is not enough parallelism with a 4000 x 4000 matrix
to use efficiently all 8 cores with the in-place algorithm; thus the higher performance
of the out-of-place version in this case (leftmost part of Figure 2.2b).
Loop reversal (exploiting commutativity). The most internal loop of each
step of Algorithm 2.1 (1. 8, 1. 17 and 1. 26) consists in successive commutative GEMM
operations. Therefore they can be performed in any order, among which increasing
order and decreasing order of the loop index. Their ordering impacts the length of the
critical path. Algorithm 2.1 orders those three loops in increasing (U) and decreasing
(D) order, respectively. We had manually chosen these respective orders (UDU)
because they minimize the critical path of each step (values reported in Table 2.1).
A naive approach would have, for example, been comprised of consistently ordering
the loops in increasing order (UIJIJ). In this case (UIJIJ), the critical path of TRTRI
would have been equal to t2 2t + 3 (in-place) or (|f2 \t + 2) (out-of-place) instead
of 3t 3 (in-place) or 2t 1 (out-of-place) for (UDU). Figure 2.3 shows how loop
reversal impacts performance.
Pipelining. Pipelining the multiple steps of the inversion reduces the length of
its critical path. For the in-place case, the critical path is reduced from 9t 7 tasks
14


Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads) Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=1000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon N=4000, NB=200, VecLib, Processor 2 x 2.26GHz Quad-Core Intel Xeon
(#threads)
(#threads)
(a) n = 1000 (b) n = 4000
Figure 2.3: Impact of loop reversal on performance.
(t is the number of tiles) to 9t 9 tasks (negligible). For the out-of-place case, it
is reduced from 6t 3 to 5£ 2 tasks. We studied the effect of pipelining on the
performance of the inversion on a 8000 x 8000 matrix with an artificially large tile
size (b = 2000 and t = 4). As expected, we observed almost no effect on performance
of the in-place case (about 36.4 seconds with or without pipelining). For the out-of-
place case, the elapsed time grows from 25.1 to 29.2 seconds (16% overhead) when
pipelining is prevented.
2.3 Conclusion and future work
We have proposed a new algorithm to compute the inverse of a symmetric positive
definite matrix on multicore architectures. An experimental study has shown both
an excellent scalability of our algorithm and a significant performance improvement
compared to state-of-the-art libraries. Beyond extending the class of so-called tile
algorithms, this study brought back to the fore well known issues in the domain of
compilation. Indeed, we have shown the importance of loop reversal, array renaming
and pipelining.
The use of a dynamic scheduler allowed an out-of-the-box pipeline of the differ-
ent steps whereas loop reversal and array renaming required a manual change to the
15


algorithm. The future work directions consist in enabling the scheduler to perform
itself loop reversal and array renaming. We exploited the commutativity of GEMM
operations to perform array renaming. Their associativity would furthermore allow
to process them in parallel (following a binary tree); the subsequent impact on perfor-
mance is to be studied. Array renaming requires extra-memory. It will be interesting
to address the problem of the maximization of performance under memory constraint.
This work aims to be incorporated into PLASMA.
16


3. QR Factorization
In this chapter we present joint work with Mathias Jacquelin, Julien Langou, and
Yves Robert [31].
Given an m-by-n matrix A with n < m, we consider the computation of its QR
factorization, which is the factorization A = QR, where Q is an m-by-n unitary
matrix (QHQ = /), and R is upper triangular.
The QR factorization of an m-by-n matrix with n < m is the time consuming
stage of some important numerical computations. It is needed for solving a linear
least squares problem with m equations (observations) and n unknowns and is used
to compute an orthonormal basis (the Q-factor) of the column span of the initial
matrix A. For example, all block iterative methods (used to solve large sparse linear
systems of equations or computing some relevant eigenvalues of such systems) require
orthogonalizing a set of vectors at each step of the process. For these two usage
examples, while n < m, n can range from n m to n = m. We note that the
extreme case n = m is also relevant: the QR factorization of a matrix can be used to
solve (square) linear systems of equations. While this requires twice as many flops as
an LU factorization, using a QR factorization (a) is unconditionally stable (Gaussian
elimination with partial pivoting or pairwise pivoting is not) and (b) avoids pivoting
so it may well be faster in some cases.
To obtain a QR factorization, we consider algorithms which apply a sequence of
m-by-m unitary transformations, (R, (UfUi = /,), i = on the left of the
matrix A, such that after £ transformations the resulting matrix R = (R... U\A is
upper triangular, in which case, R is indeed the A-factor of the QR factorization. The
Q-factor (if needed) can then be obtained by computing Q = ... Uf. These types
of algorithms are in regular use, e.g., in the LAPACK and ScaLAPACK libraries,
and are favored over others algorithms (Cholesky QR or Gram-Schmidt) for their
stability.
17


The unitary transformation [/* is chosen so as to introduce some zeros in the cur-
rent update matrix IR-i... U\A. The two basic transformations are Givens rotations
and Householder reflections. One Givens rotation introduces one additional zero;
the whole triangularization requires mn n(n + f)/2 Givens rotations for n < m.
One elementary Householder reflection simultaneously introduces m i zeros in po-
sition i + 1 to m in column z; the whole triangularization requires n Householder
reflections for n < m. (See LAPACK subroutine GEQR2.) The LAPACK GEQRT
subroutine constructs a compact WY representation to apply a sequence of ib House-
holder reflections, this enables one to introduce the appropriate zeros in ib consecutive
columns and thus leverage optimized Level 3 BLAS subroutines during the update.
The blocking of Givens rotations is also possible but is more costly in terms of flops.
The main interest of Givens rotations over Householder transformations is that
one can concurrently introduce zeros using disjoint pairs of rows, in other words, two
transformations Ui and Ui+1 may be applicable concurrently. This is not possible
using the original Householder reflection algorithm since the transformations work
on whole columns and thus do not exhibit this type of intrinsic parallelism, forcing
this kind of Householder reflections to be applied sequentially. The advantages of
Householder reflections over Givens rotations are that, first, Householder reflections
perform fewer flops, and second, the compact WY transformation enables high se-
quential performance of the algorithm. In a multicore setting, where data locality
and parallelism are crucial algorithmic characteristics for enabling performance, the
tiled QR factorization algorithm combines both ideas: use of Householder reflections
for high sequential performance and use of a scheme such as Givens rotations to en-
able parallelism within cores. In essence, one can think either (i) of the tiled QR
factorization as a Givens rotation scheme but on tiles (mb-hy-nb submatrices) instead
of on scalars (1-by-l submatrices) as in the original scheme, or (ii) of it as a blocked
Householder reflection scheme where each reflection is confined to an extent much
18


less than the full column span, which enables concurrency with other reflections.
Tiled QR factorization in the context of multicore architectures has been intro-
duced in [18, 19, 41]. Initially the focus was on square matrices and the sequence of
unitary transformations presented was analogous to Sameh-KuCK [45], which cor-
responds to reducing the panels with flat trees. The possibility of using any tree in
order to either maximize parallelism or minimize communication is explained in [26].
The focus of this chapter is on maximizing parallelism. We reduce the communica-
tion (data movement between memory hierarchy) within the algorithm to acceptable
levels by tiling the operations. Stemming from the observation that a binary tree is
best for tall and skinny matrices and a flat tree is best for square matrices, Hadri
et al. [30], propose to use trees which combine flat trees at the bottom level with a
binary tree at the top level in order to exhibit more parallelism. Our theoretical and
experimental work explains that we can adapt Fibonacci [36] and Greedy [24, 25]
to tiles, resulting in yet better algorithms in terms of parallelism. Moreover our new
algorithms do not have any tuning parameter such as the domain size in the case
of [30],
The sequential kernels of the Tiled QR factorization (executed on a core) are made
of standard blocked algorithms such as LAPACK encoded in kernels; the development
of these kernels is well understood. The focus of this chapter is on improving the
overall degree of parallelism of the algorithm. Given a p-by-q tiled matrix, we seek
to find an appropriate sequence of unitary transformations on the tiled matrix so as
to maximize parallelism (minimize critical path length). We will get our inspiration
from previous work from the 1970s/80s on Givens rotations where the question was
somewhat related: given an rn-by-n matrix, find an appropriate sequence of Givens
rotations as to maximize parallelism. This question is essentially answered in [24, 25,
36, 45]; we call this class of algorithms 11 coarse-grain algorithms.
Working with tiles instead of scalars, we introduce four essential differences be-
19


tween the analysis and the reality of the tiled algorithms versus the coarse-grain
algorithms. First, while there are only two states for a scalar (nonzero or zero), a
tile can be in three states (zero, triangle or full). Second, there are more operations
available on tiles to introduce zeros; we have a total of three different tasks which can
introduce zeros in a matrix. Third, in the tiled algorithms, the factorization and the
update are dissociated to enable factorization stages to overlap with update stages;
whereas, in the coarse-grain algorithm, the factorization and the associated update
are considered as a single stage. Lastly, while coarse-grain algorithms have only one
task, we end up with six different tasks: three from the factorizations (zeroing of
tiles) and three for each of the associated updates (since these have been unlinked
from the factorization). Each of these six tasks have different computational weights;
this dramatically complicates the critical path analysis of the tiled algorithms.
While the Greedy algorithm is optimal for coarse-grain algorithms, we show
that it is not in the case of tiled algorithms. However, we have devised and proved
that there does exist an optimal tiled algorithm.
3.1 The QR factorization algorithm
Tiled algorithms are expressed in terms of tile operations rather than elementary
operations. Each tile is of size rib x where rib is a parameter tuned to squeeze
the most out of arithmetic units and memory hierarchy. Typically, rib ranges from
80 to 200 on state-of-the-art machines [5]. Algorithm 3.1 outlines a naive tiled QR
algorithm, where loop indices represent tiles:
Algorithm 3.1: Generic QR algorithm for a tiled p x q matrix,
l for k = 1 to min(p, q) do
forall the i G {k + 1,
elim(i,piv(i, k), k)
,p} using any ordering, do
In Algorithm 3.1, k is the panel index, and elim(i,piv(i,k),k) is an orthogonal
transformation that combines rows i andpiv(i, k) to zero out the tile in position (i, k).
20


However, this formulation is somewhat misleading, as there is much more freedom
for QR factorization algorithms than, say, for Cholesky algorithms (and contrarily to
LU elimination algorithms, there are no numerical stability issues). For instance in
column 1, the algorithm must eliminate all tiles (z, 1) where i > 1, but it can do so
in several ways. Take p = 6. Algorithm 3.1 uses the transformations
elim(2,1,1), elim(3,1,1), elim(4,1,1), elim(5,1,1), elim(6,1,1)
But the following scheme is also valid:
elim(3,1,1), elim(6,4,1), elim(2,1,1), elim(5, 4,1), elim(4,1,1)
In this latter scheme, the first two transformations elim(3,1,1) and elim(6,4,1) use
distinct pairs of rows, and they can execute in parallel. On the contrary, elim(3,1,1)
and elim(2,1,1) use the same pivot row and must be sequentialized. To compli-
cate matters, it is possible to have two orthogonal transformations that execute in
parallel but involve zeroing a tile in two different columns. For instance we can
add elim(6,5,2) to the previous transformations and run it concurrently with, say,
elim(2,1,1). Any tiled QR algorithm will be characterized by an elimination list,
which provides the ordered list of the transformations used to zero out all the tiles
below the diagonal. This elimination list must obey certain conditions so that the fac-
torization is valid. For instance, elim(6, 5, 2) must follow elim(6, 4,1) and elim(5,4,1)
in the previous list, because there is a flow dependence between these transformations.
Note that, although the elimination list is given as a totally ordered sequence, some
transformations can execute in parallel, provided that they are not linked by a de-
pendence: in the example, elim(6,4,1) and elim(2,1,1) could have been swapped,
and the elimination list would still be valid.
In order to describe more fully the dependencies inherent in the eliminations
we shall observe a snippet of an example. In Figure 3.1a, to the left we have the
row identifications, the empty circles represent zeroed elements, and the filled circles
21


represent the pivots used to zero out the elements. The first columns eliminations
are shown in green and the second in red. Prom the elimination list, we define Is^ as
the set of rows in column k that are zeroed out at time step s.
(a) Diagram of elimina-
tion list
elim (13,10,1)
elirn( 14,11,1)
elim (15,12,1)
elim(9, 5,1)
elim(10, 6,1)
elim(11, 7,1)
elim(12, 8,1)
elim(15,14, 2)
elim(12, 9, 2)
elim(13,10, 2)
elim (14,11,2)
Elimination list
elim(lsit i, 1)
elim(IS2ti, 1)
elim (IS12, 2)
elim (IS2 ;2, 2)
What may not be so evident from the elimination list but is more apparent in
the diagram of the elimination list are the following dependency relationships (note
that -< indicates that the operation on the left must finish prior to the operation on
the right starting):
elirn('piv(Is>k, k),k 1) -< elim(Is>k, k) (3.1a)
elim(Is,k, k 1) -< elim(Is,k, k) (3.1b)
elim(Is_itk, k) -< elim(Is However, not all of these dependencies may cause an elimination to be locked to a
particular time step. In fact, some dependencies may not be needed for a particular
instance, but the addition of these will not create an artificial lock. For example,
elim(IS2>2, 2) is dependent upon elim(piv(IS2>2), 1), elim(IS2>2,1), and elim(ISl,2, 2) but
elim(ISl,2,2) only depends upon elim(piv(ISl,2), 1) and elim(ISl,2,1).
22


Table 3.1: Kernels for tiled QR. The unit of time is y, where is the blocksize.
Operation
Factor square into triangle
Zero square with triangle on top
Zero triangle with triangle on top
Panel Update
Name Cost Name Cost
GEQRT 4 UNMQR 6
TSQRT 6 TSMQR 12
TTQRT 2 TTMQR 6
Before formally stating the conditions that guarantee the validity of (the elimi-
nation list of) an algorithm, we explain how orthogonal transformations can be im-
plemented.
3.1.1 Kernels
To implement a given orthogonal transformation elim(i,piv(i, k),k), one can use
six different kernels, whose costs are given in Table 3.1. In this table, the unit of time
n,3
is the time to perform y floating-point operations.
There are two main possibilities to implement an orthogonal transformation
elim(i,piv(i,k),k): The first version eliminates tile (i,k) with the TS (Triangle on
top of square) kernels, as shown in Algorithm 3.2:
Algorithm 3.2: Elimination elim(i,piv(i,k),k) via TS (Triangle on top of
square) kernels.
1 GEQRT (piv(i,k),k)
2 TSQRT(i,piv(i,k),k)
3 for j = k + 1 to g do
4 UNMQR(piv{i, k),k,j)
5 TSMQR(i,rpiv(i,k),k,j)
Here the tile panel (piv(i,k),k) is factored into a triangle (with GEQRT). The
transformation is applied to subsequent tiles (piv(i, k),j),j > k, in rowpiv(i, k) (with
IJNMQR). Tile (i,k) is zeroed out (with TSQRT), and subsequent tiles (i,j), j > k,
in row i are updated (with TSMQR). The flop count is 4 + 6 + (6 + 12) (q k) =
23


10 + 18(q k) (expressed in same time unit as in Table 3.1). Dependencies are the
following:
GEQRT(piv{i,k),k) -< TSQRT{i,piv{i,k),k)
GEQRT(piv{i,k),k) -< UNMQR(piv{i,k),k,j) for j > k
UNMQR(piv(i,k),k,j) -< TSMQR(i,rpiv(i, k), k, j) for j > k
TSQRT(i,piv(i,k),k) -< TSMQR(i,rpiv(i,k),k,j) for j > k
TSQRT(i,piv(i, k), k) and UNMQR(piv(i, k), k,j) can be executed in parallel, as well
as IJNMQR operations on different columns j,j' > k. With an unbounded number
of processors, the parallel time is thus 4 + 6 + 12 = 22 time-units.
Algorithm 3.3: Elimination elim(i,piv(i,k),k) via TT (Triangle on top of triangle) kernels.
i if k > 0 then
2 TTQRT(i,rpiv(i, k), k)
3 if k < q then
4 for j = k + 1 to q do
5 if k > 0 then
6 TTMQR(i,piv(i,k),k,j)
7 GEQRT{i, k + 1)
8 for j = k + 2 to q do
9 UNMQR(i,k,j)
The second approach to implement the orthogonal transformation
elim(i,piv(i,k),k) is with the TT (Triangle on top of triangle) kernels, as shown
in Algorithm 3.3. Here tile (i,k) is zeroed out (with TTQRT) and subsequent tiles
(i,j) and (piv(i,k),j), j > k, in rows i and piv(i,k) are updated (with TTMQR).
Immediately following, tile (i,k + 1) is factored into a triangle and the correspond-
ing transformations are applied to the remaining columns in row i. Necessarily,
TTQRT must have the triangularization of tile (i,k) and (piv(i,k),k) completed
24


in order to proceed. Hence for the first column there are no updates to be ap-
plied from previous columns such that the triangularization of these tiles (with
GEQRT) is completed and can be considered a preprocessing step. The flop count is
2(4 + 6(q k)) + 2 + 6(q k) = 10 + 18(g k), just as before. Dependencies are the
following:
GEQRT(piv(i,k),k) + UNMQR(piv(i,k),k,j) for j > k (3.2a)
GEQRT(i, k) + UNMQRQ, k,j) for j > k (3.2b)
GEQRT(piv(i,k),k) + TTQRT(i,rpiv(i,k),k) (3.2c)
GEQRT(i, k) + TTQRT(i,piv(i,k),k) (3.2d)
TTQRT(i,rpiv(i,k),k) + TTMQR(i,piv(i,k),k,j) for j > k (3.2e)
UNMQR(rpiv(i,k),k,j) + TTMQR(i,piv(i,k),k,j) for j > k (3.2f)
UNMQRQ, k,j) + TTMQR(i,piv(i,k),k,j) for j > k (3-2g)
Now the factor operations in row piv(i, k) and i can be executed in parallel. Moreover,
the UNMQR updates can be run in parallel with the TTQRT factorization. Thus,
with an unbounded number of processors, the parallel time is 4+6+6 = 16 time-units.
Recall our definition of the set ISyk to be the set of rows in column k that will be
zeroed out at time step s in the coarse-grain algorithm. Thus the following depen-
dencies are a direct consequence of 3.1c as applied to the zeroing of a tile and the
corresponding updates.
TTQRT(Is-i,piv(Is-i, k), k) + TTQRT(Is,piv(Is,k),k) (3.3a)
TTQRT(Is_i/piv(Is_i,k),k,j)-< TTMQR(Is/piv(Is,k),k,j) for j > k (3.3b)
In Algorithm 3.2 and 3.3, it is understood that if a tile is already in triangular
form, then the associated GEQRT and update kernels do not need to be applied.
All the new algorithms introduced in this chapter are based on TT kernels. From
an algorithmic perspective, TT kernels are more appealing than TS kernels, as they
25


offer more parallelism. More precisely, we can always break a TS kernel into two TT
kernels: We can replace a TSQRT(i,piv(i, k), k) (following a GEQRT(piv(i, k), k)) by
a GEQRT(i, k) and a TTQRT(i,piv(i, k), k). A similar transformation can be made
for the updates. Hence a Tb'-based tiled algorithm can always be executed with TT
kernels, while the converse is not true. However, the TS kernels provide more data
locality, they benefit from a very efficient implementation (see §3.3), and several ex-
isting algorithms use these kernels. For all these reasons, and for comprehensiveness,
our experiments will compare approaches based on both kernel types.
At this point, the PLASMA library only contains TS kernels. We have mapped
the PLASMA algorithm to TT kernel algorithm using this conversion. Going from a
TS kernel algorithm to a TT kernel algorithm is implicitly done by Hadri et al. [11]
when going from their Semi-Parallel to their Fully-Parallel algorithms.
3.1.2 Elimination lists
As stated above, any algorithm factorizing a tiled matrix of size p x q is charac-
terized by its elimination list. Obviously, the algorithm must zero out all tiles below
the diagonal: for each tile (i,k), i > k, 1 < k < min(p,q), the list must contain
exactly one entry k), where denotes some row index piv(i, k) There are
two conditions for a transformation elim(i,piv(i, k), k) to be valid:
both rows i and piv(i,k) must be ready, meaning that all their tiles left
of the panel (of indices (i,k') and (piv(i,k),k') for 1 < k' < k) must
have already been zeroed out: all transformations elirnii/pivii^k1)^1) and
elim(piv(i, k),piv(piv(i, k), k'), k') must precede elim(i,piv(i, k), k) in the elim-
ination list
row piv(i, k) must be a potential annihilator, meaning that tile (piv(i, k), k) has
not been zeroed out yet: the transformation elim(piv(i,k),piv(piv(i,k),k),k)
must follow elim(i,piv(i, k), k) in the elimination list
26


Any algorithm that factorizes the tiled matrix obeying these conditions is called a
generic tiled algorithm in the following.
Theorem 3.1 No matter what elimination list (any combination of TT, TS) is used
the total vjeight of the tasks for performing a tiled QR factorization algorithm is
constant and equal to 6pq2 2q3.
Proof: We have that the quantity of each kernel is given by the following
Lx :: GEQRT = TTQRT + q
L2 :: UNMQR = TTMQR + (1/2)q(q 1)
L3 :: TTQRT + TSQRT = pq (1/2)q{q + 1)
L4 :: TTMQR + TSMQR = (1/2)pq(q 1) (1/6)q{q 1 )(q + 1)
The quantity of TTQRT provides the number of tiles zeroed out via a triangle on
top of a triangle kernel. Thus equation Lx is composed of two parts: necessarily,
the diagonal tiles must be triangularized and each TTQRT must admit one more
triangularization in order to provide a pairing. The number of updates of these trian-
gularizations, given by L2, is simply the sum of the updates from the triangularization
of the tiles on the diagonal and the updates from the zeroed tiles via TTQRT. The
combination of TTQRT and TSQRT, equation L3, is exactly the total number of
tiles that are zeroed, namely every tile below the diagonal. Hence, the total number
of updates, provided by L4, is the number of tiles below the diagonal beyond the first
column minus the sum of the progression through the columns. Now we define
L3 = 4 Lx + 6L2 + 6L3 + 12L4
then
U = 4 GEQRT + 6 TSQRT + 2 TTQRT + 6 UNMQR + 6 TTMQR + 12 TSMQR.
As can be noted in L5, the coefficients of each term correspond precisely to the weight
of the kernels as derived from the number of flops each kernel incurs. Simplifying L5,
27


we have our result
L5 = 6pq2 2q3.

A critical result of Theorem 3.1 is that no matter what elimination list is used, the
total weight of the tasks for performing a tiled QR factorization algorithm is constant
and by using our unit task weight of n|/3, with m = prib, and n = qrib, we obtain
2mn2 2/?yri3 flops which is the exact same number as for a standard Householder
reflection algorithm as found in LAPACK (e.g., [14]).
3.1.3 Execution schemes
In essence, the execution of a generic tiled algorithm is fully determined by its
elimination list. This list is statically given as input to the scheduler, and the exe-
cution progresses dynamically, with the scheduler executing all required transforma-
tions as soon as possible. More precisely, each transformation involves several kernels,
whose execution starts as soon as they are ready, i.e., as soon as all dependencies have
been enforced. Recall that a tile (i,k) can be zeroed out only after all tiles (i,k'),
with k' < k, have been zeroed out. Execution progresses as follows:
Before being ready for elimination, tile (i, k), i > k, must be updated k1 times,
in order to zero out the k1 tiles to its left (of index (i, k1), k' < k). The last
update is a transformation TTMQR(i,piv(i, k l),kl,k) for some row index
piv(i, k1) such that elim(i,piv(i, k 1), k 1) belongs to the elimination list.
When completed, this transformation triggers the transformation GEQRT(i, k),
which can be executed immediately after the completion of the TTMQR. In
turn, GEQRT(i,k) triggers all updates UNMQR(i,k,j) for all j > k. These
updates are executed as soon as they are ready for execution.
The elimination elim(i,piv(i, k), k) is performed as soon as possible when both
rows i and piv(i,k) are ready. Just after the completion of GEQRT(i, k) and
28


GEQRT(piv(i, k),k), kernel TTQRT(i, piv(i, k), k) is launched. When finished,
it triggers the updates TTMQR(i,piv(i, k), k,j) for all j > k.
Obviously, the degree of parallelism that can be achieved depends upon the elim-
inations that are chosen. For instance, if all eliminations in a given column use the
same factor tile, they will be sequentialized. This corresponds to the flat tree elimi-
nation scheme described below: in each column k, it uses elim(i,k,k) for all i > k.
On the contrary, two eliminations elim(i,piv(i,k),k) and elim(i',piv(i',k),k) in the
same column can be fully parallelized provided that they involve four different rows.
Finally, note that several eliminations can be initiated in different columns simulta-
neously, provided that they involve different pairs of rows, and that all these rows are
ready (i.e., they have the desired number of leftmost zeros).
The following lemma will prove very useful; it states that we can assume w.l.o.g.
that each tile is zeroed out by a tile above it, closer to the diagonal.
Lemma 3.2 Any generic tiled algorithm can be modified, without changing its exe-
cution time, so that all eliminations elim(i,piv(i,k),k) satisfy i > piv(i, k).
Proof: Define a reverse elimination as an elimination elim(i,piv(i, k), k) where
i < piv(i,k). Consider a generic tiled algorithm whose elimination list contains
some reverse eliminations. Let ko be the first column to contain one of them. Let
*o be the largest row index involved in a reverse elimination in column ko- The
elimination list in column k0 may contain several reverse eliminations elim(ii,io,k0),
elim(i2, io,k0), ..., elim(ir,i0, k0), in that order, before row i0 is eventually zeroed out
by the transformation elimfio, piv(i0, k0),k0). Note that piv(i0, k0) < i0 by definition
of *0. We modify the algorithm by exchanging the roles of rows io and i\ in column
ko- the elimination list now includes elirn(io, i\, ko), elim(i2, i\, ko), , elim(ir, i\, ko),
and elim(ii,piv(io, ko), ko)- All dependencies are preserved, and the execution time is
unchanged. Now the largest row index involved in a reverse elimination in column ko
29


is strictly smaller than i0, and we repeat the procedure until there does not remain any
reverse elimination in column ko- We proceed inductively to the following columns,
until all reverse eliminations have been suppressed.
3.2 Critical paths
In this section we describe several generic tiled algorithms, and we provide their
critical paths, as well as optimality results. These algorithms are inspired by algo-
rithms that have been introduced twenty to thirty years ago [45, 36, 25, 24], albeit for
a much simpler, coarse-grain model. In this old model, the time-unit is the time
needed to execute an orthogonal transformation across two matrix rows, regardless
of the position of the zero to be created, hence regardless of the length of these rows.
Although the granularity is much coarser in this model, any existing algorithm for
the old model can be transformed into a generic tiled algorithm, just by enforcing
the very same elimination list provided by the algorithm. Critical paths are obtained
using a discrete event based simulator specially developed to this end, based on the
Simgrid framework [47]. It carefully handles dependencies across tiles, and allows for
the analysis of both static and dynamic algorithms.1
3.2.1 Coarse-grain algorithms
We start with a short description of three algorithms for the coarse-grain model.
These algorithms are illustrated in Table 3.2 for a 15 x 6 matrix.
3.2.1.1 Sameh-Kuck algorithm
The Sameh-Kuck algorithm [45] uses the panel row for all eliminations in each
column, starting from below the diagonal and proceeding downwards. Time-steps
indicate the time-unit at which the elimination can be done, assuming unbounded
resources. Formally, the elimination list is
{(elim(i, k, k), i = k + 1, k + 2,... ,p) k = 1,2,..., min(p, q)}
1The discrete event based simulator, together with the code for all tiled algorithms, is publicly
available at http://graal.ens-lyon.fr/~mjacquel/tiledQR.html
30


This algorithm is also referred as FlatTree.
3.2.1.2 Fibonacci algorithm
The Fibonacci algorithm is the Fibonacci scheme of order 1 in [36]. Let
coarse(i,k) be the time-step at which tile (i,k), i > k, is zeroed out. These val-
ues are computed as follows. In the first column, there are one 5, two 4s, three 3s,
four 2s and four ls (we would have had five ls with p = 16). Given x as the least
integer such that x(x + l)/2 > p 1, we have coarse(i, 1) = x y + 1 where y is the
least integer such that i < y(y + l)/2 + 1. Let the row indices of the z tiles that are
zeroed out at step s, 1 < s < x, range from i to i + z 1. The elimination list for these
tiles is elim(i + j,piv(i+j, 1), 1), withpiv(i + j) = i + j z for 0 < j < z 1. In other
words, to eliminate a bunch of z consecutive tiles at the same time-step, the algorithm
uses the z rows above them, pairing them in the natural order. Now the elimination
scheme of the next column is the same as that of the previous column, shifted down
by one row, and adding two time-units: coarse(i, k) = coarse(i 1, k 1) + 2, while
the pairing obeys the same rule.
3.2.1.3 Greedy algorithm
At each step, the Greedy algorithm [24, 25] eliminates as many tiles as possible
in each column, starting with bottom rows. The pairing for the eliminations is done
exactly as for Fibonacci. There is no closed-form formula to compute coarse(i,k),
the time-step at which tile (i, k) is eliminated, but it is possible to provide recursive
expressions (see [24, 25]).
Consider a rectangular p x q matrix, with p > q. With the coarse-grain model,
the critical path of Sameh-Kuck is p + q 2, and that of Fibonacci is x + 2q 2,
where x is the least integer such that x(x + l)/2 > p 1. The critical path of
Greedy is unknown, but the critical path of Greedy is optimal. For square q x q
matrices, critical paths are slightly different (2q 3 for Sameh-Kuck, x + 2q 4 for
Fibonacci).
31


Table 3.2: Time-steps for coarse-grain algorithms.
(a) Sameh-Kuck (b) Fibonacci ( c) Greedy
A k k
1 k 5 k 4 k
2 3 -k 4 7 k 3 6 k
3 4 5 k 4 6 9 k 3 5 8 k
4 5 6 7 k 3 6 8 11 k 2 5 7 10 k
5 6 7 8 9 k 3 5 8 10 13 k 2 4 7 9 12 k
6 7 8 9 10 11 3 5 7 10 12 15 2 4 6 9 11 14
7 8 9 10 11 12 2 5 7 9 12 14 2 4 6 8 10 13
8 9 10 11 12 13 2 4 7 9 11 14 1 3 5 8 10 12
9 10 11 12 13 14 2 4 6 9 11 13 1 3 5 7 9 11
10 11 12 13 14 15 2 4 6 8 11 13 1 3 5 7 9 11
11 12 13 14 15 16 1 4 6 8 10 13 1 3 4 6 8 10
12 13 14 15 16 17 1 3 6 8 10 12 1 2 4 6 8 10
13 14 15 16 17 18 1 3 5 8 10 12 1 2 4 5 7 9
14 15 16 17 18 19 1 3 5 7 10 12 1 2 3 5 6 8
3.2.2 Tiled algorithms
As stated above, each coarse-grain algorithm can be transformed into a tiled
algorithm, simply by keeping the same elimination list, and triggering the execution
of each kernel as soon as possible. However, because the weights of the factor and
update kernels are not the same, it is much more difficult to compute the critical
paths of the transformed (tiled) algorithms. Table 3.3 is the counterpart of Table 3.2,
and depicts the time-steps at which tiles are actually zeroed out. Note that the tiled
version of Sameh-Kuck is indeed the FlatTree algorithm in PLASMA [18, 19], and
we have renamed it accordingly. As an example, Algorithm 3.4 shows the Greedy
algorithm for the tiled model.
A first (and quite unexpected) result is that Greedy is no longer optimal, as
shown in the first two columns of Table 3.3a for a 15 x 2 matrix. In each column and
at each step, the Asap algorithm starts the elimination of a tile as soon as there are
at least two rows ready for the transformation. When s > 2 eliminations can start
simultaneously, Asap pairs the 2s rows just as Fibonacci and Greedy, the first row
(closest to the diagonal) with row s + 1, the second row with row s + 2, and so on. As a
32


matter of a fact, when processing the second column, both Asap and Greedy begin
with the elimination of lines 10 to 15 (at time step 20). However, once tiles (13,2),
(14,2) and (15,2) are zeroed out (i.e. at time step 22), Asap eliminates 4 zeros, in
rows 9 through 12. On the contrary, GREEDY waits until time step 26 to eliminate 6
zeros in rows 6 through 12. In a sense, Asap is the counterpart of Greedy at the
tile level. However, Asap is not optimal either, as shown in Table 3.3a for a 15 x 3
matrix. On larger examples, the critical path of Greedy is better than that of Asap,
as shown in Table 3.3b.
We can however use the optimality of the coarse-grain Greedy to devise an
optimal tiled algorithm. Let us define the following algorithm:
Definition 3.3 Given a matrix ofpxq tiles, withp > q, the GrASAP (i) algorithm
1. uses Algorithm 3.3 to execute Greedy on the first q i columns and propagate
the updates through column q.
2. and for column(s) q i + 1 through q, apply the Asap algorithm.
Clearly, if we let i = q we obtain the Asap algorithm. We define GrASAP to be
GrASAP (1), i.e., only the elimination of the last column will differ from Greedy,
and we will show that GrASAP is an optimal tiled algorithm.
Although we cannot provide an elimination list for the entire tiled matrix of
size p x q, we do provide an elimination list for the first q 1 columns. This tiled
elimination list describes the time-steps at which Algorithm 3.3 is complete, i.e., all of
the factorization kernels are complete for k < q 1 and corresponding update kernels
are complete for all columns k < j < q.
We must make note of one consequence from the coarse-grain elimination list
before proceeding. We will use this repeatedly within the proof of translating a
coarse-grain elimination list to a tiled elimination list.
33


Lemma 3.4 Given an elimination list from any coarse-grain algorithm, let
s = coarse(i, k) be the time step at which element (i,k) is eliminated and let
h,k = {*|s = coarse{Is>k, k)} .
Then for any s we have
s 1 = coarse(IStk, k) 1 > max
and in particular
^ coarse(IStk, k 1) ^
coarse(piv(IStk, k),k 1)
Si 1 = max
f
coarse(ISltk, k 1)
coarse{piv(ISltk, k),k 1)
where si = mm.k+i Proof: This follows directly from the dependencies given in (3.1a)-(3.1c).
Theorem 3.5 Given the elimination list of a coarse-grain algorithm for a matrix of
size p x q, using Algorithm 3.3, the tiled elimination list for all but the last column is
given by
tiledii, k) = 10 k + 6 coarse(i, k), 1 < i < p,l < k < q 1
where coarse(i, k) is the elimination list of the coarse-grain algorithm.
Proof: In this analysis, when k is clear, we will use Is instead of Is>k By abuse of no-
tation, we will write GEQRT(i, k) to denote the time at which the task GEQRT(i, k)
is complete and this will be the same for all of the kernels. Thus we will prove that
tiledii, k) = TTMQR(i,piv(i,k),k,j) for j > k.
Note that j represents the column in which the updates are applied and all columns
j for j > k have the same update history. In Algorithm 3.3, the two j-loops spawn
34


mutually independent tasks. Since we have an unbounded number of processors, these
tasks can all run simultaneously. So j represents any one of these columns.
We will proceed by induction on k. For the first column, k = 1, we do not have
any dependencies which concern the GEQRT operations. Thus from (3.2b) we have
for 1 < i < p,
GEQRT (i, 1) = 4 (3.4)
UNMQR(i, 1, j) = 4 + 6 = 10 (3.5)
Since each column in the coarse-grain elimination list is composed of one or more
time steps, we must also proceed with induction on the time steps. Let
si = min (coarseii, 1))
2 In the case k = 1, we have that
Si = 1.
(3.6)
In other words, the first tasks finish at time step 1 for the coarse-grain algorithm.
This is a complicated manner in which to state that si = 1, but it will be needed in
the general setting.
So for si, from (3.2c) and (3.2d) we have
TTQRT(ISl,piv(ISl, 1), 1)
max
f GEQRT{piv{Isl, 1), 1)^
v GEQRT (IS1,1) J
+ 2
= 4 + 2
thus
TTQRT (ISl, piv (ISl, 1), 1) = 4 + 2si. (3.7)
35


Now from (3.2e), (3.2f), and (3.2g) we have
^TTQRT(Isl,ptv(Isl, 1), 1)^
TTMQR(ISl,piv(ISl, 1), 1, j) = max
UNMQR{piv{Isl,l),l,j)
\ UNMQR (ISl, l,j) J
10- 1 + 6si
10 1 + 6 coarse (ISl, 1)
Therefore,
TTMQR(ISl,piv(ISl, 1), l,j) = Uled{ISl,l).
Assume that for 1 < t < s 1 we have
TTQRT(It,piv(It, 1), 1) = 4 + 2t
TTMQR(It,piv(It, 1), 1, j) = 10 + 6t
then from (3.2c), (3.2d), and (3.3a) we have
/
TTQRT(Is,piv(Is, 1), 1) = max
\
max
GEQRT {piv{I8,l),l)
GEQRT (Is,l)
y TTQRT (Is-i,piv(I8-i, 1), 1) j
4
4 +2
+ 2
^4 + 2(3-!)^
Thus
TTQRT(Is,piv(Is, 1),1) = 4 + 2s.
(3.8)
(3.9)
(3.10)
(3.11)
36


From (3.2e), (3.2f), (3.2g), and (3.3b) we have
TTMQR(Is,'piv(Is, 1), l,j)
1 TTQRT(Is,ptv(Is,l),l) ^
UNMQR (piv(Is, 1), 1, j)
max
UNMQR (I,, 1,3)
y TTMQR (Is-i,piv(Is-i, 1), 1 ,j) J
( 4 +2s ^
10
10
10 + 6(s 1) J
10 + 6(s 1) +6
10 + 6s
max
= 10 1 + 6 coarse(Is, 1)
Thus
TTMQR(Is,piv(Is, 1), l,j) = tiled(Is, 1) (3.12)
establishing our base case for the induction on k.
Now assume that for 1 < h < k 1 we have, for any s in column h,
TTMQR(Is,rpiv(Is, h),h,j) = tiled(Is, h).
In order to start the elimination of the next column, we must have that all updates
from the elimination of the previous column are complete. Thus using the induction
assumption, we have
GEQRT(i, k) = TTMQR(i,rpiv(i, k 1 ),k 1, fc) + 4
so that
GEQRT(i, k) = 10(k 1) + 6 coarse(i, k 1) + 4 (3.13)
37


and
UNMQR(i, k,j) = max
f
GEQRT(i, k)
TTMQR(i,piv(i, k 1), k 1 ,j)
\
so that
UNMQR(i, k,j) = 10k + 6 coarse(i, k 1).
(3.14)
Again, we must proceed with an induction on the time steps in column k. Let
Si = min (coarse(i, k))
k+l (3-15)
From (3.2c) and (3.2d) we have
TTQRT(Isl,piv(Isl,k),k) = max
^ GEQRT (piv(ISl,k),k)^
V
GEQRT (ISl,k)
+ 2
7
^ 10(/c 1) + 6 coarse(piv(Isl, k), k 1) + 4^
max
V
10(A) 1) + 6 max
10(A) 1) + 6 coarse(Isl, k 1) + 4
^ coarse(piv(ISl,k),k 1)^
+ 2
7
V
coarse(ISl, k 1)
+ 4 + 2
7
From the application of Lemma 3.4, we have
/
coarse(ISl, k) 1 = max
coarse(ISltk, k 1)
coarse(piv(Isltk, k), k 1)
such that
TTQRT(Isl,piv(Isl,k),k) = 10(A; 1) + 6 [coarse(Isl,k) 1] + 4 + 2.
Therefore,
TTQRT(ISl,piv{ISl,k),k) = 10(fc- 1) + 6si.
(3.16)
38


For the updates, we must again examine the three dependencies which result from
(3.2e), (3.2f), and (3.2g) such that we have
/
TTMQR(ISl,piv(ISl,k),k,j) = max

max
UNMQR (piv(lsi ,k),k,j)
UNMQR (ISl, k,j)
\TTQRT (Isl,piv(ISl,k),k)
10k + 6 coarse (ISl, k 1)
10k + 6 coarse(piv(Isl, k),k 1)
y 10 (k 1) + 6si
f
\

Using Lemma 3.4, we have
TTMQR(Isl,piv(ISl, k), k, j) = max
^ 10k + 6(si 1)^
y 10 (k 1) + 6si
10k + max
(a \
DSi 0
V
Osi 10
10 k + 6si
Therefore
TTMQR(ISl,piv(ISl,k),k,j) =tiled(ISl,k). (3.17)
Now assume that for si < t < s 1 we have
TTQRT(It,piv(It, k),k)< 10(k 1) + 61 (3.18)
TTMQR(It,piv(It, k), k,j) = 10k + 61. (3.19)
39


and note that we do not have equality for s > si via Lemma 3.4. From (3.2c), (3.2d),
and (3.3a) we have
TTQRT(Is,piv(Is,k),k) = max
< max
( GEQRT (piv(Is, k),k) ^
GEQRT (Is, k) +2
y TTQRT (Is_i/piv{Is-i,k),k) j
(10(fc 1) + 6 coarse(jriv(Is, k), k 1) + 4^
10(fc 1) + 6 coarse(Is, k 1) + 4
10(fc-l) + 6(s-l)
+ 2
Note that from Lemma 3.4
s 1 > max
^coarse(piv(Is, k),k 1)^
V
coarse(Is, k 1)
7
such that
Thus
TTQRT(Is,piv(Is, k), k) < 10(fc 1) + 6(s 1) + 4 + 2.
TTQRT(Is,piv(Is, k),k)< 10(k 1) + 6s. (3.20)
For the updates, we must examine the four dependencies which result from (3.2e),
(3.2f), (3.2g), and (3.3b) such that we have
f
TTMQR(Is,rpiv(Is,k),k,j) = max
UNMQR(piv(Is, k), k,j)
\
UNMQR(IS, k,j)
TTQRT (Is,piv(Ia,k),k)
yTTMQR(Is-i,piv(Is-i, k), k, j) J
f
max
10k + 6 coarse(Is, k 1)
\
10k + 6 coarse(piv(Is, k),k 1)
10(fc 1) + 6s
\v10fc + 6(s 1)
40


As before, Lemma 3.4 allows us to write
TTMQR(Is,piv(Is,k),k,j) = max
+ 6
yiO(k- 1) + 6 sj
10 k + max
+ 6
10^
10 k + 6s
Therefore
TTMQR(Is,piv(Is, k), k, j) = tiled(i, k).
(3/21)
Corollary 3.6 Given an elimination list for a coarse-grain algorithm on a matrix of
size p x q where p > q, the critical path length of the corresponding tiled algorithm is
bounded by
tiledijp, q 1) + 4 + 2 < CP{jp, q) < tiledijp, q).
Proof: For any tiled matrix, the last column will necessarily need to be factorized
which explains the addition of four time steps and since p > q at least one TTQRT will
be present which accounts for the two time steps thereby establishing the lower bound.
By including one more column, the upper bound not only includes the factorization
of column q, but also the respective updates onto column q + 1 such that the critical
path of the p x q tiled matrix must be smaller.
Corollary 3.7 Given an elimination list for a coarse-grain algorithm on a matrix of
size p x q where p = q, the critical path length of the corresponding tiled algorithm is
CPijp, q) = tiled(p, q 1) + 4.
Proof: In the last column, we need only to factorize the diagonal tile which explains
the additional four time steps. Moreover, there are no further columns to apply any
41


updates to nor any tiles below the diagonal that need to be eliminated. Thus the
result is obtained.
In the remainder of this chapter, we will make use of diagrams to clarify certain
aspects of the proofs and provide examples to further illustrate the points being made.
These diagrams make use of the kernel representations as shown in Figure 3.1.
kernel weight kernel weight
GEQRT
TTQRT

4 UNMQR <>
2 TTMQR
Figure 3.1: Icon representations of the kernels
6
6
We have a closed-form expression for the critical path of tiled FlatTree for all
three cases: single tiled column, square tiled matrix, and rectangular tiled matrix of
more than one column.
Proposition 3.8 Consider a tiled matrix of size px q, where p > q > 1. The critical
path length of FlatTree is
CPft(p,q)
2p + 2, ifq=!;
22p 24, ifp = q> 1;
6p + 16q 22, if p > q > 1.
Proof: Consider first the case q = 1. We shall proceed by induction on p to
show that the critical path of FlatTree is of length 2p + 2. If p = 1, then from
Table 3.1 the result is obtained since only GEQRT( 1,1) is required. With the base
case established, now assume that this holds for all p 1 > q = 1. Thus at time
t = 2{p 1) + 2 = 2p, we have that for all p 1 > i > 1 tile (i, 1) has been factorized
into a triangle and for all p 1 > > 1, tile (i, 1) has been zeroed out. Therefore, tile
(p, 1) will be zeroed out with TTQRT(p, 1) at time t + 2 = 2(p 1) + 2 + 2 = 2p + 2.
42


Considering the second case p = q > 1, we will be using Figure 3.2 to illustrate.
We initialize with a triangularization of the first column and send the update to the
remaining column(s), 10 time units. The we fill the pipeline with the updates onto
the remaining column(s) from the zeroing operations of the first column, 6(p 1) time
units. Then for each column after the first, except the last one, we fill the pipeline
with the triangularization, update of triangularization, and update of zeroing for the
bottom most tile, (4+6+6) (p2) time units. In the last column, we then triangularize
the bottom most tile, 4 time units. Thus
10 + 6(p 1) + (4 + 6 + 6)(q 2) + 4 = 6p + 16q 24 = 22p 24
The third case is analogous to the second case but we still need to zero out the
bottom most tile in the last column which explains the difference of 2 in the formula
from the square case.
time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
Figure 3.2: Critical Path length for the weighted FlatTree on a matrix of 4 x 4
tiles.
We remind that for the coarse algorithm,
coarse(p, q)
0, if Q < 1;
0, if p = q = 1
p + q- 2, if p > q > 1
IV 1 CO if p = q > 1
So we find that considering a tiled matrix of size px q, where p >
path length of FlatTree is given as
q > 1. The critical
CP(p, q) = 10(q 1) + 6 (coarseijp, q 1)) + 4 + 2 (coarseijp, q)
coarse(p, q 1)).
43


Proposition 3.9 The critical path length of Fibonacci is greater than 22q 30 and
less than 22q + 61~y/2p].
Proof: The critical path length of the coarse-grain Fibonacci algorithm for a
p x q matrix is
coarse{p, q) = x + 2q 2.
Thus from Proposition 3.6 we have
10(q 1) + 6(x + 2(q 1) 2) + 4 < CP(p, q) < 10q + 6(x + 2q-2).
Recall that x is the least integer such that
x(x + 11
-------- > p
1
whereby
x
Thus x < \y/ty\ and therefore
1 VSfEJ
2 2
22q 30 < CP(p, q) < 22q
12 + 6
\/2P
Similarly to [25] in which an iterate of a column is defined for the coarse-grain
algorithms, we define a weighted iterate and our notation will follow in the same
manner.
A column of length n is a sequence of n integers:
a = af1 afq
where power means concatenation with the following restrictions:
a\ > 0 ai+1 > a,i, 1 < i < q 1;
rii > 0, !<*<(?; n\ + + nq = n.
44


We define on the set of columns of length n the classical partial ordering of Mra:
x < y (xi < yi,i < i < n)
and the s-truncate (1 < s < n) of a is a column of length s composed of the s first
elements of a and is denoted as.
Definition 3.10 Given a task vjeight of w and column a = af1 afq, the column
c = c1 cffp is called an iterate of a, or c = iter(a), if
(i) c is a column of length n 1
(ii) a\+ w < c\
(a) if a\ + w < Ci < a2 then < [n\/2J
(b) if there exists an h such that a^-i + w < Ch < at then
mh < [(ni H------h nk-i mi-------mh-1) /2J
for 2 < k < q and 1 < h < p with m0 = 0.
(c) else aj < Ch and
mh < [(ni H-----Ynj-mi------------mh-1) /2J
where j = min(p + 1 ,q).
Definition 3.11 Given a task weight of w, let a = af1 afq be a column iterate of
length n then the sequence b = b1 bffp, or b = o'piterfa), is defined as
(i) for b\ and m\
(a) if n\ = l, then b\ = a2 + w and mi = [(ni + nf) /2J.
(b) if ni > 1, then bi = ai + w and mi = L(^-i) /2J.
45


(ii) if there exists k such that ak-\ + w < h-i < at, then
fi-1 = m H-------h nk-1 mi---------i > 1.
(a) if bi_i < ak and r*_i > 1, then bi = &*_! + w, and to* = i/2j.
(b) else bi = ak + w, mi = |_(nk + ri-1)/2\.
(in) if bi_i > aj where j = min(*, q), then bi = + w, and
mi < [(m H-------+ rij mi mj_i) /2J .
Proposition 3.12 Given an iterated column a of length n, the sequence
b = bfll---bG
or b = optiterfa) is an iterate of a.
Proof: The proof follows directly from the definition.
Proposition 3.13
(i) Let an be a column of length n and cra_i = iter(an) an iterated column of an.
Then
bn_i = optiter(an) < iter(an) = cra_i.
(ii) Let an and cn be two columns of length n such that an < cn. Then
optiter(an) < optiter(cn).
Proof:
(i) Clearly, b\ < C\ by definition since b\ is chosen to be as small as possible.
Moreover, by definition bi < Cj for i < j since (i) if i < ak and r*_i > 1,
meaning there are enough elements available to perform a pairing, then bi is
46


again chosen as small as possible, (ii) otherwise bi = au + w which is the smallest
again, (iii) else bi-1 > aj and bi is chosen as the next smallest element. Thus
n-1 cn-1 > 1 > Zi P
so that bn-1 < Cn-1-
(ii) This is another direct application of the definition and follows along the same
argument.
Clearly, letting w = 1 gives the definitions of iter and optiter of the coarse-grain
algorithms as presented in [25]. Definition 3.11 is the Asap algorithm on a single tiled
column and can be viewed as the counter part of the coarse-grain Greedy algorithm
in the tiled case and follows a bottom to top elimination of the tiles. In order to
preserve the bottom to top elimination, the weight of the updates must be an integer
multiple of the iterated column weight.
Theorem 3.14 Given a matrix of p x q tiles, a factorization kernel vjeight of j, an
elimination kernel vjeight of a, and an update kernel vjeight of f3 = na for some
n G N; the GrASAP algorithm is optimal in the context of the class of algorithms
that progress left to right, bottom to top.
Proof: From Theorem 3.5 we have a direct translation from any coarse-grain al-
gorithm in this class to the tiled algorithm for the first q 1 columns. Thus we are
given the time steps at which rows in column q are available for elimination. Now
we hx the time steps for the elimination of the last column and follow whatever tree
the algorithm provides for this last column. We can replace the elimination of the
first q 1 columns and updates from these eliminations onto the remaining columns
with the tiled Greedy algorithm. This is possible since the translation function is
monotonically increasing and we know that Greedy is optimal for the coarse-grain
47


algorithms and therefore optimal for the first q 1 columns in the tiled algorithms. In
another manner of speaking, we slow down the eliminations and updates on the first
q 1 columns when not using the tiled Greedy algorithm. (An illustrative example
is shown in Figure 3.3b.)
Let c be the next to last column of the coarse-grain elimination table which is of
length p (q 1) + 1. Now letting
a = (7 + (3)(q 1) + f3 coarse(p, q 1) + 7
provides an iterated column of length p (q 1) + 1 for the tiled algorithm. With
w = a we have that b = optiter(a) is an optimal iterated column of length p q + 1
with the elimination progressing from bottom to top. This can be applied to any tiled
algorithm in this class since we only concern ourselves with the time steps at which the
last columns elements are available for elimination. In other words, this is a speeding
up of the elimination of the last column while adhering to any restrictions incurred
from the previous columns. (An illustrative example is shown in Figure 3.3c.)
Combining these two ideas, let Greedy be performed on the first q 1 columns
and then ASAP on the remaining column q. This provides an optimal algorithm in
this class of algorithms.
In Figure 3.4 we provide an illustrated example of the Greedy and GrASAP
algorithms on a matrix of 15 x 2 tiles where the operations are given by Figure 3.1.
It can be seen that GrASAP finishes before Greedy since Greedy must wait
and progress with the same elimination scheme as the coarse-grain algorithm while
GrASAP can begin eliminating in the last column as soon as a pair of tiles becomes
available. (The elimination of the first column is shown in light gray.)
We have analyzed the critical path length of GrASAP versus that of Greedy
for tiled matrices p x q where 1 < p < 100 and 1 < q < p (see Figure 3.5). In all
cases where there is a difference (which is just over 44% of the cases), the difference
is always two time steps.
48


(c) Asap on column q
Figure 3.3: Illustration of first and second parts of the proof of Theorem 3.14 using
the Fibonacci algorithm on a matrix of 15 x 2 tiles.
We now show that without having the update kernel weight an integer multiple
of the elimination kernel weight, the bottom to top progression is nullified and we
cannot provide optimality of the algorithm.
Assume that the update kernel weight is 3 and the elimination kernel weight is
2. Let an = 3764 be column from some elimination scheme. We shall apply three
iterated schemes to this column: (1) an Asap scheme that progresses from bottom to
top, (2) an Asap scheme that can progress in any manner, and (3) an Asap scheme
which may provide a lag.
In Table 3.5 we clearly see that elimination scheme (3) provides the best time for
the algorithm. The reason is that enough of a lag was provided such that a binomial
tree could progress without hindrance. Therefore without integer multiple weights on
the update kernel, we cannot know which scheme will be optimal.
49


Figure 3.4: Greedy versus GrASAP on matrix of 15 x 2 tiles.
Matrix size for which the critical path of GrASAP is shorter than Greedy
1 ri 1 1i 1 1 1 1 1 1ii 1i 1i 1ii
5 -
10 -
15 :::
20
25 :..
30
35 ......
40 Liiiimiiimi:.
45
a 50 ::...........
55
65
70 I-
80 j jjjjjjjjjjjjjjjjjjjjjjjji
85 : -v.......................
100 - i 1 1 1 1 1
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
q
Figure 3.5: Tiled matrices of p x q where the critical path length of GrASAP is
shorter than that of Greedy for 1 < p < 100 and 1 < q < p.
The PLASMA library provides more algorithms, that can be informally described
as trade-offs between FlatTree and BinaryTree. (We remind that FlatTree
is the same as algorithm as Sameh-KuCK.) These algorithms are referred to as
PlasmaTree in all the following, and differ by the value of an input parameter
called the domain size BS. This domain size can be any value between 1 and p,
inclusive. Within a domain, that includes BS consecutive rows, the algorithm works
just as FlatTree: the first row of each domain acts as a local panel and is used to
zero out the tiles in all the other rows of the domain. Then the domains are merged:
50


the panel rows are zeroed out by a binary tree reduction, just as in BinaryTree.
As the algorithm progresses through the columns, the domain on the very bottom is
reduced accordingly, until such time that there is one less domain. For the case that
BS = 1, PlasmaTree follows a binary tree on the entire column, and for BS = p,
the algorithm executes a flat tree on the entire column. It seems very difficult for a
user to select the domain size BS leading to best performance, but it is known that
BS should increase as q increases. Table 3.3 shows the time-steps of PlasmaTree
with a domain size of BS = 5. In the experiments of §3.3, we use all possible values
of BS and retain the one leading to the best value.
3.3 Experimental results
All experiments were performed on a 48-core machine composed of eight hexa-
core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz.
Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 Gflop/s for
the whole machine. The Istanbul micro-architecture is a NUMA architecture where
each socket has 6 MB of level-3 cache and each processor has a 512 KB level-2 cache
and a 128 KB level-1 cache. After having benchmarked the AMD ACML and Intel
MKL BLAS libraries, we selected MKL (10.2) since it appeared to be slightly faster
in our experimental context. Linux 2.6.32 and Intel Compilers 11.1 were also used in
conjunction with PLASMA 2.3.1.
For all results, we show both double and double complex precision, using all 48
cores of the machine. The matrices are of size m = 8000 and 200 < n < 8000. The
tile size is kept constant at n5 = 200, so that the matrices can also be viewed as
p x q tiled matrices where p = 40 and 1 < q < 40. All kernels use an inner blocking
parameter of % = 32.
In double precision, an FMA (fused multiply-add\ y ax + y) involves three
double precision numbers for two flops, but these two flops can be combined into one
FMA and thus completed in one cycle. In double complex precision, the operation
51


Predicted GPLOP/s Predicted GFLOP/s
Best domain size for PlasmaTree (TT) ==[ 1 3 5 5 5 10 ID 10 10 ID 20 ... 20 ]
Best domain size for PlasmaTr.ee (TT) = [1 5 5 5 17 28 8
FlatTreb (TT)
PlasmaTree (TT) (beat)
FIBONACCI (TT)
Greedy
1 2 3 4 5 6 7 8 9 10 20 30 40
1
(a) Upper bound (double complex)
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 .. .20 ]
FlatTREE (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
1 2 3 4 5 6 7 8 9 10 20 30 40
(c) Upper bound (double)
(b) Experimental (double complex)
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 1
(d) Experimental (double)
Figure 3.6: Upper bound and experimental performance of QR factorization TT
kernels
y ax + y involves six double precision numbers for eight flops; there is no FMA.
The ratio of computation/communication is therefore, potentially, four times higher in
double complex precision than in double precision. Communication aware algorithms
are much more critical in double precision than in double complex precision.
For each experiment, we provide a comparison of the theoretical performance to
the actual performance. The theoretical performance is obtained by modeling the
limiting factor of the execution time as either the critical path, or the sequential time
divided by the number of processors. This is similar in approach to the Roofline
52


Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
!
1.5
1 2 3 4 5 6 7 8 9 10 20 30 40
q
(a) Theoretical CP length
(b) Experimental (double complex)
(c) Experimental (double)
Figure 3.7: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
model [53]. Taking 7seq as the sequential performance, T as the total number of flops,
cp as the length of the critical path, and P as the number of processors, the upper
bound on performance, 7^, is
_ Iseq T
Pub (T \
max (p, cp)
Figures 3.6a and 3.6c depict the upper bound on performance of all algorithms which
use the Triangle on top of triangle kernels. Since PlasmaTree provides an addi-
tional tuning parameter of the domain size, we show the results for each value of this
parameter as well as the composition of the best of these domain sizes. Again, it
53


is not evident what the domain size should be for the best performance, hence our
exhaustive search.
Part of our comprehensive study also involved comparisons made to the Semi-
Parallel Tile and Fully-Parallel Tile CAQR algorithms found in [11] which are much
the same as those found in PLASMA. As with PLASMA, the tuning parameter
BS controls the domain size upon which a flat tree is used to zero out tiles below
the root tile within the domain and a binary tree is used to merge these domains.
Unlike PLASMA, it is not the bottom domain whose size decreases as the algorithm
progresses through the columns, but instead is the top domain. In this study, we found
that the PLASMA algorithms performed identically or better than these algorithms
and therefore we do not report these comparisons.
Figure 3.6b and 3.6d illustrate the experimental performance reached by Greedy,
Fibonacci and PlasmaTree algorithms using the TT (Triangle on top of trian-
gle) kernels. In both cases, double or double complex precision, the performance
of Greedy is better than PlasmaTree even for the best choice of domain size.
Moreover, as expected from the analysis in §3.2.2, Greedy outperforms Fibonacci
the majority of the time. Furthermore, we see that, for rectangular matrices, the
experimental performance in double complex precision matches the upper bound on
performance. This is not the case for double precision because communications have
higher impact on performance.
While it is apparent that Greedy does achieve higher levels of performance, the
percentage may not be as obvious. To that end, taking Greedy as the baseline, we
present in Figure 3.7 the theoretical, double, and double complex precision overhead
for each algorithm that uses the Triangle on top of triangle kernel as compared to
Greedy. These overheads are respectively computed in terms of critical path length
and time. At a smaller scale (Figure 3.13), it can be seen that Greedy can perform
up to 13.6% better than PlasmaTree.
54


Overhead in time with respect to Greedy (Greedy = 1)
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 ID 10 10 20 ... 20 ]
q
(a) Theoretical CP length
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
9
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
H
O
U
-e
o
9
(b) Experimental (double complex)
(c) Experimental (double)
Figure 3.8: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
For all matrix sizes considered, p = 40 and 1 < q < 40, in the theoretical model,
the critical path length for Greedy is either the same as that of PlasmaTree
(q = 1) or is up to 25% shorter than PlasmaTree (q = 6). Analogously, the critical
path length for Greedy is at least 2% to 27% shorter than that of Fibonacci. In the
experiments, the matrix sizes considered were p = 40 and q E {1,2, 4, 5,10, 20, 40}. In
double precision, Greedy has a decrease of at most 1.5% than the best PlasmaTree
(q = 1) and a gain of at most 12.8% than the best PlasmaTree (q = 5). In
double complex precision, Greedy has a decrease of at most 1.5% than the best
PlasmaTree (q = 1) and a gain of at most 13.6% than the best PlasmaTree
55


(q = 2). Similarly, in double precision, Greedy provides a gain of 2.6% to 28.1%
over Fibonacci and in double complex precision, Greedy has a decrease of at most
2.1% and a gain of at most 28.2% over Fibonacci.
Although it is evidenced that PlasmaTree does not vary too far from Greedy
or Fibonacci, one must keep in mind that there is a tuning parameter involved and
we choose the best of these domain sizes for PlasmaTree to create the composite
result, whereas with Greedy, there is no such parameter to consider. Of particular
interest is the fact that Greedy always performs better than any other algorithm2
for p^> q. In the scope of PlasmaTree, a domain size BS = 1 will force the use of
a binary tree so that both Greedy and PlasmaTree behave the same. However,
as the matrix tends more to a square, i.e., q tends toward p, we observe that the
performance of all of the algorithms, including FlatTree, are on par with Greedy.
As more columns are added, the parallelism of the algorithm is increased and the
critical path becomes less of a limiting factor, so that the performance of the kernels
is brought to the forefront. Therefore, all of the algorithms are performing similarly
since they all share the same kernels.
(a) Factorization kernels
(b) Update kernels
Figure 3.9: Kernel performance for double complex precision
2When q = 1, Greedy and FlatTree exhibit close performance. They both perform a binary
tree reduction, albeit with different row pairings.
56


tile size tile size
(a) Factorization kernels (b) Update kernels
Figure 3.10: Kernel performance for double precision
In order to accurately assess the impact of the kernel selection towards the per-
formance of the algorithms, Figures 3.9 and 3.10 show both the in cache and out
of cache performance using the No Flush and MultCallFlushLRlJ strategies as pre-
sented in [29, 51]. Since an algorithm using TT kernels will need to call GEQRT
as well as TTQRT to achieve the same as the TS kernel TSQRT, the comparison
is made between GEQRT + TTQRT and TSQRT (and similarly for the updates).
For iib = 200, the observed ratio for in cache kernel speed for TSQRT to GEQRT
+ TTQRT is 1.3374, and for TSMQR to UNMQR + TTMQR is 1.3207. For out
of cache, the ratio for TSQRT to GEQRT + TTQRT is 1.3193 and for TSMQR
to UNMQR + TTMQR it is 1.3032. Thus, we can expect about a 30% difference
between the selection of the kernels, since we will have instances of using in cache
and out of cache throughout the run. Most of this difference is due to the higher
efficiency and data locality within the TT kernels as compared to the TS kernels.
Having seen that kernel performance can have a significant impact, we also com-
pare the TT based algorithms to those using the TS kernels. The goal is to provide
a complete assessment of all currently available algorithms, as shown in Figure 3.11.
For double precision, the observed difference in kernel speed is 4.976 GFLOP/sec for
57


q
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 41
Bast domain size for PlasmaTree '(T"T) ='[1 5 5 5 17 28 ^
?
(a) Upper bound (double complex)
(b) Experimental (double complex)
Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 ID ... 10 ]
Bast domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 1010 10 20 ... 20 ]
1
q
(c) Upper bound (double)
(d) Experimental (double)
Figure 3.11: Upper bound and experimental performance of QR factorization All
kernels
the TS kernels versus 3.844 GFLOP/sec for the TT kernels which provides a ratio
of 1.2945 and is in accordance with our previous analysis. It can be seen that as the
number of columns increases, whereby the amount of parallelism increases, the effect
of the kernel performance outweighs the benefit provided by the extra parallelism
afforded through the TT algorithms. Comparatively, in double complex precision,
Greedy does perform better, even against the algorithms using the TS kernels. As
before, one must keep in mind that Greedy does not require the tuning parameter
of the domain size to achieve this better performance.
58


Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]
Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
1 2 3456789 10 20 30 40
9
(a) Theoretical CP length
O
U
3
I
&
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4]
Best domain size for PlasmaTree (TT) = [l 5 5 5 17 28 8]
1 2 S 4 5 6 7 8 9 10 20 30 40
9
(b) Experimental (double complex)
Bast domain size for PlasmaTree (TS) = [1 3 6 11 12 18 32]
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
1 2 8456789 10 20 30 40
9
(c) Experimental (double)
Figure 3.12: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
From these experiments, we showed that in double complex precision, Greedy
demonstrated better performance than any of the other algorithms and moreover,
it does so without the need to specify a domain size as opposed to the algorithms
in PLASMA. In addition, in double precision, for matrices where p q, Greedy
continues to excel over any other algorithm using the TT kernels, and continues to
do so as the matrices become more square.
3.4 Conclusion
59


Best domain size for PlasmaTree (TS) = [1 11 1555555 10 ... 10 ]
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 30 ... ^20 ]
q
(a) Theoretical CP length
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4]
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
(b) Experimental (double complex)
(c) Experimental (double)
Figure 3.13: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
In this chapter, we have presented Fibonacci, and Greedy, two new algorithms
for tiled QR factorization. These algorithms exhibit more parallelism than state-
of-the-art implementations based on reduction trees. We have provided accurate
estimations for the length of their critical path.
Comprehensive experiments on multicore platforms confirm the superiority of
the new algorithms for p x q matrices, as soon as, say, p > 2q. This holds true
when comparing not only with previous algorithms using TT (Triangle on top of
triangle) kernels, but also with all known algorithms based on TS (Triangle on top
of square) kernels. Given that TS kernels offer more locality, and benefit from better
60


elementary arithmetic performance, than TT kernels, the better performance of the
new algorithms is even more striking, and further demonstrates that a large degree
of a parallelism was not exploited in previously published solutions.
Future work will investigate several promising directions. First, using rectangular
tiles instead of square tiles could lead to efficient algorithms, with more locality and
still the same potential for parallelism. Second, refining the model to account for
communications, and extending it to fully distributed architectures, would lay the
ground work for the design of MPI implementations of the new algorithms, unleashing
their high level of performance on larger platforms. Finally, the design of robust
algorithms, capable of achieving efficient performance despite variations in processor
speeds, or even resource failures, is a challenging but crucial task to fully benefit from
future platforms with a huge number of cores.
61


Algorithm 3.4: Greedy algorithm via TT kernels.
1 for j = 1 to g do
/* nz(j) is the number of tiles which have been eliminated in
column j */
2 nZ(j) = 0
/* nT(j) is the number of tiles which have been triangularized
in column j */
3 nT(j) = 0
4 while
5 for
6
7
8
9
10
11
12
column q is not finished do
j = q down to 1 do
if j == 1 then
/* Triangularize the first column if not yet done
nTnew = nT(j) + {p nT(j))
if p nT(j) > 0 then
for k = p down to 1 do
GEQRT(k,j)
for jj = j + 1 to q do
L UNMQR(k,j,jj)
*/
13
14
15
16
17
18
19
20
21
22
23
24
25
26
else
/* Triangularize every tile having a zero in the
previous column */
nTnew = nZ(j 1)
for k = nT(j) to nTnew 1 do
GEQRT{jp k,j)
for jj = j + 1 to g do
|_ L UNMQR(p k,j,jj)
/* Eliminate every tile triangularized in the previous step
nzZ = ZW+[nT{3)filZ{3)\
for kk = nZ(j) to nZnew 1 do
piv(p kk) = p kk nZnew + nZ(j)
TTQRT{jp kk,piv(p kk),j)
for jj = j + 1 to g do
|_ TTMQR{jp kk,piv(p kk),j,jj)
/* Update the number of triangularized and eliminated tiles
at the next step */
nT(j) = nTnew
nZ(j) = nZnew
62


Table 3.3: Time-steps for tiled algorithms.
(a) Sameh-Kuck (b) Fibonacci (c) Greedy (d) BinaryTree (e) PlasmaTree (BS = 5)
kr -k -k -k -k
6 -k 14 -k 12 -k 6 -k 6 -k
8 28 -k 12 48 -k 10 42 -k 8 28 -k 8 28 -k
10 34 50 -k 12 46 70 -k 10 40 64 -k 6 36 56 -k 10 34 50 -k
12 40 56 72 -k 10 42 68 92 -k 8 36 62 86 -k 10 34 70 90 -k 12 40 56 72 -k
14 46 62 78 94 -k 10 40 64 90 114 -k 8 34 56 84 106 -k 6 44 68 104 124 -k 14 46 62 78 94 -k
16 52 68 84 100 116 10 40 62 86 112 136 8 34 56 78 102 128 8 28 78 102 138 158 6 54 74 90 106 122
18 58 74 90 106 122 8 36 62 84 108 134 8 30 52 78 100 122 6 42 62 112 136 172 8 28 82 102 118 134
20 64 80 96 112 128 8 34 58 84 106 130 6 28 50 72 100 118 12 40 76 96 146 170 10 34 50 110 130 146
22 70 86 102 118 134 8 34 56 80 106 128 6 28 50 72 94 116 6 46 74 110 130 180 12 40 56 72 138 158
24 76 92 108 124 140 8 34 56 78 102 128 6 28 50 68 94 116 8 28 80 108 144 164 16 52 68 84 100 166
26 82 98 114 130 146 6 28 56 78 100 122 6 28 44 66 88 110 6 36 56 114 142 178 6 56 80 96 112 128
28 88 104 120 136 152 6 28 50 78 100 122 6 22 44 66 88 110 10 34 64 84 148 176 8 28 84 108 124 140
30 94 110 126 142 158 6 28 44 72 100 122 6 22 44 60 82 104 6 38 62 92 112 182 10 34 50 112 136 152
32 100 116 132 148 164 6 22 44 60 94 116 6 22 38 60 76 98 8 28 66 90 114 134 12 40 56 72 140 164
Table 3.4: Neither Greedy nor Asap are optimal.
(a) Greedy nor Asap are optimal. (b) Greedy generally outperforms Asap.
(a) Greedy (b) Asap
k k
12 k 12 k
10 42 k 10 40 k
10 40 64 10 36 86
8 36 62 8 34 80
8 34 56 8 32 74
8 34 56 8 30 68
8 30 52 8 28 62
6 28 50 6 28 56
6 28 50 6 26 50
6 28 50 6 24 46
6 28 44 6 24 44
6 22 44 6 22 44
6 22 44 6 22 40
6 22 38 6 22 38
Q
P Algorithm 16 32 64 128
16 Greedy 310
Asap 310
32 Greedy 360 650
Asap 402 656
64 Greedy 374 726 1342
Asap 588 844 1354
128 Greedy 396 748 1452 2732
Asap 966 1222 1748 2756
63


Table 3.5: Three schemes applied to a column whose update kernel weight is not an
integer multiple of the elimination kernel weight.
an (1) (2) (3)
6 6 13 14 12
6 11 8 10
6 9 8 10
6 9 12 8
6 9 9 8
6 7 7 8
6 7 7 8
6 5 5 5
6 5 5 5
6 5 5 5
64


Table 3.6: Greedy versus PT_TT and Fibonacci (Theoretical)
p q GREEDY PT_tt BS Overhead Gain Fib Overhead Gain
40 1 16 16 1 1.0000 0.0000 22 1.3750 0.2727
40 2 54 60 3 1.1111 0.1000 72 1.3333 0.2500
40 3 74 98 5 1.3243 0.2449 94 1.2703 0.2128
40 4 104 132 5 1.2692 0.2121 116 1.1154 0.1034
40 5 126 166 5 1.3175 0.2410 138 1.0952 0.0870
40 6 148 198 10 1.3378 0.2525 160 1.0811 0.0750
40 7 170 226 10 1.3294 0.2478 182 1.0706 0.0659
40 8 192 254 10 1.3229 0.2441 204 1.0625 0.0588
40 9 214 282 10 1.3178 0.2411 226 1.0561 0.0531
40 10 236 310 10 1.3136 0.2387 248 1.0508 0.0484
40 11 258 336 20 1.3023 0.2321 270 1.0465 0.0444
40 12 280 358 20 1.2786 0.2179 292 1.0429 0.0411
40 13 302 380 20 1.2583 0.2053 314 1.0397 0.0382
40 14 324 402 20 1.2407 0.1940 336 1.0370 0.0357
40 15 346 424 20 1.2254 0.1840 358 1.0347 0.0335
40 16 368 446 20 1.2120 0.1749 380 1.0326 0.0316
40 17 390 468 20 1.2000 0.1667 402 1.0308 0.0299
40 18 412 490 20 1.1893 0.1592 424 1.0291 0.0283
40 19 432 512 20 1.1852 0.1562 446 1.0324 0.0314
40 20 454 534 20 1.1762 0.1498 468 1.0308 0.0299
40 21 476 554 20 1.1639 0.1408 490 1.0294 0.0286
40 22 498 570 20 1.1446 0.1263 512 1.0281 0.0273
40 23 520 586 20 1.1269 0.1126 534 1.0269 0.0262
40 24 542 602 20 1.1107 0.0997 556 1.0258 0.0252
40 25 564 618 20 1.0957 0.0874 578 1.0248 0.0242
40 26 586 634 20 1.0819 0.0757 600 1.0239 0.0233
40 27 608 650 20 1.0691 0.0646 622 1.0230 0.0225
40 28 630 666 20 1.0571 0.0541 644 1.0222 0.0217
40 29 652 682 20 1.0460 0.0440 666 1.0215 0.0210
40 30 668 698 20 1.0449 0.0430 688 1.0299 0.0291
40 31 684 714 20 1.0439 0.0420 710 1.0380 0.0366
40 32 700 730 20 1.0429 0.0411 732 1.0457 0.0437
40 33 716 746 20 1.0419 0.0402 754 1.0531 0.0504
40 34 732 762 20 1.0410 0.0394 776 1.0601 0.0567
40 35 748 778 20 1.0401 0.0386 798 1.0668 0.0627
40 36 764 794 20 1.0393 0.0378 820 1.0733 0.0683
40 37 780 810 20 1.0385 0.0370 842 1.0795 0.0736
40 38 796 826 20 1.0377 0.0363 862 1.0829 0.0766
40 39 812 842 20 1.0369 0.0356 878 1.0813 0.0752
40 40 826 856 20 1.0363 0.0350 892 1.0799 0.0740
65


Table 3.7: Greedy versus PT_TT (Experimental Double)
p q GREEDY (d) PT_TT(d) BS Overhead Gain
40 1 36.9360 37.5020 1 1.0153 -0.0153
40 2 58.5090 52.7180 3 0.9010 0.0990
40 4 103.2670 90.7940 10 0.8792 0.1208
40 5 115.3060 100.5540 5 0.8721 0.1279
40 10 153.5180 145.8200 17 0.9499 0.0501
40 20 170.8730 171.8270 27 1.0056 -0.0056
40 40 184.5220 182.8160 19 0.9908 0.0092
lable 3.8: Greedy versus PT_TT (Experimental Double Complex
P q GREEDY(z) PT_TT(z) BS Overhead Gain
40 1 42.0710 42.7120 1 1.0152 -0.0152
40 2 60.4420 52.1970 5 0.8636 0.1364
40 4 95.1820 84.1120 5 0.8837 0.1163
40 5 107.6370 96.7530 5 0.8989 0.1011
40 10 135.0270 128.4320 17 0.9512 0.0488
40 20 144.4010 146.4220 28 1.0140 -0.0140
40 40 152.9280 151.9090 8 0.9933 0.0067
Table 3.9: Greedy versus Fibonacci (Experimental Double)
P q GREEDY(d) FIB(d) Overhead Gain
40 1 36.9360 26.5610 0.7191 0.2809
40 2 58.5090 49.4870 0.8458 0.1542
40 4 103.2670 100.1440 0.9698 0.0302
40 5 115.3060 115.0020 0.9974 0.0026
40 10 153.5180 152.0090 0.9902 0.0098
40 20 170.8730 170.4780 0.9977 0.0023
40 40 184.5220 180.2990 0.9771 0.0229
Table 3.10: Greedy versus Fibonacci (Experimental Double Complex)
p q GREEDY(z) FIB(z) Overhead Gain
40 1 42.0710 30.2280 0.7185 0.2815
40 2 60.4420 48.9570 0.8100 0.1900
40 4 95.1820 97.1650 1.0208 -0.0208
40 5 107.6370 105.9610 0.9844 0.0156
40 10 135.0270 134.5500 0.9965 0.0035
40 20 144.4010 145.5530 1.0080 -0.0080
40 40 152.9280 150.0980 0.9815 0.0185
66


4. Scheduling of Cholesky Factorization
In Chapter 2 we studied the Cholesky Inversion algorithm which consists of the
three steps: Cholesky factorization, inversion of the factor, and the multiplication of
two triangular matrices. In this chapter, we will focus on the Cholesky factorization
but unlike the previous work where the number of processors was unbounded, we will
consider the factorization in the context of a finite number of processors. By limiting
the number of processors, the scheduling of the tasks becomes an issue. Moreover, the
weight (processing time) of each task must be taken into consideration when creating
the schedule.
As before, we will be considering the critical path length for the algorithm but
not as a function of the number of tiles rather as a function of the weights of the
tasks. The weights are based upon the total computational cost for each kernel and
are provided in Table 4.1. A more in-depth analysis of the length of the critical path
with weighted tasks for the Cholesky Inversion algorithm can be found in [16] which
also provides 9110 as the weighted critical path length for the Cholesky factorization
of a matrix of t x t tiles.
Table 4.1: Task Weights
Weights
# flops (in |nf flops)
POTRF \nl + Oinl) 1
TRSM nl 3
SYRK n3b + 0(nl) 3
GEMM 2nl + 0{n2b) 6
The upper bound on performance of perfect speedup and critical path introduced
in Chapter 2 remains too optimistic and does not take into account any information
which can be garnered from the DAG of the algorithm. This work makes progress
towards providing a more representative bound on the performance of the Cholesky
factorization in the tiled setting.
67


We also provide gains toward a bound on the minimum number of processors
required to obtain the shortest possible weighted critical path (minimum make span)
for the Cholesky factorization for a matrix of t x t tiles.
4.1 ALAP Derived Performance Bound
To obtain our bounds, we calculate the latest possible start time for each task
(ALAP) and consider an unbounded number of processors without any costs for
communication. If we did account for communication, we might see the critical path
length increase which would in turn decrease our upper bound. We start at the final
tasks and consider how many processors are needed to execute these tasks without
increasing the length of the critical path. We step backwards in time until such a
point that there are more processors needed to keep the critical path length constant.
Thus we must add enough processors to execute the tasks and in turn create more
idle time for the execution of tasks which are successors. At a certain point, there is
no more need to add processors and this is then the number of processors needed to
obtain the constant length critical path.
By forcing as late as possible (ALAP) start times, any schedule will keep as
many or fewer processors active as the ALAP execution on an unbounded number of
processors. Thus by evaluating the Lost Area (LA), or idle time, for a given number
of processors, p, at the end of the ALAP execution on an unbounded number of
processors, we can increase the sequential time by the amount of LA and divide this
result by the p to obtain the best possible execution time, i.e.,
rp Tgeq + LAp
J-p
P
and we define this to be the ALAP Derived Performance Bound. Hence the maximum
speedup that we can expect is given by
T
seq
T
seq
v____
+ LAp
68


An example will help to further illustrate this technique. In Figure 4.1 we are
given the ALAP execution of a 5 x 5 tiled matrix which has Tseq = 125. The ordered
pairs indicated provide the number of processors and idle time, respectively, and
in Table 4.2 are given the values for TPl speedup, and efficiency. For more than
four processors, there are enough processors to obtain the critical path length which
becomes our limiting factor.
SH s-4 n
2qI sol 31 I 32I an I 34
1 2 1 3 1 4 1 5 1 6 1 7 8 1 9 1 101 111 12' I.3I lJ I5I iel I7I isl 19' 20' 21' 22' 23^ 24' 25' 26' 27' 281 291 301 311 321 331 34
time
Figure 4.1: ALAP execution for 5x5 tiles
Table 4.2: Upper bound on speedup and efficiency for 5 x 5 tiles
p T 1p sP Ep
1 125.00 1.00 1.00
2 64.50 1.94 0.97
3 45.33 2.76 0.92
4 37.25 3.36 0.84
5 35.00 3.57 0.71
6 35.00 3.57 0.60
7 35.00 3.57 0.51
8 35.00 3.57 0.45
9 35.00 3.57 0.40
10 35.00 3.57 0.36
4.2 Critical Path Scheduling
In order to provide a critical path schedule, we use the Backflow algorithm to
assign priorities to tasks of a DAG such that each tasks priority adheres to its de-
69


pendencies. The algorithm is described in four steps:
STEP 1 : Beginning at the final task in the DAG, set its priority
to its processing time.
STEP 2 : Moving in the reverse direction, set each incidental
tasks priority to the sum of its the processing time and
the final tasks priority.
STEP 3 : For each task in STEP 2, moving in the reverse direc-
tion, set each incidental tasks priority to the sum of
its processing time and the maximum priority of any
incidental successor task.
STEP 4 : Repeat the procedure until all tasks have been assigned
a priority.
An example is given in Figure 4.2. The processing times are given in parenthesis
and the assigned priorities (cp) are designated in square brackets. Tasks A and B
will be assigned a priority of 16 since cp(A) = 3 + max (cp(C'), cp(D)) and cp(E>) =
3 + max (cp(D), cp(ff)).
By following the path with the highest priorities, a critical path can be discerned
from the weighted DAG. Thus any schedule which then chooses from the available
tasks those with the highest priorities to execute first inherently follows the critical
path. It is well known that a critical path scheduling is not always optimal. As an
example [43], take two processors and four tasks. Let tasks A, B, C, and D have
weights of 3, 3, 1, and 1, respectively and let the only relationship between tasks be
that C is a predecessor of D. Then cp(A) = cp(E>) = 3, cp(C') = 2, and cp(D) = 1.
A critical path schedule would choose to schedule tasks A and B simultaneously and
follow up with C and then D, resulting in a schedule of length five. However, if A
and C are scheduled simultaneously and then D follows A on the same processor and
B follows C on the other processor, the length of the schedule is four.
70


(jwrt (i) [lTT^)
C5
C (3) [11]
7
A (3) [16]
7
(^B(3) [77)
<^D(6) [7T)
G (3) [77) (^7(3) [TP^)
(^END (1) [lT^>
(a) State at start (b) State at finish
Figure 4.2: Example derivation of task priorities via the Backflow algorithm
We will use this critical path information to analyze three schedules by choosing
available tasks via max(cp), rand(cp), or min(cp). The max(cp) will naturally follow
the critical path by scheduling tasks with the highest cp first and, vice versa, the
min(cp) will schedule from the available tasks those with the minimum cp. Between
these two, we also choose randomly amongst the available tasks with rand(cp).
4.3 Scheduling with synchronizations
The right-looking version of the LAPACK Cholesky factorization as depicted
in Figure 1.1c provides an alternative schedule which can be easier to analyze and
understand. We will apply the three steps of the algorithm to a matrix of 1x1
tiles. In the tiled setting, we can provide synchronization points between the varying
tasks of each step and simply schedule any of the available tasks since there are no
dependencies between the tasks in each grouping. By adding these synchronizations,
71


this schedule is not able to obtain the critical path no matter how many processors
are available. Algorithm 4.1 is the right-looking variant of the Cholesky factorization
with added synchronization points.
Algorithm 4.1: Schedule for tiled right-looking Cholesky factorization with
added synchronizations to allow for grouping.
1 Tile Cholesky Factorization (compute L such that A
2 for j = 0 to t 1 do
schedule POTRF(i) ;
synchronize;
for j = i + 1 to t 1 do
schedule TRSM(j,i) ;
7
8
9
10
synchronize;
for j = i + 1 to t 1 do
for k = i + 1 to j 1 do
schedule GEMM(j,i,k) ;
LLTj;
n
12
13
14
synchronize;
for j = i + 1 to t 1 do
schedule SYRK(j,i) ;
synchronize;
Naturally, we can improve upon the above schedule by removing the synchro-
nization between some of the groupings (Algorithm 4.2). The update of the trailing
matrix is composed of two groupings, namely the GEMMs and the SYRKs, which
can be executed in parallel if enough processors are available. Moreover, the added
synchronization point between the update of the trailing matrix and the factorization
of the next diagonal tile may also be removed. This schedule does become more com-
plex, but given enough processors, the schedule is able to obtain the critical path as
the limiting factor to performance. The minimum number of processors, p, needed to
obtain the critical path is p = |~\{t 1)2~| for a matrix of t x t tiles since the highest
degree of parallelism is realized for the update of the first trailing matrix which is of
72


size [t 1) x [t 1).
Both of these schedules differ from the critical path scheduling due to the added
synchronization points and will show lower theoretical performance. In the theoretical
results, we only show Algorithm 4.2.
Algorithm 4.2: Improvement upon Algorithm 4.1
1 Tile Cholesky Factorization (compute L such that A
2 for j = 0 to t 1 do
schedule POTRF(i) ;
synchronize;
for j = i + 1 to t 1 do
schedule TRSM(j,i) ;
7
8
9
10
11
synchronize;
for j = i + 1 to t 1 do
for k = i + 1 to j 1 do
schedule GEMM(j,i,k) ;
schedule SYRK(j,i) ;
LLt)-
4.4 Theoretical Results
In the following figures, our Rooftop bound will be that which uses the perfect
speedup until the weighted critical path is the limiting factor, i.e.,
Rooftop Bound = max ^Critical Path, ^ (4.2)
We will compare this to our ALAP Derived bound which was derived using the ALAP
execution, our various scheduling strategies, and the following lower bound. From [20,
p.221,§7.4.2], given our DAG, we know that the make span, MS, of any list schedule,
a, for a given number of processors, p, is
MS(cr,p) < ^2 ^ MSoptip)
where MSapt is the make span of the optimal list schedule without communication
costs. However, we do not know MS opt and must therefore substitute the make span
73


of the Critical Path Scheduling using the maximum strategy to compute our lower
bound.
The ALAP Derived bound improves upon the Rooftop bound precisely in the area
that is of most concern, namely where there is enough parallelism but not enough
processors to fully exploit that parallelism.
Speedupwlth p=40
(b) Efficiency
Scalability with p=40
Critical Path max 1 1 j 1 1 1 1 1 ! !
Critical Path random ;; i ;
Critical Path min i
Algorithm 4.2 ll
(c) Comparison to new bound
Figure 4.3: Theoretical results for matrix of 40 x 40 tiles.
Figure 4.3(a) shows that the Critical Path schedule is actually quite descent and
comes to within two percent of the ALAP Derived bound. Moreover, the ALAP
Derived bound has significantly reduced the gap between the Rooftop bound and any
of our list schedules.
74


4.5 Toward an aopt
It is interesting to know how many processors one needs to be able to schedule
all of the tasks and maintain the weighted critical path. We will view this problem
in terms of tiles and state the problem as follows:
Given a matrix oftxt tiles, determine the minimum number of processors,
Popt, needed to schedule all tasks and achieve an execution time equal to
the weighted critical path.
Toward that end, we will let p = at2 where 0 < a < 1. Our analysis will be
asymptotic in which we let t tend to infinity. From the analysis of Algorithm 4.2,
we already know that a^t < Using MATLAB, we have calculated the a value for
matrices of t x t tiles as shown in Figure 4.4. It is our conjecture that a^t ~ |.
Values ofafor 40
5 1 0 1 5 20 25 3 0 3 5 40
# oTliles
Figure 4.4: Values of a for matrices of t x t tiles where 3 < t < 40
4.6 Related Work
For the LU decomposition with partial pivoting, much work has been accom-
plished to discern asymptotically optimal algorithms for all values of a [37, 35, 43].
They consider a problem of size n and assume p = an processors on which to schedule
the LU decomposition. The critical path length of the optimal schedule in this case
75


Efficiency vesus the ratio a = p /
Efficiency vesus the ratio a = p /12
Figure 4.5: Asymptotic efficiency versus a = p/n for LU decomposition and versus
a = p/t2 for Tiled Cholesky factorization.
is n2 and ciopt ~ 0.347 where ciopt is a solution to the equation 3a a'3 = 1.
In Figure 4.5, we make a comparison between the algorithmic efficiency of the LU
decomposition and the tiled Cholesky factorization. In the case of the LU decompo-
sition, the attainable upper bound on efficiency closely resembles our previous bound
of perfect speedup which is then limited by the critical path. On the other hand, the
tiled Cholesky factorization does not exhibit this type of efficiency which can be seen
from the gap between our Rooftop bound and the ALAP Derived bound. Unlike the
work in [37], we do not provide an algorithm which attains the ALAP Derived bound.
4.7 Conclusion and future work
In many research papers, the performance of an algorithm is usually compared to
either the performance of the GEMM kernel or against perfect scalability resulting in
large discrepancies between the peak performance of the algorithm and these upper
bounds. If an algorithm displays a DAG such as that of the tiled Cholesky factor-
ization, it is unrealistic to expect perfect scalability or even make comparisons to the
performance of the GEMM kernel. Thus one needs to consider a new bound which
is more representative of the algorithm and accounts for the structure of the DAG.
Without such a bound it is difficult to access whether there are any performance
76


gains to be achieved. Although we do not have a closed-form expression for this new
bound, we have shown that such a bound exists. Moreover, we have also shown that
any algorithm which schedules the tiled Cholesky factorization while maintaining the
weighted critical path will require 0(t2) processors for a matrix of t x t tiles and the
coefficient is somewhere around 0.2.
In this chapter, we have applied a combination of existing techniques to a tiled
Cholesky factorization algorithm to discover a more realistic bound on the perfor-
mance and efficiency. We did so by considering an ALAP execution on an unbounded
number of processors and used this information to calculate the idle time for any
list schedule on a bounded number of processors. This is then used to calculate the
maximum speedup and efficiency that may be observed.
Further work is necessary to provide a closed-form expression of the new bound
dependent upon the number of processors. In addition, we need to include communi-
cation costs in the bound to make it more reflective of the actual scheduling problem
on parallel machines. As can be seen in Figure 4.3(c), the Critical Path Schedule is
within 2% of our ALAP Derived bound. Although scheduling a DAG on a bounded
number of processors is an NP-complete problem, it may be not be the case for the
DAG of the tiled Cholesky factorization. Pursuant investigation might show that the
Critical Path Scheduling is the optimal schedule.
77


5. Scheduling of QR Factorization
In this chapter, we present collaborative work with Jeffrey Larson. We revisit
the tiled QR factorization as presented in Chapter 3 but do so in the context of a
bounded number of resources. (Chapter 3 was concerned with finding the optimal
elimination tree on an unlimited number of processors.) We will be using the same
analytical tools as in Chapter 4 to derive good schedules and to improve upon the
Rooftop bound. The Cholesky factorization has just one DAG, therefore Chapter 4
is a standard scheduling problem, i.e., how to schedule a DAG on a finite number
of processor. Unlike the previous chapter, we will need to consider all of the various
algorithms (i.e., elimination trees) and cannot distill the analysis down to a single
DAG.
5.1 Performance Bounds
Each of the algorithms studied in Chapter 3, namely FlatTree, Fibonacci,
Greedy, and GrASAP, produces a unique DAG for a matrix of p x q tiles. In turn,
the ALAP Derived bounds (4.1) for each elimination tree will also be unique. In Fig-
ure 5.1, we give the computed upper bounds and make comparisons to the scheduling
strategies of maximum, random, and minimum via the Critical Path Method for a
matrix of 20 x 6 tiles. The matrix size was chosen such that the critical path length
of Greedy is 136 and the critical path length of GrASAP is 134 (see Figure 3.5 in
§ 3.2.2).
The GrASAP algorithm for a tiled matrix is optimal in the scope of unbounded
resources. However by the manner in which the ALAP Derived bound is computed,
the bound created by using GrASAP cannot hold for all of the other algorithms.
Consider the ALAP execution of the Fibonacci and GrASAP algorithms on a
matrix of 15 x 4 tiles. In Figure 5.2, we show the execution of the last tasks for
GrASAP on the left and Fibonacci on the right. More of the tasks in the ALAP
execution for Fibonacci are pushed towards the end of the execution which means
78


Speedup wllli p=20, q=B
(a) FlatTree
Speedup wild p=2Q, q=6
Speedup with p=2Q,q=6
(b) Fibonacci
Speedup with p=20, q=6
(c) Greedy (d) GrASAP
Figure 5.1: Scheduling comparisons for each of the algorithms versus the ALAP
Derived bounds on a matrix of 20 x 6 tiles.
the ALAP Derived bound will be higher than that of GrASAP for a schedule that
uses fewer than 10 processors. In other words, as we add more processors, the Lost
Area (LA) increases much faster for GrASAP than it does for Fibonacci. Since
the critical path length for Fibonacci is greater than that of GrASAP, after a
certain number of processors, the ALAP Derived bound for Fibonacci falls below
that of GrASAP. These observations are evident in Figure 5.3 where we show a
comparison of the bound for each algorithm. Recall that the Rooftop bound (4.2) only
takes into account the critical path length of an algorithm such that for GrASAP
79


GEQRT
TTQRT
g UTSTMQR
TTMQR
GrASAP Fibonacci
Figure 5.2: Tail-end execution using ALAP on unbounded resources for GrASAP
and Fibonacci on a matrix of 15 x 4 tiles.
it can be considered a bound for all the algorithms since GrASAP is optimal for
unlimited resources and thus has the shortest critical path length.
5.2 Optimality
There is no reason for the tree found optimal in Chapter 3 on an unbounded num-
ber of resources to be optimal on a bounded number resources. We cast the problem
of finding the optimal tree with the optimal schedule as an integer programming
problem. (A complete description of the formulation can be found in Appendix A.)
Similarly, in [1] a Mixed-Integer Linear Programming approach was used to provide
an upper bound on performance. However, the integer programming problem size
grows exponentially as the matrix size increases, thus the largest feasible problem
size was a matrix of 5 x 5 tiles. In Figure 5.4 we show the speedup of the GrASAP
algorithm with its bound and make comparisons to an optimal tree with an optimal
schedule and Table 5.1 provides the actual schedule lengths for all of the algorithms
using the CP Method for the matrix of 5 x 5 tiles..
80


Speedup with p=15, q=4
Efficiency with p=15, q=4
(b) Efficiency
Figure 5.3: ALAP Derived bound comparison for all algorithms for a matrix of 15 x 4
tiles.
5.3 Elimination Tree Scheduling
We pair up the choice of the elimination tree with a type of scheduling strategy
to obtain the following bounds:
/
\ /
<
GrASAP
^Rooftop boundy
Moreover, we also have
^ GrASAP ^ ^
optimal tree
^optimal scheduley
\ /
<
GrASAP
^optimal schedule^
^ ^ GrASAP ^
<
CP schedule
Rooftop bound
<
GrASAP
ALAP Derived Bound
\ /
<

GrASAP
optimal schedule
^ ^ GrASAP \
<

CP schedule
7
Combining these inequalities with Table 5.1 gives rise to the following questions:
Given an optimal elimination tree for the tiled QR factorization on an
unbounded number of resources
(Ql) does there always exist a scheduling strategy such that the schedule
on limited resources is optimal?
(Q2) does the ALAP Derived bound for this elimination tree hold true
for any scheduling strategy on any other elimination tree?
81


Optimality Comparison with p=5, q=5
Figure 5.4: Comparison of speedup for CP Method on GrASAP, ALAP Derived
bound from GrASAP, and optimal schedules for a matrix of 5 x 5 tiles on 1 to f4
processors
Table 5.1: Schedule lengths for matrix of 5 x 5 tiles
Procs ALAP Derived Bound(GRASAP) Optimal Tree/Schedule GrASAP CP Method Greedy Fibonacci FlatTree
1 500 500 500 500 500 500
2 255 256 256 256 256 256
3 176 176 178 178 178 176
4 138 138 140 140 140 140
5 116 116 118 118 118 116
6 102 104 104 104 104 104
7 92 94 94 94 94 94
8 86 88 88 88 88 88
9 82 84 84 84 86 86
10 80 80 82 82 86 86
11 80 80 80 80 86 86
12 80 80 80 80 86 86
13 80 80 80 80 86 86
14 80 80 80 80 80 86
82


From Chapter 3 we have that GrASAP is an optimal elimination tree for the tiled
QR factorization. We know that the length of an optimal schedule for GrASAP
on p processors will necessarily be greater or equal to the ALAP Derived bound for
GrASAP on p processors by way of construction of the ALAP Derived bound. Thus
(Ql) implies (Q2). We cannot address the first question directly since the size of
the matrix needed to produce a counter example is too large for verification with the
integer programming formulation.
Therefore we need to find a matrix size for which a schedule exists whose execution
length is smaller than the ALAP Derived bound from GrASAP on the same matrix.
As we have seen in Figure 5.3, the Fibonacci elimination tree on a tall and skinny
tiled matrix provides the best hope.
Consider a matrix of 34 x 4 tiles on 10 processors. The ALAP Derived bound
from GrASAP is 188. Using the Critical Path method to schedule the Fibonacci
elimination tree we obtain a schedule length of 184. Therefore the ALAP Derived
bound from GrASAP does not hold for this schedule. By implication, we have that
(Ql) is false. However, the Rooftop bound from GrASAP is still a valid bound for
all of the schedules.
5.4 Conclusion
In this chapter we have applied the same tools used in Chapter 4 to provide
performance bounds for the tiled QR factorization. Further, we have shown that the
ALAP Derived bound is algorithm dependent. This leaves that only bound we have
for all algorithms is the Rooftop bound as computed using the GrASAP algorithm.
The analysis in this chapter has also shown that an optimal algorithm for an
unbounded number of resources does not imply that a scheduling strategy exists such
that it can be scheduled optimally.
83


6. Strassen Matrix-Matrix Multiplication
Matrix multiplication is the underlying operation in many if not most of the
applications in numerical linear algebra and as such, it has garnered much attention.
Algorithms such as the Cholesky factorization, LU decomposition and more recently
the QR-based Dynamically Weighted Halley iteration for polar decomposition [38],
spend a majority of their computational cost in matrix-matrix multiplication. The
conventional BLAS Level 3 subprogram for matrix-matrix multiplication is of 0(na),
where a = log28 = 3, computational cost but there exist subcubic computational
cost algorithms. In 1969, Volker Strassen [48] presented an algorithm that computes
the product of two square matrices of size n x n, where n is even, using only 7
matrix multiplications at the cost of needing 18 matrix additions/subtractions which
then can be called recursively for each of the 7 multiplications. This compares to
the standard cubic algorithm which requires 8 matrix multiplications and only 4
matrix additions. When Strassens algorithm is applied recursively down to a constant
size, the computational cost is 0(na) where a = log2 7 ~ 2.807. Two years later,
Shmuel Winograd proved that a minimal of 7 matrix multiplications and 15 matrix
additions/subtractions, which is less than the 18 of Strassens, are required for the
product of two 2x2 matrices, see [54], These discoveries have spawned a flurry of
research over the years. In 1978, Pan [39] showed that a < 2.796. In the late 1970s
and early 1980s, Bini [12] provided a < 2.78 with Schonage [46] following up by
showing a < 2.522 but was usurped the following year by Romani [44] who discerned
a < 2.517. In 1986, Strassen brought forth a new approach which lead to a < 2.497.
In 1990, Coppersmith and Winograd [23] improved upon Strassens result providing
the asymptotic exponent a < 2.376. This final result still stands, but it is conjectured
that a = 2 + e for any e > 0 where e can me made as small as possible. Although
the Coppersmith-Winograd algorithm may be reasonable to implement, since the
constant of the algorithm is huge and will not provide an advantage except for very
84


large matrices, we will not consider it and instead focus on the Strassen-Winograd
Algorithm.
6.1 Strassen-Winograd Algorithm
Here we discuss the algorithm as it would be implemented to compute the product
of two matrices A and £> where the result is stored into matrix C. The algorithm is
recursive, thus we describe one step. Given the input matrices A, B, and C, divide
them into four submatrices,
1 to 1 Bn £>12 Cn Gi 2
, B = , c =
1 to to to 1 B21 B22 P P to l
then the 7 matrix multiplications and 15 matrix additions/subtractions are computed
as depicted in Table 6.1 and Figure 6.1 shows the task graph of the Strassen-Winograd
algorithm for one level of recursion.
Table 6.1: Strassen-Winograd Algorithm
Phase 1 T\ = A21 + A22 T5 = £>12 £>n £2 = T\ An Te = £>22 T5 £3 = An ~ A21 £7 = £>22 £>12 £4 = A12 T2 Tg = T6 £>21
Phase 2 Qi = £2 x T6 Q5 = T\ x T5 Q2 = An x £>11 Qe = T4 x £>22 Q3 = A12 x ££21 Qi = A22 x Tg Qi = £3 x t7
Phase 3 U\ = Q\ + Q2 Cn = Q2 + Q3 U2 = U\ + Q4 C\2 = U\ + Us Ug = Q5 + Qe C2\ = U2 Q7 C22 = U2 + Q5
In essence, Strassens approach is very similar to the observation that Gauss made
concerning the multiplication of two complex numbers. The product of (a + bi)(c +
di) = ac bd+ (bc + ad)i would naively take four multiplications, but can actually be
85


Figure 6.1: Task graph for the Strassen-Winograd Algorithm. Execution time pro-
gresses from left to right. Large ovals depict multiplication and small ovals addi-
tion/subtraction.
accomplished via three multiplications by discerning that bc + ad = (a + b) (c + d)
be ad.
6.2 Tiled Strassen-Winograd Algorithm
In our tiled version, the matrices are subdivided such that each submatrix is of
the form
Mij ii Mij 12 Mij in
Mij21 Mij22 Mij2n
.1 bi.ju. i Mijn2 . Mijnn
where the matrices are tiles of size rib x rib- As before one can proceed with
full recursion, unlike before this would not terminate at the scalar level, but rather it
would terminate with the multiplication of two tiles using a sequential BLAS Level
3 matrix-matrix multiplication. The recursion can also be cutoff at a higher level
at which point the tiled matrix multiplication of Algorithm 6.1 computes the result-
ing multiplication. For the addition/subtraction of the submatrices in Phase 1 and
86


Phase 3 of the Strassen-Winograd algorithm, a similar approach is used which is also
executed in parallel.
Algorithm 6.1: Tiled Matrix Multiplication (tiled_gemm)
/* Input: n x n tiled matrices A and B, Output: n x n tiled
If the cutoff for the recursion occurs before the tile level, the computation for each
Cij can be executed in parallel. Therefore our tiled implementation of the Strassen-
Winograd algorithm exploits two levels of parallelism. Moreover, this allows some
parts of the matrix multiplications to occur early on as can be seen in Figure 6.2
which shows the DAG for a matrix of 4 x 4 tiles with one level of recursion. Both
Figure 6.2 and Figure 6.1 illustrate one level of recursion but the tiled task graph of
a 4 x 4 tiled matrix clearly portrays the high degree of parallelism.
The conventional matrix-matrix multiplication algorithm requires 8 multiplica-
tions and 4 additions whereas the Strassen-Winograd algorithm requires 7 multipli-
cations and 15 additions/subtractions for each level of recursion. Therefore, there
are more tasks for the Strassen-Winograd algorithm as compared to the conven-
tional matrix-matrix multiplication and it would behoove us to reduce the number
of tasks which would also reduce the algorithmic complexity. On the other hand,
since we are reducing the number of multiplications, the computational cost is also
reduced since this requires a cubic operation versus the quadratic operation of the
addition/subtractions.
The total number of tasks, T, of the Strassen-Winograd algorithm is given by
matrix C such that C = A x B
i for i = 1 to n do
4
2
3
for j = 1 to n do
for k = 1 to n do
Aik X Bfcj + CV
3
r 1
r
t = t
+ DVr1-1
+ 5p
,2
i=0
87


Full Text

PAGE 1

TILEDALGORITHMSFORMATRIXCOMPUTATIONSONMULTICORE ARCHITECTURES by HenricusMBouwmeester B.S.,ColoradoMesaUniversity,1998 M.S.,ColoradoStateUniversity,2000 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2012

PAGE 2

ThisthesisfortheDoctorofPhilosophydegreeby HenricusMBouwmeester hasbeenapproved by JulienLangou,Advisor LynnBennethum,Chair GitaAlaghband ElizabethR.Jessup StephenBillups October26,2012 ii

PAGE 3

Bouwmeester,HenricusMPh.D.,AppliedMathematics TiledAlgorithmsforMatrixComputationsonMulticoreArchitectures ThesisdirectedbyAssociateProfessorJulienLangou ABSTRACT Currentcomputerarchitecturehasmovedtowardsthemulti/many-corestructure. However,thealgorithmsinthecurrentsequentialdensenumericallinearalgebralibraries e.g. LAPACKdonotparallelizewellonmulti/many-corearchitectures.A newfamilyofalgorithms,the tilealgorithms ,hasrecentlybeenintroducedtocircumventthisproblem.Previousresearchhasshownthatitispossibletowriteecient andscalabletilealgorithmsforperformingaCholeskyfactorization,apseudoLU factorization,andaQRfactorization.Thegoalofthisthesisistostudytiledalgorithmsinamulti/many-coresettingandtoprovidenewalgorithmsthatexploit thecurrentarchitecturetoimproveperformancerelativetocurrentstate-of-the-art librarieswhilemaintainingthestabilityandrobustnessoftheselibraries. InChapter2,weconfronttheproblemofcomputingtheinverseofasymmetric positivedenitematrixwithtiledalgorithms.Weobservethat,usingadynamic taskscheduler,itisrelativelypainlesstotranslateexistingLAPACKcodetoobtainaready-to-be-executedtilealgorithm.Howeverwedemonstratethatnontrivial compilertechniquesarrayrenaming,loopreversalandpipeliningneedtobeappliedtofurtherincreasetheparallelismofourapplication,boththeoreticallyand experimentally. Chapter3revisitsexistingalgorithmsfortheQRfactorizationofrectangular matricescomposedof p q tiles,where p q ,foranunlimitednumberofprocessors. Withinthisframework,westudythecriticalpathsandperformanceofalgorithms suchas Sameh-Kuck Fibonacci Greedy ,andthosefoundwithinPLASMA.We iii

PAGE 4

alsoprovideamonotonicallyincreasingfunctiontotransformtheeliminationlistofa coarse-grainalgorithmtoatiledalgorithm.Althoughtheoptimalityfromthecoarsegrain Greedy algorithmdoesnottranslatetothetiledalgorithms,weproposeanew algorithmandshowthatitisoptimalinthetiledcontext. InChapters2and3,ourcontextincludesanunboundednumberofprocessors. Theexercisewastondalgorithmicvariantswithshortcriticalpaths.Sincethe numberofresourcesisunlimited,anytaskisexecutedassoonasallitsdependencies aresatised.InChapters4and5,wesetourselvesinthemorerealisticcontext ofboundednumberofprocessors.Inthiscontext,atagiventime,thenumber ofready-to-gotaskscanexceedthenumberofavailableresources,andthereforea schedulewhichprescribeswhichtaskstoexecutewhenneedstobedened.For theCholeskyfactorization,westudystandardschedulesandndthatthecritical pathscheduleisbest.Wealsoderivealowerboundonthetimetosolutionofthe optimalschedule.Weconcludethatthecriticalpathscheduleisnearlyoptimalfor ourstudy.FortheQRfactorizationproblem,westudytheproblemofoptimizing thereductiontreesthereforethealgorithmandtheschedulesimultaneously.This isconsiderablyharderthantheCholeskyfactorizationwherethealgorithmisxed andso,forCholeskyfactorization,weareconcernedonlywithschedules.Weprovide alowerboundforthetimetosolutionforanytiledQRalgorithmandanyschedule. Wealsoshowthat,insomecases,theoptimalalgorithmforanunboundednumberof processorsfoundinChapter3cannotbescheduledtosolveoptimallythecombined problem.Wecompareouralgorithmsandscheduleswithourlowerbound. Finally,inChapter6westudyarecursivetiledalgorithminthecontextofmatrix multiplicationusingtheWinograd-Strassenalgorithmusingadynamictaskscheduler.Whereasmostimplementationsobtaineitheroneortwolevelsofrecursion,our implementationsupportsanylevelofrecursion. iv

PAGE 5

Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:JulienLangou v

PAGE 6

DEDICATION Idedicatethisthesistomywife,myparentsandallofmyfamilyfortheircare, unwaveringsupport,andsteadfastbeliefinmydreamsandgoals. vi

PAGE 7

ACKNOWLEDGMENT Mydeepestappreciationgoestomyadvisorandmentor,ProfessorJulienLangou, whosepersistence,guidanceandhelpwereinstrumentalinthisdissertationandwould nothavebeenpossiblewithouthim. Thankyoutomyoutstandingco-authorsonthepapersfromwhomIlearnedthe importanceofcollaborationandgainedinsightintothethoughtprocessesofothers. Theyhelpedmetomakemyideasandexplanationsmoreunderstandabletoallof thereaders. Iwouldliketothankmycommitteemembers,ProfessorLynnBennethumfor herdiligenceinbeingthechair,ProfessorStephenBillups,andProfessorsElizabeth JessupandGitaAlaghbandfortheirexpertiseincomputerscienceandparallelprogramming. Inaddition,IwouldliketothankallofthecurrentandformerstaoftheMathematicsandStatisticalScienceDepartmentattheUniversityofColoradoDenverwho helpedmewiththeiradministrativesupport.Financialsupportwasprovidedbythe UniversityofColoradoDenverandtheNationalScienceFoundationgrantnumbers NSFCCF-811520andNSFCCF-1054864.Thisresearchwasconductedusingthe resourcesoftheCenterforComputationalMathematicsCCMattheUniversityof ColoradoDenverandtheInnovativeComputingLaboratoryICLattheUniversity ofTennessee. Finally,IwouldliketothankmyfellowPhD/Masterscolleaguesforthelaughs, greattimes,andbeersessions:JennyDiemunsch,JeLarson,BradLowery,Tim Morris,EricSullivan,andmanymore. vii

PAGE 8

TABLEOFCONTENTS Figures.......................................x Tables........................................xiii Chapter 1.Introduction...................................1 2.CholeskyInversion...............................8 2.1Tilein-placematrixinversion......................10 2.2Algorithmicstudy............................13 2.3Conclusionandfuturework.......................15 3.QRFactorization................................17 3.1TheQRfactorizationalgorithm....................20 3.1.1Kernels..............................23 3.1.2Eliminationlists.........................26 3.1.3Executionschemes........................28 3.2Criticalpaths..............................30 3.2.1Coarse-grainalgorithms.....................30 3.2.1.1 Sameh-Kuck algorithm.................30 3.2.1.2 Fibonacci algorithm...................31 3.2.1.3 Greedy algorithm....................31 3.2.2Tiledalgorithms.........................32 3.3Experimentalresults...........................51 3.4Conclusion................................60 4.SchedulingofCholeskyFactorization.....................67 4.1ALAPDerivedPerformanceBound..................68 4.2CriticalPathScheduling........................69 4.3Schedulingwithsynchronizations....................71 4.4TheoreticalResults...........................73 viii

PAGE 9

4.5Towardan opt ..............................75 4.6RelatedWork..............................75 4.7Conclusionandfuturework.......................76 5.SchedulingofQRFactorization........................78 5.1PerformanceBounds...........................78 5.2Optimality................................80 5.3EliminationTreeScheduling......................81 5.4Conclusion................................83 6.StrassenMatrix-MatrixMultiplication....................84 6.1Strassen-WinogradAlgorithm.....................85 6.2TiledStrassen-WinogradAlgorithm..................86 6.3RelatedWork..............................91 6.4Experimentalresults...........................93 6.5Conclusion................................96 7.Conclusion....................................97 Appendix A.IntegerProgrammingFormulationofTiledQR................99 A.1IPFormulation..............................99 A.1.1Variables.............................99 A.1.2Constraints............................100 A.1.2.1Precedenceconstraints..................105 A.1.3Objectivefunction........................107 References ......................................108 Glossary.......................................113 ix

PAGE 10

FIGURES Figure 1.1ThreevariantsfortheCholeskydecomposition...............3 1.2Datalayoutoftiledmatrix.........................5 1.3ThreevariantsoftheCholeskydecompositionappliedtoatiledmatrixof 4 4tiles....................................5 2.1DAGsofStep3oftheTileCholeskyInversion t =4...........12 2.2ScalabilityofAlgorithm2.1inplaceanditsout-of-placevariantintroducedin x 2.2,usingourdynamicscheduleragainstvecLib,ScaLAPACK andLAPACKlibraries............................13 2.3Impactofloopreversalonperformance...................15 3.1Iconrepresentationsofthekernels.....................42 3.2CriticalPathlengthfortheweighted FlatTree onamatrixof4 4tiles.43 3.3IllustrationofrstandsecondpartsoftheproofofTheorem3.14using the Fibonacci algorithmonamatrixof15 2tiles............49 3.4 Greedy versus GrASAP onmatrixof15 2tiles............50 3.5Tiledmatricesof p q wherethecriticalpathlengthof GrASAP is shorterthanthatof Greedy for1 p 100and1 q p ........50 3.6UpperboundandexperimentalperformanceofQRfactorizationTT kernels.....................................52 3.7Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1..........................53 3.8Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1..........................55 3.9Kernelperformancefordoublecomplexprecision.............56 3.10Kernelperformancefordoubleprecision..................57 3.11UpperboundandexperimentalperformanceofQRfactorization-Allkernels58 x

PAGE 11

3.12Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1..........................59 3.13Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1..........................60 4.1ALAPexecutionfor5 5tiles.......................69 4.2ExamplederivationoftaskprioritiesviatheBackowalgorithm.....71 4.3Theoreticalresultsformatrixof40 40tiles................74 4.4Valuesof formatricesof t t tileswhere3 t 40..........75 4.5Asymptoticeciencyversus = p=n forLUdecompositionandversus = p=t 2 forTiledCholeskyfactorization..................76 5.1SchedulingcomparisonsforeachofthealgorithmsversustheALAPDerivedboundsonamatrixof20 6tiles...................79 5.2Tail-endexecutionusingALAPonunboundedresourcesfor GrASAP and Fibonacci onamatrixof15 4tiles.................80 5.3ALAPDerivedboundcomparisonforallalgorithmsforamatrixof15 4 tiles.......................................81 5.4ComparisonofspeedupforCPMethodon GrASAP ,ALAPDerived boundfrom GrASAP ,andoptimalschedulesforamatrixof5 5tiles on1to14processors.............................82 6.1TaskgraphfortheStrassen-WinogradAlgorithm.Executiontimeprogressesfromlefttoright.Largeovalsdepictmultiplicationandsmall ovalsaddition/subtraction...........................86 6.2Strassen-WinogradDAGformatrixof4 4tileswithonerecursion.Executiontimeprogressesfromlefttoright.Largeovalsdepictmultiplication andsmallovalsaddition/subtraction.....................88 6.3Requiredextramemoryallocationfortemporarystorageforvaryingrecursionlevels..................................93 xi

PAGE 12

6.4Comparisonoftuningparameters n b and r .................94 6.5ScalabilityandEciencycomparisonson48threadswithmatricesof64 64tilesand n b =200.............................95 6.6Scalabiltyandeciencycomparisonsexecutedon12threadswithmatrices of64 64tilesand n b =200.........................96 xii

PAGE 13

TABLES Table 2.1Lengthofthecriticalpathasafunctionofthenumberoftiles t ......13 3.1KernelsfortiledQR.Theunitoftimeis n 3 b 3 ,where n b istheblocksize..23 3.2Time-stepsforcoarse-grainalgorithms....................32 3.3Time-stepsfortiledalgorithms........................63 3.4Neither Greedy nor Asap areoptimal...................63 3.5Threeschemesappliedtoacolumnwhoseupdatekernelweightisnotan integermultipleoftheeliminationkernelweight..............64 3.6GreedyversusPT TTandFibonacciTheoretical............65 3.7GreedyversusPT TTExperimentalDouble...............66 3.8GreedyversusPT TTExperimentalDoubleComplex..........66 3.9GreedyversusFibonacciExperimentalDouble..............66 3.10GreedyversusFibonacciExperimentalDoubleComplex........66 4.1TaskWeights.................................67 4.2Upperboundonspeedupandeciencyfor5 5tiles...........69 5.1Schedulelengthsformatrixof5 5tiles..................82 6.1Strassen-WinogradAlgorithm........................85 6.2Recursionlevelswhichminimizethenumberoftasksforatiledmatrixof size p p ...................................89 6.3128 128tilesofsize n b =200.......................90 6.4Comparisonofthetotalnumberoftasksandcriticalpathlengthformatrix of p p tiles..................................90 xiii

PAGE 14

1.Introduction TheHigh-PerformanceComputingHPClandscapeistrendingmoretowardsa multi/many-corearchitecture[8]asisevidencedbytherecentprojectsofmajorchip manufacturersandreportsofsurveysconductedbyconsultingcompanies[52].The computationalalgorithmsfordenselinearalgebraneedtobere-examinedtomake betteruseofthesearchitecturesandprovidehigherlevelsofeciency.Someofthese algorithmsmayhaveastraightforwardtranslationfromthecurrentstate-of-the-art librarieswhileothersrequiremuchmorethoughtandeorttogainperformanceincreases.Inthisthesis,weendeavortoachievealgorithmsthatexploitthearchitecture toimproveperformancerelativetocurrentstate-of-the-artcomputationallibraries whilemaintainingthestabilityandrobustnessoftheselibraries. OurresearchwillmakeuseoftheBLASBasicLinearAlgebraSubprograms[15] andtheLAPACKLinearAlgebraPACKage[7]libraries.TheBLASareastandardtoperformbasiclinearalgebraoperationsinvolvingvectorsandmatriceswhile LAPACKperformsthemorecomplicatedandhigherlevellinearalgebraoperations. TheBLASdividenumericallinearalgebraoperationsintothreedistinctgroupingsbasedupontheamountofinputdataandthecomputationalcost.Operations involvingonlyvectorsareconsideredLevel1;thoseinvolvingbothvectorsandmatricesareLevel2;andthoseinvolvingonlymatricesareLevel3.Formatricesofsize n n ,theLevel3operationsaremostcovetedsincetheyuse O n 2 datafor O n 3 computationsandinherentlyreducetheamountofmemorytrac.SincetheBLAS providefundamentallinearalgebraoperations,hardwareandsoftwarevendorssuch asIntel,AMD,andIBMprovideoptimizedBLASlibrariesforavarietyofarchitectures.TheBLASlibrarycanbemultithreadedtomakeuseofmulti/many-core architecturesandmostofthevendorsuppliedlibrariesaremultithreaded. TheLAPACKlibraryisacollectionofsubroutinesforsolvingmostofthecommon problemsinnumericallinearalgebraandwasdevelopedtomakeuseofLevel3BLAS 1

PAGE 15

operationsasmuchaspossible.AlgorithmswithinLAPACKarewrittentomakeuse ofpanelswhichcanbeeitherablockofcolumnsorablockofrowssothatupdates areperformedusingmatrixmultiplicationsinsteadofvectoroperations.LAPACK canmakeuseofamultithreadedBLAStoexploitmulti/many-corearchitecturesbut thismaynotbeenoughtofullyexploitthecapabilityofthearchitecture. Asanexample,letusconsidertheCholeskydecompositiontofactorizeasymmetricpositivedeniteSPDmatrixintoitstriangularfactor.Therearethreevariants forperformingtheCholeskydecomposition:bordered,left-looking,andright-looking. Allthreeworkoneithertheupperorlowertriangularportionofthematrixand producethesametriangularfactor,butdependingontheusage,onemayhavean advantageovertheothers.InFigure1.1wedepictthestepsinvolvedineachvariant usingthelowertriangularformulations.Atthestartofeachvariant,thematrixis subdividedintoblocksofrowsandcolumns,orpanels,whichwilltakeadvantageof theLevel3BLAS. The`bordered'variant,asdepictedinFigure1.1a,involvesaloopoverthree steps.Therststepupdatesthepurplerowblockusingthealreadyfactorizedgreen portion,thesecondstepupdatesthenexttriangularblocktobefactorizedinred, andthethirdstepperformsthefactorizationofthetriangularblock.Thisisthen repeateduntiltheentirematrixisfactorized.Notethatthelowerportionofthe matrixisnottouchedbytheprecedingsteps. The`left-looking'variantseeFigure1.1binvolvesfoursteps.Therststep updatesthetriangularblockinredwhichisthenfactorizedinthesecondstep,the thirdstepupdatestheblockcolumnincyanbelowthetriangularblockusingthe previouscolumns,andthelaststepupdatesthecolumnblockusingthefactorization ofstep2.Itiscalledleft-lookingsincethealgorithmdoesnotaecttheportionto therightofthecurrentblockofthematrixandonly`looks'totheleftforitsupdates. Thetopmosttriangularportionofthematrixisinitsnalformandwillnotchange 2

PAGE 16

aBorderedvariant bLeft-lookingvariant cRight-lookingvariant Figure1.1:ThreevariantsfortheCholeskydecomposition 3

PAGE 17

inthesuceedingstepsofthealgorithm. The`right-looking'variantseeFigure1.1cinvolvesthreesteps.Itperformsthe factorizationoftheredtriangularblock,updatestheblockcolumnincynaandthen updatesthebluetrailingmatrixontheright.Thisiscalledright-lookingsinceitdoes notrequireanythingfromthepreviousfactorizedmatrixandpushesitsupdatesto therightpartofthematrix.Theentirematrixtotheleftofthecolumnblockin whichthealgorithmiscurrentlyworkingisinitsnalform. TheadvantageoftheborderedvariantisthatitdoestheleastamountofoperationstodetermineifamatrixisSPD.Theadvantageoftheright-lookingvariant isthatitprovidesthemostparallelism.Amajordisadvantageoftheleft-looking variantistheaddedfork-jointhatitmustperformbetweenthestepsascomparedto theothertwovariantswhichwillnegativelyaectitsparallelperformance. TheLAPACKschemeofusingpanelshasthreedistinctdisadvantageswhich limititsperformance.Therstcanbeseeninthethirdstepoftheright-looking CholeskydecompositionFigure1.1cwherepotentiallyalargesymmetricrank k operationisperformed;thememoryarchitecturewillboundtheperformanceofthe algorithm.Secondly,thereissomeimpactofthesynchronizationsthatmustbe performedbetweeneachstep.Third,theideaofpanelsdoesnotallowfornegrainedtasks.Wealleviatethelasttwooftheserestrictionsthroughtheuseoftiled algorithmswhereastherstisovercomethroughtheuseofLevel3BLASoperations withinthetiledalgorithm. Weapproachthisviatilingamatrixwhichmeansreorderingthedataofthe matrixintosmallerregionsofcontiguousmemoryasdepictedinFigure1.2.By varyingthetilesize,thisdatalayoutallowsustotunethealgorithmsuchthatthe dataneededforthekernelsispresentinthecacheoftheprocessorcore.Moreover, weareabletoincreasetheamountofparallelismandminimizethesynchronizations. LetusrevisittheCholeskydecompositionasdescribedearlierandapplyeach 4

PAGE 18

Figure1.2:Datalayoutoftiledmatrix steptothetiledmatrix.InFigure1.3wepresentthedirectedacyclicgraphsDAG forthethreevariantsonatiledmatrixof4 4tiles.IneachoftheDAGs,thetasks arerepresentedasthenodesandthedatadependenciesaretheedges.Thedashed horizontallinesdesignateafullsweepthroughallofthestepsineachalgorithm. aBorderedvariant bLeft-lookingvariant cRight-lookingvariant Figure1.3:ThreevariantsoftheCholeskydecompositionappliedtoatiledmatrix of4 4tiles. TherstobservationthatonemakesisthattheheightofeachDAGvariesaccordingtowhichvariantischosen.Theborderedvariantisthetallestsincethetasks 5

PAGE 19

becomealmostsequentialwheretheonlyportionthatisnotsequentialisthatof therowblockupdate.Theleft-lookingvariantisalmostofheight t 2 where t isthe numberoftilesinacolumnofthetiledmatrix.Itgainsparallelismfrombeingable toupdatethecolumnblockofthenalstepwithintheloopinparallel.Asbefore, theright-lookingvariantistheshortestandprovidesthemostparallelism. However,inthetiledversions,thesynchronizationsbetweeneachstepofthe LAPACKalgorithmsissupercialandcanberemoved.Bydoingso,thesethree variantsreducetoonlytheright-lookingvariant. Themaindierencebetweenthetiledversionandtheblockedversionisthe amountofparallelismthatisgainedfromupdatesofthetrailingmatrix.Insteadof performingalargesymmetricrank k updatewhere k isthenumberofrowsinarow blockorthenumberofcolumnsinacolumnblock,theoperationisdecomposedinto smallersymmetricrank n b updatesandassociatedmatrixmultiplicationswhere n b isthesizeofatilesuchthat N = t n b .Intheright-lookingvariant,foran N N matrixthesizeofthersttrailingmatrixis N )]TJ/F20 11.9552 Tf 12.263 0 Td [(k N )]TJ/F20 11.9552 Tf 12.262 0 Td [(k sothattheupdate operationforthisrstmatrixhasacomputationalcostof O kN 2 .Thetiledupdate consistsofbothsymmetricupdatesandmatrixmultiplicationsoftilesofsize n b n b sothatthecomputationalcostpertaskis O n 3 b Thesizeofthetileswilldeterminethegranularityofthetasksforatiledalgorithm.Foramatrixofsize N N ,thetilesizeofthetiled t t matrixcanvaryfrom theentiresizeofthematrix t =1downtoascalar t = N ,butisheldconstant throughouttheexecutionofthealgorithm.However,abalancemustbekeptbetween theeciencyofthekernelandtheamountofdatamovement. Therefore,atiledalgorithmdoesovercometherestrictiononthegranularity imposedbythepanelsconceptofLAPACKaswellasalleviatesomeofthesynchronizationsandassociatedoverhead.Thememoryboundisstillpresentduetoneeding tomovetheupdatesofthetrailingmatrices. 6

PAGE 20

TheParallelLinearAlgebraSoftwareforMulticoreArchitecturesPLASMA[2] libraryprovidestheframeworkforthetiledalgorithms.Fortheexperimentalportions,ourmainassumptionwillbeasharedmemorymachinearchitecturewherein eachprocessorhasdirectaccesstoallportionsofthememoryinwhichthematrices arestored. InChapter2wedescribemorefullytheimplementationofthetiledCholesky decompositionandfurtherthetiledCholeskyInversionalgorithm.Itshowsthat translatingfromLAPACKtoPLASMAcanbestraightforward,butthatthereare caveatstobetakenintoaccount.InChapter3wehaveimplementedatiledQR decompositionshowingthattheimplementationisnotstraight-forward.Moreover, resultsfrompreviousworkdonottranslatedirectlytothetiledalgorithm. UnlikeChapters2and3whereanunboundednumberofprocessorsisassumed, Chapters4and5restrictthenumberofprocessorsandprovideboundsontheperformanceofthealgorithms.Weobservethetheoreticalspeed-upandeciencyand providemorerealisticboundsontheperformance. Finally,Chapter6presentsastudyonatiledimplementationoftheStrassenWinogradalgorithmformatrix-matrixmultiplication.Unliketheotheralgorithms presented,itisarecursivealgorithmwhichbecomesinterestinginthescopeoftiled matrices. 7

PAGE 21

2.CholeskyInversion Inthischapter,wepresentjointworkwithEmmanuelAgullo,JackDongarra, JakubKurzak,JulienLangou,andLeeRosenberg[4]. Theappropriatedirectmethodtocomputethesolutionofasymmetricpositive denitesystemoflinearequationsconsistsofcomputingtheCholeskyfactorization ofthatmatrixandthensolvingtheunderlyingtriangularsystems.Itisnotrecommendedtousetheinverseofthematrixinthiscase.Howeversomeapplicationsneed toexplicitlyformtheinverseofthematrix.Acanonicalexampleisthecomputation ofthevariance-covariancematrixinstatistics.Higham[32,p.260, x 3]listsmoresuch applications. Withtheiradvent,multicorearchitectures[50]inducetheneedforalgorithmsand librariesthatfullyexploittheircapacities.Aclassofsuchalgorithms{calledtile algorithms[18,19]{hasbeendevelopedforone-sideddensefactorizationsCholesky, LUandQRandmadeavailableaspartoftheParallelLinearAlgebraSoftwarefor MulticoreArchitecturesPLASMAlibrary[2].Inthischapter,weextendthisclass ofalgorithmstothecaseofthesymmetricpositivedenitematrixinversion.Besides constitutinganimportantfunctionalityforalibrarysuchasPLASMA,thestudyof thematrixinversiononmulticorearchitecturesrepresentsachallengingalgorithmic problem.Indeed,rst,contrarytostandaloneone-sidedfactorizationsthathavebeen studiedsofar,thematrixinversionexhibitsmanyanti-dependences[6]Writeafter Read.Thisisafalseorarticialdependencywhichisreliantonthenameofthe dataandnottheactualdataow.Forexample,giventwooperationswherethe rstonlyreadsthedatainthematrix A andthesecondonlywritestothelocation of A ,theninaparallelexecutiontheremaybeacasewherethedatabeingread bytherstoperationiswrongsincethesecondmayhavealreadywrittentothe location.Bycopyingthedatafrom A beforehand,bothoperationscanbeexecuted inparallel.Thoseanti-dependencescanbeabottleneckforparallelprocessing,which 8

PAGE 22

iscriticalonmulticorearchitectures.Itisthusessentialtoinvestigateandadapt wellknowntechniquesusedincompilationsuchasusingtemporarycopiesofdatato removeanti-dependencestoenhancethedegreeofparallelismofthematrixinversion. Thistechniqueisknownas arrayrenaming [6]or arrayprivatization [28].Second, loopreversal [6]istobeinvestigated.Third,thematrixinversionconsistsofthree successivestepsrstofwhichistheCholeskydecomposition.Intermsofscheduling, itthusrepresentsanopportunitytostudytheeectsof pipelining [6]thosestepson performance. ThecurrentversionofPLASMAversion2.1isscheduledstatically.InitiallydevelopedfortheIBMCellprocessor[34],thisstaticschedulingreliesonPOSIXthreads andsimplesynchronizationmechanisms.Ithasbeendesignedtomaximizedatareuse andloadbalancingbetweencores,allowingforveryhighperformance[3]ontoday's multicorearchitectures.However,inthecaseofmatrixinversion,thedesignofan ad-hocstaticschedulingisatimeconsumingtaskandraisesloadbalancingissues thataremuchmorediculttoaddressthanforastand-aloneCholeskydecomposition,inparticularwhendealingwiththepipeliningofmultiplesteps.Furthermore, thegrowthofthenumberofcoresandthemorecomplexmemoryhierarchiesmake executionslessdeterministic.Inthischapter,werelyonanexperimentalin-house dynamicscheduler[33].Thisschedulerisbasedontheideaofexpressinganalgorithm throughitssequentialrepresentationandunfoldingitatruntimeusingdatahazards ReadafterWrite,WriteafterRead,WriteafterWriteasconstraintsforparallel scheduling.Theconceptisratheroldandhasbeenvalidatedbyafewsuccessful projects.WecouldhaveaswellusedschedulersfromtheJadeprojectfromStanford University[42]orfromtheSMPSsprojectfromtheBarcelonaSupercomputerCenter [40]. Ourdiscussionsareillustratedwithexperimentsconductedonadual-socketquadcoremachinebasedonanIntelXeonEMT64processoroperatingat2 : 26GHz.The 9

PAGE 23

theoreticalpeakisequalto9 : 0Gop/spercoreor72 : 3Gop/sforthewholemachine, composedof8cores.ThemachineisrunningMacOSX10.6.2andisshippedwith theApplevecLibv126.0multithreadedBLAS[15]andLAPACKvendorlibrary,as wellasLAPACK[7]v3.2.1andScaLAPACK[13]v1.8.0references. 2.1Tilein-placematrixinversion TilealgorithmsareaclassofLinearAlgebraalgorithmsthatallowfornegranularityparallelismandasynchronousscheduling,enablinghighperformanceonmulticorearchitectures[3,18,19,41].Thematrixoforder n issplitinto t t square submatricesoforder b n = b t .Suchasubmatrixisofsmallgranularitywe xed b =200inthischapterandiscalleda tile .Sofar,tilealgorithmshavebeen essentiallyusedtoimplementone-sidedfactorizations[3,18,19,41]. Algorithm2.1extendsthisclassofalgorithmstothecaseofthematrixinversion.Asinstate-of-the-artlibrariesLAPACK,ScaLAPACK,thematrixinversion isperformed in-place i.e. ,thedatastructureinitiallycontainingmatrix A isdirectly updatedasthealgorithmisprogressing,withoutusinganysignicanttemporary extra-storage;eventually, A )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 replaces A .Algorithm2.1iscomposedofthreesteps. Step1isaTileCholeskyFactorizationcomputingtheCholeskyfactor L lowertriangularmatrixsatisfying A = LL T .Thisstepwasstudiedin[19].Step2computes L )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 byinverting L .Step3nallycomputestheinversematrix A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = L )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 T L )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 Eachstepiscomposedofmultiplenegranularitytaskssinceoperatingontiles. ThesetasksarepartoftheBLASSYRK,GEMM,TRSM,TRMMandLAPACK POTRF,TRTRI,LAUUMstandards.Amoredetaileddescriptionisbeyondthe scopeofthisextendedchapterandisnotessentialtotheunderstandingoftherest ofthechapter.Indeed,fromahighlevelpointofview,anoperationbasedontile algorithmscanberepresentedasaDirectedAcyclicGraphsDAG[22]wherenodes representthenegranularitytasksinwhichtheoperationcanbedecomposedandthe edgesrepresentthedependencesamongthem.Forinstance,Figure2.1arepresents 10

PAGE 24

Algorithm2.1: TileIn-placeCholeskyInversionlowerformat.Matrix A is theon-goingupdatedmatrixin-placealgorithm. Input : A ,SymmetricPositiveDenitematrixintilestorage t t tiles. Result : A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ,storedin-placein A 1 Step1:TileCholeskyFactorizationcomputeLsuchthat A = LL T ; 2 for j =0 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 3 for k =0 to j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 do 4 A j;j A j;j )]TJ/F20 11.9552 Tf 11.955 0 Td [(A j;k A T j;k SYRKj,k; 5 A j;j CHOL A j;j POTRFj; 6 for i = j +1 to t )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 do 7 for k =0 to j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 8 A i;j A i;j )]TJ/F20 11.9552 Tf 11.955 0 Td [(A i;k A T j;k GEMMi,j,k; 9 for i = j +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 10 A i;j A i;j =A T j;j TRSMi,j; 11 Step2:TileTriangularInversionof L compute L )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ; 12 for j = t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 to 0 do 13 A j;j TRINV A j;j TRTRIj; 14 for i = t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 to j +1 do 15 A i;j A i;i A i;j TRMMi,j; 16 for k = j +1 to i )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 17 A i;j A i;j + A i;k A k;j GEMMi,j,k; 18 A i;j )]TJ/F20 11.9552 Tf 24.575 0 Td [(A i;j A i;i TRMMi,j; 19 Step3:TileProductofLowerTriangularMatricescompute A )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = L )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 T L )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ; 20 for i =0 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 21 for j =0 to i )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 do 22 A i;j A T i;i A i;j TRMMi,j; 23 A i;i A T i;i A i;i LAUUMi; 24 for j =0 to i )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 do 25 for k = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 26 A i;j A i;j + A T k;i A k;j GEMMi,j,k; 27 for k = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 28 A i;i A i;i + A T k;i A k;i SYRKi,k; 11

PAGE 25

theDAGofStep3ofAlgorithm2.1. aIn-placeAlgorithm2.1 bOut-of-placevariantintroducedin x 2.2 Figure2.1:DAGsofStep3oftheTileCholeskyInversion t =4. Algorithm2.1isbasedonthevariantsusedinLAPACK3.2.1.Bientinesi,Gunter andvandeGeijn[10]discussthemeritsofalgorithmicvariationsinthecaseofthe computationoftheinverseofasymmetricpositivedenitematrix.Althoughof deniteinterest,thisisnotthefocusofthisextendedchapter. WehaveimplementedAlgorithm2.1usingourdynamicschedulerintroduced inthebeginningofthechapter.Figure2.2showsitsperformanceagainststate-ofthe-artlibrariesandthevendorlibraryonthemachinedescribedinthebeginning ofthechapter.Foramatrixofsmallsize,itisdiculttoextractparallelismand haveafulluseofallthecores[3,18,19,41].Weindeedobservealimitedscalability N =1000,Figure2.2a.However,tilealgorithmsAlgorithm2.1stillbenet fromahigherdegreeofparallelismthanblockedalgorithms[3,18,19,41].ThereforeAlgorithm2.1inplaceconsistentlyachievesasignicantlybetterperformance thanvecLib,ScaLAPACKandLAPACKlibraries.Alargermatrixsize N =4000, Figure2.2ballowsforabetteruseofparallelism.Inthiscase,anoptimizedimplementationofablockedalgorithmvecLibcompeteswellagainsttilealgorithmsin placeonfewcoresleftpartofFigure2.2a.However,onlytilealgorithmsscaleto alargernumberofcoresrightmostpartofFigure2.2bthankstoahigherdegree ofparallelism.Inotherwords,thetileCholeskyinversionachievesabetter strong 12

PAGE 26

Table2.1:Lengthofthecriticalpathasafunctionofthenumberoftiles t In-placecaseOut-of-placecase Step13 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(23 t )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 Step23 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(32 t )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 Step33 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 t scalability thantheblockedversions,similarlytowhathadbeenobservedforthe factorizationstep[3,18,19,41]. a n =1000 b n =4000 Figure2.2:ScalabilityofAlgorithm2.1inplaceanditsout-of-placevariantintroducedin x 2.2,usingourdynamicscheduleragainstvecLib,ScaLAPACKand LAPACKlibraries. 2.2Algorithmicstudy Inthe x 2.1,wecomparedtheperformanceofthetileCholeskyinversionagainst state-the-artlibraries.Inthissection,wefocusontileCholeskyinversionandwe discusstheimpactofseveralvariantsofAlgorithm2.1onperformance. Arrayrenamingremovinganti-dependences. Thedependencebetween SYRK,1andTRMM,0intheDAGofStep3ofAlgorithm2.1Figure2.1a representstheconstraintthattheSYRKoperationl.28ofAlgorithm2.1needs toread A k;i = A 1 ; 0 beforeTRMMl.22canoverwrite A i;j = A 1 ; 0 .ThisantidependenceWriteafterReadcanberemovedthankstoatemporarycopyof A 1 ; 0 13

PAGE 27

Similarly,alltheSYRK-TRMManti-dependences,aswellasTRMM-LAUMMand GEMM-TRMManti-dependencescanberemoved.Wehavedesignedavariantof Algorithm2.1thatremovesalltheanti-dependencesthankstotheuseofalarge workingarraythistechniqueiscalled arrayrenaming [6]incompilation[6].The subsequentDAGFigure2.1bissplitinmultiplepiecesFigure2.1b,leadingtoa shortercriticalpathTable2.1.Weimplementedtheout-of-placealgorithm,based onourdynamicschedulertoo.Figure2.2ashowsthatourdynamicschedulerexploits itshigherdegreeofparallelismtoachieveamuchhigherstrongscalabilityevenon smallmatrices N =1000.ForalargermatrixFigure2.2b,thein-placealgorithm alreadyachievedverygoodscalability.Therefore,usingupto7cores,theirperformancearesimilar.However,thereisnotenoughparallelismwitha4000 4000matrix touseecientlyall8coreswiththein-placealgorithm;thusthehigherperformance oftheout-of-placeversioninthiscaseleftmostpartofFigure2.2b. Loopreversalexploitingcommutativity. Themostinternalloopofeach stepofAlgorithm2.1l.8,l.17andl.26consistsinsuccessivecommutativeGEMM operations.Thereforetheycanbeperformedinanyorder,amongwhichincreasing orderanddecreasingorderoftheloopindex.Theirorderingimpactsthelengthofthe criticalpath.Algorithm2.1ordersthosethreeloopsinincreasingUanddecreasing Dorder,respectively.WehadmanuallychosentheserespectiveordersUDU becausetheyminimizethecriticalpathofeachstepvaluesreportedinTable2.1. Anaiveapproachwouldhave,forexample,beencomprisedofconsistentlyordering theloopsinincreasingorderUUU.InthiscaseUUU,thecriticalpathofTRTRI wouldhavebeenequalto t 2 )]TJ/F15 11.9552 Tf 11.122 0 Td [(2 t +3in-placeor 1 2 t 2 )]TJ/F18 7.9701 Tf 12.317 4.707 Td [(1 2 t +2out-of-placeinstead of3 t )]TJ/F15 11.9552 Tf 12.577 0 Td [(3in-placeor2 t )]TJ/F15 11.9552 Tf 12.576 0 Td [(1out-of-placeforUDU.Figure2.3showshowloop reversalimpactsperformance. Pipelining. Pipeliningthemultiplestepsoftheinversionreducesthelengthof itscriticalpath.Forthein-placecase,thecriticalpathisreducedfrom9 t )]TJ/F15 11.9552 Tf 12.044 0 Td [(7tasks 14

PAGE 28

a n =1000 b n =4000 Figure2.3:Impactofloopreversalonperformance. t isthenumberoftilesto9 t )]TJ/F15 11.9552 Tf 12.61 0 Td [(9tasksnegligible.Fortheout-of-placecase,it isreducedfrom6 t )]TJ/F15 11.9552 Tf 12.651 0 Td [(3to5 t )]TJ/F15 11.9552 Tf 12.652 0 Td [(2tasks.Westudiedtheeectofpipeliningonthe performanceoftheinversionona8000 8000matrixwithanarticiallylargetile size b =2000and t =4.Asexpected,weobservedalmostnoeectonperformance ofthein-placecaseabout36 : 4secondswithorwithoutpipelining.Fortheout-ofplacecase,theelapsedtimegrowsfrom25 : 1to29 : 2seconds%overheadwhen pipeliningisprevented. 2.3Conclusionandfuturework Wehaveproposedanewalgorithmtocomputetheinverseofasymmetricpositive denitematrixonmulticorearchitectures.Anexperimentalstudyhasshownboth anexcellentscalabilityofouralgorithmandasignicantperformanceimprovement comparedtostate-of-the-artlibraries.Beyondextendingtheclassofso-calledtile algorithms,thisstudybroughtbacktotheforewellknownissuesinthedomainof compilation.Indeed,wehaveshowntheimportanceofloopreversal,arrayrenaming andpipelining. Theuseofadynamicschedulerallowedanout-of-the-boxpipelineofthedierentstepswhereasloopreversalandarrayrenamingrequiredamanualchangetothe 15

PAGE 29

algorithm.Thefutureworkdirectionsconsistinenablingtheschedulertoperform itselfloopreversalandarrayrenaming.WeexploitedthecommutativityofGEMM operationstoperformarrayrenaming.Theirassociativitywouldfurthermoreallow toprocesstheminparallelfollowingabinarytree;thesubsequentimpactonperformanceistobestudied.Arrayrenamingrequiresextra-memory.Itwillbeinteresting toaddresstheproblemofthemaximizationofperformanceundermemoryconstraint. ThisworkaimstobeincorporatedintoPLASMA. 16

PAGE 30

3.QRFactorization InthischapterwepresentjointworkwithMathiasJacquelin,JulienLangou,and YvesRobert[31]. Givenan m -byn matrix A with n m ,weconsiderthecomputationofitsQR factorization,whichisthefactorization A = QR ,where Q isan m -byn unitary matrix Q H Q = I n ,and R isuppertriangular. TheQRfactorizationofan m -byn matrixwith n m isthetimeconsuming stageofsomeimportantnumericalcomputations.Itisneededforsolvingalinear leastsquaresproblemwith m equationsobservationsand n unknownsandisused tocomputeanorthonormalbasisthe Q -factorofthecolumnspanoftheinitial matrix A .Forexample,allblockiterativemethodsusedtosolvelargesparselinear systemsofequationsorcomputingsomerelevanteigenvaluesofsuchsystemsrequire orthogonalizingasetofvectorsateachstepoftheprocess.Forthesetwousage examples,while n m n canrangefrom n m to n = m .Wenotethatthe extremecase n = m isalsorelevant:theQRfactorizationofamatrixcanbeusedto solvesquarelinearsystemsofequations.Whilethisrequirestwiceasmanyopsas anLUfactorization,usingaQRfactorizationaisunconditionallystableGaussian eliminationwithpartialpivotingorpairwisepivotingisnotandbavoidspivoting soitmaywellbefasterinsomecases. ToobtainaQRfactorization,weconsideralgorithmswhichapplyasequenceof m -bym unitarytransformations, U i U H i U i = I ,, i =1 ;:::;` ,ontheleftofthe matrix A ,suchthatafter ` transformationstheresultingmatrix R = U ` :::U 1 A is uppertriangular,inwhichcase, R isindeedthe R -factoroftheQRfactorization.The Q -factorifneededcanthenbeobtainedbycomputing Q = U H 1 :::U H ` .Thesetypes ofalgorithmsareinregularuse,e.g.,intheLAPACKandScaLAPACKlibraries, andarefavoredoverothersalgorithmsCholeskyQRorGram-Schmidtfortheir stability. 17

PAGE 31

Theunitarytransformation U i ischosensoastointroducesomezerosinthecurrentupdatematrix U i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 :::U 1 A .ThetwobasictransformationsareGivensrotations andHouseholderreections.OneGivensrotationintroducesoneadditionalzero; thewholetriangularizationrequires mn )]TJ/F20 11.9552 Tf 12.567 0 Td [(n n +1 = 2Givensrotationsfor n
PAGE 32

lessthanthefullcolumnspan,whichenablesconcurrencywithotherreections. TiledQRfactorizationinthecontextofmulticorearchitectureshasbeenintroducedin[18,19,41].Initiallythefocuswasonsquarematricesandthesequenceof unitarytransformationspresentedwasanalogousto Sameh-Kuck [45],whichcorrespondstoreducingthepanelswithattrees.Thepossibilityofusinganytreein ordertoeithermaximizeparallelismorminimizecommunicationisexplainedin[26]. Thefocusofthischapterisonmaximizingparallelism.Wereducethecommunicationdatamovementbetweenmemoryhierarchywithinthealgorithmtoacceptable levelsbytilingtheoperations.Stemmingfromtheobservationthatabinarytreeis bestfortallandskinnymatricesandaattreeisbestforsquarematrices,Hadri etal.[30],proposetousetreeswhichcombineattreesatthebottomlevelwitha binarytreeatthetoplevelinordertoexhibitmoreparallelism.Ourtheoreticaland experimentalworkexplainsthatwecanadapt Fibonacci [36]and Greedy [24,25] totiles,resultinginyetbetteralgorithmsintermsofparallelism.Moreoverournew algorithmsdonothaveanytuningparametersuchasthedomainsizeinthecase of[30]. ThesequentialkernelsoftheTiledQRfactorizationexecutedonacorearemade ofstandardblockedalgorithmssuchasLAPACKencodedinkernels;thedevelopment ofthesekernelsiswellunderstood.Thefocusofthischapterisonimprovingthe overalldegreeofparallelismofthealgorithm.Givena p -byq tiledmatrix,weseek tondanappropriatesequenceofunitarytransformationsonthetiledmatrixsoas tomaximizeparallelismminimizecriticalpathlength.Wewillgetourinspiration frompreviousworkfromthe1970s/80sonGivensrotationswherethequestionwas somewhatrelated:givenan m -byn matrix,ndanappropriatesequenceofGivens rotationsastomaximizeparallelism.Thisquestionisessentiallyansweredin[24,25, 36,45];wecallthisclassofalgorithms coarse-grainalgorithms ." Workingwithtilesinsteadofscalars,weintroducefouressentialdierencesbe19

PAGE 33

tweentheanalysisandtherealityofthetiledalgorithmsversusthecoarse-grain algorithms.First,whilethereareonlytwostatesforascalarnonzeroorzero,a tilecanbeinthreestateszero,triangleorfull.Second,therearemoreoperations availableontilestointroducezeros;wehaveatotalofthreedierenttaskswhichcan introducezerosinamatrix.Third,inthetiledalgorithms,thefactorizationandthe updatearedissociatedtoenablefactorizationstagestooverlapwithupdatestages; whereas,inthecoarse-grainalgorithm,thefactorizationandtheassociatedupdate areconsideredasasinglestage.Lastly,whilecoarse-grainalgorithmshaveonlyone task,weendupwithsixdierenttasks:threefromthefactorizationszeroingof tilesandthreeforeachoftheassociatedupdatessincethesehavebeenunlinked fromthefactorization.Eachofthesesixtaskshavedierentcomputationalweights; thisdramaticallycomplicatesthecriticalpathanalysisofthetiledalgorithms. Whilethe Greedy algorithmisoptimalforcoarse-grainalgorithms,weshow thatitisnotinthecaseoftiledalgorithms.However,wehavedevisedandproved thattheredoesexistanoptimaltiledalgorithm. 3.1TheQRfactorizationalgorithm Tiledalgorithmsareexpressedintermsoftileoperationsratherthanelementary operations.Eachtileisofsize n b n b ,where n b isaparametertunedtosqueeze themostoutofarithmeticunitsandmemoryhierarchy.Typically, n b rangesfrom 80to200onstate-of-the-artmachines[5].Algorithm3.1outlinesanaivetiledQR algorithm,whereloopindicesrepresenttiles: Algorithm3.1: GenericQRalgorithmforatiled p q matrix. 1 for k=1 to min p;q do 2 forallthe i 2f k +1 ;:::;p g usinganyordering, do 3 elim i;piv i;k ;k InAlgorithm3.1, k isthepanelindex,and elim i;piv i;k ;k isanorthogonal transformationthatcombinesrows i and piv i;k tozerooutthetileinposition i;k 20

PAGE 34

However,thisformulationissomewhatmisleading,asthereismuchmorefreedom forQRfactorizationalgorithmsthan,say,forCholeskyalgorithmsandcontrarilyto LUeliminationalgorithms,therearenonumericalstabilityissues.Forinstancein column1,thealgorithmmusteliminatealltiles i; 1where i> 1,butitcandoso inseveralways.Take p =6.Algorithm3.1usesthetransformations elim ; 1 ; 1 ; elim ; 1 ; 1 ; elim ; 1 ; 1 ; elim ; 1 ; 1 ; elim ; 1 ; 1 Butthefollowingschemeisalsovalid: elim ; 1 ; 1 ; elim ; 4 ; 1 ; elim ; 1 ; 1 ; elim ; 4 ; 1 ; elim ; 1 ; 1 Inthislatterscheme,thersttwotransformations elim ; 1 ; 1and elim ; 4 ; 1use distinctpairsofrows,andtheycanexecuteinparallel.Onthecontrary, elim ; 1 ; 1 and elim ; 1 ; 1usethesamepivotrowandmustbesequentialized.Tocomplicatematters,itispossibletohavetwoorthogonaltransformationsthatexecutein parallelbutinvolvezeroingatileintwodierentcolumns.Forinstancewecan add elim ; 5 ; 2totheprevioustransformationsandrunitconcurrentlywith,say, elim ; 1 ; 1.AnytiledQRalgorithmwillbecharacterizedbyan eliminationlist whichprovidestheorderedlistofthetransformationsusedtozerooutallthetiles belowthediagonal.Thiseliminationlistmustobeycertainconditionssothatthefactorizationisvalid.Forinstance, elim ; 5 ; 2mustfollow elim ; 4 ; 1and elim ; 4 ; 1 inthepreviouslist,becausethereisaowdependencebetweenthesetransformations. Notethat,althoughtheeliminationlistisgivenasatotallyorderedsequence,some transformationscanexecuteinparallel,providedthattheyarenotlinkedbyadependence:intheexample, elim ; 4 ; 1and elim ; 1 ; 1couldhavebeenswapped, andtheeliminationlistwouldstillbevalid. Inordertodescribemorefullythedependenciesinherentintheeliminations weshallobserveasnippetofanexample.InFigure3.1a,totheleftwehavethe rowidentications,theemptycirclesrepresentzeroedelements,andthelledcircles 21

PAGE 35

representthepivotsusedtozeroouttheelements.Therstcolumn'seliminations areshowningreenandthesecondinred.Fromtheeliminationlist,wedene I s;k as thesetofrowsincolumn k thatarezeroedoutattimestep s aDiagramofeliminationlist elim ; 10 ; 1 elim ; 11 ; 1 elim ; 12 ; 1 9 > = > ; I s 1 ; 1 elim I s 1 ; 1 ; 1 elim ; 5 ; 1 elim ; 6 ; 1 elim ; 7 ; 1 elim ; 8 ; 1 9 > > > = > > > ; I s 2 ; 1 elim I s 2 ; 1 ; 1 elim ; 14 ; 2 o I s 1 ; 2 elim I s 1 ; 2 ; 2 elim ; 9 ; 2 elim ; 10 ; 2 elim ; 11 ; 2 9 > = > ; I s 2 ; 2 elim I s 2 ; 2 ; 2 bEliminationlist Whatmaynotbesoevidentfromtheeliminationlistbutismoreapparentin thediagramoftheeliminationlistarethefollowingdependencyrelationshipsnote that indicatesthattheoperationontheleftmustnishpriortotheoperationon therightstarting: elim piv I s;k ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 elim I s;k ;k .1a elim I s;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 elim I s;k ;k .1b elim I s )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ;k ;k elim I s;k ;k .1c However,notallofthesedependenciesmaycauseaneliminationtobelockedtoa particulartimestep.Infact,somedependenciesmaynotbeneededforaparticular instance,buttheadditionofthesewillnotcreateanarticiallock.Forexample, elim I s 2 ; 2 ; 2isdependentupon elim piv I s 2 ; 2 ; 1, elim I s 2 ; 2 ; 1,and elim I s 1 ; 2 ; 2but elim I s 1 ; 2 ; 2onlydependsupon elim piv I s 1 ; 2 ; 1and elim I s 1 ; 2 ; 1. 22

PAGE 36

Table3.1:KernelsfortiledQR.Theunitoftimeis n 3 b 3 ,where n b istheblocksize. PanelUpdate OperationNameCostNameCost Factorsquareintotriangle GEQRT 4 UNMQR 6 Zerosquarewithtriangleontop TSQRT 6 TSMQR 12 Zerotrianglewithtriangleontop TTQRT 2 TTMQR 6 Beforeformallystatingtheconditionsthatguaranteethevalidityoftheeliminationlistofanalgorithm,weexplainhoworthogonaltransformationscanbeimplemented. 3.1.1Kernels Toimplementagivenorthogonaltransformation elim i;piv i;k ;k ,onecanuse sixdierentkernels,whosecostsaregiveninTable3.1.Inthistable,theunitoftime isthetimetoperform n 3 b 3 oating-pointoperations. Therearetwomainpossibilitiestoimplementanorthogonaltransformation elim i;piv i;k ;k :Therstversioneliminatestile i;k withthe TSTriangleon topofsquare kernels,asshowninAlgorithm3.2: Algorithm3.2: Elimination elim i;piv i;k ;k via TSTriangleontopof square kernels. 1 GEQRT piv i;k ;k 2 TSQRT i;piv i;k ;k 3 for j= k +1 to q do 4 UNMQR piv i;k ;k;j 5 TSMQR i;piv i;k ;k;j Herethetilepanel piv i;k ;k isfactoredintoatrianglewith GEQRT .The transformationisappliedtosubsequenttiles piv i;k ;j j>k ,inrow piv i;k with UNMQR .Tile i;k iszeroedoutwith TSQRT ,andsubsequenttiles i;j j>k inrow i areupdatedwith TSMQR .Theopcountis4+6++12 q )]TJ/F20 11.9552 Tf 12.441 0 Td [(k = 23

PAGE 37

10+18 q )]TJ/F20 11.9552 Tf 12.201 0 Td [(k expressedinsametimeunitasinTable3.1.Dependenciesarethe following: GEQRT piv i;k ;k TSQRT i;piv i;k ;k GEQRT piv i;k ;k UNMQR piv i;k ;k;j for j>k UNMQR piv i;k ;k;j TSMQR i;piv i;k ;k;j for j>k TSQRT i;piv i;k ;k TSMQR i;piv i;k ;k;j for j>k TSQRT i;piv i;k ;k and UNMQR piv i;k ;k;j canbeexecutedinparallel,aswell as UNMQR operationsondierentcolumns j;j 0 >k .Withanunboundednumber ofprocessors,theparalleltimeisthus4+6+12=22time-units. Algorithm3.3: Elimination elim i;piv i;k ;k via TTTriangleontopof triangle kernels. 1 if k> 0 then 2 TTQRT i;piv i;k ;k 3 if k 0 then 6 TTMQR i;piv i;k ;k;j 7 GEQRT i;k +1 8 for j= k +2 to q do 9 UNMQR i;k;j Thesecondapproachtoimplementtheorthogonaltransformation elim i;piv i;k ;k iswiththe TTTriangleontopoftriangle kernels,asshown inAlgorithm3.3.Heretile i;k iszeroedoutwith TTQRT andsubsequenttiles i;j and piv i;k ;j j>k ,inrows i and piv i;k areupdatedwith TTMQR Immediatelyfollowing,tile i;k +1isfactoredintoatriangleandthecorrespondingtransformationsareappliedtotheremainingcolumnsinrow i .Necessarily, TTQRT musthavethetriangularizationoftile i;k and piv i;k ;k completed 24

PAGE 38

inordertoproceed.Hencefortherstcolumntherearenoupdatestobeappliedfrompreviouscolumnssuchthatthetriangularizationofthesetileswith GEQRT iscompletedandcanbeconsideredapreprocessingstep.Theopcountis 2+6 q )]TJ/F20 11.9552 Tf 11.722 0 Td [(k +2+6 q )]TJ/F20 11.9552 Tf 11.722 0 Td [(k =10+18 q )]TJ/F20 11.9552 Tf 11.722 0 Td [(k ,justasbefore.Dependenciesarethe following: GEQRT piv i;k ;k UNMQR piv i;k ;k;j for j>k .2a GEQRT i;k UNMQR i;k;j for j>k .2b GEQRT piv i;k ;k TTQRT i;piv i;k ;k .2c GEQRT i;k TTQRT i;piv i;k ;k .2d TTQRT i;piv i;k ;k TTMQR i;piv i;k ;k;j for j>k .2e UNMQR piv i;k ;k;j TTMQR i;piv i;k ;k;j for j>k .2f UNMQR i;k;j TTMQR i;piv i;k ;k;j for j>k .2g Nowthefactoroperationsinrow piv i;k and i canbeexecutedinparallel.Moreover, the UNMQR updatescanberuninparallelwiththe TTQRT factorization.Thus, withanunboundednumberofprocessors,theparalleltimeis4+6+6=16time-units. Recallourdenitionoftheset I s;k tobethesetofrowsincolumn k thatwillbe zeroedoutattimestep s inthecoarse-grainalgorithm.Thusthefollowingdependenciesareadirectconsequenceof3.1casappliedtothezeroingofatileandthe correspondingupdates. TTQRT I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;k ;k TTQRT I s ;piv I s ;k ;k .3a TTQRT I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;k ;k;j TTMQR I s ;piv I s ;k ;k;j for j>k .3b InAlgorithm3.2and3.3,itisunderstoodthatifatileisalreadyintriangular form,thentheassociated GEQRT andupdatekernelsdonotneedtobeapplied. Allthenewalgorithmsintroducedinthischapterarebasedon TT kernels.From analgorithmicperspective, TT kernelsaremoreappealingthan TS kernels,asthey 25

PAGE 39

oermoreparallelism.Moreprecisely,wecanalwaysbreaka TS kernelintotwo TT kernels:Wecanreplacea TSQRT i;piv i;k ;k followinga GEQRT piv i;k ;k by a GEQRT i;k anda TTQRT i;piv i;k ;k .Asimilartransformationcanbemade fortheupdates.Hencea TS -basedtiledalgorithmcanalwaysbeexecutedwith TT kernels,whiletheconverseisnottrue.However,the TS kernelsprovidemoredata locality,theybenetfromaveryecientimplementationsee x 3.3,andseveralexistingalgorithmsusethesekernels.Forallthesereasons,andforcomprehensiveness, ourexperimentswillcompareapproachesbasedonbothkerneltypes. Atthispoint,thePLASMAlibraryonlycontains TS kernels.Wehavemapped thePLASMAalgorithmto TT kernelalgorithmusingthisconversion.Goingfroma TS kernelalgorithmtoa TT kernelalgorithmisimplicitlydonebyHadrietal.[11] whengoingfromtheirSemi-Parallel"totheirFully-Parallel"algorithms. 3.1.2Eliminationlists Asstatedabove,anyalgorithmfactorizingatiledmatrixofsize p q ischaracterizedbyitseliminationlist.Obviously,thealgorithmmustzerooutalltilesbelow thediagonal:foreachtile i;k i>k ,1 k min p;q ,thelistmustcontain exactlyoneentry elim i;?;k ,where ? denotessomerowindex piv i;k .Thereare twoconditionsforatransformation elim i;piv i;k ;k tobevalid: bothrows i and piv i;k mustbeready,meaningthatalltheirtilesleft ofthepanelofindices i;k 0 and piv i;k ;k 0 for1 k 0
PAGE 40

Anyalgorithmthatfactorizesthetiledmatrixobeyingtheseconditionsiscalleda generictiledalgorithm inthefollowing. Theorem3.1 NomatterwhateliminationlistanycombinationofTT,TSisused thetotalweightofthetasksforperformingatiledQRfactorizationalgorithmis constantandequalto 6 pq 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 q 3 Proof: Wehavethatthequantityofeachkernelisgivenbythefollowing L 1 :: GEQRT = TTQRT + q L 2 :: UNMQR = TTMQR + = 2 q q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 L 3 :: TTQRT + TSQRT = pq )]TJ/F15 11.9552 Tf 11.955 0 Td [( = 2 q q +1 L 4 :: TTMQR + TSMQR = = 2 pq q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 )]TJ/F15 11.9552 Tf 11.955 0 Td [( = 6 q q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 q +1 Thequantityof TTQRT providesthenumberoftileszeroedoutviaatriangleon topofatrianglekernel.Thusequation L 1 iscomposedoftwoparts:necessarily, thediagonaltilesmustbetriangularizedandeach TTQRT mustadmitonemore triangularizationinordertoprovideapairing.Thenumberofupdatesofthesetriangularizations,givenby L 2 ,issimplythesumoftheupdatesfromthetriangularization ofthetilesonthediagonalandtheupdatesfromthezeroedtilesvia TTQRT .The combinationof TTQRT and TSQRT ,equation L 3 ,isexactlythetotalnumberof tilesthatarezeroed,namelyeverytilebelowthediagonal.Hence,thetotalnumber ofupdates,providedby L 4 ,isthenumberoftilesbelowthediagonalbeyondtherst columnminusthesumoftheprogressionthroughthecolumns.Nowwedene L 5 =4 L 1 +6 L 2 +6 L 3 +12 L 4 then L 5 =4 GEQRT +6 TSQRT +2 TTQRT +6 UNMQR +6 TTMQR +12 TSMQR: Ascanbenotedin L 5 ,thecoecientsofeachtermcorrespondpreciselytotheweight ofthekernelsasderivedfromthenumberofopseachkernelincurs.Simplifying L 5 27

PAGE 41

wehaveourresult L 5 =6 pq 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 q 3 : AcriticalresultofTheorem3.1isthatnomatterwhateliminationlistisused,the totalweightofthetasksforperformingatiledQRfactorizationalgorithmisconstant andbyusingourunittaskweightof n 3 b = 3,with m = pn b ,and n = qn b ,weobtain 2 mn 2 )]TJ/F15 11.9552 Tf 12.317 0 Td [(2 = 3 n 3 opswhichistheexactsamenumberasforastandardHouseholder reectionalgorithmasfoundinLAPACKe.g.,[14]. 3.1.3Executionschemes Inessence,theexecutionofagenerictiledalgorithmisfullydeterminedbyits eliminationlist.Thislistisstaticallygivenasinputtothescheduler,andtheexecutionprogressesdynamically,withtheschedulerexecutingallrequiredtransformationsassoonaspossible.Moreprecisely,eachtransformationinvolvesseveralkernels, whoseexecutionstartsassoonastheyareready,i.e.,assoonasalldependencieshave beenenforced.Recallthatatile i;k canbezeroedoutonlyafteralltiles i;k 0 with k 0 k ,mustbeupdated k )]TJ/F15 11.9552 Tf 9.628 0 Td [(1times, inordertozerooutthe k )]TJ/F15 11.9552 Tf 11.976 0 Td [(1tilestoitsleftofindex i;k 0 k 0 k .These updatesareexecutedassoonastheyarereadyforexecution. Theelimination elim i;piv i;k ;k isperformedassoonaspossiblewhenboth rows i and piv i;k areready.Justafterthecompletionof GEQRT i;k and 28

PAGE 42

GEQRT piv i;k ;k ,kernel TTQRT i;piv i;k ;k islaunched.Whennished, ittriggerstheupdates TTMQR i;piv i;k ;k;j forall j>k Obviously,thedegreeofparallelismthatcanbeachieveddependsupontheeliminationsthatarechosen.Forinstance,ifalleliminationsinagivencolumnusethe samefactortile,theywillbesequentialized.Thiscorrespondstotheattreeeliminationschemedescribedbelow:ineachcolumn k ,ituses elim i;k;k forall i>k Onthecontrary,twoeliminations elim i;piv i;k ;k and elim i 0 ;piv i 0 ;k ;k inthe samecolumncanbefullyparallelizedprovidedthattheyinvolvefourdierentrows. Finally,notethatseveraleliminationscanbeinitiatedindierentcolumnssimultaneously,providedthattheyinvolvedierentpairsofrows,andthatalltheserowsare readyi.e.,theyhavethedesirednumberofleftmostzeros. Thefollowinglemmawillproveveryuseful;itstatesthatwecanassumew.l.o.g. thateachtileiszeroedoutbyatileaboveit,closertothediagonal. Lemma3.2 Anygenerictiledalgorithmcanbemodied,withoutchangingitsexecutiontime,sothatalleliminationselim i;piv i;k ;k satisfy i>piv i;k Proof: Denea reverse eliminationasanelimination elim i;piv i;k ;k where i
PAGE 43

isstrictlysmallerthan i 0 ,andwerepeattheprocedureuntiltheredoesnotremainany reverseeliminationincolumn k 0 .Weproceedinductivelytothefollowingcolumns, untilallreverseeliminationshavebeensuppressed. 3.2Criticalpaths Inthissectionwedescribeseveralgenerictiledalgorithms,andweprovidetheir criticalpaths,aswellasoptimalityresults.Thesealgorithmsareinspiredbyalgorithmsthathavebeenintroducedtwentytothirtyyearsago[45,36,25,24],albeitfor amuchsimpler, coarse-grain model.Inthisold"model,thetime-unitisthetime neededtoexecuteanorthogonaltransformationacrosstwomatrixrows,regardless ofthepositionofthezerotobecreated,henceregardlessofthelengthoftheserows. Althoughthegranularityismuchcoarserinthismodel,anyexistingalgorithmfor theoldmodelcanbetransformedintoagenerictiledalgorithm,justbyenforcing theverysameeliminationlistprovidedbythealgorithm.Criticalpathsareobtained usingadiscreteeventbasedsimulatorspeciallydevelopedtothisend,basedonthe Simgridframework[47].Itcarefullyhandlesdependenciesacrosstiles,andallowsfor theanalysisofbothstaticanddynamicalgorithms. 1 3.2.1Coarse-grainalgorithms Westartwithashortdescriptionofthreealgorithmsforthecoarse-grainmodel. ThesealgorithmsareillustratedinTable3.2fora15 6matrix. 3.2.1.1 Sameh-Kuck algorithm The Sameh-Kuck algorithm[45]usesthepanelrowforalleliminationsineach column,startingfrombelowthediagonalandproceedingdownwards.Time-steps indicatethetime-unitatwhichtheeliminationcanbedone,assumingunbounded resources.Formally,theeliminationlistis f elim i;k;k ;i = k +1 ;k +2 ;:::;p ;k =1 ; 2 ;:::; min p;q g 1 Thediscreteeventbasedsimulator,togetherwiththecodeforalltiledalgorithms,ispublicly availableat http://graal.ens-lyon.fr/ ~ mjacquel/tiledQR.html 30

PAGE 44

Thisalgorithmisalsoreferredas FlatTree 3.2.1.2 Fibonacci algorithm The Fibonacci algorithmistheFibonaccischemeoforder1in[36].Let coarse i;k bethetime-stepatwhichtile i;k i>k ,iszeroedout.Thesevaluesarecomputedasfollows.Intherstcolumn,thereareone5,two4's,three3's, four2'sandfour1'swewouldhavehadve1'swith p =16.Given x astheleast integersuchthat x x +1 = 2 p )]TJ/F15 11.9552 Tf 11.796 0 Td [(1,wehave coarse i; 1= x )]TJ/F20 11.9552 Tf 11.796 0 Td [(y +1where y isthe leastintegersuchthat i y y +1 = 2+1.Lettherowindicesofthe z tilesthatare zeroedoutatstep s ,1 s x ,rangefrom i to i + z )]TJ/F15 11.9552 Tf 10.282 0 Td [(1.Theeliminationlistforthese tilesis elim i + j;piv i + j; 1 ; 1,with piv i + j = i + j )]TJ/F20 11.9552 Tf 10.703 0 Td [(z for0 j z )]TJ/F15 11.9552 Tf 10.703 0 Td [(1.Inother words,toeliminateabunchof z consecutivetilesatthesametime-step,thealgorithm usesthe z rowsabovethem,pairingtheminthenaturalorder.Nowtheelimination schemeofthenextcolumnisthesameasthatofthepreviouscolumn,shifteddown byonerow,andaddingtwotime-units: coarse i;k = coarse i )]TJ/F15 11.9552 Tf 11.96 0 Td [(1 ;k )]TJ/F15 11.9552 Tf 11.96 0 Td [(1+2,while thepairingobeysthesamerule. 3.2.1.3 Greedy algorithm Ateachstep,the Greedy algorithm[24,25]eliminatesasmanytilesaspossible ineachcolumn,startingwithbottomrows.Thepairingfortheeliminationsisdone exactlyasfor Fibonacci .Thereisnoclosed-formformulatocompute coarse i;k thetime-stepatwhichtile i;k iseliminated,butitispossibletoproviderecursive expressionssee[24,25]. Considerarectangular p q matrix,with p>q .Withthecoarse-grainmodel, thecriticalpathof Sameh-Kuck is p + q )]TJ/F15 11.9552 Tf 11.71 0 Td [(2,andthatof Fibonacci is x +2 q )]TJ/F15 11.9552 Tf 11.709 0 Td [(2, where x istheleastintegersuchthat x x +1 = 2 p )]TJ/F15 11.9552 Tf 13.068 0 Td [(1.Thecriticalpathof Greedy isunknown,butthecriticalpathof Greedy isoptimal.Forsquare q q matrices,criticalpathsareslightlydierent q )]TJ/F15 11.9552 Tf 11.367 0 Td [(3for Sameh-Kuck x +2 q )]TJ/F15 11.9552 Tf 11.368 0 Td [(4for Fibonacci 31

PAGE 45

Table3.2:Time-stepsforcoarse-grainalgorithms. a Sameh-Kuck b Fibonacci c Greedy ? ? ? 1 ? 5 ? 4 ? 23 ? 47 ? 36 ? 345 ? 469 ? 358 ? 4567 ? 36811 ? 25710 ? 56789 ? 3581013 ? 247912 ? 67891011 357101215 24691114 789101112 25791214 24681013 8910111213 24791114 13581012 91011121314 24691113 1357911 101112131415 24681113 1357911 111213141516 14681013 1346810 121314151617 13681012 1246810 131415161718 13581012 124579 141516171819 13571012 123568 3.2.2Tiledalgorithms Asstatedabove,eachcoarse-grainalgorithmcanbetransformedintoatiled algorithm,simplybykeepingthesameeliminationlist,andtriggeringtheexecution ofeachkernelassoonaspossible.However,becausetheweightsofthefactorand updatekernelsarenotthesame,itismuchmorediculttocomputethecritical pathsofthetransformedtiledalgorithms.Table3.3isthecounterpartofTable3.2, anddepictsthetime-stepsatwhichtilesareactuallyzeroedout.Notethatthetiled versionof Sameh-Kuck isindeedthe FlatTree algorithminPLASMA[18,19],and wehaverenameditaccordingly.Asanexample,Algorithm3.4showsthe Greedy algorithmforthetiledmodel. Arstandquiteunexpectedresultisthat Greedy isnolongeroptimal,as showninthersttwocolumnsofTable3.3afora15 2matrix.Ineachcolumnand ateachstep, the Asap algorithm "startstheeliminationofatileassoonasthereare atleasttworowsreadyforthetransformation.When s 2eliminationscanstart simultaneously, Asap pairsthe2 s rowsjustas Fibonacci and Greedy ,therstrow closesttothediagonalwithrow s +1,thesecondrowwithrow s +2,andsoon.Asa 32

PAGE 46

matterofafact,whenprocessingthesecondcolumn,both Asap and Greedy begin withtheeliminationoflines10to15attimestep20.However,oncetiles ; 2, ; 2and ; 2arezeroedouti.e.attimestep22, Asap eliminates4zeros,in rows9through12.Onthecontrary, Greedy waitsuntiltimestep26toeliminate6 zerosinrows6through12.Inasense, Asap isthecounterpartof Greedy atthe tilelevel.However, Asap isnotoptimaleither,asshowninTable3.3afora15 3 matrix.Onlargerexamples,thecriticalpathof Greedy isbetterthanthatof Asap asshowninTable3.3b. Wecanhoweverusetheoptimalityofthecoarse-grain Greedy todevisean optimaltiledalgorithm.Letusdenethefollowingalgorithm: Denition3.3 Givenamatrixof p q tiles,with p>q ,the GrASAP i algorithm 1.usesAlgorithm3.3toexecute Greedy ontherst q )]TJ/F20 11.9552 Tf 11.085 0 Td [(i columnsandpropagate theupdatesthroughcolumn q 2.andforcolumns q )]TJ/F20 11.9552 Tf 11.956 0 Td [(i +1 through q ,applythe Asap algorithm. Clearly,ifwelet i = q weobtainthe Asap algorithm.Wedene GrASAP tobe GrASAP ,i.e.,onlytheeliminationofthelastcolumnwilldierfrom Greedy andwewillshowthat GrASAP isanoptimaltiledalgorithm. Althoughwecannotprovideaneliminationlistfortheentiretiledmatrixof size p q ,wedoprovideaneliminationlistfortherst q )]TJ/F15 11.9552 Tf 12.411 0 Td [(1columns.Thistiled eliminationlistdescribesthetime-stepsatwhichAlgorithm3.3iscomplete,i.e.,allof thefactorizationkernelsarecompletefor k q )]TJ/F15 11.9552 Tf 10.954 0 Td [(1andcorrespondingupdatekernels arecompleteforallcolumns k
PAGE 47

Lemma3.4 Givenaneliminationlistfromanycoarse-grainalgorithm,let s = coarse i;k bethetimestepatwhichelement i;k iseliminatedandlet I s;k = f i j s = coarse I s;k ;k g : Thenforany s wehave s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1= coarse I s;k ;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 max 0 B @ coarse I s;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 coarse piv I s;k ;k ;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 1 C A andinparticular s 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1=max 0 B @ coarse I s 1 ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 coarse piv I s 1 ;k ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C A where s 1 =min k +1 i p coarse i;k Proof: Thisfollowsdirectlyfromthedependenciesgivenin3.1a-3.1c. Theorem3.5 Giventheeliminationlistofacoarse-grainalgorithmforamatrixof size p q ,usingAlgorithm3.3,thetiledeliminationlistforallbutthelastcolumnis givenby tiled i;k =10 k +6 coarse i;k ; 1 i p; 1 kk: Notethat j representsthecolumninwhichtheupdatesareappliedandallcolumns j for j>k havethesameupdatehistory.InAlgorithm3.3,thetwo j -loopsspawn 34

PAGE 48

mutuallyindependenttasks.Sincewehaveanunboundednumberofprocessors,these taskscanallrunsimultaneously.So j representsanyoneofthesecolumns. Wewillproceedbyinductionon k .Fortherstcolumn, k =1,wedonothave anydependencieswhichconcernthe GEQRT operations.Thusfrom3.2bwehave for1 i p GEQRT i; 1=4.4 UNMQR i; 1 ;j =4+6=10.5 Sinceeachcolumninthecoarse-graineliminationlistiscomposedofoneormore timesteps,wemustalsoproceedwithinductiononthetimesteps.Let s 1 =min 2 i p coarse i; 1 : .6 Inthecase k =1,wehavethat s 1 =1 : Inotherwords,thersttasksnishattimestep1forthecoarse-grainalgorithm. Thisisacomplicatedmannerinwhichtostatethat s 1 =1,butitwillbeneededin thegeneralsetting. Sofor s 1 ,from3.2cand3.2dwehave TTQRT I s 1 ;piv I s 1 ; 1 ; 1=max 0 B @ GEQRT piv I s 1 ; 1 ; 1 GEQRT I s 1 ; 1 1 C A +2 =4+2 thus TTQRT I s 1 ;piv I s 1 ; 1 ; 1=4+2 s 1 : .7 35

PAGE 49

Nowfrom3.2e,3.2f,and3.2gwehave TTMQR I s 1 ;piv I s 1 ; 1 ; 1 ;j =max 0 B B B B @ TTQRT I s 1 ;piv I s 1 ; 1 ; 1 UNMQR piv I s 1 ; 1 ; 1 ;j UNMQR I s 1 ; 1 ;j 1 C C C C A +6 =10 1+6 s 1 =10 1+6 coarse I s 1 ; 1 Therefore, TTMQR I s 1 ;piv I s 1 ; 1 ; 1 ;j = tiled I s 1 ; 1 : .8 Assumethatfor1 t s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1wehave TTQRT I t ;piv I t ; 1 ; 1=4+2 t .9 TTMQR I t ;piv I t ; 1 ; 1 ;j =10+6 t .10 thenfrom3.2c,3.2d,and3.3awehave TTQRT I s ;piv I s ; 1 ; 1=max 0 B B B B @ GEQRT piv I s ; 1 ; 1 GEQRT I s ; 1 TTQRT I s )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ; 1 ; 1 1 C C C C A +2 =max 0 B B B B @ 4 4 4+2 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C C C C A +2 Thus TTQRT I s ;piv I s ; 1 ; 1=4+2 s: .11 36

PAGE 50

From3.2e,3.2f,3.2g,and3.3bwehave TTMQR I s ;piv I s ; 1 ; 1 ;j =max 0 B B B B B B B @ TTQRT I s ;piv I s ; 1 ; 1 UNMQR piv I s ; 1 ; 1 ;j UNMQR I s ; 1 ;j TTMQR I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ; 1 ; 1 ;j 1 C C C C C C C A +6 =max 0 B B B B B B B @ 4+2 s 10 10 10+6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C C C C C C C A +6 =10+6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 =10+6 s =10 1+6 coarse I s ; 1 Thus TTMQR I s ;piv I s ; 1 ; 1 ;j = tiled I s ; 1.12 establishingourbasecasefortheinductionon k Nowassumethatfor1 h k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1wehave,forany s incolumn h TTMQR I s ;piv I s ;h ;h;j = tiled I s ;h : Inordertostarttheeliminationofthenextcolumn,wemusthavethatallupdates fromtheeliminationofthepreviouscolumnarecomplete.Thususingtheinduction assumption,wehave GEQRT i;k = TTMQR i;piv i;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 ;k +4 sothat GEQRT i;k =10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 coarse i;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4.13 37

PAGE 51

and UNMQR i;k;j =max 0 B @ GEQRT i;k TTMQR i;piv i;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ;j 1 C A +6 sothat UNMQR i;k;j =10 k +6 coarse i;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 : .14 Again,wemustproceedwithaninductiononthetimestepsincolumn k .Let s 1 =min k +1 i p coarse i;k : .15 From3.2cand3.2dwehave TTQRT I s 1 ;piv I s 1 ;k ;k =max 0 B @ GEQRT piv I s 1 ;k ;k GEQRT I s 1 ;k 1 C A +2 =max 0 B @ 10 k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+6 coarse piv I s 1 ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 coarse I s 1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4 1 C A +2 =10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6max 0 B @ coarse piv I s 1 ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 coarse I s 1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C A +4+2 FromtheapplicationofLemma3.4,wehave coarse I s 1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1=max 0 B @ coarse I s 1 ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 coarse piv I s 1 ;k ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C A suchthat TTQRT I s 1 ;piv I s 1 ;k ;k =10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6[ coarse I s 1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1]+4+2 : Therefore, TTQRT I s 1 ;piv I s 1 ;k ;k =10 k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+6 s 1 : .16 38

PAGE 52

Fortheupdates,wemustagainexaminethethreedependencieswhichresultfrom 3.2e,3.2f,and3.2gsuchthatwehave TTMQR I s 1 ;piv I s 1 ;k ;k;j =max 0 B B B B @ UNMQR piv I s 1 ;k ;k;j UNMQR I s 1 ;k;j TTQRT I s 1 ;piv I s 1 ;k ;k 1 C C C C A +6 =max 0 B B B B @ 10 k +6 coarse I s 1 ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k +6 coarse piv I s 1 ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s 1 1 C C C C A +6 UsingLemma3.4,wehave TTMQR I s 1 ;piv I s 1 ;k ;k;j =max 0 B @ 10 k +6 s 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s 1 1 C A +6 =10 k +max 0 B @ 6 s 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 6 s 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(10 1 C A +6 =10 k +6 s 1 Therefore TTMQR I s 1 ;piv I s 1 ;k ;k;j = tiled I s 1 ;k : .17 Nowassumethatfor s 1 t s )]TJ/F15 11.9552 Tf 11.956 0 Td [(1wehave TTQRT I t ;piv I t ;k ;k 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 t .18 TTMQR I t ;piv I t ;k ;k;j =10 k +6 t: .19 39

PAGE 53

andnotethatwedonothaveequalityfor s>s 1 viaLemma3.4.From3.2c,3.2d, and3.3awehave TTQRT I s ;piv I s ;k ;k =max 0 B B B B @ GEQRT piv I s ;k ;k GEQRT I s ;k TTQRT I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ;k ;k 1 C C C C A +2 max 0 B B B B @ 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 coarse piv I s ;k ;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+4 10 k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+6 coarse I s ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4 10 k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1 C C C C A +2 NotethatfromLemma3.4 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 max 0 B @ coarse piv I s ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 coarse I s ;k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 1 C A suchthat TTQRT I s ;piv I s ;k ;k 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4+2 : Thus TTQRT I s ;piv I s ;k ;k 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s: .20 Fortheupdates,wemustexaminethefourdependencieswhichresultfrom3.2e, 3.2f,3.2g,and3.3bsuchthatwehave TTMQR I s ;piv I s ;k ;k;j =max 0 B B B B B B B @ UNMQR piv I s ;k ;k;j UNMQR I s ;k;j TTQRT I s ;piv I s ;k ;k TTMQR I s )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ;piv I s )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 ;k ;k;j 1 C C C C C C C A +6 =max 0 B B B B B B B @ 10 k +6 coarse I s ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k +6 coarse piv I s ;k ;k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s 10 k +6 s )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 1 C C C C C C C A +6 : 40

PAGE 54

Asbefore,Lemma3.4allowsustowrite TTMQR I s ;piv I s ;k ;k;j =max 0 B @ 10 k +6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 10 k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 s 1 C A +6 =10 k +max 0 B @ 6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 6 s )]TJ/F15 11.9552 Tf 11.955 0 Td [(10 1 C A +6 =10 k +6 s Therefore TTMQR I s ;piv I s ;k ;k;j = tiled i;k : .21 Corollary3.6 Givenaneliminationlistforacoarse-grainalgorithmonamatrixof size p q where p>q ,thecriticalpathlengthofthecorrespondingtiledalgorithmis boundedby tiled p;q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4+2 CP p;q q atleastone TTQRT will bepresentwhichaccountsforthetwotimestepstherebyestablishingthelowerbound. Byincludingonemorecolumn,theupperboundnotonlyincludesthefactorization ofcolumn q ,butalsotherespectiveupdatesontocolumn q +1suchthatthecritical pathofthe p q tiledmatrixmustbesmaller. Corollary3.7 Givenaneliminationlistforacoarse-grainalgorithmonamatrixof size p q where p = q ,thecriticalpathlengthofthecorrespondingtiledalgorithmis CP p;q = tiled p;q )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+4 : Proof: Inthelastcolumn,weneedonlytofactorizethediagonaltilewhichexplains theadditionalfourtimesteps.Moreover,therearenofurthercolumnstoapplyany 41

PAGE 55

updatestonoranytilesbelowthediagonalthatneedtobeeliminated.Thusthe resultisobtained. Intheremainderofthischapter,wewillmakeuseofdiagramstoclarifycertain aspectsoftheproofsandprovideexamplestofurtherillustratethepointsbeingmade. ThesediagramsmakeuseofthekernelrepresentationsasshowninFigure3.1. kernelweight GEQRT 4 TTQRT 2 kernelweight UNMQR 6 TTMQR 6 Figure3.1:Iconrepresentationsofthekernels Wehaveaclosed-formexpressionforthecriticalpathoftiled FlatTree forall threecases:singletiledcolumn,squaretiledmatrix,andrectangulartiledmatrixof morethanonecolumn. Proposition3.8 Consideratiledmatrixofsize p q ,where p q 1 .Thecritical pathlengthof FlatTree is CP ft p;q = 8 > > > > < > > > > : 2 p +2 ; if q =1 ; 22 p )]TJ/F15 11.9552 Tf 11.956 0 Td [(24 ; if p = q> 1 ; 6 p +16 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(22 ; if p>q> 1 Proof: Considerrstthecase q =1.Weshallproceedbyinductionon p to showthatthecriticalpathof FlatTree isoflength2 p +2.If p =1,thenfrom Table3.1theresultisobtainedsinceonly GEQRT ; 1isrequired.Withthebase caseestablished,nowassumethatthisholdsforall p )]TJ/F15 11.9552 Tf 12.615 0 Td [(1 >q =1.Thusattime t =2 p )]TJ/F15 11.9552 Tf 11.389 0 Td [(1+2=2 p ,wehavethatforall p )]TJ/F15 11.9552 Tf 11.389 0 Td [(1 i 1tile i; 1hasbeenfactorized intoatriangleandforall p )]TJ/F15 11.9552 Tf 11.027 0 Td [(1 i> 1,tile i; 1hasbeenzeroedout.Therefore,tile p; 1willbezeroedoutwith TTQRT p; 1attime t +2=2 p )]TJ/F15 11.9552 Tf 11.429 0 Td [(1+2+2=2 p +2. 42

PAGE 56

Consideringthesecondcase p = q> 1,wewillbeusingFigure3.2toillustrate. Weinitializewithatriangularizationoftherstcolumnandsendtheupdatetothe remainingcolumns,10timeunits.Thewellthepipelinewiththeupdatesonto theremainingcolumnsfromthezeroingoperationsoftherstcolumn,6 p )]TJ/F15 11.9552 Tf 10.364 0 Td [(1time units.Thenforeachcolumnaftertherst,exceptthelastone,wellthepipeline withthetriangularization,updateoftriangularization,andupdateofzeroingforthe bottommosttile,+6+6 p )]TJ/F15 11.9552 Tf 9.643 0 Td [(2timeunits.Inthelastcolumn,wethentriangularize thebottommosttile,4timeunits.Thus 10+6 p )]TJ/F15 11.9552 Tf 11.956 0 Td [(1++6+6 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(2+4=6 p +16 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(24=22 p )]TJ/F15 11.9552 Tf 11.955 0 Td [(24 Thethirdcaseisanalogoustothesecondcasebutwestillneedtozerooutthe bottommosttileinthelastcolumnwhichexplainsthedierenceof2intheformula fromthesquarecase. Figure3.2:CriticalPathlengthfortheweighted FlatTree onamatrixof4 4 tiles. Weremindthatforthecoarsealgorithm, coarse p;q = 8 > > > > > > > < > > > > > > > : 0 ; if q< 1; 0 ; if p = q =1; p + q )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 ; if p>q 1; 2 p )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 ; if p = q> 1. Sowendthatconsideringatiledmatrixofsize p q ,where p q 1.Thecritical pathlengthof FlatTree isgivenas CP p;q =10 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 coarse p;q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+4+2 coarse p;q )]TJ/F20 11.9552 Tf 11.955 0 Td [(coarse p;q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 : 43

PAGE 57

Proposition3.9 Thecriticalpathlengthof Fibonacci isgreaterthan 22 q )]TJ/F15 11.9552 Tf 10.755 0 Td [(30 and lessthan 22 q +6 d p 2 p e : Proof: Thecriticalpathlengthofthecoarse-grain Fibonacci algorithmfora p q matrixis coarse p;q = x +2 q )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 : ThusfromProposition3.6wehave 10 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+6 x +2 q )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(2+4 CP p;q < 10 q +6 x +2 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 : Recallthat x istheleastintegersuchthat x x +1 2 p )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 whereby x = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 + p 8 p )]TJ/F15 11.9552 Tf 11.955 0 Td [(7 2 : Thus x p 2 p andtherefore 22 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(30 a i ; 1 i q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1; n i > 0 ; 1 i q ; n 1 + + n q = n: 44

PAGE 58

Wedeneonthesetofcolumnsoflength n theclassicalpartialorderingof R n : x y x i y i ;i i n andthe s -truncate s n of a isacolumnoflength s composedofthe s rst elementsof a andisdenoted a s Denition3.10 Givenataskweightof w andcolumn a = a n 1 1 a n q q ,thecolumn c = c m 1 1 c m p p iscalledaniterateof a ,or c = iter a ,if i c isacolumnoflength n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ii a 1 + w c 1 aif a 1 + w c 1 a 2 then m 1 b n 1 = 2 c bifthereexistsan h suchthat a k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 + w c h a k then m h b n 1 + + n k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 )]TJ/F20 11.9552 Tf 11.956 0 Td [(m 1 )-222()]TJ/F20 11.9552 Tf 40.514 0 Td [(m h )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = 2 c for 2 k q and 1 h p with m 0 =0 celse a j c h and m h b n 1 + + n j )]TJ/F20 11.9552 Tf 11.955 0 Td [(m 1 )-222()]TJ/F20 11.9552 Tf 40.514 0 Td [(m h )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = 2 c where j =min p +1 ;q Denition3.11 Givenataskweightof w ,let a = a n 1 1 a n q q beacolumniterateof length n thenthesequence b = b m 1 1 b m p p ,or b = opiter a ,isdenedas ifor b 1 and m 1 aif n 1 =1 ,then b 1 = a 2 + w and m 1 = b n 1 + n 2 = 2 c bif n 1 > 1 ,then b 1 = a 1 + w and m 1 = b n 1 = 2 c 45

PAGE 59

iiifthereexists k suchthat a k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 + w b i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 a k ,then r i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = n 1 + + n k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(m 1 )-222()]TJ/F20 11.9552 Tf 40.515 0 Td [(m i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 1 : aif b i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 1 ,then b i = b i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 + w ,and m i = b r i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = 2 c belse b i = a k + w m i = b n k + r i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = 2 c iiiif b i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 >a j where j =min i;q ,then b i = b i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 + w ,and m i b n 1 + + n j )]TJ/F20 11.9552 Tf 11.956 0 Td [(m 1 )-222()]TJ/F20 11.9552 Tf 40.514 0 Td [(m i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = 2 c : Proposition3.12 Givenaniteratedcolumn a oflength n ,thesequence b = b m 1 1 b m p p ; or b = optiter a isaniterateof a Proof: Theprooffollowsdirectlyfromthedenition. Proposition3.13 iLet a n beacolumnoflength n and c n )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = iter a n aniteratedcolumnof a n Then b n )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 = optiter a n iter a n = c n )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 : iiLet a n and c n betwocolumnsoflength n suchthat a n c n .Then optiter a n optiter c n : Proof: iClearly, b 1 c 1 bydenitionsince b 1 ischosentobeassmallaspossible. Moreover,bydenition b i c j for i j sinceiif b i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 1, meaningthereareenoughelementsavailabletoperformapairing,then b i is 46

PAGE 60

againchosenassmallaspossible,iiotherwise b i = a k + w whichisthesmallest again,iiielse b i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 >a j and b i ischosenasthenextsmallestelement.Thus b m 1 + + m i n )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 c m 1 + + m i n )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ; 1 i p sothat b n )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 c n )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 iiThisisanotherdirectapplicationofthedenitionandfollowsalongthesame argument. Clearly,letting w =1givesthedenitionsof iter and optiter ofthecoarse-grain algorithmsaspresentedin[25].Denition3.11isthe Asap algorithmonasingletiled columnandcanbeviewedasthecounterpartofthecoarse-grain Greedy algorithm inthetiledcaseandfollowsabottomtotopeliminationofthetiles.Inorderto preservethebottomtotopelimination,theweightoftheupdatesmustbeaninteger multipleoftheiteratedcolumnweight. Theorem3.14 Givenamatrixof p q tiles,afactorizationkernelweightof ,an eliminationkernelweightof ,andanupdatekernelweightof = n forsome n 2 N ,the GrASAP algorithmisoptimalinthecontextoftheclassofalgorithms thatprogresslefttoright,bottomtotop. Proof: FromTheorem3.5wehaveadirecttranslationfromanycoarse-grainalgorithminthisclasstothetiledalgorithmfortherst q )]TJ/F15 11.9552 Tf 12.126 0 Td [(1columns.Thusweare giventhetimestepsatwhichrowsincolumn q areavailableforelimination.Now wexthetimestepsfortheeliminationofthelastcolumnandfollowwhatevertree thealgorithmprovidesforthislastcolumn.Wecanreplacetheeliminationofthe rst q )]TJ/F15 11.9552 Tf 11.873 0 Td [(1columnsandupdatesfromtheseeliminationsontotheremainingcolumns withthetiled Greedy algorithm.Thisispossiblesincethetranslationfunctionis monotonicallyincreasingandweknowthat Greedy isoptimalforthecoarse-grain 47

PAGE 61

algorithmsandthereforeoptimalfortherst q )]TJ/F15 11.9552 Tf 10.441 0 Td [(1columnsinthetiledalgorithms.In anothermannerofspeaking,weslowdowntheeliminationsandupdatesontherst q )]TJ/F15 11.9552 Tf 11.429 0 Td [(1columnswhennotusingthetiled Greedy algorithm.Anillustrativeexample isshowninFigure3.3b. Let c bethenexttolastcolumnofthecoarse-graineliminationtablewhichisof length p )]TJ/F15 11.9552 Tf 11.955 0 Td [( q )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+1.Nowletting a = + q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+ coarse p;q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+ providesaniteratedcolumnoflength p )]TJ/F15 11.9552 Tf 12.238 0 Td [( q )]TJ/F15 11.9552 Tf 12.238 0 Td [(1+1forthetiledalgorithm.With w = wehavethat b = optiter a isanoptimaliteratedcolumnoflength p )]TJ/F20 11.9552 Tf 11.979 0 Td [(q +1 withtheeliminationprogressingfrombottomtotop.Thiscanbeappliedtoanytiled algorithminthisclasssinceweonlyconcernourselveswiththetimestepsatwhichthe lastcolumn'selementsareavailableforelimination.Inotherwords,thisisaspeeding upoftheeliminationofthelastcolumnwhileadheringtoanyrestrictionsincurred fromthepreviouscolumns.AnillustrativeexampleisshowninFigure3.3c. Combiningthesetwoideas,let Greedy beperformedontherst q )]TJ/F15 11.9552 Tf 11.582 0 Td [(1columns andthen Asap ontheremainingcolumn q .Thisprovidesanoptimalalgorithmin thisclassofalgorithms. InFigure3.4weprovideanillustratedexampleofthe Greedy and GrASAP algorithmsonamatrixof15 2tileswheretheoperationsaregivenbyFigure3.1. Itcanbeseenthat GrASAP nishesbefore Greedy since Greedy mustwait andprogresswiththesameeliminationschemeasthecoarse-grainalgorithmwhile GrASAP canbegineliminatinginthelastcolumnassoonasapairoftilesbecomes available.Theeliminationoftherstcolumnisshowninlightgray. Wehaveanalyzedthecriticalpathlengthof GrASAP versusthatof Greedy fortiledmatrices p q where1 p 100and1 q p seeFigure3.5.Inall caseswherethereisadierencewhichisjustover44%ofthecases,thedierence isalwaystwotimesteps. 48

PAGE 62

a Fibonacci b Greedy onrst q )]TJ/F8 9.9626 Tf 9.962 0 Td [(1columns c Asap oncolumn q Figure3.3:IllustrationofrstandsecondpartsoftheproofofTheorem3.14using the Fibonacci algorithmonamatrixof15 2tiles. Wenowshowthatwithouthavingtheupdatekernelweightanintegermultiple oftheeliminationkernelweight,thebottomtotopprogressionisnulliedandwe cannotprovideoptimalityofthealgorithm. Assumethattheupdatekernelweightis3andtheeliminationkernelweightis 2.Let a 11 =3 7 6 4 becolumnfromsomeeliminationscheme.Weshallapplythree iteratedschemestothiscolumn:an Asap schemethatprogressesfrombottomto top,an Asap schemethatcanprogressinanymanner,andan Asap scheme whichmayprovidealag. InTable3.5weclearlyseethateliminationschemeprovidesthebesttimefor thealgorithm.Thereasonisthatenoughofalagwasprovidedsuchthatabinomial treecouldprogresswithouthindrance.Thereforewithoutintegermultipleweightson theupdatekernel,wecannotknowwhichschemewillbeoptimal. 49

PAGE 63

a Greedy b GrASAP Figure3.4: Greedy versus GrASAP onmatrixof15 2tiles. Figure3.5:Tiledmatricesof p q wherethecriticalpathlengthof GrASAP is shorterthanthatof Greedy for1 p 100and1 q p ThePLASMAlibraryprovidesmorealgorithms,thatcanbeinformallydescribed astrade-osbetween FlatTree and BinaryTree .Weremindthat FlatTree isthesameasalgorithmas Sameh-Kuck .Thesealgorithmsarereferredtoas PlasmaTree inallthefollowing,anddierbythevalueofaninputparameter calledthe domainsizeBS .Thisdomainsizecanbeanyvaluebetween1and p inclusive.Withinadomain,thatincludes BS consecutiverows,thealgorithmworks justas FlatTree :therstrowofeachdomainactsasalocalpanelandisusedto zerooutthetilesinalltheotherrowsofthedomain.Thenthedomainsaremerged: 50

PAGE 64

thepanelrowsarezeroedoutbyabinarytreereduction,justasin BinaryTree Asthealgorithmprogressesthroughthecolumns,thedomainontheverybottomis reducedaccordingly,untilsuchtimethatthereisonelessdomain.Forthecasethat BS =1, PlasmaTree followsabinarytreeontheentirecolumn,andfor BS = p thealgorithmexecutesaattreeontheentirecolumn.Itseemsverydicultfora usertoselectthedomainsize BS leadingtobestperformance,butitisknownthat BS shouldincreaseas q increases.Table3.3showsthetime-stepsof PlasmaTree withadomainsizeof BS =5.Intheexperimentsof x 3.3,weuseallpossiblevalues of BS andretaintheoneleadingtothebestvalue. 3.3Experimentalresults Allexperimentswereperformedona48-coremachinecomposedofeighthexacoreAMDOpteron8439SEcodenameIstanbulprocessorsrunningat2.8GHz. Eachcorehasatheoreticalpeakof11.2Gop/swithapeakof537.6Gop/sfor thewholemachine.TheIstanbulmicro-architectureisaNUMAarchitecturewhere eachsockethas6MBoflevel-3cacheandeachprocessorhasa512KBlevel-2cache anda128KBlevel-1cache.AfterhavingbenchmarkedtheAMDACMLandIntel MKLBLASlibraries,weselectedMKL.2sinceitappearedtobeslightlyfaster inourexperimentalcontext.Linux2.6.32andIntelCompilers11.1werealsousedin conjunctionwithPLASMA2.3.1. Forallresults,weshowbothdoubleanddoublecomplexprecision,usingall48 coresofthemachine.Thematricesareofsize m =8000and200 n 8000.The tilesizeiskeptconstantat n b =200,sothatthematricescanalsobeviewedas p q tiledmatriceswhere p =40and1 q 40.Allkernelsuseaninnerblocking parameterof i b =32. Indoubleprecision,anFMA fusedmultiply-add ", y x + y involvesthree doubleprecisionnumbersfortwoops,butthesetwoopscanbecombinedintoone FMAandthuscompletedinonecycle.Indoublecomplexprecision,theoperation 51

PAGE 65

aUpperbounddoublecomplex bExperimentaldoublecomplex cUpperbounddouble dExperimentaldouble Figure3.6:UpperboundandexperimentalperformanceofQRfactorizationTT kernels y x + y involvessixdoubleprecisionnumbersforeightops;thereisnoFMA. Theratioofcomputation/communicationistherefore,potentially,fourtimeshigherin doublecomplexprecisionthanindoubleprecision.Communicationawarealgorithms aremuchmorecriticalindoubleprecisionthanindoublecomplexprecision. Foreachexperiment,weprovideacomparisonofthetheoreticalperformanceto theactualperformance.Thetheoreticalperformanceisobtainedbymodelingthe limitingfactoroftheexecutiontimeaseitherthecriticalpath,orthesequentialtime dividedbythenumberofprocessors.ThisissimilarinapproachtotheRooine 52

PAGE 66

aTheoreticalCPlength bExperimentaldoublecomplex cExperimentaldouble Figure3.7:Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1 model[53].Taking seq asthesequentialperformance, T asthetotalnumberofops, cp asthelengthofthecriticalpath,and P asthenumberofprocessors,theupper boundonperformance, ub ,is ub = seq T max )]TJ/F21 7.9701 Tf 6.894 -4.977 Td [(T P ;cp Figures3.6aand3.6cdepicttheupperboundonperformanceofallalgorithmswhich usethe Triangleontopoftriangle kernels.Since PlasmaTree providesanadditionaltuningparameterofthedomainsize,weshowtheresultsforeachvalueofthis parameteraswellasthecompositionofthebestofthesedomainsizes.Again,it 53

PAGE 67

isnotevidentwhatthedomainsizeshouldbeforthebestperformance,henceour exhaustivesearch. PartofourcomprehensivestudyalsoinvolvedcomparisonsmadetotheSemiParallelTileandFully-ParallelTileCAQRalgorithmsfoundin[11]whicharemuch thesameasthosefoundinPLASMA.AswithPLASMA,thetuningparameter BS controlsthedomainsizeuponwhichaattreeisusedtozeroouttilesbelow theroottilewithinthedomainandabinarytreeisusedtomergethesedomains. UnlikePLASMA,itisnotthebottomdomainwhosesizedecreasesasthealgorithm progressesthroughthecolumns,butinsteadisthetopdomain.Inthisstudy,wefound thatthePLASMAalgorithmsperformedidenticallyorbetterthanthesealgorithms andthereforewedonotreportthesecomparisons. Figure3.6band3.6dillustratetheexperimentalperformancereachedby Greedy Fibonacci and PlasmaTree algorithmsusingthe TTTriangleontopoftriangle kernels.Inbothcases,doubleordoublecomplexprecision,theperformance of Greedy isbetterthan PlasmaTree evenforthebestchoiceofdomainsize. Moreover,asexpectedfromtheanalysisin x 3.2.2, Greedy outperforms Fibonacci themajorityofthetime.Furthermore,weseethat,forrectangularmatrices,the experimentalperformanceindoublecomplexprecisionmatchestheupperboundon performance.Thisisnotthecasefordoubleprecisionbecausecommunicationshave higherimpactonperformance. Whileitisapparentthat Greedy doesachievehigherlevelsofperformance,the percentagemaynotbeasobvious.Tothatend,taking Greedy asthebaseline,we presentinFigure3.7thetheoretical,double,anddoublecomplexprecisionoverhead foreachalgorithmthatusesthe Triangleontopoftriangle kernelascomparedto Greedy .Theseoverheadsarerespectivelycomputedintermsofcriticalpathlength andtime.AtasmallerscaleFigure3.13,itcanbeseenthat Greedy canperform upto13.6%betterthan PlasmaTree 54

PAGE 68

aTheoreticalCPlength bExperimentaldoublecomplex cExperimentaldouble Figure3.8:Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1 Forallmatrixsizesconsidered, p =40and1 q 40,inthetheoreticalmodel, thecriticalpathlengthfor Greedy iseitherthesameasthatof PlasmaTree q =1orisupto25%shorterthan PlasmaTree q =6.Analogously,thecritical pathlengthfor Greedy isatleast2%to27%shorterthanthatof Fibonacci .Inthe experiments,thematrixsizesconsideredwere p =40and q 2f 1 ; 2 ; 4 ; 5 ; 10 ; 20 ; 40 g .In doubleprecision, Greedy hasadecreaseofatmost1.5%thanthebest PlasmaTree q =1andagainofatmost12.8%thanthebest PlasmaTree q =5.In doublecomplexprecision, Greedy hasadecreaseofatmost1.5%thanthebest PlasmaTree q =1andagainofatmost13.6%thanthebest PlasmaTree 55

PAGE 69

q =2.Similarly,indoubleprecision, Greedy providesagainof2.6%to28.1% over Fibonacci andindoublecomplexprecision, Greedy hasadecreaseofatmost 2.1%andagainofatmost28.2%over Fibonacci Althoughitisevidencedthat PlasmaTree doesnotvarytoofarfrom Greedy or Fibonacci ,onemustkeepinmindthatthereisatuningparameterinvolvedand wechoosethebestofthesedomainsizesfor PlasmaTree tocreatethecomposite result,whereaswith Greedy ,thereisnosuchparametertoconsider.Ofparticular interestisthefactthat Greedy alwaysperformsbetterthananyotheralgorithm 2 for p q .Inthescopeof PlasmaTree ,adomainsize BS =1willforcetheuseof abinarytreesothatboth Greedy and PlasmaTree behavethesame.However, asthematrixtendsmoretoasquare,i.e., q tendstoward p ,weobservethatthe performanceofallofthealgorithms,including FlatTree ,areonparwith Greedy Asmorecolumnsareadded,theparallelismofthealgorithmisincreasedandthe criticalpathbecomeslessofalimitingfactor,sothattheperformanceofthekernels isbroughttotheforefront.Therefore,allofthealgorithmsareperformingsimilarly sincetheyallsharethesamekernels. aFactorizationkernels bUpdatekernels Figure3.9:Kernelperformancefordoublecomplexprecision 2 When q =1, Greedy and FlatTree exhibitcloseperformance.Theybothperformabinary treereduction,albeitwithdierentrowpairings. 56

PAGE 70

aFactorizationkernels bUpdatekernels Figure3.10:Kernelperformancefordoubleprecision Inordertoaccuratelyassesstheimpactofthekernelselectiontowardstheperformanceofthealgorithms,Figures3.9and3.10showboththeincacheandout ofcacheperformanceusingthe NoFlush and MultCallFlushLRU strategiesaspresentedin[29,51].Sinceanalgorithmusing TT kernelswillneedtocall GEQRT aswellas TTQRT toachievethesameasthe TS kernel TSQRT ,thecomparison ismadebetween GEQRT + TTQRT and TSQRT andsimilarlyfortheupdates. For n b =200,theobservedratioforincachekernelspeedfor TSQRT to GEQRT + TTQRT is1.3374,andfor TSMQR to UNMQR + TTMQR is1.3207.Forout ofcache,theratiofor TSQRT to GEQRT + TTQRT is1.3193andfor TSMQR to UNMQR + TTMQR itis1.3032.Thus,wecanexpectabouta30%dierence betweentheselectionofthekernels,sincewewillhaveinstancesofusingincache andoutofcachethroughouttherun.Mostofthisdierenceisduetothehigher eciencyanddatalocalitywithinthe TT kernelsascomparedtothe TS kernels. Havingseenthatkernelperformancecanhaveasignicantimpact,wealsocomparethe TT basedalgorithmstothoseusingthe TS kernels.Thegoalistoprovide acompleteassessmentofallcurrentlyavailablealgorithms,asshowninFigure3.11. Fordoubleprecision,theobserveddierenceinkernelspeedis4.976GFLOP/secfor 57

PAGE 71

aUpperbounddoublecomplex bExperimentaldoublecomplex cUpperbounddouble dExperimentaldouble Figure3.11:UpperboundandexperimentalperformanceofQRfactorization-All kernels the TS kernelsversus3.844GFLOP/secforthe TT kernelswhichprovidesaratio of1.2945andisinaccordancewithourpreviousanalysis.Itcanbeseenthatasthe numberofcolumnsincreases,wherebytheamountofparallelismincreases,theeect ofthekernelperformanceoutweighsthebenetprovidedbytheextraparallelism aordedthroughthe TT algorithms.Comparatively,indoublecomplexprecision, Greedy doesperformbetter,evenagainstthealgorithmsusingthe TS kernels.As before,onemustkeepinmindthat Greedy doesnotrequirethetuningparameter ofthedomainsizetoachievethisbetterperformance. 58

PAGE 72

aTheoreticalCPlength bExperimentaldoublecomplex cExperimentaldouble Figure3.12:Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1 Fromtheseexperiments,weshowedthatindoublecomplexprecision, Greedy demonstratedbetterperformancethananyoftheotheralgorithmsandmoreover, itdoessowithouttheneedtospecifyadomainsizeasopposedtothealgorithms inPLASMA.Inaddition,indoubleprecision,formatriceswhere p q Greedy continuestoexceloveranyotheralgorithmusingthe TT kernels,andcontinuesto dosoasthematricesbecomemoresquare. 3.4Conclusion 59

PAGE 73

aTheoreticalCPlength bExperimentaldoublecomplex cExperimentaldouble Figure3.13:Overheadintermsofcriticalpathlengthandtimewithrespectto Greedy Greedy =1 Inthischapter,wehavepresented Fibonacci ,and Greedy ,twonewalgorithms fortiledQRfactorization.Thesealgorithmsexhibitmoreparallelismthanstateof-the-artimplementationsbasedonreductiontrees.Wehaveprovidedaccurate estimationsforthelengthoftheircriticalpath. Comprehensiveexperimentsonmulticoreplatformsconrmthesuperiorityof thenewalgorithmsfor p q matrices,assoonas,say, p 2 q .Thisholdstrue whencomparingnotonlywithpreviousalgorithmsusingTT Triangleontopof triangle kernels,butalsowithallknownalgorithmsbasedonTS Triangleontop ofsquare kernels.GiventhatTSkernelsoermorelocality,andbenetfrombetter 60

PAGE 74

elementaryarithmeticperformance,thanTTkernels,thebetterperformanceofthe newalgorithmsisevenmorestriking,andfurtherdemonstratesthatalargedegree ofaparallelismwasnotexploitedinpreviouslypublishedsolutions. Futureworkwillinvestigateseveralpromisingdirections.First,usingrectangular tilesinsteadofsquaretilescouldleadtoecientalgorithms,withmorelocalityand stillthesamepotentialforparallelism.Second,reningthemodeltoaccountfor communications,andextendingittofullydistributedarchitectures,wouldlaythe groundworkforthedesignofMPIimplementationsofthenewalgorithms,unleashing theirhighlevelofperformanceonlargerplatforms.Finally,thedesignofrobust algorithms,capableofachievingecientperformancedespitevariationsinprocessor speeds,orevenresourcefailures,isachallengingbutcrucialtasktofullybenetfrom futureplatformswithahugenumberofcores. 61

PAGE 75

Algorithm3.4: Greedy algorithmvia TT kernels. 1 for j =1 to q do /* nz j isthenumberoftileswhichhavebeeneliminatedin column j */ 2 nZ j =0 /* nT j isthenumberoftileswhichhavebeentriangularized incolumn j */ 3 nT j =0 4 while column q isnotnished do 5 for j = q downto 1 do 6 if j ==1 then /*Triangularizethefirstcolumnifnotyetdone*/ 7 nT new = nT j + p )]TJ/F20 11.9552 Tf 11.955 0 Td [(nT j 8 if p )]TJ/F20 11.9552 Tf 11.955 0 Td [(nT j > 0 then 9 for k = p downto 1 do 10 GEQRT k;j 11 for jj= j +1 to q do 12 UNMQR k;j;jj 13 else /*Triangularizeeverytilehavingazerointhe previouscolumn*/ 14 nT new = nZ j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 15 for k = nT j to nT new )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 16 GEQRT p )]TJ/F20 11.9552 Tf 11.956 0 Td [(k;j 17 for jj= j +1 to q do 18 UNMQR p )]TJ/F20 11.9552 Tf 11.955 0 Td [(k;j;jj /*Eliminateeverytiletriangularizedinthepreviousstep */ 19 nZ new = nZ j + b nT j )]TJ/F20 11.9552 Tf 11.955 0 Td [(nZ j 2 c 20 for kk = nZ j to nZ new )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 21 piv p )]TJ/F20 11.9552 Tf 11.955 0 Td [(kk = p )]TJ/F20 11.9552 Tf 11.956 0 Td [(kk )]TJ/F20 11.9552 Tf 11.955 0 Td [(nZ new + nZ j 22 TTQRT p )]TJ/F20 11.9552 Tf 11.955 0 Td [(kk;piv p )]TJ/F20 11.9552 Tf 11.956 0 Td [(kk ;j 23 for jj= j +1 to q do 24 TTMQR p )]TJ/F20 11.9552 Tf 11.955 0 Td [(kk;piv p )]TJ/F20 11.9552 Tf 11.956 0 Td [(kk ;j;jj /*Updatethenumberoftriangularizedandeliminatedtiles atthenextstep*/ 25 nT j = nT new 26 nZ j = nZ new 62

PAGE 76

Table3.3:Time-stepsfortiledalgorithms. a Sameh-Kuck b Fibonacci c Greedy d BinaryTree e PlasmaTree BS =5 ? ? ? ? ? 6 ? 14 ? 12 ? 6 ? 6 ? 828 ? 1248 ? 1042 ? 828 ? 828 ? 103450 ? 124670 ? 104064 ? 63656 ? 103450 ? 12405672 ? 10426892 ? 8366286 ? 10347090 ? 12405672 ? 1446627894 ? 10406490114 ? 8345684106 ? 64468104124 ? 1446627894 ? 16526884100116 10406286112136 8345678102128 82878102138158 6547490106122 18587490106122 8366284108134 8305278100122 64262112136172 82882102118134 20648096112128 8345884106130 6285072100118 12407696146170 103450110130146 227086102118134 8345680106128 628507294116 64674110130180 12405672138158 247692108124140 8345678102128 628506894116 82880108144164 16526884100166 268298114130146 6285678100122 628446688110 63656114142178 6568096112128 2888104120136152 6285078100122 622446688110 10346484148176 82884108124140 3094110126142158 6284472100122 622446082104 6386292112182 103450112136152 32100116132148164 622446094116 62238607698 8286690114134 12405672140164 Table3.4:Neither Greedy nor Asap areoptimal. a Greedy nor Asap areoptimal. a Greedy b Asap ? ? 12 ? 12 ? 1042 ? 1040 ? 104064 103686 83662 83480 83456 83274 83456 83068 83052 82862 62850 62856 62850 62650 62850 62446 62844 62444 62244 62244 62244 62240 62238 62238 b Greedy generallyoutperforms Asap q p Algorithm 16 32 64 128 16 Greedy 310 Asap 310 32 Greedy 360 650 Asap 402 656 64 Greedy 374 726 1342 Asap 588 844 1354 128 Greedy 396 748 1452 2732 Asap 966 1222 1748 2756 63

PAGE 77

Table3.5:Threeschemesappliedtoacolumnwhoseupdatekernelweightisnotan integermultipleoftheeliminationkernelweight. a 11 6 6 13 14 12 6 11 8 10 6 9 8 10 6 9 12 8 6 9 9 8 6 7 7 8 6 7 7 8 6 5 5 5 6 5 5 5 6 5 5 5 64

PAGE 78

Table3.6:GreedyversusPT TTandFibonacciTheoretical pqGREEDYPT TTBSOverheadGainFibOverheadGain 401161611.00000.0000221.37500.2727 402546031.11110.1000721.33330.2500 403749851.32430.2449941.27030.2128 40410413251.26920.21211161.11540.1034 40512616651.31750.24101381.09520.0870 406148198101.33780.25251601.08110.0750 407170226101.32940.24781821.07060.0659 408192254101.32290.24412041.06250.0588 409214282101.31780.24112261.05610.0531 4010236310101.31360.23872481.05080.0484 4011258336201.30230.23212701.04650.0444 4012280358201.27860.21792921.04290.0411 4013302380201.25830.20533141.03970.0382 4014324402201.24070.19403361.03700.0357 4015346424201.22540.18403581.03470.0335 4016368446201.21200.17493801.03260.0316 4017390468201.20000.16674021.03080.0299 4018412490201.18930.15924241.02910.0283 4019432512201.18520.15624461.03240.0314 4020454534201.17620.14984681.03080.0299 4021476554201.16390.14084901.02940.0286 4022498570201.14460.12635121.02810.0273 4023520586201.12690.11265341.02690.0262 4024542602201.11070.09975561.02580.0252 4025564618201.09570.08745781.02480.0242 4026586634201.08190.07576001.02390.0233 4027608650201.06910.06466221.02300.0225 4028630666201.05710.05416441.02220.0217 4029652682201.04600.04406661.02150.0210 4030668698201.04490.04306881.02990.0291 4031684714201.04390.04207101.03800.0366 4032700730201.04290.04117321.04570.0437 4033716746201.04190.04027541.05310.0504 4034732762201.04100.03947761.06010.0567 4035748778201.04010.03867981.06680.0627 4036764794201.03930.03788201.07330.0683 4037780810201.03850.03708421.07950.0736 4038796826201.03770.03638621.08290.0766 4039812842201.03690.03568781.08130.0752 4040826856201.03630.03508921.07990.0740 65

PAGE 79

Table3.7:GreedyversusPT TTExperimentalDouble pqGREEDYdPT TTdBSOverheadGain 40136.936037.502011.0153-0.0153 40258.509052.718030.90100.0990 404103.267090.7940100.87920.1208 405115.3060100.554050.87210.1279 4010153.5180145.8200170.94990.0501 4020170.8730171.8270271.0056-0.0056 4040184.5220182.8160190.99080.0092 Table3.8:GreedyversusPT TTExperimentalDoubleComplex pqGREEDYzPT TTzBSOverheadGain 40142.071042.712011.0152-0.0152 40260.442052.197050.86360.1364 40495.182084.112050.88370.1163 405107.637096.753050.89890.1011 4010135.0270128.4320170.95120.0488 4020144.4010146.4220281.0140-0.0140 4040152.9280151.909080.99330.0067 Table3.9:GreedyversusFibonacciExperimentalDouble pqGREEDYdFIBdOverheadGain 40136.936026.56100.71910.2809 40258.509049.48700.84580.1542 404103.2670100.14400.96980.0302 405115.3060115.00200.99740.0026 4010153.5180152.00900.99020.0098 4020170.8730170.47800.99770.0023 4040184.5220180.29900.97710.0229 Table3.10:GreedyversusFibonacciExperimentalDoubleComplex pqGREEDYzFIBzOverheadGain 40142.071030.22800.71850.2815 40260.442048.95700.81000.1900 40495.182097.16501.0208-0.0208 405107.6370105.96100.98440.0156 4010135.0270134.55000.99650.0035 4020144.4010145.55301.0080-0.0080 4040152.9280150.09800.98150.0185 66

PAGE 80

4.SchedulingofCholeskyFactorization InChapter2westudiedtheCholeskyInversionalgorithmwhichconsistsofthe threesteps:Choleskyfactorization,inversionofthefactor,andthemultiplicationof twotriangularmatrices.Inthischapter,wewillfocusontheCholeskyfactorization butunlikethepreviousworkwherethenumberofprocessorswasunbounded,wewill considerthefactorizationinthecontextofanitenumberofprocessors.Bylimiting thenumberofprocessors,theschedulingofthetasksbecomesanissue.Moreover,the weightprocessingtimeofeachtaskmustbetakenintoconsiderationwhencreating theschedule. Asbefore,wewillbeconsideringthecriticalpathlengthforthealgorithmbut notasafunctionofthenumberoftilesratherasafunctionoftheweightsofthe tasks.Theweightsarebaseduponthetotalcomputationalcostforeachkerneland areprovidedinTable4.1.Amorein-depthanalysisofthelengthofthecriticalpath withweightedtasksfortheCholeskyInversionalgorithmcanbefoundin[16]which alsoprovides9 t )]TJ/F15 11.9552 Tf 9.965 0 Td [(10astheweightedcriticalpathlengthfortheCholeskyfactorization ofamatrixof t t tiles. Table4.1:TaskWeights Weights #opsin 1 3 n 3 b ops POTRF 1 3 n 3 b + O n 2 b 1 TRSM n 3 b 3 SYRK n 3 b + O n 2 b 3 GEMM2 n 3 b + O n 2 b 6 Theupperboundonperformanceofperfectspeedupandcriticalpathintroduced inChapter2remainstoooptimisticanddoesnottakeintoaccountanyinformation whichcanbegarneredfromtheDAGofthealgorithm.Thisworkmakesprogress towardsprovidingamorerepresentativeboundontheperformanceoftheCholesky factorizationinthetiledsetting. 67

PAGE 81

Wealsoprovidegainstowardaboundontheminimumnumberofprocessors requiredtoobtaintheshortestpossibleweightedcriticalpathminimummakespan fortheCholeskyfactorizationforamatrixof t t tiles. 4.1ALAPDerivedPerformanceBound Toobtainourbounds,wecalculatethelatestpossiblestarttimeforeachtask ALAPandconsideranunboundednumberofprocessorswithoutanycostsfor communication.Ifwedidaccountforcommunication,wemightseethecriticalpath lengthincreasewhichwouldinturndecreaseourupperbound.Westartatthenal tasksandconsiderhowmanyprocessorsareneededtoexecutethesetaskswithout increasingthelengthofthecriticalpath.Westepbackwardsintimeuntilsucha pointthattherearemoreprocessorsneededtokeepthecriticalpathlengthconstant. Thuswemustaddenoughprocessorstoexecutethetasksandinturncreatemore idletimefortheexecutionoftaskswhicharesuccessors.Atacertainpoint,thereis nomoreneedtoaddprocessorsandthisisthenthenumberofprocessorsneededto obtaintheconstantlengthcriticalpath. ByforcingaslateaspossibleALAPstarttimes,anyschedulewillkeepas manyorfewerprocessorsactiveastheALAPexecutiononanunboundednumberof processors.ThusbyevaluatingtheLostArea LA ,oridletime,foragivennumber ofprocessors, p ,attheendoftheALAPexecutiononanunboundednumberof processors,wecanincreasethesequentialtimebytheamountof LA anddividethis resultbythe p toobtainthebestpossibleexecutiontime,i.e., T p = T seq + LA p p .1 andwedenethistobethe ALAPDerivedPerformanceBound .Hencethemaximum speedupthatwecanexpectisgivenby T seq p T seq + LA p : 68

PAGE 82

Anexamplewillhelptofurtherillustratethistechnique.InFigure4.1weare giventheALAPexecutionofa5 5tiledmatrixwhichhas T seq =125.Theordered pairsindicatedprovidethenumberofprocessorsandidletime,respectively,and inTable4.2aregiventhevaluesfor T p ,speedup,andeciency.Formorethan fourprocessors,thereareenoughprocessorstoobtainthecriticalpathlengthwhich becomesourlimitingfactor. ; 0 ; 4 ; 11 ; 24 ; 45 time processors POTRF TRSM SYRK GEMM Figure4.1:ALAPexecutionfor5 5tiles Table4.2:Upperboundonspeedupandeciencyfor5 5tiles pT p S p E p 1125.001.001.00 264.501.940.97 345.332.760.92 437.253.360.84 535.003.570.71 635.003.570.60 735.003.570.51 835.003.570.45 935.003.570.40 1035.003.570.36 4.2CriticalPathScheduling Inordertoprovideacriticalpathschedule,weusetheBackowalgorithmto assignprioritiestotasksofaDAGsuchthateachtask'spriorityadherestoitsde69

PAGE 83

pendencies.Thealgorithmisdescribedinfoursteps: STEP1:BeginningatthenaltaskintheDAG,setitspriority toitsprocessingtime. STEP2:Movinginthereversedirection,seteachincidental task'sprioritytothesumofitstheprocessingtimeand thenaltask'spriority. STEP3:ForeachtaskinSTEP2,movinginthereversedirection,seteachincidentaltask'sprioritytothesumof itsprocessingtimeandthemaximumpriorityofany incidentalsuccessortask. STEP4:Repeattheprocedureuntilalltaskshavebeenassigned apriority. AnexampleisgiveninFigure4.2.Theprocessingtimesaregiveninparenthesis andtheassignedprioritiescparedesignatedinsquarebrackets.Tasks A and B willbeassignedapriorityof16sincecp A =3+maxcp C ; cp D andcp B = 3+maxcp D ; cp E Byfollowingthepathwiththehighestpriorities,acriticalpathcanbediscerned fromtheweightedDAG.Thusanyschedulewhichthenchoosesfromtheavailable tasksthosewiththehighestprioritiestoexecuterstinherentlyfollowsthecritical path.Itiswellknownthatacriticalpathschedulingisnotalwaysoptimal.Asan example[43],taketwoprocessorsandfourtasks.LettasksA,B,C,andDhave weightsof3,3,1,and1,respectivelyandlettheonlyrelationshipbetweentasksbe thatCisapredecessorofD.Thencp A =cp B =3,cp C =2,andcp D =1. AcriticalpathschedulewouldchoosetoscheduletasksAandBsimultaneouslyand followupwithCandthenD,resultinginascheduleoflengthve.However,ifA andCarescheduledsimultaneouslyandthenDfollowsAonthesameprocessorand BfollowsContheotherprocessor,thelengthofthescheduleisfour. 70

PAGE 84

aStateatstart bStateatnish Figure4.2:ExamplederivationoftaskprioritiesviatheBackowalgorithm Wewillusethiscriticalpathinformationtoanalyzethreeschedulesbychoosing availabletasksviamax cp ,rand cp ,ormin cp .Themax cp willnaturallyfollow thecriticalpathbyschedulingtaskswiththehighest cp rstand,viceversa,the min cp willschedulefromtheavailabletasksthosewiththeminimum cp .Between thesetwo,wealsochooserandomlyamongsttheavailabletaskswithrand cp 4.3Schedulingwithsynchronizations Theright-lookingversionoftheLAPACKCholeskyfactorizationasdepicted inFigure1.1cprovidesanalternativeschedulewhichcanbeeasiertoanalyzeand understand.Wewillapplythethreestepsofthealgorithmtoamatrixof t t tiles.Inthetiledsetting,wecanprovidesynchronizationpointsbetweenthevarying tasksofeachstepandsimplyscheduleanyoftheavailabletaskssincethereareno dependenciesbetweenthetasksineachgrouping.Byaddingthesesynchronizations, 71

PAGE 85

thisscheduleisnotabletoobtainthecriticalpathnomatterhowmanyprocessors areavailable.Algorithm4.1istheright-lookingvariantoftheCholeskyfactorization withaddedsynchronizationpoints. Algorithm4.1: Schedulefortiledright-lookingCholeskyfactorizationwith addedsynchronizationstoallowforgrouping. 1 TileCholeskyFactorizationcomputeLsuchthat A = LL T ; 2 for j =0 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 3 schedulePOTRFi; 4 synchronize; 5 for j = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 6 scheduleTRSMj,i; 7 synchronize; 8 for j = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 9 for k = i +1 to j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 10 scheduleGEMMj,i,k; 11 synchronize; 12 for j = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 13 scheduleSYRKj,i; 14 synchronize; Naturally,wecanimproveupontheaboveschedulebyremovingthesynchronizationbetweensomeofthegroupingsAlgorithm4.2.Theupdateofthetrailing matrixiscomposedoftwogroupings,namelytheGEMMsandtheSYRKs,which canbeexecutedinparallelifenoughprocessorsareavailable.Moreover,theadded synchronizationpointbetweentheupdateofthetrailingmatrixandthefactorization ofthenextdiagonaltilemayalsoberemoved.Thisscheduledoesbecomemorecomplex,butgivenenoughprocessors,thescheduleisabletoobtainthecriticalpathas thelimitingfactortoperformance.Theminimumnumberofprocessors, p ,neededto obtainthecriticalpathis p = 1 2 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 foramatrixof t t tilessincethehighest degreeofparallelismisrealizedfortheupdateofthersttrailingmatrixwhichisof 72

PAGE 86

size t )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1. Bothoftheseschedulesdierfromthecriticalpathschedulingduetotheadded synchronizationpointsandwillshowlowertheoreticalperformance.Inthetheoretical results,weonlyshowAlgorithm4.2. Algorithm4.2: ImprovementuponAlgorithm4.1 1 TileCholeskyFactorizationcomputeLsuchthat A = LL T ; 2 for j =0 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 3 schedulePOTRFi; 4 synchronize; 5 for j = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 6 scheduleTRSMj,i; 7 synchronize; 8 for j = i +1 to t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 9 for k = i +1 to j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 do 10 scheduleGEMMj,i,k; 11 scheduleSYRKj,i; 4.4TheoreticalResults Inthefollowinggures,our Rooftop boundwillbethatwhichusestheperfect speedupuntiltheweightedcriticalpathisthelimitingfactor,i.e., RooftopBound =max CriticalPath ; T seq p .2 Wewillcomparethistoour ALAPDerived boundwhichwasderivedusingtheALAP execution,ourvariousschedulingstrategies,andthefollowinglowerbound.From[20, p.221, x 7.4.2],givenourDAG,weknowthatthemakespan, MS ,ofanylistschedule, ,foragivennumberofprocessors, p ,is MS ;p 2 )]TJ/F15 11.9552 Tf 13.162 8.088 Td [(1 p MS opt p where MS opt isthemakespanoftheoptimallistschedulewithoutcommunication costs.However,wedonotknow MS opt andmustthereforesubstitutethemakespan 73

PAGE 87

oftheCriticalPathSchedulingusingthemaximumstrategytocomputeourlower bound. TheALAPDerivedboundimprovesupontheRooftopboundpreciselyinthearea thatisofmostconcern,namelywherethereisenoughparallelismbutnotenough processorstofullyexploitthatparallelism. aSpeedup bEciency cComparisontonewbound Figure4.3:Theoreticalresultsformatrixof40 40tiles. Figure4.3ashowsthattheCriticalPathscheduleisactuallyquitedescentand comestowithintwopercentoftheALAPDerivedbound.Moreover,theALAP DerivedboundhassignicantlyreducedthegapbetweentheRooftopboundandany ofourlistschedules. 74

PAGE 88

4.5Towardan opt Itisinterestingtoknowhowmanyprocessorsoneneedstobeabletoschedule allofthetasksandmaintaintheweightedcriticalpath.Wewillviewthisproblem intermsoftilesandstatetheproblemasfollows: Givenamatrixof t t tiles,determinetheminimumnumberofprocessors, p opt ,neededtoschedulealltasksandachieveanexecutiontimeequalto theweightedcriticalpath. Towardthatend,wewilllet p = t 2 where0 < 1.Ouranalysiswillbe asymptoticinwhichwelet t tendtoinnity.FromtheanalysisofAlgorithm4.2, wealreadyknowthat opt 1 2 .UsingMATLAB,wehavecalculatedthe valuefor matricesof t t tilesasshowninFigure4.4.Itisourconjecturethat opt 1 5 Figure4.4:Valuesof formatricesof t t tileswhere3 t 40 4.6RelatedWork FortheLUdecompositionwithpartialpivoting,muchworkhasbeenaccomplishedtodiscernasymptoticallyoptimalalgorithmsforallvaluesof [37,35,43]. Theyconsideraproblemofsize n andassume p = n processorsonwhichtoschedule theLUdecomposition.Thecriticalpathlengthoftheoptimalscheduleinthiscase 75

PAGE 89

aLUDecomposition bTiledChoelskyfactorization Figure4.5:Asymptoticeciencyversus = p=n forLUdecompositionandversus = p=t 2 forTiledCholeskyfactorization. is n 2 and opt 0 : 347where opt isasolutiontotheequation3 )]TJ/F20 11.9552 Tf 11.955 0 Td [( 3 =1. InFigure4.5,wemakeacomparisonbetweenthealgorithmiceciencyoftheLU decompositionandthetiledCholeskyfactorization.InthecaseoftheLUdecomposition,theattainableupperboundoneciencycloselyresemblesourpreviousbound ofperfectspeedupwhichisthenlimitedbythecriticalpath.Ontheotherhand,the tiledCholeskyfactorizationdoesnotexhibitthistypeofeciencywhichcanbeseen fromthegapbetweenourRooftopboundandtheALAPDerivedbound.Unlikethe workin[37],wedonotprovideanalgorithmwhichattainstheALAPDerivedbound. 4.7Conclusionandfuturework Inmanyresearchpapers,theperformanceofanalgorithmisusuallycomparedto eithertheperformanceoftheGEMMkerneloragainstperfectscalabilityresultingin largediscrepanciesbetweenthepeakperformanceofthealgorithmandtheseupper bounds.IfanalgorithmdisplaysaDAGsuchasthatofthetiledCholeskyfactorization,itisunrealistictoexpectperfectscalabilityorevenmakecomparisonstothe performanceoftheGEMMkernel.Thusoneneedstoconsideranewboundwhich ismorerepresentativeofthealgorithmandaccountsforthestructureoftheDAG. Withoutsuchabounditisdiculttoaccesswhetherthereareanyperformance 76

PAGE 90

gainstobeachieved.Althoughwedonothaveaclosed-formexpressionforthisnew bound,wehaveshownthatsuchaboundexists.Moreover,wehavealsoshownthat anyalgorithmwhichschedulesthetiledCholeskyfactorizationwhilemaintainingthe weightedcriticalpathwillrequire O t 2 processorsforamatrixof t t tilesandthe coecientissomewherearound0 : 2. Inthischapter,wehaveappliedacombinationofexistingtechniquestoatiled Choleskyfactorizationalgorithmtodiscoveramorerealisticboundontheperformanceandeciency.WedidsobyconsideringanALAPexecutiononanunbounded numberofprocessorsandusedthisinformationtocalculatetheidletimeforany listscheduleonaboundednumberofprocessors.Thisisthenusedtocalculatethe maximumspeedupandeciencythatmaybeobserved. Furtherworkisnecessarytoprovideaclosed-formexpressionofthenewbound dependentuponthenumberofprocessors.Inaddition,weneedtoincludecommunicationcostsintheboundtomakeitmorereectiveoftheactualschedulingproblem onparallelmachines.AscanbeseeninFigure4.3c,theCriticalPathScheduleis within2%ofourALAPDerivedbound.AlthoughschedulingaDAGonabounded numberofprocessorsisanNP-completeproblem,itmaybenotbethecaseforthe DAGofthetiledCholeskyfactorization.Pursuantinvestigationmightshowthatthe CriticalPathSchedulingistheoptimalschedule. 77

PAGE 91

5.SchedulingofQRFactorization Inthischapter,wepresentcollaborativeworkwithJereyLarson.Werevisit thetiledQRfactorizationaspresentedinChapter3butdosointhecontextofa boundednumberofresources.Chapter3wasconcernedwithndingtheoptimal eliminationtreeonanunlimitednumberofprocessors.Wewillbeusingthesame analyticaltoolsasinChapter4toderivegoodschedulesandtoimproveuponthe Rooftopbound.TheCholeskyfactorizationhasjustoneDAG,thereforeChapter4 isastandardschedulingproblem,i.e.,howtoscheduleaDAGonanitenumber ofprocessor.Unlikethepreviouschapter,wewillneedtoconsiderallofthevarious algorithmsi.e.,eliminationtreesandcannotdistilltheanalysisdowntoasingle DAG. 5.1PerformanceBounds EachofthealgorithmsstudiedinChapter3,namely FlatTree Fibonacci Greedy ,and GrASAP ,producesauniqueDAGforamatrixof p q tiles.Inturn, theALAPDerivedbounds4.1foreacheliminationtreewillalsobeunique.InFigure5.1,wegivethecomputedupperboundsandmakecomparisonstothescheduling strategiesofmaximum,random,andminimumviatheCriticalPathMethodfora matrixof20 6tiles.Thematrixsizewaschosensuchthatthecriticalpathlength of Greedy is136andthecriticalpathlengthof GrASAP is134seeFigure3.5in x 3.2.2. The GrASAP algorithmforatiledmatrixisoptimalinthescopeofunbounded resources.HoweverbythemannerinwhichtheALAPDerivedboundiscomputed, theboundcreatedbyusing GrASAP cannotholdforalloftheotheralgorithms. ConsidertheALAPexecutionofthe Fibonacci and GrASAP algorithmsona matrixof15 4tiles.InFigure5.2,weshowtheexecutionofthelasttasksfor GrASAP ontheleftand Fibonacci ontheright.MoreofthetasksintheALAP executionfor Fibonacci arepushedtowardstheendoftheexecutionwhichmeans 78

PAGE 92

a FlatTree b Fibonacci c Greedy d GrASAP Figure5.1:SchedulingcomparisonsforeachofthealgorithmsversustheALAP Derivedboundsonamatrixof20 6tiles. theALAPDerivedboundwillbehigherthanthatof GrASAP foraschedulethat usesfewerthan10processors.Inotherwords,asweaddmoreprocessors,theLost AreaLAincreasesmuchfasterfor GrASAP thanitdoesfor Fibonacci .Since thecriticalpathlengthfor Fibonacci isgreaterthanthatof GrASAP ,aftera certainnumberofprocessors,theALAPDerivedboundfor Fibonacci fallsbelow thatof GrASAP .TheseobservationsareevidentinFigure5.3whereweshowa comparisonoftheboundforeachalgorithm.RecallthattheRooftopbound4.2only takesintoaccountthecriticalpathlengthofanalgorithmsuchthatfor GrASAP 79

PAGE 93

GrASAP Fibonacci GEQRT TTQRT UNMQR TTMQR Figure5.2:Tail-endexecutionusingALAPonunboundedresourcesfor GrASAP and Fibonacci onamatrixof15 4tiles. itcanbeconsideredaboundforallthealgorithmssince GrASAP isoptimalfor unlimitedresourcesandthushastheshortestcriticalpathlength. 5.2Optimality ThereisnoreasonforthetreefoundoptimalinChapter3onanunboundednumberofresourcestobeoptimalonaboundednumberresources.Wecasttheproblem ofndingtheoptimaltreewiththeoptimalscheduleasanintegerprogramming problem.AcompletedescriptionoftheformulationcanbefoundinAppendixA. Similarly,in[1]aMixed-IntegerLinearProgrammingapproachwasusedtoprovide anupperboundonperformance.However,theintegerprogrammingproblemsize growsexponentiallyasthematrixsizeincreases,thusthelargestfeasibleproblem sizewasamatrixof5 5tiles.InFigure5.4weshowthespeedupofthe GrASAP algorithmwithitsboundandmakecomparisonstoanoptimaltreewithanoptimal scheduleandTable5.1providestheactualschedulelengthsforallofthealgorithms usingtheCPMethodforthematrixof5 5tiles.. 80

PAGE 94

aSpeedup bEciency Figure5.3:ALAPDerivedboundcomparisonforallalgorithmsforamatrixof15 4 tiles. 5.3EliminationTreeScheduling Wepairupthechoiceoftheeliminationtreewithatypeofschedulingstrategy toobtainthefollowingbounds: 0 B @ GrASAP Rooftopbound 1 C A 0 B @ optimaltree optimalschedule 1 C A 0 B @ GrASAP optimalschedule 1 C A 0 B @ GrASAP CPschedule 1 C A Moreover,wealsohave 0 B @ GrASAP Rooftopbound 1 C A 0 B @ GrASAP ALAPDerivedBound 1 C A 0 B @ GrASAP optimalschedule 1 C A 0 B @ GrASAP CPschedule 1 C A CombiningtheseinequalitieswithTable5.1givesrisetothefollowingquestions: GivenanoptimaleliminationtreeforthetiledQRfactorizationonan unboundednumberofresources Q1 doestherealwaysexistaschedulingstrategysuchthattheschedule onlimitedresourcesisoptimal? Q2 doestheALAPDerivedboundforthiseliminationtreeholdtrue foranyschedulingstrategyonanyothereliminationtree? 81

PAGE 95

Figure5.4:ComparisonofspeedupforCPMethodon GrASAP ,ALAPDerived boundfrom GrASAP ,andoptimalschedulesforamatrixof5 5tileson1to14 processors Table5.1:Schedulelengthsformatrixof5 5tiles ALAPDerivedOptimalCPMethod ProcsBound GrASAP Tree/Schedule GrASAPGreedyFibonacciFlatTree 1500500500500500500 2255256256256256256 3176176178178178176 4138138140140140140 5116116118118118116 6102104104104104104 7929494949494 8868888888888 9828484848686 10808082828686 11808080808686 12808080808686 13808080808686 14808080808086 82

PAGE 96

FromChapter3wehavethat GrASAP isanoptimaleliminationtreeforthetiled QRfactorization.Weknowthatthelengthofanoptimalschedulefor GrASAP on p processorswillnecessarilybegreaterorequaltotheALAPDerivedboundfor GrASAP on p processorsbywayofconstructionoftheALAPDerivedbound.Thus Q1impliesQ2.Wecannotaddresstherstquestiondirectlysincethesizeof thematrixneededtoproduceacounterexampleistoolargeforvericationwiththe integerprogrammingformulation. Thereforeweneedtondamatrixsizeforwhichascheduleexistswhoseexecution lengthissmallerthantheALAPDerivedboundfrom GrASAP onthesamematrix. AswehaveseeninFigure5.3,the Fibonacci eliminationtreeonatallandskinny tiledmatrixprovidesthebesthope. Consideramatrixof34 4tileson10processors.TheALAPDerivedbound from GrASAP is188.UsingtheCriticalPathmethodtoschedulethe Fibonacci eliminationtreeweobtainaschedulelengthof184.ThereforetheALAPDerived boundfrom GrASAP doesnotholdforthisschedule.Byimplication,wehavethat Q1isfalse.However,theRooftopboundfrom GrASAP isstillavalidboundfor alloftheschedules. 5.4Conclusion InthischapterwehaveappliedthesametoolsusedinChapter4toprovide performanceboundsforthetiledQRfactorization.Further,wehaveshownthatthe ALAPDerivedboundisalgorithmdependent.Thisleavesthatonlyboundwehave forallalgorithmsistheRooftopboundascomputedusingthe GrASAP algorithm. Theanalysisinthischapterhasalsoshownthatanoptimalalgorithmforan unboundednumberofresourcesdoesnotimplythataschedulingstrategyexistssuch thatitcanbescheduledoptimally. 83

PAGE 97

6.StrassenMatrix-MatrixMultiplication Matrixmultiplicationistheunderlyingoperationinmanyifnotmostofthe applicationsinnumericallinearalgebraandassuch,ithasgarneredmuchattention. AlgorithmssuchastheCholeskyfactorization,LUdecompositionandmorerecently theQR-basedDynamicallyWeightedHalleyiterationforpolardecomposition[38], spendamajorityoftheircomputationalcostinmatrix-matrixmultiplication.The conventionalBLASLevel3subprogramformatrix-matrixmultiplicationisof O n where = log 2 8=3,computationalcostbutthereexistsubcubiccomputational costalgorithms.In1969,VolkerStrassen[48]presentedanalgorithmthatcomputes theproductoftwosquarematricesofsize n n ,where n iseven,usingonly7 matrixmultiplicationsatthecostofneeding18matrixadditions/subtractionswhich thencanbecalledrecursivelyforeachofthe7multiplications.Thiscomparesto thestandardcubicalgorithmwhichrequires8matrixmultiplicationsandonly4 matrixadditions.WhenStrassen'salgorithmisappliedrecursivelydowntoaconstant size,thecomputationalcostis O n where =log 2 7 2 : 807.Twoyearslater, ShmuelWinogradprovedthataminimalof7matrixmultiplicationsand15matrix additions/subtractions,whichislessthanthe18ofStrassen's,arerequiredforthe productoftwo2 2matrices,see[54].Thesediscoverieshavespawnedaurryof researchovertheyears.In1978,Pan[39]showedthat < 2 : 796.Inthelate1970's andearly1980's,Bini[12]provided < 2 : 78withSchonage[46]followingupby showing < 2 : 522butwasusurpedthefollowingyearbyRomani[44]whodiscerned < 2 : 517.In1986,Strassenbroughtforthanewapproachwhichleadto < 2 : 497. In1990,CoppersmithandWinograd[23]improveduponStrassen'sresultproviding theasymptoticexponent < 2 : 376.Thisnalresultstillstands,butitisconjectured that =2+ forany "> 0where canmemadeassmallaspossible.Although theCoppersmith-Winogradalgorithmmaybereasonabletoimplement,sincethe constantofthealgorithmishugeandwillnotprovideanadvantageexceptforvery 84

PAGE 98

largematrices,wewillnotconsideritandinsteadfocusontheStrassen-Winograd Algorithm. 6.1Strassen-WinogradAlgorithm Herewediscussthealgorithmasitwouldbeimplementedtocomputetheproduct oftwomatrices A and B wheretheresultisstoredintomatrix C .Thealgorithmis recursive,thuswedescribeonestep.GiventheinputmatricesA,B,andC,divide themintofoursubmatrices, A = 2 6 4 A 11 A 12 A 21 A 22 3 7 5 ;B = 2 6 4 B 11 B 12 B 21 B 22 3 7 5 ;C = 2 6 4 C 11 C 12 C 21 C 22 3 7 5 thenthe7matrixmultiplicationsand15matrixadditions/subtractionsarecomputed asdepictedinTable6.1andFigure6.1showsthetaskgraphoftheStrassen-Winograd algorithmforonelevelofrecursion. Table6.1:Strassen-WinogradAlgorithm Phase1 T 1 = A 21 + A 22 T 5 = B 12 )]TJ/F20 11.9552 Tf 11.955 0 Td [(B 11 T 2 = T 1 )]TJ/F20 11.9552 Tf 11.955 0 Td [(A 11 T 6 = B 22 )]TJ/F20 11.9552 Tf 11.955 0 Td [(T 5 T 3 = A 11 )]TJ/F20 11.9552 Tf 11.955 0 Td [(A 21 T 7 = B 22 )]TJ/F20 11.9552 Tf 11.955 0 Td [(B 12 T 4 = A 12 )]TJ/F20 11.9552 Tf 11.955 0 Td [(T 2 T 8 = T 6 )]TJ/F20 11.9552 Tf 11.955 0 Td [(B 21 Phase2 Q 1 = T 2 T 6 Q 5 = T 1 T 5 Q 2 = A 11 B 11 Q 6 = T 4 B 22 Q 3 = A 12 B 21 Q 7 = A 22 T 8 Q 4 = T 3 T 7 Phase3 U 1 = Q 1 + Q 2 C 11 = Q 2 + Q 3 U 2 = U 1 + Q 4 C 12 = U 1 + U 3 U 3 = Q 5 + Q 6 C 21 = U 2 )]TJ/F20 11.9552 Tf 11.955 0 Td [(Q 7 C 22 = U 2 + Q 5 Inessence,Strassen'sapproachisverysimilartotheobservationthatGaussmade concerningthemultiplicationoftwocomplexnumbers.Theproductof a + bi c + di = ac )]TJ/F20 11.9552 Tf 11.016 0 Td [(bd + bc + ad i wouldnaivelytakefourmultiplications,butcanactuallybe 85

PAGE 99

Figure6.1:TaskgraphfortheStrassen-WinogradAlgorithm.Executiontimeprogressesfromlefttoright.Largeovalsdepictmultiplicationandsmallovalsaddition/subtraction. accomplishedviathreemultiplicationsbydiscerningthat bc + ad = a + b c + d )]TJ/F20 11.9552 Tf -422.702 -23.98 Td [(bc )]TJ/F20 11.9552 Tf 11.955 0 Td [(ad 6.2TiledStrassen-WinogradAlgorithm Inourtiledversion,thematricesaresubdividedsuchthateachsubmatrixisof theform M ij = 2 6 6 6 6 6 6 6 4 M ij 11 M ij 12 :::M ij 1 n M ij 21 M ij 22 :::M ij 2 n . . . . . . M ijn 1 M ijn 2 :::M ijnn 3 7 7 7 7 7 7 7 5 wherethematrices M ijkl aretilesofsize n b n b .Asbeforeonecanproceedwith fullrecursion,unlikebeforethiswouldnotterminateatthescalarlevel,butratherit wouldterminatewiththemultiplicationoftwotilesusingasequentialBLASLevel 3matrix-matrixmultiplication.Therecursioncanalsobecutoatahigherlevel atwhichpointthetiledmatrixmultiplicationofAlgorithm6.1computestheresultingmultiplication.Fortheaddition/subtractionofthesubmatricesinPhase1and 86

PAGE 100

Phase3oftheStrassen-Winogradalgorithm,asimilarapproachisusedwhichisalso executedinparallel. Algorithm6.1: TiledMatrixMultiplicationtiled gemm /*Input: n n tiledmatrices A and B ,Output: n n tiled matrix C suchthat C = A B */ 1 for i=1 to n do 2 for j=1 to n do 3 for k=1 to n do 4 C ij A ik B kj + C ij Ifthecutofortherecursionoccursbeforethetilelevel,thecomputationforeach C ij canbeexecutedinparallel.ThereforeourtiledimplementationoftheStrassenWinogradalgorithmexploitstwolevelsofparallelism.Moreover,thisallowssome partsofthematrixmultiplicationstooccurearlyonascanbeseeninFigure6.2 whichshowstheDAGforamatrixof4 4tileswithonelevelofrecursion.Both Figure6.2andFigure6.1illustrateonelevelofrecursionbutthetiledtaskgraphof a4 4tiledmatrixclearlyportraysthehighdegreeofparallelism. Theconventionalmatrix-matrixmultiplicationalgorithmrequires8multiplicationsand4additionswhereastheStrassen-Winogradalgorithmrequires7multiplicationsand15additions/subtractionsforeachlevelofrecursion.Therefore,there aremoretasksfortheStrassen-Winogradalgorithmascomparedtotheconventionalmatrix-matrixmultiplicationanditwouldbehooveustoreducethenumber oftaskswhichwouldalsoreducethealgorithmiccomplexity.Ontheotherhand, sincewearereducingthenumberofmultiplications,thecomputationalcostisalso reducedsincethisrequiresacubicoperationversusthequadraticoperationofthe addition/subtractions. Thetotalnumberoftasks, T ,oftheStrassen-Winogradalgorithmisgivenby T =7 r p 2 r 3 +15 r )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 X i =0 7 r )]TJ/F21 7.9701 Tf 6.587 0 Td [(i )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 p 2 r )]TJ/F21 7.9701 Tf 6.587 0 Td [(i 2 = p 3 7 8 r +5 p 2 7 4 r )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 87

PAGE 101

Figure6.2:Strassen-WinogradDAGformatrixof4 4tileswithonerecursion. Executiontimeprogressesfromlefttoright.Largeovalsdepictmultiplicationand smallovalsaddition/subtraction. 88

PAGE 102

Table6.2:Recursionlevelswhichminimizethenumberoftasksforatiledmatrixof size p p pr min GopSWGopGEMM 418.96e-011.02e+00 817.15e+008.18e+00 1615.72e+016.55e+01 3214.57e+025.24e+02 6423.20e+034.19e+03 12832.24e+043.35e+04 25641.57e+052.68e+05 51251.09e+062.14e+06 102467.69e+061.71e+07 whichisminimizedatrecursionlevel r min when r min = 2 6 6 6 6 6 0 B B @ ln p ln 8 7 5ln 7 4 ln2 1 C C A 3 7 7 7 7 7 ; andthetotalnumberofops, F ,isgivenby F = m 7 r p 2 r 3 +15 a r )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 X i =0 7 r )]TJ/F21 7.9701 Tf 6.587 0 Td [(i )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 p 2 r )]TJ/F21 7.9701 Tf 6.587 0 Td [(i 2 = mp 3 7 8 r +5 ap 2 7 4 r )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ; where m =2 n 3 b )]TJ/F20 11.9552 Tf 10.354 0 Td [(n 2 b forthemultiplicationsand a = n 2 b fortheadditions/subtractions. Asweincreasetherecursion,thenumberoftaskswilldecreaseuptoacertain point, r min .Thereasonforthisbeingateachrecursionwereducethenumberof p 3 tasksby 1 8 whileincreasingthe p 2 tasksby15. Inourexperiments,lettingthetilesize n b =200providedthebestperformance. ThusTable6.2showsthecorrespondingvaluesfortheminimizingrecursionlevelfor variousnumberoftiles.However,ascanbeseeninTable6.3,the r min whichprovides theminimumnumberoftasksdoesnotprovidetheleastamountofcomputational cost.Thecomputationalcostswillbeminimizedatthefullrecursion. Eventhoughthenumberoftasksandtherebythecomputationalcomplexityare minimizedfor r min ,Table6.4showsthatthecriticalpathlengthincreasesexponen89

PAGE 103

Table6.3:128 128tilesofsize n b =200 algorithmrecursiontasksGop strassen winograd 11,896,4482.92e+04 21,774,5922.56e+04 31,762,0482.24e+04 41,915,7121.96e+04 52,338,2881.72e+04 63,212,2521.51e+04 74,859,3381.33e+04 tiled DGEMM4,177,9203.36e+04 Table6.4:Comparisonofthetotalnumberoftasksandcriticalpathlengthformatrix of p p tiles. pr #tasksCPRatio 4 064232.0 1116716.6 8 05123170.7 1688976.4 21,0525220.2 16 04,09641,023.5 14,54413349.5 25,7766687.5 38,32436123.1 32 032,76856,553.6 132,512211,548.2 235,64894379.2 344,27245996.5 462,1082,52424.6 64 0262,144643,690.7 1244,736376,614.5 2242,9441501,619.6 3264,896655404.4 4325,2643,210101.3 90

PAGE 104

tiallywiththerecursionlevels.Thustheamountofparallelismislikewisereduced foreachrecursionlevel. 6.3RelatedWork InpracticetheStrassen-Winogradalgorithmimpartsalargeamountofoverhead forsmallmatrices,thusitiscustomarytoovercomethisissuebyusingitinconjunctionwithaconventionalmatrixmultiplicationoperation.Thekeyideaistoprovide arecursioncutopointsuchthatoncethisisreached,thealgorithmswitchesfrom callingitselfrecursivelytocalling,e.g.,theBLASLevel3matrixmultiplication.The recursioncutopointisatuningparameterwhichdependsuponthemachinearchitectureandcanbeeitherdynamicorstatic.In[49],onelevelofrecursionisused whileChou[21]providestwolevelsofrecursionbyexplicitlycodingthe49matrix multiplicationswhichthenisprocessedbyeither7or49processors.Ourapproachis toprovidetherecursioncutopointasaparameterwhichcanbesetbytheuser. Therearevariousmethodswhichcanbeusedtodealwithnonsquarematrices and/ormatricesofoddorder.Methodssuchasstaticpadding,dynamicpadding, anddynamicpeelingallprovidethesemechanisms.Inthischapter,weonlystudy matricesoftileorder p =2 k ,i.e., n =2 k n b In[17],Boyeretal.proposeschedulesforbothin-placeandout-of-placeimplementationswithouttheneedforextracomputations.Discoveryofthesealgorithms wasaccomplishedbyusinganexhaustivesearchalgorithm.Theirout-of-placealgorithmmakesuseoftheresultantmatrixfortemporarystoragefortheintermediate computations.Thisintroducesunwanteddatadependenciesandmoredatamovementleadingtoalossofparallelism.Hence,wedonotconsidertheirout-of-place algorithmandfocusontheclassicalStrassen-Winogradalgorithmbyusingtemporary storageallocationsforalloftheintermediatecomputations.Asanexample,given atiledmatrixof128 128tilesofsize200 200,theinputmatricesandresultant matrixrequire1 : 96608GBofstorageandallowingfullrecursionforouralgorithm 91

PAGE 105

Algorithm6.2: TiledStrassen-Winogradtiled gesw /*pisequaltotherecursioncutoff,dothemultiplication*/ 1 if p = r then 2 tiled gemm p;A;B;C 3 else /*pisgreaterthantherecursioncutoff,sowesplitthe probleminhalf*/ 4 p = p= 2 5 /*Phase1*/ 6 tiled geadd p;A 21 ;A 22 ;T 1 7 tiled geadd p;T 1 ;A 11 ;T 2 8 tiled geadd p;A 11 ;A 21 ;T 3 9 tiled geadd p;A 12 ;T 2 ;T 4 10 tiled geadd p;B 12 ;B 11 ;T 5 11 tiled geadd p;B 22 ;T 5 ;T 6 12 tiled geadd p;B 22 ;B 12 ;T 7 13 tiled geadd p;T 6 ;B 21 ;T 8 14 /*Phase2*/ 15 tiled gesw p;T 2 ;T 6 ;Q 1 16 tiled gesw p;A 11 ;B 11 ;Q 2 17 tiled gesw p;A 12 ;B 21 ;Q 3 18 tiled gesw p;T 3 ;T 7 ;Q 4 19 tiled gesw p;T 1 ;T 5 ;Q 5 20 tiled gesw p;T 4 ;B 22 ;Q 6 21 tiled gesw p;A 22 ;T 8 ;Q 7 22 /*Phase3*/ 23 tiled geadd p;Q 1 ;Q 2 ;U 1 24 tiled geadd p;U 1 ;Q 4 ;U 1 25 tiled geadd p;Q 5 ;Q 6 ;U 1 26 tiled geadd p;Q 2 ;Q 3 ;C 11 27 tiled geadd p;U 1 ;U 3 ;C 12 28 tiled geadd p;U 2 ;Q 7 ;C 21 29 tiled geadd p;U 2 ;Q 5 ;C 22 92

PAGE 106

wouldrequireanadditional3 : 2766GBoftemporarystorageseeFigure6.3. Figure6.3:Requiredextramemoryallocationfortemporarystorageforvaryingrecursionlevels. In[27]asequentialimplementationofStrassen-WinogradisstudiedbyDouglas etal.whichalsoprovidesmeansforahybridparallelimplementationwherethelower levelisthesequentialStrassen-Winogradandtheupperlevelalgorithmisthestandard subdividedmatrixmultiplication.ComparisonsaremadetosequentialimplementationsavailableintheIBM'sESSLandCray'sScienticandMathlibrary.Although thiswouldhavebeenaninterestingprojectinandofitselfwithinatiledframework, theemphasisinthischapterwastoprovidetheStrassen-Winogradattheupperlevel. Afterthisworkwascompleted,Ballardetal.publishedworkwhichplaces Strassen-Winogradinaparalleldistributedenvironment[9].Thepaperiswellwrittenandtheydoshowimprovementoverthestandardalgorithmformatricesoforder over94,000.Theiralgorithmiscommunicationoptimalandappliesbettertothe distributedenvironmentthanthesharedmemoryenvironmentseeingthatwedonot haveasmuchcontroloverthememorydistribution. 6.4Experimentalresults Allexperimentswereperformedona48-coremachinecomposedofeighthexacoreAMDOpteron8439SEcodenameIstanbulprocessorsrunningat2.8GHz. 93

PAGE 107

Eachcorehasatheoreticalpeakof11.2Gop/swithapeakof537.6Gop/sfor thewholemachine.TheIstanbulmicro-architectureisaNUMAarchitecturewhere eachsockethas6MBoflevel-3cacheandeachprocessorhasa512KBlevel-2cache anda128KBlevel-1cache.AfterhavingbenchmarkedtheAMDACMLandIntel MKLBLASlibraries,weselectedMKL.2sinceitappearedtobeslightlyfaster inourexperimentalcontext.UsingMKL,forDGEMMeachcorehasapeakof9.7 Gop/swithapeakof465.6Gops/sforthewholemachine.Linux2.6.32andIntel Compilers11.1werealsousedinconjunctionwithPLASMA2.3.1. Theparameterfortilesizehasadirecteectontheamountofdatamovement andtheeciencyofthekernels.Figure6.4apresentstheperformancecomparison forvaryingtilesizesindicating n b =200asthemostecient.Asexpected,itis alsoevidencedthattheeciencyoftheStrassen-Winogradalgorithmisdependent upontheeciencyofthetiledDGEMM.Whenrunningon48threads,increasingthe recursionleveldecreasestheperformanceofthealgorithmasdepictedinFigure6.4 r =0isthetiledmatrix-matrixmultiplication. aTilesizecomparison bRecursionlevelcomparison Figure6.4:Comparisonoftuningparameters n b and r Ourimplementationallowsforthetuningoftherecursionlevelandcanrange fromonerecursionuptofullrecursion.InFigure6.4b,matricesof64 64tilesare 94

PAGE 108

usedandtherecursionlevelrangesfromonetove.Although r min =2for64 64 tiles,thebestperformanceusing48threadsisseenat r =1.Thisisduetothe amountofparallelismlostbyhavingthecriticalpathlengthincreaseastherecursion levelincreaseswhichosetsanygainsofthereductionintasks,andcomputational costandcomplexity. aScalabilitycomparison bEciencycomparison Figure6.5:ScalabilityandEciencycomparisonson48threadswithmatricesof 64 64tilesand n b =200. Figures6.5aand6.5billustratetheperformanceandeciencyreachedbythe Strassen-Winogradimplementation,with r =1,ascomparedtothemultithreaded MKLDGEMMandthetiledDGEMMimplementation.TheStrassen-WinogradimplementationoutperformsthetiledDGEMMuptothepointwherewelooseparallelism.However,bothshowsub-parperformancewhencomparedtothemultithreadedMKLimplementation. Ifwerunon12corestypicalcurrentarchitectureforanodethenwedooutperformtiledDGEMMAlgorithm6.1ifalossofparallelismisnotafactorandthereis nottoomuchdatamovement,i.e.,keep n b smallenoughsothatmoretilestintothe cachebutlargeenoughtoretaintheeciencyoftheGEMMkernel.DepictedinFigure6.6,theperformanceoftheStrassen-Winogradalgorithmisbestwhen r = r min 95

PAGE 109

aScalabiltycomparison bEciencycomparison Figure6.6:Scalabiltyandeciencycomparisonsexecutedon12threadswithmatricesof64 64tilesand n b =200. sincethenumberoftasksandcomputationalcomplexityisminimizedwhichreects theanalysisof x 6.2. 6.5Conclusion InthischapterwehaveshownandanalyzedanimplementationoftheStrassenWinogradalgorithminatiledframeworkandprovidedcomparisonstothemultithreadedMKLstandardlibraryaswellasatiledmatrixmultiplication.Theinterest inthisimplementationisthatitcansupportanylevelofrecursionandanylevelof parallelismthroughtheuseoftherecursionandtilesizeparameters. Albeitthatourimplementationdidnotperformaswellasthehighlytuned andoptimizedmultithreadedMKLlibrary,on12coreswith2levelsofrecursion, itsperformancewasonlylowerbyabout2%.Ultimately,itisaformidabletask tosurpasstheMKLimplementationconsideringthecomputationalcomplexityofa recursivetiledalgorithm. 96

PAGE 110

7.Conclusion Inthisthesis,wehavestudiedthetiledalgorithmsboththeoreticallyandexperimentally.Ouraimwastoalleviatetheconstraintsofmemoryboundedness,task granularity,andsynchronicityimposedbytheLAPACKlibraryasbroughttolight intheCholeskyDecompositionexampleintheintroduction.Moreover,wehavealso detailedthatonemaytranslatetheLAPACKalgorithmsdirectlytotiledalgorithms whileinothercasesanewapproachmayprovidebetterperformancegains. InthestudyoftheCholeskyInversion,thetiledalgorithmsprovidedaunique insightintotheinteractionofthethreedistinctstepsinvolvedinthealgorithmand howthesemaybeintertwined.Wehadobservedthatthechoiceofthevariantin theinversionofthetriangularfactorStep2hasagreatimpactontheperformance ofthealgorithm.Inthescopeofunlimitednumberofprocessors,thischoicecan leadtoanalgorithmwhichperformstheinversionofthematrixinalmostthesame amountoftimeittakestodotheCholeskyfactorization.Moreover,wenotethat thecombinationofthevariantswiththeshortestcriticalpathlengthdoesnottranslateintothebestperformingpipelinedalgorithm.Eventhoughvariant3ofStep2 thetriangularinversiondidnotprovideuswiththeshortestcriticalpathlength withinitself,combinedwiththeothertwosteps,itdoesprovideaCholeskyInversion algorithmthatperformedthebestoverall. WehavealsoobservedthatasimpletranslationfromtheLAPACKroutinesmay notprovideatiledalgorithmwiththebestperformance.Weshowedthatloopreversalsareneededtoalleviateanti-dependencieswhichnegativelyaecttheperformanceofthetiledalgorithms. Inthecasewhereatiledalgorithmalreadyexisted,namelytheQRFactorization, wemadeuseofanewtiledalgorithmtoimproveuponperformance.Thesealgorithms exhibitmoreparallelismthanstate-of-the-artimplementationsbasedonelimination trees.Usingideasfromthe1970/80's,wehavetheoreticallyshownthatthenew 97

PAGE 111

algorithm, GrASAP ,isoptimalinthescopeofunboundednumberofprocessors. Wehaveprovidedaccurateestimationsforthelengthofthecriticalpathsforall ofthesenewtiledalgorithmsandhaveprovidedexplicitformulasforsomeofthe algorithms. Intheframeworkofboundednumberofprocessors,ourtheoreticalworkhasaffordedanewboundwhichmoreaccuratelyreectstheperformanceexpectationsof thetiledalgorithms.InthecaseoftheCholeskyFactorization,thescheduleproduced usingtheCriticalPathMethodprovedtobewithinsevenpercentoftheALAPDerivedboundindicatingthatthisschedulingstrategyiswellsuitedforthisapplication. TheALAPDerivedboundhasalsobeenusedasatooltoshowthattheoptimality inthescopeunboundednumberofprocessorsdoesnottranslatetooptimalityinthe schedulingonaboundednumberofprocessors. Overall,thetheoreticalandexperimentalportionsofthisthesisgivecredenceto theimpactoftiledalgorithmsformulti/many-corearchitectures. 98

PAGE 112

APPENDIXA.IntegerProgrammingFormulationofTiledQR A.1IPFormulation Weformulatetheprobleminquestionusingintegerprogramming.Anequivalent binaryprogrammingformulationwasalsoconstructedwithtimeasanadditional index,butwefoundtheintegerformulationsweremorequicklysolved. Let i j ,and h denoterowsranging1 ;:::;p ; k l l 1 denotecolumnsranging 1 ;:::;q ; r s ,and t denotetimestepsranging1 ;:::;T .Theupperboundonthe numberoftimesteps, T ,maycomefromanyexistingalgorithmgreedy,ASAP, GRASP,etc..Letthedecisionvariablesbedenedasfollows: A.1.1Variables Letall t beanintegerrangingfromzeroto T denotingthetimewecompletethe followingtasksand t =0meansthetaskisneverperformed. UNMQR: CompleteapplyingthereectorsfromGEQRTacrosstherow.Requires6unitsoftime. w ikl = t 2 [0 ;T ]:ifwenishtheupdateoftile i;k attime t A.1 Thisupdatewasnecessitatedby x il = s : l
PAGE 113

TTQRT: Cancelonetriangulartileusinganothertriangletile.Requires2 unitsoftime. z ijk = t 2 [0 ;T ]:ifwecompletezeroingtile i;k usingtile j;k attheendoftimeunit t A.4 Forthe y and z actions,itisusefultohaveabinaryvariablewhichis1iftheaction occurs.Explicitly, ^ y ijkl = 8 > < > : 1:if y ijkl > 0 0:otherwise ^ z ijk = 8 > < > : 1:if z ijk > 0 0:otherwise A.5 A.1.2Constraints 1.Timeconstraintsforeachofthefouractions: aTimefor w ikl i. w ikl mustoccuratleast3timestepsafterearlier w updates. w ikl w ikl 1 +3 8 k 2 ;i l;l 1
PAGE 114

bTimefor x ik i. x ik mustoccur2timestepsafterany w updates.Identicaltoequation A.9. ii. x ik mustoccur2timestepsafterany y updates. x ik y ijkl + y jikl +2 8 j;l;i k;l
PAGE 115

Wedenethebinaryvariables 1 hijkl 2 hijkl 3 hijkl ,and 4 hijkl ,andinclude thedisjunctiveconstraints y ijkl + y jikl +3 y hikl + y ihkl + )]TJ/F15 11.9552 Tf 12.747 0 Td [(^ y hikl )]TJ/F15 11.9552 Tf 12.747 0 Td [(^ y ihkl T + 1 hijkl T A.13 8 h;i;j;ll;k 2 y ijkl +3 z hjk + z jhk + )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z hjk )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z jhk T 8 h;i;j;k>l;k 2 A.19 dTimefor z ijk i. z ijk mustoccur1timestepafterany w action.Identicaltoequation A.10. ii. z ijk mustoccur1timestepafterany y action.Identicaltoequation A.19. iii. z ijk mustoccur1timestepafterany x action.Identicaltoequation A.12. 102

PAGE 116

iv. z ijk mustoccur1timestepbeforeorafteranyother z action. A.Case1.Weuse i;k tozero j;k and i;k tozero h;k Weneedtoenforce z jik +1 z hik + )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z hik T or z hik +1 z jik + )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z jik T Todothis,wedenebinaryvariables 5 hijk and 6 hijk andinclude thedisjunctiveconstraintsasfollows. z jik +1 z hik + )]TJ/F15 11.9552 Tf 12.665 0 Td [(^ z hik T + 5 hijk T 8 h;i;j;k A.20 z hik +1 z jik + )]TJ/F15 11.9552 Tf 12.665 0 Td [(^ z jik T + 6 hijk T 8 h;i;j;k A.21 5 hijk + 6 hijk 1 8 h;i;j;k A.22 B.Case2.Weuse i;k tozero j;k and h;k tozero i;k We needtoenforce z jik +1 z ihk + )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z ihk T 8 h;i;j;k i>j ??A.23 2.Atilecan'tzeroitself,andhencewecan'tupdatejustasinglerowafterwards. z iik =0 8 i;ky iikl =0 8 i;k;l A.24 3.BothtilesinvolvedinTTQRTmustbetrianglesbeforeonecanzeroanother x ik )]TJ/F15 11.9552 Tf 12.665 0 Td [(^ z ijk T + z ijk 8 i;j;k;x jk )]TJ/F15 11.9552 Tf 12.664 0 Td [(^ z ijk T + z ijk 8 i;j;k A.25 4.Forceupdatesaftertriangleandzeroingactions. aAfteratileistriangularized,updatesmustoccurinthenextcolumn. x ik w ilk )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 8 i;kk A.26 103

PAGE 117

bAfteratileiszeroed,updatesmustoccurinthenextcolumn. z ijk y ijlk + y jilk 8 i;j;kl A.28 6.Noupdatestoatilecanoccuraftertriangularization. x ik w ikl 8 i k;k>lx ik y ijkl + y jikl 8 i k;j;k>l A.29 7.Afteratile i;k iszeroed,wecan'tuseittozero. z ijk z hik 8 h;i;j;k A.30 8.Tilesonorbelowthediagonalmustbetriangularizedatsomepoint,andwe can'tnishatriangularizationuntiltimestep2. x ik 2 8 i k A.31 9.Tilesstrictlybelowthediagonalmustbezeroedatsomepoint. X j ^ z ijk =1 8 i>k A.32 10.Can'ttriangularizeabovethediagonal. x ik =0 8 i
PAGE 118

A.1.2.1Precedenceconstraints ThemostcumbersomeconstraintstoformulatearethoseforcingthecorrespondingTTQRTandTTMQRoperationstobeexecutedinthesameorder. Foreachtile i;k involvedinazeroingprocesswith j;k and h;k ,theorder oftheupdatesmustfollowtheorderofthezeroingprocesses.Wewant z jik > > > < > > > > : 1:ifwe[use i;k tozero h;k ]andalso[use i;k tozero j;k ] 0:otherwise A.36 a 1 hijk ^ z hik a 1 hijk ^ z jik a 1 hijk +1 ^ z hik +^ z jik A.37 2. a 2 a 2 hijk = 8 > > > > < > > > > : 1:ifwe[use h;k tozero i;k ]after[using i;k to zero j;k ] 0:otherwise A.38 a 2 hijk ^ z ihk a 2 hijk ^ z jik a 2 hijk +1 ^ z ihk +^ z jik A.39 3. b b hijk = 8 > < > : 1:if[ z hik >z jik ]regardlessofwhether[ z hik =0] 0:otherwise A.40 105

PAGE 119

Tb hijk z hik )]TJ/F20 11.9552 Tf 11.955 0 Td [(z jik T b hijk )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 z hik )]TJ/F20 11.9552 Tf 11.955 0 Td [(z jik A.41 4. c 1 c 1 hijk = 8 > < > : 1:if[ z hik >z jik ] 0:otherwise A.42 c 1 hijk a 1 hijk c 1 hijk b hijk c 1 hijk +1 a 1 hijk + b hijk A.43 5. c c hijk = 8 > > > > < > > > > : 1:ifzeroingactionsincolumn k ofrows h and i to occurafterthezeroingactionsofrows i and j 0:otherwise A.44 c hijk c 1 hijk c hijk a 2 hijk c hijk c 1 hijk + a 2 hijk A.45 Sowenowhaveavariable c hijk thatis1whenupdatesofrow h and i mustcome beforetheupdatesof i and j .Wecandenesimilarvariablesfortheupdates ratherthanthezeros. 6. d d hijk = 8 > > > > > > > < > > > > > > > : 1:ifwe[update i;k and h;k together]andalso [update i;k and j;k together].Allupdates occurbecauseofzeroingincolumn k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 0:otherwise A.46 d hijk ^ y hik k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 +^ y ihk k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 d hijk ^ y jik k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 +^ y ijk k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 d hijk +1 ^ y hik k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 +^ y ihk k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 +^ y jik k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 +^ y ijk k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 A.47 106

PAGE 120

7. e e hijk = 8 > > > > > > > < > > > > > > > : 1:ifupdatesincolumn k ofrows h and i tooccur afterupdatesof i and j ortheupdatesin i and j neverhappen 0:otherwise A.48 Te hijk )]TJ/F20 11.9552 Tf 5.48 -9.683 Td [(y hik k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 + y ihk k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 )]TJ/F26 11.9552 Tf 11.956 9.683 Td [()]TJ/F20 11.9552 Tf 5.479 -9.683 Td [(y jik k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 + y ijk k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 T e hijk )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 )]TJ/F20 11.9552 Tf 5.48 -9.683 Td [(y hik k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 + y ihk k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 )]TJ/F26 11.9552 Tf 11.956 9.683 Td [()]TJ/F20 11.9552 Tf 5.479 -9.683 Td [(y jik k )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 + y ijk k )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 A.49 8. f f hijk = 8 > > > > < > > > > : 1:iftheupdatesincolumn k ofrows h and i tooccur afterupdatesof i and j 0:otherwise A.50 f hijk d hijk f hijk e hijk f hijk d hijk + e hijk A.51 Lastly,toforcetheupdatesinorder: c hijk f hijl 8 h;i;j;k
PAGE 121

REFERENCES [1]E.Agullo,C.Augonnet,J.Dongarra,M.Faverge,H.Ltaief,S.Thibault,and S.Tomov.QRFactorizationonaMulticoreNodeEnhancedwithMultiple GPUAccelerators.In ParallelDistributedProcessingSymposiumIPDPS,2011 IEEEInternational ,pages932{943,may2011. [2]E.Agullo,J.Dongarra,B.Hadri,J.Kurzak,J.Langou,J.Langou,andH.Ltaief. PLASMAUsers'Guide.Technicalreport,ICL,UTK,2009. [3]E.Agullo,B.Hadri,H.Ltaief,andJ.Dongarrra.Comparativestudyofonesidedfactorizationswithmultiplesoftwarepackagesonmulti-corehardware.In SC'09:ProceedingsoftheConferenceonHighPerformanceComputingNetworking,StorageandAnalysis ,pages1{12,NewYork,NY,USA,2009.ACM. [4]EmmanuelAgullo,HenricusBouwmeester,JackDongarra,JakubKurzak,Julien Langou,andLeeRosenberg.Towardsanecienttilematrixinversionofsymmetricpositivedenitematricesonmulticorearchitectures.In Proceedingsofthe 9thinternationalconferenceonHighperformancecomputingforcomputational science ,VECPAR'10,pages129{138,Berlin,Heidelberg,2011.Springer-Verlag. [5]EmmanuelAgullo,BilelHadri,HatemLtaief,andJackDongarra.Comparative studyofone-sidedfactorizationswithmultiplesoftwarepackagesonmulti-core hardware.In ProceedingsoftheConferenceonHighPerformanceComputing Networking,StorageandAnalysisSC'09 ,pages1{12.IEEEComputerSociety Press,2009. [6]R.AllenandK.Kennedy. OptimizingCompilersforModernArchitectures:A Dependence-basedApproach .MorganKaufmann,2001. [7]E.Anderson,Z.Bai,C.Bischof,L.S.Blackford,J.W.Demmel,J.Dongarra, J.DuCroz,A.Greenbaum,S.Hammarling,A.McKenney,andD.Sorensen. LAPACKUsers'Guide .SIAM,1992. [8]KrsteAsanovic,RastislavBodik,JamesDemmel,TonyKeaveny,KurtKeutzer, JohnKubiatowicz,NelsonMorgan,DavidPatterson,KoushikSen,John Wawrzynek,DavidWessel,andKatherineYelick.Aviewoftheparallelcomputinglandscape. Commun.ACM ,52:56{67,October2009. [9]GreyBallard,JamesDemmel,OlgaHoltz,BenjaminLipshitz,andOded Schwartz.Communication-optimalparallelalgorithmforStrassen'smatrixmultiplication.In Proceedinbgsofthe24thACMsymposiumonParallelisminalgorithmsandarchitectures ,SPAA'12,pages193{204,NewYork,NY,USA,2012. ACM. 108

PAGE 122

[10]P.Bientinesi,B.Gunter,andR.vandeGeijn.Familiesofalgorithmsrelatedto theinversionofasymmetricpositivedenitematrix. ACMTrans.Math.Softw. 35:1{22,2008. [11]BilelHadriandHatemLtaiefandEmmanuelAgulloandJackDongarra.EnhancingParallelismofTileQRFactorizationforMulticoreArchitectures.Technical Report222,LAPACKWorkingNote,2009. [12]D.Bini,M.Capovani,F.Romani,andG.Lotti. O n 2 : 7799 complexityfor n n approximatematrixmultiplication. InformationProcessingLetters ,8:234{ 235,1979. [13]L.S.Blackford,J.Choi,A.Cleary,E.D'Azevedo,J.Demmel,I.Dhillon,J.Dongarra,S.Hammarling,G.Henry,A.Petitet,K.Stanley,D.Walker,andR.C. Whaley. ScaLAPACKUsers'Guide .SIAM,1997. [14]SusanBlackfordandJackJ.Dongarra.InstallationGuideforLAPACK.TechnicalReport41,LAPACKWorkingNote,jun1999.originallyreleasedMarch 1992. [15]BLAS.BasicLinearAlgebraSubprograms. http://www.netlib.org/blas/ [16]HenricusBouwmeesterandJulienLangou.ACriticalPathApproachtoAnalyzingParallelismofAlgorithmicVariants.ApplicationtoCholeskyInversion. CoRR ,abs/1010.2000,2010. [17]BriceBoyer,Jean-GuillaumeDumas,ClementPernet,andWeiZhou.MemoryecientschedulingofStrassen-Winograd'smatrixmultiplicationalgorithm. In Proceedingsofthe2009internationalsymposiumonSymbolicandalgebraic computation ,ISSAC'09,pages55{62,NewYork,NY,USA,2009.ACM. [18]A.Buttari,J.Langou,J.Kurzak,andJ.Dongarra.ParalleltiledQRfactorizationformulticorearchitectures. ConcurrencyComputat.:Pract.Exper. 20:1573{1590,2008. [19]A.Buttari,J.Langou,J.Kurzak,andJ.Dongarra.Aclassofparalleltiledlinear algebraalgorithmsformulticorearchitectures. ParallelComputing ,35:38{53, 2009. [20]H.Casanova,A.Legrand,andY.Robert. Parallelalgorithms .Chapman& Hall/CRCnumericalanalysisandscienticcomputing.CRCPress,2009. [21]ChungchiangChou,YuefanDeng,GangLi,andYuanWang.Parallelizing Strassen'sMethodforMatrixMultiplicationonDistributed-MemoryMIMDArchitectures.In ComputersforMathematicswithApplications ,1994. [22]N.Christodes. GraphTheory:AnalgorithmicApproach .AcademicPress, 1975. 109

PAGE 123

[23]D.CoppersmithandS.Winograd.Matrixmultiplicationviaarithmeticprogressions.In ProceedingsofthenineteenthannualACMsymposiumonTheoryof computing ,STOC'87,pages1{6,NewYork,NY,USA,1987.ACM. [24]MichelCosnard,Jean-MichelMuller,andYvesRobert.ParallelQRdecompositionofarectangularmatrix. NumerischeMathematik ,48:239{249,1986. [25]MichelCosnardandYvesRobert.ComplexityofparallelQRfactorization. JournaloftheA.C.M. ,33:712{723,1986. [26]J.W.Demmel,L.Grigori,M.Hoemmen,andJ.Langou.CommunicationavoidingparallelandsequentialQRandLUfactorizations:theoryandpractice. TechnicalReport204,LAPACKWorkingNote,2008. [27]CraigC.Douglas,MichaelHeroux,GordonSlishman,RogerM.Smith,and RogerM.Gemmw:APortableLevel3BlasWinogradVariantOfStrassen's Matrix-MatrixMultiplyAlgorithm,1994. [28]R.Eigenmann,J.Hoeinger,andD.Padua.Ontheautomaticparallelizationof theperfectbenchmarks R IEEETrans.ParallelDistrib.Syst. ,9:5{23,1998. [29]EmmanuelAgulloandJackDongarraandRajibNathandStanimireTomov.A FullyEmpiricalAutotunedDenseQRFactorizationForMulticoreArchitectures. TechnicalReport242,LAPACKWorkingNote,2011. [30]BilelHadri,HatemLtaief,EmmanuelAgullo,andJackDongarra.TileQR FactorizationwithParallelPanelProcessingforMulticoreArchitectures.In 24thIEEEInt.ParallelDistributedProcessingSymposiumIPDPS'10 ,2010. [31]HenricusBouwmeesterandMathiasJacquelinandJulienLangouandYves Robert.TiledQRfactorizationalgorithms.TechnicalReport7601,INRIA, France,apr2011.Availableat http://hal.inria.fr/docs/00/58/62/39/PDF/ RR-7601.pdf [32]N.J.Higham. AccuracyandStabilityofNumericalAlgorithms .Societyfor IndustrialandAppliedMathematics,Philadelphia,PA,USA,secondedition, 2002. [33]J.KurzakandJ.Dongarra.Fullydynamicschedulerfornumericalcomputingon multicoreprocessors. UniversityofTennesseeCSTech.Report,UT-CS-09-643 2009. [34]J.KurzakandJ.Dongarra.QRfactorizationfortheCellBroadbandEngine. Sci.Program. ,17-2:31{42,2009. [35]R.E.Lord,J.S.Kowalik,andS.P.Kumar.SolvingLinearAlgebraicEquations onanMIMDComputer. J.ACM ,30:103{117,January1983. 110

PAGE 124

[36]J.J.ModiandM.R.B.Clarke.AnalternativeGivensordering. Numerische Mathematik ,43:83{90,1984. [37]MounirMarrakchiandYvesRobert.OptimalalgorithmsforGaussianeliminationonanMIMDcomputer. ParallelComputing ,12:183{194,1989. [38]YujiNakatsukasaandNicholasJ.Higham.StableandEcientSpectralDivide andConquerAlgorithmsfortheSymmetricEigenvalueDecompositionandthe SVD.http://eprints.ma.man.ac.uk/1824/,preprint. [39]V.Ya.Pan.Strassen'salgorithmisnotoptimaltrilineartechniqueofaggregating,unitingandcancelingforconstructingfastalgorithmsformatrixoperations. In FoundationsofComputerScience,1978.,19thAnnualSymposiumon ,pages 166{176,oct.1978. [40]J.M.Perez,P.Bellens,R.M.Badia,andJ.Labarta.CellSs:Makingiteasierto programtheCellBroadbandEngineprocessor. IBMJ.Res.&Dev. ,51:593{ 604,2007. [41]G.Quintana-Ort,E.S.Quintana-Ort,R.A.vandeGeijn,F.G.VanZee,and ErnieChan.Programmingmatrixalgorithms-by-blocksforthread-levelparallelism. ACMTransactionsonMathematicalSoftware ,36,2009. [42]M.C.Rinard,D.J.Scales,andM.S.Lam.Jade:Ahigh-level,machineindependentlanguageforparallelprogramming. Computer ,6:28{38,1993. [43]YvesRobert. TheImpactofVectorandParallelArchitecturesontheGaussian EliminationAlgorithm .ManchesterUniversityPress,1991. [44]F.Romani.SomePropertiesofDisjointSumsofTensorsRelatedtoMatrix Multiplication. SIAMJournalonComputing ,11:263{267,1982. [45]A.H.SamehandD.J.Kuck.Onstableparallellinearsystemssolvers. J.ACM 25:81{91,1978. [46]A.Schonhage.PartialandTotalMatrixMultiplication. SIAMJournalonComputing ,10:434{455,1981. [47]SimGrid.URL: http://simgrid.gforge.inria.fr [48]VolkerStrassen.Gaussianeliminationisnotoptimal. NumerischeMathematik 13:354{356,1969.10.1007/BF02165411. [49]FredericSuter.MixedParallelImplementationsoftheTopLevelStepofStrassen andWinogradMatrixMultiplicationAlgorithms.In Proceedingsofthe15thInternationalParallelandDistributedProcessingSymposiumIPDPS'01-Volume 1 ,IPDPS'01,pages10008.2{,Washington,DC,USA,2001.IEEEComputerSociety. 111

PAGE 125

[50]H.Sutter.Afundamentalturntowardconcurrencyinsoftware. Dr.Dobb's Journal ,30,2005. [51]R.ClintWhaleyandAnthonyM.Castaldo.Achievingaccurateandcontextsensitivetimingforcodeoptimization. Softw.Pract.Exper. ,38:1621{1642,December2008. [52]ChristopherG.WillardandAddisonSnellLauraSegervall.HPCUserSite Census:Systems. http://www.intersect360.com/industry/reports.php? id=67 ,2012. [53]SamuelWilliams,AndrewWaterman,andDavidPatterson.Rooine:aninsightfulvisualperformancemodelformulticorearchitectures. Commun.ACM 52:65{76,April2009. [54]ShmuelWinograd.OnMultiplicationof2 2Matrices. LinearAlgebraandIts Applications ,4:381{388,1971. 112

PAGE 126

Glossary Anti-dependency Dependencywhichoccurswhenaninstructionrequiresavalue thatislaterupdated.AlsoknownasaWrite-After-Readdependency. BLAS BasicLinearAlgebraSubprogramsareroutinesthatprovidestandardbuildingblocksforperformingbasicvectorandmatrixoperations. DAG DirectedAcyclicGraphswhereeachnoderepresentsataskandeacheach representsthedatadependenciesbetweenthetasks. EliminationList Tablewhichprovidestheorderedlistofthetransformationsused tozerooutallthetilesbelowthediagonal. GEMM RoutinewhichispartoftheBLASthatcomputestheproductoftwogeneral matrices. GEQRT RoutinewhichispartoftheLAPACKthatconstructsacompactWY representationtoapplyasequenceof i b Householderreections. LAPACK LinearAlgebraPACKageprovidesroutinesforsolvingsystemsofsimultaneouslinearequations,least-squaressolutionsoflinearsystemsofequations, eigenvalueproblems,andsingularvalueproblems. LAUUM RoutinewhichispartoftheLAPACKthatcomputestheproductofa upperorlowertriangularmatrixwithitstranspose. Pipelining Animplementationtechniquewheremultipleinstructionsareoverlapped inexecution. PLASMA ParallelLinearAlgebraSoftwareforMulticoreArchitecturesprovides routinestosolvedensegeneralsystemsoflinearequations,symmetricpositive 113

PAGE 127

denitesystemsoflinearequationsandlinearleastsquaresproblems,usingLU, Cholesky,QRandLQfactorizations. POTRF RoutinewhichispartoftheLAPACKthatcomputestheCholeskyfactorizationofasymmetricpositivedenitematrix. ScaLAPACK ScalaleLinearAlgebraPACKageisalibraryofhigh-performance linearalgebraroutinesforparalleldistributedmemorymachineswhichsolves denseandbandedlinearsystems,leastsquaresproblems,eigenvalueproblems, andsingularvalueproblems. StrongScalability Scalabilitywhichshowshowthesolutiontimevarieswiththe numberofprocessorsforaxedtotalproblemsize. SYRK RoutinewhichispartoftheBLASthatcomputestherankkupdateofthe upperorlowertriangularcomponentofasymmetricmatrix. TRMM RoutinewhichispartoftheBLASthatcomputestheproductofageneral matrixwithanupperorlowertriangularmatrix. TRSM RoutinewhichispartoftheBLASthatsolvesatriangularlinearequation. TRTRI RoutinewhichispartoftheLAPACKthatcomputestheinverseofaupper orlowertriangularmatrix. TSMQR Multipliesacomplexmatrixformedviastackingtwosquarematricesone ontopoftheotherbytheunitarymatrix Q oftheQRfactorizationformedby TSQRT. TSQRT RoutinewhichconstructsacompactWYrepresentationofamatrixcomposedofasquarematrixstackedontopofanuppertriangularmatrix. 114

PAGE 128

TTMQR Multipliesacomplexmatrixformedviastackingtwosquarematricesone ontopoftheotherbytheunitarymatrix Q oftheQRfactorizationformedby TTQRT. TTQRT RoutinewhichconstructsacompactWYrepresentationofamatrixcomposedoftwouppertriangularmatricesstackedoneontopoftheother. UNMQR Multipliesacomplexmatrixbytheunitarymatrix Q oftheQRfactorizationformedbyGEQRT.. WAR Write-After-Read.Ananti-dependency. 115