Citation
Parallel inversion of large matrices on graphics hardware

Material Information

Title:
Parallel inversion of large matrices on graphics hardware
Creator:
Treadwell, Jason E ( author )
Place of Publication:
Denver, Colo.
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (59 pages) : ;

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science

Subjects

Subjects / Keywords:
Computer graphics ( lcsh )
Matrices ( lcsh )
Computer graphics ( fast )
Matrices ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
Matrix inversion is an important operation in domains ranging from graphics and real-time applications to nonlinear optimization and regression analysis. Traditional methods of matrix inversion have cubic run times that scale poorly as the matrix dimension grows.
Review:
A method by M. Altman produces a sequence of cubically and monotonically-convergent approximations of the matrix inverse using a few matrix multiplications and additions each iteration. While these operations themselves are hardly trivial, the homogeneous and independent nature of their constituent computations makes them prime candidates for parallel acceleration using recent advances in GPU (graphics processing unit) hardware.
Review:
Using a consumer GPU, we show that matrices can be quickly inverted with high fidelity using Altman's method. We exploit the error tolerance of the method to gain further speed by using mixed-precision computations and compare the results to those from double-precision and CPU computations.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by Jason E. Treadwell.

Record Information

Source Institution:
University of Colorado Denver Collections
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
987243395 ( OCLC )
ocn987243395
Classification:
LS1193.E52 2016m T74 ( lcc )

Downloads

This item has the following downloads:


Full Text
PARALLEL INVERSION OF LARGE MATRICES ON GRAPHICS HARDWARE
by
JASON E. TREADWELL B.S., University of Colorado, 2008
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program
2016


This thesis for the Master of Science degree by Jason E. Treadwell has been approved for the Department of Computer Science by
Gita Alaghband, Chair Tom Altman, Advisor Farnoush Banaei-Kashani Boris Stilman
December 17, 2016
n


Treadwell, Jason E. (M.S., Computer Science)
Parallel Inversion of Large Matrices on Graphics Hardware Thesis directed by Professor Tom Altman
ABSTRACT
Matrix inversion is an important operation in domains ranging from graphics and real-time applications to nonlinear optimization and regression analysis. Traditional methods of matrix inversion have cubic runtimes that scale poorly as the matrix dimension grows.
A method by M. Altman produces a sequence of cubically and monotonically-convergent approximations of the matrix inverse using a few matrix multiplications and additions each iteration. While these operations themselves are hardly trivial, the homogeneous and independent nature of their constituent computations makes them prime candidates for parallel acceleration using recent advances in GPU (graphics processing unit) hardware.
Using a consumer GPU, we show that matrices can be quickly inverted with high fidelity using Altmans method. We exploit the error tolerance of the method to gain further speed by using mixed-precision computations and compare the results to those from double-precision and CPU computations.
The form and content of this abstract are approved. I recommend its publication.
Approved: Tom Altman
m


TABLE OF CONTENTS
CHAPTER
I. INTRODUCTION.................................................. 1
1.1 Matrices and Their Inverses............................... 1
1.2 Altmans Iterative Matrix Inversion Method................ 3
1.3 Computational Aspects of Altmans Method ................. 6
II. COMPUTER LINEAR ALGEBRA....................................... 7
2.1 BLAS and OpenBLAS ........................................ 7
2.2 Parallel Computation on the GPU with CUDA................. 8
2.3 CUDA Linear Algebra with cuBLAS.......................... 12
III. IMPLEMENTATION OF ALTMANS METHOD........................... 14
3.1 Development Environment and General Architecture......... 14
3.2 The Initial Inverse Approximation........................ 15
3.3 Computation of the Sequence.............................. 16
3.4 Recovery from Divergence................................. 18
IV. EXPERIMENTAL RESULTS....................................... 20
4.1 Testing Environment and Overview of Experiments ......... 20
4.2 Diagonally-Dominant Random Matrices...................... 22
4.3 Random Matrices.......................................... 25
4.4 Real-World Large Matrices................................ 27
4.5 Hilbert Matrices......................................... 29
4.6 Comparison of the Orders of Convergence.................. 30
V. CONCLUSION.................................................. 33
5.1 Implications of Results ................................. 33
5.2 Further Research......................................... 33
REFERENCES..................................................... 35
APPENDIX
IV


A. ACMI Inverter Usage.......................................................... 38
B. Sample ACMI Inversion Runs.................................................. 39
C. Selected Source Code......................................................... 43
v


FIGURES
FIGURE
2.1 CPU/GPU Architectural Differences....................................... 8
2.2 Heterogeneous Programming Model........................................ 10
2.3 CUDA Grid/Block/Thread Model .......................................... 11
4.1 GeForce GTX 1080 Streaming Multiprocessor.............................. 21
4.2 Inversion Runtime for n x n Diagonally-Dominant Random Matrices . 23
4.3 Inversion Runtime for n x n Random Matrices.......................... 25
4.4 Inversion Runtime for n x n Boolean Matrices......................... 26
4.5 Inversion Runtime for Large, Sparse Matrices.......................... 28
4.6 Inversion Error for n x n Hilbert Matrices............................. 30
4.7 Order-p Inversion Runtime for Diagonally-Dominant Random Matrices . 31
4.8 Order-p Inversion Runtime for Random Matrices ......................... 32
4.9 Total Order-p Inversion Iterations for Random Matrices ................ 32
vi


TABLES
TABLE
2.1 Differences between a CPU and a GPU..................................... 9
3.1 BLAS Matrix/Vector Routines Used....................................... 16
3.2 Implemented Custom Matrix Routines..................................... 17
4.1 GeForce GTX 1080 Specifications........................................ 20
4.2 Sparse Matrix Collection Specimens .................................... 27
vii


CHAPTER I
INTRODUCTION
1.1 Matrices and Their Inverses
An m x n matrix is a two-dimensional array of m rows and n columns of realvalued or complex elements. Matrices are used to model systems of linear equations, graphs, raster graphics and manifold other computational phenomena. Consider a system of three equations over three scalar variables ay, ay and x3:
anxi + CI12X2 + CI13X3 = b\,
Oj2\X 1 + (222^2 + (I23X3 = &2 ,
0>31X1 + U32X2 + CL33X3 = 63 .
This linear system can be represented as a matrix of the coefficients times a vector (or n x 1 matrix) of the variables equal to the vector of the right-hand side:
f
a 11 a 12 a 13 (221 CI22 CI23 32 33 J
\ tx\ / \
X2
\X3)
bi
b2
\b3j
The same system is represented concisely and algebraically as
Ax = b,
where capital and small symbols represent matrices and vectors, respectively.
Addition, multiplication and other operations may be performed on matrices. The element cy,- in the 2th row and jth column of the matrix sum C = A + B is simply the sum of the elements from the corresponding positions: ay,- + bij. Each element cy,- of the matrix product C = AB is the dot product of the 2th row of A and
1


the jth column of B. An m x p matrix is multiplied on the right by a p x n matrix to yield an m x n result; each row from the left must correspond with one column on the right. Multiplication of the matrix (ay,-) by a scalar c is commutative and yields
(cciij).
The algebra of matrices admits the inverse of a square (i.e., n x n) matrix1. Analogous to scalar algebra, where the inverse of a scalar x satisfies x~lx = 1, a matrix multiplied by its inverse produces the n x n identity matrix / of ones on the main diagonal and zeros elsewhere.
I 1 if j = k ,
A-1 A = AA-1 = I = (ijk) ijk =
0 otherwise.
AI = IA = A.
Matrix multiplication is otherwise not generally commutative, yet it is associative and distributes over addition (which itself is both commutative and associative). We can apply the inverse of the matrix A to both sides of the previous system to simultaneously solve all three equations.
A~lAx = A~lb > x = A~lb.
A singular matrix has no inverse. Every invertible matrix is square, has a unique inverse, has a nonzero determinant, consists exclusively of linearly-independent columns and satisfies a host of additional conditions per the invertible matrix theorem. Every positive-definite matrix is invertible [9]. Given a nonsingular matrix, the process of inversion to compute its inverse is hardly trivial.
While inverting a matrix to solve a linear system is neither computationally efficient [6] nor numerically-stable [9] (i.e., prone to amplified round-off error), an approximate inverse can be quickly computed with sufficient accuracy to satisfy certain
1The Moore-Penrose pseudoinverse generalization of the matrix inverse applies to non-square matrices [20], but was not investigated here.
2


real-time robotics applications [32], Matrix inversion is also employed to perform regression analysis [20], nonlinear programming [5], Newtons method for nonlinear systems [6] as well as ray casting in computer graphics [30].
Gauss-Jordan elimination is a slow [29], but introductory method of inversion where we use Gaussian elimination to reduce the left side of the augmented matrix [A: I] to /, producing A-1 on the right side [6]. The LU and LUP triangular decompositions of the matrix A, which are numerically-stable methods of solving linear systems, can be used to solve Axi = e* for each unit vector e* to produce the columns of A-1 more quickly [9]. Cholesky decomposition of Hermitian, positive-definite matrices is much more efficient than LU decomposition and can similarly be used to invert a matrix [29]. Each of these traditional inversion methods has 0(n3) runtime complexity.
Winograd proved that matrix multiplication is no harder than inversion while Aho, Hop croft, and Ullman proved the converse [9]; the lower and upper runtime bounds for matrix inversion thus have quadratic and cubic order, respectively.
1.2 Altmans Iterative Matrix Inversion Method
M. Altman [3] defined a family of sequences (Rn) of monotonically-convergent approximations to the inverse A-1:
Rn+i = Rn{I + Tn + + + 1),
Tn = I ARn lim Rn = A-1.
nCO
Each sequence of this family has degree or order p, where p > 2, each iteration of the sequence requires p matrix multiplications, and the sequence converges towards A-1 with degree p. The following are the sequences of degree 2, 3 and 4 that converge quadratically, cubically and quartically towards A-1, respectively (noting that we
3


compute ARn only once per iteration):
quadratic Rn+i = f?ra(2/ ARn) ,
cubic Rn+i = Rn{31 3ARn + (ARn)2), quartic Rn+\ = i?ra(4/ ARn(6I 4ARn + (ARn)2)).
Given an appropriate starting approximation f?0, each sequence forms a method of quickly computing the inverse of A in k iterations within some acceptable error measure || I ARk || (a Frobenius norm) that approaches zero as Rk approaches A-1 (since AA-1 = I). When such an error bound exists and is attainable, the runtime of this method can drastically outperform the traditional inversion methods described in the previous section.
We wish to know which of these sequences provides the fastest method of inversion. Since matrix multiplication dominates the computational runtime of an iteration of any sequence in the family, the closer we get to the true inverse per multiplication, the better. Consider two such sequences: one of degree p (which requires p multiplications per iteration) and one of degree q such that, after m iterations of the first sequence and n iterations of the second, v matrix multiplications have been performed over each (i.e., mp = nq = v). The better or faster sequence is the one which, after v multiplications, results in the approximate inverse with the lower error measure. Suppose the sequence of degree p is the better method. Altman showed that, since the error measures of these iterations are a function of the initial inverse approximations error measure raised to pm or qn (respectively), we have
pm > qn ,
png/p ^ qinp/q
pi/r > qi/g
The runtime is thus minimized by using the sequence of degree p = |~e~| =3, which
4


converges cubically with successive iterations to the inverse of A:
Rn+1 Rn{31 3ARn + (ARn)2).
The choice of the initial approximation R0 used to seed the inversion is crucial. The convergence of the method is guaranteed provided that ||/ ARq || < 1. For any invertible matrix, we may choose i?0 = a A*, where A* is the conjugate transpose of A. We shall only consider real-valued matrices, so we effectively have Ro = aAT. The scalar multiple a must he within a finite interval and leads to the fastest convergence at the critical value a = 2/(M2 + to2), where to and M and the minimum and maximum eigenvalues of A, respectively.
Unfortunately, computing eigenvalues2 is generally about as hard as matrix inversion itself [28]. Fortunately, an acceptable choice of a can be quickly drawn from the interval
-a pir
where the upper bound produces the fastest-converging choice of R0.
Should A be self-adjoint and positive definite, we can perform the much faster computation i?0 = cd, where the fastest convergence occurs with a = 2/(M + to). An easily-computed alternative is
a
1
The iterative improvement of Altmans method makes it self-correcting (in a sense) and grants it some tolerance of floating-point round-off error and less-than-ideal choices of i?0. In contrast to other inversion methods, Altmans can provide useful intermediate results when its runtime is constrained [32],
2One typical method of which is the QR algorithm [28].
5


1.3 Computational Aspects of Altmans Method
Since the runtime of an Altman iteration is dominated by its matrix multiplications, the methods complexity depends upon the multiplication algorithm used. The square matrix multiplication problem has Q(n2) and 0(n3) runtime; the theoretical lower bound simply reflects the size of the output.
Specifically, naive matrix multiplication performed by computing all n2 dot products runs in 0(n3) time using constant space. Strassens landmark 1969 divide-and-conquer algorithm runs in 0(nlg7) time and can be run in parallel almost to the same extent as the naive algorithm, yet is less numerically stable, consumes more space and doesnt exploit sparse matrices well. Furthermore, its runtime has a larger constant factor, outperforming the naive algorithm only at a highly-variable crossover point as high as n = 2150 [9]. The Coppersmith-Winograd algorithm of 1987 currently has the best known upper bound of 0(n2'376), yet has a crossover point so high as to be wildly impractical for any conceivable applications [8].
For these reasons, linear algebra libraries tend to implement optimized versions of the naive multiplication algorithm [19]. Using that algorithm, the computation of a single Altman sequence iteration has the same 0(n3) runtime as the traditional matrix inversion methods. However, the lower constant factor of the former combined with a small number of iterations (due to the fast convergence) can make for a shorter total runtime. Finally, and perhaps most importantly, each Altman iteration requires only matrix multiplication and addition to compute; these operations scale very well in parallel.
6


CHAPTER II
COMPUTER LINEAR ALGEBRA
2.1 BLAS and OpenBLAS
The Basic Linear Algebra Subprograms, or BLAS, are a venerable specification of routines that perform elementary matrix operations. BLAS includes addition, multiplication, rank update, norm and solution routines that are divided among three levels [1]:
1. Scalar, vector and vector-vector operations.
2. Matrix-vector operations.
3. Matrix-matrix operations.
BLAS was originally developed in Fortran and traditionally employs column-major orderinga matrix A with m rows of n columns is strided across columns and is laid out in sequential memory as {an, a2i, aTOi, ai2, a22, aw}- BLAS provides routines for working with both single and double-precision floating-point matrix elements; the latter precision requires twice as much storage, but is needed when numerical instability renders the former inadequate. In our implementation of Altmans inversion method, we used the level 3 GEMM routines to multiply matrices; we treated matrices as 1 x n2 vectors to norm them and to add them together using level 1 routines.
There are many BLAS implementations. For running computations purely in software, we employed the OpenBLAS library, an up-to-date fork of the highly-optimized and sophisticated GotoBLAS library [23]. Specifically, we used OpenBLAS compiled to divide its computations across all available cores using POSIX threads on Linux with bindings for the C programming language. Matrix multiplication in OpenBLAS
7


is highly-performant on recent Intel processors, outperforming Intels own MKL implementation of BLAS [35].
In addition, we implemented custom routines for handling tasks not easily supported by BLAS, such as adding a multiple of the identity matrix to another matrix, computing the trace of a matrix (i.e., the sum of its diagonal elements) and converting a matrix from single to double precision. These routines were dominated in runtime by the algebraic matrix operations and were thus computed serially for simplicity.
2.2 Parallel Computation on the GPU with CUDA
For parallel computations accelerated on graphics hardware, we used Nvidias proprietary CUDA platform. Introduced in 2007 and the subject of much research and commercial application, CUDA is a General Purpose Graphics Processing Unit (or GPGPU) toolkit for executing parallel code on a systems graphics hardware for non-graphical tasks. CUDA implements a model of heterogeneous computing where a general-purpose host CPU coordinates suitable tasks across one or more heavily-specialized GPU devices.
CUDA was foreshadowed in the mid-1990s when GPU transistor counts began to exceed those of CPUs. Heterogeneous GPGPU computing exploits the massively-parallel nature of GPU hardware, which was originally developed to rapidly process large numbers of independent pixels for computer games and other graphical applications [34],
DRAM
GPU
Figure 2.1: GPU/GPU Architectural Differences [24]
-
-






CPU
8


Table 2.1: Differences between a CPU and a GPU
CPU GPU
Contains several powerful cores. Contains many lightweight cores.
Cache dominates the die. Logic dominates the die.
High core clock rates. High memory bandwidth.
Meant for complicated sequential logic. Designed for instruction-leuel parallelism.
Predicts and caches code paths. Hides latency through parallelism.
Fully-independent cores. Cores execute the same instruction.
Expensive thread context switches. Very cheap thread context switches.
A typical modern CPU contains 2 to 4 cores whereas a recent Nvidia GeForce GTX 1080 card has 2560 cores in total. GPU hardware implements a single instruction, multiple thread (SIMT) architecture that differs from its multiple data (SIMD) counterpart in that an operand vectors elements are processed by independent threads, whose behavior might not be uniform [7]. Computations on the GPU cover up driver and bus communication latency as well as lower core clock rates by efficiently executing many threads in parallel. CUDA naturally fits into the current trend towards parallel computing and can accelerate suitable workloads by a factor of 5 to 400; this acceleration comes at the cost of the increased complexity inherent to concurrent computation [34],
A heterogeneous application using the CUDA runtime consists of a standard CPU program binary bundled with GPU compute kernel functions written in C++ and compiled with Nvidias nvcc compiler. These kernels are loaded into GPU memory during initialization of the runtime. During execution, the program interacts with the GPU over the PCI Express bus through the device driver by transferring data from host memory to device memory, launching kernels over data in device memory and reading results back to host memory. Kernel launches are queued in order of
9


invocation and dispatched by the driver1, running asynchronously with respect to the CPU program, which synchronizes with the launched kernels when accessing device memory or upon explicit command.
C Program Sequential Execution
Serial code
Parallel kernel Kernel 0<>()
Serial code
Parallel kernel Kernel !<> <)
Host
Device
Host
i
Device
Figure 2.2: Heterogeneous Programming Model [24]
The GPU hardware consists of a pool of global device memory and some number of streaming multiprocessors (or SMs). Each SM consists of cores that execute integer and floating-point operations in parallel, smaller numbers of double-precision and special function (e.g., transcendental) units, caches, thousands of registers to be shared among all resident threads, a small pool of shared memory for fast inter-thread communication as well as warp schedulers.
Kernels are launched with a specified number of blocks and threads per block. Each block is scheduled separately by the driver onto the next-available SM where
1The programmer may establish one or more streams through which kernels can be run concurrently on the GPU.
10


its threads are divided into warps consisting of 32 threads each. A warps threads run on distinct execution cores that all simultaneously execute the same instruction dispatched by the warp scheduler [7]. The set of blocks comprising a kernel execution is called a grid. Each thread in the grid has a distinct ID with which it can determine its portion of the total parallel computation.
Global device memory access is a major bottleneck for kernels, taking hundreds of clock cycles per request. Kernels can be optimized to internally use shared memory or to coalesce memory access; the latter occurs when a blocks threads simultaneously operate over contiguous and adjacent runs of global memory, minimizing the number of memory reads and writes performed before each instruction by the GPU, which reads from and writes to memory in 32-byte or wider contiguous chunks [7].
Thread Bbck _
j. Pw-block shared


Figure 2.3: CUDA Grid/Block/Thread Model [24]
In many ways, GPUs outperform CPUs at individual floating-point operations. Single-precision floating-point operations form the workhorse of GPU computations and are thus heavily optimized [34]. Double-precision operations incur a performance penalty as they require twice the memory and registers as well as special units on the
11


streaming multiprocessor. However, for iterative methods and other computations that are sensitive to numerical instability, double-precision arithmetic proves to be invaluable.
On Nvidias Tesla GPUs (meant for servers and for scientific computing), these double-precision units are outnumbered 3:1 and 2:1 on the older Kepler and the latest Pascal architectures, respectively; this ratio reflects the relative maximum throughputs of these precisions [7] [26]. Tesla GPUs arent intended for consumer desktop computing and are extremely expensive, whereas the GeForce line of GPUs, which are intended for computer gaming, are far more affordable and common.
Nvidias own white papers appear to be mute on the topic, but sources on the Internet variously claim that GeForce GPUs are either fabricated with fewer doubleprecision units or are outright crippled by having these units simply disabled despite being in parity with their Tesla counterparts [33] [15]. Whatever the cause, the result is that the ratio of single-precision to available double-precision units on a high-end Pascal GeForce GPU is 32:1, resulting in a dramatic loss of performance when computing with double precision on the consumer-grade units.
2.3 CUDA Linear Algebra with cuBLAS
Our GPU implementation of Altmans method runs on CUDA primarily through Nvidias cuBLAS library. This BLAS implementation is parallel and heavily optimized for working with dense matrices, outperforming optimized CPU BLAS implementations by a factor of least 15 [7]. CUDA also comes packaged with the cuS-PARSE library optimized for computations over sparse matrices, but this isnt used for inverting such matrices as their inverses and intermediate iterations tend to be dense [11].
The cuBLAS library includes BLAS-like extensions such as the GEAM routines, which are used to more conveniently perform matrix addition and transposition.
12


Handwritten kernels were employed for the same tasks that werent well-suited to OpenBLAS. Since these operations are easily divided into independent divide-and-conquer problems and are thus embarrassingly parallel [16], their kernels cant exploit inter-thread communication and thus only achieve optimization by coalesced memory access. A handwritten kernel was tested for computing the Frobenius norm of a matrix less some multiple of the identity, but it was still outperformed by the much-simpler level 1 SNRM2 routine offered by cuBLAS. A custom kernel is also used to convert matrices from single to double precision; while the copy function from CUDAs Thrust implementation of the C++ STL can accomplish this with identical performance, including Thrust among the linked libraries substantially increased compilation time and inflated the size of the binary.
13


CHAPTER III
IMPLEMENTATION OF ALTMANS METHOD
3.1 Development Environment and General Architecture
Our implementation of Altmans method is called ACMI1. ACMI was developed on Linux (specifically openSUSE 42.1 and Debian 8.5) using an Nvidia GeForce GTX 1080 graphics carda high-end, consumer-grade device (see Section 4.1 for additional details regarding computer hardware). ACMI was written primarily in C and compiled with GCC; device kernels and related code were written in C++ and compiled with Nvidias nvcc compiler from the CUDA 8.0 SDK. We have not explored the simultaneous usage of multiple graphics cards over SLI for this application.
For reasons (see Section 2.2) that are as economical as they are practical, ACMI was developed for relatively-inexpensive consumer hardware, hence the GeForce GPU used. Compared to their Tesla counterparts, these units perform double-precision arithmetic very slowly. ACMI was thus written to use single-precision arithmetic by default for as long as it suffices in order to speed up computation, resorting to double precision per heuristics described in this chapter. The size and numerical stability of the source matrix determine how much ACMI can accomplish in single precision. However, ACMI may also be configured to run all computations in double precision or on the CPU from the start.
ACMI can accept a Matrix Market2 hie as input or generate a random source matrix with various configurable characteristics, including size, diagonal dominance and element domains. Since software pseudorandom number generators can be of dubious quality [21] and even respectable ones can fail a statistical matrix rank test [22], ACMI can optionally use a very high-quality [18] hardware random number genera-
1ACMIs a Convergent Matrix Inverter.
2 ACMI uses source code provided by NIST for reading and writing matrices encoded in the Matrix Market format [27].
14


tor available on Ivy Bridge and newer Intel processors. The inverse computed by the program may be written to disk in the Matrix Market format. ACMIs command-line interface also lets the user configure the order of convergence, initial floating-point precision, maximum inversion error, whether to use the GPU and other inversion parameters (see Appendix A for a complete listing).
During execution, ACMI loads or generates the source matrix, initializes the GPU and runs the centerpiece altmanlnvert () routine over the matrix. This routine allocates work matrices, computes an initial inverse approximation Rq and computes iterations {Ri} of the Altman sequence until an approximate inverse is generated with an error measure ||I Ri|| under the given limit (10-5 by default); a lower error limit might necessitate additional sequence iterations. If the sequence diverges, the routine either restarts with a different initial approximation, promotes the work matrices to double precision or halts returning the previous inverse approximation, depending on the circumstances.
3.2 The Initial Inverse Approximation
As discussed in Section 1.2, determining the ideal initial inverse approximation of a matrix A requires knowledge of its eigenvalues and is, therefore, a computationally-intensive operation. Acceptable and quickly-computable substitutes are i?0 = //||A|| when A is positive definite and i?0 = AT/||A||2 for the general case. However, determining whether A is positive definite is related to the eigenvalue problem and is itself unacceptably expensive [6].
It turns out that the inverse approximation for positive-definite matrices works for many matrices outside that class. For these matrices as well as true positive-definite ones, this approximation yields much faster convergence than when using its general-case counterpart. Moreover, we can usually determine early in the inversion process whether the positive-definite approximation shall lead to divergence and never
15


produce a suitable inverse. ACMIs default operation is, therefore, to simply apply the positive-definite approximation first and switch to the general approximation in the event of early divergence.
Unexpectedly, both initial approximations usually fail to satisfy ||/ ARq || < 1 and do not meet Altmans requirement for guaranteed convergence. This frequently happens to be of little consequence. We could artificially satisfy the condition by lowering the a scalar used to produce f?0, but this dramatically slows the convergence of the sequence, in practice.
3.3 Computation of the Sequence
ACMI implements BLAS wrappers that invoke their respective cuBLAS or Open-BLAS routines of the appropriate precision when GPU computation is enabled or disabled, respectively. ACMI additionally implements handwritten CUDA kernels and host functions for performing custom computations.
Table 3.1: BLAS Matrix/Vector Routines Used
Routine Function
GEMM (a,A,B,f3,C) C <- aAB + pC
GEAM (a,A,f3,B,C) C ^aA + fdB
SCAL(o:, x) x ax
AXPY(q:, x, y) y y + ax
NRM2(A) \\M
The GEMM routine performs matrix-matrix multiplication and is the main workhorse of the inversion process. GEAM is a BLAS-like extension to cuBLAS and simplifies matrix-matrix addition; we wrote a CPU implementation of this routine using SCAL and AXPY in OpenBLAS. NRM2 computes a Frobenius norm and is used to compute the scalar coefficient of the initial inverse approximation. The handwritten transpose
16


Table 3.2: Implemented Custom Matrix Routines
Routine Function
transpose^, A, T) T <- AT
trace(A) \-~\n Mi=l aii
addId(o:, A) A i A Oil
minus!dNrm2(A) 1 1\
and trace routines perform their namesakes and are used to compute the sequences general-case initial inverse approximation f?0 as well as the optimized, positive-definite initial error measure ||/ Af?o||, respectively. The addld routine was written to perform linear-time addition of a scalar multiple of the identity to a matrix; this routine helps us avoid having to allocate a large and slow identity matrix and is also used with NRM2 to implement minusIdNrm2 for computing each inversion iterations error measure. An iteration of the cubically-convergent Altman sequence is thus computed by ACMI to produce the next approximate inverse R of the matrix A as follows:
1 gemm(l, mA, mR, 0, mAR); // mAR <- AR
2 err = minusIdNrm2(mAR); // comp ute error of the prev iteration
3 // based on error, stop or back up
4 gemm(l, mAR, mAR, 0, mX); // mX <- (AR) ~2
5 geam(-3, mAR, 1, mX, mX); // mX <- -3AR + (AR) "2
6 addld(3 mX); // mX <- 31 3AR + (AR)"2
7 gemm (1 mR, mX, 0, mAR); // mAR <- R(3I 3AR + (AR) "2)
8 swap(&mR, &mAR); // mR <- next R, mAR <- prev R
9 swap(&mX, &mAR); // mX <- previous R
Floating-point operations are not generally associative with respect to round-off
error; the order in which a computation is performed is important. In an effort to
reduce accumulated error, the above sequence of matrix operations was borrowed
from earlier work [32], The previous iteration is saved in case ACMI decides to
back up an iteration to reduce error before changing strategy or quitting. While
both BLAS engines compute each individual operation in parallel, the sequence of
operations is itself serially-interdependent and presents no opportunity for parallel
execution of separate matrix operations (i.e., all CUDA operations are run through
17


a single stream).
In addition to the above, ACMI implements the quadratically and quartically-convergent Altman sequences as well; the latter requires a fifth work matrix (i.e., much more memory). Their relative performances shall be examined in Chapter 4.
3.4 Recovery from Divergence
As we shall see, many matrices, both large and small, cannot be accurately inverted using ACMI with single-precision arithmetic alone. However, the substantial performance penalty of using double precision on the GPU compels us to perform as much of the inversion as possible in single precision. We therefore need a mechanism for detecting when to abandon single precision.
The simplest method of accomplishing this is to compute iterations until one diverges. Since Altmans sequences converge monotonically to the inverse, an iteration whose error measure exceeds that of the previous indicates that floating-point round-off error has caught up with ACMI and halted the progress of the inversion. Since Altmans method is iteratively improving and tolerant of errors, ACMI simply promotes its work matrices to double precision after divergence occurs in order to overcome the round-off error and continue the inversion process.
However, our inverse approximation might have accumulated substantial damage by the time outright divergence has occurred. This damage can require additional (and expensive) double-precision iterations to correct; it might even derail the remainder of the inversion. We would like a reliable method of measuring numerical instability and detecting divergence ahead of time to elude this damage. To this end, we employed the sequences rate of convergence.
A sequence {xn} that converges with order p towards zero has constant rate p such that [13]
\xn\<£n, = p>0.
raS-oo £n
18


Substituting our monotonically-decreasing error measure and using the cubically-convergent sequence, we have
lim
n)>c /
ARn+1 | ARn ||3
fi.
We dont know what the value of ^ is for a given inversion, but we do know that ^ exists, since this sequence does indeed converge cubically to the inverse. We therefore wouldnt expect the sequence of rate measurements {fj>n} to itself diverge unless we fall victim to numerical instability and, based on evidence, hypothesize that such a divergence would occur before that of {Rn}- ACMI measures this rate after each iteration and promotes its matrices after the rate exceeds some given threshold. The ideal value of this threshold varies; ACMI tolerates a limit of fi < 1 by default, which was found to be reasonable for many matrices during initial testing. We shall explore the effectiveness of this strategy in Chapter 4.
If divergence occurs within the first few iterations, the initial inverse approximation is probably to blame; ACMI then computes the starting approximation for the general case and restarts the inversion. When divergence cant be blamed on the initial approximation and occurs while working in double precision, ACMI terminates the inversion and returns the previously-computed inverse approximation as a consolation prize.
It should be noted that, if the inverter is allowed to indefinitely continue producing inverse approximations after hitting the wall of numerical instability (but not when the matrix is singular or the starting approximation is inadequate, which lead to outright divergence), its iteratively-improving nature shall cause their error measures to hover around the lowest-achievable figure indefinitely.
19


CHAPTER IV
EXPERIMENTAL RESULTS
4.1 Testing Environment and Overview of Experiments
The experiments were run on a custom-built workstation sporting an Intel Xeon E3-1275 v5 processor from the Skylake generation. The CPU has 4 hyper-threaded cores running at 3.6 GHz with 256KB L2 cache per core and 8MB of shared L3 cache; the system memory is 2133 MHz ECC DDR4. The workstations graphics card is a Pascal-generation Nvidia GeForce GTX 1080, which was the highest-end and most-recent consumer GPU offered by Nvidia at the time of the systems construction in the summer of 2016; its specifications and SM architecture are shown in Table 4.1 and Figure 4.1, respectively.
Table 4.1: GeForce GTX 1080 Specifications
Specification Rating
Compute capability 6.1
Streaming multiprocessors 20
CUD A cores 2560 total (128 per SM)
Maximum core clock rate 1734 MHz
GFLOPS 8873 [25]
Memory 8 GiB GDDR5X
Memory clock rate 5005 MHz
Memory bus width 256-bit
Memory bandwidth 320 GiB/sec [25]
L2 cache 2 MiB
Max threads per block/SM 1024/2048
20


SM
Instruction Cache
Instruction Buffer Instruction Buffer
Warp Scheduler
Dispatch Unit ispatch Unit Dispatch Unit Ispatch Unit
<>
Register File (16,384 x 32-bit) Register File (16,384 x 32-bit)

Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Texture / L1 Cache
Tex Tex Tex Tex
Instruction Buffer Instruction Buffer
Warp Scheduler I Scheduler
Dispatch Unit ispatch Unit Dispatch Unit dispatch Unit
*
Register File (16,384 x 32-bit) Register File (16,384 x 32-bit)

Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Texture / L1 Cache
Tex Tex Tex Tex
96KB Shared Memory
Figure 4.1: GeForce GTX 1080 Streaming Multiprocessor [25]
The operating system used is openSUSE 42.1 installed with the 367.27 version of Nvidias graphics/CUDA drivers.
In order to eliminate pseudorandom statistical weaknesses, all random matrices were generated using Intels hardware random number generator (unless otherwise noted). Each random matrix is populated with integer elements no greater than the matrix dimension in magnitude1; allowing much larger or non-integer elements didnt seem to significantly affect any of the results.
Several different classes of matrices were inverted using ACMI and shall be ex-
1 Diagonals in diagonally-dominant matrices are necessarily not so bounded.
21


amined in the following sections. Single and double precision on the GPU, single precision on the CPU2 and GNU Octave were all compared by the speed and quality of their inversions. Every single-precision run was automatically promoted to double precision when necessary to continue the inversion. Each test was performed five times to produce an averaged result.
Octave is a free and open-source numerical computing platform that is largely compatible with MATLAB. Depending upon the characteristics of an input matrix, Octave performs one of several methods of inversion, including LU and Cholesky factorization [12] [2]; Octave thus provides a method of testing ACMI against traditional matrix inversion methods. Octave itself depends upon BLAS and was configured to use the same multithreaded OpenBLAS library used by ACMI.
CUDA kernel launches were tuned to saturate all SMs on the GeForce GPU, allocating up to 40 blocks per grid with 1024 threads per block. The OpenBLAS routines used to perform inversion in software use up to 8 POSIX threads to execute their computations in parallel. The time required to generate or load matrices and to transfer them to GPU memory is not figured into the tested inversion runtimes.
4.2 Diagonally-Dominant Random Matrices
We first tested ACMI against easily-invertible random matrices. A matrix A is strictly diagonally-dominant [6] if, for all 1 < i < n,
an
n
> ia
Such a matrix is easily, rapidly and accurately solved by Gaussian elimination [13]. Forcing the diagonal elements to be positive well conditions3 the matrix, altogether making the matrix easily invertible.
2Double-precision CPU inversions using ACMI arent studied as they generally form a higher and unnecessary runtime upper bound.
3 Most such matrices have a condition number of about 2.
22


We generated diagonally-dominant matrices with dimensions ranging from 128 x 128 to 16384 x 16384. Five matrices were generated for each size tested and their inversion runtime averages are graphed in Figure 4.2, where ACMIs double-precision, single-precision (with promotion to double precision if needed) and CPU inverters are compared to each other and to GNU Octaves matrix inverter. The target error
n
Figure 4.2: Inversion Runtime for n x n Diagonally-Dominant Random Matrices
measure limit for each inversion was 1.15 105; this was chosen to be slightly over ACMIs default of 10-5 in order to accommodate some large matrices that could not be promoted to double precision due to memory constraints.
The single-precision ACMI inverter generally and greatly outperforms all its competitors, taking 30.4 seconds on average to invert a 16384 x 16384 matrix while its nearest competitor, GNU Octave, takes 198 seconds6.5 times as long. Octave outperformed ACMI once during the hump in the latters curve; this range of matrix dimensions represents a region of instability that required ACMI to promote its
matrices to double precision in order to meet the inversion error target. As can be
23


seen in the curve of OpenBLAS inversions performed by ACMI, this region is peculiar to ACMIs CUDA inverter. We do not yet know why our CUDA inverter has trouble with diagonally-dominant matrices of these dimensions. An observation that held across all tests is that, once promoted from single precision, an inversion only rarely requires more than two additional iterations to meet its error target4. Despite this and due to the double-precision hardware limitations of GeForce GPUs, these promoted iterations dominate the runtimes of their inversions.
To minimize the extent of this region, we had to raise the convergence rate limit5 to n < 5 107; limits as high as n < 5 105 extend the ranges of matrix sizes affected by this single-precision CUDA anomaly. This limit was found by trial-and-error and is poorly-understood; nevertheless, it shows that ACMIs default limit of n < 1 is hardly as general-purpose as previously hoped. The only way to eliminate the region is to raise the maximum-tolerated error to about 1 103, which is unacceptably high.
Even when inverting matrices whose sizes he within this region, single-precision inversions with promotion dramatically outperform pure double-precision runs by a factor of 23 with respect to runtime. ACMIs double-precision CUDA and singleprecision CPU inverters are beaten here by Octave, also by large factors; this might be due to the sophistication of Octave as much as it is to the ease with which diagonally-dominant matrices are solved by Gaussian elimination.
These diagonally-dominant matrices require few ACMI iterations to invert: the smaller matrices tested require 5-6 iterations, 1024 x 1024 matrices require 7 and the largest require 8-10.
The largest specimen inverted during the entirety of this research was a 22800 x 22800 diagonally-dominant matrix, taking 91.9 seconds to invert to under 1.2 10-5 error in 9 iterations. The inversion required almost the entirety of the GPUs 8 GiB of RAM and was performed entirely in single precision using ACME ACMI took over
4This observation no longer holds when the convergence rate limit is set to be very low.
5See Section 3.4 for a description of this limit.
24


110 minutes to invert such a matrix within the same error bound using the CPU alone; Octave needed 513 seconds.
4.3 Random Matrices
We then tested fully-random, non-diagonally-dominant matrices where negative elements are allowed. While such matrices, unlike their diagonally-dominant cousins, may be singular, they prove in practice to almost always be invertible6. Figures 4.3 and 4.4 show the runtime results for random integer and Boolean matrices, respectively. The runtime behaviors of these two classes are remarkably similar, though the
n
Figure 4.3: Inversion Runtime for n x n Random Matrices
raw figures show that Boolean matrices are slightly harder to invert. A convergence rate limit of fa < 5 was chosen as most of these inversions would diverge if pushed beyond this limit at single precision. This limit was also found by trial-and-error and
6 During this research, random singularity was only observed with very small matricesso small that they did not figure into our experiments.
25


n
Figure 4.4: Inversion Runtime for n x n Boolean Matrices
differs greatly from that used to optimize the inversions of diagonally-dominant matrices, suggesting that computation of the ideal such limit is non-trivial. The default ACMI inversion error target 10-5 was used for all tests.
While ACMI using single precision (with promotion) takes only a ninth of the time needed by double precision to invert a random 8192 x 8192 integer matrix (36.6 vs. 332 seconds), GNU Octave defeated ACMI for nearly all matrix sizes tested7, taking only 25 seconds to invert an 8192 x 8192 matrix.
In contrast to diagonally-dominant matrices, fully-random matrices tend to be poorly-conditioned8 and possess negative eigenvalues, requiring ACMI to use the general-purpose Rq initial inverse approximation to successfully invert these matrices. This alternate initial approximation appears to be the culprit for ACMIs poor
7While ACMI appears to have won at inverting the largest matrix size tested here, the resulting approximate inverses had poor error measures on the order of 10~2 as there was insufficient memory for promotion to double precision.
8The condition numbers are on the order of 104; ACMI must be told to skip the positive-definite Rq approximation lest it waste a couple iterations before restarting.
26


performance, sometimes requiring over 30 iterations to invert within the error target (see the p = 3 graph in Figure 4.9).
4.4 Real-World Large Matrices
We, of course, wished to demonstrate practical applications of ACMI by testing it against some large, non-random matrices modeling real-world problems. To this end, we employed the University of Florida Sparse Matrix collection [10], choosing the matrices listed in Table 4.2 as test subjects. Runtime results are shown in Figure 4.5; since single-precision ACMI clearly dominates its other two modes, we compared its results to GNU Octave alone. Here, the default convergence rate limit of p < 1 was used to no apparent ill effect.
Table 4.2: Sparse Matrix Collection Specimens [10]
ID: Matrix n Description
1: 494_bus 494 Bus power system
2: 685_bus 685 Bus power system
3: 1138_bus 1138 Bus power system
4: CAG_matl916 1916 Combinatorial problem
5: ex37 3565 Computational fluid dynamics problem
6: lshp_265 265 Thermal problem
7: Kuu 7102 Structural problem
8: heart 3 2339 Quasi-static FE model of a heart
9: c-40 9941 Non-linear optimization problem
10: p2p-Gnutella25 22687 Peer-to-peer hie sharing network
11: pli 22695 Magnetohydrodynamic problem
For all but two (small) matrices, ACMI defeated Octave9, taking 66.5 seconds to
9Octave couldnt invert the largest two matrices within a reasonable time limit, hence the truncated results in the histogram.
27


1000
100
10
(sec) 1
0.1
0.01
0.001
32-bit CUDA C GNU Octave I
123456789 10
Figure 4.5: Inversion Runtime for Large, Sparse Matrices
11
invert the c-40 matrix whereas Octave required 879. Unlike with random matrices, ACMI wins even at the non-positive-dehnite tests (8-11) that require the much more slowly-convergent, general-purpose Ro seed. These matrices arent necessarily friendlier than the fully-random matrices, since the heart3 matrix happened to have a much worse condition number of 2.1 106. Among these tests, ACMI needed to compute as few as 10 iterations and as many as 32 for the ex37 and c-40 matrices, respectively.
The heart3 matrix happened to break ACMIs automatic detection of poor initial inverses: whereas every other tested matrix inversion diverged within the first two iterations if the positive-definite f?0 were wrongly chosen (successfully triggering ACMI to restart the inversion from the general-purpose Rq in the process), this matrix produced nine iterations from the positive-definite seed before diverging.
As an example of the superior convergence of the positive-definite Ro approximation, forcing ACMI to use the general-purpose Ro to invert the Kuu matrix results
28


in a runtime of 24.6 seconds over 27 iterations instead of 20 seconds over 15 iterations.
4.5 Hilbert Matrices
The numerical stability of an invertible matrix A can be predicted by its condition number: ||+| ||A_1||. A condition number significantly greater than one indicates an ill-conditioned matrix where small perturbations of the elements have great influence on operations performed on the matrix and their results [6] [13]. The higher the condition number, the greater the error inflation. Naturally, an ill-conditioned matrix wont tolerate floating-point round-off error well and shall complicate ACMIs computations, possibly even making effective matrix inversion impossible.
Despite being positive definite, the symmetric Hilbert matrices are notoriously ill-conditioned [6]; even the tiny 10 x 10 Hilbert matrix has a condition number of 1.6-1013 [13]. They thus provided an extreme under which we could test the numerical stability of ACMI. The n x n Hilbert matrix is
/1 I ... i \
2 n
1 1 1
jj 2 3 n+1

1 _L_ ... _1
\n n+1 2n1 /
Figure 4.6 shows the best inversion error measures achieved using ACMI in double precision to invert H2 through H\3 and compares them to those obtained by Octave. Inversion runtimes are negligible across all methods at these dimensions and are not shown; these tests serve to show error tolerance, not parallel performance.
ACMIs CPU and GPU inverters perform similarly, both producing inverses of far greater fidelity than Octave. Despite the fact that these matrices all benefit from the positive-definite Ro approximation, they require many iterations to reach their best error measures: 14 iterations for H3 and 35 for i712. Altering the convergence rate limit when inverting Hilbert matrices didnt affect the total iterations computed, but did increase the proportion of double precision iterations needed by ACMI.
29


Figure 4.6: Inversion Error for n x n Hilbert Matrices 4.6 Comparison of the Orders of Convergence
Thus far, we had only been computing the cubically-convergent Altman sequence of inverse approximations using ACMI. Since the quadratically and quartically-convergent sequences produce more quickly computable and convergent iterations, respectively, we asked whether parallel computation through CUDA ever grants them an advantage over the sequence of cubic order; the latter is optimal in theory, but not necessarily so with finite-precision software. Figures 4.7 and 4.8 compare the three sequences when applied to diagonally-dominant random and fully-random matrices, respectively. The number of iterations computed by each sequence for the latter case are shown in Figure 4.9. All tests are run in single precision, yet typically promote to double precision for the last two iterations. The same convergence rate limits are used as in the previous sections dealing with these matrix classes.
Surprisingly, the cubic-order sequence is frequently outperformed by the quadratic,
30


n
Figure 4.7: Order-p Inversion Runtime for Diagonally-Dominant Random Matrices
despite the latter requiring many more iterations to perform each inversion. This result appears to be due to the finite precision and numerical instability of our matrix arithmetic. For each test, all three sequences tend to promote to double precision after similar inversion error measures have been achieved. The runtime dominance of the following few double-precision iterations, which are much more expensive for higher-order sequences, cover up the slower convergence of the lower-order sequence iterations performed in single precision.
It should be noted that, given its slower convergence, the quadratic sequence is more sensitive to lower convergence rate limits and requires them to be raised in order to compete with the other sequences. When this value is low and fixed for all sequences, the cubic-order sequence substantially outperforms the quadratic. Generally, even though a low convergence rate limit can lower the total number of inversion iterations performed, it likely raises the ratio of double-precision to singleprecision iterations at the expense of the runtime.
31


Figure 4.9: Total Order-p Inversion Iterations for Random Matrices
32


CHAPTER V
CONCLUSION
5.1 Implications of Results
A parallel implementation of Altmans matrix inversion method using a consumer-grade GPU and single-precision arithmetic can greatly outperform other matrix inverters, such as GNU Octave. Moreover, the method is error tolerant and allows extended single-precision computation until double precision is absolutely necessary. This runtime advantage is not present when executing the method entirely with double precision or on the CPU. While our implementation defeated GNU Octave at inverting large, easily-invertible matrices as well as matrices representing large scientific and industrial problems, Octave outperformed the implementation at inverting certain ill-conditioned random matrices.
We attempted to exploit the inversion sequences rate of convergence to predict oncoming divergence and optimize the inversion strategy. The convergence rate limit proved non-trivial to optimize and even increased the inversion computation time when set too low. On the hardware used, a few double-precision iterations dominated the runtime of many single-precision iterations to the extent that attempting to minimize the latter proved to not be worthwhile.
5.2 Further Research
This research studied GPGPU matrix inversion using only consumer Nvidia GPUs through the CUDA framework. Since these consumer units perform relatively poorly during double-precision computations, a study of the performance of ACMI on professional Tesla units would be a logical next step. Adding support for other GPGPU platforms, such as OpenCL and Vulkan, should also be of interest.
33


Matrix inversion very quickly exhausts single-precision numerical stability and even double precision frequently proves to be inadequate for ACMI. Unfortunately, quadruple-precision support is currently unavailable outside of relatively-exotic IBM mainframes [31] and Power CPUs [14]. A compromise of worthwhile pursuit would be the double-double method, which extends floating-point precision by using pairs of double-precision numbers to represent each vector or matrix element [17].
As discovered in Chapter 4, certain ranges of dimensions of diagonally-dominant random matrices are much less numerically stable when using CUDA and cuBLAS and require promotion to much slower double-precision computation to invert well. CPU inversion using OpenBLAS doesnt suffer this instability; an explanation for why this is would be appreciated.
Finally, Strassens matrix multiplication algorithm has acceptable numerical stability for some applications and scales very well with parallel execution [9]. A GPU implementation of this algorithm might appreciably accelerate Altmans method.
34


REFERENCES
[1] BLAS (Basic Linear Algebra Subprograms), http://www.netlib.org/blas/. Accessed: 2016-10-15.
[2] ?getrf. https://software.intel.com/en-us/node/520877. Accessed: 2016-10-25.
[3] M. Altman. Approximation Methods in Functional Analysis. California Institute of Technology, 1959.
[4] M. Altman. An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space. Pacific Journal of Mathematics, 10(4): 11071113, December 1960.
[5] Bradley, Hax, and Magnanti. Applied Mathematical Programming. Addison-Wesley, 1977.
[6] R. Burden and J. Faires. Numerical Analysis. Cengage Learning, 9th edition, August 2010.
[7] J. Cheng, M. Grossman, and T. McKercher. Professional CUDA C Programming. Wrox, 1st edition, September 2014.
[8] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9:251-280, March 1990.
[9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 3rd edition, 2009.
[10] T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38(1): 1:11:25, 2011. Accessed: 2016-10-27.
[11] J. L. Dongarra and V. Eijkhout. Numerical linear algebra algorithms and software. Journal of Computational and Applied Mathematics, 123:489-514, November 2000.
[12] J. W. Eaton. Techniques used for linear algebra, https://www.gnu.org/software/ octave/doc/interpreter/Techniques-Used-for-Linear-Algebra.html, 2016. Accessed: 2016-10-25.
[13] W. Gautschi. Numerical Analysis. Birkhauser, 2nd edition, 2012.
[14] GCC Team. GCC 6 release series: Changes, new features, and fixes, https: //gcc.gnu.org/gcc-6/changes.html, October 2016. Accessed: 2016-10-28.
[15] H. Hagedoorn. Nvidia GeForce GTX 1080 review Pascal GPU architecture. http://www.guru3d.com/articles-pages/nvidia-geforce-gtx-1080-review, 3. html, May 2016. Accessed: 2016-10-20.
35


[16] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, 1st edition, June 2012.
[17] Y. Hida, X. S. Li, and D. H. Bailey. Library for double-double and quad-double arithmetic. December 2007.
[18] Intel. Intel Digital Random Generator (DRNG) Software Implementation Guide, 1.1 edition, August 2012.
[19] J. Li, S. Ranka, and S. Salmi. Strassens matrix multiplication on GPUs. Parallel and Distributed Systems (ICPADS), December 2011.
[20] J. Liesen and V. Mehrmann. Linear Algebra. Springer, 1st edition, December 2015.
[21] G. Marsaglia. Random numbers fall mainly in the planes. Proc. N. A. S., 61(1):2528, September 1968.
[22] G. Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14), July 2003.
[23] K. Milfeld. GotoBLAS2. https://www.tacc.utexas.edu/research-development/ tacc-software/gotoblas2. Accessed: 2016-10-14.
[24] NVIDIA Corporation. CUDA C Programming Guide, https://docs.nvidia.com/ cuda/cuda-c-programming-guide, September 2016. Accessed: 2016-10-16.
[25] NVIDIA Corporation. NVIDIA GeForce GTX 1080. http://international. download, nvidia. com/geforce-com/international/pdfs/GeForce_GTX_1080_ Whitepaper_FINAL.pdf, 2016. Accessed: 2016-10-20.
[26] NVIDIA Corporation. NVIDIA Tesla P100. https://images.nvidia.com/ content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf, 2016. Accessed: 2016-10-20.
[27] National Institute of Standards and Technology. ANSI C library for Matrix Market I/O. http://math.nist.gov/MatrixMarket/mmio-c.html, May 2000. Accessed: 2016-10-21.
[28] V. Pan, Z. Chen, and A. Zheng. The complexity of the matrix eigenproblem. Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 507-516, May 1999.
[29] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 2nd edition, 1992.
[30] S. D. Roth. Ray casting for modeling solids. Computer Graphics and Image Processing, 18:109-144, February 1982.
36


[31] E. Schwarz. The IBM zl3 SIMD accelerators for integer, string, and floatingpoint. http://arith22.gforge.inria.fr/slides/sl-schwarz.pdf, June 2015. Accessed: 2016-10-28.
[32] R. Senser. GPU Declarative Framework. PhD thesis, University of Colorado Denver, November 2014.
[33] T. Valich. Pascal secrets: what makes Nvidia GeForce GTX 1080 so fast? http: //vrworld.com/2016/05/10/pascal-secrets-nvidia-geforce-gtx-1080, May 2016. Accessed: 2016-10-20.
[34] N. Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley Professional, 1st edition, June 2013.
[35] Z. Xianyi. OpenBLAS wiki. https://github.com/xianyi/OpenBLAS/wiki. Accessed: 2016-10-14.
37


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
APPENDIX A
ACMI INVERTER USAGE
Listing A.l: ACMI Inverter Usage
ACMI Convergent Matrix Inverter J. Treadwell, 2016
Usage:
acmi [options] Cinput file>
Only Matrix Market I/O is supported. To generate and invert a random matrix, enter the matrix dimension prepended by 0 in lieu of the input file; prepend with /0 for symmetry.
Options:
-f c -1
-o -q -p <#bits>
-e <+real> -t
-m <+real>
-b <+int>
Random matrix -H -R -N -D
-V <+real>
-U -S
Print matrix file info and exit
Perform all computations in software without the GPU Use low-speed, safe initial R0 approximation Output computed matrix inverse to path Set the order of convergence (2-4, default: cubic)
Set initial matrix element floating-point precision (32 or 64, default: 32)
Set inversion error limit (default: le-05)
Set inversion time limit in ms (default: none)
Set max-allowed single-precision convergence rate; prepend with x and set to >= 1 to use a multiple of the starting rate (default: 1)
Set max blocks run by each GPU kernel (default: 40, max: 65535) options:
Enable hardware random number generator (unseedable)
Enable real elements
Enable negative elements
Generate dominant diagonal elements
Set max element magnitude
(default: matrix dimension, max: 32767)
Output generated, uninverted matrix to path Set PRNG seed (not yet portable)
38


APPENDIX B
SAMPLE ACMI INVERSION RUNS
Listing B.l: GPU single-precision, random diagonally-dominant (n = 2048).
1 > ./acmi 02048 -DS deadbeef
2 7998.438 MiB device memory available
3 seeding PRNG with deadbeef
4 generating 32-bit random 2048x2048 integer nonnegative
5 diagonally-dominant matrix...
6 16.000 MiB/matrix; allocating 64.000 MiB total
7 setting up cuBLAS
8 setting up kernels on GeForce GTX 1080
9 blocks/matrix: 40, threads/matrix: 40960
10 blocks/vector: 2, threads/vector: 2048
11 uploading matrix to device
12 inverting random matrix with cubic convergence...
13 initializing work matrices...
14 computed alpha = 1.05396e-08
15 R0 err = 4.4255e+01, = 0.0000e+00 (0.000s)
16 R1 err = 4.2322e+01 // = 4.8828e-04 (0.000s)
17 R2 err = 3.7015 e +01 jj, = 4.8829e-04 (0.010s)
18 R3 err = 2.47 66 e +01 y = 4.8835e-04 (0.008s)
19 R4 err = 7.4232e+00 // = 4.8870e-04 (0.007s)
20 R5 err = 2.0094e-01 yU, = 4.9125e-04 (0.008s)
21 R6 err = 5.9508e -05 jj, = 7.3342e-03 (0.008s)
22 R7 err = 5.9787e-06 a = 2.8372e+07 (0.007s)
23 inversion halted in 7 iterations after OmO .056s
24 shutting down cuBLAS
25 shutting down kernels
Listing B.2: GPU double-precision, random diagonally-dominant (n = 2048).
1 > ./acmi 02048 -DS deadbeef -p 64
2 7998.438 MiB device memory available
3 seeding PRNG with deadbeef
4 generating 64-bit random 2048x2048 integer nonnegative
5 diagonally-dominant matrix...
6 32.000 MiB/matrix; allocating 128.000 MiB total
7 setting up cuBLAS
8 setting up kernels on GeForce GTX 1080
9 blocks/matrix: 40, threads/matrix: 40960
10 blocks/vector: 2, threads/vector: 2048
11 uploading matrix to device
12 inverting random matrix with cubic convergence...
13 initializing work matrices...
14 computed alpha = 1.05396e-08
15 R0 err = 4.4255e+01, fi = 0.0000e+00 (0.000s)
16 R1 err = 4.2322e+01 // = 4.8828e-04 (0.000s)
17 R2 err = 3.7015 e +01 jj, = 4.8829e-04 (0.272s)
18 R3 err = 2.47 66 e +01 y = 4.8835e-04 (0.185s)
19 R4 err = 7.4229 e+00 // = 4.8868e-04 (0.182s)
20 R5 err = 2.0076e -01 yU, = 4.9086e-04 (0.182s)
21 R6 err = 4.1234e-06, a = 5.0957e-04 (0.183s)
22 23 24 inversion shutting shutting halted in 6 iterations after down cuBLAS down kernels 0ml 186s
Listing B.3: CPU single-precision, random diagonally-dominant (n = 2048).
1 > ./acmi 02048 -DS deadbeef -c
39


2 seeding PRNG with deadbeef
3 generating 32-bit random 2048x2048 integer nonnegative
4 diagonally-dominant matrix...
5 16.000 MiB/matrix; allocating 64.000 MiB total
6 GPU acceleration disabled!
7 inverting random matrix with cubic convergence.
8 initializing work matrices...
9 computed alpha = 1.05396e-08
10 R0 err = 4.4255e+01, fi = 0.0000e+00 (0.003s)
11 R1 err = 4.2322e+01 yU = 4.8828e-04 (0.530s)
12 R2 err = 3.7015 e +01 yU, = 4.8829e-04 (0.523s)
13 R3 err = 2.47 66 e +01 yU, = 4.8835e-04 (0.523s)
14 R4 err = 7.4229 e+00 yU = 4.8868e-04 (0.523s)
15 R5 err = 2.0077e -01 /x = 4.9089e-04 (0.523s)
16 R6 err = 1.7253e-05, yU, = 2.1318e-03 (0.523s)
17 R7 err = 3.8735e-06, a = 7.5426e+08 (0.523s)
18 inversion halted in 7 iterations after 0m3.846s
Listing B.4: Laptop GPU single-precision, random diagonally-dominant (n = 2048).
1 > ./acmi 02048 -DS deadbeef
2 850.562 MiB device memory available
3 seeding PRNG with deadbeef
4 generating 32-bit random 2048x2048 integer nonnegative
5 diagonally-dominant matrix...
6 16.000 MiB/matrix; allocating 64.000 MiB total
7 setting up cuBLAS
8 setting up kernels on NVS 4200M
9 blocks/matrix: 40, threads/matrix: 40960
10 blocks/vector: 2, threads/vector: 2048
11 uploading matrix to device
12 inverting random matrix with cubic convergence...
13 initializing work matrices...
14 computed alpha = 1.05396e-08
15 R0 err = 4.4255e+01, fi = 0.0000e+00 (0.004s)
16 R1 err = 4.2322e+01 // = 4.8828e-04 (0.000s)
17 R2 err = 3.7014e+01 y = 4.8829e-04 (0.948s)
18 R3 err = 2.47 66 e +01 yU, = 4.8836e-04 (0.714s)
19 R4 err = 7.4232e+00 // = 4.8870e-04 (0.713s)
20 R5 err = 2.0094e -01 y = 4.9125e-04 (0.714s)
21 R6 err = 5.9371 e -05 yU, = 7.3176e-03 (0.713s)
22 R7 err = 5.9608e-06 u = 2.8482e+07 (0.714s)
23 inversion halted in 7 iterations after 0m5.234 s
24 shutting down cuBLAS
25 shutting down kernels
Listing B.5: Single-precision, random {n = 4096) with promotion to double precision.
1 > ./acmi -HD 04096
2 7998.438 MiB de vi c e memory available
3 using RDRAND RNG
4 generating 32 -bit random 4096 x4096 int eger no nn
5 diagonally - dominant matr ix .
6 64.000 MiB/matrix; allocat ing 256. 000 MiB tot a i
7 setting up cuBLAS
8 setting up ke rnel s on GeFo rce GTX 1080
9 blocks/matrix : 40, threads /matrix: 40960
10 blocks/vector : 4, threads/ ve c tor : 4096
11 uploading mat rix t o device
12 inverting ran dom m atrix wi th cubic con verg enc e
13 initializing work matrices
14 computed alph a = 1 .86387e - 09
15 R0 : err = 6. 3000 e + 01, ytX = 0. OOOOe + 00 (0. 001 s )
16 R1 : err = 6. 1047 e + 01, ytX = 2. 4414e -04 (0. 000 s )
17 R2 : err = 5. 5544 e + 01, jJj = 2. 4414e -04 (0. 079 s )
40


18 R3: err = 4.1838 e +01 fi = 2.4415e-04 (0.059s)
19 R4: err = 1.7885 e +01, /i = 2.4421e-04 (0.058s)
20 R5: err = 1.3999 e+00 (i = 2.4472e-04 (0.059s)
21 R6: err = 5.3994e-04, /i = 1.9681e-04 (0.059s)
22 R7: err = 5.3156e-05, a = 3.3769e+05 (0.059s)
23 diverging extending to double precision...
24 R8: err = 4.5140e-04 ji = 0.0000e+00 (0.064s)
25 R9: err = 2.9680e-13, u = 3.2268e-03 (1.479s)
26 inversion halted in 9 iterations after 0m3.382s
27 shutting down cuBLAS
28 shutting down kernels
Listing B.6: Excerpt of an attempt to invert within an unachievable error limit
with divergence detection disabled, showing self-corrective bouncing where double-
precision numerical stability ends (n = 1024).
1 R18: err = 1 1310e+00 /X = 3.2129e-01 (0.026s)
2 R19: err = 9.2789e-01 /x = 6.4140e-01 (0.027s)
3 R20: err = 7.5785e-01 /x = 9.4861e-01 (0.026s)
4 R21: err = 4.3523e-01, yt = 9.9993e-01 (0.029s)
5 R22: err = 8.2442e -02 /x = 1.0000e+00 (0.027s)
6 R23: err = 5.6034e -04 /x = 1.0000e+00 (0.027s)
7 R24: err = 1.7601 e -10 /x = 1.0004e+00 (0.027s)
8 R25: err = 4.8498e -12 /x = 8.8943e+17 (0.029s)
9 R26: err = 5.0036e -12 ji = 4.3863e+22 (0.026s)
10 R27: err = 4.9887e -12 /x = 3.9825e+22 (0.025s)
11 R28: err = 5.0116e-12 ii = 4.0365e+22 (0.028s)
12 R29: err = 5.1043e-12 /x = 4.0550e+22 (0.028s)
13 R30: err = 4.8396e -12 ji = 3.6393e+22 (0.027s)
14 R31: err = 4.8361 e -12 /x = 4.2664e+22 (0.028s)
15 R32: err = 5.0029e-12, ji = 4.4231e+22 (0.028s)
16 R33: err = 4.9357e -12 /x = 3.9418e+22 (0.027s)
Listing B.7: 10 x 10 Hilbert matrix.
1 > ./acmi mats/hilblO.mtx - p 64
2 7998.438 MiB device memory available
3 loading mats/hilblO.mtx
4 matrix is 10x10 and 0.001 MiB in size
5 0.001 MiB/matrix; allocati ng 0.003 MiB total
6 setting up cuBLAS
7 setting up kernels on GeForce GTX 1080
8 blocks/matrix: 1, threads/matrix: 100
9 blocks/vector: 1, threads/vector: 10
10 uploading matrix to device
11 inverting mats/hilblO.mtx with cubic convergence .
12 initializing work matrices
13 computed alpha = 0.560059
14 R0: err = 2.9344e+00 /x = 0.0000e+00 (0.000s)
15 R1: err = 2.8557e+00, /i = 1.1302e-01 (0.000s)
16 R2: err = 2.7731 e+00 // = 1.1908e-01 (0.000s)
17 R3: err = 2.6940 e +00 /x = 1.2634e-01 (0.000s)
18 R4: err = 2.6116e+00 /i = 1.3357e-01 (0.000s)
19 R5: err = 2.5429 e+00 ji = 1.4276e-01 (0.000s)
20 R6: err = 2.4541 e+00 /x = 1.4925e-01 (0.000s)
21 R7: err = 2.3916e+00 /x = 1.6181e-01 (0.000s)
22 R8: err = 2.3137e+00 // = 1.6914e-01 (0.000s)
23 R9: err = 2.2264e+00 /x = 1.7976e-01 (0.000s)
24 R10: err = 2.1733e+00 // = 1.9694e-01 (0.000s)
25 R11: err = 2.0897e+00 /x = 2.0357e-01 (0.001s)
26 R12: err = 1.9970e+00 /x = 2.1885e-01 (0.000s)
27 R13: err = 1.9492e+00 /x = 2.4474e-01 (0.000s)
28 R14: err = 1.8725 e +00 ji = 2.5285e-01 (0.000s)
29 R15: err = 1.7619 e+00 /i = 2.6838e-01 (0.000s)
30 R16: err = 1.7039 e+00 ji = 3.1151e-01 (0.000s)
41


31 R17 err = 1.6522e+00, M = 3.3399e-01 (0.000s)
32 R18 err = 1.5440e+00, M = 3.4236e-01 (0.000s)
33 R19 err = 1.4254e+00, M = 3.8725e-01 (0.000s)
34 R20 err = 1.3839 e+00, M = 4.7781e-01 (0.000s)
35 R21 err = 1.3287 e+00, M = 5.0136e-01 (0.000s)
36 R22 err = 1.2029 e+00, M = 5.1278e-01 (0.000s)
37 R23 err = 1.0393e+00, M = 5.9712e-01 (0.000s)
38 R24 err = 9.8325e-01 , M = 8.7588e-01 (0.001s)
39 R25 err = 9.4944e-01 , M = 9.9881e-01 (0.000s)
40 R26 err = 8.5587 e-01 , M = 1.0000e+00 (0.000s)
41 R27 err = 6.2693e-01 , M = 9.9998e-01 (0.000s)
42 R28 err = 2.4643e-01 , M = 1.0001e+00 (0.000s)
43 R29 err = 1.4969e-02 , M = 1.0002e+00 (0.000s)
44 R30 err = 1.5726e-04 , M = 4.6887 e + 01 (0.000s)
45 R31 err = 2.2343e-04 , Ll = 5.7455e+07 (0.000s)
46 WARNING: diverged R30 is the best we can do
47 inversion halted in 31 iterations after OmO.002
48 WARNING: failed to converge to error < le-05 wi
49 shutting down cuBLAS
50 shutting down kernels
2147483647 ms
Listing B.8: Doomed attempt to invert a 2x2 singular matrix of all ones.
1 > ./acmi mat s/1111.mtx
2 7998.438 MiB device memory available
3 loading m at s/1111.mtx
4 matrix is 2x2 and 0.000 MiB in size
5 0.000 MiB /matrix; allocating 0.000 MiB t ot al
6 setting up cuBLAS
7 setting up kernels on GeFor c e GTX 1080
8 blocks/matrix: 1, threads/m atrix : 4
9 blocks/ve ctor: 1, threads/v e ctor : 2
10 uploading matrix to device
11 inverting mats/I111.mtx with cubic con ve rg enc e
12 initializ ing work matrices.
13 computed alpha = 0.5
14 R0: err = 1.0000e+00 fj, = 0 .0000e+00 c 0 . 000 s )
15 R1: err = 1.0000e+00, n = 1 .0000e+00 c 0 . 000 s )
16 diverged , retrying with alt e mate R0 .
17 R0: err = 1.0000e+00, n = 0 .0000e+00 c 0 . 000 s )
18 R1: err = 1.0000e+00 fj, = 1 .0000e+00 c 0 . 000 s )
19 diverging , extending to dou ble precisi on
20 backing up to reduce error
21 R0: err = 1.0000e+00 = 0 . 0000e + 00 c 0 . 000 s )
22 R1: err = 1.0000e+00, n = 1 . 0000e + 00 c 0 . 000 s )
23 R2: err = 1.0000e+00, n = 1 . 0000e + 00 c 0 . 000 S )
24 WARNING : diverged R1 is th e best we c an d 0
25 inversion halted in 4 iterat ions after 0 mO . 000 s
26 WARNING: failed to converge to error < 1 e - 05 W ithin 2147483647 ms
27 shutting down cuBLAS
28 shutting down kernels
42


APPENDIX C
SELECTED SOURCE CODE
For the complete source code, please refer to: https://github.com/zauberkraut/acmi
Listing C.l: Altman Inversion Implementation
1 /* invert. c
2
3 ACMI convergent inversion algorithm implementation. */
4
5 #include
6 #include "acmi.h"
7
8 enum { MAX_RESTART_ITER = 2 };
9
10 /* Quickly computes III ARO / / for RO = alpha*I. */
11 static double traceErr(double alpha, Mat mA) {
12 return sqrt(MatN(mA) + 1 2*alpha*trace(mA));
13 }
14
15 /* Returns the number of milliseconds elapsed since start. */
16 static int msSince(struct timespec* start) {
17 struct timespec time;
18 clock.gettime(CLOCK.MONOTONIC, fetime);
19 return (time.tv.sec start->tv_sec)*1000 +
20 (time.tv.nsec start->tv_nsec)/I.e6;
21 }
22
23 /* Swaps matrix pointers. */
24 static void swap(Mat* mp, Mat* np) {
25 Mat t = *mp;
26 *mp = *np;
27 *np = t ;
28 }
29
30 /* The inversion algorithm.
31
32 If convRateLimit < 0, /convRateLimit / specifies a multiple of the
33 first measured rate to use as the limit. */
34 double altmanlnvert(const Mat mA, Mat* const mRp ,
35 const int convOrder const double errLimit ,
36 const int msLimit double convRateLimit ,
37 bool safeRO) {
38 static struct timespec start;
39
40 debug("initializing work matrices...");
41 clock_gettime(CL0CK_M0N0T0NIC, festart); // start clock
42 const int matCount = convOrder < 4 ? 4 : 5;
43 Mat mR = MatBuild(mA), // allocate matrices with As dimensions
44 mAR = MatBuild(mA) ,
45 mX = MatBuild(mA),
46 mY = matCount < 5 ? 0 : MatBuild(mA);
47
48 const double alpha = l/nrm2(mA);
49 debug (" computed alpha = */,g" alpha);
50 double err = NAN;
51
52 if (safeRO) {
53 debug("starting with safe R0");
54 transpose(alpha*alpha, mA, mR);
43


55
56
57
58
59
60 61 62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80 81 82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99 100 101 102
103
104
105
106
107
108
109
110 111 112
113
114
115
116
117
118
119
120
} else {
MatClear(mR);
addld(alpha, mR);
err = traceErr(alpha, mA);
int iter = 0, totallters = 0, ms; // while time remains for (; (ms = msSince(&start)) < msLimit; iter++, totallters++) {
static double prevErr = INFINITY; static int prevMS = 0;
gemm(l, mA, mR, 0, mAR); // mAR <- AR
if (iter || safeRO) { // dont overwrite above optimization
err = minusIdNrm2(mAR);
}
// rate of convergence
const double convRate = fabs(err)/pow(fabs(prevErr), convOrder);
debug (" sR/od: err = '/,.4e, p = '/,.4e (/0.3fs)", iter < 10,
iter, err, convRate, (ms prevMS)/1000.); prevMS = ms;
if (err <= errLimit) { // weve achieved the desired accuracy break;
}
// handle divergence
if (!safeRO && err >= prevErr && iter <= MAX_RESTART_ITER) {
/* Our attempt to exploit the self-adjoint, positive-definite R0 = alpha*I failed. Start over using the slow, safe R0 = alpha"2 A"T instead. */ debug("diverged, retrying with alternate R0...");
safeRO = true;
transpose(alpha*alpha, mA, mR);
prevErr = INFINITY;
iter = -1; totallters--;
continue;
} else if (MatElemSize(mA) < MAX_ELEM_SIZE && convRateLimit > 0 && (convRate >= convRateLimit || err >= prevErr)) { debug("diverging, extending to double precision...");
if (MatDev(mA) && // quit if we lack GPU memory for promotion !checkDevMemEnough(MatN(mA), MatElemSize(mA), matCount)) { debug("not enough device RAM; halting");
if (err >= prevErr) { // back up if last iter were better swap(&mX, &mR);
}
break;
}
MatPromote(mA); MatPromote(mR); MatPromote(mAR);
MatPromote(mX); if (mY) {
MatPromote(mY) ;
}
double tmp = prevErr ;
prevErr = INFINITY; // in case error shall reinflate if (err >= tmp) { // if weve fully diverged...
debug("backing up to reduce error"); swap(&mX &mR) ; iter -= 2; totallters--; continue; // recompute AR
}
44


121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180 181 182 183
else if (err >= prevErr || !isfinite(err)) {
warn (" diverged R/od is the best we can do", iter 1);
swap(&mX, &mR); // back up R to its previous value
err = prevErr ;
break; // quit
else {
if (iter == 1 && convRateLimit < 0) {
// set rate limit to initial value convRateLimit *= -convRate;
}
prevErr
err ;
switch (convOrder) { // compute the next iteration case 2: // quadratic convergence
138 gemm(l, mR, mAR, 0, mX); // mX <
139 geam(-l, mX, 2, mR, mAR); // mAR <
140 swap(&mR &mAR); // mR <
141 swap(&mX &mAR); // mX <
142 break;
143 case 3: // cubic convergence
144 gemm(l, mAR, mAR, 0, mX); // mX <
145 geam(-3, mAR, 1, mX, mX); // mX <
146 addld(3 mX); // mX <
147 gemm(l, mR, mX, 0, mAR); // mAR <
148 swap(&mR &mAR); // mR <
149 swap(&mX &mAR); // mX <
150 break;
151 case 4: // quartic convergence
152 gemm(l, mAR, mAR, 0, mX); // mX <-
153 geam(-4, mAR, 1, mX, mX); // mX <-
154 addld(6 mX); // mX <-
155 gemm(-l, mAR, mX, 0, mY); // mY <-
156 addld(4, mY); // mY <-
157 swap(&mR &mX) ; // mX <-
158 gemm(l, mX, mY, 0, mR); // mR <-
159 break;
160 def ault :
RAR
2R RAR = next R next R, mAR <- prev R previous R, mAR <- junk
(AR) ~2
-3AR + (AR) "2 31 3AR + (AR)"2 R (31 3AR + (AR) ~2) next R, mAR <- prev R previous R
(AR) "2
-4AR + (AR) "2
61 4AR + (AR) "2
-AR (61 4AR + (AR) "2)
41 - AR(6I~4AR + (AR) "2) previous R
R(4I AR (6I-4AR + (AR) "2) )
f atal (" unsupported convergence order: 7od", convOrder);
}
} // end while
ms = msSince(&start); int minutes = ms/60000;
double seconds = (ms minutes*60000)/1000.; debug (" inversion halted in /0d iterations after /0dm/0.3f s totallters, minutes, seconds); if (err > errLimit) {
warn("failed to converge to error < /0g within /0d ms", errLimit, msLimit) ;
}
// cleanup MatFree(mAR);
MatFree(mX); if (mY) {
MatFree(mY) ;
}
*mRp = mR; // return inverted matrix return err;
45


Listing C.2: Linear Algebraic Function Wrappers
1
2
3
4
5
6
7
8 9
10
11
12
13
14
15
16
17
18
19
20 21 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 61 62
63
64
65
/* l inalg. c
ACMI linear algebraic operations implemented using (cu)BLAS and CUDA kernels. */
#include
#include
#include "acmi.h"
static cublasHandle_t g_cublasHandle;
void gpuSetUp(const int maxBlocksPerKernel const int n) { debug("setting up cuBLAS");
if (cublasCreate(&g_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
fatal("couldnt open cuBLAS handle");
}
cuSetUp(maxBlocksPerKernel, n);
void gpuShutDown() {
debug("shutting down cuBLAS");
if (g_cublasHandle) cublasDestroy(g_cublasHandle); cuShutDown () ;
/* mT <- alpha*mAr'T */
void transpose(double alpha, Mat mA, Mat mT) { const int n = MatN(mA); const void* const a = MatElems(mA); void* const t = MatElems(mT); const bool dev = MatDev(mA); const double beta = 0;
switch (MatElemSize(mA)) { case 4:
if (dev) {
float alpha32 = alpha;
cublasSgeam(g_cublasHandle CUBLAS_0P_T, CUBLAS_0P_N n, n, &alpha32, a, n, (float*)febeta, t, n, t, n);
} else {
cblas.somatcopy(CblasColMajor, CblasTrans, n, n, alpha, a, n,
break;
case 8:
if (dev) {
cublasDgeam(g_cublasHandle CUBLAS_0P_T CUBLAS_0P_N n, n, fealpha, a, n, febeta, t, n, t, n);
} else {
cblas.domatcopy(CblasColMajor, CblasTrans, n, n, alpha, a, n,
break;
}
}
/* C <- alpha*A*B + beta*C */
void gemm(double alpha, Mat mA, Mat mB, double beta, Mat mC) { const int n = MatN(mA); const void* const a = MatElems(mA); const void* const b = MatElems(mB); void* const c = MatElems(mC); const bool dev = MatDev(mA);
46


66
67
68
69
70
71
72
73
74
75
76
77
78
79
80 81 82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99 100 101 102
103
104
105
106
107
108
109
110 111 112
113
114
115
116
117
118
119
120 121 122
123
124
125
126
127
128
129
130
131
switch (MatElemSize(mA)) { case 4:
if (dev) {
float alpha32 = alpha, beta32 = beta;
cublasSgemm(g_cublasHandle, CUBLAS_0P_N, CUBLAS_0P_N, n, n, n, &alpha32, a, n, b, n, &beta32, c, n);
} else {
cblas.sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, alpha, a, n, b, n, beta, c, n);
}
break;
case 8:
if (dev) {
cublasDgemm(g_cublasHandle CUBLAS_0P_N CUBLAS_0P_N n, n, n, fealpha, a, n, b, n, febeta, c, n);
} else {
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, alpha, a, n, b, n, beta, c, n);
}
break;
}
}
/* mB = mC => C <- alpha*A + beta*C
otherwise C <- alpha*A + beta*B */
void geam(double alpha, Mat mA, double beta, Mat mB, Mat mC) { const int n = MatN(mA); const int n2 = MatN2(mA); const void* const a = MatElems(mA); const void* const b = MatElems(mB); void* const c = MatElems(mC); const bool dev = MatDev(mA);
switch (MatElemSize(mA)) { case 4:
if (dev) {
float alpha32 = alpha, beta32 = beta;
cublasSgeam(g_cublasHandle, CUBLAS_0P_N, CUBLAS_0P_N, n, n, &alpha32, a, n, &beta32, b, n, c, n);
} else {
if (b == c) {
cblas.sscal(n2, beta, c, 1);
} else {
memset(c, 0, MatSize(mC) ) ; cblas_saxpy(n2, beta, b, 1, c, 1);
}
cblas_saxpy(n2, alpha, a, 1, c, 1);
}
break;
case 8:
if (dev) {
cublasDgeam(g_cublasHandle CUBLAS_0P_N CUBLAS_0P_N n, n, fealpha, a, n, febeta, b, n, c, n);
} else {
if (b == c) {
cblas_dscal(n2, beta, c, 1);
} else {
memset(c, 0, MatSize(mC)) ; cblas_daxpy(n2, beta, b, 1, c, 1);
}
cblas_daxpy(n2, alpha, a, 1, c, 1);
}
break;
47


132 }
133
134 /* mA <- mA + alpha*I */
135 void addld(double alpha, Mat mA) {
136 if (MatDev(mA)) {
137 cuAddld(alpha, MatElems(mA), MatN(mA), MatElemSize(mA));
138 } else for (int diag = 0; diag < MatN(mA); diag++) {
139 /* This could be marginally sped up using *axpy with a Ixn
140 vector of ones and a stride of n + 1 over the matrix, but
141 its not worth the trouble. */
142 MatPut(mA, diag, diag, alpha + MatGet(mA, diag, diag));
143 }
144 }
145
146 /* Computes the Frobenius norm of a matrix. */
147 double nrm2(Mat mA) {
148 const int n2 = MatN2(mA);
149 const void* a = MatElems(mA);
150 const bool dev = MatDev(mA);
151 double norm;
152
153 switch (MatElemSize(mA)) {
154 case 4:
155 if (dev) {
156 float norm32 ;
157 cublasSnrm2(g_cublasHandle, n2, a, 1, (float*)&norm32);
158 norm = norm32;
159 } else {
160 norm = cblas_snrm2(n2, a, 1);
161 }
162 break;
163
164 case 8:
165 if (dev) {
166 cublasDnrm2(g_cublasHandle n2 a, 1, (double*)fenorm);
167 } else {
168 norm = cblas_dnrm2(n2, a, 1);
169 }
170 break;
171 }
172
173 return norm;
174 }
175
176 /* Computes the sum of the entries on the main diagonal. */
177 double trace(Mat mA) {
178 double trace;
179
180 if (MatDev(mA)) {
181 trace = cuTrace(MatElems(mA), MatN(mA), MatElemSize(mA));
182 } else {
183 trace = 0.;
184 for (int i = 0; i < MatN(mA); i++) {
185 trace += MatGet(mA, i, i);
186 }
187 }
188
189 return trace;
190 }
191
192 /* Computes the norm of (I A). */
193 double minusIdNrm2(Mat mA) {
194 addld(-l, mA); // sort of a hack, but it works very well
195 double norm = nrm2(mA);
196 addld (1 , mA) ;
197 return norm;
48


1198 }
Listing C.3: Handwritten CUDA Kernels
1 /* kernels. cu
2
3 Custom ACMI CUDA kernels. */
4
5 #include "acmi.h"
6
7 namespace {
8
9 enum { SWEEP.FACTOR = 16 };
10
11 /* Kernel parameters. */
12 int g_blocksPerVector , g_threadsPerVectorBlock,
13 g_blocksPerSweep, g_threadsPerSweepBlock,
14 g_blocksPerMatrix, g_threadsPerMatrixBlock;
15 double* g_buckets;
16 int g_bucketsLen;
17
18 /* Copies elements from one nxn matrix to another, converting them
19 to the precision of the destination matrix. */
20 templateCtypename T, typename U> __global__ void
21 kernCopy(T* dst const U* src const int64_t n2) {
22 const int offset = blockldx.x*blockDim.x + threadldx.x;
23 const int stride = gridDim.x*blockDim.x;
24 const T* const end = dst + n2;
25 src += off set ;
26 dst += off set ;
27
28 for (; dst < end; dst += stride, src += stride) {
29 *dst = src;
30 }
31 }
32
33 templateCtypename T> __global__ void
34 kernAddld(const T alpha, T* a, const int n) {
35 const T* const end = a + n*n;
36 a += (blockldx.x*blockDim.x + threadldx.x)*(n + 1);
37 const int stride = gridDim.x*blockDim.x*(n + 1);
38
39 for (; a < end; a += stride) {
40 *a += alpha;
41 }
42 }
43
44 /* Sweeps sums of matrix elements into buckets. */
45 templateCtypename T> __global__ void
46 kernSweepSums(const T* a, const int64_t len, const int pitch,
47 T* const buckets, const int bucketsLen) {
48 const int offset = blockldx.x*blockDim.x + threadldx.x;
49 const T* const end = a + len;
50 a += offset pitch;
51 const int stride = gridDim.x*blockDim.x*pitch;
52
53 if (offset < bucketsLen) {
54 T partialSum = 0.;
55 for (; a < end; a += stride) {
56 partialSum += *a;
57 }
58 buckets [offset] = partialSum;
59 }
60 }
61
62 } // end anonymous namespace
49


63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80 81 82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99 100 101 102
103
104
105
106
107
108
109
110 111 112
113
114
115
116
117
118
119
120 121 122
123
124
125
126
127
128
extern "C" {
/* Sets up kernel parameters. */
void cuSetUp(const int maxBlocksPerKernel, const int n) { cudaDeviceProp prop;
cudaGetDeviceProperties(feprop , 0); // assume using first device
debug (" setting up kernels on 7oS", prop.name); if (maxBlocksPerKernel > prop.maxGridSize [0]) {
fatal ("max blocks supported: 7od", prop maxGridSize [0] ) ;
}
const int maxThreadsPerBlock = prop.maxThreadsPerBlock; const int64_t n2 = n*n;
g_bucketsLen = (n + SWEEP_FACT0R 1)/SWEEP.FACTOR; g_buckets = (double*)cuMalloc(sizeof(double)*g_bucketsLen);
g_blocksPerVector = (n + maxThreadsPerBlock 1) /
maxThreadsPerBlock;
g_blocksPerVector = iMin(maxBlocksPerKernel, g_blocksPerVector); g_blocksPerSweep = (g_bucketsLen + maxThreadsPerBlock 1) /
maxThreadsPerBlock;
g_blocksPerSweep = iMin(maxBlocksPerKernel, g_blocksPerSweep); g_blocksPerMatrix = (n2 + maxThreadsPerBlock 1) /
maxThreadsPerBlock;
g_blocksPerMatrix = iMin(maxBlocksPerKernel, g_blocksPerMatrix); g_threadsPerVectorBlock = iMin(maxThreadsPerBlock, n); g_threadsPerSweepBlock = iMin(maxThreadsPerBlock, g_bucketsLen); g_threadsPerMatrixBlock = iMin(maxThreadsPerBlock, n2);
auto threadsPerVector = g_blocksPerVector*g_threadsPerVectorBlock, threadsPerMatrix = g_blocksPerMatrix*g_threadsPerMatrixBlock; debug (" biocks/matrix : 7od threads/matrix : 7od\n"
" blocks/vector : 7od threads/vector : 7od", g_blocksPerMatrix ,
threadsPerMatrix, g_blocksPerVector, threadsPerVector);
/* Cleans up kernel environment. */
void cuShutDownO {
debug("shutting down kernels"); cuFree(g.buckets);
}
/* Doubles matrix precision. */
void cuPromote(void* dst, void* src, int srcElemSize int64_t n2) { kernCopy<<>>
((double*)dst, (const float*)src, n2);
}
/* Adds alpha*I to the matrix backed by the device array elems. */
void cuAddld(double alpha, void* elems, int n, int elemSize) { switch (elemSize) { case 4:
kernAddld<<>>
((float)alpha, (float*)elems n) ;
break; case 8:
kernAddld<<>> (alpha, (double*)elems, n); break;
}
}
/* Computes a matrix trace by sweeping sums. */
double cuTrace(void* elems, int n, int elemSize) { const int64_t n2 = n*n;
50


129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
r
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
double trace = NAN;
switch (elemSize) { float trace32;
case 4:
kernSweepSums<<>>
((float*)elems n2, n + 1, (float*)g_buckets g_bucketsLen);
kernSweepSums <<<1, 1>>>
((float*)g_buckets g_bucketsLen , 1, (float*)g_buckets , 1);
cuDownload(&trace32 g_buckets sizeof(float)); trace = trace32; break; case 8:
kernSweepSums<<>>
((double*)elems, n2, n + 1, (double*)g_buckets, g_bucketsLen);
kernSweepSums <<<1, 1>>>
((double*)g_buckets g_bucketsLen , 1, (double*)g_buckets , 1);
cuDownload(fetrace g_buckets sizeof(double)) ; break;
return trace;
} // end extern "C"
Listing C.4: Random Matrix Generator
/* Tests for CPU support of the RDRAND instruction. */ static bool rdRandSupported() {
unsigned eax, ebx, ecx, edx;
return __get_cpuid(1, &eax, &ebx &ecx &edx) && ecx & bit_RDRND;
static int cstdRandl6() {
return (int)((unsigned)rand() >> 15); // discard correlated bits
}
/* Uses RDRAND instruction to generate high-quality random integers.
Requires an Ivy Bridge or newer x86 CPU. Requires no seeding. */ static int rdRandl6() { static int n = 0;
static unsigned long long r = 0; if (! n) {
/* RDRAND always pulls 6^ bits, so splitting each term into four 16-bit numbers is far faster. */
_rdrand64_step(&r); // assume entropy suffices
n = 4;
}
int s = (int)(r & Oxffff); r >>= 16; n--;
return s;
}
/* Draws off a bit from a random number and returns it as a sign. */ static int drawSign(int* n) { int tmp = *n;
*n >>= 1;
return (tmp & 1) ? 1 : -1;
51


38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 61 62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80 81 82
83
84
85
86
87
88
89
90
91
/* Generates a random, probably-invertible matrix of integers.
(Certainly-invertible if diagonally-dominant). Allowing negative entries shall probably cause ill conditioning.
TODO: replace bool cascade with binary flags
TODO: large dominant diagonals might swallow bias of +1
TODO: populating elements in column-major order _might_ be faster
TODO: 16-bit granularity of reals can be course */
Mat MatNewRand(int n, int elemSize double maxElem, bool symm, bool real, bool neg, bool diagDom, bool useHardwareRNG) {
if (useHardwareRNG && !rdRandSupported()) {
fatal("your CPU doesnt support the RDRAND instruction");
}
int (*randl6)() = useHardwareRNG ? rdRandl6 : cstdRandl6; int rMax = neg ? SHRT.MAX : USHRT_MAX;
Mat m = MatNew(n, elemSize, false); int maxElemEx = floor(maxElem) + 1;
for (int row = 0; row < n; row++) { int col = symm ? row : 0;
double rowSum = 0;
// if symmetric, sum the elements before this diagonal for (int i = 0; i < col; i++) {
rowSum += MatGet(m, row, i);
}
for (; col < n; col++) {
if (!diagDom || row != col) { // diagonals are summed later int r = randl6();
double sign = neg ? drawSign(fer) : 1;
double absElem = real ? (double)r/rMax maxElem :
(double)(r /0 maxElemEx); double elem = sign*absElem;
MatPut(m, row, col, elem);
if (symm && row != col) { // mirror symmetric element MatPut(m, col, row, elem);
}
rowSum += absElem;
}
}
if (diagDom) {
double sign = neg ? (randl6() & 1 ? 1 : -1) : 1;
/* Make diagonal element strictly greater than the sum of the other row entries. */
double diag = sign (real ? nextafter(rowSum INFINITY) :
rowSum + 1) ;
MatPut(m, row, row, diag);
}
}
return m;
52


Full Text

PAGE 1

PARALLELINVERSIONOFLARGEMATRICESONGRAPHICSHARDWARE by JASONE.TREADWELL B.S.,UniversityofColorado,2008 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ComputerScienceProgram 2016

PAGE 2

ThisthesisfortheMasterofSciencedegreeby JasonE.Treadwell hasbeenapprovedforthe DepartmentofComputerScience by GitaAlaghband,Chair TomAltman,Advisor FarnoushBanaei-Kashani BorisStilman December17,2016 ii

PAGE 3

Treadwell,JasonE.M.S.,ComputerScience ParallelInversionofLargeMatricesonGraphicsHardware ThesisdirectedbyProfessorTomAltman ABSTRACT Matrixinversionisanimportantoperationindomainsrangingfromgraphicsand real-timeapplicationstononlinearoptimizationandregressionanalysis.Traditional methodsofmatrixinversionhavecubicruntimesthatscalepoorlyasthematrix dimensiongrows. AmethodbyM.Altmanproducesasequenceofcubicallyandmonotonicallyconvergentapproximationsofthematrixinverseusingafewmatrixmultiplications andadditionseachiteration.Whiletheseoperationsthemselvesarehardlytrivial,the homogeneousandindependentnatureoftheirconstituentcomputationsmakesthem primecandidatesforparallelaccelerationusingrecentadvancesinGPU graphics processingunit hardware. UsingaconsumerGPU,weshowthatmatricescanbequicklyinvertedwithhigh delityusingAltman'smethod.Weexploittheerrortoleranceofthemethodto gainfurtherspeedbyusingmixed-precisioncomputationsandcomparetheresultsto thosefromdouble-precisionandCPUcomputations. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:TomAltman iii

PAGE 4

TABLEOFCONTENTS CHAPTER I.INTRODUCTION...............................1 1.1MatricesandTheirInverses.......................1 1.2Altman'sIterativeMatrixInversionMethod..............3 1.3ComputationalAspectsofAltman'sMethod.............6 II.COMPUTERLINEARALGEBRA......................7 2.1BLASandOpenBLAS.........................7 2.2ParallelComputationontheGPUwithCUDA............8 2.3CUDALinearAlgebrawithcuBLAS..................12 III.IMPLEMENTATIONOFALTMAN'SMETHOD..............14 3.1DevelopmentEnvironmentandGeneralArchitecture.........14 3.2TheInitialInverseApproximation...................15 3.3ComputationoftheSequence......................16 3.4RecoveryfromDivergence........................18 IV.EXPERIMENTALRESULTS.........................20 4.1TestingEnvironmentandOverviewofExperiments.........20 4.2Diagonally-DominantRandomMatrices................22 4.3RandomMatrices............................25 4.4Real-WorldLargeMatrices.......................27 4.5HilbertMatrices.............................29 4.6ComparisonoftheOrdersofConvergence...............30 V.CONCLUSION.................................33 5.1ImplicationsofResults.........................33 5.2FurtherResearch.............................33 REFERENCES...................................35 APPENDIX iv

PAGE 5

A.ACMIInverterUsage..............................38 B.SampleACMIInversionRuns.........................39 C.SelectedSourceCode..............................43 v

PAGE 6

FIGURES FIGURE 2.1CPU/GPUArchitecturalDierences....................8 2.2HeterogeneousProgrammingModel.....................10 2.3CUDAGrid/Block/ThreadModel.....................11 4.1GeForceGTX1080StreamingMultiprocessor...............21 4.2InversionRuntimefor n n Diagonally-DominantRandomMatrices..23 4.3InversionRuntimefor n n RandomMatrices...............25 4.4InversionRuntimefor n n BooleanMatrices...............26 4.5InversionRuntimeforLarge,SparseMatrices...............28 4.6InversionErrorfor n n HilbertMatrices.................30 4.7Orderp InversionRuntimeforDiagonally-DominantRandomMatrices.31 4.8Orderp InversionRuntimeforRandomMatrices.............32 4.9TotalOrderp InversionIterationsforRandomMatrices.........32 vi

PAGE 7

TABLES TABLE 2.1DierencesbetweenaCPUandaGPU...................9 3.1BLASMatrix/VectorRoutinesUsed....................16 3.2ImplementedCustomMatrixRoutines...................17 4.1GeForceGTX1080Specications......................20 4.2SparseMatrixCollectionSpecimens....................27 vii

PAGE 8

CHAPTERI INTRODUCTION 1.1MatricesandTheirInverses An m n matrix isatwo-dimensionalarrayof m rowsand n columnsofrealvaluedorcomplex elements .Matricesareusedtomodelsystemsoflinearequations, graphs,rastergraphicsandmanifoldothercomputationalphenomena.Considera systemofthreeequationsoverthreescalarvariables x 1 ;x 2 and x 3 : a 11 x 1 + a 12 x 2 + a 13 x 3 = b 1 ; a 21 x 1 + a 22 x 2 + a 23 x 3 = b 2 ; a 31 x 1 + a 32 x 2 + a 33 x 3 = b 3 : Thislinearsystemcanberepresentedasamatrixofthecoecientstimesa vector or n 1matrixofthevariablesequaltothevectoroftheright-handside: 0 B B B B @ a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 1 C C C C A 0 B B B B @ x 1 x 2 x 3 1 C C C C A = 0 B B B B @ b 1 b 2 b 3 1 C C C C A : Thesamesystemisrepresentedconciselyandalgebraicallyas Ax = b; wherecapitalandsmallsymbolsrepresentmatricesandvectors,respectively. Addition,multiplicationandotheroperationsmaybeperformedonmatrices. Theelement c ij inthe i throwand j thcolumnofthematrixsum C = A + B is simplythesumoftheelementsfromthecorrespondingpositions: a ij + b ij .Each element c ij ofthematrixproduct C = AB isthedotproductofthe i throwof A and 1

PAGE 9

the j thcolumnof B .An m p matrixismultipliedontherightbya p n matrix toyieldan m n result;eachrowfromtheleftmustcorrespondwithonecolumnon theright.Multiplicationofthematrix a ij byascalar c iscommutativeandyields ca ij Thealgebraofmatricesadmitsthe inverse ofa square i.e., n n matrix 1 Analogoustoscalaralgebra,wheretheinverseofascalar x satises x )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 x =1,a matrixmultipliedbyitsinverseproducesthe n n identity matrix I ofonesonthe maindiagonalandzeroselsewhere. A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 A = AA )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = I = i jk i jk = 8 > > < > > : 1if j = k; 0otherwise : ; AI = IA = A: Matrixmultiplicationisotherwisenotgenerallycommutative,yetitisassociative anddistributesoveradditionwhichitselfisbothcommutativeandassociative. Wecanapplytheinverseofthematrix A tobothsidesoftheprevioussystemto simultaneouslysolveallthreeequations. A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 Ax = A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 b )167(! x = A )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 b: A singular matrixhasnoinverse.Every invertible matrixissquare,hasa uniqueinverse,hasanonzerodeterminant,consistsexclusivelyoflinearly-independent columnsandsatisesahostofadditionalconditionsperthe invertiblematrixtheorem .Everypositive-denitematrixisinvertible[9].Givenanonsingularmatrix,the processof inversion tocomputeitsinverseishardlytrivial. Whileinvertingamatrixtosolvealinearsystemisneithercomputationallyefcient[6]nornumerically-stable[9]i.e.,pronetoampliedround-oerror,anapproximateinversecanbequicklycomputedwithsucientaccuracytosatisfycertain 1 The Moore-Penrosepseudoinverse generalizationofthematrixinverseappliestonon-square matrices[20],butwasnotinvestigatedhere. 2

PAGE 10

real-timeroboticsapplications[32].Matrixinversionisalsoemployedtoperform regressionanalysis[20],nonlinearprogramming[5],Newton'smethodfornonlinear systems[6]aswellasraycastingincomputergraphics[30]. Gauss-Jordaneliminationisaslow[29],butintroductorymethodofinversion whereweuseGaussianeliminationtoreducetheleftsideoftheaugmentedmatrix [ A : I ]to I ,producing A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ontherightside[6].TheLUandLUPtriangulardecompositionsofthematrix A ,whicharenumerically-stablemethodsofsolvinglinear systems,canbeusedtosolve Ax i = e i foreachunitvector e i toproducethecolumns of A )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 morequickly[9].CholeskydecompositionofHermitian,positive-denitematricesismuchmoreecientthanLUdecompositionandcansimilarlybeusedto invertamatrix[29].Eachofthesetraditional"inversionmethodshas n 3 runtimecomplexity. Winogradprovedthatmatrixmultiplicationisnoharderthaninversionwhile Aho,Hopcroft,andUllmanprovedtheconverse[9];thelowerandupperruntime boundsformatrixinversionthushavequadraticandcubicorder,respectively. 1.2Altman'sIterativeMatrixInversionMethod M.Altman[3]denedafamilyofsequences R n ofmonotonically-convergent approximationstotheinverse A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 : R n +1 = R n I + T n + T 2 n + + T p )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 n ; T n = I )]TJ/F20 11.9552 Tf 11.955 0 Td [(AR n ; lim n !1 R n = A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 : Eachsequenceofthisfamilyhas degree or order p ,where p 2,eachiterationof thesequencerequires p matrixmultiplications,andthesequenceconvergestowards A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 withdegree p .Thefollowingarethesequencesofdegree2,3and4thatconverge quadratically,cubicallyandquarticallytowards A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 ,respectivelynotingthatwe 3

PAGE 11

compute AR n onlyonceperiteration: quadratic R n +1 = R n I )]TJ/F20 11.9552 Tf 11.955 0 Td [(AR n ; cubic R n +1 = R n I )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 AR n + AR n 2 ; quartic R n +1 = R n I )]TJ/F20 11.9552 Tf 11.956 0 Td [(AR n I )]TJ/F15 11.9552 Tf 11.955 0 Td [(4 AR n + AR n 2 : Givenanappropriatestartingapproximation R 0 ,eachsequenceformsamethod ofquicklycomputingtheinverseof A in k iterationswithinsomeacceptable error measure k I )]TJ/F20 11.9552 Tf 11.983 0 Td [(AR k k aFrobeniusnormthatapproacheszeroas R k approaches A )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 since AA )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 = I .Whensuchanerrorboundexistsandisattainable,theruntimeof thismethodcandrasticallyoutperformthetraditionalinversionmethodsdescribed intheprevioussection. Wewishtoknowwhichofthesesequencesprovidesthefastestmethodofinversion.Sincematrixmultiplicationdominatesthecomputationalruntimeofaniteration ofanysequenceinthefamily,thecloserwegettothetrueinversepermultiplication,thebetter.Considertwosuchsequences:oneofdegree p whichrequires p multiplicationsperiterationandoneofdegree q suchthat,after m iterationsof therstsequenceand n iterationsofthesecond, matrixmultiplicationshavebeen performedovereachi.e., mp = nq = .The better orfastersequenceistheone which,after multiplications,resultsintheapproximateinversewiththelowererror measure.Supposethesequenceofdegree p isthebettermethod.Altmanshowed that,sincetheerrormeasuresoftheseiterationsareafunctionoftheinitialinverse approximation'serrormeasureraisedto p m or q n respectively,wehave p m >q n ; p nq=p >q mp=q ; p 1 =p >q 1 =q : Theruntimeisthusminimizedbyusingthesequenceofdegree p = d e e =3,which 4

PAGE 12

convergescubicallywithsuccessiveiterationstotheinverseof A : R n +1 = R n I )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 AR n + AR n 2 : Thechoiceoftheinitialapproximation R 0 usedtoseedtheinversioniscrucial. Theconvergenceofthemethodisguaranteedprovidedthat k I )]TJ/F20 11.9552 Tf 11.588 0 Td [(AR 0 k < 1.Forany invertiblematrix,wemaychoose R 0 = A ,where A istheconjugatetransposeof A .Weshallonlyconsiderreal-valuedmatrices,soweeectivelyhave R 0 = A T .The scalarmultiple mustliewithinaniteintervalandleadstothefastestconvergence atthe criticalvalue =2 = M 2 + m 2 ,where m and M andtheminimumand maximumeigenvaluesof A ,respectively. Unfortunately,computingeigenvalues 2 isgenerallyaboutashardasmatrixinversionitself[28].Fortunately,anacceptablechoiceof canbequicklydrawnfrom theinterval 0 1 k A k 2 ; wheretheupperboundproducesthefastest-convergingchoiceof R 0 Should A beself-adjointandpositivedenite,wecanperformthemuchfaster computation R 0 = I ,wherethefastestconvergenceoccurswith =2 = M + m Aneasily-computedalternativeis = 1 k A k : TheiterativeimprovementofAltman'smethodmakesitself-correcting"ina senseandgrantsitsometoleranceofoating-pointround-oerrorandless-thanidealchoicesof R 0 .Incontrasttootherinversionmethods,Altman'scanprovide usefulintermediateresultswhenitsruntimeisconstrained[32]. 2 Onetypicalmethodofwhichisthe QRalgorithm [28]. 5

PAGE 13

1.3ComputationalAspectsofAltman'sMethod SincetheruntimeofanAltmaniterationisdominatedbyitsmatrixmultiplications,themethod'scomplexitydependsuponthemultiplicationalgorithmused.The squarematrixmultiplicationproblemhas n 2 and O n 3 runtime;thetheoretical lowerboundsimplyreectsthesizeoftheoutput. Specically,navematrixmultiplicationperformedbycomputingall n 2 dotproductsrunsin n 3 timeusingconstantspace.Strassen'slandmark1969divide-andconqueralgorithmrunsin n lg7 timeandcanberuninparallelalmosttothesame extentasthenavealgorithm,yetislessnumericallystable,consumesmorespaceand doesn'texploitsparsematriceswell.Furthermore,itsruntimehasalargerconstant factor,outperformingthenavealgorithmonlyatahighly-variable crossoverpoint ashighas n =2150[9].TheCoppersmith-Winogradalgorithmof1987currentlyhas thebestknownupperboundof O n 2 : 376 ,yethasacrossoverpointsohighastobe wildlyimpracticalforanyconceivableapplications"[8]. Forthesereasons,linearalgebralibrariestendtoimplementoptimizedversions ofthenavemultiplicationalgorithm[19].Usingthatalgorithm,thecomputation ofasingleAltmansequenceiterationhasthesame n 3 runtimeasthetraditional matrixinversionmethods.However,thelowerconstantfactoroftheformercombined withasmallnumberofiterationsduetothefastconvergencecanmakeforashorter totalruntime.Finally,andperhapsmostimportantly,eachAltmaniterationrequires onlymatrixmultiplicationandadditiontocompute;theseoperationsscaleverywell inparallel. 6

PAGE 14

CHAPTERII COMPUTERLINEARALGEBRA 2.1BLASandOpenBLAS The BasicLinearAlgebraSubprograms ,orBLAS,areavenerablespecication ofroutinesthatperformelementarymatrixoperations.BLASincludesaddition, multiplication,rankupdate,normandsolutionroutinesthataredividedamongthree levels [1]: 1.Scalar,vectorandvector-vectoroperations. 2.Matrix-vectoroperations. 3.Matrix-matrixoperations. BLASwasoriginallydevelopedinFortranandtraditionallyemploys columnmajor ordering|amatrix A with m rowsof n columnsis strided acrosscolumns andislaidoutinsequentialmemoryas f a 11 ;a 21 ; ;a m 1 ;a 12 ;a 22 ; ;a mn g .BLAS providesroutinesforworkingwithbothsingleanddouble-precisionoating-point matrixelements;thelatterprecisionrequirestwiceasmuchstorage,butisneeded whennumericalinstabilityrenderstheformerinadequate.Inourimplementationof Altman'sinversionmethod,weusedthelevel3 GEMM routinestomultiplymatrices; wetreatedmatricesas1 n 2 vectorstonormthemandtoaddthemtogetherusing level1routines. TherearemanyBLASimplementations.Forrunningcomputationspurelyinsoftware,weemployedthe OpenBLAS library,anup-to-dateforkofthehighly-optimized andsophisticatedGotoBLASlibrary[23].Specically,weusedOpenBLAScompiled todivideitscomputationsacrossallavailablecoresusingPOSIXthreadsonLinux withbindingsfortheCprogramminglanguage.MatrixmultiplicationinOpenBLAS 7

PAGE 15

ishighly-performantonrecentIntelprocessors,outperformingIntel'sown MKL implementationofBLAS[35]. Inaddition,weimplementedcustomroutinesforhandlingtasksnoteasilysupportedbyBLAS,suchasaddingamultipleoftheidentitymatrixtoanothermatrix, computingthetraceofamatrixi.e.,thesumofitsdiagonalelementsandconverting amatrixfromsingletodoubleprecision.Theseroutinesweredominatedinruntime bythealgebraicmatrixoperationsandwerethuscomputedseriallyforsimplicity. 2.2ParallelComputationontheGPUwithCUDA Forparallelcomputationsacceleratedongraphicshardware,weusedNvidia's proprietary CUDA platform.Introducedin2007andthesubjectofmuchresearch andcommercialapplication,CUDAisa GeneralPurposeGraphicsProcessingUnit or GPGPU toolkitforexecutingparallelcodeonasystem'sgraphicshardwarefor non-graphicaltasks.CUDAimplementsamodelof heterogeneouscomputing where ageneral-purpose host CPUcoordinatessuitabletasksacrossoneormoreheavilyspecializedGPU devices CUDAwasforeshadowedinthemid-1990swhenGPUtransistorcountsbegan toexceedthoseofCPUs.HeterogeneousGPGPUcomputingexploitsthemassivelyparallelnatureofGPUhardware,whichwasoriginallydevelopedtorapidlyprocess largenumbersofindependentpixelsforcomputergamesandothergraphicalapplications[34]. Figure2.1: CPU/GPUArchitecturalDierences [24] 8

PAGE 16

Table2.1: DierencesbetweenaCPUandaGPU CPUGPU Containsseveralpowerfulcores.Contains many lightweightcores. Cachedominatesthedie.Logicdominatesthedie. Highcoreclockrates.Highmemorybandwidth. Meantforcomplicatedsequentiallogic.Designedfor instruction-levelparallelism Predictsandcachescodepaths.Hideslatencythroughparallelism. Fully-independentcores.Coresexecutethesameinstruction. Expensivethreadcontextswitches.Verycheapthreadcontextswitches. AtypicalmodernCPUcontains2to4coreswhereasarecentNvidiaGeForce GTX1080cardhas2560coresintotal.GPUhardwareimplementsa singleinstruction,multiplethread SIMT architecturethatdiersfromits multipledata SIMDcounterpartinthatanoperandvector'selementsareprocessedbyindependentthreads,whosebehaviormightnotbeuniform[7].ComputationsontheGPU coverup"driverandbuscommunicationlatencyaswellaslowercoreclockratesby ecientlyexecutingmanythreadsinparallel.CUDAnaturallytsintothecurrent trendtowardsparallelcomputingandcanacceleratesuitableworkloadsbyafactor of5to400;thisaccelerationcomesatthecostoftheincreasedcomplexityinherent toconcurrentcomputation[34]. AheterogeneousapplicationusingtheCUDAruntimeconsistsofastandardCPU programbinarybundledwithGPU computekernel functionswritteninC ++ and compiledwithNvidia's nvcc compiler.ThesekernelsareloadedintoGPUmemory duringinitializationoftheruntime.Duringexecution,theprograminteractswith theGPUoverthePCIExpressbusthroughthedevicedriverbytransferringdata fromhostmemorytodevicememory, launching kernelsoverdataindevicememory andreadingresultsbacktohostmemory.Kernellaunchesarequeuedinorderof 9

PAGE 17

invocationanddispatchedbythedriver 1 ,runningasynchronouslywithrespecttothe CPUprogram,whichsynchronizeswiththelaunchedkernelswhenaccessingdevice memoryoruponexplicitcommand. Figure2.2: HeterogeneousProgrammingModel [24] TheGPUhardwareconsistsofapoolof global devicememoryandsomenumberof streamingmultiprocessors or SMs .EachSMconsistsofcoresthatexecute integerandoating-pointoperationsinparallel,smallernumbersofdouble-precision andspecialfunctione.g.,transcendentalunits,caches,thousandsofregisterstobe sharedamongallresidentthreads,asmallpoolof shared memoryforfastinter-thread communicationaswellas warp schedulers. Kernelsarelaunchedwithaspeciednumberof blocks andthreadsperblock. Eachblockisscheduledseparatelybythedriverontothenext-availableSMwhere 1 Theprogrammermayestablishoneormore streams throughwhichkernelscanberunconcurrentlyontheGPU. 10

PAGE 18

itsthreadsaredividedintowarpsconsistingof32threadseach.Awarp'sthreads runondistinctexecutioncoresthatallsimultaneouslyexecutethesameinstruction dispatchedbythewarpscheduler[7].Thesetofblockscomprisingakernelexecution iscalleda grid .EachthreadinthegridhasadistinctIDwithwhichitcandetermine itsportionofthetotalparallelcomputation. Globaldevicememoryaccessisamajorbottleneckforkernels,takinghundreds ofclockcyclesperrequest.Kernelscanbeoptimizedtointernallyusesharedmemory orto coalesce memoryaccess;thelatteroccurswhenablock'sthreadssimultaneously operateovercontiguousandadjacentrunsofglobalmemory,minimizingthenumber ofmemoryreadsandwritesperformedbeforeeachinstructionbytheGPU,which readsfromandwritestomemoryin32-byteorwidercontiguouschunks[7]. Figure2.3: CUDAGrid/Block/ThreadModel [24] Inmanyways,GPUsoutperformCPUsatindividualoating-pointoperations. Single-precisionoating-pointoperationsformtheworkhorse"ofGPUcomputations andarethusheavilyoptimized[34].Double-precisionoperationsincuraperformance penaltyastheyrequiretwicethememoryandregistersaswellasspecialunitsonthe 11

PAGE 19

streamingmultiprocessor.However,foriterativemethodsandothercomputations thataresensitivetonumericalinstability,double-precisionarithmeticprovestobe invaluable. OnNvidia'sTeslaGPUsmeantforserversandforscienticcomputing,these double-precisionunitsareoutnumbered3:1and2:1ontheolderKeplerandthelatest Pascalarchitectures,respectively;thisratioreectstherelativemaximumthroughputsoftheseprecisions[7][26].TeslaGPUsaren'tintendedforconsumerdesktop computingandareextremelyexpensive,whereastheGeForcelineofGPUs,which areintendedforcomputergaming,arefarmoreaordableandcommon. Nvidia'sownwhitepapersappeartobemuteonthetopic,butsourcesonthe InternetvariouslyclaimthatGeForceGPUsareeitherfabricatedwithfewerdoubleprecisionunitsorareoutrightcrippled"byhavingtheseunitssimplydisableddespite beinginparitywiththeirTeslacounterparts[33][15].Whateverthecause,theresult isthattheratioofsingle-precisiontoavailabledouble-precisionunitsonahighendPascalGeForceGPUis32:1,resultinginadramaticlossofperformancewhen computingwithdoubleprecisionontheconsumer-gradeunits. 2.3CUDALinearAlgebrawithcuBLAS OurGPUimplementationofAltman'smethodrunsonCUDAprimarilythrough Nvidia's cuBLAS library.ThisBLASimplementationisparallelandheavilyoptimizedforworkingwithdensematrices,outperformingoptimizedCPUBLASimplementationsbyafactorofleast15[7].CUDAalsocomespackagedwiththecuSPARSElibraryoptimizedforcomputationsoversparsematrices,butthisisn'tused forinvertingsuchmatricesastheirinversesandintermediateiterationstendtobe dense[11]. ThecuBLASlibraryincludesBLAS-like"extensionssuchasthe GEAM routines, whichareusedtomoreconvenientlyperformmatrixadditionandtransposition. 12

PAGE 20

Handwrittenkernelswereemployedforthesametasksthatweren'twell-suitedto OpenBLAS.Sincetheseoperationsareeasilydividedintoindependentdivide-andconquerproblemsandarethus embarrassinglyparallel [16],theirkernelscan'texploit inter-threadcommunicationandthusonlyachieveoptimizationbycoalescedmemory access.AhandwrittenkernelwastestedforcomputingtheFrobeniusnormofamatrix lesssomemultipleoftheidentity,butitwasstilloutperformedbythemuch-simpler level1 SNRM2 routineoeredbycuBLAS.Acustomkernelisalsousedtoconvertmatricesfromsingletodoubleprecision;whilethe copy functionfromCUDA's Thrust implementationoftheC ++ STLcanaccomplishthiswithidenticalperformance,includingThrustamongthelinkedlibrariessubstantiallyincreasedcompilationtime andinatedthesizeofthebinary. 13

PAGE 21

CHAPTERIII IMPLEMENTATIONOFALTMAN'SMETHOD 3.1DevelopmentEnvironmentandGeneralArchitecture OurimplementationofAltman'smethodiscalled ACMI 1 .ACMIwasdeveloped onLinuxspecicallyopenSUSE42.1andDebian8.5usinganNvidiaGeForceGTX 1080graphicscard|ahigh-end,consumer-gradedeviceseeSection4.1foradditional detailsregardingcomputerhardware.ACMIwaswrittenprimarilyinCandcompiledwithGCC;devicekernelsandrelatedcodewerewritteninC ++ andcompiled withNvidia's nvcc compilerfromtheCUDA8.0SDK.Wehavenotexploredthe simultaneoususageofmultiplegraphicscardsoverSLIforthisapplication. ForreasonsseeSection2.2thatareaseconomicalastheyarepractical,ACMI wasdevelopedforrelatively-inexpensiveconsumerhardware,hencetheGeForceGPU used.ComparedtotheirTeslacounterparts,theseunitsperformdouble-precision arithmeticveryslowly.ACMIwasthuswrittentousesingle-precisionarithmeticby defaultforaslongasitsucesinordertospeedupcomputation,resortingtodouble precisionperheuristicsdescribedinthischapter.Thesizeandnumericalstability ofthesourcematrixdeterminehowmuchACMIcanaccomplishinsingleprecision. However,ACMImayalsobeconguredtorunallcomputationsindoubleprecision orontheCPUfromthestart. ACMIcanacceptaMatrixMarket 2 leasinputorgeneratearandomsource matrixwithvariouscongurablecharacteristics,includingsize,diagonaldominance andelementdomains.Sincesoftwarepseudorandomnumbergeneratorscanbeofdubiousquality[21]andevenrespectableonescanfailastatisticalmatrixranktest[22], ACMIcanoptionallyuseaveryhigh-quality[18]hardwarerandomnumbergenera1 ACMI'saConvergentMatrixInverter. 2 ACMIusessourcecodeprovidedbyNISTforreadingandwritingmatricesencodedinthe MatrixMarketformat[27]. 14

PAGE 22

toravailableonIvyBridgeandnewerIntelprocessors.Theinversecomputedbythe programmaybewrittentodiskintheMatrixMarketformat.ACMI'scommand-line interfacealsoletstheuserconguretheorderofconvergence,initialoating-point precision,maximuminversionerror,whethertousetheGPUandotherinversion parametersseeAppendixAforacompletelisting. Duringexecution,ACMIloadsorgeneratesthesourcematrix,initializesthe GPUandrunsthecenterpiece altmanInvert routineoverthematrix.Thisroutine allocatesworkmatrices,computesaninitialinverseapproximation R 0 andcomputes iterations f R i g oftheAltmansequenceuntilanapproximateinverseisgeneratedwith anerrormeasure k I )]TJ/F20 11.9552 Tf 11.142 0 Td [(R i k underthegivenlimit )]TJ/F18 7.9701 Tf 6.586 0 Td [(5 bydefault;alowererrorlimit mightnecessitateadditionalsequenceiterations.Ifthesequencediverges,theroutine eitherrestartswithadierentinitialapproximation,promotestheworkmatricesto doubleprecisionorhaltsreturningthepreviousinverseapproximation,dependingon thecircumstances. 3.2TheInitialInverseApproximation AsdiscussedinSection1.2,determiningtheidealinitialinverseapproximationof amatrix A requiresknowledgeofitseigenvaluesandis,therefore,acomputationallyintensiveoperation.Acceptableandquickly-computablesubstitutesare R 0 = I= k A k when A ispositivedeniteand R 0 = A T = k A k 2 forthegeneralcase.However,determiningwhether A ispositivedeniteisrelatedtotheeigenvalueproblemandisitself unacceptablyexpensive[6]. Itturnsoutthattheinverseapproximationforpositive-denitematricesworks formanymatricesoutsidethatclass.Forthesematricesaswellastruepositivedeniteones,thisapproximationyieldsmuchfasterconvergencethanwhenusingits general-casecounterpart.Moreover,wecanusuallydetermineearlyintheinversion processwhetherthepositive-deniteapproximationshallleadtodivergenceandnever 15

PAGE 23

produceasuitableinverse.ACMI'sdefaultoperationis,therefore,tosimplyapply thepositive-deniteapproximationrstandswitchtothegeneralapproximationin theeventofearlydivergence. Unexpectedly,bothinitialapproximationsusuallyfailtosatisfy k I )]TJ/F20 11.9552 Tf 12.169 0 Td [(AR 0 k < 1 anddonotmeetAltman'srequirementforguaranteedconvergence.Thisfrequently happenstobeoflittleconsequence.Wecouldarticiallysatisfytheconditionby loweringthe scalarusedtoproduce R 0 ,butthisdramaticallyslowstheconvergence ofthesequence,inpractice. 3.3ComputationoftheSequence ACMIimplementsBLASwrappersthatinvoketheirrespectivecuBLASorOpenBLASroutinesoftheappropriateprecisionwhenGPUcomputationisenabledor disabled,respectively.ACMIadditionallyimplementshandwrittenCUDAkernels andhostfunctionsforperformingcustomcomputations. Table3.1: BLASMatrix/VectorRoutinesUsed RoutineFunction GEMM ;A;B;;C C AB + C GEAM ;A;;B;C C A + B SCAL ;x x x AXPY ;x;y y y + x NRM2 A k A k The GEMM routineperformsmatrix-matrixmultiplicationandisthemainworkhorse oftheinversionprocess. GEAM isaBLAS-like"extensiontocuBLASandsimplies matrix-matrixaddition;wewroteaCPUimplementationofthisroutineusing SCAL and AXPY inOpenBLAS. NRM2 computesaFrobeniusnormandisusedtocomputethe scalarcoecientoftheinitialinverseapproximation.Thehandwritten transpose 16

PAGE 24

Table3.2: ImplementedCustomMatrixRoutines RoutineFunction transpose ;A;T T A T trace A P n i =1 a ii addId ;A A A + I minusIdNrm2 A k I )]TJ/F20 11.9552 Tf 11.955 0 Td [(A k and trace routinesperformtheirnamesakesandareusedtocomputethesequence's general-caseinitialinverseapproximation R 0 aswellastheoptimized,positive-denite initialerrormeasure k I )]TJ/F20 11.9552 Tf 11.532 0 Td [(AR 0 k ,respectively.The addId routinewaswrittentoperformlinear-timeadditionofascalarmultipleoftheidentitytoamatrix;thisroutine helpsusavoidhavingtoallocatealargeandslowidentitymatrixandisalsoused with NRM2 toimplement minusIdNrm2 forcomputingeachinversioniteration'serror measure.Aniterationofthecubically-convergentAltmansequenceisthuscomputed byACMItoproducethenextapproximateinverse R ofthematrix A asfollows: 1 gemm1,mA,mR,0,mAR; //mAR<-AR 2 err=minusIdNrm2mAR; //computeerrorofthepreviteration 3 ... //basedonerror,stoporbackup 4 gemm1,mAR,mAR,0,mX; //mX<-AR^2 5 geam-3,mAR,1,mX,mX; //mX<--3AR+AR^2 6 addId3,mX; //mX<-3I-3AR+AR^2 7 gemm1,mR,mX,0,mAR; //mAR<-R3I-3AR+AR^2 8 swap&mR,&mAR; //mR<-nextR,mAR<-prevR 9 swap&mX,&mAR; //mX<-previousR Floating-pointoperationsarenotgenerallyassociativewithrespecttoround-o error;theorderinwhichacomputationisperformedisimportant.Inaneortto reduceaccumulatederror,theabovesequenceofmatrixoperationswasborrowed fromearlierwork[32].ThepreviousiterationissavedincaseACMIdecidesto backup"aniterationtoreduceerrorbeforechangingstrategyorquitting.While bothBLASenginescomputeeachindividualoperationinparallel,thesequenceof operationsisitselfserially-interdependentandpresentsnoopportunityforparallel executionofseparatematrixoperationsi.e.,allCUDAoperationsarerunthrough 17

PAGE 25

asinglestream. Inadditiontotheabove,ACMIimplementsthequadraticallyandquarticallyconvergentAltmansequencesaswell;thelatterrequiresafthworkmatrixi.e., muchmorememory.TheirrelativeperformancesshallbeexaminedinChapter4. 3.4RecoveryfromDivergence Asweshallsee,manymatrices,bothlargeandsmall,cannotbeaccuratelyinvertedusingACMIwithsingle-precisionarithmeticalone.However,thesubstantial performancepenaltyofusingdoubleprecisionontheGPUcompelsustoperformas muchoftheinversionaspossibleinsingleprecision.Wethereforeneedamechanism fordetectingwhentoabandonsingleprecision. Thesimplestmethodofaccomplishingthisistocomputeiterationsuntilone diverges .SinceAltman'ssequencesconvergemonotonicallytotheinverse,aniterationwhoseerrormeasureexceedsthatofthepreviousindicatesthatoating-point round-oerrorhascaughtupwithACMIandhaltedtheprogressoftheinversion. SinceAltman'smethodisiterativelyimprovingandtolerantoferrors,ACMIsimply promotesitsworkmatricestodoubleprecisionafterdivergenceoccursinorderto overcometheround-oerrorandcontinuetheinversionprocess. However,ourinverseapproximationmighthaveaccumulatedsubstantialdamage"bythetimeoutrightdivergencehasoccurred.Thisdamagecanrequireadditionalandexpensivedouble-precisioniterationstocorrect;itmightevenderailthe remainderoftheinversion.Wewouldlikeareliablemethodofmeasuringnumerical instabilityanddetectingdivergenceaheadoftimetoeludethisdamage.Tothisend, weemployedthesequence's rateofconvergence Asequence f x n g thatconvergeswithorder p towardszerohasconstantrate suchthat[13] j x n j n ; lim n !1 n +1 p n = ;> 0 : 18

PAGE 26

Substitutingourmonotonically-decreasingerrormeasureandusingthecubicallyconvergentsequence,wehave lim n !1 k I )]TJ/F20 11.9552 Tf 11.955 0 Td [(AR n +1 k k I )]TJ/F20 11.9552 Tf 11.955 0 Td [(AR n k 3 = : Wedon'tknowwhatthevalueof isforagiveninversion,butwedoknowthat exists,sincethissequencedoesindeedconvergecubicallytotheinverse.Wetherefore wouldn'texpectthesequenceofratemeasurements f n g toitselfdivergeunlesswe fallvictimtonumericalinstabilityand,basedonevidence,hypothesizethatsucha divergencewouldoccurbeforethatof f R n g .ACMImeasuresthisrateaftereach iterationandpromotesitsmatricesaftertherateexceedssomegiventhreshold.The idealvalueofthisthresholdvaries;ACMItoleratesalimitof < 1bydefault,which wasfoundtobereasonableformanymatricesduringinitialtesting.Weshallexplore theeectivenessofthisstrategyinChapter4. Ifdivergenceoccurswithintherstfewiterations,theinitialinverseapproximationisprobablytoblame;ACMIthencomputesthestartingapproximationfor thegeneralcaseandrestartstheinversion.Whendivergencecan'tbeblamedon theinitialapproximationandoccurswhileworkingindoubleprecision,ACMIterminatestheinversionandreturnsthepreviously-computedinverseapproximationasa consolationprize. Itshouldbenotedthat,iftheinverterisallowedtoindenitelycontinueproducing inverseapproximationsafterhittingthewallofnumericalinstabilitybutnotwhen thematrixissingularorthestartingapproximationisinadequate,whichleadto outrightdivergence,itsiteratively-improvingnatureshallcausetheirerrormeasures tohoveraroundthelowest-achievablegureindenitely. 19

PAGE 27

CHAPTERIV EXPERIMENTALRESULTS 4.1TestingEnvironmentandOverviewofExperiments Theexperimentswererunonacustom-builtworkstationsportinganIntelXeon E3-1275v5processorfromtheSkylakegeneration.TheCPUhas4hyper-threaded coresrunningat3.6GHzwith256KBL2cachepercoreand8MBofsharedL3cache; thesystemmemoryis2133MHzECCDDR4.Theworkstation'sgraphicscardisa Pascal-generationNvidiaGeForceGTX1080,whichwasthehighest-endandmostrecentconsumerGPUoeredbyNvidiaatthetimeofthesystem'sconstructionin thesummerof2016;itsspecicationsandSMarchitectureareshowninTable4.1 andFigure4.1,respectively. Table4.1: GeForceGTX1080Specications SpecicationRating Computecapability6.1 Streamingmultiprocessors20 CUDAcores2560total128perSM Maximumcoreclockrate1734MHz GFLOPS8873[25] Memory8GiBGDDR5X Memoryclockrate5005MHz Memorybuswidth256-bit Memorybandwidth320GiB/sec[25] L2cache2MiB Maxthreadsperblock/SM1024/2048 20

PAGE 28

Figure4.1: GeForceGTX1080StreamingMultiprocessor [25] TheoperatingsystemusedisopenSUSE42.1installedwiththe367.27versionof Nvidia'sgraphics/CUDAdrivers. Inordertoeliminatepseudorandomstatisticalweaknesses,allrandommatrices weregeneratedusingIntel'shardwarerandomnumbergeneratorunlessotherwise noted.Eachrandommatrixispopulatedwithintegerelementsnogreaterthanthe matrixdimensioninmagnitude 1 ;allowingmuchlargerornon-integerelementsdidn't seemtosignicantlyaectanyoftheresults. SeveraldierentclassesofmatriceswereinvertedusingACMIandshallbeex1 Diagonalsindiagonally-dominantmatricesarenecessarilynotsobounded. 21

PAGE 29

aminedinthefollowingsections.SingleanddoubleprecisionontheGPU,single precisionontheCPU 2 andGNUOctavewereallcomparedbythespeedandquality oftheirinversions.Everysingle-precisionrunwasautomaticallypromotedtodouble precisionwhennecessarytocontinuetheinversion.Eachtestwasperformedve timestoproduceanaveragedresult. Octaveisafreeandopen-sourcenumericalcomputingplatformthatislargely compatiblewithMATLAB.Dependinguponthecharacteristicsofaninputmatrix, Octaveperformsoneofseveralmethodsofinversion,includingLUandCholeskyfactorization[12][2];OctavethusprovidesamethodoftestingACMIagainsttraditional matrixinversionmethods.OctaveitselfdependsuponBLASandwasconguredto usethesamemultithreadedOpenBLASlibraryusedbyACMI. CUDAkernellaunchesweretunedtosaturateallSMsontheGeForceGPU, allocatingupto40blockspergridwith1024threadsperblock.TheOpenBLAS routinesusedtoperforminversioninsoftwareuseupto8POSIXthreadstoexecute theircomputationsinparallel.Thetimerequiredtogenerateorloadmatricesand totransferthemtoGPUmemoryisnotguredintothetestedinversionruntimes. 4.2Diagonally-DominantRandomMatrices WersttestedACMIagainsteasily-invertiblerandommatrices.Amatrix A is strictlydiagonally-dominant [6]if,forall1 i n j a ii j > n X j =1 ;j 6 = i j a ij j : Suchamatrixiseasily,rapidlyandaccuratelysolvedbyGaussianelimination[13]. Forcingthediagonalelementstobepositivewellconditions 3 thematrix,altogether makingthematrixeasilyinvertible. 2 Double-precisionCPUinversionsusingACMIaren'tstudiedastheygenerallyformahigherand unnecessaryruntimeupperbound. 3 Mostsuchmatriceshaveaconditionnumberofabout2. 22

PAGE 30

Wegenerateddiagonally-dominantmatriceswithdimensionsrangingfrom128 128to16384 16384.Fivematricesweregeneratedforeachsizetestedandtheir inversionruntimeaveragesaregraphedinFigure4.2,whereACMI'sdouble-precision, single-precisionwithpromotiontodoubleprecisionifneededandCPUinverters arecomparedtoeachotherandtoGNUOctave'smatrixinverter.Thetargeterror 0 : 001 0 : 01 0 : 1 1 10 100 1000 10000 128 256 512 1024 2048 4096 8192 16384 sec n 32-bitCUDA 64-bitCUDA OpenBLAS GNUOctave Figure4.2: InversionRuntimefor n n Diagonally-DominantRandomMatrices measurelimitforeachinversionwas1 : 15 10 )]TJ/F18 7.9701 Tf 6.587 0 Td [(5 ;thiswaschosentobeslightlyover ACMI'sdefaultof10 )]TJ/F18 7.9701 Tf 6.587 0 Td [(5 inordertoaccommodatesomelargematricesthatcouldnot bepromotedtodoubleprecisionduetomemoryconstraints. Thesingle-precisionACMIinvertergenerallyandgreatlyoutperformsallitscompetitors,taking30.4secondsonaveragetoinverta16384 16384matrixwhileits nearestcompetitor,GNUOctave,takes198seconds|6.5timesaslong.OctaveoutperformedACMIonceduringthehumpinthelatter'scurve;thisrangeofmatrix dimensionsrepresentsaregion"ofinstabilitythatrequiredACMItopromoteits matricestodoubleprecisioninordertomeettheinversionerrortarget.Ascanbe 23

PAGE 31

seeninthecurveofOpenBLASinversionsperformedbyACMI,thisregionispeculiartoACMI'sCUDAinverter.WedonotyetknowwhyourCUDAinverterhas troublewithdiagonally-dominantmatricesofthesedimensions.Anobservationthat heldacrossalltestsisthat,oncepromotedfromsingleprecision,aninversiononly rarelyrequiresmorethantwoadditionaliterationstomeetitserrortarget 4 .Despite thisandduetothedouble-precisionhardwarelimitationsofGeForceGPUs,these promotediterationsdominatetheruntimesoftheirinversions. Tominimizetheextentofthisregion,wehadtoraisetheconvergenceratelimit 5 to < 5 10 7 ;limitsashighas < 5 10 5 extendtherangesofmatrixsizesaected bythissingle-precisionCUDAanomaly.Thislimitwasfoundbytrial-and-errorand ispoorly-understood;nevertheless,itshowsthatACMI'sdefaultlimitof < 1is hardlyasgeneral-purposeaspreviouslyhoped.Theonlywaytoeliminatetheregion istoraisethemaximum-toleratederrortoabout1 10 )]TJ/F18 7.9701 Tf 6.587 0 Td [(3 ,whichisunacceptablyhigh. Evenwheninvertingmatriceswhosesizesliewithinthisregion,single-precision inversionswithpromotiondramaticallyoutperformpuredouble-precisionrunsby afactorof23withrespecttoruntime.ACMI'sdouble-precisionCUDAandsingleprecisionCPUinvertersarebeatenherebyOctave,alsobylargefactors;thismightbe duetothesophisticationofOctaveasmuchasitistotheeasewithwhichdiagonallydominantmatricesaresolvedbyGaussianelimination. Thesediagonally-dominantmatricesrequirefewACMIiterationstoinvert:the smallermatricestestedrequire5-6iterations,1024 1024matricesrequire7andthe largestrequire8-10. Thelargestspecimeninvertedduringtheentiretyofthisresearchwasa22800 22800diagonally-dominantmatrix,taking91.9secondstoinverttounder1 : 2 10 )]TJ/F18 7.9701 Tf 6.587 0 Td [(5 errorin9iterations.TheinversionrequiredalmosttheentiretyoftheGPU's8GiB ofRAMandwasperformedentirelyinsingleprecisionusingACMI.ACMItookover 4 Thisobservationnolongerholdswhentheconvergenceratelimitissettobeverylow. 5 SeeSection3.4foradescriptionofthislimit. 24

PAGE 32

110minutestoinvertsuchamatrixwithinthesameerrorboundusingtheCPU alone;Octaveneeded513seconds. 4.3RandomMatrices Wethentestedfully-random,non-diagonally-dominantmatriceswherenegative elementsareallowed.Whilesuchmatrices,unliketheirdiagonally-dominantcousins, maybesingular,theyproveinpracticetoalmostalwaysbeinvertible 6 .Figures4.3 and4.4showtheruntimeresultsforrandomintegerandBooleanmatrices,respectively.Theruntimebehaviorsofthesetwoclassesareremarkablysimilar,thoughthe 0 : 001 0 : 01 0 : 1 1 10 100 1000 64 128 256 512 1024 2048 4096 8192 16384 sec n 32-bitCUDA 64-bitCUDA OpenBLAS GNUOctave Figure4.3: InversionRuntimefor n n RandomMatrices rawguresshowthatBooleanmatricesareslightlyhardertoinvert.Aconvergence ratelimitof < 5waschosenasmostoftheseinversionswoulddivergeifpushed beyondthislimitatsingleprecision.Thislimitwasalsofoundbytrial-and-errorand 6 Duringthisresearch,randomsingularitywasonlyobservedwithverysmallmatrices|sosmall thattheydidnotgureintoourexperiments. 25

PAGE 33

0 : 001 0 : 01 0 : 1 1 10 100 1000 64 128 256 512 1024 2048 4096 8192 16384 sec n 32-bitCUDA 64-bitCUDA OpenBLAS GNUOctave Figure4.4: InversionRuntimefor n n BooleanMatrices diersgreatlyfromthatusedtooptimizetheinversionsofdiagonally-dominantmatrices,suggestingthatcomputationoftheidealsuchlimitisnon-trivial.Thedefault ACMIinversionerrortarget10 )]TJ/F18 7.9701 Tf 6.587 0 Td [(5 wasusedforalltests. WhileACMIusingsingleprecisionwithpromotiontakesonlyaninthofthe timeneededbydoubleprecisiontoinvertarandom8192 8192integermatrix.6 vs.332seconds,GNUOctavedefeatedACMIfornearlyallmatrixsizestested 7 takingonly25secondstoinvertan8192 8192matrix. Incontrasttodiagonally-dominantmatrices,fully-randommatricestendtobe poorly-conditioned 8 andpossessnegativeeigenvalues,requiringACMItousethe general-purpose R 0 initialinverseapproximationtosuccessfullyinvertthesematrices.ThisalternateinitialapproximationappearstobetheculpritforACMI'spoor 7 WhileACMIappearstohavewonatinvertingthelargestmatrixsizetestedhere,theresulting approximateinverseshadpoorerrormeasuresontheorderof10 )]TJ/F7 6.9738 Tf 6.227 0 Td [(2 astherewasinsucientmemory forpromotiontodoubleprecision. 8 Theconditionnumbersareontheorderof10 4 ;ACMImustbetoldtoskipthepositive-denite R 0 approximationlestitwasteacoupleiterationsbeforerestarting. 26

PAGE 34

performance,sometimesrequiringover30iterationstoinvertwithintheerrortarget seethe p =3graphinFigure4.9. 4.4Real-WorldLargeMatrices We,ofcourse,wishedtodemonstratepracticalapplicationsofACMIbytesting itagainstsomelarge,non-randommatricesmodelingreal-worldproblems.Tothis end,weemployedtheUniversityofFloridaSparseMatrixcollection[10],choosing thematriceslistedinTable4.2astestsubjects.RuntimeresultsareshowninFigure 4.5;sincesingle-precisionACMIclearlydominatesitsothertwomodes,wecompared itsresultstoGNUOctavealone.Here,thedefaultconvergenceratelimitof < 1 wasusedtonoapparentilleect. Table4.2: SparseMatrixCollectionSpecimens [10] ID:Matrix n Description 1:494 bus494Buspowersystem 2:685 bus685Buspowersystem 3:1138 bus1138Buspowersystem 4:CAG mat19161916Combinatorialproblem 5:ex373565Computationaluiddynamicsproblem 6:lshp 265265Thermalproblem 7:Kuu7102Structuralproblem 8:heart32339Quasi-staticFEmodelofaheart 9:c-409941Non-linearoptimizationproblem 10:p2p-Gnutella2522687Peer-to-peerlesharingnetwork 11:pli22695Magnetohydrodynamicproblem Forallbuttwosmallmatrices,ACMIdefeatedOctave 9 ,taking66.5secondsto 9 Octavecouldn'tinvertthelargesttwomatriceswithinareasonabletimelimit,hencethetruncatedresultsinthehistogram. 27

PAGE 35

0 : 001 0 : 01 0 : 1 1 10 100 1000 1 2 3 4 5 6 7 8 9 10 11 sec 32-bitCUDA GNUOctave Figure4.5: InversionRuntimeforLarge,SparseMatrices invertthec-40"matrixwhereasOctaverequired879.Unlikewithrandommatrices,ACMIwinsevenatthenon-positive-denitetests-11thatrequirethemuch moreslowly-convergent,general-purpose R 0 seed.Thesematricesaren'tnecessarily friendlier"thanthefully-randommatrices,sincetheheart3"matrixhappenedto haveamuchworseconditionnumberof2 : 1 10 6 .Amongthesetests,ACMIneeded tocomputeasfewas10iterationsandasmanyas32fortheex37"andc-40" matrices,respectively. Theheart3"matrixhappenedtobreak"ACMI'sautomaticdetectionofpoor initialinverses:whereaseveryothertestedmatrixinversiondivergedwithintherst twoiterationsifthepositive-denite R 0 werewronglychosensuccessfullytriggering ACMItorestarttheinversionfromthegeneral-purpose R 0 intheprocess,thismatrix producednineiterationsfromthepositive-deniteseedbeforediverging. Asanexampleofthesuperiorconvergenceofthepositive-denite R 0 approximation,forcingACMItousethegeneral-purpose R 0 toinverttheKuu"matrixresults 28

PAGE 36

inaruntimeof24.6secondsover27iterationsinsteadof20secondsover15iterations. 4.5HilbertMatrices Thenumericalstabilityofaninvertiblematrix A canbepredictedbyits condition number : k A kk A )]TJ/F18 7.9701 Tf 6.586 0 Td [(1 k .Aconditionnumbersignicantlygreaterthanoneindicates an ill-conditioned matrixwheresmallperturbationsoftheelementshavegreatinuenceonoperationsperformedonthematrixandtheirresults[6][13].Thehigher theconditionnumber,thegreatertheerrorination.Naturally,anill-conditioned matrixwon'ttolerateoating-pointround-oerrorwellandshallcomplicateACMI's computations,possiblyevenmakingeectivematrixinversionimpossible. Despitebeingpositivedenite,thesymmetricHilbertmatricesarenotoriously ill-conditioned[6];eventhetiny10 10Hilbertmatrixhasaconditionnumberof 1 : 6 10 13 [13].Theythusprovidedanextremeunderwhichwecouldtestthenumerical stabilityofACMI.The n n Hilbertmatrixis H n = 0 B B B B B B B @ 1 1 2 1 n 1 2 1 3 1 n +1 . . . 1 n 1 n +1 1 2 n )]TJ/F18 7.9701 Tf 6.587 0 Td [(1 1 C C C C C C C A : Figure4.6showsthebestinversionerrormeasuresachievedusingACMIindouble precisiontoinvert H 2 through H 13 andcomparesthemtothoseobtainedbyOctave. Inversionruntimesarenegligibleacrossallmethodsatthesedimensionsandarenot shown;thesetestsservetoshowerrortolerance,notparallelperformance. ACMI'sCPUandGPUinvertersperformsimilarly,bothproducinginversesof fargreaterdelitythanOctave.Despitethefactthatthesematricesallbenetfrom thepositive-denite R 0 approximation,theyrequiremanyiterationstoreachtheir besterrormeasures:14iterationsfor H 3 and35for H 12 .Alteringtheconvergence ratelimitwheninvertingHilbertmatricesdidn'taectthetotaliterationscomputed, butdidincreasetheproportionofdoubleprecisioniterationsneededbyACMI. 29

PAGE 37

1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(16 1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(14 1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(12 1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(10 1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(08 1 e )]TJ/F15 11.9552 Tf 11.955 0 Td [(06 0 : 0001 0 : 01 1 100 2 3 4 5 6 7 8 9 10 11 12 13 k I )]TJ/F20 11.9552 Tf 11.955 0 Td [(H n R k k n CUDA OpenBLAS GNUOctave Figure4.6: InversionErrorfor n n HilbertMatrices 4.6ComparisonoftheOrdersofConvergence Thusfar,wehadonlybeencomputingthecubically-convergentAltmansequence ofinverseapproximationsusingACMI.Sincethequadraticallyandquarticallyconvergentsequencesproducemorequicklycomputableandconvergentiterations, respectively,weaskedwhetherparallelcomputationthroughCUDAevergrantsthem anadvantageoverthesequenceofcubicorder;thelatterisoptimalintheory,butnot necessarilysowithnite-precisionsoftware.Figures4.7and4.8comparethethree sequenceswhenappliedtodiagonally-dominantrandomandfully-randommatrices, respectively.Thenumberofiterationscomputedbyeachsequenceforthelattercase areshowninFigure4.9.Alltestsareruninsingleprecision,yettypicallypromote todoubleprecisionforthelasttwoiterations.Thesameconvergenceratelimitsare usedasintheprevioussectionsdealingwiththesematrixclasses. Surprisingly,thecubic-ordersequenceisfrequentlyoutperformedbythequadratic, 30

PAGE 38

0 : 001 0 : 01 0 : 1 1 10 100 1000 128 256 512 1024 2048 4096 8192 16384 sec n p =2 p =3 p =4 Figure4.7: Orderp InversionRuntimeforDiagonally-DominantRandomMatrices despitethelatterrequiringmanymoreiterationstoperformeachinversion.Thisresultappearstobeduetotheniteprecisionandnumericalinstabilityofourmatrix arithmetic.Foreachtest,allthreesequencestendtopromotetodoubleprecision aftersimilarinversionerrormeasureshavebeenachieved.Theruntimedominance ofthefollowingfewdouble-precisioniterations,whicharemuchmoreexpensivefor higher-ordersequences,coveruptheslowerconvergenceofthelower-ordersequence iterationsperformedinsingleprecision. Itshouldbenotedthat,givenitsslowerconvergence,thequadraticsequence ismoresensitivetolowerconvergenceratelimitsandrequiresthemtoberaised inordertocompetewiththeothersequences.Whenthisvalueislowandxed forallsequences,thecubic-ordersequencesubstantiallyoutperformsthequadratic. Generally,eventhoughalowconvergenceratelimitcanlowerthetotalnumberof inversioniterationsperformed,itlikelyraisestheratioofdouble-precisiontosingleprecisioniterationsattheexpenseoftheruntime. 31

PAGE 39

0 : 001 0 : 01 0 : 1 1 10 100 1000 64 128 256 512 1024 2048 4096 8192 sec n p =2 p =3 p =4 Figure4.8: Orderp InversionRuntimeforRandomMatrices 10 15 20 25 30 35 40 45 50 64 128 256 512 1024 2048 4096 8192 n p =2 p =3 p =4 Figure4.9: TotalOrderp InversionIterationsforRandomMatrices 32

PAGE 40

CHAPTERV CONCLUSION 5.1ImplicationsofResults AparallelimplementationofAltman'smatrixinversionmethodusingaconsumergradeGPUandsingle-precisionarithmeticcangreatlyoutperformothermatrixinverters,suchasGNUOctave.Moreover,themethodiserrortolerantandallows extendedsingle-precisioncomputationuntildoubleprecisionisabsolutelynecessary. ThisruntimeadvantageisnotpresentwhenexecutingthemethodentirelywithdoubleprecisionorontheCPU.WhileourimplementationdefeatedGNUOctaveat invertinglarge,easily-invertiblematricesaswellasmatricesrepresentinglargescienticandindustrialproblems,Octaveoutperformedtheimplementationatinverting certainill-conditionedrandommatrices. Weattemptedtoexploittheinversionsequence'srateofconvergencetopredict oncomingdivergenceandoptimizetheinversionstrategy.Theconvergenceratelimit provednon-trivialtooptimizeandevenincreasedtheinversioncomputationtime whensettoolow.Onthehardwareused,afewdouble-precisioniterationsdominatedtheruntimeofmanysingle-precisioniterationstotheextentthatattempting tominimizethelatterprovedtonotbeworthwhile. 5.2FurtherResearch ThisresearchstudiedGPGPUmatrixinversionusingonlyconsumerNvidiaGPUs throughtheCUDAframework.Sincetheseconsumerunitsperformrelativelypoorly duringdouble-precisioncomputations,astudyoftheperformanceofACMIonprofessionalTeslaunitswouldbealogicalnextstep.AddingsupportforotherGPGPU platforms,suchasOpenCLandVulkan,shouldalsobeofinterest. 33

PAGE 41

Matrixinversionveryquicklyexhaustssingle-precisionnumericalstabilityand evendoubleprecisionfrequentlyprovestobeinadequateforACMI.Unfortunately, quadruple-precisionsupportiscurrentlyunavailableoutsideofrelatively-exoticIBM mainframes[31]andPowerCPUs[14].Acompromiseofworthwhilepursuitwould bethe double-double method,whichextendsoating-pointprecisionbyusingpairs ofdouble-precisionnumberstorepresenteachvectorormatrixelement[17]. AsdiscoveredinChapter4,certainrangesofdimensionsofdiagonally-dominant randommatricesaremuchlessnumericallystablewhenusingCUDAandcuBLAS andrequirepromotiontomuchslowerdouble-precisioncomputationtoinvertwell. CPUinversionusingOpenBLASdoesn'tsuerthisinstability;anexplanationfor whythisiswouldbeappreciated. Finally,Strassen'smatrixmultiplicationalgorithmhasacceptablenumericalstabilityforsomeapplicationsandscalesverywellwithparallelexecution[9].AGPU implementationofthisalgorithmmightappreciablyaccelerateAltman'smethod. 34

PAGE 42

REFERENCES [1]BLASBasicLinearAlgebraSubprograms.http://www.netlib.org/blas/.Accessed:2016-10-15. [2]?getrf.https://software.intel.com/en-us/node/520877.Accessed:2016-10-25. [3]M.Altman. ApproximationMethodsinFunctionalAnalysis .CaliforniaInstitute ofTechnology,1959. [4]M.Altman.Anoptimumcubicallyconvergentiterativemethodofinverting alinearboundedoperatorinHilbertspace. PacicJournalofMathematics 10:1107{1113,December1960. [5]Bradley,Hax,andMagnanti. AppliedMathematicalProgramming .AddisonWesley,1977. [6]R.BurdenandJ.Faires. NumericalAnalysis .CengageLearning,9thedition, August2010. [7]J.Cheng,M.Grossman,andT.McKercher. ProfessionalCUDACProgramming .Wrox,1stedition,September2014. [8]D.CoppersmithandS.Winograd.Matrixmultiplicationviaarithmeticprogressions. JournalofSymbolicComputation ,9:251{280,March1990. [9]T.H.Cormen,C.E.Leiserson,R.L.Rivest,andC.Stein. Introductionto Algorithms .TheMITPress,3rdedition,2009. [10]T.A.DavisandY.Hu.TheUniversityofFloridaSparseMatrixCollection. ACMTransactionsonMathematicalSoftware ,38:1:1{1:25,2011.Accessed: 2016-10-27. [11]J.L.DongarraandV.Eijkhout.Numericallinearalgebraalgorithmsandsoftware. JournalofComputationalandAppliedMathematics ,123:489{514,November2000. [12]J.W.Eaton.Techniquesusedforlinearalgebra.https://www.gnu.org/software/ octave/doc/interpreter/Techniques-Used-for-Linear-Algebra.html,2016.Accessed:2016-10-25. [13]W.Gautschi. NumericalAnalysis .Birkhauser,2ndedition,2012. [14]GCCTeam.GCC6releaseseries:Changes,newfeatures,andxes.https: //gcc.gnu.org/gcc-6/changes.html,October2016.Accessed:2016-10-28. [15]H.Hagedoorn.NvidiaGeForceGTX1080review-PascalGPUarchitecture.http://www.guru3d.com/articles-pages/nvidia-geforce-gtx-1080-review,3. html,May2016.Accessed:2016-10-20. 35

PAGE 43

[16]M.HerlihyandN.Shavit. TheArtofMultiprocessorProgramming .Morgan Kaufmann,1stedition,June2012. [17]Y.Hida,X.S.Li,andD.H.Bailey.Libraryfordouble-doubleandquad-double arithmetic.December2007. [18]Intel. IntelDigitalRandomGeneratorDRNGSoftwareImplementationGuide 1.1edition,August2012. [19]J.Li,S.Ranka,andS.Sahni.Strassen'smatrixmultiplicationonGPUs. Parallel andDistributedSystemsICPADS ,December2011. [20]J.LiesenandV.Mehrmann. LinearAlgebra .Springer,1stedition,December 2015. [21]G.Marsaglia.Randomnumbersfallmainlyintheplanes. Proc.N.A.S. 61:25{28,September1968. [22]G.Marsaglia.XorshiftRNGs. JournalofStatisticalSoftware ,8,July2003. [23]K.Milfeld.GotoBLAS2.https://www.tacc.utexas.edu/research-development/ tacc-software/gotoblas2.Accessed:2016-10-14. [24]NVIDIACorporation.CUDACProgrammingGuide.https://docs.nvidia.com/ cuda/cuda-c-programming-guide,September2016.Accessed:2016-10-16. [25]NVIDIACorporation.NVIDIAGeForceGTX1080.http://international. download.nvidia.com/geforce-com/international/pdfs/GeForce GTX 1080 Whitepaper FINAL.pdf,2016.Accessed:2016-10-20. [26]NVIDIACorporation.NVIDIATeslaP100.https://images.nvidia.com/ content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf,2016.Accessed:2016-10-20. [27]NationalInstituteofStandardsandTechnology.ANSIClibraryforMatrix MarketI/O.http://math.nist.gov/MatrixMarket/mmio-c.html,May2000.Accessed:2016-10-21. [28]V.Pan,Z.Chen,andA.Zheng.Thecomplexityofthematrixeigenproblem. Proceedingsofthethirty-rstannualACMsymposiumonTheoryofcomputing pages507{516,May1999. [29]W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery. Numerical RecipesinC:TheArtofScienticComputing .CambridgeUniversityPress,2nd edition,1992. [30]S.D.Roth.Raycastingformodelingsolids. ComputerGraphicsandImage Processing ,18:109{144,February1982. 36

PAGE 44

[31]E.Schwarz.TheIBMz13SIMDacceleratorsforinteger,string,andoatingpoint.http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf,June2015.Accessed: 2016-10-28. [32]R.Senser. GPUDeclarativeFramework .PhDthesis,UniversityofColorado Denver,November2014. [33]T.Valich.Pascalsecrets:whatmakesNvidiaGeForceGTX1080sofast?http: //vrworld.com/2016/05/10/pascal-secrets-nvidia-geforce-gtx-1080,May2016. Accessed:2016-10-20. [34]N.Wilt. TheCUDAHandbook:AComprehensiveGuidetoGPUProgramming Addison-WesleyProfessional,1stedition,June2013. [35]Z.Xianyi.OpenBLASwiki.https://github.com/xianyi/OpenBLAS/wiki.Accessed:2016-10-14. 37

PAGE 45

APPENDIXA ACMIINVERTERUSAGE ListingA.1:ACMIInverterUsage 1 ACMIConvergentMatrixInverter 2 J.Treadwell,2016 3 4 Usage: 5 acmi[options] 6 7 OnlyMatrixMarketI/Oissupported.Togenerateandinverta 8 randommatrix,enterthematrixdimensionprependedby'@'inlieu 9 oftheinputfile;prependwith'%'forsymmetry. 10 11 Options: 12 -fPrintmatrixfileinfoandexit 13 -cPerformallcomputationsinsoftwarewithouttheGPU 14 -lUselow-speed,safeinitialR0approximation 15 -oOutputcomputedmatrixinversetopath 16 -qSettheorderofconvergence2-4,default:cubic 17 -p<#bits>Setinitialmatrixelementfloating-pointprecision 18 32or64,default:32 19 -e<+real>Setinversionerrorlimitdefault:1e-05 20 -tSetinversiontimelimitinmsdefault:none 21 -m<+real>Setmax-allowedsingle-precisionconvergencerate; 22 prependwith'x'andsetto>=1touseamultipleof 23 thestartingratedefault:1 24 -b<+int>SetmaxblocksrunbyeachGPUkernel 25 default:40,max:65535 26 Randommatrixoptions: 27 -HEnablehardwarerandomnumbergeneratorunseedable 28 -REnablerealelements 29 -NEnablenegativeelements 30 -DGeneratedominantdiagonalelements 31 -V<+real>Setmaxelementmagnitude 32 default:matrixdimension,max:32767 33 -UOutputgenerated,uninvertedmatrixtopath 34 -SSetPRNGseednotyetportable 38

PAGE 46

APPENDIXB SAMPLEACMIINVERSIONRUNS ListingB.1:GPUsingle-precision,randomdiagonally-dominant n =2048. 1 >./acmi@2048-DSdeadbeef 2 7998.438MiBdevicememoryavailable 3 seedingPRNGwithdeadbeef 4 generating32-bitrandom2048x2048integernonnegative 5 diagonally-dominantmatrix... 6 16.000MiB/matrix;allocating64.000MiBtotal 7 settingupcuBLAS 8 settingupkernelsonGeForceGTX1080 9 blocks/matrix:40,threads/matrix:40960 10 blocks/vector:2,threads/vector:2048 11 uploadingmatrixtodevice 12 invertingrandommatrixwithcubicconvergence... 13 initializingworkmatrices... 14 computedalpha=1.05396e-08 15 R0:err=4.4255e+01, =0.0000e+000.000s 16 R1:err=4.2322e+01, =4.8828e-040.000s 17 R2:err=3.7015e+01, =4.8829e-040.010s 18 R3:err=2.4766e+01, =4.8835e-040.008s 19 R4:err=7.4232e+00, =4.8870e-040.007s 20 R5:err=2.0094e-01, =4.9125e-040.008s 21 R6:err=5.9508e-05, =7.3342e-030.008s 22 R7:err=5.9787e-06, =2.8372e+070.007s 23 inversionhaltedin7iterationsafter0m0.056s 24 shuttingdowncuBLAS 25 shuttingdownkernels ListingB.2:GPUdouble-precision,randomdiagonally-dominant n =2048. 1 >./acmi@2048-DSdeadbeef-p64 2 7998.438MiBdevicememoryavailable 3 seedingPRNGwithdeadbeef 4 generating64-bitrandom2048x2048integernonnegative 5 diagonally-dominantmatrix... 6 32.000MiB/matrix;allocating128.000MiBtotal 7 settingupcuBLAS 8 settingupkernelsonGeForceGTX1080 9 blocks/matrix:40,threads/matrix:40960 10 blocks/vector:2,threads/vector:2048 11 uploadingmatrixtodevice 12 invertingrandommatrixwithcubicconvergence... 13 initializingworkmatrices... 14 computedalpha=1.05396e-08 15 R0:err=4.4255e+01, =0.0000e+000.000s 16 R1:err=4.2322e+01, =4.8828e-040.000s 17 R2:err=3.7015e+01, =4.8829e-040.272s 18 R3:err=2.4766e+01, =4.8835e-040.185s 19 R4:err=7.4229e+00, =4.8868e-040.182s 20 R5:err=2.0076e-01, =4.9086e-040.182s 21 R6:err=4.1234e-06, =5.0957e-040.183s 22 inversionhaltedin6iterationsafter0m1.186s 23 shuttingdowncuBLAS 24 shuttingdownkernels ListingB.3:CPUsingle-precision,randomdiagonally-dominant n =2048. 1 >./acmi@2048-DSdeadbeef-c 39

PAGE 47

2 seedingPRNGwithdeadbeef 3 generating32-bitrandom2048x2048integernonnegative 4 diagonally-dominantmatrix... 5 16.000MiB/matrix;allocating64.000MiBtotal 6 GPUaccelerationdisabled! 7 invertingrandommatrixwithcubicconvergence... 8 initializingworkmatrices... 9 computedalpha=1.05396e-08 10 R0:err=4.4255e+01, =0.0000e+000.003s 11 R1:err=4.2322e+01, =4.8828e-040.530s 12 R2:err=3.7015e+01, =4.8829e-040.523s 13 R3:err=2.4766e+01, =4.8835e-040.523s 14 R4:err=7.4229e+00, =4.8868e-040.523s 15 R5:err=2.0077e-01, =4.9089e-040.523s 16 R6:err=1.7253e-05, =2.1318e-030.523s 17 R7:err=3.8735e-06, =7.5426e+080.523s 18 inversionhaltedin7iterationsafter0m3.846s ListingB.4:LaptopGPUsingle-precision,randomdiagonally-dominant n =2048. 1 >./acmi@2048-DSdeadbeef 2 850.562MiBdevicememoryavailable 3 seedingPRNGwithdeadbeef 4 generating32-bitrandom2048x2048integernonnegative 5 diagonally-dominantmatrix... 6 16.000MiB/matrix;allocating64.000MiBtotal 7 settingupcuBLAS 8 settingupkernelsonNVS4200M 9 blocks/matrix:40,threads/matrix:40960 10 blocks/vector:2,threads/vector:2048 11 uploadingmatrixtodevice 12 invertingrandommatrixwithcubicconvergence... 13 initializingworkmatrices... 14 computedalpha=1.05396e-08 15 R0:err=4.4255e+01, =0.0000e+000.004s 16 R1:err=4.2322e+01, =4.8828e-040.000s 17 R2:err=3.7014e+01, =4.8829e-040.948s 18 R3:err=2.4766e+01, =4.8836e-040.714s 19 R4:err=7.4232e+00, =4.8870e-040.713s 20 R5:err=2.0094e-01, =4.9125e-040.714s 21 R6:err=5.9371e-05, =7.3176e-030.713s 22 R7:err=5.9608e-06, =2.8482e+070.714s 23 inversionhaltedin7iterationsafter0m5.234s 24 shuttingdowncuBLAS 25 shuttingdownkernels ListingB.5:Single-precision,random n =4096withpromotiontodoubleprecision. 1 >./acmi-HD@4096 2 7998.438MiBdevicememoryavailable 3 usingRDRANDRNG 4 generating32-bitrandom4096x4096integernonnegative 5 diagonally-dominantmatrix... 6 64.000MiB/matrix;allocating256.000MiBtotal 7 settingupcuBLAS 8 settingupkernelsonGeForceGTX1080 9 blocks/matrix:40,threads/matrix:40960 10 blocks/vector:4,threads/vector:4096 11 uploadingmatrixtodevice 12 invertingrandommatrixwithcubicconvergence... 13 initializingworkmatrices... 14 computedalpha=1.86387e-09 15 R0:err=6.3000e+01, =0.0000e+000.001s 16 R1:err=6.1047e+01, =2.4414e-040.000s 17 R2:err=5.5544e+01, =2.4414e-040.079s 40

PAGE 48

18 R3:err=4.1838e+01, =2.4415e-040.059s 19 R4:err=1.7885e+01, =2.4421e-040.058s 20 R5:err=1.3999e+00, =2.4472e-040.059s 21 R6:err=5.3994e-04, =1.9681e-040.059s 22 R7:err=5.3156e-05, =3.3769e+050.059s 23 diverging,extendingtodoubleprecision... 24 R8:err=4.5140e-04, =0.0000e+000.064s 25 R9:err=2.9680e-13, =3.2268e-031.479s 26 inversionhaltedin9iterationsafter0m3.382s 27 shuttingdowncuBLAS 28 shuttingdownkernels ListingB.6:Excerptofanattempttoinvertwithinanunachievableerrorlimit withdivergencedetectiondisabled,showingself-correctivebouncingwheredoubleprecisionnumericalstabilityends n =1024. 1 R18:err=1.1310e+00, =3.2129e-010.026s 2 R19:err=9.2789e-01, =6.4140e-010.027s 3 R20:err=7.5785e-01, =9.4861e-010.026s 4 R21:err=4.3523e-01, =9.9993e-010.029s 5 R22:err=8.2442e-02, =1.0000e+000.027s 6 R23:err=5.6034e-04, =1.0000e+000.027s 7 R24:err=1.7601e-10, =1.0004e+000.027s 8 R25:err=4.8498e-12, =8.8943e+170.029s 9 R26:err=5.0036e-12, =4.3863e+220.026s 10 R27:err=4.9887e-12, =3.9825e+220.025s 11 R28:err=5.0116e-12, =4.0365e+220.028s 12 R29:err=5.1043e-12, =4.0550e+220.028s 13 R30:err=4.8396e-12, =3.6393e+220.027s 14 R31:err=4.8361e-12, =4.2664e+220.028s 15 R32:err=5.0029e-12, =4.4231e+220.028s 16 R33:err=4.9357e-12, =3.9418e+220.027s ListingB.7:10 10Hilbertmatrix. 1 >./acmimats/hilb10.mtx-p64 2 7998.438MiBdevicememoryavailable 3 loadingmats/hilb10.mtx 4 matrixis10x10and0.001MiBinsize 5 0.001MiB/matrix;allocating0.003MiBtotal 6 settingupcuBLAS 7 settingupkernelsonGeForceGTX1080 8 blocks/matrix:1,threads/matrix:100 9 blocks/vector:1,threads/vector:10 10 uploadingmatrixtodevice 11 invertingmats/hilb10.mtxwithcubicconvergence... 12 initializingworkmatrices... 13 computedalpha=0.560059 14 R0:err=2.9344e+00, =0.0000e+000.000s 15 R1:err=2.8557e+00, =1.1302e-010.000s 16 R2:err=2.7731e+00, =1.1908e-010.000s 17 R3:err=2.6940e+00, =1.2634e-010.000s 18 R4:err=2.6116e+00, =1.3357e-010.000s 19 R5:err=2.5429e+00, =1.4276e-010.000s 20 R6:err=2.4541e+00, =1.4925e-010.000s 21 R7:err=2.3916e+00, =1.6181e-010.000s 22 R8:err=2.3137e+00, =1.6914e-010.000s 23 R9:err=2.2264e+00, =1.7976e-010.000s 24 R10:err=2.1733e+00, =1.9694e-010.000s 25 R11:err=2.0897e+00, =2.0357e-010.001s 26 R12:err=1.9970e+00, =2.1885e-010.000s 27 R13:err=1.9492e+00, =2.4474e-010.000s 28 R14:err=1.8725e+00, =2.5285e-010.000s 29 R15:err=1.7619e+00, =2.6838e-010.000s 30 R16:err=1.7039e+00, =3.1151e-010.000s 41

PAGE 49

31 R17:err=1.6522e+00, =3.3399e-010.000s 32 R18:err=1.5440e+00, =3.4236e-010.000s 33 R19:err=1.4254e+00, =3.8725e-010.000s 34 R20:err=1.3839e+00, =4.7781e-010.000s 35 R21:err=1.3287e+00, =5.0136e-010.000s 36 R22:err=1.2029e+00, =5.1278e-010.000s 37 R23:err=1.0393e+00, =5.9712e-010.000s 38 R24:err=9.8325e-01, =8.7588e-010.001s 39 R25:err=9.4944e-01, =9.9881e-010.000s 40 R26:err=8.5587e-01, =1.0000e+000.000s 41 R27:err=6.2693e-01, =9.9998e-010.000s 42 R28:err=2.4643e-01, =1.0001e+000.000s 43 R29:err=1.4969e-02, =1.0002e+000.000s 44 R30:err=1.5726e-04, =4.6887e+010.000s 45 R31:err=2.2343e-04, =5.7455e+070.000s 46 WARNING:diverged,R30isthebestwecando 47 inversionhaltedin31iterationsafter0m0.002s 48 WARNING:failedtoconvergetoerror<1e-05within2147483647ms 49 shuttingdowncuBLAS 50 shuttingdownkernels ListingB.8:Doomedattempttoinverta2 2singularmatrixofallones. 1 >./acmimats/1111.mtx 2 7998.438MiBdevicememoryavailable 3 loadingmats/1111.mtx 4 matrixis2x2and0.000MiBinsize 5 0.000MiB/matrix;allocating0.000MiBtotal 6 settingupcuBLAS 7 settingupkernelsonGeForceGTX1080 8 blocks/matrix:1,threads/matrix:4 9 blocks/vector:1,threads/vector:2 10 uploadingmatrixtodevice 11 invertingmats/1111.mtxwithcubicconvergence... 12 initializingworkmatrices... 13 computedalpha=0.5 14 R0:err=1.0000e+00, =0.0000e+000.000s 15 R1:err=1.0000e+00, =1.0000e+000.000s 16 diverged,retryingwithalternateR0... 17 R0:err=1.0000e+00, =0.0000e+000.000s 18 R1:err=1.0000e+00, =1.0000e+000.000s 19 diverging,extendingtodoubleprecision... 20 backinguptoreduceerror 21 R0:err=1.0000e+00, =0.0000e+000.000s 22 R1:err=1.0000e+00, =1.0000e+000.000s 23 R2:err=1.0000e+00, =1.0000e+000.000s 24 WARNING:diverged,R1isthebestwecando 25 inversionhaltedin4iterationsafter0m0.000s 26 WARNING:failedtoconvergetoerror<1e-05within2147483647ms 27 shuttingdowncuBLAS 28 shuttingdownkernels 42

PAGE 50

APPENDIXC SELECTEDSOURCECODE Forthecompletesourcecode,pleasereferto:https://github.com/zauberkraut/acmi ListingC.1:AltmanInversionImplementation 1 /*invert.c 2 3 ACMIconvergentinversionalgorithmimplementation.*/ 4 5 #include 6 #include"acmi.h" 7 8 enum{MAX_RESTART_ITER=2}; 9 10 /*Quicklycomputes||I-AR0||forR0=alpha*I.*/ 11 staticdoubletraceErrdoublealpha,MatmA{ 12 returnsqrtMatNmA+1-2*alpha*tracemA; 13 } 14 15 /*Returnsthenumberofmillisecondselapsedsincestart.*/ 16 staticintmsSincestructtimespec*start{ 17 structtimespectime; 18 clock_gettimeCLOCK_MONOTONIC,&time; 19 returntime.tv_sec-start->tv_sec*1000+ 20 time.tv_nsec-start->tv_nsec/1.e6; 21 } 22 23 /*Swapsmatrixpointers.*/ 24 staticvoidswapMat*mp,Mat*np{ 25 Matt=*mp; 26 *mp=*np; 27 *np=t; 28 } 29 30 /*Theinversionalgorithm. 31 32 IfconvRateLimit<0,|convRateLimit|specifiesamultipleofthe 33 firstmeasuredratetouseasthelimit.*/ 34 doublealtmanInvertconstMatmA,Mat*constmRp, 35 constintconvOrder,constdoubleerrLimit, 36 constintmsLimit,doubleconvRateLimit, 37 boolsafeR0{ 38 staticstructtimespecstart; 39 40 debug"initializingworkmatrices..."; 41 clock_gettimeCLOCK_MONOTONIC,&start; //startclock 42 constintmatCount=convOrder<4?4:5; 43 MatmR=MatBuildmA, //allocatematriceswithA'sdimensions 44 mAR=MatBuildmA, 45 mX=MatBuildmA, 46 mY=matCount<5?0:MatBuildmA; 47 48 constdoublealpha=1/nrm2mA; 49 debug"computedalpha=%g",alpha; 50 doubleerr=NAN; 51 52 ifsafeR0{ 53 debug"startingwithsafeR0"; 54 transposealpha*alpha,mA,mR; 43

PAGE 51

55 }else{ 56 MatClearmR; 57 addIdalpha,mR; 58 err=traceErralpha,mA; 59 } 60 61 intiter=0,totalIters=0,ms; //whiletimeremains 62 for;ms=msSince&start=prevErr&&iter<=MAX_RESTART_ITER{ 83 /*Ourattempttoexploittheself-adjoint,positive-definite 84 R0=alpha*Ifailed.Startoverusingtheslow,safe 85 R0=alpha^2*A^Tinstead.*/ 86 debug"diverged,retryingwithalternateR0..."; 87 88 safeR0=true; 89 transposealpha*alpha,mA,mR; 90 prevErr=INFINITY; 91 iter=-1;totalIters--; 92 93 continue; 94 }elseifMatElemSizemA0 95 &&convRate>=convRateLimit||err>=prevErr{ 96 debug"diverging,extendingtodoubleprecision..."; 97 98 ifMatDevmA&& //quitifwelackGPUmemoryforpromotion 99 !checkDevMemEnoughMatNmA,MatElemSizemA,matCount{ 100 debug"notenoughdeviceRAM;halting"; 101 iferr>=prevErr{ //backupiflastiterwerebetter 102 swap&mX,&mR; 103 } 104 break; 105 } 106 107 MatPromotemA;MatPromotemR;MatPromotemAR; 108 MatPromotemX; 109 ifmY{ 110 MatPromotemY; 111 } 112 113 doubletmp=prevErr; 114 prevErr=INFINITY; //incaseerrorshallreinflate 115 iferr>=tmp{ //ifwe'vefullydiverged... 116 debug"backinguptoreduceerror"; 117 swap&mX,&mR; 118 iter-=2;totalIters--; 119 continue; //recomputeAR 120 } 44

PAGE 52

121 }elseiferr>=prevErr||!isfiniteerr{ 122 warn"diverged,R%disthebestwecando",iter-1; 123 124 swap&mX,&mR; //backupRtoitspreviousvalue 125 err=prevErr; 126 break; //quit 127 }else{ 128 ifiter==1&&convRateLimit<0{ 129 //setratelimittoinitialvalue 130 convRateLimit*=-convRate; 131 } 132 133 prevErr=err; 134 } 135 136 switchconvOrder{ //computethenextiteration 137 case2: //quadraticconvergence 138 gemm1,mR,mAR,0,mX; //mX<-RAR 139 geam-1,mX,2,mR,mAR; //mAR<-2R-RAR=nextR 140 swap&mR,&mAR; //mR<-nextR,mAR<-prevR 141 swap&mX,&mAR; //mX<-previousR,mAR<-junk 142 break; 143 case3: //cubicconvergence 144 gemm1,mAR,mAR,0,mX; //mX<-AR^2 145 geam-3,mAR,1,mX,mX; //mX<--3AR+AR^2 146 addId3,mX; //mX<-3I-3AR+AR^2 147 gemm1,mR,mX,0,mAR; //mAR<-R3I-3AR+AR^2 148 swap&mR,&mAR; //mR<-nextR,mAR<-prevR 149 swap&mX,&mAR; //mX<-previousR 150 break; 151 case4: //quarticconvergence 152 gemm1,mAR,mAR,0,mX; //mX<-AR^2 153 geam-4,mAR,1,mX,mX; //mX<--4AR+AR^2 154 addId6,mX; //mX<-6I-4AR+AR^2 155 gemm-1,mAR,mX,0,mY; //mY<--AR6I-4AR+AR^2 156 addId4,mY; //mY<-4I-AR6I-4AR+AR^2 157 swap&mR,&mX; //mX<-previousR 158 gemm1,mX,mY,0,mR; //mR<-R4I-AR6I-4AR+AR^2 159 break; 160 default: 161 fatal"unsupportedconvergenceorder:%d",convOrder; 162 } 163 } //endwhile 164 165 ms=msSince&start; 166 intminutes=ms/60000; 167 doubleseconds=ms-minutes*60000/1000.; 168 debug"inversionhaltedin%diterationsafter%dm%.3fs", 169 totalIters,minutes,seconds; 170 iferr>errLimit{ 171 warn"failedtoconvergetoerror<%gwithin%dms",errLimit, 172 msLimit; 173 } 174 175 //cleanup 176 MatFreemAR; 177 MatFreemX; 178 ifmY{ 179 MatFreemY; 180 } 181 *mRp=mR; //returninvertedmatrix 182 returnerr; 183 } 45

PAGE 53

ListingC.2:LinearAlgebraicFunctionWrappers 1 /*linalg.c 2 3 ACMIlinearalgebraicoperationsimplementedusingcuBLASand 4 CUDAkernels.*/ 5 6 #include 7 #include 8 #include"acmi.h" 9 10 staticcublasHandle_tg_cublasHandle; 11 12 voidgpuSetUpconstintmaxBlocksPerKernel,constintn{ 13 debug"settingupcuBLAS"; 14 ifcublasCreate&g_cublasHandle!=CUBLAS_STATUS_SUCCESS{ 15 fatal"couldn'topencuBLAShandle"; 16 } 17 cuSetUpmaxBlocksPerKernel,n; 18 } 19 20 voidgpuShutDown{ 21 debug"shuttingdowncuBLAS"; 22 ifg_cublasHandlecublasDestroyg_cublasHandle; 23 cuShutDown; 24 } 25 26 /*mT<-alpha*mA^T*/ 27 voidtransposedoublealpha,MatmA,MatmT{ 28 constintn=MatNmA; 29 constvoid*consta=MatElemsmA; 30 void*constt=MatElemsmT; 31 constbooldev=MatDevmA; 32 constdoublebeta=0; 33 34 switchMatElemSizemA{ 35 case4: 36 ifdev{ 37 floatalpha32=alpha; 38 cublasSgeamg_cublasHandle,CUBLAS_OP_T,CUBLAS_OP_N,n,n, 39 &alpha32,a,n,float*&beta,t,n,t,n; 40 }else{ 41 cblas_somatcopyCblasColMajor,CblasTrans,n,n,alpha,a,n, 42 t,n; 43 } 44 break; 45 46 case8: 47 ifdev{ 48 cublasDgeamg_cublasHandle,CUBLAS_OP_T,CUBLAS_OP_N,n,n, 49 &alpha,a,n,&beta,t,n,t,n; 50 }else{ 51 cblas_domatcopyCblasColMajor,CblasTrans,n,n,alpha,a,n, 52 t,n; 53 } 54 break; 55 } 56 } 57 58 /*C<-alpha*A*B+beta*C*/ 59 voidgemmdoublealpha,MatmA,MatmB,doublebeta,MatmC{ 60 constintn=MatNmA; 61 constvoid*consta=MatElemsmA; 62 constvoid*constb=MatElemsmB; 63 void*constc=MatElemsmC; 64 constbooldev=MatDevmA; 65 46

PAGE 54

66 switchMatElemSizemA{ 67 case4: 68 ifdev{ 69 floatalpha32=alpha,beta32=beta; 70 cublasSgemmg_cublasHandle,CUBLAS_OP_N,CUBLAS_OP_N,n,n,n, 71 &alpha32,a,n,b,n,&beta32,c,n; 72 }else{ 73 cblas_sgemmCblasColMajor,CblasNoTrans,CblasNoTrans,n,n, 74 n,alpha,a,n,b,n,beta,c,n; 75 } 76 break; 77 78 case8: 79 ifdev{ 80 cublasDgemmg_cublasHandle,CUBLAS_OP_N,CUBLAS_OP_N,n,n,n, 81 &alpha,a,n,b,n,&beta,c,n; 82 }else{ 83 cblas_dgemmCblasColMajor,CblasNoTrans,CblasNoTrans,n,n, 84 n,alpha,a,n,b,n,beta,c,n; 85 } 86 break; 87 } 88 } 89 90 /*mB=mC=>C<-alpha*A+beta*C 91 otherwiseC<-alpha*A+beta*B*/ 92 voidgeamdoublealpha,MatmA,doublebeta,MatmB,MatmC{ 93 constintn=MatNmA; 94 constintn2=MatN2mA; 95 constvoid*consta=MatElemsmA; 96 constvoid*constb=MatElemsmB; 97 void*constc=MatElemsmC; 98 constbooldev=MatDevmA; 99 100 switchMatElemSizemA{ 101 case4: 102 ifdev{ 103 floatalpha32=alpha,beta32=beta; 104 cublasSgeamg_cublasHandle,CUBLAS_OP_N,CUBLAS_OP_N,n,n, 105 &alpha32,a,n,&beta32,b,n,c,n; 106 }else{ 107 ifb==c{ 108 cblas_sscaln2,beta,c,1; 109 }else{ 110 memsetc,0,MatSizemC; 111 cblas_saxpyn2,beta,b,1,c,1; 112 } 113 cblas_saxpyn2,alpha,a,1,c,1; 114 } 115 break; 116 117 case8: 118 ifdev{ 119 cublasDgeamg_cublasHandle,CUBLAS_OP_N,CUBLAS_OP_N,n,n, 120 &alpha,a,n,&beta,b,n,c,n; 121 }else{ 122 ifb==c{ 123 cblas_dscaln2,beta,c,1; 124 }else{ 125 memsetc,0,MatSizemC; 126 cblas_daxpyn2,beta,b,1,c,1; 127 } 128 cblas_daxpyn2,alpha,a,1,c,1; 129 } 130 break; 131 } 47

PAGE 55

132 } 133 134 /*mA<-mA+alpha*I*/ 135 voidaddIddoublealpha,MatmA{ 136 ifMatDevmA{ 137 cuAddIdalpha,MatElemsmA,MatNmA,MatElemSizemA; 138 }elseforintdiag=0;diag
PAGE 56

198 } ListingC.3:HandwrittenCUDAKernels 1 /*kernels.cu 2 3 CustomACMICUDAkernels.*/ 4 5 #include"acmi.h" 6 7 namespace{ 8 9 enum{SWEEP_FACTOR=16}; 10 11 /*Kernelparameters.*/ 12 intg_blocksPerVector,g_threadsPerVectorBlock, 13 g_blocksPerSweep,g_threadsPerSweepBlock, 14 g_blocksPerMatrix,g_threadsPerMatrixBlock; 15 double*g_buckets; 16 intg_bucketsLen; 17 18 /*Copieselementsfromonenxnmatrixtoanother,convertingthem 19 totheprecisionofthedestinationmatrix.*/ 20 template__global__void 21 kernCopyT*dst,constU*src,constint64_tn2{ 22 constintoffset=blockIdx.x*blockDim.x+threadIdx.x; 23 constintstride=gridDim.x*blockDim.x; 24 constT*constend=dst+n2; 25 src+=offset; 26 dst+=offset; 27 28 for;dst__global__void 34 kernAddIdconstTalpha,T*a,constintn{ 35 constT*constend=a+n*n; 36 a+=blockIdx.x*blockDim.x+threadIdx.x*n+1; 37 constintstride=gridDim.x*blockDim.x*n+1; 38 39 for;a__global__void 46 kernSweepSumsconstT*a,constint64_tlen,constintpitch, 47 T*constbuckets,constintbucketsLen{ 48 constintoffset=blockIdx.x*blockDim.x+threadIdx.x; 49 constT*constend=a+len; 50 a+=offset*pitch; 51 constintstride=gridDim.x*blockDim.x*pitch; 52 53 ifoffset
PAGE 57

63 64 extern"C"{ 65 66 /*Setsupkernelparameters.*/ 67 voidcuSetUpconstintmaxBlocksPerKernel,constintn{ 68 cudaDevicePropprop; 69 cudaGetDeviceProperties&prop,0; //assumeusingfirstdevice 70 debug"settingupkernelson%s",prop.name; 71 ifmaxBlocksPerKernel>prop.maxGridSize[0]{ 72 fatal"maxblockssupported:%d",prop.maxGridSize[0]; 73 } 74 constintmaxThreadsPerBlock=prop.maxThreadsPerBlock; 75 76 constint64_tn2=n*n; 77 g_bucketsLen=n+SWEEP_FACTOR-1/SWEEP_FACTOR; 78 g_buckets=double*cuMallocsizeofdouble*g_bucketsLen; 79 80 g_blocksPerVector=n+maxThreadsPerBlock-1/ 81 maxThreadsPerBlock; 82 g_blocksPerVector=iMinmaxBlocksPerKernel,g_blocksPerVector; 83 g_blocksPerSweep=g_bucketsLen+maxThreadsPerBlock-1/ 84 maxThreadsPerBlock; 85 g_blocksPerSweep=iMinmaxBlocksPerKernel,g_blocksPerSweep; 86 g_blocksPerMatrix=n2+maxThreadsPerBlock-1/ 87 maxThreadsPerBlock; 88 g_blocksPerMatrix=iMinmaxBlocksPerKernel,g_blocksPerMatrix; 89 g_threadsPerVectorBlock=iMinmaxThreadsPerBlock,n; 90 g_threadsPerSweepBlock=iMinmaxThreadsPerBlock,g_bucketsLen; 91 g_threadsPerMatrixBlock=iMinmaxThreadsPerBlock,n2; 92 93 autothreadsPerVector=g_blocksPerVector*g_threadsPerVectorBlock, 94 threadsPerMatrix=g_blocksPerMatrix*g_threadsPerMatrixBlock; 95 debug"blocks/matrix:%d,threads/matrix:%dn" 96 "blocks/vector:%d,threads/vector:%d",g_blocksPerMatrix, 97 threadsPerMatrix,g_blocksPerVector,threadsPerVector; 98 } 99 100 /*Cleansupkernelenvironment.*/ 101 voidcuShutDown{ 102 debug"shuttingdownkernels"; 103 cuFreeg_buckets; 104 } 105 106 /*Doublesmatrixprecision.*/ 107 voidcuPromotevoid*dst,void*src,intsrcElemSize,int64_tn2{ 108 kernCopy<<>> 109 double*dst,constfloat*src,n2; 110 } 111 112 /*Addsalpha*Itothematrixbackedbythedevicearrayelems.*/ 113 voidcuAddIddoublealpha,void*elems,intn,intelemSize{ 114 switchelemSize{ 115 case4: 116 kernAddId<<>> 117 floatalpha,float*elems,n; 118 break; 119 case8: 120 kernAddId<<>> 121 alpha,double*elems,n; 122 break; 123 } 124 } 125 126 /*Computesamatrixtracebysweepingsums.*/ 127 doublecuTracevoid*elems,intn,intelemSize{ 128 constint64_tn2=n*n; 50

PAGE 58

129 doubletrace=NAN; 130 131 switchelemSize{ 132 floattrace32; 133 134 case4: 135 kernSweepSums<<>> 136 float*elems,n2,n+1,float*g_buckets,g_bucketsLen; 137 kernSweepSums<<<1,1>>> 138 float*g_buckets,g_bucketsLen,1,float*g_buckets,1; 139 cuDownload&trace32,g_buckets,sizeoffloat; 140 trace=trace32; 141 break; 142 case8: 143 kernSweepSums<<>> 144 double*elems,n2,n+1,double*g_buckets,g_bucketsLen; 145 kernSweepSums<<<1,1>>> 146 double*g_buckets,g_bucketsLen,1,double*g_buckets,1; 147 cuDownload&trace,g_buckets,sizeofdouble; 148 break; 149 } 150 151 returntrace; 152 } 153 154 } //endextern"C" ListingC.4:RandomMatrixGenerator 1 /*TestsforCPUsupportoftheRDRANDinstruction.*/ 2 staticboolrdRandSupported{ 3 unsignedeax,ebx,ecx,edx; 4 return__get_cpuid1,&eax,&ebx,&ecx,&edx&&ecx&bit_RDRND; 5 } 6 7 staticintcstdRand16{ 8 returnintunsignedrand>>15; //discardcorrelatedbits 9 } 10 11 /*UsesRDRANDinstructiontogeneratehigh-qualityrandomintegers. 12 RequiresanIvyBridgeornewerx86CPU.Requiresnoseeding.*/ 13 staticintrdRand16{ 14 staticintn=0; 15 staticunsignedlonglongr=0; 16 17 if!n{ 18 /*RDRANDalwayspulls64bits,sosplittingeachtermintofour 19 16-bitnumbersisfarfaster.*/ 20 _rdrand64_step&r; //assumeentropysuffices 21 n=4; 22 } 23 24 ints=intr&0xffff; 25 r>>=16; 26 n--; 27 28 returns; 29 } 30 31 /*Drawsoffabitfromarandomnumberandreturnsitasasign.*/ 32 staticintdrawSignint*n{ 33 inttmp=*n; 34 *n>>=1; 35 returntmp&1?1:-1; 36 } 37 51

PAGE 59

38 /*Generatesarandom,probably-invertiblematrixofintegers. 39 Certainly-invertibleifdiagonally-dominant.Allowingnegative 40 entriesshallprobablycauseill-conditioning. 41 TODO:replaceboolcascadewithbinaryflags 42 TODO:largedominantdiagonalsmightswallowbiasof+1 43 TODO:populatingelementsincolumn-majororder_might_befaster 44 TODO:16-bitgranularityofrealscanbecourse*/ 45 MatMatNewRandintn,intelemSize,doublemaxElem,boolsymm, 46 boolreal,boolneg,booldiagDom, 47 booluseHardwareRNG{ 48 ifuseHardwareRNG&&!rdRandSupported{ 49 fatal"yourCPUdoesn'tsupporttheRDRANDinstruction"; 50 } 51 int*rand16=useHardwareRNG?rdRand16:cstdRand16; 52 intrMax=neg?SHRT_MAX:USHRT_MAX; 53 54 Matm=MatNewn,elemSize,false; 55 intmaxElemEx=floormaxElem+1; 56 57 forintrow=0;row