Citation
Topics in communication avoiding algorithms and stability analysis

Material Information

Title:
Topics in communication avoiding algorithms and stability analysis
Creator:
Lowery, Bradley R. ( author )
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (136 pages). : ;

Subjects

Subjects / Keywords:
Kernel functions ( lcsh )
Error analysis (Mathematics) ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
High performance linear algebra kernels are essential for scientific computing. Fast, accurate, and robust algorithms are required to process the large amount of data present in today's applications. In this thesis, we present new performance analysis, new error analysis, new stable algorithms, and new computational results. An algorithm's performance depends on the computational cost and the communication cost. We begin with a study of the communication cost for dense linear algebra algorithms. We improve the lower bounds for the amount of communications in matrix multiplication. We also review optimal algorithms for dense linear algebra algorithms focusing on recursive algorithms. We also consider the communication cost of the reduction operation. A reduction is a collective communication that is often used to communicate data in a parallel application. We present two new algorithms each developed under different models. In a unidirectional model, we prove our new algorithm is optimal. In a bidirectional model, we show experimentally our new algorithm has the same time complexity of a reverse optimal broadcast. Our implementations show that the new algorithms are viable options in practice. In the remaining chapters, we turn our attention to error analysis. We present a complete error analysis study of computing an oblique QR factorization. As part of this study we introduce a new, stable, communication avoiding algorithm. Performance experiments confirm the benefit of the communication avoiding algorithms. Finally, we consider the error due to the balancing algorithm, a preprocessing step to the nonsymmetric eigenvalue problem. We modify the balancing algorithm by improving its stopping criteria. We present several test cases where the previous balancing algorithm deteriorated the accuracy of the computation. The new algorithm successfully maintains satisfactory backward error for all test cases.
Thesis:
Thesis (Ph.D.)--University of Colorado Denver. Applied mathematics
Bibliography:
Includes bibliographic references.
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Bradley R. Lowery.

Record Information

Source Institution:
University of Colorado Denver
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
898492485 ( OCLC )
ocn898492485

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

TOPICSINCOMMUNICATIONAVOIDINGALGORITHMSANDSTABILITY ANALYSIS by BRADLEYR.LOWERY B.S.,SouthDakotaStateUniversity,2005 M.S.,SouthDakotaStateUniversity,2010 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy AppliedMathematics 2014

PAGE 2

ThisthesisfortheDoctorofPhilosophydegreeby BradleyR.Lowery hasbeenapprovedforthe DepartmentofMathematicalandStatisticalSciences by StephenBillups,Chair JulienLangou,Advisor LynnBennethum JanMandel GeorgeBosilca April29,2014 ii

PAGE 3

Lowery,BradleyR.Ph.D.,AppliedMathematics TopicsinCommunicationAvoidingAlgorithmsAndStabilityAnalysis ThesisdirectedbyAssociateProfessorJulienLangou ABSTRACT Highperformancelinearalgebrakernelsareessentialforscienticcomputing.Fast, accurate,androbustalgorithmsarerequiredtoprocessthelargeamountofdatapresent intoday'sapplications.Inthisthesis,wepresentnewperformanceanalysis,newerror analysis,newstablealgorithms,andnewcomputationalresults. Analgorithm'sperformancedependsonthecomputationalcostandthecommunicationcost.Webeginwithastudyofthecommunicationcostfordenselinearalgebra algorithms.Weimprovethelowerboundsfortheamountofcommunicationsinmatrix multiplication.Wealsoreviewoptimalalgorithmsfordenselinearalgebraalgorithmsfocusingonrecursivealgorithms. Wealsoconsiderthecommunicationcostofthereductionoperation.Areductionisa collectivecommunicationthatisoftenusedtocommunicatedatainaparallelapplication. Wepresenttwonewalgorithmseachdevelopedunderdifferentmodels.Inaunidirectional model,weproveournewalgorithmisoptimal.Inabidirectionalmodel,weshowexperimentallyournewalgorithmhasthesametimecomplexityofareverseoptimalbroadcast. Ourimplementationsshowthatthenewalgorithmsareviableoptionsinpractice. Intheremainingchapters,weturnourattentiontoerroranalysis.WepresentacompleteerroranalysisstudyofcomputinganobliqueQRfactorization.Aspartofthisstudy weintroduceanew,stable,communicationavoidingalgorithm.Performanceexperiments conrmthebenetofthecommunicationavoidingalgorithms. Finally,weconsidertheerrorduetothebalancingalgorithm,apreprocessingstepto thenonsymmetriceigenvalueproblem.Wemodifythebalancingalgorithmbyimproving iii

PAGE 4

itsstoppingcriteria.Wepresentseveraltestcaseswherethepreviousbalancingalgorithm deterioratedtheaccuracyofthecomputation.Thenewalgorithmsuccessfullymaintains satisfactorybackwarderrorforalltestcases. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:JulienLangou iv

PAGE 5

DEDICATION Idedicatethisdissertationtomyfamily.Particularlytomylovingwife,Rachel,forher unwaveringsupportoverthelastseveralyears.Ialsodedicatethisworktomyparents,Bob andDiane,fortheirendlesssupportandencouragementthroughoutmylife. v

PAGE 6

ACKNOWLEDGMENT Iwouldliketothankmyadvisor,Dr.JulienLangou,forthepatience,guidance,encouragementandadviceheprovidedthroughoutmystudiesandresearch.Iamextremelygrateful forthepromptfeedbackandconstantinteractionwemaintainedthroughouttheresearch process. Iwouldalsoliketothankmycommitteemembers,Dr.SteveBillups,Dr.LynnBennethum,Dr.GeorgeBosilcaandDr.JanMandel.Thankyouforservingasmycommittee membersandforyourexpertiseandinstructionthroughoutmygraduatestudies. Aspecialthankstoallmyfellowgraduatestudentsforthesupport,encouragement, andlaughs.Inparticular,IwanttothankHencBouwmeesterforprovidinghisexpertisein numericalanalysisandparallelcomputing. FinancialsupportwasprovidedbytheUniversityofColoradoDenverandtheNational ScienceFoundationawardnumbersNSFCCF1054864andNSFCNS0958354.Research wasconductedusingresourcesattheUniversityofColoradoDenverandtheUniversityof ColoradoBoulderawardnumberNSFCNS0821794. vi

PAGE 7

TABLEOFCONTENTS Tables..........................................x Figures..........................................xi Chapter 1.Introduction.....................................1 2.StateofCommunicationAvoidingAlgorithms...................6 2.1Introduction.................................6 2.2CommunicationsinSequentialAlgorithms.................6 2.2.1LowerBoundsforMatrix-MatrixMultiplication..........8 2.2.2OptimalAlgorithmsforOrdinaryMatrix-MatrixMultiplication..15 2.2.2.1BlockMatrix-MatrixMultiplicationAlgorithm.......15 2.2.2.2CacheObliviousRecursiveAlgorithm............17 2.2.3CommunicationsinLinearAlgebra.................21 2.2.3.1CommunicationsforKernels.................21 2.2.3.2RecursiveLUwithPartialPivoting.............22 2.2.3.3RecursiveCholesky.....................26 2.2.3.4RecursiveQR.........................28 2.3CommunicationsinDistributedMemoryAlgorithms............31 2.3.1LowerBoundsforMatrix-MatrixMultiplication..........32 2.3.2OptimalMatrix-MatrixMultiplicationAlgorithms.........34 3.QRFactorizationinanObliqueInnerProduct...................38 3.1Introduction.................................38 3.2StabilityAnalysisofCHOLQR.......................41 3.3StabilityAnalysisofPRE-CHOLQR....................45 3.4FactorizationBasedonCholeskyFactorof A ................47 3.5FactorizationbasedonEigenvalueDecompositonof A ...........51 3.6StabilityofGram-Schmidt..........................53 vii

PAGE 8

3.7StabilityExperiments............................56 3.8CommunicationCost............................63 3.9PerformanceExperiments..........................65 3.10Conclusion..................................67 4.AGreedyAlgorithmforOptimallyPipeliningaReduction............68 4.1Introduction.................................68 4.2Model....................................70 4.3CollectiveCommunicationsandRelatedWork...............71 4.4StandardAlgorithms.............................73 4.5AGreedyAlgorithmforUnidirectionalSystem..............77 4.5.1ProofofOptimality.........................81 4.5.2TheoreticalResults..........................84 4.6BidirectionalSystem.............................87 4.6.1TheoreticalResults..........................87 4.6.2TheBi-greedyAlgorithm......................88 4.7ScheduleComplexity............................92 4.8NumericalResults..............................94 4.9UnequalSegmentation............................95 4.10Conclusion..................................98 5.OnMatrixBalancingandEigenvectorComputation................100 5.1Introduction.................................100 5.2TheBalancingAlgorithm..........................101 5.3CaseStudy..................................105 5.4BackwardError...............................107 5.5Experiments.................................108 5.6MatrixExponential.............................112 5.7Conclusion..................................113 viii

PAGE 9

6.Conclusion......................................115 References ........................................117 Appendix A.ExactEigenvalueandEigenvectorsforBalancingCaseStudy...........122 ix

PAGE 10

TABLES Table 4.1Timeanalysisforstandardalgorithmsinaunidirectionalsystem........74 4.2Lowerboundsforeachterminthetimeforareductionaswellasthelower boundsforeachstandardalgorithminaunidirectionalsystem.........75 4.3Timeanalysisforbidirectionalreductionalgorithms...............89 4.4OptimalSegmentationcomparedtobestoftheequi-segmentations.......97 x

PAGE 11

FIGURES Figure 2.1SnapshotoftheoperationsneededinEquation2.42...............35 3.1Representativityerror...............................59 3.2 A -normrepresentativityerrorforCHOL-EQR..................61 3.3ComponentwiseversusnormwiserepresentativityboundforCHOLQRdemonstratedwithcase3.................................61 3.4Orthogonalityerror................................62 3.5Performanceresultson1nodewith16threads..................66 4.1Regionswherebinomialorpipelineorbinaryisbetterintermofthenumber ofprocessors p andthemessagesize m ....................76 4.2Ccodefortheuni-greedyreductionalgorithm..................79 4.3Examplescheduleforuni-greedyalgorithm...................80 4.4Theoreticalresultsforunidirectionalsystem...................85 4.5Ratiobetweenthetimeforthebeststandardalgorithmandtheuni-greedyalgorithmforvaryingvaluesof p numberofprocessorsand m messagesize..86 4.6Theoreticalresultsforbidirectionalsystem...................89 4.7Butteryalgorithmversusreverseoptimalbroadcastalgorithm.........90 4.8Examplescheduleforbi-greedyalgorithm....................92 4.9Experimentalresultsforeachalgorithmforallpossiblemessagesizesfor64 processors.....................................95 4.10Ratioforoptimalsegmentationversusthebestequalsegmentation........97 5.1Backwarderrorandabsoluteeigenvalueconditionnumberversusiterations...110 5.2Backwarderrorbounds..............................112 5.3Ratioofbalancedmatrixnormandoriginalmatrixnorm.............113 xi

PAGE 12

1.Introduction Linearalgebraisakeycomputationaltoolinmanyapplicationsofscienticcomputing.Theneedforhighperformancelinearalgebrakernelscontinuestogrowastheamount ofdatathatmustbeprocessedcontinuestoclimb.Theperformanceofanalgorithmis limitedbythecostofmovingdata.Inthesequentialcase,thedatamustbemovedbetween levelsofmemorytocachewherethecomputationisperformed.Intheparalleldistributed memorycase,thedatamustbetransferedbetweenprocessors.Inbothcases,reducing orminimizingtheamountofdatatransferedcangreatlyimprovetheperformance.Algorithmsthataredesignedtoreduceorminimizedtheamountofdatatransferedarecalled communicationavoidingalgorithms InChapter2,wereviewthestate-of-the-artofcommunicationavoidingalgorithms. Thehistoryofcommunicationavoidingalgorithmsbeginswithmatrixmultiplication.We reviewthelowerboundsfortheamountofdatathatmustbetransferedinboththesequentialcaseandtheparalleldistributedcase.Wealsoprovidenewanalysisthatimprovesthe lowerboundsformatrixmultiplication.Wealsoreviewalgorithmsthatmatchthelower boundsasymptoticallyandprovidedetailedanalysistodetermineanupperboundonthe communicationsforthesealgorithms.ThecommunicationlowerboundshavebeenextendedtootherlinearalgebrakernelssuchasLUfactorization,Choleskyfactorization,and QRfactorization.Wedonotimproveonthelowerboundsfortheseproblems,butwedo studyoptimalalgorithms.Wechoosetofocusonrecursivealgorithms. Speedisnottheonlyconsiderationwhendesigninganalgorithm.Theaccuracyofthe computedsolutionmustalsobeconsidered.Duetonite-precisionarithmetic,errorsinthe computationareinevitable.Determiningtheamountoferrorsanalgorithmwillaccrueand understandinghowtheerrorsoccurisvitaltodevelopingrobustalgorithms. InChapter3,weconsidertheproblemofcomputingaQRfactorizationinanoblique innerproduct.TheQfactorhasorthonormalcolumnswithrespecttoanonstandardinnerproduct.TheproblemisageneralizationofcomputingaQRfactorizationwhereQ 1

PAGE 13

hasorthonormalcolumnswithrespecttotheEuclideaninnerproduct.AnobliqueQR factorizationisusefulforsomeapplicationssuchastheconjugategradientmethodand generalizedeigenvalueproblem. TherearenumerousalgorithmstocomputeanobliqueQRfactorization.Wepresent newerroranalysisforsomeclassicalgorithms.Wealsopresentanew,stable,communicationavoidingalgorithm.ThisworkextendsthepreviouserroranalysisbyRozlo zn ketal.[41].Wealsodevelopasetoftestcasestoassesswhethertheerrorboundswe obtainedaredescriptiveoftheactualerror.Numericalexperimentsconrmtheboundsare descriptive.Thestudyalsoshowswhatunderlyingpropertiesoftheinputdatadetermine themagnitudeoftheerror. Wediscussthecommunicationdifferencesbetweenthealgorithmsinadistributed memorysetting.Performanceexperimentsdemonstratethebenetofthecommunication avoidingalgorithms. Obviously,wewillnevercompletelyavoidcommunicatingdatabetweenprocessorsin parallelapplications.InChapter4,westudyalgorithmsforthereductionoperationwhich isa collectivecommunication operation.Typically,dataistransferedamongprocessors accordingtocollectivecommunications.Inareduction,dataisinitiallydistributedamong alloftheprocessorsinthecollective.Thedatamustbecombinedbysomespeciedbinary operationandtheresultreturnedonasinglespeciedprocessor. Wepresenttwonewreductionalgorithms.Weconsidertwocommunicationmodels: unidirectionalandbidirectional.Wedevelopourrstnewalgorithmassumingaunidirectionalmodelandprovethenewalgorithmisoptimal.Thesecondnewalgorithmis developedassumingabidirectioanlmodel.Weshowexperimentallyoursecondnewalgorithmhasthesametimecomplexityasareverseoptimalbroadcast.Timingexperiments showthealgorithmsareaviableoptioninpractice. InChapter5,wereturnourattentiontoerroranalysis.Weconsidertheerrorincomputingtheeigenvaluedecompositionofanonsymmetricmatrixduetoinitiallybalancing 2

PAGE 14

amatrix.Thegoalofthebalancingroutineistoimprovetheconvergenceofiterative methodsandtheaccuracyofthecomputation.Balancinghaslongbeenassumedtobe innocuous,sinceitcanbeimplementedwithnooating-pointerror.However,examples existwherebalancingwilldeterioratetheeigenvalueconditionnumber,theaccuracyofthe eigenvectors,andthebackwarderroroftheeigenvaluedecomposition. Weshowthesourceoftheerrorintheseexamplesisduetoapoorstoppingcriteria. Wepresentachangetothebalancingroutinethateffectivelyimprovesthestoppingcriteria. Ournewalgorithmpreventslossofaccuracyinallofourexamples. NotationandTerminology. Inthissection,weintroducenotationandterminologythat isusedthroughoutthisthesis. A word isasingledataentry.A message isapacketofwordsthatcanbetransfered betweenlevelsofmemoryinthesequentialcaseandacrossprocessorsinthedistributed memorycase.Foraparticularproblemthetotalnumberofwordsisthe volume ofwords transfered.Whenconsideringthetimeforaparticularalgorithmwewillusealinearmodel torepresentthetime.Thetimetotransferasinglemessageisgivenby + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 m ,where isthelatency, isthebandwidthwords/secand m isthenumberofwordsinthe message.Thecomputationaltimeisgivenby )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 N ,where N isthenumberofoatingpointoperationsopsand isthecomputationalrateops/sec.Whenconsidering thetotaltimeforanyalgorithm,wewillassumethatcommunicationandcomputationdo notoverlap.Althoughcommunicationsandcomputationscanoccursimultaneously,given currenttechnology,thisassumptionisastandardacceptedmodel.Wemodelthetotaltime ofanalgorithmwith numberofmessages + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 volumeofwordstransfered + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ops : Thefollowingnotationwillbeusedwhenconsideringtheasymptoticbehavioroffunctions. 3

PAGE 15

Denition1.1. Let f x and g x bedenedonasubsetoftherealline.Then f x = O g x ifandonlyifthereexistsapositiverealnumber c suchthatforall x sufciently closetoanunderstoodlimittypically0or 1 j f x j c j g x j Denition1.2. Let f x and g x bedenedonasubsetoftherealline.Then f x = g x ifandonlyifthereexistsapositiverealnumber c suchthatforall x sufciently closetoanunderstoodlimittypically0or 1 j f x j c j g x j Denition1.3. Let f x and g x bedenedonasubsetoftherealline. f x = g x ifandonlyif f x = g x and f x = O g x InChapter3andChapter5,westudytheoating-pointerrorduetovariousalgorithms. Weassumestandardoating-pointarithmetic x op y = x op y + ; j j u; where u denotesmachineprecisionandop =+ ; )]TJ/F22 11.9552 Tf 9.299 0 Td [(; ;= .Toavoidcloudingthemajor resultswewillnottracksmallconstantsandmerelylet c representasmallconstantthat maychangefromonelinetoanother.Wewillalsomakeassumptionsonthedimensionsof matrices m and n tosimplifytheerrorbounds.Wewillfrequentlyassume,forexample, nu< 1 = 2 ,so nu 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(nu 2 nu = cnu: Toassesstheaccuracyofanalgorithmwewillstudythebackwarderrorofthealgorithm.Let f x = y beaproblemwithinput x andexactsolution y .Let e f x = e y be analgorithmtosolve f withacomputedsolution e y .Let f e x = e y .Hence, e y istheexact solutionto f withinput e x .Thedistancebetween x and e x iscalledthe backwarderror .An algorithmissaidtobe backwardstable if k x )]TJ/F28 11.9552 Tf 12.286 0 Td [(e x k k x k O u : 4

PAGE 16

Thequantityontheleftsideoftheaboveequationiscalledtherelativebackwarderror.In words,analgorithmisbackwardstableifthecomputedsolutionistheexactsolutiontoa nearbyproblem.Backwarderrorresultsformajorkernels:matrix-matrixmultiplication, triangularsolve,Choleskyfactorization,andHouseholder QR factorizationcanbefound in[24,Chapters3,8,10,19]. Thesetof m n realmatricesisrepresentedby R m n .Wewilluseuppercaseletters torepresentmatricesandlowercaseletterstorepresentelementsofthematrix.Wewill usecolonnotationtorepresentacolumnorrowofamatrix.If A 2 R m n ,then A j or A : ;j isthe j -thcolumnof A A i; : isthe i -throwof A ,and a ij istheelementinthe i -throwand j -thcolumn. Inequalitiesbetweenmatricesareunderstoodtoholdentry-wise.Forexample, A
PAGE 17

2.StateofCommunicationAvoidingAlgorithms 2.1Introduction Inthischapter,wepresentthestate-of-the-artofcommunicationavoidingalgorithms. Webeginwithareviewofthelowerboundsforthenumberofwordstransferredinthe sequentialcaseformatrix-matrixmultiplication.Weprovideaprooftoimprovethelower boundformatrix-multiplicationaswellasstudythecommunicationrequirementsforoptimalalgorithms. Wealsoreviewlowerboundsandoptimalalgorithmsforcommunicationsindense linearalgebramethods.Wedonotimproveonthelowerboundsfordenselinearalgebra butwedoconsideroptimalalgorithmsforsomeclassiclinearalgebraproblems.InSection2.2.3,wepresentdetailsofasymptotically-optimalalgorithmsforLUfactorization, CholeskyfactorizationandQRfactorization.Wechoosetofocusonrecursivealgorithms. Finally,wereviewthecommunicationlowerboundsformatrix-matrixmultiplication inthedistributedmemorycase.Again,wepresentdetailsofanoptimalalgorithm. 2.2CommunicationsinSequentialAlgorithms Webeginwithareviewofthelowerboundsforthenumberofwordstransferredin thesequentialcaseformatrix-matrixmultiplication.Thelowerboundsarevalidonlyfor ordinary matrix-matrixmultiplication. Denition2.1 Ordinarymatrix-matrixmultiplicationalgorithm Anordinarymatrixmatrixmultiplicationalgorithmcomputes C = AB bycomputingeach c ijk = a ik b kj explicitlyandthe c ij entryisformedasthesumof c ijk ,forall k Theorderinwhichthecomputationsareformedconstitutesvariantsoftheordinary matrix-matrixmultiplicationalgorithm.Strassen'salgorithm[44]isnotanordinarymatrixmatrixmultiplicationalgorithm.Strassen'salgorithmreducesthenumberofscalarmultiplicationsandthereforedoesnotexplicitlycomputeeach c ijk In1981,HongandKong[27]consideredthenumberofwordstransferredbetweenfast andslowmemoryina2-levelmemoryhierarchywhenmultiplyingtwomatrices. 6

PAGE 18

Theorem2.2 Sequentialmatrix-matrixmultiplicationlowerbound,HongandKong, 1981[27] Considertheordinarymatrix-matrixmultiplicationalgorithmformultiplying m n and n p matricesonacomputerwithunlimitedslowmemoryandfastmemoryof size M .Thenumberofwords, W ,transferredbetweenslowandfastmemoryis W = mnp p M : Theresultshowsthat,forxed M ,thenumberofwordstransferredisatleastproportionaltotheamountofcomputation mnp ratherthanjusttheamountofinitialdata mn + np .In2004,Ironyetal.[28]providedanewprooffortheresultofHongand Kong[27]Theorem2.2. Theorem2.3 Sequentialmatrix-matrixmultiplicationlowerbound,Ironyetal.,2004[28] Considertheordinarymatrix-matrixmultiplicationalgorithmformultiplying m n and n p matricesonacomputerwithunlimitedslowmemoryandfastmemoryofsize M .The totalnumberofwordsthataretransferredbetweenfastandslowmemoryisatleast W m;n;p mnp 2 p 2 p M )]TJ/F22 11.9552 Tf 11.956 0 Td [(M: AbenetofTheorem2.3isthatthisisanexplicitlowerboundterm,allowingfora comparisontoalgorithmswithoutasymptoticanalysis.InSection2.2.1,weuseasimilar prooftechniqueasIronyetal.toimprovethislowerboundforordinarymatrix-matrix multiplication.InSection2.2.2,wepresentandanalyzeoptimalalgorithms. Alowerboundforthenumberofmessagesthatmustbetransferredbetweenslow andfastmemorycanbeobtainedfromalowerboundforthenumberofwords.Inthe sequentialmodel,thelargestamessagecanbeis M ,thesizeoffastmemory.Dividing thelowerboundforthenumberofwordsby M givesalowerboundforthenumberof messagestransferred.Westatethegeneralresultinthefollowingtheorem. 7

PAGE 19

Theorem2.4 Demmeletal.[15] Let W bealowerboundforthenumberofwords transferredbetweenbyanalgorithm.Let M beanupperboundforthesizeofamessage. Alowerboundforthenumberofmessages, L ,transferredbythealgorithmis L>W=M Theorem2.4holdsforboththesequentialmodelanddistributedmemorymodel.The theoremalsoholdsforanyalgorithm,notjusttheordinarymatrix-matrixmultiplication algorithm. 2.2.1LowerBoundsforMatrix-MatrixMultiplication Inthissection,weconsiderthesequentialcasewithtwolevelsofmemory.Wederive alowerboundonthenumberofwordsthatmustbecommunicatedbetweenslowandfast memoryformultiplyingtwomatrices, C = AB ,where A 2 R m n and B 2 R n p .We restrictthealgorithmstoordinarymatrix-matrixmultiplication.Theproofissimilarto Ironyetal.[28].Wemanagetoincreasetheconstanttoprovideabetterlowerbound. Assumefastmemoryhassize M andthesizeoftheslowmemoryisunlimited.There areatotalof mnp scalarmultiplicationstoperform.Thatis,wemustexplicitlycompute c ijk = a ik b kj .Attheevaluationof c ijk ,itisnotrequiredthat c ijk beaccumulatedintothe entryof c ij .Itmaytakeupitsownspaceinfastmemory,bewrittentoitsownlocationin slowmemory,orbeaccumulatedwith c ij Wehavethefollowingsetofinstructions: 1.Readdatafromslowmemory. 2.Writedatatoslowmemory. 3.Computecreate c ijk infastmemory. 4.Deletedatafromfastmemory. Asequenceoftheseinstructionsdeneanalgorithmforordinarymatrix-matrixmultiplication.Wewillassumeacopyof A and B isalwaysinslowmemory.Thereforewenever write a ik or b kj backtoslowmemory.Elementsof A and B areonlyreadfromslow 8

PAGE 20

memoryordeletedfromfastmemory.Attheendofthealgorithm, C isstoredinslow memory. Splittheinstructionsintosegments.Foreachsegment,thenumberofreadsandthe numberofwritessumexactlyto M .Wewishtodeterminethemaximumnumberof c ijk thatcanbecomputedduringasegment.Let M a M b ,and M c bethenumberofelements of A B ,and C ,respectively,infastmemoryatthestartofasegment.Byanelementof C wemeanapartialsumoftheelement c ij .Thepartialsummightnotbecombinedwith anotherpartialsumthatwascomputedearlierandstoredinslowmemory.Let R a R b ,and R c bethenumberofelementsreadfromslowmemoryduringthesegment.Let W a W b and W c bethenumberofelementswrittentoslowmemoryduringthesegment.Notethat W a = W b =0 sinceweneverwrite A or B .Let N a N b ,and N c bethenumberofelements leftinfastmemoryattheendofthesegment. Thereareatmost M a + R a elementsof A infastmemoryduringasegment.Wesayat mostsincewedonotassumetheelementsinfastmemoryatthestartofthesegmentand theelementreadduringthesegmentaredistinct.Thereareatmost M b + R b elementsof B infastmemoryduringasegment.Theelementsof C thatarecreatedeithertakeupaspace infastmemoryinanaccumulationspot,onespaceforeach c ij attheendofthesegment ormustbewrittentoslowmemory.So,wecancreateatmost W c + N c elementsof C WewillusethethreedimensionalversionofthediscreteLoomis-Whitneyinequality[6,32]. Lemma2.5 Loomis-WhitneyInequality[6,32] Let V beanitesetwithelementsin Z 3 ,andlet V x V y ,and V z beorthogonalprojectionsof V ontothecoordinateplanes.The cardinalityof V j V j ,satises j V j q j V x jj V y jj V z j : Inourcase,theset V consistsofthetriplets i;j;k representingthemultiplications 9

PAGE 21

c ijk performedduringasegment.Theorthogonalprojectionsaresubsetsoftheelementsof A and B infastmemoryduringthesegmenteitherinitiallyinfastmemoryorreadfrom slowmemoryandtheelementsof C thatarecreatedeitherwrittentoslowmemoryor remainasinfastmemory.Hence,themaximumnumberofscalarmultiplicationsthatcan beperformedduringasegmentis max p M a + R a M b + R b N c + W c ; subjectto R a + R b + R c + W c = M M a + M b + M c M N a + N b + N c M R i 0 ;M i 0 ;N i 0 ; for i 2f a;b;c g Therstconstraintistheassumptionthatthetotalnumberofreadsandwritescombinedto M .Recallthatelementsof A or B areneverwrittentoslowmemory,so W a = W b =0 Thesecondconstraintsaysthatatmost M elementscanbeinfastmemoryatthestartof asegment.Thethirdconstraintsaysthatatmost M elementscanbeinfastmemoryatthe endofasegment.Thenallyconstraintsaysallvariablesarenonnegative.Thenonnegative constraintisimplicitintheremainingequations. Since M c N a N b ,and R c donotappearintheobjectivefunctionwecanalways increasethenumberofscalarmultiplicationsbytaking M c = N a = N b = R c =0 .Now theproblemisreducedto max p M a + R a M b + R b N c + W c ; subjectto R a + R b + W c = M M a + M b M N c M: 10

PAGE 22

Anupperboundisobtainedbynotingeachvalueislessthan M .Themaximumnumberofscalarmultiplicationsthatcanbeperformedduringasegmentislessthanorequal to p 8 M 3 .ThisisthevalueobtainedbyIronyetal. Toimproveontheboundwesolvetheconstrainedoptimizationproblem.Wecan normalizetheproblembyletting M =1 .Thevariablesnowrepresentfractionsofthe sizeoffastmemory.Sinceallquantitiesarenonnegativeandtheobjectivefunctionisonly multiplicationandaddition,themaximumisobtainedwhenallinequalityconstraintsare active.Let M b =1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(M a and W c =1 )]TJ/F15 11.9552 Tf 11.955 0 Td [( R a + R b .Wehave max p M a + R a )]TJ/F22 11.9552 Tf 11.955 0 Td [(M a + R b )]TJ/F15 11.9552 Tf 11.955 0 Td [( R a + R b ; subjectto M a 1 R a + R b 1 : .1 Theglobalmaximumisobtainedateitheracriticalpointorontheboundaries.Let x = R a ;y = R b ,and z = M a .Let f x;y;z = x + z )]TJ/F22 11.9552 Tf 11.955 0 Td [(z + y )]TJ/F15 11.9552 Tf 11.955 0 Td [( x + y ,then @f @x = )]TJ/F22 11.9552 Tf 11.955 0 Td [(z + y )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 x )]TJ/F22 11.9552 Tf 11.955 0 Td [(y )]TJ/F22 11.9552 Tf 11.955 0 Td [(z ; @f @y = x + z )]TJ/F22 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 y + z ; @f @z = )]TJ/F15 11.9552 Tf 11.955 0 Td [( x + y )]TJ/F22 11.9552 Tf 11.955 0 Td [(x + y )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 z : Thefeasiblecriticalpointsare x; 1 )]TJ/F22 11.9552 Tf 11.789 0 Td [(x; 1 )]TJ/F22 11.9552 Tf 11.79 0 Td [(x for 0 x 1 and f x; 1 )]TJ/F22 11.9552 Tf 11.79 0 Td [(x; 1 )]TJ/F22 11.9552 Tf 11.789 0 Td [(x =1 Next,wecheckthemaximumvalueobtainedontheboundaries. aLet x =0 0 y 1 ,and 0 z 1 f ;y;z = z )]TJ/F22 11.9552 Tf 11.955 0 Td [(z + y )]TJ/F22 11.9552 Tf 11.955 0 Td [(y and @f @y ;y;z = z )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 y + z ; @f @z ;y;z = )]TJ/F22 11.9552 Tf 11.955 0 Td [(y + y )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 z : 11

PAGE 23

Theonlyfeasiblecriticalpointis ; 1 ; 1 and f ; 1 ; 1=1 bLet 0 x 1 y =0 ,and 0 z 1 f x; 0 ;z = x + z )]TJ/F22 11.9552 Tf 11.955 0 Td [(z )]TJ/F22 11.9552 Tf 11.956 0 Td [(x and @f @x x; 0 ;z = )]TJ/F22 11.9552 Tf 11.956 0 Td [(z )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 x )]TJ/F22 11.9552 Tf 11.956 0 Td [(z ; @f @z x; 0 ;z = )]TJ/F22 11.9552 Tf 11.956 0 Td [(x )]TJ/F22 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 z : Theonlyfeasiblecriticalpointis ; 0 ; 0 and f ; 0 ; 0=1 cLet 0 x 1 0 y 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(x ,and z =0 f x;y; 0= x )]TJ/F22 11.9552 Tf 11.955 0 Td [(y )]TJ/F15 11.9552 Tf 11.955 0 Td [( x + y and @f @x x;y; 0=1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(y )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 x )]TJ/F22 11.9552 Tf 11.955 0 Td [(y ; @f @y x;y; 0= x )]TJ/F22 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 y : Therearetwofeasiblecriticalpoint: ; 1 ; 0 and ; 0 ; 0 f ; 1 ; 0=0 and f ; 0 ; 0=1 dLet 0 x 1 0 y 1 ,and z =1 f x;y; 1= x +1 y )]TJ/F15 11.9552 Tf 11.956 0 Td [( x + y and @f @x x;y; 1= y )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 x )]TJ/F22 11.9552 Tf 11.955 0 Td [(y ; @f @y x;y; 1= x +1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(x )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 y : Theonlyfeasiblecriticalpointis ; 1 ; 1 and f ; 1 ; 1=1 eLet x + y =1 ,and 0 z 1 f )]TJ/F22 11.9552 Tf 11.955 0 Td [(y;y;z = )]TJ/F22 11.9552 Tf 11.955 0 Td [(y + z )]TJ/F22 11.9552 Tf 11.956 0 Td [(z + y )]TJ/F15 11.9552 Tf 11.955 0 Td [( )]TJ/F22 11.9552 Tf 11.955 0 Td [(y + y = )]TJ/F15 11.9552 Tf 11.955 0 Td [( y )]TJ/F22 11.9552 Tf 11.955 0 Td [(z + y )]TJ/F22 11.9552 Tf 11.955 0 Td [(z =1 )]TJ/F15 11.9552 Tf 11.956 0 Td [( y )]TJ/F22 11.9552 Tf 11.955 0 Td [(z 2 1 12

PAGE 24

Finally,wemustchecktheedges.Themaximumvalueontheedgesoffaceeareless thanorequalto1bythepreviousinequality. aLet x =0 0 y 1 ,and z =1 f ;y; 1= y )]TJ/F22 11.9552 Tf 11.819 0 Td [(y ,whichhasamaximumof1at ; 1 ; 1 bLet x =0 0 y 1 ,and z =0 f ;y; 0=0 cLet x =0 y =0 ,and 0 z 1 f ; 0 ;z =2 z )]TJ/F22 11.9552 Tf 12.285 0 Td [(z ,whichhasamaximumof 1 = 2 at ; 0 ; 1 = 2 dLet 0 x 1 y =0 ,and z =0 f x; 0 ; 0= x )]TJ/F22 11.9552 Tf 11.503 0 Td [(x ,whichhasamaximumof1at ; 0 ; 0 eLet 0 x 1 y =0 ,and z =1 f x; 0 ; 1=0 Themaximumofthenormalizedproblemis1andoccursat x; 1 )]TJ/F22 11.9552 Tf 12.702 0 Td [(x; 1 )]TJ/F22 11.9552 Tf 12.702 0 Td [(x .Denormalizingtheproblemweobservethemaximumnumberofscalarmultiplicationsina singlesegmentis p M 3 Theorem2.6. Let M bethesizeoffastmemoryandslowmemoryhaveunlimitedsize. Considerasegmentofanordinarymatrix-matrixmultiplicationsalgorithminwhichthe numberofreadsandthenumberofwritessumto M .Themaximumnumberofscalar multiplicationsthatcanbeperformedduringasegmentis p M 3 Proof. Themaximumofnormalizedproblem.1isobtainedwhen 0 R a 1 R b = 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(R a ,and M a =1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(R a .Thedenormalizedproblemis max p M a + R a M )]TJ/F22 11.9552 Tf 11.955 0 Td [(M a + R b M )]TJ/F15 11.9552 Tf 11.955 0 Td [( R a + R b ; subjectto 0 M a M 0 R a + R b M: 13

PAGE 25

Themaximumisobtainedfor 0 R a M R b = M )]TJ/F22 11.9552 Tf 12.206 0 Td [(R a ,and M a = M )]TJ/F22 11.9552 Tf 12.207 0 Td [(R a andthe maximumis p M 3 Alowerboundforthetotalnumberofwordsandmessagestransferredimmediately follows. Theorem2.7 SequentialMatrix-MatrixMultiplicationLowerBound Let A 2 R m n B 2 R n p ,and C 2 R m p .If M isthesizeoffastmemoryandslowmemoryisunlimited, thenumberofwords, W m;n;p ,andthenumberofmessages, L m;n;p ,transferred betweenslowandfastmemorytocompute C = AB withanordinarymatrix-matrixmultiplicationalgorithmsatisfy W m;n;p mnp p M )]TJ/F22 11.9552 Tf 11.955 0 Td [(M; .2 L m;n;p mnp M 3 = 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 : .3 Proof. FromTheorem2.6,thetotalnumberofcompletesegmentsofsize M islessthan orequalto j mnp p M 3 k .Theminimumnumberofwordstransferredbetweenfastandslow memoryis mnp p M 3 M mnp p M )]TJ/F22 11.9552 Tf 11.955 0 Td [(M: Alowerboundonthenumberofmessagesisobtainedbynotingatmost M words canbetransferredinasinglemessage.Dividingby M ,weobtainthelowerboundforthe numberofmessages. Thelowerboundsforthenumberofwordsandmessagescommunicatedcanbeextend tomultiplelevelsofmemory.Lettherebe l levelsofmemoryeachofsize M k .Computation isperformedintherstlevelofmemoryandslowmemoryisunlimited M l = 1 Theorem2.8. Let A 2 R m n B 2 R n p ,and C 2 R m p .Lettherebe l levelsofmemory eachofsize M k and M l = 1 slowmemory.Thenumberofwordsandthenumberof 14

PAGE 26

messagestransferredbetweenlevel k and k +1 tocompute C = AB satisfy W k m;n;p mnp p S k )]TJ/F22 11.9552 Tf 11.955 0 Td [(S k ; .4 L k m;n;p mnp M k p S k )]TJ/F22 11.9552 Tf 15.213 8.088 Td [(S k M k ; .5 where S k = P k i =1 M i Proof. Considerthenumberforwordscommunicatedbetweenlevel k andlevel k +1 .Let levels1to k collectivelyrepresentfastmemoryandlevels k +1 to l collectivelyrepresent slowmemory.ByTheorem2.7with M = S k ,thelowerbound.4forthenumberof wordstransferredisobtained.Thelowerbound.5forthenumberofmessagesfollows immediatelysinceatmost M k wordscanbecommunicatedatonetime. Notes Intheproofofthelowerbound, N a N b ,and N c ofonesegmentarethevaluesof M a M b ,and M c ofthenextsegment.Thelowerboundassumesfastmemoryisfullatthe startandendofeverysegment.Ifwehopetoobtainthelowerboundwemustdesignour algorithmso N a N b ,and N c areusedinthefollowingsegment. 2.2.2OptimalAlgorithmsforOrdinaryMatrix-MatrixMultiplication Inthissection,wepresentdetailedanalysisoftwo communication-optimal algorithms forordinarymatrix-matrixmultiplication. Denition2.9 Communication-optimalAlgorithms Analgorithmissaidtobecommunicationoptimalifthenumberofwords/messagestransferredareboundedabovebyaquantitywith aleadingtermthatisaconstantmultipleofalowerbound. 2.2.2.1BlockMatrix-MatrixMultiplicationAlgorithm Algorithm2.1istheblockmatrix-matrixmultiplicationalgorithm.Eachblockisof size b b .Line1setstheblocksizesothatthreeblockswilltinfastmemoryatonetime. ThreeblocksmusttinfastmemorytoperformtheoperationsinLine7withoutmorereads 15

PAGE 27

andwritesthanshowninthealgorithm.Thealgorithmisshownwithexplicitcachefast memorymanagement.Theblockalgorithmisageneralizationofthestandardmatrixmatrixmultiplicationalgorithmforscalars.Reorderingtheloopswillproducedifferent variantsofthealgorithm.Thelooporderingwepresentbenetsfromonlyrequiringthe block C ij tobereadandwrittenonce.See[21,Section1.3]foranintroductiontoblock algorithms. Algorithm2.1: BlockMatrix-MatrixMultiplication Input : A 2 R m n B 2 R n p Output : C 2 R m p ; s.t. C = C + AB Notes : C ij A ik ,and B kj areblockseachofsize b b 1 b = p M= 3 m b = d m=b e n b = d n=b e p b = d p=b e 2 for i =1 m n do 3 for j =1 p b do 4 Load C ij 5 for k =1 n b do 6 Load A ik B k;j 7 C ij = C ij + A ik B kj 8 Delete A ik B k;j 9 Write C ij Theorem2.10. Let M bethesizeoffastmemory.Let A 2 R m n and B 2 R n p .Thetotal numberofwords, W BMM m;n;p ,andthetotalnumberofmessages, L BMM m;n;p transferredbetweenmainmemoryandfastmemorybyAlgorithm2.1satisfy W BMM m;n;p 2 p 3 mnp p M + O mp ; .6 L BMM m;n;p 3 p 3 mnp M 3 = 2 + O mp : .7 16

PAGE 28

Proof. ThetotalvolumetransferredbetweenfastandslowmemoryinAlgorithm2.1is W m;n;p =2 m b p b b 2 +2 m b n b p b b 2 =2 mp b 2 b 2 +2 mnp b 3 b 2 =2 mp +2 mnp p M= 3 : ThetotalnumberofmessagestransferredbetweenfastandslowmemoryinAlgorithm2.1 is L m;n;p =2 m b p b +2 m b n b p b =2 mp b 2 + mnp b 3 =2 mp M= 3 + mnp M= 3 3 = 2 : Algorithm2.1isafactorof 2 p 3 largerthanthelowerbound.2forthenumberof wordstransferredandafactorof 3 p 3 largerthanthelowerbound.3forthenumberof messagestransferred.Algorithm2.1includesaparameterblocksizethatmustbetuned dependingonthesizeoffastmemory.SeeLine1ofAlgorithm2.1 b = p M= 3 .Thenext algorithmdoesnotrequireanytuningtooptimallyusethememoryhierarchy. 2.2.2.2CacheObliviousRecursiveAlgorithm Acacheobliviousalgorithmisonethatdoesnotrequirespecicknowledgeofthe sizeoffastmemory.Recursivealgorithmsareusuallycacheoblivious.Wepresentthe cacheobliviousalgorithmformultiplyingtworectangularmatricesproposedandanalyzed byFrigoetal.[18].Frigoetal.analyzethealgorithmintermsofcachelineswherethe cacheisorganizedintocachelinesofsize L andthereare M=L cachelines.Whena cachemissoccursthedatarequestedisnotincacheanentirecachelineisdiscardedand replacedbytheneededdata.Forconsistencywiththerestofouranalysis,wepresentthe communicationrequirementsignoringcachelines. 17

PAGE 29

Let A 2 R m n and B 2 R n p .If m = n = p =1 ,thentherecursivealgorithmreturns thescalarmultiplication.Otherwise,if m isthelargestdimensionlet A = A 1 A 2 ; where A 1 2 R d m= 2 e m .Recursivelymultiply C 1 = A 1 B and C 2 = A 2 B .If n isthelargest dimensionlet A = A 1 A 2 and B = B 1 B 2 ; where A 1 2 R m d n= 2 e and B 1 2 R d n= 2 e p .Recursivelymultiply C = A 1 B 1 + A 2 B 2 .If p isthelargestdimensionlet B = B 1 B 2 ; where B 1 2 R n d p= 2 e .Recursivelymultiply C 1 = AB 1 and C 2 = AB 2 .Algorithm2.2 givesthepseudo-code.Intuitively,thealgorithmiscommunicationoptimalsinceoncethe matricestinfastmemorythedatafortheremainingrecursionsremaininfastmemory. InTheorem2.11,weformallystateandproveanupperboundforthenumberofwords transferredfortherecursivematrix-matrixmultiplicationalgorithm.Ourcontributionis theanalysiswithoutconsideringcachelines. Theorem2.11. Let M bethesizeoffastmemory.Let A 2 R m n and B 2 R n p .Thetotal numberofwords, W RMM m;n;p ,transferredbetweenmainmemoryandfastmemoryby Algorithm2.2satisfy W RMM m;n;p 12 p 3 mnp p M : .8 18

PAGE 30

Algorithm2.2: RecursiveMatrix-MatrixMultiplication, recMM Frigoetal.,1999[18] Input : A 2 R m n B 2 R n p Output : C 2 R m p ; s.t. C = AB 1 if m = n = p =1 then 2 %scalarmultiplication 3 C = A B 4 elseif m n and m p then 5 C 1 = recMM A 1 ;B 6 C 2 = recMM A 2 ;B 7 elseif n>m and n p then 8 C = recMM A 1 ;B 1 + recMM A 2 ;B 2 9 else 10 C 1 = recMM A;B 1 11 C 2 = recMM A;B 2 Proof. Thetotalvolumetransferredsatisestherecurrencerelation W RMM m;n;p 8 > > > > > > > > > > < > > > > > > > > > > : mn + mp + np if mn + mp + np M; 2 W RMM m= 2 ;n;p otherwiseif m n and m p; 2 W RMM m;n= 2 ;p +3 mp otherwiseif n>m and n p; 2 W RMM m;n;p= 2 otherwise : .9 Thethirdcaserequires 3 mp morememorytransfersthanjustthetworecursivemultiplications.Thesetransfersaccountfortheadditionofthetwomatrices.Formatrixaddition, atworstwemustreadthetwomatricesfromslowmemoryandwrite C toslowmemory. Thebasecaseisobtainedwhenthethreematricestintofastmemory.Theonlytransfers betweenfastandslowmemoryaretoread A and B fromslowmemoryandwrite C to fastmemory.Let k 1 k 2 ,and k 3 bethesmallestintegerssuchthat m 0 = m= 2 k 1 p M= 3 n 0 = n= 2 k 2 p M= 3 and p 0 = p= 2 k 3 p M= 3 .Equivalently, k 1 = d log 2 p 3 m= p M e k 2 = d log 2 p 3 n= p M e ,and k 3 = d log 2 p 3 n= p M e .Matricesofsize m 0 n 0 n 0 p 0 and m 0 p 0 tinfastmemoryandthecriteriaforthebasecaseintherecurrencerelationis 19

PAGE 31

satised.Iteratingtherecurrencerelation k 1 + k 2 + k 3 timesgives W RMM m;n;p mn 2 k 1 + k 2 + mp 2 k 1 + k 3 + np 2 k 2 + k 3 2 k 1 + k 2 + k 3 +3 mp 2 k 2 = mn 2 k 3 + mp 2 k 2 + np 2 k 1 +3 mp 2 k 2 6 p 3 mnp p M +3 mp 2 p 3 n p M =12 p 3 mnp p M : Weusetheinequality 2 k 1 =2 d log 2 p 3 m= p M e 2 p 3 m= p M andsimilarinequalitiesfor k 2 and k 3 Algorithm2.2transfersatmost 12 p 3 timesmorewordsthanthelowerbound.The boundfortherecursivealgorithmis6timesgreaterthantheboundfortheblockmatrixmatrixmultiplicationalgorithm.However,thealgorithmhasthebenetthatnotuningisrequireddependingonthesizeoffastmemory.Also,therecursivealgorithmis communication-optimalformultiplelevelsofmemory.Thesameanalysisusedforthe two-levelmemoryproblemcanbeappliedtoeachlevelofmemoryandthealgorithmis communication-optimalateverylevelofmemory.Theblockalgorithmwouldrequiremultiplelevelsofblockingorinnerblockingtoutilizeeachlevelofmemoryoptimally. Corollary2.12. Let A 2 R m n B 2 R n p ,and C 2 R m p .Lettherebe l levelsof memoryeachofsize M k and M l = 1 slowmemory.Anupperboundforthenumberof wordstransferredbetweenlevel k and k +1 tocompute C = AB withAlgorithm2.2is W k m;n;p 12 p 3 mnp p S k )]TJ/F22 11.9552 Tf 11.955 0 Td [(S k ; .10 where S k = P k i =1 M i Onedownfalloftherecursivealgorithmisthatitcomputes C = AB .Inmanycases oneisinterestedincomputing C = C + AB .Usingtherecursivealgorithmtorstcompute 20

PAGE 32

AB thenaddingtheresultto C requiresstoringanextra m p matrix.Thiswillalsoimpact theamountofcommunicationsandcomputations,butisnegligible. 2.2.3CommunicationsinLinearAlgebra Inthissection,wereviewoptimalalgorithmsforLUfactorizationwithpartialpivoting,Choleskyfactorization,andQRfactorization.Wechoosetofocusonrecursivealgorithmsforeachfactorization.LowerboundsfornearlyalldenselinearalgebracomputationswhereprovenbyBallardetal.[1].Theyshowedthatif W scalarmultiplications areperformedbythealgorithm,thentheminimumnumberofwordstransferredbetween fastandslowmemoryis W= p M .Weshowthatthefollowingrecursivealgorithmsare optimal.Werstneedtocomputeanupperboundforthenumberofwordstransferredby variouskernelsthatareusedinthealgorithms. 2.2.3.1CommunicationsforKernels Weneedanupperboundforthenumberofwordstransferredforsolvingatriangularsystemofequationsandageneralmatrix-matrixmultiplication.Weobtainanupper boundbyassumingaclassicblockingalgorithmforeachkernel.Algorithm2.3showsthe blockalgorithmforsolvingatriangularsystemwithmultipleright-handsides.Theblock algorithmformatrix-matrixmultiplicationwaspreviouslyshowninAlgorithm2.1. Thenumberofwordscommunicatedtosolvean m m triangularsystemofequations with n right-handsideswherethesolutionoverwritestheright-handsideisexpressedby W TS m;n .Thenumberofwordscommunicatedformultiplyingan m n matrixbyan n p matrixandaddingtheresulttoan m p matrixisexpressedby W MM m;n;p FromAlgorithm2.3,itisstraightforwardtoshowthat W TS m;n 8 > > < > > : m 2 2 +2 mn if m p M= 3 ; p 3 nm 2 p M + nm if m> p M= 3 : .11 Theboundfor m p M= 3 isobtainedsince m b =1 andtheinnerloopinAlgorithm2.3 isremoved.Formatrix-matrixmultiplication,weonlyneedthespecialcasewhen B is 21

PAGE 33

Algorithm2.3: BlockTriangularSolve Input : A 2 R m m ,and B 2 R m n A isuppertriangular. Output : X 2 R m n ; s.t. AX = B X overwrites B Notes : B ik and A ij areblockseachofsize b b 1 b = p M= 3 n b = d n=b e m b = d m=b e 2 for k =1 n b do 3 for i = m b & 1 do 4 Load B ik 5 for j = m b & i +1 do 6 %Load A ij B jk 7 B ik = B ik )]TJ/F22 11.9552 Tf 11.955 0 Td [(A ij B jk 8 %Delete A ij B jk 9 Load A ii 10 B ik = A )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 ii B ik 11 Write B ik square.Itisstraightforwardtoshowthat W MM m;n;n 8 > > < > > : 3 mn + n 2 if n p M= 3 ; 2 p 3 mn 2 p M +2 mn if n> p M= 3 : .12 2.2.3.2RecursiveLUwithPartialPivoting Algorithm2.4givesthepseudo-codeforrecursiveLUfactorizationwithpartialpivotingpresentedbyToledo[48].Toledoshowstherecursivealgorithmisoptimalinterms ofthenumberofwordstransferred.Wepresentourownanalysisandobtainthesame boundswithonlyslightlydifferentconstantsthanthoseobtainedbyToledo.Ouranalysis isconsistentwiththeanalysisusedfortheotheralgorithmsinthisthesis. Theorem2.13. Let A 2 R n n .Let M bethesizeoffastmemory.Thenumberofwords, W RLU ,transferredbetweenfastandslowmemorybyAlgorithm2.4satisfy W RLU n;n 2 p 3 3 n 3 p M + 11 4 n 2 log 2 n + O n 2 .13 Proof. Initiallyweignorethecommunicationsrequiredforthepermutations.Thenumber 22

PAGE 34

Algorithm2.4: RecursiveLUfactorizationwithPartialPivoting, recLU Toledo,1997[48] Input : A 2 R m n Output : L 2 R m n U 2 R n n ,and P 2 R n n L islowertrapezoidal, U isupper triangularand P isapermutationmatrixsuchthat PA = LU 1 if n =1 then 2 %pivotandscale. 3 x =argmax i j A i 1 j 4 P isapermutationmatrixcorrespondingtoexchangingrow1androw x 5 e A = PA 6 U = e A 11 ;L = e A=U 7 else 8 Let A = A 11 A 12 A 21 A 22 ,where A 11 2 R d n= 2 ed n= 2 e 9 h L 11 L 21 ;U 11 ;P 1 i = recLU A 11 A 21 10 e A 12 e A 22 = P 1 A 12 A 22 11 U 12 = L )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 11 e A 12 12 A 0 22 = e A 22 )]TJ/F22 11.9552 Tf 11.955 0 Td [(L 21 U 12 13 [ L 22 ;U 22 ;P 2 ]= recLU A 0 22 14 e L 21 = P 2 L 21 15 P = I 0 0 P 2 P 1 ;L = L 11 0 e L 21 L 22 ;U = U 11 0 e U 21 U 22 ofwordstransferredfortherestoftheoperationssatisfytherecursionrelation W RLU n;n = 8 > > > > > > > > > < > > > > > > > > > : 2 m if n =1 ; W RLU n;n= 2+ W TS n= 2 ;n= 2 + W MM n )]TJ/F22 11.9552 Tf 11.955 0 Td [(n= 2 ;n= 2 ;n= 2 + W RLU n )]TJ/F22 11.9552 Tf 11.955 0 Td [(n= 2 ;n= 2 otherwise : .14 Let b = p M= 3 k = log 2 n ,and l = log 2 n=b .Thereare k levelsintherecursion.At eachlevel,thenumberofrecursionsdoublestwocallsto recLU .Atlevel i ,thereis 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 triangularsystemstosolve,eachoforder n= 2 .Hence,thenumberofwordscommunicated 23

PAGE 35

duetosolvingtriangularsystemsislessthanorequalto k X i =1 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 W TS n= 2 ;n= 2 .15 = l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 n 2 i 3 =b + n 2 i 2 + k X i = l 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 5 2 n 2 i 2 .16 = l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 1 2 n 3 b 4 i + n 2 2 i + k X i = l 5 4 n 2 2 i .17 = 1 2 n 3 b l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 1 4 i + 1 2 n 2 l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 1 2 i + 5 4 n 2 k X i = l 1 2 i .18 1 6 n 3 b + 1 2 n 2 : .19 Thesummationsin.18areboundedbythesumofthegeometricseriestoobtainthenal bound. Atlevel i )]TJ/F15 11.9552 Tf 12.686 0 Td [(1 ,thereare 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 callsto recLU .Thenumberofwordscommunicated foreachcallis W RLU n )]TJ/F22 11.9552 Tf 12.248 0 Td [(jn= 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ;n= 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ,for j =0 ;:::; 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 )]TJ/F15 11.9552 Tf 12.249 0 Td [(1 .Eachcallspawnsa matrix-matrixmultiplicationatlevel i with W MM n )]TJ/F22 11.9552 Tf 12.096 0 Td [(jn= 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 )]TJ/F22 11.9552 Tf 12.095 0 Td [(n= 2 i ;n= 2 i ;n= 2 i words communicated.Hence,thenumberofwordscommunicatedduetomatrix-matrixmultiplicationupdatesisatmost k X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 W MM n )]TJ/F15 11.9552 Tf 13.151 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 n 2 i ; n 2 i ; n 2 i .20 = l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 2 b n )]TJ/F15 11.9552 Tf 13.151 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 2 i n n 2 i 2 + l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 2 n )]TJ/F15 11.9552 Tf 13.15 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 i n n 2 i + k X i = l 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 3 n )]TJ/F15 11.9552 Tf 13.151 8.087 Td [(2 j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 i n n 2 i + n 2 i 2 .21 = l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 2 n 3 b 1 4 i 1 )]TJ/F15 11.9552 Tf 13.151 8.087 Td [( j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 2 i + l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.757 0 Td [(1 X j =1 2 n 2 1 2 i 1 )]TJ/F15 11.9552 Tf 13.151 8.087 Td [( j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 i + k X i = l 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 3 n 2 2 i 1 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(2 j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 i + k X i = l 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 n 2 4 i .22 24

PAGE 36

= l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 2 n 3 b 1 4 i )]TJ/F23 7.9701 Tf 13.868 14.944 Td [(l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 n 3 b 1 8 i 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+ l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 n 2 )]TJ/F23 7.9701 Tf 13.869 14.944 Td [(l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 2 n 2 1 4 i 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+ k X i = l 3 2 n 2 )]TJ/F23 7.9701 Tf 18.278 14.944 Td [(k X i = l 3 n 2 4 i 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+ k X i = l 1 2 n 2 2 i .23 = n 3 b l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 1 2 i )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 n 3 b l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 1 2 i + l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 n 2 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 n 2 + 3 2 n 2 k )]TJ/F22 11.9552 Tf 11.955 0 Td [(l +1 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(3 4 n 2 k )]TJ/F22 11.9552 Tf 11.955 0 Td [(l +1+ 1 2 n 2 k X i = l 1 2 i .24 1 2 n 3 b + 1 2 n 2 l )]TJ/F15 11.9552 Tf 11.955 0 Td [(1+ 3 4 n 2 k )]TJ/F22 11.9552 Tf 11.955 0 Td [(l +1+ 1 2 n 2 .25 = 1 2 n 3 b + 3 4 n 2 k + 3 4 n 2 : .26 Toobtain.24weusedtheequality P 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 j =1 j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1=4 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 : Atlevel k ,thereare n callstothebasecaseof recLU .EachonecorrespondstotheLU factorizationofacolumnofanupdatedsubmatrix.Thenumberofwordscommunicated forthebasecaseisatmost n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =0 W RLS n )]TJ/F22 11.9552 Tf 11.955 0 Td [(i; 1= n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =0 2 n )]TJ/F22 11.9552 Tf 11.955 0 Td [(i = n 2 + n: .27 Addingtogether.19,.26,and.27gives W RLU n;n 2 3 n 3 p M= 3 + 3 4 n 2 log 2 n + 9 4 n 2 + n: Finally,weconsiderthecommunicationsduetothepermutations.Topermuteacolumn, atworstthecolumnmustbereadfromslowmemoryandwrittentofastmemory.Columns arepermutedinAlgorithm2.4eitherinthebasecaseorinLine10andLine14.Atlevel i thereare 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 callsto recLU andeachcallpermutes n= 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 columns.Thetotalnumberof columnspermutedis k +1 X i =1 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 n 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = n k +1 : 25

PAGE 37

Atworst 2 n wordsarecommunicatedforeachcolumnpermuted,thereforethetotalnumber ofwordcommunicatedduetopermutationsisatmost 2 n 2 log 2 n +1 TherecursiveLUfactorizationalgorithmisoptimalintermsofthenumberofwords transferred.Therecursionfor recLU onlyoccursontheentirecolumnsof A .Thisis becausewerequirepartialpivotingforstabilityofthecomputation. 2.2.3.3RecursiveCholesky When A issymmetricpositivedenite,aCholeskyfactorizationcanbeused.Cholesky factorizationdoesnotrequirepivotingforstability.Algorithm2.5showsarecursivealgorithmforcomputingtheCholeskyfactorization.Sincethereisnopivotingrequirementwe candenetherecursionontherstdiagonalblockof A asopposedtoblockcolumnsfor LUfactorization.ThebasecaseistheCholeskyfactorizationofascalar,whichissimply computingthesquarerootofascalar.Updatingthematrixconsistsofsolvingatriangularsystemofequationsandasymmetricrankk updateSYRK.Finally,theCholesky factorizationoftheupdatedtrailingsubmatrixiscomputedbytherecursivealgorithm. Gustavson[23]comparedtherecursivealgorithmtotheleft-lookingandright-looking algorithms.Gustavsondidnotanalysisthecommunicationsofthealgorithms.Rather,he showedtherecursivealgorithmbenetsbyperformingalltheoperationsonsquaresubmatrices.Theleft-lookingandright-lookingalgorithmsrequirethemajorityoftheoperations tobeperformedonrectangularsubmatrices. Theorem2.14. Let A 2 R n n .Let M bethesizeoffastmemory.Thenumberofwords, W RCH ,communicatedbetweenfastandslowmemorybyAlgorithm2.5satisfy W RCH n;n p 3 3 n 3 p M + 3 2 n 2 +2 n: .28 26

PAGE 38

Algorithm2.5: RecursiveCholesky recChol Input : A 2 R n n A issymmetricpositivedenite. Output : L 2 R n n L islowertriangularsuchthat A = LL T 1 if n =1 then 2 L = p A 3 else 4 Let A = A 11 A 12 A 21 A 22 ,where A 11 2 R d n= 2 ed n= 2 e 5 L 11 = recChol A 11 6 L 21 = A 21 L )]TJ/F23 7.9701 Tf 6.586 0 Td [(T 11 7 e A 22 = A 22 )]TJ/F22 11.9552 Tf 11.955 0 Td [(L 21 L T 21 8 L 22 = recChol e A 11 Proof. Thenumberofwordscommunicatedsatisfytherecursionrelation W RCH n = 8 > > > > > < > > > > > : 2 if n =1 ; 2 W RCH n= 2+ W TS n= 2 ;n= 2 + W SYRK n= 2 ;n= 2 ;n= 2 otherwise ; .29 where W SYRK arethewordstransferredforasymmetricrankk update.Let b = p M= 3 k = log 2 n ,and l = log 2 n=b .Thereare k levelsintherecursion.Eachlevelofrecursion doublesthenumberofrecursivecallstothealgorithmandthesizeoftheproblemishalved thetworecursivecallshavethesameorder.Atlevel i ,thereare 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 triangularsystems tosolve,eachoforder n= 2 .Thenumberofwordscommunicatedforsolvingtriangular systemsisthesameasitwasfor recLU andisgivenin.19.Thereisalso 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 symmetricrankn= 2 updatesatlevel i intherecursion.Thenumberofwordstransferredfor eachupdatewillbehalfofthoseformatrix-matrixmultiplicationsinceweonlyhaveto computethelowertriangularpartofthematrix.Thenumberofwordstransferreddueto thesymmetricrankn= 2 updatesisatmost 27

PAGE 39

k X i =1 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 W SYRK n 2 i ; n 2 i ; n 2 i .30 = l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 1 b n 2 i 3 + l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 n 2 i 2 + k X i = l 2 i )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 2 n 2 i 2 .31 = n 3 2 b l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 1 4 i + 1 2 n 2 l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 1 2 i + n 2 k X i = l 1 2 i .32 1 6 n 3 b + 1 2 n 2 k X i =1 1 2 i + 1 2 n 2 k X i = l 1 2 i .33 1 6 n 3 b + n 2 : .34 Atlevel k ,thereare n basecasesandeachwillcommunicatedatmost2wordsreada scalarfromslowmemoryandwriteascalartoslowmemory.Therefore,thereareatmost 2 n wordscommunicatedforthebasecase.Addingthiswith.19and.34givesusthe bound. ComparingTheorem2.13andTheorem2.14showstheCholeskyalgorithmisableto eliminatethe n 2 log 2 n termfromthebound. 2.2.3.4RecursiveQR WenextconsiderarecursivealgorithmbyElmrothandGustavson[16]tocompute theQRfactorizationofamatrix.Algorithm2.6givesthepseudo-code.Thealgorithm computesthe R factorofarectangularmatrixandthe Q factorisstoredincompacted form.The Q factorisgivenby Q = I )]TJ/F22 11.9552 Tf 12.874 0 Td [(YTY T ,where Y islowertrapezoidaland T isuppertriangular.ThisisknownasthecompactWYrepresentation.Forfurtherdetails see[43]. 28

PAGE 40

Algorithm2.6: RecursiveQR-Elmroth/Gustavson recQR Input : A 2 R m n Output : Y 2 R m n T 2 R n n ,and R 2 R n n Y islowertrapezoidal, T islower triangularand R isuppertriangularsuchthat I )]TJ/F22 11.9552 Tf 11.955 0 Td [(YTY T A = R 0 1 if n =1 then 2 Y = A + sign A 11 k A k 2 e 1 3 T =2 = Y T Y 4 R = )]TJ/F17 11.9552 Tf 9.298 0 Td [(sign A 11 k A k 2 5 else 6 Let A = A 11 A 12 A 21 A 22 ,where A 11 2 R d n= 2 ed n= 2 e 7 h Y 11 Y 12 ;T 11 ;R 11 i = recQR A 11 A 21 8 R 12 e A 22 = I )]TJ/F28 11.9552 Tf 11.955 13.27 Td [( Y 11 Y 12 T 11 Y T 11 Y T 12 A 12 A 22 9 [ Y 22 ;T 22 ;R 22 ]= recQR e A 22 10 T 21 = )]TJ/F22 11.9552 Tf 9.298 0 Td [(T 22 Y T 22 Y 12 T 11 11 T = T 11 0 T 21 T 22 Y = Y 11 0 Y 21 Y 22 R = R 11 R 12 0 R 22 Computing T requiresextraopscomparedtothestandardHouseholderalgorithm. ElmrothandGustavsonsuggestahybridalgorithmwhichappliestherecursivealgorithm onapanelandupdatingthematrixusingaright-lookingalgorithm.Thehybridalgorithm reducesthecomputation.Wewillrstconsiderthecommunicationsofthefullrecursive algorithm.Thehybridalgorithmreducesthecomputationsinceonlyasmall T iscomputed foreachpanel.Weconsiderthecommunicationsoftherecursivealgorithmonly. Theorem2.15. Let A 2 R n n .Let M bethesizeoffastmemory.Thenumberofwords, W RCQR ,communicatedbetweenfastandslowmemorybyAlgorithm2.6satisfy W RQR n;n 10 3 n 3 p M= 3 + 5 2 n 2 log 2 n +10 n 2 : .35 29

PAGE 41

Proof. Thenumberofwordscommunicatedsatisfytherecursionrelation W RQR m;n = 8 > > > > > > > > > > > > < > > > > > > > > > > > > : 2 m +1 if n =1 ; W RQR m;n= 2+ W RQR m )]TJ/F22 11.9552 Tf 11.955 0 Td [(n= 2 ;n= 2 +4 W MM n= 2 ;n= 2 ;n= 2 + W MM m;n= 2 ;n= 2 + W MM n= 2 ;m;n= 2 otherwise : .36 Let b = p M= 3 k = log 2 n ,and l = log 2 n=b .Thereare k levelsintherecursion.All oftheupdatesarematrix-matrixmultiplicationsofdifferentsizes.Atlevel i ,thereare 4 2 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 squarematrix-matrixmultiplicationsoforder n= 2 sincetherearefourperupdate oneinLine8andthreeinLine10.Thenumberofwordstransferredduetothesquare matrix-matrixmultiplicationupdatesis4timeswhatwascalculatedin.34. Therecursionfortheothertwomatrix-matrixmultiplicationsaresimilartotherecursivematrix-matrixmultiplicationalgorithmexceptthelargestdimensiondoesnotdecrease by n= 2 fortheupdate.For n> p M= 3 W MM n= 2 ;m;n= 2= W MM m;n= 2 ;n= 2 .For n p M= 3 W MM m;n= 2 ;n= 2=3 mn + m 2 and W MM n= 2 ;m;n= 2=2 mn + m 2 Thedifferenceistherstmustreadtwo m n matricesandwriteonewherethesecond mustonlyreadtwo m n matrices.Thenumberofwordscommunicatedbytheremaining twomatrix-matrixmultiplicationsislessthanorequalto k X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 W MM n )]TJ/F15 11.9552 Tf 13.151 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 n 2 i ; n 2 i ; n 2 i + W MM n 2 i ;n )]TJ/F15 11.9552 Tf 13.151 8.087 Td [( j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 n 2 i ; n 2 i .37 = l )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 4 b n )]TJ/F15 11.9552 Tf 13.15 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 i n n 2 i 2 + l )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 X i =1 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 4 n )]TJ/F15 11.9552 Tf 13.151 8.088 Td [( j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 2 i n n 2 i + k X i = l 2 i )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 X j =1 5 n )]TJ/F22 11.9552 Tf 13.15 8.088 Td [(j )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 2 i n n 2 i +2 n 2 i 2 .38 30

PAGE 42

2 n 3 b +2 n 2 l )]TJ/F15 11.9552 Tf 11.956 0 Td [(1+ 5 2 n 2 k )]TJ/F22 11.9552 Tf 11.955 0 Td [(l +1+ n 2 .39 2 n 3 b + 5 2 n 2 k + n 2 : .40 Atlevel k ,therewillbe n basecases.Thesizeofthecolumnforeachbasecaseis n )]TJ/F22 11.9552 Tf 12.24 0 Td [(i ,for i =0 ;:::n )]TJ/F15 11.9552 Tf 12.241 0 Td [(1 .Thenumberofcommunicationsduetothebaseisatmost n 2 Summing4times.34,.40,and n 2 givesthedesiredbound. AnotherrecursiveQRalgorithmisgivenbyFrensandWise[17].Theiralgorithm isdenedforsquarematricesandreturnsthe R factorandthe Q factorexplicitly.The algorithmhandlesrectangularmatricesbypaddingthematrixwithzeros.Specialkernels areusedtoavoidextracomputationfortheaddedzeroentries.WedonotanalyzetheFrens andWisealgorithmatthistime. 2.3CommunicationsinDistributedMemoryAlgorithms In2004,Ironyetal.[28]providedanewprooffortheresultofHongandKong[27] Theorem2.2andextendedthelowerboundsformatrix-matrixmultiplicationtotheparalleldistributedcase.Theparalleldistributedboundassumestheprocessorsstoreonlythe minimumamountofdatadistributedamongtheprocessors.Thebounddoesnotapplyto algorithmswhichstoremultiplecopiesof A and B i.e.noreplication. Theorem2.16 Parallelmatrix-matrixmultiplcationlowerbound,Ironyetal.,2004[28] Considertheordinarymatrix-matrixmultiplicationalgorithmformultiplying m n and n p matrices,ona P -processordistributedmemoryparallelcomputerwith M wordsof localmemoryperprocessor.Thetotalnumberofwordsthataresentandreceivedbyat leastoneprocessorisatleast mnp 2 p 2 P p M )]TJ/F22 11.9552 Tf 11.955 0 Td [(M: .41 Forsquarematrices,inthememoryboundedcasewehave M = O n 2 =P .Usingthis with.41showsthatthenumberofwordssentandreceivedbyatleastoneprocessoris n 2 = p P 31

PAGE 43

2.3.1LowerBoundsforMatrix-MatrixMultiplication Inthedistributedmemorymodel,eachprocessorhasalocalmemoryofsize M ,there are P processors,andtheelementsof A and B aredistributedamongtheprocessors.At leastonecopyof A B ,and C mustbestored,so M mn + np + mp =P .Thefollowing isthesetofpossibleinstructions: 1.Senddatatoanotherprocessor; 2.Receivedatafromanotherprocessor; 3.Computecreate c ijk ; 4.Deletedatafromaprocessor. Inthismodel,datacanonlybedeletedifanothercopyisstoredelsewheresinceacopy ofallthreematricesmustbestoredattheendofthecomputation.Wewillconsiderthe numberofsendsandreceivesbyasingleprocessor.Weagainsplitthesetofinstructions intosegmentsofexactly M sendsandreceives.Let M x bethenumberofelementsof x 2f A;B;C g storedontheprocessoratthestartofthesegment.Let R x bethenumberof elementsof x 2f A;B;C g receivedfromanotherprocessorduringthesegment.Let S x be thenumberofelementsof x 2f A;B;C g sentfromtheprocessorduringthesegment.Let N x bethenumberofelementsof x 2f A;B;C g storedontheprocessorattheendofthe segment. Themaximumnumberofelementsof A ontheprocessorduringasegmentisatmost thenumberofelementsontheprocessoratthebeginningofthesegmentplusthenumber ofelementsreceivedduringthesegment.Thesameistruefor B .Thenumberofelements of C thatarecreatedduringasegmentisatmostthenumberofelementsstoredonthe processorattheendofthesegmentplusthenumberofelementssenttoanotherprocessor. BytheLoomis-WhitneyinequalityTheorem2.5,themaximumnumberofelementary 32

PAGE 44

multiplicationsthatcanbeperformedis max p M a + R a M b + R b N c + S c ; subjectto R a + R b + R c + S a + S b + S c = M M a + M b + M c M N a + N b + N c M: Allthevariablesarenonnegativeandtheobjectivefunctionisanincreasingfunction. Therefore,theobjectivefunctionwillbemaximizedwhenthevariablesthatdonotappear intheobjectivefunctionarezeroandtheinequalityconstraintsareactive.Hence,ifwe take R c = S a = S b = M c = N a = N b =0 ,thenthemaximumnumberofelementary multiplicationsthatcanbeperformedis max p M a + R a M b + R b N c + S c ; subjectto R a + R b + S c = M M a + M b = M N c = M: TheoptimizationproblemisthesameastheoptimizationproblemobtainedinthesequentialcaseseeSection2.2.1.Hence,themaximumnumberofelementarymultiplications duringasegmentis p M 3 Atleastoneprocessmustperformatleast mnp=P elementarymultiplications.Therefore,alowerboundonthenumberofwordssentandreceivedbythisprocessoris mnp P p M )]TJ/F22 11.9552 Tf 11.956 0 Td [(M: ByTheorem2.4,alowerboundforthenumberofmessagessentorreceivebyatleast oneprocessorisobtained.WestatetheselowerboundsinTheorem2.17.Forthememory 33

PAGE 45

boundedcase, M = cn 2 =P forsomeconstant c ,weobtaintheresultsinCorollary2.18. Theorem2.17. Let A 2 R m n B 2 R n p ,and C 2 R m p .Considertheordinary matrix-matrixmultiplicationalgorithmtocompute C = AB ona P -processordistributed memoryparallelcomputerwith M wordsoflocalmemoryperprocessor.Thetotalnumber ofwords, W m;n;p ,thataresentandreceivedbyatleastoneprocessorisatleast W m;n;p mnp P p M )]TJ/F22 11.9552 Tf 11.955 0 Td [(M: Thetotalnumberofmessages, L m;n;p ,thataresentandreceivedbyatleastoneprocessorisatleast L m;n;p mnp PM 3 = 2 )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 : Corollary2.18. Let A 2 R n n B 2 R n n ,and C 2 R n n .Considertheordinary matrix-matrixmultiplicationalgorithmtocompute C = AB ona P -processordistributed memoryparallelcomputerwith M = cn 2 =P wordsoflocalmemoryperprocessor.The totalnumberofwords, W n ,thataresentandreceivedbyatleastoneprocessorisat least W n 1 p c n 2 p P )]TJ/F22 11.9552 Tf 11.955 0 Td [(c n 2 P : Thetotalnumberofmessages, L n ,thataresentandreceivedbyatleastoneprocessoris atleast L n 1 p c p P )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 : Proof. Substitute M = cn 2 =P intotheequationsinTheorem2.17. 2.3.2OptimalMatrix-MatrixMultiplicationAlgorithms Inthissection,werestrictourdiscussionofoptimalalgorithmstosquare n n matrices. ThePUMMAParallelUniversalMatrixMultiplicationAlgorithmalgorithm[13]and theSUMMAScalableUniversalMatrixMultiplicationAlgorithmalgorithm[51]aretwo 34

PAGE 46

= = C ij A ik B kj Figure2.1:SnapshotoftheoperationsneededinEquation2.42. A ik and B kj mustbe broadcasttoprocessor i;j paralleldistributedmatrix-matrixmultiplicationalgorithms.Bothalgorithmsdistributeall threematricesacrossthesamerectangulargrid.Wewillassumeasquareprocessinggrid p P p P .Thisallowsforaneasiercomparisonbetweenthealgorithmandthelower bound.Theprocessorsareindexedbytheirrowandcolumnindex.Let b betheblocksize, where 1 b n= p P .Thesub-matrix C ij ,locatedonprocessor i;j ,canbecomputed byaseriesofrank b updates.Thatis, C ij = C ij + n=b X k =1 A ik B kj : .42 A ik and B ik aresenttoprocessor i;j forthecomputation. A ik contributestothecalculationof C ij ,forall j .Hence, A ik mustbesenttoallprocessorsinrow i .Similarly, B kj mustbesenttoallprocessorsincolumn j .Figure2.1showsasnapshotoftherequired operations.Theshadedrectanglesare n= p P b and b n= p P matrices. A ik issenttoall processorsinrow i and B kj issenttoallprocessorsincolumn j ThetimeforthePUMMAalgorithmdependsonthebroadcastalgorithmusedtocommunicatethedatawithinaprocessorroworcolumn.TheSUMMAalgorithmusesa pipelinedbroadcastalgorithm.Algorithm2.7showsthepseudo-codefortheSUMMA algorithm. WewillanalyzethecommunicationrequirementsfortheSUMMAalgorithm.Each sendandreceivecommunicates nb p P words.Therstblockof A willreachthelastprocessor 35

PAGE 47

Algorithm2.7: SUMMA[51] Input : A;B;C 2 R n n Output : C 2 R n n ; s.t. C = C + AB Notes :Assumeasquareprocessinggrid p P p P b istheblocksizeand A ik isa n= p P b blockand B kj isa b n p P block. 1 %Let i;j bethelocalprocessorsid. 2 for k =1 n=b do 3 %If A ik isnotlocal,receive A ik from i; mod y )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ; p P 4 %If B kj isnotlocal,receive B kj from mod x )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ; p P ;j 5 C ij = C ij + A ik B kj 6 %Send A ik to i; mod y +1 ; p P 7 %Send B kj to mod x +1 ; p P ;j inarowafter p P )]TJ/F15 11.9552 Tf 12.215 0 Td [(1 steps.Thisiscalledllingthepipe.Similarly,therstblockof B willreachthelastprocessorinacolumnafter p P )]TJ/F15 11.9552 Tf 11.56 0 Td [(1 steps.Thetimeforallprocessorsto receivetheirrstmessageis 2 p P )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 nb p P : Thetimeforthenext n=b simultaneoussendsandreceivesis 2 n b + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 nb p P : .43 Theremainingtimeignoringcomputationisforthenalblocktoreachthenalprocessor. Thetimein.43includessendingthelastblockonce.Theremainingtimeignoring computationis 2 p P )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 nb p P : Finally,thecomputationsareperfectlydistributedamongallprocessors.Hence,thetotal timeforSUMMAAlgorithm2.7is 4 p P +2 n b )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 2 n 2 p P +4 nb )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 nb p P + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 2 n 3 P : .44 36

PAGE 48

TheSUMMAalgorithmisoptimalintermsofthecomputationssincethecomputations areperfectlydistributedamongalltheprocessors.TheSUMMAalgorithmrequires M = 3 n 2 P +2 nb p P localmemoryperprocessor.Inthememoryboundedcase,thealgorithmwill becommunication-optimalifthenumberofwordstransferredandnumberofmessages transferredarewithinaconstantfactorofthelowerboundsstatedinCorollary2.18.The timefortheSUMMAalgorithmdependsontheblocksize 1 b n= p P .For b = n= p P ,thenumberofwordstransferredisboundedaboveby 6 n 2 p P andthenumberof messagestransferredisboundedaboveby 6 p P .Hence,for b = n= p P ,theSUMMA algorithmiscommunication-optimal. For b =1 ,thedominatingtermforthenumberofwordstransferedis 2 n 2 p P ,decreasing thequantitybyafactorof3.However,thenumberofwordstransferredisboundedby 4 p P +2 n .Hence,for b =1 ,theSUMMAalgorithmisnotcommunication-optimal intermsofthenumberofmessagestransferred.Thedeceaseinthebyafactorof3 forthenumberofwordstransferredissignicant.Anobviousquestioniswhatblock sizeminimizesthetotaltime.44ofthealgorithm?Theoptimalblocksizeisabout b opt = p = 2 AlthoughwepresentedtheSUMMAalgorithmforsquarematricesandsquareprocessinggrids,itcanbeadaptedforrectangularprocessinggridsandrectangularmatrices. AnotheroptimalalgorithmisbyCannon[7].However,Cannon'salgorithmisnotaseasily adaptedtorectangularprocessinggrids. 37

PAGE 49

3.QRFactorizationinanObliqueInnerProduct 3.1Introduction Setsofvectorsthataremutuallyorthogonalwithrespecttoanobliqueinnerproduct areoftenusediniterativemethodssuchastheconjugategradientmethodandKrylovsubspacemethods.However,thesevectorscannotbecomputedexactlysinceoating-point arithmeticintroduceserrorintothecomputation.Largeerrorinthecomputationmayslow theconvergenceoftheassociatediterativemethod,atopicwedonotaddressinthischapter.Rather,wewishtoimproveourunderstandingoftheerrorintroducedincomputing suchvectors.ForthisstudywefocusonQRfactorizationalgorithms. Inthischapter,weareinterestedincomputinga QR factorizationof Z 2 R m n inan obliqueinnerproductspace.Givenan m m symmetricpositivedenitematrix A ,we denetheinnerproductby h x;y i A := x T Ay;x;y 2 R m andtheinducednormby k x k A := p x T Ax: Weseekfactors Q and R suchthat Z = QR R isan n n uppertriangularmatrix,and Q isan m n matrixsuchthat Q T AQ = I Z isassumedtobefullrankand m n .RecallfromChapter1thatthesingularvalues ofamatrix X 2 R m n are 1 n .If X isfullrank,thentheconditionnumberis X = 1 = n Since R T R = A 1 = 2 Z T A 1 = 2 Z ,wehavetheequality i R = i A 1 = 2 Z forthe singularvaluesof R .Wecanalsoderiveanequalityforthesingularvaluesof Q .Since I = Q T AQ = A 1 = 2 Q T A 1 = 2 Q A 1 = 2 Q hasorthonormalcolumns.Let ^ Z havecolumns thatformanorthonormalbasisofthecolumnspaceof Z ^ Z ^ Z T isaprojectionontothe columnspaceof Z ,so Q = ^ Z ^ Z T Q and A 1 = 2 Q = A 1 = 2 ^ Z ^ Z T Q .Then 38

PAGE 50

i A 1 = 2 ^ Z = i A 1 = 2 ^ Z ^ Z T Q ^ Z T Q )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = i ^ Z T Q )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = n )]TJ/F23 7.9701 Tf 6.586 0 Td [(i +1 ^ Z T Q )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = n )]TJ/F23 7.9701 Tf 6.586 0 Td [(i +1 ^ Z ^ Z T Q )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = n )]TJ/F23 7.9701 Tf 6.586 0 Td [(i +1 Q )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 : Weorganizetheseresultsinthefollowingtheorem. Theorem3.1. Let Z befullrank.Let ^ Z havecolumnsthatformanorthonormalbasisof Z .Thenthe Q and R factoroftheoblique QR factorizationinthe A innerproductsatisfy, i Q = n )]TJ/F23 7.9701 Tf 6.587 0 Td [(i +1 A 1 = 2 ^ Z )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ; i R = i A 1 = 2 Z ; forall i =1 ;:::;n: Inparticular, k Q k 2 = n A 1 = 2 ^ Z )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 and k R k 2 = k A 1 = 2 Z k 2 FortheEuclideancases A = I ,werecoverthestandard QR factorization,aclassic probleminlinearalgebra.ThestabilityofEuclidean QR decompositionalgorithmshas beenstudiedgreatly.ForHouseholder,Givens,andmodiedGram-Schmidtalgorithms wereferto[24,Chapter19].ForclassicalGramSchmidtwereferto[19,20]. Therearetwomainclassesofalgorithmsforcomputingtheoblique QR factorization: algorithmsthatuseafactorizationof A ,andthosethatdonot.Therstcomputeafactorizationof A = B T B andconverttheproblemtoaEuclidean QR factorizationof BZ = YR The Q factormaybeobtainedby Q = B )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Y .Weconsiderthecaseswherethecomputed B istheCholeskyfactorof A andwhere B isobtainedbytheeigenvaluedecomposition. If A = VDV T ,where D isdiagonaland V isunitary,thenwelet B = D 1 = 2 V T .GramSchmidtschemesaredesignedtobeorthogonalizedwithrespecttoanyinnerproductand donotrequireafactorizationof A .Also,sincewehavetherelationship Z T AZ = R T R ,we 39

PAGE 51

cancomputethe R factorastheCholeskyfactorofthenormalequation Z T AZ .Againno factorizationof A isrequired.AlgorithmssimilartoHouseholderorGivensforanoblique QRfactorizationdonotcurrentlyexist.However,acloselyrelatedproblemofcomputing the QR factorization,where Q is A -invarianti.e. Q T AQ = A hasbeenconsidered.Gulliksson[22]presentsstabilityanalysisforaHouseholder-likealgorithmwhen A isdiagonal andtheresultisusedtosolvetheweightedleastsquaresproblem. Rozlo zn ketal.[41]extendtheerroranalysisoftheGram-Schmidtmethodstothe generalcaseofanobliqueinnerproduct.Theyalsoanalyzedthemethodbasedonthe eigenvaluedecompositionof A .InSection3.4weprovidestabilityboundsforwhen B is theCholeskyfactorizationof A .InSection3.5weanalyzethecasewhen B isgivenfrom theeigenvaluedecompositionandpresentimprovederrorboundsofthosefromRozlo zn k etal.InSection3.2weanalyzethestabilityofthealgorithmbasedoncomputingthe R factorastheCholeskyfactorofthenormalequation, Z T AZ .InSection3.3weanalyzethe normalequationalgorithmwhenrstcomputingaEuclidean QR factorization. Anapplicationoftheoblique QR factorizationiscomputingthesolutiontothegeneralizedleastsquaresproblem min x k Zx )]TJ/F22 11.9552 Tf 11.955 0 Td [(b k A : .1 Thelinearleastsquaresestimatecomputedviaanoblique QR factorizationisgivenby ~ x = R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Q T Ab: Iftheelementsof b aretakentoberandomvariables,thenthebestlinear unbiasedestimateisthesolutionto.1,where A istheinversecovariancematrix.Foran overviewofthegeneralizedleastsquaresproblemwereferto[5,Chapter4]. Insomeapplicationsitisreasonabletoassumethatinitiallyafactorizationof A is knownratherthan A itself.Thisisthecasewhen A isanestimateofthecovariancematrixcomputedby 1 n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 B T B ,where B has n rowscorrespondingtodatasamples.Several articles[45,46,47]studythestabilityandperformanceofthemodiedGram-Schmidt algorithmwhenafactorization A = B T B isknowninitially.Here, A isneverexplicitly formed.Wedon'tassumeanyfactorizationof A isknowninitially. 40

PAGE 52

Anotherapplicationisthegeneralizedeigenvalueproblem Bx = Ax; .2 where A 2 R m m issymmetricpositivedenite,and B 2 R m m issymmetric.If Q 2 R m m is A -orthogonal,thenthestandardeigenvalueproblem Q T BQx = x hasthesameeigenvaluesas.2.Thesameideacanbeusedforiterativemethodswhere Q 2 R m n isabasisofaKrylovsubspaceofdimension n InSection3.7wetesttheboundsobtainedinSections3.2,3.3,3.4,and3.5.Fromthese experimentsweconcludethatourboundsaretight.InSection3.8wecomparethecommunicationsrequiredbyeachalgorithm.Weseethatcommunicationavoidingalgorithms willreducethenumberofmessagestransfered.InSection3.9weprovideperformance experimentsonanIntelcluster.Theexperimentsconrmthebenetofthecommunication avoidingalgorithms. 3.2StabilityAnalysisofCHOLQR The R factorofanoblique QR factorizationistheCholeskyfactorof Z T AZ .Thisis easilyseenfrom Z T AZ = R T Q T AQR = R T R: Oncethe R factorisfoundfromaCholeskydecomposition,the Q factorisobtainedbya trianglesolve.Algorithm3.1,whichwecallCHOLQR,summarizesthismethod. Thereareessentiallythreekernels:matrixmultiplication,Choleskydecomposition, andtrianglesolve.Thebackwarderrorresultsforeachofthesecomputationsare ~ B = AZ + B; s.t. j B j cmu j A jj Z j ; .3 41

PAGE 53

Algorithm3.1: CHOLQR Input : Z 2 R m n A 2 R m m -symmetricpositivedenite Output : Q 2 R m n R 2 R n n 1 B = AZ ; 2 C = Z T B ; 3 R = chol C ; 4 Q = ZR )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ; ~ C = Z T ~ B + C 1 ; s.t. j C 1 j cmu j Z T jj ~ B j ; .4 ~ C = ~ R T ~ R + C 2 ; s.t. j C 2 j cnu j ~ R T jj ~ R j ; .5 ~ Q i; : ~ R + R i = Z i; : ; s.t. j R i j cnu j ~ R j ; .6 where c isasmallconstantthatcanchangefromonelinetoanother.Toensurethatthe Choleskyfactorizationin.6runstocompletionweassumethat cn 3 = 2 ~ C < 1 see [24,Chapter10].Let Z = )]TJ/F15 11.9552 Tf 9.299 0 Td [([ ~ Q 1 ; : R 1 ; ::: ; ~ Q i; : R i ; ::: ; ~ Q m; : R m ] : Thenthenal equationbecomes ~ Q ~ R = Z + Z; s.t. j Z j cnu j ~ Q jj ~ R j : Thisequationgivesusthecomponentwiserepresentativitybound j Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R j cnu j ~ Q jj ~ R j : Andthenormwiseboundfollows: k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cnu k ~ Q k 2 k ~ R k 2 : Forthelossoforthogonalityweusethat ~ Q = Z ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + Z ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 andform ~ Q T A ~ Q = ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T + ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T A Z ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + Z ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 .7 = ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T AZ ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T A Z ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 42

PAGE 54

+ ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T AZ ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + O u 2 : From.3,.4,and.6, Z T AZ = Z T ~ B )]TJ/F15 11.9552 Tf 11.956 0 Td [( B .8 = ~ C )]TJ/F15 11.9552 Tf 11.955 0 Td [( C 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(Z T B = ~ R T ~ R + C 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [( C 1 )]TJ/F22 11.9552 Tf 11.956 0 Td [(Z T B: Substituting3.8into3.7gives ~ Q T A ~ Q = I + ~ R )]TJ/F23 7.9701 Tf 6.586 0 Td [(T C 2 ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + ~ R )]TJ/F23 7.9701 Tf 6.586 0 Td [(T C 1 ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 .9 + ~ R )]TJ/F23 7.9701 Tf 6.586 0 Td [(T Z T B ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + ~ R )]TJ/F23 7.9701 Tf 6.587 0 Td [(T Z T A Z ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + ~ R )]TJ/F23 7.9701 Tf 6.586 0 Td [(T Z T AZ ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + O u 2 : Since Z ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = ~ Q )]TJ/F15 11.9552 Tf 12.808 0 Td [( Z ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k Z ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k ~ Q k 2 + O u .Similarly,wehave k ~ C k 2 k Z T ~ B k 2 + O u and k ~ B k 2 k AZ k 2 + O u : Usingtheseequalitieswith.9givesthe bound k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.956 0 Td [(I k 2 cmnu k ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 2 k Z k 2 k AZ k 2 .10 + k ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k ~ Q k 2 k A k 2 k Z k 2 + k ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k A ~ Q k 2 k ~ Q k 2 k ~ R k 2 + O u 2 : Wemayeliminatethethirdtermfrom.10bynotingthat k A ~ Q k 2 k ~ R k 2 k A k 2 k Z k 2 + O u : .11 43

PAGE 55

Inequality.11canbeobtainedfromthefollowing.Let N = Z T AZ )]TJ/F15 11.9552 Tf 15.045 3.022 Td [(~ R T ~ R .From .8, k N k 2 = O u and k ~ R k 2 2 k A 1 = 2 Z k 2 2 + k N k 2 : Hence, k ~ R k 2 k A 1 = 2 Z k 2 s 1+ k N k 2 k A 1 = 2 Z k 2 2 .12 k A 1 = 2 Z k 2 1+ k N k 2 k A 1 = 2 Z k 2 2 k A 1 = 2 Z k 2 + O u : Similarly,since ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I = O u k A 1 = 2 ~ Q k 2 1+ O u : .13 Combining.12and.13weobtain.11.Theorem3.2summarizestheresults. Theorem3.2. Thefactors ~ Q and ~ R computedbyAlgorithm3.1satisfy k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cnu k ~ Q k 2 k ~ R k 2 ; .14 j Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R j cnu j ~ Q jj ~ R j ; .15 k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmnu k ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k 2 k Z k 2 k ~ R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k AZ k 2 + k ~ Q k 2 k A k 2 + O u 2 .16 Thelossoforthogonality.16boundcanbedifculttounderstandinitscurrentform. Atrstglanceaworsecaseboundis k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmnu ~ R 2 A + O u 2 : However,inmanycasestheconditioningof R increaseswhenincreasingtheconditioning of A .Thiscreatesaboundthatisproportionalto A 2 ,aresultnotseeninnumerical experimentsseeSection3.7.Insteadwesearchforaworstcaseboundthatisproportional 44

PAGE 56

to A if Z isconstant. Let Z = Z )]TJ/F15 11.9552 Tf 15.447 3.022 Td [(~ Q ~ R: Assumethat cnu k ~ Q k 2 k R k 2 = k Z k 2 1 = 2 .Fromperturbation theoryofsingularvalues min ~ Q ~ R min Z )-222(k Z k 2 > 0 and 1 max ~ Q min ~ R 1 min ~ Q ~ R 1 min Z )-222(k Z k 2 2 min Z : Therefore, k ~ R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k 2 2 k ~ Q k 2 min Z : Substitutingthisinto.16alongwith k ~ Q k 2 k A )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 k 2 + O u givesanupperbound thatisatworstproportionalto A when Z isconstant.Westatetheupperboundin thefollowingcorollary. Corollary3.3. Thefactor ~ Q computedbyAlgorithm3.1satises k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmnu Z 2 A + O u 2 : 3.3StabilityAnalysisofPRE-CHOLQR IntheCHOLQRalgorithmtheconditioningof Z playsanimportantrollinthelossof orthogonality.Toremovethedependencyontheconditioningof Z onecanrstuseastable Euclidean QR factorizationsuchasHouseholder QR andapplytheCHOLQRalgorithm tothecomputedorthonormalfactor.Algorithm3.2summarizesPRE-CHOLQR. 45

PAGE 57

Algorithm3.2: PRE-CHOLQR Input : Z 2 R m n A 2 R m m -symmetricpositivedenite Output : Q 2 R m n R 2 R n n 1 [ Y;S ]= qr Z ; 2 [ Q;U ]= CHOLQR A;Y ; 3 R = US ; Anormwiseboundforthelossoforthogonalityisobtainedbyletting Z = ~ Y inTheorem3.2,where ~ Y isclosetoanorthonormalmatrix Y thatisabasesforthecolumnspace of Z .Then Y ~ U )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = ~ Q + O u and k ~ U )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k ~ Q k 2 + O u : Thisgivesthelossoforthogonalitybound k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.956 0 Td [(I k 2 cmnu k ~ Q k 2 2 k A k 2 + O u 2 : Therepresentativityerrorcanbeobtainedbytheerrorbounds Z = ~ Y ~ S + Z; k Z k 2 cmn 2 u k Z k 2 .17 ~ Q ~ U = ~ Y + Y; k Y k 2 cmu k ~ Q k 2 k ~ U k 2 .18 ~ R = ~ U ~ S + R; k R k 2 cmu k ~ U k 2 k ~ S k 2 : .19 Thesecondequationcorrespondstotheerrorofatrianglesolveformultiplecolumns. Normwiseandcomponentwiseboundsfollowimmediately.Theorem3.4summarizesthese results. Theorem3.4. Thefactors ~ Q and ~ R computedbyalgorithm3.2satisfy k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cmn 2 u k ~ Q k 2 k ~ U k 2 k ~ S k 2 ; .20 46

PAGE 58

j Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R j cmn 2 u j ~ Q jj ~ U jj ~ S j ; .21 k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmn 2 u k ~ Q k 2 2 k A k 2 + O u 2 : .22 3.4FactorizationBasedonCholeskyFactorof A ThefactorizationbasedoncomputingtheCholeskyfactorof A andconvertingthe problemtotheEuclideancaseisgiveninAlgorithm3.3,whichwenameCHOL-EQR. HereonecomputestheCholeskyfactor C ,thencomputestheEuclidean QR factorization of CZ .Thisprovidesthecorrect R factor,andoneobtainsthe Q factorby Q = C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 Y where Y isthecomputedorthonormalfactoroftheEuclidean QR factorization. Algorithm3.3: CHOL-EQR Input : Z 2 R m n A 2 R m m -symmetricpositivedenite Output : Q 2 R m n R 2 R n n 1 C = chol A ; 2 W = CZ ; 3 [ Y;R ]= qr W ; 4 Q = C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Z ; Thealgorithmconsistsoffourmajorkernels.Wehavethefollowingbackwarderror resultsforeachkernel. A = ~ C T ~ C + A;s:t: k A k 2 = cm 2 u k A k 2 ; .23 and j A j = cmu j ~ C T jj ~ C j ; ~ W + W = ~ CZ;s:t: j W j cmu j ~ C jj Z j ; .24 ~ W = ~ Y ~ R + W ;s:t: k W k 2 cmn 2 u k ~ W k 2 ; .25 and j W j cmn 3 = 2 uee T j ~ W j ; ~ Y = Y + Y;s:t: k Y k 2 cmn 2 u; .26 ~ C + C j ~ Q j = ~ Y j ;s:t: j C j j cmu j ~ C j ; .27 47

PAGE 59

where e isavectorofallonesand Y isanexactorthonormalmatrix.Let C =[ C Q 1 ;:::; C j Q j ;:::; C n Q n ] : ThenEquation.27becomes ~ C ~ Q + C = ~ Y;s:t: j C j cmu j ~ C jj ~ Q j : .28 Combining.24,.25,and.28wehave, ~ CZ = ~ Y ~ R + W + W = ~ C ~ Q + C ~ R + W + W : .29 Solvingfor Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R gives Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R = ~ C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 C ~ R + ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 W + ~ C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 W : Acomponentwiseboundfollowsimmediately: j Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R j cmu j ~ C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 jj ~ C jj ~ Q jj ~ R j + cnu j ~ C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 jj ~ C jj Z j + cmn 3 = 2 u j ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 j ee T j ~ CZ j cmu j ~ C )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 jj ~ C jj ~ Q jj ~ R j + cmn 3 = 2 u j ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 j ee T j ~ CZ j : Anormwiseboundalsofollows: k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cmu ~ C k ~ Q k 2 k ~ R k 2 + cnu ~ C k ~ Z k 2 + cmn 3 = 2 u k ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 k ~ CZ k 2 .30 48

PAGE 60

cmn 2 u ~ C k ~ Q k 2 k ~ R k 2 : .31 Analyzingthenormwisebound,itappearsthataworsecaseupperboundforthelossof representativityisproportionalto A .However,inthenumericalexperimentsthedependencyon A isnotobservedseeSection3.7.Thecomponentwiseboundalsofails toprovideadescriptiveboundonthelossofrepresentativity.Thisleadsustoconsidera differentmeasureofrepresentativityerror,namely,the A -norm. Let Z = Z )]TJ/F15 11.9552 Tf 15.941 3.022 Td [(~ Q ~ R .From.23, Z T A Z = Z T ~ C T ~ C Z + Z T A Z: Therefore, k Z k 2 A k ~ C Z k 2 2 + k Z T A Z k 2 : From.29, k ~ C Z k 2 cmn 2 u k ~ C k 2 kj ~ Q jj ~ R jk 2 : So, k Z k 2 A cmn 2 u k ~ C k 2 kj ~ Q jj ~ R jk 2 2 + k Z T A Z k 2 and k Z k A cmn 2 u k ~ C k 2 kj ~ Q jj ~ R jk 2 + k Z T A Z k 2 = cmn 2 u k ~ C k 2 kj ~ Q jj ~ R jk 2 From.23and.31,weknowthat k Z T Z Z k 2 = O u 3 : Whichgivesusanal errorboundof k Z k A cmn 2 u k ~ C k 2 kj Q jj R jk 2 + O u 2 : Toderiveaboundforthelossoforthogonalitywebeginwith.28andcompute ~ Q T ~ C T ~ C ~ Q = ~ Y )]TJ/F15 11.9552 Tf 11.955 0 Td [( C T ~ Y )]TJ/F15 11.9552 Tf 11.955 0 Td [( C : 49

PAGE 61

Substituting.23ontheleftandexpandingontherightwehave, ~ Q T A + A ~ Q = ~ Y T ~ Y )]TJ/F15 11.9552 Tf 13.725 3.022 Td [(~ Y T C )]TJ/F15 11.9552 Tf 11.955 0 Td [( ~ Y T C T + O u 2 : Substituting.26for ~ Y andrearranginggives, ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.956 0 Td [(I = Y T Y + Y T Y T )]TJ/F15 11.9552 Tf 13.725 3.022 Td [(~ Y T C )]TJ/F15 11.9552 Tf 11.955 0 Td [( ~ Y T C T )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q T A ~ Q + O u 2 : Anormwiseboundfollows: k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmnu + cmu k ~ Y T k 2 k ~ C k 2 k ~ Q k 2 + cmu k ~ Q T k 2 k ~ C T k 2 k ~ Y T k 2 + cmu k ~ Q T k 2 k ~ C T k 2 k ~ C k 2 k ~ Q k 2 + O u 2 cmu k ~ Q T k 2 k A k 2 k ~ Q k 2 + O u 2 cmu k A k 2 k ~ Q k 2 2 + O u 2 : Wesummarizetheboundsinthefollowingtheorem. Theorem3.5. Thefactors ~ Q and ~ R computedbytheoblique QR factorizationbasedon theCholeskyfactorof A satisfythefollowingerrorbounds. j Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R j cmu j ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 jj ~ C jj ~ Q jj ~ R j + cmn 3 = 2 u j ~ C )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 j ee T j ~ CZ j ; .32 k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cmn 2 u ~ C k ~ Q k 2 k ~ R k 2 ; .33 k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k A cmn 2 u k A k 1 = 2 2 kj ~ Q jj ~ R jk 2 + O u 2 ; .34 k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cmnu k A k 2 k ~ Q k 2 2 + O u 2 : .35 50

PAGE 62

3.5FactorizationbasedonEigenvalueDecompositonof A Thefactorizationbasedoncomputingtheeigenvaluedecompositionof A andconvertingtheproblemtotheEuclideancaseisgiveninAlgorithm3.4,whichwecallEQRSYEV.Hereonecomputestheeigenvaluedecomposition A = VDV T ,thencomputesthe Euclidean QR factorizationof D 1 = 2 V T Z .Thisprovidesthecorrect R factor,andweobtainthe Q factorby Q = VD )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 Y ,where Y hasorthonormalcolumnsandisobtained fromtheEuclidean QR factorization.Asexplainedintheintroduction,thisisthesame procedureasfortheCholeskyfactorizationof A ,wheretheCholeskyfactorisreplacedby D 1 = 2 V T Algorithm3.4: EQR-SYEV Input : Z 2 R m n A 2 R m m -symmetricpositivedenite Output : Q 2 R m n R 2 R n n 1 [ V;D ]= eig A ; 2 X = V T Z ; 3 W = D 1 = 2 X ; 4 [ Y;R ]= qr W ; 5 U = D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 YQ = VU ; Thealgorithmconsistsofsixmajorkernels.Wehavethefollowingbackwarderror resultsforeachkernel. A + A = ~ V ~ D ~ V T ;s:t: k A k 2 cm 5 = 2 u k A k 2 ; .36 ~ V = V + E;s:t: k E k 2 cm 5 = 2 u; .37 ~ X = ~ V T Z + X;s:t: k X k 2 cmu k Z k 2 ; .38 ~ W = ~ D 1 = 2 ~ X + S ;s:t: k S k 2 cu k ~ X k 2 ; .39 ~ W = ~ Y ~ R + W;s:t: k W k 2 cmn 3 = 2 u k ~ W k 2 ; .40 ~ U = ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 ~ Y + U ;s:t: k U k 2 cu k ~ Y k 2 ; .41 ~ Q = ~ V ~ U + Q;s:t: k Q k 2 cmu k ~ U k 2 : .42 51

PAGE 63

BeginningwiththebackwarderrorresultfortheEuclidean QR factorizationof ~ W .40aswellasthebackwarderrorresultforcomputing ~ W fromthemultiplicationbya diagonalmatrix.39weget ~ D 1 = 2 ~ X + S = ~ Y ~ R + W: Multiplyby ~ D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 andsubstituting.38for ~ X and.41for ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 ~ Y weget ~ V T Z + X + S = ~ U ~ R )]TJ/F15 11.9552 Tf 14.701 3.022 Td [(~ D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 U ~ R + ~ D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 W: Multiplyingby ~ V andsubstituting.42for ~ V ~ U weget I + E Z + ~ V X + ~ V S = ~ Q )]TJ/F15 11.9552 Tf 11.955 0 Td [( Q ~ R )]TJ/F15 11.9552 Tf 13.741 3.022 Td [(~ V ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 U ~ R + ~ V ~ D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 W: .43 Notethat k W k 2 k ~ D 1 = 2 ~ V T Z k 2 + O u k A 1 = 2 Z k 2 + O u and k A 1 = 2 Z k 2 k ~ R k 2 + O u .Solvingfor Z )]TJ/F15 11.9552 Tf 14.639 3.022 Td [(~ Q ~ R givesthelossrepresentativityerrorbound k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cm 5 = 2 u k A )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 k 2 k ~ R k 2 : Tocalculatethelossoforthogonalitywebeginwithcombining.41and.42toget ~ Q = ~ VD )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 ~ Y + ~ V ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 U + Q .Thisalongwithsubstituting.36for A gives ~ Q T A ~ Q = ~ VD )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 ~ Y + ~ V ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 U + Q T ~ V ~ D ~ V T ~ VD )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 ~ Y + ~ V ~ D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 U + Q : Expandingandsimplifyingusing.37weobtaintheboundforlossoforthogonality.The resultsaresummarizedinTheorem3.6. 52

PAGE 64

Theorem3.6. Thefactors ~ Q and ~ R computedbytheoblique QR factorizationbasedon thesymmetriceigenvaluedecompositionof A satisfythefollowingerrorbounds. k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cm 5 = 2 u k A )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 k 2 k ~ R k 2 ; .44 k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 5 = 2 u k A k 2 k ~ Q k 2 2 + O u 2 : .45 Therepresentativityboundisanimprovementontheboundpresentedin[41].There theygiveaboundwhichisproportionalto A 1 = 2 k Z k 2 ,whichisclearlyanupperboundof .44.However,thisboundcanbeanoverestimateinsomecaseasweshowinSection3.7. ForcompletenesswestatetheresultofRozlo zn ket.al.inTheorem3.7.Notethatthereis atypographicalerrorin[41]fortherepresentativitybound.Theresultiscorrectlydisplayed in[41,Table7.1]andinthetheorembelow. Theorem3.7. [41,Thm2.1]Thefactors ~ Q and ~ R computedbytheoblique QR factorizationbasedonthesymmetriceigenvaluedecompositionof A satisfythefollowingerror bounds. k Z )]TJ/F15 11.9552 Tf 14.64 3.022 Td [(~ Q ~ R k 2 cm 5 = 2 u A 1 = 2 k Z k 2 ; .46 k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 5 = 2 u k A k 2 k ~ Q k 2 2 + O u 2 : .47 3.6StabilityofGram-Schmidt TheGram-Schmidtalgorithmsarealsopossiblealgorithmsforanoblique QR factorization.AnupdatedoftheGram-Schmidtalgorithmscomputeanorthogonalprojection.If theseprojectionsarecomputedtobe A -orthogonal,thenthecomputed Q is A -orthogonal. ThestabilityoftheGram-Schmidtalgorithmswasanalyzedindetailin[41].Wedonot provideanynewanalysisforthesealgorithms.Forcompletenesswewillstatetheresults fromRozlo zn ket.al.inthissection.Inthetheoremsbelowwestatetheresultsasthey arestatedin[41].Theboundsrepresenttheworstcaseupperbound.Inthreecaseswe 53

PAGE 65

providedaslightvariationofthebounds.1.TherepresentativityerrorinalltheGramSchmidtalgorithmscanbeexpressedasacomponentwisebound.2.Theboundforthe lossoforthogonalityofCGScanbemuchlessthantheworstcasebound.Atightbound canbeachievedtriviallyfromthepreviousanalysis.3.TheboundforthelossoforthogonalityofMGSincludesafactorforthemaximumgrowthoftheratiobetweenthe2-norm and A -normoftheupdatedcolumns.Thisfactorcanbeboundedby min A 1 = 2 ^ Z )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 ,where ^ Z isanorthonormalbasesof Z .Eachoftheseimprovementsarediscussedfollowingthe correspondingTheorem. ClassicalCGSandmodiedMGSGram-SchmidtbothsatisfythesamerepresentativityboundandisstatedinTheorem3.8. Theorem3.8. [41,Thm3.1]Thefactors ~ Q and ~ R computedbyeitherclassicalormodied Gram-Schmidtsatisfy Z = ~ Q ~ R + E; k E k 2 cn 3 = 2 u k Z k 2 + k ~ Q k 2 k ~ R k 2 : .48 Thisisthebestnormwiseboundonecanobtain.However,theboundcanalsobestated asacomponentwisebound,whichattimescanberepresentamuchsmallererrorthenthe correspondingnormwisebound.InTheorem3.8, E satises j E j cn 3 = 2 u j Z j + j ~ Q jj ~ R j : .49 InSection3.7weshowthat.49istightforallofourtestcases. Theorem3.9. [41,Thm4.1]If cm 3 = 2 nu A A 1 = 2 Z Z < 1 thenthelossoforthogonalityofthecomputed ~ Q byclassicalGram-Schmidtisboundedby k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 3 = 2 nu k A k 1 = 2 2 k ~ Q k 2 A 1 = 2 Z A 1 = 2 Z 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(cm 3 = 2 nu k A k 1 = 2 2 k ~ Q k 2 A 1 = 2 Z A 1 = 2 Z : .50 54

PAGE 66

TheboundinTheorem3.9isnottightinallcases.Itiseasytoadjusttheanalysis ofRozlo zn ket.al.toprovidedatightbound.Also,tosimplifytheequationwecan changetheassumptioninTheorem3.9to cm 3 = 2 nu A A 1 = 2 Z Z < 1 = 2 .Withthis assumptionthedenominatorisboundedbelowby 1 = 2 andthisconstantcanbeabsorbed intothearbitraryconstant c inthenumerator.Withthisadjustmentandanimprovementof thenumerator,thelossoforthogonalityofthecomputed ~ Q byclassicalGram-Schmidtis boundedby k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 3 = 2 nu k A k 2 k Z k 2 k ~ Q k 2 k R )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k 2 A 1 = 2 Z : .51 Thedifferencebetween.51and.50isthefactorof k R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k 2 in.51,whichsatises k R )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k 2 min A )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 min Z )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 : Thisinequalityiswhatgivesrisetothefactor A 1 = 2 Z Z thatappearsin.50.In Section3.7weshowthat.51istightforallofourtestcases. ThelasttwotheoremsstatetheresultsforMGSandclassicalGram-Schmidtwith reorthogonalizationCGS2.Ineithercaseabetterboundisnotobtained. Theorem3.10. [41,Thm3.2]If cm 3 = 2 nu A A 1 = 2 Z < 1 ,thenthelossoforthogonality ofthecomputed ~ Q bymodiedGram-Schmidtisboundedby k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 3 = 2 nu k A k 2 k ~ Q k 2 max j i k ~ z j )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 i k 2 k ~ z j )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 i k A A 1 = 2 Z 1 )]TJ/F22 11.9552 Tf 11.955 0 Td [(cm 3 = 2 nu k A k 2 k ~ Q k 2 max j i k ~ z j )]TJ/F21 5.9776 Tf 5.757 0 Td [(1 i k 2 k ~ z j )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 i k A A 1 = 2 Z : .52 Theorem3.11. [41,Thm5.1,5.2]Thefactors ~ Q and ~ R computedbyclassicalGramSchmidtwithreorthogonalizationsatisfy Z = ~ Q ~ R + E; k A 1 = 2 E k 2 cn 3 = 2 u k A k 1 = 2 2 k ~ Q k 2 k ~ R k 2 ; .53 55

PAGE 67

k ~ Q T A ~ Q )]TJ/F22 11.9552 Tf 11.955 0 Td [(I k 2 cm 3 = 2 nu k A k 2 k ~ Q k 2 2 : .54 3.7StabilityExperiments Inthissection,wedemonstratethattheerrorboundspresentedaredescriptiveofthe trueerror.AllexperimentswereperformedusingMATLAB.Itwillbehelpfulforthis sectiontorecallidentitiesinTheorem3.1.Thelossoforthogonalityforthemoststable algorithmsdependson k A k 2 k Q k 2 2 andthelossofrepresentativitydependson k Q k 2 k R k 2 Thetestcaseswedeveloprepresenttheextremesituationsofthesebounds.Bothofthese boundsdependgreatlyonthecolumnspaceof Z .Theboundforthelossoforthogonality hasthefollowinginequality: 1 A n A k A k 2 k Q k 2 2 A : Andtheboundforthelossofrepresentativitysatises k Z k 2 k Q k 2 k R k 2 A 1 = 2 k Z k 2 : Iftheeigenvectorassociatedwithsmallesteigenvalueof A isinthecolumnspaceof Z ,thentheupperboundonthelossoforthogonalityismeti.e. k A k 2 k Q k 2 2 = A .On theotherhandiftheeigenvectorsassociatedwiththe n largesteigenvaluesareabasisof thecolumnspaceof Z ,thenthelowerboundismeti.e. 1 A n A = k A k 2 k Q k 2 2 Tocontrolthebehavioroftherepresentativityboundwemustbealittlemoreparticular onthechoiceof Z .Iftheleftsingularvectorsof Z arealsoeigenvectorsof A wecanpair eigenvaluesof A andsingularvaluesof Z aswewishtoformsingularvaluesof A 1 = 2 Z ,and equivalently R .Ifwetaketheleftsingularvectorof 1 Z betheeigenvectorassociated with 1 A ,then k R k 2 = k A k 1 = 2 k Z k 2 .Toobtainthedependencyon A 1 = 2 wemustalso havetheeigenvectorassociatewiththesmallesteigenvalueof A inthecolumnspaceof Z Withtheseconditionswehave k Q k 2 k R k 2 = A 1 = 2 k Z k 2 56

PAGE 68

Ifinsteadtheleftsingularvectorsaretheeigenvectorsassociatedwiththe n largest eigenvaluesof A ,then k Q k 2 = n A )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 = 2 and k Q k 2 k R k 2 = 1 A n A 1 = 2 k Z k 2 .Ifthe largesteigenvaluesareclusteredtogether,theboundinapproximately k Z k 2 Wemayalsoattainaboundapproximately k Z k 2 ifwetaketheleftsingularvectorsof Z tobetheeigenvectorsof A associatedwith n smallesteigenvalues.Ifwealsospecify thattheleftsingularvectorof Z associatedwith 1 Z istheeigenvectorofassociated with m )]TJ/F23 7.9701 Tf 6.586 0 Td [(n +1 A ,then k Q k 2 k R k 2 = m )]TJ/F24 5.9776 Tf 5.757 0 Td [(n +1 A m A 1 = 2 k Z k 2 .Ifthesmallesteigenvaluesare clusteredtogether,theboundisapproximately k Z k 2 Ourtestcasesexhibitthedifferentsituationsdescribed.Case1wetakeleftsingularvectorsof Z tobeeigenvectorsassociatedwiththe n smallesteigenvaluesof A .This providesaworstcaseboundforthelossoforthogonalityandabestcaseboundontherepresentativityerror.Case2wetakeleftsingularvectorsof Z tobeeigenvectorsassociated withthe n largesteigenvaluesof A .Thisprovidesabestcaseboundforbotherrors.Case3 wetakeleftsingularvectorsof Z tobeeigenvectorsassociatedwiththe d n= 2 e smallestand b n= 2 c largesteigenvaluesof A .Thisprovidesaworstcaseboundforbotherrors.Case4 wetake Z toberandom.Thisexampleindicateswhattobeexpectedwhen Z and A are notcorrelated. Thecasesdescribedthusfarallowustovary A and Z independently.Forcompletenesswealsoconsiderthecasewhen Z is A -orthogonal,case5.Thiswasatestcase presentedin[41].Toconstructsucha Z ,weagaintaketheleftsingularvectorsof Z to beeigenvectorsof A .Theassociatedsingularvaluesarethentheinversesquarerootof eigenvaluesof A .Wechosetousetheeigenvectorsassociatedwith d n= 2 e smallestand b n= 2 c largesteigenvaluesof A .Otherchoicescouldhavebeenconsidered,butwefeelthis isadequatetodemonstratethebounds. A isconstructedfromitseigenvaluedecomposition, A = VDV T V 2 R m m isa randomorthonormalmatrixcomputedinMATLABas V = orth randn m V isxedfor eachproblem.Theeigenvaluesof A arecomputedsuchthat log d ii areevenlyspacedand 57

PAGE 69

A isthelargesteigenvalue.Thatis,if =log A = m )]TJ/F15 11.9552 Tf 12.35 0 Td [(1 ,then d ii =10 i )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Z isconstructedfromitssingularvaluedecomposition, Z = U W T U 2 R m n iseither arandomorthonormalmatrixcase4ortheappropriatecolumnsof V dependingonthe case.Thesingularvaluesarealsoconstructedso log ii areevenlyspacedand Z is thelargestsingularvalueexcludingcase5.Inallcases, W 2 R n n istakentobea randomorthonormalmatrix.Otherdistributionsonthesingularvaluesandeigenvalues whereconsidered,butthecasepresentedhereprovidesagooddemonstrationofthetest cases.Inallexperiments m =80 and n =10 .Finally,welet A varyfrom 10 to 10 15 and Z = A 1 = 2 .Thisallowsustoincludecase5andviewthecorrelateddependency. Todisplaythelossoforthogonalityweplot k I )]TJ/F22 11.9552 Tf 11.547 0 Td [(Q T AQ k 2 .Forthelossofrepresentativityweuse k Z )]TJ/F22 11.9552 Tf 11.243 0 Td [(QR k 2 = k Z k 2 orifyouareusingthe A -norm, k Z )]TJ/F22 11.9552 Tf 11.243 0 Td [(QR k A = k Z k A .Inthe stabilitysectionstheonlyalgorithmwhichweexplicitlystatethelossofrepresentativity intermsofth A -normis CHOL )]TJ/F22 11.9552 Tf 12.606 0 Td [(EQR .However,an A -normboundfortheotheralgorithmscanbereadilyattainedfromthe2-normbound.Since k Z k A = k A 1 = 2 Z k 2 k A k 1 = 2 2 k Z k 2 : Figure3.1showstherepresentativityerrorsolidlinesandboundsdashedlinesfor eachcaseandeachalgorithm.Forthebouundsweomittheconstantsthataredependent on m and n forplottingpurposes.Inthisgureweshowthecomponentwiseboundfor allalgorithmsexceptSYEV-EQRsinceweonlyhaveanormwiseboundforSYEV-EQR. CHOLQR,PRE-CHOLQR,CGS,MGS,andCGS2allshowedaconstantrelativeerrorof approximately 10 )]TJ/F20 7.9701 Tf 6.587 0 Td [(15 forboththeactualerrorandthecomponentwiseerrorbounds. InFigure3.1dweseethatSYEV-EQRistheonlyalgorithminwhichtherepresentativityerrorisnotconstant.Forcases2,3,and4theerrorisproportionalto A 1 = 2 .For cases2and3,thisisduetothefactthattheeigenvectorassociatedwiththelargesteigenvalueof A isinthecolumnspaceof Z .Case4showsthatSYEV-EQRisnotstableeven forrandommatrices.Again,theboundsaredescriptiveforallcases. 58

PAGE 70

aCHOLQR bPRE-CHOLQR cCHOL-EQR dSYEV-EQR eCGS fMGS Figure3.1:Representativityerrorguresarecontinuedonnextpage. m =80 and n =10 59

PAGE 71

gCGS2 Figure3.1:Representativityerrorcontinued InFigure3.1cweseethatthe2-normboundsforCHOL-EQRarenon-descriptive forsomecases.Theboundsforcases2,3,4showadependencyon A 1 = 2 which isnotobservedintheexperiments.Ifweinsteadmeasuretherepresentativityerrorin the A -norm,thenthebound.34inTheorem3.5isdescriptiveseeFigure3.2.The dependencyof A 1 = 2 isseeninCase1since kj Q jj R jk 2 k Q k 2 k R k 2 ,therefore k A k 1 = 2 2 kj Q jj R jk 2 = k Z k A A 1 = 2 .Forthisexamplewedonotvary Z Z =10 Figure3.3showsthecomponentwiseversusnormwiserepresentativityerrorforCHOLQR. Thesolidlineisthetrueerrorthedashed-dottedlineisthecomponentwiseboundandthe dashedlineisthenormwiseboundforcase3.Thisexampleshowsthatthenormwisebound canbenon-descriptiveofthetrueerror.Thenormwiseboundforcase3isproportionalto A 1 = 2 ,howeverthetrueerrorshowsalmostnodependencyon A Case3introducesscalingintothecolumnsof Q andtherowsof R Z isnumerically inthespanoftheeigenvectorsassociatedwiththe n= 2 largesteigenvalues.Hence,therst n= 2 columnsof Q arealsointhespanoftheseeigenvectorsandhavea2-normofabout k A k )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 = 2 .Theremainingcolumnswillbeinthespanoftheeigenvectorsassociatedwith thesmallesteigenvaluesandhavea2-normofabout 1 = m A 1 = 2 .Theoppositeistrueof therowsof R .Therst n= 2 rowsof R havea2-normofabout k A k 1 = 2 2 k Z k 2 andthelast rowsaresmall,relativetotherst n= 2 rows.Therefore, kj Q jj R jk 2 k Z k 2 60

PAGE 72

Figure3.2: A -normrepresentativityerrorforCHOL-EQR. Z =10 m = 80 n =10 Figure3.3:Componentwiseversus normwiserepresentativityboundfor CHOLQRdemonstratedwithcase3. Z =10 m =80 and n =10 Figure3.4showsthelossoforthogonalityforeachalgorithmandeachcase.Themost stablealgorithmsarePRE-CHOLQR,CHOL-EQR,SYEV-EQR,andCGS2i.e.those withalossoforthogonalityproportionalto k A k 2 k Q k 2 2 .Forthesealgorithmsthelossof orthogonalityisindependentoftheconditioningof Z .Case1,3,and5alldemonstrate theworstcaseboundof A isobtainable.Case2showsthatthebestcaseboundof 1 A = n A isobtainable.Forourexperimentsthisvalueissmallsinceweareusinga log-lineardistributionoftheeigenvaluesof A .Thevaluecouldbemuchlargerif 1 A wasmuchlargerthantheothersingularvalues.Wealsoseethatforthecasewhen Z is randomcase4theerrortendstofollowthecasewhen Z isinthespanoftheeigenvectors associatedwiththelargesteigenvaluesi.e.thebestcase. CHOLQRandCGSareverysimilar.Thebestcaseobtainedforthesealgorithmsis worsethananycaseofthemoststablealgorithms.Case2showsthatlossoforthogonality forCHOLQRandCGSareproportionalto Z 2 ,sinceweknowthatitdoesnotdepend ontheconditioningof A .Similarly,MGSisproportionalto Z inthebestcase.These observationsrecoverwhatisknownfortheEuclideancase A = I 61

PAGE 73

aCHOLQR bPRE-CHOLQR cCHOL-EQR dSYEV-EQR eCGS fMGS Figure3.4:Orthogonalityerrorguresarecontinuedonthenextpage. m =80 n =10 62

PAGE 74

gCGS2 Figure3.4:Orthogonalityerrorcontinued Case5isaspecialcasewhereallalgorithmsshowthesameerror.Thisexampleshows thatwhen Z is A -orthogonalthattheorthogonalityerrorwillnotdependontheconditioningof Z .Thereasonthattheerrormatchestheworstcaseboundforthemoststable algorithmsisbecausewechosethecolumnsof Z tobeinthespanoftheeigenvectors associatedwithlargestandsmallesteigenvaluessimilartocase1. 3.8CommunicationCost Inthissection,wecomparethecommunicationcostforeachalgorithminadistributed memorysystem.Wewillfocusonthecasewhen Z istallandskinny m n .Thecomputationalkernelsforthealgorithmsvarysignicantly,fromeigenvaluedecompositions, EuclideanQRfactorizations,Choleskyfactorizations,andprojections.Also,if A isageneral,sparsematrixtheexactcomputationalrequirementsforthealgorithmsisunknown. Thevaryingdegreeofcomputationsmakesitdifculttodirectlycomparethealgorithms. Onecommonkernelamongallthealgorithmsismultiplying A orafactorizationof A withthecolumnsof Z orupdatedcolumns.Thewayinwhichthiskernelisimplementedisthemaindifferenceinthecommunicationrequirementsofthealgorithms.Let spMM A; 1= f m + g m; 1 = bethecommunicationcosttomultipleasparse m m matrix A byasinglecolumnvector.TheGram-Schmidtalgorithmsmultiplythecolumns Z orupdatedcolumnsby A independently.Thecommunicationcosttomultiply n columns 63

PAGE 75

independentlyis n spMM A; 1= nf m + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 ng m; 1 : .55 AllthealgorithmsotherthentheGram-Schmidtimplementthemultiplicationwith A or factorizationof A bysimultaneouslymultiplyingallthecolumnsof Z .Ifanalgorithm multipliesthecolumnsof Z by A simultaneouslyandweassumethatthe n columnsare alwayscommunicatedatthesametimethisisreasonableif n issmall,thenthecommunicationcostis spMM A;n = f m + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 g m;n : .56 ComparingEquation3.55and3.56weseethattheGram-Schmidtalgorithmswillcommunicated n timesasmanymessages.Thenumberofwordscommunicateddependson thealgorithmthatisusedtoperformthematrixmultiplication.Ifentirerowsof A are storedontheprocessorssothatonly Z mustbecommunicatedamongtheprocessorsthen ng n; 1= g m;n andthetotalvolumecommunicatedisthesame.Hence,thealgorithmswhichsatisfyEquation3.56areconsidercommunicationavoidingalgorithmssince theyreducethenumberofmessagesbyafactorof n Otherfactorsplayarollintheperformanceofeachalgorithms.Algorithmsthatrequireafactorizationof A maybetoocomputationallyexpensivetobeefcient.Their efciencydependslargelyonthedensityof A .AlgorithmsCHOL-EQR,SYEV-EQR,and PRE-CHOLQRrequireaEuclideanQRfactorization.FortallandskinnymatricesacommunicationavoidingalgorithmTSQR[15]canbeusedtoreducethecommunications. Finally,thecommunicationavoidingalgorithmsmayrequiremoreops.Forexample, CHOLQRandCGShavesimilarstabilityproperties.CHOLQRrequirestheextracomputationofsolvingatriangularsystemtoobtain Q .However,thiscomputationdoesnot requireanycommunicationssinceweassumethateachprocessorstoresacopyof R and CHOLQRrequire n timesfewercommunicationsthanCGS.Weexperimentallyinvestigate theperformanceofthealgorithmsinthefollowingsection. 64

PAGE 76

3.9PerformanceExperiments Inthissection,weconsidertheperformanceofthealgorithmsfortallandskinnymatrices m n .OurexperimentswereperformedonanIntelcluster.Thismachinehas24 computenodeswitheachnodeconsistingoftwoIntelXeonE5-2670SandyBridgeprocessorsrunningat3.3GHz.Thus,eachnodeisasymmetricmultiprocessorwith16cores and64GBmemory.Thetheoreticalpeakis26.4GFlops/spercoreor422.4GFlops/sper node.EachnodealsohastwoNVIDIATeslaM2090GPUs,howeverthesearenotusedin theexperiments.ThenodesareconnectedwithaQDRInnibandinterconnectat40Gbit/s. Formulti-threadedexperimentswelinkwithMKL.0optimizedBLASandLAPACK libraries. Theperformanceofeachalgorithmdependsonthedensityof A .Tocomparethe extremecasesweconsiderthecasewhere A isdenseandthecasewhere A istridiagonal. InbothcasesweuseanormalizedFLOPcounttoplottheFLOPrate.When A isadense matrixweuse 2 m 2 n +2 mn 2 ,whichisthenumberofFLOPsforCHOLQR.When A is tridiagonalmatrixweuse 2 mn 2 ,whichisagainthenumberofFLOPsforCHOLQRsince themultiplicationwith A is O mn .Thetimeforeachalgorithmwastakentobethe minimumof10experiments. InFigure3.5,weshowtheperformanceforasinglenodewith16threads.InFigure3.5a, A isadensematrixand m isxedat10000.Inthisexperiment,thebothCHOLQR algorithmsgreatlyoutperformtheotheralgorithms.CHOLQRreachesapeakperformance of270GFLOPs/sandPRE-CHOLQRpeaksoutatabout200GFLOPs/s.Noneofthe Gram-Schmidtalgorithmswhereabove10GFLOPs/s.WedidnottesttheCHOL-EQR algorithmsincethe O m 3 costoftheCholeskyfactorizationofadensematrixwouldhave beenmuchtocostlycomparedtotheotheralgorithms. InFigure3.5b, A isatridiagonalmatrixand m isxedat100000.Again,CHOLQR outperformsallotheralgorithms.WedonotshowthepeakperformanceofCHOLQR whichisabout85GFLOPs/sinthisguresowecancomparePRE-CHOLQRwiththe 65

PAGE 77

a A isadensematrixwith m =10000 b A isatridiagonalmatrixwith m =100000 Figure3.5:Performanceresultson1nodewith16threads. otheralgorithms.Forsmall n ,CGSisabletooutperformPRE-CHOLQR,reachingapeak performanceofabout7GFLOPs/s.However,as n increasePRE-CHOLQRcontinuesto improveandeventuallyoutperformsalltheGram-Schmidtalgorithms.TheCholeskyfactorizationofatridiagonalmatrixisinexpensiveandweseeforthisexamplethattheCHOLEQRalgorithmisaviableoption.TheperformanceisfairlysimilartoPRE-CHOLQR.This isbecausebothalgorithmsforsmall n aredominatedbytheEuclidean QR factorization kernel.As n increasethecostoftheCholeskyfactorizationof A andanextrasparsematrix multiplicationbecomesalargeroverheadthansolvingthenormalequation. Acommonkernelamongthealgorithmsismultiplying A byavectororafactorization of A forCHOL-EQR.Thebestalgorithmsareabletoutilizelevel3BLASkernelsto performtheoperationonmultiplevectorsatatime.Thesealgorithmsinclude:CHOLQR, PRE-CHOLQR,andCHOL-EQRSYEV-EQRaswellbutwasnotimplemented.The Gram-Schmidtalgorithmsareonlyabletoutilizethelessefcentlevel2blaskernels. Comments. Inbothplots,therearetwoimplementationsofMGS.Thedifferencebetweenthetwoistheorderinwhichtheentriesof R arecomputed.InMGS-col,theentries of R arecomputedonecolumnatatime.Todothisthecomputationoftheinnerproduct whichincludesamultiplicationwith A isinsidetheinnermostloop.ThecostofMGScolis O m 2 n 2 fordensematrices.However,wecanalsocomputetheentriesof R arow 66

PAGE 78

atatime.Bydoingthis,themultiplicationofwith A canbebroughtoutsidetheinner mostloopandthecostofthisalgorithmisthesameasCGS.Thetwoimplementationsare equivalentwhen A = I PRE-CHOLQRandCHOL-EQRbothlosesignicantperformancefor n =70 to n = 120 when A istridiagonal.Wecannotexplainthelossinperformanceatthistime. 3.10Conclusion Inthischapter,wepresentstabilityanalysisofmanywell-knownalgorithmsforperforminganoblique QR factorization.WealsointroduceanewalgorithmPRE-CHOLQR. Wendthatthemoststablealgorithmshavealossoforthogonalitythatisproportional to k A kk Q k 2 ,whichisequivalenttothemultiplicationerrorinformingtheorthogonality check, k I )]TJ/F22 11.9552 Tf 12.508 0 Td [(Q T AQ k .Weconcludethatthisisabestcasebound.Wealsoprovideaset oftestcasestoassesthetightnessoftheboundsobtained.Fromthesetestsweseethatthe boundsaretightinallcases. Thenewalgorithm,PRE-CHOLQR,isstableandcansignicantlyoutperformexistingstablealgorithms.TheexistingstablealgorithmareCGS2,CHOL-EQR,andSYEVEQR.CGS2communicates n timesasmanymessageforthemultiplicationwith A andthe columnsof Z .Theincreaseincommunicationsisobservedtodecreasetheperformance ofCGS2.BothCHOL-EQRandSYEV-EQRrequireafactorizationoftheinnerproduct matrix.Fordensematricesthefactorizationistooexpensiveforeitheralgorithmtobe efcient.Forsparsematrices,factoring A maybeaviableoption. 67

PAGE 79

4.AGreedyAlgorithmforOptimallyPipeliningaReduction 4.1Introduction Applicationsrelyingonparalleldistributedcommunicationrequiretheuseofcollective communications.Inthischapter,wefocusonthereductionoperationMPI Reducein whicheachprocessholdsapieceofdataandthesepiecesofdataneedtobecombined usinganassociativeoperationtoformtheresultsontherootprocess.Wepresenttwonew algorithmsforperformingareduction.Theoperationassociatedwithourreductionneeds tobeassociativeandcommutative.Thetwoalgorithmsaredevelopedundertwodifferent communicationmodelsunidirectionalandbidirectional.Bothalgorithmsuseagreedy schedulingscheme. Collectivecommunicationslikereductioncanoftenbethebottleneckofmassively parallelapplicationcodes,thereforeoptimizingcollectivecommunicationscangreatlyimprovetheperformanceofsuchapplicationcodes.Theperformanceofareductionalgorithmishighlydependentonthemachineparameterse.g.,networktopologyandthe underlyingarchitecture;thismakesthesystematicoptimizationofacollectivecommunicationforagivenmachinearealchallenge.Foraunidirectional,fullyconnectednetwork, weprovethatourgreedyalgorithmisoptimalwhensomerealisticassumptionsarerespected.Previousalgorithmstthesameassumptionsandareonlyappropriateforsome givencongurations.Ouralgorithmisoptimalforallcongurations.Wenotethatthere aresomecongurationswhereourgreedyalgorithmsignicantlyoutperformsanyexisting algorithms.Thisresultrepresentsacontributiontothestate-of-theart.Forabidirectional, fullyconnectednetwork,wepresentadifferentgreedyalgorithm.Weverifybyexperimentalsimulationsthatouralgorithmmatchesthetimecomplexityofanoptimalbroadcast withadditionofthecomputation.Besidereversinganoptimalbroadcastalgorithm,the greedyalgorithmistherstknownreductionalgorithmtoexperimentallyattainthistime complexity.Simulationsshowthatthisgreedyalgorithmperformswellinpractice,outperforminganystate-of-the-artreductionalgorithms.Withrespecttopracticalapplication, 68

PAGE 80

muchworkisstillneededintermsofauto-tuningandconguringforvariousarchitectures nodesofmulti-core,multi-portnetworks,etc.. Intheunidirectionalcontext,wecomparedthegreedyalgorithmuni-greedywith threestandardalgorithmsbinomial,pipeline,andbinaryusingalinearmodeltorepresent thepoint-to-pointcostofcommunicatingbetweenprocessors.Unlikethestandardalgorithms,noclosed-formexpressionexistsforthecompletiontimeofthegreedyalgorithm; therefore,simulationsareusedtocomparethealgorithms.Whileweknowthatouralgorithmisoptimalinthiscontext,thesesimulationsindicatetheperformancegainofthe newalgorithmoverclassicones.Inparticular,formid-sizemessages,thenewalgorithmis about50%fasterthanpipeline,binarytree,orbinomialtree. Inthebidirectionalcontext,weagaincomparethegreedyalgorithmbi-greedywith thethreestandardalgorithmsaswellasareduce-scatter/gatherbutteryalgorithm.For thestandardalgorithmsweseesimilarresultsastheunidirectionalcase.Thebuttery algorithmperformswellformid-sizemessages,buthaspoorasymptoticbehavior. Intheexperimentalcontext,wehaveimplementedthebinomial,pipeline,uni-greedy, andbi-greedyalgorithmsusingOpenMPIversion1.4.3;allalgorithmsareimplemented usingpoint-to-pointMPIfunctions:MPI SendandMPI Recv,orMPI Sendrecv.Moreover,theOpenMPIlibraryprovidesastate-of-the-artimplementationforareduction.We also,comparewithanimplementationofthebutteryalgorithm.Numericalcomparisons ofthegreedyalgorithmtoOpenMPI'sbuiltinfunctionMPI Reduceaswellasthebinomial andpipelinealgorithmsshowthegreedyalgorithmisthebestformediumsizemessages, conrmingthattheresultsfoundtheoreticallyappliesinourexperimentalcontext.However,themoresimplisticalgorithmsbinomialandpipelineperformbetterforsmalland largemessages,respectively.Finally,theimplementationofthebutteryalgorithmexhibits similarresultsasinthetheoreticalcontext. Finally,theideaofunequalsegmentationisconsidered.Typically,whenamessageis splitintosegmentsduringareductionoperation,thesegmentsareassumedtobeofequal 69

PAGE 81

size.Weinvestigateifthegreedyandpipelinealgorithmscanbeimprovedbyallowingfor segmentstohaveunequalsizes.Itturnsoutthat,forsomeparametervalues,thegreedy algorithminaunidirectionalsystemwasoptimizedbyunequalsegmentations.Thisindicatesthatremovingtheequalsegmentationassumptionfromourtheorycanleadtobetter algorithms.However,thegainsobtainedweremarginal.Wealsonotethatthepipeline algorithmwasalwaysoptimizedbyequalsegmentation. 4.2Model Unidirectionalandbidirectionalsystemsaretwowaystodescribehowprocessorsare allowedtocommunicatewitheachother.Inaunidirectionalsystem,aprocessorcanonly sendamessageorreceiveamessageatagiventime,butnotboth.Abidirectionalsystem assumesthatatagiventimeaprocessormayreceiveamessagefromaprocessorand sendamessagetoanotherpotentiallydifferentprocessorsimultaneously.Weassumea unidirectionalsystemfortheinitialsectionsandoptimalityproofSections4.4and4.5.In Section4.6weadaptthealgorithmtoabidirectionalsystem. Theparallelsystemcontains p processorsindexedfrom0to p )]TJ/F15 11.9552 Tf 9.579 0 Td [(1 .Additionally,wewill assumethesystemisfullyconnectedeveryprocessorhasadirectconnectiontoallother processorsandhomogeneous.Theseassumptionsarenotveriedinpracticeoncurrent architecture,howevertheyrepresentastandardtheoreticalframeworkandourexperiments indicatethattheyarevalidenoughinpracticetodevelopusefulalgorithms. AlinearmodelHockney[26]isusedtorepresentthecostofapoint-to-pointcommunication.Thetimerequiredforeachcommunicationisgivenby + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 m ,where is thelatencystartup, isthebandwidthnumberofwordsperunittime,and m isthesize ofthemessage.Sinceweareperformingareduction,computationsarealsoinvolvedso wewillalsoassumethetimeforcomputationfollowsalinearmodelandisgivenby )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m where isthecomputationalrate.Wewillinvestigatethetheoreticalcasewhen = 1 i.e.,veryfastprocessingunitsaswellaswhen isnite.Furthermore,wewillassume thatcommunicationandcomputationcannotoverlap. 70

PAGE 82

Toachievebetterperformancethemessagescanbesplitinto q segmentsofsize s 1 ;:::;s q .InSection4.5weshowthegreedyalgorithmisoptimalforanysegmentation. Toobtainthebestperformanceofagivenalgorithmanoptimalsegmentationisdetermined. InSection4.5.2werestricttheoptimizationtoequi-segmentation s i = s; 8 i ,exceptthe lastsegmentmaypossiblybesmaller. 4.3CollectiveCommunicationsandRelatedWork Acollectivecommunicationoperationisanoperationondatadistributedbetweenaset ofprocessorsacollective.Examplesofcollectivecommunicationsarebroadcast,reduce all-to-one,reduceall-to-all,scatter,andgather. Broadcast: Dataononeprocessortherootisdistributedtotheotherprocessorsinthe collective. Reduce: Eachprocessorinacollectivehasdatathatiscombinedentry-wiseandtheresult isstoredontherootprocessor. AllReduce: Sameasreduceexceptthatallprocessorsinthecollectivecontainacopyof theresult. Scatter: Datathatisontherootprocessorissplitintopiecesandthepiecesaredistributed betweentheprocessorsinthecollective.Theresultiseachprocessorincludingtheroot processorcontainsapieceoftheoriginaldata. Gather: Thereverseoperationofascatter. Optimizingthereductionoperationiscloselyrelatedtooptimizationofthebroadcast operation.Anybroadcastalgorithmcanbereversedtoperformareduction.Bar-Noyet al.[2]andTr affandRipke[49]bothprovidealgorithmsthatproduceanoptimalbroadcast scheduleforabidirectionalsystem.Herethemessagesaresplitintosegmentsandthesegmentsarebroadcastinrounds.Inbothcasestheoptimalityisinthesensethatthealgorithm meetsthelowerboundonthenumberofcommunicationrounds.Forthetheoreticalcase when = 1 ,reversinganoptimalbroadcastwillprovideanoptimalreduction.However, 71

PAGE 83

for < 1 thisisnolongervalidasanoptimalschedulewillmostliketakeintoaccount thecomputation. Rabenseifner[39]providesareduce-scatter/gatheralgorithmbutterywhichprovidesoptimalloadbalancingtominimizethecomputation.Rabenseifnerdoesnotpredeneasegmentationofthemessage,butratherusesthetechniquesrecursive-halvingand recursive-doubling.Thealgorithmisdoneintwophases.Intherstphasethemessage isrepeatedlyhalvedinsizeandexchangedamongprocesses.Attheendoftherstphase thenalresultisdistributedamongalltheprocessors.Thisphaseisknowasareducescatter.Thesecondphasegatherstheresultsrecursivelydoublingthesizeofthemessage. In[40]RabenseifnerandTr affimproveonthealgorithmforanon-poweroftwonumber ofprocessors. Sandersetal.[42]introduceanalgorithmtoscheduletoareductionorbroadcast usingtwobinarytrees.Theauthorsnoticethattwobinarytreescouldbeusesimultaneously andeffectivelyreducethebandwidthbyafactoroftwofromasinglebinarytree.Thetime complexityapproachesthelowerboundforlargemessages.However,forsmallmessages thelatencytermistwicelargerthanoptimal.Also,thetimecomplexityisneverbetterthan thatofareverse-optimalbroadcast. Othermodelshavebeenusedtodescribemorecomplexmachinearchitectures.Heterogeneousnetworks,whereprocessorcharacteristicsarecomposedofdifferentcommunicationandcomputationrates,havebeenconsidered.Beaumontetal.[3,4]consider optimizingabroadcastandLegrandetal.[30]consideroptimizingscatterandreduce.In thesepapers,theproblemisformulatedasalinearprogramandsolvedtomaximizethe throughputnumberofmessagesbroadcastingpertimeunit.Also,higherdimensional systemshavealsobeenconsidered[9].Hereaprocessorcancommunicatewithmorethan oneprocessoratatime. Themachineparametersandarchitecturevarybetweenmachinesandonealgorithm mayperformbetterononemachineversusanother.Accuratelydeterminingmachinepa72

PAGE 84

rameterscanbeadifculttask.Inpractice,auto-tuningforeachmachineisrequiredto obtainawell-performingalgorithm.Vadhiyaretal.[50]discusshowtoexperimentally determinetheoptimalalgorithm.Pjesivac-Grbovi cetal.[38]comparevariousmodelsthat canbeusedtoauto-tuneamachine.ThesemodelsaremorecomplexthantheHockney modelandcanprovideamoreaccuratemodelforthecommunicationcost.However,a linearmodelprovidesagoodbasisfortheoreticallycomparingalgorithms.Othermodels,suchasLogP,LogGP,andPLogP,eachprovideusefulinformationtohelpauto-tunea machine. 4.4StandardAlgorithms Beforeintroducingthegreedyalgorithmsitishelpfultoreviewthreestandardalgorithmsbinomial,pipeline,andbinary. Sinceweareusingaunidirectionalsystem,wewillcallaprocessoreitherasending processororareceivingprocessor.Sendingandreceivingprocessorsarepairedtogether toperformareduction.Afterareceivingprocessorreceivesasegmentitwillcombine itssegmentwiththesegmentreceived.Afteraprocessorsendsasegmentitissaidtobe reducedforthatsegment.Forthereductiontobecompletedeachprocessorexceptthe rootmustbereducedforeachsegment.Thethreealgorithmsaredescribedbelow. Binomial. Atagiventimesupposethereare n processorslefttobereducedforamessage,then b n= 2 c processorsareassignedassendingprocessorsand b n= 2 c areassignedas receivingprocessors.Afterwardstherewillbe d n= 2 e processorsleftforreductionandthe processisrepeatuntil n =1 .Segmentingthemessagewouldonlyincreasethelatencyof thecommunicationtimesinceallprocessorsmustbenishedbeforethenextsegmentcan bestarted.Therefore,wedonotconsidersegmentation. Pipeline. Atagiventimesupposethatprocessors k +1 to p )]TJ/F15 11.9552 Tf 12.262 0 Td [(1 havebeenreducedfor segment s i .Processor k isassignedasasendingprocessorforsegment s i andprocessor k )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 isassignedasareceivingprocessorforsegment s i 73

PAGE 85

Table4.1:Timeanalysisforstandardalgorithmsinaunidirectionalsystem,where N = d log 2 p +1 e s opt istheoptimalequi-segmentsize,and T opt isthetimeforthealgorithm at s opt .Formulaearevalidfor p> 3 Binomial Time d log 2 p e + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m Pipeline Time p )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s +2 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s s opt 2 m p )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 1 = 2 T opt h p )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 1 = 2 + m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 i 2 Binary Time 2 d log 2 p +1 e)]TJ/F15 11.9552 Tf 19.925 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s +4 q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s s opt 2 m N )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 T opt 2 h N )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 1 = 2 + m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 1 = 2 i 2 Binary. Atagiventimesupposethereare 2 n )]TJ/F15 11.9552 Tf 12.375 0 Td [(1 processorslefttobereducedforsegment s i ,then 2 n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 processorsareassignedassendingprocessorsand 2 n )]TJ/F20 7.9701 Tf 6.587 0 Td [(2 processorsare assignedasreceivingprocessors.If p 6 =2 n )]TJ/F15 11.9552 Tf 12.774 0 Td [(1 forsome n ,thentheinitialstepinthe algorithmreducestheextraprocessorssothatinthenextstepthereis 2 n )]TJ/F15 11.9552 Tf 10.944 0 Td [(1 ,forsome n processorsleftforreduction.Sincetherearetwiceasmanysendingprocessorsasreceiving processors,twoprocessorswillsendtothesamereceivingprocessorandthetimeforone stepinthebinaryalgorithmis 2 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s Given p , , m ,and s ,thetimeforareductioncanbecalculatedforeachalgorithm. Usingtheseformulaetheoptimalsegmentationsize s opt andoptimaltime T opt canalso becalculated.Whencalculatingtheoptimalsegmentationsizeweassumethatthesegments areofequalsize.AsummaryofthesevaluesisgiveninTable4.1.Itisunknownwhatthe optimalsegmentationwouldbeifthesegmentsareofunequalsize.Theformulaefortime areupperboundsinthecasewhen m isnotdivisibleby s .Theseformulaedifferfromthe formulaethatappearin[49,38]byafactorof2inthebandwidthtermsincetheauthors assumeabidirectionalsystem.Inaunidirectionalsystemthereisextralagtime. Tobetterunderstandthealgorithmsitishelpfultoknowthelowerboundsforeach quantity:latency,bandwidth,andcomputation.Thelowerboundsforanyreductionalgorithmaswellasothercollectivecommunicationsarediscussedin[8].Thefollowing rationalesareusedtoobtainthelowerbounds. 74

PAGE 86

Latency. Everyprocessorinacollectivehasatleastonemessagethatmustbecommunicated.Atanytimestep,atmosthalfoftheremainingprocessorscansendtheirmessages toanotherprocessortoperformthereduction.Thisgivesthelowerbound d log 2 p e Bandwidth. Considerthecasewhen p =3 .Eachprocessormustbeinvolvedinacommunicationforeachelementofthemessage.Whentwoprocessorsareinvolvedinacommunicationthethirdprocessorisidle.Therefore,theremustbetwocommunicationsperformedforeachelementofthemessage,whichcannotoccursimultaneously.Thisgives thelowerbound 2 m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 .For p> 3 ,thelowerboundholdssinceamessageofthesame sizeonmoreprocessorscannotbecommunicatedfaster. p =2 isaspecialcasewitha bandwidthtermof m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 Communication. Eachelementofthemessagemustbeinvolvedin p )]TJ/F15 11.9552 Tf 13.356 0 Td [(1 computations.Distributingthecomputationsevenlyamongalltheprocessorsgivesthelowerbound p )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 p m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Thelowerboundsforthestandardalgorithmscanbeobtainedusingthefollowing rationale.Thelatencytermwillbeminimizedwhennosegmentationisused q =1 sinceanysegmentationwillonlyincreasethenumberoftimestepsandhenceincreasethe latency.Thebandwidthandcomputationtermswillbeminimizedwhen s =1 ,sincethis willmaximizethetimewhenmultipleprocessorsareworkingsimultaneously.Table4.2 showsthelowerboundsforanyreductionalgorithmandthelowerboundsforeachofthe standardalgorithms. Table4.2:Lowerboundsforeachterminthetimeforareductionaswellasthelower boundsforeachstandardalgorithminaunidirectionalsystem,where N = d log 2 p +1 e Formulaearevalidfor p> 2 Latency Bandwidth Computation Reduce d log 2 p e 2 m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 p )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 p m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 Binomial d log 2 p e d log 2 p e m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 d log 2 p e m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 Pipeline p )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 p +2 m )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 p +2 m )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 Binary 2 N )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 2 N +2 m )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 2 N +2 m )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 75

PAGE 87

Givenmachineparameters and ,onecanselectthealgorithmthatprovidesthe bestperformance.Inpractice,thelatencyismuchlargerthanthebandwidth )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 Forsmallmessages, )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m ,thetimeforareductionisdominatedbythelatency.The binomialalgorithmminimizesthelatencytermandwillprovidethebestperformancefor smallmessages.Forlargemessages, )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m ,thetimeforareductionisdominated bythebandwidth.Thepipelinealgorithmprovidesthesmallestbandwidthtermandwill thereforegivethebestperformanceforlargemessages.Formediumsizemessages,the binaryalgorithmwillprovidethebestperformance.Thebinaryalgorithmhasbothqualities thatgivebinomialandpipelinegoodlowerboundsforlatencyandbandwidth,respectively, onlydifferingbyafactoroftwo. Thevaluesof m thatareconsideredsmallandlargedependonthemachineparameters , ,and p .Figure4.1showswhichalgorithmisbestfor =10 =1 = 1 asafunctionofthenumberofprocessors p andthemessagesize m .Hereweare assumingthatthereisnotimerequiredforthecomputation = 1 Figure4.1:Regionswherebinomialorpipelineorbinaryisbetterintermofthenumber ofprocessors p andthemessagesize m .Foreachalgorithm,each p andeach m ,the optimalsegmentsize s opt isobtainedfromtheformulaegiveninTable4.1.Themachine parametersare =10 =1 = 1 76

PAGE 88

4.5AGreedyAlgorithmforUnidirectionalSystem Whenamessageissplitintosegmentsareductionalgorithmconsistsofscheduling p )]TJ/F15 11.9552 Tf 9.615 0 Td [(1 q send/receivepairsoneforeachsegmentofthenon-rootprocessors.Forasegment i ofsize s i ,thesendingprocessorinacommunicationwillbeavailableafter t comm i = + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s i ,andthereceivingprocessorwillbeavailableafter t comm i + t comp i ,where t comp = )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s i .Wewilladdanextrarequirementthatsegment i mustbereducedon aprocessorbeforethatprocessorcanbeinvolvedinareductionforsegment j j>i Binomial,binary,andpipelineallsatisfythisrequirementandthereforearealgorithmswe optimizeover.Thisrequirementprovidestwobenets.First,sincetheschedulethatis generatedforagivennumberofprocessorsandnumberofsegmentswillvary,wemustlog thecurrentstateoneachprocessor.Withthisrequirementweonlyneedasmallarraywith alengthequaltothenumberofprocessors.Second,ifaprocessorhasstartedbuthasnot beenreducedformultiplesegmentstheneachsegment'scurrentresultmustbestoredon theprocessor.Thiswouldrequireadditionalstoragespaceonthatprocessor. Sinceweassumethateachprocessormustreducethesegmentsinorder,thenareductioncanbescheduledconsideringonesegmentatatime.Let I beasetof n processorsthat stillneedtosendagivensegmentorfortheroot,receivethenalmessage.Let x i bethe timethatprocessor i completeditsprevioustask.Wedenethestateoftheseprocessorsas X n = f x i j i 2 I g : Thestateoftheprocessorsthathavesentthesegmentis S n .Theprocessorsrepresented in X n and S n aredisjointandtheirunionisthesetofallprocessors.Thepermutation P X n orderstheelementsof X n fromsmallesttolargest.Thetimecomplexityofan algorithmisgivenbythenalstateoftherootprocessor. Ascheduleforasend/receivepairisgivenby X n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k ;S n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k = reduce X n k ;S n k ;a;b;t : 77

PAGE 89

Processor a sendssegment k toprocessor b startingattime t X n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k istheupdatedstate oftheprocessorsthatstillneedtosendsegment k and S n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k istheupdatedstateofthe processorsthathavesentsegment k .Thestateofprocessor a isremovedfrom X n k and theupdatedstateisaddedto S n )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 k .Thestateofprocessor b isinitiallyin X n k andthe updatedstateisin X n )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 k sinceprocessor B stillneedstosendsegment k .Wewillassume thatsuccessivesend/receivepairsarescheduledintheorderthatthecommunicationsare started.Thiswaythestateofaprocessorthatisnotamemberofthesend/receivepair,but whosecurrentstateislessthan t issetequalto t Lemma4.1. reduce X n ;S n ;a;b;t ,where P X n = x g 1 ;x g 2 ;:::;x g n ,isavalid send/receivepairifandonlyif i x a t and x b t ; ii a 6 =0 ;and iii t x g 2 Proof. irequiresthatprocessor a and b havecompletedtheirprevioustaskbefore t .ii statesthattherootcanneverbeasendingprocessor.Andiimpliesiii. Agreedyalgorithmisobtainedbyschedulingeverysend/receivepairusing t = x g 2 in Lemma4.1.Whenasegmentisdonebeingscheduledthenthenalstateoftheprocessors isusedastheinitialstateforthenextsegment.Algorithm4.1producestheunidirectional greedyuni-greedyschedule.AproofofoptimalityisgiveninSection4.5.1. WegivetheCcodefortheuni-greedyalgorithmonthenextpageinFigure4.2.Inthis code,weareabletogeneratethegreedyscheduledynamicallyduringthereductionrather thanprecomputingtheschedulebyAlgorithm4.1.Thealgorithmassumesthat MPI Op iscommutativeandassociativeaswell,ofcourse.Thealgorithmallocatestwoextra buffersofsizesegmentsize.Thevariable s isthesizeofasegmentandneedstobeinitializedifpossibletunedinadvance.Theimplementationisrestrictedto root being 78

PAGE 90

Figure4.2:Ccodefortheuni-greedyreductionalgorithm.Thealgorithmassumesthat MPI Op iscommutativeandassociativeaswell,ofcourse.Thealgorithmallocatestwo extrabuffersofsizesegmentsize.Theglobalvariable global s isthesizeofasegmentandneedstobeinitializedifpossibletunedinadvance.Theimplementationis restrictedto root being 0 MPI Datatype being int ,and MPI Op being + .These restrictionsarenotaconsequenceofthealgorithmandcanberemoved. 79

PAGE 91

Algorithm4.1: Uni-greedySchedule 1 X p 1 = f 0 ;:::; 0 g S p 1 = ; 2 for i =1 to q do 3 for j = p downto 2 do 4 P X j i = x g 1 ;x g 2 ;:::x g j 5 t = x g 2 6 X j )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 i ;S j )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 i = reduce X j i ;S j i ;g 1 ;g 2 ;t 7 X p i +1 = X i S S i Figure4.3:Uni-greedyalgorithmfor15processorsand5equi-segmentsfor t comm =2 and t comp =1 .Eachcolorrepresentsadifferentsegment.Filledincirclesrepresentreceiving processors,opencirclesrepresentsendingprocessors,andhexagonsrepresentcomputation. 0 MPI Datatype being int ,and MPI Op being + .Theserestrictionsarenotaconsequenceofthealgorithmandcanberemoved.Inthenextsectionweanalyzethetheoretical timeforthegreedyalgorithmforvariousvaluesof ,and Noclosed-formexpressionforthetimecomplexityoftheuni-greedyalgorithmexists likethosegiveninTable4.1forthestandardalgorithms.Figure4.3givesanexamplefor 15processorsand5equi-segmentswith t comm =2 and t comp =1 .Fromtime12to15 providesanexamplewhereifweallowedsegmentstobereducedoutoforderthenthe uni-greedyalgorithmwouldnotbeoptimal.Withoutthisrestrictionprocessor8couldsend segment4toprocessor0andtheoveralltimecomplexitycouldbereducedby1.Wechoose toomitthegurethatshowstheimprovement. 80

PAGE 92

4.5.1ProofofOptimality Inthissection,weprovethattheuni-greedyalgorithmisoptimal.Theproofisa variationoftheproofin[14].UsingthenotationasinLemma4.1,wedene optreduce X n = reduce X n ;S n ;g 1 ;g 2 ;x g 2 : Wealsodenethepartialorderingon R n : X Y x i y i ; 1 i n: Lemma4.2. Let X 2 R n and Y 2 R n benondecreasingsuchthat X Y andlet a 2 R and b 2 R suchthat a b .If x k a x k +1 and y l b y l +1 ,then x 1 ;:::;x k ;a;x k +1 ;:::;x n y 1 ;:::;y l ;b;y l +1 ;:::;y n : Proof. 8 i s.t. i min k;l ; x i y i : 8 i s.t. i max k;l +1 ; x i y i : If k = l ,then a b istheonlyothercomparisonneeded. If k
PAGE 93

If k>l ,then a x k x l +1 y l +1 b: Byassumption a b ,therefore a = b .Hence, x l +1 = a = b; a = x k y k ; and 8 i s.t. l +1
PAGE 94

Wewillshowthat X Y .Let x g k bethesmalleststategreaterthan x g 2 and y h l be thesmalleststategreaterthan t .Then X = x g 2 ;:::;x g 2 ;x g k ;:::;x g n and Y = t;:::;t;y h l ;:::;y h n : 8 i s.t. i< min k;l ; x g 2 y h 2 t: 8 i s.t. i max k;l ; x g i y h i : If kl ,then 8 i s.t. l i
PAGE 95

fromeachstepandthestateoftherootprocessorafternishingthenalcomputation.Let X p k +1 = iter X p k beaniterationthatappliesanysetofreduceoperations. Lemma4.4. If P X p k P Y p k and X p k +1 = optiter X p k and Y p k +1 = iter Y p k ,then P X p k +1 P Y p k +1 Proof. ByLemma4.3thestatevectorsoftheremainingprocessorsateverystepwillsatisfy theconditionsofLemma4.3.Hence,wecanconcludethat S k R k and X k Y k where X k ;S k = optreduce X k and Y k ;R k = reduce Y k .Theinitialstates forsegment k +1 are X p k +1 = S k S X k and Y p k +1 = R k S Y k .ByLemma4.2, P X p k +1 P Y p k +1 : Areductionalgorithmisobtainedbyrepeatedlyapplyinganiterationforeverysegment usingtheendingstatevectorofonesegmentastheinitialstatevectorofthefollowing. Weobtainthegreedyalgorithmbyalwaysapplyingoptiter.Anyreductionalgorithmis obtainedbyrepeatedlyapplyinganyiteration. Theorem4.5. Inaunidirectional,fullyconnected,homogeneoussystemthetimecomplexityoftheuni-greedyalgorithmisminimalamongallreductionalgorithmsthatreduce segmentsinorder. Proof. X p k isthestateoftheprocessorsatthestartofsegment k whenapplyingthegreedy algorithmand Y k p isthestateoftheprocessorsatthestartofsegment k whenapplying anotherreductionalgorithm.Weassumethattheinitialstateoftheprocessorsbeforeany reductionisdoneiszero.Thatis, X p 1 = Y p 1 = f 0 ;::: 0 g .ByLemma4.4, P X p 2 P Y p 2 andbyinduction X p q +1 Y p q +1 .Hence, x 0 =max X p q +1 max Y p q +1 = y 0 4.5.2TheoreticalResults InSection4.5.1weshowedthatgivenanysegmentationofamessagetheuni-greedy algorithmisoptimalamongalgorithmsthatreducethesegmentsinorderonaprocessor. However,selectingtheoptimalsegmentationoverallpossiblesegmentationsisdifcult. 84

PAGE 96

aMessagesizeversusratio bMessagesizeversustime Figure4.4:Theoreticalresultsfor p =64 plottingmessagesize m versusratio.4aand time.4binaunidirectionalsystem.Themachineparametersarexedat =10 =1 and = 1 Toavoidthisdifculty,werestictourselvestoconsideringonlyequi-segmentationinthis section.InSection4.9wewillinvestigateunequalsegmentation. Wecomparetheoptimalsegmenteduni-greedyalgorithmwithoptimallysegmented binomial,binary,andpipelinealgorithms.Forpipelineandbinary,theoptimalsegment sizeareobtainedbytheformulaeinTable4.1.Forbinomial,themessageswherealways reducedwithasinglesegment.Theparametersusedforthetheoreticalexperimentswere: =0 ; 1 ; 10 ; 100 ; 1000 =1 =1 ; 1 p =2 k ; s.t. k =2 ; 3 ;:::; 10 ,and m = 2 k bytes ; s.t. k =2 ; 3 ;:::; 16 Figures4.4a,4.4b,and4.5showtheresultsforwhen =10 =1 ,and = 1 .In Figure4.4aand4.4b,thenumberofprocessorsisxedat64.InFigure4.4athealgorithms arecomparedbyplottingtheratiobetweenthealgorithmstimeandthetimefortheunigreedyalgorithm.Forsmallmessagestheoptimalsegmentsizeforuni-greedyis1,and uni-greedyisexactlythebinomialalgorithm.Soforsmallmessages, m< 64 bytesin thisexampleuni-greedyandbinomialhavethesametime.Forlargemessages m> 1 : 64 10 4 bytesinthisexample,thepoorstartupofthepipelinealgorithmisnegligible andthepipelinealgorithmapproachestheuni-greedyalgorithmasymptotically.Themiddle 85

PAGE 97

regioniswhereuni-greedyisthemostrelevant,seeinganincreaseinperformanceoverthe standardalgorithmsuptoapproximately50%faster.Figure4.4bplotsthetimeforeach algorithm. Figure4.5givestheresultsforeachvalueof p and m ,theuni-greedyalgorithmis comparedwiththebestofthestandardalgorithmsbyplotting Ratio = Minimumtimeofthestandardalgorithms TimefortheGreedyAlgorithm : Aratioof1whiteinthegureindicatesthatastandardalgorithmhasthesametime astheuni-greedyalgorithm.Largernumbersindicateamoresignicantimprovement. Againtheregionthatuni-greedyprovidesthegreatestimprovementisformediumsize messages.Asthenumberofprocessorsincrease,therangeofmessagesizeswhereunigreedyprovidesthemostimprovementratio 1.5increases.Thisislargelyduetothe differenceinthenumberoftimestepsrequiredtoreducetherstsegmentofuni-greedy d log 2 p e versusthatofpipeline p )]TJ/F15 11.9552 Tf 12.43 0 Td [(1 .Forlarger p ,pipelinerequiresalargermessage beforeitisabletomakeupfortheextratimetoreducetherstsegment. Figure4.5:Ratiobetweenthetimeforthebeststandardalgorithmandtheuni-greedyalgorithmforvaryingvaluesof p numberofprocessorsand m messagesize.Themachine parametersarexedat =10 =1 ,and = 1 86

PAGE 98

4.6BidirectionalSystem Inthissection,wecanadapttheuni-greedyalgorithmpresentedinSection4.5under theunidirectionalcontexttoanalgorithmmoresuitedforabidirectionalsystem.Wewish tocomparethenewalgorithmtothecurrentstate-of-the-art.Weagainhavethestandard algorithms:binomial,binary,andpipeline.Wealsoconsiderabutteryalgorithm[39,40] whichminimizesthecomputationtermofthereduction.Allofthealgorithmsmentioned sofarworkfornon-commutativeoperations.Iftheoperationiscommutative,thenany broadcastalgorithmcanbeusedinreverseforareduction.Bar-Noy,etal.[2]andTr aff andRipke[49]providebroadcastalgorithmsthatmatchthelowerboundonthenumberof communicationroundsforbroadcasting q segmentsamong p processors.Thetimecomplexityforsuchareductionalgorithmis d log 2 p e + q )]TJ/F15 11.9552 Tf 11.956 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m : Fromtheoreticalsimulationsitisseenthatournewalgorithmbi-greedyhasthesame timecomplexityofareverseoptimalbroadcast.Therestofthissectionisorganizedas follows:InSection4.6.1weanalyzethetimecomplexityofthereductionalgorithms.In Section4.6.2weintroducethenewalgorithmbi-greedy. 4.6.1TheoreticalResults WebeginbysummarizingthetimecomplexityforeachalgorithminTable4.3.Binomialandbutterydonothaveachoiceonthesegmentsize,sotheoptimalsegment/time rowsarenotincludedforthosealgorithms.Binomial,binary,pipeline,andbi-greedy,all workinthesameway:themessageissplitinto q segmentsofequi-segmentsizeandsegmentsarecommunicatedinrounds,wherethetimeforoneroundis + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s .For thesealgorithmsitisclearthatbi-greedyisthebestsinceitrequiretheminimumnumber ofroundsforanynumberofsegments.Thebutteryalgorithmdoesnotstartwithaxed segmentationofthemessage,butratherrecursivelyhalvesthemessagesizeandexchanges themessagebetweenprocessors.Thismethodallowsthecomputationtobedistributed 87

PAGE 99

evenlyamongtheprocessorsandhenceminimizingthecomputationalterm4.2.Ifthe computationalrateisslowi.e. issmallthanminimizingthecomputationtermcanbe advantageous. Figure4.6aand4.6bshowtheresultsforwhen p =64 =50000 =1 = 6 and =1 .InFigure4.6atheratiobetweenthetimeofthealgorithmandthetimeofbi-greedy isshown.Whencomparingbi-greedywithbinomial,binary,andpipeline,weseesimilar resultsasintheunidirectionalcase.Forsmallmessages m< 10 KBinthisexample, bi-greedyandbinomialhavethesametime.Forlargemessages m> 10 5 KBinthisexample,pipelineapproachesthebi-greedyalgorithmasymptotically.Formediumlength messages 200
PAGE 100

Table4.3:Timeanalysisforbidirectionalreductionalgorithms,where N = d log 2 p +1 e s opt istheoptimalequi-segmentsize,and T opt isthetimeforthealgorithmat s opt .Formulae arevalidfor p> 3 .Thereduce-scatter/gatherformulaisonlyalowerboundwhen p isnot apoweroftwo. Binomial Time d log 2 p e + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 m + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 m Pipeline Time p + q )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s s opt m p )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 T opt h p )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 1 = 2 + m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 i 2 Binary Time 2 d log 2 p +1 e + q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 s s opt m N )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 T opt 2 h N )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 1 = 2 + m )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 1 = 2 i 2 Bi-greedy& OptimalBroadcast Time d log 2 p e + q )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 s s opt m d log 2 p e)]TJ/F15 11.9552 Tf 19.925 0 Td [(1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 T opt h d log 2 p e)]TJ/F15 11.9552 Tf 19.926 0 Td [(1 1 = 2 + m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 1 = 2 i 2 Buttery Time 2 d log 2 p e +2 p )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 p )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 m + p )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 p )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 m aMessagesizeversusratio bMessagesizeversustime Figure4.6:Theoreticalresultsforbidirectionalsystemwhen p =64 plottingmessagesize m versusratio.6aandtime.6b.Theformulaforthereferencelineis d log 2 p e + m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 + p )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 p m )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 ; whichisthesumofthelowerboundsforeachterm: )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 .The machineparametersarexedat =50000 =1 = 6 ,and =1 89

PAGE 101

Figure4.7:Parametersforwhichthereexistsamessagesize, m ,suchthatthebuttery algorithmisbetterthanareverseoptimalbroadcastalgorithm. processorhastwoportsonesendingandonereceivingratherthanasingleportthateither sendsorreceives. Thebidirectionalgreedybi-greedyalgorithmadherestothefollowingrestrictions: themessageissplitusingequi-segmentationandtheportsarelledgivingpriorityto segmentswithsmallerindices.Thisrestrictiondiffersfromthatofuni-greedy,inthata processormayreceiveasegmentbeforesendingasegmentwithasmallerindex.Forthe uni-greedyalgorithm,thesmallerindexedsegmentmustbesentrst.Adownfallofthis relaxationisthatextrastoragemustbeallocated.However,otherreductionalgorithmsalso requirealargerbufferseeRabenseifner[39]. Algorithm4.2onthenextpagegivesthepseudo-codeforthebi-greedyalgorithm. S and R are p 1 arrayswhereentry i givesthetimewhenprocessor i )]TJ/F15 11.9552 Tf 12.879 0 Td [(1 hasnished sendingorreceiving,respectively,thelastscheduledcommunication. C isa p 2 array where C i; 1 isthestarttimeand C i; 2 istheendtimeofthelastscheduledcomputation onprocessor i )]TJ/F15 11.9552 Tf 12.502 0 Td [(1 M isa p q arraywhereentry i;j givesthetimewhenprocessor i )]TJ/F15 11.9552 Tf 11.695 0 Td [(1 hasnishedsendingsegment j exceptfortheroot i =1 whichgivesthetimeafter nishingthenalcomputation.Ineachinnerloopthemaximumnumberofsend/receive pairsforsegment j isdetermined.ThenalentriesofsendProcandrecvProcarescheduled tosendandreceivesegment j ,respectively.Theseprocessorsarepairedentry-wiseto 90

PAGE 102

Algorithm4.2: Bi-greedyAlgorithm 1 S = zeros p; 1 ; 2 R = zeros p; 1 ; 3 C = zeros p; 2 ; 4 t =0 ; 5 M = zeros p;q ; 6 while min f M i;j j 1 i p and 1 j q g =0 do 7 segStart =min f j j M ;j =0 g ; 8 stop =1 ; 9 j = segStart )]TJ/F20 7.9701 Tf 8.468 0 Td [(1 ; 10 while stop do 11 j j +1 ; 12 COMP = f i j C i; 1 tC i g ; 13 I = f i j M i;j =0 g ; 14 sendProc = I n COMP nf i j S i >t g ; 15 recvProc = I n COMP nf i j R i >t g ; 16 freeProc = sendProc T recvProc; 17 sendProc sendProc n freeProc; 18 recvProc recvProc n freeProc; 19 s = j sendProc j ; 20 r = j recvProc j ; 21 f = j freeProc j ; 22 if s = r then 23 y = b f= 2 c ; 24 sendProc sendProc S f freeProc i j f )]TJ/F23 7.9701 Tf 8.468 0 Td [(y +1 i f g ; 25 recvProc recvProc S f freeProc i j 1 i y g ; 26 elseif s 0 then 31 sendProc sendProc S f freeProc i j f )]TJ/F20 7.9701 Tf 8.469 0 Td [( m + x +1 i f g ; 32 if x> 0 then 33 recvProc recvProc S f freeProc i j 1 i x g ; 34 elseif r 0 then 39 recvProc recvProc S f freeProc i j 1 i m + x g ; 40 if x> 0 then 41 sendProc sendProc S f freeProc i j f )]TJ/F23 7.9701 Tf 8.468 0 Td [(x +1 i f g ; 42 l =min j sendProc j ; j recvProc 43 if l =0 then 44 sendProc ; ; 45 recvProc ; ; 46 else 47 sendProc = f sendProc i j 1 i l g ; 48 recvProc = f recvProc i j 1 i l g ; 49 M i;j = t + t comm j ; 8 i s.t. i 2 sendProc; 50 S i = t + t comm j ; 8 i s.t. i 2 sendProc; 51 R i = t + t comm j + t comp j ; 8 i s.t. i 2 recvProc; 52 C i; 1= t + t comm j ; 8 i s.t. i 2 recvProc; 53 C i; 2= t + t comm j + t comp j ; 8 i s.t. i 2 recvProc; 54 if jf i j M i;j =0 gj =1 then 55 M ;j = max f M i;j j 1 i p g + t comp j ; 56 if jf i j S i t gj + jf i j R i t gj < 2 then 57 stop =0 ; 58 elseif j q then 59 stop =0 ; 60 t = t +1 ; 91

PAGE 103

Figure4.8:Bi-greedyalgorithmfor16processorsand5equi-segmentsfor t comm =2 and t comp =1 .Eachcolorrepresentsadifferentsegment.Solidrectanglesrepresentreceivingprocessorsandrectangleswithendstripsrepresentsendingprocessors.Thedarker rectanglesrepresentcomputation. obtainsend/receivepairs.Thealgorithmassumesadiscretetimestep. Figure4.8givesanexamplefor16processorsand5equi-segmentswith t comm =2 and t comp =1 .Forthisexampleweseethatthetimecomplexitymatchesthethatofthe optimalbroadcastalgorithm.Fromexperimentsweconjecturethatforall p and q thebigreedyalgorithmhasthesametimecomplexityasareverseoptimalbroadcastalgorithm. Wedonotprovideaproofatthistime. 4.7ScheduleComplexity Inthissectionwediscussthecomputationtimeandthestoragerequirementsofboth schedules.Theseareimportantaspectofapracticalreductionalgorithmandthereasonwe includethissectioninthepaper.However,thefocusofthispaperisontheperformanceof greedyreductionalgorithmsandnotthecomputationorstoragerequirementstocompute theschedules.Itmaybepossibletoimprovethecodestoreducethetimetocomputethe scheduleandreducetostoragerequirements. Toanalyzethetimerequiredfortheuni-greedyalgorithm,weconsidertheCcodein Figure4.2.Thecodeconsistsoftwoloops.Thenumberofouter while loopsisequal 92

PAGE 104

tothetimeoftheuni-greedyalgorithmwith =1 = 1 ,and = 1 .Wedonot knowaformulaforthetimeoftheuni-greedyalgorithm,butwecanboundthenumber ofloopsbytheminimumofthetimeforbinomial,binary,andpipeline.Theinner for loopdetermineswhichsegmentissent/receivedforeachnode.Thesegmentsthatmaybe sentduringasteparebetweenthesmallestindexedsegmentthathasnotbeenreducedto therootnode qend andthecurrentsegmenttobesentfromnode p )]TJ/F15 11.9552 Tf 12.383 0 Td [(1 qstart .In anygivenstepthereareatmost min p;q differentsegmentsthatcanbescheduled.Inall casesweseethatthetimetocomputethescheduleisboundedby O pq .Thebi-greedy schedulingalgorithmfollowsthesametwoloopstructureandthetimecomplexityisalso boundedby O pq Forbothalgorithms,allnodescomputetheentireschedule.Therearetwopossibilities forrunningthereductioncodes:allnodescomputetheentirescheduleinitiallyand thescheduleforeachnodeisstoredforsubsequentreductions.Sincethealgorithms aresequentialintime,thecommunicationscanbeinterlacedwiththecomputationofthe schedule.WeusemethodforourexperimentsinSection4.8.Wealsoexperimented withthetimetocomputeonlytheschedule.Onamodernworkstation,for2048segments and1024nodes,theuni-greedyschedulewascomputedin7msandthebi-greedyschedule wascomputedin700ms.Thenumberofsegmentsandnodeswherechosentorepresent thetimeforschedulingalargemessage.InSection4.8,withonly64nodesalargemessage willtakeoverasecondtobereduce.Hence,thetimetoscheduleeitherreductionshould beasmallfractionofthetotaltimeforareduction. Tocomputetheuni-greedyschedulewearerequiredtostoreanarrayofintegersof length q .Thereforeauni-greedyschedulewith2048segmentsrequires8KB.Thestorage requirementsareindependentofthenumberofnodes.Thebidirectionalschedulerequires an p by q arrayofofintegersand12integerarraysoflength p .Thereforeabi-greedy schedulewith2048segmentsand1024nodesrequires8MB.Ifthememoryrequiredfor computingthebi-greedyscheduleisnotavailableduringanapplication,thentheschedule 93

PAGE 105

wouldneedtobeprecomputedandeachnodeonlyneedstostoreitsownschedule.Inthis case,thestoragerequiredwouldbe4arraysoflength log 2 p + q )]TJ/F15 11.9552 Tf 12.487 0 Td [(1 For2048segments and1024nodeseachnodewouldrequire32KB. 4.8NumericalResults Forthenumericalexperimentsweimplementedtheuni-greedyalgorithmdeveloped undertheunidirectionalcontextedFigure4.2andthebi-greedyalgorithmdevelopedunderthebidirectionalcontextAlgorithm4.2.Thealgorithmswereimplementedusing OpenMPIversion1.4.3.AllexperimentswereperformedontheJanussupercomputer whichconsistsof1,368nodeswitheachnodehaving12cores.Sinceprocessorsonasingle nodehaveaccesstosharedmemory,onlyoneprocessorpernodewasusedwithatotalof 64nodesinusefortheexperiments.Thenodesareconnectedusingafullynon-blocking quad-data-rateQDRInniBandinterconnect.Thetimeforareductionwascalculatedby takingtheminimumtimeof10experiments. Wewishtocomparethealgorithmsassumingthateachisoptimallytunedforthebest segmentation.Foramessageofsize m ,segmentsofsize 2 i ; s.t. i =0 ;:::; b log 2 m c wheretestedandthebesttimewastakentobethetimeofanoptimallytunedalgorithm. Figures4.9aand4.9bcomparestheoptimallytunedgreedyalgorithmswithotheralgorithmsforreduction. Comparingbi-greedywithbinomialandpipelineshowssimilarresultsastothoseof thetheoreticalresults.Forsmallmessages,thetimeforbinomialandgreedyareclose,althoughbinomialperformsabouttwiceaswellasgreedyinafewcases.Forlargemessages, pipelinebeginstooutperformallalgorithmsandisafactoroftwobetterthanbi-greedy. However,formediumsizemessages 100
PAGE 106

aMessagesizeversusratio bMessagesizeversustime Figure4.9:Experimentalresultsforeachalgorithmforallpossiblemessagesizesfor64 processors. sametimeasbi-greedyfor m =1000 KB,buthaspoorperformanceforlargeandsmall messages.Comparingthebi-greedyalgorithmwiththebuilt-inMPIfunctionMPI Reduce, showsthatgreedy,inmostcases,isthebest.Finally,theuni-greedyalgorithmalthough designedforaunidirectionalsystemperformedfairlywellingeneral,matchingtheperformanceofbi-greedyforsmallandlargemessages. Forsmallmessages,greedyisequivalenttobinomial,butthereisafairlylargedifferenceintheperformanceinthisregion.Thisisbecausethetwoalgorithmsimplemented usedifferentsend/receivepairingsandtheperformancedifferencesindicatethesystemis actuallyheterogeneous. Checkingallofthepossiblesegmentsizesforagivenmessagesizeisnotpractical. Auto-tuningcouldbeusedtodeterminetheoptimalsegmentationfordifferentmessage sizes. 4.9UnequalSegmentation Sofarwehaveonlyoptimizedthegreedyalgorithmbysplittingamessageintoequally sizedsegmentsexceptpossiblythelastsegment.Animmediatequestionisthenasked, whatifthesegmentshaveunequalsize?Aquestionwehavenotfoundtobeconsideredin theliterature.Cantheexistingalgorithmsbeimproved? 95

PAGE 107

Toinvestigatethesequestionsthemessagesizewasxedto m =10 andallpossible segmentationswerecheckedfor =0 ; 1 ;:::; 10 ; 20 ; 30 ;:::; 100 ; 200 ; 300 ;:::; 1000 = 1 =1 ; 1 ,and p =2 n s.t. n =2 ;:::; 10 and p =3 n s.t. n =1 ;:::; 9 .Thereatotal of512possiblesegmentations.Examplesofpossiblesegmentationsare ; 1 ; 3 ; 3 ; 1 ; 2 ; 4 ; 1 ; 2 ; 1 ; 5 ; 3 ; 1 ,etc.Anequi-segmentationissaidtobeonethathasthe samesegmentsizeforallsegmentsexceptpossiblythelastsegmentmaybesmaller. Forthegreedyalgorithm,61outof986experimentswhereoptimizedbyanunequal segmentation.Tocompareunequaltoequi-segmentation,weusethevalue Ratio = Besttimeforequi-segmentations Besttimeforallsegmentations : Aratioof1indicatesthatoneoftheoptimalsegmentationsisanequi-segmentation. Ifthereweremultiplesegmentationsthatwereoptimalandoneofthemwasanequalsegmentation,thentheexperimentwassaidtobeoptimizedbyanequi-segmentation.The maximumimprovementoverequi-segmentationwas7.3%.Ofthe61thatdidseeandincreaseinperformancetheaverageimprovementwas2.0%. Table4.4showsresultsforallexperimentsthatwereoptimizedbyunequalsegmentation.Wenotethat,inallcases,oneoftheoptimalsegmentationswasnearlythesameasthe bestequi-segmentation.Actually,toobtaintheoptimalunequalsegmentationfromthe bestoftheequi-segmentations,thesizeofonlyoneortwosegmentshadtobeincreasedor decreasedbyavalueof1inmostcases. Figures4.10aand4.10bshowtheresultsforall ,and p .For > 20 ,thegreedy algorithmwasalwaysoptimizedbyanequi-segmentationandthereforeisnotgraphed.A patternastowhenunequalsegmentationisoptimalisnotevidentfromthegures.For thepipelinealgorithm,allexperimentswhereoptimizedbyequi-segmentation.Thebinary algorithmwasnotcheckedforunequalsegmentation. Theexperimentislimitedtosmallvaluesof m sincethenumberofpossiblesegmentationsgrowsexponentiallyas m increases.Thealgorithmswerenotimplementedfor 96

PAGE 108

Table4.4:OptimalSegmentationcomparedtobestoftheequi-segmentations. Parameters Ratio Equi-segmentation OptimalSegmentation p =6 ; =1 ; =1 1.0408 ,4,2 ,3,2 ,5,2 ,4,2,1 ,2,2,2 ,4,2,2 ,2,2,2,1 ,3,2,2,1 p =6 ; =2 ; =1 1.0357 ,4,2 ,3,2 ,5,2 p =8 ; =0 ; =1 1.0571 ,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1 p =8 ; =1 ; =1 1.0200 ,4,2 ,2,2,1 ,4,2,1 p =12 ; =0 ; =1 1.0732 ,1,1,1,1,1,1,1,1,1 ,2,1,1,1,1,1,1 p =12 ; =1 ; =1 1.0169 ,4,2 ,3,1,2,1 ,2,1,2,2,1 p =12 ; =3 ; =1 1.0130 ,5 ,5,1 p =16 ; =1 ; =0 1.0526 ,3,3,1 ,3,2 ,2,2,2 p =16 ; =1 ; =1 1.0161 ,3,3,1 ,2,2,2,1 p =16 ; =2 ; =1 1.0278 ,4,2 ,3,3 p =24 ; =3 ; =1 1.0108 ,3,3,1 ,4,1 ,3,3 p =24 ; =4 ; =1 1.0388 ,5 ,4,1 ,3,3 p =32 ; =1 ; =0 1.0222 ,4,2 ,3,2,2 p =32 ; =1 ; =1 1.0141 ,2,2,2,2 ,2,2,2,1 p =32 ; =2 ; =1 1.0118 ,4,2 ,3,3 ,2,2,1 ,3,2,2 p =32 ; =3 ; =1 1.0104 ,4,2 ,3,3 p =48 ; =1 ; =0 1.0408 ,4,2 ,2,2,2 ,2,3,2 p =48 ; =1 ; =1 1.0260 ,2,2,2,2 ,2,2,2,1 ,2,1,2,2,1 p =48 ; =2 ; =0 1.0164 ,4,2 ,3,2 p =48 ; =2 ; =1 1.0215 ,3,3,1 ,3,2,2 p =48 ; =3 ; =1 1.0093 ,4,2 ,3,3 ,4,3 ,3,2,2 ,2,3,2 p =48 ; =4 ; =1 1.0084 ,4,2 ,4,1 ,2,3 ,3,3 ,4,3 p =64 ; =1 ; =1 1.0253 ,3,3,1 ,2,2,2,1 ,2,2,2,1,1 ,2,2,1,2,1 ,2,2,1,1,1,1 ,2,1,1,1,1,1,1 p =64 ; =2 ; =1 1.0106 ,3,3,1 ,3,2,2 p =64 ; =3 ; =1 1.0093 ,4,2 ,3,2,2 p =96 ; =1 ; =0 1.0182 ,4,2 ,3,2,2 ,2,3,2 ,3,3,2 p =96 ; =1 ; =1 1.0119 ,1,1,1,1,1,1,1,1,1 ,2,2,2,1,1 ,2,2,1,1,1,1 ,2,1,1,1,1,1,1 p =96 ; =2 ; =1 1.0098 ,3,3,1 ,3,2,2 p =96 ; =3 ; =1 1.0085 ,3,3,1 ,3,2,2 p =128 ; =1 ; =1 1.0116 ,2,2,2,2 ,2,2,1,1,1,1 ,1,1,1,1,1,1,1,1 ,1,1,2,1,1,1,1,1 p =192 ; =1 ; =0 1.0169 ,3,3,1 ,2,2,2,1,1 Parameters Ratio Equi-segmentation OptimalSegmentation p =256 ; =1 ; =0 1.0161 ,2,2,2,2 ,3,2,1 ,2,3,2 ,3,3,2 ,3,2,1,1 ,2,3,2,1 ,3,2,1,1,1 ,2,1,2,1,1 ,1,2,2,1,1 ,2,2,2,1,1 ,2,1,1,2,1 ,1,2,1,2,1 ,2,2,1,2,1 ,2,1,2,2,1 ,1,2,2,2,1 ,2,2,2,2,1 ,1,2,2,1,2 p =256 ; =1 ; =1 1.0109 ,1,1,1,1,1,1,1,1,1 ,1,2,1,1,1,1,1 ,1,1,1,1,1,1,1,1 p =256 ; =2 ; =0 1.0256 ,4,2 ,3,2 ,3,3 p =256 ; =2 ; =1 1.0085 ,2,2,2,2 ,2,3,2 ,2,2,3,1 ,1,2,2,2,1 ,1,2,2,1,1,1 p =256 ; =3 ; =0 1.0217 ,4,2 ,3,3 p =256 ; =3 ; =1 1.0224 ,3,3,1 ,2,3,2 p =256 ; =4 ; =1 1.0265 ,3,3,1 ,3,2 ,3,3 p =256 ; =5 ; =1 1.0303 ,4,2 ,3,2 p =256 ; =6 ; =1 1.0279 ,4,2 ,3,2 p =384 ; =3 ; =1 1.0142 ,3,3,1 ,2,3,2 ,3,3,2 p =512 ; =1 ; =0 1.0154 ,2,2,2,2 ,2,2,2,1,1 ,2,2,1,2,1 p =512 ; =2 ; =0 1.0238 ,4,2 ,3,3 p =512 ; =3 ; =0 1.0100 ,4,2 ,3,3 p =512 ; =3 ; =1 1.0069 ,3,3,1 ,2,3,2 ,3,3,2 p =512 ; =4 ; =1 1.0061 ,3,3,1 ,3,3,2 p =512 ; =5 ; =1 1.0167 ,4,2 ,3,3 p =512 ; =6 ; =1 1.0154 ,4,2 ,3,3 p =512 ; =7 ; =1 1.0095 ,4,2 ,3,3 p =768 ; =1 ; =0 1.0147 ,2,2,2,2 ,2,2,2,1,1 ,2,2,1,2,1 ,1,2,2,2,1 ,2,2,2,2,1 p =768 ; =1 ; =1 1.0099 ,1,1,1,1,1,1,1,1,1 ,1,1,1,1,1,1,1,1 p =768 ; =2 ; =0 1.0111 ,3,3,1 ,3,3 ,4,2,1 ,2,3,1 ,2,3,2 ,3,3,2 ,2,4,2 ,2,3,3 p =768 ; =3 ; =0 1.0189 ,4,2 ,3,3 p =768 ; =3 ; =1 1.0066 ,2,2,2,2 ,3,3,2 p =768 ; =4 ; =0 1.0164 ,4,2 ,3,3 p =768 ; =4 ; =1 1.0174 ,3,3,1 ,2,3,2 ,3,3,2 p =768 ; =5 ; =1 1.0317 ,3,3,1 ,3,3 p =768 ; =6 ; =1 1.0341 ,4,2 ,3,3 p =768 ; =7 ; =1 1.0270 ,4,2 ,3,3 p =768 ; =8 ; =1 1.0209 ,4,2 ,3,3 p =1024 ; =3 ; =0 1.0093 ,4,2 ,3,3 a = 1 b =1 Figure4.10:Ratioforoptimalsegmentationversusthebestequalsegmentation. 97

PAGE 109

unequalsegmentationssincethesmalltheoreticalimprovementsmostlikelywillnotbe obtained.Furtherinvestigationisrequiredtodetermineifforlargermessagessizesdoes theimprovementofunequalsegmentationbecomegreater. 4.10Conclusion Twonewalgorithmsforreductionarepresented.Thetwoalgorithmsaredeveloped undertwodifferentcommunicationmodelsunidirectionalandbidirectional. Weprovethattheunidirectionalalgorithmisoptimalundercertainassumptions.Our assumptionsarefullyconnected,homogeneoussystem,andprocessingthesegmentsinorder.Previousalgorithmsthatsatisfythesameassumptionsareonlyappropriateforsome congurations.Theuni-greedyalgorithmisoptimalforallcongurations.Themostimprovementoverstandardalgorithmsisformessagesofmediumlength,providingabout 50%improvementwhencomparedtothebestofthestandardalgorithmsinthatregion.The regionofmediumlengthmessagesbecomemoreprevalentasthenumberofprocessors increases. Weadaptedthegreedyalgorithmthatwasdevelopedintheunidirectionalcontextto analgorithmthatwasmoresuitedforabidirectionalsystem.Withsimulationswefound thatthebi-greedyalgorithmmatchedthetimecomplexityofoptimalbroadcastalgorithms scheduledinreverseorderasareduction.Similarperformanceimprovementswhere foundasintheunidirectionalcase. Implementationofthealgorithmsconrmthetheoreticalresults.Ourimplementation ofthebi-greedyalgorithmiswasbestamongallalgorithmsimplementedformedium sizemessages.Forsmallandlargemessages,themoresimplisticalgorithmsbinomialand pipeline,whichareasymptoticallyoptimal,outperformedthegreedyalgorithms. Finally,theconceptofunequalsegmentationwasdiscussedandanalyzedforthe greedyalgorithmandthepipelinealgorithminaunidirectionalsystem.Forthegreedy algorithm,unequalsegmentationprovided,inoursampletestsuite,theoptimalsegmentation 6 : 2% ofthetimewithamaximumimprovementof7.3%.Thepipelinealgorithmwas 98

PAGE 110

alwaysoptimizedbyequalsegmentation. 99

PAGE 111

5.OnMatrixBalancingandEigenvectorComputation 5.1Introduction Foragivenvectornorm kk ,an n -byn squarematrix A issaidtobe balanced if andonlyif,forall i from1to n ,thenormofits i -thcolumnandthenormofits i -throw areequal. A and e A are diagonallysimilar meansthatthereexistsadiagonalnonsingular matrix D suchthat e A = D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 AD .Computing D and/or e A suchthat e A isbalancedis called balancing A ,and e A iscalled thebalancedmatrix Balancingisastandardpreprocessingstepwhilesolvingthenonsymmetriceigenvalue problemseeforexamplethesubroutineDGEEVintheLAPACKlibrary.Indeed,since, A and e A aresimilar,theeigenvalueproblemon A issimilartotheeigenvalueproblem on e A .Balancingisusedtohopefullyimprovetheaccuracyofthecomputation[55,10, 11,12].However,thereareexampleswheretheeigenvalueaccuracy[52],eigenvector accuracy[35],andbackwarderrorwilldeteriorate.Wewilldemonstratetheseexamples throughoutthechapter. Usingthe2-norm,givenan n -byn squarematrix A ,Osborne[36]provedthatthere existsauniquematrix e A suchthat A and e A arediagonallysimilarand e A isbalanced.The balancedmatrix, e A ,hasminimalFrobeniusnormamongalldiagonallysimilarmatrices. Osborne[36]alsoprovidesapracticaliterativealgorithmforcomputing e A ParlettandReinsch[37]amongotherthingsproposedtoapproximate D with D 2 adiagonalmatrixcontainingonlypowersoftwoonthediagonal.While D 2 isonlyan approximationof D e A 2 = D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 2 AD 2 isbalancedenoughtosuccessfullyreducethenorm ofthematrix.Thekeyideaisthatthereisnooating-pointerrorincomputing e A 2 ona base-twocomputingmachine. Whatweknowbalancingdoesisdecreasethenormofthematrix.Currently,the stoppingcriteriafortheLAPACKbalancingroutine, GEBAL ,istostopifastepinthe algorithmdoesnotprovideasignicantreductioninthenormofthematrixignoringthe diagonalelements.Ignoringthediagonalelementsisareasonableassumptionsincethey 100

PAGE 112

areunchangedbyadiagonalsimilaritytransformation.However,itisourobservationthat thisstoppingcriteriaisnotstrictenough.Insometestcases,thisstoppingcriteriaallows thealgorithmtocontinuetobalancethematrixwhilenotsignicantlydecreasingthenorm ofthematrix.Atthesametime,theoating-pointentriesdrasticallychangeinmagnitude. Thiscombinationcausespooraccuracywhenrecoveringtheeigenvectorsoftheoriginal matrix. Weproposeasimplextothestoppingcriteria.Includingthediagonalelementsand balancingwithrespecttothe2-normsolvestheproblemofdeterioratingthebackwarderror inourtestcases.Usingthe2-normalsopreventsbalancingaHessenbergmatrix,whichpreviouslywasanexampleofbalancingdeterioratingtheeigenvalueconditionnumber[52]. OtherworkonbalancingwasdonebyChenandDemmel[10,11,12].Theyextended thebalancingliteraturetosparsematrices.Theyalsolookedatusingaweightednorm tobalancethematrix.Fornonnegativematrices,iftheweightvectoristakentobethe Perronvectorwhichisnonnegative,thenthelargesteigenvaluehasperfectconditioning. However,littlecanbesaidabouttheconditioningoftheothereigenvalues.Thisistheonly exampleintheliteraturethatguaranteesbalancingimprovestheconditionnumberofany oftheeigenvalues. Therestofthechapterisorganizedasfollows.InSection5.2weprovidebackground andreviewthecurrentliteratureonbalancing.InSection5.3weintroduceasimpleexampletoillustratewhybalancingcandeterioratethebackwarderrorandtheaccuracyof theeigenvectors.InSection5.4weprovidedbackwarderroranalysis.Finally,inSection5.5welookatafewtestcasesandconcludethatthatusingthe2-norminthebalancing algorithmisthebestsolution. 5.2TheBalancingAlgorithm Algorithm5.1showsOsborne'soriginalbalancingalgorithm,whichbalancesamatrix inthe2-norm[36].Thealgorithmassumesthatthematrixis irreducible .Amatrixis 101

PAGE 113

reducible ifthereexistsapermutationmatrix, P ,suchthat, PAP T = A 11 A 12 0 A 22 ; .1 where A 11 and A 22 aresquarematrices.Adiagonalsimilaritytransformationcanmakethe off-diagonalblock, A 12 ,arbitrarilysmallwith D = diag I;I forarbitrarilylarge .For convergenceofthebalancingalgorithmitisnecessaryfortheelementsof D tobebounded. Forreduciblematrices,thediagonalblockscanbebalancedindependently. Algorithm5.1: BalancingOsborne Input :Anirreduciblematrix A 2 R n n Notes :Foreach k A isoverwrittenby D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD ,andthesequenceof A 'sconvergeto auniquebalancedmatrix. D alsoconvergestoauniquenonsingulardiagonal matrix. 1 D I 2 for k 0 ; 1 ; 2 ;::: do 3 for i 1 ;:::;n do 4 c s X j 6 = i j a j;i j 2 ;r s X j 6 = i j a i;j j 2 5 f r r c 6 d ii f d ii 7 A : ;i f A : ;i ;A i; : A i; : =f Thebalancingalgorithmoperatesoncolumnsandrowsof A inacyclicfashion.In line4,the2-normignoringthediagonalelementofthecurrentcolumnandrowarecalculated.Line5calculatesthescalar, f ,thatwillbeusedtoupdate d ii ,column i ,androw i Thequantity f 2 c 2 + r 2 =f 2 isminimizedat f = p r c ,andhasaminimumvalueof 2 rc .Itis obviousthatforthischoiceof f ,theFrobeniusnormisnon-increasingsince, k A nk + i k 2 F )-222(k A nk + i +1 k 2 F = c 2 + r 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [( rc = c )]TJ/F22 11.9552 Tf 11.956 0 Td [(r 2 0 : 102

PAGE 114

OsborneshowedthatAlgorithm5.1convergestoauniquebalancedmatrixand D convergestoauniqueuptoascalarmultiplenonsingulardiagonalmatrix.Thebalanced matrixhasminimumFrobeniusnormamongalldiagonalsimilaritytransformations. ParlettandReinsch[37]generalizedthebalancingalgorithmtoany p -norm.Changing line4to c X j 6 = i j a j;i j p 1 =p ;r X j 6 = i j a i;j j p 1 =p ; thealgorithmwillconvergetoabalancedmatrixinthe p -norm.ParlettandReinschhowever,donotquantifyifthenormofthematrixwillbereducedorminimized.Thenormwill infactbeminimized,onlyitisanon-standardmatrixnorm.Specically,forthebalanced matrix e A k e A k p =min )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(k vec D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD k p j D 2D ; where D isthesetofallnonsingulardiagonalmatricesandvecstacksthecolumnsof A intoan n 2 columnvector.For p =2 ,thisistheFrobeniusnorm. ParlettandReinschalsorestrictthediagonalelementof D tobepowersoftheradix basetypically2.Thisrestrictionensuresthereisnocomputationalerrorinthebalancingalgorithm.Thealgorithmprovidesonlyanapproximatelybalancedmatrix.Doingso withoutcomputationalerrorisdesirableandtheexactbalancedmatrixisnotnecessary.Finally,theyintroduceastoppingcriteriaforthealgorithm.Algorithm5.2showsParlettand Reinsch'salgorithmforany p -norm.Algorithm5.2balancinginthe 1 -normisessentially whatiscurrentlyimplementedbyLAPACK's GEBAL routine. Again,thealgorithmproceedsinacyclicfashionoperatingonasinglerow/columnat atime.Ineachstep,thealgorithmseekstominimize f p c p + r p =f p .Theexactminimumis obtainedfor f = p r=c .Lines8-11ndtheclosestapproximationtotheexactminimizer. Atthecompletionofthesecondinnerwhileloop, f satisestheinequality )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 r c f 2 < r c : 103

PAGE 115

Algorithm5.2: BalancingParlettandReinsch Input :Amatrix A 2 R n n .Wewillassumethat A isirrreducible. Output :Adiagonalmatrix D and A whichisoverwrittenby D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD Notes : istheradixbase.Onoutput A isnearlybalancedinthe p -norm. 1 D I 2 converged 0 3 while converged =0 do 4 converged 1 5 for i 1 ;:::;n do 6 c X j 6 = i j a j;i j p 1 =p ;r X j 6 = i j a i;j j p 1 =p 7 s c p + r p ;f 1 8 while c
PAGE 116

However,itiseasiertoabsorbtheadditionofthediagonalelementintothecalculationof c and r .ThisiswhatwedoinourproposedalgorithmAlgorithm5.3.Theonlydifference inAlgorithm5.3andAlgorithm5.2isinline6.InAlgorithm5.3weincludethediagonal elementsinthecalculationof c and r Algorithm5.3: BalancingProposed Input :Amatrix A 2 R n n .Wewillassumethat A isirrreducible. Output :Adiagonalmatrix D and A whichisoverwrittenby D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD Notes : istheradixbase.Onoutput A isnearlybalancedinthe p -norm. 1 D I 2 converged 0 3 while converged =0 do 4 converged 1 5 for i 1 ;:::;n do 6 c k A : ;i k p ;r k A i; : k p 7 s c p + r p ;f 1 8 while c
PAGE 117

thematrix A = 0 B B B B B @ 1100 0210 0031 004 1 C C C C C A where 0 1 [35].As 0 theeigenvaluestendtowardsthediagonalelements. AlthoughthesearenottheexacteigenvalueswhicharecomputedinAppendixA,wewill refertotheeigenvaluesasthediagonalelements.Thediagonalmatrix D = 0 B B B B B @ 1000 0 1 = 4 00 00 1 = 2 0 000 3 = 4 1 C C C C C A ; forany > 0 ,balances A exactlyforany p -norm.Thebalancedmatrixis e A = D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 AD = 0 B B B B B @ 1 1 = 4 00 02 1 = 4 0 003 1 = 4 1 = 4 004 1 C C C C C A : If v isaneigenvectorof e A ,then Dv isaneigenvectorof A .Theeigenvectorassociated with4isthemostproblematic.Theeigenspaceof4for A isspan ; 3 ; 6 ; 6 T .Allof thecomponentshaveaboutthesamemagnitudeandthereforecontributeequallytowards thebackwarderror.Theeigenspaceof4for e A approachesspan ; 0 ; 0 ; 1 T as 0 andthecomponentsbegintodiffergreatlyinmagnitude.Errorinthecomputationof thesmallelementsarefairlyirrelevantforbackwarderrorof e A .But,whenconverting backtotheeigenvectorsof A theerrorsinthesmallcomponentswillbemagniedanda poorbackwarderrorisexpected.ComputingtheeigenvaluedecompositionviaMatlab's routine eig with =10 )]TJ/F20 7.9701 Tf 6.587 0 Td [(32 therstcomponentiszero.Thisresultsinnoaccuracyinthe eigenvectorcomputationandbackwarderror.AppendixAprovidesdetailedcomputations 106

PAGE 118

oftheexacteigenvaluesandeigenvectorsofthebalancedandunbalancedmatrices. Thisexamplecanbegeneralizedtomorecaseswhereonewouldexpectthebackward errortodeteriorate.Ifacomponentoftheeigenvectorthatisoriginallyofsignicanceis madetobeinsignicantinthebalancedmatrix,thenerrorinthecomputationismagnied whenconvertingbacktotheoriginalmatrix. Moreover,thebalancedmatrixfailstoreducethenormofthematrixbyasignicant amountsince k vec e A k 1 k vec A k 1 10 = 13 as 0 .Ifweignorethediagonalelements,then k vec e A 0 k 1 k vec A 0 k 1 0 ; where A 0 denotestheoffdiagonalentriesof A .Excludingthediagonalelementsmakesit appearasifbalancingisreducingthenormofthematrixgreatly.But,whenthediagonal elementsareincludedtheoriginalmatrixisalreadynearlybalanced. 5.4BackwardError Inthissection,westudythebackwarderroroftheeigenvaluedecompositionwhenthe balancingalgorithmisusedasapreprocessingstep.Intheanalysis, c n isaloworder polynomialdependentonlyonthesizeof A ,and V i isthe i th columnofthematrix V Let e A = D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD ,where D isadiagonalmatrixandtheelementsofthediagonalof D arerestrictedtobeingexactpowersoftheradixbasetypically2.Therefore, thereisnoerrorincomputing e A .Ifabackwardstablealgorithmisusedtocomputethe eigenvaluedecompositionof e A ,then e A e V = e V + E ,where k e V i k 2 =1 forall i and k E k F c n u k e A k F : Theeigenvectorsof A areobtainedbymultiplying e V by D ,again thereisnoerrorinthecomputation V = D e V .Scalingthecolumnsof V tohave2-norm of1gives AV i = ii V i + DE i = k D e V i k 2 andwehavethefollowingupperboundonthe 107

PAGE 119

backwarderror: k AV i )]TJ/F22 11.9552 Tf 11.955 0 Td [( ii V i k 2 = k DE i k 2 k D e V i k 2 c n u k D k 2 k D e V i k 2 k e A k F c n u k D k 2 k e A k F min D = c n u D k e A k F ; where c n isalowerorderpolynomialthatdependsonlyonthesizeofthematrix.Therefore, k AV )]TJ/F22 11.9552 Tf 11.955 0 Td [(V k F k A k F c n u D k e A k F k A k F : .2 If e V i isclosetothesingularvectorassociatedwiththeminimumsingularvaluethen theboundwillbetight.Thiswillbethecasewhen e A isnearadiagonalmatrix.Onthe otherhand,if e A isnotnearadiagonalmatrixthenmostlikely k D e V i k 2 k D k 2 andthe upperboundisnolongertight.Forthiscaseitwouldbebettertoexpecttheerrortobe closeto c n u k e A k F Ensuringthatthequantity u D k e A k F k A k F .3 issmallwillguaranteethatthebackwarderrorremainssmall.Thequantityiseasyto computeandthereforeareasonablestoppingcriteriaforthebalancingalgorithm.However, inmanycasestheboundistoopessimisticandwilltendtonotallowbalancingatallsee Section5.5. 5.5Experiments Wewillconsiderfourexamplestojustifybalancinginthe2-normandincludingthe diagonalelementsinthestoppingcriteria.Therstexampleisthecasestudydiscussedin Section5.3with =10 )]TJ/F20 7.9701 Tf 6.587 0 Td [(32 108

PAGE 120

Thesecondexampleisaneartriangularmatrixageneralizationofthecasestudy. Let T beatriangularmatrixwithnormallydistributednonzeroentries.Let N beadense randommatrixwithnormallydistributedentries.Then A = T + N ,with =10 )]TJ/F20 7.9701 Tf 6.587 0 Td [(30 Balancingwilltendtowardsadiagonalmatrixandtheeigenvectorswillcontainentries withlittlesignicance.Errorinthesecomponentswillbemagniedwhenconvertingback totheeigenvectorsof A ThethirdexampleisaHessenbergmatrix,thatis,theHessenbergformofadense randommatrixwithnormallydistributedentries.Itiswellknownthatbalancingthismatrix usingtheLAPACKalgorithmwillincreasetheeigenvalueconditionnumber[52]. Thenallyexampleisabadlyscaledmatrix.Startingwithadenserandommatrixwith normallydistributedentries.Adiagonallysimilaritytransformationwiththelogarithmof thediagonalentriesevenlyspacedbetween0and10givesustheinitialmatrix.Thisisan examplewherewewanttobalancethematrix.Thisexampleisusedtoshowthatthenew algorithmwillstillbalancematricesandimproveaccuracyoftheeigenvalues. Todisplayourresultswewilldisplayplotthequantitiesofinterestversusiterations ofthebalancingalgorithm.Theblackstarsandsquaresoneachlineindicatetheiteration inwhichthebalancingalgorithmconverged.Plottingthiswayallowsusetoseehowthe quantitiesvaryduringthebalancingalgorithm.Thisisespeciallyusefultodetermineifthe errorbound.3isaviablestoppingcriteria. Tochecktheaccuracyoftheeigenvaluewewillexaminetheabsoluteconditionnumber ofaneigenvaluewhichisgivenbytheformula ;A = k x k 2 k y k 2 j y H x j ; .4 where x and y aretherightandlefteigenvectorsassociatedwith [29]. Figure5.1plotstherelativebackwarderror, k AV )]TJ/F22 11.9552 Tf 11.955 0 Td [(VD k 2 = k A k 2 ; .5 109

PAGE 121

solidlinesandmachineprecisiontimesthemaximumabsoluteconditionnumberofthe eigenvalues, u max ii ;A j 1 i n ; dashedlinesversusiterationsofthebalancingalgorithms.Thedashedlinesareanestimateoftheforwarderror. InFigures5.1aand5.1btheLAPACKalgorithmhasnoaccuracyinthebackward sense,whileboththe1-normand2-normalgorithmsmaintaingoodbackwarderror.On theotherhandtheeigenvalueconditionnumberisreducedmuchgreaterbytheLAPACK algorithminthesecondexample.TheLAPACKalgorithmhasaconditionnumber7orders ofmagnitudelessthanthe1-normalgorithmand10ordersofmagnitudelessthanthe2aCasestudy bNearuppertriangular cHessenbergform dBadlyscaled Figure5.1:Backwarderrorandabsoluteeigenvalueconditionnumberversusiterations. 110

PAGE 122

normalgorithm.Clearlythereisatrade-offtomaintaindesirablebackwarderror.Ifonly theeigenvaluesareofinterest,thenthecurrentbalancingalgorithmmaystillbedesirable. However,thespecialcaseofbalancingarandommatrixalreadyreducedtoHessenberg formwouldstillbeproblematicandneedtobeavoided. InFigure5.1cboththebackwarderrorandtheeigenvalueconditionnumberworsen withtheLAPACKand1-normalgorithms.The2-normalgorithmdoesnotallowformuch scalingtooccurandasaconsequencemaintainsagoodbackwarderrorandsmalleigenvalueconditionnumber.Thisistheonlyexampleforwhichthe1-normissignicantly worsethanthe2-normalgorithm5ordersofmagnitudedifference.Thisexampleisthe onlyreasonwechoosetousethe2-normoverthe1-normalgorithm. ThenalexampleFigure5.1dshowsallalgorithmshavesimilarresults:maintain goodbackwarderrorandreducetheeigenvalueconditionnumber.Thisindicatesthatthe newalgorithmswouldnotpreventbalancingwhenappropriate. Figure5.2plotstherelativebackwarderror.5solidlinesandthebackwarderror bound.3dashedlines.Fortheboundtobeusedasastoppingcriteria,theboundshould beafairlygoodestimateofthetrueerror.Otherwise,theboundwouldpreventbalancing whenbalancingcouldhavebeenusedwithouthurtingthebackwarderror.Figures5.2aand 5.2bshowtheboundwouldnotbetightfortheLAPACKalgorithmandthereforenota goodstoppingcriteria.Figure5.2dshowsthattheboundisnottightforallalgorithms.For thisreasonwedonotconsiderusingtheboundasastoppingcriteria. Figure5.3showstheratioofthenormofthebalancedmatrixandthenormofthe originalmatrixversusiterationsofthebalancingalgorithm.InFigures5.3a,5.3b,5.3cthe normisnotdecreasedbymuchinanyofthealgorithms.Themaximumdecreaseisonly 70%fortheneartriangularmatrixFigure5.3b.InFigure5.3dallalgorithmssuccessfully reducethenormofthematrixby9ordersofmagnitude.Inallthreeexampleswherethe LAPACKalgorithmdeterioratesthebackwarderror,thenormisnotsuccessfullyreduced byasignicantquantity.But,inourexamplewherebalancingismostbenecialthenorm 111

PAGE 123

aCasestudy bNearuppertriangular cHessenbergform dBadlyscaled Figure5.2:Backwarderrorbounds. isreducedsignicantly. 5.6MatrixExponential Balancingcouldalsobeusedforthecomputationofthematrixexponential,since e XAX )]TJ/F21 5.9776 Tf 5.756 0 Td [(1 = Xe A X )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 .Iftheexponentialofthebalancedmatrixcanbecomputedmoreaccurately,thenbalancingmaybejustied.Therearenumerouswaystocomputethematrix exponential.ThemethodsarediscussedinthepaperbyMolerandVanLoan,NineteenDubiousWaystoComputetheExponentialofaMatrix[33]andlaterupdatedin 2003[34].Thescalingandsquaringmethodisconsideroneofthebestoptions.Currently, MATLAB'sfunction expm implementsthismethodusingavariantbyHigham[25].The scalingandsquaringmethodreliesonthePad eapproximationofthematrixexponential, whichismoreaccuratewhen k A k issmall.Hence,balancingwouldbebenecial. 112

PAGE 124

aCasestudy bNearuppertriangular cHessenbergform dBadlyscaledlogscale Figure5.3:Ratioofbalancedmatrixnormandoriginalmatrixnorm. However,thematrixexponentialpossessthespecialproperty: e A = e A=s s .Scaling thematrixby s successfullyreducesthenormof A toanacceptablelevelforanaccurate computationof e A=s bythePad eapproximate.When s isapowerof2thenrepeatsquaring isusedtoobtainthenalresult.Sincethescalingof A successfullyachievesthesamegoal asbalancing,balancingisnotrequiredasapreprocessingstep. 5.7Conclusion WeproposedachangetotheLAPACKbalancingalgorithm GEBAL thatimprovesthe stoppingcriteriaforamatrixtobebalancedenough.Includingthediagonalelements inthebalancingalgorithmprovidesabettermeasureofsignicantdecreaseinthematrix normthanthecurrentversion.Byincludingthediagonalelementsandswitchingtothe 2-normweareabletoavoiddeterioratingthebackwarderrorforourtestcases. 113

PAGE 125

Theproposedchangeistoprotectthebackwarderrorofthecomputation.Insomecase thenewalgorithmwillpreventfromdecreasingthemaximumeigenvalueconditionnumber asmuchasthecurrentalgorithm.Ifonlytheeigenvaluesareofinterest,thenitmaybe desirabletousesthecurrentversion.However,forthecasewhere A isdenseandpoorly scaled,thenewalgorithmwillstillbalancethematrixandimprovetheeigenvaluecondition number.Ifaccurateeigenvectorsaredesired,thenoneshouldconsidernotbalancingthe matrix. 114

PAGE 126

6.Conclusion Inthisthesis,wepresentastudyofcommunicationavoidingalgorithmsandstability analysis. Weimprovethelowerboundsforthevolumeofwordsthataretransferedformatrix multiplication.Wepresentareviewofrecursivealgorithmsthatareoptimalintermsofthe volumeofwordstransfered. Wepresentstabilityanalysisofmanywell-knownalgorithmsforperforminganoblique QR factorization.WealsointroduceanewalgorithmPRE-CHOLQR.Weprovideaset oftestcasestoassessthetightnessoftheobtainedbounds.Fromthesetests,weseethat theboundsaretightforallcases. Thenewalgorithm,PRE-CHOLQR,isstableandcansignicantlyoutperformexistingstablealgorithms.TheexistingstablealgorithmsareCGS2,CHOL-EQR,andSYEVEQR.CGS2communicates n timesasmanymessagesforthemultiplicationwith A andthe columnsof Z comparedtoPRE-CHOLQR.Theincreaseincommunicationsisobserved todecreasetheperformanceofCGS2.BothCHOL-EQRandSYEV-EQRrequireafactorizationoftheinnerproductmatrix.Fordensematrices,thefactorizationistooexpensive foreitheralgorithmtobeefcient.Forsparsematrices,factoring A maybeaviableoption. Twonewalgorithmsforthereductionoperationarepresented.Thetwoalgorithmsare developedundertwodifferentcommunicationmodelsunidirectionalandbidirectional. Weprovethattheunidirectionalalgorithmisoptimalundercertainassumptions.Our assumptionsarethatthenetworkisfullyconnectedandhomogeneous.Wealsoassume thealgorithmsreducethesegmentsinorderaprocessor.Weadaptedthegreedyalgorithm thatwasdevelopedintheunidirectionalcontexttoanalgorithmthatwasmoresuitedfor abidirectionalsystem.Withsimulationswefoundthatthebi-greedyalgorithmmatched thetimecomplexityofanoptimalbroadcastalgorithmsscheduledinreverseorderasa reduction.Similarperformanceimprovementswherefoundasintheunidirectionalcase. 115

PAGE 127

Implementationofthealgorithmsconrmthetheoreticalresults:thegreedyalgorithms providethemostimprovementformediumsizemessages.Ourimplementationofthe bi-greedyalgorithmperformsbestamongallalgorithmsimplementedformediumsize messages.Forsmallandlargemessages,themoresimplisticalgorithmsbinomialand pipeline,whichareasymptoticallyoptimal,outperformedthegreedyalgorithms. Unequalsegmentationwasalsostudied.Inoursampletestsuite,thegreedyalgorithm wasoptimizedbyanunequalsegmentation 6 : 2% ofthetime.Themaximumtheoretical improvementoverequalsegmentationwas7.3%. Finally,weproposedachangetotheLAPACKbalancingalgorithm GEBAL thatimprovesthestoppingcriteriaforamatrixtobebalancedenough.Includingthediagonal elementsinthebalancingalgorithmprovidesabettermeasureofsignicantdecreaseinthe matrixnormthanthecurrentversion.Byincludingthediagonalelementsandswitchingto the2-normweareabletoavoiddeterioratingthebackwarderrorforourtestcases. 116

PAGE 128

REFERENCES [1]G.Ballard,J.Demmel,O.Holtz,andO.Schwartz.Minimizingcommunication innumericallinearalgebra. SIAMJournalonMatrixAnalysisandApplications 32:866,2011. [2]AmotzBar-Noy,ShlomoKipnis,andBaruchSchieber.Optimalmultiplemessage broadcastingintelephone-likecommunicationsystems. DiscreteAppl.Math. ,1002:1,March2000. [3]O.Beaumont,A.Legrand,L.Marchal,andY.Robert.Pipeliningbroadcastson heterogeneousplatforms. ParallelandDistributedSystems,IEEETransactionson 16:300313,April2005. [4]OlivierBeaumont,LorisMarchal,andYvesRobert.Broadcasttreesforheterogeneousplatforms.In 19thInternationalParallelandDistributedProcessingSymposiumIPDPS'2005 ,pages3.SocietyPress,2005. [5] A.Bj orck. Numericalmethodsforleastsquaresproblems .Number51.Societyfor IndustrialMathematics,1996. [6]Y.D.BuragoandV.A.Zalgaller. GeometricInequalities .GrundlehrenMath.Wiss. Springer,Berlin,1988. [7]LynnECannon.Acellularcomputertoimplementthekalmanlteralgorithm.Technicalreport,DTICDocument,1969. [8]E.Chan,M.Heimlich,A.Purkayastha,andR.vandeGeijn.Onoptimizingcollectivecommunication.In CLUSTER'04Proceedingsofthe2004IEEEInternational ConferenceonClusterComputing ,pages145,2004. [9]ErnieChan,RobertA.vandeGeijn,WilliamGropp,andRajeevThakur.Collectivecommunicationonarchitecturesthatsupportsimultaneouscommunicationover multiplelinks.In PPOPP ,pages2,2006. [10]Tzu-YiChen.Balancingsparsematricesforcomputingeigenvalues.Master'sthesis, UniversityofCaliforniaatBerkeley,1998. [11]Tzu-YiChen. PreconditioningSparseMatricesforComputingEigenvaluesandSolvingLinearSystemsofEquations .PhDthesis,UniversityofCaliforniaatBerkeley, 2001. [12]Tzu-YiChenandJamesWDemmel.Balancingsparsematricesforcomputingeigenvalues. LinearAlgebraandItsApplications ,309:261,2000. 117

PAGE 129

[13]JaeyoungChoi,DavidWWalker,andJackJDongarra.PUMMA:Paralleluniversal matrixmultiplicationalgorithmsondistributedmemoryconcurrentcomputers. Concurrency:PracticeandExperience ,6:543,1994. [14]MichelCosnardandYvesRobert.ComplexityofparallelQRfactorization. Journal oftheA.C.M. ,33:712,1986. [15]JamesDemmel,LauraGrigori,MarkHoemmen,andJulienLangou. Communication-optimalparallelandsequentialQRandLUfactorizations. SIAM JournalonScienticComputing ,34:A206A239,2012. [16]ErikElmrothandFredGGustavson.Applyingrecursiontoserialandparallelqr factorizationleadstobetterperformance. IBMJournalofResearchandDevelopment 44:605,2000. [17]JeremyDFrensandDavidSWise.Qrfactorizationwithmorton-orderedquadtree matricesformemoryre-useandparallelism.In ACMSIGPLANNotices ,volume38, pages144.ACM,2003. [18]MatteoFrigo,CharlesELeiserson,HaraldProkop,andSridharRamachandran. Cache-obliviousalgorithms.In FoundationsofComputerScience,1999.40thAnnualSymposiumon ,pages285.IEEE,1999. [19]L.Giraud,J.Langou,andM.Rozlo zn k.ThelossoforthogonalityintheGramSchmidtorthogonalizationprocess. Computers&MathematicswithApplications 50:1069,2005. [20]L.Giraud,J.Langou,M.Rozlo zn k,andJ.Eshof.RoundingerroranalysisoftheclassicalGram-Schmidtorthogonalizationprocess. NumerischeMathematik ,101:87 100,2005. [21]GeneH.GolubandCharlesF.VanLoan. Matrixcomputationsthed. .Johns HopkinsUniversityPress,Baltimore,MD,USA,2013. [22]M.Gulliksson.Backwarderroranalysisfortheconstrainedandweightedlinearleast squaresproblemwhenusingtheweightedqrfactorization. SIAMjournalonmatrix analysisandapplications ,16:675,1995. [23]FredGGustavson.Recursionleadstoautomaticvariableblockingfordenselinearalgebraalgorithms. IBMJournalofResearchandDevelopment ,41:737, 1997. [24]NicholasJ.Higham. AccuracyandStabilityofNumericalAlgorithms .Societyfor IndustrialandAppliedMathematics,Philadelphia,PA,USA,secondedition,2002. [25]NicholasJHigham.Thescalingandsquaringmethodforthematrixexponentialrevisited. SIAMJournalonMatrixAnalysisandApplications ,26:1179,2005. 118

PAGE 130

[26]R.Hockney.ThecommunicationchallengeforMPP:IntelParagonandMeikoCS-2. ParallelComputing ,20:389,1994. [27]Jia-WeiHongandH.T.Kung.I/ocomplexity:Thered-bluepebblegame.In ProceedingsofthethirteenthannualACMsymposiumonTheoryofcomputing ,STOC '81,pages326,NewYork,NY,USA,1981.ACM. [28]DrorIrony,SivanToledo,andAlexanderTiskin.Communicationlowerboundsfor distributed-memorymatrixmultiplication. J.ParallelDistrib.Comput. ,64:1017 1026,September2004. [29]D.Kressner. Numericalmethodsforgeneralandstructuredeigenvalueproblems ,volume46of LectureNotesinComputationalScienceandEngineeringSeries .SpringerVerlagBerlinandHeidelbergGmbH&CompanyKG,2005. [30]A.Legrand,L.Marchal,andY.Robert.Optimizingthesteady-statethroughputof scatterandreduceoperationsonheterogeneousplatforms. J.ParallelDistrib.Comput. ,65:1497,December2005. [31]FransLemeire.Equilibrationofmatricestooptimizebackwardnumericalstability. BITNumericalMathematics ,16:143,1976. [32]L.H.LoomisandH.Whitney.Aninequalityrelatedtotheisoperimetricinequality. Bull.AMS ,55:961962,1949. [33]CleveMolerandCharlesVanLoan.Nineteendubiouswaystocomputetheexponentialofamatrix. SIAMReview ,20:pp.801,1978. [34]CleveMolerandCharlesVanLoan.Nineteendubiouswaystocomputetheexponentialofamatrix,twenty-veyearslater. SIAMReview ,45:pp.3,2003. [35]RonMorgan.http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=13&t=4270,2013. [36]E.E.Osborne.Onpre-conditioningofmatrices. J.ACM ,7:338,October 1960. [37]BeresfordNParlettandChristianReinsch.Balancingamatrixforcalculationof eigenvaluesandeigenvectors. NumerischeMathematik ,13:293,1969. [38]J.Pjesivac-Grbovi c,T.Angskun,G.Bosilca,G.E.Fagg,E.Gabriel,andJ.J.Dongarra. PerformanceanalysisofMPIcollectiveoperations.In ParallelandDistributedProcessingSymposium,2005.Proceedings.19thIEEEInternational ,page8pp.,april 2005. [39]RolfRabenseifner.Optimizationofcollectivereductionoperations.In International ConferenceonComputationalScience ,pages1,2004. [40]RolfRabenseifnerandJesperLarssonTr aff.Moreefcientreductionalgorithms fornon-power-of-twonumberofprocessorsinmessage-passingparallelsystems.In PVM/MPI ,pages36,2004. 119

PAGE 131

[41]MiroslavRozlo zn k,MiroslavT uma,AlicjaSmoktunowicz,andJi r Kopal.Numericalstabilityoforthogonalizationmethodswithanon-standardinnerproduct. BIT NumericalMathematics ,pages1,2012. [42]PeterSanders,JochenSpeck,andJesperLarssonTr aff.Two-treealgorithmsforfull bandwidthbroadcast,reductionandscan. ParallelComputing ,35:581594, 2009.Selectedpapersfromthe14thEuropeanPVM/MPIUsersGroupMeeting. [43]RobertSchreiberandCharlesVanLoan.Astorage-efcientWYrepresentationfor productsofhouseholdertransformations. SIAMJournalonScienticandStatistical Computing ,10:53,1989. [44]VolkerStrassen.Gaussianeliminationisnotoptimal. NumerischeMathematik 13:354,1969. [45]S.J.Thomas.Ablockalgorithmfororthogonalizationinellipticnorms. Lecture NotesinComputerScience ,634:379,1992. [46]S.J.ThomasandR.V.M.Zahar.EfcientorthogonalizationintheM-norm. CongressusNumerantium ,80:23,1991. [47]S.J.ThomasandR.V.M.Zahar.Ananalysisoforthogonalizationinellipticnorms. Congressusnumerantium ,86:193,1992. [48]SivanToledo.Localityofreferenceinludecompositionwithpartialpivoting. SIAM J.MatrixAnal.Appl. ,18:1065,October1997. [49]JesperLarssonTr affandAndreasRipke.Optimalbroadcastforfullyconnected processor-nodenetworks. J.ParallelDistrib.Comput. ,68:887,2008. [50]SathishS.Vadhiyar,GrahamE.Fagg,andJackDongarra.Automaticallytunedcollectivecommunications.In Proceedingsofthe2000ACM/IEEEconferenceonSupercomputingCDROM ,Supercomputing'00,Washington,DC,USA,2000.IEEE ComputerSociety. [51]RobertAVanDeGeijnandJerrellWatts.Summa:Scalableuniversalmatrixmultiplicationalgorithm. Concurrency-PracticeandExperience ,9:255,1997. [52]DavidS.Watkins.Acasewherebalancingisharmful. Electron.Trans.Numer.Anal 23:1,2006. [53]DavidS.Watkins. TheMatrixEigenvalueProblem:GRandKrylovSubspaceMethods .SocietyforIndustrialandAppliedMathematics,Philadelphia,PA,USA,1edition,2007. [54]J.H.WilkinsonandC.Reinsch,editors. HandbookforAutomaticComputation,VolumeII,LinearAlgebra .Springer-Verlag,NewYork,1971. 120

PAGE 132

[55]BaiZhaojun,DemmelJames,DongarraJack,RuheAxel,andvanderVorstHenk,editors. TemplatesfortheSolutionofAlgebraicEigenvalueProblems .SIAM,Philadelphia,2000. 121

PAGE 133

AppendixA.ExactEigenvalueandEigenvectorsforBalancingCaseStudy Considerthematrix A = 0 B B B B B @ 1100 0210 0031 004 1 C C C C C A where 0 1 .Theeigenvaluesof A are 1 = 1 2 5 )]TJ/F28 11.9552 Tf 11.955 16.159 Td [(q 5+4 p 1+ ; 2 = 1 2 5 )]TJ/F28 11.9552 Tf 11.955 16.16 Td [(q 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(4 p 1+ ; 3 = 1 2 5+ q 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(4 p 1+ ; and 4 = 1 2 5+ q 5+4 p 1+ withcorrespondingeigenvectors V 1 = 0 B B B B B @ 2 1+ p 1+ 3 )]TJ/F25 11.9552 Tf 6.587 8.698 Td [(p 5+4 p 1+ 1+ p 1+ 4+2 p 1+ )]TJ/F20 7.9701 Tf 6.587 0 Td [(2 p 5+4 p 1+ 1+ p 1+ 3 )]TJ/F28 11.9552 Tf 11.955 11.406 Td [(p 5+4 p 1+ 1 C C C C C A ;V 2 = 0 B B B B B B @ 2 3 )]TJ/F25 11.9552 Tf 6.587 8.698 Td [(p 5 )]TJ/F20 7.9701 Tf 6.587 0 Td [(4 p 1+ 1 1 2 1 )]TJ/F28 11.9552 Tf 11.955 11.405 Td [(p 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(4 p 1+ 1 )]TJ 11.955 9.433 Td [(p 1+ 1 C C C C C C A ; V 3 = 0 B B B B B B @ 2 3+ p 5 )]TJ/F20 7.9701 Tf 6.587 0 Td [(4 p 1+ 1 1 2 1+ p 5 )]TJ/F15 11.9552 Tf 11.955 0 Td [(4 p 1+ 1 )]TJ 11.956 9.433 Td [(p 1+ 1 C C C C C C A ; and V 4 = 0 B B B B B B @ 2 3+ p 5+4 p 1+ 1 1 2 1+ p 5+4 p 1+ 1+ p 1+ 1 C C C C C C A : 122

PAGE 134

Let V = V 1 V 2 V 3 V 4 and = diag 1 ; 2 ; 3 ; 4 .Notethatas 0 ,wehave 0 B B B B B @ 1000 0200 0030 0004 1 C C C C C A andwithsomerescaling V 0 B B B B B @ 1111 0113 0026 0006 1 C C C C C A : Tobalance A ,wemustndaninvertiblediagonalmatrix D = 0 B B B B B @ 000 0 00 00 0 000 1 C C C C C A suchthat e A = D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD hasequalrowandcolumnnormsforeachdiagonalelement,where D )]TJ/F20 7.9701 Tf 6.587 0 Td [(1 AD = 0 B B B B B @ 1 = 00 02 = 0 003 = = 004 1 C C C C C A : Ifwerequirethatthediagonalelementsof D areallpositive,thebalancingconditioncan bewrittenas = = = : A.1 ThebalancingconditionA.1canbemetbyparameterizing D intermsof where = 1 = 4 123

PAGE 135

= 1 = 2 = 3 = 4 andthenthebalancingmatrix D is D = 0 B B B B B @ 1000 0 1 = 4 00 00 1 = 2 0 000 3 = 4 1 C C C C C A : Thebalancedmatrix e A then e A = D )]TJ/F20 7.9701 Tf 6.586 0 Td [(1 AD = 0 B B B B B @ 1 1 = 4 00 02 1 = 4 0 003 1 = 4 1 = 4 004 1 C C C C C A : Computingtheeigenvectorsof e A directly,wehave e V 1 = 0 B B B B B @ )]TJ/F20 7.9701 Tf 6.587 0 Td [(3 )]TJ/F25 11.9552 Tf 6.586 8.698 Td [(p 5+4 p 1+ 2 3 = 4 1+ p 1+ )]TJ/F23 7.9701 Tf 10.494 4.707 Td [( 1 = 2 2 )]TJ/F20 7.9701 Tf 6.586 0 Td [(1+ p 5+4 p 1+ 1+ p 1+ 1 = 4 1 C C C C C A ; e V 2 = 0 B B B B B B @ )]TJ/F23 7.9701 Tf 10.494 8.887 Td [( 1 = 4 3+ p 5 )]TJ/F20 7.9701 Tf 6.587 0 Td [(4 p 1+ 2 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 )]TJ 11.955 9.433 Td [(p 1+ )]TJ/F20 7.9701 Tf 29.424 4.707 Td [(2 3 = 4 1+ p 5 )]TJ/F20 7.9701 Tf 6.586 0 Td [(4 p 1+ 1 = 2 1 C C C C C C A ; e V 3 = 0 B B B B B B @ 1 = 2 )]TJ/F20 7.9701 Tf 6.586 0 Td [(3+ p 5 )]TJ/F20 7.9701 Tf 6.586 0 Td [(4 p 1+ 2 )]TJ/F22 11.9552 Tf 9.299 0 Td [( 1 = 4 p 1+ )]TJ/F26 7.9701 Tf 10.494 15.069 Td [(p 1+ 1+ p 5 )]TJ/F20 7.9701 Tf 6.586 0 Td [(4 p 1+ 2 3 = 4 1 C C C C C C A ; and e V 4 = 0 B B B B B B B @ 2 3 = 4 1+ p 1+ 3+ p 5+4 p 1+ 1 = 2 1+ p 1+ 2 1 = 4 2+ p 1+ + p 5+4 p 1+ 1+ p 1+ 3+ p 5+4 p 1+ 1 1 C C C C C C C A : 124

PAGE 136

Notethatas 0 ,thebalancedmatrix e A tendstowardsadiagonalmatrix: e A 0 B B B B B @ 1000 0200 0030 0004 1 C C C C C A withtheeigenvectorsofthebalancedmatrixapproachingtheidentitymatrix. 125