Citation
GPU declarative framework

Material Information

Title:
GPU declarative framework
Creator:
Senser, Robert W. ( author )
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (205 pages). : ;

Subjects

Subjects / Keywords:
Declarative programming languages ( lcsh )
C++ (Computer program language) ( lcsh )
Programming languages (Electronic computers) ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
This dissertation presents our novel declarative framework, called the Declarative Framework for GPUs (DEFG). GPUs are highly sophisticated computing devices, capable of computing at very high speeds. The framework makes the development of OpenCL-based GPU applications less complex, and less time consuming. The framework's approach is two-fold. First, we developed the DEFG domain-specific language in such a way that it uses primarily declarative statements, and design patterns, to define the CPU actions to manage GPU kernels. This approach makes the GPU processing power more accessible. It does this by lowering the amount of specialized GPU knowledge the GPU software developer must possess. It also decreases the volume of code written, meaning these applications can be written more easily and quickly. The second aspect of our approach is the addition of high-value GPU capabilities, most notably the declarative utilization of additional GPU devices. The use of multiple devices, in this DEFG manner, allows for scaling the application run-time performance without rewriting the entire application. This aspect of DEFG makes developing multiple-GPU applications faster and more straightforward. In this dissertation, we describe DEFG's novel parser, optimizer and code generator, as well as, provide detailed descriptions of DEFG's domain-specific language and associated design patterns. Taken together, these components make it possible for the developer to write DEFG source and generate C/C++ programs containing highly optimized OpenCL requests that provide high-performance. In order to demonstrate the viability of DEFG, we produce applications related to common areas of computer science: image filtering, graph processing, sorting, and numerical algebra, specifically iterative matrix inversion. We select these applications because each one explores different aspects of GPU use and OpenCL. Taken together, they demonstrate DEFG's ability to function well over a wide range of applications. To show DEFG's capabilities in image processing, we implement the Sobel operator and median image filter applications, with an emphasis on multiple-GPU processing. Our graph-processing, breadth-first search application shows DEFG's ability to process large irregular data structures, with multiple GPUs. In the sorting realm, the novel roughly sorting application shows DEFG's GPU-based sorting capability. Finally, the numerical algebra application, an interesting iterative matrix inversion implementation, exhibits DEFG's ability to implement iterative, GPU-based, numerical processing. DEFG's domain specific language is able to use mainly declarations to describe the CPU actions needed to manage complex GPU actions. We demonstrate that this approach to GPU-oriented software development succeeds through the production of our OpenCL applications.
Thesis:
Thesis (Ph.D.)--University of Colorado Denver. Computer sciences and information systems
Bibliography:
Includes bibliographic references.
System Details:
System requirements; Adobe Reader.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Robert W. Senser, Jr.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
904620394 ( OCLC )
ocn904620394

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

GPUDECLARATIVEFRAMEWORK by ROBERTW.SENSER,JR. B.S.,UniversityofWyoming,1973 M.A.,UniversityofHawaii,1975 M.S.,UniversityofHawaii,1975 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy ComputerScienceandInformationSystems 2014

PAGE 2

c 2014 ROBERTW.SENSER,JR. ALLRIGHTSRESERVED

PAGE 3

ThisthesisfortheDoctorofPhilosophydegreeby RobertW.Senser,Jr. hasbeenapprovedforthe ComputerScienceandInformationSystemsProgram by GitaAlaghband,Chair TomAltman,Advisor MichaelMannino BorisStilman TamN.Vu November7,2014 ii

PAGE 4

Senser,RobertW.,Jr.Ph.D.,ComputerScienceandInformationSystems GPUDeclarativeFramework ThesisdirectedbyProfessorTomAltman. ABSTRACT Thisdissertationpresentsournoveldeclarativeframework,calledthe Declarative FrameworkforGPUs DEFG.GPUsarehighlysophisticatedcomputingdevices, capableofcomputingatveryhighspeeds.Theframeworkmakesthedevelopment ofOpenCL-basedGPUapplicationslesscomplex,andlesstimeconsuming.The framework'sapproachistwo-fold.First,wedevelopedtheDEFGdomain-specic languageinsuchawaythatitusesprimarilydeclarativestatements,anddesign patterns,todenetheCPUactionstomanageGPUkernels.Thisapproachmakes theGPUprocessingpowermoreaccessible.Itdoesthisbyloweringtheamount ofspecializedGPUknowledgetheGPUsoftwaredevelopermustpossess.Italso decreasesthevolumeofcodewritten,meaningtheseapplicationscanbewrittenmore easilyandquickly.Thesecondaspectofourapproachistheadditionofhigh-value GPUcapabilities,mostnotablythedeclarativeutilizationofadditionalGPUdevices. Theuseofmultipledevices,inthisDEFGmanner,allowsforscalingtheapplication run-timeperformancewithoutrewritingtheentireapplication.ThisaspectofDEFG makesdevelopingmultiple-GPUapplicationsfasterandmorestraightforward. Inthisdissertation,wedescribeDEFG'snovelparser,optimizerandcodegenerator,aswellas,providedetaileddescriptionsofDEFG'sdomain-speciclanguage andassociateddesignpatterns.Takentogether,thesecomponentsmakeitpossible forthedevelopertowriteDEFGsourceandgenerateC/C++programscontaining highlyoptimizedOpenCLrequeststhatprovidehigh-performance. InordertodemonstratetheviabilityofDEFG,weproduceapplicationsrelated iii

PAGE 5

tocommonareasofcomputerscience:imageltering,graphprocessing,sorting,and numericalalgebra,specicallyiterativematrixinversion.Weselecttheseapplications becauseeachoneexploresdierentaspectsofGPUuseandOpenCL.Takentogether, theydemonstrateDEFG'sabilitytofunctionwelloverawiderangeofapplications. ToshowDEFG'scapabilitiesinimageprocessing,weimplementtheSobeloperator andmedianimagelterapplications,withanemphasisonmultiple-GPUprocessing.Ourgraph-processing,breadth-rstsearchapplicationshowsDEFG'sabilityto processlargeirregulardatastructures,withmultipleGPUs.Inthesortingrealm, thenovelroughlysortingapplicationshowsDEFG'sGPU-basedsortingcapability. Finally,thenumericalalgebraapplication,aninterestingiterativematrixinversion implementation,exhibitsDEFG'sabilitytoimplementiterative,GPU-based,numericalprocessing. DEFG'sdomain-speciclanguageisabletousemainlydeclarationstodescribe theCPUactionsneededtomanagecomplexGPUactions.Wedemonstratethatthis approachtoGPU-orientedsoftwaredevelopmentsucceedsthroughtheproductionof ourOpenCLapplications. Theformandcontentofthisabstractisapproved.Irecommenditspublication. Approved:TomAltman iv

PAGE 6

Thisworkisdedicatedtothememoryofmyparents,RobertW.SenserandMaria E.Senser,andmyyoungerbrother,DwightW.Senser,Ph.D.,whosepreciouslife waslosttoafoolwithahandgun. YouthreeSensersshowedmetheway. v

PAGE 7

ACKNOWLEDGEMENTS Gettingthisworktothenishline"wasnotaneasytask.Twopeople'sextensive supporthasbeenbeyondmeasure: Tomyintrepidadvisor,TomAltman,mydeepest thankyou .Youbroughtout mybest,toleratedmyworst,andmadeitpossibleformetorunwithmypassions. Mostofall,tomywonderfulwife,LindaSenser,whosupportedmethroughthe worstmoments,wholaughedwithmethroughthenuttiest,andwhonevergaveup evenwhenIwantedto: dankesehr vi

PAGE 8

TABLEOFCONTENTS LISTOFFIGURES ::::::::::::::::::::::::::::::: ix LISTOFTABLES :::::::::::::::::::::::::::::::: xi GLOSSARY :::::::::::::::::::::::::::::::::::: xii CHAPTER I.INTRODUCTION ...........................1 1.1GPUSoftwareDevelopment...................1 1.2Motivation............................2 1.3Contributions...........................3 1.4Delimitations...........................5 II.RELATEDWORK:GraphicsProcessingUnitsandOpenCL 6 2.1Introduction...........................6 2.2BasicOverviewofCPUs,GPUs,andPRAMs.........7 2.3ModernGPUs..........................8 2.4GPGPU..............................10 2.5OpenCLandGPUBasics....................10 2.6ParallelizationandDomain-SpecicLanguages........14 III.OVERVIEWOFDEFGANDITSPERFORMANCE ....16 3.1Introduction...........................16 3.2DEFGFrameworkandDEFGLanguage............17 3.3ViabilityofDEFG........................19 3.4DiscussionofResults.......................21 IV.DEFGTHEORYOFOPERATIONS ...............24 4.1Introduction...........................24 vii

PAGE 9

4.2DEFGDesignPatterns.....................25 4.3DEFGInternalOperations...................37 V.NEWANDDIVERSEDEFGAPPLICATIONS ........48 5.1Application:ImageFilters....................50 5.2Application:Breadth-FirstSearch...............65 5.3Application:SortingRoughlySortedData...........86 5.4Application:AltmanMethodofMatrixInversion.......106 VI.ACCOMPLISHMENTS,OBSERVATIONS,ANDFUTURE RESEARCH ..............................118 6.1DEFGAccomplishments.....................118 6.2SomeNoteworthyObservations.................119 6.3ConictingDEFGAimsandStaticOptimization.......121 6.4FutureResearch.........................121 6.5DEFGTechnicalImprovements.................124 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 125 APPENDICES :::::::::::::::::::::::::::::::::: 132 A.DEFGUser'sGuide ..........................133 A.1Introduction...........................133 A.2IntendedAudience........................134 A.3DEFGExamples.........................134 A.4CommonDEFGDesignPatterns................137 A.5DEFGLanguageReference...................141 A.6DEFGAdvancedFeatures....................160 A.7HowtoExecutetheDEFGTranslator.............164 A.8DEFGErrorHandling......................165 B.SourceCodeandOtherItems ...................167 B.1HardwareandSoftwareDescription..............167 B.2SuggestedDEFGTechnicalImprovements...........167 B.3TheDEFGMini-ExperimentwithFourGPUs.........169 B.4DEFGApplicationSourceCode................171 B.5DEFGDiagnosticSourceCode.................189 B.6DEFGMajorComponents...................192 viii

PAGE 10

LISTOFFIGURES Figure 2.1OpenCLDeveloper'sView.......................12 3.1SampleDEFGCode..........................17 3.2SnippetofGeneratedOpenCLCode..................18 3.3SnippetofSobelOpenCLKernelCode................18 3.4SampleDEFGCodeShowingaSequence...............19 3.5ApplicationLines-of-CodeComparison................21 3.6ApplicationRun-TimePerformanceComparison...........22 4.1DEFGTranslation-StepsDiagram...................38 4.2SampleXMLOutputFromDEFGParser...............40 4.3SampleXMLOutputSnippetFromDEFGOptimizer........41 4.4DEFGCodeGenerationDiagram...................45 4.5C/C++Snippetfor sobel lter KernelExecution..........47 5.1SobelOperatorPerformedwithDEFG:BeforeandAfterImages..53 5.2Median5 5FilterPerformedwithDEFG:Original,Noised-Added, andAfter-ProcessingImages......................54 5.3KernelCodeforMedianFilterwith3 3Neighborhood......56 5.4ImageSchematicShowingOverlapwith2GPUs...........57 5.5DEFGCodetoExecutethe5 5MedianFilter...........58 5.6PlotofFilterbyImage,AverageRunTimes.............61 5.7PrexSumbasedBuerAllocation..................69 5.8BFSApplication'sDEFGLoop....................72 5.9BFSApplication's kernel1 .......................73 5.10BFSApplication's kernel2 .......................73 5.11BFSDP2GPUApplication's kernel1a2 .................77 5.12BFSDP2GPUApplication's kernel1b .................77 5.13BFSDP2GPUDEFGPseudoCode..................79 5.14BFSVersusBFSDP2GPURunTimeswithRodiniaGraphs.....81 5.15 LR RL ,and DM PseudoCode....................90 5.16 LRmax and RLmin Kernels......................91 5.17 DM and UB Kernels..........................92 5.18RSORTDEFGDeclareStatements..................93 5.19RSORTDEFGExecutableStatements................95 ix

PAGE 11

5.20PlotofSortRunTimesfor2 23 ,388,608Items...........97 5.21TwoServerPlotofSortRunTimeswith2 23 Items..........100 5.22PlotofRun-TimeBreakoutwith2 23 Items..............102 5.23AbbreviatedRSORTDEFGExecutableStatements.........104 5.24PlotofSortRunTimeswith2 26 Items................105 5.25PlotofSortRunTimeswith2 27 Items................106 5.26IMIFLXApplicationProcessingLoop.................111 5.27 SweepSquares KernelSourceCode...................113 5.28TableandPlotofM500MatrixNormValues.............115 B.1Run-TimeComparisonwith1,2and4GPUs............170 x

PAGE 12

LISTOFTABLES Table 2.1GPUPerformanceConstraints.....................10 3.1TestCongurations...........................20 3.2LinesofCode..............................20 3.3Run-timePerformance,inMilliseconds................21 5.1ExecutionTimesonHydraServer,inMilliseconds..........57 5.2ImagesUsedwithFilterApplicationTesting.............59 5.3RunTimesforVariousImages.....................60 5.4DetailedRunTimesforBUFLOImage................62 5.5RunTimesforpthreadExperiment..................63 5.6RodiniaGraphCharacteristics.....................80 5.7RunTimesofBFSVersusBFSDP2GPU,inSeconds.........81 5.8SNAPgraphcharacteristics......................82 5.9RunTimesfromSNAPGraphs,BFSVersusBFSDP2GPU,inSeconds.82 5.10RunTimes,inSeconds,forSorting2 23 ,388,608Items......98 5.11TwoServerRunTimeswith2 23 Items,inSeconds..........101 5.12Run-TimeBreakoutwith2 23 Items,inSeconds............102 5.13SortRunTimesonHydrawith2 26 Items,inSeconds........105 5.14SortRunTimesonHydrawith2 27 Items,inSeconds........105 5.15ComparisonofM1000RunTimes...................113 5.16IMIFLXInversionResultsforVariousMatrices............117 A.1APartialListofDEFGLoadersandFunctions............160 B.1TestingCongurations,HardwareandSoftware...........167 xi

PAGE 13

GLOSSARY BFSDP2GPU Multiple-GPU,breadth-rstsearchDEFGapplication.30 CUDA NVIDIA-providedcapabilitytoexecuteC/C++onNVIDIAGPUs.2 DEFG FrameworktodeclarativelycreateGPUapplications.3 GPU AcronymforGraphicsProcessingUnit.1 IMIFLX IterativematrixinversionDEFGapplication.110 LVI AcronymforLargeVeryIrregular,appliedtographs.65 MEDIAN MedianimagelterDEFGapplication.54 OpenCL KronosGroupspecicationtoexecuteC/C++onGPUs.2 PRAM AcronymfortheabstractionParallelRandomAccessMachine.6 RSORT RoughlysortingDEFGapplication.33 SIMD AcronymforSingleInstruction,MultipleData.9 SIMT AcronymforSingleInstruction,MultipleThread.8 SISD AcronymforSingleInstruction,SingleData.67 SOBEL SobelimagelterDEFGapplication.19 xii

PAGE 14

CHAPTERI INTRODUCTION 1.1GPUSoftwareDevelopment ThegraphicsprocessingunitGPUisahardwarecomponenthavingthepotential torapidlyexecutecomputeralgorithmsandcode.Therawdouble-precisionoatingpointthroughputofGPUsnowexceedstwoteraoating-pointoperationsper secondTFLOPS.OneGPU,theAMDRadeonHD7990,canperform8.2singleprecisionTFLOPSand2.04double-precisionTFLOPS[3].Thiscomputationalcapabilitycomesatanattractivecostof$1Kretail,andhasmadeGPUsveryattractive forexecutingnon-graphicapplications.ThehighthroughputprovidedbyGPUshas beenusedinthe highperformancecomputing HPCscienticcommunitytodevelop manytypesofapplications. However,whiletheGPUhardwarecostsarerelativelylow,GPUsoftwaredevelopmentcostscanbeprohibitivelyhigh.Thesoftwaredevelopmentprocessassociated withGPUscanbecomplexandmorecostlythanstandardsoftwaredevelopment becauseofthearchitectureofGPUhardwareandthespecializedsoftwaredevelopmentenvironmentsneededforGPUprogramming.Achievinghighperformancein aGPUenvironmentrequiresthedevelopertounderstandnotonlytheapplication requirementsandparallelsoftware,butalsotheadditionaluniquecharacteristicsof theGPUhardware.Theuniquecharacteristicsaddcomplexity,andmanypertinent low-leveldetails,tothedevelopmentenvironment. 1

PAGE 15

ThisdissertationpresentsanovelOpenCLprogrammingframework,whichis designedtolowerthesoftwaredevelopmentcomplexity,andthus,thecostsofHPC GPUusage.OurapproachtosimplifyingOpenCLprogrammingistwo-fold:rst,we makeuseofdeclarativestatementsanddesignpatternstodenetheCPU-sideofa GPUapplication,therebyloweringthenumberoflinesofcodewritten,andlessening thelow-levelGPUknowledgeneededbythedeveloper. Second,weaddoptionssuchastheutilizationofmultipleGPUdevicestoscalethe applicationrun-timeperformancewithouttheneedtorewritetheapplication.For certainapplicationtypes,computingpowercanbeadded,intheformofadditional GPUdevices,withoutaddingsignicantsoftwaredevelopmentcosts. 1.2Motivation Developingsoftwareforuseinhighperformancecomputingisoftenadicultundertaking.Itnotonlyrequiresathoroughunderstandingoftheapplicationproblem beingsolvedandthealgorithmsusedtosolvetheproblem.Italsorequiresanindepthunderstandingoftheuniquecharacteristicsofthehardwareplatformbeing utilized.Whentheplatformisparallelinnature,thesoftwarebecomesevenmore diculttowriteduetotheaddedcomplexitiesofparallelexecution[47].TheHPC useofGPUsforgeneralprocessingtsintothislattercategoryofespeciallydicult softwaredevelopment. TheGPUhasbeenshowntohaveveryhighthroughputcapabilities[71].Ideally, existingHPCsoftwarewouldbemovedtotheGPU,andtheGPU'shighthroughputwouldbeimmediatelyavailable.Andatrstglance,thisappearstobepossible becausebothofthecommonGPUprogrammingenvironments,NVIDIA'sCompute UniedDeviceArchitectureCUDAandKronosGroup'sOpenCLSpecication,providethecapabilitytoexecuteC/C++codeasGPUkernels[65,70].Unfortunately, GPUsoftwareproducedthiswaytendstohaveverypoorperformancecharacteristics. 2

PAGE 16

HighperformanceGPUprogrammingrequirestheuseofspecialized,parallelalgorithmsandGPU-specic,low-levelapplicationprogramminginterfacesAPIs.This use,inturn,requiresthatthedeveloperpossessathoroughunderstandingoftheoverallGPUhardwarearchitecture.Forexample,thedevelopermustavoidthemajor issuesofmemorylatencyandinstructionpathdivergence,ifheorshewantstoobtain highlevelsofGPUperformance.TheresultisthatGPUsoftwaredevelopmenttends tobebothcomplexandtimeconsuming[34]. Inthisdissertation,wepresentanoveldeclarativeframework,calledthe DeclarativeFrameworkforGPUs DEFG,whichmakesdevelopmentofOpenCL-basedGPU applicationslesscomplex,andlesstimeconsuming[82,80,81].Itmitigatestheneed foradeepunderstandingofthefullCPU-sideAPIusedwithtechnologies,suchas OpenCL.DEFGallowsthedevelopertofocusonthealgorithmsbeingusedandthe mostecientusageoftheoverallGPUarchitecture. Inaddition,toclearlyshowDEFG'sviability,wedemonstrateitsuseandperformancewithfourdiverse,generalGPUapplications.Eachapplicationputsdierent demandsontheframework,therebyshowingtheframework'sapplicability,exibility, andgeneralrobustness.Here robustness referstotheframework'sabilitytoelegantly handledieringapplications'demandsandrequirements.Forcertainapplications, theDEFGapproachmakesitpossibletoscaletheapplicationtomultipleGPUcards withoutapplicationchanges.ThisapplicationscalingismadepossiblebydeclaringthenatureoftheapplicationanddevelopingtheapplicationGPUkernels,then havingtheframeworkgeneratethecodetointerconnecttheCPUandGPUsina high-performancemanner. 1.3Contributions InadditiontotheconstructionofthenovelDEFGframework,ourresearchcontributes twogroupsofOpenCLapplications.TherstgroupconsistsofthreeexistingOpenCL 3

PAGE 17

applicationsthatwereconvertedtoDEFG.TheseDEFGconversionsshowedthe run-timeperformanceofDEFGmatching,orexceeding,thatofthenativeOpenCL applications.And,thislevelofperformancewasachievedwiththesoftwaredeveloper writingfewerlinesofcode,relativetothecorresponding,originalOpenCLapplication. ThesecondgroupcontainsfournewOpenCLapplications,whichareusedas DEFGuse-cases.TheselatterapplicationsdemonstratetheapplicabilityofDEFG indiversedomains,rangingfromgraphprocessingtosorting.Eachoftheseapplicationsismeasuredandanalyzed.Insomecases,theanalysisincludescomparisons, intermsofrun-timeperformanceandothermetrics,betweenDEFGandnon-DEFG applicationversions. Insummary,thisdissertationmakesthefollowingcontributionstocomputerscience: Thedesign,implementation,testing,andanalysisofournovel Declarative FrameworkforGPUs DEFG. Application: SobelandMedianimagelteringusingmultipleGPUs ProvidesDEFGimplementation,measurement,andanalysis[86,87]. Application: Breadth-rstsearchapplicationusingmultipleGPUs ProvidesGPUalgorithmicdesign,DEFGimplementation,measurement,and analysis[38,59]. Application: Sortingroughlypartiallysorteddata ProvidesGPUalgorithmicdesign,DEFGimplementation,measurement,and analysis[9,10]. Application: Altman'siterativemethodofmatrixinversion ProvidesGPUalgorithmicdesign,DEFGimplementation,measurementand analysis[7]. 4

PAGE 18

1.4Delimitations ThetwomostcommonGPUprogrammingenvironmentsaretheKronosGroup's OpenCLSpecicationandNVIDIA'sCUDA.CUDAislimitedtoonlyNVIDIAproducts,whereasOpenCLissupportedbymanyhardwarevendorsandcanberunon CPUsandotherdevices[65,70].DuetoOpenCL'swiderapplicabilityandhigher levelofAPIexibility,ourworkfocusesonOpenCL. Thisdissertationcoversawiderangeofinterestingapplicationareas,algorithms, andGPU-relatedtopicssuchasGPGPU,OpenCL,andtheDEFGdomain-specic language.However,therearealsosomerelatedtopicsthatarespecicallyomitted. Inparticular,thisworkdoesnotfocusonhighlyspecialized,GPU-generation-specic orproduct-unique,algorithmsandtechniques.TheGPUtechnologyisconstantly changingandoverly-focusedtechniques,helpfulforaspecicfamilyorgenerationof GPUs,maysoonbeobsolete. Thisconstantlychanging"characteristiccanbeobservedwiththelatestgenerationsofGPUcards,suchastheNVIDIAFermiandKeplarGPUs.Theycontain L1andL2memorycaches,lackinginmanypreviousGPUdesigns[64,66].These hardwarememorycachestendtomakesoftware-provideddatacachesheldinGPU thread-localstorageobsolete. 5

PAGE 19

CHAPTERII RELATEDWORK:GraphicsProcessingUnitsandOpenCL 2.1Introduction Technologies,suchasOpenCLandCUDA,makeitpossibletorunnearlystandardC codeongraphicsprocessingunitsGPUs.GPUsareauxiliaryprocessorsthatcanbe packagedonseparatecards,includedonthecentralprocessingunitCPUmother board,ormanufacturedwithintheCPU'sintegrated-circuitdie. 1 WhenOpenCL andCUDAareusedtosolvegeneralproblemswithGPUs,theacronymGPGPU," whichstandsfor General-PurposecomputationofGraphicsProcessingUnits ,isoften used[69].Withthesetechnologies,onecanrunparallelalgorithmsonGPUswiththe expectationofachievinghighperformance.Atarstglance,itmaybetemptingto usebasicparallelrandomaccessmachinePRAMalgorithmsonGPUs,sinceGPUs appeartosupplythemajorityofthefunctionalityrequiredbythePRAMmodel. However,wewillseethatsoftwarebasedonlyonbasicPRAMalgorithmstendsto performpoorlyonGPUs.Thesolutiontothisperformanceissueliesinavoidingthe commonGPUperformancepitfallssuchas instructionpathdivergence and excessive memorylatency 1 Anintegrated-circuitdieisasmallsectionofsemiconductormaterial. 6

PAGE 20

2.2BasicOverviewofCPUs,GPUs,andPRAMs CPUstendtohavealimitednumberofcoresoftenlessthan16withsignicant amountsofcachememoryandalimitednumberofsoftware-managedthreads.These CPUshavearchitectureswithspecializedlogicforpredictivebranching,outoforder instructionexecution,andotheradvancedtechniques{allofthisaimedatkeepingtheCPUbusyprocessinginstructions.GPUstendtohavehundredsofcores thatcansimultaneouslyhandlethousandsof hardware-managed threads.CPUsperformthreadswitchesundersoftwarecontrol,whereasGPUshavehardware-managed threads.GPUscanswitchbetweenthreadswithnosignicantdelays,becausethere isnosoftware-managedcontextswitchinginvolved.Memorycachingmaybepresent onnewerGPUdesigns,butwhenagiventhreadstalls,GPUstendtorelyontheir fastthreadswitchingtokeeptheprocessorsexecutinginstructions.Withthelarge numberofGPUthreads,theexpectationisthatthereisgoingtobeadispatchablethreadavailable.CPUs,ontheotherhand,tendtorelyonmemorycachingto minimizememory-access-inducedstalls[30]. TheGPUarchitecturehasaresemblancetotheconceptualPRAM.Thisacronym referstoanabstractmodelofamachinewithmemoryeasilysharedbetweenthe parallelprocessorsandthepresenceofasmanyparallelprocessorsasrequired[14, 42]. 2 Fromahigh-levelview,GPUs,withtheirsharedmemoryandlargenumberof processorsandthreads,aresimilartoPRAMs.Theyappeartobelooselyequivalent aseachhasuniformshared-accesstoglobalmemoryandalargenumberofconcurrent processors.However,togethighperformanceatalowcost,theGPUarchitecture hasanumberoffeatureswhichmakethisequivalence"viewincorrect.Aswillbe seeninthediscussiononGPUperformance,thetechniquesusedtogethighlevelsof GPUperformancearecontrarytotheunrestrictednatureofthePRAMmodel. 2 Forourpurposes,wecanignorethefourdierenttypesofPRAMsoutlinedbyBermansince, inpractice,PRAMsandGPUsarenotverysimilar. 7

PAGE 21

2.3ModernGPUs ThemodernGPUisaspecializedelectroniccircuitthatiscommoninalmostall computers.MostpersonalcomputersnowincludesometypeofGPU.ThecommoditynatureofGPUshashelpedkeeptheirunitcostslow,thoughtheyhaveachieved thepotentialtoexhibitveryhighthroughput[78].TheGPUwasoriginallydesignedtoprovidehigh-speedgraphicalrenderingcomputer-generatedgraphics.With theadventofNVIDIA'sComputerUniedDeviceArchitecture,theprogrammingof NVIDIAGPUsfornon-graphicsusebecameamorestraight-forwardprocess[95].The OpenCLSpecicationappearedafterCUDAandprovidesforsimilarprogramming capabilitiesoveramuchwiderrangeofGPUcardsanddevices[34,70]. BeforeCUDAandOpenCL,theproblemtobesolvedinparallelonaGPUhadto bere-factoredasarenderingdisplayproblem,andwhentherenderingwascompleted, thedisplayrasterimagehadtobecapturedandreformattedtogenerateoutputresults.Now,OpenCLandCUDAmakeitpossibletoprogramGPUsinnearlystandard C,usingcommonprogrammingconstructs.AkeypointisthatOpenCLandCUDA makeitpossibletocodealgorithms,possiblydesignedforthePRAMmodel,onthe GPUinamannerthatdoesnotrequirepresentingtheproblemasagraphicalrenderingproblem.GPUscannowsolvenon-renderingproblemswithoutresortingtoexotic technologiesandapproaches.CurrentgenerationGPUscanprocessdoubleprecision oatingpointnumbers,andhigh-endGPUssupporterrorcorrectingmemory[66,65]. Asalludedtoabove,thehypotheticalPRAMmodelandtheGPUmayappearto besimilar.However,themodernGPUisaveryspecializedpieceofhardwareandhas uniquecharacteristicsthatmakeitproblematictocodePRAMalgorithmsdirectly ontoaGPU[42].TwoofthesecharacteristicsarelistedinTable2.1.Instruction PathDivergencerelatestothenatureoftheGPU'sinstructionprocessing. TheGPUcanbedescribedasa SingleInstruction,MultipleThread SIMTtype ofparallelprocessor.SIMTisverysimilartothewell-known SingleInstruction, 8

PAGE 22

MultipleData SIMDmodeloftheparallelprocessor.However,itisdierentin thatworkitemsthreadscanfollowdierentpathsthroughthesamecode,butat asignicantperformancepenalty[42].WithmanycurrentGPUdesigns,eachworkitem,inawork-group,executesexactlythesameinstruction.But,theinstructions notinanactivework-item,meaningnotonthecurrentexecutioncodepath,have theirresultsvoided.TheimpactofthisGPUdesigntechniqueisthatthework-items notonthecurrentinstructionpathareeectivelyinactive.Thework-itemsarenot doinganyworthwhilework,resultinginaperformanceloss. TheGPUglobalmemoryaccessdelaysrelatetothehighclockspeedoftheGPU relativetothelowerspeedofassociatedglobalmemory.Historically,manyGPU designslackedanytypeofhardwarecachingofglobalmemory.MorerecentGPUdesigns,suchastheNVIDIAFermiandKeplardesigns,doprovidesomeglobalmemory caching[66].Thelackofacache,orsucientcachesize,isnormallycompensated forbytheGPUrapidlyshiftingbetweenitshardware-managedthreads.Atthetime agiventhreadstallsforamemoryaccess,thenotionisthatanotherofthehardwaremanagedthreadsisreadytodispatch.Whensolvingapplicationproblemswithhigh localityofmemoryreference,thisapproachworkswell.However,whensolvingcertainclassesofproblemswithclassicalgorithms,suchasgraph-theoreticproblemsor sparsematrixproblems,theremaynotbeawork-itemreadytodispatchduetothe lackoflocalityofmemoryreference.Thisdispatchingirregularityusuallyresultsin poorperformance. Insummary,whentheprogramcodetobeexecutedbyaGPUisnotdesigned forGPUuse,itmayperformverypoorlyduetoinstructionpathdivergenceand global-accessinducedmemorystalls.Inordertogetbeyondtheseissues,software developerscanwriteGPUsprogramsthataccessotherclassesofmemorytoachieve andusespecialcodingtechniques,whichareabletoavoidexcessiveinstructionpath divergence. 9

PAGE 23

Table2.1:GPUPerformanceConstraints Constraint Description 1 InstructionPathDivergence Occurswhenthreadstake dierentpathsthroughthe code. 2 GlobalMemoryAccessCharacteristics Eachaccesstoglobalmemoryneedsthetimetoexecute200-500instructions. 2.4GPGPU TheacronymGPGPU,whichstandsfor General-PurposecomputationofGraphics ProcessingUnits ,referstotheuseofGPUstosolvegeneralproblemsbeyondthe renderingofgraphicalimages.Scientistsandsoftwaredevelopersnoticedthatearly GPUsprovidedveryfastparallelprocessingforoperationssuchasscalingandshading.Noticingthehighlevelsofperformanceachieved,andthelowcosts,thesescientistsanddevelopersbegantoformulate non-graphical problemsingraphicalterms andthenperformtheprocessingoninexpensiveGPUs.Thiswasabreakthrough,as itshowedthatGPUscouldeectivelybeusedforhighperformancenon-graphical computing,withlowhardwarecosts.Astimepassed,productslikeOpenCLand CUDAmadeitmucheasiertoperformgeneral-purposecomputingonGPUs[69]. ThetermGPGPUnowrefersbothtotheearlyeortsofdoinggeneral-purposecomputationongraphics-onlyGPUs[48]andtothewidereldofdoinggeneral-purpose computationonanytypeofGPU,beitfullyprogrammableornot[69]. 2.5OpenCLandGPUBasics TheOpenCLspecicationismanagedbythenon-protconsortiumKhronosGroup, andOpenCL-enabledproductsaresuppliedbymanysoftwareandhardwarevendors[34].Thisspecicationenablesthedevelopmentofapplicationsoverarange ofdevices,notallofthemGPUs.TheseOpenCL-enableddevicesaresuppliedby 10

PAGE 24

vendorssuchasNVIDIA,AMD/ATI,andIntel.Alterahasrecentlyannouncedthe availabilityofOpenCLforitshigh-endFPGAcards[6]. OpenCLdevicesareprogrammedinCandtheCPU-sideoftheOpenCLapplicationcanbeprogrammedinCorviaaC++wrapper.Therearethird-partyOpenCL bindingsforanumberofotherlanguagesincludingJava,Python,andMicrosoft's .NET.ItisworthnotingthattheGPUprogrammingmodelssuppliedbyOpenCL andNVIDIA'sCUDAaresimilarconceptually,butnotatallthesameatthesource codelevel[48].TheuseofCUDAislimitedtoonlyNVIDIAhardware. OpenCLispartoftheverydynamicgraphicshardwareandsoftwarearena;here, continuousproductchangesandenhancementsarethenorm.Thefeaturesandlimits presenttodaymaybesignicantlydierentinayear{thismeansthatOpenCL issubjecttofrequentupdates.Asstatedearlierinthe Delimitations section,this workdoesnotfocusonhighlyspecialized,GPU-generation-speciccodingtechniques, butinsteadfocusesonalgorithms,techniques,andapproachesforsolvingGPGPU problemsthatareapplicabletoOpenCLoverarangeofapplicationsandproducts. 2.5.1GPUDeveloper'sViewandExecutionModel Thedeveloper'sviewoftheGPUcodeisthatofoneormorekernels.Asmentionedabove,OpenCLkernelsarefunctionswrittenintheCprogramminglanguage. OpenCLprovidesextensionstoCthatfacilitatetheexecutionofthekernelonthe GPU.TheseextensionsprovidespecialGPUvariabletypesandaccesstoOpenCLspecicGPUinternalvariables.OpenCLalsoprovidesCPU-sideCextension;these extensionsprovidethecomplexCPU-sideOpenCLAPI.ThisAPIprovidesalarge numberofvariedCPUfunctions,andfunctionoptions.Includedarefunctionsto copybuersofmemorytoandfromtheGPU,invokeGPUkernels,andmanageerrorconditions.Fromahigh-levelperspective,theCPUcopiestherequiredmemory buerstotheGPUandthenrequeststhatkernelsbeexecuted.Whenthekernels 11

PAGE 25

GPU GlobalMemory Grid Work-group 1expanded LocalMemory Work-item m PrivateM. ::: Work-item 2 PrivateM. Work-item 1 PrivateM. Work-group 2 L.Mem. W.Items :::::: Work-group n L.Mem. W.Items Figure2.1:OpenCLDeveloper'sView arenished,theCPUcanrequestbuersbecopiedbackfromtheGPU. Akernelisexecutedbya work-item ;work-itemsaregroupedinto work-groups ; work-groupsaregroupedintoa grid .Thedeveloperseestheprogramcodeand itsexecutionintermsofkernels,work-items,work-groups,andgrids.Figure2.1 containsadiagramexpressingtheserelationships.Atagiventime,eachwork-item isexecutingonekernel.Eachwork-itemhasaccesstothesharedglobalmemory,a limitedamountofwork-groupsharedmemory,andit'sownprivatememory.How thesedierenttypesofmemoryareusedbythekernelhasamajorimpactonthe GPUperformance,becauseaccesstotheplentifulglobalmemoryisrelativelyslow. ItisoftennecessarytodesignGPUalgorithmsthatpermitreuseddatatobekeptin localmemoryorprivatememory[65].Bothlocalmemoryandprivatememorytend tobehigh-speedRAM,packagedwithintheGPUitself. DevelopersprogramtheGPUkernelsandsetthecharacteristicofthework-groups 12

PAGE 26

andgrid.However,OpenCLuses wavefronts toexecutethework-groups,andhence thework-items.Thework-groupsareexecutedinarbitraryorder,andmayormay notexecuteinparallel.Thisarbitrary-orderingcharacteristicofwork-groupexecutionimpactsthedeveloper,becausetheorderofwork-groupexecutioncannotbe predicted.Thealgorithmsandcodeusedcannotbedependentonwork-groupordering.Thework-itemsinawork-groupallutilizethesameprogramcounterandthey cansharethelocalwork-groupmemory[61]. 2.5.2GPUPerformance Astheprogramcounterisshared,andsincetheSIMTmodelisbeingused,itis possibleforthebehaviorofonework-itemtoimpacttheotherwork-itemsinthe work-group.Inparticular,whenadecisionstatementoraloopcauses instructionpath divergence ,thework-itemsnotontheinstructionpathstillexecutetheinstructionsat theprogramcounter,buttheinstruction'sactionsarevoided.Thismeanstheworkitemsnotonthecurrentpathareeectivelypaused.Dependingonthelengthofthe divergedpath,thenon-activework-itemscanstaypausedforarelativelylongperiod oftime.Instructionpathdivergenceisamajorissueingettinghighperformance fromGPUs[65,34]. GPUsareveryfastprocessors,butthetimeittakestoaccessglobalmemory canstalltheGPUforapproximately200-800instructioncycles,dependingonthe actualGPU.Thisdelayiscausedby memorylatency .CPUshavesimilarmemory latencyissues,andaredesignedwithsophisticated,multi-levelmemorycachestohelp alleviatethisperformanceissue.Asmentionedpreviously,GPUsmayalsohavecaches but,ingeneral,GPUsuseadierentsolutiontothememorylatencyissue:GPUs rapidlydispatchanotherthreadthatisnotstalled.Itbecomesthedeveloper'stask toutilizeanalgorithmthatfacilitatestheavailabilityofadispatchablethread.This isoneoftheareaswherethePRAMviewofGPUsbreaksdown;PRAMSdonothave 13

PAGE 27

thistypeofrequirement.Techniquessuchas datapre-fetch ,useoflocalregisters,and useoflocalsharedmemoryareoftenemployedtomitigatememorylatencyissuesin GPUs.TheGPU'shardwareregistersandlocalmemoryareaccessiblewithminimal delaysandthepre-fetchingofdatainvolvesusingtechniquestopre-loadtherequired dataintolocalstorageorasoftwarecache[48]. AnadditionalmajorGPUperformanceconsiderationoccurswhentransferring databetweentheCPUandtheGPU.Asahigh-performanceGPUisoftenprovided onaseparatecard,locatedonaslow"PCIExpressbus,themovementofdata betweentheCPUandGPUisslowrelativetotheperformanceoftheGPU[30].This performancedierence,betweenthePCIExpressbusdatatransferrateandtheGPU throughputrate,isamajorconcerninachievinghighperformance[65]. Whenusingcomplexalgorithms,achievinghighperformancewithOpenCLcan beacomplexanddicultundertaking.InordertoachievegoodGPUperformance, FarbersuggeststhefollowingthreebasicrulesforGPGPUprogramming[30]:get thedataontheGPUandleaveitthere;givetheGPUampleworktodo; 3 and, focusonthereuseofdatawithintheGPUtoavoidmemorybandwidthlimitations. ThesethreebasicrulesformthebasictenantsforDEFG'sgenerationofOpenCL code. 2.6ParallelizationandDomain-SpecicLanguages Numerousattemptshavebeenmadetoconstructlanguages,compilers,andtoolsto maketheproductionofhighperformanceparallelsolutionseasier.In2003,Shenetal. talkedabouttheholygrailofparallelization,whichistheautomatedparallelization ofserialprograms,beingoutofreach[83].However,progressisbeingmade.One approachtowardstheecientproductionofGPU-basedparallelsolutionsistheuse 3 Ofthesethreerules,perhapsthisoneisthemostcomplex.Findingwaystoalwayshavethe GPUworkingisnoteasyinthefaceofpossiblyhiddeninstructionpathdivergenceandmemory latencyissues. 14

PAGE 28

ofadomain-speciclanguageDSL.DEFGisaDSL,alanguageandrelatedtools thatfacilitatetheproductionofOpenCLapplications.MartinFowlerdenesaDSL asacomputerprogramminglanguageoflimitedexpressivenessfocusedonaparticular domain,andsuggeststhatDSLscanbebrokenintotwocategories:internalDSLs andexternalDSLs[32].DSLsofbothvarietieshavebeenproducedforGPU-based highperformancecomputing. InternalDSLsforGPU-basedHPCincludeextensionstoPython,suchasPyGPU, PyCUDA,andPyOpenCL[50,49,53].TheseDSLstendtoconsistofPythonwrappersplacedaroundaparticularGPU'sAPI.TherearealsoC/C++extensions,such asBacon[92].AsidefromDEFG,otherGPUexternalDSLsincludetheSPLdigitalsignalprocessinglanguageandtheMATLABParallelComputingToolbox.The MATLABtoolboxsupportsCUDAanditpermitspassingsomeMATLABfunctions totheGPU.ItalsopermitsdirectGPUkernelexecution[57,100].BothMATLAB andDEFGrequirethattheGPUkernelbeprovidedbythedeveloper. DSLshavetheabilitytoprovidehigh-level abstractions forcomplexcomputing tasks;theycanbeusedtohidecomplexity[32].Inthisdissertation,weshowaDSL, namelyDEFG,whichprovidesabstractionsforthecomplexCPUcodethatmustbe writtenforOpenCLGPUapplications.Theseabstractionsareproducedinsucha waythatthedeveloperisshieldedfromagreatdealofcomplexityencounteredwhen usingthevariousOpenCLAPIfunctionsandoptions[45]. 15

PAGE 29

CHAPTERIII OVERVIEWOFDEFGANDITSPERFORMANCE 3.1Introduction ThischapterprovidesanoverviewofDEFGandsummarizesitscapabilitiesand performance[80,81].LaterchapterswilldescribetheinternalworkingsofDEFGand itsassociateddesignpatterns,andshowtheuseofDEFGwithadditionalapplications. Inaddition,SectionAoftheAppendixprovidesafulldescriptionoftheDEFG language.Here,wepresentthreesampleDEFGapplicationsolutionsanddiscussthe wayDEFGrelatestotheapplication'sOpenCLhostcodeandkernelcode.Wefocus ourattentiononthebasicsoftheDEFGenvironmentandactualDEFGperformance results,intermsofbothdeveloperproductivityandruntimes. WeapproachthisdiscussionoftheDEFGimplementationasfollows:usingthree existingOpenCLapplicationsandtheirexistingOpenCLkernelswithoutanychanges, theexistinghostCPUisreplacedwithDEFG-generatedcode.TheDEFGsource modulesneed,onaverage,about90%fewerlinesofcodethanthecorresponding hand-writtenhostOpenCLmodules.Wecomparethecomputationalperformance ofthethreeapplicationsovertwodierentOpenCLplatforms,whichwecallCPU andGPU-Tesla.PerformancevariationsbetweentheDEFGandreferenceresultsare identiedandanalyzed.ThenextfewpagessummarizetheDEFGimplementation andDEFGlanguage,aswellasthethreeexistingOpenCLapplicationsweuseas referenceapplicationsandtheirconversiontoDEFG.Wethenpresentapreliminary 16

PAGE 30

lookatourexperimentalresults,intermsoflinesofcodeandruntimes. 3.2DEFGFrameworkandDEFGLanguage TheDEFGimplementationconsistsofaparserwritteninJava,utilizingANTLR3 [11],aJava-basedoptimizerspecictoDEFG,andourcodegenerator,whichis writteninC++.Theparserhandlessyntaxcheckingandresultsinanabstract syntaxtree,expressedasanXMLdocument.TheXMLsyntaxtreeisthenoptimized forrun-timeperformanceanddecoratedwithcross-referenceinformationneededfor codegeneration.Finally,thistreeisprocessedbyourcodegenerator,whichusesthe TinyXML2librarytoaccepttheXML-formattedtree[91].Forexample,thetwelve linesofDEFGcodeshowninFigure3.1resultinapproximately460linesofC/C++ code,asnippetofwhichisshowninFigure3.2.TheOpenCLkernelexecutedby thiscodeisshowninFigure3.3.NotethatthisgeneratedOpenCLcodeisintended toexecuteonanysupportedOpenCLdevice,includingtheCPU.WithOpenCL,the CPUcanfunctionasboththehostandthedevicetoexecutethekernel. TheDEFGdeclarativelanguageconsistsofanumberof declare execute and call statements,andoptionalstatements,suchas sequence/times and loop/while .An exampleDEFGsourceleisshowninFigure3.1.The declare statementisusedto nametheDEFGapplication,todeneandnametheGPUkernelstobeexecuted,to deneanyrequiredscalarvariablessuchasagraph'snodecount,andtodenethe 01.declareapplicationsobel 02.declareintegerXdim 03.integerYdim0 04.integerBUF_SIZE 05.declaregpugpuone* 06.declarekernelsobel_filterSobelFilter_Kernels[[2D,Xdim,Ydim]] 07.declareintegerbufferimage1XdimYdim 08.integerbufferimage2XdimYdim 09.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 10.executerun1sobel_filterimage1inimage2out 11.calldisp_outputimage2in$Xdimin$Ydimin 12.end Figure3.1:SampleDEFGCode 17

PAGE 31

//***buffersin cl_membuffer_image1=clCreateBuffercontext,CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, BUF_SIZE*sizeofint,void*image1,&status; ifstatus!=CL_SUCCESS{handleerror} status=clSetKernelArgsobel_filter,0,sizeofcl_mem,void*&buffer_image1; ifstatus!=CL_SUCCESS{handleerror} cl_membuffer_image2=clCreateBuffercontext,CL_MEM_WRITE_ONLY, BUF_SIZE*sizeofint,void*NULL,&status; ifstatus!=CL_SUCCESS{handleerror} status=clSetKernelArgsobel_filter,1,sizeofcl_mem,void*&buffer_image2; ifstatus!=CL_SUCCESS{handleerror} //***execution size_tglobal_work_size[2];global_work_size[0]=Xdim;global_work_size[1]=Ydim; status=clEnqueueNDRangeKernelcommandQueue,sobel_filter,2,NULL,global_work_size, NULL,0,NULL,NULL; ifstatus!=CL_SUCCESS{handleerror} //***resultbuffers status=clEnqueueReadBuffercommandQueue,buffer_image2,CL_TRUE,0, BUF_SIZE*sizeofint,image2,0,NULL,NULL; ifstatus!=CL_SUCCESS{handleerror} Figure3.2:SnippetofGeneratedOpenCLCode __kernelvoidsobel_filter__globaluchar4*inputImage,__globaluchar4*outputImage{ uintx=get_global_id;uinty=get_global_id; uintwidth=get_global_size;uintheight=get_global_size; float4Gx=float4;float4Gy=Gx; intc=x+y*width; /*Readeachtexelcomponentandcalculate..*/ ifx>=1&&x=1&&y
PAGE 32

patternswheretheGPUkernelsmayhavetobeexecutedavariablenumberoftimes. Figure3.4containsaDEFGexamplewhichexecutesthekernelonceforeachgraph node.Figure3.4,line9,showstheuseofthe sequence statement.DEFGalsocontains additionalloopingstatementstoprocessscalarvaluesreturnedbytheGPUkernels. ThiscapabilityisusedintheDEFGbreadth-rstsearchsolutiontoconditionally stoptheparalleldeviceprocessing.DEFGgeneratesOpenCL1.1codeinkeeping withinthelimitsofNVIDIA'scurrentOpenCLsupport[67]. 01.declareapplicationfloydwarshall 02.declareintegerNODE_CNT 03.integerBUF_SIZE 04.declaregpugpuoneany 05.declarekernelfloydWarshallPassFloydWarshall_Kernels[[2D,NODE_CNT]] 06.declareintegerbufferbuffer1$BUF_SIZE 07.integerbufferbuffer2$BUF_SIZE 08.callinit_inputbuffer1inbuffer2in$NODE_CNTout$BUF_SIZEout 09.sequenceNODE_CNTtimes 10.executerun1floydWarshallPassbuffer1inoutbuffer2outNODE_CNTinDEFG_CNTin 11.calldisp_outputbuffer1inbuffer2inNODE_CNTin 12.end Figure3.4:SampleDEFGCodeShowingaSequence 3.3ViabilityofDEFG InordertotesttheviabilityofDEFG,weselectedthreeexistingOpenCLapplications basedonwell-knownalgorithms:SobelimagelteringandFloyd-Warshallallpairs shortestpathAPSP,bothfromtheAMDAPPSDK,andbreadth-rstsearchfrom theOpenDwarfsbenchmark[1,31].WewillrefertotheseapplicationsasSOBEL, FW,andBFS,respectively.SOBELwaschosenbecauseitrepresentstheclass ofless-complexGPUproblems,whereasinglekerneliscalledonceandbecause ithassignicantRAMlocalityofreference.DEFGsupportsconcurrentexecution onmultipleGPUdevices,inadeclarativemanner,andSOBELprovidesagoodtest caseforthisaddedsupport.Thismultiple-GPUsupportisdiscussedlaterinSection 5.1. FWandBFSwereselectedbecausetheyrepresenttwodierentclassesofgraph19

PAGE 33

Table3.1:TestCongurations Name CongurationData CPU Windows7,IntelI3Processor,1.33GHz,4GBRAM,using AMDOpenCLSDK2.8noGPU GPU-TeslaT20 PenguinComputingCluster,LinuxCentOS5.3,AMDOpteron 2427Processor,2.2GHz,24GBRAM,usingNVIDIAOpenCL SDK4.0,NVIDIATeslaT20with14ComputeUnits,1147MHzand 2687MRAM Table3.2:LinesofCode DEFG Declarative Generated Reference BFS 42 620 364 FW 12 481 478 SOBEL 12 467 442 orientedGPUproblems,withBFSbeingthemorecomplextoimplementinDEFG. TheFWalgorithmrequiresthatacommonoperationberepeatedforeachgraphnode. Inthisimplementation,FW'sGPUkerneliscalledonceforeachnode.Thiscall-foreach-nodebehaviormustbemanagedfromtheCPUhost,andhencefromDEFG.The OpenDwarfsBFSimplementationisbasedontheworkbyHarishandNarayanan, andusesaversionofDijkstra'salgorithm[22,38].TheactualOpenDwarfsBFScode isanOpenCLportoftheCUDAcodefromtheRodiniabenchmark[85].ThisBFS implementationrequiresthatapairofkernelsberepeateduntilsuccessisindicated bythesecondkernel.ThisrepetitionismanagedbytheCPUhostcode. AllthreeoftheseapplicationswereconvertedtoDEFG,keepingtheunmodiedOpenCLkernels.TheconversionstoDEFGproduceexactlythesameresults asthecorrespondingreferenceversion.Beforediscussingtheperformanceresults, wesummarizethehardwareandsoftwareused.Thetestswererunontwodierent congurations,whichwecallCPUandGPU-TeslaT20.Thesecongurationsare describedinTable3.1.InGPUperformanceterms,theCPUcongurationissignicantlylesspowerfulthanGPU-TeslaT20becausetheCPUisnotusingaGPU resource;itexecutesthekernelontheCPU. 20

PAGE 34

Table3.3:Run-timePerformance,inMilliseconds CPU GPU-TeslaT20 DEFG Ref. DEFG Ref. BF-4096 1.5 2.6 4.3 5.8 BF-65536 12.3 14.2 8.0 11.3 FW 111.8 152.0 6.0 51.2 SOBEL 23.0 24.8 3.7 4.1 Figure3.5:ApplicationLines-of-CodeComparison 3.4DiscussionofResults Intermsofdeveloper-writtenmoduleline-countresults,thethreeDEFGmodules weremuchsmallerthantheirreferencecounterparts.Table3.2liststhelinecounts forSOBEL,BFS,andFW;shownarethenumberoflinesofDEFGdeclarativecode, thenumberoflinesofDEFG-generatedOpenCLcode,andtheestimatednumberof non-commentlinesintheOpenCLreferenceversion.Thisdataisshowngraphically inFigure3.5.Onaverage,theDEFGcodeis3.9percentofthegeneratedcode,and 5.6percentofthereferencecode.Itshouldbenotedthatthereferencecodetendedto includeadditionalfunctionalityandthattheDEFGgenerated-codecountsincluded anadditional150linesoftemplatecodeusedtoidentifyandselecttherequested GPUdevices. 21

PAGE 35

Figure3.6:ApplicationRun-TimePerformanceComparison Therun-timeperformancecomparisonturnedouttobeveryinteresting.Theraw runtimes,inmilliseconds,arepresentedinTable3.3.Figure3.6presentsthisdatain 3Dform.Theresultsshownaretheapplicationaverages,overtenindividualrunsfor eachapplication.Whenunexpectedresultswereencountered,weperformedreruns withtemporarymanualcodemodicationsinaneorttoisolatetherootcausesof theunexpectedbehaviors.Atdierenttimes,thesecodechangesweremadeinto boththeDEFGandreferenceOpenCLcode.However,thenumbersshownhereare onlytheoriginaltimes,i.e.,thosepriortoanymanualcodemodications. SOBEListhesimplestapplicationandtherun-timeperformanceresultsbetween DEFGandthereferencecasesarecomparable.TheSOBELresultsareshownonthe graphinpurple.TheDEFGperformancewasslightlyfasteronallofthecongurations.ThissimilarityofresultsisnotsurprisingastheOpenCLCPUhostoperations toexecuteSOBELarenotcomplex. Therun-timeresultsoftheFWtests,whichareshowningreen,wereasurpriseto us.WesawnoobviousexplanationforwhyDEFGshouldbeconsistentlyfaster.For example,withGPU-TeslaT20thereferenceFWneeded51.2mswhileDEFGFW consumedonly6ms.WereviewedtheOpenCLcodeforbothDEFGandtheAMD SDK-suppliedreferencecase,anddidnotndanysignicantdierencesinbuer 22

PAGE 36

usageortheOpenCLAPIfunctionsused.Wedidnoticethatthereferencecasewas usingasynchronousevents,whennotinfactrequired,andwetemporarilydisabled themandreranthereferencecase.TheFWreferencecaseruntimesontheTesla T20droppedthree-foldfromanaverage51.2msto17ms.WefeeltheDEFGTesla timeof6msandthereferencecasetimeof17msarereasonablycloseandthistest tendstoshowthat,forthisimplementationoftheFloyd-Warshallalgorithm,both theDEFGandreferenceruntimeswerereasonablycomparable. TheBFSrun-timecomparisonsusedtwodierentgraphs.Therstgraphhas 4,096nodes,withtheresultsshowninblueonthegraph,andthesecondhas65,536 nodes,showninred.Asahistoricalnote,ourearlierprototypeversionofDEFG BFSwassubstantiallyslowerthanthereferenceversionofBFS;theprototypeDEFG needed59.4mstoperformwhatthereferenceBFSdidin11.3ms.ThefullDEFG versioncontainsbuermanagementoptimization.WhenthefullDEFGisused,the 59.4runtimedropsto8.0ms.Thisdramaticimprovementinperformancewasdue totheDEFGoptimizer'sremovalofunneededbuertransferoperations. WecannotleavetheBFSperformancetopicwithoutnotingthattheOpenCL CPUconguration'sperformancewasbetterthanGPUperformanceforthe4,096 nodecase.WepostulatethatthisisexplainedbytheBFSimplementationbeing used.ThisgraphalgorithmimplementationisbasedontheworkbyHarish[38], whichdoesnotcompensateforthelackofmemorycachingonmanyGPUs.The CPUversionmostlikelyfaredsowellduetothemultiplelevelsofmemorycaching providedbytheIntelI3;itisalsolikelythatthe4,096nodecasetentirelyinthe IntelI3'scache. Insummary,theseexperimentshaveshownthat,atleastwiththesethreeapplications,thedeclarativeapproachusedinDEFGcanbeusedtoproduceOpenCL applicationswithfewerlinesofcodeandcomparablerun-timeperformancelevels, relativetohand-writtenOpenCLapplicationhostcode. 23

PAGE 37

CHAPTERIV DEFGTHEORYOFOPERATIONS 4.1Introduction OurDeclarativeFrameworkforGPUsDEFGgeneratestheCPU-sidecodefor OpenCLGPUapplications.OneoftheprincipleDEFGgoalsistomakethedevelopmentofGPUsoftwarelessonerousanddicult.TheOpenCLGPUAPIstend tobenumerous,verycomplex,andverbose.DEFGhidesmuchofthiscomplexity andremovesunneededverbosity.Ourapproachenablesthedeveloper,wherepossible,todeclarewhatisneededandhavetheDEFGsoftwaregeneratethesolutionto produce whatisneeded AnotherprincipalgoalistohaveDEFGgeneratecodethatishumanlyreadable, allowingDEFGtoalsofunctionasalearningtool.Thedevelopercandeclarewhat isneededandthenreviewthegeneratedDEFGcodeforinsightsintohowtoutilize OpenCLinseparate,non-DEFGapplications.Weusestaticoptimizationtechniques asmuchaspossible,soastohavetheDEFG-generatedcodebereadable.This approachtendstoavoidtheuseofcomplexoptimizationanddynamic-decisioncode atruntime.Weavoidthembecausetheirusetendstomakethegeneratedcode large,heavywithcalledmodules,andratherhardtounderstand. AswiththeCPU,theGPUhaslimits;ithasniteamountsofprocessingpower andmemory.InordertogetbeyondthelimitspresentinasingleGPU,aDEFG aimistofacilitatetheuseofmultipleGPUswithinasingleapplication.GPUkernel 24

PAGE 38

designsoftenbreaktheirworkintosmallunitsthathaveminimalinteractionswith eachother.Withthesetypesofdesigns,spreadingtheworkoveradditionalGPU hardwareisrelativelyeasytodo,iftheworkcanbepartitionedinareasonableway. DEFGprovidesforthisworkpartitioninginanumberofcommonapplicationuse cases.ThesecaseswillbedescribedinthesectiononDEFGDesignPatterns. ThisTheoryofOperationsChapterisdividedintotwomainsections,plusa smallerthirdsection.Section4.2talksabouttheDEFGDesignPatternsandstays ataratherhighlevel.Itiswrittenfromthepointofviewofadeveloperwantingtouse DEFG.Section4.3divesintotheconceptsandnotionsbehindtheworkingsofDEFG andprovidesnumerousspecicimplementationdetails.Thissecondsubsectionis writtenfromthepointofviewoftheintrepidsoulwishingtounderstandtheinner workingsoftheDEFGenvironment.Section6.2.1exploreswhyaDSLisusefulfor GPUsoftwaredevelopment. 4.2DEFGDesignPatterns Insoftwareengineering,theterm designpattern isoftenambiguous;dierentresearchers,designers,anddevelopersmayviewthenotionofadesignpatternwith variedexperiences,expectationsandrequirements. 1 Soastoavoidconfusion,wewill denewhatwemeanbyaDEFGdesignpatternafterweprovideabriefoverview oftheDEFGdomain.WewillthendescribetheexistingcommonDEFGdesign patterns.Beforeproceeding,wenotethatagivendomainmayuseonlycertain commonlydescribedsoftwareengineeringdesignpatternsandmayintroduceunique designpatternsofitsown. TheDEFGlanguageisadomainspeciccomputerlanguage,commonlycalleda DSL".Assuch,itisintendedtobeusedtosolveproblemswithinitslimiteddomain. TheDEFGdomainmainlyconsistsofprovidingtheCPU-sidecodeforOpenCL 1 Weknowthistobetruefromourownextensivesoftwareengineeringexperiences. 25

PAGE 39

applications.AkeyconcepthereisthatDEFGgeneratesalimitedrangeofsolutions andthislimitedrangeisreectedinitsdesignpatterns.Manyofthecommonsoftware engineeringdesignpatternssuchas fork-join ,fromprocesscontrol, pipeline ,fromdata management[58],and observer [33],fromobject-orientedprogramming,donotapply toDEFG. Forpurposesofthisdissertation,wewillusetheGamma,etal.denitionofa designpattern [33];adesignpatternisasolutiontoaprobleminacontext."Some ofourdesignpatternsareverysimpleandothersarequitecomplex.Whendescribingourmorecomplexpatterns,wewillfollowtheGammaapproachofdescribing theDEFGpatternswithfouressentialelements:the patternname ,the problem addressed,the solution provided,andthe consequences ofuse. 4.2.1DEFGInvocationPatterns 4.2.1.1Sequential-FlowPattern TheSequential-FlowpatternisthedefaultbehaviorofDEFG.InDEFGprograms, allofthestatementsafterthe declare statementsareexecutablestatementsandthe defaultbehavioristoexecutetheseinorder,fromtoptobottom.DEFGprograms thatusethispattern,andnottheSingle-KernelRepeatSequenceandMultiple-Kernel Looppatterns,discussednext,tendtohavethefollowingstructure: 2 declareapplication declareinteger... declaregpu... declarekernel... declareintegerbuffer... call.... execute... call... end ForsomeDEFGapplications,thissimplebehaviorissucienttoprovidethe 2 Threedots,anellipsis,representomittedtextandtextbetween < and > representsthename ofapplications,kernels,functions,etc. 26

PAGE 40

basisforasolution.EachreferencedGPUkernelandGPUfunctionisexecutedonce andtheexecutionoccursintheorderlisted.The call and execute statementscanbe mixedinanyorder.ThisdesignpatternisclearlyshowninourSOBELandMEDIAN digitallterapplications 3 4.2.1.2Single-KernelRepeatSequencePattern Whenagivenkernelneedstoberepeatedlyexecutedaxednumberoftimes,and thisnumberisknownbeforethekernelisexecuted,theSingle-KernelRepeatSequencedesignpatternisused.ThispatternprovidesthepreferredDEFGapproach torepeatedlyexecutingakernelforapresetnumberofiterations.Wheneverpossible,thispatternispreferredovertheMultiple-KernelLooppatternbecausethe DEFGoptimizercanmoreeasilydealwiththeOpenCLdatatransferoperations. TheInvoke-Whilepatternseebelowhasmoreinteractionconcernsandmaynot optimizeaswellbyDEFG.ThisSingle-KernelRepeatSequencepatternmaybeused withintheMultiple-KernelLooppatterntohelpperformcomplexoperations.DEFG programsusingthispatterntendtohavethisstructure: declareapplication declareinteger... declaregpu... declarekernel... declareintegerbuffer... .... sequencetimes execute... ... end ThereferencedGPUkernelisexecuted < number > times.Thisdesignpattern isveryhandywhenaself-containediterationoperationmustbeexecuted,andis demonstratedinourFloyd-WarshallFWapplication. 3 TheDEFGapplicationnamesarecapitalized. 27

PAGE 41

4.2.1.3Multiple-KernelLoopPattern SometimesthenumberoftimesaGPUkernel,orothergroupofDEFGstatements, needstoberepeatedlyexecutedisnotknowninadvance.Forthiscase,DEFG providesthisMultiple-KernelLooppattern.Therangeandexibilityofthisdesignpattern,andthe loop/while DEFGstatementsthatimplementit,arepurposely limitedbythisdomainspeciclanguagesoastoenableourveryusefulDEFGoptimizations.Themainlimitimposedisthatthisdesignpatterncannotbeembedded. TheMultiple-KernelLoopdesignpatterntendstohavethisstyleofDEFGcoding: declareapplication declareinteger... declaregpu... declarekernel... declareintegerbuffer... ... loop ... execute... ... execute... ... while ... end Wenotethatthispatterncancontaintheprevioustwopatterns.Theconsequence ofthispattern'snon-embeddedlimitisthatsomealgorithms/applicationsmayhave tobere-factored;however,asre-factoringforGPUuseiscommon,ourexperience isthatthiswillnotlikelybeanissueforthedeveloper.Wehavefoundthatby havingtheGPUkernelsperformtheworkandhavingtheCPUmanagethework,this pattern'slimitsarenotfrequentlyencountered.Ofcourse,theremaybeapplications wherethistypeofre-factoringisnotdesirableorpossible.Inthiscase,usingDEFG mightnotbethecorrectsoftwaredevelopmenttool.OurBFSapplicationsshowthis Multiple-KernelLooppatterninuse;itisalsousedinouriterativematrixinversion IMIapplication. 28

PAGE 42

4.2.2DEFGConcurrent-GPUPatterns ProvidingsupportforapplicationstoutilizemorethanasingleGPUisaprimaryaim ofDEFG.Inordertoprovidethiscapabilitywithminimalcodechanges,wesupply severaldesignpatterns.TheMultiple-ExecutionpatternengagesadditionalGPUs. 4.2.2.1Multiple-ExecutionPattern Thispatternisboth simple anddeceptively complex .Itissimpleinthesensethat withseveralsimplecodechanges,anynon-BLAS 4 DEFGprogramcanbechanged toutilizemorethanasingleGPU.Itiscomplexinthatitonlymakessensetodo thisifthealgorithmsandkernelsusedbytheapplicationcansupportmultiple-GPU operations.AdditionalcommentsaboutthispatternaregivenlaterinSection5.1 andinSectionAoftheAppendix. TheproblemthisdesignpatternaddressesistheneedformoreGPUprocessing powerandmemory.ThesolutionitprovidesistoengageadditionalGPUsinthe applicationexecution.ThebenetsofaddingmoreGPUs,whenthealgorithmsand kernelspermit,mayseemratherobtusefromadistance.However,iftheapplication dataneedingtobeprocessedonthesingleGPUisasinglebytelargerthantheRAM availableonthesingle-GPU,theapplicationwillfail.Theabilitytogetasolution byutilizingasecond,already-availableGPUwithouthavingtorewritetheentire applicationisasignicantadvantage. 4 TheBLASdesignpatternisdiscussedbelow. 29

PAGE 43

Thispatternisengagedbyusinga declaregpu statementthatselectsmultiple GPUdevices,andusing multi exec statementsinsteadof execute statements.The consequencesofusingthispatternareenormous.Iftheapplicationisnotdesigned correctly,itwillfailorworse,producewrongresults.Iftheapplicationisdesigned formultipleGPUuse,thereisthepotentialforgettingresultsmorequickly,handling largerapplicationdatasets,orboth.DEFGprogramsthatusethispatterntendto havethisstructure: declareapplication declareinteger... declaregpuall declarekernel... declareintegerbuffer... ... multi_exec... ... end Anumberofourapplications,including:SOBELM,MEDIANM,RSORTMand BFSDP2GPU,usethispattern.TheBFSDP2GPUapplicationutilizesthispattern inamuchmorecomplexmannercomparedtotheotherlistedapplications.This breadth-rstsearchapplicationhasactivecommunicationsbetweenitstwoGPUs anditalsomanagesthemultiple-threadupdatingofitssharedbuerwiththePrexAllocationpattern,whichisdiscussedbelow.TheBFSDP2GPUapplicationisspecificallydiscussedinSection5.2. 4.2.2.2Divide-Process-MergePattern Thispatternisusedinassociationwiththemulti-GPUpattern.Itisasmuchofaset ofdesignguidelinesasitisapatternthatactivatesadditionalDEFGfeatures.When theMultiple-Executionpatternisactive,DEFG'sdefaultbehavioristoautomatically dividethedatabuersintosegmentsofequalsizeandgiveeachsegmenttoaunique GPU.DEFGalsocorrespondinglylowersthesizeofthework-groupforeachGPU. 30

PAGE 44

Wenotethatthisbehaviorcanbechangedbyusingthebueroptions,asdiscussed inSectionAoftheAppendix. WhenDEFGisleftinitsdefaultmulti-GPUbuerbehavior,thedevelopermust becertainthatthismodeofoperationtstheapplication'srequirements.Itmay bethecasethattheCPUfunctionusedtodisplay,orfurtherprocess,theresulting datamayhavetotakespecialsteps.Inparticular,ourRSORTMapplicationhas specialmergeprocessingintheCPUfunctionthatwritestheresultingsorteddatato disk.Thisdesignpatternusesthedefaultmulti-GPUbuerbehavior,whichsplits thebuersintoequalsegments.OurRSORTMapplicationmakesuseofthispattern andisdiscussedinSection5.3. 4.2.2.3Overlapped-Split-Process-ConcatenatePattern Thispatternisalsousedinassociationwiththemulti-GPUpattern.Itismutually exclusivetotheprecedingDivide-Process-Mergepattern.Thispatternisengagedby addingthe halo optiontothebuersholdingtheGPUdata,asshowninthisDEFG abbreviatedcode: declareapplication declareinteger... declaregpuall declarekernel... declareintegerbuffer...halo ... multi_exec... ... end Somedatacontainsboundariesoredgesthatbringaboutspecialprocessingrequirementswhenworksplittingisengaged.Theobviouscaseforthisisthetwodimensionalimageltering.But,italsoshowsupinapplicationssuchasmovingaveragecalculationanddigitalsignalprocessing.Whenthedimensionsoftheseedges arexedinsizeandknowninadvance,thispatterncanbeusedtohaveDEFGautomaticallyduplicatetheedgedatabetweenGPUs.DEFGmanagesinternallythe 31

PAGE 45

insertionoftheduplicatedinformationatthesplitsofthedata;italsoremovesthese overlapswhenthedataisreturned.The halo bueroptioncausesDEFGtouse thisspecialprocessingandtheassociated n valueprovidesthesizeoftheoverlapped area.Fora1Dstructure,the n valuerepresentsindividualelements.Fora2Dstructure, n providesthenumberofoverlappedimagelines.Thisdesignpatternisvery eectivelyusedinourSOBELMandMEDIANMimageprocessingapplications. 4.2.3DEFGPrex-AllocationPattern AspartofDEFG,anumberofOpenCLkernelsaresupplied.Thesekernelsinclude: bermanPrexSumP1 bermanPrexSumP2b ,and getCellValue .Thersttwokernels arebasedontheprexsumalgorithmsgiveninBerman[14].Thisveryabbreviated codeshowsthesekernelsbeingused: ... multi_execrun2PrefixSumP1offsetout... ... sequenceKCNTtimes multi_execrun2PrefixSumP2boffset2inoutoffsetinout... ... multi_execrun3getCellValueoffset2in... ... Thesethreekernelsformageneralprexsumcapabilitythatcanbeusedfromany DEFGapplication.Weconsideredusingotherprexsumprexscanalgorithms; however,theyhadthepower-of-2buersizerequirement,whichisnotacceptable here,sinceDEFGisintendedtobeusedingeneral-purposemanner.Theprovision ofaprexsumcapabilityinDEFGmakesitpossibletoallocatethespaceina sharedbuerwithouthavingtouseperformance-impacting,low-levelsynchronization constructs.TheideasbehindthisbuerallocationapproacharediscussedinSection 5.2.The getCellValue kernelreturnsthelastvalueintheresultsbuer,whichisequal tothenumberofitemsusedwhenallocatingbuerspace. 32

PAGE 46

Thiscodedoesnotshowsomeofthehousekeeping"stepsneededtomanagethe buerpassedtothe getCallValue .Thesestepsandthefullparameterlistsforthe kernelscanbeobservedinourmulti-GPUBFSDP2GPUapplication'ssourcecode. ThesourcecodeforthesekernelsisgivenintheSourceCodeAppendix,SectionB.4. 4.2.4DEFGDynamic-SwapPattern ThisdesignpatternrepresentstheDEFGcapabilityweaddedtologicallyswapGPU buers.TheuseofthispatternincreasesDEFGperformanceandmakestheDEFG sourcecodeeasiertoread.ItiscommontoseeagivenGPUkernelrepeateda numberoftimeswheretheoutputfromthepreviousiterationistheinputtothe currentiteration.DEFGassignsaxednametoeachGPUbuer.Arraysofbuers arenotsupported.Thispatternenablesthecontentoftwoxed-namebuersto beswappedwithoutactuallymovingthedatabetweenthebuers{thebuersare interchanged"byjustswappingtheirrespectiveCPUreferences.Thispatternis usedviathe interchange statement,asshowninthisabbreviatedDEFGcode: ... declareintegerbuffer... declareintegerbuffer... ... interchange ... ThisveryhandydesignpatternisusedinourRSORTapplication. 4.2.5DEFGCode-MorselPattern TheCode-Morseldesignpatterncameaboutafteroursurrendertotheelegantpower ofsmallC/C++codesnippets.Afteragreatdealofthought,weincludedinDEFG thecapabilitytoinsertarbitrarysnippetsofC/C++code.Asweutilizedthisfacility, wecametoviewthesesnippetsasveryusefulandwe,correspondingly,gavethema 33

PAGE 47

positivename;wecalledthemmorsels."DEFGmorselscanbeusedinanumberof ways.Webreakdowntheiruseintotwocategories: cosmetic and functional Cosmeticmorselsaregenerallybenignandareusedtodothingslikeadddescriptiveprogramoutputorassistwithapplicationdebugging.Theydonotparticipate inthebasicprocessingoftheapplication.Functionalmorselsdoparticipateinthe activeprocessingoftheapplicationandtheycanaddalotofpowertotheapplication. Unfortunately,theyalsohavethepotentialtocreatehideous,hard-to-ndbugs. ThecodeinthemorselsisnotparsedbyDEFG.ThismeanstheDEFGoptimizer hasnoindicationwhatthemorselisdoingasfarasconsumingdataorupdating it.Morselsareusedviathe include and code statements.AppendixSectionA,the DEFGUser'sGuide,providesadditionalmorsel-usageinformation.Shownhereisan abbreviatedmorselsample: ... code[[printf"version%ssize:%d,logSize:%dn",...]] ... loop ... //somethinginthisloopsetsvaluesinbufferagainPart... ... code[[again=againPart[0]+againPart[2];]] whileagainne0 ... Therst code statementusedaboveiscosmeticandthesecondisfunctional. Cosmeticmorselsareusedinmanyofourapplicationsandfunctionalmorselsare usedheavilyintheBFSDP2GPUapplication. 34

PAGE 48

4.2.6DEFGAnytimePattern Anytime processingisourfacilitytostopanalgorithmorkernelbeforeitsnormal endingpoint,sothatresultscanbepresentedearlier.Thisisnotprogramtermination associatedwithanerror.HereisanabbreviatedDEFGsample: ... loop ... loop_escapeatms ... whileagainne0 ... Acommonuseoftheanytimeapproachistoreleasethecurrentapplicationresults whenanevent,suchasthepassageofacertainamountoftime,occurs.Ofcourse, thistypeofprocessingisonlyofvalueiftheapplication'salgorithmsanddesign facilitatethepresenceofincrementalresults.Thispatternisusedinouriterative matrixinversionapplicationanddescribedinSection5.4. 4.2.7DEFGBLAS-UsagePattern TheBLAS-UsagepatternenablestheuseofBLAS-baseddouble-precisionmatrix multiplicationfromtheAMDclMathlibrary[2]. 5 Thispatternisusedwiththe DEFG blas statement.ThisabbreviatedDEFGcodeshowsitsuse: ... declaredouble... double... declaredoublebuffer... doublebuffer... doublebuffer... ... blas**+* -> ... Here d 1and d 2arescalarvariables; bufferA bufferB ,and bufferC arebuers 5 DEFGhasthepotentialtoincludeothercapabilitiesfromthis,andotherOpenCL-oriented, applicationlibraries. 35

PAGE 49

holdingthematrices.Theresultsarestoredin bufferC .Thisdesignpatternisused inouriterativematrixinversionIMIapplication. 4.2.8WhentoconsidernotusingDEFG AfterdescribingDEFGandit'sdesignpatterns,itisimportanttonotethatusing DEFGtocreatecertaintypesofOpenCLapplicationsmightnotbeawisechoice. HerearetheapplicationcharacteristicswehaveidentiedthatindicatethatDEFG usemaybequestionable: 1. Theapplicationhastight"integrationwithothercomponentsconcerningresourcesharing,threadingmodels,specializedGUIprocessing,etc. ThistightintegrationislikelytoleadtoproblemssinceDEFGmanagesallits resourcesconnectionhandles,buers,GPUs,etc.asthoughitisthesoleuser. DEFGisdesignedtobeusedbyasingleoperatingsystemthread;itisnota multi-threadedCPUapplication. 2. Someoftheapplication'sOpenCLoperationsareconditionallyexecuted. ItispossibletoputconditionalDEFGmorselsaroundtheDEFG execute and multi exec statements.However,thisapproachislikelytocreateissuesasthe DEFGoptimizercannotbedependedupontohavetheapplication'svariables andbuersupdatedwiththecorrectcontent. 3. ComplexCPUprocessingisintegratedwiththeGPUprocessing. Thisitemissimilartothepreviousone.IfthecomplexCPUprocessingcan beputintofunctionsinvokedwiththeDEFG call statement,thisapproachwill likelywork.IfthecomplexcodeisinsertedwithDEFGmorsels,theproblems describedinthepreviouspointarelikelytooccur. 4. Complexerrorhandling,suchasignoringanerrororrestartingafteranerror, isinuse. 36

PAGE 50

WhenDEFGencountersanerrorcondition,itexpectstoterminateitsprocessing.TheDEFGUser'sGuide,presentedinAppendixSectionA,describes howtheDEFGgenerated-errortextandthecalltothe exit functioncanbe redirected.However,DEFGdoesnotgeneratecodetocontinueexecutionafter anerror. 4.3DEFGInternalOperations 4.3.1TheDEFGTranslator ThegeneraldesignofourDEFGTranslatorisdiagrammedinFigure4.1.Thetranslatorisreallyacompiler;itinputstheDEFGsourcecodeandoutputsC/C++code, whichisthenusedbyastandardcompiler.Weusetheterm translator withour DEFGtool,insteadofthemorecommon compiler term,tosimplifydescribingthe interactionsoftheDEFGtranslatorandthestandardcompiler. BeforedescribingthespecicsofhowDEFGprogramsbecomeC/C++programs, letusconsiderwhattypeofapplicationisproducedbyDEFG.IfweignoretheGPU aspectsofDEFG,wecansaythatDEFGisadomainspeciclanguageDSLthat hastwomainfacets.First,DEFGactionsrelyupontheowofcontroloptionsjust described;inotherwords,DEFGhasaxedsetofinter-mixableexecution-owmodelsthatitfollows.Second,DEFGmarshalsdataformovementtoremotedevicesand itmanagesthesedeviceswithoptimizedremoteprocedurecallsRPCs.Inaddition, thereareprovisionstoinsertadditionalCPUactionsintotheDEFGprogramsviathe useoftheDEFG call and code statements.However,whileDEFGfacilitatestheexecutionoftheseadditionalactions,itdoesnotunderstand"ormanage"them;they existoutsidethecontextoftheDEFGTranslator.Wenowdescribeourtranslatorin detail. Thetranslationisdoneinthreesteps,asdepictedinFigure4.1.TheDEFG sourcecode,storedinatextle,isprocessedbytheDEFGParserandtheresultsof 37

PAGE 51

Figure4.1:DEFGTranslation-StepsDiagram thecompilationarestoredinanXMLdocument.TheresultingXMLdocument,also havingbeenstoredinatextle,issubstantiallyupdatedbytheDEFGOptimizer. TheoptimizedXMLdocumentisthenusedbytheDEFGCodeGenerator,along withaC/C++codetemplate.TheXMLisparsedbythecodegenerator,processed, andthenalC/C++programiswrittenout.Thisprogram,storedinanormal textle,isthenfurtherprocessedasastandardC/C++program.Whenanerror isencounteredbyourtranslator,itstopsprocessingtheinput,writeserrortext,and terminates.Theadditionalstepshowingcompilationbyastandardcompilerisnot showninFigure4.1.ProgramsgeneratedbyDEFGhavethepotentialtorunon mobileprocessors,asdiscussedintheSectionA.6oftheAppendix. 4.3.1.1TheDEFGParser TheDEFGparseriswrittenintheJavaversionofANTLR3[11].ThefullDEFG grammarforourtranslatorisshownintheSectionB.6oftheAppendix.OurDEFG grammardenition,usedbyANTLR,containsembeddedJavastatements.These JavastatementsoutputtheXMLsnippetsthat,intotal,formtheresultingXML document.TheXMLdocumentcontainsafulldescriptionoftheparsedDEFGpro38

PAGE 52

gram,storedinaneasy-to-processXMLform.OurparserdetectssimpleDEFG syntaxerrors,butdoesnotdoanyvericationofreferencesbetweenDEFGstatements.Thesemorecomplexchecksaredonebyouroptimizer. Wenowfocusontheparserinputsandoutputs.Theinputusedforthesample outputexamplebelowisshowninChapterIII,Figure3.1.Figure4.2showsthe generatedXMLoutput.SomeXMLnodesandattributeshavebeenomittedand replacedwithellipsis..."tokeepthesizeoftheguremanageable.Inthegure, line1isacommentthatnotesthetranslationdateandthenameoftheinputle. Thenextlineprovidestheapplicationname, sobel ;the main XMLattributemarks thisasaprogramtocreateanapplicationandnotaC/C++function. XMLwaschosenfortheintermediateabstractsyntaxtreeformatbecause,in Java,XMLdocumentscanbeeasytocreateandrelativelyeasytoparse.Inaddition, thisgeneratedXMLdocumentisstoredinasimpletextleanditiseasilyviewed fromanynumberoftools.ThereadabilityofXMLissometimesdisputed,butitis humanlyreadableandhavingtheabstractparsetreevisibleasanXMLdocument madetheparserandoptimizerdebuggingsomewhateasier. Lines3through5showthatthreeintegervariablesaredened.Lines6through 8denetheGPUdevicestobeused.Inthiscase,theGPUselectedwillbetherst devicethattheOpenCLruntimepresents,becauseofthe "setting.TheOpenCL kerneltobeused, sobel lter ,isdenedinlines9through11.Lines12and13dene thebuerstobeused, image1 and image2 ,andtheirrespectivesizes.Thecallto the init input functionisdenednext,inlines14through19.Notetheincluded referencestothe image1 Xdim Ydim ,and BUF SIZE buersandvariables.Each buerorvariablehasanassociated mode attribute.Thesemodesettingsarecritical forthecorrectoperationoftheDEFGoptimization.Thecodegeneratedfromlines 20through23willcausethe sobel lter kerneltobeexecutedontheselectedGPU. Theselineswillbefurtherdecoratedbytheoptimizer;asshownbelow.Finally, 39

PAGE 53

01. 02. 03. 04. 05. 06. 07. 08. 09. 10.[[2D,Xdim,Ydim]] 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. Figure4.2:SampleXMLOutputFromDEFGParser lines24through28denethatthe disp output functionberunontheCPU.The applicationendismarkedinline29. 4.3.1.2TheDEFGOptimizer Forclarity,wewillnowstartexclusivelyusingthetermtree"todenotetheXML documentholdingtheDEFGprogram'sabstractsyntaxtree.TheDEFGoptimizer isaJavaprogramthathasthreebasicpurposes:itlooksforandreportsDEFG codingerrorsnotdetectablebytheparser,itdecoratesthetreewithcrossreference andoptimizationinformation,and3itreformsthetreebranchesbyrelocating selectedrequeststomoveGPUbuerstooptimaltreelocations.Thedecorating ofthetreewithcrossreferenceinformationiscriticalforthecorrectfunctioning oftheDEFGcodegenerator.ThecodegeneratorhandlesonetreebranchDEFG statementatatimeandrequiresthatthetreehasalloftheinformationneededfora 40

PAGE 54

... 20. 21. 22. 23. ... Figure4.3:SampleXMLOutputSnippetFromDEFGOptimizer singlestatement'scodegenerationincludedoneachtreebranch.Thisrequirementfor completeinformationoneachbranchhasmadethecodegenerator,discussedbelow, morestraight-forward. DEFGoptimizationoccursintwoforms.TherstformappliestoallDEFG call execute ,and multi exec statements.The in out ,and inout options,ontheassociated variablesandbuers,areusedtodeterminewhenthecontentsofagivenvariableor buerneedtobetransferredbetweentheCPUandGPU.Thesetransfersareonly performedifthegivenoptionsettingindicatestheneedtoupdatethedatafromthe CPUorGPU.Figure4.3showsasnippetofthedecoratedtreefortheexecutionofthe sobel lter kernel.Lines20through23correspondwiththesamelinesinFigure4.2. Wecanseethatlines21and22nowcontainadditionalinformation.Inparticular, Figure4.3showsthatmoreinformation,suchastheargumentcount,datatype,and required-movementsetting,ispresent.The move attributesettingof toDev ,present online21,isoptimizationinformationthatwillinformthecodegenerationwhento actuallytransferthegivenbuer. ThesecondformofDEFGoptimizationistherelocationofcertainbuermovementoperationstolocationsoutsideofloops.Whenagiven call execute ,or multi exec statementaccessesvariablesorbuersthatarenotmodiedinsideagivenloop,the requestforthedataismovedtoatreelocationthatprecedestheloop.Thismovementofcertainrequestspreventsunneededandrepetitivedatamovementoperations frombeingexecutedatruntime. TheDEFG loop/while statementsarelimitedandcannotbeembeddedinsideother 41

PAGE 55

loops.ThereareseveralreasonsforthisDEFGlimit,butthemainonerelatesto theoptimizationsperformed.Theoptimizerhastounderstand"theDEFGlooping andwefoundthattheoptimizationcouldbeperformedreasonably,ifwelimited theDEFG loop/while statementpowerbyforbiddingembedding.Tobeclear,the DEFG sequence statementcanbeembeddedin loop/while statements;a loop/while justcannotbeembeddedinanother loop/while .Morewillbesaidabouttheselimits inChapterVI,onfutureresearch.Webelievethereisabetterapproach,basedon usingaless static optimizationtechnique. TheimplementationofthesetwooptimizationsinDEFGhasgiventheDEFGgeneratedcodegoodperformance.Theperformanceissimilartothatachievedwith hand-writtenC/C++OpenCLapplications,atleastintheapplicationswetested.We believethattheseoptimizationsareoneoftheanchorfacilitiesofDEFG;theyhelp makeDEFGaviableapplicationgenerationapproachbyprovidinggoodrun-time performance. 4.3.1.3TheDEFGCodeGenerator ThegenerationoftheDEFGapplicationcodeisacomplicated,non-trivialendeavor. Itissoforanumberofreasons.ThecodegeneratorhastocreateC/C++codethat producesthedesiredresult,onmultipleoperatingsystemplatforms. 6 Wesupport OpenCLGPUsfromdierentvendors.Inaddition,thecodegeneratorhastoproduce codefortwomodesofoperation:singleGPUexecutionandmultipleGPUexecution. Itisalsotaskedwithgeneratingcodethatapersonwillbecomfortablereading. AlthoughOpenCLisadenedstandard,ithasmultipleimplementer-denedfeatures andoptions[70].Wherepossible,wehaveusedthesimplest,common-denominator approaches;whenthesecausederrorsorproducedpoorperformance,weusedmore advanced,andsometimesenvironment-specic,approaches.Thedeveloperisshielded 6 WehavelimitedourformaltestingtoLinuxandWindows. 42

PAGE 56

fromalltheseissuesbyDEFG;ithandlesthemanydetailsofdevelopingC/C++code forGPUuse. Ahigh-leveldiagramoftheDEFGCodeGeneratorisshowninFigure4.4.The DEFGCodeGeneratoriswritteninC/C++,executesonWindows,andusesthe TinyXML2[91]librarytoparseitsinputXMLle.Anearlystepinitsprocessing isaccessingthetextlewhichholdstheDEFGtemplateC/C++code.TheDEFG templateiscopieddirectlytothelecontainingthegeneratedcode.Thistemplate providestheboilerplate"codethatinitializestherun-timeresourcesandselectsthe oneormoredevicesforrun-timeprocessing;italsocontainsspecialmarkers.These markersareplaceholdersthatarereplacedbytheuniquecodegeneratedforeach DEFGtranslation. Speakingabstractly,therstmajorstepinthecodegenerationisproducingthe GPUdeviceselectionanddevicemanagementcode.Here,theOpenCLenvironment andcommandqueuesarecreated,theGPUneededkernelsareloaded,andtherequiredCPUmemorybuersareallocated.TheCPUbuermemoryisobtainedwith C-style malloc callsandisofaxedsize.Thissizeischangeable,atruntime,bythe DEFG MAX BUFenvironmentvariable.Onceallocated,thebuersaresegregated andmanagedbythegeneratedDEFGcode. Fromahighlevel,aftertheenvironmentisestablished,thestepsoutlinedinthe middleboxofFigure4.4areperformed.Theseoperationsgeneratethecodeforthe DEFGstatementsdenedintheinputtree.Aftertheseoperationsarecompleted, thecodetoreleasealloftheresourcesandterminatetheapplicationisgenerated. Afterthecodegenerationiscompleted,theoutputleisavailableforcompilationby astandardcompiler. Intheremainderofthissection,wewillfocusonthebasicoperationsoutlinedin themiddleboxofFigure4.4.Theseoperationsconsistof:GPU-OrientedOperations,CPU-OrientedOperations,andLoop/WhileStatementOperations. 43

PAGE 57

Theoptimizedsyntaxtreeisprocessedonebranchatatime,beginningwiththe root.Foreachnon-declarestatementinthetree,asingleoperationtypefromoneof thesethreegroupsisperformed.Wewillsummarizeeachoftheseoperationsbelow. Wenote,inadvance,thatthemultiple-GPUsupportbueroptions: halo multi ,and nonpartable arehandledinthe buermovement codegenerationandinthe execute NDRange codegeneration.Throughthislayeringandabstraction,thesecomplex bueroptionsarehandledwithoutgreatlyimpactingthemorebasicDEFGcode generationformarshalingbuers,makingOpenCLAPIcalls,andhandlingerrors. TheGPU-orientedcodegenerationoperationismadeupoftwophases.The rstphaseisonlydonewhenrequiredbythe move attributesintheinputtree.It consistsofgeneratingtheOpenCLcodetocreatebuersandtotransfervariables andbuers.TheOpenCLGPUdevicebuersareonlycreatedwhenactuallyused. TheyaretransferredactuallycopiedwhentheircontentsarevalidontheCPUand nottheGPU.ThiscanhappenifthegeneratedCPU-sidecodeupdatesavariableor buerontheCPU.ItisthejoboftheDEFGoptimizertomanagethiscoordination oftransfers. Thesecondphaseofthisoperationconsistsofgeneratingthecodetoinvokethe requestedkernelorcalltheBLASlibrary.WiththeGPUkernelinvocations,each selectedGPUhasthekernelargumentssetandthekernelstartedviatheOpenCL clEnqueueNDRangeKernel APIcall.Codeisgeneratedtodescribeanydetected errorindications. TheCPU-orientedcodegenerationhasfouroptions,dependingontheDEFG statementbeinghandled.Fora code statementthetextbetweenthe[["and]]" delimitersisinserteddirectlyintothegeneratedcode.Theuseof code statement, calledamorsel,"isdiscussedintheUser'sGuide,SectionAintheAppendix.The set statement'scodegenerationconsistsofsimplycopyingtheassociatedvalueintothe referencedscalarvariable.Thisstatementmayseemverylimitedinpower.However, 44

PAGE 58

Figure4.4:DEFGCodeGenerationDiagram 45

PAGE 59

theDEFGoptimizerisawareofitsactions.Thiscausesanyneededtransferoperationstobegenerated,ifthiseldisthenreferencedfromaGPU.TheDEFGtimer statementscausethegenerationofcodetostart,stop,andreadtheDEFG-provided CPUtimer. ThemostinterestingCPU-orientedcodegenerationinvolvestheDEFG call statement.Thetransfercodeisgeneratedonlywhenrequiredbythe move attributesin theinputtree;ittransferscopiesvariablesandbuersfromtheGPUtotheCPU. Aftertheseoptionalactionsarecompleted,thecalltothedenedC/C++function ismade.TheDEFG call ispartoftheDEFGoptimizerprocessing;CPUvariables andbuersareonlyupdatedwithGPUcontentwhentheGPUcontenthasbeen updated. TheloopandwhileoperationgeneratestheC/C++codetoimplementtheDEFG loop and while statements.DuetosomeunwantedbehaviorinthewayC/C++local variablesarescoped,weusetheC/C++ if and goto statementstoimplementthis DEFGloopingcapability. InFigure4.5,weshowasnippetoftheC/C++codegeneratedfortheDEFG execute statementshowninFigure3.1,line10.Thissnippetcausesexecutionof the sobel lter kernelwith image1 asinputand image 2holdingtheresultingltered image. ThissnippethashadmanyofthelessimportantC/C++linesandparameters removed;theseremovalsaremarkedwithellipsis...".Thegeneratedcodeincludes commentstofacilitatethecodebeinghumanreadable."Therstelevenlinesshow theconditionalexecutionofabuercreationontheGPUdevice.Thenextsixlines showthetransferofthe image1 totheGPUdeviceandthesettingofthersttwo argumentsforthekernelexecutiononthedevice.Thelasttwolinesshowthestaging ofthekernel'sexecutionintheOpenCLcommandqueue. ThischapterhasdescribedtheDEFGtranslator.Ourdesignsplitsthetrans46

PAGE 60

//***KERNEL:<<<>>> //***WRITEFIELD:image1in-toDev ... ifdefg_create_1==0{ defg_create_1=1; ... defg_buffer_image1[0]=clCreateBufferdefg_context,...,&defg_status; ifdefg_status!=CL_SUCCESS{ //errorhandling,etc ... } } ... defg_status=clEnqueueWriteBufferdefg_Queue[0],defg_buffer_image1[0],...; ... defg_status=clSetKernelArgdefg_sobel_filter[0],0,...; ... defg_status=clSetKernelArgdefg_sobel_filter[0],1,...; ... //***EXECUTION:sobel_filter defg_status=clEnqueueNDRangeKerneldefg_Queue[0],defg_sobel_filter[0],...; ifdefg_status!=CL_SUCCESS{ //errorhandling,etc ... } ... Figure4.5:C/C++Snippetfor sobel lter KernelExecution lationprocessingintothreesteps,whichworkswell,asitallowsustoseparatethe complexitiesinherentinoptimizationandcodegeneration.Themiddlestep,the optimizationstep,reallykeepstheerrordetectionandhigh-leveloptimizationaway fromthedown-and-dirtyOpenCL-orientedcodegeneration.ForDEFGtosupport theCUDAproduct,whichhassimilaritiestoOpenCL,butalsohasdierences,anew codegeneratorwouldneedtobeprovided;theparserandoptimizerwouldremain largelyunchanged. 47

PAGE 61

CHAPTERV NEWANDDIVERSEDEFGAPPLICATIONS InordertodemonstratethepowerandviabilityofDEFG,wedesignedandimplementedasetofnewapplicationsusingDEFGandOpenCL.Someofthesenew applicationsarebasedonourexistingSOBELandBFSapplications,discussedin ChapterIII.Theothersareentirelynewimplementations. Ourrstnewapplicationsconsistoftwoimagelters.TherstlterisanenhancedSobeloperatorlterformultiple-GPUoperationandthesecondisamedian lter,functioninginsingle-GPUandmultiple-GPUmodes.Werefertoourltering applicationsasSOBELandMEDIANandmaysometimesaddasuxtoindicate additionalfunctionality.Forexample,SOBELMisourmultiple-GPUversionofSOBEL. Inthenextnewapplication,multiple-GPUsupportisaddedtotheexistingDEFGbasedBFSapplication.Here,parallelprexscanisappliedinanovelandinteresting waytodynamicallymanagethebuerspassedbetweenthedierentGPUdevices. Prexscan'sallocationofthebuerspacetoindividualGPUthreadsmakesitpossible tomanagethesharedbuerswithoutthecostsofslow,atomiclockingonbuer structures.ThisapplicationisreferredtoasBFSDP2GPU. ThenewRSORTapplicationisourproof-of-conceptimplementationofroughly sorting.Roughlysortingisusedwhenthedatatobesortedisalreadypartiallyin sequence.Thissortingapproachscansthedata,computesameasureofdisorderand thenbreaksthedataintosegments,whichareindividuallysorted[9,10]. 48

PAGE 62

OurnalnewDEFGapplicationimplementstheAltmaniterativematrixinversion algorithm[7,8]andisreferredtoasIMI.Inthisapplication,DEFGformsaninterfacelayerbetweentheapplication'slogicandtheBasicLinearAlgebraSubprograms BLASnumericallibrary[2]. Eachofournewapplicationsisdescribed,andanalyzed,inthischapter.The analysisforeachapplicationvariesbasedonthegoalsforeach.Forexample,our imagelteringapplicationsshowtheeasewithwhichDEFGcanprovidetheCPUsidecodeforhigh-performance,multiple-GPUapplications.Wecarefullymeasurethe respectiverun-timeperformances.Ontheotherhand,theiterativematrixinversion applicationshowstheabilityofDEFGtobeexpandedtouseexistingGPUlibraries. Itisnotrelatedtomultiple-GPUoperationsand,hence,wearenotasinterestedin run-timeperformance.Wearemoreconcernedwiththesizesandtypesofmatrices thatcanbeprocessed. In DesigningScienticApplicationsonGPUs [23],RaphaelCouturiertalksabout theimplementationofGPUapplications.Hegroupsthemintocategoriesbyimage processing,optimization,numericalapplications,andaddssoftwaredevelopment. OurmixofapplicationsdirectlymaptothreeofCouturier'sfourcategories. TheimagelteringapplicationsimplementtheSobeloperatorandthemedianlter.TouchingontheCouturiertopicofnumericalapplications,inparticularsolving sparselinearsystems,istheDEFGiterativematrixinversionapplication.ThisDEFG environment,withitsmultipleGPUsupport,touchestherecurringCouturiertopicof enablingthesupportofapplicationsovermultipleGPUs.Theseapplicationsprovide agoodmixtureofGPUsolutionsthatdemonstratethepowerofDEFG. Therun-timeresultswereobtainedusingtheHydraserverattheUniversityof ColoradoDenver'sComputerScienceandEngineeringDepartment,withafewexceptions.Intheseexceptions,resultswereobtainedusingotherhardwareandweclearly notetheseexceptions. 49

PAGE 63

5.1Application:ImageFilters 5.1.1ProblemDenitionandSignicance Imagelterscanbeusedtoenhancethequalityofimages,aswellas,tolocate theedgescontainedinimages.Inthissection,twoDEFGimageapplicationswill bediscussed: Median lteringandthe Sobel operator,bothwithinthecontextof single-GPUandmultiple-GPUoperations.Single-GPUoperationreferstoexecuting theDEFGapplicationonasingleGPU;likewise,multiple-GPUoperationrefersto executingtheDEFGapplicationonmultipleGPUs.Havingtheoptiontoexecutea givenDEFGapplicationonmultipleGPUsprovidesthepotentialtoobtainresults morequicklyandtosolvelargerproblems. 5.1.2RelatedWork TheSobeloperatorcanbeusedforimageedgedetectionandisdesignedtoapproximatethegradientvalueatthespecicpixelbeingprocessed[84].Acomputed gradientvaluethatisrelativelyhighrepresentsahill,slopeorwall,thatis,anedge. TheSobeloperatorforedgedetectionwasrstdescribedinanunpublished1968articlefromtheStanfordAILabbyIrwinSobelandJeromeFredman: A 3 3 Isotropic GradientOperatorforImageProcessing [21,86].Thisoperatorcomputesanapproximationoftwogradientsusingapairof3 3convolutionmasks,onemaskforthe horizontalestimationofthegradientandoneforthevertical[93].Thereareother similaroperatorsincludingthe Robinson operatorandthe Kirsh operator,diering intheweightsusedintheconvolutionmask[87]. ThemedianlterwasoutlinedbyTukey,in1971,foruseinsignalsmoothing anditisaspecialcaseofaranklter[43,87].Itiscommonlyusedtoenhancethe qualityofanimagebyforcingpointswithhighlyvariedintensitiestobesimilarto theirneighbors[84].Themedianvalueofagivenpixel'sneighborsiscomputedand 50

PAGE 64

isusedtoreplacethepixel'svalue.Theprocessedpixelisnormallycenteredinits neighborhood.Inourmedianlterwork,neighborhoodsof3 3and5 5pixels areused.Thistypeoflteringhasmanyuses,includingthatofnoiseremoval,which isaninstanceofsmoothing."Variationsinthemedianlterincludeincreasingthe sizeoftheneighborhoodandchangingtheshapeoftheneighborhood. Overtime,numerousimprovementstotheSobeloperatorhavebeenproposed. SincetheSobeloperatoronlyusesa3 3mask,itisverysensitivetonoisein theimage.Ma,etal.[56]proposedanimprovedSobelalgorithmusingamedian computation,basedonalarger5 5mask.Usingcolorimages,Wesolkowski,et al.[96]demonstratedthattheSobeloperatorshowsgoodedgedetectionresults,as comparedwithothersimilaroperators.Wang[94]showedhowalterbasedonthe Sobeloperatorcanbeusedasanintegralcomponentofavehicleidenticationsystem wheretheSobeloperatorisusedtohighlightthecontourofthevehicleagainstthe environmentaroundthevehicle.Inourwork,weusetheSobeloperatorinitsoriginal form,witha3 3mask. 5.1.3ApproachtoResearch WebeginourDEFGApplicationChapterwiththeSobeloperatorandmedianlterapplicationsbecausetheyrepresenttheclassofstraight-forwardGPUproblems, whereasinglekerneliscalledonceandthereissignicantlocalityofmemoryreference.Thislocality-of-referencecharacteristictendstoprovideforgoodGPUperformancewithouthavingtocreateacomplexkernelorsetofkernels.Themedianlter presentsanapplicationthatissimilartotheSobeloperator,butismorecomputationallyintense,especiallywhenusingneighborhoodsofsize5 5.Thisincreased computationalintensityisusefulwhenexploringtheperformancecharacteristicsof themultiple-GPUDEFGversions. Oursingle-GPUSobelandmedianlterimplementationsshowthatDEFGhan51

PAGE 65

dlessingle-invocationOpenCLkernelswithminimaldevelopereort,andwithgood performance.Whenperformingmultiple-GPUDEFGprocessing,theSobelandmedianltersprovidemorecomplexusagecasessincetheimagessenttoeachGPU mustoverlapslightlyand DEFGneedstomanagethisdataoverlap .AgoalofDEFG istoenablemultipleGPUoperationswithverylimitedchangestoDEFGcode.Our expectationisthatwhenthisDEFGcapabilityisusedappropriately,theOpenCL kernelswillnotrequiremodications. AfewwordsofcautionconcerningtheDEFGmultiple-GPUcapabilities:of course,theOpenCLkernelsmusthavebeenimplementedusingGPUcodingtechniquesthatpermitmultipleGPUoperations.DEFGmakestheutilizationofmultipleGPU-capablekernelslessworkforthedeveloper,butitdoesnotprovideanyHoly Grail"seeShen[83]forautomaticparallelizationofexistingalgorithmsandcode. DEFGprovidesanumberofdesignpatternsthatsupportmultiple-GPUprocessing.Multiple-GPUimageapplicationscanuseDEFG's Overlapped-Split-ProcessConcatenate designpatterntohandlethoseneighborhoodmasksthatareatthe imageedges.SincetheSobeloperatorandmedianlterrequireatleasta3 3 neighborhoodmask,someimagecontentmustberepeatedsothattheedgepixelsat theimage-splitlocationsarepresentwheretheyareneeded.EachGPUmusthave inmemorythepixelsitrequires.Whentheseseparatelyprocessedimagesarelater concatenatedtoreformthenalimage,thisoverlappedimagecontentmustbetaken intoconsideration.TheOverlapped-Split-Process-Concatenatedesignpatternhandlesthisoverlappinginageneralizedandeasy-to-usemanner.TheDEFGbuer halo optionfacilitatesthisdesignpattern;itprovidesDEFGwiththeneededinformation tocorrectlysplitandreformimages.The halo optionisdiscussedinChapterIVand intheUser'sGuide,AppendixSectionA. 52

PAGE 66

5.1.4AdditionalBackground 5.1.4.1Single-GPUDEFGSobelOperatorApplication InSection3.4,wediscussedourDEFG-basedversionoftheSobeloperator,utilizing theSobelkernelfromtheAMDApplicationSDK2.8[1].TheCPU-sideOpenCL codewasreplacedbyDEFG-generatedcode.ThisDEFGSobelimplementationproducedexactlythesameresultsastheSDKversionandshowedverysimilarrun-time performance.Figure5.1showstheresultsofrunningtheSobeloperatorwiththe DEFG.Theimageontheleftistheinputimageandtheimageontherightshows thelteredresults.Theedgesofthehighwaylanemarkers,treetops,andcloud banksaredelineatedandhighlighted.Thelterhastheeectofhighlightingchanges incolor. 5.1.4.2Single-GPUDEFGMedianFilterApplication OurlterapplicationresearcheortsinvolveusingtheDEFGmultiple-GPUsupport toproduceenhancedapplicationperformancecharacteristics.However,itbecame clearthattheSobeloperatorwasnotcomputationallyintense,intermsofruntime, relativetothetimeittakestomovetheimagetoandfromtheGPU;alargeportion oftheGPU-basedSobelFilterexecutiontimewasconsumedmovingtheimageto andfromtheGPU. Figure5.1:SobelOperatorPerformedwithDEFG:BeforeandAfterImages 53

PAGE 67

Figure5.2:Median5 5FilterPerformedwithDEFG:Original,Noised-Added,and After-ProcessingImages Therefore,asecondlterapplicationwasadded:theMEDIANlter.Themedian lteringwassupportedinboth3 3and5 5neighborhoodversions.ThisDEFG CPU-sideprogramtosupportthemedianlterwasverysimilartotheSobelversion. ThesignicantdierencesbetweenthesetwolterapplicationswereinthecorrespondingOpenCLkernelcode.Figure5.2showstheresultsofhavingrunthemedian 5 5lterwithDEFG.Theleft-mostimageistheoriginalimage.Themiddleimage consistsoftheoriginalimagewithapproximately2percentofthepixelsreplacedby blacknoise"pixels.Theright-mostimageshowstheresultsofrunningthemedian applicationtocleanupthenoise.Acarefulexaminationoftheresultantimageshows thatthenoisewasremoved,butthequalityoftheimagesueredasaresult.One canseethat,intheright-mostimage,thebladesofgrassaresomewhatblurredand thecloudsarelessclear;someofthequitesharpcloudedgeshavebeendulled. Figure5.3showstheMedianlterkernelcode,witha3 3neighborhood.Lines12 through20movedtheneighborhoodpixelsintothe sort buf array.Lines21through 29orderedtheelementsinthearrayandline30copiedthemiddlepixelvalue,the median,tothepixelbeingprocessed. 1 Thelarger5 5versionofthiskernelwasvery similartothis3 3version;themajordierencewasthat25pixelsweresortedand 1 Thesortingalgorithmusedherewasnothighlyoptimized,helpingensurethatthismedianlter kernelismorecomputationallyintensethanourSobeloperatorkernel. 54

PAGE 68

themiddleelementoftheresulting25-itemarraywaschosenasthenewpixelvalue. Wenotethatthesetwokernelsfromoursingle-GPUworkwereusedinour multiple-GPUworkwithoutmodication.Thiswasinlinewithouraimof,wherepossible,obtainingDEFGmultiple-GPUsupportwithouthavingtorecodetheOpenCL kernelormakesubstantialchangestotheDEFGcode.Inthebestcase,weexpect thatthechangestotheDEFGcodearesimplechangestothedeclarativestatements andthesimpleswitchingof execute statementsto exec multi .Thealgorithmsand kernelsthataresuitableforthisrelativelyeasyswitchtomultiple-GPUprocessing tendtobeself-containedandhavesimpledataaccesspatterns.Somecomputerscientistsdescribetheseasembarrassinglyparallel,perfectlyparallel,orpleasinglyparallel algorithms[98].Embarrassinglyparallelornot,applicationsthatusemultipleGPUs stillneedtheinfrastructurecodepresenttomanagethemanyGPUsandthemany buers.DEFGprovidesthisinfrastructurecode. 5.1.4.3Multiple-GPUDEFGFilterApplications TheDEFGOverlapped-Split-Process-Concatenatedesignpatternandtheassociated halo optionareusedheretopermitthelteringtobeperformedontwoseparate GPUs,witheachGPUprocessinghalfoftheimage.Asnotedpreviously,thecomplicatingfactoristheneedforanypixelprocessedtohaveitsneighborspresent.This meansthatasmallsectionoftheimage,thepartsofanyhalo,mayneedtobepresent onbothGPUs. ThecodeusedtoexecutetheSobeloperatorwithasingleGPUwaspreviously showninFigure3.1.Forcomparison,theDEFGcodeusedtoexecutetheMedian 5 5lterontwoGPUsisshowninFigure5.5.AreviewofthetwoDEFGprograms showsthreesignicantchangesbeyondthekernelnamechange.First,the declare gpu ,online5,hasbeenchangedtousethe all option.Next,the halo option hasbeenaddedtothebuerdeclarationsonlines7and8.Finally,line10hasbeen 55

PAGE 69

changedfroman execute statementtoa multi exec statement.The all optionallows DEFGtousealltheGPUsattachedtotheCPUnode;inthecaseoftheUCDHydra server,thiswastwoGPUs.Theadditionofthe halo optionnotiesDEFGthat whenrunningwithmultipleGPUs,DEFGmustmanagetheoverlappedhaloarea. The denotesthattheoverlapis2units;asthisisa2-dimensionaldatastructure, i.e.,animage,the2denotestwolines.Theuseof multi exec causesDEFGtoexecute kernelcodeonalltheselectedGPUs.Thesechangesaresucienttoallowforthe generationofC/C++codeutilizingmorethanoneGPU,eventhoughthepartial imagesenttoeachGPUmustbeoverlapped.Theoverlapmanagementishandled entirelybyDEFG. Figure5.4showsaschematicrepresentationofa6pixel 6pixelimagethat mightbeprocessedbyDEFGwithtwoGPUs.Intheschematic,thepixelsprexed 01.__kernelvoidmedian_filter__globaluint*inputImage, 02.__globaluint*outputImage{ 03.uintsort_buf[9]; 04.uinth; 05.inti,j; 06.uintx=get_global_id; 07.uinty=get_global_id; 08.uintwidth=get_global_size; 09.uintheight=get_global_size; 10.intc=x+y*width; 11.ifx>=1&&x=1&&ysort_buf[j]{ 24.h=sort_buf[i]; 25.sort_buf[i]=sort_buf[j]; 26.sort_buf[j]=h; 27.} 28.} 29.} 30.outputImage[c]=sort_buf[4]; 31.} 32.} Figure5.3:KernelCodeforMedianFilterwith3 3Neighborhood 56

PAGE 70

SchematicImage: A11A12A13A14A15A16 A21A22A23A24A25A26 A31bA32bA33bA34bA35bA36b B41aB42aB43aB44aB45aB46a B51B52B53B54B55B56 B61B62B63B64B65B66 Figure5.4:ImageSchematicShowingOverlapwith2GPUs Table5.1:ExecutionTimesonHydraServer,inMilliseconds Filter Writeto Execution Readfrom TotalGPU Name GPUms GPUms GPUms Runms SOBEL 1 1 1 3 SOBELM 2 1 1 6 withA"areinthesub-imageforGPU 1processingandthepixelsprexedwith B"areinthesub-imageforGPU 2processing.Inthishypotheticalexample,a Sobeloperatoristoberunforthenon-edgepixelsinthefullimage.So,the halo statementoptionwouldbeusedsincetheSobeloperatorusesa3 3mask.Theuse ofthishalooptionwouldhavetheeectofcopyingpixelsA11...A36b plus B41a ...B46atoGPU 1andA31b...A36b plus B41a...B66toGPU 2.Bycopyingthe additionalpixels,eachGPUwouldhaveallofthevaluesneededtocorrectlyexecute theSobeloperator.WhentheimagebuersarelatertransferredbacktotheCPU, theimageoverlapareaismanagedbyDEFG,sothatthecorrectpixelsappearinthe nalimage. 57

PAGE 71

01.declareapplicationmedian5 02.declareintegerXdim 03.integerYdim0 04.integerBUF_SIZE 05.declaregpugpugrpall 06.declarekernelmedian5_filterMedian5Filter_Kernels[[2D,Xdim,Ydim]] 07.declareintegerbufferimage1XdimYdimhalo 08.integerbufferimage2XdimYdimhalo 09.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 10.multi_execrun1median5_filterimage1inimage2out 11.calldisp_outputimage2inXdiminYdimin 12.end Figure5.5:DEFGCodetoExecutethe5 5MedianFilter 5.1.5ExperimentalResults 5.1.5.1Multiple-GPUDEFGFilterPerformance Table5.1showstheSOBELandSOBELMapplications'run-timeperformances,obtainedfromexecutingtheseapplicationsonHydra. 2 SOBEListhenamegivento DEFGsingle-GPUversionoftheSobeloperatorandSOBELMisthenameforthe DEFGmultiple-GPUversion.Clearlythesingle-GPUversion,whichused3ms,was fasterthanthe2-GPUversion,whichused6ms.Equallyclearisthatthetimeneeded tomovetheimagetoandfromtheGPU,2msinthecaseofSOBEL,wasgreater thanthetimeneededfortheexecutionofthekernel,inthiscase1ms.TheSOBEL application,directlybasedontheSobelapplicationfromtheAMDSDK,usesmore timemovingtheimagethanprocessingit. InaneorttounderstandtheunimpressiveperformancewithourSOBELM2GPUapplication,wemanuallyinsertedadditionaltimersintotheDEFG-generated codeandperformedadditionalDEFGlogging.Thesestepsgaveusbasicrun-time prolesforSOBELandSOBELM.Afterreviewingtheserun-timeproles,wepostulatedseveralpotentialexplanationsfortheobservedperformance:theOpenCL multi-GPUoperationshavegenerallypoorperformance;2theimagebeingusedfor ltertestingwastoosmall;theSobeloperatorwasnotcomputationallyintense enoughformulti-GPUuse;theDEFGapproachtomulti-GPUkernelexecution 2 TheSOBELMruntimeisnotequaltothesumofthepartialtotalsduetotimerlimits. 58

PAGE 72

Table5.2:ImagesUsedwithFilterApplicationTesting Image Description Image Image Source Name Dimensions Sizeinbytes BUFLO DowntownBualo 1714 1162 5,977,382 MikeBoncaldo FIELD Riceeld 512 512 786,486 freedigitalphotos IMG1000 Coloredbox 1000 1000 3,000,054 generated IMG5000 Coloredbox 5000 5000 75,000,054 generated IMG7000 Coloredbox 7000 7000 147,000,054 generated IMGHUGE Coloredbox 22000 22000 1,452,000,054 generated ROAD Roadtotheclouds 170 153 78,390 freedigitalphotos ontheGPUperformedpoorly;and,theDEFGapproachtomulti-GPUbuer movementtoandfromtheGPUperformedpoorly. OurpreviousexperienceswithotherOpenCLapplications,inparticulartheminingofdigitalcoins[54],showedthattheOpenCLsupportformultiple-GPUoperations wassolid"andshowednotablehighperformance.Sowediscountedtherstexplanation.Inordertotesttheviewthattheltertestimagewastoosmall,weobtained andexperimentedwithasetoflargerimages.OurfullsetofimagesislistedinTable 5.2.TheBUFLO 3 ,FIELD 4 ,andROAD 5 imageswereobtainedfromGoogleImages andtheWorldWideWeb.Thistableassignsauniquenamefortheimages,along withpresentingeachimage'scharacteristics. AspartoftestingtheexplanationthattheSOBELoperatorisnotcomputationallyintensiveenough,weaddedthemedianltertooursetofDEFGapplications. Weproducedmedianlterapplicationsintwoforms.Therstformuseda3 3 neighborhoodwith9pixelsandthesecondformuseda5 5neighborhoodwith25 pixels.Thesecondform,withits25pixelneighborhood,wasmorecomputationally intenseduetothemedianlter'ssortingstep.Usingtheseadditionalimagesand lters,weperformedaseriesofnewexperimentsonHydra. TheresultsoftheseexperimentsareshowninTable5.3.Therstcolumnlists 3 http://www.mikeboncaldo.com/photos-Usedwithpermission. 4 http://www.freedigitalphotos.com-Usedasperagreement. 5 http://www.freedigitalphotos.com-Usedasperagreement. 59

PAGE 73

Table5.3:RunTimesforVariousImages Image Filter Average Name Name Execution Seconds FIELD SOBEL 0.003 SOBELM 0.006 MEDIAN 0.003 MEDIANM 0.006 MEDIAN5 0.009 MEDIAN5M 0.008 BUFLO SOBEL 0.018 SOBELM 0.023 MEDIAN5 0.058 MEDIAN5M 0.048 IMG1000 SOBEL 0.010 SOBELM 0.013 MEDIAN5 0.023 MEDIAN5M 0.019 Image Filter Average Name Name Execution Seconds IMG5000 SOBEL 0.215 SOBELM 0.240 MEDIAN5 0.542 MEDIAN5M 0.412 IMG7000 SOBEL 0.420 SOBELM 0.479 MEDIAN5 1.062 MEDIAN5M 0.794 IMGHUGE SOBEL failed-4 SOBELM 4.650 MEDIAN5 failed-4 MEDIAN5M 7.775 theimageused,thesecondprovidesthenameofthelterused,andthelastcolumn showstheaverageruntimesforthreeHydraexecutionsofthenamedimagewiththe givenlter.ThecharacteristicsofeachimagearegiveninTable5.2. 6 WenotethattheexecutiontimesforSOBELandSOBELMcomparedtoMEDIANandMEDIANM,fortheFIELDimage,wereverysimilar.Ourexperience hasbeenthattheSobeloperatorandthe3 3Medianltershowedsimilarrun times,nomatterwhichimagewasused.Therefore,weomittedtheMEDIANand MEDIANMexecutiontimesintheremainderofTable5.3.Instead,wefocusedon theSOBEL/SOBELMandMEDIAN5/MEDIAN5Mresults. TheFIELDimageexecutiontimesshowedthatMEDIAN5tookabout3times longertoexecuteascomparedtoSOBELandthatMEDIAN5Mtooklessthan2times thetimeofSOBELM.Clearly,theimpactofusingtwoGPUswiththeMEDIAN5M medianlterapplicationwasdierentfromthatwiththeSOBELMSobeloperator application.ComparingtheMEDIAN5andMEDIAN5Mresults,usingtheFIELD 6 TheFIELDimagefromTable5.2isshowninFigure5.2,ontheleft,andtheROADimagefrom Table5.2isshowninFigure5.1,alsoontheleft. 60

PAGE 74

Figure5.6:PlotofFilterbyImage,AverageRunTimes image,showedthattheMEDIAN5Mapplicationwasslightlyfaster,havingused0.008 secondsversus0.009secondsforsingle-GPUMEDIAN5. TheaverageruntimesfortheIMAGE1000,IMAGE5000,andIMAGE7000images,plottedwithapplicationsSOBEL,SOBELM,MEDIAN5,andMEDIAN5M, aredisplayedinFigure5.6.WenotethattheIMAGEHUGEisnotincludedinthis plot,asitwastoolargetobeprocessedbythesingle-GPUapplications.Thisplot showsthatmultiple-GPUMEDIAN5Mwasfasterthansingle-GPUMEDIANMfor IMAGE5000,ofsize5000 5000,andforIMAGE7000,ofsize7000 7000.Forthe highercomputationalintensitymedian5 5lter,themultiple-GPUprocessingwas fasterwhenusingthelargerimages. 5.1.5.2ProledMultiple-GPUPerformance Inaneorttofurtherunderstandthecomputationalintensityissue,weranadditional experimentswiththeDEFGOpenCLloggingfacilityengaged.Theselogsprovided OpenCLAPIcall-prolinginformation.Inparticular,thetimesneededtowritethe 61

PAGE 75

Table5.4:DetailedRunTimesforBUFLOImage Filter Writeto Execution Readfrom TotalGPU Name GPUms GPUms GPUms Runms SOBEL 6 5 3 19 SOBELM 7 5 7 23 MEDIAN5 6 45 3 59 MEDIAN5M 8 25 8 44 imagestotheGPUs,executethekernels,andreadtheimagesbackfromthe GPUswerelogged.Table5.4showstheresultsoftheseexperiments,usingtheBUFLOimage.Thisdataindicatesthatthemedian5 5lterbenetedfrom2-GPU executionwithanexecutiontimeof25millisecondsversusa1-GPUtimeof45milliseconds.ItisalsoclearthattheclocktimeneededtowritetheimagestotheGPUs, andreadtheimagesbackfromtheGPUs,hadincreasedwith2-GPUoperation.For example,theMEDIAN5write-to-GPUtimewas6msandtheMEDIAN5Mwriteto-GPUtimewas8ms.WeconcludedthatwiththisimplementationofDEFG,the Sobeloperatorisnotcomputationallyintenseenoughtobenetfrom2-GPUusage. TheSobeloperatorshowedincreasedwriteandreadtimeswhiletheexecutiontime didnotdecrease. Theincreaseinwriteandreadtimeswasunexpected;ourexpectationwasthat thesetimeswoulddropwith2-GPUoperation.AreviewoftheOpenCLdocumentation,combinedwithourpreviousexperienceswithotherOpenCLapplications,ledus topostulatethattheDEFGuseofasingleoperatingsystemthreadforalloperations mightbecontributingtotheincreasedruntimesforthewriteandreadoperations. WenotethattheDEFG-generatedcodeusedtheOpenCLasynchronouscallsand maintainedmultiplecommandqueues,whenexecutedwithmorethanasingleGPU. Inordertobetterunderstandthis2-GPUunimpressiveperformance,wewroteanonDEFGOpenCLtestprogramthatusedLinuxpthreads 7 toprovidemultipleoperating systemthreads.Wethencomparedtherun-timespeedsofdoingthe2-GPUOpenCL 7 pthread inthenamegiventoacommonlyusedLinuxmultiple-threadinglibrary. 62

PAGE 76

Table5.5:RunTimesforpthreadExperiment RunMode RunTimeinms Comments 1Thread 610 DEFG-Style 2Threads 387 pthread-based writeandreadoperationsusingasingle-threadedapproachagainstdoingthesame processingwithapthread-based2-threadapproach. Bothtestprogramsmoved128MBtoeachof2GPUsandthenmovedthesame 128MBback.Thetestprogramswereexecutedthreetimesandtheaverageexecution timesforthe1-threadand2-threadwrite/readoperationsareshowninTable5.5.As wehadpostulated,thesingle-operating-system-threadruntimeswerelonger.The averagefor1-threadoperationswas0.610secondsinduration,whilethe2-thread operationsneededonly0.387seconds.Thissimpleexperimenttendstosupportour beliefthatthesingle-threadedapproachtoOpenCLbuermovementoperationswas aperformanceissue. Beforeweleavethetopicof1-GPUand2-GPUimageprocessing,wenotethatthe IMGHUGEimage,whichisextremelylargeatapproximately1.45GB,wasprocessed bythe2-GPUversionsofthelterapplicationsandnotbythe1-GPUversions.The 1-GPUversionsfailedwithOpenCLerror-4,denedbytheOpenCLheaderas: CL_MEM_OBJECT_ALLOCATION_FAILURE Plainlystated,thememoryallocationonthesingleGPUfailedduetolackofglobal memory. 5.1.5.3SummaryofResults Letusnowreturntothevepotentialcausesthatwepreviouslynotedfortheunexpectedmultiple-GPUperformance.Thecausesarere-listedherealongwithour thoughtsandconclusionsaboutthevalidityofeachcause. 63

PAGE 77

1. TheOpenCLmultiple-GPUoperationshavegenerallypoorperformance. Thisisnottrue;OpenCLmultiple-GPUapplicationshavebeenshowntohave goodperformance.Welisteddigitalcoinminingasavalidexampleofgood OpenCLperformancewithmultiple-GPUs[54]. 2. Theimagebeingusedforltertestingwastoosmall. ThiswasbasicallytrueforthesmallishFIELDimage.Thelargerimagesdid showtheperformancebenetofusingmultiple-GPUswithacomputationallyintenselter,suchasthemedian5 5lter. 3. TheSobeloperatorwasnotcomputationallyintenseenoughformulti-GPUuse. WithourDEFGimplementationofSobeloperatorformultipleGPUs,thiswas true. 4. TheDEFGapproachtomulti-GPUkernelexecutionontheGPUperforms poorly. OurexperimentsshowedthattheDEFGapproachtomultiple-GPUkernelexecutiondidnotperformpoorly.ComparingtheMEDIAN5andMEDIAN5M kernelexecutiontimesshowedadropfrom45msto25ms.Thiswasaspeedup of1.8. 5. TheDEFGapproachtomulti-GPUbuermovementtoandfromtheGPUperformspoorly. Unfortunately,ourexperimentsdidshowthattheDEFGapproachtomultipleGPUbuermanagementdidnotperformaswellasittheoreticallycouldhave. Wesawincreasesinrun-timesformultiple-GPUbuermovementsandweattributedthistotheDEFGrun-timeuseofasingleoperatingsystemthread. Theseparateexperimentweranwithpthreadstendedtoconrmthatamultithreadedapproachcouldbefaster. 64

PAGE 78

Ourexperimentshaveshownthatthemultiple-GPUsupportcanbeusefulwhen usedwithworkloadsthatarecomputationallyintenseenoughtoosettheadded overheadoftheincreasedmultiple-GPUbuertransfertimes.Itisconceivablethat thenextmajorversionofDEFGwillsupportpthread-styleprocessing,whenmultiple GPUsareused. 5.2Application:Breadth-FirstSearch 5.2.1ProblemDenitionandSignicance Breadth-rstsearchisawell-studiedgraph-theoreticproblem,withpracticalapplicationinshortestpathanalysisofsocialnetworks,inInternetpacketrouting,inthe WorldWideWeb,andinmanyotherareas[16].Numerousbreadth-rstsearchBFS algorithmshavebeenimplementedonGPUs,withHarishandNarayananproviding oneoftherstpublishedGPUimplementations[38].Thebreadth-rstsearchapplicationdiscussedinChapterIIIisbasedonthisHarishworkanditwillbeusedin thissectionasacomparisonbasisforournewBFSapplication. Morerecently,Merrilletal.havedescribedaninterestingmethodformanaging theintermediatedatastructuresneededinBFSbyusing prexsum toavoidlocking andserializationofdatastructureitems[59].DoingGPU-basedgraphprocessingwith largeveryirregularLVIgraphscanbechallenging,becauseofthehighvariation invertexdegreeandthesheervolumeofthevertexandedgedatastructures.The storageconsumedbyaLVIgraph'sdatastructuresmayexceedthememorycapacity ofasingleGPU.Thespecicproblembeingaddressedhereistheimplementationof BFSwithmultiple-GPUsupport,specicallydesignedtoprocessLVIgraphs,using OpenCLandDEFG.TheintentisthatDEFGanditsdesignpatternswillprovide thenecessarymechanismsneededtomanagetheowsofdatabetweentheGPUs; byusingthissupport,theGPUapplicationdevelopercanfocusontheapplication's algorithmsandprocessing. 65

PAGE 79

5.2.2RelatedWork TherearenumerousimplementationsofBFSonGPUs,utilizingavarietyofapproachesandalgorithms.Since2007,theperformanceoftheseBFSalgorithmshas improved,sometimesbyaspeedupofupto15[42].HarishandNarayananprovide oneoftherstpublishedGPUimplementationsusingNVIDIA'sCUDA[38].This workprovidesbasicGPU-usablealgorithmsforimplementingbreadth-rstsearch, singlesourceshortestpath,andallpairsshortestpath;theperformanceresultsobtainedareoftenusedasabaselineforjudgingtheperformanceofimprovedGPU algorithms.TheHarishbreadth-rstsearchalgorithmprocessessearchnodesinparallel,andislevel-synchronous.Thealgorithmconsistsoftwomajorsegmentsand, intheirimplementation,eachofthesesegmentsbecomesaCUDAkernel.Usingthe twoseparatekernelsprovidesanautomatictrializationbarrier,therebyenforcingthe requiredlevelsynchronization.HarishreferencesthepreviousworkbyBaderand Madduriaspartofthebasisfortheirwork. TheBaderandMadduriworkcoversBFSontheCrayMTA-2[12].TheCray MTA-2architecture,withitsampleglobalsharedmemory,isthread-orientedand hasveryfastcontextswitchtimesbetweenthreads,makingitsomewhatsimilarto theGPUarchitecturalmodel.ThesecharacteristicsallowtheMTA-2tonotdepend uponmemorycaching,butinstead,muchlikeGPUs,toswitchtoadierentthread whenmemorystallsoccur.ThisarchitecturalsimilaritymakesthealgorithmprovidedbyBaderareasonablestartingpointforGPU-basedBFSprocessing.The level-synchronizedBFSalgorithmpublishedbyBaderusesthezero-overheadsynchronizationprovidedbytheMTA-2. WhereastheBaderworkandtheHarishworkareaimedatgettingresultswithin particularenvironments,DehneandYogaratnamprovideanoverviewoftheadvanced programmingtechniques,suchaspackingmultiplevariablesvaluesintoone32-bit integer,thatcanbeusedwithbothCUDAandOpenCLtoachievebetterperformance 66

PAGE 80

[26].TheystartwiththebasicPRAMmodelsofparallelcomputationandpresent guidelinesonhowtoadaptthesemodelsforuseonGPUs.Specialattentionis giventothedicultiesencounteredintheGPUprocessingofhighlyirregulardata accesspatterns.Thesedicultiesincludecoalescingglobalmemoryaccesses,dealing withconcurrentwritememoryaccesses,SIMTthreadexecutioninstructionpath divergenceandthreadsynchronization. InoppositiontotheHarishapproach,Luo,etal.[55]suggest hierarchicalqueue management and hierarchicalkernelarrangement approaches,whichmayprovide improvedperformance.ThesuggestedhierarchicalqueuemanagementapproachfacilitateshavingmanythreadsaddingtotheBFSqueuebybreakingthequeueupinto independentsegments.Hierarchicalkernelmanagementavoidssomesynchronization atthetopkernellevelbyusingtheGPUbarriersandne-grainedsynchronization operatorsatcertainlevels.ThisworkdealsmainlywithoptimizedlockingandsynchronizationissuesanddoesnotaddresstheGPUworkloadimbalances,coveredby Hong,etal.[42]. Hong,etal.coveracceleratingtheperformanceofBFSgiventhespecicconstraintsoftheGPUenvironment.Thisworknotesthepoorperformanceofthe earlierPRAM-likeGPUimplementationswaslargelyduetoGPUthreadloadimbalances.Loadimbalanceoccurswhentheworktobedoneisallocatedtothethreadsin awarporwork-group 8 improperly,andsomethreadscompletetheirworkandremain idlewhileotherthreadscontinuetoworkinthesamewarporwork-group.ThesolutionproposedbyHonginvolvesdividingtheGPUcodeinSISD-likeandSIMD-like portionsandassigning virtualwarps totheSISDportions.Virtualwarpsarearticial groupingsofthreadsthatallowthesoftwaretobettermanagethewarpsforcertain processingsteps.Hong,etal.makeitclearthatimprovedGPUperformancecanbe obtainedbyintroducingnon-standardprogrammingtechnique,likevirtualwarps,to 8 warp isaCUDAterm, work-group isaOpenCLterm,theybothrefertothegroupofthreads beingexecutedtogetherunderasingleinstructioncounter,SIMD/SIMTstyle. 67

PAGE 81

theGPUprogrammingenvironment. Dinneen,etal.[27]providearun-timeperformancecomparisonbetweenOpenCL andCUDA,andtalkaboutusingOpenCL.Inaddition,theycomparetheperformance ofsynchronizationatthekernellevelwithne-grained,atomicsynchronizationatthe levelofindividualdataitems.TheirresultsshowthatOpenCLandCUDA,overtheir tests,havesimilarrun-timeperformance,towithin2%ofeachother.Theresultsof theirsynchronizationcomparisonsshowthatdierentinputgraphs,dependingon sizeanddensity,benetfromeachapproach.NogeneralpreferenceforOpenCLor CUDAisidentied. Merrill,etal.demonstrateabreadth-rstsearchparallelization,whichuses prex sum forbuermanagement,andachievesanasymptoticallyoptimal O j N j + j E j workcomplexity[59].NoneofthesepreviouslydescribedBFSworksachievedthis optimumlevelofworkcomplexity.TheMerrillapproachperformssynchronous,levelby-leveltraversalofthegraphfromthestartingvertex.ThisBFSmethodusesthe expansionofthevertexfrontierbytraversingtheassociatededgesandthenpruning thefrontiertocontainonlyunmarkedvertices.Asthemarkedverticesarepruned, potentialduplicateverticesarealsoremoved.Merrillreferstothesetwophasesas neighborexpansion"andstatus-lookupandltering",respectively.Prexsumis usedtomanageshared,updatedbuersinawaythatavoidstheuseofserialization andlocking;Merrillreferstothistechniqueas cooperativeallocation .Here,wequote, Givenalistofallocationrequirementsforeachthread,prexsumcomputesthe osetswhereeachthreadshouldstartwritingitsoutputelements."[59]Ofparticular interesttoourworkistheuseofprexsuminthemanagementofsharedbuers. ThesebuersaresharedbetweenGPUthreadsandbetweendierentGPUs.Inthis approach,the virtualpointers ,denotingtheedgesinthegraph,arepassedbetween GPUsasthegraphistraversed.Inthenextsection,wetakeadeeperlookatbuer allocationusingprexsum. 68

PAGE 82

5.2.2.1BuerAllocationUsingPrexSum Theparallelizedallocationofsharedbuerspaceusingprexsumprovidesabuer managementtechniquethatdoesnotrequiretheuseofserializationandatomiclocking.Figure5.7providesanexampleofallocatingbuerspacebasedonprexsum. Inthisexample,threads t1,t2,t3 ,and t4 needtoputvaluesintothe outputresults buer.Thethreadsneedtoallocate3,2,0,and1items,respectively.Theprexsum computesthesumoftheprecedingitemsforeachthread,whichistheosettothe beginningofeachthread'sarea.Thecolorcodingmatchestherequirementforeach threadwithitsassociatedareainthebuer.Forexample,thread t 2,markedingreen, requiresspacefortwointhebuer,andthetwoitemsareallocatedatosets3and4. ThisapproachgetsaroundthebottleneckthatoftenoccursonGPUswithallocating spaceinasharedbuerorqueue{thebottleneckisavoidedsincenoserializationor atomicoperationsareneeded.Inourwork,thisapproachwillbeusedtocreatean OpenCLversionofBFS,whichsupportsmultipleGPUdevices.Inparticular,itwill beusedtomanagetheedgevirtualpointersthatneedtobepassedbetweenGPUs asthegraphistraversed. Figure5.7:PrexSumbasedBuerAllocation 69

PAGE 83

5.2.2.2IdenticationofGraphRepositories Thetestingofgraphalgorithmswithlargeveryirregulargraphsrequirestestdata. Onesourceofsuchtestdataiswellknownrepositoriesofreal-worlddata.These repositories'graphsarebasedonactualdatafromareassuchassocialnetworking, communicationsnetworks,andgeographicalmapping;thesegraphsarenotarticially generatedtestdata.Suchrepositoriesinclude: 1. StanfordNetworkAnalysisPackage SNAP[88] Thispackageofapplicationssuppliesgraphdatasetsfrom14areasincluding socialnetworks,communicationsnetworks,andcitationnetworks.Hong,et al.madeuseoftherepository's soc-LiveJournal1 9 datasetintheir2011articleonBFSprocessingwiththeCrayMTA-2[42].Ithas4,847,571nodesand 68,993,773edges.OtherSNAPdatasets/graphsareevenlarger.The memetracker9 has96millionnodesand418millionlinksedges.Thesocialnetwork, directedgraph soc-LiveJournal1 isusedintheBFSexperimentsdescribedbelow. 2. CenterforDiscreteMathematicsandTheoreticalComputerScience DIMACS [20]TherepositoryprovidedbyDIMACS,aspartofthe 9thDIMACSImplementationChallenge-ShortestPaths ,contains12USAroadnetworks.Each networkanetworkbeingagraphwithadditionalvaluesattachedtonodesor edgesisavailableintwoforms:oneformgivesthedistanceoneachedgeand theothergivesthetransittime.ThenetworksvaryinsizefromtheNewYork Citynetworkwith246,346nodesand733,846edgestotheFullUSAnetwork with23,947,347nodesand58,333,344edges.HarishandNarayanan[38]use graphsfromthisrepositoryintheirworkonAllPointsShortestPath. 9 Wedenotegraphnames,OpenCLkernelnames,sourcecodevariablenames,andsimilarobjects initalics. 70

PAGE 84

3. UniversityofFloridaSparseMatrixCollection [25,24] ThiscollectionismaintainedattheUniversityofFloridaandcontainssparse matricesthatariseinrealapplications.Thecollectionisusedbythenumerical linearalgebracommunityfortheevaluationofsparsematrixalgorithms. 5.2.3ApplicationSoftwareDesign TheDEFG-basedbreadth-rstsearchapplicationfromChapterIII,basedonthe breadth-rstsearchbenchmarkfromOpenDwarfs[31],wasdesignedtoworkwitha singleGPUdevice.WerefertothisapplicationasBFS.Inordertoutilizemorethan oneGPUdevice,wemodiedthisexistingapplication'skernels,introducedadditional kernels,andenhancedtheapplication'sDEFGcode.Thisnewapplicationhasthe monikerBFSDP2GPU.Ourbasicdesignapproachwastocontinueusingtheoriginal BFSapplication'sHarishtechniquesandenhancethemtobeutilizedovertwoGPUs. Thistwo-GPUexecutionrequiredthatsharedbuers,containingtheedgesonBFS frontier,bemovedbetweentheGPUs.ThesebuersweresharedbetweenGPU threads,aswellas,betweenGPUs.Weusedtheprexscanapproach,aspreviously described,tomanagethesesharedbuers. 5.2.3.1TheBFSSingle-GPUApplication WerstdescribetheprocessingdoneinthebasicBFSapplication.TheBFSapplication'sDEFGcodeandkernelcodearelistedintheAppendix,SectionB.4.Here,we willshowonlysignicantcodesnippets.Figure5.8showslines25through41ofthe BFSapplication'smainprocessingloop.Line26executed kernel1 andline35executed kernel2 .The STOP variablewasusedtocontroltheloop;theloopwasexited whenthefrontieremptied,whereinthesecondkerneldidnotupdatethe STOP value to1.Thislooputilizedthetwokernelstomovethroughthetree,whichwasheld inthe graph nodes and graph edges buers,andtoupdatethebreadth-rstsearch 71

PAGE 85

25.loop 26.executepart1kernel1graph_nodesin 27.graph_edgesin 28.graph_maskin 29.updating_graph_maskinout 30.graph_visitedin 31.costinout 32.NODE_CNTin 33. 34.setSTOP 35.executepart2kernel2graph_maskinout 36.updating_graph_maskinout 37.graph_visitedinout 38.STOPinout 39.NODE_CNTin 40. 41.whileSTOPeq1 Figure5.8:BFSApplication'sDEFGLoop frontier,whichwasheldinthe graph mask buer. Figures5.9and5.10listthekernelsused.Theyaretaken,withonlycosmetic changes,fromtheOpenDwarfsBenchmark. 10 Therstkernelwasgivenanodeto processandifthenodehada g graph mask non-zerovalue,meaningthenodewas onthecurrentfrontier,thenode'sedgeswereprocessed.Ifthenodepointedtoby anedgehadazero g graph visited value,thenodewasputonthenewfrontierand theupdatedcostwascarriedforward.Thesecondkernelservedtwopurposes:it copiedthenewfrontier,storedin g updating graph mask ,tothefrontier,storedin g graph mask ,anditsetthe g over variable,whichmapstothe STOP variableinthe DEFGcode,to1.Settingthisvariableto1causedthelooptoberepeated.These twokernels,alongwiththeassociatedDEFGcode,performedthebreadth-rstsearch startingwithanodepresetinthefrontier. 5.2.3.2TheBFSDP2GPUTwo-GPUApplication EnhancingthisapproachtousetwoGPUsentailedseveralnewoperations.Thegraph wassharedbetweentheGPUs,andtheGPUshadtoactivelyexchangethelistsof thenodesonthefrontier. 10 OpenDwarfsBenchmark:CopyrightJuly29,2011byVirginiaPolytechnicInstituteandState UniversityAllrightsreserved. 72

PAGE 86

01.__kernelvoidkernel1__globalconstNode*g_graph_nodes, 02.__globalint*g_graph_edges, 03.__globalint*g_graph_mask, 04.__globalint*g_updating_graph_mask, 05.__globalint*g_graph_visited, 06.__globalint*g_cost, 07.intno_of_nodes 08.{ 09.unsignedinttid=get_global_id0; 10.iftid
PAGE 87

structure.The DEFG GLOB structurewasusedinternallybyDEFGtomanagethe presentationofeachbuertotheGPUs.Forthesakeofbrevity,wewilllimitourselvestoshowingonlyhighlysignicantcodesegmentsandomitsegmentsoflesser importance. WenotethatDEFGdidnotperformdynamicworkloadbalancing;theallocationofworkgiventotheGPUs,andGPUthreads,wasdeterminedentirelybythe DEFGapplicationimplementation.Ourexperiencehasbeenthatservernodeswith multipleGPUcardswereoftenconguredwithmatchingpairsofGPUcards.For thisreason,weconsidereditsucienttohavetheapplicationsplittheworkloadapproximatelyintoequalparts.Ourgraphapplicationgavehalfofthenodestoeach GPU.Whenthisgraphpartitioningwasperformedbythe ArrayPartition2GPU2 function,noattemptwasmadetogroupthenodesinsomemannertominimizethe cross-GPUcommunications.Theadditionalworktooptimizethegraphtraversal couldhavemeantunacceptableresourceusagewithinthispreprocessingstep.This graphpartitioningapproachwassimilartothatusedintheMerrillwork[59]. TheDEFGuseofmorethanasingleGPUisengagedbyreplacingthe execute statementswiththe multi exec statementsandchangingthe declaregpu statement toselectmultipleGPUs.Oncethesechangeswereinplace,DEFGwouldautomaticallyassigntheworktotheselectedGPUs.Ofcourse,theDEFGapplicationand kernelsmusthavebeendesignedtofunctionwithmultipleGPUs.Inthecaseof thepreviously-describedimagelteringapplications,theuseofmorethanasingle GPUrequiredverylittlecodeenhancement.However,forthisbreadth-rstsearch application,theDEFGcodeneededtobesignicantlyenhancedtodynamicallyshare theBFSfrontierbetweenGPUs.Wenotethattheaddedcomplexitywasnotinthe managementoftheadditionalGPUswithinOpenCLorinthemarshalingofbuers tothecorrectGPU;instead,theadditionalcomplexitywasrootedintheallocation andpopulationofthesharedbuers.WeusedtheDEFGPrex-Allocationdesign 74

PAGE 88

patterntomanagethisbuersharing. OuroriginalBFSapplicationrequiredtwokernels: kernel1 and kernel2 .Thenew BFSDP2GPUapplicationrequiredsixkernels.TheBFS kernel1 wasreplacedby vekernels: bermanPrexSumP1 bermanPrexSumP2b getCellValue kernel1a2 ,and kernel1b .The bermanPrexSumP1 bermanPrexSumP2b ,and getCellValue kernels wereprovidedbytheDEFGPrex-Allocationdesignpattern.The kernel2 kernelwas retainedfromtheoriginalBFSapplication.Theremainingtwokernels, kernel1a2 and kernel1b ,werenewanduniquetothisapplication. Soastoperformtheprexsumprocessing,the bermanPrexSumP1 kernelwas executedonceandthenthe bermanPrexSumP2b wasexecuted log 2 buffer size times.ThesekernelswerebasedontheworkbyBermanandPaul[14]andhave timecomplexityof O n log n .The getCellValue kernelwasusedtoquicklyobtain theGPU-specic,run-timelengthofthesharedbuers. 11 Inaneorttoachieveimprovedperformance,weconsideredusingotherprexsumalgorithms,besidesBerman andPaul.TheprexsumalgorithmdescribedinHarris,etal.[40]achievedtimecomplexityof O n .HowevertheHarris,andotherreviewedalgorithms,hadapower-of-2 buerlengthrequirement,whichwasnotacceptablehereasthesizesofthebuers tobemanagedvaryatruntime.Weconsideredextendingthebuers,atruntime, toapower-of-2sizeandusingaHarris-likeapproach.Ultimately,wedecidedagainst thisbuer-expansionapproach,duetotheperformanceimplications;thesizeofthe managedbuerscanbequitelarge.Thesizeexceeded9millionitemsinourtest graphs. Withtheallocationofthesharedbuersprovidedbythekernelsdescribedabove, the kernel1a2 and kernel1b kernelsthenprovidedthefunctionalityoftheoriginal kernel1 .Populationofthebuerstobemovedandsharedwasdoneby kernel1a2 After kernel1a2 completedandthecross-GPUdatamovementsoccurred, kernel1b 11 Thistotalbuersizeinformationwasgeneratedasasideeectoftheprexsumoperations. 75

PAGE 89

didthebasicbreadth-rstsearchprocessingoftheoriginal kernel1 kernel.After thesestepswerecompleted,theoriginal kernel2 wasexecuted. The kernel1a2 kernelsourcecodeisshowninFigure5.11.Asstatedabove,its primarypurposewastotraversethefrontierforeachGPUandplacetheactiveedges andcostsintotheirassignedlocationintheshared g frontier and g payload buers. The g payload buercontainedtheaccumulatedBFStreetraversalcosts.Wenote inpassingthatHarish-basedapproachisnotaparticularlygoodhigh-performance GPUsoftwaredesign,asthiskernelmayinduceGPUhardwarethreaddivergence whenthenumberofedgesperthreadishighlyvaried;eachgraphnodeisprocessed byone,andonlyone,GPUthread.Anodewithalargenumberofedgescanbe processedmuchmoreslowlythananodewithasmallnumberofedges.Nonetheless, wefeltthattheHarishapproachwassucienttotestourprexsum-basedDEFG Prex-Allocationdesignpattern. Theoriginal kernel1 wasgivenanodetoprocessandifthenodehada g graph mask non-zerovalue,thenthenode'sedgeswereprocessed.Inthe2-GPUversion,thiswork wasdoneby kernel1b ,whichisshowninFigure5.12.Notethatsomelineshavebeen omittedfromthegure;theseomittedlinesareverysimilartolines13to26,except thatthesharedbuersfromtheotherGPUwasprocessed.Thiskernelwasgiven twosetsofbuers,onefromeachGPU,anditscannedthesebuers,processingonly theedgesforthecurrently-runningGPU. WewillnowdiscusstheBFSDP2GPUapplication'sDEFGcode.Duetothe codelength,wewilldescribeitasDEFGpseudocode;wehaveincludedonlyafew DEFGstatementsinfull,thosebeingthestatementswithaspecializedsyntaxor purpose.MostoftheDEFGstatementsarepresentedinalessverbose,minimized form.TheBFSDP2GPUpseudocodeisshowninFigure5.13.Anellipsiswasusedto indicateomittedstatementsandparameters.Thereferencestominorscalarvariables havebeenomitted.However,duetotheirkeyfunctions,weshowedtheimportant 76

PAGE 90

01.__kernelvoidkernel1a2__globalconstNode*g_graph_nodes, 02.__globalint*g_graph_edges, 03.__globalint*g_graph_mask, 04.__globalint*g_graph_offset, 05.__globalint*g_cost, 06.__globalint*g_frontier, 07.__globalint*g_payload, 08.intno_of_nodes 09.{ 10.unsignedinttid=get_global_id0; 11.iftid0 15.{ 16.intcost=g_cost[tid]; 17.intmax=g_graph_nodes[tid].no_of_edges+g_graph_nodes[tid].starting; 18.intindex=g_graph_offset[tid]; 19.forinti=g_graph_nodes[tid].starting;i
PAGE 91

referencestothe KCNT and STOP variables. ThiscodeshowstheC++functionsandOpenCLkernelsbeingutilized.Lines11 and12containthecallsthatobtainedtheinputgraphandperformeditspartitioning. Likewise,callsusedtomergethecostarrayandoutputthenalnode-by-nodecosts areinlines27and28.Lines14and26marktheboundariesoftheouter"loop.Line 16showstheinner"loopexecutionofthe bermanPrexSumP2b kernelhavingbeen doneKCNTtimes.The bermanPrexSumP1 and bermanPrexSumP2b kernelswere usedatthestartofeachouter"loopiterationtoestablishthesharedbuerosets foreachnode.Atlines18through20,the getCellValue kernelwasusedtoobtainthe sizeofthesharedbuerforeachGPU.Notethattheamountofsharedbuerspace utilizedbyeachGPUwaslikelydierent. The@" < GPU-ID > notation,shownonlines19and20,isusedtoindicatewhich buerispresentedtoagivenGPU.TherstGPUselectedisassignedanIDof0, andtheIDisincrementedforeachadditionalGPUinuse.Thisspecializedtechnique isusedwhenthedefaultDEFGapproachtomultiple-CPUbuermanagementis notapplicable.ThedefaultDEFGapproachofsimplysplittingthebuerisnot applicablebecausethesizeofthesharedareasusedoneachGPUisvariableatrun timeandeachsharedbuerlistmustbenameduniquely.Althoughthis@" < GPUID > notationisnotelegant,wefeelitiswithinthedesigngoalsofDEFG,asitisa declarationofwhattodoandnotcodingforhowtodoit. The broadcast statementsonline22causedthelistedbuerstobemadeavailable totheotherGPU.DuetoOpenCLlimits,theseDEFG broadcast statementsmerely causethenamedbuerstobecopiedtotheCPU'smemoryandmadeavailable, fromthere,totheotherGPUs.Aswillbecoveredinourrun-timeperformance discussion,thisapproachtobroadcastingdatatotheotherGPUs,unfortunately, bringsaboutsomewhatunimpressiveperformance. Lines21and23showtheexecutionofthetwokernels,whichwasderivedfromthe 78

PAGE 92

01.declareapplicationbfsdp2gpu 02.declareintegerNODE_CNT... 03.declaregpugpugrpall 04.declarekernelbermanPrefixSumP1bfsdp_kernelv3[[1D,NODE_CNTt2]] 05.kernelbermanPrefixSumP2bbfsdp_kernelv3[[1D,NODE_CNTt2]] 06.kernelgetCellValuebfsdp_kernelv3[[1D,1]] 07.kernelkernel1a2bfsdp_kernelv3[[1D,NODE_CNT]] 08.kernelkernel1bbfsdp_kernelv3[[1D,EDGE_CNT]] 09.kernelkernel2bfsdp_kernelv3[[1D,NODE_CNT]] 10.declareintegerbuffergraph_edgesEDGE_CNTnonpartable... 11.callinit_inputgraph_nodesout... 12.callArrayPartition2GPU2graph_nodesinout... 13.//setmisc.controlvariablecalculations,eg.KCNT 14.loop 15.multi_execrun2bermanPrefixSumP1offsetout... 16.sequenceKCNTtimes 17.multi_execrun2bermanPrefixSumP2boffset2inout... 18.multi_execrun3getCellValueoffset2in 19.NODE_CNT0@0inNODE_CNT1@1in 20.listused0@0outlistused1@1out 21.multi_execs2kernel1a2graph_nodesin... 22.broadcastfrontier0@0...broadcastpayload1@1 23.multi_execs3kernel1bfrontier0in... 24.setSTOP 25.multi_execs4kernel2graph_nodesin...STOP... 26.whileSTOPeq1 27.callMergeCost2GPU2costinout... 28.calldisp_outputcostin... 29.end Figure5.13:BFSDP2GPUDEFGPseudoCode originalBFS kernel1 .Lines14and26utilizedthesameloopmanagementapproach asusedintheoriginalBFSDEFGcode.Thekernelexecutedonline25wasthesame kernel2 asusedinBFS. 5.2.4ApproachtoResearch ThedevelopmentofourDEFG-basedBFSDP2GPUapplicationwasdonewithtwo mainresearchaimsinmind:tofurthershowtheviabilityofDEFGandtodemonstrateatwo-GPU,breadth-rstsearchapplicationthatutilizesprexsumforshared buermanagement.InordertotestournewBFSapplicationwithintheseaims, wecomparedtheresultsandrun-timeperformanceofouroriginalBFSapplication againstournewtwo-GPUBFSapplication.WeusedselectedgraphsfromtheSNAP repository,andRodinia/OpenDwarfsBenchmark,inthesecomparisons[85,88].Our newBFSDP2GPUapplicationproducedcorrectresults.Wenotethatwecompared 79

PAGE 93

Table5.6:RodiniaGraphCharacteristics Graph Nodes Edges g65536 65,536 393,216 g1M 1,000,000 6,001,836 g2M 2,000,000 11,999,346 g3M 3,000,000 18,003,170 g4M 4,000,000 23,999,338 g5M 5,000,000 29,995,814 breadth-rstsearchresultsfromtheexistingBFSapplication,overasubsetofthe Rodinia/OpenDwarfsgraphs,withournewapplication.Theresultsmatchedexactly. 12 Unfortunately,theperformanceofournewBFSDP2GPUapplicationwas notimpressiverelativetothatoftheBFSapplication. 5.2.5ExperimentalResults WerstexploretheBFSDP2GPUrun-timeperformanceusinggraphsfromtheRodiniaBenchmark.WethenrunthisapplicationwithtwographsfromtheSNAP Package.Seeingsimilarrun-timeresultsfrombothsetsofgraphs,wethenperform ananalysistondtheexplanationsfortheperformanceweobserved. 5.2.5.1Run-TimePerformanceResultsfromRodiniaGraphData ThesetestswereruncomparingtheBFSandBFSDP2GPUrun-timeperformance, usingasetoftheRodiniaBFSBenchmarkgraphs.Thebenchmarkisprovidedwith anumberofgeneratedtestgraphsandweusedtheincludedtool,graphgenr.cpp,to generateadditionallargegraphs.The65,536nodegraph, g65536 wasincludedwith thebenchmarkandthelargergraphsweregeneratedwiththeRodiniatool.The graphsarelistedTable5.6;itcontainsthenamewegaveeachgraph,andthenumber ofnodesandedges.Alloftheseare directed graphs. Table5.7showstheaverageruntimesofthreeBFSDP2GPUrunsforeachofthe 12 TheBFSapplicationresultshadbeenpreviouslycomparedwiththeoriginalBFSresultsfrom theOpenDwarfsOpenCLsoftware. 80

PAGE 94

Table5.7:RunTimesofBFSVersusBFSDP2GPU,inSeconds Graph BFS BFSDP2GPU Slowdown Application Application Factor g65536 0.008 0.140 17.5 g1M 0.074 0.762 10.3 g2M 0.158 2.345 14.8 g3M 0.276 1.650 6.0 g4M 0.377 2.490 6.6 g5M 0.505 4.089 8.1 Figure5.14:BFSVersusBFSDP2GPURunTimeswithRodiniaGraphs graphs.Asanexample,the0.008intherstdatarowandsecondcolumnmeans thattheBFSapplicationprocessedthe g65536 graphinanaverageof0.008seconds. TheBFSprocessingwasstartedwiththegraphrootsettorstgraphnode.These samerun-timeresultsarepresentedinbar-chartforminFigure5.14.Giventhatour intentionwasforBFSDP2GPUtobeahigh-performanceapplication,theseresults weredisappointing.Ournewapplication'sruntimesrangedbetween6and17.5times thoseoftheexistingDEFGBFSapplication. 81

PAGE 95

Table5.8:SNAPgraphcharacteristics GraphDetails Nodes Edges Description soc-Slashdot0811 77,360 905,468 Slashdotsocialnetworkfrom November2008 soc-LiveJoural1 4,847,571 68,993,773 LiveJournalonlinesocialnetwork Table5.9:RunTimesfromSNAPGraphs,BFSVersusBFSDP2GPU,inSeconds. Graph BFS BFSDP2GPU Slowdown Application Application Factor soc-Slashdot0811 0.011 0.075 7.1 soc-LiveJoural1 0.292 3.571 12.2 5.2.5.2Run-TimeResultsfromTwoSNAPGraphs Weperformedasecondsetofperformancetests,usingtheSNAP soc-LiveJournal1 and soc-Slashdot0811 graphs,tohelpdetermineiftheapplication'sslowerthenexpectedperformancewiththeRodiniadatawasperhapsduetosometraitofthedata. TwoSNAPgraphswerechosenbecauseoftheirvastlydierentsizesandsources. Table5.8providesadetaileddescriptionofthesedirectedgraphs.Weconsideredthe secondgraph, soc-LiveJoural ,tobeagoodexampleofaVLIgraph. InTable5.9,weobservesimilarpoorperformanceaswasobservedwiththeRodiniagraphs.Ratherthanfurthernotetheperformanceofournewapplicationwith additionalrun-timetestingresults,weturnourattentiontothereasonsfortheobservedperformanceoutcomes. 5.2.5.3Run-TimePerformanceAnalysis TheaforementionedMerrillarticledemonstratedgoodbreadth-rstsearchperformancewithmultipleGPUs,usingNVIDIA'sCUDA[59].Ourdecisiontoinclude theprexsum-basedbuermanagementinDEFG,asadesignpattern,waslargely basedonthegoodperformanceshownbyMerrill.Giventheperformancewesaw withBFSDP2GPU,ourtaskbecamendingexplanationsforthelargeperformance gapbetweenMerrill'sresultsandours. 82

PAGE 96

Afterexecutingnumerousadditionalperformancetestsandreviewingthedesigns ofDEFGandtheBFSDP2GPUapplication,weoutlinefourpotentialcausesforthe unimpressiverun-timeresultsofourapplication:OpenCL'slackofdirectGPUto-GPUcommunications;DEFG'slackofsupportforvariable-lengthbuers; themixtureofsparse-arrayandlist-baseddatastructuresintheapplication;and, theapplication'sGPUworkallocationmethod.Wenowexploreeachofthese. 5.2.5.4OpenCL-ProvidedGPU-to-GPUCommunicationsLimitations TheDEFG broadcast statementisusedtoprovidetheGPU-to-GPUcommunications capability.SinceOpenCLdoesnotprovideforactualGPU-to-GPUcommunications, theDEFG broadcast causestwomajorOpenCLAPIcallstobegeneratedforeach transfer.TherstAPIcall, clEnqueueWriteBuer ,movesthebuertotheCPU andthesecond, clEnqueueReadBuer ,bringsthebuertotherequestingGPU. TheGPU-to-GPUcommunicationscapabilityprovidedbyNVIDIA'sCUDAuses thePCIehardwarebusfordirectGPU-to-GPUcommunicationsandispresentedas havingattractivetransferrates[68].CUDAtestsrunontheHydraservershowed CUDAinter-cardtransferratesof6.05GB/second.With6GBasanactualbaseline gure,weproducedaDEFG-basedOpenCLapplication,calledDiagBR,totimethe movementofbuersbetweenthetwoHydraGPUs,withoutanykernelprocessing involved.OuraimwastocomparetheCUDAandDEFG/OpenCLinter-carddata transferspeedsonHydra.DiagBRproducedatransferrateofaround286MB/second. TheCUDAratewasapproximately21timestheOpenCLrate. Withthisfactestablished,weturnedontheBFSDP2GPUapplication'sDEFG loggingsoastoproletheruntimeconsumedbythetwo broadcast statement's OpenCLAPIcalls.Whentheapplicationwasrerunwiththe g1M SNAPgraph,we foundthatatotalof0.590secondswasusedbythebroadcasting.Thisis76.6%of thetotalruntimeof0.771seconds.Wenotethatthistotalrun-timevalue,obtained 83

PAGE 97

withloggingenabled,isslightlydierentthanthe0.762valueshowninTable5.7; weattributethisdierencetotherun-timeoverheadaddedbytheDEFGlogging.It isclearthatthelackofanOpenCL-providedGPU-to-GPUdirectcommunications facilitydrasticallyimpactstheapplication'srun-timeperformance.Thelackofthis capabilityinOpenCLhasasubstantialperformanceimpactonDEFG.Thislossof performanceisnotduetoanerrorinDEFG;instead,itisduetotheOpenCLlack ofdirectGPU-to-GPUcommunicationssupport. 5.2.5.5DEFGVariable-LengthBuers TheperformancenumberswehaveshownsofarforBFSDP2GPUincludedamanual stepoftuningthesizeoftheapplication'sbroadcast-listbuersforeachdierently sizedinputgraph.Withthe g1M graph,thisbroadcast-listbuersizewasmanually setto2Melements;withthe g6M graphitwassetto9Melements. Whenatrialtestrunwasdonewitha9Msizeandthe g1M graph,theaverage run-timeincreasedfrom0.762secondsto2.634seconds.Thisisaverynoticeable changeanditindicateshowsensitivethisapplicationistothebroadcast-listbuer size.ThisDEFGperformanceissuecanbehandled,inthefuture,byenhancingDEFG tobeabletodeclarethatagivenscalarvariableholdsthecurrentbuertransfersize andusingthisvariable'svalueinthecalltotheOpenCLbuermovementfunctions. Wenotethatinourrun-timetests,eventhesemanuallytunedtransferbuersizes overstatethebuersize,astheyaresizedforthelargestlistsused.EnhancingDEFG toutilizethetransfersizeoftheactuallistwouldpreventtheslowdownsassociated withthisundesirableoverstatingofthevolumeofdatatobetransferred. 5.2.5.6DataStructuresandThreadingModel Thedatastructuredesignandthreadingmodelusedinthisapplicationhaveanimpactonitsrun-timeperformance.IfweassumethattheDEFGbroadcastprocessing 84

PAGE 98

tooknotimeatall,thenanoptimisticestimateoftheBFSDP2GPUprocessingtime fortheg1Mgraphcouldbe24%of0.762,or0.183seconds.ThetimefortheBFS applicationprocessingthesamegraphwas0.074seconds.Clearly,theslowbroadcast performancedoesnot,byitself,explaintheentireperformanceissue. Inadditionalproledexecutions,wemeasuredthatabout5%oftheBFSDP2GPU runtimewasspentdoingthe ArrayPartition2GPU2 and MergeCost2GPU2 preprocessingandpost-processingoperationsandanother3.5%wasspentintheprex sumoperation.Ifwesubtractthepre-processingandpost-processing5%,andthe prexsumprocessingof3.5%fromthe0.183seconds : 183 )]TJ/F15 11.9552 Tf 12.516 0 Td [( : 085 0 : 762,we arethenleftwithanexecutiontimeofabout0.118seconds.This0.118secondgure isacrudeestimateofthetimeneededtomovethegraphbuerstotheGPUs,to populatethefrontierdataandtoperformtheactualBFSprocessing.Thiscrude estimateexceedstheequivalentBFSruntimeof0.074,by0.044seconds,nearly60%. Someaspectoftheperformancedierenceisnotyetexposed. Unfortunately,theLinuxtimerswecoulduseonHydraareaccurateonlyto millisecondsandmanyoftheDEFG-loggedtimeswiththis g1M graphdroppedtothe 1mslevel,andbelow.Thislimitationseverelyaectedourabilitytofurtheranalyze theperformancedierences.Nevertheless,inouropinion,theBFSDP2GPUuseof twodierentdatastructuretypesandtheone-node-per-threadworkmanagement modelwerenotprovidingsucientlygoodperformance.Theperformancelostto thelackofOpenCLGPU-to-GPUdirectcommunicationssignicantlyovershadowed otherperformanceissues.However,themixingofthedense-stylearraystructure,for theBFSfrontierandcostarray,withthelist-stylestructure,forthesharedlists,also causedextraworktobedoneand,wesuspect,alossofperformance.Theuseofthe classicHarishGPUthreadingmodelmayalsobesuspect,intermsofperformance, becauseofthepotentialthreaddivergenceissues,previouslymentioned,andthe unfortunatecharacteristicthatthreadsassignedtonodesnotonthefrontierhave 85

PAGE 99

verylittlevalidworktoperform. 5.2.6BFSDP2GPUGoalsandRun-TimePerformance Theseexplanations,andviews,bringusbacktotheMerrillarticleandourpreviously establishedaimsforthisapplicationandDEFG.ThetwomaingoalsforBFSDP2GPU were:implementthisreasonablycomplexapplicationinDEFG,andutilizethe DEFGPrex-Allocationdesignpattern.Weachievedbothofthese.Inordertohave Prex-AllocationbeageneraldesignpatterninDEFG,itmustbeabletobeusedin anindependentmannerandnotbehighlyintegratedwithaspecicapplicationor acertainGPUthreadingmodel.TheMerrillworkusedaverysophisticatedGPUmappingapproach,nottheHarishGPUthreadingmodel.Inhindsight,itappears thehighlevelofperformanceobservedwiththeMerrillwork,comparedtotheHarish BFSapproach,canonlyoccuriftheaddedprexsumandmarshalingoverheadcanbe osetbyadecreaseintheruntimeneededtodotheactualBFSprocessing.Thiswas notthecasewithBFSDP2GPU.TheperformanceofourBFSDP2GPUapplication didnotmeetexpectations;however,ithelpedshowtherangeofapplicationsDEFG canhandleandithelpedusfurtherunderstandthelimitsofOpenCL-basedprocessing withmultipleGPUs. 5.3Application:SortingRoughlySortedData 5.3.1ProblemDenitionandSignicance Generalcomparisonsortinghasan O n log n optimalperformancebound.Thisperformancelevelcanbeimproveduponinspeciccases;forexample,whenthesequence tobesortedisalreadypartiallysorted.Werefertothissortingofpartiallysorted dataas roughlysorting .Formally,roughlysortingpertainstosortingnearlysorted sequences.Inthiscontext,aroughlysortedsequenceis k -sortedifnosequenceelementismorethan k stepsoutofsequence.Withthesequence k -sorted,Altman, 86

PAGE 100

etal.showthatthissequencecanbesortedwithan O n log k performancebound [9,10]. Thesortingofsequencesisaprimitiveoperationinmanyalgorithmsanddatamanipulationoperationsincludingbinarysearching,closestpairdetermination,uniquenessdetermination,identifyingoutliers,databaseacceleration,anddatamining[22, 51,41,35].Roughlysortedsequencescanoccurwhenapreviouslysortedsequence isperturbedwithrelativelyminorchanges[10].Anexampleofaroughlysorted sequencewouldbealistofallthecitiesinaregion,sortedconstantlybycitypopulation,wherepopulationupdatesareperformedveryfrequently.Thepopulationof eachcitymaychangeconstantly,butitisunlikelythatacity'spositioninthesorted listwouldsignicantlychangeinaveryshortintervaloftime. WehavedevelopedGPU-basedparallelroughlysortinginDEFG.Theimplementationusesthehigh-levelalgorithmoutlinedbyAltman;itinvolvesthreedistinct phases.Therstphaseisthedeterminationofthedegreeofdisorderinthesequence, thatisdeterminingthe k value.Thesecondphaseisthepartitioningoftheinput sequenceintoxed-lengthblocks,usingthe k valuetosizetheblocks.Inthethird phase,theseblocksareindividuallysorted,inparallel.Whenalloftheseblocksare sorted,theentire k -sortedsequenceissorted.Theemphasisofthisworkisonphases oneandtwo.Weusethe combsort [76]tosatisfyphasethree.With k valuesless than200 13 ,ourGPU-basedroughsortshowedimpressiverun-timeresultscompared tothoseofthestandardCPUQuickSort.Ourroughsortshowsincreasedperformance,when k issmallrelativeto n because O n log k issubstantiallylessthan O n log n 13 The200valueisanestimate.Withsomeofthedatasetsweusedthisvaluewasashighas2000. 87

PAGE 101

5.3.2RelatedWorkandGPU-basedSorts Roughlysortingisaninstanceofan adaptive sortingalgorithm.Estivill-Castroand WoodprovideadatedsurveyofAdaptiveSortingAlgorithms[29].Theirworkidentiesanumberofdierentmeasuresofdisorder.Inalaterwork,Estivill-Castro identify Dis and Max asthetwomostcommonmeasuresofdisorderusedwithadaptivesorting[19].Disisequivalenttothemeasureof k usedhereforroughlysorting. Thethirdphaseofroughlysortingrequiresthatsortingbeperformedontheindividualblocksofroughlysequenceddata.ThismeansaGPU-basedsortisrequired. Satish,etal.describetheirCUDA-basedradixandmergesortswheretheyclaima 2-4timesspeedupoverpreviousGPU-basedsorts.Theyalsoclaimittobe23%faster thanevenahighlyoptimizedCPUsortingroutine[79].Harris,etal.[40]describe theuseofCUDAGPU-basedprexscaninradix-basedparallelsortingandbased ontheprexscanworkbyBlelloch[15].MerrillandGrimshawprovideanin-depth summaryofsortingonGPUarchitecturesinhistechnicalreport[60]. AsourworkisOpenCL-based,andnotCUDA-based,wecouldnotdirectlyuse theSatish-producedsorts.Wedecidednottousearadix-basedsortduetothelimits associatedwiththeformatofthesort-keyvaluesandotherissues[22].Theubiquitous QuickSortisnotusablehere,asOpenCL1.1doesnotsupportrecursivecallsinkernel code[22].Therefore,forourwork,weneededadierentsort.Wesearchedforasort thatwasanin-placesortandwasnotrecursive,andchosethe combsort [62,76,97]. ThecombsortwasrstdesignedbyDobosiewicz[18]in1980andwaslaterrediscoveredbyLaceyandBoxin1991[76].TheLaceyarticleprovidesagoodoverviewof thecombsort.Thecombsortisavariantofthebubblesort,butwithmuch-improved performance.Thecombsorthasouterandinnerloopslikethebubblesort;however, insteadofcomparingadjacentvalues,thecomparisonsarebetweenelementssome gap apart.Thestartinggapissettothenumberofitemstosortandatthestartof eachsortingiteration gap issetto: gap=shrink .The shrink valueisnormallysetto 88

PAGE 102

1.3.Laceydiscusseshowthevalueof1.3isobtained[76].The gap value,therefore, dropswitheachiteration.Theiterationsstopwhenthe gap valueis1andthejustcompletediterationresultedinnocomparisonswaps.Thecombsortisalsousedin theGPUworkbyNagaoandMori[62].Theyusethecombsortintheirlanguage analysisworkforthesamereasonswedo:thissorthasasort-in-placedesign,avoids theuseofrecursion,andprovidesgoodperformance. 5.3.3ApplicationSoftwareDesign Inthissection,wedescribeourroughlysortingapplications.WeproducedtwoDEFG implementationsofroughlysorting;wecalloursingle-GPUroughlysortingversion RSORTandtheotherversionisRSORTM,ourmultiple-GPUimplementation. TheAltmanroughlysortingapproachspeciesthe LR RL ,and DM steps[10]. Thepseudocodeforsteps LR RL ,and DM isgiveninFigure5.15.Afterthesesteps arecompleted,aparallelsortisperformed.Weusethecombsortforthisparallel sortingstep.OurimplementationoftheRSORTapplicationrequirestheDEFGcreatedcodefortheCPU-side,theGPU-basedcombsortkernelandthekernelsfor the LR RL ,and DM steps.Inaddition,asimplekernelisrequiredtondtheupper limitonthe DM -supplieddistancemeasures.Wecallthisadditionalstep UB .In total,vekernelsareusedbyRSORT.Theresultofthe DM stepisanarrayof distances;thisarray'smaximumvalueisthedesired k valueforthepartially-sorted data. Thekernelsforthe LR ,left-to-rightmaximum,and RL ,right-to-leftminimum, stepsrequireparallelimplementations.Inaparallel-processingenvironment,these cannotbewrittenascommonserial for loops.WehaveimplementedthemasOpenCL kernelsusingparallelprexscan.ThesetwokernelsarebasedontheBerman[14] prexscanapproach,whichwasalsousedintheDEFGPrex-Allocationdesign pattern. 89

PAGE 103

WhiledescribingthisRSORTapplication,andunlikethepreviousDEFGapplicationdescriptions,weshowthecompletesourcecode.Nocodeisomitted.This tendstomakethecodegureslarger,butitmakesitpossibletoseehowmuchcan beaccomplishedwithlessthanahundredlinesofDEFGcodeandtheassociatedkernels.Inaddition,inclusionofthefullsourcecodeprovidessolidexamplesofDEFG code-insertionmorsels. Figure5.16showsthe LRmax and RLmin kernels.The LRmax kernelcomputed therunningmaximum,movinglefttorightandthesecondkernelcomputedthe runningminimum,movingrighttoleft.Bothkernelsfunctionedinasimilarmanner andwewillsummarizeonlytheinnerworkingsof LRmax .Lines29through38of procedure LR ;B [1 :::n ]; begin B [1]:= a 1 ; for i :=2 to n if B [ i )]TJ/F8 10.9091 Tf 10.909 0 Td [(1] 0 and C [ i ] B [ j ] and j =1 or C [ i ] B [ j )]TJ/F8 10.9091 Tf 10.909 0 Td [(1] do begin D [ i ]:= i )]TJ/F36 10.9091 Tf 10.909 0 Td [(j ; i := i )]TJ/F8 10.9091 Tf 10.909 0 Td [(1; end end. Figure5.15: LR RL ,and DM PseudoCode 90

PAGE 104

01.__kernelvoidLRmax__globalintsrc[],__globalintdst[],uintstride 02.{ 03.uintblock=get_global_id; 04.uintsize=get_global_size; 05.uintarnold=size-stride; 06.ifblock>=arnoldreturn; 07.uintjs=block+stride; 08.ifjs>=sizereturn; 09.intsrc_j_item=src[block]; 10.intsrc_js_item=src[js]; 11.ifblock=size-stride{//copyalreadyprocessed 09.dst[block]=src[block]; 10.} 11.ifsrc[js]>src[block]{ 12.dst[js]=src[block]; 13.}else{ 14.dst[js]=src[js]; 15.} 16.} Figure5.16: LRmax and RLmin Kernels theCPU-sideDEFGcode,inFigure5.19,utilizedthiskernel.ThisDEFGCPU-side codeprovidedthe src dst ,and stride parametervaluestothekernels;thevalueof stride beganwithoneandwasdoubledoneachsuccessiveiteration. InFigure5.16's LRmax kernel,lines3through10setuptheindexvaluesusedto accessthe src and dst arrays,obtainedthevaluestobeprocessed,andperformed boundarychecking.Thefollowinglines,11through18,copiedthealready-processed valuesfromthe src to dst arrayandprocessedthecurrentvalues.The RLmin kernel workedinasimilarmanner,puttingitemsinadecreasingsequencewhileitmoved righttoleft. The DM and UB kernelsareshowninFigure5.17.The DM kernelusedthe maximumandminimumarrays,producedbytheprevioustwokernels,toestablish 91

PAGE 105

thedistancethateachvalue,tobesorted,wasoutofsequence.The for loopusedin the DM algorithm,showninFigure5.15,wasreplacedbytheGPUexecutionofthis functionatline49ofFigure5.19.The while loopinthe DM algorithmwasreplaced bythekernel's for and if statementsonlines04and05. 14 ThesecondkernelinFigure 5.17wasstraight-forward;itsfunctionwastodetermineifanyvalueinthe D array waslargerthanthescalarvalue d .Ifanyelementwaslarger,the again valuewasset to1.Thevalueof d ,theradiusintheDEFGcode,wasstartedatoneanddoubled witheachadditionalinvocation.Thepurposeofkernel UB wastoquicklyestablish anupperboundonthemaximumvalueinthe D array. Next,wediscusstheDEFGprogramthatwasusedtodrivethesekernels.Altman describestheoverallroughlysortingprocessinthreephasesandwewilldescribeour DEFGcodeintermsofthesethreephases.OurDEFGcodeislistedintwoFigures 5.18and5.19.TheDEFGdeclarationsusedbyourimplementationareshownin Figure5.18.Wenotethatvekernelsweredeclaredinlines14through18and thatsixequal-sizedbuersweredeclaredinlines19through24.Wealsonotethat 14 Algorithm DM isdeceptivelycomplexandconvertingittoanOpenCLkernelwas interesting and challenging 01.__kernelvoidDM__globalintB[],__globalintC[],__globalintD[] 02.{ 03.inti=get_global_id; 04.forintj=i;j>=0;j--{ 05.ifj<=i&&i>=0&&C[i]<=B[j]&&j==0||C[i]>=B[j-1]{ 06.D[i]=i-j; 07.break; 08.} 09.} 10.} 01.__kernelvoidUB__globalintD[],uintsize,intd,__globaluint*again 02.{ 03.if*again==1return; 04.inti=get_global_id; 05.ifD[i]<=d{ 06.//good 07.}else{ 08.*again=1; 09.} 10.} Figure5.17: DM and UB Kernels 92

PAGE 106

01.declareapplicationRSort 02.declareintegerstride 03.integersize 04.integersizeDB 05.integergenK 06.integerbufSize 07.integerradius 08.integergroups 09.integeragain 10.integeroffset 11.integeroffset2 12.integerlogSize 13.declaregpugpuone* 14.declarekernelLRmaxRSort_Kernels[[1D,size]] 15.kernelRLminRSort_Kernels[[1D,size]] 16.kernelDMRSort_Kernels[[1D,size]] 17.kernelUBRSort_Kernels[[1D,size]] 18.kernelcomb_sortRSort_Kernels[[1D,groups]] 19.declareintegerbufferarraySbufSize 20.integerbufferLRbufSize 21.integerbufferLRoutbufSize 22.integerbufferRLbufSize 23.integerbufferRLoutbufSize 24.integerbufferDMbufbufSize Figure5.18:RSORTDEFGDeclareStatements the comb sort kernelwasdeclaredsomewhatdierentlyconcerningthe work-group size.Thework-groupdimension,orwidth,isthenumberfollowingtheD,"within thedoublebrackets.Therstfourkernelshadawork-groupwidthof size andthe comb sort kernelhadawork-groupwidthof groups .The size variablecontainedthe numberofitemstobesortedandthe groups variablecontainedthenumberofparallel sortstobedonebythecombsort.Givenaxednumberofitemstosort,thelarger thevalueof groups ,comparedto size ,thebetterthisapplicationperforms,becausea largernumberofparallelsmallersortprocessescompleteinlesstimethanasmaller numberoflargerparallelsortprocesses.TherewasoneGPUthreadallocatedtoeach sortgroupand,hence,each comb sort instance. Wecontinuebydescribingtherst-phaseprocessing.The LRmax kernelwas drivenbylines29through38oftheDEFGcode,showninFigure5.19.Thekernel wascalledoncefromline30tostarttheprex-maximumprocessingandtomovethe resultstothe LR array.Theloopshowninlines33through38continuediterating thekernel,withthestridebeingincreasedoneachiteration.Loopingwasterminated 93

PAGE 107

whenthe again valueexceededlog2oftheitemstobesorted. Thepurposeofline36,containingtheDEFG interchange statement,requiressome explanation.The LRmax kerneldoesnotuseatomiclockingorsynchronization.It strictlyreaddatafromthe LR arrayandwrotetothe LRout array.Thepurposeof the interchange statementwastoperformahigh-performanceswapofthetwoarrays. Lines39through48functionedsimilarlytolines29through38,exceptthe RLmin kernelwasused.The DM kernelwasexecutedfromline49.Theresulting DMbuf arraywasprocessedbylines50to54toobtainanupper-boundradiusvalue. Thesecondphaseconsistedofusingthenewradiusvaluetodeterminethesize ofthesortingblocksandtotestvariousspecialconditions.Theseoperations,implementedwithDEFGmorsels,areshowninlines55through57.Withthe groups value havingbeensetinline57,thethirdprocessingphasecouldbegin. Phasethreeconsistedoftwocallstothe comb sort kernel.Thiskernelwasrst calledonline58andthenagainonline61.Thecodeonlines59and60didosetthe arraytobesortedbya radius valueandloweredthe groups valuebyone.Thisaction overlappedthepreviously-sorteddatawithnewsortblocksandwasdonetore-sort thepreviouslysortedgroups.Thissubtlestepmadeitpossibleforout-of-sequence itemstobemovedbetweentheoriginalsortgroups. Line63calledthe putMergeArray functiontowritethesorteddatatodisk.The calltothe sync function 15 informedtheDEFGoptimizertotransfertheupdated arrayS arraycontentstotheCPUandtheDEFGcodemorselonline63performed thecalltooutputtheresultstothe sorted.txt le.Thisfunctionwascalledfroma morsel,andnotaDEFG call statement,becauseDEFGdoesnotdirectlysupport stringliterals.StringsareoutsidetheproperdomainofDEFG. 15 Thesyncfunctionismadeavailableinthedefg loader.hheaderle. 94

PAGE 108

25.code[[char*arg="16";ifargc>1{arg=argv[1];}]] 26.code[[ifargc>2{size=intpow.0,doubleatoiargv[2];}]] 27.code[[getArrayarg,arrayS,size;bufSize=size;]] 28.code[[logSize=intlogdoublesize/log.0;]] 29.setstride//startLRprocessing 30.executeLR1LRmaxarraySinLRoutstridein 31.calltimes2stride* 32.setagain 33.loop 34.executeLR2LRmaxLRinoutLRoutoutstridein 35.calltimes2stride* 36.interchangeLRLRout 37.code[[again++;]] 38.whileagainltlogSize 39.setstride//startRLprocessing 40.executeRL1RLminarraySinRLoutstridein 41.calltimes2stride* 42.setagain 43.loop 44.executeRL2RLminRLinRLoutoutstridein 45.calltimes2stride* 46.interchangeRLRLout 47.code[[again++;]] 48.whileagainltlogSize 49.executeDM1DMLRinRLinDMbufout//DMprocessing 50.loop//startUBprocessing 51.setagain 52.calltimes2radius* 53.executeUB1UBDMbufinsizeinradiusinagaininout 54.whileagainne0 55.code[[radius*=2;]]//determineblocksizes 56.code[[ifradius>size{radius=size;}]] 57.code[[groups=intceildoublesize/doubleradius;]] 58.executeSORT1comb_sortarraySinoutradiusinoffsetingroupsin 59.code[[offset2=radius/2;]] 60.calldecgroups* 61.executeSORT2comb_sortarraySinoutradiusinoffset2ingroupsin 62.callsyncarraySin 63.code[[putMergeArray"sorted.txt",arrayS,size;]] 64.end Figure5.19:RSORTDEFGExecutableStatements 5.3.4ExperimentalResults 5.3.4.1ApproachtoExperimentation Inthissection,wecomparetheRSORTandRSORTMrun-timeperformancerelative totheperformanceoftheubiquitousCPU-basedLinuxQuickSort.WeusedQuick Sortasoursortrun-timeperformancebaseline.Weconsideredseveralotheroptions forthebaselinesort.TheAMDApplicationSDK[1]includesaGPUsortingexample, butthisisbasedonaradix-stylesortingalgorithm.Thissortwasrejected,because wedidnotwanttoworkaroundlimitsofradixsorting[22].AsnotedinSection 95

PAGE 109

5.3.2onrelatedwork,mostavailableGPU-basedsortsarewrittentouseCUDAand notOpenCL.Therefore,wesettledonusingtheubiquitousLinuxCPU-basedQuick Sort 16 asourbaselinesort.QuickSortisacommonlyusedgeneral-purposesortand weviewitisavalidmeasurementyardstick." Wefounditdiculttolocateusable,real-world,roughlysorteddataforour performancetesting.Oursolutiontothisproblemwastowriteatooltoarticially createtheneededsortingtestcases.Thetoolwasgiventhedesiredradiusvalue andthenumberofitemstobegenerated;itthengeneratedtherequesteddata.The numbersinthegenerateddatawereintegersbetweenoneandnumberofdesireditems; thesenumbershavingbeenperturbedtoachievetherequestedradius.Theresultant datacouldeitherhaveoneperturbationorhaveeachsegmentofsize radius +1items perturbed,dependingonwhatwasrequested.Ahighlyperturbeddatasetof16items witharadiusof4wouldcontain: 54321109876151413121116 Wemostoftenusedtestdatathatwashighlyperturbed,andnotedwhenthe run-timeperformancetestbeingdiscussedwasbasedonminimallyperturbeddata. Thedatasets,oftestdata,werenamedwiththeradius,whichwecalleda k value," andthesize.Forourtestswithlargenumbersofitems,thesizewasgivenasapower of2.Weperformedanumberofdierentrun-timetests.Thedatasetsusedhad either2 23 ,2 26 ,or2 27 elements. 5.3.4.2Single-GPUPerformance Single-GPUExperimentOne Table5.10showstherun-timecomparisonresultsfromexecutingourRSORTapplicationagainstQuickSortQSORT.ThesewereexecutedonHydraandthenumbers 16 Bythis,wemeanusingtheqsortfunctionfromtheLinuxC/C++run-timelibrary. 96

PAGE 110

Figure5.20:PlotofSortRunTimesfor2 23 ,388,608Items shownaretheaveragesecondscomputedfromthreeruns,performedforeachcombinationof k value,perturbation,andsorttype.Forexample,theliteralK:10"means theperformancetestsforthecorrespondingtablelineweredonewithdatagenerated usingtheradius k valuesetto10.Inthecolumnheadings,MaxP"denoteshighly perturbedtestdataandOneP"denotesdatawithasingleperturbation.Qsort" correspondstoQuickSortandRsort"toourRSORTapplication.Thedataused togeneratetheseresultswasofsize2 23 or8,388,608items.Aplotoftheseresultsis showninFigure5.20. 97

PAGE 111

Table5.10:RunTimes,inSeconds,forSorting2 23 ,388,608Items Gen.K Qsort Rsort Qsort Rsort Value MaxP MaxP OneP OneP K:10 0.948 0.329 0.822 0.285 K:100 0.919 0.734 0.823 0.619 K:200 0.912 0.998 0.823 0.840 K:300 0.910 0.842 0.822 0.678 K:400 0.904 0.886 0.821 0.678 K:500 0.904 0.919 0.821 0.678 K:600 0.904 1.443 0.820 1.195 K:700 0.903 1.470 0.821 1.202 K:800 0.900 1.496 0.822 1.205 K:900 0.900 1.528 0.821 1.200 K:1000 0.901 1.558 0.822 1.196 TheblueandgreenlinesshowtheruntimesforQSORT;blueforthehighly perturbeddataandgreenforminimallyperturbeddata.Theresultsforthesetwo aresimilar. TheredandpurplelinesshowtheruntimesforRSORT.Theredlineisforthe highlyperturbeddataandpurpleisfortheminimallyperturbeddata.TheRSORT run-timeperformancewasbetteruptoK:200andagainatK:300,K:400andoneperturbationK:500.WenotethebumpintheredandpurplelinesatK:200.Our suspicionisthatthisbump"wasduetothenon-linearsortingperformanceofthe combsort.Asimple,serialcombsorttestwasseparatelyexecuted,entirelyona CPU.Weperformedthecombsortoverdierently-sized,highly-perturbednumber sequences;thisshowedthattheratioofsorting-comparisons-countoveritems-sorted wasnotmonotonic.Althoughthecombsortshowedgoodperformance,thecomb sort'srun-timeperformancewasnotlinearlyrelatedtothenumberofitemssorted. AftertheK:300toK:500range,theQSORTperformanceremainedsuperior. WenotethattheRSORTmaximallyperturbedresults,theredline,climbed slightlyfromK:600onwardsandtheRSORTminimallyperturbeddata,thepurple line,staysratherat.Thisbehaviorwasnotasurprise.TheRSORTmaximally perturbeddatawasmuchmoreoutofsequencethantheminimallyperpetuated 98

PAGE 112

data;theparallelcombsortshadsignicantlymoreworktoperform. PerhapsamoreinterestingobservationisthatthepurplelinestaysatfromK:300 toK:500andthenatagainfromK:600onwards.Thisbehaviorwasassociatedwith the UB kernel'supper-boundsprocessinganditssearchforanupperboundthatwas apowerof2.TheRSORTexecutionsperformedforK:300,K:400,andK:500were donewitharadiusof512;theK:600toK:1000rangewasperformedwith1,024.With thesedatasetsof2 23 items,thistestingshowedthatroughlysortingisfaster,forthe k valuestested,upto200. Single-GPUExperimentTwo Werananadditionalsetoftestsusingjustthehighlyperturbeddata,overalarger rangeof k values,andontwoservers:HydraandRabbit.Thehardwareandsoftware specicationsforHydraandRabbitaredescribedintheAppendix,SectionB.1.The resultsofthisperformancetestingarelistedinTable5.11andshowngraphicallyin Figure5.21;theblueandgreenlinescorrespondtotheQsortprocessing;theredand purplelinescorrespondtoRsort.Bothsortapplicationswererunthreetimesoneach serverandaveragescomputed.Ourearliestperformancetestswiththesehigher k valueswereexecutedonlyonHydra.However,wecouldndabsolutelynoreasonable explanationforthesubstantialjumpsinHydra'sruntimesatK:1100andK:2100. Afterlookingatlow-levellogsandprolingdata,ournextstepwastoaddtheRabbit servertothetestingenvironment;wewantedtoobservethesesamesubstantialjumps onasecondserver.OnRabbit,theydidnotoccur. Wenotethatthejumpstendtooccuratthepowersof2:512,1,024and2,048; wealsoknowthattheseweretheradiivaluesproducedbythe UB kernelandthat theywereusedinallocationandsizingoftheapplication'ssortingblocks.Therefore, theseradiivaluesdirectlyimpactedthesizeoftheOpenCL work-groups .Inorder tohelpfurtherunderstandthis,weagainengagedtheDEFGloggingofthemajor OpenCLAPIcallsandexecutedadditionaltests.Weprocessedthislogdatainto 99

PAGE 113

Figure5.21:TwoServerPlotofSortRunTimeswith2 23 Items summarizedproledata.ThekernelexecutionstepsareshowninTable5.12and graphedinFigure5.22.Theruntimesofthesortingstep,thecomb sortexecution, forHydraatthe k valueof2100hadclearlyincreased.WhereastheRabbitrun-time increasefromK:2000toK:2100wastimes1.82,theHydraincreasewas3.44.Ata k of2100,thetwoserverswereclearlybehavingverydierently,intermsofrun-times. OurbasicexplanationforthisdierenceisthatthetwoserversareusingdierentGPUtechnologies.TheHydraserverusedolderNVIDIAGPUsandanolder OpenCLdriver.TheRabbitserverusedsmaller,butnewer,AMDGPUcardswith currentOpenCLdrivers.WesuspectthattheolderNVIDIAOpenCLdriver,present onHydra,wasnotorganizingit'slocalGPUwork-groupsaswellasthenewer, AMDOpenCLdriver,usedonRabbit.Withthemultiple-GPUtesting,wherelarger 100

PAGE 114

datasetswereused,weobservedthatthisHydraanomalywas notpresent .Wesuspect thatweencounteredabug"intheNVIDIAOpenCLdriver. Beforediscussingthemultiple-GPUperformance,wenotethatweanalyzedthe green-line-upwardbumpsfortheRabbitserveratK:1400,K:1900,andK:2100.We attributedthemtoalackofCPUmemoryresources.TheRabbitserverhadarather smallCPU,withlimitedRAM,andtwomid-rangeGPUcards.Oursuspicionwas thatgarbagecollection"ofCPUmemorywasoccurringatthesebump"times. Table5.11:TwoServerRunTimeswith2 23 Items,inSeconds Gen.K Hydra Hydra Rabbit Rabbit Value Qsort Rsort Qsort Rsort K:10 0.948 0.329 1.229 0.177 K:100 0.919 0.734 1.232 0.557 K:200 0.912 0.998 1.231 0.823 K:300 0.910 0.842 1.220 0.661 K:400 0.904 0.886 1.222 0.707 K:500 0.904 0.919 1.233 0.715 K:600 0.904 1.443 1.222 0.689 K:700 0.903 1.470 1.223 0.690 K:800 0.900 1.496 1.215 0.719 K:1900 0.900 1.528 1.213 0.724 K:1000 0.901 1.558 1.208 0.747 K:1100 0.900 2.221 1.209 0.995 K:1200 0.899 2.249 1.213 1.061 K:1300 0.898 2.280 1.226 1.012 K:1400 0.898 2.317 1.404 1.059 K:1500 0.898 2.335 1.212 1.059 K:1600 0.898 2.363 1.207 1.107 K:1700 0.898 2.393 1.228 1.076 K:1800 0.895 2.421 1.203 1.128 K:1900 0.895 2.442 1.408 1.100 K:2000 0.897 2.475 1.202 1.133 K:2100 0.895 6.845 1.506 1.965 K:2200 0.897 6.868 1.202 2.040 5.3.4.3Multiple-GPUPerformance Aswehavepreviouslymentioned,ourviewisthattheDEFGprovisionofmultipleGPUsupportisvaluable.We,therefore,addedthiscapabilitytoRSORT,forming 101

PAGE 115

Figure5.22:PlotofRun-TimeBreakoutwith2 23 Items Table5.12:Run-TimeBreakoutwith2 23 Items,inSeconds Server Data LR RL DM UB Sort Hydra K:2000 0.029 0.028 0.582 0.110 1.782 Hydra K:2100 0.280 0.280 0.607 0.100 6.127 Rabbit K:2000 0.025 0.025 0.223 0.010 0.746 Rabbit K:2100 0.018 0.028 0.256 0.004 1.355 RSORTM.TheuseoftwoGPUswithroughlysortingallowedustoobtainsortedresultsfasterandtosortlargerdatasets.ThemultipleGPUsupportwasimplemented usingtheDivide-Process-Mergedesignpattern. Figure5.23showsaveryabbreviatedlistingofDEFGcodeusedinRSORTM. RSORTMisbuiltuponRSORT;mostofthelinesofcodeshowninthisgureare additionstothecodeshowninFigures5.18and5.19.Line22selectedmultiple GPUs,lines27and28declaredtheadditionalkernelsusedintheUBloop,andline 36declaredtheextrabueralsousedintheUBloopprocessing.Lines46through 69werearepeatoftheoriginalleft-to-rightmaximumcodewiththeDEFG execute statementsreplacedby multi exec statements. Continuingwithlines71to77,weobservethattheUBprocessinginRSORTM 102

PAGE 116

diersfromtheapproachusedinRSORT.Here,weencounterasmallDEFGdesign restriction.TheDEFGapproachusedtosplitbuersintoequalportions,witha portiongiventoeachGPU,doesnotworkwithscalarvariables.Wecannotreasonably splitascalarvariableinhalf.Theupshotofthisshortcomingwasthatthe again variableusedtomanagethisloophadtobesetwiththevaluesreturnedfromthe againPart buer.Themorselcodeatline76combinedthetwovaluesfromthe buer.The UB kernel,usedinRSORT,wasreplacedbythenew UBreset and UBsplit kernels.Kernel UBreset didresetthebuerdataand UBsplit comparedthe radius upperboundwiththe DMbuf contents.TheOpenCLcodeforthesesimple kernelsisshownintheAppendix,SectionB.4. The putMergeArray function,notshowninthisabbreviatedlisting,wascalled tocopythetwosorteddatasegmentstodisk.Theyweremergedintoonesorted segmentastheywerecopied.ThiswasthemergestepinherentintheDEFGDivideProcess-Mergedesignpattern. Next,wecomparetheperformanceofRSORTMagainstbothRSORTandQSORT. ThetestingwasperformedonHydrausingtestdatalesof2 26 and2 27 sortitems. 17 Tables5.13and5.14showthetestingresultsandtheseareshowngraphicallyinFigures5.24and5.25.WenotethattheunexpecteddropinHydraperformanceseenin theprevioustestingwasnotpresentwiththeselargertestdatasets. LookingatFigure5.24,wenotetheRSORTMgreenlineisalwaysbelowthe RSORTredline.Withthistestdata,RSORTMwasconsistentlyfasterthanRSORT. WealsonotethattheRSORTMperformancewasfasterthanQSORTuptoa k value of2000.TheadditionofthesecondGPUmaderoughlysortingthefastersolutionfor awiderrangeof k values.Lookingbackattheimagelteringapplicationsdiscussed previously,andcomparingtheapplications,RSORTMclearlyhasenoughworkto utilizebothoftheGPUs. 17 TheRabbitserverwasnotabletohandledatalesofthislargesize,duetodiskandmemory sizelimits. 103

PAGE 117

... 04.declareapplicationRSortm ... 22.declaregpugpugrpall ... 27.kernelUBsplitRSort_Kernels[[1D,size]] 28.kernelUBresetRSort_Kernels[[1D,size]] ... 36.integerbufferagainPartagainSize ... 46.multi_execLR1LRmaxarraySinLRoutstridein ... 50.loop 51.multi_execLR2LRmaxLRinoutLRoutoutstridein ... 55.whileagainltlogSize ... 69.multi_execDM1DMLRinRLinDMbufout ... 71.loop 72.multi_execUB1UBresetagainPartinout 73.calltimes2radiusinout 74.multi_execUB2UBsplitDMbufin againPartinoutsizeinradiusin 75.callsyncagainPartin 76.code[[again=againPart[0]+againPart[2];]] 77.whileagainne0 ... 80.code[[ifgroups<2{printf"sortend,toofewsortgroups,...]] ... 82.code[[groupsMulti=groups/DEFG_GPU_COUNT;]] 83.multi_execSORT1comb_sortarraySinout...groupsMultiin ... 88.multi_execSORT2comb_sortarraySinout...groupsMultiin ... 91.end Figure5.23:AbbreviatedRSORTDEFGExecutableStatements DuetolackofmemoryonasingleHydraGPUcard,RSORTcannotbeexecuted withthe2 27 -sizeddataset.RSORTM,withaccesstotwiceasmuchGPUmemory, wasabletoprocessthislargedataset.Theperformanceresultsarepresentedin Figure5.25.Weobservethatthecross-overpointwiththisdatasetwassimilarto the2 26 dataset,theperformancecrossoverwasatthe k valueof2000. 5.3.4.4RSORTRun-TimePerformance Asexpected,usingourarticiallygenerated,partiallysorteddatasets,RSORTwas showntoproducegoodperformanceresults,relativetoCPU-basedQSORT,when the k -valuewasapproximately100,orless.Thishigh-levelofperformanceoccurred 104

PAGE 118

withboththesingle-GPUandmultiple-GPURSORTusagemodes;whenmorethan asingleGPUwasusedbyRSORT,sortedresultswereproducedevenmorequickly. Theuseofmultiple-GPUsalsopermittedthesortingoflargerdatasets.Forthevast majorityofourrun-timetests,highlyperturbeddatasetswereused.Forthisreason, ourviewisthatmostotherdatasets,withequal k values,wouldbelikelytoexperience fasterperformance,comparedtoourgenerateddatasets.For k valuesoverabout100, wesuggestthatthesizeanddistributionofthedatamustbetakenintoconsideration whenlookingatthegeneraluseofroughlysorting. Figure5.24:PlotofSortRunTimeswith2 26 Items Table5.13:SortRunTimesonHydrawith2 26 Items,inSeconds Program GenK: GenK: GenK: GenK: GenK: Name 10 1000 2000 4000 8000 Qsort 8.394 8.008 7.972 7.922 7.890 Rsort 2.527 11.216 15.360 17.120 29.556 RsortM 1.459 6.487 7.400 11.189 24.682 Table5.14:SortRunTimesonHydrawith2 27 Items,inSeconds Program GenK: GenK: GenK: GenK: GenK: Name 10 1000 2000 4000 8000 Qsort 17.317 16.539 16.460 16.389 16.318 RsortM 2.912 11.896 16.447 19.613 31.243 105

PAGE 119

Figure5.25:PlotofSortRunTimeswith2 27 Items 5.4Application:AltmanMethodofMatrixInversion 5.4.1ProblemDenitionandSignicance Matrixinversionisawell-studiedproblemwithapplicationtomanyendeavors,includingthesolutionofsimultaneousequationsinEngineering,Economics,andChemistry [72,13,5,74].Further,simultaneousequationsneedtobesolvedinreal-timerobotics andthesesolutionsarerequiredinalimitedtimeinterval[46].Theuseofthe iterativematrixinversion ,inthecontextof anytime algorithms[77],hasthepotentialto supplythesolutionsforsimultaneousequationinreal-timeenvironments.Theuseof simultaneousequationsinroboticsisdemonstratedinnumerousworks[28,44,63]. Therearemanymethodsforinvertingmatrices.OnesuchmethodisM.Altman's iterativemethod;ithasthepotentialforinvertingmatriceswithhighdegreesof adaptability,scalabilityandrobustness,duetotheiterativenatureofitsdesign[7, 8].Iterativeinversionconsistsofmakinganinitialestimatedinversionandthen performingadditionalalgebraicoperationstoimprovethequalityoftheinversion. Inthiswork,weusethenotionofanytimealgorithmstomanageourGPU-based Altmaninversionprocessing.Anytimealgorithmsusea well-denedqualitymeasure 106

PAGE 120

tomonitorthesolutionprogressandthenallocateresourceseectively[101].This anytimealgorithmapproachmakesitpossibletooeratradeobetweensolution qualityandcomputationaltime[36]. Assumewehaveamatrix A andwewishtonditsinverse, R .TheAltman methodrequireswehavean R 0 suchthat jj I )]TJ/F22 11.9552 Tf 12.579 0 Td [(AR 0 jj < 1andthenweiteratively apply R n +1 = R n I )]TJ/F15 11.9552 Tf 12.761 0 Td [(3 AR n + AR n 2 .WeimplementedaGPU-basedanytime algorithmversionofAltman'siterativemethodusingDEFGandOpenCL,andthen performedananalysisoftheinversionresults. OureortshadafocusonDEFGandthedemonstrationofDEFG'scapabilities. Assuch,weprovidedaDEFG-based,GPUimplementationoftheiterativeAltman numericalmethod,withaddedanytimeprocessing.Thiswasdonetohelpshow therangeofproblemsandapplicationshandledbyDEFG.Wedidnotexplorethe numericalintricaciesoftheactualAltmanmethod.Matrixmultiplicationwasusedas abuildingblock,andanexistingOpenCLmatrixmultiplicationfacility,clMath[2], wasusedtosatisfyourrequirementforahigh-speedmatrixmultiplicationcapability. ThisAltmaninversionmethodhassomeinterestingadaptivecharacteristics[7,8]. Itistolerantoferrorsincomputationandinitialsettings.Theinversionresults aredependentontheinitial R 0 andonthenumberofiterationsperformed.More iterationstendtocompensateforapoorlyset R 0 orpoorlycomputed R n .If R 0 isnot otherwiseavailable,itcanbeformedwith R 0 = I ,whereonepossiblevalueof is 1 = jj A jj ,with jj A jj beingtheEuclideannormof A .Dierentvaluesof leadtodierent convergencecharacteristics.WeusetheEuclideannormof I )]TJ/F22 11.9552 Tf 10.313 0 Td [(AR n +1 asourinversion qualitymeasure.Thisscalarvaluedecreasesasthequalityoftheinversionincreases. Thesecharacteristicsprovidetheabilitytouseananytimealgorithmtolimitthe numberofiterations,enablingthedesirableoptiontochoosebetweensolutionquality andcomputationaltime. 107

PAGE 121

5.4.2RelatedWork 5.4.2.1AnytimeAlgorithms BoddyandDeancoinedthetermanytimealgorithm"intheirworkontime-dependent planning[17].Theydescribeanytimealgorithmsas,algorithmsthatcanbeinterruptedatanypointtosupplyananswerwhosequalityincreaseswithincreasing computationaltime."TheworkofZilbersteinbringsthecharacteristicsofanytime algorithmsintoclearerfocuswithhislistofdesirableanytimetraits: measurable quality,recognizablequality,monotonicity,consistency,diminishingreturns,interruptibility, and preemptibility .Zilbersteinmakesanadditionalpoint,perhapsbest givenasadirectquote,Anytimecomputationextendsthetraditionalnotionofcomputationalprocedurebyallowingittoreturnmanypossibleapproximateanswersto anygiveninput"[101]. AnytimealgorithmshavepreviouslybeenimplementedonGPUsbySabandManharam[77].Theydemonstratetheuseoftheir PAP* parallelsearchalgorithmin whichtheycomposeapoolofCUDAkernelsinadvanceandthenatruntime,dependingontheloadandquality-levelgoals,selectanappropriatesetofkernelsto launch. 5.4.2.2IterativeApproach Thisiterativeapproachdistinguishesitselffromdirect,decompositioninversionapproachessuchasCholeskyDecompositionandGauss-Jordanbyproducinguseful intermediateinversionresults[52,90].Theiterativeapproachusedherewasoutlined byM.Altmaninhiswork: AnoptimumcubicallyconvergentiterativemethodofinvertingalinearboundedoperatorinHilbertspace [7].Afewyearslater,Petryshyn improvedtheerrorestimatesfortheM.Altmanmethod[73].Morerecentworks inthisareaincludetheStanimirovicarticle: Self{correctingiterativemethodsfor 108

PAGE 122

computing f 2 g {inverses [89].Stanimirovicpointedoutthatthisclassofiterative inversionalgorithmswillself-correctonlyifthematrixisinfactinvertible. 18 5.4.2.3GPU-basedHigh-PerformanceMatrixLibraries TheAltmanalgorithmrequiresnumerousmatrixoperations,includingmatrixaddition,subtraction,andmultiplication.WeusedanexistingGPUhigh-performance library:clMath,formerlycalledAPPML[4].clMathisthe AcceleratedParallelProcessingMathLibraries whichcontainstheBasicLinearAlgebraSubprogramsBLAS andtheFastFourierTransformFFTfunctions.ItiswrittenforOpenCLanddesignedtorunonAMDGPUs[2]andcanbeusedwithOpenCLGPUsfromother vendors.In2013,AMDre-brandedAPPMLasclMath"andconvertedittoan OpenSourceproduct.NVIDIAsuppliescuBLAS.cuBLASistheNVIDIABasicLinearAlgebraSubroutineslibraryforusewiththeirhardware.Unfortunately,cuBLAS isnotcompatiblewithOpenCL.ThisisunfortunatesincecuBLAShasbeenavailable longerthanclMath,isvendorsupported,andislikelyamorestableandmaturelinear algebralibrary. 5.4.3ApplicationSoftwareDesign 5.4.3.1TheImplementationofAltman'sMethodinDEFG OurDEFGimplementationofAltman'siterativemethodofmatrixinversionhas gonethroughanumberofversions.Eachversionwasveriedtobesurethatcorrect inversionresultswereproduced. 19 Inourearlierversions,weusedtheclMathcapabilitiesforallmatrixoperations.Testing,andperiodicreviewsofourIMIversions, revealedthatourearlyDEFGIMIsourcecodewasnoteasilyreadandvisualizedby 18 AfactthatwecanattesttoafteraccidentallypassingasingularmatrixtoourrstGPUversion oftheAltmanmethod! 19 TheresultswereveriedbycomparingthematrixinversionresultsfromourIMIapplications withcorrespondingMATLABinversionresults. 109

PAGE 123

thedeveloper,andthegeneralapplicationperformancewassomewhatslowerthan wehadexpected.Becauseoftheseissues,wechangedourIMIapplicationdesignand implementation.OurnalIMIversion,calledIMIFLX,usedtheclMathlibraryonly whenabsolutelyneeded.Forthesimpleroperations,suchasestablishingtheidentity matrixormultiplyingamatrixbyaconstant,weusedadierentapproach.Thenew approachusedspecializedGPUkernelstomultiplyamatrixbyascalarandused CPU-sideDEFGmorselstoinitializematrices. Thenalversion,IMIFLX,hastheFLX"namesuxbecauseoftheexible mannerinwhichitobtainedthesquarematrixtobeinverted.Thematrixwas loadedfromaninputleorwasinternallygenerated.Thetestmatricesgeneratedby IMIFLXwereHilbertmatrices[99],identitymatrices,orinvertible"matrices.The invertiblematriceshadtheirnon-principlediagonalvaluessetto1andtheirprincipal diagonalvaluessettotheintegervalues1 ; 2 ;:::;N ; N beingthewidthofthematrix [75]. Forthesakeofbrevity,wewilllimitourpresentationoftheIMIFLXapplication's codetoanabbreviatedlisting,whichshowsmostoftheapplication'snon-declarative code.ThecodeisshowninFigure5.26.Thegure'scommentlinestendtodescribe thelinearalgebrastepperformednext;thefollowingstatementsbeginningwith blas or execute performthecorrespondingDEFGoperations.The code statements,which areDEFGmorsels,insertedtheassociatedC++codeintotheapplication.Asan example,the code statementsatlines2and3establishedanidentitymatrixmultiplied bythepreviouslyestablished value.Therewasone blas statementbeforetheloop, andeachloopiterationrequiredtheexecutionofthree blas statements.Thesmall kernelsexecutedatlines11,13,and21performedmatrixmultiplicationbyaconstant. Thekernelexecutedatline19performsamatrixbuercopy.Thecodeforthese simplekernelsisshownintheAppendix,SectionB.4.TheearlierIMIversionsused blas statementsinsteadofthesekernels.Theperformanceimplicationsofthischange 110

PAGE 124

arediscussedinthenextsection. The loop escape statement,atline29,demonstratesanexampleofDEFG anytime processing.DEFGmaintainedaninternaltimerthattrackedtheruntimeof theloopandthe loop escape statementwasusedtoexittheloopwhenaruntime thresholdhadbeenexceeded.SincetheAltmanalgorithmprovidedaninversionresultthatimprovedmonotonicallywitheachiteration,thisalgorithmwasappropriate foranytimealgorithmusage. Thethreekernelexecutionslistedtogether,atline23through25,warrantsome explanation.ThecalculationoftherequiredEuclideannorm,usedheretoevaluate thequalityoftheinversion,requiredthatthesquaresofallmatrixvaluesbesummed. Parallelprexscanprovidedahighperformanceapproachtoobtainingthissum[40]. However,manyoftheavailablehigh-performanceprexscanalgorithmsrequirean 01.//mAholdsthematrixtobeinverted;mRn=IdentityMatrix*alpha 02.code[[forinti=0;imP 06.loop 07.//desiredresult:mRnp1=mRn*mI*3-mP*3+mP2mP2ismP*mP 08.//mW=mP*mP 09.blasdOne*mP*mP+dZero*mW->mW 10.//mW+=mI*3 11.executek8PlusIdentityThreemWinoutmSIZEin 12.//mW-=mP*3 13.executek9MinusMatThreemWinoutmPinmSIZEt2in 14.//mRnp1=mRn*mW 15.blasdOne*mRn*mW+dZero*mRnp1->mRnp1 16.//mP=mA*MRnp1 17.blasdOne*mA*mRnp1+dZero*mP->mP 18.//copymPtomW; 19.executek10CopyArraymWoutmPinmSIZEt2in 20.//mW-=mI 21.executek11MinusIdentitymWinoutmSIZEin 22.//result=normmW 23.executek12SweepSquaresmWinmSIZEt2inmBasketinoutbasketSizein 24.executek13prefixSummSoutmBasketinmLocal*basketSizein 25.executek14ReadLastSqrtmSinbasketSizeinresultout 26.releaseresult//getsvalueontoCPU 27.//cpu:compareepsilonandresult 28.code[[ifresult<=epsilonLCNT=cycles;]] 29.loop_escapeat6secs//"anytime"processing 30.//mRn<==>mRnp1 31.interchangemRnmRnp1 32.callincLCNTinout 33.whileLCNTltcycles Figure5.26:IMIFLXApplicationProcessingLoop 111

PAGE 125

inputdataarraysizethatisapowerof2.Thisprexscanandpower-of-2topicwas mentionedinSection5.3,whendiscussingBFSDP2GPU.WithourIMIapplication, wedidnotwanttolimitourprocessingtomatricesthathaveapower-of-2numberof elementsandwewantedtohaveimprovedperformanceovertheBermanPrexScan algorithm'sworkupperboundof O n log n .We,therefore,splitthenormprocessing intothreekernelsandweintroducedapower-of-two-sizedintermediatebuertohold partialsums.OurimprovedapproachwaspossiblebecausetheIMInormprocessing onlyneededthenalsumofsquaresanddidnotneedanyoftheprecedingpartial sums. Ourapproachstoredthepartialsumsinthe mBasket buer,whichcontained basketSize elements.The basketSize valuewassetwithapower-of-2valuethatwas somewhatlargerthanthemaximumnumberofthreadstheGPUprovides;forHydra weusedavalueof1,024.EachGPUthreadproducedasmallnumberofsumsand thesesumswerestoredin mBasket .Thisworkwasperformedbythe SweepSquares kernel;itsummedthe squares ofthevaluesinthematrix.Thecodeforthiskernel, SweepSquares ,isshowninFigure5.27.The for loop,inthekernel,incrementedthe k indexvaluebyastrideof basket length ,tominimizetheGPUhardware'sglobal memoryaccesses. The prexSum kernelwastakenfromtheAMDOpenCLSDK.Itwasmodeled aftertheapproachtakenbyHarris,etal.[39]andBlelloch[15].Hereweuseditto sumthepartialsumsharboredinthe basketM array.Thesimple ReadLastSqrt kernel returnedthesquarerootofthelastsumproducedbythe prexSum kernel.Theupper boundonworkforthisapproachis O n +log basketSize andsince basketSize isa constant,theupperboundwithalargematrixisactually O n .Thesourcecodefor the prexSum kernelisavailablefromtheAMDOpenCLSDKandthe ReadLastSqrt kernelcodeisavailableintheAppendix,SectionB.4. 112

PAGE 126

01.__kernelvoidSweepSquares 02.__globaldouble*input,//bufferofdatavalues 03.constintlength,//fulllengthofbuffer03. 04.__globaldouble*basket,//basketofpartialsums 05.constintbasket_length//fulllengthofbasket 06.{ 07.doubled; 08.doublesum=0.0; 09.iflength<1return; 10.unsignedinttid=get_global_id0; 11.iftid>=lengthreturn; 12.//makestridesofbasket_lengthwidth 13.forintk=tid;k
PAGE 127

MatThree CopyArray ,and MinusIdentity kernels.Withthe M1000 matrix,the measuredruntimesdroppedfrom4.711secondsto2.113,aspeedupof2.22.InTable5.15,theSummedSecondscolumncontainsthesumoftheBLASandtheclEnqueueNDRangeKernelAPIexecutionruntimes.The2.02secondsIMIFLXvaluewas calculatedusing : 047 43+0 : 0005 101.WenotethattheBLASandNDRange processingaccountforthelion'sshareoftheactualruntime.LookingattheBLAS CountandNDRCountcolumns,wecanseethateachapplicationversiongenerated 144requests.InthecaseofIMIFLX,43wereBLASrequests;IMIBhad99.Our IMIFLXapplicationversionwasfasterthanIMIBbecauseitmademanyfewerBLAS requests.Thekernelrequests,whichreplacedtheomittedBLASrequests,weremuch faster,withruntimeslessthan0.001seconds,versusthe0.047secondsforeachBLAS request.Notsurprisingly,the blas statementswerethedominantfactorintherun times. 5.4.4.2InversionResultsandAnytimeProcessing AstheIMIFLXapplicationusedtheAltmaniterativeapproach,itconvergedtoa solution.Itstoppedwhen:thecomputedEuclidianNormwaslessthanthespecied epsilon value;anumericoverowhadoccurred;themaximumnumberofiterations counthadbeenmatched,ortheanytimeprocessinghadintervened.Thetablein Figure5.28showstheEuclideanNormvaluesforprocessingan M500 matrix,with anepsilonsettingof0.00001.TheplotinFigure5.28presentsthisdata;theIteration countisontheXaxisandtheNormvalueisontheYaxis.Wecanseethatthe application'sNormvaluesmonotonicallydecreasedfrom20towardszero.Onthe 13thiteration,theNormwaslessthan0.00001andtheprocessingstopped. Table5.16showstheresultsfrom20sampleexecutionsofIMIFLX,withdierent matrices.Foreachexecution,thetableprovidesname,type,andsizeforthematrix andtheepsilonvalue,numberofiterations,andtotalruntimefromtheactualex114

PAGE 128

M500 Norm Iteration Value 1 19.905700 2 16.298600 3 10.775400 4 6.320110 5 3.687980 6 2.186990 7 1.362090 8 0.943485 9 0.684855 10 0.320215 11 0.032834 12 0.000035 13 0.000000 Figure5.28:TableandPlotofM500MatrixNormValues ecution.Line13showstheresultsofthedescribed M500 execution;itrequired13 iterationsandconsumed0.259seconds.Thenextline,13a,isasecondexecutionof thesamematrixbutwiththeanytimelimitsetat0.2seconds.Hereweseethatthe processingconsumed0.206secondsandused10iterations.Ifwelookbackatthe tableshowninFigure5.28,weseethattheNormwas0.320215atthispoint.When theexecutionwasterminatedbytheanytimeprocessing,theapplicationhadbrought thenormtowithin2%ofitsfull-runnalvalue. 5.4.4.3RangeofIMIInversions Lines1through12,ofTable5.16,presenttheresultsofinvertingasetofthenotoriouslydicult-to-invertHilbertMatrices[99].Thematrixsizesrangedfrom2 2 to13 13.Ourinversionapplicationwasabletoinvertthesematricesuptosize 12 12.At13 13,an#INFInnity"C/C++run-timeresultoccurred.This errorwasreturnedwhentheresultofanoperationwastoolargetobeaccurately stored.Startingatsize10 10,theepsilonvaluewasraised;theselargerepsilon 115

PAGE 129

valuesweretheapproximatelowerboundthatthenormvaluecouldachieve. 20 Using lowervaluesofepsilon,withthesecases,tendedtoiterateuntilthemax-iteration countwasencountered. InordertodetermineanapproximateupperboundonmatrixsizethatourapplicationcouldhandleontheHydraserver,weinvertedmatricesofincreasingsize.Line 16showsan8000 8000matrixbeinghandledsuccessfully;however,the8500 8500 instanceatline17failswithanOpenCLerrorof-4.ThisOpenCLerrorwasdiscussed previouslyinSection5.1;itoccurswhentheGPUcannotallocatetherequestedglobal memory.TheIMIinversionofthismatrixexceededthememorycapabilityofthe NVIDIATeslaT20onHydra,whichhas2,687MofRAM.Ourapplicationrequired vematrix-sizedbuers.Withthe8000 8000matrix,eachbuerrequired512M bytes,veoftheseneeded2,560MB.For8500 8500,eachbuerrequired578MB ofRAM,veoftheseneeded2,890MB.Thelarger8500 8500matrixexceededthe memorycapacityoftheHydraT20GPU. Thelastthreematriceslistedwereobtainedfromthe UniversityofFloridaSparse MatrixCollection [24,25].Matrix 685 bus isfromapowernetworkproblemwith 3,249non-zerovaluesandmatrix 1138 bus isasimilarproblemwith4,054non-zero values.Themuchlarger Kuu Matrixisfromastructuralproblemandhas340,200 non-zerovalues.Ouriterativeinversionapplicationencounterednoissueswiththese real-worldmatrices. Inthissection,wedemonstratedDEFGhandlingourOpenCLnumericalapplication,whichusedtheclMathBLASlibrary.Weobtainedimprovedperformanceby usingkernelsforthenon-multiplicationmatrixoperations,usingtheBLASfunctions onlywhenabsolutelyneeded.WeshowedthattheM.Altmanapproachtomatrix inversionworkswellonGPUs,andshowedtheusefulDEFG anytime processing performinganearlyexitofaniterativeprocess,whentriggeredbyanexternalevent. 20 Wenotethatdouble-precisiondatatypeswereusedhereandthatrunningthisOpenCLapplicationonaCPUproducedverysimilarresults. 116

PAGE 130

Table5.16:IMIFLXInversionResultsforVariousMatrices Cnt Matrix Type Size Epsilon Iterations RunTime Name Seconds 1 H2 Hilbert 2x2 0.00001 4 0.018 2 H3 Hilbert 3x3 0.00001 8 0.022 3 H4 Hilbert 4x4 0.00001 12 0.023 4 H5 Hilbert 5x5 0.00001 15 0.030 5 H6 Hilbert 6x6 0.00001 18 0.034 6 H7 Hilbert 7x7 0.00001 21 0.036 7 H8 Hilbert 8x8 0.00001 24 0.037 8 H9 Hilbert 9x9 0.00001 27 0.042 9 H10 Hilbert 10x10 0.001 30 0.035 10 H11 Hilbert 11x11 0.005 40 0.057 11 H12 Hilbert 12x12 0.15 70 0.089 12 H13 Hilbert 13x13 n.a. n.a #INFerror 13 M500 Invertible 500x500 0.00001 13 0.259 13a M500 Invt-AnyTime 500x500 0.00001 10 0.206 14 M1000 Invertible 1000x1000 0.00001 14 2.112 15 M5000 Invertible 5000x5000 0.00001 16 329.619 16 M8000 Invertible 8000x8000 0.00001 17 1380.320 17 M8500 Invertible 8500x8500 n.a. n.a. error-4 18 685 bus Repository 685x685 0.00001 12 0.665 19 1138 bus Repository 1138x1138 0.00001 14 3.262 20 Kuu Repository 7102x7102 0.00001 9 605.310 117

PAGE 131

CHAPTERVI ACCOMPLISHMENTS,OBSERVATIONS,ANDFUTURE RESEARCH Whileenvisioning,designing,implementing,testing,anddebuggingDEFGanddevelopingnewDEFG-basedapplications,webecameveryfamiliarwiththeprocessof developingGPUapplications.Basedonthisfamiliarityandourassociatedexpertise, weoerthischapter.ItsummarizesourDEFGaccomplishments,listsournoteworthy observations,andsuggestsinterestingfutureresearch. 6.1DEFGAccomplishments Inthisdissertation,weproducedtheDEFGparser,optimizer,andcodegenerator. Thesethree,alongwithseveralDEFGdesignpatterns,giveDEFGitsuniqueabilitytogeneratetheCPU-sideofOpenCLapplications,asseenintheaforementioned SOBEL,MEDIAN,BFSDP2GPU,RSORT,andIMIFLXapplications.Eachofthese DEFGapplicationsexploredadierentaspectofGPUsandOpenCL.TheSOBEL andMEDIANimagelterapplicationsusedDEFGwithimageltering,withafocusonmultiple-GPUprocessing.Our5 5medianimageltershowedaspeedup withmultipleGPUs.Ourgraph-processingapplication,BFSDP2GPU,showedthe DEFGabilitytoprocesslargeveryirregulardatastructureswithmultipleGPUs.The GPUsortingRSORTapplicationimplementedthenovelroughlysortingalgorithm forpartiallysorteddata.Itdemonstratedgoodperformanceinbothsingle-GPUand 118

PAGE 132

multiple-GPUversions.Thelastapplication,ouriterativematrixinversionimplementationIMIFLX,exhibitedDEFG'sabilitytoimplementiterative,GPU-based, numericalprocessing. DEFGisdesignedtomakethedevelopmentoftheCPU-sideofOpenCLapplicationslessworkforthesetofapplicationsthatfollowone,ormore,oftheDEFG designpatterns.InChapterIII,weshowedthatDEFG-producedapplicationscan matchthespeedofequivalenthand-writtenapplications. ThevalueofDEFGisclear:itenablestheimplementationofOpenCLapplicationsusingthedeclarativeapproach.Thisapproachtoapplicationdevelopmentis advantageous;itislikelytoconsumefarlessdevelopertime,comparedtowriting hand-writtenC/C++,becausethedeveloperhastowritemanyfewerlinesofsource code,andtheDEFGcodethatiswrittenissimpler,relativetothesamehand-written C/C++. 6.2SomeNoteworthyObservations 6.2.1UsefulnessofaDSLforGPUSoftwareDevelopment WhatisitaboutthestyleofsoftwaredevelopmentusedwithGPUsthatmakesthe useofaDSLattractive?OuranswertothisquestionisbasedonFarber'ssuggested rulesforGPGPUprogramming[30],paraphrasedhere:getthedataontheGPU andleaveitthere,givetheGPUampleworktodo,andfocusondatareuse withintheGPUtoavoidglobalmemorylimitations. Thishigh-performanceGPUmethodologycomesdowntoputtingthecomputationalworkontheGPUandutilizingtheCPUmainlytomanagetheGPU'soperations.GPUdevelopersstrivetominimizethetransferofdatatoandfromtheGPUs, duetothetransfer'ssignicantconsumptionoftime.Theapplicationworkisdone inthekernels;kernelsneedtobeecient,andoften,highlyoptimized.TheCPU applicationcodefacilitatestheGPU'sexecution;itmanagesthe movement ofdata 119

PAGE 133

toandfromtheGPU,andithandles scheduling oftheGPUkernels. OurviewisthattheseCPU-baseddatamovementandGPUschedulingoperationscanbequiteoftenexpressedwithasetofpredeneddesignpatterns.OurDSL, DEFG,suppliesthesedesignpatterns.DEFGmakesitpossibletodeclarethecharacteristicsoftheGPU-requireddataand,usingdesignpatterns,managetheneeded GPUoperations.Withthedeclarationsmadeanddesignpatternschosen,DEFG thenproducesthecorrespondingCPU-sideC/C++program.Ourexperiencehas beenthatmostapplicationvariabilityandcomplexitytendtobeconcentratedinthe GPUkernels. 6.2.2Declarative-ApproachforKernelCodeGeneration HavingobservedthatDEFG'sdeclarativeapproachcanworkwellforgeneratingthe CPU-sideofOpenCLapplications,weexploredusingasimilardeclarativeapproach forGPUkernelgeneration.Wearenotasoptimisticthatthedeclarativeapproach willworkwellfortheGPUkernelsduetothewidevariationsweobservedinkernel processing. Inourview,onereasonthatDEFGworkswellisbecauseofthesupplieddesign patterns.Asstatedabove,manyGPUapplicationsrequiretheirCPU-sidecodeto performsimilaractions;thesesimilaractionsoftenmapwellintopatterns. OurpointhereisthatwithGPUkernels,wedidnotndthesamesimilarity ofactions.Wereviewedthekernelsusedinourportofexistingapplicationsand ournewlydevelopedapplications,andweobservedveryfewcommonkernelusage patternsthataresignicantandsubstantial.Thefewsimilarpatternsseenonthe GPU-sideweretrivialactions,suchaslocatingtheindextoarequiredbuerelement ormakingcertainagivenindexosetvaluewaswithinthebuersize.Dierent applicationstendedtousesubstantiallydierentkernelprocessing. 120

PAGE 134

6.3ConictingDEFGAimsandStaticOptimization Attheoutsetofthisproject,twoofourprincipleobjectivesforDEFGwere:to simplifytheGPUsoftwaredevelopmentprocess,andtogeneratehuman-readable code.Itbecameobviousthattheseobjectivesweresomewhatinoppositiontoeach other.Generatinghuman-readablecodeimpliesproducingmodulesthatarereective oftheprogramminglogicbeingused. Inotherwords:tobereadable,themodulesneedtobecodedinastraightforwardmanneranduncluttered.However,asweintroducedsignicantperformance optimizations,wesawthatthestaticoptimizationsteprequiredinsertingadditional codeformanyspecialcases.Forexample,whenoptimizingthebuertransfersinside aloopingstructurecontainingmultipleloopexitpoints,theoptimizershouldconsider thebuertransferrequirementsforeachexitpoint,andgeneratetheappropriatecode. Withstaticoptimization,thesespecialcasesmayforcethegenerationofsignicant amountsofspecializedlogic,atmanylocationswithinthecode.Theinsertionofthis specializedcodecanmakethegeneratedcodehardtounderstandandseemcluttered. Onereasonwedecidedtoconstructan external DSLwastogeneratehuman-readable code.Tooursurprise,wefoundthatitisnotalwayspossibletogenerateuncluttered, readablecode,whenhigh-levelsofperformanceisalsoamajorconcern. 6.4FutureResearch 6.4.1AdditionalDEFGDesignPatterns DEFGhasdemonstratedtheabilitytogeneratestandaloneapplicationsandcallable C/C++functions.DEFGhasthepotentialtobeevenmoreusefuliftwoadditional designpatternsareimplemented:multiple-GPUloadbalancingandresourcesharing. OurcurrentDEFGimplementationsupportsmultipleGPUsbutdoesnot,inand ofitself,makeanyattempttobalancetheworkloadbetweentheselectedGPUs.The 121

PAGE 135

workloadassignmentisafunctionofhowtheapplicationiswritten.Thisapproach workswellonasystemwithmatchedGPUdevices.However,iftheselectedGPUs arenotmatched,ortheapplicationdoesnotassigntheworkloadtothedevices appropriately,DEFG-suppliedloadbalancingcouldbeofgreatvalue.ADEFGdesign patternthatdynamicallyallocatesworktotheselectedGPUscouldcompensatefor mismatchedGPUsandhard-to-predictworkloads. Similartoloadbalancing,resourcesharingcouldbecomeanissue.Thisissue mayarisewhenaDEFG-generatedprogramisutilizedasaC/C++functionand thefunctionisusedfromwithinaloop.ThecurrentDEFGobtainsandreleases itsresourcesoneveryinvocation.ThisbehaviorcouldbeproblematiciftheDEFGgeneratedfunctionisconstantlyre-invoked,afterperformingonlyasmallamountof work.Adesignpatternthatallowedfortheholding,andsharing,ofresourcescould beusefulinpreventingthislossofperformanceassociatedwithrepeatedallocation andreleaseofresources. 6.4.2DEFGSupportforCUDA WesuggestdesigningandimplementingaversionofDEFGforNVIDIA'sCUDA. CUDAisawidelyusedplatformforgeneratingNVIDIA-specicGPUapplications. Asnotedearlierinthiswork,CUDAhasmuchbettersupportfordirectGPU-to-GPU communicationsthanOpenCLprovides.Whilethisisaworthwhilegoal,weseetwo majorobstaclestoimplementingCUDAsupportinDEFG;therstisratherobvious andthesecondismoresubtle. First,theCUDAenvironmentissimilartotheOpenCLenvironmentbutdenitelynotequal.Forexample,theCUDAterminologygreatlydiersfromthatof OpenCL;theGPUkernelshaveadierentsyntax;and,theactualCPU-sidecall-level APIshaveadierent granularity comparedtoOpenCL[65,70].Inourexperience, forequalsproblems,CUDA-basedapplicationsolutionsrequirefewerAPIcalls.As 122

PAGE 136

aresultofthesetypesofdierences,theDEFGcodegeneratorwouldrequiresignificantrefactoringtoelegantlysupportCUDA.Theparserandoptimizerwouldnot requiresuchrefactoring.TheBFSDP2GPUapplicationhasthepotentialtoperform muchbetterwithCUDA,duetotheCUDA'ssuperiorGPU-to-GPUcommunication capabilities. Additionally,theOpenCLenvironmentdoesnotrequirethatthe localwork-group valuebesuppliedonOpenCL clEnqueueNDRangeKernel APIcalls;settingthis parameterisoptional.DEFGmakessignicantuseofthisOpenCLfeature.With CUDA,theequivalenttothisparameter, threadsperblock ,isrequiredanditislimitedtocertainvaluesdependingontheCUDAblock-sizevalueprovided.Aversion ofDEFGforCUDAwouldhavetondanappropriateautomatedwaytosetthis parameterorforcetheDEFGapplicationsoftwaredevelopertoprovideareasonable value. Whilechallenging,implementingaCUDAversionofDEFGwouldbeworthwhile becauseofCUDA'swide-scaleuseandtheaddedperformanceitcouldgivetoapplications,suchasBFSDP2GPU. 6.4.3DEFGRe-factoredasanInternalDSL DEFGiscurrentlydesignedinthestyleofan external DSL;itconsistsofaparser, optimizer,andcodegenerator.Thegeneratedprogramiscompiledbyastandard C/C++compiler.AftermakingsignicantuseofthecurrentDEFG,wesuggestthat producingan internal DSL,usingmanyoftheDEFGcomponents,wouldproducea veryusefulGPUdevelopmenttool. MartinFowlersuggeststhatDSLscanbecategorizedintwoways: internal DSLs and external DSLs[32].An internal DSLisconstructedinsideastandardprogramminglanguagethroughtheuseofobjects,macros,andotherprogramminglanguage extensions.ThisnewDSLcouldbeimplementedasasetofspecializedobjectsin 123

PAGE 137

anobject-orientedlanguagesuchasC++,C#orJava. 1 Theitemsdeclaredinthe currentDEFGdesigncouldbecomeprogrammingobjects,andthehigh-valueDEFG buertransferoptimizationscouldbeimplementedasprogrammingobjectscentered uponthebasicOpenCLAPIfunctions.GreatcareindesigningthisnewinternalDSL wouldneedtobetaken,soastoensurethatthedatastructuresusedareconsistent withtheconstraintsoftheOpenCLAPIfunctions. Inthisapproach,manyDEFGdesignpatternswouldbecomehigherlevelobjects, makinguseofthejust-describedprogrammingobjects.WiththisnewDSLinplace, thecurrentDEFGcouldbere-implementedandgreatlysimplied.TheDEFGparser wouldremainlargelyasis. 6.5DEFGTechnicalImprovements Asaresultofourwork,wediscoveredanumberoftechnicalimprovementsandenhancementsthatcouldbeaddedtoDEFG.Theyareparaphrasedhere,insummarized form,andfullydescribedinSectionB.2oftheAppendix. 1.AddaDEFGoptimizersteptoverifythe in / out / inout optionsettings. 2.EnhancetheDEFG code statementtoincludealistofvariablesused. 3.OptimizeDEFGtoreleaseover-allocatedCPUmemory. 4.AddDEFG interchange statementfunctionalitytothe execute statement. 5.ImproveDEFG'sabilitytocollectrun-timestatistics. 6.Consideruseof dynamic optimizationinDEFGbuertransferoperations. Evenwithouttheseimprovements,DEFGhasshownitselftobeacapableand ecienttoolforcreatingOpenCL-basedGPUapplications. 1 ProvidingJavasupporthasthepotentialtoenableDEFGusewiththemobileAndroidPlatform. 124

PAGE 138

BIBLIOGRAPHY [1]AdvancedMicroDevices,Inc.AcceleratedParallelProcessingAPPSDK.Website,2013. http: //developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/ [2]AdvancedMicroDevices,Inc.AcceleratedParallelProcessingMathLibrariesAPPML.Website,2013. http: //developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-math-libraries/ [3]AdvancedMicroDevices,Inc.AMDRadeonHD7990Graphics.Website, 2013. http://www.amd.com/us/products/desktop/graphics/7000/7990/ Pages/radeon-7990.aspx [4]AdvancedMicroDevices,Inc.clMathformerlyAPPML.Website,2014. http://developer.amd.com/tools-and-sdks/opencl-zone/ amd-accelerated-parallel-processing-math-libraries/ [5]R.AggarwalandK.Jacques.Theimpactoffdiciaandpromptcorrectiveaction onbankcapitalandrisk:Estimatesusingasimultaneousequationsmodel. JournalofBanking&Finance ,25:1139{1160,2001. [6]Altera,Inc.AlterawebpageonOpenCL.Website,2013. http://www.altera. com/products/software/opencl/opencl-index.html [7]M.Altman.Anoptimumcubicallyconvergentiterativemethodofinverting alinearboundedoperatorinhilbertspace. PacicJournalofMathematics 10:1107{1113,1960. [8]T.Altman.Amethodofinexactsteepestdescentforsystemsoflinearequations. ComputersandMathematicswithApplications ,19:65{69,1990. [9]T.AltmanandB.Chlebus.Sortingroughlysortedsequencesinparallel. Informationprocessingletters ,33:297{300,1990. [10]T.AltmanandY.Igarashi.Roughlysorting:Sequentialandparallelapproach. JournalofInformationProcessing ,12:154{158,1989. [11]ANTLR3.Antlr3.Website,2013. http://www.antlr3.org/ 125

PAGE 139

[12]D.BaderandK.Madduri.Designingmultithreadedalgorithmsforbreadth-rst searchandst-connectivityonthecraymta-2.In Proceedingsofthe2006InternationalConferenceonParallelProcessing ,ICPP'06,pages523{530,Washington,DC,USA,2006.IEEEComputerSociety. [13]R.Barr,T.Pilkington,J.Boineau,andM.Spach.Determiningsurfacepotentialsfromcurrentdipoles,withapplicationtoelectrocardiography. Biomedical Engineering,IEEETransactionson ,:88{92,1966. [14]K.BermanandJ.Paul. FundamentalsofSequentialandParallelAlgorithms PWSPublishingCo.,Boston,MA,USA,1stedition,1996. [15]G.Blelloch.Prexsumsandtheirapplications. ACarnegieMellonUniversity ResearchShowcaseReport ,1990. [16]S.Boccaletti,V.Latora,Y.Moreno,M.Chavez,andD.Hwang.Complex networks:Structureanddynamics. Physicsreports ,424:175{308,2006. [17]M.BoddyandT.Dean.Deliberationschedulingforproblemsolvingintimeconstrainedenvironments. ArticialIntelligence ,67:245{285,1994. [18]B.Brejova.Analyzingvariantsofshellsort. InformationProcessingLetters 79:223{227,2001. [19]V.CastroandD.Wood.Anadaptivegenericsortingalgorithmthatusesvariablepartitioning. Internationaljournalofcomputermathematics ,61-4:181{ 194,1996. [20]CenterforDiscreteMathematicsandTheoreticalComputerScience.Dimacs. Website,2010. http://www.dis.uniroma1.it/challenge9/download.shtml [21]P.Corke. Robotics,VisionandControl:FundamentalAlgorithmsinMATLAB volume73.Springer,2011. [22]T.Cormen,C.Leiserson,R.Rivest,andC.Stein.Introductiontoalgorithms thethirdedition,2009. [23]R.Couturier. DesigningScienticApplicationsonGPUs .CRCPress,2013. [24]T.DavisandY.Hu.TheUniversityofFloridasparsematrixcollection. ACM TransactionsonMathematicalSoftwareTOMS ,38:1,2011. [25]T.DavisandY.Hu.TheUniversityofFloridaSparseMatrixCollection.Website,2014. http://www.cise.ufl.edu/research/sparse/matrices/ [26]F.DehneandK.Yogaratnam.Exploringthelimitsofgpuswithparallelgraph algorithms. arXivpreprintarXiv:1002.4482 ,2010. 126

PAGE 140

[27]M.Dinneen,M.Khosravani,andA.Probert.Usingopenclforimplementingsimpleparallelgraphalgorithms.In Proceedingsofthe17thannualconferenceonParallelandDistributedProcessingTechniquesandApplications PDPTA11,partofWORLDCOMP ,volume11,pages1{6,2011. [28]J.Edd,S.Payen,B.Rubinsky,M.Stoller,andM.Sitti.Biomimeticpropulsionforaswimmingsurgicalmicro-robot.In IntelligentRobotsandSystems, 2003.IROS2003.Proceedings.2003IEEE/RSJInternationalConferenceon volume3,pages2583{2588.IEEE,2003. [29]V.Estivill-CastroandD.Wood.Asurveyofadaptivesortingalgorithms. ACM ComputingSurveysCSUR ,24:441{476,1992. [30]R.Farber. CUDAApplicationDesignandDevelopment .ElsevierScience, Burlington,2011. [31]W.Feng,H.Lin,T.Scogland,andJ.Zhang.Openclandthe13dwarfs:a workinprogress.In ProceedingsofthethirdjointWOSP/SIPEWinternational conferenceonPerformanceEngineering ,pages291{294.ACM,2012. [32]M.Fowler. Domain-SpecicLanguages .Addison-WesleyProfessional,2010. [33]E.Gamma,R.Helm,R.Johnson,andJ.Vlissides. Designpatterns:elements ofreusableobject-orientedsoftware .PearsonEducation,1994. [34]B.Gaster,L.Howes,D.Kaeli,P.Mistry,andD.Schaa. HeterogeneousComputingwithOpenCL:RevisedOpenCL1. MorganKaufmann,2012. [35]N.Govindaraju,N.Raghuvanshi,andD.Manocha.Fastandapproximate streamminingofquantilesandfrequenciesusinggraphicsprocessors.In Proceedingsofthe2005ACMSIGMODinternationalconferenceonManagement ofdata ,SIGMOD'05,pages611{622,NewYork,NY,USA,2005.ACM. [36]E.HansenandS.Zilberstein.Monitoringandcontrolofanytimealgorithms:A dynamicprogrammingapproach. ArticialIntelligence ,126:139{157,2001. [37]HardkernelCo.,Ltd.ORDOIDPlatforms.Website,2013. http:http://www. hardkernel.com/main/products/prdt_info.php?g_code=G138745696275 [38]P.HarishandP.Narayanan.Acceleratinglargegraphalgorithmsonthegpu usingcuda.In Proceedingsofthe14thinternationalconferenceonHighperformancecomputing ,HiPC'07,pages197{208,Berlin,Heidelberg,2007.SpringerVerlag. [39]M.HarrisandM.Garland.Optimizingparallelprexoperationsforthefermi architecture. GPUComputingGemsJadeEdition ,page29,2011. [40]M.Harris,S.Sengupta,andJ.Owens.Parallelprexsumscanwithcuda. GPUGems ,3:851{876,2007. 127

PAGE 141

[41]B.He,K.Yang,R.Fang,M.Lu,N.Govindaraju,Q.Luo,andP.Sander.Relationaljoinsongraphicsprocessors.In Proceedingsofthe2008ACMSIGMOD internationalconferenceonManagementofdata ,SIGMOD'08,pages511{524, NewYork,NY,USA,2008.ACM. [42]S.Hong,S.K.Kim,T.Oguntebi,andK.Olukotun.Acceleratingcudagraph algorithmsatmaximumwarp.In Proceedingsofthe16thACMsymposiumon Principlesandpracticeofparallelprogramming ,PPoPP'11,pages267{276, NewYork,NY,USA,2011.ACM. [43]T.Huang,G.Yang,andG.Tang.Afasttwo-dimensionalmedianltering algorithm. Acoustics,SpeechandSignalProcessing,IEEETransactionson 27:13{18,1979. [44]K.Ikuta,H.Ishii,andM.Nokata.Safetyevaluationmethodofdesignand controlforhuman-carerobots. TheInternationalJournalofRoboticsResearch 22:281{297,2003. [45]T.K.G.Inc.OpenCLReferencePages.Website,2010. http://www.khronos. org/registry/cl/sdk/1.1/docs/man/xhtml/ [46]M.Ishii,S.Sakane,M.Kakikura,andY.Mikami.A3-dsensorsystemfor teachingrobotpathsandenvironments. TheInternationaljournalofrobotics research ,6:45{59,1987. [47]L.JordanandG.Alaghband. Fundamentalsofparallelprocessing .Prentice HallProfessionalTechnicalReference,2002. [48]D.KirkandW.Hwu. ProgrammingMassivelyParallelProcessors:AHandsonApproach .MorganKaufmannPublishersInc.,SanFrancisco,CA,USA,1st edition,2010. [49]A.Klckner.PyOpenCL.Website,2010. http://mathema.tician.de/ software/pyopencl [50]A.Klckner.PyCUDA.Website,2012. http://mathema.tician.de/software/ pycuda [51]D.Knuth. Theartofcomputerprogramming,volume3:nded.sortingand searching .AddisonWesleyLongmanPublishingCo.,Inc.,RedwoodCity,CA, USA,1998. [52]A.KrishnamoorthyandD.Menon.Matrixinversionusingcholeskydecomposition. arXivpreprintarXiv:1111.4144 ,2011. [53]C.Lejdfors.PyGPU-PythonfortheGPU.Website,2007. http://fileadmin. cs.lth.se/cs/Personal/Calle_Lejdfors/pygpu/ 128

PAGE 142

[54]litecoin.info.Mininghardwarecomparison.Website,2014. https://litecoin. info/Mining_hardware_comparison [55]L.Luo,M.Wong,andW.Hwu.Aneectivegpuimplementationofbreadthrstsearch.In Proceedingsofthe47thdesignautomationconference ,pages 52{55.ACM,2010. [56]C.Ma,L.Yang,W.Gao,andZ.Liu.Animprovedsobelalgorithmbasedon medianlter.In MechanicalandElectronicsEngineeringICMEE,20102nd InternationalConferenceon ,volume1,pagesV1{88.IEEE,2010. [57]Mathworks.ParallelComputingToolbox.Website,2013. http://www. mathworks.com/products/parallel-computing/ [58]M.McCool,J.Reinders,andA.Robison. Structuredparallelprogramming: patternsforecientcomputation .Elsevier,2012. [59]D.Merrill,M.Garland,andA.Grimshaw.Scalablegpugraphtraversal.In Proceedingsofthe17thACMSIGPLANsymposiumonPrinciplesandPractice ofParallelProgramming ,pages117{128.ACM,2012. [60]D.MerrillandA.Grimshaw.Revisitingsortingforgpgpustreamarchitectures. In Proceedingsofthe19thinternationalconferenceonParallelarchitecturesand compilationtechniques ,pages545{546.ACM,2010. [61]A.Munshi,B.Gaster,T.G.Mattson,andD.Ginsburg. OpenCLprogramming guide .Addison-WesleyProfessional,2011. [62]M.NagaoandS.Mori.Anewmethodofn-gramstatisticsforlargenumberofn andautomaticextractionofwordsandphrasesfromlargetextdataofjapanese. In Proceedingsofthe15thconferenceonComputationallinguistics-Volume1 pages611{615.AssociationforComputationalLinguistics,1994. [63]Y.Nagasaka,K.andKuroki,S.Suzuki,Y.Itoh,andJ.Yamaguchi.Integrated motioncontrolforwalking,jumpingandrunningonasmallbipedalentertainmentrobot.In RoboticsandAutomation,2004.Proceedings.ICRA'04.2004 IEEEInternationalConferenceon ,volume4,pages3189{3194.IEEE,2004. [64]NVIDIACorporation.Whitepaper:NVIDIA'sNextGenerationCUDAComputeArchitecture:Fermi.Technicalreport,NVIDIACorporation,2011. [65]NVIDIACorporation. CUDACProgrammingGuide ,4.2edition,2012. [66]NVIDIACorporation.Whitepaper:NVIDIA'sNextGenerationCUDAComputeArchitecture:KeplerGK110i.Technicalreport,NVIDIACorporation, 2012. [67]NVIDIACorporation.Cudazone{opencl.Website,2013. https:// developer.nvidia.com/opencl 129

PAGE 143

[68]NVIDIACorporation.Nvidiagpudirect.Website,2014. https://developer. nvidia.com/gpudirect [69]NVIDIACorportation.NVIDAwebpageonGPGPU.Website,2013. http: //www.nvidia.com/object/what-is-gpu-computing.html [70]K.OpenclandA.M.Theopenclspecicationversion:1.2documentrevision: 15,2012. [71]J.Owens,M.Houston,D.Luebke,S.Green,J.E.Stone,andJ.Phillips.GPU computing. ProceedingsoftheIEEE ,96:879{899,2008. [72]G.PetrieandT.Kennie.Terrainmodelinginsurveyingandcivilengineering. Computer-aideddesign ,19:171{187,1987. [73]W.Petryshyn.Ontheinversionofmatricesandlinearoperators. Proceedings oftheAmericanMathematicalSociety ,16:893{901,1965. [74]R.Porra.Thechequeredhistoryofthedevelopmentanduseofsimultaneous equationsfortheaccuratedeterminationofchlorophyllsaandb. Photosynthesis Research ,73-3:149{156,2002. [75]Rosenzweig,M.HowtoConstructanInvertibleMatrix?JustChooseLarge Diagonals.Website,2013. http://matthewhr.wordpress.com/2013/09/01/ how-to-construct-an-invertible-matrix-just-choose-large-diagonals/ [76]L.S.andR.Box.AFastEasySort.Website,1991. http://cs.clackamas. cc.or.us/molatore/cs260Spr03/combsort.htm [77]A.SabaandR.Mangharam.Anytimealgorithmsforgpuarchitectures. AVICPS2010 ,page31,2010. [78]J.SandersandE.Kandrot. CUDAbyExample:AnIntroductiontoGeneralPurposeGPUProgramming .Addison-WesleyProfessional,1stedition,2010. [79]N.Satish,M.Harris,andM.Garland.Designingecientsortingalgorithmsfor manycoregpus.In IEEEInternationalSymposiumonParallel&Distributed Processing,2009.IPDPS2009. ,pages1{10.IEEE,2009. [80]R.SenserandT.Altman.DEF-G:DeclarativeFrameworkforGPUEnvironment.In Proceedingsofthe19thannualconferenceonParallelandDistributed ProcessingTechniquesandApplicationsPDPTA13,partofWORLDCOMP volumeII,pages490{496,2013. [81]R.SenserandT.Altman.AsecondgenerationofDEFG:DeclarativeFrameworkforGPUs.In Proceedingsofthe20thannualconferenceonParallel andDistributedProcessingTechniquesandApplicationsPDPTA14,partof WORLDCOMP ,volumet.b.d.,paget.b.d,2014.TobeavailableNovember 2014. 130

PAGE 144

[82]R.SenserandT.Altman.Poster:DEFG,DeclarativeFrameworkforGPUs -IDP4194.NVIDIAGPUTechnologyConference,GTC2014,March 24-27,2014,SanJose,CA,2014. http://on-demand-gtc.gputechconf. com/gtcnew/on-demand-gtc.php?searchByKeyword=senser&searchItems= &sessionTopic=&sessionEvent=2&sessionYear=2014&sessionFormat= 5&submit=&select=+#sthash.v9BEY0N0.dpuf [83]J.ShenandM.Lipasti. ModernProcessorDesign:Fundamentalsofsuperscalar processors,betaedition .McGraw-HillScience/Engineering/Math,2003. [84]M.Sid-Ahmed. Imageprocessing .McGraw-Hill,1994. [85]K.Skadron.Rodinia:Acceleratingcompute-intensiveapplicationswithaccelerators.Website,2013. http://lava.cs.virginia.edu/Rodinia/ [86]I.SobelandG.Feldman.A3x3isotropicgradientoperatorforimageprocessing. a1968talkattheStanfordArticialProject ,1968. [87]M.Sonka,V.Hlavac,andR.Boyle.Imageprocessing,analysis,andmachine vision.1999. Champion&Hall ,pages2{6,1998. [88]StanfordSNAPGroup.Stanfordnetworkanalysisproject.Website,2014. http://snap.stanford.edu/ [89]P.Stanimirovic.Self{correctingiterativemethodsforcomputing f 2 g {inverses. ArchMathBrno ,39:27{36,2003. [90]R.Tewarson.Adirectmethodforgeneralizedmatrixinversion. SIAMJournal onNumericalAnalysis ,4:499{507,1967. [91]TINYXML2.Tinyxml2.Website,2013. http://www.grinninglizard.com/ tinyxml2/index.html [92]N.Tuck.Bacon:Agpuprogramminglanguagewithjustintimespecialization draft. UniversityofMassachusettsLowel,LowelMA01854 ,2012. [93]O.VincentandO.Folorunso.Adescriptivealgorithmforsobelimageedge detection.In ProceedingsofInformingScienceandITEducationConference InSITE ,pages97{107,2009. [94]W.Wang.Reachonsobeloperatorforvehiclerecognition.In ArticialIntelligence,2009.JCAI'09.InternationalJointConferenceon ,pages448{451. IEEE,2009. [95]Z.WeiandJ.JaJa.Optimizationoflinkedlistprexcomputationsonmultithreadedgpususingcuda.In IPDPS ,pages1{8.IEEE,2010. [96]S.Wesolkowski,M.Jernigan,andR.Dony.Comparisonofcolorimageedge detectorsinmultiplecolorspaces.In ImageProcessing,2000.Proceedings.2000 InternationalConferenceon ,volume2,pages796{799.IEEE,2000. 131

PAGE 145

[97]Wikipedia.Combsort.Website,2014. http://en.wikipedia.org/wiki/ Comb_sort [98]Wikipedia.Embarrassinglyparallel.Website,2014. http://en.wikipedia. org/wiki/Embarrassingly_parallel [99]Wikipedia.Hilbertmatrix.Website,2014. http://en.wikipedia.org/wiki/ Hilbert_matrix [100]J.Xiong,J.Johnson,R.Johnson,andD.Padua.Spl:Alanguageandcompiler fordspalgorithms.In ACMSIGPLANNotices ,volume36,pages298{308. ACM,2001. [101]S.Zilberstein.Usinganytimealgorithmsinintelligentsystems. AImagazine 17:73,1996. 132

PAGE 146

APPENDIXA DEFGUser'sGuide A.1Introduction DEFGisaDomainSpecicLanguageDSLthatusesdeclarativestatementsto providemuchoftheinformationneededtogeneratetheCPU-portionofOpenCLGPU applications.Itisnotageneral-purposeprogramminglanguage,norisitintendedto be.DEFGgivesupsomeofthepowerthatauniversallanguage,suchasC++orJava, provides.Inexchangefordecreasedpower,DEFGisabletoperformcertainGPUrelatedactivitiesecientlywithminimalsoftwaredevelopereort.Thiseciency relatesbothtothedegreeofeortrequiredbythedevelopertocreateanOpenCL application,andtheperformanceachievedbythegeneratedOpenCLapplication. Whilenotageneral-purposelanguageinitself,DEFGdoeshavetheability,with its code statement,toinsertarbitraryC/C++codeintotheapplicationsitgenerates. TheseinsertedC/C++codesnippetsarereferredtoasDEFG morsels ,whichare intendedtoallowthedevelopertoinsertsmallsnippetstodominorcalculations,to formatdisplayoutput,andtoassistindebugging. ThisUser'sGuideprovidesDEFGsourcecodeexamplesandsummarydescriptionsofthecommonDEFGdesignpatterns.Theseexamplesanddesignpattern descriptionsarefollowedbythedetailedDEFGLanguageReferenceSection.Also coveredinthisDEFGareadvancedfeaturesanderrorhandling. 133

PAGE 147

A.2IntendedAudience TheintendedaudienceforthisUser'sGuideisbothexperienceddeveloperswhoare startingtouseGPUs,andseasonedGPUdevelopers.Itisassumedthatthereader hasabasicunderstandingofGPUtechnology,theOpenCLSpecication,andGPU kernelprogramming. A.3DEFGExamples TwoDEFGexampleprogramsarepresentedhere.Theyprovideaglimpseofhow theDEFGlanguageisutilized,andtheyshowtheCPU-sideoperationsneededfor GPUimplementationoftwocommonapplications.Therstexampleprovidesfor theexecutionofasingleGPUkernel,whichisanimagelter.Thesecondoneshows moreoftheDEFGpowerintheformoftheloopingoperationusedinaGPU-based Breadth-FirstSearchBFSimplementation. A.3.1SobelImageFilterExample ThisisasimpleDEFGapplication.Itloadsanimage,copiesittotheGPU,processes theimagewithaGPUkernelnamed sobel lter ,andthenbringstheimagebackto theCPUfordisplay. //SobelRef.txt:SobelalgorithminDEFGsyntax declareapplicationsobel declareintegerXdim integerYdim integerBUF_SIZE declaregpugpuone* declarekernelsobel_filterSobelFilter_Kernels [[2D,Xdim,Ydim]] declareintegerbufferimage1XdimYdim integerbufferimage2XdimYdim callinit_inputimage1inXdimoutYdimoutBUF_SIZEout executerun1sobel_filterimage1inimage2out calldisp_outputimage2inXdiminYdimin end 134

PAGE 148

TheDEFG declare statementsdenethecharacteristicsoftheDEFGapplication. Inthiscasetheapplicationnameis sobel ;ithasthreeintegerscalarvariables;it usesanyavailableGPUandtheOpenCLkernelnameis sobel lter ,fromtheOS lenamedSobelFilter Kernels.cl".TherearetwoGPUintegerbuers: image1 and image2 .TheC++function init input iscalledtoloadtheimage,andtheC++ function disp output iscalledtodisplaytheresultingimage.BetweenthesetwoC++ functions,theDEFG execute statementschedulestheexecutionoftheGPUkernel andarrangesforthemovementofthetwobuers.The in and out denotations allowDEFGtooptimizethemovementoftheintegerbuersbetweentheCPUand GPU. A.3.2Breadth-FirstSearchExample ThesecondDEFGexampleshowsmultiplekernelexecutionandtheloopingcapabilitiesofDEFG.Here,aDEFGloopingconstructisusedtorepeatthesetofkernels, andDEFGinternallyoptimizestheloop'sdatabuermovementsbetweentheCPU andGPUdevices. //bfsRef.txt:BFSHarishversion declareapplicationbfs declareintegerNODE_CNT integerEDGE_CNT integerMAX_DEGREE integerSTOP declaregpugpuone* declarekernelkernel1bfs_kernel[[1D,NODE_CNT]] kernelkernel2bfs_kernel[[1D,NODE_CNT]] declarestructbuffergraph_nodesNODE_CNT integerbuffergraph_edgesEDGE_CNT integerbuffergraph_maskNODE_CNT integerbufferupdating_graph_maskNODE_CNT integerbuffergraph_visitedNODE_CNT integerbuffercostNODE_CNT callinit_inputgraph_nodesout graph_edgesout graph_maskout 135

PAGE 149

updating_graph_maskout graph_visitedout costout NODE_CNTout EDGE_CNTout MAX_DEGREEout loop executepart1kernel1graph_nodesin graph_edgesin graph_maskin updating_graph_maskinout graph_visitedin costinout NODE_CNTin setSTOP executepart2kernel2graph_maskinout updating_graph_maskinout graph_visitedinout STOPinout NODE_CNTin whileSTOPeq1 calldisp_outputcostinNODE_CNTin end Here,theDEFG declare call ,and execute statementsfunctionaswasdescribedin therstexample.Thisexampleintroducesthe loop set ,and while DEFGstatements tomanagetherequiredloopingbehavior.The loop statementdenotesthebeginning ofaprocessingloop.The while statementdenotestheendoftheprocessingloopand providestheloop-exitcondition.The set statementassignsthegivenvaluetothe scalarvariable, STOP .Inthisexample,thevaluepreviouslysetin STOP ischangeable bythesecondkernelexecutedandthusformstheloopexitcontrol,whichismanaged bythesecondkernel. Althoughthisappearstobejustaproceduralprogrammingloop,theDEFG optimizerprocessingusesverycarefullythecharacteristicsofthislooptogenerate theOpenCLcodetomovedatabuers.Forinstance,the graph edges buer,which isusedbytherstkernelwithintheloop,isonlymovedoncefromtheCPUtothe 136

PAGE 150

GPU.Ifthe in inout ,and out denotationswerecodeddierently,DEFGcould generatecodetomovethisbuerineachiterationoftheloop. A.4CommonDEFGDesignPatterns DEFGreliesheavilyonasetoftemplatestogeneratetheneededOpenCLcode.When thesetemplatesarecombinedwithcertainDEFGcodingtechniques,theresultsare the DEFGdesignpatterns .Thissectiondescribesthemostcommonlyusedpatterns. SomeofthemareengagedviatheuseofcertainDEFGstatementsorkeywords. OtherdesignpatternsareinvokedbyusingthecommonDEFGstatementsinspecic groupings.Bytheirverynature,thesedesignpatternscanbeoverlapped,andin certaincases,limitedbythepresenceofotherpatterns.Mostofthecomplexityin usingthesedesignpatternariseswhenusingtheDEFGmultiple-GPUcardsupport. Useofsingle-GPUDEFGdesignpatternstendstobestraight-forward. A.4.1Execution-FlowDesignPatterns DEFGprovidesthethreeexecution-owdesignpatterns,describedinthissection. ThesepatternsprovidethebasicprocessingtemplatethattheDEFGcodegenerator usestoformtheC/C++code. A.4.1.1Sequential-FlowDesignPattern TheSequential-Flowdesignpatternisalwayspresent.ItcausesthestatementsfollowingtheDEFG declare statementtobeexecutedfromtoptobottom.TheSingleKernelRepeatSequenceandMultiple-KernelLoopDesignPatterns,discussedbelow, cansubstantiallyalterthistop-to-bottomordering. 137

PAGE 151

A.4.1.2Single-KernelRepeatSequenceDesignPattern TheSingle-KernelRepeatSequenceisaloopingdesignpatternthatallowsasingle DEFG execute or multi exec statementtobeexecutedaxednumberoftimes.The DEFGstatementtoutilizethispatternisthe sequence .Itcanbeusedwithboththe Sequential-FlowandMultiple-KernelDesignPatterns. A.4.1.3Multiple-KernelLoopDesignPattern TheMultiple-KernelLoopdesignpatternisaconditionalloopingpattern.Itisused withintheSequential-Flowdesignpattern.TheDEFGstatementsneededtoformthis designpatternarethe loop and while statements.TheMultiple-KernelLoopdesign patternmaynotbeembeddedwithinitself.Inotherwords,aDEFG loop / while constructmaynotcontainanother loop / while construct.TheSingle-KernelRepeat SequencedesignpatternmaybeusedwithintheMultiple-KernelLoopdesignpattern. A.4.2AnytimeProcessingDesignPattern TheAnytimeProcessingdesignpatternisusedwiththeMultiple-KernelLoopdesign pattern.Thispatternisusedtocauseaprematureexitofthedesignpattern'sloop. Thecriteriatoexittheloopcanberun-timebased,throughtheuseoftheDEFG anytime statement,ordata-comparisonbased,throughtheuseoftheDEFG code morsel. A.4.3ApplicationversusModuleDesignPatterns Standaloneapplications,withaC/C++ main entryfunction,andcallablefunctionscanbothbecreatedwithDEFG. 138

PAGE 152

A.4.3.1ApplicationDesignPattern TheApplicationdesignpatterngeneratesastandaloneapplicationandisindicated byusingaDEFG declareapplication statement. A.4.3.2ModuleDesignPattern TheModuledesignpatterngeneratesacallableC/C++functionandisindicated byusingtheDEFG declaremodulestatement ,sometimesfollowedbyoneormore optionalDEFG declareparameter statements. A.4.4BLASDesignPattern TheclMathBasicLinearAlgebraSubprogramsBLASlibraryisprovidedforOpenCL [2].TheclMathlibrarywasformerlyknownasAPPML.ThissetofsubprogramsprovidesasetofOpenCL-callableBLAScapabilities.DEFGmakesuseofthedoubleprecisionmatrixmultiplyAPI, clAmdBlasDgemm ,fromthislibrarywithits blas statement. 1 Themaindierencebetween execute and blas isthatthe execute statementexecutesaspecicOpenCLkernel,andthe blas statementexecutesthedesired subprogramfromthelibrarywhichthencallsitsownsetofOpenCLkernels. A.4.5Virtual-Pointer/PrexSumDesignPattern DEFGissuppliedwithseveralOpenCLkernelsthatperformparallelprexsum.We havefoundthatprexsumisacommonoperationperformedinDEFGapplications. ThesekernelscanbeusedbyanyOpenCLapplicationthatneedsprex-sumprocessing.TheyarespecicallyincludedinDEFGtofacilitatethedeveloper-useof prexsuminassigningosetsinasharedbuer,withouttheuseoflow-levelsynchronization.Thesekernelsarenamed: bermanPrexSumP1 bermanPrexSumP2b 1 SupportforadditionalclMathfunctionsmaybeaddedtoDEFGatalaterdate. 139

PAGE 153

and getCellValue .TheyareusedintheBreadth-FirstSearchApplication,discussed inSection5.2. A.4.6MultipleGPUSupportDesignPatterns WhenthedesignoftheGPUalgorithmsusedinaDEFGprogrampermittheconcurrentuseofmorethanoneGPUdevice,DEFGsupportsdesignpatternstoenable theuseofmultipleGPUdevices. A.4.6.1Multiple-ExecutionDesignPattern Withcertainapplications,itisdesirabletosplittheworkloadovermorethanasingle GPUdevice.Thismaybedonetoobtainfasterperformance,handlelargerproblems, orboth.DEFGmakesthispossiblewiththeuseofthe multi exec statement.With thiscapability,DEFGwillexecutetheworkloadoverthegroupofGPUdevices declaredinthe declaregpu statement.Thebasicalgorithmsinusemustbeintended forthishighdegreeofparallelism,andtheOpenCLkernelsinusemustalsohavebeen designedforthismodeofuse.Thispatternisusedintheimageltersapplications, discussedinSection5.1.Thedesignpatterndescribednextisoftenusedwiththis Multiple-Executiondesignpattern. A.4.6.2ManagedBuerDesignPattern Whenthejust-describedMultiple-Executiondesignpatternisused,itisoftennecessarytosegmentthedatapassedtoeachGPU.DEFGprovidesanumberof declare buer optionstomakeitpossibleforDEFGtoautomaticallysegmentthedata.These optionsare halo multi ,and nonpartable .Inthesingle-GPUcontext,theseoptionshave noeect;theyareignored.Whenthe multi exec isused,DEFGbuersareconsidered partable ,unlessabueroptionissupplied.Thedefault partable behavioristosplit thedataintoequalsegmentsforpassingtotheselectedGPUs.Whenthe nonpartable 140

PAGE 154

optionisused,thebuerisnotsegmentedandisgiven,infull,toeachGPU.The halo optionisoftenusedwithimages,becauseitnotiesDEFGthatthedatacontains smalledges.Theresultofusing halo isthatedgedataissenttoeachGPUusingthe edge.The multi optionisusedwiththe DEFG GLOB variabletohandlecomplexsegmentationsoftheinputdata.Withthe DEFG GLOB approach,theC/C++function thatbringsthedataintotheDEFG-generatedprogramcontrolsthesplittingofthe data.ItalsonotiesDEFG-generatedprogram,viaDEFG GLOB,howthedatais allocatedtotheGPUbuers.Thisoptionisoftenneededwhengraphicaldataneeds tobeprocessedovermorethanasingleGPU.UseofDEFG GLOBisshowninthe Breadth-FirstSearchapplication,discussedinSection5.2. A.5DEFGLanguageReference A.5.1DEFGStatements ThissectionliststheDEFGstatementsinalphabeticalorder.EachDEFGstatement isdescribed,followedbyasyntaxdescriptionandthenaverysmallcodeexample. The < "and > "metacharacters,usedinthesyntaxdescriptions,areusedtodenote names,variables,constants,andliterals;thesingle [ "and` ] "metacharactersare usedtodenoteoptionalrepeatingsequences;andthe j "metacharacterisusedto denoteoptions.Notethatthedouble [[ "and ]] "charactersequencesareliteral charactersandarepartoftheactualDEFGsyntax. 1. blas Statement DEFGisabletoutilizethe clAmdBlasDgemm BLASfunctiontoperform double-precisionmatrixmultiplication.ToutilizetheDEFG blas statement,the AMDclMathformerlyAPPMLlibrarymusthavealreadybeeninstalled.The DEFG blas statementisawrapperaroundthe clAmdBlasDgemm C/C++ function.Notethatthisstatementrequiresthatalleldsbeprovided.Inmost 141

PAGE 155

cases,thescalarvariablesmustbetobesetto0or1togetthedesiredresults. Notethatthe> "charactersequenceispartoftheactualstatementsyntax. Intheexamplebelow,thelinearalgebracalculationperformedis: matrix 3= scalar 1 matrix 1 matrix 2+ scalar 2 matrix 3 Syntax: blas < scalar1 > < matrix1 > < matrix2 > + < scalar2 > < matrix3 > > < matrix3 > Example: setdZero.0 setdOne.0 blasdOne*mP*mP+dZero*mW> mW 2. broadcast Statement WhenaDEFGprogramisbeingusedinmultiple-GPUmode,the broadcast statementisusedtomakethecontentsofadatabueronagivenGPUavailable totheotherGPUs.Thestatement'sGPUnumbermustbeaconstantand precededbythe@"character.InthecurrentDEFGimplementation,this statementforcesthedatabuertobecopiedbacktotheCPUforlaterusage byarequestingGPU.Thisoperationis not doneasaGPUtoGPUcopyand, therefore,tendstoberelativelyslow. Syntax: broadcast < variable > @ < gpu > Example: broadcastfrontier0@0 3. call Statement ItisverycommonforaDEFGprogramtorelyonC/C++functionstobring 142

PAGE 156

dataintotheDEFGprogramandtoforwardtheGPU-processeddatatothe CPUforfurtherprocessingoroutput.The call statementisusedtoexecute C/C++functions.ThenameofthecalledC/C++function,andallreferenced DEFGvariablenames,mustfollowC/C++namingconventionsandnotbegin witheither defg or DEFG .Therequiredoptions in inout ,and out aredescribed inthismanual'sDEFGStatementOptionsSection.Thesethreeoptionsprovide theDEFGTranslatorwiththedatamovementdirections. Syntax: call < function name > < variable > < direction > [ < variable > < direction > ] where: < function name > isaC/C++functionname and: < variable > isaDEFGscalarorbuername and: < direction > isadirection Example: callinit inputimage1inXdimoutYdimoutBUF SIZEout TheDEFGGeneratedC/C++codeforthis call statementwouldbe: init inputimage1,Xdim,Ydim,BUF SIZE; Whereas,thecorrespondingdeclarationinactualC/C++functioncouldbe: voidinit inputvoid*image1,int&Xdim,int&Ydim,int&BUF SIZE; Note:thefunctionisoftype void ,andtherequireduseoftheC++callby-referenceoperator,thatis,the & character,onallthescalarvaluespassed betweentheDEFG-generatedcodeandtheC/C++function.Sincethedata buersarepassedas void pointers,theyarealsopassedbyreference. 4. code Statement TheDEFG code statementmaybeusedtoinsertC/C++codeintothegener143

PAGE 157

atedC/C++program.Theseinsertedcodesnippetsarecalled morsels .The morsel'scodesnippet,providedbetweenthe [[ "and ]] "markers,isinserted intothegeneratedprogramwithoutbeingparsedorveried.Commonuses forthisstatementareoutputtingscalarvaluesfordebuggingandgenerating inputdatafortesting.NotethatwhenoutputtingGPUresults,theGPU's datamustbecurrentandvalidontheCPU.ReferencestoanyDEFGelds, fromembeddedC/C++code,arenotmanagedbytheDEFGoptimizer.Therefore,theDEFG release statementcanbeusedtoforceacopyoftheGPUdata backtotheCPU.Obviously,theDEFG code statementcancreatehard-to-nd softwareerrors;itshouldonlybeusedwithgreatcare. Syntax: code[[ < nativeC/C++code > ]] Example: code[[printf"KCNT:%d n n",KCNT;]] 5. declareapplication Statement EachDEFGapplicationrequiresaname,whichisprovidedbythe declareapplication .ThenameislistedinthegeneratedC/C++codealongwithaDEFG translationtimestamp. Syntax: declareapplication < application name > where: < application name > isthenameusedincodegeneration Example: declareapplicationsobel 6. declarebuer Statement EachdatabuerusedwithinaDEFGprogrammustbedeclared,andallin 144

PAGE 158

asingle declarebuer statement.Themaximumsizeofthebuerscanbe changedwitharun-timeenvironmentvariable,calledDEFG MAX BUF.The databuersaretreatedasC/C++arraysoftype double oat integer or struct TheDEFGsupportfor struct arraysislimitedandtheselimitsarediscussedin theDEFGDataTypesSection. Syntax: declarebuer < buer entry > [ < buer entry > ] where: < buer entry > consistsof < data type >< name > < size > [ < buer option > ] and: < data type > consistsofthedatatype and: < name > consistsofthebuername and: < size > consistsofthenumberofoccurrences and: < buer option > consistsof halo j local j multi j nonpartable Example: declareintegerbuerdata1size integerbuerdata2size The halo local multi ,and nonpartable optionsaredescribedintheDEFGStatementOptionsSection;theDEFGAdvancedFeaturesSectionprovidesadditionalinformation. 7. declaregpu Statement Thegpudeclarationdetermineswhichdeviceordevicesareselectedforuse. ThedevicesarenormallyanyGPUs,ortheCPU.TheGPUselectioncriteria aredenedasfollows: a :Matchesanysingledevice,CPUincluded. b any :MatchesanysingleGPU,CPUexcluded. 145

PAGE 159

c all :MatchesallGPUs. dquotedlist:Matchesdevicenameslisted. Syntax: declaregpu < gpu group name > < gpu criteria > where: < gpu group name > isthegroupname and: < gpu criteria > consistsof j any j all j < gpu list > and: < gpu list > consistsofasequenceofoneormorequotedGPUnames. Twoexamples: declaregpugpuone* declaregpugpuone"GeForceGT220" 8. declarekernel Statement Thenamesofthekernelorkernelsrequiredaredeclaredwiththe declarekernel statement.Eachkernelreferencedhasaninternalname,andinmostcasesa lename.Theinternalnamemustmatchtheactualkernelnameandthele nameisthenameofthetextlecontainingthekernelcode.Thetextlemust haveanextensionof.cl".Notethatthereisanoptiontoinsertthekerneltext intotheDEFGcodethroughtheuseofthe insertcode[[..]] phrase.This kernelinsertcodeoptionisintendedfordebuggingpurposes;thatis,forthe temporaryexecutionofsmallamountsofuncommentedkernelcode. Eachkernelrequiresglobalparametersandmayincludelocalsizingparameters.ThecurrentDEFGversionusesthe [[..]] syntaxtocontainthe parametersneededforOpenCL.Therstparameteris 1D or 2D ,denotingthe dimensionalityofthedata.Thenextoneortwoparametersdenotetheassociateddimensionsize.Notetheuseofcommas,whichisnotthenormalDEFG syntaxdesign,andtheoptionalcoloncharacter.Theoptionallocalparameters, withoneortwovalues,areprecededbyacoloncharacter. 146

PAGE 160

Syntax: declarekernel < kernel name >< kernel le name > < kernel run-time > where: < kernel name > consistsofthenameofthekernelfunction and: < kernel le name > consistsofthenameofthekerneltextle and: < kernel run time > consistsofthesizingparameters notethat < kernel le name > canbereplacedbythe insertcode[[..]] asshownbelow. Threeexamples: declarekernelmfavg lterMfavg Kernels[[1D,Dim]] declarekernellocal sampleSamples[[1D,100:10]] declarekerneltiny kernel insertcode[[ kernelvoidtiny kernel globalint*p1 f p1[0]=34;return; g ]][[1D,1]] 9. declaremodule Statement DEFGhastheoptiontogenerateC/C++functionsinsteadofstandaloneapplicationswithaC/C++ main entrypoint.Inordertogenerateafunction, the declaremodule statementisused.Themodulenameprovidedbecomesthe functionnameofthegeneratedC/C++code.TheargumentstotheC/C++ functionareprovidedwiththe declareparameter statement,discussedinthe nextsection. Syntax: declaremodule < module name > where: < module name > isthenameusedforthegeneratedfunction Example: declaremodulesobelc 147

PAGE 161

10. declareparameter Statement WhentheDEFG declaremodule statementisused,theargumentstothegeneratedfunctionaredenedusingthe declareparameter statement.Onlythe parameternameissuppliedandthenamemustreferenceadeclaredvariableor buer. Syntax: declareparameter < parameter name > where: < parameter name > matchesadeclaredvariableorbuer Example: declareparameterimage1 11. declare variableStatement Eachdeveloper-denedscalarvariableusedwithinaDEFGprogrammustbe declared.Allvariablesaredeclaredwithinasingle declare statement.These variablesaretreatedasC/C++variablesoftype double oat ,or integer .Note thattheliteralvariable"is not partofthisDEFGstatement. Syntax: declare < variable entry > [ < variable entry > ] where: < variable entry > consistsof < data type >< name > Example: declareintegernum1data1 integerdata2 12. end timer Statement DEFGautomaticallyprovidesonetimerthatthedevelopercanusetocompute theruntimesofDEFGcode.Seethe start timer statementbelowforafull description. 148

PAGE 162

13. execute Statement TheDEFG execute statementschedulestheexecutionofthekernelonasingle GPU.IftheneededinputvariablesandbuersarenotalreadyontheGPU, thencopiesofthesearemovedtotheGPU.Therequiredoptions in inout ,and out expressthedirectionofdatamovementandaremorefullydescribedin thisguide'sStatementOptionsSection.Thecorrectsettingoftheseoptionsis critical,astheDEFGoptimizerusestheirvaluestoecientlymovethevariables andbuersbetweenthedevices.Theincorrectsettingoftheseoptionscancause poorperformanceand/orincorrectresults. Syntax: execute < run name >< kernel name > [ < variable > < option > ] where: < run name > istheuniquenameofthisexecutionstep and: < kernel name > consistsofthekernelname and: < variable > consistsofavariablename and: < option > isoneof in inout ,or out Example: executerun1sobel lterimage1inimage2out 14. include Statement TheDEFG include statementmaybeusedtoinsertC/C++codeintothegeneratedC/C++program.Thecodesnippetprovided,betweenthe' [[ 'and' ]] markerscandeneC-stylemacros,aswellas,containC/C++ #include statements.TheDEFG include statementcancreatehard-to-ndsoftwareerrors; itshouldonlybeusedaftercarefulconsiderationasthesideeectsonitsuse. Thisstatementdiersfromthe code statementinsertionsinthatthe include insertionsoccurattheverybeginningofthegeneratedC/C++program. Syntax: 149

PAGE 163

include[[ < nativeC/C++includeandsimilarstatements > ]] Example: include[[#deneINDEX2xi,xj,xsize,xindxind=xi*xsize+xj;]] 15. interchange Statement TheDEFG interchange statementisusedtomakeDEFGprogramseasierto understand,faster,andsmallerbymakingitpossibletointerchangethecontents oftwoGPUbuers.Theinterchangeoccurswithoutcopyingthecontentsback totheCPUandactuallyswappingthebuercontents.Thisstatementisoften usedwhenagivenkernelisexecutedmorethanonceandtheoutputofthe previousiterationistheinputtothenextiteration. Syntax: interchange < variable1 >< variable2 > Example: loop executeRun2KernelLRinoutLRoutoutsizeinstrideingroupSizein interchangeLRLRout while... 16. loop Statement TheDEFG loop statementisusedtorepeatasequenceofDEFGstatements avariablenumberoftimes.Theloopterminationconditionishandledbya singlevariable,checkedbythe while clause.Ifmorethanoneconditionneeds tobechecked,thenthe while clausehastobeprecededbya code morselthat processesthemultipleconditionsandreturnstheresultsinonescalarDEFG variable.DEFG loop statementsmaynotbelayeredorembeddedinotherloops, buttheDEFG loop statementmaycontainoneormore sequence statements. TheseDEFGdomain-speciclanguagelimitsmakeitpossiblefortheDEFG 150

PAGE 164

optimizertoecientlymanagetheOpenCLbuertransferswithintheloopvia staticoptimizations. Syntax: loop < DEFG statements > while < condition > where: < DEFG statements > consistsofexecutablestatements and: < condition > consistsof < variable name >< operator >< numeric constant > and: < operator > consistsofoneof eqneltlegtge Example: loop //maybeexecutesomeDEFGcode setagain //executesomeDEFGcodethatupdates again variable whileagaineq1 The eqneltlegt and ge operatorsrepresent= ; 6 = ;<; etc. 17. multi exec Statement TheDEFG multi exec statementschedulestheexecutionofthekernelontwo ormoreGPUs.Theassociated declaregpu statementmusthaveprovidedfor atleasttwoGPUs,orthe multi exec statementfailswithanerror.Moreinformationontheuseofthisstatementisavailableinthe execute statement descriptionandintheMultipleGPUSupportSection.Therequiredoptions in inout ,and out aredescribedinthisguide'sStatementOptionsSection. Syntax: multi exec < run name >< kernel name > [ < variable > < option > ] where: < run name > istheuniquenameofthisexecutionstep 151

PAGE 165

and: < kernel name > consistsofthekernelname and: < variable > consistsofavariablename and: < option > isoneof in inout ,or out Example: multi execrun1sobel lterimage1inimage2out 18. output timer Statement DEFGautomaticallyprovidesonetimerthatthedevelopercanusetocompute theruntimesofDEFGcode.Seethe start timer statementforafulldescription. 19. release Statement Incertaininstances,theDEFGoptimizerneedstobeinformedthatavariable orbuermustcontainvalidcontentsontheCPU.The release statementprovidesthisfunctionality.Asanexample,a release statementisrequiredforany variableorbuerthatisreturnedtoacallingprogramfromaDEFGmodule; thisstatementguaranteesthedatabeingreturnedtothecallerisvalid.This statementmayalsobeusedbeforethe end timer statementtoverifythatthe referencedCPUeldhasvalid,notold,contents. Syntax: release < name > where: < name > consistsofthenameofavariable Example: releaseimage2 20. sequence Statement TheDEFG sequence statementisaloopingconstructandmustbeimmediately followedbyanassociated execute or multi exec statement.Theassociated execute or multi exec statementisre-executed,inasequence,thenumberoftimes 152

PAGE 166

specied.The DEFG CNT systemvariableprovidesaniterationcount.This DEFG CNT systemvariableiszero-indexed. Syntax: sequence < count > times < exec > where: < count > consistsofavariableorconstant and: < exec > consistsofan execute or multi exec statement Example: sequenceNODE CNTtimes executerun1FWarshallbuer1inoutbuer2inoutDEFG CNTin 21. set Statement TheDEFG set statementisusedtocopythevalueofaconstanttoascalar variable.Note:forscalar-variable-to-scalar-variablecopying,a code morsel maybeused. Syntax: set < name > < value > where: < name > consistsofthenameofthevariable and: < value > consistsoftheconstantvaluegiventothevariable Example: setSTOP 22. start timer Statement DEFGautomaticallyprovidesonetimerthatthedevelopermayusetocompute simpleruntimesofDEFGcode.Notethattogetaccuratetimes,thedatabeing processedorupdatedlikelyhastobecopiedbacktotheCPU,otherwisethe thefulltimeusedmaynotbecaptured.Thismayrequireuseofthe release statementtoforcetheupdatedbuerstotheCPU. 153

PAGE 167

Threesimplestatementsareusedtocomputeruntimes: start timer end timer and output timer .Hereisanexampleoftheiruse. Example: start timer //dosomethingtoupdateimage2 releaseimage2 end timer //optionally,dosomethingun-timed output timer 23. while Statement TheDEFG while statementisalwayspairedwithapreceding loop statement. Seethe loop statementdescriptionforthedetailsofthe while statement. A.5.2DEFGStatementOptions 1. halo Option Whenthe multi exec statementisused,DEFGhastheabilitytosharebuer databetweentheselectedGPUs.The halo optionisusedonthe declarebuer statementtoprovideDEFGwiththenumberofedgecellsfor1Dprocessingor rowsfor2DprocessingthatmustbesenttobothGPUsthatareprocessingthe edge.TheGPUprocessingboundariesareattheedgesandwhen halo isused, DEFGmanagestheneededduplicationofdataforeachGPU.Thisstatement optionismostcommonlyusedwithimagelterswhereeachpixel'sprocessing requiresthedatavaluesoftheneighboringpixels. 2. in/inout/out/* Option Inthisdiscussion, eld referstoeitheraDEFG buer or variable .Theoptions 154

PAGE 168

in inout out and areusedtoinformtheDEFGoptimizerhowagiveneld onaDEFG call execute ,or multi exec statementisused.Theeldcanbeused forinput,output,orboth.The canbeusedwiththe call statementtomark agiveneldas don'tcare ,whichmeanstheDEFGoptimizerdoesnotneedto moveanydataforthiseld.The in marksaeldas input ;the out marksaeld as output ;and inout marksaeldasboth input and output ;Agiven in inout or out isappropriateforonlytheDEFGstatementitappearswith.Forexample, an in associatedwithagiveneldona call statementmeansthateldisused forinputbythiscalledfunction.Likewise,an in associatedwithagiveneld onan execute statementmeansthateldisusedforinputbythisGPUkernel. Anoteofcaution:iftheseoptionsarenotsetcorrectly,thenagivenDEFG programmayperformpoorly,duetounnecessarydatamovesbetweenthedevices.Worse,erroneousresultsmayproducedduetothenecessarydatamoves notbeingperformed.ThecorrectresultsmaybecontainedintheGPU'smemory,butiftheseoptionsarenotsetcorrectly,thentheseresultsmaynotbe presentontheCPU.TheDEFGtranslatordoesnotparsetheGPUkernelcode toverifypassedeldsandtheiroptionsettings. 3. local Option Localisa declarebuer statementoptionthatisusedtomarkabueraslocalto theGPU.BuersofthistypearenottransferredbetweentheCPUandGPU, andtheyareusuallyrestrictedinsize.Localbuersarenormallyprocessedby thekernelsmorequicklythanbuerskeptinGPUglobalstorage.Notethat eachGPUwork-grouphasprivatelocalstorage. 4. multi Option Whenthe multi exec statementisused,asharingoftheworkloadbetweenGPUs isimplied.Incaseswheretheother declarebuer statementoptionsdonot 155

PAGE 169

providetherequireddatasegmentation,the multi optionisavailable.When thisoptionisused,DEFGreliesontheC/C++programloadingthegivendata buertosetthe DEFG GLOB variableactuallyaC/C++structurewiththe informationtodeterminewhichareaofthegivenbuergoestoagivenGPU. Thisoptioniscomplexandnoteasytouse.Asthenameimplies,thisvariable isglobalandsharedbyall multi buers.Itshouldbeusedasasolutionoflast resort. 5. nonpartable Option The nonpartable optionmakesitpossibletomarkabuerasnon-segmented whenthe multi exec statementisused.Abuerdeclaredwiththe nonpartable optionispassed,infull,toeachGPU.Thisoptionisveryusefulwhenthesame read-onlydatamustbepassedtoeachGPUparticipatingina multi exec step. 6. [[...]] Option The [[...]] constructisusedtopasstheexactcharactersbetweenthe[[" and]]"delimiterstotheDEFGrun-timecode.InthecaseofC/C++code, thecodeisprocessedbytheCPUC/C++compilerortheOpenCLdriver, asappropriate.Inthecaseofglobalandoptionallocalsizingparametersin theDEFG declaregpu statement,theseparametersarepassedtotheOpenCL clEnqueueNDRangeKernel function.NotethattheseNDRangevalues maybeadjustedautomaticallywhenmultiple-GPUsupportisactive. A.5.3DEFGDataTypes ThissectiondiscussesdatatypesupportinDEFG.First,notethatthecommon C/C++ char and string datatypesarenotdirectlysupportedbyDEFG.TheDEFG code morselsmakeitpossibletodealwith char and string datatypes,fromwithin DEFG,butthemajorityoftheDEFGstatementsdonotsupportthem.Theremain156

PAGE 170

derofthissectiondescribesthedatatypessupportedbyDEFG. A.5.3.1DataType:double TheDEFG double datatypemapsdirectlytotheC/C++ double datatypeprovided bytheC/C++compiler. A.5.3.2DataType:oat TheDEFG oat datatypemapsdirectlytotheC/C++ oat datatypeprovidedby theC/C++compiler. A.5.3.3DataType:integer TheDEFG integer datatypemapsdirectlytotheC/C++ int datatypeprovidedby theC/C++compiler. A.5.3.4DataType:struct TheDEFGdirectsupportofC/C++structuresislimitedtocopyingthecontentsof the struct variablesthroughDEFG.Whenaccessingtheactualeldswithina struct aDEFG code statementmustbeused. A.5.4DEFGSystemVariables DEFGprovidesanumberofinternalsystemvariables.ThesevariablesmakeitpossiblefortheDEFGdevelopertoaccessDEFGinternalinformation. A.5.4.1DEFG CNTVariable DEFG CNT isaread-onlyDEFGsystemvariablethatprovidesthecurrentloopcount whenaDEFG sequence isused.Itisazero-indexedcounter.Whenthisvariableis readoutsideofa sequence loop,ithasavalueofzero. 157

PAGE 171

A.5.4.2DEFG GLOBStructure The DEFG GLOB variableisaC/C++structurethatisinternaltoDEFG.Itisused tomanagethepartitioningofdatabuerswhenthe multi optionisusedonabuer. Thevaluesin DEFG GLOB arenormallysetbytheC/C++functionthatpopulates thedatabuers.ThisstructuremustbepassedasanargumenttoanysuchC/C++ function.SeetheDEFGAdvancedFeaturesSectionformoredetails. A.5.4.3DEFG GPUVariable Theread-onlyDEFGsystemvariable DEFG GPU providestheordinalofthecurrent GPU.Thisvariableisintendedtobeusedwiththe multi exec statement,whenitis necessarytoknowwhichoftheselectedGPUsisactive.Thisvariableiszeroindexed. A.5.4.4DEFG GPU COUNTVariable Theread-onlyDEFGsystemvariable DEFG GPU COUNT providesthemaximum numberofGPUsactiveforthisprogram.ThenumberofGPUsactiveisdetermined bythe declaregpu statementandtheactualhardwareconguration. A.5.5DEFGEnvironmentVariables TheseDEFGenvironmentvariablescanbeaccessedviatheoperatingsystem.They canbeusedtoinuencethebehavioroftheDEFGTranslatorcompilationandruntimebehaviorofthegeneratedDEFGprogram. A.5.5.1DEFG MAX BUFEnvironmentVariable The DEFG MAX BUF environmentvariableisusedatruntimetosetthesizeofCPU buersusedtoholdDEFGdatabuers.Thedefaultvalueis4MB.Largervaluescan beused,dependingontheavailableRAMontheCPUandGPU. 158

PAGE 172

A.5.5.2DEFG TIMERSEnvironmentVariable The DEFG TIMERS environmentvariableisusedat compile time.Thisisnot a run-timesetting.Whenthisenvironmentvariableissettoavalueof,"manyof thelow-levelOpenCLcallsforbuermovementandkernelexecutionaretimedand displayed.ThisfeaturecanbehelpfulforDEFGdebuggingandtuning;itshould onlybeusedfordebuggingandtesting,asitimpactstheoverallperformanceand stabilityofDEFG. A.5.6DEFGUtilitiesandFunctions DEFGprovidesadditionaladdedfeaturestoassistthedeveloper. A.5.6.1rsDevicesUtility rsDevicesisasmallutilityprogramthatcanbeusedtolisttheavailableGPUsona givenCPU.TheoutputliststheOpenCLplatformsavailableandtheneachOpenCLsupporteddevice.Eachdevicenameisfollowedby j CPU"or j GPU",depending onitsdevicetype.ThesedevicenamescanbeusedwiththeDEFG declaregpu statement'squotedlistofGPUnames. HereistheoutputfromasampleLinuxexecution: $./rsDevices platform:AdvancedMicroDevices,Inc. platform:NVIDIACorporation Six-CoreAMDOpterontmProcessor2427|CPU TeslaS2050|GPU TeslaS2050|GPU $ 159

PAGE 173

A.5.6.2LoaderFunctions DEFGprovidesanumberofC/C++functionstoinputdatales,outputresults, andperformsimplecalculations.Thesefunctionsaresuppliedinthe"defg loaders.h" includele.Additionalloaderfunctionscanbeaddedtothisheaderleoradditional headerleslistedinDEFG.ThesenewheaderscanbereferencedwithaDEFG include statement.Apartiallistofthecommonlyused"defg loaders.h"functionsisgivenin TableA.1. 2 Facility Description Function Name ImageLoader FromtheAMDSDK init input FloydWarshallGraphLoader FromtheAMDSDK init input BFSGraphLoader FromRodiniaBenchmark init input ArrayPartition Partitionfor2-GPUuse ArrayPartition2GPU ArrayMerge Mergefor2-GPUuse MergeCost2GPU DumpScalar Outputascalarvariable dumpScalar DumpBuer Outputabuer dumpBuer DebugExit Immediateprocessingstop debugExit Increment Incrementascalar inc Decrement Decrementascalar dec TableA.1:APartialListofDEFGLoadersandFunctions Adevelopershouldreviewthe"defg loaders.h"includeletoseeexactlywhich functionsareavailableandthetypesofdataprocessed.Newapplication-specic functionscanbeaddedtothisheaderle. A.6DEFGAdvancedFeatures A.6.1DirectInsertionofC/C++Code TheDEFGoptimizerusesthelimiteddomainoftheDEFGlanguagetoassistin thegenerationofecientOpenCLcode.Specically,thisdomainlimitmakesit possiblefortheoptimizertomanagethescalarvariablesanddatabuersinways 2 DuplicatefunctionnamesaremanagedthroughtheuseofapplicationnameC-stylemacros;see theactualheaderleforthenamesanddetails. 160

PAGE 174

thatavoidunneededOpenCLdatamovementoperationsbetweentheCPUandGPU. ThisoptimizerdoesnotprocesstheactionsperformedintheDEFG code statements, referredtoasDEFG morsels .Therefore,morselsthatmodifydata,oruseassumptions astothecurrentCPUorGPUlocationofmodieddata,shouldbewrittenwithgreat care.The release statementcanbeusedtonotifyDEFGthatagivenbuerneeds tobemadecurrentontheCPU.TheDEFGoptimizermaymove,ornotmove, databetweentheCPUandGPUatexpectedtimes;itmayhavepre-stagedthedata movementorpostponedthedatamovement,dependingonfactorsuniquetothe optimizer. A.6.2MultipleGPUSupport WhenanapplicationiswritteninDEFG,theapplicationislimitedbythecapabilitiesoftheresourcesavailable.Asingle-GPUDEFGapplicationislimitedtothe resourcesofoneGPU.Ifthedatabeingprocessedisevenonebytelargerthanthe availableGPUmemory,theapplicationlikelywillnotexecute.Therefore,DEFG suppliesthecapabilitytoutilizemorethanasingleGPU.Forapplicationsthatpermitahigherdegreeofparallelism,thiscapabilitymakesitpossibletoobtainfaster performance,handlelargerproblems,orboth.Whenthiscapabilityisengagedwith the multi exec statement,theprocessingisspreadovertheGPUsselectedbythe declaregpu statement.ItcannotbeoveremphasizedthatthismultipleGPUsupport capabilityislimitedbythealgorithmsinuseandtheimplementationoftheOpenCL kernelsused.ThisfeaturesimpliestheCPUcodingtosupportmultipleGPUs;it doesnotautomaticallyturnanarbitrary,single-GPUapplicationintoanewonewith multiple-GPUsupport. 161

PAGE 175

A.6.3SupportforMobileGPUPlatforms DEFGhasthepotentialtobeusedonmobileplatformhardware,whentheLinux operatingsystemisused.WedonotanticipateDEFGbeingusedwiththeAndroidoperatingsystem.C/C++programsgeneratedbyDEFGhavebeenbuiltand executedontheARMCortexAPprocessorusedwiththeORDOIDU3[37].Unfortunately,theLinuxOpenCLdriverfortheintegratedARMMali-400QuadCore 440MHzGPUwasnotavailableatthetimeofourtesting. A.6.4UseofDEFG GLOB Caveat:DEFG GLOBisaveryadvancedDEFGfeatureandshouldonlybeusedby experiencedC/C++developerswhohavedeterminedthattheotherDEFGbuer optionswillnotprovidethefunctionalitytheyneed. WhentheDEFG multi optionisaddedtoabuerand multi exec isused,the buer'slayoutisthenmanagedbytheC/C++functionthatpopulatesthebuer. Thiscapabilityprovidesamechanismforthedevelopertoinuencewhatdatadoes toeachGPU,withDEFGperformingthedatatransferstotheselectedGPUs.The DEFG GLOBvariableisaC/C++structurethatcontainsosetsandlengths,which areusedtomanageassociatedDEFGbuers.Theactualstructureisdeclaredin defg template.txt"anditsformatisDEFG-versiondependent. A.6.5GlobalandLocalRangeSettings TheDEFG declarekernel statementrequiresthattheOpenCLglobalrangebeset. Thisparameterdeterminesthenumberofdevicethreadsstartedandhowtheyare tobemanaged.OpenCLhasthecapabilitytosetthelocalrangeautomaticallyand DEFGroutinelyusesthiscapability.Insomeinstances,thedevelopermaywishto setthislocalrangesettingdirectly.Inthisevent,theglobalrangewithinthe[["and ]]"phraseisfollowedbyacoloncharacterandthenthelocalrangeisgiven.Theuse 162

PAGE 176

oflocalstoragebyakernelislikelytorequirethedevelopertosetthecorresponding declarekernel statement'slocalrange.Seethe declarekernel statementdescriptionfor itsfullsyntax. A.6.6AFewDEFGAdvancedTechniques A.6.6.1ChangingtheDEFGRun-TimeOutputLocation DEFGgeneratesinformationaloutputanderrormessagesatruntime.Thegenerated DEFGcodeusestheDEFG PRINTFC-stylemacrotorouteoutputtothe printf function.ThisroutingmaybechangedbyupdatingtheDEFG PRINTFmacro contentinthedefg template.txt"le. A.6.6.2ChangingtheDEFGExitBehavior Atruntime,DEFGwillsometimesaborttheprocessingduetounrecoverableerrors. ThegeneratedDEFGcodeusestheDEFG EXITC-stylemacrotorouteexecution tothe exit function.ThisroutingmaybechangedbyupdatingtheDEFG EXIT macrosettinginthegeneratedcode. A.6.6.3Additional Anytime ExitCapabilities AdditionalAnytime-likeprocessingcanbeinsertedintoaDEFGapplicationwith code morsels.ThisbasicallyinvolvesusingaC/C++ if statementand goto statementto jumpoutoftheDEFGloop. 3 Itwillbenecessaryexamineatthegeneratedcodeto determinethelabeloftheC/C++loopexit.ThisisaveryadvancedDEFGfeature andisDEFG-versiondependent.Itshouldonlybeusedasasolutionoflastresort. 3 Theuseofa goto statementisrelatedtothescopingofinternalDEFGC/C++variables. 163

PAGE 177

A.6.6.4DEFG call statementispreferredoverDEFG code morsel Asstatedearlier,the code morselcaneasilyintroduceerrors.Thissmallsection discusseswhythe call statementispreferredoverthe code statement. TheDEFG call statementcontainsthe in/out/inout/* settings.Theseinformthe DEFGoptimizerhowtheeldsbeinggiventothecalledfunctionareused.Using thisinformation,theoptimizermovestheneededinputdatatotheCPUandmoves updatedDEFGvariablesandbuersbacktotheGPU.TheDEFG code morsels lackthiscapability.DEFGwillinserttheprovidedC/C++codeintothegenerated programatthegivenlocation,butDEFGdoesnotmonitorthevariablesandbuers accessed.Datareadinthemorselmaybeinvalidandupdatesdoneinthemorsel maybelost.Itisthedeveloper'sresponsibilitytobecertainthatthedatausedand updatedinthemorseliscorrectlyused. A.7HowtoExecutetheDEFGTranslator InordertoexecutetheDEFGtranslatorunderWindows,the translate batchleis supplied.ThisbatchlewillexecutetheDEFGparser,optimizer,andcodegenerator. TheDEFGsourcecodemustbeinatextlewiththeextensiontypeof.txt"and theresultingprogramwillhaveanextensionoftype.cpp".Thetranslatebatch leexpectsoneinputcommand-lineargument,andthatmustbethenameofthe sourceletobetranslatedfromDEFGtoC/C++.Theexamplebelowtranslates sobel.txt"tosobel.cpp". Example: translatesobel 164

PAGE 178

A.8DEFGErrorHandling A.8.1TranslatorErrors TheDEFGTranslatorhassomewhatlimitederrorreportingcapabilitiesandwill onlyreportonetranslationerroratatime.Itmaytakemultipleeditingsessions andtranslatorexecutionstondalltheerrorspresent.Whenanerroroccurs,a corresponding.cpp"lewillnotbegenerated. Example:Translationwithanerroronline21 C > translatesobelerr C > echoo ********TRANSLATE********" sobelerrparse" sobelerr.txtline21:2extraneousinput'this is an error'expectingEND C > A.8.2Run-TimeErrors DEFGtranslatestheDEFGinputintoaC/C++programthatisthencompiled andexecuted.ThisC/C++programcanproducerun-timeerrors.WhenDEFGgeneratedcodesensesarun-timeerrorcondition,itsdefaultbehavioristodescribe theerrorwithacalltotheC/C++ printf function.Thisbehaviorcanbeadjusted viatheDEFG PRINTFC-stylemacrodenedindefg template.txt".Hereisthe displayofarun-timeerrorshowinganOpenCLfailure: ExamplefromWindows: C > sobel NOTICE:run1CPU clCreateKernelstatus:-44 C > 165

PAGE 179

TheNOTICE:"lineisaDEFGinformationalmessageandnotanerror.Thesubsequentlineistheactualerroroutputmessage.TheclCreateKernel"isthename oftheOpenCLAPIfunctionthatfailed,"isthesourcecodelinenumberdisplayingthiserrormessage,andthe-44isthereturnedOpenCLerrorcode.These displayederrorcodesarenormallylistedanddescribedintheopencl.h"headerle. A.8.3UsefulDebuggingTechniques TheC/C++programsgeneratedbyDEFG,likeanyotherprograms,maycontain errors.HerearesomedebuggingtechniquesthatareknowntobeusefulwithDEFG problemsandprogrammingmistakes: Haveavailable,anduse,anexistingtestcasewithknownresults. DebugtheDEFGprogramrstonasimplecomputerusingasingleDEFG devicebeforedebuggingitonacomplexserverwithhigh-poweredGPUs.Any WindowspersonalcomputerlargeenoughtosupportC/C++compilerscan runOpenCL,withorwithoutanactualGPUdenedtoOpenCL. Ifpossible,debugeachprogramportionseparately.DEFGprogramlinescan becommentedoutbyinserting//"atthebeginningoftheprogramline. Use call statementstothe dumpScalar and dumpBuer functionstodisplay intermediateresults. Use code morselstooutputdescriptivetext,aswellasprocessingresults. UseadditionaltemporaryDEFGdatabuerstomoveintermediateresultsto theCPUforadditionalanalysis. InspectthegeneratedC/C++code.Itispossibletotemporarilymodifythis codewithadditionaldebuggingstatements. 166

PAGE 180

APPENDIXB SourceCodeandOtherItems B.1HardwareandSoftwareDescription Weusedthreedierentplatformsinourresearch.Thevastmajorityofourwork wasdoneontheUniversityofColoradoDenver,DepartmentofComputerScience andEngineering'sPenguinServer,knownas Hydra .Incertaininstances,twoother computerswereused.AllthreearedescribedbelowinTableB.1. Name CongurationData Server:Hydra PenguinComputingCluster,LinuxCentOS5.3,AMDOpteron2427 2.2GHz,24GBRAM,usingNVIDIAOpenCLSDK4.0,twoNVIDIA TeslaT20s,eachwith14ComputeUnits,1147MHzand2687MRAM OpenCL:CUDAvers:5driverandOpenCL1.1CUDA4.2.1 Compiler:gccversion4.4.4 Server:Rabbit SingleServer,Linux2.6.32-5-686,AMDSempron145,2.8GHz,2GB RAM,usingAMDSDK2.8,GPU1:AMDHD78502GBRAMand GPU2:AMDRadeonR9270X2GBRAM OpenCL:AMDCatalyst13.25.5driverandAMDOpenCL2.8SDK Compiler:gccversion4.4.5Debian4.4.5-8 CPU-only Windows7,IntelI3Processor,1.33GHz,4GBRAMnoGPU OpenCL:SupportviaAMDOpenCL2.8SDKx86CPUdriver Compiler:MicrosoftVC2008 TableB.1:TestingCongurations,HardwareandSoftware B.2SuggestedDEFGTechnicalImprovements DEFG,aswithmanyothercomplexsoftwareimplementations,canbenetfromadded featuresandexisting-featureenhancements.BelowisalistofpotentialDEFGfea167

PAGE 181

tures,innoparticularorder,thatwehaveconsideredforfutureadditiontoDEFG. 1.AddaDEFGoptimizersteptoverifythe in / out / inout optionsettings.These settingsareusedwith execution multi exec and call statements.Currently, DEFGdoesnotcrosschecktheoptionsettings;theoptionsettingscodedby thedevelopermightbesyntacticallyvalid,butsemanticallywrong.Usingthe wrongoptionsettingscancauseDEFGtoomitrequiredtransferoperations, creatinghard-to-ndfailures.VericationwouldfacilitatendingDEFGcoding errors. 2.TheDEFG code statement,the morsel ,couldbeenhancedtoincludeanoptional listofDEFGeldnamesandtheirassociated in / out / inout settings.Thiswould makeitpossiblefortheDEFGoptimizertoassistinalwayshavingvaliddata presentintheDEFGvariablesandbuers.Inthepresentversion,thisisthe responsibilityofthedeveloper. 3.DEFGcurrentlyallocatesaDEFG MAX BUF-sizedCPUmemorysegmentfor eachdeclaredbuer.ThisapproachcanwasteasignicantamountofCPU memory.OncetheCPUbuerisloadedwithdata,DEFGcouldreleasethe unusedCPUmemory.TheDEFGCPUbuersarecurrentlyallocatedwitha C++ malloc call.Theunusedmemorycouldbecarefullyreleasedwithan associated realloc call.Theperformanceimplicationsofthischangewould needtobeexplored.WenotethatthememoryallocatedontheGPUis not allocatedinxed-sizeblocks;itsallocationsizeisdeterminedbythewidthof thedatastored. 4.TheDEFG interchange statementisoftenused,afteran execute or multi exec statement,toswapthecontentsoftwoDEFGbuers.Withasyntaxchangeto the execute and multi exec statements,the interchange statementfunctionality 168

PAGE 182

couldbeincludedwithinthe execute and multi exec .Thiswouldmakethe DEFGprogramssimpler,andlikelyfaster. 5.DEFGhastheabilitytoautomaticallycollectrun-timestatisticsforeachmajor OpenCLAPIrequest.Thisfacilitycouldbeexpandedtocollecttimingdatafor allmajorDEFGactions,includingthecallstoCPUfunctionsandtheexecution ofDEFGmorsels.Theaddedinformationcouldthenbeusedtoobtaindetailed run-timeprolestatistics,whichareoftenquitehelpfulinachievinghighlevels ofperformance. 6.TheDEFGoptimizationiscurrentlyperformedstatically;itiscompletedjust beforeprogramgeneration.Inthefuture,DEFGcouldalsohavetheoptionto utilizedynamicrun-timeoptimizationoftheOpenCLbuertransfers.This changewouldgreatlysimplifythestaticoptimizationandlikelypermitDEFG toprovideadditionalloopingstructures. B.3TheDEFGMini-ExperimentwithFourGPUs OneofourgoalsistoproduceapplicationsthatutilizemultipleGPUs.Inthisshort discussion,wedescribeourresultsfromperformingamini-experimentwithasmall, 4-GPU,DEFGapplication.WewroteacomputationallyintenseDEFGdiagnostic program,calledDIAG4WAY;itperformed,26times,the multi exec ofasmallkernel. Thiskernelmainlyexecutedthisstep: forinti=0;i<1024*512;i++{ d=d+i;e=intsqrtfloatd;d=d-e;d=d+e; } ThepurposeofthiscontrivedworkloadisobviouslytokeeptheGPUverybusy. WewouldhavepreferredtorunourexistingMEDIAN5Mapplicationinthis4-GPU environment;however,aswelackedadministratorprivilegesonthe4-GPUserver,we werenotabletoinstalltheneededAMDSDKtoprovidetheimage-loadingmodules 169

PAGE 183

FigureB.1:Run-TimeComparisonwith1,2and4GPUs neededbyourlteringapplications.WeusedDIAG4WAYinstead,andweobtained interestingresults.TheDIAG4WAYsourcecodeisinSectionB.4. Weranourdiagnosticfourtimesforeachofthe1-GPU,2-GPU,and4-GPU executionmodesandaveragedtherespectiveruntimes.Theserun-timeaverages are,1-GPU:0.797secs,2-GPU:0.409secs,and4-GPU:0.220secs.FigureB.1graphs theexecutionsecondsagainstthenumberofGPUs.The2-GPUtestshowsa1.95 speedupandthe4-GPUshowsa3.62speedup. DEFGshowsasignicantmultiple-GPUperformancegainwiththisverycomputationallyintenseapplication.AspecialthankstoMarkSmith,attheExxactCorporation 1 ,forprovidingaweekendofaccesstooneoftheirGPU-equippedservers. Thisserverhad32XeonprocessorsE5-2660@2.20GHzandfourNVIDAK20, Kepler-generationGPUcards,eachwith5119MBofRAM. 1 ExxactCorporation,Fremont,CA;AdistributorofAMD,NVIDIA,andPNYproducts;URL: www.exxactcorp.com 170

PAGE 184

B.4DEFGApplicationSourceCode B.4.1BFSDP2GPUApplication 001.//bfsdp2gpu.txt:BFSHarish-likeversionwithVPsfor2GPUs 002.declareapplicationbfsdp2gpu 003.declareintegerNODE_CNT 004.integerNODE_CNTt2 005.integerNODE_CNT00 006.integerNODE_CNT0p1 007.integerNODE_CNT10 008.integerNODE_CNT1p1 009.integerKCNT 010.integerKCNT0 011.integerKCNT1 012.integerEDGE_CNT 013.integerMAX_DEGREE 014.integerSTOP 015.integerSTOP0 016.integerSTOP1 017.integerLIST_WIDTH 018.integerlistused00 019.integerlistused10 020.declaregpugpugrpall 021.declarekernelbermanPrefixSumP1bfsdp_kernelv3[[1D,NODE_CNTt2]] 022.kernelbermanPrefixSumP2bbfsdp_kernelv3[[1D,NODE_CNTt2]] 023.kernelgetCellValuebfsdp_kernelv3[[1D,1]] 024.kernelkernel1a2bfsdp_kernelv3[[1D,NODE_CNT]] 025.kernelkernel1bbfsdp_kernelv3[[1D,EDGE_CNT]] 026.kernelkernel2bfsdp_kernelv3[[1D,NODE_CNT]] 027.declareintegerbuffergraph_edgesEDGE_CNTnonpartable 028.integerbufferfrontier0LIST_WIDTHnonpartable//list 029.integerbufferpayload0LIST_WIDTHnonpartable//list 030.integerbufferfrontier1LIST_WIDTHnonpartable//list 031.integerbufferpayload1LIST_WIDTHnonpartable//list 032.structbuffergraph_nodesNODE_CNTmulti 033.integerbuffergraph_maskNODE_CNTmulti 034.integerbufferupdating_graph_maskNODE_CNTmulti 035.integerbuffergraph_visitedNODE_CNTmulti 036.integerbuffercostNODE_CNTmulti 037.integerbufferoffsetNODE_CNTmulti 038.integerbufferoffset2NODE_CNTmulti 039.callinit_inputgraph_nodesout 040.graph_edgesout 041.graph_maskout 042.updating_graph_maskout 043.graph_visitedout 044.costout 045.NODE_CNTin 046.EDGE_CNTout 047.MAX_DEGREEout 048. 049.//partitionnodesandsomeotherbuffersformulti-GPUuse 050.callArrayPartition2GPU2graph_nodesinout 051.graph_visitedinout 052.costinout 053.graph_maskinout 054.updating_graph_maskinout 055.NODE_CNT0out 056.NODE_CNT1out 057.NODE_CNTin 058.DEFG_GLOB* Continuedonnextpage 171

PAGE 185

Continuedfrompreviouspage 059. 060.code[[NODE_CNT0p1=NODE_CNT0+1;NODE_CNT1p1=NODE_CNT1+1; NODE_CNTt2=NODE_CNT+1*2;]] 061.loop 062.multi_execrun2bermanPrefixSumP1offsetout 063.graph_maskin 064.NODE_CNT0p1@0in 065.NODE_CNT1p1@1in 066. 067.code[[KCNT0=intceillogdoubleNODE_CNT0p1/log.0;]] 068.code[[KCNT1=intceillogdoubleNODE_CNT1p1/log.0;]] 069.code[[KCNT=KCNT0>=KCNT1?KCNT0:KCNT1;]] 070.sequenceKCNTtimes 071.multi_execrun2bermanPrefixSumP2boffset2inout 072.offsetinout 073.DEFG_CNTin 074.NODE_CNT0p1@0in 075.NODE_CNT1p1@1in 076. 077.code[[ifKCNT%2==0{cl_mems=defg_buffer_offset[0]; defg_buffer_offset[0]=defg_buffer_offset2[0]; defg_buffer_offset2[0]=s;}]] 078.code[[ifKCNT%2==0{cl_mems=defg_buffer_offset[1]; defg_buffer_offset[1]=defg_buffer_offset2[1]; defg_buffer_offset2[1]=s;}]] 079.multi_execrun3getCellValueoffset2in 080.NODE_CNT0@0in 081.NODE_CNT1@1in 082.listused0@0out 083.listused1@1out 084. 085.callsynclistused0in 086.callsynclistused1in 087.code[[iflistused0>=LIST_WIDTH||listused1>=LIST_WIDTH{ 088.printf"Errorlistoverflow%d%d.n", listused0,listused1;exit0; 089.}]] 090.multi_execs2kernel1a2graph_nodesin 091.graph_edgesin 092.graph_maskinout 093.offset2in 094.costin 095.frontier0@0out 096.payload0@0out 097.NODE_CNT0@0in 098.frontier1@1out 099.payload1@1out 100.NODE_CNT1@1in 101. 102.broadcastfrontier0@0 103.broadcastpayload0@0 104.broadcastfrontier1@1 105.broadcastpayload1@1 106.multi_execs3kernel1bfrontier0in 107.payload0in 108.listused0in 109.frontier1in 110.payload1in 111.listused1in 112.updating_graph_maskinout 113.graph_visitedin 114.costinout Continuedonnextpage 172

PAGE 186

Continuedfrompreviouspage 115.DEFG_GPUin//whichGPU 116. 117.setSTOP0 118.setSTOP1 119.setSTOP 120.multi_execs4kernel2graph_nodesin 121.graph_maskinout 122.updating_graph_maskinout 123.graph_visitedinout 124.STOP0@0inout 125.STOP1@1inout 126//STOPinout//debug,iswrong 127.NODE_CNT0@0in 128.NODE_CNT1@1in 129. 130.calllogicalorSTOPoutSTOP0inSTOP1in 131.whileSTOPeq1 132.//mergecostsinto1arrayfortesting 133.callMergeCost2GPU2costinoutDEFG_GLOB* 134.calldisp_outputcost*NODE_CNT* 135.end OpenCLKernels: 001.//kernelforDEFGBFSDP2GPUwithPrefixSum-basedbufferallocation 002.//using"berman"2-phasePrefixSumversion 003.// 004.//macrosusedtoseperatedeviceandnodeforVP 005.#defineMAP_DEVICExx&1 006.#defineMAP_NODExx>>1 007.typedefstruct 008.{ 009.intstarting; 010.intno_of_edges; 011.}Node; 012.//// 013.////PrefixSumkernels--twoversions 014.//// 015.//Simple1-threadprefixsum--slowbutreliable--unused 016.__kernelvoidkernelPrefixSum 017.__globalint*output,//bufferofsums 018.__globalint*input,//bufferofvalues 019.__globalint*block,//workarea 020.constintlength//lengthofbuffers 021.{ 022.iflength<1return; 023.//clearly,imustbeincreasingwitheachcall... 024.//NOTICE:length+1;goes1pastendofnormalbuffer!!!! 025.forintk=0;k
PAGE 187

Continuedfrompreviouspage 042.ifoffset>sizereturn; 043.ifoffset==0{ 044.output[0]=0; 045.}elseifoffset==1{ 046.output[1]=input[0]; 047.}else{ 048.output[offset]=input[offset-2]+input[offset-1]; 049.} 050.return; 051.} 052.//Bermanpage378,part2 053.__kernelvoidbermanPrefixSumP2b 054.__globalint*buffer2,//bufferofnewpartialsums 055.__globalint*buffer1,//bufferofpartialsums 056.constintCNT, 057.constintsize//sizeofbuffer 058.{ 059.intoffset=get_global_id; 060.ifoffset>=sizereturn; 061.intk=1<buffer2 064.buffer2[offset]=buffer1[offset]; 065.ifoffsetbuffer1 069.buffer1[offset]=buffer2[offset]; 070.ifoffset0 104.{ 105.intcost=g_cost[tid]; Continuedonnextpage 174

PAGE 188

Continuedfrompreviouspage 106.intmax=g_graph_nodes[tid].no_of_edges+g_graph_nodes[tid].starting; 107.intindex=g_graph_offset[tid]; 108.forinti=g_graph_nodes[tid].starting;i
PAGE 189

B.4.2IMIFLXApplication 001.//imiflx.txt:Altmaniterativematrixinversion 002.//*input.txt|.mtx|I|M|H[epsilonmaxCyclesnewAlpha] 003.// 004.declareapplicationimiflx 005.include[[ 006.#defineINDEX2xi,xj,xsize,xindxind=xi*xsize+xj; 007.#ifdef_WIN32 008.#defineisfinite_finite 009.#endif 010.]] 011.declareintegermSIZE 012.integermSIZEt2 013.integerLIMIT 014.integercycles 015.integerlocalWork 016.integerlocalSize//2*localWork 017.integerbasketSize//2*localWork 018.doubleepsilon 019.doubleresult 020.doublenewAlpha 021.doublenorm 022.doublealpha 023.integerLCNT 024.integerITR 025.doubledOne.0 026.doubledZero.0 027.doubledThree.0 028.doubledmThree-3.0 029.declaregpugpuone* 030.declarekernelCopyArrayimiflx[[1D,mSIZEt2]] 031.kernelPlusIdentityThreeimiflx[[1D,mSIZE]] 032.kernelMinusMatThreeimiflex[[1D,mSIZEt2]] 033.kernelprefixSumimiflx[[1D,localWork:localWork]] 034.kernelMinusIdentityimiflx[[1D,mSIZE]] 035.kernelSweepSquaresimiflx[[1D,basketSize]] 036.kernelReadLastSqrtimiflx[[1D,1]] 037.declaredoublebuffermAmSIZEmSIZE 038.doublebuffermPmSIZEmSIZE 039.doublebuffermRnmSIZEmSIZE 040.doublebuffermRnp1mSIZEmSIZE 041.doublebuffermWmSIZEmSIZE 042.doublebuffermSbasketSize 043.doublebuffermBasketbasketSize 044.doublebuffermLocallocalSizelocal 045.calldefg_matloadermAoutmSIZEout 046.code[[mSIZEt2=mSIZE*mSIZE;]] 047.code[[ifmSIZEt2>=intDEFG_MAX_BUF*sizeofdouble {printf"bufferoverflowerrorn";exit;}]] 048.code[[cycles=mSIZEt2;]] 049.code[[newAlpha=-1.0;]] 050.code[[epsilon=0.00001;]] 051.code[[ifargc>2{epsilon=atofargv[2];}]] 052.code[[ifargc>3{cycles=atoiargv[3];}]] 053.code[[ifargc>4{newAlpha=atofargv[4];}]] 054.code[[LIMIT=mSIZEt2<16?mSIZEt2:16;]] 055.code[[printf"imiflxGPUversionusingSweepSquaresandprefixSumn";]] 056.code[[printf"SQMATsize:%d,epsilon:%lf,maxcycles:%d,newAlpha:%gn", mSIZE,epsilon,cycles,newAlpha;]] 057.//computenorm 058.executek1SweepSquaresmAinmSIZEt2inmBasketinoutbasketSizein 059.executek2prefixSummSoutmBasketinmLocal*basketSizein 060.executek3ReadLastSqrtmSinbasketSizeinnormout 061.releasenorm//getsvalueontoCPU Continuedonnextpage 176

PAGE 190

Continuedfrompreviouspage 062.code[[alpha=1.0/norm;]] 063.code[[ifnewAlpha!=-1.0alpha=newAlpha;]] 064.code[[ifalpha<0.0||alpha>=1.0{printf"Error,invalidalpha!n";exit;}]] 065.//cpu:mRn=IdentityMatrix*alpha 066.code[[forinti=0;imP 070.loop 071.//desiredresult:mRnp1=mRn*mI*3-mP*3+mP2mP2ismP*mP 072.//mW=mP*mP 073.blasdOne*mP*mP+dZero*mW->mW 074.//mW+=mI*3 075.executek8PlusIdentityThreemWinoutmSIZEin 076.//mW-=mP*3 077.executek9MinusMatThreemWinoutmPinmSIZEt2in 078.//mRnp1=mRn*mW 079.blasdOne*mRn*mW+dZero*mRnp1->mRnp1 080.//mP=mA*MRnp1 081.blasdOne*mA*mRnp1+dZero*mP->mP 082.//copymPtomW 083.executek10CopyArraymWoutmPinmSIZEt2in 084.//mW-=mI 085.executek11MinusIdentitymWinoutmSIZEin//note:mSizenotmSizet2 086.//result=normmW 087.executek12SweepSquaresmWinmSIZEt2inmBasketinoutbasketSizein 088.executek13prefixSummSoutmBasketinmLocal*basketSizein 089.executek14ReadLastSqrtmSinbasketSizeinresultout 090.releaseresult//getsvalueontoCPU 091.code[[ITR=LCNT+1;]] 092.code[[if!isfiniteresult{printf"infiniteresult,exitingn";LCNT=cycles;}]] 093.code[[ifresult!=result{printf"resultisnan,exitingn";LCNT=cycles;}]] 094.code[[ifresult<=epsilonLCNT=cycles;]] 095.loop_escapeat6secs//"anytime"processing 096.//mRn<==>mRnp1 097.interchangemRnmRnp1 098.callincLCNTinout 099.whileLCNTltcycles 100.calldefg_write_matrixmRninmSIZEin 101.end OpenCLKernels: Theprex sumkernelisusedfromtheAMDOpenCL2.8SDKandiscopyrightedbyAMD; thefullkernelsourcecodecanbeobtainedfromthisSDK. 001-076containtheprefix_sumkernelandareommited,seeabove. 077.// 078.//senser-writtenkernelsstarthere 079.// 080.__kernelvoidCopyArray 081.__globaldouble*output,//bufferofoutdatavalues 082.__globaldouble*input,//bufferofindatavalues 083.constintlength//lengthofbuffer 084.{ 085.unsignedinttid=get_global_id; 086.iftid>=lengthreturn; 087.output[tid]=input[tid]; 088.} 089.__kernelvoidSweepSquares 090.__globaldouble*input,//bufferofdatavalues 091.constintlength,//fulllengthofbuffer 092.__globaldouble*basket,//basketofpartialsums 093.constintbasket_length//fulllengthofbasket 094.{ 095.doubled; 096.doublesum=0.0; Continuedonnextpage 177

PAGE 191

Continuedfrompreviouspage 097.//intindex; 098.iflength<1return; 099.unsignedinttid=get_global_id; 100.iftid>=lengthreturn; 101.//stridesofbasket_lengthsize.... 102.forintk=tid;k=lengthreturn; 115.unsignedintoffset=tid*length+tid; 116.matrix[offset]+=3; 117.} 118.__kernelvoidMinusMatThree 119.__globaldouble*baseMatrix,//bufferofresultvalues 120.__globaldouble*otherMatrix,//2ndMatrixbuffer 121.constintlength//sizeoffullmatrix 122.{ 123.unsignedinttid=get_global_id; 124.iftid>=lengthreturn; 125.baseMatrix[tid]-=3*otherMatrix[tid]; 126.} 127.__kernelvoidMinusIdentity 128.__globaldouble*matrix,//bufferofvalues 129.constintlength//sizeofdiagNOTfulllength 130.{ 131.unsignedinttid=get_global_id; 132.iftid>=lengthreturn; 133.unsignedintoffset=tid*length+tid; 134.matrix[offset]-=1.0; 135.} 136.__kernelvoidZeroBasketDEAD 137.__globaldouble*matrix,//basketoffuturepartialsums 138.constintlength//sizeofbasket 139.{ 140.unsignedinttid=get_global_id; 141.iftid>=lengthreturn; 142.matrix[tid]=0.0; 143.} 144.__kernelvoidReadLastSqrt 145.__globaldouble*matrix,//array 146.constintlength,//sizeofarray 147.__globaldouble*last//returnvalue 148.{ 149.*last=sqrtmatrix[length-1]; 150.} 178

PAGE 192

B.4.3MEDIANApplication 01.//mediam.txt:MedianalgorithminDEFGsyntax 02.declareapplicationmedian 03.declareintegerXdim 04.integerYdim 05.integerBUF_SIZE 06.declaregpugpuone* 07.declarekernelmedian_filterinsertcode[[ 08.__kernelvoidmedian_filter__globaluint*inputImage,__globaluint*outputImage 09.{ 10.uintsort_buf[9]; 11.uinth; 12.inti; 13.intj; 14.uintx=get_global_id; 15.uinty=get_global_id; 16. 17.uintwidth=get_global_size; 18.uintheight=get_global_size; 19.intc=x+y*width; 20.ifx>=1&&x=1&&ysort_buf[j]{ 35.h=sort_buf[i]; 36.sort_buf[i]=sort_buf[j]; 37.sort_buf[j]=h; 38.} 39.} 40.} 41.outputImage[c]=sort_buf[4]; 42.} 43.} 44.]][[2D,Xdim,Ydim]] 45.declareintegerbufferimage1XdimYdimhalo 46.integerbufferimage2XdimYdimhalo 47.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 48.start_timer 49.executerun1median_filterimage1inimage2out 50.callsyncimage2in//helpstimeraccuracy 51.end_timer 52.calldisp_outputimage2inXdiminYdimin 53.output_timer 54.end 179

PAGE 193

B.4.4MEDIAN5Application 01.//mediam5.txt:MedianalgorithminDEFGsyntax 02.//use5x5,not3x3,"window"tocomputethemedian 03.declareapplicationmedian 04.declareintegerXdim 05.integerYdim 06.integerBUF_SIZE 07.declaregpugpuone* 08.declarekernelmedian_filterinsertcode[[ 09.__kernelvoidmedian_filter__globaluint*inputImage,__globaluint*outputImage 10.{ 11.uintsort_buf[25]; 12.uinth; 13.inti; 14.intj; 15.intk; 16.uintx=get_global_id; 17.uinty=get_global_id; 18. 19.uintwidth=get_global_size; 20.uintheight=get_global_size; 21.intc=x+y*width; 22.ifx>=2&&x=2&&ysort_buf[j]{ 43.h=sort_buf[i]; 44.sort_buf[i]=sort_buf[j]; 45.sort_buf[j]=h; 46.} 47.} 48.} 49.outputImage[c]=sort_buf[12]; 50.} 51.} 52.]][[2D,Xdim,Ydim]] 53.declareintegerbufferimage1XdimYdimhalo 54.integerbufferimage2XdimYdimhalo 55.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 56.executerun1median_filterimage1inimage2out 57.calldisp_outputimage2inXdiminYdimin 58.end 180

PAGE 194

B.4.5MEDIAN5MApplication 01.//median5m.txt:Multi-GPUMedian5x5algorithminDEFGsyntax 02.//use5x5"window"tocomputethemedian 03.declareapplicationnedian5 04.declareintegerXdim 05.integerYdim0 06.integerBUF_SIZE 07.declaregpugpuoneall 08.declarekernelmedian5_filterinsertcode[[ 09.__kernelvoidmedian5_filter__globaluint*inputImage,__globaluint*outputImage 10.{ 11.uintsort_buf[25]; 12.uinth; 13.inti; 14.intj; 15.intk; 16.uintx=get_global_id; 17.uinty=get_global_id; 18. 19.uintwidth=get_global_size; 20.uintheight=get_global_size; 21.intc=x+y*width; 22.ifx>=2&&x=2&&ysort_buf[j]{ 43.h=sort_buf[i]; 44.sort_buf[i]=sort_buf[j]; 45.sort_buf[j]=h; 46.} 47.} 48.} 49.outputImage[c]=sort_buf[12]; 50.} 51.} 52.]][[2D,Xdim,Ydim]] 53.declareintegerbufferimage1XdimYdimhalo 54.integerbufferimage2XdimYdimhalo 55.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 56.multi_execrun1median5_filterimage1inimage2out 57.calldisp_outputimage2inXdiminYdimin 58.end 181

PAGE 195

B.4.6RSORTApplication 01.//RSort.txt:Altman'sroughlysortalgorithminDEFGsyntax 02.//useage:pgm 03.//arg1canbeaninputfileorthegenKvalueifgeneratingdata 04.//arg2isthesize,ifgeneratingdata,2**size 05.declareapplicationRSort 06.include[[char*vers="V1.1";]] 07.declareintegerstride 08.integersize64 09.integersizeDB 10.integergenK0 11.integerbufSize 12.integerradius 13.integergroups 14.integeragain 15.integeroffset 16.integeroffset2 17.integerlogSize 18.declaregpugpuone* 19.declarekernelLRmaxRSort_Kernels[[1D,size]] 20.kernelRLminRSort_Kernels[[1D,size]] 21.kernelDMRSort_Kernels[[1D,size]] 22.kernelUBRSort_Kernels[[1D,size]] 23.kernelcomb_sortRSort_Kernels[[1D,groups]] 24.declareintegerbufferarraySbufSize 25.integerbufferLRbufSize 26.integerbufferLRoutbufSize 27.integerbufferRLbufSize 28.integerbufferRLoutbufSize 29.integerbufferDMbufbufSize 30.code[[char*arg="16";ifargc>1{arg=argv[1];}]] 31.code[[ifargc>2{size=intpow.0,doubleatoiargv[2];}]] 32.code[[ifintsize*sizeofint>=DEFG_MAX_BUF {printf"Error,buffertoosmall!n";exit;} ]] 33.//hastobeC-styleinvocationduetoargbeingastring... 34.code[[getArrayarg,arrayS,size;bufSize=size;]] 35.code[[logSize=intlogdoublesize/log.0;]] 36.code[[ifbufSize>16sizeDB=16;elsesizeDB=bufSize;]] 37.code[[printf"version%ssize:%d,logSize:%dn",vers,size,logSize;]] 38.//==>LR 39.setstride 40.executeLR1LRmaxarraySinLRoutstridein 41.calltimes2stride* 42.setagain 43.loop 44.executeLR2LRmaxLRinoutLRoutoutstridein 45.calltimes2stride* 46.interchangeLRLRout 47.code[[again++;]] 48.whileagainltlogSize 49.//==>RL 50.setstride 51.executeRL1RLminarraySinRLoutstridein 52.calltimes2stride* 53.setagain 54.loop 55.executeRL2RLminRLinRLoutoutstridein 56.calltimes2stride* 57.interchangeRLRLout 58.code[[again++;]] 59.whileagainltlogSize 60.//==>DM 61.executeDM1DMLRinRLinDMbufout Continuedonnextpage 182

PAGE 196

Continuedfrompreviouspage 62.//==>FU 63.callcpygroups*size* 64.loop 65.setagain 66.calltimes2radius* 67.executeUB1UBDMbufinsizeinradiusinagaininout 68.whileagainne0 69.code[[radius*=2;]] 70.code[[ifradius>size{radius=size;}]] 71.code[[groups=intceildoublesize/doubleradius;]] 72.//==>COMBSORTpass1 73.executeSORT1comb_sortarraySinoutradiusinoffsetingroupsin 74.//==>COMBSORTpass2 75.code[[offset2=radius/2;]] 76.//lowergroupsbyoneandthenputitback 77.calldecgroups* 78.//groupsisusedimplicitly,seekerneldeclare... 79.executeSORT2comb_sortarraySinoutradiusinoffset2ingroupsin 80.callincgroups* 81.callsyncarraySin 82.code[[putMergeArray"sorted.txt",arrayS,size;]] 83.end OpenCLKernels: 001./* 002.*Foradescriptionofthealgorithmandthetermsused,pleasesee: 003.*http://en.wikipedia.org/wiki/Comb_sort 004.*/ 005.__kernelvoidcomb_sort__globalint*base,uintsize,uintoffset,uintgroups 006.{ 007.uintblock=get_global_id; 008.ifblock>=groupsreturn;//usedwithmulti-GPU 009.__globalint*input; 010.constfloatshrink=1.3f; 011.intswap; 012.uinti,gap=size; 013.boolswapped=false; 014. 015.input=base+block*size+offset; 016.whilegap>1||swapped{ 017.ifgap>1{ 018.gap=size_tfloatgap/shrink; 019.} 020. 021.swapped=false; 022. 023.fori=0;gap+i0{ 025.swap=input[i]; 026.input[i]=input[i+gap]; 027.input[i+gap]=swap; 028.swapped=true; 029.} 030.} 031.} 032. 033.} 034.__kernelvoidLRmax__globalintsrc[],__globalintdst[],uintstride 035.{ 036.//srcisinputarray 037.//dstisoutputarray 038.//strideistheoffsettotherelevantrhscells Continuedonnextpage 183

PAGE 197

Continuedfrompreviouspage 039.uintblock=get_global_id; 040.uintsize=get_global_size; 041.uintarnold=size-stride; 042.ifblock>=arnoldreturn;/*Terminator!*/ 043.uintjs=block+stride; 044.ifjs>=sizereturn;/*TerminatorII*/ 045.intsrc_j_item=src[block]; 046.intsrc_js_item=src[js]; 047.ifblock=size-stride{//copyalreadyprocessed 067.dst[block]=src[block]; 068.} 069.ifsrc[js]>src[block]{ 070.dst[js]=src[block]; 071.}else{ 072.dst[js]=src[js]; 073.} 074.} 075.__kernelvoidDM__globalintB[],__globalintC[],__globalintD[] 076.{ 077.inti=get_global_id; 078.forintj=i;j>=0;j--{ 079.ifj<=i&&i>=0&&C[i]<=B[j]&&j==0||C[i]>=B[j-1]{ 080.D[i]=i-j; 081.break; 082.} 083.} 084.} 085.__kernelvoidUB__globalintD[],uintsize,intd,__globaluint*again 086.{ 087.if*again==1return;//speedsupCPUversion?? 088.inti=get_global_id; 089.ifD[i]+0<=d{//this+1setsupabetterradiusforsortingcontrol 090.//good 091.}else{ 092.*again=1; 093.} 094.} 095.__kernelvoidUBsplit__globalintD[],__globaluintagain[],uintsize,intd 096.{ 097.again[1]=d; 098.ifagain[0]!=0return;//speedsupCPUversion?? 099.inti=get_global_id; 100.//debugd{//this+1setsupabetterradiusforsortingcontrol 102.again[0]=D[i];//1; Continuedonnextpage 184

PAGE 198

Continuedfrompreviouspage 103.again[1]=i; 104.} 105.} 106.__kernelvoidUBreset__globaluintagain[] 107.{ 108.ifget_global_id>0return; 109.again[0]=0; 110.} 185

PAGE 199

B.4.7RSORTMApplication 01.//RSortm.txt:Altman'sroughlysortalgorithminDEFGsyntax,multiGPU 02.declareapplicationRSortm 03.include[[ 04.char*vers="MV1.1b"; 05.]] 06.declareintegergroupSize 07.integerstride 08.integersize 09.integersizeDB 10.integergenK 11.integerbufSize 12.integerradius 13.integergroups 14.integergroupsMulti 15.integeragain 16.integeroffset 17.integeroffset2 18.integerlogSize 19.integeragainSize 20.declaregpugpugrpall 21.declarekernelLRmaxRSort_Kernels[[1D,size]] 22.kernelRLminRSort_Kernels[[1D,size]] 23.kernelSHexitRSort_Kernels[[1D,size]] 24.kernelDMRSort_Kernels[[1D,size]] 25.kernelUBsplitRSort_Kernels[[1D,size]] 26.kernelUBresetRSort_Kernels[[1D,size]] 27.kernelcomb_sortRSort_Kernels[[1D,groups]] 28.declareintegerbufferarraySbufSize 29.integerbufferLRbufSize 30.integerbufferLRoutbufSize 31.integerbufferRLbufSize 32.integerbufferRLoutbufSize 33.integerbufferDMbufbufSize 34.integerbufferagainPartagainSize 35.code[[char*arg="16";ifargc>1{arg=argv[1];}]] 36.code[[ifargc>2{size=intpow.0,doubleatoiargv[2];}]] 37.code[[ifintsize*sizeofint>=DEFG_MAX_BUF 38.{printf"Error,buffertoosmall!n";exit;}]] 39.code[[getArrayarg,arrayS,size;bufSize=size;]] 40.code[[ifbufSize>16sizeDB=16;elsesizeDB=bufSize;]] 41.code[[logSize=intlogdoublesize/log.0-1;]]//<<----multichange 42.code[[printf"version%ssize:%d,logSize:%dn",vers,size,logSize;]] 43.//==>LR 44.setstride 45.multi_execLR1LRmaxarraySinLRoutstridein 46.calltimes2stride* 47.//mainLRloop 48.setagain 49.loop 50.multi_execLR2LRmaxLRinoutLRoutoutstridein 51.calltimes2stride* 52.interchangeLRLRout 53.code[[again++;]] 54.whileagainltlogSize 55.//==>RL 56.setstride 57.multi_execRL1RLminarraySinRLoutstridein 58.calltimes2stride* 59.//mainRLloop 60.setagain 61.loop 62.multi_execRL2RLminRLinoutRLoutoutstridein Continuedonnextpage 186

PAGE 200

Continuedfrompreviouspage 63.calltimes2stride* 64.interchangeRLRLout 65.code[[again++;]] 66.whileagainltlogSize 67.//==>DM 68.multi_execDM1DMLRinRLinDMbufout 69.callcpygroups*size* 70.loop 71.multi_execUB1UBresetagainPartinout 72.calltimes2radiusinout 73.multi_execUB2UBsplitDMbufinagainPartinoutsizeinradiusin 74.callsyncagainPartin 75.code[[again=againPart[0]+againPart[2];]] 76.whileagainne0 77.code[[radius*=2;]] 78.code[[groups=intceildoublesize/doubleradius;]] 79.code[[ifgroups<2{printf"sortended,toofewsortgroups,use1GPUsort!n";exit;}]] 80.//==>COMBSORTpass1 81.code[[groupsMulti=groups/DEFG_GPU_COUNT;]] 82.multi_execSORT1comb_sortarraySinoutradiusinoffsetingroupsMultiin 83.//==>COMBSORTpass2 84.code[[offset2=radius/2;]] 85.calldecgroups* 86.code[[groupsMulti=groups/DEFG_GPU_COUNT;]]//thisiscritical!! 87.multi_execSORT2comb_sortarraySinoutradiusinoffset2ingroupsMultiin 88.callsyncarraySin 89.code[[putMergeArray"sorted.txt",arrayS,size;]] 90.end OpenCLkernelsforRSORTMaretheRSORTkernels. 187

PAGE 201

B.4.8SOBELApplication 01.//Sobel.txt:SobelalgorithminDEFGsyntax 02.declareapplicationsobel 03.declareintegerXdim 04.integerYdim 05.integerBUF_SIZE 06.declaregpugpuone* 07.declarekernelsobel_filterSobelFilter_Kernels[[2D,Xdim,Ydim]] 08.declareintegerbufferimage1XdimYdimhalo 09.integerbufferimage2XdimYdimhalo 10.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 11.executerun1sobel_filterimage1inimage2out 12.calldisp_outputimage2inXdiminYdimin 13.end OpenCLKernel: TheSOBELkernelisusedfromtheAMDOpenCL2.8SDKandiscopyrightedbyAMD.Thefullkernelsourcecode canbeobtainedfromthisSDK. B.4.9SOBELMApplication 01.//Sobelm.txt:SobelalgorithminDEFGsyntax 02.//'m'versionfor2GPUexecution 03.declareapplicationsobelm 04.declareintegerXdim 05.integerYdim 06.integerBUF_SIZE 07.declaregpugpuoneall 08.declarekernelsobel_filterSobelFilter_Kernels[[2D,Xdim,Ydim]] 09.declareintegerbufferimage1XdimYdimhalo 10.integerbufferimage2XdimYdimhalo 11.callinit_inputimage1inXdimoutYdimoutBUF_SIZEout 12.multi_execrun1sobel_filterimage1inimage2out 13.calldisp_outputimage2inXdiminYdimin 14.end OpenCLKernel: TheSOBELkernelisusedfromtheAMDOpenCL2.8SDKandiscopyrightedbyAMD.Thefullkernelsourcecode canbeobtainedfromthisSDK. 188

PAGE 202

B.5DEFGDiagnosticSourceCode B.5.1diagATDiagnosticProgram 01.// 02.//DiagAT.txt:diagnostictoverifyanytimeescapeatinDEFG 03.// 04.declareapplicationdiagat 05.declareintegerBUF_SIZE 06.integeriMore 07.declaregpugpuone* 08.declarekerneldummy_kernelinsertcode[[ 09.__kernelvoiddummy_kernel__globalint*p1 10.{ 11.p1[0]=34; 12.return; 13.} 14.]][[1D,1]] 15.kerneldummy_kernel2insertcode[[ 16.__kernelvoiddummy_kernel2__globalint*p1 17.{ 18.p1[0]=99; 19.return; 20.} 21.]][[1D,1]] 22.declareintegerbufferb1BUF_SIZE 23.integerbufferb2BUF_SIZE 24.start_timer 25.loop 26.executerun1dummy_kernelb1out//b1[0]=34 27.executerun1dummy_kernel2b2out//b2[0]=99 28.callsyncb1in 29.callsyncb2in 30.loop_escapeat20ms 31.whileiMoreeq1 32.end_timer 33.code[[ifb1[0]==34&&b2[0]==99 printf"ok.n";elseprintf"error.n";]] 34.output_timer 35.end 189

PAGE 203

B.5.2diagIKDiagnosticProgram 01.// 02.//DiagIK.txt:diagnostictoverifyincludekernelcodeinDEFG 03.// 04.declareapplicationdiagik 05.declareintegerBUF_SIZE 06.declaregpugpuone* 07.declarekerneldummy_kernelinsertcode[[ 08.__kernelvoiddummy_kernel__globalint*p1 09.{ 10.p1[0]=34; 11.return; 12.} 13.]][[1D,1]] 14.kerneldummy_kernel2insertcode[[ 15.__kernelvoiddummy_kernel2__globalint*p1 16.{ 17.p1[0]=99; 18.return; 19.} 20.]][[1D,1]] 21.declareintegerbufferb1BUF_SIZE 22.integerbufferb2BUF_SIZE 23.start_timer 24.executerun1dummy_kernelb1out//b1[0]=34 25.executerun1dummy_kernel2b2out//b2[0]=99 26.callsyncb1in 27.callsyncb2in 28.end_timer 29.code[[ifb1[0]==34&&b2[0]==99 printf"ok.n";elseprintf"error.n";]] 30.output_timer 31.end 190

PAGE 204

B.5.3diag4wayDiagnosticProgram 01.// 02.//Diag4way.txt:diagnostictoverify4-wayoperations 03.//vers:C5D4_D6C7D6E2 04.declareapplicationdiag4way 05.declareintegerBUF_SIZE 06.integerN 07.integeriMore 08.declaregpugpuallall 09.declarekerneldummy_kernelinsertcode[[ 10.__kernelvoiddummy_kernel__globalint*p1 11.{ 12.uintindex=get_global_id; 13.p1[index]=2*p1[index]; 14.intd=0; 15.inte; 16.forinti=0;i<1024*512;i++{d=d+i; e=intsqrtfloatd;d=d-e;d=d+e;} 17.p1[index]+=d; 18.p1[index]-=d; 19.return; 20.} 21.]][[1D,BUF_SIZE]] 22.declareintegerbufferb1BUF_SIZE 23.code[[forintii=0;ii
PAGE 205

B.6DEFGMajorComponents B.6.1ParserV2.g:DEFGGrammarSourceCode DEFGParserAntlercodeavailablefromtheDepartmentofComputerScienceandEngineering,UniversityofColoradoDenver. B.6.2DEFGopt.java:DEFGOptimizerSourceCode DEFGOptimizerJavacodeavailablefromtheDepartmentofComputerScienceand Engineering,UniversityofColoradoDenver. B.6.3defgv2.cpp:DEFGCodeGeneratorSourceCode DEFGCodeGeneratorC++codeavailablefromtheDepartmentofComputerScience andEngineering,UniversityofColoradoDenver. 192