Citation
A framework for performance tuning and analysis on parallel computing platforms

Material Information

Title:
A framework for performance tuning and analysis on parallel computing platforms
Creator:
Gehrke, Allison S. ( author )
Language:
English
Physical Description:
1 electronic file (203 pages) : ;

Subjects

Subjects / Keywords:
Parallel computers -- Programming ( lcsh )
Performance technology ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
Emerging parallel processor designs create a computing paradigm capable of advancing numerous scientific areas, including medicine, data mining, biology, physics, and earth sciences. However, the trends in many-core hardware technology have advanced far ahead of the advances in software technology and programmer productivity. For the most part, scientists and software developers leverage many-core and GPU (Graphical Processing Unit) computing platforms after painstakingly uncovering the inherent task and data-level parallelism in their application. In many cases, the development does not realize the full potential of the parallel hardware. Moreover, often the exploitable resources, such as processor registers and on-chip programmer- controller memories, scale with each new generation of many-core system and software performance drifts over hardware generations. An opportunity exists to meet the challenges in mapping scientific applications to parallel computer systems through a synthesis of architectural in-sight, profile driven performance analysis, and execution optimization. This thesis explores an analysis and optimization framework that directs code-tuning strategies and applies science to the art of performance optimization for efficient execution on throughput-oriented systems. The framework demonstrates systematic performance gain through profile- driven analysis on three representative scientific kernels on three different throughput- oriented architectures.
Thesis:
Thesis (Ph.D.)- University of Colorado Denver
Bibliography:
Includes bibliographic references
System Details:
System requirements: Adobe Reader.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Allison S. Gehrke.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
945638632 ( OCLC )
ocn945638632
Classification:
LD1193.E52 2015d G35 ( lcc )

Downloads

This item has the following downloads:


Full Text
A FRAMEWORK FOR PERFORMANCE TUNING AND ANALYSIS ON
PARALLEL COMPUTING PLATFORMS
by
ALLISON S. GEHRKE
Master of Science, University of Colorado, Boulder, 1998
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Engineering
2015


This thesis for the Doctor of Philosophy degree by
Allison S. Gehrke
has been approved for the
Department of Computer Science and Engineering
by
Ilkyeun Ra, Advisor
Gita Alaghband, Chair
Tim Benke
Larry Hunter
Zhiping Walter
September 14, 2015


Gehrke, Allison S. (Ph.D., Computer Science and Information Systems)
A Framework for Performance Tuning and Analysis on Parallel Computing Platforms
Thesis directed by Associate Professor Ilkyeun Ra
ABSTRACT
Emerging parallel processor designs create a computing paradigm capable of ad-
vancing numerous scientific areas, including medicine, data mining, biology, physics,
and earth sciences. However, the trends in many-core hardware technology have
advanced far ahead of the advances in software technology and programmer produc-
tivity. For the most part, scientists and software developers leverage many-core and
GPU (Graphical Processing Unit) computing platforms after painstakingly uncover-
ing the inherent task and data-level parallelism in their application. In many cases,
the development does not realize the full potential of the parallel hardware. Moreover,
often the exploitable resources, such as processor registers and on-chip programmer-
controller memories, scale with each new generation of many-core system and software
performance drifts over hardware generations.
An opportunity exists to meet the challenges in mapping scientific applications to
parallel computer systems through a synthesis of architectural in-sight, profile driven
performance analysis, and execution optimization. This thesis explores an analysis
and optimization framework that directs code-tuning strategies and applies science
to the art of performance optimization for efficient execution on throughput-oriented
systems. The framework demonstrates systematic performance gain through profile-
driven analysis on three representative scientific kernels on three different throughput-
oriented architectures.
m


The form and content of this abstract are approved. I recommend its publication.
Approved: Ilkyeun Ra
IV


DEDICATION
I dedicate this thesis to my family. To my husband, Chris Gehrke, because this
simply would not have been possible without you at my side. I thank you for your
generosity, your patience, your devotion to our family, and for shouldering so much so
that I might succeed. To my children, Shea Gehrke, Kaya Gehrke, and Gale Gehrke
who are my pride and joy and who put up with many, many hours of ...not now, Im
working on my thesis. To my dad, James Timothy Skinner, Jr., who built a cluster
of computers that became my first experience with High Performance Computing
(HPC) that fostered my interest in HPC development and research. To my step-
mother, Beatriz Calvo, who was always there to help and who gave me the best piece
of advice: perfection means complete.
v


ACKNOWLEDGMENT
I would like to thank my advisor, Dr. Ilkyeun Ra, for his help in guiding this
research, for his patience, and for his good-will. Dr. Ra enabled this work even
though it is not his primary research focus. I would also like to thank our collaborator,
Dr. Timothy Benke. Dr. Benkes research became the basis and motivation for this
thesis. I enjoyed the inter-disciplinary work the most and will continue to help advance
scientific computing and apply all I have learned in my career.
vi


TABLE OF CONTENTS
Tables......................................................................... x
Figures .................................................................. xiii
Chapter
1. Introduction................................................................ 1
1.1 Motivation........................................................... 4
1.2 Contributions ....................................................... 7
1.3 Dissertation Outline................................................. 9
2. Literature Review.......................................................... 11
2.1 Workload Characterization........................................... 11
2.2 Performance Analysis and Optimization............................... 13
2.3 Automatic Performance Tuning ....................................... 17
3. Analysis Framework ........................................................ 22
3.1 Framework Methodology............................................... 26
3.2 NVidia Metrics and Analysis......................................... 28
3.2.1 Memory Analysis Metrics and Optimizations ..................... 36
3.2.2 Instruction Analysis Metrics and Optimizations................ 44
3.2.3 Latency Optimizations......................................... 52
3.3 Xeon Phi Metrics and Analysis....................................... 52
4. Scientific Kernel Characterization......................................... 56
4.1 Stencil............................................................. 59
4.2 Single Precision General Matrix Multiply (SGEMM).................... 61
4.3 Kinetic Simulation.................................................. 62
4.4 Characterization Summary............................................ 63
5. Hardware and Architecture Characterization................................. 66
5.1 Throughput Oriented Architectures................................... 66
5.2 NVidia Discrete GPU Architecture.................................... 67
vii


5.2.1 Fermi....................................................... 74
5.2.2 Maxwell..................................................... 79
5.3 Intel Xeon Phi Coprocessor Architecture ........................ 82
5.3.1 Xeon-Phi Performance Tuning ................................ 85
6. Analysis and Results.................................................... 87
6.1 Fermi Optimization Analysis and Results........................... 87
6.1.1 SGEMM Analysis on Fermi..................................... 87
6.1.2 Stencil Analysis on Fermi................................... 93
6.2 Maxwell Optimization Analysis and Results....................... 106
6.2.1 SGEMM Analysis on Maxwell.................................. 106
6.2.2 Stencil Analysis on Maxwell................................ 109
6.2.3 RK4 Analysis on Maxwell.................................... 110
6.3 Xeon Phi Optimization Analysis and Results...................... 115
6.3.1 SGEMM Analysis on Xeon Phi Coprocessors ................... 115
6.3.2 Stencil Analysis on Xeon Phi Coprocessors.................. 123
6.4 Summary of Analysis and Results................................... 130
7. Conclusion.............................................................. 131
Bibliography .............................................................. 137
Appendix
A. Advancement of Computational Simulation in Kinetic Modeling........... 153
A.l Related Work ..................................................... 153
A. 1.1 Computational Methods To Investigate Ion Channels.......... 153
A. 1.2 Simulation Tools........................................... 156
A. 1.3 Numerical Methods for Modeling Chemical Kinetics .......... 158
A. 1.4 Genetic Algorithm for Optimization ........................ 160
A. 2 Modeling Ion Channels............................................ 161
A.3 Kingen Case Study ............................................. 172
viii


A.3.1 Application Characterization and Profile................... 173
A.3.1.1 System Level Analysis................................ 174
A.3.1.2 Application Level Analysis........................... 175
A.3.1.3 Computer Architecture Analysis....................... 177
A.3.2 Computing Framework........................................ 179
A.3.3 Experimental Results and analysis.......................... 183
A.3.4 Case Study Conclusion...................................... 189
IX


TABLES
Table
3.1 Low Level Instruction Related Hardware Events from NVidia.......... 29
3.2 Low Level Instruction Related Hardware Events from NVidia Continued 30
3.3 Low Level LI Events from NVidias nvprof................................. 31
3.4 Low Level Memory Related Hardware Events from NVidia..................... 32
3.5 Low Level Latency Related and Miscellaneous Hardware Events from NVidia 33
3.6 Fermi Derived Metrics from NVidia........................................ 34
3.7 L2 and DRAM Instruction to Byte Ratios on NVidia......................... 35
3.8 Fermi Memory Analysis Metrics............................................ 37
3.9 Maxwell Memory Transaction Metrics ...................................... 39
3.10 Maxwell Memory Throughput Metrics....................................... 40
3.11 Maxwell Memory Utilization Metrics...................................... 41
3.12 Miscellaneous Maxwell Memory Metrics.................................... 42
3.13 Maxwell Instruction Utilization Related Metrics......................... 47
3.14 Maxwell Flop and Efficiency Related Metrics ............................ 48
3.15 Maxwell Instruction Counts.............................................. 49
3.16 Maxwell Miscellaneous Instruction Metrics............................... 50
3.17 Xeon Phi Hardware Events from Intels VTune [33]........................ 53
3.18 Xeon Phi Formulas for Performance Analysis ............................. 54
4.1 Example Instruction Counts and Distributions............................. 58
4.2 Various static metrics across each benchmark............................. 64
4.3 Architecture Stresses of Benchmarks Before and After Optimization ... 65
5.1 NVidia Architectures..................................................... 68
5.2 Compute Capabilities on NVidia Hardware ................................. 68
5.3 NVidia Core Architecture................................................. 69
5.4 Theoretical Peaks on NVidia Architecture................................. 70
x


5.5 Execution Limits in NVidia GPUs ..................................... 71
5.6 Essential Hardware Features and Throughputs on Fermi and Maxwell . . 72
5.7 Instruction Throughputs on Fermi and Maxwell..................... 74
5.8 Summary of Features on First Three Generations of NVidia GPUs [6] . . 76
5.9 Minimum Theoretical CPIs......................................... 84
6.1 SGEMM Base GUDA Implementation with Small Input...................... 88
6.2 SGEMM Base CUDA Implementation with Medium Input..................... 90
6.3 SGEMM Optimized CUDA Implementation with Medium Input............ 92
6.4 Stencil Base CUDA Implementation with Default Input ................. 94
6.5 Stencil Optimized CUDA Implementation with Default Input......... 96
6.6 Kingen ModelOl baseline Implementation with 8,192 Chromosomes ... 98
6.7 Kingen ModelOl Memory Access Optimization with 8,192 Chromosomes 100
6.8 Kingen ModelOl Memory Access Optimization with 8,192 Chromosomes 101
6.9 Kingen ModelOl Register Optimization................................ 102
6.10 Kingen ModelOl Fix yt dym Usage Optimization........................ 104
6.11 SGEMM Baseline Implementation with Medium Dataset on Maxwell. . 107
6.12 SGEMM Baseline Implementation with Medium Dataset on Maxwell.
Only memory related metrics are shown since the kernel is memory-bound. 108
6.13 SGEMM Optimized Implementation with Medium Dataset on Maxwell . 109
6.14 Stencil Baseline Implementation on Maxwell........................ Ill
6.15 Stencil Optimized Implementation on Maxwell....................... 112
6.16 RK4 Baseline Implementation on Maxwell............................ 113
6.17 RK4 Optimized Implementation on Maxwell........................... 114
6.18 SGEMM Baseline Implementation on Xeon-Phi......................... 116
XI


6.19 SGEMM Baseline Implementation on Xeon-Phi. Investigation may be
warranted if the measured values satisfy the thresholds of the performance
heuristic. Actual values are filled in under the value column and the
performance heuristic is provided for reference in the last column. 117
6.20 SGEMM Transpose Optimization on Xeon-Phi .......................... 118
6.21 SGEMM Transposed Optimization on Xeon-Phi. Investigation may be
warranted if the measured values satisfy the thresholds of the performance
heuristic. Actual values are filled in under the value column and the
performance heuristic is provided for reference in the last column. 120
6.22 SGEMM MKL Library Optimization on Xeon-Phi........................... 121
6.23 SGEMM MKL Optimization on Xeon-Phi................................... 122
6.24 SGEMM Scaling Over Three Implementations............................. 124
6.25 Stencil Baseline Implementation on Xeon-Phi.......................... 125
6.26 Stencil Baseline Analysis Metrics on Xeon-Phi........................ 126
6.27 Stencil Cache-Blocking Optimization on Xeon Phi Measured Events . 127
6.28 Measured Analysis Metrics from Stencil Cache-Blocking Optimization on
Xeon-Phi............................................................. 128
6.29 Stencil Results on Xeon Phi as Reported by the Application........... 129
6.30 Baseline and Optimized Run-times..................................... 130
6.31 Optimization Speedups................................................ 130
7.1 NVidia Analysis Summary.............................................. 132
7.2 NVidia Optimization Summary.......................................... 133
7.3 Xeon Phi Optimization Summary........................................ 134
A.l Comparison of Process Runtime Between Compilers...................... 176
A.2 Comparison of CPI and FP Impacting Metrics........................... 176
A.3 Percentage of the Process Runtime By Function........................ 177
A.4 Chromosome Distribution.............................................. 187
xii


FIGURES
Figure
1.1 Analysis Framework Process Flow......................................... 4
3.1 Performance Architecture Continuum..................................... 22
3.2 Framework Optimization Algorithm ...................................... 25
4.1 Stencil Baseline Implementation Pseudocode............................. 61
4.2 Kernel Instruction Mix Measured on Maxwell............................. 65
5.1 Fermi Architecture..................................................... 78
5.2 Fermi Symmetric Multiprocessor with Core .............................. 79
5.3 Maxwell Symmetric Multiprocessor Design................................ 81
5.4 Xeon Phi Coprocessor Architecture...................................... 83
6.1 Roofline Model for C2050 with and w/o ECC............................. 105
6.2 SGEMM Scaling Over each Optimization ................................. 123
A.l Example Kinetic Scheme (model)......................................... 164
A.2 Kinetic Scheme Proposed By [148]....................................... 165
A.3 Implementation of Kinetic Scheme Proposed By [148]..................... 166
A.4 Optimization Improved Kinetic Scheme Proposed By [148]................. 167
A.5 Modified kinetic scheme................................................ 168
A.6 Optimization of Revised Model Improves Fit............................. 169
A.7 New Models Under Analysis.............................................. 170
A.8 Analysis Process....................................................... 173
A.9 Thread Utilization..................................................... 174
A. 10 Instruction Mix...................................................... 178
A. 11 Computational Complexity............................................. 180
A. 12 Calculation Complexity............................................... 183
A. 13 Complexity Graph..................................................... 184
A. 14 Kernel Source Code................................................... 185
xiii


A. 15 Speedup on Several Architectures................................... 186
A. 16 Speedup with Larger Workload....................................... 188
xiv


1. Introduction
A major paradigm shift occurred during the last decade in computer architecture
design. Programmers are no longer able to rely on significant increases in processor
clocks with each new generation of hardware to increase application performance.
The multi-core and many-core revolution is well underway and with it the necessity
to leverage advanced parallel systems to increase application performance generation
over generation. The milestone switch to many-core led to a desperate need for new
tools and methodologies for performance analysis and optimization.
The tectonic shift in computing led to the emergence of different classes of ap-
plication including throughput computing applications and to the rise of high per-
formance computing (HPC) in diverse domains. 3-D graphics applications running
on graphics hardware (whether discrete or internal) are a proto-typical example of
throughput-orientation; billions of pixels processed within the render time of a frame.
Throughput computing is a broader category of application that includes, but is not
limited to, graphics applications. Throughput computing applications are charac-
terized as having plenty of data level or task level parallelism and the data can be
processed independently and in any order on different processing elements, for a sim-
ilar set of operations [105]. In general, throughput computing applications are ideal
for parallel architectures. The synthesis of core principles to describe, understand,
and compare performance of throughput-oriented architectures is explored in detail
in this dissertation.
Before the many-core revolution, the markets served by central processing units
(CPU) and graphical processing units (GPU) were very clearly defined. Todays com-
puting landscape is anything but clear and it can be very difficult for scientists and
developers alike to choose an appropriate platform for their application. There are
multi-core systems with 4-20 cores that tend to be more CPU-like but limited in
the number of cores that can be supported; many-core systems that are more GPU-
1


like with several hundred cores that trade-off single-threaded performance for many,
simpler cores; and enormous, distributed, installations in super computing centers.
The focus of this research is throughput-oriented, many-core systems as exemplified
by NVidia discrete graphics cards and Intels Xeon Phi coprocessors. Multicore plat-
forms dont have enough parallelism to support many scientific applications of interest
and their design tends to remain fundamentally latency-oriented (see Section 5.1 for
detailed discussion of latency-oriented vs. throughput-oriented). Super computing
centers have limited availability to the general public and have high maintenance
and overhead costs. Many performance principles in this research apply to many
of the components running in super-computing centers but the challenges unique to
distribution on that scale are out of scope for this work.
HPC represents a broad range of scientific applications characterized into dwarfs
(an algorithmic method that captures a pattern of computation and communication)
by Berkeley researchers [17]. Scientific kernels have different algorithmic properties
that may or may not map well to certain architectures. Many parallel platforms are
suitable for both throughput-oriented and scientific applications but which hardware
is more efficient or a better match for a given software application is an open research
question. Specific kernels from both throughput computing and a computational
simulation kernel are explored in-depth in the analysis framework described in this
thesis.
Computational simulation is tightly connected to the exponential increase in the
power of computers [64], Performance improvements for computational sciences such
as biology, physics, and chemistry are critically dependent on advances in multi-core
and many-core hardware and the performance challenge, described by [139] is largely
being addressed by rapid progress in this area. However, these systems require sub-
stantial investment to migrate and optimize software and development efforts are
not performance portable even within families of architectures from the same com-
2


pany. The productivity of programmers is not keeping pace with advanced computing
systems and effective tools typically trail new systems by several years [167]. This
situation diminishes the value of these systems and undermines the potential of sci-
entific applications to serve the public with reliable and robust solutions. There is a
compelling need for software that can sustain high performance on new architectures
and scale with multi-core, many-core, hybrid, and massively parallel systems.
Researchers from the University of California, Berkeley made several recommen-
dations to guide the transition to parallel systems that are relevant to this work
including [17]:
The overarching goal should be to make it easy to write programs that execute
efficiently on highly parallel computing systems.
Autotuners should play a larger role than conventional compilers in translating
parallel programs.
Architects should not include features that significantly affect performance or
energy if programmers cannot accurately measure their impact via performance
counters and energy counters.
The framework at the core of this thesis uses profile-driven analysis to guide
optimization efforts so programmers can write code that executes efficiently on parallel
systems. One goal was to test optimization techniques recommended by NVidia and
Intel engineers for application in automated environments. Positive results enable the
methodology to be used in auto-tuning systems to select optimizations that are likely
to have the most impact on performance and reduce the combinatorially large search
space. Figure 1.1 illustrates the main components of the analysis framework and
how the process flows. The analysis framework provides insight into how applications
run on hardware. Kernel characteristics are compiled once per architecture and, if
necessary, once per code variant. All generations of NVidias discrete GPUs (see Table
3


5.1) are considered members of the same architecture in this context as each report
similar micro-architectural agnostic metrics on the same code. Intels Xeon Phi is a
completely different x86 based architecture, with a different instruction set, different
programming model, different profiling tools, and requires a different set of metrics
to describe kernel behavior. The peak rates and primary architectural features to
describe hardware behavior are compiled once and are static across any kernel and
optimization variant. Hardware specifications are used to compare measured data
with peak rates. Kernel characteristics through optimization strategies are in a loop
as optimization progresses. Optimization is over when the kernel is running near peak
architectural limits or has hit kernel or algorithmic limits.
Analysis Framework
Figure 1.1: Kernel characteristics, hardware behavior, performance profile, analysis
algorithm, and optimization strategies are the primary building blocks of the analysis
framework.
1.1 Motivation
Expertise to develop software applications that effectively leverage gigaFLOP
systems (109 floating point operations per second) and that scale beyond is scant
4


both in tools and well-defined best practices. The proliferation of research focused
on performance analysis of emerging multi-core and GPU architectures [23] mostly
using case studies is testament to a nascent held grappling to characterize, leverage,
and understand the complexity inherent in the requisite tools of the trade. A few
of the biggest challenges in manycore performance optimization research are how
to efficiently build scientific applications that achieve a reasonable fraction of the
available theoretical peak performance, how to measure the relative success of different
optimizations, how to rate architectural efficiency to measure the goodness of the
mapping from software to hardware, and how to achieve performance portability.
There are many barriers to entry in programming on parallel systems. Pew have
described sequential programming as easy but parallelism complicates the task ex-
ponentially. In sequential computing, operations are performed in order making it
easier to reason about correctness and characteristics of a program [110]. Parallel
computing complicates our reasoning along several dimensions and requires modifi-
cation in programming approaches [110]. Here are a few factors that contribute to
the challenges inherent in deploying parallel solutions:
A determination must be made about how the application can be restructured to
run in parallel. Architectural details need to be considered because structuring
code for many threads of a GPU is different than structuring code to exploit
wide vectors of Intel Xeon Phi coprocessors.
There are few established best practices to guide the development and opti-
mization process. This thesis proposes a systematic guide that bridges the gap
between theory and the many practical details that come into play when code
hits silicon. One goal is to compile best practices described by engineers at
NVidia and Intel into an automated framework and explore why they work.
The application bottleneck must be identified and analyzed for parallel oppor-
5


tunities. Parallelism wont necessarily help if large time-consuming portions of
the code are inherently serial. This thesis assumes the bottleneck with respect
to the function that consumes the most run-time has been identified through
the top down system level analysis as described in the Kingen case study in
Appendix A.3.
How to determine if the increase in performance will be worth the development
effort is not well understood. Simple techniques are employed in this work to
estimate how much optimization opportunity remains.
Developing applications that will fully exploit the machine is challenging as the
problem lives at the cross section of inherent algorithmic characteristics and
architectural features. These dynamics are explored and how close to peak
efficiency a kernel executes is quantified.
Latency, either reducing latency or hiding latency, is key to performance. Mem-
ory must be managed carefully and optimized to coalesce access and reduce
cache issues like thrashing and false sharing. These techniques require much
more programmer involvement than sequential programmers tend to consider.
Modifications to existing scientific models means changes in code and different
sized models may have different optimization points on different hardware.
Performance tuning is an important and time-consuming step that requires in-
depth knowledge of architectural features as well as the primary application
features. Performance optimization is often described as an art rather than a
science which hints at the complexity and lack of well-understood best practices.
A guiding theme throughout this dissertation is promoted in [53] as the need to
broaden the success in parallelizing scientific applications. As such, there is an
urgent need for more research on how to program for many-core architectures and
6


how to measure results. A primary contribution of this thesis is the creation of an
analysis framework that guides developers and scientists through systematic optimiza-
tion. We explore capabilities within the framework to generate performance metrics
and analysis to drive optimization strategy. The framework will aid developers and
computational scientists to use the full computational resources available in parallel
systems.
1.2 Contributions
This thesis demonstrates that a performance analysis system can construct tai-
lored mappings of scientific applications to modern many-core architecture resources.
The framework increases scientific productivity by driving development toward effi-
cient implementations. Architectural details are essential to high performing appli-
cations and the ability to fully exploit the machine. There is a large divide between
the power available in complicated parallel architectures and the tools most people
need to effectively leverage them and come within range of peak performance.
In the scope of this thesis, the computationally intensive numerical simulation of
ion channel kinetics is examined as a target for evaluting the analysis framework. In
addition, two well-known scientific kernels are studied to demonstrate the generic ap-
plicability of the methodology and to compare the results reported here against known
benchmarks. In all, three scientific kernels are mapped to three different throughput-
oriented architectures. The three kernels are described in detail in Chapter 4 and the
hardware is described in Chapter 5.
This dissertation makes the following key contributions:
1. Characterization of throughput oriented architectures through investigation of
three different platforms: NVidia Fermi, NVidia Maxwell, and Intel Xeon Phi
coprocessors. The ability of scientific applications to map to each architecture
is explored. The peak capabilities of throughput-oriented architectures are de-
scribed and applied beyond what can be found in technical manuals or tutorials.
7


Advanced computing systems, though complex with a large number of tuning
options that impact performance, actually have relatively few architectural fea-
tures that dominate performance. Just as Amdahls law defines max speedup
with parallel execution (performance gain of parallel computers is limited by the
serial portion); the primary performance limiter defines the max speedup, so
focus on optimizations with the most potential to improve performance. We de-
scribe how to interpret the performance metrics hardware exposes and identify
optimization options.
2. Presentation of profile-driven analysis to drive optimization strategy. The
framework uses a hierarchy of profile data as input to the analysis algorithm.
Analysis metrics are defined from low level events and derived performance met-
rics to codify analysis guidelines in order to identify targets for optimization.
The analysis framework identifies performance limiters and focuses optimization
effort in those areas only. Significant development time is wasted improving
code that, by definition, will not improve performance. The framework system-
atically drives the time-intensive and error prone process of performance opti-
mization and enables investigation of how performance of a kernel scales as the
available resources and parallelism change. Optimization strategies are defined
on each architecture for compute-bound, memory-bound, and latency-bound
kernels. Being bound by compute, memory, or latency, means the performance
of the kernel is dominated by compute operations, memory operations, or ex-
posed latency. Analysis metrics are used to guide which optimization within
each of the three categories are likely to improve performance.
3. Presentation of an analysis framework that bridges the gap between architec-
tural theory, performance analysis and the many details that come into play
when code hits silicon. This thesis focuses on execution optimization as opposed
8


to algorithmic optimization which strives to reduce the number of operations
and is expressed in big O notation. The analysis framework that drives perfor-
mance optimization is a synthesis of performance theory, kernel behavior, and
architectural features.
4. Presentation of the performance of three representative scientific kernels on
three throughput-oriented architectures. We demonstrate the methodology that
guided the efficient implementation of the implicit RK4 solver on several parallel
platforms is applicable to scientific kernels in general. The results demonstrate
that performance of a target kernel can be improved by profile-driven analysis
and that performance improvements are sustained across architectural genera-
tions. The increase in performance opens the door for new usage models which
is imperative to continued innovation on parallel systems.
1.3 Dissertation Outline
The next chapter is an in-depth literature review of related work and how this
dissertation builds on previous success. Chapter 3 describes our analysis framework
that synthesizes profiling, optimization, and analysis, to compare architectural effi-
ciencies and determine if an application is using a reasonable percentage of hardware
resources available. Chapter 4 introduces the scientific kernels that are under eval-
uation in this thesis. Chapter 5 describes the three platforms studied in this thesis,
NVidias Fermi, NVidias Maxwell, and Intels Xeon Phi coprocessors. Chapter 6
demonstrates the positive performance results leveraged through the methodology
described in this thesis. Finally, Chapter 7 discusses conclusions and directions for
future research. Appendix A. 3 is a case study we published that describes the step-by-
step process of adapting new and existing computational biology models to multi-core
and distributed memory architectures. We discuss our implementation and software
9


optimization process to demonstrate the challenges and complexities of software de-
velopment for advanced parallel architectures.
10


2. Literature Review
2.1 Workload Characterization
Workload demand and architectural behavior are the two components under study
when evaluating and optimizing application performance. It follows directly that
quantitative description of workloads is a fundamental part of performance evalu-
ation [28], a key focus of this dissertation. Workload characterization directly in-
forms design of future applications, compilers and architectures. For example, target
thread dispatch rates can be estimated against thread length distributions for com-
mon GPGPU programs. Another, more well-known and studied design choice, is the
size of on-die storage should match the working set of target workloads [106]. In addi-
tion, GPU kernels and applications often stress the same bottlenecks [69] so workload
characterization is used to identify a representative set of diverse workloads that exer-
cise important orthogonal architectural features. Redundant workloads in an analysis
set, meaning two workloads with very similar execution behavior on a system, require
twice the analysis work but provide no additional insight. Benchmark development
relies heavily on workload characterization to quantify primary architectural features
that impact performance. In this thesis, scientific applications are characterized to
determine kernel-imposed limits which reduce achievable peak architectural rates.
A new class of throughput computing applications has emerged across diverse
domains that process large amounts of data. A distinguishing feature of these appli-
cations is they expose plenty of data level parallelism and the data can be processed
independently [106]. Another new class of application is computational simulation
which historically ran on high performance computing (HPC) platforms with dis-
tributed compute nodes only. Throughput-oriented computing platforms are viable
options for these new types of applications and its often not clear to expert pro-
grammers and domain scientists alike which is best suited for their application. The
11


general-purpose CPU (central processing unit) is capable of running many types of
applications and has recently evolved to provide multiple cores and many cores to pro-
cess data in parallel [106]. GPUs (graphics processing units) are designed for graphics
applications with massive parallelism on smaller processing cores [106]. Some scien-
tific models can be too big to reside in smaller on-chip memory of GPUs and must
consider the trade-offs between the two architectures. One goal of this dissertion is
to provide practical guidance to inform that decision.
Qualitatively, GPU and many-core systems like the Xeon Phi coprocessors favor
workloads that perform large numbers of compute operations; that exhibit high de-
grees of memory locality; and that are highly data-parallel [134], The challenge many
researchers address with workload analysis studies is how to quantify those charac-
teristics. Workloads are very often characterized with metrics designed to measure
sensitivity to specific performance limiters of target systems. For example, Rodinia
researchers [35] characterized benchmarks in terms of warp occupancy, the average
number of active threads over all issued warps, because occupancy was identified in
early generation GPGPU architectures as a performance limiter. Kerr et al. [91] use
Ocelot [59] to characterize thread activity, the average fraction of threads that are
active at a given time, to measure impact from thread divergence. In [106], kernels
are classified based on the compute and memory requirements, regularity of memory
accesses, and the granularity of tasks. Architects make design tradeoff decisions by
identifying key architectural parameters that are important to performance and by
characterizing how benchmarks respond to changes in those parameters [35].
A clear taxonomy of workload specific tuning parameters that are independent
of any system the workload may run on as distinct from architectural dependent
tuning parameters would help clarify performance limits imposed by the kernel it-
self. Goswami et al. [69] propose a set of GPU microarchitecture agnostic GPGPU
workload characteristics. Byfl is a tool developed by [134] that reports counter values
12


in a hardware-independent manner. If metrics are architecture-agnostic then, any
workload will report the same values regardless of microarchitectural differences in
the target platform. Important research remains to identify a set of workload charac-
terization metrics that help quantify the extent to which the same workload exhibits
different properties when implemented on different architectures [35]. Improving our
understanding of these properties would improve evaluation for suitability to emerging
architectures. This is an important benefit for domain scientists who need guidance
on which system is a good match for their application and dont have time or ex-
pertise to experiment with multiple approaches. The framework described in this
dissertation extends existing characterization studies to understand which properties
impose limits below theoretical peaks hardware can support.
In general, workloads are characterized and optimized for specific architectures.
Code is tuned to leverage architectural features of the target system. An interesting
question is can the reverse be analyzed? In other words, what is the best architecture
for a given algorithm and which among the available systems are strong candidates?
Researchers and industry analysts widely share a vision of heterogeneous computing
that automatically select the best computational unit for the task from among in-
tegrated accelerators [35]. One goal of this dissertation is to advance this vision by
exploring the intersection and dependencies between workload characterization and
the systems they run on.
2.2 Performance Analysis and Optimization
Performance analysis and optimization is at the foundation of computer archi-
tecture research and development [55] and of this dissertation. Architectural details
are essential to optimal performance of software applications and the ability to fully
exploit the machine. This dissertation couples characterization of the scientific ker-
nels under investigation with detailed architectural analysis and optimization over
13


multiple generations of hardware to improve understanding of how to map scientific
applications to modern architectures.
One of the biggest limitations of the many-core era is that software must be ex-
plicitly parallelized to leverage many cores and massive parallelism of GPUs [180].
Parallel applications have transformed the challenge from latency-limited computing
to throughput-limited computing [180]. The many-core era ushered in a produc-
tivity gap as programmers struggle to maintain performance portability on rapidly
changing systems with non-linear performance impacts. This under-scores the need
to better understand throughput orientation and this trend will continue well into
the heterogeneous systems era.
Programming challenges arise from interactions among architectural constraints.
For example, optimizations that improve the performance of an individual thread
tend to increase a threads resource usage. As each threads resource usage increases,
the total number of threads that can occupy a symmetric multiprocessor (SM) on a
GPU decreases [150]. Another source of unpredicability, using the CUDA runtime
system as an example, is black-box register allocation that makes it difficult for pro-
grammers to fully understand the performance characteristics of their applications.
The optimization framework developed in this dissertation demonstrates the perfor-
mance delta that can come about across generations and describe how our system
adapts.
Fundamentally, CPUs and GPUs are built based on very different design goals.
CPUs were historically designed for a wide variety of applications optimized to im-
prove single task or single-threaded performance. CPUs reduce latency and GPUs
hide latency to maximize performance. Memory bandwidth on CPUs is low as com-
pared to GPU and CPU cache access filtering tends to modulate CPU ability to
expose memory-level parallelism. Workload optimizations for CPU that contribute
to performance are: multithreading, cache blocking, and reorganization of memory ac-
14


cesses [106]. Therefore, CPU architectural advances to improve performance include
branch prediction, out-of-order execution, super-scalar execution, and frequency scal-
ing [106]. CPU programmers could count on increasing clock frequencies to improve
performance and did not have to continually optimize applications from one genera-
tion to the next. However, CPUs have evolved to multi-cores and many-cores with
wider SIMD units, and applications must now be parallelized and optimized to ex-
ploit thread level parallelism (TLP) and data level parallelism (DLP) to effectively
hide latency.
GPUs were historically designed for graphics applications optimized for high
throughput of pixels. GPUs trade-off single threaded performance for massive data
and thread level parallelism. As GPUs are throughput-limited architectures, graph-
ics applications are very latency tolerant. GPU threads are independent and GPUs
immediately switch to a different thread to process a different pixel when an active
thread stalls on a long latency operation (like a request to memory) [106]. Thread
switching on GPU is an almost zero cost operation as compared to the high penalty
of context-switching in CPUs. Characterizing workload throughputs on throughput-
oriented architectures is critical for modeling performance and for identifying bottle-
necks and their relevant optimizations. However, like CPU evolution toward more
parallelism, GPUs have evolved to support dependent thread operation and cache
hierarchies.
Several factors have contributed to the strong paradigm shifts in both CPU and
GPU. CPU has exhausted the performance gains historically hidden from the pro-
grammer like superscalar architectures and pipelining [132] and turned to multipro-
cessing to continue to deliver processor performance increases. GPU (and CPU) is
evolving to support important new classes of applications described in 2.1 that ex-
pose data level parallelism but have different requirements from the hardware than
traditional graphics applications.
15


It remains to be seen how close CPU and GPU architectures will grow toward each
other and how heterogeneous systems manage the complexity. CPU and GPU cache
hierarchies play similar roles in filtering memory requests to the off-chip interface but
with very different memory access rates and sensitivity to the memory hierarchy [74].
There is unresolved contention between whether CPU or GPU is best for scientific
kernels of interest due to CPU transitioning to multi-core and many-core and moving
into GPUs dominance in throughput computing and GPU architectures moving into
CPU dominance in maximizing single-threaded performance and usage of a cache
hierarchy.
Analytical performance modeling and simulation are two prevalent approaches
to understanding and predicting performance. Analytical modeling is orders of mag-
nitude faster than simulation but less accurate. Analytical modeling enables the
exploration of very large design spaces. Eeckhout et al. [55] describe three primary
approaches to analytical modeling. There is mechanistic or whitebox modeling which
builds a model based on first principles. Empirical or black-box modeling builds a
model through training based on simulation results. Hybrid mechanistic-emperical
modeling aims at leveraging the best from both methods [55].
Several studies demonstrate good approximation of GPU performance using an-
alytical approaches including [79], [124], [154], and [186]. Most performance models
rely on some type of implementation and dont address how far the optimized version
is from global optima [102], The methodology described in this thesis can be used
to inform analytical modeling techniques with a cogent description of performance
limits and application of relevant optimizations.
Another important research goal is to use metrics that meaningfully map to a
performance construct and how to optimize it ([55], [154]). An example is cycles per
instruction (CPI). CPI is a dot product of event frequencies and penalties and as such
provides more insight than instructions per cycle (IPC) yet IPC is more widely used
16


[57]. Bhuiyan et al. [23] use dot products of event frequencies and penalties in their
analytical performance model to connect applications to architectures. Although not
obvious, Zhang et al. apply similar principles by measuring execution times on the
primary limiters in NVIDIA GeForce 200-series GPUs. An important research goal
of this thesis is to determine the suitability of an architecture for a given application
by building and improving on the work of Bhuiyan et al. [23] and Zhang et al. [186].
Many studies in performance analysis and optimization research focus on very spe-
cific goals targeting a type of application on a target architecture. Authors routinely
note that their methods should be more generally applicable, and there are sound the-
oretical arguments to support this claim, yet few actually demonstrate more general
applicability on any vector, be it other application domains or diverse architectures.
This dissertation tries to improve on generality goals and demonstrate how analysis
methodologies can be extended to other applications and architectures. General Pur-
pose GPU (GPGPU) breaks assumptions in the purely throughput-limited paradigm
in ways that are not yet fully understood. One research goal of this dissertation is
to clearly describe how the GP in modern GPUs has impacted throughput oriented
design goals.
2.3 Automatic Performance Tuning
The primary motivation for auto-tuning is to sustain performance trends on
rapidly changing architectures and execution environments, a concept known as per-
formance portability. This thesis demonstrates that an automated analysis system
can construct tailored mappings of scientific numerical kernels to modern many-core
architecture resources. As there is no single configuration good for all systems and
all applications, tremendous effort is necessary to develop applications and map them
onto target machines [42], In addition, successive generations of massively-parallel
architectures tend to require a complete reapplication of the optimization process
17


to maintain high performance [150]. A key research goal of this dissertation is to
demonstrate sustained performance across scientific models and across hardware.
Programmers must choose which architecture is best-suited to their method which
is very difficult to intuit. Examples of the difficulties, especially for accelerators like
GPU, include an unusual programming model, emerging architectures are evolving
very quickly, and technical details of the architectures are not publicly available [82],
Software can adapt to hardware, so auto-tuning benefits architects who dont have
to overprovision hardware designs for legacy applications and implementations that
are tuned for the previous generation [7].
Compilers have failed to achieve high-performance on new architectures so the
responsibility has fallen on domain experts and expert programmers [180]. The pri-
ority for compilers is correctness and very general applicability. The priority for
auto-tuners is to leverage the specifics of different architectural features and work-
load characteristics to find the fastest running configuration. Compilers use simple
models of architecture behavior that may be overly simplistic compared to the com-
plex hardware of advanced systems. In addition, compilers have difficulty determin-
ing the behavior of algorithms that are dependent on the inputs for optimization.
Compilers handle two-level memory well and work best when latencies are small and
deterministic [180]. Compilers do not optimize multi-level cache hierarchies and out-
of-order execution well. Auto-tuning has become a commonly accepted technique to
address these issues, and generate highly tuned code automatically. Bilmes et al.
[25] observed that all code variants could be enumerated and explored and, given the
capabilities of modern microprocessors, the optimal configuration for a given problem
size could be determined which resulted in PhiPAC which is considered the progenitor
of auto-tuners [180].
Two approaches to performance auto-tuning are used in practice, model-driven
optimization and empirical search optimization. The model-driven approach comes
18


from the compiler community and includes techniques like loop blocking, loop un-
rolling, loop permutation, fusion and distribution, prefetching, and software pipelin-
ing. Empirical optimizers estimate the values of key optimization parameters by
generating many program versions and running them on specific hardware to de-
termine which values give the best performance for a given application on a given
machine. In contrast, conventional compilers use models of programs and machines
to choose the best parameters [183]. Some researchers present a hybrid approach and
use simple architectural models in the first stage to limit the search space for second
stage of empirical search.
A comparison of the differences between empirical and model-driven found that
model-driven approaches can be just as effective as empirical search (widely believed
to be superior) [183]. At least one study asserts that a model-driven tuning strategy
is impractical since some device parameters cant be queried and that some param-
eter tradeoffs are difficult to model [50]. However, the full benefit of model-driven
has probably not been fully explored due to the de-facto preference and wide suc-
cess with empirical search. In general, the performance impact of specific optimiza-
tions cant be predicted in large part because model-driven is under-studied and not
well-understood. There is considerable room for improvement in both empirical and
model-driven optimization techniques (hand-optimized code still significantly outper-
forms automated code for generating the BLAS) [183].
Automatic library generation with empirical search has been a very effective strat-
egy on CPUs including, ATLAS [176], Sparsity[171], FFTW[60], and Spiral[140].
Many studies borrow similar methods to autotune performance by empirical search
on GPUs. Jiang and Snir [82] implemented a high performance matrix multiplica-
tion library generator for GPU that they refer to as an ATLAS for GPU. Liu et al.
[113] use a greedy algorithm to accelerate search and explore the influence of program
inputs. Baskaran et al. [18] use a hybrid model-driven empirical search to determine
19


optimal parameters for unrolling and tiling. Meng and Skadron [125] and Choi and
Singh [36] build a performance model to guide the tuning process for iterative stencil
loops and sparse matrix-vector multiply, respectively. Ryoo et al. [150] carve the op-
timization search space in order of performance impact. There is a lot of interest in
the community for encapsulating details into auto-tuned libraries for computational
scientists to use. X-Stack researchers [7] observe that although an optimal imple-
mentation may be unique to a particular dataset-architecture- application tuple, a
generalized optimization design space can capture them all.
Auto-tuning research is distinguished along several vectors:
The method used (empirical or model-driven).
The application domain of focus (FFT, matrix multiplication, tridiagonal
solvers, etc).
Strategies for pruning the search space.
The tools used in the tuning framework (profiling techniques, type of metrics
measured, static analysis or dynamic run-time analysis).
Algorithmic properties and whether domain-knowledge is leveraged.
The tuning parameter targets researchers employ (switch points, shared memory
allocation, thread occupancy, etc).
The contributions in research that advance performance auto-tuning is distin-
guished and compared along one of the categories above. However, when designing
and implementing auto-tuners the same core challenges remain. Empirical search
techniques must grapple with the challenge of an exploding search space. Researchers
employ many different methods to avoid impractical exhaustive search. No single
method has taken a clear leadership position over others. Davidson et al. [50] use
algorithmic knowledge to decouple tuning parameters and use heuristics to estimate
20


search starting points. Ryoo et al. [150] prune the search space by capturing first-
order performance effects. Choosing the right tuning parameters is also a challenge
since the decision often involves a tradeoff between muliple objectives whose per-
formance impact is often nonlinear and difficult to predict. Unpredictable non-linear
performance effects is a major challenge to auto-tuners that has not yet been compre-
hensively described. For example, high thread occupancy can limit available shared
memory available per thread and increase register pressure.
Most auto-tuners target specific optimizations for the target application on target
systems [102], They include both architecture-independent optimizations based on
improving the source code and workload analysis as well as architecture-specific opti-
mizations like optimal thread group size [180]. Many researchers assert their method
can be generalized to other types of workloads or architectures but this is rarely
proven or systematically explored [69]. The argument makes sense in principle, as
long as the code can be parameterized and its properties, such as demand for registers
and shared memory, expressed as functions of the parameters [99]. In general, there
is ample opportunity to extract, generalize, and encapsulate performance auto-tuning
functionality so the complexity is hidden from application scientists [7] that this dis-
sertation explores. Research directions for auto-tuning include making auto-tuning
more efficient, expanding it to additional application domains, and achieving more
generality [180].
21


3. Analysis Framework
The key to high performance is an effective architecture-algorithm mapping which
is not straightforward in most cases [64], This framework uses architectural insight
to guide optimization strategy. Many architectural features are available from hard-
ware specifications and software queries to the hardware. Important architectural
throughputs that are not documented can be approximated with benchmark testing.
Workload characterization is described in detail in Chapter 4 and is expanded here
as it applies to the optimization methodology.
Figure 3.1 illustrates the relationship between implementations with architectural
insight and those without. Lee et al. [105] recommends an application driven design
methodology that identifies essential hardware architecture features based on applica-
tion characteristics. This type of approach is integrated throughout the framework.
Figure 3.1: Figure extracted from [64], Algorithmic implementations specific to a
particular architecture leads to high performance and implementations without any
insight lead to poor performance.
Optimization and tuning is applied at the kernel level as opposed to at the ap-
plication or system level. Kernels are specific algorithms or functions developed for
22


a specific task and are typically building blocks inside applications. For example,
SGEMM (Single Precision General Matrix Multiply) is a kernel that performs matrix
multiplication. SGEMM is a pervasive component in scientific computing applica-
tions. Input to the framework is a kernel that has been identified as the primary
performance bottleneck in the application (as RK4 was identified in Appendix A)
and is the target for optimization and tuning.
The optimization strategy for this profile-driven tuning framework couples per-
formance analysis with empirical observation to guide and limit the performance
optimization search space. Throughput-oriented processors put more emphasis on
parallelism and scalability than programming sequential processors. We use the prin-
ciples described in the roofline model [179] to identify max theoretical capabilities
of the hardware as compared with execution limits from workload characteristics.
Performance metrics are used to identify primary performance limiters including in-
struction throughput, memory throughput, latency, or some combination. We employ
profiler feedback to eliminate categories of optimizations that are unrelated to the ob-
served bottleneck which, by definition, cannot improve performance.
If latency is effectively hidden, hardware resources do not suffer from underuti-
lization and there must be hardware units that are running near peak and represent
performance limiters. Optimizations that improve the efficiency of the bottlenecked
resource will have the greatest impact on performance. This concept has important
implications for setting expectations and correct interpretation of the results. For ex-
ample, if a code is memory-bandwidth bound and an optimization doubles memory
throughput, its easy to expect the code will run twice as fast. However, the 2x im-
provement is the upper bound, depending on how close other limiters are behind the
primary limit. Conversely, if a code is memory-bound and an optimization doubles
instruction throughput, there should be no expectation of improved runtime since the
code is limited by memory bandwidth and remains so after the compute optimization
23


is applied.
The analysis framework is built from the following components:
Low level performance events defined for each architecture.
Metrics (formulas) are derived from the low level events.
Analysis metrics use low level events and metrics to inform optimization
decision-making.
Methodology and software tools to measure performance events.
Interpretation of the events and metrics to limit optimization search space (to
those that improve memory or those that improve compute, for example) and
specific strategies within each.
Algorithm to automatically process and quantify performance metrics.
Documented peak capabilities of the hardware under analysis.
Understanding of kernel imposed limits.
Architectural review of relevant hardware to understand specific optimizations.
An important component of the framework is specifications for which type of
events need to be measured for throughput-oriented architectures, the formulas for
building metrics from the events, and how to interpret and apply the results. Figure
3.2 illustrates the algorithm that drives performance optimization in this framework.
24


Figure 3.2: Optimizations are selected based on whether the kernel is compute-bound,
memory-bound, or latency-bound for selection of appropriate optimization strategies.
This basic flow is recommended by performance engineers at NVidia. The automated
framework demonstrates how to implement those ideas systematically.
25


3.1 Framework Methodology
The first step in the framework is to measure profile data and determine the
primary performance limiter. High performance throughput-oriented architectures
require enough thread level parallelism to hide latency. However, on GPU archi-
tectures, additional threads beyond what is sufficient for latency coverage will not
necessarily increase performance and since additional threads reduce shared memory
and/or registers, unnecessarily high occupancy can actually limit performance. The
NVidia programming manual [9] asserts that more than 50 percent occupancy does
not typically scale with increased performance. In other words, going from 50% oc-
cupancy to 75% occupancy, a 25% gain, does not imply a 25% gain in performance in
general. In fact, [169] demonstrated it is possible to lose performance by maximizing
occupancy. We initially predict an appropriate balance point based on at least 50%
occupancy and full usage of on-chip resources.
A key principle embedded in our analysis strategy is that performance metrics
must be evaluated within the larger context of the primary performance limiter.
For example, metrics may indicate poor memory access patterns. This data should
be acted on only if the kernel is memory-bound. The optimization algorithm first
identifies which hardware resource is the primary performance limiter. This, in turn,
defines an optimization space and a set of events to measure within that space. These
events are then combined into threshold metrics and heuristics to determine how
the resource under evaluation is limiting performance which identifies a subset of
candidate optimizations. Each of the following sections (memory bound in section
3.2.1, compute bound in section 3.2.2, and latency bound in section 3.2.3) follow this
recipe. For each performance limiting source a set of optimizations strategies are
identified, generated, and tested for each architecture.
26


Each architecture has an ideal balance of instruction to memory bytes ratio de-
fined by each machines peak limits (see Chapter 5). A kernel is classified as instruction
throughput limited or memory bandwidth limited by comparing the achieved flop :
byte ratio; a common practice in performance oriented communities. If measured in-
struction : byte ratio is higher than the ideal, the code is likely instruction throughput
limited. If the measured instruction : byte ratio is lower than the hardware ideal,
the code is likely memory bandwidth limited. The instuction : byte ratio is a bi-
nary operator, the outcome is either compute-bound or instruction-bound. However,
some kernels dont get close to hardware limits for instruction throughput or mem-
ory bandwidth. This is often an indication that latency is exposed and the kernel is
latency-bound.
To formalize the above into pseudo-code, a kernel is classified as compute-bound,
memory-bound, or latency-bound following these rules:
If the measured instruction : byte ratio is higher than the hardware ideal,
and measured instruction throughput is 70% or higher than hardware peak
capability, then the kernel is compute-bound.
If the measured instruction : byte ratio is lower than the hardware ideal, and
measured bandwidth throughput is 70% or higher than hardware peak capabil-
ity, then the kernel is memory-bound.
If no hardware unit is near its relative hardware peak, the kernel is likely
latency-bound.
Heuristic thresholds can be parameterized and adjusted as necessary but general
industry consensus is that code is limited by a given hardware resource if achieved
throughput is approximately 70%-80% of the peak hardware capability.
27


3.2 NVidia Metrics and Analysis
Tables 3.1 through 3.5 describe low level hardware events NVidia supports on dis-
crete graphics cards with CUD A support. The low level events are common between
Fermi and Maxwell. However, Maxwell supports many more metrics without clear
documentation of how the metrics were calculated. The derived metrics and analysis
are different between Fermi and Maxwell. Maxwell evolved significantly since Fermi
to natively support more of the efficiency metrics practitioners need to effectively
measure performance. Since they differ, analysis metrics are given for both Fermi
and Maxwell.
Compile-time data such as number of registers per thread, grid size, block size,
static shared memory allocated per cuda block (bytes), dynamic shared memory al-
located per cuda block (bytes), constant memory allocated per cuda block (bytes),
spilled loads (bytes, spilled stores (bytes) were also collected in addition to the low
level events collected from NVidias profiling tool, nvprof.
28


Table 3.1: Low Level Instruction Related Hardware Events from NVidias nvprof.
Event name Event Description
instbssued Difference between issued and executed instructions. Instruction issues that happened due to serialization, instruction cache misses, etc. Will rarely be zero, concern only if its a significant percentage of instructions issued.
inst_executed Counts instructions encountered during execution. Incremented by one per warp.
thread Jnst_executed_0 Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 0.
thread bnst_executed-l Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 1.
gldcrequest Number of executed load instructions where the state space is not specified and hence generic addressing is used. It can include the load operations from global, local, and share state space. Incre- ments per warp on an SM.
gst-request, Number of executed store instructions where the state space is not specified and hence generic addressing is used. It can include the store operations to global, local and share state space. Increments per warp on an SM.
shareddoad Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
shared-store Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
local-load Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
local-store Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
29


Table 3.2: Low level instruction related hardware events from NVidias nvprof -
continued.
gld_inst_8bit Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_16bit Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_32bit Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_64bit Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
gldJnst_128bit Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
gstJnst_8bit Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
gstJnst_16bit Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
gstJnst_32bit Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
gstJnst_64bit Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
gstJnst_128bit Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.
30


Table 3.3: Low Level LI Events from NVidias nvprof.
Event Name Description
11 .global _load_miss Number of cache lines that miss in LI cache for global memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64, and 128 bit accesses by a warp respectively. Incremented by 1 per LI line (line is 128B).
11 -global Joad_hit Number of cache lines that hit in LI cache for global memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64, and 128 bit accesses by a wrap respectively.
UJocaLloadJiit Number of cache lines that hit in LI cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
llJocaLload_miss Number of cache lines that miss in LI cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
11 Jocal_store_hit Number of cache lines that hit in LI cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
11 Jocal_store_miss Number of cache lines that miss in LI cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
ll_shared_bank_conflict Number of shared bank conflicts caused due to addresses for two or more shared memory requests fall in the same memory bank. Increments by N-l and 2*(N-1) for a N-way conflict for 32 bit and 64 bit shared memory accesses respectively.
31


Table 3.4: Low level memory related hardware events from NVidias nvprof.
Event name Event Description
fb-subpO_read-sectors Number of DRAM read requests to sub-partition 0. In- crements by 1 for each 32 byte access.
fb_subpl_read_sectors Number of DRAM read requests to sub-partition 1.
fb_subpO_write_sectors Number of DRAM write requests to sub-partition 0.
fb_subpl_write_sectors Number of DRAM write requests to sub-partition 1.
12_subp0_read_hit_sectors Number of read requests from LI that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
12_subpl_read_hit-sectors Number of read requests from LI that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
12_subp0_read_sector_queries Number of read requests from LI to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
12_subpl_read_sector_queries Number of read requests from LI to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
12_subp0_write_sector_queries Number of write requests from LI to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
12_subpl_write_sector_queries Number of write requests from LI to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
32


Table 3.5: Low level latency related and miscellaneous hardware events from NVidias
nvprof.
gputime Execution time for the GPU kernel or memory copy method in micro seconds.
elapsed clocks Number of cycles.
global_store_transaction Number of global store transactions. Increments by 1 per transaction. Transaction can by 32, 64, 96, or 128 bytes.
threads Jaunched Number of threads launched on a multiprocessor.
warpsdaunched Number of warps launched on a multiprocessor.
branch Number of branch instructions executed per warp on a multiprocessor.
divergent_branch Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.
active_warps Accumulated number of active warps per cycle. For ev- ery cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48.
active_cycles Number of cycles a multiprocessor has at least one active warp.
atom_count Number of warps executing atomic reduction operations for thread-to-thread communication. Increments by one if at least one thread in a warp executes the instruction
33


Table 3.6: Fermi derived metrics from NVidia.
Metric Name Formula
dram reads fb_subpO_read_sectors + fb_subpl_read_sectors
dram writes fb_subpO_write_sectors + fb_subpl_write_sectors
thread instructions executed threadJnst_executed_0 + thread rinst_executed_l
12_read_requests 12_subp0_read_sector_queries + 12_subpl_read_sector_queries
12_writejrequests 12_subp0_write_sector_queries + 12_subpl_write_sector_queries
reads from LI that hit in L2 12_subp0_read_hit_sectors + 12_subpl_read_hit_sectors
L2 read hit rate 12_ll_read_hits / 12_read_requests
LI hit rate for local load requests llJocaLload-hit / (11 JocaLloadJiit + llJocaLload_miss)
number of registers per block num_registers_per_thread num_threads_per_block
total number of blocks grid_dimjx grid_dim_y grid_dim_z
effective max blocks per SM thread limit min(max_active_blocks-per_sm, max_active_threads_per_sm / block_size)
effective max blocks per SM shared mem- ory limit min(max_active_blocks_per_sm, total_shmem_per_sm / shmem_per_block)
effective max blocks per SM register limit min(max_active_blocks_per_sm, max_32bit_reg_per_sm / num_reg_per_block)
Table 3.6 lists the formulas for metrics derived from the low level events on
Fermi. The events and metrics above are collected with scripts that drive nvprof via
command-line. The scripts format the data for easy import into a template excel
34


sheet. The performance template has additional analysis metrics defined to drive
optimization. This process is semi-automated as is and is machine-readable for easy
integration in an auto-tuner or any other package.
The measured instruction to byte ratio (see Table 3.7) is compared to the theo-
retical peaks of the machine. The instruction to byte ratio is calculated with respect
to dram bytes and with respect to L2 bytes. According to NVidia engineers, if the
code has a high hit rate in the L2 cache, its better to look at L2 counters instead of
DRAM. Accesses to L2 are still expensive compared to arithmetic.
Table 3.7: L2 and DRAM instruction to byte ratios on NVidia.
Instruction Byte
DRAM Ratio 32 inst Jssued 32B (dram_reads + dram.writes)
L2 Ratio 32 inst Jssued 32B (12_read_requests + 12_write_requests)
If the selected instruction to byte ratio is less than the balanced ratio for the
hardware then the kernel is likely compute bound and optimization efforts should
focus on improving instruction throughput. If the selected instruction to byte ratio
is more than the balanced ratio for the hardware then the kernel is likely memory
bound and optimization efforts should focus on improving memory throughput. An
important check on this direction is to look at how close measured performance is to
peak limits. If the instruction to byte ratio indicates memory bound but no memory
unit is operating near peak throughput, then latency exposure is likely the limiting
factor and optimizations should focus on hiding latency better.
35


3.2.1 Memory Analysis Metrics and Optimizations
The analysis metrics described in this section are considered if the code is iden-
tified as memory-bound. They are derived from the metrics and events described in
3.2 and represent the thresholds and heuristics applied to determine which optimiza-
tion is likely to have a significant performance impact. NVidias profiling tool, nvprof,
evolved significantly between Fermi and Maxwell and many of the metrics that had to
be manually measured, compiled, and computed for Fermi are automatically output
from nvprof for Maxwell (or compute capability 5.0 and beyond). Memory analysis
metrics for Fermi are given in Table 3.8 and for Maxwell in Tables 3.9 through 3.12.
36


Table 3.8: Fermi memory analysis metrics.
Metric Formula
dram read throughput (32 dram_reads) / execution_time_in_seconds
dram write through- put (32 drarmwrites) / execution_timeJn_seconds
dram throughput dram_read_throughput + dram_write_throughput
dram throughput to peak throughput ratio drarmthroughput / theoreti- cal_peak_memory .bandwidth
L2 read throughput (32 L2_reads) / execution_time_in_seconds
L2 write throughput (32 12_writes) / execution_time_m_seconds
L2 throughput 12_read_throughput + 12_write_throughput
LI global load hit rate 11-global Joadffiit / (11 -global Joadffiit + ll_global_load_miss)
L2 queries 12_read_requests + 12_write_requests
L2 queries due to local memory 2*4*ll_local_load_miss
Transactions per load request (ll_global_load_hit + ll_global_load_miss) / gld_request
The following are guidelines documented by NVidia engineers for applying metrics
to specific optimizations. To determine if register spilling is impacting memory, esti-
mate how much of L2 or dram traffic is due to local memory. Percentage of L2 queries
due to local memory is a ratio of (2 4 11 Jocal_load_miss) to (12_read_requests +
12_write_requests). Multiply by 2 because a load miss implies a store happened first.
Multiply by 4 because a local memory transaction is 128 bytes which is 4 L2 trans-
actions (32 bytes each).
37


To determine if the memory access pattern is problematic, compare the appli-
cation throughput with the hardware throughput. If the application throughput is
much lower than the hardware throughput, then many more requests are being made
than are being used by the application.
gld_request much less than (ll_globalJoaddiit + ll_global_load_miss) (word_size /
32)
gst_request much less than (12_write_requests) (word_size / 32)
To determine if loads are coalesced, compare the number of global load requests
with the number of LI cache line requests from global memory. The number of
transactions per load request is estimated by comparing the ratio with the expected
number of transactions per load. Expected transactions per load for fp32 (single
precision floating point) is 1 because 32 threads of a warp each request 4 bytes =
128 byte request which matches the transaction size. Word size for fp64 (doubles) is
8 bytes. 32 threads of a warp each request 8 bytes = 256 byte request which is two
expected transactions for doubles.
Maxwell memory transaction related metrics are listed in Table 3.9. Maxwell
memory throughput related metrics are listed in Table 3.10, memory utilization re-
lated metrics in Table 3.11, and miscellaneous memory metrics in Table 3.12.
38


Table 3.9: Maxwell memory transaction metrics.
Metric Name Description
shared Joad_transactions_perjrequest Average number of shared memory load transactions performed for each shared memory load
shared_store_trans actions jpernrequest Average number of shared memory store transactions performed for each shared memory store
local Joad_transactions_per jequest Average number of local memory load transactions performed for each local memory load
locaLstore_transactions_per_request Average number of local memory store transactions performed for each local memory store
gld_transactions_per_request Average number of global memory load transactions performed for each global memory load
gst_transactions_perjrequest Average number of global memory store transactions performed for each global memory store
shared_store_trans actions Number of shared memory store transactions
shareddoadTransactions Number of shared memory load transactions
local JoadJransactions Number of local memory load transactions
local_store_transactions Number of local memory store transactions
gld_trans actions Number of global memory load transactions
gsffitransactions Number of global memory store transactions
dr am jreadctrans actions Device memory read transactions
dram_write_transactions Device memory write transactions
atomicjtrans actions Global memory atomic and reduction transactions
atomic_trans actions jpernrequest Average number of global memory atomic and reduction transac- tions performed for each atomic and reduction instruction
sysmem_read_transactions Number of system memory read transactions
sysmem_write_transactions Number of system memory write transactions
12 _read_trans actions Memory read transactions seen at L2 cache for all read requests
12_write_trans actions Memory write transactions seen at L2 cache for all write requests
local jnemory .overhead Ratio of local memory traffic to total memory traffic between the LI and L2 caches
12_atomic_trans actions Memory read transactions seen at L2 cache for atomic and reduc- tion requests
12_texrread_transactions Memory read transactions seen at L2 cache for read requests from the texture cache
12_tex_write_trans actions Memory write transactions seen at L2 cache for write requests from the texture cache
tex_cache_trans actions Texture cache read transactions
39


Table 3.10: Maxwell memory throughput metrics.
Metric Name Description
gld_requested_throughput Requested global memory load throughput
gst _requested_throughput Requested global memory store throughput
gld_throughput Global memory load throughput
gst-throughput Global memory store throughput
dram_read_throughput Device memory read throughput
dram_write_throughput Device memory write throughput
tex_cache_throughput Texture cache throughput
local Joad_throughput Local memory load throughput
local_store_throughput Local memory store throughput
shared_load_throughput Shared memory load throughput
shared_store_throughput Shared memory store throughput
12_tex_read_throughput Memory read throughput seen at L2 cache for read requests from the texture cache
12_tex_write_throughput Memory write throughput seen at L2 cache for write requests from the texture cache
12_read_throughput Memory read throughput seen at L2 cache for all read requests
12_write_throughput Memory write throughput seen at L2 cache for all write requests
sysmem_read_throughput System memory read throughput
sysmem_write_throughput System memory write throughput
12_atomic_throughput Memory read throughput seen at L2 cache for atomic and reduction requests
ecc_throughput ECC throughput from L2 to DRAM
40


Table 3.11: Maxwell memory utilization metrics.
Metric Name Description
12_utilization The utilization level of the L2 cache relative to the peak utilization
tex_fu_utilization The utilization level of the multiprocessor function units that execute global, local and texture memory instruc- tions
ldst_fu_utilization The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
dram_utilization The utilization level of the device memory relative to the peak utilization
tex.utilization The utilization level of the texture cache relative to the peak utilization
sysmem_utilization The utilization level of the system memory relative to the peak utilization
shared_utilization The utilization level of the shared memory relative to peak utilization
41


Table 3.12: Miscellaneous Maxwell Memory Metrics.
Metric Name Description
global-hit-rate Hit rate for global loads
localJiitcrate Hit rate for local loads and stores
tex-cache_hit_rate Texture cache hit rate
12-tex-read-hitcrate Hit rate at L2 cache for all read requests from texture cache
12_tex_write_hit_rate Hit Rate at L2 cache for all write requests from texture cache
gld-efficiency Ratio of requested global memory load throughput to required global memory load throughput. Values greater than 100% indicate that, on average, the load requests of multiple threads in a warp fetched from the same memory address
gst_efliciency Ratio of requested global memory store throughput to re- quired global memory store throughput. Values greater than 100% indicate that, on average, the store requests of multiple threads in a warp targeted the same memory address
shared_efficiency Ratio of requested shared memory throughput to required shared memory throughput
ldst dssued Number of issued local, global, shared and texture memory load and store instructions
ldst_executed Number of executed local, global, shared and texture memory load and store instructions
stall_memory_dependency Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding
stalLtexture Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests
stall-other Percentage of stalls occurring due to miscellaneous reasons
st alLconst ant -memory-dependency Percentage of stalls occurring because of immediate constant cache miss
staH_memory_throttle Percentage of stalls occurring because of memory throttle
42


Inefficient memory access patterns are well-known and documented limiters to
performance on throughput-oriented architectures. The framework measures bytes
requested by the application and compares them to bytes moved by the hardware.
The two can be different if memory access patterns cause inefficient use of the memory
bus. This framework identifies if the access pattern is problematic using heuristics
developed and described by NVidia engineers [126]. If the number of global memory
loads (or stores) requested by the application are much smaller than the number of
bytes moved by the hardware, then efficiency is much less than 100% and bandwidth
is being wasted. Below 50% most likely means scattered accesses are the problem.
Another indication that bandwidth is being wasted can be measured by comparing
the number of transactions per load request to the expected number of transactions
per load request. If the number of transactions per load request is higher than ex-
pected it means the application is using only some of the bytes per transaction and
has to generate more transactions to satisfy every thread in the warp. For example,
fp64 instructions require at least 2 transactions per memory access and fp32 instruc-
tions require at least 1 transaction per memory access. If the number of memory
transactions is higher than expected, it indicates bad access patterns. The primary
optimization to consider for problematic access patterns is to ensure accesses are co-
alesced. Loads can be coalesced by using structure-of-arrays storage (as opposed to
array of structures) or by padding multi-dimensional structures so that warp accesses
are aligned on line boundaries.
Candidate optimizations when register spilling is a problem are to increase the
register count per thread (using a higher limit in the -maxrregcount compiler option
or lower thread count with the grid launch bounds), increase the size of the LI
cache (which reduces the bytes available for shared memory), use non-caching global
memory loads, or try fetching data from the texture cache [126]. Increasing the
number of registers per thread can decrease occupancy, potentially making global
43


memory accesses less efficient. However, this can still result in a net win if the fewer
total bytes accessed in global memory reduce pressure on the memory interface. The
challenge is to fold the right balance. The purpose of increasing the size of the LI
cache (which decreases shared memory) on Fermi is to enable more spills/fills to hit in
the cache and reduce memory traffic. Non-caching loads disable the LI cache only (not
the L2) and generate smaller transactions (32B instead of 128B) which is more efficient
for scattered or partially-filled patterns. Fetching data from texture or constant cache
can be effective if smaller transactions help with memory efficiency (kernel is using
all bytes fetched from memory and not generating multiple transactions that could
be coalesced into one) and because the cache is not polluted by other global memory
loads. An auto-tuner is an appropriate framework to experiment with LI and caching
configurations to select the best option.
3.2.2 Instruction Analysis Metrics and Optimizations
The analysis metrics and optimizations described in this section are evaluated
if the code is identified as compute-bound (used interchangeable with instruction
throughput limited). The instruction analysis metrics described in this section are
derived from the metrics and events described in 3.2 and represent thresholds and
heuristics applied to determine which optimization is likely to improve instruction
throughput. As discussed in 3.2.1, NVidias profiling tool, nvprof, has evolved sig-
nificantly and many of the instruction analysis metrics that had to be manually
measured, compiled, and computed for Fermi are automatically output from nvprof
for Maxwell. One of the complicating factors in this work is the tools and methods
to collect performance counters are different between platforms. To characterize the
two NVidia architectures, instruction analysis metrics for both Fermi and Maxwell
are detailed in this section. However, the concepts for how to measure, interpret, and
apply data for any throughput-oriented architecture are the same.
44


The following are a few analysis metrics for Fermi given as guidelines by NVidia
engineers for compute-bound kernels.
Serialization Impact: serialization is significant if instructionsJssued is signifi-
cantly higher that instructions_executed.
inst-executed/inst-is sued (3.1)
Shared Memory Bank Conflicts: shared memory bank conflicts can limit in-
struction throughput if conflicts are a significant percentage of instructions and
if the kernel is instruction throughput limited.
11 shared-bank-conflict/{shared-load + sharedstore) (3.2)
11 shared-bank-conflict/inst-is sued (3.3)
Register Spilling Impact: measure if register spills are impacting instruction
count. Count LI misses for caching loads (equation 3.4). Count L2 read requests
for non-caching loads and stores (equation 3.5).
dlJocal Joad-miss / inst-issued (3.4)
12-read-requests / inst Jssued (3.5)
Local Memory Impact: percentage of instructions due to local memory access.
ill-local-load-hit+ll -local -loadcmiss+ll local storeJiit+ll -local store-miss)/inst -issued
(3.6)
Branch Divergence Impact: branch divergence can waste instructions.
diver gent-branch/branch (3.7)
45


All Divergence Impact: branch divergence is just one way instructions can be
serialized for other reasons.
100 ((32 inst-executed) thread-instructions-executed) / (32 inst-executed)
(3.8)
Instruction Per Clock (IPC)
(inst-executed / numS Ms) / elapsed-docks (3.9)
Table 3.13 describes instruction utilization related metrics for Maxwell, Table 3.14
describes flop related metrics, Table 3.15 describes instruction count related metrics,
and Table 3.16 describes miscellaneous instruction metrics on Maxwell.
46


Table 3.13: Maxwell instruction utilization related metrics.
Metric Description
issue_slot_utilization Percentage of issue slots that issued at least one instruction, averaged across all cycles
cfTu_utilization The utilization level of the multiprocessor function units that execute control-flow in- structions
tex_fu_utilization The utilization level of the multiprocessor function units that execute global, local and texture memory instructions
ldstTu_utilization The utilization level of the multiprocessor function units that execute shared load, shared store and constant load instructions
double_precisionTu_utilization The utilization level of the multiprocessor function units that execute double-precision floating-point instructions
special_fu_utilization The utilization level of the multiprocessor function units that execute sin, cos, ex2, pope, flo, and similar instructions
single_precision_fu_utilization The utilization level of the multiprocessor function units that execute single-precision floating-point instructions and integer in- structions
47


Table 3.14: Maxwell flop and efficiency related metrics.
Metric Description
ipc Instructions executed per cycle
issueddpc Instructions issued per cycle
flop mount _dp Number of double-precision floating-point operations executed non- predicated threads (add, multiply, multiply-accumulate and spe- cial)
flop mount _dp _add Number of double-precision floating-point add operations executed by non-predicated threads
flop mount _dp _fma Number of double-precision floating-point multiply- accumulate op- erations executed by non-predicated threads
flopmountmipmnul Number of double-precision floating-point multiply operations exe- cuted by non-predicated threads
flopmountmp Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flopmount_sp_add Number of single-precision floating-point add operations executed by non-predicated threads
flopmountmpJma Number of single-precision floating-point multiply-accumulate op- erations executed by non-predicated threads
flop_count_sp_mul Number of single-precision floating-point multiply operations exe- cuted by non-predicated threads
flopmount_sp_special Number of single-precision floating-point special operations exe- cuted by non-predicated threads
eligible_warps_per_cycle Average number of warps eligible to issue per active cycle
flop _sp-efficiency Ratio of achieved to peak single-precision floating-point ops
flop_dp_efflciency Ratio of achieved to peak double-precision floating-point ops
branch-efficiency Ratio of non-divergent branches to total branches
warp_exec_efliciency Ratio of average active threads per warp to the maximum number of threads per warp supported on a multiprocessor
flop _sp-efficiency Ratio of achieved to peak single-precision FP ops
flop_dp_efflciency Ratio of achieved to peak double-precision FP ops
48


Table 3.15: Maxwell instruction counts.
Metric Description
inst_per_warp Average number of instructions executed by each warp
inst .replay-overhead Average number of replays for each instruction executed
inst .executed The number of instructions executed
inst_issued The number of instructions issued
inst_fp_32 Number of single-precision floating-point instructions executed by non-predicated threads (arithmetric, com- pare, etc.)
inst_fp_64 Number of double-precision floating-point instructions executed by non-predicated threads (arithmetric, com- pare, etc.)
inst-integer Number of integer instructions executed by non- predicated threads
inst_bit_convert Number of bit-conversion instructions executed by non- predicated threads
inst-control Number of control-flow instructions executed by non- predicated threads (jump, branch, etc.)
inst-compute Jd_st Number of compute load/store instructions executed by non-predicated threads
inst_misc Number of miscellaneous instructions executed by non- predicated threads
inst_inter_thread_comm Number of inter-thread communication instructions ex- ecuted by non-predicated threads
49


Table 3.16: Maxwell miscellaneous instruction metrics.
Metric Description
issue_slots The number of issue slots used
cfJssued Number of issued control-flow instructions
cLexecuted Number of executed control-flow instructions
ldst Jssued Number of issued local, global, shared and texture memory load and store instructions
ldst .executed Number of executed local, global, shared and tex- ture memory load and store instructions
atomic-transactions Global memory atomic and reduction transactions
stallJnstTetch Percentage of stalls occurring because the next as- sembly instruction has not yet been fetched
stall_exec_dependency Percentage of stalls occurring because an input re- quired by the instruction is not yet available
stalLsync Percentage of stalls occurring because the warp is blocked at a syncthreads() call
stall-other Percentage of stalls occurring due to miscellaneous reasons
stall_pipe_busy Percentage of stalls occurring because a compute operation cannot be performed because the com- pute pipeline is busy
stall_not_selected Percentage of stalls occurring because warp was not selected
eligible_warps_per_cycle Average number of warps that are eligible to issue per active cycle
50


Compute-bound optimization strategies focus on reducing the number of instruc-
tions executed or use higher performing instuctions. Some factors that can limit code
from reaching peak compute capability are raw instruction throughput and instruc-
tion serialization. It is important to understand the instruction mix of the kernel
because 32 bit floats, 64 bit floats, integers, memory loads and stores, and transcen-
dentals all have different throughputs. When peak GFLOP/s is quoted for hardware,
it often refers to the 32 bit floating point throughput which is only achievable if 100%
of the instructions are 32 bit floats. A kernel with some percentage of 64 bit floats
will, by definition, have a max limit that is lower than the 32 bit theoretical ceil-
ing. This is why instruction distributions are measured and applied to instruction
throughputs in kernel analysis.
One optimization strategy to improve instruction throughput is to change the
raw instruction mix to prefer higher throughput instruction types. One example is to
replace 64 bit floats with 32 bit floats where the lower level of precision is acceptable.
Floating point literals without an f suffix (52.8 as opposed to 52.8f) are interpreted
as 64 bit floats per the C standard. Another is to use transcendental instruction
types which are hardware optimized approximations. Again, this trades-off accuracy
for speed and can only be used if the loss in accuracy can be tolerated.
Instruction serialization occurs when threads in a warp issue the same instruction
in sequence as opposed to the entire warp issuing the instruction at once [126]. NVidia
profile counters refer to this as replays because the same instruction is replayed for
different threads in a warp. Replays can be caused by shared memory bank conflicts
and constant memory bank conflicts. Warp divergence can also cause instructions to
serialize. Serialization due to divergent branches and shared memory conflicts can be
measured with profile counters.
Shared memory bank conflicts can significantly impact performance if the kernel
is instruction throughput limited and shared memory bank conflicts are a significant
51


percentage of instructions issued. Warps access shared memory by columns which
implies that each thread of a warp will access the same bank of a 32x32 shared
memory array resulting in a 32-way bank conflict. Bank conflicts can be avoided by
padding shared memory so each thread accesses a different bank.
Constant memory throughput can be measured against hardware peak capabil-
ities to determine if constant memory bank conflicts are limiting constant memory
throughput. Constant memory resides in global memory and can process 4B per SM
per clock.
3.2.3 Latency Optimizations
The optimizations described in this section are evaluated if the code is identified
as latency-bound. One reason memory throughput can be lower than hardware limits
is because the number of concurrent accesses is insufficient to hide memory latency.
Littles law can be used to approximate if there are sufficient concurrent accesses
to saturate the bus. High performing kernels need (memory latency) (bandwidth)
bytes in flight (Littles law) to hide latency. Concurrent accesses can be increased by
increasing occupancy and modifying the code to process several elements per thread.
Occupancy is increased by adjusting threadblock dimensions to maximize occupancy
at given register and shared memory requirements. There is a balance point where
occupancy is sufficient and whatever remains to the kernel for registers and shared
memory should be utilized accordingly.
3.3 Xeon Phi Metrics and Analysis
This section describes the low level events, analysis metrics, and performance
thresholds that need to be collected on the Xeon-Phi Coprocessors to drive perfor-
mance analysis and optimization. The hardware counters that are exposed to de-
velopers on the Xeon Phi are very different than what is measured on NVidia using
nvprof. However, the basic concepts remain the same.
52


Table 3.17: Xeon Phi Hardware Events from Intels VTune [33]
Metric Description
CPILCLK-UNHALTED Number of cycles executed by the core.
DATA_PAG E AVAL K Number of LI TLB misses.
DATA_READ_MISS Number of memory read accesses that mis the internal data cache whether or not the access is cacheable or noncacheable. Cache accesses resulting from prefetch instructions are included.
DATA_READ_MISS_OR_WRITE_MISS Number of demand loads or stores that miss a threads LI cache.
DATA_READ_ORAVRITE Number of loads and stores seen by a threads LI data cache.
DATAAVRITEJV1ISS Number of memory write accesses that miss the internal data cache whether or not the access is cacheable or noncacheable.
EXEC_STAG E_C YCL ES Number of cycles when the thread was executing computational operations.
HWP_L2MISS Number of hardware prefetches that missed L2.
INSTRUCTIONSJ5XECUTED Number of instructions executed by the thread.
L2_DATA_READ_MISS_CACHE_FILL Number of data read accesses that missed the L2 cache and were satisfied by another L2 cache. Can include promoted read misses that started as code accesses.
L2_DATA_READ_MISS_MEM_FILL Number of dta read accesses that issed the L2 cache and were sat- isfied by main memory. Can include promoted read misses that started as code accesses.
L2_DATAAVRITE_MISS_CACHE_FILL Number of data write (RFO) accesses that missed the L2 cache and were satisfied by another L2 cache.
L2_DATAAÂ¥RITE_MISSJMEMLFILL Number of data write (RFO) accesses that missed the L2 cache and were satisfied by main memory.
L2_VICTIM_REQAVITH_DATA Number of evictions that resulted in a memory write operation.
SNP_HITM_L2 Number of incoming snoops that hit modified data in L2 (thus resulting in an L2 eviction).
VPU_ELEMENTS^CTIVE Number of VPU operations executed by the thread.
VPUJNSTRUCTIONSLEXECUTED Number of VPU instructions exectued by the thread.
Table 3.18 lists all equations that support analysis on the Xeon Phi Coprocessors.
53


Table 3.18: Xeon Phi Formulas for Performance Analysis
Metric Name Formula
FLOP/s VPU_ELEMENTSvVCTIVE / Time
SP GFLOP/sec 16 (SP SIMD Lane) 2 (FMA) 1.1 (GHZ) 56 (#cores) = 1971
DP GFLOP/sec 8 (DP SIMD Lane) 2 (FMA) 1.1 (GHz) 56 (#cores) = 985.6
Average CPI per Thread CPLLCLKJJNHALTED / INSTRUCTIONS_EXECUTED
Average CPI per Core Average CPI per Thread / numJrardwareJhreads
LI Compute to Data Access Ratio VPlAELEMENTSvVCTIVE / DATA_READ_0 REWRITE
L2 Compute to Data Access Ratio VPUJJLEMENTSvVCTIVE / DATA_READ_MISS_ORJWRITE_MISS
Vectorization Intensity VPU_ELEMENTSvVCTIVE / VPUJNSTRUCTIONSJ3XECUTED
LI Misses DATA_READ_MISS_OR_ WRITE JVIISS + LIED ATA JIIT JNFLIGHTJF 1
LI Hit Rate (DATAEREADJ) REWRITE LI Misses) / DATA_READ_0 REWRITE
Estimated Latency Impact (CPILCLKJJNHALTED EXEC_STAGE_CY CLES DATAEREAD_OREWRITE) / DATA JtEADDRAYRITE JVIISS
LI TLB miss ratio DATA_PAGE_WALK / DATAJtEAD_0REWRITE
L2 TLB miss ratio LONG_DATA_PAGE_WALK / D ATA JtEAD _OR_WRITE
LI TLB misses per L2 TLB miss DATA_PAGE_WALK / LONGJ)ATA J5AGE_WALK
The LI ratio calculates an average of the number of vectorized operations that
occur for each LI cache access.
In practice, achieved bandwidth of approximately 140 GB/sec is near the maxi-
mum that an application is likely to see. This is likely due to limits in the network
infrastructure.
54


The following are performance heuristics published by Intel [33] to guide per-
formance analysis on the Xeon Phi coprocessors. Intel documentation recommends
investigation if any of the given heuristics hold. However, as discussed in this anal-
ysis framework, the memory-related heuristics should be investigated if the kernel
is memory-bound only and compute-related heuristics should be investigated if the
kernel is compute-bound.
average CPI per thread is greater than 4
average CPI per core is is greater than 1
vectorization intensity is less than 8 (SP)
vectorization intensity is less than 16 (DP)
LI compute to data acess ratio is less than Vectorization intensity
L2 compute to data acess ratio is less than lOOx LI compute to data access
ratio
LI hit rate is less than 95%
latency impact is greater than 145
LI TLB miss ratio is greater than 1%
L2 TLB miss ratio is greater than 0.1%
LI TLB misses per L2 TLB miss is near 1
Bandwidth is less than 80 GB/s
55


4. Scientific Kernel Characterization
Workload characterization for the analysis framework has three primary objec-
tives including, the instruction mix to capture instruction frequencies, algorithmic es-
timate of flops and bytes to make an educated guess if the kernel should be compute-
or memory-bound, and relevant domain knowledge that may inform optimization
with respect to data structures and organization. According to [57], a benchmark
(kernel) is really a specification of event frequencies. For example, [57] asserts:
Cycles per instruction (CPI) is a most natural metric for expressing processor perfor-
mance because it is the product of two measurable things: CPI = (cycles per event)
* (events per instruction). The number of cycles per event is determined by the event
type for a particular microarchitecture, and the number of events per instruction is
known for each workload (independent of the microarchitecture).
We apply this concept to the instruction mix of the workload (kernel-dependent, mi-
croarchitecture independent), and the throughput for each instruction type (kernel-
independent microarchitecture-dependent).
A processors architecture defines the instructions it can execute. Its micro-
architecture determines how the instructions are executed [121]. Ideally, micro-
architecturally agnostic events help isolate kernel behavior from hardware influences.
Micro-architecturally agnostic metrics have the nice property that the same values
will be reported on any microarchitecture and therefore any differences observed in
performance must be due to differences in hardware as opposed to the complex inter-
play between software and hardware. However, this only holds on microarchitectures
within the same architectural family with the same instruction set. Even within the
same instruction set, many events are influenced by the hardware in subtle ways.
The primary purpose of capturing the instruction mix is to quantify kernel im-
posed limits on performance rooflines. This helps more accurately predict how far
from peak a kernel is and how much optimization opportunity remains. For example,
56


most peak Gflop/s claims are cited with respect to peak single-precision (SP) opera-
tions. This limit is achievable only if 100% of the instructions are single-precision. In
practice, measuring instruction types isnt supported with all profiling tools. Fermi
has a limited set of instruction events, not enough to use for an estimate of how the
instruction mix limits achievable peak. Maxwell improved over previous generations
and includes support for many instruction types and is the basis for Figure A. 10.
VTune on the Xeon Phi Coprocessors does not give any indication of instruction
mix other than VPU instructions (which are arguably the most important type but
broader visibility would help this type of analysis). Instructions can be measured
in other ways or parameterized estimates can be applied to roofline models. These
type of issues are why being within 70% to 80% of peak is considered very good. If
kernel-imposed limits are accounted for, 80% of peak is probably closer to 90% or
better.
One difficulty in comparing benchmarks or kernels on different architectures is the
source code for a kernel running on NVidia graphics cards with CUDA and the source
code for the same kernel running on Xeon Phi are, by definition, not the same source
code. They have different instruction types and instruction distributions relevant to
the hardware the code is running on. The Xeon Phi Coprocessors are not compatible
with the proprietary CUDA extensions required to drive NVidia graphics cards and
code written for CUDA-enabled graphics cards will not compile on Xeon Phi and vice
versa. However, the conceptual goal remains the same for all throughput-oriented
architectures, which is to understand kernel behaviors that may limit theoretical
peak capabilities of the hardware. Architectures can be compared using the ratio of
performance to peak capability of the machine to normalize out any differences in
the architecture. Assuming both kernels are optimized, relative performance can be
assessed.
57


Table 4.1: Example instruction counts and distributions.
Metric Description
dynamic instruction count per kernel Per kernel accounting of the dynamic instruc- tions executed.
average instruction count per thread average number of dynamic instructions ex- ecuted by a thread.
thread count per kernel Count of total threads spawned per kernel.
total thread count Count of total threads spawned in a work- load.
floating-point instruction count Total number of floating point instructions executed.
integer instruction count Total number of integer instructions exe- cuted.
special instruction count Total number of special functional unit in- structions executed.
memory instruction count Total number of memory instructions.
branch instruction count Total number of branch instructions.
barrier instruction count Total number of barrier sychronization in- structions.
58


To demonstrate general applicability of our approach we analyzed two additional
scientific kernels (beyond the scientific program that has been the basis of our research
to date). We selected Stencil and SGEMM to analyze because they are well-studied
kernels that have been characterized in performance optimization research. We vali-
date how well the framework guides optimization using those known characteristics.
In addition, both Stencil and SGEMM behave differently on different architectures
depending on the level of optimization performed. We demonstrate performance
improvement over several kernel implementations on several architectures to estab-
lish the validity of the analysis framework. The additional kernels demonstrate our
methodology works with very different application domains and the Xeon Phi Co-
processors demonstrate our methodology works with a very different architectural
paradigm. The three scientific application kernels we selected are Stencil, single pre-
cision general matrix multiply (SGEMM), and the RK4 implementation for kinetic
modeling. The following sections describe each application in more detail, including
why they were selected for study.
4.1 Stencil
Partial differential equations (PDEs) represent a very common workload class in
a broad range of scientific applications. Solving PDEs numerically is very important
to the scientific community and PDE solvers tend to be very computationally intensive
which make them interesting candidates for acceleration [159]. We choose to include
the stencil benchmark in part because stencil acceleration on advanced hardware
is a very active area of research. Stencil applications are often implemented using
iterative finite-difference techniques where each point in a multidimensional grid is
updated with weighted contributions from a subset of its neighbors, which represent
the coefficients of the PDE for that data element [47].
The stencil evaluated in NVidia GPU analysis is a seven point stencil that solves a
3D heat equation from the Parboil benchmark suite [159]. The 7-point stencil contains
59


points directly adjacent to the center in each direction [135]. This benchmark includes
a GPU-optimized version drawn from several published works including [145] and
[45] and uses Jacobi iterations which means the calculation is not done in place; the
algorithm alternates the source and target arrays after each iteration [47]. For each
grid point, the stencil will execute 8 floating point operations (two multiplications
and six additions) and transfer at least 16 bytes (8 bytes have to be read and written
with double precision). The stencils flop to byte ratio is 0.5 which is very low (the
ideal balance for most throughput-oriented architectures is between four and ten) so
the algorithm is likely to be memory-bound on most architectures.
The typical bottleneck of an unoptimized stencil implementation is data local-
ity. The Parboil Stencil benchmark applies register-tiling (blocking) along the Z-
dimension and other tiling optimizations to improve locality. Stencil re-use of data
elements along the third dimension may not be able to fit in the cache for large prob-
lem sizes and baseline, or naive, implementations tend to be memory-bound. Cache
tiling (blocking) is an optimization technique to form smaller tiles of loop iterations
which are executed together and result in better temporal and spatial locality. The
Parboil authors found that even with optimizations the performance limitation of the
Stencil benchmark is global memory bandwidth (for the architectures they tested)
which we were able to independently confirm in our analysis. Figure 4.1 is pseudo-
code for the baseline Parboil implementation; the same pseudocode is also used in
[47].
The study of finite difference stencils is large with many varied implementation
choices. The stencil evaluated on Xeon Phi Coprocessors is an 8th-order (25 points)
isotropic acoustic wave equation developed by Andreolli [16]. This algorithm is dif-
ferent than the 7-point stencil evaluated on GPUs but the flops to bytes ratio is very
similar. The series of implementations and optimizations that Andreolli developed is
very useful to validate the methodology proposed in this thesis.
60


void stencil3d(double *A, double *Anext, int niter, int x, int y, int z) {
for (int t = 0 to niter) {
for (int i = 1 to x-1) {
for (int j = 1 to y-1) {
for (int k = 1 to z-1) {
Anext[i,j,k] = CO A[i,j,k]
+ C1 ((A[i+1 ,j,k] + A[i-1,j,k]
+ A[i,j+1 ,k] + A[i,j-1 ,k]
+ A[i,j,k+1] + A[i,j,k-1 ]);
}}}
swap A and Anext
}
}
Figure 4.1: Pseudocode for the baseline stencil implementation used in GPU archi-
tecture analysis.
4.2 Single Precision General Matrix Multiply (SGEMM)
General matrix multiplication is a function that performs matrix multiplication
of the form shown in Equation 4.1.
C = alpha A B + beta C, (4.1)
where A, B and C are matrices and alpha and beta are floating point scalars.
Many studies set beta to zero and alpha to 1, which reduces the equation to C = A
* B, an easy form to verify correctness without losing operations.
SGEMM performs 0(n3) compute operations, where n is the matrix dimension
(can assume square matrices without loss of generality). SGEMM performs 0(v2)
data accesses. Therefore, the flop to byte ratio is 0(n) which means SGEMM should
be compute-bound when properly blocked. Naive implementations are often memory-
bound but any well-optimized implementation will be compute-bound. Due to the
nature of matrix multiplication, the matrices can be blocked to fit in virtually any
cache or on-chip storage (shared memory of GPUs) to find a good balance point for
high-performance on any architecture. SGEMM is unique in the property that an
61


optimized implementation on any throughput-oriented architecture will be compute-
bound.
SGEMM is a commonly implemented library routine with very broad applicabil-
ity as a key building block in numerical linear algebra codes. The vast majority of
library routines support two modes: normal-normal (NN) where A, B, and C matrices
are stored in column-major order and normal-transposed (NT) where A and C matri-
ces are stored in column-major order, and the B matrix is stored in row-major order.
Few scientific kernels are as well understood, studied, optimized, and characterized as
thoroughly as general matrix multiply. We use SGEMM as a known reference appli-
cation for validation of our methodology. The Parboil implementation parameterized
the code so that register tiling and shared memory tiling can be configured at compile
time which facilitates testing.
4.3 Kinetic Simulation
Kingen was developed by our collaborator, Dr. Tim Benke, a neuroscientist from
the University of Colorado, Anschutz Medical campus to simulate AMPA receptor
ion channel activity and to optimize kinetic model rate constants to biological data.
Kingen uses a genetic algorithm to stochastically search parameter space to find global
optima. As each individual in the population describes a rate constant parameter set
in the kinetic model and the model is evaluated for each individual, there is significant
computation complexity and parallelism in even a simple model run. The bottleneck
in kingen is in the ordinary differential equation (ODE) solver which is typical for
scientific applications.
Numerical methods are distinguished, among other things, by stiffness, a property
of the ODEs themselves. Stiff systems mainly appear in chemical kinetics, electric
circuits, and control theory [90]. Stiffness measures the difficulty of solving an ODE
numerically and is characterized by disparate time scales [96] (small time-steps are
required for stability). Stiff systems require complex implicit methods to avoid nu-
62


merical instabilities, while nonstifT systems can be solved by relatively simple explicit
methods. The system of ODE equations that describe AMPA currents are stiff and
there has been relatively little focus in acceleration research on how to optimize stiff
systems of ODEs.
The most popular codes for the numerical solution of a stiff system of ODEs
are based on the backward differentiation formulas (BDF). One of the most powerful
methods characterized by high accuracy and high stability for solving ODEs is an
implicit Runge-kutta method. The Runge-Kutta 4th order method (RK4) is consid-
ered a workhorse in scientific engineering applications providing accurate and stable
solutions to stiff systems of ODEs common in computational chemistry. This method
is the selected solver within our simulation framework. The 4th order Runge-Kutta
formulas are given below.
x(t + h) x(t) + l/6(Fi + 2F2 + 2T3 + T4) (4-2)
where
Fi = hf(t,x) (4.3)
F2 = hf(t + h,x + Fl) (4.4)
F3 = hf(t + h,x + F2) (4.5)
F4 = hf(t + h,x + F3) (4.6)
4.4 Characterization Summary
Many challenges face researchers in this field. One is every tool provides different
variations of similar metrics and care must be taken to make sure the granularity
and unit of the reported data is understood and used appropriately. This problem
is demonstrated here in the very different suite of metrics for Maxwell versus Fermi,
two architectures from the same family.
63


Table 4.2 demonstrates the challenge domain scientists face with simulation code
that models real-world phenomenon as opposed to synthetic benchmarks. RK4 is the
only kernel that is using all the available registers indicating the Parboil benchmarks
have no register pressure, probably because the footprint of the code is small enough.
In addition, RK4 launches far fewer threads than stencil yet executes approximately
15x more instructions. This means RK4 threads are executing several orders of mag-
nitude more instructions per thread than the stencil benchmark. RK4 is launching
more threads than SGEMM, but the difference in thread count doesnt come close
to accounting for the difference in instructions executed. The question is would opti-
mizing single-threaded performance on fewer but more complex processing cores be a
better fit than the massively parallel but simple processing cores of standard graphics
hardware?
Table 4.2: Various static metrics across each benchmark.
SGEMM Stencil RK4
regs/thread 47 25 63
shared mem- ory per thread (bytes) 4 0 0
threads launched 1,280 131,072 20,352
warps launched 40 4,096 636
dynamic instruc- tions executed 103,240 24,753,000 374,047,500
Table 4.3, extracted from [159], shows a brief summary of the major architectural
features stressed by each benchmark in its unoptimized and optimized forms. One
of the reasons we focused on the Parboil benchmark suite is because it provides op-
64


timized and unoptimized code for each benchmark which we can use to evaluate the
analysis framework. Figure 4.2 clearly shows that the RK4 kernel from the kingen
application has a very different profile from the other benchmarks with significant
convert instructions and fewer integer instructions. A challenge with numerical simu-
lations in practice is knowing, before significant development efforts are spent, if the
simulation equations will fit into limited GPU resources and how much data can be
moved on-chip to closer memory for significant performance gains.
Table 4.3: Architecture Stresses of Benchmarks Before and After Optimization
Benchmark Unoptimized Bottleneck Optimizations Applied Optimized Bottleneck
stencil locality coarsening, tiling bandwidth
sgemm bandwidth coarsening, tiling instruction throughput
Instruction Mix for Scientific Kernels
1UU70 9e %
90% "
80% " 70% 69%
_ 70% ro O 65 i% FP Instructions(Single)
IH 60% o c FP Instructions(Double) Integer Instructions
S 50% 40% a 01 N 5% 35% 31?f : 2% Control-Flow Instructions 1 oad/Store Instructions
= 30% E | Misc Instructions
z 20% 10% 0% - 15% 9?| 14% 1 9% 9% 12% 1 mll%U Inter-Thread Instructions
nriiPnenFii 2% 2%
sgemm sgemm opt stencil base stencil opt RK4 base RK4 base medium default default optimized medium
Figure 4.2: Instruction mix measured on Maxwell for scientific kernels.
65


5. Hardware and Architecture Characterization
This chapter describes the architectures that are evaluated in this thesis. The
purpose is to understand the performance implications of primary architectural fea-
tures and to compile hardware specifications the analysis framework requires as input.
In the next section, Section 5.1, what it means for an architecture to be throughput-
orientated is defined. The implications throughput-centric design decisions have on
performance and how code executes on the machine are explained. Section 5.2 de-
scribes NVidias Fermi and Maxwell architectures and Section 5.3 describes Intels
Xeon Phi Coprocessors. These three platforms are used as industry recognized exam-
ples of throughput oriented architectures and demonstrate that the concepts applied
in the analysis framework are generally applicable to all throughput-oriented systems.
5.1 Throughput Oriented Architectures
The key to high performance on any architecture is to reduce latency or in-
crease throughput while tolerating latency. Latency measures the amount of time
to complete a task and throughput is the total amount of work completed per unit
time [61]. The traditional commodity CPU is the exemplar of latency-oriented ar-
chitectures that place a premium on reducing latency of a single thread (or task).
The traditional commodity GPU is the exemplar of throughput-oriented architec-
tures that place a premium on maximizing the total amount of work that can be
completed within a given amount of time. Architectural design tends to focus on
one over the other and throughput oriented systems trade single-threaded speed for
increased throughput over all threads. The trade-offs throughput-oriented architec-
tures make have important implications for understanding performance of scientific
applications running on manycore and GPU platforms.
Throughput-oriented systems employ three key architectural features to hide la-
tency and increase throughput including hardware multithreading, many simple pro-
cessing units, and SIMD execution [61]. Simple cores tend to execute instructions in
66


order and avoid speculative execution and branch prediction. Out-of-order execution,
speculative execution and branch prediction are the very techniques latency-oriented
systems use to speed-up single-threaded performance. Single instruction multiple
data (SIMD) execution is attractive for throughput oriented systems because it re-
duces die area dedicated to control-structures which makes room for more compute
units.
Graphics architectures are known to be very latency tolerant which means any
given thread may experience longer latency than what is acceptable on latency-
oriented CPUs, but the latency is hidden by switching in the next clock to another
thread with ready operands. Latency tolerance implies processor utilization does not
drop just because some threads are blocked on long latency operations [61]. This,
in turn, implies that at any given moment there should be some hardware resource
running near full rate and this key observation is under-appreciated in performance
oriented research. The methodology described in the analysis framework is based on
this principle.
5.2 NVidia Discrete GPU Architecture
Graphics processing units (GPU) have evolved over the last decade from ded-
icated fixed-function 3D graphics processors to programmable massively parallel-
processing architectures useful for traditional graphics and for high-performance com-
putation. The GeForce 8800 was launched in 2006 and was the first unified graphics
and computing architecture released by NVidia with compute unified device archi-
tecture (CUDA), software technology that enabled NVidia GPUs to be programmed
with high level programming languages. However, general consensus is that Fermi,
the third generation of compute capable NVidia GPUs, was the first graphics archi-
tecture to support a complete set of features for scientific computing.
67


Table 5.1: NVidia Architectures
Year Microarchitecture Compute Capability Compute Generation
2006 GeForce 8800 1.0 1st Gen
2008 Tesla 1.0 1.3 2nd Gen
2010 Fermi 2.0 2.1 3rd Gen
2012 Kepler 3.0, 3.2, 3.5, 3.7 4th Gen
2014 Maxwell 5.0, 5.2 5.3 5th Gen
2016 Paschal lOx Maxwell? 6th Gen
2018 Volta ? 7th Gen
A brief summary of NVidia microarchitectures are listed in Table 5.1. Generation
one begins with the introduction of general purpose compute capability and subse-
quent generations count up from there. Table 5.2 shows basic features of each of the
major compute capability versions. The lack of double precision support was seen
as a major disadvantage to some scientific applications and was added with compute
capability 1.3.
Table 5.2: Compute Capabilities on NVidia Hardware
Compute Capability Basic Features
compute_10 Basic features for general purpose compute
computed Adds atomic operations on global memory
computed Adds atomic operations on shared memory
and vote instructions
computed Adds double-precision floating point
compute_20 Adds support for Fermi
compute_30 Adds support for Kepler
compute_50 Adds support for Maxwell
68


Table 5.3 compares Fermi with Maxwell on key features of compute and memory
capability. Peak rates in Table 5.4 are derived from these hardware specifications.
Instruction throughputs are often given in giga-flops-per-sec, or GFLOPS. A flop is
defined as either an addition or multiplication of single (32 bit) or double (64 bit)
precision numbers.
Table 5.3: NVidia Core Architecture
Fermi C2050 Maxwell GTX 960
Compute Capability 2.0 5.0/5.2
Cuda Cores 448 1024
Number of SM per GPU 14 8
Number of SP per SM (cuda cores per SM) 32 128
Number of SFU per SM 4 64
Number of LD/ST per SM 16 32
Warp Schedulers per SM 2 4
Max Processor Clock (GHz) 1.15 1.18
Memory Clock (GHz) 1.5 5.4
Bus Type Memory Interface GDDR5 GDDR5
Memory Interface Width (bits) 384 128
Theoretical Memory Bandwidth (GB/s) 115 112
LI bytes per SM 65,536 24,576
LI bytes per GPU 917,504 196,608
Number of Shared Memory Banks 32 32
L2 Cache Capacity (KB) 768 1,048
69


Table 5.4: Theoretical Peaks on NVidia Architecture
Fermi (C2050) Maxwell(GTX
960)
Peak SP FLOPs(GFlops/s) 1030 2308
Peak DP FLOPs (GFlop/s) 515 72
Peak SP FLOPS per SM 74 302
(Gflop/s)
Peak Memory with ECC on 115.2 112.2
(GB/s)
Execution limits in NVidia GPUs are described in Table 5.5. These limits repre-
sent trade-offs between occupancy and resource utilization that have to be balanced
with the characteristics of the workload. Notice the difference in double precision
peak GFlop/s between Fermi and Maxwell (515 compared to 72). The Maxwell ar-
chitecture reduced focus on scientific computing in favor of more efficient execution
of traditional 3D. Occupancy is usually defined in terms of thread occupancy, the
ratio of active threads to the maximum supported per SM. Occupancy can be limited
by the resource usage of each thread block, specifically with respect to registers and
shared memory. Here is an example using Fermi limits to illustrate: if each thread in
a thread block loads 12 single-precision floats in shared memory, then each thread is
using 48 bytes of shared memory (12 4 Bytes each). With a thread block size of 256
threads per block, each block is using 12,288 bytes of shared memory (256 threads *
48 bytes per thread). The number of active thread blocks will be limited to four per
SM (49,152 max shared memory per SM / 12,288 bytes per block). Four blocks *
256 threads per block is 1,024 active threads per SM out of 1,536 maximum possible.
In other words, occupancy is limited to 67% (1024/1056) because of shared memory
utilization. The same reasoning can be applied to the number of registers.
70


Table 5.5: Execution Limits in NVidia GPUs
Fermi Maxwell
Max Active Threads per SM 1,536 2,048
Max Active Warps 48 64
Threads per Warp 32 32
Max Active Blocks per SM 8 32
Max Threads per Block 1,024 1,024
Max 32-bit Registers (bytes per 32,768 65,536
SM)
Max Registers per Block (bytes 23,768 65,536
per block)
Max Active Threads per GPU 21,504 16,384
Max Registers per Thread 63 255
Shared Memory per SM (bytes) 49,152 96,000
Shared Memory per Thread 49,152 49,152
Block (bytes)
Table 5.6 lists essential hardware features. The number of processing cores
equals the number_SM_per_GPU number_cores_per SM. Instruction throughput
in giga-instructions per second equals the max_freq_processing_cores(GHz) num-
ber _processing_cores. Peak FP32 throughput (GFLops/s) = max_freq_processing_cores(GHz)
* number_processing_cores number_flops_per_procesing_core. Peak FP64 through-
put is calculated as 1/2 the SP rate for Fermi, and 1/32 the SP rate for Maxwell.
71


Table 5.6: Essential Hardware Features and Throughputs on Fermi and Maxwell
Fermi (C2050) Maxwell (GTX 960)
# Symmetric Multiproces- sors 14 8
Cores per SM 32 128
Core Clock (MHz) 1150 1279
Memory Clock (MHz) 1500 3505
Instruction Throughput (GInst/s) 515 1206
Peak FP32 Throughput (GFlops/s) 1030 2412
Peak FP64 Throughput (GFlops/s) 515 75
Shared Memory (GB/s) 1030 1206
DRAM GB/s (w/ ECC on) 144 (112) 112
Ideal Ratio Instructions : 515 / 112 = 4.5 1206 / 112 =
Memory (GB) :1 10.7 :1
ld/st unit pressure 16 / 32 = 1/2 32 / 128 = 1/4
DP performance (fraction of SP) 1/2 1 / 32
Table 5.7 are instruction throughputs from NVidia manuals for Fermi [6] and
Maxwell [11]. The throughputs are given in terms of operations per clock cycle per
SM. These rates are used with the kernels instruction mix profile to determine if the
theoretical peak instruction throughput rate should be lowered due to the kernels
72


instruction mix. The instruction mix gives the percent of each instruction type in the
kernel. For example, the instruction mix for the SGEMM base implementation with
the medium input set size (see Figure 4.2) is FP32: 5%, Int: 70%, control flow: 1%,
Ld/st: 9%, and Misc: 15%. The weighted average is calculated to determine if the
theoretical compute peak is limited by the kernels instruction mix. Using Maxwell
as an example, the average operations per clock cycle per SM is: 128 0.05 + 128
* 0.70 + 64 0.01 + 32 0.09 + 64 0.15 = 109. This number is used as the
number of processing cores to calculate kernel limited peak instruction throughput.
In this example, that is: 2 (109*8) 1.18 = 2057 GFlop/s. The peak SP GFlop/s
on Maxwell is 2308, so the kernels instruction mix reduces the theoretical max in-
struction throughput of the hardware by approximately 11%. This is an important
execution difference that helps quantify how much optimization opportunity actually
remains and if higher performing instructions can be used to improve performance of
compute-bound kernels.
73


Table 5.7: Instruction Throughputs on Fermi and Maxwell
Compute ca- Compute Ca-
pability 2.0 pability 5.0
(Fermi) (Maxwell)
32-bit FP add, mult, mult. 32 128
add
64-bit FP add, mult, mult. 16 1
add
32-bit FP reciprocal, square 4 32
root, exp
convert instructions 16 32
32-bit int add, subtract 32 128
control flow, compare, min, 32 64
max
load/store 16 32
5.2.1 Fermi
Fermi is the third generation of compute-capable GPUs and represents a big leap
over early generations as detailed in Table 5.8. The primary differences in Fermi over
previous generations is that memory is accessed in groups of 32 threads (compared
to 16) to match instruction issue width; the addition of an LI cache in each SM
is configurable as 16KB LI / 48 KB shared memory or 48 KB LI / 16 KB shared
memory; a global 768 KB L2 cache per chip was added; and Fermi added dual-issue
which means instructions from two different warps can be issued to two different pipes.
Dual-issue on Fermi requires at least two active warps to hit peak throughputs.
Figure 5.1 is an illustration of a Fermi-based GPU [6]. In this example, there are
16 SMs. The C2050 cards in the Hydra lab have 14 SMs and all specific Fermi analysis
74


in this thesis assumes the Hydra configuration. Beginning with Fermi, NVidia GPUs
use a small LI cache for register spilling and for small local arrays that must be
stored in global memory (because the compiler cant use register addressing if array
indices are unknown). Earlier generation GPUs suffered from performance cliffs an
application could fall over with moderate increase in register usage (from a developers
perspective). The LI cache was designed to mitigate spill/fill issues. Local memory is
called local because it is private to individual threads but local memory resides in the
same area as global memory with the same long latencies associated with dram access.
If a significant percentage of dram or L2 accesses are due to local memory, these are
likely spills/fills that are not doing productive work from the kernels perspective and
can decrease performance by putting increased pressure on memory.
75


Table 5.8: Summary of Features on First Three Generations of NVidia GPUs [6]
GPU G80 GT200 Fermi
CUDA cores 128 240 512
DP floating point ca- pability none 30 FMA ops/clock 256 FMA ops/clock
SP Floating Point Ca- pability 128 MAD ops/clock 240 MAD ops / clock 512 FMA ops / clock
Special Function Units (SFUs) / SM 2 2 4
Warp schedulers (per SM) 1 1 2
Shared Memory (KB per SM) 16 16 Configurable 48 KB or 16 KB
LI Cache (per SM) None None Configurable 16 KB or 48 KB
L2 Cache None None 768 KB
ECC Memory Sup- port No No Yes
Concurrent Kernels No No Up to 16
Load/Store Address Width 32-bit 32-bit 64-bit
Larger LI can improve performance when spilling registers or with misaligned,
strided access patterns. If there is a load hit in LI, there is no bus traffic to L2 and
memory. If there is a load miss in LI, 128 bytes per miss are generated. A cache
line request is serviced at the throughput of LI or L2 in case of a cache hits, or at
76


throughput of device memory otherwise.
In single precision floating point operations (fp32), all global memory accesses are
four-byte words. A warp has always consisted of 32 threads on Nvidia hardware and
are executed together as single instruction, multiple thread (SIMT). SIMT differs from
SIMD in that SIMT is scalar with no set vector width. SIMT allows each thread to
take its own code path and branching is handled by the hardware. Though branching
is legal, there can be a performance penalty as the execution for all threads taking
each path are serialized in the hardware. The reality is, performance will likely suffer
if threads are not scheduled and executed with powers of two to fill each warp in much
the same way as empty SIMD lanes hurt performance in vector architectures. The
difference is really in how SIMT code vs SIMD code is programmed and less about
how the hardware executes a vector of threads or a vector of data elements.
77


Figure 5.1: Figure from NVidias white paper for Fermi [6]. Fermi has 32 processing
units per SM. All SMs are backed up by a large L2 cache. The smaller blocks in
light-blue along the edge of the frame are the LI/Shared Memory caches which are
allocated per SM.
Figure 5.2 is a close-up of one symmetric multiprocessor [6].
78


THnSlonEacBT
Warp Scheduler
Warp Scheduler
Dispatch Unit
n
Dispatch Unit
Register File (32.768 x 32-bit)
JL.
Cora Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
LO/5T
LIVST
c
Interconnect Network
64 KB Shared Memory IL1 Cache
SRJ
SFU
SFU
SFU
Uniform Cache
Fermi Streaming Multiprocessor (SM)
Figure 5.2: Figure from NVidias white paper on Fermi [6]. Close-up of one SM shows
the 32 processing cores per SM, 4 SFU units, and 16 LD/ST units.
5.2.2 Maxwell
Maxwell was released in 2014 and is NVidias 5th generation GPU capable of
supporting general purpose compute. Tables 5.3 and 5.5 highlight the basic changes in
important architectural features between Fermi and Maxwell. The primary features
to note are:
79


The maximum number of active thread blocks per SM quadrupled to 32. This
can improve occupancy for kernels with small thread blocks of 64 threads or
less (assuming available registers and shared memory are not the occupancy
limiter).
Maxwell redesigned the symmetric multiprocessor to align better with warp
size, the number of cuda cores were reduced (with respect to Kepler) to a power
of 2, making it easier to utilize the SM efficiently [11].
Maxwell has 64 KB of dedicated shared memory per SM, unlike Fermi and
Kepler, which partitioned the 64 KB of memory between LI cache and shared
memory. Functionality of LI and texture caches were merged into a single unit.
The per-thread-block shared memory limit remains 48 KB in Maxwell, but the
increase in total available shared memory can lead to occupancy improvements
[11]-
80


SMM
Instruction Cache
i I -H
instruction Buffer I Instruction Buffer
Warp Scheduler
Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit
4 4 4 4
Register File (16,384 x 32-bit) Register File (16,384 x 32-bit)
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core ID/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU

Texture 1 L1 Cache
Tex Tex Tex Tex
Instruction Buffer Instruction Buffer
Warp Scheduler Warp Scheduler
Mapatch Unit Dispatch Unit Dispatch UnH Dispatch Unit
4 4 4 4
Register File (16,384 x 32-bit) Register File (16,384 x 32-bit)
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU
Core Core Core Core LD/ST SFU Core Core Core Core LD/ST SFU

Texture / L1 Cache
Tex Tex Tex Tex
96KB Shared Memory
Figure 5.3: Figure from NVidias white paper on Maxwell [11]. Each SM contains four
warp schedulers, and each warp schedule is capable of dispatching two instructions per
warp every clock. Maxwell aligns with warp size being partitioned into four distinct
32 core processing blocks, with 128 cores total per SM.
81


5.3 Intel Xeon Phi Coprocessor Architecture
The performance optimization methodology proposed in this thesis depend on
a thorough characterization of the hardware and key theoretical peak limits which
are described here for the Xeon-Phi Coprocessor. Throughput oriented architectures
depend on efficient utilization of hardware resources and developers need a way to
measure utilization to guide performance optimization. Intel has published key effi-
ciency and analysis metrics to determine how well an application is utilizing available
resources. Keys to performance on Xeon Phi Coprocessors are to express sufficient
parallelism, efficient vectorization, hide I/O latency, and parallel code must scale with
the number of cores. This means, the number of cores, threads, and SIMD (vector
operations) must be used effectively for the best performance. All events, metrics,
equations, and heuristic guidelines described in this thesis are extracted from Intel
developer documentation [33].
Figure 5.4 illustrates Xeon Phi Coprocessors with 50 or more simplified copro-
cessor cores (depending on configuration), four threads per core, 512 bit registers
for SIMD operations (vector ops), vector processing units (VPU) with 512-bit vector
operations on 16 SP or 8 DP floating point arithmetic operations, 32 KB data and
instruction LI cache, each core with 512 KB L2 cache shared among four threads.
82


PCIe I/O
Figure 5.4: Intels Xeon Phi Coprocessor Block Diagram. Figure from [10].
Measuring efficiency in terms of floating-point operations, as done in the NVidia
analysis, is convenient as its easily compared to the peak floating-point performance
of the machine. However, the Intel Xeon Phi Coprocessor does not have events to
count floating-point operations. The alternative is to measure the number of vector
instructions executed [33]. Most vector instructions have four-cycle latency and single-
cycle throughput, so four threads should be used to hide vector unit latency [142],
When a vector operation on two full vectors is performed, the vpu_elements_active
event is incremented by 16 (for single precision) or 8 ( for double precision). Rule of
thumb is to take the ratio of vpu_elements_active to vpu_instructions_executed. If it
approaches 8 or 16, the loop is well vectorized. Vectorization intensity cant exceed
8 for double-precision code or 16 for single-precision code.
Another way to measure processing efficiency is with cycles per instructions.
Table 5.9 lists the minimum CPI the machine is capable of depending on how many
hardware threads are used. If measured CPI is close to the theoretical minimum, it
83


is an indication the kernel is compute-bound and optimizations should focus on areas
that improve compute efficiency.
Table 5.9: Minimum Theoretical CPIs
Number of Hardware Threads per Core Best Theoretical CPI per Core Best Theoretical CPI Per Thread
1 1.0 1.0
2 0.5 1.0
3 0.5 1.5
4 0.5 2.0
The Intel Xeon Phi Coprocessor supports up to four hardware threads on each
core. The pipeline can only issue up to two instructions per cycle. The additional
threads (traditional xeon processor pipelines support two hardware threads and issue
four instructions per cycle) are to cover latency of the executing threads. Xeon Phi
cores are simpler than Xeon host counter-parts, they run at about a third the speed
of Intel Xeon processors, and operate on instructions in-order (instructions must
wait for previous instructions to complete before they can execute) so its important
to be able to switch to other threads with ready operands when a thread stalls.
The Coprocessor will not issue instructions from the same hardware thread. This
means, in order to achieve the maximum issue rate, at least two hardware contexts
(hardware thread) must be running. In contrast, the processors in NVidia cards are
much smaller and focus on floating point operations and single instruction, multiple
thread execution. In general, Xeon Phi handles code with lots of branches and control
statements better than NVidia processors. NVidia runs best with SIMT code with
no control statements.
84


The theoretical aggregate memory bandwidth available on Xeon Phi Coprocessors
is 352 GB/s but internal limitations (ring interconnect) limit achievable bandwidth to
approximately 140 GB/s. The ideal instuction to byte ratio on Xeon Phi Coprocessors
is six. If the measured instruction to byte ratio is less than six, the kernel is memory
bound; if the measured ratio is more than six, the kernel is compute bound.
The primary difference between developing for NVidia cards versus for Xeon
Phi Coprocessors is the programming model. NVidia uses CUDA, their proprietary
framework for developing GPGPU applications on NVidia graphics cards. CUDA
uses C/C++ extensions and code must be compiled with NVidia compilers to in-
tegrate host C/C++ code with CUDA code that will be off-loaded and executed
on the graphics device. Xeon Phi uses the x86 programming model and eco-system
that has a long history in desktop computing. MPI, OpenMP, Clik, and Threaded
Building Blocks (TBB), to name a few, can be used to off-load computation onto the
coprocessors from the host.
5.3.1 Xeon-Phi Performance Tuning
The three factors that most influence performance on Xeon Phi Coprocessors are
scalability, vectorization, and memory utilization. These map almost directly to the
latency, compute-bound, and memory-bound spaces already described in the context
of NVidia architectures.
The Xeon Phi Coprocessors are like discrete graphics cards in that they are con-
nected to the host via PCIe and are a peripheral attachment. The programming
model is heterogenous in nature with a host that runs on Xeon processors (more
like traditional CPUs) and a device (the coprocessors) that run simpler cores and
a different OS. The host either off-loads computation tasks to the coprocessors or
code can be compiled to run on the coprocessors natively. The native model involves
transfering hies from the host to the coprocessors so they can be run directly from the
85


device. The native execution model is explored in this thesis since the coprocessors
are throughput oriented and designed for parallel applications. The heterogeneous
nature of off-loading from a host is beyond the scope of this work.
The challenge in porting code that ran in a CUDA programming model (or, dis-
tributed on a cluster with MPI) to Xeon Phi Coprocessors is that the code must
be vectorized to fully utilize the machine. NVidias programming model is SIMT
(single instruction multiple thread) which means each scalar thread can operate inde-
pendently. Vector- based architectures use SIMD (single instruction multiple data),
which means data items represent one element in the vector with limited flexibility in
terms of control flow and code complexity for vectorized sections of code. Any given
thread can operate on up to 16 vector elements at a time on Xeon Phi Coprocessors.
The difference between SIMD and SIMT requires non-trivial code restructuring to
map N chromosomes on N threads to X*S chromosomes on X threads, each process-
ing S vector elements where N = X*S. Spawning threads to different cores can be
done with OpenMP, MPI, TBB, Cilk, etc. and choosing which model is most appro-
priate is another challenge. Fortunately, the same optimization strategy described in
the analysis framework applies as for all throughput-oriented architectures. Briefly,
the analysis algorithm breaks the optimization space into three categories: memory,
compute, and latency. Defined metrics are measured within each space to determine
which are limiting performance. The optimization search space for each iteration is
reduced to one of the three. Optimization strategies within each are selected for eval-
uation. The results are compared to theoretical peaks of the machine. If within 20%
of a roofline limit, whether max limit of the hardware or a modulated limit imposed
by the kernel, most performance improvements have been found and optimization is
complete.
86


Full Text

PAGE 1

AFRAMEWORKFORPERFORMANCETUNINGANDANALYSISON PARALLELCOMPUTINGPLATFORMS by ALLISONS.GEHRKE MasterofScience,UniversityofColorado,Boulder,1998 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof DoctorofPhilosophy ComputerScienceandEngineering 2015

PAGE 2

ThisthesisfortheDoctorofPhilosophydegreeby AllisonS.Gehrke hasbeenapprovedforthe DepartmentofComputerScienceandEngineering by IlkyeunRa,Advisor GitaAlaghband,Chair TimBenke LarryHunter ZhipingWalter September14,2015 ii

PAGE 3

Gehrke,AllisonS.Ph.D.,ComputerScienceandInformationSystems AFrameworkforPerformanceTuningandAnalysisonParallelComputingPlatforms ThesisdirectedbyAssociateProfessorIlkyeunRa ABSTRACT Emergingparallelprocessordesignscreateacomputingparadigmcapableofadvancingnumerousscienticareas,includingmedicine,datamining,biology,physics, andearthsciences.However,thetrendsinmany-corehardwaretechnologyhave advancedfaraheadoftheadvancesinsoftwaretechnologyandprogrammerproductivity.Forthemostpart,scientistsandsoftwaredevelopersleveragemany-coreand GPUGraphicalProcessingUnitcomputingplatformsafterpainstakinglyuncoveringtheinherenttaskanddata-levelparallelismintheirapplication.Inmanycases, thedevelopmentdoesnotrealizethefullpotentialoftheparallelhardware.Moreover, oftentheexploitableresources,suchasprocessorregistersandon-chipprogrammercontrollermemories,scalewitheachnewgenerationofmany-coresystemandsoftware performancedriftsoverhardwaregenerations. Anopportunityexiststomeetthechallengesinmappingscienticapplicationsto parallelcomputersystemsthroughasynthesisofarchitecturalin-sight,proledriven performanceanalysis,andexecutionoptimization.Thisthesisexploresananalysis andoptimizationframeworkthatdirectscode-tuningstrategiesandappliesscience totheartofperformanceoptimizationforecientexecutiononthroughput-oriented systems.Theframeworkdemonstratessystematicperformancegainthroughproledrivenanalysisonthreerepresentativescientickernelsonthreedierentthroughputorientedarchitectures. iii

PAGE 4

Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:IlkyeunRa iv

PAGE 5

DEDICATION Idedicatethisthesistomyfamily.Tomyhusband,ChrisGehrke,becausethis simplywouldnothavebeenpossiblewithoutyouatmyside.Ithankyouforyour generosity,yourpatience,yourdevotiontoourfamily,andforshoulderingsomuchso thatImightsucceed.Tomychildren,SheaGehrke,KayaGehrke,andCaleGehrke whoaremyprideandjoyandwhoputupwithmany,manyhoursof...notnow,I'm workingonmythesis".Tomydad,JamesTimothySkinner,Jr.,whobuiltacluster ofcomputersthatbecamemyrstexperiencewithHighPerformanceComputing HPCthatfosteredmyinterestinHPCdevelopmentandresearch.Tomystepmother,BeatrizCalvo,whowasalwaystheretohelpandwhogavemethebestpiece ofadvice:perfectionmeanscomplete. v

PAGE 6

ACKNOWLEDGMENT Iwouldliketothankmyadvisor,Dr.IlkyeunRa,forhishelpinguidingthis research,forhispatience,andforhisgood-will.Dr.Raenabledthisworkeven thoughitisnothisprimaryresearchfocus.Iwouldalsoliketothankourcollaborator, Dr.TimothyBenke.Dr.Benke'sresearchbecamethebasisandmotivationforthis thesis.Ienjoyedtheinter-disciplinaryworkthemostandwillcontinuetohelpadvance scienticcomputingandapplyallIhavelearnedinmycareer. vi

PAGE 7

TABLEOFCONTENTS Tables........................................x Figures.......................................xiii Chapter 1.Introduction...................................1 1.1Motivation................................4 1.2Contributions..............................7 1.3DissertationOutline...........................9 2.LiteratureReview................................11 2.1WorkloadCharacterization.......................11 2.2PerformanceAnalysisandOptimization................13 2.3AutomaticPerformanceTuning....................17 3.AnalysisFramework..............................22 3.1FrameworkMethodology........................26 3.2NVidiaMetricsandAnalysis......................28 3.2.1MemoryAnalysisMetricsandOptimizations.........36 3.2.2InstructionAnalysisMetricsandOptimizations........44 3.2.3LatencyOptimizations......................52 3.3XeonPhiMetricsandAnalysis.....................52 4.ScienticKernelCharacterization.......................56 4.1Stencil..................................59 4.2SinglePrecisionGeneralMatrixMultiplySGEMM.........61 4.3KineticSimulation............................62 4.4CharacterizationSummary.......................63 5.HardwareandArchitectureCharacterization.................66 5.1ThroughputOrientedArchitectures..................66 5.2NVidiaDiscreteGPUArchitecture...................67 vii

PAGE 8

5.2.1Fermi...............................74 5.2.2Maxwell..............................79 5.3IntelXeonPhiCoprocessorArchitecture...............82 5.3.1Xeon-PhiPerformanceTuning.................85 6.AnalysisandResults..............................87 6.1FermiOptimizationAnalysisandResults...............87 6.1.1SGEMMAnalysisonFermi...................87 6.1.2StencilAnalysisonFermi....................93 6.2MaxwellOptimizationAnalysisandResults..............106 6.2.1SGEMMAnalysisonMaxwell..................106 6.2.2StencilAnalysisonMaxwell...................109 6.2.3RK4AnalysisonMaxwell....................110 6.3XeonPhiOptimizationAnalysisandResults.............115 6.3.1SGEMMAnalysisonXeonPhiCoprocessors.........115 6.3.2StencilAnalysisonXeonPhiCoprocessors...........123 6.4SummaryofAnalysisandResults...................130 7.Conclusion....................................131 Bibliography ....................................137 Appendix A.AdvancementofComputationalSimulationinKineticModeling......153 A.1RelatedWork..............................153 A.1.1ComputationalMethodsToInvestigateIonChannels.....153 A.1.2SimulationTools.........................156 A.1.3NumericalMethodsforModelingChemicalKinetics.....158 A.1.4GeneticAlgorithmforOptimization..............160 A.2ModelingIonChannels.........................161 A.3KingenCaseStudy...........................172 viii

PAGE 9

A.3.1ApplicationCharacterizationandProle............173 A.3.1.1SystemLevelAnalysis..................174 A.3.1.2ApplicationLevelAnalysis................175 A.3.1.3ComputerArchitectureAnalysis.............177 A.3.2ComputingFramework......................179 A.3.3ExperimentalResultsandanalysis...............183 A.3.4CaseStudyConclusion......................189 ix

PAGE 10

TABLES Table 3.1LowLevelInstructionRelatedHardwareEventsfromNVidia......29 3.2LowLevelInstructionRelatedHardwareEventsfromNVidiaContinued30 3.3LowLevelL1EventsfromNVidia'snvprof.................31 3.4LowLevelMemoryRelatedHardwareEventsfromNVidia........32 3.5LowLevelLatencyRelatedandMiscellaneousHardwareEventsfromNVidia33 3.6FermiDerivedMetricsfromNVidia.....................34 3.7L2andDRAMInstructiontoByteRatiosonNVidia...........35 3.8FermiMemoryAnalysisMetrics.......................37 3.9MaxwellMemoryTransactionMetrics...................39 3.10MaxwellMemoryThroughputMetrics...................40 3.11MaxwellMemoryUtilizationMetrics....................41 3.12MiscellaneousMaxwellMemoryMetrics..................42 3.13MaxwellInstructionUtilizationRelatedMetrics..............47 3.14MaxwellFlopandEciencyRelatedMetrics...............48 3.15MaxwellInstructionCounts.........................49 3.16MaxwellMiscellaneousInstructionMetrics.................50 3.17XeonPhiHardwareEventsfromIntel'sVTune[33]............53 3.18XeonPhiFormulasforPerformanceAnalysis...............54 4.1ExampleInstructionCountsandDistributions...............58 4.2Variousstaticmetricsacrosseachbenchmark................64 4.3ArchitectureStressesofBenchmarksBeforeandAfterOptimization...65 5.1NVidiaArchitectures.............................68 5.2ComputeCapabilitiesonNVidiaHardware................68 5.3NVidiaCoreArchitecture..........................69 5.4TheoreticalPeaksonNVidiaArchitecture.................70 x

PAGE 11

5.5ExecutionLimitsinNVidiaGPUs.....................71 5.6EssentialHardwareFeaturesandThroughputsonFermiandMaxwell..72 5.7InstructionThroughputsonFermiandMaxwell..............74 5.8SummaryofFeaturesonFirstThreeGenerationsofNVidiaGPUs[6]..76 5.9MinimumTheoreticalCPIs.........................84 6.1SGEMMBaseCUDAImplementationwithSmallInput.........88 6.2SGEMMBaseCUDAImplementationwithMediumInput........90 6.3SGEMMOptimizedCUDAImplementationwithMediumInput.....92 6.4StencilBaseCUDAImplementationwithDefaultInput.........94 6.5StencilOptimizedCUDAImplementationwithDefaultInput......96 6.6KingenModel01baselineImplementationwith8,192Chromosomes...98 6.7KingenModel01MemoryAccessOptimizationwith8,192Chromosomes100 6.8KingenModel01MemoryAccessOptimizationwith8,192Chromosomes101 6.9KingenModel01RegisterOptimization...................102 6.10KingenModel01FixytdymUsageOptimization.............104 6.11SGEMMBaselineImplementationwithMediumDatasetonMaxwell...107 6.12SGEMMBaselineImplementationwithMediumDatasetonMaxwell. Onlymemoryrelatedmetricsareshownsincethekernelismemory-bound.108 6.13SGEMMOptimizedImplementationwithMediumDatasetonMaxwell.109 6.14StencilBaselineImplementationonMaxwell................111 6.15StencilOptimizedImplementationonMaxwell...............112 6.16RK4BaselineImplementationonMaxwell.................113 6.17RK4OptimizedImplementationonMaxwell................114 6.18SGEMMBaselineImplementationonXeon-Phi..............116 xi

PAGE 12

6.19SGEMMBaselineImplementationonXeon-Phi.Investigationmaybe warrantedifthemeasuredvaluessatisfythethresholdsoftheperformance heuristic.Actualvaluesarelledinunderthevaluecolumnandthe performanceheuristicisprovidedforreferenceinthelastcolumn.....117 6.20SGEMMTransposeOptimizationonXeon-Phi..............118 6.21SGEMMTransposedOptimizationonXeon-Phi.Investigationmaybe warrantedifthemeasuredvaluessatisfythethresholdsoftheperformance heuristic.Actualvaluesarelledinunderthevaluecolumnandthe performanceheuristicisprovidedforreferenceinthelastcolumn.....120 6.22SGEMMMKLLibraryOptimizationonXeon-Phi.............121 6.23SGEMMMKLOptimizationonXeon-Phi.................122 6.24SGEMMScalingOverThreeImplementations...............124 6.25StencilBaselineImplementationonXeon-Phi...............125 6.26StencilBaselineAnalysisMetricsonXeon-Phi...............126 6.27StencilCache-BlockingOptimizationonXeonPhi-MeasuredEvents..127 6.28MeasuredAnalysisMetricsfromStencilCache-BlockingOptimizationon Xeon-Phi...................................128 6.29StencilResultsonXeonPhiasReportedbytheApplication.......129 6.30BaselineandOptimizedRun-times.....................130 6.31OptimizationSpeedups............................130 7.1NVidiaAnalysisSummary..........................132 7.2NVidiaOptimizationSummary.......................133 7.3XeonPhiOptimizationSummary......................134 A.1ComparisonofProcessRuntimeBetweenCompilers............176 A.2ComparisonofCPIandFPImpactingMetrics...............176 A.3PercentageoftheProcessRuntimeByFunction..............177 A.4ChromosomeDistribution..........................187 xii

PAGE 13

FIGURES Figure 1.1AnalysisFrameworkProcessFlow......................4 3.1PerformanceArchitectureContinuum....................22 3.2FrameworkOptimizationAlgorithm....................25 4.1StencilBaselineImplementationPseudocode................61 4.2KernelInstructionMixMeasuredonMaxwell................65 5.1FermiArchitecture..............................78 5.2FermiSymmetricMultiprocessorwithCore................79 5.3MaxwellSymmetricMultiprocessorDesign.................81 5.4XeonPhiCoprocessorArchitecture.....................83 6.1RooineModelforC2050withandw/oECC...............105 6.2SGEMMScalingOvereachOptimization.................123 A.1ExampleKineticSchememodel......................164 A.2KineticSchemeProposedBy[148]......................165 A.3ImplementationofKineticSchemeProposedBy[148]...........166 A.4OptimizationImprovedKineticSchemeProposedBy[148].........167 A.5Modiedkineticscheme............................168 A.6OptimizationofRevisedModelImprovesFit................169 A.7NewModelsUnderAnalysis.........................170 A.8AnalysisProcess................................173 A.9ThreadUtilization...............................174 A.10InstructionMix................................178 A.11ComputationalComplexity..........................180 A.12CalculationComplexity............................183 A.13ComplexityGraph...............................184 A.14KernelSourceCode..............................185 xiii

PAGE 14

A.15SpeeduponSeveralArchitectures......................186 A.16SpeedupwithLargerWorkload........................188 xiv

PAGE 15

1.Introduction Amajorparadigmshiftoccurredduringthelastdecadeincomputerarchitecture design.Programmersarenolongerabletorelyonsignicantincreasesinprocessor clockswitheachnewgenerationofhardwaretoincreaseapplicationperformance. Themulti-coreandmany-corerevolutioniswellunderwayandwithitthenecessity toleverageadvancedparallelsystemstoincreaseapplicationperformancegeneration overgeneration.Themilestoneswitchtomany-coreledtoadesperateneedfornew toolsandmethodologiesforperformanceanalysisandoptimization. ThetectonicshiftincomputingledtotheemergenceofdierentclassesofapplicationincludingthroughputcomputingapplicationsandtotheriseofhighperformancecomputingHPCindiversedomains.3-Dgraphicsapplicationsrunning ongraphicshardwarewhetherdiscreteorinternalareaproto-typicalexampleof throughput-orientation;billionsofpixelsprocessedwithintherendertimeofaframe. Throughputcomputingisabroadercategoryofapplicationthatincludes,butisnot limitedto,graphicsapplications.Throughputcomputingapplicationsarecharacterizedashavingplentyofdatalevelortasklevelparallelismandthedatacanbe processedindependentlyandinanyorderondierentprocessingelements,forasimilarsetofoperations[105].Ingeneral,throughputcomputingapplicationsareideal forparallelarchitectures.Thesynthesisofcoreprinciplestodescribe,understand, andcompareperformanceofthroughput-orientedarchitecturesisexploredindetail inthisdissertation. Beforethemany-corerevolution,themarketsservedbycentralprocessingunits CPUandgraphicalprocessingunitsGPUwereveryclearlydened.Today'scomputinglandscapeisanythingbutclearanditcanbeverydicultforscientistsand developersaliketochooseanappropriateplatformfortheirapplication.Thereare multi-coresystemswith4-20coresthattendtobemoreCPU-likebutlimitedin thenumberofcoresthatcanbesupported;many-coresystemsthataremoreGPU1

PAGE 16

likewithseveralhundredcoresthattrade-osingle-threadedperformanceformany, simplercores;andenormous,distributed,installationsinsupercomputingcenters. Thefocusofthisresearchisthroughput-oriented,many-coresystemsasexemplied byNVidiadiscretegraphicscardsandIntel'sXeonPhicoprocessors.Multicoreplatformsdon'thaveenoughparallelismtosupportmanyscienticapplicationsofinterest andtheirdesigntendstoremainfundamentallylatency-orientedseeSection5.1for detaileddiscussionoflatency-orientedvs.throughput-oriented.Supercomputing centershavelimitedavailabilitytothegeneralpublicandhavehighmaintenance andoverheadcosts.Manyperformanceprinciplesinthisresearchapplytomany ofthecomponentsrunninginsuper-computingcentersbutthechallengesuniqueto distributiononthatscaleareoutofscopeforthiswork. HPCrepresentsabroadrangeofscienticapplicationscharacterizedintodwarfs analgorithmicmethodthatcapturesapatternofcomputationandcommunication byBerkeleyresearchers[17].Scientickernelshavedierentalgorithmicproperties thatmayormaynotmapwelltocertainarchitectures.Manyparallelplatformsare suitableforboththroughput-orientedandscienticapplicationsbutwhichhardware ismoreecientorabettermatchforagivensoftwareapplicationisanopenresearch question.Specickernelsfromboththroughputcomputingandacomputational simulationkernelareexploredin-depthintheanalysisframeworkdescribedinthis thesis. Computationalsimulationistightlyconnectedtotheexponentialincreaseinthe powerofcomputers[64].Performanceimprovementsforcomputationalsciencessuch asbiology,physics,andchemistryarecriticallydependentonadvancesinmulti-core andmany-corehardwareandtheperformancechallenge,describedby[139]islargely beingaddressedbyrapidprogressinthisarea.However,thesesystemsrequiresubstantialinvestmenttomigrateandoptimizesoftwareanddevelopmenteortsare notperformanceportableevenwithinfamiliesofarchitecturesfromthesamecom2

PAGE 17

pany.Theproductivityofprogrammersisnotkeepingpacewithadvancedcomputing systemsandeectivetoolstypicallytrailnewsystemsbyseveralyears[167].This situationdiminishesthevalueofthesesystemsandunderminesthepotentialofscienticapplicationstoservethepublicwithreliableandrobustsolutions.Thereisa compellingneedforsoftwarethatcansustainhighperformanceonnewarchitectures andscalewithmulti-core,many-core,hybrid,andmassivelyparallelsystems. ResearchersfromtheUniversityofCalifornia,Berkeleymadeseveralrecommendationstoguidethetransitiontoparallelsystemsthatarerelevanttothiswork including[17]: "Theoverarchinggoalshouldbetomakeiteasytowriteprogramsthatexecute ecientlyonhighlyparallelcomputingsystems." "Autotunersshouldplayalargerrolethanconventionalcompilersintranslating parallelprograms." "Architectsshouldnotincludefeaturesthatsignicantlyaectperformanceor energyifprogrammerscannotaccuratelymeasuretheirimpactviaperformance countersandenergycounters." Theframeworkatthecoreofthisthesisusesprole-drivenanalysistoguide optimizationeortssoprogrammerscanwritecodethatexecutesecientlyonparallel systems.OnegoalwastotestoptimizationtechniquesrecommendedbyNVidiaand Intelengineersforapplicationinautomatedenvironments.Positiveresultsenablethe methodologytobeusedinauto-tuningsystemstoselectoptimizationsthatarelikely tohavethemostimpactonperformanceandreducethecombinatoriallylargesearch space.Figure1.1illustratesthemaincomponentsoftheanalysisframeworkand howtheprocessows.Theanalysisframeworkprovidesinsightintohowapplications runonhardware.Kernelcharacteristicsarecompiledonceperarchitectureand,if necessary,oncepercodevariant.AllgenerationsofNVidia'sdiscreteGPUsseeTable 3

PAGE 18

5.1areconsideredmembersofthesamearchitectureinthiscontextaseachreport similarmicro-architecturalagnosticmetricsonthesamecode.Intel'sXeonPhiisa completelydierentx86basedarchitecture,withadierentinstructionset,dierent programmingmodel,dierentprolingtools,andrequiresadierentsetofmetrics todescribekernelbehavior.Thepeakratesandprimaryarchitecturalfeaturesto describehardwarebehaviorarecompiledonceandarestaticacrossanykerneland optimizationvariant.Hardwarespecicationsareusedtocomparemeasureddata withpeakrates.Kernelcharacteristicsthroughoptimizationstrategiesareinaloop asoptimizationprogresses.Optimizationisoverwhenthekernelisrunningnearpeak architecturallimitsorhashitkerneloralgorithmiclimits. Figure1.1:Kernelcharacteristics,hardwarebehavior,performanceprole,analysis algorithm,andoptimizationstrategiesaretheprimarybuildingblocksoftheanalysis framework. 1.1Motivation ExpertisetodevelopsoftwareapplicationsthateectivelyleveragegigaFLOP systems 9 oatingpointoperationspersecondandthatscalebeyondisscant 4

PAGE 19

bothintoolsandwell-denedbestpractices.Theproliferationofresearchfocused onperformanceanalysisofemergingmulti-coreandGPUarchitectures[23]mostly usingcasestudiesistestamenttoanascenteldgrapplingtocharacterize,leverage, andunderstandthecomplexityinherentintherequisitetoolsofthetrade.Afew ofthebiggestchallengesinmanycoreperformanceoptimizationresearcharehow toecientlybuildscienticapplicationsthatachieveareasonablefractionofthe availabletheoreticalpeakperformance,howtomeasuretherelativesuccessofdierent optimizations,howtoratearchitecturaleciencytomeasurethegoodnessofthe mappingfromsoftwaretohardware,andhowtoachieveperformanceportability. Therearemanybarrierstoentryinprogrammingonparallelsystems.Fewhave describedsequentialprogrammingaseasybutparallelismcomplicatesthetaskexponentially.Insequentialcomputing,operationsareperformedinordermakingit easiertoreasonaboutcorrectnessandcharacteristicsofaprogram[110].Parallel computingcomplicatesourreasoningalongseveraldimensionsandrequiresmodicationinprogrammingapproaches[110].Hereareafewfactorsthatcontributeto thechallengesinherentindeployingparallelsolutions: Adeterminationmustbemadeabouthowtheapplicationcanberestructuredto runinparallel.Architecturaldetailsneedtobeconsideredbecausestructuring codeformanythreadsofaGPUisdierentthanstructuringcodetoexploit widevectorsofIntelXeonPhicoprocessors. Therearefewestablishedbestpracticestoguidethedevelopmentandoptimizationprocess.Thisthesisproposesasystematicguidethatbridgesthegap betweentheoryandthemanypracticaldetailsthatcomeintoplaywhencode hitssilicon.Onegoalistocompilebestpracticesdescribedbyengineersat NVidiaandIntelintoanautomatedframeworkandexplorewhytheywork. Theapplicationbottleneckmustbeidentiedandanalyzedforparalleloppor5

PAGE 20

tunities.Parallelismwon'tnecessarilyhelpiflargetime-consumingportionsof thecodeareinherentlyserial.Thisthesisassumesthebottleneckwithrespect tothefunctionthatconsumesthemostrun-timehasbeenidentiedthrough thetopdownsystemlevelanalysisasdescribedintheKingencasestudyin AppendixA.3. Howtodetermineiftheincreaseinperformancewillbeworththedevelopment eortisnotwellunderstood.Simpletechniquesareemployedinthisworkto estimatehowmuchoptimizationopportunityremains. Developingapplicationsthatwillfullyexploitthemachineischallengingasthe problemlivesatthecrosssectionofinherentalgorithmiccharacteristicsand architecturalfeatures.Thesedynamicsareexploredandhowclosetopeak eciencyakernelexecutesisquantied. Latency,eitherreducinglatencyorhidinglatency,iskeytoperformance.Memorymustbemanagedcarefullyandoptimizedtocoalesceaccessandreduce cacheissueslikethrashingandfalsesharing.Thesetechniquesrequiremuch moreprogrammerinvolvementthansequentialprogrammerstendtoconsider. Modicationstoexistingscienticmodelsmeanschangesincodeanddierent sizedmodelsmayhavedierentoptimizationpointsondierenthardware. Performancetuningisanimportantandtime-consumingstepthatrequiresindepthknowledgeofarchitecturalfeaturesaswellastheprimaryapplication features.Performanceoptimizationisoftendescribedasanartratherthana sciencewhichhintsatthecomplexityandlackofwell-understoodbestpractices. Aguidingthemethroughoutthisdissertationispromotedin[53]astheneedto "broadenthesuccessinparallelizing"scienticapplications.Assuch,thereisan urgentneedformoreresearchonhowtoprogramformany-corearchitecturesand 6

PAGE 21

howtomeasureresults.Aprimarycontributionofthisthesisisthecreationofan analysisframeworkthatguidesdevelopersandscientiststhroughsystematicoptimization.Weexplorecapabilitieswithintheframeworktogenerateperformancemetrics andanalysistodriveoptimizationstrategy.Theframeworkwillaiddevelopersand computationalscientiststousethefullcomputationalresourcesavailableinparallel systems. 1.2Contributions Thisthesisdemonstratesthataperformanceanalysissystemcanconstructtailoredmappingsofscienticapplicationstomodernmany-corearchitectureresources. Theframeworkincreasesscienticproductivitybydrivingdevelopmenttowardecientimplementations.Architecturaldetailsareessentialtohighperformingapplicationsandtheabilitytofullyexploitthemachine.Thereisalargedividebetween thepoweravailableincomplicatedparallelarchitecturesandthetoolsmostpeople needtoeectivelyleveragethemandcomewithinrangeofpeakperformance. Inthescopeofthisthesis,thecomputationallyintensivenumericalsimulationof ionchannelkineticsisexaminedasatargetforevalutingtheanalysisframework.In addition,twowell-knownscientickernelsarestudiedtodemonstratethegenericapplicabilityofthemethodologyandtocomparetheresultsreportedhereagainstknown benchmarks.Inall,threescientickernelsaremappedtothreedierentthroughputorientedarchitectures.ThethreekernelsaredescribedindetailinChapter4andthe hardwareisdescribedinChapter5. Thisdissertationmakesthefollowingkeycontributions: 1.Characterizationofthroughputorientedarchitecturesthroughinvestigationof threedierentplatforms:NVidiaFermi,NVidiaMaxwell,andIntelXeonPhi coprocessors.Theabilityofscienticapplicationstomaptoeacharchitecture isexplored.Thepeakcapabilitiesofthroughput-orientedarchitecturesaredescribedandappliedbeyondwhatcanbefoundintechnicalmanualsortutorials. 7

PAGE 22

Advancedcomputingsystems,thoughcomplexwithalargenumberoftuning optionsthatimpactperformance,actuallyhaverelativelyfewarchitecturalfeaturesthatdominateperformance.JustasAmdahlslawdenesmaxspeedup withparallelexecutionperformancegainofparallelcomputersislimitedbythe serialportion;theprimaryperformancelimiterdenesthemaxspeedup,so focusonoptimizationswiththemostpotentialtoimproveperformance.Wedescribehowtointerprettheperformancemetricshardwareexposesandidentify optimizationoptions. 2.Presentationofprole-drivenanalysistodriveoptimizationstrategy.The frameworkusesahierarchyofproledataasinputtotheanalysisalgorithm. Analysismetricsaredenedfromlowleveleventsandderivedperformancemetricstocodifyanalysisguidelinesinordertoidentifytargetsforoptimization. Theanalysisframeworkidentiesperformancelimitersandfocusesoptimization eortinthoseareasonly.Signicantdevelopmenttimeiswastedimproving codethat,bydenition,willnotimproveperformance.Theframeworksystematicallydrivesthetime-intensiveanderrorproneprocessofperformanceoptimizationandenablesinvestigationofhowperformanceofakernelscalesasthe availableresourcesandparallelismchange.Optimizationstrategiesaredened oneacharchitectureforcompute-bound,memory-bound,andlatency-bound kernels.Beingboundbycompute,memory,orlatency,meanstheperformance ofthekernelisdominatedbycomputeoperations,memoryoperations,orexposedlatency.Analysismetricsareusedtoguidewhichoptimizationwithin eachofthethreecategoriesarelikelytoimproveperformance. 3.Presentationofananalysisframeworkthatbridgesthegapbetweenarchitecturaltheory,performanceanalysisandthemanydetailsthatcomeintoplay whencodehitssilicon.Thisthesisfocusesonexecutionoptimizationasopposed 8

PAGE 23

toalgorithmicoptimizationwhichstrivestoreducethenumberofoperations andisexpressedinbigOnotation.Theanalysisframeworkthatdrivesperformanceoptimizationisasynthesisofperformancetheory,kernelbehavior,and architecturalfeatures. 4.Presentationoftheperformanceofthreerepresentativescientickernelson threethroughput-orientedarchitectures.Wedemonstratethemethodologythat guidedtheecientimplementationoftheimplicitRK4solveronseveralparallel platformsisapplicabletoscientickernelsingeneral.Theresultsdemonstrate thatperformanceofatargetkernelcanbeimprovedbyprole-drivenanalysis andthatperformanceimprovementsaresustainedacrossarchitecturalgenerations.Theincreaseinperformanceopensthedoorfornewusagemodelswhich isimperativetocontinuedinnovationonparallelsystems. 1.3DissertationOutline Thenextchapterisanin-depthliteraturereviewofrelatedworkandhowthis dissertationbuildsonprevioussuccess.Chapter3describesouranalysisframework thatsynthesizesproling,optimization,andanalysis,tocomparearchitecturalecienciesanddetermineifanapplicationisusingareasonablepercentageofhardware resourcesavailable.Chapter4introducesthescientickernelsthatareunderevaluationinthisthesis.Chapter5describesthethreeplatformsstudiedinthisthesis, NVidia'sFermi,NVidia'sMaxwell,andIntel'sXeonPhicoprocessors.Chapter6 demonstratesthepositiveperformanceresultsleveragedthroughthemethodology describedinthisthesis.Finally,Chapter7discussesconclusionsanddirectionsfor futureresearch.AppendixA.3isacasestudywepublishedthatdescribesthestep-bystepprocessofadaptingnewandexistingcomputationalbiologymodelstomulti-core anddistributedmemoryarchitectures.Wediscussourimplementationandsoftware 9

PAGE 24

optimizationprocesstodemonstratethechallengesandcomplexitiesofsoftwaredevelopmentforadvancedparallelarchitectures. 10

PAGE 25

2.LiteratureReview 2.1WorkloadCharacterization Workloaddemandandarchitecturalbehaviorarethetwocomponentsunderstudy whenevaluatingandoptimizingapplicationperformance.Itfollowsdirectlythat quantitativedescriptionofworkloadsisafundamentalpartofperformanceevaluation[28],akeyfocusofthisdissertation.Workloadcharacterizationdirectlyinformsdesignoffutureapplications,compilersandarchitectures.Forexample,target threaddispatchratescanbeestimatedagainstthreadlengthdistributionsforcommonGPGPUprograms.Another,morewell-knownandstudieddesignchoice,isthe sizeofon-diestorageshouldmatchtheworkingsetoftargetworkloads[106].Inaddition,GPUkernelsandapplicationsoftenstressthesamebottlenecks[69]soworkload characterizationisusedtoidentifyarepresentativesetofdiverseworkloadsthatexerciseimportantorthogonalarchitecturalfeatures.Redundantworkloadsinananalysis set,meaningtwoworkloadswithverysimilarexecutionbehavioronasystem,require twicetheanalysisworkbutprovidenoadditionalinsight.Benchmarkdevelopment reliesheavilyonworkloadcharacterizationtoquantifyprimaryarchitecturalfeatures thatimpactperformance.Inthisthesis,scienticapplicationsarecharacterizedto determinekernel-imposedlimitswhichreduceachievablepeakarchitecturalrates. Anewclassofthroughputcomputingapplicationshasemergedacrossdiverse domainsthatprocesslargeamountsofdata.Adistinguishingfeatureoftheseapplicationsistheyexposeplentyofdatalevelparallelismandthedatacanbeprocessed independently[106].Anothernewclassofapplicationiscomputationalsimulation whichhistoricallyranonhighperformancecomputingHPCplatformswithdistributedcomputenodesonly.Throughput-orientedcomputingplatformsareviable optionsforthesenewtypesofapplicationsandit'softennotcleartoexpertprogrammersanddomainscientistsalikewhichisbestsuitedfortheirapplication.The 11

PAGE 26

general-purposeCPUcentralprocessingunitiscapableofrunningmanytypesof applicationsandhasrecentlyevolvedtoprovidemultiplecoresandmanycorestoprocessdatainparallel[106].GPUsgraphicsprocessingunitsaredesignedforgraphics applicationswithmassiveparallelismonsmallerprocessingcores[106].Somescienticmodelscanbetoobigtoresideinsmalleron-chipmemoryofGPUsandmust considerthetrade-osbetweenthetwoarchitectures.Onegoalofthisdissertionis toprovidepracticalguidancetoinformthatdecision. Qualitatively,GPUandmany-coresystemsliketheXeonPhicoprocessorsfavor workloadsthatperformlargenumbersofcomputeoperations;thatexhibithighdegreesofmemorylocality;andthatarehighlydata-parallel[134].Thechallengemany researchersaddresswithworkloadanalysisstudiesishowtoquantifythosecharacteristics.Workloadsareveryoftencharacterizedwithmetricsdesignedtomeasure sensitivitytospecicperformancelimitersoftargetsystems.Forexample,Rodinia researchers[35]characterizedbenchmarksintermsofwarpoccupancy,theaverage numberofactivethreadsoverallissuedwarps,becauseoccupancywasidentiedin earlygenerationGPGPUarchitecturesasaperformancelimiter.Kerretal.[91]use Ocelot[59]tocharacterizethreadactivity,theaveragefractionofthreadsthatare activeatagiventime,tomeasureimpactfromthreaddivergence.In[106],kernels areclassiedbasedonthecomputeandmemoryrequirements,regularityofmemory accesses,andthegranularityoftasks.Architectsmakedesigntradeodecisionsby identifyingkeyarchitecturalparametersthatareimportanttoperformanceandby characterizinghowbenchmarksrespondtochangesinthoseparameters[35]. Acleartaxonomyofworkloadspecictuningparametersthatareindependent ofanysystemtheworkloadmayrunonasdistinctfromarchitecturaldependent tuningparameterswouldhelpclarifyperformancelimitsimposedbythekernelitself.Goswamietal.[69]proposeasetofGPUmicroarchitectureagnosticGPGPU workloadcharacteristics.Byisatooldevelopedby[134]thatreportscountervalues 12

PAGE 27

inahardware-independentmanner.Ifmetricsarearchitecture-agnosticthen,any workloadwillreportthesamevaluesregardlessofmicroarchitecturaldierencesin thetargetplatform.Importantresearchremainstoidentifyasetofworkloadcharacterizationmetricsthathelpquantifytheextenttowhichthesameworkloadexhibits dierentpropertieswhenimplementedondierentarchitectures[35].Improvingour understandingofthesepropertieswouldimproveevaluationforsuitabilitytoemerging architectures.Thisisanimportantbenetfordomainscientistswhoneedguidance onwhichsystemisagoodmatchfortheirapplicationanddon'thavetimeorexpertisetoexperimentwithmultipleapproaches.Theframeworkdescribedinthis dissertationextendsexistingcharacterizationstudiestounderstandwhichproperties imposelimitsbelowtheoreticalpeakshardwarecansupport. Ingeneral,workloadsarecharacterizedandoptimizedforspecicarchitectures. Codeistunedtoleveragearchitecturalfeaturesofthetargetsystem.Aninteresting questioniscanthereversebeanalyzed?Inotherwords,whatisthebestarchitecture foragivenalgorithmandwhichamongtheavailablesystemsarestrongcandidates? Researchersandindustryanalystswidelyshareavisionofheterogeneouscomputing thatautomaticallyselectthebestcomputationalunitforthetaskfromamongintegratedaccelerators[35].Onegoalofthisdissertationistoadvancethisvisionby exploringtheintersectionanddependenciesbetweenworkloadcharacterizationand thesystemstheyrunon. 2.2PerformanceAnalysisandOptimization Performanceanalysisandoptimizationisatthefoundationofcomputerarchitectureresearchanddevelopment[55]andofthisdissertation.Architecturaldetails areessentialtooptimalperformanceofsoftwareapplicationsandtheabilitytofully exploitthemachine.Thisdissertationcouplescharacterizationofthescientickernelsunderinvestigationwithdetailedarchitecturalanalysisandoptimizationover 13

PAGE 28

multiplegenerationsofhardwaretoimproveunderstandingofhowtomapscientic applicationstomodernarchitectures. Oneofthebiggestlimitationsofthemany-coreeraisthatsoftwaremustbeexplicitlyparallelizedtoleveragemanycoresandmassiveparallelismofGPUs[180]. Parallelapplicationshavetransformedthechallengefromlatency-limitedcomputing tothroughput-limitedcomputing[180].Themany-coreerausheredina"productivitygap"asprogrammersstruggletomaintainperformanceportabilityonrapidly changingsystemswithnon-linearperformanceimpacts.Thisunder-scorestheneed tobetterunderstandthroughputorientationandthistrendwillcontinuewellinto theheterogeneoussystemsera. Programmingchallengesarisefrominteractionsamongarchitecturalconstraints. Forexample,optimizationsthatimprovetheperformanceofanindividualthread tendtoincreaseathread'sresourceusage.Aseachthread'sresourceusageincreases, thetotalnumberofthreadsthatcanoccupyasymmetricmultiprocessorSMona GPUdecreases[150].Anothersourceofunpredicability,usingtheCUDAruntime systemasanexample,isblack-boxregisterallocationthatmakesitdicultforprogrammerstofullyunderstandtheperformancecharacteristicsoftheirapplications. Theoptimizationframeworkdevelopedinthisdissertationdemonstratestheperformancedeltathatcancomeaboutacrossgenerationsanddescribehowoursystem adapts. Fundamentally,CPUsandGPUsarebuiltbasedonverydierentdesigngoals. CPUswerehistoricallydesignedforawidevarietyofapplicationsoptimizedtoimprovesingletaskorsingle-threadedperformance.CPUsreducelatencyandGPUs hidelatencytomaximizeperformance.MemorybandwidthonCPUsislowascomparedtoGPUandCPUcacheaccesslteringtendstomodulateCPUabilityto exposememory-levelparallelism.WorkloadoptimizationsforCPUthatcontribute toperformanceare:multithreading,cacheblocking,andreorganizationofmemoryac14

PAGE 29

cesses[106].Therefore,CPUarchitecturaladvancestoimproveperformanceinclude branchprediction,out-of-orderexecution,super-scalarexecution,andfrequencyscaling[106].CPUprogrammerscouldcountonincreasingclockfrequenciestoimprove performanceanddidnothavetocontinuallyoptimizeapplicationsfromonegenerationtothenext.However,CPUshaveevolvedtomulti-coresandmany-coreswith widerSIMDunits,andapplicationsmustnowbeparallelizedandoptimizedtoexploitthreadlevelparallelismTLPanddatalevelparallelismDLPtoeectively hidelatency. GPUswerehistoricallydesignedforgraphicsapplicationsoptimizedforhigh throughputofpixels.GPUstrade-osinglethreadedperformanceformassivedata andthreadlevelparallelism.AsGPUsarethroughput-limitedarchitectures,graphicsapplicationsareverylatencytolerant.GPUthreadsareindependentandGPUs immediatelyswitchtoadierentthreadtoprocessadierentpixelwhenanactive threadstallsonalonglatencyoperationlikearequesttomemory[106].Thread switchingonGPUisanalmostzerocostoperationascomparedtothehighpenalty ofcontext-switchinginCPUs.Characterizingworkloadthroughputsonthroughputorientedarchitecturesiscriticalformodelingperformanceandforidentifyingbottlenecksandtheirrelevantoptimizations.However,likeCPUevolutiontowardmore parallelism,GPUshaveevolvedtosupportdependentthreadoperationandcache hierarchies. SeveralfactorshavecontributedtothestrongparadigmshiftsinbothCPUand GPU.CPUhasexhaustedtheperformancegainshistoricallyhiddenfromtheprogrammerlikesuperscalararchitecturesandpipelining[132]andturnedtomultiprocessingtocontinuetodeliverprocessorperformanceincreases.GPUandCPUis evolvingtosupportimportantnewclassesofapplicationsdescribedin2.1thatexposedatalevelparallelismbuthavedierentrequirementsfromthehardwarethan traditionalgraphicsapplications. 15

PAGE 30

ItremainstobeseenhowcloseCPUandGPUarchitectureswillgrowtowardeach otherandhowheterogeneoussystemsmanagethecomplexity.CPUandGPUcache hierarchiesplaysimilarrolesinlteringmemoryrequeststotheo-chipinterfacebut withverydierentmemoryaccessratesandsensitivitytothememoryhierarchy[74]. ThereisunresolvedcontentionbetweenwhetherCPUorGPUisbestforscientic kernelsofinterestduetoCPUtransitioningtomulti-coreandmany-coreandmoving intoGPUsdominanceinthroughputcomputingandGPUarchitecturesmovinginto CPUdominanceinmaximizingsingle-threadedperformanceandusageofacache hierarchy. Analyticalperformancemodelingandsimulationaretwoprevalentapproaches tounderstandingandpredictingperformance.Analyticalmodelingisordersofmagnitudefasterthansimulationbutlessaccurate.Analyticalmodelingenablesthe explorationofverylargedesignspaces.Eeckhoutetal.[55]describethreeprimary approachestoanalyticalmodeling.Thereismechanisticorwhiteboxmodelingwhich buildsamodelbasedonrstprinciples.Empiricalorblack-boxmodelingbuildsa modelthroughtrainingbasedonsimulationresults.Hybridmechanistic-emperical modelingaimsatleveragingthebestfrombothmethods[55]. SeveralstudiesdemonstrategoodapproximationofGPUperformanceusinganalyticalapproachesincluding[79],[124],[154],and[186].Mostperformancemodels relyonsometypeofimplementationanddon'taddresshowfartheoptimizedversion isfromglobaloptima[102].Themethodologydescribedinthisthesiscanbeused toinformanalyticalmodelingtechniqueswithacogentdescriptionofperformance limitsandapplicationofrelevantoptimizations. Anotherimportantresearchgoalistousemetricsthatmeaningfullymaptoa performanceconstructandhowtooptimizeit[55],[154].Anexampleiscyclesper instructionCPI.CPIisadotproductofeventfrequenciesandpenaltiesandassuch providesmoreinsightthaninstructionspercycleIPCyetIPCismorewidelyused 16

PAGE 31

[57].Bhuiyanetal.[23]usedotproductsofeventfrequenciesandpenaltiesintheir analyticalperformancemodeltoconnectapplicationstoarchitectures.Althoughnot obvious,Zhangetal.applysimilarprinciplesbymeasuringexecutiontimesonthe primarylimitersinNVIDIAGeForce200-seriesGPUs.Animportantresearchgoal ofthisthesisistodeterminethesuitabilityofanarchitectureforagivenapplication bybuildingandimprovingontheworkofBhuiyanetal.[23]andZhangetal.[186]. Manystudiesinperformanceanalysisandoptimizationresearchfocusonveryspecicgoalstargetingatypeofapplicationonatargetarchitecture.Authorsroutinely notethattheirmethodsshouldbemoregenerallyapplicable,andtherearesoundtheoreticalargumentstosupportthisclaim,yetfewactuallydemonstratemoregeneral applicabilityonanyvector,beitotherapplicationdomainsordiversearchitectures. Thisdissertationtriestoimproveongeneralitygoalsanddemonstratehowanalysis methodologiescanbeextendedtootherapplicationsandarchitectures.GeneralPurposeGPUGPGPUbreaksassumptionsinthepurelythroughput-limitedparadigm inwaysthatarenotyetfullyunderstood.Oneresearchgoalofthisdissertationis toclearlydescribehowtheGPinmodernGPUshasimpactedthroughputoriented designgoals. 2.3AutomaticPerformanceTuning Theprimarymotivationforauto-tuningistosustainperformancetrendson rapidlychangingarchitecturesandexecutionenvironments,aconceptknownasperformanceportability.Thisthesisdemonstratesthatanautomatedanalysissystem canconstructtailoredmappingsofscienticnumericalkernelstomodernmany-core architectureresources.Asthereisnosinglecongurationgoodforallsystemsand allapplications,tremendouseortisnecessarytodevelopapplicationsandmapthem ontotargetmachines[42].Inaddition,successivegenerationsofmassively-parallel architecturestendtorequireacompletereapplicationoftheoptimizationprocess 17

PAGE 32

tomaintainhighperformance[150].Akeyresearchgoalofthisdissertationisto demonstratesustainedperformanceacrossscienticmodelsandacrosshardware. Programmersmustchoosewhicharchitectureisbest-suitedtotheirmethodwhich isverydiculttointuit.Examplesofthediculties,especiallyforacceleratorslike GPU,includeanunusualprogrammingmodel,emergingarchitecturesareevolving veryquickly,andtechnicaldetailsofthearchitecturesarenotpubliclyavailable[82]. Softwarecanadapttohardware,soauto-tuningbenetsarchitectswhodon'thave tooverprovisionhardwaredesignsforlegacyapplicationsandimplementationsthat aretunedforthepreviousgeneration[7]. Compilershavefailedtoachievehigh-performanceonnewarchitecturessothe responsibilityhasfallenondomainexpertsandexpertprogrammers[180].Thepriorityforcompilersiscorrectnessandverygeneralapplicability.Thepriorityfor auto-tunersistoleveragethespecicsofdierentarchitecturalfeaturesandworkloadcharacteristicstondthefastestrunningconguration.Compilersusesimple modelsofarchitecturebehaviorthatmaybeoverlysimplisticcomparedtothecomplexhardwareofadvancedsystems.Inaddition,compilershavedicultydeterminingthebehaviorofalgorithmsthataredependentontheinputsforoptimization. Compilershandletwo-levelmemorywellandworkbestwhenlatenciesaresmalland deterministic[180].Compilersdonotoptimizemulti-levelcachehierarchiesandoutof-orderexecutionwell.Auto-tuninghasbecomeacommonlyacceptedtechniqueto addresstheseissues,andgeneratehighlytunedcodeautomatically.Bilmesetal. [25]observedthatallcodevariantscouldbeenumeratedandexploredand,giventhe capabilitiesofmodernmicroprocessors,theoptimalcongurationforagivenproblem sizecouldbedeterminedwhichresultedinPhiPACwhichisconsideredtheprogenitor ofauto-tuners[180]. Twoapproachestoperformanceauto-tuningareusedinpractice,model-driven optimizationandempiricalsearchoptimization.Themodel-drivenapproachcomes 18

PAGE 33

fromthecompilercommunityandincludestechniqueslikeloopblocking,loopunrolling,looppermutation,fusionanddistribution,prefetching,andsoftwarepipelining.Empiricaloptimizersestimatethevaluesofkeyoptimizationparametersby generatingmanyprogramversionsandrunningthemonspecichardwaretodeterminewhichvaluesgivethebestperformanceforagivenapplicationonagiven machine.Incontrast,conventionalcompilersusemodelsofprogramsandmachines tochoosethebestparameters[183].Someresearcherspresentahybridapproachand usesimplearchitecturalmodelsintherststagetolimitthesearchspaceforsecond stageofempiricalsearch. Acomparisonofthedierencesbetweenempiricalandmodel-drivenfoundthat model-drivenapproachescanbejustaseectiveasempiricalsearchwidelybelieved tobesuperior[183].Atleastonestudyassertsthatamodel-driventuningstrategy isimpracticalsincesomedeviceparameterscan'tbequeriedandthatsomeparametertradeosarediculttomodel[50].However,thefullbenetofmodel-driven hasprobablynotbeenfullyexploredduetothede-factopreferenceandwidesuccesswithempiricalsearch.Ingeneral,theperformanceimpactofspecicoptimizationscan'tbepredictedinlargepartbecausemodel-drivenisunder-studiedandnot well-understood.Thereisconsiderableroomforimprovementinbothempiricaland model-drivenoptimizationtechniqueshand-optimizedcodestillsignicantlyoutperformsautomatedcodeforgeneratingtheBLAS[183]. AutomaticlibrarygenerationwithempiricalsearchhasbeenaveryeectivestrategyonCPUsincluding,ATLAS[176],Sparsity[171],FFTW[60],andSpiral[140]. Manystudiesborrowsimilarmethodstoautotuneperformancebyempiricalsearch onGPUs.JiangandSnir[82]implementedahighperformancematrixmultiplicationlibrarygeneratorforGPUthattheyrefertoasan"ATLASforGPU".Liuetal. [113]useagreedyalgorithmtoacceleratesearchandexploretheinuenceofprogram inputs.Baskaranetal.[18]useahybridmodel-drivenempiricalsearchtodetermine 19

PAGE 34

optimalparametersforunrollingandtiling.MengandSkadron[125]andChoiand Singh[36]buildaperformancemodeltoguidethetuningprocessforiterativestencil loopsandsparsematrix-vectormultiply,respectively.Ryooetal.[150]carvetheoptimizationsearchspaceinorderofperformanceimpact.Thereisalotofinterestin thecommunityforencapsulatingdetailsintoauto-tunedlibrariesforcomputational scientiststouse.X-Stackresearchers[7]observethatalthoughanoptimalimplementationmaybeuniquetoaparticulardataset-architecture-applicationtuple,a generalizedoptimizationdesignspacecancapturethemall. Auto-tuningresearchisdistinguishedalongseveralvectors: Themethodusedempiricalormodel-driven. TheapplicationdomainoffocusFFT,matrixmultiplication,tridiagonal solvers,etc. Strategiesforpruningthesearchspace. Thetoolsusedinthetuningframeworkprolingtechniques,typeofmetrics measured,staticanalysisordynamicrun-timeanalysis. Algorithmicpropertiesandwhetherdomain-knowledgeisleveraged. Thetuningparametertargetsresearchersemployswitchpoints,sharedmemory allocation,threadoccupancy,etc. Thecontributionsinresearchthatadvanceperformanceauto-tuningisdistinguishedandcomparedalongoneofthecategoriesabove.However,whendesigning andimplementingauto-tunersthesamecorechallengesremain.Empiricalsearch techniquesmustgrapplewiththechallengeofanexplodingsearchspace.Researchers employmanydierentmethodstoavoidimpracticalexhaustivesearch.Nosingle methodhastakenaclearleadershippositionoverothers.Davidsonetal.[50]use algorithmicknowledgetodecoupletuningparametersanduseheuristicstoestimate 20

PAGE 35

searchstartingpoints.Ryooetal.[150]prunethesearchspacebycapturingrstorderperformanceeects.Choosingtherighttuningparametersisalsoachallenge sincethedecisionofteninvolvesatradeobetweenmulipleobjectiveswhoseperformanceimpactisoftennonlinearanddiculttopredict.Unpredictablenon-linear performanceeectsisamajorchallengetoauto-tunersthathasnotyetbeencomprehensivelydescribed.Forexample,highthreadoccupancycanlimitavailableshared memoryavailableperthreadandincreaseregisterpressure. Mostauto-tunerstargetspecicoptimizationsforthetargetapplicationontarget systems[102].Theyincludebotharchitecture-independentoptimizationsbasedon improvingthesourcecodeandworkloadanalysisaswellasarchitecture-specicoptimizationslikeoptimalthreadgroupsize[180].Manyresearchersasserttheirmethod canbegeneralizedtoothertypesofworkloadsorarchitecturesbutthisisrarely provenorsystematicallyexplored[69].Theargumentmakessenseinprinciple,"as longasthecodecanbeparameterizedanditsproperties,suchasdemandforregisters andsharedmemory,expressedasfunctionsoftheparameters"[99].Ingeneral,there isampleopportunitytoextract,generalize,andencapsulateperformanceauto-tuning functionalitysothecomplexityishiddenfromapplicationscientists[7]thatthisdissertationexplores.Researchdirectionsforauto-tuningincludemakingauto-tuning moreecient,expandingittoadditionalapplicationdomains,andachievingmore generality[180]. 21

PAGE 36

3.AnalysisFramework Thekeytohighperformanceisaneectivearchitecture-algorithmmappingwhich isnotstraightforwardinmostcases[64].Thisframeworkusesarchitecturalinsight toguideoptimizationstrategy.Manyarchitecturalfeaturesareavailablefromhardwarespecicationsandsoftwarequeriestothehardware.Importantarchitectural throughputsthatarenotdocumentedcanbeapproximatedwithbenchmarktesting. WorkloadcharacterizationisdescribedindetailinChapter4andisexpandedhere asitappliestotheoptimizationmethodology. Figure3.1illustratestherelationshipbetweenimplementationswitharchitectural insightandthosewithout.Leeetal.[105]recommendsan"applicationdrivendesign methodologythatidentiesessentialhardwarearchitecturefeaturesbasedonapplicationcharacteristics".Thistypeofapproachisintegratedthroughouttheframework. Figure3.1:Figureextractedfrom[64].Algorithmicimplementationsspecictoa particulararchitectureleadstohighperformanceandimplementationswithoutany insightleadtopoorperformance. Optimizationandtuningisappliedatthekernellevelasopposedtoattheapplicationorsystemlevel.Kernelsarespecicalgorithmsorfunctionsdevelopedfor 22

PAGE 37

aspecictaskandaretypicallybuildingblocksinsideapplications.Forexample, SGEMMSinglePrecisionGeneralMatrixMultiplyisakernelthatperformsmatrix multiplication.SGEMMisapervasivecomponentinscienticcomputingapplications.Inputtotheframeworkisakernelthathasbeenidentiedastheprimary performancebottleneckintheapplicationasRK4wasidentiedinAppendixA andisthetargetforoptimizationandtuning. Theoptimizationstrategyforthisprole-driventuningframeworkcouplesperformanceanalysiswithempiricalobservationtoguideandlimittheperformance optimizationsearchspace.Throughput-orientedprocessorsputmoreemphasison parallelismandscalabilitythanprogrammingsequentialprocessors.Weusetheprinciplesdescribedintherooinemodel[179]toidentifymaxtheoreticalcapabilities ofthehardwareascomparedwithexecutionlimitsfromworkloadcharacteristics. Performancemetricsareusedtoidentifyprimaryperformancelimitersincludinginstructionthroughput,memorythroughput,latency,orsomecombination.Weemploy prolerfeedbacktoeliminatecategoriesofoptimizationsthatareunrelatedtotheobservedbottleneckwhich,bydenition,cannotimproveperformance. Iflatencyiseectivelyhidden,hardwareresourcesdonotsuerfromunderutilizationandtheremustbehardwareunitsthatarerunningnearpeakandrepresent performancelimiters.Optimizationsthatimprovetheeciencyofthebottlenecked resourcewillhavethegreatestimpactonperformance.Thisconcepthasimportant implicationsforsettingexpectationsandcorrectinterpretationoftheresults.Forexample,ifacodeismemory-bandwidthboundandanoptimizationdoublesmemory throughput,it'seasytoexpectthecodewillruntwiceasfast.However,the2ximprovementistheupperbound,dependingonhowcloseotherlimitersarebehindthe primarylimit.Conversely,ifacodeismemory-boundandanoptimizationdoubles instructionthroughput,thereshouldbenoexpectationofimprovedruntimesincethe codeislimitedbymemorybandwidthandremainssoafterthecomputeoptimization 23

PAGE 38

isapplied. Theanalysisframeworkisbuiltfromthefollowingcomponents: Lowlevelperformanceeventsdenedforeacharchitecture. Metricsformulasarederivedfromthelowlevelevents. Analysismetricsuselowleveleventsandmetricstoinformoptimization decision-making. Methodologyandsoftwaretoolstomeasureperformanceevents. Interpretationoftheeventsandmetricstolimitoptimizationsearchspaceto thosethatimprovememoryorthosethatimprovecompute,forexampleand specicstrategieswithineach. Algorithmtoautomaticallyprocessandquantifyperformancemetrics. Documentedpeakcapabilitiesofthehardwareunderanalysis. Understandingofkernelimposedlimits. Architecturalreviewofrelevanthardwaretounderstandspecicoptimizations. Animportantcomponentoftheframeworkisspecicationsforwhichtypeof eventsneedtobemeasuredforthroughput-orientedarchitectures,theformulasfor buildingmetricsfromtheevents,andhowtointerpretandapplytheresults.Figure 3.2illustratesthealgorithmthatdrivesperformanceoptimizationinthisframework. 24

PAGE 39

Figure3.2:Optimizationsareselectedbasedonwhetherthekerneliscompute-bound, memory-bound,orlatency-boundforselectionofappropriateoptimizationstrategies. ThisbasicowisrecommendedbyperformanceengineersatNVidia.Theautomated frameworkdemonstrateshowtoimplementthoseideassystematically. 25

PAGE 40

3.1FrameworkMethodology Therststepintheframeworkistomeasureproledataanddeterminethe primaryperformancelimiter.Highperformancethroughput-orientedarchitectures requireenoughthreadlevelparallelismtohidelatency.However,onGPUarchitectures,additionalthreadsbeyondwhatissucientforlatencycoveragewillnot necessarilyincreaseperformanceandsinceadditionalthreadsreducesharedmemory and/orregisters,unnecessarilyhighoccupancycanactuallylimitperformance.The NVidiaprogrammingmanual[9]assertsthatmorethan50percentoccupancydoes nottypicallyscalewithincreasedperformance.Inotherwords,goingfrom50%occupancyto75%occupancy,a25%gain,doesnotimplya25%gaininperformancein general.Infact,[169]demonstrateditispossibletoloseperformancebymaximizing occupancy.Weinitiallypredictanappropriatebalancepointbasedonatleast50% occupancyandfullusageofon-chipresources. Akeyprincipleembeddedinouranalysisstrategyisthatperformancemetrics mustbeevaluatedwithinthelargercontextoftheprimaryperformancelimiter. Forexample,metricsmayindicatepoormemoryaccesspatterns.Thisdatashould beactedononlyifthekernelismemory-bound.Theoptimizationalgorithmrst identieswhichhardwareresourceistheprimaryperformancelimiter.This,inturn, denesanoptimizationspaceandasetofeventstomeasurewithinthatspace.These eventsarethencombinedintothresholdmetricsandheuristicstodeterminehow theresourceunderevaluationislimitingperformancewhichidentiesasubsetof candidateoptimizations.Eachofthefollowingsectionsmemoryboundinsection 3.2.1,computeboundinsection3.2.2,andlatencyboundinsection3.2.3followthis recipe.Foreachperformancelimitingsourceasetofoptimizationsstrategiesare identied,generated,andtestedforeacharchitecture. 26

PAGE 41

EacharchitecturehasanidealbalanceofinstructiontomemorybytesratiodenedbyeachmachinespeaklimitsseeChapter5.Akernelisclassiedasinstruction throughputlimitedormemorybandwidthlimitedbycomparingtheachievedop: byteratio;acommonpracticeinperformanceorientedcommunities.Ifmeasuredinstruction:byteratioishigherthantheideal,thecodeislikelyinstructionthroughput limited.Ifthemeasuredinstruction:byteratioislowerthanthehardwareideal, thecodeislikelymemorybandwidthlimited.Theinstuction:byteratioisabinaryoperator,theoutcomeiseithercompute-boundorinstruction-bound.However, somekernelsdon'tgetclosetohardwarelimitsforinstructionthroughputormemorybandwidth.Thisisoftenanindicationthatlatencyisexposedandthekernelis latency-bound. Toformalizetheaboveintopseudo-code,akernelisclassiedascompute-bound, memory-bound,orlatency-boundfollowingtheserules: Ifthemeasuredinstruction:byteratioishigherthanthehardwareideal, andmeasuredinstructionthroughputis70%orhigherthanhardwarepeak capability,thenthekerneliscompute-bound. Ifthemeasuredinstruction:byteratioislowerthanthehardwareideal,and measuredbandwidththroughputis70%orhigherthanhardwarepeakcapability,thenthekernelismemory-bound. Ifnohardwareunitisnearit'srelativehardwarepeak,thekernelislikely latency-bound. Heuristicthresholdscanbeparameterizedandadjustedasnecessarybutgeneral industryconsensusisthatcodeislimitedbyagivenhardwareresourceifachieved throughputisapproximately70%-80%ofthepeakhardwarecapability. 27

PAGE 42

3.2NVidiaMetricsandAnalysis Tables3.1through3.5describelowlevelhardwareeventsNVidiasupportsondiscretegraphicscardswithCUDAsupport.Thelowleveleventsarecommonbetween FermiandMaxwell.However,Maxwellsupportsmanymoremetricswithoutclear documentationofhowthemetricswerecalculated.Thederivedmetricsandanalysis aredierentbetweenFermiandMaxwell.MaxwellevolvedsignicantlysinceFermi tonativelysupportmoreoftheeciencymetricspractitionersneedtoeectively measureperformance.Sincetheydier,analysismetricsaregivenforbothFermi andMaxwell. Compile-timedatasuchasnumberofregistersperthread,gridsize,blocksize, staticsharedmemoryallocatedpercudablockbytes,dynamicsharedmemoryallocatedpercudablockbytes,constantmemoryallocatedpercudablockbytes, spilledloadsbytes,spilledstoresbyteswerealsocollectedinadditiontothelow leveleventscollectedfromNVidia'sprolingtool,nvprof. 28

PAGE 43

Table3.1:LowLevelInstructionRelatedHardwareEventsfromNVidia'snvprof. Eventname EventDescription inst issued Dierencebetweenissuedandexecutedinstructions.Instruction issuesthathappenedduetoserialization,instructioncachemisses, etc.Willrarelybezero,concernonlyifit'sasignicantpercentage ofinstructionsissued. inst executed Countsinstructionsencounteredduringexecution.Incrementedby oneperwarp. thread inst executed 0 Numberofinstructionsexecutedbyallthreads,doesnotinclude replays.Foreachinstructionitincrementsbythenumberofthreads inthewarpthatexecutetheinstructioninpipeline0. thread inst executed 1 Numberofinstructionsexecutedbyallthreads,doesnotinclude replays.Foreachinstructionitincrementsbythenumberofthreads inthewarpthatexecutetheinstructioninpipeline1. gld request Numberofexecutedloadinstructionswherethestatespaceisnot speciedandhencegenericaddressingisused.Itcanincludethe loadoperationsfromglobal,local,andsharestatespace.IncrementsperwarponanSM. gst request Numberofexecutedstoreinstructionswherethestatespaceisnot speciedandhencegenericaddressingisused.Itcanincludethe storeoperationstoglobal,localandsharestatespace.Increments perwarponanSM. shared load Numberofexecutedloadinstructionswherestatespaceisspecied asshared,incrementsperwarponamultiprocessor. shared store Numberofexecutedstoreinstructionswherestatespaceisspecied asshared,incrementsperwarponamultiprocessor. local load Numberofexecutedloadinstructionswherestatespaceisspecied aslocal,incrementsperwarponamultiprocessor. local store Numberofexecutedstoreinstructionswherestatespaceisspecied aslocal,incrementsperwarponamultiprocessor. 29

PAGE 44

Table3.2:LowlevelinstructionrelatedhardwareeventsfromNVidia'snvprofcontinued. gld inst 8bit Totalnumberof8-bitgloballoadinstructionsthatareexecutedby allthethreadsacrossallthreadblocks. gld inst 16bit Totalnumberof16-bitgloballoadinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gld inst 32bit Totalnumberof32-bitgloballoadinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gld inst 64bit Totalnumberof64-bitgloballoadinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gld inst 128bit Totalnumberof128-bitgloballoadinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gst inst 8bit Totalnumberof8-bitglobalstoreinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gst inst 16bit Totalnumberof16-bitglobalstoreinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gst inst 32bit Totalnumberof32-bitglobalstoreinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gst inst 64bit Totalnumberof64-bitglobalstoreinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. gst inst 128bit Totalnumberof128-bitglobalstoreinstructionsthatareexecuted byallthethreadsacrossallthreadblocks. 30

PAGE 45

Table3.3:LowLevelL1EventsfromNVidia'snvprof. EventName Description l1 global load miss NumberofcachelinesthatmissinL1cacheforglobalmemoryload accesses.Incaseofperfectcoalescingthisincrementsby1,2,and4 for32,64,and128bitaccessesbyawarprespectively.Incremented by1perL1linelineis128B. l1 global load hit NumberofcachelinesthathitinL1cacheforglobalmemoryload accesses.Incaseofperfectcoalescingthisincrementsby1,2,and 4for32,64,and128bitaccessesbyawraprespectively. l1 local load hit NumberofcachelinesthathitinL1cacheforlocalmemoryload accesses.Incaseofperfectcoalescingthisincrementsby1,2,and 4for32,64and128bitaccessesbyawarprespectively. l1 local load miss NumberofcachelinesthatmissinL1cacheforlocalmemoryload accesses.Incaseofperfectcoalescingthisincrementsby1,2,and 4for32,64and128bitaccessesbyawarprespectively. l1 local store hit NumberofcachelinesthathitinL1cacheforlocalmemorystore accesses.Incaseofperfectcoalescingthisincrementsby1,2,and 4for32,64and128bitaccessesbyawarprespectively. l1 local store miss NumberofcachelinesthatmissinL1cacheforlocalmemorystore accesses.Incaseofperfectcoalescingthisincrementsby1,2,and 4for32,64and128bitaccessesbyawarprespectively. l1 shared bank conict Numberofsharedbankconictscausedduetoaddressesfortwo ormoresharedmemoryrequestsfallinthesamememorybank. IncrementsbyN-1and2*N-1foraN-wayconictfor32bitand 64bitsharedmemoryaccessesrespectively. 31

PAGE 46

Table3.4:LowlevelmemoryrelatedhardwareeventsfromNVidia'snvprof. Eventname EventDescription fb subp0 read sectors NumberofDRAMreadrequeststosub-partition0.Incrementsby1foreach32byteaccess. fb subp1 read sectors NumberofDRAMreadrequeststosub-partition1. fb subp0 write sectors NumberofDRAMwriterequeststosub-partition0. fb subp1 write sectors NumberofDRAMwriterequeststosub-partition1. l2 subp0 read hit sectors NumberofreadrequestsfromL1thathitinslice0of L2cache.Thisincrementsby1foreach32-byteaccess. l2 subp1 read hit sectors NumberofreadrequestsfromL1thathitinslice1of L2cache.Thisincrementsby1foreach32-byteaccess. l2 subp0 read sector queries NumberofreadrequestsfromL1toslice0ofL2cache. Thisincrementsby1foreach32-byteaccess. l2 subp1 read sector queries NumberofreadrequestsfromL1toslice1ofL2cache. Thisincrementsby1foreach32-byteaccess. l2 subp0 write sector queries NumberofwriterequestsfromL1toslice0ofL2cache. Thisincrementsby1foreach32-byteaccess. l2 subp1 write sector queries NumberofwriterequestsfromL1toslice1ofL2cache. Thisincrementsby1foreach32-byteaccess. 32

PAGE 47

Table3.5:LowlevellatencyrelatedandmiscellaneoushardwareeventsfromNVidia's nvprof. gputime ExecutiontimefortheGPUkernelormemorycopy methodinmicroseconds. elapsedclocks Numberofcycles. global store transaction Numberofglobalstoretransactions.Incrementsby1 pertransaction.Transactioncanby32,64,96,or128 bytes. threads launched Numberofthreadslaunchedonamultiprocessor. warps launched Numberofwarpslaunchedonamultiprocessor. branch Numberofbranchinstructionsexecutedperwarpona multiprocessor. divergent branch Numberofdivergentbrancheswithinawarp.This counterwillbeincrementedbyoneifatleastonethread inawarpdivergesthatis,followsadierentexecution pathviaaconditionalbranch. active warps Accumulatednumberofactivewarpspercycle.Foreverycycleitincrementsbythenumberofactivewarpsin thecyclewhichcanbeintherange0to48. active cycles Numberofcyclesamultiprocessorhasatleastoneactive warp. atom count Numberofwarpsexecutingatomicreductionoperations forthread-to-threadcommunication.Incrementsbyone ifatleastonethreadinawarpexecutestheinstruction 33

PAGE 48

Table3.6:FermiderivedmetricsfromNVidia. MetricName Formula dramreads fb subp0 read sectors+fb subp1 read sectors dramwrites fb subp0 write sectors+fb subp1 write sectors threadinstructions executed thread inst executed 0+thread inst executed 1 l2 read requests l2 subp0 read sector queries+ l2 subp1 read sector queries l2 write requests l2 subp0 write sector queries+ l2 subp1 write sector queries readsfromL1thathit inL2 l2 subp0 read hit sectors+l2 subp1 read hit sectors L2readhitrate l2 l1 read hits/l2 read requests L1hitrateforlocal loadrequests l1 local load hit/l1 local load hit+ l1 local load miss numberofregisters perblock num registers per thread*num threads per block totalnumberofblocks grid dim x*grid dim y*grid dim z eectivemaxblocks perSM-threadlimit minmax active blocks per sm, max active threads per sm/block size eectivemaxblocks perSM-sharedmemorylimit minmax active blocks per sm,total shmem per sm/ shmem per block eectivemaxblocks perSM-registerlimit minmax active blocks per sm,max 32bit reg per sm/ num reg per block Table3.6liststheformulasformetricsderivedfromthelowleveleventson Fermi.Theeventsandmetricsabovearecollectedwithscriptsthatdrivenvprofvia command-line.Thescriptsformatthedataforeasyimportintoatemplateexcel 34

PAGE 49

sheet.Theperformancetemplatehasadditionalanalysismetricsdenedtodrive optimization.Thisprocessissemi-automatedasisandismachine-readableforeasy integrationinanauto-tuneroranyotherpackage. ThemeasuredinstructiontobyteratioseeTable3.7iscomparedtothetheoreticalpeaksofthemachine.Theinstructiontobyteratioiscalculatedwithrespect todrambytesandwithrespecttoL2bytes.AccordingtoNVidiaengineers,ifthe codehasahighhitrateintheL2cache,it'sbettertolookatL2countersinsteadof DRAM.AccessestoL2arestillexpensivecomparedtoarithmetic. Table3.7:L2andDRAMinstructiontobyteratiosonNVidia. Instruction Byte DRAMRatio 32*inst issued 32B*dram reads+dram writes L2Ratio 32*inst issued 32B*l2 read requests+l2 write requests Iftheselectedinstructiontobyteratioislessthanthebalancedratioforthe hardwarethenthekernelislikelycomputeboundandoptimizationeortsshould focusonimprovinginstructionthroughput.Iftheselectedinstructiontobyteratio ismorethanthebalancedratioforthehardwarethenthekernelislikelymemory boundandoptimizationeortsshouldfocusonimprovingmemorythroughput.An importantcheckonthisdirectionistolookathowclosemeasuredperformanceisto peaklimits.Iftheinstructiontobyteratioindicatesmemoryboundbutnomemory unitisoperatingnearpeakthroughput,thenlatencyexposureislikelythelimiting factorandoptimizationsshouldfocusonhidinglatencybetter. 35

PAGE 50

3.2.1MemoryAnalysisMetricsandOptimizations Theanalysismetricsdescribedinthissectionareconsideredifthecodeisidentiedasmemory-bound.Theyarederivedfromthemetricsandeventsdescribedin 3.2andrepresentthethresholdsandheuristicsappliedtodeterminewhichoptimizationislikelytohaveasignicantperformanceimpact.NVidia'sprolingtool,nvprof, evolvedsignicantlybetweenFermiandMaxwellandmanyofthemetricsthathadto bemanuallymeasured,compiled,andcomputedforFermiareautomaticallyoutput fromnvprofforMaxwellorcomputecapability5.0andbeyond.Memoryanalysis metricsforFermiaregiveninTable3.8andforMaxwellinTables3.9through3.12. 36

PAGE 51

Table3.8:Fermimemoryanalysismetrics. Metric Formula dramreadthroughput *dram reads/execution time in seconds dramwritethroughput *dram writes/execution time in seconds dramthroughput dram read throughput+dram write throughput dramthroughputto peakthroughputratio dram throughput/theoretical peak memory bandwidth L2readthroughput *L2 reads/execution time in seconds L2writethroughput *l2 writes/execution time in seconds L2throughput l2 read throughput+l2 write throughput L1globalloadhitrate l1 global load hit/l1 global load hit+ l1 global load miss L2queries l2 read requests+l2 write requests L2queriesduetolocal memory 2*4*l1 local load miss Transactionsperload request l1 global load hit+l1 global load miss/gld request ThefollowingareguidelinesdocumentedbyNVidiaengineersforapplyingmetrics tospecicoptimizations.Todetermineifregisterspillingisimpactingmemory,estimatehowmuchofL2ordramtracisduetolocalmemory.PercentageofL2queries duetolocalmemoryisaratioof*4*l1 local load misstol2 read requests+ l2 write requests.Multiplyby2becausealoadmissimpliesastorehappenedrst. Multiplyby4becausealocalmemorytransactionis128byteswhichis4L2transactionsbyteseach. 37

PAGE 52

Todetermineifthememoryaccesspatternisproblematic,comparetheapplicationthroughputwiththehardwarethroughput.Iftheapplicationthroughputis muchlowerthanthehardwarethroughput,thenmanymorerequestsarebeingmade thanarebeingusedbytheapplication. gld requestmuchlessthanl1 global load hit+l1 global load miss*word size/ 32 gst requestmuchlessthanl2 write requests*word size/32 Todetermineifloadsarecoalesced,comparethenumberofgloballoadrequests withthenumberofL1cachelinerequestsfromglobalmemory.Thenumberof transactionsperloadrequestisestimatedbycomparingtheratiowiththeexpected numberoftransactionsperload.Expectedtransactionsperloadforfp32single precisionoatingpointis1because32threadsofawarpeachrequest4bytes= 128byterequestwhichmatchesthetransactionsize.Wordsizeforfp64doublesis 8bytes.32threadsofawarpeachrequest8bytes=256byterequestwhichistwo expectedtransactionsfordoubles. MaxwellmemorytransactionrelatedmetricsarelistedinTable3.9.Maxwell memorythroughputrelatedmetricsarelistedinTable3.10,memoryutilizationrelatedmetricsinTable3.11,andmiscellaneousmemorymetricsinTable3.12. 38

PAGE 53

Table3.9:Maxwellmemorytransactionmetrics. MetricName Description shared load transactions per request Averagenumberofsharedmemoryloadtransactionsperformedfor eachsharedmemoryload shared store transactions per request Averagenumberofsharedmemorystoretransactionsperformedfor eachsharedmemorystore local load transactions per request Averagenumberoflocalmemoryloadtransactionsperformedfor eachlocalmemoryload local store transactions per request Averagenumberoflocalmemorystoretransactionsperformedfor eachlocalmemorystore gld transactions per request Averagenumberofglobalmemoryloadtransactionsperformedfor eachglobalmemoryload gst transactions per request Averagenumberofglobalmemorystoretransactionsperformedfor eachglobalmemorystore shared store transactions Numberofsharedmemorystoretransactions shared load transactions Numberofsharedmemoryloadtransactions local load transactions Numberoflocalmemoryloadtransactions local store transactions Numberoflocalmemorystoretransactions gld transactions Numberofglobalmemoryloadtransactions gst transactions Numberofglobalmemorystoretransactions dram read transactions Devicememoryreadtransactions dram write transactions Devicememorywritetransactions atomic transactions Globalmemoryatomicandreductiontransactions atomic transactions per request Averagenumberofglobalmemoryatomicandreductiontransactionsperformedforeachatomicandreductioninstruction sysmem read transactions Numberofsystemmemoryreadtransactions sysmem write transactions Numberofsystemmemorywritetransactions l2 read transactions MemoryreadtransactionsseenatL2cacheforallreadrequests l2 write transactions MemorywritetransactionsseenatL2cacheforallwriterequests local memory overhead Ratiooflocalmemorytractototalmemorytracbetweenthe L1andL2caches l2 atomic transactions MemoryreadtransactionsseenatL2cacheforatomicandreductionrequests l2 tex read transactions MemoryreadtransactionsseenatL2cacheforreadrequestsfrom thetexturecache l2 tex write transactions MemorywritetransactionsseenatL2cacheforwriterequestsfrom thetexturecache tex cache transactions Texturecachereadtransactions 39

PAGE 54

Table3.10:Maxwellmemorythroughputmetrics. MetricName Description gld requested throughput Requestedglobalmemoryloadthroughput gst requested throughput Requestedglobalmemorystorethroughput gld throughput Globalmemoryloadthroughput gst throughput Globalmemorystorethroughput dram read throughput Devicememoryreadthroughput dram write throughput Devicememorywritethroughput tex cache throughput Texturecachethroughput local load throughput Localmemoryloadthroughput local store throughput Localmemorystorethroughput shared load throughput Sharedmemoryloadthroughput shared store throughput Sharedmemorystorethroughput l2 tex read throughput MemoryreadthroughputseenatL2cacheforread requestsfromthetexturecache l2 tex write throughput MemorywritethroughputseenatL2cachefor writerequestsfromthetexturecache l2 read throughput MemoryreadthroughputseenatL2cacheforall readrequests l2 write throughput MemorywritethroughputseenatL2cacheforall writerequests sysmem read throughput Systemmemoryreadthroughput sysmem write throughput Systemmemorywritethroughput l2 atomic throughput MemoryreadthroughputseenatL2cachefor atomicandreductionrequests ecc throughput ECCthroughputfromL2toDRAM 40

PAGE 55

Table3.11:Maxwellmemoryutilizationmetrics. MetricName Description l2 utilization TheutilizationleveloftheL2cacherelativetothepeak utilization tex fu utilization Theutilizationlevelofthemultiprocessorfunctionunits thatexecuteglobal,localandtexturememoryinstructions ldst fu utilization Theutilizationlevelofthemultiprocessorfunctionunits thatexecutesharedload,sharedstoreandconstantload instructions dram utilization Theutilizationlevelofthedevicememoryrelativeto thepeakutilization tex utilization Theutilizationlevelofthetexturecacherelativetothe peakutilization sysmem utilization Theutilizationlevelofthesystemmemoryrelativeto thepeakutilization shared utilization Theutilizationlevelofthesharedmemoryrelativeto peakutilization 41

PAGE 56

Table3.12:MiscellaneousMaxwellMemoryMetrics. MetricName Description global hit rate Hitrateforgloballoads local hit rate Hitrateforlocalloadsandstores tex cache hit rate Texturecachehitrate l2 tex read hit rate HitrateatL2cacheforallreadrequestsfromtexturecache l2 tex write hit rate HitRateatL2cacheforallwriterequestsfromtexturecache gld eciency Ratioofrequestedglobalmemoryloadthroughputtorequired globalmemoryloadthroughput.Valuesgreaterthan100% indicatethat,onaverage,theloadrequestsofmultiplethreads inawarpfetchedfromthesamememoryaddress gst eciency Ratioofrequestedglobalmemorystorethroughputtorequiredglobalmemorystorethroughput.Valuesgreaterthan 100%indicatethat,onaverage,thestorerequestsofmultiple threadsinawarptargetedthesamememoryaddress shared eciency Ratioofrequestedsharedmemorythroughputtorequired sharedmemorythroughput ldst issued Numberofissuedlocal,global,sharedandtexturememory loadandstoreinstructions ldst executed Numberofexecutedlocal,global,sharedandtexturememory loadandstoreinstructions stall memory dependency Percentageofstallsoccurringbecauseamemoryoperation cannotbeperformedduetotherequiredresourcesnotbeing availableorfullyutilized,orbecausetoomanyrequestsofa giventypeareoutstanding stall texture Percentageofstallsoccurringbecausethetexturesub-system isfullyutilizedorhastoomanyoutstandingrequests stall other Percentageofstallsoccurringduetomiscellaneousreasons stall constant memory dependency Percentageofstallsoccurringbecauseofimmediateconstant cachemiss stall memory throttle Percentageofstallsoccurringbecauseofmemorythrottle 42

PAGE 57

Inecientmemoryaccesspatternsarewell-knownanddocumentedlimitersto performanceonthroughput-orientedarchitectures.Theframeworkmeasuresbytes requestedbytheapplicationandcomparesthemtobytesmovedbythehardware. Thetwocanbedierentifmemoryaccesspatternscauseinecientuseofthememory bus.Thisframeworkidentiesiftheaccesspatternisproblematicusingheuristics developedanddescribedbyNVidiaengineers[126].Ifthenumberofglobalmemory loadsorstoresrequestedbytheapplicationaremuchsmallerthanthenumberof bytesmovedbythehardware,theneciencyismuchlessthan100%andbandwidth isbeingwasted.Below50%mostlikelymeansscatteredaccessesaretheproblem. Anotherindicationthatbandwidthisbeingwastedcanbemeasuredbycomparing thenumberoftransactionsperloadrequesttotheexpectednumberoftransactions perloadrequest.Ifthenumberoftransactionsperloadrequestishigherthanexpecteditmeanstheapplicationisusingonlysomeofthebytespertransactionand hastogeneratemoretransactionstosatisfyeverythreadinthewarp.Forexample, fp64instructionsrequireatleast2transactionspermemoryaccessandfp32instructionsrequireatleast1transactionpermemoryaccess.Ifthenumberofmemory transactionsishigherthanexpected,itindicatesbadaccesspatterns.Theprimary optimizationtoconsiderforproblematicaccesspatternsistoensureaccessesarecoalesced.Loadscanbecoalescedbyusingstructure-of-arraysstorageasopposedto arrayofstructuresorbypaddingmulti-dimensionalstructuressothatwarpaccesses arealignedonlineboundaries. Candidateoptimizationswhenregisterspillingisaproblemaretoincreasethe registercountperthreadusingahigherlimitinthe-maxrregcountcompileroption orlowerthreadcountwiththegridlaunchbounds,increasethesizeoftheL1 cachewhichreducesthebytesavailableforsharedmemory,usenon-cachingglobal memoryloads,ortryfetchingdatafromthetexturecache[126].Increasingthe numberofregistersperthreadcandecreaseoccupancy,potentiallymakingglobal 43

PAGE 58

memoryaccesseslessecient.However,thiscanstillresultinanetwinifthefewer totalbytesaccessedinglobalmemoryreducepressureonthememoryinterface.The challengeistondtherightbalance.ThepurposeofincreasingthesizeoftheL1 cachewhichdecreasessharedmemoryonFermiistoenablemorespills/llstohitin thecacheandreducememorytrac.Non-cachingloadsdisabletheL1cacheonlynot theL2andgeneratesmallertransactionsBinsteadof128Bwhichismoreecient forscatteredorpartially-lledpatterns.Fetchingdatafromtextureorconstantcache canbeeectiveifsmallertransactionshelpwithmemoryeciencykernelisusing allbytesfetchedfrommemoryandnotgeneratingmultipletransactionsthatcould becoalescedintooneandbecausethecacheisnotpollutedbyotherglobalmemory loads.Anauto-tunerisanappropriateframeworktoexperimentwithL1andcaching congurationstoselectthebestoption. 3.2.2InstructionAnalysisMetricsandOptimizations Theanalysismetricsandoptimizationsdescribedinthissectionareevaluated ifthecodeisidentiedascompute-boundusedinterchangeablewithinstruction throughputlimited.Theinstructionanalysismetricsdescribedinthissectionare derivedfromthemetricsandeventsdescribedin3.2andrepresentthresholdsand heuristicsappliedtodeterminewhichoptimizationislikelytoimproveinstruction throughput.Asdiscussedin3.2.1,NVidia'sprolingtool,nvprof,hasevolvedsignicantlyandmanyoftheinstructionanalysismetricsthathadtobemanually measured,compiled,andcomputedforFermiareautomaticallyoutputfromnvprof forMaxwell.Oneofthecomplicatingfactorsinthisworkisthetoolsandmethods tocollectperformancecountersaredierentbetweenplatforms.Tocharacterizethe twoNVidiaarchitectures,instructionanalysismetricsforbothFermiandMaxwell aredetailedinthissection.However,theconceptsforhowtomeasure,interpret,and applydataforanythroughput-orientedarchitecturearethesame. 44

PAGE 59

ThefollowingareafewanalysismetricsforFermigivenasguidelinesbyNVidia engineersforcompute-boundkernels. SerializationImpact:serializationissignicantifinstructions issuedissignicantlyhigherthatinstructions executed. inst executed=inst issued .1 SharedMemoryBankConicts:sharedmemorybankconictscanlimitinstructionthroughputifconictsareasignicantpercentageofinstructionsand ifthekernelisinstructionthroughputlimited. l 1 shared bank conflict= shared load + shared store .2 l 1 shared bank conflict=inst issued .3 RegisterSpillingImpact:measureifregisterspillsareimpactinginstruction count.CountL1missesforcachingloadsequation3.4.CountL2readrequests fornon-cachingloadsandstoresequation3.5. :l 1 local load miss=inst issued .4 l 2 read requests=inst issued .5 LocalMemoryImpact:percentageofinstructionsduetolocalmemoryaccess. l 1 local load hit + l 1 local load miss + l 1 local store hit + l 1 local store miss =inst issued .6 BranchDivergenceImpact:branchdivergencecanwasteinstructions. divergent branch=branch .7 45

PAGE 60

AllDivergenceImpact:branchdivergenceisjustonewayinstructionscanbe serializedforotherreasons. 100 inst executed )]TJ/F19 11.9552 Tf 10.505 0 Td [(thread instructions executed = inst executed .8 InstructionPerClockIPC inst executed=num SMs =elapsed clocks .9 Table3.13describesinstructionutilizationrelatedmetricsforMaxwell,Table3.14 describesoprelatedmetrics,Table3.15describesinstructioncountrelatedmetrics, andTable3.16describesmiscellaneousinstructionmetricsonMaxwell. 46

PAGE 61

Table3.13:Maxwellinstructionutilizationrelatedmetrics. Metric Description issue slot utilization Percentageofissueslotsthatissuedatleast oneinstruction,averagedacrossallcycles cf fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecutecontrol-owinstructions tex fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecuteglobal,localand texturememoryinstructions ldst fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecutesharedload, sharedstoreandconstantloadinstructions double precision fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecutedouble-precision oating-pointinstructions special fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecutesin,cos,ex2, popc,o,andsimilarinstructions single precision fu utilization Theutilizationlevelofthemultiprocessor functionunitsthatexecutesingle-precision oating-pointinstructionsandintegerinstructions 47

PAGE 62

Table3.14:Maxwellopandeciencyrelatedmetrics. Metric Description ipc Instructionsexecutedpercycle issued ipc Instructionsissuedpercycle op count dp Numberofdouble-precisionoating-pointoperationsexecutednonpredicatedthreadsadd,multiply,multiply-accumulateandspecial op count dp add Numberofdouble-precisionoating-pointaddoperationsexecuted bynon-predicatedthreads op count dp fma Numberofdouble-precisionoating-pointmultiply-accumulateoperationsexecutedbynon-predicatedthreads op count dp mul Numberofdouble-precisionoating-pointmultiplyoperationsexecutedbynon-predicatedthreads op count sp Numberofsingle-precisionoating-pointoperationsexecutedby non-predicatedthreadsadd,multiply,multiply-accumulateand special op count sp add Numberofsingle-precisionoating-pointaddoperationsexecuted bynon-predicatedthreads op count sp fma Numberofsingle-precisionoating-pointmultiply-accumulateoperationsexecutedbynon-predicatedthreads op count sp mul Numberofsingle-precisionoating-pointmultiplyoperationsexecutedbynon-predicatedthreads op count sp special Numberofsingle-precisionoating-pointspecialoperationsexecutedbynon-predicatedthreads eligible warps per cycle Averagenumberofwarpseligibletoissueperactivecycle op sp eciency Ratioofachievedtopeaksingle-precisionoating-pointops op dp eciency Ratioofachievedtopeakdouble-precisionoating-pointops branch eciency Ratioofnon-divergentbranchestototalbranches warp exec eciency Ratioofaverageactivethreadsperwarptothemaximumnumber ofthreadsperwarpsupportedonamultiprocessor op sp eciency Ratioofachievedtopeaksingle-precisionFPops op dp eciency Ratioofachievedtopeakdouble-precisionFPops 48

PAGE 63

Table3.15:Maxwellinstructioncounts. Metric Description inst per warp Averagenumberofinstructionsexecutedbyeachwarp inst replay overhead Averagenumberofreplaysforeachinstructionexecuted inst executed Thenumberofinstructionsexecuted inst issued Thenumberofinstructionsissued inst fp 32 Numberofsingle-precisionoating-pointinstructions executedbynon-predicatedthreadsarithmetric,compare,etc. inst fp 64 Numberofdouble-precisionoating-pointinstructions executedbynon-predicatedthreadsarithmetric,compare,etc. inst integer Numberofintegerinstructionsexecutedbynonpredicatedthreads inst bit convert Numberofbit-conversioninstructionsexecutedbynonpredicatedthreads inst control Numberofcontrol-owinstructionsexecutedbynonpredicatedthreadsjump,branch,etc. inst compute ld st Numberofcomputeload/storeinstructionsexecutedby non-predicatedthreads inst misc Numberofmiscellaneousinstructionsexecutedbynonpredicatedthreads inst inter thread comm Numberofinter-threadcommunicationinstructionsexecutedbynon-predicatedthreads 49

PAGE 64

Table3.16:Maxwellmiscellaneousinstructionmetrics. Metric Description issue slots Thenumberofissueslotsused cf issued Numberofissuedcontrol-owinstructions cf executed Numberofexecutedcontrol-owinstructions ldst issued Numberofissuedlocal,global,sharedandtexture memoryloadandstoreinstructions ldst executed Numberofexecutedlocal,global,sharedandtexturememoryloadandstoreinstructions atomic transactions Globalmemoryatomicandreductiontransactions stall inst fetch Percentageofstallsoccurringbecausethenextassemblyinstructionhasnotyetbeenfetched stall exec dependency Percentageofstallsoccurringbecauseaninputrequiredbytheinstructionisnotyetavailable stall sync Percentageofstallsoccurringbecausethewarpis blockedatasyncthreadscall stall other Percentageofstallsoccurringduetomiscellaneous reasons stall pipe busy Percentageofstallsoccurringbecauseacompute operationcannotbeperformedbecausethecomputepipelineisbusy stall not selected Percentageofstallsoccurringbecausewarpwas notselected eligible warps per cycle Averagenumberofwarpsthatareeligibletoissue peractivecycle 50

PAGE 65

Compute-boundoptimizationstrategiesfocusonreducingthenumberofinstructionsexecutedorusehigherperforminginstuctions.Somefactorsthatcanlimitcode fromreachingpeakcomputecapabilityarerawinstructionthroughputandinstructionserialization.Itisimportanttounderstandtheinstructionmixofthekernel because32bitoats,64bitoats,integers,memoryloadsandstores,andtranscendentalsallhavedierentthroughputs.WhenpeakGFLOP/sisquotedforhardware, itoftenreferstothe32bitoatingpointthroughputwhichisonlyachievableif100% oftheinstructionsare32bitoats.Akernelwithsomepercentageof64bitoats will,bydenition,haveamaxlimitthatislowerthanthe32bittheoreticalceiling.Thisiswhyinstructiondistributionsaremeasuredandappliedtoinstruction throughputsinkernelanalysis. Oneoptimizationstrategytoimproveinstructionthroughputistochangethe rawinstructionmixtopreferhigherthroughputinstructiontypes.Oneexampleisto replace64bitoatswith32bitoatswherethelowerlevelofprecisionisacceptable. Floatingpointliteralswithoutan"f"sux.8asopposedto52.8fareinterpreted as64bitoatspertheCstandard.Anotheristousetranscendentalinstruction typeswhicharehardwareoptimizedapproximations.Again,thistrades-oaccuracy forspeedandcanonlybeusedifthelossinaccuracycanbetolerated. Instructionserializationoccurswhenthreadsinawarpissuethesameinstruction insequenceasopposedtotheentirewarpissuingtheinstructionatonce[126].NVidia prolecountersrefertothisas"replays"becausethesameinstructionisreplayedfor dierentthreadsinawarp.Replayscanbecausedbysharedmemorybankconicts andconstantmemorybankconicts.Warpdivergencecanalsocauseinstructionsto serialize.Serializationduetodivergentbranchesandsharedmemoryconictscanbe measuredwithprolecounters. Sharedmemorybankconictscansignicantlyimpactperformanceifthekernel isinstructionthroughputlimitedandsharedmemorybankconictsareasignicant 51

PAGE 66

percentageofinstructionsissued.Warpsaccesssharedmemorybycolumnswhich impliesthateachthreadofawarpwillaccessthesamebankofa32x32shared memoryarrayresultingina32-waybankconict.Bankconictscanbeavoidedby paddingsharedmemorysoeachthreadaccessesadierentbank. Constantmemorythroughputcanbemeasuredagainsthardwarepeakcapabilitiestodetermineifconstantmemorybankconictsarelimitingconstantmemory throughput.Constantmemoryresidesinglobalmemoryandcanprocess4BperSM perclock. 3.2.3LatencyOptimizations Theoptimizationsdescribedinthissectionareevaluatedifthecodeisidentied aslatency-bound.Onereasonmemorythroughputcanbelowerthanhardwarelimits isbecausethenumberofconcurrentaccessesisinsucienttohidememorylatency. Little'slawcanbeusedtoapproximateiftherearesucientconcurrentaccesses tosaturatethebus.Highperformingkernelsneedmemorylatency*bandwidth bytesinightLittle'slawtohidelatency.Concurrentaccessescanbeincreasedby increasingoccupancyandmodifyingthecodetoprocessseveralelementsperthread. Occupancyisincreasedbyadjustingthreadblockdimensionstomaximizeoccupancy atgivenregisterandsharedmemoryrequirements.Thereisabalancepointwhere occupancyissucientandwhateverremainstothekernelforregistersandshared memoryshouldbeutilizedaccordingly. 3.3XeonPhiMetricsandAnalysis Thissectiondescribesthelowlevelevents,analysismetrics,andperformance thresholdsthatneedtobecollectedontheXeon-PhiCoprocessorstodriveperformanceanalysisandoptimization.ThehardwarecountersthatareexposedtodevelopersontheXeonPhiareverydierentthanwhatismeasuredonNVidiausing nvprof.However,thebasicconceptsremainthesame. 52

PAGE 67

Table3.17:XeonPhiHardwareEventsfromIntel'sVTune[33] Metric Description CPU CLK UNHALTED Numberofcyclesexecutedbythecore. DATA PAGE WALK NumberofL1TLBmisses. DATA READ MISS Numberofmemoryreadaccessesthatmistheinternaldatacache whetherornottheaccessiscacheableornoncacheable.Cache accessesresultingfromprefetchinstructionsareincluded. DATA READ MISS OR WRITE MISS Numberofdemandloadsorstoresthatmissathread'sL1cache. DATA READ OR WRITE Numberofloadsandstoresseenbyathread'sL1datacache. DATA WRITE MISS Numberofmemorywriteaccessesthatmisstheinternaldatacache whetherornottheaccessiscacheableornoncacheable. EXEC STAGE CYCLES Numberofcycleswhenthethreadwasexecutingcomputational operations. HWP L2MISS NumberofhardwareprefetchesthatmissedL2. INSTRUCTIONS EXECUTED Numberofinstructionsexecutedbythethread. L2 DATA READ MISS CACHE FILL NumberofdatareadaccessesthatmissedtheL2cacheandwere satisedbyanotherL2cache.Canincludepromotedreadmisses thatstartedascodeaccesses. L2 DATA READ MISS MEM FILL NumberofdtareadaccessesthatissedtheL2cacheandweresatisedbymainmemory.Canincludepromotedreadmissesthat startedascodeaccesses. L2 DATA WRITE MISS CACHE FILL NumberofdatawriteRFOaccessesthatmissedtheL2cacheand weresatisedbyanotherL2cache. L2 DATA WRITE MISS MEM FILL NumberofdatawriteRFOaccessesthatmissedtheL2cacheand weresatisedbymainmemory. L2 VICTIM REQ WITH DATA Numberofevictionsthatresultedinamemorywriteoperation. SNP HITM L2 NumberofincomingsnoopsthathitmodieddatainL2thus resultinginanL2eviction. VPU ELEMENTS ACTIVE NumberofVPUoperationsexecutedbythethread. VPU INSTRUCTIONS EXECUTED NumberofVPUinstructionsexectuedbythethread. Table3.18listsallequationsthatsupportanalysisontheXeonPhiCoprocessors. 53

PAGE 68

Table3.18:XeonPhiFormulasforPerformanceAnalysis MetricName Formula FLOP/s VPU ELEMENTS ACTIVE/Time SPGFLOP/sec 16SPSIMDLane*2FMA*1.1GHZ*56#cores= 1971 DPGFLOP/sec 8DPSIMDLane*2FMA*1.1GHz*56#cores= 985.6 AverageCPIperThread CPU CLK UNHALTED/INSTRUCTIONS EXECUTED AverageCPIperCore AverageCPIperThread/num hardware threads L1ComputetoDataAccessRatio VPU ELEMENTS ACTIVE/DATA READ OR WRITE L2ComputetoDataAccessRatio VPU ELEMENTS ACTIVE/ DATA READ MISS OR WRITE MISS VectorizationIntensity VPU ELEMENTS ACTIVE/ VPU INSTRUCTIONS EXECUTED L1Misses DATA READ MISS OR WRITE MISS+ L1 DATA HIT INFLIGHT PF1 L1HitRate DATA READ OR WRITEL1Misses/ DATA READ OR WRITE EstimatedLatencyImpact CPU CLK UNHALTED-EXEC STAGE CYCLES -DATA READ OR WRITE/ DATA READ OR WRITE MISS L1TLBmissratio DATA PAGE WALK/DATA READ OR WRITE L2TLBmissratio LONG DATA PAGE WALK/DATA READ OR WRITE L1TLBmissesperL2TLBmiss DATA PAGE WALK/LONG DATA PAGE WALK TheL1ratiocalculatesanaverageofthenumberofvectorizedoperationsthat occurforeachL1cacheaccess. Inpractice,achievedbandwidthofapproximately140GB/secisnearthemaximumthatanapplicationislikelytosee.Thisislikelyduetolimitsinthenetwork infrastructure. 54

PAGE 69

ThefollowingareperformanceheuristicspublishedbyIntel[33]toguideperformanceanalysisontheXeonPhicoprocessors.Inteldocumentationrecommends investigationifanyofthegivenheuristicshold.However,asdiscussedinthisanalysisframework,thememory-relatedheuristicsshouldbeinvestigatedifthekernel ismemory-boundonlyandcompute-relatedheuristicsshouldbeinvestigatedifthe kerneliscompute-bound. averageCPIperthreadisgreaterthan4 averageCPIpercoreisisgreaterthan1 vectorizationintensityislessthan8SP vectorizationintensityislessthan16DP L1computetodataacessratioislessthanVectorizationintensity L2computetodataacessratioislessthan100xL1computetodataaccess ratio L1hitrateislessthan95% latencyimpactisgreaterthan145 L1TLBmissratioisgreaterthan1% L2TLBmissratioisgreaterthan0.1% L1TLBmissesperL2TLBmissisnear1 Bandwidthislessthan80GB/s 55

PAGE 70

4.ScienticKernelCharacterization Workloadcharacterizationfortheanalysisframeworkhasthreeprimaryobjectivesincluding,theinstructionmixtocaptureinstructionfrequencies,algorithmicestimateofopsandbytestomakeaneducatedguessifthekernelshouldbecomputeormemory-bound,andrelevantdomainknowledgethatmayinformoptimization withrespecttodatastructuresandorganization.Accordingto[57],abenchmark kernelisreallyaspecicationofeventfrequencies.Forexample,[57]asserts: "CyclesperinstructionCPIisamostnaturalmetricforexpressingprocessorperformancebecauseitistheproductoftwomeasurablethings:CPI=cyclesperevent *eventsperinstruction.Thenumberofcyclespereventisdeterminedbytheevent typeforaparticularmicroarchitecture,andthenumberofeventsperinstructionis knownforeachworkloadindependentofthemicroarchitecture." Weapplythisconcepttotheinstructionmixoftheworkloadkernel-dependent,microarchitectureindependent,andthethroughputforeachinstructiontypekernelindependent,microarchitecture-dependent. Aprocessor'sarchitecturedenestheinstructionsitcanexecute.It'smicroarchitecturedetermineshowtheinstructionsareexecuted[121].Ideally,microarchitecturallyagnosticeventshelpisolatekernelbehaviorfromhardwareinuences. Micro-architecturallyagnosticmetricshavethenicepropertythatthesamevalues willbereportedonanymicroarchitectureandthereforeanydierencesobservedin performancemustbeduetodierencesinhardwareasopposedtothecomplexinterplaybetweensoftwareandhardware.However,thisonlyholdsonmicroarchitectures withinthesamearchitecturalfamilywiththesameinstructionset.Evenwithinthe sameinstructionset,manyeventsareinuencedbythehardwareinsubtleways. Theprimarypurposeofcapturingtheinstructionmixistoquantifykernelimposedlimitsonperformancerooines.Thishelpsmoreaccuratelypredicthowfar frompeakakernelisandhowmuchoptimizationopportunityremains.Forexample, 56

PAGE 71

mostpeakGop/sclaimsarecitedwithrespecttopeaksingle-precisionSPoperations.Thislimitisachievableonlyif100%oftheinstructionsaresingle-precision.In practice,measuringinstructiontypesisn'tsupportedwithallprolingtools.Fermi hasalimitedsetofinstructionevents,notenoughtouseforanestimateofhowthe instructionmixlimitsachievablepeak.Maxwellimprovedoverpreviousgenerations andincludessupportformanyinstructiontypesandisthebasisforFigureA.10. VTuneontheXeonPhiCoprocessorsdoesnotgiveanyindicationofinstruction mixotherthanVPUinstructionswhicharearguablythemostimportanttypebut broadervisibilitywouldhelpthistypeofanalysis.Instructionscanbemeasured inotherwaysorparameterizedestimatescanbeappliedtorooinemodels.These typeofissuesarewhybeingwithin70%to80%ofpeakisconsideredverygood.If kernel-imposedlimitsareaccountedfor,80%ofpeakisprobablycloserto90%or better. Onedicultyincomparingbenchmarksorkernelsondierentarchitecturesisthe sourcecodeforakernelrunningonNVidiagraphicscardswithCUDAandthesource codeforthesamekernelrunningonXeonPhiare,bydenition,notthesamesource code.Theyhavedierentinstructiontypesandinstructiondistributionsrelevantto thehardwarethecodeisrunningon.TheXeonPhiCoprocessorsarenotcompatible withtheproprietaryCUDAextensionsrequiredtodriveNVidiagraphicscardsand codewrittenforCUDA-enabledgraphicscardswillnotcompileonXeonPhiandvice versa.However,theconceptualgoalremainsthesameforallthroughput-oriented architectures,whichistounderstandkernelbehaviorsthatmaylimittheoretical peakcapabilitiesofthehardware.Architecturescanbecomparedusingtheratioof performancetopeakcapabilityofthemachinetonormalizeoutanydierencesin thearchitecture.Assumingbothkernelsareoptimized,relativeperformancecanbe assessed. 57

PAGE 72

Table4.1:Exampleinstructioncountsanddistributions. Metric Description dynamicinstructioncountperkernel Perkernelaccountingofthedynamicinstructionsexecuted. averageinstructioncountperthread averagenumberofdynamicinstructionsexecutedbyathread. threadcountperkernel Countoftotalthreadsspawnedperkernel. totalthreadcount Countoftotalthreadsspawnedinaworkload. oating-pointinstructioncount Totalnumberofoatingpointinstructions executed. integerinstructioncount Totalnumberofintegerinstructionsexecuted. specialinstructioncount Totalnumberofspecialfunctionalunitinstructionsexecuted. memoryinstructioncount Totalnumberofmemoryinstructions. branchinstructioncount Totalnumberofbranchinstructions. barrierinstructioncount Totalnumberofbarriersychronizationinstructions. 58

PAGE 73

Todemonstrategeneralapplicabilityofourapproachweanalyzedtwoadditional scientickernelsbeyondthescienticprogramthathasbeenthebasisofourresearch todate.WeselectedStencilandSGEMMtoanalyzebecausetheyarewell-studied kernelsthathavebeencharacterizedinperformanceoptimizationresearch.Wevalidatehowwelltheframeworkguidesoptimizationusingthoseknowncharacteristics. Inaddition,bothStencilandSGEMMbehavedierentlyondierentarchitectures dependingonthelevelofoptimizationperformed.Wedemonstrateperformance improvementoverseveralkernelimplementationsonseveralarchitecturestoestablishthevalidityoftheanalysisframework.Theadditionalkernelsdemonstrateour methodologyworkswithverydierentapplicationdomainsandtheXeonPhiCoprocessorsdemonstrateourmethodologyworkswithaverydierentarchitectural paradigm.ThethreescienticapplicationkernelsweselectedareStencil,singleprecisiongeneralmatrixmultiplySGEMM,andtheRK4implementationforkinetic modeling.Thefollowingsectionsdescribeeachapplicationinmoredetail,including whytheywereselectedforstudy. 4.1Stencil PartialdierentialequationsPDEsrepresentaverycommonworkloadclassin abroadrangeofscienticapplications.SolvingPDEsnumericallyisveryimportant tothescienticcommunityandPDEsolverstendtobeverycomputationallyintensive whichmaketheminterestingcandidatesforacceleration[159].Wechoosetoinclude thestencilbenchmarkinpartbecausestencilaccelerationonadvancedhardware isaveryactiveareaofresearch.Stencilapplicationsareoftenimplementedusing iterativenite-dierencetechniqueswhereeachpointinamultidimensionalgridis updatedwithweightedcontributionsfromasubsetofitsneighbors,whichrepresent thecoecientsofthePDEforthatdataelement[47]. ThestencilevaluatedinNVidiaGPUanalysisisasevenpointstencilthatsolvesa 3DheatequationfromtheParboilbenchmarksuite[159].The7-pointstencilcontains 59

PAGE 74

pointsdirectlyadjacenttothecenterineachdirection[135].Thisbenchmarkincludes aGPU-optimizedversiondrawnfromseveralpublishedworksincluding[145]and [45]andusesJacobiiterationswhichmeansthecalculationisnotdoneinplace;the algorithmalternatesthesourceandtargetarraysaftereachiteration[47].Foreach gridpoint,thestencilwillexecute8oatingpointoperationstwomultiplications andsixadditionsandtransferatleast16bytesbyteshavetobereadandwritten withdoubleprecision.Thestencil'soptobyteratiois0.5whichisverylowthe idealbalanceformostthroughput-orientedarchitecturesisbetweenfourandtenso thealgorithmislikelytobememory-boundonmostarchitectures. Thetypicalbottleneckofanunoptimizedstencilimplementationisdatalocality.TheParboilStencilbenchmarkappliesregister-tilingblockingalongtheZdimensionandothertilingoptimizationstoimprovelocality.Stencilre-useofdata elementsalongthethirddimensionmaynotbeabletotinthecacheforlargeproblemsizesandbaseline,ornaive,implementationstendtobememory-bound.Cache tilingblockingisanoptimizationtechniquetoformsmallertilesofloopiterations whichareexecutedtogetherandresultinbettertemporalandspatiallocality.The Parboilauthorsfoundthatevenwithoptimizationstheperformancelimitationofthe Stencilbenchmarkisglobalmemorybandwidthforthearchitecturestheytested whichwewereabletoindependentlyconrminouranalysis.Figure4.1ispseudocodeforthebaselineParboilimplementation;thesamepseudocodeisalsousedin [47]. Thestudyofnitedierencestencilsislargewithmanyvariedimplementation choices.ThestencilevaluatedonXeonPhiCoprocessorsisan8th-orderpoints isotropicacousticwaveequationdevelopedbyAndreolli[16].Thisalgorithmisdifferentthanthe7-pointstencilevaluatedonGPUsbuttheopstobytesratioisvery similar.TheseriesofimplementationsandoptimizationsthatAndreollidevelopedis veryusefultovalidatethemethodologyproposedinthisthesis. 60

PAGE 75

Figure4.1:PseudocodeforthebaselinestencilimplementationusedinGPUarchitectureanalysis. 4.2SinglePrecisionGeneralMatrixMultiplySGEMM Generalmatrixmultiplicationisafunctionthatperformsmatrixmultiplication oftheformshowninEquation4.1. C = alpha A B + beta C; .1 whereA,BandCarematricesandalphaandbetaareoatingpointscalars. Manystudiessetbetatozeroandalphato1,whichreducestheequationtoC=A *B,aneasyformtoverifycorrectnesswithoutlosingoperations. SGEMMperforms O n 3 computeoperations,wherenisthematrixdimension canassumesquarematriceswithoutlossofgenerality.SGEMMperforms O n 2 dataaccesses.Therefore,theoptobyteratiois O n whichmeansSGEMMshould becompute-boundwhenproperlyblocked.Naiveimplementationsareoftenmemoryboundbutanywell-optimizedimplementationwillbecompute-bound.Duetothe natureofmatrixmultiplication,thematricescanbeblockedtotinvirtuallyany cacheoron-chipstoragesharedmemoryofGPUstondagoodbalancepointfor high-performanceonanyarchitecture.SGEMMisuniqueinthepropertythatan 61

PAGE 76

optimizedimplementationonanythroughput-orientedarchitecturewillbecomputebound. SGEMMisacommonlyimplementedlibraryroutinewithverybroadapplicabilityasakeybuildingblockinnumericallinearalgebracodes.Thevastmajorityof libraryroutinessupporttwomodes:normal-normalNNwhereA,B,andCmatrices arestoredincolumn-majororderandnormal-transposedNTwhereAandCmatricesarestoredincolumn-majororder,andtheBmatrixisstoredinrow-majororder. Fewscientickernelsareaswellunderstood,studied,optimized,andcharacterizedas thoroughlyasgeneralmatrixmultiply.WeuseSGEMMasaknownreferenceapplicationforvalidationofourmethodology.TheParboilimplementationparameterized thecodesothatregistertilingandsharedmemorytilingcanbeconguredatcompile timewhichfacilitatestesting. 4.3KineticSimulation Kingenwasdevelopedbyourcollaborator,Dr.TimBenke,aneuroscientistfrom theUniversityofColorado,AnschutzMedicalcampustosimulateAMPAreceptor ionchannelactivityandtooptimizekineticmodelrateconstantstobiologicaldata. Kingenusesageneticalgorithmtostochasticallysearchparameterspacetondglobal optima.Aseachindividualinthepopulationdescribesarateconstantparameterset inthekineticmodelandthemodelisevaluatedforeachindividual,thereissignicant computationcomplexityandparallelisminevenasimplemodelrun.Thebottleneck inkingenisintheordinarydierentialequationODEsolverwhichistypicalfor scienticapplications. Numericalmethodsaredistinguished,amongotherthings,bystiness,aproperty oftheODEsthemselves.Stisystemsmainlyappearinchemicalkinetics,electric circuits,andcontroltheory[90].StinessmeasuresthedicultyofsolvinganODE numericallyandischaracterizedbydisparatetimescales[96]smalltime-stepsare requiredforstability.Stisystemsrequirecompleximplicitmethodstoavoidnu62

PAGE 77

mericalinstabilities,whilenonstisystemscanbesolvedbyrelativelysimpleexplicit methods.ThesystemofODEequationsthatdescribeAMPAcurrentsarestiand therehasbeenrelativelylittlefocusinaccelerationresearchonhowtooptimizesti systemsofODEs. ThemostpopularcodesforthenumericalsolutionofastisystemofODEs arebasedonthebackwarddierentiationformulasBDF.Oneofthemostpowerful methodscharacterizedbyhighaccuracyandhighstabilityforsolvingODEsisan implicitRunge-kuttamethod.TheRunge-Kutta4thordermethodRK4isconsideredaworkhorseinscienticengineeringapplicationsprovidingaccurateandstable solutionstostisystemsofODEscommonincomputationalchemistry.Thismethod istheselectedsolverwithinoursimulationframework.The4thorderRunge-Kutta formulasaregivenbelow. x t + h = x t +1 = 6 F 1 +2 F 2 +2 F 3 + F 4 .2 where F 1 = hf t;x .3 F 2 = hf t + h;x + F 1.4 F 3 = hf t + h;x + F 2.5 F 4 = hf t + h;x + F 3 .6 4.4CharacterizationSummary Manychallengesfaceresearchersinthiseld.Oneiseverytoolprovidesdierent variationsofsimilarmetricsandcaremustbetakentomakesurethegranularity andunitofthereporteddataisunderstoodandusedappropriately.Thisproblem isdemonstratedhereintheverydierentsuiteofmetricsforMaxwellversusFermi, twoarchitecturesfromthesamefamily. 63

PAGE 78

Table4.2demonstratesthechallengedomainscientistsfacewithsimulationcode thatmodelsreal-worldphenomenonasopposedtosyntheticbenchmarks.RK4isthe onlykernelthatisusingalltheavailableregistersindicatingtheParboilbenchmarks havenoregisterpressure,probablybecausethefootprintofthecodeissmallenough. Inaddition,RK4launchesfarfewerthreadsthanstencilyetexecutesapproximately 15xmoreinstructions.ThismeansRK4threadsareexecutingseveralordersofmagnitudemoreinstructionsperthreadthanthestencilbenchmark.RK4islaunching morethreadsthanSGEMM,butthedierenceinthreadcountdoesn'tcomeclose toaccountingforthedierenceininstructionsexecuted.Thequestioniswouldoptimizingsingle-threadedperformanceonfewerbutmorecomplexprocessingcoresbea bettertthanthemassivelyparallelbutsimpleprocessingcoresofstandardgraphics hardware? Table4.2:Variousstaticmetricsacrosseachbenchmark. SGEMM Stencil RK4 regs/thread 47 25 63 sharedmemoryperthread bytes 4 0 0 threads launched 1,280 131,072 20,352 warpslaunched 40 4,096 636 dynamicinstructionsexecuted 103,240 24,753,000 374,047,500 Table4.3,extractedfrom[159],showsabriefsummaryofthemajorarchitectural featuresstressedbyeachbenchmarkinitsunoptimizedandoptimizedforms.One ofthereasonswefocusedontheParboilbenchmarksuiteisbecauseitprovidesop64

PAGE 79

timizedandunoptimizedcodeforeachbenchmarkwhichwecanusetoevaluatethe analysisframework.Figure4.2clearlyshowsthattheRK4kernelfromthekingen applicationhasaverydierentprolefromtheotherbenchmarkswithsignicant convertinstructionsandfewerintegerinstructions.Achallengewithnumericalsimulationsinpracticeisknowing,beforesignicantdevelopmenteortsarespent,ifthe simulationequationswilltintolimitedGPUresourcesandhowmuchdatacanbe movedon-chiptoclosermemoryforsignicantperformancegains. Table4.3:ArchitectureStressesofBenchmarksBeforeandAfterOptimization Benchmark UnoptimizedBottleneck OptimizationsApplied OptimizedBottleneck stencil locality coarsening,tiling bandwidth sgemm bandwidth coarsening,tiling instructionthroughput Figure4.2:InstructionmixmeasuredonMaxwellforscientickernels. 65

PAGE 80

5.HardwareandArchitectureCharacterization Thischapterdescribesthearchitecturesthatareevaluatedinthisthesis.The purposeistounderstandtheperformanceimplicationsofprimaryarchitecturalfeaturesandtocompilehardwarespecicationstheanalysisframeworkrequiresasinput. Inthenextsection,Section5.1,whatitmeansforanarchitecturetobethroughputorientatedisdened.Theimplicationsthroughput-centricdesigndecisionshaveon performanceandhowcodeexecutesonthemachineareexplained.Section5.2describesNVidia'sFermiandMaxwellarchitecturesandSection5.3describesIntel's XeonPhiCoprocessors.Thesethreeplatformsareusedasindustryrecognizedexamplesofthroughputorientedarchitecturesanddemonstratethattheconceptsapplied intheanalysisframeworkaregenerallyapplicabletoallthroughput-orientedsystems. 5.1ThroughputOrientedArchitectures Thekeytohighperformanceonanyarchitectureistoreducelatencyorincreasethroughputwhiletoleratinglatency.Latencymeasurestheamountoftime tocompleteataskandthroughputisthetotalamountofworkcompletedperunit time[61].ThetraditionalcommodityCPUistheexemplaroflatency-orientedarchitecturesthatplaceapremiumonreducinglatencyofasinglethreadortask. ThetraditionalcommodityGPUistheexemplarofthroughput-orientedarchitecturesthatplaceapremiumonmaximizingthetotalamountofworkthatcanbe completedwithinagivenamountoftime.Architecturaldesigntendstofocuson oneovertheotherandthroughputorientedsystemstradesingle-threadedspeedfor increasedthroughputoverallthreads.Thetrade-osthroughput-orientedarchitecturesmakehaveimportantimplicationsforunderstandingperformanceofscientic applicationsrunningonmanycoreandGPUplatforms. Throughput-orientedsystemsemploythreekeyarchitecturalfeaturestohidelatencyandincreasethroughputincludinghardwaremultithreading,manysimpleprocessingunits,andSIMDexecution[61].Simplecorestendtoexecuteinstructionsin 66

PAGE 81

orderandavoidspeculativeexecutionandbranchprediction.Out-of-orderexecution, speculativeexecutionandbranchpredictionaretheverytechniqueslatency-oriented systemsusetospeed-upsingle-threadedperformance.Singleinstructionmultiple dataSIMDexecutionisattractiveforthroughputorientedsystemsbecauseitreducesdieareadedicatedtocontrol-structureswhichmakesroomformorecompute units. Graphicsarchitecturesareknowntobeverylatencytolerantwhichmeansany giventhreadmayexperiencelongerlatencythanwhatisacceptableonlatencyorientedCPUs,butthelatencyishiddenbyswitchinginthenextclocktoanother threadwithreadyoperands.Latencytoleranceimpliesprocessorutilizationdoesnot dropjustbecausesomethreadsareblockedonlonglatencyoperations[61].This, inturn,impliesthatatanygivenmomentthereshouldbesomehardwareresource runningnearfullrateandthiskeyobservationisunder-appreciatedinperformance orientedresearch.Themethodologydescribedintheanalysisframeworkisbasedon thisprinciple. 5.2NVidiaDiscreteGPUArchitecture GraphicsprocessingunitsGPUhaveevolvedoverthelastdecadefromdedicatedxed-function3Dgraphicsprocessorstoprogrammablemassivelyparallelprocessingarchitecturesusefulfortraditionalgraphicsandforhigh-performancecomputation.TheGeForce8800waslaunchedin2006andwastherstuniedgraphics andcomputingarchitecturereleasedbyNVidiawithcomputeunieddevicearchitectureCUDA,softwaretechnologythatenabledNVidiaGPUstobeprogrammed withhighlevelprogramminglanguages.However,generalconsensusisthatFermi, thethirdgenerationofcomputecapableNVidiaGPUs,wastherstgraphicsarchitecturetosupportacompletesetoffeaturesforscienticcomputing. 67

PAGE 82

Table5.1:NVidiaArchitectures Year Microarchitecture ComputeCapability ComputeGeneration 2006 GeForce8800 1.0 1stGen 2008 Tesla 1.0-1.3 2ndGen 2010 Fermi 2.0-2.1 3rdGen 2012 Kepler 3.0,3.2,3.5,3.7 4thGen 2014 Maxwell 5.0,5.2-5.3 5thGen 2016 Paschal 10xMaxwell? 6thGen 2018 Volta ? 7thGen AbriefsummaryofNVidiamicroarchitecturesarelistedinTable5.1.Generation onebeginswiththeintroductionofgeneralpurposecomputecapabilityandsubsequentgenerationscountupfromthere.Table5.2showsbasicfeaturesofeachofthe majorcomputecapabilityversions.Thelackofdoubleprecisionsupportwasseen asamajordisadvantagetosomescienticapplicationsandwasaddedwithcompute capability1.3. Table5.2:ComputeCapabilitiesonNVidiaHardware ComputeCapability BasicFeatures compute 10 Basicfeaturesforgeneralpurposecompute compute 11 Addsatomicoperationsonglobalmemory compute 12 Addsatomicoperationsonsharedmemory andvoteinstructions compute 13 Addsdouble-precisionoatingpoint compute 20 AddssupportforFermi compute 30 AddssupportforKepler compute 50 AddssupportforMaxwell 68

PAGE 83

Table5.3comparesFermiwithMaxwellonkeyfeaturesofcomputeandmemory capability.PeakratesinTable5.4arederivedfromthesehardwarespecications. Instructionthroughputsareoftengiveningiga-ops-per-sec,orGFLOPS.Aopis denedaseitheranadditionormultiplicationofsinglebitordoublebit precisionnumbers. Table5.3:NVidiaCoreArchitecture FermiC2050 MaxwellGTX 960 ComputeCapability 2.0 5.0/5.2 CudaCores 448 1024 NumberofSMperGPU 14 8 NumberofSPperSMcudacores perSM 32 128 NumberofSFUperSM 4 64 NumberofLD/STperSM 16 32 WarpSchedulersperSM 2 4 MaxProcessorClockGHz 1.15 1.18 MemoryClockGHz 1.5 5.4 BusType-MemoryInterface GDDR5 GDDR5 MemoryInterfaceWidthbits 384 128 TheoreticalMemoryBandwidth GB/s 115 112 L1bytesperSM 65,536 24,576 L1bytesperGPU 917,504 196,608 NumberofSharedMemoryBanks 32 32 L2CacheCapacityKB 768 1,048 69

PAGE 84

Table5.4:TheoreticalPeaksonNVidiaArchitecture FermiC2050 MaxwellGTX 960 PeakSPFLOPsGFlops/s 1030 2308 PeakDPFLOPsGFlop/s 515 72 PeakSPFLOPSperSM Gop/s 74 302 PeakMemorywithECCon GB/s 115.2 112.2 ExecutionlimitsinNVidiaGPUsaredescribedinTable5.5.Theselimitsrepresenttrade-osbetweenoccupancyandresourceutilizationthathavetobebalanced withthecharacteristicsoftheworkload.Noticethedierenceindoubleprecision peakGFlop/sbetweenFermiandMaxwellcomparedto72.TheMaxwellarchitecturereducedfocusonscienticcomputinginfavorofmoreecientexecution oftraditional3D.Occupancyisusuallydenedintermsofthreadoccupancy,the ratioofactivethreadstothemaximumsupportedperSM.Occupancycanbelimited bytheresourceusageofeachthreadblock,specicallywithrespecttoregistersand sharedmemory.HereisanexampleusingFermilimitstoillustrate:ifeachthreadin athreadblockloads12single-precisionoatsinsharedmemory,theneachthreadis using48bytesofsharedmemory*4Byteseach.Withathreadblocksizeof256 threadsperblock,eachblockisusing12,288bytesofsharedmemorythreads* 48bytesperthread.Thenumberofactivethreadblockswillbelimitedtofourper SM,152maxsharedmemoryperSM/12,288bytesperblock.Fourblocks* 256threadsperblockis1,024activethreadsperSMoutof1,536maximumpossible. Inotherwords,occupancyislimitedto67%/1056becauseofsharedmemory utilization.Thesamereasoningcanbeappliedtothenumberofregisters. 70

PAGE 85

Table5.5:ExecutionLimitsinNVidiaGPUs Fermi Maxwell MaxActiveThreadsperSM 1,536 2,048 MaxActiveWarps 48 64 ThreadsperWarp 32 32 MaxActiveBlocksperSM 8 32 MaxThreadsperBlock 1,024 1,024 Max32-bitRegistersbytesper SM 32,768 65,536 MaxRegistersperBlockbytes perblock 23,768 65,536 MaxActiveThreadsperGPU 21,504 16,384 MaxRegistersperThread 63 255 SharedMemoryperSMbytes 49,152 96,000 SharedMemoryperThread Blockbytes 49,152 49,152 Table5.6listsessentialhardwarefeatures.Thenumberofprocessingcores equalsthenumber SM per GPU*number cores perSM.Instructionthroughput ingiga-instructionspersecondequalsthemax freq processing coresGHz*number processing cores.PeakFP32throughputGFLops/s=max freq processing coresGHz *number processing cores*number ops per procesing core.PeakFP64throughputiscalculatedas1/2theSPrateforFermi,and1/32theSPrateforMaxwell. 71

PAGE 86

Table5.6:EssentialHardwareFeaturesandThroughputsonFermiandMaxwell FermiC2050 MaxwellGTX 960 #SymmetricMultiprocessors 14 8 CoresperSM 32 128 CoreClockMHz 1150 1279 MemoryClockMHz 1500 3505 InstructionThroughput GInst/s 515 1206 PeakFP32Throughput GFlops/s 1030 2412 PeakFP64Throughput GFlops/s 515 75 SharedMemoryGB/s 1030 1206 DRAMGB/sw/ECCon 144 112 IdealRatioInstructions: MemoryGB 515/112=4.5 :1 1206/112= 10.7:1 ld/stunitpressure 16/32=1/2 32/128=1/4 DPperformancefractionof SP 1/2 1/32 Table5.7areinstructionthroughputsfromNVidiamanualsforFermi[6]and Maxwell[11].Thethroughputsaregivenintermsofoperationsperclockcycleper SM.Theseratesareusedwiththekernel'sinstructionmixproletodetermineifthe theoreticalpeakinstructionthroughputrateshouldbeloweredduetothekernel's 72

PAGE 87

instructionmix.Theinstructionmixgivesthepercentofeachinstructiontypeinthe kernel.Forexample,theinstructionmixfortheSGEMMbaseimplementationwith themediuminputsetsizeseeFigure4.2isFP32:5%,Int:70%,controlow:1%, Ld/st:9%,andMisc:15%.Theweightedaverageiscalculatedtodetermineifthe theoreticalcomputepeakislimitedbythekernel'sinstructionmix.UsingMaxwell asanexample,theaverageoperationsperclockcycleperSMis:128*0.05+128 *0.70+64*0.01+32*0.09+64*0.15=109.Thisnumberisusedasthe numberofprocessingcorestocalculatekernellimitedpeakinstructionthroughput. Inthisexample,thatis:2**8*1.18=2057GFlop/s.ThepeakSPGFlop/s onMaxwellis2308,sothekernel'sinstructionmixreducesthetheoreticalmaxinstructionthroughputofthehardwarebyapproximately11%.Thisisanimportant executiondierencethathelpsquantifyhowmuchoptimizationopportunityactually remainsandifhigherperforminginstructionscanbeusedtoimproveperformanceof compute-boundkernels. 73

PAGE 88

Table5.7:InstructionThroughputsonFermiandMaxwell Computecapability2.0 Fermi ComputeCapability5.0 Maxwell 32-bitFPadd,mult,mult. add 32 128 64-bitFPadd,mult,mult. add 16 1 32-bitFPreciprocal,square root,exp 4 32 convertinstructions 16 32 32-bitintadd,subtract 32 128 controlow,compare,min, max 32 64 load/store 16 32 5.2.1Fermi Fermiisthethirdgenerationofcompute-capableGPUsandrepresentsabigleap overearlygenerationsasdetailedinTable5.8.TheprimarydierencesinFermiover previousgenerationsisthatmemoryisaccessedingroupsof32threadscompared to16tomatchinstructionissuewidth;theadditionofanL1cacheineachSM iscongurableas16KBL1/48KBsharedmemoryor48KBL1/16KBshared memory;aglobal768KBL2cacheperchipwasadded;andFermiaddeddual-issue whichmeansinstructionsfromtwodierentwarpscanbeissuedtotwodierentpipes. Dual-issueonFermirequiresatleasttwoactivewarpstohitpeakthroughputs. Figure5.1isanillustrationofaFermi-basedGPU[6].Inthisexample,thereare 16SMs.TheC2050cardsintheHydralabhave14SMsandallspecicFermianalysis 74

PAGE 89

inthisthesisassumestheHydraconguration.BeginningwithFermi,NVidiaGPUs useasmallL1cacheforregisterspillingandforsmalllocalarraysthatmustbe storedinglobalmemorybecausethecompilercan'tuseregisteraddressingifarray indicesareunknown.EarliergenerationGPUssueredfromperformanceclisan applicationcouldfalloverwithmoderateincreaseinregisterusagefromadevelopers perspective.TheL1cachewasdesignedtomitigatespill/llissues.Localmemoryis called'local'becauseitisprivatetoindividualthreadsbutlocalmemoryresidesinthe sameareaasglobalmemorywiththesamelonglatenciesassociatedwithdramaccess. IfasignicantpercentageofdramorL2accessesareduetolocalmemory,theseare likelyspills/llsthatarenotdoingproductiveworkfromthekernel'sperspectiveand candecreaseperformancebyputtingincreasedpressureonmemory. 75

PAGE 90

Table5.8:SummaryofFeaturesonFirstThreeGenerationsofNVidiaGPUs[6] GPU G80 GT200 Fermi CUDAcores 128 240 512 DPoatingpointcapability none 30FMA ops/clock 256FMA ops/clock SPFloatingPointCapability 128MAD ops/clock 240MADops/ clock 512FMAops /clock SpecialFunction UnitsSFUs/SM 2 2 4 Warpschedulersper SM 1 1 2 SharedMemoryKB perSM 16 16 Congurable48 KBor16KB L1CacheperSM None None Congurable16 KBor48KB L2Cache None None 768KB ECCMemorySupport No No Yes ConcurrentKernels No No Upto16 Load/StoreAddress Width 32-bit 32-bit 64-bit LargerL1canimproveperformancewhenspillingregistersorwithmisaligned, stridedaccesspatterns.IfthereisaloadhitinL1,thereisnobustractoL2and memory.IfthereisaloadmissinL1,128bytespermissaregenerated.Acache linerequestisservicedatthethroughputofL1orL2incaseofacachehits,orat 76

PAGE 91

throughputofdevicememoryotherwise. Insingleprecisionoatingpointoperationsfp32,allglobalmemoryaccessesare four-bytewords.Awarphasalwaysconsistedof32threadsonNvidiahardwareand areexecutedtogetherassingleinstruction,multiplethreadSIMT.SIMTdiersfrom SIMDinthatSIMTisscalarwithnosetvectorwidth.SIMTallowseachthreadto takeit'sowncodepathandbranchingishandledbythehardware.Thoughbranching islegal,therecanbeaperformancepenaltyastheexecutionforallthreadstaking eachpathareserializedinthehardware.Therealityis,performancewilllikelysuer ifthreadsarenotscheduledandexecutedwithpowersoftwotolleachwarpinmuch thesamewayasemptySIMDlaneshurtperformanceinvectorarchitectures.The dierenceisreallyinhowSIMTcodevsSIMDcodeisprogrammedandlessabout howthehardwareexecutesavectorofthreadsoravectorofdataelements. 77

PAGE 92

Figure5.1:FigurefromNVidia'swhitepaperforFermi[6].Fermihas32processing unitsperSM.AllSMsarebackedupbyalargeL2cache.Thesmallerblocksin light-bluealongtheedgeoftheframearetheL1/SharedMemorycacheswhichare allocatedperSM. Figure5.2isaclose-upofonesymmetricmultiprocessor[6]. 78

PAGE 93

Figure5.2:FigurefromNVidia'swhitepaperonFermi[6].Close-upofoneSMshows the32processingcoresperSM,4SFUunits,and16LD/STunits. 5.2.2Maxwell Maxwellwasreleasedin2014andisNVidia's5thgenerationGPUcapableof supportinggeneralpurposecompute.Tables5.3and5.5highlightthebasicchangesin importantarchitecturalfeaturesbetweenFermiandMaxwell.Theprimaryfeatures tonoteare: 79

PAGE 94

ThemaximumnumberofactivethreadblocksperSMquadrupledto32.This canimproveoccupancyforkernelswithsmallthreadblocksof64threadsor lessassumingavailableregistersandsharedmemoryarenottheoccupancy limiter. Maxwellredesignedthesymmetricmultiprocessortoalignbetterwithwarp size,thenumberofcudacoreswerereducedwithrespecttoKeplertoapower of2,makingiteasiertoutilizetheSMeciently[11]. Maxwellhas64KBofdedicatedsharedmemoryperSM,unlikeFermiand Kepler,whichpartitionedthe64KBofmemorybetweenL1cacheandshared memory.FunctionalityofL1andtexturecachesweremergedintoasingleunit. Theper-thread-blocksharedmemorylimitremains48KBinMaxwell,butthe increaseintotalavailablesharedmemorycanleadtooccupancyimprovements [11]. 80

PAGE 95

Figure5.3:FigurefromNVidia'swhitepaperonMaxwell[11].EachSMcontainsfour warpschedulers,andeachwarpscheduleiscapableofdispatchingtwoinstructionsper warpeveryclock.Maxwellalignswithwarpsizebeingpartitionedintofourdistinct 32coreprocessingblocks,with128corestotalperSM. 81

PAGE 96

5.3IntelXeonPhiCoprocessorArchitecture Theperformanceoptimizationmethodologyproposedinthisthesisdependon athoroughcharacterizationofthehardwareandkeytheoreticalpeaklimitswhich aredescribedherefortheXeon-PhiCoprocessor.Throughputorientedarchitectures dependonecientutilizationofhardwareresourcesanddevelopersneedawayto measureutilizationtoguideperformanceoptimization.Intelhaspublishedkeyeciencyandanalysismetricstodeterminehowwellanapplicationisutilizingavailable resources.KeystoperformanceonXeonPhiCoprocessorsaretoexpresssucient parallelism,ecientvectorization,hideI/Olatency,andparallelcodemustscalewith thenumberofcores.Thismeans,thenumberofcores,threads,andSIMDvector operationsmustbeusedeectivelyforthebestperformance.Allevents,metrics, equations,andheuristicguidelinesdescribedinthisthesisareextractedfromIntel developerdocumentation[33]. Figure5.4illustratesXeonPhiCoprocessorswith50ormoresimpliedcoprocessorcoresdependingonconguration,fourthreadspercore,512bitregisters forSIMDoperationsvectorops,vectorprocessingunitsVPUwith512-bitvector operationson16SPor8DPoatingpointarithmeticoperations,32KBdataand instructionL1cache,eachcorewith512KBL2cachesharedamongfourthreads. 82

PAGE 97

Figure5.4:Intel'sXeonPhiCoprocessorBlockDiagram.Figurefrom[10]. Measuringeciencyintermsofoating-pointoperations,asdoneintheNVidia analysis,isconvenientasit'seasilycomparedtothepeakoating-pointperformance ofthemachine.However,theIntelXeonPhiCoprocessordoesnothaveeventsto countoating-pointoperations.Thealternativeistomeasurethenumberofvector instructionsexecuted[33].Mostvectorinstructionshavefour-cyclelatencyandsinglecyclethroughput,sofourthreadsshouldbeusedtohidevectorunitlatency[142]. Whenavectoroperationontwofullvectorsisperformed,thevpu elements active eventisincrementedby16forsingleprecisionor8fordoubleprecision.Ruleof thumbistotaketheratioofvpu elements activetovpu instructions executed.Ifit approaches8or16,theloopiswellvectorized.Vectorizationintensitycan'texceed 8fordouble-precisioncodeor16forsingle-precisioncode. Anotherwaytomeasureprocessingeciencyiswithcyclesperinstructions. Table5.9liststheminimumCPIthemachineiscapableofdependingonhowmany hardwarethreadsareused.IfmeasuredCPIisclosetothetheoreticalminimum,it 83

PAGE 98

isanindicationthekerneliscompute-boundandoptimizationsshouldfocusonareas thatimprovecomputeeciency. Table5.9:MinimumTheoreticalCPIs Numberof Hardware Threadsper Core BestTheoretical CPIperCore BestTheoretical CPIPerThread 1 1.0 1.0 2 0.5 1.0 3 0.5 1.5 4 0.5 2.0 TheIntelXeonPhiCoprocessorsupportsuptofourhardwarethreadsoneach core.Thepipelinecanonlyissueuptotwoinstructionspercycle.Theadditional threadstraditionalxeonprocessorpipelinessupporttwohardwarethreadsandissue fourinstructionspercyclearetocoverlatencyoftheexecutingthreads.XeonPhi coresaresimplerthanXeonhostcounter-parts,theyrunataboutathirdthespeed ofIntelXeonprocessors,andoperateoninstructionsin-orderinstructionsmust waitforpreviousinstructionstocompletebeforetheycanexecutesoit'simportant tobeabletoswitchtootherthreadswithreadyoperandswhenathreadstalls. TheCoprocessorwillnotissueinstructionsfromthesamehardwarethread.This means,inordertoachievethemaximumissuerate,atleasttwohardwarecontexts hardwarethreadmustberunning.Incontrast,theprocessorsinNVidiacardsare muchsmallerandfocusonoatingpointoperationsandsingleinstruction,multiple threadexecution.Ingeneral,XeonPhihandlescodewithlotsofbranchesandcontrol statementsbetterthanNVidiaprocessors.NVidiarunsbestwithSIMTcodewith nocontrolstatements. 84

PAGE 99

ThetheoreticalaggregatememorybandwidthavailableonXeonPhiCoprocessors is352GB/sbutinternallimitationsringinterconnectlimitachievablebandwidthto approximately140GB/s.TheidealinstuctiontobyteratioonXeonPhiCoprocessors issix.Ifthemeasuredinstructiontobyteratioislessthansix,thekernelismemory bound;ifthemeasuredratioismorethansix,thekerneliscomputebound. TheprimarydierencebetweendevelopingforNVidiacardsversusforXeon PhiCoprocessorsistheprogrammingmodel.NVidiausesCUDA,theirproprietary frameworkfordevelopingGPGPUapplicationsonNVidiagraphicscards.CUDA usesC/C++extensionsandcodemustbecompiledwithNVidiacompilerstointegratehostC/C++codewithCUDAcodethatwillbeo-loadedandexecuted onthegraphicsdevice.XeonPhiusesthex86programmingmodelandeco-system thathasalonghistoryindesktopcomputing.MPI,OpenMP,Clik,andThreaded BuildingBlocksTBB,tonameafew,canbeusedtoo-loadcomputationontothe coprocessorsfromthehost. 5.3.1Xeon-PhiPerformanceTuning ThethreefactorsthatmostinuenceperformanceonXeonPhiCoprocessorsare scalability,vectorization,andmemoryutilization.Thesemapalmostdirectlytothe latency,compute-bound,andmemory-boundspacesalreadydescribedinthecontext ofNVidiaarchitectures. TheXeonPhiCoprocessorsarelikediscretegraphicscardsinthattheyareconnectedtothehostviaPCIeandareaperipheralattachment.Theprogramming modelisheterogenousinnaturewithahostthatrunsonXeonprocessorsmore liketraditionalCPUsandadevicethecoprocessorsthatrunsimplercoresand adierentOS.Thehosteithero-loadscomputationtaskstothecoprocessorsor codecanbecompiledtorunonthecoprocessorsnatively.Thenativemodelinvolves transferinglesfromthehosttothecoprocessorssotheycanberundirectlyfromthe 85

PAGE 100

device.Thenativeexecutionmodelisexploredinthisthesissincethecoprocessors arethroughputorientedanddesignedforparallelapplications.Theheterogeneous natureofo-loadingfromahostisbeyondthescopeofthiswork. ThechallengeinportingcodethatraninaCUDAprogrammingmodelor,distributedonaclusterwithMPItoXeonPhiCoprocessorsisthatthecodemust bevectorizedtofullyutilizethemachine.NVidia'sprogrammingmodelisSIMT singleinstructionmultiplethreadwhichmeanseachscalarthreadcanoperateindependently.Vector-basedarchitecturesuseSIMDsingleinstructionmultipledata, whichmeansdataitemsrepresentoneelementinthevectorwithlimitedexibilityin termsofcontrolowandcodecomplexityforvectorizedsectionsofcode.Anygiven threadcanoperateonupto16vectorelementsatatimeonXeonPhiCoprocessors. ThedierencebetweenSIMDandSIMTrequiresnon-trivialcoderestructuringto mapNchromosomesonNthreadstoX*SchromosomesonXthreads,eachprocessingSvectorelementswhereN=X*S.Spawningthreadstodierentcorescanbe donewithOpenMP,MPI,TBB,Cilk,etc.andchoosingwhichmodelismostappropriateisanotherchallenge.Fortunately,thesameoptimizationstrategydescribedin theanalysisframeworkappliesasforallthroughput-orientedarchitectures.Briey, theanalysisalgorithmbreakstheoptimizationspaceintothreecategories:memory, compute,andlatency.Denedmetricsaremeasuredwithineachspacetodetermine whicharelimitingperformance.Theoptimizationsearchspaceforeachiterationis reducedtooneofthethree.Optimizationstrategieswithineachareselectedforevaluation.Theresultsarecomparedtotheoreticalpeaksofthemachine.Ifwithin20% ofarooinelimit,whethermaxlimitofthehardwareoramodulatedlimitimposed bythekernel,mostperformanceimprovementshavebeenfoundandoptimizationis complete. 86

PAGE 101

6.AnalysisandResults 6.1FermiOptimizationAnalysisandResults Inthissection,eachofthethreescientickernelsselectedforstudyareevaluatedfollowingthemethodologydescribedinChapter3.Theparboilbenchmark [159]providedbaseCUDAimplementationsandoptimizedCUDAimplementations forevaulationonNVidiahardware.Knownandwell-characterizedkernelswereproledtovalidatetheoptimizationmethodologyusedinthisframework.Theparboil benchmarksuitealsoprovidedseveraldatasetsforeachscientickernelthatspana rangeofinputsizes.Forexample,thesgemmbenchmarkincludessmallandmedium inputdatasets.Insmall,matricesofsize128x96and96x160aremultipliedtogether. Inmedium,matricesofsize1024x992and992x1056.Therangeofinputsenables exporationofhowthealgorithmscaleswithinputsizeandsmallersubsetsareuseful whenfasterruntimefacilitatesexperimentationandstudy. 6.1.1SGEMMAnalysisonFermi TherstdecisionintheanalysisalgorithmistodetermineiftheCUDAbase implementationofSGEMMwiththesmallinputsizeiscompute-bound,memorybound,orlatency-bound.ThemetricslistedinTable6.1arethekeyanalysismetrics usedbytheframework.Therstsectionhasmemory-boundrelatedmetricsand theinstructiontobyteratiosthatindicateifakerneliscompute-bound,instructionbound,orlatency-bound.Thesecondsectionhasinstruction-boundrelatedmetrics, andthethirdsectionhaslatency-boundmetrics.Thisbenchmarkreportsan88%L2 cachehitratewhichishighsoitmakessensetolookatL2countersinsteadofDRAM countersaccessestoL2arestillexpensivecomparedtoarithmetictodeterminehow thebenchmarksbytetoopratiocompareswiththehardwareideal.Accordingtothe literatureinsection3.1thisbenchmarkisinstructionthroughputlimited.However, thekernelself-reportsabout30GFLOP/swhichisfarfromthe1030GFLOP/speak whichmeansthebenchmarkisnotlimitedbycompute. 87

PAGE 102

Table6.1:SGEMMBaseCUDAImplementationwithSmallInput Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 80.62 Instr:ByteRatioDRAM 122 Instr:ByteRatioL2 11 DRAMThroughput 3GB/s DRAMPercentofPeak 3% L2Throughput 32GB/s L2HitRate 88% L1LocalLoadHitRate 0% L1GlobalLoadHitRate 86% NumTransperLoadRequest 1.06 SerializationImpact 30% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 0% LocalMemoryPercentofInstr 0% PercentDivergentBranches 0% InstrPercentofPeak 49% GFLOP/sAsReportedbyKernel 30 ThreadOccupancy 100% AchievedOccupancy 64% IPC 0.49 88

PAGE 103

Thisexampleunderscoresthreeimportantobservations.One,kernelbehavior canimposeexecutionlimitsthroughtheinstructionmixandotherresourceusage. WeuseaweightedaverageofinstructionmixasdescribedinSection3.1toadjust theinstructionthroughputrooinelimittocomprehendkernelimposedlimits.Two, thecomputethroughputquotedbymanyresearchersintheeldisusuallyGop/s andisapproximatedbycountingthenumberofoperationsinthesourcecodeand representsalowerbound.Thetranslationtoassemblycodeinvariablyresultsinmore operationsthanareobviousthroughsourcecodeinspection.ComparingGop/sis usefultodetermineifanoptimizationhashadanimpactbycomparingonevariant withanotherbutitdoesn'tindicatehowmuchmoreoptimizationopportunityremains.Howmuchoptimizationremainscanbequantiedbycomparingthelimiting throughputasafractionoftheoreticalpeak.Three,ifthekernelisnotclosetoeither thememorybandwidthpeakorthecomputethroughputpeak,thekernelisprobablylatencyboundand/orisnotstressingthefullcapabilityofthehardwarewhich changesthebalanceofthemachine.Thisisthecasewiththesmallinputsetforthe baseCUDAimplementationandwewillfocusonlargerinputsetstollthemachine. Table6.2isthebaselineSGEMMimplementationonFermiusinglarger-sized inputmatrices.SGEMMonthemediuminputdatasetwentfrommemory-boundin thebaseimplementationtocompute-boundintheoptimizedversion.ThisSGEMM implementationistypicalofmostinthatthebenchmarkreportsGFlop/sasalower boundonperformance.Theframeworkmethodologyalsomeasurestimetoexecute tocompareoptimizationeects.BothareusedforSGEMMastwodierentdata pointswhichessentiallymeasurethesamething. 89

PAGE 104

Table6.2:SGEMMBaseCUDAImplementationwithMediumInput Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 35960 Instr:ByteRatioDRAM 28 Instr:ByteRatioL2 4.19 DRAMThroughputGB/s 16 DRAMPercentofPeak 14% L2ThroughputGB/s 106GB/s L2HitRate 89% L1LocalLoadHitRate 0% L1GlobalLoadHitRate 55% NumTransperLoadRequest 1.01 SerializationImpact 32% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 0% LocalMemoryPercentofInstr 0% PercentDivergentBranches 0% InstrPercentofPeak 58% GFLOP/sAsReportedbyKernel 59GFLOP/s ThreadOccupancy 100% AchievedOccupancy 93% IPC 0.58 90

PAGE 105

Theautomatedframeworkrecommendsmemoryoptimizationsonthebaseimplementationwiththemediumsizedinputandinstructionoptimizationsontheoptimizedversion.Thisisthesameapproachaswastakenmanuallyandisconsistentwith howParboilcharacterizedtheirimplementation.Parboilmusthavebeenreferringto themediumdatasetintheirworkloadcharacterizationnotclearintheirpaperwhich makessensesincethesmalldatasetisareducedinputsizeforstudypurposesandis notkeepingthemachinefullwithenoughconcurrentlyrunningthreads. 91

PAGE 106

Table6.3:SGEMMOptimizedCUDAImplementationwithMediumInput Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 5420 Instr:ByteRatioDRAM 10.93 Instr:ByteRatioL2 5.49 DRAMThroughput 31.6GB/s DRAMPercentofPeak 27% L2Throughput 32GB/s L2Throughput 62.9GB/s L2HitRate 66% L1LocalLoadHitRate 0% L1GlobalLoadHitRate 1% NumTransperLoadRequest 1.12 SerializationImpact 1% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 0% LocalMemoryPercentofInstr 0% PercentDivergentBranches 0% InstrPercentofPeak 67% GFLOP/sAsReportedbyKernel 396 ThreadOccupancy 50% AchievedOccupancy 48% IPC 0.67 92

PAGE 107

6.1.2StencilAnalysisonFermi FortheStencilkernelandKingen'sRK4,analysiswillfocusonthelargerdataset sincethesmallerinputshavecharacteristicsuniquetonotfullyloadingthemachine. ThebaseimplementationoftheStencilKernelwithdefaultinputlargeinputsize ismemory-bandwidthbound;it'sinstructiontobyteratioislowerthantheidealfor theFermihardwareandmemorythroughputis76%ofpeakseeTable6.4.The rstroundofoptimizationshouldfocusonmemoryoptimizationwhichisconsistentwiththeliteraturewhichindicatesthebaseimplementationhaslocalityrelated optimizationopportunities. 93

PAGE 108

Table6.4:StencilBaseCUDAImplementationwithDefaultInput Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 3106 Instr:ByteRatioDRAM 10.93 Instr:ByteRatioDRAM 3.71 Instr:ByteRatioL2 3.10 DRAMThroughput 87.6GB/s DRAMPercentofPeak 76% L2Throughput 104.8 GB/s L2HitRate 52% L1LocalLoadHitRate 0% L1GlobalLoadHitRate 55% NumTransperLoadRequest 1.3 SerializationImpact 13% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 0% LocalMemoryPercentofInstr 0% PercentDivergentBranches 3% InstrPercentofPeak 55% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 100% AchievedOccupancy 92% IPC 0.55 94

PAGE 109

TheoptimizedimplementationoftheStencilKernelwithdefaultinputisalso memory-bandwidthbound;it'sinstructiontobyteratioislowerthantheidealfor Fermihardwareandmemorythroughputis96%ofpeakseeTable6.5.Thisis alsoconsistentwiththeliteraturewhichindicatestheoptimizedimplementationis memory-bandwidthboundanddoesn'tgivesuggestionsforfutherimprovement.This kernelisusingmemorybandwidthasecientyaspossibleandthereisnofurther optimizationavailable.Theonlywaytoimprovethiskernelistoreducethetracto memorysincehowmemoryisbeingusedisveryecient.Thisislikelynotpossible duetoalgorithmicrequirements. 95

PAGE 110

Table6.5:StencilOptimizedCUDAImplementationwithDefaultInput Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 2150 Instr:ByteRatioDRAM 3.38 Instr:ByteRatioL2 3.71 DRAMThroughput 111.5GB/s DRAMPercentofPeak 97% L2Throughput 101.7GB/s L2HitRate 36% L1LocalLoadHitRate 0% L1GlobalLoadHitRate 1% NumTransperLoadRequest 1.0 SerializationImpact 1% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 0% LocalMemoryPercentofInstr 0% PercentDivergentBranches 0% InstrPercentofPeak 72% GFLOP/sAsReportedbyKernel ThreadOccupancy 67% AchievedOccupancy 64% IPC 0.72 96

PAGE 111

TheRK4kernelevaluated8,192chromosomeswhichintheKingenapplication representrateconstantsets.Therstscienticmodelscientiststestedisreferred toasmodel01.Theresultsforthebaselineimplementationofmodel01arelistedin Table6.6. BaselineRK4Kernel Thechromosomethroughput,therateKingenwantstomaximizeis:8192/ 0.02168=377,859chromosomes/sec.Theframeworkanalysisindicatesthiskernel isverymemory-boundseeTable6.6.Thesenumbersareunusuallylowwhichmeans memoryisreallyaproblem.Below1indicatesthatforeverybytefetchedfrom memory,notevenacompleteinstructionisexecuted.Iftheinstructiontobytevalue were2,thatmeanstwoinstructionsareexecutedforeverybytefetchedfrommemory. Thisismemory-boundbecausethehardwareisbalancedtoexecutethreetofour instructionsforeverybyte.DrampercentofpeakBWis88%sotheimplementation is,infact,memorybound. Theframeworkanalysisalgorithmmakesthechoicetofocusonmemory-related optimizationsonly,anyimprovementsininstructionthroughputwillmakenodierencetoperformanceasthekernelisbottle-neckedatthememoryinterface.Two optimizationopportunitiesarepresentinthemeasuredprolemetrics.One,there aremanyL1transactionsduetolocalmemorymisses;37%ofL1cacherequestsare duetolocalmemoryaccessess.Remember,localmemoryisusedforregisterspilling andforlocalarraysthatmustbestoredinglobalmemorybecausecompilercan't resolvearrayindexingatcompiletime.Two,therearememoryaccesspatternissues thenumberoftransactionsperloadrequestis13ascomparedtoexpected1-2.This meansthekernelisrequesting13xmorebytesfrommemorythanareactuallybeing used;it'sanindicationofpoorbusutilization.Toaddressmemoryaccessissues, thedataneedstobetransposedsothateachthreadinawarpaccessescontiguous addressesinmemory.It'ssimilartovectoroperationsonlyonNVidiahardware,it's 97

PAGE 112

Table6.6:KingenModel01baselineImplementationwith8,192Chromosomes Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 20920 Instr:ByteRatioDRAM 0.65 Instr:ByteRatioL2 0.80 DRAMThroughputGB/s 101 DRAMPercentofPeak 88% L2ThroughputGB/s 82 L2HitRate 18% L1LocalLoadHitRate 12% L1GlobalLoadHitRate 76% NumTransperLoadRequest 13 SerializationImpact 85% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 22% LocalMemoryPercentofInstr 37% PercentDivergentBranches 0% InstrPercentofPeak 11% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 33% AchievedOccupancy 31% IPC 0.11 avectorofthreads.Anotherwaytoaddressmemorypressureregardlessofwhat proledmetricssayistoreducethenumberofbytesthatgotoglobalmemory. 98

PAGE 113

Therstoptimizationchosenwastoxtheaccesspatternissue.ThethetaparametersvariablesintheRK4solveraccesspatternweretransposedforcoalesced access.ResultsofthisoptimizationareshowninTable6.7.Theseresultsaredisappointingsincetheoptimizationresultedina2%improvementinexecutiontime.The numberoftransactionsperloadrequestreducedfrom13to10,whichmeansperformanceisscalingpoorlywiththereductionintransactionsperloadrequestabout 8%.Thisindicatesthattheaccesspatternmaynotbetherstorderlimiterand plentyofopportunityintheaccesspatternstillremainstogetfrom10transactions perloadrequestdowntotheexpected1.Theanalysisalgorithmwasrenedbased ontheseresultstomoveasmuchdataaspossibletoon-chipstorageastherst memory-boundoptimizationtotry.Sharedmemoryandregistersaremuchfaster thanaccessestoglobalmemoryandthesehardwareresourcesshouldbeusedasfully aspossiblewithoutreducingoccupancybelowit'sperformancethreshold.Thenext optimizationusessharedmemoryforvariablereuse. Optimization2:UseSharedMemoryforVariableReuse Bringingdataintosharedmemoryresultedina1.22ximprovementoverthe baselineversion.Thekernelisstillmemory-boundbutthemeasuredinstructionto byteratiocontinuestobecomemorebalanced.Memoryaccesspatternisnolonger anissuevariablesthatwerenon-alignedweremovedintosharedmemorybutthere isstillalargenumberofL2andDRAMtransactionsduetolocalmemoryaccess. TheRK4kernelusesvearraysthatarethesizeofthenumberofstatesinthe kineticmodeltoholdderivativesandintermediatecalculations.Thesearenotbeing storedaslocalvariablesinregistersbutgoingtoglobalmemory.Earlygenerationsof NVidiaarchitecturewouldonlystore"small"arraysinregistersifarrayindexingcould beresolvedstaticallyatcompiletime.SmallisnotquantiedinNVidiamanuals. Thenextoptimizationistoreduceglobalmemorytracfromlocalmemoryaccess bymovingthreeofthevedatastructuresintolocalvariableswithregisterstorage. 99

PAGE 114

Table6.7:KingenModel01MemoryAccessOptimizationwith8,192Chromosomes Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 20510 Instr:ByteRatioDRAM 0.64 Instr:ByteRatioL2 0.78 DRAMThroughputGB/s 101 DRAMPercentofPeak 88% L2ThroughputGB/s 83 L2HitRate 16% L1LocalLoadHitRate 15% L1GlobalLoadHitRate 72% NumTransperLoadRequest 10 SerializationImpact 86% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 22% LocalMemoryPercentofInstr 39% PercentDivergentBranches 0% InstrPercentofPeak 11% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 33% AchievedOccupancy 32% IPC 0.11 Basicallyunwindthearraysothateachindexisalocalvariable.Thisoptimization resultedin1.46xperformanceimprovementoverthebaselineversion.Conrmation 100

PAGE 115

Table6.8:KingenModel01MemoryAccessOptimizationwith8,192Chromosomes Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 17175 Instr:ByteRatioDRAM 1.10 Instr:ByteRatioL2 1.27 DRAMThroughputGB/s 96 DRAMPercentofPeak 83% L2ThroughputGB/s 83 L2HitRate 24% L2transactionsduetolocalmem 91% Dramtranactionsduetolocalmem 79% L1LocalLoadHitRate 24% L1GlobalLoadHitRate 15% NumTransperLoadRequest 1 SerializationImpact 92% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 9% LocalMemoryPercentofInstr 20% PercentDivergentBranches 0% InstrPercentofPeak 19% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 33% AchievedOccupancy 31% IPC 0.19 101

PAGE 116

Table6.9:KingenModel01RegisterOptimization Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 14279 Instr:ByteRatioDRAM 1.25 Instr:ByteRatioL2 1.31 DRAMThroughputGB/s 96 DRAMPercentofPeak 83% L2ThroughputGB/s 91 L2HitRate 35% L2transactionsduetolocalmem 53% Dramtranactionsduetolocalmem 50% L1LocalLoadHitRate 20% L1GlobalLoadHitRate 25% NumTransperLoadRequest 1 SerializationImpact 87% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 5% LocalMemoryPercentofInstr 11% PercentDivergentBranches 0% InstrPercentofPeak 20% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 33% AchievedOccupancy 30% IPC 0.20 102

PAGE 117

thattheoptimizationworkedasintendedisalsodemonstratedintheL2transactions andDRAMtranactionstolocalmemorymetrics;bothwerereducedsubstantially fromthelastoptimizationtothisone.Thekernelisstillmemory-bandwidthbound andstillsueringfromhighlocalmemoryaccess.Therearefewlocalarraysleft inthekernelsotheselocalmemoryaccessesmustbeduetoregisterspilling.This makessense,theoptimizationmovedalotofvariablesintoregistersforthelocal arrays.Todetermineifanoptimizationexiststhatcouldreduceregisterpressure requiresanalysisofthekernel.Countingsimpleaccesspatterns,thatis,quantifying howmucheachvariableisusedhelpedhere.Inthelastoptimization,ytwasaglobal variablepassedintothekernel,anddymwasmadeintolocalvariables.YTisused farmorefrequentlyinthekernel,somakessensetospreadoutdymaccesstoglobal memoryandmoveytintoregisters. Optimization4:FixytanddymUsage Table6.10demonstratestheperformancegaininmovingkernelvariablesaround tostoremorefrequentlyusedvariablesinregistersandinfrequentlyusedvariables inglobalmemoryascountedfromsourcecodeanalysis.Thisoptimizationresults ina1.83xperformancegainoverthebaselineversion.Thekernelisstillmemorybandwidthboundandisstilllimitedbyregisterspillingandlocalmemoryuse.Can wedobetter?Howmuchoptimizationremains? Figure6.1demonstratesgraphicallywitharooinemodelhowmuchperformance headroomremainsfortheRK4optimizationonFermi.Therooinemodelisa bound-and-bottleneckmodelthat"tiestogetheroating-pointperformance,operationalintensity,andmemoryperformanceina2Dgraph[179]".Peaksaredened withhardwarespecicationsormicrobenchmarking.Operationalintensityisdened by[179]asoperationsperbyteofDRAMtracsinceo-chipmemorybandwidthis seenastheconstrainingresourceinsystemperformance. 103

PAGE 118

Table6.10:KingenModel01FixytdymUsageOptimization Metric Value FermiIdealInstr:ByteRatio 4.5 walltimeusec 11420 Instr:ByteRatioDRAM 1.51 Instr:ByteRatioL2 1.54 DRAMThroughputGB/s 92 DRAMPercentofPeak 80% L2ThroughputGB/s 90 L2HitRate 40% L2transactionsduetolocalmem 69% Dramtranactionsduetolocalmem 68% L1LocalLoadHitRate 19% L1GlobalLoadHitRate 30% NumTransperLoadRequest 1 SerializationImpact 89% SharedMemoryBankConicts 0% PercentInstrduetoRegSpilling 6% LocalMemoryPercentofInstr 12% PercentDivergentBranches 0% InstrPercentofPeak 24% GFLOP/sAsReportedbyKernel N/A ThreadOccupancy 33% AchievedOccupancy 31% IPC 0.24 104

PAGE 119

TheRK4kernelwiththisscienticmodelhasaconstantchallengewithmemory bandwidthbecausetherequirementsofthescienticmodeldon'ttintheon-chip resourcesavailableonFermi.Therearesomeoptimizationsthatremain.During analysisweobservedthatmoresharedmemorycouldbeallocatedwithoutimpacting occupancyandthatperhapsloweroccupancyformoresharedmemorywouldbe apositivetrade-o.Thesecanbeimplementedandevaluatedofcourse,butthe pointofthisthesisisnottoguaranteeoptimalperformancebuttoshowasystematic methodforevaluatingwhichoptimizationsareworthwhileandtoquantifyhowmuch headroomremainssothechoicecanbemadeiftheinvestmentinfurtherdevelopment isworththepotentialgain. Figure6.1:Rooinemodel,denedby[179],showstheRK4kernelisbandwidth boundsincetheoverheadlineisonthediagonalslant.Stencilisrightattheridgepoint betweenmemory-boundandcomputewhichwasobservedintheanalysismetricsalso. SGEMMisultimatelylimitedbycompute.Themodelalsoillustrateshowmuch performanceopportunityremainsforeach. 105

PAGE 120

6.2MaxwellOptimizationAnalysisandResults TheanalysisalgorithmonMaxwellfollowsthesamemethodologyasforFermi withsomeexceptionsinthedetailstoaccountformicroarchitecturaldierencesand dierencesintheproleroutput.Maxwell'sgenerationofnvprofsupportsmore eventsandmetricsthanFermisofewermetricderivationsarerequiredbutsomeare lessclearwhichlowleveleventswereinputtothederivedmetrics.Thesedierences arereectedinthetablesgivenforMaxwellascomparedtoFermi. Maxwell'suniedL1/texturecacheKBisusedtocachereadsfromglobal memory.TheL2cachesharedbyallmultiprocessorsisstillusedtocacheaccesses tolocalorglobalmemory,includingtemporaryregisterspills.InFermi,theL2 transactionsduetolocalmemoryandtheDRAMtransactionsduetolocalmemory wereestimatedwiththel1 local load missesevent.Thiseventisnotsupportedin MaxwellduetohowtheL1cachehasbeenredened.Localmemorytracand registersspillsareestimatedonMaxwellusingthelocal memory overheadmetric whichisaratiooflocalmemorytractototalmemorytracbetweentheL1and L2caches. 6.2.1SGEMMAnalysisonMaxwell Table6.11showstheresultsforthebaselineimplementationofSGEMMon Maxwellusingthemediumdataset.Thistablelistsonlythemetricsnecessaryto evaluateifthekernelismemory-bound,compute-bound,orlatency-bound.TheL2 hitrateis91%sotheanalysisalgorithmchoosestheL2instructiontobyteratio todetermineifthekernelismemory-bound.Maxwellhasadierentbalancepoint forinstructionstobyteandmeasuredvalueslessthan10.7areprobablymemorybandwidthbound. 106

PAGE 121

Table6.11:SGEMMBaselineImplementationwithMediumDatasetonMaxwell. Theboldedvaluesareusedtodetermineifthekerneliscompute-boundormemorybound.Measuredinstructiontobyteratiosarecomparedwiththehardwareideal. TheL2hitrateishighwhichtellstheframeworktocomparetheInstructiontobyte ratioforL2asopposedtotheinstructiontobyteratiofordram. Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 17674 Instr:ByteRatioDRAM 82 Instr:ByteRatioL2 7.2 DRAMThroughputGB/s 16 DRAMPercentofPeak 14% L2ThroughputGB/s 182 L2HitRate 91% Table6.12listsmemory-boundrelatedmetricsforSGEMMonMaxwell.Note thehighutilizationofthetexturecacheunit.Thisimpliespressureonthememory subsystemthatneedstoberelievedforperformancegain.Theanalysisalgorithm favorson-chipstorageifsharedmemoryisn'tbeingusedandthiskernelismemorybandwidthboundsotheanalysisalgorithmwouldchooseevaluationofsharedmemory useasarstoptimization.SGEMMisembarassinglyparallelandcanbeblockedtot innearlyanysizeofnearmemorywhichisawell-documentedoptimizationapproach forhigh-performingsgemmcode. Table6.13listsprolemetricsaftercolumnsofthematricesareloadedintoshared memory.Thistablemergestherstdecisionpointtodeterminewhichresourceis boundingperformanceandadditionalmetricstoguidethenextoptimization.For readability,asubsetofrelevantmetricsareincluded.Thefullsuiteofmetricsthat 107

PAGE 122

Table6.12:SGEMMBaselineImplementationwithMediumDatasetonMaxwell. Onlymemoryrelatedmetricsareshownsincethekernelismemory-bound. Metric Value SharedMemoryLoadTransPerRequest 0 SharedMemoryStoreTransPerRequest 0 LocalMemoryLoadTransPerRequest 0 LocalMemoryStoreTransPerRequest 0 GlobalLoadTransPerRequest 6 GlobalStoreTransPerRequest 4 LocalMemoryOverhead 0% Load/StoreFunctionUtilization Low TextureUnitUtilization High weremeasuredaredescribedinSection3.2.Performanceisimprovedandthekernel indicatesit'smemorybandwidthlimitedaccordingtotheL2instructiontobyteratio. However,thedrammemoryinterfaceisnotnearpeak%.Thisunderscoresafew detailspractionersneedtobeawareofthatisnotclearinperformanceoptimization research.One,akernelisnotdram-boundifdramutilizationisat24%ofpeak.Two, theidenticationofmemory-bound,compute-bound,orlatency-boundiscommonly acceptedmethodologybuttheneedtomeasureeachperformancelimitinginterface andhowtointerpretresultsisnot.Thiskernelshowshighsharedmemoryutilization andmoderately-highmidsingle-precisionfunctionutilization.Thisindicatesthe kernelisshared-memory-boundandclosetoinstruction-bound.Thisisallconsistent withSGEMManalysisintheliterature.Un-optimizedSGEMMisknowntobe memory-boundandcarefulblockingtechniquescanshiftittoinstruction-boundon justaboutanyarchitecture.Thisspecicoptimizationisclosebuthasn'tfullypushed intoinstruction-bound;whichisexpected. 108

PAGE 123

Table6.13:SGEMMOptimizedImplementationwithMediumDatasetonMaxwell Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 2866 Instr:ByteRatioDRAM 37 Instr:ByteRatioL2 7.9 DRAMThroughputGB/s 25 DRAMPercentofPeak 23% Low L2ThroughputGB/s 119 L2ReadHitRate 73% SharedMemoryLoadTransPerRequest 2 SharedMemoryStoreTransPerRequest 1 GlobalLoadTransPerRequest 8 GlobalStoreTransPerRequest 4 Load/StoreFunctionUtilization Low TextureUnitUtilization Low TextureCacheHitRate 50% BranchEciency 100% SharedMemoryEciency 198% SharedMemoryUtilization High SinglePrecisionFunctionUtilization Mid FLOPEciencyPeakSingle 29% 6.2.2StencilAnalysisonMaxwell Thebaselineimplementationofthestencilkernelismemory-bandwidthbound. WhethertheDRAMinstructiontobyteratioortheL2instructiontobyteratiois 109

PAGE 124

considered,botharewellbelowthebalanceof10.7.Therstroundofoptimization shouldfocusonmemoryoptimizationswhichisconsistentwithwhatisreportedin theliterature.Therearememoryaccessissuesasthereare8loadsperrequestwhen singleprecisionloadsshouldrequireonly1.Optimizationsshouldseektoxaccess patternswhichisalsoconrmedintheliteraturewhichciteslocalityrelatedissues. TheoptimizedversionoftheStencilkernelonMaxwellisnowclosertocomputeboundthoughitsrightontheedgeofthebalancepoint.Thismeansthekernel isoperatingrightatthejointoftheslantedlinememorybandwidthboundand thehorizontallinecompute-boundinrooinemodels.TheSPfunctionutilization isreportedbytheprolerasMidandtheeciencyofSPoperationsratioof achievedtopeakSPoating-pointoperationsisverylowforaninstruction-bound kernel.Thepeakeciencyreportedbytheprolerdoesn'taccountforinstruction mixwhichlowersthetheoreticalupperboundbutisstillunlikelytogetgreaterthan 50%.Thereareseveralwaystointerpretthis.Beingontheedgeofmemory-bound tocompute-bound,theremaybenergrainedanalysisthatwouldhelpisolatewhere exactlytheperformancelimitis.Theanalysisinthisresearchisbasedonoverall executionofthekernelasawhole.Themethodologyisnotappropriatefortuning thatattemptstorecoververysmallincrementalgains.Theremaybesomelatency thatisexposedaswellasinstructionoptimizationsthatmayhavesomeimpactbut probablynotmuch. 6.2.3RK4AnalysisonMaxwell Table6.16liststheresultsofthebaselineimplementationofRK4andindicates thekernelisveryinstruction-throughputlimited.Compute-boundimpliessomecomputefunctionisexecutingnearpeak.Themetricthatjumpsoutfromthecomputeboundsetofmetricsisthehighdoubleprecisionfunctionutilization.Thisresultis verysurprisingsincethekerneldenesalllocalvariablesassingleprecisionoats. TheRK4kernelisasystemofordinarydierentialequations.Manyscalarnumbers 110

PAGE 125

Table6.14:StencilBaselineImplementationonMaxwell Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 2970 Instr:ByteRatioDRAM 4.87 Instr:ByteRatioL2 2.30 DRAMThroughputGB/s 87 DRAMPercentofPeak 78% L2ThroughputGB/s 185 L2HitRate 60% GlobalLoadTransPerRequest 8.55 GlobalStoreTransPerRequest 4 Load/StoreFunctionUtilization Low TextureUnitUtilization Low TextureCacheHitRate 50% BranchEciency 100% SinglePrecisionFunctionUtilization Low Control-FlowFunctionUtilization Low FLOPEciencyPeakDouble 1.54% AchievedOccupancy 82% IPC 1.2 areincludedinthecomputationincluding3.0,4.0,2.0,etc.Alloftheseareconsidereddouble-precisionpertheCstandardsotheoperationstheseliteralsparticipated inwerepromotedbythecompilertodouble-precision.Therstoptimizationchosen bytheanalysisalgorithmistoreducepressureonthedouble-precisioncomputeunit 111

PAGE 126

Table6.15:StencilOptimizedImplementationonMaxwell Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 2160 Instr:ByteRatioDRAM 12.55 Instr:ByteRatioL2 10.84 DRAMThroughputGB/s 71 DRAMPercentofPeak 64% L2ThroughputGB/s 82.5 L2HitRate 19% GlobalLoadTransPerRequest 3.25 GlobalStoreTransPerRequest 4 Load/StoreFunctionUtilization Low TextureUnitUtilization Low TextureCacheHitRate 48% BranchEciency 100% SharedMemoryEciency 99% SharedMemoryUtilization Low SinglePrecisionFunctionUtilization Mid Control-FlowFunctionUtilization Low FLOPEciencyPeakSingle 2.1% AchievedOccupancy 97% IPC 2.5 byappendingthescalarswithan"f"forexample,3.0f,4.0f,2.0fsothatthecompilerwouldinterpretthemassingleprecisionoatsandnotpromotetherestofthe 112

PAGE 127

operands. Table6.16:RK4BaselineImplementationonMaxwell Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 10779 Instr:ByteRatioDRAM 823 Instr:ByteRatioL2 81 DRAMThroughputGB/s 0.079 DRAMPercentofPeak 0% L2ThroughputGB/s .798 L2HitRate 92% GlobalLoadTransPerRequest 28 GlobalStoreTransPerRequest 24 BranchEciency 100% SinglePrecisionFunctionUtilization Low DoublePrecisionFunctionUtilization High Table6.17demonstratestheresultsfromtheverysimpleoptimizationtoappend scalarswithan"f".Thekernelisstillcompute-boundbutnowthesingle-precision functionunitreportshighutilizationandthesingle-precisionFLOPeciencyisat 65%.Thisisapproachingpeakutilization.Furtheroptimization,ifdesired,must continuetofocusonimprovinginstructionthroughput. TheFermiresultsonthiskernelwerememory-bandwidthlimited.Maxwellis eectivelyhidingmemorylatencywhenFermicouldnot.Partofthereasonisthe 4xincreaseinregistersperthreadfromMaxwelltoFermi.LocalarraysinMaxwell arestoredinregistersinsteadofglobalmemorywhichsubstantiallyreducestracto memory.Interestingtonote,thesamebaselinecodewasrunonFermiandMaxwell. 113

PAGE 128

ThedatastructureshavethesamememoryaccessissueonMaxwellasonFermi seethe28globalloadtransactionsperrequestand24globalstoretransactionsper requestinTable6.17.Botharchitecturesexecuteawarpof32threadsthatrequire consecutivememoryaddressesacrossthevectorofthreadsforbestperformance. However,thatonlyappliesifmemoryislimitingperformance.Inthiscase,compute islimitingperformanceandiftheeortweremadetoxthememoryaccessissue, performancewouldnotimprove.Thisunderscoreswhyit'scriticallyimportantto understandhowtointerpretanalysismetricsandonlypayattentiontothosethat arerelevanttothehardwarethatislimitingperformance. Table6.17:RK4OptimizedImplementationonMaxwell Metric Value MaxwellIdealInstr:ByteRatio 10.7 walltimeusec 325 Instr:ByteRatioDRAM 528 Instr:ByteRatioL2 49 DRAMThroughputGB/s 2.15 DRAMPercentofPeak 0% L2ThroughputGB/s 23 L2HitRate 93% GlobalLoadTransPerRequest 28 GlobalStoreTransPerRequest 24 TextureCacheHitRate 1.3% BranchEciency 100% SinglePrecisionFunctionUtilization Max DoublePrecisionFunctionUtilization Low FLOPEciencyPeakSingle 65% 114

PAGE 129

6.3XeonPhiOptimizationAnalysisandResults 6.3.1SGEMMAnalysisonXeonPhiCoprocessors SGEMMwasrunontheXeonPhiinnativeexecutionmode.56coreswereused systemhas57corestotal;left1coreformasterOSworkwith4threadspercorefor atotalof224threads.Thebaselineimplementationparallelizedtheouter-mostloop usingthestandard#pragmaompparallelforstructurewiththematrixarraysand commonvariablesdeclaredasshared.Thisisaverycommonnaiveimplementation usedinperformanceliterature. Table6.18liststhehardwareeventsmeasuredwithIntel'sVTuneproler.These measuredvaluesareinputtotheanalysismetricsgiveninTable6.19.Theanalysis methodologydescribedinthisthesismeasurestheinstructiontobyteratiotomakea rstorderapproximationonwhetherthekerneliscompute-boundormemory-bound. However,thecoprocessorsdonothaveeventstocountoating-pointinstructions. Instead,thisismeasuredwiththenumberofvectorinstructionsexecuted.Instruction tobyteratiosarecalculatedwiththeL1computetodataaccessratioandL2compute todataaccessratio. TheL1andL2computetodataaccessratiosareanalgoustotheL2andDRAM instructiontobyteratioswemeasuredandevaluatedonNVidiahardware.TheL1 ratiocalculatesanaverageofthenumberofvectorizedoperationsthatoccurfor eachL1cacheaccess;oraninstructiontoL1byteratio.TheL1heuristicindicates well-performingcodeshouldbeabletoachievearatioofcomputationtoL1access thatisgreaterthanorequaltothevectorizationintensitywhichisonedataaccess percomputationofavectorofelements.TheL2computetodataaccessratioisan 115

PAGE 130

Table6.18:SGEMMBaselineImplementationonXeon-Phi Metric Value CPU CLK UNHALTED 1,066,810,000,000 DATA PAGE WALK 387,200,000 DATA READ MISS 2,048,000,000 DATA READ MISS OR WRITE MISS 2,049,000,000 DATA READ OR WRITE 42,969,970,000 DATA WRITE MISS 0 EXEC STAGE CYCLES 134,860,000,000 HWP L2MISS 215,300,000 INSTRUCTIONS EXECUTED 104,670,000,000 L1 DATA HIT INFLIGHT PF1 0 L1Misses 2,049,000,000 L2 DATA READ MISS CACHE FILL 1,417,900,000 L2 DATA READ MISS MEM FILL 299,100,000 L2 DATA WRITE MISS CACHE FILL 0 L2 DATA WRITE MISS MEM FILL 100,000 L2 VICTIM REQ WITH DATA 1,300,000 LONG DATA PAGE WALK 0 VPU ELEMENTS ACTIVE 21,309,063,927 VPU INSTRUCTIONS EXECUTED 42,784,128,352 averageofthenumberofvectorizedoperationsthatoccurforeachL2access;oran instructiontoL2byteratio.TheL2performanceheuristicindicatestheidealbalance is1L2dataaccessforevery100L1dataaccesses.TheL2computetodataaccess ratioforthebaselineimplementationofsgemmislessthan100xthel1computeto 116

PAGE 131

dataaccessratiowhichmeansthiskernelismemorybound;morespecically,it'sL2 bound.EcientL1cacheblockingorreduceddataaccesswillresultinahigherL2 computetodataaccessratio[33].Recommendedoptimizationsaretoimprovedata localityfortheL1cacheorre-structurethecodetogeneratemoreecientvectorized code. Table6.19:SGEMMBaselineImplementationonXeon-Phi.Investigationmaybe warrantedifthemeasuredvaluessatisfythethresholdsoftheperformanceheuristic. Actualvaluesarelledinunderthevaluecolumnandtheperformanceheuristicis providedforreferenceinthelastcolumn. Metric Value PerformanceHeuristic GFLOP/s 3.4 0%ofpeak AverageCPIperThread 10.2 > 4.0 greaterthan4.0 AverageCPIpercore 0.05 > 1.0 greaterthan1.0 L1ComputetoDataAccessRatio 0.5 < 0.5 lessthanVectorizationIntensity L2ComputetoDataAccessRatio 10.4 < 49.6 lessthan100xL1Compute toDataAccessRatio VectorizationIntensity 0.5 < 16 lessthan8DP,lessthan 16SP MemorybandwidthGB/s 2.77 < 80 lessthan80GB/Sec HitRate 95% < 95% lessthan95% EstimatedLatencyImpact 433.9 > 145 greaterthan145 L1TLBMissRatio 0.01 > 1% greaterthan1% L2TLBMissRatio 0.0 > .1 greaterthan.1% L1TLBmissesperL2TLBmiss N/A Near1 117

PAGE 132

TherstoptimizationofSGEMMobservedthatmemoryaccessesonmatrixBare notcontiguouscolumn-majorindexingleadstostridedaccess.Simplytransposing BimprovesthememoryaccessissueandtheL2computetodataratioshouldincrease [58].Table6.20liststhemeasuredvaluesofthetransposedimplementation. Table6.20:SGEMMTransposeOptimizationonXeon-Phi Metric Value CPU CLK UNHALTED 637,820,000,000 DATA PAGE WALK 600,000 DATA READ MISS 56,000,000 DATA READ MISS OR WRITE MISS 74,000,000 DATA READ OR WRITE 56,086,880,000 DATA WRITE MISS 0 EXEC STAGE CYCLES 134,390,000,000 HWP L2MISS 16,600,000 INSTRUCTIONS EXECUTED 104,690,000,000 L1 DATA HIT INFLIGHT PF1 0 L1Misses 74,000,000 L2 DATA READ MISS CACHE FILL 0 L2 DATA READ MISS MEM FILL 0 L2 DATA WRITE MISS CACHE FILL 0 L2 DATA WRITE MISS MEM FILL 0 L2 VICTIM REQ WITH DATA 100,000 LONG DATA PAGE WALK 0 VPU ELEMENTS ACTIVE 21,320,063,960 VPU INSTRUCTIONS EXECUTED 42,785,128,355 118

PAGE 133

Table6.21liststheanalysismetricsfortheSGEMMtransposeimplementation. TheoptimizationdidincreasetheL2computetodataaccessratioasexpectedandits valuenowexceedstheheuristicthreshold.Severalothervaluesnowindicatepossible performanceopportunities,includingtheL1computetodataaccessratioisnowless thanthevectorizationintensity,thevectorizationintensityitselfisverylow,and theestimatedlatencyimpactishigh.Ecientvectorizationisoneoftheprimary performancefeaturesofXeonPhicoprocessors;thelowL1computetodataaccess ratioindicatescomputationaldensityislow;andtheestimatedlatencyimpactisa roughgaugeofwhetherthemajorityofL1datamissesarehittinginL2.SGEMM iswell-knowntorequireproperblockingincachesorusageofsharedmemoryin GPUsforhighperformance.ThenaloptimizationmeasuresIntel'sMathKernel LibraryMKLimplementationofSGEMMwhichunderwentsignicanteortto optimize.Weuseahighlyoptimizedversiontovalidatetheanalysismetricswill correctlyidentifyhighperformingcode. Table6.22liststhemeasuredmetricsfortheMKLversionofSGEMM. Table6.23listsresultsfortheanalysismetricsontheoptimizedMKLversionof SGEMM.NotetheL2computetodataaccessratioandtheestimatedlatencyimpact can'tbegivenbecausethedenominatorisDATA READ MISS OR WRITE MISS whichisthenumberofdemandloadsorstoresthatmissinL1cacheandnosamples weremeasuredforthatmetricwhichresultsinadividebyzerovaluefortheratio. NomissesintheL1cacheisimpressive;clearlytheoptimizedversionisblockedto tintheL1cache. Scalingisanimportantconsiderationforapplicationsthatrunonthroughputorientedarchitectures.Peformancegainisprimarilyachievedthroughparallelismand ifthekerneldoesn'tscalewithparallelresourcesthemachinewillquicklybecomeunderutilized.Figure6.2demonstrateshowwelleachimplementationofSGEMMscales withthenumberofcores.They-axisiswall-clocktimeofeachimplementationlower 119

PAGE 134

Table6.21:SGEMMTransposedOptimizationonXeon-Phi.Investigationmaybe warrantedifthemeasuredvaluessatisfythethresholdsoftheperformanceheuristic. Actualvaluesarelledinunderthevaluecolumnandtheperformanceheuristicis providedforreferenceinthelastcolumn. Metric Value PerformanceHeuristic GFLOP/s 6.66 0%ofpeak AverageCPIperThread 6.09 > 4.0 greaterthan4.0 AverageCPIpercore 0.03 > 1.0 greaterthan1.0 L1ComputetoDataAccessRatio 0.38 < 0.50 lessthanVectorizationIntensity L2ComputetoDataAccessRatio 288.1 < 38 lessthan100xL1Compute toDataAccessRatio VectorizationIntensity 0.50 < 16 lessthan8DP,lessthan 16SP MemorybandwidthGB/s 0.36 < 80 lessthan80GB/Sec HitRate 100% < 95% lessthan95% EstimatedLatencyImpact 6045 > 145 greaterthan145 L1TLBMissRatio 0% > 1% greaterthan1% L2TLBMissRatio 0 > .1 greaterthan.1% L1TLBmissesperL2TLBmiss NA Near1 numbersarebetter.ThetopsetoflinesistherstnaiveimplementationofSGEMM. Measuredkernelscalingandperfectscalingforcomparisonareshownforeachofthe threeimplementations.Figure6.2demonstratesthatscalingforthebaselineimplementationandthetransposedoptimizationbeginstofallobuttheMKLversion isboththefastestandwillscaleonfuturearchitectures.Thetransposeoptimizationisinterestingbecausetheoptimizationincreasedperformancebyapproximately 120

PAGE 135

Table6.22:SGEMMMKLLibraryOptimizationonXeon-Phi Metric Value CPU CLK UNHALTED 47,630,000,000 DATA PAGE WALK 0 DATA READ MISS 0 DATA READ MISS OR WRITE MISS 0 DATA READ OR WRITE 10,093,060,000 DATA WRITE MISS 0 EXEC STAGE CYCLES 11,610,000,000 HWP L2MISS 800,000 INSTRUCTIONS EXECUTED 16,010,000,000 L1 DATA HIT INFLIGHT PF1 80,200,000 L1Misses 80,200,000 L2 DATA READ MISS CACHE FILL 42,900,000 L2 DATA READ MISS MEM FILL 166,600,000 L2 DATA WRITE MISS CACHE FILL 0 L2 DATA WRITE MISS MEM FILL 0 L2 VICTIM REQ WITH DATA 179,200,000 LONG DATA PAGE WALK 0 VPU ELEMENTS ACTIVE 183,944,551,832 VPU INSTRUCTIONS EXECUTED 11,419,034,257 2xwhichistypicallyconsideredverygood.However,themeasuredanalysismetrics indicateplentyofoptimizationopportunityremained.Thisisbecausethetransposeoptimization,thoughmuchbetterthanthebaselineversion,doesn'tgetclose topeakhardwareutilizationwhichimpliesperformancewilldegradegenerationover 121

PAGE 136

Table6.23:Investigationmaybewarrantedifthemeasuredvaluesdonotmetthe thresholdgiven.Measuredvaluesareprovidedinthevaluecolumnandtheperformancethresholdisgivenforreferenceinthelastcolumn. Metric Value PerformanceThreshold GFLOP/s high percentageofpeak AverageCPIperThread 3.0 > 4.0 greaterthan4.0 AverageCPIpercore 0.01 > 1.0 greaterthan1.0 L1ComputetoDataAccessRatio 18.2 < 2293 lessthanVectorizationIntensity L2ComputetoDataAccessRatio N/A lessthan100xL1Compute toDataAccessRatio VectorizationIntensity 16 < 16 lessthan8DP,lessthan 16SP MemorybandwidthGB/s 2.44 < 80 lessthan80GB/Sec HitRate 100% < 95% lessthan95% EstimatedLatencyImpact NA greaterthan145 L1TLBMissRatio 0% > 1% greaterthan1% L2TLBMissRatio 0 > .1 greaterthan.1% L1TLBmissesperL2TLBmiss NA Near1 generation. 122

PAGE 137

Figure6.2:Axisarelog-logscale.Thedottedlinesindicateperfectlinearscaling withcores.Deviationfromthedottedlinesshowtheextenttowhichperformance doesnotscalewithadditionalparallelresources. Table6.24showsthenormalizedscalingforeachSGEMMoptimization.The numbersaredemonstratedgraphicallyinFigure6.2.Thevaluesarearatioofaratio comparingthedierenceinperformanceoverthedierenceinthenumberofcores andofthenumberofthreads. 6.3.2StencilAnalysisonXeonPhiCoprocessors StencilwasrunontheXeonPhiinnativeexecutionmode.56coreswereused systemhas57corestotal;left1coreformasterOSworkwith4threadspercore foratotalof224threads.Thebaselineimplementationalignsarraypointerssothey are64-bytealignedsizeofacacheline.Table6.25liststhemeasuredeventsforthe baselineimplementationofStencildevelopedby[16].Thebaselineimplementation andthecache-blockedoptimizationusedthesameparametersNX=256,NY=300, 123

PAGE 138

Table6.24:100%valuemeansperformancescaledperfectlywithcoresandthreads; forexample,thenumberofcoresdoubledandperformanceimproved2x.29%means thesoftwareisonlytakingadvantageofapproximately1/3rdofadditionalresources available. 4to8cores 8to16cores 16to32cores 32to56cores sgemmbaseline 106% 85% 79% 55% sgemmtranspose-opt 89% 29% 44% 29% sgemmsdkmklopt 94% 100% 97% 80% NZ=300,iterations=50,num threads=224,HALF LENGTH=8,n1 thread block =256,n2 thread block=1,n3 thread block124. Table6.26providestheanalysismetricsforthebaselineimplementationofstencil.TheL1computetodataaccessratio,theL2computetodataacessratio,and theL1hitratearealllow.However,notethatintheory,thevectorizationintensity cannotexceed8fordoubleprecisioncodeor16forsingle-precisioncode.Ifthemeasuredvectorizationintensityismuchlower,theloopswerenotwell-vectorized.The Inteloptimizationguide[33]doesindicatethatcareshouldbetakenwhenapplying thesemetricstolargerpiecesofcode.Maskmanipulationinstructionscountasvector instructionsandcanskewtheratio[33]."Largerpiecesofcode"isnotquantied further.Reviewofthevectorizationreportsgeneratedbythecompilerwiththeqopt-report=5and-qopt-report-phase=veccommandlineoptionsindicatethatthe compilerdoesabetterjobofvectorizingthebaselineversionthanthecache-blocking version.Theabsolutevectorizationvalueshouldnotbeweighedasheavilyasarelativecomparisonororderofmagnitudejudgementcall.Consideringtheunreasonably highvectorizationintensity,theL1computetodataaccessratiomaynotmetthe performancethresholdisgreaterthan16or8whichisthehighestthevectorizationintensityshouldbe.Theanalysisframeworkrecommendsfocusonremedy 124

PAGE 139

Table6.25:StencilBaselineImplementationonXeon-Phi Metric Value CPU CLK UNHALTED 1,979,860,000,000 DATA PAGE WALK 111,800,000 DATA READ MISS 111,800,000 DATA READ MISS OR WRITE MISS 111,800,000 DATA READ OR WRITE 47,849,660,000 DATA WRITE MISS 137,000,000 EXEC STAGE CYCLES 137,000,000 HWP L2MISS 111,900,000 INSTRUCTIONS EXECUTED 360,510,000,000 L1 DATA HIT INFLIGHT PF1 0 L1Misses 11,257,000,000 L2 DATA READ MISS CACHE FILL 684,500,000 L2 DATA READ MISS MEM FILL 125,500,000 L2 DATA WRITE MISS CACHE FILL 0 L2 DATA WRITE MISS MEM FILL 0 L2 VICTIM REQ WITH DATA 170,200,000 LONG DATA PAGE WALK N/A VPU ELEMENTS ACTIVE 839,812,519,430 VPU INSTRUCTIONS EXECUTED 53,700,000 forthelowL2computetodataaccessratiowhichincludesblockingdatafortheL1 cacheorreducedataaccess. Cacheblockingisawell-knowntechniquetoimproveperformanceofstencilcodes. Thebasicideaistobreakthearraydownintoblocksofcellstoimprovedatalocality 125

PAGE 140

Table6.26:Investigationmaybewarrantedifthemeasuredvaluesdonotmetthe thresholdgiven. Metric Value PerformanceThreshold GFLOP/s 54 3%ofpeak AverageCPIperThread 5.49 > 4.0 greaterthan4.0 AverageCPIpercore 0.02 > 1.0 greaterthan1.0 L1ComputetoDataAccessRatio 17.55 < 15639 lessthanVectorizationIntensity L2ComputetoDataAccessRatio 74.60 < 1755 lessthan100xL1Compute toDataAccessRatio VectorizationIntensity 15639 < 8DP,16SP lessthan8DP,lessthan 16SP MemorybandwidthGB/s 1.41 < 80 lessthan80GB/Sec L1HitRate 76% < 95% lessthan95% EstimatedLatencyImpact 143.6 > 145 greaterthan145 L1TLBMissRatio 0% > 1% greaterthan1% L2TLBMissRatio 0 > .1 greaterthan.1% L1TLBmissesperL2TLBmiss 0 Near1 insteadofprocessingthearraycellbycellfrombeginningtoend.Theconceptis straight-forward,howevertherearemanydetailsthatcomplicateimplementation.For example,settingnewparametersthatspecifythesizeoftheblocksineachdimension andmanagingthemappropriatelyisdicult. Table6.27liststhemeasuredeventsforthecache-blockingoptimizedimplementationofStencildevelopedby[16],referredtoasdev02. Table6.28aretheanalysismetricsforthecache-blockedoptimizedversionof thestencilimplementation.Cacheblockingmadea4ximprovementintherun-time 126

PAGE 141

Table6.27:MeasuredhardwareeventsfromVTuneofthestencilcache-blocking optimizationonXeon-Phicoprocessors. Metric Value CPU CLK UNHALTED 424,870,000,000 DATA PAGE WALK 507,800,000 DATA READ MISS 11,031,000,000 DATA READ MISS OR WRITE MISS 11,016,000,000 DATA READ OR WRITE 25,994,300,000 DATA WRITE MISS 0 EXEC STAGE CYCLES 50,040,000,000 HWP L2MISS 79,000,000 INSTRUCTIONS EXECUTED 67,920,000,000 L1 DATA HIT INFLIGHT PF1 22,750,000 L1Misses 11,038,750,000 L2 DATA READ MISS CACHE FILL 207,700,000 L2 DATA READ MISS MEM FILL 51,100,000 L2 DATA WRITE MISS CACHE FILL 300,000 L2 DATA WRITE MISS MEM FILL 1,100,000 L2 VICTIM REQ WITH DATA 123,100,000 LONG DATA PAGE WALK 0 VPU ELEMENTS ACTIVE 40,233,120,699 VPU INSTRUCTIONS EXECUTED 40,205,120,615 buttheeciencymetricsindicatethecacheblockingandvectorizationcanstillbe improvedseeTable6.29formetricsthestencilcodeself-reports.Thevectorization reportforthecacheblockingimplementationindicatessomeloopsarenotvectorized 127

PAGE 142

becausetheiterationcountwasn'texplicitlyknownbeforeexecutingtheloopand therearevectordependenciesthatpreventvectorization.Thevectorizationdiagnostic abouttheiterationloopcountisemittedstartingwithIntelC++Compiler15.0.The dierenceintheresultsreportedhereandbyAndreolli[16]areprobablydue,inpart, todierencesincompilerandcompileroptions. Table6.28:MeasuredAnalysisMetricsfromStencilCache-BlockingOptimizationon Xeon-Phi.Investigationmaybewarrantedifthemeasuredvaluesdonotmetthe performancethreshold. Metric Value PerformanceThreshold GFLOP/s 11.2 0%ofpeak AverageCPIperThread 6.26 > 4.0 greaterthan4.0 AverageCPIpercore 0.03 > 1.0 greaterthan1.0 L1ComputetoDataAccessRatio 1.55 < 1.0 lessthanVectorizationIntensity L2ComputetoDataAccessRatio 3.65 < 155 lessthan100xL1Compute toDataAccessRatio VectorizationIntensity 1.0 < 8,16 lessthan8DP,lessthan 16SP MemorybandwidthGB/s 2.85 < 80 lessthan80GB/Sec L1HitRate 58% < 95% lessthan95% EstimatedLatencyImpact 32 > 145 greaterthan145 L1TLBMissRatio 2% > 1% greaterthan1% L2TLBMissRatio 0 > .1 greaterthan.1% L1TLBmissesperL2TLBmiss N/A Near1 Thoughthecache-blockingimplementationmadesignicantimprovement,the stencilisnotanywherenearpeakforeithercomputeorbandwidth.Thecacheblock128

PAGE 143

ingandvectorizationeciencycanbeimproved.Stencilapplicationsoftentakemany parameters,suchasarraysizes,cacheblockingsizes,numberofthreads,andnumber ofiterations,sodeveloperscanexperimenttondhighperformingparametersets. Auto-tuningisidealforoptimizingcacheblocksizingandotherparametersthatrequirecompilationtogeneratenewvariantsbutnocodemodication.Thesolution spacecanbeexploredwithasimplescriptingtool. Table6.29:StencilResultsonXeonPhiasReportedbytheApplication Timesec throughputMPoints/s GFlops StencilBaseline 15.66 61.82 3.77 StencilCache-blocked 3.56 272.09 16.6 129

PAGE 144

6.4SummaryofAnalysisandResults Table6.30comparesbaselineandoptimizedresultsofthethreescientickernels onthreearchitectures.ThelargeruntimedierencesbetweentheXeonPhiCoprocessorsandtheothertwoarchitecturesareduetodierencesinthealgorithmsand inputsthatwereusedforeachkernelandadierenceintheperformanceoptimization eort.OptimizationprogressionontheXeonPhiCoprocessorsdidnotgoasfarto bringthekernelsasclosetoachievablepeaksastheyareontheNVidiaarchitectures duetotimeandresourceconstraints. Table6.30:Comparisonofbaselineandoptimizedresultsofthethreescientickernels onthreearchitectures. Fermi usec Maxwell usec XeonPhi sec BaselineOptimizedBaselineOptimizedBaselineOptimized RK420920114201077932552531 SGEMM3596054201767428676.33.2 Stencil310621502970216015.363.56 Table6.31:Speedupsofoptimizedversionsoverbaselineforeachscientickernelon eacharchitecture. Fermi Maxwell XeonPhiCoprocessors RK4 1.8x 33x 17x SGEMM 6.6x 6.2x 2x Stencil 1.4x 1.4x 4.3x 130

PAGE 145

7.Conclusion Aprimarycontributionofthisthesisisthecreationofananalysisframeworkthat guidesdevelopersandscientiststhroughsystematicperformanceoptimization.The frameworkappliesamoredisciplined,scienticapproachtotheartofperformance optimization.Capabilitieswithintheframeworkareexploredtogenerateperformance metricsandanalysistodriveoptimizationstrategy.Theframeworkhelpsdevelopers andcomputationalscientistsleveragethefullcomputationalresourcesavailablein parallelsystems. Table7.1throughTable7.3summarizetherelationshipbetweeneachsignicant performancemetricandtherecommendedoptimizationforNVidia'sFermi,NVidia's Maxwell,andIntel'sXeonPhiCoprocessors.Additionalthroughput-orientedmetricsareavailableonMaxwell-basedsystemsthatgiveadvancedinsightintowhich hardwareunitisrunningnearfullutilizationandisprobablylimitingperformance. Forexample,anymetricfromTable3.13orTable3.14thatendsin'utilization' or'eciency'canbeusedtofocusoptimizationeorts.Iftheprolerreportshigh double-precisionoatingpointutilization,anyoptimizationthatreducesthepressure ondouble-precisioncomputewilllikelyseesometypeofperformancegain. 131

PAGE 146

Table7.1:Summaryofrelationshipbetweensigncantperformancebottleneckson NVidiaandtheirrelevantmetricsasdenedbyNVidiaengineers.Table7.2mapsthe performancebottleneckswithrecommendedoptimizations.Doublehorizontallines separatememoryrelatedissuesfrominstructionrelatedlimitsandlatencygiven in-order. PotentialBottleneck Formula Description Memory-Boundat DRAM? 32*inst issued:32B* dram reads+dram writes ComparemeasuredinstructiontoDRAM byteratiowithhardwareideal. Memory-Boundat L2? 32*inst issued:32B*L2 reads +L2 writes CompareinstructiontoL2byteratiowith hardwareideal. RegisterSpillMemory Impact 2*4*l1 local load miss:l2read requests+l2writerequests EstimateL2queriesduetolocalmemoryand comparetototalL2queries. PoorMemoryAccess Pattern gld request << l1 global load hit+ l1 global load miss*word size /32 Compareapplicationthroughputwithhardwarethroughput. CoalescedLoads? l1 global load miss+ l1 global load hit/gld request Compareexpectednumberoftranstohardwarerequestednumberoftransperload. SerializationImpact inst executed/inst issued Estimatepercentageofinstructionsdueto serialization. SharedMemoryBank Conict l1 shared bank conict/ shared loadshared store Estimatepercentofsharedinststhatconict. LocalMemoryInstructionImpact l1 local load miss+ l1 local load hit+ l1 local store miss+ l1 local store hit/inst issued Percentageoftotalinstructionsduetolocal memory. BranchDivergence divergent branch/branch Branchdivergencecanwasteinstructions. IPC inst executed/#SM/ elapsed clocks ComparemeasuredIPCtopeakIPC. Occupancy active warps/active cycles/ MAX WARPS PER SM Lowoccupancyandneithercompute-bound ormemory-boundindicateslatencyissue. 132

PAGE 147

Table7.2:NVidiaperformanceanalysistooptimizationmapping.IdentiedbottlenecksfromTable7.1aremappedtooptimizationsthatarelikelytopositivelyimpact performance.Theframeworkrecommendsmemory-relatedoptimizationswhenakernelismemory-boundandcompute-relatedoptimizationswhenakerneliscomputebound. PotentialBottleneck Optimization MemoryBound Focusonlyonoptimizationsthatimprovememory throughput. InstructionBound Focusonlyonoptimizationsthatimproveinstructionthroughput. RegisterSpillImpact Reducepressureonlocalmemory. PoorMemoryAccess Pattern Fixaccesspatternsforecientloadsandstores. CoalescedLoads Restructuredatatoguaranteecoalescedloads. SerializationImpact Drillintolowerlevelmemorymetricstondsource ofserialization. SharedMemoryBank Conict Restructuredatatoeliminatebankconicts. BranchDivergence Reducedivergencebetweenthreadsinawarp. LowIPC Somethingislimitingcomputethroughput.Check instructioneciencyandlatency. Occupancy Lowoccupancycanindicatelatencyisexposed ifneithercompute-boundnormemory-bound. Changeexecutionparameterstoexposemore threadlevelparallelism. 133

PAGE 148

Table7.3:Summaryoftherelationshipbetweeneachsignicantperformancemetric andrecommendedoptimizationsforIntelsXeonPhiCoprocessors.TheseperformanceheuristicsarepublishedbyIntel[33]toguideperformanceanalysisonXeon Phicoprocessors. PotentialBottleneck Description AverageCPIperthread > 4 Applicationsoperatingwithinthecoresi.e.doingcomputationsoncacheableworkingsetsshouldbeabletoobtain CPIsatorlowerthanthesethresholds. AverageCPIpercore > 1 GoalingeneralistoreduceCPIpercorebutsomeoptimizationscanincreaseCPI.Bestasaneciencyandcomparison metricratherthanasprimarydeterminerofoptimization. VectorizationIntensity < 8 DP,16SP Examinevectorizationreportfordatadependencies,non-unit strideaccesses,ortrueindirection. L1computetodataaccess ratio < Vectorizationintensity Generaltuningstoreducedataaccesslikereducenumberof instructionsinthecriticalpath,removeconditionalsorinitialization,anythingnotneededininnerloops. L2computetodataaccess ratio < 100xL1Computeto DataAccessRatio ImprovelocalityforL1cachebyrestructuringcodeorenable compilertogeneratemoreecientcode. L1hitrate < 95% Increasedatalocalitywithcacheblocking,softwareprefetching,dataalignment,andusingstreamingstorestokeepmore dataincache. Latencyimpact > 145 IndicateswhethermajorityofL1datamissesarehittinginL2. Trypaddingdatastructureswhilemaintainingalignmentor changingtheaccessstride. L1TLBmissratio > 1% Ingeneral,anyprogramtransformationthatimprovesspatial localitywillbenetbothcacheutilizationandTLButilization.TheTLBisjustanotherkindofcache. L1TLBmissesperL2TLB missisnear1 IftheL1toL2ratioishighthenconsiderusinglargepages. 134

PAGE 149

Prolersingeneralneedtocontinuetoevolvetobettersupportperformance engineersbyprovidingmoreinsightintowhatishappeningonthemachineandguidanceonhowtointerpretmetrics.Specicrecommendationsincludes:well-dened low-levelevents;detaileddocumentationonformuladenitionsthatemploylow-level eventsinderivedanalysismetrics;validrangesforeachmetricandsuggestionsfor debugwhenmeasuredresultsfalloutsidenormalranges;dynamicinstructioncounts fortheprimaryinstructiontypes;denedinstructionthroughputsforeachinstruction type;anddetaileddocumentationoftheoreticalpeakcapabilitiesofthemachineand achievablepeaksunderoptimumconditionsthebestanyrealapplicationmightsee forallhardwareunitsthatcanlimitperformance.Peakperformancecanbefound fromhardwarespecicationsormicrobenchmarkingbutitshouldbemorereadily available.Veryoften,DRAMbandwidthisdocumentedbutnotbandwidthtothe cachesorotherinterfaces.Theanalysisinthisthesiswouldhavebeengreatlyaided ifalloftheabovehadbeenavailableforallarchitecturesunderevaluation. Theanalysismethodologydescribedinthisthesisisgenerallyapplicabletoany throughputorientedkernelthatcanrunonFermi,Maxwell,orXeonPhicoprocessors. Theframeworkislimitedtothespecicoptimizationsexploredforthethreescientic kernels.Additionaloptimizationsforgivenconditionsinhardwareandhowtheyare representedinmetricsareopportunitiesforfutureresearch.Theframeworkcanbe appliedtoauto-tuningsystemstoreducethecombinatorialsearchspacebyfocusing optimizationsinareasthatarelikelytoimproveperformance.Signicantdevelopmenttimeiswastedimprovingcodethat,bydenition,willnotimproveperformance. Thegenesisofauto-tunerswasbuiltontheobservationthatallcodevariantscould beenumeratedandexploredand,giventhecapabilitiesofmodernmicroprocessors, theoptimalcongurationforagivenproblemsizecouldbedetermined[25].Thisis truebutveryinecient.Diagnosingperformancelimitersupfrontreducesthesearch spacesignicantlyandmakesthemachinelessofablackbox. 135

PAGE 150

Historically,conventionalwisdomproducedarchitectureswithsimilardesign;this changedwiththetransitiontomulticore,manycore,andmassivelyparallelsystems, andmicroprocessorshavebecomemorediverse[179].Thisdiversityandfocusonparallelismmotivatestheneedtodevelopastrongerunderstandingofhowthroughputorientedarchitecturesworkintomainstreamvernacular.Afewofthebiggestchallengesinmanycoreperformanceoptimizationresearcharehowtoecientlybuild applicationsthatachieveareasonablefractionoftheavailabletheoreticalpeakperformance,howtomeasuretherelativesuccessofdierentoptimizations,howtointerpretproleeventsinappropriatecontext,andhowtoratearchitecturaleciency tomeasurethegoodnessofthemappingfromsoftwaretohardware.Scienticapplicationsareparticularlysusceptibletounderutilizationofparallelplatforms.The frameworkdescribedinthisthesisadvancesprogressineachofthoseareas. Heterogenouscomputingisarapidlygrowingeldandareaofactiveresearch. Heterogenousimpliesamixofdierenttypesofcomputingaccleratorsandhostcooperatingoncommontasks.Mostheterogeneousplatformstodateselectdierent typesofcomputenodesmorebyresourceavailabilityratherthanappropriateness ofthekerneltothecomputinghardware.Thisisdueinlargepartbecausehowto dynamicallychoosethebestcomputingresourceforagivenexecutingkernelisavery challengingproblemwithfewclearanswers.Givenanear-peakoptimization,architecturescanbecomparedusingtheratioofachievedthroughputtopeakcapability. Thiseciencymeasurenormalizesoutthedierencesbetweenthetwomachineswith respecttodierentpeakcomputeandbandwidthratesandwhatremainsisaquanticationofarchitecturaleciencyforagivenkerneloneachhardware.Thisisvery usefulforarchitecturalresearchandunderstandinghowkernelsexecuteonhardware butitcan'tgiveaprioriin-sightintoplatformorcomputenodeofchoicewithout adevelopmenteort.Thisthesislaysthegroundworkforfurtherdevelopmentof dynamicmappingsofkernelstohardware. 136

PAGE 151

BIBLIOGRAPHY [1]Channellabv2.050113.softwaredocumentationbySynaptosoft. [2]Optimizingexecutionperformanceforanisotropicdoubleprecision3denitedierencestencilalgorithmontheintelxeon phicoprocessor. https://software.intel.com/en-us/articles/ optimizing-execution-performance-for-an-isotropic-double-precision-3de-finite-difference Accessed:2015-09-04. [3]poski:Anextensibleautotuningframeworktoperformoptimizedspmvson multicorearchitectures. [4] Thegambler'sruinproblem,geneticalgorithms,andthesizingofpopulations Piscataway,NJ,1997.IEEEPress. [5] Therooinemodel:Apedagogicaltoolforauto-tuningkernelsonmulticore architectures .HotChips,2008. [6]Nvidiasnextgenerationcudacomputearchitecture:Fermi.WhitePaper,January2011. [7] X-Stack:Auto-tuningforPerformanceandProductivityonExtreme-scaleComputations ,February2011. [8]Proler:Usersguide.OnlineDocumentation,October2012. [9] CUDACProgrammingGuide ,2013. [10]Intelxeonphicoprocessorblockdiagram.online,2015. [11]Nvidiageforcegtx980:Featuringmaxwell,themostadvancedgpuevermade. WhitePaper,January2015. [12]DavidAbramson,BlairBethwaite,ColinEnticott,SlavisaGaric,TomPeachey, AnushkaMichailova,andSalehAmirriazi.Embeddingoptimizationincomputationalscienceworkows. JournalofComputationalScience ,1:41{47, 2010. [13]JensAckermann,PaulBaecher,ThorstenFranzel,MichaelGoesele,andKay Hamacher.Massively-parallelsimulationofbiochemicalsystems.In ProceedingsofMassivelyParallelComputationalBiologyonGPUs.LectureNotesin InformaticsLNI ,2009. [14]EnriqueAlbaandJoseM.Troya.Asurveyofparalleldistributedgeneticalgorithms. Complex. ,4:31{52,March1999. 137

PAGE 152

[15]AAli,LJohnsson,andDMirkovic. Empiricalauto-tuningcodegeneratorfor FFTandtrigonometrictransforms ....withInternationalSymposiumonCode ...,2007. [16]CedricAndroelli.Eightoptimizationsfor3-dimensionalnitedierencedfd codewithanisotropiciso. https://software.intel.com/en-us/articles/ eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso Accessed:2014-10-21. [17]KrsteAsanovic,RastislavBodik,BryanCCatanzaro,JosephJamesGebis, ParryHusbands,KurtKeutzer,DAPatterson,WilliamLesterPlishker,John Shalf,S.W.Williams,andKatherineYelick. TheLandscapeofParallelComputingResearch:AViewfromBerkeley .TechnicalReport UCB/EECS-2006-183,UniversityofCalifornia,Berkeley,Berkeley,dec2006. [18]MuthuManikandanBaskaran,UdayBondhugula,SriramKrishnamoorthy, JRamanujam,AtanasRountev,andPSadayappan.Acompilerframework foroptimizationofaneloopnestsforgpgpus.In ICS'08:Proceedingsof the22ndannualinternationalconferenceonSupercomputing .ACMRequest Permissions,June2008. [19]BabakBehzad,JosephHuchette,HuongVuThanhLuu,RuthAydt,Surendra Byna,YushuYao,QuinceyKoziol,andPrabhat. Aframeworkforauto-tuning HDF5applications .ACM,June2013. [20]A.Bellen.Parallelismacrossthestepsfordierenceanddierentialequations. InAlfredoBellen,CharlesGear,andElviraRusso,editors, NumericalMethods forOrdinaryDierentialEquations ,volume1386of LectureNotesinMathematics ,pages22{35.SpringerBerlin/Heidelberg,1989.10.1007/BFb0089229. [21]TABenke,ALthi,JTIsaac,andGLCollingridge.Modulationofampareceptor unitaryconductancebysynapticactivity. NatureNeuroscience ,395:793{ 797,1998. [22]TABenke,ALthi,MJPalmer,MAWikstrm,WWAnderson,JTIsaac,and GLCollingridge.Mathematicalmodellingofnon-stationaryuctuationanalysisforstudyingchannelpropertiesofsynapticampareceptors. Journalof Physiology ,537pt2:407{420,2001. [23]MABhuiyan,MCSmith,andVKPallipuram.Performance,optimization,and tness:Connectingapplicationstoarchitectures. ConcurrencyandComputation:PracticeandExperience [24]ChristianBienia,SanjeevKumar,JaswinderPalSingh,andKaiLi. ThePARSECbenchmarksuite:characterizationandarchitecturalimplications .ACM, October2008. 138

PAGE 153

[25]JeBilmes,KrsteAsanovic,Chee-WhyeChin,andJimDemmel.Optimizingmatrixmultiplyusingphipac:aportable,high-performance,ansiccoding methodology.In ICS'97:Proceedingsofthe11thinternationalconferenceon Supercomputing .ACMRequestPermissions,July1997. [26]RomainBrette,MichelleRudolph,TedCarnevale,MichaelHines,David Beeman,JamesM.Bower,MarkusDiesmann,AbigailMorrison,PhilipH. Goodman,Jr.FrederickC.Harris,MilindZirpe,ThomasNatschlager,Dejan Pecevski,BardErmentrout,MikaelDjurfeldt,AndersLansner,OlivierRochel, ThierryVieville,EilifMuller,AndrewP.Davison,SamiElBoustani,andAlain Destexhe.Simulationofnetworksofspikingneurons:Areviewoftoolsand strategies. JournalComputationalNeuroscience ,23:349{398,2007. [27]AndreRBrodtkorb,TrondRHagen,andMartinLStra.Graphicsprocessing unitgpuprogrammingstrategiesandtrendsingpucomputing. Journalof ParallelandDistributedComputing ,73:4{13,jan2013. [28]MCalzarossaandGSerazzi.Workloadcharacterization:Asurvey. Proceedings oftheIEEE ,81:1136{1150,1993. [29]R.C.Cannon,M.O.Gewaltig,P.Gleeson,U.S.Bhalla,H.Cornelis,M.L.Hines, F.W.Howell,EMuller,J.R.Stiles,S.Wils,andE.DeSchutter.Interoperabilityofneurosciencemodelingsoftware:currentstatusandfuturedirections. Neuroinformatics ,3:127{138,2007. [30]ErickCantu-Paz.Asummaryofresearchonparallelgeneticalgorithms.Technicalreport,UniversityofIllinoisatUrbana-Champaign,1995. [31]ErickCantu-paz.Asurveyofparallelgeneticalgorithms. CalculateursParalleles,ReseauxetSystemsRepartis ,10:141{171,1997. [32]ErickCantu-Paz. Ecientandaccurateparallelgeneticalgorithms .Kluwer AcademicPublishers,2000. [33]ShannonCepeda.Optimizationandperformancetuningforintelxeonphi coprocessors,part2:Understandingandusinghardwareevents.OnlineDocumentation,November2012. [34]JohnChandy,SunghoKim,BalkrishnaRamkumar,StevenParkes,andPrithvirajBanerjee.Anevaluationofparallelsimulatedannealingstrategieswithapplicationtostandardcellplacement. IEEETrans.onComp.Aid.Designof Int.Cir.andSys ,16:398{410,1997. [35]ShuaiChe,J.WSheaer,MBoyer,L.GSzafaryn,LiangWang,andKSkadron. Acharacterizationoftherodiniabenchmarksuitewithcomparisontocontemporarycmpworkloads. WorkloadCharacterizationIISWC,2010IEEE InternationalSymposiumon ,pages1{11,2010. 139

PAGE 154

[36]JWChoiandASingh.Model-drivenautotuningofsparsematrix-vectormultiplyongpus. ACMSIGPLANNotices ,2010. [37]MatthiasChristen,OlafSchenk,andHelmarBurkhart.Patus:Acodegenerationandautotuningframeworkforparalleliterativestencilcomputationson modernmicroarchitectures. IPDPS ,pages676{687,2011. [38]I-HsinChungandJKHollingsworth.Usinginformationfrompriorrunsto improveautomatedtuningsystems.In Supercomputing,2004.Proceedingsof theACM/IEEESC2004Conference ,page30.IEEE,2004. [39]I-HsinChungandJKHollingsworth.Acasestudyusingautomaticperformancetuningforlarge-scalescienticprograms. HighPerformanceDistributed Computing,200615thIEEEInternationalSymposiumon ,pages45{56,2006. [40]RClintWhaley,AntoinePetitet,andJackJDongarra.Automatedempirical optimizationsofsoftwareandtheatlasproject. ParallelComputing ,27-2:3{ 35,January2001. [41]DavidColquhounandF.J.Sigworth.Fittingandstatisticalanalysisofsinglechannelrecords.InBertSakmannandErwinNeher,editors, Single-Channel Recording ,pages483{587.SpringerUS,2009. [42]GuojingCong,SeetharamiSeelam,I-HsinChung,SophiaWen,andDavidJ Klepacki.Towardsaframeworkforautomatedperformancetuning. IPDPS pages1{8,2009. [43]XiangCui,YifengChen,ChangyouZhang,andHongMei.Auto-tuningdense matrixmultiplicationforgpgpuwithcache.In ParallelandDistributedSystems ICPADS,2010IEEE16thInternationalConferenceon ,pages237{242,2010. [44]ValeriuDamian,AdrianSandu,MirelaDamian,FlorianPotra,andGregoryR. Carmichael.Thekineticpreprocessorkpp-asoftwareenvironmentforsolving chemicalkinetics. ComputersAndChemicalEngineering ,26:1567{1579, 2002. [45]KDatta,MMurphy,VVolkov,SWilliams,JCarter,LOliker,DPatterson, JShalf,andKYelick.Stencilcomputationoptimizationandauto-tuningon state-of-the-artmulticorearchitectures. HighPerformanceComputing,Networking,StorageandAnalysis,2008.SC2008.InternationalConferencefor pages1{12,2008. [46]KaushikDatta.Auto-tuningstencilcodesforcache-basedmulticoreplatforms, 2009. [47]KaushikDatta,ShoaibKamil,SamuelWilliams,LeonidOliker,JohnShalf,and KatherineYelick.OptimizationandPerformanceModelingofStencilComputationsonModernMicroprocessors. SIAMreview ,51:129{159,February 2009. 140

PAGE 155

[48]KaushikDatta,MarkMurphy,VasilyVolkov,SamuelWilliams,Jonathan Carter,LeonidOliker,DavidPatterson,JohnShalf,andKatherineYelick. Stencilcomputationoptimizationandauto-tuningonstate-of-the-artmulticore architectures.In Proceedingsofthe2008ACM/IEEEconferenceonSupercomputing ,SC'08,pages4:1{4:12,Piscataway,NJ,USA,2008.IEEEPress. [49]ADavidson.Towardtechniquesforauto-tuninggpualgorithms.In Proceedings ofPara ,2010. [50]AndrewADavidson,YaoZhang,andJohnDOwens.Anauto-tunedmethod forsolvinglargetridiagonalsystemsonthegpu. IPDPS ,pages956{965,2011. [51]LorenzoDematteandDavidePrandi.Gpucomputingforsystemsbiology. BriengsinBioinformatics ,11:323{333,2010. [52]JackDongarraandVictorEijkhout.Self-adaptingnumericalsoftwareandautomatictuningofheuristics. InternationalConferenceonComputationalScience 2660Chapter78:759{770,2003. [53]JackDongarra,DennisGannon,GeoreyFox,andKenKennedy.Theimpactof multicoreoncomputationalsciencesoftware. CTWatchQuarterly ,3,2007. [54]YDotsenko,SSBaghsorkhi,BLloyd,andNKGovindaraju.Auto-tuningof fastfouriertransformongraphicsprocessors.pages257{266,2011. [55]LievenEeckhout. ComputerArchitecturePerformanceEvaluationMethods Morgan&ClaypoolPublishers,September2010. [56]HubertEichner,TobiasKlug,andAlexanderBorst.Neuralsimulationson multi-corearchitectures. FrontiersinNeuroinformatics ,4:12,2010. [57]PGEmma.Understandingsomesimpleprocessor-performancelimits. IBM JournalofResearchandDevelopment ,41:215{232,1997. [58]JianbinFang,AnaLuciaVarbanescu,HenkSips,LilunZhang,YonggangChe, andChuanfuXu.BenchmarkingIntelXeonPhitoGuideKernelDesign. Technicalreport. [59]NailaFarooqui,AndrewKerr,GregoryDiamos,SYalamanchili,andKSchwan. Aframeworkfordynamicallyinstrumentinggpucomputeapplicationswithin gpuocelot.In GPGPU-4:ProceedingsoftheFourthWorkshoponGeneral PurposeProcessingonGraphicsProcessingUnits .ACMRequestPermissions, March2011. [60]MFrigoandSGJohnson.Thedesignandimplementationoftw3. Proceedings oftheIEEE ,93:216{231. [61]MichaelGarlandandDavidBKirk.Understandingthroughput-orientedarchitectures. CommunicationsoftheACM ,53:58{66,November2010. 141

PAGE 156

[62]C.W.Gear.Massiveparallelismacrossspaceinodes. AppliedNumerical Mathematics ,11:27{43,1993. [63]AllisonGehrke,KatherineRennie,TimothyBenke,DanielA.Connors,and IlkyeunRa.Modelingionchannelkineticswithhpc. HighPerformanceComputingandCommunications,10thIEEEInternationalConferenceon ,0:562{ 567,2010. [64]StefanGoedeckerandAdolfyHoisie. Performanceoptimizationofnumerically intensivecodes .SocietyforIndustrialandAppliedMathematics,2001. [65]DavidGoldberg. GeneticAlgorithmsinSearch,Optimization,andMachine Learning .Addison-WesleyProfessional,1989. [66]DavidGoldberg. TheDesignofInnovation:LessonsfromandforCompetent GeneticAlgorithms .KluwerAcademicPublishers,2002. [67]D.E.Goldberg,K.Deb,andJ.H.Clark.Geneticalgorithms,noise,andthe sizingofpopulations. ComplexSystems ,6:333{362,1992. [68]JorgeGonzalez-Domnguez,GuillermoLTaboada,BasilioBFraguela,MaraJ Martn,andJuanTouri~no.Servet:Abenchmarksuiteforautotuningonmulticoreclusters. IPDPS ,pages1{9,2010. [69]NGoswami,RShankar,MJoshi,andTaoLiTaoLi.Exploringgpgpuworkloads:Characterizationmethodology,analysisandmicroarchitectureevaluation implications. Audio,TransactionsoftheIREProfessionalGroupon ,pages1{ 10,December2010. [70]SGrauer-Gray,LifanXu,RSearles,SAyalasomayajula,andJCavazos.Autotuningahigh-levellanguagetargetedtogpucodes. InnovativeParallelComputingInPar,2012 ,pages1{10,2012. [71]MeronGurkiewiczandAlonKorngreen.Anumericalapproachtoionchannel modellingusingwhole-cellvoltage-clamprecordingsandageneticalgorithm. PLoSComputBiol ,3:e169,082007. [72]MHall,JChame,CChen,JShin,andGRudy.Looptransformationrecipes forcodegenerationandauto-tuning. ...andCompilersfor... ,2010. [73]JohnL.HennessyandDavidA.Patterson. ComputerArchitecture:AQuantitativeApproach .MorganKaufmannPublishers,2007. [74]JHestness,SWKeckler,andDAWood.Acomparativeanalysisofmicroarchitectureeectsoncpuandgpumemorysystembehavior. WorkloadCharacterizationIISWC,2014IEEEInternationalSymposiumon ,pages150{160, 2014. 142

PAGE 157

[75]TonyHey,StewartTansley,andKristinTolle,editors. TheFourthParadigm Data-IntensiveScienticDiscovery .MicrosoftResearch,2009. [76]BertilHille. IonChannelsofExcitableMembranes .SinauerAssociates,Inc, 2001. [77]M.L.HinesandN.T.Carnevale.Theneuronsimulationenvironment. Neural Computation ,9:1179{1209,1997. [78]MichaelHinesandNicholasT.Carnevale.Thehandbookofbraintheoryand neuralnetworks.chapterComputermodelingmethodsforneurons,pages226{ 230.MITPress,Cambridge,MA,USA,1998. [79]SunpyoHongandHyesoonKim.Ananalyticalmodelforagpuarchitecture withmemory-levelandthread-levelparallelismawareness.In ISCA'09:Proceedingsofthe36thannualinternationalsymposiumonComputerarchitecture ACMRequestPermissions,June2009. [80]KennethHosteandLievenEeckhout.Comparingbenchmarksusingkey microarchitecture-independentcharacteristics.In 2006IEEEInternational SymposiumonWorkloadCharacterization ,pages83{92.IEEE,2006. [81]EricJakobsson,R.JayMashl,andTsai-TienTseng.Investigatingionchannelsusingcomputationalmethods.InThomasJ.McInotoshSidneyA.Simon, editor, Peptide-LipidInteractions ,volume52of CurrentTopicsinMembranes pages255{273.AcademicPress,2002. [82]ChanghaoJiangandMarcSnir.Automatictuningmatrixmultiplicationperformanceongraphicshardware.In Proceedingsofthe14thInternationalConferenceonParallelArchitecturesandCompilationTechniques ,PACT'05,pages 185{196,Washington,DC,USA,2005.IEEEComputerSociety. [83]ChanghaoJiangandMarcSnir.Automatictuningmatrixmultiplicationperformanceongraphicshardware. IEEEPACT ,pages185{196,2005. [84]RongshengJin,TueGBanke,MarkLMayer,StephenFTraynelis,andEric Gouaux.Structuralbasisforpartialagonistactionationotropicglutamate receptors. NatureNeuroscience ,6:803{810,2003. [85]HJordan,PThoman,JJDurillo,SPellegrini,PGschwandtner,TFahringer, andHMoritsch.Amulti-objectiveauto-tuningframeworkforparallelcodes. In HighPerformanceComputing,Networking,StorageandAnalysisSC,2012 InternationalConferencefor ,pages1{12.IEEE,2012. [86]EarlJoseph,AddisonSnell,andChristopherG.Willard.Counciloncompetitivenessstudyofu.s.industrialhpcusers.Technicalreport,2004. 143

PAGE 158

[87]TamitoKajiyama,AkiraNukada,ReijiSuda,HidehikoHasegawa,andAkira Nishida.Towardautomaticperformancetuningfornumericalsimulationsin thesilcmatrixcomputationframework.In SoftwareAutomaticTuning ,pages 175{192.SpringerNewYork,NewYork,NY,January2010. [88]SKamil,CChan,LOliker,JShalf,andSWilliams.Anauto-tuningframework forparallelmulticorestencilcomputations. Parallel&DistributedProcessing IPDPS,2010IEEEInternationalSymposiumon ,pages1{12,2010. [89]EdwardKandrotandJasonSanders. CUDAbyExample:AnIntroductionto General-PurposeGPUProgramming .Addison-WesleyProfessional,2010. [90]PeterKapsandPeterRentrop.Generalizedrunge-kuttamethodsoforderfour withstepsizecontrolforstiordinarydierentialequations. NumerischeMathematik ,33:55{68,1979.10.1007/BF01396495. [91]AKerr,GDiamos,andSYalamanchili.Acharacterizationandanalysisofptx kernels. WorkloadCharacterization,2009.IISWC2009.IEEEInternational Symposiumon ,pages3{12,2009. [92]MalikKhan,ProtonuBasu,GabeRudy,MaryHall,ChunChen,andJacquelineChame.Ascript-basedautotuningcompilersystemtogeneratehighperformancecudacode. ACMTransactionsonArchitectureandCodeOptimization ,9:1{25,January2013. [93]MalikMuhammadZakiMurtazaKhan. AUTOTUNING,CODEGENERATIONANDOPTIMIZINGCOMPILERTECHNOLOGYFORGPUS .PhD thesis. [94]HKim,RVuduc,SBaghsorkhi,andJChoi.Performanceanalysisandtuning forgeneralpurposegraphicsprocessingunitsgpgpu.7:1{96,November 2012. [95]DavidB.KirkandWenmeiW.Hwu. ProgrammingMassivelyParallelProcessors:AHands-onApproach .MorganKaufmannPublishers,2010. [96]ChristofKochandIdanSegev. MethodinNeuronalModeling:FromIonsto Networks .TheMITPress,1998. [97]A.Korngreen.Optimizingionchannelkineticsusingamassivelyparallelgenetic algorithmonthegpu.presentedattheGPUTechnologyConference,2009. [98]KKothapalli,RMukherjee,MSRehman,SPatidar,PJNarayanan,andKSrinathan.Aperformancepredictionmodelforthecudagpgpuplatform. Audio, TransactionsoftheIREProfessionalGroupon ,pages463{472,December2009. [99]JKurzak,STomov,andJDongarra.Autotuninggemmsforfermi. SC11 ,2011. 144

PAGE 159

[100]JKurzak,STomov,andJDongarra.Autotuninggemmkernelsforthefermi gpu. ParallelandDistributedSystems,IEEETransactionson ,23:2045{ 2057,2012. [101]JakubKurzak,DavidBader,andJackDongarra,editors. ScienticComputing withMulticoreandAccelerators ,volume20102756of Chapman&Hall/CRC ComputationalScience .CRCPress,December2010. [102]JunjieLaiandASeznec.Performanceupperboundanalysisandoptimization ofsgemmonfermiandkeplergpus. CodeGenerationandOptimizationCGO, 2013IEEE/ACMInternationalSymposiumon ,pages1{10,2013. [103]MonicaDLam,EdwardERothberg,andMichaelEWolf.Thecacheperformanceandoptimizationsofblockedalgorithms. ACMSIGARCHComputer ArchitectureNews ,19:63{74,April1991. [104]Hey-KyoungLee,KogoTakamiya,Jung-SooHan,HengyeMan,Chong-Hyun Kim,GavinRumbaugh,SandyYu,LinDing,ChunHe,RonaldS.Petralia, RobertJ.Wenthold,MichelaGallagher,andRichardL.Huganir.Phosphorylationoftheampareceptorglur1subunitisrequiredforsynapticplasticityand retentionofspatialmemory. Cell ,112:631{643,2003. [105]VictorW.Lee,ChangkyuKim,JatinChhugani,MichaelDeisher,Daehyun Kim,AnthonyD.Nguyen,NadathurSatish,MikhailSmelyanskiy,Srinivas Chennupaty,PerHammarlund,RonakSinghal,andPradeepDubey.Debunkingthe100xgpuvs.cpumyth:anevaluationofthroughputcomputingoncpu andgpu. SIGARCHComput.Archit.News ,38:451{460,June2010. [106]VictorWLee,ChangkyuKim,JatinChhugani,MichaelDeisher,DaehyunKim, AnthonyDNguyen,NadathurSatish,MikhailSmelyanskiy,SrinivasChennupaty,PerHammarlund,RonakSinghal,andPradeepDubey.Debunkingthe 100xgpuvs.cpumyth:anevaluationofthroughputcomputingoncpuand gpu.In ISCA'10:Proceedingsofthe37thannualinternationalsymposiumon Computerarchitecture .ACMRequestPermissions,June2010. [107]DavidG.Levitt.Modelingofionchannels. TheJournalofGeneralPhysiology 113:789{794,1999. [108]YinanLi,JackDongarra,andStanimireTomov.Anoteonauto-tuninggemm forgpus. ICCS ,5544Chapter89:884{892,2009. [109]YinanLi,JackDongarra,andStanimireTomov.Anoteonauto-tuninggemm forgpus.InGabrielleAllen,JaroslawNabrzyski,EdwardSeidel,Geertvan Albada,JackDongarra,andPeterSloot,editors, ComputationalScienceICCS 2009 ,volume5544of LectureNotesinComputerScience ,pages884{892. SpringerBerlinHeidelberg,2009. 145

PAGE 160

[110]CalvinLinandLawrenceSnyder. PrinciplesofParallelProgramming .Pearson AddisonWesley,2008. [111]HaiboLin,ChaoLi,QianWang,YiZhao,NinghePan,XiaotongZhuang,and LingShao.Automatedtuninginparallelsortingonmulti-corearchitectures. InPasquaDAmbra,MarioGuarracino,andDomenicoTalia,editors, Euro-Par 2010-ParallelProcessing ,volume6271of LectureNotesinComputerScience pages14{25.SpringerBerlin/Heidelberg,2010. [112]JohnC.Linford,JohnMichalakes,ManishVachharajani,andAdrianSandu. Multi-coreaccelerationofchemicalkineticsforsimulationandprediction.In ProceedingsoftheConferenceonHighPerformanceComputingNetworking, StorageandAnalysis ,SC'09,pages7:1{7:11,NewYork,NY,USA,2009.ACM. [113]YixunLiu,EddyZZhang,andXipengShen.Across-inputadaptiveframework forgpuprogramoptimizations.In DistributedProcessingIPDPS ,pages1{10. IEEE. [114]YixunLiu,E.ZZhang,andXipengShen.Across-inputadaptiveframeworkfor gpuprogramoptimizations. Parallel&DistributedProcessing,2009.IPDPS 2009.IEEEInternationalSymposiumon ,pages1{10,2009. [115]AndreasLuthi,MartinWikstrom,MaryPalmer,PaulMatthews,TimBenke, JohnIsaac,andGrahamCollingridge.Bi-directionalmodulationofampareceptorunitaryconductancebysynapticactivity. BMCNeuroscience ,5:44, 2004. [116]WenjingMa,SKrishnamoorthy,andGAgrawal.Parameterizedmicrobenchmarking:Anauto-tuningapproachforcomplexapplications.In Parallel ArchitecturesandCompilationTechniquesPACT,2011InternationalConferenceon ,pages181{182.IEEE,2011. [117]WenjingMa,SriramKrishnamoorthy,OresteVilla,KarolKowalski,andGagan Agrawal.Optimizingtensorcontractionexpressionsforhybridcpu-gpuexecution. ClusterComputing ,16:131{155,March2013. [118]AlbertoMagni,DominikGrewe,andNickJohnson.Input-awareauto-tuningfor directive-basedgpuprogramming.In GPGPU-6:Proceedingsofthe6thWorkshoponGeneralPurposeProcessorUsingGraphicsProcessingUnits ,pages 66{75,NewYork,NewYork,USA,March2013.ACMRequestPermissions. [119]AMametjanov,DLowell,Ching-ChenMa,andBNorris.Autotuningstencilbasedcomputationsongpus.In ClusterComputingCLUSTER,2012IEEE InternationalConferenceon ,pages266{274.IEEE,2012. [120]AzamatMametjanovandBoyanaNorris.Softwareautotuningforsustainable performanceportability. arXiv.org ,September2013. 146

PAGE 161

[121]GrantMcFarland. MicroprocessorDesign .APracticalGuidefromDesign PlanningtoManufacturing.2006. [122]ACMcKellar,EGComan,andJr.Organizingmatricesandmatrixoperationsforpagedmemorysystems. CommunicationsoftheACM ,12,March 1969. [123]R.E.Melnick,M.J.Siclari,F.Marconi,T.Barber,andA.Verho.Anoverview ofarecentindustryeortatcfdcodevalidation.In 26thAIAAFluidDynamics Conference ,2005. [124]JiayuanMeng,VAMorozov,KKumaran,VVishwanath,andTDUram. Grophecy:Gpuperformanceprojectionfromcpucodeskeletons.In HighPerformanceComputing,Networking,StorageandAnalysisSC,2011InternationalConferencefor ,pages1{11.ACMRequestPermissions,2011. [125]JiayuanMengandKevinSkadron.Performancemodelingandautomaticghost zoneoptimizationforiterativestencilloopsongpus.In Proceedingsofthe23rd internationalconferenceonConferenceonSupercomputing-ICS'09 ,pages 256{265.ACM,June2009. [126]PauliusMicikevicius.Analysisdrivenoptimization.In GPUTechnologyConference ,2010. [127]AsimMunawar,MohamedWahib,MasaharuMunetomo,andKiyoshiAkama. Asurvey:Geneticalgorithmsandthefastevolvingworldofparallelcomputing. HighPerformanceComputingandCommunications,10thIEEEInternational Conferenceon ,0:897{902,2008. [128]M.Valiev,E.J.Bylaska,N.Govind,K.Kowalski,T.P.Straatsma,H.J.J.van Dam,D.Wang,J.Nieplocha,E.Apra,T.L.Windus,andW.A.deJong. Nwchem:acomprehensiveandscalableopen-sourcesolutionforlargescale molecularsimulations. ComputerPhysicsCommunications ,181:1477{1489, 2010. [129]AkiraNukadaandSatoshiMatsuoka.Auto-tuning3-dtlibraryforcudagpus. SC2009 ,page1,2009. [130]AkiraNukadaandSatoshiMatsuoka.Auto-tuning3-dtlibraryforcudagpus. In ProceedingsoftheConferenceonHighPerformanceComputingNetworking, StorageandAnalysis ,SC'09,pages30:1{30:10,NewYork,NY,USA,2009. ACM. [131]MichaelC.OhandVictorA.Derkach.Dominantroleoftheglur2subunitin regulationofampareceptorsbycamkii. NatureNeuroscience ,8:853{854, 2005. [132]KunleOlukotunandLanceHammond.Thefutureofmicroprocessors. ACM Queue ,3,2005. 147

PAGE 162

[133]MAO'NeilandMBurtscher.Microarchitecturalperformancecharacterizationofirregulargpukernels. WorkloadCharacterizationIISWC,2014IEEE InternationalSymposiumon ,pages130{139,2014. [134]SPakinandPMcCormick.Hardware-independentapplicationcharacterization. WorkloadCharacterizationIISWC,2013IEEEInternationalSymposiumon ,pages111{112,2013. [135]JoshuaPeraza,AnantaTiwari,MichaelLaurenzano,LauraCarrington, WilliamAWard,andRoyCampbell.Understandingtheperformanceofstencil computationsonIntel'sXeonPhi.In 2013IEEEInternationalConferenceon ClusterComputingCLUSTER ,pages1{5.IEEE,2013. [136]K.Plant,KAPelkey,ZABortolotto,DMorita,ATerashima,CJMcBain, GLCollingridge,andJTIsaac.Transientincorporationofnativeglur2-lacking ampareceptorsduringhippocampallong-termpotentiation. NatureNeuroscience ,9:602{604,2006. [137]D.E.Post,R.P.Kendall,andE.M.Whitney.Casestudyofthefalconcode project.In ProceedingsofthesecondinternationalworkshoponSoftwareengineeringforhighperformancecomputingsystemapplications ,SE-HPCS'05, pages22{26,NewYork,NY,USA,2005.ACM. [138]DouglassEPost,RichardPKendall,andRobertFLucas. AdvancesinComputers ,volume66,chapterTheOpportunities,Challenges,andRisksofHigh PerformanceComputinginComputationalScienceandEngineering,pages239{ 301.AcademicPress,2006. [139]DouglassEPostandLawrenceG.Votta.Computationalsciencedemandsa newparadigm. PhysicsToday ,58:35{41,2005. [140]MPuschel,JMFMoura,JRJohnson,DPadua,MMVeloso,BWSinger, JianxinXiong,FFranchetti,AGacic,YVoronenko,KChen,RWJohnson, andNRizzolo.Spiral:Codegenerationfordsptransforms. Proceedingsofthe IEEE ,93:232{275. [141]AQasem.Automatictuningofscienticapplications.2007. [142]RezaurRahman. IntelXeonPhiCoprocessorArchitectureandTools:The GuideforApplicationDevelopers .September2013. [143]DReed.Makinginfrastrucureinvisible,2009. [144]K.J.Rennie,M.A.Streeter,T.A.Benke,andA.T.Moritz.Modelingchannel propertiesinvestibularcalyxterminals. Biomedicalsciencesinstrumentation 41:358{363,2005. 148

PAGE 163

[145]GRiveraandChau-WenTseng.Tilingoptimizationsfor3dscienticcomputations.In Supercomputing,ACM/IEEE2000Conference ,page32.IEEE ComputerSociety,2000. [146]A.RizziandJ.Vos.Towardsestablishingcredibilityincfdsimulations.In 27th AIAAFluidDynamicsConference ,1996. [147]PatrickJ.Roache. VericationandValidationinComputationalScienceand Engineering .hermosapublishers,1998. [148]AntoineRobertandJamesR.Howe.Howampareceptordesensitizationdependsonreceptoroccupancy. TheJournalofNeuroscience ,23:847{858, 2003. [149]ShaneRyoo. ProgramOptimizationStrategiesforData-parallelMany-coreProcessors .PhDthesis,2008. [150]ShaneRyoo,ChristopherIRodrigues,SamSStone,JohnAStratton,SainZeeUeng,SaraSBaghsorkhi,andWen-meiWHwu.Programoptimization carvingforgpucomputing. JournalofParallelandDistributedComputing 68:1389{1401,January2008. [151]K.J.Sampson,V.Iyer,A.R.Marks,andR.S.Kass.Acomputationalmodel ofpurkinjebresinglecellelectrophysiology:implicationsforthelongqtsyndrome. TheJournalofPhysiology ,588:2643{2655,2010. [152]DaisukeSato,YuanfangXie,JamesWeiss,ZhilinQu,AlanGarnkel,andAllen Sanderson.Accelerationofcardiactissuesimulationwithgraphicprocessing units. MedicalandBiologicalEngineeringandComputing ,47:1011{1015,2009. 10.1007/s11517-009-0514-4. [153]KatsutoSato,HiroyukiTakizawa,KazuhikoKomatsu,andHiroakiKobayashi. Automatictuningofcudaexecutionparametersforstencilprocessing.In SoftwareAutomaticTuning ,pages209{228.SpringerNewYork,NewYork,NY, January2010. [154]JaewoongSim,AniruddhaDasgupta,HyesoonKim,andRichardVuduc.A performanceanalysisframeworkforidentifyingpotentialbenetsingpgpuapplications.In PPoPP'12:Proceedingsofthe17thACMSIGPLANsymposium onPrinciplesandPracticeofParallelProgramming .ACMRequestPermissions,February2012. [155]ABSinhaandLVKale.Towardsautomaticperformanceanalysis.In ParallelProcessing,1996.Vol.3.Software.,Proceedingsofthe1996International Conferenceon ,pages53{60.IEEE,1996. [156]AddisonSnellandChristopherG.Willard.Revitalizingmanufacturing:Transformingthewayamericabuilds.Technicalreport,2010. 149

PAGE 164

[157]AlexanderI.Sobolevsky,MichaelP.Rosconi,andEricGouaux.X-raystructure, symmetryandmechanismofanampa-subtypeglutamatereceptor. Nature 462:745{758,2009. [158]JohnE.Stone,DavidJ.Hardy,IvanS.Umtsev,andKlausSchulten.Gpuacceleratedmolecularmodelingcomingofage. JournalofMolecularGraphics andModelling ,29:116{125,2010. [159]JohnAStratton,ChristopherRodrigues,I-JuiSung,NadyObeid,Li-Wen Chang,NasserAnssari,GengDanielLiu,andWMWHwu.Parboil:Arevised benchmarksuiteforscienticandcommercialthroughputcomputing. Center forReliableandHigh-PerformanceComputing ,2012. [160]WaiTengTang,WenJunTan,RKrishnamoorthy,YiWenWong,Shyh-Hao Kuo,RSMGoh,SJTurner,andWeng-FaiWong.Optimizingandautotuningiterativestencilloopsforgpuswiththein-planemethod. Parallel& DistributedProcessingIPDPS,2013IEEE27thInternationalSymposiumon pages452{462,2013. [161]KTeranishi,JCavazos,andRSuda.Softwareautomatictuning,2010. [162]PeterD.Tieleman. AtomisticSimulationsofIonChannels ,pages53{95.Chapman&Hall/CRCMathematical&ComputationalBiology,2003. [163]AdrianTineo,SadafRAlam,andThomasCSchulthess.Towardsautotuning byalternatingcommunicationmethods. SIGMETRICSPerformanceEvaluationReview ,40,October2012. [164]Tiwari.Ascalableauto-tuningframeworkforcompileroptimization. Parallel &DistributedProcessing,2009.IPDPS2009.IEEEInternationalSymposium on ,pages1{12,2009. [165]ATiwari,JKHollingsworth,ChunChen,MHall,ChunhuaLiao,DJQuinlan,andJChame.Auto-tuningfullapplications:Acasestudy. International JournalofHighPerformanceComputingApplications ,25:286{294,August 2011. [166]YashUkidave,DavidKaeli,FannyNinaParavecino,LeimingYu,CharuKalra, AmirMomeni,ZhongliangChen,NickMaterise,BrettDaley,andPerhaad Mistry.Nupar.In the6thACM/SPECInternationalConference ,pages253{ 264,NewYork,NewYork,USA,2015.ACMPress. [167]MichaelL.VanDeVanter,D.E.Post,andMaryE.Zosel.Hpcneedsatool strategy.In ProceedingsofthesecondinternationalworkshoponSoftwareengineeringforhighperformancecomputingsystemapplications ,SE-HPCS'05, pages55{59,NewYork,NY,USA,2005.ACM. [168]P.J.vanderHouwenandB.P.Sommeijer.Parallelodesolvers. SIGARCH Comput.Archit.News ,18:71{81,June1990. 150

PAGE 165

[169]VasilyVolkov.Betterperformanceatloweroccupancy.In GPUTechnology Conference ,2010. [170]RichardVuduc,JamesWDemmel,andKatherineAYelick.Oski:Alibrary ofautomaticallytunedsparsematrixkernels. JournalofPhysics:Conference Series ,16:521{530,January2005. [171]RichardWilsonVuduc. AutomaticPerformanceTuningofSparseMatrixKernels .PhDthesis,2003. [172]LaiWeiandJohnMellor-Crummey.Autotuningtensortransposition.In 2014 IEEEInternationalParallel&DistributedProcessingSymposiumWorkshops IPDPSW ,pages342{351.IEEE. [173]JosefWeidendorferandPeterLuksch.Aframeworkfortransparentloadbalancinginparallelnumericalsimulation.In AnnualSimulationSymposium ,pages 125{132,2001. [174]GerhardWeikum,ChristofHasse,AlexMoenkeberg,andPeterZabback. Thecomfortautomatictuningproject,invitedprojectreview. Inf.Syst. 19:381{432,1994. [175]RClintWhaley.Atlasversion3.9:Overviewandstatus.In SoftwareAutomatic Tuning ,pages19{32.SpringerNewYork,January2010. [176]R.ClintonWhaley,AntoinePetitet,andJackDongarra.Automatedempirical optimizationsofsoftwareandtheatlasproject. ParallelComputing ,27-2:3{ 35,2001. [177]SWilliams,JCarter,LOliker,andJShalf.Resource-ecient,hierarchical auto-tuningofahybridlatticeboltzmanncomputationonthecrayxt4.2009. [178]SWilliams,KDatta,JCarter,LOliker,JShalf,KYelick,andDBailey. Peri-auto-tuningmemory-intensivekernelsformulticore. JournalofPhysics: ConferenceSeries ,125:012038,July2008. [179]SamuelWilliams,AndrewWaterman,andDavidPatterson.Rooine:aninsightfulvisualperformancemodelformulticorearchitectures. Communications oftheACM ,52:65{76,April2009. [180]SamuelWebbWilliams. Auto-tuningPerformanceonMulticoreComputers PhDthesis,ProQuest,2008. [181]M.Wolfe.Compilersandmore:Optimizinggpukernels. HPCWire ,October 2008. [182]RYokotaandLBarba.Hierarchicaln-bodysimulationswithautotuningfor heterogeneoussystems. ComputinginScience&Engineering ,14:30{39, 2012. 151

PAGE 166

[183]KamenYotov,XiaomingLi,GangRen,MichaelCibulskis,GeraldDeJong, MariaGarzaran,DavidPadua,KeshavPingali,PaulStodghill,andPengWu. Acomparisonofempiricalandmodel-drivenoptimization. SIGPLANNot. 38:63{76,May2003. [184]HYou,QLiu,andZLi.Thedesignofanauto-tuningi/oframeworkoncray xt5system. icl.cs.utk.edu [185]IanJYoungs.Acomparisonbetweenphysicalpropertiesofcarbonblackpolymerandcarbonnanotubes-polymercomposites. JournalofApplied Physics ,108:074108,2010. [186]YaoZhangandJDOwens.Aquantitativeperformanceanalysismodelforgpu architectures. IEEEInternationalSymposiumonHighPerformanceComputer Architecture.Proceedings ,pages382{393,February2011. [187]YongpengZhangandFrankMueller.Auto-generationandauto-tuningof3d stencilcodesongpuclusters.In CGO'12:ProceedingsoftheTenthInternationalSymposiumonCodeGenerationandOptimization .ACMRequest Permissions,March2012. [188]FranciscoV ~ Azquez,JoseJesusFernandez,andEsterMGarzon.Automatic tuningofthesparsematrixvectorproductongpusbasedontheellr-tapproach. ParallelComputing ,38:408{420,August2012. 152

PAGE 167

APPENDIXA.AdvancementofComputationalSimulationinKinetic Modeling A.1RelatedWork Thissectioncoversresearchrelatedtocomputersimulationsofionchannelproteins.Abriefdiscussionofdierentsimulationapproachesisusedtoidentifywhere theworkpresentedheretswithinexistingresearchandhowthisdissertationaddressesshortcomings.Simulationsareusedtostudyionchannelsatdierentlevels ofdetail.Atthehighestlevel,individualatomsaremodeled,butaconnectionto voltage-currentrelationshipsishardtomake[162].Atintermediatelevelsofdetail, proteinandionsaretreatedasindividualparticlesbutsolventisnot[162].Weare modelingionchannelsatthislevel,wherealinkbetweensimulationandexperiment ispossible[162].Ionchannelmodelingisonecomponentinsystemsmodelingofmany typestheneuralcellasasystem,anetworkofcellsinvolvedinthesamefunctionas asystem,etc.andprogressmadeatthislevelfacilitatesmodelingatotherlevels. Atthelowestlevelofdetailionsandsolventaretreatedimplicitlyandthechallenge isinobtainingaccurateresults[162]. MoleculardynamicMDsimulationsareanotherclassofsimulationthatare limitedcomputationally.MDsimulationshaveanimportantfunctioninmodeling ionchannelsandarenowthemostpopularapproachtothetheoreticalinvestigation ofionchannels[107].Inthepast,exactsolutionswerenotattemptedbecauseof computationallimitationsbutthissituationischangingandweareatthebeginning ofanewerainionchannelmodeling[107]. A.1.1ComputationalMethodsToInvestigateIonChannels Currentneurosciencesimulationdevelopmenthastwoprimarycomputational limitations:optimizationroutinesarenotintegratedinthescienticmodelingwork153

PAGE 168

owandthedevelopmentofparallelsimulationsoftwarehasbeenrelativelyslow [26]. Softwareapplicationsinneurosciencesimulationhaveevolvedandsomehavebecomede-factostandardsintheirdomain[ ? ].Neuron[77]andGenesishavethelargest userbaseforwhole-cellandsmallnetworkmodelingandMCelldoesforstochastic modelingofindividualparticles[ ? ].Inotherdomains,wheretheresearchtargetisnot ascommonoraswelldeveloped,researchersstillwritenewsimulationprogramsfrom scratch[ ? ].Wefollowthisapproachandhavewrittenourownsimulationprogram. Researchersinterestedinparameteroptimizationtendtotakeamodeldeveloped withinonesystemandrunitonanothersystemthatprovidesoptimization.For example,[144]usedtheNEURONsimulationenvironmenttomodelthepropertiesof thecalyxterminal.Thekineticparameterswereoptimizedwithageneticalgorithm outsidetheNEURONenviromenttomatchkineticdatafromthewholecellrecordings.Numericalintegrationneedsautomatedoptimizationduetothelargenumber ofordinarydierentialequationsforadenitionandfurtherexplanationseeSection A.1.3andthetimestepsrequiredforthesimulations.Manualapproachesaretoo laborintensive,error-prone,andseverlylimitthetypesofmodelsthatwillbeconsideredandthetypesofstudiesthatwillbeconducted.Statisticalcomparisonofawide varietyofkineticschemesisapowerfultooltodeterminewhichparametersarereally makingadierencebutthesetypesofstudiesarerareintheliterature.Bydeveloping optimizingkineticstudiestogohandinhandwithsimulationandexperiment,the rangeofapplicabilityoftheexperimentalandsimulationstudiesisextended. Althoughitiswell-knownthatsimulationenvironmentstendtobecomputationallydemandingandwouldbenetfromexploitingparallelismatanylevel,simulation toolsdonottransparentlysupportlarge-scaleparallelization.Thereareafewpublicly availableparallelsimulators,buttheyarenotasgeneral,exible,anddocumented ascommonlyusedserialsimulators[26].Forexample,extensionsareunderdevel154

PAGE 169

opementParallelGenesisandNeurondoeshavesupportforparallelprocessingbut settingupdistributedmodelsrequiresconsiderableeort[26].Supportingemerging architecturesatanylevelofparallelismisanascenteortformostsimulationenvironmentsandapplicationdevelopmentingeneral.Assuch,mosthavenofocuson optimalmappingstoparallelresources.Optimalmappingsareimportanttoensure applicationswillbeabletoscaletonewgenerationsofhardwarewhichamortizesthe developmentcostofportingtoanewparallelarchitectureoverfuturegenerations. Themodelingmethodsusedin[71]issimilartothetargetapplicationandcase studyproledinsectionA.3.TheyusedageneticalgorithmGAincombinationwith agradientdescentalgorithmtotwhole-cellvoltage-clampdatatokineticmodels. TheirmodelwasprogrammedusingNEURON[77]andtheprocesswasparallelized usingParallelVirtualMachinePVMonaclusteroftenPentium4computerswith a3GHzclockspeed.PVMisnotwidelyusedtodayalthoughit'smethodtoimplementparallelism,theMessagePassingInterfaceMPI,isade-factostandard indistributedparallelcomputing.Thistypeofcustomparallelimplementationis requiredforcurrentsimulationresearchthatoptimizesmodelparameters. [71]reportedruntimesthatrangedfromlessthananhourforsimplemodels tomorethanaweekfora20-parametermodelttingmanyexperimentalpoints. Theauthorsdon'tmentionhowtheirODEsweresolvedbutgiventhesimplicityof theirmodels,thefactthatthenumericalroutinewasn'tmentionedwithrespectto thecomputationalconcerns,andtheverymodestsimulationtimestheyreportthat ranoverseveralthousandgenerationssuggeststheymusthaveusedamuchmore computationallyforgivingexplicitnumericalsolver. [71]observedthatcomputingpowerisaprimaryconcernwithoptimizationand thatacommonmethodtoaddressthisisbylimitingtheparametersearchrange. However,thismethoddidn'timpactthecomplexityoftheiralgorithmorthesimulationruntime.ThisisprobablybecausetheGAisthecomputationalbottle-neck 155

PAGE 170

onlywhentheobjectivefunctionusedtodeterminetnessiscomplexrelativetoa numericalroutinethatisn'tcomputationallydemanding.Approximatingnumerical solutionsofpartialdierentialequationsPDEsandODEsareinevitablythecomputationalbottleneckforscienticapplicationsandthisiswhereimprovingperformance throughhighlyparallelmachineswillhavethegreatestimpact. Researchersin[152]demonstratethepotentialofusingGPUstoacceleratesimulationsofacomputationallydemandingproblemincardiologyusingaforwardEuler method.GPUsareincreasinglybeingemployedinavarietyofsimilardomainsto acceleratesimulations.Anyworktoimprovesimulationpeformanceandincrease scienticproductivitywillbenetneuroscience,cardiology,andMDsimulations,to nameafewimportantexampleapplicationareas. Computationalresourcesonpowerfulmachinesarereadilyavailableatreasonablecostsandbecomingmoresoeveryday.However,thetimeandcostrequired tolearnenoughaboutthetechnologyforcomputationallydrivenresearchprojects tobesuccessfulisasignicantbarrierformany.[81]isareviewofinvestigating ionchannelsusingcomputationalmethodsandsubmitthatitshouldbepossiblefor aninvestigatortollinawebformwithdetailsoftheexperimentstobesimulated andhavetheunderlyingprogramautomaticallydosimulationsandothercalculations automatically.Weagreeandthisdissertationprogressestheeldinthisdirection. A.1.2SimulationTools Severalsimulationtoolsmentionedinprevioussectionsarediscussedinalittle moredetailhere.Thesetoolsallowmodelerstoperformsimulationsforafewparametersetsbutdonotperformoptimization.Theyareveryusefulanaylsistools forstudyingchannelfeaturesbuttheyarenotintendedforlarge-scalesimulationsor high-throughputoptimizationofmanypotentialrateconstantsets.Eachmodeland itsassociatedrateconstantsetaremanuallyconguredthroughtheprogramGUI 156

PAGE 171

andnumericallyintegrated. Neuronisasimulationprogramthatspecializesinmodelingneuronsandsmall networksofneurons.Neuronusesspatialdiscretizationtoreducethecableequation cabletheoryisappliedtostudyofelectricalsignalinginneuronsandthecable equationdescribestherelationshipbetweencurrentandvoltageinaone-dimensional cabletoasetofODEswithrstorderderivativesintime[77].Usersareoered achoiceoftwostableimplicitmethods,BackwardEulerandavariantofCrankNicholson.NeuronalsoprovidestheprincipalaxisPrAxismethodforminimizing afunctiontotdatabutthisisnotarobustgeneraloptimizationroutine.[71] demonstratedthatonlythecombinationofconvergencebyaGAtowithin40%of targetvaluesfollowedbythePrAxisroutineproducedagoodt.PrAxisrequires parametersthathavealreadybeenconstrainedtoatightrangearoundtheirtarget valuetondagoodt. ChannelLabisasimulationprogramthatspecializesinmodelingthekineticsof ionchannelstohelpresearchersandstudentsstudychannelfunction.ItusesMontecarlosimulation,Runge-kuttaandlesscommonly,theQ-matrixmethodfornumerical integrationtondasolutionforagivensetofrateconstants.ChannelLabdoesnot oeranytypeofparameteroptimizationandintegrationmustbere-runforevery newrateconstantsetandgivestheresponseofachannelto,forexample,changing concentrationsofagonist[1]. Ultimately,thisdissertationhopestomakeasubstantialcontributiontothe eortsoftheKineticPreprocessorKPP[44]andNWChemsoftware.KPPwas designedasageneralanalysistooltoaidsimulationofchemicalkineticsystems. NWChemaimstoprovidecomputationalchemistrytoolsthatarescalabletolarge scienticproblemsandwasrecentlyreleasedintoopensourcebythePacicNorthwest NationalLaboratory[128]. 157

PAGE 172

KPPmaintainscomputationaleciencyingeneratedcodebutdoesnotsupport emergingarchitectures.However,KPPsupportsframeworkextensiontogenerateoptimizedcodeforotherplatforms,which[112]havedonefortheWeatherResearchand ForecastwithChemistrymodel.Wewouldliketoextendthisworktothechemical kineticsofionchannelsandprovidemoresubstantivesupportforautomaticperformancetuningonGPUs. NWChemhasastrongfocusonclassicalmoleculardynamicscapabilitiesthatprovideforthesimulationofmacromoleculesandsolutions[128].Inaddition,NWChem supportstheuseofavailableparallelcomputingresources.However,thescriptsthat directtheprocessingrequireuserstospecifyarchitecture,memorylimitsandtypesof memoryconstraints.WewouldliketoevaluateNWChemandintegratemoretransparencyfortheusersotheycanachieveoptimalperformanceonGPUarchitectures withoutspecifyingarchitecturaldetailsinsimulationscripts. A.1.3NumericalMethodsforModelingChemicalKinetics Thesolutiontechniquesincomputationalsimulationareoftenclassiedby theirformulationassystemsofordinarydierentialequationsODEs[138].ODEs describeratesofchangebasedonasinglevariablewhichistypicallytimebutit doesn'thavetobe.Forexample,afunctionthatdescribeshowtheconcentration ofglutamateinakineticexperimentchangesovertime.Anymodelcanberepresentedasasetofdierentialequationsandtheseequationsaresolvedwithnumerical integrationtondasolutionforagivensetofrateconstants. ThenumericalmethodsusedtosolveODEsthatdescribechemicalkineticsreplacethederivativesinthedierentialequationswithnite-dierenceapproximations,whichreducesthedierentialequationstoalgebraicequations[96].Thereare twomajorclassesofnite-dierencemethodswhicharecharacterizedbywhetherthe equationsimplicitlyorexplicitlydenethesolutionateachtimestep.Solutionsfor implicitandexplicitmethodsvarywidelyandmanyalgorithmswithineachdierin 158

PAGE 173

termsofhowtheyhandlesourcesoferrorinthecalculationknownasconvergence, consistency,andstability. Numericalmethodsarealsodistinguishedbystiness,apropertyoftheODEs themselves.Stisystemsmainlyappearinchemicalkinetics,electriccircuits,and controltheory[90].StinessmeasuresthedicultyofsolvinganODEnumerically andischaracterizedbydisparatetimescales[96]smalltime-stepsarerequiredfor stability.Stisystemsrequirecompleximplicitmethodstoavoidnumericalinstabilities,whilenonstisystemscanbesolvedbyrelativelysimpleexplicitmethods. ThesystemofODEequationsthatdescribeAMPAcurrentsarestiandfurther discussionislimitedtoimplicitnumericalmethods. Computationallimitationsareoftentakenintoconsiderationwhenchoosinga numericalmethod.Methodsareclassiedbyhowtheaccuracydependsonthestep size[96].Higher-ordermethodsrequiremoreworkateachtimestepandaremore complicatedtoimplement;theadvantageistheyaremoreaccurate.Conversely, lower-ordermethodshavelessworktodoandarelessaccurate.So,thereisatrade obetweenaccuracyandthecostandcomplexityofimplementingandusingagiven method. ThemostpopularcodesforthenumericalsolutionofastisystemofODEs arebasedonthebackwarddierentiationformulasBDF.OneofthemostpowerfulmethodscharacterizedbyhighaccuracyandhighstabilityforsolvingODEs isanimplicitRunge-kuttamethod.TheRunge-Kutta4thordermethodRK4is consideredaworkhorseinscienticengineeringapplicationsprovidingaccurateand stablesolutionstostisystemsofODEscommonincomputationalchemistry.This methodistheselectedsolverwithinoursimulationframework.SectionA.3details ourpreliminaryresearchinmappingRK4ontoadvancedparallelarchitectureswhich signicantlyacceleratesKingenandpotentiallyanyotherapplicationwhichusesRK4 asanumericalsolution.Weintendtoexamineapplicabilityofourframeworktoother 159

PAGE 174

scienticapplicationsthatusetheRK4method.Inaddition,infutureresearchwe areinterestedinhowothersolversmaydierandwhyintheiroptimalmappingsto GPUhardware. A.1.4GeneticAlgorithmforOptimization Themainchoiceinoptimizationisbetweensearchmethodsandtraditional simplexorgradientalgorithms,includingthewidelyusedGauss-Newton-Marquadt method[41].Traditionalgradientmethodssampleasmallpartoftheparameter space,aredependentoninitialparametervalues,andcangetstuckinlocalminima. Gradientdescentdoesverywelliftheerrorfunctionisaconvexcurve.However,our optimizationismeasuredbyacomplexfunctionwithanunknownstructuretemplate extractedfromdatasetsdescribingaspecictypeofAMPAreceptor.Inaddition, [14]analyzedgradient-basedoptimizersandidentiedbottlenecksthatcanlimittheir paralleleciency.Forthesereasons,weuseageneticalgorithmGAwhichisasearch methodbasedonDarwinianevolution. GAsareincreasinglyusedtosolvehardproblemsofpracticalinterestindiverse elds[66].GAsareknownfortheirabilitytondoptimalsolutionswithinadened parameterspacefrominitialrandompopulations[65].[185]demonstratedthataGA wasabletondauniquesolutionwithintheparameterrangespecied.GAsuse selectionandrecombinationoperatorstogeneratenewsearchpointsinthesearch spacewhichcanpotentiallyjumpoutoflocalminimaandtheseoperatorsmustbe tunedforbestresultsi.e.,mutationrateandcrossoverrate. ThemajorityofresearchinparallelgeneticalgorithmshasconcentratedonparallelGAswithmultiplepopulations[14].However,theliteraturehasnotproduced evidencethatmultiplesub-populationsaregenerallymoreeectiveintermsofthe qualityofthesolutionthanasinglelargepopulation.Populationsizeisdivided todistributeworkloadsandreachperformancetargetsonlyandnottoimprovethe search.Howtoideallypartitionthepopulation,migrationratebetweenpopulations 160

PAGE 175

theeectsofmigratingtoofrequentlyorinfrequentlyisnotunderconsiderationin thisdisseration. Commonknowledgeingeneticalgorithmresearchisthatthesizeofthepopulation isimportantbecauseitinuenceswhethertheGAcanndgoodsolutionsandthe timeittakestoreachthem[67],[30],[4].GAsusuallyrequireabiggerpopulationfor solvingnon-trivialproblems[30].GPUscanavoidsub-populationsbecausemassive dataparallelismenableslargepopulations. GAsdonotguaranteeoptimalityandtherearenotheoreticalprinciplestoguide ourunderstandingontheeectofthepopulationsizeonthequalityandeciency ofthesearch.[97]demonstratedthatlargerpopulationsledtobetterperformance intermsoftheoptimizationofthetnessscore.Thesequentialversionofourtargetapplicationused100individualsinthegeneticalgorithmduetocomputational limitations.Itislikelythispopulationsizedoesnotcontainenoughgeneticdiversity tocoverthesearchspaceforthisproblemadequately.However,thenecessaryand sucientsizeisnotknown.Ideally,youwantenoughindividualstoensureahigh qualitysolutionbutnotsomanythatyouarewastingcomputingresources.Given themassivedataparallelismavailableonGPUs,wecannowtestmuchlargerpopulationsandmuchlargergenerationstosetpopulationsizemoresystematicallyand havemorecondencethatwehavefoundglobaloptima. A.2ModelingIonChannels IonChannelsaretrans-membraneproteinsthatopenandclosetoregulatethe owofionscurrentsacrossmembranesinallcells.Theseioniccurentsarecritical tointra-andinter-cellularsignaling.Ionchannelsareespeciallysuitablebiological entitiesforcomputationalstudiesthroughkineticmodeling.Thesekineticfeatures criticallyinuencethetemporalcodingofcell-signalinginformation.AMPAreceptors AMPARs,ligand-gatedionchannelsactivatedbythetimedreleaseoftheneurotransmitterglutamate,areresponsiblefornearlyallfastexcitatoryneuronalsignaling 161

PAGE 176

inthecentralnervoussystem.Temporalcodingbythesereceptorsultimatelyinuencestemporalcodingbyentireneuralnetworks.Numerousstudieshavedemonstratedthatthesereceptorsactivate,deactivateanddesensitizeinacomplexfashion. Thekineticpropertiesarereportedlyalteredunderbothnormalconditionsthatmay mediatelearningandmemoryandinpathologicalconditionssuchasepilepsy.Detailedkineticdescriptionsarethereforeusefulforbothunderstandingtheirfunction andimplementationintoaccuratemathematicalmodelsoftheseprocesses.UnderstandinghowAMPARsworkcouldplayakeyroleintherapeuticdrugdesignfor epilepsy,strokeandneurodegenerativedisorders. AMPARsarethoughttobecomposedoftetramericallyarrangedsubunits.Crystalographicstudiessuggestthatinteractionsbetweensubunitsarelikelyasymmetric. Functionalstudiessuggestthattheinteractionsbetweensubunitsarecomplicated: agonistbindinganddesensitizationmayproceedsequentiallywhilechannelopening maybedependentonresultingagonistconcentration-dependentbindingconformations.CurrentmodelsofAMPARhaveatleastthreedistinctopenstates,each dependentonincreasingnumbersofboundagonistmolecules.Previouslypublished kineticschemesareunabletosimultaneouslydescribeseveralfeaturesincludingthe eectsofpartialagonistsonchannelconductanceandrelatedshiftsinconductance duetophosphorylationwhichpreserveanityanddensensitization/deactivation. Scientistsrepresentthepropertiesofionchannelswithelectricalcircuitdiagrams thathaveequivalentelectricalpropertiesandmuchofwhatweknowaboutionchannelsisdeducedfromelectricalmeasurements[76].AMPARsareclassiedasligandgatedbecausetheygateionmovementsandgenerateelectricalsignalsinresponse toglutamate.Thebehaviorofthecurrentscanbeaccuratelycapturedusingkinetic modelsthatdescribethetransitionsbetweenconformationalstatesofthechannel. Thisclassofmodelsarecommonlyknownas"MarkovModels"[96].Akineticscheme isaninstantiationofaMarkovmodelanddescribesstates,orconformationsofthe 162

PAGE 177

proteinopen,closed,ordesensitized,andthetransitionratesbetweenthemtiming oftheswitchingfromonestatetoanother.Whenthechannelopens,ionsowand currententersthemembrane. FigureA.1isaverysimplekineticschemetoillustratehowionchannelkinetics aremodeledconceptually.Thepurposeistodemonstratewhatacomputerhastodo tosimulatekineticproperties.Thegureonthetopisthekineticmodel,followed bytheODEsthatdescribethismodel.Inthisexample,therearethreekinetic equations,oneforeachstate.Underthedierentialequationsisthetranslationto compactformwithavectorforthederivatives,amatrixofrateconstantsandavector ofthestatesontheright-handside.Thematrixissolvedtogetthesolutiontothe systemofequations.However,inkineticmodelingrateconstantscanbevoltageor concentrationdependentwhichisanon-linearfunctionoftime.Thesesystemsmust benumericallyapproximatedwithmorecomputationallydemandingimplicitsolvers. Scientistssimulatethekineticsofionchannelstoinvestigatehowdierentinputs e.g.forAMPARs,thealterationoftherelativeamountandtemporalcharacteristics ofglutamatestimulationinuencereactionratesandtransitionstatesofthekinetic scheme.FigureA.2isakineticschemeproposedby[148].Inthisstudy,theresearchersusedMonteCarlosimulationswithasoftwareprogramcalledChannelLab Synaptosoft,Inc.andthenmanuallycomparedthousandsofdierentsolutionsto experimentaldatatondagoodt.FigureA.3illustratestheimplementationof thismodelandFigureA.4showsthatoptimizationimprovedthemanualt. 163

PAGE 178

FigureA.1:Thisisaverysimplemodelwiththreestates,twoclosedC1andC2 andoneopenO,andfourrateconstantsthatdescribethetransitionratebetween thestatesthemodelsparameters.Figuremodiedfrom[97]. 164

PAGE 179

FigureA.2:Thekineticmodelproposedby[148].Thismodelhas16statesand 9parameters.The"O"statesareopenwitheither2,3,or4boundglutamate molecules.The"R"statesareclosedwith0,1,2,3,or4boundglutamatemolecules. The"D"statesaredesensitizedhaverecentlyopenedandwon'treopeneveninthe presenceofglutamateforthedurationoftherecoveryfromdesensitizationperiod. Thedesensitizedstatesdierinthenumberofsubunitsoccupiedbyglutamateand thenumberofsubunitswithpartiallyclosedbindingdomains.Forexample,channels instate D 2 4havefourglutamatesboundandtwopartiallyclosedbindingdomains [148]. 165

PAGE 180

FigureA.3:Implementationofkineticschemeblacktracesshowstheextentthat manuallyoptimizedparametersdeviatefromoriginaldataredtraces[148].A Ahumpinthesteady-statecurrentvsglutamatecurveisafeaturenotexplored previouslyasanon-idealconsequenceoftheoriginalmodel,notseeninoriginaldata red.Theinhibitionofcurrentsbyglutamateandthepeakcurrentarereasonably modeledtooriginaldatared.BTimecourseofentryintodesensitization mseciswellapproximatedtooriginaldatared.CTimecourseofcurrentin responseto5mMglutamateissimilartooriginaldatared.DRecoveryfrom desensitizationdeviatesfromHodgkin-Huxleytstooriginaldataredtraces[148]. 166

PAGE 181

FigureA.4:Optimizationimprovesoriginaldatat.AAhumpinthesteadystatecurrentvsglutamatecurveisstillpresent,suggestingthisisaconsequence ofthemodelstructureandnotduetoinadequateoptimization.Theinhibitionof currentsbyglutamateandthepeakcurrentarenowcloselymodeledtooriginaldata red.BTimecourseofentryintodesensitizationmseciswellapproximated tooriginaldatared.CTimecourseofcurrentinresponseto5mMglutamateis similartooriginaldatared.DRecoveryfromdesensitizationmorecloselyfollows Hodgkin-Huxleytstooriginaldataredtraces. 167

PAGE 182

FigureA.5isarevisedkineticschemetobetterdescribepreviouslypublished data[148].FigureA.6demonstratesthatthismodelisabetterttoexperimental data. FigureA.5:Theoriginalkineticschemewasalteredtoallowmultipleopenstatesfrom agonistboundreceptors.Thiswasjustiedbyarecentreport[84]suggestingthat maximallyboundreceptorscantransitiontomultipleopenstatesandpreservesthe featurethatonlymaximallyboundreceptorscanreachthelargestconductanceopen state4.Transitionstoopenstatesareregulatedbyafactorepthatisinuenced byligand-bindingdomainclosure[84].Allotherratesareasdenedintheoriginal scheme. 168

PAGE 183

FigureA.6:Optimizationofrevisedmodelfurtherimprovesoriginaldatat.AA humpinthesteady-statecurrentvsglutamatecurveisnolongerpresent,suggesting thisisaconsequenceofthemodelstructureandnotduetoinadequateoptimization. Theinhibitionofcurrentsbyglutamateandthepeakcurrentarenowcloselymodeled tooriginaldatared.BTimecourseofentryintodesensitizationmseciswell approximatedtooriginaldatared.CTimecourseofcurrentinresponseto5mM glutamateissimilartooriginaldatared.DRecoveryfromdesensitizationmore closelyfollowsHodgkin-Huxleyts.SeeFigureA.4foracomparisonofthets. 169

PAGE 184

Themodelproposedby[148]doesnotmatchthestructureoftheprotein;AMPA Receptorshavefourindependentsub-units[157],buttheyaremodeledasdependentunits.Researchershavenotbeenabletoaddressthiswithimprovedmodels, inpartduetocomputationallimitations.Acceleratingthesimulationsonparallel architecturesenablesustoevaluateseveralmodelssimultaneouslytondabetter representationofthestructureoftheprotein.ThemodelsshowninFigureA.7are threeexamplesofthemodelingexperimentsweareperforming.Eachmodelthefour subunitsofAMPARsindependently.Themodelnumberingwehaveemployedisa loosetrackingsystemandwearecurrentlyevaluatingModel17severalmodelswere retiredquicklywhentheycouldn'tconvergeordidn'tmakebiologicalsense. aModel10 bModel12 cModel16 FigureA.7:Threemodelswehaveimplemented.ASubunitsareindependent. BModel12isModel10buttherateconstantsnotatedwitha*areinuenced bytheglutamateconcentrationandthereforesensitivetothenumberofsub-units bound.CAnothermodelwehaveimplementedandtestedthatincludesvariations ofpreviousmodels. Insummary,themodelshowninFigureA.5isabettertanddemonstratesthe feasibilityofanautomatedttingapproach.Newermodelsarebeingimplemented inordertodescribethefullspectrumofbiologicaldatamorecompletely.Weusea geneticalgorithmtorapidlyoptimizerateconstantsdescribingmultiplepreviously published,hybridandnovelkineticschemes.Throughourmodelingexperiments, 170

PAGE 185

wehavedeterminedthatdatawasbestdescribedbyahybridkineticschemewith cooperativeagonistbinding.Schemesincorporatingfullsubunitindependencedid nottthedataaswell.Onlynovelkineticschemeswhichconsideredsubunitinterdependencewereabletotthedatareasonablywell.Thesemodelingexperimentsare helpfulindirectingfutureexperimentsexploringthenatureofconductancechanges thatcouldunderlieplasticity. 171

PAGE 186

A.3KingenCaseStudy Casestudiesplayanessentialroleinaddressingprogramming,prediction,and developmentchallengesandhelpmaturecomputationalscience[138].Thecasestudy proledinthissectionwaspublishedin[63]andthefocusistoexaminethestep-bystepprocessofadaptingnewandexistingcomputationalbiologymodelstomulticore anddistributedmemoryarchitectures.Weanalyzedierentstrategiesthatmaybe moreecientinmulticorevs.manycoreenvironments.Thiscasestudycharacterizes thechallengesincomputationalscienceanddemonstratestheurgentneedforresearch thatimprovesleverageofhighperformancecomputingarchitecturesinpartbysupporting"plug-in"adaptation[167]ofwidelyusedalgorithmstospecicplatforms. KingenwasdevelopedtosimulateAMPAreceptorionchannelactivityandtooptimizekineticmodelrateconstantstobiologicaldata.Kingenusesageneticalgorithm tostochasticallysearchparameterspacetondglobaloptima.Aseachindividualin thepopulationdescribesarateconstantparametersetinthekineticmodelandthe modelisevaluatedforeachindividual,thereissignicantcomputationalcomplexity andparallelisminevenasimplemodelrun. Implementationandoptimizationofkineticschemeswereinitiallycodedina serialfashionandwerefoundtobeprohibitivelytimeconsuming.Wedevelopeda processtoadaptscienticapplicationstoparallelarchitecturesFigureA.8.We integratedastrategyrecommendedbyIntelengineersandconductedouranalysis level-by-levelstartingwithsystemlevelanalysisanddrillingdowntoprogressively nerlevelsofanalysis.Thereis,ofcourse,someinterplayandoverlapbetweenthe levelsbuttheapproachdenesameaningfulframeworkwithinouroverallprocess. A.3.1ApplicationCharacterizationandProle 172

PAGE 187

FigureA.8:Modelillustratingtheprocesswefollowedtocharacterize,prole,optimize,andportourapplicationtoparallelarchitectures. KingensimulatesAMPARionchannelactivityandoptimizeskineticmodelrate constantstobiologicaldata.Theprogramwasoriginallydevelopedsequentiallyin Cbuttooktoolongtobeuseful+days.Theexecutiontimewassoprohibitive topreventitsusefulnesstoexploretheimplicationsandvalidateexistingandalternativemodels,totestabroaderrangeofparametersetsandtoconductsensitivity analyses.Anecientmethodformodelingandsimulationisatremendousadvantage toresearchersinknowledgediscovery. Kingenwascarefullyrestructuredtobeahighlyparallelizedprogramthatapproximatesthesolutionofasystemoflineardierentialequationsthatkinetically describeAMPAR-mediatedioniccurrents.TheGAisusedtooptimizethemodelsrateconstants.KingenisdesignedtostudyAMPARsbuttheapproachisalso applicabletoothertypesofionchannels[185],[71]. Successfulportingofscienticapplicationstomulticoreandmanycorearchitecturesheavilydependonprimarycharacteristicsoftheapplication.WeproledKingen andcharacterizedourtypicalworkloadtooptimizetheapplicationforperformance 173

PAGE 188

andtoidentifyhowbesttomapittoparallelarchitectures. A.3.1.1SystemLevelAnalysis Systemlevelanalysisidentiesnetwork,disk,memory,andprocessorusage. Performancebottlenecksatthesystemlevelarelikelytohaveamuchgreaterimpact thanissuesfoundatlowerlevels.Issuesatthesystemlevelcanalsomaskwhatmay looklikeapplicationorarchitecturallevelissues,sendingdevelopersdownthewrong optimizationpath.Wemeasuredmetricsfordiskperformance,memoryusage,and processorusageusingPINandVTunePerformanceAnalyzeranalysistoolsdeveloped byIntel. Performanceincreaseisthedrivingfactorforparallelizinganapplication,soarst checkwhenportingaprogramtoparallelarchitecturesshouldbeonthreadutilization. FigureA.9showsKingen'sthreadproleanddemonstratesthatallavailablecores arekeptbusymostofthetimeindicatingthatKingenisprocessor-bound.Stallsdue tomemoryordiskrequestscan'tbehappeningveryoftenandthewaytoincrease performanceforthisapplicationisthroughmorecores. FigureA.9:Threadutilizationisover90%foralleightcoresonaWindowsmulticore server.Thewaitandunder-utilizedportionsreectwhentheapplicationentersthe serialregionfordataI/Obetweengenerations.Acrossallcores,thethreadsarefully utilized93%ofthetime,underutilized4.8%,andserial1.65%. Wealsomeasuredhardwareperformancecounterstotrackavailablebytesof memoryremainedconstant,contextswitches/sec,processorqueuelength,%processorprivilegedtime,%processortime,andgraphedthemoverruntimenotshown 174

PAGE 189

forspace.TheseresultstoldthesamestoryasFigureA.9inthatprocessutilizationishighandallotherslow,untilexecutionhitstheserialregionforI/O,when contextswitchesandprocessorpriviledgedtimespiked.However,thisincreaseand relateddropinprocessortimeisnotaconcernsincethetotalruntimespentinserial executionisapproximately1.65%andhasanegligibleimpactonoverallperformance. A.3.1.2ApplicationLevelAnalysis Therearenosystemlevelconcerns,sowecontinuewithapplicationlevelanalysis. Totuneandportourapplication,itwasimportanttounderstandwhereitspends mostofitstimeduringexecution.Duringoptimization,itonlymakessensetofocus onthoseregionsofcodethatdominatetheruntime.Otherwise,signicantmanhours canbespenthand-tuningcodethatwillhavelittleimpactonperformancegoals. Duringanalysisoftheruntimehotspots,wealsolookedatcyclesperinstruction retiredCPIandatoatingpointFPrelatedmetricssinceitwasclearKingenisso compute-intensiveduetoitsdependenceonFPoperations.Throughexperimentation onhowtoreducetheCPIandFPperformanceimpactingissuesitbecameclearwe neededtoupgradeourcompilerandestablishanewbaseline.Theupgradechanged, amongotherthings,theinstructionsetdefaulttorequireIntelStreamingSIMD Extensions2IntelSSE2.AccordingtoIntelsreleasenotes,thismaychangeoating pointresultssincetheSSEinstructionswillbeusedinsteadoftheX87instructions. Thishadasignicantimpactonourapplication. TableA.1showsthebreak-downofruntimespentineachfunctionunderthe twoversionsofIntelscompiler.Thefunctionscalc funcs ampaandcalc glut concare calledfromrunAmpaLoop.Programhotspotsshowsimilartrendsbutdemonstrate thatmanyFPoperationsinthecalc funcs ampafunctionwereoptimizedinv11.1, specicallythroughtheuseofSIMD. TableA.2showsthedierenceinCPIandFPrelatedmetricsunderthetwoversionsofIntelscompiler.Thisprogram,compiledwithversion11.1,isapproximately 175

PAGE 190

TableA.1:ComparisonofProcessRuntimeBetweenCompilers Version10.1 version11.1 calc funcs amap 59.51% 30.45% runAmpaLoop 40.04% 40.99% calc glut conc 0.45% 2.16% operator[] 0% 25.92% get delta 0% 0.48% 9.4timesfasterthanwithversion10.1.AccordingtoIntelsdocumentation,aCPI of.75isconsideredgood,andavalueof4isconsideredpoor;anFPassistof0.2is consideredlowand1isconsideredhigh.TheFPinstructionsratiomeasuresoverallFPactivity.SignicantFPoperationsaectingperformancewereaddressedby upgradinganoptimizingcompiler. TableA.2:ComparisonofCPIandFPImpactingMetrics CPI FPAssistPerformanceImpact FPInsructionsRatio v10.1 3.464 0.85 0.13 v11.1 0.536 0.0011 0.0028 Weanalyzedthehotspotsfromthethreefunctionsanddiscoveredtherewere manyredundantcalculationsbeingperformedinsidealoopthatcaniterateovera milliontimes.Andthisloopiscalledmanytimestosimulatedierentstatesofthe modele.g.,recoveryfromdesensitization.Werestructuredthecodetoremoveredundanciesandtheoverheadinherentinfunctioncallsintegratedcalc funcs ampa withinthecallerfunction.Theseoptimizationshadasignicantimpactonperformance2xspeedupbutruntimeisstilldominatedbytherunAmpaLoopfunctions, asshowninTableA.3below.ThebottleneckisintheODEsolverwhichistypical 176

PAGE 191

forscienticapplications. TableA.3:PercentageoftheProcessRuntimeByFunction runAmpaLoop 91.83% calc glug conc 4.4% ge 0.02% libm sse2 exp 0.02% Allothers 3.73% A.3.1.3ComputerArchitectureAnalysis Forcomputerarchitecturerelatedanalysiswemeasuredvariousmetricsrelated toL1datacache,DTLBratios,andL2cacheratios.Allratiosofinterestwerenegligible.Specically,theL1andL2cachemissratewas0;theL2modiedlineseviction rateis0,andtheL1datacachemissperformanceimpactwas0.Ourexperiments indicatethatKingenhasasmallinstructionfootprintthattsintheL1instruction cacheandtypicalworkloadsaresmallwhichtheL2cacheiseectivelyhandling. Theonlyrelatedratiothatregisteredwastheloadratewhichis0.69.Onememory readoperationcanbeservedbyacoreeachcycleandInteldocumentationsaysa highvalueforthisratioindicatesthatexecutiontimemaybeboundbymemory readoperations.Previousanalysisdisputesthisinourcasebutloadeciencyisan importantconcernwhenmappingapplicationstomanycorearchitectureslikeGPU whichhavemuchsmallersharedmemorythanCPUs. FigureA.10showsKingensinstructionmixtocharacterizethedominantwork intheapplication.Predictably,FPoperationsrepresentthelargestpercentageof retiredinstructions. 177

PAGE 192

FigureA.10:InstructionmixforKingen.Floatingpointoperationsx87+SIMD dominatethepercentageofinstructionsretired. 178

PAGE 193

A.3.2ComputingFramework WeimplementedandranKingenonseveraldierentparallelarchitecturesincludingmulticoreusingIntel'sThreadedBuildingBlocksTBBlibrary;amulticore clusterwith192coresnodes,12corespernodeusingMPI;andtoanNvidia GPUsystemtocomputeintensiveapplicationkernels.Weevaluateourparallelimplementation,speedup,andcomputationalcomplexityofthemulticoreandcluster implementationsinthefollowingsubsections.GPUanalysisisdetailedinChapter ?? TBBisdevelopedbyIntelasatemplatelibrarythatextendsC++.TheTBBAPI includesbasicalgorithmsforcommonparallelpatternse.g.,loopparallelizationand reductions,thread-safecontainers,scalablesharedmemoryallocators,mutexes,and atask-schedulerforschedulingnon-blockingtasks.TBBabstractsCPUresourcesand allowsparallelismtobeexpressedwithconstructsthatweredesignedtobefamiliar toC++developers.Weimplementedcoarse-grainedparallelismwithparallel for,a TBBtemplatefunctionthatparallelizesloopsthathaveindependentiterations.The chromosomeloopintheGAhasindependentiterationstoevaluateeachchromosome inthepopulationandthisiswhereweimplementedcoarse-grainedparallelism.The iterationspaceisbrokenupintochunksofworkandTBBrunseachchunkona separatethread. Kingenscoarse-grainedparallelismenclosesagreatdealofcomputationcomplexity.FigureA.11isagraphoftimeasafunctionofthenumberofchromosomes beingevaluatedineachgeneration.Thecomputationcomplexitybetweengenerationsunderseveralcomputingframeworksisapparentastheproblemsizestartsto growexponentiallyquickly. 179

PAGE 194

FigureA.11:Geneticalgorithmsrelyongeneticdiversitytoimproveconvergence. Asthenumberofindividualsincreasepergenerationtheproblemstartstogrow exponentially. WealsoportedKingentoaLinux-basedcomputingclusterthathas17-nodes master,16computenodeswith2x2.2GHzAMDOpteronsixcoreprocessors pernode.Theclustercomputingpowerincludes204corescoreonmasternode and192coresoncomputenodes.Theclusterenablesustoexaminemuchlarger populationsinthegeneticalgorithm.ThisisquantiedinFigureA.11whereyou canseehowmanyindividualswecanuseineachgenerationontheclusterbeforeour executiontimegrowsexponentially600. Weparallelizedtheapplicationatthesamecoarse-grainedlevelaswedidon themulticorearchitecturesusingMPIandamaster-slavealgorithmtoautomatically handleloaddistributiontotheavailablecores.Assoonasagivencomputecoreisdone processingonechromosomeitsendsitsresultsbacktothemaster,andthemaster, knowingwhichcorejustbecameavailableformorework,sendsthatcoreanother chromosometobeprocessed.Thisalgorithmiselegantinthatnoonecomputenode iseverheldupwaitingforsomeotherprocesstonish.Eachcomputecoregets moreworkassignedassoonasitscompleteditstaskforaslongastherearemore chromosomestoprocess. 180

PAGE 195

Speedupanalysisontheclusterindicatesthatourcoarse-grainedparallelismcan betoocoarseseediscussionaroundFigureA.16whenworkingonmulticoreand smallerscaledsystems.Onmoremoderatelyparallelsystemsasopposedtomassively parallelsystemscommoninmanyresearchdepartmentsandtestframeworksitmakes sensetoexploremoredynamicschedulingalgorithmsand/orner-grainedparallel approacheswithintheapplication.Forexample,inKingen,wecalculatetheerrorfor agivenrateconstantchromosomemanytimesunderdierentconditionsformany dierentkineticprocessesandatdierentpointsinthecurvesthatdescribethese processese.g.inhibitionofcurrentsbyglutamateandthepeakcurrent,timecourse ofentryintodesensitization,timecourseofcurrentinresponseto5mMglutamate andrecoveryfromdesensitizationunderdierentconcentrationsofglutamate.Each oftheseerrorsissummedtogethertogetthe"tnessscore"foreachchromosome andeachsimulationcanberuninparallel,whichisanadditionallevelofdata-level parallelism. Ourcoarse-grainedparallelismatthechromosomelevelrunseachofthesimulationssequentiallyforeachchromosome.Thisworkloadcanpotentiallybedistributed dierentlyformoreecientuseoftheidlecoresthatimpactthespeedupatdierent combinationsofthenumberofexecutioncoresandnumberofchromosomesunder smallerworkloadssmallpopulationsizeasseeninFigureA.15bspeedupwas alsoevaluatedfor100chromosomeswithsimilarresults,notshownbyparallelizingeacherrorevaluation.Ofcourse,thecommunicationoverheadinMPIhastobe managedcarefully.Weareevaluatingdierentapproachestodeterminetheeects theyhaveonecientmappingstodierentarchitectures.TherunAmpaLoopsimulationloopfunctioncomputes4thorderRunge-Kuttaformulasgivenbelowto numericallyintegratedierentialequationsthatdescribeourkineticscheme. x t + h = x t +1 = 6 F 1 +2 F 2 +2 F 3 + F 4 A.1 181

PAGE 196

where F 1 = hf t;x A.2 F 2 = hf t + h;x + F 1A.3 F 3 = hf t + h;x + F 2A.4 F 4 = hf t + h;x + F 3 A.5 Wearecurrentlyexploringoptionstoalleviatethebottleneckinthecomputation ofAMPAfunctions.Typically,ODEsolversareinherentlyserialsincetheytime-step thesolutionoveranintervalandeachtimestepisdependentonthecalculationsfrom theprevioustimestep.Thisistrueinoursimulationloop.Thisdoesnotmean parallelismcan'tbeexploited,justhowitisdonerequirescarefulanalysis.Anygain madeinthisloopwillhavealargeimpactonoverallperformancesincethisiswhere Kingenspendsthevastmajorityofruntime. Wemayalsobeabletoimproveperformancebychanginghowthedataare structuredandhowcomputationsareapplied.Currently,themodelissimulated andevaluatedwiththecompute-intensiveroutine,RK4foreachchromosome.We maybeabletorestructurethecodeincludingloopinversionsothateachtimestep ofthesimulationsisevaluatedforeverychromosome.Thisapproachisbestsuited formassivelyparallelGPUacceleratorssincewecanexploitdata-parallelismwith compute-intensiveroutines. Thesimulationloopperformsmanycomplexcalculationsforeachstateinthe kineticmodel.FigureA.12graphsthiscomplexitybymeasuringthedependence height,numberofoperations,andnumberofmemoryoperationsinvolvedineach statecomputation.FigureA.13showsthedependencegraphsforeachstateusing severalasexamples.Thesetwoguresdemonstratethereareparallelopportunities withinthesequentialconstraintsofthesimulationloop.Thedierentialequationfor eachstaterepresents22-wayparallelism,buttheyareveryne-grainedandhardto 182

PAGE 197

exploit.FigureA.14isthesourcecodeforthecomputationsthatarebeingperformed inthefourstagesofRK4andisthesourceforFigureA.12andFigureA.13. FigureA.12:Calculationcomplexityforthedierentialequationsdescribingeach stateinthekineticmodeloverdependenceheight,numberofoperations,andnumber ofmemoryoperations.Thex-axisrepresentstheODEforeachstateinthekinetic model.They-axisisthestatecalculationcomplexity. A.3.3ExperimentalResultsandanalysis WetestKingenonseveraldierentarchitecturesincludingfourcoreson32-bit windows,fourcoreson64-bitlinux,eightcoreson32bitwindows,and204cores onalinuxbasedcluster.FigureA.15showsspeedupswehaveobtainedasaresult ofourstep-by-stepprocessandcoarse-grainedparallelism.NotethatFigureA.15is basedon50chromosomespergeneration;asthenumberofchromosomesincreaseper generation,performanceimprovmentismagnied. FigureA.15bshowsaatteningoutofthelinearspeed-upwewereachieving between30and50cores.Thisisexplainedbyhowthechromosomesindividualsin thegeneticalgorithmaresenttoeachcoreusingMPImessagingandthemaster-slave algorithmseetableA.4.TheworkloadmodeledinFigureA.15has50chromosomes toprocess.Atrstallavailablecomputecoresreceiveonechromosometoprocess. Theonlyexceptionisiftherearelesschromosomesthanavailablecoresinwhich casemostormanyreceive1dependingonthenumberofchromosomesandtherest 183

PAGE 198

FigureA.13:Dependencegraphsperstate.Eachareindependentandpointtofurther optimizationopportunitiesincode.TheywerecomputedforEq.2nw[2]butare thesameforequations3,4,and5. 184

PAGE 199

FigureA.14:SourcecodethatisthebasisforguresA.12andA.13 185

PAGE 200

aMulticorewithTBB bMulticoreclusterwithMPI FigureA.15:Speeduponseveralarchitectureswithabaselineof50chromosomes. Timingswereaveragedover10executions.ThisisspeedupoveraserialC++implementationwithagreatlyincreasedworkload,asopposedtotheoriginalapplication withalessdemandingworkload.Weachieved15xspeedupovertheoriginalbutnow havemoredemandingworkloadsmodelswithmoreparameterstotandrunning formanymoregenerations.a.Quadcoreand8corewindowsandlinuxmulticore architectures.Wesetthebaselinetoourcurrentmodelinserialbeforeredundancies inthehotspotregionorotheroptimizationopportunitieswereaddressed.Thereis 2xspeedupevensequentiallycorebecausethesequentialbaselineispostcompiler upgradeandpre-manualcodeoptimizations.b.204corelinuxbasedclusterwith MPI.Thebaselineonthisarchitectureincludesallcodeoptimizations. 186

PAGE 201

areidle;thishappenswiththisworkloadwhenthenumberofcoresis52orgreater. With51executioncores,thereisonemasterand50computecores,oneforeachof the50chromosomes,andexecutionisboundbythetimeittakestoevaluateone chromosomepergenerationonaverage. Aftertherstdistributionofwork,somecomputenodesreceivemorechromosomestoprocess.Forexample,therstrowoftableA.4isreadas:with10execution cores,1ismasterand9arecomputecores;9computecoresprocess5chromosomes eachonaverage*5=45chromosomesaccountedfor,andultimately5ofthose 9mustcomputeanadditionalchromsomeor6totalforatotalof50chromosomes amongallcores.Itdoesn'tmatterifall9corescompute6chromosomesorifjust 1corecomputes6chromosomes;theruntimeisboundbythecoreswiththemost worktodo. TableA.4:ChromosomeDistribution Numberexecutioncores Availablecomputecores #Chromosprocessedpercore onaverage #chromosprocessedbymost taskedcores 10 9 5 6 20 19 2 3 30 29 1 2 40 39 1 2 50 49 1 2 60 59 1 1 Between30and50cores,wherethespeeduplevelso,thenumberofchromosomes processedbythemosttaskedcoresisthesame,twoA.4.Thisiswhythereisno speedupinthisrange.Thereislinearspeedupbetween10and20coresandbetween 20and30coresbecausethecoreswiththemostworkwentfrom6chromosomes 187

PAGE 202

to3chromsomesand3chromosometo2chromosomes,respectively.Theexecution isboundbythenumberofchromosomesprocessedbytheheaviesttaskedcores throughthedistributionofwork.Wedon'tseethespeedupaectedonthequadcore and8coremulticoresystemsbecausewithanyreasonableworkloadisaboutthe fewestchromosomeswecanconsiderwedon'thaveenoughcorestoseethiseect. FigureA.16demonstratesthatspeedupwithamoresubstantialworkloadmany chromosomesdoesn'thavethesamedependenceonthecombinationofthenumber ofcoresandthenumberofchromosomesastheapplicationdoeswithsmallerpopulations.Wetested1,000,3,000showninFigureA.16and5,000chromsomesand theyallhadsimilarspeeduplinesonthecluster.Webelievethereisatippingpoint between100and600chromosomeswheretheeectseeninFigureA.15bisactive anditmakessensetodevoteeorttokeepingidlecoresbusy.Thispointislikely relatedtowheretheexecutionsizestartstogrowexponentiallyseeFigureA.11. FigureA.16:Speedupusingmanymorechromosomesonthe204-corelinuxcluster. Wehadtoexperimentallyderivethebaselinesequentialruntime00chromosomes takesprohibitivelylong.Wemeasuredtheaverageexecutionof1chromosomeover 20examplesandfoundlittlevariationinthemean.Wemultipliedthisaverage by3000forthebaslinesequentialvalueinthegraph.Wecomparedthisapproach withsmallerpopulations,20,and50chromosomesovermultiplegenerationsand foundthisapproachyieldsaccurateresults. 188

PAGE 203

A.3.4CaseStudyConclusion ImpressiveperformancegainsforKingenwereachievedwithaprocessthatsystematicallyproceedsthroughdierentlevelsofanalysis.Improvingthesequential versionisimportantasthosegainsweremagniedinparallel.Onamulticorearchitecture,theTBBimplementationachievesa15xreductioninruntimedayswith moredemandingworkloadsascomparedwith30+dayssequentiallywithasimpler model.Thisanalysisrevealsthereismoreparallelismtoexploit.Themostecient mapdependsonthearchitectureandisprobablysignicantlydierentformulticore architecturesthanmanycorearchitectures.Inaddition,thisapplicationisideally suitedforGPUacceleration. Scienticapplicationsneedtobethoroughlyproledandtypicalworkloadscharacterizedtoreachperformancegoalsandtomaptonewarchitectures.Researchersin HPCneedreliabletoolstofacilitateportingtoemergingsystemsthatreduceerrors andincreasethroughput. 189