Citation
Techniques to leverage data-parallel GPU acceleration for computer vision algorithms

Material Information

Title:
Techniques to leverage data-parallel GPU acceleration for computer vision algorithms
Creator:
Nichols, Allen Paul
Publication Date:
Language:
English
Physical Description:
ix, 57 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Graphics processing units ( lcsh )
Computer vision ( lcsh )
Computer algorithms ( lcsh )
Parallel processing (Electronic computers) ( lcsh )
Computer algorithms ( fast )
Computer vision ( fast )
Graphics processing units ( fast )
Parallel processing (Electronic computers) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 56-57).
General Note:
Department of Electrical Engineering
Statement of Responsibility:
by Allen Paul Nichols.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
747820349 ( OCLC )
ocn747820349
Classification:
LD1193.E54 2011m N52 ( lcc )

Full Text
TECHNIQUES TO LEVERAGE
DATA-PARALLEL GPU ACCELERATION FOR
COMPUTER VISION ALGORITHMS
by
Allen Paul Nichols
B.S., University of Colorado Denver, 2007
A report submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Electrical Engineering
2011


This report for the Master of Science
degree by
Allen Paul Nichols
has been approved
by
Daniel A. Connors
Robert Grabbe
Date
Yiming Deng


Nichols, Allen Paul (M.S., Electrical Engineering)
TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FOR
COMPUTER VISION ALGORITHMS
Report directed by Professor Daniel A. Connors
ABSTRACT
Graphics Processing Units (GPUs) have proven to be a powerful and efficient compu-
tational platform. An increasing number of applications are demanding more efficient
computing power at a lower cost. The modern GPU can natively perform thousands
of parallel computations per clock cycle. Relative to the traditional power of a CPU,
the GPU can far out-perform the CPU in terms of computational power or Floating
Point Operations per Second (FLOPS). Traditionally GPUs have been used exclusively
for graphics processing. Recent developments have allowed GPUs to be used for more
than just graphics processing and rendering. With a growing set of applications these
new GPUs are known as GPGPUs (General Purpose GPUs). NVIDIA has devel-
oped the CUDA (Compute Unified Device Architecture) API (Application Program-
ming Interface) which enables software developers to access the GPU through standard
programming languages such as C. CUDA gives developers access to the GPUs vir-
tual instruction set, onboard memory and the parallel computational elements. Taking
advantage of this parallel computational power will result in significant speedup for
multiple applications. One such application is computer vision algorithms. From the
assembly line to home entertainment systems, the need for efficient real-time computer
vision systems is growing quickly. This paper explores the potential power of using
the CUDA API and NVIDIA GPUs to speedup common computer vision algorithms.
Through real-life algorithm optimization and translation, several approaches to GPU
optimization for existing code are proposed in this report.
This abstract accurately represents the content of the candidates report. I
recommend its publication.
C
Daniel A. Connors


TABLE OF CONTENTS
Tables vi
Figures vii
Chapter
1. Introduction 1
2. Background and Motivation 5
2.1 Computer Vision..................................................... 5
2.1.1 San Diego Vision Benchmark Suite................... 6
2.1.2 Image Segmentation........................................... 7
2.1.3 Disparity Map ............................................... 7
2.1.4 Feature Tracking............................................. 9
2.1.5 Support Vector Machines..................................... 10
2.2 Data-level Parallelism............................................. 11
2.2.1 Data-Level Parallelism Opportunities with SD-VBS ........... 12
2.2.2 The Diminishing Return of the Traditional CPU.............. 13
2.3 Graphics Processing Unit Architecture.............................. 15
2.3.1 CUDA Overview............................................... 19
2.4 GPGPU Motivation .................................................. 24
IV


2.4.1 Arithmetic Intensity......................................... 25
2.4.2 Demonstration of GPU Execution Efficiency: Sorting.......... 29
2.4.3 Demonstration of GPU Execution Efficiency: Array Reduction . 30
2.4.4 Non-traditional Approach for Computer Vision Applications . 33
3. Non-traditional Exploitation of GPUs 34
3.1 Data-size Based GPU versus CPU Execution.......................... 34
3.2 Computation Reformulation......................................... 35
3.3 Computation Speculation........................................... 37
4. Experimental Results 44
4.1 Data-size Based GPU versus CPU Execution.......................... 44
4.2 Computation Reformulation......................................... 46
4.3 Computation Speculation........................................... 50
5. Conclusion 54
Bibliography 56
v


TABLES
Table
3.1 Pre-calculating look-up tables for the S VM application................... 41
3.2 Look-up table access statistics for the SVM application .................. 41
vi


FIGURES
Figure
2.1 A Pre-Processed image (left) and the same image processed by the Image
Segmentation algorithm (right).......................................... 8
2.2 Stereo Image Inputs (left and right). Output image after processing by
the Disparity algorithm (bottom)........................................ 9
2.3 Series of sequential image inputs (left) and the resulting Representative
Motion Vectors (right)................................................. 10
2.4 A representative data set and the result of the Support Vector Machine
algorithm.............................................................. 11
2.5 Thread Processing Cluster for the GTX200 Series GPU..................... 16
2.6 Architecture diagram of the GTX200 series Graphics Processing Unit
(GPU).................................................................. 17
2.7 Example of CUDA processing flow......................................... 19
2.8 The CPU code implementation of matrix multiplication.................... 21
2.9 Example code for invoking the memory transfers and kernel execution on
the GPU for matrix multiplication...................................... 23
2.10 Example code for memory transfers to and from the GPU for matrix
multiplication......................................................... 24
2.11 Example CUDA kernel code written for a GPU to do a matrix multipli-
cation...................................................................... 25
vii


2.12 Arithmetic Intensity for the SD-VBS and the SPEC 2006 benchmarks. . 27
2.13 Identifying hotspots : Percentage of dynamic execution of SD-VBS pro-
grams attributed to program code.................................... 28
2.14 Operation counts for SD-VBS program hotspots........................ 29
2.15 Sorting performance (CPU vs. GPU) for various sizes of a 1-dimensional
array of floating point numbers...................................... 30
2.16 Sorting performance detail for GPU Execution........................ 31
2.17 Reduction performance (CPU vs. GPU) for various sizes of a 1-dimensional
array of floating point numbers...................................... 32
2.18 Reduction performance detail for GPU Execution...................... 32
3.1 Code for SobeLdX computation on CPU.................................. 36
3.2 Diagram of SobeLdX computation on CPU................................ 36
3.3 Code for SobeLdX computation on GPU.................................. 37
3.4 Diagram of SobeLdX computation on GPU................................ 38
3.5 Code for S VM computation conditionally requiring polynomial calculation. 39
3.6 Diagram of SVM potential polynomial calculation based on i and k. . 40
3.7 Distributions of loop iterations that require polynomial calculations in
SVM for cif input.................................................... 42
3.8 Timeline of unique polynomial calculations in SVM for input cif. .... 43
4.1 Image Segmentation performance change using GPU implementation. 45
4.2 fSortlndices performance change using GPU implementation............. 46
4.3 Tracking kernels (Blurlmage, SobeLdX, SobeLdY) performance change
using GPU implementation............................................. 47
viii


4.4 Tracking kernels (Blurlmage, SobeLdX, SobeLdY) GPU performance
change............................................................ 48
4.5 Tracking kernel Blurlmage comparison of CPU and GPU: minimum,
maximum, average.................................................. 49
4.6 Tracking kernel SobeLdX comparison of CPU and GPU: minimum,
maximum, average.................................................. 49
4.7 Tracking kernel SobeLdY comparison of CPU and GPU: minimum,
maximum, average.................................................. 50
4.8 Tracking kernels (Blurlmage, SobeLdX, SobeLdY) standard deviation
as a percentage of average........................................ 51
4.9 GPU performance for S VM application.............................. 51
4.10 GPU performance for each table access versus generation........... 52
4.11 GPU performance for SVM application............................... 53
IX


1.
Introduction
Computer vision is the science that enables computer systems to extract infor-
mation from an image or a sequence of images. The development of computer vision is
essential for the advancement of a multitude of areas including medical, entertainment
and security. Computer vision systems are useful for tasks such as industrial control,
event detection, informational organization and object modeling. Other domains of
computer vision systems also include motion analysis, image restoration and scene con-
struction. With a wide variety of emerging applications, the demand for more advanced
computer vision systems is quickly growing.
There are several limiting factors when developing accurate real-time computer
vision systems. Most computer vision tasks require a great deal of mathematical com-
putation. For many computer vision algorithms, the analysis of a single image can take
anywhere from a few seconds to several hours to process. In short, computer vision al-
gorithms require a large number of computations as well as an equally large number of
memory values. Computer vision algorithms are applied to broad range of environments
from home entertainment systems to the operation of unmanned aerial and ground ve-
hicles. Each new generation of application increases the need for more computational
resources. Traditionally software developers and even scientists have strictly relied on
the increase of processor clock frequency as the primary method of gaining performance
for the next generation of applied algorithms.
Emerging technology constraints are slowing the rate of performance growth in
computer systems. Specifically, designers are finding difficulties addressing strict pro-
1


cessor power consumption and cooling constraints. Design modifications to address
power consumption generally limit processor performance and reduce peak operating
frequency, thus changing the trend of providing increased system performance with every
new processor generation. As such, modern architectures have diverged from the clock
speed race into the multicore era with multiple processing cores on a single chip. While
the multicore design strategy improves chip power density, there remains a significant
potential in improving run-time power consumption.
Graphics Processing Units (GPUs) are commonly found on graphics cards and
main computer boards. These specialized processors have great potential for solving
a number of problems. Unlike traditional Central Processing Units (CPUs), the GPU
contains as many as several hundred mathematical computation cores. These cores
have evolved recently from being able to perform simple graphics computations to fully
capable processing engines. Due to the nature of many processing cores, GPUs can
perform mathematical computations in a massively parallel manner.
Many applications and algorithms have the potential to take advantage of the
parallel processing capabilities of GPU systems. In most cases, problems possessing
data-level parallelism are best suited for GPU execution. Data parallelism focuses on
distributing the large amounts of data across different parallel computing cores. A
problem appears data-parallel if each core can perform the same identical task on
different pieces of distributed data. There are distinct ranges of data-parallel problems.
Small scale image processing that includes the parallel manipulation or analysis of pix-
els can be achieved with multiprocessor extensions such as Single Instruction, Multiple
Data (SIMD). Larger problems of data-parallel computing can be solved with large-
scale distributed systems consisting of multiple independent computers that communi-
2


cate through a computer network or network grid. Non-large-scale data-level parallel
problems are ideal applications that can be optimized. In this categorization, modern
computer vision applications fall into a unique domain since they are more complex than
simple image processing tasks yet would not be described as large enough to require
massive computing resources of a distributed system.
This thesis investigates the potential of adapting computer vision algorithms to
execute on GPUs. As GPUs operate in a heterogeneous system in which both the
CPU and GPU perform some fraction of the computational work, there are unique
performance constraints to explore. This thesis considers two primary parameters in
developing optimal GPU solutions: problem size and problem reformulation.
Problem Size: For every application, the problem size (number of data items
to be processed) is a direct factor to consider when computing results on a GPU. Most
GPU systems are deployed as hardware accelerators for CPU systems, requiring memory
transfers to be performed between each computational component. As the GPU system
includes its own memory system, a necessary step in the heterogeneous system is to
transfer data between the CPU and GPU memory systems. Applications with a smaller
input size may not necessarily benefit from using the GPU even for the computationally
intensive sections. An optimal use of GPUs would include either compile-time or run-
time capabilities to detect computation sizes and automatically choose the best resource
(CPU only or CPU/GPU combination) for the task. Compile-time techniques would
have the code written in such a way that the executable decides at fixed computation
thresholds to use a particular resource. While a run-time approach would allow the
application to determine the optimal method at run-time based on computation size,
available resources and any other computation parameters.
3


Reformulation: For every application, there is a unique degree in which the
computation tasks can be transformed for the GPU resources. Often the reformulation
requires significant code development and sufficient experience with GPU programming
techniques. Several forms of reformulation for GPU execution have been studied for
sorting [1] and parallel reduction methods [1]. This thesis examines two techniques for
mapping non-traditional computations to GPUs: work reformulation and computation
speculation. These techniques are not traditionally exploited on traditional CPU sys-
tems, as they appear to require more computational time. However, based on the fact
that the GPU can do a large amount of computations very quickly, there appear to be
successful ways to exploit GPUs for non-data-level parallel algorithms.
This thesis is organized as follows: Chapter 2 discusses the motivation and back-
ground of computer vision applications. Chapter 3 examines several examples of prob-
lem solving on GPUs using reformulation techniques. The experimental results section,
chapter 4, shows performance data for the various optimization cases. Finally, chapter 5
concludes this thesis.
4


2.
Background and Motivation
2.1 Computer Vision
Computer vision systems are integrated within a wide variety of industrial and
scientific applications. Such systems extract key information from an image or series of
images necessary to complete a single specific task or a series of tasks. There are increas-
ing uses of computer vision in emerging software applications and product development.
In the scientific realm, computer vision methods exist in medical imaging, intelligent
robotics and topographical modeling. There are also many industrial applications such
as autonomous vehicles, visual surveillance, industrial inspection and quality control.
Even modern day digital cameras now apply simple algorithms for face detection to
ensure the best outcome in photographs.
Many computer vision applications already have well defined algorithms and asso-
ciated methodologies. Most of these algorithms have a trend of being computationially
intensive as well as memory intensive. The computational intensive nature of computer-
based vision algorithms has traditionally detered the development of new applications
because results could not be generated in real-time. In many cases, only off-line appli-
cation of computer vision systems is expected. Nevertheless, as computer performance
generally increases in each generation, there is a great potential for real-time computer
vision applications to become more common. However, as time progresses, the data in-
put available for computer vision algorithms increases as the image size and resolution
5


increase. This increase in resolution allows for greater scientific discovery and higher
industrial precision. However, the trade-off of more precise image quality and availabil-
ity results in even further increasing the demand for speed and volume of computations
that must be performed. In many ways the computer system designs are consistently
behind the performance needs of the latest applications.
The application of muti-core and many-core processing architecture systems is
an area of current research and development. A system with multiple processor cores
would be ideal for a wide variety of computer vision applications. Each algorithm must
be carefully studied to determine if all or part of the computation can be performed in
parallel. Some such computer based vision algorithms lend themselves to computational
parallelism while others do not.
2.1.1 San Diego Vision Benchmark Suite
To gain a better understanding of the possible advantages to be gained with
many-core systems, a set of common computer vision algorithms will be carefully in-
spected and tested. The specific sub-set of algorithms to be examined has been de-
veloped by the Department of Computer Science and Engineering at the University of
California, San Diego. The benchmark suite they have developed is known as The San
Diego Vision Benchmark Suite (SD-VBS) [2]. The suite contains applications from
the following representative areas: Image Processing and Formation; Image Analysis;
Image Understanding; and Motion, Tracking and Stereo Vision. This suite contains
nine representative computer vision applications and each application contains a set of
image inputs that vary in size.
6


2.1.2 Image Segmentation
The Image Segmentation algorithm processes an image by dividing it into con-
ceptual regions or segments. These regions include boundaries, borders and objects that
appear in the image. The algorithm functions on the premise that a set of pixels share
a common set of characteristics [2]. This algorithm is commonly applied to fingerprint
and face recognition. Other applications include medical imaging, machine vision and
computational photography.
To ease understanding of this complex algorithm, it has been broken down into
three separate parts or sections. The first section of the algorithm deals with the con-
struction of a similarity matrix. This matrix is computed by analyzing pixel-pairs across
the entire data input stream (image). The second section involves the computation of
discrete regions based on the results from the first matrix computation. The third and
final part of this algorithm normalizes the segmentation results from the regions previ-
ously computed. This application is very computationally intensive based on the fine
granularity of multiple complex operations [2], Figure 2.1 shows an example of an image
after it has been processed by the Image Segmentation algorithm.
2.1.3 Disparity Map
This algorithm takes a pair of stereo images as an input. These stereo images are
taken from a slightly different position looking in the same direction. This is similar to
a person taking a picture with the left eye looking through a traditional camera view-
finder and then switching the camera to the right eye and taking an additional picture.
There are a number of applications in which this algorithm could be utilized. One
possible computer vision application for this algorithm would be to use two cameras
7


/
s
Figure 2.1: A Pre-Processed image (left) and the same image processed by the Image
Segmentation algorithm (right).
that could be used as depth sensors on a conveyor belt for an assembly line. After
running this algorithm against the input images, the computer control system would
have depth information about where each product is on the conveyor belt.
The Disparity algorithm computes depth information based on a set of two
stereo images that are provided. Take for example the robot with cameras placed as
eyes for virtual vision. This algorithm could be applied to these camera inputs and the
computer could calculate depth information based on the stereo input. Other industrial
applications include systems such as intelligent cruise control, pedestrian tracking and
collision avoidance systems.
The San Diego Vision Benchmark Suite has implemented this algorithm using the
concept of Stereopsis [2] which allows depth analysis to be performed based on a set of
stereo images. Figure 2.2 shows an example of a set of images that has been processed
by the Disparity algorithm and the generated disparity map.
This algorithm computes a dense disparity map between two images while preserv-
ing any discontinuities resulting from image boundaries. The concept of dense disparity
8


mapping operates on the premise that every pixel of a given image is important, not
just sections or features of an image. Because every pixel is analyzed, this algorithm
is very computationally intensive. The general sections of the Disparity algorithm are
filtering, correlation, calculation of the sum of squared differences and sorting.
Figure 2.2: Stereo Image Inputs (left and right). Output image after processing by the
Disparity algorithm (bottom).
2.1.4 Feature Tracking
The operation of feature tracking is fundamental in computer vision systems.
Feature tracking is the process of locating and characterizing moving objects given a set
of subsequent frames. When tracking is enabled in vision systems, multiple objects in a
given field of view can be monitored and analyzed with other computer vision algorithms.
Applications for this algorithm include robotic vision systems and automotive traffic
monitoring systems. Figure 2.3 shows an example of a series of sequential image inputs
9


and the resulting motion vectors that this algorithm generates.
The SD-VBS implementation involves the Kanade Lucas Tomasi (KLT) tracking
algorithm [2], The overall algorithm can be broken down into three major sections.
The first section operates on pixel level granularity while the second and third sec-
tions operate on coarse grained data or feature points. The first section is an image
processing phase. This phase accomplishes things such as noise filtering, gradient im-
age and image pyramid computations. This low-level image processing is comprised of
mostly Multiply-and-Accumulate computations. The second and third sections contain
the core functionality of the algorithm. These routines invlove feature extraction and
feature tracking, respectively. The core functionality is based on a large number of
complex matrix operations and vector estimation.
Figure 2.3: Series of sequential image inputs (left) and the resulting Representative
Motion Vectors (right).
2.1.5 Support Vector Machines
The support vector machine (SVM) algorithm is used for data classification and
regression analysis. For each application the algorithm separates the input data into
10
k'j


two categories. These categories are the calculated maximal geometric margins of the
data sets. This classic machine learning algorithm is closely related to neural networks
and is a form of a generalized linear classifier. Figure 2.4 shows a representative data
set with lines depicting the maximal margins.
The SVM algorithm is organized into two distinctive stages. The first stage is
the training phase, the SVM classifier is trained based on the training data. When the
training phase is complete, the classifier has a polynomial function that describes the
learning data. The second stage involves applying the input data set to the classifier.
As the algorithm continues these same two stages are iterated multiple times to achieve
higher accuracy of the polynomial function.
Figure 2.4: A representative data set and the result of the Support Vector Machine
algorithm. H3 (green) doesnt separate the 2 classes. HI (blue) does, with a small
margin and H2 (red) with the maximum margin.
2.2 Data-level Parallelism
Data parallelism is a method of parallel computing across multiple processors.
This type of parallelism focuses on distributing the data across different parallel comput-
11


ing nodes. When multiprocessor systems are given a data parallel task, each processor
performs the same task on a different piece of data. Flynns taxonomy classifies this
as Single Instruction, Multiple Data or SIMD [3]. In some instances, a single execution
thread controls all of the ongoing operations. Different situations dictate that different
threads control the ongoing operations, but ultimately all threads execute the same
code.
We call these algorithms data parallel because their parallelism comes from si-
multaneous operations using large sets of data [4]. To illustrate data-level parallelism,
consider a system with ten processor cores. The task at hand is to add together two
matrices with ten elements in each matrix. Traditional code can be written so that each
processor does one addition and store. At run-time, each processor executes exactly the
same instructions, but on different pieces of data. For example, matrix addition can
assign each matrix position to a unique processor to complete many times faster than
a system with only one processor core. In this particular example each matrix position
is independent from the surrounding positions making data-level parallelism possible.
2.2.1 Data-Level Parallelism Opportunities with SD-VBS
Large portions of the SD-VBS benchmark suite exhibit forms of data-level paral-
lelism. Upon careful inspection of the internal workings of the algorithms, it can be seen
that a large portion of the routines are performing the same task on different pieces of
data repeatedly. There are also several portions of the algorithms which are dependent
on the previous iteration and cannot be parallelized at the data level.
In the Disparity algorithm, one section of the code calculates an output matrix
based on two input matrices. The algorithm takes a given point from each matrix and
12


performs a simple calculation. The result is then stored in an output matrix. Examples
like this lend themselves very nicely to data-level parallelism. Other portions of this
algorithm allow for the same optimization.
The Image Segmentation algorithm operates on pixel granularity which results
in a large number of repetitive operations. The execution time of this algorithm is almost
completely consumed with performing a sort on a single dimensional array of numbers.
Performing a sort in a parallel fashion can be performed significantly faster than a serial
sorting method.
Taking a closer look at the feature tracking algorithm shows several opportunities
for data-level parallelization. The image processing phase operates on the entire image
and thus makes it parallelization friendly. The feature extraction and feature track-
ing portions operate on feature level precision and are comprised of complex matrix
operations. These complex operations involve matrix inversions and motion vector esti-
mations. These operations are computationally intensive. While parallelism is possible
it is more challenging for this algorithm [2].
The support vector machine by nature is irregular and random. The sections
of this suite are not necessarily considered as data level parallelism candidates. The
iterative nature of this algorithm is comprised of mostly complex computations. This
algorithm can be optimized with thread level parallelism and instruction level paral-
lelism [2].
2.2.2 The Diminishing Return of the Traditional CPU
Microprocessors that feature a single processing core have evolved over several
decades. These single core processors are the driving force behind the development of
13


the modern personal computing platform. For example, Intel Pentium processor,
released in 1993, could perform over one billion floating-point operations per second
(FLOPS) [5]. Comparatively, the Intel Corei7 processor, released in 2008, is rated
for 50 billion FLOPS [6].
The continued advancement of software applications has steadily grown the de-
mand for faster and more powerful processing ability. As this demand has increased,
software developers have mostly relied on the hardware advancement to increase speed
and performance of the new applications they develop [7]. This demand began to slow
when energy consumption and heat dissipation became a limiting factor with increasing
the clock speed of the processor. In order to increase performance while keeping energy
consumption down, processor designers increased the number of processing cores on
a single chip die [8]. Modern day personal computing platforms have as many as six
processing cores on a single chip die.
Even with six cores on a single chip, todays CPU is not optimized for large
scale data-level parallelism. The x86 instruction set does include the Streaming SIMD
Extensions [9] which are inherently parallel, but are very limited. Looking closer at the
architecture of CPUs reveals that only about twenty percent of the chip area on a CPU
is dedicated for arithmetic [10]. The main design goal of a CPU is the complex control
of the entire computer. Traditionally, CPUs are designed for low latency operations and
keeping their pipelines busy (high cache hit rates and efficient branch prediction) not
for high bandwidth.
For many years past, consumer-grade software has been developed to run in a se-
rial manner and typically utilizes only a single processing core. This serial methodology
will not experience a significant performance increase with the addition of computing
14


cores. The software development community must optimize current and future appli-
cations for multicore systems to achieve the best run-time performance. This relatively
new incentive for parallel program development is referred to as the concurrency revolu-
tion [11]. High-performance parallel computing is not a new concept to the computing
industry. The scientific industry has been using large-scale clusters of single-core pro-
cessors for several decades to solve large complicated problems [7].
2.3 Graphics Processing Unit Architecture
The underlying architecture of the GPU is optimized for data-level parallelism.
Taking a closer look at the NVIDIA GPU reveals the chip is organized into an array
of highly threaded streaming multiprocessors (SMs). Each streaming multiprocessor
consists of several streaming processors (SPs). Streaming multiprocessors are grouped
into thread processing clusters (TPCs). The number of SPs per SM depends on the
generation of the chip. The GTX200 series consists of eight SP per SM and three SMs
per TPC [10]. Each SM cluster has a portion of shared control logic and instruction
cache as well as texture filtering processors. Figure 2.5 shows a TPC block diagram
for the GTX200 series GPU. The GTX200 chip has 240 streaming processors (SP).
Every SP has a multiply-add unit plus an additional multiply unit. The combined
power of that many SPs in this GPU exceeds one teraflop [7]. Each SM contains a
special function unit which can compute floating-point functions such as square root and
transcendental functions. Each stream processor is threaded and can run thousands of
threads per application. Graphics cards are commonly built to run 5,000-12,000 threads
simultaneously on this GPU. The GTX200 can support 1,024 threads per SM and 30,720
threads for the entire chip [10]. In contrast, the Intel Corei7 series can support two
15


I j|core |cote j jcore | > Icq'S j ^ jj^s': | |core
:tle t: t il'IH I bl ; hi' *
i|i : I i n!i iji i ] ji !; I i ';
ipCo'e ; |_g JCors 1 |Core ^ |Core | j j JCore > JCore f ^
;|cofe : |ce jj|cofe Icye |ci/e |core
|tf |TF |tf )tf |tf |tf Jtf |tf
L1 Cache
Figure 2.5: Thread Processing Cluster for the GTX200 Series GPU.
threads per core.
GPUs are optimized via the execution throughput of a massive number of threads.
The hardware takes advantages of this by switching to different threads while other
threads wait for long-latency memory accesses. This methodology enables very minimal
control logic for each execution thread [7]. Each thread is very lightweight and requires
very little creation overhead.
From a memory perspective, the GPU is architected quite differently than a CPU.
Each GPU currently comes with up to four gigabytes of Graphics Double Data Rate
(GDDR) DRAM which is used as global memory. For the GTX200 series, every SM
has twenty-four kilobytes of dedicated memory (shared between SPs). This dedicated
memory boasts a one-cycle access time. The architecture of the GPU is designed to
exploit arithmetic intensity and data-level parallelism. Figure 2.6 shows the architecture
of the GTX200 series.
Graphics processors have traditionally been designed for very specific specialized
16


c
Host
~JT
f InputAssembler
If
Vertex Thread Issue ~l I Geom Thread Issue-]
Setup and Rasterize I
y
Pixel Thread Issue I
L2 Texture L2 Texture L2 Texture L2 Texture
Cache Cache Cache Cache
I
I
GDDR3 GDDR3 GDDR3 GDDR3
Memory Memory Memory Memory
Controller Controller Controller Controller
Figure 2.6: Architecture diagram of the GTX200 series Graphics Processing Unit
(GPU).
tasks. Most of their transistors perform calculations related to 3D computer graphics
rendering. Typically GPUs perform memory-intensive work such as texture mapping
and rendering polygons. The GPU also performs geometric calculations such as rotation
and translation of vertices into different coordinate systems. The on-chip programmable
shaders can manipulate vertices and textures.
Specialized video decoding processes are optimized on the modern GPU. These
processes include:
Motion compensation (mocomp)
Inverse discrete cosine transform (iDCT)
Inverse telecine 3:2 and 2:2 pull-down correction
Inverse modified discrete cosine transform (iMDCT)
17


In-loop deblocking filter
Intra-frame prediction
Inverse quantization (IQ)
Variable-Length Decoding (VLD), more commonly known as slice-level acceler-
ation
Spatial-temporal deinterlacing and automatic interlace/progressive source de-
tection
Bitstream processing (CAVLC/CABAC).
The programming interfaces and instruction sets change quite frequently in the
graphics processing realm. New versions of GPU APIs such as Microsofts Direct3D and
OpenGL (originally developed by SGI) are released every year. Not only are these APIs
quickly changing to meet demand, they are also very specialized for graphics rendering
and video processing.
GPUs by nature are designed as numeric computing engines, but their use has
been historically limited by graphics-oriented APIs. These APIs are designed with
unusual programming models and developers must be experts in computer graphics to
use them. To advance the use of GPGPU programming, BrookGPU [12] and Sh [13] were
projects that had the goal of abstracting the GPU as a streaming processor. The Brook
model was soley based on streaming computation abstraction where data is represetned
as streams and computation as kernels [14].
NVIDIA has developed a parallel computing architecture which is known as
the Compute Unified Device Architecture (CUDA). This computing engine, which is the
core of modern NVIDIA GPUs, is accessible to software developers through extensions
of industry standard programming languages. The development of CUDA has enabled
developers to access the virtual instruction set and memory of the GPU. This has
18


enabled the exploitation of the native parallel computational elements of the NVIDIA
GPU. Figure 2.7 shows a high-level overview of the steps required for data transfer and
code execution on the GPU.
Figure 2.7: Example of CUDA processing flow.
2.3.1 CUDA Overview
The CUDA model is optimized for maximum compatibility. NVIDIA utilizes
a soft instruction set which enables the GPU designers to change the low level hard-
ware and instruction set without having to address backwards compatibility. Similarly,
NVIDIA has also built in scalability to the CUDA model. These GPUs are scalable
in that the CUDA code that is written is not tied to a specific release of the NVIDIA
GPU. This can be contrasted to traditional CPUs where the hard instruction set is
published. CPU software developers often optimize their programs for how many cores
are available which can change as new CPUs are released.
19


As seen in figure 2.7, a CUDA program is comprised of multiple execution phases.
Depending on the phase, the execution involves the CPU, the GPU or both. Portions
of the CUDA code are executed on the GPU while other portions are executed on the
CPU. The NVIDIA compiler known as nvcc translates the code for the GPU and CPU
accordingly. This model is very easy to work with because the device code is written
in ANSI C extended with keywords for labeling data-parallel functions, called kernels,
and their associated data structures [14].
The GPU kernels create a large number of threads on the device in order to take
advantage of data parallelism. For example, a matrix multiplication problem would be
very simple to implement using the CUDA API. If the matrix was a 1000 x 1000 matrix,
the kernel would invoke 1,000,000 threads on the GPU to perform the computation.
Each thread block is responsible for computing one square sub-matrix [15]. Furthermore,
each thread within the block is responsible for computing one element of the sub-matrix.
Unlike CPU threads, CUDA threads are much lighter weight. In most cases, CPU
threads literally take thousands of clock cycles to schedule and generate while CUDA
threads take only a few [14].
The matrix multiplication problem is a great example to explore. This common
mathematical computation is easily translated into CUDA code. The CPU code for
performing this multiplication is shown in figure 2.8. This example assumes that the
matrices have already been allocated and generated. Furthermore, it assumes that these
two dimensional array elements are stored in a linear addressed memory system. The
inner loop iterates over the variable k and steps through a column of N and a row of
M. The loop then calculates the corresponding dot product. The outer loops (i and j)
jointly iterate over all of the rows and columns of the matrix MN. Finally, the outermost
20


void MatrixMul(float* M, float* N, float* P, int Width)
{
for (int i = 0; i < Width; ++i) {
for (int j = 0; j < Width; ++j) {
float sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[i * Width + k];
float b = N[k * Width + j];
sum += a b;
>
P[i Width + j] = sum;
}
}
}
Figure 2.8: The CPU code implementation of matrix multiplication.
loop identifies every row in the resultant matrix P.
Now that the traditional implementation has been explored it will now be shown
how this code translates into data parallel CUDA code. The CUDA API allows for the
programmer to continue to use the same function call within the code. This is useful
because many programs can be easily modified for GPU execution without a major
upset in the program architecture.
As previously discussed, GPU execution happens in three basic phases. The first
phase is to allocate memory on the device and copy the beginning matrices to the GPU.
The second phase is invoking the kernel on the GPU device itself. Once the computation
is complete, the resulting matrix is copied from the GPU device to the CPU.
Phase one or the memory transfer and allocation step is similar to that on a CPU.
A memory allocation command is called (similar to malloc) to allocate memory on the
GPU device. After the memory is allocated the CPU transfers the data to the GPU for
processing. The GPU stores this in the onboard GPU card DRAM. Figure 2.10 shows
21


the code to transfer to and from device memory. To put this concept in perspective,
the NVIDIA T10 processor comes with up to 4 GB of DRAM.
Next, the kernel is invoked on the GPU card itself. The GPU card is optimized for
efficient thread scheduling and execution. With the data available in DRAM, execution
can begin. The GPU card is also optimized for memory access and caching. If one
thread is waiting for DRAM, another thread will actually execute until the previous
thread can continue. With single cycle cache available to the core-blocks, the threads
execute with incredible speed and efficiency.
The final phase of the CUDA cycle is retrieving the result from the GPU card
(see figure 2.10). This step is fairly straight forward as it is a simple memory transfer
from the GPU back to the CPU.
The code shown in figure 2.9 is separated into functions similar to the three
sections that have been discussed. This code is the over all function that calls the
memory transfer functions as well as the kernel itself.
The amount of kernel code itself is relatively short in comparison to the setup
code. This kernel is responsible for creating all of the threads that were previously
mentioned. The CUDA kernel code is shown in figure 2.11.
Some situations do not lend themselves to port over to the GPU as easily as
this example. However, many problems can be analyzed and parallelized with medium
effort. Many factors must be taken into consideration when considering porting code
over to the GPU.
22


void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P)
{
//Interface host call to the device kernel code and invoke the kernel
// Device matrices
Matrix Mdevice;
Matrix Ndevice;
Matrix Pdevice;
// Allocate and copy the M and N matrices to the device
Mdevice = AllocateDeviceMatrix(M);
CopyToDeviceMatrix(Mdevice, M);
Ndevice = AllocateDeviceMatrix(N);
CopyToDeviceMatrix(Ndevice, N);
// Allocate the P matrix on the device
Pdevice = AllocateDeviceMatrix(P);
// Define thread and grid dimensions
dim3 threads(MATRIX_SIZE, MATRIX_SIZE);
dim3 grid(WP / threads.x, HP / threads.y);
// Launch the device threads to calculate matrix P
// Invoke the CUDA Kernel
MatrixMulKernel(Mdevice, Ndevice, Pdevice);
// Copy the resulting P matrix from the device
CopyFromDeviceMatrix(P, Pdevice);
// Free the device matrices
cudaFree(Mdevice.elements);
Mdevice.elements = NULL;
cudaFree(Ndevice.elements);
Ndevice.elements = NULL;
cudaFree(Pdevice.elements);
Pdevice.elements = NULL;
>
Figure 2.9: Example code for invoking the memory transfers and kernel execution on
the GPU for matrix multiplication.
23


// Copy a host(CPU) matrix to a device(GPU) matrix.
void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost)
{
int size = Mhost.width Mhost.height sizeof(float);
Mdevice.height = Mhost.height;
Mdevice.width = Mhost.width;
Mdevice.pitch = Mhost.pitch;
cudaMemcpy(Mdevice.elements, Mhost.elements, size,
cudaMemcpyHostToDevice);
}
// Copy a device(GPU) matrix to a host(CPU) matrix.
void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice)
{
int size = Mdevice.width Mdevice.height sizeof(float);
cudaMemcpy(Mhost.elements, Mdevice.elements, size,
cudaMemcpyDeviceToHost);
}
Figure 2.10: Example code for memory transfers to and from the GPU for matrix
multiplication.
2.4 GPGPU Motivation
General-purpose computing on graphics processing units (GPGPU) is the tech-
nique of using a GPU, which typically handles computation only for computer graphics,
to perform computation in applications traditionally handled by the CPU. There are
several opportunities to apply data-level parallel techniques of GPUs to non-traditional
programming tasks. The GPU is optimized for data-parallel operations such as scan,
sort, search, data queries, differential equations and linear algebra. This optimization
allows for a wide range of applications such as databases or scientific simulations such as
fluid dynamics [14]. The data-parallel architecture of the GPU requires programming
idioms that are by no means new to the parallel computing world.
24


// Matrix multiplication kernel thread specification
__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P)
{
//Multiply the two matrices
// Get the column and row id
int column = threadldx.x;
int row = threadldx.y;
// Calculate the P matrix result and the column and row
float Presuit = 0;
// Calculate the P result value by multiplying the
// M row with the N column and adding the values
for(int i = 0; i < WM; i++)
{
float Mvalue = M.elements [row WM + i];
float Nvalue = N.elements[i WN + column];
Presuit += Mvalue Nvalue;
}
// Set the P result value
P.elements[row WP + column] = Presuit;
Figure 2.11: Example CUDA kernel code written for a GPU to do a matrix multiplica-
tion.
2.4.1 Arithmetic Intensity
Arithmetic intensity (AI) is defined as the number of operations performed per
word of memory transferred. With todays computing hardware, computation is rela-
tively cheap, but memory bandwidth is a commodity. This is especially true for GPUs
with their inherent massive floating-point computational ability. Traditionally, the best
way to predict the performance of a GPU for a specific problem is to first observe
high levels of arithmetic intensity. To this end, two benchmark suites, SD-VBS and
25


the SPEC 2006 [16] were examined for their characteristics of arithmetic intensity. The
SPEC CPU benchmark suite is designed to provide performance measurements that can
be used to compare compute-intensive workloads on different computer systems. SPEC
has held to the principle that better benchmarks are based on applications and these ap-
plications can come from any area of work. For example, the current SPEC CPU2006
suite includes applications from the following areas: AI game theory, bioinformatics,
chemistry, compilers, interpreters, data compression, physics, speech recognition, video
processing, and weather prediction.
To investigate arithmetic intensity, an experimental tool was developed using the
Pin [17] instrumentation engine. The Pin instrumentation framework allows a programs
execution to be traced to collect operation counts, memory addresses and control be-
haviors. Figure 2.12 shows the arithmetic intensity (ratio of arithmetic operations to
memory operations) of the SD-VBS benchmark along side of the SPEC 2006 bench-
mark. The results are interesting in that the two sets of benchmarks have nearly the
same arithmetic intensity average, around 2.2. However, the SD-VBS suite consists of
a wider range of intensity. The applications within the SD-VBS suite are more likely to
be significantly higher or lower than the average. In the case of SPEC 2006, the set of
applications are more consistently close to the average arithmetic intensity.
Benchmarks from the SD-VBS set having high arithmetic intensity are: Local-
ization, Sift, and Stitch. These applications appear to be naturally well suited for
GPU execution. For example, Scale-Invariant Feature Transform (or SIFT) is an algo-
rithm in computer vision to detect and describe local features in images. The algorithm
was published by David Lowe in 1999 [18]; Applications using the SIFT include object
recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture
26


recognition, and video tracking. Already a considerable amount of research has been
done at developing GPU versions of the Sift algorithm [19], For the remaining parts
of this thesis, benchmarks MultLncut, SVM, and Tracking were modified to operate
on GPUs using the CUDA framework. These applications have the lowest arithmetic
intensity and thus represent interesting opportunities to show how non-traditional codes
may be transformed for GPU systems.
SD-VBS Benchmark SPEC 2006 Benchmark
Figure 2.12: Arithmetic Intensity for the SD-VBS and the SPEC 2006 benchmarks.
Another relevant factor in considering the use of GPUs for any application is the
amount of execution credited to small regions of code. GPUs are effectively hardware
accelerators in a heterogeneous environment. In short, large portions of application code
may still be best suited for CPU execution. However, if small important kernels of code
account for large portions of the dynamic execution, then the task of off-loading some of
the execution to the GPU is straightforward. To this end, this thesis evaluated the SD-
VBS benchmarks using the GNU gprof profiling tool. Several top functions were isolated
to account for the majority of program execution. Furthermore, using a Pin profiling
tool to collect the code execution counts for each assembly instruction, Figure 2.13 and
27


Figure 2.14 were created. Figure 2.13 shows the cummulative percentage of dynamic
program execution attributed to the programs static percentage of operations. The
curves show that 5% of the static operations account for over 55% of the dynamic
execution of all benchmark programs. In more detail, 5% of the code covers 90% of
execution for applications MSER, Stitch, Localization and SVM. Based on the
same curves, 15% of the static code covers 80% of the execution for all programs.
The benchmark MultLncut is the application with the lowest percentage of dynamic
execution covered by the program code. In contrast, mser has one of the highest
coverages as 3% of the static code covers nearly 99% of the dynamic execution.
Disparity
Localizat
MSER
Muiti_nc
SIFT
Stitch
SVM
Texture
Tracking
0.00% 1-----------1------------1-----------1-----------.------------1-----------.-----------.------------r i-----------.
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%
Percentage of Instructions Causing Execution
Figure 2.13: Identifying hotspots : Percentage of dynamic execution of SD-VBS pro-
grams attributed to program code.
Figure 2.14 expands on the cummulative percentage of dynamic execution by pre-
senting the exact number of assembly operations that make up a large percent of the
programs execution. Three separate data bars are presented for showing the distribu-
tion of 5%, 10% and 15% of the static code. For example, in the case of Disparity, 5%
28


of the static code amounts to approximately 550 assembly code operations. Given the
coverage trends of Figure 2.13, across most benchmarks, 99% of the dynamic execution
covered by 15% of the static program code can be implemented with as few as 2250
assembly operations.
Figure 2.14: Operation counts for SD-VBS program hotspots.
2.4.2 Demonstration of GPU Execution Efficiency: Sorting
The inherent calculations of computer vision algorithms take many different
forms. For example, numerical sorting represents how GPUs can attain significant
performance increases over traditional CPUs in certain cases. For relatively small sets
of data, the CPU is very efficient at sorting. To illustrate the concept of CPU vs. GPU
performance for sorting, Figure 2.15 shows the relative sorting performance of a sort
performed on a CPU vs. the same sort performed on the GPU. One can see that near
the middle of the graph, the GPU overtakes the CPU in terms of performance. For this
particular example, the GPU over takes the CPU when the array to be sorted reaches
29


6,000 floating point numbers. In some cases, the CPU may outperform the GPU, since
the GPU must perform memory transfers to and from its dedicated RAM. The setup
overhead cost of tasking the GPU can often cause the CPU to win over GPU for small
data sets. This translates into the fact that the CPU can just do a brute force sort on
an array faster than the time required to send the task to the GPU for processing. In
the case of 2 million elements, the GPU speedup for sorting is nearly 60x over CPU
performance.
GPU vs. CPU Sorting Performance (with GPU overhead)
CPU Sort Time
GPU Sort Time
GPU Sorting
Performance
Multiplier
Figure 2.15: Sorting performance (CPU vs. GPU) for various sizes of a 1-dimensional
array of floating point numbers.
2.4.3 Demonstration of GPU Execution Efficiency: Array Reduction
GPUs also excel at reduction algorithms such as finding the sum (average, min-
imum, maximum or variance) of a collection of numbers. Traditionally, a CPU would
step in a serial fashion through every value to find the sum, for example. The GPU
30


GPU Sorting Performance Detail
Memory Copy
(CPU->GPU)
GPU Execuction
(computation)
Memory Copy
(GPU->CPU)
GPU Execution
Percentage of
total GPU Time
Figure 2.16: Sorting performance detail for GPU Execution.
implementation involves a reducing the array over a number of threads, effectively using
a thread per every pair of numbers. Each time the add is performed, the number of
elements shrinks down until the final solution is reached. Figure 2.17 shows the relative
reduction performance of summing an array on a CPU vs. summing the same array
on the GPU. Figure 2.17 shows that if an array size if under 6000 elements, then the
CPU performance achieves a better outcome than the GPU. However, as the data sizes
increase, the GPU decisively outperforms the CPU performance. In this experiment, 4
million elements were reduced to a single summation value 3.5x faster with the GPU.
In this particular example, a great deal of time is consumed due to overhead.
Figure 2.18 shows the memory transfer times side-by-side with the actual GPU execution
time. In the case where 4 million elements are reduced, merely 3.6% of the total time
is spent on the actual computation.
31


GPU vs. CPU Reduction Performance (with GPU overhead)
CPU Reduction
Time
GPU Reduction
Time
GPU Execution
Performance
Multiplier
Figure 2.17: Reduction performance (CPU vs. GPU) for various sizes of a 1-dimensional
array of floating point numbers.
GPU Reduction Performance Detail
3,128 S,S79 21,315 48,482 52,503 92,006 212,201 369,866 1.225,442 1,915,802 3,000,000 4,000,000
Number of Elements Reduced
Figure 2.18: Reduction performance detail for GPU Execution.
Memory Copy
(CPU->GPU)
GPU Execuction
(computation)
Memory Copy
(GPU->CPU)
GPU Execution
Percentage of total
GPU Time
32


2.4.4 Non-traditional Approach for Computer Vision Applications
Clearly for traditional computational cases of sorting numbers and reducing an
array of numbers, the GPU offers a significant performance advantage over CPUs. Like-
wise, since small code regions of computer visions algorithms account for substantial
portions of execution, there is good potential for using a GPU to accelerate the execu-
tion of some code functions. However, several benchmarks from the SD-VBS suite do not
appear to have sufficient arithmetic intensity required to map to the data-parallel na-
ture of GPU systems. While compilers can perform transformations to changes some of
the fundamental executing characteristics of applications, more significant programmer-
based reformulations are necessary to synthesize new computational models. Often such
reformulation techniques are not considered by the original programmer since the re-
sulting transformation would not achieve performance gains on a traditional CPU. In
the next section, the thesis explores reformulation of a few SD-VBS benchmarks.
33


3.
Non-traditional Exploitation of GPUs
3.1 Data-size Based GPU versus CPU Execution
In most applications there is a section of code that runs efficiently on the GPU.
However, this is not true in all cases, specifically for small amounts of data. For a few
small tasks, the CPU performs very well and do not require any additional memory
transfers between processing elements. For example, when sorting numbers, for a small
amount of data the CPU will win every time due to the setup overhead for the GPU
execution. When the dataset is sufficiently large, the GPU will perform better than the
CPU.
On a given program or algorithm, the input data set can vary with size depending
on what the algorithm is processing. Computer vision algorithms are a great example
of how input sizes can vary at run-time. All computer vision algorithms need an image
input, which often comes from a camera or a saved image from taken with a camera.
With the variety of cameras and resulting picture quality, there exists a broad spectrum
of image input sizes and resolutions.
The approach to get the best performance using the GPU to complement the
CPU would be to implement intelligent run-time program execution. Conceptually, the
program would decide at run-time which device is best suited to take on the upcoming
operation. Code written in this manner will serve to complement CPU execution.
This will ultimately support the exploitation of the heterogeneous parallel computing
capability of a CPU/GPU system.
34


In the case where an algorithm needs to sort an array, the program would decide
if the CPU or the GPU could do the sort faster. During run-time, the program would
look at the array and make a determination based on the number of elements contained
in the array. Matrix multiplication is another example where the GPU could outperform
the CPU if the matrix size is sufficiently large.
3.2 Computation Reformulation
An example of computation reformulation appears in the application Tracking,
which has three main routines that account for a large percentage of execution: So-
behdX, SobeLdY, and Blurlmage. The Sobel computation is commonly used in image
processing, particularly within edge detection algorithms. Figure 3.1 shows the CPU
implementation of the Sobel computation algorithm.
Figure 3.2 illustrates the computation for one point of the output image of the X-
based Sobel operation. The Sobel calculation uses a window size of three, to generate the
output value in two distinct phases. The first phase takes three horizontal values (point
to the left, center point, and point to the right) and multiplies the values respectively
with the values of the first kernel array. The second phases takes three vertical points
(point in earlier row, center point, and point directly in the next row) and multiples the
values respectively by a second kernel array. In this example, for each calculated point,
there are a total of six multiples and four additions.
The GPU solution using the NVIDIA CUDA interface describes the computa-
tion task of each individual thread. In this implementation each thread is responsible
for calculating nine multiplies and eight additions, nearly 50% more work than the CPU
version performed. The resulting code demonstrates the set of code reformulations that
35


\* Code for Sobel_dX computation on the CPU *\
for(i=startRow; i for(j=startCol; j temp = 0;
for(k=-halfKernel; k<=halfKernel; k++) {
temp += subsref(imageln,i,j+k) asubsref(kernel_2,k+halfKernel);
>
subsref(tempDut,i,j) = temp/kernelSum_2;
>
>
for(i=startRow; i for(j=startCol; j temp = 0;
for(k=-halfKernel; k<=halfKernel; k++) {
temp += subsref(tempOut,(i+k),j) asubsref(kernel_l,k+halfKernel);
}
subsref(imageOut,i,j) = temp/(float)kernelSum_l;
}
}
Figure 3.1: Code for SobeLdX computation on CPU.
Imageln tempOut tempOut ImageOut
Kernell
HXT71
Kernel2
r*
v<
Figure 3.2: Diagram of SobeLdX computation on CPU.
are necessary to generate a data-parallel computation from an existing computation
36


meant for sequential execution. Figure 3.3 shows the CUDA code reformulation for
SobeLdX computation.
tempO = SM_REF(s_data,col-1, row-1, SOBEL_DX_RADIUS) kernel2_0;
tempO += SM_REF(s_data,col-1,row, SOBEL_DX_RADIUS) kernel2_l;
tempO += SM_REF(s_data,col-1,row+1, SOBEL_DX_RADIUS) kernel2_2;
tempi = SM_REF(s_data,col,row-1, SOBEL_DX_RADIUS) kernel2_0;
tempi += SM_REF(s_data,col,row, SOBEL_DX_RADIUS) kernel2_l;
tempi += SM_REF(s_data,col,row+1, SOBEL_DX_RADIUS) kernel2_2;
tempi = SM_REF(s_data,col+l,row-1, SOBEL_DX_RADIUS) kernel2_0;
tempi += SM_REF(s_data,col+l,row, SOBEL_DX_RADIUS) kernel2_l;
tempi += SM_REF(s_data,col+l,row+1, SOBEL_DX_RADIUS) kernel2_2;
temp = tempO kernell_0 + tempi kernell_l + temp2 kernell_2;
d_base[global_base_index] = temp/(float)(kernelSum2*kernelSuml);
Figure 3.3: Code for SobeLdX computation on GPU.
The GPU implementation is far more computationally efficient. Each thread
calculates the appropriate data point based on the input data. Figure 3.4 illustrates the
work performed by each thread on a GPU machine. From the code it can be seen how
the thread loads nine points of data from memroy to perform the calculation.
An interesting note is the GPU implementation actually uses nine data points to
perform the calculation. In contrast the CPU uses six data points (three horizontal and
three vertical). The GPU architecture has an advantage over the CPU because each
thread has quick one-cycle access to the window data.
3.3 Computation Speculation
In some computer vision applications the amount of computation depends on the
content of the data itself. Conceptually, as the content of an image changes, there may be
37


Imageln
tempOut
w

* r



Figure 3.4: Diagram of SobeLdX computation on GPU.
more points of data exposed to certain computer vision algorithms. One example of this
variable amount of computation is found in the application Support Vector Machines
(SVMs). SVMs are a set of related supervised learning methods that analyze data and
recognize patterns, used for classification and regression analysis. The original SVM
algorithm was invented by Vladimir Vapnik and the current standard implementation
(soft margin) was proposed by Corinna Cortes and Vladimir Vapnik [20]. The standard
SVM takes a set of input data, and predicts, for each given input, which of two possible
classes the input is a member of, which makes the SVM a non-probabilistic binary
linear classifier. Since an SVM is a classifier, then given a set of training examples, each
marked as belonging to one of two categories, an SVM training algorithm builds a model
that predicts whether a new example falls into one category or the other. Intuitively,
an SVM model is a representation of the examples as points in space, mapped so that
the examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to belong
to a category based on which side of the gap they fall on.
In the case of the S VM benchmark, there is a loop with a conditional statement in
the loop body. Depending on the value of an array indexed by the loop iteration variable,
38


the conditional is taken and requires the program to execute additional computation.
The result is that the loop execution takes a variable amount of execution time and
is difficult to parallelize on multiple processing cores. Figure 3.5 illustrates the loop
example from the SVM application in which the calculation does not conform to a
traditional data-parallel problem. In this example, a function polynomial takes the
results of multiplying two 256-entry vectors of the array i and k. The multiplication
represents a significant amount of work, however, only a few iterations of the loop pass
the conditional check inside the loop.
polynomial(float *exp, float A[], float B[], float dimension) {
float sum = 0;
for (int index; index < 256; index++) {
sum = sum + A[index]*B [index];
}
return (pow(sum, exp) / dimension);
}
for (i=0; i if(subsref(a,i,0) > 0) {
count ++;
s += asubsref(a,i) asubsref{Y,i)
* polynomial(3, Vector [i], Vector[k], 256);
>
}
Figure 3.5: Code for SVM computation conditionally requiring polynomial calculation.
Figure 3.6 shows two vectors i and k potentially multiplied together in SVM. In
terms of optimizing this particular type of problem, it is difficult to determine if there
is a performance increase in off-loading the computational work to the GPU. If the
given input dictates that most of the conditionals are not taken, then the additional
computation would be skipped. This presents an interesting problem on finding the
39


0 1 2 255
0
Vectior[i] 1
2

Vectior[k] .
N-l
Figure 3.6: Diagram of SVM potential polynoimal calculation based on i and k.
methodology for best optimization using the CPU and the GPU. For this case, one
approach would be to have the GPU pre-calculate every possible computation needed
by the iterative loop and store the results into a table. This would result in perform-
ing far more computations than would be necessarily needed by the algorithm. For a
CPU system, the additional computation would be prohibitive. For a GPU system,
computation is essentially free.
Table 3.1 shows for the given inputs of the SVM benchmark how much of the
table is calculated and consequently used, as well as the table size and number of table
accesses made. For example, when the algorithm is run with the sqcif input, 97.32%
of the table is used during execution, and the table contains 5050 entries and accessed
97839 times in the program. For non-test inputs, a table of pre-calculated results would
have between 85% and 98% of the pre-calculated entries referenced. This indicates that
in the long run of the program, there is good reasoning to build the table all at once.
Table 3.2 shows that for each input there are phases of the program that call the
polynomial function. The polynomial function is inside a loop (see Figure 3.5). The
second column is the number of times the loop is called. The third column shows how
many times that loop iterates when it is called. For example, cif calls the loop 6441
40


Table 3.1: Motivation to pre-calculate the entire table
Input Percentage of Table Used Table Size Run-Time Accesses to Table
cif 98.47% 5050 97839
qcif 97.60% 2628 30126
sqcif 97.32% 1830 30907
sim 94.11% 136 1753
sim_fast 85.23% 210 1776
test 50.00% 10 20
times and each loop iteration count is 100. Based on the next columns for cif, the
minimum number of times that polynomial is called across the 6441 times is zero, the
average is 15.2 times out of 100, and the maximum number of times polynomial is called
is 30 out of 100. Finally, the standard deviation shows the variation each time the loop
is invoked. Overall the per-loop iteration results show that each loop execution would
not alone justify computing all of the potential values of a table.
Table 3.2: Number of accesses to table per loop invocation.
Input Loop Invocations Loop Iterations Percentage of Loop Execution
Min Avg Max Standard Deviation
cif 6441 100 0.0% 15.2% 30.0% 9.0 %
qcif 2635 72 0.0% 15.9% 41.7% 11.5%
sqcif 2197 60 0.0% 23.4% 50.0% 14.8%
sim 466 16 0.0% 23.5% 37.5% 15.6%
simJast 214 20 0.0% 41.5% 75.0% 27.1%
test 18 4 0.0% 27.8% 75.0% 32.2%
Figure 3.7 shows the exact distribution of loop iterations that require the poly-
nomial calculation in SVM for the cif input. For this input, each loop iterates for 100
times. However, as the figure shows, there are never any loop iterations that require
more than 30 polynomial calculations. Again, these results indicate that for regular
41


systems, each loop iteration would not look to use many of the pre-computed values.
Execution Occurrences
a
o
o
Active Loop Executions (Out of 100)
Figure 3.7: Distributions of loop iterations that require polynomial calculations in SVM
for cif input.
Figure 3.8 shows the number of unique polynomial calculations required over a
timeline of loop executions. Overall, the results indicate that over time nearly all of
the table entries are required. In this case, not building the table of pre-computed or
speculated results is short-sighted if only each loop innovcation is considered. The table
results could be pre-computed on a traditional CPU system, given the large table size
and upfront overhead of computation.
42


Table Entries Accessed
Figure 3.8: Timeline of unique polynomial calculations in SVM for input cif.
43


4. Experimental Results
The experimental results show that in many cases the GPU out-performs the
CPU as expected. The performance increase is seen in all of the translation methods
from computation speculation to reformulation. However, with a smaller problem size,
the CPU can often beat the GPU time simply due to the overhead associated with
executing code on the GPU device. The data also shows that in many cases the actual
computation time of the GPU is only a small percentage of total GPU execution time.
4.1 Data-size Based GPU versus CPU Execution
The Image Segmentation algorithm yields some interesting experimental re-
sults. This benchmark requires a large amount of processing power in order to sort a
large set of floating point numbers. Figure 4.1 shows that for an input size greater than
96x128 there is a definite performance increase when using the GPU implementation.
Performance is severely decreased for the very small input sizes due to the GPU over-
head requirements such as memory transfers and kernel setup time. In this particular
case, if the input size was small then the CPU would execute the code. Mid-size and
larger data input sizes would be executed on the GPU.
The majority of the Image Segmentation algorithm consists of one major func-
tion known as fSortlndices. This function consumes more than 99 percent of the execu-
tion time of the overall algorithm. This function simply sorts a large array of floating
point numbers. This particular application is slowed down due to the structure of the
Image Segmentation algorithm. The structure of this algorithm requires the sort
44


Image Segmentation GPU Performance
Figure 4.1: Image Segmentation performance change using GPU implementation.
to be done via indexing, wherein the function builds an index to the original array as
to where the smallest to largest values occur. If this application utilized an in-place
style sort then the overall performance would be significantly increased. The original
author of this algorithm implemented a brute force sorting method which requires n2
comparisons to take place (where n is the number of elements to be sorted).
This algorithm was translated to the GPU by first building an index on the
CPU side and then transferring all the values to the GPU to be sorted. Once the
GPU executes an in-place sort, the resulting sorted array of floating point numbers
is transferred back to the CPU. The CPU then correlates the returned array to the
appropriate index values. The performance is negatively impacted based on the CPU
doing the correlation work.
Figure 4.2 shows the relative execution time for the fSortlndices function. This
figure compares the CPU execution time side-by-side with the total GPU execution
45


time (includes memory transfer and kernel setup time). Another interesting note is the
relative time of the GPU arithmetic execution time versus the total GPU execution
time. For the small input sizes nearly all of the GPU execution time is consumed for
computation. Looking at the larger input sizes one can see that the amount of time
spent on computation versus memory transfer time has a downward trend. Compared
relatively to execution time, the memory trasfer times actually increase with larger
input sizes.
fSortlndices
-*-GPU Execution
40.00% Percentage of Total
GPU Time
11x15 18x24 25x33 33x44 64x85 96x128 100x133 132x176 200x267 264x352 480x640 600x800
Input Size
Figure 4.2: fSortlndices performance change using GPU implementation.
4.2 Computation Reformulation
The example of reformulation of computation for GPUs for computer vision ap-
plication came from the application Tracking. The Tracking benchmark consists of
three main functions: SobeLdX, SobeLdY, and Blurlmage. This particular example
actually reformulates how the computation is performed. Figure 4.3 depicts the per-
formance change for the overall Tracking benchmark. The graph shows that there
46


is a performance increase once the image input is sqcif or larger. The results for the
Tracking are similar to Image Segmentation in that for small input sizes there is an
overall performance decrease due to the GPU memory trasfer and kernel setup time.
Tracking GPU Performance
Figure 4.3: Tracking kernels (Blurlmage, SobeLdX, SobeLdY) performance change
using GPU implementation.
The three sections of the Tracking benchmark yield different results. Higher
performance increases are seen with the GPU optimized Blurlmage than the Sobel func-
tions. Blurlmage, for example, using the full HD input size has a performance increase
of 8.3x while calcSobeLdY only experiences a performance increase of 2.8. The per-
formance increase for the dY algorithm is greater because the CPU utilizes onboard
caching optimization for the dX version. Figure 4.4 shows that for all three algorithms
there is a performance improvement for all input image sizes. This result is interesting
due to the fact that the GPU is doing more computational work than the CPU.
47


96x128 144x176 240x320 288x352 480x640 540x960 1080x1920
Input Size
Figure 4.4: Tracking kernels (Blurlmage, SobeLdX, SobeLdY) GPU performance
change.
Figure 4.5 shows the performance improvement of Blurlmage in terms of execution
time measured in microseconds. The blue line shows how much of the overall GPU time
is spent on the execution, the rest is spent on memory transfers and overhead. For the
high definition image (1080x1920) we see that 0.35% of the total GPU time is actually
spent doing the execution and the other 99.65% of that time is used on kernel setup
and memory transfers.
The Sobel algorithm performance for the delta X direction can be seen in figure 4.6.
The results show the performance improvement of calcSobeLdX for various input sizes.
The minimum, average and maximum values have similar trends on each data set with
the CPU versus the GPU. For example, the largest (right most) data set is very flat for
both the CPU values as well as the GPU values. In all cases for this algorithm there is
a performance increase by using the GPU implementation of the algorithm.
Similarly, figure 4.7 shows the performance improvements the Sobel algorithm in
the delta Y. The results indicate a performance improvement for every given input.
48


Blurtmage
CPU Avg
Input Size (Pixels)
GPU Avg
GPU Execution
Percentage of
Total GPU Time
Figure 4.5: Tracking kernel Blurlmage comparison of CPU and GPU: minimum, max-
imum, average.
c
o
calcSobel dX
j _
Min
i i i i i
CPU Avg
GPU Avg
-GPU Execution
Percentage of Total
GPU Time
Input Size (Pixels)
Figure 4.6: Tracking kernel SobeLdX comparison of CPU and GPU: minimum, maxi-
mum, average.
When the algorithm is optimized for the GPU, less than one quarter of one percent of
the total GPU time is used for computation. More than 99.7% of the GPU time is taken
up by memory transfers and kernel setup.
49


100000
calcSobel dY
9.00%
I CPU Min
Figure 4.7: Tracking kernel SobeLdY comparison of CPU and GPU: minimum, maxi-
mum, average.
Figure 4.8 shows an overview plot of all the kernels involved in Tracking. The
plot shows the standard deviation as a percentage of the average. In all cases, the GPU
has a lower standard deviation than the same function ran on the CPU. Each GPU run
has a more consistent execution time than the correlating CPU only run. Some of the
deviations on the CPU are attributed to the operating system scheduling various other
tasks on the CPU while the benchmark is running.
4.3 Computation Speculation
An example of speculative computation comes from SVM. The SVM benchmark
was implemented on the GPU in such a way that a large amount of pre-calculation is
performed at the beginning of the algorithm. Not all of the pre-calculated data may be
used during the run. Figure 4.9 shows the overall performance improvement of SVM by
using the GPU to pre-calculate the entire table. For the smaller inputs we see a 5x to
lOx performance increase. For the larger inputs we see a consistent 28x increase. The
50


96x128 144x176 240 x 320 288 x 352 480 x 640 540 x 960 1080x1920
Input Size (Pixels)
Blurimage CPU
calc$obel_dX CPU
calc$obel_dY CPU
Blurimage GPU
calcSobel_dX GPU
calcSobel_dY GPU
Figure 4.8: Tracking kernels (Blurimage, SobeLdX, SobeLdY) standard deviation as
a percentage of average.
bars (plotted on the right-axis) show how much of the GPU time is used in the actual
computation (the remainder is GPU kernel setup and memory transfer time). For the
largest input (cif), less than 5% of the GPU execution time is used on computation.
= 30.0*
D 25.0*
| 15.0x
45.00%
40.00%
35.00%
3040%
2540%
20.00%
1540%
1040%
540%
0.00%
GPU Execution
Percentage of
Total GPU Time
--GPU
Performance
test
.fast sim sqctf qcif cif
Image Input Type
Figure 4.9: GPU performance for S VM application.
Figure 4.10 shows the speedup of each table access versus generation. Each bar
51


indicates the speedup factor of the GPU compared to the CPU for each table access
versus generation. The number on top of each bar shows the ratio of standard deviation
relative to the average. The data shows that the GPU time is more fixed than the CPU
and does not stray from that amount.
w 60-00x
O
tl
SO: 27%
test
GPU Performance Multiplier Shown with the Standard Deviation*
sim sqcif qcif cif
Image Input Type 'Standard Deviation is shown as a percent of the average
Figure 4.10: GPU performance for each table access versus generation.
The polynomial algorithm, a portion of the SVM application, exhibits extreme
performance improvement using the GPU implementation. Figure 4.11 shows the total
speedup of the polynomial function and the total speedup of the SVM application. The
polynomial function is sped up as much as sixty times faster than the CPU for the
sqcif input. The overall speedup is around 2.5 to 3 times faster than the CPU only
implementation for all input sizes.
52


Speedup Factor Compared to CPU Execution
70.00k
60.00k
50.00x
40.00k
30.00x
20.00x
10.00k
0.00k
GPU SVM Polynomial Algorithm Performance
sim sqdf
Image Input Type
qerf
GPU SVM Overall Performance
Image Input Type
Figure 4.11: GPU performance for SVM application.
53


5.
Conclusion
The optimization approaches described and the experimental results have proved
successful techniques for real-world applications. The results show, given realistic prob-
lem sizes, a performance increase is seen with GPU optimized algorithms.
The example of sorting optimization shows that in some cases the GPU can
actually slow down the overall performance of the algorithm. This particular example
also shows that the GPU can perform tasks such as sorting significantly faster than
a CPU given a large enough input data set. Future work would potentially focus on
optimizing additional commonly used algorithms used in software.
Many cases exist where optimization can be done by reformulating the computa-
tion methods contained in the algorithm. Often times the GPU will actually perform
more computations than the CPU version. Reformulation can allow the GPU to per-
form the same calculations in a parallel form. Additional work for this case would focus
on creating templates and examples of ways to reformulate commonly seen algorithms.
Speculative computation involves careful observation of the underlying algorithm
and identifying the generation of large lookup tables and the like. Code for CPUs is
often written to optimize the code for serial execution and minimize the total number of
computations performed. Future work would involve finding a method to easily identify
when such situations occur in software code.
This thesis shows that different types of computationally intensive problems can
be optimized further by using the processing power of the GPU. By identifying and
properly applying the techniques presented, a software developer could increase the
performance of a given process by using the GPU. Utilizing the GPU, which exists on
54


many desktop systems today, is a great way to improve performance, boost efficiency
and decrease hardware costs.
55


BIBLIOGRAPHY
[1] M. Pharr and R. Fernando, Gpu gems 2: programming techniques for
high-performance graphics and general-purpose computation. Addison-Wesley Pro-
fessional, 2005.
[2] SD-VBS: The San Diego Vision Benchmark Suite, October 2009.
[3] M. Flynn, Some computer organizations and their effectiveness, IEEE Trans.
Comput., vol. C-21, pp. 948+, 1972.
[4] W. D. Hillis and G. L. Steele, Jr., Data parallel algorithms, Commun. ACM,
vol. 29, pp. 1170-1183, December 1986.
[5] INTEL, The evolution of a revolution, http: // download, intel. com/ pressroom/
kits/ intelprocessorhistory. pdf accessed, 2008.
[6] INTEL, Intel microprocessor export compliance metrics, http: // www. intel.
com/ support/ processors/ sb/ cs-023143. htm accessed, 2010.
[7] D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: A
Hands-on Approach. Morgan Kaufmann, 1 ed., February 2010.
[8] W.-m. Hwu, K. Keutzer, and T. G. Mattson, The concurrency challenge, IEEE
Des. Test, vol. 25, pp. 312-320, July 2008.
[9] I. Corporation, Intel 64 and IA-32 Architectures Optimization Reference Manual.
No. 248966-023a, Intel Corporation, January 2011.
[10] NVIDIA, Geforce gtx 200 gpu architectural overview, http: // www. nvidia. com/
docs/ io/ 55506/ geforce_gtx_200_gpu_technicaLbrief. pdf accessed, 2008.
[11] H. Sutter and J. Larus, Software and the concurrency revolution, Queue, vol. 3,
pp. 54-62, September 2005.
[12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanra-
han, Brook for GPUs: stream computing on graphics hardware, in SIGGRAPH
'04: ACM SIGGRAPH 2004 Papers, (New York, NY, USA), pp. 777-786, ACM
Press, 2004.
[13] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, Shader algebra, ACM
Trans. Graph., vol. 23, pp. 787-795, August 2004.
56


[14] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips,
GPU computing, Proceedings of the IEEE, vol. 96, pp. 879-899, May 2008.
[15] NVIDIA, NVIDIA CUDA Programming Guide 2.0. 2008.
[16] Standard Performance Evaluation Corporation, The SPEC CPU 2006 benchmark
suite, 2006.
[17] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.
Reddi, and K. Hazelwood, Pin: building customized program analysis tools with
dynamic instrumentation, SIGPLAN Not., vol. 40, pp. 190-200, June 2005.
[18] D. G. Lowe, Object recognition from local scale-invariant features, in Proceedings
of the International Conference on Computer Vision-Volume 2 Volume 2, ICCV
99, (Washington, DC, USA), pp. 1150, IEEE Computer Society, 1999.
[19] S. Warn, W. Emeneker, J. Cothren, and A. W. Apon, Accelerating sift on parallel
architectures., in CLUSTER, pp. 1-4, IEEE, 2009.
[20] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, vol. 20,
pp. 273-297, 1995. 10.1007/BF00994018.
57