Point-In-Polygon (PIP) search acceleration using GPU based quadtree framework

Material Information

Point-In-Polygon (PIP) search acceleration using GPU based quadtree framework
Karunagaran, Sreeviddyadevi ( author )
Place of Publication:
Denver, Colo.
University of Colorado Denver
Publication Date:
Physical Description:
1 electronic file (137 pages) : ;

Thesis/Dissertation Information

Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering


Subjects / Keywords:
Graphics processing units ( lcsh )
Computer graphics ( lcsh )
Polygons ( lcsh )
Computer graphics ( fast )
Graphics processing units ( fast )
Polygons ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


The point-in-polygon (PIP) problem defines for a given set of points in a plane whether those points are inside, outside or on the boundary of a polygon. The PIP for a plane is a special case of point range finding that has applications in numerous ar- eas that deal with processing geometrical data, such as computer graphics, computer vision, geographical information systems (GIS), motion planning and CAD. Point in range search can be performed by a brute force search method, however solution be- comes increasingly prohibitive when scaling the input to a large number of data points and distinct ranges (polygons). The core issue of the brute force approach is that each data point must be compared to the boundary range requiring both computation and memory access. By using a spatial data structure such as a quadtree, the number of data points computationally compared within a specific polygon range can be reduced to be more efficient in terms of performance on the CPU. While Graphics Processing Unit (GPU) systems have the potential to advance the computational requirements of a number of problems, their massive number of processing cores execute efficiently only in certain rigid execution models such as data-level parallel problems. The goal of this thesis is to demonstrate that the GPU systems can still process irregular data structure such as a quadtree and that it can be efficiently traversed on a GPU to find the points inside a polygon. Compared with an optimized serial CPU implemen- tation based on the recursive Depth-First Search (DFS), the stack based iterative Breadth-first search (BFS) on the GPU has a performance gain of 3x to 500x.
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by Sreeviddyadevi Karunagaran.

Record Information

Source Institution:
University of Colorado Denver Collections
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
982959636 ( OCLC )
LD1193.E54 2016m K47 ( lcc )


This item has the following downloads:

Full Text
SREEVIDDYADEVI KARUNAGARAN B.E., Anna University, India, 2008
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering

This thesis for the Master of Science degree by Sreeviddyadevi Karunagaran has been approved for the Department of Electrical Engineering by
Dan Connors, Chair Yiming Deng Chao Liu
April 14, 2016

Karunagaran, Sreeviddyadevi (M.S., Electrical Engineering)
Point in Polygon (PIP) Search Acceleration Using GPU Based Quadtree Framework Thesis directed by Professor Dan Connors
The point-in-polygon (PIP) problem defines for a given set of points in a plane whether those points are inside, outside or on the boundary of a polygon. The PIP for a plane is a special case of point range Ending that has applications in numerous areas that deal with processing geometrical data, such as computer graphics, computer vision, geographical information systems (GIS), motion planning and CAD. Point in range search can be performed by a brute force search method, however solution becomes increasingly prohibitive when scaling the input to a large number of data points and distinct ranges (polygons). The core issue of the brute force approach is that each data point must be compared to the boundary range requiring both computation and memory access. By using a spatial data structure such as a quadtree, the number of data points computationally compared within a specific polygon range can be reduced to be more efficient in terms of performance on the CPU. While Graphics Processing Unit (GPU) systems have the potential to advance the computational requirements of a number of problems, their massive number of processing cores execute efficiently only in certain rigid execution models such as data-level parallel problems. The goal of this thesis is to demonstrate that the GPU systems can still process irregular data structure such as a quadtree and that it can be efficiently traversed on a GPU to find the points inside a polygon. Compared with an optimized serial CPU implementation based on the recursive Depth-First Search (DFS), the stack based iterative Breadth-first search (BFS) on the GPU has a performance gain of 3x to 500x.

The form and content of this abstract are approved. I recommend its publication.
Approved: Dan Connors

Tables.................................................................... vii
Figures ................................................................. viii
1. Introduction............................................................. 1
2. Background and Motivation................................................ 5
2.1 Quadtree Background................................................ 5
2.1.1 Quadtree Construction........................................ 6
2.2 Related Work ..................................................... 10
2.3 Graphics Processing Unit Architecture............................. 11
2.4 Motivation........................................................ 17
3. Approach ............................................................... 21
3.1 Exploitation of Quadtree on CPU and GPU........................... 21
3.1.1 CPU Implementation ......................................... 21
3.1.2 GPU Implementation ......................................... 23 Drawbacks on Implementing DFS on GPU.................. 26 BFS Implementation on GPU............................. 26
3.1.3 BFS Implementation on GPU Different Scenarios of Query
Polygon..................................................... 32 Query Polygon inside a Leaf Node...................... 35 Query Polygon Overlapping 2 Nodes..................... 37 Query Polygon Overlapping All Leaf Nodes.............. 39 Query Polygon Containing No Points.................... 42
4. Experimental Results.................................................... 44
4.1 Results........................................................... 44
5. Conclusion............................................................... 50
References.................................................................. 51

A. CUDA Code for Brute Force Search..................................... 53
B. Header File for Memory Allocation..................................... 63
C. CUDA C Code for PIP Search on GPU..................................... 73

3.1 Criteria for Quadtree Node Classification

2.1 Example of Data Points Mapped within a Quadtree Data Structure. . 6
2.2 Level 1 of Quadtree.................................................... 7
2.3 Level 2 of Quadtree.................................................... 7
2.4 Level 3 of Quadtree.................................................... 7
2.5 Level 4 of Quadtree.................................................... 8
2.6 NW North west, NE North east, SW South west, SE - South east. . 10
2.7 Kepler Architecture................................................... 13
2.8 GTX680 SMX............................................................ 14
2.9 Kepler Memory Hierarchy............................................... 15
2.10 Kepler Work Flow...................................................... 16
2.11 Comparing Bruteforce Search and Quadtree Search on CPU................ 18
2.12 Comparing Bruteforce Search on CPU with GPU........................... 19
2.13 CPU Performance of Different Sized Polygons Using Quadtree............ 20
3.1 Overlap of Node on the Horizontal Edge of Polygon..................... 23
3.2 Overlap of Node on the Vertical Edge of Polygon....................... 23
3.3 BFS GPU Implementation................................................ 30
3.4 Process Flow.......................................................... 32
3.5 Level 1 Quadtree...................................................... 33
3.6 Level 2 Quadtree...................................................... 33
3.7 Level 3 Quadtree...................................................... 34
3.8 Level 4 Quadtree...................................................... 34
3.9 Sample Datapoints (left) and the Quadtree (right) at Level 1.......... 35
3.10 Sample Datapoints (left) and the Quadtree (right) at Level 2.......... 35
3.11 Sample Datapoints (left) and the Quadtree (right) at Level 3.......... 36
3.12 Sample Datapoints (left) and the Quadtree (right) at Level 4.......... 37

3.13 Sample Datapoints (left) and the Quadtree (right) at Level 1............. 37
3.14 Sample Datapoints (left) and the Quadtree (right) at Level 2............. 38
3.15 Sample Datapoints (left) and the Quadtree (right) at Level 3............. 38
3.16 Sample Datapoints (left) and the Quadtree (right) at Level 4............. 39
3.17 Sample Datapoints (left) and the Quadtree (right) at Level 1............. 40
3.18 Sample Datapoints (left) and the Quadtree (right) at Level 2............. 40
3.19 Sample Datapoints (left) and the Quadtree (right) at Level 3............. 41
3.20 Sample Datapoints (left) and the Quadtree (right) at Level 4............. 41
3.21 Stack Based Iterative BFS Traversal................................ 42
3.22 Sample Datapoints (left) and the Quadtree (right) at Level 1............. 43
3.23 Sample Datapoints (left) and the Quadtree (right) at Level 2............. 43
4.1 Performance Comparison between CPU and GPU for Small Polygons. . . 45
4.2 Performance Comparison between CPU and GPU for Medium Polygons. 45
4.3 Performance Comparison between CPU and GPU for Large Polygons. . . 46
4.4 Performance Comparison for Different Sized Polygons using GPU.. 47
4.5 Execution Time of Least to Most Complex Medium Sized Polygons. ... 48
4.6 Speed-up of GPU over CPU........................................... 49

1. Introduction
The point-in-polygon (PIP) problem defines for a given set of points in a plane whether those points are inside, outside or on the boundary of a polygon. The PIP for a plane is a special case of point range finding that has applications in numerous areas that deal with processing geometrical data, such as computer graphics, computer vision, geographical information systems (GIS), motion planning and computer-aided design (CAD). Point in range search can performed by a brute force search method, however for solution becomes increasingly prohibitive when scaling the input to a large number of data points and distinct ranges (polygons).
Range search algorithms, which make use of spatial data structures, perform much better than the ones that do not partition the data before processing. Quadtree is a hierarchical spatial data structure that is used for both indexing and compressing geographic database layers due its applicability to many types of data, its ease of implementation and relatively good performance. As done traditionally, the quadtree is built on the Central Processing Unit (CPU). To speed up the range searching problems, it is essential to increase the threshold on the number of queries processed within a given time frame. Purely sequential approach to this will demand increase in processor speeds.
Graphics Processing Units (GPUs) have proven to be a powerful and efficient computational platform. An increasing number of applications are demanding more efficient computing power at a lower cost. The modern GPU can natively perform thousands of parallel computations per clock cycle. Relative to the traditional power of a CPU, the GPU can far out-perform the CPU in terms of computational power or Floating Point Operations per Second (FLOPS). Traditionally GPUs have been used exclusively for graphics processing. Recent developments have allowed GPUs to be used for more than just graphics processing and rendering. With a growing set of applications these new GPUs are known as GPGPUs (General Purpose GPUs).

NVIDIA has developed the CUDA (Compute Unified Device Architecture) API (Application Programming Interface) which enables software developers to access the GPU through standard programming languages such as C. CUDA gives developers access to the GPUs virtual instruction set, onboard memory and the parallel computational elements. Taking advantage of this parallel computational power will result in significant speedup for multiple applications. One such application is computer vision algorithms. From the assembly line to home entertainment systems, the need for efficient real-time computer vision systems is growing quickly. This paper explores the potential power of using the CUDA API and NVIDIA GPUs to speedup common computer vision algorithms. Through real-life algorithm optimization and translation, several approaches to GPU optimization for existing code are proposed in this report.
In the past few years, there has been a rapid adoption of GPGPU parallel computing techniques for both high-performance computing and mobile systems. As GPUs exploit models of data-parallel execution that generally describes a common task across different parallel computing elements, there can be severe limitations for any irregular individual thread behaviors. Irregular execution disrupts the efficiency of the rigid GPU groups of threads by causing workload disparity that effectively leads to idle or underutilized resource units. Most research focus is on the hardware aspects of execution disparity such as branch divergence [7], local memory bank conflicts [5] and non-coalesced global memory accesses [15]. There are a number of proposed architecture concepts to mitigate some of the performance downfalls of handling non-uniform data on GPU architectures [10]. Many of the proposed solutions reduce the frequency of the occurrences but do not fully address the inherent run-time execution characteristics of an algorithm. Overall, irregular data patterns limit the performance potential of GPGPU algorithms.
While irregular algorithms restrain the full use of data-level resources of GPU

systems, GPU implementations may still achieve performance benefit over multicore implementations. In effect, the raw number of GPU resources serves as a brute-force method of carrying out computations with significant arithmetic intensity. Nevertheless, there are alternative and emerging software-based models of distributing execution tasks on GPUs such as dynamic parallelism support. Another technique proposed is persistently scheduled thread groups [6] that abandons the traditional data-level model for stable thread groups assigned to GPU compute units that dynamically retrieve computation tasks from software-based workload queues. The result of persistently scheduled groups can be better load balance, utilization and reduced overhead in thread block creation. At the same time, such techniques have not fully addressed exploiting patterns and variations in model data specific to algorithms.
Generally tasks that involve irregular data or non-deterministic algorithms are not effectively mapped to GPU systems. For example, in graph-based algorithms, the irregular nature of the edge connectivity between graph nodes is not well suited for data-level task definition on GPU computing units. In this case, a group of neighboring GPU threads may be assigned a diverse workload of processing nodes with a few edges as well as nodes with thousands of edges. This form of imbalance is characterized as static workload disparity as a portion of the runtime utilization can be traced to the static connectivity of graph nodes. Only if a graphs structure is persistent, not changing over several evaluations, might there be well-reasoned opportunities to reorganize the data, effectively performing static load balancing in which each GPU thread group is assigned data with less variation in work. However, in such cases, there is cost to the partitioning graph nodes to the model data.
This thesis investigates the potential of processing quadtrees for PIP search problems that execute on GPUs. As GPUs operate in a heterogeneous system in which both the CPU and GPU perform some fraction of the computational work, there are unique performance constraints to explore. This thesis considers two primary

parameters in scaling optimal GPU quad-tree solutions: data point problem size and characteristics of polygons being searched.
This thesis is organized as follows: Chapter 2 discusses the motivation and background of computer vision applications. Chapter 3 examines several examples of the PIP problem solving on GPUs. The experimental results section, Chapter 4, shows performance data for the various optimization cases. Finally, Chapter 5 concludes this thesis.

2. Background and Motivation
2.1 Quadtree Background
A quadtree is a tree data structure that designates each internal node to have exactly 4 children. Quadtrees are used to partition a 2D space by recursively subdividing it into 4 quadrants or regions. The regions may be square or rectangular or may have arbitrary shapes. Quadtrees may be classified according to the type of data represented such as areas, points, lines and curves. Figure 2.1 demonstrates a set of 2D data points that have been used to compose a quadtree. The reasoning for using a quadtree structure is that any region search that requires the data points within a polygon area simply can reference the tree rather than the data. There are clear search benefits from the data to tree representations. In the case of the lower right quadrant, any region search will consult the tree structure and determine that only a limited number of point values (X, Y) are required for comparison. In short, the tree representation saves the computational calculations for the polygon search by immediately directing that only a limited number of points exist in that entire quadrant. In the comparison case, it is easier for a polygon that resides in the lower right quadrant to consult the tree versus consult the full data set using full brute force checking of every datapoint within the polygon range.

A .B

* D . E
+ C
Figure 2.1: Example of Data Points Mapped within a Quadtree Data Structure.
2.1.1 Quadtree Construction
Figure 2.2, Figure 2.3, Figure 2.4, Figure 2.5, demonstrates quadtree construction on the CPU up to level 4. The Figure 2.2, shows the sample data points at level 1 and it represents the root node containing all the data points. The Figure 2.3, shows the sample data points at level 2 and it represents the four child nodes of the root node. The Figure 2.3 is further subdivided to generate level 3 of the quadtree as shown in Figure 2.4. The Figure 2.5, represents the tree at level 4. There are two empty nodes at level 4, as there no points in that direction.

Figure 2.2: Level 1 of Quadtree.
Figure 2.3: Level 2 of Quadtree.
Figure 2.4: Level 3 of Quadtree

Figure 2.5: Level 4 of Quadtree.
The point quadtree structure recursively divides space into four equal rectangles based on the location of the points. The root of the quadtree represents the whole 2D space. Successive points divide each new sub-region into quadrants until all points are indexed. With each subsequent level, quadtrees can have a maximum of four children for every parent node. Each node represents a region in the 2D space which contains a subset of the data points.
The common algorithm to construct a quadtree is based on sequential point insertion and this sequential approach is not suitable for massively parallel processors. Therefore the quadtree construction is done on the CPU. The performance of quadtrees, for Insertion is 0(n log n), for searching is 0(log n) and optimization for the purposes of balancing the tree is 0(n log n) [8]. This makes quadtrees an ideal data structure for multiple insertion of points and range searching [8]. Though the linear quadtrees reduce the overhead involved in transferring the tree from the CPU main memory to the GPU memory, the pointer based quadtree has several other advantages. The pointer based quadtrees are more memory space efficient and it can be accessed in any traversal order by following the pointers whereas the linear quadtrees only allow preorder traversal and require logarithm time searches to find the next child for any other order [14].

The quadtree is built using pointers. The tree is built by recursive function calls and sequentially inserting the points. Each parent node has pointers to all of its children. There are 3 structures in the quadtree, the node, the point and the point buffers. Each node is one of three types which are the root node, link node and the leaf node. The root node, corresponds to the entire array of input points. The four children of the root node represent the 4 quadrants (Northwest NW, Northeast NE, Southwest SW, Southeast SE). The link node is a node that can be further subdivided and the leaf nodes correspond to the quadrants for which further subdivision is not necessary.
The point structure is the input points to the quadtree which occupies a 2D space. A leaf node has a list of point buffers, each of which holds predetermined maximum number of points. Starting at the root, the quadrant or the direction (NW, NE, SW or SE) in which the input point lies is determined and a child node is built in that direction. This step is repeated till the leaf node is reached.
The Figure 2.6 shows the node index and its corresponding direction on its parent node. Then the point is added to that leaf nodes buffer of points. If the buffer exceeds some predetermined maximum number of elements, the points are moved into the next buffer in the buffer list. The root node is subdivided recursively. The quadtree with k levels including the root would have (4(k 1)) nodes on the kth level and ((4k) 1) nodes in total [8].

1 2
3 4
Figure 2.6: NW North west, NE North east, SW South west, SE South east.
2.2 Related Work
Traversal path for each polygon may differ, therefore linearizing the tree based on the traversal path for each polygon and iterating over the linear traversal order is computationally intensive. Several application-specific approaches have been proposed to handle the problem. In algorithms like Barnes-Hut, a points traversal can be computed without executing the entire algorithm and a preprocessing pass can determine each points linear traversal, avoiding repeatedly visiting interior nodes during vector operations. However, for PIP search, where a points full traversal is only determined as the traversal progresses, the preprocessing step can be expensive.
Goldfarb et al. [2015] [4] demonstrates a GPU implementation of tree traversal algorithms by installing ropes, extra pointers that connect a node in the tree not to its children, instead to the next new node that a point would visit if its children are not visited. But the drawback of using ropes is that it requires an additional traversal prior to performing the actual traversals to create the ropes into the nodes of the tree structure. As a result, earlier attempts to use ropes to accelerate tree traversals have relied on application-specific transformations that leverage the semantics of the algorithm to efficiently place ropes. The problem with using Autoropes [4] is that

since the rope pointers are computed during traversal and they are stored on the stack, it causes overhead due to stack manipulation and the efficiency is compromised.
Work by Zhang et al. [2013] [16] presents a speed up of 90x by constructing a quadtree using parallel primitives by transforming a multidimensional geospatial computing problem into chaining a set of generic parallel primitives that are designed for one dimensional arrays. Another work by Bedorf et al. [2015] [1] presents an octree construction where the tree is constructed on the GPU after mapping particles in 3D space to linear array. And a vectorized BFS traversal is done on the octree.
All of these implementations have relied on linearizing the tree for BFS traversal on the GPU or they have preprocessed the tree to either linearize or install extra pointers on the tree. CUDA GPU ray traversal through a hierarchical data structure such as a bounding volume hierarchy (BVH) is usually carried out using depth-first search by maintaining a stack. This paper explores the parallelization of breadth-first search (BFS) to implement range search on GPUs. The kernel is optimized to minimize thread divergence and memory access. BFS is used instead of depth-first search (DFS) as it is easier to parallelize and implement efficiently on the GPU. BFS and DFS are fundamental building blocks for more sophisticated algorithms used in many computational fields that range from gaming, image processing to social network analysis. BFS is used as a core computational kernel in a number of benchmark suites, including Rodinia, Parboil and the Graph500 supercomputer benchmark.
2.3 Graphics Processing Unit Architecture
The underlying architecture of the GPU is optimized for data-level parallelism. Taking a closer look at the NVIDIA GPU reveals the chip is organized into an array of highly threaded streaming multiprocessors (SMs). Each streaming multiprocessor consists of several streaming processors (SPs). Streaming multiprocessors are grouped into thread processing clusters (TPCs). The number of SPs per SM depends on the generation of the chip.

NVIDIA has chips that increase the number of streaming processors (SP) from 240 to 2048. Each SP has a multiply-add unit plus an additional multiply unit. The combined power of that many SPs in this GPU exceeds one teraflop [9]. Each SM contains a special function unit which can compute floating-point functions such as square root and transcendental functions. Each streaming processor is threaded and can run thousands of threads per application. Graphics cards are commonly built to run 5,000-12,000 threads simultaneously on this GPU. The GTX680 can support 192 threads per SMX and a total of 1536 threads simultaneously [12]. In contrast, the Intel Corei7 series can support two threads per core.
GPUs are optimized via the execution throughput of a massive number of threads. The hardware takes advantages of this by switching to different threads while other threads wait for long-latency memory accesses. This methodology enables very minimal control logic for each execution thread [9]. Each thread is very lightweight and requires very little creation overhead.
From a memory perspective, the GPU is architected quite differently than a CPU. Each GPU currently comes with up to four gigabytes of Graphics Double Data Rate (GDDR) DRAM which is used as global memory. The GPU architecture is designed to exploit arithmetic intensity and data-level parallelism.
Figure 2.7 shows the architecture of the GTX680. This generation of the CUDA enabled GPU devices consists of an array of 8 next generation streaming multiprocessors (SMX), 4 GPCs (Graphics Processing Clusters) and 4 memory controllers. Each GPC has a dedicated raster engine and two SMX units. Figure 2.8 shows the architecture of a SMX. Each SMX unit contains 192 cores and four warp schedulers. Each warp scheduler is capable of dispatching two instructions per warp every clock [12].

Memory Controller Memory Controller
Figure 2.7: Kepler Architecture.
Memory Controller Memory Controller

Figure 2.8: GTX680 SMX.
In Kepler, as shown in Figure 2.9, each SMX has 64 KB of on-chip memory that can be configured as 48 KB of Shared memory with 16 KB of LI cache, or as 16 KB of shared memory with 48 KB of LI cache or a 32KB / 32KB split between shared memory and LI cache. In addition to the LI cache, Kepler introduces a 48KB cache for data that is known to be read-only for the duration of the function and it also has a 1536KB of L2 cache memory. The L2 cache services all loads, stores, and texture requests and provides high speed data sharing across the GPU [11].

Figure 2.9: Kepler Memory Hierarchy.
Graphics processors have traditionally been designed for very specific specialized tasks. Most of their transistors perform calculations related to 3D computer graphics rendering. Typically GPUs perform memory-intensive work such as texture mapping and rendering polygons. The GPU also performs geometric calculations such as rotation and translation of vertices into different coordinate systems. The on-chip programmable shaders can manipulate vertices and textures.
NVIDIA has developed a parallel computing architecture which is known as the Compute Unified Device Architecture (CUDA). This computing engine, which is the core of modern NVIDIA GPUs, is accessible to software developers through extensions of industry standard programming languages. The development of CUDA has enabled developers to access the virtual instruction set and memory of the GPU. This has enabled the exploitation of the native parallel computational elements of the NVIDIA GPU.

The Kepler architecture that has the Compute Capability 3.5 or higher supports dynamic parallelism, in which the GPU can launch new grids by itself. This feature allows algorithms that involve nested parallelism, recursive parallelism to be implemented on GPUs. This results in better GPU utilization than before. In Figure 2.10, the Kepler host to GPU workflow shows the Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids [11].
Figure 2.10: Kepler Work Flow.
The CUDA model is optimized for maximum compatibility. NVIDIA utilizes a soft instruction set which enables the GPU designers to change the low level hardware and instruction set without having to address backwards compatibility. Similarly, NVIDIA has also built in scalability to the CUDA model. These GPUs are scalable

in that the CUDA code that is written is not tied to a specific release of the NVIDIA GPU. This can be contrasted to traditional CPUs where the hard instruction set is published. CPU software developers often optimize their programs for how many cores are available which can change as new CPUs are released.
A CUDA program is comprised of multiple execution phases. Depending on the phase, the execution involves the CPU, the GPU or both. Portions of the CUDA code are executed on the GPU while other portions are executed on the CPU. The NVIDIA compiler known as nvcc translates the code for the GPU and CPU accordingly. This model is very easy to work with because the device code is written in ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures [13].
GPU implements a Single instruction multiple data (SIMD) model unlike the traditional CPU architectures where each thread may execute its own set of instructions. Multiple processing units are controlled by the same control signal from the control unit. Though each unit executes the same instruction, they have different data that corresponds to the CUDA threads.
The GPU hardware maps the data blocks to the SMX through time and space multiplexing. Since each SMX has limited hardware resources such as shared memory, registers, number of threads that can be tracked and scheduled (scheduling lots), careful selection of block sizes allow the SMX to accommodate more blocks simultaneously thus increasing the throughput.
2.4 Motivation
In theory, for n items, the quadtree gives a performance of (n log(n)). Compared to a brute force methods performance of n[3], a quadtree is extremely fast. The performance gain by using a quadtree is (log(n))/n. As shown in Figure 2.11, the empirical investigation of quadtree search and brute force search shows that the
quadtree has a 9x better performance for medium number of queries and 8x improve-

ment for larger problems. The main purpose of quadtree lies in localizing the queries. The brute force search on a CPU checks sequentially if every single point of the input data satisfies the criteria whereas quadtrees help isolate the region of interest faster by ignoring the quadrants of the tree that lie outside the region of interest at every level, thus reducing the number of quadrants to be processed at the next level. Therefore the quadtree gives better performance for a dense more concentrated data than a dense equally distributed data set.
Figure 2.11: Comparing Bruteforce Search and Quadtree Search on CPU.
As shown in Figure 2.12, the GPU provides a better performance for brute force search compared to CPU for larger datasets. But this performance is still lower than the CPU performance using a quadtree. The same algorithm is used for both CPU and GPU brute force search to check if a point is within the polygon. But the GPU is capable of launching 65535 blocks with a maximum of 1024 threads, thus massively

parallelizing the search. The Geforce GTX680 Kepler architecture used, has a new parallel geometry pipeline optimized for tessellation and displacement mapping. And also it has a better performance / watt [12].
Figure 2.12: Comparing Bruteforce Search on CPU with GPU.
Figure 2.13 shows the quadtree performance for different sized polygons. The small polygons occupy 10 to 20 percent of the quadtree area, the medium sized polygons occupy 30 to 60 percent of the quadtree area and the large polygons occupy 70 to 90 percent of the quadtree area. The CPU implementation performs better on small polygons compared to larger polygons. The performance improvement in this case relates to how quickly the quadrants of the quadtree that contain the polygon can be isolated. The amount of computation done for a larger polygon that occupies all four quadrants is higher compared to a smaller polygon that lies inside only one quadrant.

Figure 2.13: CPU Performance of Different Sized Polygons Using Quadtree.

3. Approach
3.1 Exploitation of Quadtree on CPU and GPU
The PIP search is implemented on the CPU using quadtree. The performance on the CPU is compared against the implementation on the GPU.
3.1.1 CPU Implementation
The traversal starts from the root node of the quadtree and performs DFS by recursion. The traversal runs recursively till leaf node is reached. Depth First Search algorithm (DFS) traverses a graph in a depth wise motion and uses recursive function to get the next node to start a search when the bottom of the tree is reached. The traversal starts at the root node and moves down the tree till the bottom of the tree is reached along each branch before backtracking. While visiting each node, it checks for all three conditions to see if a node is completely overlapping, partially overlapping or not overlapping.
Table 3.1 illustrates the conditions to satisfy each scenario. NO, Nl, N2 and N3 are the four corners of a node and the boundary of the polygon is marked by the four corners, PO, PI, P2, P3 of a rectangle. A checkmark indicates that the corner of the node is within the boundary of all four edges of a polygon. A x mark indicates that the corner of the node is outside the boundary of all four edges of a polygon. If all the corners of a node is within the four edges of a polygon, then the node is completely overlapping the polygon. If all the corners of a node is outside the four edges of a polygon, then the node is not overlapping the polygon. If at most three of the corners of a node is outside the four edges of a polygon, then the node is partially overlapping the polygon. All the conditions apart from the ones shown in Table 1 are partially overlapping.

Table 3.1: Criteria for Quadtree Node Classification.
Case NO N1 N2 N3 Condition
PO PI P2 P3 / / / / Completely overlapping
PO PI P2 P3 X X X X Not overlapping
Before classifying the node as not overlapping, the polygon is also checked to see if it lies inside the node. If the node is classified as completely overlapping node, then the boundary details of this node is stored and tree is not traversed further from this node. The boundary of the node represents the range of points the node contains and all these points within the node are considered as points within the polygon. If the node is classified as not overlapping node, then the tree is not traversed further from this node. If the node is classified as partially overlapping node then the tree is traversed further till the leaf node is reached. The points are then extracted from the leaf nodes of the quadtree.
In the case of partially overlapping node, every point needs to be checked to see if it lies within the boundary of the polygon. This check is optimized by classifying the kind of intersection between the node and the polygon in such a way that for certain scenarios only either x or y coordinate of the points need to be verified.
If the intersection of the node is along the horizontal edge of the polygon as in Figure 3.1, then only the y coordinate of the data points inside the node needs to be checked as all the x-coordinate of all points is within the polygon boundary limits. If the intersection between the node and the polygon is along the vertical edge of the polygon as in Figure 3.2, then only the x coordinate of the points needs to be checked as all the y-coordinate of all points is within the polygon boundary limits. For all the other overlap conditions, x-y coordinate of every point is checked with the boundary limits of the polygon.

Polygon Polygon
Figure 3.1: Overlap of Node on the Horizontal Edge of Polygon.

Node 1 Polygon

Polygon Node 1

Figure 3.2: Overlap of Node on the Vertical Edge of Polygon.
In the case of GPU implementation, x-y coordinates of all points of the partial node are checked against the polygon edges without classifying them based on the type of node intersection as it will lead to more conditional statements which would increase thread divergence and reduce the performance.
3.1.2 GPU Implementation
Efficient implementation of quadtree traversal in GPU using CUD A is challenging due to the following reasons. 1. The quadtree is an irregular data structure whose memory accesses and work distribution are both irregular and data-dependent. And traversing irregular data structures results in thread divergence. 2. Traversal of a

quadtree built using pointers results in lot of pointer-chasing memory operations. Pointer-chasing leads to many un-coalesced memory accesses which affects performance [2]. 3. As done typically the quadtree is built using recursive depth first style in the CPU. Recursive DFS in GPU would be advantageous as it will reduce un-coalesced memory accesses considering the memory for all the structures is preallocated. Recursion can be implemented only on the latest GPUs with CUDA 6.5 version that supports dynamic parallelism, but recursion could lead to low performance due to function call overhead. Recursion on a GPU, requires explicit stack space per thread and deep recursion will cause overflow of the available stack space and in this case CUDA may reduce the maximum physical threads per launch. 4. Transferring a pointer based quadtree to a GPU proves to be a challenge. Though this task can be implemented with cudamallocamanaged function, debugging becomes harder.
To take advantage of the parallel nature of the GPUs, BFS is used instead of DFS. The quadtree is queried to find the points inside a polygon by first finding the quadtree nodes that overlap the polygon. Once the nodes that overlap the polygon are determined, the points inside the polygon can be found from these nodes. Given a set of nodes, arranged hierarchically, the system needs to find the minimum set which solves the query correctly. This can be done by assigning one thread to a polygon but there is not enough stack space. Since our GPU implementation method requires an explicit array in the shared memory for each polygon, assigning one thread to a polygon will also pose a limit on the number of polygons processed simultaneously by a block due to the limitation in the shared memory size. The stack space per thread needs to be reduced and optimization for thread divergence and memory access should be done. To minimize thread divergence and reduce stack space per thread and also achieve maximum parallelization, a polygon is assigned to a warp instead of a thread or a block of threads. To optimize for memory access, memory preallocation for the

input points, query polygons, point buffer (list that holds points within a leaf node of quadtree) is done on the CPU.
CUDA applications exploit massive data parallelism and it is capable of processing massive amount of data within a short period of time. But data accesses to GDDR DRAM global memory limits this performance due to limitations in global memory bandwidth. Therefore algorithms and techniques need to be employed in order to reduce the demand on the DRAM bandwidth and make efficient use of the data accessed from the global memory [9].
The quadtree is built on the CPU from heap objects and each heap object has several fields. For example, each quadtree node object contains 4 child pointers and data fields which stores the attributes of the node. Other heap objects include data points and point buffers. Though the CPU implementation benefits from the dynamic memory allocation, the use of the C library malloc is not preferred for the CPU-GPU implementation because this dynamic memory-management function provides allocation and de-allocation of arbitrary segments of memory. In order to improve performance on the GPU, the access to memory segments need to be coalesced. The memory preallocation places the consecutive points and point buffers sequentially in memory and therefore results in memory coalesced access as the points close to one other are more likely to be within the same polygon boundary. But preallocating memory for the quadtree nodes is not expected to improve performance as the nodes are stored in a depth first fashion in the CPU but a BFS traversal is done on the GPU.
The DRAM cores are slower as the market demands a high capacity rather than a lower latency DRAM. These cores operate at much lower speed compared to the interface speed. GDDR4 operates at 1 /8th of the interface speed and this difference will increase with every generation of GDDR as the storage capacity increases [9]. Since DRAM cores are slower than the interface, the system needs to increase the

buffer width to feed into faster interface. The modern DRAM banks are designed to be accessed in burst mode. If the accesses are not to sequential locations, the transferred burst bytes are discarded. In order to take full advantage of the DRAM system in CUD A environment, the system needs to make sure that the processor makes use of all the data bursted out of the DRAM chip. Thus from a memory throughput standpoint to get peak memory performance, computation must be structured around sequential memory accesses [9]. Random memory access leads to a huge performance penalty due to random memory reads. Drawbacks on Implementing DFS on GPU
One option to traverse down the tree for every polygon is to use DFS by assigning a warp or a thread to each polygon. But the DFS implementation in the GPU would cause a overhead due to the data movement as the thread moves up and down the tree. In this case the interior nodes of the tree are repeatedly visited as a thread has to return to higher level of the tree to traverse down to other children [4]. This results in additional work for each thread. BFS Implementation on GPU
Stack based breadth first tree-traversal allows parallelization. To take advantage of the parallel nature of the GPU, the threads are launched at level 3 of the quadtree instead of the root node. Starting at the root node will result in only one thread being active for level 1 traversal of quadtree. Starting at a deeper level prevents this and results in more parallelism. Initially the index of the node from level 3 of the quadtree is stored in an array in the shared memory. This index is used to fetch the quadtree node from the global memory.
Here the tree is traversed down to find the boundaries of a polygon and thus extracting the quadtree nodes within that boundary or partially overlapping the boundary. For completely overlapping nodes, the boundary of the node gives the

range of points within the polygon. This method is more efficient than traversing down the quadtree to fold the individual points within the polygon in terms of computation and also memory. And for partially overlapping nodes, each point needs to be checked against the boundary of the polygon. With this information, the range of points within the polygon region can be found and this range is stored instead of individual points to save memory on the GPU. If there are a very large number of input points and query polygons, then storing individual points for each polygon will result in overflow of the memory.
The polygons are associated with a CUD A warp, where the number of threads in a warp depends on CUDA configuration. By launching thousands of GPU blocks in parallel, N number of polygons can be queried, where N = (size of each block / 32)*(total number of blocks). Each warp executes the same algorithm but on a different polygon. Assigning a warp to a polygon will result in less thread divergence. Each thread in a warp reads x, y co-ordinate of all four corners of a polygon which is required for the tree traversal.
The GPU kernel replaces the recursive call with stack push operations that save indices of each child of a node traversed, that satisfies the query. Traversal is facilitated by a loop that simultaneously assigns the indices of the nodes in the stack to threads in warp until the stack is empty, indicating there are no more nodes to visit at that level for a polygon. To save stack space, node indices are stored in the stack instead of the node. At the beginning of each iteration of the loop the nodes are popped from the stack. After popping the node from the stack, the function which checks for the overlapping conditions is executed on that node, possibly pushing more nodes onto the stack or returning when it reaches a node that fails the criteria, an empty node or a leaf node.
If the number of nodes in the stack exceeds the number of threads assigned to a polygon, then loop over with increments of the number of threads in a warp is

done. Within each loop, each node is assigned to one thread. Each thread reads the quadtree nodes properties and tests whether or not to traverse the tree further down from this node. If the criteria to traverse the tree down from this node is satisfied, then the indices of the nodes all four children is added to the stack, provided the child is not an empty node. One stack space is assigned to each polygon. Traversal at each level of quadtree is synchronized, so that the same stack can be re-used for a polygon. The tree-traverse loop is repeated until all the levels of the quadtree is visited. The number of GPU threads used per level is equal to the number of nodes that was pushed on to the stack from the previous iteration.
The warp starts the traversal at level 3 of the quadtree. The indices of children of the nodes that satisfy the query are placed in an array. The condition checks whether a node is overlapping the polygon. Once the indices of children of the nodes that satisfy the query are placed in an array and the other nodes, which do not satisfy the condition are ignored, the next level is evaluated. Once the leaf node is reached, the nodes are classified as completely and partially overlapping nodes. And from these nodes all the points that are inside the polygon are taken. The boundaries of the completely overlapping nodes which give the range of points within the nodes are stored.
In the case of partially overlapping nodes, each point within the node is checked against the boundary of the polygon and the range of points that he within the polygon is stored. At the last level of the tree, the warp still holds the polygon and the threads within a warp is assigned to a leaf node that is either partially or completely overlapping the polygon. Each of these threads compute the points within each node that are inside the polygon.
This approach will minimize the use of shared memory space, as only the indices of nodes are stored starting from level 3 in shared memory instead of the node itself that satisfy the required criteria. The indices of the nodes are used to fetch the

corresponding node from the global memory. The order in which the nodes are stored in the stack does not matter, as every level in the quadtree is processed independent of the previous and every node is processed independent of other nodes in the stack.
The indices of the nodes are saved in the stack array in the shared memory and the the stack is cleared after computing every level. Shared memory blocks are used in order to gain significant speedups of approximately 150 to 200 times [8]. If a node index returns -1, then the node is empty and it is not fetched from the global memory. The number of nodes that needs to be visited at the next level is also stored in a shared memory variable for every polygon and it is incremented atomically. The iterative loop launches threads based on the count on this variable. The loop executes till the last level of quadtree is reached.
To implement this algorithm, the maximum number of levels in a quadtree should be 4 because of the limitations in the size and the number of stacks in the shared memory. As the number of levels in a quadtree increases, the maximum stack space required per polygon would also increase and this will limit the number of polygons processed.
The size of shared memory per block on the GTX680 is 49152 bytes [12]. As there are 32 warps in a block by default, each block is assigned 32 polygons. Each polygon requires an array in the shared memory in order to store the node indices from each level that needs to be further traversed. The maximum array size required for a 4 level quadtree is 64 as the maximum number of nodes at the leaf is 64. This allocation occupies a shared memory space of 32 64 *4 = 8192 bytes, which is well within the limit. The array size for each polygon needs to be increased to 256 for a 5 level quadtree. This allocation results in a shared memory usage of 256 32 4 = 66560 bytes, which exceeds the available shared memory size. Therefore unless a larger size of shared memory becomes available in the future, the quadtree cannot not be subdivided beyond level 4 in this case.

In Figure 3.3, the check mark indicates that the node satisfies the query and the tree will be further traversed from this node. The cross mark indicates either that the node does not satisfy the query or the node is empty and the threads terminate at this point without proceeding further.
Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Figure 3.3: BFS GPU Implementation.
The traversal starts at level 3 of the quadtree. The children of the nodes that satisfy the overlap conditions are pushed in to the array. The array is initialized after every iteration. The cross mark indicates that the node is either not satisfying the overlap conditions or it is empty. The nodes are stored in the array in any order. The process repeats till bottom of the tree is reached. The number of stack array in the shared memory per block depends on the number of warps in a block. At the most, the iterative loop processes 32 nodes at a time by default (equal to the number of threads in a warp). The conditional statements are changed to avoid function return statements to prevent the traversal loop from prematurely exiting and preventing the remainder of the loop body from executing.
A polygon is checked against the node of the quadtree for 3 scenarios such as,
completely overlapping, partially overlapping and not overlapping. All nodes are

checked for not overlapping conditions till leaf node is reached. If a node satisfies the condition then the tree is not traversed from this node and if the node does not satisfy the condition, then its children are placed in the stack and the traversal continues.
The number of checks for a partially overlapping node is more and use of if statements decreases the performance due to control divergence. If threads within a warp take different paths, then 2 passes on the different path of the code is required to execute all threads. Since different execution paths are serialized, it leads to performance penalty.
Therefore in order to reduce conditional statements (if statements), the condition for completely overlapping node is first computed on the leaf nodes and the nodes that satisfies the condition is classified as completely overlapping nodes and nodes that does not satisfy the condition is classified as partially overlapping nodes. The Figure 3.4, shows the high level overview of the algorithm to find the points in polygon using quadtree on a GPU.

Figure 3.4: Process Flow.
3.1.3 BFS Implementation on GPU Different Scenarios of Query Polygon
The Figure 3.5, Figure 3.6, Figure 3.7 and Figure 3.8 show the data points (left) and its corresponding tree representation (right). These data points will be analyzed for different polygon overlap scenarios in the next sections. The numbers inside the circle represent the index of the node at that level. Figure 3.5 shows the root node

that contains all input data points. The root node is subdivided into 4 equal regions to form 4 nodes at level 2. Figure 3.6 shows the four children of the root node. Root node has all four children as the data points are present in all four directions (NW, NE, SW, SE) of the root node. The nodes at level 2 are again subdivided equally to form the child nodes. Figure 3.7, shows the nodes at level 3 of the quadtree. Child node 3 of parent node 1 from level 2, child node 1 and child node 4 of parent node 3 from level 2 and child node 3 of parent node 4 from level 2 are empty as the parent nodes did not have any points in those directions. Figure 3.8 shows the child nodes at level 4. Many of the nodes at this level are empty as they do not contain any points.
level 1
Level 1
Figure 3.5: Level 1 Quadtree.
level 2
1 2
3. 4
Figure 3.6: Level 2 Quadtree.

level 3
Level 1
1 2 1 2
4 3 4
2 * 1 2.
3 4
Level 2
Level 3
Figure 3.7: Level 3 Quadtree.
34 Query Polygon inside a Leaf Node
Figure 3.9 shows the first scenario that demonstrates the best case scenario where the polygon lies within one quadrant at level 4 of the quadtree and the algorithm provides best performance.
level 1
Level 1
Figure 3.9: Sample Datapoints (left) and the Quadtree (right) at Level 1.
Figure 3.10 shows that at level 2, the algorithm isolates one quadrant (first child of the root node) and ignores all other quadrants, thus bringing down the number of quadrants to be processed to 1 from 4 (4 is the total number of nodes at this level).
level 2

Level 1
Level 2
Figure 3.10: Sample Datapoints (left) and the Quadtree (right) at Level 2.

Figure 3.11 shows that at level 3, the algorithm again isolates one quadrant which is the first child of the quadrant from level 1. This step reduces the number of quadrants to be processed to 1 from 3 (3 is the total number of children of the node output from level 1).
Level 1
level 3
0' o

Level 2
Level 3
Figure 3.11: Sample Datapoints (left) and the Quadtree (right) at Level 3.
In Figure 3.12 at level 4, the algorithm isolates one quadrant which is the first child of the quadrant from level 3. This step reduces the number of quadrants to be processed to 1 from 2 (2 is the total number of children of the node output from level 3). Finally only one quadrant at level 4 needs to be processed in order to get the points inside the polygon in this case.

Level 2
level 4


Level 3
Level 4
Figure 3.12: Sample Datapoints (left) and the Quadtree (right) at Level 4. Query Polygon Overlapping 2 Nodes
Figure 3.13 shows the second scenario that demonstrates a condition where the query polygon overlaps 2 nodes at level 2 and the algorithm provides a medium performance.
level 1
Level 1
Figure 3.13: Sample Datapoints (left) and the Quadtree (right) at Level 1.
Figure 3.14 shows that at level 2, the algorithm isolates two quadrants (second

and fourth child of the root node) and ignores the other two quadrants, thus bringing down the number of quadrants to be processed to 2.
level 2
Level 1
Level 2
Figure 3.14: Sample Datapoints (left) and the Quadtree (right) at Level 2.
Figure 3.15 shows that level 3, the algorithm isolates 7 quadrants which are the all four children of the second child of root node and first, second and fourth child of the fourth child of root node. This step reduces the number of nodes to be processed by one as one of the child of the nodes output from the previous level is empty.
level 3

1 2,
3 4
M 1 * 2 .
Level 1
Level 2
Level 3
Figure 3.15: Sample Datapoints (left) and the Quadtree (right) at Level 3.
Figure 3.16 shows that at level 4, the algorithm isolates 5 quadrants from the
children of 7 quadrants from previous level. By traversing down to level 4, the number
of data points that need to be evaluated are reduced by ignoring 3rd child of the 4th

child from level 3. If the quadrant that is ignored at level 4 has a larger number of data points, then using a 3-level quadtree would have resulted in a huge performance penalty. Finally only 5 quadrants at level 4 needs to be processed in order to get the points inside the polygon in this case.
level 4
Level 1
Level 2
Level 3
Level 4
Figure 3.16: Sample Datapoints (left) and the Quadtree (right) at Level 4. Query Polygon Overlapping All Leaf Nodes
Figure 3.17 shows the last scenario that demonstrates a condition where the polygon contains maximum number of quadrants at level 4 of the quadtree and the algorithm provides least performance compared to all other scenarios.

level 1
Level 1
Figure 3.17: Sample Datapoints (left) and the Quadtree (right) at Level 1.
In Figure 3.18 at level 2, the algorithm overlaps all 4 quadrants at this level. Therefore the thread has to descend the tree from all four nodes.
level 2

1 2

3 4
Level 1
Level 2
Figure 3.18: Sample Datapoints (left) and the Quadtree (right) at Level 2.
In Figure 3.19 at level 3, the algorithm isolates all 12 quadrants at this level. The maximum possible number of nodes at this level is 16 but four of those nodes are empty nodes and these four nodes are ignored even though these nodes lie within the polygon.

level 3
Level 1
1. 2 1 2
4 3 4
2 1 2 .
3 .4
Level 2
Level 3
Figure 3.19: Sample Datapoints (left) and the Quadtree (right) at Level 3.
Figure 3.20 shows that at level 4, the algorithm isolates all the 12 quadrants at this level. The maximum possible number of nodes at this level is 64 but 52 of those nodes are empty nodes and these 52 nodes are ignored even though these nodes he within the polygon. Finally only 12 quadrants at level 4 need to be processed in order to get the points inside the polygon in this case. If there are a larger number of data points distributed equally across the entire 2 D space and if there are no empty nodes, then all the 64 nodes need to be processed.
level 4
Level 1
Level 2
Level 3
Level 4
Figure 3.20: Sample Datapoints (left) and the Quadtree (right) at Level 4.

Figure 3.21, shows the stack based iterative BFS traversal on GPU, in reference
to Figure 3.19 and Figure 3.20.
Objects in stack after 1 st Iteration
Objects in stack after 2nd Iteration
Figure 3.21: Stack Based Iterative BFS Traversal. Query Polygon Containing No Points
Figure 3.22 shows one case where the polygon overlaps a region where there are no points. Figure 3.23 shows two distinct cases. At level 2, the algorithm isolates first and third child of the root node. At level 3, the polygon does not overlap with any of the children of the nodes from level 2. The nodes that it overlaps are empty nodes and therefore the tree is not traversed any further.

level 1
Level 1
Figure 3.22: Sample Datapoints (left) and the Quadtree (right) at Level 1.
level 2
Level 1
Level 2
Figure 3.23: Sample Datapoints (left) and the Quadtree (right) at Level 2.

4. Experimental Results
The experimental results show that in many cases the GPU out-performs the CPU as expected.
4.1 Results
The performance is compared in terms of the number of polygons that can be queried per second. The number of polygons queried per second is measured for a wide range of input points. As the size of data points increase, the performance of CPU diminishes at a faster rate than the GPU. In the case of GPUs, as the size of input increases, performance is affected as the amount of work done per thread increases and also the overhead of memory transfer of large amount of data between CPU and GPU increases. But once the quadtree is transferred to the GPU, any number of polygons can be queried using iterative BFS traversal method described above.
Figure 4.1, Figure 4.2 and Figure 4.3 show that the CPU-GPU approach gives a performance improvement by a factor of 3.2 for small datasets and a performance improvement by a factor of 449 for very large datasets in the case of small polygons, a performance improvement by a factor of 4 for small datasets and a performance improvement by a factor of 594 for very large datasets in the case of medium sized polygons and a performance improvement by a factor of 3.6 for small datasets and a performance improvement by a factor of 591 for very large datasets in the case of large sized polygons respectively.

CPU Vs GPU comparison for small polygons
S 400000
30000 300000 500000
Number of Data Points
Figure 4.1: Performance Comparison between CPU and GPU for Small Polygons.
CPU Vs GPU comparison for medium polygons
30000 300000 500000
Number of Data Points
Figure 4.2: Performance Comparison between CPU and GPU for Medium Polygons.

Figure 4.3: Performance Comparison between CPU and GPU for Large Polygons.
The CPU performance diminishes at a faster rate than the GPU mainly because of the quadtree structure and the memory hierarchy. Since the tree is a pointer based structure, there is a cache miss every time a node is accessed. At the leaf node, the points are stored in a linked list, where each node in the list contains a pre-determined number of points and there is a cache miss every time a new buffer in the list is accessed. Therefore cache miss penalties are higher for larger datasets.
In the case of GPUs, the point buffers are sequentially stored in memory through memory preallocation. Though there is a penalty when a node is accessed, cache miss penalties are not encountered while reading data points from a node. And also, in the GPU implementation, the traversal is accelerated by starting the BFS at level 3 of the quadtree.
The Figure 4.4 shows the execution time of different polygon sizes on GPU. Similar to CPU, the GPU performs better on smaller polygons compared to the larger

polygons. In this case, the smaller polygons occupy 10 to 20 percent of the region and the larger polygons occupy 70 to 90 percent of the region. Once the leaf nodes are reached, the number of nodes that need to be taken into account to compute points are very less but for a large polygon which could contain a maximum of 64 nodes for a level 4 quadtree, points within all these 64 nodes need to be taken into account.
Figure 4.4: Performance Comparison for Different Sized Polygons using GPU.
The performance is determined by the number of partial nodes a polygon contains because for partial nodes, each point inside a node needs to be checked against a polygon boundary whereas in the case of completely overlapping nodes only the range of points a node contains is stored. The Figure 4.5 shows the execution time of individual medium sized polygons with a data input size of 3000. The polygons are ranked based on the number of partial nodes it contains and its execution time is measured. As the number of partial nodes increase, the execution time per polygon increases. The number of polygons calculated per second is the average of the output

of all these individual polygons.
Figure 4.5: Execution Time of Least to Most Complex Medium Sized Polygons.
Figure 4.6 shows the performance comparison between CPU and GPU measured for different sizes of polygons. The significant speed up on GPU is due to the memory coalesced access and due to accelerating the traversal by starting the BFS from level 3. In this case, the algorithm traverses one level down the tree to quickly find the set of nodes that satisfy the query,
The maximum performance gain is achieved for the medium polygons which are more computationally intensive than small polygons. Very less number of threads are active for small polygons and this does not take advantage of the throughput oriented nature of the GPU. The work on medium sized polygons is optimum for the GPU as the performance decreases a little for larger polygons. The performance is mainly determined by the number of partial nodes a polygon contains. In the case of CPU, all the partial nodes in a polygon are processed sequentially but in the case of GPU,

32 partial nodes within a polygon are computed in parallel with 48 (Total number of SMX x (CUDA cores per SMX / number of polygons per warp)) [12] polygons being computed simultaneously.
Figure 4.6: Speed-up of GPU over CPU.

5. Conclusion
The range search which is a fundamental operation in GIS and spatial databases always performs better when a spatial data structure is used and its performance is further improved when the search is implemented on a GPU. This paper has proved that an irregular pointer based quadtree need not be linearized in order to achieve a significant performance gain on GPU. And the CPU-GPU approach provides a speed up of 3x to 500x than the pure CPU approach. In a real world scenario, the range search problem is carried out on irregular polygons. For future work, the work presented in this paper can be extended to implement PIP search on irregular polygons. The work on the GPU can be extended to compute larger data sets as the size of RAM per GPU increases. With increase in shared memory size, quadtrees with deeper layers can be traversed. There would be a significant improvement in the performance with increase in shared memory size and increase in global memory bandwidth.

[1] Jeroen Bedorf, Evghenii Gaburov, and Simon Portegies Zwart. A sparse octree gravitational n-body code that runs entirely on the gpu processor. J. Comput.
Phys., 231(7):2825-2839, April 2012.
[2] Martin Burtscher and Keshav Pingali. An efficient cuda 6 implementation of the tree-based barnes hut n-body algorithm.
[3] David Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. J. Exp. Algorithmics, 5, December 2000.
[4] Michael Goldfarb, Youngjoon Jo, and Milind Kulkarni. General transformations for gpu execution of tree traversals. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 13, pages 10:1-10:12, New York, NY, USA, 2013. ACM.
[5] Georgi Gou, C.; Gaydadjiev. Addressing gpu on-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming,
41(3), 2013.
[6] Kshitij Gupta, Jeff A. Stuart, and John D. Owens. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, page 14, May 2012.
[7] Tianyi David Han and Tarek S. Abdelrahman. Reducing branch divergence in gpu programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 3:1-3:8, New York,
NY, USA, 2011. ACM.
[8] Maria Kelly and Alexander Breslow. Quadtree construction on the gpu: A hybrid cpu-gpu approach,
https: //www. secs. / mkelly 1 / quadtrees. pdf.
[9] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 1st edition, 2010.
[10] Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput.
Archit. News, 38(3):235 246, June 2010.
[11] NVIDIA. White paper keplertm gkl 10., accessed, 2012.

[12] NVIDIA. White paper nvidia geforce gtx 680., 2012.
[13] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. GPU computing. Proceedings of the IEEE, 96(5):879-899, May 2008.
[14] Clifford A. Shaffer and Patrick R. Brown. A paging scheme for pointer-based quadtree,
[15] Bo Wu, Zhijia Zhao, Eddy Z. Zhang, Yunlian Jiang, and Xipeng Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013.
[16] Jianting Zhang and Simin You. High-performance quadtree constructions on large-scale geospatial rasters using GPGPU parallel primitives. International Journal of Geographical Information Science, 27(11):22072226, 2013.

APPENDIX A. CUDA Code for Brute Force Search
* File name :
* Create random points and polygons.
* Perform a brute force search to find points in polygon on GPU.
#include #include #include #include #include #include "hw4_4A_timing.h" #include Includes **********************/
/**<************************# Defines **************************/
#define __host__
#define __shared__
#define BUILD_FULL 1 #define BUILD_ADAPTIVE 2 #define M0DE_RAND0M 1 #define M0DE_FILE 2 #define TRUE 1 #define FALSE 0 #ifndef RANGE #define RANGE 1024

#ifndef SIZE
#define SIZE 20000
/s|c=|cK >K >K */
int pointMode = M0DE_RAND0M; char *inputPointFileName; char *outputTreeFileName; int rangeSize = RANGE; int bucketSize = 32; int numPoints = SIZE; int numPolygon =50; int pointRangeX = RANGE; int pointRangeY = RANGE; typedef int P0LYID;
/**<************************ Structure definition ************/ // Input data point typedef struct POINT { int x; int y; int index;
// Point coordinates, typedef struct N0DEPP4 int xminPP; int yminPP;

// Query polygon typedef struct Polygon {
// coordinates of corners of a polygon and its width and height.
int xmin;
int ymin;
int width;
int height;
int xmax;
int ymax;
int count;
int index;
NODEPP nodePoint [20000];
> Polygon;
/**<***************** Performance functions ***************/ typedef struct perf_struct { double sample; struct timeval tv;
> perf;
perf timeRecord;
void init_perf(perf *t) { t->sample = 0;
// Start time measurement.

void start_perf.measurement(perf *t) { gettimeofday(&(t->tv), NULL);
// Stop time measurement.
void stop.perf.measurement(perf *t) { struct timeval ctv; gettimeofday(&ctv, NULL); struct timeval diff; timersub(&ctv, &(t->tv), &diff);
double time = + / 1000000.0; t->sample = time;
// Print time difference.
void print.perf.measurement(perf *t) { double total.time = 0.0; double stdev = 0.0; printf("%f", t->sample);
/**<************************ Random number generator *********************/
* Generate random numbers.
* 1- Generate random data points.
* 2- Generate random numbers for polygon corners.
// Generate a random number within a range.

int randn(int range) {
int a;
a = rand() % range; return a;
// Create random query Polygons
Polygon *createRandomPolygon(unsigned int nPolygon, unsigned int range) { Polygon *polyarray = (Polygon *) malloc(sizeof(Polygon) nPolygon); unsigned int index;
for (index = 0; index < nPolygon; index++) { polyarray[index].xmin = randn(512); polyarray[index].ymin = randn(512); polyarray[index].width = 400; polyarray [index].height = 300;
polyarray [index].xmax = polyarray[index].xmin + polyarray [index].width;
polyarray [index].ymax = polyarray[index].ymin + polyarray [index].height; polyarray [index].index = index;
return polyarray;
// Create random data points.
POINT *createRandomPoints(unsigned int nPoints, unsigned int range) {
POINT *pointArray = (POINT *) malloc(sizeof(POINT) nPoints); unsigned int index;
for (index = 0; index < nPoints; index++) {

pointArray[index].x = randn(range); pointArray[index].y = randn(range); pointArray[index].index = index;
return pointArray;
POINT *createPoints(int numPoints) {
POINT *pointArray;
pointArray = createRandomPoints(numPoints, rangesize); return pointArray;
/**<************************ Kernel function ********************/ /**<************************ Brute force search *********************/ /**
* Function to search points inside polygon using brute force search
* 1- Assign each thread to a data point.
* 2- The thread checks if the point is within a given polygon.
* 3- Repeat till all polygons are checked.
* 4- Store the attributes of the polygon which satisfy the query.
* 5- Then grid stride is done, so that the thread loops over to get the
next point.
// Brute force search
___global_ void bruteForce(POINT* d_pointArray, Polygon* d_Polygon, int
numPoints, int numPolygon)

// Get global thread index.
int tidG = threadldx.x + blockIdx.x*blockDim.x; int pid;
// Boundary check for threads, if(tidG < numPoints)
d_Polygon [pid].count = 0;
// Loop till all points are checked.
for(int tid = tidG; tid < numPoints; tid = tid + blockDim.x*gridDim.x)
// Loop till all polygons are checked for(pid = 0; pid < numPolygon; pid++)
// Check every point with every polygon.
if ((d_pointArray[tid].x >= d_Polygon[pid].xmin) &&
(d_pointArray[tid].x <= d_Polygon[pid].xmax) &&
(d_pointArray[tid].y >= d_Polygon[pid].ymin) &&
(d_pointArray[tid].y <= d_Polygon[pid].ymax))
// Store the points that lie within a polygon. d_Polygon[pid].nodePoint [d_Polygon[pid].count].xminPP = d_pointArray [tid].x;
d_Polygon[pid].nodePoint [d_Polygon[pid].count].yminPP = d_pointArray [tid].y; d_Polygon [pid],count++;

/**<************************ Main function ***************************/
* Main function.
* 1- Create points randomly.
* 2- Create polygons randomly.
* 3- Copy points and polygons from host to device.
* 4- Launch CUDA kernel to perform the search.
* 5- Time the kernel execution.
int main(int argc, char **argv) {
// Host variables int i;
float diff_GPU; struct Polygon *randomPoly; int index; int j ,p;
float time = 0.0; int NumberofPolygons;
// Device variables POINT *d_pointArray;
Polygon *d_Polygon;
// Create data points and polygons randomly.
POINT *pointArray = createPoints(numPoints); randomPoly = createRandomPolygon(numPolygon, rangeSize);
// Allocate device memory.
cudaMalloc((void**)&d_pointArray, sizeof(POINT)*numPoints);

cudaMalloc((void**)&d_Polygon, sizeof(Polygon)*numPolygon);
// Memory transfer from host to device
cudaMemcpy(d_pointArray, pointArray, sizeof(POINT)*numPoints, cudaMemcpyHostToDevice);
cudaMemcpy(d_Polygon, randomPoly, sizeof(Polygon)*numPolygon, cudaMemcpyHostToDevice);
// Time brute force search. perfTimer timeRecord; initTimerC&timeRecord); cudaEvent_t start_GPU, stop_GPU; cudaEventCreate(&start_GPU); cudaEventCreate(&stop_GPU); cudaEventRecord(start_GPU, 0);
// Kernel launch
bruteForce<(numPoints/1024), 1024>(d_pointArray, d_Polygon, numPoints, numPolygon);
cudaError_t cudaerr = cudaDeviceSynchronize(); if (cudaerr != CUDA_SUCCESS){
printf("kernel launch failed with error \"0/0s\".\n", cudaGetErrorString(cudaerr));
cudaEventRecord(stop_GPU, 0); cudaEventSynchronize(stop_GPU);
cudaEventElapsedTime(&diff_GPU, start_GPU, stop_GPU); time = diff_GPU;
// Calculate number of polygons per second NumberofPolygons = numPolygon / time; printf ("70d\n", NumberofPolygons); cudaEventDestroy(start_GPU);

cudaEventDestroy(stop_GPU); return 0;

APPENDIX B. Header File for Memory Allocation
* File Name : MemoryManager.h
* Define objects of quadtree.
* Define query polygons.
* Preallocate memory for all quadtree objects on CPU.
* Memory allocation on the GPU.
/**<************************# Includes ********************************/ #include
/**<************************ Structure definition *********************/
typedef int POINTID;
typedef int NODEID;
typedef int BUFFID;
typedef int POLYID;
* The CPU has three main structures, NODE (nodes of the quadtree),
* POINT (the input points for the quadtree) and
* POINT_BUFFER (the array of points which each leaf node of the quadtree
typedef struct POINT { int x;

int y; int index;
typedef struct POINT_BUFFER {
// Array of points POINTID pointArray; unsigned int pointCount;
// use a linked list to point to another NODE at same level. BUFFID nextld;
typedef struct NODE {
// Level
unsigned int level;
// Keep track of type of NODE : LEAF, LINK unsigned int type;
// Location in 2D space int posX; int posY;
// Size of quadrant int width; int height;
// Description of points
int count [4];
int total;
int index;
int parent_index;
int offset;

// For Adaptive implementation int open;
NODEID child [4];
// This handles the 4 regions 0,1,2,3 representing sub-quadrants BUFFID pBuffer; bool P0, PI, P2, P3;
// Definition of query polygon typedef struct Polygon {
// Size of Polygon.
//xmin, ymin, xmax and ymax defines 4 corners of a polygon
int xmin;
int ymin;
int width;
int height;
int xmax;
int ymax;
int count;
int index;
POLYID *PolyPointId;
> Polygon;
// Structure to store points from completely overlapping nodes in GPU. typedef struct CP{ int polygonlndexCP; int CPcount;
NODEID nodeNumCP [64];

// Structure to store points from partially overlapping nodes in GPU typedef struct N0DEPP{ int xminPP; int yminPP; int xmaxPP; int ymaxPP; j-NODEPP;
typedef struct PP{ int polygonlndexPP; int PPcount;
NODEPP nodePP_array [64];
/**<************************ Memory preallocation **********************/ // Preallocating memory for point structure, typedef struct{
POINT* pArray;
int currCount,maxSize;
}point_array_t; point_array_t pArr;
void allocatePointMemory(){
pArr.pArray = (POINT*)malloc(100000000*sizeof(POINT)); pArr.currCount = 0; pArr.maxSize = 1000000000; if (pArr.pArray != NULL) {

// Function to fetch memory for point from preallocated memory.
POINT* getPoint(int id){
if(id >= pArr.currCount || id < 0){ exit(-1);
return &pArr.pArray[id];
// Allocate memory for a new point.
POINTID getNewPoint(){
if (pArr.currCount >= pArr,maxSize){
pArr.pArray = (POINT*)realloc(pArr.pArray, pArr.maxSize + 10000); pArr,maxSize+=10000;
return pArr.currCount++;
// Fetch memory for array of points.
POINTID getPointArray(int nelems){ int strtld = pArr.currCount; pArr.currCount+=nelems; if (pArr.currCount >= pArr,maxSize){
pArr.maxSize = (pArr.maxSize + 10000) > pArr.currCount ? pArr.maxSize + 10000:pArr.currCount + 1000; pArr.pArray = (POINT*)realloc(pArr.pArray,

return strtld;
// Preallocating memory for node structure, typedef struct{
NODE* nodeArray; int currCount,maxSize;
}node_array_t; node_array_t nodeArr;
void allocateNodeMemory(){
nodeArr.nodeArray = (NODE*)malloc(1000*sizeof(NODE)); nodeArr.currCount = 0; nodeArr.maxSize = 1000;
// Function to fetch memory for point from preallocated memory.
NODE* getNode(int id){
if(id >= nodeArr.currCount || id < 0){ exit(-1);
return fenodeArr.nodeArray[id];
// Allocate memory for a new node.
NODEID getNewNode(){
if (nodeArr.currCount >= nodeArr.maxSize){
nodeArr.nodeArray = (NODE*)realloc(nodeArr.nodeArray, nodeArr.maxSize
+ 10000);

return nodeArr.currCount++;
// Preallocating memory for point buffer structure, typedef struct{
POINT_BUFFER* bufferArray; int currCount,maxSize;
}buff_array_t; buff_array_t bufferArr;
void allocatePOINTBUFFERMemory(){ bufferArr.bufferArray =
(POINT_BUFFER*)malloc(4000000*sizeof(POINT_BUFFER)); bufferArr.currCount = 0; bufferArr.maxSize = 4000000;
// Function to fetch memory for point buffer from preallocated memory. P0INT_BUFFER* getBUFFER(int id){
if(id >= bufferArr.currCount || id < 0){ exit(-1);
return febufferArr.bufferArray[id];
// Allocate memory for a new buffer. BUFFID getNewBUFFER(){

if (bufferArr.currCount >= bufferArr,maxSize){
bufferArr.bufferArray = (POINT_BUFFER*)realloc(bufferArr.bufferArray, bufferArr.maxSize + 10000); bufferArr,maxSize+=10000;
return bufferArr.currCount++;
// Get the node indices of the level 3 nodes, typedef struct {
NODEID *Level3Nodes; int count;
level3Node Level3NodeArray;
void levelThreeNodeO
Level3NodeArray.Level3Nodes = (N0DEID*)malloc(64*sizeof(NODEID)); int current = getNewNodeO 1 ; int i; int j =0;
for (i = 0; i<=current; i++){ if(getNode(i)->level == 2){
Level3NodeArray.Level3Nodes [j] = i;

/**<****************** CUDA device memory allocation *****************/
// Allocate CUDA memory for Point structure.
void allocatePointMemoryCuda(){
cudaMalloc((void**)&d_POINT, sizeof(POINT)*pArr.currCount);
// Allocate CUDA memory for Point buffer structure.
POINT_BUFFER *d_POINT_BUFFER; void allocatePoint_BufferMemoryCuda(){ cudaMalloc((void**)&d_POINT_BUFFER,
// Allocate CUDA memory for Node structure.
void allocateNodeMemoryCuda(){
cudaMalloc((void**)&d_N0DE, sizeof(NODE)*nodeArr.currCount);
//Allocate memory for level 3 nodes.
void allocateNodeIDMemoryCuda(){
cudaMalloc((void**)&d_NODE_In, sizeof(NODEID)*Level3NodeArray.count);
/**<******************* Memory transfer from CPU to GPU **************/
// Copy Point structure to GPU.

void PointCudaCopy(){
cudaMemcpy(d_P0INT, pArr.pArray, sizeof(POINT)*pArr.currCount, cudaMemcpyHostToDevice);
// Copy Point buffer structure to GPU.
void Point_BufferCudaCopy(){
cudaMemcpy(d_POINT_BUFFER, bufferArr.bufferArray,
sizeof(POINT_BUFFER)*bufferArr.currCount, cudaMemcpyHostToDevice);
// Copy Node structure to GPU.
void NodeCudaCopy(){
cudaMemcpy(d_N0DE, nodeArr.nodeArray, sizeof(NODE)*nodeArr.currCount, cudaMemcpyHostToDevice);
//Copy level 3 nodes to GPU.
void NodeIDCudaCopy(){
cudaMemcpy(d_N0DE_In, Level3NodeArray.Level3Nodes,
sizeof(NODEID)*Level3NodeArray.count, cudaMemcpyHostToDevice);

APPENDIX C. CUDA C Code for PIP Search on GPU
* File name :
* Construct quadtree in CPU.
* Traverse quadtree in GPU to find PIP.
/**<************************# Includes ********************************/ #include
/ nIC ^ Tf Tl T ~\ O 'altf /
#define ___host__
#define ___shared__
#define CUDA_KERNEL_DIM(...)
#def ine CUDA_KERNEL_DIM( ) < __VA_ARGS__ >
#define BUILD_FULL 1

#define BUILD_ADAPTIVE 2 #define MODE_RANDOM 1 #define MODE_FILE 2 #define TRUE 1 #define FALSE 0 #define pMax 32 #ifndef RANGE #define RANGE 1024 #endif
/**<***************** Global variables ****************************
int pointMode = M0DE_RAND0M;
char *inputPointFileName;
char *outputTreeFileName;
int rangesize = RANGE;
int bucketSize = 32;
int numPoints = 3000;
int numLevels = 3;
int numSearches = 0;
int printTree = 0;
int outputTree = 0;
int quadTreeMode = BUILD_FULL;
int numPolygon = 1099120;
int pointRangeX = RANGE;
int pointRangeY = RANGE;
int completelndex = 0;
int Notlndex = 0;
int Partiallndex = 0;
int arraysize = 100;

/ ^ /~\ "Kt T T m O ^ sjf /
enum {
enum {
FullyOverlapped = 0, PartiallyOverlapped
/**<****************** Generic Functions ****************/
/**<****************** Timing Functions *****************/
typedef struct perf_struct { double sample; struct timeval tv;
> perf;
perf timeRecord;
void init_perf(perf *t) { t->sample = 0;
// Start time measurement.
void start_perf.measurement(perf *t) { gettimeofday(&(t->tv), NULL);
// Stop time measurement.

void stop_perf.measurement(perf *t) { struct timeval ctv; gettimeofday(&ctv, NULL); struct timeval diff; timersub(&ctv, &(t->tv), &diff);
double time = + / 1000000.0; t->sample = time;
// Print time difference, void print.perf.measurement(perf *t) { double total.time = 0.0; double stdev = 0.0;
/**<************************ Random number generator *********************/ /**
* Generate random numbers.
* 1- Generate random data points.
* 2- Generate random numbers for polygon corners.
int randn(int range) { int a;
a = rand() % range; return a;
int randnRng(int N, int M)

int r;
// Initialize random seed srand (time(NULL));
// Generate number between a N and M r = M + random() / (RAND_MAX / (N M + 1) + 1); return r;
// Creates random polygon with random xmin, ymin, width and height. Polygon *createRandomPolygon(unsigned int nPolygon, unsigned int range) { Polygon *polyarray = (Polygon *) malloc(sizeof(Polygon) nPolygon); int PointArraySize = getNewPoint() -1 ; unsigned int index;
for (index = 0; index < nPolygon; index++) { polyarray[index].xmin = randn(723); polyarray[index].ymin = randn(723); polyarray [index].width = randnRng(100, 300); polyarray [index].height = randnRng(100, 300); polyarray [index].xmax = polyarray[index].xmin + polyarray [index].width;
polyarray [index].ymax = polyarray[index].ymin + polyarray [index].height; polyarray [index].count = 0; polyarray [index].index = index;
return polyarray;
// Generate random data points, given a number of points and its range.

POINTID createRandomPoints(unsigned int nPoints, unsigned int range) { POINTID pointArray = getPointArray(nPoints); unsigned int index;
for (index = 0; index < nPoints; index++) {
POINT *p=getPoint(pointArray+index); p->x=randn(range); p->y=randn(range); p->index=index;
return pointArray;
/**<************************ Tree Functions ***************************/ j **^************************ Set n O de *********************************7 /**
* Set node parameters.
* 1- Set appropriate values for x, y, width, height and level of a node.
* 2- Initialize rest of the node parameters.
void setNode(NODEID nodeid, int x, int y, int w, int h, int type, int level) {
// Get memory for node.
NODE* node=getNode(nodeid);
// Set the 5 parameters. node->posX = x; node->posY = y; node->width = w; node->height = h; node->level = level;

// Reset all of the tracking values, int i;
for (i = 0; i < 4; i++)
node->child[i] = -1; node->count[i] = 0;
node->total = 0; node->index = 0; node->offset = 0; node->open = TRUE; node->type = type; node->pBuffer = -1;
/**<************** Count number of nodes ******************/ int countNodesQuadTree(NODEID nodeid) { int sum = 0; int i;
if(nodeid == -1) return 0;
// Depth first traversal to find the total number of nodes, if (getNode(nodeid)->type == TYPE_LEAF) { return 1; y else {
for (i = 0; i < 4; i++) {
sum = sum + countNodesQuadTree(getNode(nodeid)->child[i]);

return sum + 1;
/**<*************** Assign index and offset ********************/
* 1- DFS traversal of tree to assign indices.
* 2- Count data points in leaf node.
void walkNodeTable(NODE ^parent, NODEID nodeid, int ^offset, int *index) { NODE* node=getNode(nodeid);
BUFFID ptrlD; int i;
if (node == NULL) return; if (parent)
node->parent_index = parent->index; else
node->parent_index = -1;
// Assign index and offset. node->index = *index; node->offset = ^offset;
// Advance the next index.
index = *index + 1; if (node->type == TYPE_LEAF) { int count = 0;
// Get indices of points
for (ptrlD = node->pBuffer; ptrlD != -1; ptrlD =
getBUFFER(ptrID)->nextId){ P0INT_BUFFER *ptr= getBUFFER(ptrlD);

count = count + ptr->pointCount;
// Assign total number of points in node. node->total = count;
offset = offset + count; y else {
for (i = 0; i < 4; i++) {
walkNodeTable(node, node->child[i], offset, index);
/< Build quadtree / /< Create new point buffer / /
Create new point buffer.
1- Get memory for new buffer from preallocated memory.
2- Allocate memory inside each buffer to hold 32 data points.
3- Assign first point to buffer and set point count.
4- Initialize pointer to next buffer.
BUFFID newPointBuffer(POINT p) {
// Allocate memory for point buffer.
P0INT_BUFFER ptr= getBUFFER(ptrlD);
// Allocate a bucket for 32 data points.
(ptr->pointArray) = getPointArray(bucketSize);
// Get point and set parameters.
(getPoint(ptr->pointArray)) = p;

ptr->pointCount = 1; ptr->nextld = -1; return ptrlD;
/**<************************ Assign point to a node ********************/ /**
* Add a point to leaf node.
* 1- If a buffer of a leaf node is not full, add points to that buffer.
* 2- If the buffer is full, then add new buffer to end of list.
void addPointToNode(NODE *node, POINT *p) {
BUFFID lastBufferlD;
// Add points till buffer is full, if (node->pBuffer != -1) {
for ( ptrlD = node->pBuffer; ptrlD != -1; ptrlD = getBUFFER(ptrID)->nextId) {
POINT_BUFFER *ptr= getBUFFER(ptrlD);
// Add to the end of the list, if (ptr->pointCount < bucketSize) {
*(getPoint((ptr->pointArray)+(ptr->pointCount))) = *p; ptr->pointCount++;
BUFFID lastBufferlD = ptrlD;
// Boundary case of adding a new link. getBUFFER(lastBufferID)->nextId = newPointBuffer(p); y else {

node->pBuffer = newPointBuffer(p);
/**<****************** Get direction of node ***************/
* 1- Get node direction based on input data point.
* 2- A point is checked to see if it lies in NW, NE, SW or SE direction
of a node.
* 3- Return direction to calling function.
int getNodeDirection(NODE *nParent, struct POINT *p) { int posX, posY; int x, y; int index;
// Get the point, x = p->x;
y = p->y;
// Child width and height
int width = nParent->width / 2;
int height = nParent->height / 2;
// Decide direction (North west (NW), North east (NE), South west (SW), South east (SE) of a point). for (index = 0; index < 4; index++) { switch (index) { case 0: // NW
posX = nParent->posX;
posY = nParent->posY + height;
if ((x >= posX) && (x < posX + width) && (y >= posY)

&& (y < posY + height)) {
return 0;
case 1: // NE
posX = nParent->posX + width; posY = nParent->posY + height;
if ((x >= posX) && (x < posX + width) && (y >= posY) && (y < posY + height)) { return 1;
case 2: // SW
posX = nParent->posX; posY = nParent->posY;
if ((x >= posX) && (x < posX + width) && (y >= posY) && (y < posY + height)) { return 2;
case 3: // SE
posX = nParent->posX + width; posY = nParent->posY;
if ((x >= posX) && (x < posX + width) && (y >= posY) && (y < posY + height)) { return 3;

exit(-1); return (-1);
/**<******************* Build node *****************************/
* 1- Check the direction of the node.
* 2- Calculate and assign the nodes x, y, width and height based on
* 3- Assign the node level.
void buildNode(NODEID node, NODE *nParent, int direction) { int posX, posY, width, height, level; switch (direction) { case 0: // NW
posX = nParent->posX; //0
posY = nParent->posY + nParent->height / 2; //512 break;
case 1: // NE
posX = nParent->posX + nParent->width / 2; posY = nParent->posY + nParent->height / 2; break;
case 2: // SW
posX = nParent->posX; posY = nParent->posY; break;
case 3: // SE

posX = nParent->posX + nParent->width / 2;
posY = nParent->posY;
// Width and height of the child node is simply 1/2 parent, width = nParent->width / 2; height = nParent->height / 2;
// Set the level.
level = nParent->level + 1;
setNode(node, posX, posY, width, height, TYPE_N0NE, level);
NODEID createChildNode(NODE ^parent, int direction) {
NODEID node = getNewNode(); buildNode(node, parent, direction); return (node);
/**<************************ Build full quadtree *************************/ /**
* 1- Get node direction.
* 2- Build node in that direction.
* 3- Repeat till bottom of tree is reached.
* 4- If bottom of the tree is reached, then add point to leaf nodes
void buildFullTree(NODEID nodeid, unsigned int level, POINT *p) {
NODEID dirNode;
NODEID child; int direction;

// Check if bottom of tree is reached, if (node->level == level) { addPointToNode(node, p); y else {
// Get direction of point in the node, direction = getNodeDirection(node, p); dirNode = node->child[direction]; if (dirNode!=-l) {
buildFullTree(dirNode, level, p); y else {
// Create child node in that direction, child = createChildNode(node, direction); node->child[direction] = child;
// Assign node type, if (getNode(child)->level == level) getNode(child)->type = TYPE_LEAF; else
getNode(child)->type = TYPE_LINK; buildFullTree(node->child[direction], level, p);
/**<************************ Build adaptive quadtree ******************** /**
* Quadtree grows based on input points.
* 1- Get direction of the node based on the input point.
* 2- Build node in that direction.

* 3- Add points to the nodes buffer till predetermined limit is reached.
* 4- Once limit is reached, divide node to sub nodes.
* 5- Build child nodes and push points to it and repeat.
void buildAdaptiveTree(NODEID node, unsigned int level, POINT *p);
void pushPointToChildren(NODE *node, unsigned int level, POINT *p) { NODEID dirNode;
NODEID child; int direction;
// Get node direction, direction = getNodeDirection(node, p); dirNode = node->child [direction]; if (dirNode) {
buildAdaptiveTree(dirNode, level, p); y else {
// Build node in that direction, child = createChildNode(node, direction); node->child[direction] = child; getNode(child)->type = TYPE_LEAF;
buildAdaptiveTree(node->child[direction], level, p);
void pushAHNodePointsToChildren(NODE *node, unsigned int level) {
BUFFID ptrlD; int link = 0; int i;
ptrlD = node->pBuffer;

if (ptrlD == -1) return;
POINT_BUFFER *ptr= getBUFFER(ptrlD);
// Should have only 1 buckets worth, if (ptr->nextld != -1) {
printf ("pushAHNodePointsToChildren: error\n"); exit (-1);
// Get direction of node and push points to the nodes buffer, for (i = 0; i < ptr->pointCount; i++) {
pushPointToChildren(node, level, (getPoint((ptr->pointArray)+i)));
node->pBuffer = -1; node->open = FALSE; if (node->type == TYPE_LEAF) node->type = TYPE_LINK;
void buildAdaptiveTree(NODEID nodeid, unsigned int level, POINT *p) { NODEID dirNode;
NODEID child; int direction;
// Have we reached the bottom : force to put point there : linked buckets
if (getNode(nodeid)->level == level) { addPointToNode(node, p); return;

if (node->open == FALSE) {
//If got to here, then this is an empty link node and we push point down.
pushPointToChildren(node, level, p); return;
// All of the following checks assume node is open.
// Node is open but point buffer is empty, if (node->pBuffer == -1) { addPointToNode(node, p); return;
// Check if the point belongs on this node.
if ((node->pBuffer != -1) && (getBUFFER(node->pBuffer)->pointCount < bucketSize)) {
// Add to current pointBuffer.
POINT_BUFFER *ptr = getBUFFER(node->pBuffer);
POINT* pt = getPoint(ptr->pointArray+ptr->pointCount); pt = p;
getBUFFER(node->pBuffer)->pointCount++; return;
if ((node->pBuffer != -1) && (getBUFFER(node->pBuffer)->pointCount == bucketSize)) {
// Full Buffer
pushPointToChildren(node, level, p);
// Push all points and delete this nodes buffer. pushAHNodePointsToChildren(node, level); return;

printf("Should never get here \n"); exit(-1);
/**<************************ Build quadtree ************************/ void buildQuadTree(NODEID node, unsigned int level, POINTID pointArray, int nPoints, int treeType) { int i;
for (i = 0; i < nPoints; i++) { if (treeType == BUILD_FULL)
buildFullTree(node, level, getPoint(pointArray+i)); else
buildAdaptiveTree(node, level, getPoint(pointArray+i));
/**<************************ Print functions **************************/ /**
* Functions to print quadtree.
* 1- DFS traversal of quadtree.
* 2- Print quadtree node details to a file.
* 3- Print data points in leaf node to a file.
void printTableNode(FILE *fp, NODEID nodeid) { int i;
if(nodeid==-l) return;

Full Text


POINTINPOLYGONPIPSEARCHACCELERATIONUSINGGPUBASED QUADTREEFRAMEWORK by SREEVIDDYADEVIKARUNAGARAN B.E.,AnnaUniversity,India,2008 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ElectricalEngineering 2016


ThisthesisfortheMasterofSciencedegreeby SreeviddyadeviKarunagaran hasbeenapprovedforthe DepartmentofElectricalEngineering by DanConnors,Chair YimingDeng ChaoLiu April14,2016 ii


Karunagaran,SreeviddyadeviM.S.,ElectricalEngineering PointinPolygonPIPSearchAccelerationUsingGPUBasedQuadtreeFramework ThesisdirectedbyProfessorDanConnors ABSTRACT Thepoint-in-polygonPIPproblemdenesforagivensetofpointsinaplane whetherthosepointsareinside,outsideorontheboundaryofapolygon.ThePIPfor aplaneisaspecialcaseofpointrangendingthathasapplicationsinnumerousareasthatdealwithprocessinggeometricaldata,suchascomputergraphics,computer vision,geographicalinformationsystemsGIS,motionplanningandCAD.Pointin rangesearchcanbeperformedbyabruteforcesearchmethod,howeversolutionbecomesincreasinglyprohibitivewhenscalingtheinputtoalargenumberofdatapoints anddistinctrangespolygons.Thecoreissueofthebruteforceapproachisthateach datapointmustbecomparedtotheboundaryrangerequiringbothcomputationand memoryaccess.Byusingaspatialdatastructuresuchasaquadtree,thenumberof datapointscomputationallycomparedwithinaspecicpolygonrangecanbereduced tobemoreecientintermsofperformanceontheCPU.WhileGraphicsProcessing UnitGPUsystemshavethepotentialtoadvancethecomputationalrequirements ofanumberofproblems,theirmassivenumberofprocessingcoresexecuteeciently onlyincertainrigidexecutionmodelssuchasdata-levelparallelproblems.Thegoal ofthisthesisistodemonstratethattheGPUsystemscanstillprocessirregulardata structuresuchasaquadtreeandthatitcanbeecientlytraversedonaGPUto ndthepointsinsideapolygon.ComparedwithanoptimizedserialCPUimplementationbasedontherecursiveDepth-FirstSearchDFS,thestackbasediterative Breadth-rstsearchBFSontheGPUhasaperformancegainof3xto500x. iii


Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:DanConnors iv


TABLEOFCONTENTS Tables........................................vii Figures.......................................viii Chapter 1.Introduction...................................1 2.BackgroundandMotivation..........................5 2.1QuadtreeBackground..........................5 2.1.1QuadtreeConstruction......................6 2.2RelatedWork..............................10 2.3GraphicsProcessingUnitArchitecture.................11 2.4Motivation................................17 3.Approach....................................21 3.1ExploitationofQuadtreeonCPUandGPU..............21 3.1.1CPUImplementation......................21 3.1.2GPUImplementation......................23 3.1.3BFSImplementationonGPU-DierentScenariosofQuery Polygon..............................32 4.ExperimentalResults..............................44 4.1Results..................................44 5.Conclusion....................................50 References ......................................51 v


Appendix A.CUDACodeforBruteForceSearch......................53 B.HeaderFileforMemoryAllocation......................63 C.CUDACCodeforPIPSearchonGPU....................73 vi


TABLES Table 3.1CriteriaforQuadtreeNodeClassication..................22 vii


FIGURES Figure 2.1ExampleofDataPointsMappedwithinaQuadtreeDataStructure...6 2.2Level1ofQuadtree..............................7 2.3Level2ofQuadtree..............................7 2.4Level3ofQuadtree..............................7 2.5Level4ofQuadtree..............................8 2.6NW-Northwest,NE-Northeast,SW-Southwest,SE-Southeast..10 2.7KeplerArchitecture..............................13 2.8GTX680SMX.................................14 2.9KeplerMemoryHierarchy...........................15 2.10KeplerWorkFlow...............................16 2.11ComparingBruteforceSearchandQuadtreeSearchonCPU........18 2.12ComparingBruteforceSearchonCPUwithGPU..............19 2.13CPUPerformanceofDierentSizedPolygonsUsingQuadtree......20 3.1OverlapofNodeontheHorizontalEdgeofPolygon............23 3.2OverlapofNodeontheVerticalEdgeofPolygon..............23 3.3BFSGPUImplementation..........................30 3.4ProcessFlow..................................32 3.5Level1Quadtree................................33 3.6Level2Quadtree................................33 3.7Level3Quadtree................................34 3.8Level4Quadtree................................34 3.9SampleDatapointsleftandtheQuadtreerightatLevel1.......35 3.10SampleDatapointsleftandtheQuadtreerightatLevel2.......35 3.11SampleDatapointsleftandtheQuadtreerightatLevel3.......36 3.12SampleDatapointsleftandtheQuadtreerightatLevel4.......37 viii


3.13SampleDatapointsleftandtheQuadtreerightatLevel1.......37 3.14SampleDatapointsleftandtheQuadtreerightatLevel2.......38 3.15SampleDatapointsleftandtheQuadtreerightatLevel3.......38 3.16SampleDatapointsleftandtheQuadtreerightatLevel4.......39 3.17SampleDatapointsleftandtheQuadtreerightatLevel1.......40 3.18SampleDatapointsleftandtheQuadtreerightatLevel2.......40 3.19SampleDatapointsleftandtheQuadtreerightatLevel3.......41 3.20SampleDatapointsleftandtheQuadtreerightatLevel4.......41 3.21StackBasedIterativeBFSTraversal.....................42 3.22SampleDatapointsleftandtheQuadtreerightatLevel1.......43 3.23SampleDatapointsleftandtheQuadtreerightatLevel2.......43 4.1PerformanceComparisonbetweenCPUandGPUforSmallPolygons...45 4.2PerformanceComparisonbetweenCPUandGPUforMediumPolygons.45 4.3PerformanceComparisonbetweenCPUandGPUforLargePolygons...46 4.4PerformanceComparisonforDierentSizedPolygonsusingGPU.....47 4.5ExecutionTimeofLeasttoMostComplexMediumSizedPolygons....48 4.6Speed-upofGPUoverCPU.........................49 ix


1.Introduction Thepoint-in-polygonPIPproblemdenesforagivensetofpointsinaplane whetherthosepointsareinside,outsideorontheboundaryofapolygon.ThePIPfor aplaneisaspecialcaseofpointrangendingthathasapplicationsinnumerousareas thatdealwithprocessinggeometricaldata,suchascomputergraphics,computer vision,geographicalinformationsystemsGIS,motionplanningandcomputer-aided designCAD.Pointinrangesearchcanperformedbyabruteforcesearchmethod, howeverforsolutionbecomesincreasinglyprohibitivewhenscalingtheinputtoa largenumberofdatapointsanddistinctrangespolygons. Rangesearchalgorithms,whichmakeuseofspatialdatastructures,performmuch betterthantheonesthatdonotpartitionthedatabeforeprocessing.Quadtreeis ahierarchicalspatialdatastructurethatisusedforbothindexingandcompressing geographicdatabaselayersdueitsapplicabilitytomanytypesofdata,itseaseof implementationandrelativelygoodperformance.Asdonetraditionally,thequadtree isbuiltontheCentralProcessingUnitCPU.Tospeeduptherangesearching problems,itisessentialtoincreasethethresholdonthenumberofqueriesprocessed withinagiventimeframe.Purelysequentialapproachtothiswilldemandincrease inprocessorspeeds. GraphicsProcessingUnitsGPUshaveproventobeapowerfulandecient computationalplatform.Anincreasingnumberofapplicationsaredemandingmore ecientcomputingpoweratalowercost.ThemodernGPUcannativelyperform thousandsofparallelcomputationsperclockcycle.Relativetothetraditionalpower ofaCPU,theGPUcanfarout-performtheCPUintermsofcomputationalpower orFloatingPointOperationsperSecondFLOPS.TraditionallyGPUshavebeen usedexclusivelyforgraphicsprocessing.RecentdevelopmentshaveallowedGPUsto beusedformorethanjustgraphicsprocessingandrendering.Withagrowingset ofapplicationsthesenewGPUsareknownasGPGPUsGeneralPurposeGPUs. 1


NVIDIA R hasdevelopedtheCUDAComputeUniedDeviceArchitectureAPI ApplicationProgrammingInterfacewhichenablessoftwaredeveloperstoaccessthe GPUthroughstandardprogramminglanguagessuchas'C'.CUDAgivesdevelopers accesstotheGPU'svirtualinstructionset,onboardmemoryandtheparallelcomputationalelements.Takingadvantageofthisparallelcomputationalpowerwillresult insignicantspeedupformultipleapplications.Onesuchapplicationiscomputer visionalgorithms.Fromtheassemblylinetohomeentertainmentsystems,theneed forecientreal-timecomputervisionsystemsisgrowingquickly.ThispaperexploresthepotentialpowerofusingtheCUDAAPIandNVIDIA R GPUstospeedup commoncomputervisionalgorithms.Throughreal-lifealgorithmoptimizationand translation,severalapproachestoGPUoptimizationforexistingcodeareproposed inthisreport. Inthepastfewyears,therehasbeenarapidadoptionofGPGPUparallelcomputingtechniquesforbothhigh-performancecomputingandmobilesystems.As GPUsexploitmodelsofdata-parallelexecutionthatgenerallydescribesacommon taskacrossdierentparallelcomputingelements,therecanbeseverelimitationsfor anyirregularindividualthreadbehaviors.Irregularexecutiondisruptstheeciency oftherigidGPUgroupsofthreadsbycausingworkloaddisparitythateectively leadstoidleorunderutilizedresourceunits.Mostresearchfocusisonthehardwareaspectsofexecutiondisparitysuchasbranchdivergence[7],localmemorybank conicts[5]andnon-coalescedglobalmemoryaccesses[15].Thereareanumberof proposedarchitectureconceptstomitigatesomeoftheperformancedownfallsofhandlingnon-uniformdataonGPUarchitectures[10].Manyoftheproposedsolutions reducethefrequencyoftheoccurrencesbutdonotfullyaddresstheinherentrun-time executioncharacteristicsofanalgorithm.Overall,irregulardatapatternslimitthe performancepotentialofGPGPUalgorithms. Whileirregularalgorithmsrestrainthefulluseofdata-levelresourcesofGPU 2


systems,GPUimplementationsmaystillachieveperformancebenetovermulticore implementations.Ineect,therawnumberofGPUresourcesservesasabrute-force methodofcarryingoutcomputationswithsignicantarithmeticintensity.Nevertheless,therearealternativeandemergingsoftware-basedmodelsofdistributingexecutiontasksonGPUssuchasdynamicparallelismsupport.Anothertechniqueproposed ispersistentlyscheduledthreadgroups[6]thatabandonsthetraditionaldata-level modelforstablethreadgroupsassignedtoGPUcomputeunitsthatdynamicallyretrievecomputationtasksfromsoftware-basedworkloadqueues.Theresultofpersistentlyscheduledgroupscanbebetterloadbalance,utilizationandreducedoverhead inthreadblockcreation.Atthesametime,suchtechniqueshavenotfullyaddressed exploitingpatternsandvariationsinmodeldataspecictoalgorithms. Generallytasksthatinvolveirregulardataornon-deterministicalgorithmsare noteectivelymappedtoGPUsystems.Forexample,ingraph-basedalgorithms, theirregularnatureoftheedgeconnectivitybetweengraphnodesisnotwellsuited fordata-leveltaskdenitiononGPUcomputingunits.Inthiscase,agroupof neighboringGPUthreadsmaybeassignedadiverseworkloadofprocessingnodes withafewedgesaswellasnodeswiththousandsofedges.Thisformofimbalance ischaracterizedas staticworkloaddisparity asaportionoftheruntimeutilization canbetracedtothestaticconnectivityofgraphnodes.Onlyifagraph'sstructure ispersistent,notchangingoverseveralevaluations,mighttherebewell-reasoned opportunitiestoreorganizethedata,eectivelyperformingstaticloadbalancingin whicheachGPUthreadgroupisassigneddatawithlessvariationinwork.However, insuchcases,thereiscosttothepartitioninggraphnodestothemodeldata. ThisthesisinvestigatesthepotentialofprocessingquadtreesforPIPsearchproblemsthatexecuteonGPUs.AsGPUsoperateinaheterogeneoussysteminwhich boththeCPUandGPUperformsomefractionofthecomputationalwork,there areuniqueperformanceconstraintstoexplore.Thisthesisconsiderstwoprimary 3


parametersinscalingoptimalGPUquad-treesolutions:datapointproblemsizeand characteristicsofpolygonsbeingsearched. Thisthesisisorganizedasfollows:Chapter2discussesthemotivationandbackgroundofcomputervisionapplications.Chapter3examinesseveralexamplesofthe PIPproblemsolvingonGPUs.Theexperimentalresultssection,Chapter4,shows performancedataforthevariousoptimizationcases.Finally,Chapter5concludes thisthesis. 4


2.BackgroundandMotivation 2.1QuadtreeBackground Aquadtreeisatreedatastructurethatdesignateseachinternalnodetohave exactly4children.Quadtreesareusedtopartitiona2Dspacebyrecursivelysubdividingitinto4quadrantsorregions.Theregionsmaybesquareorrectangular ormayhavearbitraryshapes.Quadtreesmaybeclassiedaccordingtothetypeof datarepresentedsuchasareas,points,linesandcurves.Figure2.1demonstratesa setof2Ddatapointsthathavebeenusedtocomposeaquadtree.Thereasoning forusingaquadtreestructureisthatanyregionsearchthatrequiresthedatapoints withinapolygonareasimplycanreferencethetreeratherthanthedata.Thereare clearsearchbenetsfromthedatatotreerepresentations.Inthecaseofthelower rightquadrant,anyregionsearchwillconsultthetreestructureanddeterminethat onlyalimitednumberofpointvaluesX,Yarerequiredforcomparison.Inshort, thetreerepresentationsavesthecomputationalcalculationsforthepolygonsearch byimmediatelydirectingthatonlyalimitednumberofpointsexistinthatentire quadrant.Inthecomparisoncase,itiseasierforapolygonthatresidesinthelower rightquadranttoconsultthetreeversusconsultthefulldatasetusingfullbrute forcecheckingofeverydatapointwithinthepolygonrange. 5


Figure2.1:ExampleofDataPointsMappedwithinaQuadtreeDataStructure. 2.1.1QuadtreeConstruction Figure2.2,Figure2.3,Figure2.4,Figure2.5,demonstratesquadtreeconstruction ontheCPUuptolevel4.TheFigure2.2,showsthesampledatapointsatlevel1and itrepresentstherootnodecontainingallthedatapoints.TheFigure2.3,showsthe sampledatapointsatlevel2anditrepresentsthefourchildnodesoftherootnode. TheFigure2.3isfurthersubdividedtogeneratelevel3ofthequadtreeasshownin Figure2.4.TheFigure2.5,representsthetreeatlevel4.Therearetwoemptynodes atlevel4,astherenopointsinthatdirection. 6


Figure2.2:Level1ofQuadtree. Figure2.3:Level2ofQuadtree. Figure2.4:Level3ofQuadtree. 7


Figure2.5:Level4ofQuadtree. Thepointquadtreestructurerecursivelydividesspaceintofourequalrectangles basedonthelocationofthepoints.Therootofthequadtreerepresentsthewhole2D space.Successivepointsdivideeachnewsub-regionintoquadrantsuntilallpoints areindexed.Witheachsubsequentlevel,quadtreescanhaveamaximumoffour childrenforeveryparentnode.Eachnoderepresentsaregioninthe2Dspacewhich containsasubsetofthedatapoints. Thecommonalgorithmtoconstructaquadtreeisbasedonsequentialpoint insertionandthissequentialapproachisnotsuitableformassivelyparallelprocessors.ThereforethequadtreeconstructionisdoneontheCPU.Theperformanceof quadtrees,forInsertionisOnlogn,forsearchingisOlognandoptimizationfor thepurposesofbalancingthetreeisOnlogn[8].Thismakesquadtreesanideal datastructureformultipleinsertionofpointsandrangesearching[8].Thoughthe linearquadtreesreducetheoverheadinvolvedintransferringthetreefromtheCPU mainmemorytotheGPUmemory,thepointerbasedquadtreehasseveralotheradvantages.Thepointerbasedquadtreesaremorememoryspaceecientanditcanbe accessedinanytraversalorderbyfollowingthepointerswhereasthelinearquadtrees onlyallowpreordertraversalandrequirelogarithmtimesearchestondthenext childforanyotherorder[14]. 8


Thequadtreeisbuiltusingpointers.Thetreeisbuiltbyrecursivefunction callsandsequentiallyinsertingthepoints.Eachparentnodehaspointerstoallof itschildren.Thereare3structuresinthequadtree,thenode,thepointandthe pointbuers.Eachnodeisoneofthreetypeswhicharetherootnode,linknode andtheleafnode.Therootnode,correspondstotheentirearrayofinputpoints. Thefourchildrenoftherootnoderepresentthe4quadrantsNorthwest-NW, Northeast-NE,Southwest-SW,Southeast-SE.Thelinknodeisanodethat canbefurthersubdividedandtheleafnodescorrespondtothequadrantsforwhich furthersubdivisionisnotnecessary. Thepointstructureistheinputpointstothequadtreewhichoccupiesa2Dspace. Aleafnodehasalistofpointbuers,eachofwhichholdspredeterminedmaximum numberofpoints.Startingattheroot,thequadrantorthedirectionNW,NE,SW orSEinwhichtheinputpointliesisdeterminedandachildnodeisbuiltinthat direction.Thisstepisrepeatedtilltheleafnodeisreached. TheFigure2.6showsthenodeindexanditscorrespondingdirectiononitsparent node.Thenthepointisaddedtothatleafnode'sbuerofpoints.Ifthebuerexceeds somepredeterminedmaximumnumberofelements,thepointsaremovedintothenext buerinthebuerlist.Therootnodeissubdividedrecursively.Thequadtreewith klevelsincludingtherootwouldhave k )]TJ/F15 11.9552 Tf 11.956 0 Td [(1nodesonthek th levelandk )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 nodesintotal[8]. 9


Figure2.6:NW-Northwest,NE-Northeast,SW-Southwest,SE-Southeast. 2.2RelatedWork Traversalpathforeachpolygonmaydier,thereforelinearizingthetreebased onthetraversalpathforeachpolygonanditeratingoverthelineartraversalorder iscomputationallyintensive.Severalapplication-specicapproacheshavebeenproposedtohandletheproblem.InalgorithmslikeBarnes-Hut,apoint'straversalcan becomputedwithoutexecutingtheentirealgorithmandapreprocessingpasscan determineeachpoint'slineartraversal,avoidingrepeatedlyvisitinginteriornodes duringvectoroperations.However,forPIPsearch,whereapoint'sfulltraversalis onlydeterminedasthetraversalprogresses,thepreprocessingstepcanbeexpensive. Goldfarbetal.[2015][4]demonstratesaGPUimplementationoftreetraversal algorithmsbyinstallingropes,extrapointersthatconnectanodeinthetreenottoits children,insteadtothenextnewnodethatapointwouldvisitifitschildrenarenot visited.Butthedrawbackofusingropesisthatitrequiresanadditionaltraversal priortoperformingtheactualtraversalstocreatetheropesintothenodesofthe treestructure.Asaresult,earlierattemptstouseropestoacceleratetreetraversals havereliedonapplication-specictransformationsthatleveragethesemanticsofthe algorithmtoecientlyplaceropes.Theproblemwithusing"Autoropes"[4]isthat 10


sincetheropepointersarecomputedduringtraversalandtheyarestoredonthestack, itcausesoverheadduetostackmanipulationandtheeciencyiscompromised. WorkbyZhangetal.[2013][16]presentsaspeedupof90xbyconstructing aquadtreeusingparallelprimitivesbytransformingamultidimensionalgeospatial computingproblemintochainingasetofgenericparallelprimitivesthataredesigned foronedimensionalarrays.AnotherworkbyBedorfetal.[2015][1]presentsan octreeconstructionwherethetreeisconstructedontheGPUaftermappingparticles in3Dspacetolineararray.AndavectorizedBFStraversalisdoneontheoctree. AlloftheseimplementationshavereliedonlinearizingthetreeforBFStraversal ontheGPUortheyhavepreprocessedthetreetoeitherlinearizeorinstallextra pointersonthetree.CUDAGPUraytraversalthroughahierarchicaldatastructure suchasaboundingvolumehierarchyBVHisusuallycarriedoutusingdepth-rst searchbymaintainingastack.ThispaperexplorestheparallelizationofbreadthrstsearchBFStoimplementrangesearchonGPUs.Thekernelisoptimizedto minimizethreaddivergenceandmemoryaccess.BFSisusedinsteadofdepth-rst searchDFSasitiseasiertoparallelizeandimplementecientlyontheGPU.BFS andDFSarefundamentalbuildingblocksformoresophisticatedalgorithmsusedin manycomputationaleldsthatrangefromgaming,imageprocessingtosocialnetwork analysis.BFSisusedasacorecomputationalkernelinanumberofbenchmarksuites, includingRodinia,ParboilandtheGraph500supercomputerbenchmark. 2.3GraphicsProcessingUnitArchitecture TheunderlyingarchitectureoftheGPUisoptimizedfordata-levelparallelism. TakingacloserlookattheNVIDIA R GPUrevealsthechipisorganizedintoanarray ofhighlythreadedstreamingmultiprocessorsSMs.Eachstreamingmultiprocessor consistsofseveralstreamingprocessorsSPs.Streamingmultiprocessorsaregrouped intothreadprocessingclustersTPCs.ThenumberofSPsperSMdependsonthe generationofthechip. 11


NVIDIA R haschipsthatincreasethenumberofstreamingprocessorsSPfrom 240to2048.EachSPhasamultiply-addunitplusanadditionalmultiplyunit.The combinedpowerofthatmanySPsinthisGPUexceedsoneteraop[9].EachSM containsaspecialfunctionunitwhichcancomputeoating-pointfunctionssuchas squarerootandtranscendentalfunctions.Eachstreamingprocessoristhreadedand canrunthousandsofthreadsperapplication.Graphicscardsarecommonlybuiltto run5,000-12,000threadssimultaneouslyonthisGPU.TheGTX680cansupport192 threadsperSMXandatotalof1536threadssimultaneously[12].Incontrast,the Intel R Core TM i7seriescansupporttwothreadspercore. GPUsareoptimizedviatheexecutionthroughputofamassivenumberofthreads. Thehardwaretakesadvantagesofthisbyswitchingtodierentthreadswhileother threadswaitforlong-latencymemoryaccesses.Thismethodologyenablesveryminimalcontrollogicforeachexecutionthread[9].Eachthreadisverylightweightand requiresverylittlecreationoverhead. Fromamemoryperspective,theGPUisarchitectedquitedierentlythanaCPU. EachGPUcurrentlycomeswithuptofourgigabytesofGraphicsDoubleDataRate GDDRDRAMwhichisusedasglobalmemory.TheGPUarchitectureisdesigned toexploitarithmeticintensityanddata-levelparallelism. Figure2.7showsthearchitectureoftheGTX680.ThisgenerationoftheCUDA enabledGPUdevicesconsistsofanarrayof8nextgenerationstreamingmultiprocessorsSMX,4GPCsGraphicsProcessingClustersand4memorycontrollers.Each GPChasadedicatedrasterengineandtwoSMXunits.Figure2.8showsthearchitectureofaSMX.EachSMXunitcontains192coresandfourwarpschedulers.Each warpscheduleriscapableofdispatchingtwoinstructionsperwarpeveryclock[12]. 12


Figure2.7:KeplerArchitecture. 13


Figure2.8:GTX680SMX. InKepler,asshowninFigure2.9,eachSMXhas64KBofon-chipmemorythat canbeconguredas48KBofSharedmemorywith16KBofL1cache,oras16KB ofsharedmemorywith48KBofL1cacheora32KB/32KBsplitbetweenshared memoryandL1cache.InadditiontotheL1cache,Keplerintroducesa48KBcache fordatathatisknowntoberead-onlyforthedurationofthefunctionanditalsohas a1536KBofL2cachememory.TheL2cacheservicesallloads,stores,andtexture requestsandprovideshighspeeddatasharingacrosstheGPU[11]. 14


Figure2.9:KeplerMemoryHierarchy. Graphicsprocessorshavetraditionallybeendesignedforveryspecicspecialized tasks.Mostoftheirtransistorsperformcalculationsrelatedto3Dcomputergraphics rendering.TypicallyGPUsperformmemory-intensiveworksuchastexturemappingandrenderingpolygons.TheGPUalsoperformsgeometriccalculationssuch asrotationandtranslationofverticesintodierentcoordinatesystems.Theon-chip programmableshaderscanmanipulateverticesandtextures. NVIDIA R hasdevelopedaparallelcomputingarchitecturewhichisknownas theComputeUniedDeviceArchitectureCUDA.Thiscomputingengine,which isthecoreofmodernNVIDIA R GPUs,isaccessibletosoftwaredevelopersthrough extensionsofindustrystandardprogramminglanguages.ThedevelopmentofCUDA hasenableddeveloperstoaccessthevirtualinstructionsetandmemoryoftheGPU. Thishasenabledtheexploitationofthenativeparallelcomputationalelementsofthe NVIDIA R GPU. 15


TheKeplerarchitecturethathastheComputeCapability3.5orhighersupports dynamicparallelism,inwhichtheGPUcanlaunchnewgridsbyitself.Thisfeature allowsalgorithmsthatinvolvenestedparallelism,recursiveparallelismtobeimplementedonGPUs.ThisresultsinbetterGPUutilizationthanbefore.InFigure2.10, theKeplerhosttoGPUworkowshowstheGridManagementUnit,whichallows ittomanagetheactivelydispatchinggrids,pausedispatch,andholdpendingand suspendedgrids[11]. Figure2.10:KeplerWorkFlow. TheCUDAmodelisoptimizedformaximumcompatibility.NVIDIA R utilizesa softinstructionsetwhichenablestheGPUdesignerstochangethelowlevelhardware andinstructionsetwithouthavingtoaddressbackwardscompatibility.Similarly, NVIDIA R hasalsobuiltinscalabilitytotheCUDAmodel.TheseGPUsarescalable 16


inthattheCUDAcodethatiswrittenisnottiedtoaspecicreleaseoftheNVIDIA R GPU.ThiscanbecontrastedtotraditionalCPUswherethehardinstructionsetis published.CPUsoftwaredevelopersoftenoptimizetheirprogramsforhowmany coresareavailablewhichcanchangeasnewCPUsarereleased. ACUDAprogramiscomprisedofmultipleexecutionphases.Dependingonthe phase,theexecutioninvolvestheCPU,theGPUorboth.PortionsoftheCUDA codeareexecutedontheGPUwhileotherportionsareexecutedontheCPU.The NVIDIA R compilerknownas nvcc translatesthecodefortheGPUandCPUaccordingly.Thismodelisveryeasytoworkwithbecausethedevicecodeiswrittenin ANSICextendedwithkeywordsforlabelingdata-parallelfunctions,called kernels andtheirassociateddatastructures[13]. GPUimplementsaSingleinstructionmultipledataSIMDmodelunlikethetraditionalCPUarchitectureswhereeachthreadmayexecuteitsownsetofinstructions. Multipleprocessingunitsarecontrolledbythesamecontrolsignalfromthecontrol unit.Thougheachunitexecutesthesameinstruction,theyhavedierentdatathat correspondstotheCUDAthreads. TheGPUhardwaremapsthedatablockstotheSMXthroughtimeandspace multiplexing.SinceeachSMXhaslimitedhardwareresourcessuchassharedmemory,registers,numberofthreadsthatcanbetrackedandscheduledschedulinglots, carefulselectionofblocksizesallowtheSMXtoaccommodatemoreblockssimultaneouslythusincreasingthethroughput. 2.4Motivation Intheory,fornitems,thequadtreegivesaperformanceof n log n .Comparedtoabruteforcemethod'sperformanceof n [3],aquadtreeisextremelyfast. Theperformancegainbyusingaquadtreeis log n =n .AsshowninFigure2.11, theempiricalinvestigationofquadtreesearchandbruteforcesearchshowsthatthe quadtreehasa9xbetterperformanceformediumnumberofqueriesand8ximprove17


mentforlargerproblems.Themainpurposeofquadtreeliesinlocalizingthequeries. ThebruteforcesearchonaCPUcheckssequentiallyifeverysinglepointoftheinput datasatisesthecriteriawhereasquadtreeshelpisolatetheregionofinterestfaster byignoringthequadrantsofthetreethatlieoutsidetheregionofinterestateverylevel,thusreducingthenumberofquadrantstobeprocessedatthenextlevel. Thereforethequadtreegivesbetterperformanceforadensemoreconcentrateddata thanadenseequallydistributeddataset. Figure2.11:ComparingBruteforceSearchandQuadtreeSearchonCPU. AsshowninFigure2.12,theGPUprovidesabetterperformanceforbruteforce searchcomparedtoCPUforlargerdatasets.Butthisperformanceisstilllowerthan theCPUperformanceusingaquadtree.ThesamealgorithmisusedforbothCPU andGPUbruteforcesearchtocheckifapointiswithinthepolygon.ButtheGPU iscapableoflaunching65535blockswithamaximumof1024threads,thusmassively 18


parallelizingthesearch.TheGeforceGTX680Keplerarchitectureused,hasanew parallelgeometrypipelineoptimizedfortessellationanddisplacementmapping.And alsoithasabetterperformance/watt[12]. Figure2.12:ComparingBruteforceSearchonCPUwithGPU. Figure2.13showsthequadtreeperformancefordierentsizedpolygons.The smallpolygonsoccupy10to20percentofthequadtreearea,themediumsized polygonsoccupy30to60percentofthequadtreeareaandthelargepolygonsoccupy 70to90percentofthequadtreearea.TheCPUimplementationperformsbetteron smallpolygonscomparedtolargerpolygons.Theperformanceimprovementinthis caserelatestohowquicklythequadrantsofthequadtreethatcontainthepolygon canbeisolated.Theamountofcomputationdoneforalargerpolygonthatoccupies allfourquadrantsishighercomparedtoasmallerpolygonthatliesinsideonlyone quadrant. 19


Figure2.13:CPUPerformanceofDierentSizedPolygonsUsingQuadtree. 20


3.Approach 3.1ExploitationofQuadtreeonCPUandGPU ThePIPsearchisimplementedontheCPUusingquadtree.Theperformance ontheCPUiscomparedagainsttheimplementationontheGPU. 3.1.1CPUImplementation ThetraversalstartsfromtherootnodeofthequadtreeandperformsDFSby recursion.Thetraversalrunsrecursivelytillleafnodeisreached.DepthFirstSearch algorithmDFStraversesagraphinadepthwisemotionandusesrecursivefunction togetthenextnodetostartasearchwhenthebottomofthetreeisreached.The traversalstartsattherootnodeandmovesdownthetreetillthebottomofthetree isreachedalongeachbranchbeforebacktracking.Whilevisitingeachnode,itchecks forallthreeconditionstoseeifanodeiscompletelyoverlapping,partiallyoverlapping ornotoverlapping. Table3.1illustratestheconditionstosatisfyeachscenario.N0,N1,N2andN3 arethefourcornersofanodeandtheboundaryofthepolygonismarkedbythe fourcorners,P0,P1,P2,P3ofarectangle.Acheckmarkindicatesthatthecornerof thenodeiswithintheboundaryofallfouredgesofapolygon.Axmarkindicates thatthecornerofthenodeisoutsidetheboundaryofallfouredgesofapolygon. Ifallthecornersofanodeiswithinthefouredgesofapolygon,thenthenodeis completelyoverlappingthepolygon.Ifallthecornersofanodeisoutsidethefour edgesofapolygon,thenthenodeisnotoverlappingthepolygon.Ifatmostthreeof thecornersofanodeisoutsidethefouredgesofapolygon,thenthenodeispartially overlappingthepolygon.AlltheconditionsapartfromtheonesshowninTable1are partiallyoverlapping. 21


Table3.1:CriteriaforQuadtreeNodeClassication. CaseN0N1N2N3Condition P0P1P2P3 XXXX Completelyoverlapping P0P1P2P3 7777 Notoverlapping Beforeclassifyingthenodeasnotoverlapping,thepolygonisalsocheckedtosee ifitliesinsidethenode.Ifthenodeisclassiedascompletelyoverlappingnode,then theboundarydetailsofthisnodeisstoredandtreeisnottraversedfurtherfromthis node.Theboundaryofthenoderepresentstherangeofpointsthenodecontains andallthesepointswithinthenodeareconsideredaspointswithinthepolygon.If thenodeisclassiedasnotoverlappingnode,thenthetreeisnottraversedfurther fromthisnode.Ifthenodeisclassiedaspartiallyoverlappingnodethenthetree istraversedfurthertilltheleafnodeisreached.Thepointsarethenextractedfrom theleafnodesofthequadtree. Inthecaseofpartiallyoverlappingnode,everypointneedstobecheckedtoseeif itlieswithintheboundaryofthepolygon.Thischeckisoptimizedbyclassifyingthe kindofintersectionbetweenthenodeandthepolygoninsuchawaythatforcertain scenariosonlyeitherxorycoordinateofthepointsneedtobeveried. Iftheintersectionofthenodeisalongthehorizontaledgeofthepolygonasin Figure3.1,thenonlytheycoordinateofthedatapointsinsidethenodeneedstobe checkedasallthex-coordinateofallpointsiswithinthepolygonboundarylimits.If theintersectionbetweenthenodeandthepolygonisalongtheverticaledgeofthe polygonasinFigure3.2,thenonlythexcoordinateofthepointsneedstobechecked asallthey-coordinateofallpointsiswithinthepolygonboundarylimits.Forallthe otheroverlapconditions,x-ycoordinateofeverypointischeckedwiththeboundary limitsofthepolygon. 22


Figure3.1:OverlapofNodeontheHorizontalEdgeofPolygon. Figure3.2:OverlapofNodeontheVerticalEdgeofPolygon. InthecaseofGPUimplementation,x-ycoordinatesofallpointsofthepartial nodearecheckedagainstthepolygonedgeswithoutclassifyingthembasedonthe typeofnodeintersectionasitwillleadtomoreconditionalstatementswhichwould increasethreaddivergenceandreducetheperformance. 3.1.2GPUImplementation EcientimplementationofquadtreetraversalinGPUusingCUDAischallengingduetothefollowingreasons.1.Thequadtreeisanirregulardatastructurewhose memoryaccessesandworkdistributionarebothirregularanddata-dependent.And traversingirregulardatastructuresresultsinthreaddivergence.2.Traversalofa 23


quadtreebuiltusingpointersresultsinlotofpointer-chasingmemoryoperations. Pointer-chasingleadstomanyun-coalescedmemoryaccesseswhichaectsperformance[2].3.Asdonetypicallythequadtreeisbuiltusingrecursivedepthrst styleintheCPU.RecursiveDFSinGPUwouldbeadvantageousasitwillreduce un-coalescedmemoryaccessesconsideringthememoryforallthestructuresispreallocated.RecursioncanbeimplementedonlyonthelatestGPUswithCUDA6.5 versionthatsupportsdynamicparallelism,butrecursioncouldleadtolowperformanceduetofunctioncalloverhead.RecursiononaGPU,requiresexplicitstack spaceperthreadanddeeprecursionwillcauseoverowoftheavailablestackspace andinthiscaseCUDAmayreducethemaximumphysicalthreadsperlaunch.4. TransferringapointerbasedquadtreetoaGPUprovestobeachallenge.Thoughthis taskcanbeimplementedwith"cudamallocamanaged"function,debuggingbecomes harder. TotakeadvantageoftheparallelnatureoftheGPUs,BFSisusedinsteadof DFS.Thequadtreeisqueriedtondthepointsinsideapolygonbyrstndingthe quadtreenodesthatoverlapthepolygon.Oncethenodesthatoverlapthepolygon aredetermined,thepointsinsidethepolygoncanbefoundfromthesenodes.Givena setofnodes,arrangedhierarchically,thesystemneedstondtheminimumsetwhich solvesthequerycorrectly.Thiscanbedonebyassigningonethreadtoapolygon butthereisnotenoughstackspace.SinceourGPUimplementationmethodrequires anexplicitarrayinthesharedmemoryforeachpolygon,assigningonethreadtoa polygonwillalsoposealimitonthenumberofpolygonsprocessedsimultaneouslyby ablockduetothelimitationinthesharedmemorysize.Thestackspaceperthread needstobereducedandoptimizationforthreaddivergenceandmemoryaccessshould bedone.Tominimizethreaddivergenceandreducestackspaceperthreadandalso achievemaximumparallelization,apolygonisassignedtoawarpinsteadofathread orablockofthreads.Tooptimizeformemoryaccess,memorypreallocationforthe 24


inputpoints,querypolygons,pointbuerlistthatholdspointswithinaleafnode ofquadtreeisdoneontheCPU. CUDAapplicationsexploitmassivedataparallelismanditiscapableofprocessing massiveamountofdatawithinashortperiodoftime.ButdataaccessestoGDDR DRAMglobalmemorylimitsthisperformanceduetolimitationsinglobalmemory bandwidth.Thereforealgorithmsandtechniquesneedtobeemployedinorderto reducethedemandontheDRAMbandwidthandmakeecientuseofthedata accessedfromtheglobalmemory[9]. ThequadtreeisbuiltontheCPUfromheapobjectsandeachheapobjecthas severalelds.Forexample,eachquadtreenodeobjectcontains4childpointersand dataeldswhichstorestheattributesofthenode.Otherheapobjectsincludedata pointsandpointbuers.ThoughtheCPUimplementationbenetsfromthedynamic memoryallocation,theuseoftheClibrarymallocisnotpreferredfortheCPUGPUimplementationbecausethisdynamicmemory-managementfunctionprovides allocationandde-allocationofarbitrarysegmentsofmemory.Inordertoimprove performanceontheGPU,theaccesstomemorysegmentsneedtobecoalesced.The memorypreallocationplacestheconsecutivepointsandpointbuerssequentiallyin memoryandthereforeresultsinmemorycoalescedaccessasthepointsclosetoone otheraremorelikelytobewithinthesamepolygonboundary.Butpreallocating memoryforthequadtreenodesisnotexpectedtoimproveperformanceasthenodes arestoredinadepthrstfashionintheCPUbutaBFStraversalisdoneonthe GPU. TheDRAMcoresareslowerasthemarketdemandsahighcapacityratherthan alowerlatencyDRAM.Thesecoresoperateatmuchlowerspeedcomparedtothe interfacespeed.GDDR4operatesat1/8thoftheinterfacespeedandthisdierence willincreasewitheverygenerationofGDDRasthestoragecapacityincreases[9]. SinceDRAMcoresareslowerthantheinterface,thesystemneedstoincreasethe 25


buerwidthtofeedintofasterinterface.ThemodernDRAMbanksaredesignedtobe accessedinburstmode.Iftheaccessesarenottosequentiallocations,thetransferred burstbytesarediscarded.InordertotakefulladvantageoftheDRAMsystemin CUDAenvironment,thesystemneedstomakesurethattheprocessormakesuseofall thedataburstedoutoftheDRAMchip.Thusfromamemorythroughputstandpoint togetpeakmemoryperformance,computationmustbestructuredaroundsequential memoryaccesses[9].Randommemoryaccessleadstoahugeperformancepenalty duetorandommemoryreads. OneoptiontotraversedownthetreeforeverypolygonistouseDFSbyassigning awarporathreadtoeachpolygon.ButtheDFSimplementationintheGPUwould causeaoverheadduetothedatamovementasthethreadmovesupanddownthe tree.Inthiscasetheinteriornodesofthetreearerepeatedlyvisitedasathreadhas toreturntohigherlevelofthetreetotraversedowntootherchildren[4].Thisresults inadditionalworkforeachthread. Stackbasedbreadthrsttree-traversalallowsparallelization.Totakeadvantage oftheparallelnatureoftheGPU,thethreadsarelaunchedatlevel3ofthequadtree insteadoftherootnode.Startingattherootnodewillresultinonlyonethread beingactiveforlevel1traversalofquadtree.Startingatadeeperlevelpreventsthis andresultsinmoreparallelism.Initiallytheindexofthenodefromlevel3ofthe quadtreeisstoredinanarrayinthesharedmemory.Thisindexisusedtofetchthe quadtreenodefromtheglobalmemory. Herethetreeistraverseddowntondtheboundariesofapolygonandthus extractingthequadtreenodeswithinthatboundaryorpartiallyoverlappingthe boundary.Forcompletelyoverlappingnodes,theboundaryofthenodegivesthe 26


rangeofpointswithinthepolygon.Thismethodismoreecientthantraversing downthequadtreetondtheindividualpointswithinthepolygonintermsofcomputationandalsomemory.Andforpartiallyoverlappingnodes,eachpointneedsto becheckedagainsttheboundaryofthepolygon.Withthisinformation,therange ofpointswithinthepolygonregioncanbefoundandthisrangeisstoredinsteadof individualpointstosavememoryontheGPU.Ifthereareaverylargenumberof inputpointsandquerypolygons,thenstoringindividualpointsforeachpolygonwill resultinoverowofthememory. ThepolygonsareassociatedwithaCUDAwarp,wherethenumberofthreads inawarpdependsonCUDAconguration.BylaunchingthousandsofGPUblocks inparallel,'N'numberofpolygonscanbequeried,whereN=sizeofeachblock /32*totalnumberofblocks.Eachwarpexecutesthesamealgorithmbutona dierentpolygon.Assigningawarptoapolygonwillresultinlessthreaddivergence. Eachthreadinawarpreadsx,yco-ordinateofallfourcornersofapolygonwhichis requiredforthetreetraversal. TheGPUkernelreplacestherecursivecallwithstackpushoperationsthatsave indicesofeachchildofanodetraversed,thatsatisesthequery.Traversalisfacilitatedbyaloopthatsimultaneouslyassignstheindicesofthenodesinthestackto threadsinwarpuntilthestackisempty,indicatingtherearenomorenodestovisit atthatlevelforapolygon.Tosavestackspace,nodeindicesarestoredinthestack insteadofthenode.Atthebeginningofeachiterationoftheloopthenodesare poppedfromthestack.Afterpoppingthenodefromthestack,thefunctionwhich checksfortheoverlappingconditionsisexecutedonthatnode,possiblypushingmore nodesontothestackorreturningwhenitreachesanodethatfailsthecriteria,an emptynodeoraleafnode. Ifthenumberofnodesinthestackexceedsthenumberofthreadsassignedto apolygon,thenloopoverwithincrementsofthenumberofthreadsinawarpis 27


done.Withineachloop,eachnodeisassignedtoonethread.Eachthreadreadsthe quadtreenode'spropertiesandtestswhetherornottotraversethetreefurtherdown fromthisnode.Ifthecriteriatotraversethetreedownfromthisnodeissatised, thentheindicesofthenode'sallfourchildrenisaddedtothestack,providedthe childisnotanemptynode.Onestackspaceisassignedtoeachpolygon.Traversal ateachlevelofquadtreeissynchronized,sothatthesamestackcanbere-usedfor apolygon.Thetree-traverseloopisrepeateduntilallthelevelsofthequadtreeis visited.ThenumberofGPUthreadsusedperlevelisequaltothenumberofnodes thatwaspushedontothestackfromthepreviousiteration. Thewarpstartsthetraversalatlevel3ofthequadtree.Theindicesofchildrenof thenodesthatsatisfythequeryareplacedinanarray.Theconditioncheckswhether anodeisoverlappingthepolygon.Oncetheindicesofchildrenofthenodesthat satisfythequeryareplacedinanarrayandtheothernodes,whichdonotsatisfythe conditionareignored,thenextlevelisevaluated.Oncetheleafnodeisreached,the nodesareclassiedascompletelyandpartiallyoverlappingnodes.Andfromthese nodesallthepointsthatareinsidethepolygonaretaken.Theboundariesofthe completelyoverlappingnodeswhichgivetherangeofpointswithinthenodesare stored. Inthecaseofpartiallyoverlappingnodes,eachpointwithinthenodeischecked againsttheboundaryofthepolygonandtherangeofpointsthatliewithinthe polygonisstored.Atthelastlevelofthetree,thewarpstillholdsthepolygon andthethreadswithinawarpisassignedtoaleafnodethatiseitherpartiallyor completelyoverlappingthepolygon.Eachofthesethreadscomputethepointswithin eachnodethatareinsidethepolygon. Thisapproachwillminimizetheuseofsharedmemoryspace,asonlytheindices ofnodesarestoredstartingfromlevel3insharedmemoryinsteadofthenodeitself thatsatisfytherequiredcriteria.Theindicesofthenodesareusedtofetchthe 28


correspondingnodefromtheglobalmemory.Theorderinwhichthenodesarestored inthestackdoesnotmatter,aseverylevelinthequadtreeisprocessedindependent ofthepreviousandeverynodeisprocessedindependentofothernodesinthestack. Theindicesofthenodesaresavedinthestackarrayinthesharedmemoryand thethestackisclearedaftercomputingeverylevel.Sharedmemoryblocksareused inordertogainsignicantspeedupsofapproximately150to200times[8].Ifa nodeindexreturns-1,thenthenodeisemptyanditisnotfetchedfromtheglobal memory.Thenumberofnodesthatneedstobevisitedatthenextlevelisalsostored inasharedmemoryvariableforeverypolygonanditisincrementedatomically.The iterativelooplaunchesthreadsbasedonthecountonthisvariable.Theloopexecutes tillthelastlevelofquadtreeisreached. Toimplementthisalgorithm,themaximumnumberoflevelsinaquadtreeshould be4becauseofthelimitationsinthesizeandthenumberofstacksintheshared memory.Asthenumberoflevelsinaquadtreeincreases,themaximumstackspace requiredperpolygonwouldalsoincreaseandthiswilllimitthenumberofpolygons processed. ThesizeofsharedmemoryperblockontheGTX680is49152bytes[12].As thereare32warpsinablockbydefault,eachblockisassigned32polygons.Each polygonrequiresanarrayinthesharedmemoryinordertostorethenodeindices fromeachlevelthatneedstobefurthertraversed.Themaximumarraysizerequired fora4levelquadtreeis64asthemaximumnumberofnodesattheleafis64.This allocationoccupiesasharedmemoryspaceof32*64*4=8192bytes,whichiswell withinthelimit.Thearraysizeforeachpolygonneedstobeincreasedto256for a5levelquadtree.Thisallocationresultsinasharedmemoryusageof256*32* 4=66560bytes,whichexceedstheavailablesharedmemorysize.Thereforeunless alargersizeofsharedmemorybecomesavailableinthefuture,thequadtreecannot notbesubdividedbeyondlevel4inthiscase. 29


InFigure3.3,thecheckmarkindicatesthatthenodesatisesthequeryandthe treewillbefurthertraversedfromthisnode.Thecrossmarkindicateseitherthat thenodedoesnotsatisfythequeryorthenodeisemptyandthethreadsterminate atthispointwithoutproceedingfurther. Figure3.3:BFSGPUImplementation. Thetraversalstartsatlevel3ofthequadtree.Thechildrenofthenodesthat satisfytheoverlapconditionsarepushedintothearray.Thearrayisinitializedafter everyiteration.Thecrossmarkindicatesthatthenodeiseithernotsatisfyingthe overlapconditionsoritisempty.Thenodesarestoredinthearrayinanyorder.The processrepeatstillbottomofthetreeisreached.Thenumberofstackarrayinthe sharedmemoryperblockdependsonthenumberofwarpsinablock.Atthemost, theiterativeloopprocesses32nodesatatimebydefaultequaltothenumberof threadsinawarp.Theconditionalstatementsarechangedtoavoidfunctionreturn statementstopreventthetraversalloopfromprematurelyexitingandpreventingthe remainderoftheloopbodyfromexecuting. Apolygonischeckedagainstthenodeofthequadtreefor3scenariossuchas, completelyoverlapping,partiallyoverlappingandnotoverlapping.Allnodesare 30


checkedfor"notoverlapping"conditionstillleafnodeisreached.Ifanodesatises theconditionthenthetreeisnottraversedfromthisnodeandifthenodedoes notsatisfythecondition,thenitschildrenareplacedinthestackandthetraversal continues. Thenumberofchecksforapartiallyoverlappingnodeismoreanduseof"if" statementsdecreasestheperformanceduetocontroldivergence.Ifthreadswithina warptakedierentpaths,then2passesonthedierentpathofthecodeisrequired toexecuteallthreads.Sincedierentexecutionpathsareserialized,itleadsto performancepenalty. Thereforeinordertoreduceconditionalstatementsifstatements,thecondition forcompletelyoverlappingnodeisrstcomputedontheleafnodesandthenodes thatsatisestheconditionisclassiedascompletelyoverlappingnodesandnodes thatdoesnotsatisfytheconditionisclassiedaspartiallyoverlappingnodes.The Figure3.4,showsthehighleveloverviewofthealgorithmtondthepointsinpolygon usingquadtreeonaGPU. 31


Figure3.4:ProcessFlow. 3.1.3BFSImplementationonGPU-DierentScenariosofQuery Polygon TheFigure3.5,Figure3.6,Figure3.7andFigure3.8showthedatapointsleft anditscorrespondingtreerepresentationright.Thesedatapointswillbeanalyzed fordierentpolygonoverlapscenariosinthenextsections.Thenumbersinsidethe circlerepresenttheindexofthenodeatthatlevel.Figure3.5showstherootnode 32


thatcontainsallinputdatapoints.Therootnodeissubdividedinto4equalregions toform4nodesatlevel2.Figure3.6showsthefourchildrenoftherootnode.Root nodehasallfourchildrenasthedatapointsarepresentinallfourdirectionsNW, NE,SW,SEoftherootnode.Thenodesatlevel2areagainsubdividedequallyto formthechildnodes.Figure3.7,showsthenodesatlevel3ofthequadtree.Child node3ofparentnode1fromlevel2,childnode1andchildnode4ofparentnode3 fromlevel2andchildnode3ofparentnode4fromlevel2areemptyastheparent nodesdidnothaveanypointsinthosedirections.Figure3.8showsthechildnodesat level4.Manyofthenodesatthislevelareemptyastheydonotcontainanypoints. Figure3.5:Level1Quadtree. Figure3.6:Level2Quadtree. 33


Figure3.7:Level3Quadtree. Figure3.8:Level4Quadtree. 34

PAGE 44 Figure3.9showstherstscenariothatdemonstratesthebestcasescenariowhere thepolygonlieswithinonequadrantatlevel4ofthequadtreeandthealgorithm providesbestperformance. Figure3.9:SampleDatapointsleftandtheQuadtreerightatLevel1. Figure3.10showsthatatlevel2,thealgorithmisolatesonequadrantrstchild oftherootnodeandignoresallotherquadrants,thusbringingdownthenumberof quadrantstobeprocessedto1from4isthetotalnumberofnodesatthislevel. Figure3.10:SampleDatapointsleftandtheQuadtreerightatLevel2. 35


Figure3.11showsthatatlevel3,thealgorithmagainisolatesonequadrant whichistherstchildofthequadrantfromlevel1.Thisstepreducesthenumberof quadrantstobeprocessedto1from3isthetotalnumberofchildrenofthenode outputfromlevel1. Figure3.11:SampleDatapointsleftandtheQuadtreerightatLevel3. InFigure3.12atlevel4,thealgorithmisolatesonequadrantwhichistherst childofthequadrantfromlevel3.Thisstepreducesthenumberofquadrantstobe processedto1from2isthetotalnumberofchildrenofthenodeoutputfromlevel 3.Finallyonlyonequadrantatlevel4needstobeprocessedinordertogetthe pointsinsidethepolygoninthiscase. 36


Figure3.12:SampleDatapointsleftandtheQuadtreerightatLevel4. Figure3.13showsthesecondscenariothatdemonstratesaconditionwherethe querypolygonoverlaps2nodesatlevel2andthealgorithmprovidesamedium performance. Figure3.13:SampleDatapointsleftandtheQuadtreerightatLevel1. Figure3.14showsthatatlevel2,thealgorithmisolatestwoquadrantssecond 37


andfourthchildoftherootnodeandignorestheothertwoquadrants,thusbringing downthenumberofquadrantstobeprocessedto2. Figure3.14:SampleDatapointsleftandtheQuadtreerightatLevel2. Figure3.15showsthatlevel3,thealgorithmisolates7quadrantswhicharethe allfourchildrenofthesecondchildofrootnodeandrst,secondandfourthchildof thefourthchildofrootnode.Thisstepreducesthenumberofnodestobeprocessed byoneasoneofthechildofthenodesoutputfromthepreviouslevelisempty. Figure3.15:SampleDatapointsleftandtheQuadtreerightatLevel3. Figure3.16showsthatatlevel4,thealgorithmisolates5quadrantsfromthe childrenof7quadrantsfrompreviouslevel.Bytraversingdowntolevel4,thenumber ofdatapointsthatneedtobeevaluatedarereducedbyignoring3rdchildofthe4th 38


childfromlevel3.Ifthequadrantthatisignoredatlevel4hasalargernumberof datapoints,thenusinga3-levelquadtreewouldhaveresultedinahugeperformance penalty.Finallyonly5quadrantsatlevel4needstobeprocessedinordertogetthe pointsinsidethepolygoninthiscase. Figure3.16:SampleDatapointsleftandtheQuadtreerightatLevel4. Figure3.17showsthelastscenariothatdemonstratesaconditionwherethe polygoncontainsmaximumnumberofquadrantsatlevel4ofthequadtreeandthe algorithmprovidesleastperformancecomparedtoallotherscenarios. 39


Figure3.17:SampleDatapointsleftandtheQuadtreerightatLevel1. InFigure3.18atlevel2,thealgorithmoverlapsall4quadrantsatthislevel. Thereforethethreadhastodescendthetreefromallfournodes. Figure3.18:SampleDatapointsleftandtheQuadtreerightatLevel2. InFigure3.19atlevel3,thealgorithmisolatesall12quadrantsatthislevel.The maximumpossiblenumberofnodesatthislevelis16butfourofthosenodesare emptynodesandthesefournodesareignoredeventhoughthesenodesliewithinthe polygon. 40


Figure3.19:SampleDatapointsleftandtheQuadtreerightatLevel3. Figure3.20showsthatatlevel4,thealgorithmisolatesallthe12quadrantsat thislevel.Themaximumpossiblenumberofnodesatthislevelis64but52ofthose nodesareemptynodesandthese52nodesareignoredeventhoughthesenodeslie withinthepolygon.Finallyonly12quadrantsatlevel4needtobeprocessedin ordertogetthepointsinsidethepolygoninthiscase.Iftherearealargernumberof datapointsdistributedequallyacrosstheentire2Dspaceandiftherearenoempty nodes,thenallthe64nodesneedtobeprocessed. Figure3.20:SampleDatapointsleftandtheQuadtreerightatLevel4. 41


Figure3.21,showsthestackbasediterativeBFStraversalonGPU,inreference toFigure3.19andFigure3.20. Figure3.21:StackBasedIterativeBFSTraversal. Figure3.22showsonecasewherethepolygonoverlapsaregionwherethereare nopoints.Figure3.23showstwodistinctcases.Atlevel2,thealgorithmisolates rstandthirdchildoftherootnode.Atlevel3,thepolygondoesnotoverlapwith anyofthechildrenofthenodesfromlevel2.Thenodesthatitoverlapsareempty nodesandthereforethetreeisnottraversedanyfurther. 42


Figure3.22:SampleDatapointsleftandtheQuadtreerightatLevel1. Figure3.23:SampleDatapointsleftandtheQuadtreerightatLevel2. 43


4.ExperimentalResults TheexperimentalresultsshowthatinmanycasestheGPUout-performsthe CPUasexpected. 4.1Results Theperformanceiscomparedintermsofthenumberofpolygonsthatcanbe queriedpersecond.Thenumberofpolygonsqueriedpersecondismeasuredfora widerangeofinputpoints.Asthesizeofdatapointsincrease,theperformanceof CPUdiminishesatafasterratethantheGPU.InthecaseofGPUs,asthesize ofinputincreases,performanceisaectedastheamountofworkdoneperthread increasesandalsotheoverheadofmemorytransferoflargeamountofdatabetween CPUandGPUincreases.ButoncethequadtreeistransferredtotheGPU,any numberofpolygonscanbequeriedusingiterativeBFStraversalmethoddescribed above. Figure4.1,Figure4.2andFigure4.3showthattheCPU-GPUapproachgives aperformanceimprovementbyafactorof3.2forsmalldatasetsandaperformance improvementbyafactorof449forverylargedatasetsinthecaseofsmallpolygons, aperformanceimprovementbyafactorof4forsmalldatasetsandaperformance improvementbyafactorof594forverylargedatasetsinthecaseofmediumsized polygonsandaperformanceimprovementbyafactorof3.6forsmalldatasetsand aperformanceimprovementbyafactorof591forverylargedatasetsinthecaseof largesizedpolygonsrespectively. 44


Figure4.1:PerformanceComparisonbetweenCPUandGPUforSmallPolygons. Figure4.2:PerformanceComparisonbetweenCPUandGPUforMediumPolygons. 45


Figure4.3:PerformanceComparisonbetweenCPUandGPUforLargePolygons. TheCPUperformancediminishesatafasterratethantheGPUmainlybecause ofthequadtreestructureandthememoryhierarchy.Sincethetreeisapointer basedstructure,thereisacachemisseverytimeanodeisaccessed.Attheleaf node,thepointsarestoredinalinkedlist,whereeachnodeinthelistcontainsa pre-determinednumberofpointsandthereisacachemisseverytimeanewbuer inthelistisaccessed.Thereforecachemisspenaltiesarehigherforlargerdatasets. InthecaseofGPUs,thepointbuersaresequentiallystoredinmemorythrough memorypreallocation.Thoughthereisapenaltywhenanodeisaccessed,cachemiss penaltiesarenotencounteredwhilereadingdatapointsfromanode.Andalso,in theGPUimplementation,thetraversalisacceleratedbystartingtheBFSatlevel3 ofthequadtree. TheFigure4.4showstheexecutiontimeofdierentpolygonsizesonGPU. SimilartoCPU,theGPUperformsbetteronsmallerpolygonscomparedtothelarger 46


polygons.Inthiscase,thesmallerpolygonsoccupy10to20percentoftheregionand thelargerpolygonsoccupy70to90percentoftheregion.Oncetheleafnodesare reached,thenumberofnodesthatneedtobetakenintoaccounttocomputepoints areverylessbutforalargepolygonwhichcouldcontainamaximumof64nodesfor alevel4quadtree,pointswithinallthese64nodesneedtobetakenintoaccount. Figure4.4:PerformanceComparisonforDierentSizedPolygonsusingGPU. Theperformanceisdeterminedbythenumberofpartialnodesapolygoncontains becauseforpartialnodes,eachpointinsideanodeneedstobecheckedagainsta polygonboundarywhereasinthecaseofcompletelyoverlappingnodesonlytherange ofpointsanodecontainsisstored.TheFigure4.5showstheexecutiontimeof individualmediumsizedpolygonswithadatainputsizeof3000.Thepolygonsare rankedbasedonthenumberofpartialnodesitcontainsanditsexecutiontimeis measured.Asthenumberofpartialnodesincrease,theexecutiontimeperpolygon increases.Thenumberofpolygonscalculatedpersecondistheaverageoftheoutput 47


ofalltheseindividualpolygons. Figure4.5:ExecutionTimeofLeasttoMostComplexMediumSizedPolygons. Figure4.6showstheperformancecomparisonbetweenCPUandGPUmeasured fordierentsizesofpolygons.ThesignicantspeeduponGPUisduetothememory coalescedaccessandduetoacceleratingthetraversalbystartingtheBFSfromlevel 3.Inthiscase,thealgorithmtraversesoneleveldownthetreetoquicklyndtheset ofnodesthatsatisfythequery, Themaximumperformancegainisachievedforthemediumpolygonswhichare morecomputationallyintensivethansmallpolygons.Verylessnumberofthreadsare activeforsmallpolygonsandthisdoesnottakeadvantageofthethroughputoriented natureoftheGPU.TheworkonmediumsizedpolygonsisoptimumfortheGPU astheperformancedecreasesalittleforlargerpolygons.Theperformanceismainly determinedbythenumberofpartialnodesapolygoncontains.InthecaseofCPU, allthepartialnodesinapolygonareprocessedsequentiallybutinthecaseofGPU, 48


32partialnodeswithinapolygonarecomputedinparallelwith48Totalnumberof SMXxCUDAcoresperSMX/numberofpolygonsperwarp[12]polygonsbeing computedsimultaneously. Figure4.6:Speed-upofGPUoverCPU. 49


5.Conclusion TherangesearchwhichisafundamentaloperationinGISandspatialdatabases alwaysperformsbetterwhenaspatialdatastructureisusedanditsperformanceis furtherimprovedwhenthesearchisimplementedonaGPU.Thispaperhasproved thatanirregularpointerbasedquadtreeneednotbelinearizedinordertoachieve asignicantperformancegainonGPU.AndtheCPU-GPUapproachprovidesa speedupof3xto500xthanthepureCPUapproach.Inarealworldscenario, therangesearchproblemiscarriedoutonirregularpolygons.Forfuturework,the workpresentedinthispapercanbeextendedtoimplementPIPsearchonirregular polygons.TheworkontheGPUcanbeextendedtocomputelargerdatasetsasthe sizeofRAMperGPUincreases.Withincreaseinsharedmemorysize,quadtrees withdeeperlayerscanbetraversed.Therewouldbeasignicantimprovementin theperformancewithincreaseinsharedmemorysizeandincreaseinglobalmemory bandwidth. 50


REFERENCES [1]JeroenBedorf,EvgheniiGaburov,andSimonPortegiesZwart.Asparseoctree gravitationaln-bodycodethatrunsentirelyonthegpuprocessor. J.Comput. Phys. ,231:2825{2839,April2012. [2]MartinBurtscherandKeshavPingali.Anecientcuda6implementationofthe tree-basedbarneshutn-bodyalgorithm. [3]DavidEppstein.Fasthierarchicalclusteringandotherapplicationsofdynamic closestpairs. J.Exp.Algorithmics ,5,December2000. [4]MichaelGoldfarb,YoungjoonJo,andMilindKulkarni.Generaltransformations forgpuexecutionoftreetraversals.In ProceedingsofSC13:InternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis ,SC '13,pages10:1{10:12,NewYork,NY,USA,2013.ACM. [5]GeorgiGou,C.;Gaydadjiev.Addressinggpuon-chipsharedmemorybank conictsusingelasticpipeline. InternationalJournalofParallelProgramming 41,2013. [6]KshitijGupta,JeA.Stuart,andJohnD.Owens.Astudyofpersistentthreads stylegpuprogrammingforgpgpuworkloads.In InnovativeParallelComputing page14,May2012. [7]TianyiDavidHanandTarekS.Abdelrahman.Reducingbranchdivergence ingpuprograms.In ProceedingsoftheFourthWorkshoponGeneralPurpose ProcessingonGraphicsProcessingUnits ,GPGPU-4,pages3:1{3:8,NewYork, NY,USA,2011.ACM. [8]MariaKellyandAlexanderBreslow.Quadtreeconstructiononthegpu:Ahybridcpu-gpuapproach. [9]DavidB.KirkandWen-meiW.Hwu. ProgrammingMassivelyParallelProcessors:AHands-onApproach .MorganKaufmannPublishersInc.,SanFrancisco, CA,USA,1stedition,2010. [10]JiayuanMeng,DavidTarjan,andKevinSkadron.Dynamicwarpsubdivision forintegratedbranchandmemorydivergencetolerance. SIGARCHComput. Archit.News ,38:235{246,June2010. [11]NVIDIA.Whitepaperkeplertmgk110.,accessed,2012. 51


[12]NVIDIA.Whitepapernvidiageforcegtx680.,2012. [13]JohnD.Owens,MikeHouston,DavidLuebke,SimonGreen,JohnE.Stone,and JamesC.Phillips.GPUcomputing. ProceedingsoftheIEEE ,96:879{899, May2008. [14]CliordA.ShaerandPatrickR.Brown.Apagingschemeforpointer-based quadtree. [15]BoWu,ZhijiaZhao,EddyZ.Zhang,YunlianJiang,andXipengShen.Complexityanalysisandalgorithmdesignforreorganizingdatatominimizenon-coalesced memoryaccessesongpu. 18thACMSIGPLANSymposiumonPrinciplesand PracticeofParallelProgramming ,2013. [16]JiantingZhangandSiminYou.High-performancequadtreeconstructionson large-scalegeospatialrastersusingGPGPUparallelprimitives. International JournalofGeographicalInformationScience ,27:2207{2226,2013. 52


APPENDIXA.CUDACodeforBruteForceSearch /********************************************************************* * *Createrandompointsandpolygons. *PerformabruteforcesearchtofindpointsinpolygononGPU. *********************************************************************/ /**<************************#Includes**********************/ #include #include #include #include #include #include"hw4_4A_timing.h" #include #include /**<************************#Defines**************************/ #define__host__ #define__shared__ #defineBUILD_FULL1 #defineBUILD_ADAPTIVE2 #defineMODE_RANDOM1 #defineMODE_FILE2 #defineTRUE1 #defineFALSE0 #ifndefRANGE #defineRANGE1024 53


#endif #ifndefSIZE #defineSIZE20000 #endif /**<************************Globalvariables**************/ intpointMode=MODE_RANDOM; char*inputPointFileName; char*outputTreeFileName; intrangeSize=RANGE; intbucketSize=32; intnumPoints=SIZE; intnumPolygon=50; intpointRangeX=RANGE; intpointRangeY=RANGE; typedefintPOLYID; /**<************************Structuredefinition************/ //Inputdatapoint typedefstructPOINT{ intx; inty; intindex; }POINT; //Pointcoordinates. typedefstructNODEPP{ intxminPP; intyminPP; 54


}NODEPP; //Querypolygon typedefstructPolygon{ //coordinatesofcornersofapolygonanditswidthandheight. intxmin; intymin; intwidth; intheight; intxmax; intymax; intcount; intindex; NODEPPnodePoint[20000]; }Polygon; /**<*****************Performancefunctions***************/ typedefstructperf_struct{ doublesample; structtimevaltv; }perf; perftimeRecord; voidinit_perfperf*t{ t->sample=0; } //Starttimemeasurement. 55


voidstart_perf_measurementperf*t{ gettimeofday&t->tv,NULL; } //Stoptimemeasurement. voidstop_perf_measurementperf*t{ structtimevalctv; gettimeofday&ctv,NULL; structtimevaldiff; timersub&ctv,&t->tv,&diff; doubletime=diff.tv_sec+diff.tv_usec/1000000.0; t->sample=time; } //Printtimedifference. voidprint_perf_measurementperf*t{ doubletotal_time=0.0; doublestdev=0.0; printf"%f",t->sample; } /**<************************Randomnumbergenerator*********************/ /** *Generaterandomnumbers. *1-Generaterandomdatapoints. *2-Generaterandomnumbersforpolygoncorners. ************************************************************************/ //Generatearandomnumberwithinarange. 56


intrandnintrange{ inta; a=rand%range; returna; } //CreaterandomqueryPolygons Polygon*createRandomPolygonunsignedintnPolygon,unsignedintrange{ Polygon*polyarray=Polygon*mallocsizeofPolygon*nPolygon; unsignedintindex; forindex=0;index

pointArray[index].x=randnrange; pointArray[index].y=randnrange; pointArray[index].index=index; } returnpointArray; } POINT*createPointsintnumPoints{ POINT*pointArray; pointArray=createRandomPointsnumPoints,rangeSize; returnpointArray; } /**<************************Kernelfunction********************/ /**<************************Bruteforcesearch*********************/ /** *Functiontosearchpointsinsidepolygonusingbruteforcesearch method. *1-Assigneachthreadtoadatapoint. *2-Thethreadchecksifthepointiswithinagivenpolygon. *3-Repeattillallpolygonsarechecked. *4-Storetheattributesofthepolygonwhichsatisfythequery. *5-Thengridstrideisdone,sothatthethreadloopsovertogetthe nextpoint. *********************************************************************/ //Bruteforcesearch __global__voidbruteForcePOINT*d_pointArray,Polygon*d_Polygon,int numPoints,intnumPolygon 58


{ //Getglobalthreadindex. inttidG=threadIdx.x+blockIdx.x*blockDim.x; intpid; //Boundarycheckforthreads. iftidG=d_Polygon[pid].xmin&& d_pointArray[tid].x<=d_Polygon[pid].xmax&& d_pointArray[tid].y>=d_Polygon[pid].ymin&& d_pointArray[tid].y<=d_Polygon[pid].ymax { //Storethepointsthatliewithinapolygon. d_Polygon[pid].nodePoint[d_Polygon[pid].count].xminPP= d_pointArray[tid].x; d_Polygon[pid].nodePoint[d_Polygon[pid].count].yminPP= d_pointArray[tid].y; d_Polygon[pid].count++; } } } 59


} } /**<************************Mainfunction***************************/ /** *Mainfunction. *1-Createpointsrandomly. *2-Createpolygonsrandomly. *3-Copypointsandpolygonsfromhosttodevice. *4-LaunchCUDAkerneltoperformthesearch. *5-Timethekernelexecution. *********************************************************************/ intmainintargc,char**argv{ //Hostvariables inti; floatdiff_GPU; structPolygon*randomPoly; intindex; intj,p; floattime=0.0; intNumberofPolygons; //Devicevariables POINT*d_pointArray; Polygon*d_Polygon; //Createdatapointsandpolygonsrandomly. POINT*pointArray=createPointsnumPoints; randomPoly=createRandomPolygonnumPolygon,rangeSize; //Allocatedevicememory. cudaMallocvoid**&d_pointArray,sizeofPOINT*numPoints; 60


cudaMallocvoid**&d_Polygon,sizeofPolygon*numPolygon; //Memorytransferfromhosttodevice cudaMemcpyd_pointArray,pointArray,sizeofPOINT*numPoints, cudaMemcpyHostToDevice; cudaMemcpyd_Polygon,randomPoly,sizeofPolygon*numPolygon, cudaMemcpyHostToDevice; //Timebruteforcesearch. perfTimertimeRecord; initTimer&timeRecord; cudaEvent_tstart_GPU,stop_GPU; cudaEventCreate&start_GPU; cudaEventCreate&stop_GPU; cudaEventRecordstart_GPU,0; //Kernellaunch bruteForce<<>>d_pointArray,d_Polygon, numPoints,numPolygon; cudaError_tcudaerr=cudaDeviceSynchronize; ifcudaerr!=CUDA_SUCCESS{ printf"kernellaunchfailedwitherror"%s".n", cudaGetErrorStringcudaerr; } cudaEventRecordstop_GPU,0; cudaEventSynchronizestop_GPU; cudaEventElapsedTime&diff_GPU,start_GPU,stop_GPU; time=diff_GPU; //Calculatenumberofpolygonspersecond NumberofPolygons=numPolygon/time; printf"%dn",NumberofPolygons; cudaEventDestroystart_GPU; 61


cudaEventDestroystop_GPU; return0; } 62


APPENDIXB.HeaderFileforMemoryAllocation /********************************************************************* *FileName:MemoryManager.h *Defineobjectsofquadtree. *Definequerypolygons. *PreallocatememoryforallquadtreeobjectsonCPU. *MemoryallocationontheGPU. *********************************************************************/ /**<************************#Includes********************************/ #include #include #include /**<************************Structuredefinition*********************/ typedefintPOINTID; typedefintNODEID; typedefintBUFFID; typedefintPOLYID; /** *TheCPUhasthreemainstructures,NODEnodesofthequadtree, *POINTtheinputpointsforthequadtreeand *POINT_BUFFERthearrayofpointswhicheachleafnodeofthequadtree holds. */ typedefstructPOINT{ intx; 63


inty; intindex; }POINT; typedefstructPOINT_BUFFER{ //Arrayofpoints POINTIDpointArray; unsignedintpointCount; //usealinkedlisttopointtoanotherNODEatsamelevel. BUFFIDnextId; }POINT_BUFFER; typedefstructNODE{ //Level unsignedintlevel; //KeeptrackoftypeofNODE:LEAF,LINK unsignedinttype; //Locationin2Dspace intposX; intposY; //Sizeofquadrant intwidth; intheight; //Descriptionofpoints intcount[4]; inttotal; intindex; intparent_index; intoffset; 64


//ForAdaptiveimplementation intopen; NODEIDchild[4]; //Thishandlesthe4regions0,1,2,3representingsub-quadrants BUFFIDpBuffer; boolP0,P1,P2,P3; }NODE; //Definitionofquerypolygon typedefstructPolygon{ //SizeofPolygon. //xmin,ymin,xmaxandymaxdefines4cornersofapolygon intxmin; intymin; intwidth; intheight; intxmax; intymax; intcount; intindex; POLYID*PolyPointId; }Polygon; //StructuretostorepointsfromcompletelyoverlappingnodesinGPU. typedefstructCP{ intpolygonIndexCP; intCPcount; NODEIDnodeNumCP[64]; }CP; 65


//StructuretostorepointsfrompartiallyoverlappingnodesinGPU typedefstructNODEPP{ intxminPP; intyminPP; intxmaxPP; intymaxPP; }NODEPP; typedefstructPP{ intpolygonIndexPP; intPPcount; NODEPPnodePP_array[64]; }PP; /**<************************Memorypreallocation**********************/ //Preallocatingmemoryforpointstructure. typedefstruct{ POINT*pArray; intcurrCount,maxSize; }point_array_t; point_array_tpArr; voidallocatePointMemory{ pArr.pArray=POINT*malloc*sizeofPOINT; pArr.currCount=0; pArr.maxSize=1000000000; ifpArr.pArray!=NULL{ } 66


} //Functiontofetchmemoryforpointfrompreallocatedmemory. POINT*getPointintid{ ifid>=pArr.currCount||id<0{ exit-1; } return&pArr.pArray[id]; } //Allocatememoryforanewpoint. POINTIDgetNewPoint{ ifpArr.currCount>=pArr.maxSize{ pArr.pArray=POINT*reallocpArr.pArray,pArr.maxSize+10000; pArr.maxSize+=10000; } returnpArr.currCount++; } //Fetchmemoryforarrayofpoints. POINTIDgetPointArrayintnelems{ intstrtId=pArr.currCount; pArr.currCount+=nelems; ifpArr.currCount>=pArr.maxSize{ pArr.maxSize=pArr.maxSize+10000>pArr.currCount?pArr.maxSize +10000:pArr.currCount+1000; pArr.pArray=POINT*reallocpArr.pArray, pArr.maxSize*sizeofPOINT; } 67


returnstrtId; } //Preallocatingmemoryfornodestructure. typedefstruct{ NODE*nodeArray; intcurrCount,maxSize; }node_array_t; node_array_tnodeArr; voidallocateNodeMemory{ nodeArr.nodeArray=NODE*malloc*sizeofNODE; nodeArr.currCount=0; nodeArr.maxSize=1000; } //Functiontofetchmemoryforpointfrompreallocatedmemory. NODE*getNodeintid{ ifid>=nodeArr.currCount||id<0{ exit-1; } return&nodeArr.nodeArray[id]; } //Allocatememoryforanewnode. NODEIDgetNewNode{ ifnodeArr.currCount>=nodeArr.maxSize{ nodeArr.nodeArray=NODE*reallocnodeArr.nodeArray,nodeArr.maxSize +10000; 68


nodeArr.maxSize+=10000; } returnnodeArr.currCount++; } //Preallocatingmemoryforpointbufferstructure. typedefstruct{ POINT_BUFFER*bufferArray; intcurrCount,maxSize; }buff_array_t; buff_array_tbufferArr; voidallocatePOINTBUFFERMemory{ bufferArr.bufferArray= POINT_BUFFER*malloc*sizeofPOINT_BUFFER; bufferArr.currCount=0; bufferArr.maxSize=4000000; } //Functiontofetchmemoryforpointbufferfrompreallocatedmemory. POINT_BUFFER*getBUFFERintid{ ifid>=bufferArr.currCount||id<0{ exit-1; } return&bufferArr.bufferArray[id]; } //Allocatememoryforanewbuffer. BUFFIDgetNewBUFFER{ 69


ifbufferArr.currCount>=bufferArr.maxSize{ bufferArr.bufferArray=POINT_BUFFER*reallocbufferArr.bufferArray, bufferArr.maxSize+10000; bufferArr.maxSize+=10000; } returnbufferArr.currCount++; } //Getthenodeindicesofthelevel3nodes. typedefstruct{ NODEID*Level3Nodes; intcount; }level3Node; level3NodeLevel3NodeArray; voidlevelThreeNode { Level3NodeArray.Level3Nodes=NODEID*malloc*sizeofNODEID; intcurrent=getNewNode-1; inti; intj=0; fori=0;i<=current;i++{ ifgetNodei->level==2{ Level3NodeArray.count++; Level3NodeArray.Level3Nodes[j]=i; j++; } } } 70


/**<******************CUDAdevicememoryallocation*****************/ //AllocateCUDAmemoryforPointstructure. POINT*d_POINT; voidallocatePointMemoryCuda{ cudaMallocvoid**&d_POINT,sizeofPOINT*pArr.currCount; } //AllocateCUDAmemoryforPointbufferstructure. POINT_BUFFER*d_POINT_BUFFER; voidallocatePoint_BufferMemoryCuda{ cudaMallocvoid**&d_POINT_BUFFER, sizeofPOINT_BUFFER*bufferArr.currCount; } //AllocateCUDAmemoryforNodestructure. NODE*d_NODE; voidallocateNodeMemoryCuda{ cudaMallocvoid**&d_NODE,sizeofNODE*nodeArr.currCount; } //Allocatememoryforlevel3nodes. NODEID*d_NODE_In; voidallocateNodeIDMemoryCuda{ cudaMallocvoid**&d_NODE_In,sizeofNODEID*Level3NodeArray.count; } /**<*******************MemorytransferfromCPUtoGPU**************/ //CopyPointstructuretoGPU. 71


voidPointCudaCopy{ cudaMemcpyd_POINT,pArr.pArray,sizeofPOINT*pArr.currCount, cudaMemcpyHostToDevice; } //CopyPointbufferstructuretoGPU. voidPoint_BufferCudaCopy{ cudaMemcpyd_POINT_BUFFER,bufferArr.bufferArray, sizeofPOINT_BUFFER*bufferArr.currCount,cudaMemcpyHostToDevice; } //CopyNodestructuretoGPU. voidNodeCudaCopy{ cudaMemcpyd_NODE,nodeArr.nodeArray,sizeofNODE*nodeArr.currCount, cudaMemcpyHostToDevice; } //Copylevel3nodestoGPU. voidNodeIDCudaCopy{ cudaMemcpyd_NODE_In,Level3NodeArray.Level3Nodes, sizeofNODEID*Level3NodeArray.count,cudaMemcpyHostToDevice; } 72


APPENDIXC.CUDACCodeforPIPSearchonGPU /********************************************************************* * *ConstructquadtreeinCPU. *TraversequadtreeinGPUtofindPIP. *********************************************************************/ /**<************************#Includes********************************/ #include #include #include"MemoryManager.h" #include #include #include #include #include #include #include #ifdef__CDT_PARSER__ /**<************************#Defines*********************************/ #define__host__ #define__shared__ #defineCUDA_KERNEL_DIM... #else #defineCUDA_KERNEL_DIM...<<<__VA_ARGS__>>> #endif #defineBUILD_FULL1 73


#defineBUILD_ADAPTIVE2 #defineMODE_RANDOM1 #defineMODE_FILE2 #defineTRUE1 #defineFALSE0 #definepMax32 #ifndefRANGE #defineRANGE1024 #endif /**<*****************Globalvariables****************************/ intpointMode=MODE_RANDOM; char*inputPointFileName; char*outputTreeFileName; intrangeSize=RANGE; intbucketSize=32; intnumPoints=3000; intnumLevels=3; intnumSearches=0; intprintTree=0; intoutputTree=0; intquadTreeMode=BUILD_FULL; intnumPolygon=1099120; intpointRangeX=RANGE; intpointRangeY=RANGE; intcompleteIndex=0; intNotIndex=0; intPartialIndex=0; intarraysize=100; 74


/**<*****************enums******************************/ enum{ TYPE_NONE=0,TYPE_ROOT,TYPE_LINK,TYPE_LEAF }; enum{ FullyOverlapped=0,PartiallyOverlapped }; /**<******************GenericFunctions****************/ /**<******************TimingFunctions*****************/ typedefstructperf_struct{ doublesample; structtimevaltv; }perf; perftimeRecord; voidinit_perfperf*t{ t->sample=0; } //Starttimemeasurement. voidstart_perf_measurementperf*t{ gettimeofday&t->tv,NULL; } //Stoptimemeasurement. 75


voidstop_perf_measurementperf*t{ structtimevalctv; gettimeofday&ctv,NULL; structtimevaldiff; timersub&ctv,&t->tv,&diff; doubletime=diff.tv_sec+diff.tv_usec/1000000.0; t->sample=time; } //Printtimedifference. voidprint_perf_measurementperf*t{ doubletotal_time=0.0; doublestdev=0.0; } /**<************************Randomnumbergenerator*********************/ /** *Generaterandomnumbers. *1-Generaterandomdatapoints. *2-Generaterandomnumbersforpolygoncorners. ************************************************************************/ intrandnintrange{ inta; a=rand%range; returna; } intrandnRngintN,intM { 76


intr; //Initializerandomseed srandtimeNULL; //GeneratenumberbetweenaNandM r=M+random/RAND_MAX/N-M+1+1; returnr; } //Createsrandompolygonwithrandomxmin,ymin,widthandheight. Polygon*createRandomPolygonunsignedintnPolygon,unsignedintrange{ Polygon*polyarray=Polygon*mallocsizeofPolygon*nPolygon; intPointArraySize=getNewPoint-1; unsignedintindex; forindex=0;index

POINTIDcreateRandomPointsunsignedintnPoints,unsignedintrange{ POINTIDpointArray=getPointArraynPoints; unsignedintindex; forindex=0;indexx=randnrange; p->y=randnrange; p->index=index; } returnpointArray; } /**<************************TreeFunctions***************************/ /**<************************Setnode*********************************/ /** *Setnodeparameters. *1-Setappropriatevaluesforx,y,width,heightandlevelofanode. *2-Initializerestofthenodeparameters. **********************************************************************/ voidsetNodeNODEIDnodeid,intx,inty,intw,inth,inttype,int level{ //Getmemoryfornode. NODE*node=getNodenodeid; //Setthe5parameters. node->posX=x; node->posY=y; node->width=w; node->height=h; node->level=level; 78


//Resetallofthetrackingvalues. inti; fori=0;i<4;i++ { node->child[i]=-1; node->count[i]=0; } node->total=0; node->index=0; node->offset=0; node->open=TRUE; node->type=type; node->pBuffer=-1; } /**<**************Countnumberofnodes******************/ intcountNodesQuadTreeNODEIDnodeid{ intsum=0; inti; ifnodeid==-1 return0; //Depthfirsttraversaltofindthetotalnumberofnodes. ifgetNodenodeid->type==TYPE_LEAF{ return1; }else{ fori=0;i<4;i++{ sum=sum+countNodesQuadTreegetNodenodeid->child[i]; } } 79


returnsum+1; } /**<***************Assignindexandoffset********************/ /** *1-DFStraversaloftreetoassignindices. *2-Countdatapointsinleafnode. **************************************************************/ voidwalkNodeTableNODE*parent,NODEIDnodeid,int*offset,int*index{ NODE*node=getNodenodeid; BUFFIDptrID; inti; ifnode==NULL return; ifparent node->parent_index=parent->index; else node->parent_index=-1; //Assignindexandoffset. node->index=*index; node->offset=*offset; //Advancethenextindex. *index=*index+1; ifnode->type==TYPE_LEAF{ intcount=0; //Getindicesofpoints forptrID=node->pBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; 80


count=count+ptr->pointCount; } //Assigntotalnumberofpointsinnode. node->total=count; *offset=*offset+count; }else{ fori=0;i<4;i++{ walkNodeTablenode,node->child[i],offset,index; } } } /**<************************Buildquadtree****************************/ /**<************************Createnewpointbuffer********************/ /** *Createnewpointbuffer. *1-Getmemoryfornewbufferfrompreallocatedmemory. *2-Allocatememoryinsideeachbuffertohold32datapoints. *3-Assignfirstpointtobufferandsetpointcount. *4-Initializepointertonextbuffer. ***********************************************************************/ BUFFIDnewPointBufferPOINT*p{ //Allocatememoryforpointbuffer. BUFFIDptrID=getNewBUFFER; POINT_BUFFER*ptr=getBUFFERptrID; //Allocateabucketfor32datapoints. ptr->pointArray=getPointArraybucketSize; //Getpointandsetparameters. *getPointptr->pointArray=*p; 81


ptr->pointCount=1; ptr->nextId=-1; returnptrID; } /**<************************Assignpointtoanode********************/ /** *Addapointtoleafnode. *1-Ifabufferofaleafnodeisnotfull,addpointstothatbuffer. *2-Ifthebufferisfull,thenaddnewbuffertoendoflist. ***********************************************************************/ voidaddPointToNodeNODE*node,POINT*p{ BUFFIDptrID; BUFFIDlastBufferID; //Addpointstillbufferisfull. ifnode->pBuffer!=-1{ forptrID=node->pBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; //Addtotheendofthelist. ifptr->pointCountpointArray+ptr->pointCount=*p; ptr->pointCount++; } BUFFIDlastBufferID=ptrID; } //Boundarycaseofaddinganewlink. getBUFFERlastBufferID->nextId=newPointBufferp; }else{ 82


node->pBuffer=newPointBufferp; } } /**<******************Getdirectionofnode***************/ /** *1-Getnodedirectionbasedoninputdatapoint. *2-ApointischeckedtoseeifitliesinNW,NE,SWorSEdirection ofanode. *3-Returndirectiontocallingfunction. ***********************************************************/ intgetNodeDirectionNODE*nParent,structPOINT*p{ intposX,posY; intx,y; intindex; //Getthepoint. x=p->x; y=p->y; //Childwidthandheight intwidth=nParent->width/2; intheight=nParent->height/2; //DecidedirectionNorthwestNW,NortheastNE,SouthwestSW, SoutheastSEofapoint. forindex=0;index<4;index++{ switchindex{ case0://NW posX=nParent->posX; posY=nParent->posY+height; ifx>=posX&&x=posY 83


&&yposX+width; posY=nParent->posY+height; ifx>=posX&&x=posY &&yposX; posY=nParent->posY; ifx>=posX&&x=posY &&yposX+width; posY=nParent->posY; ifx>=posX&&x=posY &&y

} exit-1; return-1; } /**<*******************Buildnode*****************************/ /** *1-Checkthedirectionofthenode. *2-Calculateandassignthenode'sx,y,widthandheightbasedon direction. *3-Assignthenodelevel. ***************************************************************/ voidbuildNodeNODEIDnode,NODE*nParent,intdirection{ intposX,posY,width,height,level; switchdirection{ case0://NW posX=nParent->posX;//0 posY=nParent->posY+nParent->height/2;//512 break; case1://NE posX=nParent->posX+nParent->width/2; posY=nParent->posY+nParent->height/2; break; case2://SW posX=nParent->posX; posY=nParent->posY; break; case3://SE 85


posX=nParent->posX+nParent->width/2; posY=nParent->posY; break; } //Widthandheightofthechildnodeissimply1/2parent. width=nParent->width/2; height=nParent->height/2; //Setthelevel. level=nParent->level+1; setNodenode,posX,posY,width,height,TYPE_NONE,level; } NODEIDcreateChildNodeNODE*parent,intdirection{ NODEIDnode=getNewNode; buildNodenode,parent,direction; returnnode; } /**<************************Buildfullquadtree*************************/ /** *1-Getnodedirection. *2-Buildnodeinthatdirection. *3-Repeattillbottomoftreeisreached. *4-Ifbottomofthetreeisreached,thenaddpointtoleafnode's buffer. ************************************************************************/ voidbuildFullTreeNODEIDnodeid,unsignedintlevel,POINT*p{ NODEIDdirNode; NODEIDchild; intdirection; 86


NODE*node=getNodenodeid; //Checkifbottomoftreeisreached. ifnode->level==level{ addPointToNodenode,p; }else{ //Getdirectionofpointinthenode. direction=getNodeDirectionnode,p; dirNode=node->child[direction]; ifdirNode!=-1{ buildFullTreedirNode,level,p; }else{ //Createchildnodeinthatdirection. child=createChildNodenode,direction; node->child[direction]=child; //Assignnodetype. ifgetNodechild->level==level getNodechild->type=TYPE_LEAF; else getNodechild->type=TYPE_LINK; buildFullTreenode->child[direction],level,p; } } } /**<************************Buildadaptivequadtree********************/ /** *Quadtreegrowsbasedoninputpoints. *1-Getdirectionofthenodebasedontheinputpoint. *2-Buildnodeinthatdirection. 87


*3-Addpointstothenode'sbuffertillpredeterminedlimitisreached. *4-Oncelimitisreached,dividenodetosubnodes. *5-Buildchildnodesandpushpointstoitandrepeat. ***********************************************************************/ voidbuildAdaptiveTreeNODEIDnode,unsignedintlevel,POINT*p; voidpushPointToChildrenNODE*node,unsignedintlevel,POINT*p{ NODEIDdirNode; NODEIDchild; intdirection; //Getnodedirection. direction=getNodeDirectionnode,p; dirNode=node->child[direction]; ifdirNode{ buildAdaptiveTreedirNode,level,p; }else{ //Buildnodeinthatdirection. child=createChildNodenode,direction; node->child[direction]=child; getNodechild->type=TYPE_LEAF; buildAdaptiveTreenode->child[direction],level,p; } } voidpushAllNodePointsToChildrenNODE*node,unsignedintlevel{ BUFFIDptrID; intlink=0; inti; ptrID=node->pBuffer; 88


ifptrID==-1 return; POINT_BUFFER*ptr=getBUFFERptrID; //Shouldhaveonly1bucket'sworth. ifptr->nextId!=-1{ printf"pushAllNodePointsToChildren:errorn"; exit-1; } //Getdirectionofnodeandpushpointstothenode'sbuffer. fori=0;ipointCount;i++{ pushPointToChildrennode,level,getPointptr->pointArray+i; } node->pBuffer=-1; node->open=FALSE; ifnode->type==TYPE_LEAF node->type=TYPE_LINK; } voidbuildAdaptiveTreeNODEIDnodeid,unsignedintlevel,POINT*p{ NODEIDdirNode; NODEIDchild; intdirection; NODE*node=getNodenodeid; //Havewereachedthebottom:forcetoputpointthere:linked buckets ifgetNodenodeid->level==level{ addPointToNodenode,p; return; } 89


ifnode->open==FALSE{ //Ifgottohere,thenthisisanemptylinknodeandwepushpoint down. pushPointToChildrennode,level,p; return; } //Allofthefollowingchecksassumenodeisopen. //Nodeisopenbutpointbufferisempty. ifnode->pBuffer==-1{ addPointToNodenode,p; return; } //Checkifthepointbelongsonthisnode. ifnode->pBuffer!=-1&&getBUFFERnode->pBuffer->pointCount< bucketSize{ //AddtocurrentpointBuffer. POINT_BUFFER*ptr=getBUFFERnode->pBuffer; POINT*pt=getPointptr->pointArray+ptr->pointCount; pt=p; getBUFFERnode->pBuffer->pointCount++; return; } ifnode->pBuffer!=-1&&getBUFFERnode->pBuffer->pointCount== bucketSize{ //FullBuffer pushPointToChildrennode,level,p; //Pushallpointsanddeletethisnode'sbuffer. pushAllNodePointsToChildrennode,level; return; 90

PAGE 100

} printf"Shouldnevergetheren"; exit-1; } /**<************************Buildquadtree************************/ voidbuildQuadTreeNODEIDnode,unsignedintlevel,POINTIDpointArray, intnPoints,inttreeType{ inti; fori=0;i
PAGE 101

NODE*node=getNodenodeid; //Printnodedetailsoftreetofile. fprintffp,"%d%d:[%d%d]%d%d:%d",node->index,node->offset, node->posX,node->posY,node->width,node->height,node->total; fprintffp,"nextlinen"; fprintffp,"%d:",node->parent_index; fori=0;i<4;i++{ intindex=-1; ifnode->child[i]!=-1{ index=getNodenode->child[i]->index; } fprintffp,"%d",index; } fprintffp,"n"; } //Printnodedetailstofile. voidprintTableNodeDataFileFILE*fp,NODEIDnodeid{ inti; ifnodeid==-1 return; printTableNodefp,nodeid; fori=0;i<4;i++{ printTableNodeDataFilefp,getNodenodeid->child[i]; } } //Printindexofdatapoint. voidprintLeafPointsDataFileFILE*fp,NODEIDnodeid{ 92

PAGE 102

BUFFIDptrID; inti; ifnodeid==-1{ return; } NODE*node=getNodenodeid; ifnode->type==TYPE_LEAF{ //Printindicesofpoints. forptrID=node->pBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; fori=0;ipointCount;i++{ fprintffp,"%dn",getPointptr->pointArray+i->index; } } }else{ fori=0;i<4;i++{ printLeafPointsDataFilefp,node->child[i]; } } } //Printx,ycoordinatesofpointsinleafnodetofile. voidprintQuadTreeDataFileNODEIDroot,char*outputFile,POINTID pointArray, intnPoints{ FILE*fp; inti; fp=fopenoutputFile,"w"; 93

PAGE 103

iffp==NULL{ puts"Cannotopenfile"; exit; } fprintffp,"%d%dn",pointRangeX,pointRangeY; fprintffp,"%dn",nPoints; fori=0;ix, getPointpointArray+i->y; } intcountNodes=countNodesQuadTreeroot; fprintffp,"%dn",countNodes; intoffset=0; intindex=0; //Calculate:offsetandindexpernode walkNodeTableNULL,root,&offset,&index; //Print printTableNodeDataFilefp,root; //Printallpoints. printLeafPointsDataFilefp,root; fclosefp; } //Printpointdetailsalongwithleafnodedetails. voidprintQuadTreeLevelNODEIDnodeid,intlevel{ ifnodeid==-1 return; NODE*node=getNodenodeid; BUFFIDptrID; 94

PAGE 104

ifnode->level==level{ printf"Node<%d>L=%d[%d%d]%d%dn",node->type,node->level, node->posX,node->posY,node->width,node->height; ifnode->pBuffer!=-1{ intlink=0; inti; forptrID=node->pBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; printf"Link%d:",link; fori=0;ipointCount;i++{ printf"%d:%d,%d",getPointptr->pointArray+i->index, getPointptr->pointArray+i->x, getPointptr->pointArray+i->y; } printf"n"; link++; } } } inti; fori=0;i<4;i++{ printQuadTreeLevelnode->child[i],level; } } //Printdetailsofanodepassedtothefunction. voidprintQuadTreeNODEIDnodeid{ 95

PAGE 105

ifnodeid==-1 return; NODE*node=getNodenodeid; printf"Node<%d>L=%d[%d%d]%d%dn",node->type,node->level, node->posX,node->posY,node->width,node->height; ifnode->pBuffer!=-1{ BUFFIDptrID; intlink=0; inti; forptrID=node->pBuffer;ptrID!=-1;ptrID =getBUFFERptrID->nextId{ printf"Link%d:",link; POINT_BUFFER*ptr=getBUFFERptrID; fori=0;ipointCount;i++{ printf"%d:%d,%d",getPointptr->pointArray+i->index, getPointptr->pointArray+i->x, getPointptr->pointArray+i->y; } printf"n"; link++; } } inti; fori=0;i<4;i++{ printQuadTreenode->child[i]; } } //Printnodedetailsalongwithpointdensity. 96

PAGE 106

intprintNodeStatsNODEIDnodeid,intprintDetails{ intchildren[4]; BUFFIDptrID; ifnodeid==-1 return0; NODE*node=getNodenodeid; intcount=0; intsum=0; ifnode->pBuffer!=-1{ intlink=0; inti; forptrID=node->pBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; fori=0;ipointCount;i++{ count++; } } sum++; } inti; ifprintDetails{ fori=0;i<4;i++{ children[i]=0; ifnode->child[i] children[i]=1; } printf"QuadNode<%d>L=%d[%d%d]%d%d%d%f:%d%d%d%dn", node->type,node->level,node->posX,node->posY,node->width, 97

PAGE 107

node->height,count, floatcount/floatnode->height*node->width, children[0],children[1],children[2],children[3]; } fori=0;i<4;i++{ sum+=printNodeStatsnode->child[i],printDetails; } returnsum; } voidprintQuadTreeStatsNODEIDrootNode,unsignedintlevel,int printDetails//level=4,printDetails=0 { intsum; sum=printNodeStatsrootNode,printDetails; } /**<************************Searchfunctions**********************/ /**<************************Searchpoint****************************/ /** *Functiontosearchpointsequentially. *1-Measurestimetakentosearchapointfromapointarray. *********************************************************************/ //Functiontosearchagivenpointfromasetofrandompoints. doublesearchPointsPOINTIDpointArray,intnumPoints,POINTIDsearchArray, intnumSearches{ ints,p; intmatches=0; 98

PAGE 108

longlongcost=0; POINT*point,*search; //Timethesearch. start_perf_measurement&timeRecord; forp=0;px==point->x&&search->y==point->y{ matches++; } } } stop_perf_measurement&timeRecord; print_perf_measurement&timeRecord; returntimeRecord.sample; } doublesearchSmartPointsPOINTIDpointArray,intnumPoints,POINTID searchArray, intnumSearches{ ints,p; intmatches=0; longlongcost=0; inti; POINT*point,*search; char*maskArray=char*mallocsizeofchar*numSearches; 99

PAGE 109

fori=0;ix==point->x&&search->y==point->y{ matches++; maskArray[s]=FALSE; } } } stop_perf_measurement&timeRecord; print_perf_measurement&timeRecord; returntimeRecord.sample; } /**<*********************Searchpointinquadtree**********************/ /** *Functionstosearchpointusingquadtree. *1-Measurestimetakentosearchpointinaquadtree. *2-Keepstrackofthecostofsearchandthenumberofmatches. *3-Searchfunctionisimplementedforbothfullandadaptivequadtree. 100

PAGE 110

***********************************************************************/ //Findnodeinwhichpointliesforafullquadtree. NODEIDfindQuadTreeNodeNODEIDnParentid,structPOINT*p{ intposX,posY; intx,y; intindex; ifnParentid==-1 returnnParentid; NODE*nParent=getNodenParentid; ifnParent->type==TYPE_LEAF returnnParentid; //Getthepoint. x=p->x; y=p->y; //Childwidthandheight intwidth=nParent->width/2; intheight=nParent->height/2; forindex=0;index<4;index++{ switchindex{ case0://NW posX=nParent->posX; posY=nParent->posY+height; ifx>=posX&&x=posY &&ychild[0],p; } break; case1://NE posX=nParent->posX+width; 101

PAGE 111

posY=nParent->posY+height; ifx>=posX&&x=posY &&ychild[1],p; } break; case2://SW posX=nParent->posX; posY=nParent->posY; ifx>=posX&&x=posY &&ychild[2],p; } break; case3://SE posX=nParent->posX+width; posY=nParent->posY; ifx>=posX&&x=posY &&ychild[3],p; } break; } } return-1; } //Functiontofindnodeinadaptivequadtree. NODEIDdescendQuadTreeNodeNODEIDnParentid,structPOINT*p{ 102

PAGE 112

intposX,posY; intx,y; intindex; ifnParentid==-1 return-1; NODE*nParent=getNodenParentid; //Thisnodehaspoints. ifnParent->pBuffer!=-1 returnnParentid; //Getthepoint. x=p->x; y=p->y; //Childwidthandheight intwidth=nParent->width/2; intheight=nParent->height/2; forindex=0;index<4;index++{ switchindex{ case0://NW posX=nParent->posX; posY=nParent->posY+height; ifx>=posX&&x=posY &&ychild[0],p; } break; case1://NE posX=nParent->posX+width; posY=nParent->posY+height; ifx>=posX&&x=posY 103

PAGE 113

&&ychild[1],p; } break; case2://SW posX=nParent->posX; posY=nParent->posY; ifx>=posX&&x=posY &&ychild[2],p; } break; case3://SE posX=nParent->posX+width; posY=nParent->posY; ifx>=posX&&x=posY &&ychild[3],p; } break; } } return-1; } //Searchfullquadtree. doublesearchPointsFullQuadTreeNODEIDnodeid,POINTIDpointArray,int numPoints, POINTIDsearchArray,intnumSearches{ 104

PAGE 114

inti; ints; intmatches=0; intindex; longlongcost=0; POINTpoint,*search; BUFFIDptrID; NODEIDleaf; //Timethesearch. start_perf_measurement&timeRecord; fors=0;spBuffer;ptrID!=-1;ptrID= getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; fori=0;ipointCount;i++{ point=*getPointptr->pointArray+i; cost++; ifsearch->x==point.x&&search->y==point.y{ matches++; break; } } } 105

PAGE 115

} stop_perf_measurement&timeRecord; print_perf_measurement&timeRecord; returntimeRecord.sample; } //Searchadaptivequadtree. doublesearchPointsAdaptiveQuadTreeNODEIDroot,POINTIDpointArray, intnumPoints,POINTIDsearchArray,intnumSearches{ inti; ints; intmatches=0; intindex; longlongcost=0; POINTpoint,*search; BUFFIDptrID; NODEIDnode; //Timethesearch. start_perf_measurement&timeRecord; fors=0;spBuffer;ptrID!=-1;ptrID= 106

PAGE 116

getBUFFERptrID->nextId{ POINT_BUFFER*ptr=getBUFFERptrID; fori=0;ipointCount;i++{ point=*getPointptr->pointArray+i; cost++; ifsearch->x==point.x&&search->y==point.y{ matches++; continue; } } } } stop_perf_measurement&timeRecord; print_perf_measurement&timeRecord; returntimeRecord.sample; } /**<************************Supportsystem***************************/ //Getdatapointsfromfile. POINTIDcreateFilePointschar*inputFileName{ FILE*fdin; ifinputFileName==NULL{ printf"Erroropeningfilen"; exit; } fdin=fopeninputFileName,"r"; iffdin==NULL{ printf"Erroropeningfilen"; exit; 107

PAGE 117

} iffscanffdin,"%d%dn",&pointRangeX,&pointRangeY!=2{ printf"Errorpointfilen"; exit; } iffscanffdin,"%dn",&numPoints!=1{ printf"Errorpointfilen"; exit; } POINTIDpointArray=getPointArraynumPoints; intindex; intx,y; forindex=0;indexx=x; p->y=y; p->index=index; } returnpointArray; } //Createpointsrandomlyorreadfromfile. POINTIDcreatePointsintnumPoints{ POINTIDpointArray; ifpointMode==MODE_RANDOM{ 108

PAGE 118

pointArray=createRandomPointsnumPoints,rangeSize; } else{ pointArray=createFilePointsinputPointFileName; } returnpointArray; } staticvoidusagechar*argv0{ char*help="Usage:%s[switches]-nnum_pointsn" "-snumber_search_pointsn" "-lnumber_levelsn" "-bbucket_sizen" "-p:printquadtreen" "-h:printthishelpinformationn"; fprintfstderr,help,argv0; exit-1; } //compilationoptions. voidsetupintargc,char**argv{ intopt; whileopt=getoptargc,argv,"n:l:s:b:r:i:o:aph"!=EOF{ switchopt{ case'n': numPoints=atoioptarg; break; case'l': numLevels=atoioptarg; 109

PAGE 119

break; case's': numSearches=atoioptarg; break; case'b': bucketSize=atoioptarg; break; case'r': rangeSize=atoioptarg; break; case'i': inputPointFileName=optarg; pointMode=MODE_FILE; break; case'o': outputTreeFileName=optarg; outputTree=1; break; case'p': printTree=1; break; case'a': quadTreeMode=BUILD_ADAPTIVE; break; case'h': default: usageargv[0]; break; } 110

PAGE 120

} } /**<******************Kernelfunction************************/ /** *IterativeBFStraversalofquadtreetofindpointsinsidepolygon. *1-Starttraversalatlevel3. *2-Assignwarptopolygonandthreadwithineachwarptoquadtreenodes. *3-Checkifnodeoverlapspolygon. *4-Ifnodeoverlaps,thentraversetreefromthisnode. *5-Ifthebottomofthetreeisreached,thenclassifynodesbasedon kindofoverlap. *6-Completelyoverlappingnodes-Gettherangeofpointsfromnode boundary. *7-Partiallyoverlappingnodes-Checkeverypointinnodewithpolygon boundary. *8-Gettherangeandstore. *9-Loopovertoevaluatenextsetofpolygons. ****************************************************************/ //Functiontocheckifapolygonissmallerthanagivennode. __device__boolsearchPolygonCUDAintx0,intx1,intxp0,intxp1,int y0,inty1,intyp0,intyp1 { boolN0=false; boolN1=false; boolN2=false; boolN3=false; //Checkifapolygoniswithinanode 111

PAGE 121

ifxp0>=x0&&xp0<=x1&&yp0>=y0&&yp0<=y1{ N0=true; } ifxp1>=x0&&xp1<=x1&&yp0>=y0&&yp0<=y1{ N1=true; } ifxp0>=x0&&xp0<=x1&&yp1>=y0&&yp1<=y1{ N2=true; } ifxp1>=x0&&xp1<=x1&&yp1>=y0&&yp1<=y1{ N3=true; } ifN0==false&&N1==false&&N2==false&&N3==false { returntrue; } returnfalse; } //Functiontocheckcompletelyoverlappingconditions. __device__voidNodeCheckCompleteNODEIDnodeid,NODEnode,Polygon indexOfPolyArray,int**complete,int**partial,int**notN{ ifnodeid!=NODEID-1{ //x,ycoordinates,widthandheightofnode. intx0,y0,x1,y1,xp0,yp0,xp1,yp1,w,h; inti,j,numOfPoints; boolP0=false; boolP1=false; boolP2=false; 112

PAGE 122

boolP3=false; //x,ycoordinates,widthandheightofpolygon. xp0=indexOfPolyArray.xmin; yp0=indexOfPolyArray.ymin; xp1=indexOfPolyArray.xmax; yp1=indexOfPolyArray.ymax; x0=node.posX; y0=node.posY; x1=x0+node.width; y1=y0+node.height; //Checkforcompletelyoverlappingnode. ifx0>=xp0&&x0<=xp1&&y0>=yp0&&y0<=yp1 { P0=true; } ifx1>=xp0&&x1<=xp1&&y0>=yp0&&y0<=yp1 { P1=true; } ifx0>=xp0&&x0<=xp1&&y1>=yp0&&y1<=yp1 { P2=true; } ifx1>=xp0&&x1<=xp1&&y1>=yp0&&y1<=yp1 { P3=true; } //Ifallcornersofanodeiswithinapolygon,thenclassifynode ascompletelyoverlapping. 113

PAGE 123

ifP0==true&&P1==true&&P2==true &&P3==true{ **complete=1; } //Ifallcornersofanodeisnotwithinapolygon,thenclassify nodeasnotoverlapping. elseifP0==false&&P1==false&&P2==false &&P3==false&&searchPolygonCUDAx0,x1,xp0,xp1,y0, y1,yp0,yp1==true { **notN=1; } //Classifyremainingnodesaspartiallyoverlapping. else{ **partial=1; } } } //Functiontochecknotoverlappingconditions. __device__voidNodeCheckNotOverlapNODEIDnodeid,NODEnode,Polygon indexOfPolyArray,int**complete,int**partial,int**notN{ ifnodeid!=-1{ intx0,y0,x1,y1,xp0,yp0,xp1,yp1,w,h; inti,j,numOfPoints; boolP0=false; boolP1=false; 114

PAGE 124

boolP2=false; boolP3=false; xp0=indexOfPolyArray.xmin; yp0=indexOfPolyArray.ymin; xp1=indexOfPolyArray.xmax; yp1=indexOfPolyArray.ymax; x0=node.posX; y0=node.posY; x1=x0+node.width; y1=y0+node.height; //Checkfornotoverlappingnode. ifx0>=xp0&&x0<=xp1&&y0>=yp0&&y0<=yp1 { P0=true; } ifx1>=xp0&&x1<=xp1&&y0>=yp0&&y0<=yp1 { P1=true; } ifx0>=xp0&&x0<=xp1&&y1>=yp0&&y1<=yp1 { P2=true; } ifx1>=xp0&&x1<=xp1&&y1>=yp0&&y1<=yp1 { P3=true; } ifP0==false&&P1==false&&P2==false &&P3==false&&searchPolygonCUDAx0,x1,xp0,xp1,y0, 115

PAGE 125

y1,yp0,yp1==true { **notN=1; } else { **notN=0; } } } //IterativeBFStraversalofquadtreetofindpointsinsidepolygon. __global__voidsearchOverlapNodeCUDANODE*d_NODE,NODEID*d_NODE_In, Polygon*d_Polygon,intd_nodeCount,intd_level3Count,int d_numPolygon,intQuadtreeLevel,POINT*d_POINT,POINT_BUFFER* d_POINT_BUFFER,CP*d_cp,PP*d_pp{ //Globalthreadindex. inttid=blockDim.x*blockIdx.x+threadIdx.x; intb_tid=blockDim.x*blockIdx.x+threadIdx.x; intval=-1; intval1=-1; int*completeNode,*partialNode,*notNode; completeNode=&val; partialNode=&val1; notNode=&val; intNumWarp=32; intthreadnumber=/32; //Keepscountofnodesinthesharedmemoryarray. __shared__intnv[32]; 116

PAGE 126

__shared__intnc[32]; //Variablestostorerangeofpoints. intminCheckX=1024; intmaxCheckX=0; intminCheckY=1024; intmaxCheckY=0; intnodeIndexPP=0; intnumlevelsQ=1; //Initializearray. ifthreadIdx.x
PAGE 127

{ boundary[threadIdx.x]=-1; } //Assignwarptoapolygonandlooptillallpolygonsareevaluated. forintqid=qid1;qid
PAGE 128

boundary[qb+atomicAdd&nc[qidd],1]= d_NODE_In[lane_id]; notNode=&val; } } } whileiter
PAGE 129

ifnid!=-1{ //Addnodesthatsatisfythequerytoarray. boundary[qb+atomicAdd&nc[qidd],1]= d_NODE[nid].child[i]; notNode=&val; } } } } } iter=iter+1; } __syncthreads; //Assignthreadsinawarptonodesandgetthepointswithina nodethatoverlapsapolygon. forintid=lane_id;id
PAGE 130

BUFFIDptrID; NODEnode=d_NODE[nid]; ifnode.pBuffer { d_cp[qid].nodeNumCP[atomicAdd&d_cp[qid].CPcount,1] =nid; d_cp[qid].polygonIndexCP=qid; } } //partiallyoverlappingnodes elseif*partialNode==1 { partialNode=&val; BUFFIDptrID; NODEnode=d_NODE[nid]; ifnode.pBuffer { forptrID=node.pBuffer;ptrID!=-1;ptrID= d_POINT_BUFFER[ptrID].nextId { ifnid!=-1 { //Checkeverypointwithintheleafnodeagainst polygonboundary. POINT_BUFFERptr=d_POINT_BUFFER[ptrID]; forintc=0;c
PAGE 131

ifd_POINT[ptr.pointArray+c].x>=xp0&& d_POINT[ptr.pointArray+c].x<=xp1&& d_POINT[ptr.pointArray+c].y>=yp0&& d_POINT[ptr.pointArray+c].y<=yp1 { //Storetherangeofpoints. ifminCheckX>d_POINT[ptr.pointArray+c].x { minCheckX=d_POINT[ptr.pointArray+c].x; } ifmaxCheckXd_POINT[ptr.pointArray+c].y { minCheckY=d_POINT[ptr.pointArray+c].y; } ifmaxCheckY
PAGE 132

nodeIndexPP=atomicAdd&d_pp[qid].PPcount,1; d_pp[qid].nodePP_array[nodeIndexPP].xminPP= minCheckX; d_pp[qid].nodePP_array[nodeIndexPP].yminPP= maxCheckX; d_pp[qid].nodePP_array[nodeIndexPP].xmaxPP= minCheckY; d_pp[qid].nodePP_array[nodeIndexPP].ymaxPP= maxCheckY; d_pp[qid].polygonIndexPP=qid; } } } } } } } } /**<************************Mainfunction***************************/ /** *TwotechniquestobuildQuadTrees *1-full:extendallthewaydown,onlyleafsholdpoints *:countsarekeptatintermediatelevels *:nullsarestillusedtoknowwherepointsare. 123

PAGE 133

*2-adaptive:itemsarepushedaroundasneededtoformtree *:pointsofLIMITpusheddown. ********************************************************************/ intmainintargc,char**argv{ setupargc,argv; //Hostvariables. intindex; NODEIDrootNode; structPolygon*randomPoly; //Devicevariabledeclaration. Polygon*d_Polygon; CP*d_cp; PP*d_pp; PP*d_hostPoint; //PreallocatememoryforallobjectsinCPU. allocatePointMemory; allocateNodeMemory; allocatePOINTBUFFERMemory; d_hostPoint=PP*mallocnumPolygon*sizeofPP; //Createrandompointsandpolygon. POINTIDpointArray=createPointsnumPoints; randomPoly=createRandomPolygonnumPolygon,rangeSize; //Getmemoryforrootnode. rootNode=getNewNode; //Startnode:root setNoderootNode,0,0,rangeSize,rangeSize,TYPE_ROOT,0; //Createthequadtree. buildQuadTreerootNode,numLevels,pointArray,numPoints,quadTreeMode; //Timequadtreeconstruction. 124

PAGE 134

doublebuildTime=timeRecord.sample; printf"QuadTreeBuildTime:%fn",buildTime; //Printthequadtree. ifprintTree{ printQuadTreerootNode; } printQuadTreeStatsrootNode,numLevels,0; ifoutputTree { printQuadTreeDataFilerootNode,outputTreeFileName,pointArray, numPoints; } //Searchsection ifnumSearches>0 { //Createsomesearchpoints. POINTIDsearchArray=createRandomPointsnumSearches,rangeSize; //Searchpointsinanarray. doublebaseTime=searchPointspointArray,numPoints,searchArray, numSearches; doublesmartTime=searchSmartPointspointArray,numPoints, searchArray, numSearches; //Searchpointsusingquadtree. doublequadTime; ifquadTreeMode==BUILD_FULL quadTime=searchPointsFullQuadTreerootNode,pointArray, numPoints, searchArray,numSearches; 125

PAGE 135

else { quadTime=searchPointsAdaptiveQuadTreerootNode,pointArray, numPoints,searchArray,numSearches; } } //CUDAmemoryallocation. allocatePointMemoryCuda; allocatePoint_BufferMemoryCuda; allocateNodeMemoryCuda; cudaMallocvoid**&d_Polygon,sizeofPolygon*numPolygon; cudaMallocvoid**&d_cp,sizeofCP*numPolygon; cudaError_terr=cudaMallocvoid**&d_pp,sizeofPP*numPolygon; //Timethekernelexecution. floatdiff_GPU; cudaEvent_tstart_GPU,stop_GPU; cudaEventCreate&start_GPU; cudaEventCreate&stop_GPU; cudaEventRecordstart_GPU,0; //Copynodesatlevel3. levelThreeNode; allocateNodeIDMemoryCuda; intd_nodeCount=nodeArr.currCount; intd_level3Count=Level3NodeArray.count; //PolygonCUDAmemoryallocation. PointCudaCopy; Point_BufferCudaCopy; NodeCudaCopy; NodeIDCudaCopy; 126

PAGE 136

cudaMemcpyd_Polygon,randomPoly,sizeofPolygon*numPolygon, cudaMemcpyHostToDevice; //Launchkernelwith65535blocksand1024threadsperblock. searchOverlapNodeCUDA<<<65535,1024>>>d_NODE,d_NODE_In,d_Polygon, d_nodeCount,d_level3Count,numPolygon,numLevels,d_POINT, d_POINT_BUFFER,d_cp,d_pp; cudaError_tcudaerr=cudaDeviceSynchronize; ifcudaerr!=CUDA_SUCCESS{ printf"kernellaunchfailedwitherror"%s".n", cudaGetErrorStringcudaerr; } cudaEventRecordstop_GPU,0; cudaEventSynchronizestop_GPU; cudaEventElapsedTime&diff_GPU,start_GPU,stop_GPU; //Memorytransferfromdevicetohost. cudaMemcpyd_hostPoint,d_pp,sizeofPP*numPolygon, cudaMemcpyDeviceToHost; //Calculationofnumberpolygonspersecond. intNumberofPolygons=numPolygon/diff_GPU/1000; printf"%dn",NumberofPolygons; //DestroyCUDAevent. cudaEventDestroystart_GPU; cudaEventDestroystop_GPU; //FreeCUDAmemory. cudaFreed_Polygon; cudaFreed_POINT_BUFFER; cudaFreed_POINT; cudaFreed_NODE; return0; 127

PAGE 137

} 128