Citation
Acceleration of computer vision algorithms using soft-core CPUs with custom hardware

Material Information

Title:
Acceleration of computer vision algorithms using soft-core CPUs with custom hardware
Creator:
Grover, Eric Richard ( author )
Place of Publication:
Denver, Colo.
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (79 pages) : ;

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering

Subjects

Subjects / Keywords:
Computer vision ( lcsh )
Computer engineering ( lcsh )
Computer engineering ( fast )
Computer vision ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
Computer vision is a domain of machine learning that includes methods for processing, analyzing, and extracting high-dimensional data from real world digital images and video. There are numerous breakthrough technological inventions ranging from autonomous vehicles to medical image scans that depend on lowering the execution time of computer vision algorithms on computer systems. To overcome the performance barriers of specific computer vision mathematical sequences, new emerging technologies in computer architecture must be explored. The thesis investigates new custom computer hardware solutions described using hardware description language design techniques and evaluated with field-programmable gate array (FPGA)-based simulation and synthesis tools. ( , )
Review:
Specifically, the thesis extends a Microblaze soft-core CPU with new instruction set architecture (ISA) assembly instructions to accelerate three specific, frequently utilized computational vision tasks with dedicated on-chip hardware units. The first algorithm is an ordered buffer that returns the lowest k values of the values written to the buffer. The second algorithm component is a series of SIMD extensions to accelerate image pattern block matching. The third computer vision task examined is a 2D convolution kernel that is accelerated by parallelizing and pipelining a series of multiply-and-accumulate operations. Each algorithm is compared to a software-only Microblaze implementation.
Bibliography:
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by Eric Richard Grover.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
on10121 ( NOTIS )
1012118466 ( OCLC )
on1012118466
Classification:
LD1193.E54 2017m G76 ( lcc )

Downloads

This item has the following downloads:


Full Text
ACCELERATION OF COMPUTER VISION
ALGORITHMS USING SOFT-CORE CPUS WITH CUSTOM HARDWARE
by
ERIC RICHARD GROVER B.S., Brigham Young University, 2007
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering Program 2017
1


2017
ERIC RICHARD GROVER ALL RIGHTS RESERVED
11


This thesis for the Master of Science degree by Eric Richard Grover has been approved for the Electrical Engineering Program by
Dan Connors, Chair Chao Liu Jae Do Park
Date: July 7, 2017


Grover, Eric Richard (M.S., Electrical Engineering Program)
Acceleration of Computer Vision Algorithms Using Soft-Core CPUs With Custom Hardware Thesis directed by Associate Professor Dan Connors
ABSTRACT
Computer vision is a domain of machine learning that includes methods for processing, analyzing, and extracting high-dimensional data from real world digital images and video. There are numerous breakthrough technological inventions ranging from autonomous vehicles to medical image scans that depend on lowering the execution time of computer vision algorithms on computer systems. To overcome the performance barriers of specific computer vision mathematical sequences, new emerging technologies in computer architecture must be explored. The thesis investigates new custom computer hardware solutions described using hardware description language design techniques and evaluated with field-programmable gate array (FPGA)-based simulation and synthesis tools.
Specifically, the thesis extends a Microblaze soft-core CPU with new instruction set architecture (ISA) assembly instructions to accelerate three specific, frequently utilized computational vision tasks with dedicated on-chip hardware units. The first algorithm is an ordered buffer that returns the lowest k values of the values written to the buffer. The second algorithm component is a series of SIMD extensions to accelerate image pattern block matching. The third computer vision task examined is a 2D convolution kernel that is accelerated by parallelizing and pipelining a series of multiply-and-accumulate operations. Each algorithm is compared to a software-only Microblaze implementation.
This form and content of this abstract are approved. I recommend its publication.
Approved: Daniel A. Connors
IV


TABLE OF CONTENTS
CHAPTER
I. INTRODUCTION...................................................1
II. BACKGROUND AM) RELATED WORK...................................3
2.1 Background................................................3
2.2 Related Work..............................................5
III. APPROACH.....................................................8
3.1 Overview..................................................8
3.2 Soft-Core CPU.............................................8
3.3 MB-Lite Modification: SIMD Pipeline Enhancement...........9
3.4 Computer Vision Kernel: Ordered K-Buffer.................11
3.5 Computer Vision Kernel: Block Matching...................19
3.6 Computer Vision Kernel: 2D Convolution...................25
3.7 Software Toolchain.......................................31
3.8 Cycle-Accurate Simulator.................................31
3.9 Executi on Stati sti c s.................................32
IV. EXPERIMENT RESULTS..........................................33
4.1 Overview.................................................33
4.2 Ordered Buffer...........................................34
4.3 Block Matching...........................................54
v


4.4 2D Convolution....................................59
V CONCLUSION..............................................65
APPENDIX A CODE...........................................67
REFERENCES................................................69
vi


LIST OF TABLES
TABLE
1. Ordered buffer instructions.........................................................13
2. Ordered Buffer illustrated example (cont.)........................................16
3. Block matching instructions........................................................20
4. Ordered Buffer datasets............................................................35
5. Ordered Buffer software comparison algorithms.....................................41
6. Ordered buffer operations per clock cycle calculations............................52
7. Ordered Buffer power and area comparison..........................................53
8. Ordered Buffer performance summary.................................................54
9. Ordered Buffer GOPS/W calculation..................................................54
10. Block matching speed-up............................................................56
11. Block matching operations per clock cycle calculations.............................57
12. Block matching power and area comparison...........................................58
13. Block matching performance summary.................................................58
14. 2D convolution speed-up............................................................61
15. 2D Convolution operations per clock cycle calculations (cont.).....................61
16. 2D convolution power and area comparison...........................................63
17. 2D convolution performance summary.................................................63
vii


LIST OF FIGURES
FIGURE
1. Microblaze architecture...............................................................9
2. Modified Microblaze architecture....................................................10
3. Vector register 8-bit pixel indices.................................................11
4. k-Nearest Neighbor algorithm example................................................12
5. Ordered buffer tag-value pair insertion flowchart...................................14
6. Block matching between image frames.................................................19
7. Mean absolute difference (MAD) equation for block matching..........................20
8. Block matching scoring software example.............................................21
9. Right-shift template pixels.........................................................22
10. Right-shift target pixels by amount in r3..........................................23
11. Absolute difference and mask.......................................................23
12. Result pushed into dedicated FIFO..................................................23
13. Right-shift target pixels by amount in r3..........................................23
14. Absolute difference and mask.......................................................24
15. Result pushed into dedicated FIFO..................................................24
16. Right-shift target pixels by amount in r3..........................................24
17. Absolute difference and mask.......................................................25
18. Result pushed into dedicated FIFO..................................................25
19. 2D image convolution example.......................................................26
20. Convolution kernel.................................................................27
viii


21. Convolution inputs (within green box) for calculation of indicated pixel location in
relation to 15 vector registers, showing sample data......................................27
22. Single-pixel convolution pseudo-code ('&' is an array concatenation operator, calculated
pixel result in tmp[15])..................................................................29
23. State of registers after one operation...............................................29
24. Hardware-accelerated convolution pseudo-code.........................................30
25. Victim buffer sizing (random inputs, k=8, 100 datasets with results averaged)........37
26. Victim buffer sizing (random inputs, k=32, 100 datasets with results averaged).......38
27. Victim buffer sizing (100 random datasets with results shown separately, each with 1,000
pairs, k=32)..............................................................................38
28. Victim buffer sizing (best-case inputs, k=8).........................................39
29. Victim buffer sizing (best-case inputs, k=32)........................................39
30. Victim buffer sizing (worst-case inputs, k=8)........................................40
31. Victim buffer sizing (worst-case inputs, k=32).......................................40
32. Number of input pairs vs. k (HW ordered buffer, random inputs, 100 datasets with results
averaged).................................................................................45
33. Number of input pairs vs. k (SW ordered buffer, random inputs, 100 datasets with results
averaged).................................................................................45
34. Number of input pairs vs. k (array sort, random inputs, 100 datasets with results
averaged).................................................................................45
35. Number of input pairs vs. k (HW ordered buffer, best-case inputs)...............46
36. Number of input pairs vs. k (SW ordered buffer, best-case inputs)...............46
37. Number of input pairs vs. k (array sort, best-case inputs)...........................46
IX


38. Number of input pairs vs. k (HW ordered buffer, worst-case inputs)..............47
39. Number of input pairs vs. k (SW ordered buffer, worst-case inputs)..............47
40. Number of input pairs vs. k (array sort, worst-case inputs).....................47
41. Performance improvement over SW ordered buffer (random inputs, 100 datasets with
results averaged)....................................................................48
42. Performance improvement over array sort (random inputs, 100 datasets with results
averaged)............................................................................48
43. Performance improvement over SW ordered buffer (best-case inputs)...............48
44. Performance improvement over array sort (best-case inputs)......................48
45. Performance improvement over SW ordered buffer (worst-case inputs)..............49
46. Performance improvement over array sort (worst-case inputs).....................49
47. Instruction histogram, 1000 input samples, k=16, random data....................50
48. Instruction histogram, 1000 input samples, k=16, random data, SW ordered buffer.50
49. Instruction histogram, 1000 input samples, k=16, random data, array sort........50
50. Block matching software implementation..........................................55
51. Block matching instruction histogram............................................56
52. 2D convolution software-only implementation.....................................59
53. 2D convolution instruction histogram (hardware accelerated).....................60
54. 2D convolution instruction histogram (software only)............................60
x


CHAPTER I
INTRODUCTION
Computer vision is a domain of machine learning that includes methods for processing, analyzing, and extracting high-dimensional data from real world digital images and video. The development of computer vision is essential for the advancement of a range of new industrial inventions including medical imaging, entertainment, security, robotics, and autonomous vehicles. With a wide variety of emerging applications, the demand for more advanced computer vision systems is quickly growing. There are several limiting factors when developing accurate real-time computer vision systems. Most computer vision tasks require a great deal of mathematical computation. Computer vision algorithms require a large number of computations as well as an equally large number of memory values. Each new generation of application increases the need for more computational resources.
Traditionally software developers and even scientists have strictly relied on the increase of processor clock frequency as the primary method of gaining performance for the next generation of applied algorithms. Similarly, in years past, performance has been addressed by exploiting instruction level parallelism (ILP). However, chip area increases exponentially with increasing instruction parallelism with ILP while only linearly with using thread-level and multicore parallel architectures [1], While multiprocessing techniques perform well for independent sequences of computer operations, there are specific highly complex dependences within computer vision computations that cannot alone be parallelized only at the processing core level but most be addressed with specialized logic blocks.
To overcome the performance barriers of specific computer vision mathematical sequences, new emerging technologies in computer architecture must be explored. The thesis
1


investigates new custom computer hardware solutions described using hardware description language design techniques and evaluated with field-programmable gate array (FPGA)-based simulation and synthesis tools.
Specifically, the thesis extends a Microblaze soft-core CPU to accelerate a selection of frequently utilized computational vision tasks. The soft-core CPU enables new instruction set architecture (ISA) assembly instructions to interface to dedicated on-chip hardware units to accelerate three specific computational sequences. The first algorithm is an ordered buffer that returns the lowest k values of the values written to the buffer. The second algorithm component is a series of SIMD extensions to accelerate image pattern block matching. The third computer vision task examined is a 2D convolution kernel that is accelerated by parallelizing and pipelining a series of multiply-and-accumulate operations. Each algorithm is compared to a software-only Microblaze implementation.
The thesis is divided into the following chapter. CHAPTER II provides the necessary background and related work to understanding the concepts of soft-core CPUs and design environments. CHAPTER III introduces the proposed hardware and related instruction set architecture interfaces for examining the acceleration of three computer vision mathematical kernels. Next, CHAPTER IV covers experimental results of the kernels. Finally, CHAPTER V provides a conclusion to the thesis.
2


CHAPTER II
BACKGROUND AND RELATED WORK
2.1 Background
2.1.1 Single Instruction, Multiple Data (SIMD)
One method of increasing parallel operations is through single instruction, multiple data processors (SIMD). A SIMD processor features large width registers, logically divided into multiple data values. Each instruction to the CPU performs the same operation, in parallel, on each logical data value. The performance improvement is proportional to number of values that can fit in a SIMD registers.
2.1.2 Instruction Set Architecture (ISA)
Software is composed of a series of instructions that are fed into a processor. The instructions are the way a user interacts with the hardware. The hardware vendor and the software vendor both need to agree on what those instructions are so that the vendors are compatible with each other. This agreement is an Instruction Set Architecture, or ISA.
The ISA lays out the specific instruction formats at the bit level as well defines the specific operation that occurs for any given instruction.
An ISA does not specify how a processor carries out those operations for an instruction. Two different processors could support the same ISA but carry out the work completely differently. An example of this is in the x86 processors where multiple companies, such as Intel and AMD, have created their own versions of processors that implement a version of the x86 instruction set. Because each uses the x86 ISA the same software will run on either an Intel processor or an AMD processor. (This is a slight simplification since there are different versions and extensions of an x86 instruction set.
3


Some more advanced instructions are only available on one vendors processor. Software many times is written to check for which instructions are available in the ISA for that processor and utilize the most optimal instructions available.)
2.1.3 Field-Programmable Gate Array (FPGA)
A Field-Programmable Gate Array (FPGA) is an integrated circuit that contains many different primitive building blocks for digital logic circuits. After an FPGA is manufactured, it can be programmed to provide custom circuit behavior. An FPGA includes various components such as logic gates (OR, AND), memory, look-up tables, clock generators, and a routing fabric going between all of the components. The routing fabric is laid out in a general grid pattern. This allows any component to be electrically connected to nearly any other component. Routes are setup in the fabric during the programming process by turning transistors on or off to either connect or isolate components along a route.
Due to their flexible functionality, FPGAs are frequently used for prototyping and testing new application-specific integrated circuit (ASIC) designs prior to going through the expensive setup process to fabricate a ASIC. After an ASIC is fabricated it cannot be changed without going through the expensive setup process again, so finding any problems with the design prior to fabrication translates into cost savings.
FPGAs are also used in end products for products that are made in low quantities.
For these types of products, the slightly higher cost per chip of the FPGA outweighs the much higher cost of ASIC manufacturing. FPGAs are frequently used in research for the same reason.
FPGA manufactures also provide models of their products that can be used in computer simulations. A computer simulation can model chip behavior, electrical path
4


delays, primitive component usage (area), and power usage a design on the FPGA. In this way a design can be vetted as much as possible before being put on an FPGA for hardware testing. The results obtained in this thesis are from computer simulations of the FPGA.
The components on an FPGA are grouped into blocks called slices. Each slice contains multiple features. For example, a slice may contain one look-up-table, some RAM, a FIFO, and two multiplexors. These slices are then replicated across the FPGA. What is contained in each slice depends on the FPGA. The total slice count is used as a single figure of merit for relative FPGA resource usage of each implementation presented in this thesis.
A modular group of digital functionality is often called a core. An integrated circuit is generally composed of multiple cores connected together to meet a particular need. A core is described by a special kind of file, called a netlist, that is very detailed, computer-readable description of digital logic building blocks and connections between them. This file allows cores to be communicated and reused across designs or even in the same design.
When a core is realized on an FPGA by programming the FPGA, it is a called a soft core because it is realized in a programmable fabric that is changeable (soft). This is opposed to a hard core that is realized in an ASIC that cannot be changed once it is fabricated.
A soft-core on an FPGA allows modifications of the soft-core itself as well as the ability to add additional functionality. For example, a soft-core processor could be expanded to include new instructions that may be unique to a particular application or research effort.
2.2 Related Work
The Ultra-Low Power Parallel Accelerator (PULP) targets energy-efficiency for ultra-low power platforms such as micro-unmanned aerial vehicles (UAVs). PULP was designed
5


from the logic level down to the fabrication process to be ultra-low power and achieving peak energy efficiency of 211 billion operations per second per watt (GOPS/W). [2]
Analog Devices Black Fin commercial embedded CPU provides several dedicated computer vision accelerators, two processor cores, and achieves peak computational performance of 2 billion multiply-accumulate operations per second (GMACS). Based on the published power draw of approximately 0.446 W, and two operations per multiply-accumulate operation, Black Fin has a peak energy efficiency of 8.96 GOPS/W. [3]
The Convolution Engine (CE) focuses on accelerating convolution-like data flow that is, a map and reduce operations. CE strikes a balance between a fully custom hardware accelerator and SIMD with 2-3x energy and area efficiency with 8-15x performance improvement over GPP with SIMD extensions. [4] Absolute power and performance numbers were not published.
A commercial vision process made by Movidius, the Myriad 2, provides two 32-bit UltraSPARC processors coupled with twelve 128-bit VLIW SHAVE processors in a low-power system-on-a-chip (SoC). Myraid 2 also provides 16 and 32-bit floating point operations. The literature indicates teraflops of performance within one Watt. Assuming at least one teraflop (one trillion floating-point operations per second), Myriad 2 seems to achieve at least 1,000 GOPS/W.
The performance per Watt of both research work and commercial low-power embedded computer vision accelerated CPUs varies greatly. Developing a computer vision solution on a Myraid 2 can provide high performance per Watt. But this thesis takes an approach somewhat similar to the CE: adding an accelerator to find a better balance between fully dedicated hardware and a general-purpose CPU. And still, one difference between the
6


CE and this thesis is that this thesis is evaluating adding this accelerator to an existing softcore CPUthat is, not utilizing a fully discrete computer vision ASIC or any new CPU architecture entirely. This thesis evaluates possible accelerators that could be added to any existing CPU, with the goal of having an accelerator that significantly increases performance per Watt while keeping the accelerator generic enough to be used in more than one type of algorithm.
7


CHAPTER III
APPROACH
3.1 Overview
The focus of this thesis is to evaluate the use of custom hardware interfaced with a soft-core CPU using a various ISA extensions. The hardware represents a dedicated and efficient small-footprint of circuitry that can accelerate portions of a few select computer vision algorithms. The algorithm accelerations are generic enough to apply to more than one algorithm if possible, but specific enough to provide significant increase in performance per Watt.
The algorithms to accelerate are central to elements in computer vision and machine learning. Specifically, three computational kernels are selected: k-Nearest Neighbor, 2-dimensional convolution, and block matching. The hardware accelerated algorithms are compared with software-only counterparts running only on the soft-core CPUs. The primary evaluation criteria are computational performance, power usage, and chip area (FPGA slices).
3.2 Soft-Core CPU
The MB-Lite [5] soft-core CPU was chosen for are the baseline processing core for its extendible interface to custom logic functions. MB-Lite implements a Xilinx Microblaze ISA and supports most Microblaze instructions except for more dedicated components of floating-point, divide, Fast Simplex Link (FSL), cache management instructions, and some less frequently used instructions. Figure 1 illustrates the MB-Lite (Microblaze) standard 5-stage pipeline: fetch, decode, execute, memory, and write-back stages. These stages are shown in figure 1.
8


Figure 1 Microblaze architecture
3.3 MB-Lite Modification: SIMD Pipeline Enhancement
To evaluate the custom hardware, a separate single-instruction, multiple-data (SIMD) pipeline is interfaced to the MB-Lite that includes 32 vector registers, each 128 bits wide. SIMD operations are used in block matching and 2D convolution in this thesis. After the instruction fetch stage, the instruction goes to both the SIMD decode and the scalar decode stages. The SIMD pipeline stages mirror the scalar stages. The other interface points between the SIMD and scalar pipelines are: pipeline stall and flush signals, scalar operand A, scalar operand B, scalar operand D, and SIMD result (lower 32-bit word). Figure 2 illustrates the MB-Lite modification with SIMD enhancement.
9


As a simplification to the design, memory operations must be done using the standard
Microblaze store and load instructions for the 32-bit scalar registers. A custom instruction is used to transfer data between vector registers and scalar registers, up to 32 bits at a time. Although memory transaction efficiency is an important part of a computer architecture, it is outside the scope of this thesis. Memory transactions in this implementation have a fixed one clock cycle latency.
Each 128-bit vector register is logically divided into 16, 8-bit values, indexed as shown in figure 3. The most significant bit of the register is on the left.
10


128-bit Vector Register
16, 8-bit pixel values
Figure 3 Vector register 8-bit pixel indices For some operations, such as shifting values, two vector registers are fused to
effectively form a 256-bit register (32, 8-bit values).
In this thesis, register references will be of the following forms:
rX where X is an integer from 0-31, refers to one of 32, 32-bit general purpose scalar registers
vrX where X is an integer from 0-31, refers to one of 32, 128-bit vector registers.
15 14 13 12 11 10 9 8 7 5 5 4 3 2 1 0
3.4 Computer Vision Kernel: Ordered K-Buffer
Frequently in computer vision there are various scoring mechanisms that require the algorithm to find a subset of points based on value. An example is in the k-Nearest Neighbor classification algorithm in which a data object is classified based on the majority classification of its nearest k neighbors. The core of the problem involves ordering the k closest neighbors from a set of N data objects.
In the context of computer vision algorithms, k-nearest Neighbors are extremely useful in detecting matches between data vectors that are derived from feature extraction.
In machine learning, pattern recognition and computer vision, feature extraction starts from an initial set of measured data and builds derived values (features or vectors) intended to be informative. Based on extracted feature, interesting image points on an image (frame) can be extracted to provide a "feature description" of the scenes objects. Common uses of feature tracking include object recognition, mapping and navigation, image stitching, 3D modeling,
11


and video tracking. Common feature extraction algorithms are SIFT [6], BRIEF [7], and ORB [8],
A hardware-based ordered buffer allows a computer system to efficiently keep only the lowest k values as the values are being calculated. With respect to feature vectors, the ordering of the top vector to match to k nearest vectors, determines the confidence of a match feature between images or frames of video. With a dedicated hardware implementation, the need for additional memory buffers and multiple reads of values from memory for sorting is reduced compared to running the code sequence on a traditional CPU.
3.4.1 Software Example
Pseudo-code for k-Nearest Neighbor algorithm is shown in figure 4. In this example, k index positions are recorded to track respective best (lowest) matching k vector values. The best match is typically decided by some comparison score such as Euclidean distance expressed over the ^-dimensions of the feature vector.
1 // Scan over the N vector scores
2 for (int index=0; index 3
4 int score = database[index];
6 // Update the K ordered entries
7 for (int k=0; k < K; k++) {
8 if (score <= KNN Score[k]) {
9 for(int j = K-l; j > k; j ) {
10 KNN Score[j] = KNN Score[j-1];
11 KNN Index[j] = KNN Index[j 1] ;
12 }
13 KNN Score[k] = score;
14 KNN Index[k] = index;
15 break;
16 }
17 }
18 }
Figure 4 k-Nearest Neighbor algorithm example
3.4.2 Ordered Buffer Implementation
The ordered buffer is a hardware managed buffer that receives tag-value pairs and returns the associated tags of the lowest lvalues. The 32-bit tag is application dependent
12


(likely a memory pointer or index position) and the provided 32-bit sorting value is used by the ordered buffer for ordering operations. The set of custom ISA instructions to the ordered buffer are shown in table 1.
Table 1 Ordered buffer instructions
Instruction_____________________Description_________________________________________________
obk rA Set the value of k (keeps the lowest k values)
obclr Empty order buffer contents
obwr rA, rB Add value rA with tag value rB to the ordered buffer
obrd rD Read the next tag, sorted by value, from the ordered buffer
Internally the ordered buffer maintains two buffers: a sorted buffer of logical length k (maximum value of k is 128), and an unsorted victim buffer. Pairs in the sorted buffer are always in a sorted state and pairs in the victim buffer are not in any particular order.
There are two sorting phases within the ordered buffer implementation. The first occurs when a tag-value pair is written to the ordered buffer (figure 5). The sorting value is compared with the highest pair and the lowest pair in the sorted buffer to determine if it can simply append or prepend the pair to the sorted buffer. If the sorting value of the pair lies somewhere in the middle of the sorted buffer, then no determination can immediately be made and the new tag-value pair is appended to the victim buffer.
13


Figure 5 Ordered bu ffer tag-value pair insertion flowchart The second sorting phase occurs when the victim buffer is full or an attempt is made
to read the final result with the obrd instruction with pairs still in the victim buffer. This sort is a full insertion sort involving all values in the sorted buffer and the victim buffer. Pairs in the sorted buffer are already in-order, so the sorting algorithm can take advantage of that and
14


skip unnecessary comparisons. After the lowest k values are found and in the sorted buffer, the sorting process stops and the remaining values in the victim buffer are discarded.
A few algorithm optimizations occur on incoming tag-value pairs when there are already k values in the sorted buffer. If the incoming value is greater than the largest value in the sorted buffer, then the new tag-value pair is discarded since it is known that there are at least k values less than the new value (which are present in the sorted buffer). If the incoming value is less than the lowest pair in the sorted buffer, then the new tag-value pair is pushed onto the front of the sorted buffer and the largest pair on the back of the sorted buffer is discarded.
As a simplification to the implementation, the rest of the processor is stalled during the full-sort operation. Each obrd instruction retrieves one tag from the sorted buffer, returned in order of their associated sorting value, lowest to highest.
3.4.2.1 Illustrated example
The following value-tag pairs will be inserted into the ordered buffer and the results read. This example uses a k value of 4 {sorted buffer size) and a victim buffer size of 2. The example is demonstrated within table 2.
(593, OxFFOl)
(989,0xFF02)
(54,0xFF03)
(625,0xFF04)
(995,0xFF05)
(5,0xFF06)
(999,0xFF07)
(100,0xFF08)
(4,0xFF09)
15


Table 2 Ordered Buffer illustrated example
In-Order Buffer
1. Internal buffers begin in an empty state
2. Insert (593, OxFFOl); first value inserted into in-order buffer
3. Insert (989, 0xFF02); value 989 is compared with 593 and inserted after 593.
4. Insert (54, 0xFF03); value 54 is
compared with 593 and 989. Value is pushed onto the front of the in-order buffer since it is less than 593.
5. Insert (625, 0xFF04); value 625 is compared with 54 and 989. Since it is between the max and min of the in-order buffer, the value is put in the victim buffer for later sorting.
6. Insert (995, 0xFF05); value 995 is compared with 54 and 989. Value is appended to the end of the buffer since it is greater than 989.
7. Insert (5, 0xFF06); value 5 is compared with 54 and 995. Value is pushed onto the front of the in-order buffer since it is less than 54. The last value in the buffer, 995, is discarded.
8. Insert (999, 0xFF07); value 999 is compared with 5 and 989. Since 999 is greater than 989, and there are already K values in buffer, value 999 is discarded.
(5, 0xFF06) (54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
(54, 0xFF03) (593, OxFFOl) (989, 0xFF02) (995, 0xFF05)
(5, 0xFF06) (54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
(593, OxFFOl)
(593, OxFFOl) (989, 0xFF02)
(54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
(54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
Victim Buffer
(625, 0xFF04)
(625, 0xFF04)
(625, 0xFF04)
(625, 0xFF04)
16


Table 2 Ordered Buffer illustrated example (cont.)
9. Insert (100, 0xFF08); value 100 is compared with 5 and 989. Since it is between the max and min of the in-order buffer, the value is put in the victim buffer for later sorting.
(5, 0xFF06) (54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
(625, 0xFF04) (100, 0xFF08)
At this point the victim buffer is full and a full insertion sort is initiated. Each value of the in-order buffer is compared with each value of the victim buffer. If the in-order buffer value is less than the victim buffer value, the victim buffer value is inserted before the in-order buffer value.
10. Sort: compare, no change
(5, 0xFF06)
(54, 0xFF03) (593, OxFFOl) (989, 0xFF02)
(625, 0xFF04)
(100, 0xFF08)
11. Sort: compare, no change
12. Sort: compare, no change
13. Sort: compare, no change
(5, 0xFF06)
(54, 0xFF03)
(593, OxFFOl)
(989, 0xFF02)
(5, 0xFF06)
(54, 0xFF03)
(593, OxFFOl)
(989, 0xFF02)
(5, 0xFF06)
(54, 0xFF03)
(593, OxFFOl)
(989, 0xFF02)
(625, 0xFF04)
(100, 0xFF08)
(625, 0xFF04)
(100, 0xFF08)
(625, 0xFF04)
(100, 0xFF08)
14. Sort: compare, no change (5, 0xFF06) (625, 0xFF04)
(54, 0xFF03) (100, 0xFF08)
(593, OxFFOl)
(989, 0xFF02)
17


Table 2 Ordered Buffer illustrated example (cont.)
15. Sort: compare, insert 100 before 593; the value 989 is discarded. (If there were more values in the victim buffer, 100 would be compared with the remaining victim buffer values.)
(5, 0xFF06) (54, 0xFF03) (593, OxFFOl)
(989, 0xFF02)
(625, 0xFF04) (100, 0xFF08)
16. Sort: compare, no change. This is the last in-order buffer value and the last victim buffer values. The lowest K values has now been found.
(5, 0xFF06) (54, 0xFF03) (100, 0xFF08) (593, OxFFOl)
(625, 0xFF04)
17. Sort: Discard any remaining values in victim buffer. End of sorting.
18. Insert the last value (4, 0xFF09); value 4 is compared with 5 and 593. Value is pushed onto the front of the in-order buffer since it is less than 5. The last value in the buffer, 593, is discarded.
(5, 0xFF06) (54, 0xFF03) (100, 0xFF08) (593, OxFFOl)
(4, 0xFF09) (5, 0xFF06) (54, 0xFF03) (100, 0xFF08)
19. Read tag 0xFF09. Since no values have been inserted into the victim buffer since the last full sort, it knows that no additional sorting is needed before reading out the tags.
(5, 0xFF06) (54, 0xFF03) (100, 0xFF08)
(54, 0xFF03) (100, 0xFF08)
(100, 0xFF08)
22. Read tag 0xFF08



20. Read tag 0xFF06
21. Read tag 0xFF03
18


3.5 Computer Vision Kernel: Block Matching
3.5.1 Overview
Block matching is used in image processing algorithms such as compression, motion estimation, and template-based object recognition. Figure 6 demonstrates an example of block matching (scoring) between two image frames in a video sequence. A scored matched would indicate a match of the block of pixels between two images. The block size parameters, height and width, express additional computational requirements that scale with each parameter size. In addition, the search window represents additional block matching scoring. The core of the block matching kernel is the mathematical calculations comparing the pixel values between two selected blocks and not the additional search requirements. However, once multiple block matching scores have been calculated, the k-ordered nearest neighbor kernel could then accelerate the set of best matches in the search area.

Block
Frame k


Search window
Frame k+1
Figure 6 Block matching between image frames
One calculation of block matching involves finding the mean absolute difference (MAD) between corresponding blocks of pixels from two input images, A and B. This is shown in figure 7. The result is used as a score to find a likely match.
19


71-1 71-1
mad=wTL^~b^
i=o y=o
Figure 7 -Mean absolute difference (MAD) equation for block matching The implementation in this thesis accelerates calculating the summation portions of
this equation: Sum of Absolute Difference (SAD). This is done in one dimension at a time. The other dimension could be done by first transposing the pixels, but is not covered in this thesis.
Table 3 Block matching instructions
Instruction Description
vshr8c vrD, vrA, vrB, rC Shift the 256-bit register formed from (vrB & vrA) to the right by rC*8 bits. The value in rC is then incremented by 1
vrsub8am vrD, vrA, vrB, rC Subtracts vrA from vrB, element-wise, and writes the absolute value of each element result to the destination register vrD. Register rC is a register containing an element mask indicating which elements to include in the operation. If an element is excluded, a zero is written to its position.
vat8 vrA Sums all 16 8-bit elements in vector register vrA using an adder tree. The calculation is done at a rate of 1 result per clock cycle with a latency of 3 clock cycles. The 32-bit summation result is read into a scalar register using the vatr8 instruction.
vatr8 rD Reads the 32-bit summation result from vat8 into scalar register rD.
vshr8c vrD, vrA, vrB, rC Shifts 8-bit elements in a vector register formed from vrB & vrA to the right by the number of elements specified in scalar register rC. Scalar register rC is incremented by one after the shift.
3.5.2 Software Example
Given 32 pixels from an image in vector registers 1 and 2, 32 pixels from an image template in vector registers 3 and 4, block size of T pixels, image register offset of A pixels,
20


and an image template register offset of B pixels, the scoring for block matching can be performed with the operations shown in figure 8.
1 // load pixel mask with a width of T into scalar register r3
2 Id (r3, (0xlT) -1) ;
O 4 // shift (vr3 & vr4) to the right by A pixels and store in vr6
5 vshr8(vr6, vr4, vr3, A);
7 for(i=0; i 8 // shift (vrl & vr2) to the right by rl pixels and store in
vr5, increment rl
9 vshr8c(vr5, vr2, vrl, rl);
10
11 // absolute-differences operation on vr5 and vr6, pixel mask
of r3, store result in vr7
12 vrsub8am(vr7, vr5, vr6, r3);
13
14 // sum the difference in vr7
15 vat8(vr7);
16 }
17
18 uint32 results[N];
19
20 for(i=0; i 21 // read one result and store in scalar register r4
22 vatr8(r4);
23 store(r4, Sresults[i]);
24 }
Figure 8 Block matching scoring software example
3.5.3 Block Matching Implementation
Up to 32, 8-bit pixels are loaded into vector registers from each of the images to be compared. For each SAD calculation, three instructions are executed with a throughput of one instruction per clock cycle: (1) right shift to current pixel, (2) calculate absolute values of differences for width of block, and (3) sum differences. The result of the summation is stored in a dedicated FIFO at full precision until read out to a scalar register.
The summation of the differences is done using a fully pipelined adder tree. The adder tree has a latency of 3 clock cycles. Because of this latency, efficiency is gained by doing multiple SAD operations before attempting to read results out of the FIFO. For this explanation, the block of pixels that is fixed will be called the template and the block of
21


pixels that is moved or shifted will be called the target. For motion estimation, the template would be the previous frame and the target would be the current frame.
The SAD operation subtracts each 8-bit pixel in one vector register from each 8-bit pixel with the same position in another vector register. To limit which pixels are involved in the operation, the absolute difference instruction (vrsub8am) takes a byte mask parameter that causes zeros to be written to the destination register for pixels that lie outside of the mask.
In the example illustrated below, the block size is 11 pixels wide and is compared with 5 positions in the target image. Within the 32 template pixels register, the right-most pixel of the block is at index 7. Within the 32 target pixels register, the right-most pixel of the block for the first comparison it at index 2.
1. Load up to 32 template pixels to two fused vector registers (3 and 4)
2. Load up to 32 target image pixels to two fused vector registers (1 and 2)
3. Right-shift template pixels using instruction: vshr8(vr6, vr4, vr3, 7) (figure 9)
Template
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (4) 67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0

4.
5.
(6)
24
24
55
91 96197
85 67
62
88
99
60 33 27 15
1C
Figure 9 Right-shift template pixels
Load target starting register offset of 2 into scalar register r3.
Right-shift target pixels by amount in r3 and increment r3 from 2 to 3 using instruction vshr8c(vr5, vr2, vrl, r3) (figure 10)
22


Ta rget
(1)
81
80
79
78
77
76
75
74
73
72
71
70 69 68
67
66
(2)
65
64
63
62 61
60 59 58
57 56
55
54 53 52 51 50

shift
(5) 67 66 65 64 63|62 61 60 59 58 57 56 55 54 53 52
Figure 10 Right-shift target pixels by amount in r3
6. Perform absolute difference operation, writing zeros to pixels that are outside of the mask using instruction vrsub8am(vr7, vr5, vr6, r3) (figure 11)
(6)
(5)
24
24
55
91
96
97
85
67
62
88
99
60
33
27
15
1C
67
66
65
64
63162
61
60 59 58
57
56
55
54 53
52
0000011111111111

absolute difference and mask
(7)
35
24
3 30
42
22
27 38
42
Figure 11 Absolute difference and mask
7. Sum all values in vr7 vat8(vr7) and put result into dedicated FIFO (figure 12)
274
Figure 12 Result pushed into dedicated FIFO
8. Right-shift target pixels by amount in r3; increment r3 from 3 to 4 using instruction vshr8c(vr5, vr2, vrl, r3) (figure 13)
(1)
81
80
79
78
77
76
75
74
73
72
71
70 69 68
67
66
(2)
65
64^^
62
61
60 59 58
57 56
55
54 53 52
51 50

shift
(5)
63
62
61
60 59 58
57 56
55
54 53
Figure 13 Right-shift target pixels by amount in r3
23


9. Perform absolute difference operation, writing zeros to pixels that are outside of the mask using instruction vrsub8am(vr7, vr5, vr6, r3) (figure 14)
24 24 55 91 96 97 85 67 62 88 99 60 33 27 15 10

67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52
0000011111111111
absolute difference and mask
(7)
35
24
30
42
22
27 38
42
Figure 14 Absolute difference and mask
10. Sum all values in vr7 vat8(vr7) and put result into dedicated FIFO (figure 15)
271
Figure 15 Result pushed into dedicated FIFO
11. Right-shift target pixels by amount in r3; increment r3 from 3 to 4 using instruction vshr8c(vr5, vr2, vrl, r3) (figure 16)
(1)
81
80
79
78
77
76
75
74
73
72
71
70 69 68
67
66
(2) 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50

shift
(5)
69 68
67 66
65
64 63
62
61
60 59 58
57 56
55 54
Figure 16 Right-shift target pixels by amount in r3
12. Perform absolute difference operation, writing zeros to pixels that are outside of the mask using instructions vrsub8am(vr7, vr5, vr6, r3) (figure 17)
24


(6)
(5)
0000011111111111
absolute difference and mask
0 0 0 0 0 33 22 5 1 28 40 2 24 29 40 44
Figure 17 Absolute difference and mask
13. Sum all values in vr7 vat8(vr7) and put result into dedicated FIFO (figure 18)
268
Figure 18 Result pushed into dedicated FIFO
14. Read results, one per clock cycle, out of dedicated FIFO and write to memory: 274, 271, 268
3.6 Computer Vision Kernel: 2D Convolution
3.6.1 Overview
Two-dimensional (2D) convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. Convolution is the core process in image processing for blurring, sharpening, embossing, edge detection, and more. Convolution is mathematically accomplished by doing a respectively multiplying the 2D positions between a kernel (array of numbers) and an image, repeating that process for all locations of the image. Figure 19 illustrates the base example of 2D image convolution for a 3x3 kernel applied to an image for a single input image position to generate an output result.
24 24 55 91 96 97 85 67 62 88 99 60 33 27 15 10

69 68 67 66 64 63 62 61 60 59 58 57 56 55 54
25


Figure 19 2D image convolution example
Custom instructions are created to accelerate 2-dimensional convolution of 8-bit pixels. The instruction is capable of convolving a kernel of 5x5 pixels and a partial image of up to 5x16 pixels with a throughput of 1 pixel per clock cycle. The 2D convolution instructions operate directly on the SIMD vector registers.
The input for the 2D convolution operation comes from fixed positions in the vector registers. Additional pixels are calculated by shifting the vector registers by one pixel and repeating the process. Each result is stored in a special internal register.
3.6.2 Instructions
3.6.2.1 Instruction: vflt
Given the kernel and surrounding pixels, the vflt instruction performs convolution for a single pixel. The kernel is loaded into registers 16 and 17 (see figure 20).
26


1 4 6 4 1

2 8 12 8 2

0 0 0 0 0

-2 -8 -11 -8 -2

-1 -4 -6 -4 -1

Figure 20 Convolution kernel
Partial rows of an image are loaded into vector registers 1-15. Pixels of image rows can be loaded into consecutively numbered registers to represent their respective spatial arranged as shown in figure 21: row 1 in registers 1-3, row 2 in registers 4-6, row 3 in
(4)
g = rn
(10)
, row 4 in registers 10 12, ai id row 5 in registei
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (2)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 I 97 85 (5)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (8 67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 1 III 97 85 (11)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (14)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
0 0 0 97 85 67 62 88 9 60 33 27 5 10
0 0 0 97 85 67 62 88 9 60 33 27 5 10
0 0 0 97 85 67 62 88 1 1 1 9 60 33 27 5 10
0 0 0 97 85 67 62 88 9 60 33 27 5 10
0 0 0 97 85 67 62 88 1 1 1 9 60 33 27 5 10
Figure 21 Convolution inputs (within green box) for calculation of indicated pixel location in relation to 15 vector registers, showing sample data
The input pixels to the convolution operation are outlined in figure 21 with a green box, centered on the left-most pixel of register 8. The purpose of registers 1, 4, 7, 10, and 13 is to provide the left neighboring pixels for the calculation. Only the right two pixels of each register are utilized in the calculation. Upon execution of the vflt instruction, the following occurs: pixels from the indicated area are multiplied by the respective values in the kernel, multiplication results are registered in the adder tree pipeline for summation, rows are shifted
27


to the left by one pixel, and summation result is shifted to the right by the number of bits specified by the vflt instruction.
The pseudo-code for calculating one pixel is shown in figure 22. Each vector register is represented by vrX[Y] where X is the register number, and Y is the 8-bit pixel in the register, with Y=0 being the rightmost pixel. The variable tmp is an internal working vector register, shiftAmount is the number of bits to right-shift each result, and values of the form rowZ where Z is an integer are scalar variables for convenience. The state of the registers after one operation, each shifted by shifted by one pixel, is shown in figure 23.
28


1 rowl = vrl[1]*vrl6[15] + vrl[0]*vrl6[14] + vr2[15]*vrl6[13] + vr2[14]*vrl6[12] + vr2[13]*vrl6[11]
2 row2 = vr4[1]*vrl6[10] + vr4[0]*vrl6[9] + vr5[15]*vrl6[8] + vr5[14]*vrl6[7] + vr5[13]*vrl6[6]
3 row3 = vr7[1]*vrl6[5] + vr7[0]*vrl6[4] + vr8[15]*vrl6[3] + vr8[14]*vrl6[2] + vr8[13]*vrl6[1]
4 row4 = vrl0[l]*vrl6[0] + vrlO[0]*vrl7[15] + vrll[15]*vrl7[14] + vrll[14]*vrl7[13] + vrll[13]*vrl7[12]
5 row5 = vrl3[1]*vrl7[11] + vrl3[0]*vrl7[10] + vrl4[15]*vrl7[9] + vrl4[14]*vrl7[8] + vrl4[13]*vrl7[7]
6
7 tmp[15] = (rowl + row2 + row3 + row4 + row5) >> shiftAmount
9 vrl = vrl[15.. 1] & vr2[15]
10 vr2 = vr2[15 . 1] & vr3[15]
11 vr3 = vr3[15 . 1] & 0x00
12 13 vr4 = vr4[15.. 1] & vr5[15]
14 vr5 = vr5[15 . 1] & vr6[15]
15 vr6 = vr6[15 . 1] & 0x00
16 17 vr7 = vr7[15.. 1] & vr8[15]
18 vr8 = vr8[15.. 1] & vr9[15]
19 vr9 = vr9[15.. 1] & 0x00
20 21 vrlO = vrlO[15 1] & vrll[15]
22 vrll = vrll[15 1] & vrl2[15]
23 vrl2 = vrl2[15 1] & 0x00
24 25 vrl3 = vrl3[15 1] & vrl4[15]
26 vrl4 = vrl4[15 1] & vrl5[15]
27 vrl5 = vrl5[15 1] & 0x00
Figure 22 Single-pixel convolution pseudo-code ('&' is an array concatenation operator, calcidated
pixel result in tmp[15])
9 =
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (2)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (5)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (8' 67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 1 1 1 (11)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
5 5 4 6 10 15 18 17 23 24 24 55 91 96 97 85 (14)67 62 88 99 60 33 27 15 10 9 9 6 7 7 4 0
0 0 0 97 85 67 62 88 99 60 33 . 27 5 10
0 0 0 97 85 67 62 88 99 60 33 . 27 5 10
0 0 0 97 85 67 62 88 99 60 33 L 27 5 10
0 0 0 97 85 67 62 88 99 60 33 27 5 10
0 0 0 97 85 67 62 88 99 60 33 27 5 10
Figure 23 State of registers after one operation
Intermediate calculations are full precision, including the final division by a power-
of-two divisor before being truncated to 8 bits. The convolution operation has a throughput
of one pixel per clock cycle with a latency of 5 clock cycles. All multiplication is done in the
first clock cycle using 25 multipliers and addition occurring on subsequent clock cycles using
a pipelined adder tree and 3-input adders.
29


For design simplicity, the operations occur directly on the register file itself without respect to normal pipeline stages. Thus, the CPU pipeline must first be clear by issuing 4 nops before the 16 vflt instructions. This design is a balance between reduced performance and a reduction in power and area by being able to have more FPGA resource sharing.
3.6.2.2 Instruction: vfltr
After calculating 16 pixels, the instruction vfltr copies the 16 resulting pixels from the internal working vector register (tmp) to the vector register destination specified by the instruction.
3.6.3 Software Example
Pseudo-code is shown in figure 24 for convolving row R of an image, assuming the width of the image is a multiple of 16, with image pixels in array img and result stored in array imgResult. This process would be repeated for each row of the image.
1 zero registers 1, 4, 7, 10, and 13
2 populate registers 16 and 17 with the kernel
3 shiftAmt = 1;
5 vr2 = img[R-2] [0. .15];
6 vr5 = img[R-l] [0. .15] ;
7 vr8 = img[R ][0..15];
8 vrll = img[R+l] [0. .15] ;
9 vrl4 = img[R+2] [0. .15] ;
10
11 for(i=0; i 12 // On the last block of 16, these registers
13 // will just be holding values of zero.
14 if(i < width-16) {
15 vr3 = img[R-2][i+16. .i+31];
16 vr6 = img[R-l][i+16. .i+31];
17 vr9 = img[R ][i+16. .i+31];
18 vrl2 = img[R+l][i+16. .i+31];
19 vrl5 = img[R+2][i+16. .i+31];
20 }
21
22 for(j=0; j <16; j++) {
23 vflt(shiftAmnt);
24 }
25
26 vfltr(vr20);
27 imgResult[R][i+0..i+16] = vr20;
28 }
Figure 24 Hardware-accelerated convolution pseudo-code
30


3.7 Software Toolchain
The GNU C/C++ toolchain was used to cross-compile Microblaze code. The assembler was modified to include the custom instructions. The custom instructions are used explicitly in the C/C++ code as inline assembly.
3.8 Cycle-Accurate Simulator
To speed-up cycle-accurate simulations, the VHDL was translated by hand to C++. The two-process design methodology of the VHDL, promoted by Jiri Gaisler [9], lent itself well to C++ execution flow. The VHDL entities have combinational and synchronous logic explicitly separated. Each VHDL entity was directly translated to a C++ class. All combinational logic was put into a method named evaluate and all synchronous logic was put into a method named clock. The logic on each VHDL line was modified to provide the same logic in C++ syntax.
During execution of the simulator, a clock cycle begins by calling each objects evaluate method repeatedly in a loop. After each iteration of the loop, the memory space of each object is checked for signal changes as combinational logic settles. After an iteration runs without any signal changes, the clock() method is called once on each object to latch the new inputs. The process then repeats for the next clock cycle.
File EO and terminal EO is accomplished through communication with memory-mapped devices. The devices are C++ objects that make the applicable host operating-system calls when commanded via its assigned Microblaze memory address. The custom simulator allowed for faster simulation time than the equivalent Modelsim simulation.
31


3.9 Execution Statistics
Execution statistics are collected within the simulator. Statistics collected and evaluated are the number of times each instruction is executed, the number of clock cycles lost due to pipeline flushes or data dependency hazards, and the number of clock cycles the CPU is stalled due to an ordered buffer sort operation.
Control of the statistics collection, which includes clearing, pausing, and starting counters, is done by custom Microblaze instructions and inline assembly in the Microblaze source code. The statistics control instructions are used to limit statistics collection during the execution of the algorithm itself and to ignore program startup, screen printing, and program shutdown.
32


CHAPTER IV
EXPERIMENT RESULTS
4.1 Overview
The implementations are evaluated against clock cycles for complete algorithm, estimated power usage, and chip area (FPGA resource usage).
The VHDL design was synthesized, placed and routed using Xilinx Vivado 2015.2 for a Kintex 7, XC7K70T. The Vivado power estimating tools were used to generate estimated static and dynamic power usage of the design. Default node activity built into the Vivado power estimator was used (vectorless power estimation).
The power estimates are not as consistent as initially expected. The most likely cause is using the default node activity vectors of the power estimator tool instead of actual vector from simulation.
Also, the implementations presented take up a small percentage of the total FPGA area and power for routing is a significant contributor to power usage. This means routing decisions by the FPGA tools can have a greater impact on the total estimated power usage. More meaningful power estimates might be able to be achieved by instantiating multiple cores of the design and estimating power with actual simulation vectors. Although power estimates obtained from the power estimator tool are presented in this thesis, addition work is required to improve the fidelity of the results.
The metric speed-up-per-slice is more accurate and should be used for evaluating the designs. However, the absolute number of slices metric should only be used for relative comparisons from the same FPGA. The number of slices required by a design varies between different FPGAs.
33


The clock rate needed to correctly meet timing is 80 MHz. This is due to at least one slow path. Due to time constraints for this thesis, the design was not further optimized to increase the clock rate. However, the relative performance improvement metrics are not affected by the clock rate.
4.2 Ordered Buffer
4.2.1 Victim Buffer Sizing
Although it would initially seem that a larger victim buffer would improve efficiency by reducing the number of full sort operations, it turns out that is not always the case. The size of the victim buffer also indirectly drives the efficacy of the initial comparison with the lowest and highest pairs in the sorted buffer when a pair is written to the ordered buffer (first phase of sorting). The most efficiency can be gained if a new input pair can be either discarded because it is higher than already known k lowest pairs up to that point, or prepended to the beginning of the sorted buffer causing the previous largest pair in the k buffer to be discarded. In this way, a full sort operation can be avoided.
The narrower the gap is between smallest and largest pairs in the sorted buffer, the greater chance there is of input pairs falling outside of the sorted buffer range in the first phase and can thus be added to sorted buffer or discarded without a full sort operation. The sooner in the entire process this gap can is narrowed, then the more input pairs can be handled in the first phase without resorting to a full sort.
The size of the victim buffer drives how frequently the full-sort operation takes place. More frequent full-sort operations (smaller victim buffer size) can help to narrow the gap in the sorted buffer sooner and thus allow input pairs to be handled in the first phase that would have otherwise been added to victim buffer.
34


Although these principles hold for every dataset, the actual efficiency and optimal victim buffer size of the ordered buffer is dataset, insertion order, and sorted buffer size (k)
dependent. But we can observe uniform random distribution dataset results, worst case results, and best case results. Table 4 describes each dataset.
Table 4 Ordered Buffer datasets
Dataset Description
Uniform Random This is generated using the rand() function in the C standard library (libc). Algorithm is executed 100 times, each with a different random dataset, and the results averaged. Clock cycles spent in the rand() function call are excluded.
Worst-Case The worst-case data input occurs when the pairs being written to the ordered buffer lie between the minimum and maximum pairs in the sorted buffer and thus need to be added to the victim buffer. No values can be filtered out in the first phase comparisons. All sorting is done by the full-sort operation. This dataset starts with one low value (0) followed by one high value (2,000,000,000), which both get added to the sorted buffer. Each subsequent value is 1 minus the previous value. This guarantees that regardless of the sorting state of the buffers, the new value will always lie between the lowest and highest values in the sorted buffer.
Best-Case The best-case data input occurs when the pairs being added fall outside the minimum and maximum values in the sorted buffer. The new pair can be either be directly added to the sorted buffer in-order or discarded. No future sort operation needs to take place for that value. The dataset used is a series starting with zero and each subsequent value is one plus the previous value. The ordered buffer is filled after k pairs and pairs that follow are discarded for being too large. No pairs end up being inserted into the victim buffer.
The figure of merit for computational efficiency comparison of the ordered buffer is the total number of clock cycles to execute the algorithm divided by the number of input pairs (lower is better).
Figure 25 and figure 26 (k=8 and k=32 respectively) show the computational efficiency for various number of dataset input pairs and victim buffer sizes using the random
35


datasets (100 random datasets with the results averaged). When k=8, the optimal victim buffer size for any given number of input pairs is 4 pairs. When k=32, the optimal victim buffer size for any given number of input pairs is 10 pairs.
To better visualize the behavior of the ordered buffer with random input data, a slice of figure 26 is shown in figure 27 for 1,000 input pairs. The results from each of the 100 random datasets are shown individually instead of being averaged. Although each dataset result yields different results, each exhibits a similar trend. The clock cycles per input pair has a steep decline to a minimum roughly around a victim buffer size of 10 pairs with an average of about 11 clock cycles per input pair, and then slowing increasing to an average of about 12 clock cycles per input pair when the victim buffer size is 32. Based on the figure, the results appear to be distributed in a Gaussian-like manner around a second-order trend line.
Figure 28 and figure 29 (k=8 and k=32 respectively) show the computational efficiency for various dataset input pair counts and victim buffer sizes using the best-case dataset. Since the victim buffer is not used with a base-case dataset, the efficiency is constant across victim buffer sizes for a given number of input pairs.
Figure 30 and figure 31 (k=8 and k=32 respectively) show the computational efficiency of various number of dataset input pairs and victim buffer sizes using the worst-case dataset. There is no clock-cycles-per-input-pair minimum present in either figure up to the max victim buffer size of 32 that was measured. But the rate of change of efficiency with increasing victim buffer size quickly decreases. The efficiency improvement when going from a victim buffer size of 9 to a victim buffer size of 10 is just 0.03 clock cycles per input
36


pair (at 1,000 input pairs). A victim buffer size at about 9 would be a reasonable balance between hardware resources and computational efficiency.
Further analysis could be done in future work to formulate a parametric model of the observed behavior. Given the computational efficiency variability near the minimums observed across the test cases, doing further analysis to select a victim buffer size that is statistically optimal versus selecting one that seems reasonable based on the observations is not likely to yield significant efficiency improvements. Based on the preceding figures, a victim buffer size of 8 pairs was chosen as a reasonable compromise. The software comparison results that follow utilize this victim buffer size.
number of input pairs
Figure 25 Victim buffer sizing (random inputs, k=8, 100 datasets with results averaged)
37


Figure 26 Victim buffer sizing (random inputs, k=32, 100 datasets with results averaged)
16.5
victim buffer size
Figure 27 Victim buffer sizing (100 random datasets with residts shown separately, each with 1,000
pairs, k=32)
38


39


25
number of input pairs
Figure 30 Victim buffer sizing (worst-case inputs, k=8)
40


4.2.2 Performance Comparison with Software-Only Implementations
4.2.2.1 Overview
The hardware ordered buffer is compared to a software implementation of the same algorithm. The software ordered buffer is also compared to another software implementation that simply sorts an array in memory. The comparison of the two software implementations illustrate performance improvements that are due to the algorithm itself and not special hardware. Descriptions of each software implementation are in table 5.
Table 5 Ordered Buffer software comparison algorithms
Software Implementation Description
Software (SW) Ordered Modeled after the hardware with sorted and victim buffers
Buffer as arrays in memory. Each new value goes through the same comparisons as the hardware version prior to being added to one of the buffers in memory or being discarded. When the victim buffer is full, a software-based sort algorithm executes.
Array Sort Modeled after the traditional method of inserting all new values into an array in memory followed by a single selection sort operation. The selection sort operation stops as soon as the lowest k values are found.
The figure of merit for computational efficiency is the total number of clock cycles to execute the algorithm divided by the number of input pairs (lower is better). In the figures that follow, the computational efficiency of the indicated implementation is evaluated for the given number of input values (200 through 5000), given victim buffer size (4 through 128), and given input dataset (random, best-case, worst-case).
4.2.2.2 Performance Comparison: Random Dataset
Figure 32 shows the computational efficiency of the hardware implementation of the ordered buffer with the random dataset (results of 100 datasets averaged; see table 4). The implementation scales well for larger values of k. With 5000 input pairs, it goes from 6 clock
41


cycles per pair at k=4, to about 22 clock cycles per pair at k=128. The software array sort (figure 34) does not scale as well, going from 58 to 1,649, respectively. The software ordered buffer (figure 33) scales similarly to the hardware ordered buffer, going from 37 to 282, respectively. This shows that much can be gained from the ordered buffer algorithm itself, whether implemented in hardware or software.
The improvement of the computational efficiency of the hardware ordered buffer compared to the software ordered buffer ranges from 5. lx (5,000 input pairs and k=4) to 13.5x (200 input pairs and k=128). Either end of the spectrum is still a significant increase in performance. See figure 41.
The hardware ordered buffer compared to the array sort implementation ranges from 7.6x (200 input pairs and k=4) to 79. lx (5,000 input pairs and k=88). See figure 42.
4.2.2.3 Performance Comparison: Best-Case Dataset
Figure 35 shows the computational efficiency of the hardware implementation of the ordered buffer with the best-case dataset (see table 4). Since the best-case dataset involves no sort operations, the computational efficiency is nearly constant at about 6 clock cycles per input pair for nearly any number of input pairs or values for k (slightly increase for less than 1,000 input pairs). The trend is similar for the software ordered buffer (figure 36) at about 36 clock cycles per input pair, except efficiency improves slightly with increasing number of input pairs due to software overhead.
The selection sort algorithm used in the array sort is less impacted by the different data sets and exhibits results similar to the random dataset.
The hardware ordered buffer provides an improvement in computational efficiency of 4.3x to 5.Ox over the software ordered buffer (figure 43), and 8.3x to 265.7x improvement
42


over the array sort (figure 44). That latter result is the peak improvement that can be achieved with the hardware ordered buffer over the array sort for the tested input parameters.
4.2.2.4 Performance Comparison: Worst-Case Dataset
Figure 38 shows the computational efficiency of the hardware implementation of the ordered buffer with the worst-case dataset (see table 4). Even with the worst-case dataset, the computational efficiency still scales well with increasing k. With 5000 input pairs, it goes from 11 clock cycles per pair at k=4, to about 149 clock cycles per pair at k=128. The software array sort (figure 40) does not scale as well going from 58 to 1,528, respectively.
The software ordered buffer (figure 39) did much worse this time going from 121 to 2,584, respectively.
With the worst-case dataset, the software ordered buffer has no advantages over the array sort since no data can be inserted in-order or discarded at insertion time. It is simply doing a full-sort each time the victim buffer fills up. However, it incurs additional overhead by doing the sort multiple times as opposed to the array sorts single sort operation. The hardware ordered buffer still outperforms either software implementation since the dedicated hardware incurs less clock cycle overhead even for frequent sort operations.
The hardware ordered buffer provides an improvement in computational efficiency of 10.Ox to 16.4x over the software ordered buffer (figure 45), and 4.3x to 9.3x improvement over the array sort (figure 46).
Although memory and cache performance was not modeled, it is expected that the software ordered buffer would have better performance than the array sort for large data sets since pairs can be handled in the first stage when data is still in cache or CPU registers.
Array sort may experience more cache misses and be more impacted by memory latencies
43


due to the need to iterate through the entire input dataset multiple times. It is possible that the software ordered buffer would outperform array sort in the worst-case dataset if memory and cache performance were considered.
44


Figure 32 Number of input pairs vs. k (HW ordered buffer, random inputs, 100 datasets with residts
averaged)
residts averaged)
averaged)
45


clock cycles / input pair
Figure 35 Number of input pairs vs. k (HW ordered buffer, best-case inputs)
60
50
40
30
20
10
0
1800 1600 1400 1200
11000
5
~ 800 t/1
O 600
6
400
o
~ 200 0
number of input pairs
number of input pairs
Figure 36 Number of input pairs vs. k (SW ordered buffer, best-case inputs)
Figure 37 Number of input pairs vs. k (array sort, best-case inputs)
46


Figure 38 Number of input pairs vs. k (HW ordered buffer, worst-case inputs)
3000
number of input pairs
Figure 39 Number of input pairs vs. k (SW ordered buffer, worst-case inputs)
number of input pairs
Figure 40 Number of input pairs vs. k (array sort, worst-case inputs)
47


n-
n-
number of input pairs
116
5 o

3
in
number of input pairs
Figure 41 Performance improvement over SW Figure 42 Performance improvement over array ordered buffer (random inputs, 100 datasets with sort (random inputs, 100 datasets with results results averaged) averaged)
300.Ox
250.Ox
g 200.Ox
HI
^ 150.Ox
13
1)
fl
M 100.Ox
number of input pairs
50.Ox O.Ox
Figure 43 Performance improvement over SW Figure 44 Performance improvement over array ordered buffer (best-case inputs) sort (best-case inputs)
48


number of input pairs
10.Ox
9.Ox
8.Ox
7.Ox

6.Ox
5.Ox
m ru Oh GO 4.Ox
3.Ox
2.Ox
l.Ox
O.Ox
18.Ox 16.Ox 14.Ox g 12.Ox
,C3
~ 10.Ox
%
3 8.Ox
u ft
M 6.Ox 4.Ox 2.Ox O.Ox
number of input pairs
Figure 45 Performance improvement over SW Figure 46 Performance improvement over array ordered buffer (worst-case inputs) sort (worst-case inputs)
4.2.2.5 Instruction Analysis
An instruction histogram is shown below (figure 47, figure 48, figure 49) for each implementation for 1,000 input pairs, k=16, and the random dataset (100 random datasets with the results averaged). This sampling provides a representative view of the relative instruction usage between implementations.
49


2,500
2,000
o 8 i?ooo o son


c add addc addi addik addk and andi beqi beqid bgej biK bleic blti bltic bne bneic bralc bn brie brlic bslli bsrai emp empu flush imm lbu lbui lhui lw lwi mts nop obetr obk obrd obwr or ori rsub rsubk rtsd sb sbi sextl6 sext8 slii sra srl stall sw swi xor xori
instruction
Figure 47 Instruction histogram, 1000 input samples, k=16, random data
Figure 48 Instruction histogram, 1000 input samples, k=16, random data, SW ordered buffer
Figure 49 Instruction histogram, 1000 input samples, k=16, random data, array sort
50


A few instructions listed in the figures are not actual instructions:
flush: A nop clock cycle due to a pipeline stage flush. These occur when taking a branch. There will be one or two stages flushed, depending on whether the delay slot is utilized by the compiler, every time a branch is taken.
stall: The CPU was stopped while the ordered buffer did a full sort.
nop: A nop due to a bubble being inserted into the pipeline to handle a data hazard.
In the software implementations (figure 48, figure 49), the large number of pipeline stage flushes relative to other instructions is due to the large number of branches taken to iterate over the data. These are a significant portion of the total clock cycles. Other CPU implementations with branch prediction would probably not incur so many pipeline stage flushes. In the hardware implementation, these loops over the data are managed in dedicated hardware.
For the case above, the number of operations per instruction, the number of instructions, total operations, and operations per clock cycle is shown in table 6 for the hardware accelerated implementation and the software-only implementation.
51


Table 6 Ordered buffer operations per clock cycle calcidations
Instruction Operations per instruction HW Ordered Buffer Instr. Count SW Ordered Buffer Instr. Count Array Sort Instr. Count HW Ordered Buffer Operations SW Ordered Buffer Operations Array Sort Operations
addik 1 1,034 6,308 16,882 1,034 6,308 16,882
addk 1 1 1,167 145 1 1,167 145
beqi 0 0 997 0 0 0 0
beqid 0 1 998 16 0 0 0
bgei 0 0 16 31,728 0 0 0
bgeid 0 0 5,594 16 0 0 0
blei 0 1 129 0 0 0 0
bleid 0 0 13 2 0 0 0
blti 0 1,001 3,090 0 0 0 0
bltid 0 16 1,979 16,880 0 0 0
brid 0 0 879 0 0 0 0
brlid 0 0 1,011 0 0 0 0
bslli 1 0 4,111 32,776 0 4,111 32,776
cmp 1 1,016 2,925 16,896 1,016 2,925 16,896
cmpu 1 1 144 0 1 144 0
flush 0 2,013 14,269 64,262 0 0 0
imm 0 0 6,536 0 0 0 0
lw 1 0 0 31,760 0 0 31,760
lwi 1 0 9,816 0 0 9,816 0
nop 0 0 1,003 0 0 0 0
obrd 1 16 0 0 16 0 0
obwr 3 1,000 0 0 3,000 0 0
or 1 0 16 0 0 16 0
rsubk 1 0 19 0 0 19 0
rtsd 1 0 1,004 0 0 1,004 0
stall 4 1,748 0 0 6,992 0 0
sw 1 0 16 1,032 0 16 1,032
swi 1 16 2,812 0 16 2,812 0
xor 1 0 997 16 0 997 16
xori 1 0 1,000 1 0 1,000 1
Totals 7,864 66,849 212,412 12,076 30,335 99,508
Operations per clock cycle 1.54 0.45 0.47
52


Given a clock rate of 80 MHz, the operations per second of the software ordered buffer is 0.0360 GOPS, array sort is 0.0376, and the hardware ordered buffer is 0.1232 GOPS.
4.2.3 Power and Area Efficiency
Table 7 shows the deltas between MB-Lite without the ordered buffer and MB-Lite with the ordered buffer. (See section 4.1 regarding fidelity of power results.)
Table 7- Ordered Buffer power and area comparison
Total Power (W) LUTs Registers DSP48E1 Block RAMs Total Slices
MB-Lite 0.246 2319 1526 3 2 1075
MB-Lite + 0.260 2978 1712 3 6 1221
Ordered Buffer Increase 5.7% 28.4% 12.2% 0% 200% 13.6%
The total slice count is used as a single metric for FPGA resource usage. Considering both the minimum and the maximum speed-up provided by the hardware ordered buffer, table 8 shows the speed-up adjusted for additional power draw and additional resource usage. The Watts Factor and Slice Factor are each calculated by dividing the MB-Lite-only value by the MB-Lite + Ordered Buffer value. Speed-up factors are taken from the preceding Performance improvement figures and encompass all datasets and test cases.
Even considering the minimum speed-up (worst case) and adjusting it for additional power draw and resource usage, the ordered buffer provides an appreciable amount of performance improvement at 4.03x per Watt-factor and 3.75x per slice-factor.
53


Table 8 Ordered Buffer performance summary>
Metric Min Max
Power
Watts Factor 1.06x 1.06x
Speed-up Factor 4.25x 265.70x
Speed-up per Watt Factor 4.03x 251.39x
Area (or resource usage)
Slice Factor 1.14x 1.14x
Speed-up Factor 4.25x 265.70x
Speed-up per Slice Factor 3.75x 233.92x
The GOPS/W metric (table 9) is dependent on the fidelity of the estimated power usage. A value of 0.4738 GOPS/W is not very high performance compared with the related work. But future work of optimizing FPGA resource usage, including adding multiple cores instead of just one, will improve this value.
Table 9 Ordered Buffer GOPS/W calcidation
Metric Value
GOPS/W
GOPS 0.1232
Watts 0.260
GOPS/W 0.4738
4.3 Block Matching
4.3.1 Performance Comparison with Software-Only Implementation
A simple block matching algorithm was run using a block width of 10 and a horizontal search of 11 pixels. The result is an array of 23 scores. A vertical search would require transposing pixels.
The software implementation of the block matching algorithm is shown in figure 50.
54


1 void blockMatch(unsigned char *refPixels, unsigned char
*com.parePixels, int blockWidth, int refOffset, int iterations, int *result) { compOffset, int
2 int curOffset = (31-compOffset);
3 int rrefOffset = (31-refOffset);
5 6 int tmpResult;
7 for(int i=0; i 8 tmpResult = 0;
9 for(int j=0; j 10 tmpResult += abs (refPixels[rrefOffset-j] -comparePixels[curOffset-j ] ) ;
11 12 }
13 result[iterations-l-i] = tmpResult;
14 curOffset--;
15 }
16 }
Figure 50 Block matching software implementation
Since the block matching algorithm execution flow, both software and hardware implementations, is not dependent on the dataset, only one input dataset is needed.
Figure 51 shows the instruction histogram for execution of the hardware accelerated and software-only block matching implementations. The two most frequently used instructions in the software implementation is subtraction (rsubk) and loading data (lbu).
This is consistent with the additional loop overhead and referencing the same data in memory multiple times that is required by the software implementation.
55


73
&
o
o
1000
900
800
700
600
500
400
300
200
100
0
M M T3 13 13 =1
L=J T3 JTj rt Jjj 00
' O 17-* 44 TZ^ f~l
-g 3 ^
3 '§ & 3 ^ '$ M fe a ^ = 4 m 3 g
£'3i'£i$i<§<8§
Oh > ^3 > m
t* S >
I Hardware i Software
Instruction
Figure 51 Block matching instruction histogram Table 10 shows the speed-up factor of the block matching hardware acceleration.
This is calculated by dividing the software implementation clock cycle count by the hardware implementation clock cycle count and subtracting one. The hardware accelerated speed-up of 6.Ox is modest compared to the ordered buffer, but it is still a significant speed-up.
Table 10 Block matching speed-up
Clock Cycles
Hardware-accelerated 444
Software-only 3106
Speed-up Factor 6.0x
The number of operations per instruction, the number of instructions, total operations, and operations per clock cycle is shown in table 11 for the hardware accelerated implementation and the software-only implementation.
56


Table 11 Block matching operations per clock cycle calculations
Instructions Operations HW-accel. SW-Only HW-accel. SW-only per Instr. Count Instr. Count Operations Operations Instruction
addik 1 57 274 57 274
addk 1 6 266 6 266
bleid 0 1 23 0 0
bltid 0 44 242 0 0
brlid 0 1 1 0 0
bsll 1 1 0 1 0
bslli 1 45 22 45 22
bsrai 1 0 220 0 220
cmp 1 44 242 44 242
flush 0 46 221 0 0
lbu 1 0 440 0 440
lwi 1 21 4 21 4
mstov32i 1 16 0 16 0
rsubk 1 23 904 23 904
rtsd 0 1 1 0 0
sw 1 44 22 44 22
swi 1 5 4 5 4
vat8 16 22 0 352 0
vatr8 1 22 0 22 0
vrsub8am 32 22 0 704 0
vshr8 16 1 0 16 0
vshr8c 16 22 0 352 0
xor 1 0 220 0 220
Totals 444 3,106 1,708 2,618
Operations per clock cycle 3.85 0.84
Given a clock rate of 80 MHz, the operations per second of the software-only implementation is 0.0672 GOPS, and the hardware-accelerated implementation is 0.3080 GOPS.
57


4.3.2 Power and Area Efficiency
Table 12 shows the deltas between MB-Lite without block matching, and MB-Lite with block matching. (See section 4.1 regarding fidelity of power results.)
Table 12 Block matching power and area comparison
Total Power (W) LUTs Registers DSP48E1 Block RAMs Slices
MB-Lite 0.246 2319 1526 3 2 1075
MB-Lite + 0.276 11297 6325 3 2.5 3647
Block Matching Increase 12.3% 387.1% 314.5% 0.0% 25.0% 239.3%
The total slice count is used as a single metric for FPGA resource usage. Table 13 shows the speed-up adjusted for additional power draw and additional resource usage. The Watts Factor and Slice Factor are each calculated by dividing the MB-Lite-only value by the MB-Lite + Block Matching value. The speed-up is from table 10.
Table 13 Block matching performance summary>
Metric Factor
Power
Watts Factor 1.12x
Speed-up Factor 6.00x
Speed-up per Watt Factor 5.34x
Area (or resource usage)
Slice Factor 3.39x
Speed-up Factor 6.00x
Speed-up per Slice Factor 1.77x
The speed-up per Watt factor of 5.34x is still significant, however, the speed-up per slice factor of only 1.77x demonstrates the significant resources required for this implementation. Included in that hardware is a full SIMD implementation for the CPU that can be used by more than just this algorithm.
58


4.4 2D Convolution
4.4.1 Performance Comparison with Software-Only Implementation
A 512x512 pixel image is used for input and convolved with a 5x5 kernel. The software-only implementation is shown in figure 52.
1 void filter(unsigned char *image, unsigned int height, unsigned int
width, char *kernel, unsigned int kheight, unsigned int kwidth,
unsigned char div, unsigned char *result) {
2 unsigned char *ptr = image!(2*width);
3 int kdiv2h = kheight/2; // integer division (intended to
truncate)
4 int kdiv2w = kwidth/2; // integer division (intended to
truncate)
0 6 kheight =5;
7 kwidth = 5;
9 int i, j ki, kj;
10 int sum;
11
12 int tmpx, tmpy;
13
14 for(i=0; i 15 for(j=0; j 16 sum = 0;
17 for(ki=0; ki 18 for(kj=0; kj 19 tmpx = i+ki-kdiv2h;
20 tmpy = j+kj-kdiv2w;
21 if(tmpx >= 0 && tmpx < width &&
tmpy >= 0 && tmpy < height) {
22 sum += (*arr2d(kernel, ki,
kj , {width)) (*arr2d(ptr, i+ki-kdiv2h, j+kj-kdiv2w, width));
23 }
24 }
25 }
26 *arr2d(result, i, j, width) = (sum >> div);
27 }
28 }
29 }
Figure 52 2D convolution software-only implementation
Figure 53 and figure 54 shows the instruction histograms for execution of the hardware accelerated and software-only 2D convolution implementations, respectively. The bulk of the instructions in the hardware accelerated version is moving data from memory into the vector registers (lwi and mstov32i). For the software-only version, the most used instruction is addition.
59


clock cycles clock cycles
400.000
350.000
300.000
250.000
200.000
150.000
100.000 50,000
0





1 1
. 1 III 11
is oo -
2-2 42 m ts
OO 3 H >
5-h th
o S X 8
instruction
Figure 53 2D convolution instruction histogram (hardware accelerated)
45.000. 000
40.000. 000
35.000. 000
30.000. 000
25.000. 000
20.000. 000
15.000. 000
10.000. 000
5,000,000
0
M M Fr^ ^ ^ ^ ^ ^ -

3^
3 u a*u s jS'^ = 42)^Jlrri dm dirdi4-j
h rn >
> q
Q to
a+24'qrqoo-
X flit

instraction
Figure 54 2D convolution instruction histogram (software only)
Table 14 shows the speed-up factor of the 2D convolution hardware acceleration.
This is calculated by dividing the software implementation clock cycle count by the hardware
60


implementation clock cycle count and subtracting one. The hardware accelerated speed-up of 99.8x is significant.
Table 14 2D convolution speed-up
Clock Cycles
Hardware-accelerated 1,560,234
Software-only 157,262,161
Speed-up Factor 99.8x
The number of operations per instruction, the number of instructions, total operations, and operations per clock cycle is shown in table 15 for the hardware accelerated implementation and the software-only implementation.
Table 15 2D Convolution operations per clock cycle calculations
Instruction Operations HW-accel. SW-Only HW-accel. SW-only
per Instr. Instr. Operations Operations
Instruction Count Count
addik 1 115,528 18,874,884 115,528 18,874,884
addk 1 5,636 40,283,252 5,636 40,283,252
and 1 0 1,572,864 0 1,572,864
beqi 0 16,384 0 0 0
beqid 0 64 6,554,113 0 0
bgeid 0 49,153 6,538,240 0 0
bleid 0 512 0 0 0
bltid 0 17,408 8,096,292 0 0
bneid 0 0 7,864,320 0 0
brid 0 126 0 0 0
brlid 0 1 1 0 0
bslli 1 1 6,538,240 1 6,538,240
bsra 1 0 262,144 0 262,144
bsrai 1 0 262,144 0 262,144
bsrli 1 1 1,310,720 1 1,310,720
cmp 1 32,768 0 32,768 0
cmpu 1 32,769 8,111,616 32,769 8,111,616
flush 0 20,793 7,891,933 0 0
halt 0 0 0 0 0
61


Table 15 2D Convolution operations per clock cycle calculations (cont.)
imm 0 1 1 0 0
lbu 1 0 13,045,832 0 13,045,832
lbui 1 2 1 2 1
lw 1 2,554 0 2,554 0
lwi 1 378,207 524,815 378,207 524,815
mstov32i 1 326,918 0 326,918 0
mstov8i 1 2 0 2 0
mul 1 512 13,307,976 512 13,307,976
mvtos32i 1 65,536 0 65,536 0
nop 0 131,072 0 0 0
rsubk 1 0 7,864,320 0 7,864,320
rtsd 0 1 1 0 0
sb 1 0 262,144 0 262,144
sext8 1 0 6,522,916 0 6,522,916
srl 1 0 2 0 2
swi 1 69,360 526 69,360 526
vadd8 16 13 0 208 0
vflt 31 262,144 0 8,126,464 0
vfltr 1 16,384 0 16,384 0
xor 1 16,384 0 16,384 0
xori 1 0 1,572,864 0 1,572,864
Totals 1,560,234 157,262,161 9,189,234 120,317,260
Operations per clock cycle 5.89 0.77
Given a clock rate of 80 MHz, the operations per second of the software-only implementation is 0.0616 GOPS, and the hardware-accelerated implementation is 0.4712 GOPS.
4.4.2 Power and Area Efficiency
Table 16 shows the deltas between MB-Lite without 2D convolution, and MB-Lite with 2D convolution. (See section 4.1 regarding fidelity of power results.)
62


Table 16 2D convolution power and area comparison
Total Power (W) LUTs Registers DSP48E1 Block RAMs Total Slices
MB-Lite 0.246 2319 1526 3 2 1075
MB-Lite + 2D 0.271 11187 7297 3 2 3446
Convolution Increase 10.2% 382.4% 378.2% 0.0% 0.0% 220.6%
The total slice count is used as a single metric for FPGA resource usage. Table 17 shows the speed-up adjusted for additional power draw and additional resource usage. The Watts Factor and Slice Factor are each calculated by dividing the MB-Lite-only value by the MB-Lite + Block Matching value. The speed-up is from table 14.
Table 17 2D convolution performance summary>
Metric Factor
Power
Watts Factor l.lOx
Speed-up Factor 99.79x
Speed-up per Watt Factor 90.59x
Area (or resource usage)
Slice Factor 3.39x
Speed-up Factor 99.79x
Speed-up per Slice Factor 29.42x
GOPS/W
GOPS 0.4712
Watts 0.271
GOPS/W 1.739
The speed-up per slice factor of 29.42x demonstrates the significant amount of hardware required for this implementation. However, included in that hardware is a full SIMD implementation for the CPU that can be used by more than this algorithm. This is still a significant performance gain.
63


The GOPS/W metric (table 17) is dependent on the fidelity of the estimated power usage. A value of 1.739 GOPS/W is not very high performance compared with the related work. But future work of optimizing FPGA resource usage, including adding multiple cores instead of just one, will improve this value.
64


CHAPTER V
CONCLUSION
The speed-up per slice factor of 3.75x to 233.92x for the ordered shows that it can provide significant increase in performance with relatively small amounts of additional hardware. The total number of slices utilized increased from 1,075 to only 1,221.
Block matching has a speed-up per slice factor of only 1.77x. This due to the modest speed-up of only 6.Ox coupled with the relatively large amount of hardware requiredthe number of slices utilized went from 1,075 to 3,647. However, this includes a full SIMD implementation for the CPU that can used by more than just the block matching algorithm.
The speed-up per slice factor 29.42x for 2D convolution is a significant increase in performance. The algorithms speed-up of 99.79x helped offset the significant increase in hardwarethe number of slices utilized went from 1,075 to 3,446. Again, just like with block matching, this includes a full SIMD implementation for the CPU that can be used by more than the 2D convolution algorithm.
The GOPS/W of the hardware ordered buffer, hardware-accelerated block matching, and hardware-accelerated 2D convolution were 0.1232, 0.3080, 0.4712. These are significantly less than the related work in section 2.2. However, multiple improvements can be made, many which are utilized by the related work, including more efficient use of FPGA, adding multiple cores, adding multiple accelerators per core.
From the perspective of evaluating the relative performance improvement of adding one of the accelerators presented in this thesis to an existing soft-core CPU, the ordered buffer and 2D convolution are both candidates for future design work. If the application requires a SIMD pipeline, the addition of the hardware-accelerated block matching can
65


provide some performance improvement. But further research into more efficient implementations is recommended.
Each implementation (ordered buffer, block matching, and 2D convolution) achieved the goal of a design that is generic enough (or composed of generic blocks of functionality) that allows the accelerators to be used in multiple algorithms.
66


APPENDIX A CODE
Ordered buffer software-only implementation (SW ordered buffer):
1
2
3
4
5
6
7
8
9
10 11 12
13
14
15
16
17
18
19
20 21 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
typedef struct { int data; unsigned int tag;
} ob_data_t;
static ob_data_t ob_kfifo[BUFFER_MAX_K]; static int ob_klen = 0;
static ob_data_t ob_vfifo[BUFFER_MAX_VICTIM]; static int ob_vlen = 0; static unsigned int ob_k = 0;
void ob_sort();
void ob_print() {
xil_printf("K: [");
for(int i=0; i xil_printf("%d,", ob_kfifo[i].data);
}
xil_printf("] -- V: [");
for(int i=0; i xil_printf("%d,", ob_vfifo[i].data);
}
xil_printf("]\n");
}
void ob_setK(unsigned int k) { ob_k = k;
}
void ob_clear() {
ob_klen = 0; ob_vlen = 0;
}
void ob_write(int data, unsigned int tag) {
if(ob_klen < ob_k && (data >= ob_kfifo[ob_klen-l].data || ob_klen == 0)) {
ob_kfifo[ob_klen].data = data; ob_kfifo[ob_klen].tag = tag; ob_klen++;
} else if(ob_klen == ob_k && data >= ob_kfifo[ob_klen-l].data)
{
// don't do anything } else {
ob_vfifo[ob_vlen].data = data; ob_vfifo[ob_vlen].tag = tag; ob_vlen++;
}
if(ob_vlen == BUFFER_MAX_VICTIM) { ob_sort();
}
}
void ob_read(unsigned int *valuesArray, unsigned int k) { int i ;
67


57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
ob_sort();
for(i=0; i valuesArray[i] = ob_kfifo[i].tag;
}
void ob_sort() { int i ; int j ;
ob_data_t tmp;
for(i=0; i if(i >= ob_klen && ob_vlen > 0) {
ob_kfifo[ob_klen++] = ob_vfifo[ob_vlen-l]; ob_vlen--;
}
for(j=0; j if(ob_vfifo[j].data < ob_kfifo[i].data) {
// Swap
tmp = ob_kfifo[i]; ob_kfifo[i] = ob_vfifo[j]; ob_vfifo[j] = tmp;
ob vlen = 0 ;
68


REFERENCES
[1] S. Stojanovic, D. Bojic and M. Bojovic, "An Overview of Selected Heterogeneous and
Reconfigurable Architectures," in Advances in Computers: Dataflow Processing, vol. 96, Academic Press, 2015.
[2] F. Conti, D. Rossi, A. Pullini, I. Loi and L. Benini, "PULP: A Ultra-Low Power Parallel
Accelerator for Energy-Efficient and Flexible Embedded Vision," Journal of Signal Processing Systems, vol. 84, pp. 339-354, September 2016.
[3] Analog Devices, Inc., "Blackfin Dual Core Embedded Processor," [Online], Available:
http://www.analog.com/media/en/technical-documentation/data-sheets/ADSP-BF606_607_608_609.pdf. [Accessed 21 July 2017],
[4] W. Qadeer et al., "Convolution Engine: Balancing Efficiency & Flexibility in
Specialized Computing," in Proc. 40th Anna. Int. Symp. Computer Architecture, Tel-Aviv, Israel, 2013.
[5] T. Kranenburg and R. van Leuken, "MB-LITE: a robust, light-weight soft-core
implementation of the MicroBlaze architecture," Proceedings of Design, Automation & Test in Europe Conference & Exhibition 2010, pp. 997-1000, April 2010.
[6] D. G. Lowe, "Object Recognition from Local Scale-Invariant Features," The
Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150-57, 1999.
[7] M. Calonder, V. Lepetit, C. Strecha and P. Fua, "BRIEF: Binary Robust Independent
Elementary Features," Computer Vision ECCV2010, pp. 778-792, 2010.
[8] E. Rublee, V. Rabaud, K. Konolige and G. Bradski, "ORB: an efficient alternative to
SIFT or SURF," in IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 2011.
[9] J. Gaisler, "A structured VHDL design method," [Online], Available:
http://www.gaisler.com/doc/vhdl2proc.pdf. [Accessed 24 March 2017],
69


Full Text

PAGE 5

................................ ................................ ................................ ...... ................................ ............................. ................................ ................................ ................................ ... ................................ ................................ ................................ ................................ ................................ ................................ ............ ................................ ................................ ................................ ....... ................................ ................................ .............................. ................................ .. ................................ ............... ................................ ................. ................................ ................. ................................ ................................ ..................... ................................ ................................ ........... ................................ ................................ ..................... ................................ ................................ ..................... ................................ ................................ ................................ ..... ................................ ................................ ............................ ................................ ................................ ...........................

PAGE 6

................................ ................................ ........................... ................................ ................................ ................................ ....... ................................ ................................ ................................ .. ................................ ................................ ................................ ............

PAGE 7

................................ ................................ ................................ ................................ ................................ ......... ................................ ................................ ............................... ................................ ................................ ................................ ...... ................................ ................................ ................................ ...................... ................................ ................................ ....... ................................ ................................ ............... ................................ ................................ ................. ................................ ................................ ................................ .. ................................ .................. ................................ ................................ .... ................................ ................................ ............ ................................ ................................ ................................ .. ................................ ...... ................................ ................................ .... ................................ ................................ ............

PAGE 8

................................ ................................ ................................ ........ ................................ ................................ ....................... ................................ ................................ ........................ ................................ ................................ .............. ................................ .............................. ................................ ................................ .............. ................................ ......... ................................ ................................ ......... ................................ ................................ ................................ .. ................................ ................................ .......... ................................ ................................ ........................... ................................ ................................ ................... ................................ ................................ .......... ................................ ................................ ........................... ................................ ................................ ................... ................................ ................................ .......... ................................ ................................ ........................... ................................ ................................ ................... ................................ ................................ ........................ ................................ ................................ ................................ ............

PAGE 9

................................ .............................. ................................ ................................ ................................ ........... ................................ ................................ ................ ................................ .............................. ............... ............. ................................ ................................ ................................ ............................. ................................ ................................ ..... ................................ ................................ ... ................................ ................................ ... ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ ................................ .. ................................ ................

PAGE 10

............................... ............................... ................................ .............. ................................ ................................ ................................ ..................... ................................ ................................ ................................ ................................ ............................ ................................ .......... ......................... ................................ ........ ................................ ....... ....... ...................... ................................ ................................ ........ ................................ ................................ .............. ................................ ................................ ................................ ......... ................................ .....................

PAGE 11

CHAPTER I

PAGE 13

CHAPTER II 2.1 2.1.1 2.1.2

PAGE 14

2.1.3

PAGE 15

2.2

PAGE 18

CHAPTER III 3.1 3.2

PAGE 19

Figure 1 Microblaze a rchitecture 3.3

PAGE 20

Figure 2 Modified Microblaze a rchitecture

PAGE 21

Figure 3 Vect or r egister 8 bit p ixel i ndices 3.4

PAGE 22

3.4.1 1 // Scan over the N vector scores 2 for (int index=0; index k; j -) { 10 KNN_Score[j] = KNN_Score[j 1]; 11 KNN_Index[j] = KNN_Index[j 1]; 12 } 13 KNN_Score[k] = score; 14 KNN_Index[k] = index; 15 break; 16 } 17 } 18 } Figure 4 k Nearest Neighbor algorithm example 3.4.2

PAGE 23

Table 1 Ordered b uffer i nstructions

PAGE 24

Figure 5 Ordered buffer tag value pair insertion flowchart

PAGE 25

3.4.2.1

PAGE 26

Table 2 Ordered B uffer i llustrated e xam ple (cont.) 1. 2. 3. 4. 5. 6. 7. 8.

PAGE 27

Table 2 Ordered B uffer i llustrated e xam ple (cont.) 9. 10. 11. 12. 13. 14.

PAGE 28

Table 2 Ordered B uffer i llustrated e xam ple (cont.) 15. 16. 17. 18. 19. 20. 21. 22.

PAGE 29

3.5 3.5.1 Figure 6 Block m atching between i m age f rames

PAGE 30

Figure 7 Mean absolute difference (MAD) equation for block matching Table 3 Block m atchi ng i nstructions 3.5.2

PAGE 31

1 // load pixel mask with a width of T into scalar register r3 2 ld(r3, (0x1<
PAGE 32

1. 2. 3. Figure 9 Right shift template pixels 4. 5.

PAGE 33

Figure 10 Right shift target pixels by amount in r3 6. Figure 11 Absolute d ifference and m ask 7. Figure 12 Result pushed into dedicated FIFO 8. Figure 13 Right shift target pixels by amount in r3

PAGE 34

9. Figure 14 Absolute d ifference and m ask 10. Figure 15 Result pushed into dedicated FIFO 11. Figure 16 Right shift target pixels by amount in r3 12.

PAGE 35

Figure 17 Absolute difference and mask 13. Figure 18 Result pushed into dedicated FIFO 14. 3.6 3.6.1

PAGE 36

Figure 19 2D i mage c onvolution e xample 3.6.2 3.6.2.1

PAGE 37

Figure 20 Convolution kernel Figu re 21 Convolution inputs (within green box) for calculation of indicated pixel location in relation to 15 vector registers showing sample data

PAGE 39

1 row1 = vr 1[1]* vr 16 [15] + vr 1[0]* vr 16[14] + vr 2[15]* vr 16[13] + vr 2[14]* vr 16[12] + vr 2[13]* vr 16[11] 2 row2 = vr 4[1]* vr 16[10] + vr 4[0]* vr 16[9] + vr 5[15]* vr 16[8] + vr 5[14]* vr 16[7] + vr 5[13]* vr 16[6] 3 row3 = vr 7[1]* vr 16[5] + vr 7[0]* vr 16[4] + vr 8[15]* vr 16[3] + vr 8[14]* vr 16[2] + vr 8 [13]* vr 16[1] 4 row4 = vr 10[1]* vr 16[0] + vr 10[0]* vr 17[15] + vr 11[15]* vr 17[14] + vr 11[14]* vr 17[13] + vr 11[13]* vr 17[12] 5 row5 = vr 13[1]* vr 17[11] + vr 13[0]* vr 17[10] + vr 14[15]* vr 17[9] + vr 14[14]* vr 17[8] + vr 14[13]* vr 17[7] 6 7 tmp[15] = ( row1 + row2 + row3 + row4 + row5 ) >> shiftAmount 8 9 vr 1 = vr 1[15..1] & vr 2[15] 10 vr 2 = vr 2[15..1] & vr 3[15] 11 vr 3 = vr 3[15..1] & 0x00 12 13 vr 4 = vr 4[15..1] & vr 5[15] 14 vr 5 = vr 5[15..1] & vr 6[15] 15 vr 6 = vr 6[15..1] & 0x00 16 17 vr 7 = vr 7[15..1] & vr 8[15] 18 vr 8 = vr 8[15..1] & vr 9[15] 19 vr 9 = vr 9[15..1] & 0 x00 20 21 vr 10 = vr 10[15..1] & vr 11[15] 22 vr 11 = vr 11[15..1] & vr 12[15] 23 vr 12 = vr 12[15..1] & 0x00 24 25 vr 13 = vr 13[15..1] & vr 14[15] 26 vr 14 = vr 14[15..1] & vr 15[15] 27 vr 15 = vr 15[15..1] & 0x00 Figure 22 Single pixel convolution pseudo code ('& is an array concatenation operator, calculated pixel result in tmp[15]) Figure 23 State of registers after one operation

PAGE 40

3.6.2.2 3.6.3 1 zero registers 1, 4, 7, 10, and 13 2 populate registers 16 and 17 with the kernel 3 shiftAmt = 1; 4 5 vr 2 = img[R 2][0..15]; 6 vr 5 = img[R 1][0..15]; 7 vr 8 = img[R ][0..15]; 8 vr 11 = img[R+1][0..15]; 9 vr 14 = img[R+2][0..15]; 10 11 for(i= 0 ; i
PAGE 41

3.7 3.8

PAGE 42

3.9

PAGE 43

CHAPTER IV 4.1

PAGE 44

4.2 4.2.1

PAGE 45

Table 4 Ordere d Buffer d atasets

PAGE 47

Figure 25 Victim b uffer s izing (r andom i nputs k=8 100 dataset s with results averaged ) 1 7 13 19 25 31 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs

PAGE 48

Figure 26 Victim b uffer s izing (random inputs, k=32, 100 datasets with results averaged) Figure 27 Victim buffer sizing ( 100 random datasets with results shown separately each with 1,000 pairs, k=32 ) 1 7 13 19 25 31 10 12 14 16 18 20 22 24 26 28 30 32 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15 15.5 16 16.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 clock cycles / input pair victim buffer size

PAGE 49

Figure 28 Victim buffer sizing (best case inputs, k=8) Figure 29 Victim buffer sizing (best case inputs, k=32) 1 7 13 19 25 31 5.9 5.95 6 6.05 6.1 6.15 6.2 6.25 6.3 6.35 6.4 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs 1 7 13 19 25 31 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs

PAGE 50

Figure 30 Victim buffer sizing (worst case inputs, k=8) Figure 31 Victim buffer sizing (worst case inputs, k=32) 1 7 13 19 25 31 0 5 10 15 20 25 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs 1 7 13 19 25 31 0 10 20 30 40 50 60 70 80 200 240 280 320 360 400 440 480 520 560 600 640 680 720 760 800 840 880 920 960 1000 victim buffer size clock cycles / input pair number of input pairs

PAGE 51

4.2.2 4.2.2.1 Table 5 Ordered Buffer s oftware c omparison a lgorithms 4.2.2.2

PAGE 52

4.2.2.3

PAGE 53

4.2.2.4

PAGE 55

Figure 32 Number of input pairs vs. k ( HW ordered buffer, random inputs 100 datasets w ith results averaged) Figure 33 Number of input pairs vs. k (SW ordered buffer random inputs, 100 datasets with results averaged) Figure 34 Number of input pairs vs. k ( array sort random inputs, 100 datasets with results averaged) 4 16 28 40 52 64 76 88 100 112 124 0 10 20 30 40 50 60 70 80 90 100 k clock cycles / input pair number of input pairs 4 32 60 88 116 0 200 400 600 800 1000 1200 1400 1600 200 800 1400 2000 2600 3200 3800 4400 5000 k clock cycles / input pair number of input pairs 4 32 60 88 116 0 200 400 600 800 1000 1200 1400 1600 1800 200 1000 1800 2600 3400 4200 5000 k clock cycles / input pair number of input pairs

PAGE 56

Figure 35 Number of input pairs vs. k ( HW ordered buffer, best case inputs) Figure 36 Number of input pairs vs. k ( SW ordered buffer best case inputs) Fi gure 37 Number of input pairs vs. k ( array sort best case inputs) 4 20 36 52 68 84 100 116 0 2 4 6 8 10 12 k clock cycles / input pair number of input pairs 4 32 60 88 116 0 10 20 30 40 50 60 200 800 1400 2000 2600 3200 3800 4400 5000 k clock cycles / input pair number of input pairs 4 32 60 88 116 0 200 400 600 800 1000 1200 1400 1600 1800 200 1000 1800 2600 3400 4200 5000 k clock cycles / input pair number of input pairs

PAGE 57

Figure 38 Number of input pairs vs. k ( HW ordered buffer, worst case inputs) Figure 39 Number of in put pairs vs. k ( SW ordered buffer worst case inputs) Figure 40 Number of input pairs vs. k ( array sort worst case inputs) 4 16 28 40 52 64 76 88 100 112 124 0 20 40 60 80 100 120 140 160 k # clks/input value number of input pairs 4 32 60 88 116 0 500 1000 1500 2000 2500 3000 200 1000 1800 2600 3400 4200 5000 k clock cycles / input pair number of input pairs 4 40 76 112 0 200 400 600 800 1000 1200 1400 1600 200 1000 1800 2600 3400 4200 5000 k clock cycles / input pair number of input pairs

PAGE 58

Figure 41 Performance improvement over SW ordered buffer (random inpu ts, 100 datasets with results averaged) Figure 42 Performance improvement over array sort (random inputs, 100 datasets with results averaged) Figure 43 Performance improvement over SW ordered buff er (best case inputs) Figure 44 Performance improvement over array sort (best case inputs) 4 32 60 88 116 0.0x 2.0x 4.0x 6.0x 8.0x 10.0x 12.0x 14.0x 200 800 1400 2000 2600 3200 3800 4400 5000 k speed up factor number of input pairs 4 32 60 88 116 0.0x 10.0x 20.0x 30.0x 40.0x 50.0x 60.0x 70.0x 80.0x 200 800 1400 2000 2600 3200 3800 4400 5000 k speed up factor number of input pairs 4 32 60 88 116 3.8x 4.0x 4.2x 4.4x 4.6x 4.8x 5.0x 200 800 1400 2000 2600 3200 3800 4400 5000 k speed up factor number of input pairs 4 32 60 88 116 0.0x 50.0x 100.0x 150.0x 200.0x 250.0x 300.0x 200 1000 1800 2600 3400 4200 5000 k speed up factor number of input pairs

PAGE 59

Figure 45 Performance improvement over SW ordered buffer (worst case inputs) Figure 46 Performance improvement over array sort (worst case inputs) 4.2.2.5 4 32 60 88 116 0.0x 2.0x 4.0x 6.0x 8.0x 10.0x 12.0x 14.0x 16.0x 18.0x 200 800 1400 2000 2600 3200 3800 4400 5000 k speed up factor number of input pairs 4 32 60 88 116 0.0x 1.0x 2.0x 3.0x 4.0x 5.0x 6.0x 7.0x 8.0x 9.0x 10.0x 200 800 1400 2000 2600 3200 3800 4400 5000 k speed up factor number of input pairs

PAGE 60

Figur e 47 Instruction h istogram, 1000 input samples, k= 16 random data Figure 48 Instruction h istogram, 1000 input samples, k= 16 random data SW ordered buffer Figure 49 Instr uction h istogram, 1000 input samples, k= 16 random data array sort 0 500 1,000 1,500 2,000 2,500 add addc addi addik addk and andi beqi beqid bgei bgeid bgti bgtid blei bleid blti bltid bnei bneid brald bri brid brlid bslli bsrai cmp cmpu flush imm lbu lbui lhui lw lwi mts nop obclr obk obrd obwr or ori rsub rsubk rtsd sb sbi sext16 sext8 shi sra srl stall sw swi xor xori clock cycles instruction 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 add addc addi addik addk and andi beqi beqid bgei bgeid bgti bgtid blei bleid blti bltid bnei bneid brald bri brid brlid bslli bsrai cmp cmpu flush imm lbu lbui lhui lw lwi mts nop obclr obk obrd obwr or ori rsub rsubk rtsd sb sbi sext16 sext8 shi sra srl stall sw swi xor xori clock cycles instruction 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 add addc addi addik addk and andi beqi beqid bgei bgeid bgti bgtid blei bleid blti bltid bnei bneid brald bri brid brlid bslli bsrai cmp cmpu flush imm lbu lbui lhui lw lwi mts nop obclr obk obrd obwr or ori rsub rsubk rtsd sb sbi sext16 sext8 shi sra srl stall sw swi xor xori clock cycles instruction

PAGE 62

Table 6 Ordered buffer operations per clock cycle calculations Instruction Operations per instruction HW Ordered Buffer Instr. Count SW Ordered Buffer Instr. Count Array Sor t Instr. Count HW Ordered Buffer Operations SW Ordered Buffer Operations Array Sort Operations addik 1 1,034 6,308 16,882 1,034 6,308 16,882 addk 1 1 1,167 145 1 1,167 145 beqi 0 0 997 0 0 0 0 beqid 0 1 998 16 0 0 0 bgei 0 0 16 31,728 0 0 0 bge id 0 0 5,594 16 0 0 0 blei 0 1 129 0 0 0 0 bleid 0 0 13 2 0 0 0 blti 0 1,001 3,090 0 0 0 0 bltid 0 16 1,979 16,880 0 0 0 brid 0 0 879 0 0 0 0 brlid 0 0 1,011 0 0 0 0 bslli 1 0 4,111 32,776 0 4,111 32,776 cmp 1 1,016 2,925 16,896 1,016 2,925 16,896 cmpu 1 1 144 0 1 144 0 flush 0 2,013 14,269 64,262 0 0 0 imm 0 0 6,536 0 0 0 0 lw 1 0 0 31,760 0 0 31,760 lwi 1 0 9,816 0 0 9,816 0 nop 0 0 1,003 0 0 0 0 obrd 1 16 0 0 16 0 0 obwr 3 1,000 0 0 3,000 0 0 or 1 0 16 0 0 16 0 rsubk 1 0 19 0 0 19 0 rtsd 1 0 1,004 0 0 1,004 0 stall 4 1,748 0 0 6,992 0 0 sw 1 0 16 1,032 0 16 1,032 swi 1 16 2,812 0 16 2,812 0 xor 1 0 997 16 0 997 16 xori 1 0 1,000 1 0 1,000 1 Totals 7,864 66,849 212,412 12,076 30,335 99,508 Operations per clock cycle 1.54 0.45 0.47

PAGE 63

4.2.3 Table 7 Ordered Buffer p ower and a rea c omparison

PAGE 64

Table 8 Ordered Buffer p erformance s ummary Table 9 Ordered Buffer GOPS/W c alculation 4.3 4.3.1

PAGE 65

1 void blockMatch(unsigned char *refPixels, unsigned char *comparePixels, int blockWidt h, int refOffset, int compOffset, int iterations, int *result) { 2 int curOffset = (31 compOffset); 3 int rrefOffset = (31 refOffset); 4 5 int tmpResult; 6 7 for(int i=0; i
PAGE 66

Figure 51 Block match ing instruction histogram Table 10 Block m atching s peed up 0 100 200 300 400 500 600 700 800 900 1000 addik addk bleid bltid brlid bsll bslli bsrai cmp flush lbu lwi mstov32i rsubk rtsd sw swi vat8 vatr8 vrsub8am vshr8 vshr8c xor Clock Cycles Instruction Hardware Software

PAGE 67

Table 11 Block matching operations per clock cycle calculations

PAGE 68

4.3.2 Table 12 Block m atching p ower and a rea c omparison Table 13 Block m atching p erformance s ummary

PAGE 69

4.4 4.4.1 1 void filter(unsigned char *image, uns igned int height, unsigned int width, char *kernel, unsigned int kheight, unsigned int kwidth, unsigned char div, unsigned char *result) { 2 unsigned char *ptr = image+(2*width); 3 int kdiv2h = kheight/2; // integer division (intended to truncate) 4 int kdiv2 w = kwidth/2; // integer division (intended to truncate) 5 6 kheight = 5; 7 kwidth = 5; 8 9 int i, j, ki, kj; 10 int sum; 11 12 int tmpx, tmpy; 13 14 for(i=0; i= 0 && tmpx < width && tmpy >= 0 && tmpy < height) { 22 sum += (*arr2d(kernel, ki, kj, kwidth)) (*arr2d(ptr, i+ki kdiv2h, j+kj kdiv2w, width)); 23 } 24 } 25 } 26 *arr2d(resul t, i, j, width) = (sum >> div); 27 } 28 } 29 } Figure 52 2D c onvolution s oftware o nly i mplementation

PAGE 70

Figure 53 2D convolution instruction histogram (hardware accelerated) Figure 54 2D convolution instruction histogram (software only) 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 addik addk and beqi beqid bgeid bleid bltid bneid brid brlid bslli bsra bsrai bsrli cmp cmpu flush halt imm lbu lbui lw lwi mstov32i mstov8i mul mvtos32i nop rsubk rtsd sb sext8 srl swi vadd8 vflt vfltr xor xori clock cycles instruction 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 45,000,000 addik addk and beqi beqid bgeid bleid bltid bneid brid brlid bslli bsra bsrai bsrli cmp cmpu flush halt imm lbu lbui lw lwi mstov32i mstov8i mul mvtos32i nop rsubk rtsd sb sext8 srl swi vadd8 vflt vfltr xor xori clock cycles instruction

PAGE 71

Table 14 2D c onvolution s peed up Table 15 2D Convolution o perations per clock cycle calculations (cont.)

PAGE 72

Table 15 2D Convolution o perations per clock cycle calculations (cont.) 4.4.2

PAGE 73

Table 16 2D c onvolution p ower and a rea c omparison Table 17 2D c onvolution p erformance s ummary

PAGE 75

CHAPTER V

PAGE 77

APPENDIX A 1 typedef struct { 2 int data; 3 unsigned int tag; 4 } ob_data_t; 5 6 static ob_data_t ob_kfifo[BUFFER_MAX_K]; 7 static int ob_klen = 0; 8 static ob_data_t ob_vfifo[BUFFER_MAX_VICTIM]; 9 static int ob_vlen = 0; 10 static unsigned int ob_k = 0; 11 12 void ob_sort(); 13 14 v oid ob_print() { 15 xil_printf("K: ["); 16 for(int i=0; i= ob_kfifo[ob_klen 1].data || ob_klen == 0)) { 39 ob_kfifo[ob_klen].data = data; 40 ob_kfifo[ob_klen].tag = ta g; 41 ob_klen++; 42 } else if(ob_klen == ob_k && data >= ob_kfifo[ob_klen 1].data) { 43 // don't do anything 44 } else { 45 ob_vfifo[ob_vlen].data = data; 46 ob_vfifo[ob_vlen].tag = tag; 47 ob_vlen++; 48 } 49 50 if(ob_vlen == BUFFER_MAX_VICTIM) { 51 ob_sort(); 52 } 53 } 54 55 void ob_read(unsigned int *valuesArray, unsigned int k) { 56 int i;

PAGE 78

57 ob_sort(); 58 for(i=0; i= ob_klen && ob_vlen > 0) { 70 ob_kf ifo[ob_klen++] = ob_vfifo[ob_vlen 1]; 71 ob_vlen -; 72 } 73 for(j=0; j