Citation
Optimizing software through fully understanding the underlying architecture and strengthening the compiler

Material Information

Title:
Optimizing software through fully understanding the underlying architecture and strengthening the compiler
Creator:
Mezher, Ahmed Eskander ( author )
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
1 electronic file (91 pages) : ;

Subjects

Subjects / Keywords:
Compilers (Computer programs) ( lcsh )
Parallelizing compilers ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
Software developers are trying to write more efficient and faster programs. Increasing speed and efficiency not only depends on software programming knowledge, but also on understanding of the underlying computer architecture and compiler. Many programmers do not consider the machine architecture when they write code. Knowing the architecture specifications of the computer before writing code can influence program efficiency and speed. Also, optimizing compilers have a significant impact on the speed of program executive. The GCC compiler is a well-known compiler that has advanced techniques to optimize code. However, GCC compiler has several weaknesses in certain areas such as optimizing arithmetic operations, optimizing function calls, optimizing loops and providing full automatic parallelization. Those weaknesses can make programs run slower. This thesis helps developers to write more efficient software by presenting the weaknesses and strengths of the machine architecture and building tools to improve the weaknesses of the GCC compiler. In this thesis we show the negative impact of not considering the architecture specifications when programs are written to motivate software developers to pay attention to the architecture specifications of the machine. Several tools are designed to improve the weaknesses in the GCC compiled. Our tools work together with the GCC compiler to help developers design more efficient code. Also, our tolls can work on different compiler architecture, are not limited to specific compiler architecture. We tested our tools to optimize real C applications such as Strassen.cpp, Bubble Sort.cpp and Selection Sort.cpp and we successfully optimized them. Therefore, by using our tools together with the GCC compiler, programs will be more efficient and faster. Through understanding the weaknesses and strengths of the underlying computer architecture, and using our tools together with the GCC compiler, developers can design more efficient software.
Thesis:
Thesis (M.S.)--University of Colorado Denver. Computer science
Bibliography:
Includes bibliographic references.
System Details:
System requirements: Adobe Reader.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Ahmed Eskaner Mezher.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
891741426 ( OCLC )
ocn891741426

Downloads

This item has the following downloads:


Full Text
OPTIMIZING SOFTWARE THROUGH FULLY UNDERSTANDING THE UNDERLYING
ARCHITECTURE & STRENGTHENING THE COMPILER
by
AHMED ESKANDER MEZHER
B.Sc, University of Technology, 2009
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado Denver in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2013


This thesis for the Master of Science degree
by
Ahmed E. Mezher
has been approved for the
Computer Science Program
By
Gita Alaghband, Chair
Tom Altman
Bogdan Chlebus
11


September 12, 2013
Ahmed, Mezher, E. (Master of Science in Computer Science)
Optimizing Software Through Fully Understanding the Underlying Architecture &
Strengthening the Compiler
Thesis directed by Professor Gita Alaghband
ABSTRACT
Software developers are trying to write more efficient and faster programs. Increasing
speed and efficiency not only depends on software programming knowledge, but also on
understanding of the underlying computer architecture and compiler. Many programmers do not
consider the machine architecture when they write code. Knowing the architecture specifications
of the computer before writing code can influence program efficiency and speed. Also,
optimizing compilers have a significant impact on the speed of program execution. The GCC
compiler is a well-known compiler that has advanced techniques to optimize code. However,
GCC compiler has several weaknesses in certain areas such as optimizing arithmetic operations,
optimizing function calls, optimizing loops and providing full automatic parallelization. Those
weaknesses can make programs run slower. This thesis helps developers to write more efficient
software by presenting the weaknesses and strengths of the machine architecture and building
tools to improve the weaknesses of the GCC compiler. In this thesis we show the negative
impact of not considering the architecture specifications when programs are written to motivate
software developers to pay attention to the architecture specifications of the machine. Several
tools are designed to improve the weaknesses in the GCC compiler. Our tools work together with
the GCC compiler to help developers design more efficient code. Also, our tools can work on
different compiler architecture, are not limited to specific compiler architecture. We tested our
ii


tools to optimize real C applications such as Strassen.cpp, Bubble Sort.cpp and Selection
Sort.cpp and we successfully optimized them. Therefore, by using our tools together with the
GCC compiler, programs will be more efficient and faster. Through understanding the
weaknesses and strengths of the underlying computer architecture, and using our tools together
with the GCC compiler, developers can design more efficient software.
The form and content of this abstract are approved. I recommend its publication.
Approved: Gita Alaghband
m


CONTENTS
Chapter
1. Introduction....................................................................................8
2. Finding the limits of the underlying computer architecture through de-optimization..............4
2.1 Optimization information and recommendations for the AMD Opteron processor.................4
2.2 Find the limits of underlying computer architecture through de optimization techniques.....5
2.3 Methods.....................................................................................6
2.4 Analysis and results.......................................................................10
2.5 Detailed results...........................................................................11
2.5.1 Branch density de-optimization.........................................................11
2.5.2 Unpredictable instruction de-optimization..............................................15
2.5.3 Branch pattern de-optimization.........................................................18
2.5.4 Float comparison de-optimization.......................................................19
2.5.5 Costly instruction de-optimization.....................................................22
2.5.6 Load-store dependency de-optimization..................................................25
2.5.7 High latency instruction de-Optimization...............................................28
2.5.8 If condition de-optimization...........................................................31
2.5.9 Loop re-rolling de-optimization........................................................34
2.5.10 Dependency chain de-optimization......................................................37
3. Building tools to optimize weaknesses in the GCC compiler......................................40
3.1 Compiler...................................................................................40
3.2 GCC compiler...............................................................................41
3.3 GCC optimizations..........................................................................42
3.4 Our optimizations to the GCC compiler......................................................47
3.5 Methods, analysis and results..............................................................49
3.5.1 Division vs. multiplication:...........................................................49
3.5.2 Loop and recursive function............................................................52
3.5.3 Loop re-rolling........................................................................54
3.5.4 Loop unrolling.........................................................................57
3.5.5 Power vs. multiplication...............................................................59
IV


3.5.6 SQRT function vs. division................................................................62
3.5.7 The cost of the function call inside and outside loops...................................65
3.6 Automatic parallelization in compilers........................................................67
3.7 Automatic parallelization in GCC..............................................................70
3.8 Our improvement to automatic parallelization of the GCC compiler.............................71
3.9 Optimizing real C programs by our tool........................................................73
3.9.1 Strassen optimization.....................................................................73
3.9.2 Bubble sort optimization..................................................................75
3.9.3 Selection sort optimization...............................................................77
4. Conclusion........................................................................................80
References............................................................................................83
v


LIST OF FIGURES
Figure
2. 1 Branch Density (modtencounter).....................................................14
2. 2 Unpredictable instructions (factorial over array) de-optimization..................17
2. 3 Compare_two_floats.................................................................21
2. 4 Costly instruction de-optimization.................................................24
2. 5 Costly instruction de-optimization.................................................27
2. 6 Fib de-optimization................................................................30
2. 7 If Condition de-optimization.......................................................33
2. 8 Loop re-rolling de-optimization....................................................36
2. 9 Dependency chain de-optimization...................................................39
3. 1 Division vs. Multiplication-milliseconds...........................................51
3. 2 Loop and recursive function- Time in milliseconds..................................54
3. 3 Loop re-rolling- Milliseconds......................................................56
3. 4 Loop unrolling- seconds............................................................59
3. 5Power vs. multiplication- seconds...................................................62
3. 6 SQRT function vs. division- seconds................................................64
3. 7 Function calls in and out the loop- Millisecond....................................67
3. 8 Strassen optimization..............................................................75
3. 9 Bubble sort optimization...........................................................77
3.10 Selection sort optimization........................................................79
vi


LIST OF TABLES
Table
2. 1 Branch density de-optimization for the AMD Opteron and Intel Nehalem.................14
2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel Nehalem.....17
2. 3 Compare two floats de-optimization...................................................22
2. 4 Costly instruction de-optimization for the AMD Opteron and Intel Nehalem..............25
2. 5 Costly Instruction de-optimization for the Opteron and Intel Nehalem..................28
2. 6 Fib de-optimization for the Opteron and Intel Nehalem................................30
2. 7 If condition de-optimization for the AMD Opteron and Intel Nehalem...................33
2. 8 Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem................36
2. 9 Dependency chain de-optimization for the AMD Opteron and Intel Nehalem..............39
3. 1 Time in milliseconds for the Division vs. multiplication optimization technique......51
3. 2 Time in milliseconds for the loop and recursive function optimization technique.......53
3. 3 Time in milliseconds for the loop re-rolling optimization technique...................56
3. 4 Time in seconds for the loop unrolling optimization technique.........................59
3. 5 Time in seconds for the power vs. multiplication optimization technique...............61
3. 6 Time in seconds for the SQRT function vs. division optimization technique.............64
3. 7 Time in milliseconds for the cost of the function call in and out the loop optimization
technique.................................................................................66
3. 8 Running time in seconds for the Strassen optimization.................................74
3. 9 Running time in seconds for the Bubble sort optimization..............................76
3.10 Time in seconds for the Selection sort optimization..................................78
vii


1. Introduction
This thesis is in the area of compiler and software. Writing efficient software not only
depends on writing skills of programmers but also depends on the knowledge of underlying
architecture and compiler. Sometimes a particular optimized code running fast in one machine
may run very slow in a different machine. For example, we have an optimized code, and this
code has loop unrolled several times to run very fast in machine A, a programmer run the same
code in a different machine B, and the two machines A and B are completely different. If the
programmer does not know the architecture specifications of machine B, then the program may
run very slow because the size of instruction cache of machine B is less than the size of the first
machine A, and we know that when the number of instructions in a program exceeds the size of
the instruction cache, the program becomes slower. Therefore, knowing the architecture
specifications will make programmers write optimized codes. Additionally, the effort required to
optimize generally depends upon knowledge of the architecture specifications such as: How long
is its pipeline? How does it detect hazards? Does it dynamically schedule? How does its caching
work? By understanding these aspects of the machine architecture, Programmers can write very
efficient software. Does ignoring the architecture specifications when writing software affect its
efficiency? Chapter two answered this question. In chapter two, we researched the weaknesses
and strengths in the underlying computer architecture, and showing the consequences of not
viii


considering the architecture specifications when programs are written. The goal of chapter two is
to design a series of de-optimizations for the Opteron and to show how gracefully, or not, its
performance degrades. In some circumstances, serious degradations in performance were found.
In others, expected de-optimizations were difficult if not impossible to implement. By examining
the results of chapter two, one can gain a thorough understanding of what development aspects
need attention and what aspects can be safely ignored. The de-optimizations that we have
implemented are branch density, unpredictable instructions, branch patterns, float comparisons,
costly instructions, load-store dependency, high latency instruction, if condition, loop rerolling
and dependency chain. Branch patterns, unpredictable instructions and branch density are
deemed unsuccessful because of the powerful of hardware. Other de-optimizations have showed
great impact on the Opteron and the Intel Nehalem as well.
Moreover, some programmers depend on the compiler only to optimize their codes.
Dependence on compiler only will not fully optimize codes because compilers may have some
weaknesses. We chose to research the GCC compiler because it is a well-known compiler used
by many developers and programmers. We will develop tools to write more efficient code and
compare the results with the GCC compiler. In chapter three, we present several weaknesses in
the GCC compiler, and then build two tools to help programmers using GCC compiler to write
more efficient codes by using our tools. The first tool can detect the optimizations in a given
program, and present a message to the programmer that with a list of optimizations. Therefore, a
programmer can implement the list of optimizations manually in his program to become more
efficient. The optimizations that we have implemented to the GCC compiler are division
operation, loop and recursive function, loop rerolling, loop unrolling, power operation, square
root function and function calls. All the optimizations are not compiler architecture dependent
2


except for the loop unrolling because loop unrolling depend on the size of instruction cache, and
the size of instruction cache is different from architecture to another. Another type of our study is
the automatic parallelization in the GCC compiler which does not support dependencies
detection. The GCC compiler can parallelize loops that do not have dependencies. In this way,
the programmer needs to check its program manually for dependencies. Sometimes checking
dependencies visually is not easy because it needs to follow the iterations of the loop. If a loop
has very long sequence of instructions, it becomes very hard and much time consuming to decide
if a loop has dependencies or not. Therefore, we designed a tool that can detect most of the loop
dependencies in programs. Our tool can give a message to the programmer indicating that either
the particular program has dependencies or not. Our tool can help programmers to identify
dependencies, and they can save a lot of time by using it. The main goal from chapter three is to
improve the weaknesses we have detected in the GCC compiler, and building tools with the
series of optimizations that can help programmers to write more efficient code. The main goal
form the thesis is to help developers to write the best efficient software.
3


2. Finding the limits of the underlying computer architecture through de-optimization
techniques
2.1 Optimization information and recommendations for the AMD Opteron processor
Many optimizations have been done on the 32 bit and 64 bit software to work with AMD
processor and AMD architecture. These optimizations are instruction-Decoding optimizations,
cache and memory optimizations, branch optimizations, scheduling optimizations, integer
optimizations, optimizing with the SIMD instructions and x87 floating point optimizations.
Instruction-Decoding optimizations help the processor to maximize the number of decoded
instructions. We have two types of instructions which are direct path instructions and vector path
instructions. In general, direct path instructions minimize the number of operations that cost
AMD processor. Three direct path instructions can be done in one cycle, and one and a half
double direct path instructions per one cycle. Therefore, using direct path instructions rather than
vector path instructions can optimize codes. Also, there are other optimizations have done to
load-execute instructions, load-execute integer instructions and many other on the type of
instruction-decoding optimizations. Memory optimization is another optimization can be applied
to the AMD processor. Memory has several optimizations; one of them is using store instructions
such as MOVNTPS and MOVNTPQ. These instructions make processor do writing without
reading the data from memory. So, these instructions can save time by not reading the memory
or cache. Therefore, it makes programs faster. This optimization can be applied on 32-bit
software and 64-bit software. Another optimization can be applied on AMD processor is branch
optimization. Branch optimization is one of the optimizations that can improve branch prediction
4


and minimize branch penalties. One of the possible branch optimizations is avoiding conditional
branches that depend on random data, as these branches are difficult to predict. Moreover,
scheduling optimization is another optimization can be applied on the AMD processor. One of
the possible scheduling optimizations is pushing data directly in to stack instead of loading it in
to a register first, and this technique will reduce register pressure and eliminates data
dependencies. Also, this optimization can apply to the 32-bit software and 64-bit software [12],
We showed some of the recommendations for the AMD processor. These recommendations to
the hardware can make programs faster, and then we did an implementation to the AMD
hardware in order to find what the limitation of the hardware or what other weaknesses in the
hardware can be improved by the AMD designers.
2.2 Find the limits of underlying computer architecture through de optimization techniques
Computer industry is developing very fast. CPU speeds are increasing, and most of
computers come with more than one CPU. Software developers are focusing on writing good
software without considering the architecture specifications. Developers rely on compilers to do
the dirty work of concerning themselves with the details of the architecture specifications. Fully
understanding the performance and what can affect the hardware that gives us a clear clue about
developing hardware or software. In this part of the thesis, the limits of underlying computer
architecture were researched by intentionally de-optimizing software. Developers typically do
not consider architecture specification; it is useful to how optimized hardware will behave under
the worst circumstances. With this viewpoint, one can imagine that a company might choose one
piece of hardware over another not because it performs the best in very special circumstances,
but rather because its performance degrades the slowest in adverse circumstances. The goal of
5


this part of thesis was to design a series of de-optimizations for the Opteron and to show how
gracefully, or not its performance degrades. After we did the implementation, there are serious
degradations in performance were found. Also, in some circumstances, it is harder to implement
de-optimization techniques because the powerful of hardware. The main benefit of this work is
that one can look to the result and decide what development aspects need attention and what
aspects can be safely ignored.
2.3 Methods
In our implementation, we chose the AMD Opteron and the Intel Nehalem. The Opteron
was chosen as the CPU to benchmark. The Intel Nehalem is used to check de-optimizations if
universal or not, and that is mean if one of the de-optimizations affects both CPUs Opteron and
Intel Nehalem then may be it is fairly universal. In our implementation, we chose nine de-
optimizations to be implemented. The de-optimizations are:
Branch Density The code for this de-optimization contains densely packed branch
instructions in order to overload branch target buffers and/or branch prediction
algorithms.
Unpredictable Instruction The code for this de-optimization has intentionally
misaligned return instructions in order to prevent branch prediction.
Float Comparison The code for this de-optimization uses normal float comparison (a
conditional) when comparing integers (after casting) would suffice.
6


Costly Instruction The code for this de-optimization uses division when
multiplication would suffice.
Load-Store Dependency The code for this de-optimization sets up a load-store
dependency, the intention of which is to cause the CPU to wait until stores complete
before loads can occur.
Dependency Chain The code for this de-optimization creates a very long
dependency that quickly exhausts the resources of the dynamic scheduler.
High Latency Instruction The code for this de-optimization uses a very latency
instruction (LOOP) when lower latency instructions (DEC/JZ) would suffice.
If Condition Organization The code for this de-optimization intentionally orders
sub-conditions of an IF statement so that sub-conditions which take longer to
evaluate are first.
Loop Re-Rolling The code for this de-optimization breaks a loop into two that could
otherwise be combined so that the dynamic scheduler has less scheduling leeway.
Branch Patterns The code for this de-optimization tries to find patterns that are
difficult for branch prediction hardware to predict.
7


All of these de-optimizations are described in more detail below.
Most of these optimizations and de-optimizations were written in C and using GCC compiler and
we are definitely sure that using GCC compiler will not affect or reduce the de-optimization
techniques. In some cases, GCC can effect on the de-optimization, so we cannot use C to write
the code. When C could not be used completely, the NASM assembly language was used to
build C-importable assembly modules. These modules were then used by a C-wrapper in order to
implement the de-optimization. Typically, the C-wrapper builds an array that can be passed to an
assembly module, which in turn processes the array. Now, we have an important question which
is how to evaluate the de-optimization techniques. Well, we have several programs or tools can
give information about codes such as CodeAnalyst and VTune. At a glance, CodeAnalyst
appeared ideal since it proffers up a great deal of information about program execution and it
was written for the chosen hardware platform. However, it turned out to be less than ideal. It is
cumbersome to work with; it is hard to evaluate the data for sections of code together. In short, it
is great for profiling code, but it is poor for generating lots of test results. So, CodeAnalyst was
not an easy answer to the problem of evaluating de-optimizations. So, rather than using
prepackaged software in order to evaluate de-optimizations, we decided to wrap important
sections with code that counts the number of cycles. In the post-Pentium world, there is a ready
resource for counting clock cycles. It is the CPU Timestamp Counter (CTC). The CTC counts
the number clock cycles executed since the CPU was booted. It can be a little tricky to work
with, especially in multi-core, multi-CPU environments. However, if the core/CPU used for
execution can be tightly controlled, then it can be trusted to provide a reasonable measure of the
number of cycles needed to run a section of code.
#if defined(_i3 86_)
static__inline_ unsigned long long rdtsc(void)
8


{
unsigned long long int x;
___asm____volatile (".byte OxOf, 0x31" : "=A" (x)); return x;
}
#elif defined(_x86_64____)
static_inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
___asm______volatile_ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)32 );
}
#end if
The above code was inserted in to the all C optimizations and de-optimizations. With this
code, a call to rtdsc() yields the current clock cycles (from the CTC). When we are running the
code either optimization or de-optimization, the number of clock cycles is different slightly from
an execution to another. So, we decided to run the code several times and then we find the
average for the all executions. For simplicity, we created a program called The Version Tester
(VT). This program is simply an executable that takes a configuration file as an argument.
Within the configuration file, the number of test iterations is specified along with a series of
programs to test. For each program, there is a description and a run command line. For each
specified program, the version tester runs the executable for the specified number of iterations.
After each iteration, it captures the number of cycles (the number of cycles is written onto stdout
by the executable being tested). After completing the number of test cycles, the VT computes the
average number of cycles per iteration.
9


2.4 Analysis and results
Many of the de-optimizations designed during the implementation worked as expected.
For example, the Branch Density de-optimization consistently showed a 12% slowdown on the
Opteron for all array sizes. This comported very well with expected cost of missing the branch
target buffer (BTB) on the Opteron so consistently. Other de-optimizations had a greater effect
than anticipated. The Loop Re-Rolling and Dependency Chain de-optimizations showed very
significant slowdowns for all array sizes. The expectation was that winnowing down the options
available to the Opteron dynamic scheduler would have an impact, but not such a stark impact.
The upshot of these surprises is: software engineers must consider the dynamic scheduler in
order write software that can run efficiently on the Opteron. Finally, some de-optimizations had
little or an inconclusive effect. For example, the Branching Pattern de-optimization turned out to
be very difficult to realize. A slowdown was achieved using random data. But no significant
slowdown was achieved using patterned data. Since using random data to cause mispredictions
yields no useful information about the Opteron, this de-optimization was deemed unsuccessful.
Overall, modern CPUs are a wild cacophony of interacting threads and processes. This makes
profiling optimizations (and de-optimizations) very difficult. It turns out that Hydra was a great
place to test de-optimizations; its multiple nodes with multiple CPUs made it easy to run
processes with relative confidence that the program being tested would not end up competing for
resources. As a result, the data for the Opteron is very smooth and consistent, while the noisy
data for the Intel Nehalem reflects its meager resources.
10


2.5 Detailed results
2.5.1 Branch density de-optimization
For this de-optimization, we designed an algorithm called modtencounter. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with an integer x where 0 <= x <= 9. (Note that the data for this array
is not random; if it was, then branch mispredictions could artificially inflate the effect of the
Branch Density de-optimization.) After this array is populated, the number of times that each
integer x appears in the array is counted.
The executables associated with this algorithm are called mod_ten_counter_op.exe and
mod_ten_counter_deop_l.exe (on Windows). The code is written in C. The section of code
being optimized/de-optimized is written in NASM. The cycles being counted includes only the
time that mod ten counter spends running its assembly code.
The optimized version counts the integer instances within the array using a structure that
is much like a case-switch statement in C. However, the branch statements within this structure
are spaced out using NOP instructions. These NOP instructions are used to ensure that the
optimized version maintains proper alignment and spacing for branch instructions so that branch
target buffer (BTB) misses are not incurred. The important optimized section is below:
cmp ecx, 0
je markO ; We have a 0
nop dec Ecx
je markl ; We have a 1
nop dec Ecx
je mark_2 ; We have a 2
nop dec Ecx
je mark_3 ; We have a 3
nop dec Ecx
11


; We have a 4
je mark_4
nop dec Ecx
je mark_5
nop dec Ecx
je mark_6
nop dec Ecx
je mark_7
nop dec Ecx
je mark_8
nop dec Ecx
je mark_9
; We have a 5
; We have a 6
; We have a 7
; We have a 8
; We have a 9
Notice that there is a NOP between each DEC/JE pair. This was done in order to create space
between branches and in order to better align branching instructions. The de-optimized version
counts the integer instances with much the same structure as the optimized version. However, it
does not maintain proper alignment and spacing such that it should incur many BTB misses. The
important de-optimized section is below:
cmp ecx, 0
je markO ; We have a 0
dec Ecx
je markl ; We have a 1
dec Ecx
je mark_2 ; We have a 2
dec Ecx
je mark_3 ; We have a 3
dec Ecx
je mark_4 ; We have a 4
dec Ecx
je mark_5 ; We have a 5
dec Ecx
je mark_6 ; We have a 6
Dec Ecx
je mark_7 ; We have a 7
dec Ecx
je mark_8 ; We have a 8
dec Ecx
je mark_9 ; We have a 9
12


Notice that, unlike the optimized version, there are no additional NOP instructions between
DEC/JE instructions. This means that the de-optimized version has a very tightly packed bunch
of branches, unlike the optimized version. Also, note that, on average, the optimized version
executes 5 more instructions per iteration yet it is significantly outperformed by the optimized
version.
Data
As can be seen by the slowdown percentages below, packing branches as densely as the
de-optimization can have a significant impact on run-time. On the Opteron, it caused -10%
slowdown for all array sizes; this slowdown is due to the branch target buffer misses that come
with packing branches too densely. On the Nehalem, this de-optimization also had a big impact,
though the reasons are less well understood. One point of note for the Nehalem is the length of
its pipeline, which is 17 versus 12 on the Opteron. Thus, additional slowdown could be due to a
higher cost of BTB misses on the Nehalem.
13


Table 2. 1: Branch Density de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 -47 -7.31 39 9.92
100 185 7.80 331 27.36
1000 2234 11.86 6355 89.42
10000 23230 12.69 67653 110.38
100000 203760 10.98 537362 84.27
1000000 1620652 8.40 5306766 89.63
10000000 17263048 8.34 52082971 78.85
14


2.5.2 Unpredictable instruction de-Optimization
For this de-optimization, we designed an algorithm called factorialoverarray. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with an integer x where 0 <= x <= 12. After this array is populated, the
factorial of each element of the array is calculated and written back into the array, overwriting
the original array element.
The executables associated with this algorithm are called factorial_over_array_op.exe
and factorial_over_array_deop_l.exe (on Windows). The code is written in C. The section of
code being optimized/de-optimized is written in NASM. The cycles being counted include only
the time that factorial over array spends running its assembly code.
The optimized and de-optimized versions are almost identical. Of course, both perform
factorial, recursively, on an integer argument. However, the optimized version properly spaces
the TNE and RET instructions so that the RET is not mispredicted. Below is the important
optimized section:
nop
mov eax, [esp+4]
cmp eax, 1
Jne Calculate
nop
Ret
; Get the integer whose factorial is
; being calculated
; Have we hit one yet?
; If we haven't then do another call
; This is here for alignment purposes
; return with a 1
Notice that there is a NOP instruction between the TNE instruction and the RET instruction. This
was done in order to create space between the two and ensure that the RET instruction is
properly aligned.
The de-optimized version on the other hand does not properly space and so it results in a
large number of mispredictions on the RET instruction. Below is the important de-optimized
15


section:
mov eax, [esp+4] ; Get the integer whose factorial is
; being calculated
cmp eax, 1 ; Have we hit one yet?
Jne Calculate ; If we haven't then do another call
ret ; return with a 1
Notice that, unlike the optimized version, there is no NOP instruction between the JNE and the
RET instruction. This means that the two branching instructions are adjacent and the RET
instruction is misaligned.
Data
As can be seen by the slowdown percentages below, the cost of mis-aligning heavily used
RET calls is high. On the Opteron, it caused -12% slowdown for all array sizes; this slowdown
is due to the branch mispredictions engendered by mis-aligning RET. On the Nehalem, this de-
optimization had an inconclusive impact. Therefore, it is reasonable to assume that the way that
the Nehalem indexes branching instructions must be quite different from the Opteron.
16


De-optimization: Unpredictable instructions
(factorial_over_array)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Array Size
Figure 2. 2 Unpredictable instructions (factorialoverarray) de-optimization
Table 2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel
Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 80 7.07 21 2.58
100 1510 0.88 2436 32.15
1000 10429 11.61 -990 -1.24
10000 115975 12.76 36158 5.33
100000 1139748 12.59 238852 3.54
1000000 11529624 12.66 -3505949 -5.42
10000000 175191774 18.90 3081557 0.51
17


2.5.3 Branch pattern de-optimization
For this de-optimization, we designed an algorithm called even number sieve. This
algorithm takes an array size and a pattern as inputs. It then generates an array of the specified
size in which each element is populated with either 1 or 2. The order in which ls and 2s appear
depends upon the pattern provided; this pattern essentially represents the branching pattern being
tested. For example, the pattern 12 would cause the array to be populated with alternating ls
and 2s; the pattern r would cause the array to be populated with a random sequence of ls and
2s. After the array has been constructed, the algorithm iterates over the array and replaces all of
its odd entries, the ls, with 0s.
Only one executable was needed for this de-optimization. Rather, in this case, an
optimized version is a branching pattern that can be easily managed by the CPUs branch
prediction mechanism. On the other hand, a de-optimization is a branching pattern that cannot
be easily managed by the branch prediction mechanism.
The executable associated with this algorithm is called even_number_sieve.exe (on
Windows). The code is written in C. The section of code being optimized/de-optimized is also
written in C; there is no assembly for this de-optimization.
The cycles being counted include only the time that is taken to mark the odd elements of the
array.
The code that evaluates the array elements simply tests whether each element x (mod
2) is 1 or not. If it is, then it is replaced with a 0. Below is the important code:
unsigned long long pstart = rdtsc();
for (i = 0; i < size of array; i++ ) {
if ( number_array[i] % 2 == 1 ) {
18


number_array[i] = 0;
}
}
printf( "Cycles=%d\n", ( rdtsc() pstart ) );
Notice that each element, mod 2, is compared to 1. If it is equal, then the element is
replaced with 0.
We were never able to generate interesting results from this de-optimization. It is
reasonable to conclude that the branch prediction mechanism on the Opteron is very good. Of
course, many cycles were lost when random data was used. However, no branch prediction
mechanism can find and use a pattern in random data; it would not then be random.
2.5.4 Float comparison de-optimization
For this de-optimization, we designed an algorithm called comparetwofloats. This
algorithm takes a number of iterations as input. It then generates two simple floats and tests
them, over and over again, for the requested number of iterations.
The executables associated with this algorithm are called compare_two_floats_op.exe
and compare_two_floats_deop.exe (on Windows). The code is written in C. The section of
code being optimized/de-optimized is also written in C; there is no assembly for this de-
optimization. The cycles being counted include only the time that compare two floats spends
comparing the floating point values.
The optimized version performs the comparison of floating point values by casting the
floats into integers and then performing integer comparison. The important optimized section is
below:
#define FLOAT2INTCAST(f) (*((int *)(&f))) float
t=fl-f2;
pstart = rdtsc();
19


for (j = 0; j < numberofiteration; j++) {
if (FLOAT2INTCAST(t) <= 0 ) {
Countnumbers(i);
count++;
} else { count++;
}
}
result=rdtsc()-pstart;
printf( "Cycles=%d\n", result );
Notice that the two floats being compared, fl and f2, are subtracted (outside of the section being
timed) and then casted and compared to zero. Thus, no float comparison occurs.
The de-optimized version performs the comparison of floating point values in
the normal fashion, i.e. by straightforwardly comparing the floats. The important de-optimized
section is below:
float t=fl -f2; pstart =
rdtsc();
for (j = 0; j < numberof iteration; j++) { if ( fl<=f2 ) {
Countnumbers(i);
count++;
} else { count++;
}
}
result=rdtsc()-pstart;
printf( "Cycles=%d\n", result );
Notice that fl and f2 are being compared in the straightforward fashion. Thus, they are
being compared as floats
Data
As can be seen by the slowdown percentages below, the cost of casting and the
comparing integers is higher until the array size crosses a certain threshold; at this threshold, the
20


cost of comparing floats begins to dominate. On the Opteron, it caused ~6% slowdown for array
sizes greater than 10000; this slowdown is due to the fact that the floating point data path on the
Opteron is more costly than the integer data path. On the Nehalem, this de-optimization had a
similar impact. Therefore, it is reasonable to assume that the Nehalem has a similar discrepancy
between its integer and floating point data paths.
De-Optimization (Compate_Two_floats)
to
V
u
>
U
u
_o
U
180000000
160000000
140000000
120000000
100000000
80000000
60000000
40000000
20000000
0
T

Array Size
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Figure 2. 3 CompareTwofloats
21


Table 2. 3: Compare two floats de-optimization
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 -10991 -97.19 43 26.70
100 -10960 -86.18 471 49.68
1000 -9892 -37.97 5285 59.90
10000 -982 -0.60 52211 58.82
100000 88962 5.86 413932 41.79
1000000 987559 6.56 4298502 44.46
10000000 9949917 6.62 41906333 46.14
2.5.5 Costly instruction de-optimization
For this de-optimization, we designed an algorithm called div vs mult. This algorithm
takes an array size as input. It then generates an array of the specified size in which each
element is an integer x where 2 <= x <= 2 ; thus each element is a random power of two less
than or equal to 2 After this array is populated, each element in the array is divided by two.
The executables associated with this algorithm are called div_vs_mult_op.exe and
div_vs_mult_deop_l.exe (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that div vs mult spends dividing each element of
the array by 2.
Both the optimized and de-optimized versions do basically the same thing. The only
difference is how each version divides each element of the array by 2. In the optimized version,
22


this is done by multiplying each element by 0.5. This means that the optimized version is able to
use the multiple multiply-data-paths available on the Opteron. The important optimized section
is below:
unsigned long long start = rdtsc();
for (i = 0; i < sizeofarray; i++) {
test_array[i] = test_array[i] 0.5;
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that the division of each element occurs by multiplying it by 0.5. Thus, no actual
division occurs.
The de-optimized version does the same thing as the optimized version. It just uses
division, instead of multiplication, to do it. This can be costly on an Opteron since it has more
limited resources for division, i.e. a single data path. The important de-optimized section is
below:
unsigned long long start = rdtsc();
for (i = 0; i < size of array; i++) {
test_array[i] = test_array[i] / 2.0;
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that the division of each element occurs by dividing it by 2.0. (Note the use of 2.0 rather
than 2 ensures that the de-optimization is processed as a floating point operation, just like the
optimization). Thus, division occurs with each iteration.
Data
As can be seen by the slowdown percentages below, the cost of using division when
multiplication would suffice is very high. On the Opteron, it caused -25% slowdown for all
array sizes; this slowdown is due to the fact that the Opteron can only handle division on one of
23


its scalar pipelines. On the Nehalem, this de-optimization had little impact. Therefore, it is
reasonable to assume that Nehalem may have more resources for division available.
Figure 2. 4 Costly instruction de-optimization
24


Table2. 4: Costly instruction de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 99 18.1 -42 -14.34
too 949 24.73 57 2.04
1000 9754 26.49 1630 5.96
10000 94658 25.67 16364 5.07
100000 938712 25.27 117817 4.54
1000000 9619257 23.94 675940 2.64
10000000 96477334 23.98 8585454 3.41
2.5.6 Load-store dependency de-optimization
For this de-optimization, we designed an algorithm called dependency. This algorithm
takes an array size as input. It then generates an array of the specified size in which each element
is populated with an integer x where 0 <= x <= 9. After this array is populated, each element in
the array has the previous element added to it; this effect ripples from the front of the array to the
back of the array. The last element in the array is the sum of all elements in the array. Thus, this
is basically a prefix sum.
The executables associated with this algorithm are called dependency_op.exe and
dependency_deop.exe (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that dependency spends summing array values.
The optimized version performs the additions by keeping the previous three array
25


elements in temporary variables; it creates few load-store dependencies since the previous array
element does not need to be reloaded, i.e. the sum for the previous element doesnt need to be
written before it can be reloaded. The important optimized section is below:
int temp_prev = test_array[0], tempi, temp2; unsigned
long long start = rdtsc();
for (i = 3; i < sizeofarray; i += 3 ) {
temp2 = test_array[i 2] + temp_prev;
tempi = test_array[i 1] + temp2;
test_array[i 2] = temp2;
test_array[i 1] = tempi;
test_array[i] = temp_prev = test_array[i] + tempi;
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that the value being added to the current element of the array comes from the temporary
variable. Thus, its value doesnt need to re-loaded, i.e. no load-store dependency is (likely to be)
created.
The de-optimized version, on the other hand, performs this computation in a more
natural way. However, since it must wait for the previous element to have its value written
before the next one can be calculated, many load-store dependencies are created. The
important de-optimized section is below:
unsigned long long start = rdtsc();
for ( i =1; i < size of array; i++ ) {
test_array[i] = test_array[i] + test_array[i 1];
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that, unlike the optimized version, there are no temporary variables that can be used to
prevent load-store dependencies.
26


Data
As can be seen by the slowdown percentages below, the cost associated with these kinds
of dependencies is very high. On the Opteron, it caused -60% slowdown for all array sizes; this
is due to the fact that the Opteron does not schedule stores in the way that it schedules other
instruction types leading to very costly dependency stalls. On the Nehalem, this de-optimization
had a lesser impact. Therefore, it is reasonable to assume that its dynamic scheduler is better at
managing stores.
De-Optimization (Costlyjnstruction)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Array Size
Figure 2. 5 Costly instruction de-optimization
27


Table 2. 5: Costly Instruction de-optimization for the Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 87 37.5 12 8.16
too 2392 77.98 230 21.9
1000 9858 78.41 1610 15.38
10000 79180 63.89 17517 17.84
100000 786708 63.1 176893 18.7
1000000 8181981 64.12 1205000 12.12
10000000 81498679 58.83 17276216 18.06
2.5.7 High latency instruction de-optimization
For this de-optimization, we designed an algorithm called fibonacci. This algorithm
takes an array size as input. It then generates an empty array of the specified size. After this
array is created, it calculates the fibonacci number associated with each index of the array.
The executables associated with this algorithm are called fib_op.exe and fib_deop.exe
(on Windows). The code is written in C. The section of code being optimized/de-optimized is
written in NASM. The cycles being counted include only the time that fib spends running its
assembly code.
The optimized and de-optimized versions are almost identical. Of course, both calculate
the fibonacci number associated with each index. However, the optimized version uses the
combination of DEC and JNZ instructions in order to control its branching. Below is the
important optimized section:
28


calculate:
mov edx, eax
add ebx, edx
mov eax, ebx
mov dword [edi], ebx
add edi, 4
dec ecx
jnz calculate
Notice that each iteration ends with DEC and JNZ instructions. Thus, the loop will end
when the ECX register is zero.
The de-optimized version on the other hand uses a LOOP instruction instead of the
DEC/JNZ combination. The important thing to note about the LOOP instruction is that it has a
high latency (approximately 8 cycles) compared to DEC/JNZ. Below is the important de-
optimized section:
calculate:
mov edx, eax
add ebx, edx
mov eax, ebx
mov dword [edi], ebx
add edi, 4
loop calculate
Notice that, unlike the optimized version, the loop is controlled by the LOOP instruction. Just
like the optimized version, the loop will end when ECX register is zero.
Data
As can be seen by the slowdown percentages below, the cost of the LOOP instruction is
high. On the Opteron, it caused -17% slowdown for all array sizes; this is due solely to the very
high latency of the LOOP instruction. On the Nehalem, this de-optimization had a greater
impact. It is hard to speculate on what might cause the Nehalem to have such a poor
29


implementation of LOOP.
De-Optimization (fib)
to
V
u
>
U
u
_o
U
100000000
90000000
80000000
70000000
60000000
50000000
40000000
30000000
20000000
10000000
0
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Array Size
Figure 2. 6 Fib de-optimization
Table 2. 6: Fib de-optimization for the Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 17 9.04 9 8.57
too 155 29.4 244 81.87
1000 1386 34.74 2239 104.52
10000 14588 22.03 19519 62.16
100000 123271 17.91 256839 87.42
1000000 1206678 16.94 2430301 81.51
10000000 11716747 14.07 19896396 65.26
30


2.5.8 If condition de-optimization
For this de-optimization, we designed an algorithm called if. This algorithm takes an
array size as input. It then generates an array of the specified size in which each element is
populated with floats x where 0.5 <= x <= 11.0. After this array is populated, each element in
the array is evaluated with an if-then statement in order to determine whether to increment or
decrement a dummy variable.
The executables associated with this algorithm are called if_op.exe and if_deop.exe
(on Windows). The code is written in C. The section of code being optimized/de-optimized is
also written in C; there is no assembly for this de-optimization. The cycles being counted
include only the time that if spends evaluating the elements of the array with the if-then
statement.
Both the optimized and de-optimized versions do basically the same thing. The only
difference is the order which they each evaluate the clauses of the if-then statement. One of these
clauses is very easy and fast to evaluate; it is a comparison to zero. The other of the clauses is
hard and time consuming to evaluate; it is a floating point comparison. The important optimized
section is below:
int dummy = 0, mod = 0;
unsigned long long start = rdtsc();
for (i = 0; i < size of array; i++) {
mod = (i % 2 );
if ( mod == 0 && test_array[i] > 1.5 ) { dummy++;
} else
{
dummy -
}
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice the ordering of the clauses within the if-then statement. This is an optimal
ordering since the floating point comparison clause need only be evaluated half of the time in
31


order to know the value of the entire statement.
Again, the de-optimized version does the same thing as the optimized version. It just
evaluates the statement in a different order. The important de-optimized section is below:
int dummy = 0, mod = 0;
unsigned long long start = rdtsc();
for (i = 0; i < size of array; i++) {
mod = (i % 2 );
if (test_array[i] > 1.5 && mod == 0 ) { dummy++;
} else { dummy-;
}
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Again, notice the ordering of the clauses within the if-then statement. This is a sub-optimal
ordering since the floating point comparison clause needs to be evaluated each and every time in
order to know the value of the entire statement.
Data
As can be seen by the slowdown percentages below, the cost of improperly ordering
clauses within a conditional is very high. On the Opteron, it caused -37% slowdown for all array
sizes; this is due to being forced to evaluate the most expensive of the conditional clauses for
each iteration; therefore, a simple switch in ordering can make this program 37% faster. On the
Nehalem, this de-optimization had a similar impact.
32


De-Optimization (lf_Condition)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Figure 2. 7 If Condition de-optimization
Table 2. 7: If Condition de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 -207 -26.90 57 23.85
100 338 15.48 990 56.99
1000 6438 38.27 4886 33.86
10000 63975 38.09 47374 37.87
100000 651438 38.94 520198 50.64
1000000 6385986 37.37 4577710 41.86
10000000 66023847 35.89 49997484 47.52
33


2.5.9 Loop re-rolling de-optimization
For this de-optimization, we designed an algorithm called loop re rolling. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with integers x where 0 <= x <= 9. After this array is populated, two
more arrays are built; one will hold the squares (x2) of each element in integer array and the
other will hold the cubes (x3) of each element in the array. After these arrays are built, squares
and cubes arrays are populated by iterating over the elements of the integer array.
The executables associated with this algorithm are called Toop_re_rolling_op.exe and
loop_re_rolling_deop.exe (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that the executables spend filling out the squares
and cubes array from the original.
The optimized version fills out the squares and cubes array into a single loop, since the
integer array is used to populate both. This is ideal since it gives the dynamic instruction
scheduler more flexibility when scheduling instructions to run. The important optimized section
is below:
unsigned long long pstart = rdtsc();
for (i = 0; i < sizeofarray; i++) {
quadratic_array[i]=load_store_array[i] *load_store_array[i];
cubicarray [i]=load_store_array[i] *load_store_array[i] *load_store_array [i];
}
printf( "Cycles=%d\n", ( rdtsc() pstart ) );
Notice that the calculation that populates the element of each array, squares and cubes, is
within the same loop.
34


The de-optimized version fills out the squares and cubes array using separate loops. This
is sub-optimal since it takes away some of the flexibility that the dynamic instruction scheduler
might otherwise have when scheduling instructions to run. The important de-optimized section is
below:
Unsigned long long pstart = rdtsc();
for (i = 0; i < size of array; i++) {
quadratic_array[i]=load_store_array[i] *load_store_array[i];
}
For(i=0;i cubic_array[i]=load_store_array[i] *load_store_array[i] *load_store_array [i];
}
printf( "Cycles=%d\n", ( rdtsc() pstart ) );
Notice that the calculation that populates the element of each array, squares and cubes, is
within the separate loops.
Data
As can be seen by the slowdown percentages below, the cost of not combining loops that
can be combined is very high. On the Opteron, it caused -50% slowdown for all array sizes; this
is solely due to the fact that splitting the computations into separate loops isolates them such that
the dynamic scheduler cannot schedule them together; some of the flexibility that it might have
had otherwise has been lost. On the Nehalem, this de-optimization had an impact, but a lesser
one. It is hard to imagine why this is less costly on the Nehalem.
35


De-Optimization (Loop_re_Rolling)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Figure 2. 8 Loop re-rolling de-optimization
Table 2. 8: Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 189 43.95 83 28.32
100 1653 53.68 357 15.35
1000 21311 54.96 4696 21.28
10000 234336 52.42 39311 17.10
100000 2352328 50.93 134786 6.05
36


2.5.10 Dependency chain de-optimization
For this de-optimization, we designed an algorithm called dependencychaink This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with integers x where 0 <= x <= 20. After this array is populated, all
of the elements of the array are summed into a single integer variable.
The executables associated with this algorithm are called dependency_chain_op.exe and
dependency_chain_deop_l.exe (on Windows). The code is written in C. The section of code
being optimized/de-optimized is also written in C; there is no assembly for this de-optimization.
The cycles being counted include only the time that the executables spend adding the elements of
the array.
The optimized version adds the elements of the array by striding through the array in four
element chunks and adding elements to four different temporary variables. After the array has
been processed, the four temporary variables are added. The advantage of the optimized version
is that it creates four large dependency chains instead of one massive one. Thus, the dynamic
scheduler has many more options when scheduling instructions. The important optimized section
is below:
int sum = 0, suml = 0, sum2 = 0, sum3 = 0, sum4 = 0; unsigned
long long start = rdtsc();
for (i = 0; i < size of array; i += 4 ) {
suml += test_array[i];
sum2 += test_array[i +1];
sum3 += test_array[i + 2];
suml += test_array[i +3];
}
sum = suml + sum2 + sum3 + suml;
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that the summing occurs across four temporary variables. These temporary
37


variables are then added after the array has been processed.
The de-optimized version sums each element of the array into one variable. This creates a
massive dependency chain that quickly exhausts the scheduling resources of the dynamic
scheduler. The important de-optimized section is below:
int sum = 0;
unsigned long long start = rdtsc();
for (i = 0; i < size of array; i++) {
sum += test_array[i];
}
printf( "Cycles=%d\n", ( rdtsc() start ) );
Notice that the summing occurs using one variable only.
Data
As can be seen by the slowdown percentages below, the cost of creating a massive
dependency chain is high. On the Opteron, it caused -150% slowdown for all array sizes; this is
solely due to the exhaustion of the scheduling resources of the dynamic scheduler; exhausting
these resources effectively makes the program run sequentially (no ILP). On the Nehalem, this
de-optimization had a large impact but not quite as large as the Opteron. One can only imagine
that it has more scheduling resources at its disposal.
38


dependency_ chain
to
QJ
U
>
U
u
_o
u
180000000
160000000
140000000
120000000
100000000
80000000
60000000
40000000
20000000
0
Array Size
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Figure 2. 9 Dependency chain de-optimization
Table 2. 9: Dependency chain de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%)
10 43 19.03 3 2.5
100 841 120.31 256 57.66
1000 8857 160.02 2793 79.03
10000 87503 160.64 30073 86.46
100000 877172 155.3 272096 78.83
1000000 8633066 142.06 2937193 88.53
10000000 90226731 132.98 25436239 71.12
39


3. Building tools to optimize weaknesses in the GCC compiler
3.1 Compiler
There are many definitions to the compiler but the most simple one is a program
converts other programs from high level language to low level languages. Compilers are
different from one language to another. However, compilers can be similar to each other
if they are belongs to extended programming languages. For example, C, C++, C# and
java are extended languages from C, so compilers for those languages are similar in terms
of structure and work. Programs are written in high level languages instead of low level
languages for several reasons. One of the reasons is that programs in high level languages
are much shorter than programs written in low level languages. So, high level languages
help users or programmers to write less and efficient codes. Also, there is another
advantage which is programs can be compiled to different low level languages, and then
can be run on different machines. However, programs in high level languages are slower
than programs written in machine languages because the time that spends in converting
from high to low level languages. So, the most efficient compiler can convert from high
level language to low level language with the shortest time [6] [7], Every compiler is
working in phases; number of phases and sequence of phases are different from compiler
to another. The first phase in a compiler is lexical analysis. Lexical analysis is the
operation that reads programs as a text, and then divides the whole programs to tokens
corresponding to the rules of languages. The second phase in a compiler is syntax
analysis. Syntax analysis takes the output from the previous level and represents the
output in a tree corresponding to the structure of a program. The third phase of compiler
levels is Type Checking. Type Checking is responsible for checking a program to see if it
40


follows the rules and regulations of grammar and structure. The fourth compiler phase is
Intermediate code generation. Intermediate code generation is responsible for translating
programs to the machine language. The fifth compiler phase is register allocation.
Variables and names that used in previous operation (Intermediate code generation level)
translated to numbers in registers of a machine language. The sixth compiler phase called
machine code generation. In the machine code generation, the code in the intermediate
language is converted to assembly language. The last compiler phase is assembly and
linking. In the assembly and linking level, the assembly code translated to the binary
form. Lexical analysis, syntax analysis and type checking are called frontend of a
compiler while register allocation, machine code generation and assembly and linking are
called backend of compiler. Most of compiler optimizations are done on the two parts
which are frontend and backend of compilers. The optimization procedure is different
from one compiler to another, but there are basics of optimizations that work in all of
them [7],
3.2 GCC compiler
GCC is one of the most popular compilers that used in different languages. GCC
is used widely because it is free source and has very active tools to optimize programs
efficiently. Currently, GCC supports six languages such as C, C++, Objective-C,
FORTRAN, Java and Ada. Also, there is another language will be added in future which
is COBOL. GCC can be considered as a part of the GNU project. GNU project was
launched in 1984 to complete UNIX operating system as free software. There are many
versions of the GCC from the first version till today [8],
41


The basic language of the GCC compiler is C, and also the GCC compiler was
written in C. So, in the beginning, GCC compiler was written as a C compiler and after
that several languages were added to it. The first language that was added to the GCC
compiler is C++ which is the extension of C language. Furthermore, Objective C is
another language that is extended from C language; it was added to the GCC compiler.
Also, there is an important language was added to the GCC compiler which is
FORTRAN. FORTRAN is considered stronger than C because FORTRAN is a rich
language with math libraries. It can do complex operations and it deals with complex
numbers. Also, there is another important language that is added to the GCC which is
JAVA. JAVA is different than other languages because the form of its code different,
also the executable file that is generated by Javas compiler is different than other
languages, and the executable will be executed by a Java virtual machine. The last
language that is added to the GCC compiler is Ada which is developed by Ada core
technologies and added to the GCC compiler at 2001. Ada language is designed to handle
large programs [8] [9],
3.3 GCC optimizations
Compilers basically are aimed to make programs faster, efficient and reducing the
size of codes. The main important goal from compiler optimizations is the speedup of
programs. Codes are generated by optimizing compilers called optimized codes.
Optimized codes can be produced by several changes on the source code, in order to
make that code is efficient and run faster than the original code. There are several ways in
the GCC compiler to decide a particular code can be improved or not such as control flaw
analysis and data flaw analysis. Control flaw analysis is used to see how the control
42


statements are used, and to see how the path is taken from the beginning of control
statement to the end of the statement. The other one is data flaw analysis, which is used
to identify how data is working in programs [10] pl01-pll7. There are several
optimization techniques that used in the GCC compiler to optimize programs. Copy
propagation transformation is one of the optimizing techniques that used to reduce the
redundant information. In this method, constant values are used instead of copy
operation. For more clarification about previous optimizing technique, lets consider the
following example:
i=10;
x[j]=i;
y[min]=b;
b=i;
The previous code will optimized to the following code by GCC compiler:
i=10;
x[j]=10;
y[min]=b
b=10;
In the previous example, in the optimized code, values are used directly to set the
variables x[j] and b, while in the original code, copy operation used to copy values.
Copying operation costs much time than just setting the constant values in the optimized
code [10] PI04.
Also, there is another GCC compiler optimization called Dead code elimination.
Dead code elimination is one of the GCC compiler optimization that removes extra or
unused codes in programs, or removes codes that do not effect on the sequence of
43


operations and the result in programs. Many codes are written with extra calculations or
extra variables called dead codes. Also, when we have a code that is never reached by
compiler, in this case, this code called unreachable code. For example,
If(i==10){
:}
Else{
If(i==10){
:}
}
In the above example, the second code will be never reached by a compiler because either
the first statement will reach by a compiler or none of them will not reached by a
compiler. So, in all cases, the second statement will never reached by a compiler, and
these phenomena called unreachable code, and the optimization that removes these
phenomena called unreachable code elimination.
GCC compiler and especially the latest versions have different and very smart
techniques to optimize codes. GCC compiler has optimizations called architecture -
independent optimizations. What is the most interesting thing about these optimizations is
that they do not depend on the features of a given architecture or features of a given
processor. So, optimizing codes is in different architecture and processors. GCC compiler
has different optimization switches such as -O, -01, -02, -03, -Os, -00 and nothing.
Nothing or -00 turn off all GCC optimizations. -01 turns on first level of GCC
optimizations which is level one. -02 and -03 turn on level two and level three of GCC
optimizations, they added to the first level of GCC optimizations. -O enables all GCC
optimizations, and it attempts to reduce the size of codes and execution time without
44


increasing the compilation time. -Os attempts to reduce the size of codes only without
considering the compilation time [10],
In the GCC compiler, you can turn off one of the optimizations by doing the
following:
We use no option between -f and the name of optimization that we want to disable it.
Suppose we want to disable the optimization that guess branch possibilities, the disabling
command will be like this:
GCC file name.c -o file name -01 -fno-guess-branch-probability. In addition, the
original name of the guess branch possibilities optimization is -fguess -branch-
probability [10] PI08.
In the GCC compiler, as we mentioned before, there are three levels of GCC
optimizations. First level called level one GCC optimizations. Level one has many
optimizations that reduce size of codes and improve time consumption, -free-dee is one
of level one GCC optimizations that eliminates unreachable codes or dead codes in
programs. Also, -fif-conversion is another optimization in level one which can change
conditional jumps in to none branching code. Also, there is another optimization in level
one which predicts branch possibilities in programs called -fguess-branch-probability. In
level two optimizations, there are many optimizations added to the first level includes
loop transformation, mathematical optimizations and jump handling. One of the
mathematical optimization is called -fpeephole. These optimizations replace long
instructions with the shorter once. For example, suppose we have the following code:
A=2;
45


For( i=l; i<10; i++)
A+=2;
GCC compiler, by applying second level of optimization can replace the above
code with A=20. So, this optimization saves much time that spend in executing For
Loop with much iteration. Also, there is another optimization in level two which can
improve recursive functions called -foptimize-sibling-calls. This optimization attempts to
eliminate the calls of functions and replace it by the code of the function itself. In general,
calls of a function consume much time, so replacing it with the code itself will make
programs faster. Also, there are other optimizations in level two can reduce the size of
codes. The command -Os in GCC attempts to reduce the size of codes and that will
improve the memory. -Os can be applied on all level two optimizations with the
exception of optimizations that enlarge programs codes. Level three optimizations turn
on all level one, level two and additional optimizations such as -funswitch-loops, -fgcse-
afer-reload and -finline-functions [10] PI 10, PI 11, PI 12.
Furthermore, there are extra optimizations that called specific optimizations. One
of the specific optimizations is -fbound-check, which can check codes to validate array
indices in order to use for array access, -ffunction-cse is one of the specific optimizations
that stores function addresses in registers. Also, there is another specific optimization
called -finline-functions which expands simple functions in the calling function, -ffloat-
store is another specific optimization which disables strong floating point values in
registers. The specific optimizations are not enabled by 01, 02 or 03, basically they are
enabled by either using the command -f followed by the name of the optimization, or
they can be enabled by -O because -O turns on all GCC optimizations [10],
46


3.4 Our optimizations to the GCC compiler
In our implantation, we found several weaknesses in the GCC compiler, so we
improved these weaknesses and then we built a tool that has all the improved weaknesses
or the optimizations. The weaknesses that we found in the GCC compiler include
Division operation, loop and recursive function, loop rerolling, loop unrolling, power
operation, square root mathematical operation and functions calls. Division operation
consume much time than multiplication, so our goal in this optimization technique is to
replace the division operation with the multiplication operation to make programs
efficient and faster. Functions are used in many programs and function calls are similar to
iterations that used in loops, but function calls are much time consuming than iterations
in loops. Therefore, programs have function calls can be replaced by loops in order to
make programs faster. Loop rerolling is one of the optimization techniques that used in
many compilers to optimize loops in programs. The basic idea in loop rerolling is to
combine several loops in one loop, and this obviously will reduce the number of
instructions because in general, instructions in one loop are less than separate individual
loops. So, the main goal in this optimization technique is to combine several loops in one
loop when it is possible to save time and make programs faster. Also, we implemented
another optimization technique which can optimize loops called loop unrolling. Loop
unrolling is one of the optimization techniques that used in many compilers to enhance
the execution time of codes that have loops. The main idea in loop unrolling is to reduce
the number of instructions. Many programs are spending less time in executing actual
codes while most of the time is used to execute the instructions. The article An
aggressive approach to loop unrolling stated that many programs spend 90% of the time
47


in executing 10% of a code [11], So, reducing the time consumption in the little portion
of a code will make programs faster. Also, the article An aggressive approach to loop
unrolling stated that loop unrolling can improve instruction-level parallelism and
improve memory hierarchy locality [11], Loop unrolling can be done in many loops but
not in all loops because some codes are difficult to unroll them such as two dimensional
arrays and pointers. The basic idea in loop unrolling is to open the code several times and
reduce the number of iterations, and then will make programs faster. However, there is a
disadvantage in loop unrolling technique which is increasing the size of codes, and then
will affect the memory. The main drawback of increasing codes size is affecting the
instruction cache. When the instruction cache is overflowed, the optimized code will be
affected and then will make programs slower. This case can be happened when the size
of instruction cache is small, and the size of original code is large. Also, there is another
thing regarding to loop unrolling which is choosing the unrolling factor. There is an
important question which is how many times we should unroll a given loop? We
introduced this question in order to help users or programmers to choose the perfect
unrolling factor. Choosing a good unrolling factor depends on several factors such as the
size of instruction cache, size of code and the number of instructions in program.
In our implementation, we help the programmer or user to find the best unrolling
factor if loop unrolling optimization were found. Our tool requires the size of instruction
cache as input, length of an individual instruction because length of an instruction is
different from architecture to another, and also requires the number of iterations to a loop
as input. The output of our tool is the best maximum unrolling factor. Programmer or
user should not pass the number of unrolling factor because this will overflowed the
48


instruction cache, and then will degrade the performance or the speed of programs. In
addition to those optimization techniques, there is another one which is optimizing power
operation. Power function in C or C++ (POW()) used in many mathematical programs.
Power operation can be replaced by multiplication and definitely both of them give the
same result. We want to replace power by multiplication because power operation is
much expensive than multiplication, so replacing power with the multiplication will make
programs faster. Also, there is another optimization technique that included in our tool
which is optimizing square root mathematical operation (SQRT()). SQRT() is one of the
functions in C and C++ that used in many programs to find the square root of a number.
SQRT() function is much expensive than division, and we can see that easily from the
results of our implementation. So, in our optimization technique, we are replacing the
SQRT() function with the division operation to make programs faster. The last
optimization technique that we implemented in our tool is optimizing function calls.
Suppose, when we have a function in For Loop that used as a constant, so taking out this
function from the loop, will save much time because the calls of a function will be equal
to the number of iterations in that loop. Therefore, in our optimization technique, we are
taking out a function from a loop when this is possible to make programs faster.
3.5 Methods, analysis and results
3.5.1 Division vs. multiplication
We designed a benchmark which tests the division in order to see if the compiler
can optimize it or not. We used an array populated with power of integers from (1-12).
We divided each element in the array by 5, and run the program for different size arrays,
and then we collected the results. Also, we implemented our optimized benchmark. In our
49


optimized code, we multiplied each element in the array by 0.2, and we know the two
codes give the same result; the only difference between those two codes is using
multiplication instead of division. In the implementation, we compiled each code in two
different types; one with disabling all GCC optimizations and with enabling all GCC
optimizations. After doing the implementation, we can see that the division has a real
impact on the GCC compiler. The average slowdown percentage is 118%, and from this
point of view, we can tell the GCC compiler has very limited optimization ability in this
field. The goal from this benchmark is to tell the programmer not using the division even
if you are relying on the compiler because the compiler cannot optimize this kind of code.
If you have an option to change the division with multiplication then changes it because
the division degrades the performance comparing with multiplication. We depend on time
in milliseconds to see which one is faster than the other. The size of the array that used in
this optimization technique is 1000000 to 1000000 for time measured in milliseconds.
We implemented our optimization using large array size for measuring time in
milliseconds because the CPU is really fast. For example, if you use small size of array,
the time is zero most of the time. Furthermore, we measured time in milliseconds to see
the effects of our implementations in real time.
Here are the results and flow charts for our implementations (depending on time in
milliseconds):
A: Division code without compiler optimization.
B: Multiplication code without compiler optimization.
C: Division code with compiler optimization.
50


D: Multiplication code with compiler optimization.
Table 3. 1: Time in milliseconds for the Division vs. Multiplication optimization
technique.
Array size A B C D
1000000 20.00ms 20,00ms 10,00ms 0.00ms
2000000 39.99ms 30.00ms 20.00ms 10,00ms
3000000 70,00ms 50.00ms 30.00ms 20,00ms
4000000 100.00ms 69.99ms 50.00ms 19.99ms
5000000 110.00ms 90.00ms 60.00ms 19.99ms
8000000 179.99ms 139.99ms 89.99ms 39.99ms
10000000 230.00ms 190.00 120.00ms 59.99ms
Division vs multiplication/ milliseconds
1000000 2000000 3000000 4000000 5000000 8000000 10000000
Array size
Figure 3. 1 Division vs. Multiplication- milliseconds
51


3.5.2 Loop and recursive function
We designed another benchmark which tests the GCC compiler to see if it can
optimize the function call. We know that function calls spend much time comparative
with the other programming techniques. In our implementation, we implemented two
benchmarks, both benchmarks calculate the factorial of a number, and we repeat this
operation many times to see the differences in terms of the performance between the two
benchmarks. The original benchmark calculates the factorial of the number using
function call, so a function will be called several times to calculate the factorial of a
number. The other benchmark which is our optimized benchmark calculates the factorial
of a number using loop instead of calling a function several times. We know that using a
loop instead of function call will optimize our program; the GCC compiler has a
weakness in optimizing this kind of optimization. We depend on time in milliseconds to
see the differences in performance between the two benchmarks. We run our programs
with different number of iterations. We run the benchmark with the iterations from
1000000 to 10000000, and we collected the time in milliseconds. We depend on time in
milliseconds in order to see clearly which benchmark is faster or perform better than the
other. The results showed that our optimized code (the code with loop) is much faster
than the code with function call that optimized with the GCC optimization. Therefore,
GCC compiler has a weakness in doing this kind of optimization. Lets see the codes,
results and graphs.
The original code by using recursive function
int tail(int factorial)
{
52


if (factorial>l)
{
result=result*factorial;
tail(factorial-l);
}
return result;
}
Our optimized code by using loop instead of calling the function every time.
int tail(int factorial)
{
temp=factorial;
for(i= 1 ;i {
result=result*factorial;
factorial=factorial-1;
}
return result;}
A: The original recursive function code without GCC optimization.
B: Our optimized For Loop code without GCC optimization.
C: The original recursive function code with GCC optimization.
D: Our optimized For Loop code with GCC optimization.
Table 3. 2: Time in milliseconds for the Loop and recursive function optimization
technique.
Array size A B C D
1000000 4950 4190 4260 1380
2000000 9870 7890 8310 2770
3000000 14530 12600 12730 4160
4000000 19820 16280 16930 5550
6000000 29760 23500 25500 8340
8000000 38860 31350 33870 11110
10000000 49060 39460 42630 13890
53


Loop vs recusive functions
Array size
Figure 3. 2 Loop and recursive function- Time in milliseconds
3.5.3 Loop re-rolling
We designed another benchmark that tests the GCC compiler to see if it can
optimize loops using loop re-rolling optimization technique. This optimization designed
to work with For loop. There are two benchmarks; one of them is the original
benchmark, and the other one is our optimized benchmark. The two benchmarks are
doing the same job; they calculate square numbers and cubic numbers. We used three
arrays which are loadstorearray, quadratic_ array and cubic array. Loadstorearray
initializes randomly with integers form 0 to 9. After this array is populated, two more
arrays are built. The first benchmark which is the original benchmark calculates square
numbers and cubic numbers in two separate loops. While the other benchmark calculates
square numbers and cubic numbers using one loop. We developed these two benchmarks
to test GCC compiler because we know that loop re-rolling optimization is very
54


important technique that used by many compilers for optimizing loops. In our
implementation, we depend on time in milliseconds to see which benchmark is faster or
has better performance than the other. Measuring the time is for the part of the code that
we want to optimize, so it is not for the whole code or program, and this will give us a
clear clue about performance. The GCC compiler cannot do this kind of optimization
because when we tested both programs (original code and our optimized code), we can
see obviously our optimized code is much faster than the original code. We run both
programs (original benchmark and our optimized benchmark) with different number of
iterations from 10 to 1000. In this optimization technique, our optimized code is
approximately 40% faster than the original code. From this percentage, we can see that
the GCC compiler has a significant weakness for optimizing loop re-rolling. Lets see the
codes, results and graphs.
The original code:
for(i=0;i {
quadratic_array[i]=load_store_array[i] *load_store_array[i];
}
for (i=0;i {
cubicarray [i]=load_store_array[i] *load_store_array [i] *load_store_array[i];
}
Our optimized code:
for (i=0;i {
quadraticarray [i]=load_store_array [i] loadstorearray [i];
cubicarray [i]=load_store_array[i] *load_store_array [i] *load_store_array[i];
}
A: The original code without GCC optimization.
B: The original code with GCC optimization.
C: Our optimized code without GCC optimization.
D: Our optimized code with GCC optimization.
55


Table 3. 3: Time in milliseconds for the Loop Re-Rolling optimization technique
Array size A B C D Slower Percentage
10 587 284 481 160 32.5%
20 1000 320 778 227 41.436 %
30 1428 412 1103 297 39.394 %
40 1887 522 1427 367 42.234 %
50 2491 612 1722 437 40.046 %
60 3028 711 2043 506 40.514 %
70 3393 821 2356 577 42.288 %
80 3933 921 2664 646 42.105 %
90 4394 1021 2977 714 42.997 %
100 4897 1128 3303 789 42.586 %
Loop Rerolling
Figure 3. 3 Loop Re-Rolling- Milliseconds
56


3.5.4 Loop unrolling
We designed another benchmark that tests GCC compiler to see if it can optimize
loops using loop unrolling technique. We did this optimization because loops are used
widely in many programs, so optimizing loops can significantly improve many such
programs. There are two benchmarks in our implementation; one of them is the original
benchmark while the other one is our optimized benchmark. Both benchmarks are doing
the same job with different technique or strategy. In the original benchmark, we
multiplied each number in the array by itself, and we stored the result in different array,
and this is done using For Loop. Number of iterations in the For Loop is equal to the
size of array. In our optimized benchmark, we opened the source code, and that means we
repeated the source code inside the loop body several times to reduce number of
iterations, and we know that reducing number of iterations will speed up our benchmark.
So, in our optimized benchmark we repeated the loop body five times, number of
iterations is reduced to 80% in our implementation. For example, if the array size is 100
then number of iterations will be 20 instead of 100. In our implementation, we depend on
time in seconds to see which benchmark is faster or better optimized than the other.
Measuring the time is for the part of the code that we want to optimize it, so it is not for
the whole code or program, and this will give us a clear clue about performance. The
GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code), we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
900000000. From our results for both benchmarks, we can see that this optimization
57


technique is faster comparative with our previous optimization techniques because our
optimized benchmark without GCC optimization is faster than the original benchmark
with enabling GCC optimizations. Lets see the codes, results and graphs.
The original code:
for (i = 0; i < sizeofarray; i++) {
test_array[i] = test_array[i]*test_array[i];
}
Our optimized code:
for (i = 0; i < size of array; i+=6 ) {
test_array[i] = test_array[i]*test_array[i];
test_array[i+l] = test_array[i+l]*test_array[i+l];
test_array[i+2] = test_array[i+2]*test_array[i+2];
test_array[i+3] = test_array[i+3]*test_array[i+3];
test_array[i+4] = test_array[i+4]*test_array[i+4];
test_array[i+5] = test_array[i+5]*test_array[i+5];}
A: The original code without GCC optimizations.
B: The original code with GCC optimizations
C: Our optimized code (Unrolling code) without GCC optimization.
D: Our optimized code (Unrolling code) with GCC optimizations.
58


Table 3. 4: Time in seconds for the Loop Unrolling optimization technique
Array size A B C D
10000000 5.79 3.86 0.96 0.55
20000000 11.61 7.74 1.93 1.1
40000000 23.4 15.6 3.9 2.7
60000000 35.22 23.48 5.87 3.93
80000000 46.5 31 7.75 4.87
100000000 58.05 38.7 9.67 5.83
900000000 527.91 351.94 87.98 45.1
Loop Unrolling
Figure 3. 4 Loop Unrolling- seconds
3.5.5 Power vs. multiplication
We designed another benchmark that tests the GCC compiler to see if it can
optimize the power operation or not. POW function is slower than multiplication because
there is extra time that used to call this function and executing subroutine has variables
59


beside the multiplication. We designed this benchmark because we know that the
multiplication is much faster than the power operation, and also the power operation is
used widely in many math programs, so optimizing power operation will help to improve
many programs. There are two benchmarks; one of them is the original benchmark, and
the other one is our optimized benchmark. The two benchmarks are doing the same job;
they calculate the power of numbers. We used one array which is called test_array[], and
the result will be put in the same array. The test_array[] is initialized randomly with
different integer numbers. The first benchmark which is the original benchmark
calculates the power of all numbers in the test_array[] using POW power function, while
the other benchmark calculates the power of all numbers in the test_array[] using
multiplication operation instead of using POW function. In our implementation, we
depend on time in seconds to see which benchmark is faster or better optimized than the
other. Measuring the time is for the part of the code that we want to optimize it, so it is
not for the whole code or program, and this will give us a clear clue about performance.
The GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code), we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
90000000. From our results for both benchmarks, we can see that this optimization
technique is faster comparative with our previous optimization techniques because our
optimized benchmark which is the multiplication benchmark without GCC optimization
is faster than the original benchmark with enabling GCC optimizations. Lets see the
codes, results and graphs.
60


The original Power code:
for (i = 0; i < sizeofarray; i++) {
test_array[i] = pow(test_array[i],2);
}
Our optimized code:
for (i = 0; i < size of array; i++) {
test_array[i] = test_array[i]*test_array[i]; }
A: The original power code without GCC optimizations.
B: Our optimized code (multiplication code) without GCC optimization.
C: The original power code with GCC optimizations.
D: Our optimized code (multiplication code) with GCC optimizations.
Table 3. 5: Time in seconds for the Power vs. Multiplication optimization technique
Array size A 6 C D
10000000 1.93 0.11 1.41 0.03
20000000 3.87 0.23 2.86 0.07
40000000 7.80 0.45 5.71 0.13
60000000 11.74 0.68 8.59 0.19
80000000 15.50 0.92 11.41 0.31
100000000 19.35 1.14 14.28 0.32
900000000 175.97 10.18 127.25 3.29
61


Power vs multiplication
Array size
Figure 3. 5Power vs. Multiplication- seconds
3.5.6 SQRT function vs. division
We designed another benchmark that tests GCC compiler to see if it can optimize
Sqrt () function which it uses to calculate the square root for numbers in C language. We
did this implementation about Sqrt() function and division because Sqrt () function
spends much time than the division. Sqrt() function used in many mathematical
benchmarks, so optimizing it will speed up those programs. In this implementation, we
wrote two codes that are doing the same job. Both codes calculate square root of a
number with different mechanism. One code used Sqrt() function to calculate the square
root, while the other code used the division operation to calculate the square root. There
are two benchmarks, the original benchmark used Sqrt() function while our optimized
benchmark used division. In the original benchmark, we multiplied each number in the
array by Sqrt(Numberl) while in our optimized code we multiplied each number in the
array by Numberl/Number2. We used different numbers and the results almost the same
62


because the main idea in this implementation to show that Sqrt() function spends much
time because of an extra time for calling a Sqrt() function while in our optimized
benchmark we did division only. In both codes, calling a Sqrt() function and the division
operation are repeated to the number of iterations. In our implementation, we depend on
time in seconds to see which benchmark is faster or much optimized than the other.
Measuring the time is for the part of the code that we want to optimize it, so it is not for
the whole code or program, and this will give us a clear clue about performance. The
GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code); we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
100000000. From our results for both benchmarks, we can see that this optimization
technique is faster comparative with our previous optimization techniques because our
optimized benchmark without GCC optimization is faster than the original benchmark
with enabling GCC optimizations. Lets see the results and graphs.
A: Sqrt() function code without GCC optimization.
B: Our optimized division code without GCC optimization.
C: Sqrt() function code with GCC optimization.
D: Our optimized division code with GCC optimization.
63


Table 3. 6: Time in seconds for the SQRT function vs. Division optimization
technique
Array size A B C D
10000000 1.69 0.13 1.40 0.03
20000000 3.38 0.26 2.79 0.08
30000000 5.06 0.40 4.21 0.11
40000000 6.78 0.52 5.82 0.16
50000000 8.45 0.64 7.02 0.18
60000000 10.17 0.76 8.44 0.23
70000000 11.87 0.87 9.87 0.26
80000000 13.57 1.02 11.24 0.30
90000000 15.28 1.16 12.60 0.35
100000000 17.00 1.28 13.95 0.37
Sqrt function vs Division
Figure 3. 6 SQRT function vs. Division- seconds
64


3.5.7 The cost of the function call inside and outside loops
We designed a benchmark that tests a compiler to see if it can optimize a function
call inside loops. This optimization designed to work with for loops. The original code
has a function called strlen(), this function was put inside For Loop, we know this
function return string length. We passed this code to the GCC compiler to see if it can
optimize it. We developed this benchmark because we know that calling a function
several times will cost significant time. We optimized this benchmark manually and we
passed it to the compiler to see which is faster or who optimize better GCC compiler
optimization or our optimization. In our optimization we took out the function call
strlen() from inside the loop and we put it outside the loop, so instead of calling a
function every time in the for loop it will be called one time only and will be as a
constant inside the For Loop. This optimization saves us much time and makes our
program faster. GCC compiler cannot do this kind of optimization because when we
tested both programs (original code and our optimized code), we can see obviously our
optimized code is much faster than the original code. We depend on time in milliseconds
to see which is faster than the other. Measuring the time is for the part of the code that we
want to optimize it, so it is not for the whole code or program, and this will give us a
clear clue about performance. In this optimization technique, our optimized code is
approximately four times faster than the original code. From this percentage, we can see
that the GCC compiler has a significant weakness. Now lets see the codes, results and
graphs.
This is the code that we pass it to the GCC compiler in order to optimize it.
unsigned sum(const unsigned char *s) {
// x[] is an array initialized randomly with integer numbers less than 100.
65


unsigned result = 0;
for (i=0; i < strlen(s); i++) {
result += x[i];
}
return result;
}
This is our optimized code which we optimize it manually.
unsigned sum(const unsigned char *s) {
unsigned result = 0;
length = strlen(s);
for (i=0; i < length; i++) {
result += x[i];
}
return result;
}
A: The original function calls in a loop code without GCC optimizations.
B: Function calls out of loop code without GCC optimizations (our optimized code).
C: The original function calls in a loop code with GCC optimizations.
D: Function calls out of loop code with GCC optimizations (our optimized code).
These results show the difference between the code that optimized by the GCC compiler
and our optimized code in terms of time in Milliseconds.
Table 3. 7: Time in milliseconds for the cost of the function call in and out the loop
optimization technique
Array size A B C D Average slowdown percentage
1000000 20.00 ms 10.00 ms 20.00 ms 9.99 ms 100.2 %
2000000 50.00 ms 20.00 ms 40.00 ms 9.99 ms 300.4 %
3000000 80.00 ms 29.99 ms 50.00 ms 9.99 ms 400.501 %
4000000 99.99 ms 30.00 ms 70.00 ms 10.00 ms 600.000 %
5000000 140.00 ms 40.00 ms 80.00 ms 10.00 ms 700.000 %
6000000 160.00 ms 50.00 ms 99.99 ms 19.99 ms 400.2 %
10000000 249.00 ms 89.99 ms 169.99 ms 29.99 ms 466.822 %
66


Function calls in loop and out loop
Array size
A
B
C
D
Figure 3. 7 Function calls in and out the loop- Millisecond
3.6 Automatic parallelization in compilers
When programs have begun to execute in parallel, programmers need a compiler
to take care of parallelization. The basic idea of parallelization is to distribute the job
among processors in order to complete big work in a significant small amount of time.
Today, computers have more than one central processing unit, so to take advantage from
those CPUs, they can collaborate together to finish their job. There are two types of
parallelism which are shared and distributed memory. The main difference between
shared and distributed memory is how the communication occurs between them. In
shared memory environment, threads are communicating by each other through reads and
writes while in distributed memory environment, processes are communicating by each
other through the messages [1] [2],
67


Also, there is an architecture can mix the previous two types which called Mixed-
Type multiprocessor architecture. The last architecture mixed the shared memory
environment with the distributed memory environment.
P M
P M
S
P M
(a): shared memory multiprocessor [4]
M
P
I
M-PSP-M
(b): Distributed memory multiprocessor [4]
In a shared memory environment, processors are sharing the same memory while in a
distributed memory environment; each processor has its own memory location. In
addition, in mixed-type architecture, threads may share the same memory and may
communicate to other physical locations through messages, and this is the work of
distributed memory architecture. Parallel processing can achieve big jobs in a very small
68


time through work distribution between processors. Each processor has a part of work to
finish, and in the end, all processors give their results to the master processor. Master
processor or process is in charge of creating processes, collect results from the workers
(processors) and at the end kill them. In order to get good results or expected results from
parallel processing, programmer or users should balance the work between processors
when he or she distribute the work between them. There are many programs can be
parallelized perfectly without difficulties or requiring further work, but also there are
other programs need extra work for parallelizing them such as synchronization [4], We
need synchronization when two processes have access to the same data. So, in this
situation programmer needs to solve the conflicts between the processes before applying
or implementing the parallelism. Synchronization takes much time because it limits from
the parallelism and the time spent in its techniques. For instance, when many processes
need to access the same value, we need synchronization techniques to make the value
accessed only once at a time. There are several operations that used in synchronization
such as produce and consume. Produce sets the bit to be full if it is empty while consume
waits the value to be full, and then read the value and sets the bit to be empty [4], By
using this method, conflicts of accessing the same value will not happen because each
time one process can access that value. Also, there is another problem that makes some
programs harder or impossible to parallelize them which is dependencies. Dependencies
consider one of the obstacles that face parallel processing. The basic idea of dependencies
is that some processes need or depend on the value of a previous process; in this case we
will have wrong results because all processes are working in the same time. There are
several kinds of dependencies such as read after write, write after read, write after write
69


and read after read. Dependencies limit the ability of parallelizing programs, and even if
we parallelized them, they become slower. In some programs, if little dependencies may
found, we can have one or more than one sequential sections, and we parallelize the rest
of the code. In some times, the whole program including its results dependent on each
other. For instance, each result in every level of a program uses in the next level, so in
this case is very difficult or impossible to parallelize it. Deciding which section in a
program can be parallelized or not is by depending on the programmer. A programmer
looks to the program, and see which section in the code has dependencies and difficult to
parallelize, and what other do not have dependencies. The method of detecting
dependencies also can be done by depending on compilers. Some compilers have built to
work with the parallel environment, so they have the ability to detect any kind of
dependencies in programs. One of the compilers that have very powerful techniques to
detect dependencies is Intel compiler. Intel compiler has the ability to analyze loops and
determine which loops have dependencies and are hard to parallelize them [5],
3.7 Automatic parallelization in GCC
GCC compiler is one of the compilers that support automatic parallelization.
GCC can parallelize loops, but it has some limitations in automatic parallelization. One
of the limitations in automatic parallelization is that GCC cannot detect dependencies in
loops, so it can parallelize loops without dependencies. Also, there is another limitation in
GCC compiler which is GCC parallelize innermost loops only, it cannot parallelize outer
loops. Now, GCC compiler parallelizes every loop without dependency, so there is no
strategy to determine which loop can be parallelized or not depending on performance
issues [3],
70


3.8 Our improvement to automatic parallelization of the GCC compiler
Our improvement to the GCC compiler is to implement tool to optimize the
weaknesses in the automatic parallelization of GCC compiler. Our tool can detect
dependencies in most of the loops. The input of our tool is C or C++ program, and the
output is a message which is either you can safely parallelize your loop or there are
dependencies and it is difficult to parallelize it. In our tool, we used the idea of GCD test.
GCD test used to detect most of the dependencies with in arrays in loops. The method of
the GCD test works as follows:
1- We consider everything in the brackets only of arrays that used in loops to
calculate the GCD.
2- We give the values in brackets to variables. For instance, A[2i+3]= A[2i-
1]*B+C, so in this case a=2, b=3, c=2 and d=-l, thus a=2 and b=3 are
representing A[2i+3], while c=2 and d=-l are representing A[2i-1],
3- We find the GCD between a and c, in the previous example, the GCD of (2, 2)
is 2.
4- We find the result of (b-d), in the above example, (3-(-l))=4.
Since, the GCD (a, c) = 2 which divides (b-d) which is 4, so there are dependencies in our
example, and we can make sure from the result by following the iterations. Suppose, we
have the following example with the number of iterations is 6.
A [2i+3] = A[2i-1]
i=0 A[3]=A[-1]
i=l A[5]=A[1]
i=2 A[7]=A[3]
71


i=3 A[9]=A[5]
i=4 A[11]=A[7]
i=5 A[13]=A[9]
So from the previous example, we can see obviously that iteration 2 depend on iteration
0, iteration 3 depend on the result of iteration 1 and so on.
For more clarification of the method of GCD test, suppose we have the following
example:
For (.....................)
A [2i]= A[2i-1]*B+C
End for
From the previous example, we can set the values of GCD test like this:
a=2
b=0
c=2
d=-l
GCD (a, c) = 2
(d-b)= -1
Since the GCD (a, c)= 2 does not divide ( d-b)=-l [4], so there is no dependencies in our
example. Therefore, by using the method of the GCD test, we can find dependencies in
programs without following the iterations of loops manually.
Therefore, in our tool, we implemented the idea of the GCD test. Our tool takes
C or C++ program as input and print a message tells that either loop has dependencies or
72


there are no possible dependencies in program. This tool basically designed to help the
user or programmer to discover different dependencies in programs that compiled by the
GCC compiler.
3.9 Optimizing real C programs by our tool
After we built our tool that has seven optimization techniques (Division, recursive
function, loop re-rolling, loop unrolling, power operation, SQRT function and recursive
function in a loop), we tested our tool to improve real C programs such as Strassen,
Bubble sort and Selection sort.
3.9.1 Strassen optimization
Strassen is a C program can do matrix multiply faster than the normal way of
matrix multiplication. The C program was passed as input to our tool. We optimized the
C program in several ways. The first optimizing way is by using GCC compiler only. The
second optimizing way is by using our tool only, and the third optimizing way is by using
both our tool and GCC compiler. Our tool detected several optimization techniques in
Strassen.cpp which are not optimized by the GCC compiler. Our tool found that several
divisions in the program can be replaced by multiplications to make the program run
faster. Also, our tool found that two loops in Strassen.cpp can be unrolled to make the
program run faster. So, our tool gave hints or messages showing that these optimizations
can improve the program. After we optimized Strassen.cpp by our tool, we run three
programs which are the original Strassen.cpp, Strassen.cpp optimized by GCC compiler
only and Strassen.cpp optimized by both our tool and GCC compiler. We saw an
interesting result which is the optimized program by our tool only is faster than the
73


original program, and also faster than the program optimized by GCC compiler. In
addition, when we run the program that optimized by our tool and GCC compiler, we got
much speedup.
Results
A: The code without both GCC optimization and our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
The runtime for our programs:
Table 3. 8: Running time in seconds for the Strassen optimization
Array size Strassen A Strassen B Strassen C Strassen D
256 0.62 0.47 0.29 0.15
512 4.56 3.32 2.09 1.07
1024 41.38 33.20 27.65 23.17
2048 371.49 321.51 217.96 204.08
4096 3341.04 2945.98 2349.21 2222.88
74


Strassen implementation
Strassen A
Strassen B
Strassen C
Strassen D
Figure 3. 8 Strassen Optimization
3.9.2 Bubble sort optimization
Bubble sort is a very well-known sorting algorithm used to sort numbers.
BubbleSort.cpp is the original C++ program that used in our implementation, and
OptimizedBubbleSort.cpp is a C++ program that optimized by our tool. The Bubble.cpp
program was passed as input to our tool. We optimized the C++ program in several ways.
The first optimizing way is by using GCC compiler only. The second optimizing way is
by using our tool only, and the third optimizing way is by using both our tool and GCC
compiler. Our tool detected several optimization techniques in the BubbleSort.cpp which
are not optimized by the GCC compiler. Our tool found that several loops can be unrolled
several times to make BubbleSort.cpp faster. Our tool gave hints indicating that these
optimizations can optimize the program. After we optimized the BubbleSort.cpp program
manually, then we collected the running time results, we found that the optimized
75


program by our tool only (OptimizedBubbleSort.cpp) is faster than the original program
BubbleSort.cpp, and also the program that optimized by our tool and the GCC compiler is
much faster than the original program that optimized by the GCC compiler only.
Results and graphs
A: The code without both GCC optimization and without our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
The runtime for our programs:
Table 3. 9: Running time in seconds for the Bubble Sort optimization
Array size BubbleSort A BubbleSort B BubbleSort C BubbleSort D
10000 0.64 0.28 0.52 0.17
20000 2.63 1.89 2.13 0.83
40000 10.45 6.68 8.62 3.59
80000 42.58 28.38 34.52 13.42
100000 65.55 43.80 54.04 20.90
76


Bubble Sort Optimization
BubbleSort A
BubbleSort B
BubbleSort C
BubbleSort D
Figure 3. 9 Bubble Sort optimization
3.9.3 Selection sort optimization
Selection sort is one of the sorting algorithms that used to order numbers.
SelectionSort.cpp is the original C++ program that used in our implementation, and
OptimizedSelectionSort.cpp is a C++ program that optimized by our tool. The
SelectionSort.cpp program was passed as input to our tool. We optimized the C++
program in several ways. The first optimizing way is by using GCC compiler only. The
second optimizing way is by using our tool only, and the third optimizing way is by using
both our tool and GCC compiler. Our tool detected several optimization techniques in the
SelectionSort.cpp which are not optimized by the GCC compiler. Our tool found that
several loops can be unrolled several times to make SelectionSort.cpp faster. After we
optimized the SelectionSort.cpp program manually, then we collected the results, we
77


found that the optimized program by our tool only (OptimizedSelectionSort.cpp) is faster
than the original program SelectionSort.cpp, and also the program that optimized by our
tool and the GCC compiler is much faster than the original program that optimized by the
GCC compiler only. Our tool in the programs Strassen.cpp, BubbleSort.cpp and
SelectionSort.cpp worked very well, and it added much speedup to the programs that
optimized by the GCC compiler only.
Therefore, our tool is working very well together with the GCC compiler to make
programs faster. Developers, who are using GCC compiler to optimize their codes, can
use our tool together with the GCC compiler to get fast and efficient results. We did an
implementation to optimize real C codes in order to see the effectiveness of our tool
when developers want to optimize real software.
Results and graphs
A: The code without both GCC optimization and without our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
Table 3.10: Time in seconds for the Selection Sort optimization
Array size SelectionSort A SelectionSort B SelectionSort C SelectionSort D
100000 33.60 11.64 14.50 6.31
200000 135.65 46.87 59.34 25.62
400000 544.67 188.83 237.86 104.12
800000 2197.39 751.43 959.02 428.39
1000000 3539.38 1361.53 1600 862.97
78


SelectionSort
SelectionSort A
SelectionSort B
SelectionSort C
SelectionSort D
Figure 3. 10 Selection sort optimization
79


4. Conclusion
In our thesis, we have presented several ways to help developers writing more
efficient codes. Writing good software depends on the architecture specifications and
compiler. We researched the architecture specifications and compiler, and we introduced
very good tools to help developers writing most optimized codes. For the architecture
specifications, developers should fully understand the architecture of a specific hardware.
Not considering the architecture specifications can make programs run slower. In the
chapter II of our thesis, we designed intentionally de-optimized benchmarks to see what
are the strengths and weaknesses in the underlying computer architecture, and showing
what are the affects if developers not considering architecture specifications when
programs are written. The de-optimizations show, convincingly, that ignoring
architecture specifications when writing software can be very costly indeed. Most of the
de-optimizations had effects that were greater than 25% of running time. Some had
effects up to 150%. These are not trivial slowdowns. They show that prioritizing
considerations of aesthetics, portability, etc. in front of hardware can result in significant
slowdowns. It is shocking when one considers that many of the de-optimizations that
were implemented look just like code that one sees every day in business and academic
environments. Moreover, much of its was straightforwardly compiled, i.e. the compiler
did not resolve the issues during compilation. The results of the chapter II should, at the
very least, make software developers reconsider many of their accumulated habits.
Sometimes the best looking code may perform the worst.
Compilers have very powerful tools to optimize codes. Therefore, writing good
software not only depends on the skills of programmers and architecture specifications,
80


but also depends on the powerful of compilers. One of the compiler processes is
optimizing codes. Compilers can make programs run faster, work efficiently and also can
reduce size of codes. Therefore, In order for developers to write optimized software
should choose very powerful compiler to optimize their codes. GCC compiler is one of
the well-known compilers that used by many developers to optimize codes. In chapter III
of our thesis, we researched the weaknesses in the GCC compiler. Several weaknesses
have found in the GCC compiler, and then we built several optimization techniques such
as Division vs. Multiplication, Loop and Recursive function, Loop Re-rolling, Loop
Unrolling, Power vs. Multiplication, SQRT function vs. Division and the cost of the
function call inside and outside loops. To make using these optimizations easier we built
a tool can help developers to optimize their codes manually. Our tool can give message to
users indicating that if their programs need to optimize or not. We built a tool because in
some circumstances, GCC compiler cannot fully optimize codes. Therefore, optimizing
codes by users will make programs run faster.
Also, GCC compiler has another weakness which is the automatic parallelization
in the GCC compiler can parallelize loops that do not have dependencies. Therefore, we
built a tool can do dependencies detection. Our second tool can detect most of the
dependencies in programs, and then can give a message to users indicating that either a
particular program has dependencies or not. The main goal from chapter HI was to build
tools to improve the weaknesses in the GCC compiler and help programmers to write
most efficient codes
To make sure our tools can optimize real C or C++ application, we tested our tool
to optimize real codes such Strassen.c, Bubble Sort.cpp and SelectionSort.cpp.
81


Strassen.cpp is a c program can do matrix multiplication faster. The optimization to the
Strassen.cpp was successfully implemented, and we found that our optimizing code
without GCC optimization is faster than the original Strassen.cpp with GCC
optimizations. Moreover, we optimized BubbleSort.cpp. The optimization of the Bubble
Sort was implemented successfully, and Bubble Sort that optimized by our tool is much
faster than Bubble Sort optimized by GCC compiler only. Also, we successfully
optimized Selection Sort algorithm, and the optimized program by our tool is much faster
than the optimized program by the GCC compiler. Therefore, our tool is very powerful
tool can work with the GCC compiler to make programs efficient and faster. The main
goal from our thesis is to help developers to write more efficient codes through showing
that fully understanding the architecture specifications will make writing efficient codes,
and improve the weaknesses in the GCC compiler will make programs run faster.
82


REFERENCES
[1] Midkiff, S. P. (2012). Automatic Parallelization AnOverview of Fundamental
CompilerTechniques (2012th ed., Vol. 7, pp. 3-5). N.p.: Morgan & Claypool. Retrieved
June 6, 2013, from http://0-
www.morganclaypool.com.skyline.ucdenver.edu/doi/abs/10.2200/S00340EDlV01Y2012
01CAC019
[2] Adve, S.V.; Gharachorloo, K., "Shared memory consistency models: a
tutorial," Computer vol.29, no. 12, pp.66,76, Dec 1996
doi: 10.1109/2.546611, URL: http://0-
ieeexplore.ieee.org.skyline.ucdenver.edu/stamp/stamp .jsp?tp=&arnumber=546611&isnu
mber=l 1956
[3] Stallman, R. 2013. Using the GNU Compiler Collection. Boston, USA: GNU Press.
[4] Jordan, H. and Alaghband, G. 2003. Fundamentals of parallel processing. Upper
Saddle River, NJ: Prentice Hall/Pearson Education.
[5] Software.intel.com. 1999. Automatic Parallelization with Intel Compilers | Intel
Developer Zone, [online] Available at: http://software.intel.com/en-us/articles/automatic-
parallelization-with-Intel-compilers [Accessed: 13 Jun2013],
[6] Wilhelm, R. and Seidl, H. 2010. Compiler design. Heidelberg: Springer.
[7] Mogensen, T. 2009. Basics of compiler design. [Kbh.]: Torben TEgidius Mogensen.
[8] Griffith, A. 2002. GCC, the complete reference. New York: McGraw-Hill/Osborne.
[9] Gcc.gnu.org. 2013. GCC, the GNU Compiler Collection- GNU Project Free
Software Foundation (FSF). [online] Available at: http://gcc.gnu.org/ [Accessed: 13 Jun
2013],
[10] Von Hagen, W. 2006. The definitive guide to GCC. Berkeley, CA: Apress.
[11] Jack W. Davidson and Sanjay Jinturkar. 2001. An Aggressive Approach to Loop
Unrolling. Technical Report. University of Virginia, Charlottesville, VA, USA.
[12] AMD64 Technology. Software Optimization Guide for AMD64 Processors. 2005.
83


Full Text

PAGE 1

OPTIMIZING SOFTWARE THROUGH FULLY UNDERSTANDING THE UNDERLYING ARCHITECTURE & STRENGTH EN ING THE COMPILER by AHMED ESKANDER MEZHE R B.Sc, University of Technology 2009 A thesis submitted to the Faculty of the Graduate School of the University of Colorado Denver in partial fulfillment o f the requirements for the degree of Master of Science Computer Scienc e 2013

PAGE 2

ii This t hesis for the Master of Science degree by Ahmed E. Mezher has been approved for the Computer Science Program By Gita Alaghband, Chair Tom Altma n Bogdan Chlebus

PAGE 3

ii September 12, 2013 Ahmed, Mezher, E (Master of Science in Computer Science) Optimizing Software Through Fully Understanding the Underlying Architecture & Strengthening the Compiler Thesis directed by Professor Gita Alaghband ABSTRACT Softwa re developers are trying to write more efficien t and faster programs. Increasing speed and efficiency not only depends on software programming knowledge, but also on underst anding of the underlying computer architecture and compiler Many prog rammers do not consider the machine architecture when they write code Knowing the archit ecture specifications of the computer before writing code can influence program efficiency and sp eed Also, optimizing compilers have a significant impact on the speed of program execution The GCC compiler is a well known compiler that has advan ced techniques to optimize code However, GCC compiler has several weaknesses in certain areas such as opt imizing arithmetic operations, optimizing funct ion calls, optimizing loops and providing full automatic p arallelization T hose weaknesses can make programs run slower This thesi s helps developers to write more efficient software by presenting the weakness es and strengths of the machine architecture and building tools to improve the weaknesses of the GCC compiler. In this thesis we show the negative impact of not considering the architecture specifications when programs are written to motivate s oftware deve lopers to pay attention to the archite cture specifications of the machine. Several tools are designed to improve the weaknesses in the GCC compi ler Our tools work together with the GC C compiler to help developers design more efficient code Also, our tool s can work on different compiler architecture, are not limited to specific compiler architecture. We tested our

PAGE 4

iii tools to optimize real C application s such as Strassen .cpp, Bubble Sort.cpp and Selection Sort .cpp and we successfully optimized them. Therefore by using our tools together with the GCC compiler, programs will be more efficient and faster. Through understanding the weakness es and strengths of the underlying computer architecture and using our tools together with the GCC compiler, developers can design more efficient software The form and content of this abstract are approved. I recommend its publication. Approved: Gita Alaghband

PAGE 5

iv CONTENTS Chapter 1. Introduction ................................ ................................ ................................ ................................ ............. 8 2. Finding the limits of the underlying computer architecture through de optimization .............................. 4 2.1 Optimization information and recommendations for t he AMD Opteron processor ........................... 4 2.2 Find the limits of underlying computer architecture through de optimization techniques ................. 5 2.3 Method s ................................ ................................ ................................ ................................ ............ 6 2.4 Analysis and results ................................ ................................ ................................ ........................ 10 2.5 Detailed results ................................ ................................ ................................ ................................ 11 2.5.1 Branc h density de optimization ................................ ................................ ................................ 11 2.5.2 Unpredictable instruction de o ptimization ................................ ................................ ............... 15 2.5.3 Branch pattern d e o ptimization ................................ ................................ ................................ 18 2.5.4 Float comparison de optimization ................................ ................................ ............................ 19 2.5.5 Costly instruction de optimization ................................ ................................ ........................... 22 2.5.6 Load store dependency de optimization ................................ ................................ ................... 25 2.5.7 High l atency instruction d e Optimization ................................ ................................ ................. 28 2.5.8 If condition de o p timization ................................ ................................ ................................ ..... 31 2.5.9 Loop re rolling de optimization ................................ ................................ ............................... 34 2.5.10 Dependency chain de optimization ................................ ................................ ........................ 37 3. Building tools to optimize weaknesses in the GCC compiler ................................ ................................ 40 3.1 Compiler ................................ ................................ ................................ ................................ ......... 40 3.2 GCC compiler ................................ ................................ ................................ ................................ 41 3.3 GCC optimizations ................................ ................................ ................................ .......................... 42 3.4 Our optimizations to the GCC compiler ................................ ................................ .......................... 47 3.5 Methods, analysis and results ................................ ................................ ................................ .......... 49 3.5.1 Division vs. m ultiplication: ................................ ................................ ................................ ...... 49 3.5.2 Loop and recursive function ................................ ................................ ................................ ..... 52 3.5.3 Loop re rolling ................................ ................................ ................................ ......................... 54 3.5.4 Loop unrolling ................................ ................................ ................................ .......................... 57 3.5.5 Power vs. multiplication ................................ ................................ ................................ ........... 59

PAGE 6

v 3.5.6 SQRT function vs. division ................................ ................................ ................................ ...... 62 3.5.7 The cost of the function call inside and outside loops ................................ .............................. 65 3.6 Automatic parallelization in compilers ................................ ................................ ............................ 67 3.7 Automatic parallelization in GCC ................................ ................................ ................................ ... 70 3.8 Our i mprovement to automatic parallelization of the GCC compiler ................................ .............. 71 3.9 Optimizing real C programs by our tool ................................ ................................ .......................... 73 3.9.1 Strassen optimi zation ................................ ................................ ................................ ............... 73 3.9.2 Bubble s o rt o ptimization ................................ ................................ ................................ .......... 75 3.9.3 Selection s ort optimization ................................ ................................ ................................ ....... 77 4. Conclusion ................................ ................................ ................................ ................................ ............ 80 R eferences ................................ ................................ ................................ ................................ ............... 8 3

PAGE 7

vi LIST OF FIGURES Figure 2. 1 Branch Den sity (mod_ten_counter) ................................ ................................ ....................... 14 2. 2 Unpredictable instructions (factorial_over_array) de optimization ................................ ....... 17 2. 3 Compare_t wo_floats ................................ ................................ ................................ .............. 21 2. 4 Costly instruction de optimization ................................ ................................ ......................... 24 2. 5 Costly instruction de optimization ................................ ................................ ......................... 27 2. 6 Fib de optimization ................................ ................................ ................................ ................ 30 2. 7 If Condition de optimization ................................ ................................ ................................ .. 33 2. 8 Loop re rolling de optimization ................................ ................................ ............................. 36 2. 9 Dependency chain de optimization ................................ ................................ ........................ 39 3. 1 Division vs. Multiplication milliseconds ................................ ................................ .............. 51 3. 2 Loop and recursive function Time in milliseconds ................................ .............................. 54 3. 3 Loop re r olling Milliseconds ................................ ................................ ................................ 56 3. 4 Loop u nrolling seconds ................................ ................................ ................................ ........ 59 3. 5Power vs. m ultiplication seconds ................................ ................................ .......................... 62 3. 6 SQRT function vs. d ivision seconds ................................ ................................ ..................... 64 3. 7 Function calls in and out the loop Millisecond ................................ ................................ ..... 67 3. 8 Strassen o ptimization ................................ ................................ ................................ ............. 75 3. 9 Bubble s ort optimization ................................ ................................ ................................ ........ 77 3. 10 Selection s ort optimization ................................ ................................ ................................ .. 79

PAGE 8

vii LIST OF TABLES Table 2. 1 Branch d ensity de optimization for the AMD Opteron and Intel Nehalem ........................... 14 2. 2 Unpredictable instructions de optimization for the AMD Opteron and Intel Nehalem ........ 17 2. 3 Compare two floats de optimization ................................ ................................ ...................... 22 2. 4 Costly instruction de optimization for the AMD Opteron and Intel Nehalem ...................... 25 2. 5 Costly Instruction de optimization for the Opteron and Intel Nehalem ................................ 28 2. 6 Fib de optimization for the Opteron and Intel Nehalem ................................ ........................ 30 2. 7 If c ondition de optimization for the AMD Opteron and Intel Nehalem ................................ 33 2. 8 Loop re rolling de optimization for the AMD Opteron and Int el Nehalem .......................... 36 2. 9 Dependency chain de optimization for the AMD Opteron and Intel Nehalem ..................... 39 3. 1 Ti me in mil liseconds for the Division vs. m ultiplication optimization technique. ................ 51 3. 2 Time in milliseconds for the l oop and recursive function optimization technique. ............... 53 3. 3 Time in milliseconds for the loop re r olling optimization technique ................................ .... 56 3. 4 Time in seconds for the loop u nrolling optimization technique ................................ ............ 59 3. 5 Time in seconds for the power vs. m ultiplication optimization technique ............................ 61 3. 6 Time in sec onds for the SQRT function vs. d ivision optimization technique ....................... 64 3. 7 Time in milliseconds for the cost of the function call in and out the loop optimization technique ................................ ................................ ................................ ................................ ....... 66 3. 8 R unning time in seconds for the S trassen optimization ................................ ......................... 74 3. 9 Running time in seconds for the Bubble s ort optimization ................................ ................... 76 3 10 Tim e in seconds for the Selection s ort optimization ................................ ............................ 78

PAGE 9

viii 1. Introduction This thesis is in the area of compiler and soft ware. Writing efficient software not only depends on writing skills of programmers but also depends on the knowledge of underlying architecture and compiler. Sometime s a particular optimized code running fast in one machine may run very slow in a different machine. For example, we have an optimized code, and this code has loop unrolled several ti mes to run very fast in machine A, a programmer run the same code in a different machine B, and the two machines A and B are completely different. If the programmer does not know the architecture specifications of machine B, then the program may run very s low because the size of instruction cache of machine B is less than the size of the first machine A and we know that when the number of instructions in a program exceed s the size of the instruction cache, the program become s slower. Therefore, knowing the architecture specifications will make programmers write optimized codes. Additionally, the effort required to optimize generally depends upon knowledge of the architecture specifications such as: How long is its pipeline? How does it detect hazards? Does it dynamically schedule? How does its caching work? By understandi ng these aspects of the machine architecture Programmers can wri te very efficient software. Does ignoring the architecture specifications when writing software affect its efficiency ? Chapt er two answered this question. In chapter two, we researched the weakness es and strengths in the underlying computer architecture and showing the con sequences of not

PAGE 10

2 considering the architecture specifications when programs are writ ten. The goal of chapte r two is to design a series of de optimizations for the Opteron and to show how gracefully, or not, its performance degrades. In some circumstances, serious degradations in performance were found. In others, expected de optimizations were difficult if not impossible to implement. By examining the results of chapter two, one can gain a thorough understanding of what development aspects need attention and what aspects can be safely ignored. The de optimizations that we have implemented are branch density, unp redictable instructions, branch patterns, float comparisons, costly instructions, load store dependency, high latency instruction, if condition, loop rerolling and dependency chain. Branch patterns, unpredictable instructions and branch density are deemed unsuccessful because of the powerful of hardware. Other de optimizations have showed great impact on the Opteron and the Intel Nehalem as well. Moreover, some programmers depend on the compile r only to optimize their codes. Dependence on compiler only wil l not fully optimize codes because compiler s may have some weaknesses We c hose to research the GCC co mpiler because it is a well known compiler used by many developers and programmers. We will develop tools to write more efficient code and compare the res ults with the GCC compiler In chapter three, we present several weaknesses in the GCC compiler, and then build two tools to help programmers using GCC compiler to write more efficient codes by using our tools. The first tool can detect the optimizations i n a given program, and present a message to the programmer that with a list of optimizations. Therefore a programmer can implement the list of optimizations manually in his program to become more efficient. The optimizations that we have implemented to th e GCC compiler are division operation, loop and recursive function, loop rerolling, loop unrolling, power operation, square root function and function calls. All the optimizations are not compiler architecture dependent

PAGE 11

3 except for the loop unrolling becaus e loop unrolling depend on the size of instruction cache, and the size of instruction cache is different from architecture to another. Another type of our study is the automatic parallelization in the GCC compiler which does not support dependencies detect ion. The GCC compiler can parallelize loops that do no t have dependencies In this way, the programmer needs to check its program manually for dependencies Sometimes checking dependencies visually is not easy because it needs to fol low the iterations of t he loop. If a loop has very long sequence of instructions, it becomes very hard and much time consuming to decide if a loop has dependencies or not. Therefore, we designed a tool that can detect most of the loop dependencies in programs. Our tool can giv e a message to the programmer indicating that either the particular program has dependencies or not. Our tool can help programmers to identify dependencies, and they can sav e a lot of time by using it. The main goal from chapter three is to improve the we aknesses we have detected in the GCC compiler, and building tools with the series of optimizations that can help programme rs to write more efficient code The main goal form the thesis is to help developers to write the best efficient software.

PAGE 12

4 2. Finding the limits of the underlying computer architecture through de optimization techniques 2.1 Optimization information and recommendations for the AMD Opteron processor Many optimizations have been done on the 32 bit and 64 bit software to work with AMD processor and AMD architecture. These optimizations are instruction Decoding optimizations, cache and memory optimizations, branch optimizations, scheduling optimizations, integer optimizations, optimizing with the SIMD instructions and x87 floating po int optimizations. Instruction Decoding optimizations help the processor to maximize the number of decoded instructions. We have two types of instructions which are direct path instructions and vector path instructions. In general, direct path instructions minimize the number of operations that cost AMD processor. Three direct path instructions can be done in one cycle, and one and a half double direct path instructions per one cycle. Therefore, using direct path instructions rather than vector path instruc tions can optimize codes. Also, there are other optimizations have done to load execute instructions, load execute integer instructions and many other on the type of instruction decoding optimizations. Memory optimization is another optimization can be app lied to the AMD processor. Memory has several optimizations; one of them is using store instructions such as MOVNTPS and MOVNTPQ. These instructions make processor do writing without reading the data from memory. So, these instructions can save time by not reading the memory or cache. Therefore, it makes programs faster. This optimization can be applied on 32 bit software and 64 bit software. Another optimization can be applied on AMD processor is branch optimization. Branch optimization is one of the optim izations that can improve branch prediction

PAGE 13

5 and minimize branch penalties. One of the possible branch optimizations is avoiding conditional branches that depend on random data, as these branches are difficult to predict. Moreover, scheduling optimization i s another optimization can be applied on the AMD processor. One of the possible scheduling optimizations is pushing data directly in to stack instead of loading it in to a register first, and this technique will reduce register pressure and eliminates data dependencies. Also, this optimization can apply to the 32 bit software and 64 bit software [12]. We showed some of the recommendations for the AMD processor. These recommendations to the hardware can make programs faster, and then we did an implementation to the AMD hardware in order to find what the limitation of the hardware or what other weaknesses in the hardware can be improved by the AMD designers. 2.2 Find the limits of underlying computer architecture through de optimization techniques Computer industry is developing very fast. CPU speeds are increasing, and most of computers come with more than one CPU. Software developers are focusing on writing good software without considering the architecture specifications Developers rely on compilers to d o the dirty work of concerning themselves with the details of the architecture specifications Fully understanding the performance and what can affect the hardware that gives us a clear clue about developing hardware or software. In this part of the thesis the limits of underlying computer architecture were researched by intentionally de optimizing software. Developers ty pically do not consider architecture specification ; it is useful to how optimized hardware will behave under the worst circumstances. Wit h this viewpoint, one can imagine that a company might choose one piece of hardware over another not because it performs the best in very special circumstances, but rather because its performance degrades the slowest in adverse circumstances. The goal of

PAGE 14

6 t his part of thesis was to design a series of de optimizations for the Opteron and to show how gracefully, or not its performance degrades. After we did the implementation, there are serious degradations in performance were found. Also, in some circumstance s, it is harder to implement de optimization techniques because the powerful of hardware. The main benefit of this work is that one can look to the result and decide what development aspects need attention and what aspects can be safely ignored. 2.3 Metho ds In our implementation, we chose the AMD Opteron and the Intel Nehalem. The Opteron was chosen as the CPU to benchmark. The Intel Nehalem is used to check de optimizations if universal or not, and that is mean if one of the de optimizations affects both CPUs Opteron and Intel Nehalem then may be it is fairly universal. In our implementation, we chose nine de optimizations to be implemented. The de optimizations are: Branch Density The code for this de optimization contains densely packed branch inst ructions in order to overload branch target buffers and/or branch prediction algorithms. Unpredictable Instruction The code for this de optimization has intentionally misaligned return instructions in order to prevent branch prediction. Float Compari son The code for this de optimization uses normal float comparison (a conditional) when comparing integers (after casting) would suffice.

PAGE 15

7 Costly Instruction The code for this de optimization uses division when multiplication would suffice. Load Sto re Dependency The code for this de optimization sets up a load store dependency, the intention of which is to cause the CPU to wait until stores complete before loads can occur. Dependency Chain The code for this de optimization creates a very long d ependency that quickly exhausts the resources of the dynamic scheduler. High Latency Instruction The code for this de optimization uses a very latency instruction (LOOP) when lower latency instructions (DEC/JZ) would suffice. If Condition Organizatio n The code for this de optimization intentionally orders sub conditions of an IF statement so that sub conditions which take longer to evaluate are first. Loop Re Rolling The code for this de optimization breaks a loop into two that could otherwise b e combined so that the dynamic scheduler has less scheduling leeway. Branch Patterns The code for this de optimization tries to find patterns that are difficult for branch prediction hardware to predict.

PAGE 16

8 All of these de optimizations are described in more detail below. Most of these optimizations and de optimizations were written in C and using GCC compiler and we are definitely sure that using GCC compiler will not affect or reduce the de optimization techniques. In some cases, GCC can effect on the de optimization, so we cannot use C to write the code. When C could not be used completely, the NASM assembly language was used to build C importable assembly modules. These modules were then used by a C wrapper in order to implement the de optimization. T ypically, the C wrapper builds an array that can be passed to an assembly module, which in turn processes the array. Now, we have an important question which is how to evaluate the de optimization techniques. Well, we have several programs or tools can giv e information about codes such as CodeAnalyst and VTune. At a glance, CodeAnalyst appeared ideal since it proffers up a great deal of information about program execution and it was written for the chosen hardware platform. However, it turned out to be less than ideal. It is cumbersome to work with; it is hard to evaluate the data for sections of code together. In short, it is great for profiling code, but it is poor for generating lots of test results. So, CodeAnalyst was not an easy answer to the problem o f evaluating de optimizations. So, rather than using prepackaged software in order to evaluate de optimizations, we decided to wrap important sections with code that counts the number of cycles. In the post Pentium world, there is a ready resource for coun ting clock cycles. It is the CPU Timestamp Counter (CTC). The CTC counts the number clock cycles executed since the CPU was booted. It can be a little tricky to work with, especially in multi core, multi CPU environments. However, if the core/CPU used for execution can be tightly controlled, then it can be trusted to provide a reasonable measure of the number of cycles needed to run a section of code. #if defined(__i386__) static __inline__ unsigned long long rdtsc(void)

PAGE 17

9 { unsigned long long int x; __a sm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } #elif defined(__x86_64__) static __inline__ unsigned long long rdtsc(void) { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); } #end if The above code was inserted in to the all C optimizations and de optimizations. With this code, a call to rtdsc() yields the current clock cycles (from the CTC). When we are running the code either optimization or de optimization, the number of clock cycles is different slightly from an execution to another. So, we decided to run the code several times and then we find the (VT). This program is simply an executable that takes a configuration file as an argument. Within the configuration file, the number of test iterations is specified along with a series of programs to test. For each program, there is a description an d a run command line. For each specified program, the version tester runs the executable for the specif ied number of iterations. After each iteration, it captures the number of cycles (the number of cycles is written onto stdout by the executable being tes ted). After completing the number of test cycles, the VT computes the average number of cycles per iteration.

PAGE 18

10 2.4 Analysis and results Many of the de optimizations designed during the implementation worked as expected. For example, the Branch Density de optimization consistently showed a 12% slowdown on the Opteron for all array sizes. This comported very well with expected cost of missing the branch target buffer (BTB) on the Opteron so consistently. Other de optimizations had a greater effect than antic ipated. The Loop Re Rolling and Dependency Chain de optimizations showed very significant slowdowns for all array sizes. The expectation was that winnowing down the options available to the Opteron dynamic scheduler would have an impact, but not such a sta rk impact. The upshot of these surprises is: software engineers must consider the dynamic scheduler in order write software that can run efficiently on the Opteron. Finally, some de optimizations had little or an inconclusive effect. For example, the Branc hing Pattern de optimization turned out to be very difficult to realize. A slowdown was achieved using random data. But no significant slowdown was achieved using patterned data. Since using random data to cause mispredictions yields no useful information about the Opteron, this de optimization was deemed unsuccessful. Overall, modern CPUs are a wild cacophony of interacting threads and processes. This makes profiling optimizations (and de optimizations) very difficult. It turns out that Hydra was a great p lace to test de optimizations; its multiple nodes with multiple CPUs made it easy to run processes with relative confidence that the program being tested would not end up competing for resources. As a result, the data for the Opteron is very smooth and con sistent, while the noisy data for the Intel Nehalem reflects its meager resources.

PAGE 19

11 2.5 Detailed r esults 2.5.1 Branch density de o ptimization For this de algorithm takes an array size a s input. It then generates an array of the specified size in which each element is populated with an integer x where 0 <= x <= 9. (Note that the data for this array is not random; if it was then branch mispredictions could artificially inflate the effect of the Branch Density de optimization.) After this array is populated, the number of times that each integer x a ppears in the array is counted. being optimized/de optimized is written in NASM. The cycles being counted includes only the time that mod_ten_counter sp ends running its assembly code. The optimized version counts the integer instances within the array using a structure that is much like a case switch statement in C. However, the branch statements within this structure are spaced out using NOP instructions. These NOP instructions are used to ensure that the optimized version m aintains proper alignment and spacing for branch instructions so that branch target buffer (BTB) misses are not incurred. The important optimized section is below: cmp ecx, 0 je mark_0 ; We have a 0 nop dec E cx je mark_1 ; We have a 1 nop dec E cx je mark_2 ; We have a 2 nop dec E cx je mark_3 ; We have a 3 nop dec E cx

PAGE 20

12 je mark_4 ; We have a 4 nop dec E cx je mark_5 ; We have a 5 nop dec E cx je mark_6 ; We have a 6 nop dec E cx je mark_7 ; We have a 7 nop dec E cx je mark_8 ; We have a 8 nop dec E cx je mark_9 ; We have a 9 Notice that there is a NOP between each DEC/JE pair. This was done in order to create space between branches and in order to better align branching instructions. The de optimized versio n counts the integer instances with much the same structure as the optimized version. However, it does not maintain proper alignment and spacing such that it should incur many BTB misses. The important de optimized section is below: cmp ecx, 0 je mark_ 0 ; We have a 0 dec E cx je mark_1 ; We have a 1 dec E cx je mark_2 ; We have a 2 dec E cx je mark_3 ; We have a 3 dec E cx je mark_4 ; We have a 4 dec E cx je mark_5 ; We have a 5 dec E cx je mark_6 ; We have a 6 Dec E cx je mark_7 ; We have a 7 dec E cx je mark_8 ; We have a 8 dec E cx je mark_9 ; We have a 9

PAGE 21

13 Notice that, unlike the optimized version, there are no additional NOP instructions between DEC/JE instructions. This means that the de optimized ver sion has a very tightly packed bunch of branches, unlike the optimized version. Also, note that, on average, the optimized version executes 5 more instructions per iteration yet it is significantly outperformed by the optimized version Data As can be seen by the slowdown percentages below, packing branches as densely as the de optimization can have a significant impact on run time. On the Opteron, it caused ~10% slowdown for all array sizes; this slowdown is due to the branch target buffer misses that come with packing branches too densely. On the Nehalem, this de optimization also had a big impact, though the reasons are less well understood. One point of note for the Nehalem is the length of its pipeline, which is 17 versus 12 on the Opteron. Thus, additi onal slowdown could be due to a higher cost of BTB misses on the Nehalem.

PAGE 22

14 Figure 2 1 Branch Density (mod_ten_counter) Table 2 1 : Branch Density de optimization for the AMD Opteron and Intel Neh alem Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 47 7.31 39 9.92 100 185 7.80 331 27.36 1000 2234 11.86 6355 89.42 10000 23230 12.69 67653 110.38 100000 203760 10.98 537362 84.27 1000000 1620652 8.40 5306 766 89.63 10000000 17263048 8.34 52082971 78.85 0 50000000 100000000 150000000 200000000 250000000 Clock Cycles Array Size De optimization: Branch Density ( mod_ten_counter ) Optmzd, Opteron De-Optmzd, Opteron

PAGE 23

15 2.5.2 Unpredictable instruction d e Optimization For this de algorithm takes an array size as input. It then generates an array o f the specified size in which each element is populated with an integer x where 0 <= x <= 12. After this array is populated, the factorial of each element of the array is calculated and written back into the array, overwriting the original array element. T code being optimized/de optimized is written in NASM. The cycles being count ed include only the time that factorial_over_array spends running its assembly code. The optimized and de optimized versions are almost identical. Of course, both perform factorial, recursively, on an integer argument. However, the optimized version proper ly spaces the JNE and RET instructions so that the RET is not mispredicted. Below is the important optimized section: nop eax, [esp+4] mov ; Get the integer whose factorial is ; being calculated cmp eax, 1 ; Have we hit one yet? J ne C alculate ; If we haven't then do another call nop ; This is here for alignment purposes R et ; return with a 1 Notice that there is a NOP instruction between the JNE instruction and the RET instruction. This was done in order to create space between the two and ensure that the RET i nstruction is properly aligned. The de optimized version on the other hand does not properly space and so it results in a large number of mispredictions on the RET instruction. Below is the important de opti mized

PAGE 24

16 section: m ov eax, [esp+4] ; Get the integer whose factorial is ; being calculated cmp eax, 1 ; Have we hit one yet? J ne C alculate ; If we haven't then do another call ret ; return with a 1 Notice that, unlike the optimized version, there is no NOP instr uction between the JNE and the RET instruction. This means that the two branching instructions are adjacent and the RET instruction is misaligned. Data As can be seen by the slowdown percentages below, the cost of mis aligning heavily used RET calls is hig h. On the Opteron, it caused ~12% slowdown for all array sizes; this slowdown is due to the branch mispredictions engendered by mis aligning RET. On the Nehalem, this de optimization had an inconclusive impact. Therefore, it is reasonable to assume that th e way that the Nehalem indexes branching instructions must be quite different from the Opteron.

PAGE 25

17 Figure 2 2 Unpredictable instructions (factorial_over_array) de optimization Table 2 2 Unpredicta ble instructions de optimization for the AMD Opteron and Intel Nehalem Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 80 7.07 21 2.58 100 1510 0.88 2436 32.15 1000 10429 11.61 990 1.24 10000 115975 12.76 36158 5.33 100000 1139748 12.59 238852 3.54 1000000 11529624 12.66 3505949 5.42 10000000 175191774 18.90 3081557 0.51 0 200000000 400000000 600000000 800000000 1E+09 1.2E+09 Clock Cycles Array Size De optimization: Unpredictable instructions ( factorial_over_array ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 26

18 2.5.3 Branch pattern de o ptimization For this de algorithm ta kes an array size and a pattern as inputs. It then generates an array of the specified depends upon the pattern provided; this pattern essentially represents the branching pattern being tructed, the algorithm iterates over the array and replaces all of Only one executable was needed for this de optimization. Rather, in this case, an optimized version is a branching pattern that can be easily managed by the CPUs branch prediction mechanism. On the other hand, a de optimization is a branching pattern that cannot be easily managed by the branch prediction mechanism. Windows) The code is written in C. The section of code being optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include only the time that is taken to mark the odd elements of the array. The code t hat evaluates the array elements simply tests whether each element x (mod 2) is 1 or not. If it is, then it is replaced with a 0. Below is the important code: unsigned long long pstart = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { if ( number_array[ i] % 2 == 1 ) {

PAGE 27

19 number_array[i] = 0; } } printf( "Cycles=%d \ n", ( rdtsc() pstart ) ); Notice that each element, mod 2, is compared to 1. If it is equal, then the element is replaced with 0. We were never able to generate interesting results from this de optimization. It is reasonable to conclude that the branch prediction mechanism on the Opteron is very good. Of course, many cycles were lost when random data was used. However, no branch prediction mechanism can find and use a pattern in rand om data ; it would not then be random. 2.5.4 Float comparison de o ptimization For this de algorithm takes a number of iterations as input. It then generates two simple floats and tes ts them, over and over again, for the requested number of iterations. code being optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include only the time that compare_two_floats spends comparing the floating point values. The optimized version performs the comparison of floating point values by casting the floats into integers and then performing integer comparison. The important optimized section is below: #define FLOAT2INTCAST(f) (*((int *)(&f))) float t=f1 f2; pstart = rdtsc();

PAGE 28

20 for (j = 0; j < numberof_iteration ; j ++) { if ( FLOAT2INTCAST(t) <= 0 ) { Count_numbers(i); count++; } else { count++; } } result=rdtsc() pstart; printf( "Cycles=%d \ n", result ); Notice that the two floats being compared, f1 and f2, are subtracted (outside of the section being timed) and then casted and compared to zero. Thus, no float comparison occurs. The de optimized version performs the comparison of floating point values in the normal fashion, i.e. by straightforwardly comparing the floats. The important de optimized section is below: float t=f1 f2; pstart = rdtsc(); for (j = 0; j < numberof_iteration ; j++) { if ( f1<=f2 ) { Count_numbers(i); count++; } else { count++; } } result=rdtsc() pstart; printf( "Cycles=%d \ n", result ); Notice that f1 and f2 are being compared in the straightforward fashion. Thus, th ey are being compared as floats Data As can be seen by the slowdown percentages below, the cost of casting and the comparing integers is higher until the array size crosses a certain threshold; at this threshold, th e

PAGE 29

21 cost of comparing floats begins to dominate. On the Opteron, it caused ~6% slowdown for array sizes greater than 10000; this slowdown is due to the fact that the floating point data path on the Opteron is more costly than the integer data path. On the Ne halem, this de optimization had a similar impact. Therefore, it is reasonable to assume that the Nehalem has a similar discrepancy between its integer and floating point data paths. Figure 2 3 Compare_Two_floats 0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000 Clock Cycles Array Size De Optimization ( Compate_Two_floats ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 30

22 Ta ble 2 3 : Compare two floats de optimization Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 10991 97.19 43 26.70 100 10960 86.18 471 49.68 1000 9892 37.9 7 5285 59.90 10000 982 0.60 52211 58.82 100000 88962 5.86 413932 41.79 1000000 987559 6.56 4298502 44.46 10000000 9949917 6.62 41906333 46.14 2.5.5 Costly instruction de o ptimization For this de optimiza takes an array size as input. It then generates an array of the specified size in which each element is an integer x where 2 1 <= x <= 2 12 ; thus each element is a random power of two less than or equal to 2 12 After this array is populated, each element in the array is divided by two. n of code being optimized/de optimized is also written in C; there is no assembly for this de optimization. The each element of the array by 2. Both the optimized and de optimize d versions do basically the same thing. The only difference is how each version divides each element of the array by 2. In the optimized version,

PAGE 31

23 this is done by multiplying each element by 0.5. This means that the optimized version is able to use the mult iple multiply data paths available on the Opteron. The import ant optimized section is below: unsigned long long start = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { test_array[i] = test_array[i] 0.5; } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that the division of each element occurs by multiplying it by 0.5. Thus, no actual division occurs. The de optimized version does the same thing as the optimized version. It just uses division, instead of multiplication, to do it. This can be costly on an Opteron since it has more limited resources for division, i.e. a single data path. The important de optimized section is below: unsigned long long start = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { test_array[i] = te st_array[i] / 2.0; } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that the division of each element occurs by dividing it by 2.0. (Note the use of 2.0 rather than 2 ensures that the de optimization is processed as a floati ng point operation, just like the optimization). Thus, divis ion occurs with each iteration. Data As can be seen by the slowdown percentages below, the cost of using division when multiplication would suffice is very high. On the Opteron, it caused ~25% slo wdown for all array sizes; this slowdown is due to the fact that the Opteron can only handle division on one of

PAGE 32

24 its scalar pipelines. On the Nehalem, this de optimization had little impact. Therefore, it is reasonable to assume that Nehalem may have more r esources for division available. Figure 2 4 Costly instruction de optimization 0 100000000 200000000 300000000 400000000 500000000 600000000 Clock Cycles Array Size De Optimization ( Costly_Instruction ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 33

25 Table2 4 : Costly instruction de optimization for the AMD Opteron and Intel Nehalem Array Size AMD Optero n: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 42 14.34 100 1000 10000 100000 1000000 10000000 2.5.6 Load store dependency de o ptimization For this de takes an array size as input. It then generates a n array of the specified size in which each element is populated with an integer x where 0 <= x <= 9. After this array is populated, each element in the array has the previous element added to it; this effect ripples from the front of the array to the back of the array. The last element in the array is the sum of all elements in the array. Thus, this is basically a prefix sum. itten in C. The section of code being optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include only the time that dependency spends summing array values. The optimized version performs the additions by keeping the previous three array

PAGE 34

26 elements in temporary variables; it creates few load store dependencies since the previous array written before it can be reloaded. The important optimized section is below: int temp_prev = test_array[0], temp1, temp2; unsigned long long start = rdtsc(); for ( i = 3; i < size_of_array; i += 3 ) { temp2 = test_array[i 2] + temp_prev; temp1 = test_array[i 1] + te mp2; test_array[i 2] = temp2; test_array[i 1] = temp1; test_array[i] = temp_prev = test_array[i] + temp1; } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that the value being added to the current element of the arr ay comes from the temporary loaded, i.e. no load store dependency is (likely to be) created. The de optimized version, on the other hand, performs this computation in a more natural way. However, since it must w ait for the previous element to have its value written before the next one can be calculated, many load store dependencies are created. The important de optimized section is below: unsigned long long start = rdtsc(); for ( i = 1; i < siz e_of_array; i++ ) { test_array[i] = test_array[i] + test_array[i 1]; } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that, unlike the optimized version, there are no temporary variables that can be used to pr event load store dependencies.

PAGE 35

27 Data As can be seen by the slowdown percentages below, the cost associated with these kinds of dependencies is very high. On the Opteron, it caused ~60% slowdown for all array sizes; this is due to the fact that the Opteron does not s chedule stores in the way that it schedules other instruction types leading to very costly dependency stalls. On the Nehalem, this de optimization had a lesser impact. Therefore, it is reasonable to assume that its dynamic scheduler is better at managing s tores. Figure 2 5 Costly instruction de optimization 0 50000000 100000000 150000000 200000000 250000000 Clock Cycles Array Size De Optimization ( Costly_Instruction ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 36

28 Table 2 5 : Costly Instruction de optimization for the Opteron and Intel Nehalem Array Size AMD Opteron: Difference Slowdown(%) Intel Neh alem: Difference Slowdown(%) 10 100 1000 10000 100000 1000000 10000000 2.5.7 High latency instruction de o ptimization For this de takes an array size as input. It then generates an empty array of the specified size. After this array is created, it calculates the fibonacci number associate d with each index of the array. (on Windows). The code is written in C. The section of code being optimized/de op timized is written in NASM. The cycles being counted include only the time that fib sp ends running its assembly code. The optimized and de optimized versions are almost identical. Of course, both calculate the fibonacci number associated with each index. H owever, the optimized version uses the combination of DEC and JNZ instructions in order to control its branching. Below is the important optimized section:

PAGE 37

29 calculate: mov edx, eax add ebx, edx mov eax, ebx mov dword [edi], ebx add edi, 4 dec ecx jnz calculate Notice that each iteration ends with DEC and JNZ instructions. Thus, the loop will end when the ECX register is zero. The de optimized version on the other hand uses a LOOP instruction instead of the DEC/JNZ combination. The important thing to n ote about the LOOP instruction is that it has a high latency (approximately 8 cycles) compared to DEC/JNZ. Below is the important de optimized section: calculate: mov edx, eax add ebx, edx mov eax, ebx mov dword [edi], ebx add edi, 4 loop calculat e Notice that, unlike the optimized version, the loop is controlled by the LOOP instruction. Just like the optimized version, the loop will end when ECX register is zero. Data As can be seen by the slowdown percentages below, the cost of the LOOP instruct ion is high. On the Opteron, it caused ~17% slowdown for all array sizes; this is due solely to the very high latency of the LOOP instruction. On the Nehalem, this de optimization had a greater impact. It is hard to speculate on what might cause the Nehale m to have such a poor

PAGE 38

30 implementation of LOOP. Figure 2 6 Fib de optimization Table 2 6 : Fib de optimization for the Opteron and Intel Nehalem 0 10000000 20000000 30000000 40000000 50000000 60000000 70000000 80000000 90000000 100000000 Clock Cycles Array Size De Optimization (fib) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem Array Size AMD Opteron: Difference Slowdown(%) Inte l Nehalem: Difference Slowdown(%) 10 17 9.04 9 8.57 100 155 29.4 244 81.87 1000 1386 34.74 2239 104.52 10000 14588 22.03 19519 62.16 100000 123271 17.91 256839 87.42 1000000 1206678 16.94 2430301 81.51 10000000 11716747 14.07 19896396 65.26

PAGE 39

31 2.5.8 If condition de o ptimization For this de array size as input. It then generates an array of the specified size in which each element is populated with floats x where 0.5 <= x <= 11.0. After this array is populated, each element in the array is evaluated with an if then statement in order to determine whether to increment or decrement a dummy variable. The executables associ (on Windows). The code is written in C. The section of code being optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include then statement. Both the optimized and de optimized versions do basically the same thing. The only difference is the order which they each evaluate the clauses of the if then statement. One of these clauses is very easy and fast to evaluate; it is a comparison to zero. The other of the clauses is hard and time consuming to evaluate; it is a floating point comparison. The import ant optimized section is below: i nt dummy = 0, mod = 0; unsigned long long start = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { mod = ( i % 2 ); if ( mod == 0 && test_array[i] > 1.5 ) { dummy++; } else { dummy ; } } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice t he ordering of the clauses within the if then statement. This is an optimal ordering since the floating point comparison clause need only be evaluated half of the time in

PAGE 40

32 Again, the de optimized version does the same thing as the optimized version. It just de optimized section is below: int dummy = 0, mod = 0; unsigned long long start = rdtsc(); for ( i = 0; i < size_of_arr ay; i++ ) { mod = ( i % 2 ); if ( test_array[i] > 1.5 && mod == 0 ) { dummy++; } else { dummy -; } } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Again, notice the ordering of the clauses within the if then statement. This is a sub optimal ordering since the floating point comparison clause needs to be evaluated each and every time in order to know the valu Data As can be seen by the slowdown percentages below, the cost of improperly ordering clauses within a conditional is very high. On the Opteron, it caused ~37% slowdown for all array sizes; this is due to being forced to evaluate the most expensive of the conditional clauses for each iteration; therefore, a simple switch in ordering can make this program 37 % faster. On the Nehalem, this de opt imization had a similar impact.

PAGE 41

33 Figure 2 7 If Condition de optimization Table 2 7 : If Condition de optimization for the AMD Opteron and Intel Nehalem Array S ize AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 207 26.90 100 1000 10000 100000 1000000 10000000 0 50000000 100000000 150000000 200000000 250000000 300000000 Clock Cycles Array Size De Optimization ( If_Condition ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 42

34 2.5.9 Loop re rolling de o ptimization For this de algorithm takes an array size as in put. It then generates an array of the specified size in which each element is populated with integers x where 0 <= x <= 9. After this array is populated, two more arrays are built; one will hold the squares (x 2 ) of each element in integer array and the ot her will hold the cubes (x 3 ) of each element in the array. After these arrays are built, squares and cubes arrays are populated by iterating over the elements of the integer array. optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include only the time that the exe cutables spend filling out the squares and cubes array from the original. The optimized version fills out the squares and cubes array into a single loop, since the integer array is used to populate both. This is ideal since it gives the dynamic instruction scheduler more flexibility when scheduling instructions to run. The important optimized section is below: unsigned long long pstart = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { quadratic_array[i]=load_store_array[i]*load_store_array[i]; cubic_arra y[i]=load_store_array[i]*load_store_array[i]*load_store_array[i]; } printf( "Cycles=%d \ n", ( rdtsc() pstart ) ); Notice that the calculation that populates the element of each array, squares and cubes, is within the same loop.

PAGE 43

35 The de optimiz ed version fills out the squares and cubes array using separate loops. This is sub optimal since it takes away some of the flexibility that the dynamic instruction scheduler might otherwise have when scheduling instructions to run. The important de optimiz ed section is below: Unsigned long long pstart = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { quadratic_array[i]=load_store_array[i]*load_store_array[i]; } For(i=0;i< s ize_of_array;i++ ) { cubic_array[i]=load_store_array[i]*load_store_array[i]*load_ store_array[i]; } printf( "Cycles=%d \ n", ( rdtsc() pstart ) ); Notice that the calculation that populates the element of each array, squares and cubes is within the separate loops. Data As can be seen by the slowdown percentages below, the cost of not combining loops that can be combined is very high. On the Opteron, it caused ~50% slowdown for all array sizes; this is solely due to the fact that splitting the computations into separate loops isolates them such that the dynamic scheduler can not schedule them together; some of the flexibility that it might have had otherwise has been lost. On the Nehalem, this de optimization had an impact, but a lesser one. It is hard to imagine why this is less costly on the Nehalem.

PAGE 44

36 Figure 2 8 Loop re rolling de optimization Table 2 8 : Loop re rolling de optimization for the AMD Opteron and Intel Nehalem Array Size AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 189 43.95 83 28.32 100 1653 53.68 357 15.35 1000 21311 54.96 4696 21.28 10000 234336 52.42 39311 17.10 100000 2352328 50.93 134786 6.05 0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 10 100 1000 10000 100000 Clock Cycles Array Size De Optimization ( Loop_re_Rolling ) Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 45

37 2.5.10 Dependency chain de o ptimization For this de algorithm takes an array size as input. It then generates an array of the specified size in which each element is populated with integers x where 0 <= x <= 20. After this array is populated, all of the elements of the array are summed into a sing le integer variable. being optimized/de optimized is also written in C; there is no assembly for this de optimization. The cycles being counted include only the time that the executables spend ad ding the elements of the array. The optimized version adds the elements of the array by striding through the array in four element chunks a nd adding elements to four different temporary variables. After the array has been processed, the four temporary variables are added. The advantage of the optimized version is that it creates four large dependency chains instead of one massive one. Thus, t he dynamic scheduler has many more options when scheduling instructions. The import ant optimized section is below: int sum = 0, sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0; unsigned long long start = rdtsc(); for ( i = 0; i < size_of_array; i += 4 ) { sum1 += test_array[i]; sum2 += test_array[i + 1]; sum3 += test_array[i + 2]; sum4 += test_array[i + 3]; } sum = sum1 + sum2 + sum3 + sum4; printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that the summing occurs across four temporary variables. These temporary

PAGE 46

38 variables are then added after the array has been processed. The de optimized version sums each element of the array into one variable. This creates a massive dependency chain that quickly exhausts the scheduling resources of th e dynamic scheduler. The important de optimized section is below: int sum = 0; unsigned long long start = rdtsc(); for ( i = 0; i < size_of_array; i++ ) { sum += test_array[i]; } printf( "Cycles=%d \ n", ( rdtsc() start ) ); Notice that the summing occurs using one variable only. Data As can be seen by the slowdown percentages below, the cost of creatin g a massive dependency chain is high. On the Opteron, it caused ~150% slowdown for all array sizes; this is solely due to the exhaus tion of the scheduling resources of the dynamic scheduler; exhausting these resources effectively makes the program run sequentially (no ILP). On the Nehalem, this de optimization had a large impact but not quite as large as the Opteron. One can only imagi ne that it has more schedu ling resources at its disposal.

PAGE 47

39 Figure 2 9 Dependency chain de optimization Table 2 9 : Dependency chain de optimization for the AMD Opteron and Intel Nehalem Array Siz e AMD Opteron: Difference Slowdown(%) Intel Nehalem: Difference Slowdown(%) 10 100 1000 10000 100000 1000000 10000000 0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000 Clock Cycles Array Size dependency_ c hain Optmzd, Opteron De-Optmzd, Opteron Optmzd, Nehalem De-Optmzd, Nehalem

PAGE 48

40 3 Bui lding tools to optimize weaknesses in the GCC compiler 3.1 Compiler There are many definitions to the compiler but the most simple one is a program converts other programs from high level language to low level languages. Compilers are different from one language to another. However, compilers can be similar to each other if they are belongs to extended programming languages. For example, C, C++, C# and java are extended languages from C, so compilers for those languages are similar in terms of structure and work. Programs are written in high level languages instead of low level languages for several reasons. One of the reasons is that program s in high level languages are much shorter than programs written in low level languages. So, high level languages help users or programmers to write less and efficient codes. Also, there is another advantage which is programs can be compiled to different l ow level languages, and then can be run on different machines. However, programs in high level languages are slower than programs written in machine languages because the time that spends in converting from high to low level languages. So, the most efficie nt compiler can convert from high level language to low level language with the shortest time [6] [7]. Every compiler is working in phases; number of phases and sequence of phases are dif ferent from compiler to another The first phase in a compiler is lex ical analysis. Lexical analysis is the operation that read s programs as a text, and then divides the whole program s to tokens corresponding to the rules of languages. The second phase in a compiler is syntax analysis. Syntax analysis takes the output from the previous level and represents the output in a tree corresponding to the structure of a program. The third phase of compiler levels is Type Checking. Type Checking is responsible for checking a program to see if it

PAGE 49

41 follows the rules and regulations of g rammar and structure. The fourth compiler phase is Intermediate code generation. Intermediate code generation is responsible for translating programs to the machine language. The fifth compiler phase is register allocation. Variables and names that used in previous operation (Intermediate code generation level) translated to numbers in registers of a machine language. The sixth compiler phase called machine code generation. In the machine code generation, the code in the intermediate language is converted t o assembly language. The last compiler phase is assembly and linking. In the assembly and linking level, the assembly code translated to the binary form. Lexical analysis, syntax analysis and type checking are called frontend of a compiler while register a llocation, machine code generation and assembly and linking are called backend of compiler. Most of compiler optimizations are done on the two parts which are frontend and backend of compilers. The optimization procedure is different from one compiler to a nother, but there are basics of optimization s that work in all of them [7]. 3.2 GCC compiler GCC is one of the most popular compilers that used in different languages. GCC is used widely because it is free source and has very active tools to optimize pro grams efficiently. Currently, GCC supports six languages such as C, C++, Objective C, FORTRAN, Java and Ada. Also, there is another language will be added in future which is COBOL. GCC can be considered as a part of the GNU project. GNU project was launche d in 1984 to complete UNIX operating system as free software. There are many versions of the GCC from the first version till today [8].

PAGE 50

42 The basic language of the GCC compiler is C, and also the GCC compiler was written in C. So, in the beginning, GCC comp iler was written as a C compiler and after that several languages were added to it. The first language that was added to the GCC compiler is C++ which is the extension of C language. Furthermore, Objective C is another language that is extended from C lang uage; it was added to the GCC compiler. Also, there is an important language was added to the GCC compiler which is FORTRAN. FORTRAN is considered stronger than C because FORTRAN is a rich language with math libraries. It can do complex operations and it d eals with complex numbers. Also, there is another important language that is added to the GCC which is JAVA. JAVA is different than other languages because the form of its code different, ferent than other languages, and the executable will be executed by a Java virtual machine. The last language that is added to the GCC compiler is Ada which is developed by Ada core technologies and added to the GCC compiler at 2001. Ada language is design ed to handle large programs [8] [9]. 3.3 GCC optimizations Compilers basically are aimed to make programs faster, efficient and reducing the size of codes. The main important goal from compiler optimizations is the speedup of programs. Codes are generate d by optimizing compilers called optimized codes. Optimized codes can be produced by several changes on the source code, in order to make that code is efficient and run faster than the original code. There are several ways in the GCC compiler to decide a p articular code can be improved or not such as control flaw analysis and data flaw analysis. Control flaw analysis is used to see how the control

PAGE 51

43 statements are used, and to see how the path is taken from the beginning of control statement to the end of th e statement. The other one is data flaw analysis, which is used to identify how data is working in programs [10] p101 p117. There are several optimization techniques that used in the GCC compiler to optimize programs. Copy propagation transformation is one of the optimizing techniques that used to reduce the redundant information. In this method, constant values are used instead of copy following example: i=10; x[j]=I; y[min]=b; b=i; The previous code will optimized to the following code by GCC compiler: i=10; x[j]=10; y[min]=b b=10; In the previous example, in the optimized code, values are used directly to set the variables x[j] and b, while in the original code, copy operation used to copy values. Copying operation costs much time than just setting the constant values in the optimized code [10] P104. Also, there is another GCC compiler optimization called Dead code elimination. Dead code elimination is one of the GCC compiler optimization that removes extra or unused codes in programs, or removes codes that do not effect on the sequence of

PAGE 52

44 operations and the result in programs. Many codes are written with extra calculations or extra variables called dead codes. Also, w hen we have a code that is never reached by compiler, in this case, this code called unreachable code. For example, If(i==10){ :} Else{ If(i==10){ :} } In the above example, the second code will be never reached by a compiler because either the first stat ement will reach by a compiler or none of them will not reached by a compiler. So, in all cases, the second statement will never reached by a compiler, and these phenomena called unreachable code, and the optimization that removes these phenomena called un reachable code elimination. GCC compiler and especially the latest versions have different and very smart techniques to optimize codes. GCC compiler has optimizations called architecture independent optimizations. What is the most interesting thing abo ut these optimizations is that they do not depend on the features of a given architecture or features of a given processor. So, optimizing codes is in different architecture and processors. GCC compiler has different optimization switches such as O, O1, O2, O3, Os, OO and nothing. Nothing or OO turn off all GCC optimizations. O1 turns on first level of GCC optimizations which is level one. O2 and O3 turn on level two and level three of GCC optimizations, they added to the first level of GCC optimi zations. O enables all GCC optimizations, and it attempts to reduce the size of codes and execution time without

PAGE 53

45 increasing the compilation time. Os attempts to reduce the size of codes only without considering the compilation time [10]. In the GCC com piler, you can turn off one of the optimizations by doing the following: f and the name of optimization that we want to disable it. Suppose we want to disable the optimization that guess branch possibilities, the disabling comma nd will be like this: GCC file name.c o file name O1 fno guess branch probability. In addition, the original name of the guess branch possibilities optimization is fguess branch probability [10] P108. In the GCC compiler, as we mentioned before, ther e are three levels of GCC optimizations. First level called level one GCC optimizations. Level one has many optimizations that reduce size of codes and improve time consumption. free dce is one of level one GCC optimizations that eliminates unreachable codes or dead codes in programs. Also, fif conversion is another optimization in level one which can change conditional jumps in to none branching code. Also, there is another optimization in level one which predicts branch possibilities in programs calle d fguess branch probability. In level two optimizations, there are many optimizations added to the first level includes loop transformation, mathematical optimizations and jump handling. One of the mathematical optimization is called fpeephole. These opt imizations replace long instructions with the shorter once. For example, suppose we have the following code: A=2;

PAGE 54

46 For( i=1; i<10; i++) A+=2; GCC compiler, by applying second level of optimization can replace the above code with A=20. So, this optimization much iteration Also, there is another optimization in level two which can improve recursive functions called foptimize sibling calls. This optimization attempts to eliminate the calls of functions a nd replace it by the code of the function itself. In general, calls of a function consume much time, so replacing it with the code itself will make programs faster. Also, there are other optimizations in level two can reduce the size of codes. The command Os in GCC attempts to reduce the size of codes and that will improve the memory. Os can be applied on all level two optimizations with the exception of optimizations that enlarge program codes. Level three optimizations turn on all level one, level two and additional optimizations such as funswitch loops, fgcse afer reload and finline functions [10] P110, P111, P112. Furthermore, there are extra optimizations that called specific optimizations. One of the specific optimizations is fbound check, whi ch can check codes to validate array indices in order to use for array access. ffunction cse is one of the specific optimizations that stores function addresses in registers. Also, there is another specific optimization called finline functions which ex pands simple functions in the calling function. ffloat store is another specific optimization which disables strong floating point values in registers. The specific optimizations are not enabled by O1, O2 or O3, basically they are enabled by either using the command f followed by the name of the optimization, or the y can be enabled by O because O turns on all GCC optimizations [10].

PAGE 55

47 3.4 Our optimizations to the GCC compiler In our implantation, we found several weaknesses in the GCC compiler, so we i mproved these weaknesses and then we built a tool that has all the improved weaknesses or the optimizations. The weaknesses that we found in the GCC compiler include Division operation, loop and recursive function, loop rerolling, loop unrolling, power ope ration, square root mathematical operation and functions calls. Division operation consume much time than multiplication, so our goal in this optimization technique is to replace the division operation with the multiplication operation to make programs eff icient and faster. Functions are used in many programs and function calls are similar to iterations that used in loops, but function calls are much time consum ing than iterations in loops. Therefore programs have function calls can be replaced by loops in order to make programs faster. Loop rerolling is one of the optimization techniques that used in many compilers to optimize loops in programs. The basic idea in loop rerolling is to combine several loops in one loop, and this obviously will reduce the num ber of instructions because in general, in structions in one loop are less than separate individual loops. So, the main goal in this optimization technique is to combine several loops in one loop when it is possible to save time and make programs faster. Al so, we implemented another optimization technique which can optimize loops called loop unrolling. Loop unrolling is one of the optimization techniques that used in many compilers to enhance the execution time of codes that have loops. The main idea in loop unrolling is to reduce the number of instructions. Many programs are spending less time in executing actual hat many programs spend 90% of the time

PAGE 56

48 in executing 10% of a code [11] So, reducing the time consumption in the little portion rolling can improve instruct ion level parallelism and improve memory hierarchy locality [11]. Loop unrolling can be done in many loops but not in all loops because some codes are difficult to unroll them such as two dimensional arrays and pointers. The basic idea in loop unrolling is to open the code several times and reduce the number of iterations, and then will make programs faster. However, there is a disadvantage in loop unrolling technique which is increasing the size of codes, and then will affect the memory. The main drawback instruction cache. When the instruction cache is overflowed, the optimized code will be affected and then will make programs slower. This case can be happened when the size of instruction cache is small, and the size of original code is large. Also, there is another thing regarding to loop unrolling which is choosing the unrolling factor. There is an important question which is how many times we should unroll a given loop? We introduced this question in order to h elp users or programmers to choose the perfect unrolling factor. Choosing a good unrolling factor depends on several factors such as the size of instruction cache, size of code and the number of instructions in program. In our implementation, we help the programmer or user to find the best unrolling factor if loop unrolling optimization were found. Our tool requires the size of instruction cache as input, length of an individual instruction because length of an instruction is different from architecture t o another, and also requires the number of iterations to a loop as input. The output of our tool is the best maximum unrolling factor. Programmer or user should not pass the number of unrolling factor because this will overflowed the

PAGE 57

49 instruction cache, and then will degrade the performance or the speed of programs. In addition to those optimization techniques, there is another one which is optimizing power operation. Power function in C or C++ (POW()) used in many mathematical programs. Power operation can be replaced by multiplication and definitely both of them give the same result. We want to replace power by multiplication because power operation is much expensive than multiplication, so replacing power with the multiplication will make programs faster. Also, there is another optimization technique that included in our tool which is optimizing square root mathematical operation (SQRT()). SQRT() is one of the functions in C and C++ that used in many programs to find the square root of a number. SQRT() func tion is much expensive than division, and we can see that easily from the results of our implementation. So, in our optimization technique, we are replacing the SQRT() function with the division operation to make programs faster. The last optimization tech nique that we implemented in our tool is optimizing function calls. Suppo se, when we have a function in For Loop that used as a constant, so taking out this function from the loop, will save much time because the calls of a function will be equal to the n umber of iterations in that loop. Therefore, in our optimization technique, we are taking out a function from a loop when this is po ssible to make programs faster. 3.5 Methods, a nalysis and results 3.5.1 Division vs. m ultiplication We designed a benchmark which tests the division in order to see if the compiler can optimize it or not. We used an array populated with power of integers from (1 12). We divided each element in the array by 5, and run the program for different size arrays, and then we collected the results. Also, we implemented our optimized benchmark. In our

PAGE 58

50 optimized code, we multiplied each element in the array by 0.2, and we know the two codes give the same result; the only difference between those two codes is using multiplication instead o f division. In the implementation, we compiled each code in two different types; one with disabling all GCC optimizations and with enabling all GCC optimizations. After doing the implementation, we can see that the division has a real impact on the GCC com piler. The average slowdown percentage is 118%, and from this point of view, we can tell the GCC compiler has very limited optimization ability in this field. The goal from this benchmark is to tell the programmer not using the division even if you are re lying on the compiler because the compiler cannot optimize this kind of code. If you have an option to change the division with multiplication then changes it because the division degrades the performance comparing with multiplication. We dep end on time in milli seconds to see which one is faster than the other. The size of the array that used in this optimization technique is 1000000 to 10 00000 for time measured in milli seconds. We implemented our optimization using large array size for measuring time in mi lli seconds because the CPU is really fast. For example, if you use small size of array, the time is zero most of the time. Furthe rmore, we measured time in milli seconds to see the effects of our implementations in real time. Here are the results and flow c harts for our implementat ions (depending on time in milli seconds): A: Division code without compiler optimization. B: Multiplication code without compiler optimization. C: Division code with compiler optimization.

PAGE 59

51 D: Multiplication code with compiler optim ization. Table 3 1 : Time in milliseconds for the Division vs. Multiplication optimization technique. Figure 3 1 Division vs. Multiplication milliseconds 0 50 100 150 200 250 1000000 2000000 3000000 4000000 5000000 8000000 10000000 Run time Array size Division vs multiplication/ milliseconds A B C D

PAGE 60

52 3.5.2 Loop and recursive function We designed another benchmark which tests the GCC compiler to see if it can optimize the function call. We know that function calls spend much time comparative with the other programming techniques. In our implementation, we implemented two benchmarks, bot h benchmarks calculate the factorial of a number, and we repeat this operation many times to see the differences in terms of the performance between the two benchmarks. The original benchmark calculates the factorial of the number using function call, so a function will be called several times to calculate the factorial of a number. The other benchmark which is our optimized benchmark calculates the factorial of a number using loop instead of calling a function several times. We know that using a loop inste ad of function call will optimize our program; the GCC compiler has a weakness in optimizing this kind of optimization. We depend on time in milli seconds to see the differences in performance between the two benchmarks. We run our programs with different n umber of iterations. W e run the benchmark with the iterations from 1000000 to 10000000, an d we collected the time in milli seconds. We depend on time in m illi seconds in order to see clearly which benchmark is faster or perform better than the other. The res ults showed that our optimized code (the code with loop) is much faster than the code with function call that optimized with the GCC optimization. Therefore, results and g raphs. The original code b y using recursive function int tail(int factorial) {

PAGE 61

53 if (factorial>1) { result=result*factorial; tail(factorial 1); } return result; } Our optimized code b y using loop instead of calling the function every time. int tail( int factorial) { temp=factorial; for(i=1;i
PAGE 62

54 Figure 3 2 Loop and recursive function Time in milliseconds 3.5.3 Loop re rolling We designed another benchmark that tests the GCC compiler to see if it can optimize loops using loop re rolling optimization technique. This opti mization designed benchmark, and the other one is our optimized benchmark. The two benchmarks are doing the same job; they calculate square numbers and cubic numbers. We used th ree arrays which are load_store_array, quadratic_ array and cubic_array. Load_store_array initializes randomly with integers form 0 to 9. After this array is populated, two more arrays are built. The first benchmark which is the original benchmark calculat es square numbers and cubic numbers in two separate loops. While the other benchmark calculates square numbers and cubic numbers using one loop. We developed these two benchmarks to test GCC compiler because we know that loop re rolling optimization is ver y 0 10000 20000 30000 40000 50000 60000 1000000 2000000 3000000 4000000 6000000 8000000 10000000 Run time Array size Loop vs recusive functions A B C D

PAGE 63

55 important technique that used by many compilers for optimizing loops. In our implementation, we depend on time in milliseconds to see which benchmark is faster or has better performance than the other. Measuring the time is for the part of the code that we want to optimize so it is not for the whole code or program, and this will give us a clear clue about performance. The GCC compiler cannot do this kind of optimization because when we tested both programs (original code and our optimiz ed code), we can see obviously our optimized code is much faster than the original code. We run both programs (original benchmark and our optimized benchmark) with different number of iterations from 10 to 1000. In this optimization technique, our optimized code is approxi mately 40% faster than the original code. From this percentage, we can see that the GCC compiler has a significant weakness for optimizing loop re codes, results and gr aphs. The original code: for(i=0;i
PAGE 64

56 Ta ble 3 3 : Time in milliseconds for the Loop Re Rolling optimization technique Figure 3 3 Loop Re Rolling Milliseconds 0 1000 2000 3000 4000 5000 6000 10 20 30 40 50 60 70 80 90 100 Run time Array size Loop Rerolling A B C D

PAGE 65

57 3.5.4 Loop u nrolling We designed anoth er benchmark that tests GCC compiler to see if it can optimize loops using loop unrolling technique. We did this optimization because loops are used widely in many programs, so optimizing loops can significantly improve many such programs. There are two be nchmarks in our implementation; one of them is the original benchmark while the other one is our optimized benchmark. Both benchmarks are doing the same job with different technique or strategy. In the original benchmark, we multiplied each number in the a rray by itself, and we stored the result in different array, size of array. In our optimized benchmark, we opened the source code, and that means we repeated the sour ce code inside the loop body several times to reduce number of iterations, and we know that reducing number of iterations will speed up our benchmark. So, in our optimized benchmark we repeated the loop body five times, number of iterations is reduced to 8 0% in our implementation. For example, if the array size is 100 then number of iterations will be 20 instead of 100. In our implementa tion, we depend on time in seconds to see which benchmark is faster or better optimized than the other. Measuring the time is for the part of the code that we want to optimize it, so it is not for the whole code or program, and this will give us a clear clue about performance. The GCC compiler cannot do this kind of optimization because when we tested both programs (original code and our optimized code), we can see obviously our optimized code is much faster than the original code. We run both programs (original benchmark and our optimized benchmark) with different number of iterations from 10000000 to 900000000. From our resu lts for both benchmarks, we can see that this optimization

PAGE 66

58 technique is faster comparative with our previous optimization techniques becau se our optimized benchmark without GCC optimization is faster than the original benchmark with enabling GCC optimizati The original code: for ( i = 0; i < size_of_array; i++ ) { test_array[i] = test_array[i]*test_array[i]; } Our optimized code: for ( i = 0; i < size_of_array; i+=6 ) { test_array[i] = test_array[i]*test_ar ray[i]; test_array[i+1] = test_array[i+1]*test_array[i+1]; test_array[i+2] = test_array[i+2]*test_array[i+2]; test_array[i+3] = test_array[i+3]*test_array[i+3]; test_array[i+4] = test_array[i+4]*test_array[i+4]; test_array[i+5] = test_array[i+5]* test_array[i+5];} A: The original code without GCC optimizations. B: The original code with GCC optimizations C: Our optimized code (Unrolling code) without GCC optimization. D: Our optimized code (Unrolling code) with GCC optimizations.

PAGE 67

59 Table 3 4 : Time in seconds for the Loop Unrolling optimization technique Figure 3 4 Loop Unrolling seconds 3.5.5 Power vs. multiplication We designed another benchmark that tests the GCC compiler to see i f it can optimize the power operation or not. POW function is slower than multiplication because there is extra time that used to call this function and executing subroutine has variables 0 100 200 300 400 500 600 Runtime Array size Loop Unrolling A B C D

PAGE 68

60 beside the multiplication. We designed this benchmark because we kn ow that the multiplication is much faster than the power operation, and also the power operation is used widely in many math programs, so optimizing power operation will help to improve many programs. There are two benchmarks; one of them is the original b enchmark, and the other one is our optimized benchmark. The two benchmarks are doing the same job; they calculate the power of numbers. We used one array which is called test_array[], and the result will be put in the same array. The test_array[] is initia lized randomly with different integer numbers. The first benchmark which is the original benchmark calculates the power of all numbers in the test_array[] using POW power function, while the other benchmark calculates the power of all numbers in the test _array[] using multiplication operation instead of using POW function. In our implementation, we depend on time in seconds to see which benchmark is faster or better optimized than the other. Measuring the time is for the part of the code that we want to o ptimize it, so it is not for the whole code or program, and this will give us a clear clue about performance. The GCC compiler cannot do this kind of optimization because when we tested both programs (original code and our optimized code), we can see obvio usly our optimized code is much faster than the original code. We run both programs (original benchmark and our optimized benchmark) with different number of iterations from 10000000 to 90000000. From our results for both benchmarks, we can see that this o ptimization technique is faster comparative with our previous optimization techniques because our optimized benchmark which is the multiplication benchmark without GCC optimization is faster than the original benchmark with enabling GCC optimizations. Let codes, results and graphs.

PAGE 69

61 The original Power code: for ( i = 0; i < size_of_array; i++ ) { test_array[i] = pow(test_array[i],2); } Our o ptimized code: for ( i = 0; i < size_of_array; i++ ) { test_array[i] = test_array[i]*test_array[i]; } A: The original power code without GCC optimizations. B: Our optimized cod e (multiplication code) without GCC optimization. C: The original power code with GCC optimizations. D: Our optimized code (multiplication code) with GCC optimizations. Table 3 5 : Time in seconds for the Power vs. Multiplicati on optimization technique

PAGE 70

62 Figure 3 5 Power vs. Multiplication seconds 3.5.6 SQRT function vs. division We designed another benchmark that tests GCC compiler to see if it can optimize Sqrt () function which it uses to ca lculate the square root for numbers in C language. We did this implementation about Sqrt() function and division because Sqrt () function spends much time than the division. Sqrt() function used in many mathematical benchmarks, so optimizing it will speed up those programs. In this implementation, we wrote two codes that are doing the same job. Both codes calculate square root of a number with different mechanism. One code used Sqrt() function to calculate the square root, while the other code used the divi sion operation to calculate the square root. There are two benchmarks, the original benchmark used Sqrt() function while our optimized benchmark used division. In the original benchmark, we multiplied each number in the array by Sqrt(Number1) while in our optimized code we multiplied each number in the array by Number1/Number2. We used different numbers and the results almost the same 0 20 40 60 80 100 120 140 160 180 200 Run time Array size Power vs multiplication A B C D

PAGE 71

63 because the main idea in this implementation to show that Sqrt() function spends much time because of an extra time for call ing a Sqrt() function while in our optimized benchmark we did division only. In both codes, calling a Sqrt() function and the division operation are repeated to the number of iterations. In our implementa tion, we depend on time in seconds to see which benc hmark is faster or much optimized than the other. Measuring the time is for the part of the code that we want to optimize it, so it is not for the whole code or program, and this will give us a clear clue about performance. The GCC compiler cannot do this kind of optimization because when we tested both programs (original code and our optimized code) ; we can see obviously our optimized code is much faster than the original code. We run both programs (original benchmark and our optimized benchmark) with diff erent number of iteratio ns from 10000000 to 100000000. From our results for both benchmarks, we can see that this optimization technique is faster comparative with our previous optimization techniques becau se our optimized benchmark without GCC optimizatio n is faster than the original benchmark A: Sqrt() function code without GCC optimization. B: Our optimized division code without GCC optimization. C: Sqrt() function code with GCC optimiz ation. D: Our optimized division code with GCC optimization.

PAGE 72

64 Table 3 6 : Time in seconds for the SQRT function vs. Division optimization technique Figure 3 6 SQRT function vs. Division seconds 0 2 4 6 8 10 12 14 16 18 Run time Array size Sqrt function vs Division A B C D

PAGE 73

65 3.5.7 The c ost of the function call inside and outside loops We designed a benchmark that tests a compiler to see if it can optimize a function function return string length. We passed this code to the GCC compiler to see if it can optimize it. We developed this benchmark because we know that calling a function several times will cost significant time. We optimized this benchmark manually and we passed it to the compiler to see which is faster or who optimize better GCC compiler optimization or our optimization. In our optimization we took out the function call s trlen() from inside the loop and we put it outside the loop, so instead of calling a program faster. GCC compiler cannot do this kind of optimization because when we tested both programs (original code and our optimized code), we can see obviously our optimized code is much faster than the original code. We depend on time in milliseconds to see which is faster than the other. Measuring the time is for the part of the code that we want to optimize it, so it is not for the whole code or program, and this will give us a clear clue about performance. In this optimization technique, our optimi zed code is approximately four times faster than the original code. From this percentage, we can see graphs. This is the code that we pass it to the GCC compiler in ord er to optimize it. unsigned sum(const unsigned char *s) { // x[] is an array initialized randomly with integer numbers less than 100.

PAGE 74

66 unsigned result = 0; for (i=0; i < strlen(s); i++) { result += x[i]; } return result; } Th is is our opt imized code which we optimize it manually. unsigned sum(const unsigned char *s) { unsigned result = 0; length = strlen(s); for ( i=0; i < length; i++) { result += x[i]; } return result; } A: The original function calls in a loop code without GCC optimizations. B: Function calls out of loop code without GCC optimizations (our optimized code). C: The original function calls in a loop code with GCC optimizations. D: Function calls out of loop code with GCC optimizations (our optimized co de). These results show the difference between the code that optimized by the GCC compiler and our optimized code in terms of time in Milliseconds. Table 3 7 : Time in milliseconds for the cost of the function call in and out t he loop optimization technique

PAGE 75

67 Figure 3 7 Function calls in and out the loop Millisecond 3.6 Automatic parallelization in compilers When programs have begun to execute in parallel, programmers need a compiler to take care of parallelization. The basic idea of parallelization is to distribute the job among processors in order to complete big work in a significant small amount of time. Today, computers have more than one central processing unit, so to take advantage from those CPUs, they can collaborate together to finish their job. There are two types of parallelism which are shared and distributed memory. The main difference between shared and distributed memory is how the communication occurs between them. In shared me mory environment, threads are communicating by each other through reads and writes while in distributed memory environment, processes are communicating by each other through the messages [1] [2]. 0 50 100 150 200 250 300 1000000 2000000 3000000 4000000 5000000 6000000 10000000 Run time Array size Function calls in loop and out loop A B C D

PAGE 76

68 Also, there is an architecture can mix the previous two type s which called Mixed Type multiprocessor architecture. The last architecture mixed the shared memory environment with the distributed memory environment. P M P M : --S : : : P M (a): share d memory multiprocessor [4] M p M P --S --P M (b): Distributed memory multiprocessor [4] In a shared memory environment, processors are sharing the same memory while in a distributed memory environment; each processor has its own memory location. In addition, in mixed type architecture, threads may share the same memory and may communicate to other physical locations through messages, and this is the work of distributed memory architecture. Parallel processing can achieve b ig jobs in a very small

PAGE 77

69 time through work distribution between processors Each processor has a part of work to finish, and in the end, all processors give their results to the master processor. Master processor or process is in charge of creating processe s, collect results from the workers (processors) and at the end kill them. In order to get good results or expected results from parallel processing, programmer or user s should balance the work between processors when he or she distribute the work be tween them There are many programs can be parallelized perfectly without difficulties or requiring further work, but also there are other programs need extra work for parallelizing them such as synchronization [4]. We need synchronization when two processes hav e access to the same data. So, in this situation programmer needs to solve the conflicts between the processes before applying or implementing the parallelism. Synchronization takes much time because it limits from the parallelism and the time spent in its techniques. For instance, when many processes need to access the same value, we need synchronization techniques to make the value accessed only once at a time. There are several operations that used in synchronization such as produce and consume. Produce sets the bit to be full if it is empty while consume waits the value to be full, and then read the value and sets the bit to be empty [4]. By using this method, conflicts of accessing the same value will not happen because each time one process can access that value. Also, there is another problem that makes some programs harder or impossible to parallelize them which i s dependencies. Dependencies consider one of the obstacles that face parallel processing. The basic idea of dependencies is that some proces ses need or depend on the value of a previous process; in this case we will have wrong results because all processes are working in the same time. There are several kinds of dependencies such as read after write, write after read, write after write

PAGE 78

70 and rea d after read. Dependencies limit the ability of parallelizing programs, and even if we parallelized them, they become slower. In some programs, if little dependencies may found, we can have one or more than one sequential sections, and we parallelize the r est of the code. In some times, the whole program including its results dependent on each other. For instance, each result in every level of a program uses in the next level, so in this case is very difficult or impossible to parallelize it. Deciding which section in a program can be parallelized or not is by depending on the programmer. A programmer looks to the program, and see which section in the code has dependencies and difficult to parallelize, and what other do not have dependencies. The method of d etecting dependencies also can be done by depending on compilers. Some compilers have built to work with the parallel environment, so they have the ability to detect any kind of dependencies in programs. One of the compilers that have very powerful techniq ues to detect dependencies is Intel compiler. Intel compiler has the ability to analyze loops and determine which loops have dependencies and ar e hard to parallelize them [5]. 3.7 Automatic parallelization in GCC GCC compiler is one of the co mpilers that support automatic parallelization. GCC can parallelize loops, but it has some limitations in automatic parallelization. One of the limitations in automatic parallelization is that GCC cannot detect dependencies in loops, so it can parallelize loops without dependencies. Also, there is another limitation in GCC compiler which is GCC parallelize innermost loops only, it cannot parallelize outer loops. Now, GCC compiler parallelizes every loop without dependency, so there is no strategy to determi ne which loop can be parallelized or not d epending on performance issues [3].

PAGE 79

71 3.8 Our improvement to automatic parallelization of the GCC compiler Our improvement to the GCC compiler is to implement tool to optimize the weaknesses in the automatic paralle lization of GCC compiler. Our tool can detect dependencies in most of the loops. The input of our tool is C or C++ program, and the output is a message which is either you can safely parallelize your loop or there are dependencies and it is difficult to pa rallelize it. In our tool, we used the idea of GCD test. GCD test used to detect most of the dependencies with in arrays in loops. The method of the GCD test works as follows: 1 We consider everything in the brackets only of arrays that used in loops to calc ulate the GCD. 2 We give the values in brackets to variables. For instance, A[2i+3]= A[2i 1]*B+C, so in this case a=2, b=3, c=2 and d= 1, thus a=2 and b=3 are representing A[2i+3], while c=2 and d= 1 are representing A[2i 1]. 3 We find the GCD between a and c, in the previous example, the GCD of (2, 2) is 2. 4 We find the result of (b d), in the above example, (3 ( 1))=4. Since, the GCD (a, c) = 2 which divides (b d) which is 4, so there are dependencies in our example, and we can make sure from the result by fo llowing the iterations. Suppose, we have the following example with the number of iterations is 6. A [2i+3] = A[2i 1] i=0 A[3]=A[ 1] i=1 A[5]=A[1] i=2 A[7]=A[3]

PAGE 80

72 i=3 A[9]=A[5] i=4 A[11]=A[7] i=5 A[13]=A[9] So from t he previous example, we can see obviously that iteration 2 depend on iteration 0, iteration 3 depend on the result of iteration 1 and so on. For more clarification of the method of GCD test, suppose we have the following example: A [2i]= A[2i 1]*B+C End for From the previous example, we can set the values of GCD test like this: a=2 b=0 c=2 d= 1 GCD (a, c) = 2 (d b)= 1 Since the GCD (a, c)= 2 does not divide ( d b)= 1 [4] so there is no de pendencies in our example. Therefore, by using the method of the GCD test, we can find dependencies in programs without following the iterations of loops manually. Therefore in our tool, we implemented the idea of the GCD test. Our tool takes C or C++ pr ogram as input and print a message tells that either loop has dependencies or

PAGE 81

73 there a re no possible dependencies in program. This tool basically designed to help the user or programmer to discover different dependencies in programs that compiled by the GCC compiler. 3.9 Optimizing real C programs by our tool After we built our tool that has seven optimization techniques (Division, recursive function, loop re rolling, loop unrolling, power operation, SQRT function and recursive function in a loop), we teste d our tool to improve rea l C programs such as Strassen, Bubble sort and Selection sort 3.9.1 Strassen o ptimization Strassen is a C program can do matrix multiply faster than the normal way of matrix multiplication. The C program was passed as input to o ur tool. We optimized the C program in several ways. The first optimizing way is by using GCC compiler only. The second optimizing way is by using our tool only, and the third optimizing way is by using both our tool and GCC compiler. Our tool detected sev eral optimization techniques in Strassen.c pp which are not optimized by the GCC compiler. Our tool found that several divisions in the program can be replaced by multiplications to make the program run faster. Also, our tool found that two loops in Strasse n.c pp can be unrolled to make the program run faster. So, our tool gave hints or messages showing that these optimizations can improve the program. After we optimized Strassen.c pp by our tool, we run three programs which are the original Strassen.c pp Stra ssen.c pp optimized by GCC compiler only and Strassen.c pp optimized by both our tool and GCC compiler. We saw an interesting result which is the optimized program by our tool only is faster than the

PAGE 82

74 original program, and also faster than the program optimiz ed by GCC compiler. In addition, when we run the program that optimized by our tool and GCC compiler, we got much speedup. Results A: The code without both GCC optimization and our tool optimization. B: The code with GCC optimization but without our tool optimization. C: The code without GCC optimization but with our tool optimization. D: The code with GCC optimization and with our tool optimiz ation. The runtime for our programs: Table 3 8 : Running time in seconds for the Str assen optimization Array size Strassen A Strassen B Strassen C Strassen D 256 0.62 0.47 0.29 0.15 512 4.56 3.32 2.09 1.07 1024 41.38 33.20 27.65 23.17 2048 371.49 321.51 217.96 204.08 4096 3341.04 2945.98 2349.21 2222.88

PAGE 83

75 Figure 3 8 Strassen Optimizatio n 3.9.2 Bubble sort o ptimization Bubble sort is a very well known sorting algorithm used to sort numbers. BubbleSort.c pp is the original C ++ program that used in our implementation, and OptimizedBubbleSort.c pp is a C ++ program that optimized by our tool. The Bubble.c pp program was passed as input to our tool. We optimized the C ++ program in several ways. The first optimizing way is by using GCC compiler only. The second optimizing way is by using our tool only, and the t hird optimizing way is by using both our tool and GCC compiler. Our tool detected several optimization techniques in the BubbleSort.c pp which are not optimized by the GCC compiler. Our tool found that several loops can be unrolled several times to make Bub bleSort.c pp faster O ur tool gave hints indicating that these optimizations can optimize the program. After we optimized the BubbleSort.c pp program manually, then we collected the running time results, we found that the optimized 0 500 1000 1500 2000 2500 3000 3500 4000 256 512 1024 2048 4096 Runtime Array size Strassen implementation Strassen A Strassen B Strassen C Strassen D

PAGE 84

76 program by our tool only ( OptimizedBubbleSort.c pp ) is faster than the original program BubbleSort.c pp and also the program that optimized by our tool and the GCC compiler is much faster than the original program that optimized by the GCC compiler only Results and graphs A: The c ode without both GCC optimization and without our tool optimization. B: The code with GCC optimization but without our tool optimization. C: The code without GCC optimization but with our tool optimization. D: The code with GCC optimization and with our to ol optimization. The runtime for our programs: Table 3 9 : Running time in seconds for the Bubble Sort optimization Array size BubbleSort A BubbleSort B BubbleSort C BubbleSort D 10000 0.64 0.28 0.52 0.17 20000 2.63 1.89 2.13 0.83 40000 10.45 6.68 8.62 3.59 80000 42.58 28.38 34.52 13.42 100000 65.55 43.80 54.04 20.90

PAGE 85

77 Figure 3 9 Bubble Sort optimization 3.9.3 Selection s ort optimization Selection sort is one of the sorting algorithms that used to order numbers. Selection Sort.c pp is the original C ++ program that used in our implementation, and Optimized Selection Sort.c pp is a C ++ program that optimized by our tool. The SelectionSort c pp program was passed as input to our tool. We optimized the C ++ program in several ways. The first optimizing way is by using GCC compiler only. The second optimizing way is by using our tool only, and the third optimizing way is by using both our tool a nd GCC compiler. Our tool detected several optimization techniques in the Selection Sort.c pp which are not optimized by the GCC compiler. Our tool found that several loops can be unrolled several times to make Selection Sort.c pp faster After we optimized th e Selection Sort.c pp program manually, th en we collected the results, we 0 10 20 30 40 50 60 70 10000 20000 40000 80000 100000 RunTime Array Size Bubble Sort Optimization BubbleSort A BubbleSort B BubbleSort C BubbleSort D

PAGE 86

78 found that the optimized program by our tool only (Optimized Selection Sort.c pp ) is faster than the original program Selection Sort.c pp and also the program that optimized by our tool an d the GCC compiler is much faster than the original program that optimized by the GCC compiler only. Our tool in the programs Strassen.c pp BubbleSort.c pp and SelectionSort.c pp worked very well, and it added much speedup to the programs that optimized by t he GCC compiler only. Therefore, our tool is working very well together with the GCC compiler to make programs faster. Developers, who are using GCC compiler to optimize their codes, can use our tool together with the GCC compiler to get fast and efficien t results. We did an implementation to optimize real C codes in order to see the effectiveness of our tool when developers want to optimize real software. Results and graphs A: The code without both GCC optimization and without our tool optimization. B: Th e code with GCC optimization but without our tool optimization. C: The code without GCC optimization but with our tool optimization. D: The code with GCC optimization and with our tool optimization Table 3 10 : Time in second s for the Selection Sort optimization Array size SelectionSort A SelectionSort B SelectionSort C SelectionSort D 100000 33.60 11.64 14.50 6.31 200000 135.65 46.87 59.34 25.62 400000 544.67 188.83 237.86 104.12 800000 2197.39 751.43 959.02 428.39 10 00000 3539.38 1361.53 1600 862.97

PAGE 87

79 Figure 3 10 Selection s ort optimization 0 500 1000 1500 2000 2500 3000 3500 4000 100000 200000 400000 800000 1000000 RunTime Array size SelectionSort SelectionSort A SelectionSort B SelectionSort C SelectionSort D

PAGE 88

80 4. Conclusion In our thesis, we have presented several ways to help developers writing more efficient codes. Writing good software depends on the architecture specifications and compiler. We researched the architecture specifications and compiler, and we introduced very good tools to help developers writing most optimized codes. For the architecture specifications, developers should fully un derstand the architecture of a specific hardwa re. Not considering the architecture specifications can make programs run slower. In the chapter II of our thesis, we designed intentionally de optimized benchmarks to see what are the strength s and weaknesses in the underlying computer architecture and showing what are the affects if developers not considering architecture specifications when programs are written. The de optimizations show, convincingly, that ignoring architecture specifications when writing s oftware can be very costly indeed. Most of the de optimizations had effects that were greater than 25% of running time. Some had effects up to 150%. These are not trivial slowdowns. They show that prioritizing considerations of aesthetics, portability, etc in front of hardware can res ult in significant slowdowns. It is shocking when one considers that many of the de optimizations that were implemented look just like code that one sees every day in business and academic envi ronments. Moreover, much of its w as straightforwardly compiled, i.e. the compiler did not resolve the issues during compilation. The results of the chapter II should, at the very least, make software developers reconsider many of their accumulated habits. may perform the worst. Compilers have very powerful tools to optimize codes. Therefore, writing good software not only depends on the skills of programmers and architecture specifications

PAGE 89

81 but also depends on the powerful of compilers. One of the compile r processes is optimizing codes. Compilers can make programs run faster, work efficiently and also can reduce size of codes. Therefore, In order for developers to write optimized software should choose very powerful compiler to optimize their codes. GCC co mpiler is one of the well known compilers that used by many developers to optimize codes. In chapter III of our thesis, we researched the weaknesses in the GCC compiler. Several weaknesses have found in the GCC compiler, and then we built several optimizat ion techniques such as Division vs. Multiplication, Loop and Recursive function, Loop Re rolling, Loop Unrolling, Power vs. Multiplication, SQRT function vs. Division and the cost of the function call inside and outside loops. To make using these optimizat ions easier we built a tool can help developers to optimize their codes manually. Our tool can give message to users indicating that if their programs need to optimize or not. We built a tool because in some circumstances, GCC compiler cannot fully optimiz e codes. Therefore, optimizing codes by users will make programs run faster. Also, GCC compiler has another weakness which is the automatic parallelization in the GCC compiler can parallelize loops that do not have dependencies. Therefore, we built a too l can do dependencies detection. Our second tool can detect most of the dependencies in programs, an d then can give a message to user s indicating that either a particular program has dependencies or not. The main goal from chapter III was to build tools to improve the weaknesses in the GCC compiler and help programmers to write most efficient codes To make sure our tools can optimize real C or C++ application, we tested our tool to optimize real codes such Strassen.c, Bubble Sort.c pp and SelectionSort.c pp

PAGE 90

82 Strassen.c pp is a c program can do matrix multiplication faster. The op timization to the Strassen.c pp was successfully implemented, and we found that our optimizing code without GCC optimization is faster than the original Strassen.c pp with GCC optimizatio ns. Moreover, we optimized Bubble Sort. cpp. The optimization of the Bubble Sort was implemented successfully, and Bubble Sort that optimized by our tool is much faster than Bubble Sort optimized by GCC compiler only. Also, we successfully optimized Selectio n Sort algorithm, and the optimized program by our tool is much faster than the optimized program by the GCC compiler Therefore, our tool is very powerful tool can work with the GCC compiler to make programs efficient and faster. The main goal from our t hesis is to help developers to write more efficient codes through showing that fully understanding the architecture specifications will make writing efficient codes, and improve the weaknesses in the GCC compiler will make programs run faster.

PAGE 91

83 REF ERENCES [1] Midkiff, S. P. (2012). Automatic Parallelization AnOverview of Fundamental CompilerTechniques (2012th ed., Vol. 7, pp. 3 5). N.p.: Morgan & Claypool. Retrieved June 6, 2013, from http://0 www.morganclaypool.com.skyline.ucdenver.edu/doi/abs/10.2200/S00340ED1V01Y2012 01CAC019 [2] Adve, S.V.; Gharachorloo, K., "Shared memory consistency models: a tutorial," Computer vol.29, no.12, pp.66,7 6, Dec 1996 doi: 10.1109/2.546611, URL: http://0 ieeexplore.ieee.org.skyline.ucdenver.edu/stamp/stamp.jsp?tp=&arnumber=546611&isnu mber=11956 [3] Stallman, R. 2013. Using the GNU Compiler Collection. Boston, USA: GNU Press. [4] Jordan, H. and Alaghband, G. 2003. Fundamentals of parallel processing. Upper Saddle River, NJ: Prentice Hall/Pearson Education. [5] Software.intel.com. 1999. Automatic Parallelization with Intel Compilers | Intel Developer Zone. [online] Available at: http://software.intel.com/en us/articles/automatic parallelization with Intel compilers [Accessed: 13 Jun 2013]. [6] Wilhelm, R. and Seidl, H. 2010. Compiler design. Hei delberg: Springer. [7] Mogensen, T. 2009. Basics of compiler design. [Kbh.]: Torben gidius Mogensen. [8] Griffith, A. 2002. GCC, the complete reference. New York: McGraw Hill/Osborne. [9] Gcc.gnu.org. 2013. GCC, the GNU Compiler Collection GNU Project Free Software Foundation (FSF). [online] Available at: http://gcc.gnu.org/ [Accessed: 13 Jun 2013]. [10] Von Hagen, W. 2006. The definitive guide to GCC. Berkeley, CA: Apress. [11] Jack W. Davidson and Sanjay Jinturkar. 2001. An Aggressive Approach to Lo op Unrolling Technical Report. University of Virginia, Charlottesville, VA, USA. [12] AMD64 Technology. Software Optimization Guide for AMD64 Processors 2005.