Citation
Software transactional memory design options and overhead

Material Information

Title:
Software transactional memory design options and overhead
Creator:
Alfuaires, Ibrahim
Publication Date:
Language:
English
Physical Description:
vii, 57 leaves : ; 28 cm.

Subjects

Subjects / Keywords:
Transaction systems (Computer systems) ( lcsh )
Memory management (Computer science) ( lcsh )
Parallel programming (Computer science) ( lcsh )
Memory management (Computer science) ( fast )
Parallel programming (Computer science) ( fast )
Transaction systems (Computer systems) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (M.S.)--University of Colorado Denver, 2010.
Bibliography:
Includes bibliographical references (leaves 55-57).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Ibrahim Alfuaires.

Record Information

Source Institution:
University of Colorado Denver
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
656562425 ( OCLC )
ocn656562425

Downloads

This item has the following downloads:


Full Text
SOFTWARE TRANSACTIONAL MEMORY
DESIGN OPTIONS AND OVERHEAD
By
Ibrahim Alfuaires
B.S., King Fahd University of Petroleum and Minerals, 2005
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2010


This thesis for the Master of Science
degree by
Ibrahim Alfuaires
has been approved
by
Bogdan Chlebus
Date
SOlO


Alfuaires, Ibrahim (M.S., Computer Science)
Software Transactional Memory (STM) Design Options and Overhead
Thesis directed by Associate Professor Bogdan S. Chlebus
ABSTRACT
The advent of the multicore architecture, in which the technology has
helped to integrate more and more cores in the single chip, has raised the
expectation of getting better computation speed up and performance. In this
thesis, I tried first to discuss briefly the problems associated with using the
legacy ways, locks, to write parallel programs such as: priority inversion,
convoying and deadlock. Transactional memory (TM) was proposed as a lock-
free synchronization technique which provides safe and easy concurrency
programming environment. Transaction is a group of reads and writes
instructions to a shared memory that appears as one atomic operation to the
outside observer and to other transactions and either execute all or nothing. I
am presenting the TM solution in general which is also can be classified into
three types: hardware transactional memory (HTM), software transactional
memory (STM) and hybrid transactional memory (HybridTM). As it can be
inferred from its name, STM implements the whole TM system in software
with the aid of the basic instructions provided by an off shelf processor. STM
has been more preferable to implement TM as it does not require the extra
cost of hardware. In my thesis, I am presenting most of the design options
that have been used by many researchers to build different STM
implementations we have had so far. These options are important to study
and analyze the overhead that might be encountered with different STM
implementations. At the end, I tried to list some of the published identified
sources of overhead in STM systems.
This abstract accurately represents the content of the candidate's thesis,
recommend its publication.
Signe
Bogdan Chlebus


TABLE OF CONTENTS
Figures................................................................vii
Chapter
1. Introduction.........................................................1
1.1 Research Focus and Methodology......................................4
1.2 Lock Based Synchronization..........................................5
1.3 Problems with Locks................................................ 8
1.4 Lock free Synchronization..........................................11
2 Transactional Memory................................................12
2.1 Transaction Basic Operations.......................................16
2.2 Semantics..........................................................17
2.2.1 Database Correctness Criteria (ACI Model)....................... 17
2.2.2 Single Lock-atomicity............................................19
2.2.3 Retry and OrElse.................................................20
3. Design Options in Software Transactional Memory.....................22
3.1 Isolation..........................................................22
3.1.1 Weak Isolation...................................................22
3.1.2 Strong Isolation.................................................26
3.2 Nesting Transactions.............................................32
iv


3.2.1 Flattened Transaction
32
3.2.2 Closed Transaction..............................................33
3.2.3 Open Transaction................................................35
3.3 Transaction Granularity.........................................35
3.3.1 Block or Word Granularity.......................................36
3.3.2 Object Granularity..............................................36
3.4 Version Management..............................................36
3.4.1 Eager Versioning...............................................37
3.4.2 Lazy Versioning................................................39
3.5 Concurrency Control..............................................40
3.5.1 Pessimistic Concurrency Control.................................41
3.5.2 Optimistic Concurrency Control..................................41
3.6 Blocking versus Non-blocking....................................42
3.6.1 Blocking.......................................................42
3.6.2 Non-blocking...................................................42
3.7 Object Acquisition..............................................43
3.7.1 Lazy Acquire...................................................44
3.7.2 Eager Acquire..................................................44
3.8 Contention Management...........................................44
3.9 Other Options...................................................45
3.9.1 Exceptions.....................................................45
3.9.2 Library versus Programming Language
46


3.9.3 Uniprocessor versus Multiprocessor
46
4. STM Overhead..................................................... 47
5. Findings...........................................................50
6. Conclusion.........................................................54
Bibliography..........................................................55
vi


LIST OF FIGURES
Figure
1.1 Moore's Law Trend from 1971-2008............................1
1.2 THE SHARED MEMORY MULTIPROCESSORS...........................5
1.3 LOCK-BASED SYNCHRONIZATION MODEL............................7
3.1 Ananinan's and Rinard's STM Data Structures................29
3.2 Eager Versioning...........................................38
3.3 Lazy Versioning............................................40
vii


1. Introduction
With the advent of the multi-core processors, in which the technology has
helped to integrate more than one cores into the chip, it has been expected
that this would increase the computation speed by at least double when we
double the number of cores. According to Moore's Law as we can see in the
figure below, the number of cores is expected to double almost every 18
months, which means that the speed of the computation should be doubled
accordingly.
CPU Transistor Counts 1971 -2008 & Moores Law
2.000.000.000-
1,000.000,000
100.000,000
Hi

3 Qj've show's Moore i Law':
R 10,000,000 transistor courn doubling iw*"1
Qf^iy Me you'o , SKO
o
Cft
1.000,000
m


100,000

10,000 /
ae
2.300
*"V"
J
C=.fc lO..*! .
1971 1980 1990 2000
Date of Introduction
200B
FIGURE 1.1: Moore's Law Trend from 1971-2008
1


So, as we increase the number of cores or processors by n, we should expect
n-fold increase in the power of computation power. However, this increase is
only in the ideal cases and never happens in the practical ones. The primary
reason for this degradation in the real world cases computational power is
the cost of inter-processor communication and coordination or concurrency
control. In most cases, when the applications is written for multi-processing,
the net result performance of the system is just 5-10 % of the theoretical peak
performance.
One of the most difficult things about parallel and distributed computing is
the concurrency control or synchronization in shared memory model. The
need for parallel and distributed computing has been increasing due to the
need for massive computation in various fields: industrial, medical or even
science research fields. Almost all the computer since programmers are
familiar with the traditional or sequential programming in which, single
processor machine is used. However, with parallel programming, where more
than one processor are used and expected to increase the performance of
computation and speed up the calculations, the programmer tasks become
much harder.
Concurrency is the problem of synchronizing concurrent access of multiple
processes or threads to shared memory. It is difficult because it is required to
make sure that the right operation is conducted in the right time and as if it is
the only process accessing that piece of data in shared memory.
As I mentioned above, the primary reason for this drop in the performance
from the expected one is concurrency. Concurrency is the responsibility of the
2


programmers and not of the systems. This justifies why parallel programming
is difficult compared to the sequential one. In the computing industry, there
are a lot of programmers but a few of them who are familiar with the parallel
programming. In addition, a few of those who are supposed to be parallel
programmers are good in this field. Parallel programmers can do mistakes
easily and those mistakes will make the performance even worse. As we add
more cores or processors, the problem gets even worse for developers.
To exploit parallelism, developers will need to master parallel programming
and concurrency or synchronization as part of that. Many synchronization
techniques have been used to implement parallelism on multi-core systems.
Locks have been used by parallel programmers to implement synchronization.
However, addition to the difficulty of using locks, several pitfalls have been
experienced and identified due to using them such as priority inversion,
deadlock and convoying.
To avoid problems associated with implementing locking based
synchronization techniques, several lock free techniques have been
investigated. A shared data structure is lock free if its operations do not
require mutual exclusion. Transactional memory is one of those techniques
which provides new concurrency techniques that avoid most of the pitfalls
mentioned above as well as it is much easier concurrent programming
technique. In addition, it is much efficient more the other proposed
techniques. The idea of transactional memory has been taken from database
transaction where a transaction can be defined as a sequence of actions that
appears indivisible and instantaneous to an outside observer. In addition, the
same attributes that the database transaction has i.e. failure atomicity,
consistency, isolation and durability, are applied to the transactional memory
except the last one durability since the data in memory is usually transient
unlike the database one.
3


1.1 Research Focus and Methodology
Software Transactional Memory (STM) has been blamed as a high overhead
transactional memory (TM) compared to the other implementations:
hardware transactional memory (HTM) and hybrid transactional memory
(HybridTM). The overhead of STM is due to the need to build the TM
semantics and all the transaction primitives in software. Many STM
implementations have been proposed with a little emphasis on the overhead
investigation. The problem with STM is that some instructions that usually do
not require more than one or two clock cycles to execute would need tens or
even hundreds clock cycles when executed with STM. As an example, a
program that might need 100 clock cycles to complete, addition 500 clock
cycles are introduced as STM overhead. That means almost 80% of the
program time is taken by the STM and hence the program is made even 5
times longer. We can conclude that the STM overhead is not a trivial one and
can be the most dominant cost of the code.
I am going to study the STM semantics and some of the approaches used to
build STM implementations. After understanding these semantics and design
approaches, I am going to identify how these may help in reducing or
increasing the overhead of STM systems. In addition, STM have been built
using different programming languages. I want to investigate if the type of
language can affect overhead or not. This would help us choose the right
design options to build low overhead STM implementation.
I begin by giving background information about the legacy way of attaining
concurrency, the locking based synchronization. Then, I discuss the major
4


pitfalls of locks and using them in synchronization. In chapter2,1 introduce
what transactional memory is. Chapter 3 covers different design options of
STM. In chapter 4 and 5,1 discuss the overhead of STM and I list some of the
findings that might help to reduce the STM overhead. The last chapter
includes the conclusion.
1.2 Lock Based Synchronization
In a shared memory multiprocessing basic model, we have many processors
that can work concurrently and they might need to access the same piece of
data in the shared memory to complete their computations. Let us assume
that we have the following multiprocessor environment:
Main Memory
FIGURE 1.2: SHARED MEMORY MULTIPROCESSORS
5


So as we can see in the above figure, we have n processors all share the same
memory. Let us assume that the first three processors: PI, P2 and P3 are
about to execute the same instruction X := X + 1. The initial value of X = 1 and
if all of the processes tried to execute at the same time the value of X would
be unpredictable, it might be 2, 3 or 4. This is because all the processes might
read at the same time the same value X or even two of them read
simultaneously the X value resulted from the execution done by the third one.
This problem is called data race and to avoid getting such issue, the shared
resources, i.e. X in this case, must be prevented from concurrent read and
write accesses. The solution here is to use locks to provide mutual exclusion.
Locks are programming constructs that can be used so that only one
processor has the control over the shared resource and until this resource is
unlocked the controlling process is the only one who can read and write to it.
The area which has the shared resource and to be accessed with locks is
called critical section. At any time, at most one process can be inside the
critical section the processor who has the lock of that section. The following
diagram shows how the operation is going to be when locking is used to
perform synchronization:
6


Lock (XI
Read XintoR
Add 1 to R
Write back R to X
Unlock (X)
FIGURE 1.3: LOCK-BASED SYNCHRONIZATION MODEL
Based on granularity, locks can be classified into fine-grained or coarse-
grained locks. The main differences between the two types are the size of the
critical sections guarded by the lock and the number of locks used throughout
the system. In the systems which use the coarse-grained locking, one lock is
used to lock many shared resources like in the case of the first Linux kernel
2.0 that supported multiprocessing. In this case, the whole kernel is
considered as a critical section that is locked by one big lock. On the other
hand, in case of fine-grained locking, as it has been implemented in the new
kernel releases, many locks are used in the system to provide concurrency.
Instead of using only one lock to lock the whole kernel, the big kernel critical
section is divided into smaller critical sections each has its own lock. As an
example, in the Linux 2.2 kernel release, one spinlock is used to control the
access to the block I/O subsystem and another one for networking and so on.
7


Locks have been used widely in many areas: databases, operating systems,
and parallel programming.
1.3 Problems with Locks
Many problems associated with using locks in concurrency control have been
published in literatures. One of these issuers is the difficulty of programming.
Most of the computer science programmers are much used to the sequential
coding and not the parallel one. More issues were published by Herlihy and
Moss as we see below [13]. In some operating systems, which that employs a
priority-based preemptive scheduler, tasks or processes are assigned different
priorities. Let assume that we have a 3 processes H, M and L with priorities
High, Medium and Low. The scheduler always makes sure that if a process is
ready to run and holds higher priority to be running or executed. In addition,
if a lower priority process is executing and during its execution, a higher
priority process is ready to execute, the scheduler will preempt the lower
priority one to let the higher priority process executes. Let us consider a
scenario where, the process L acquires the mutex or the lock to work on a
specific shared resource. In the middle of L execution, the process M becomes
ready to execute and hence the scheduler will have to preempt L, which still
holds the lock, and let M executes. Suppose during the execution of M,
process H would like to access the same shared resource whose lock is held
by the preempted process L. In this case, process H won't be able to acquire
the lock since L cannot preempt M due to the precedence in priorities. Finally,
although process H has higher priority than process M, H cannot execute and
this situation will be like if the priorities were inverted so that M has higher
8


priority but it hasn't. The problem that I have just described is called priority
inversion.
Another issue that is associated with using locks in concurrency control is lock
convoy or sometimes called convoying. Unlike the case with the priority
inversion, the processes in this case all have equal priorities. The problem
of convoying happens when we have several processes where some of them
are interested to access a specific shared resource R. Let us assume that
process PO is now holding the lock for R and also the processor is executing
PO. If the scheduler preempts PO for any reason, such as a page fault or PO has
already exhausted its quantum or by any kind of interrupt, PO will be
discheduled while it still holds the lock for R. Now, the scheduler is going to
schedule PI, which is also interested to access R, although PI has the right to
execute, the lock for R is still with PO and hence again PI will be discheduled.
This results in a huge context switching and waste of scheduling quantum
which will lead to huge performance degradation of the system.
The next problem that might be experienced with using locks is called
deadlock. Deadlock can happen as a result of having some processes lock the
same set of objects in different orders. To make it clearer, let us assume that
we have two processes P and Q work on shared objects identified by CR1, CR2
CR3 and CR4. Process P will work on the shared objects in this order
CR1->CR2-CR3->CR4. On the other hand, process Q based on its
computation will need to access the shared objects in the opposite order
CR4->CR3->CR2->CR1. Let us assume that both processes begin their
execution together starting on acquiring the lock required for each shared
object they need to access. Process P will acquire the lock for CR1 and then
9


CR2 without any problem as the only process interested to access these
shared objects at those times and similarly with process Q, it will acquire the
locks of CR4 and then of CR3 shared objects. Now, process P would like to
proceed its execution by acquiring the shared object CR3 which is already
locked by process Q. Similarly, process Q would like to access the next shared
object in the chain, CR2, which is already locked by process P. The two
processes P and Q are waiting to each other on releasing the lock for the
shared objects CR2 and CR3 and this problem is what so called deadlock.
Going back to the locking granularity and whether to use the fine-grained or
the coarse-grained one. As I mentioned already, early Linux kernel used the
coarse-grained locking, in which the whole kernel is considered as a one big
critical section and only one lock is available to access this critical section.
Coarse-grained lock is well known to its simplicity in implementation and
maintenance. Although, this type of lock, the coarse-grained, helped the
kernel programmer address all the issues might be faced with by shifting to
Symmetric Multi-Processor (SMP), it did not scale well. As the number of the
processors increase, the amount of time to wait for the single kernel lock
becomes bigger and bigger. Hence, with having only one lock to guard the
kernel, the performance of four independent processors is much better than
having four-processor system. To solve the problem of lack of scalability with
the coarse-grained locks, developers have started to use finer-grained locking
more and more and now the modern kernel could include thousands of locks.
With this type of locks, each lock protects a small resource instead of
protecting the whole kernel. However, using fine-grained locks has its own
pitfalls. With thousands of locks, it is not easy to identify which lock for which
10


and in which order you should acquire each of them in order to avoid
problems such as deadlocks and races. In addition, it became much harder to
identify the bugs as the number of locks increases and also the number of
bugs is expected to increase accordingly. Finally, using the fine-grained
locking approach will increase the level of complexity of the system and by
time, it can sacrifice the maintainability of that system.
1.4 Lock free Synchronization
Lock free or as it might be named in some literatures as non-blocking
synchronization does not require the existence of mutual exclusion between
the processes which are accessing a shared object concurrently. As we have
seen in the previous section, problems with locks, some problems might be
experienced when mutual exclusion is satisfied by using locks. If the shared
resource does need mutual exclusion, priority inversion problem is resolved
since the high priority process won't need to wait for the lower priority
process to release the lock. In addition, consider the problem of convoying
where a process holding the lock for a shared resource is descheduled or
interrupted for reasons I mentioned earlier. In case of lock free
synchronization, the shared resource does not mandate the acquisition of
lock to have access to, then descheduling or interrupting any process does not
delay the execution or operation of other processes interested to access this
shared resource. Even deadlocks can be avoided for the same reasons and as
a result performance could be improved by making the shared resources lock
free.
11


2. Transactional Memory
As it can be inferred from its name, transactional memory has its root in
database transaction and hence we need first to define what a transaction is.
A transaction can be defined as a sequence of actions that appears as one
atomic action and instantaneous to the outside observer. Transactions have
been used widely in the database as a way to ensure concurrent execution of
programs. However, due to the differences between the transactions in
database and that of transactional memory, since the later uses the main
memory instead of the hard drive to perform transactional access, it is called
transactional memory.
The idea of using transactions in concurrent programming was first proposed
by Lomet in 1977 [16]. At that time, it was not introduced as transactional
memory but as an idea to use the database abstraction to make a good
programming language mechanism that would ensure the consistency of data
shared among several processes [8], This paper lacked the practical
implementation compared to the explicit synchronization and this is way it
had not been used until later when Herlihy and Moss in 1993 [13] proposed
the first hardware transactional memory as a mechanism to build al lock-free
data structures. This paper is considered the first official paper defined what
the transactional memory.
After that many researchers have been involved heavily in this area and many
hardware transactional memories have been proposed. However there were
some limitations in the proposed hardware ones like the high cost and the
need for extra instruction set to be integrated with the existing processors
12


instruction set. That means that hardware transactional memory for some
people, it might not be that feasible solution.
Given the above limitations associated with the hardware transactional
memory, researchers were trying to find more feasible solution that would be
still a lock-free synchronization solution. In 1995, Shavit and Touitou,
proposed the first software transactional memory that is will still use the
transactional approach which can be built on the existing machine using only
a Load-Linked / Store-Conditional operations. Later, many software solutions
have been proposed by different researchers in different languages and using
many design approaches.
Types of TM:
TM can be built in hardware, software or using both software and hardware
(hybrid). The hardware transactional memory (HTM) is implemented as a
processor instructions which means it needs some modifications to the
processor, cache and bus protocols to support transactions. In software
transactional memory (STM), the semantics and protocols of TM are built
completely in software either in runtime library or as extension to the
programming language itself. In the later type, STM, no need to upgrade the
hardware and the minimal hardware is required such that it supports the two
basic operations Compare and Swap (CAS) or Load-Linked/ Store-Conditional.
The hybrid transactional memory is a mix of hardware and software
transactional memory so that the transactions are supported in software with
the absence of hardware and would provide better performance if the
hardware existed. However, there is no formal specification for how the
IB


hybrid TM should be implemented as many different approaches have been
used to build this type.
Why STM:
Software transactional memory (STM) has been believed by many researchers
as a feasible solution for the problems of concurrent programming. The first
STM implementation was published by Shavit and Touitou [19]. STM has
many advantages over the hardware transactional memory (HTM) such the
cost of implementation and maintainig. Since it is implemented in software,
no additional cost required for the hardware as in the case of HTM. In
addition, any change can be introduced to the STM implementation easily
which makes it more flexible than HTM.
Previous Work:

Let me mention some previous STM research works. The first real STM
implementation was proposed by Shavit and Touitou which is well known as
the static STM [19]. This implementation came after the first TM system
which was implemented in hardware and required some extra operation
which was not available in the off-shelve processors at that time. They
proposed the software solution to make it possible to implement TM with the
existing processors in software and using Load_Linked/Store_Conditional
operation. This implementation was called static because the transaction
needs to need the set of the location it is going to access in advance which
was considered a limitation to this implementation. The advantage they were
targeting with this implementation is to provide a software solution that will
not require extra hardware since it could be applied to the existing machines.
14


However, the performance of this implementation was not even close to the
HTM implementation proposed by Herlihy and Moss.
Then, Herlihy et al. proposed dynamic software transactional memory (DSTM)
implementation [12]. In DSTM, they addressed the limitation with the static
STM by supporting the dynamically created objects that does not require
prior knowledge for the set of the locations will be accessed by transaction.
This implementation was an obstruction-free which is the weakest approach
of non-blocking synchronization. In addition, they introduced the idea of
contention management that was able to avoid livelock problems that may
encountered with obstruction-free implementations.
Fraser later proposed a lock-free objected based software transactional
memory which was named after his name (FSTM) [8]. In this lock-free
implementation, each lower priority transaction helps a higher priority
conflicting transaction to complete so that the progress is guaranteed. The
helped transaction might also help another conflicting transaction to
complete execution and so on.
Marathe et al. proposed an obstruction-free adaptive software transactional
memory or (ASTM) [18]. They had done a comparison between the two
leading object-based software transactional memories DSTM and OSTM.
They found that the DSTM outperforms the OSTM by an order of magnitude
on write-dominated workloads and OSTM outperforms DSTM with by a factor
or two on read-dominated workloads.
Herlihy et al. revised the first proposed DSTM and that resulted in DSTM2
[11]. As I mentioned, the DSTM2 is a revised version of the DSTM where they
15


tried to incorporate lessons they learned from its predecessor. In this
implementation, they did it as a transactional library which was considered
the first implementation with such approach. The main feature included in
the design of DSTM2 was providing the users the ability to include their own
synchronization and recovery mechanisms in the form of transactional
factories that transform stylized sequential interfaces into transactionally
synchronized data structures addition to the freedom in customizing their
techniques for managing contention.
Recent implementations such as Enalls' STM and TL2 are lock-based STM
implementations [7][6].
2.1 Transaction Basic Operations
Each transactional memory has some basic operations and most of the
implementations have the same operations with a slight variation from one
implementation to another. These basic operations are used to manipulate
transactions. The following basic operations are the general ones used among
most of the implementations.
Start or Begin: starting a new transaction and allocate a fresh descriptor in
which the status is initialized to ACTIVE.
- Read/Write: used within the scope of the transaction to access
transactional memory.
16


- Commit: this operation makes all the transaction's tentative changes
permanent and ends the transaction. It may fail and in this case it is equal to
abort.
- Abort: this operation instead of making the transaction's tentative changes
permanent, it discard all these changes and also ends the transaction.
- Validate: this operation tests the transaction status. It is either true means
the transaction commit or can commit or false means that the transaction
aborts.
2.2 Semantics
Although many literatures have been published about TM, still there is no
formal TM semantics. Intuitively and as it can be inferred from its name, TM
idea originated from database transactions and this can explain why most of
the papers used the correctness criteria from database transactions,
serializability, or concurrent data structures, linearizability. However, both
criteria just specify some aspects of TM semantics and not all.
2.2.1 Oatabase Correctness Criteria (ACI Model)
The semantic model used by database has been adopted from the beginning
as a model for the TM as transactions idea was taken from database.
However, still the database model is not enough to provide the formal
semantic for TM due to the nature of the database transactions and those of
TM. Let us define the correctness model used by database and then we check
its availability to TM. Database use the ACID (failure atomicity, consistency,
isolation, and durability) properties as a model for its transaction semantic.
17


Failure atomicity or atomicity requires either all or nothing. All means that a
transaction executes and finishes and then it commits and hence the change
will be visible. However, in case if it fails in the middle of its execution,
nothing will be written and everything will be as if it does not execute at all.
Consistency means that any successful or failure of the execution of a
transaction should leave the database or memory in a consistent state. A new
transaction will always assume that the world is in consistent state.
Isolation means that the execution of a transaction must be as if it is the only
transaction executed at that time although there are other transactions
executed concurrently. In database, serializability correctness condition is
used to achieve isolation in which the transaction execution result isolated
from other transactions execution as if they are executed serially.
Durability requires that once the result is ready and committed on the
hardware, it is final and should be always there and available to all the other
transaction. However, this property is not required for the parallel
programming as the media to which the transaction result would be
committed is the memory and the data there is always transit unlike in the
database case where the media is the hard disk.
The ACI model that explained above does not provide a formal semantic for
the TM. ACI does not provide any mechanism that controls the interaction
between the atomic blocks and the code outside of a transaction and this is
what can be called interaction between transactional and non-transactional
access to shared resources. Unless a semantic specifies this interaction still it
is not fully applicable to TM case and this is the issue with ACI model.
18


2.2.2 Single Lock-atomicity
The single lock-atomicity is not a technique to implement TM; rather, it
specifies the interaction of atomic blocks in TM. With this, an atomic block
executes as if all the other atomic blocks were protected by a single lock and
at most one block can execute at a time. This model specifies the isolation
expected from transactions but does not specify the failure atomicity of a
transaction. It is one way to achieve the serializability I talked about earlier.
This lock can also be used to describe what happen as if a non-transactional
access has been encountered. The non-transactional access is any code
outside the transactions tries to access shared data.
Having this kind of access, i.e. the code outside the transaction, to shared
data may lead to a datarace [9]. Before I define what the datarace is, let me
define what is known as a conflict. A conflict happens when more than
concurrent thread access the same shared data and at least one of these
threads modify the data. The datarace would happen if the conflicting access
mentioned before were not synchronized properly. When the datarace
happens, the behavior of the code cannot be judged and will depend on the
memory consistency model that is used in the system memory. Dataraces can
be caused by poor programming of critical sections. To avoid getting datarace,
locking or other synchronization techniques should be used to control the
concurrent access by transactional and non-transactional codes or by
carefully arranging the program control flow.
19


2.2.B Retry and OrElse
So far, we have discussed several aspects of the TM semantics proposed by
several researchers. However, we mentioned nothing about the order that
any two transactions executing concurrently should change the program
state. Such coordination between transactions was first introduced by Harris
et al. [10], by using the retry statement. When retry is used in a transaction, if
anything causes the transaction to abort, it will reexecute again later. What is
important about the retry is when it should reexecute after it aborts. Harris in
the same paper suggested that the transaction should reexecute again if any
change has been already introduced to the set of values the transaction did
read and aborted before.
Another operation that can be used to coordinate the execution of two
transactions is orElse [10]. The operation of orElse can be explained as follow
[14]. If we have P orElse Q, where P and Q are two different transactions,
then P will be executed first:
1. If P commits or explicitly aborts either by explicit abort statement or
uncaught exception, then Q will not be executed since the orElse
statement terminates.
2. In case of P executes a retry statement, then Q will be executed as part of
orElse statement execution.
3. If Q commits or explicitly aborts, then the orElse statement will
terminate.
20


4. If Q executes a retry, then the execution of the orElse statement will be
pending for a change to a location read by either P or Q to reexecute
itself again.
Finally, the orElse statement should be implemented in atomic way since it is
a composition of two atomic operations.
21


3. Design Options in Software Transactional Memory
Different options have been used to build different STM implementations.
Some of these options might help in simplifying the implementation and
some might make it more powerful in a way that makes it a better candidate
to replace the lock-based parallel programming ways. Some of the options
have direct relation to the overhead that might be associated with STM and
some have indirect. In this chapter, I tried to list most of the options or
tradeoffs that have been used so far in designing most of the STM
implementations.
3.1 Isolation
I have already defined what the transaction isolation is as part of the
transaction correctness criteria. Isolation property requires that the
transactions execute concurrently don't affect the result of each other. Each
transaction executes as if it is the only transaction executes at that time.
Blundell et al. [3] introduced the terms weak and strong atomicity. However,
the term they used i.e. atomicity can be replaced by isolation as it can be
inferred from that paper [14].
3.1.1 Weak Isolation
Before I start discussing the weak isolation property, I need to give formal
definition for it:
22


Weak isolation: is the property of TM system that guarantees the isolation
only between transactions. A non-transactional code can directly access the
same shared data in the main memory that is already accessed by
transactional code.
If a non-transactional access executes, or a conflict reference outside the
transaction happens, the result that will be returned might be inconsistent or
may lead to an incorrect execution inside the transaction. With weak isolation
implementation, datarace might be the result and the system can be in
undefined state. Some problems or more precisely isolation violations would
happen as the non-transactional accesses are allowed to access the main
memory and bypassing the STM system protocols [1]:
- Non-repeatable reads: this problem happens when a transaction code
includes two read instructions and another write instruction belongs to a not-
transactional code execute concurrently. The two reads must return the same
value but since another non-transactional write alter the execution of the
transaction, the values that are returned by the two reads are different as we
can see in the following example:
Thread 1 Thread 2
atomic {
rl = x; x = 1;
r2 = x;}
23


Let us assume that the initial value of x is 0. Thread 1 starts to execute the
first read and rl will be assigned 0. Then, before thread 1 perform the second
read, thread 2 write a new value to x which is 1. After that thread 1 performs
its second read and now r2 = 1. So, rl x2 even though they belong to the
same transaction which means the isolation is violated.
- Intermediate lost update: it happens when we have two writes access
performed by both transaction and no-transaction code concurrently to the
same shared data. As an example, let us considered the following:
Initial value of x is 0
Thread 1 Thread 2
atomic { r = x; x = 10;
x = r +1;}
if the two threads executions are serialized then the value of x should be
either 10 if thread 1 executes first and then thread 2 or 11 if thread 2
executes first then thread 2 does. Unfortunately, the value of x is going to be
1 since thread 2 writes after thread 1 performs the read access and before it
write x. This means that the thread 2 write like if it does not happen at all.
Intermediate dirty reads: it happens when a non-transactional access can
see an intermediate state of a transaction code. The following is an example
that can explains this problem more:
24


Initial value of x is even
Thread 1 Thread 2
atomic {
x++; r = x;
x++;}
The thread 1 transactional code will always keep the value of x even.
However, the non-transaction read access will read an odd value if the non-
transactional read is executed in the middle of the thread 1 two increments.
This implementation, weak isolation, does not mandate that all the access to
the shared data must be by a transactional access only, the non-transactional
access can access but without conflicting with the transactional one. This can
be guaranteed by the programmer to make the non-transactional access to
take place in different time or to different memory location. The approach
that has been used in most of the STM implementations to achieve weak
isolation is to divide the whole memory manually into two parts part that is
only accessed by transaction and the other part is only accessed by non-
transactional codes. However, dividing the memory as such is an error-prone
process as software evolves. Furthermore, such a process of manually dividing
the memory sometimes even is not required in the lock-based
synchronization. Therefore, using the weak isolation in STM systems may
conflict with the goal of achieving a simple and reliable concurrent
programming.
25


3.1.2 Strong Isolation
The second approach is called strong isolation where the TM semantics must
be guaranteed between transactional and non-transactional code. In this
model, the database model is enforced in which all the accesses to the shared
data are transactional or through transactions. In strong isolation, any
operation outside the transaction block that includes access to shared data
eventually will be transferred to transaction. Strong isolation has been well
known for its cost in term of the overhead that might be produced by a TM
implementation adopts it. This can justify why the most high performance
implementation were designed to provide weak isolation. Researchers have
proposed many ways to achieve strong isolation like restricting the
transactional code to access on the data that are transactions and this is the
approach used in Haskell STM [3]. However, this was achieved since all the
mutable data in Haskell STM are in transactions. The rest data are read only
ones which means that they can be accessed either by transactional or non-
transactional access. In the other hand, with the non-functional languages
where most of the data are mutable, the program must be divided into two
parts: transactional and non-transactional data. However, this classification is
hard to be implemented as all the data will need to be transformed to
transactional data from the non-transactional part of the program and then
back to its original state after completing the processing. Another approach
that was proposed by Lev and Maessen is to track the shared objects at
runtime [15]. Any object that will be visible by more than one thread should
be accessed only through transactions during concurrent access. Although it
can be considered a good approach, it might introduce more overhead to the
26


STM. However, the authors depend a lot on the compiler for more
optimizations and hence lower overhead. In the following, I am presenting
one of the Ananian,s and Rinard's STM implementation that supports strong
isolation.
Example: Ananian and Rinard STM:
The STM implementation, which was proposed by Ananian and Rinard in
2005, is considered one of the few STM implementations that are
implemented using strong isolation design option [2]. Most of the STM
implementations have been proposed so far employ weak isolation as
transaction isolation property. The Ananian's and Rinard's STM
implementation provides strong isolation since any memory access outside
the transaction can abort the running transaction. However, in most of the
STM implementations, which implement weak isolation, a non-transactional
read or write to transactional data, can happen without aborting that
transaction. Instead, a mechanism is implemented in such a way that the non-
transactional accesses don't conflict with the transactional accesses. The
reason behind choosing the weak isolation is to avoid the cost of overhead of
adding instrumentation to every non-transactional read and write. Ananian's
and Rinard's STM implements a novel way of providing strong isolation that
lowers the instrumentation overhead. They achieved this goal by employing
sentinel values which are placed in memory locations accessed by
transactions. I am going to provide a detail implementation of Ananian's and
Rinard's STM strong isolation implementation in the following paragraphs.
27


In their implementation, they used a special value called FLAG to be placed in
locations of objects that are involved in transactions. Trying to read or
overwrite a flagged value tells the non-transactional code that an exceptional
processing is required.
Object Structures:
The basic data structures of Ananian's and Rinard's STM implementation is
shown in the figure below.
28


Transaction ID #68 Transaction ID #56
FIGURE 3.1: Ananinan's and Rinard's STM Data Structures
Two additional fields are added to each object: versions and readers fields.
Versions field points a single-linked list object versions. Each object version
has a field owner, which points to the transaction object's single field status,
corresponds to committed, aborted or waiting. For each transaction, there is
a specific object version. The second filed is the readers field which also
29


points to a single-linked list of all the transactions that have read from this
object other than the committed or aborted ones. The FLAG value can be any
value but not a common one. Each object field will have the semantic value of
the original object structure in case that is not a FLAG value. In the later case,
the value of the field will be the value of the first committed transaction in
the object's version list.
Operations:
Ananian and Rinard supported the following operations in their STM
implementation: transactional read/write, non-transactional read/write,
begin, abort and commit. In this context, strong isolation, the only two
operations concern us is the non-transactional read and write.
Read:
The non-transactional read in the Ananian's and Rinard's STM is performed
using the readNT method. This method does the non-transactional read of
field f from object o and the result will be stored in v. When the value read is
a FLAG value, the thread will have to copy back the field value from the most
recently committed transaction. In addition, the thread will have to abort all
the uncommitted transactions and then try again. To tell the thread to abort
only the transactional writers and not the transactional readers, the kill-
writers constant is passed to the copy-back method as seen the code below.
inline readNT(o, f, v) {
do
: : v = object [o] field[f] ;
if
:: (v!=FLAG) -> break /* done! */
:: else
30


fi;
copyBackField(o, f, kill_writers, _st) ;
if
:: (_st==false_flag) ->
v = FLAG;
break
:: else
fi
od
}
Write:
The non-transactional write access in Ananian's and Rinard's is performed by
invoking the writeNT method. It writes the new value nval to the field f of
object o. For this non-transactional write to be performed, the reader list
must be empty. To check this condition, the Load-Linked/Store-Conditional
operation is used so that the write will be stored if the reader list is empty.
However, if it is not empty, the procedure copy-back will be called which
passes the kill.all constant. This constant will make sure that all the
transactional readers and writers are aborted and hence the reader list will be
empty. The writeNT procedure is shown below:
inline writeNT(o, f, nval) {
if
:: (nval != FLAG) ->
do
:: atomic {
if /* this is a LL(readerList)/SC(field) */
:: (object[o].readerList == NIL) ->
object[o].fieldLock[f] = _thread_id;
object[o].field[f] = nval;
break /* success! */
:: else
f i
} /* unsuccessful SC */
31


copyBackField(o, f, kill_all/ _st)
od
:: else -> /* create false flag */
/* implement this as a short *transactional* write.
*/
/* start a new transaction, write FLAG, commit the */
/* transaction; repeat until successful. */
/* Implementation elided. */
fi;
}-
3.2 Nesting Transactions
Sometimes, programmers might need to execute a transaction inside another
transaction so that the execution of the inner one will be within the dynamic
scope of the outer transaction. Such transaction in this case is called a nested
transaction. Let us suppose that we have two transactions; one is nested
inside the other. In this case the one nested is called the inner transaction and
the one includes the inner transaction is called the outer transaction. If we
assume that we have only one thread of control for both transactions, once
the outer transaction is done with execution, the control will go to the inner
transaction. Now, the inner transaction starts its execution turn and at the
same time, it can see any modification done by the outer or mother
transaction. Larus and Rajwar listed various ways to link such transactions as
we will see in the following sections [14].
3.2.1 Flattened Transaction
In this type of nested transactions, if the inner transaction aborts, then the
outer transaction will abort also. However, when the inner transaction
commits, the outer transaction is the only transaction that can see the
changes produced by the inner one. In addition, the outer one will not
32


commit just because that the inner one does, and once it commits the
modification of the inner transaction will be visible by all the other
transactions. The following example explains the behavior of the flattened
transactions:
int x = 1;
atomic {
x = 2;
atomic flatten {
x = 3;
abort;
}
}
In the above example, if the outer transaction aborts whether because of
aborting the inner one or because it decides to abort for any other reason
even if the inner one committed instead of aborting, then in this case x will be
1. Although flattened transactions are considered easy to program, it is
considered also a bad abstraction model that does not provide a
compositional atomic transaction since aborting the inner transaction will
cause the outer one to abort.
3.2.2 Closed Transaction
As an alternative to flattened transaction, closed transaction has the
advantage that the effect of the inner transaction on the outer one will be
lower unlike the flattened transaction. In closed transaction, aborting the
inner transaction does not terminate the execution of the outer transaction.
When an inner transaction aborts, all its log will be discarded and a
notification will be sent to the outer one and after that the parent transaction
33


has to choose even if it wants to abort [14], Similarly, in the committing case,
again if the inner transaction commits, the control will go to the outer one. In
case of inner transaction commits, the outer transactions will be able to see
any change introduced by the inner one. However, still the other transactions
(not the parents) won't be able to see the modification of that inner
transaction until all the surrounding parent transactions commit. The code
below explains the operation of closed nested transactions.
int x = 1;
atomic {
x = 2;
atomic closed {
x = 3;
abort;
}
}
When the inner transaction aborts, its operations will be executing undo and
all its log will be discarded. The value of x will be 2 if the outer transaction
commits otherwise it will be 1 when all the inner and outer transaction abort.
Furthermore, if the inner transaction commits and all the outer transactions
abort, the value of x will be also 1 and the change that was introduced by the
inner transaction wouldn't be visible to the other transactions since its
parents of did not commit. What is interesting about the closed model is that
no conflict happens between the inner transaction and the outer ones.
However, the conflict may exist between the inner transaction and another
transaction that belongs to different family (none of its parents).
34


3.2.3 Open Transaction
In the open nested transaction model, when the inner transaction commits,
its modification will be visible to all the transactions either parents or not
parent transactions whatever the results of the outer transactions are. The
following example is similar the previous examples provided in the previous
nested transactions except the addition of the open keyword that enables the
open nested transaction model.
int x = 1;
atomic {
x = 2;
atomic open {
x = 3;}
abort;}
Here in this case, if the inner transaction commits, its change will be visible
even if the outer transaction aborts. In this case the value of x = 3.
3.3 Transaction Granularity
One of the important key parameters in designing TM is transaction
granularity. It is the unit of the storage over which a TM system detects
conflict [14]. In most STM implementations, two types of transaction
granularity have been used: word (or block) and Object [20].
35


3.3.1 Block or Word Granularity
In this type of transaction granularity, at most one transaction has the control
over a shared word at a time. This is to make sure that the any modification
or change to the shared word is atomic. Moreover, a dedicated record is
attached to each shared word serves store the exclusive ownership of this
word. The process of transferring the ownership of the records between the
transactions is performed by the system automatically. The above
implementation was introduced by Shavit and Touitou in their STM paper
which is also considered the first formal STM implementation [19].
3.3.2 Object Granularity
Object granularity is usually implemented in systems that used object-
oriented languages. This type of granularity enables the STM to detect a
conflicting access to an object even if the transactions referenced different
fields. Most of the STM implementations were implemented with object
granularity. Using the object granularity, it is not required to change the
original object structure for translating the non-transactional program to
transactional one. Moreover, any object has the freedom to execute either
inside the transaction or outside it without needing any change. Many STM
implementations use object granularity and one of them is DSTM [12].
3.4 Version Management
During the execution of a transaction, an STM system will have to maintain
different versions of the global data. These different versions are: the new or
36


updated version and the old or original version. The new or updated version is
the version of data that is produced by one of the pending transactions. This
version is supposed to be visible by all the other transactions once the
executing transaction succeeds and decides to commit its computations. The
old or original version, which is the second type of versions, is maintained by
the TM system during the execution period. The original version is the one
that exists in the memory before the executing transaction starts executing.
This version is required in case that the executing transaction decides to abort
for any reason and hence it needs to roll back to its original value or version.
In general, there are two approaches have been proposed and implemented
by researchers to maintain the new and old versions of global data: eager
versioning and lazy versioning. However, these two ways are not necessarily
named as eager and lazy versioning as some researchers called them direct
and deferred updates [14]. At the end the eager versioning is the direct
update and lazy versioning is the deferred update processes with different
names only. In my thesis, I chose to use the first names: eager and lazy
versioning as they have been widely used in the literature and more common
than the other names: direct and deferred updates.
3.4.1 Eager Versioning
A TM system, which is implemented with the eager versioning feature, writes
the new data or version directly to the memory. This means that the
modification that is introduced by the executing transaction will be applied to
the object itself. However, the new version will not be visible to the other
transactions as this type of versioning requires the TM system to use some
37


form of concurrency control like locks throughout the transaction execution
period. So the TM system will make sure that the uncommitted version of
data that resides in memory will not be visible by any other transaction until it
is committed. The old version will be stored in the undo buffer or log so that,
in case the transaction aborts, the TM system will need this original version to
restore it back to the memory location. This type of versioning is considered a
fast committing one since if the transaction decides to commit, it needs only
to make the object visible by other transactions and discards the old version
stored in the undo buffer. However, additional delay might be experienced if
the transaction aborts since the original version will need to be written back
to the original memory address. In addition, concurrency is limited during the
execution of the transaction till the final decision is reached.
Begin
Write X<-10
Thread
Thread
X=:5 X=: 10 X=: 5
Memory Undo Buffer
Memory

Undo Buffer
Commit
Abort
FIGURE 3.2: Eager Versioning
38


3.4.2 Lazy Versioning
The second type of data versioning, which is called lazy versioning, has been
used widely in most of the STM implementations. In this type of versioning,
any data modification will be stored as a new version in the undo buffer
which is private to the transaction. The old or original version will stay as is in
the memory and hence this type will enhance the concurrency instead of
restricting it as in the eager versioning. Once the transaction finishes its
execution and decides to commit, the new version will be moved from the
undo buffer to the memory. However, if the decision was to abort then the
transaction will need just to discard whatever stored in its undo buffer. The
lazy versioning is considered a fast aborting one since as I mentioned earlier,
when a transaction decides to abort, it needs just to discard the new version
that is isolated in the per transaction buffer. On the other hand, lazy
versioning might experience some delay if the transaction decides to commit
as the TM system will need to copy the new updated version to replace the
old version that is stored in memory. This delay is produced due to the need
to search the undo buffer for the latest version.
39


Begin
Write X <-10
Thread Thread ^3
X=: 5 X=: 5 X=: 10
Memory Undo Buffer Memory Undo Buffer
Commit Thread y
y
X=: 10 yTo
Memory / Jndo Buffer

Abort
X=: 5
Memory
FIGURE 3.3: Lazy Versioning
3.5 Concurrency Control
Concurrent transactions executed by a TM system need to be synchronized to
avoid having concurrent access to an object. What I meant by these
concurrent accesses are those are resulting in conflict access. I have already
defined what a conflict access is, two or more transactions concurrently
access the same object with one of them modify that object. After a conflict
occurs, the TM system then determines that a conflict happens and this is
called conflict detection. When a conflict is already detected, the TM system
40


or the code in a transaction implements a mechanism or apply a contention
policy to ensure the correctness either by aborting or delaying one of the
conflicting transactions. These three operations which are conflict occurrence,
detection and resolution must follow this order even they can happen in
different times but not order. Some STM implementations have been
designed using two different approaches to have concurrency control:
pessimistic or optimistic.
3.5.1 Pessimistic Concurrency Control
In this type of concurrency control, the three operations: conflict occurrence,
detection and resolution, all happen at the same point of time during the
execution. A conflict is detected and resolved in this case as the transaction is
about to access a shared object. With pessimistic concurrency control, the
transaction claims the exclusive ownership of an object so that other
transactions are not allowed to access this object. However, deadlock may
happen when two transactions got the exclusive access to an object and one
of them would like to access that objects while the other transaction already
hold this object. Such deadlocks can be avoided by enforcing a predetermined
order in getting the exclusive access to objects. In case of deadlock between
two transactions, one of them has to be aborted.
3.5.2 Optimistic Concurrency Control
The difference between this type of concurrency control, optimistic, and the
previous one, pessimistic, is that the conflict detection and resolution don not
happen at the same point of execution with the conflict occurrence. In this
41


case, more than one transaction can access the same object concurrently.
However, any conflict if there is one, must be detected and resolved before a
transaction commits. A conflicting transaction can be delayed, restarted or
even aborted while the other can continue its execution. The advantages of
this type of concurrency control are: avoiding locks and their problems and
providing more concurrency especially if conflicts are barely happening.
B.6 Blocking versus Non-blocking
Considering forward progress guarantees, where a transaction attempts to
access shared data will be able to finish its execution, synchronization can be
divided into two types: Blocking and non-blocking.
3.6.1 Blocking
It is the legacy way used to provide concurrency control, in which locks,
monitors or semaphores are used. At any time, only one transaction is
allowed to access a shared data exclusively. However, this type may prevent
the forward progress by introducing the problems with locks I have already
discussed in chapter 1: priority inversion, convoying and deadlock. In this
type, even though locks are the tools to provide the synchronization, they are
only used in the STM code and not by the programmer or in their codes.
3.6.2 Non-blocking
Early STM have been implemented with non-blocking synchronization which
can provides better forward progress guarantees than the previous type, the
42


blocking one. This type avoids all the problems associated with the blocking
synchronization techniques. Based on its strength in providing the forward
progress guarantees, non-blocking synchronization can be classified into three
types:
Wait-freedom: is considered the strongest assurance to guarantee that all
threads competing to access shared data make forward progress in a finite
number of their time steps. This type does not cause deadlocks or starvation.
Lock-freedom: is considered weaker assurance to guarantee that at least one
thread of the competing threads to access shared data makes forward
progress in a finite number of any of the concurrent threads. The deadlocks
can be avoided with this type but not the starvation.
Obstruction-freedom: is considered the weakest assurance to guarantee that
a thread makes progress in finite number of its own time steps in the absence
of the contention. Although deadlocks are avoided with this type, livelocks
are not.
3.7 Object Acquisition
The acquisition of an object happens when a transaction declare its
ownership of an object in non-blocking STM systems or acquires the lock to
access that object in blocking STM systems. Based on the time of acquisition,
it can be divided into two types: lazy acquire and eager acquire.
43


3.7.1 Lazy Acquire
In this type of object acquisition, the object is acquired only at commit time.
This means that the conflict detection is defer until the time commit. With
this type, a doomed transaction is allowed to continue but it has the ability
also to overlook a potential conflict. One of the advantages of this type is the
short-running readers transactions can commit in parallel with the execution
of the long-running transactions that commit too. Many STM
implementations use this type of object acquisition and conflict detection
such as: OSTM and Haskell STM.
3.7.2 Eager Acquire
With eager acquisition, a transaction will have to acquire the object it is trying
to access in the beginning of at open time. This type allows the transactions
to detect a conflict early and hence save on the useless works if the
transaction is going to abort eventually. However, sometimes with eager
acquisition, the transaction aborts a conflicting transaction and then the
aborting one fails to commits which cause huge waste in the works invested
by the aborted transaction and itself. Many STM implementations use this
type of object acquisition and conflict detection such as: DSTM, WSTM and
McRTSTM.
3.8 Contention Management
Whenever a conflict is detected over an object between two transactions, one
of the transactions must be aborted to resolve this conflict. The part of the
44


TM system that is responsible for this decision, aborting one of the
transactions, is called a contention manager. The contention manager
includes one or more of contention policies and based on these policies the
aborted conflicting transaction is chosen. Some implementations of
contention managers may not decide to abort the conflicting transaction and
instead, the contention policy can ask for delaying that conflicting transaction.
Different contention managers may have different cost depending on the
contention policies of each of them.
3.9 Other Options
There are other options that can differentiate an STM implementation from
the other:
3.9.1 Exceptions
Some TMs may include mechanisms of exception handling where the try and
catch keywords are used to surround the atomic block. Exception handling
can be implemented to either terminate or abort the transaction.
When the TM uses terminate transaction, it terminates the transaction and
exits. However, before the control leaves the transaction, whatever changes
have been introduced will be written or committed by the system.
The other exception handling which aborts the transaction whenever
exception is found will abort the transaction without committing any changes.
45


At the end, it depends on the language used to implement the TM and the
compiler used if it can provide a flexible solution to even includes the two
types of exception handling methods mentioned above.
3.9.2 Library versus Programming Language
Another implementation options whether to use the transactions as part of
the library itself or to integrate them into the syntax of the programming
language. Several TM implementations were implemented in both options.
Although the latter option in which the transactions are integrated as
primitives with the programming language semantics are more preferable,
the integration should be done properly as Carlstrom et. al. [4]. Having clear
semantics will help to implement better optimization to the compiler.
3.9.3 Uniprocessor versus Multiprocessor
Some TM implementations have based their implementations to be run on a
uniprocessor system whereas the other did take the advantage of the new
trend.
46


4. STM Overhead
It has been believed by many researchers that TM in general is a promising
solution for the problems related to programming multicore processors or
even parallel programming in general. After introducing TM to be a solution
implemented in hardware, HTM, this kind of implementation has been
considered expensive which made it less feasible by most of researchers. STM
was proposed as a solution that still avoids most of the pitfalls of locking
based synchronization and also does not require any extra hardware which
adds no extra cost. This is the main reason that most of the researchers
shifted themselves to STM. Now days, many STM implementations were
proposed in different languages and with different flavors. Although, it is still
considered in research based, many researchers started to consider that STM
won't go far and might not be the optimal replacement for the current used
synchronization techniques which require mutual exclusion. One of the most
important reasons for that believe is the overhead that is encountered with
using STM and this is the reason makes some of the researchers called it as a
research toy [5].
After discussing most of the STM design options or tradeoffs which have been
used in various STM implementations, I provide some of the overhead
sources related to these options directly or indirectly.
In their effort in designing a low overhead STM implementation, which was
designed and named RSTM, Marathe et al. listed several possible sources of
overhead [17]. These sources are:
47


Managed Code: they used C++ instead of the Java which they used in their
prior implementation ASTM and they expected better performance in run-
time semantic checks, lock-based thread-safe libraries and always-virtual
method dispatch. However, the Sun's HotSpot Java implementation typically
runs programs slightly faster than equivalent C++ versions compiled with GCC.
Bookkeeping: in object-based STM implementations, the minimum CAS
operations required acquiring n objects and commit is n+1 CAS. In addition, n
CAS operations might be required for post-commit cleanup of headers. The
private read lists and write lists require higher overhead. So, the bookkeeping
introduces more overhead in the single-thread or low-contention cases.
Memory Management: memory management is required for both data
objects and dynamically allocated metadata (transaction descriptors, DSTM
Locators, OSTM Object Handlers). In case of garbage collected languages,
additional cost of tracing and reclamation.
Contention Management: conflict resolution can lead to a significant
overhead as well as the sorting required for avoiding deadlocks. The work
that is lost to abort or delaying a conflicting transaction can be also
considered a high overhead added to the other sources of overhead I have
already listed. However, contention manager overhead will finally depend on
the options used in building the STM. Choosing the implementation
synchronization is critical because obstruction-free STM has better chance of
minimizing the useless work that might be required by lock-free or wait-free
STM implementations.
Copying: in STM, every writer creates a copy of every to be written object. In
case of small objects, they might not result in a significant overheads
48


compared to the overheads of bookkeeping. However, if the object instead is
large and only a small change is required, then the overheads that result from
unneeded copying might be significant.
49


5. Findings
In this section, I am going to list all of my recommendation regarding the STM
design options that might increase or decrease the overhead. I have already
mentioned a few words about using these statements in STM systems to
provide better chapter 2.
Assertion 1
Regarding the isolation property, I have mentioned two design approaches
used by various STM implementations: weak isolation and strong isolation.
Strong isolation is considered better than weak isolation since the isolation
with the first type can guarantee the isolation between transactions
themselves and also between them and the non-transactional access. Weak
isolation STM implementations need to divide the shared memory manually
into two divisions: transactional that can be accessed only by transaction and
non-transactional that can be accessed by non-transactional accesses or
codes outside transactions. Dividing the memory is not perfect since it is error
prone with taking in consideration the evolvement of software. Moreover,
dividing the memory does not make the idea of STM and TM in general a
feasible concurrent programming technique. This is because the locking based
synchronization with critical sections does not require this hassle. In addition,
avoiding that option, dividing the memory, will lead to several issues: non-
repeatable reads, intermediate lost update and intermediate dirty reads [1].
50


The solution that I strongly recommend is to deny any direct non-
transactional access to the shared memory by implementing STM with strong
isolation feature. Addition to avoiding issues related to using weak isolation,
the overhead of implementing STM with strong isolation can be decreased.
One of the good approaches that might be used to implement the strong
isolation is using functional languages such as Haskell. The reason is that all
the mutable data in Haskell are transactions and hence no extra overhead is
required to achieve strong isolation.
However, imperative and object oriented languages are more used especially
in the industry and commercial applications. Furthermore, programmers are
more familiar with imperative languages rather than the functional languages.
Therefore, I recommend implementing STM with strong isolation using these
types of languages. Regarding the higher overhead introduced to the STM
overall overhead by going with this approach, efficient read and write barriers
for memory accessed outside transactions. Optimization can be used to
reduce the overhead encountered with using barriers and even to remove it
sometimes. The cost of the suggested read and write barriers can be reduced
by the following enhancements:
1. Tracking the local objects of the thread at runtime to eliminate the
unneeded synchronization
2. Eliminating barriers whenever not required.
3. Implementing some sort of barriers aggregations would help in
amortizing the barriers cost.
51


Assertion 2
Supporting nested transactions have been included in most of STM
implementations I have seen except a few of them which are considered
pretty old. What I discovered that most of the STM's, which support nested
transactions, use the flattened type while a few used open one. think the
reason which made most the STM implementers to choose the flattened type
that it is easy to be implemented. However, as I mentioned the flatted type
does not provide compositional transactions. The rest which uses closed type
might experience more overhead especially when the inner transaction
aborts, the outer transaction will continue its computations. If the outer
transaction also aborts, then it would be more efficient if the outer one has
been aborted directly.
I think the trend in supporting nested transactions in STM implementations
will choose the open type. My opinion, it is the best type compared to the
previous types. In case the inner transaction commits, making all the other
transactions including its parent transaction or the non parent, to see the
result of inner transaction will save a lot on the STM overhead. Hiding the
result, in case the inner transaction commits, will add extra overhead to the
STM. In addition, if the outer transaction decides to abort and execute again,
it would not need to execute the inner because it is already executed. Open
type will also increase the concurrency since the resources that would be
reserved by the inner transaction would be released earlier.
52


Assertion 3
Given the two design options around which the STM have been using to
detect the conflict object granularity or word (or block) granularity.
Although object granularity is more preferable especially with STM
implementations built in object-oriented languages, it can be major source of
overhead. The size of the object affects the overhead of any STM system. One
of the disadvantages of object granularity is sometimes with a large size
objects opened by a transaction to perform a write, the transaction will need
to create a new copy of that object to put its modification. The problem is the
transaction would create the new copy even if the modification is minor and
hence copying such object can cause significant overhead. Another issue is
the false conflict that might happen if two transactions reference two
different field of the same object like the case of multidimensional array. Such
problems might be avoided by using word granularity but again it requires
more space and time in order to track and compare read sets and write sets.
The time and space required by the world granularity is always required.
My opinion is with all the disadvantages of the object granularity, it is still
better than the word granularity in terms of overhead. Unlike the word
granularity, object granularity helps programmers to reason about their
programs and hence they can introduce further enhancements and
optimizations to detect conflicts. In addition, the overhead with the object
granularity might be noticeable only in case large objects.
53


6. Conclusion
In my thesis, I started by defining what are the problem posed by parallel
programming and what is the locking based synchronization model. Then I
mentioned some of the published issues resulting from using the legacy
locking techniques in parallel programming. I also introduced TM with a focus
on the STM and I gave some background about the implementations,
operations and semantic of STM.
In my thesis, I concentrated on main STM design tradeoffs. Some of these
options have a direct relation to the increase overhead of STM.
Unfortunately, still the area of STM implementations overhead is not
identified clearly. One reason is that STM or TM in general is still in the
research based. Moreover, most of the parallel programmers are not familiar
with TM and if they know, still they don't see it as a mature environment to
build their codes on.
The area of TM or STM specifically still an active area of research and more
researchers are now interested in making it a feasible and successful
replacement to the current parallel programming paradigms. More work will
need to be conducted on evaluating the overhead of STM and if possible
reducing such a disadvantage of it.
One of the problems regarding the overhead of STM is the differences among
implementations. It is still an open issue to develop methods to evaluate all
the STM implementations overhead.
54


BIPLIOGRAPHY
[1] Adl-Tabtabai, A., Shpeisman, T., Menon, V., Balensiefer, S., Grossman,
D., Hudson, R., Moore, K., Saha, B., Enforcing Isolation and Ordering in
STM", in Proceedings of the FCRC'07 Conference on Programming Languages
Design and Implementation (PLDI). 2007: USA, California.
[2] Ananian, S. and Rinard, M., "Efficient Object-Based Software
Transactions," in Proceedings of the 22nd annual ACM SIGPLAN conference
on Object-oriented programming systems and applications (SCOOL). 2005: San
Diego, Ca.
[3] Blundell, C., E.C. Lewis, and M.M.K. Martin, "Deconstructing Transactions:
The Subtleties of Atomicity", in Fourth Annual Workshop on Duplicating,
Deconstructing, and Debunking. 2005.
[4] Carlstrom, B.D., et al., TheAtomos Transactional Programming Language,
in Proceedings of the 2006 ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI 2006). 2006, ACM Press: Ottawa,
Ontario, Canada, pp. 1-13.
[5] Cascaval, C., Blundell, C., Michael, M., Cain, H., Wu, P., Chrias, S., and
Chatterjee, S., Software Transactional Memory: Why Is It Only a Research
Toy?, CACM, 51(ll):40-46, November 2008.
[6] Dice, D., Shalev, O., and Shavit, N., "Transactional locking II," Proceedings
of the 14th ACM Symposium on Principles of Distributed Computing, pp. 204-
213.
[7] Ennals, R. "Software transactional memory should not be obstruction-
free," Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech
Report., 2006.
55


[8] Fraser, K., "Practical Lock-Freedom.", PhD Thesis, University of Cambridge
Computer Laboratory, 2004.
[9] Gharachorloo, K., Memory Consistency Models for Shared-Memory
Multiprocessors. 1995, Computer System Laboratory, Stanford University.
[10] Harris, T., et al., Composable Memory Transactions, in Proceedings of the
Tenth ACM GPLAN Symposium on Principles and Practice of Parallel
Programming. 2005: Chicago, IL.
[11] Herlihy, M., Luchangco, V., and Moir, M., "A flexible framework for
implementing software transactional memory," Proceedings of the 21st
Annual ACM SIGPLAN Conference on Object-Oriented Programming
Languages, Systems, and Applications, pp. 253-262,2006.
[12] Herlihy, M., Luchangco, V., Moir M., and Scherer III, W., "Software
transactional memory for dynamic-sized data structures," in Proceedings of
the Twenty-Second Annual Symposium on Principles of Distributed Computing,
pp. 92-101, 2003.
[13] Herlihy, M. and Moss, J.E.B. Transactional memory: architectural support
for lock-free data structures. ISCA 1993.
[14] Larus, J. and Rajwar, R. Transactional Memory. Morgan &. Claypool
Publishers, 2006.
[15] Lev, Y. and J.-W. Maessen, "Toward a Safer Interaction with Transactional
Memory by Tracking Object Visibility, in Proceedings of the OOPSLA 2005
Workshop on Synchronization and Concurrency in Object-Oriented Languages
(SCOOL). 2005.
[16] Lomet, D. "Process structuring, synchronization, and recovery using
atomic actions," In Proc. ACM Conf. on Language Design for Reliable Software,
Raleigh, NC, 1977, pp. 128-137
[17] Marathe, V. et al. "Lowering the Overhead of Nonblocking Software
Transactional Memory,", in Proceedings of the SIGPLAN 2006 Workshop on
56


Languages, Compilers and Hardware Support for Transactional Computing
(PLDI). 2006: Ottawa, Canada.
[18] Marathe, V., Scherer III, W., and Scott, M. "Adaptive Software
Transactional Memory," Proceedings of the Nineteenth International
Symposium on Distributed Computing, pp. 354-368, 2005
[19] Shavit, N. and Touitou, D. "Software transactional memory," In Proc.
14th ACMSymp. on Principles of Distributed Computing, Ottawa, Canada,
1995, pp. 204-213
[20] Wang, X., Ji, Z., Fu, C., and Hu, M. "A review of transactional memory in
multicore processors," Inform. Technol. J., 9:192-200. 2010.
57


SOFTWARE TRANSACTIONAL MEMORY
DESIGN OPTIONS AND OVERHEAD
By
Ibrahim Alfuaires
B.S., King Fahd University of Petroleum and Minerals, 2005
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2010


This thesis for the Master of Science
degree by
Ibrahim Alfuaires
has been approved
by
Bogdan Chlebus
W, 8(9(0


Alfuaires, Ibrahim (M.S., Computer Science)
Software Transactional Memory (STM) Design Options and Overhead
Thesis directed by Associate Professor Bogdan S. Chlebus
ABSTRACT
The advent of the multicore architecture, in which the technology has
helped to integrate more and more cores in the single chip, has raised the
expectation of getting better computation speed up and performance. In this
thesis, I tried first to discuss briefly the problems associated with using the
legacy ways, locks, to write parallel programs such as: priority inversion,
convoying and deadlock. Transactional memory (TM) was proposed as a lock-
free synchronization technique which provides safe and easy concurrency
programming environment. Transaction is a group of reads and writes
instructions to a shared memory that appears as one atomic operation to the
outside observer and to other transactions and either execute all or nothing. I
am presenting the TM solution in general which is also can be classified into
three types: hardware transactional memory (HTM), software transactional
memory (STM) and hybrid transactional memory (HybridTM). As it can be
inferred from its name, STM implements the whole TM system in software
with the aid of the basic instructions provided by an off shelf processor. STM
has been more preferable to implement TM as it does not require the extra
cost of hardware. In my thesis, I am presenting most of the design options
that have been used by many researchers to build different STM
implementations we have had so far. These options are important to study
and analyze the overhead that might be encountered with different STM
implementations. At the end, I tried to list some of the published identified
sources of overhead in STM systems.
This abstract accurately represents the content of the candidate's thesis. I
recommend its publication.
Signed
Bogdan Chlebus


TABLE OF CONTENTS
Figures................................................................vii
Chapter
1. Introduction..........................................................1
1.1 Research Focus and Methodology......................................4
1.2 Lock Based Synchronization..........................................5
1.3 Problems with Locks................................................ 8
1.4 Lock free Synchronization..........................................11
2 Transactional Memory................................................12
2.1 Transaction Basic Operations.......................................16
2.2 Semantics..........................................................17
2.2.1 Database Correctness Criteria (ACI Model)....................... 17
2.2.2 Single Lock-atomicity............................................19
2.2.3 Retry and OrElse.................................................20
3. Design Options in Software Transactional Memory......................22
3.1 Isolation..........................................................22
3.1.1 Weak Isolation...................................................22
3.1.2 Strong Isolation.................................................26
3.2 Nesting Transactions.............................................32
iv


3.2.1 Flattened Transaction
32
3.2.2 Closed Transaction.............................................33
3.2.3 Open Transaction...............................................35
3.3 Transaction Granularity........................................35
3.3.1 Block or Word Granularity......................................36
3.3.2 Object Granularity.............................................36
3.4 Version Management..............................................36
3.4.1 Eager Versioning...............................................37
3.4.2 Lazy Versioning................................................39
3.5 Concurrency Control.............................................40
3.5.1 Pessimistic Concurrency Control................................41
3.5.2 Optimistic Concurrency Control.................................41
3.6 Blocking versus Non-blocking....................................42
3.6.1 Blocking.......................................................42
3.6.2 Non-blocking...................................................42
3.7 Object Acquisition..............................................43
3.7.1 Lazy Acquire...................................................44
3.7.2 Eager Acquire..................................................44
3.8 Contention Management...........................................44
3.9 Other Options...................................................45
3.9.1 Exceptions.....................................................45
3.9.2 Library versus Programming Language............................46
v


3.9.3 Uniprocessor versus Multiprocessor,
46
4. STM Overhead.......................................................47
5. Findings...........................................................50
6. Conclusion.........................................................54
Bibliography..........................................................55
vi


LIST OF FIGURES
Figure
1.1 Moore's Law Trend from 1971-2008.............................1
1.2 THE SHARED MEMORY MULTIPROCESSORS............................5
1.3 LOCK-BASED SYNCHRONIZATION MODEL.............................7
3.1 Ananinan's and Rinard's STM Data Structures.................29
3.2 Eager Versioning............................................38
3.3 Lazy Versioning.............................................40
vii


1. Introduction
With the advent of the multi-core processors, in which the technology has
helped to integrate more than one cores into the chip, it has been expected
that this would increase the computation speed by at least double when we
double the number of cores. According to Moore's Law as we can see in the
figure below, the number of cores is expected to double almost every 18
months, which means that the speed of the computation should be doubled
accordingly.
CPU Transistor Counts 1971 -2008 & Moores Law
2.000.000.000
1,000.000,000
100.000,000
c
3
8
o
to
c
CO
10,000,000
1,000,000
100,000
r.rf Com u t*. v ?
MTWFKI
(iKl '
2 CK-tvI ,
.u-i
nT2i-:
nvrrc
ins
pjiiij

Cu^vestop's Moores Law1:
transistor coum doubling
cvciy you's

Ml Sl*ll
H I

'eoca
10,000
2,300
1971 1980 1990 2000 2008
Date of Introduction
FIGURE 1.1: Moore's Law Trend from 1971-2008
1


So, as we increase the number of cores or processors by n, we should expect
n-fold increase in the power of computation power. However, this increase is
only in the ideal cases and never happens in the practical ones. The primary
reason for this degradation in the real world cases computational power is
the cost of inter-processor communication and coordination or concurrency
control. In most cases, when the applications is written for multi-processing,
the net result performance of the system is just 5-10 % of the theoretical peak
performance.
One of the most difficult things about parallel and distributed computing is
the concurrency control or synchronization in shared memory model. The
need for parallel and distributed computing has been increasing due to the
need for massive computation in various fields: industrial, medical or even
science research fields. Almost all the computer since programmers are
familiar with the traditional or sequential programming in which, single
processor machine is used. However, with parallel programming, where more
than one processor are used and expected to increase the performance of
computation and speed up the calculations, the programmer tasks become
much harder.
Concurrency is the problem of synchronizing concurrent access of multiple
processes or threads to shared memory. It is difficult because it is required to
make sure that the right operation is conducted in the right time and as if it is
the only process accessing that piece of data in shared memory.
As I mentioned above, the primary reason for this drop in the performance
from the expected one is concurrency. Concurrency is the responsibility of the
2


programmers and not of the systems. This justifies why parallel programming
is difficult compared to the sequential one. In the computing industry, there
are a lot of programmers but a few of them who are familiar with the parallel
programming. In addition, a few of those who are supposed to be parallel
programmers are good in this field. Parallel programmers can do mistakes
easily and those mistakes will make the performance even worse. As we add
more cores or processors, the problem gets even worse for developers.
To exploit parallelism, developers will need to master parallel programming
and concurrency or synchronization as part of that. Many synchronization
techniques have been used to implement parallelism on multi-core systems.
Locks have been used by parallel programmers to implement synchronization.
However, addition to the difficulty of using locks, several pitfalls have been
experienced and identified due to using them such as priority inversion,
deadlock and convoying.
To avoid problems associated with implementing locking based
synchronization techniques, several lock free techniques have been
investigated. A shared data structure is lock free if its operations do not
require mutual exclusion. Transactional memory is one of those techniques
which provides new concurrency techniques that avoid most of the pitfalls
mentioned above as well as it is much easier concurrent programming
technique. In addition, it is much efficient more the other proposed
techniques. The idea of transactional memory has been taken from database
transaction where a transaction can be defined as a sequence of actions that
appears indivisible and instantaneous to an outside observer. In addition, the
same attributes that the database transaction has i.e. failure atomicity,
consistency, isolation and durability, are applied to the transactional memory
except the last one durability since the data in memory is usually transient
unlike the database one.
3


1.1 Research Focus and Methodology
Software Transactional Memory (STM) has been blamed as a high overhead
transactional memory (TM) compared to the other implementations:
hardware transactional memory (HTM) and hybrid transactional memory
(HybridTM). The overhead of STM is due to the need to build the TM
semantics and all the transaction primitives in software. Many STM
implementations have been proposed with a little emphasis on the overhead
investigation. The problem with STM is that some instructions that usually do
not require more than one or two clock cycles to execute would need tens or
even hundreds clock cycles when executed with STM. As an example, a
program that might need 100 clock cycles to complete, addition 500 clock
cycles are introduced as STM overhead. That means almost 80% of the
program time is taken by the STM and hence the program is made even 5
times longer. We can conclude that the STM overhead is not a trivial one and
can be the most dominant cost of the code.
I am going to study the STM semantics and some of the approaches used to
build STM implementations. After understanding these semantics and design
approaches, I am going to identify how these may help in reducing or
increasing the overhead of STM systems. In addition, STM have been built
using different programming languages. I want to investigate if the type of
language can affect overhead or not. This would help us choose the right
design options to build low overhead STM implementation.
I begin by giving background information about the legacy way of attaining
concurrency, the locking based synchronization. Then, I discuss the major
4


pitfalls of locks and using them in synchronization. In chapter2,1 introduce
what transactional memory is. Chapter 3 covers different design options of
STM. In chapter 4 and 5,1 discuss the overhead of STM and I list some of the
findings that might help to reduce the STM overhead. The last chapter
includes the conclusion.
1.2 Lock Based Synchronization
In a shared memory multiprocessing basic model, we have many processors
that can work concurrently and they might need to access the same piece of
data in the shared memory to complete their computations. Let us assume
that we have the following multiprocessor environment:
Main Memory
FIGURE 1.2: SHARED MEMORY MULTIPROCESSORS
5


So as we can see in the above figure, we have n processors all share the same
memory. Let us assume that the first three processors: PI, P2 and P3 are
about to execute the same instruction X := X + 1. The initial value of X = 1 and
if all of the processes tried to execute at the same time the value of X would
be unpredictable, it might be 2, 3 or 4. This is because all the processes might
read at the same time the same value X or even two of them read
simultaneously the X value resulted from the execution done by the third one.
This problem is called data race and to avoid getting such issue, the shared
resources, i.e. X in this case, must be prevented from concurrent read and
write accesses. The solution here is to use locks to provide mutual exclusion.
Locks are programming constructs that can be used so that only one
processor has the control over the shared resource and until this resource is
unlocked the controlling process is the only one who can read and write to it.
The area which has the shared resource and to be accessed with locks is
called critical section. At any time, at most one process can be inside the
critical section the processor who has the lock of that section. The following
diagram shows how the operation is going to be when locking is used to
perform synchronization:
6


Lock (Xl
Read XintoR
Add ItoR
Write back R to X
Unlock fX^
FIGURE 1.3: LOCK-BASED SYNCHRONIZATION MODEL
Based on granularity, locks can be classified into fine-grained or coarse-
grained locks. The main differences between the two types are the size of the
critical sections guarded by the lock and the number of locks used throughout
the system. In the systems which use the coarse-grained locking, one lock is
used to lock many shared resources like in the case of the first Linux kernel
2.0 that supported multiprocessing. In this case, the whole kernel is
considered as a critical section that is locked by one big lock. On the other
hand, in case of fine-grained locking, as it has been implemented in the new
kernel releases, many locks are used in the system to provide concurrency.
Instead of using only one lock to lock the whole kernel, the big kernel critical
section is divided into smaller critical sections each has its own lock. As an
example, in the Linux 2.2 kernel release, one spinlock is used to control the
access to the block I/O subsystem and another one for networking and so on.
7


Locks have been used widely in many areas: databases, operating systems,
and parallel programming.
1.3 Problems with Locks
Many problems associated with using locks in concurrency control have been
published in literatures. One of these issuers is the difficulty of programming.
Most of the computer science programmers are much used to the sequential
coding and not the parallel one. More issues were published by Herlihy and
Moss as we see below [13]. In some operating systems, which that employs a
priority-based preemptive scheduler, tasks or processes are assigned different
priorities. Let assume that we have a 3 processes H, M and L with priorities
High, Medium and Low. The scheduler always makes sure that if a process is
ready to run and holds higher priority to be running or executed. In addition,
if a lower priority process is executing and during its execution, a higher
priority process is ready to execute, the scheduler will preempt the lower
priority one to let the higher priority process executes. Let us consider a
scenario where, the process L acquires the mutex or the lock to work on a
specific shared resource. In the middle of L execution, the process M becomes
ready to execute and hence the scheduler will have to preempt L, which still
holds the lock, and let M executes. Suppose during the execution of M,
process H would like to access the same shared resource whose lock is held
by the preempted process L. In this case, process H won't be able to acquire
the lock since L cannot preempt M due to the precedence in priorities. Finally,
although process H has higher priority than process M, H cannot execute and
this situation will be like if the priorities were inverted so that M has higher
8


priority but it hasn't. The problem that I have just described is called priority
inversion.
Another issue that is associated with using locks in concurrency control is lock
convoy or sometimes called convoying. Unlike the case with the priority
inversion, the processes in this case all have equal priorities. The problem
of convoying happens when we have several processes where some of them
are interested to access a specific shared resource R. Let us assume that
process PO is now holding the lock for R and also the processor is executing
PO. If the scheduler preempts PO for any reason, such as a page fault or PO has
already exhausted its quantum or by any kind of interrupt, PO will be
discheduled while it still holds the lock for R. Now, the scheduler is going to
schedule PI, which is also interested to access R, although PI has the right to
execute, the lock for R is still with PO and hence again PI will be discheduled.
This results in a huge context switching and waste of scheduling quantum
which will lead to huge performance degradation of the system.
The next problem that might be experienced with using locks is called
deadlock. Deadlock can happen as a result of having some processes lock the
same set of objects in different orders. To make it clearer, let us assume that
we have two processes P and Q work on shared objects identified by CR1, CR2
CR3 and CR4. Process P will work on the shared objects in this order
CR1->CR2->CR3->CR4. On the other hand, process Q based on its
computation will need to access the shared objects in the opposite order
CR4->CR3->CR2->CR1. Let us assume that both processes begin their
execution together starting on acquiring the lock required for each shared
object they need to access. Process P will acquire the lock for CR1 and then
9


CR2 without any problem as the only process interested to access these
shared objects at those times and similarly with process Q, it will acquire the
locks of CR4 and then of CR3 shared objects. Now, process P would like to
proceed its execution by acquiring the shared object CR3 which is already
locked by process Q. Similarly, process Q would like to access the next shared
object in the chain, CR2, which is already locked by process P. The two
processes P and Q are waiting to each other on releasing the lock for the
shared objects CR2 and CR3 and this problem is what so called deadlock.
Going back to the locking granularity and whether to use the fine-grained or
the coarse-grained one. As I mentioned already, early Linux kernel used the
coarse-grained locking, in which the whole kernel is considered as a one big
critical section and only one lock is available to access this critical section.
Coarse-grained lock is well known to its simplicity in implementation and
maintenance. Although, this type of lock, the coarse-grained, helped the
kernel programmer address all the issues might be faced with by shifting to
Symmetric Multi-Processor (SMP), it did not scale well. As the number of the
processors increase, the amount of time to wait for the single kernel lock
becomes bigger and bigger. Hence, with having only one lock to guard the
kernel, the performance of four independent processors is much better than
having four-processor system. To solve the problem of lack of scalability with
the coarse-grained locks, developers have started to use finer-grained locking
more and more and now the modern kernel could include thousands of locks.
With this type of locks, each lock protects a small resource instead of
protecting the whole kernel. However, using fine-grained locks has its own
pitfalls. With thousands of locks, it is not easy to identify which lock for which
10


and in which order you should acquire each of them in order to avoid
problems such as deadlocks and races. In addition, it became much harder to
identify the bugs as the number of locks increases and also the number of
bugs is expected to increase accordingly. Finally, using the fine-grained
locking approach will increase the level of complexity of the system and by
time, it can sacrifice the maintainability of that system.
1.4 Lock free Synchronization
Lock free or as it might be named in some literatures as non-blocking
synchronization does not require the existence of mutual exclusion between
the processes which are accessing a shared object concurrently. As we have
seen in the previous section, problems with locks, some problems might be
experienced when mutual exclusion is satisfied by using locks. If the shared
resource does need mutual exclusion, priority inversion problem is resolved
since the high priority process won't need to wait for the lower priority
process to release the lock. In addition, consider the problem of convoying
where a process holding the lock for a shared resource is descheduled or
interrupted for reasons I mentioned earlier. In case of lock free
synchronization, the shared resource does not mandate the acquisition of
lock to have access to, then descheduling or interrupting any process does not
delay the execution or operation of other processes interested to access this
shared resource. Even deadlocks can be avoided for the same reasons and as
a result performance could be improved by making the shared resources lock
free.
11


2. Transactional Memory
As it can be inferred from its name, transactional memory has its root in
database transaction and hence we need first to define what a transaction is.
A transaction can be defined as a sequence of actions that appears as one
atomic action and instantaneous to the outside observer. Transactions have
been used widely in the database as a way to ensure concurrent execution of
programs. However, due to the differences between the transactions in
database and that of transactional memory, since the later uses the main
memory instead of the hard drive to perform transactional access, it is called
transactional memory.
The idea of using transactions in concurrent programming was first proposed
by Lomet in 1977 [16]. At that time, it was not introduced as transactional
memory but as an idea to use the database abstraction to make a good
programming language mechanism that would ensure the consistency of data
shared among several processes [8]. This paper lacked the practical
implementation compared to the explicit synchronization and this is way it
had not been used until later when Herlihy and Moss in 1993 [13] proposed
the first hardware transactional memory as a mechanism to build al lock-free
data structures. This paper is considered the first official paper defined what
the transactional memory.
After that many researchers have been involved heavily in this area and many
hardware transactional memories have been proposed. However there were
some limitations in the proposed hardware ones like the high cost and the
need for extra instruction set to be integrated with the existing processors
12


instruction set. That means that hardware transactional memory for some
people, it might not be that feasible solution.
Given the above limitations associated with the hardware transactional
memory, researchers were trying to find more feasible solution that would be
still a lock-free synchronization solution. In 1995, Shavit and Touitou,
proposed the first software transactional memory that is will still use the
transactional approach which can be built on the existing machine using only
a Load-Linked / Store-Conditional operations. Later, many software solutions
have been proposed by different researchers in different languages and using
many design approaches.
Types of TM:
TM can be built in hardware, software or using both software and hardware
(hybrid). The hardware transactional memory (HTM) is implemented as a
processor instructions which means it needs some modifications to the
processor, cache and bus protocols to support transactions. In software
transactional memory (STM), the semantics and protocols of TM are built
completely in software either in runtime library or as extension to the
programming language itself. In the later type, STM, no need to upgrade the
hardware and the minimal hardware is required such that it supports the two
basic operations Compare and Swap (CAS) or Load-Linked/ Store-Conditional.
The hybrid transactional memory is a mix of hardware and software
transactional memory so that the transactions are supported in software with
the absence of hardware and would provide better performance if the
hardware existed. However, there is no formal specification for how the
13


hybrid TM should be implemented as many different approaches have been
used to build this type.
Why STM:
Software transactional memory (STM) has been believed by many researchers
as a feasible solution for the problems of concurrent programming. The first
STM implementation was published by Shavit and Touitou [19]. STM has
many advantages over the hardware transactional memory (HTM) such the
cost of implementation and maintainig. Since it is implemented in software,
no additional cost required for the hardware as in the case of HTM. In
addition, any change can be introduced to the STM implementation easily
which makes it more flexible than HTM.
Previous Work:

Let me mention some previous STM research works. The first real STM
implementation was proposed by Shavit and Touitou which is well known as
the static STM [19]. This implementation came after the first TM system
which was implemented in hardware and required some extra operation
which was not available in the off-shelve processors at that time. They
proposed the software solution to make it possible to implement TM with the
existing processors in software and using Load_Linked/Store_Conditional
operation. This implementation was called static because the transaction
needs to need the set of the location it is going to access in advance which
was considered a limitation to this implementation. The advantage they were
targeting with this implementation is to provide a software solution that will
not require extra hardware since it could be applied to the existing machines.
14


However, the performance of this implementation was not even close to the
HTM implementation proposed by Herlihy and Moss.
Then, Herlihy et al. proposed dynamic software transactional memory (DSTM)
implementation [12]. In DSTM, they addressed the limitation with the static
STM by supporting the dynamically created objects that does not require
prior knowledge for the set of the locations will be accessed by transaction.
This implementation was an obstruction-free which is the weakest approach
of non-blocking synchronization. In addition, they introduced the idea of
contention management that was able to avoid livelock problems that may
encountered with obstruction-free implementations.
Fraser later proposed a lock-free objected based software transactional
memory which was named after his name (FSTM) [8]. In this lock-free
implementation, each lower priority transaction helps a higher priority
conflicting transaction to complete so that the progress is guaranteed. The
helped transaction might also help another conflicting transaction to
complete execution and so on.
Marathe et al. proposed an obstruction-free adaptive software transactional
memory or (ASTM) [18]. They had done a comparison between the two
leading object-based software transactional memories DSTM and OSTM.
They found that the DSTM outperforms the OSTM by an order of magnitude
on write-dominated workloads and OSTM outperforms DSTM with by a factor
or two on read-dominated workloads.
Herlihy et al. revised the first proposed DSTM and that resulted in DSTM2
[11]. As I mentioned, the DSTM2 is a revised version of the DSTM where they
15


tried to incorporate lessons they learned from its predecessor. In this
implementation, they did it as a transactional library which was considered
the first implementation with such approach. The main feature included in
the design of DSTM2 was providing the users the ability to include their own
synchronization and recovery mechanisms in the form of transactional
factories that transform stylized sequential interfaces into transactionally
synchronized data structures addition to the freedom in customizing their
techniques for managing contention.
Recent implementations such as Enalls' STM and TL2 are lock-based STM
implementations [7][6].
2.1 Transaction Basic Operations
Each transactional memory has some basic operations and most of the
implementations have the same operations with a slight variation from one
implementation to another. These basic operations are used to manipulate
transactions. The following basic operations are the general ones used among
most of the implementations.
- Start or Begin: starting a new transaction and allocate a fresh descriptor in
which the status is initialized to ACTIVE.
- Read/Write: used within the scope of the transaction to access
transactional memory.
16


- Commit: this operation makes all the transaction's tentative changes
permanent and ends the transaction. It may fail and in this case it is equal to
abort.
- Abort: this operation instead of making the transaction's tentative changes
permanent, it discard all these changes and also ends the transaction.
- Validate: this operation tests the transaction status. It is either true means
the transaction commit or can commit or false means that the transaction
aborts.
2.2 Semantics
Although many literatures have been published about TM, still there is no
formal TM semantics. Intuitively and as it can be inferred from its name, TM
idea originated from database transactions and this can explain why most of
the papers used the correctness criteria from database transactions,
serializability, or concurrent data structures, linearizability. However, both
criteria just specify some aspects of TM semantics and not all.
2.2.1 Database Correctness Criteria (ACI Model)
The semantic model used by database has been adopted from the beginning
as a model for the TM as transactions idea was taken from database.
However, still the database model is not enough to provide the formal
semantic for TM due to the nature of the database transactions and those of
TM. Let us define the correctness model used by database and then we check
its availability to TM. Database use the ACID (failure atomicity, consistency,
isolation, and durability) properties as a model for its transaction semantic.
17


Failure atomicity or atomicity requires either all or nothing. All means that a
transaction executes and finishes and then it commits and hence the change
will be visible. However, in case if it fails in the middle of its execution,
nothing will be written and everything will be as if it does not execute at all.
Consistency means that any successful or failure of the execution of a
transaction should leave the database or memory in a consistent state. A new
transaction will always assume that the world is in consistent state.
Isolation means that the execution of a transaction must be as if it is the only
transaction executed at that time although there are other transactions
executed concurrently. In database, serializability correctness condition is
used to achieve isolation in which the transaction execution result isolated
from other transactions execution as if they are executed serially.
Durability requires that once the result is ready and committed on the
hardware, it is final and should be always there and available to all the other
transaction. However, this property is not required for the parallel
programming as the media to which the transaction result would be
committed is the memory and the data there is always transit unlike in the
database case where the media is the hard disk.
The ACI model that explained above does not provide a formal semantic for
the TM. ACI does not provide any mechanism that controls the interaction
between the atomic blocks and the code outside of a transaction and this is
what can be called interaction between transactional and non-transactional
access to shared resources. Unless a semantic specifies this interaction still it
is not fully applicable to TM case and this is the issue with ACI model.
18


2.2.2 Single Lock-atomicity
The single lock-atomicity is not a technique to implement TM; rather, it
specifies the interaction of atomic blocks in TM. With this, an atomic block
executes as if all the other atomic blocks were protected by a single lock and
at most one block can execute at a time. This model specifies the isolation
expected from transactions but does not specify the failure atomicity of a
transaction. It is one way to achieve the serializability I talked about earlier.
This lock can also be used to describe what happen as if a non-transactional
access has been encountered. The non-transactional access is any code
outside the transactions tries to access shared data.
Having this kind of access, i.e. the code outside the transaction, to shared
data may lead to a datarace [9]. Before I define what the datarace is, let me
define what is known as a conflict. A conflict happens when more than
concurrent thread access the same shared data and at least one of these
threads modify the data. The datarace would happen if the conflicting access
mentioned before were not synchronized properly. When the datarace
happens, the behavior of the code cannot be judged and will depend on the
memory consistency model that is used in the system memory. Dataraces can
be caused by poor programming of critical sections. To avoid getting datarace,
locking or other synchronization techniques should be used to control the
concurrent access by transactional and non-transactional codes or by
carefully arranging the program control flow.
19


2.2.3 Retry and OrElse
So far, we have discussed several aspects of the TM semantics proposed by
several researchers. However, we mentioned nothing about the order that
any two transactions executing concurrently should change the program
state. Such coordination between transactions was first introduced by Harris
et al. [10], by using the retry statement. When retry is used in a transaction, if
anything causes the transaction to abort, it will reexecute again later. What is
important about the retry is when it should reexecute after it aborts. Harris in
the same paper suggested that the transaction should reexecute again if any
change has been already introduced to the set of values the transaction did
read and aborted before.
Another operation that can be used to coordinate the execution of two
transactions is orElse [10]. The operation of orElse can be explained as follow
[14]. If we have P orElse Q, where P and Qare two different transactions,
then P will be executed first:
1. If P commits or explicitly aborts either by explicit abort statement or
uncaught exception, then Q will not be executed since the orElse
statement terminates.
2. In case of P executes a retry statement, then Q will be executed as part of
orElse statement execution.
3. If Q commits or explicitly aborts, then the orElse statement will
terminate.
20


4. If Q. executes a retry, then the execution of the orElse statement will be
pending for a change to a location read by either P or Q to reexecute
itself again.
Finally, the orElse statement should be implemented in atomic way since it is
a composition of two atomic operations.
21


3. Design Options in Software Transactional Memory
Different options have been used to build different STM implementations.
Some of these options might help in simplifying the implementation and
some might make it more powerful in a way that makes it a better candidate
to replace the lock-based parallel programming ways. Some of the options
have direct relation to the overhead that might be associated with STM and
some have indirect. In this chapter, I tried to list most of the options or
tradeoffs that have been used so far in designing most of the STM
implementations.
3.1 Isolation
I have already defined what the transaction isolation is as part of the
transaction correctness criteria. Isolation property requires that the
transactions execute concurrently don't affect the result of each other. Each
transaction executes as if it is the only transaction executes at that time.
Blundell et al. [3] introduced the terms weak and strong atomicity. However,
the term they used i.e. atomicity can be replaced by isolation as it can be
inferred from that paper [14].
3.1.1 Weak Isolation
Before I start discussing the weak isolation property, I need to give formal
definition for it:
22


Weak isolation: is the property of TM system that guarantees the isolation
only between transactions. A non-transactional code can directly access the
same shared data in the main memory that is already accessed by
transactional code.
If a non-transactional access executes, or a conflict reference outside the
transaction happens, the result that will be returned might be inconsistent or
may lead to an incorrect execution inside the transaction. With weak isolation
implementation, datarace might be the result and the system can be in
undefined state. Some problems or more precisely isolation violations would
happen as the non-transactional accesses are allowed to access the main
memory and bypassing the STM system protocols [1]:
- Non-repeatable reads: this problem happens when a transaction code
includes two read instructions and another write instruction belongs to a not-
transactional code execute concurrently. The two reads must return the same
value but since another non-transactional write alter the execution of the
transaction, the values that are returned by the two reads are different as we
can see in the following example:
Thread 1 Thread 2
atomic {
rl = x; x = 1;
r2 = x;}
23


Let us assume that the initial value of x is 0. Thread 1 starts to execute the
first read and rl will be assigned 0. Then, before thread 1 perform the second
read, thread 2 write a new value to x which is 1. After that thread 1 performs
its second read and now r2 = 1. So, rl r2 even though they belong to the
same transaction which means the isolation is violated.
- Intermediate lost update: it happens when we have two writes access
performed by both transaction and no-transaction code concurrently to the
same shared data. As an example, let us considered the following:
Initial value of x is 0
Thread 1 Thread 2
atomic {
r = x; x = 10;
x = r + 1;}
if the two threads executions are serialized then the value of x should be
either 10 if thread 1 executes first and then thread 2 or 11 if thread 2
executes first then thread 2 does. Unfortunately, the value of x is going to be
1 since thread 2 writes after thread 1 performs the read access and before it
write x. This means that the thread 2 write like if it does not happen at all.
Intermediate dirty reads: it happens when a non-transactional access can
see an intermediate state of a transaction code. The following is an example
that can explains this problem more:
24


Initial value of x is even
Thread 1 Thread 2
atomic {
x++; r = x;
x++;}
The thread 1 transactional code will always keep the value of x even.
However, the non-transaction read access will read an odd value if the non-
transactional read is executed in the middle of the thread 1 two increments.*
This implementation, weak isolation, does not mandate that all the access to
the shared data must be by a transactional access only, the non-transactional
access can access but without conflicting with the transactional one. This can
be guaranteed by the programmer to make the non-transactional access to
take place in different time or to different memory location. The approach
that has been used in most of the STM implementations to achieve weak
isolation is to divide the whole memory manually into two parts part that is
only accessed by transaction and the other part is only accessed by non-
transactional codes. However, dividing the memory as such is an error-prone
process as software evolves. Furthermore, such a process of manually dividing
the memory sometimes even is not required in the lock-based
synchronization. Therefore, using the weak isolation in STM systems may
conflict with the goal of achieving a simple and reliable concurrent
programming.
25


3.1.2 Strong Isolation
The second approach is called strong isolation where the TM semantics must
be guaranteed between transactional and non-transactional code. In this
model, the database model is enforced in which all the accesses to the shared
data are transactional or through transactions. In strong isolation, any
operation outside the transaction block that includes access to shared data
eventually will be transferred to transaction. Strong isolation has been well
known for its cost in term of the overhead that might be produced by a TM
implementation adopts it. This can justify why the most high performance
implementation were designed to provide weak isolation. Researchers have
proposed many ways to achieve strong isolation like restricting the
transactional code to access on the data that are transactions and this is the
approach used in Haskell STM [3]. However, this was achieved since all the
mutable data in Haskell STM are in transactions. The rest data are read only
ones which means that they can be accessed either by transactional or non-
transactional access. In the other hand, with the non-functional languages
where most of the data are mutable, the program must be divided into two
parts: transactional and non-transactional data. However, this classification is
hard to be implemented as all the data will need to be transformed to
transactional data from the non-transactional part of the program and then
back to its original state after completing the processing. Another approach
that was proposed by Lev and Maessen is to track the shared objects at
runtime [15]. Any object that will be visible by more than one thread should
be accessed only through transactions during concurrent access. Although it
can be considered a good approach, it might introduce more overhead to the
26


STM. However, the authors depend a lot on the compiler for more
optimizations and hence lower overhead. In the following, I am presenting
one of the Ananian,s and Rinard's STM implementation that supports strong
isolation.
Example: Ananian and Rinard STM:
The STM implementation, which was proposed by Ananian and Rinard in
2005, is considered one of the few STM implementations that are
implemented using strong isolation design option [2]. Most of the STM
implementations have been proposed so far employ weak isolation as
transaction isolation property. The Ananian's and Rinard's STM
implementation provides strong isolation since any memory access outside
the transaction can abort the running transaction. However, in most of the
STM implementations, which implement weak isolation, a non-transactional
read or write to transactional data, can happen without aborting that
transaction. Instead, a mechanism is implemented in such a way that the non-
transactional accesses don't conflict with the transactional accesses. The
reason behind choosing the weak isolation is to avoid the cost of overhead of
adding instrumentation to every non-transactional read and write. Ananian's
and Rinard's STM implements a novel way of providing strong isolation that
lowers the instrumentation overhead. They achieved this goal by employing
sentinel values which are placed in memory locations accessed by
transactions. I am going to provide a detail implementation of Ananian's and
Rinard's STM strong isolation implementation in the following paragraphs.
27


In their implementation, they used a special value called FLAG to be placed in
locations of objects that are involved in transactions. Trying to read or
overwrite a flagged value tells the non-transactional code that an exceptional
processing is required.
Object Structures:
The basic data structures of Ananian's and Rinard's STM implementation is
shown in the figure below.
28


Transaction ID #68 Transaction ID #56
FIGURE 3.1: Ananinan's and Rinard's STM Data Structures
Two additional fields are added to each object: versions and readers fields.
Versions field points a single-linked list object versions. Each object version
has a field owner, which points to the transaction object's single field status,
corresponds to committed, aborted or waiting. For each transaction, there is
a specific object version. The second filed is the readers field which also
29


Full Text

PAGE 1

SOFTWARE TRANSACTIONAL MEMORY DESIGN OPTIONS AND OVERHEAD By Ibrahim Alfuaires B.S., King Fahd University of Petroleum and Minerals, 2005 A thesis submitted to the University of Colorado Denver in partial fulfillment of the requirements for the degree of Master of Science Computer Science 2010

PAGE 2

This thesis for the Master of Science degree by Ibrahim Alfuaires has been approved by Bogdan Chlebus Gitaa;hband

PAGE 3

Alfuaires, Ibrahim (M.S., Computer Science) Software Transactional Memory (STM) Design Options and Overhead Thesis directed by Associate Professor Bogdan S. Chlebus ABSTRACT The advent of the multicore architecture, in which the technology has helped to integrate more and more cores in the single chip, has raised the expectation of getting better computation speed up and performance. In this thesis, I tried first to discuss briefly the problems associated with using the legacy ways, locks, to write parallel programs such as: priority inversion, convoying and deadlock. Transactional memory (TM) was proposed as a lock free synchronization technique which provides safe and easy concurrency programming environment. Transaction is a group of reads and writes instructions to a shared memory that appears as one atomic operation to the outside observer and to other transactions and either execute all or nothing. I am presenting the TM solution in general which is also can be classified into three types: hardware transactional memory (HTM), software transactional memory {STM) and hybrid transactional memory (HybridTM). As it can be inferred from its name, STM implements the whole TM system in software with the aid of the basic instructions provided by an off shelf processor. STM has been more preferable to implement TM as it does not require the extra cost of hardware. In my thesis, I am presenting most of the design options that have been used by many researchers to build different STM implementations we have had so far. These options are important to study and analyze the overhead that might be encountered with different STM implementations. At the end, I tried to list some of the published identified sources of overhead in STM systems. This abstract accurately represents the content of the candidate's thesis. I recommend its publication. Bogdan Chlebus

PAGE 4

TABLE OF CONTENTS Figures ....................................................................................................................... vii Chapter 1. Introduction ........................................................................................................ 1 1.1 Research Focus and Methodology ............................................................. 4 1.2 lock Based Synchronization ............................................................................. 5 1.3 Problems with locks ......................................................................................... 8 1.4lock free Synchronization ................................................................................ 11 2 Transactional Memory ..................................................................................... 12 2.1 Transaction Basic Operations ......................................................................... 16 2.2 Semantics ........................................................................................................... 17 2.2.1 Database Correctness Criteria (ACI Model) .............................................. 17 2.2.2 Single lock-atomicity .................................................................................... 19 2.2.3 Retry and OrEise ............................................................................................ 20 3. Design Options in Software Transactional Memory ..................................... 22 3.11solation .............................................................................................................. 22 3.1.1 Weak Isolation ................................................................................................ 22 3.1.2 Strong Isolation .............................................................................................. 26 3.2 Nesting Transactions ........................................................................................ 32 iv

PAGE 5

3.2.1 Flattened Transaction .................................................................................. 32 3.2.2 Closed Transaction ...................................................................................... 33 3.2.3 Open Transaction ......................................................................................... 35 3.3 Transaction Granularity .................................................................................. 35 3.3.1 Block or Word Granularity .......................................................................... 36 3.3.2 Object Granularity ........................................................................................ 36 3.4 Version Management ...................................................................................... 36 3.4.1 Eager Versioning ............................................................................................ 37 3.4.2 Lazy Versioning .............................................................................................. 39 3.5 Concurrency Control ........................................................................................ 40 3.5.1 Pessimistic Concurrency Control... ............................................................. 41 3.5.2 Optimistic Concurrency Control... ............................................................... 41 3.6 Blocking versus Non-blocking .................................................................... 42 3.6.1 Blocking .................................................................................................. 42 3.6.2 Non-blocking .......................................................................................... 42 3. 7 Object Acquisition ..................................................................................... 43 3.7.1 Lazy Acquire ........................................................................................... 44 3.7.2 Eager Acquire ......................................................................................... 44 3.8 Contention Management ............................................................................... 44 3.9 Other Options .................................................................................................... 45 3.9.1 Exceptions ....................................................................................................... 45 3.9.2 Library versus Programming Language ...................................................... 46 v

PAGE 6

3.9.3 Uniprocessor versus Multiprocessor .......................................................... 46 4. STM Overhead .................................................................................................... 47 5. Findings ....................................................................................................... 50 6. Conclusion ........................................................................................................... 54 Bibliography ............................................................................................................ 55 vi

PAGE 7

LIST OF FIGURES Figure 1.1 Moore's law Trend from 1971-2008 ............................................................ 1 1.2 THE SHARED MEMORY MULTIPROCESSORS .............................................. 5 1.3 LOCK-BASED SYNCHRONIZATION MODEL ................................................... 7 3.1 Ananinan's and Rinard's STM Data Structures ......................................... 29 3.2 Eager Versioning .................................................................................. 38 3.3Lazy Versioning .................................................................................... 40 vii

PAGE 8

1. Introduction With the advent of the multi-core processors, in which the technology has helped to integrate more than one cores into the chip, it has been expected that this would increase the computation speed by at least double when we double the number of cores. According to Moore's Law as we can see in the figure below, the number of cores is expected to double almost every 18 months, which means that the speed of the computation should be doubled accordingly. .... B Cll v; c:: 1'!1 F CPU Transistor Counts 1971 -2008 & Moore's Law 2.000.000.00() ,000.000,00() 100,000,00() 10,000,00() 1,000.00() CU...e 'Yoore .......-: trans.isfCI'COU"'''IOOLbhng , k!o .,.,.,, ...... af .. ,r'e'9Y'It 1971 198() 1990 2000 2008 Date of Introduction FIGURE 1.1: Moore's Law Trend from 1971-2008 1

PAGE 9

So, as we increase the number of cores or processors by n, we should expect n-fold increase in the power of computation power. However, this increase is only in the ideal cases and never happens in the practical ones. The primary reason for this degradation in the real world cases computational power is the cost of inter-processor communication and coordination or concurrency control. In most cases, when the applications is written for multi-processing, the net result performance of the system is just 5-10% of the theoretical peak performance. One of the most difficult things about parallel and distributed computing is the concurrency control or synchronization in shared memory model. The need for parallel and distributed computing has been increasing due to the need for massive computation in various fields: industrial, medical or even science research fields. Almost all the computer since programmers are familiar with the traditional or sequential programming in which, single processor machine is used. However, with parallel programming, where more than one processor are used and expected to increase the performance of computation and speed up the calculations, the programmer tasks become much harder. Concurrency is the problem of synchronizing concurrent access of multiple processes or threads to shared memory. It is difficult because it is required to make sure that the right operation is conducted in the right time and as if it is the only process accessing that piece of data in shared memory. As I mentioned above, the primary reason for this drop in the performance from the expected one is concurrency. Concurrency is the responsibility of the 2

PAGE 10

programmers and not of the systems. This justifies why parallel programming is difficult compared to the sequential one. In the computing industry, there are a lot of programmers but a few of them who are familiar with the parallel programming. In addition, a few of those who are supposed to be parallel programmers are good in this field. Parallel programmers can do mistakes easily and those mistakes will make the performance even worse. As we add more cores or processors, the problem gets even worse for developers. To exploit parallelism, developers will need to master parallel programming and concurrency or synchronization as part of that. Many synchronization techniques have been used to implement parallelism on multi-core systems. locks have been used by parallel programmers to implement synchronization. However, addition to the difficulty of using locks, several pitfalls have been experienced and identified due to using them such as priority inversion, deadlock and convoying. To avoid problems associated with implementing locking based synchronization techniques, several lock free techniques have been investigated. A shared data structure is lock free if its operations do not require mutual exclusion. Transactional memory is one of those techniques which provides new concurrency techniques that avoid most of the pitfalls mentioned above as well as it is much easier concurrent programming technique. In addition, it is much efficient more the other proposed techniques. The idea of transactional memory has been taken from database transaction where a transaction can be defined as a sequence of actions that appears indivisible and instantaneous to an outside observer. In addition, the same attributes that the database transaction has i.e. failure atomicity, consistency, isolation and durability, are applied to the transactional memory except the last one durability since the data in memory is usually transient unlike the database one. 3

PAGE 11

1.1 Research Focus and Methodology Software Transactional Memory (STM) has been blamed as a high overhead transactional memory (TM) compared to the other implementations: hardware transactional memory (HTM) and hybrid transactional memory (HybridTM). The overhead of STM is due to the need to build the TM semantics and all the transaction primitives in software. Many STM implementations have been proposed with a little emphasis on the overhead investigation. The problem with STM is that some instructions that usually do not require more than one or two clock cycles to execute would need tens or even hundreds clock cycles when executed with STM. As an example, a program that might need 100 clock cycles to complete, addition SOO clock cycles are introduced as STM overhead. That means almost 80% of the program time is taken by the STM and hence the program is made even 5 times longer. We can conclude that the STM overhead is not a trivial one and can be the most dominant cost of the code. I am going to study the STM semantics and some of the approaches used to build STM implementations. After understanding these semantics and design approaches, I am going to identify how these may help in reducing or increasing the overhead of STM systems. In addition, STM have been built using different programming languages. I want to investigate if the type of language can affect overhead or not. This would help us choose the right design options to build low overhead STM implementation. I begin by giving background information about the legacy way of attaining concurrency, the locking based synchronization. Then, I discuss the major 4

PAGE 12

pitfalls of locks and using them in synchronization. In chapter2, I introduce what transactional memory is. Chapter 3 covers different design options of STM. In chapter 4 and 5, I discuss the overhead of STM and I list some of the findings that might help to reduce the STM overhead. The last chapter includes the conclusion. 1.2 Lock Based Synchronization In a shared memory multiprocessing basic model, we have many processors that can work concurrently and they might need to access the same piece of data in the shared memory to complete their computations. Let us assume that we have the following multiprocessor environment: Main Memory X FIGURE 1.2: SHARED MEMORY MULTIPROCESSORS 5

PAGE 13

So as we can see in the above figure, we have n processors all share the same memory. Let us assume that the first three processors: P1, P2 and P3 are about to execute the same instruction X := X + 1 The initial value of X= 1 and if all of the processes tried to execute at the same time the value of X would be unpredictable, it might be 2, 3 or 4. This is because all the processes might read at the same time the same value X or even two of them read simultaneously the X value resulted from the execution done by the third one. This problem is called data race and to avoid getting such issue, the shared resources, i.e. X in this case, must be prevented from concurrent read and write accesses. The solution here is to use locks to provide mutual exclusion. Locks are programming constructs that can be used so that only one processor has the control over the shared resource and until this resource is unlocked the controlling process is the only one who can read and write to it. The area which has the shared resource and to be accessed with locks is called critical section. At any time, at most one process can be inside the critical section the processor who has the lock of that section. The following diagram shows how the operation is going to be when locking is used to perform synchronization: 6

PAGE 14

lock (X\ Read X into R Add 1 toR Write back R to X Unloc:k IX\ FIGURE 1.3: lOCK-BASED SYNCHRONIZATION MODEl Based on granularity, locks can be classified into fine-grained or coarse grained locks. The main differences between the two types are the size of the critical sections guarded by the lock and the number of locks used throughout the system. In the systems which use the coarse-grained locking, one lock is used to lock many shared resources like in the case of the first linux kernel 2.0 that supported multiprocessing. In this case, the whole kernel is considered as a critical section that is locked by one big lock. On the other hand, in case of fine-grained locking, as it has been implemented in the new kernel releases, many locks are used in the system to provide concurrency. Instead of using only one lock to lock the whole kernel, the big kernel critical section is divided into smaller critical sections each has its own lock. As an example, in the linux 2.2 kernel release, one spinlock is used to control the access to the block 1/0 subsystem and another one for networking and so on. 7

PAGE 15

locks have been used widely in many areas: databases, operating systems, and parallel programming. 1.3 Problems with locks Many problems associated with using locks in concurrency control have been published in literatures. One of these issuers is the difficulty of programming. Most of the computer science programmers are much used to the sequential coding and not the parallel one. More issues were published by Herlihy and Moss as we see below [13]. In some operating systems, which that employs a priority-based preemptive scheduler, tasks or processes are assigned different priorities. let assume that we have a 3 processes H, M and l with priorities High, Medium and low. The scheduler always makes sure that if a process is ready to run and holds higher priority to be running or executed. In addition, if a lower priority process is executing and during its execution, a higher priority process is ready to execute, the scheduler will preempt the lower priority one to let the higher priority process executes. let us consider a scenario where, the process l acquires the mutex or the lock to work on a specific shared resource. In the middle of l execution, the process M becomes ready to execute and hence the scheduler will have to preempt l, which still holds the lock, and let M executes. Suppose during the execution of M, process H would like to access the same shared resource whose lock is held by the preempted process l. In this case, process H won't be able to acquire the lock since l cannot preempt M due to the precedence in priorities. Finally, although process H has higher priority than process M, H cannot execute and this situation will be like if the priorities were inverted so that M has higher 8

PAGE 16

priority but it hasn't. The problem that I have just described is called priority inversion. Another issue that is associated with using locks in concurrency control is lock convoy or sometimes called convoying. Unlike the case with the priority inversion, the processes in this case all have equal priorities. The problem of convoying happens when we have several processes where some of them are interested to access a specific shared resource R. let us assume that process PO is now holding the lock for R and also the processor is executing PO. If the scheduler preempts PO for any reason, such as a page fault or PO has already exhausted its quantum or by any kind of interrupt, PO will be discheduled while it still holds the lock for R. Now, the scheduler is going to schedule Pl, which is also interested to access R, although Pl has the right to execute, the lock for R is still with PO and hence again Pl will be discheduled. This results in a huge context switching and waste of scheduling quantum which will lead to huge performance degradation of the system. The next problem that might be experienced with using locks is called deadlock. Deadlock can happen as a result of having some processes lock the same set of objects in different orders. To make it clearer, let us assume that we have two processes P and Q work on shared objects identified by CRl, CR2 CR3 and CR4. Process P will work on the shared objects in this order CRl On the other hand, process Q based on its computation will need to access the shared objects in the opposite order let us assume that both processes begin their execution together starting on acquiring the lock required for each shared object they need to access. Process P will acquire the lock for CRl and then 9

PAGE 17

CR2 without any problem as the only process interested to access these shared objects at those times and similarly with process Q, it will acquire the locks of CR4 and then of CR3 shared objects. Now, process P would like to proceed its execution by acquiring the shared object CR3 which is already locked by process Q. Similarly, process Q would like to access the next shared object in the chain, CR2, which is already locked by process P. The two processes P and Q are waiting to each other on releasing the lock for the shared objects CR2 and CR3 and this problem is what so called deadlock. Going back to the locking granularity and whether to use the fine-grained or the coarse-grained one. As I mentioned already, early linux kernel used the coarse-grained locking, in which the whole kernel is considered as a one big critical section and only one lock is available to access this critical section. Coarse-grained lock is well known to its simplicity in implementation and maintenance. Although, this type of lock, the coarse-grained, helped the kernel programmer address all the issues might be faced with by shifting to Symmetric Multi-Processor (SMP), it did not scale well. As the number of the processors increase, the amount of time to wait for the single kernel lock becomes bigger and bigger. Hence, with having only one lock to guard the kernel, the performance of four independent processors is much better than having four-processor system. To solve the problem of lack of scalability with the coarse-grained locks, developers have started to use finer-grained locking more and more and now the modern kernel could include thousands of locks. With this type of locks, each lock protects a small resource instead of protecting the whole kernel. However, using fine-grained locks has its own pitfalls. With thousands of locks, it is not easy to identify which lock for which 10

PAGE 18

and in which order you should acquire each of them in order to avoid problems such as deadlocks and races. In addition, it became much harder to identify the bugs as the number of locks increases and also the number of bugs is expected to increase accordingly. Finally, using the fine-grained locking approach will increase the level of complexity of the system and by time, it can sacrifice the maintainability of that system. 1.4 Lock free Synchronization Lock free or as it might be named in some literatures as non-blocking synchronization does not require the existence of mutual exclusion between the processes which are accessing a shared object concurrently. As we have seen in the previous section, problems with locks, some problems might be experienced when mutual exclusion is satisfied by using locks. If the shared resource does need mutual exclusion, priority inversion problem is resolved since the high priority process won't need to wait for the lower priority process to release the lock. In addition, consider the problem of convoying where a process holding the lock for a shared resource is descheduled or interrupted for reasons I mentioned earlier. In case of lock free synchronization, the shared resource does not mandate the acquisition of lock to have access to, then descheduling or interrupting any process does not delay the execution or operation of other processes interested to access this shared resource. Even deadlocks can be avoided for the same reasons and as a result performance could be improved by making the shared resources lock free. 11

PAGE 19

2. Transactional Memory As it can be inferred from its name, transactional memory has its root in database transaction and hence we need first to define what a transaction is. A transaction can be defined as a sequence of actions that appears as one atomic action and instantaneous to the outside observer. Transactions have been used widely in the database as a way to ensure concurrent execution of programs. However, due to the differences between the transactions in database and that of transactional memory, since the later uses the main memory instead of the hard drive to perform transactional access, it is called transactional memory. The idea of using transactions in concurrent programming was first proposed by Lomet in 1977 [16]. At that time, it was not introduced as transactional memory but as an idea to use the database abstraction to make a good programming language mechanism that would ensure the consistency of data shared among several processes [8]. This paper lacked the practical implementation compared to the explicit synchronization and this is way it had not been used until later when Herlihy and Moss in 1993 [13] proposed the first hardware transactional memory as a mechanism to build allock-free data structures. This paper is considered the first official paper defined what the transactional memory. After that many researchers have been involved heavily in this area and many hardware transactional memories have been proposed. However there were some limitations in the proposed hardware ones like the high cost and the need for extra instruction set to be integrated with the existing processors 12

PAGE 20

instruction set. That means that hardware transactional memory for some people, it might not be that feasible solution. Given the above limitations associated with the hardware transactional memory, researchers were trying to find more feasible solution that would be still a lock-free synchronization solution. In 1995, Shavit and Touitou, proposed the first software transactional memory that is will still use the transactional approach which can be built on the existing machine using only a load-Linked I Store-Conditional operations. later, many software solutions have been proposed by different researchers in different languages and using many design approaches. TypesofTM: TM can be built in hardware, software or using both software and hardware (hybrid). The hardware transactional memory (HTM) is implemented as a processor instructions which means it needs some modifications to the processor, cache and bus protocols to support transactions. In software transactional memory (STM), the semantics and protocols of TM are built completely in software either in runtime library or as extension to the programming language itself. In the later type, STM, no need to upgrade the hardware and the minimal hardware is required such that it supports the two basic operations Compare and Swap (CAS) or load-linked/ Store-Conditional. The hybrid transactional memory is a mix of hardware and software transactional memory so that the transactions are supported in software with the absence of hardware and would provide better performance if the hardware existed. However, there is no formal specification for how the 13

PAGE 21

hybrid TM should be implemented as many different approaches have been used to build this type. WhySTM: Software transactional memory (STM) has been believed by many researchers as a feasible solution for the problems of concurrent programming. The first STM implementation was published by Shavit and Touitou [19]. STM has many advantages over the hardware transactional memory (HTM) such the cost of implementation and maintainig. Since it is implemented in software, no additional cost required for the hardware as in the case of HTM. In addition, any change can be introduced to the STM implementation easily which makes it more flexible than HTM. Previous Work: let me mention some previous STM research works. The first real STM implementation was proposed by Shavit and Touitou which is well known as the static STM [19). This implementation came after the first TM system which was implemented in hardware and required some extra operation which was not available in the off-shelve processors at that time. They proposed the software solution to make it possible to implement TM with the existing processors in software and using load_linked/Store_Conditional operation. This implementation was called static because the transaction needs to need the set of the location it is going to access in advance which was considered a limitation to this implementation. The advantage they were targeting with this implementation is to provide a software solution that will not require extra hardware since it could be applied to the existing machines. 14

PAGE 22

However, the performance of this implementation was not even close to the HTM implementation proposed by Herlihy and Moss. Then, Herlihy et al. proposed dynamic software transactional memory (DSTM) implementation [12]. In DSTM, they addressed the limitation with the static STM by supporting the dynamically created objects that does not require prior knowledge for the set of the locations will be accessed by transaction. This implementation was an obstruction-free which is the weakest approach of non-blocking synchronization. In addition, they introduced the idea of contention management that was able to avoid live lock problems that may encountered with obstruction-free implementations. Fraser later proposed a lock-free objected based software transactional memory which was named after his name (FSTM) [8]. In this lock-free implementation, each lower priority transaction helps a higher priority conflicting transaction to complete so that the progress is guaranteed. The helped transaction might also help another conflicting transaction to complete execution and so on. Marathe et al. proposed an obstruction-free adaptive software transactional memory or (ASTM) [18]. They had done a comparison between the two leading object-based software transactional memories DSTM and OSTM. They found that the DSTM outperforms the OSTM by an order of magnitude on write-dominated workloads and OSTM outperforms DSTM with by a factor or two on read-dominated workloads. Herlihy et al. revised the first proposed DSTM and that resulted in DSTM2 [11]. As I mentioned, the DSTM2 is a revised version of the DSTM where they 15

PAGE 23

tried to incorporate lessons they learned from its predecessor. In this implementation, they did it as a transactional library which was considered the first implementation with such approach. The main feature included in the design of DSTM2 was proviCfing the users the ability to include their own synchronization and recovery mechanisms in the form of transactional factories that transform stylized sequential interfaces into transactionally synchronized data structures addition to the freedom in customizing their techniques for managing contention. Recent implementations such as Enalls' STM and Tl2 are lock-based STM implementations [7][6]. 2.1 Transaction Basic Operations Each transactional memory has some basic operations and most of the implementations have the same operations with a slight variation from one implementation to another. These basic operations are used to manipulate transactions. The following basic operations are the general ones used among most of the implementations. Start or Begin: starting a new transaction and allocate a fresh descriptor in which the status is initialized to ACTIVE. Read/Write: used within the scope of the transaction to access transactional memory. 16

PAGE 24

Commit: this operation makes all the transaction's tentative changes permanent and ends the transaction. It may fail and in this case it is equal to abort. Abort: this operation instead of making the transaction's tentative changes permanent, it discard all these changes and also ends the transaction. -Validate: this operation tests the transaction status. It is either true means the transaction commit or can commit or false means that the transaction aborts. 2.2 Semantics Although many literatures have been published about TM, still there is no formal TM semantics. Intuitively and as it can be inferred from its name, TM idea originated from database transactions and this can explain why most of the papers used the correctness criteria from database transactions, serializability, or concurrent data structures, linearizability. However, both criteria just specify some aspects of TM semantics and not all. 2.2.1 Database Correctness Criteria (ACI Model) The semantic model used by database has been adopted from the beginning as a model for the TM as transactions idea was taken from database. However, still the database model is not enough to provide the formal semantic for TM due to the nature of the database transactions and those of TM. Let us define the correctness model used by database and then we check its availability to TM. Database use the ACID (failure atomicity, consistency, isolation, and durability) properties as a model for its transaction semantic. 17

PAGE 25

Failure atomicity or atomicity requires either all or nothing. All means that a transaction executes and finishes and then it commits and hence the change will be visible. However, in case if it fails in the middle of its execution, nothing will be written and everything will be as if it does not execute at all. Consistency means that any successful or failure of the execution of a transaction should leave the database or memory in a consistent state. A new transaction will always assume that the world is in consistent state. Isolation means that the execution of a transaction must be as if it is the only transaction executed at that time although there are other transactions executed concurrently. In database, serializability correctness condition is used to achieve isolation in which the transaction execution result isolated from other transactions execution as if they are executed serially. Durability requires that once the result is ready and committed on the hardware, it is final and should be always there and available to all the other transaction. However, this property is not required for the parallel programming as the media to which the transaction result would be committed is the memory and the data there is always transit unlike in the database case where the media is the hard disk. The ACI model that explained above does not provide a formal semantic for the TM. ACI does not provide any mechanism that controls the interaction between the atomic blocks and the code outside of a transaction and this is what can be called interaction between transactional and non-transactional access to shared resources. Unless a semantic specifies this interaction still it is not fully applicable to TM case and this is the issue with ACI model. 18

PAGE 26

2.2.2 Single Lock-atomicity The single lock-atomicity is not a technique to implement TM; rather, it specifies the interaction of atomic blocks in TM. With this, an atomic block executes as if all the other atomic blocks were protected by a single lock and at most one block can execute at a time. This model specifies the isolation expected from transactions but does not specify the failure atomicity of a transaction. It is one way to achieve the serializability I talked about earlier. This lock can also be used to describe what happen as if a non-transactional access has been encountered. The non-transactional access is any code outside the transactions tries to access shared data. Having this kind of access, i.e. the code outside the transaction, to shared data may lead to a datarace [9]. Before I define what the datarace is, let me define what is known as a conflict. A conflict happens when more than concurrent thread access the same shared data and at least one of these threads modify the data. The data race would happen if the conflicting access mentioned before were not synchronized properly. When the data race happens, the behavior of the code cannot be judged and will depend on the memory consistency model that is used in the system memory. Dataraces can be caused by poor programming of critical sections. To avoid getting datarace, locking or other synchronization techniques should be used to control the concurrent access by transactional and non-transactional codes or by carefully arranging the program control flow. 19

PAGE 27

2.2.3 Retry and OrEise So far, we have discussed several aspects of the TM semantics proposed by several researchers. However, we mentioned nothing about the order that any two transactions executing concurrently should change the program state. Such coordination between transactions was first introduced by Harris et al. [10], by using the retry statement. When retry is used in a transaction, if anything causes the transaction to abort, it will reexecute again later. What is important about the retry is when it should reexecute after it aborts. Harris in the same paper suggested that the transaction should reexecute again if any change has been already introduced to the set of values the transaction did read and aborted before. Another operation that can be used to coordinate the execution of two transactions is orEise [10]. The operation of orEise can be explained as follow [14]. If we have P orEise Q, where P and Q are two different transactions, then P will be executed first: l.lf P commits or explicitly aborts either by explicit abort statement or uncaught exception, then Q will not be executed since the orEise statement terminates. 2. In case of P executes a retry statement, then Q will be executed as part of orEise statement execution. 3.1f Q commits or explicitly aborts, then the orEise statement will terminate. 20

PAGE 28

4. If Q executes a retry, then the execution of the orEise statement will be pending for a change to a location read by either P or Q to reexecute itself again. Finally, the orEise statement should be implemented in atomic way since it is a composition of two atomic operations. 21

PAGE 29

3. Design Options in Software Transactional Memory Different options have been used to build different STM implementations. Some of these options might help in simplifying the implementation and some might make it more powerful in a way that makes it a better candidate to replace the lock-based parallel programming ways. Some of the options have direct relation to the overhead that might be associated with STM and some have indirect. In this chapter, I tried to list most of the options or tradeoffs that have been used so far in designing most of the STM implementations. 3.11solation I have already defined what the transaction isolation is as part of the transaction correctness criteria. Isolation property requires that the transactions execute concurrently don't affect the result of each other. Each transaction executes as if it is the only transaction executes at that time. Blundell et al. [3] introduced the terms weak and strong atomicity. However, the term they used i.e. atomicity can be replaced by isolation as it can be inferred from that paper [14). 3.1.1 Weak Isolation Before I start discussing the weak isolation property, I need to give formal definition for it: 22

PAGE 30

Weak isolation: is the property of TM system that guarantees the isolation only between transactions. A non-transactional code can directly access the same shared data in the main memory that is already accessed by transactional code. If a non-transactional access executes, or a conflict reference outside the transaction happens, the result that will be returned might be inconsistent or may lead to an incorrect execution inside the transaction. With weak isolation implementation, data race might be the result and the system can be in undefined state. Some problems or more precisely isolation violations would happen as the non-transactional accesses are allowed to access the main memory and bypassing the STM system protocols [1]: Non-repeatable reads: this problem happens when a transaction code includes two read instructions and another write instruction belongs to a not transactional code execute concurrently. The two reads must return the same value but since another non-transactional write alter the execution of the transaction, the values that are returned by the two reads are different as we can see in the following example: Thread 1 atomic { r1 = x; r2 = x;} Thread 2 X= 1; 23

PAGE 31

let us assume that the initial value of x is 0. Thread 1 starts to execute the first read and r1 will be assigned 0. Then, before thread 1 perform the second read, thread 2 write a new value to x which is 1. After that thread 1 performs its second read and now r2 = 1. So, r1 r2 even though they belong to the same transaction which means the isolation is violated. Intermediate lost update: it happens when we have two writes access performed by both transaction and no-transaction code concurrently to the same shared data. As an example, let us considered the following: Initial value of x is 0 Thread 1 Thread 2 atomic { r = x; x= 10; x = r + 1;} if the two threads executions are serialized then the value of x should be either 10 if thread 1 executes first and then thread 2 or 11 if thread 2 executes first then thread 2 does. Unfortunately, the value of x is going to be 1 since thread 2 writes after thread 1 performs the read access and before it write x. This means that the thread 2 write like if it does not happen at all. Intermediate dirty reads: it happens when a non-transactional access can see an intermediate state of a transaction code. The following is an example that can explains this problem more: 24

PAGE 32

Initial value of x is even Thread 1 atomic { x++; X++;} Thread 2 r = x; The thread 1 transactional code will always keep the value of x even. However, the non-transaction read access will read an odd value if the non transactional read is executed in the middle of the thread 1 two increments. This implementation weak isolation, does not mandate that all the access to the shared data must be by a transactional access only, the non-transactional access can access but without conflicting with the transactional one. This can be guaranteed by the programmer to make the non-transactional access to take place in different time or to different memory location. The approach that has been used in most of the STM implementations to achieve weak isolation is to divide the whole memory manually into two parts part that is only accessed by transaction and the other part is only accessed by non transactional codes. However, dividing the memory as such is an error-prone process as software evolves. Furthermore, such a process of manually dividing the memory sometimes even is not required in the lock-based synchronization. Therefore, using the weak isolation in STM systems may conflict with the goal of achieving a simple and reliable concurrent programming. 25

PAGE 33

3.1.2 Strong Isolation The second approach is called strong isolation where the TM semantics must be guaranteed between transactional and non-transactional code. In this model, the database model is enforced in which all the accesses to the shared data are transactional or through transactions. In strong isolation, any operation outside the transaction block that includes access to shared data eventually will be transferred to transaction. Strong isolation has been well known for its cost in term of the overhead that might be produced by a TM implementation adopts it. This can justify why the most high performance implementation were designed to provide weak isolation. Researchers have proposed many ways to achieve strong isolation like restricting the transactional code to access on the data that are transactions and this is the approach used in Haskell STM [3]. However, this was achieved since all the mutable data in Haskell STM are in transactions. The rest data are read only ones which means that they can be accessed either by transactional or non transactional access. In the other hand, with the non-functional languages where most of the data are mutable, the program must be divided into two parts: transactional and non-transactional data. However, this classification is hard to be implemented as all the data will need to be transformed to transactional data from the non-transactional part of the program and then back to its original state after completing the processing. Another approach that was proposed by lev and Maessen is to track the shared objects at runtime [15]. Any object that will be visible by more than one thread should be accessed only through transactions during concurrent access. Although it can be considered a good approach, it might introduce more overhead to the 26

PAGE 34

STM. However, the authors depend a lot on the compiler for more optimizations and hence lower overhead. In the following, I am presenting one of the Ananian,s and Rinard's STM implementation that supports strong isolation. Example: Ananian and Rinard STM: The STM implementation, which was proposed by Ananian and Rinard in 2005, is considered one of the few STM implementations that are implemented using strong isolation design option [2]. Most of the STM implementations have been proposed so far employ weak isolation as transaction isolation property. The Ananian's and Rinard's STM implementation provides strong isolation since any memory access outside the transaction can abort the running transaction. However, in most of the STM implementations, which implement weak isolation, a non-transactional read or write to transactional data, can happen without aborting that transaction. Instead, a mechanism is implemented in such a way that the non transactional accesses don't conflict with the transactional accesses. The reason behind choosing the weak isolation is to avoid the cost of overhead of adding instrumentation to every non-transactional read and write. Ananian's and Rinard's STM implements a novel way of providing strong isolation that lowers the instrumentation overhead. They achieved this goal by employing sentinel values which are placed in memory locations accessed by transactions. I am going to provide a detail implementation of Ananian's and Rinard's STM strong isolation implementation in the following paragraphs. 27

PAGE 35

In their implementation, they used a special value called FLAG to be placed in locations of objects that are involved in transactions. Trying to read or overwrite a flagged value tells the non-transactional code that an exceptional processing is required. Object Structures: The basic data structures of Ananian's and Rinard's STM implementation is shown in the figure below. 28

PAGE 36

Object OlherCia .... ...... ...... 'A' .............. 55 llfldf ----------Transaction ID 123 ... FIGURE 3.1 : Ananinan's and Rinard's STM Data Structures Two additional fields are added to each object: versions and readers fields. Versions field points a single-linked list object versions. Each object version has a field owner, which points to the transaction object's single field status, corresponds to committed, aborted or waiting. For each transaction, there is a specific object version. The second filed is the readers field which also 29

PAGE 37

points to a single-linked list of all the transactions that have read from this object other than the committed or aborted ones. The FLAG value can be any value but not a common one. Each object field will have the semantic value of the original object structure in case that is not a FLAG value. In the later case, the value of the field will be the value of the first committed transaction in the object's version list. Operations: Ananian and Rinard supported the following operations in their STM implementation: transactional read/write, non-transactional read/write, begin, abort and commit. In this context, strong isolation, the only two operations concern us is the non-transactional read and write. Read: The non-transactional read in the Ananian's and Rinard's STM is performed using the readNT method. This method does the non-transactional read of field f from object o and the result will be stored in v. When the value read is a FLAG value, the thread will have to copy back the field value from the most recently committed transaction. In addition, the thread will have to abort all the uncommitted transactions and then try again. To tell the thread to abort only the transactional writers and not the transactional readers, the kill writers constant is passed to the copy-back method as seen the code below. inline readNT{o, f, v) { do v = object [o] field [f] ; if .. (v!=FLAG) ->break/* done! */ else 30

PAGE 38

fi; copyBackField(o, f, kill_writers, _st); if :: (_st==false_flag) -> v = FLAG; break .. else fi od } Write: The non-transactional write access in Ananian's and Rinard's is performed by invoking the writeNT method. It writes the new value nval to the field f of object o. For this non-transactional write to be performed, the reader list must be empty. To check this condition, the Load-linked/Store-Conditional operation is used so that the write will be stored if the reader list is empty. However, if it is not empty, the procedure copy-back will be called which passes the kill_ all constant. This constant will make sure that all the transactional readers and writers are aborted and hence the reader list will be empty. The writeNT procedure is shown below: inline writeNT(o, f, nval) { if .. (nval !=FLAG) -> do atomic { if /* this is a LL(readerList)/SC(field) */ .. (object[o] .readerList ==NIL) -> object[o] .fieldLock[f] = thread id; object[o] .field[f] = nval; -break /* success! */ : : else fi } /* unsuccessful sc *I 31

PAGE 39

copyBackField(o, f, kill all, st) od -else -> /* create false flag */ /* implement this as a short *transactional* write. *I /* start a new transaction, write FLAG, commit the */ /* transaction; repeat until successful. */ /* Implementation elided. */ fi; } 3.2 Nesting Transactions Sometimes, programmers might need to execute a transaction inside another transaction so that the execution of the inner one will be within the dynamic scope of the outer transaction. Such transaction in this case is called a nested transaction. let us suppose that we have two transactions; one is nested inside the other. In this case the one nested is called the inner transaction and the one includes the inner transaction is called the outer transaction. If we assume that we have only one thread of control for both transactions, once the outer transaction is done with execution, the control will go to the inner transaction. Now, the inner transaction starts its execution turn and at the same time, it can see any modification done by the outer or mother transaction. Larus and Rajwar listed various ways to link such transactions as we will see in the following sections [14]. 3.2.1 Flattened Transaction In this type of nested transactions, if the inner transaction aborts, then the outer transaction will abort also. However, when the inner transaction the outer transaction is the only transaction that can see the changes produced by the inner one. In addition, the outer one will not 32

PAGE 40

commit just because that the inner one does, and once it commits the modification of the inner transaction will be visible by all the other transactions. The following example explains the behavior of the flattened transactions: int x = 1; atomic { } X = 2; atomic flatten { X = 3; abort; } In the above example, if the outer transaction aborts whether because of aborting the inner one or because it decides to abort for any other reason even if the inner one committed instead of aborting, then in this case x will be 1. Although flattened transactions are considered easy to program, it is considered also a bad abstraction model that does not provide a compositional atomic transaction since aborting the inner transaction will cause the outer one to abort. 3.2.2 Closed Transaction As an alternative to flattened transaction, closed transaction has the advantage that the effect of the inner transaction on the outer one will be lower unlike the flattened transaction. In closed transaction, aborting the inner transaction does not terminate the execution of the outer transaction. When an inner transaction aborts, all its log will be discarded and a notification will be sent to the outer one and after that the parent transaction 33

PAGE 41

has to choose even if it wants to abort [14]. Similarly, in the committing case, again if the inner transaction commits, the control will go to the outer one. In case of inner transaction commits, the outer transactions will be able to see any change introduced by the inner one. However, still the other transactions (not the parents) won't be able to see the modification of that inner transaction until all the surrounding parent transactions commit. The code below explains the operation of closed nested transactions. int x = 1; atomic { } X = 2; atomic closed { X = 3; abort; } When the inner transaction aborts, its operations will be executing undo and all its log will be discarded. The value of x will be 2 if the outer transaction commits otherwise it will be 1 when all the inner and outer transaction abort. Furthermore, if the inner transaction commits and all the outer transactions abort, the value of x will be also 1 and the change that was introduced by the inner transaction wouldn't be visible to the other transactions since its parents of did not commit. What is interesting about the closed model is that no conflict happens between the inner transaction and the outer ones. However, the conflict may exist between the inner transaction and another transaction that belongs to different family (none of its parents). 34

PAGE 42

3.2.3 Open Transaction In the open nested transaction model, when the inner transaction commits, its modification will be visible to all the transactions either parents or not parent transactions whatever the results of the outer transactions are. The following example is similar the previous examples provided in the previous nested transactions except the addition of the open keyword that enables the open nested transaction model. int x = 1; atomic { X = 2; atomic open { X = 3;} abort;} Here in this case, if the inner transaction commits, its change will be visible even if the outer transaction aborts. In this case the value of x = 3. 3.3 Transaction Granularity One of the important key parameters in designing TM is transaction granularity. It is the unit of the storage over which a TM system detects conflict (14). In most STM implementations, two types of transaction granularity have been used: word (or block) and Object [20]. 35

PAGE 43

3.3.1 Block or Word Granularity In this type of transaction granularity, at most one transaction has the control over a shared word at a time. This is to make sure that the any modification or change to the shared word is atomic. Moreover, a dedicated record is attached to each shared word serves store the exclusive ownership of this word. The process of transferring the ownership of the records between the transactions is performed by the system automatically. The above implementation was introduced by Shavit and Touitou in their STM paper which is also considered the first formal STM implementation [19]. 3.3.2 Object Granularity Object granularity is usually implemented in systems that used object oriented languages. This type of granularity enables the STM to detect a conflicting access to an object even if the transactions referenced different fields. Most of the STM implementations were implemented with object granularity. Using the object granularity, it is not required to change the original object structure for translating the non-transactional program to transactional one. Moreover, any object has the freedom to execute either inside the transaction or outside it without needing any change. Many STM implementations use object granularity and one of them is DSTM [12). 3.4 Version Management During the execution of a transaction, an STM system will have to maintain different versions of the global data. These different versions are: the new or 36

PAGE 44

updated version and the old or original version. The new or updated version is the version of data that is produced by one of the pending transactions. This version is supposed to be visible by all the other transactions once the executing transaction succeeds and decides to commit its computations. The old or original version, which is the second type of versions, is maintained by the TM system during the execution period. The original version is the one that exists in the memory before the executing transaction starts executing. This version is required in case that the executing transaction decides to abort for any reason and hence it needs to roll back to its original value or version. In general, there are two approaches have been proposed and implemented by researchers to maintain the new and old versions of global data: eager versioning and lazy versioning However, these two ways are not necessarily named as eager and lazy versioning as some researchers called them direct and deferred updates [14]. At the end the eager versioning is the direct update and lazy versioning is the deferred update processes with different names only. In my thesis, I chose to use the first names: eager and lazy versioning as they have been widely used in the literature and more common than the other names: direct and deferred updates. 3.4.1 Eager Versioning A TM system, which is implemented with the eager versioning feature, writes the new data or version directly to the memory. This means that the modification that is introduced by the executing transaction will be applied to the object itself. However, the new version will not be visible to the other transactions as this type of versioning requires the TM system to use some 37

PAGE 45

form of concurrency control like locks throughout the transaction execution period. So the TM system will make sure that the uncommitted version of data that resides in memory will not be visible by any other transaction until it is committed. The old version will be stored in the undo buffer or log so that, in case the transaction aborts, the TM system will need this original version to restore it back to the memory location. This type of versioning is considered a fast committing one since if the transaction decides to commit, it needs only to make the object visible by other transactions and discards the old version stored in the undo buffer. However, additional delay might be experienced if the transaction aborts since the original version will need to be written back to the original memory address. In addition, concurrency is limited during the execution of the transaction till the final decision is reached. Begin Thread GJ D Memory Undo Buffer PI Memory Commit Thread I GJ ........ .......... Abort I Thread I GJID Memory Undo Buffer \V) FIGURE 3.2: Eager Versioning 38

PAGE 46

3.4.2 Lazy Versioning The second type of data versioning, which is called lazy versioning, has been used widely in most of the STM implementations. In this type of versioning, any data modification will be stored as a new version in the undo buffer which is private to the transaction. The old or original version will stay as is in the memory and hence this type will enhance the concurrency instead of restricting it as in the eager versioning. Once the transaction finishes its execution and decides to commit, the new version will be moved from the undo buffer to the memory. However, if the decision was to abort then the transaction will need just to discard whatever stored in its undo buffer. The lazy versioning is considered a fast aborting one since as I mentioned earlier, when a transaction decides to abort, it needs just to discard the new version that is isolated in the per transaction buffer. On the other hand, lazy versioning might experience some delay if the transaction decides to commit as the TM system will need to copy the new updated version to replace the old version that is stored in memory. This delay is produced due to the need to search the undo buffer for the latest version. 39

PAGE 47

Begin Thread c:J D Memory Undo Buffer Memory Commit Thread GJ Memory Memory Thread Undo Buffer Abort -EJ Undo Buffer FIGURE 3.3: lazy Versioning 3.5 Concurrency Control Concurrent transactions executed by a TM system need to be synchronized to avoid having concurrent access to an object. What I meant by these concurrent accesses are those are resulting in conflict access. I have already defined what a conflict access is, two or more transactions concurrently access the same object with one of them modify that object. After a conflict occurs, the TM system then determines that a conflict happens and this is called conflict detection. When a conflict is already detected, the TM system 40

PAGE 48

or the code in a transaction implements a mechanism or apply a contention policy to ensure the correctness either by aborting or delaying one of the conflicting transactions. These three operations which are conflict occurrence, detection and resolution must follow this order even they can happen in different times but not order. Some STM implementations have been designed using two different approaches to have concurrency control: pessimistic or optimistic. 3.5.1 Pessimistic Concurrency Control In this type of concurrency control, the three operations: conflict occurrence, detection and resolution, all happen at the same point of time during the execution. A conflict is detected and resolved in this case as the transaction is about to access a shared object. With pessimistic concurrency control, the transaction claims the exclusive ownership of an object so that other transactions are not allowed to access this object. However, deadlock may happen when two transactions got the exclusive access to an object and one of them would like to access that objects while the other transaction already hold this object. Such deadlocks can be avoided by enforcing a predetermined order in getting the exclusive access to objects. In case of deadlock between two transactions, one of them has to be aborted. 3.5.2 Optimistic Concurrency Control The difference between this type of concurrency control, optimistic, and the previous one, pessimistic, is that the conflict detection and resolution don not happen at the same point of execution with the conflict occurrence. In this 41

PAGE 49

case, more than one transaction can access the same object concurrently. However, any conflict if there is one, must be detected and resolved before a transaction commits. A conflicting transaction can be delayed, restarted or even aborted while the other can continue its execution. The advantages of this type of concurrency control are: avoiding locks and their problems and providing more concurrency especially if conflicts are barely happening. 3.6 Blocking versus Non-blocking Considering forward progress guarantees, where a transaction attempts to access shared data will be able to finish its execution, synchronization can be divided into two types: Blocking and non-blocking. 3.6.1 Blocking It is the legacy way used to provide concurrency control, in which locks, monitors or semaphores are used. At any time, only one transaction is allowed to access a shared data exclusively. However, this type may prevent the forward progress by introducing the problems with locks I have already discussed in chapter 1: priority inversion, convoying and deadlock. In this type, even though locks are the tools to provide the synchronization, they are only used in the STM code and not by the programmer or in their codes. 3.6.2 Non-blocking Early STM have been implemented with non-blocking synchronization which can provides better forward progress guarantees than the previous type, the 42

PAGE 50

blocking one. This type avoids all the problems associated with the blocking synchronization techniques. Based on its strength in providing the forward progress guarantees, non-blocking synchronization can be classified into three types: Wait-freedom: is considered the strongest assurance to guarantee that all threads competing to access shared data make forward progress in a finite number of their time steps. This type does not cause deadlocks or starvation. Lock-freedom: is considered weaker assurance to guarantee that at least one thread of the competing threads to access shared data makes forward progress in a finite number of any of the concurrent threads. The deadlocks can be avoided with this type but not the starvation. Obstruction-freedom: is considered the weakest assurance to guarantee that a thread makes progress in finite number of its own time steps in the absence of the contention. Although deadlocks are avoided with this type, livelocks are not. 3.7 Object Acquisition The acquisition of an object happens when a transaction declare its ownership of an object in non-blocking STM systems or acquires the lock to access that object in blocking STM systems. Based on the time of acquisition, it can be divided into two types: lazy acquire and eager acquire. 43

PAGE 51

3.7.1 Lazy Acquire In this type of object acquisition, the object is acquired only at commit time. This means that the conflict detection is defer until the time commit. With this type, a doomed transaction is allowed to continue but it has the ability also to overlook a potential conflict. One of the advantages of this type is the short-running readers transactions can commit in parallel with the execution of the long-running transactions that commit too. Many STM implementations use this type of object acquisition and conflict detection such as: OSTM and Haskell STM. 3. 7.2 Eager Acquire With eager acquisition, a transaction will have to acquire the object it is trying to access in the beginning of at open time. This type allows the transactions to detect a conflict early and hence save on the useless works if the transaction is going to abort eventually. However, sometimes with eager acquisition, the transaction aborts a conflicting transaction and then the aborting one fails to commits which cause huge waste in the works invested by the aborted transaction and itself. Many STM implementations use this type of object acquisition and conflict detection such as: DSTM WSTM and McRTSTM. 3.8 Contention Management Whenever a conflict is detected over an object between two transactions, one of the transactions must be aborted to resolve this conflict. The part of the 44

PAGE 52

TM system that is responsible for this decision, aborting one of the transactions, is called a contention manager. The contention manager includes one or more of contention policies and based on these policies the aborted conflicting transaction is chosen. Some implementations of contention managers may not decide to abort the conflicting transaction and instead, the contention policy can ask for delaying that conflicting transaction. Different contention managers may have different cost depending on the contention policies of each of them. 3.9 Other Options There are other options that can differentiate an STM implementation from the other: 3.9.1 Exceptions Some TMs may include mechanisms of exception handling where the try and catch keywords are used to surround the atomic block. Exception handling can be implemented to either terminate or abort the transaction. When the TM uses terminate transaction, it terminates the transaction and exits. However, before the control leaves the transaction, whatever changes have been introduced will be written or committed by the system. The other exception handling which aborts the transaction whenever exception is found will abort the transaction without committing any changes. 45

PAGE 53

At the end, it depends on the language used to implement the TM and the compiler used if it can provide a flexible solution to even includes the two types of exception handling methods mentioned above. 3.9.2 library versus Programming Language Another implementation options whether to use the transactions as part of the library itself or to integrate them into the syntax of the programming language. Several TM implementations were implemented in both options. Although the latter option in which the transactions are integrated as primitives with the programming language semantics are more preferable, the integration should be done properly as Carlstrom et. al. [4). Having clear semantics will help to implement better optimization to the compiler. 3.9.3 Uniprocessor versus Multiprocessor Some TM implementations have based their implementations to be run on a uniprocessor system whereas the other did take the advantage of the new trend. 46

PAGE 54

4. STM Overhead It has been believed by many researchers that TM in general is a promising solution for the problems related to programming multicore processors or even parallel programming in general. After introducing TM to be a solution implemented in hardware, HTM, this kind of implementation has been considered expensive which made it less feasible by most of researchers. STM was proposed as a solution that still avoids most of the pitfalls of locking based synchronization and also does not require any extra hardware which adds no extra cost. This is the main reason that most of the researchers shifted themselves to STM. Now days, many STM implementations were proposed in different languages and with different flavors. Although, it is still considered in research based, many researchers started to consider that STM won't go far and might not be the optimal replacement for the current used synchronization techniques which require mutual exclusion. One of the most important reasons for that believe is the overhead that is encountered with using STM and this is the reason makes some of the researchers called it as a research toy [5]. After discussing most of the STM design options or tradeoffs which have been used in various STM implementations, I provide some of the overhead sources related to these options directly or indirectly. In their effort in designing a low overhead STM implementation, which was designed and named RSTM, Marathe et al. listed several possible sources of overhead (17]. These sources are: 47

PAGE 55

Managed Code: they used C++ instead of the Java which they used in their prior implementation ASTM and they expected better performance in runtime semantic checks, lock-based thread-safe libraries and always-virtual method dispatch. However, the Sun's HotSpot Java implementation typically runs programs slightly faster than equivalent C++ versions compiled with GCC. Bookkeeping: in object-based STM implementations, the minimum CAS operations required acquiring n objects and commit is n+l CAS. In addition, n CAS operations might be required for post-cqmmit cleanup of headers. The private read lists and write lists require higher overhead. So, the bookkeeping introduces more overhead in the single-thread or low-contention cases. Memory Management: memory management is required for both data objects and dynamically allocated metadata (transaction descriptors, DSTM locators, OSTM Object Handlers). In case of garbage collected languages, additional cost of tracing and reclamation. Contention Management: conflict resolution can lead to a significant overhead as well as the sorting required for avoiding deadlocks. The work that is lost to abort or delaying a conflicting transaction can be also considered a high overhead added to the other sources of overhead I have already listed. However, contention manager overhead will finally depend on the options used in building the STM. Choosing the implementation synchronization is critical because obstruction-free STM has better chance of minimizing the useless work that might be required by lock-free or wait-free STM implementations. Copying: in STM, every writer creates a copy of every to be written object. In case of small objects, they might not result in a significant overheads 48

PAGE 56

compared to the overheads of bookkeeping. However, if the object instead is large and only a small change is required, then the overheads that result from unneeded copying might be significant. 49

PAGE 57

S. Findings In this section, I am going to list all of my recommendation regarding the STM design options that might increase or decrease the overhead. I have already mentioned a few words about using these statements in STM systems to provide better chapter 2. Assertion 1 Regarding the isolation property, I have mentioned two design approaches used by various STM implementations: weak isolation and strong isolation. Strong isolation is considered better than weak isolation since the isolation with the first type can guarantee the isolation between transactions themselves and also between them and the non-transactional access. Weak isolation STM implementations need to divide the shared memory manually into two divisions: transactional that can be accessed only by transaction and non-transactional that can be accessed by non-transactional accesses or codes outside transactions. Dividing the memory is not perfect since it is error prone with taking in consideration the evolvement of software. Moreover, dividing the memory does not make the idea of STM and TM in general a feasible concurrent programming technique. This is because the locking based synchronization with critical sections does not require this hassle. In addition, avoiding that option, dividing the memory, will lead to several issues: non repeatable reads, intermediate lost update and intermediate dirty reads [1]. 50

PAGE 58

The solution that I strongly recommend is to deny any direct non transactional access to the shared memory by implementing STM with strong isolation feature. Addition to avoiding issues related to using weak isolation, the overhead of implementing STM with strong isolation can be decreased. One of the good approaches that might be used to implement the strong isolation is using functional languages such as Haskell. The reason is that all the mutable data in Haskell are transactions and hence no extra overhead is required to achieve strong isolation. However, imperative and object oriented languages are more used especially in the industry and commercial applications. Furthermore, programmers are more familiar with imperative languages rather than the functional languages. Therefore, I recommend implementing STM with strong isolation using these types of languages. Regarding the higher overhead introduced to the STM overall overhead by going with this approach, efficient read and write barriers for memory accessed outside transactions. Optimization can be used to reduce the overhead encountered with using barriers and even to remove it sometimes. The cost of the suggested read and write barriers can be reduced by the following enhancements: 1. Tracking the local objects of the thread at runtime to eliminate the unneeded synchronization 2. Eliminating barriers whenever not required. 3. Implementing some sort of barriers aggregations would help in amortizing the barriers cost. 51

PAGE 59

Assertion 2 Supporting nested transactions have been included in most of STM implementations I have seen except a few of them which are considered pretty old. What I discovered that most of the STM's, which support nested transactions, use the flattened type while a few used open one. think the reason which made most the STM implementers to choose the flattened type that it is easy to be implemented. However, as I mentioned the flatted type does not provide compositional transactions. The rest which uses closed type might experience more overhead especially when the inner transaction aborts, the outer transaction will continue its computations. If the outer transaction also aborts, then it would be more efficient if the outer one has been aborted directly. I think the trend in supporting nested transactions in STM implementations will choose the open type. My opinion, it is the best type compared to the previous types. In case the inner transaction commits, making all the other transactions including its parent transaction or the non parent, to see the result of inner transaction will save a lot on the STM overhead. Hiding the result, in case the inner transaction commits, will add extra overhead to the STM. In addition, if the outer transaction decides to abort and execute again, it would not need to execute the inner because it is already executed. Open type will also increase the concurrency since the resources that would be reserved by the inner transaction would be released earlier. 52

PAGE 60

Assertion 3 Given the two design options around which the STM have been using to detect the conflict object granularity or word (or block) granularity. Although object granularity is more preferable especially with STM implementations built in object-oriented languages, it can be major source of overhead. The size of the object affects the overhead of any STM system. One of the disadvantages of object granularity is sometimes with a large size objects opened by a transaction to perform a write, the transaction will need to create a new copy of that object to put its modification. The problem is the transaction would create the new copy even if the modification is minor and hence copying such object can cause significant overhead. Another issue is the false conflict that might happen if two transactions reference two different field of the same object like the case of multidimensional array. Such problems might be avoided by using word granularity but again it requires more space and time in order to track and compare read sets and write sets. The time and space required by the world granularity is always required. My opinion is with all the disadvantages of the object granularity, it is still better than the word granularity in terms of overhead. Unlike the word granularity, object granularity helps programmers to reason about their programs and hence they can introduce further enhancements and optimizations to detect conflicts. In addition, the overhead with the object granularity might be noticeable only in case large objects. 53

PAGE 61

6. Conclusion In my thesis, I started by defining what are the problem posed by parallel programming and what is the locking based synchronization model. Then I mentioned some of the published issues resulting from using the legacy locking techniques in parallel programming. I also introduced TM with a focus on the STM and I gave some background about the implementations, operations and semantic of STM. In my thesis, I concentrated on main STM design tradeoffs. Some of these options have a direct relation to the increase overhead of STM. Unfortunately, still the area of STM implementations overhead is not identified clearly. One reason is that STM or TM in general is still in the research based. Moreover, most of the parallel programmers are not familiar with TM and if they know, still they don't see it as a mature environment to build their codes on. The area of TM or STM specifically still an active area of research and more researchers are now interested in making it a feasible and successful replacement to the current parallel programming paradigms. More work will need to be conducted on evaluating the overhead of STM and if possible reducing such a disadvantage of it. One of the problems regarding the overhead of STM is the differences among implementations. It is still an open issue to develop methods to evaluate all the STM implementations overhead. 54

PAGE 62

BIPLIOGRAPHV [1] Adi-Tabtabai, A., Shpeisman. T., Menon, V., Balensiefer, S., Grossman, D. Hudson, R. Moore, K., Saha, B., Enforcing Isolation and Ordering in STM", in Proceedings of the FCRC'Ol Conference on Programming Languages Design and Implementation (PLDI}. 2007: USA, California. [2] Ananian, S. and Rinard, M., "Efficient Object-Based Software Transactions," in Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications (SCOOL). 2005: San Diego, Ca. [3] Blundell, C., E.C. lewis, and M.M.K. Martin, "Deconstructing Transactions: The Subtleties of Atomicity", in Fourth Annual Workshop on Duplicating, Deconstructing, and Debunking. 2005. [4] Carlstrom, B.D., et al., The Atomos Transactional Programming Language, in Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2006}. 2006, ACM Press: Ottawa, Ontario, Canada. pp. 1-13. [5] Cascaval, C., Blundell, C., Michael, M., Cain, H., Wu, P., Chrias, S., and Chatterjee, S. Software Transactional Memory: Why Is It Only a Research Toy?, CACM, 51(11):4Q-46, November 2008. [6] Dice, D., Shalev, 0., and Shavit, N., "Transactional locking II," Proceedings of the 14th ACM Symposium on Principles of Distributed Computing. pp. 204213. [7] Ennals, R. "Software transactional memory should not be obstruction free," Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report., 2006. 55

PAGE 63

[8] Fraser, K., "Practical lock-Freedom.", PhD Thesis, University of Cambridge Computer Laboratory, 2004. [9] Gharachorloo, K., Memory Consistency Models for Shared-Memory Multiprocessors. 1995, Computer System laboratory, Stanford University. [10] Harris, T., et al., Composable Memory Transactions, in Proceedings of the Tenth ACM GPLAN Symposium on Principles and Practice of Parallel Programming. 2005: Chicago, ll. [11] Herlihy, M., luchangco, V., and Moir, M., "A flexible framework for implementing software transactional memory," Proceedings of the 21st Annual ACM S/GPLAN Conference on Object-Oriented Programming Languages, Systems, and Applications, pp. 253-262, 2006. [12] Herlihy, M., luchangco, V., Moir M., and Scherer Ill, W., "Software transactional memory for dynamic-sized data structures," in Proceedings of the Twenty-Second Annual Symposium on Principles of Distributed Computing, pp. 92-101, 2003. [13] Herlihy, M. and Moss, J.E.B. Transactional memory: architectural support for lock-free data structures. /SCA 1993. [14] larus, J. and Rajwar, R. Transactional Memory. Morgan & Claypool Publishers, 2006. [15] lev, Y. and J.-W. Maessen, "Toward a Safer Interaction with Transactional Memory by Tracking Object Visibility, in Proceedings of the OOPSLA 2005 Workshop on Synchronization and Concurrency in Object-Oriented Languages (SCOOL). 2005. [16] lomet, D. "Process structuring, synchronization, and recovery using atomic actions," In Proc. ACM Conf on Language Design for Reliable Software, Raleigh, NC, 1977, pp. 128-137 [17] Marathe, V. et al. "lowering the Overhead of Nonblocking Software Transactional Memory," in Proceedings of the SIGPLAN 2006 Workshop on 56

PAGE 64

Languages, Compilers and Hardware Support for Transactional Computing (PLDI). 2006: Ottawa, Canada. [18] Marathe, V., Scherer Ill, W., and Scott, M. "Adaptive Software Transactional Memory, .. Proceedings of the Nineteenth International Symposium on Distributed Computing, pp. 354-368, 2005 [19] Shavit, N. and Touitou, D. "Software transactional memory," In Proc. 14th ACM Symp. on Principles of Distributed Computing, Ottawa, Canada, 1995,pp.204-213 [20] Wang, X., Ji, Z., Fu, C., and Hu, M. "A review of transactional memory in multicore processors," Inform. Techno/. J., 9: 192-200. 2010. 57

PAGE 65

SOFTWARE TRANSACTIONAL MEMORY DESIGN OPTIONS AND OVERHEAD By Ibrahim Alfuaires B.S., King Fahd University of Petroleum and Minerals, 2005 A thesis submitted to the University of Colorado Denver in partial fulfillment of the requirements for the degree of Master of Science Computer Science 2010

PAGE 66

This thesis for the Master of Science degree by Ibrahim Alfuaires has been approved by Bogdan Chlebus

PAGE 67

Alfuaires, Ibrahim {M.S., Computer Science) Software Transactional Memory (STM) Design Options and Overhead Thesis directed by Associate Professor Bogdan S. Chlebus ABSTRACT The advent of the multicore architecture, in which the technology has helped to integrate more and more cores in the single chip, has raised the expectation of getting better computation speed up and performance. In this thesis, I tried first to discuss briefly the problems associated with using the legacy ways, locks, to write parallel programs such as: priority inversion, convoying and deadlock. Transactional memory (TM) was proposed as a lock free synchronization technique which provides safe and easy concurrency programming environment. Transaction is a group of reads and writes instructions to a shared memory that appears as one atomic operation to the outside observer and to other transactions and either execute all or nothing. I am presenting the TM solution in general which is also can be classified into three types: hardware transactional memory (HTM), software transactional memory (STM) and hybrid transactional memory (HybridTM). As it can be inferred from its name, STM implements the whole TM system in software with the aid of the basic instructions provided by an off shelf processor. STM has been more preferable to implement TM as it does not require the extra cost of hardware. In my thesis, I am presenting most of the design options that have been used by many researchers to build different STM implementations we have had so far. These options are important to study and analyze the overhead that might be encountered with different STM implementations. At the end, I tried to list some of the published identified sources of overhead in STM systems. This abstract accurately represents the content of the candidate's thesis. I recommend its publication.

PAGE 68

TABLE OF CONTENTS Figures ....................................................................................................................... vii Chapter 1. Introduction ........................................................................................................ 1 1.1 Research Focus and Methodology ............................................................. 4 1.2 Lock Based Synchronization ............................................................................. 5 1.3 Problems with Locks ......................................................................................... 8 1.4 Lock free Synchronization ................................................................................ 11 2 Transactional Memory ..................................................................................... 12 2.1 Transaction Basic Operations ......................................................................... 16 2.2 Semantics ........................................................................................................... 17 2.2.1 Database Correctness Criteria (ACI Model) .............................................. 17 2.2.2 Single Lock-atomicity .................................................................................... 19 2.2.3 Retry and OrEise ............................................................................................ 20 3. Design Options in Software Transactional Memory ..................................... 22 3.11solation .............................................................................................................. 22 3.1.1 Weak Isolation ................................................................................................ 22 3.1.2 Strong Isolation .............................................................................................. 26 3.2 Nesting Transactions ........................................................................................ 32 iv

PAGE 69

3.2.1 Flattened Transaction .................................................................................. 32 3.2.2 Closed Transaction ...................................................................................... 33 3.2.3 Open Transaction ......................................................................................... 35 3.3 Transaction Granularity .................................................................................. 35 3.3.1 Block or Word Granularity .......................................................................... 36 3.3.2 Object Granularity ........................................................................................ 36 3.4 Version Management ...................................................................................... 36 3.4.1 Eager Versioning ............................................................................................ 37 3.4.2 Lazy Versioning .............................................................................................. 39 3.5 Concurrency Control ........................................................................................ 40 3.5.1 Pessimistic Concurrency Control... ............................................................. 41 3.5.2 Optimistic Concurrency Control... ............................................................... 41 3.6 Blocking versus Non-blocking .................................................................... 42 3.6.1 Blocking .................................................................................................. 42 3.6.2 Non-blocking .......................................................................................... 42 3.7 Object Acquisition ..................................................................................... 43 3.7.1 Lazy Acquire ........................................................................................... 44 3.7.2 Eager Acquire ......................................................................................... 44 3.8 Contention Management ............................................................................... 44 3.9 Other Options .................................................................................................... 45 3.9.1 Exceptions ....................................................................................................... 45 3.9.2 Library versus Programming Language ...................................................... 46 v

PAGE 70

3.9.3 Uniprocessor versus Multiprocessor .......................................................... 46 4. STM Overhead .................................................................................................... 47 5. Findings ....................................................................................................... 50 6. Conclusion ........................................................................................................... 54 Bibliography ............................................................................................................ 55 vi

PAGE 71

LIST OF FIGURES Figure 1.1 Moore's Law Trend from 1971-2008 ............................................................ 1 1.2 THE SHARED MEMORY MULTIPROCESSORS ............................................... 5 1.3 LOCK-BASED SYNCHRONIZATION MODEL ................................................... 7 3.1 Ananinan's and Rinard's STM Data Structures ......................................... 29 3.2 Eager Versioning ...................................................................................... 38 3.3 Lazy Versioning ........................................................................................ 40 vii

PAGE 72

1. Introduction With the advent of the multi-core processors, in which the technology has helped to integrate more than one cores into the chip, it has been expected that this would increase the computation speed by at least double when we double the number of cores. According to Moore's Law as we can see in the figure below, the number of cores is expected to double almost every 18 months, which means that the speed of the computation should be doubled accordingly. ... B Cll v; c:: I'll F CPU Transistor Counts 1971-2008 & Moore's Law 2.000.000.00() 1,000,000,000 100,000,00() 10,000,000 1,000.000 100,000 10,000 1971 CU.-.e s.IIOo.'l 'Moore .......-: tronsistot COU1'I Lblf"'g tro yOOJrt; .DPE' .. ,, .. ''!!UI!a , 198() 1990 , .k!o .. ,.,. .. Date of Introduction 2000 2008 FIGURE 1.1: Moore's Law Trend from 1971-2008 1

PAGE 73

So, as we increase the number of cores or processors by n, we should expect n-fold increase in the power of computation power. However, this increase is only in the ideal cases and never happens in the practical ones. The primary reason for this degradation in the real world cases computational power is the cost of inter-processor communication and coordination or concurrency control. In most cases, when the applications is written for multi-processing, the net result performance of the system is just 5-10% of the theoretical peak performance. One of the most difficult things about parallel and distributed computing is the concurrency control or synchronization in shared memory model. The need for parallel and distributed computing has been increasing due to the need for massive computation in various fields: industrial, medical or even science research fields. Almost all the computersince programmers are familiar with the traditional or sequential programming in which, single processor machine is used. However, with parallel programming, where more than one processor are used and expected to increase the performance of computation and speed up the calculations, the programmer tasks become much harder. Concurrency is the problem of synchronizing concurrent access of multiple processes or threads to shared memory. It is difficult because it is required to make sure that the right operation is conducted in the right time and as if it is the only process accessing that piece of data in shared memory. As I mentioned above, the primary reason for this drop in the performance from the expected one is concurrency. Concurrency is the responsibility of the 2

PAGE 74

programmers and not of the systems. This justifies why parallel programming is difficult compared to the sequential one. In the computing industry, there are a lot of programmers but a few of them who are familiar with the parallel programming. In addition, a few of those who are supposed to be parallel programmers are good in this field. Parallel programmers can do mistakes easily and those mistakes will make the performance even worse. As we add more cores or processors, the problem gets even worse for developers. To exploit parallelism, developers will need to master parallel programming and concurrency or synchronization as part of that. Many synchronization techniques have been used to implement parallelism on multi-core systems. Locks have been used by parallel programmers to implement synchronization. However, addition to the difficulty of using locks, several pitfalls have been experienced and identified due to using them such as priority inversion, deadlock and convoying. To avoid problems associated with implementing locking based synchronization techniques, several lock free techniques have been investigated. A shared data structure is lock free if its operations do not require mutual exclusion. Transactional memory is one of those techniques which provides new concurrency techniques that avoid most of the pitfalls mentioned above as well as it is much easier concurrent programming technique. In addition, it is much efficient more the other proposed techniques. The idea of transactional memory has been taken from database transaction where a transaction can be defined as a sequence of actions that appears indivisible and instantaneous to an outside observer. In addition, the same attributes that the database transaction has i.e. failure atomicity, consistency, isolation and durability, are applied to the transactional memory except the last one durability since the data in memory is usually transient unlike the database one. 3

PAGE 75

1.1 Research Focus and Methodology Software Transactional Memory (STM) has been blamed as a high overhead transactional memory (TM) compared to the other implementations: hardware transactional memory (HTM) and hybrid transactional memory (HybridTM). The overhead of STM is due to the need to build the TM semantics and all the transaction primitives in software. Many STM implementations have been proposed with a little emphasis on the overhead investigation. The problem with STM is that some instructions that usually do not require more than one or two clock cycles to execute would need tens or even hundreds clock cycles when executed with STM. As an example, a program that might need 100 clock cycles to complete, addition 500 clock cycles are introduced as STM overhead. That means almost 80% of the program time is taken by the STM and hence the program is made even 5 times longer. We can conclude that the STM overhead is not a trivial one and can be the most dominant cost of the code. I am going to study the STM semantics and some of the approaches used to build STM implementations. After understanding these semantics and design approaches, I am going to identify how these may help in reducing or increasing the overhead of STM systems. In addition, STM have been built using different programming languages. I want to investigate if the type of language can affect overhead or not. This would help us choose the right design options to build low overhead STM implementation. I begin by giving background information about the legacy way of attaining concurrency, the locking based synchronization. Then, I discuss the major 4

PAGE 76

pitfalls of locks and using them in synchronization. In chapter2, I introduce what transactional memory is. Chapter 3 covers different design options of STM. In chapter 4 and 5, I discuss the overhead of STM and I list some of the findings that might help to reduce the STM overhead. The last chapter includes the conclusion. 1.2 Lock Based Synchronization In a shared memory multiprocessing basic model, we have many processors that can work concurrently and they might need to access the same piece of data in the shared memory to complete their computations. Let us assume that we have the following multiprocessor environment: Main Memory X FIGURE 1.2: SHARED MEMORY MULTIPROCESSORS 5

PAGE 77

So as we can see in the above figure, we have n processors all share the same memory. Let us assume that the first three processors: P1, P2 and P3 are about to execute the same instruction X :=X+ 1 The initial value of X= 1 and if all of the processes tried to execute at the same time the value of X would be unpredictable, it might be 2, 3 or 4. This is because all the processes might read at the same time the same value X or even two of them read simultaneously the X value resulted from the execution done by the third one. This problem is called data race and to avoid getting such issue, the shared resources, i.e. X in this case, must be prevented from concurrent read and write accesses. The solution here is to use locks to provide mutual exclusion. Locks are programming constructs that can be used so that only one processor has the control over the shared resource and until this resource is unlocked the controlling process is the only one who can read and write to it. The area which has the shared resource and to be accessed with locks is called critical section. At any time, at most one process can be inside the critical section the processor who has the lock of that section. The following diagram shows how the operation is going to be when locking is used to perform synchronization: 6

PAGE 78

lock (X\ Read X into R Add 1 toR Write back R to X lJnloc:k lX\ FIGURE 1.3: LOCK-BASED SYNCHRONIZATION MODEL Based on granularity, locks can be classified into fine-grained or coarse grained locks. The main differences between the two types are the size of the critical sections guarded by the lock and the number of locks used throughout the system. In the systems which use the coarse-grained locking, one lock is used to lock many shared resources like in the case of the first linux kernel 2.0 that supported multiprocessing. In this case, the whole kernel is considered as a critical section that is locked by one big lock. On the other hand, in case of fine-grained locking, as it has been implemented in the new kernel releases, many locks are used in the system to provide concurrency. Instead of using only one lock to lock the whole kernel, the big kernel critical section is divided into smaller critical sections each has its own lock. As an example, in the linux 2.2 kernel release, one spin lock is used to control the access to the block 1/0 subsystem and another one for networking and so on. 7

PAGE 79

locks have been used widely in many areas: databases, operating systems, and parallel programming. 1.3 Problems with locks Many problems associated with using locks in concurrency control have been published in literatures. One of these issuers is the difficulty of programming. Most of the computer science programmers are much used to the sequential coding and not the parallel one. More issues were published by Herlihy and Moss as we see below [13]. In some operating systems, which that employs a priority-based preemptive scheduler, tasks or processes are assigned different priorities. let assume that we have a 3 processes H, M and l with priorities High, Medium and low. The scheduler always makes sure that if a process is ready to run and holds higher priority to be running or executed. In addition, if a lower priority process is executing and during its execution, a higher priority process is ready to execute, the scheduler will preempt the lower priority one to let the higher priority process executes. let us consider a scenario where, the process l acquires the mutex or the lock to work on a specific shared resource. In the middle of l execution, the process M becomes ready to execute and hence the scheduler will have to preempt l, which still holds the lock, and let M executes. Suppose during the execution of M, process H would like to access the same shared resource whose lock is held by the preempted process l. In this case, process H won't be able to acquire the lock since l cannot preempt M due to the precedence in priorities. Finally, although process H has higher priority than process M, H cannot execute and this situation will be like if the priorities were inverted so that M has higher 8

PAGE 80

priority but it hasn't. The problem that I have just described is called priority inversion. Another issue that is associated with using locks in concurrency control is lock convoy or sometimes called convoying. Unlike the case with the priority inversion, the processes in this case all have equal priorities. The problem of convoying happens when we have several processes where some of them are interested to access a specific shared resource R. Let us assume that process PO is now holding the lock for R and also the processor is executing PO. If the scheduler preempts PO for any reason, such as a page fault or PO has already exhausted its quantum or by any kind of interrupt, PO will be discheduled while it still holds the lock for R. Now, the scheduler is going to schedule Pl, which is also interested to access R, although Pl has the right to execute, the lock for R is still with PO and hence again Pl will be discheduled. This results in a huge context switching and waste of scheduling quantum which will lead to huge performance degradation of the system. The next problem that might be experienced with using locks is called deadlock. Deadlock can happen as a result of having some processes lock the same set of objects in different orders. To make it clearer, let us assume that we have two processes P and Q work on shared objects identified by CRl, CR2 CR3 and CR4. Process P will work on the shared objects in this order CR1-7CR2-7CR3-7CR4. On the other hand, process Q based on its computation will need to access the shared objects in the opposite order CR4-7CR3-7CR2-7CR1. Let us assume that both processes begin their execution together starting on acquiring the lock required for each shared object they need to access. Process P will acquire the lock for CRl and then 9

PAGE 81

CR2 without any problem as the only process interested to access these shared objects at those times and similarly with process 0. it will acquire the locks of CR4 and then of CR3 shared objects. Now, process P would like to proceed its execution by acquiring the shared object CR3 which is already locked by process Q. Similarly, process Q would like to access the next shared object in the chain, CR2, which is already locked by process P. The two processes P and Q are waiting to each other on releasing the lock for the shared objects CR2 and CR3 and this problem is what so called deadlock. Going back to the locking granularity and whether to use the fine-grained or the coarse-grained one. As I mentioned already, early Linux kernel used the coarse-grained locking, in which the whole kernel is considered as a one big critical section and only one lock is available to access this critical section. Coarse-grained lock is well known to its simplicity in implementation and maintenance. Although, this type of lock, the coarse-grained, helped the kernel programmer address all the issues might be faced with by shifting to Symmetric Multi-Processor (SMP), it did not scale well. As the number of the processors increase, the amount of time to wait for the single kernel lock becomes bigger and bigger. Hence, with having only one lock to guard the kernel, the performance of four independent processors is much better than having four-processor system. To solve the problem of lack of scalability with the coarse-grained locks, developers have started to use finer-grained locking more and more and now the modern kernel could include thousands of locks. With this type of locks, each lock protects a small resource instead of protecting the whole kernel. However, using fine-grained locks has its own pitfalls. With thousands of locks, it is not easy to identify which lock for which 10

PAGE 82

and in which order you should acquire each of them in order to avoid problems such as deadlocks and races. In addition, it became much harder to identify the bugs as the number of locks increases and also the number of bugs is expected to increase accordingly. Finally, using the fine-grained locking approach will increase the level of complexity of the system and by time, it can sacrifice the maintainability of that system. 1.4 Lock free Synchronization Lock free or as it might be named in some literatures as non-blocking synchronization does not require the existence of mutual exclusion between the processes which are accessing a shared object concurrently. As we have seen in the previous section, problems with locks, some problems might be experienced when mutual exclusion is satisfied by using locks. If the shared resource does need mutual exclusion, priority inversion problem is resolved since the high priority process won't need to wait for the lower priority process to release the lock. In addition, consider the problem of convoying where a process holding the lock for a shared resource is descheduled or interrupted for reasons I mentioned earlier. In case of lock free synchronization, the shared resource does not mandate the acquisition of lock to have access to, then descheduling or interrupting any process does not delay the execution or operation of other processes interested to access this shared resource. Even deadlocks can be avoided for the same reasons and as a result performance could be improved by making the shared resources lock free. 11

PAGE 83

2. Transactional Memory As it can be inferred from its name, transactional memory has its root in database transaction and hence we need first to define what a transaction is. A transaction can be defined as a sequence of actions that appears as one atomic action and instantaneous to the outside observer. Transactions have been used widely in the database as a way to ensure concurrent execution of programs. However, due to the differences between the transactions in database and that of transactional memory, since the later uses the main memory instead of the hard drive to perform transactional access, it is called transactional memory. The idea of using transactions in concurrent programming was first proposed by Lomet in 1977 [16]. At that time, it was not introduced as transactional memory but as an idea to use the database abstraction to make a good programming language mechanism that would ensure the consistency of data shared among several processes [8]. This paper lacked the practical implementation compared to the explicit synchronization and this is way it had not been used until later when Herlihy and Moss in 1993 [13] proposed the first hardware transactional memory as a mechanism to build allock-free data structures. This paper is considered the first official paper defined what the transactional memory. After that many researchers have been involved heavily in this area and many hardware transactional memories have been proposed. However there were some limitations in the proposed hardware ones like the high cost and the need for extra instruction set to be integrated with the existing processors 12

PAGE 84

instruction set. That means that hardware transactional memory for some people, it might not be that feasible solution. Given the above limitations associated with the hardware transactional memory, researchers were trying to find more feasible solution that would be still a lock-free synchronization solution. In 1995, Shavit and Touitou, proposed the first software transactional memory that is will still use the transactional approach which can be built on the existing machine using only a Load-Linked I Store-Conditional operations. Later, many software solutions have been proposed by different researchers in different languages and using many design approaches. Types ofTM: TM can be built in hardware, software or using both software and hardware (hybrid}. The hardware transactional memory (HTM} is implemented as a processor instructions which means it needs some modifications to the processor, cache and bus protocols to support transactions. In software transactional memory (STM}, the semantics and protocols of TM are built completely in software either in runtime library or as extension to the programming language itself. In the later type, STM, no need to upgrade the hardware and the minimal hardware is required such that it supports the two basic operations Compare and Swap (CAS} or Load-Linked/ Store-Conditional. The hybrid transactional memory is a mix of hardware and software transactional memory so that the transactions are supported in software with the absence of hardware and would provide better performance if the hardware existed. However, there is no formal specification for how the 13

PAGE 85

hybrid TM should be implemented as many different approaches have been used to build this type. WhySTM: Software transactional memory (STM) has been believed by many researchers as a feasible solution for the problems of concurrent programming. The first STM implementation was published by Shavit and Touitou [19]. STM has many advantages over the hardware transactional memory (HTM) such the cost of implementation and maintainig. Since it is implemented in software, no additional cost required for the hardware as in the case of HTM. In addition, any change can be introduced to the STM implementation easily which makes it more flexible than HTM. Previous Work: Let me mention some previous STM research works. The first real STM implementation was proposed by Shavit and Touitou which is well known as the static STM [19]. This implementation came after the first TM system which was implemented in hardware and required some extra operation which was not available in the off-shelve processors at that time. They proposed the software solution to make it possible to implement TM with the existing processors in software and using Load_Linked/Store_Conditional operation. This implementation was called static because the transaction needs to need the set of the location it is going to access in advance which was considered a limitation to this implementation. The advantage they were targeting with this implementation is to provide a software solution that will not require extra hardware since it could be applied to the existing machines. 14

PAGE 86

However, the performance of this implementation was not even close to the HTM implementation proposed by Herlihy and Moss. Then, Herlihy et al. proposed dynamic software transactional memory (DSTM) implementation [12]. In DSTM, they addressed the limitation with the static STM by supporting the dynamically created objects that does not require prior knowledge for the set of the locations will be accessed by transaction. This implementation was an obstruction-free which is the weakest approach of non-blocking synchronization. In addition, they introduced the idea of contention management that was able to avoid livelock problems that may encountered with obstruction-free implementations. Fraser later proposed a lock-free objected based software transactional memory which was named after his name (FSTM) [8]. In this lock-free implementation, each lower priority transaction helps a higher priority conflicting transaction to complete so that the progress is guaranteed. The helped transaction might also help another conflicting transaction to complete execution and so on. Marathe et al. proposed an obstruction-free adaptive software transactional memory or (ASTM) [18]. They had done a comparison between the two leading object-based software transactional memories DSTM and OSTM. They found that the DSTM outperforms the OSTM by an order of magnitude on write-dominated workloads and OSTM outperforms DSTM with by a factor or two on read-dominated workloads. Herlihy et al. revised the first proposed DSTM and that resulted in DSTM2 [11]. As I mentioned, the DSTM2 is a revised version of the DSTM where they 15

PAGE 87

tried to incorporate lessons they learned from its predecessor. In this implementation, they did it as a transactional library which was considered the first implementation with such approach. The main feature included in the design of DSTM2 was proviCfing the users the ability to include their own synchronization and recovery mechanisms in the form of transactional factories that transform stylized sequential interfaces into transactionally synchronized data structures addition to the freedom in customizing their techniques for managing contention. Recent implementations such as Enalls' STM and TL2 are lock-based STM implementations [7][6]. 2.1 Transaction Basic Operations Each transactional memory has some basic operations and most of the implementations have the same operations with a slight variation from one implementation to another. These basic operations are used to manipulate transactions. The following basic operations are the general ones used among most of the implementations. Start or Begin: starting a new transaction and allocate a fresh descriptor in which the status is initialized to ACTIVE. Read/Write: used within the scope of the transaction to access transactional memory. 16

PAGE 88

Commit: this operation makes all the transaction's tentative changes permanent and ends the transaction. It may fail and in this case it is equal to abort. Abort: this operation instead of making the transaction's tentative changes permanent, it discard all these changes and also ends the transaction. -Validate: this operation tests the transaction status. It is either true means the transaction commit or can commit or false means that the transaction aborts. 2.2 Semantics Although many literatures have been published about TM, still there is no formal TM semantics. Intuitively and as it can be inferred from its name, TM idea originated from database transactions and this can explain why most of the papers used the correctness criteria from database transactions, serializability, or concurrent data structures, linearizability. However, both criteria just specify some aspects of TM semantics and not all. 2.2.1 Database Correctness Criteria (ACI Model) The semantic model used by database has been adopted from the beginning as a model for the TM as transactions idea was taken from database. However, still the database model is not enough to provide the formal semantic for TM due to the nature of the database transactions and those of TM. Let us define the correctness model used by database and then we check its availability to TM. Database use the ACID (failure atomicity, consistency, isolation, and durability) properties as a model for its transaction semantic. 17

PAGE 89

Failure atomicity or atomicity requires either all or nothing. All means that a transaction executes and finishes and then it commits and hence the change will be visible. However, in case if it fails in the middle of its execution, nothing will be written and everything will be as if it does not execute at all. Consistency means that any successful or failure of the execution of a transaction should leave the database or memory in a consistent state. A new transaction will always assume that the world is in consistent state. Isolation means that the execution of a transaction must be as if it is the only transaction executed at that time although there are other transactions executed concurrently. In database, serializability correctness condition is used to achieve isolation in which the transaction execution result isolated from other transactions execution as if they are executed serially. Durability requires that once the result is ready and committed on the hardware, it is final and should be always there and available to all the other transaction. However, this property is not required for the parallel programming as the media to which the transaction result would be committed is the memory and the data there is always transit unlike in the database case where the media is the hard disk. The ACI model that explained above does not provide a formal semantic for the TM. ACI does not provide any mechanism that controls the interaction between the atomic blocks and the code outside of a transaction and this is what can be called interaction between transactional and non-transactional access to shared resources. Unless a semantic specifies this interaction still it is not fully applicable to TM case and this is the issue with ACI model. 18

PAGE 90

2.2.2 Single Lock-atomicity The single lock-atomicity is not a technique to implement TM; rather, it specifies the interaction of atomic blocks in TM. With this, an atomic block executes as if all the other atomic blocks were protected by a single lock and at most one block can execute at a time. This model specifies the isolation expected from transactions but does not specify the failure atomicity of a transaction. It is one way to achieve the serializability I talked about earlier. This lock can also be used to describe what happen as if a non-transactional access has been encountered. The non-transactional access is any code outside the transactions tries to access shared data. Having this kind of access, i.e. the code outside the transaction, to shared data may lead to a datarace [9]. Before I define what the datarace is, let me define what is known as a conflict. A conflict happens when more than concurrent thread access the same shared data and at least one of these threads modify the data. The data race would happen if the conflicting access mentioned before were not synchronized properly. When the data race happens, the behavior of the code cannot be judged and will depend on the memory consistency model that is used in the system memory. Dataraces can be caused by poor programming of critical sections. To avoid getting datarace, locking or other synchronization techniques should be used to control the concurrent access by transactional and non-transactional codes or by carefully arranging the program control flow. 19

PAGE 91

2.2.3 Retry and OrEise So far, we have discussed several aspects of the TM semantics proposed by several researchers. However, we mentioned nothing about the order that any two transactions executing concurrently should change the program state. Such coordination between transactions was first introduced by Harris et al. [10], by using the retry statement. When retry is used in a transaction, if anything causes the transaction to abort, it will reexecute again later. What is important about the retry is when it should reexecute after it aborts. Harris in the same paper suggested that the transaction should reexecute again if any change has been already introduced to the set of values the transaction did read and aborted before. Another operation that can be used to coordinate the execution of two transactions is orEise [10]. The operation of orEise can be explained as follow [14]. If we have P orEise Q, where P and Q are two different transactions, then P will be executed first: l.lf P commits or explicitly aborts either by explicit abort statement or uncaught exception, then Q will not be executed since the orEise statement terminates. 2. In case of P executes a retry statement, then Q will be executed as part of orEise statement execution. 3.1f Q commits or explicitly aborts, then the orEise statement will terminate. 20

PAGE 92

4. If Q executes a retry, then the execution of the orEise statement will be pending for a change to a location read by either P or Q to reexecute itself again. Finally, the orEise statement should be implemented in atomic way since it is a composition of two atomic operations. 21

PAGE 93

3. Design Options in Software Transactional Memory Different options have been used to build different STM implementations. Some of these options might help in simplifying the implementation and some might make it more powerful in a way that makes it a better candidate to replace the lock-based parallel programming ways. Some of the options have direct relation to the overhead that might be associated with STM and some have indirect. In this chapter, I tried to list most of the options or tradeoffs that have been used so far in designing most of the STM implementations. 3.11solation I have already defined what the transaction isolation is as part of the transaction correctness criteria. Isolation property requires that the transactions execute concurrently don't affect the result of each other. Each transaction executes as if it is the only transaction executes at that time. Blundell et al. [3] introduced the terms weak and strong atomicity. However, the term they used i.e. atomicity can be replaced by isolation as it can be inferred from that paper [14]. 3.1.1 Weak Isolation Before I start discussing the weak isolation property, I need to give formal definition for it: 22

PAGE 94

Weak isolation: is the property of TM system that guarantees the isolation only between transactions. A non-transactional code can directly access the same shared data in the main memory that is already accessed by transactional code. If a non-transactional access executes, or a conflict reference outside the transaction happens, the result that will be returned might be inconsistent or may lead to an incorrect execution inside the transaction. With weak isolation implementation, data race might be the result and the system can be in undefined state. Some problems or more precisely isolation violations would happen as the non-transactional accesses are allowed to access the main memory and bypassing the STM system protocols [1]: Non-repeatable reads: this problem happens when a transaction code includes two read instructions and another write instruction belongs to a not transactional code execute concurrently. The two reads must return the same value but since another non-transactional write alter the execution of the transaction, the values that are returned by the two reads are different as we can see in the following example: Thread 1 atomic { r1 = x; r2 = x;} Thread 2 X= 1; 23

PAGE 95

Let us assume that the initial value of x is 0. Thread 1 starts to execute the first read and r1 will be assigned 0. Then, before thread 1 perform the second read, thread 2 write a new value to x which is 1. After that thread 1 performs its second read and now r2 = 1. So, r1 '# r2 even though they belong to the same transaction which means the isolation is violated. Intermediate lost update: it happens when we have two writes access performed by both transaction and no-transaction code concurrently to the same shared data. As an example, let us considered the following: Initial value of x is 0 Thread 1 atomic { r = x; x = r + 1;} Thread 2 X= 10; if the two threads executions are serialized then the value of x should be either 10 if thread 1 executes first and then thread 2 or 11 if thread 2 executes first then thread 2 does. Unfortunately, the value of x is going to be 1 since thread 2 writes after thread 1 performs the read access and before it write x. This means that the thread 2 write like if it does not happen at all. Intermediate dirty reads: it happens when a non-transactional access can see an intermediate state of a transaction code. The following is an example that can explains this problem more: 24

PAGE 96

Initial value of x is even Thread 1 atomic { x++; x++;} Thread 2 r = x; The thread 1 transactional code will always keep the value of x even. However, the non-transaction read access will read an odd value if the non transactional read is executed in the middle of the thread 1 two increments. This implementation, weak isolation, does not mandate that all the access to the shared data must be by a transactional access only, the non-transactional access can access but without conflicting with the transactional one. This can be guaranteed by the programmer to make the non-transactional access to take place in different time or to different memory location. The approach that has been used in most of the STM implementations to achieve weak isolation is to divide the whole memory manually into two parts part that is only accessed by transaction and the other part is only accessed by non transactional codes. However, dividing the memory as such is an error-prone process as software evolves. Furthermore, such a process of manually dividing the memory sometimes even is not required in the lock-based synchronization. Therefore, using the weak isolation in STM systems may conflict with the goal of achieving a simple and reliable concurrent programming. 25

PAGE 97

3.1.2 Strong Isolation The second approach is called strong isolation where the TM semantics must be guaranteed between transactional and non-transactional code. In this model, the database model is enforced in which all the accesses to the shared data are transactional or through transactions. In strong isolation, any operation outside the transaction block that includes access to shared data eventually will be transferred to transaction. Strong isolation has been well known for its cost in term of the overhead that might be produced by a TM implementation adopts it. This can justify why the most high performance implementation were designed to provide weak isolation. Researchers have proposed many ways to achieve strong isolation like restricting the transactional code to access on the data that are transactions and this is the approach used in Haskell STM [3]. However, this was achieved since all the mutable data in Haskell STM are in transactions. The rest data are read only ones which means that they can be accessed either by transactional or non transactional access. In the other hand, with the non-functional languages where most of the data are mutable, the program must be divided into two parts: transactional and non-transactional data. However, this classification is hard to be implemented as all the data will need to be transformed to transactional data from the non-transactional part of the program and then back to its original state after completing the processing. Another approach that was proposed by Lev and Maessen is to track the shared objects at runtime [15]. Any object that will be visible by more than one thread should be accessed only through transactions during concurrent access. Although it can be considered a good approach, it might introduce more overhead to the 26

PAGE 98

STM. However, the authors depend a lot on the compiler for more optimizations and hence lower overhead. In the following, I am presenting one of the Ananian,s and Rinard's STM implementation that supports strong isolation. Example: Ananian and Rinard STM: The STM implementation, which was proposed by Ananian and Rinard in 2005, is considered one of the few STM implementations that are implemented using strong isolation design option [2]. Most of the STM implementations have been proposed so far employ weak isolation as transaction isolation property. The Ananian's and Rinard's STM implementation provides strong isolation since any memory access outside the transaction can abort the running transaction. However, in most of the STM implementations, which implement weak isolation, a non-transactional read or write to transactional data, can happen without aborting that transaction. Instead, a mechanism is implemented in such a way that the non transactional accesses don't conflict with the transactional accesses. The reason behind choosing the weak isolation is to avoid the cost of overhead of adding instrumentation to every non-transactional read and write. Ananian's and Rinard's STM implements a novel way of providing strong isolation that lowers the instrumentation overhead. They achieved this goal by employing sentinel values which are placed in memory locations accessed by transactions. I am going to provide a detail implementation of Ananian's and Rinard's STM strong isolation implementation in the following paragraphs. 27

PAGE 99

In their implementation, they used a special value called FLAG to be placed in locations of objects that are involved in transactions. Trying to read or overwrite a flagged value tells the non-transactional code that an exceptional processing is required. Object Structures: The basic data structures of Ananian's and Rinard's STM implementation is shown in the figure below. 28

PAGE 100

Object#1 Myel ____ ...... nut ...... 'A' 55 lkldf ----------.. FIGURE 3.1 : Ananinan's and Rinard's STM Data Structures Two additional fields are added to each object: versions and readers fields. Versions field points a single-linked list object versions. Each object version has a field owner, which points to the transaction object's single field status, corresponds to committed, aborted or waiting. For each transaction, there is a specific object version. The second filed is the readers field which also 29

PAGE 101

points to a single-linked list of all the transactions that have read from this object other than the committed or aborted ones. The FL4G value can be any value but not a common one. Each object field will have the semantic value of the original object structure in case that is not a FL4G value. In the later case, the value of the field will be the value of the first committed transaction in the object's version list. Operations: Ananian and Rinard supported the following operations in their STM implementation: transactional read/write, non-transactional read/write, begin, abort and commit. In this context, strong isolation, the only two operations concern us is the non-transactional read and write. Read: The non-transactional read in the Ananian's and Rinard's STM is performed using the read NT method. This method does the non-transactional read of field f from object o and the result will be stored in v. When the value read is a FIJl.G value, the thread will have to copy back the field value from the most recently committed transaction. In addition, the thread will have to abort all the uncommitted transactions and then try again. To tell the thread to abort only the transactional writers and not the transactional readers, the kill writers constant is passed to the copy-back method as seen the code below. inline readNT(o, f, v) { do v = object [o] field [f] ; if .. (v!=FLAG) -> break /* done! */ .. else 30

PAGE 102

fi; copyBackField(o, f, kill_writers, _st); if :: (_st==false_flag) -> v = FLAG; break fi od } else Write: The non-transactional write access in Ananian's and Rinard's is performed by invoking the writeNT method. It writes the new value nval to the field f of object o. For this non-transactional write to be performed, the reader list must be empty. To check this condition, the Load-Linked/Store-Conditional operation is used so that the write will be stored if the reader list is empty. However, if it is not empty, the procedure copy-back will be called which passes the kill_all constant. This constant will make sure that all the transactional readers and writers are aborted and hence the reader list will be empty. The writeNT procedure is shown below: inline writeNT(o, f, nval) { if (nval != FLAG) -> do atomic { if/* this is a LL(readerList)/SC(field) */ . (object [o] readerList == NIL) -> object[o] .fieldLock[f] = thread id; object[o] .field[f] = nval; -break /* success! */ : : else fi } /* unsuccessful sc */ 31

PAGE 103

copyBackField(o, f, kill all, st) od else -> /* create false flag */ /* implement this as a short *transactional* write. */ /* start a new transaction, write FLAG, commit the */ /* transaction; repeat until successful. */ /* Implementation elided. */ fi; }. 3.2 Nesting Transactions Sometimes, programmers might need to execute a transaction inside another transaction so that the execution of the inner one will be within the dynamic scope of the outer transaction. Such transaction in this case is called a nested transaction. Let us suppose that we have two transactions; one is nested inside the other. In this case the one nested is called the inner transaction and the one includes the inner transaction is called the outer transaction. If we assume that we have only one thread of control for both transactions, once the outer transaction is done with execution, the control will go to the inner transaction. Now, the inner transaction starts its execution turn and at the same time, it can see any modification done by the outer or mother transaction. Larus and Rajwar listed various ways to link such transactions as we will see in the following sections (14]. 3.2.1 Flattened Transaction In this type of nested transactions, if the inner transaction aborts, then the outer transaction will abort also. However, when the inner transaction commits, the outer transaction is the only transaction that can see the changes produced by the inner one. In addition, the outer one will not 32

PAGE 104

commit just because that the inner one does, and once it commits the modification of the inner transaction will be visible by all the other transactions. The following example explains the behavior of the flattened transactions: int x = 1; atomic { } X = 2; atomic flatten { X = 3; abort; } In the above example, if the outer transaction aborts whether because of aborting the inner one or because it decides to abort for any other reason even if the inner one committed instead of aborting, then in this case x will be 1. Although flattened transactions are considered easy to program, it is considered also a bad abstraction model that does not provide a compositional atomic transaction since aborting the inner transaction will cause the outer one to abort. 3.2.2 Closed Transaction As an alternative to flattened transaction, closed transaction has the advantage that the effect of the inner transaction on the outer one will be lower unlike the flattened transaction. In closed transaction, aborting the inner transaction does not terminate the execution of the outer transaction. When an inner transaction aborts, all its log will be discarded and a notification will be sent to the outer one and after that the parent transaction 33

PAGE 105

has to choose even if it wants to abort [14]. Similarly, in the committing case, again if the inner transaction commits, the control will go to the outer one. In case of inner transaction commits, the outer transactions will be able to see any change introduced by the inner one. However, still the other transactions (not the parents) won't be able to see the modification of that inner transaction until all the surrounding parent transactions commit. The code below explains the operation of closed nested transactions. int x = 1; atomic { X= 2; } atomic closed { X = 3; abort; } When the inner transaction aborts, its operations will be executing undo and all its log will be discarded. The value of x will be 2 if the outer transaction commits otherwise it will be 1 when all the inner and outer transaction abort. Furthermore, if the inner transaction commits and all the outer transactions abort, the value of x will be also 1 and the change that was introduced by the inner transaction wouldn't be visible to the other transactions since its parents of did not commit. What is interesting about the closed model is that no conflict happens between the inner transaction and the outer ones. However, the conflict may exist between the inner transaction and another transaction that belongs to different family (none of its parents). 34

PAGE 106

3.2.3 Open Transaction In the open nested transaction model, when the inner transaction commits, its modification will be visible to all the transactions either parents or not parent transactions whatever the results of the outer transactions are. The following example is similar the previous examples provided in the previous nested transactions except the addition of the open keyword that enables the open nested transaction model. int x = 1; atomic { X = 2; atomic open { X = 3;} abort;} Here in this case, if the inner transaction commits, its change will be visible even if the outer transaction aborts. In this case the value of x = 3. 3.3 Transaction Granularity One of the important key parameters in designing TM is transaction granularity. It is the unit of the storage over which a TM system detects conflict [14]. In most STM implementations, two types of transaction granularity have been used: word (or block) and Object [20]. 35

PAGE 107

3.3.1 Block or Word Granularity In this type of transaction granularity, at most one transaction has the control over a shared word at a time. This is to make sure that the any modification or change to the shared word is atomic. Moreover, a dedicated record is attached to each shared word serves store the exclusive ownership of this word. The process of transferring the ownership of the records between the transactions is performed by the system automatically. The above implementation was introduced by Shavit and Touitou in their STM paper which is also considered the first formal STM implementation [19]. 3.3.2 Object Granularity Object granularity is usually implemented in systems that used object oriented languages. This type of granularity enables the STM to detect a conflicting access to an object even if the transactions referenced different fields. Most of the STM implementations were implemented with object granularity. Using the object granularity, it is not required to change the original object structure for translating the non-transactional program to transactional one. Moreover, any object has the freedom to execute either inside the transaction or outside it without needing any change. Many STM implementations use object granularity and one of them is DSTM [12]. 3.4 Version Management During the execution of a transaction, an STM system will have to maintain different versions of the global data. These different versions are: the new or 36

PAGE 108

updated version and the old or original version. The new or updated version is the version of data that is produced by one of the pending transactions. This version is supposed to be visible by all the other transactions once the executing transaction succeeds and decides to commit its computations. The old or original version, which is the second type of versions, is maintained by the TM system during the execution period. The original version is the one that exists in the memory before the executing transaction starts executing. This version is required in case that the executing transaction decides to abort for any reason and hence it needs to roll back to its original value or version. In general, there are two approaches have been proposed and implemented by researchers to maintain the new and old versions of global data: eager versioning and lazy versioning However, these two ways are not necessarily named as eager and lazy versioning as some researchers called them direct and deferred updates [14]. At the end the eager versioning is the direct update and lazy versioning is the deferred update processes with different names only. In my thesis, I chose to use the first names: eager and lazy versioning as they have been widely used in the literature and more common than the other names: direct and deferred updates. 3.4.1 Eager Versioning A TM system, which is implemented with the eager versioning feature, writes the new data or version directly to the memory. This means that the modification that is introduced by the executing transaction will be applied to the object itself. However, the new version will not be visible to the other transactions as this type of versioning requires the TM system to use some 37

PAGE 109

form of concurrency control like locks throughout the transaction execution period. So the TM system will make sure that the uncommitted version of data that resides in memory will not be visible by any other transaction until it is committed. The old version will be stored in the undo buffer or log so that, in case the transaction aborts, the TM system will need this original version to restore it back to the memory location. This type of versioning is considered a fast committing one since if the transaction decides to commit, it needs only to make the object visible by other transactions and discards the old version stored in the undo buffer. However, additional delay might be experienced if the transaction aborts since the original version will need to be written back to the original memory address. In addition, concurrency is limited during the execution of the transaction till the final decision is reached. Begin Thread CJ D Memory Undo Buffer Memory Commit Thread I CJ ...... Abort I Thread I Memory Undo Buffer \0) FIGURE 3.2: Eager Versioning 38

PAGE 110

3.4.2 Lazy Versioning The second type of data versioning, which is called lazy versioning, has been used widely in most of the STM implementations. In this type of versioning, any data modification will be stored as a new version in the undo buffer which is private to the transaction. The old or original version will stay as is in the memory and hence this type will enhance the concurrency instead of restricting it as in the eager versioning. Once the transaction finishes its execution and decides to commit, the new version will be moved from the undo buffer to the memory. However, if the decision was to abort then the transaction will need just to discard whatever stored in its undo buffer. The lazy versioning is considered a fast aborting one since as I mentioned earlier, when a transaction decides to abort, it needs just to discard the new version that is isolated in the per transaction buffer. On the other hand, lazy versioning might experience some delay if the transaction decides to commit as the TM system will need to copy the new updated version to replace the old version that is stored in memory. This delay is produced due to the need to search the undo buffer for the latest version. 39

PAGE 111

Begin Write Thread Thread GJ G:J Memory Undo Buffer Memory Undo Buffer Commit Abort Thread Thread I E[ Memory Memory FIGURE 3.3: Lazy Versioning 3.5 Concurrency Control Concurrent transactions executed by a TM system need to be synchronized to avoid having concurrent access to an object. What I meant by these concurrent accesses are those are resulting in conflict access. I have already defined what a conflict access is, two or more transactions concurrently access the same object with one of them modify that object. After a conflict occurs, the TM system then determines that a conflict happens and this is called conflict detection. When a conflict is already detected, the TM system 40

PAGE 112

or the code in a transaction implements a mechanism or apply a contention policy to ensure the correctness either by aborting or delaying one of the conflicting transactions. These three operations which are conflict occurrence, detection and resolution must follow this order even they can happen in different times but not order. Some STM implementations have been designed using two different approaches to have concurrency control: pessimistic or optimistic. 3.5.1 Pessimistic Concurrency Control In this type of concurrency control, the three operations: conflict occurrence, detection and resolution, all happen at the same point of time during the execution. A conflict is detected and resolved in this case as the transaction is about to access a shared object. With pessimistic concurrency control, the transaction claims the exclusive ownership of an object so that other transactions are not allowed to access this object. However, deadlock may happen when two transactions got the exclusive access to an object and one of them would like to access that objects while the other transaction already hold this object. Such deadlocks can be avoided by enforcing a predetermined order in getting the exclusive access to objects. In case of deadlock between two transactions, one of them has to be aborted. 3.5.2 Optimistic Concurrency Control The difference between this type of concurrency control, optimistic, and the previous one, pessimistic, is that the conflict detection and resolution don not happen at the same point of execution with the conflict occurrence. In this 41

PAGE 113

case, more than one transaction can access the same object concurrently. However, any conflict if there is one, must be detected and resolved before a transaction commits. A conflicting transaction can be delayed, restarted or even aborted while the other can continue its execution. The advantages of this type of concurrency control are: avoiding locks and their problems and providing more concurrency especially if conflicts are barely happening. 3.6 Blocking versus Non-blocking Considering forward progress guarantees, where a transaction attempts to access shared data will be able to finish its execution, synchronization can be divided into two types: Blocking and non-blocking. 3.6.1 Blocking It is the legacy way used to provide concurrency control, in which locks, monitors or semaphores are used. At any time, only one transaction is allowed to access a shared data exclusively. However, this type may prevent the forward progress by introducing the problems with locks I have already discussed in chapter 1: priority inversion, convoying and deadlock. In this type, even though locks are the tools to provide the synchronization, they are only used in the STM code and not by the programmer or in their codes. 3.6.2 Non-blocking Early STM have been implemented with non-blocking synchronization which can provides better forward progress guarantees than the previous type, the 42

PAGE 114

blocking one. This type avoids all the problems associated with the blocking synchronization techniques. Based on its strength in providing the forward progress guarantees, non-blocking synchronization can be classified into three types: Wait-freedom: is considered the strongest assurance to guarantee that all threads competing to access shared data make forward progress in a finite number of their time steps. This type does not cause deadlocks or starvation. Lock-freedom: is considered weaker assurance to guarantee that at least one thread of the competing threads to access shared data makes forward progress in a finite number of any of the concurrent threads. The deadlocks can be avoided with this type but not the starvation. Obstruction-freedom: is considered the weakest assurance to guarantee that a thread makes progress in finite number of its own time steps in the absence of the contention. Although deadlocks are avoided with this type, livelocks are not. 3. 7 Object Acquisition The acquisition of an object happens when a transaction declare its ownership of an object in non-blocking STM systems or acquires the lock to access that object in blocking STM systems. Based on the time of acquisition, it can be divided into two types: lazy acquire and eager acquire. 43

PAGE 115

3.7.1 Lazy Acquire In this type of object acquisition, the object is acquired only at commit time. This means that the conflict detection is defer until the time commit. With this type, a doomed transaction is allowed to continue but it has the ability also to overlook a potential conflict. One of the advantages of this type is the short-running readers transactions can commit in parallel with the execution of the long-running transactions that commit too. Many STM implementations use this type of object acquisition and conflict detection such as: OSTM and Haskell STM. 3. 7.2 Eager Acquire With eager acquisition, a transaction will have to acquire the object it is trying to access in the beginning of at open time. This type allows the transactions to detect a conflict early and hence save on the useless works if the transaction is going to abort eventually. However, sometimes with eager acquisition, the transaction aborts a conflicting transaction and then the aborting one fails to commits which cause huge waste in the works invested by the aborted transaction and itself. Many STM implementations use this type of object acquisition and conflict detection such as: DSTM, WSTM and McRTSTM. 3.8 Contention Management Whenever a conflict is detected over an object between two transactions, one of the transactions must be aborted to resolve this conflict. The part of the 44

PAGE 116

TM system that is responsible for this decision, aborting one of the transactions, is called a contention manager. The contention manager includes one or more of contention policies and based on these policies the aborted conflicting transaction is chosen. Some implementations of contention managers may not decide to abort the conflicting transaction and instead, the contention policy can ask for delaying that conflicting transaction. Different contention managers may have different cost depending on the contention policies of each of them. 3.9 Other Options There are other options that can differentiate an STM implementation from the other: 3.9.1 Exceptions Some TMs may include mechanisms of exception handling where the try and catch keywords are used to surround the atomic block. Exception handling can be implemented to either terminate or abort the transaction. When the TM uses terminate transaction, it terminates the transaction and exits. However, before the control leaves the transaction, whatever changes have been introduced will be written or committed by the system. The other exception handling which aborts the transaction whenever exception is found will abort the transaction without committing any changes. 45

PAGE 117

At the end, it depends on the language used to implement the TM and the compiler used if it can provide a flexible solution to even includes the two types of exception handling methods mentioned above. 3.9.2 library versus Programming Language Another implementation options whether to use the transactions as part of the library itself or to integrate them into the syntax of the programming language. Several TM implementations were implemented in both options. Although the latter option in which the transactions are integrated as primitives with the programming language semantics are more preferable, the integration should be done properly as Carlstrom et. al. [4]. Having clear semantics will help to implement better optimization to the compiler. 3.9.3 Uniprocessor versus Multiprocessor Some TM implementations have based their implementations to be run on a uniprocessor system whereas the other did take the advantage of the new trend. 46

PAGE 118

4. STM Overhead It has been believed by many researchers that TM in general is a promising solution for the problems related to programming multicore processors or even parallel programming in general. After introducing TM to be a solution implemented in hardware, HTM, this kind of implementation has been considered expensive which made it less feasible by most of researchers. STM was proposed as a solution that still avoids most of the pitfalls of locking based synchronization and also does not require any extra hardware which adds no extra cost. This is the main reason that most of the researchers shifted themselves to STM. Now days, many STM implementations were proposed in different languages and with different flavors. Although, it is still considered in research based, many researchers started to consider that STM won't go far and might not be the optimal replacement for the current used synchronization techniques which require mutual exclusion. One of the most important reasons for that believe is the overhead that is encountered with using STM and this is the reason makes some of the researchers called it as a research toy [5]. After discussing most of the STM design options or tradeoffs which have been used in various STM implementations, I provide some of the overhead sources related to these options directly or indirectly. In their effort in designing a low overhead STM implementation, which was designed and named RSTM, Marathe et al. listed several possible sources of overhead [17]. These sources are: 47

PAGE 119

Managed Code: they used C++ instead of the Java which they used in their prior implementation ASTM and they expected better performance in runtime semantic checks, lock-based thread-safe libraries and always-virtual method dispatch. However, the Sun's HotSpot Java implementation typically runs programs slightly faster than equivalent C++ versions compiled with GCC. Bookkeeping: in object-based STM implementations, the minimum CAS operations required acquiring n objects and commit is n+l CAS. In addition, n CAS operations might be required for post-cqmmit cleanup of headers. The private read lists and write lists require higher overhead. So, the bookkeeping introduces more overhead in the single-thread or low-contention cases. Memory Management: memory management is required for both data objects and dynamically allocated metadata (transaction descriptors, DSTM locators, OSTM Object Handlers). In case of garbage collected languages, additional cost of tracing and reclamation. Contention Management: conflict resolution can lead to a significant overhead as well as the sorting required for avoiding deadlocks. The work that is lost to abort or delaying a conflicting transaction can be also considered a high overhead added to the other sources of overhead I have already listed. However, contention manager overhead will finally depend on the options used in building the STM. Choosing the implementation synchronization is critical because obstruction-free STM has better chance of minimizing the useless work that might be required by lock-free or wait-free STM implementations. Copying: in STM, every writer creates a copy of every to be written object. In case of small objects, they might not result in a significant overheads 48

PAGE 120

compared to the overheads of bookkeeping. However, if the object instead is large and only a small change is required, then the overheads that result from unneeded copying might be significant. 49

PAGE 121

5. Findings In this section, I am going to list all of my recommendation regarding the STM design options that might increase or decrease the overhead. I have already mentioned a few words about using these statements in STM systems to provide better chapter 2. Assertion 1 Regarding the isolation property, I have mentioned two design approaches used by various STM implementations: weak isolation and strong isolation. Strong isolation is considered better than weak isolation since the isolation with the first type can guarantee the isolation between transactions themselves and also between them and the non-transactional access. Weak isolation STM implementations need to divide the shared memory manually into two divisions: transactional that can be accessed only by transaction and non-transactional that can be accessed by non-transactional accesses or codes outside transactions. Dividing the memory is not perfect since it is error prone with taking in consideration the evolvement of software. Moreover, dividing the memory does not make the idea of STM and TM in general a feasible concurrent programming technique. This is because the locking based synchronization with critical sections does not require this hassle. In addition, avoiding that option, dividing the memory, will lead to several issues: non repeatable reads, intermediate lost update and intermediate dirty reads [1]. so

PAGE 122

The solution that I strongly recommend is to deny any direct non transactional access to the shared memory by implementing STM with strong isolation feature. Addition to avoiding issues related to using weak isolation, the overhead of implementing STM with strong isolation can be decreased. One of the good approaches that might be used to implement the strong isolation is using functional languages such as Haskell. The reason is that all the mutable data in Haskell are transactions and hence no extra overhead is required to achieve strong isolation. However, imperative and object oriented languages are more used especially in the industry and commercial applications. Furthermore, programmers are more familiar with imperative languages rather than the functional languages. Therefore, I recommend implementing STM with strong isolation using these types of languages. Regarding the higher overhead introduced to the STM overall overhead by going with this approach, efficient read and write barriers for memory accessed outside transactions. Optimization can be used to reduce the overhead encountered with using barriers and even to remove it sometimes. The cost of the suggested read and write barriers can be reduced by the following enhancements: 1. Tracking the local objects of the thread at runtime to eliminate the unneeded synchronization 2. Eliminating barriers whenever not required. 3. Implementing some sort of barriers aggregations would help in amortizing the barriers cost. 51

PAGE 123

Assertion 2 Supporting nested transactions have been included in most of STM implementations I have seen except a few of them which are considered pretty old. What I discovered that most of the STM's, which support nested transactions, use the flattened type while a few used open one. think the reason which made most the STM implementers to choose the flattened type that it is easy to be implemented. However, as I mentioned the flatted type does not provide compositional transactions. The rest which uses closed type might experience more overhead especially when the inner transaction aborts, the outer transaction will continue its computations. If the outer transaction also aborts, then it would be more efficient if the outer one has been aborted directly. I think the trend in supporting nested transactions in STM implementations will choose the open type. My opinion, it is the best type compared to the previous types. In case the inner transaction commits, making all the other transactions including its parent transaction or the non parent, to see the result of inner transaction will save a lot on the STM overhead. Hiding the result, in case the inner transaction commits, will add extra overhead to the STM. In addition, if the outer transaction decides to abort and execute again, it would not need to execute the inner because it is already executed. Open type will also increase the concurrency since the resources that would be reserved by the inner transaction would be released earlier. 52

PAGE 124

Assertion 3 Given the two design options around which the STM have been using to detect the conflict object granularity or word (or block) granularity. Although object granularity is more preferable especially with STM implementations built in object-oriented languages, it can be major source of overhead. The size of the object affects the overhead of any STM system. One of the disadvantages of object granularity is sometimes with a large size objects opened by a transaction to perform a write, the transaction will need to create a new copy of that object to put its modification. The problem is the transaction would create the new copy even if the modification is minor and hence copying such object can cause significant overhead. Another issue is the false conflict that might happen if two transactions reference two different field of the same object like the case of multidimensional array. Such problems might be avoided by using word granularity but again it requires more space and time in order to track and compare read sets and write sets. The time and space required by the world granularity is always required. My opinion is with all the disadvantages of the object granularity, it is still better than the word granularity in terms of overhead. Unlike the word granularity, object granularity helps programmers to reason about their programs and hence they can introduce further enhancements and optimizations to detect conflicts. In addition, the overhead with the object granularity might be noticeable only in case large objects. 53 ...

PAGE 125

6. Conclusion In my thesis, I started by defining what are the problem posed by parallel programming and what is the locking based synchronization model. Then I mentioned some of the published issues resulting from using the legacy locking techniques in parallel programming. I also introduced TM with a focus on the STM and I gave some background about the implementations, operations and semantic of STM. In my thesis, I concentrated on main STM design tradeoffs. Some of these options have a direct relation to the increase overhead of STM. Unfortunately, still the area of STM implementations overhead is not identified clearly. One reason is that STM or TM in general is still in the research based. Moreover, most of the parallel programmers are not familiar with TM and if they know, still they don't see it as a mature environment to build their codes on. The area of TM or STM specifically still an active area of research and more researchers are now interested in making it a feasible and successful replacement to the current parallel programming paradigms. More work will need to be conducted on evaluating the overhead of STM and if possible reducing such a disadvantage of it. One of the problems regarding the overhead of STM is the differences among implementations. It is still an open issue to develop methods to evaluate all the STM implementations overhead. 54

PAGE 126

BIPLIOGRAPHY [1] Adi-Tabtabai, A., Shpeisman, T., Menon, V., Balensiefer, S., Grossman, D. Hudson, R. Moore, K., Saha, B., Enforcing Isolation and Ordering in STM", in Proceedings of the FCRC'Ol Conference on Programming Languages Design and Implementation (PLDI}. 2007: USA, California. [2] Ananian, S. and Rinard, M., "Efficient Object-Based Software Transactions," in Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications {SCOOL}. 2005: San Diego, Ca. [3] Blundell, C., E.C. Lewis, and M.M.K. Martin, "Deconstructing Transactions: The Subtleties of Atomicity", in Fourth Annual Workshop on Duplicating, Deconstructing, and Debunking. 2005. [4] Carlstrom, B.D., et al., The Atomos Transactional Programming Language, in Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2006}. 2006, ACM Press: Ottawa, Ontario, Canada. pp. 1-13. [5] Cascaval, C., Blundell, C., Michael, M., Cain, H., Wu, P., Chrias, S., and Chatterjee, S. Software Transactional Memory: Why Is It Only a Research Toy?, CACM, 51(11):4Q-46, November 2008. [6] Dice, D., Shalev, 0., and Shavit, N., "Transactional locking II," Proceedings of the 14th ACM Symposium on Principles of Distributed Computing. pp. 204213. [7] Ennals, R. "Software transactional memory should not be obstruction free," Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report., 2006. 55

PAGE 127

[8] Fraser, K., "Practical Lock-Freedom.", PhD Thesis, University of Cambridge Computer Laboratory, 2004. [9] Gharachorloo, K., Memory Consistency Models for Shared-Memory Multiprocessors. 1995, Computer System Laboratory, Stanford University. [10] Harris, T., et al., Composable Memory Transactions, in Proceedings of the Tenth ACM GPLAN Symposium on Principles and Practice of Parallel Programming. 2005: Chicago, IL. [11] Herlihy, M., Luchangco, V., and Moir, M., "A flexible framework for implementing software transactional memory/' Proceedings of the 21st Annual ACM S/GPLAN Conference on Object-Oriented Programming Languages, Systems, and Applications, pp. 253-262, 2006. [12] Herlihy, M., Luchangco, V., Moir M., and Scherer Ill, W., "Software transactional memory for dynamic-sized data structures," in Proceedings of the Twenty-Second Annual Symposium on Principles of Distributed Computing, pp. 92-101, 2003. [13] Herlihy, M. and Moss, J.E.B. Transactional memory: architectural support for lock-free data structures. /SCA 1993. [14] Larus, J. and Rajwar, R. Transactional Memory. Morgan & Claypool Publishers, 2006. [15] Lev, Y. and J.-W. Maessen, "Toward a Safer Interaction with Transactional Memory by Tracking Object Visibility, in Proceedings of the OOPSLA 2005 Workshop on Synchronization and Concurrency in Object-Oriented Languages {SCOOL). 2005. [16] Lomet, D. "Process structuring, synchronization, and recovery using atomic actions," In Proc. ACM Con f. on Language Design for Reliable Software, Raleigh, NC, 1977, pp. 128-137 [17] Marathe, V. et al. "Lowering the Overhead of Nonblocking Software Transactional Memory," in Proceedings of the SIGPLAN 2006 Workshop on 56

PAGE 128

Languages, Compilers and Hardware Support for Transactional Computing (PLDI). 2006: Ottawa, Canada. [18] Marathe, V., Scherer Ill, W., and Scott, M. "Adaptive Software Transactional Memory," Proceedings of the Nineteenth International Symposium on Distributed Computing, pp. 354-368, 2005 [19] Shavit, N. and Touitou, D. 11Software transactional memory," In Proc. 14th ACM Symp. on Principles of Distributed Computing, Ottawa, Canada, 1995,pp.204-213 [20] Wang, X., Ji, Z., Fu, C., and Hu, M. "A review of transactional memory in multicore processors," Inform. Techno/. J., 9: 192-200. 2010. 57