An agent based distributed parallel processing architecture

Material Information

An agent based distributed parallel processing architecture
Byassee, Jason Scot
Place of Publication:
Denver, Colo.
University of Colorado Denver
Publication Date:
Physical Description:
76 leaves : ; 28 cm

Thesis/Dissertation Information

Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer Science
Committee Chair:
Wolfe, William J.
Committee Members:
Kumar, Dianne
Price, Rod
Smith, Chris


Subjects / Keywords:
Parallel processing (Electronic computers) ( lcsh )
Electronic data processing -- Distributed processing ( lcsh )
Electronic data processing -- Distributed processing ( fast )
Parallel processing (Electronic computers) ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 74-76).
Statement of Responsibility:
by Jason Scot Byassee.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
45141305 ( OCLC )
LD1190.E52 2000m .B93 ( lcc )


This item has the following downloads:

Full Text
Jason Scot Byassee
BSEE, University of Colorado at Denver, 1994
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science

This thesis for the Master of Science
degree by
Jason Scot Byassee
has been approved
(Chris Smith

ByasseeJason S. (M.S. Computer Science)
An Agent based Distributed Parallel Processing Architecture
Thesis directed by Associate Professor William J. Wolfe
Parallel computing is the predominate solution for software applications that demand both superior
performance and substantial processing power. Distributed parallel systems^ or computing clusters,
add the potential for increased reliability at a reduced cost. Cluster computing is based on the
collaboration of independent nodes, which is conducive to an adaptive, fault tolerant execution
environment. To this end, the majority of implementations involve custom solutions where the
dynamics of the environment are integrated at the application level in order to exploit the distributed
system. This paper introduces Cotton, a versatile architecture that is designed to host parallel
applications in a distributed environment. Cotton is a novel approach to cluster computing that
incorporates an agent-based parallel programming model and corresponding software support
framework. This programming model provides a natural task parallelism abstraction, where agent
heuristics are defined by a strategy that combines a lightweight discipline with a functional
programming paradigm. The architecture design highlights features that include adaptive parallelism,
application transparency and fault tolerance.
This abstract accurately represents the content of the candidate's thesis. I recommend its publication.

1 Introduction........................
1.1 Approach............................
1.2 Organization........................
2 Computing Clusters..................
2.1 Cluster Environments................
2.1.1 Volunteer Clusters.................
2.1.2 Dedicated Clusters................
2.2 Projects..........................
2.3 Summary.............................
3 Distributed Parallel Computing......
3.1 Computing Architectures.............
3.2 Parallel Software Requirements......
3.2.1 Concurrency........................
3.2.2 Scalability.......................
3.2.3 Locality..........................
3.2.4 Modularity........................
3.3 Parallel Programming Models.........
3.3.1 Data Parallelism...................
3.3.2 Task Parallelism..................
3.3.3 Shared Memory.....................
3.4 Multitasking Strategies.............
3.4.1 Bulk Synchronous Parallel..........
3.4.2 Communicating Sequential Processes
3.4.3 Active Threads....................
3.5 Summary.............................
4 Adaptive and Fault Tolerant Systems.
4.1 Overview............................
4.1.1 Performance Optimization..........
4.1.2 Dynamic Demand....................
4.1.3 Resource Availability.............
4.2 Adaptive Parallelism................
4.3 Adaptive Model......................
4.4 Fault Tolerance.....................
4.4.1 Fault Types.......................
4.4.2 Operational Fault Mechanisms......
4.4.3 Implementation Policies...........
4.5 Adaptive System Design Tradeoffs....
4.6 Cotton Implementation...............
4.6.1 Load Balancing....................

I* 1 1 n n n 1 1 1i n n 1 1 1 1 1 1 n oc T/-

4.6.2 Resource Supervision................................................................27
4.6.3 Fault Tolerance.....................................................................28
4.7 Summary...............................................................................28
5 Agents..................................................................................29
5.1 Agents and Task Parallelism...........................................................29
5.2 Agent Design Patterns.................................................................29
5.3 Master-Slave Parallelism..............................................................30
5.3.1 Master-Slave Clustering Architecture.................................................30
5.3.2 Two-Tier Management Architecture....................................................31
5.4 Cotton Agent Model....................................................................31
5.4.1 Agent Communication..................................................................32
5.4.2 Natural Systems.....................................................................33
5.5 Agent Distribution....................................................................33
5.6 Summary...............................................................................34
6 Software Architecture...................................................................35
6.1 Introduction..........................................................................35
6.2 Implementation........................................................................35
6.2.1 Programming Language.................................................................35
6.2.2 Components..........................................................................37
6.3 Adaptive Support Services.............................................................48
6.3.1 Discovery............................................................................48
6.3.2 Load Balancing......................................................................49
6.4 Parallel Computation Process..........................................................51
6.4.1 Agent Initialization.................................................................51
6.4.2 Quicksort Implementation............................................................53
6.5 Fault Detection and Recovery..........................................................54
6.5.1 Distributed Agent Monitor............................................................55
6.5.2 Fault Scenario......................................................................57
6.6 Summary...............................................................................59
7 Web Experiment..........................................................................60
7.1 Background............................................................................60
7.2 Results...............................................................................61
7.3 Conclusion............................................................................62
7.4 Summary...............................................................................63
8 Conclusion..............................................................................64
8.1 Review................................................................................64
8.2 Future Work...........................................................................65
8.2.1 Data Input Streams...................................................................65
8.2.2 Multithreading Options..............................................................67
8.2.3 Multi-Level Recovery................................................................67
8.2.4 High Performance Architecture.......................................................68
8.3 Summary...............................................................................69
A. Application Programmer Interface......................................................70

2- 1 Computing Cluster.......................................................................4
3- 1 Data Parallelism Model.................................................................10
3-2 Task Parallelism Model...................................................................11
3-3 Shared Memory Model......................................................................12
3- 4 Bulk Synchronous Parallel Model........................................................14
4- 1 Hardware Failure (Node Crash)..........................................................20
4-2 Network Failure (Connection Interrupt)...................................................20
4-3 Software Failure (Premature Termination).................................................20
4- 4 Parallel Program Checkpointing................................................... 22
5- 1 Master-Slave Pattern............................................................... 30
5- 2 Agent Tree...................................................................... ..32
6- 1 Cotton Software Architecture Packages..................................................38
6-2 Agent Package Class Diagram..............................................................40
6-3 Distributed Agent Server.................................................................41
6-4 Mobile Agent Host........................................................................42
6-5 Jini Service Operation...................................................................43
6-6 Resource Registry........................................................................44
6-7 Network Messenger........................................................................45
6-8 Statistics Collection....................................................................46
6-9 Thread Monitor...........................................................................47
6-10 Thread Pool.............................................................................48
6-11 Resource Discovery.................................................................. 49
6-12 Multicast Load Balancing.......................................................... 49
6-13 Mobile Agent Load Balancing......................................................... 50
6-14 Task Agent Components...................................................................51
6-15 Child Agent Creation....................................................................52
6-16 Parent-Child Agent Connection...........................................................53
6-17 Quicksort Agent Tree....................................................................54
6-18 Fault Recovery..........................................................................55
6-19 Distributed Agent Monitor...............................................................56
6-20 Fault Scenario..........................................................................57
6-21 Fault Recovery Initialization...........................................................58
6- 22 Agent Regeneration....................................................................59
-1 Agent Applet.............................................................................60
7- 2 Experiment Results................................................................... 61
8- 1 Multiple Data Input Stream Agent Tree.......................................... 66
8-2 Descendant Retrieval.....................................................................68

The demand for processing power continues to increase at an astonishing pace. Power hungry
applications cover a wide and diverse spectrum from digital signal processing to data mining to
virtual reality. Sequential computing has traditionally served as the processing mechanism for these
applications. The technological advances that coniinue to be made with hardware have allowed the
computing power provided by sequential platforms to just keep pace with demand. The sequential
model has been adequate in many instances, yet the finite limitations provided by this model are
within reach. Future hardware enhancements are constrained by the speed of light and the laws of
thermodynamics [4].
Research and development involving parallel processing as a practical means to replace sequential
computing has existed for a number of years. Many parallel application solutions involve specialized
supercomputing platforms. These massive multiprocessor machines deliver both high performance
and a wealth of processing power. While adequate for even the most demanding applications, the
feasibility of parallel supercomputing solutions is limited by cost, reliability and versatility.
Current research is focused on a solution that represents a viable alternative to parallel
supercomputing. The concept is to create what is called a computing cluster or computing fabric -
essentially weaving together hundreds or even thousands of inexpensive microprocessors into one
massive parallel entity. The power of a computing cluster is theoretically infinite, only limited by
resource availability.
The appeal of cluster computing for processor intensive applications goes beyond the notion of
increased power at a lower cost. The dynamics of a computing cluster computing environment are
conducive to the following:
Parallel computation
Dynamic scalability
Fault tolerance
Platform independence
Linear speedup
There currently exist numerous computing cluster implementations that incorporate various
combinations of these features. Most solutions are application specific, where ^environmental
knowledge'' is integrated into the application itself. The motivation for this development is based on
the applicability of a general-purpose software architecture that hosts parallel applications. This
thesis presents Cotton a robust computing fabric.
1.1 Approach
While conceptually straightforward, computing cluster implementations present numerous
challenges. Distributed environments introduce complexities that require robust software solutions.
The following issues highlight areas that must be addressed in computing cluster software
architecture design:

Task Distribution
Task Coordination
Managing the complexity of multi-tasking in distributed environments is a popular research area.
Solutions span a broad spectrum from sequential process models (Bulk Synchronous Parallel) to
distributed messaging protocols (Java RMI and CORBA). Research at the system level has
concentrated on the contributions of agent-based models. The features introduced by agent
technology are conducive to a dynamic distributed environment. An agent in this context is a
software entity that exhibits some combination of the following behavior [9]:
Temporally Continuous
Agents provide a level of abstraction that separates the application from the underlying
implementation detail. An agent-based distributed system model designed at the agent level provides
a complexity management layer while maximizing the versatility of the overall architecture.
Essentially, a single paradigm addresses a variety of difficult requirements.
The Cotton architecture incorporates an agent model as a natural task parallelism abstraction. This
approach involves an innovative design strategy whereby lightweight agents collaborate to solve
problems. Based on a functional paradigm, these agents are dynamically distributed throughout the
computing cluster to perform parallel processing tasks.
The objective of the Cotton architecture design is to provide parallel applications an adaptive,
heterogeneous, fault tolerant execution environment while transparently managing the underlying
support services. This thesis introduces a novel approach to parallel programming with an agent-
based functional paradigm and corresponding software framework that supports this model in a
computing cluster environment.
1.2 Organization
This paper presents the parallel programming model and software framework that comprises the
Cotton architecture. The early chapters explore three disciplines (Distributed Parallel Computing,
Adaptive and Fault Tolerant Systems and Agents) that provide insight into the design of the Cotton
model for parallel programming and corresponding system architecture. The later chapters provide a
detailed presentation of the software implementation (Software Architecture) and evaluation
mechanism (Cotton Experiment). Each chapter is briefly outlined below.
This paper opens with an introduction to computing clusters in Chapter 2. This chapter covers
dedicated and volunteer cluster systems and current research areas.

Chapter 3 begins with a brief discussion of parallel computing architectures. This introduction is
followed by the software requirements of a parallel program and the three prevalent distributed
parallel programming models. This chapter concludes with a presentation of several popular
approaches for managing multi-tasking complexity.
Chapter 4 provides a detailed analysis of adaptive and fault tolerant systems. A model is introduced
that outlines the adaptation process. Current research is presented that compliments adaptive system
design tradeoffs. This is followed by an overview of operational faults in distributed systems.
Several prominent models for fault tolerance are explored with an emphasis on fault detection and
recovery. The functional programming model, as a robust fault tolerant solution, is covered in
detail. This chapter concludes with a focus on application transparency and the Cotton model for
fault tolerance.
Chapter 5 includes an in-depth examination of agent technology. This chapter begins with the agent-
task parallelism abstraction. An overview of agent design patterns is followed by a detailed
discussion of the Master-Slave pattern. Two implementations of the Master-Slave pattern are
reviewed. The Cotton agent model is introduced with a complex adaptive systems influence. This
chapter concludes with the agent distribution scheme employed in the Cotton architecture.
Chapter 6 outlines the Cotton software architecture design and implementation. This chapter begins
with a brief discussion of the system services. A presentation of the core packages provides an in-
depth look at the various software components. An operational overview details the discovery, load
balancing and computation processes. This is followed by a sample implementation of a parallel
application. This chapter concludes with an examination of the process models for agent
initialization and fault recovery.
Chapter 7 presents a web-based experiment that incorporates the Cotton software architecture.
Chapter 8 concludes with a summary and discussion of future research opportunities.

2 Computing Clusters
The computing technology for high performance, power-demanding applications is in the midst of a
transition from sequential and parallel supercomputers to parallel distributed systems. A computing
cluster is a versatile system that provides parallel applications the opportunity for exponential gains
in both reliability and processing power. A cluster consists of regions of nodes that can instantly be
assembled or dissipated based on current demand. A cluster is analogous to a computing fabric
where distributed processors are woven together to create a large parallel supercomputer. Figure 2-1
depicts a computing cluster that is comprised of a network of personal computers (PCs).
The following observation highlights the reasoning behind the growing popularity of computing
clusters (fabrics) [33]:
ftThe fabric is a dynamic architecture that eliminates rigid network boundaries to yield a
completely fluid system ... Development and deployment of distributed systems has the
potential to become a real-time process that is a simple matter of configuring resources in
the fabric. n
2.1 Cluster Environments
There exist two basic computing cluster environments that can be categorized by the requirements of
the host applications. A volunteer cluster is a system that is designed to exploit idle processing
power on existing machines. In this environment, idle CPU cycles on participating nodes
(volunteers) are borrowed for parallel processing tasks. This is an innovative, cost-effective
solution that can provide applications with substantial processing power that would otherwise go
unused. A dedicated cluster refers to a system where applications require superior performance.
Computing nodes must be available on demand. Real-time signal processing applications fall into
this realm, where high speed, demand driven computations are critical. From a hardware standpoint,
a dedicated cluster is more costly, due to the fact that 1)resources are dedicated as opposed to
borrowedand 2) the most advanced (and most expensive) hardware technology is used.
The following sections cover the volunteer and dedicated computing cluster environments.

2.1.1 Volunteer Clusters
The volunteer computing cluster concept is based on the premise of harnessing unused processing
power. Most computer processors, whether it is a personal computer at home or a desktop machine
at the office, sit idle most of the time. This processing power is essentially an untapped resource that
provides enormous potential for parallel applications.
A volunteer computing cluster implementation must satisfy two essential requirements:
1) A volunteer environment is heterogeneous by nature. The application software should
be executable on different architectures.
2) The computers (or nodes) must be physically connected at some point in time in order
to exchange data.
Traditionallythe term architecture neutral software has been somewhat of an oxymoron. Software
is written and compiled for a specific architecture. In order to port the software to a new
architecture, the code must be recompiled, and in many instances, the code must be modified. This
problem is a primary reason behind the growing popularity of the Java programming language. Java
programs are compiled into an architecture neutral bytecode which are then interpreted with a
platform dependent virtual machine. Volunteer cluster projects now solve the architecture neutrality
problem in one of two ways -1)provide compiled binaries for a variety of architectures, or 2)
develop the software in a language such as Java.
The growth of the Internet has allowed volunteer cluster computing to become a viable parallel
processing option. The Internet provides a physical connection medium that is used by many
volunteer cluster applications to transmit data for parallel processing jobs. The SETI@home project
(Search for Extraterrestrial Intelligence) [26] is a volunteer cluster computing application that
satisfies an enormous demand for processing power by distributing tasks via the Internet. The
project involves analyzing a continuous stream of data collected from a sky survey radio telescope.
The data is divided into small pieces that are distributed to volunteer machines. Participants
download client software that runs as a screen saver, searching the data for intelligent signals.
Internet connections to the project server are made periodically to upload search results and
download new data. This project has proven to be a very successful volunteer computing cluster
The volunteer cluster concept introduces the notion of adaptive parallelism. In this environment,
machines being used for parallel processing can be taken back at any time. The dynamic nature of
the system requires an adaptively parallel architecture, responding to ever-changing environmental
parameters that may include resource load, resource availability and network traffic.
2.1.2 Dedicated Clusters
Dedicated computing clusters are designed for applications that require superior real-time
performance. In this environment, both software and hardware architectures are optimized for speed.
In many instances, optimization involves homogenous platforms where software is pre-distributed,
executing natively. Cluster interconnects, the bottlenecks of most systems, are implemented with
low latency, high bandwidth technologies. These include Gigabit Ethernet, Scalable Coherent
Interface (SCI) and Myrinet [4].

Messaging in dedicated clusters is usually implemented with a low-level protocol that bypasses the
operating system. Traditional protocols such as TCP/IP utilize system calls to transmit data through
the kernel. The overhead incurred in this process can dominate the actual computation time. The
overall result is high network latencies and fractional network bandwidth available to the application
[24]. Current approaches to high performance messaging bypass the kernel. The Virtual Interface
Architecture (VIA) [32] is a low-level interface that provides user applications direct interaction with
the network interface. Work queues that send and receive data are notified via a doorbell of pending
requests. Requests are made in the form of descriptors by writing to a memory-mapped region or
triggering a card interrupt. The VIA specification is largely based on other direct messaging research
projects including Active Messages (UC Berkeley), Fast Messages (UIUC) and U-Net (Cornell
University) [32].
There are several popular message passing libraries that provide enhanced features that build on low-
level network interfaces. The Parallel Virtual Machine (PVM) [10] architecture is a process-based
parallel computational model that defines an explicit message passing interface. The design supports
heterogeneous networks of single and multiprocessor machines. Applications are developed using
PVM by partitioning tasks into sequential programs that incorporate PVM library calls. The
programs are precompiled for the various cluster architectures and then placed at respective
retrievable locations. Computations are initiated al one node and additional processes are
subsequently initiated on remote machines in the cluster.
The Message Passing Interface (MPI) [28] is a standard that defines a message passing library that is
similar to that of PVM. It differs from PVM in that MPI is not an environment -initialization and
process control are left to individual applications [4]. The MPI standard is gaining in popularity in
the high performance computing community.
The Message Passing Interface (MPI) and the Parallel Virtual Machine (PVM) are two such systems
comprised of software libraries and tools. The PVM offers a translucent access to hardware and
heterogeneous machine, network and application support [10]. While providing parallel applications
with superior performance, MPI and PVM do not completely conform to the adaptively parallel
paradigm. MPI programs are written assuming a fixed number of processors [25]. PVM
incorporates a master daemon that maintains configuration [10]. Programming language support is
Research projects involving dedicated clusters have evolved to the point of providing viable
solutions for many performance critical applications. The Beowulf project [3] originated in part to
determine the applicability of massively parallel computers for problem solving in earth and space
science applications. The project has evolved into an open development community involving both
academia and industry. A Beowulf Cluster is a closed system of dedicated to high performance
computing. A Beowulf distribution includes a suite of software tools including PVM, MPI, Bulk
Synchronous Parallel (BSP) in addition to SYS V-style IPC and pthreds [4]. Current research
involves utilizing multiple Ethernet networks in parallel to satisfy increasing bandwidth
requirements. Numerous Beowulf extensions to the Linux kernel are available that provide features
which include specialized device drivers, Ethernet channel bonding and virtual memory pre-paging.
A distributed process space package provides global process identification numbers for process
signaling and system wide control. The architecture design supports operating system performance
tuning at each node based on task granularity.

2.2 Projects
A number of research projects are currently exploring applications that exploit this evolutionary
technology. The National Scalable Cluster Project (NSCP) is a research project involving a
consortium of universities working to develop a high speed distributed computing meta-cluster [23].
The goal is to provide transparent access to a shared pool of data and resources located at various
physical locations across the country. Work is currently underway on a scalable, high-performance
system used for applications that include digital libraries, linguistics, imaging, virtual reality and data
A relatively new computing cluster research area involves intelligent systemswhere
administrative services are performed by the system itself. Industrial research has focused on a self-
configuringVself-tuning system where system resources are detected and work is subsequently
distributed to achieve maximum efficiency [15]. Tasks are redistributed when components fail. The
system continuously works to optimize performance. The following observation is made by an
industry representative [15]:
nThe question is not: will the system become too complex to manage? It already is. What we
need is a system that will take care of itself.n
2.3 Summary
This chapter has reviewed cluster computing with a discussion of volunteer and dedicated cluster
environments. This innovative technology extends the boundaries of traditional computing systems.
The primary motivation behind the Cotton project is to exploit the cluster computing concept by
providing a cost-effective parallel computing platform that is both adaptive and robust.
The following chapter examines distributed parallel computings including architectures, software and
program models.

Single Instruction Single Data (SISD)
Single Instruction Multiple Data (SIMD)
Multiple Instruction Single Data (MISD)
Multiple Instruction Multiple Data (MIMD)
3 Distributed Parallel Computing
Distributed parallel processing software designs are based on the underlying parallel programming
model. In the past 30 years, the parallel computing discipline has matured to yield component
classifications related to architecture and software modeling. In this chapter, we examine several
fundamental aspects of parallel programming. We begin with a brief overview of computing
architectures. This is followed by an analysis of parallel programming that includes software
requirements and software models. The chapter concludes with an analysis of parallel programming
research based on multitasking management.
3.1 Computing Architectures
The predominant computing architecture classification scheme is that of Flynn's taxonomy in which
characterization is based on instruction and data flow [6]. There exist four instruction-data
combinations identified below:
The von Neumann machine consists of a central processing unit and memory space, where machine
instructions are implemented in a fetch-execute cycle. The von Neumann machine is an SISD
implementation that supports standard sequential algorithms. The final three architectures,
characterized as parallel models, are discussed below.
SIMD architectures consist of multiple processors that execute the same instruction on different data.
The SIMD model is ideal for a specialized set of parallel applications, including image processing
and database operandi, where multiple data sets are manipulated simultaneously [36].
The MISD architecture consists of multiple processors that perform unique operations on the same
data. MISD implementations are rare due to the fact that most applications do not conform to this
architecture. Systolic arrays are one such MISD implementation, where nodes read and manipulate
data from neighbors on global clock cycles [6).
The MIMD architecture consists of multiple processors executing different instructions on different
data. This model provides the most versatility and potential for parallel distributed systems. MIMD
architectures are implemented with a distributed memory (each processor manipulates its own local
memory space) or shared memory (each processor accesses a global address space) model.
3.2 Parallel Software Requirements
Parallel programming levies unique requirements that must be satisfied, at some level, in a
distributed parallel processing software architecture. Foster [8] suggests four fundamental
2 3 4

requirements for parallel software, which include concurrency, scalability, locality and modularity.
Each of these requirements is covered below.
3.2.1 Concurrency
The degree of concurrency exhibited by an application dictates the applicability of a parallel
implementation. Algorithms can be designed in a manner so as to exploit a parallel environment,
although the design is difficult in many instances. In order to be a viable parallel processing
candidate, an algorithm must implement sufficient actions that can be executed simultaneously.
Amdahl's law states the limitations of parallel implementation when there exists strict sequential
components. In practice, most problem domains are growing in the areas where parallelism is
applicable [29].
One of the most difficult aspects of parallel algorithm implementation involves the coordination of
access to shared data. Minimizing the need for data synchronization in algorithm design is a
significant step in reducing complexity and optimizing performance. Nevertheless, simultaneous
task execution usually requires synchronization on some set of program data. The topic of shared
data access is addressed in various parallel programming models, three of which are discussed below.
3.2.2 Scalability
The scalability of a parallel application refers to the ability to take advantage of an environment in
which additional processors become available. The granularity of task division is critical in this
context. A more fine-grained approach will often result in more tasks than available processors. In
this case, dynamically added processors are exploited. The down side to a fine-grained approach is
the fact that messaging is at a premium (it is usually necessary to read and write data more often)
which impacts performance. A more coarse-grained approach might result in a situation in which
more processors are available than tasks.
Several parallel programming models implement a scheme whereby processes are statically mapped
to processors. This is an obvious detriment to scalability and portability. Algorithms that are
implemented for a specific number of processors do have the advantage of reduced complexity,
demonstrated by the data parallelism model (discussed in the following section).
3.2.3 Locality
The application performance in a distributed parallel environment is a function of data locality.
Local data can be accessed with minimal or no overhead. In contrast, remote data access will always
incur some performance penalty. Performance considerations dictate that data locality maintain a
significant influence on parallel program design.
3.2.4 Modularity
The advent of object-oriented programming methodologies can be attributed to code modularity and
software reuse. Although most parallel algorithms do not implement an object-oriented design,
similar design goals do apply. In this context, parallel programs are comprised of components that
reduce complexity and facilitate reuse.

3.3 Parallel Programming Models
We now examine three predominant distributed parallel programming models. Each model uniquely
addresses parallel software requirements. Most parallel computing environments incorporate a
variation or combination of shared memory, data parallelism and task parallelism.
3.3.1 Data Parallelism
The data parallel model focuses on data structures as the object of concurrent task execution.
Sequential identical repetitive operations are divided into independent tasks and performed
simultaneously. The following figure depicts the data parallel model.
Data parallel languages, such as Fortran 90, High Performance Fortran and pC++, provide
programming constructs for the explicit declaration of parallel operations. Data parallel compilers
provide additional recognition of implicit parallelism, such as nested loop segments that have no
shared variables. Parallel compilers construct an executable program from the explicit and implicit
parallel code segments, where data structures are partitioned into disjoint subdomains for parallel
operation [8]. Data distribution is addressed with a programmer specification. A data parallel
program provides directives that dictate a data distribution scheme.
The major advantage to the data parallel model is the fact that no explicit communication mechanism
is required. This is due, in large part, to the directives provided by the programmer. The compiler
infers communication operations from the program code [8]. Application performance is enhanced
with a focus on data locality.
There exist several disadvantages to this approach that make it impractical for many applications. A
major drawback is the fact that only a subset of algorithms can be represented as data parallel
Figure 3-1 Data Parallelism Model

programs. Data parallel program design and language proficiency are difficult to achieve in light of
the data partitioning and distribution responsibilities levied on the programmer. The fact that
distribution schemes are addressed prior to run time reduces scalability and the adaptive parallel
3.3.2 Task Parallelism
As opposed to data parallelism, task parallelism encapsulates both the data and the operations on
those data. At a high level, task parallelism represents an object-oriented design model. This model
is most often implemented as an MIMD system, where each task has the ability to operate
independently, yet communicate and exchange data with other tasks. The following figure depicts
the task parallelism model.
Figure 3-2 Task Parallelism Model
A task parallel algorithm is partitioned into independent concurrent tasks. Each task has three
distinct states:
1) Execution
2) Communication
3) Creation
The execution state involves performing the specified task, or algorithm code segment, using local
data or data received from another task. The communication state involves message passing between
independent tasks. In the task parallelism model, message sending is an asynchronous operation,
while message receiving is a synchronous (blocking) operation. The creation state involves
instantiating a new, independent task, which is created to work on some subset of the application.
Tasks in this model are often organized in a parent-child hierarchical tree, where parents partition the
task and then hand pieces off to some number of children.
The task parallelism model is conducive to adaptive parallelism. The mapping of tasks to processors
is truly dynamic by remaining completely independent of the algorithm. Additionally, this model is
inherently fault tolerant in that there is no central data store or control mechanism. This model also
offers the most diversity in terms of parallel algorithm conformance.

These advantages are contrasted by an increased reliance on task communication. The concept of
data locality is temporal. Input data is passed to a task, a code segment is executed using that data,
and the result is passed to another task. In this model, message passing is at a premium, where
implementation is based on a relatively significant amount of inter-task communication. Data
serialization can be costly, and in many instances, dominate total computation time.
The evolution of performance-based message passing technologies (highlighted in the previous
chapter) has significantly reduced messaging overhead. Given the advantages provided by this
model, distributed parallel software architectures based on task parallelism are conducive to
computing cluster environments.
3.3.3 Shared Memory
The shared memory model of parallel programming incorporates a common address space that tasks
use for data access and communication. In a loosely coupled architecture, this model is referred to as
distributed shared memory [4]. The following figure depicts the shared memory model.
Figure 3-3 Shared Memory Model
In this model, the shared memory system is responsible for data synchronization mechanisms and
ownership enforcement. Shared memory simplifies task level implementation. In addition to
asynchronous data access, communication transparency between tasks is realized.
The underlying shared memory implementation presents several challenges. Data locality introduces
a complex management problem that is usually solved with one of two ownership schemes [4].
Fixed ownership is where each data element is assigned to a central owner. The owner is consulted
before each write operation. In this instance, the owner can quickly become a performance
bottleneck. Dynamic ownership is when the owner of the data changes periodically, where the
owner of the data must first be located before each data modification. In this scheme, all data
modifications incur additional remote operations.

The advantage of a parallel programming shared memory design is the encapsulation of the
underlying implementation details. As discussed previously, the implementation specifics will
impact both the performance and dynamic nature of the overall system. Many distributed shared
memory designs are implemented on top of a message passing interface. Performance is comparable
with the task parallel model; in each instance, performance is driven by the communication and
computation requirements.
There are two aspects of the shared memory model that must be considered in a distributed parallel
processing architecture design. First, it can be difficult to write deterministic programs using shared
memory [8]. Additionally, most solutions preclude a completely dynamic, fault tolerant system by
implementing a global memory pool paradigm.
3.4 Multitasking Strategies
Distributed parallel programming introduces additional complexity in software design that presents
significant challenges. Most implementation problems are derived from simultaneous process
coordination. Coordination encapsulates multiple issues, including the communication, task division
and data synchronization.
The design of efficient, performance-based parallel programs introduces additional complications.
Task parallelism, in particular, is based on communication between nodes. In this model, data
locality is of primary importance. From the process perspective, communication involves interrupts,
which will accumulate and can be relatively expensive. Threading models have emerged as a
solution that reduce context switch times and provide some degree of operating system enforced
This section presents three multitasking management solutions. The first two are programming
models, Bulk Synchronous Parallel (BSP) and Communicating Sequential Processes (CSP), that are
designed to simplify coordination in a parallel environment. The third is software package, Active
Threads, which is designed to improve performance with a focus on data locality and thread
overhead. A brief discussion of this package provides an illustrative example of the numerous
parallel processing software tools and libraries that exist for parallel programming.
3.4.1 Bulk Synchronous Parallel
The BSP model was developed to address the complications associated with simultaneous process
coordination. The following diagram illustrates a BSP implementation.

Process 1
Process 2
Process 4

Figure 3-4 Bulk Synchronous Parallel Model
This model decouples communication and synchronization with a sequence of parallel super steps
that consist of three distinct compute-communicate-synchronize phases [22]. In the computation
phase, processes operate on local data and issue external communication requests. In the
communicate phase, data is exchanged as per the requests issued in the computation phase. The
synchronization stage guarantees all communication events are complete before the next super step
commences. The clear advantage of this model is the simplification of parallel programming:
Synchronous message passing fallacies, such as client deadlock, are eliminated.
Super steps provide natural breakpoints where system state can be assessed.
Correctness reasoning is similar to that of a sequential program.
The BSP model guarantees processes to be mutually independent within supersteps [25]. This
supports systems with heterogeneous nodes and reduces the complexity of program development.
This model is ideal for parallel applications that incorporate a similar degree of granularity among
tasks. When task granularity varies, performance and efficiency can be degraded. The difficulty
here is determining the phase timing as illustrated by the following two scenarios:
1) Tasks finish computations, sitting idle waiting for the communication phase to begin.
2) Tasks are interrupted before completion by the communication phase.

The advantages of this model are contrasted with the compromise in both performance and
efficiency, yet the simplification of parallel programming makes it a popular paradigm.
3.4.2 Communicating Sequential Processes
Communicating Sequential Processes (CSP) is a parallel programming model with goals similar to
that of BSP. This model is based on a mathematical language in which software behavior can be
both defined and substantiated [13]. CSP incorporates synchronous message passing using guarded
commands to implement selective communication [16]. Unlike standard task parallelism (which
incorporates asynchronous writing), this model reduces complexity with task synchronization at
every communication phase. This simplifies global state determination.
The disadvantage to CSP, similar to that of the BSP model described previously, is that performance
is degraded by tasks sitting idle. Each task must wait for an acknowledgement from a synchronous
write before proceeding.
3.4.3 Active Threads
A research effort at the University of California at Berkeley yielded a threading package (Active
Threads) that offers several features designed for fine-grained parallel applications [34]. The Active
Threads development is motivated by the increased importance of data locality as the gap widens
between processor performance and communication (for data access) latency. The developers also
note the overhead associated with the thread count in a fine-grained application. These issues are
addressed with a package that implements thread operations, thread scheduling and specialization
policies. The Active Threads design provides the user with a virtual processor abstraction, where
clients are encouraged to schedule threads that share data to run on the same virtual processor [34].
An extensible library of schedulers with various policies is included with the package. The Active
Threads runtime, consisting of a machine independent and machine dependent layer, implements the
scheduling policy, mapping virtual processors to the actual hardware.
3.5 Summary
This chapter has provided an overview of the prominent parallel programming models followed by
representative parallel programming solutions. A major design consideration in a distributed parallel
system involves the model for parallelism. To this end, there are design tradeoffs associated with
each implementation that will impact system complexity and overall performance.
The Cotton parallel programming model is based on task parallelism. We have noted two important
characteristics of task parallelism -1)the model introduces a significant degree of complexity, yet 2)
remains extremely robust. The versatility of task parallelism makes this model conducive to adaptive
and fault tolerant systems. This is the topic of the following chapter.

4.1.1 Performance Optimization
Dynamic load balancing is one component of performance optimization in a distributed parallel
environment. A significant amount of research has concentrated on load balancing schemes and
algorithms for use in parallel environments. The goal is to maximize the efficiency of the overall
system by balancing processor load while minimizing network traffic. Dynamic adaptation is
accomplished with task allocation and distribution that is based on the current state of the system at
any point in time.
4.1.2 Dynamic Demand
Application demands are rarely stable. In this context, dynamic adaptation has numerous
advantages. As an example, consider a computing cluster executing multiple jobs simultaneously. A
system that adapts to changing demands, such as a new job priority scheme, provides a great degree
of versatility. If the job priorities change during execution, system resources are reallocated based on
the new job priorities.
4 Adaptive and Fault Tolerant Systems
Adaptive systems are characterized as systems that modify their behavior based on changes in the
environment [14]. In this context, the system does not require outside intervention it thinksand
acts independently. A prime example of a true adaptive system is the Transmission Control
Protocol/Internet Protocol (TCP/IP), one of the most widely used network communication mediums
[14]. With TCP/IP, network efficiency is optimized via dynamic flow and congestion control.
Packets are retransmitted in response to network faults, in which the previous transmission attempt
was unsuccessful.
This chapter examines adaptation in the context of distributed parallel systems. An emphasis is
placed on fault tolerance, including adaptation phases, fault tolerant design strategies and
implementation methods.
4.1 Overview
The emerging popularity of distributed system solutions can, in large part, be attributed to the
inherent adaptive potential. This potential results from the fact that the architecture is based on a
collaboration of independent nodes. In this context, adaptation is a system response to environmental
changes. A system response in a distributed environment can be categorized as one of the following:
mid i
^ e V
p D R
12 3

4.1.3 Resource Availability
In a distributed system, resources can be added and/or removed at any point in time. If a new
machine is introduced to the system, adaptation involves a process whereby this machine is
discovered and subsequently employed. Conversely, when a machine is removed from the system
(due to a voluntary withdraw or crash), adaptation involves recognizing the lost resource and
possibly initiating a recovery process.
It is this second case where resources are abruptly removed from the system, that presents numerous
challenges. In a distributed environment, faults are possible at virtually any point in the system.
These faults fall into three general categories hardware crashes, network interrupts and software
failure. A distributed environment provides inherent fault tolerance in that the potential exists for
tasks to continue even as system resources are lost or software failures occur. A fault tolerant
architecture is designed to detect and recover from such faults. To this end, a true fault tolerant
system adapts with changes (such as task redistribution) in response to fault detection.
4.2 Adaptive Parallelism
One of the first successful research projects in the field of adaptive parallelism was the Piranha
system developed at Yale University [5]. Piranha is based on the Linda model,a coordination
language that provides underlying coordination whereby independent processes act as one parallel
program [21].Linda implements a virtual shared memory (VSM) system that is shared by all
processes. The VSM acts as a shared object repository in a global memory space. Linda is suited for
the master/worker model, where masters divide the task and place it in the global memory space.
Workers, in turn, retrieve tasks from the global memory space, perform the computation, and return
the results.
Traditional approaches to adaptive parallelism have focused on the process model, where
applications are implemented as a set of processes. In contrast, the Piranha system abstracts the
notion of a process from the application itself. The Piranha model is based on a process that is
created on demand at the location of an available processor. This process executes a single function
(piranha), that shares data with other processes via Linda's VSM. Piranha utilizes task descriptors
in the shared memory space (as opposed to processes) as movable, remappable computation units -
an approach that provides heterogeneity [5]. When a Piranha process must be suspended during
execution, that process is remapped (via the task descriptor) to a new processor. Prior to suspension,
a retreat function (implemented by the application programmer) is invoked.
The Piranha system makes several valuable contributions with respect to distributed parallel
processing architectures. First is the adaptive nature of the system, whereby processes are
dynamically created on demand. This is realized by separating the application from the processes.
An explicit boundary simplifies the application by removing static bindings between the processes
and algorithm implementation. A second Piranha contribution involves the notion of distributed
process invocation. This eliminates the need for initial process migration.
Two aspects are noted that differentiate the Cotton architecture from the Piranha system. First is the
global memory space implementation. This approach promotes susceptibility to catastrophic failures
when a shared memory node crashes. Second, Piranha requires the application programmer to
provide a retreat function that, to a certain degree, couples the application with the environment.

Byzantine Faults
Software Faults
Operational Faults
Change Detection
The change detection phase involves monitoring the environment for possible changes. Changes to
the system can be initiated internally or externally. In a distributed environment, the agreement
phase usually involves collaboration among nodes where an adaptive action agreement is reached
(this phase can be omitted if nodes can make decisions independently). The action phase involves
behavior modifications in response to environmenlal change(s).
Adaptive systems can be characterized by a correctness domain that corresponds to the behavior of
the algorithms implemented by the adaptive system [14]. The correctness domain of an application
refers to the execution environment or subset thereof in which the algorithm(s) function properly.
These domains coincide with the tolerable failures of fault tolerant systems. With respect to adaptive
systems, the following observation is noted [14]:
UA system with a smaller correctness domain is usually easier to construct and faster to
operate since it can be based on more simplifying assumptions.
The Cotton architecture incorporates an application model motivated in part by the notion of
simplification. In the context of fault tolerance, simplification reduces complexity, and in turn,
increases reliability.
4.4 Fault Tolerance
Distributed parallel systems provide various degrees of fault tolerance. One common approach to
fault tolerant system design is to simply exclude fault recovery. This method reduces both
complexity and overhead, yet places a strict limitation on the suitable application domains. Genetic
algorithms are ideal for a system that lacks recovery, in that a genetic algorithm often seeks the best
solution, not the optimal solution. In this instance, failures can simply be ignored (as they are not
fatal to the system as a whole) with the understanding that the final result might be degraded.
4.4.1 Fault Types
We will refer to fault tolerant systems as those that can detect and recover from faults. In this
context, we explore three fault types:
This can be a difficult requirement levied on the application programmer when the timing of the
function call is indiscriminant. Implementation may depend upon system progress or execution state
that is external (and hence unknown).
4.3 Adaptive Model
The dynamics of adaptation at the system level involve an adaptive event sequence. Research in this
area has yielded an adaptive system model that consists of the following three phases [14]:
\7 \/ \1/
12 3
\/ \1/
12 3

Software faults are categorized as faults where a portion or portions of a parallel algorithm produce
erroneous results essentially, a software bug. An operational fault occurs at the system level, and
may include hardware failures, network interrupts and software crashes. Byzantine faults are
arbitrary faults where components exhibit erratic behavior. Byzantine Faults
Fault detection schemes traditionally assume the use of fail stop components that is, components
whose functionality abruptly halts upon a fault [16]. Byzantine faults are different in that the
component continues to operate after a fault has occurred. Byzantine faults are extremely difficult to
detect for two reasons:
1) Faults are arbitrary in nature.
2) Faults are not bounded by a finite set of behaviors.
The Cotton architecture does not directly address Byzantine faults, although recovery is possible in
most instances. A Byzantine fault will eventually yield an invalid response at either the software or
operational (hardware) level. At the operational level, this invalid response can be detected in the
form of a connection timeout or an invalid acknowledgement. In these instances, the component is
marked as a failure. At the software level,a Byzantine fault can be detected by evaluating the
correctness of a result. In this case, the component failure might never be detected in a design that
decouples the application from software architecture. The fault is not fatal in that the incorrect
portion of the result can be recomputed. Software Faults
Detection of software faults involves verification of results at predetermined (usually intermediate)
stages of the parallel computation. In order to verify a result, a validation step is required. In this
context, fault recovery usually involves two stages of the adaptation process:
1) Change Detection verify an incorrect result.
2) Action repeat the computation.
Software fault recovery is most often implemented at the algorithm (or application) level, as opposed
to the architecture (or system) level, due to the fact that validation is algorithm dependent. Software
fault recovery simply involves result checks at intermediate stages. Incorrect results are recomputed. Operational Faults
Operational faults in a distributed environment can occur at virtually any location in the system at
any time during operation. The following diagrams present possible fault situations in a distributed
parallel environment.

Figure 4-2 Network Failure (Connection Interrupt)
Node 1 Node 2
Figure 4-3 Software Failure (Premature Termination)

Hardware faults refer to situations in which a node, or processor, that is currently involved in the
computation becomes unavailable. In a dedicated environment, this occurs when a machine crashes
or loses power. In a non-dedicated (or volunteer) environment, a hardware fault can also result from
a user halting the parallel program or simply turning off the machine.
Network faults occur when the physical connection between two nodes is lost. This can be a result
of a malfunction or physical break in the network hardware. Network faults can also refer to a
situation in which communication latencies are unreasonably high due to an extremely overloaded
Software faults in this context are operational faults as opposed to program bugs. These faults are
identified as non-responsive tasks. Software faults can be caused by a variety of situations including
internal deadlock and premature process or thread termination.
Distributed parallel processing systems that provide complete fault tolerance address operational
faults by providing a mechanism for fault recovery. Operational faults should be transparent to the
application. For instance, in reference to the figures above, the algorithm developer should not be
burdened with verifying that task A is able to communicate with task B.
4.4.2 Operational Fault Mechanisms
Extensive research in the realm of fault tolerant distributed systems has introduced programming
techniques and structured paradigms designed to provide fault detection and recovery. These
strategies are rarely implemented as mutually exclusive solutions. Most fault tolerant architectures
incorporate a combination of techniques to yield a complete fault tolerant system. We first review a
general fault detection scheme (Process Surveillance) followed by several recovery mechanisms. Process Surveillance
Application execution in a computing cluster consists of multiple processes running on some number
of distributed processors. At any time during execution, an operational fault can result in a lost, or
unreachable, process. Process surveillance is a basic fault detection technique whereby processes are
paired with one or more observer processes (it should be noted that this technique is valid for threads
as well). At periodic intervals, the observer(s) perform status checks on the monitored process. If a
lost process is detected, the observer(s) proceed to the agreement phase of adaptation. If deemed
necessary, the action phase (which, in many cases, involves recovering the lost process) will follow. Checkpointing
Checkpointing is a backward error recovery method whereby the execution state of a program is
saved periodically so that it may be recovered at a later time. When an error is encountered, the
system is rolled back to the last checkpoint. The concept of checkpointing is relatively simple, yet
implementation introduces significant complexities. The following diagram illustrates the
checkpointing process in a distributed parallel program.

Figure 4-4 Parallel Program Checkpointing
Figure 4-4 depicts multiple processes communicating via messages (represented by the directed lines
between processes). If the system is checkpointed at time CPI, message ml is lost. This message
has been sent by process 1 but not yet received by process 3. The same scenario exists at checkpoint
CP2 with message m3 between process 4 and process 3. Due lo ihe non-deterministic timing of
messages between processes, asynchronous checkpointing (saving a global system state without
process coordination) is difficult to implement.
It is possible for each process to save its state with regard to current pending messages. In this
situation, each process checkpoints only when there are no receives pending and all sends have been
received. This is depicted by checkpoint CP3 in Figure 4-4. The problem with this method is the
potential domino effect of rollbacks in the attempt to reach a consistent system state [16]. With no
restriction imposed on message exchanges, the potential always exists for rollback to the beginning
of the program.
Synchronous checkpointing is an alternative to the asynchronous implementation whereby
independent processes coordinate the job of providing a global snapshot of the system [16]. This
process involves two phases. In phase 1,an initiator requests that each process make tentative
checkpoints. Each process can accept the request, in which case subsequent communication is
suspended. If all processes accept, phase 2 begins whereby the initiator requests permanent
checkpoints tor each process.

Synchronous checkpointing guarantees a consistent system state with minimal rollback. There are
several drawbacks to this approach. First, additional messaging overhead is required for
implementation. This overhead can be magnified due to the fact that the system wide checkpointing
procedure is invalidated if a single process is not able to accept the checkpoint request. Second, a
global control mechanism is required to coordinate the checkpointing procedure for the system.
Checkpointing is a natural fit for the BSP model (covered in chapter 3) in that distinct
communication breakpoints are clearly defined. A project undertaken at the Massachusetts Institute
of Technology has focused on an adaptive, fault tolerant implementation using the BSP model in a
volunteer computing environment [25]. In this architecture, tasks are packaged and distributed to
volunteer nodes. Automatic checkpointing is implemented at the end of each BSP superstep when
results are sent back to the origin. Tasks that rail are subsequently redistributed at the beginning of
the subsequent superstep.
Task interdependence plays a major role in the efficiency of the checkpointing technique. For
applications that limit communication between tasks (tasks remain relatively independent), the
process is simplified. Checkpointing overhead must be weighed with system performance when this
fault tolerant strategy is implemented. Forward Recovery
In contrast to checkpointing,/orwar^/ error recovery is a technique that involves moving ahead to
recover from system errors. This technique is implemented with a predefined set of potential errors
and corresponding corrective actions. While advantageous in that it imposes minimal overhead, this
technique is application dependent. The error diagnosis must be exact and must match a
corresponding corrective action. These actions can be difficult to define; a comprehensive list that
predicts all possible error conditions and corresponding states is not feasible in most instances.
Forward error recovery can be ideal for special cases, although it is not applicable as a general-
purpose mechanism for fault tolerance. Replication
Replication is a fault tolerant design scheme based on redundancy. This is a common technique with
multiple variations that can be implemented in hardware, software, or a combination thereof. Stable
storage is an example of a system component that applications assume to be reliable. Disk
shadowing is a stable storage reliability technique that implements mirrored disks [16]. On each data
write, the data is written to two or more disks. If one disk fails, a duplicate is available. The obvious
disadvantage to this approach is the overhead involved with maintaining duplicate data, both from a
financial and operational perspective. Several design schemes have been implemented to alleviate
some of this burden. Redundant Arrays of Inexpensive Disks (RAID) is one such scheme that
utilizes check disks in conjunction with a coding technique to mask disk railure without duplicating
the entire set of data [16].
In the context of task parallelism, replication is used to initiate redundant tasks. A common
implementation involves duplicating a task on multiple nodes. One task is designated as the primary.
If the primary task fails, the duplicated task on a secondary node is available. This scheme is
remarkably simple, yet the additional overhead associated with duplicating tasks can be detrimental

to parallel processing performance. The number of duplicate tasks is directly proportional to the
degree of fault tolerance in the system. For instance, consider an application with x tasks executing
in a distributed system consisting of y processors. The highest degree of fault tolerance exists when
each task is duplicated on each processor (x [y-11 duplicates). This scenario is optimal for fault
tolerance, yet detrimental to both efficiency and performance. Functional Models
Functional programming has been an active area of research for a number of years. The pure
functional paradigm adheres to a mathematical model where programs are both referentially
transparent and free of side-effects. Although these characteristics are conducive to both parallelism
and fault tolerance, there has been limited work associated with functional models in this realm. Attributes
The following sections discuss three fundamental attributes of functional programs. Mathematical Model
The functional paradigm is based on a strict mathematical model. Functions represent mathematical
expressions. As a result, functional programs are more clear and concise than their imperative
Variables in mathematics represent values. In imperative languages, variables refer to memory
locations and represent values [27]. Pure functional languages adhere to a mathematical model in
that there is no concept of assignment where a memory location can be modified. Variables are
strictly named values (as opposed to memory locations) that do not change. Referential Transparency
Pure functional programs exhibit referential transparency. The principal of referential transparency
is defined as follows [27]:
''The meaning of an entity is unchanged when a part of the entity is replaced with an equal
In functional programs, expressions are comprised of other expressions (subexpressions). A
referentially transparent expression is one in which all subexpressions can be transparently
interchanged with their corresponding value (evaluation result). Function results are based solely on
the input parameters. This model is conducive to parallelism in that function values (evaluation
result) are independent of execution sequence.
24 Side-Effect Free
Functional programs are free of side-effects. There are no state maintenance facilities, and hence
functions lack internal state. This property guarantees that multiple calls to the same function with
the same input parameters will always yield the same result.
A simple example of an imperative programming side-effect is a function F that maintains and
returns a count x of the number of times that it has been called. Each call to function F results in
state modification. In this instance, the order of calls to various functions in the program directly
corresponds to the value of x at any point in time. Imperative models typically rely on side-effects,
or state changes, which makes program reasoning more difficult. Functional Fault Tolerance
A significant contribution to fault tolerant parallel processing using the functional model was made
by a research effort involving fault tolerant parallel processors [12]. The authors of this work make
the following observations with respect to functional programs:
Function invocation constitutes a natural rollback point The function to be computed,
its arguments, and the destination of the function's result are all the information necessary
to restart the function, and can be saved by the function's caller at low cost... Because of
referential transparency, the result of the restarted computations is identical to that of the
original computation, merely delayed in time."
The functional programming model is conducive to fault tolerance in that it both simplifies and
isolates error recovery. In a distributed parallel environment, checkpointing can be implemented
with a distributed scheme where, in the presence of faults, rollback is isolated (task localized). A
global checkpoint is not necessary. This scheme is made possible by a functional model in which 1)
data exchange is limited to passing input parameters and returning results and 2) program results are
independent of function execution sequence.
Although the functional paradigm is conducive to fault tolerance, there exist a limited number of
fault tolerant models whose designs are influenced by functional programming. One such model,
FTAG (Fault Tolerant Attribute Grammar), incorporates the functional model with attribute
grammars to provide the application programmer with high level programming abstractions [30].
FTAG computations consist of modules that are pure mathematical functions. At execution time,
modules are decomposed into primitive modules based on attribute dependencies. This process
results in a computation tree in which inherited attributes flow down and synthesized attributes
flow upward.
Fault tolerance in the FTAU system is accomplished with redoing and replication. Redoing
involves repeating Tailed parts of the computation. Replication involves implementing redundant
tasks in parallel, where identical modules have the same set of inherited and synthesized attributes.
While not directly addressed in the FTAG model, the redoing process relies on a stable object base
abstraction that is assumed to survive failures.

The FTAG development introduces a functional computational model that can tolerate both software
and operational faults. Module decomposition in the form of a computation tree is a conceptual
influence in the Cotton task parallelism design model.
4.4.3 Implementation Policies
Adaptive system designs address performance, application demand and resource availability to
different degrees with various combinations of adaptive mechanisms. Traditional solutions have
implemented either a complete integration or strong coupling policy. Recent work has emphasized
implementations that are transparent to the application. Complete Integration
Complete integration is one of the most common, yet least desirable implementation policies. With
respect to fault tolerance, complete integration requires that the application be designed to detect and
recover from faults. This approach has several negative consequences. First, a significant burden is
placed on the application programmer that requires expertise in fault tolerant design. Second, the
algorithm design becomes dependent upon the execution environment. Strong Coupling
Many distributed systems incorporate adaptation modules in the form of toolkits or libraries that are
used by the application. These components are incorporated by the application in the form of
building blocks [2]. Although this solution provides a more distinct separation between algorithm
implementation and adaptive mechanisms, the problems posed by the complete integration method
still persist. The real benefit of the strong coupling policy is the versatility provided to the
application in the form of components that can be constructed to yield a customized adaptive
implementation. Application Transparent
It is clearly desirable to provide adaptive behavior that is completely transparent to the application,
although distributed system architectures have provided limited support in this area. A research
project undertaken at the University of Kaiserslautern in Germany was one of the first to identify the
importance of application transparent fault tolerance [2]. This project involves a software
architecture development that completely separates the fault tolerant mechanisms from the
application. The architecture design is based on a client server model that implements server
replication. A fault tolerance layer (ft-layer), providing an application interface, is inserted between
the server on each node and the operating system kernel. The ft-layer provides fault tolerant
components that include service configuration and recovery, replica management, surveillance and
message distribution and selection. The ft-layer is interchangeable which allows customized
This architecture incorporates a modular fault tolerant design scheme that maintains application
transparency. It is noted that the ft-layer is implemented natively, which restricts a platform neutral

Decouple parallel applications from the execution environment.
Minimize overhead and complexities associated with adaptation.
Maintain platform independence.
4.6.1 Load Balancing
The Cotton architecture incorporates a dynamic distribution scheme whereby tasks are distributed to
available processors based on the current load. Prior to task initialization, the current system load
parameters are assessed. New tasks are initialized on the node in which system performance will be
4.6.2 Resource Supervision
In the absence of global control mechanisms, network resources are monitored at each node in the
Cotton system. A resource directory is maintained on each node that contains load and state
execution environment. Additionally, the design makes an assumption (valid in most instances) that
server replication overhead, relative to the server weight, is minimal.
4.5 Adaptive System Design Tradeoffs
Implementation of an adaptive architecture does not come without cost. These costs are incurred in
the form of additional overhead that can impact both system performance and overall efficiency.
One of the most simple, yet most effective fault tolerant design schemes is replication. With respect
to task parallelism, duplicate tasks execute simultaneously on different processors. If a primary task
does not complete successfully, a duplicate is available.
The obvious drawback of this implementation is the redundant processing, where duplicates may be
running at the expense of primary tasks. This scheme also introduces replication design decisions
(such as the number of duplicates to be created) that impact the degree of fault tolerance. A software
architecture that runs on an n node cluster has the highest degree of fault tolerance if successful task
completion is guaranteed when up to n -1 nodes are lost.
Load balancing is an example of an adaptive behavior that is required, in some form, for efficient
execution in a distributed parallel environment. The update frequency of processor load information
is directly proportional to the system load balance at any instant in time. The effort to provide an
optimal load balance with continuous updates can result in vast network traffic. There is a point of
diminishing returns where performance suffers as task migration and data distribution occur more
frequently and among further physical distances.
Parallel processing architectures must be designed to balance the degree of adaptive behavior with
overall performance. This consideration is addressed in the Cotton architecture design.
4.6 Cotton Implementation
A primary design goal of the Cotton architecture is to provide adaptive behavior that is transparent at
the application level. This behavior consists of load balancing, resource supervision and fault
tolerance. The Cotton design is motivated in part by the following objectives:
2 3

information. The directory is updated via a discovery process when new resources are added to the
system. When a resource is removed (via a crash or voluntary withdraw), the fault tolerance
mechanisms detect the lost resource and subsequently update the directory.
4.6.3 Fault Tolerance
The fault tolerant design scheme in the Cotton architecture involves task parallelism that incorporates
a functional programming model. The recursive nature of functional programs provides a natural
match for task parallelism. Additionally, functional programming introduces referential
transparency. Each task is implemented as a lightweight entity that remains side-effect free. This
model reduces the correctness domain and hence simplifies fault recovery by 1)inherently enforcing
referentially transparent, side-effect free algorithm implementation and 2) reducing code complexity.
This allows an isolated asynchronous rollback recovery implementation at the task level.
Cotton can be distinguished from many distributed parallel systems by the fact that the design
implements total locality. There exist no global control mechanisms or global memory spaces. This
implementation provides the potential for complete fault tolerance by eliminating system critical
4.7 Summary
This chapter has presented an overview of adaptive and fault tolerant systems. We have defined
these systems in the context of a model for adaptation. Fault types in a distributed system were
introduced with a focus on operational faults. Several detection and recovery mechanisms were
covered, including an in-depth analysis of the functional programming model.
The Cotton architecture provides adaptive behavior in the form of performance tuning, resource
supervision and fault tolerance. The approach to fault tolerance is influenced by the functional
paradigm. Functional programs adhere to a mathematical model and are both free of side-effects and
referentially transparent. These unique characteristics reduce the cost associated with distributed
checkpointing and permit an isolated rollback scheme. The Cotton design couples this approach with
application transparency, where implementation remains completely independent of the execution
The following chapter extends the discussion of the Cotton parallel programming model in
introducing a task parallelism agent abstraction.

5 Agents
Agent-based solutions are ubiquitous in distributed system software design. A multitude of
distributed applications, ranging from e-commerce to network management to data warehousing,
incorporate agent technology. This can be attributed to the characteristics, or behaviorsthat
agents exhibit. Agents provide a natural abstraction in a distributed environment by encapsulating
the functionality required for three essential actions travel, task performance and interaction.
In this chapter we begin with a brief discussion of agents and task parallelism. This is followed by a
study of agent design patterns with a focus on the Master-Slave model. Finally, agent heuristics are
presented in the context of an adaptive, fault tolerant system.
5.1 Agents and Task Parallelism
The task parallelism programming model is based on independent tasks that perform parallel
computations with a combination of communication and simultaneous execution. In this context, an
agent is a natural task abstraction. The task agent encapsulates the task data, operations to be
performed on that data, and the necessary communication mechanisms.
5.2 Agent Design Patterns
The study of agent-based solutions for widely diverse problem domains has revealed recurring
design patterns. Agent design patterns are categorized as one of the following [19]:
Traveling Patterns coordinate aspects of agent mobility including itineraries, routing and
Interaction Patterns define and manage agent communication facilities.
Task Patterns focus on partitioning and delegating tasks among agents.
Parallel applications that use a task agent model typically do not incorporate traveling patterns. To
this end, agent mobility simply involves traveling between nodes. Traveling patterns such as
Itinerary (agent routing schemes among multiple hosts) and Ticket (agent service detail and
permissions) offer capabilities that are overkill for parallel processing applications. While providing
robust, versatile solutions for agent mobility, the overhead incurred with these patterns will degrade
Agent interaction patterns can be useful in parallel applications, although the overhead incurred is
similar to that of most traveling patterns. Several popular interaction patterns include the Meeting
(initialization of local agent interaction), Locker (private data storage mechanism) and Messenger
(remote messaging agent)[19].
For parallel architectures, the task pattern is a critical design component. This pattern defines the
agent coordination model for parallel execution. The Master-Slave pattern is a common task design
model incorporated in a broad domain of parallel applications. This pattern is discussed in the
following section.

5.3 Master-Slave Parallelism
This Master-Slave pattern implements a divide and conquer strategy in which a master delegates
tasks to one or more slaves that in turn are distributed throughout the system and work in parallel.
The standard agent-based implementation involves master and slave agents. The execution sequence
is as follows [19]:
1) The master agent creates a slave agent.
2) The slave agent moves to a remote host and performs a task.
3) The slave agent returns to the master.
4) The slave agent passes task results to the master.
Figure 5-1 Master-Slave Pattern
There exist multiple variations of the Master-Slave model that derive from the premise of task
division and delegation. We now examine two agent architectures based on Master-Slave
5.3,1 Master-Slave Clustering Architecture
An innovative implementation of the Master-Slave model involving an agent-based clustering
architecture is the subject of recent dissertation research [7]. This Master-Slave model is based on
intervals in which distinct actions are performed system state retrieval, computation and merging.
The process begins with a Timer agent that initiates a master agent. The master then dispatches
Load/Gather agents to search for new resources and measure processor load information in the
system. Subsequently, Worker-Slave agents are dispatched (to locations dependent upon current
load), perform work, and then return to the master. A Merge agent merges and reports the results.
The experimental context for this architecture is a clustering algorithm that groups input data
according to set membership [7]. The experiment simulates a large-scale signal processing
application in which large bursts of data are incoming at intermittent intervals. This architecture
provides the ability to process data that is input at multiple nodes in the system (which is critical for
data intensive applications that require optimal performance).

It is interesting to note that several performance issues were addressed with enhancements to the
initial Master-Slave model. First, Worker-Slave agent creation was decoupled from LoadGather
intervals with intermittent dispatching. Second, a dual master implementation, each with its own
merge agent, eliminated a single master bottleneck.
5.3.2 Two-Tier Management Architecture
A two-tier management architecture, implemented as an industrial research project, extends the
Master-Slave model to incorporate network management [11].This architecture is based on what the
developers call the Team Agent Approach. A primary design goal is to boost performance by
minimizing network traffic and serialization overhead. This architecture incorporates three agent
types that collaborate to perform distributed parallel processing.
A Hive Agent (implemented as a stationary entity) is akin to a primary master that coordinates
mobile agent teams. The Hive dispatches Queen and Scout Agents. A Queen Agent (primary
master's slave or secondary master) leads teams, monitors hosts and spawns worker threads. Worker
threads (slaves) perform the actual computation. Scout Agents perform network resource analysis.
The developers note the benefits of a central master controller (the Hive Agent) that include 1)
efficient and balanced resource allocation among agent teams and 2) itinerary optimization that
reduces system bottlenecks. These advantages must be weighed against the risk incurred with a
single point of taiiure.
There are several contributions made by this design that are noted. First, there is a focus on agent
mobility as it relates to performance. The mobility of an agent is inversely proportional to its weight.
The massive Hive Agent is stationary. Scout Agents, who move the most, are implemented as
explicit lightweight entities. Second, a self-destruction mechanism is implemented in the worker
threads. If the corresponding Queen Agent is lost, the worker threads are terminated in order to halt
resource consumption.
5.4 Cotton Agent Model
The Cotton parallel programming model incorporates task agents in a derivation of the Master-Slave
pattern. In the Cotton architecture, a parent (master) agent delegates tasks to child (slave) agents.
Child agents, in turn, can delegate tasks to new child agents. This process results in a dynamic agent
tree, as illustrated by the following figure.

Figure 5-2 Agent Tree
When compared with the Master-Slave model, there are two key distinctions in the Cotton
implementation that are noted:
1) Cotton task agents can serve a dual parent-child role. With the exception of the top
level (root) agent, nodes in the tree structure serve as parents (masters) and children
(slaves). Leaves in the tree are strictly child agents.
2) Cotton task agents are stationary. Parent agents create child agents via remote requests.
The second caveat limits the overhead and performance impact associated with agent serialization.
Data that is necessary to perform the task is sent via remote messaging.
5.4.1 Agent Communication
The Cotton agent communication model (first referenced in the previous chapter) follows a strict
guideline. An agent can communicate with the parent or the immediate children. Additional
communication between agents (such as sibling to sibling) is not allowed, and better yet, not even
necessary. By adhering to this guideline, fault recovery is possible when agents are lost. The
regeneration of a lost agent does not involve retracing intricate communication paths.

Suspend the current task.
Save any state information that is required for resuming the task.
Migrate to a new host.
Recover any state information that was saved.
Resume the task.
5.4.2 Natural Systems
The approach taken with agents in Cotton differs from the intelligence-based approach of many agent
systems. The agents here are explicitly lightweight with basic heuristics. Agent interactions are
clearly defined and simple. At an individual level, each agent is a negligible entity. From a system
perspective, the small pieces (agents) produce tangible results in collaboration. This approach is
loosely based on the principles of artificial life; a system is modeled after living organisms [1].
Artificial life is a field of research that attempts to study and exploit the behaviors characteristic of
natural living systems [20]. The lightweight agents are akin to a colony of ants. In an ant colony,
individual ants collaborate to locate food, relay a food discovered message, and finally transport
the food back to the hive. The actions of an individual ant, out of context, seem to be meaningless.
The behaviors and interactions of each ant are extremely simple, yet the operation, taken as a whole,
is remarkably efficient. This biological entity conforms to the notion that the whole is greater than
the sum of the parts.
An agent system modeled in this manner provides inherent fault tolerance. Basic heuristics and
simple interactions provide the potential for systematic recovery from lost pieces, or agents. In some
applications, such as genetic algorithms, lost pieces are not critical. The Cotton agent model is
designed for applications in which all pieces (agents), however small, play an essential role.
5.5 Agent Distribution
There has been significant research involving task agent distribution schemes in agent-based
distributed parallel processing architectures. Distribution schemes are essentially load balancing
strategies that involve the following two goals:
1) Agents should be placed on the least loaded processors (so as to balance the system-
wide CPU utilization).
2) Agents that communicate frequently should be placed in close physical proximity (so as
to minimize network traffic).
The distribution scheme has a significant impact on performance and is therefore an important design
consideration. There are numerous load balancing algorithms that aim to optimize the previously
stated goals. The Metropolis algorithm is one such algorithm that involves potential energy
calculations for determining agent placement. The algorithm involves a combination of one body
potentials (representing agent hosts) and two body potentials (representing agent links). An agents
placement is determined by comparing potential energies of that agent on various hosts. An
extensive Metropolis load balancing analysis and simulation related to the Cotton architecture is the
subject of recent research [31].
Task migration is a feature that corresponds to agent movement during execution. Agent migration
steps are as follows:
2 3 4 5

The act of suspending and subsequently resuming execution with a saved state can be a complex
process that involves significant overhead. Task migration provides the following two benefits with
respect to load balancing:
1) Agents can migrate from heavier to lighter loaded processors.
2) Agents that communicate frequently can move closer to one another.
The Cotton design model does not support task migration. Task agents are created on demand at
locations determined to be optimal for maintaining a balanced system load. These agents remain on
that particular host throughout their lifecycle. This approach is termed adaptive placement.
Adaptive placement offers two distinct advantages when compared with implementations that
support migration:
1) A system that implements task migration introduces additional complexity that is
proportional in magnitude with the degree of fault tolerance.
2) Task agents are stationary serialization does not impact performance.
Conceptually, system performance in an adaptive placement implementation is inversely
proportional to the task granularity. Large tasks will suffer on overloaded processors. However,
system load is considered upon each agent creation in determining where the agent is placed. A
performance penalty is only realized with extremely heavyweight tasks.
A second argument against the adaptive placement scheme is that agents communicating frequently
are not able to congregate, or move closer together. The impact of this restriction is limited in the
Cotton system due to the communication guidelines incorporated in the agent model. Agents can
only communicate with their parent and children. Although latency is a consideration, the model
dictates a reduced rate of network traffic.
5.6 Summary
This chapter has presented the Cotton programming model in the context of an agent-based task
parallelism abstraction. This model is based on Master-Slave parallelism, where master agents
delegate tasks to slave agents. Cotton agents are organized in a hierarchical tree, where agent
interactions are clearly defined. These lightweight task agents conform to the principles of natural
living systems. The model incorporates an adaptive placement distribution scheme that eliminates
the overhead associated with agent migration. The model is based on reducing both complexity and
overhead, which is conducive to a reliable fault recovery implementation.
The following chapter introduces the Cotton software architecture. The architecture components are
covered in detail, providing a thorough examination of the operational framework.

6 Software Architecture
This chapter details the design and implementation of the Cotton software architecture. This chapter
is divided into two segments. The first segment examines the software architecture components. We
begin with a discussion of the Java programming language and distributed computing followed by a
detailed analysis of the software packages and corresponding services. This segment concludes with
a presentation of two essential components in an adaptive system discovery and load balancing.
The second segment presents system level operations. The parallel computation process is covered
in a section that describes the task agent components and agent initialization. This is followed by an
illustrative example of a parallel algorithm based on the Cotton parallel programming model. This
segment concludes with a review of the fault tolerance design scheme, including fault detection and
recovery implementation.
6.1 Introduction
The majority of computing cluster applications incorporate some form of global control mechanisms
or vital system services that are implemented on a single machine or a subset of the available
machines. In the Cotton system, the software (consisting of a variety of service-based components)
runs on every machine in the cluster. At the application levelevery machine is a perceptual clone
providing identical services. In this contextperceptual alludes to the fact that hardware platforms
can be heterogeneous, yet this detail is transparent to the application itself.
One of the prominent features of a cluster environment is the potential to execute multiple jobs
simultaneously. Each machine in the Cotton system provides the capability to initiate parallel
processing jobs. As a compliment to the job origination service, each machine also serves as a host
for pieces of parallel computations (in the form of task agents) that originate on other machines in the
6.2 Implementation
This section presents the Cotton software architecture in detail. We first discuss the motivation for
selecting the Java programming language as the development platform for the Cotton software. The
following section covers the major components of the architecture, including component
functionality and operational diagrams.
6.2.1 Programming Language
The Java programming language provides numerous features that make it the language of choice for
dynamic distributed systems. The following characteristics of Java are purported by the designers of
the language [17]:

Architecture Neutral
We discuss several of these below in the context of the Cotton software architecture.
One of the primary Cotton design goals is to develop an adaptive system that supports
heterogeneous platforms. This goal is realized with Java, which is a portable, platform independent
programming language. Java programs are compiled into an architecture neutral byte-code. Java
programs are executed via a Java Virtual Machine (JVM) that transforms the byte-code into platform
dependent machine instructions.
Software architecture designs are based on the premise of software reuse. Java is an object-oriented
programming language that provides numerous facilities that support modular designs. When
compared with languages such as C++, Java program designs have the potential to be more robust
and less complex. Java is almost a pure object-oriented language in the sense that objects
represent all program entities (with the exception of primitive types). Classes are rooted at a base
class (Object) that serves as the root for all class hierarchies. Java interfaces provide a data free type
abstraction. The Cotton software makes extensive use of these object-oriented facilities with a
design effort that incorporates an agent hierarchical framework and modular service components.
Distributed parallel software systems must incorporate some form of multiplexing in asynchronous
messaging environments. The most prominent method is the use of multiple processes, or threads,
on each node. In this context, messaging threads can block while other threads perform local tasks.
Java's class library supports multithreaded applications with thread classes. The language also
provides built-in threading synchronization primitives including atomic write operations and mutex
locks. The Cotton architecture fully exploits the Java threading library; each task agent executes as
an independent Java thread, and the service components (discussed below) make extensive use of
Javas multithreading facilities.
There exist multiple security mechanisms provided by Java that give rise to dynamic distributed
applications. At the application (language) level, memory is not directly accessible. This eliminates
a wide range of potential malicious activities associated with direct memory manipulation. One of
the most powerful features of the language is the capability to dynamically load and execute Java
code. The Java interpreter provides a byte-code verification mechanism that inspects dynamically
loaded (untrustedcode for illegal byte-code sequences. The Java sandbox model provides a
separate, isolated environment in which untrusted code executes. The sandbox restricts access to
local resources. At the application level, cryptography is supported with multiple features, including
digital signatures and secure streams. These extensive features provide the Cotton architecture with
built-in security mechanisms.
The numerous features of the Java language are somewhat offset by a relative lack of performance
when compared with traditional statically compiled languages such as C and C++. Java is not the
language implementation choice for many high performance software architectures. Javas lack of
performance is due in large part to the fact that 1)the language is (usually) interpreted and 2) Java

provides built-in memory management facilities (garbage collection). The performance of a Java
program will never match that of a native C or C++ counterpart. With that being said, the
predominate performance bottleneck in distributed parallel applications (assuming adequate
hardware) can be attributed to the operating system calls required for messaging. This is the focus of
a high performance Cotton implementation, a future work topic that is discussed in the final chapter.
In order to support adaptive parallelism in a heterogeneous environment, there will be a moderate
performance penalty.
Java programs are comprised of packages that contain Java class files. A package represents an
organization of logically related classes. The following section covers the packages that comprise
the Cotton software architecture.
6.2.2 Components
The following diagram illustrates the packages that comprise the Cotton software architecture. There
are six dependency levels ranging from 0-5. Packages in level 0 maintain no dependencies within
the architecture, whereas level 5 packages are dependent upon (both directly and indirectly) all lower

Figure 6-1 Cotton Software Architecture Packages

The following sections examine in detail the contents and functionality provided by each package. Agent Package
The Agent Package provides core agent functionality in the Cotton software architecture. This
package is comprised of three distinct modules (the base Agent package and two subpackages):
Agent package provides the abstract base classes for agents.
Communication subpackage provides network communication facilities for agents.
Implementation subpackage provides concrete agent implementations as well as agent
monitoring utilities.
The following class diagram depicts the core classes in these packages.

Implements Class Interface
External V
Figure 6-2 Agent Package Class Diagram
The Cotton software provides two types of agents Task Agents and Mobile Agents. Task Agents
are stationary agents created on demand that implement a piece of the parallel processing algorithm
The Mobile Agent class is a serializable base class that provides itinerary management facilities for
agents that move to various nodes throughout the network. All Cotton agents are rooted at a base
Abstract Agent class. This class provides generic initialization functions, including the assignment
of a globally unique agent identification.
40 Communication Package
This subpackage of Agent provides network communication for agents. This package includes the
Agent Link class that encapsulates the details of network operations, exposing an abstract interface
for agent to agent communication. The following services are provided:
Establish network connections.
Provide read (input) and write (output) functions.
Implement a buffering scheme for fault recovery. Implementation Package
This subpackage of Agent provides the following concrete agent and agent monitoring
Quick Sort Agent Task Agent derivative that implements a simple quick sort algorithm.
Load Balancing Agent Mobile Agent derivative that moves throughout the network.
Local load information is collected and network load information is deposited at each node.
Distributed Agent Monitor responsible for monitoring an agent and the immediate agent
family (parent and/or children).
Monitor Link provides monitor to monitor communication facilities, including parent and
child (or children) listen and response threads (Distributed Agent Monitor inner class). Agent Services Package
This package provides the two core services for agent operation -1)Distributed Agent Server and 2)
Mobile Agent Host. Each service is discussed below.
The Distributed Agent Server (DAS) creates Task Agents on demand via remote requests. The
following diagram depicts this operation.
Figure 6-3 Distributed Agent Server

The DAS is contacted by parent agents when it is necessary (based on task partitioning) to create a
new child agent. Based on current load information, a request is made to the DAS running on a
specific node. The DAS is implemented as an ideinpotent server there is no state change between
requests. This implementation simplifies the support of a fault tolerant system [16].
The Mobile Agent Host (MAH) provides management facilities for Mobile Agents. The following
diagram depicts the MAH operation.
Figure 6-4 Mobile Agent Host
The MAH implements a server that listens for incoming agents. Upon arrival, the Mobile Agent is
bound to an execution thread. When the local task is complete, the agent is sent to the next
destination in the itinerary. Display Package
This package includes the classes that comprise the graphical user interface (GUI). The GUI
provides an interface that is used for initializing parallel processing jobs in the Cotton system. There
are two Display subpackages that provide specific visualization components:
1) Agent subpackage provides the functionality to display the dynamic task agent
2) Resource subpackage provides the functionality to display the current state of each
node in the system. Jini Agent Package
This package incorporates Java's Jini technology [18], which allows a client to dynamically
download the Cotton software. The software in this package can be used when a new node that is
Jini enabled is introduced to the system. The following diagram illustrates this process.

Jini Lookup Service
service proxy
Client downloads
Cotton software
Client retrieves service
proxy object
Figure 6-5 Jini Service Operation
The following steps outline the initialization process using the Jini Agent Package components:
1) The new node (Agent Server Client) initiates the discovery process by locating a Jini
lookup service.
2) The client requests a server proxy object from the Jini lookup service.
3) The client makes a request (calls the download method), via the server proxy object, to
download the Cotton software.
4) The Agent Server Service provides the client a copy of the Cotton software.
6.2.2.S Resource Package
This package provides local management services for network resource load information. This
includes the Resource Registry class that maintains a network node directory and the Registry
Observer interface for Resource Registry clients. The following diagram depicts the Resource

Figure 6-6 Resource Registry
The Resource Registry listens for and receives periodic updates of network resource load information
for each available network node. Clients that have interest in network load information implement
the Registry Observer interface and in turn register for update notification. Resource Registry clients
include parent task agents that must determine the network node on which (usually the least loaded) a
new child agent should be created. Each registered client is notified when updates are received. This
is an implementation of the Observer design pattern [3]that is prevalent throughout the Cotton
software architecture design. Services Package
This package provides abstract interfaces for network services and parameters. The network service
interfaces include generic functions implemented by various components in the software architecture.
Service clients are limited to these interfaces, thereby decoupling clients from specific
implementation details. Component software changes are transparent to clients.
The network parameters in this package are encapsulated in the Network Communication interface.
These include well-known addresses (used for multicast) and well-known ports (for the Agent Server
and Agent Host servers). The Routing package is a subpackage of Services that contains the
Network Messenger. This class was designed to encapsulate the details of network messaging,
providing clients with a network transparent interface. The Network Messenger is illustrated in the
following diagram.

Figure 6-7 Network Messenger
The Network Messenger provides clients with Unreliable Datagram Protocol (UDP) direct and
multicast messaging services. This class provides both send and receive functionality. For data
transmission, the client initiates a simple function call. The network messenger initializes a datagram
packet and sends the packet to the desired destination. Clients receive data by registering with the
Network Messenger. Each registered client is paired with a network socket. When a message arrives
for a registered client on a given socket, a thread pool thread is used to notify the client. Statistics Package
This package provides statistics gathering and reporting functionality for each job that is run in the
system. The following parameters are available for each job:
Origin the IP address of the node where the job originated.
Duration the length of time that the job took to complete.
Agent Count the number of task agents involved.
Fault Count the number of faults (premature task agent terminations) incurred.
Completion status success (correct result) or failure (incorrect result).

All system nodes run a Local Data Compiler as part of the Cotton software package.
Each job is assigned a globally unique indentification number upon initialization.
The Distributed Agent Server is notified of the corresponding job number when a
create agent request is received. Each server monitors local agent activity (agent
count, faults).
When a job is completed, a statistics request message notifies each Distributed Agent
Server to send data for the specified job to the Network Data Compiler.
The statistics package is used in a debugging mode for analyzing the overall system performance in
terms of job size, execution speed, fault recovery and success rate. The following diagram depicts
the statistics collection components.
Local Data _
Network _
_cDmpL 1
Figure 6-8 Statistics Collection
Statistics collection is implemented as follows:
d.2.2.8 Utilities Package
This package is comprised of software components, or utilities, that are used by multiple packages in
the Cotton software architecture. The Utilities Package is implemented at the highest level (level 0)
in the package dependency hierarchy in order to decouple and eliminate circular dependencies among
packages. This package contains four subpackages which are detailed in the following sections. Agent Package
This package contains agent-related interfaces and classes, several of which are identified below, that
are used by components of the Agent and Agent Services packages.
Agent Connection this interface provides an abstract representation of a physical
connection, or link between agents for message passing.
Agent Observer this interface provides an abstract representation of an observer that
monitors an agent.
Agent ID this class provides a globally unique agent identification signature.
Child Link Data this class contains data that represents the connection information
between a parent and child agent that is necessary when a lost child must be
12 3
46 Display Package
This package provides components that encapsulate and extract agent graphical display information
for task agents. The Agent Data class encapsulates the agent ID, parent ID, task specific label and
current state. The Agent State Observer is an interface for clients (display coordinators) that require
agent state information. Network Package
This package provides a single class (Data Transmission) which encapsulates low level connection-
oriented network transmission operations. This includes serializing and unmarshalling objects and
catching and handling network error conditions. A variety of type conversion utility functions are
provided that are required for network communication. Threads Package
This package provides three essential thread utilities for clients that implement multithreaded
1. Timer this component provides notification (either single event or continuous) after a
specified interval.
2. Thread Monitor this component is a Timer derivative that monitors current thread
execution state.
3. Thread Pool-this component provides clients asynchronous tasking functionality.
The following diagram depicts the Thread Monitor operation.
Figure 6-9 Thread Monitor
Components that require thread status information implement the Thread Observer interface and
register a specific thread with the Thread Monitor. The monitor watches the thread and notifies the

corresponding client for registered events which include termination, thread death and interval status
The Thread Pool provides clients with asyncronous tasking functionality. The following diagram
depicts the Thread Pool operation.
Do Job
Do Job

V Thread
' Client
Figure 6-10 Thread Pool
The Thread Pool maintains a pool of available threads for performing tasks. Thread creation
overhead is minimized by reusing pooled threads. When a client requires asynchronous tasking, a
job (Thread Pool Job) is added to the Thread Pool job queue. Subsequently, a pooled thread removes
the pending job from the queue and performs the designated task.
6.3 Adaptive Support Services
This section presents two fundamental operations in the Cotton system. We begin with discovery
the process of notification and initialization for nodes that are dynamically added to the network.
This is followed by a review of two load balancing mechanisms (multicast and mobile agents) in the
Cotton architecture.
6.3.1 Discovery
When a new node is added to the network, each current node must be notified in order to take
advantage of the new resource. In turn, the new node must also be notified of all nodes currently
employed in the system. With the absence of a global network mapping or configuration utility, each
new node maintains the responsibility of notifying and collecting responses from other system nodes.
The Cotton software provides multicast and direct notification discovery mechanisms, both of which
involve system-wide notification and direct response. Multicast discovery uses a well-known
multicast address. For network configurations that do not support multicast, direct notification is
implemented by sending a message to each network address. In each instance, message recipients
send a directed response to the message initiator (new node). The following diagram depicts the
network discovery operation.

Figure 6-11 Resource Discovery
6.3.2 Load Balancing
The Cotton architecture provides two methods for load balancing. System hardware and network
configuration may dictate the optimal method. The first method involves multicast messaging. he
multicast load balancing operation is detailed in the figure below.
Figure 6-12 Multicast Load Balancing

The Cotton software provides a Resource Registry (discussed previously) that maintains load
information for the local machine and each network node. The multicast load balancing operates as
1) The Resource Registry on each node sends multicast messages (via the Network
Messenger) at random intervals, providing the current local load data.
2) The Resource Registry listens for multicast messages (via the Network Messenger) that
provide load updates from other system nodes.
The second load balancing mechanism involves mobile agents. This implementation is very similar
to that of multicast. The only difference is in the messaging medium, where mobile agents act as
load balancing messengers. The following diagram depicts the load balancing agent operation.
Network Node
Outgoing load
balancing agents
Incoming load
balancing agents
Figure 6-13 Mobile Agent Load Balancing
The load balancing agent scheme operates as follows:
1) When a node is initialized, a load balancing agent is created and then sent to a
randomly selected node in the network.
2) The load balancing agent maintains a current system load directory. Upon arrival, the
agent retrieves local load data and deposits system load data.
3) When the data exchange task is complete, the load balancing agent is sent to a new
(randomly selected) node.
For each of the previous implementations, the Resource Registry on each node maintains a system
load snapshot. This snapshot is continuously updated (via multicast messaging or load balancing
agents) upon receipt of load information from other nodes in the system.
This concludes the presentation of the components and corresponding functionality that comprise the
Cotton software architecture. We now transition to the parallel computation process that is examined
in the following section.

6.4 Parallel Computation Process
In the Cotton system, parallel computations, or jobs, are initiated with a top level (root) agent. Input
data is passed to the root agent. The root agent will in turn generate child agents (as determined by
the algorithm implementation) who are passed pieces of the computation. This recursive process
continues until the parallel job has been sufficiently partitioned and distributed to some number of
task agents. The following section demonstrates the agent initialization process. This is followed by
the Algorithm Implementation section that presents a sample parallel application using the Cotton
agent model.
6.4.1 Agent Initialization
Task agents in the Cotton design model are associated with two components:1)the Agent Link, and
2) the Distributed Agent Monitor. Each task agent is created with a corresponding Agent Link that
serves as the agents interface. The Distributed Agent Monitor is a surveillance mechanism that
monitors a corresponding task agent. The monitor provides fault detection services (discussed
below). The following diagram depicts the task agent components.
Task agent creation involves the following steps:
1) The parent agent creates a new Agent Link for communicating with the child agent. (In
the case of the root agent that has no parent, a top-level agent link is created as an
interface to the root.)
2) The Agent Link retrieves system load information from the Resource Registry.
3) Given the current load informationthe Agent Link sends a create agent request
(specifying the task agent type) to the Agent Server on the least loaded system node.
These three steps are depicted in the diagram below.
Figure 6-14 Task Agent Components

Figure 6-15 Child Agent Creation
Upon receipt of a create agent request, the Distributed Agent Server creates a new child agent and
a corresponding Agent Link. The parent-child connection is completed with the following four
1) The Agent Links connecting the parent and child establish data input and output
2) The Distributed Agent Server that creates the child returns an Agent ID to the parent
Agent Link.
3) The parent's Agent Link requests the creation of a child agent monitor.
4) The parent's Agent Link establishes two connections (listen and status) between the
parent and child monitors.
The following diagram illustrates the complete parent-child connection.

Figure 6-16 Parent-Child Agent Connection
6.4.2 Quicksort Implementation
The application programmer interface (API) for application developers in the Cotton system is
designed to be clear and concise (this API is covered in detail in Appendix A). Parallel algorithm
implementation requires the programmer to develop custom task agent(s) which is accomplished by
implementing the abstract doJob() method.
The development process is illustrated below with a sample quicksort algorithm. This algorithm
sorts a string of characters lexicographically. The algorithm is implemented with task (quicksort)
agents as follows:
1) The string to sort is passed to the root (parent) agent.
2) The parent agent identifies the pivot (first) character in the string, and then divides the
string into two substrings. The first string (loToSort) contains all characters
lexicographically less than or equal to the pivot character. The second string (hiToSort)
contains all characters lexicographically greater than the pivot character.
3) A child agent is created for each string whose length is greater than zero. In the case of
two children, the first agent is passed loToSort, and the second agent is passed
4) This process repeats recursively (resulting in an agent tree) until the task is sufficiently
partitioned and distributed (all string lengths are zero). Results (sorted strings) are then
passed up the agent tree from child to parent.
5) Each agent combines the results (low string, pivot character, high string) passed from
the children before passing the result to the parent.
6) The root agent eventually produces the final result (complete sorted string).

The Java code segment included in Appendix A shows the doJob() method of the Quicksort Agent (a
concrete implementation of the task agent).
The following diagram depicts the agent tree created for sorting the word quick using the quicksort
algorithm (for simplicity, agent links and monitors are excluded).
Figure 6-17 Quicksort Agent Tree
6.5 Fault Detection and Recovery
The Cotton system is designed to detect and recover from faults. In the context of task agent
parallelism, an unreachable agent identifies a fault. Task agents are lost (unreachable) as the result of
operational faults that include hardware, network or software failure. The Cotton fault detection and
recovery steps are as follows:
1) The parent agent detects the lost task agent (child).
2) The lost agent (child) is subsequently regenerated.
The following figure illustrates fault detection and recovery in an agent tree (for simplicity, agent
links and monitors are excluded).

Figure 6-18 Fault Recovery
The Cotton fault tolerant design scheme is motivated by the need to provide detection and recovery
services that are transparent at the application level. In the context of task agent parallelism, this
dictates that detection and recovery mechanisms remain independent of task agents. The following
section presents the Distributed Agent Monitor (DAM), the essential component in the fault detection
and recovery process. This presentation is followed by a fault scenario that details the detection and
recovery process.
6.5.1 Distributed Agent Monitor
Each task agent is associated with a DAM upon initialization. This component provides the
following three services:
1) Request status from parent and child agents at periodic intervals.
2) Respond to requests from parent and child agent monitors requesting agent status.
3) Coordinate agent regeneration tasks (if and when necessary).
The diagram below depicts the Distributed Agent Monitor operation.

No response is received.
An invalid response is received.
A valid response is received that indicates the agent has terminated prematurely (before
its task is complete).
If a parent agent fault is detected, the DAM terminates the monitored task agent and both threads
(listen and status) of execution. If a child agent fault is detected, the recovery process begins. This
process is discussed below.
Status request from parent and child agent monitors
Status response lo parent and child agent monitors
Status request to parent and child agent monitors
Status response from parent and child agent monitors
Figure 6-19 Distributed Agent Monitor
The DAM encapsulates two threads of execution -listen and status. The listen thread is responsible
for responding to status requests from parent and child agents. When the DAM receives a status
request (via the listen thread), the Thread Monitor is queried for the current status of the task agent.
This status is returned to the parent or child monitor that initiated the request.
The status thread is a Thread Monitor that performs two services:
1) Monitor the agent thread and notify the DAM upon termination.
2) Provide a timer that notifies the DAM when time expires.
The DAM uses a Thread Monitor timer that expires at specified intervals for timing status requests to
parent and child agents. There are three status response scenarios that indicate a fault has occurred:
12 3

6.5.2 Fault Scenario
The following sequence of events depicts a representative fault scenario:
The child agents host machine crashes.
The child agent dies before completion.
Connections between the monitors and agent links are broken.
The parent monitor receives no response from a child status request.
The parent agent link is blocked on a read from the child.
The diagram below illustrates these events.
Figure 6-20 Fault Scenario
In this scenario, the Distributed Agent Monitor delects the lost child agent (a result of an invalid
response from the status request). The recovery process begins with the following three steps:
1) The DAM discards the connections to the child Agent Monitor.
2) The DAM notifies the Agent Link that the child agent is dead.
3) The Agent Link suspends the read operation (from the child agent) and waits.
The following diagram depicts these steps.

Read suspended wailing
Figure 6-21 Fault Recovery Initialization
The fault recovery initialization sequence is followed by the agent regeneration process. Agent
regeneration consists of the following three steps performed by the Agent Link:
1) Generate a new child (by way of the child agent creation process discussed previously).
2) Write buffered data to the child agent.
3) Resume the read operation.
The following diagram depicts the agent regeneration process.

resume read
Figure 6-22 Agent Regeneration
At the conclusion of the agent regeneration steps, the fault recovery procedure is complete and the
parent Agent Link resumes normal operation.
6.6 Summary
This chapter has covered the Cotton software architecture. The architecture is comprised of multiple
packages that encapsulate various system service components. A parallel application was presented
that illustrates algorithm development with the agent-based parallel programming model. The
parent-child agent connection implementation was introduced as a precursor to the fault detection
and recovery process. Fault detection is implemented with an agent surveillance component
(Distributed Agent Monitor). Fault recovery involves collaboration between two agent components
(Distributed Agent Monitor and Agent Link), remaining independent of and transparent to the
individual task agents.
The following chapter presents a web-based experiment that is designed to test the software in a
harsh environment that requires both dynamic adaptation and fault tolerance.

7 Web Experiment
This chapter presents an implementation of the Cotton system in a volunteer cluster web-based
experiment. This experiment is devised to rigorously test the design and implementation of the
Cotton software architecture. Of particular interest is the performance of the fault detection and
recovery mechanisms.
7.1 Background
The volunteer cluster in this experiment is comprised of a distributed network of desktop office
machines. Task agents are distributed throughout this network to perform computations in parallel,
where agents borrow processing power from participants machines. In this experiment, the
Cotton software is packaged in a Java applet that is embedded in a web page (Agent Home Page).
When a participant navigates to the Agent Home Page, the applet is downloaded and initialized
(thereby executing the Cotton software). In turnthe participants machine is capable of both hosting
agents and initializing parallel processing jobs.
In this experiment, a user joins the cluster (thereby allowing his or her machine to host agents) by
simply navigating to the Agent Home Page. The applet can run as a background process where
agents are hosted intermittently, essentially coming and going transparently to the user. The
following diagram depicts the applet that is displayed on the Agent Home Page.
Aaents allowed

Agent Home Page
_ Go Away! _
Tell Me More
Web Browser
Figure 7-1 Agent Applet

Press the Go Away! button.
Leave the Agent Home Page.
Close the web browser.
Power down the machine.
Experiment Results
Faults Delay Detection Results
Figure 7-2 Experiment Results
As a result, the Cotton software services, including the Agent Host, are terminated. Any local agents
are abruptly killed. In this environment, the Cotton system must 1)adapt to a dynamic number of
resources, and 2) detect and recover lost agents. The following section presents the results collected
from this experiment.
7.2 Results
The following chart summarizes the results collected in this experiment.
The applet provides a dynamic display that shows the number of agents (represented by heads at the
top) that are currently hosted (running locally). The Go Away! button allows users to disable the
local Agent Host. Parallel jobs are initialized in a job window that is revealed by pressing the 4TelI
Me More button. For each job initialized locally, the job window displays the location (the machine
in the network) of each participating task agent.
This experiment is conducted in a harsh environment, where machines are continuously being added
and removed from the cluster. A user can withdraw from the experiment at any time by performing
one of the following actions:
2 3 4
0 8 6 4 2

The first bar (Faults) indicates that 38% of parallel jobs in this experiment incurred a fault. In this
context, a fault refers to lost agent(s) that were terminated prematurely (before completing the task).
The second bar (Delay) represents the time required for a job to complete. A comparison was made
between two sets of identical jobs those that incurred faults and those that were fault free. On
averagejobs with faults took 19% more time to complete. The third bar (Detection) denotes the
number of faults that were successfully detected (99%). The final bar (Results) indicates the job
success rate; 93% of jobs that incurred faults completed successfully (in this context, a job that
completes successfully is defined as a job that yields a correct result).
7.3 Conclusion
The results collected in this experiment provide an evaluation basis for the fault tolerant design
scheme in the Cotton software architecture. The experiment was designed to rigorously test the
software. The fact that 38% of the jobs incurred faults, significantly greater than that of a system
under normal operationsupports this design goal.
The average delay incurred by jobs with faults provides insight into the detection and recovery
process. Fault detection is essentially performed via surveillance. The detection time (the time that
elapses before fault determination) is an adjustable parameter, which in most instances, is negligible
compared to that of the recovery process. In this context, complete recovery time includes the
recovery of a lost agent and all descendants. This time is directly proportional to when and where
the fault occurred in the agent tree hierarchy. For instance, consider an agent that generates a deep
tree of descendants. If this agent is terminated early (relative to the time it was created), its
descendant tree is still shallow. In this case, work lost from the descendants is minimal and hence
total recovery time is small. In the alternative, the descendant tree is mature. Total recovery time is
costly due to the relatively substantial amount of work lost.
The 19% detection and recovery delay in this experiment is reasonable, although this could be
improved with an enhanced recovery scheme. This brings to light an opportunity for future work
involving multi-level recovery that is discussed in the following chapter.
The fault detection data suggests a very high success rate. Agent monitors that provide surveillance
performed as designed by detecting premature agent deaths and initiating the recovery process. The
results indicate a disparity in the overall success rate of fault incurring jobs; although 99% of faults
were detected, only 93% of jobs completed successfully. This discrepancy suggests a flaw in the
recovery procedure. An analysis of the software design and implementation has pinpointed several
potential sources of recovery errors :
The threading performance and implementation are somewhat suspect. In many
instances, the software incorporates a substantial number of threads (dependent upon
the application and the environment). Threading performance has been shown to
degrade at various thresholds. In addition, this multi-threaded environment introduces
a level of complexity that sets forth the potential for non-deterministic behavior.
A bandwidth limited network connection (between agents) can result in erroneous fault
detection. In this instance, the recovery process is initialized prematurely due to the
fact that responses to monitor requests are never received.

The Cotton software architecture makes heavy use of the Java threading package. Each task agent is
associated with three threads of execution one for the agent and two for the agent monitor. In
addition, each local service (Agent Host, Agent Server, Thread Pool, etc) incorporates some number
of threads. This multi-threaded environment introduces a number of complex issues that revolve
around task coordination and synchronization. Threading errors, including deadlock and race
conditions, are possible due to indiscriminate timing; it is not possible to prove correctness in this
The second multi-tasking issue involves the Java tlireading implementation. We have observed
inconsistencies in the Java threading performance that arise at various thread count thresholds. These
inconsistencies are possibly a result of the garbage collector that runs intermittently as a prioritized
A viable solution to these threading issues is to reduce the complexity associated with this
multithreaded environment. Several options, which include incorporating a process model and
experimentation with various virtual machines, are explored in the following chapter.
In addition to threading issues, the physical network connection is a potential problem source for
error recovery. In this case, the surveillance timeout period is critical. If this is set too low, agents
are continuously regenerated before completion. If set too high, jobs that incur faults will result in
unreasonable computation time. Faults will not be detected until long after they occur. Ideallythis
is a dynamic parameter that is dependent upon the physical network connection between parent and
7*4 Summary
This chapter covered the design and implementation of a Cotton web-based experiment. The
software architecture can be evaluated with an analysis of the experimental data. The results
demonstrate a robust system, supporting both the agent-based model and corresponding service
components. There are two isolated areas (threading and network connections) that leave room for
improvement. In particular, the functional programming paradigm proved to be a sound method of
fault tolerance.
+Since this writing, an error was discovered in the agent initialization procedure. This error involves
recovery of an agent that dies (terminates prematurely) after creation but prior to monitor
initialization. This problem has been corrected in the latest version of the Cotton system.
Preliminary results indicate a significant improvement in the job success rate.

8 Conclusion
The computational intensity of software applications continues to grow at a rate which outpaces that
of advances made in hardware technology. Mainstream applications whose demands were once
satisfied by a single processor machine now require multiprocessing architectures. Cluster computing
offers a viable solution that poses no rigid boundaries. This parallel-computing environment
provides a cost-effective platform that dynamically expands to meet ever-increasing demand.
8.1 Review
Application execution in a distributed parallel environment necessitates fundamental operational
services such as task distribution, coordination and messaging. Additional considerations fall under
the realm of dynamic adaptation, including performance optimization and fault tolerance. To this
end, there exist numerous custom solutions that are fully integrated into specific applications. This
involves a static mapping between the software services, the execution environment and the
We have presented a computing cluster software architecture designed to host parallel applications in
a distributed environment. The architecture consists of a parallel programming model and service
framework that incorporates both fundamental services and extended features. A cluster
environment is fully exploited by a dynamic software system; one that exhibits se-tuningself-
healing behaviors in the form of adaptation and fault tolerance. The Cotton software architecture
provides an execution environment for parallel applications that can be characterized by the
Fault Tolerant
Platform Independent
Application Transparent
The programming paradigm in the Cotton architecture is based on task parallelism. Task parallelism
involves three distinct states (execution, communication, creation) in which a computation is
performed, data is exchanged and new tasks are initialized. The absence of a central data store or
static task mapping, in conjunction with implementation flexibility, gives rise to an adaptive, fault
tolerant architecture.
The Cotton design strategy incorporates an agent model as a natural task parallelism abstraction. In
tms model, task agents collaborate to perform computations. Each task agent is responsible for a
piece of the parallel processing task. Agents are dynamically distributed throughout the network
based on current environmental parameters (such as processor load). The task agent computation
model is loosely based on Master-Slave parallelism. In this scheme, master agents (known as
parents) create and delegate tasks to slave agents (known as children). This process results in a
hierarchical agent computation tree.

Task agents in the Cotton architecture are different than that of most agent-based distributed systems.
These agents are explicitly lightweight as opposed to 4tintelligent,\ Agents incorporate a functional
programming paradigm that enforces a referentially transparent, side-effect free implementation.
Computations are both independent of execution sequence and only dependent upon input parameters
(as opposed to some saved state in the system). This paradigm supports fault tolerance by
simplifying the execution environment while providing the expressiveness that is required for
complex applications.
The simple heuristics of Cotton agents are akin to principles of many natural living systems. An
individual agent is a negligible entity, yet in collaboration they yield tangible results. The
lightweight agent model supports a distributed rollback recovery scheme. In this context, the
recovery of lost agents remains independent of global system state and computation progress. The
recovery process is isolated to that of the parent agent who lost a child.
The Cotton architecture provides application developers with a service framework. Parallel
application development is reduced to task agent algorithm implementation, where all services
remain transparent to the application. The architecture encapsulates all of the underlying operational
utilities including messaging (between agents), load balancing and agent distribution, resource
discovery, and fault detection and recovery.
The Cotton software is implemented in the Java programming language. Java supports a
heterogeneous environment where programs are compiled into an architecture neutral bytecode that
is interpreted by platform dependent virtual machines (JVM). The software was tested with an
experiment that uses the JVM inside of a web browser. This experiment is modeled after a volunteer
computing cluster, where participants navigate to a home page and download the Cotton software as
a Java applet.
This experiment was designed to test the software in a harsh environment, where resources are
continuously added to and removed from the network. The results demonstrate operational service
functionality and adaptive system behavior. The fault detection and recovery mechanisms proved to
be sound. Enhancements are possible by streamlining the threading implementation, addressing both
the complexity and overhead associated with this environment.
8.2 Future Work
We have identified several key areas that provide opportunities for future work, involving
enhancements and modifications to the current architecture. Each of these opportunities is discussed
8.2.1 Data Input Streams
The Cotton agent model is an extension of the Master-Slave pattern. In the Cotton architecture, a
parent (master) agent delegates tasks to child (slave) agents. Computation data is passed in to the root
(top-level master) agent. The root agent then delegates tasks to some number of child agents. Child
agents, in turn, can delegate tasks to new child agents.
Many applications have the potential to ingest a tremendous amount of data in a very short time
(such as real-time signal processing). In this instance, a single entry point (the root agent) can be a
performance bottleneck. A viable solution to this problem is to provide data entry capabilities at

multiple locations in the agent tree hierarchy. The following diagram depicts both the current and
enhanced agent tree implementations.
--------------- ...........................................
Current data entry point - Multiple data entry points
only at root agent. with enhanced architecture.
Figure 8-1 Multiple Data Input Stream Agent Tree
This scheme provides multiple data input streams at the expense of complicating the fault recovery
process. In this context, the functional programming paradigm is breached. Each task agent is no
longer guaranteed to be free of side-effects. The current distributed checkpointing and rollback
recovery implementation is inadequate due to the fact that the state of an agent may be dependent
upon the external system state. Potential solutions involve research that examines the feasibility of
augmenting the agent-based functional programming model.

8.2.2 Multithreading Options
The Cotton software architecture makes extensive use of multi-threading in order to achieve optimal
performance. The benefits realized in this implementation can be offset by increased complexity.
An analysis of the software has revealed that this complexity is a potential source for errors.
A solution to be explored involves incorporating a process model such as Communicating Sequential
Processes (CSP). As covered in chapter 3, CSP is based on a mathematical language in which
software behavior can be both defined and substantiated [13]. Communicating Threads for Java
(CJT) is a threading package for the Java programming language that is derived from the CSP model
[13]. All communication is performed via channels that provide synchronized communication.
This opportunity involves research that quantifies the overhead introduced by such a model. The
benefits realized from streamlining the threading implementation might outweigh a performance
A multithreading enhancement to be explored involves experimentation with various
implementations of the Java garbage collector. The garbage collection mechanism was identified in
the previous chapter as a potential source of errors. There currently exist several virtual machine
implementations that 1)claim to optimize garbage collection performance and 2) provide various
degrees of user-level control.
8.2.3 Multi-Level Recovery
As suggested in the previous chapter, the fault recovery time is a function of where (in the agent tree)
and when the fault takes place. A significant recovery delay is incurred when an agent terminates
prematurely after generating a deep tree of descendants. In this case, the recovery process consists of
regenerating the agent and all descendants.
A sophisticated descendant retrieval mechanism could minimize recovery delay. The following
diagram depicts this retrieval process.

Figure 8-2 Descendant Retrieval
The impact of a lost child agent is minimal if the parent agent has the ability to retrieve immediate
grandchildren (and hence all of the grandchild's descendants). This is not possible with the current
implementation. Agent communication follows strict guidelines. An agent communicates (and is
only aware of) its parent and immediate children.
A descendant retrieval scheme complicates both agent model and the recovery process in two ways:
1) Parent agents must be aware of the existence and physical location of all grandchildren.
2) A parent-grandchild data stream must be established during the retrieval process. The
existing grandchildren must then be reconnected with the regenerated child.
An enhanced architecture that addresses these issues will certainly introduce additional overhead.
The challenge is to design a descendant retrieval mechanism that minimizes the impact, both in terms
of complexity and overhead, to the current architecture.
8.2.4 High Performance Architecture
Work is currently underway on a high performance version of the Cotton software architecture, using
a high bandwidth, low latency networking medium (Myrinet). This effort involves incorporating the
Jaguar (Java Access to Generic Underlying Architectural Resources) model [35] that is based on
bypassing the operating system for messaging. Jaguar provides applications direct access to system

resources (a necessity for high-speed communication) while maintaining the protections of the Java
environment. This is accomplished with a compile-time transformation of predefined Java bytecode
segments into machine code primitives. This transformation provides application level access to
memory regions outside of the Java heap. In turn, Java programs can manipulate memory mapped
I/O and network devices through Java objects.
8.3 Summary
This thesis makes two contributions to cluster computing technology. The first is a novel approach
to parallel programming with an agent-based functional model. The second is an application
transparent service framework that supports this programming model. The Cotton architecture
incorporates this service framework, in conjunction with an agent-based functional model, to yield an
adaptive, robust fault tolerant execution environment for parallel applications.


The following guidelines outline the Cotton application programmer interface (API):
Task Agent implementations inherit from an abstract Distributed Agent base class. Algorithm
specific code is located in the overridden doJob() method. This method is invoked immediately
following agent initialization.
Agents communicate via Agent Links. Each agent is provided a link to its parent (as a class
member variable myParentLink). The first steps in the doJob() method is to read input data from
the parent agent. The last step in the doJob() method is to write the result back to the parent agent:
Typedata =; // read input data from parent agent
... // perform task
myParentLink.writeData(result); // write result back to parent agent
After receiving input data from the parent, the agent may need to create one or more child agents
(depending on the specific algorithm) in order to further partition the task. Child agent creation is
a three step process:
1) Create a new child agent of the desired type:
QuickSortAgent childAgent = new QuickSortAgent();
2) Create a link to connect to this new child agent:
AgentLink childLink = new AgentLink();
3) Connect this agent (denoted by the keyword thisto the new child by calling the
connect() method of the Agent Link class:
childLink.connect(this, childAgent); // connect this agent to the child
If a child agent does not need to be created, it may be desirable to make a loopback connection. In
this case, an Agent Link is created and used to connect the agent to itself:
AgentLink loopback = new AgentLink();
loopback.connect(this, this); // connect this agent to itself
As with the parent agent, communication with child agents (and loopback connections) is
performed via the Agent Link. Data transfer is accomplished with the writeData() and
read() methods of the Agent Link class:
childLink.writeData(data); // write task data to the child agent
result =; // read the result from the child agent
This API is demonstrated in the implementation of a quicksort algorithm that is included on the
following pages.

// Read input data (string to sort) from parent agent.
String inputstring = myParentLink.readstring{)
II Get pivot (first) character.
Character mid_ = new Character(message.charAt(0));
// Create array to hold the low characters,
char[]loChars = new char[message.length{)];
int loNum = 0;
II Create array to hold the high characters.
char[] hiChars = new char[message.length()]
int hiNum = 0;
// Sort the input string into two substrings,
for (int i =1i < inputstring.length() ++i)
char c = inputstring.charAt(i)
if (c <= mid_.charValue())
loChars[loNum] = c
hiChars[hiNum] = c
// Create strings using the low and high character arrays.
String loToSort = new String(loChars/ 0,loNum)
String hiToSort = new String(hiChars, 0, hiNum)
II Create an agent link to connect to a new child agent.
AgentLink lolink = new AgentLink{)
II Connect this parent agent 'this' to the new child agent
II 'QuickSortAgent1 passing it the task 'loToSort#. If the
// string is eirpty, use loopback connection,
if (loToSort.length() > 0)
lolink.connect(this, new QuickSortAgent(),loToSort)
lolink.connect(this, this, null);
// Write the task 'loToSort" to the child agent (via the lolink).
II Create an agent link to connect to a new child agent.
AgentLink hilink = new AgentLink()
// Connect this parent agent 'this1 to the new child agent
II QuickSortAgent# passing it the task 'hiToSort#. If the
// string is empty, use loopback connection,
if (hiToSort.length() > 0)
hilink.connect(this, new QuickSortAgent(), hiToSort);
hilink.connect(this, this, null);
II Write the task 'loToSort; to the child agent (via the lolink).

II Read and combine the results from the two child agents
// (via lolink and hilink).
String sortedString =lolink.readstring() + mid_ + hi 1 ink.readstring{)
// Pass the results 'sortedString" to the parent agent

[1] ALIFE Online,
[2] Becker, T., Application-Transparent Fault Tolerance in Distributed Systems, Proceedings of the
Second International Workshop on Configurable Distributed Systems, pages 36-45, 1994.
[3] Beowulf Project,
[4] Buyya, R., High Performance Cluster Computing: Architectures and Systems, Volume 1,Prentice
Hall, Upper Saddle River, New Jersey, 1999.
[5] Carriero, N., Freeman, E., Gelernter, D., and Kaminsky, D., Adaptive Parallelism and Piranha,
Department of Computer Science, Yale University. 1994.
[6] Computational Science Educational Project, htlp://, 1996.
[7] Dominic, S., Designing a Meta-Level Architecture in Java for Adaptive Parallelism by Mobile
Software Agents, Ph.D. Dissertation, Department of Mechanical Engineering, Colorado State
University, 1999.
[8] Foster, 1., Designing and Building Parallel Programs: Concepts and Tools for Parallel Software
Engineering, Addison-Wesley, Reading Massachusetts, 1995.
[9] Franklin, S., and Graesser., A., Is it an Agent, or just a Program?: A Taxonomy for Autonomous
Agents, Proceedings of the Third International Workshop on Agent Theories, Architectures, and
[10] Geist, Am Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V., PVM: Parallel
Virtual Machine: A User's Guide and Tutorial for Networked Parallelism,,1994.
[11] Ghanea-Hercock, R., Collins, J.C., and Ndumu D.T., Heterogeneous Mobile Agents for
Distributed Processing, Proceedings of the Third International Conference on Autonomous Agents,
pages 398-399, 1999.
[12J Harper, R., Nagle, G., and Serrano, M.T Use of a Functional Programming Model for Fault
Tolerant Parallel Programming, In Proceedings ol'the Nineteenth Symposium on Fault-Tolerant
Computing, pages 20-26, 1989.
[13] Hilderink, G. H., Communicating Threads for Java, University of Twente, the Netherlands, 1998.

[14] Hiltunen, M., and Schlichting, R. D., Adaptive and Fault-Tolerant Systems, Department of
Computer Science, University of Arizona, 1995.
[15] Hunt, G., Microsoft Millennium Project,
[16] Jalote, P., Fault Tolerance in Distributed Systems, Prentice Hall, Englewood Cliffs. New Jersey,
1994. "
[17] Java Language Overview, http://java.sun.ct)m/docs/white/index.html.
(18] Jini,
[19] Lange. D., and Oshima, M., Programming and Deploying Java Mobile Agents with Aglets.
Addison Wesley, Menlo Park, CA,1998.
[20] Langton, G.C., Artificial Life, SFI Studies in Ihe Sciences of Complexity, Volume VI, pages 1-47,
Addison-Wesley, Redwood City, CA,1989.
[21j Linda, Scientific Computing Associates Inc.,, 1997.
[22] McColI, B., BSP Computing, Oxford University Computing Laboratory,, 1999.
[23] National Scalable Cluster Project,
[24] Pant, A., An Efficient Implementation of Java Stream Sockets on VIA, National Center for
Supercomputing Applications.
[25] Sarmenta, L., An Adaptive, Fault-tolerant Implementation of BSP for Java-based Volunteer
Computing Systems, International Workshop on Java for Parallel and Distributed Computation (IPPS
[261 SETI@home, http://setiathome.ssl.berkeley.eclu/index.html.
|27j Skibinski, J., Haskell Companion, http://www.numeric-
|28] Snir, M.. Otto, S., Huss-Lederman, S., Walker, D., and Dongarra. J., MPI: The Complete
Reference, MIT Press,, 1996.
[29] Sterling, T., Salmon, J., Becker, D., and Savarcse. D., How to Build a Beowulf, MIT Press,
Cambridge, Massachusetts, 1999.
[30] Suzuki, M., Katayama, T., and Schlichting, R. D., Implementing Fault Tolerance with an Attribute
and Functional Based Model, Japan Advanced Institule of Science and Technology and the University
of Arizona Department of Computer Science.
[31] Tyree, J., Load Balancing in a Distributed A^ent Architecture. Master's Thesis, University of
Colorado at Denver, 1999.

[32] VI Architecture,
[33] Von Schweber, E.T and Von Schweber, L., Computing's Next Wave, PC Week Online,, 1998.
[34] Weissman, B., Gomes, B. A., and Feldman, J. A., Active Threads: Enabling Fine-Grained
Parallelism in Object-Oriented Languages^ International Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA 98), 1998.
[35] Welsh, M., and Culler, D., Jaguar: Enabling Efficient Communication and I/O in Java, University
of California at Berkeley, Department of Computer Science, 1999.
[36] Zomaya, A., Parallel and Distributed Computing Handbook, McGraw-Hill, New York, 1996.

Full Text


"The fabric is a dynamic architecture that eliminates rigid network boundaries to yield a completely fluid system ... Development and deployment of distributed systems has the potential to become a real-time process that is a simple matter of configuring resources ill the fabric.


"The question is not: will the systelll becollle too cOlllplex to lIlanage? It already is. What we need a system that will take care of itself"


process model,


correctness domain "A system with a smaller correctness dOlllain is usually easier to COllstruCt and Jaster to operate since it can be based more simplifyillg assumptions. best optimal and


fail stop


backward error recovery


checkpointing,forward error recovery


The meanillg of an entity is unchanged when a part of the emity is replaced with an equal part.


"Function invocation constitutes a natural rollback point ... The/unction to be computed, its arguments, and the destination of the jilllction 's result are all the information necessary to restart the function, al/d can be saved by the function's caller at low cost ... Because of referential transparency, the result of the restarted computations is identical to that of the original computatiol/, merely delayed in time.


correctlless domaill




adaptive placement.




lightweight lightweight




ApplicationTrallSparent Fault Tolerance in Distributed Systems, High Peiformance Cluster Computing: Architectures and Systems, Adaptive Parallelism and Piranha, Designing a Meta-Level Arclzitecture in lava for Adaptive Parallelism by Mobile Software Agents, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Is it an Agent. or just a Program?: Taxonomy for Autonomous Agents, PVM: Parallel Virtual Maclzine: User's Guide and Tutorialfor Networked Parallelism, Heterogeneous Mobile Agents for Distributed Processing, Use of a Functional Programllling Model for Fault Tolerant Parallel Progralllming, COllllllunicating Threads for lava,


Adaptil'c and Fault-Tolerant Systems, Fault Tolerance in Distribllted Systems. Programming and Deploying ja\'(/ Mobile Agenl.l \\'ith Aglets. Artificial Life, BSP COIllPlltillg, An Efficient Implementation of ja\'{/ Strealll Sockets on VIA, All Adaptil'e. Fallil-tolerant IlIIplelllentation of BSP for jal'{l-based Volllnteer Compllling Systellls, MPI: The Complete Reference, How to Bllild a BeolVlllf, IlIIplelllenting Fault Tolerance \\'ilh AlfriiJllle and FUllctional Based Model, Load Balancing in a Distributed Agent Architectllre.


Computing's Next Wave, Active Threads: Enabling Fine-Grained Parallelism in Object-Oriented Languages, Jaguar: Enabling l:.1ficient Communication and 110 in Java, Parallel and Distributed Computing Handbook,