Citation
Load balancing in a distributed agent architecture

Material Information

Title:
Load balancing in a distributed agent architecture
Creator:
Tyree, John L
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
x, 89 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Parallel processing (Electronic computers) ( lcsh )
Computer network architectures ( lcsh )
Computer capacity -- Management ( lcsh )
Electronic data processing -- Distributed processing ( lcsh )
Computer capacity -- Management ( fast )
Computer network architectures ( fast )
Electronic data processing -- Distributed processing ( fast )
Parallel processing (Electronic computers) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 88-89).
Thesis:
Computer science
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by John L. Tyree.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
44094571 ( OCLC )
ocm44094571
Classification:
LD1190.E52 1999m .T97 ( lcc )

Full Text
LOAD BALANCING IN A DISTRIBUTED AGENT ARCHITECTURE
by
John L. Tyree
B.S., University of Illinois, 1984
M.S.E. Purdue University, 1986
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
1999


This thesis for the Master of Science
degree by
John L. Tyree
has been approved
by
Christopher Smith


Tyree, John L. (M.S., Computer Science)
Load Balancing in a Distributed Agent Architecture
Thesis directed bv Associate Professor William J. Wolfe
ABSTRACT
In this thesis a load balancing algorithm for a distributed agent system was
examined and tested. The agent system studied is called Cotton. Cotton distributes
work among processors in a network by spawning agents and providing them a
portion of a task to perform. Cotton was designed to be able solve data intensive
problems that place heavy loads on networks as well as processors. A goal of Cotton
is to perform tasks in the least amount of time possible by maintaining a balance
between network and processor utilization. A load balancing algorithm called
Metropolis was evaluated to determine the appropriate input parameters for data
intensive tasks that run on bandwidth limited systems. The performance metric
considered was time to completion of a task. Network resource consumption w'as
considered to be a secondary criterion. A discrete event simulation was built and
used to test Metropolis. Input factors considered include: network and processor
weights, minimum and maximum hop distances, number of choices per hop distance,
and type of load determinant. Results of the simulation indicate a simple load
determinant based on task queue length with minimum hop distances and choices
performed best.
This abstract accurately represents the content of the candidate's thesis. I
recommend its publication.
in


CONTENTS
Figures...................................................................ix
Tables.....................................................................x
Chapter
1. Introduction............................................................1
2. Review of Literature...................................................2
2.1 Load Balancing Algorithm Characteristics..............................2
2.1.1 Adaptability.........................................................2
2.1.2 Algorithm Coordination Location....................................2
2.1.3 LBA Policies.......................................................3
2.1.3.1 Information Policy.................................................3
2.1.3.2 Transfer Policy....................................................4
2.1.3.3 Selection Policy...................................................4
2.1.3.4 Location Policy....................................................5
2.1.3.5 Acceptance Policy..................................................5
2.2 Comparative Performance of Some LBA Features.......................5
2.2.1 Task Preemption....................................................5
2.2.2 Proximity..........................................................6
2.2.3 Topology...........................................................6
IV


cn cn rn m rn rn rn rn rn rn rn
2.2.4 An Adaptive Strategy.
7
2.2.5 Managed State Knowledge Strategy.................................7
2.2.6 Biased Load Information.............................................8
2.2.7 Load Determinants...................................................8
. Mobile Agent System Description........................................10
. 1 General Characteristics...............................................10
.1.1 Hierarchical.........................................................10
. 1.2 Distributed.........................................................11
. 1.3 Heterogeneous Agents................................................11
. 1.4 Processor and Network Heterogeneity.................................12
.1.5 Non-Uniform Resource Utilization.....................................12
.1.6 Lack of Application Side Effects in Agents...........................12
.1.7 Efficient Medium and Coarse Grained Parallelism......................13
.1.8 Artificial Life Category of Agents...................................13
.2 Agent Movement Decisions...............................................13
3.2.1 Available Information..............................................14
3.2.2 Non-Available Information..........................................14
3.2.3 Metropolis Algorithm...............................................15
3.3 Performance Criteria...............................................17
4. Cotton Simulator
18


4.1 Purpose of Simulation.............................................18
4.2 Simulation Framework..............................................18
4.2.1 Architecture......................................................18
4.2.2 Simulation Topology...............................................21
4.2.3 Input Parameters..................................................22
4.2.4 Agent Life Cycle..................................................24
4.2.4.1 Starting State....................................................25
4.2.4.2 Latency State.....................................................25
4.2.4.3 Link State........................................................25
4.2.4.4 Preprocess State..................................................26
4.2.4.5 Spawn State.......................................................26
4.2.4.6 Work State........................................................26
4.2.4.7 Merge State.......................................................26
4.2.4.8 Finished State....................................................27
4.2.5 Work Consumption and Tasks........................................27
4.2.6 Tracking Resource Consumption.....................................27
4.2.7 State Progression.................................................28
4.2.8 Resource Slicing in the Upper Level Simulation....................29
4.2.9 Agent Scheduling and Resource Slicing at a Low Level..............31
4.2.10 Agent Path Creation and Selection.................................31
VI


4.2.11 Move selection methods...........................................32
4.2.12 Load Determinants................................................32
4.2.12.1 Method 1: Number of Active Agents................................33
4.2.12.2 Method 2: Number of Active and Waiting Agents....................33
4.2.12.3 Method 3: Estimated time steps assuming 2 work units.............34
4.2.12.4 Method 4: Estimated time steps assuming current agents work.....34
4.2.12.5 MethodS: Estimated time steps using known work remaining.........35
4.2.13 Spawning depth and BF and work division........................36
4.2.14 Randomness in the Simulation.....................................37
4.2.15 Description of I/O configurations, possible output...............38
4.3 Comparison of Cotton with Simulation.............................44
4.3.1 Time.............................................................44
4.3.2 Resource Consumption.............................................44
4.3.3 Movement.........................................................44
4.3.4 Work.............................................................45
4.3.5 Information Availability.........................................45
4.3.6 Topology.........................................................45
5. Experiment Descriptions and Rationale..................................46
5.1 Input Parameters....................................................47
5.1.1 System Size.........................................................47
vii


5.1.2 Fixed System Architecture Parameters
48
5.1.3 Variable System Architecture Parameters........................48
5.2 Agent Movement Input Parameters.................................53
6. Results...........................................................55
6.1 Effect of CPU Weight/Network Weight and Temperature.............55
6.1.1 CPU Limited System.............................................55
6.1.2 Bandwidth Limited System.......................................55
6.2 Effect of Maximum and Minimum Hop Distances....................57
6.3 Effect of Number of Choices Per Hop Distance...................58
6.4 Effect of Load Determinant.....................................58
6.4.1 NumActiveAgents and Active&WaitingAgents.......................58
6.4.2 2WorkUnitsPerLinkAgent and CurrentWorkUnitsPerLinkAgent........59
6.4.3 Actual.........................................................59
6.4.4 Load Determinant Comparisons...................................59
6.5 Effect of Agent Movement Method.................................60
6.6 Effect of Grid Size.............................................60
7. Summary...........................................................63
Appendix.............................................................65
viii
References
88



FIGURES
Figure
.1 Hierarchical Tree of Agents on Multiple Processors.....................11
3.2 Metropolis Movement Choice Based on Energy Potentials..................17
4.1 Swarm Architecture.....................................................20
4.2 Simulation Topology....................................................21
4.3 State Diagram for Agent Life Cycle.....................................24
4.4 Input parameter frame..................................................38
4.5 Agent location and density grid........................................40
4.6 Host and Link Utilization..............................................41
4.7 Number of agents versus time...........................................42
4.8 Total host and link utilization........................................42
4.9 Agent state histogram..................................................43
5.1 System Architectures Derived from Processor Speed and Bandwidth.........47
5.2 Cycle Quota Selection for CPU Limited System...........................50
5.3 Packet Quota Selection for Bandwidth Limited System....................51
5.4 Latency Selection for CPU Limited System...............................52
5.5 Latency Selection for Bandwidth Limited System.........................53
6.1 Mobile Agent Ratio for Temperature = 0 Case............................57
6.2 Grid size 80x80 depicting approximate extent of agent movement.........62
IX


TABLES
Table
4.1 Input Parameter Descriptions.........................................24
4.2 Resource Slicing Example for One Time Step..........................30
4.3 Effects of Randomness in Simulation.................................38
5.1 Input parameters for four system architectures.......................49
6.1 Effect of Grid Size on Time, Processors Visited and Mobile Agent Ratio ....61
x


1. Introduction
The abundance of low-cost processors with ever-increasing speed continues
to spur interest in the field of parallel processing. The focus of this investigation is a
platform independent parallel processing system herein referred to as Cotton. Cotton
uses lightweight mobile agents to distribute and execute tasks on a networked system
of processors. Cotton was designed to be able solve data intensive problems that
place heavy loads on networks as well as processors. A goal of Cotton is to perform
tasks in the least amount of time possible by maintaining a balance between network
and processor utilization. The context of network utilization includes bandwidth as
well as latency. The purpose of this investigation is to examine this load-balancing
problem. Load balancing in Cotton is performed using a relatively simple heuristic
called Metropolis which agents use to make movement decisions. A simulation of
Cotton was built to examine the effectiveness of Metropolis.


2. Review of Literature
A review of the research literature was performed to evaluate load balancing
schemes used in parallel processing. Surprisingly, little attention has been paid to the
particular problem of this study which is balancing network bandwidth and processor
utilization to minimize task time to completion. Although many researchers
recognize a balance is desirable between task migration and network utilization, their
view of network utilization is focused on the overhead imposed by load balancing
communication activities. The cost of moving a task is generally considered to be
negligible. This is not the case with data intensive tasks. Data intensive tasks may
need to move a large amount of data and may be limited by bandwidth. Even though
no other study was found which overlapped with the core purpose of this study, the
range of possibilities available in various load balancing schemes was explored. The
purpose of this review was to summarize general factors involved in load balancing
for many kinds of problems and discuss the effectiveness of some specific strategies.
2.1 Load Balancing Algorithm Characteristics
2.1.1 Adaptability
Load balancing algorithms (LBAs) can be static, dynamic or adaptable.
Static LBAs fix the balancing scheme at compile time and do not use system state.
Dynamic LBAs use system state to adjust the balancing scheme that was set at
compile time. Adaptable LBAs select their sub-LBAs at run time based on system
state and adjust them as well. Static LBAs do not require information exchanges that
result in communication overhead while static and adaptive LBAs generally do
require these exchanges. Dynamic and adaptive LBAs have the advantage of being
able to adjust to changes in the computation and task heterogeneity. Parallel ray
tracing is an example where extreme data dependent imbalances arise during
execution [1]. The authors [1] also found a dynamic LB A scales better than a static
one for the parallel ray tracing problem.
2.1.2 Algorithm Coordination Location
Task migration decisions can be made at a central processor location
(centralized) or at each of the processors (distributed). A centralized LBA must
collect load information (all to one communication) and then send directive
2


commands (one to all communication). If the same information is made available in
a distributed LBA, all processors must broadcast their load information (all-to-all
communication). It is not a requirement that centralized or distributed LBAs
communicate to the full extent described above. For instance it is common that
processors in a distributed LBA may only communicate with their adjacent
neighbors. Similarly, processors in a centralized LBA may communicate when their
state changes. A combination of centralized and distributed coordination can also be
used. A combined example would be clusters of processors each having a central
processor which coordinates with other central processors.
The advantage of a centralized LBA is its ability to formulate the best
decision using all the state information. Advantages of a distributed LBA over a
centralized LBA are: there is no single point of failure and it scales better since
decision making is performed in parallel rather than sequentially. The advantage of
distributed LBA which uses only local information is that it causes less
communication overhead. The advantages of a clustered LBA are: scalability,
natural mapping to subnetworks, and control over spreading of network and overhead
communication costs [2],
2.1.3 LBA Policies
Vaughn [3] identifies five LBA policies. They are discussed in the following
subsections.
2.1.3.1 Information Policy
The information policy governs the collection, distribution, and usage of load
information. It determines when the information is collected. Information collection
can be demand driven, periodic, or state change initiated. Demand driven collection
may be initiated by the sending processor, initiated by the receiving processor or both
(symmetric). Periodic collection is based on some measure of time while state
change initiated collection is event based. The information policy determines the
quantity of load information and also impacts the quality of information. The quality
of information is a function of the kind of information and its age. Receiver initiated
collection has the advantage of being more communication efficient over sender
initiated[4]. Two studies [5, 6] provide results indicating that, overall, the receiver
initiated collection performed best. Another study [7] indicated that sender initiated
collection policy performed better.
3


2.1.3.2 Transfer Policy
The transfer policy determines whether or not task(s) will be migrated to
other processors. Vaughn (3) gives three general examples of transfer policies. The
first is one uses a threshold for load on the arriving processor. If the threshold is
exceeded the task is suitable for migration. The second method compares the
relative difference between the current and remote processor. The third uses a
threshold for the difference between current and remote processor. Thresholds are
useful in reducing the likelihood of excessive migration (thrashing).
Some rigorous transfer policies perform load balancing during a
synchronization time step until a near perfect balanced system is achieved. This is
done iteratively until the load distribution has converged to some threshold.
Although such rigorous transfer policies achieve a desired degree of load balance,
they can be costly.
Very few researchers attribute much significance to the cost of data
movement in their transfer policy. This is probably because the tasks they test are
not bandwidth intensive. As an example Zaki [8] used an empirical rule of adding 10
percent to the threshold to account for data movement cost. The most liberal
estimate of data movement cost noted w'as in Tiermeyer [9] where the time to send a
task was 2 time units and the average time to completion was 13 time units.
2.1.3.3 Selection Policy
Once it is decided that tasks are to be moved, the selection policy identifies
tasks to be transferred. In a system with homogeneous tasks, the selection process
may be random or queue-order-based. In a system of heterogeneous tasks, it is
important to know and send the appropriate size task to achieve balance among
processors. Consequently, the expected execution duration is commonly estimated
and used as a measure of task size. A threshold may also be used on expected
execution duration to avoid sending the smallest tasks.
Rigorous methods of estimating execution duration include having the
application programmer provide the information or profiling recurrent tasks prior to
execution. The application programmer has the best predictive ability but may not
map processor cycle consumption correctly. A profiler can be used to analyze task
4


behavior prior to execution with the results stored in a runtime library. A
combination of the two may provide the best solution of all.
2.1.3.4 Location Policy
The location policy selects the destination for a task migration. This policy
may be sender initiated, receiver initiated or both (symmetric). If the size of the task
to be migrated is known, the granularity of the task size becomes important in
selecting a destination.
2.1.3.5 Acceptance Policy
The acceptance policy determines whether a destination processor can refuse
to accept a task sent to it by another processor. Refusal may be based on a threshold
load. This policy also regulates the number of processors a rejected task can visit
before it must stop moving and execute.
2.2 Comparative Performance of Some LBA Features
A load balancing algorithm is composed of many policies each which may
have many variations. Consequently, an infinite combination of LBAs exists. Many
different LBAs exist in the literature. Each study has its own external input such as
degree of loading, task heterogeneity, network topology, etc. which makes it difficult
to compare LBAs. The following subsections discuss some interesting results that
focused on a limited number of LBA parameters.
2.2.1 Task Preemption
Some LBAs allow task migration after a process has begun. This is referred
to as task preemption. An advantage of this option is the greater flexibility provided
to balance load at any time. A disadvantage is the state of a process in progress can
be large and complicated to pack up and send. Weissman [10] developed a thread
migration strategy that minimizes the overhead involved in saving thread state,
transfer, and thread recreation to reasonable levels.
Harchol-Balter [7] pointed out that most previous studies discourage
preemptive migration due to the inherent inefficiencies. However, they contend that a
preemptive scheme performs better. The reason given for the discrepancy is that
5


many other authors assume artificial job arrival distributions such as Poisson and
exponential process lifetime distributions for computational convenience. These
distributions do not match up well to real system behavior. They profiled one million
processes on multi-use UNIX systems and used the distributions derived from this
profiling. The distributions were predictable across a variety of processors and
workloads. Their strategy identifies old processes that by their existence are expected
to have long remaining lifetimes. Long-lived processes are chosen for migration.
More long-lived processes exist in real systems than exist in simulated systems with
the previously described arrival and duration distributions. This resulted in fewer
processes migrating (less overhead), with the most viable candidates being selected.
Overall performance improved using this strategy.
2.2.2 Proximity
Proximity is a useful metric in load balancing when communication costs and
quality are important. Increased distance between communicating processors
increases network traffic and time of task transfer. Delays due to transport increases
the information age which decreases information quality. Both of these factors affect
performance. Billard [11] uses a scheme where the entire set of processors in a
system is dynamically divided into sets of overlapping clusters. Load balance
decisions are limited to within a cluster. The size and membership of a cluster
adjusted based on feedback regarding the quality of information received. Cluster
membership is not explicit but is effectively achieved by giving weights to other
processors. Higher preference is given to processors that have most recently
provided good information. Billard [11] claims that this technique decreases
performance degradation by half when information quality is less than optimal.
2.2.3 Topology'
Billard [11] showed the effect of load information quality is directly related to
topology. Loh [5] points out that topologies have a unique processor distances and
unique degrees of connectivity. Loh [5] obtained performance comparisons of 5
different LBAs (gradient, sender initiated, receiver initiated, central job dispatcher,
and prediction based) on 4 different topologies (mesh, hypercube, Fibonacci, and
linear). Their results showed that topologies with smaller average processor distance
and greater average node degree (connectivity) performed best for all LBAs. Even
though the LBAs included centralized and distributed methods, the effect of topology
was consistent.
6


2.2.4 An Adaptive Strategy
Zaki [12] presented an adaptive method for a network of workstations.
Initially load is distributed evenly between all processors. Next, one of four receiver
initiated LBAs is selected after a relatively short period of evaluation. The purpose
was not to see which of the four LBAs worked best but rather whether a short
evaluation period is adequate to select an optimal LBA. During the evaluation
period, global load information is collected. This method is ideally suited to systems
with transient external load where an adaptive strategy is warranted. The four
selectable LBAs were (1) global (information) centralized (decision control), (2)
global distributed, (3) local centralized, and (4) local distributed. This method was
tested for 2 different algorithms with a range of input data sizes. The performance of
the selected LBA was compared to each of the performances of the other 3 LBAs (as
if they had been selected). This method chose the optimal LBA 75% of the time. In
the cases where the optimal LBA was not selected, the chosen LBA result was very
close to optimal.
2.2.5 Managed State Knowledge Strategy
Lee [13] presents an LBA (SALO) which manages load information from
other processors which many other LBAs dispose of after it is used once. They call
this managed load information predictable state knowledge. SALO organizes
decision making into clusters of processors and has a sender initiated information
policy. SALO processors that are overloaded must poll other processors to find a
lightly loaded candidate to send work. Two problems arise with LBAs that rely on
extended local load information: (1) Communication costs rise to unacceptable levels
at high system load and (2) Lightly loaded processors may be selected by many
overload processors simultaneously resulting in over-migration and overloading.
SALO attempts to reduce the high communication problem by keeping track of and
reusing quasi-static load information. The over-migration problem is dealt with by
maintaining a set of mutually exclusive distributed priority queues.
SALO collects predictable state knowledge by having each processor
maintain its own PolMist, Inform_list and GlobaMist. The Poll_list contains the
lightly loaded processors in the same cluster. The PolMist is the list from which a
processor may poll to determine where to send work. The InfomMist contains the
list of processors in the same cluster to which a processor must inform its state when
it becomes lightly loaded. The GlobaMist contains the list of potential lightly loaded
7


processors included at other clusters (used to balance load between clusters).
Several strategies are exploited to reduce communication. One of these is when a
heavily loaded processor polls a candidate processor, the candidate processor knows
this must be a heavily loaded processor and updates its PolMist. This happens in
addition to the polling processor updating its PolMist. Communication overhead
from updating processor tables is controlled by the InfornMist. Efficient inter-
cluster communication is aided by information in the GlobaMist table.
SALO's performance was compared to other similar LBAs that did not
exploit predictable state knowledge. SALO's response time was similar to the other
methods for load factors less than 0.5. Beyond a load factor of 0.5 differences over
100 percent were reported. Another performance measure looked at was the polling
rate. At high system loads (> 0.7) the other method's polling rate grew exponentially
as the system was overloaded with useless polling. On the other hand, SALO's
polling rate remained relatively constant because of its knowledge and avoidance of
heavily loaded processors. In summary, SALO significantly outperformed other
similar LBAs in situations where the system was highly loaded.
2.2.6 Biased Load Information
Zambonelli [14] presents a LBA which distorts the load information passed
between processors to achieve more accurate and less frequent task migration. Their
LBA is distributed and uses only local information. Instead of processors advertising
actual load conditions, a load is broadcast which takes into account the load of other
processors around it. This promotes load balance of regions using only local
information. It also limits the amount of iterative and repetitive migration that would
have been necessary to achieve balance had biased information not been used. The
authors report that biased load information is less reliable when system loads are
changing rapidly and may cause performance to degrade below a non-biased load
strategy.
2.2.7 Load Determinants
A load determinant is the information used to measure load on a resource
such as a processor. Three general categories of load determinant are (1) the amount
of historical consumption, (2) the current number of tasks waiting on a resource, or
(3) a future prediction of the amount of resource to be consumed. The most common
8


load determinants used in the literature fall into the second and third categories with
the second category being the most predominant.
9


3. Mobile Agent System Description
3.1 General Characteristics
The mobile agent system examined in this work is called Cotton. Cotton
consists of a set of agents that execute their work on a set of processors connected by
a network. Cotton has the following characteristics:
Hierarchical
Distributed
Heterogeneous Agents
Processor and network heterogeneity
Non-uniform resource utilization
Lack of Application Side Effects in Agents
Efficient medium and coarse grained parallelism
3.1.1 Hierarchical
A system of agents is used to perform a task. A task is initially given to a
root agent that divides the task into subtasks and spawns a child agent for each
subtask. If the childs subtask is further divisible, the child divides the subtask and
spawns more agents. A hierarchical tree of agents is formed in this manner. Some
agents receive an indivisible subtask from their parent agent. Such agents form the
terminal leaves of the hierarchical agent tree. Leaf agents perform some computation
on their subtask and pass the results back up the tree to their parent. All parent
agents in the hierarchy block while waiting for their children to return results. The
task is complete when all child agents have passed their results back up to the root
agent. An agent can either be a root, a branch or a leaf in the hierarchy. Problems
that can be recursively subdivided and distributed are ideally suited for Cotton.
Figure 3.1 shows an example of a tree of agents distributed on multiple processors.
10


Figure 3.1. Flierarchical Tree of Agents on Multiple Processors
3.1.2 Distributed
Agents created on one processor are able to move to other processors.
Consequently, a hierarchical tree of agents performing a task may span and loop
across many processors and network segments. An agents decision to move to
another processor is based on processor and network load. No architectural
restriction is placed on the frequency of agent movement. An application
programmer can specify when movement is appropriate.
3.1.3 Heterogeneous Agents
An agent system consists of a set of host server processes, and two kinds of
agents: load balance agents and task agents. An agent server process runs on each


processor which accepts agents. The agent server receives and sends task agents and
load balance agents. Task agents compose the hierarchy previously described. One
or more task agents may be running on a set of processors simultaneously. Different
kinds of task agents may be running simultaneously. Load balance agents move
cyclically or randomly within the active processor set. The load balance agent and
agent servers exchange load information and update their load tables with the most
current data. The load balance agents also make agent servers aware of newly
available or newly unavailable processors.
3.1.4 Processor and Network Heterogeneity
Agents can execute on any processor that runs the Java virtual machine. The
virtual machine acts as an interface between the operating system and the agent. The
hierarchical characteristic of Cotton does not require a specific network topology or a
certain number of processors. Agent awareness of the topology is required though
for selecting and evaluating possible agent movements.
3.1.5 Non-Uniform Resource Utilization
Some other parallel processing systems divide a task into N equal subtasks,
send the subtasks to processors, execute the subtasks, and return the results. No
hierarchical distribution of the subtasks is performed. Two periods of relatively
uniform processor use and two periods of relatively uniform network use might
characterize resource utilization in such a system.
Network and processor resource utilization is non-uniform in Cotton. The
division and distribution of subtasks through creation of a hierarchy is typically
slower than in the above-described scheme. Preprocessing work used to create
subtasks is distributed over time as the agent tree expands. Processor utilization over
time is more likely to have a bell-shaped utilization curve with periods of
discontinuity. Network utilization over time is more frequent but less predictable.
3.1.6 Lack of Application Side Effects in Agents
Information passing between task agents is strictly limited to two events: (1)
the passing of the subtask to the child from the parent agent and (2) the return of the
result from the child to the parent agent. Communication between branches on the
agent tree is specifically not provided to prevent any chance of side effects. Task
12


agents may communicate with their local agent server to gain processor and network
load information for making movement decisions. Parent and child agents ping each
other periodically to check for viability
3.1.7 Efficient Medium and Coarse Grained Parallelism
Cotton is designed to impose a minimal amount of overhead to an
application. Agent creation and transport time is minimized to prevent excessive
latency. Agents are lightweight with minimal intelligence to promote efficient
mobility. Consequently, Cotton efficiently implements parallelism for applications
that finely subdivide tasks as well as applications that coarsely subdivide tasks.
3.1.8 Artificial Life Category of Agents
Cotton agents fall into the artificial life category of agent systems. This
category of agent system is characterized by simple individual agent decision making
rather than highly complex or global/centralized decision making. A carefully
chosen configuration and individual agent behavior pattern results in an emergent
behavior exhibiting efficiencies or complexities not inherent in their individual
behaviors.
3.2 Agent Movement Decisions
Individual task agents make decisions to move or not to move to other
processors. The desired outcome of a set of agent movement decisions is a balance
between processor and network utilization that minimizes time to completion.
Determining the optimal set of movement choices is prohibitively time consuming
for a large agent hierarchy. Consequently agents make movement decisions based on
a simple heuristic and locally available information.
Task agents base their decision to move or not move on some measure of
processor load and network load. Load balance agents move cyclically (or randomly)
from one processor to another and do not need to make any choices. Task agents can
move only to processors with an agent server running. It is possible for an agent to
move more than once as determined by the application programmer. Not all
information useful for determining processor and network load can be obtained and
used by Cotton. This is due to constraints imposed by efficiency requirements.
Various kinds of available and non-available information are discussed in the next
13


subsections. Finally the movement algorithm that implements load balancing,
Metropolis, is described in detail.
3.2.1 Available Information
Load balance agents disseminate individual processor and individual network
segment load information to agent servers. Three kinds of load information are
available. The first is the processor load measured in percent CPU consumed during
some time interval. Similarly, the network load can be measured in percent of
bandwidth used during some time interval. The second kind of information is the
number of agents executing on a processor. The third kind of available information
is the moving agents awareness of its potential impact on processor or network load.
This means the agent knows enough about its own state to estimate the burden it will
place on its destination processor and network path.
3.2.2 Non-Available Information
Load balance information that cannot be efficiently collected, distributed or
analyzed is considered to be non-available. The classification of some of the
following information as unavailable was arbitrary, as it might be possible to
efficiently gather and process it or some limited variation of it. Information
classified as non-available includes:
The amount of agent task work consumed to date on each processor and
network segment.
The amount of work each agent on every processor has remaining.
The state of all other agents useful to indirectly estimate work
remaining.
The fractions of processor consumption from the current task, other agent
tasks, and non-agents tasks.
It is possible that a cluster of processors may be used for non-agent tasks
simultaneously with agent tasks. Knowledge of non-agent work would
allow signature detection and avoidance of large non-agent tasks.
Similarly sniffing out network segments to derive patterns of use would
be useful for detection and avoidance schemes.
Complete history of some kind of load data over time.
14


Tracking the structure of the agent hierarchy to avoid inefficient
movements such as cycles.
3.2.3 Metropolis
The method used by Cotton to make movement decisions is the Metropolis
algorithm [15]. Prior to the execution of Metropolis, a remote processor is selected
at random, without regard to load, from a limited set of processors. Metropolis then
makes a choice between moving to a remote processor or staying on the local
processor. The load on the network (fully switched) is considered as well as
processor load.
A process similar to simulated annealing is used along with Metropolis to
force an agent to move to a remote processor even though load information suggests
a the local choice is best. Temperature is used as an input to force such uphill
choices. Simulated annealing also uses a temperature input but also implements a
cooling schedule to gradually converge on the optimal solution [16]. The
temperature used with Metropolis does not use a cooling schedule but rather has a
constant temperature. This is because load conditions used for agent movement
decisions are so dynamic that carefully choosing a cooling schedule is not warranted.
Metropolis is based on a simple physical analogy that regards agents as
particles that interact with their environment. One-body potentials are associated
with the load on processors and are repulsive. Two-body potentials exist between
parent and child agents and are attractive. If transport costs between parent and child
are high (i.e. many network hops required, or network is heavily loaded), this
attractive two-body potential is also high. The goal of the Metropolis algorithm is to
minimize the sum of the one and two body potentials resulting from a move choice.
The algorithm operates on one particle (agent) at a time using the best available
information.
In practice, load is used to select the best remote move candidate (the
processor with the smallest load). Next the choice between remaining on the local
processor or moving to the remote processor is made using the following equations:
localPotential = localProcessorLoad processorWeight
remotePotential = remoteProcessorLoad processorWeight
networkPotential = networkLoad networkWeight
15


deltaU
= (remotePotential + networkPotential) localPotential
where
localPotential = the local processor one-body potential
remotePotential = the remote processor one-body potential
networkPotential = the two-body potential
localProcessorLoad = the load on the local processor
remoteProcessorLoad = the load on the remote processor
networkLoad = the load on the network segment between the local and
remote processor
processorWeight = an adjustable constant used to tune movement decisions
networkWeight = an adjustable constant used to tune movement decisions
deltaU = the difference in potential energy of the local and remote choices
If deltaU is negative, potential energy associated with the agent is lowered by the
move and the move is accepted. Otherwise, a final opportunity for forcing
movement is attempted through a modified simulated annealing process. The
probability of accepting an uphill move is calculated from:
moveProbability = exp(-deltaU/temperature);
randomValue = random number between 0.0 and 1.0
where
temperature is an adjustable constant used to adjust the frequency of uphill
movements
The uphill move is accepted if moveProbability exceeds randomValue. Figure 3.2
illustrates the potentials associated with an agent choosing between a local and a
remote processor.
16


Figure 3.2. Metropolis Movement Choice Based on Energy Potentials. Move with
lowest total energy is selected (except sometimes when temperature is non-zero).
3.3 Performance Criteria
The most significant performance criteria is time to completion for a task.
Minimizing processor and network bottlenecks is closely reflected in time to
completion and so is not considered separately. If time to completion can be kept
near its minimum, secondary performance criteria can be considered. These include:
minimizing network usage, minimizing the number of processors visited, minimizing
unnecessary cyclical movement, and minimizing the average radius (from start
processor) of processors visited by agents.
17


4. Cotton Simulator
A simulation was built of the Cotton agent system. This section describes the
purpose of the simulation, the simulation framework, and the similarities and
differences between the simulation and Cotton.
4.1 Purpose of Simulation
The Cotton simulation was developed to study load balancing of Cotton and
its effects on performance. It is possible to study load balancing in a real system.
The advantage of using a real system to evaluate load balancing is not having to
justify the adequacy of the mapping between the simulation and the real system.
Another advantage is not having to spend the time building the simulation.
On the other hand, using a simulation has several advantages. A simulation
provides a controlled environment where external forces have no impact unless they
are simulated. A simulation is not restricted to a fixed range of input parameters
(such as processor speed), so it is possible to simulate hardware that does not exist or
is prohibitively expensive. It is easier to take a close look at some behavior by
instrumenting it and observing the output. Isolation of a factor can be accomplished
by turning off noise elsewhere which may be an immovable part of the real system.
Performance measurements are unaffected by extensive data collection and analysis
while this is a problem in a real system. In addition to the advantages mentioned
above, a completed working system was not available on which to perform load-
balancing tests.
4.2 Simulation Framework
4.2.1 Architecture
The simulation was written in Java using the Swarm simulation system
(Swarm) as the simulation engine. Swarm is a toolkit for building multi-agent
discrete event simulations. Swarm was developed by the Santa Fe Institute [17].
The basic unit of simulation in the Swarm system is the swarm, a collection of agents
executing a schedule of actions. Swarm contains tools in the form of object oriented
libraries that can be used to build models, control experiments, display and analyze
results.
18


The Swarm engine relies on a single top level controller called the swarm
executor. Other objects in a swarm simulation include swarms, schedules, actions,
and domain objects. A swarm contains a schedule that contains one or more actions.
Actions are performed on the domain objects during a time step. During a time step,
the swarm executor causes all the actions of all the schedules of all the swarms
registered with the swarm executor to be executed. One pass through all the actions
constitutes one time step. The executor continues to execute the actions of a
swarms schedule until a swarm is explicitly de-scheduled. A model run of the
simulation terminates when no more swarms are scheduled to run.
To build a simulation one must create one or more swarms, register the
swarms with the swarm executor, and implement actions for each swarms schedule.
Typically it is desirable to have some or all of the domain objects perform some
activity during each time step. This can be accomplished by implementing a step
method for the domain objects and calling that step method from within the action
object The Cotton simulation has two primary swarms, ModelSwarm and
ModelSwarmGUI. ModelSwarm contains a grid of hosts (processors), links
(network connections), and switches (more network connections). ModelSwarmGUI
contains a separate display grid of the same hosts, links, and switches. During each
time step the swarm executor executes all the step methods for each host and link
(switches do not do anything) as part of the ModelSwarms schedule of actions and
then updates the display grid of the ModelSwarmGUI as part of the
ModelSwarmGUIs schedule of actions. Figure 4.1 shows the simulation
architecture.
19


ModelSwarm
Schedule
Acti onl
calls
step
method
of each
obj ect
' Actionl
Action2
ActionN
Host Link Host Link
Link Host Link Host
Host Link Host Link
Link Host Link Host
Swarm Executor
schedule
registration
by Swarms
executes all registered
schedules every step
ModelSwarmGUI
----Schedule
Actionl '
Actj on2
ActionN
Grid Plot of Network
Figure 4.1. Swarm Architecture
Agents live on and move between hosts and links. Each host and link has a
queue of agents that are given the opportunity to execute inside each host or link each
time step. The lifecycle of agents is executed in the step methods of the hosts and
links. At the beginning of a time step each host is allocated a fixed number of
processor cycles while each link is allocated a fixed number of network packets.
Agents consume cycles on hosts as part of their subtasks or they consume packets on
links when moving. When the resources are completely consumed on a host or link,
agents must wait until the next time step to proceed. In this manner the processor
speed and network bandwidth can be set for a model run.
20


4.2.2 Simulation Topology
The topology selected for the simulation was a torus. A rectangular grid with X
to X-axis wrapping and Y to Y-axis wrapping forms the torus. Components of the
topology include hosts, links, and switches. Switches are inert, consuming no cycles
or packets, but are included to preserve the network topology. Agents may travel
across switches without any impedance. Travel between two nearest hosts requires
crossing 2 links. Figure 4.2 shows a grid of size 8 by 8. All grid sizes have 50
percent links, 25 percent empty squares, 12.5 percent hosts and 12.5 percent
switches. The smallest possible grid is a 4 by 4 that has 2 hosts, 8 links and 2
switches.
LT%
Link Link Link Link
Link Host Link Switch Link Host Link Switch
Link Link Link Link
Link Switch Link Host Link Switch Link Host
Link Link Link Link
Link Host Link Switch Link Host Link Switch
Link Link Link Link
Link Switch Link Host Link Switch Link Host

Figure 4.2. Simulation Topology
21


4.2.3 Input Parameters
The simulation begins with the input of the model run parameters. This can
be from a graphical user interface or as command line arguments. Many of the input
parameters are discussed in later sections within the context of the agent life cycle.
However, for clarity the following table lists and explains the input parameters.
Input Parameter Description
Latency The number of cycles an agent consumes on a host prior to moving onto the network (a link).
Number of Work Units The size of the task given to the first (root) agent in work units.
Depth of Finished Agents The desired average depth of leaf nodes in the agent tree at completion.
Maximum Hop Distance The maximum number of hops (hosts touched on path to destination) an agent can move.
Minimum Hop Distance The minimum number of hops (hosts touched on path to destination) an agent can move.
CPU Weight A factor used in the Metropolis algorithm in conjunction with the Network Weight to determine the most desirable move (or no move). Ranges from 0.0 to 1.0 and is always equal to (1 Network Weight).
Network Weight A factor used in the Metropolis algorithm in conjunction with the CPU Weight to determine the most desirable move (or no move). Ranges from 0.0 to 1.0 and is always equal to (1 CPU Weight).
Temperature A factor used in the Metropolis algorithm used to allow uphill moves. A higher value favors uphill climb choices.
Choices Per Hop Distance The number of host candidates considered within a single hop distance when an agent is deciding the best remote host to move to.
Hop Distances Considered Either is 0 or 1. When it is 0, only one hop distance (selected randomly between the minimum and maximum hop distances) is searched for remote host candidates. When it is 1, all hop distances from minimum to maximum hop distance (inclusive) are
22


searched.
Move Selection Method Either is 0 or 1. When it is 0, the Metropolis algorithm is used to determine the agent movement choice. When it is 1, the movement choice is determined by the Mobile Agent Ratio (see below).
Load Determinant Either is 0, 1, 2, 3, or 4. Represents the nature of the load being used for load balancing. 0 uses the number of active agents. 1 uses active agents plus waiting agents. 2 uses estimated time steps derived from total number of agents and assumes each agent on a link has 2 work units. 3 is like 1 but assumes each agent on a link has an equal number of work units to the deciding agent. 4 calculates the exact load remaining for each agent and converts that into time steps
Mobile Agent Ratio Mobile Agent Ratio equals (number of mobile agents/ total agents). Used only when the Fixed Mobile Agent Ratio method for move selection is used. This parameter represents the desired mobile agent ratio target achieved at completion.
World Size X The size of the grid in the X axis.
World Size Y The size of the grid in the Y axis.
First Agent Start Position X The X axis start position of the root agent .(must be a host).
First Agent Start Position Y The Y axis start position of the root agent (must be a host).
Cycle Quota (cycles/step) The number of cycles given to a host for agents to consume each step.
Packet Quota (packets/step) The number of packets given to a link for agents to consume each step.
Cycle Slice Size (cycles/slice) The Cycle Quota is divided into portions that are consumed by agents on a host in round robin fashion. This portion size is the Cycle Slice Size.
Packet Slice Size (packets/slice) The Packet Quota is divided into portions that are consumed by agents on a link in round robin fashion. This portion size is the Packet Slice Size.
Host Cost (cycles/work unit) The number of cycles required to consume a work unit on a host.
23


Link Cost (packets/work unit) The number of packets required to move one work unit carried by an agent across a link.
Number of model runs Multiple consecutive models can be run or just a single model.
Table 4.1. Input Parameter Descriptions
4.2.4 Agent Life Cycle
An agent progresses through some set of eight life cycle states. The state
diagram for the agent system is shown in Figure 4.3. Each of the states is briefly
discussed below. Details of the complicated activities performed in some of the
states are discussed in later sections. For every state except the Link State, the agent
is physically located on a host. While in the Link State, the agent is on a link.
Figure 4.3. State Diagram for Agent Life Cycle.
24


4.2.4.1 Starting State
All agents begin at this state whether they are spawned or the root agent. The
agent selects its remote host and makes a decision to move or not to move. This is
the only opportunity given to agents to move. If the agent decides to be stationary it
progresses to the Preprocess State. If the agent is mobile it progresses to the Latency
State.
4.2.4.2 Latency State
The Latency State is where mobile agents wait for a period of network latency
to elapse. Agents which arrive in this state either are at their starting host and just
beginning to move to their destination, part way to their destination on an
intermediate host, at their destination host and beginning to move back home, or part
way home on an intermediate host. Since latency is specified in cycles, agents must
consume the latency by taking cycles from their current host. When the latency is
consumed (or paid for) the agent moves to the link that is next in its path toward its
destination. It then is on the network in the Link State.
4.2.4.3 Link State
The Link State represents an agents presence on the network while it is
moving to its next host. Agents that arrive in this state are either coming from a host
or another link. Two links must be crossed between every host. To move on. an
agent must consume a certain quantity of packets from the link. The quantity of
packets is equal to the input parameter, Link Cost, times the number of work units
the agent is carrying. Each step the agent has the opportunity to compete with other
agents (if present) for packets provided by the link. A link is allowed to give out a
number of packets equal to its Packet Quota every step. Once the agent consumes its
required packets it moves to the next object in its path to its destination. If the next
object is a link, the agent moves to that link and is again in the Link State. If the next
object is a host the agent moves to that host. Three possibilities exist depending on
where the agent is trying to go. If the agent is at its destination host, it is put in the
Preprocess State. If the agent is back home, it is done and put in the Finished State.
If the agent is at an intermediate host either on its way to the destination or on its way
home, it is put in the Latency State. An agent that journeys far from its start host will
iterate between the Latency State and the Link State until it arrives.
25


4.2.4.4 Preprocess State
The Preprocess State is where an agent does one work unit of work on a host.
Agents that arrive in this state are either stationary, coming off the Start State, or
have traveled and just arrived at their destination host. The number of cycles
required to consume the work unit is the Host Cost times 1 work unit. Each step the
agent has the opportunity to compete with other agents (if present) for cycles
provided by the host. A host is allowed to give out a number of cycles equal to its
Cycle Quota every step. Once the agent consumes its required cycles it progresses to
the next state. If the agent has enough work units to provide work units to possible
children it progresses to Spawn State, otherwise it progresses to the Work State.
4.2.4.5 Spawn State
The Spawn State is where an agent splits its workload into several pieces and
creates child agents to do each piece. Agents that arrive in this state come from the
Preprocess State and have enough work units to make it worthwhile to spawn and
divide their workload. No cycles are required to be consumed in this state so the
agent moves immediately to the Merge State.
4.2.4.6 Work State
The Work State is where some agents perform a second unit of work on a
host. Agents that arrive in this state come from the Preprocess State and did not have
enough work Units to make it worthwhile to spawn and divide their workload. The
number of cycles required to consume the work unit is the Host Cost times 1 work
unit. Each step the agent has the opportunity to compete with other agents (if
present) for cycles provided by the host. A host is allowed to give out a number of
cycles equal to its Cycle Quota every step. Once the agent consumes its required
cycles it is finished and progresses to the next State. If the agent is stationary it
progresses to the Finished State. If the agent is mobile, it needs to move back home
so it progresses to the Latency State.
4.2.4.7 Merge State
The Merge State is where some agents perform a second unit or work on a
host. Agents that arrive in this state come from the Spawn State. The number of
cycles required to consume the work unit is the Host Cost times 1 work unit. Each
26


step the agent has the opportunity to compete with other agents (if present) for cycles
provided by the host. A host is allowed to give out a number of cycles equal to its
Cycle Quota every step. Before an agent can begin paying for its work unit it must
wait for all its children to return. It is quite possible an agent will wait many time
steps for all its children to return. Once the children return, the agent consumes its
required cycles and progresses to the next State. If the agent is stationary it
progresses to the Finished State. If the agent is mobile, it needs to move back home
so it progresses to the Latency State.
4.2.4.8 Finished State
All agents end up in this state. All agents finish on the host they started on.
When the root agent is in Finished State, the model has completed running.
4.2.5 Work Consumption and Tasks
The size of a task given to an agent (its processing workload) is measured in
an abstract quantity called work units. Since the purpose of this simulation is to
model data intensive situations, the number of work units is proportional to the
amount of data. An agent consumes work units by taking cycles from its current
host. The number of cycles required to consume a work unit is set by the input
parameter Host Cost. A typical agent consumes 2 work units during its life. The
first work unit is always consumed in the Preprocess State and represents
preprocessing work. The second work unit is consumed either in the Work State or
the Merge State depending on whether the agent is a leaf agent or a branch agent (or
root). Some leaf agents may consume 1 work unit or 3 work units. Details of these
cases are described in a later section describing spawning. The data intensive aspect
of the simulation is reflected in the assumption that the data passed to a child is equal
to the data received back from the child at completion.
4.2.6 Tracking Resource Consumption
An agent that begins consuming a work unit on a host may not finish before
the hosts quota of cycles is depleted. This marks the end of activity on that host for
that time step. Similarly, an agent on a link may not consume all its required packets
in one step. The reason an agent may not finish is either the number of cycles (or
packets) required is greater than the hosts (or links) quota or other agents are
competing for resources at the same time. An agent has several class members of
27


type AgentTask which record the progress of resource consuming activities. The
following instances of AgentTask are part of each agent:
latencyTask tracks payment of cycles for latency,
preprocessTask tracks payment of cycles for the work unit consumed in the
Preprocess State,
linkTask tracks payment of network packets for agents on a link in Link State,
workTask tracks payment of cycles for the work unit consumed in the Work
State, and
mergeTask tracks payment of cycles for the work unit consumed in the Merge
State.
An AgentTask keeps track of the original task cost, the amount remaining,
and the number of original work units. In the case of linkTask, its taskCost is
initialized to the number of work units times the Task Cost. Other AgentTasks do
not need to know the number of work units. Methods of an AgentTask include:
consume(numCycles) consumes the number of cycles passed in and returns
what it did not need
isPaidForQ returns whether task is paid for or not
resetRemaining() resets the amount remaining to the original task cost
In states where one of the five tasks above is required, the agent must
consume the appropriate number of cycles first using the consume method. The
isPaidFor method is called to check if consumption is complete. Once it is paid for,
the agent can proceed to its next state. Three of the AgentTasks, preprocessTask,
workTask. and mergeTask, are only performed once. The other two, latencyTask and
linkTask, are used more than once for mobile agents. These two call the
resetRemaining method before exiting their states to prepare for the next use.
4.2.7 State Progression
The simulation initially allowed an agent to progress through only one state
during a time step. It was believed that any latency imposed by this rule would be
made insignificant by creating simulation runs with many time steps. However, the
number of steps required to achieve this would have made the simulation too slow
(hours/run). The simulation was then changed to allow an agent to progress as many
states as it could. The advantage of this change was that it eliminated the need to
28


justify that excessive imposed latency was not present. This modification required a
change in the execution of host and link step methods because agents were being
impeded by the call order of host and link step methods. The next section on
resource slicing describes this issue further.
4.2.8 Resource Slicing in the Upper Level Simulation
Host and link step methods are called from within a swarm action which is
indirectly called by the swarm executor. Only active hosts and links have their step
methods called. A host or link becomes active when they are included in the path of
a mobile agent (see section on Path Creation). Active hosts and links are kept in
separate dynamic arrays called hostList and linkList.
Originally, the contents of each list was shuffled and then iterated through
while calling each step method. The hosts were run first, then the links were run.
Even after the single state rule was relaxed, agents could at best execute on one host,
move to a link and execute there for the remainder of the time step. Agents on links
could move to hosts but could not execute there because the host step methods had
already run for that time step.
This problem was remedied by resource slicing. Resource slicing works by
calling a hosts or links step method many times and only allowing agents to
consume a fraction (slice) of the resources available for that time step. The size of
the slices is determined from the input parameters Cycle Slice Size and Packet Slice
Size. Calls to host and link step methods were proportionately interleaved relative to
the percentage of resources consumed for a time step. The following example
illustrates the process. Consider the following input parameters:
Cycle Quota = 20
Packet Quota = 10
Cycle Slice Size = 5
Packet Slice Size = 5
Number of Active Hosts = 2
Number of Active Links = 4
Table 4.2 shows the sequence of step calls for one time step. Consumption by the
links is not allowed in the second slice opportunity because the hosts need to catch
29


up. Proportional consumption between hosts and links allows for maximum agent
mobility.
Sequence of Step Calls Resources Consumed this Step Call Cumulative Resources Consumed as Percent of Quota
First Slice Opportunity
Host 1.step 5 cycles 25%
Host2.step 5 cycles 25%
Link 1. step 5 packets 50%
Link2.step 5 packets 50%
Link3.step 5 packets 50%
Link4.step 5 packets 50%
Second Slice Opportunity
Host 1.step 5 cycles 50%
Host2.step 5 cycles 50%
Third Slice Opportunity
Host 1. step 5 cycles 75%
Host2.step 5 cycles 75%
Link 1.step 5 packets 100%
Link2.step 5 packets 100%
Link3.step 5 packets 100%
Link4.step 5 packets 100%
fourth Slice Opportunity
Host 1.step 5 cycles 100%
Host2.step 5 cycles 100%
Table 4.2. Resource Slicing Example for One Time Step
In this resource distribution scheme, resources that are not used from a slice
event are not available later in the time step. In other words, an agent that arrives late
in a time step on a link may not get many packets and may have to wait until the next
step to consume enough packets to pay its linkTask. This simulates the passage of
time within a time step.
30


4.2.9 Agent Scheduling and Resource Slicing at a Low Level
Multiple agents on a host or link compete for resources. Scheduling of
resource consumption among agents on hosts is managed by two queues, the
activeAgentList and the waitingAgentList. New agents and newly arrived mobile
agents are put at the end of activeAgentList. At the beginning of the step, agents
from the waitingAgentList are added to the activeAgentList. Agents in the Merge
State get put in the waitingAgentList if their children have not returned and remain
there until their children return or until the beginning of the next step method call.
Links only need an activeAgentList since there are never any agents in Merge State
on links.
Resource slicing occurs in the step method one cycle or packet at a time. A
Cycle (or Packet) Slice Size of resource gets dealt to agents in the activeAgentList in
round robin fashion. The step method terminates if all the resources are consumed or
if no active agents are present. Agents that are in the Finished State are put in the
fmishedAgentList.
4.2.10 Agent Path Creation and Selection
Each mobile agent has a path that contains an ordered list of hosts and links
leading to its destination. The path is the result of a selection process which first
considers a set of paths to remote hosts and then considers either staying on the local
host or moving to the remote candidate. An agent which is stationary has an empty
path. This section explains the process of creating a single path for later movement
consideration.
A path is first defined by its starting host and destination host. The start host
is where the agent currently exists. The destination host is selected randomly from
the set of hosts a certain hop distance (number of hosts away) from the start host.
One of two possible L-shaped paths is randomly selected between the start and
destination hosts. A path variable is then filled in with the start host followed by all
the links and hosts in between and finally the destination host.
The process of selecting a path may involve evaluating several paths. The
simulation input parameters Maximum and Minimum Hop Distance bound the
distance for the search (as measured in number of hosts from the start host). Choices
Per Hop Distance is the number of candidates randomly selected from a specific hop
31


distance. Hop Distance Considered allows either looking at all hop distances within
the max and min values or randomly picking one hop distance between the max and
min values.
The path of the agent is used to guide the agent to its destination. Once it
arrives at its destination and prior to its return back to its start host, the path is
reversed. The agent then follows the reversed path back to its start host.
4.2.11 Move selection methods
Two move selection methods were implemented, Metropolis and fixed
mobile agent ratio. Metropolis was explained in detail in section 2.1.3. Fixed
mobile agent ratio is described in this section. The information used to determine
load is described in the next section.
The fixed mobile agent ratio method first selects a remote candidate. The
local or remote decision is made using the input parameter Mobile Agent Ratio. This
local/remote decision is a two-way choice that produces an actual mobile agent ratio
outcome close to the input Mobile Agent Ratio. The choice is random but is
proportionality weighted to achieve the Mobile Agent Ratio.
Metropolis often restricts the mobile agent ratio outcome to a narrow range
even when its input is varied considerably. In some preliminary test runs, the
optimal performance achieved was at a mobile agent ratio at the far edge of
Metropolis mobile agent ratio range. This spurred interest in creating a method that
could achieve a greater range of mobile agent ratio. An equally weighted choice
between moving or not moving would results in a mobile agent ratio close to 0.5.
The fixed mobile agent ratio method used a weighted choice to achieve a desired
input mobile agent ratio. This allows the examination of the full range of agent
movement that is quantified by mobile agent ratio.
4.2.12 Load Determinants
Load on hosts and links is used to make movement decisions in Metropolis.
The simulation allows the use of one of five load determining methods. Each
method is discussed below. Equations are provided showing how each of the three
Metropolis input parameters (Local host load, Remote host load, and Network load)
is calculated for each method.
32


One method that was tried was using the percent consumption of the Cycle
Quota or Packet Quota during the current or last time step. This method did not
produce meaningful results because of poor consumption information generated on
the local host. Prior to an agent making a move decision, its parent had just used
host cycles consuming its preprocessor task in the Preprocessing State.
Consequently, the latest percent consumption will be 100 percent or at least partially
busy. This results in poor move decisions with more agents mobile than necessary.
This method was abandoned due to this anomaly.
Even if the simulation was modified to accommodate effective usage of the
historical percent consumption load determinant, it is not expected to achieve good
results. This is because the quality of load data taken over too long a period is less
relevant and load data taken over a short period is too narrowly focused to be
representative. In addition, a highly loaded resource can not reflect the amount of
demand beyond 100 percent. The existence of these shortcomings creates an
enhanced susceptibility to quality of information degradation. This was evidenced by
effect of minor simulation implementation details on overall performance. This
observation is confirmed by the absence of this type of load determinant in the
literature and the presence of ones based on task queue length or task duration
prediction.
4.2.12.1 Method 1: Number of Active Agents
This method simply uses the number of currently executing agents as load.
Agents in Merge and Finished State are not counted. Agents that have not made a
move decision yet are not counted. The variable numActive is the number of active
agents on a host or link. The Metropolis input parameters are:
Local host load
Remote host load
Network load
numActive
numActive + currentAgent
maxLinkAgents + currentAgent
4.2.12.2 Method 2: Number of Active and Waiting Agents
This method differs from the previous method in that agents waiting for their
children to return are also counted. The variable numWaiting is the number of
waiting agents on a host or link. The Metropolis input parameters are:


Local host load
Remote host load
Network load
= numActive + num Waiting
= numActive + numWaiting + currentAgent
= maxLinkAgents + currentAgent
4.2.12.3 Method 3: Estimated time steps assuming
2 work units for other agents on links
This method computes load as an estimate of time steps required to
sequentially pay for all the tasks it will encounter on hosts and links. The estimate
also includes time steps the agent will have to wait while other agents existing on
other hosts and links complete their tasks. LatencyLoad is considered part of the
network load because it is a cost associated with using the network. In calculating
LinkLoad it is assumed that other agents on links have two work units. The
Metropolis input parameters are:
Local host load
= (numActive*AAWL + numWaiting*WAWL)
* hostCost / cycleQuota
= (numActive*AAWL + numWaiting* WAWL +
currentAgent*2.0) hostCost / cycleQuota
LinkLoad + LatencyLoad
active agent work load (average) =1.5 work units
waiting agent work load (average)= 1.0 work units
(sum of agents on links in path + currentAgent)*2*linkRate /
packetQuota
LatencyLoad = (number of links in path) latencyRate / cycleQuota
Remote host load
Network load
AAWL
WAWL
LinkLoad
4.2.12.4 Method 4: Estimated time steps assuming
current agents work units for other agents on links
This method is similar to the previous method except in calculating the
LinkLoad. It is assumed that other agents on links have the same number of work
units as the current agent. Using the current agents work load inhibits movements
onto the network early in the task when agents pose a heavy burden to the network.
Some early trials indicated this early movement may be inefficient and costly. The
Metropolis input parameters are:
34


Local host load
= (numActive*AAWL + numWaiting*WAWL +
currentAgent*2.0) hostCost / cycleQuota
= (numActive*AAWL + numWaiting*WAWL +
currentAgent*2.0) hostCost / cycleQuota
= LinkLoad + LatencyLoad
= active agent work load (average) = 1.5 work units
= waiting agent work load (average)= 1.0 work units
= (sum of agents on links in path + currentAgent)
*currentAgentNumWorkUnits linkRate / packetQuota
LatencyLoad = (number of links in path) latencyRate/ cycleQuota
Remote host load
Network load
AAWL
WAWL
LinkLoad
4.2.12.5 Method 5: Estimated time steps using known
work remaining in agents tasks
This method is similar to the previous two methods except instead of
estimating the amount of work other agents have to do, the remaining work is
determined precisely by interrogating each agent. This method is not possible in a
real system since agents do not know how much work they have remaining at any
given time, causes significant overhead and consequently would not be a appropriate
in a real system. This method was implemented and tested as it was expected to be
an upper bound on load information quality. The performance of other load
determinants could be compared to this load determinant as an "optimal" benchmark.
The Metropolis input parameters are:
Local host load =
(sum of cycles remaining for preprocess, work or merge, and
latency tasks for all active agents + sum of cycles remaining
for merge and latency tasks for all waiting agents) /
cycleQuota
Remote host load =
(sum of cycles remaining for preprocess, work or merge, and
latency tasks for all active agents + sum of cycles remaining
for merge and latency tasks for all waiting agents + remaining
cycles for preprocess, work, and latency tasks for current
agent) / cycleQuota
35


Network load = (sum of packets remaining for link tasks for all agents
detected on links in the path + packets expected to be
consumed at each link along the path by the current agent) /
packetQuota
4.2.13 Spawning depth and BF and work division
Spawning is the process by which work is distributed through the agent
system. Blellock [18] characterized parallel algorithms to have very well defined or
even predictable performance if their work (total number of operations) and depth
(longest chain of sequential dependencies in an operation) are known. In this
simulation the work of a task (operations) is work units and depth is the average
depth of leaf agents in the agent tree. The simulation takes Number of Work Units as
input as well as Depth of Finished Agents. To achieve the input depth as the
outcome of a model run. the effective branching factor (EFB) is calculated using
work and depth as input. The EFB is used to determine how many children result
from a spawn event. An EFB of 2.6 would indicate 40 percent of spawn events
would have 2 children while 60 percent would have 3 children.
The EFB was derived as follows:
N= 1 + b + b2+ ... +bd
bN = b + b2 + ... + bd+1
bN N = bd+I -1
N = (bd+1 1) / (b 1)
where N = number of agents
b = EFB, and
d = depth.
EFB was solved for numerically using Newton's method. The derivation assumes
integer values are used. For non-integer values the EFB is approximate and typically
overestimates the actual depth by 10 percent. However, consistency of work and
depth is the primary concern and is achieved using this technique.
An agent must have enough work units to spawn. The typical spawn event
leaves 2 work units with the parent and splits the rest evenly with the children. If not
enough work units are left to accommodate the number of children specified by the
36


EFB, a reduced number of children is produced. If an agent has 2 work units, it
cannot spawn since it would like to have 2 work units to consume itself. Agents with
4 or more work units must spawn. Agents with 3 work units present an unbalanced
case where spawning results in one child with 1 work unit (consumed in preprocess
task). It was probable that this case would happen to half of all leaf agents and might
cause imbalance in the number of spawn events expected. This imbalance was
adjusted by spawning half the time and allowing the potential parent to consume 3
work units (1 as preprocess task and the other 2 as work task) the other half of the
time.
4.2.14 Randomness in the Simulation
Various weighted random decisions are made in the simulation, which by
themselves introduce variability between model runs with identical input. The Table
4.3 explains the various random events, their effects and the estimated significance.
Process / Event Effects Estimated Significance
Spawning: the least likely number of children may be selected early in the run The shape of the agent tree may be deeper or shallower than the typical run Moderate
Spawning: more 3 work unit childless agents may be created than average A depth that is more shallow and less opportunity for spread of the work units to other hosts by spawning Low
Path Creation: 2 kinds of L-shaped paths are possible to the same host. Possibility exists that path selected will be the more congested one (or less congested). Overall the time to completion may be impacted somewhat. However, future path creation and movement decisions should adapt to previous bad one to avoid compounding affects Low to Moderate
Path Selection: If fewer choices are allowed within a hop distance, Again, overall time to completion may suffer somewhat but future Low to Moderate
37


the best candidate is less likely to be selected. selections by other agents will be even more likely to avoid the same overloaded host
Mobile Agent Ratio Method: the choice to move is random weighted on the input mobile agent ratio. Since this method is mostly random to begin with, significant performance degradation is not expected but time to completion values will vary more wildly. Low to Moderate
Metropolis Method: Non-zero temperature inputs will cause uphill move decisions The potential for a randomly occurring bad choice exists. Performance would suffer especially if the bad choice occurs early. Moderate
Table 4.3. Effects of Randomness in Simulation
4.2.15 Description of I/O configurations, possible output
The simulation has three modes of execution: single model, multiple model
(GUI input), or multiple model (command line input). The single model collects
model specific data and displays the data in grids, graphs and histograms. The single
model allows the closest look at the simulation and is useful for demonstrations and
troubleshooting. The multiple model (GUI input) collects summary information at
the completion of each model and displays the results on graphs an histograms. It
useful for examining variability and deciding on the number of models necessary to
smooth out random effects. The multiple model (command line input) collects
average data for several models and stores it in a file.
The single model displays the following:
An input parameter frame (Figure 4.4)
A grid showing agent location and density (Figure 4.5)
A grid showing host and link utilization (Figure 4.6)
A graph of number of agents versus time (Figure 4.7)
A graph of total host and link utilization (Figure 4.8)
38


A histogram showing number of agents currently in each state (Figure
4.9)

SYSTEM PARAMETERS
SIMULATION PARAMETERS
Number of Processors
Processor Speed (work units/step)
Throughput (work units/step)
Latency (# steps to send agent)
800
To"
5lT
TcT
APPLICATION PARAMETERS
Number ofWork Units
Average Depth of Finished Agents
1000
5.3
LOAD BALANCING PARAMETERS
Maximum Hop Distance
Minimum Hop Distance
CPU Weight
Network Weight
Temperature
Choices Per Hop Distance
Hop Distances Considered
Move Selection Method
Load Determinant
Mobile Agent Ratio
1
"i
006
0*94
OLCl-
1
i
0
0
cTcT
World Size X
World Size Y
Start X
Start Y
Cycle Quota (cycles/step)
Packet Quota (packets/step)
Cycle Slice Size (cycles/slice)
Packet Slice Size (packets/slice)
Host Cost (cycles/WorkUnit)
Link Cost (packets/workUnit)
Number of Runs
80
80"
TT
TT
20"
~2~
T~
20
TO
T~
Start Simulation
Figure 4.4. Input parameter frame.
39


[O] Agent Density Map

n
Figure 4.5. Agent location and density grid.
40


[cJj Host/Liruk Usage Map
a

Figure 4.6. Host and Link Utilization.
41


Number of Live Agents vs. time
n
Number of Live Agents vs. time
time steps elapsed
Legend
I Agents
Figure 4.7. Number of agents versus time.
[C]Average CPU and Link Utilization vs. time
Average CPU and Link Utilization vs. time
time steps elapsed
Legend
cpu
I link
Figure 4.8. Total host and link utilization.
42


Agent States Cumulative Totals
[x]
Agent States Cumulative Totals
65-
60-
55-
50-
46-
40"
35-
30-
25-
20-
15-
10-
r
Legend
STARTING
latency
link
preprocess
spawn
work
merge
finished
0 1 2 3 4 5 6 7
State
Figure 4.9. Agent state histogram.
The single model is also equipped with a control panel that allows the
simulation to be paused, stepped incrementally, restarted or stopped. Double
clicking on a host or link grid square brings up an inspection window where host and
link methods can be used to inspect internal state.
The multiple model (command line input) records the following information
averaged for several models:
time
mobile agent ratio
cycles consumed
packets consumed
actual depth
number of agents
hop count
number of hosts visited
number of link visited
standard deviations for above
time stamp
43


input parameters
4.3 Comparison of Cotton with Simulation
The following subsections discuss how well the simulation models Cotton.
4.3.1 Time
The smallest unit of time in the simulation is a step. A step represents the
opportunity for agents to consume a fixed amount of resources from hosts and links.
Although a step is the smallest measured time quantity in the simulation, as
mentioned in Section 3.2.8, a smaller increment of time is effectively modeled that
causes lost opportunities to consume resources. The true measure of time in both
systems is how it is related to resource consumption. A good match seems to exist
here.
4.3.2 Resource Consumption
The simulation assumes that network transmissions across a host incur
latency on that host. This may or may not be true with Cotton depending on the
network specifics. Another simulation assumption is that the network cost is
proportional to the number of work units an agent is carrying. This is consistent with
a data intensive application that must move a significant amount of data. Round
robin resource slicing in the simulation emulated the behavior of the Java virtual
machine in Cotton. Load balance agents are not included in the simulation but they
do consume significant resources (less than 5 percent). Finally, no attempt is made
in the simulation (although it is possible) to account for non-agent competition for
resources on hosts and link.
4.3.3 Movement
The simulation allows an agent to move to any host within its hop limits.
Some restrictions are placed on agent movement in the simulations that are not in
Cotton. The number of moves is restricted to one and that move occurs near the
beginning of the agent life cycle. A second restriction is the path can only be one of
the two possible L-shaped routes to a host destination. A third restriction is the path
used to get to a destination is the same path used to get back to the start host. No
flexibility is given to avoid congestion on the return trip. In spite of these
44


simplifying assumptions, it is not believed they significantly obscure or alter the load
balancing properties of the modeled system.
4.3.4 Work
The simulation has a data intensive work consumption pattern. Agents
consume (typically) 2 work units in their life. The work units are consumed at 2
different times in the life of the agent. Both work units require consumption of the
same number of cycles. Cotton has no specific requirement for work division. The
simulation's work division/consumption scheme attempts to model a generic data
intensive case where the amount of input data is equal to the amount of output data.
4.3.5 Information Availability
The load information extracted from the simulation can all be collected from
Cotton. The only exception is the 5th load determinant (section 3.13.5) which was an
experiment in a more accurate heuristic. The simulation may actually outperform
Cotton in the area of information relevance. Load information in the simulation is
always current while Cotton has transmittal delays associated with any real system.
4.3.6 Topology
The simulation represents hosts the same way Cotton does. The network
connection between two hosts is represented by 2 links and a switch (which is inert)
in the simulation. Essentially there is no difference between the simulation and
Cotton in this area. The two links can be thought of as a single network connection.
45


5. Experiment Descriptions and Rationale
The goal of the experiments was to determine the effect of load balancing
input parameters on agent movement and performance. An infinite number of
system architectures could be created using the simulation. These system
architectures are defined by processor speed, network bandwidth and latency. Figure
5.1 shows the kinds of system architectures possible by varying processor speed and
bandwidth. The only system architecture that was interesting from a load balancing
perspective was the bandwidth limited system. An architecture with unlimited
bandwidth does not require load balancing (agents always choose to move) except
when latency is high. In contrast, in a bandwidth limited system agents must
carefully choose when to move or not move to prevent bottlenecks on processors and
on networks. Initially both CPU limited and bandwidth limited systems were
modeled and tested to confirm behavior was correct. Once reasonable input
parameters for the bandwidth limited system architecture were confirmed, further
experiments were restricted to this case.
46


Figure 5.1. System Architectures Derived from Combinations of Processor Speed
and Bandwidth.
5.1 Input Parameters
5.1.1 System Size
The system size (WorldX =80,WorldY = 80) was chosen large enough to
prevent edge effects from impacting performance. A later experiment was performed
to look at this factor. The root agent start position (StartX =41, and StartY =41)
chosen placed the agent in the middle of the grid. The number of models (50) was
selected by trial and error to smooth out variability in time to completion. The output
(time to completion, hosts visited, etc.) was then an average of a set of 50 models.
47


5.1.2 Fixed System Architecture Parameters
The input parameters that characterize the system architecture are processor
speed, bandwidth, and latency. In the simulation, processor speed consists of (Cycle
Quota/Host Cost). Bandwidth consists of (Packet Quota / Link Cost). Host Cost (20
cycles/work unit) and Link Cost (1 packet/work unit) were kept constant while Cycle
Quota and Packet Quota were varied. Number of Work Units (1000) and Depth (5.3)
were also kept constant. They were selected by trial and error in conjunction with
Cycle Quota, Packet Quota and Latency to produce time to completion values in a
reasonable range (not too small or large). Cycle Slice Size and Packet Slice Size
chosen so that the ratios (Cycle Quota/Cycle Slice Size) and (Packet Quota/ Packet
Slice Size) did not exceed 10 (for speed issues) and were not less than 5 to achieve
adequate resource slicing.
5.1.3 Variable System Architecture Parameters
Four system architectures were selected with varying processor speed (Cycle
Quota), bandwidth (Packet Quota), and latency. Low and high latency systems were
modeled for both CPU limited systems and the bandwidth limited systems
represented in Figure 5.1. The systems were referred to as:
1. Low Latency, CPU Limited
2. Low Latency, Bandwidth Limited
3. High Latency, CPU Limited
4. High Latency, Bandwidth Limited
Low latency represents a system that is relatively unaffected by latency. High latency
represents a system that is significantly affected by latency. CPU limited represents a
system that has ample network capacity but a shortage of processor capacity. A CPU
limited system achieves low time to completion by making nearly all its agents
mobile. Bandwidth limited represents a system that has a shortage of network
capacity relative to processor capacity. A bandwidth limited system achieves low
time to completion by carefully restricting the number of mobile agents to prevent
lost time from network congestion.
These three input parameters for the four system types were initially chosen
by carefully observing the single model output graphs showing the degree of
48


processor and network utilization. The candidate parameter sets were confirmed and
fine tuned by running a set of models varying one of the three parameters.
The input parameters for the four system architectures are listed in Table 5.1 along
with the resulting system parameters.
System Input Parameters Resultant System Parameters
Architecture
Cycle Packet Latency Processor Bandwidth Latency
Quota Quota (cycles) Speed (work per (cycles)
(cycles (packets/ (work step)
/step) step) units per
step)
Low Latency, CPU Limited 5 100 10 0.25 100 10
Low Latency, Bandwidth Limited 20 5 10 1.00 5 10
High Latency CPU Limited 5 100 100 0.25 100 100
High Latency, Bandwidth Limited 20 5 100 1.00 5 100
Table 5.1. Input parameters for four system architectures.
Figures 5.2, 5.3, 5.4, and 5.5 confirm the correctness of the previous
selections. A system that is limited by processor speed (Figure 5.2) or network
bandwidth (Figure 5.3) achieves significant improvement in time to completion (on
Y axis) when the controlling parameter (on X axis) is increased. In contrast, a
system that is not limited or minimally limited does not improve when the
controlling parameter is increased. Low latency systems do not show significant
improvement when latency is decreased while high latency systems do show
improvement (Figures 5.4 and 5.5).
49


Cycle Quota Selection for CPU Limited System
Figure 5.2. Cycle Quota Selection for CPU Limited System
varied
= 100
50


Packet Quota Selection for Bandwidth Limited System
Figure 5.3. Packet Quota Selection for Bandwidth Limited System
51


Latency Selection for CPU Limited System
Latency (cycles)
Figure 5.4. Latency Selection for CPU Limited System
52


Latency Selection for Bandwidth Limited System
v>
CL
(/>
Cycle Quota = 20
Packet Quota = 5
Latency = varied
1
10
100
1000
Latency (cycles)
Figure 5.5. Latency Selection for Bandwidth Limited System
5.2 Agent Movement Input Parameters
The following agent movement input parameters were varied to see their
effect on performance. Input ranges for the parameters are also listed.
CPU Weight and Network Weight
(0.0, 1.0), (0.01, 0.99), (0.02, 0.98), (0.04, 0.96), (0.06, 0.94),
(0.08, 0.92), (0.1, 0.9), (0.2, 0.8), (0.3, 0.7), (0.4, 0.6),
(0.5, 0.5), (0.6, 0.4), (0.7, 0.3), (0.8, 0.2), (0.9, 0.1), (1.0, 0.0)
Temperature
0, 0.5, 1,2,3, 10, 10000
Maximum and Minimum Hop Distances
(1,1), (2,2), (3,3), (6,6), (1,2),
53


(1,3), (1,5), (2,3), (2,5), (4,6)
Number of Choices Per Hop Distance
1,2, 4,8
Load Determinant
all five methods
Agent Movement Method
Metropolis, Fixed Mobile Agent Ratio
Grid Size (number of processors)
4x4 (2), 8x8 (8), 12x12 (18), 16x16 (32), 80x80 (800)
54


6. Experiment Results
6.1 Effect of CPU Weight/Network Weight and Temperature
Since CPU Weight is equal to (1 Network Weight) reference is only made to
CPU Weight in the following discussion.
6.1.1 CPU Limited System
Diagrams A1 and A2 in the appendix show a plot of time versus CPU
Weight for various temperatures. A1 is for a low latency system while A2 is for a
high latency system. Both systems responded similarly to the effects of CPU Weight
and temperature.
Performance increases rapidly at low values of CPU Weight and gradually
levels off at higher values. The optimal CPU Weight is at 1.0. This result is as
expected since higher values of CPU Weight cause increased agent mobility. Since
the cost of agent movement is low with this system, spreading the work out to as
many processors as possible minimizes time.
Low, but non-zero temperatures work best. Higher temperatures cause too
many agents to move to highly loaded processors, which reduces performance. Low
temperatures push agents to move just a little more but the effects of a few bad
decisions are outweighed by the mobility and spacing provided by the extra
movement. At low CPU Weight values, the best performance belongs to runs with
high temperature values. At very low CPU Weight values (0, 0.1) agent mobility is
very low and can benefit from induced movement even when associated with poor
movement choices. This effect disappears for CPU Weight values of 0.2 or greater.
6.1.2 Bandwidth Limited System
A3 and A4 show plots of time versus CPU Weight for various temperatures
for low and high latencies, respectively. The plots for the high latency system were
similar in shape to the low latency plots but were significantly higher in magnitude.
A zero value of temperature performed best. Increased temperature lowered
performance consistently. For the optimal temperature case (=0), performance
improves as CPU Weight increases at very low values. A minimum time is reached
55


for CPU Weights between 0.05 and 0.1. The existence of an minimal time was
expected between the extreme cases of non-mobile agents and entirely-mobile
agents. This is because a network limited system should perform best when a
balance between network and processor are achieved. As CPU Weight increases
from the optimal value (0.06) performance gradually degrades and then levels off at a
CPU Weight of 0.4. A significant increase in time occurs from 0.4 to 0.5. Time
stays relatively constant for CPU Weights of 0.4 to 0.9. Another jump in time occurs
going from 0.9 to 1.0.
The increase in time observed going from CPU Weights of 0.4 to 0.5 is
caused by the weighted decisions made by agents using the Metropolis movement
rule which are:
localPotential = localProcessorLoad processorWeight
remotePotential = remoteProcessorLoad processorWeight
networkPotential = networkLoad network Weight
if (localPotential >= remotePotential + networkPotential) then move is
accepted.
There is a higher than normal proportion of movement decisions made where the
tuple (localProcessorLoad, remoteProcessorLoad, networkLoad) is (1,0,1). This
tuple results in no movement for CPU Weight less than 0.5 and movement when 0.5
or greater. This has a significant impact on the mobile agent ratio which is a good
predictor of performance for bandwidth limited systems. Consequently the mobile
agent ratio is relatively flat for the ranges (0.3, 0.4) and (0.5 through 0.9). When the
CPU Weight is at 1.0, the mobile agent ratio noticeably increases, which results in
the performance poorer than 0.9 where not all agents were mobile. Figure 6.1 shows
the corresponding mobile agent ratios for the zero temperature case. These
discontinuities occur for the load determinants: NumActiveAgents and
Active&WaitingAgents but not the other load determinants.
A higher density of points was gathered at low CPU Weights to confirm the
existence of the optimal area. At higher temperatures the optimal area does not exist
and the resulting time is relatively constant with CPU Weight. The tight spacing of
points at low CPU Weight for higher temperatures appear to exhibit local minima
and maxima but this is due to randomness in the data.
56


Figure 6.1. Mobile Agent Ratio for Temperature 0 Case.
6.2 Effect of Maximum and Minimum Hop Distances
A5 shows the results of varying the minimum and maximum hop distances.
The hop distance is the number of hops an agent might make from its current host to
some remote host. The hop distance did not affect the location of the CPU Weight
(0.06) with the minimum time. The minimum hop distance controlled the behavior
of the various runs. Runs with a minimum hop distance of 1 performed best overall.
Higher minimum hop distances caused worse performance. This is due to the
relatively slow network, which does not efficiently support excessive agent
movement. Since agents preferred to move only 1 hop, the maximum hop distance
choice was prohibitively expensive and had no impact on the results. At higher
57


values of CPU Weight (> 0.04) runs with higher minimum hop distance performed
even worse.
The same pronounced increase in time occurred going from a CPU Weight of
0.4 to 0.5. This was explained in Section 6.1.2. The increase is more pronounced for
higher minimum hop distances since higher hop distances cause more burdensome
network delays.
6.3 Effect of Number of Choices Per Hop Distance
The number of choices an agent has within a particular hop distance has no
perceivable effect. This is because even though an agent may only get one chance to
move to a processor with lower usage in its lifetime, its children, which still carry the
bulk of future processing load, will have the opportunity to move. Poor choices by
parents will make the situation on the current host more intolerable for the child
agents and they will be more likely to move and improve the overall situation.
Adjustments by child agents produce the same benefits that choices per hop distance
can provide.
6.4 Effect of Load Determinant
Five load determinants were compared. These were:
NumActiveAgents,
Active&WaitingAgents,
2WorkUnitsPerLinkAgent,
CurrentWorkUnitsPerLinkAgent, and
Actual.
Each load determinant was run with varying CPU Weight and temperature for both
low and high latency. The results are plotted in A3, A4, A9 through A16.
6.4.1 NumActiveAgents and Active&WaitingAgents,
NumActiveAgents was the load determinant used for the experiments
discussed in Sections 6.1 through 6.3. Active&WaitingAgents performed almost
identically to NumActiveAgents with slightly increased times.
58


6.4.2 2WorkUnitsPerLinkAgent and CurrentWorkUnitsPerLinkAgent
2WorkUnitsPerLinkAgent and CurrentWorkUnitsPerLinkAgent performed
distinctly different from NumActiveAgents but were similar to each other. A
temperature of 10 was found to be optimal for the low latency cases with the optimal
CPU Weight still around 0.06 to 0.1. A relatively stable set of times (within 15
percent of optimal) was achieved for CPU Weights between 0.2 and 0.6 for
temperatures less than 10. The high latency cases had an optimal temperature of 1
and optimal CPU Weight range of 0.5 to 0.9. As compared to NumActiveAgents,
temperature had a positive impact and relatively no detrimental impact for all but the
highest temperature (10000). In addition a wider range of CPU Weight achieved
good results. Finally, the mobile agent ratios were in the range of 0.7 for
2WorkUnitsPerLinkAgent and CurrentWorkUnitsPerLinkAgent but were much
lower for NumActiveAgents (~0.3) indicating NumActiveAgents uses less network
resources.
6.4.3 Actual
Actual was similar to 2WorkUnitsPerLinkAgent but had an optimal time at a
CPU Weight of 0.5 at a temperature of 3. The high latency case had an optimal time
at a CPU Weight of 1.0 at a temperature of 0.
6.4.4 Load Determinant Comparisons
The performance of the optimal cases for the five load determinants is
summarized in A17 through A20. These results were averages from 500 model runs
as opposed to 50 runs used in the previous results. The 2X standard deviation values
on time for these 500 model runs were 1 percent or less.
Actual had the lowest time to completion (A 17). This was expected since it
had the best load information possible. Surprisingly, 2WorkUnitsPerLinkAgent
CurrentWorkUnitsPerLinkAgent were within 1 to 2 percent. This demonstrates the
value of reducing movement onto busy links, especially early in the task when
network heavy agents sometime collide, over accuracy in determining load.
NumActiveAgents and Active&WaitingAgents had times 15 to 20 percent larger
than the other 3 methods. The best times achieved in the 500 run averages were also
obtained (A 17). The average times were 25 to 31 percent greater than the best times.
59


A18 through A20 show the agent mobility produce by the various load
determinants and the impact on the network. The highest mobile agent ratios,
bandwidth consumed, and number of hosts visited belong to
2WorkUnitsPerLinkAgent, CurrentWorkUnitsPerLinkAgent, and Actual with Actual
being the highest. NumActiveAgents uses 35 percent less bandwidth and visits 20
percent less hosts with the tradeoff of being 15 percent slower than
2WorkUnitsPerLinkAgent and CurrentWorkUnitsPerLinkAgent.
6.5 Effect of Agent Movement Method
The results of runs made for the Fixed Mobile Agent Ratio movement
method are presented in A21 for low and high latency. Both low and high latency
cases show a preference for low mobile agent ratios (0.2 and 0.1, respectively). A22
compares the Fixed Mobile Agent Ratio and Metropolis movement methods for the
optimal results of both. The optimal value presented for Metropolis used the
NumActiveAgents load determinant. Metropolis is more than twice as fast for the
low latency and 60 percent faster for the high latency case. The Fixed Mobile Agent
Ratio method is a random method and demonstrates the value of using even simple
load balancing strategies.
6.6 Effect of Grid Size
Five grid sizes were compared using the NumActiveAgents load determinant
at optimal temperature and CPU Weight averaged over 500 model runs. Table 6.1
shows the resulting times
60


Grid Size Number of processors Time to completion (Theoretical best: no latency, infinite bandwidth) Number of Steps (Average) Processors visited (Average) Mobile Agent Ratio (Average) 2X standard deviations
4x4 2 500 793 2 0.49 5.79
8x8 8 125 331 8 0.47 1.93
12x12 18 56 269 18 0.34 2.48
16x16 32 32 262 27 0.32 2.66
80x80 800 2 263 32 0.31 2.73
Table 6.1. Effect of Grid Size on Time, Processors Visited and Mobile Agent Ratio
for each grid size. The theoretical time given assumes no latency and infinite
bandwidth and is provided for comparison to the 4x4 and 8x8 grids. Grid sizes 4x4,
8x8, and 12x12 use all the available processors in their systems. Grid sizes 16x16
and 80x80 do not use all available processors. Although grid size 12x12 touches
every processor, it did not appear to perform much worse than the 16x16 and 80x80
grids. The 16x16 grid does not appear to be significantly affected by its degree of
confinement. Figure 6.2 shows the limited extent of agent movement for the 80x80
grid size as evidenced by the number of processors busy at the peak agent
distribution for a particular run.
61


[o] Agent Density Map
nlM
Figure 6.2. Grid size 80x80 depicting approximate extent of agent movement.
62


7. Summary
A simulation of the Cotton agent system was built to perform load balancing
experiments. The goal of load balancing was not only to distribute tasks evenly
among processors but to efficiently do so on a system where data movement
requirements are limited by the network bandwidth capacity. A review of the
literature did not reveal a significant amount of previous load balancing studies in
this context.
The simulation system size was an 80 by 80 torus with 800 available
processors. System resources (processor cycles and network packets) were created
on hosts and links and consumed by agents. The work of an agent's task was
distributed to processors by hierarchical spawning and movement of child agents.
The size of a task was 1000 work units. Each agent was capable of consuming 2
work units in its lifetime. The resulting agent tree had 500 agents with an average
depth of 5. The simulation was capable of many other model designs and could be
extended to other topologies in the future.
Agents used a simple heuristic called Metropolis to make movement
decisions. Experiments were performed to determine the best input parameters for
Metropolis. Performance was measured in time to completion for a task with a fixed
amount of processing work. A secondary performance measure was the network
resources consumed. Further experiments were performed to determine the affect of
load determinants, agent hop distance, the number of choices to evaluate for
movement, and grid size had on performance.
The results of the experiments indicate a simple load balancing scheme such
as Metropolis is significantly better than random movement decisions. It was not
determined what the optimal time would be for the system modeled but the average
time to completion values over 500 simulation model runs were only 25 to 31
percent worse than the best result from the set of runs. Use of some load
determinants caused 35 percent less bandwidth to be used at a cost of 15 percent
increased time to completion. Increasing the bounds of search using hop distances
and choices within a hop distance provided no benefit for a bandwidth limited system
investigated. In this study, the limitations and best modes of operation were
substantially delineated for Metropolis.
63


The following Metropolis parameter recommendations were developed for
bandwidth limited systems running data intensive algorithms. The
2WorkUnitsPerLinkAgent load determinant should be used to optimized time to
completion for most circumstances. For low latency systems the CPU Weight and
temperature should be 0.08 and 10, respectively. For high latency systems the CPU
Weight should be 0.9 and 1, respectively. In circumstances where the impact to the
network must be minimized the load determinant, CPU Weight and temperature
should be NumActiveAgents, 0.06, and 0, respectively. The minimum and
maximum hop distances and choices per hop distance should always be 1.
64


Appendix
65


500
CPU weight
System: Low Latency, CPU Limited
Load Determinant: NumActiveAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
+
temp = 0
temp = 0.5
temp = 1
temp = 2
temp = 3
temp = 10
temp = 10000
A1. Effect of CPU Weight and Temperature for Low
Latency, CPU Limited System
66


2000
1900
1800
& 1700
+>
W
1500
1400
1300
0 0.2 0.4 0.6
CPU Weight
System: High Latency, CPU Limited
Load Determinant: NumActiveAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
0.8 1
+
temp = 0
temp = 0.5
temp = 1
temp = 2
temp = 3
temp = 10
temp = 10000
A2. Effect of CPU Weight and Temperature for High
Latency, CPU Limited System
67


CPU weight
System: Low Latency, Bandwidth
Limited
Load Determinant: NumActiveAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
temp = 0
temp = 0.5
temp = 1
xtemp = 2
* temp = 3
temp = 10
itemp = 10000
A3. Effect of CPU Weight and Temperature for Low
Latency, Bandwidth Limited System
68


Time (steps)
CPU Weight
System: High Latency, Bandwidth Limited
Load Determinant: NumActiveAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
temp = 0
temp = 0.5
temp = 1
x temp = 2
x temp = 3
temp = 10
itemp = 10000
A4. Effect of CPU Weight and Temperature for High
Latency, Bandwidth Limited System
69


3000
I
o -------.--------------------------------------------------
0 0.2 0.4 0.6 0.8 1
CPU Weight
System: Low Latency, Bandwidth Limited Min =1, Max = 1 Min =2, Max = 2
Load Determinant: NumActiveAgents A Min =3, Max = 3 K Min =6, Max = 6
Movement Method: Metropolis HopDist: varied Min =1, Max = 2 Min =1, Max = 3
ChoicesPerHopDist = 1 lMin =1, Max = 5 Min =2, Max = 3
Temperature = 0 Min =2, Max = 5 Min =4, Max = 6
A5. Effect of HopDist and ChoicesPerHopDist (=1) for
Low Latency, Bandwidth Limited System.
70


3000
CPU Weight
System: Low Latency, Bandwidth Min =1, Max = 1 Min =2, Max = 2
Limited AMin =3, Max = 3 XMin =6, Max = 6
Load Determinant: NumActiveAgents *Min =1 May = 2 Min =1 May = 3
Movement Method: Metropolis
HopDist: varied 1Min =1, Max = 5 Min =2, Max = 3
ChoicesPerHopDist = 2 - Min =2, Max = 5 Min =4, Max = 6
A6. Effect of HopDist and ChoicesPerHopDist (=2) for Low
Latency, Bandwidth Limited System.
71


3000
CPU Weight
System: Low Latency, Bandwidth Limited Min =1, Max = 1 Min =2, Max = 2
Load Determinant: NumActiveAgents A Min =3, Max = 3 X Min =6, Max = 6
Movement Method: Metropolis
HopDist: varied x Min =1, Max = 2 Min =1, Max = 3
ChoicesPerHopDist = 4 lMin =1, Max = 5 Min =2, Max = 3
Temperature = 0 Min =2, Max = 5 Min =4, Max = 6
A7. Effect of HopDist and ChoicesPerHopDist (=4) for Low
Latency, Bandwidth Limited System.
72


3000
CPU Weight
System: Low Latency, Bandwidth
Limited
Load Determinant: NumActiveAgents
Movement Method: Metropolis
HopDist: varied
ChoicesPerHopDist = 8
Min =1, Max = 1 Min =2, Max = 2
A Min =3, Max = 3 Min =6, Max = 6
Min =1, Max = 2 Min =1, Max = 3
l Min =1, Max = 5-Min =2, Max = 3
Min =2, Max = 5 Min =4, Max = 6
A8. Effect of HopDist and ChoicesPerHopDist (=8) for Low
Latency, Bandwidth Limited System.
73


1600
CPU Weight
System: Low Latency, Bandwidth Limited
Load Determinant: NumActive&WaitingAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
temp = 0
temp = 0.5
Atemp = 1
xtemp = 2
*temp = 3
temp = 10
ltemp = 10000
A9. Effect of Load Determinant (Active&WaitingAgents) for Low
Latency, Bandwidth Limited System
74


CPU Weight
temp = 0
-temp = 0.5
Atemp = 1
x temp = 2
temp = 3
temp = 10
ltemp = 10000
System: High Latency, Bandwidth Limited
Load Determinant: NumActive&WaitingAgents
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
A10. Effect of Load Determinant (Active&WaitingAgents) on
High Latency, Bandwidth Limited System
75


Time (steps)
600
550
500
450
400
350
300
250
200
0.2
0.4
0.6
0.8
CPU Weight
System: Low Latency, Bandwidth Limited
Load Determinant: 2WorkUnitsPerLinkAgent
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
temp = 0
temp = 0.5
temp = 1
temp = 2
temp = 3
temp = 10
temp = 10000
A11. Effect of Load Determinant (2WorkUnitsPerLinkAgent) for Low
Latency, Bandwidth Limited System
76


1800
F
A12. Effect of Load Determinant (2WorkUnitsPerLinkAgent) for High
Latency, Bandwidth Limited System
77


600
0 0.2 0.4 0.6 0.8 1
CPU Weight
System: Low Latency, Bandwidth Limited Load Determinant: CurrentWorkUnitsPerLinkAgent Movement Method: Metropolis HopDist: max=1 min=1 ChoicesPerHopDist = 1 Temperature = varied temp = 0 temp = 0.5 *r-temp = 1 x temp = 2 * temp = 3 temp = 10 ltemp = 10000

A13. Effect of Load Determinant (CurrentWorkUnitsPerLinkAgent)
for Low Latency, Bandwidth Limited System
78


temp = 0
temp = 0.5
A^temp = 1
xtemp = 2
* temp = 3
temp = 10
ltemp = 10000
A14. Effect of Load Determinant (CurrentWorkUnitsPerLiriRAgent)
for High Latency, Bandwidth Limited System
System: High Latency, Bandwidth Limited
Load Determinant: (CurrentWorkUnitsPerLinkAgent)
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
79


CPU Weight
System: Low Latency, Bandwidth Limited
Load Determinant: Actual
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
-temp = 0
-temp = 0.5
temp = 1
-xtemp = 2
temp = 3
-temp = 10
-+temp = 10000
A15. Effect of Load Determinant (Actual) on Low Latency,
Bandwidth Limited System
80


1800
1600
1400
w
Q.
CD
1200
0)
E
1000
800
600
0 0.2 0.4 0.6
CPU Weight
System: High Latency, Bandwidth Limited
Load Determinant: Actual
Movement Method: Metropolis
HopDist: max=1 min=1
ChoicesPerHopDist = 1
Temperature = varied
0.8 1
temp = 0
-temp = 0.5
temp = 1
* temp = 2
* temp = 3
temp = 10
itemp = 10000
A16. Effect of Load Determinant (Actual) for High Latency,
Bandwidth Limited System
81


Load Determinant
Best of 500 model runs Average of 500 model runs
A17. Effect of of Load Determinant on Time to Completion.
82


Mobile Agent Ratio
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.81
Active&Waiting NumActiveAgents 2 Work Units Per Current Work Units Actual
Link Agent Per Link Agent
Load Determinant
Average of 500 model runs
A18. Effect of Load Determinant on Mobile Agent Ratio.
83


9000
8000
7000
(A
6000
o
ra
o.
TS
0) 5000
E
3
(/>
C
o
O 4000
$
c 3000
m
00
2000
1000
0
8231
Active&Waiting NumActiveAgents 2 Work Units Per Current Work Units Actual
Link Agent Per Link Agent
Load Determinant
Average of 500 model runs
A19. Effect of Load Determinant on Bandwidth Consumed.
84


60
54
Active&Waiting NumActiveAgents 2 Work Units Per Current Work Units Actual
Link Agent Per Link Agent
Load Determinant
Average of 500 model runs
A20. Effect of Load Determinant on Number of Hosts Visited.
85


1800
Mobile Agent Ratio
Low Latency, Bandwidth Limited
High Latency, Bandwidth Limited
A21. Results using the Fixed Mobile Agent Ratio
Movement Method
86


1200
1000
800
in
a.
a>
in
a>
E
600
400
200
0
5m
682
605
Low Latency
High Latency
Metropolis Fixed Mobile Agent Ratio
A22. Comparison of Optimal Results from 2 Movement Methods:
Metropolis and Fixed Mobile Agent Ratio
87


References
[1] A. Heirich and J. Arvo, "A Competitive Analysis of Load Balancing Strategies
for Parallel Ray Tracing," The Journal of Supercomputing 12, pp.57-68, 1998.
[2] L. Sanglu and X. Li, "A Scalable Load Balancing System for NOWs," Operating
Systems Review, vol.32, pp.55-63, 1998.
[3] J. Vaughan and M. O'Donovan, "Experimental Evaluation of Distributed Load
Balancing Implementations," Concurrency: Practice and Experience, vol 10(10), 763-
782, 1998.
[4] R. Blumofe and C. Leiserson, "Scheduling Multithreaded Computations by Work
Stealing," Proceedings of the 35th Annual IEEE Conference on Foundations of
Computer Science (FOCS), pp.356-368, November 1994.
[5] P. Loh, W. Hsu, C. Wentong, and N. Sriskanthan. "How Network Topology
Affects Dynamic Load Balancing,", IEEE Parallel and Distributed Technology,
pp.25-35, Fall 1996.
[6] M. Willebeek-LeMair and A. Reeves, "Strategies for Dynamic Load Balancing on
Highly Parallel Computers," IEEE Trans. Parallel and Distributed Systems, vol.4,
no.9, Sept. 1993.
[7] M. Harchol-Balter and A. Downey, "Exploiting Process Lifetime Distributions
for Dynamic Load Balancing," ACM Trans. On Computer Systems, vol. 15, p.236,
1997.
[8] M. Zaki, S. Parthasarathy and W. Li, "Customized Dynamic Load Balancing,"
High Performance Cluster Computing (Buyya), Vol. 1, Chap. 24, Prentice Hall PTR,
1999.
[9] M. Tiemeyer and J. Wong, "A Task Migration Algorithm for Heterogeneous
Distributed Computing Systems," The Journal of Systems and Software 41, pp.175-
188, 1998.
88


[10] B. Weissman, B. Gomes, J. Quittek and M. Holtkamp, "Efficient Fine-Grain
Thread Migration with Active Threads," 12th International Parallel Processing
Symposium and 9th Symposium on Parallel and Distributed Processing, March 1998.
[11] E. Billard and J. Pasquale, "Load Balancing to Adjust for Proximity in some
Network Topologies," Parallel Computing 22, pp. 2007-2023, 1997.
[12] M. Zaki, W. Li and S. Parthasarathy, "Customized Dynamic Load Balancing for
a Network of Workstations," J. Parallel Distrib. Comput. 43, pp. 156-162, 1997.
[13] G. Lee, H. Lee, and J. Cho, "A Sender-Initiated Adaptive Load Balancing
Scheme Based on Predictable State Knowledge," IEICE Trans. Inf. & Syst., vol.E79-
D, no.3, pp.209-220, March 1996.
[14] F. Zambonelli, "Exploiting Biased Load Information in Direct-Neighbor Load
Balancing Policies," Parallel Computing 25, pp. 745-766, 1999.
[15] N. Metropolis et al, "Equation of State Calculations by Fast Computing
Machines," J. Chem. Phys., 21, 1087-1092, 1953.
[16] S. Kirkpatrick et al, "Optimization by Simulated Annealing," Science, 220, 671 -
680, 1983.
[17] N. Minar, "The Swarm Simulation System: A Toolkit for Building Multi-Agent
Simulations," Santa Fe Institute Technical Report 96-06-042, 1996.
[18] G. Blellock, "Programming Parallel Algorithms," Communications of the ACM,
39(3), March 1996.
89