Citation
Design of the initialization, failure detection, routing, mutual exclusion, and election algorithms for a distributed system configured as a complete or incomplete hypercube

Material Information

Title:
Design of the initialization, failure detection, routing, mutual exclusion, and election algorithms for a distributed system configured as a complete or incomplete hypercube
Creator:
Bisque, Cheryl L
Publication Date:
Language:
English
Physical Description:
xiii, 160 leaves : illustrations ; 29 cm

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering

Subjects

Subjects / Keywords:
Computer algorithms ( lcsh )
Hypercube networks (Computer networks) ( lcsh )
Computer algorithms ( fast )
Hypercube networks (Computer networks) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references.
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Department of Electrical Engineering.
Statement of Responsibility:
by Cheryl L. Bisque.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
26922764 ( OCLC )
ocm26922764
Classification:
LD1190.E54 1992m .B57 ( lcc )

Full Text
DESIGN OF THE INITIALIZATION, FAILURE DETECTION
ROUTING, MUTUAL EXCLUSION, AND ELECTION
ALGORITHMS, FOR A DISTRIBUTED SYSTEM

CONFIGURED AS A
COMPLETE OR INCOMPLETE HYPERCUBE
by
Cheryl L. Bisque
B.S., University of Southern Colorado, 1989
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Electrical Engineering
1992
9 rtl i
v ft v, ts-* ^


This thesis for the Master of Science
degree by
Cheryl L. Bisque
has been approved for the
Department of
Electrical Engineering
Gita Alaghband
q / tx 7- / *?
Date


Bisque, Cheryl L. (M.S., Electrical Engineering)
Design of the Initialization, Failure Detection,
Routing, Mutual Exclusion, and Election
Algorithms for a Distributed System Configured
as a Complete or Incomplete Hypercube
Thesis directed by Professor Gita Alaghband
ABSTRACT
The design of the initialization, failure de-
tection, routing, mutual exclusion, and election
algorithms for a distributed system configured as a
complete or incomplete hypercube is presented in
this paper. The initialization algorithm provides a
node with all of the information that is necessary
in order to acquire knowledge of the entire system
configuration. In addition, the initialization al-
gorithm has a very short execution time because this
algorithm does not require any message passing. The
failure detection algorithm ensures that each node
in the system knows the current status of all of the
nodes and links in the system, i.e., whether they
are working or not working. Moreover, the failure
iii


detection algorithm requires fewer message transfers
than most other failure detection algorithms. The
routing algorithm ensures that the shortest working
path is always used to transmit a message from one
node to another node in the system. In addition,
the routing algorithm is more appealing than most
other routing algorithms because this algorithm
requires much less memory and no message passing.
The mutual exclusion algorithm guarantees that, when
one node in the system is using a nonsharable re-
source, no other node can gain access to this re-
source. This algorithm also ensures that any node
that needs to use a nonsharable resource can do so
within a finite amount of time. Moreover, the mutu-
al exclusion algorithm is more attractive than most
other mutual exclusion algorithms because this algo-
rithm requires a significantly smaller number of
message transfers. The election algorithm ensures
that the mutual exclusion algorithm is resilient to
node failures. In addition, the election algorithm
requires very few, if any, message transfers, and
this attribute makes this algorithm more appealing
iv


than other election algorithms. The initialization,
failure detection, routing, mutual exclusion, and
election algorithms were tested on the Intel hyper-
%
cube system in order to ensure that the algorithms
functioned correctly.
This abstract accurately represents the content of
the candidate's thesis. I recommend its
publication.
Signed
Gita Alaghband
v


DEDICATION
Dedicated with love and appreciation to my husband,
Matthew, who delayed our honeymoon so that I could
complete this paper, and who, without a doubt, is
the most patient and understanding person that I
know.


CONTENTS
CHAPTER
1. INTRODUCTION ................................ 1
What Is a Distributed System?............... 1
Communication .............................. 3
Message Passing ....................... 3
Event Ordering.............................. 4
Time-Stamping ......................... 6
Configurations ............................. 7
Completely Connected .................. 8
Ring................................... 9
Star...................................11
Hypercube..............................13
Comparison.............................15
Initialization ............................ 17
Failure Detection ......................... 17
Routing.....................................18
Mutual Exclusion .......................... 19
Election....................................20
Implementation ............................ 21
2. SYSTEM REQUIREMENTS ........................ 23
Assumptions.................................24
vii


3. INITIALIZATION...............................26
Principles..................................26
Properties..................................30
4. FAILURE DETECTION ........................ 31
Principles..................................32
Message Loss Resiliency .............. 46
Properties..................................49
Comparison to Other Algorithms ... 52
5. ROUTING......................................54
Principles..................................55
Properties..................................60
Comparison to Other Algorithms ... 61
6. MUTUAL EXCLUSION ........................... 63
Principles..................................63
Message Loss Resiliency .............. 79
Properties..................................83
Comparison to Other Algorithms ... 84
7. ELECTION..................................87
Principles..................................87
Message Loss Resiliency ............. 102
Properties.................................104
Comparison to Other Algorithms . 105
viii


8. IMPLEMENTATION..............................107
Testing....................................108
9. CONCLUSIONS . ..........................109
APPENDIX
A. NODE OPERATING SYSTEM.......................113
B. INITIALIZATION ALGORITHM .................. 119
C. FAILURE DETECTION ALGORITHM ............... 124
D. ROUTING ALGORITHM ......................... 131
E. MUTUAL EXCLUSION ALGORITHM ................ 133
F. ELECTION ALGORITHM..........................149
REFERENCES......................................159


FIGURES
Figure
1.1. A Completely Connected Configuration . 8
1.2. A Ring Configuration...................... 9
1.3. A Star Configuration......................11
1.4. A Complete Hypercube Configuration ... 13
1.5. An Incomplete Hypercube Configuration . 14
.2.1. Link Numbering for a Hypercube............24
6.1. Node Interaction for the Mutual
Exclusion Algorithm ..................... 65
x


TABLES
Table
3.1. Required Global Variables for the
Initialization Algorithm ................ 27
3.2. Pseudocoded Version of the
Initialization Algorithm ................ 28
4.1. Required Message Types for the
Failure Detection Algorithm ............. 32
4.2. Required Global Variables for the
Failure Detection Algorithm ............. 33
4.3. Pseudocoded Version of the First
Failure Detection Procedure ............. 36
4.4. Pseudocoded Version of the Second
Failure Detection Procedure ............. 36
4.5. Pseudocoded Version of the Third
Failure Detection Procedure ............. 37
4.6. Pseudocoded Version of the Fourth
Failure Detection Procedure ............. 41
4.7. Pseudocoded Version of the Fifth
Failure Detection Procedure ...... 43
5.1. Required Global Variable for the
Routing Algorithm ....................... 55
5.2. Pseudocoded Version of the Routing
Algorithm............................... 56
6.1. Required Message Types for the
Mutual Exclusion Algorithm ....... 65
6.2. Required Global Variables for the
Mutual Exclusion Algorithm .............. 66
xi


6.3
. Pseudocoded Version of the First
Mutual Exclusion Procedure ............. 70
6.4. Pseudocoded Version of the Second
Mutual Exclusion Procedure ............. 70
6.5. Pseudocoded Version of the Third
Mutual Exclusion Procedure ............. 71
6.6. Pseudocoded Version of the Fourth
Mutual Exclusion Procedure ............. 72
6.7. Pseudocoded Version of the Fifth
Mutual Exclusion Procedure ............. 73
6.8. Pseudocoded Version of the Sixth
Mutual Exclusion Procedure ............. 74
6.9. Pseudocoded Version of the Seventh
Mutual Exclusion Procedure ............. 76
6.10. Pseudocoded Version of the Eighth
Mutual Exclusion Procedure ............. 77
7.1. Required Message Types for the
Election Algorithm ...................... 88
7.2. Required Global Variables for the
Election Algorithm ...................... 88
7.3. Pseudocoded Version of the First
Election Procedure ...................... 91
7.4. Pseudocoded Version of the Second
Election Procedure ...................... 92
7.5. Pseudocoded Version of the Third
Election Procedure ...................... 97
7.6. Pseudocoded Version of the Fourth
Election Procedure ...................... 98
xii


100
7.7. Pseudocoded Version of the Fifth
Election Procedure .............
7.8. Pseudocoded Version of the Sixth
Election Procedure ..................... 101
xiii


CHAPTER 1
INTRODUCTION
The design of the initialization, failure
\
detection, routing, mutxial exclusion, and election
algorithms for a distributed system configured as a
complete or incomplete hypercube is presented in
this paper.
In order to easily understand the reasons for
needing each of these algorithms, a basic under-
standing of a distributed system must be gained. In
this chapter, a complete explanation of a distrib-
uted system, including the specific characteristics
of a hypercube configuration, is given. The rea-
sons for requiring the initialization, failure de-
tection, routing, mutual exclusion, and election
algorithms are then presented.
What Is a Distributed System?
A distributed system is defined as a group of
processors that are linked by a communication net-
work. If these processors share a single memory and
clock, the system is tightly coupled. If each of
1


these processors has its own local memory and clock,
the system is loosely coupled.
Although a tightly coupled system is considered
to be a distributed system, it is referred to as a
multiprocessor system. A loosely coupled system is,
therefore, referred to as simply a distributed
system.
Thus, given these terms, a distributed system
is defined as a group of loosely coupled processors
that are linked by a communication network. These
processors are usually referred to as nodes, and
each of these nodes is numbered uniquely so that it
can be distinguished from the rest of the nodes in
the system [1],
Distributed systems have several advantages
over conventional, single processor, systems. These
advantages include resource sharing, information ex-
changing, decreased turnaround time, and increased
system reliability.
The first distributed system was designed by a
group of engineers at the Xerox Palo Alto Centre in
the early 1970's. Since then, advances in hardware
2


technology and the desire to automate increasingly
difficult applications have led to more interest in
distributed systems [2].
*
Communication
The nodes in a distributed system must be able
to communicate with each other in order to share
information.
In a multiprocessor system, the nodes communi-
cate with each other by accessing the shared memory.
Since there is no shared memory in a distributed
system, the nodes communicate with each other by
sending and receiving messages through the communi-
cation network [3].
Message Passing
A message is defined as a block of data, to-
gether with some system information. The system
information includes the number of the source node
and the number of the destination node of the mes-
sage. The system information also includes the mes-
sage type, which is used by the destination node to
determine the meaning of the data that is included
3


in the message.
When a node needs to communicate with another
node, the node simply sends a message containing any
necessary data to the other node. The source node
must ensure that the type of message being sent is
correct so that the destination node will understand
the intent of the message.
When a node receives a message from another
node, the node determines the type of message that
was received, and the node then performs any func-
tions that are required upon receival of a message
of this type [4].
Event Ordering
It is sometimes necessary to determine the
order in which events have occurred in a distrib-
uted system. For example, if several nodes request
to use a printer, it is very important to know the
order in which these requests were made so that the
node that made the earliest request can be given
permission to use the printer first. If the order
of these requests is not known, permission could be
4


given to the nodes in an order other than that in
which the requests were made, and this could result
in starvation of some of the nodes. For example, if
permission is always given to the lowest numbered
node that has made a request, the higher numbered
nodes could be required to wait indefinitely for
permission to be granted to them while a steady
stream of requests are being made by lower numbered
nodes [1].
In a multiprocessor system, event ordering is
easily accomplished because the shared memory and
clock can be used by each node in the system to de-
termine the global state of the system at any time.
Since message passing is the only form of communica-
tion between nodes in a distributed system and since
there is no common clock in a distributed system,
there is no global state observable from all of the
nodes in a distributed system, i.e., each node in a
distributed system will be aware of only the events
that occur at that node and the events that the node
has been made aware of via the receival of a message
from another node. Thus, in a distributed system,
5


event ordering is not as easily accomplished. By
using a technique called time-stamping, a global
ordering of events can be established in a distrib-
%
uted system [4].
Time-Stamping
In order to implement time-stamping, each node
in the system must have a logical clock. Each of
the logical clocks in the system attempts to simu-
late a common system clock.
When a node needs to send a message, the node
increments the value of its logical clock by one,
and the node then time-stamps the message with the
new value of its logical clock, i.e., the node adds
the new value of its logical clock to the system
information in the message. The node then sends the
message. The value of the time-stamp in the message
represents the time at which the message was sent.
When a node receives a message, the node com-
pares the value of its own logical clock to the
time-stamp from the message. If the value of the
logical clock is greater than the time-stamp, the
6


node increments its logical clock by one. If the
value of the logical clock is less than the time-
stamp, the node sets its logical clock to the value
of the time-stamp, incremented by one. The new
value of the logical clock for this node represents
the time at which the message was received. After
setting its logical clock, the node determines the
type of message that was received, and the node then
performs any functions that are required upon re-
ceival of a message of this type [4].
Configurations
Since message passing plays such an important
role in a distributed system, the physical config-
uration of the system is also very important because
the time that is needed to transmit messages between
nodes is dictated by the system configuration [3].
The nodes and links of a distributed system can
be physically configured in many different ways.
Some of the most popular configurations are the com-
pletely connected, ring, star, and hypercube config-
urations [5].
7


Completely Connected
A distributed system is said to be completely
connected if there is a direct link between all of
the pairs of nodes in the system. A completely con-
nected configuration containing six binary numbered
nodes is shown in Figure 1.1.
Figure l.l. A Completely Connected Configuration
The number of links in a completely connected
distributed system is (n/2) (n 1) where n is
the number of nodes in the system.
Because there is such a large number of links
that exist in a completely connected distributed
system, this type of configuration is very expensive
to implement.
A completely connected configuration provides
8


very fast transmission of messages between nodes
because there is a direct link between every two
nodes.
A completely connected distributed system is
very reliable because several link failures would
have to occur in order for the system to become dis-
connected [6].
Ring
A distributed system is said to be configured
as a ring if every node in the system is directly
linked to exactly two other nodes. A ring config-
uration containing six binary numbered nodes is
shown in Figure 1.2.
9


The number of links in a distributed system
configured as a ring is the same as the number of
nodes in the system.
*
A ring configuration is much less expensive to
implement than a completely connected configuration
because a ring configuration containing a specific
number of nodes has much fewer links than a com-
pletely connected configuration containing the same
number of nodes.
A ring configuration can be either unidirec-
tional or bidirectional. In a unidirectional ring
configuration, messages can be transmitted in only
one direction, whereas, in a bidirectional ring con
figuration, messages can be transmitted in either
direction.
Both unidirectional and bidirectional ring
configurations provide much slower transmission of
messages between nodes than a completely connected
configuration because every node in a ring config-
uration has only two direct links to other nodes.
In addition, a unidirectional ring configuration
provides significantly slower transmission of
10


messages between nodes than a bidirectional ring
configuration.
A distributed system configured as a bidirec-
tional ring is fairly reliable because more than one
link or node failure would have to occur in order
for the system to become disconnected. A distrib-
uted system configured as a unidirectional ring is
not reliable because the system would become discon-
nected after only one link or node failure [1].
Star
A distributed system is said to be configured
as a star if there is one central node in the system
which has a direct link to all of the other nodes in
the system. A star configuration containing five
binary numbered nodes is shown in Figure 1.3.
11


The number of links in a distributed system
configured as a star is one less than the number of
nodes in the system.
A star configuration is slightly less expensive
to implement than a ring configuration because a
star configuration containing a specific number of
nodes has one fewer link than a ring configuration
containing the same number of nodes.
A star configuration should provide faster
transmission of messages between nodes than a ring
configuration and slower transmission of messages
between nodes than a completely connected configu-
ration because every node in a star configuration is
not more than two links away from every other node.
However, this is not always the case because the
central node in a star configuration can become a
bottleneck when numerous transmissions are simul-
taneously occurring, and this can result in a dras-
tic reduction in the speed of message transmission.
A distributed system configured as a star is
not very reliable because, if a node failure oc-
curred at the central node, the system would become
12


completely disconnected [1].
Hypercube
A distributed system is said to be configured
as a hypercube if there is a direct link between all
of the pairs of nodes in the system whose binary
representations differ in only one bit position [7].
A hypercube configuration is said to be com-
plete if the total number of nodes is a power of
two; otherwise, it is said to be incomplete [8]. A
complete hypercube configuration containing eight
binary numbered nodes is shown in Figure 1.4, and an
incomplete hypercube configuration containing six
binary numbered nodes is shown in Figure 1.5.
Figure 1.4. A Complete Hypercube Configuration
13


Figure 1.5. An Incomplete Hypercube Configuration
The maximum number of links in a distributed
system configured as a complete or incomplete hyper-
of nodes in the system.
Both types of hypercube configurations are less
expensive to implement than a completely connected
configuration and more expensive to implement than a
ring configuration. This is due to the fact that a
hypercube configuration containing a specific number
of nodes has fewer links than a completely connected
configuration containing the same number of nodes,
while it has more links than a ring configuration
containing the same number of nodes.
Both types of hypercube configurations provide
faster transmission of messages between nodes than a
ring configuration and slower transmission of mes-
* log2(n) where n is the number
14


sages between nodes than a completely connected con-
figuration because every node in a hypercube config-
every other node.
A distributed system configured as a complete
or incomplete hypercube is very reliable because
several link failures would have to occur in order
for the system to become disconnected [7].
Comparison
The completely connected configuration has very
fast transmission of messages between nodes, and it
is very reliable; however, this type of configura-
tion is impractical for systems containing more than
a few nodes because of the numerous links that are
required.
Both types of ring configurations and the star
configuration are much more practical in terms of
the number of links required; however, none of these
types of configurations are very reliable [6].
The complete and incomplete hypercube configu-
rations are very reliable, and they both offer fast
links away from
15


transmission of messages between nodes. These con-
figurations do require more links than the ring and
star configurations; however, the number of links
that are required for the hypercube configurations
is significantly less than that which is required
for the completely connected configuration.
Overall, the hypercube configurations provide a
nice balance between the expensive completely con-
nected configuration and the unreliable ring and
star configurations. Hypercube configurations are
also advantageous because they promote algorithm
embeddability, uniform structure, extension ability,
and easy software development [7]. Because of the
numerous advantages that are associated with the
hypercube configurations, the algorithms that are
presented in this paper were designed specifically
for a distributed system that is configured as a
complete or incomplete hypercube. The requirements
for this type of system, as well as the specific as-
sumptions that were made for these algorithms, are
explained in Chapter 2.
16


Initialization
After power is applied to a node in a distrib-
uted system, the node needs to perform initializa-
*
tion in order to find the system information that is
needed for the node to function as a cooperating
member of the system. Initialization is also needed
in order to initialize the operating system vari-
ables, such as the logical clock [8].
The initialization algorithm that was designed
for a distributed system configured as a hypercube
is described in Chapter 3, and the C program for
this algorithm can be found in Appendix B [9], [10].
Failure Detection
Link and node failures are two of the most
common types of failures in a distributed system.
Thus, in order to make a distributed system robust,
the system must be able to detect link and node
failures. When a link or node failure occurs, each
node in the system should be notified so that the
nodes do not attempt to send any messages through a
failed link and so that the nodes do not attempt to
17


use resources that reside at a failed node. After a
link or node is repaired, each node in the system
should again be notified. Each node in a distrib-
uted system should perform failure detection period-
ically in order to detect and announce any link or
node failures or recoveries. Failure detection is
also needed in order to reconfigure in the result of
a detected failure or recovery [1].
The failure detection algorithm that was de-
signed for a distributed system configured as a
hypercube is described in Chapter 4# and the C pro-
gram for this failure detection algorithm can be
found in Appendix C [9], [10].
Routing
When one node in a distributed system needs to
communicate with another node, the node simply sends
a message containing any necessary data to the other
node. In order to do this, the node must know the
series of links that the message should be routed
through in order to get from this node to the des-
tination node. Typically, each node in a distrib-
18


uted system has a routing table that is used to
determine the path that a message should be sent
through in order to get to another node in the sys-
tem. After a link or node failure or recovery oc-
curs in the system, the routing table for each node
must be updated so that the nodes do not attempt to
route messages through paths that are not working.
Thus, each node in a distributed system needs to
perform routing after a link or node failure or re-
covery in order to keep up-to-date information in
the routing table [1].
The routing algorithm that was designed for a
distributed system configured as a hypercube is de-
scribed in Chapter 5, and the C program for this
algorithm can be found in Appendix D [9], [10].
Mutual Exclusion
When one node in a distributed system is using
a nonsharable resource, no other node should be
allowed access to this resource. For example, when
one node is using a printer, no other nodes should
be given permission to use the printer because a
19


printer cannot be used by more than one node at a
time. However, when a node needs to use a nonshar-
able resource, it should be allowed to do so within
*
a finite amount of time. For example, permission to
use a nonsharable resource should not always be
given to the highest numbered node that has made a
request to use this resource because the lower num-
bered nodes could be required to wait indefinitely
for permission to be granted to them while a steady
stream of requests for the resource are being made
by higher numbered nodes. Each node in a distrib-
uted system should perform mutual exclusion in order
to equitably choose which node should be allowed
access to a nonsharable resource at any time [4].
The mutual exclusion algorithm that was de-
signed for a distributed system configured as a
hypercube is described in Chapter 6, and the C pro-
gram for this mutual exclusion algorithm can be
found in Appendix E [9], [10].
Election
In a distributed system, it is sometimes advan-
20


tageous to employ one or more of the nodes to per-
form certain functions that are needed by the other
nodes in the system. These type of nodes are usual-
ly referred to as coordinators. If a coordinator
node fails in a distributed system, a new coordina-
tor node must be elected so that the needs of the
other nodes in the system will continue to be met.
The mutual exclusion algorithm that is presented in
this paper performs its function by requiring that
some of the nodes in the system serve as coordinator
nodes. Each node in the system should therefore
perform election after a coordinator node failure
occurs in order to make the mutual exclusion algo-
rithm resilient to node failures [4].
The election algorithm that was designed for a
distributed system configured as a hypercube is de-
scribed in Chapter 7, and the C program for this
algorithm can be found in Appendix F [9], [10].
Implementation
A node operating system program was written in
order to control the execution of all of the algo-
21


rithms that are presented in this paper [9], [10].
This C program was then used to test all of the al-
gorithms on the Intel hypercube system, otherwise
known as the iPSC/2 system [11], [12]. This C pro-
gram can be found in Appendix A, and an explanation
of the testing that was performed on the algorithms
is given in Chapter 8.
22


CHAPTER 2
SYSTEM REQUIREMENTS
Because of the numerous advantages that are
associated with hypercube configurations, the algo-
rithms that are presented in this paper were de-
signed specifically for a distributed system that is
configured as a complete or incomplete hypercube.
In this chapter, the requirements for this type of
distributed system, as well as the specific assump-
tions that were made for these algorithms, are
explained.
As was previously stated, a distributed system
is said to be configured as a hypercube if there is
a direct link between all of the pairs of nodes in
the system whose binary representations differ in
only one bit position [7]. The configuration is
complete if the total number of nodes is a power of
two; otherwise, it is incomplete.
A distributed system configured as a hypercube
requires that a node number be hard-wired into each
of the nodes in the system. A hypercube distributed
system also requires that a link number be hard-
23


wired into each link in the system. The link number
for each link is established by the bit that differs
in the two nodes that the link is between [8]. An
example of the link numbering for a hypercube dis-
tributed system is shown in Figure 2.1.
Figure 2.1. Link Numbering for a Hypercube
Assumptions
The algorithms that are presented in this paper
assume that the nodes in the system are numbered
consecutively from zero. This assumption is real-
istic because this is the conventional approach for
numbering the nodes.
The algorithms also assume that each node in
the system knows the total number of nodes that
exist in the system. This is also a realistic as-
sumption because the nodes can determine the total
24


number of nodes in the system by simply communicat-
ing among themselves [8].
25


CHAPTER 3
INITIALIZATION
After power is applied to a node in a hyper-
cube distributed system, the node needs to perform
initialization in order to find its own physical
node number, the total number of nodes in the sys-
tem, and any other system information that is needed
for the node to function as a cooperating member of
the system. Initialization is also needed in order
to initialize the operating system variables, such
as the logical clock [8].
The initialization algorithm that was designed
for a hypercube distributed system is described in
this chapter, and the C program for this algorithm
can be found in Appendix B [9], [10].
Principles
In order to execute the initialization algo-
rithm, a node must have the global variables that
are explained in Table 3.1. Several other global
variables, such as those relating to the failure
detection, mutual exclusion, and election algo-
26


rithms, will be initialized during execution of the
initialization algorithm; however, these variables
are not shown here because they will be introduced
in later chapters.
Table 3.1. Required Global Variables for the
Initialization Algorithm
mynum:
totnum;
mygrp:
totgrps;
degree:
logclock:
integer
This variable contains the physical num-
ber of this node,
integer
This variable contains the total number
of nodes in the system.
integer
This variable contains the group number
of this node.
integer
This variable contains the total number
of groups in the system.
integer
This variable contains the maximum num-
ber of physical links that are connected
to any node in the system,
integer
This variable is used to implement time-
stamping, and it contains the current
value of this node's logical clock.
A node executes the initialization algorithm
after power is applied to the node. During execu-
tion of this algorithm, a node finds necessary sys-
tem information, and it initializes the operating
system variables.
A pseudocoded version of the initialization
27


algorithm is shown in Table 3.2 [13], and a detailed
explanation of each of the functions that are per-
formed by this algorithm is then given.
Table 3.2. Pseudocoded Version of the
Initialization Algorithm
initialization:
BEGIN
set mynum to the physical number of this node
set totnum to the total number of nodes in the
system
set mygrp to |_mynum / 4J
set totgrps to Ttotnum / 4~|
set degree to riog2(totnum)l
initialize logclock to 0
initialize failure detection variables
initialize mutual exclusion variables
initialize election variables
execute failure detection
END
During initialization, a node first determines
its own physical node number, and the node stores
this information in the mynum variable. The node
then determines the total number of nodes in the
system, and it stores this information in the totnum
variable.
Using the mynum variable, the node finds its
group number, which is equal to Jjmynum / 11 and it
stores this information in the mygrp variable. For
example, the nodes numbered zero, one, two, and
28


three are in group number zero, the nodes numbered
four, five, six, and seven are in group number one,
and so on.
Using the totnum variable, the node finds the
total number of groups in the system, which is equal
to Ptotnum / 4l, and it stores this information in
the totgrps variable. For example, if the system
contains four nodes, numbered zero to three, there
is one group, and, if the system contains eight
nodes, there are two groups, and so on.
Using the totnum variable again, the node finds
the largest degree of any node in the system, which
is equal to f"log2 (totnum j~|, and it stores this in-
formation in the degree variable.
After finding all of the system information
shown above, the node initializes the logclock vari-
able to zero. The node then initializes all of the
global variables associated with the failure detec-
tion, mutual exclusion, and election algorithms, and
the node initiates failure detection to ensure that
all of its links are working properly.
29


Properties
This initialization algorithm has a very short
execution time because no message passing is in-
volved. In addition, the algorithm provides a node
with all of the information that is necessary in
order to acquire knowledge of the entire system con-
figuration. After initialization, a node will be
able to determine exactly how the system is config-
ured, including the total number of nodes, the total
number of links, and the location of each node and
link in the system.
30


CHAPTER 4
FAILURE DETECTION
Two of the most common types of failures in a
%
distributed system are link failures and node fail-
ures. In order to make a hypercube distributed sys-
tem robust, the system must be able to detect link
and node failures. Actually, a node failure can be
detected by detecting link failures on all of the
links that are connected to the node; thus, the only
types of failures that truly need to be detected are
link failures. Each node in the system needs to be
notified when a link failure occurs so that the
nodes do not attempt to send any messages through a
failed link and so that the nodes do not attempt to
use resources that reside at a failed node. Each
node in the system should also be notified after a
link recovers. Each node in a hypercube distributed
system should perform failure detection immediately
following initialization and periodically thereafter
in order to detect and announce any link failures or
recoveries and in order to reconfigure in the result
of a detected failure or recovery [1].
31


The failure detection algorithm that was de-
signed for a distributed system configured as a
hypercube is described in this chapter, and the C
program for this failure detection algorithm can be
found in Appendix C [9], [10].
Principles
The failure detection algorithm uses message
passing to perform its function. A brief explana-
tion of each type of message that is used by this
algorithm is given in Table 4.1.
Table 4.1. Required Message Types for the
Failure Detection Algorithm
AREYOUUP: A node sends an AREYOUUP message through
a link in order to determine if the link
is working.
IAMUP: A node sends an IAMUP message through a
link in order to inform the node at the
other end of the link that the link is
working.
DETECTION: A node sends a DETECTION message to an-
other node in order to inform the other
node of the current status of this
node's links, i.e., whether the links
are working or not working.
In order to execute the failure detection algo-
rithm, each node must have the global variables that
are explained in Table 4.2.
32


Table 4.2. Required Global Variables for the
Failure Detection Algorithm
mylinks:
goodlinks:
fdlinks:
tmplinks:
goodnodes:
oldnodes:
fdnodes:
array [1 . degree] of boolean
This array is used as a record of the
links that are connected to this node:
position j of this array is set to TRUE
if link j is connected to this node and
FALSE if not.
array [1 . totnum, 1 . degree]
of boolean
This array is used as a record of the
current status of all of the links in
the system: position [k, j] of this
array contains a TRUE if link j for node
k is working and a FALSE if not.
array [1 . degree] of boolean
This array is used as a record of the
links that messages have recently been
received through: position j of this
array contains a TRUE if a message has
recently been received through link j
and a FALSE if not.
array [1 . degree] of boolean
This array is used as a temporary record
of the status of the links for one of
the nodes in the system: position j of
this array contains a TRUE if link j for
the node is working and a FALSE if not.
array [1 . totnum] of boolean
This array is used as a record of the
current status of all of the nodes in
the system: position j of this array
contains a TRUE if node j is working and
a FALSE if not.
array [1 . totnum] of boolean
This array contains the previous values
from the goodnodes array,
array [1 . totnum] of boolean
This array is used as a record of the
nodes that need to be sent a DETECTION
message: position j of this array con-
tains a TRUE if node j needs to be sent
33


Table 4.2. Required Global Variables for the
Failure Detection Algorithm (continued)
a DETECTION message and a FALSE if not.
routingt: array [1 . totnum] of integer
This array is used as the routing table
for this node: position j of this array
contains the number of the first link
that a message should be sent through in
order to get from this node to node j or
a ~1 if no working path exists from this
node to node j or a -5 if this node is
node j.
The failure detection algorithm consists of
five procedures.
The first failure detection procedure is exe-
cuted whenever a message is received by a node, and
this procedure enables a node to keep track of the
links through which messages have recently been
received.
The second failure detection procedure is ini-
tiated when an AREYOUUP message is received by a
node. This procedure is executed in order to inform
the node that sent the AREYOUUP message that the
link that the AREYOUUP message was sent through is
working.
The third failure detection procedure is the
main failure detection procedure, and it is executed
34


by a node immediately following initialization and
periodically after that. During this procedure, a
node determines whether each of the links that are
directly connected to it are working or not working,
and the node reconfigures if any of these links have
recently failed or recovered.
The fourth failure detection procedure is ini-
tiated by a node after a link failure, a link recov-
ery, or a node recovery is detected by the node.
This procedure is executed in order to inform work-
ing nodes of any recently detected link failures and
recoveries and in order to inform recently recovered
nodes of the current status of the links.
The fifth failure detection procedure is initi-
ated when a DETECTION message is received by a node.
During this procedure, a node reconfigures from any
newly reported link failures and recoveries.
Pseudocoded versions of each of the failure
detection procedures are shown in Tables 4.3 through
4.7 [13]. In order to gain a complete understanding
of each failure detection procedure, the pseudocoded
version of each procedure is followed by a detailed
35


explanation of each of the functions that are per-
formed by the procedure.
Table 4.3. Pseudocoded Version of the
First Failure Detection Procedure
when any message is received through link k:
BEGIN
record a TRUE in fdlinks[k]
END
When a message is received by a node, the node
initiates the first failure detection procedure,
and, in this procedure, the node records a TRUE in
the position in the fdlinks array for the link that
the message was received through. Since all of the
positions in the fdlinks array are set to FALSE at
the end of each failure detection run, the first
failure detection procedure enables a node to keep
track of the links that messages have been received
through since the previous failure detection run.
Table 4.4. Pseudocoded Version of the
Second Failure Detection Procedure
when an AREYOUUP message is received through link k:
BEGIN
send an IAMUP message through link k
END
When a node receives an AREYOUUP message, the
node initiates the second failure detection proce-
36


dure, which consists of sending an IAMUP message
through the link that the AREYOUUP message was re-
ceived through. Thus, if a node sends an AREYOUUP
message through a working link, the node should
receive an IAMUP message through the link shortly
thereafter. A node can therefore determine whether
a link is working or not working by sending an
AREYOUUP message through the link.
Table 4.5. Pseudocoded Version of the
Third Failure Detection Procedure
failure detection:
BEGIN
FOR each link i DO
IF fdlinks[i] = FALSE and mylinks[i] = TRUE
no messages have been received through link i
send an AREYOUUP message through link i
END IF
END FOR
IF any AREYOUUP messages were sent
wait a predetermined amount of time
END IF
FOR each link i DO
IF fdlinks[i] = TRUE and
goodlinks[mynum, i] = FALSE
link i has recovered
record a TRUE in goodlinks[mynum, i]
set other_node to the number of the node at
other end of link i
record a TRUE in goodlinks[other_node, i]
END IF
IF fdlinks[i] = FALSE and
goodlinks[mynum, i] = TRUE
link i has failed
record a FALSE in goodlinks[mynum, i]
37


Table 4.5. Pseudocoded Version of the
Third Failure Detection Procedure (continued)
set other_node to the number of the node at
other end of link i
record a FALSE in goodlinks[other_node, i]
END IF
END FOR
IF there were any link recoveries or failures
execute routing
FOR each node i DO
IF routingt[i] >= 0
node i is working and should be sent a
DETECTION message
record a TRUE in fdnodes[i]
ELSE
node i is not working
record a FALSE in fdnodes[i]
END IF
END FOR
FOR each link i DO
copy fdlinks[i] to tmplinks[i]
END FOR
execute failure announcement
END IF
FOR each link i DO
reset fdlinks[i] to FALSE
END FOR
END
When a node initiates the third failure detec-
tion procedure, the node first checks all of the
positions in its fdlinks array. If any of the posi-
tions in this array contain a FALSE while the same
position in the mylinks array contains a TRUE, no
messages have been received through the correspond-
ing link since the last failure detection run; thus,
38


the node sends an AREYOUUP message through this link
in order to determine whether the link is working.
After sending any necessary AREYOUUP messages
through links, the node waits a predetermined amount
of time to receive IAMUP messages through these
links. If an IAMUP message is not received through
one of these links within this amount of time, the
node assumes that this link is not working.
The node then compares each position in the
fdlinks array to the corresponding position in the
goodlinks array in order to determine whether any of
the links that are connected to this node have re-
covered or failed since the previous failure detec-
tion run. If the position for a link in the fdlinks
array contains a TRUE and the corresponding position
for this link in the goodlinks array contains a
FALSE, this link has recently recovered from a fail-
ure; thus, the node records a TRUE in both of the
positions for this link in the goodlinks array in
order to keep track of the current status of this
link. If the position for a link in the fdlinks ar-
ray contains a FALSE and the corresponding position
39


for this link in the goodlinks array contains a
TRUE, this link has recently failed; thus, the node
records a FALSE in both of the positions for this
link in the goodlinks array.
If any recent link recoveries or failures have
occurred, the node initiates routing in order to up-
date its routing table.
After executing the routing algorithm, the node
determines which of the other nodes in the system
need to notified of each link recovery and failure.
If a position in the routingt array contains a value
that is greater than or equal to zero, the corre-
sponding node is working and should therefore be
notified of each recovery and failure; thus, the
node records a TRUE in this node's position in the
fdnodes array. If a position in the routingt array
contains a value that is equal to negative one, the
corresponding node is not working and should not be
notified of each recovery and failure; thus, the
node records a FALSE in this node's position in the
fdnodes array. The node then copies all of the data
from the fdlinks array to the tmplinks array, and
40


the node executes the fourth failure detection pro
cedure in order to announce each recent link recov
ery and failure.
*
Table 4.6. Pseudocoded Version of the
Fourth Failure Detection Procedure
failure announcement:
BEGIN
FOR each node i DO
IF fdnodes[i] = TRUE
send a DETECTION message containing tmplinks
to node i
END IF
END FOR
FOR each node i DO
copy goodnodes[i] to oldnodesfi]
IF routingt[i] >= 0 and goodnodes[i] = FALSE
node i has recovered
record a TRUE in goodnodes[i]
END IF
IF routing[i] = -1 and goodnodes[i] = TRUE
node i has failed
record a FALSE in goodnodes[i]
END IF
END FOR
IF there were any node recoveries or failures
execute election
END IF
END
During execution of the fourth failure detec-
tion procedure, a node sends a DETECTION message
containing the tmplinks array to any node whose po
sition in the fdnodes array contains a TRUE.
The node then compares each position in the
41


routingt array to the same position in the goodnodes
array in order to determine whether any of the nodes
in the system have recovered or failed since the
previous failure detection run. If a position in
the routingt array contains a value that is greater
than or equal to zero and the same position in the
goodnodes array contains a FALSE, the corresponding
node has recently recovered from a failure; thus,
the node records a FALSE in the position for this
node in the oldnodes array and a TRUE in the posi-
tion for this node in the goodnodes array in order
to keep track of the previous and current status of
this node. If a position in the routingt array
contains a value that is equal to negative one and
the same position in the goodnodes array contains a
TRUE, the corresponding node has recently failed;
thus, the node records a TRUE in the position for
this node in the oldnodes array and a FALSE in the
position for this node in the goodnodes array.
If any recent node recoveries or failures have
occurred, the node then initiates election so that
the mutual exclusion algorithm can be reconfigured.
42


Table 4.7. Pseudocoded Version of the
Fifth Failure Detection Procedure
when a DETECTION message is received from node k:
BEGIN
FOR each link i DO
copy data[i] from the DETECTION message to
tmplinksfi]
END FOR
FOR each link i DO
IF tmplinks[i] = TRUE and
goodlinks [k, i] = FALSE
link i has recovered
record a TRUE in goodlinks[k, i]
set other_node to the number of the node at
other end of link i
record a TRUE in goodlinks[other_node, i]
END IF
IF tmplinks[i] = FALSE and
goodlinks[k, i] = TRUE
link i has failed
record a FALSE in goodlinks[k, i]
set other_node to the number of the node at
other end of link i
record a FALSE in goodlinks[other_node, i]
END IF
END FOR
IF there were any link recoveries or failures
execute routing
END IF
IF there were any link recoveries
FOR each node i DO
IF routing[i] >= 0 and goodnodes[i] = FALSE
node i has recovered and should be sent a
DETECTION message
record a TRUE in fdnodes[i]
ELSE
node i has not recovered
record a FALSE in fdnodes[i]
END IF
END FOR
END IF
43


Table 4.7. Pseudocoded Version of the
Fifth Failure Detection Procedure (continued)
IF there were any node recoveries
FOR each link i DO
copy goodlinks[mynum, i] to tmplinks[i]
END FOR
execute failure announcement
END IF
END
When a DETECTION message is received by a node,
the node initiates the fifth failure detection pro-
cedure, and, in this procedure, the node copies all
of the data from the DETECTION message into the
tmplinks array.
The node then compares each position in the
tmplinks array to the corresponding position in the
goodlinks array in order to determine whether each
of the link recoveries and failures that were re-
ported in the DETECTION message have been previous-
ly recorded. If the position for a link in the
tmplinks array contains a TRUE and the corresponding
position for this link in the goodlinks array con-
tains a FALSE, this link has a reported recovery
which has not been recorded yet; thus, the node
records a TRUE in both of the positions for this
44


link in the goodlinks array in order to keep track
of the current status of this link. If the position
for a link in the tmplinks array contains a FALSE
and the corresponding position for this link in the
goodlinks array contains a TRUE, this link has a
reported failure which has not been recorded yet;
thus, the node records a FALSE in both of the po-
sitions for this link in the goodlinks array.
If any link recoveries or failures had to be
recorded, the node initiates routing in order to up-
date its routing table.
After executing the routing algorithm, the node
determines if any of the other nodes in the system
have recently recovered from a failure. If a posi-
tion in the routingt array contains a value that is
greater than or equal to zero and the same position
in the goodnodes array contains a FALSE, the corre-
sponding node has recently recovered from a failure,
and it should therefore be notified of the current
status of this node's links so that it will have the
most recent system information; thus, the node re-
cords a TRUE in this node's position in the fdnodes
45


array. The node also records a FALSE in each posi-
tion in the fdnodes array for which the correspond-
ing node has not recently recovered. If any node
recoveries were found, the node then copies the cur-
rent status of each of its links from the goodlinks
array to the tmplinks array, and the node executes
the fourth failure detection procedure in order to
announce the current status of its links to each
recently recovered node.
Message Loss Resiliency
The most common type of failure in a distrib-
uted system is the loss of a message. Distributed
system algorithms should therefore be resilient to
this type of failure [1].
In order to make an algorithm resilient to
message losses, the consequences of a lost message
must first be considered, and the algorithm must
then be modified to ensure that a lost message does
not cause degraded performance.
The three types of messages that are used in
the failure detection algorithm are the AREYOUUP,
46


IAMUP, and DETECTION messages.
If an AREYOUUP or an IAMUP message is lost, a
working link will be reported as not working. If a
DETECTION message is lost, a node may not receive
the most recent system information. In both of
these cases, the loss of a message may result in a
node not being able to perform efficiently.
In order to recover from a lost AREYOUUP or
IAMUP message, the following modification was made
to the failure detection algorithm.
After sending an AREYOUUP message through a
link, a node waits a predetermined amount of time to
receive an IAMUP message through this link. If an
IAMUP message is not received through this link
within this amount of time, the node resends the
AREYOUUP message in case a message was lost. The
node repeats this procedure three times, if neces-
sary. If an IAMUP message is still not received
through this link, the node assumes that this link
is not working.
This modification makes use of probabilities.
Since it is not probable that a message will be lost
47


during each of the three attempts, this modification
should suit the needs of this algorithm fairly well.
This modification is not a cure-all, however, be-
cause, if a message is lost during each of the three
attempts, a working link will be reported as not
working.
In order to recover from a lost DETECTION mes-
sage, a new message called DETRCVD was introduced,
and the following modifications were made to the
failure detection algorithm.
When a node receives a DETECTION message, the
node initiates the fifth failure detection proce-
dure. Before exiting from this procedure, the node
sends a DETRCVD message back to the source node of
the DETECTION message.
After sending a DETECTION message to another
node, a node waits a predetermined amount of time to
receive a DETRCVD message from this node. If a
DETRCVD message is not received from this node
within this amount of time, the node resends the
DETECTION message in case a message was lost. The
node repeats this procedure until either the DETRCVD
48


message is received or the destination node is found
to not be working.
These modifications provide the failure detec-
tion algorithm with the means to recover from a
DETECTION or a DETRCVD message loss, and these modi-
fications will not significantly increase the execu-
tion time of the algorithm unless a message loss
occurs.
Properties
The failure detection algorithm that is pre-
sented in this paper ensures that each node in the
system knows the current status of all of the nodes
and links in the system, i.e., whether they are
working or not working. This algorithm is used to
determine whether there have been any recent link
failures or recoveries. After a link failure or
recovery is detected, the routing algorithm is ini-
tiated in order to determine whether a working path
exists to every other node in the system. From this
information, a node can determine whether there have
been any recent node failures or recoveries.
49


During failure detection, a node will only send
an AREYOUUP message through one of its links if a
message has not been received through this link

since the previous failure detection run. This
could result in a link failure not being discovered
during a failure detection run. If this occurs, the
link failure will be found either during the next
failure detection run or by the node at the other
end of this link.
This algorithm could be modified so that a link
failure would definitely be discovered during the
first failure detection run that follows the link
failure. In order to do this, a node would simply
need to send an AREYOUUP message through each of the
links that are directly connected to it when failure
detection is initiated. Although this method en-
sures that every link failure is detected during a
failure detection run, it may require a considerably
larger number of messages to be sent, which would
cause the execution time of the algorithm to in-
crease significantly.
If no messages are lost, the maximum number of
50


messages that will need to be sent from any node
that is performing failure detection is degree +
totnum 1. For example, if there are eight nodes
in the system, the maximum number of messages is
ten: three AREYOUUP messages and seven DETECTION
messages. The minimum number of messages that will
need to be sent is zero because, if every directly
connected link for a node is working and a message
has been received through each of these links since
the previous run, no AREYOUUP messages or DETECTION
messages will need to be sent.
As was previously stated, the failure detection
algorithm should be run by each node periodically.
This algorithm can be run as often as is liked with
very few adverse effects because messages are trans-
mitted to every node only if a failure or recovery
is detected. Deciding how often failure detection
should be initiated is a very complex decision. If
one node is used as a coordinator for other nodes in
the system, it may be wise to initiate failure de-
tection for this node and the nodes surrounding it
more often than for the rest of the nodes in the
51


system; however, this decision will have to be made
for each specific system, depending on the system
performance and control algorithms.
*
Comparison to Other Algorithms
The majority of the popular failure detection
algorithms use some type of handshaking scheme simi-
lar to the one presented in this paper. However,
there is one main difference between the failure
detection algorithm that is presented in this paper
and the other algorithms [1].
The algorithm in this paper is executed by a
node in an attempt to find any link failures and
recoveries surrounding this node. From this infor-
mation, the node then finds any node failures and
recoveries.
Other failure detection algorithms tend to be
executed by a node in an attempt to find both the
link failures and recoveries surrounding this node
and the node failures and recoveries surrounding
this node. In order to do this, a node that is exe-
cuting one of these algorithms must perform several
52


iterations of a link failure detection scheme in an
attempt to find a link failure for every link that
is connected to a node. This results in a consider-
able increase in the number of messages that need to
be sent by a node to perform failure detection [1],
and, since distributed algorithms are evaluated by
the number of messages passed, this makes these oth-
er failure detection algorithms less attractive than
the algorithm that is presented in this paper [4].


CHAPTER 5
ROUTING
Each node in the system has a signed integer
array called routingt which represents the routing
table for the node. When a node needs to send a
message to another node in the system, the node
looks in the destination node's position in its
routing table to find the first link that the mes-
sage should be sent through in order to get to the
destination node. The node then sends the message
through this link. When the node at the other end
of the link receives the message, this node deter-
mines whether it is the destination node for the
message. If this node is not the destination node,
it sends the message through the link that is con-
tained in the destination node's position in its
routing table. This procedure is repeated by each
node that receives the message until the message is
received by the destination node. After a link
failure or recovery is detected in the system, the
routing table for each node must be updated so that
the nodes do not attempt to route messages either
54


through paths that are not working or to nodes that
are not working. Thus, each node in the system
needs to perform routing after a link failure or
*
recovery in order to keep up-to-date information in
the routing table [1].
The routing algorithm that was designed for a
hypercube distributed system is described in this
chapter, and the C program for this algorithm can be
found in Appendix D [9], [10].
Principles
In order to execute the routing algorithm, a
node must have the goodlinks array and the routingt
array that are explained in Table 4.2, as well as
the global variable that is explained in Table 5.1.
Table 5.1. Required Global Variable for the
Routing Algorithm
list: array [1 . totnum] of integer
This array is used as a record of each
of the nodes for which a working path
has been found, and these nodes are
listed in this array in the order in
which the corresponding paths were
found: position j of this array con-
tains the number of the jth node that a
working path was found for or a -1 if
less than j nodes have had paths found
for them.
55


By executing the failure detection algorithm, a
node can both detect and announce any recent link
failures and recoveries. After a link failure or
recovery is either detected by or announced to a
node, the node initiates routing. During routing, a
node uses the current status of each of the links in
the system in an attempt to determine the shortest
working path that can be taken to transmit a message
from this node to every other node in the system, if
such a path exists.
A pseudocoded version of the routing algorithm
is shown in Table 5.2 [13], and a detailed explana-
tion of this algorithm follows.
Table 5.2. Pseudocoded Version of the
Routing Algorithm
routing:
BEGIN
FOR each node i DO
reset routingt[i] to -1
END FOR
set routingt[mynum] to -5
FOR each position i in list DO
reset list[i] to -1
END FOR
initialize next_position to 0
FOR each link j DO
IF goodlinks[mynum, j] = TRUE
link j is working
56


Table 5.2. Pseudocodes Version of the
Routing Algorithm (continued)
set dest_node to the number of the node at
the other end of link j
record j in routingt[dest_node]
record dest_node in list[next_position]
increment next_position by 1
END IF
END FOR
FOR each node i in list DO
set current_node to i
FOR each link j DO
IF goodlinks[current_node, j] = TRUE
link j for the current_node is working
set dest_node to the number of the node at
the other end of link j
IF routingt[dest_node] = -1
no working path to the dest_node was
previously found
record routingt[current_node] in
routingt[dest_node]
record dest_node in list[next_position]
increment next_position by 1
END IF
END IF
END FOR
END FOR
END
When a node initiates routing, the node first
removes all of the old routing information from the
routingt array by resetting all of the positions in
this array to negative one. The node then sets its
own position in the routingt array to negative five
since a message does not need to be sent through any
links to get to this node. The node also resets
57


every position in the list array to negative one.
The node then attempts to find a working path
to all of the nodes that are directly connected to
it. In order to do this, the node first checks the
goodlinks array to determine which of its links are
working, and the node then performs the following
routine once for each of its working links, i.e.,
for each link whose position in the goodlinks array
contains a TRUE.
The node determines the number of the node at
the other end of the working link, i.e., the desti-
nation node. The node then records the number of
the working link in the destination node's position
in the routingt array because this link is the first
and only link that a message needs to be sent
through in order to get to the destination node.
Since a working path to the destination node has now
been found, the node records the destination node
number in the first available position in the list
array, i.e., the first position in the list array
that contains a value that is less than zero.
The node then attempts to find a working path
58


to each node for which such a path has not yet been
found. In order to do this, the node performs the
following routine once for each of the working links
of each of the nodes in the list array.
The node determines the number of the node at
the other end of the working link for the current
node from the list array, i.e., the destination
node. The node then checks the destination node's
position in the routingt array to determine whether
a working path to the destination node has already
been found. If no working path to the destination
node was previously found, i.e., if the position for
the destination node in the routingt array contains
a negative one, the node copies the value from the
current node's position in the routingt array to the
destination node's position in the routingt array
because the first link that a message should be sent
through in order to get to the destination node is
the same as that which should be used for the cur-
rent node. Since a working path to the destination
node has now been found, the node records the desti-
nation node number in the first available position
59


in the list array.
Properties
During execution of the routing algorithm, a
node uses the current status of each of the links in
the system to determine the shortest working path
that can be taken to transmit a message from this
node to every other node in the system. If such a
path exists, the node records the first link of the
path in the destination node's entry in the routing
table. If no such path exists, either the destina-
tion node is not working or the system has become
disconnected and the destination node cannot be
reached; thus, the node records a negative one in
the destination node's entry in the routing table.
This routing algorithm therefore ensures that a node
does not attempt to route a message through a failed
link or node, and it also ensures that a node uses
the shortest possible path to transmit a message at
all times.
The routing algorithm that is presented in this
paper also has a very short execution time because
60


no message passing is involved.
Comparison to Other Algorithms
The majority of the popular routing algorithms
use a different routing approach than the one that
is presented in this paper [1],
The algorithm in this paper is executed by a
node in an attempt to find the shortest working path
to every other node in the system. When such a path
is found, only the number of the first link in this
path is recorded in the routing table. When a node
needs to send a message to another node in the sys-
tem, the node looks in the destination node's posi-
tion in its routing table to find the first link
that the message should be sent through in order to
get to the destination node, and the node then sends
the message through this link.
Other routing algorithms tend to be executed by
a node in an attempt to find some or all of the
working paths to every other node in the system.
All of these paths are then recorded in the routing
table. When a node needs to send a message to an-
61


other node in the system, the node looks in the
routing table for all of the alternative paths that
can be taken to send a message to the destination
node, and the node then sends the message through
one of these paths [1]. Because of the numerous
links in a hypercube distributed system, this type
of approach can require a lot of memory and a large
execution time; thus, the routing algorithm that is
presented in this paper is more appealing than these
other routing algorithms.
62


CHAPTER 6
MUTUAL EXCLUSION
When a node in a distributed system needs to
use a nonsharable resource such as a printer, it
should be allowed to do so within a finite amount of
time; however, when one node in a distributed system
is using a nonsharable resource, none of the other
nodes should be allowed access to this resource. In
order to equitably choose which node should be al-
lowed access to a nonsharable resource at any time,
each node in the system should execute mutual exclu-
sion [4].
The mutual exclusion algorithm that was de-
signed for a hypercube distributed system is de-
scribed in this chapter, and the C program for this
algorithm can be found in Appendix E [9], [10].
Principles
The mutual exclusion algorithm requires that
some of the nodes in the system serve as coordinator
nodes. These coordinator nodes are required to per-
form specific functions for some or all of the other
63


nodes in the system.
The smallest numbered working node in the sys-
tem is the system coordinator. The system coordina-
tor is responsible for keeping a record of all of
the current requests for mutual exclusion from the
groups in the system. The system coordinator is
also responsible for deciding which of the groups in
the system should be allowed mutual exclusion at any
time.
The smallest numbered working node in each
group is the group coordinator for that group. Each
group coordinator is responsible for keeping a re-
cord of all of the current requests for mutual ex-
clusion from the nodes in its group. Each group
coordinator is also responsible for requesting mu-
tual exclusion for its group from the system coordi-
nator and for deciding which of the nodes in its
group should be allowed mutual exclusion after the
system coordinator grants mutual exclusion to the
group.
A graphical representation of the interaction
between nodes for a system containing eight working
64


nodes is shown in Figure 6.1.
Figure 6.1. Node Interaction for the
Mutual Exclusion Algorithm
The mutual exclusion algorithm uses message
passing to perform its function. A brief explana-
tion of each type of message that is used by this
algorithm is given in Table 6.1.
Table 6.1. Required Message Types for the
Mutual Exclusion Algorithm
NREQUEST: A node sends an NREQUEST message to its group coordinator in order to request mutual exclusion.
GREQUEST: A group coordinator sends a GREQUEST message to the system coordinator in order to request mutual exclusion for its group.
GREPLY: The system coordinator sends a GREPLY message to a group coordinator in order to grant mutual exclusion permission to the group coordinator's group.
NREPLY: A group coordinator sends an NREPLY message to a node in its group in order to grant mutual exclusion permission to the node.
NRELEASE: A node sends an NRELEASE message to its group coordinator in order to relinquish
65


Table 6.1. Required Message Types for the
Mutual Exclusion Algorithm (continued)
mutual exclusion permission.
GRELEASE: A group coordinator sends a GRELEASE
' message to the system coordinator in or-
GREQREL: der to relinquish mutual exclusion per- mission for its group. A group coordinator sends a GREQREL mes- sage to the system coordinator in order to both request mutual exclusion for its group and relinquish mutual exclusion permission for its group.
In order to execute the mutual exclusion algo-
rithm, each node must have the global variables that
are explained in Table 6.2.
Table 6.2. Required Global Variables for the
Mutual Exclusion Algorithm
syscoord: integer This variable contains the node number of the current system coordinator or a -1 if no system coordinator currently exists.
grpcoord: array [1 . totgrps] of integer This array is used as a record of the current group coordinator for each of the groups in the system: position j of this array contains the node number of
greq: the current group coordinator for group j or a -1 if no group coordinator cur- rently exists for group j. array [1 . totgrps] of integer The system coordinator uses this array as a record of the current requests for mutual exclusion from the groups in the system: position j of this array con- tains the time at which a mutual exclu- sion request was made by a node in group
66


Table 6.2. Required Global Variables for the
Mutual Exclusion Algorithm (continued)
j or a -1 if none of the nodes in group
j have made such a request,
array [1 . 4] of integer
A group coordinator uses this array as a
record of the current requests for mutu-
al exclusion from the nodes in its
group: position j of this array con-
tains the time at which a mutual exclu-
sion request was made by the jth node in
the group or a -1 if the jth node has
not made such a request,
integer
This, variable is used by the system co-
ordinator to keep track of the group
that currently has mutual exclusion per-
mission, and it contains the number of
this group or a -1 if there is no such
group.
integer
This variable is used by a group coordi-
nator to keep track of the node in its
group that currently has mutual exclu-
sion permission, and it contains the
number of this node or a -1 if there is
no such node in the group.
The mutual exclusion algorithm consists of
eight procedures.
When a node needs to enter its critical sec-
tion, i.e., when a node needs mutual exclusion, the
node initiates the first mutual exclusion procedure.
In this procedure, a node requests mutual exclusion
from its group coordinator.
nreq:
lgrep:
lnrep:
67


The second mutual exclusion procedure is initi-
ated when an NREQUEST message is received by a group
coordinator. During this procedure, a group coordi-
nator requests mutual exclusion for its group from
the system coordinator, if the group coordinator has
not already done so.
The third mutual exclusion procedure is initi-
ated when a GREQUEST message is received by the sys-
tem coordinator. During this procedure, the system
coordinator grants mutual exclusion to the group
that sent the GREQUEST message, if no other group in
the system currently has mutual exclusion.
The fourth mutual exclusion procedure is initi-
ated when a GREPLY message is received by a group
coordinator. During this procedure, a group coordi-
nator grants mutual exclusion to one of the nodes in
its group.
The fifth mutual exclusion procedure is initi-
ated when an NREPLY message is received by a node.
In this procedure, a node executes its critical sec-
tion, and the node then relinquishes mutual exclu-
sion permission to its group coordinator.
68


The sixth mutual exclusion procedure is initi-
ated when an NRELEASE message is received by a group
coordinator. During this procedure, a group coordi-
*
nator relinquishes mutual exclusion permission for
its group to the system coordinator, and, at the
same time, the group coordinator makes another mutu-
al exclusion request for its group from the system
coordinator, if any of the nodes in the group still
need mutual exclusion.
The seventh mutual exclusion procedure is ini-
tiated when a GRELEASE message is received by the
system coordinator, and the eighth mutual exclusion
procedure is initiated when a GREQREL message is re-
ceived by the system coordinator. During these pro-
cedures, the system coordinator grants mutual exclu-
sion to one of the groups in the system, if there
are any groups that need mutual exclusion.
Pseudocoded versions of each of the mutual ex-
clusion procedures are shown in Tables 6.3 through
6.10 [13], and the pseudocoded version of each pro-
cedure is followed by a detailed explanation of each
of the functions that are performed by the procedure.
69


Table 6.3. Pseudocoded Version of the
First Mutual Exclusion Procedure
mutual exclusion:
BEGIN
send an NREQUEST message to grpcoord[mygrp]
END
The first mutual exclusion procedure is exe-
cuted by a node when the node wants to enter its
critical section, i.e., when the node wants to re-
quest mutual exclusion. During execution of this
procedure, the node sends an NREQUEST message to its
group coordinator.
Table 6.4. Pseudocoded Version of the
Second Mutual Exclusion Procedure
when an NREQUEST message from node j is received by
a group coordinator:
BEGIN
record the time-stamp from the NREQUEST message in
nreq[j MOD 4]
FOR each position i in nreq DO
IF nreq[i] >= 0
the ith node in the group made a request
at time nreq[i]
END IF
END FOR
IF node j has the only active request in the group
send a GREQUEST message containing the time-
stamp from the NREQUEST message to syscoord
END IF
END
When a group coordinator receives an NREQUEST
message from a node in its group, the group coordi
70


nator initiates the second mutual exclusion proce-
dure. In this procedure, the group coordinator
records the time-stamp from the NREQUEST message,
i.e., the time at which the request was made, in the
source node's position in the nreq array.
If the new mutual exclusion request is the only
active request from a node in the group coordina-
tor's group, i.e., if there are no other values
greater than or equal to zero in the nreq array, the
group coordinator sends a GREQUEST message contain-
ing the time-stamp from the NREQUEST message to the
system coordinator.
Table 6.5. Pseudocoded Version of the
Third Mutual Exclusion Procedure
when a GREQUEST message from group coordinator k is
received by the system coordinator:
BEGIN
set kgroup to the number of the group for node k
record the data from the GREQUEST message in
greq[kgroup]
IF lgrep = -1
group k has the only active request in the
system
send a GREPLY message to grpcoord[kgroup]
record kgroup in lgrep
END IF
END
When the system coordinator receives a GREQUEST
71


message from a group coordinator, the system coordi-
nator initiates the third mutual exclusion proce-
dure. During execution of this procedure, the sys-
tem coordinator records the data from the message,
i.e., the time at which a node in the source group
made the request, in the source group's position in
the greq array.
If the new mutual exclusion request is the only
active request from a group in the system, i.e., if
the lgrep variable contains a value equal to nega-
tive one, the system coordinator sends a GREPLY mes-
sage back to the group coordinator of the group that
sent the GREQUEST message, and the system coordina-
tor records the number of this group in the lgrep
variable because this group has just been granted
mutual exclusion.
Table 6.6. Pseudocoded Version of the
Fourth Mutual Exclusion Procedure
when a GREPLY message from the system coordinator is
received by a group coordinator:
BEGIN
FOR each position i in nreq DO
IF nreq[i] >= 0
the ith node in the group made a request
at time nreq[i]
END IF
72


Table 6.6. Pseudocoded Version of the
Fourth Mutual Exclusion Procedure (continued)
END FOR
set j to the number of the node that made the
earliest request
send an NREPLY message to node j
record j in lnrep
END
When a group coordinator receives a GREPLY mes-
sage from the system coordinator, the group coordi-
nator initiates the fourth mutual exclusion proce-
dure. In this procedure, the group coordinator
finds the smallest value in the nreq array that is
greater than or equal to zero, i.e., the time at
which the earliest mutual exclusion request was made
by a node in this group. The group coordinator then
sends an NREPLY message to the node that made this
request, and the group coordinator records the num-
ber of this node in the lnrep variable because this
node has just been granted mutual exclusion.
Table 6.7. Pseudocoded Version of the
Fifth Mutual Exclusion Procedure
when an NREPLY message from a group coordinator is
received by node j:
BEGIN
execute critical section
send an NRELEASE message to grpcoord[mygrp]
END
73


When a node receives an NREPLY message from its
group coordinator, the node initiates the fifth mu-
tual exclusion procedure. During this mutual exclu-
sion procedure, the node executes its critical sec-
tion. The node then sends an NRELEASE message back
to its group coordinator in order to relinquish mu-
tual exclusion permission.
Table 6.8. Pseudocoded Version of the
Sixth Mutual Exclusion Procedure
when an NRELEASE message from node j is received by
a group coordinator:
BEGIN
IF lnrep = j
node j has mutual exclusion permission
record a -1 in nreq[j MOD 4]
record a -1 in lnrep
FOR each position i in nreq DO
IF nreq[i] >= 0
the ith node in the group made a request
at time nreq[i]
END IF
END FOR
IF there are no active requests in nreq
send a GRELEASE message to syscoord
ELSE
set min_ts to the time of the earliest request
send a GREQREL message containing min__ts to
syscoord
END IF
END IF
END
When a group coordinator receives an NRELEASE
message from a node in its group that currently has
74


mutual exclusion permission, the group coordinator
initiates the sixth mutual exclusion procedure. In
this mutual exclusion procedure, the group coordi-
nator records a negative one in the source node's
position in the nreq array because the mutual exclu-
sion request that was previously made by this node
is no longer active. The group coordinator also re-
cords a negative one in the lnrep variable because
the source node has just relinquished mutual exclu-
sion permission.
The group coordinator then finds the smallest
value in the nreq array that is greater than or
equal to zero, i.e., the time at which the earliest
remaining mutual exclusion request was made by an-
other node in this group. If there is no value
greater than or equal to zero in the nreq array,
i.e., there are no remaining mutual exclusion re-
quests for this group, the group coordinator sends a
GRELEASE message to the system coordinator; other-
wise, the group coordinator sends a GREQREL message
containing the smallest value from the nreq array to
the system coordinator.
75


Table 6.9. Pseudocoded Version of the
Seventh Mutual Exclusion Procedure
when a GRELEASE message from group coordinator k is
received by the system coordinator:
BEGIN
set kgroup to the number of the group for node k
IF lgrep = kgroup
kgroup has mutual exclusion permission
record a -1 in greq[kgroup]
record a -1 in lgrep
FOR each position i in greq DO
IF greqEi] >= 0 _
group i made a request at time greq[i]
END IF
END FOR
IF there are any active requests in greq
set j to the number of the group that made the
earliest request
send a GREPLY message to grpcoord[j]
record j in lgrep
END IF
END IF
END
When the system coordinator receives a GRELEASE
message from a group coordinator whose group cur-
rently has mutual exclusion permission, the system
coordinator initiates the seventh mutual exclusion
procedure. During execution of this procedure, the
system coordinator records a negative one in the
source group's position in the greq array because
the mutual exclusion request that was previously
made by this group is no longer active. The system
76


coordinator also records a negative one in the lgrep
variable because the source group has just relin-
quished mutual exclusion permission.
The system coordinator then finds the smallest
value in the greq array that is greater than or
equal to zero, i.e., the time at which the earliest
remaining mutual exclusion request was made by a
node in another group in the system. If there are
any remaining mutual exclusion requests, the system
coordinator sends a GREPLY message to the group co-
ordinator of the group that made the earliest re-
quest, and the system coordinator records the number
of this group in the lgrep variable because this
group has just been granted mutual exclusion.
Table 6.10. Pseudocoded Version of the
Eighth Mutual Exclusion Procedure
when a GREQREL message from group coordinator k is
received by the system coordinator:
BEGIN
set kgroup to the number of the group for node k
IF lgrep = kgroup
kgroup has mutual exclusion permission
record the data from the GREQREL message in
greq[kgroup]
record a -1 in lgrep
FOR each position i in greq DO
IF greq[i] >= 0
group i made a request at time greq[i]
77


Table 6.10. Pseudocoded Version of the
Eighth Mutual Exclusion Procedure (continued)
END IF
END FOR
set j to the number of the group that made the
earliest request
send a GREPLY message to grpcoord[j]
record j in lgrep
END IF
END
When the system coordinator receives a GREQREL
message from a group coordinator whose group cur-
rently has mutual exclusion permission, the system
coordinator initiates the eighth mutual exclusion
procedure. In this procedure, the system coordi-
nator records the data from the message, i.e., the
time at which a node in the source group made the
request, in the source group's position in the greq
array. The system coordinator also records a nega-
tive one in the lgrep variable because the source
group has relinquished mutual exclusion permission.
The system coordinator finds the smallest value
in the greq array that is greater than or equal to
zero, i.e., the time at which the earliest remaining
mutual exclusion request was made by a node in one
of the groups in the system. The system coordinator
78


then sends a GREPLY message to the group coordinator
of the group that made this request, and the system
coordinator records the number of this group in the
lgrep variable because this group has just been
granted mutual exclusion.
Message Loss Resiliency
The seven types of messages that are used in
the mutual exclusion algorithm are the NREQUEST,
GREQUEST, GREPLY, NREPLY, NRELEASE, GRELEASE, and
GREQREL messages.
If an NREQUEST message is lost, the algorithm
will continue to function correctly; however, the
node that sent the message will never be allowed
mutual exclusion. If a GREQUEST message is lost,
the algorithm will again continue to function cor-
rectly; however, the group that sent the message
will never be allowed mutual exclusion. If a
GREPLY, NREPLY, NRELEASE, GRELEASE, or GREQREL
message is lost, the algorithm will stop function-
ing, and none of the nodes in the system will be
allowed mutual exclusion.
79


In order to recover from all of these types of
message losses, a new message called NREQREL was
introduced, and the following modifications were
made to the mutual exclusion algorithm.
After sending an NREQUEST message to its group
coordinator, a node waits a predetermined amount of
time to receive an NREPLY message from its group
coordinator. If an NREPLY message is not received
by the node within this amount of time, the node
sends an NREQREL message to its group coordinator to
recover from any possible message losses. The node
continues to send NREQREL messages to its group co-
ordinator at predetermined intervals until the
NREPLY message is received.
When a group coordinator receives an NREQREL
message, the group coordinator checks the source
node's position in the nreq array to determine if
there is already an active mutual exclusion request
for this node. If there is no such request, an
NREQUEST message was lost; thus, in order to recover
from this message loss, the group coordinator treats
the NREQREL message as an NREQUEST message. If
80


there is already a request for the source node in
the nreq array, an NREQUEST message was not lost;
however, another type of message loss may have oc-
curred. Thus, the group coordinator does one of the
following two things. If there is a node in this
group that currently has mutual exclusion permis-
sion, i.e., the lnrep variable contains a value
greater than or equal to zero, the group coordinator
sends another NREPLY message to this node in order
to recover from an NREPLY or NRELEASE message loss,
if one has occurred. If none of the nodes in this
group currently have mutual exclusion permission,
the group coordinator sends a GREQREL message to the
system coordinator in order to recover from a
GREQUEST, GREPLY, GRELEASE, or GREQREL message loss,
if one has occurred.
When the system coordinator receives a GREQREL
message from a group coordinator, the system coordi-
nator initiates the eighth mutual exclusion proce-
dure. At the beginning of this procedure, the sys-
tem coordinator checks the source group's position
in the greq array to determine if there is already
81


an active mutual exclusion request for this group.
If there is no such request, a GREQUEST message was
lost; thus, in order to recover from this message
loss, the system coordinator treats the GREQREL mes-
sage as a GREQUEST message. If there is already a
request for the source group in the greq array but
the source group does not currently have mutual ex-
clusion permission, a GREQUEST message was not lost;
however, another type of message loss may have oc-
curred. Thus, the system coordinator ignores the
GREQREL message, and it sends another GREPLY message
to the group coordinator of the group that currently
has mutual exclusion permission, i.e., the group
whose number is contained in the lgrep variable, in
order to recover from a GREPLY, GRELEASE, or GREQREL
message loss, if one has occurred.
When a group coordinator receives a GREPLY mes-
sage from the system coordinator, the group coordi-
nator initiates the fourth mutual exclusion proce-
dure. At the beginning of this procedure, the group
coordinator checks the lnrep variable to determine
if any node in its group currently has mutual exclu-
82


sion permission. If there is a node in this group
that currently has mutual exclusion permission, the
group coordinator ignores the GREPLY message, and it
*
sends another NREPLY message to this node in order
to recover from an NREPLY or NRELEASE message loss,
if one has occurred.
These modifications provide the mutual exclu-
sion algorithm with the means to automatically re-
cover from any message loss. This type of message
loss recovery scheme is advantageous because the
coordinator nodes are not required to wait for a
message to be received; thus, the coordinator nodes
do not spend a significant amount of time performing
mutual exclusion functions.
Properties
In order to request mutual exclusion, a node
must send an NREQUEST message to its group coordina-
tor. As soon as this node receives an NREPLY mes-
sage back from its group coordinator, it can enter
its critical section. Since only one node in one
group in the system is granted mutual exclusion at
83


any time, this mutual exclusion algorithm ensures
that, when one node is using a nonsharable resource,
no other node can gain access to this resource.
This algorithm also ensures that any node that wants
to enter its critical section can do so within a
finite amount of time because the coordinator nodes
choose which node should be allowed mutual exclusion
by comparing the times at which the requests for
mutual exclusion were made.
If no messages are lost, the maximum number of
messages that will need to be sent in order to in-
voke mutual exclusion is six: an NREQUEST, a
GREQUEST, a GREPLY, an NREPLY, an NRELEASE, and a
GRELEASE message.
If all of the links in the system are working,
this algorithm provides maximum parallelism between
groups because each group is completely independent
of the other groups in terms of the links that are
used to send the messages through.
Comparison to Other Algorithms
The majority of the popular mutual exclusion
84


algorithms use a more distributed approach than the
one that is presented in this paper [4].
The algorithm in this paper uses coordinator
nodes to control the entry of a node into its criti-
cal section. This type of approach is usually re-
ferred to as centralized because some of the nodes
in the system perform functions for the rest of the
nodes. The main disadvantage of using this type of
approach is that an election algorithm is required
in order to make the algorithm resilient to node
failures.
Other mutual exclusion algorithms tend to be
more distributed than this, i.e., all of the nodes
in the system work together in order to make a de-
cision. There are two main disadvantages of using
this type of approach: the participation of all of
the nodes in the system is required, and the number
of messages that need to be sent is directly propor-
tional to the number of nodes in the system. These
two disadvantages become much more prevalent in a
system containing more than a few nodes; thus, for
this type of system, the mutual exclusion algorithm
85


that is presented in this paper is more attractive
than the more distributed types of algorithms [4].
86


CHAPTER 7
ELECTION
The mutual exclusion algorithm that is pre-
sented in this paper performs its function by re-
quiring that some of the nodes in the system serve
as coordinator nodes. If a coordinator node failure
occurs, a new coordinator node must be elected so
that the mutual exclusion algorithm will continue to
function correctly. Each node in the system should
execute election after a node failure or recovery
occurs in order to make the mutual exclusion algo-
rithm resilient to node failures [4].
The election algorithm that was designed for a
hypercube distributed system is described in this
chapter, and the C program for this algorithm can be
found in Appendix F [9], [10].
Principles
The election algorithm uses message passing to
perform its function. A brief explanation of each
type of message that is used by this algorithm is
given in Table 7.1.
87