Citation
VSFIM

Material Information

Title:
VSFIM visual flight simulator
Creator:
Cohan, Brian Joseph
Publication Date:
Language:
English
Physical Description:
77 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Computer network architectures ( lcsh )
Software engineering ( lcsh )
Parallel computers ( lcsh )
Computer software -- Development ( lcsh )
Computer simulation ( lcsh )
Computer network architectures ( fast )
Computer simulation ( fast )
Computer software -- Development ( fast )
Parallel computers ( fast )
Software engineering ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 76-77).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Brian Joseph Cohan.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
49630474 ( OCLC )
ocm49630474
Classification:
LD1190.E52 2001m .C63 ( lcc )

Full Text
VFSIM
VISUAL FLIT SIMULATOR
by
Brian Joseph Cohan
B.S., University of Houston, 1981
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2001




Cohan, Brian J. (M.S. Computer Science)
VFSim Visual Flit Simulator
Thesis directed by Associate Professor Dianne Kumar
ABSTRACT
Interconnection networks are an important component of parallel computer systems and
the efficient design of these networks is crucial to scalability and performance. As the
number of processors increase in parallel systems, the interconnection subsystem
becomes the limiting factor, and efficient and cost-effective tools for the design and
evaluation of highly scaled networks are needed. Software simulators are valuable in this
endeavor; however, as the design complexity of the model increases, the corresponding
complexity of the associated simulation software also increases. To manage this software
engineering challenge, a software design method using a loosely coupled visual API with
a base simulator is presented. This design results in five major benefits:
Visual Tutorial for interconnection network concepts.
Visual aid for the design of new architectures
Increased code maintainability and function verification.
Aid in the validation of existing and future designs.
Asynchronous expandability of the visual extension and/or the base simulator.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
111
Dianne Kumar


CONTENTS
1. Introduction...................................................................1
2. Background.....................................................................2
2.1 Parallel Systems...........................................................2
2.2 Topology...................................................................3
2.2.1 Shared-Medium Networks.....................................................4
2.2.2 Direct Networks............................................................5
2.2.3 Indirect Networks..........................................................9
2.2.4 Hybrid N etworks...........................................................9
2.3 Switching..................................................................9
2.3.1 Circuit Switching........................................................ 9
2.3.2 SAF Packet Switching......................................................10
2.3.3 VCT Packet Switching......................................................10
2.3.4 Wormhole Switching........................................................11
2.4 Routing...................................................................12
2.4.1 Router Model..............................................................14
2.4.2 Deadlock, Livelock and Starvation.........................................15
2.4.3 Multicast Routing Hardware Implementation Schemes.........................16
2.4.3.1 Hardware Tree-Based Mulitcast Routing...................................17
2.4.3.2 Path-Based Multicast Routing.............................................18
3. VFSim Design and Functionality................................................19
3.1 Base Implementation.......................................................19
3.1.1 Key Data Structures.......................................................19
3.1.1.1 Global Message structure: struct g_msg_type..............................20
3.1.1.2 Queue Element structure: struct q_elt_type...............................22
3.1.1.3 Queue structure: struct q_type...........................................23
3.1.1.4 Node structure: struct nodetype..........................................24
3.1.2 Key Functions.............................................................27
3.2 Visual Implementation.....................................................32
3.2.1 Key Data Structures.......................................................32
3.2.1.1 Global Node Location structure: struct node_loc_type.....................34
3.2.1.2 Gui Control structure: struct gui_control_type..........................35
3.2.2 Key Functions.......................................................... 37
3.3 Visual API................................................................43
3.3.1 Base Implementation Function Placement....................................43
3.3.2 API Functions.............................................................45
4. Simulation Algorithms........................................................47
4.1 Dimension Order Unicast Routing...........................................47


FIGURES
Figure 2.1 Multiprocessor Representation............................................3
Figure 2.2 Generic node representation..............................................5
Figure 2.3 Binary tree..............................................................6
Figure 2.4 Star topology............................................................6
Figure 2.5 4-ary 2-cube mesh...................................................... 7
Figure 2.6 3-ary 3-cube mesh........................................................7
Figure 2.7 4-ary 2-cube torus.......................................................8
Figure 2.8 3-ary 3-cube torus.......................................................8
Figure 2.9 Message packet, flit and phit partitioning..............................10
Figure 2.10 Message blocking with wormhole switching...............................12
Figure 2.11 Generic router model...................................................14
Figure 2.12 Network Deadlock.......................................................15
Figure 2.13 Broadcast Tree-Based multicast in a 3-cube.............................17
Figure 2.14 Hardware Column-Path Multicast Routing.................................18
Figure 3.1 Node and queue relationships............................................20
Figure 3.2 Global Message Structure................................................22
Figure 3.3 Queue Element Structure.................................................23
Figure 3.4 Queue Structure.........................................................24
Figure 3.5 Node Structure..........................................................26
Figure 3.6 Base Implementation Key Function call Sequence..........................27
Figure 3.7 Global & Virtual structure relationships................................33
Figure 3.8 Global Node Location structure..........................................34
Figure 3.9 GUI Control structure...................................................36
Figure 3.10 Visual Initialization..................................................37
Figure 3.11 Visual Event...........................................................38
Figure 3.12 Global Channel Color Change............................................38
Figure 3.13 Rotate Button, DragDrop, Info pop-up Window............................39
Figure 4.1 a-d Unidirectional rings and their channel dependency graphs............48
Figure 4.2 Dimension Order Routing Simulation......................................49
Figure 4.3 Dimension Order Routing Simulation.................................... 50
Figure 4.4 Dimension Order Routing Simulation......................................50
Figure 4.5 Duato routing adaptive channels.........................................52
Figure 4.6 Duato routing deterministic channel.....................................52
Figure 4.7 Hybrid routing using low traffic........................................54
Figure 4.8 Hybrid routing with heavy traffic.......................................54
Figure 4.9 SW_Multicast routing in 2D..............................................55
vii


1. Introduction
Software simulation is an efficient and cost-effective method for designing and testing
interconnection network architectures. However, as the complexity of the software
increases, the simulators themselves become complex abstractions. This is a common
aspect of all software engineering, necessitating that the software model itself needs to be
evaluated for accuracy. Indeed, it is often more difficult to prove the correctness of a
software model than to build it.
Due to the dynamic and complex nature of message transfer within interconnection
networks, a visual abstraction is highly desirable. However, visual models that are highly
integrated, require extensive testing including regression testing if any code
modifications are made. To minimize testing and validation, a loosely coupled model is
preferred. Ideally, a visual abstraction should exist as a separate entity, merely presenting
an interface to a base simulator.
The visual simulator presented in this paper consists of a base portion that generates calls
to a visual abstraction. The calls are loosely coupled with the base simulator and cause a
visual event to occur. By using this model neither software layer is affected by the
operation of the other. A separation is maintained and the visual abstraction can verify
the base portion and vice versa. Five benefits are presented:
Visual Tutorial for interconnection network concepts.
Visual aid for the design of new architectures
Increased code maintainability and function verification.
Aid in the validation of existing and future designs.
Asynchronous expandability of the visual extension and/or the base simulator.
This paper is organized as follows: Chapter two presents an overview of interconnection
network concepts, highlighting the present challenges. The third Chapter presents the
design of the visual simulator. Chapter four discusses the algorithms which are presently
simulated with the visual simulator. Chapter five presents several simulation case studies,
and Chapter six elaborates on the advantages of this software model. A conclusion is
stated in Chapter seven and suggestions for future work in Chapter eight.
1


Figure 2.1 Multiprocessor Representation.
P = processor M = memory
Figure 2.1 shows a simplified schematic representation consisting of several processors,
each with its own memory, sharing an interconnection network subsystem. This
discussion will be further reduced to concentrate on the strictly orthogonal torus or k-ary
n-cube direct network model.
2.2 Topology
Topologies for interconnection networks may be classified into 4 types:
Shared-Medium Networks
Direct networks
Indirect Networks
Hybrid Networks
All of these topologies use asynchronous operating mode along with distributed network
control. Shared-medium networks (i.e., LAN, Backplane Bus) utilize a shared
transmission medium for all communicating devices. The limited bandwidth associated
with this topology restricts its use in multi-processor technology. Direct Networks (i.e..
Torus (k-ary, n-cube), Hypercube, Trees), or router-based, point-to-point networks scale
well and have become a popular topology for multi-processor applications. Indirect or
switched-based networks (i.e., Crossbar) consist of an interconnection of switches with
each switch connected to one or more nodes. Hybrid networks (i.e., Multiple-Backplane
Buses, Hierarchical Networks) combine techniques from the above 3 topologies.
3


2.2.2 Direct Networks
Direct networks are well suited for large-scale parallel computers since this topology
scales well to a large number of processors. Direct networks consist of a set of nodes
with each node directly connected to a subset of other nodes in the network. A generic
node architecture is depicted in figure 2.2 [5,15].
Figure 2.2 Generic node representation.
fe Router
[ E
1 - - t
Input Channels Output Channels


Figures 2.5 and 2.6 (mesh) and figures 2.7 and 2.8 (tori) show examples of the direct
network orthogonal topologies.
Figure 2.5 4-ary 2-cube mesh
Figure 2.6 3-ary 3-cube mesh
7


2.2.3 Indirect Networks
Instead of a direct connection between any 2 nodes, communication between nodes is
via a switch in indirect networks. Each node uses a network adapter that connects to a
network switch. The switch has a set of ports with each port consisting of one input and
one output link. The ports may or may not connect to a processor node. The
interconnection of the switches (versus the nodes) defines the network topology.
Indirect networks may be modeled by a graph G(N,C), where N is the set of switches
and C the set of links between switches. An interesting indirect topology is the case of a
single N X N switch connecting N nodes; this type of switch is known as a crossbar [5].
2.2.4 Hybrid Networks
Hybrid networks are a combination of shared-medium networks and direct or indirect
networks. An example is a bridged LAN. The objective is to increase bandwidth using
the shared-medium model and reduce node-to-node distance with the direct or indirect
network model. Hybrid networks do not scale as well as direct or indirect networks;
therefore, for high-performance parallel computers they are not a first-choice topology.
2.3 Switching
Parallel computer implementation require low-latency switching techniques to prevent
the switching layer from becoming the bottleneck in the interconnection subsystem.
Switching techniques have a profound impact on the performance and behavior of
interconnection networks. It can be argued that switching has a greater impact on
network performance than topology or the routing algorithm.
2.3.1 Circuit Switching
Circuit switching reserves a physical path through the network from source to
destination prior to data transmission. This is accomplished by injecting control signals
into the network. As these control signals progress through the network towards the
destination a physical link is reserved along the way. Once the control signals reach the
destination an ACK (acknowledgment) is transmitted back to the source node indicating
the message is ready for transmission. At this point the message contents are
transmitted at the full bandwidth of the hardware path. Once the message has been
transmitted the path is released. This may be accomplished in part by releasing the tail
message at the departure of each intermediate node or all at once following the
completion of the entire message transmission from the destination node.
Circuit switching is effective when messages are infrequent and long. An example is a
long-distance phone connection, and this type of switching has been used commercially
for telephony. A disadvantage of circuit switching is that the physical path is reserved
9


assigned output channel is free. Subsequent data flits simply follow the header flit since
the routing decision has already been made. If the output link is free the packet cuts-
through to the next node without the need for waiting for the entire packet to be
buffered at the intermediate node before being forwarded. The packet can effectively be
pipelined through successive node switches. However, if the header flit is blocked on a
busy output channel, the complete packet is buffered at the blocked node, and VCT
reduces to SAF packet switching.
Although at low network loads, VCT can provide high throughput rates, when network
loads are high, VCT displays the same disadvantages of SAF and requires adequate
buffering space at each node to store the complete message [5, 7, 8].
2.3.4 Wormhole Switching
In wormhole switching message packets are also pipelined through the network, however
the buffer requirements within the routers are reduced compared to VCT. The message
flits flowing through the network occupy buffers in several routers at any given time.
The major difference between wormhole switching and VCT is that with wormhole an
entire message cannot be buffered at a router.
Advantages of wormhole switching are small buffer requirements and message
pipelining. The Wormhole switching design allows for the use of small, compact and
fast routers. A disadvantage of wormhole switching is its blocking characteristics. Since
wormhole switching does not allow full message buffering at a single node (as in VCT),
when a message is blocked the message occupies buffers in multiple routers. This often
results in other messages in the network being blocked as well and greatly complicates
deadlock freedom.
11


Distributed Routing Routing path is determined in a distributed manner
while the packet travels through the network.
Multiphase Routing Hybrid routing schemes where the source node may
compute some destination nodes and others may be computed in a distributed
manner.
Implementation:
Table lookup Table lookup to determine routing path.
Finite-State Machine Executing a routing algorithm in software or hardware
according to a finite-state machine.
Adaptivity:
Deterministic Routing Deterministic routing algorithms always supply the
same path from source node to destination node.
Adaptive Routing Adaptive routing algorithms dynamically determine a path
through the network according to various factors such as network traffic, channel
status, and fault regions.
Adaptive Routing Definitions:
Progressiveness:
Progressive Progressive routing moves the header packet forward reserving a
new channel at each routing operation.
Backtracking Backtracking algorithms allow the header packet to backtrack,
releasing previously reserved channels if necessary. This method is mainly used
for fault-tolerant networks.
Minimality:
Profitable Also know as minimal routing algorithms that only supply channels
that bring the header flit closer to its destination node.
Misrouting May supply channels that move the packet away from its
destination. Also know as non-minimal.
Number of Paths:
Complete Allows for any number of alternative paths; also know as fully
adaptive.
Partial Partially adaptive routing.
13


2.4.2 Deadlock, Livelock and Starvation
Because resources in a network are finite, deadlock, livelock and starvation may occur
[5,8,12].
a Deadlock
Every packet in a deadlock cycle is requesting resources held by other packet(s)
while holding resources requested by other packet(s). This is displayed
graphically in figure 2.9.
Figure 2.12 Network Deadlock.
Livelock
A situation where a packet is traveling around its destination node, never
reaching it because the channels required to reach the destination are occupied by
other packets. This situation only occurs if packets are allowed to follow
nonminimal paths.
15


2.4.3.1 Hardware Tree-Based Multicast Routing
In Hardware Tree-Based multicast routing all of the header flits are used for routing at
each node. A message is sent along a common path as far as possible. Once the
common path has been exhausted, the message is split as necessary and routed on
different channels bound for a unique set of destination nodes. This procedure
continues until all destination nodes have been reached. Data flits are simultaneously
forwarded to all paths previously allocated by the header flits. Figure 2.10 shows a
Hardware Tree-Based Multicast broadcast in a three-cube network from source node 000
[5]. The routing algorithm is implemented at the hardware node level.
Advantages of Tree-Based Multicast routing:
No ordering of the destinations is required prior to message injection.
Shortest path is always taken between source node and all destinations.
Disadvantages of Tree-Based Multicast routing:
Higher probability of blocking than path-based multicast routing since all branch
channels must be available for the whole multidestination message to continue.
Figure 2.13 Broadcast Tree-Based multicast in a 3-cube.
17


3. VFSim Design and Functionality
VFSim is a unicast-multicast flit routing simulator. It runs in three modes:
Normal Mode
In normal mode VFSim should run for a high number of cycles to closely
approximate real operating conditions. Usually, a few thousand cycles are run to
reach steady state and then the simulation begins. Normal mode produces several
output files that record throughput, channel utilization, average latency and other
network metrics at selected clock cycle points and at completion of the simulation.
Text Mode
In text mode, VFSim produces textual output describing the movement of each
flit for each clock cycle.
Graphics Mode
The graphical mode displays a visual interface that shows the movement of each
flit at the network topology level and also at selected node level. Four nodes may
be tracked in detail simultaneously. Clock cycles are controlled by pressing any key
and may be set in the run file to any value (i.e. run_mode graphics n; where n is
cycles run each time a key is pressed).
VFSim is based on a 3-tier design consisting of a base implementation, a visual
implementation and an API layer, which loosely couples the base and visual layers.
When VFSim is run in normal or text mode, the visual layer is not entered. If run in
graphics mode VFSim extends the base implementation as a visual abstraction via the
API.
3.1 Base Implementation
The base implementation is written in ANSI-C on the Linux platform. The code has
been ported to Sun-Solaris and HP-UX without modification. Both static arrays and
dynamic data structures are used.
3.1.1 Key Data Structures
19


of hops in each dimension to reach destination. This array would have value one
if the message was unicast and greater than one for multicast messages.
num_dest_nodes holds the number of destination nodes for multicast
messages.
num_dest_absorbed holds the number of destinations that have been
absorbed.
num_data_flits indicates the number of data flits per message.
comm type indicates whether the message is unicast or multicast.
hop_count counts the total hops for the message.
total_msg_lat_sum accumulates the total message latency.
cc_gen indicates the clock cycle the message was generated.
cc_leave_src_q indicates the clock cycle when the message leaves the source
queue. This value does not include waiting time in the queue.
changing deter to deter count counts the number of times the message goes
from deterministic to deterministic channel of the same type and dimension.
changing other to any count counts the number of times the message goes
from adaptive to adaptive channel or from a deterministic to a different type of
deterministic channel.
waiting_time indicates the time the message has been waiting.
hop_count_sum_per_dest counts the number of hops needed in each
dimension to reach the destination positive value means move in positive
direction; negative value means move in negative direction.
struct gjmsgtag *prev, *next is a global linked list of msgs in the whole
network.
21


timeout holds a timeout value.
reinject_delay holds the reinjection delay value.
num_times_msg_absorbed holds the number of times the message has been
absorbed into the absorbed queue of a node.
total_time_msg_absorbed is the total time the message has been absorbed.
sim_route_time is a float variable where the value 1.0 indicates the flit is routed.
sim_remaining_time is a float variable where the value 1.0 indicates the flit is
switched.
first_flit_in_entire_msg is a Boolean variable indicating the flit is the first in
message or it is not the first flit.
last_flit_in_entire_msg has a similar function, but last instead of first flit.
smsg_ptr is hardware_column_path specific and will not be discussed.
g_msg_ptr is a pointer variable and points to the global message of type
g_msg_type.
next is a pointer to q_elt_type, which will allow a link-list structure of q_elt_type.
Figure 3.3 Queue Element Structure.
typedef struct qtag {
int l_flit_num;
int g_flit_num;
int flit_type;
int time_stamp;
int flit_routed;
int timeout;
int reinject_delay;
int num_times_msg_absorbed;
int total_time_msg_absorbed;
float sim_route_time;
float sim_remaining_time;
int first_flit_in_entire_msg;
int last_flit_in_entire_msg;
g_sorted_msg_type *smsg_ptr;
g_msg_type *g_msg_ptr;
struct qtag next;
} q__elt_type;
3.1.1.3 Queue structure: struct q_type
The queue structure (figure 3.4) contains variables describing a queue:
23


nummsgsinabsorbedmsgq holds the number of messages in the
absorbed queue of the node.
routing_type variable is used to change between routing types when using
hybrid routing. q_route_from is radix three-dimensional array of q_type. Array
indexes consist of
[MAX_DIM+1] [MAX_NUM_DIR+1] [MAX_NUM_TOTAL_YC +
MAX_NUM_SINK_CHAN + MAX_NUM_ABSORBED_CHAN] and relate
to dimension, direction, and channel, respectively. This variable holds the
current queue values at a given node.
q_rte_to_array is of type q_to_type and consists of the queues crossmap.
next_VC_on_PC three-dimensional array is used to hold the next YC (virtual
channel) to use the PC (physical channel).
num_pc_count array holds the PCs and is indexed according to MAX_DIM+1
(maximum dimensions).
q_route_from_neighbor[ ] [ ] and q_route_to_neighbor [ ] [ ] hold the node
number from and to the queue routes neighbor, respectively.
chan_util_sum, src_chan_util_sum, and dim chan util sum arrays hold
static data for channel utility, source channel utility and dimension channel utility,
respectively.
25


3.1.2 Key Functions
An overview of the key functions critical to the base implementation is presented in
figure 3.6.
Figure 3.6 Base Implementation Key Function call Sequence.
Main
Inject_msg_to_src_q_from_absorbed_q
Generate_new_msgs
Route_and_update_deadlocked_msgs_to_absorbed_q
Route_and_update_msgs_to_reg_and_sink_chans
Route
Update_flits_to_reg_and_sink_chans
Inc_sim_time_and_determine_if_can_move_flit_at_top_of_q
u
Increment_sim_time_for_flit_at_top_of_q
Increment_sim_time_for_header_flit_at_top_of_q
Increment_sim_time_for_data_flit_at_top_of_q
Inc_sim_time_for_flits_following_the_flit_at_top_of_q
Move_flit_at_top_of_q
Pipeline_data_update
( Continued in next figure.....)
27


A listing and a brief description of the base implementation functions follows:
void generate_msgs ( )
The generate_msgs function is called from the main base implementation driver.
It generates messages of length msg_len and places the message in the
q_route_ffom queues of their source node. The source and destination nodes are
normally chosen at random, but this may be over ridden as in debug_mode or
non-random graphics mode.
void update_node ( )
The update_node function updates the contents of a node in two steps:
1. The routing step done in the route() routine of route.c
2. The update step, consisting of moving individual flits from the current
node (node_num) toward the destination node. If the destination node is
the current node, the flits are absorbed.
void Route ( )
The Route function routes flits deterministically and/or adaptively according to
specific protocols based on the routing algorithm. Routing algorithms are
determined by a switch statement. Current routing algorithms implemented are
HTA, HW_COLUMN_PATH, SW_MULTICAST, DIM_ORDER_ROUTING,
DUATO_ROUTING, and HYBRID_ROUTING.
void Update_flits ( )
The update_flits function calls three major functions:
1. update_first_flit
2. update_required_times_for_remaining_flit
3. move_flit_at_top_of_q
These functions in turn call routines that update the times and actually move the
flit to a neighboring node queue, the current node absorb queue or is absorbed at
the current node if it is its destination node.
int Update_First_Flit ( )
This function returns the flit number returned by move_flit_at_top_of_q.
29


void Move_Flit_In_ALL_Dirs ( )
This function calls the following functions in this list that move the flit to
neighboring nodes if the flit is not sink assigned.
void Update_Values_for_Last_Data_Flit ( )
The Update_Values_for_Last_Data_Flit function performs the absorption of the
last data flit, which defines that a message has reached its destination in its entirety.
void Move_Not_SINK_ASSIGN_flit ( )
This function works with the following function for not sink assigned flits.
void First_Time_In_Not_SINK_ASSIGN ( )
This function moves the flit to the neighbor node if the flit is not assigned to the
current node as its destination. An enqueue occurs to the appropriate neighbor
node if criteria are met.
void NOT_First_Time_In_Not_SINK_ASSIGN( )
The NOT_First_Time_In_Not_SINK_ASSIGN function is only used for data
flits. This function is used if the same data flit is going to a different queue than
the first data flit. An enqueue may occur in the function if the criteria are met.
31


Figure 3.7 Global & Virtual structure relationships.
The node_loc_type structure stores data to redraw the global layout of the
network. This structure contains the locations of the global layout parts, such as
the node, represented as a ball and the channels represented as lines.
The gui_control_type structure stores data which allows for the redrawing of the
virtual queues (depicted dO-dl, a0-a3), unicast and multicast source queues and
the absorbed queue. This structure also tracks the routing and switching data for
each message. This structure must keep track of all the nodes in the global
network although only 4 nodes of this type are displayed at any given time.
1 x- ¥ ' x y
doLIJJJ Lit 11LI 1 1 ail i 111 Li 1J II 1 i aol 1 1 III II 1 1 1 1 1 1 all till IB II II II ^ a2l Ll l.l I B 1 1 1 1 1 1 a3l 11 1 I I II 1 1 1 1 I I UNICAST Source Queue ^ - doLlLl LI LL.11.1 1 1 dll il l 1 1 I l 111 1 1 aOl 1 1 1 1 1 II 1 1 1 1 1 1 all 1 1 1 1 1 Bill 1 1 1 -2ll.i l II 11 M.III -3l 11111-1111111 UNICAST Source Queue
= 11 m m 11 mm n i MULTICAST Source Queue mil 11 n ii 11111 n i MULTICAST Source Queue
i iiti n rm m 11111 ; No#e_@ v; ;...' : ABSORBED Message Queue mi mu n 11 mi 11 Ticrte 2 w 7 ABSORBED Message Queue
if, i rn rn i n m rm i'i >tsr . ii i ii 111 rn n mm
33


3.2.1.2 Gui Control structure: struct gui_control_type
The gui_control_type structure (figure 3.9) holds visual information about the 4 nodes
that are displayed in detail on the screen. Detail consists of source, virtual and absorbed
queues. This information is tracked and updated in this structure and allows for the
screen to be redrawn if necessary
node_queue[ ] [ ] tracks unicast flits in the virtual and source queues.
color Jd[][][] tracks the color_id of unicast messages.
node_multi_queue[ ] [ ] tracks multicast flits in the virtual and source queues.
color_multi_id[ ] [ ] [ ] tracks the color_id for multicast messages.
node_num holds number of the displayed node.
display holds the screen display handle.
display_position holds a display position (range 0-3).
switched_flag and switched_flag_delay are used for left justification of data
flits; these variables are needed to display the queue situation of the most left flit
as routed and the next to most left flit as switched.
routed_queue[ ] [ ] holds the routing state of the top most (or most left) flit in a
particular queue for unicast.
switched_queue holds the switched state of the top or next to top (or most left
or next to most left) flit in a particular queue for multicast.
routed_multi_queue[ ][][][] is similar structure for multicast messages as
routed_queue for unicast messages.
switched_multi_queue[ ][][][] is a similar structure for multicast messages as
switched_queue is for unicast messages. The additional array dimensions are
needed for multicast messages since these messages can exist at more than one
node at the same time.
drag_node holds the current node being dragged to the detail node display area
from the global network display area.
cast_type[ ] holds a value of UNICAST or MULTICAST to indicate whether a
message is unicast or multicast.
src_q_loc[ ] is a place holder for the source channel. This is needed since
absorbed messages may get routed or switched out-of-sequence when placed
back in the source queue.
35


3.2.2 Key Functions
An overview of the key functions critical to the visual implementation may be divided
into several sections.
Figure 3.10, the initialization section, lists the calls responsible for calculating
node positions, the initialization of key structures, and the drawing of the
starting screen state.
Figure 3.11 lists the function calls that occur for each visual event, which is
defined as each time the screen is updated. This may be set to one or N clock
cycles of the base implementation (e.g., if N=10, the base implementation runs
for 10 cycles for each Key_Press before the screen is updated).
Figure 3.12 lists the function groups responsible for the global channel color
change. Since this call is relatively inexpensive, it is executed every base
implementation cycle via a call back.
Figure 3.13 presents the function calls used to implement the rotate button, drag
& drop functionality, and the color click information pop-up window.
Figure 3.10 Visual Initialization.
37


Figure 3.13 Rotate Button, DragDrop, Info pop-up Window.
39


void Xsim2 ( )
This is a wrapper function that encapsulates the main of the base implementation.
This function allows step through (N cycles at a time) of the base simulator.
void Xclear_virtual_chan ( )
This function clears the active-only virtual channel display area. By only clearing the
active channels this function is more effective. A future improvement in
performance would be to implement an XOR screen update.
void Xredraw virtual chan ( )
This function redraws the virtual channel area. It only updates active areas, thus
enhancing performance.
void Xredraw_rsc ( )
This function updates the unicast route and switch letters,
void XMulti_rsc ( )
This function updates the multicast route and switch letters,
void Channel_[x, y]p ( )
This function draws the global physical channels for the x and y directions for 2
dimensional k-ary n-cubes.
void Channel_[x, y]t ( )
This function draws the global physical channels for the x and y tori for 2
dimensional k-ary n-cubes.
void Channelcube_[x, y,z]p ( )
This function draws the global physical channels for the x, y, and z directions for
3dimensional k-ary n-cubes.
void Channelcube_[x, y,z]t ( )
41


3.3 Visual API
The visual API is a collection of functions that act as a virtual barrier between the base
and the visual implementation. Visual events can only occur through API function calls.
These functions are constant, return void, and arguments are passed by copy to
guarantee that data mutation does not occur. They are called at critical points in the base
simulator code to visually illustrate the operation of the base simulator.
3.3.1 Base Implementation Function Placement
As the base implementation executes, state changes are transformed into visual events
via API function calls. The function call along with the placement in the base
implementation follows. This list follows the typical call sequence for each base clock
cycle. The base simulator driver function is listed first with one or more visual API
function calls.
Record_Data ( )
Xdeq_all_channels ( )
Xdraw_current_cycle ( )
generate_msgs( )
Xvirtual_chan( )
update_time_for_first_header_flit ( )
Xrsc( )
update_time_for_£irst_data_flit ( )
Xrsc( )
update_time_for_remaining_data_flit ( )
Xrsc( )
update time for remaining header flit ( )
Xrsc( )
move_flit ( )
43


3.3.2 API Functions
The following is a list of the API functions and a description of their operation.
Xdeq_all_channels ( )
This function sets the gui_control_type structure to zero. The gui_control_type
structure is bounded to 4 displayed nodes. Therefore this operation is finite and
not expensive.
Xset_ball_color ( )
This function encapsulates the operation:
node_loc[node].baU_color[dim_to][dir_to][chan_to] = color_id;
Xdraw_current_cycle ( )
This function draws the current clock cycle of the base implementation. This is an
API call to the Xlib, but is a very inexpensive call since it writes XOR text to the
screen in one fixed location.
Xvirtual_chan ( )
This function sets the virtual channel color id in the gui_control_type structure.
This is bounded to 4 nodes at any given time. Therefore this operation is finite
and not expensive.
int Xrsc ( )
This function sets the routed and/or switched variables in the gui__control_type
structure. This is bounded to 4 nodes at any given time. Therefore this operation
is finite and not expensive.
Xdrawjball ( )
This function is a call to the Xlib that redraws the ball structure in the global
network display, but is only called when a physical channel is occupied or released.
If running in graphics mode with cycles set to 1, then the call is made each clock
cycle; however, if cycles is set to 1000, this is preformed once each 1000 clock
cycles. Therefore, realistic simulations may be performed in graphics mode with
adequate performance. This function works in conjunction with the
Xdraw_channel function.
45


4. Simulation Algorithms
The VFSim implements six routing algorithms. The type of routing algorithm may be
chosen in the run file, which is the input setting file used by the VFSim. An
accompanying Java program is available to interactively produce this run file, and the
algorithm type may be chosen using the program. The following is a list and a discussion
of these algorithms.
4.1 Dimension Order Unicast Routing
Dimension Order routing is a simple deterministic unicast algorithm. Deadlock freedom
is achieved by routing flits in increasing dimensions and using 2 virtual channels per
physical channel. For ^-ary--cubes, ordering dimensions alone allows cycles (figure
4.1a) and therefore is not deadlock-free as depicted by the channel dependency graph in
figure 4.1b. In addition to dimension ordering, an acyclic path is necessary within each
dimension for deadlock freedom. This is achieved by splitting each physical channel into
2 virtual channels, a high and a low channel (dO and d1) (figure 4.1c). The channel
dependency graph in figure 4.1d is acyclic based on the following routing function:
For each physical channel, virtual channels dOi and d1i are defined. Virtual channels are
used in strictly increasing order by this routing function:
For a destination node N(j) if the current node N(i) is equal to N(j) store the packet.
Otherwise, use dOi, if ji.
47


The algorithm is depicted in figures 4.2 4.4 for 2 dimensions. The tan message follows
the path as shown from the source node 6 to the destination node 2. The message is
routed completely in the X-dimension and then in the Y-dimension. The dO channel is
used from node 6 to node 8 because the current node is less than the destination node
(figure 4.2). The dl channel is used from node 8 to node 5 (figure 4.3) and from node 5
to node 2 (figure 4.4). The change from channel dO to dl is necessary for deadlock
freedom since the current node is greater than the destination node.
Figure 4.2 Dimension Order Routing Simulation.
49


4.2 Duato Unicast Routing
Duatos routing algorithm adds a fully adaptive virtual channels) to two deterministic
virtual channels over which dimension order routing is used. The adaptive virtual
channel has a higher precedence than the dimension-order channels and is chosen if it is
free. When the adaptive channels) is blocked the algorithm reverts to one of the two
deterministic channels that uses dimension-order routing. A message may return to
adaptive channels if they are available.
With dimension-order routing, a message is routed along decreasing dimensions with a
dimension decrease occurring only when zero hops remain in all higher dimensions.
Maximum adaptivity is maintained by always choosing the path that will allow for the
maximum choices for adaptive channels. For example, in a 2-dimensional network (x,y)
assuming all adaptive channels are free, a flit at node (2,2) with destination node (0,0)
routes to (2,1) and then to (1,1), versus routing from (2,1) to (2,0).
Deadlock freedom in /&-ary--cubes is assured because [7]:
A whole message must fit in a queue for VCT switching, or in wormhole
switching the header flit must be at the top of the queue for the message to
advance.
Although a message can use adaptive channels it can revert to dimension-order
routing which is deadlock-free.
In figure 4.5 the flits displayed in nodes 0, 1, 2, 5 are all occupying adaptive virtual
channels. The destination node for the three messages displayed is node 2. The y-
dimension adaptive virtual channel of node 2 is occupied at this clock cycle, 16. On the
next clock cycle, 17, a flit from node 5 will try to enter node 2 in the y-dimension. Since
the y-dimension adaptive channel is occupied the message (pink) flit from node 5 will be
assigned to the dl deterministic channel as shown in figure 4.6.
51


4.3 Hybrid Unicast Routing
The Hybrid routing algorithm consists of three logically independent and pipelined
message paths: a Fast Deterministic Path (FDP), a Slow Deterministic Path (SDP), and
an Adaptive Path (AP). The FDP has the highest priority and is used for a message flit
entering on a deterministic channel that is also able to leave on a deterministic channel of
the same type (low/high) and dimension. The FDP requires the least number of stages:
h + d stages for a header flit and d stages for a data flit.
If a message cannot be routed along a FDP (i.e., if a deterministic channel of the same
type is not available or a message is being switched to a different type or dimension),
then the message is sent along the SDP which requires more logic and takes H + D clock
cycles, where (H + D) > (h + d).
The AP has the lowest priority and is used when both the FDP and SDP are not
available. Giving a higher priority to the deterministic paths reduces the adaptivity of a
message. At low traffic this loss is more than offset by the gains in router delay and
translates in message latencies that are lower than those of an adaptive router. At very
high traffic the probability of blocking is large and the loss of adaptivity translates into
latencies that are sometimes larger than those of an adaptive router but always orders of
magnitude lower than a deterministic router.
This routing scheme is deadlock free since for any given message, the choice of paths
selected is always a true subset of those that could be selected by Duatos algorithm.
Duatos algorithm is deadlock-free and therefore the Hybrid routing algorithm is
deadlock-free.
Figure 4.7 demonstrates the use of FDP in low traffic. Since all virtual channels are free
the message can utilize the FDP deterministic channels. Conversely, in figure 4.8, traffic
is heavy and messages at node 2 (pink) and at node 4 (red) are using the slower adaptive
channels. Although, the adaptive channels are slower, this is an advantage since these
messages would be otherwise blocked waiting for a free deterministic channel.
53


4.4 Software Unicast-Multicast Routing
Software Multicast routing is the simplest implementation of multicast routing
consisting of sending one unicast message for each destination address in a multicast
message. This routing algorithm uses Duato routing for the base routing of unicast
messages. Figure 4.9 displays software multicast routing for a two-destination address
multicast message. The tan lines depict the path taken from the source node 0 to
destination node 5 and 6. The path taken is as follows:

12 15 -> 11 -> 10 6
Figure 4.9 SW_Multicast routing in 2D.
4.5 Hardware Column Path Unicast-Multicast Routing
The Hardware Column Path routing algorithm can route unicast and multicast messages.
Unicast messages are routed using Duatos algorithm. For multicast messages, a path-
55


4.6 HTA Unicast-Multicast Routing
HTA (Hardware Tree-Based Routing Algorithm) is a tree-based routing algorithm.
Unicast messages are routed with Duatos deadlock-free algorithm (figure 4.11). For
multicast messages, routing takes place at each intermediate node along the
multidestination message path. The message is routed along a common path for all
header flits as far as possible, and then each header flit is routed into different channels
headed for a unique set of destination nodes. This branching continues as necessary
until all destination nodes have been reached.
HTA uses VCT switching with distinct virtual paths for unicast and multicast messages.
Unicast messages are routed using Duatos algorithm while multicast messages are routed
using fully adaptive routing along with a deadlock detection and recovery scheme.
Immediate header flit routing is the most commonly mechanism used with routing
algorithms. When a header flit is moved to a neighboring queue, it is usually routed at
this neighboring node without waiting for the remaining header flits in the message to be
routed. HTA, however, uses delayed header flit routing to lower the probability of
message blocking.
In delayed header flit routing, a header flit at a neighboring node is not routed until all
header flits at the current node have been routed. Header flits are kept in close
proximity, which prevents header flits from being assigned to queues at downstream
nodes before all flits in the message can use them. This keeps downstream node queues
free so that they are available for other messages in the network, which can use them
immediately. As the number of destinations grows, delayed header flit routing becomes
increasingly important in reducing message-blocking probability.
In HTA, each node has a dedicated holding queue called the deadlock queue (or
absorbed message queue). Following a predetermined amount of time (timeout delay) a
header flit that cannot be routed is considered to be in a potential deadlock situation and
is routed to the deadlock queue at the current node (figure 4.12 depicts the red and
yellow colored messages in absorbed queues at node 7 and 6, respectively). If one of the
header flits in a message has been assigned to the deadlock queue, all remaining flits in
the message must be routed as soon as they reach the top of the queue. If any of these
remaining header flits cannot be immediately routed, they are considered to be in
potential deadlock and are routed to the deadlock queue of the current node. Deadlock
queue messages are reinjected into the network following a predetermined time
(reinjection delay).
Deadlock-freedom is achieved as long as the deadlock queue is infinitely large. For
practical purposes, using processor memory attains this infinite size. However,
57


Figure 4.11 Duatos routing for unicast messages in HTA,
Figure 4.12 HTA deadlock queue (absorbed message queue) simulation.
UNICAST Sour-c*
oxmrr.i nini^TO
MU.T1CWT SuurM fiwue
3P&prrmrnxtrn
a&i;
OftKB filuvti# - A
igppaio t rmiTnan
UMJCUST Sooroa dun*
rrt rrrn nrnn m >
MULTICAST Source Ctutfuv
t grraxiLOixrrixrx}
Hocte.-S'. ; : - ; :.y
RBSORBEB 41*g &uu. "
- / dxrii.txi;ii.ir.uii i)
t Routing Hlgoritru* 3ib<4;
' clock: '-u
/ UNICrtSly:f^.
. N_target,thruput
.tfr*stiv.thruput
rs i ui atea_msg_ 1 at
jNg.aatv.4at
target_tt-u0ct -
errmv*ttv-uout y
a
Color IB: ;
CoMJnioat.iQn Typejt/
Sourtw,;Jtodi
tfcidats):
G HOT ABSCfiBEB
Bwt> 7 Hat ABSORBED
tests 8 ABSORBED
Nuto*r :
CCGenerateefi
CC Leave Sovffce Queue:
Tntal Msg Latency:
Meg Waxt-mg T xmtn
Hop Count:
Hop Count Per Beart.lrmt.lon:
ta /- :
HULTICftST
x - 4
3 ,
5
3
6
20
22
9
14
V -38 X
Color IB: 19
Conaunicetion Type; MULTICAST
Source Mode: 3
Nuatoer Bastintion NodeJ Oact- 6 NOT ABSORBED Dest 7 ABSORBED Best- 8 NOT ABSORBED 3
Nuaber Data Flits: 6
CC Generated: 9
CC Leave Source Queue: 12
Total Msg Latency: 27
Mag Halting Tine: 21
Hap Count: 7
Hap Count Per Destination: 12


Figure 4.13 shows a simulated multicast message using HTA routing with a source node
of 3 and destination nodes 4, 9, and 10. The path taken with the branch at node 14 is as
follows:
3 - 2 - 14 - 10
\
13 9 - 8 4
Figure 4.13 HTA Routing Algorithm in 2D.
60


Figure 5.1 Duatos Clock 29, Queue Size = 2.
j ^INECWtt fouro* S!k*ru miTixi - k-
HULYtemT toum Gunu '
tL' OIU33 .. .
tt "
HBSO#*** *<} Out
OHICftST .Surx
Ksam "j _
mittfc***'a*br*
T CXOXD ^ .
fUmteg--' -;-
nrcueaev n*-s*9* <***
COXED ^
Rouitln^ fU^rltha OvMiUtvr^i
c 1 ocfe -~ *
unicast
U^#**ivu4*ru0Mt ^'-i-STSia
j gptSwu3vW rir#i M1 uekv1
%+mm.
ItULTlCAST -
li^'Virvptft .
*Mrflt4^k,t Q::
: - jjg
rtl j^p^^ -
/rytMi*rupu
UHICAST6our^>* Qutuv
IWtTiCflST Bowro# Hutuir
txtxm
Ncwi** j$ .llf.
BSOR*H nr.ut Gkwvw
rnrrn -
-~.fr
UMICRtT $ouro Queue '
quo.;)
ntATlCFlBT Sourw 6ke>u#
cmxp
Nos4n>.! . :yjj _
ftfcSQftMtB n*s*^ ftuwruwr
fTTOTJ

Figure 5.2 Duatos Clock 29, Queue Size = 8.
UN1CMT
H_vfWtiv^tlrfupu<:
*iwul 1 art ^ (Sioe
<*afe*. -: ..&**!
tArg* ttrupu t -.- ~ .. r*?
fTw i ttirwvt ': usro-
62


Figure 5.5 Duatos Clock 134, Queue Size = 2.

v - -
UNICAST. souro* Wueo*; -
err nxi
MULTICAST Source Ouextr
crmn.- x -- -
tJabc 0 - : :
HliSWgfcG nessaoe Uueur
o ri.r.i) - .

UNICAST. " VX'r-X"
H_inr9*-t,-thrueut Jf. H^ffsotiv+^thruput iJjtJJ23C
simil4rtdrii*o<,Ut ^ TIVOOOOJ
2- :.;15S50
^fffrctive^^brupwt -g
- -* - - rtU_TIC3T \-f#
N_^r Q*t^thrupwt
N_e f f **-t i v_ Wupwt :
r *4*1 _rm**_ckdl kar o-t^-thr^jput ',
r^rarotive^-thrtjpat '
VNICftSt Sow
t_ rirrm
MULTICAST ts
: rnrm
NocJv v
: A&SORSC* SO
f:mxn
~ x?,;e64m
UNICAST Source Uuuc
muD
MULTICAST 4uurt Scwu
tXfXIO .
J*sgfag&~ -
HVSORBCO fVssage flueue
--fTTXCD

Color ID; 24
Connunicatton Type: UNICRST
Source Node: 4
NupAjwr Dest Antiort Nocte ; 1
De*t= B NOf HBSORHCtl
Nunber Dace Flits: 9
CC Generated: 14
CC Leave Source Queue: 7S
Total Ms* Latency: 0
Ms* Waiting Tine; 97
Hop Count; 3
Hop Count. Per Destination: 3
Figure 5.6 Duatos Clock 134, Queue Size = 8.
fJNXCaST iouroe fiuevw
_ r. r.7Ttri:i;u:;rxm:rr:D
MUI TICAST aourca flw*W'
_ i: i Lxn xxixrxiim n
Hie* < t - '
AasoniEB nramoe.Ovwwe
rrrnxm i rrmaaij
UNICAST Souro* Oueue
in m.t i mmmii
MULTICAST' Saurpr Hum
uxi m 111:03x11x11
; jrtte^waq^ imt.
ting^l^thrupiit
fat- ;0' T J 'a x
Color ID; 24 ,X-\
Coercnication Type. UNicosr
Source Mode; 4
Hunter DostmtIon Node Dost* 8 NUT ABSORBED t' ;'
Hunter Data Flits: 9
. CC Generated; -; '-:: 14 ..
CC Leave Source Quotas; 'i -42
Total Hag Latency; Q
Hsg Waiting line; /l '
Hop Count; 3 -.iW ..
Hop Count Per Destination; 3 ^ -X .V'. :'- '-
64


Figure 5.7 Hardware Column-Path Clock 1.
Figure 5.8 Hardware Column-Path Clock 6.
66


Figure 5.11 Hardware Column-Path Clock 19.
Figure 5.12 Hardware Column-Path Clock 27.
68


Figure 5.14 HTA Clock 64, Absorbed Message Queue.
riiinaranf^as g&- L-ax -
Color ID; . 19
Coeauniction Type: MULTICAST
Source Node t 3 ;
Hunber Dostlotion Hode Nustoer Data Elite: 5 ..-:
CC Generated: , 9
CC Leave Source Queue: --.. 12
; Total Msg Latency: . B5
' rtsfg Hi.L4rt& TximK : 37
Hop Count: 8
: Hop Count, Par- Dnest, irtaL ion: 12 -.:
shaTm n:i
SB

: Routli>9 Al^orlth*
jc 1 ock
= UMSCft&T
r -. -0* ttHii
aMft..flW45... lat -- ' 'Tv-'G- tirfttiti
wiloifkf -
:-.r~". #w$0 I B*rf*0Tiu* *ftrupu7T S^&***4$
. X- - . - - *.
MULTICASTr
N_t>rs Hw*frfcUvvltvSVut
iiBMlgttd :
*VQLrm*Q l at -
r^tQrMK.dtadlooki
#/ -' '.. . '--
immmtimtfl t giiTmfemn.mt
MULTICAST 3owr Cv~v
r:u.axi.rr;mxr.rx.TT3
Krj3--? --.7.:*'=.
*aoWB Mtssay^ oiw -
f.m.i'ii:iT.rriTXT.xrp
fWtTlCftST Source-Suet*
_ kitr.m.tn:nxcojj
(WfOOfTSFD IV3agv (Mur
i~.n,ixn:iTi~i'Tf.T'HH.:?
ar
Figure 5.15 HTA Clock 44, Random Messages in a 3-ary-5-cube.
ucne*
N_-tr#tthrucHrt '
Njtft ** tv* _thr M*H**fci
a*td_*ua_ lit ;
~ ,--^s
-tar\rt_-thrue<#t T
^ *f r ...-m-
: 9.23313 WtSr* -,,-*.*333 H^ffrectiMt^tfrupu^ & SinMletsd^esg^let 0.0909 71 '7
iiju.a0uLet _-
9.9*90 t-ji./iwe -T %11 .tsraet^tMrvput -Wi-..--.----: &^Kgt- . ^.T^rL^-V^
- $.*m >rf*ctive-9hrMput - tie&ST
70


not be without risk, and changes would need to be tested and the baseline simulator
functionality would need to be regression tested.
Provide for asynchronous expandability of the visual extension and/or the base
simulator.
A final requirement of the VFSim was to add visual functionality without increasing the
complexity of the existing software. It was also desirable to have the ability to enhance
the visual functionality and the base functionality asynchronously without the chance of
breaking the other code module.
6.2 Implementation
Implementation of the VFSim was a two-part approach. First, a functional visual
implementation was needed to support the desired functionality of the base simulator.
Some of the major visual function requirements were:
Visualize interconnection network topologies both in two and three
dimensions.
Visualize global flit movement from node to node.
Visualize virtual channel queues, including source and absorbed queues.
Visualize unicast and multicast messaging.
Visualize flit routing and switching.
Visualize any node at any clock cycle.
Present real-time global statistics.
Provide basic Window functionality.
Second, an API was needed to tie into the existing base simulator code. Although the
API code footprint was small in comparison to the visual extension or base simulator
code, implementation required a detailed understanding of both. An additional
requirement of the API was not to modify the base code.
72


7. Conclusion
Writing the visual extension of the VFSim using the X Window function library allowed
for the development of a robust and expandable visual flit simulator. A loosely coupled
API connecting the base implementation to the visual extension proved to be a good
design choice. The API provided increased code maintainability by allowing the division
of the base and visual implementations. With this model asynchronous software
development in both code modules was possible.
For future changes in the base implementation code, the visual extention may be used
for validation, including regression testing. This functionality will inevitably expedite the
implementation of new and enhanced features in the base simulator. Conversely, the
visual implementation may be enhanced using the base simulator to verify visual code
changes.
VFSim provides useful visual flit simulation and the API model provides an adequate
platform for future expansion.
74


References:
[1] K. V. Anjan and T. M. Pinkston: An Efficient Fully Adaptive Deadlock Recovery
Scheme: DISHA. ACM, 1995.
[2] R. V. Boppana, S. Chalasani and C. S. Raghavendra: On Multicast Wormhole
Routing in Multiprocessor Networks. IEEE, Los Alamitos, California, 1994.
[3] T. H. Cormen, C. E. Leiserson and R. L. Rivest: Introduction to Algorithms. The
MIT Press, Cambridge, Massachusetts London, England, 1990.
[4] D. E. Culler and J. P. Singh: Parallel Computer Architecture A
Hardware/Software Approach. Morgan Kaufmann Publishers, Inc. San Francisco,
California, 1999.
[5] J. Duato, S. Yalamanchili, and L. Ni. In Interconnection Networks: An Engineering
Approach, IEEE, Los Alamitos, California, 1997.
[6] B. W. Kemighan and D. M. Ritchie: The C Programming Language. Prentice Hall
PTR, Englewood Cliffs, New Jersey 07632, 1988.
[7] C. Larman: Applying UML and Patterns: An Introduction to Object-Oriented
Analysis and Design. Prentice Hall PTR, Upper Saddle River, New Jersey 07458, 1997.
[8] X. Lin and Lionel M. Ni: Deadlock-Free Multicast Wormhole Routing in
Multiprocessor Networks. ACM, 1991.
[9] X. Lin and Lionel M. Ni: Multicast Communication in Multiprocessor Networks.
Int Conf Parallel Processing, 1990.
[10] X. Lin, P. K. McKinley, and Lionel M. Ni: Performance Evaluation of Multicast
Wormhole Routing in 2D-Mesh Multicomputers. Int Conf Parallel Processing, 1991.
[11] Z. Liu and J. Duato: Adaptive Unicast and Multicast in 3D Mesh Networks.
IEEE, Los Alamitos, California, 1994.
[12] J. M. Martinez, P. Lopez, J. Duato et al: Software-Based Deadlock Recovery
Technique for True Fully Adaptive Routing in Wormhole Networks. IEEE, Los
Alamitos California, 1997.
[13] A. Nye: Xlib Programming Manual: OReilly & Associates, Inc., 1995.
76


Full Text
This thesis for the Master of Science
degree by
Brian Joseph Cohan
has been approved
by
Dianne Kumar
Tom Altman
II Kyeun Ra
/7 Dao/
Date


DEDICATION
I dedicate this thesis to my wife Linda and my children, Elizabeth, Danielle and Paula for
their unfaltering understanding and support. Also, I would like to dedicate this thesis to
my mentor, Dr. Dianne Kumar and die great faculty of The University of Colorado at
Denver.
IV


4.2 Duato Unicast Routing......................................................51
4.3 Hybrid Unicast Routing.....................................................53
4.4 Software Unicast-Multicast Routing.........................................55
4.5 Hardware Column Path Unicast-Multicast Routing.............................55
4.6 HTA Unicast-Multicast Routing..............................................57
5. Simulation Case Studies.......................................................61
5.1 Duato......................................................................61
5.2 Hardware Column-Path..............................................,.......65
5.3 HTA........................................................................69
6. Software Engineering Model....................................................71
6.1 Requirements...............................................................71
6.2 Implementation.............................................................72
6.3 Testing....................................................................73
7. Conclusion....................................................................74
8. Future Work...................................................................75
References:.....................................................................76
vi


Figure 4.10 Hardware Column Path Routing in 2D.............................56
Figure 4.11 Duatos routing for unicast messages in HTA....................59
Figure 4.12 HTA deadlock queue (absorbed message queue) simulation.........59
Figure 4.13 HTA Routing Algorithm in 2D....................................60
Figure 5.1 Duatos Clock 29, Queue Si2e = 2................................62
Figure 5.2 Duatos Clock 29, Queue Size = 8................................62
Figure 5.3 Duatos Clock 74, Queue Size = 2................................63
Figure 5.4 Duatos Clock 74, Queue Size = 8................................63
Figure 5.5 Duatos Clock 134, Queue Size = 2...............................64
Figure 5.6 Duatos Clock 134, Queue Size = 8...............................64
Figure 5.7 Hardware Column-Path Clock 1....................................66
Figure 5.8 Hardware Column-Path Clock 6....................................66
Figure 5.9 Hardware Column-Path Clock 11...................................67
Figure 5.10 Hardware Column-Path Clock 13..................................67
Figure 5.11 Hardware Column-Path Clock 19..................................68
Figure 5.12 Hardware Column-Path Clock 27..................................68
Figure 5.13 HTA Clock 54, Absorbed Message Queue...........................69
Figure 5.14 HTA Clock 64, Absorbed Message Queue...........................70
Figure 5.15 HTA Clock 44, Random Messages in a i-ary-i-cube................70
viii


2. Background
The number of applications using interconnection networks is growing rapidly.
Examples include: internal buses in VLSI (very large-scale integration) circuits, ,
telephone switches, internal networks for ATM (asynchronous transfer mode) switches,
vector supercomputer processor/memory interconnects, LANs (local area networks),
MANs (metropolitan area networks), WANs (wide area networks), and interconnection
networks for multicomputers and distributed shared-memory multiprocessors. This
paper and the accompanying visual simulator will concentrate on the application of
interconnection networks for multiprocessors using the k-ary n-cube architecture [4,5].
2.1 Parallel Systems
Although the performance of processors has increased at an aggressive rate, the demand
for greater computing power continually drives the need for even faster processors. An
example of this demand are the grand-challenge problems with the largest problems
requiring teraflops (trillion floating-point operations per second) for a few hours to
thousands of hours at a time. To meet this demand, processors are becoming
increasingly complex, and therefore expensive to design. An alternative, cost-effective
approach is the use of parallel computers using existing processors. Parallel systems
utilize several processors concurrently to solve a given computation. These systems
require the processors to communicate with each other, however, necessitating a
communication subsystem for the interconnection of the processors (and possibly
memory and disks in some architectures). These subsystems can easily become the
bottleneck and necessitates the need for very efficient networks. Therefore, the design
of high-performance network systems in parallel computers is critical for systems with
any significant number of nodes [9].
Of the many parallel computer architectures and types of interconnection networks, this
paper will concentrate on the multiprocessor model with the direct network
interconnection subsystem [1].
2


2.2.1 Shared-Medium Networks
Shared-medium networks are the least complex topology and consists of a transmission
medium that is shared by all communicating devices. Only one device may be allowed to
transmit at any one time, and arbitration strategies resolve access conflicts. The two
major classes of shared-medium networks are LANs and the Backplane Bus. The most
widely used contention-based LAN is Ethernet. It uses the CSMA/CD (carrier-sense
multiple access with collision detection) protocol to arbitrate access. The backplane bus
is usually implemented to support UMA (unified memory access) architectures. Shared-
medium networks have limited bandwidth and do not scale well; therefore, they are of
limited use in multiprocessor implementations [18].
4


Examples of direct network topologies include the tree (figure 2.3) and the star graph
(figure 2.4) in addition to orthogonal topologies as described below.
Figure 2.3 Binary tree.
oAT X
Figure 2.4 Star topology.
A network is defined as orthogonal iff nodes are arranged in an orthogonal n-
dimensional space having every link arranged in such a way that it produces a
displacement in a single dimension. A strictly orthogonal network is further restricted to
require that every node has at least one link crossing each dimension. The n-dimensional
mesh and k-ary n-cube (torus) fall in this category.
Formally, an n-dimensional mesh and torus is defined as:
k(0) x k(l) x .... x k(n-2) x k(n-l) nodes, k(i) nodes along each dimension i,
where k(i)>=2 and 0<=i<=n 1.
A node X has n coordinates (x(n-l), x(n-2), ...., x(l), x(0)), where 0<=x(i)<=k(i)-l for
0<=i<=n-l.
Nodes X and Y are neighbors iff y(i) = x(i) for all i, 0<=i<=n-l, except one, j,
where:
y(i) = x plus/or minus 1 mesh
y(i) = (x plus/or minus 1) mod k torus.
6


Figure 2.7 4-ary 2-cube torus
Figure 2.8 3-ary 3-cube torus


for the duration of the message transmission and blocks other messages from using any
part of the path, even when the path is not being used.
2.3.2 SAF Packet Switching
SAF (store-and-forward) packet switching consists of partitioning a message into fixed-
length packets dedicating the first few bytes of the packet for routing and control
information. These control bytes are referred to as the packet header. Packets are
individually routed from source to destination and are completely buffered at each
intermediate node before the message is forwarded to the next node. The header
information allows the router to determine the appropriate output link for the packet.
When messages are short and frequent, SAF packet switching is effective. SAF allows
for many message packets to be in the network simultaneously. Disadvantages of SAF is
the increased overhead of routing each individual packet and buffering requirements at
intermediate nodes can be prohibitive.
2.3.3 VCT Packet Switching
A flit (flow unit) represents a logical unit of information. A phit (physical unit) is a
physical unit of information. A single phit is the number of bits that can be transferred
across a physical network channel in one clock cycle. VCT (virtual cut-through) packet
switching partitions a message into a packet consisting of a header flit(s), which contains
the routing information, and data flit(s) that contain the data. For example, in figure 2.6 a
message is broken into three packets, a packet is divided into two flits per packet, and a
flit is made up of two phits.
Message (3 packets) Packet (2 flits) Flit (2 phits)
Figure 2.9 Message packet, flit and phit partitioning.
The size relationships between the message, packet, flit, and phit vary between different
machines. Typically, the flit size is equivalent to the phit size. An exception is the Cray
T3D where each flit is comprised of 8 16-bit phits. Specific size determinations reflect
design trade-offs in performance, reliability and hardware complexity.
The router examines the header flit(s) and determines the output link for the packet as
soon as it is received. The header flit can be forwarded to the next node as soon as the
10


Figure 2.10 Message blocking with wormhole switching.
Figure 2.7 illustrates message X being blocked in place by message Z in wormhole
switching. Also, message Y is blocked by message X. Notice that the buffers occupied
by message X in routers 1 and 2 will remain occupied until router 3 clears the buffer
needed by message X essentially blocking other messages that may require the occupied
buffers of routers 1 (message Y) and 2.
2.4 Routing
Routing algorithms establish the path through the network that is followed by each
message or packet. The number of proposed routing algorithms in the literature is
almost endless. A finite set of routing characteristics is useful in the classification of
routing algorithms. Table 2.1 lists routing criteria with alternative approaches. Table 2.2
lists different approaches associated with adaptive routing.
Routing Definitions:
Number of Destinations in Message:
Unicast Routing Packets are assigned to a single destination.
Multicast Routing Packets may be assigned to multiple destinations.
Routing Decisions:
Centralized Routing The routing path is established by a central controller.
Source Routing Routing path is centralized at the source node prior to packet
injection.
12


2.4.1 Router Model
The basic generic router used for multiprocessor implementation is comprised of
buffers, a switch, a routing unit, link controllers and a processor interface as shown in
figure 2.8. The buffers are commonly FIFO (first-in, first-out) and are used to store
messages in transit. Usually there are both input and output buffers. The switch
connects input buffers with output buffers. High-speed routers use a crossbar with full
connectivity. The routing unit, which will be discussed in depth in the next section,
determines which output buffers are assigned to a message. Link controllers (LC)
coordinate the transfer of the message across the physical channel between routers. The
processor interface implements a physical channel to a processor instead of another
router [5,15].
Figure 2.11 Generic router model.
Buffers
Processor
Interface
I
LC
LC
r ?
LC
a---
i_
Switch
(crossbar)
RoufingUnit
LG]
m
14


Starvation
A packet may be permanently stopped from reaching its destination if network
traffic is intense and the required resources are always granted to other packets
also requesting them. This situation usually occurs when an incorrect resource
assignment scheme is used to arbitrate resource request conflicts.
Deadlock is the most difficult problem to solve. There are three main strategies for
handling deadlock: deadlock prevention, deadlock avoidance, and deadlock recovery.
Deadlock Prevention
Deadlock prevention consists of allocating resources in such a way that a request
never leads to a deadlock. If a given routing function provides a path from
source to destination node and has no cycles in its extended channel dependency
graph it is deadlock-free. An example is circuit switching where backtracking is
allowed.
Deadlock Avoidance
As a packet advances through the network, resources are granted to the packet
only if the resulting global state of the network is safe, or deadlock-free. One
technique is to establishing an ordering between resources and granting resources
to each packet in decreasing order.
Deadlock Recovery
Resources are granted to packets without any restrictions. A deadlock may occur
with strategy and a detection mechanism must be in place. If a deadlock is
detected, some resources must be deallocated and granted to other packets.
2.4.3 Multicast Routing Hardware Implementation Schemes
Multicast communication consists of a message having more than one destination node.
This is a common operation in parallel computing and is critical for mutltiprocessor
performance. Two major approaches to multicast routing are Hardware Tree-Based and
Hardware Path-Based routing.
16


2.4.3.2 Path-Based Multicast Routing
In Path-Based Multicast routing a multicast message destination node set is divided into
disjoint subsets. These subsets are then composed into separate submulticast messages
and injected into the network. The messages are routed through the network using only
the first header flit in the message. Once the destination of the header flit is reached, the
first header flit is removed by the router and the next header flit is used to continue
routing the message to the next destination. This process continues until all destinations
are reached and the message is consumed by the network. Figure 2.11 depicts an
example of a path-based multicast routing algorithm.
Figure 2.14 Hardware Column-Path Multicast Routing.
Multicast Message: source node = 0, destination nodes: 1,2,4,5,7,8.
Given the 3-ary-2-cube mesh network below with labeled nodes, this
message will be divided into two disjoint subset messages consisting
of:
Message 1: 1,4,7.
Message 2: 2,5,8.
Message 1 will proceed to node 1 and the first header will be %
removed and then to node 4 and so on.
Message 2 will proceed to node 2 and the first header will be ii
removed and then to node 5 and so on.

18


The understanding of any software implementation usually starts with a knowledge of
key data structures and an explanation of how these structures are used. An overview of
the node and queue structures is presented in figure 3.1. An explanation of the
important variables in each structure and the code listing follows.
Figure 3.1 Node and queue relationships.
Node (nodetype) contains queues
3.1.1.1 Global Message structure: struct g__msg_type
The global message structure (figure 3.2) contains variables describing a message:
source_node describes the source node of the message.
dest_node[ ] array is of type dest_node_type and consists of the
dest_node_num (destination node number), dest_absorbed (destination for the
given node number is absorbed or not) and map array, which indicates number
20


Figure 3.2 Global Message Structure.
typedef struct g_msgtag {
int source_node
dest_node_type dest_node[MAX_NUM_DESTS_PER_MSG];
dest__node_type is defined as the following in blue
typedef struct dest_node_tag {
int dest_node_num;
int dest^absorbed; // msg to dest is absorbed
by dest node
int map[MAX_DIM+1]; // num of hops needed
in each dim
to reach dest
} dest_node_type;
int num_dest_nodes;
int num__dest_absorbed;
int num_data_flits;
int comm_type;
int hop__count;
int total_msg_lat_sum;
int cc_gen;
int cc_leave_src_q;
int changmg_deter_to_deter_count;
int changing_other_to_any_count;
int waiting_time;
int hop_count_sum_per_dest;
struct emsptae *prev. *next;
} g_msg_type;
3.1.1.2 Queue Element structure: struct q_elt_type
The queue element structure (figure 3.3) contains variables describing a queue element
(i.e., flit):
l_flit_num holds the local flit number which is unique to each flit in the
message. g_flit_num holds the global flit number.
flit_type holds the type of flit (i.e., address, data).
flit_routed is a Boolean variable indicating if the flit has been routed.
22


head and tail are the most important variables are listed first; these are pointer
variables that point to a q_elt_type structure.
size holds the queue size.
uni_msg_size and mul_msg_size hold the unicast and multicast message size
respectively. l_num_dest_nodes holds the local number of destination nodes for
a given message.
l_msg_len holds the length of the message currently in the queue.
time_stamp_of_first_addr_flit holds the time of the first address flit.
Figure 3.4 Queue Structure.
typedef struct {
q_elt_type *head, *tail;
int size;
int size_status;
int uni_msg_size;
int mul_msg_size;
int q_updated_this_cc;
int msg_deadlocked_route_move;
int TEMP_msg_deadlocked_route_move;
int all_addr_flits_moved_at_prev_node_TEMP;
int all_addr_flits_moved_at_prev_node;
int all_addr_flits_routed__at_this_node;
int first_flit_assigned_to_sink;
g_sorted_msg_type *smsg_ptr;
int l_num_dest_nodes;
int l_msgjen;
int l_msg__len_set_correctJy;
int last_flit_num_moved_in_q;
int time_stamp_ofjfirst_addr_flit;
int overall_route_cc_type;
int set_required_times;
float required_route_time;
float required__switch_time;
float required_channel_time;
} q_type;
3.1.1.4 Node structure: struct nodetype
The node structure (figure 3.5) contains variables describing a node:
24


9Z
isdAispou {
ISHcLUrWM3DM1N~XVVi] ums-[anuBq3-ujip 4ui
(SHdAXTnno:rimH~XVnl ums-\nTijmip-Djs m
[SHdAX MMOD MIN XVW] ums~[nn~uuqD 4111
[l+dIQ MIN XVJVl [l + MCTXVJX3 Joqq%nio43jnaf"b 4ui
-[l+>naNnN~XVJMj ll + ma~XWj]Jq#3uluojf-sjnoj-b 4ui
t[\ +JMiaXVIM]luno^d-u*nu 4ui
jbA iyxql wnN'xvkl
[SiklAX MVOO NON XVJMltHia MIN XVMl
+ ma XVRImMOD wav jiy^uiuimioj joj 3iup 4X3u~4ubav~3d 4111
bA_TVXOXWnN_XVW]
feadAX nwod non jcvwltaia min xvwl
U + MCI XVN] WIVOO wav 4ig jsjij joj sum 4X3U 4UBAV-3d jut
jbAwxoxwnN"xvw]lsadAX" wwo:Tminxvw1
Ixia min xvMi+ma xvrIjvwod wav 3mnb 4* iubav
(As^dAi~nnoD~nciH~xvn)
tdia min xvwJ h + ma xvwJmvod wav iubav 4
bA tvxox min_xvw] [>na nan xyMqMaxvw]
did WAV Jip^UlUTBUJ3J JOJ 3Uip 4X3 U~lUBAV-3d 4UJ
iTA"TVXOX M1N~ XVJVl jdiajMflN-XVWl
[l + nia XVJXJdld WAV 4?p isjq joj 3uip ixsu iubav-3d iut
bAjivxox'wn n-J
IdICI MIN XVN3U+MCI XVHlHia WAV snsnb jbjueav 3d 4ui
IdlQ MlN_XVMl fl_+ ma XVW]dIOWAriubaT^ 4ui
[S3cLAX vmoo WflN XVMdia NONjXVJM]
[l+MCl XWydDd uo 3Alx3u_P9^Pdnwi
[S3dAX_JMN03Wn N~XVW]
kna MIN XVMlft +WICI_XVJXbd uo 3A ixsuiui
iNVIO DdS_MlN XVW + DA TVXOX W0N_XVM1
U+dici ]vnN_xvix3U+Nia xvmI^jb z*rbj>dh-oirb
[nvio aaexosav min xvw +nvho TNis_wnNxvw
+0A TVXOX MIN xvw][_l+dici wnN~xvjv][i
+MG XVN]ujojJ ajnaTb scLtfb
;[SHdAX~ 0'[V_WnNXVWjPK3M^-u w
iXlTB 04 J31J40 pB3q JJ 4UI
fj343p04~J343ppB3q~JJ 4U]
l3dX4§upnoj 4ui
[saaO^NOIXVDINnWWOT'MlN]3?00-^^3^^^3 W
Ul+Wia'"XVMlsdoH"'lEnb3"J0jJ?P *u!
tbSsiupsqjosqB-uisSsufTunu 4ui
13AOUJ-34nOJ~3pOU4B~p35|30|pB3p-§SUJ~dJ/VHX *u!
13AOUJ-34nOJ~3pOU~4B~p33pO[pB3p-8sUJ 4UJ
} §B43pOU 43TU4S jppsdfo
sjra^njis spoj^ c*£ 3jn§iq


( ... Continued )
Inc_sim_time_for_flits_following_the_flit_at_top_o f_q
Inc_sim_time_for_addrjflit_following_the_flit_at_top_of_q
Inc_sim_time_for_data_flit_following_the_flit_at_top_of_q
it* Move_flit_at_top_of_q
Gather_stats_and_deq_and_enq
Move flit in all dirs
Move_flits_assigned_to_reg_or_abs_chan
Moving_addr_flit_or_moving_data_flit_to_the_first_
q_its_assigned_to
Moving_same_data_flit_to_a_different_q_than_the_first_q
_it_was moved to
Update_values_for_last_data_flit
28


int Update_Required_Times_For_First_FIit ( )
This function updates the times for the current first flit of the packet.
int Update_Time_For_First_Header_Flit ( )
The Update_Time_For_First_Header_Flit function updates the route and switch
time for the header flit. This function and the family of functions of this type are
called each clock cycle.
int Update_Time_For_First_Data_Flit ( )
The Update_Time_For_First_Data_Flit function updates the switch time for the
first data flit.
void Update Required Times For Remaining Flit ( )
This function updates the times for the remaining flits in the packet. This is
necessary since, for example, the first flit in a packet may be routed and the
second flit may be switched during the same clock cycle.
void Update Time For Remaining Header Flit ( )
The Update_Time_For_Remaining_Header_Flit function updates the time for the
remaining header flit(s) in multicast messages.
void Update_Time_ForJRemaining_Data_Flit ( )
The Update_Time_For_Remaining_Data_Flit function adjusts the switch time for
the remaining data flit(s).
void Move_Flit_At_Top_Of_Q ( )
This function moves the flit at the top of the queue to a neighboring node, the
absorbed queue of the current node, or is absorbed at the current node if this
node is its destination node.
void Move_Flit ( )
The Move_Flit function performs a dequeue from the current node setting the
stage for an enqueue at another queue or the absorption of the flit.
30


3.2 Visual Implementation
The visual implementation is written in ANSI-C using X library, Version 11 API on the
Linux platform [13,14]. The code has been ported to Sun-Solaris and HP-UX without
modification. Static data structures are used. Higher-level toolkits were avoided to
maintain the highest degree of integration with the base implementation and to provide
for the greatest degree of portability.
3.2.1 Key Data Structures
Most of the visual implementation is performed dynamically. When an event occurs in
the base software, which results in a network state change, a visual event is initiated.
This functionality can usually be implemented dynamically. However, the need to redraw
the screen during window resizing or an overlay necessitates storing some information.
Two structures are used to track the necessary data. These structures are depicted in
figure 3.7 along with structure listings in figure 3.8 and figure 3.9, respectively.
32


3.2.1.1 Global Node Location structure: struct node_loc_type
The node_loc_type structure (figure 3.8) contains the following variables:
ball_pos[ ] stores the ball center for a node.
num_pos[ ] stores ball number location.
pipe[ ] [ ] stores the position for drawing the x, y, z pipe representing a physical
channel.
torus [ ] [ ] holds the wrap-around physical channels.
ball_color [][][] stores the ball color.
pipe_color [ ] contains the color of the pipe, which changes during network
simulation.
torus_color [ ] likewise contains the color of the torus channel.
Figure 3.8 Global Node Location structure.
typedef struct node_loc_tag {
intball_pos[X2_POS];
int num_pos [X2_POS];
int pipe [X3_POS] (X4JPOS];
int torus [X3_POS] [X6JPOS];
unsigned long baU_color[X8_POS][X8_POS][X8_POS];
unsigned long pipe_color[X3_POS];
unsigned long torus_color[X3_POS];
} nodejbcjype;
L....' v/, '1^: ^------------5-----5 -----;----------------;
34


Figure 3.9 GUI Control structure
i
i
\
i
typedef struct gui_control_tag {
intnode_queuepCMAX_QUEUES][MAX_Q_SIZEJ;
int color_idjTOTAL_NODES] [XMAX-QUEUES] [MAX_Q_SIZE];
int node_multi_queue|XMAX-QUEUES][MAX_Q_SIZE];
int color_rnulti_id (TOTAL_NODES]
[ XMAX-Q UEUES] [MAX_Q_SIZE];
intnode_num;
int display;
int display_position;
int switched_flag;
int switched_flag_delay;
int routed_queue[XMAX_DISPLAY_NODES][ XMAX-QUEUES];
int switched_queue [XMAXJD ISPLAY_NODES][ XMAX-QUEUES];
int routed_multi_queuefXMAX_DISPLAY_NODES]
[XMAX-QUEUES] [SCREEN_Q_SIZE]
pCMAX_FLIT_COLOR_DISP];
int switched_multi_queue[XMAXJDISPLA Y_N ODES]
[XMAX-QUEUES] [SCREEN_Q_SIZE]
pCMAX_FLIT_COLOR_DISP|;
intdrag_node;
intcast_type[210];
int src_qJoc[SRC_CHAN_LOC];
} gui_controLtype;
36


Figure 3.11 Visual Event.
Figure 3.12 Global Channel Color Change.
38


A listing of the visual implementation functions follows:
void get_position_[2,3]d ( )
The get_position_[2,3]d calculates the global node positions and physical channel
locations. Input values determine these positions and consist of the k value (node
number) and n value (dimension). Example input values: k = 4, n = 3 would result
in the calculation of node and physical channel positions for a 16-node, 3-
dimensional k-ary n-cube torus.
void Xinit_gui ( )
This function initializes the screen drawing structures,
void draw_[nodes, nodes_cube] ( )
This function draws the nodes and physical channels based on the get_position_[2,
3]d function.
void Xredraw virtual chan ( )
This function draws the virtual channel section of the visual display, which include
the virtual queues, the source queues and the absorbed queues.
void Xredraw_rsc ( )
This function updates the route R and switch S letters that appear in the unicast
queues when a flit is either routed or switched.
void Xmulti_rsc ( )
This function updates the route R and switch S letters that appear in the
multicast queues when a flit is either routed or switched.
void XPrintGenlnfo ( )
This function displays the general statistic information, such as throughput. This
information if updated for each visual event.
40


This function draws the global physical channels for the x, y, and z tori for
3dimensional k-ary n-cubes.
void Xrotate_virtual_pos ( )
This function rotates the displayed virtual channel nodes. For example, if nodes 0-3
were displayed, the next click of the rotate button would display nodes 4-7. Modular
arithmetic is used on the boundaries.
void Xcheck_pos_virtual ( )
This function produces the drag and drop functionality,
void XCreatelnfoWin ( )
This function displays the pop-up packet information window.
42


Xvirtual_chan( )
update_values_for_last_data_flit ( )
Xset_baIl_color( )
node_loc[temp_q_elt->g_msg_ptr->source_node].ball_color[0] [0] [6] = 0;
First_Time_In_Not_SINK_ASSIGN ( )
if MULTICAST
Xvirtual_chan( )
else UNICAST
Xset_ball_color( )
node_loc[nbor_node] .ball_color[dim_to] [dir_to] [chan_to] =
temp_q_elt->g_msg_ptr->message_color_id;
Xdraw_ball( )
Xdraw_channel( )
Xvirtual_chan( )
NOT_First_Time_In_Not_SINK_ASSIGN ( )
if MULTICAST
Xvirtual_chan( )
else UNICAST
Xset_ball_color( )
node_loc[nbor_node] .ball_color[dim_to] [dir_to] [chan_to] =
temp_q_elt->g_msg_ptr->message_color_id;
Xdraw_ball( )
Xdraw_channel( )
Xvirtual_chan( )
44


Xdraw_channel ( )
This function is a call to the Xlib that redraws the ball structure in the global
network display, but is only called when a physical channel is occupied or released.
If running in graphics mode with cycles set to 1, then the call is made each clock
cycle; however, if cycles is set to 1000, this is preformed once each 1000 clock
cycles. Therefore, realistic simulations may be performed in graphics mode with
adequate performance. This function works in conjunction with the Xdraw_ball
function.
46


Figure 4.1 a-d Unidirectional rings and their channel dependency graphs.
(a)
(c)
d3 o'
:o di
(d)
dl2
dlO
a

t)d11
d03 o dOt O
d02 U
d!2
o
48


Figure 4.3 Dimension Order Routing Simulation.
Figure 4.4 Dimension Order Routing Simulation.
50


Figure 4.5 Duato routing adaptive channels.
clock 16 .
. UMCH3T
jteOCtXJ
H^ftcrt iv. thr s Irul *-ted*g, J t ^ Q.go
0.QM*
reeT_ms_**< i acs '
Sirqet^ttirupu^
#ff*ot.,v-t e.eo
MULTICAST
fcvv.^tHr'Vpurt
w4u*o_L Uryt.ttiruput
UNICAST $otr*4T '
4 r fTirnii'mimxiii
MULTICAST So.ro* Queue -
= rprrm rrrrrrrrrm--
r-te.i1,- it - -
fl#S0(fB6t> Message Queue
iilLi iTiip :mxrm
UNICAST Source Rueue - 1
11 :rn nTJT.TOxn.ixt
MULTILIST Seuroe Qum
ABSORBED Message Queue
r Liaiij nrtim.ixnj
Figure 4.6 Duato routing deterministic channel.
uHicfta?
!H ft i ivt^thruput
tpiwlarMKJ_o. ..
nuutcMki
*i mul^twdu },. lt
r 4^nu^ t-jwrigw fc^ttruput
4rf^vtivt;1hru(Mt
UMICAST ftoMTor
n mmrrnrn n n
MULTICAST Socro* Ouue ''
t mn i TTFrrTTrrrrt
Nrwiter-6 -
ABSORBED Me-ssege Queue'
o.ixuLxo i nmrrm
unicast Source Queue
LXLixmirirrmrrra
.MULTICAST Seuree Queue
f&ItjA'ljjr.
ABSORBED Message Queue
xixuxi.m'.u m i: :i
UNICAST Source Queue
ciimniEniiiii i
MULTICAST. Source Queue
Hocfe' I .
RBSOKSCD Message Queue
mxrixn raiui m

52


Figure 4.7 Hybrid routing using low traffic.
Figure 4.8 Hybrid routing with heavy traffic.
54


based approach is used that splits a multicast message into at most k multicast
submessages, where k equals the number of columns in the network. All destinations
located in the same column are put into the same submessage. Each submessage is
routed using Duatos algorithm.
In figure 4.10 a message with destinations 4, 6, 8, and 10 and source node 3 produces
two multicast submessages containing destinations 6, 10 & 4, 8, respectively.
The first sub-message traverses from 3 15 14 10 6
The second from 3^2-^l^13^12-^8^4.
Figure 4.10 Hardware Column Path Routing in 2D.
56


theoretically it can be argued that deadlock can occur for the N+l case, where N is the
storage capacity of a given processor memory.
58


5. Simulation Case Studies
An explanation is presented followed by a series of screen shots from the actual VFSim.
5.1 Duato
This case study demonstrates Duatos routing of unicast messages.
message length = 10
total virtual channels = 4
Figure 5.1 shows network traffic with queue size = 2 and figure 5.2 with queue size = 8.
At clock 29, the larger queue size enables more message transfer in the network, depicted
my more transfer traffic. At clock 74 (figure 5.3 and 5.4), not all of the source queues are
empty in the smaller queue size simulation, but are empty for the larger queue size. The
green message at node 1 is still waiting to leave the source queue in figure 5.3. The last
two figures (5.5 and 5.6) show the ending statistics with the pop-up information
windows, summarized below.
q-size = 2
clock cycle just before message absorption = 134
leave source queue clock cycle = 75
waiting time = 97
q-size = 8
clock cycle just before message absorption = 104
leave source queue clock cycle = 42
waiting time = 71
61


Figure 5.3 Duatos Clock 74, Queue Size = 2.
j I-
j # \
] r*+}:_mjN.jdcadlck
! thrutxrt
\ III, II I -"~ -'l -
14$$-'. : rWLTICflST r

H_4f*etiv*_tW"Optrt ££*
1 *Ul tadUMfe. 1 *t OuBMfS

r**l , W8 %** thl-M&U* .8W
ef rrt io*_thr.jpu* i£**
Figure 5.4 Duatos Clock 74, Queue Size = 8.
63


5.2 Hardware Column-Path
The following case study is based on one injected message with a source node of 0 and
destinations nodes of 5, 6, 9, and 10. The message length is 5 with 4 header flits and 1
data flit. Using Hardware Column-Path routing the message is split into two
submessages: one message headed for destinations 5 and 9, and one message headed for
destinations 6 and 10.
In figure 5.7, at clock cycle 1, the first submessage at node 0 is routed. Notice that the
message length in the source queue is 6. This represents 2 submessages of length 3 each.
A submessage consists of two header flits and one data flit since there are two
destinations and one data flit per submessage. In figure 5.8, at clock cycle 6, the second
submessage at node 0 is routed.
At clock cycle 11 the first submessage is at node 14 and the second submessage is at
node 2 and 3 (figure 5.9). At clock cycle 13, figure 5.10 shows the first submessage
traveling up the column containing destinations 6 and 10 and the second submessage at
node 2.
At clock cycle 19, figure 5.11 shows the first submessage at node 6 with destination node
10 already absorbed and the second submessage at node 13. Figure 5.12 (clock cycle 27)
shows that the first submessage has been completely absorbed (destinations 6 and 10)
and the second submessage is at node 5 with one header flit and one data flit.
Destination node 9 from the second submessage has been absorbed.
65


Figure 5.9 Hardware Column-Path Clock 11.
clock /&8$gtr
: UHIUHST _
- r* 040$
UtMUtvd^iui^Ut
aM '3r,l0OO-
^ JL.ytfoWi
- ^ -- <
y mui i '-
*#*) _nun^ Figure 5.10 Hardware Column-Path Clock 13.
:: UMIOWT - -
Q+ UrVOO
taafr>
.
*_*<**<*
tf f Oi ttif-upv fc
- nuLticpe? ,
5*r-*^-**yt4pt rt '
* i ui t*d. wart),. I art
&>kjn*Qu.v- C-
£ *1 % **
: -T" ;
-UKXCffr8T touro* OtWtMk
1-i.tn .i,mm:f:m:i n
HULTICWT' Se*-*
Dim
f&icfc* -HS. -_ _____
Msbfticv rmuK> StMi
-.rmxnxrxmxirrn
av
67


5.3 HTA
In figure 5.13, the simulation is at clock cycle 54. The yellow message (ID = 19), which
is a multicast message with destinations of 6, 7, and 8, has been routed to the absorbed
queue in both node 7 and node 8. With HTA routing the message has been replicated
and consists of three messages, destination 6 has been absorbed. Notice that the entire
message has been completely absorbed ( 1 header flit and 5 data flits for a total length of
6).
Figure 5.14 shows the simulation at clock cycle 64. The yellow absorbed message at
node 7 has been absorbed, but the yellow message at node 8 remains in the absorbed
message queue awaiting to be absorbed by destination node 8.
Figure 5.15 demonstrates a simulation of several unicast and multicast messages in a 3-
ary-i-cube. Even complex simulations may be visualized at any point in time.
Figure 5.13 HTA Clock 54, Absorbed Message Queue.
69


6. Software Engineering Model
Any software engineering project usually begins with a set of requirements. Software
requirements may consist of discussions, a list of target customers, specific goals, and/or
desired system functionality. Once the requirements have been defined, a software
model is chosen to meet these specifications [7]. The chosen model should be flexible
enough to sufficiently achieve the desired goals and allow for future enhancements. For
the VFSim, a separate visual application written in C using the X Window System was
used to present a visual extension API to the base implementation. This proved to be a
good choice with adequate functionality to support the requirements and complexity of
the VFSim.
6.1 Requirements
Produce a visual tutorial for interconnection network concepts.
Produce a visual aid for the design of new architectures.
Two major and early requirements of the VFSim were to provide a visual presentation as
a tutorial for interconnection network concepts and as a visual aid in designing new
architectures.
Provide an aid for the validation of existing and future designs of the base
simulator.
A third goal was to verify the functionality of the base simulator using a visual
demonstration. Without a visual extension, printouts of each step or a close examination
of the final results were available. Reading printouts of state changes in simulator
variables was difficult and time consuming. Relying on the final values of a simulation
required the assumption that the simulator was valid.
Increase the code maintainability and function verification of the existing
simulator.
Increased code maintainability and function verification was desirable. This is a major
challenge with any software system of appreciable size. The base simulator was no
exception. It had been written by several people over several years and consisted of
approximately 20,000 lines of C code. An obvious goal was to add features to the
existing software in terms of new or enhanced algorithms, switching techniques, and
architectures. It was clear that any change to the existing base simulator software would
71


6.3 Testing
Simulations were run comparing base state-change printouts to visual state changes.
Since the VFSim was an extension of the base simulator via the VFSim API, the VFSim
functionality could be verified using the base state-change printouts. Once the
correctness of the base simulator was achieved the base simulator could verify the
correctness of the VFSim. Conversely, once the correctness of the VFSim was
established the correctness of the base simulator could be verified using the VFSim.
This was only possible by using a visual implementation that was loosely coupled to the
base simulator code.
73


8. Future Work
The expansion possibilities for the base simulator seem endless. Adding new topologies,
switching schemes, and routing algorithms are obvious choices. The implementation of
bi-directional channels, using two or more physical channels, and the introduction of
new deadlock detection and recovery schemes come to mind.
Future work for the visual implementation should include expansion of user features.
Allowing the user to change simulator parameter settings dynamically without leaving the
application would be valuable. Enchanced visualization of flit movement may render
even greater insight into the mechanics of interconnection networks. A trace
mechanism for individual messages may be a beneficial and easily implemented
enhancement.
The visual implementation was written strictly in C; therefore, an obvious challenge
would be to rewrite the code in C++ without affecting performance. Another
interesting concept would be to implement the visual extension in another language,
such as Java.
Clearly, the possibilities for future work are boundless.
75


[14] A. Nye: Xlib Reference Manual: OReilly & Associates, Inc., 1993.
[15] D. K. Panda: Fast Barrier Synchronization in Wormhole >6-ary /?-cube Networks
with ultidestination Worms. IEEE, Los Alamitos California, 1995.
[16] F. Petrini, J. Duato, P. Lopez et al: LIFE: a Limited Injection, Fully adaptive,
Recovery-Based Routing Algorithm. IEEE, Los Alamitos California, 1997.
[17] D. Robinson, P. McKinley, and B. Cheng. Path-Based Multicast Communication in
Wormhole-Routed Unidirectional Torus Networks. J. Parallel Distributed Computing 45,
104-121, 1997.
[18] W. Stallings: Data and Computer Communications, 5th ed. Prentice Hall PTR,
Upper Saddle River, New Jersey 07458, 1997.
[19] S. Wamakulasuriya and T. M. Pinkston: Characterization of Deadlocks in
Interconnection Networks. IEEE, Los Alamitos California, 1997.
77