Citation
Construction and use of suffix trees

Material Information

Title:
Construction and use of suffix trees
Creator:
Edwards, Jerry
Publication Date:
Language:
English
Physical Description:
v, 106 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Trees (Graph theory) ( lcsh )
Text processing (Computer science) ( lcsh )
Matching theory ( lcsh )
Natural language processing (Computer science) ( lcsh )
Computer algorithms ( lcsh )
Computer algorithms ( fast )
Matching theory ( fast )
Natural language processing (Computer science) ( fast )
Text processing (Computer science) ( fast )
Trees (Graph theory) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 103-106).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Jerry Edwards.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
785737479 ( OCLC )
ocn785737479
Classification:
LD1193.E52 2011X E33 ( lcc )

Full Text
h
CONSTRUCTION AND USE OF SUFFIX TREES
by
Jerry Edwards
B.S.E., University of Colorado Denver, 2006
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements of the degree of
Masters of Science
Computer Science
2011


This thesis for the Masters of Science
degree by
Jerry Edwards
has been approved
by
Tom Altman
Gita Alaghband
u/g / ft
Date


Edwards Jerry (M.S., Computer Science)
Construction and Use of Suffix Trees
Thesis directed by Professor Tom Altman
ABSTRACT
A suffix tree is a structure that allows us to search for patterns in a string
in linear time while holding the data in linear space. The discovery of this
system, based on work of Knuth, Pratt and Morris, by Peter Weiner was
declared to be the algorithm of the year in 1973. Through iterations of
discoveries and refinement we have developed a system that solves most
string evaluation problems in linear time and space. Our purpose is
discovering how to construct the fundamental building block used by such
string discovery systems as Google and the human genome project.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
Signed
Tom Altman


DEDICATION
To my wife Laura, without your support I could not have survived the
hours or the corrections to my grammar. To my son Seth, you have waited
to go fishing for too long.


ACKNOWLEDGEMENT
I would like to thank my advisor, Dr. Tom Altman, without whom this work
would not have been more than the ramblings of a very tired student. I
would also like to give a warm thank you to the faculty and staff at CU
Denver, for all they contributed to get me to this point in my academic
career.


TABLE OF CONTENTS
TABLE OF FIGURES.....................................................iv
1. Introduction.......................................................1
1.1 Naive algorithm.................................................1
1.2 Rabin-Karp algorithm...........................................2
1.3 Boyer-Moore algorithm..........................................4
1.4 Knuth-Morris-Pratt algorithm...................................6
1.5 Ukkonens algorithm............................................9
2. The Suffix Tree...................................................11
2.1 Standing on the shoulders of giants...........................11
2.2 Constructing the Suffix Tree..................................14
2.3 Parts of the Tree.............................................17
2.3.1 Node.......................................................17
2.3.2 Edge.......................................................18
2.3.3 Suffix.....................................................18
2.3.4 Suffix Link................................................18
i


2.3.5 Iteration
19
2.3.6 Position.....................................................19
3. Process of Tree Creation............................................21
3.1 Test............................................................21
3.2 Canonize........................................................22
3.3 Preparing the Root..............................................24
3.4 Update..........................................................25
3.4.1 Updates Use of first.......................................27
3.5 Testing the Node Edges..........................................27
3.6 We Have Our Structures Explained................................28
4. A Walk Through of the Construction..................................29
4.1 Introduction of m to the Empty Suffix Tree....................29
4.2 Introduction of i to the Suffix Tree of m...................32
4.3 Introduction of s to the Suffix Tree of mi..................34
4.4 Introduction of s to the Suffix Tree of mis.................36
4.5 Introduction of i to the Suffix Tree of miss................37
4.6 Introduction of s to the Suffix Tree of missi
41


4.7 Introduction of s to the Suffix Tree of missis
42
4.8 Introduction of i to the Suffix Tree of mississ............44
4.9 Introduction of p to the Suffix Tree of mississi...........45
4.10 Introduction of p to the Suffix Tree of mississip..........52
4.11 Introduction of i to the Suffix Tree of mississipp.........53
5. Applications........................................................56
5.1 Does a substring exist?.........................................56
5.2 Palindrome......................................................57
6. Conclusion..........................................................59
Appendix...............................................................61
A. Web Access.......................................................61
B. Application Code.................................................61
BIBLIOGRAPHY
103


TABLE OF FIGURES
FIGURE 2.1 TRIE......................................11
FIGURE 2.2 TREE......................................12
FIGURE 2.3 GOAL OF OUR TREE..........................15
FIGURE 2.4 SUFFIX TREE...............................16
FIGURE 2.5 NODE......................................17
FIGURE 2.6 EDGE......................................17
FIGURE 2.7 SUFFIX....................................18
FIGURE 2.8 SUFFIX LINK...............................19
FIGURE 2.9 ITERATION.................................19
FIGURE 2.10 POSITION.................................20
FIGURE 3.1 TEST FUNCTION.............................21
FIGURE 3.2 CANONIZE1.................................23
FIGURE 3.3 CANONIZE 2................................23
FIGURE 3.4 CANONIZE 3................................24
FIGURE 4.1 FIRST ITERATION...........................29
FIGURE 4.2 SECOND ITERATION..........................32
FIGURE 4.3 THIRD ITERATION...........................34
FIGURE 4.4 FOURTH ITERATION..........................36
IV


FIGURE 4.5 FIFTH ITERATION.............................37
FIGURE 4.6 SIXTH ITERATION.............................41
FIGURE 4.7 SEVENTH ITERATION...........................42
FIGURE 4.8 EIGHTH ITERATION............................44
FIGURE 4.9 NINTH ITERATION.............................45
FIGURE 4.10 TENTH ITERATION............................52
FIGURE 4.11 ELEVETH ITERATION..........................53
FIGURE 4.12 FINAL TREE WITH SUFFIX LINKS...............55
FIGURE 5.1 SUBSTRING SEARCH............................56
FIGURE 5.2 TREE OF IPPISSISSIM.........................58
FIGURE 5.3 TREE OF MISSISSIPPI.........................58
v


1. Introduction
String matching is an important subset of string algorithms. In our
everyday lives pattern matching has become an integral part of the way
we interact with the world around us. When we order a product from
Amazon, find the movie on Netflix or research a topic using Google we
can see how much pattern matching in string has become prevalent in our
everyday lives. Moving from these common examples we can see the use
of string pattern matching in bioinformatics, DNA contamination discovery
and the human genome project. The versatility of this field of study has
lead to several implementations of trying to find the optimal balance of run
time and space utilization. Before we begin the detailed examination of our
chosen method let us review a few of the algorithms that are used to solve
this problem.
1.1 NaYve algorithm
The base line algorithm for the discovery of a pattern within a string
consists of nested for loops (Unknown, 2008). As we iterate through the
string we are searching using the outer loop to compare each index of the
pattern held in our inner loop. A pseudo code example of this approach
1


will allow us to see the simplicity of the structure and the variable nature of
its run time.
For i = 0 to size of m
For j = 0 to size of n
If m[i+j] != n[j] try next i
Else if j = size of n pattern found
In the case of the naive algorithm, we have the extremes of the pattern
being either at the beginning or at the end of the string we are searching.
These situations give the perspective run times of O(n) and O(nm). From
these extremes we are able to derive that we have an average run time of
0(n+m).
1.2 Rabin-Karp algorithm
The Rabin-Karp algorithm was created by Michael O. Rabin and Richard
M. Karp in 1987 (Cormen, Leiserson, Rivest, & Stein, 2001). This system
uses hashing to compare sections of our string to the pattern. The pseudo
code for this system is as follows:
2


patternHash = hash(pattern)
stringSubHash = hash(string[0,size of substring])
for i to (size of string size of pattern)
If subHash = stringSubHash
if string[i,i+m] = pattern
return success
else
stringSubHash = hash(string[i+1, i+size of substring])
We can see the value of this method as large sections that may be
checked in a single iteration. We can also see how, for single a pattern,
this is not an improvement over the complexity of the naive approach.
Although we no longer iterate through our inner loop it has been replaced
by the rehashing of the sub portion of the main string equal to the size of
the pattern. This adds the same level of complexity that we would find
when dealing with a single pattern search in the naive approach. This
algorithm finds its purpose in the evaluation of multiple patterns of equal
length. As these patterns require only one hash, we are able to move
through our string checking for multiple patterns without the increased
complexity we would see using an inner loop.
3


1.3 Boyer-Moore algorithm
Developed by Bob Boyer and J Strother Moore in 1977, this algorithm
preprocesses the search pattern while leaving the target string unchanged
(Boyer & Moore, 1977). The approach of this system is to create a table of
our search pattern starting at the end to facilitate a shifting approach. For
each item, in reverse order, we evaluate the minimum shift needed for the
possibility of discovering our pattern from a point of failure. To illustrate
this concept we will examine the pattern for ANMAN.
!(N) This expression tells us that we cannot be at the end of our pattern
shifting us one index. The expression !(A)N cannot exist within our pattern
allowing us to shift the entire width of our pattern. !(M)AN This expression
holds that while we are not in the proper position to discover our entire
pattern the shift can allow for the string to terminate at the previous index.
!(N)MAN This expression puts us in the same condition as the previous
expression !(M)AN. Because the item M is unique to the third position,
any deviation from the pattern would require that the shift be of length
three for this particular pattern. !(A)NMAN provides the same condition we
illustrated in the previous two possibilities giving us the same shift
4


condition. This gives us the values for the first table of comparison as
follows:
!(N) = shift of 1
!(A)N = shift of 5
!(M)AN = shift of 3
!(N)MAN = shift of 3
!(A)NMAN = shift of 3
The second table used to evaluate the shift follows a much simpler model.
We pass through our pattern and register the distance from the end of
each item. The characteristic table for our example ANMAN would be as
follows:
A = shift of 1
M = shift of 2
N = shift of 3
all others = shift of 5
Upon a failed verification, these tables are used to determine the distance
of the next shift, taking the greater value found in each of the perspective
tables. The complexity of this system is divided into two parts; one set for
5


the preprocessing to construct our tables and the second for the search
for our target pattern. For the complexity of creating the two tables based
on our search pattern, we have 0(n + |XI) where n is the length of the
search pattern and O(m) for the search of the target string, where m is the
length of the target.
1.4 Knuth-Morris-Pratt algorithm
Donald Knuth, Vaughan Pratt and James H. Morris are three pioneers in
the field of string matching. Algorithms such as the Boyer-Moore and
Ukkonen find their origins in the early research produced by their
collaboration (Knuth & James H. Morris, 1977). What became to be known
as the KMP algorithm was conceived by Knuth and Pratt working
independently, but in parallel, of Morris; the results were published jointly
in 1977. This system follows the naive approach in concept; we reduce
the algorithm to one loop and hold an index value to the minimal back
tracking value in a preprocessed array. This can be best shown by the
pseudo code as follows:
Creating the preprocessed array
while pos is less than the length of pattern
6


if pattern[pos -1] = pattern[cnd],
end = end + 1
T[pos] = end
pos = pos + 1
else if end > 0
end = T[cnd]
Else
T[pos] = 0
pos = pos + 1
The preceding code gives us three conditions where our values will be set.
The first condition is the continuation of the pattern where the index is
stored and incremented. The second condition sets the candidate to the
value held in the index of our processed array if our current candidate is
greater than zero. The third condition is executed if the preceding two
conditions have not been met. This sets the value at the current index to
zero and increments our position.
7


Search algorithm
Let n = 0, m = 0
while (n+m) < (length of string searched)
if pattern[m] = S[n + m],
if i = (length of pattern)-1, return n;
else
m = m +1
else
n = n + m T[m],
if T[m] is greater than -1,
m = T[m]
else
m = 0
8


From the search algorithm we iterate through the target string comparing
the pattern to each index in turn just as we did in the naive approach.
Where the algorithms diverge is at the discovery of a non matching index.
This allows the use of our preprocessed array to back track no farther than
is necessary to find substrings that may repeat within the pattern being
tested. For the complexity of this system we, as in the Boyer-Moore
algorithm, have the combined complexity of the preprocessing and search
portions. Combining the complexity of the two algorithms leads us to the
result of 0(m+n).
1.5 Ukkonens algorithm
As the primary focus of this work, we will now look at what distinguishes
Ukkonens algorithm from the preceding systems (Ukkonen, Suffix Tree
and Suffix Array Techniques for Pattern Analysis in Strings, 2005). The
primary difference between this system and the systems previously
introduced is the complexity. Just as in the KMP and Boyer-Moore
algorithm, Ukkonens algorithm uses two stages, preprocess and search.
Rather than processing the pattern to be found, Ukkonens algorithm
processes the string to be searched. This approach allows the majority of
the work to be completed in the preprocessing stage. As will be
9


demonstrated, the algorithms search complexity is a function of the length
of our pattern rather than the size of the string being searched. While the
overall complexity is the same as the KMP algorithm, 0(m+n), the
persistence of the trees created allow search engines to perform the rapid
return of results while periodically updating content.
10


2. The Suffix Tree
2.1 Standing on the shoulders of giants
In 1973, Peter Weiner introduced the first iteration of the suffix tree based
on the early work of Knuth, Pratt and Morris. Weiners version was
referred to as a position tree and was not of the form that we use today
(Weiner, 1973). The end structure of Weiners process is in the form of a
trie.
Trie(abaab)
FIGURE 2.1 TRIE
11


While the search capabilities of this structure promoted linear time,
storage and construction in this manner requires quadratic space and
time. This restriction makes it impractical for sets of large strings, such as
the human genome. As with all subsequent variations, the navigation of
the resulting tree is a linear function. This makes the creation of the
structure the main focus for improvement as examined by McCreight
(McCreight, 1976). With his implementation, space is compressed to a
linear level by the use of suffixes.
Tree(abaab)
FIGURE 2.2 TREE
While this was a huge step forward, the algorithm has the significant
disadvantage of requiring the string to be built in reverse order. This made
12


the algorithm more difficult to adapt to implementations, such as data
compression. Through further study and optimization Esko Ukkonen
produced a system for the creation of this structure in a practical
methodology (Ukkonen, Suffix Tree and Suffix Array Techniques for
Pattern Analysis in Strings, 2005). The algorithm he put forward inserted
strings in a left to right format giving the system the online attribute which
had been lacking in previous systems. Exploiting a number of algorithmic
techniques, Ukkonen reduced the time of a naive implementation to O(n)
time for constant-size alphabets. This advancement allowed for the use of
this system in such applications as the human genome project, LZW
compression, and the Google search engine. The versatility of Ukkonens
method has lead to the publication of many papers on space compression
and run time optimization. The purpose of this paper is to add an
explanation of the fundamental algorithms used to accomplish the
construction of this structure. We may ask ourselves, with
implementations in place and libraries available, why we cannot use these
tools to accomplish our objectives. The answer to this question is we can.
For many applications simply using a preexisting tool would help us to
accomplish our goals. These tools are designed to be used to generate
suffix trees with a specific input. Even with a general implementation we
13


cannot be assured that we are able to modify our data into a form that the
tools will accept. With the fundamental understanding of the building
blocks used to create our tree, we are able to re-task an application
allowing us to answers to the questions we have not yet thought to ask.
We can evaluate the methods to find new expansions of the algorithms
tasking or usage of particular functions for entirely new systems. A greater
knowledge of the underlying algorithm will better prepare us to identify and
correct issues we may encounter in the libraries utilized to implement this
process. To begin this process we must first introduce the components
from which our suffix tree is constructed.
2.2 Constructing the Suffix Tree
The best place to begin to grasp this concept is to start at the end. The
final product of this process is, of course, the tree that it created. We will
use the string Mississippi to illustrate this process. We are using this
particular string because, in its limited length, each of the processes used
in the construction of suffix trees is demonstrated. To provide a context for
this examination, let us look at our goal.
14


D
FIGURE 2.3 GOAL OF OUR TREE
This is the representation of the tree without suffix compression. The
nature of the tree is shown in the path of each edge. If we look at the suffix
starting with the first character, we notice that there are no branches. We
can identify that once we enter the tree from the edge beginning with the
characteristic m, there is no deviation from the pattern that can end at a
positive outcome. Each item must be followed by the next. In contrast the
next character i branches immediately. In the pattern available, entering
from this position can lead to either a p or an s. As we navigate through
the branches our options diminish. An i that leads to an s must always
be followed by si. Once at this point the pattern diverges again. What this
15


demonstrates is a straight line from the first item found through the
shortest line to the farthest common point. The structure shown allows us
to visualize the end result of the construction and how it will be used. It
also shows the repetition that could lead to expediential space
requirements. This leads us to the use of suffix compression introduced by
McCreight. The following figure illustrates the implementation.
FIGURE 2.4 SUFFIX TREE
We can see from this representation that each edge consists of an
ordered pair. Each pair relates to the entry and exit index relating to the
original string. While we can see that the space burden is greatly reduced,
the original string is still needed for comparison. This is not taken into
consideration when determining the space requirement of the tree,
16


allowing for a constraint of O(n) reduced from 0(n2). This aspect has not
changed between the implementations put forward by McCreight and
Ukkonen. Now that we have a clear picture of our goal, we can apply the
methodology used to discover the pattern. Each item in our string will pass
through three primary processes to evaluate four principal objects.
2.3 Parts of the Tree
For the construction of the suffix tree, there are four primary objects that
allow us to utilize the information and provide the structure of our tree.
2.3.1 Node
A holding point of transitions or container of edges.
FIGURE 2.5 NODE
17


2.3.2 Edge
Provides a connection between nodes holding the relevant suffix between
transitions.
Edge
FIGURE 2.6 EDGE
2.3.3 Suffix
A pair of index values representing a substring.
Suffix
FIGURE 2.7 SUFFIX
2.3.4 Suffix Link
A connection between nodes providing a relevant path through the
branches for continued testing of suffixes.
18


FIGURE 2.8 SUFFIX LINK
2.3.5 Iteration
Index in the array that we are currently processing, references the
characteristic of the item we are testing.
BBBBBBBBBB
Iteration
FIGURE 2.9 ITERATION
2.3.6 Position
Holds the index of the characteristic so that we can say no modification is
needed in a lower index.
19


10

BBBBBBBBB
FIGURE 2.10 POSITION
Position
20


3. Process of Tree Creation
These are the steps that interact with the information stored in the parts of
our tree.
3.1 Test
The test process checks for a split condition. By comparing the
characteristic of the previous iteration for an equal condition, we are able
to determine if a new suffix is needed to replace the existing edge. From
this point we add an edge to the node following the split suffix. We
compare the characteristics of the item received to the item at the index
equal to one past the left limit and the difference between the iteration and
the position.
Iteration
of Interest ^ ^^Characteristic
BDBBEIBBB QBE
mmm s
Position 3

^ sslsslpp^l,
"r == characteristic^ + 3 3 +1) or jrT = *s" is false so we spilt
2.t2 + 3-ai ^(2 + 3-3+l,~) ~ .-5*
si&sippi
-------M
FIGURE 3.1 TEST FUNCTION
21


3.2 Canonize
The canonize function modifies the position to facilitate splitting of suffixes.
In this function we adjust the position variable to an index where the next
split may be discovered. This function also produces the edge holding the
pointer to the node that may become our new suffix link. Canonize takes
in the node of interest, the position and the current iteration. For the trivial
case, our position is less than our iteration, so we return an edge with its
previous node pointing to the node of interest and the position we are
examining. For the nontrivial case we first pull the edge that corresponds
to the position we are currently interested in. While the length of the suffix
is equal to the difference between the current position and the iteration,
we move forward to the next node and find the edge that corresponds to
the item one index beyond the current edge. This process identifies the
node we will work with and the position where we may identify our next
split condition.
22


^ Iteration
^ Position
FIGURE 3.2 CANONIZE1
The length of the substring is compared to the length of the edge (10-10)
<= (7-4). We move our position; position = (position + rightLimit leftLimit
+ 1) or position = (4 + 10-10 + 1). As our position value is still less than
our iteration, we pull the next edge.
*
Iteration
FIGURE 3.3 CANONIZE 2
23


The length of the substring is compared to the length of the edge (2-2) <=
(7-5); therefore, we increment the position.
*

Iteration
Position
FIGURE 3.4 CANONIZE 3
The length of the substring is compared to the length of the edge (-3) <=
(7-6) giving us a false condition. We leave our loop returning the new
position and the node we found at our last edge. What we have now is the
condition that may lead us to a suffix split. We have now moved through
our string to a point where the edge returned must be tested for our
current iteration.
3.3 Preparing the Root
One of the foundations of the Ukkonen approach is the suffix link. This is a
pointer that provides linear navigation through the tree to the next node
that would need to be tested for a split condition. By convention, the suffix
24


link of the root would point to itself. As will be shown this is does not meet
our needs for creating our tree with the system. We will need to initialize
the suffix link for the root to a node holding edges for each item in the
alphabet of the string. As we iterate, each item is added to the edge list of
the node leaving us with a reference to the last instance of every unique
element. This node is then added as the suffix link to the root node and
the root is added as the suffix link to this start node. This provides for the
condition when checking the suffix link as it is passed to the canonize
function for the root node. As there may not be an edge available for our
position in the root node when we first enter an iteration of update, we can
use the start node to provide an edge that insures the increment of the
position variable by one.
3.4 Update
After we have prepared the root, we now have a node with no edges
representing the empty string. We take the first item in the string and pass
it through the system.
The first process we will look at is update. We provide the function update
with the node we are currently interested in, the position and the current
location in our string. Once we have entered the process, we create a
25


pointer to the root node and then test (described in Section 3.5) the edges
of the node for the addition of the item or a split in the tree. We then take
the edge we received from the previous test. If this edge is not the first
instance for this chain, we add the item to the node that our edge leads to.
If the node set to the previous node is not the root node, we set the suffix
link of that node to the previous node of the edge returned. The previous
pointer is then set to the node following the current edge. At this point we
canonize (described in Section 3.6), set the current node to the previous
node of the edge returned, update the position to the value found through
the canonize process, and re-test (described in Section 3.5). Once we
have an edge defined as the first we set the previous nodes suffix link to
the current node unless we find ourselves at the root of our tree. Now,
regardless of path, we create an edge where the previous node points to
our current node and we store the current position. Update is the driver
function of each iteration. The pattern of movement is to Test, and based
on the first variable being set to false, we loop through our function
adding an edge, setting our suffix link, canonizing and retesting. When our
first variable meets the condition of true, we move out of our loop, set our
suffix link and move on to the next iteration.
26


3.4.1 Updates Use of first
The first variable holds a Boolean value indicating the state of the edge.
Holding a value of true indicates that the edge needs no further
adjustment for the current position and iteration. A false value indicates
that the edge requires further testing. The first variable is set in test
under three conditions. If the position is to the right of the iteration and a
split is not required, first is set to true. After a split, first is set to false
and the change to the suffix requires retesting. If the position has passed
our iteration, then first is dependent on whether the node already holds a
suffix with the characteristic passed to it from the update function.
3.5 Testing the Node Edges
This process is given the node to be tested, the position, the previous
iteration index and the character we are interested in. For the trivial case,
the iteration is less than the current position. We check for the existence of
an edge holding the character of interest and set the first value to this
condition.
In the nontrivial case, we must first find the edge corresponding to the
current position. If we have a match, we return an edge marked as the
first. If these do not match, we need to split the edge. Our first item is set
27


with the start point as taken from the returned edge and the end point is
set by moving to the right the length of the difference between the iteration
and the current position. For our second item, the start point is set one to
the right of the end position of our first string. The right limit is set to right
of the original edge. Our first item is then added to the current node
replacing the original edge. We create a new edge by adding the second
item to the node following the edge previously added. The result returns
an edge holding a first value of false and a pointer to the node holding
the second item.
3.6 We Have Our Structures Explained
Now that we have a description of each step of the process, we have a
clearer understanding of the progression needed to build our tree. We will
now provide an example by demonstrating the iterations to create the
suffix tree for the string mississippi.
28


4. A Walk Through of the Construction
This chapter is meant to be used with the code base found in Appendix A.
We start with a root node representing the empty string with a link to a
node that contains an edge representing the last instance of each unique
item. This node holds the representation of the alphabet contained in our
string.
4.1 Introduction of m to the Empty Suffix Tree
z>
FIGURE 4.1 FIRST
Now that we have our base set, we can introduce our first item to the
process. We first enter the update function (described in Section 3.4)
passing in the root node the position of interest, currently 0, and our
iteration, also currently 0. Once inside this function we set a placeholder to
the root node and then move to test (described in Section 3.5) our current
29


set of variables. We pass in the root node, our current position, currently
set to 0, one less than the iteration, in this case -1, and the item value we
are working with for our first run m. This gives us a function signature of
test (root node, 0, -1 ,m). This condition gives us the base condition of the
position being greater than the iteration we are interested in. In this
condition, we return an edge with the first value set to false as we do not
have an edge added to the node we are working with and the previous
node pointer is set to the root. This edge is then passed back to update
where we enter our while loop creating a new item. An item is constructed
with the signature characteristic, in this case m, the iteration, to denote
the beginning of the suffix, and a right limit beyond the scope of our string.
This item is then added to the node set to next, in this case the root node.
The previous root is set to the pointer of the edges next node, in this case
the root and we move to canonize (described in Section 3.6). This gives
us the function signature of canonize (root node, 0, -1). Again, we have
the trivial case where the iteration is less than the position. With the node
of interest and position unchanged, we return an edge with the same
signature that was passed in. We now have set the position and the node
we are working with to the same values that we found initially. Next we set
the value of current node to the previous node pointer of the edge we
30


returned and our position is unchanged. We reenter test with the signature
(suffix link of the root, 0, -1 This will return an edge with the property
of first set to true as the node attached to the suffix link of the root has
seen this items characteristic. Once back in the update function, we return
an edge holding our position and the node of interest, in this case a
position of 0, and the node we created for the suffix link of the root. We
then run our current state through canonize with the signature (roots suffix
link, 0, 0). In this pass we have the situations where the iteration and
positions are equal. From the node of interest we pull the edge
corresponding to the signature of the item in the current position, in this
instance m. The right and left values exists. With all variables set to 0 we
enter our while loop and increment our position. The current node is set to
the next node pointer of our edge, in this case the root, and with our
position now greater than the iteration, we break out of our loop returning
the root node; our position is now set to 1. What we have at the end of our
first iteration is our original structure with an added suffix and our position
now set to 1. The suffix we added has the left limit set to 0 and the right
set to 00 showing that the suffix goes to the end of the string. The value of
the right variable is irrelevant as long as it extends beyond the final index
31


of the string, symbolizing that our suffix extends from the left limit to the
end of our string.
4.2 Introduction of i to the Suffix Tree of m
In our second iteration, we again turn to update with the signature (root
node, 1,1). The first task, once we enter the update process, is to test our
current condition. We enter the procedure with the signature (root node, 1,
0, i). As position of interest is greater than the iteration of interest in this
condition, we go directly to return an edge. We check to see if the node
passed in has an edge with the signature i. As this signature has not
been attached to the root node the first value is set to false and the
previous and next pointer is set to the root. With our first value set to false,
i
J^9
FIGURE 4.2 SECOND ITERATION
32


we create a new item with the left limit set to our iteration and add this to
the node marked as next, in this case the root. Once we have added the
item, we canonize with the signature (starter node, 1, 0). With this
signature we return an edge with the previous node set to the root and the
position of interest remaining at 1. Then we test again with the signature
(root node, 1,0, i). Under this condition the position of interest is greater
than our iteration of interest, we then return an edge with first set to true.
This edge was added earlier in this iteration with the previous node set to
the root. This removes us from the loop in update and returns an edge
with the position unchanged, the previous node set to the root node, and
the suffix link set to suffix link of the root. Once back in the main for loop
we canonize with the signature (root node, 1,1). With the position and
iteration equal, we grab the edge that corresponds with the position, and
compair the length of the string represented by the suffix to the length of
the string represented by the difference between the iteration and the
position from the next node. In this case the lengths are equal so we move
the position to the right the length of the suffix plus one. As the returned
edge holds the suffix corresponding with a single index, the length is zero
giving us a position of two. This is returned in an edge with the previous
33


node set to the root and the current position. We pull this information to
set our next iteration and move on to our third index.
# m riour^E 4.3 THIRD ITERATION
At this point we can see that we have a very similar configuration to our
previous iteration. All suffix links are pointing to the starter node, the left
limit is set to the position of the item representing the beginning of the
suffix, and the right limit is set to a point beyond the end of the string
index. This next iteration will be analogous to the previous one which
gives us an opportunity to look at what the start node adds to our process.
We now enter the update process with the signature (root node, 2, 2) and
immediately move into test with the signature (root node, 2, 1, s). As the
position is greater than the iteration, we simply return an edge with the
first value set to whether or not the node has an edge with the item
signature. In this case we have not seen an s for this node before so
4.3 Introduction of s to the Suffix Tree of mi
m .
34


first is set to false and the root node is set to the edges previous and
next node values. We then move through the update function, adding the
item to the node marked as next in the edge we returned, in this case the
root. We enter canonize with the signature (start node, 2, 1). As our
position is greater than our iteration of interest, we return an edge with the
position unchanged and the previous node set to our start node. After
canonizing, we test with the signature, (start node, 2, 1, s) as our start
node has the characteristic of each item; this will return an edge with first
set to true and pointing to the start node. As we have removed the
condition for the loop in update, we break out and return an edge holding
the start node and our position. Once back in our main for loop we
canonize again with the signature (start node, 2, 2). As our position and
iteration are equal, we now examine the edge corresponding to the
characteristic of our position, in this case s. Now we will recall that our
start node holds the position of the last instance of our item. This means
that the left and right values are equal. If we had the edge from the root
that corresponded to the item characteristic we would have had a
difference from our first instance to a point beyond the end of the string.
What we need is to move the position by one to the next item. This is a
case unique to the root, and by holding this condition in a separate node
35


space, we can remove further checks for this unique condition of the root.
This allows us to increment the position and return an edge holding our
new position and a pointer to the root node.
4.4 Introduction of s to the Suffix Tree of mis
FIGURE 4.4 FORTH ITERATION
In this iteration we have a condition we have not had in the previous
iterations, which is a repeat of an item we have seen before. We enter
update with the signature (root node, 3, 3); this then passes to test with
the signature (root node, 3, 2, s). Under this condition we jump down
through the process and return an edge that shows we have seen this
characteristic before, so first is true and the previous node is set to the
root as it was the node passed in. Returning these values to our update
function to be passed to canonize with the signature (root node, 3, 3). We
36


pull the edge from the root node that corresponds with the position. This
gives us a difference longer than the string itself making it impossible to
enter our loop. This returns the position unchanged and the root node. So
let us review our result. We have not added to our tree, in fact we have not
changed anything at all; this includes the position of interest. This means
that our next iteration will bypass the position.
4.5 Introduction of i to the Suffix Tree of miss
FIGURE 4.5 FIFTH ITERATION
We now enter our update with the signature (root, 3, 4) passing to test the
signature (root node, 3, 3 i). With the position and iteration of interest
equal we grab the edge that corresponds with the position we are
currently interested with. As our position did not move in the previous
iteration, we are looking at the same edge as the previous iteration. We
37


now check the characteristic of the item in the string, one index from the
beginning of the suffix, plus the difference between the iteration and the
position. Let us look at what this is saying. If the item we are passing in
has the same characteristic, we simply return an edge where the first is
set to true and the current node is returned. But in this case we are
comparing an i with an s. In this condition we split the suffix. Let us look
closely at the condition that determined this split. We will recall from our
previous iteration that the position did not change, as the suffix that
references the characteristic of the item was longer than the difference
between the position and the iteration. This means that no matter the
condition of the substring we are interested in the item characteristic at the
beginning of the suffix will match the characteristic we passed to the
function. We pull the same edge that is referenced by this position in test
for this iteration. We take the index of the start of this suffix and move
down the string the difference between the iteration and the position. What
we have is a common path to the item we passed in and the item one
index to the right, as it relates to the suffix. As the characteristic of the two
items are not equal, we have established that the signature of the suffix up
to this point has two possible outcomes. This condition triggers the split.
We create two items; the first has the signature of the edge we pulled from
38


the node and set to the right limit of the index just before our noted
characteristic. The second item is given the characteristic of the item just
beyond our common suffix. The left limit is set to the index of this
characteristic and the right is set to the right limit of the original suffix. We
then attach the first item to the node replacing the edge we originally
pulled and add the second item to the node following this edge. We now
have a separation in our suffix. We then create an edge with first set to
false, as we can be assured that we do not have an edge with this
characteristic directly after a split, and the previous and next nodes set to
the node holding the second edge we created. We reenter update and add
the item at the index of the iteration to the edges next node. At the bottom
of the loop we again enter canonize with the signature (start node, 3, 3). In
this configuration we enter and increment the position by one. We then
return an edge with the previous pointer set to the root and our new
position. Once we are back in update we reenter test with the signature
(root node, 4, 3, i). As our position is greater than the iteration, we drop
down to return an edge with first set to true, as we have already added
this item and a pointer to the root. Here we find a new condition where the
root placeholder is no longer set to the root. In this condition we set the
active node of the previous root, which is the node we created on our last
39


split, creating our suffix link to the root. We then return an edge holding
the position and a pointer to the root. Returning to the main loop we again
canonize with the signature (root node, 4, 4). This passes us directly
through, returning an edge holding the position and the current node of
interest. Once in the main loop we set the position and the node we are
working with, in this case to the root, and we are prepared for our next
iteration.
Let us look over what we have seen in this iteration. We discovered the
split in our test function. We saw that we replaced the edge we were
splitting with an edge that stopped at the position we returned from
canonize. What canonize did for us was, up to the position of interest,
provide assurance of consistency of the suffix. When finding a different
pattern at the position point we split, as we now have two routes from the
same pattern. This is our first look at the primary process of this system.
40


4.6 Introduction of s to the Suffix Tree of missi
FIGURE 4.6 SIXTH ITERATION
We now enter update with the signature (root node, 4, 5). Again we jump
into test with the signature (root node, 4, 4, s). With the position and
iteration of interest equal, we retrieve the edge that corresponds to the
characteristic of the item in the position. This gives use the edge of the
root with the suffix (1, ). Now we ask the question: does the
characteristic we sent match the characteristic of the item at the index to
the right of the left limit shifted by the distance of the iteration minus the
position? Having met this condition, we return an edge with first set to
true and a pointer to the root. With first set to true, we skip down and
return a pointer to the root and our current position. Returning to the
41


update function, we move into canonize with the signature (root node, 4,
5). As we have an iteration higher than our position we grab the edge that
corresponds with our position in the string giving the edge (1,). As the
length of the suffix is greater than the difference between the iteration and
the position, we leave our function with no change to the position and
returning our root node. We are now set for our next iteration. So we can
see that this is brief compared to the previous iterations. This is the result
of the structure requiring no change.
4.7 Introduction of s to the Suffix Tree of missis
FIGURE 4.7 SEVENTH ITERATION
We begin by entering update with the signature (root node, 4, 6) moving to
test with the signature (root node, 4, 5, s). Our iteration is greater than
42


our position so we check our corresponding edge. We shift to the right of
the left limit of this edge by the difference between the iteration and the
position plus one. This puts us at index 6 giving us the same character
that we found at our iteration and returning an edge with first set to true.
This passes over the loop in update, returning an edge holding a pointer to
the root node and our position unchanged. This moves us, in the main for
loop, and then into canonize with the signature (root node, 4, 6). As our
iteration is beyond our position, we pull the edge from our root node that
corresponds with our position. In this case, the edge with the suffix (1,)
as the difference between the right and left positions of this suffix, is larger
than the difference between the position and the iteration. This moves us
through the function returning the root node and the position unchanged.
Again we find ourselves with the same conditions and the same tree as
we pass into the next iteration.
43


4.8 Introduction of i to the Suffix Tree of mississ
FIGURE 4.8 EIGHTH ITERATION
As we start this iteration, we enter update with the signature (root node, 4,
7) then move directly to test with the signature (root node, 4, 6, i). As our
position is less than our iteration, we check the edge that corresponds with
our position. We check the characteristic of the item, shifted right one past
the iteration less the position, against the characteristic we passed in. In
this case, we are looking at the item at index 4 as compared to index 7.
We can see that the characteristic at these two indices are equal so we
return an edge with first set to true and a pointer set to the root node.
With these parameters we fall through the update function returning our
position and a pointer to the root node. We again enter canonize from our
44


outer loop with the signature (root node, 4, 7). As our iteration is greater
than our position, we pull the edge that corresponds with the current
position, in this case the edge holding the suffix (1,). We now check the
comparative length between the suffix, the iteration and position. As the
suffix of the edge represents a longer segment represented by the
iteration and the position, we return an edge with the position unchanged.
We are now prepared for the next iteration.
4.9 Introduction of p to the Suffix Tree of mississi
FIGURE 4.9 NINTH ITERATION
For the last few cycles we have been in a stable state, where the only
change has been to widen the gap between the position of interest and
45


the current iteration. For this cycle, we are introducing an item with a new
characteristic at index 8, where we find p. We will see how the process
modifies the structure to provide the representation we require.
We enter our iteration and move into the update function with the
signature (root node, 4, 8) again jump to test with the signature (root node,
4, 7, p). With our position to the left of our iteration, we check the edge
corresponding with our position. The edge with the suffix (1, ) is returned
and we check the characteristic at the position one to the right of the left
limit added to the difference between the iteration and the position. In our
current condition we are looking at index 5 that has the characteristic s.
This moves us into a split where we create two suffixes. The first suffix
replaces the edge we pulled from our node. We give it the characteristic of
the index at the left limit and then set its bounds to (1,4).
The second suffixs left limit is set to the position one beyond the first
suffixs right limit and the right limit of this new suffix is set to the right limit
of the original edge tested. Once we have the two suffixes defined, we
replace the original edge we pulled from our node and add the second
suffix to the node following the first edge. After this we return an edge with
the same signature as our second edge; first is set to false and a pointer
46


is set to the node we added to our new edge. We now move into the while
loop found in the update function. With the first value set to false, we
enter the loop and create an item with the signature (p, 8, ). This suffix
is then added to the node we returned from our previous test. This node is
then used to set the previous root variable. We will see this again in our
next loop to set the suffix link. From here we enter canonize with the
signature (start node, 4, 7) and, as our iteration is greater than our
position, we pull the edge that corresponds with the position. As the node
we passed in was the start node, the difference between the left and right
limits will be zero. This lets us enter the while loop and increment the
position. We then set the node of interest to the node following the edge
that we pulled. We now check the position and the iteration and, as the
position has not yet caught up with our current iteration, pull the next edge
that corresponds with our new position from the node of interest. This
edge has the suffix (2, 2) setting the new left and right limits. We also take
the next node in this branch and set it to the node of interest. This puts us
back to the top of our loop and we compare the length of the suffix and the
length of segment represented by the position and iteration. We again
increment the position by one and set the current node to the node of
interest. The position is still to the right of the iteration so we grab the edge
47


that corresponds with the current position, in this case the edge with the
suffix (3, ). Now that our suffix exceeds the length of the portion
represented by the iteration and position, we break out of our loop. This
tells us that the suffix we are interested in contains the index where any
changes may occur. It also tells us the position we have is the minimum
left position that we can be assured of continuity. We now return an edge
holding a pointer to the node of interest and the position we have
calculated. Once back into the update function we set the current node to
the node we found in canonize and set the position to the point of
consistency. We now enter test with the signature (current node, 6, 7, p).
Our position is less than the iteration so we check for the condition for a
split. Pulling the edge that corresponds to the current position we retrieve
the suffix (3, ). We then compare the characteristic we passed in to the
item found to the right of the left limit by one farther than the difference
between the iteration and the position. This puts us at index 5 which does
not match so we split, as described earlier in this iteration. We now return
an edge with first set to false and a pointer to the node holding our new
edge. This loops us back through the update adding an edge with the
suffix (8, ) to the node last returned. We now set the suffix link of the
previous root to the node preceding the current edge and update the
48


previous node to the node following our edge. We now go back into
canonize passing in the suffix link of the current node giving us the
signature (root node, 6, 7). We find that our iteration is greater than
position so we check the edge that corresponds to the position. The edge
we retrieve has the suffix signature (2, 2) giving us a suffix length less
than the length of the substring represented by the position and iteration.
Therefore, we increment the position and set the current node to the node
following the edge we pulled. Next we test to make sure that our position
has not passed our iteration. With our position and iteration equal, we take
the edge related to our new position. We place a holder to the node
following this edge and check the length of the suffix (4, ). This removes
us from the loop and returns an edge, holding the position and the node
that we pulled our last edge from. We now enter test with the signature
(current node, 7, 7, p). With our position equal to our iteration, we pull
the edge corresponding to the position giving us the suffix (4, ). We
check the characteristic at the index one position to the right of the left
limit against the characteristic we passed into the function. As these
characteristics do not match, we trigger a split. The suffixes (4, 4) and (5,
) are created. The first edge is used to replace the suffix (4, ) and the
second is added to the node following this edge. Now an edge is returned
49


with the first set to false and a reference to the node holding our new
edge. We then re-enter our loop, adding the item at the current
characteristic to the edge list of the node we returned. As this node was
set to the previous node the suffix link is set to the node containing the
suffixes (8, ) and (5, ). The node following this same edge is set to the
new previous root. We move into canonize with the signature (root node,
7, 7). With the iteration and the position equal, we pull the edge that
corresponds with our current position. We now have the edge with the
suffix (1,4) that has a length greater than the substring represented by the
position and the iteration. This moves us through the function returning an
edge holding our position and a pointer to the node we passed in. Once
back in the update function, we move into test with the signature (root
node, 7, 7, p). As our position and iteration are equal, we pull the edge
that corresponds with our position. This edge has the signature (1, 4) and
we shift to the right one position and check it against the characteristic we
passed in. As they do not match we are set to split the edge. We have two
edges (1,1) and (2, 4) to replace the original and add to the following
node. We now return an edge with first set to false and a pointer to the
node with the new edge. Once we return to update, we go back through
the loop adding the item found at this iteration. We then add the previous
50


node that we found from our last test as the suffix link. We then set the
previous root holder to the node following the edge. We now move into
canonize with the signature (start node, 7, 7). We pull the edge with the
suffix (10, 10) and enter our while loop. This moves the position to the
right and sets the current node to the next node of the edge we pulled.
With our position now greater than our iteration, and we leave the function
with an edge holding the current node and our new position. In the update
function we now pass back to test with the signature (root node, 8, 7, p).
As our position is greater than our iteration, we pass through test returning
an edge holding a pointer to the root and first set to false, as the root
node does not possess this item. We now re-enter the update function
moving to the top of our loop. With the edge returned holding false in the
first variable, we add the current item to the node held in the next pointer
of the edge, in this case the root. The node that was used in our previous
loop has the active node pointer set to the previous node of the edge we
returned, in this case the root node, establishing the suffix link. We now
set the previous root holder to the root and enter canonize with the
signature (start node, 8, 7). As our position is greater than our iteration, we
fall though the function returning an edge with the previous node set to the
start node and the position unchanged. Once back in the update function,
51


we set the current node to the node returned and the position is updated.
We enter test with the signature (root node, 8, 7, p) and with our position
greater than the iteration we fall through, returning an edge with the first
variable set to true. This moves us out of our loop, returning an edge with
a pointer set to the start node and our position set to 8. Once back in the
main loop, we enter canonize with the signature (start node, 8, 8). As our
position is equal to our iteration we pull the edge that corresponds with our
position and compare its length to the length of the substring represented
by our iteration and position. As both represent a single point, we
increment our position and return an edge holding our new position and a
pointer to the start node. We store these values and are prepared for our
next iteration.
4.10 Introduction of p to the Suffix Tree of mississip
FIGURE 4.10 TENTH ITERATION
52


We begin our iteration by entering update with the signature (root node, 9,
9) passing us into test with the signature (root node, 9, 8, p). As our
position is greater than the iteration of interest, we drop through the
function returning an edge with the first value set to true, we have an
edge with the p characteristic attached to this node, and the node we
passed in. This returns us to the update function skipping over our loop,
returning to the outer loop with an edge holding our values unchanged.
Entering into canonize with the signature (root node, 9, 9) we pull the edge
corresponding to our position. As this edge is longer than the suffix
represented by the position and iteration, we return our values unchanged.
4.11 Introduction of i to the Suffix Tree of mississipp
8
ZJ Z>
,2
h.
ZJ
Zf
*/ \A
Zj Zj
FIGURE 4.11 ELEVETH ITERATION
53


We enter the update of our last iteration with the signature (root node, 9,
10) passing into test the signature (root node, 9, 9, i). As our position and
iteration of interest are equal, we pull the edge that corresponds with our
position. We move down this suffix to the right, one index past the length
of the substring represented by the position and the iteration, and check
the characteristic of the item at that index and the item we passed in. As
they do not match we split our edge connecting the two that now have (8,
8) and (9, ). We now return an edge with first set to false and a pointer
to the node used to connect the two edges. We now add the item at our
iteration index to the node we returned. We set the previous root pointer to
the node holding our item and move into canonize with the signature (start
node, 9, 9). We pull the edge corresponding to our position and compare
the lengths of our suffix and substring. As the lengths are equal, we
increment the position and set the current node to the node following the
edge we pulled. Now that our position has passed our iteration of interest,
we break out of our loop, returning an edge holding the current node and
our new position. This moves us to test with the signature (start node, 10,
9, i). As our position is beyond our iteration, we pass through our
function returning an edge with the first set to true and a pointer to the
root node. This breaks us out of the loop in update and sets the suffix link
54


of the node held in the previous root pointer to the node held in current
node. We now return an edge holding the current node and our current
position. We again enter canonize with the signature (root node, 10, 10).
As our position and iteration are equal, we pull the edge corresponding to
our position and check the length of the suffix as compared to our
substring. As they as equal, we increment the position and set the current
node to the next node of the edge pulled. We now break out of the loop,
returning an edge with our new position and node of interest. This
completes our tree.
FIGURE 4.12 FINAL TREE WITH SUFFIX LINKS
55


5. Applications
Now that we have our tree and a system to generate further trees, we can
examine options to show the utility of this structure.
5.1 Does a substring exist?
This application is the base that subsequent applications build upon. With
this structure we can find if a substring exists in linear time based on the
string we are looking for, not the string we are searching. Continuing on
with our example using mississippi, if we wanted to find if the substring
sip occurs we simply find the edge corresponding to the characteristic of
our first item and follow the links to a result.
FIGURE 5.2 SUBSTRING SEARCH
56


From this we can get not only the existence but also a location in linear
time based on our pattern. In contrast, if we are interested in the pattern
sipt we find a negative result in the exact same configuration. In our
example this may seem trivial but if we look at a string of significant
length, for example the human genome at 3,080 Mb with millions of base
pairs, we can begin to understand significance of our process.
5.2 Palindrome
As stated in the previous section, locating palindromes is a subset of
finding a substring. There are applications of this in molecular biology, to
find restriction enzymes, and again in the human genome project, to find
motifs that have the potential to repair themselves. The process to find
these is slightly more complicated but in essence follows the same
principle. Continuing with our example mississippi, we take the tree
created and produce its complement by reversing the string and
evaluating it through the same process. This gives us two trees we then
overlay with each other to find where they match giving us the result we
are looking for.
57


By traversing the trees as compared to each other, we can find our targets
delivering the result. For palindromes of size three or greater we find ssip
from index 5 to 8, ississi from index 1 to 7, and sissi from index 3 to 7.
58


6. Conclusion
In our introduction we examined several popular algorithms for pattern
recognition in strings. From this evaluation we were able to conclude that
the primary advantage to Ukkonens algorithm is that the majority of the
complexity is held in the preprocessing stage. While both parts of the
algorithm have a complexity of O(n), the search stage n is based on the
smaller pattern string rather than the text being searched. We now have
the tools needed to expand and modify our process to address specific
applications. The next stage of development is producing a generalized
suffix tree (Ukkonen, On-line Construction of Suffix Trees, 1995) where
multiple strings are denoted by a terminal node giving reference to the
words contained in the overall text. From the generated suffix tree we
move to multiple trees linked together to produce Suffix tree clusters.
This methodology is used to produce search engines such as Google and
Husky Search. In these applications general suffix trees are created and
related to each other with the use of rating systems producing weighted
graphs. In these examples, the composition of strings is of interest and the
target is web site identification. This structure allows levels of commonality
to be evaluated and weighted between different texts. We can also see
59


the possibility for moving past the restriction of strings. As described in
chapter 3, our method of comparison is based on position and
characteristic. By modifying our definition of these parameters we can take
an ordered sequence of related objects and, in the same manner, apply
our method to discover our sub-sequences. This leads us to applications
such as the jigsaw puzzle problem (Altman, 1989) where the location of
strings terminating in the same path leads to the alignment of physical
edges. This will supply us with path information on, for example, chemical
composition or manufacturing processes to find points of commonality
between sequences. It is conservable to find optimal use of our ordered
process by adapting the use of the longest common ancestor that is easily
identified applying this method. Although the re-tasking of this process
could involve changes to the primary structure, we now have an
understanding of the fundamental process and are equipped with the tools
to explore innovative applications and accomplish new goals.
60


Appendix
A. Web Access
The source code and a pdf version of this work has been made available
at the following URL. The source code has been packaged in a visual
studio 2010 solution and has been tested on the windows 7 operating
system.
http://jerryredwards.net
B. Application Code
#include "Tree.h"
include "FindP.h"
#include
#include
using namespace std;
string stringFromFile(string fileLocation)
{
string thefile, tmp;
FILE *fp;
long len;
char *buf;
61


const char* cstr = fileLocation.c_str();
fp=fopen(cstr,"rb");
fseek(fp,0,SEEK_END); //go to end
len=ftell(fp); //get position at end (length)
fseek(fp,0,SEEK_SET); //go to beg.
buf=(char *)malloc(len); //malloc buffer
fread(buf,len,1 ,fp); //read into buffer
fclose(fp);
string toReturn = buf;
return toReturn.substr(OJen);
}
string reversestringFromFile(string textln)
{
string toReverse = stringFromFile(textln);
reverse(toReverse.begin(),toReverse.end());
return toReverse;
}
int main ()
{
int x = 0;
62


string textln;
string compairString;
Tree* suffixTree = new Tree();
Tree* IcaTree = new Tree();
Tree* palidromeTree = new Tree();
FindP* forFind = new FindP();
cout "Please enter the path to the file to construct your tree"
endl;
getline(cin, textln);
suffixTree->add(stringFromFile(textln));
fflush(stdin);
while(x != 1)
{
cout endl "Enter 1 to Exit" endl;
cout "Enter 2 to enter a new string to cunstruct the tree"
endl;
cout "Enter 3 Longest Common Substring" endl;
cout "Enter 4 to show the suffix tree" endl;
cout "Enter 5 to check for a sub string" endl;
cout "Enter 6 to find palindromes" endl;
63


cin x;
fflush(stdin);
switch(x)
{
case 2:
cout "Please enter the string to construct
your tree" endl;
suffixTree = new Tree();
suffixT ree->add(textln);
break;
case 3:
IcaTree = new Tree();
cout "Please enter compairison the search
string" endl;
cin compairString;
IcaT ree->add(compairString);
forFind->lca(suffixTree,IcaTree);
break;
case 4:
suffixT ree->show();
break;
case 5:
64


cout "Please enter your substring" endl
cin textln;
suffixTree->find(textln);
break;
case 6:
palidromeTree = new Tree();
palidromeTree-
>add(reversestringFromFile(textln));
forFind->lca(suffixT ree.palidromeT ree);
break;
}
fflush(stdin);
}
return 0;
};
#ifndef COUNTING JDENTIFIER
#define COUNTING IDENTIFIER
65


class Counting
{
public:
static Counting *instance()
int getEdgeCountQ;
int getNodeCount();
protected:
Counting();
static int nodeCount;
static int edgecount;
private:
static Counting *inst;

#endif
#ifndef FINDJDENTIFIER
#define FIND IDENTIFIER
#include


#include
using namespace std;
class Find
{
private:
public:
Find();
int endPosition;
string toFind;
string originalString;
bool found;
int currentPosition;
};
#endif
#ifndef FINDPJDENTIFIER
#define FINDP IDENTIFIER
#include
#include
#include "Tree.h"
67


using namespace std;
class FindP
{
private:
Tree* a;
Tree* b;
vector pars;
void lca(Edge a,Edge b, int);
void lca(Node* a, Node* b, int);
void lcaEndPoint(int endA.int endB.Edge a, Edge b, int count);
void lcajagEdge(int endA.int endB,Edge a, Edge b, int count);
string charToString(char toChange);
public:
FindP();
void lca(Tree* a,Tree* b);
};
#endif
#ifndef ITEMJDENTIFIER
#define ITEM IDENTIFIER
#include
68


#include
#include "Find.h"
using namespace std;
class Item
{
private:
string item;
public:
ltem();
int positionLeft;
int positionRight;
void add(string,int,int)
string getValue();
void show(string);
void find(Find*);
#endif
#ifndef NODEJDENTIFIER
#define NODEJDENTIFIER
#include
#include


#include
#include
#include
#include "Item.h"
include "Find.h"
using namespace std;
class Edge;
class Node
{
private:
public:
map edgeList;
Node();
bool isRoot;
void add(ltem);
void add(bool, Item);
void add(Edge edge);
void show(int);
bool hasEdge(string);
bool hasEdgePrefix(ltem);
void setRoot();
70


//Node *previousNode;
Node *activeNode;
void show(int,string);
void find(Find*);
string charToString(char);
int nodeNumber;
class Edge
{
private:
public:
Edge();
Edge(Node*);
Edge(bool,Node*);
Node *previousNode;
Node *nextNode;
Item item;
bool first;
int canonHolder;
void add(bool,ltem);
int edgeNumber;
71


};
#endif
#ifndef TREEJDENTIFIER
#define TREEJDENTIFIER
#include
#include
#include
#include "Node.h"
include "Find.h"
using namespace std;
#pragma once
class Tree
{
private:
Node* start;
Edge update(Node*,int,int);
Edge canonize(Node*,int,int);
Edge test(Node*,int,int,string);
string charToString(char);
public:
Tree();
72


Node* rootNode;
string item;
void add(ltem item);
void add(string item);
void show();
void find(string);
void Tree::find(Tree* toCompair);
int nodeCount;
};
#endif
#include "Counting.h"
int Counting::nodeCount = 0;
int Counting::edgecount = 0;
Counting *Counting::inst = 0;
Counting::Counting()
{
}
int Counting::getEdgeCount()
{
return edgecount++;
}
73


int Counting::getNodeCount()
{
return nodeCount++;
}
Counting *Counting::instance()
{
if (linst)
{
inst = new Counting();
}
return inst;
}
#include "Node.h"
#include "Counting.h"
Edge::Edge()
{
previousNode = new Node();
nextNode = new Node();
edgeNumber = Counting::instance()->getEdgeCount();
74


}
Edge::Edge(Node *node)
{
previousNode = node;
nextNode = new Node();
nextNode->activeNode = node->activeNode;
edgeNumber = Counting::instance()->getEdgeCount();
}
Edge::Edge(bool f, Node* node)
{
previousNode = node;
first = f;
nextNode = node;
edgeNumber = Counting::instance()->getEdgeCount();
}
void Edge::add(bool isLeafln, Item item)
{
this->item = item;
}
#include "Find.h"
75


Find::Find()
{
this->currentPosition = 0;
this->found = true;
}
#include "FindP.h"
#include
FindP::FindP()
{
}
void FindP::lca(Tree* a,Tree* b)
{
this->a = a;
this->b = b;
int count = 0;
lca(a->rootNode, b->rootNode, count);
for (long index=0; index<(long)pars.size(); ++index) {
coutthis->a->item.substr((pars.at(index).positionRight -
pars.at(index).positionLeft + 1), pars.at(index).positionLeft) Position:"
76


pars.at(index).positionRight pars.at(index).positionLeft + 1 to
pars.at(index).positionRight endl;
}
pars.clearQ;
}
void FindP::lca(Node* a, Node* b, int count)
{
int startTree = 0;
int endTree = 0;
int startCompair = 0;
int endCompair = 0;
for each (pair p in b->edgeList)
{
if(a->isRoot)
{
string testing =
}
Edge testing = a->edgeList[p.first];// check of empty
77


}
if(testing.item.positionLeft > 0)
{
lca(testing,p.second,count);
}
void FindP::lca(Edge a, Edge b, int count)
{
int endA = 0;
int endB = 0;
endA = a.item.positionLeft;
endB= b.item.positionLeft;
int rightA = a.item.positionRight;
int rightB = b.item.positionRight;
if(a.item.positionRight > this->a->item.size())
{
rightA = this->a->item.size() -1;
}
if(b.item.positionRight > this->b->item.size())
78


{
rightB = this->b->item.size() -1;
}
int alength = rightA a.item.positionLeft + 1;
int bLength = rightB b.item.positionLeft + 1;
string suba = this->a->item.substr(a.item.positionLeft,alength);
string subb = this->b->item.substr(b.item.positionLeft,bLength);
if(alength == bLength)
{
if(this->a->item.substr(a.item.positionLeft,alength) == this-
>b->item.substr(b. item. positionLeft, bLength))
{
lca(a.nextNode, b.nextNode, (count + alength));
}
else
{
lcaEndPoint(endA,endB,a, b,count);
}
79


else
{
if(alength > bLength)
{
suba = this->a-
>item.substr(a. item. positionLeft, bLength);
subb = this->b->item.substr(b.item.positionLeft,bLength);
if(this->a->item.substr(a.item.positionLeft,bLength) =
this->b->item.substr(b. item. positionLeft, bLength))
{
count = count + bLength;
b = b.nextNode->edgeList[charToString(this-
>a->item[a.item.positionLeft + bLength])];
if(b.item.positionLeft > 0)
{
lcajagEdge(a.item.positionLeft+bLength.b.item.positionLeft,a, b,
count);
}
else
{
lcaEndPoint(endA,endB,a, b, count);
}
}
80


else
{
lcaEndPoint(endA,endB,a, b,count);
}
}
else
{
if(this->a->item.substr(a.item.positionLeft,alength) =
this->b->item.substr(b.item.positionLeft,alength))
{
count = count + alength;
a = a.nextNode->edgel_ist[charToString(this-
>b->item[b.item.positionLeft + alength])];
if(a.item.positionLeft > 0)
{
lcajagEdge(a.item.positionLeft,b.item.positionLeft+alength,a, b,
count);
}
else
{
lcaEndPoint(endA,endB,a, b, count);
}
}
81


else
{
lcaEndPoint(endA,endB,a, b,count);
}
}
}
}
void FindP::lcajagEdge(int startA.int startB.Edge a, Edge b, int count)
{
int endA = a.item.positionLeft;
int endB = b.item.positionLeft;
int rightA = a.item.positionRight;
int rightB = b.item.positionRight;
if(a.item.positionRight > this->a->item.size())
{
rightA = this->a->item.size() 1;
}
if(b.item.positionRight > this->b->item.size())
{
82


rightB = this->b->item.size() -1;
}
int alength = rightA startA + 1;
int bLength = rightB startB + 1;
string subA = this->a->item.substr(startA,alength);
string subB = this->b->item.substr(startB,bLength);
if(alength == bLength)
{
if(this->a->item.substr(startA,alength) == this->b-
>item.substr(startB,bLength))
{
return lca(a.nextNode, b.nextNode, (count + alength +
1));
}
else
{
lcaEndPoint(startA,startB,a, b,count);
}
}
else
83


{
if(alength > bLength)
{
if(this->a->item.substr(startA,bLength) == this->b-
>item.substr(b. item. positionLeft, bLength))
{
count = count + bLength;
b = b.nextNode->edgeList[charToString(this-
>a->item[a.item.positionLeft + bLength])];
if(b.item.positionLeft > 0)
{
lcajagEdge(a.item.positionLeft+bLength.b.item.positionLeft,a, b,
count);
}
else
{
lcaEndPoint(startB,startB,a, b, count);
}
}
else
{
lcaEndPoint(startA,startB,a, b,count);
}
84


}
else
{
if(this->a->item.substr(startA,alength) == this->b-
>item.substr(startB,alength))
{
count = count + alength;
a = a.nextNode->edgeList[charToString(this-
>b->item[b.item.positionLeft + alength])];
if(a.item.positionLeft > 0)
{
lcajagEdge(a.item.positionl_eft,b.item.positionLeft+alength.a, b,
count);
}
else
{
lcaEndPoint(startA,startB,a, b, count);
}
}
else
{
lcaEndPoint(startA,startB,a, b,count);
}
85


}
}
}
void FindP::lcaEndPoint(int endAjnt endB.Edge a, Edge b, int count)
{
while(endA < a.item.positionRight && endA < this->a->item.size()
&& endB < this->b->item.size())
{
endA++;
endB++;
count++;
char charA = this->a->item[endA];
char charB = this->b->item[endB];
if(this->a->item[endA] != this->b->item[endB])
{
endA--;
break;
}
}
86


if(count > 2)
{
Item pair;
pair.positionLeft = count;
pair.positionRight = endA;
pars.push_back(pair);
string FindP::charToString(char toChange)
{
stringstream ss;
string s;
ss toChange;
ss s;
return s;
}
#include "Item.h"
ltem::ltem()
{
}
87


void ltem::add(string p, int left, int right)
{
item = p;
positionLeft = left;
positionRight = right;
}
string ltem::getValue()
{
return item;
}
void ltem::show(string p)
{
/*if(positionRight > p.size())
{
cout p[positionLeft];
}
else
{*/
for(int i = positionLeft; (i <= positionRight && i < p.size())
i++)
{
88


cout p[i];
}
cout Position Left:" positionLeft Position Right:"
positionRight endl;
//}
}
void ltem::find(Find* p)
{
for(int i = positionLeft; ((i <= positionRight) && (p->currentPosition <
p->toFind.size()) && (p->found)); i++)
{
char original = p->originalString[i];
char finding = p->toFind[p->currentPosition];
if(original != finding)
{
p->found = false;
}
p->currentPosition++;
}
}
#include "Node.h"
#include "Counting.h"
89


Node::Node()
{
isRoot = false;
nodeNumber = Counting::instance()->getNodeCount();
}
void Node::add(ltem item)
{
Edge edge(this);
edge.item = item;
edgeList[item.getValue()] = edge;
}
void Node::add(bool isLeaf, Item item)
{
Edge edge(this);
edge.item = item;
edgeList[item.getValue()] = edge;
}
90