Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining

Material Information

Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining
Thaicharoen, Supphachai
Publication Date:
Physical Description:
xii, 131 leaves : ; 28 cm


Subjects / Keywords:
Text processing (Computer science) ( lcsh )
Data mining ( lcsh )
Data mining ( fast )
Text processing (Computer science) ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 121-131).
Statement of Responsibility:
by Supphachai Thaicharoen.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
437259351 ( OCLC )
LD1193.E52 2009d T42 ( lcc )

Full Text
Supphachai Thaicharoen
B.E. King Mongkuts Institute of Technology North Bangkok, Thailand, 1994
M.S. Colorado State University, 1999
M.C.S. Colorado State University, 2004
A dissertation submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Information Systems

This dissertation for the Doctor of Philosophy
degree by
Supphachai Thaicharoen
has been approved
A/ Date
Bfivis Stilman

Thaicharoen. Supphaehai (Ph.D., Computer Science and Information Systems)
Text Association Mining with Cross-Sentence Inference, Structure-Based Doc-
ument Model and Multi-Relational Text Mining
Dissertation directed by Professor Tom Altman and Professor Krzysztof J. Cios
With an exponential growth of published documents, text mining becomes
a vital tool for an automated extraction of information and discovery of hidden
information/knowledge. We begin this dissertation with an overview of text
mining covering key definitions, pre-processing, feature selection, text represen-
tation and types of text mining. Then, we describe a fundamental text min-
ing approach that we used for the development of a chromosome-21 database.
Next, we present our three novel text mining techniques: (i) text association
mining with cross-sentence inference, (ii) structure-based document model, and
(iii) multi-relational text mining. Our techniques emphasize novel hypothe-
sis generation, document representation and multi-relational discovery, respec-
tively. In the text association mining with cross-sentence inference, statistical
co-occurrences of terms and syntactic sentence structure analysis are initially
used to find associations among key terms in documents. Subsequently, poten-
tial novel hypotheses are derived from the discovered associations. In a different

way, the structure-based document model introduces two novel document repre-
sentations for text documents that take into account not only term frequencies
and patterns of term occurrences, but also the documents structural informa-
tion. Based on the experimental results, our structure-based document models
are superior to existing non-structure-based ones. Finally, the multi-relational
text mining enhances a literature-based discovery method with multi-relational
data mining and Inductive Logic Programming. It is aimed to discover relational
knowledge in forms of frequent relational patterns and relational association
rules from disjoint sets of literatures. These relational patterns and rules are
complementary to the indirect connections found by existing literature-based
discovery, and can be used for exploratory research.
This abstract accurately represents the content of the candidates dissertation.
I recommend its publication.

I would like to dedicate this dissertation to everyone in my family, my grand-
mother, father, mother and two sisters, who love me unconditionally, and who
support and encourage me to achieve my dream. I owe my life to them.

I would like to acknowledge my gratitude to Dr. Tom Altman, my advisor,
for his supervision and many invaluable comments and suggestions on research
I also would like to acknowledge my appreciation to Dr. Krzysztof J. Cios,
my co-advisor, for his advice, guide and support on education and research di-
I would like to gratefully acknowledge Dr. Katheleen Gardiner for her assis-
tance and thoughtfulness while working together on chromosome-21 database.
In addition, I would like to gratefully thank Dr. Michael Mannino and Dr.
Boris Stilman, my committee members, for their kindness toward my work.
All of my friends deserve special thanks as well. No words can be used to
describe how great they are. Thank you very much.
Finally, I would like to thank God and all the divine powers for giving me
strength, courage and positive thought.

Figures .................................................................. xi
Tables................................................................... xii
1. Research Motivation and Outline of the Dissertation................... 1
1.1 Motivation............................................................ 1
1.2 Outline of the dissertation........................................... 2
2. Overview of Text Mining.............................................. 4
2.1 Introduction to text mining........................................... 4
2.2 Pre-processing........................................................ 7
2.3 Feature selection..................................................... 8
2.4 Text representation.................................................. 11
2.5 Types of text mining................................................. 11
2.5.1 Text classification ............................................... 11
2.5.2 Text clustering.................................................... 14
2.5.3 Knowledge extraction .............................................. 15
2.5.4 Association discovery.............................................. 20
2.5.5 Hypothesis generation.............................................. 21
2.5.6 Knowledge base construction and model building..................... 24
2.6 An application of text mining to support the development of the
chromosome-21 database............................................... 24

2.6.1 The ortholog identification of chromosome-21 genes................. 25 Overview........................................................... 25 Role of text mining on ortholog identification.................. 29
2.6.2 Protein interaction extraction from HPRD database............... 31 Overview........................................................... 31 Role of text mining on protein interaction extraction........... 31
2.6.3 Conclusions.......................................................... 32
3. Text Association Mining with Cross-Sentence Inference.............. 33
3.1 Introduction.......................................................... 33
3.1.1 Related work...................................................... 35
3.2 Technical background................................................ 36
3.2.1 Association rule mining.................................... 36
3.2.2 TF IDF weighting scheme.......................................... 36
3.3 Data set.............................................................. 38
3.3.1 Pre-processing .................................................... 38
3.4 The TAMCSI methods.................................................... 40
3.4.1 Extended TF IDF.................................................. 40
3.4.2 Scientifically-reported association search......................... 42
3.4.3 Statistical association rule generation............................ 43
3.4.4 Association rule to sentence conversion............................ 43
3.4.5 Cross-sentence inference........................................... 43
3.5 Experiments and results............................................. 45
3.5.1 Scientifically-reported sentences.................................. 45
3.5.2 Scientifically-reported association................................ 46

3.5.3 Word selection................................................... 46
3.5.4 Statistical association ......................................... 49
3.5.5 Cross-sentence inference......................................... 52
3.6 Conclusions........................................................ 55
4. Structure-Based Document Model: Textual Document Representation
Based on Frequency, Pattern and Document Structure ................ 57
4.1 Introduction....................................................... 57
4.2 Technical background............................................... 61
4.2.1 Term signal...................................................... 61
4.2.2 Weighting schemes................................................ 63
4.2.3 Discrete wavelet transforms...................................... 64
4.2.4 Support vector machines.......................................... 67
4.2.5 Information gain................................................. 68
4.3 Methods and data .................................................. 69
4.3.1 Methods.......................................................... 69
4.3.2 Data and pre-processing.......................................... 72 WebKB 4-Universities............................................ 72 TREC Genomics 2005 .......................................... 72 Reuters-21578 .................................................. 73 20-Newsgroups................................................... 74
4.4 Experiments and results......................................... 75
4.4.1 WebKB 4-Universities............................................. 76
4.4.2 TREC Genomics 2005 .............................................. 77
4.4.3 Reuters-21578 ................................................... 80

4.4.4 20-Newsgroups.................................................... 80
4.5 Discussion....................................................... 82
4.6 Conclusions and future work...................................... 89
5. Multi-Relational Text Mining: Discovering Relational Knowledge from
Two Disjoint Sets of Literatures Using Inductive Logic Programming 91
5.1 Introduction..................................................... 91
5.2 Technical background............................................. 94
5.2.1 WARMR............................................................ 94
5.2.2 Swansons ABC discovery models................................... 96
5.3 Data and methods................................................. 97
5.3.1 Data sets........................................................ 97
5.3.2 Methods.......................................................... 98 Pre-processing................................................ 98 Co-sentence in first-order logic........................... 100 Relational database in first-order logic................... 102 ACE data mining system....................................... 105
5.4 Experiments and results....................................... 105
5.4.1 Raynauds disease and fish oils ................................ 108
5.4.2 Down syndrome and cell polarity................................. 109
5.5 Conclusions..................................................... 112
6. Summary and Future Work ........................................... 114
References............................................................ 121

2.1 Vector Space Model.................................................. 12
2.2 The identification process of diromosonie-21 protein orthologs. ... 27
2.3 An example of a protein sequence in FASTA format............... 30
2.4 An example of the sequence alignments generated from BLASTP. . 31
3.1 The overall process diagram of the TAMCSI........................... 41
4.1 Document d with 8 partitions.................................. 62
4.2 Term signal t in document d..................................... 62
5.1 A relational database in first-order logic built from two complemen-
tary but disjoint sets of literatures 104

3.1 Scientifically-reported association................................. 47
3.2 Words selected by FxlTF IDF......................................... 48
3.3 Weighting scheme comparison......................................... 50
3.4 AprioriT results.................................................... 51
4.1 WebKB 4-Universities: Micro- and macro-averaged results............. 78
4.2 TREC Genomics 2005: Micro- and macro-averaged results............. 79
4.3 Reuters-21578: Micro- and macro-averaged results.................... 81
4.4 20-Newsgroups: Micro- and macro-averaged results.................... 82
5.1 List of the top 15 most common terms............................... 107
5.2 List of 25 user-selected verbs..................................... 108

1. Research Motivation and Outline of the Dissertation
1.1 Motivation
Text mining is an important tool for efficiently and effectively managing
and processing a large amount of textual information and data. Its fundamen-
tal contribution is to facilitate an extraction of desired data/information from
textual documents. In building a chromosome-21 database [62], several types
of data needed to be collected from many resources, and information needed to
be extracted from outputs of a number of analysis tools. To cope with these
tasks, we employed a text mining methodology by writing customized computer
programs to automatically extract the desired information in an efficient and
effective manner. For a more sophisticated level of contributions, text mining
can be used to discover hidden or potentially-novel knowledge or patterns from
textual documents. The undiscovered knowledge could be in the form of novel as-
sociations between entities. As a result, in this dissertation, we developed an as-
sociation discovery method termed text association mining with cross-sentence
inference (TAMCSI)". The TAMCSI technique is based on an association rule
mining algorithm, pattern matching and user-defined association keywords. It
searches for associations between entities using two levels of co-occurrences of
those entities: co-occurrence in the abstract level and co-occurrence in the sen-
tence level. Potentially-novel associations are identified by crossing between two
different discovered associations. In addition to discovery of hidden knowledge,
text mining can be used for automatic document categorization. Most existing

document classification methods are based on a short paragraph, such as the
abstract text in scientific articles. However, several studies have shown that
full text contains higher information coverage than abstract text. Hence, with
the increased availability of full-text articles, we developed two novel document
representation models, called "Structure-Based Document Model with Discrete
Wavelet Transforms (SDMDWT) [88] and 'Structure-Based Document Model
with Significant Term Weight (SDMSTW)", that exploit document structure
of full-text articles for representing text documents. With these models, the
structure of documents will be first analyzed and captured. Then, features or
terms are selected, and finally the document models are constructed. In this
dissertation, our models were evaluated on WWW pages, scientific full-text ar-
ticles, news, and newsgroup documents. Finally, text mining can be used to
explore a new research domain and to find additional information to existing
research study. Here, we developed a novel multi-relational text mining tech-
nique that is a combination of text mining, multi-relational data mining, and
Inductive Logic Programming. It is an extension of an existing literature-based
discovery method to extract relational knowledge in forms of frequent relational
patterns and relational association rules. With this method, two sets of litera-
tures and their common terms are first transformed into a PROLOG-based rela-
tional database. Then, an Inductive Logic Programming algorithm, WARMR, is
utilized to discover frequent relational patterns and relational association rules
from the database.
1.2 Outline of the dissertation
The structure of this dissertation is organized into the following chapters.

Chapter 1: we describe the research motivation and present an outline of
the dissertation.
Chapter 2: we give a general overview of existing text mining techniques, and
present our work on applying text mining approaches to support the construction
of the chromosome-21 database.
Chapter 3: we explain our text association mining with cross-sentence in-
ference for discovering potentially-novel associations between textual entities.
Chapter 4: we present our two structure-based document models that ex-
ploit structural information of documents to improve the performance of docu-
ment classification.
Chapter 5: we illustrate our multi-relational text mining approach for dis-
covering relational knowledge from two disjoint sets of literatures.
The experimental results of our three developed approaches above are given
at the end of each chapter, respectively.
Chapter 6: we conclude the dissertation and discuss future work.

2. Overview of Text Mining
2.1 Introduction to text mining
With the exponential growth of newly published articles, it becomes imprac-
tical for researchers to manually review all of them even for a specific domain of
interest. Similarly, in the business sectors, companies and organizations increas-
ingly store their contracts, records, correspondence, emails and other knowledge
assets in electronic text format. Manual utilization of these textual documents
would require substantial time and resources. Accordingly, text mining is grow-
ing as a crucial automated knowledge discovery method that can be used to
facilitate research and to provide competitive advantage for businesses by allow-
ing efficient and effective uses of textual resources.
Text mining has been intensively studied by various research communities,
and a number of its definitions have been proposed. We begin this chapter with
an investigation of the following three main definitions of text mining.
Dorre et al. defined text mining as a process of information and knowledge
discovery from unstructured textual form [24]. They further differentiated that
for single documents, text mining is a process of extracting codified information
or features; and for a collection of documents it is a process of analyzing the
distribution of features from the document collection to detect interesting pat-
terns or trends. Dorre et al. also pointed out that text mining is distinct from
data mining in that text mining deals with an additional highly-complex data
pre-processing phase. In addition, the number of the extracted features in this
phase is very high.

Another well-known definition of text mining was proposed by Hearst [35].
Hearst used the term text data mining instead of text mining and defined it
as an automatic discovery by a computer of previously unknown information.
Moreover, she differentiated text data mining from information retrieval and
computational linguistics by using novel and non-novel criteria. In other words,
text data mining discovers novel knowledge from text, whereas information re-
trieval and computational linguistics discover existing knowledge. In Hearsts
paper, novel knowledge is referred to as knowledge that might already exist, but
has not been found yet.
Because of the ambiguity of novel vs. non-novel criteria in Hearst definition,
Kroeze et al. clarified and extended Hearsts definition of text data mining [50].
In the paper, they explained that information retrieval should be classified as
non-novel investigation and text data mining as a semi-novel investigation. They
argued that hidden or previously unfound knowledge should not be considered
as new knowledge. As a result, they proposed a new term called intelligent text
mining that incorporates artificial intelligence and natural language processing
for discovering true novel knowledge.
Regardless of novelty, text mining task could be categorized into the follow-
ing categories: text classification, text clustering, knowledge extraction, asso-
ciation discovery, hypothesis generation, and knowledge base construction and
model building. In short descriptions, text classification is a task of assigning
documents to a set of pro-defined categories. Text clustering is a task of parti-
tioning a set of documents into groups where documents within the same group
are similar to each other and dissimilar to those in other groups. Association

discovery is a task of finding relationships among important entities in text
documents. Hypothesis generation is a task of generating meaningful novel hy-
potheses from text documents. Finally, knowledge base construction and model
building is a task of collecting information extracted from documents, which will
then be used to build a meaningful model for further interpretation.
To achieve the above tasks, text mining techniques utilize a combination of
one or more state-of-the-art approaches such as mathematics, statistics, natu-
ral language processing, and machine learning. For example, Latent Semantic
Indexing (LSI) is a mathematic technique that is used in text mining to re-
duce dimensionality of text documents [48]. Part-of-speech tagging is a natural
language processing method that assigns functional labels such as noun, verb,
preposition, etc. to words in a sentence [10]. Probability and word frequency are
simple statistical approaches that are commonly used in computing numerical
values of terms and in selecting terms. Lastly, Naive Bayes [56] and [58], deci-
sion trees [97] and Support Vector Machines [44] are inductive machine learning
techniques widely used in text classification.
Text mining has been successfully employed in various fields of research rang-
ing from finding simple associations among important entities [95], discovering
trends for further research [21]. analyzing logs of intrusion detection in web
applications [1], building a knowledge base from product descriptions for de-
partment store [29], to searching for anomalies in free-text reports for aerospace
engineering [81].
The content of this chapter is organized as follows. Section 2.2 describes
text pre-precessing. Sections 2.3 and 2.4 discuss feature selection and document

representation. Section 2.5 explains types of text mining and cites several pub-
lications in each category. Finally, Section 2.6 presents our work on employing
simple text mining approaches to support the construction of the comprehensive
chromosome-21 database.
2.2 Pre-processing
In text mining, textual documents typically need to be pre-processed and
transformed into a suitable representation before any text mining technique can
be applied. Although pre-processing of text documents is different according to
types of the mining methods used, the following three steps are commonly used
in every text mining approach.
1. Tokenization:
Tokenization is a task of breaking a text document into tokens. A token
normally corresponds to a single word or phrase. In general, a space or
punctuation is used as a delimiter in a tokenization process.
2. Stop word removal:
Stop word removal is a task of removing noise. Noise in this context is
referred to as words that provide no meaning or are irrelevant to the goal
of a mining task. Examples of this type of words are articles, conjunctions,
prepositions, etc.
3. Word stemming:
Stemming is a task of converting words into their canonical forms. This
process is employed in order to resolve morphological variants of words.
An example of stemming is a conversion of words from plural into singular

In some problem domains, additional pre-processing steps may be needed.
For example, in biomedical domain, sentence boundary detection is necessary
for breaking retrieved abstracts or full texts into sentences before a Part-of-
Speech tagging program or other natural language processing techniques can be
applied. Entity detection or entity recognition could also be considered as a
pre-processing step in biomedical domain which consists of a large amount of
terminological terms such as protein names, gene names, disease names, etc.
2.3 Feature selection
Because the number of unique words in a document corpus is usually large,
it is impractical and inefficient to use all of them for a text mining task. Hence,
feature selection is an essential task that can be used to reduce the dimension
of a data set. A straightforward approach for feature selection is using word
frequency to rank all words in the data set and then selecting the top n words.
Using word frequency for feature selection is simple. However, words that fre-
quently occur might not be considered good features. For example, an article
"the occurs very frequently in normal English text, but it does not give any
additional meaning to the text. Moreover, in biomedical domain, the words in-
troduction background, methods, results and conclusions occur very
frequently, but are normally considered irrelevant or unimportant. The reason
is that these words occur in almost every article. As a result, they are not
good indicators of the contents of a document The following list describes some
interesting publications for feature selection in text mining.
Blake and Pratt proposed a heuristic approach for extracting semantic rich-
ness features from biomedical literature [6]. The main idea of the approach is

that using semantic richness features for mining biomedical literature should
yield more useful and plausible rules than using words. They utilized concepts
from the Unified Medical Language System (UMLS) as semantic richness fea-
tures. The UMLS is a biomedical knowledge base that contains a list of biomed-
ical terms and corresponding concepts. In their method, they first break down
a sentence into clauses using a customized stop word list as delimiters. Then,
clauses are mapped to terms in the UMLS. Finally, concepts that correspond
to the matched terms are extracted and used as features. They evaluated the
method using bi-directional association rules based on usefulness and plausibil-
ity, and compared the results with other feature extraction methods, such as
word-based and manually assigned words.
Jing et al. proposed a feature (word) selection method based on the
(TF IDF) weighting scheme [42], TF IDF is referred to as term frequency
multiplied by the inverse of document frequency. It is a weighting scheme that
takes into consideration both term frequency and the number of documents in
which a term belongs to. They evaluated the approach on text categorization
using Naive Bayes and found that using TF IDF instead of word frequency
for word selection, the classification accuracy is improved from 61% to 76%. In
addition, Mutual Information of the TF IDF yields even better accuracy than
TF IDF alone by improving the classification accuracy to 88%. In TF IDF,
term frequency is a frequency of occurrence of a term in a document, and doc-
ument frequency is a number of documents in a corpus where a term occurs at
least once. TF IDF weighting scheme compensates between term frequency
and document frequency, such that if a term occurs very frequently in a partic-

ular document, it can be considered as a key indexing term for the document.
However, if a term occurs in many documents, it is considered common and less
important. Because the TF IDF values of a term t in different documents are
usually not equal depending on term frequency, its maximum value is normally
Chua and Kulathuramaiyer proposed a semantic feature selection approach
based on the WordNet and Word Sense Disambiguation [17]. They applied
the method to text categorization and compared the results with Information
Gain and Chi-Square feature selection techniques. WordNet is an English lexical
database that contains a list of terms and their corresponding sets of synonyms.
The lexicon in WordNet is categorized into nouns, verbs, adjectives and adverbs.
Each set of synonyms is referred to as a sense or true meaning of the term.
For example, a term maze has two senses, or two sets of synonyms in the
WordNet database. Sense 1 is a set of terms: corn, maize, Indian corn, and Zea
mays. Sense 2 is a set of terms: gamboges, lemon, lemon yellow and maize. In
this method, for each term in a document, its sets of synonyms or synsets are
retrieved from the WordNet. Terms that have the same or similar synsets are
selected as features. Their results demonstrated that the proposed WordNet-
based feature selection approach performed better than Information Gain and
Chi-Square with 50 or more term count, using F-measures.
Yang and Pedersen conducted a comparative study of different feature se-
lection methods for text classification [103]. In their research, they evaluated
five feature selection approaches, document frequency, mutual information, in-
formation gain, Chi-Square test and term strength. Based on the experimental

results, they found that information gain and Chi-Square test are the most ef-
fective feature selection methods and mutual information is the worst.
2.4 Text representation
Transformation of text documents into a suitable representation is an im-
portant issue in text mining. One of the most widely-used representations is
known as a Vector Space Model (VSM). Vector Space Model is an n x m matrix
whose rows correspond to documents di, i = 1, 2...n, in a document collection
and whose columns correspond to unique words wj, j = 1,2......m. chosen from
the document corpus. A cell value Cy is the weight of term Wj in a document
di. The weight in the cell Cy could be binary value, term frequency, document
frequency, or term frequency multiplied by the inverse of document frequency
(TF IDF), etc. The binary term weighting scheme is usually used for text
association mining, and the TF IDF weighting scheme is normally used for
document classification and clustering. The Vector Space Model of n documents
and m features is shown in Figure 2.1.
2.5 Types of text mining
2.5.1 Text classification
Text classification is a process of assigning pre-defined labels to new doc-
uments based on a classifier trained on training examples. In a traditional
two-class classification problem, the training examples consist of both positive
and negative examples. The positive examples are usually referred to documents
that are labeled with their actual category, and the negative examples are docu-
ments that are labeled otherwise. For text classification, documents or examples
typically need to be pre-processed and transformed into a proper representation

Figure 2.1: Vector Space Model.
such as Vector Space Model, where a document is represented as a vector of
words associated with some numerical weight values. The numerical weight
could be the frequency of occurrence of those words in a document or other
weighting schemes. To build a classifier, an appropriate similarity measure will
be chosen for computing similarity values among training documents. The same
similarity measure will also be used for classifying new documents. A well-known
similarity measure is the Cosine similarity measure. Several text classification
algorithms have been proposed; the most commonly used are Rocchio [43], Naive
Bayes [58], K-nearest neighbor [51] and Support Vector Machines (SVM) [45].
Below, we review some interesting work on text classification that address the
problem of manually obtaining positive and negative training examples.
Liu et al. proposed a text classification method using positive and unlabeled
examples based on a biased SVM [55]. Text classification using these examples

is typically a two-step process. The first step is to identify a set of negative
examples from unlabeled examples, and the second step is to iteratively build a
classifier from the existing positive examples and negative examples derived from
step 1. For each iteration in step 2, the number of unlabeled examples classified
as negative examples is increased from the results of a previously build classifier.
Finally, the best classifier is selected. Unlike the classical SVM, the key idea of
their proposed text classification using positive and unlabclod examples based
on a biased SVM is that costs of errors in positive and negative examples are
separated. In the method, the cost of positive errors is given a higher value than
that of negative errors. Their proposed method was evaluated on Reuters-21578
and Usenet data sets, and compared with similar methods such as PEBL, S-EM
and Roc-SVM. The results demonstrated that the biased SVM technique was
superior to the other comparable techniques.
Fung et al. presented a text classification technique without negative ex-
amples, termed Positive examples and Negative examples Labeling Heuristic
(PNLH) [28]. Unlike other similar methods that extract only negative examples
from unlabeled examples, their technique extracts both positive and negative
examples. They pointed out that extracting positive examples (P) from un-
labeled examples was very difficult, because an unlabeled document that had
features exhibited in P did not mean that it was a positive example. There-
fore, they proposed a concept of core vocabulary that was built from features
exhibited in positive examples. A feature in positive examples was added to
the core vocabulary if it had higher feature strength than a specific threshold.
They also introduced a formula for computing the feature strength. The core

vocabulary was then used for extracting negative examples from the unlabeled
examples. Their proposed method was divided into two steps. The first step
was extracting reliable negative examples using core vocabulary, and the second
step was enlarging sets of positive and negative examples. In the second step,
negative examples were first iteratively extracted from the existing unlabeled ex-
amples. Then, positive examples were extracted from the rest of the unlabeled
examples. Finally, a classifier was built from the combined negative examples
and the merged positive examples. Fung et al. evaluated their method against
techniques such as PEBL-SVM, WL-SVM, etc. They found that their proposed
technique substantially outperformed other methods when the size of positive
examples was extremely small.
2.5.2 Text clustering
Text clustering is the process of partitioning a set of documents into groups
whose documents within the same group are similar to each other and dissimilar
to those in other groups. Text clustering is different from text classification in
that it does not require prior knowledge of document categories. Two paper
reviews are listed below.
Song et al. presented a document clustering algorithm based on association
rules [79]. The key idea of the algorithm is to use frequent item sets (generated
by an association mining algorithm) to construct initial clusters of documents.
Documents are put into the same initial clusters if they contain the same frequent
item sets, and their association means of documents with keywords are greater
than some threshold. The association means are computed using association
intensity and document feature vector defined in the paper.

Wang and Hodges proposed a document clustering based on semantic anal-
ysis [96]. The motivational justification of the paper is that the typically used
bag of words approach does not truly reflect the semantic content of a doc-
ument due to synonymous and polysemous problems of words. As a result,
in their method, words are first mapped to their actual senses using a word
sense disambiguation algorithm based on WordNet. Then, a vector space model
is constructed using the actual word senses. Finally, a clustering algorithm is
2.5.3 Knowledge extraction
Knowledge extraction is the process of extracting information from textual
documents. Its application is frequently found in biomedical domain where
extractions of gene names, protein names, protein interactions, functions or
other relationships, etc. from literature are performed. Rule-based pattern
matching, natural language processing, dictionaries and statistics are some of the
commonly-used techniques. Literature reviews related to knowledge discovery
are discussed below.
Palakal et al. presented an approach for obtaining relationships between
biological entities [63]. In the approach, Palakal et al. defined two types of
relationships to be extracted: directional relationship and hierarchical relation-
ship. The directional relationship represents a direct relationship between two
entities, for example, protein A inhibits protein B. The hierarchical relationship
represents a relationship where one entity is a part of another entity such as
brain is a part of the nervous system. They pointed out that the extracted re-
lationships could be used as a template for modeling a biological system. Their

method is based on the Hidden Markov Model (HMM), dictionaries and N-Gram
models, and is divided into three stages: entity recognition, synonym grouping
and relation extraction. For the entity recognition, dictionaries are first used to
identify known entities. Then, HMM and N-Gram models are used to identify
unknown entities and to resolve name ambiguity, including multi-word phrases.
In this stage, not only are the biological entities recognized, but also they are
tagged with biological types such as protein, genes, etc. For synonym group-
ing, entities are first put into their own types, and then generalized ontology
and graph structure are constructed for grouping synonyms. In the relation ex-
traction stage, HMM is trained to identify subject and object in a sentence for
specific verb patterns. The subject/object identification allows the method to
extract directions from sentences. An advantage of using the HMM method over
the rule-based approach for extracting directional relationships is that there is
no need to create rules for covering all possible relationships. Finally, Palakal
et al. utilized the grouping in stage two and co-occurrence statistics to extract
hierarchical relationships.
Temkin and Gilder proposed a context-free grammar and lexical analysis ex-
traction method for obtaining interactions of protein, gene and small molecules
(PGSM) from unstructured text [87]. The method consists of three parts, Path-
way database dictionaries, a Lexical Analyzer and a Context-Free Grammar
(CFG) parser. The dictionaries contain a list of PGSM names and their syn-
onyms collected from Swiss-prot, GeneBank and KEGG, and a list of interaction
keywords gathered from Friedman and NIH relevant term list. These dictionaries
are used for recognizing key terms in unstructured text. The Lexical Analyzer

utilized the dictionaries and a set of protein/gene name recognition rules de-
scribed in Fukuda et al. [27] to detect PGSM names. Finally, a Context-Free
Grammar parser is used with grammar production rules to extract interactions.
Tamkin and Gilder manually derived the grammar rules for detecting interac-
tions from 500 abstracts, and used JavaCC to generate Java programs from the
grammar rules.
Daraselia et al. presented a protein extraction system called MEDSCAN
[20]. The system acts as ontology for extracting human protein interactions
from MEDLINE. MEDSCAN consists of three modules, a preprocessing module,
a natural-language processing (NLP) module, and an information extraction
module. In the preprocessing module, the protein names are identified using a
predefined dictionary. For the NLP module, semantic trees are constructed from
sentences using a Context-Free Grammar and a lexicon developed specifically
by MEDLINE. Finally, the information extraction module utilizes an ontology
and knowledge base to evaluate and convert valid semantic trees into ontological
representation. This ontological system can then be used for protein interaction
Chiang and Yu proposed a sentence pattern matching technique for extract-
ing protein functions from biomedical literature [16]. Functions in this paper are
referred to those in Gene Ontology (GO). In the paper, Chiang and Yu described
how sentence patterns are constructed, and then applied them to extract GO
functional terms. In these methods, protein-GO-document relations arc first
extracted using co-occurrence. In other words, for a particular document, a
protein-GO-document relation exists if there are protein name and GO term

presented together in the same sentence. Chiang and Yu resolved variants of
GO terms used by researchers by classifying the variants into three categories,
morphological variants, syntactic variants and semantic variants. Morphologi-
cal variants mean that a term can be written into several different forms. For
example, cellular membrane" is a variant of "cell membrane. The cellular
and cell are referred to the same meaning, but different forms. Syntactic vari-
ants mean that all content words in a term are presented in both the original
and its variant, but in different syntactic structure. For example, a term bind-
ing to the origin of DNA replication is a variant of DNA replication origin
binding. In this example, the content words, binding, origin, DNA and
replication are presented in both original term and variant, but their syntactic
roles are different. Semantic variants mean that different terms are referred to
the same meaning. For example, delivery of copper ion is a semantic variant
of copper ion transport. Chiang and Yu solved the problems of morphologi-
cal variants using Java Lexical Tools from UMLS. the syntactic variants using
rule-based method of term variations, and the semantic variants using a dic-
tionary of synonyms of GO terms. After the co-occurrence step, the sentences
of protein names and their functions are extracted, they are transformed into
a set of phrases using a finite state automaton, where a phrase can be a noun
phrase, a verb phrase, or others. Then, the transformed co-occurrence sentences
are divided into two types, positive and negative. The positive sentences are
these whose co-occurring relations are presented in database annotation, and
others are negative sentences. Next, candidate sentences are mined from the
positive sentences. Support and confidence values are computed for each candi-

date sentence. Finally, sentence patterns are derived using specific support and
confidence thresholds.
Koike et al. introduced a shallow parsing and sentence structure analysis
techniques for extracting gene/protein biological functions from biomedical lit-
erature [47]. The method can be divided into two parts. In the first part, a
dictionary of terms similar and related to GO terms and biological names is
constructed using co-occurrence, co-location and pattern matching. Then, the
rules that describe (i) syntactic/semantic variation of terms such as folding of
protein and "protein folding. and (ii) verb-technical terms that are related to
biological functions, are generated. In the second part, the relations between
genes and gene functions are extracted using parsing and sentence structure
analysis. In this extraction phase, after detecting the terms in text using a
dictionary, sentences are shallowly parsed into subject-phrase, verb-phrase and
object-phrase. Then, gene-function relationships can be extracted for the prede-
fined verbs when gene is in the subject-phrase and gene function is in the object-
phrase. Finally, GO-IDs are assigned to the valid extracted relationships. An
important contribution of this paper is its sentence structure analysis. Koike
et al. described in detail how different variations of sentence structures and
patterns can be dealt with efficiently.
Chiang et al. integrated sequential pattern mining and information extrac-
tion methods for obtaining gene-gene interactions from biomedical literature
[15]. In their method, part-of-speech (POS) tags are applied to sentences. Based
on the POS tags, a sentence is broken into parts, noun phrase, verb phrase and
object phrase. They captured biological entities and interaction types using

pre-defined sets of dictionaries. However, they did not provide details of how
complex and compound sentences could be treated.
Rinaldi et al. presented a method called the deep-linguistic syntactic anal-
ysis for mining functional relations from biomedical literature [76]. The method
is built upon a combination of hand-written grammar and a statistic language
model based on maximum likelihood estimation. After preprocessing, which
transforms documents into rich format called dependency relations, the specific
syntactic relations are determined using probabilities computed by the model.
2.5.4 Association discovery
Association discovery is the process of finding relationships among entities
or terms in documents. An example of association discovery is finding associa-
tions among genes and proteins in biomedical literature. For a discovery to be
successful, entity names and their synonyms usually need to be detected before
the associational process. As a result, pre-processing is very important. Associ-
ation rule mining algorithms, e.g.. see [2] and [9], are commonly used methods
for association discovery.
Huang et al. proposed a fuzzy predictor of new association rules from gen-
erated association rules based on the MeSH taxonomy [38]. The key idea of
this method is that a concept in an association rule can be replaced by another
concept in the taxonomy to form a new association rule. They employed fuzzy
set theory to compute degrees of concept replacement based on the membership
function, which measures the degrees of replacing one concept with another one.
A concept can replace another concept if they are sibling in some hierarchical
taxonomy. The authors provided definitions of change degree and realized the

possibility to validate the replacement. In their method, instead of textual ab-
stracts, MeSH terms associated with each of abstracts provided in the citation
are proposed to be used for mining association rules. The paper contains only
one example without any actual implementation or experiments.
Berardi et al. presented a method for mining generalized association rules
of concepts from Medline abstracts [4]. The concepts used in the approach
are based on MeSH taxonomy that is a set of hierarchically related biomed-
ical terms. Generalized association rules are defined as an association rule,
antecedent > consequence, where no item in the consequence is an ancestor
of any item in the antecedent in the taxonomy. They utilized BioTeKS Text
Analysis Engine provided within the IBM UIM architecture for detecting MeSH
terms in Medline abstracts. Then, the tool generated a database table, where
each row corresponds to an abstract containing a list of detected MeSH terms.
To mine generalized association rules, the table is first extended to include an-
cestor of each term in the taxonomy, then association rules that pass minimum
support and confidence thresholds, but violate the ancestral rule defined above
are pruned. To cope with the large amount of generated generalized association
rules, Berardi et al. further pruned the rules using an "interestingness measure
and user-defined templates.
2.5.5 Hypothesis generation
Hypothesis generation is the process of generating potentially meaningful
hypotheses from textual documents. A hypothesis could be an association be-
tween two entities resulting from inferences or a sentence describing potential
novel relations between entities. In contrast to other text mining techniques,

the generated hypotheses, which are the results from the hypothesis generation
method, typically require further verification in the form of an experiment. If
text mining is a process of discovering novel knowledge, then hypothesis gener-
ation is the closet to the definition, because it generates outcomes that have not
yet been discovered before. Two techniques for hypothesis generation in text
mining are reviewed next.
Swansons work is considered as an initial study on hypothesis generation
from literature [83]. He proposed a hypothesis that dietary fish oil might ben-
efit Raynaud patients using the following inference rule: if A causes B and B
causes C, then A causes C. This hypothesis was derived from studying two dif-
ferent sets of papers, one is about fish oil and the other is about Raynauds
disease. In the set of fish oil papers, Swanson found that dietary fish oil causes
reduction of platelet aggregability, blood viscosity, and vascular reactivity. In
the set of Raynauds disease papers, he discovered that reduction of platelet
aggregability, blood viscosity, and vascular reactivity causes the amelioration of
Raynauds syndrome. As a result he hypothesized that dietary of fish oil causes
the amelioration of Raynauds syndrome.
Srinivasan presented open and closed discovery algorithms for generating
hypotheses from MEDLINE [80]. His algorithms utilize UMLS semantic types
and MeSH terms for making a discovery. For both algorithms, a topic profile is
constructed using MeSH terms associated with the retrieved papers. A profile is
a vector of MeSH vectors, and each MeSH vector consists of MeSH terms that
belong to the same UMLS semantic type. In the profile, TF IDF weights are
computed for each one of the MeSH terms. Therefore, a profile represents the

relative importance of MeSH terms within UMLS semantics types of the given
topic. In the open algorithm, a user provides one topic of interest and two sets
of UMLS semantic types. After a set of literature is retrieved, the MeSH terms
associated with the papers are extracted based on their TF IDF values and
the first UMLS semantic types. These extracted MeSH terms are then used
to retrieve other sets of papers. Profiles are built for each set of papers using
the second user-specified UMLS semantic types. After combining the profiles, a
MeSH term and its weight in the combined profile is considered as an output,
if a MEDLINE search using the user-given topic and the MeSH term returns a
zero result. These outputs mean that there is no direct association between a
user-specified topic and the MeSH term, but there are associations (i) between a
user-specified topic and an intermediate MeSH term and (ii) between the inter-
mediate MeSH term and an MeSH term in the last set. In the closed algorithm,
a user provides two topics of interest and two UMLS semantic types of interest.
Two different sets of literature are retrieved using the given two topics. Then,
two different profiles are constructed based on the MeSH terms in each set and
given semantic types. Next, the MeSH terms that are common between two
profiles are extracted. Finally, if an MEDLINE search that utilizes MeSH terms
from the first and second sets and from the common MeSH terms returns a non-
zero result, the common MeSH terms are eliminated. The remaining common
MeSH terms represent novel associations. In fact. Srinivasans method does not
generate hypotheses. The method generates only a ranked list of terms that are
common or connected between the two sets of retrieved literature.

2.5.6 Knowledge base construction and model building
Knowledge base construction or model construction is the process of utilizing
extracted information from literature for building knowledge base or model for
future uses. Typically, the knowledge is first extracted from literature. This
knowledge can be associations among entities, protein interactions, etc. Next,
the extracted information is refined and validated, and then it will be used
for building a knowledge base or constructing a (biological) model. Automatic
construction of signaling pathways network is an example of model construction
using text mining. We review a key paper below.
Yuryev et al. described a methodology for automatically constructing signal-
ing pathway networks [104]. They exploited existing protein relations extracted
by Medscan from scientific literature in the ResNet database. They presented
two algorithms. The first one is an algorithm for curating relations in the ResNet
database to resolve issues in jargon usage of the molecular biologists and dif-
ferences in the way scientists describe the interactions between proteins. The
second algorithm is a pathway building algorithm. Yurvey et al. used their
method for a generation of signaling pathway networks of ligand regulomes.
2.6 An application of text mining to support the development of
the chromosome-21 database
We illustrate here the usefulness of simple text mining approaches in applica-
tion for finding orthologs and extracting protein interactions for the development
of the chromosome-21 database. A comprehensive chromosome-21 database was
constructed by a group of researchers at the University of Colorado Denver and

the Eleanor Roosevelt Institute led by Drs. Gardiner and Cios. It can be seen at
http://chr21 The goal of the database is to serve as a
comprehensive resource for chromosome-21 genes and Down syndrome research.
With this database, researchers can retrieve chromosome-21 related information
such as protein interaction, orthologs, gene expression data, conserved domain,
etc. In addition, it contains graphical and prediction tools that allow researchers
to carry out a number of important chromosome-21 related tasks. Details about
the chromosome-21 database can be found in [62],
In this section, we describe components of chromosome-21 database that
are supported by text mining. These are ortholog identification process and the
protein interaction extraction from structured databases.
2.6.1 The ortholog identification of chromosome-21 genes Overview
The term ortholog was introduced in 1970 by Fitch [25]. It refers to a gene
that descended from the same ancestor as another gene in a different organ-
ism. Orthologs are genes that have evolved by speciation. The counter part of
ortholog is paralog. In contrast to ortholog, paralog is referred to a gene that
evolved by gene duplication [49]. The difference between paralog and ortholog
is further explained in Jensen [40].
The importance of an ortholog identification relies on the assumption that
the same genes perform similar functions, although they reside in different or-
ganisms. Because of the incompleteness of human protein sequence database and
the availability of protein sequences in model organisms, biological researchers
normally annotate functions of genes in human from the same genes that are
presented in model organisms, using the notion of orthology.

There are several techniques for ortholog identifications such as the InPara-
noid algorithm based on sequence similarity by Remm et al. [75], OrthoMCL
by Li [52] and ortholog identification using protein network comparisons by
Bandyopadhyay et al. [3]. The benchmark studies that compared ortholog
identification methods were conducted by Hulsen et al [39],
Rather than using the ortholog identification processes mentioned above,
the ortholog identification method used for the comprehensive chromosome-21
database is based on the basic local alignment search tool (BLAST), and is shown
in Figure 2.2. The goal of ortholog identification for chromosome-21 database is
to find the genes/proteins in model organisms that are orthologous to the human
chromosome-21 genes. The following ten model organisms are used for the
chromosome-21 database: Anopheles gambiae, Caenorhabditis elegans, Canis
familiaris, Danio rerio, Drosophila melanogaster. Gallus gallus, Pan troglodytes,
Saccharomyces cerevisiae, Schizosaccharomyces pombe and Takifugu rubripes.
Their protein sequence databases can be downloaded from NCBI FTP web site
1 under the BLAST tool.
According to Figure 2.2, the ortholog identification process can be divided
into the following sequence of steps:
Finding a potential ortholog from a model organism database:
In this step, a human chromosome-21 protein sequence of interest (query
sequence) is used as an input sequence for running BLASTP against
a protein sequence database of model organisms. The output from a
BLASTP search contains a set of protein sequences from the model or-

A protein sequence of qene A
Figure 2.2: The identification process of chromosome-21 protein orthologs.

ganism database that might be similar to the query sequence. These
matched sequences are ordered by the Expect values or E-value, which
is a significant measure that describes the number of hits that one can
expect to see by chance when searching a database of a particular size. As
a result, the lower the E-value, the higher the significance of a match. For
the chromosome-21 database, two types of thresholds are used to specify
a valid matched sequence: E-value < 0.001 and sum of matched regions >
30% of the human chromosome-21 protein sequence length. In this step,
only the best-matched sequence is evaluated. If passed the thresholds, it
is specified as a potential ortholog.
Filtering the potential ortholog:
The potential ortholog specified in the previous step may be or may
not be the actual ortholog. In this step, the first filtering task is per-
formed by running a BL2SEQ search between the potential ortholog and
the chromosome-21 query sequence. If the sum of the non-overlapping
matched regions between two sequences is greater than or equal to 30%
of the query sequence, then proceed to the next step. Otherwise, the
potential ortholog is not a valid human chromosome-21 ortholog.
Performing the second BLASTP search:
In this step, the potential ortholog is BLASTP against a human protein
database. The goal is to find a set of best matched human protein se-
quences that pass the E-value threshold (< 0.001). Then, if one of the
matched human protein sequences is the same as the query sequence in

the first step, the potential ortholog can be specified as the valid human
chromosome-21 ortholog.
Assuring the result:
To ensure if the potential ortholog is actually a valid human chromosome-
21 ortholog, the matched human protein sequence is BL2SEQ against the
query sequence. If the identities value is greater than 95%, it is certain
that the potential ortholog is validly orthologous to the query sequence.
The main idea of our ortholog identification used for chromosome-21 database
is based on the concept of bi-directional hits with some user-defined threshold
values. In other words, a protein/gene A in a model organism X is an ortholog
of another protein/gene B in organism Y if A is the best hit from BLAST results
when B is the protein query, and then B is the best hit from BLAST results
when A is the protein query, based on some user-defined thresholds. Role of text mining on ortholog identification
According to Figure 2.2, the ortholog identification process used for chromosome-
21 database depends exclusively on a basic sequence alignment tool known as
BLAST. BLAST is available from NCBI BLAST at http://blast.ncbi.nlm., which is divided into a number of more specialized tools
tate experiments, we utilize a standalone BLAST tool that can be run locally.
To prepare for BLAST input, protein or DNA sequences need to be extracted
or obtained from the protein sequence database. These sequences are usually in
the FASTA format 2, which is used to describe protein or DNA sequence. Each
2 asta.shtml

sequence in the FASTA format consists of two parts, header and body. The
header provides information about the sequence such as identification number
and type of organisms. It begins with the greater-than symbol, >, followed by
an identification number, organism, etc. The body is a protein or DNA sequence
identified in the header. An example of a protein sequence of ITSN1 gene in
FASTA format is illustrated in Figure 2.3.
>gi|l 0973278 l|gb|AAI16186.1| ITSN1 protein [Homo sapiens]
Figure 2.3: An example of a protein sequence in FASTA format.
In every BLAST output, we need to extract the following information, the
best BLAST hits, E-value, length of the matched sequence, etc. There are four
main parts of an output generated by BLAST tool: introduction, result sum-
mary, sequence alignment and program parameter. An example of the sequence
alignment part is shown in Figure 2.4.
In order to extract protein sequences for constructing a model organism
protein database and to extract BLAST output, we employed a simple text
mining approach, known as data/information extraction. We implemented a set

>gi|114684060[reflXP_514884.2| PREDICTED: carbonyl reductase 3 [Pan troglodytes]
Length 277
Score 553 bits (1426), Expect e-158
Identities 276/277 (99%), Positives = 276/277 (99%)
Figure 2.4: An example of the sequence alignments generated from BLASTP.
of Java programs with regular expression rules for extracting and parsing a large
amount of protein sequences and outputs generated from BLAST.
2.6.2 Protein interaction extraction from HPRD database Overview
The Human Protein Reference Database (HPRD),,
is a database of human proteins and other related information regarding the
human genes such as gene symbols, protein interactions, domains and motifs,
PTMs and substrates, gene expression, cellular localization, diseases, literature
references and external links to other relevant databases. Because its data are
manually curated, HPRD database is considered a high-accuracy database. Data
from the HPRD database can be obtained through the web interface or down-
loaded from its web page. Role of text mining on protein interaction extraction
To build a protein interaction network for the chromosome-21 database, pro-
tein interactions including chromosome-21 proteins are extracted. To facilitate

this extraction, the whole database file in XML format is downloaded. Then,
we implemented Java programs using an open source XML parser, JDOM 3 to
extract all protein interactions and other related information.
2.6.3 Conclusions
As described above, the major role of text mining for building a comprehen-
sive chromosome-21 database is for the extraction of relevant information from
public resources and from the outputs of analysis tools. The data to be extracted
are mostly semi-structured such as XML, or have some explicit structure such
as comma- or tab-delimited. Accordingly, a simple extraction technique using
regular expressions could be employed for the extraction tasks. However, if in
the future, data/information needs to be obtained from unstructured textual
documents, a more advanced text mining techniques, such as entity recognition
[34], may be required.
3 htt p: //www

3. Text Association Mining with Cross-Sentence Inference
3.1 Introduction
Text mining is known as the discovery by computer of new, previously
unknown information by automatically extracting information from different
written resources [35]. It is a technique that utilizes a combination of various
approaches from statistics, natural language processing, artificial intelligence,
machine learning and data mining. Sub-areas of text mining include text clas-
sification, text clustering, text summarization, and text association. Text as-
sociation mining is an application of association rule mining algorithms to text
documents. In text association mining, associations among words in textual
documents are searched, and those that satisfy user-specified thresholds (sup-
port and confidence) are retained. Text mining process is typically based on
other techniques such as entity recognition, information retrieval, and informa-
tion extraction. Entity recognition is how key terms presented in text documents
are recognized, information retrieval is how text documents relevant to a query
are retrieved from a text corpus, and information extraction is how information
hidden in text documents is extracted. These techniques are built using statis-
tics, natural language processing, key term dictionaries, and rule-based pattern
In this chapter, we present our developed text association mining with cross-
sentence inference that exploits statistical co-occurrences of keywords and the
syntactic structure of sentences to derive potentially novel and meaningful rela-
tions [90]. In our approach, associations among words in the form of association

rules are first searched from a set of selected abstracts using a state-of-the-
art association rule mining algorithm, and those that satisfy specific thresholds
(support and confidence) are chosen. Then, each association rule is expanded
and transformed into a sentence with a semantic phrase is associated with For
example, a rule A > B is transformed into .4 is associated with We call
this type of association statistical association. Next, we search for another type
of associations, which we call scientifically-reported, association, from a set of the
same selected abstracts as used in the association rule mining. In an abstract,
the most useful information is usually present in the experimental result and
conclusion sentences, where researchers report their findings. Accordingly, we
extract our scientifically-reported associations from those types of sentences. We
achieve this step through the use of part-of-speech tagging, regular expressions,
and rule-based pattern matching. Finally, we perform cross-sentence inference
between the statistical associations and the scientifically-reported associations
to derive potentially novel meaningful hypotheses. In addition, we introduce an
enhanced weighting scheme, called extended TF IDF, for selecting words to
be used in the association rule mining process. Because the number of unique
words in a document corpus is usually very large, choosing only those words
that are truly relevant to the document corpus is a critically important process.
The extended TF IDF weighting scheme takes into account the positional
property of words in documents, in addition to word frequency and document
frequency. The underlying assumption behind this new weighting scheme was
that words, which appear in the title of a document, tend to be more relevant to
the document than those that occur in the content of the document. The goal

of the extended TF IDF is to obtain words that are as highly relevant to the
documents as possible, and these words will then be used in the association min-
ing process. The contribution of this work is to provide an automatic method
for generating potentially novel hypotheses from unstructured documents using
association rule mining algorithms and text mining.
This chapter is organized as follows. Section 3.2 provides background infor-
mation. Section 3.3 describes data and preprocessing. Section 3.4 explains the
methods. Section 3.5 presents experimental results. Finally, Section 3.6 is the
3.1.1 Related work
Text mining in biomedical literature could be traced back to Swanson [83],
where he reported a relationship between Raynauds disease and a consump-
tion of fish oil through analyses of biomedical literature. Srinivasan conducted
a similar research using text mining to generate hypotheses from MEDLINE
[80]. Other literature mining studies were reported in [30], [31], [32] and [54].
Text mining in biomedical literature typically emphasizes on recognition of key
terms such as gene and protein names, on extraction of protein functions and
protein-protein interactions, on mining biological processing, and on discovery
of relationships and associations among those terms [5], [12], [14], [16], [20], [26],
[34], [38], [47], [61], [87] and [16]. Unlike previous studies, our method uses a
combination of abstract-level associations through the use of an association rule
mining, and sentence-level associations through the use of rule-based pattern
matching for deriving new associations, described in Section 3.4.

3.2 Technical background
3.2.1 Association rule mining
Association rule mining technique was introduced by Agrawal et al. [2] for
finding associations among items in transaction databases. Given a transaction
database D containing a set of transactions T {U, t2,..., tn} and a set of binary
items I = {ii.i2, ...,?'TO}, a rule X > ij|(s,c) is an association rule where X is
a set of some items in 7, ij is a single item in I that is not present in A, s is a
user-defined minimum support, and c is a user-defined minimum confidence.
An association rule mining algorithm can typically be divided into two steps:
1) frequent item set search and 2) rule generation. Frequent item set search is a
problem of finding all combinations of items, known as item sets, in / from all
transactions in the database that satisfy the user-specified minimum support.
Then, with the generated item sets, the rule generation produces a number of
association rules that satisfy the user-specified confidence level. Association rule
mining has been extensively studied and several efficient algorithms have been
proposed to improve the computational time and memory requirements. Useful
resources regarding this algorithm can be found in [11], [36], [80] and [18],
3.2.2 TF IDF weighting scheme
The TF IDF weighting scheme is widely used in text classification. The
TF IDF value of a word ic, in a document d can be computed as a combina-
tion of term frequency TF and inverse document frequency IDF. The inverse
document frequency is calculated by Equation 3.1

where N is the total number of documents in a corpus, and DF is the
number of documents in which the word wir occurs at least once. Therefore,
TF IDF can be formulated by Equation 3.2.
According to the TF IDF weighting scheme, the word that occurs in many
documents is less important than words that are presented in fewer documents
because its inverse document frequency is lower. In addition, a word that is
present many times in a document is more important because its TF is higher.
With the TF IDF weighting scheme, words that are common to all documents
such as introduction or conclusion in biomedical abstracts will have lower
values, and are removed in a word selection process.
The TAMCSI (text association mining with cross-sentence inference) method
is general and can be applied to any domain. For example, in chromosome-21
domain, the domain used here, the method discovers associations among key
terms. Those key terms are selected using the extended TF IDF scheme that
is defined in the later section. With the extended TF IDF weighting scheme,
the chosen key terms are more relevant to the chromosome-21 domain than using
other existing methods, such as document frequency or traditional TF IDF.
By using part-of-speech tagging and regular expression, association sentences
are broken up into subject phrase, verb phrase, and object phrase, and the
cross-sentence inference is then performed. The details of our techniques are

described below.
3.3 Data set
We retrieved a set of biomedical abstracts from NCBI PubMed database
through our customized NCBI Entrez Utilities retrieval tool. Using the word
Chromosome 21 and its variations as the search keywords, and January 01,
1990 to March 22, 2007 as a date range, we obtained 3,730 abstracts. We de-
terministically selected 22 abstracts containing our association/correlation pat-
terns, and randomly choose other 278 abstracts from a set of abstracts that con-
tained our scientifically-reported key phrases such as conclusion and demon-
strate that.
3.3.1 Pre-processing
One of the main problems of text mining in biomedical text is that there are
a considerable number of specialized terms and phrases. Each of these terms
and phrases typically consist of several words. Therefore, using word-based
representation of a textual document can result in a significant loss of informa-
tion. For example, in word-based representation, a phrase human chromosome
21 pair will be broken into four words, human, chromosome, 21 and
pair. Based on Blake et al.'s study [6], using semantic richness of features for
text representation can improve plausibility and usefulness of association rules.
They utilized an existing knowledge base, the Unified Medical Language System
(UMLS) 1 to map each clause to a medical concept. Slightly different from Blake
et al.'s study, instead of mapping clauses to concepts, we map terms/phrases *

to biomedical concepts. In the first step of our pre-processing, we extract ter-
m/concept pairs from UMLS description and supplementary XML files using
Piccolo 2 java library for XML. Then, with java regular expressions and the
extracted UMLS term/concept pairs, we search for terms/phrases presented in
abstracts and convert them to their corresponding concepts. Next, we replace all
spaces and punctuations within a concept with underscores. The replacement
transforms concept phrases to single words. Then, we employ Porter stemmer
program [71] to stem words that do not contain underscores. Next, we split
each abstract into sentences and perform the part-of-speech tagging process us-
ing the MontyLingua 3 program. Then, we select abstracts that only contain
scientifically-reported sentences. In other words, only abstracts that contain key
phrases and association patterns are chosen.
In biomedical literature, some words are common and add no meaning to the
content, for instance, method, introduction, "background, conclusion,
experimental result, etc. Accordingly, they can be considered as stop words
for biomedical literature. Because these stop words are usually used in several
abstracts in a data set, they tend to have high word frequency counts. There-
fore, a list of these stop words could be constructed by selecting top-frequency
words with a specific threshold, for example, selecting the top 5% most frequent
words as stop words for biomedical literature. Moreover, important words in
biomedical literature are typically presented as nouns. As the result, in addi-
tion to English stop words and biomedical-literature stop words, words that are
not identified as nouns from the results of the part-of-speech tagging program

are not used in our association mining. To account for some typos and to further
reduce dimensionality, we also remove words that occur less than three times in
a given set of selected abstracts.
Finally, we represent an abstract as a comma-delimited list of words. The
representation is then used as input to an association rule mining algorithm, the
AprioriT algorithm [19]. In addition to this pre-processing, to perform cross-
sentence inference, we divide sentences from the chosen abstracts into three
parts (subject phrase, verb phrase, and object phrase). We achieve this using a
combination of regular expressions and part-of-speech tags.
3.4 The TAMCSI methods
A high-level diagram for the text association mining with cross-sentence
inference process is shown in Figure 3.1.
3.4.1 Extended TF IDF
According to Jing et al. [42] study, using TF IDF weighting scheme
for filtering words improves classification accuracy by approximately 15% from
using IDF method. We extend and apply Jing et alls findings for our word
selection. Our TF-IDF (ExtTF-IDF) method is an extension of the TF IDF
weighting scheme that additionally considers positional information of the words
in a document. The underlying assumption is that words in the title are more
relevant to the document theme than those in the content. As a result, title
words will be assigned higher numerical weight than content words. An extended
TF IDF of a word is formulated by Equation 3.3

Potentially meaningful
and novel hypotheses
Figure 3.1: The overall process diagram of the TAMCSI.

where Wt is the positional weight of title, Wc is the positional weight of
content, TFt is term frequency of a word presented in the title of an abstract,
TFc is term frequency of a word presented in the content of an abstract, DF is
the number of documents in which a word occurs at least once, N is the total
number of documents in a corpus, and Wt > \VC. In our experiment, we assign
VFf=0.8 and Wc=0.2. After computing ExtTF IDF value for each word in
the selected abstracts, we sort the words in descending order according to their
values. Then, we select the top 200 words for further experiments.
3.4.2 Scientifically-reported association search
To obtain scientifically-reported associations from the retrieved abstracts,
we perform the following steps. First, we define a number of key phrases and
utilize them to extract sentences. These key phrases are used to locate where
the researchers report their results in abstracts. Based on our manual review of
a number of retrieved abstracts, we select a number of key phrases to be used
in this step as follows: conclusion, in conclusion, indicate that, reveal
that, show that, confirm that, demonstrate that, prove that, result
in, "suggest that, find that, as a result and consequently. With these
key phrases, we obtain a number of sentences and their corresponding abstract
identification numbers (PubMed ID). The abstract identification numbers are
then employed for selecting abstracts for an association rule mining. We call
the extracted sentences scientifically-reported, sentences. From these sentences,
we further search for sentences that contain patterns be associated with/to
or be correlated to/with. We term sentences extracted with these patterns as
scicntifi,cally-reported associations. We utilize regular expressions to account for

variations of the key phrases and patterns above. For example, we also include
"indicated that and indicates that for the indicate that key phrase, and
been associated with, are associated with, is [adverb] associated with,
etc. for the be associated with/to key pattern.
3.4.3 Statistical association rule generation
Statistical association rules are generated using a customized version of Apri-
oriT algorithm [19] that works well with sparse data sets such as text. We con-
struct an input table using 300 abstracts and 200 words, in which each abstract
is represented as a list of chosen words. We set the lower bound of the support
value to 5% of the total number of documents, and the minimum confidence value
to 80%. We constrain that the selected abstracts must come from the abstracts
that contain (i) scientifically-reported associations and (ii) scientifically-reported
sentences. Then, words are chosen using the top 200 ExtTF IDF values.
3.4.4 Association rule to sentence conversion
We convert association rules generated by the AprioriT algorithm [19] into
sentences as follows: a word association rule represents co-occurrence among
words in the same abstracts that passes user-specified thresholds, such as sup-
port and confidence. As a result, words in a rule can be considered statistically
associated. Therefore, semantically, we could transform a rule into a sentence(s)
using the key phrase is/are associated with. For instance rule A,B,C > X
could be transformed into A, B and C are associated with X.
3.4.5 Cross-sentence inference
The goal of cross-sentence inference is to derive and discover meaningful and
novel associations using the statistical co-occurrence of words and the syntactic

structures of sentences. The underlying assumptions behind cross-sentence in-
ference are that: (i) words that co-occur in an abstract are related to each other;
accordingly, words in an association rule that pass given user-specified support
and confidence are also related, (ii) the related words could be interchangeable,
and (iii) if a sentence can be broken up into subject phrase, verb phrase, and
object phrase, a word that appears in the sentence-part could be used to repre-
sent that part. Based on these assumptions, cross-sentence inference is carried
out by replacing a phrase in one sentence with a phrase or a word from another
sentence. The result of cross-sentence inference can be formulated as follows:
Given that a sentence consists of the three phrase parts (subject, verb and ob-
ject) a crossed sentence is the sentence whose phrase is replaced by a different
phrase from another sentence.
To perform a cross-sentence inference, each sentence in the scientifically-
reported-association sentences and in the converted association-rule sentences
is divided into the three parts. Then, for each item in an association rule, we
search for the same item in the scientifically-reported-association sentences. If
there is a common item between two types of associations, we perform cross-
sentence inference by replacing a phrase in one sentence with another phrase
in the other sentence. To better illustrate this idea, let us use the following
Scientific association:
An association sentence A is associated with If is extracted from an abstract.
"A is the subject phrase, is associated withr is the verb phrase, and By is
the object phrase.

Statistical association:
An association rule, A > Y is generated and selected from AprioriT algorithm.
The rule is converted into A is associated with Y where A' is the subject
phrase, is associated with is the verb phrase, and Y is the object phrase.
Connecting item: 1
There is a connecting item or common item m present in the object phrase of
the scientific association (B) and in the subject phrase of the statistical associ-
ation (A').
A cross-sentence inference sentence can be derived as follows: "A is associated
with m and m is associated with Y. Therefore, a potential novel hypothesis
can be derived as A is associated with Y".
3.5 Experiments and results
3.5.1 Scientifically-reported sentences
From 3,730 retrieved abstracts, we extracted 1,345 scientifically-reported
sentences. Examples of these sentences are:
1. Our result indicates that DYRK1A could be a key molecule bridging be-
tween beta-amyloid production and tau phosphorylation in AD (PubMed
ID: 17135279).
2. CONCLUSIONS: MALDI-TOF mass spectrometry is a robust and repro-
ducible method for the detection of trisomy 21 (PubMed ID: 17108765).
3. These results suggest that SIM2s has breast tumor suppressive activity
(PubMed ID: 16840439).

4. Evidence suggests that one-carbon/transsulfuration (1C-TS) metabolism
is abnormal in persons with DS (PubMed ID: 16541333).
5. In this perspective I review recent studies that suggest that constitutional
trisomy 21 promotes leukemic transformation during fetal hematopoiesis
(PubMed ID: 16270377).
6. These findings indicate that mosaic mitotic error of chromosome-21 is as-
sociated with non- viability (PubMed ID: 14742700).
For a scientifically-reported sentence, a simpler sentence could be further
extracted after the key phrase. For example, from the 6th sentence above, we
obtain a simpler sentence as "mosaic mitotic error of chromosome-21 is associ-
ated with non-viability., because this sentence comes after the 'indicate that
key phrase.
3.5.2 Scientifically-reported association
22 out of the 1,345 scientifically-reported sentences contain scientifically-
reported associations. In other words, these 22 sentences contain either the
patterns be associated with/to, or be correlated with/to, including their
pattern variations. A subset of these associations is shown in Table 3.1.
3.5.3 Word selection
After pre-processing, we sort words based on their ExtTF IDF values and
select the top 200 words for the mining process. A subset of those words is
shown in Table 3.2.
To demonstrate the effectiveness of our weighting scheme, we compare the
selected words by a using different weighting scheme in Table 3.3. According

Table 3.1: Scientifically-reported association.
No. Scientifically-reported association
1. KIT mutations are strongly associated with a poor progno- sis in pediatric t(8;21) AML. (PubMed ID: 16291592)
2. Activating mutations of receptor tyrosine kinases are asso- ciated with distinct genetic subtypes in AML. (PubMed ID: 16254134)
3. The measurement of lactate in fetal blood scalp seems corre- lated to the fetal scalp pH. (PubMed ID: 15848081)
4. The deletions in chr.8p21.3-23 and the alterations in the c-myc locus are independently associated with the development of CaCx. (PubMed ID: 15491757)
5. Bethlem myopathy linked to chromosome 21 is associated with a secondary decrease in laminin beta 1 expression. (PubMed ID: 10407855)

Table 3.2: A subset of words selected by using our ExtTF IDF.
No. Stemmed word ExtTF IDF
1 amll 49.1015
2 ds 46.02380
3 patient 43.9255
4 down_syndrome 38.7943
5 protein 37.1325
6 gene 36.1595
7 cell 35.8095
8 express 30.6365
9 diseas 30.1116
10 ami 29.1109
11 leukemia 29.0028
12 brain 28.7566
13 map 28.6569
14 mutat 27.0876
15 ad 26.7154

to the results, using the document frequency weighting scheme, seven out of
the top 15 words seem to be general words. Those words are: studi (study),
analysi (analysis), result (result), sequence (sequence), level (level) and factor
(factor). For the TF IDF weighting scheme, two out of the top 15 words seem
to be common such as case (case) and sequence (sequence). In contrast, using
our ExtTF IDF, all 15 selected words are meaningful such as ad (Alzheimers
disease), leukemia (Leukemia) and amll (one of the chromosome-21 genes).
3.5.4 Statistical association
From the set of 300 abstracts and 200 words chosen based on the ExtTF
IDF values (from high to low), we obtain 487 association rules from the AprioriT
algorithm [19]. A subset of those association rules is shown in Table 3.4.
The results of association rules in Table 3.4 are very encouraging. With-
out prior knowledge regarding chromosome-21, we can derive some conclusions
that human chromosome-21 genes are associated with abnormalities in brain or
Down syndrome. In addition, because the literature retrieved using the word
chromosome 21 and its variations, we might be able to infer that AML-1 and
ETO are related to chromosome-21 or they are chromosome-21 genes. Then,
we might also infer that they play an important role in Down syndrome. The
promise of these association rules is partly a result of using the Unified Medical
Language System (UMLS)4 in the pre-processing step.
Although the generated association rules seem to be promising, we can see
that they present mostly the facts or co-occurrences of key words. Our cross-
sentence inference technique allows an automatic derivation of potentially novel

Table 3.3: Comparisons of subsets of selected words using different weighting
No. Document frequency TF-IDF ExtTF IDF
1 gene ds amll
2 chromosomes_human_pair _21 patient ds
3 studi amll patient
4 chromosom cell downjsyndrome
5 cell protein protein
6 analysi gene gene
7 result ad cell
8 region ami express
9 patient case diseas
10 express express ami
11 protein mutat leukemia
12 level sequenc brain
13 diseas region map
14 sequenc brain mutat
15 factor chromosom ad

Table 3.4: Association rules generated by AprioriT algorithm.
No. Association rules
1 {abnorm. brain} > {chromosomes_humampair_21}, 100.0%
2 (ag, down_syndrome} > {chromosomesJiuman_pair_21}, 100.0%
3 {brain, dowmsyndrome} > {chromosomesJiuman_pair_21}, 100.0%
4 {eto} > {amll}, 100.0%
5 {ami, eto} > {amll}, 100.0%
6 {brain, down-syndrome, overexpress} > {chromosomes_human_pair_
21}, 100.0%
7 {brain, retard} > {chromosomes.human_pair_21}, 100.0%
8 {amll, chromosom} > {transloc}, 94.11%
9 {amll, leukemia} > {transloc}, 86.2%
10 {alzheim} {chromosomes_human_pair_21}, 80.0%

and meaningful hypotheses using abstract-level associations from association
rules and sentence-level associations from extraction. The results of our cross-
sentence inference method are demonstrated in the next section.
3.5.5 Cross-sentence inference
We perform cross-sentence inference between the scientifically-reported as-
sociations found in the previous section and the association rules generated by
AprioriT [19] algorithm. We obtain approximately 800 crossed sentences that
could be considered potentially novel and meaningful associations. After a man-
ual review, their subset is shown below. The character | is used to separate a
sentence into the subject, verb and object phrases, and the singular present
tense is used for crossed sentences.
Hypothesis 1:
Scientific association: CD56 expression in AML with t ( 8 ; 21 ) | is associated
with | significantly shorter CR duration and survival."
Statistical association: ami, transloc | is associated with | leukemia.
Connecting item: ami
Cross sentences:
leukemia | is associated with | significantly shorter CR duration and sur-
Hypothesis 2:
Scientific association: CD56 expression in AML with t ( 8 ; 21 ) | is associated
with | significantly shorter CR duration and survival.
Statistical association: eto | is associated with | ami.
Connecting item: ami

Cross sentences:
eto | is associated with | significantly shorter CR duration and survival.
Hypothesis 3:
Scientific association: KIT mutations | are strongly associated with | a poor
prognosis in pediatric t ( 8 ; 21 ) AML.
Statistical association: ami, transloc | is associated with | leukemia.
Connecting item: ami
Cross sentences:
KIT mutations | is strongly associated with | leukemia.
Hypothesis 4:
Scientific association: activating mutations of receptor tyrosine kinases | are
associated with | distinct genetic subtypes in AML.
Statistical association: down_syndrome, mutat | is associated with | chromo-
Connecting item: mutat
Cross sentences:
chromosomesJiuman_pair_21 | is associated with | distinct genetic sub-
types in AML.
Hypothesis 5:
Scientific association: several of these translocations: inversions | were asso-
ciated with | the loss of sequences located in the vicinity of the chromosomal

Statistical association: ami, amll, transloc | is associated with | leukemia.
Connecting item: transloc
Cross sentences:
leukemia | is associated with | the loss of sequences located in the vicinity
of the chromosomal breakpoints.
Hypothesis 6:
Scientific association: several of these translocations: inversions | were asso-
ciated with | the loss of sequences located in the vicinity of the chromosomal
Statistical association: ami, leukemia, transloc | is associated with | amll.
Connecting item: transloc
Cross sentences:
amll | is associated with | the loss of sequences located in the vicinity of
the chromosomal breakpoints.
Hypothesis 7:
Scientific association: acute nonlymphocytic leukemia with t (16; 21) | is asso-
ciated with j multilineage leukemic differentiation."
Statistical association: ami, transloc | is associated with [ leukemia.
Connecting item: leukemia
Cross sentences:
ami, transloc | is associated with | multilineage leukemic differentiation.
Cross-sentence inference is achieved based on a keyword that is common
between a scientifically-reported association and a statistical association rule.

For example, in the Hypothesis 2, the word ami' acts as a connecting item be-
tween scientifically-reported association and association rule. From the scientific
associations, we can derive that a type of ami is associated with significantly
shorter CR duration and survival. We also can draw from the association rule
that eto is associated with ami. As a result, our cross-sentence inference
method infers a potentially novel and meaningful hypothesis that eto is asso-
ciated with significantly shorter CR. duration and survival rates.
3.6 Conclusions
We developed a novel text association mining technique termed Text As-
sociation Mining with Cross-Sentence Inference (TAMCSI) that exploits both
statistical co-occurrences of words and syntactic structure of sentences in order
to derive novel and meaningful associations. Using the cross-sentence infer-
ence, we could derive a number of potentially novel associations between the
scientifically-reported associations and association rules. The meaningfulness
of the crossed sentence highly depends on a role of the connecting item per-
formed in the scientifically-reported association. If the connecting item is the
main subject in the subject phrase of an association, or the main object in the
object phrase, a meaningful crossed sentence can be inferred. In addition to
the TAMCSI method, we introduced an extended weighting scheme called the
ExtTF IDF that takes into account the positional property or location of
words in documents. Words that appear in the title tend to be more relevant
to the document than those that occur in the content. As a result, with the
ExtTF IDF, title words are assigned higher numerical weight than the con-
tent words. The difference between our weighting scheme and the TF IDF is

basically on the positional property of words in the documents. A list of words
selected by using our ExtTF IDF has more words that appear in the title than
that selected by using the TF IDF approach. Consequently, the use of the
ExtTF IDF for selecting words indicates that words used in a mining process
are more relevant to the documents. Our experimental results demonstrate that
using the ExtTF IDF can retain a greater number of relevant words (and less
general words) than using TF IDF and document frequency approaches.
There is still room for improvement on our developed TAMCSI technique.
Sentence compression method could be applied in the pre-processing step for
converting complex and compound sentences into simple sentences. This con-
version would considerably corroborate the replacement of the connecting item
and phrase. In addition, rather than extracting only scientifically-reported as-
sociations from text, we could extend the method into other types of relations
such as interaction, etc.

4. Structure-Based Document Model: Textual Document
Representation Based on Frequency, Pattern and Document
4.1 Introduction
Document representation is one of the crucial tasks in text mining, par-
ticularly for document classification or clustering. Its traditional approach is
based on the bag of words or Vector Space Model (VSM), where a docu-
ment is represented by a vector of weights of unique terms selected from a data
set. Weights are typically computed from frequency of term occurrences either
within a document (term frequency) or across a data set (document frequency),
or both. In a typical document classification/clustering, the abstracts of articles
are normally used instead of the whole documents. However, with the increase
of public databases which provide full-text articles, such as PubMed, the direc-
tion of text mining research changes towards using full content of the articles.
As expected, several studies have shown that full-text articles contain better
characteristics than abstracts. Schuemie et al. investigated the distribution of
information in biomedical abstracts and full-text documents. They found that
although the information density in the abstracts is higher than that in the full-
text documents, the information coverage in full texts is substantially greater
than that in the abstracts [77]. In addition to the greater information coverage
than in the abstracts, a full-text scientific article is usually organized into sec-
tions such as Title, Abstract, Introduction, Methods, Results, Discussion and

Conclusion. Shah et al. found that the keyword content in different sections is
heterogeneous and there is a correspondence between different sections and the
type and density of information they contain [78], For example, the Introduc-
tion section is the best place to extract protein and gene names for biomedical
articles, while the Method section is good for looking for technical data. Hak-
enberg et al. used a heuristic section-weighting approach for assigning different
weights to term occurrences in different sections [33]. Their experimental results
demonstrated that using a combination of section weighting and a suitable set
of features improved classification performance by 18% as compared with the
baseline classifier. Thus, using full-text articles is better than using just the
abstracts; exploiting document logical structure can improve the performance
of text mining.
In addition to the vector space representation, Park et al. introduced a
concept of term signal that takes into account both frequency information and
patterns of term occurrences in a document [64], [65], [66], [67], [68], and [69].
Term signal is a representation that describes frequency of a term in physical
locations of a document. With term signal, a document is first divided into a
number of partitions based on a sequence and the number of terms in the docu-
ment. Then, a term is represented as a vector of frequencies of term occurrences
in those partitions. Finally, a document, consisting of a number of chosen terms,
is represented as a vector of term signals. We will use the term spectral-based
document model (SPDM) to refer to the document representation using the term
signal concept. By representing a document using the term signal concept, the
similarity between two documents is based on both the frequency and the pat-

tern of term occurrences. In other words, two documents are considered similar
if they contain mostly the same set of terms with approximately the same fre-
quency of occurrences, and those terms in the two documents mostly occur in
the same user-defined sections. Park et al. used a number of mathematical
transforms, the Cosine, Fourier, and Wavelet, on their spectral-based document
representation and computed document ranking based on query terms [64], [65],
[66], [67], [68], and [69]. Pryczek and Szczepaniak applied this term signal con-
cept to document classification using Fourier transforms [73]. Using term signal
for document representation was shown to be better than the traditional vector
space representation for information retrieval [64], [65], [66], [67], [68], and [69]
and for document classification [73].
In this chapter, we present two structure-based document models (SDMs),
termed SDM with Discrete Wavelet Transforms (SDMDWT) and SDM with
Significant Term Weight (SDMSTW). These models are based on a novel con-
cept of structure-based term signal where a term is represented as a vector of
its frequencies of occurrences in different components of a document, and a doc-
ument is in turn represented as a vector of structure-based term signals. With
structure-based term signal, a document is initially divided into components or
sections based on its actual logical or structural organization. Next, frequen-
cies of term occurrences in those logical components are counted for all selected
terms, and then a structure-based term signal of each term is represented by a
vector of frequencies of term occurrences. Finally, a document is represented as
a vector of the structure-based term signals. The structure-based term signal is
similar to the term signal proposed by Park et al. [64], [65], [66], [67], [68], and

[69] in that it describes frequencies of term occurrences in different components
of a document .
The difference is that each component or section in a structure-based term
signal corresponds to each logical component of a document. In other words, it
enhances the term signal via logical document division. Structure-based term
signal not only takes into consideration frequency and patterns of term occur-
rences, but also the logical organization of the document. It is finer-grained than
the Vector Space Model and more flexible and relevant to the actual document
structure than the Spectral-Based Document Model.
We evaluated and compared the SDMDWT, SDMSTW, SPDMDWT and
VSM document models on four data sets: the WebKB 4-Universities, TREC
Genomics 2005, Reuters-21578 and 20-Newsgroups, using Support Vector Ma-
chines binary classification with Micro- and Macro-averaged Precisions, Recalls
and F-measures. Support Vector Machines (SVMs) have been shown to work
well with high-dimensional data, and, therefore, are appropriate for document
classification [44]. Micro- and Macro-averages measure averaged performance
of classification within the classes and within documents [102]. The results
showed that our SDMDWT and SDMSTW, in general, performed better than
the SPDMDWT and VSM, based on Micro- and Macro-averaged Recalls and
This chapter is organized as follows. We present the technical background
in Section 4.2, which covers the term signal concept, term and signal weighting
schemes, Discrete Wavelet Transform, Support Vector Machines, and Informa-
tion Gain. Then, we describe our structure-based document models and the

data sets used, including their preprocessing, in Section 4.3. After that, we
explain our evaluation method and provide results in Section 4.4. Finally, we
discuss the results in Section 4.5, and conclude with Section 4.6.
4.2 Technical background
In this section we describe background methods useful for understanding
the rationale behind our approach. We begin with the concept of term signal
(that our method is built on), weighting schemes for weighting terms and term
signals, Discrete Wavelet Transform used to transform term signals, Support
Vector Machines for binary classification, and Information Gain used in feature
4.2.1 Term signal
A term signal, introduced by Park et al. [64], [65], [66], [67], [68], and [69], is
a vector representation of terms that describes frequencies of term occurrences in
particular partitions within a document. To construct a term signal, a document
is first divided into a user-defined B numbers of sections. Then, a term signal
t in a document d can be represented as a vector of physical sections as shown
by Equation 4.1.
s(t,d) = [ft,14, ft,24....ft,B.d\, (4-1)
where is frequency of term t in section b of document d for 0 < b < B.
ftbd can also be considered as the bth signal component of a term signal s(t,d).
For example, suppose that a document consisting of a sequence of 32 words is
divided into eight partitions. There will be four words per partition or bin. The
document d can now be graphically represented, as shown in Figure 4.1.

0 4 8 12 16 20 24 28 32
Figure 4.1: Document d with 8 partitions.
Accordingly, if a term t occurs once in each of the 2nd, 3rd, 10th and 24t/j
positions in a document d, its term signal s(t. d) can be represented by Equation
4.2 and depicted by Figure 4.2.
0 4 8 12 16 20 24 28 32
Figure 4.2: Term signal / in document d.
,s(t,d) = [2,0,1,0,0,1.0,0], (4.2)
As shown in Figure 4.2, term t occurs two times in the 1st bin, one time in
the 3rd bin and one time in the 6th bin, respectively. With the spectral document
model, a document d containing term signals s(ti,d), ,s(/,2,d), ..., s(ln.d), can
now be represented by Equation 4.3.
d = [s(ti.d), s(t2, d)...s(tn, d)\. (4.3)

4.2.2 Weighting schemes
Weighting scheme is an assignment of numerical weights to terms in a vector
space model. Weight of a term can be computed using a number of parame-
ters such as term frequency, document frequency, document length, number of
documents in a data set, etc. One of the most commonly used term-weighting
schemes for document classification is TF IDF, which stands for term fre-
quency multiplied by the inverse of document frequency, as shown by Equation
TF IDF = TF xlg(AyDF). (4.4)
where TF is term frequency or frequency of term occurrences within a doc-
ument, N is the total number of documents in a data set and DF is document
frequency or frequency of term occurrences across a data set. The underlying
assumption of TF IDF weighting scheme is that terms that occur in many
documents, i.e., high DF, are common and do not represent documents well. In
contrast, terms that occur very often in a document, i.e., high TF, are consid-
ered important features of the document. TF I DF compensates between term
frequency and document frequency.
For document representation using term signal, variations of TF IDF
weighting schemes can be used for weighting a term signal as described in Park
et alls study [66]. One of the variations, PTF IDF, is formulated by Equation

PTF IDF = (1 + lg(/tld))(4M) x lgWDF),
where ft4 is frequency of occurrences of term t in document d, and fttb4 is
frequency of occurrences of term t in partition b of document d, respectively.
4.2.3 Discrete wavelet transforms
A wavelet is a mathematical function in time/space domain, which can be
expressed by Equation 4.6.
where s is a dilation or scaling parameter, l is a translation or time-/space-
location parameter and s, l G Z. For any function /(f) G L2(R) and for which
ips,i(0 forms an orthonormal basis for the space of signals of interest (in this
case, /(f)), a wavelet transform of /(f) can be computed by Equation 4.7.
where ///(f) is a complex conjugate of Fs./f). Wavelet transform can be
described by a concept of multi-resolution analysis, which is the decomposition
of a signal into sub-signals of different resolutions or scales. In each step of multi-
resolution analysis, a signal is decomposed into two sub-signals: approximation
and detail. The approximation sub-signal is expressed by a linear combination
of scaling functions i/?(f), and the detail sub-signal is represented by a linear
combination of wavelet functions V;(0-
The scaling function is a finite energy function in L2(R) as defined by Equa-
V;,.,(f) r^(2st 0,
tion 4.8.

In multi-resolution analysis, the subspace spanned by the scaling function
must satisfy the following properties:
Vs C K+i for all s Z,
Vice = {0} and Vx = L2
{0} - ... C V i cVqC^C ... -> L2.
The wavelet function is also a finite energy function in L2(R) and is defined
by Equation 4.9.
ips,i(t) = 2s/2'il(2st /), where sj E Z. (4.9)
The subspace spanned by the wavelet function, Ws is an orthogonal com-
plement of Vs in Vg+\. In other words, VSA.WS and Vs+\ = Vs CD Ws, which leads
to the following properties.
L2 ... W-i Wq W\ ...
VT-oo (I)... (i.) bT_i = V0.
Let Vs be a subspace spanned by scaling functions, Ws be a subspace spanned
by wavelet functions, Vo C Vi C ... C I2 and Vs = 1T,_i, any function

f(t) L2(R), where L2(R) = V0 3 \V0 Wx ... can be mathematically
expressed by Equation 4.10.
m = E am{t) + EE dsj^sAt)- (4.10)
l~ DC S=0 l = OC
The first term is mapped to the approximation sub-signal, and the second
term is referred to as the detail sub-signal. The coefficients a; and dsj are called
discrete wavelet transforms, which can be computed by Equations 4.11 and 4.12,
< }{t), = J f(t) (4.11)
ds,i= < f{t).i>sJ(t) > = J f{t)iph(t). (4.12)
Discrete wavelet transforms can be efficiently calculated by using the filter
bank tree-structured algorithm. The filter bank tree of discrete wavelet trans-
forms of a signal f(t) can be recursively expressed by Equation 4.13.
D!T + ,? + !>'
i7" /i3 +1>3 + d3 + d'
As + I)s + I)3^1 + ... + I)1,
where A8 represents an approximation sub-signal and Ds corresponds to a
detail sub-signal in the sth level of transforms of discrete wavelet transforms of

the signal /(f). respectively.
As described by Equations 4.11 and 4.12. the coefficients of discrete wavelet
transforms are computed by inner products between the signal itself and the
scaling/wavelet functions. Useful numerical examples can be found in Walkers
work [94].
4.2.4 Support vector machines
Support Vector Machines (SVMs) are an inductive machine learning tech-
nique based on the structural risk minimization from statistical learning prin-
ciple. It was introduced by Vapnik in 1995 [93], A classification problem using
SVMs is the problem of searching for a hyperplane that best separates positive
instances from negative instances. The hyperplane with the best separation of
instances is the one with maximum margin, <5, or with the maximum distance
from the hyperplane to the closet training instances. The hyperplane functions
as the decision boundary, and finding the hyperplane is equivalent to solving an
optimization problem.
Let a tuple (xi,yi) be input training instances, where i = {1,2, ...,n},
xi (xi, X2,..., :r;m), y* = {+1,-1}, v = number of training instances and
m = number of attributes, the optimization problem for a non-linear SVM and
its Lagrangian objective function can be expressed by Equations 4.14 and 4.15.
minimize J' (w, b) iv w, (4.14)
subject to
!h (w (:ii) + b) > 1, i = 1,2, ...n.,

minimize IV (w,b,a) = w w ^ on (yt (w $ (x)) + b) 1), (4.15)
1 = 1
where w and b are model parameters, T is a mapping function into a higher
feature space, a, is a Lagrange multiplier, a, > 0. and (y, (uj $ (x)) + b) 1) =
0. Solving the optimization problem above is usually carried out by hist convert-
ing the minimization problem (primal) into the maximization problem (dual)
and then using Lagrangian technique to find the desired parameter values.
Hence, the primal minimization problem in Equation 4.15 can be converted
into the dual maximization problem in Equation 4.16.
maximize \V (a) = £"=i cq ^ ^=1 EJ=i ViVjaiOijK (x) x~}), (4.16)
where A" is a kernel function. The details of how to solve the optimization
SVM problems can be found in many interesting sources and textbooks such as
the book by Tan et al. [86]. Interesting publications regarding SVM are the
work by Joachims [44] and the one by Platt [70].
4.2.5 Information gain
Information gain is based on information t heory. It is a measure of number of
bits of information gained for category prediction when the presence or absence
of a term in a document is known. Information Gain for feature selection used in
this work is based on the study by Yang and Pedersen [103], which is formulated
by Equation 4.17.

IG(t) =Pr(ct) log Pr(ct)
2 = 1
+ Pr{t) ^2 Pr(Ci\t) log Pr(Ci\t)
2 = 1
+ />r(7) J>r(c*|7) bg Pric.lJ).
2 = 1
where Pr(c,) is the probability of class c,n Pr(t) is the probability of term f,
Pr{Ci\t) is the probability of a document that has class c* and contains the term
t, and finally Pr(ci\t) is the probability of a document that has class c, and does
not contain the term t.
4.3 Methods and data
4.3.1 Methods
We describe here two structure-based document models:
1. Structure-Based Document Model with Discrete Wavelet Transforms (SD-
2. Structure-Based Document Model with Significant Term Weight (SDMSTW).
The SDMDWT document model [88] is an enhancement of the term sig-
nal proposed by Park et al. [64], [65], [66], [67], [68], and [69] such that each
bin is corresponding to a component in a document rather than is derived from
computation. In this model, we developed a term signal that additionally in-
cludes documents structural information into consideration, which is called the
structure-based term signal, defined below.

Definition 1. Structure-based term signal is a vector of frequencies of term
occurrences in different components of a document, where the components are
derived from the actual document structure. The structure-based term signal
of a term t in a document d can be defined by Equation 4.18.
st(t.d) = {ft,Sl4ift,s24..ft.SB,d], (4-18)
where ft,s2,d-> > ft,sB,d correspond to the frequencies of term t in
components or sections Si, s2, .... sb of document d, respectively.
For SDMDWT model [88], we utilized Discrete Wavelet Transforms for
transforming the structure-based term signal in the frequency domain to the
Wavelet domain. Our selection of the Discrete Wavelet Transforms over the
Fourier Transforms and Cosine Transforms is based on the latest work of Park
et al. [69]. A document based on the SDMDWT is defined by Equation 4.19.
d = [st(ti,d), st(t2, d),.... st(tn, d)], (4-19)
where 11, t2, ..., tn are term features selected in the feature selection step,
and st(ti.d), st(t2.d), .... st{tn.d) are the structure-based term signals of terms
t\, t2, ..., tn in document d, respectively. In SDMDWT, the pre-weighting and
Discrete Wavelet Transforms are applied to each structure-based term signal
before performing the classification.
The SDMSTW document model is an extension to the term signal proposed
by Park et al., [64], [65], [66], [67], [68], and [69], but with an additional section
containing significant term weight and without using any type of mathematical

transforms. The structure-based term signal with significant term weight is de-
fined below.
Definition 2. Structure-based term signal with a significant term weight is
the structure-based term signal with an extra section or component containing
a significant term weight. It is a combination of the term weight used in the
Vector Space Model such as TF IDF and the structure-based term signal.
The goal of the SDMSTW model is not only to take into account the patterns
of term occurrences through the structure-based term signal, but also to preserve
the power of the Vector Space Model via the significant term weight. The
structure-based term signal with a significant term weight is defined in Equation
4.20, and the corresponding document representation is defined in Equation 4.21,
where k is the significant term weight and is computed using the TF IDF
weighting scheme, and ft,Sud, ft,s2,d, ft,sB.d are frequencies of term t in the
1,2,.... B components of a document, respectively. The pre-weighting will be
performed on each structure-based term signal, excluding the k.
St (t, d) [ft, ft,si,dm. ft,S2,d- ft,SB,d\
d = [stK(ti.d), stK(t2, d).....sfK(fn, d)}

4.3.2 Data and pre-processing WebKB 4-Universities
The WebKB 4-Universities data set is a collection of 8,282 web pages col-
lected from computer science departments of several universities by the World
Wide Knowledge Base project of the Carnegie Mellon text learning group in Jan-
uary, 1997. These web pages in the data set were manually classified into the
following seven categories: student, faculty, staff, course, project, department,
and other. Each category contains web pages gathered from four main uni-
versities: Texas, Washington, Wisconsin and Cornell, and the remaining pages
collected from other universities. The four most populous categories: student,
faculty, course and project, were used in this study, which accounts for 4,199
web pages. For pre-processing, we utilized a combination of JTidy1 and Java
Regular Expressions to clean up and parse the original HTML documents into
our customized semi-structured XML documents. We used single words as fea-
tures. We ranked the words based on multi-class Information Gain according to
Equation 4.17. We did not perform word stemming and stop-word removal on
this data set. According to McCallum and Nigam [57], performing stop word re-
moval and word stemming on this data set could degrade performance, because
some simple words such as my, "am and me have high rankings. TREC Genomics 2005
The TREC Genomics 2005 data set is a corpus of full-text biomedical doc-
uments in SGML format. The data set is a collection of mouse genome articles
over a two-year (2002-2003) period from three journals, Journal of Biological

Chemistry (JBC), Journal of Cell Biology (JCB), and Proceedings of the Na-
tional Academy of Science (PNAS). There are four major topics of articles in
the data set Alleles of mutant phenotypes, Embryologic gene expression, Gene
Ontology and Tumor biology, which correspond to the following four class labels,
A, E, G and T, respectively. TREC Genomics 2005 data set consists of 5,837
training documents and 6,403 testing documents. A document may belong to
zero or more classes. For pre-processing, we implemented an in-house SGML
Java parser to parse the original SGML documents into our customized semi-
structured XML documents. Then, we employed Lingpipe2 to break texts into
individual sentences. Next, we applied Genia Tagger3 to individual sentences
for performing part-of-speech tagging and for detecting words, phrases and bio-
logical entities. We used phrases as features and removed those that are in the
NCBI PubMed stop word list. We stemmed phrases using Porter Stemmer [71],
and ranked them using binary-class Information Gain according to Equation
4.17. Documents that contain the target class were assigned a positive label,
for all other documents including unlabeled documents were assigned negative
labels. Reuters-21578
The Reuters-21578 data set is a collection of 21,578 news documents in
SGML format from Reuters newswire in 1987. Each document consists of a
number of components such as dates, topics, places, people, organizations, com-
panies, exchanges, title, and body text. There are 135 topics in this data set
and they are usually used as class labels for document categorization tasks. A

document may have zero, one or more than one topics. Reuters-21578 data set
is available from UCI KDD archive4. We pre-processed this data set by first
splitting it into train, test, and not-used sub-data sets based on ModApte split-
ting scheme (from the README document in the Reuters-21578 distribution
package). Then, we performed word stemming using Porter Stemmer [71]. We
used single words and ranked them based on binary-class Information Gain ac-
cording to Equation 4.17. We employed the following ten most populous topics
in our study: earn, acq, money-fx, grain, crude, trade, interest, ship, wheat and
corn. 20-Newsgroups
The 20-Newsgroups data set is a collection of newsgroup documents. It
consists of approximately 20,000 documents which are equally distributed into
20 categories. Each category in the data set corresponds to a newsgroup topic,
and some of the topics are similar to each other based on the subject mat-
ter. A list of 20 newsgroups categorized according to topic similarity is pro-
vided as follows: (i),, comp.sys.
ibm. pc .hardware, comp.sys. mac. hardware and comp. windows. x, (ii) rec.
autos, rec .motocycles, rec. sport, baseball and rec. sport .hockey, (iii)
sci.crypt, sci.electronics,, (iv), (v)
talk.politics.misc, talk.politics.guns and talk.politics.mideast and
(vi) talk.religion.misc, alt.atheism and soc.religion.Christian. In
this study, we used the sort-by-date version, which consists of 18,846 docu-
ments. For pre-processing, we performed an initial clean-up of the data set by

removing unnecessary textual parts such as header and subject lines, uuencod-
ing text and PGP signature text. We preserved only the title and body text
to be used in our experiments, and used single words as features. We ranked
the words based on multi-class Information Gain described by Equation 4.17.
We followed McCallum and Nigams work by carrying out only word stemming
using Porter Stemmer [71] without stop word removal [57].
4.4 Experiments and results
We conducted experiments using LIBSVM [13] and Weka [99], and utilized
micro-averages and macro-averages of Precisions, Recalls and F-measures to
compare performance of different document models in this study; they are de-
fined next. Precision measures how well a classifier correctly classifies positive
instances from all of its positive predictions. Recall measures how well a classi-
fier correctly classifies positive instances from all positive instances. F-measure,
or Fl-measure, is the weighted harmonic means of Precision and Recall. The
measures are expressed by Equations 4.22, 4.23 and 4.24, as follows.
Precision =
TP + FP'
Recall =
F-measure =
2 Precision Recall
Precision + Recall
where TP. TN, FN and FP are the number of true positives, true negatives,
false negatives and false positives, respectively.

The micro-averaged and macro-averaged measures are performance averages
across multiple categories. As described in Yangs work [102], the micro-averaged
performance is viewed as a per-document average, because it gives equal weight
to every document. To compute a micro-averaged performance, a global con-
tingency table is constructed. Its cell value is the sum of the corresponding cell
in each contingency table of each class. For example, the number of true posi-
tives in the global contingency table is the sum of the numbers of true positives
from all contingency tables of all classes. Then, a micro-averaged performance,
such as micro-averaged Precision or micro-averaged Recall is computed from
this global contingency table. The macro-averaged performance is considered
a per-category average, because it gives equal weight to every class. It is com-
puted by the sum of performance from each class divided by the total number
of classes.
4.4.1 WebKB 4-Universities
In this data set, each document was structurally modeled as a two-component
document, which consists of a heading component (H) and a non-heading com-
ponent (N). The heading component is text that enclosed by a pair of
HTML tags where # corresponds to a heading-level number, and
the non-heading component is text that is not enclosed by the heading HTML
tags. Accordingly, a structure-based term signal of selected terms for the SD-
MDWT and SDMSTW models can be represented by Equations 4.25 and 4.26,
sl(t,,d) [Jt,8n,di]>

stK(t,d) = [K4,d,/tlSH,d,/t.s.v,d], (4.26)
where ,s/(/., d) and slK(t, d) are the structure-based term signals of term /. in
document d, Kt.d the TF-IDF value of term t in document d, ft,sH,d the frequency
of term t in the heading component of document d, and ft,SN,d the frequency
of term t in the non-heading component of document d, respectively. The pre-
weighting scheme was then applied to the structure-based term signal for both
SDMDWT and SDMSTW, and Discrete Wavelet Transforms was next applied
to the pre-weighted structure-based term signal for SDMDWT. For SPDMDWT
document model, we used two components for its term signal as same as those in
SDMDWT and SDMSTW. A summary of the maximum and average values of
Micro- and Macro-averaged Precisions, Recalls and F-measures for the WebKB
4-Universities data set is shown in Table 4.1. Based on both Micro- and Macro-
averages of Precisions, Recalls and F-measures, the SDMDWT model performs
better than all of the other models, and the performance of SDMSTW model is
the third place above that of VSM.
4.4.2 TREC Genomics 2005
To capture the structure of documents in the TREC Genomics 2005 data
set, wc first collected names of main sections from all documents. To iden-
tify main sections, we constructed a document tree by exploiting SGML tags,
and for section boundaries. Based on our analysis of documents,
we specified the sections that are in level zero to level three of a document tree
to be main sections. Then, we mapped relevant sections into the same sections.
For example, "methods, experimental procedures, "materials and methods

Table 4.1: The maximum and average (maximum/average) pairs of micro-
and macro-averaged evaluation measures where the numbers of terms used are
500, 1000, 1500, 2000, 2500, 5000 and 7500 on WebKB 4-Universities data set.
Document models
Micro- averaged Precision 0.9278/0.9216 0.9358/0.9337 0.9345/0.9313 0.9276/0.9167
Recall 0.9119/0.9056 0.9238/0.9209 0.9205/0.9156 0.8800/0.8723
F-measure 0.9182/0.9135 0.9298/0.9273 0.9274/0.9234 0.9007/0.8938
Macro- averaged Precision 0.9296/0.9230 0.9380/0.9342 0.9347/0.9312 0.9284/0.9113
Recall 0.9023/0.8931 0.9157/0.9107 0.9096/0.9035 0.8549/0.8447
F-measure 0.9135/0.9073 0.9265/0.9220 0.9217/0.9168 0.8846/0.8752
were mapped to the user-defined method' section name, etc. In this study, the
following seven main section names are pre-defined based on the actual logical
organization of the documents, Title (T), Abstract (A), Introduction (I),
Method (M), Result (M), Conclusion (C) and Other (O). The Other
section is for main section names that obviously cannot be mapped to the previ-
ous six sections. For the SDMDWT, because the Discrete Wavelet Transforms
require the length of an input signal to be a power of two, we added an extra
zero component to the signal. Therefore, a structure-based term signal of se-
lected terms for the SDMDWT and SDMSTW models for this data set can be
represented by Equations 4.27 and 4.28, respectively.


stK(t,d) = [nt4,ft
ft,SA,d ft,S{,d-i ft,st\j,di ft.SR,di ft,sc,dj ft,SQtd ], (4.28)
where st(t, d) and stK(t, d) are the structure-based term signals of term t in
document d, Kt4 the TF-IDF value of term t in document d, ft,sT.d the frequency
of term t in the Title of document d, ft,SA,d the frequency of term t in the
"Abstract of document d, ft,s,,d the frequency of term t in the Introduction
of document d, ft,SM,d the frequency of term t in the Method of document d,,d the frequency of term t in the Result of document d, ft,sc,d the frequency
of term t in the Conclusion of document d, and ft,So4 the frequency of term
t in the Other of document d, respectively. A summary of the maximum and
average values of Micro- and Macro-averaged Precisions, Recalls and F-measures
for the TREC Genomics 2005 data set is shown in Table 4.2.
Table 4.2: The maximum and average (maximum/average) pairs of micro-
and macro-averaged evaluation measures where the numbers of terms used are
500, 750, 1000, 1500, 2000. 2500, 5000, 6500, 7500, 8500 and 10000 on TREC
Genomics 2005 data set.
Document models
Micro- averaged Precision 0.6159/0.4808 0.8140/0.5987 0.9036/0.6653 0.5562/0.4302
Recall 0.3405/0.2204 0.3046/0.2110 0.2667/0.1648 0.3426/0.2328
F-measure 0.3476/0.2790 0.3451/0.2879 0.3194/0.2414 0.3401/0.2788
Macro- averaged Precision 0.4827/0.3122 0.5357/0.3668 0.6265/0.4069 0.3355/0.2772
Recall 0.3321/0.1620 0.2626/0.1490 0.1979/0.1128 0.3316/0.1717
F-measure 0.3713/0.1931 0.3226/0.1891 0.2494/0.1556 0.3288/0.1899

As shown in Table 4.2, our SDMSTW model is appeared to be the best
model for Micro- and Macro-averaged F-measures, and is potentially a suitable
model for Macro-averaged Recall.
4.4.3 Reuters-21578
A document in Reuters-21578 data set consists of the body text news and
a set of headings such as Date, Topics, Title, Places, People, Exchanges, Orgs,
Company and Unknown. Topics are usually used as class labels for each docu-
ment. In this study, we used the "Title and Body news text for constructing
a document model, which is shown by Equations 4.29 and 4.30, respectively.
T3 £Q 4 Vi II (4.29)
St {t.d) ft,sj,d' ft.SB*d\ 7 (4.30)
where st(t,d) and stK(t,d) are the structure-based term signals of term t
in document d, nt/i the TF IDF value of term t in document d, ft,Sr,d the
frequency of term t in the Title part of document d, and ft,sB.d the frequency
of term t in the Body part of document d, respectively. A summary of the
maximum and average values of Micro- and Macro-averaged Precisions, Recalls
and F-measures for the Reuters-21578 data set is shown in Table 4.3, which
demonstrates that our SDMSTW gives a better performance on Micro- and
Macro-averaged Recalls than the other existing models.
4.4.4 20-Newsgroups
In 20-newsgroups data set, each document consists of two main components,
headings and body text. The heading part contains a number of heading lines,

Table 4.3: The maximum and average (maximum/average) pairs of micro-
and macro-averaged evaluation measures where the numbers of terms used
are 500, 1000, 1500, 2000, 2500, 5000, 7500 and 10000 on Reuters-21578 data set.
Document models
Micro- averaged Precision 0.8881/0.8740 0.8914/0.8798 0.9116/0.8931 0.8657/0.8519
Recall 0.8841/0.8667 0.8744/0.8631 0.8733/0.8607 0.8762/0.8587
F-measure 0.8854/0.8703 0.8799/0.8713 0.8921/0.8766 0.8701/0.8553
Macro- averaged Precision 0.8234/0.8034 0.8243/0.8097 0.8539/0.8334 0.7924/0.7736
Recall 0.8019/0.7811 0.7865/0.7730 0.7905/0.7676 0.8010/0.7791
F-measure 0.8094/0.7895 0.7987/0.7887 0.8195/0.7972 0.7937/0.7734
such as From: that specifies an email address of its sender, Subject: which
describes the subject of the email, Organization which tells where the email
was sent from. etc. The body part contains the main content of the email. In
this study, we employed only the subject line in the heading component, and
the whole body text. As a result, a document in this data set can be expressed
by Equations 4.31 and 4.32, respectively.
St(t.d) [ft,ss,d- ft.SB-d] i (4.31)
/ (l.d) = \^.t,di Jt,ss,d- Jt,SB,d]i (4.32)
where sl(l,d) and slK(l,d) are the structure-based term signals of term t
in document d, nt,d the TF IDF value of term t in document d, ft,sT,d the
frequency of term t in the Subject line of document d, and ft,SB,d the frequency

of term t in the "Body part of document d, respectively. A summary of the
maximum and average values of Micro- and Macro-averaged Precisions, Recalls
and F-measures for the 20-Newsgroups data set is shown in Table 4.4.
Table 4.4: The maximum and average (maximum/average) pairs of micro-
and macro-averaged evaluation measures where the numbers of terms used are
500, 1000, 1500, 2000, 2500, 5000, 7500 and 10000 on 20-Newsgroups data set.
Document models
Micro- averaged Precision 0.7453/0.6516 0.7725/0.6732 0.7784/0.6705 0.7117/0.6292
Recall 0.6900/0.6740 0.6719/0.6653 0.6579/0.6493 0.6846/0.6632
F-measure 0.7144/0.6610 0.7152/0.6668 0.7102/0.6570 0.6939/0.6441
Macro- averaged Precision 0.7506/0.6588 0.7780/0.6794 0.7828/0.6775 0.7166/0.6360
Recall 0.6829/0.6677 0.6654/0.6591 0.6575/0.6421 0.6757/0.6554
F-measure 0.7086/0.6581 0.7073/0.6634 0.7025/0.6533 0.6877/0.6405
According to Table 4.4. our SDMSTW and SDMDWT models yield the
better performance on Micro- and Macro-averaged F-measures than the other
existing models. In addition, the SDMSTW model gives the best performance
on Micro- and Macro-averaged Recalls.
4.5 Discussion
In this section, we discuss a number of interesting issues and questions as
Does the additional inclusion of document structure into document rep-
resentation improve performance of document classification over existing

document models? According to our experimental results, we conclude
that additionally incorporating document structural information into doc-
ument representation, in general, improves the performance of document
classification, particularly for Micro- and Macro-averaged Recalls and F-
measures. The performance improvements were demonstrated on three
out of four data sets, which were WebKB 4-Universities, TREC Genomics
2005 and 20-Newsgroups.
Is the performance improvement from additional incorporation of docu-
ments structural information into document representation, really worth
the increased computational time and resources for the pre-processing task?
To construct a structure-based document model using either SDMDWT
or SDMSTW, the structure of documents first needs to be captured. In
contrast, the SPDMDWT is constructed using only the number of selected
terms in the documents divided by the user-defined number of sections or
components. For the current standard of how textual documents are for-
matted and stored, our structure-based approaches require more compu-
tational resources and time on pre-processing than does the SPDMDWT.
However, published documents, especially scientific articles, are increas-
ingly stored in a semi-structured format such as XML or SGML. BioMed
Central is an example of an open-access repository that stores full-text
biomedical literature in XML format. As a result, it is practically trivial
to detect sections of a document using available XML parser software tools
such as JDOM or Piccolo. The only issue for the current standard format
is the variants of section names and different number of sections used by

different authors and publishers. However, rather than being discouraged,
we view this issue as a way to improve how textual documents should be
formatted and stored in the future. We propose that there should be a
common standard for names of major sections and the number of required
sections in a document for scientific articles in a given field. In addition,
those section names should be used as the XML or SGML elements for
their own sections. For example, the following section names, "Experi-
mental procedure and "Experimental approach could be converted into
the same section name such as Method, and the Method section name
is then used as an XML element, . Moreover, our structure-
based models are better than the SPDMDWT since it allows researchers
to compare documents based on document structure. For example, if doc-
ument classification is focused on comparing patterns of term occurrences
in two sections such as Abstract and Conclusion. It is impossible that
the SPDMDWT can achieve this task, unless the number of terms in the
Abstract and Conclusion sections is exactly equal. In contrast, using
our structure-based models, a structure-based term signal for each term
would be easily constructed as a vector of two elements: one is for the
Abstract and the other for the Conclusion sections. Another evidence
of the superiority of our model to the SPDMDWT is on the WebKB 4-
Universities data set. Our models do not depend on the order of terms in
a document as the SPDMDWT does. Therefore, HTML documents can
be compared based on both section headers, texts enclosed by and
elements, and body text. This type of comparison is impossi-

ble for SPDMDWT, because it only considers the physical order of terms.
As a result, although currently our model requires more pre-processing
resources than does the SPDMDWT, this disadvantage is likely to be re-
duced in the future if a uniform standard is set for published articles as
described above. In contrast, our structure-based approaches show the
flexibility of model construction and allow the structure of the documents
to be exploited.
Does the number of document components affect performance of document
classification? To answer this question, we conducted an additional ex-
periment by constructing document models using SDMSTW, SDMDWT
and SPDMDWT with different numbers of document components using
TREC Genomics 2005 data set. To construct a different number of doc-
ument components for our structure-based document models, SDMSTW
and SDMDWT, we combined relevant components together based on our
judgment. For example, for a two-section document model, we combined
Title, ''Abstract and "Introduction into the first scction/component,
and Method, Result, "Conclusion and Other into the second one.
The SPDMDWT was constructed by specifying the desired number of com-
ponents into our Java programs. We evaluated and compared these doc-
ument models based on Micro- and Macro-averaged F-measures (MMFs)
using Support Vector Machines binary classification. Based on our ad-
ditional experiments, we found that it is inconclusive to specify whether
additional document components improve the performance. This is be-
cause, for all document models, different number of document components

yielded different MMF performances. However, we may conclude that the
number of document components in some way affects the performance of
document classification. As a result, to obtain the best performance, a
trial-and-error approach should be carried out.
An issue regarding unequal lengths of diff erent document components. In
a typical document, the numbers of terms in different components or sec-
tions of a document are not equal. Our structure-based approach gives
some flexibility. If a section is too short, it can be combined with other
sections regardless the order of sections in a document. In addition, using
order of terms in a document for constructing document model as in the
SPDMDWT does not provide any semantic interpretation. For example,
there is no semantic meaning when two documents are similar if in both
documents, terms t\ and f2 occur in between the positions 40(/l and 60(/l.
However, for our model, there is a semantic interpretation to say that doc-
uments d\ and d2 are similar because in both documents terms t\ and t2
occur in the Title section.
An issue regarding highly unbalanced data set, e.g. TREC Genomics 2005,
which is the collection whose ratio between the number of positive in-
stances and negative instances is significantly far from one. For this type
of data sets, Accuracy is considered to be an unsuitable choice for per-
formance measure. A trained classifier on a highly unbalanced data set
tends to give prediction values similar to categories which have the larger
number in the data set. For example, if a classifier is trained on a data set

of 1000 negative and 10 positive instances, it will tend to predict a test
example as negative. As a result, different classifiers trained on the same
highly unbalanced data set tend to yield similar accuracy values.
Pre-processing issue regarding binary-class or multi-class Information Gain
feature selection. Issues regarding Information Gain computation for term
weighting on TREC Genomics 2005 and Reuters-21578 data sets should
be discussed. A similarity between these two data sets is that some docu-
ments do not have assigned class labels and some of them belong to more
than one category. Hence, if multi-class Information Gain was used, an
additional dummy class label needed to be assigned to unlabeled docu-
ments such as unassigned (U) label. Otherwise, terms in unlabeled doc-
uments would not be considered by the Information Gain computation.
As a result, for TREC Genomics 2005 data set, instead of four classes,
there would be five classes, and for Reuters-21578 data set, there would
be 136 classes instead of 135. However, an addition of the unassigned
class label may affect the Information Gain values of terms in undesirable
ways. In a categorization task, predictions of original classes are being of
interest, not the predictions of an unassigned class label. Therefore, for
TREC Genomics 2005 and Reuters-21578 data sets, we used binary-class
Information Gain rather than multi-class Information Gain for comput-
ing term weights. The Equation 4.17 can still be applied such that the
number of classes is two. Using binary-class Information Gain was par-
ticularly beneficial to the Reuters-21578 data set, where only a subset of
classes is usually used for a classification task. There is also a set of pre-

processing questions that are useful to be discussed, for example, should
we stem words?, should we remove stop words?, what types of stop word
lists should we use?, should we use single words, multi-words or phrases,
concepts or biological entities as features?, how many words should be
used for representing a document?, etc. Unfortunately, there are no defi-
nite answers to these questions. Experiments need to be performed to find
what strategies give the best performance to the current task. An alterna-
tive approach is to examine existing work related to the target data sets,
which was deployed in this study.
When should the typical term signal be used and when should the structure-
based term signal be used? Based on the experimental results, if Precision
is the main concern, the typical term signal or SPDMDWT model is the
most suitable choice. However, if the highest Recall and F-measure values
are desired, the structure-based models are the best option.
If we have to choose one document model, which one should we select?
This question is similar to but broader than the previous question. The
answer would be that it depends on the goal of the current task. If com-
puting memory is the main concern, the classic VSM is the most suitable
choice. If the highest Precision value is desired, SPDMDWT should be
selected. Finally, if the best performance on Recall and F-measure are
concerned, SDMSTW and SDMDWT are the best options.