Citation
Self-organizing neural network for plain text categorization

Material Information

Title:
Self-organizing neural network for plain text categorization
Creator:
Parry, Michael H
Publication Date:
Language:
English
Physical Description:
viii, 37 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Neural networks (Computer science) ( lcsh )
Self-organizing systems ( lcsh )
Text processing (Computer science) ( lcsh )
Neural networks (Computer science) ( fast )
Self-organizing systems ( fast )
Text processing (Computer science) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 36-37).
General Note:
Submitted in partial fulfillment of the requirements for the degree, Master of Science, Computer Science.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Michael H. Parry.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
36419963 ( OCLC )
ocm36419963
Classification:
LD1190.E52 1996m .P37 ( lcc )

Full Text
SELF-ORGANIZING NEURAL NETWORK
FOR PLAIN TEXT CATEGORIZATION
Michael H. Parry
B.A., University of Colorado, 1983
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
by
1996


1996 by Michael H. Parry
All rights reserved.


This thesis for the Master of Science
degree by
Michael H. Parry
has been approved
by

Date


Parry, Michael H. (MS., Computer Science)
Self-Organizing Neural Network for Plain Text Categorization
Thesis directed by Associate Professor William J. Wolfe
ABSTRACT
People and organizations communicate a high percentage of information by
means of textual reports generally written with arbitrary formats. Often, the only
effective way to assimilate this information requires experienced persons to read the
reports. This is a time consuming and ineffective process in an environment requiring
the continual processing of numerous text messages. Information filtering and
categorization timeliness becomes a major concern when requirements exist to
generate alerts based upon the atypical content of the text reports.
This paper presents the results of an experiment investigating techniques to fiber
and categorize plain text messages. The data source is a message stream reporting
various world events. The algorithms compare text messages from the current
reporting period to those from previous reporting periods assessing new texts for
characteristics indicating they may contain atypical information. The data set
comprised 30,000 messages collected over a five month period. The algorithms
demonstrate effective filtering ami categorization of unique point events and
continuously occurring events reported during the collection period. The algorithms
prioritize the most atypical messages for further review by experienced users. The low
priority (routine) texts are most similar to texts previously processed by the
algorithms during an earlier reporting period.
Non-a-priori techniques based on self-organizing neural networks and genetic
algorithms identify characteristics that distinguish atypical from routine messages.
Large numbers of a-priori defined pattern matching templates are not required. This
reduces the risk of missing unique information hidden in text messages because of a
failure to match an a-priori defined template. This behavior is a distinct advantage for
IV


monitoring a dynamic environment. The non-a-priori techniques automatically
categorize and prioritize texts with atypical characteristics as data requiring further
user review.
The described techniques have the greatest application to domains requiring
continuous monitoring of free text message streams. These domains include those in
many government agencies, news wire services, stock market and economic analysis,
and data fusion domains combining reports to enhance situation awareness and threat
assessment.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
illiam J.Jtfolfe


ACKNOWLEDGMENTS
My thanks to Professor Wolfe and Marcia Baker and the rest of the Graduate School
staff for their support with my graduate studies.
My thanks to my supervisors and associates at TRW and Computing Devices
International for their extensive financial support and flexibility which allowed me to
complete my studies and the research reported in this thesis.


CONTENTS
Chapter
1. Introduction..................................................1
1.1 Research Goal.................................................1
1.2 Current Capabilities..........................................2
1.2.1 Message Understanding Techniques..............................4
2. Self-Organizing Approach......................................8
2.1 Metric Space..................................................8
2.2 Preliminary Self-Organizing Text Experiments.................13
3. Experimental Setup...........................................18
3.1 Data Collection..............................................18
3.2 Implementation...............................................19
4. Results......................................................20
4.1 Predicted Text Correlation Results...........................20
4.2 Event Clusters Extracted from the Message Stream.............22
5. Summary......................................................33
Appendix
A. Sample Plain Text Messages...................................35
References
36


FIGURES
Figure
1.1 Typical Message Processing System..............................3
1.2 Response to Aristide Keyword...................................7
2.1 Two-Dimensional Slice through Text Metric Space.............. 10
2.2 Longest Common Subsequence Matching Algorithm................ 11
2.3 Genetic String Merging Algorithm..............................12
2.4 Self-Organizing Architectures.................................15
2.5 Self-Organizing Neural Response (10000 Iterations)............16
2.6 Self-Organizing Neural Time Evolution (10000 Iterations)......16
4.1 Next Reporting Period Data Compared to Baseline...............20
4.2 Reporting Volume (Number of Reports per Reporting Period)....23
4.3 Reporting Volume (Percentage Atypical/Routing)................24
4.4 Response to Ferry Keyword...................................25
4.5 Response to Disaster Keyword................................26
4.6 Response to Elect Keyword...................................26
4.7 Response to Mandate Keyword.................................27
4.8 Response to Latvia Way Keyword..............................28
4.9 Response to Faction Keyword.................................29
4.10 Response to Arafat Keyword.................................29
4.11 Response to Iraq Keyword....................................30
4.12 Response to Chech* Keyword..................................31
4.13 Response to *'Yeltsin Keyword................................32
4.14 Response to Estonia.Russia Keywords.........................32
vui


1.
Introduction
1.1 Research Goal
This paper describes an investigation into algorithms to filter and categorize
plain text messages. The plain text messages were obtained from electronic news
sources reporting various real world events. The data set comprised nearly 30,000
messages collected over a five month period. The algorithms filter and categorize
unique one time events and steadily reoccurring events reported during the
collection period. The algorithms compare text messages from the current
reporting period to messages from previous reporting periods. The new texts are
assessed for characteristics indicating they may contain atypical information. The
most similar texts are prioritized as routine data. The most dissimilar texts are
prioritized as atypical and can be further reviewed by experienced users.
The goal of this research is to develop algorithms to analyze and correlate
plain text data in problem domains were it is impractical to invest significant
knowledge acquisition effort developing parsers and problem domain lexicons.
This research demonstrates non-a-priori techniques that can exploit the structure
of plain text data in a message stream to generate categories and these techniques
can distinguish routine data (similar to data already processed) from atypical data
(dissimilar to data already processed). The algorithms developed during this
research cannot and do not match the performance of domain specific algorithms
developed using a-priori techniques nor are they intended to replace detailed
human data analysis. The algorithms are capable of finding atypical data, focusing
the attention of an experienced human analyst, and reducing user workload. The
algorithms allow consistent processing of the entire input data stream without
resorting to time consuming development of a-priori techniques covering narrow
problem domains. The algorithms described here represent an additional tool to
address information extraction and assimilation problems inherent in data
production, analysis, and message dissemination domains.
This paper discusses a text processing approach using self-organizing
algorithms to enhance the pattern matching and analysis of large volumes of text
data. The design approach specifically looks for indications of atypical data in the
input text stream. The system prioritizes input text for review by experienced users
without requiring the users to expend resources reading each message in the input
1


stream. This approach prioritizes the user workload, automates text correlation
and matching tasks, and improves identification of new data trends.
The self-organizing approach presented here is not based upon extensive
parang of the input text, top down analysis of syntax or semantics, or the
development of lexicons. The approach is based upon a bottom up correlation of
text token sequences. Hie algorithms apply ample matching rules to build complex
correlated structures with no dependence upon a-priori knowledge. The approach
draws from techniques in the areas of self-organizing neural nets and genetic
algorithms. These artificial intelligence techniques are bottom up techniques, rather
than top down symbolic approaches.
1.2 Current Capabilities
Many data processing domains communicate large volumes of data between
individuals and organizations by means of electronic messages. Some automatic
processing and reporting domains generate several thousands of reports each day,
particularly in some government, news, or financial applications. Effective
assimilation of thousands of reports transmitted and received on a daily basis
requires users to identify trends, organize, and correlate plain text data. Automated
tools to analyze the message streams are seldom adequate to induce information
from the data stream and the effectiveness of the available tools depends heavily on
the level of experience and training of the system users [1][2].
Figure 1.1 shows the tools available in a typical message storage, retrieval and
dissemination system. The system receives messages over a network or
communications line, archives them to long term storage, matches messages
against user query profiles, and a user interface provides message display,
annotation, and retrospective search capabilities. Some applications may support
counting and graphing capabilities such as histograms and bar charts. Software
indexing schemes or specialized hardware accelerators support search of the
message database. Systems of this type typically receive thousands of messages
and megabytes of data daily.
2


Typical Message Processing System
Raw Text
Via Network or
Communications Lines
Template
Management
User Inputs and
Annotations
Raw Text
Archive
Keywords
Indides
Dictionaries
Gazetteers
Other Analysis Tools:
Histograms
Scattergrams
Timelines
Figure 1.1
Trained personnel responsible for analyzing and assimilating the content of
the incoming messages typically build complicated profiles to extract specific
messages from the message stream. These profiles are of two types. First, highly
specific, elaborate, and exact profiles that generate high precision matches against
the message stream for known topics of interest. Second, very broad profiles with
high recall designed to prevent missing important information. The high precision
profiles generate short read quotes but are prone to miss relevant messages that
are near but not contained within the search volume specified by the profile. The
high recall profiles generate long read queues requiring users to read and
determine the potential relevance of each message within the broad search volume
specified by the profile.
The weakness of the template profile approach is that it requires extensive a-
priori knowledge to initially construct the profiles and requires significant
resources to read the large number of messages collected in the read queues. Long
term maintenance of the profiles requires expensive user resources because of the
dynamic nature of the message streams. Maintaining high levels of query precision
and recall efficiency requires continuous tuning of the query profile terms. (If the
message streams were static then the data handling could be trivially automated
requiring no human intervention.)
3


The typical message processing system depicted in Figure 1.1 provides no
mechanism to identify new trends in the message stream. New trend identification
depends upon a smart experienced user spontaneously noticing new patterns in the
data. This difficult task demands consideration of all the factors that work against
complete familiarity with the message stream: large volumes of data, user fatigue,
differing levels of training and expeience, and personnel turnover.
New trend identification depends upon good user situation awareness of the
content and context of the data contained in the message stream. Recognizing new
situations is fundamentally a process which humans are much better at than
computers. However, there are several techniques in addition to the techniques
presented in this paper that can automate aspects of the situation awareness,
analysis, and recognition problem reducing the workload on the system users and
raising the overall quality of their product.
1.2.1 Message Understanding Techniques
The Advanced Research Projects Agency (ARPA, formerly Defense
Advanced Research Projects Agency) has conduced a series of evaluations of
English text analysis systems including the Message Understanding Conferences:
MUC-3, MUC-4 [3 6]. The evaluations explored the merits of current text
analysis techniques applied to the performance of realistic information extraction
tasks, including text retrieval and text categorization. The output of the
information extraction process constructed a series of templates capturing the
semantic content of the messages prior to categorization and storage in a database.
The evaluations prove that systems can be built to automatically extract data
from ill-formed paragraph-length texts in a narrow domain. The evaluations
examined a wide range of text interpretation techniques including statistical, key-
word, template-driven, pattern-matching, and in depth natural language
processing.
Among the key observations from the MUC evaluations are the following:
First, those systems that use purely stochastic techniques or hand-crafted pattern-
matching techniques are not able to achieve as high a level of performance as
systems that incorporate text parsing techniques. Second, a wide range of
performance was evident among the parsing systems because of limited coverage
of the problem domain. Third a positive correlation existed between the level of
parsing performance and the level of effort invested in tuning the parsing processes
[3]-
4


The MUC evaluations clearly indicate significant a-priori knowledge is
required to obtain good precision and recall performance in a narrow domain [4].
Other observations show that increased effort is required to perform sentence level
linguistic analysis and inadequate approaches exist to determine when and how to
combine information from multiple sentences into a single coherent representation.
Emphasis has increased on hybrid systems combining linguistic and nonlinguistic
approaches.
The results from MUC-4 indicate a division in approaches to the message
understanding problem [4], Some researchers believe the most successful message
understanding systems will be the most generic ones with respect to application
task and domain. Others believe the most successful systems will take advantage of
all possible reductions in level of sophistication permitted by application of prior
knowledge of the information extraction task. The answer hinges upon the
emphasis the user places upon precision and recall. The current limits in automated
applications are about 60% recall and 55% precision when compared to the
estimated upper limits of interactive human performance of 75% recall and 85%
precision. The risk assessments of users of generic (non-a-priori) systems and
specific (a-priori) systems are bound to be different.
Another consideration is the cost of portability of specific and generic system
architectures. Is it more costly to port a large complicated system with separate
domain specific module to a new domain or will it cost less to port a smaller,
simpler system to the new domain and enhance those modules that are deficient in
processing the new domain?
The approach favored by this research paper is a non-a-priori generic
approach to categorization. This approach provides a certain minimum level of
performance in a wide variety of domains and reduces the risk of missing
important unanticipated data present in the data stream.
The minimum level of performance demonstrated by the techniques
described in this paper demonstrate the capability to categorize reports as atypical
early in the evolution of an event. If reports continue to be received after an event
has been categorized as atypical, the algorithms transition the reports from the
atypical category to the routine category. This reduces the expenditure of future
processing resources on events that have already been identified as atypical in the
past. During the five month period from August 1994 through December 1994
5


several significant newsworthy events were reported in the data collected for this
experiment including:
United States Congressional Elections,
Sinking of the Automobile Ferry Estonia,
Suspicious Iraqi Troop Movements Near Kuwait,
Escalation of the Chechnya Conflict, and
Return of President Aristide to Power in Haiti.
The remainder of the paper discusses how "atypical" or "routine" each of
these events appear to the self-organizing algorithms throughout the evolution of
the reporting of these events in the source data stream.
An interesting application of the technique is to monitor individuals
discussed in the message stream. As time progresses, the circumstances
surrounding an event or an individual will change. The changes may be of
sufficient magnitude to prompt the algorithm to categorize the change as atypical
and provide an early indication to a user that noteworthy change is in progress.
An example of this is shown in Figure 1.2. This graph shows the response of
the neural network to the keyword "Aristide". Time progresses along the bottom
of the graph from August through December 1994. Each reporting period covers
nominally three or four days. For any given reporting period, the neural network
has already processed the preceding reporting periods and can assign reports in the
current reporting period to the "routine" data category or the "atypical" data
category. Of all the document assignments made by the algorithm, Figure 1.2
depicts which category reports concerning President Aristide were assigned for
each reporting period.
On 19 September 1994 Aristide addressed the multi-national coalition that
had been formed to restore democracy to Haiti and stressed "reconciliation rather
than retribution". A high percentage (89%) of the references to Aristide were
assigned by the algorithm to the "atypical" data category. Although there is a
significant amount of data and activity regarding the impending invasion of Haiti,
this is really the first time Aristide appears in the message stream. After the initial
speech in September, all references to Aristide drop from the message stream until
mid-October when Aristide actually returns to Haiti. During the first October
reporting period in which Aristide is again mentioned in the message stream, the
algorithm assigns the majority (70%) of the reports to the "atypical" data category.
For the succeeding two reporting periods the percentage of Aristide reports
6


assigned to the "atypical" data category drops significantly (to 24% and 31%). The
"atypical nature of the reports becomes "routine" as the algorithm processes more
instances of these reports.
70t
SO--
40--
30 -
20
10
Response to Aristide Keyword
17-Oct (31%)
13-Oct (24%)
10Oet (70%) M
19-Sep (89%) | |
i-Fl-rHr~D i i i i i i i
!!!fllfifi 111
6 3
"Atypical* Data "Routine' Data
Figure 1.2
_____jiil j
^ ii 2 i S
7


2.
Self-Organizing Approach
Metric Space
2.1
The vector-space model represents queries and documents as sets of terms
and computes global similarities between queries and documents based on the
presence of terms [2]. The vector-space model assumes that an available term set
represents documents as term vectors of the form
D/ (a,7, a|2,...a//).
In a vector space, the similarity among vectors x and y can be measured by
the product
x y = |x| |y| cosine a,
where |x| is the length of x and a is the angle between the vectors. Modification
and merging of queries or documents in this representation is achieved by
modifying the coefficients a/ including or removing terms from the term sets
comprising the documents.
One of the key weaknesses of the vector space model is the keyword barrier [7].
Text retrieval and categorization algorithms that rely only on the presence or
absence of keywords in the text have great difficulty in distinguishing between
relevant texts and irrelevant texts because only term presence or absence is
measured not the context or sequence of the terms. Different tom ordering make
some texts more relevant than others. The best approach to overcoming the
keyword barrier is to incorporate a-priori knowledge via extensive dictionaries,
lexicons, and natural language parsing mechanisms. Given that we can not or do
not wish to invest significant effort into developing these a-priori methods,
different techniques must be used to overcome the keyword barrier.
The algorithms used in this research use a metric space that attempts to encode
common token sequences in addition to words that are common between similar
texts. The sequences encoded into the metric representation help to ensure relevant
texts are categorized together. The metric space has the following features: there is
a metric value that can be measured between every text; an exactly overlapping
8


match has a metric value of one; and two texts that have nothing in common have
a metric value of zero.
There is an infinite number of text permutations for a full English grammar.
Representing texts as points in a hyperspace would require infinite dimensions to
exactly model the full text hyperspace. In any sample of text some dimension or
combinations of text sequences will be over represented and some under
represented. The effective number of dimension present in the text sample can be
measured and estimated. The effective number of dimensions for the 30,000 text
problem domain was estimated by calculating a non-parametric K-nearest-neighbor
dimension estimate [8]. The estimate of the effective number of dimensions of the
problem domain was 13 +/- 6 dimensions. The estimate does not indicate what the
dimensions in the hyperspace contain, but does give an indication as to how the
text hyperspace is bumpy or convoluted.
Figure 2.1 shows a two dimensional representational slice through the
metric space. Each data point represents a text message that is some metric
distance from the probe texts. The greater the match between the text messages
and the probe messages the closer they are in the metric space. Several texts reside
along one or both of the zero axes. These texts have nothing in common with the
probe texts and represent a zero match or infinite metric distance from the probe
texts. The clustering of data points within the metric space represent clusters of
texts that are similar distances from the probe texts. As the neural network is
trained and each neuron becomes more specialized the neurons will more
accurately represent the text clusters in the metric space.
Modeling this bumpy hyperspace using a-priori methods is difficult. It is
desirable to model the hyperspace using non-a-priori methods to avoid biases
inevitably introduced by a-priori techniques. The general method chosen was to
model the hyperspace using self-organizing (Kohonen style) neural networks [9 -
11].
Self-organizing neural networks model (leant) an input domain by devoting
neural responses to regions of the hyperspace in proportion to the density of the
training data. In the self-organized model adjacent neurons represent data that is
also relatively adjacent in the hyperspace. The model is sometimes referred to as a
smooth cognitive mapping of the hyperspace. The mapping is an ideal baseline
representation model for comparing data.
9


Two-Dimensonal Slice through
Text Metric Space
1
Robe Text V
0 0.25 0.5 0.75 1
Metric Match to Probe Text aX
Figure 2.1
The method used by this research to measure text similarity uses the Longest
Common Sub-sequence (LCS) algorithm [12]. This algorithm captures the
sequential structure of word tokens contained in two texts in addition to the
presence of word tokens. Figure 2.2 shows an example of the application of the
LCS algorithm. The LCS algorithm performs the task of measuring distances
between documents and neural network nodes. To update and train neural network
nodes requires a method of combining text documents so the neural network
"weights" can be modified. In the representation for this research the neural
network weights are exactly the text strings manipulated by the LCS algorithm.
A genetic algorithm (GA) performs the task of combining text documents.
The GA encodes all token sequences from two parent texts into a graph. The GA
generates new child sequences by randomly traversing the graph and extracting
token sequences. The LCS algorithm repeatedly compares the child sequences to
die parent texts and the GA replaces the parent sequences with the best matching
child as the merged representation. The graph traversal and extraction process
makes use of a topological sorting algorithm modified to work within cyclic graphs
[12]. Figure 2.3 depicts the text report merging algorithm.
10


Longest Common Subsequence
Matching Algorithm
Consider the following two strings to be matched:
peter piper picked peck pickled peppers
how many pickled peppers did peter piper pick
The LCS algorithm implemention tolerates missing and transposed
characters producing token matches for the following token pairs:
'how many pickled peppers did peter piper pick
The metric score for this match is calculated using this formula:
[min (Ien(string1), len (string2)) + sqrt (dffipen (stringl), ten(string2)])]
or for this example: 3/(6 + sqrt(2)) = 0.404628
This algorithm captures the notion of token sequences and token
presence. Metric scores range from 0 to 1. Exact matches score 1
and the score falls off rapidly for token miss-matches and less
rapidly for differences in the source string lengths.
peter peppers peter peter
piper peppers piper piper
picked pickled picked picked
picked pick pickled pickled
The longest common subsequence is of length 3:
'peter piper pick(ed)'
peter piper picked peck pickled peppers
len(LCS)
Figure 2.2
11


Genetic String Merging Algorithm
Consider the fblowing two strings to be merged:
peter piper picked peck pickled peppers Parent 1
how many pickled peppers did peter piper pick Parent 2
These two parent strings can be combined into a graph using the
LCS string matching algorithms described in the text:
Start_Symbol
produces a series of child string token sequences that can be evaluated using a
genetic algorithm for compatability with the original parent token sequences. The
merged string is considerably nearer in the hyperspace to each of the parents than
the parents are to each other. The evaluation function averages the LCS matches
between each of the parents and the child string. After an algorithmically defined
number of iterations the genetic algorithm returns the best merge evolved so far.
For this example the Parent 1 and Parent 2 Match is:
peter piper pick(ed)'
match value = 0.404028
The Best Parent 1 and Parent 2 Merge (topological sort) is:
how many peter piper picked peck pickled peppers did pick
merge_value = 0.693667 = ((0.75 + 0.637334) / 2)
The Parent 1 and Child match is:
*peter piper picked peck pickled peppers
match_value = 0.75; improved from 0.404628
The Parent 2 and Child match is:
how many pickled peppers did peter piper pick'
match_value = 0.637334; improved from 0.404628
Figure 2.3
12


The training algorithm for a self-organizing neural network depends upon a
method to measure the distance between each node and the input and a method to
modify the most responsive neuron so that it responds even more strongly to the
input. The LCS metric and the GA text merging algorithms described above
accomplish these tasks in the text domain. The GA and LCS algorithms combine
to effectively model the text message metric space making it possible to compare
nominal baselines with off-nominal input data. The "weights" that comprise the
neural representation in the text domain are actually text strings. The self-
organizing text neural network architectures investigated are shown in Figure 2.4.
This research developed and investigated two self-organizing architectures.
First, the self-organizing neural net architecture consists of a one-dimensional
circular array of processing nodes (Figure 2 .4.a). The neural network array passes
through the baseline data partitioning the hyperspace into regions of attraction
around each node. When comparing new data to the baseline, the new documents
are attached to the nearest, most responsive, baseline node. New data clusters
should all cluster about the same baseline node.
However, because of the high number of dimensions of the hyperspace a
new text document is almost identically close to several baseline nodes.
Documents forming a new data cluster subsequently span the regions of attraction
of many baseline nodes and the cluster fractures among multiple baseline nodes as
shown by the shaded cluster in Figure 2.4.a.
The second architecture design addresses the fracturing problem observed in
the one-dimensional neural net. The Extensible Minimum-Spanning-Tree (MST)
architecture grows a tree to cover the hyperspace (Figure 2.4.b). The tree
architecture is maintained when adding new data to the baseline. Adding new data
as an extension of a tree branch forces large clusters to be attached to the baseline
at a single node. This creates tighter clusters and large clusters are more likely to
rise above the noise floor and trigger further analysis. Note the same points
fractured across three baseline nodes in the one-dimensional case attach to a single
baseline node in the extensible MST architecture.
2.2 Preliminary Self-Organizing Text Experiments
Early experiments tested the ability of the neural network algorithms to map
input texts and demonstrated the occurrence of self-organizing behavior. Figures
2.5 and 2.6 plot neural network responses from these experiments. The network
architecture for these experiments consisted of a one-dimensional (circular)
13


network with three-hundred neurons (see also Figure 2.4.a). The Kohonen
"neighborhood" size was three neurons wide.
Figure 2.5 shows a trace of the most positive neural response to training
exemplars over 10000 training iterations. The network response increases steadily
throughout the experiment. Figure 2.6 shows the frequency each neuron responds
to exemplars during the same run depicted in Figure 2.5. The Figure shows the
time evolution of the neural responses (time increasing back to front). Each neuron
is represented by a point along the left to right axis. Initially only a few (on the
order of 10) neurons respond to the input exemplars giving a few very high peaks
early in the evolution of the network. As self-organization occurs and each neuron
becomes more specialized, the neural responses spread across the neural network.
When the experiment was stopped after 10000 iterations almost two-thirds of the
neurons were responding to the training data. The network exhibits a much more
uniform level of frequency response for each neuron.
This experiment demonstrates that the text neural network successfully
models (self-organizes) the text hyperspace represented by the training data. The
LCS metric and the genetic merge algorithms demonstrate capability to manipulate
and modify the neural network in a manner consistent with self-organizing
(Kohonen network style) behavior. After training is completed the network is
frozen and h becomes a baseline against which the next set of data can be
compared. Differences between the baseline and the distribution of new data
mapped to the baseline indicate atypical data that merits further study by the users.
Unfortunately, initial runs comparing baselines to new data were
unsatisfactory. Although strong response peaks were obtained, the documents
comprising the peaks subjectively appeared to discuss random topics. To further
understand the behavior responsible for the fracturing of the response peaks a new
"neural" architecture was implemented. The new the neural architecture is similar
in construction to a Minimal Spanning Tree (MST).
The MST architecture network size was chosen to be 384 nodes. The
algorithm assigns a document from the training data to each of the nodes and
calculates the distance between each document pair. Then using Prims MST
algorithm, document pairs are identified, merged, and inserted as a component of
the MST neural architecture baseline.
14


Self-Organizing Architectures
Identify the Current Reporting Periods Most Atypical Data
Previous Reporting Period Baseline
Current Period Compared to Baseline
*"
Baseline Data Routine Data
a) Self-Organizing Neural Net Architecture
A Atypical Data
Current Period Compared to Baseline
Baseline Data Routine Data
b) Minimal Spanning Tree Architecture
Figure 2.4
a Atypicrt Data
15


16


Once the baseline is constructed, new data is compared to the baseline. The
one-dimensional self-organizing algorithm assigns new documents to the most
similar, most responsive node in the network. In contrast, the MST based
algorithm attaches new documents to the most similar document in the baseline
network unless there is a more similar non-baseline node already attached to the
baseline. This modification makes the MST network a continuously extensible tree.
This modification produces significantly more coherent new data clusters attached
to the baseline than the one-dimensional neural algorithms produce.
The MST network is the final architecture used to conduct the majority of
the experiments reported in this paper. Using the extensible network technique on
the one dimensional self-organizing network should produce similar clustering
results to the MST architecture.
The advantage of the MST network is that it has a well-defined structure
that requires less compute power to construct than the training required by the
one-dimensional network. The disadvantage of the MST network is that the size of
the network must be chosen a-priori and it is difficult to consistently match the
network size to the number of documents that must be categorized. The advantage
of the self-organizing network is that it can be of smaller more consistent sizes (for
example 128 nodes compared to 384 for MST). The disadvantage of the self-
organizing network is that more computing effort is required to train the network
to fully cover the text hyperspace. Specific application areas must trade off speed,
coverage, and architecture issues.
17


3. Experimental Setup
3.1 Data Collection
The data for this research was collected from a variety of sources on the
Internet [13 21]. (Note: The exact internet addresses of the sources used are
given in the references. Since the time that the data was collected, several of the
internet addresses have changed) Some of these sources became extinct during
the course of data collection; others were added. The degree to which each source
was kept up to date also varied greatly. The Open Media Research Institute
(OMRI) Daily Digest maintained the most reliable data source (this source was
formerly called the Radio Free Europe/Radio Liberty Daily Report). Data
collection covered a five month period from August 1994 through December
1994. Processing real data rather than simulated data provided several advantages:
the wide variety of data collection anomalies present in the data more rigorously
stressed the algorithms and the direct application of the algorithms to operational
data processing domains is more readily apparent.
The raw data covers a very broad topic range and consists mostly of news
press releases. It also contains transcriptions of speeches by various dignitaries,
United States Embassy reports, and a small amount of miscellaneous network
administrative traffic. Data preprocessing removed (exactly) duplicated reports
from the various sources and split very large messages into multiple paragraph
sized units. Additional preprocessing of the data removed duplicate letters,
punctuation, stop words (the, fin*, one, are, ami) and converted uppercase
characters to lowercase. The preprocessing eased the computer resources required
to conduct the experiments but did not fundamentally change the content of the
text reports. The text reports were then grouped into nominal three day or four
day reporting periods (Monday, Tuesday, Wednesday, and Thursday, Friday,
Saturday, Sunday).
After these preprocessing steps, the text database used for this research
contains 29,841 reports requiring 12.3 Mb of storage. The five month collection
period contains 43 reporting periods. The average report contains 68 words and
412 characters. The average reporting period volume was 694 reports. The highest
reporting volume was 2,323 reports collected for November 14, IS, and 16. The
lowest reporting volume was 100 reports collected for December 26,27, and 28.
18


3.2 Implementation
The software implementation consists of several standalone modules with
intermediate files for communication. No attempt was made to make the software
modules operationally robust. The modules consist of prototype software and glue
code to implement the self-organizing text neural network proof of concept. A
short driver module and scripts order the software processing on each sequential
pair of reporting periods. The software implementation consists of 1700 lines of C
and C++ code. The executables run on a Silicon Graphics Indy (100MHz) and a
dual CPU Silicon Graphics Challenge (2x150MHz). Even given the fairly
significant computing resources, processing the five month volume of data
required about 700 CPU hours to completely categorize the data (three CPUs 24
hours a day 10 days).
19


4.
Results
4.1 Predicted Text Correlation Results
In order to objectively measure the performance of the self-organizing
algorithms and their effectiveness at differentiating atypical data from routine data,
the distance between all the documents in a given text category to all the
documents in a second text category was calculated. This measurement captures
the concept of the extent, density, or the closeness of all of the texts. I call this
measurement the cross-correlation of the text: CC(). Figure 4.1 shows a Venn
diagram representing the text hyperspace and the coverage of the hyperspace by
the baseline data and the data to be compared to the baseline.
A key assumption for the use of the CCO metric is that the distribution of
documents in the baseline hyperspace is commensurate with the distribution of
documents in the following reporting period. In other words, the data in the two
reporting periods have roughly similar, overlapping distributions. Given this
assumption, then the relations given in Table 4.1 can be predicted.
Next Reporting Period Data Compared to Baseline
Baseline
(X)
Next Reporting Period
m_________________
Figure 4.1
20


Table 4.1 Cross Correlation of Text Document Categories
Relationship Description Prediction Success
l CC(x, x) ~ CC(y, y) The shape (distribution) of the baseline is similar to the shape of comparison 6/9 67%
2 CC(yl, y2) < CC(x, x) The correlation of the atypical category points to the routine data points is less than the correlation between all the points in the baseline as a whole. 8/9 89%
3 CC(yl,y2) 4 CC(x, yl) < CC(x, x) The correlation between the baseline points and the atypical points is less than the correlation of the baseline points to themselves 8/9 89%
5 CC(x, yl) < CC(y, y) The correlation between the baseline points and the atypical points is less than the correlation of the comparison data points to themselves 8/9 89%
6 CC(x, y) < CC(x, y2) The correlation between the baseline and the comparison data is less than the correlation of the baseline and the routine data 4/9 44%
7 CC(x, y) < CC(x, x) The correlation of the baseline and the comparison data is less than correlation of the baseline data points to themselves 8/9 89%
8 CC(x, y) < CC(y, y) The correlation of the baseline and the comparison data is less than the correlation of the comparison data points to themselves 7/9 78%
9 CC(yl, y2) < CC(y, yl) The correlation between the atypical and the routine data is less than the correlation between the atypical data and the comparison data. 9/9 100%
10 CC(yl,y2) 11 CC(x, y2) < CC(x, x) The correlation between the baseline and the routine data is less than the correlation of the baseline data points with themselves. 9/9 100%
Less Correlated < More Correlated
x represents the baseline data y represents data compared to the baseline (the comparison data, the atypical + routine data) yl is data that does not overlap x (atypical data) yl is data that does overlap x (routine data) A higher value for CCO indicates the texts are close in the text metric space A lower value for CCO indicates the texts are further from each other in the metric space
21


Table 4.1 describes a series of predictions which can be made given the
distribution of text documents in the hyperspace as depicted in Figure 4.1. In
general, if a given text category is a superset or has a greater extent in the
hyperspace than a second text category, then it will be less correlated than the
second category. The CCO metric was calculated for all possible text categories
for one month of the available five months of test data. The predicted cross
correlation results hold for 83 out of 99 opportunities or 84% of the time. These
results give a strong positive measure of the degree to which the algorithms
categorize and partition data into separate categories containing similar data.
The poorest result is for prediction number 6. The reason this prediction
performs so poorly appears to be in the assumption regarding the distribution of
texts throughout the hyperspace. The Venn diagram (Figure 4.1) draws the
hyperspace as circular and it is easy to imagine there is a uniform distribution of
documents within the hyperspace. However, the actual text hyperspace is not
guaranteed to be uniform and by removing the distribution assumption it is easy to
construct distributions (in particular non-spherical ones) for which prediction 6
always fails. The fact that all the other predictions still generally hold up is
encouraging and tends to validate the premise that one can build atypical and
routine data categories based upon an initial data baseline.
4.2 Event Clusters Extracted from the Message Stream
Given the CCQ metric evidence the algorithms form coherent data clusters,
what data is contained within the clusters? During the five month period from
August 1994 through December 1994 several significant newsworthy events were
reported in the data collected for this experiment including:
United States Congressional Elections,
Sinking of the Automobile Ferry Estonia,
Suspicious Iraqi Troop Movements Near Kuwait,
Escalation of the Chechnya Conflict, and
The Return of President Aristide to Power in Haiti.
The experiment processed all of the five months of data into two main
categories fin1 each reporting period. The first category contained data derated to
be the most atypical data as compared to the previous reporting periods baseline
(yl data from Figure 4.1). The second grouping contained data derated to be the
most routine data as compared to the previous reporting period's baseline (y2
data). The ratio of the assignment of most atypical to most routine was kept
22


constant for the entire experiment. The algorithms assigned 30% of the reported
texts to the atypical category and 70% of the reported texts to the routine
category. The choice of the 30% to 70% split was arbitrary. Figure 4.2 and Figure
4.3 depict the reporting period volume as well as the percentage of documents
assigned to each category. Except for month-end boundary condition problems at
the end of October and November the algorithms conastently assigned 30% of the
reporting volume to the atypical data category.
Because of the separation of data into two categories it is posable to
perform after the fact queries on the data files to determine where the algorithms
assigned documents concerning particular topics. It is important to remember that
during processing, the algorithms used no a-priori definitions to perform the
categorization. The decision to assign a particular document to the routine or the
atypical categories was based completely on the degree the document matched the
self-organized baseline. After document assignment, the performance of the
algorithms on topics known to be important can be examined. An operational
system would categorize and prioritize topics at the time of data collection. This
would provide the user a prioritized categorization of atypical and routine data at
the earliest developmental stages of the event, possibly before changes in the data
stream are recognized as important by the user.
Reporting Volume (Number of Reports per Reporting Period)
2600 t..................
2000........
500-
14-Nov 2323 Reports
1500 -
1000.........................................
26-Dec 100 Reports
3 2
I'Atypical* Data Routine* Data
Figure 4.2
23


Reporting Volume (Percentage Atypical/Routine)
10* I I I 1 I I I I I I I I I I I II I ITH nT'l I I I II I I
80% -
60%--
40%
'Atyptcaf Data 'Routine' Data
Figure 4.3
Figure 4.4 and Figure 4.5 graphically show the responses to the query
keywords "ferry and "disaster" and reveal the network assignment of documents
reporting a unique one-time event. The automobile ferry Estonia sank on
September 28,1994 with over 1000 lives lost. The neural network demonstrates
very interesting behavior during die first three reporting periods after the ferry boat
sinking. The network displays a nearly identical response to the two probe
keywords (ferry, disaster) and the following discussion details only the response to
the "ferry keyword. (Responses to the probe keywords prior to September 28
concern obviously unrelated events.)
The first reporting period is the 26-Sep period. The neural algorithm assigns
seventy-five percent (24/32) of the documents reporting the ferry boat incident to
the atypical document category. During the next reporting period (29-Sep) the
network assigns fifty-five percent (41/75) of the documents reporting the incident
to the atypical category. Note that although the total reporting volume increased
by more than 130% (75/32), the reporting volume in the atypical category
increased only 70% (41/24) because several of the documents (routinely) repeat
information now encoded into the baseline. By the third reporting period (1-Oct)
the volume of data assigned to the atypical category has fallen dramatically, down
to twenty percent (12/59) of the total volume. By the third reporting period the
ferry boat disaster no longer represents an atypical event (there is adequate
24


coverage in the baseline) although it remains a significant newsworthy event and
steady reporting on the event continues throughout the remaining collection
period.
This represents very desirable behavior for situation awareness. With no a-priori
knowledge, the network assigned a large percentage of the initial reports on an
event to the atypical category. After rapidly building a peak in the atypical
category, the response of the network then rapidly decays as the new event
transitions to a routinely reported event. This behavior serves to focus the
attention of the user onto the prioritized atypical category and then rapidly lowers
the priority of already identified events so the user can devote analysis resources to.
subsequent new data and events.
The response of the network to keywords related to the congressional
elections is shown in Figure 4.6 and Figure 4.7. The response to the keyword
"elect" rises steadily throughout the election. This graph (Figure 4.6) represents a
steady state discussion of a topic continuously occurring in the message stream.
25


Response to Beef Keyword
UO-t............................................
120.............................................
100.......................................fT"'
60 ............................... """"
60 ............................... ....
40
20
0
'AtypicaT Data 'Routine' Data
Figure 4.6
26


251..................
Response to "Mandate' Keyword
20-....................
15-------
10-
21-Nov 45% (10/22)
29-Sep 80% (8/10)
28-Nov 53% (6/15)
IfffSIll
W- OI
IAtypical1 Data Routine* Data
Figure 4.7
The election generated considerable discussion regarding the alleged
Republican mandate and the Contract With America. The network response
(Figure 4.7) to the keyword mandate does indeed show peaks soon after the
election. However upon further analysis these peaks are not related to a mandate
regarding the Contract With America, but rather represent discussions regarding
support for the United Nations peacekeeping mandate in Bosnia. In this case, the
collection source (Bosnia/Latvia news sources) biases the expected content of the
message stream (relative to the biases of die U.S. press). This example shows the
advantage of using non-a-priori techniques to focus and prioritize analysis of the
message stream to overcome user bias.
The mandate graph displays an additional peak for the 29-Sep reporting
period. What organization has a mandate on this date? Analysis of the raw data
inspired the keyword queries depicted in Figure 4.8 and Figure 4.9. The mandate
peak on 29-Sep is due to a group referred to as the "Latvia's Way Faction". The
members of the faction were discussing a mandate regarding the composition of
political power in the faction. Indications of the power wielded by the Latvia's Way
Faction and its party members appear strongly in the atypical category as early as
the 12-Sep reporting period. Figure 4.9 shows how frequently the keyword
"faction" appears in the data. There are many factions of various types appearing
constantly in the data. This demonstrates the difficulty of using only a-priori
27


keyword query mechanisms to analyze the incoming message stream. Too broad a
query ("faction") generates too many responses; too narrow a query ("Latvia's
Way Faction") is perishable, requires a-priori knowledge, and requires continuous
maintenance to keep the query terms up to date.
Using the self-organizing approach significant indications of new
developments can be seen at an early date. In this example, the first indications of
the workings of the fiction appear during the 12-Sep reporting period; a full five
reporting periods prior to the mandate discussions during the 29-Sep reporting
period.
Figure 4.10 and Figure 4.11 show the response to keywords concerning the
Middle East. The peaks revealed by the Arafat keyword query primarily concern
trips by Secretary of State Christopher and President Clinton to the Middle East.
Significant discussion takes place regarding the Hamas terrorists, the PLO-Israeli
Peace Plan, and the kidnapping and murder of an Israeli Corporal Wachsman. The
peaks revealed by the Iraq keyword concern the movement of troops by Saddam
Hussein in a threatening manner near the Kuwait border. This activity also
coincides with Secretary Christopher's trip to the mid-east. There is repeated
support for Kuwait, warnings to Iraq not to "repeat mistakes" and a general
coming together of the opinions of the U.S., Britain, United Nations, and Egypt.
Response to "Latvia Way Keyword
AtyptcaT Data Routine Data
Figure 4.8
28


Response to Taction' Keyword
Atypical* Data "Routine Data
Figure 4.9
Response to 'ArafaT Keyword
45 ............................................ -........
40-......................................................
35-......................................................
30............................................. -r-------
25- ........................................... .........
20-............................................ .........
Atypical*Data 'Routine*Data
Figure 4.10
29


Response to Iraq* Keyword
'Atypical' Data 'Routine' Data
Figure 4.11
Some limited overlap occurs between documents that simultaneously
mention Iraqi actions and Mr. Arafat. The patterns are perhaps (subjectively) more
consistent with the focus of the collecting news agencies while they track and write
reports concerning the movements of U.S. officials and how events compel the
U.S. officials to comment during their trips through the area. There seems to be a
positive feedback between events and a person's abilities to elicit commentary upon
the same events.
The last three Figures, 4.12,4.13, and 4.14, concern the evolution of events
in Estonia, Russia and Chechnya. The civil war in Chechnya escalated severely
during the second week of December 1994. Figure 4.12 shows the reporting on
Chechnya (Chechen) activity during the collection period. As expected the volume
of reports and the percentage of reports assigned to the atypical categories
increase dramatically during the 12-Dec reporting period. Interestingly, the first
time President Yeltsin vows not to invade Chechnya with troops was reported
during the 21-Nov reporting periods. These reports were assigned to the atypical
text category. At this same time, Yeltsin's comments are widely reported in the
collected data concerning a conflict over the border with Estonia (Figure 4.13).
Russia unilaterally decided to close the border setting up guard posts, checkpoints,
and the like. The data contains frequent references to Russia building on Estonia
30


land and refusals to return Russian land to Estonia, etc. The rhetoric seems to be in
the same line as the rhetoric concerning the Chechnya conflict.
Specifically probing the data for documents regarding the Estonia Russia
border reveals the peaks shown in Figure 4.14. It is easy to see the large peaks
during the 21-Nov reporting period and the 19-Dec reporting periods that
correspond to the peaks in Figure 4.12 and Figure 4.13. Also very clear is how
early indications of the conflict are reported more than two months prior during
the 12-Sep reporting period. This figure strongly supports the argument that non-
a-priori techniques can be used to identify topics that will later become important
and that self-organizing techniques can be used to support and prioritize user
situation awareness.
so-.
45-
Response to Chech* Keyword
12-Dec 49% (24/49)
'Atypical" Data "Routine* Data
Figure 4.12
31


32


S. Summary
This paper presents results of an investigation into self-organizing methods
capable of accepting data from a series of reporting periods, building a baseline to
model the data hyperspace, and distinguishing the most atypical data in subsequent
repotting periods from data corresponding closely to the previous baseline. These
techniques are valuable tools to process data when a-priori techniques are too
expensive or insufficient to track a dynamic environment. The algorithms prioritize
and focus user attention and resources on the newest, most atypical, and
interesting data in the data stream. The results demonstrate the algorithms
successfully find early indications of events that later become significant to the data
analysts. This capability improves situation awareness and creates longer lead
times to respond to the events reported in the data stream.
The experiments described in this paper were conducted using real report
data. The data suffered from all the problems of real data including duplicate
reports, widely varying day to day data volume, and changes in the quality and
quantity of the data sources. Real data made the experimental processing more
effective and robust. The data processing algorithms have applications for many
long term data monitoring situations whether using formatted, attribute, or free
formatted data, in a government agency, or corporate environment.
An interesting topic for further research would be to conduct a detailed
precision and recall study. Precision and recall analyses generally take advantage of
a-priori measurements of how accurate a given query should be. In the case of the
experiments reported here there is no concept of a-priori knowledge. A detailed
precision and recall study would have to account for the context of messages
within the constructed baseline and account for the assignment of documents
between the atypical category and the routine category. The sensitivity of the
assignment to changes in the baseline should be studied. A detailed precision and
recall study could also determine whether the LCS metric is more or less effective
than a more conventional vector space metric such as the cosine between two
vectors.
Also of interest would be building an operational on-line situation awareness
workstation connected to live data. User feedback on the effectiveness of the
33


processing would be valuable in enhancing the algorithms. Building robust
interfaces to live data feeds requires significant work. Tools to visualize and
browse the raw data and text documents are required. Additional temporal and
spatial analysis of the text documents needs to be addressed. The current
algorithms merely compare hyperspace baselines to subsequent reporting periods.
It is likely that structural portions of the hyperspace can be logically or intuitively
linked over long temporal periods and over geographic and spatial areas enhancing
overall user situation awareness and understanding. The document browsing
functions could be used to traverse the byperspace in both a text hyperspace sense
and a temporal-spatial hyperspace sense.
Non-a-priori self-organizing techniques represent a valuable tool to remove
bias and identify atypical data. Application areas include long term dynamic data
monitoring environments and environments where it is impractical or too
expensive to invest development resources building robust domain specific data
analysis algorithms. The experiments reported in this paper demonstrate the
effectiveness of building nominal hyperspace baselines using the raw data itself.
These techniques represent labor and resource saving data processing methods as
only the most atypical data identified in the baselines requires additional analysis.
The algorithms automatically categorize redundant and routine data as lower
priority.
34


APPENDIX
A. Sample Plain Text Messages
The following sample plain text messages are from source [18] on 10 October
1994. The first message is related to data shown in Figures 4.5 and 4.6. The
second message is related to data shown in Figure 4.11. The third sample message
is related to data shown in figure 4.14.
AFTERMATH OF ESTONIAN FERRY DISASTER Nordstrom and Thulin ABthe Swedish
co-owner of the Estline company, whose feny "Estonia" sank on 28 September, killing more than
900 peoplehas decided to pull out completely from passenger feny service, BNS reported on 7
October. It will end its ties with the Stockholm-Tallinn passenger feny immediately and will not
renew its contract, due to expire in 1997, to run a passenger ferry from mainland Sweden to the
island of Gotland. The other Estline partner, Estonian Shipping Co., plans to put a replacement
ferry into service between Stockholm and Tallinn in November and is seeking another partner.
The "Estonia" was insured for $60 million in Sweden. The Estonian government on 6 October
dissolved its investigation commission on the feny disaster and appointed an Estonian
cooperation commission to work within the International Board of Inquiry. It will be headed by
Transportation and Communications Minister Andi Meister.
RUSSIA WARNS IRAQ ABOUT KUWAIT. Russia told Iraq that it must not defy the UN
Security Council over Kuwait. A Foreign Ministry statement quoted by ITAR-TASS on 10
October said the council was determined Iraq should recognize the sovereignty and independence
of Kuwait and its borders. Russia generally supported the international coalition that drove Iraq
out of Kuwait in 1991, but did not take part in the conflict It signed a defense cooperation
agreement with Kuwait in December 1993. Interfax quoted a Defense Ministry spokesman as
saying he was unaware of any military moves by Russian forces in response to the Iraqi troop
movements near the border with Kuwait.
ESTONIAN GOVERNMENT CONTINUES TO EVALUATE ACCORDS WITH RUSSIA. The
Estonian government extended from 5 October to 28 October the deadline for evaluating the
Russian-Estonian agreements on the troop pull-out and social guarantees for retired Russian
officers, according to Press Secretary Ain Saama. The decision was made following a request by
the Justice Ministry, which needs more time to assess the documents before recommending them
for ratification, BNS reported on 7 October. Russian Foreign Ministry official Sergei Prikhodko
expressed regret over the delay and said the Foreign Ministry has submitted the agreements to the
government so that the president can send them for parliamentary ratification. Claiming that
"everything is perfectly clear in legal terms, Prikhodko said "it is astonishing that Estonia still
feels the need for some kind of juridical expertise."
35


REFERENCES
[1] Holland, J., Holyoak, K., Nisbett, R., Thagard, R., Induction Processes of
Inference, Learning, and Discovery, MIT Press, Cambridge MA, 1987.
[2] Salton, G., Automatic Text Processing, Addison Wesley Publishing Company,
Reading MA, 1989
[3] Sundheim, B.M., "Overview of the Third Message Understanding Evaluation
and Conference", in Proceedings Third Message Understanding Conference
(MUC-3), Morgan Kaufinann Publishers, Inc., San Mateo CA, May 1991, pp. 3-
16.
[4] Sundheim, B.M., "Overview of the Fourth Message Understanding Evaluation
and Conference", in Proceedings Fourth Message Understanding Conference
(MUC-4), Morgan Kaufinann Publishers, Inc., San Mateo CA, June 1992, pp. 3-
21.
[5] Lehnert, W., Sundheim, B., A Performance Evaluation of Text Analysis
Technologies, AI Magazine, Fall 1991, pp. 81-94.
[6] Rilof£ E., Automatically Constructing a Dictionary for Information Extraction
Tasks, in Proceedings of the Eleventh National Conference on Artificial
Intelligence, AAAI Press/MIT Press, 1993, pp. 811-816.
[7] Mauldin, M. L, Conceptual Information Retrieval A Case Study in Adaptive
Partial Parsing, Kluwer Acedemic Publishers, Boston MA. 1991.
[8] Fukunaga, K., Statistical Pattern Recognition, Acedemic Press Inc. San Diego
CA, 1990.
[9] Kohonen, T., Self-Organization and Associative Memory, Springer-Veriag
New York, 1989.
[10] Ritter, H., Thomas, M., Schuhen, K., Neural Computation and Self-
Organizing Maps, Addison-Wesley Publishing Company, Reading MA, 1992.
36


[11] Caudill, M., Neural Network Primer, Reprinted from AI Expert Magazine,
Miller Freeman Publications, 1990.
[12] Cormen, T., Leiserson, C., Rivest, R, Introduction to Algorithms, MIT Press,
Cambridge MA, 1992.
[13] //cnd.cnd.org/11 /English-Menu/cnd-china, cnd-china
[14] //cnd.cnd.org/11 /English-Menu/cnd-global, CND Global News Service
[15] //cnd.cnd.org/11/English-Menu, English-Menu China news stuff
[16] //econ.tucson.as.us/11/foreign, Gopher server foreign news stuff
[17] //gopher.lvnet.lv/1 l/NEWS%20of%20Latvia/Baltic%20News%20Service,
Baltic states: Baltic News Service
[18] //solar.rtd.utk.edu/friends/news/rferl/master.html, OMRI Daily Digest
[19] //smile.srce.hr/11/eng/po-temama/news, Balkan: Several news sources from
the Balkan
[20] //www. nando.net/epage/htdocs/links/newspapers.htmL, On Line Newspapers
[21] //www.viabalt.ee/, The Baltics Online News
37