Knowledge extraction from kinetic series micro-array data using a knowledge discover process

Material Information

Knowledge extraction from kinetic series micro-array data using a knowledge discover process
Quayum, Nayeem Md
Publication Date:
Physical Description:
x, 51 leaves : ; 28 cm


Subjects / Keywords:
Gene expression ( lcsh )
Data mining ( lcsh )
Biometry -- Data processing ( lcsh )
Biometry -- Data processing ( fast )
Data mining ( fast )
Gene expression ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 50-51).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Nayeem Md. Quayum.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
71814862 ( OCLC )
LD1193.E52 2006m Q82 ( lcc )

Full Text
Knowledge Extraction from Kinetic Series Micro-array Data
Using a Knowledge Discovery Process
Nayeem Md. Quayum
B.S, St. Cloud State University, 2003
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science

This thesis for the Master of Science
degree by
Nayeem Md. Quayum
has been approved

Quayum, MD. Nayeem (M.S., Computer Science)
Knowledge extraction from Kinetic Series Micro-array Data Using a Knowledge
Discovery Process
Thesis directed by by Professor Krzysztof Cios and Jan Jensen
Advent of micro-array gene expression technology enabled measurement of
thousands of genes expression levels, which lead to a qualitative change of
understanding regulatory processes occurring at the cellular level. Novel tools are
required for comprehensive comparisons and biological knowledge extraction of
these massive amounts of data, which typically have many attributes (genes) but few
samples. This makes analysis of such data difficult and error prone. For this reason it
is important to follow good practices of data mining, like a Knowledge Discovery
Process (KDP), to extract potentially new and useful knowledge. In this thesis we use
the six-step KDP model on micro-array data. For modeling we use an unsupervised
Self-Organizing Map as an effective clustering algorithm that allows for visualizing
high-dimensional complex data in 2D space. Using the KDP model we have extracted
crucial biological knowledge from the time series micro-array data. Namely, the
information we found will help biologists to find genes with similar functionality,
genes sharing similar expression profile in different tissues, and describing the
developmental process of a tissue. Most importantly we have identified progenitor-
cell enriched genes in all three tissues studied.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.:

I owe much gratitude to Dr. Krzysztof Cios and Dr. Jen Janson who have provided
me with encouragement, opportunities, and challenges for the last two years. Because
of their guidance I am now looking forward to a future I would never have envisioned
when I started this work. I would also like to acknowledge researchers effort who
posted the micro-array data sets at the NCBI website. In particular, how-ever, I
acknowledge Juhl Kirstine and Dr. John Hutton at the Barbara Davis Center for
Juvenile Diabetes at the UCDHSC who allowed to use their pancreas data set. I
would also like to specially thanks Alex Kutchma of Jensen Lab at BDC for his
extensive help and advise for the whole thesis work. He has played the most
important domain expert role after Dr. Janson and with out Alexs help in biological
context this thesis work would have been much more difficult.

1. Introduction.,
2. The problem domain. ..........
3. Understanding the data.......
4. Preparation of the Data......
4.1 Background/signal adjustment.
4.2 Normalization................
4.3 Summarization...............
4.4 Deriving absolute Analysis call.
4.5 Preprocessing...:......................................
5 Data Mining........________________________________
6 Evaluation of the Discovered Knowledge.,...................
6.1 Testis developmental time course in embryo (GSE1358).........
6.2 Kinetic Heart Series of Both Ventricles (GDS 627)..........
6.3 Kinetic Heart Series of Atrial Chamber (GDS 627)...........
6.4 Pancreatic Developmental Time course.......................

6.5 Spermatogenesis development time course (GSE926)...........................35
6.6 Ovary developmental time course (GSE1359)..................................38
7 Use of the discovered knowledge............................................41
7.1 Finding all the gene expression profile....................................41
7.2 Finding genes with similar functionality...................................44
7.3 Find genes in different tissues with same profile................... .45
7.4 Describe the developmental process of a tissue._______________________ 47
8 Conclusion................................ ____...____________________________49

3.1 Affymetrix GeneChip structure..................................................6
4.1 Proposed transformation of the data............................................9
4.2 Grid and associated Background/ noise calculation process.....................10
4.3 , MAS 5.0 normalization procedure............................................ 12
5.1 Representation of probe-sets as points in multidimensional space..............22
5.2 SOM NN topology........................................................ 23
5.3 Gene expression of selected probes from cluster number 1 of heart data set..24
5.4(A) Possible 9 cluster profiles in data containing three day points..............26
5.4(B) Three groups of these nine clusters..........................................26
6.1 30 Clusters associated with GSE 1358 (MGU74 A chip) data set..................31
6.2 9 Clusters associated with GSE 1358 data set..................................32
6.3 25 Clusters associated with heart Both Ventricle data set..............33
6.4 25 Clusters associated with heart atrial chamber data set.....................34
6.5 25 Clusters associated with the Pancreas data set...........35
6.6 Similarities among 20 clusters associated with Spermatogenesis data set.. ...36
6.7 Average expression levels of 9 clusters
associated with Spermatogenesis data set................................37
6.8 Similarities among 30 clusters associated with ovary data set.................38
6.9 Average expression levels of 9 clusters associated with ovary data set..39

Presence of Secondary Transition in the developmental process of Pancreas.43
7.2 Different expression profile in GSE 1358 Data set......................44
7.3 Gene list containing transcription factor..............................47

4.1 Probe-sets passed the filtering step in different data sets.........19
5.1 Quantized error associated with different number of clusters
in different dataset.................................................27
' r '!
'V/ '1 kV
; -! . j :;;r */.
"V;.. . ' : .r;.

Recent studies have shown that the DNA micro-array data can measure gene
expression comprehensively in a biological system. Using this technology researchers
produce genome-wide expression data containing immense amounts of biological
information waiting to be discovered and utilized. Kinetic series micro-array
experiment is set in such a way that gene expression of a tissue is measured
simultaneously over different developmental time points. The data provides
information for inferring transcriptional regulations among various genes, and the
finding of such regulation is one of the major tasks in data analysis.
The Knowledge Discovery Process (KDP) consists of several steps and provides
a procedural approach for extracting useful, potentially new, knowledge from a large
data set. Simply using various algorithms for data mining is not sufficient for a
successful data-mining project. For that purpose the KDP has been formalized to help
organizations to understand and conduct data mining in an organized way by providing
a road map to be followed while planning and carrying out the project. Among the
several available models being proposed sine 1990s we decided to use the KDP model
reported in (Cios et al., 2000). The model emphasizes iterative and interactive aspects
of the KDP and involves several feedback loops. It provides general description of the
steps and guides the user to avoid common mistakes by deploying some
countermeasures. The first step is the understanding of the problem domain, which

involves working closely with domain experts to define problem and determine the
project goals. The second step is understanding the data, which includes collecting the
micro-array data, understanding its nature and background, and checking for errors,
missing values, etc. The third step is called preparation of the data which concerns
converting the input data into a suitable input format for subsequent data mining step.
This fourth step of data mining involves using appropriate algorithms to derive
knowledge from preprocessed data from the previous step. After new knowledge has
been derived it needs to be validated and checked by a domain expert: this is the
purpose of the evaluation of the discovered knowledge step. The final step of the KDP
model is using the discovered knowledge, where we plan how to use the discovered
and validated knowledge in a practical setting.
In this study we have used the KDP model on micro-array data to discover
potentially new knowledge to help biologists answer specific questions of interest.
For preprocessing of micro array data we used MAS (v 5,0) algorithm (Asymetrix,
2004) and for the data-mining step we used the Bioconductors SOM package (Jun,
2004), which is a modified version of the Kohonens
Self-Organizing Feature Maps
(SOM) neural network (Kohonen, 1997).

2. The problem domain
Gene expression is a temporal process. Under different conditions, different
proteins are synthesized which lead to the expression of different functionality of a
cell or tissue. Even under stable conditions, due to the degradation of proteins, mRNA
is continuously transcribed and new proteins are generated. For any developmental
process, such as the one occurring during embryogenesis, cell differentiation occurs
progressively as tissues mature. Therefore, to determine the complete set of genes that
are expressed under these conditions, and to determine possible interactions between
these genes allowed in the temporal space, we need to evaluate a time course set of
expression experiments. This will allow to determine not only the stable state
following a new condition, but may also hint a pathway and networks that were
activated to arrive at this new state.
Vertebrate regulative development coordinates progenitor cell maintenance by
terminal differentiation using specific gene networks. Progenitor cells exist in fetal or
adult tissues and are partially specialized; later, by dividing and differentiating, they
turn into different parts and become more specialized. It is known that during
development most organs form by an initial expansion of non-specified progenitor
cells, only later acquiring their terminal fates. Thus, it is expected that within
temporal kinetic series of organ development, genes expressed at increased levels in
the early window, should represent progenitor-expressed genes. If such genes are

shared between individual tissues, those genes might be expected to characterize
common progenitor characteristics in-between tissues.
It has been observed in different genomes (i.e. in budding yeast and human)
that genes with similar functionality tend to cluster together in the expression space
(Eisen et al., 1998). Therefore, the identification of gene expression clusters can help
towards a characterization of the genome functionality. Various clustering methods
produce enormous amounts of information about similarities of cell state and
coordination of gene regulation, which are ideal for grouping genes having similar
transcriptional profiles (Townsend et al., 2002). The main idea is to identify subsets
of the genes and samples, so that when one of these is used to cluster the other, stable
and significant partitions emerge.

3. Understanding the Data
DNA micro-array experiments are usually classified based on a particular
tissue or cell type. In time series expression experiments, a temporal process (usually
the developmental process of a tissue) is measured that exhibits a strong
autocorrelation between successive time points. The time series micro array
experiments have provided unprecedented quantities of genome-wide data oh gene-
expression patterns and several techniques have been developed and used to answer
questions related to various biological contexts.
The time series data consists of numerical values representing the expression
level of genes at different intervals of developmental time. We considered only this
kind of micro-array data for this thesis, because it provides a possible means for
identification of different regulation relationships among genes and developmental
patterns, shared between tissues. The micro-array platforms considered for the thesis
are the MOE430, and MGU74 (divided into three chips A, B and C) Gene Chip data
produced by Affymetrix. Both the platforms claim to cover fair amount of (over 90%
coverage) mouse genome.

1 2 3 4 5 6 7 8 910
PM W , s
MMl ra! za&i k >
probe set
pro be par
Figure 3.1: Asymetrix Gene Chip structure.
The Affymetrix Gene Chip contains thousands of probes (or cells) and each of
them contains 25 bp long oligonucleotides. Each probe pair has associated PM
(perfect match) probe and MM (mis-match) probe, and together they constitute a
probe pair. The PM probe represents the exact sequence of a sub-section of mRNA of
interest, and the MM probe is created by changing the middle (13th) base of the PM
with the intention of measuring non-specific binding. Typically by combining 11-20
probe pairs, a probe set is created and it represents a specific mRNA molecule o of
interest (related to a gene) (Irizarry et al., 2003).
We have used the following data sets:
Testis developmental time course in embryo (GSE1358)
Time course of gene expression in BL/6-129 embryonic testis from the time of
the indifferent gonad (11.5 days post coitum) to birth (18.5 days post coitum). The
day points are El l .5, E12.5, E14.5, E16.5 and E 18.5 (Small et al., 2005). For each
day point the dataset has two replicates.

Ovary developmental time course in embryo (GSE1359)
Time course of gene expression in BL/6-129 embryonic ovary from the time
of the indifferent gonad (11.5 days post coitum) to birth (18.5 days post coitum). The
day points are El 1.5, E12.5, E14.5, E16.5 and El 8.5 (Small et al., 2005). For each
day point the dataset has two replicates.
Spermatogenesis and testis development time course (GSE926)
Spermatogenesis time course generated from BL/6-129 testis collected from
birth through adulthood (Shima et al., 2004). The day points are DO, D3, D6, D8,
DIO, D14, D18, D20, D30, D35, and D56.
Kinetic Heart Series of Both Ventricles and Atrial Chamber (GDS627)
Analysis of strain C57BL/6 embryo heart development from E (embryonic
day) 10.5 to E18.5, which encompass stages of separation of atrial and ventricular
chambers as well as cardiac outflow tract (GEO, 2004). Atrial and ventricular
chambers analyzed separately. The day points are E10.5, El 1.5, E12.5, E13.5, E14.5,
E16. ' ;
Pancreatic Developmental Time course
The dataset contains expression levels of pancreas from El 2.5, El 3.5, E14.5,
El 5 5, E16.5, E17.5 and El 8.5.

4. Preparation of the Data
In this step we focus on transforming raw data collected from the Gene Chip
experiment (from the .CEL files) to a more manageable format so that they can be
used as input for data mining algorithm to be used in the next step. There are two
points we address in this step:
1. Remove non-biological effects from the data
2. Normalize all data sets so that all these different experiments are
directly comparable to each other.
The .CEL file from Gene Chip experiments provides the x and y coordinates
of the spot, mean and standard deviation values of the intensity, and the number of
pixels associated with the spot. The intensity values are not suitable for further
analysis therefore they need to be quantized to depict the abundance level or the
expression level of transcripts. Therefore, we would like to normalize all the data and
come up with one quantized expression value that will indicate the measure of the
abundance of transcripts. We also derived an absolute analysis call to indicate
whether the transcript was present (labeled P), absent (labeled A) or marginal (labeled
M). The advantage of having this information is that it allows us to filter, or interpret
genes in an easier way without knowing their expression values (Affymetrix, 2004).
This absolute analysis call helps exclude some of the probe sets in the phase of data

y-raoniinale Mean STDV No
ofsrs* cf ieteir.ty of'.tttenaty ofPis
Enpresson Value A talas call
Figure 4.1: Proposed transformation of the data. From CEL file format to die input format of
DM algorithm.
The procedure of data transformation (Normalization) is broken into three
1. Background/ signal adjustment.
2. Normalization.
3. Summarization.
There are many algorithms available for accomplishing these transformations.
Choosing the most appropriate one for analyzing micro-array data is a challenging
task. Even though it is clear that use of different methods will lead to generation of
substantially different results, there is no one best method. A recent study showed
that adjusting the perfect match probe .signal with an estimate of non-specific signal
(method used ill MAS 5.0) produces good results (Choe et al., 2005) for Affymetrix
Gene Chip. The Affymetrix micro array machine we used had a built-in MAS 5.0
algorithm, which we used for normalization. Bellow each of the above mentioned sub
procedures are described in details.

4.1 Background/signal adjustment
The background correction process concerns correcting probe intensity on an
array using information only from that array. It is very important that arrays being
compared should ideally have comparable background values. Noise is the measure
of the pixel-to-pixel variation of probes sets on a Gene Chip array. Two main factors
that introduce noise are:
Electrical noise of the scanner
Sample quality
As every scanner has a unique inherent electrical noise associated with its
operational procedure, it is natural that values among scanners will vary. Therefore, a
significant portion of noise is electrical noise. Array data (especially those of
replicates) acquired from the same scanner should ideally have comparable Noise
values. Considering that different scanners have read different array sets before
further processing the micro-array data, the background noise need to be corrected.
Figure 4.2: Grid and associated Background/ noise calculation process. (The red
arrow shows distances between individual probeset and different grids).

The MAS 5.0 algorithm, by Asymetrix, accomplishes this task in the following steps:
1. Divide the whole chip in equal spaced zones and call them grids. In the above
figure, indicated by green squares.
2. Consider each grid individually; arrange all the probe-sets of a specific grid in
descending order of expression value. Then take the average of the lowest 2%
probe-sets expression values and set that as the calculate the
background/noise level for the particular grid.
,: 3. Then consider the distances between all the grids centers to a particular
probe-set (shown by the green arrow in the above figure) and do the weighted
background noise adjustment. The weight factor is calculated as the reciprocal
of the sum of a constant and the square of the distances.
4.2 Normalization
Coping with systematic variations between different experimental conditions
that are unrelated to biological factor is a well-known phenomenon in biology. In the
normalization step we attempt to compensate for systemic differences between chips
to see more clearly the biological differences between samples. In most cases of
normalization it is assumed that the overall distribution of mRNA expression level
doesnt change among different samples and different individual genes change very
little across different conditions. This kind of inference is reasonable for most
laboratory micro-array experiments (except in the case when comparison is done
between expression profile of malignant tumors and normal tissues or controls). In

other words, distribution of gene abundances across different samples is similar
(Weinstein, 2004).
Considering the fact that there should be equal weights on mRNA for all the
samples (if the sizes of the mRNA molecules are comparable, the number of RNA
molecules should also be roughly the same in each sample), the simplest approach to
normalizing Affymetrix data is to re-scale each chip in an experiment to equalize the
average (or total) signal intensity across all chips. To obtain a better result the
relationships among replicate chips (chips hybridized to the same sample) are
considered. In MAS 5.0 rather than simple scaling linear regression has been used
Figure 4.3: MAS 5.0 normalization procedure
It is accomplished in the following steps:
1. Each probe-set is considered individually; Consider the expressions of a
particular probe-set in all different replicates and construct a plot.
2. The highest and lowest 1% values are removed.
3. A regression line is then constructed to fit the rest of the middle 98% values.

4. Transform the values, by subtracting the intercept, and divide by the slope, so
that the regression line becomes an identity (y =x) line.
4.3 Summarization
After correcting the background noise, processing effects, reducing unwanted
variation and removing non-biological effects from the data, the next problem is to
calculate a quantized gene expression values. The main concern is how to reduce the
11-20 probe intensities for each probe-set onto a gene expression value to represent
amount of transcription. Summarization is the process of combining the preprocessed
PM probes together to compute an expression measure for each probe set on the
The task is accomplished by combining background-adjusted, PM and MM
values of the probe set. The value is calculated in these steps (Affymetrix, 2004):
1. Probe intensities are adjusted for background/ noise error.
2. An ideal mismatch value (IM) is calculated and that is subtracted from all the
PM intensities.
3. The adjusted PM intensities are then log-transformed to stabilize the variance.
4. Using biweight estimator a robust mean of the resulting values calculated;
5. The antilog of the resulting values is outputted as the signal expression value.
4.4 Deriving Absolute Analysis call
The absolute analysis call detection algorithm uses all probe pairs
intensities of a probe-set to generate a Detection p-value and assign it into one of the

following types: Present (transcript is present), Marginal (transcription is uncertain),
or Absent call (transcript is absent), to depict the level of transcript associated with
the probe-set. Each probe pair of a probe-set have equal contribution in determining
whether the measured transcript represented by the probe-set is detected (Present) or
not detected (Absent).
In the first step of this algorithm a value called Discrimination score
(denoted by R) is calculated using the following formula
The R is calculated for each probe pair and is compared to a user-defined threshold
(denoted as Tau). Probe pairs with R scores higher than Tau is considered having
presence of transcript and probe pairs with scores lower than Tau is considered to
have absence of transcript. The contribution of each probe pair is summarized with
a /7-value. The greater is the number of probe pairs whose discrimination score is
above the Tau for a probe-set, the smaller is the /7-value and more likely the transcript
associated with the probe-set is truly Present in the sample. The p-value associated
with this test reflects the confidence of the detection call. The two step procedure, for
determining this Detection or absolute call for a given probe set is:
1. Calculate the Discrimination score R for each probe pair of a probe-set.
2. Compare the R scores of each probe pair against the Tau and compute a p-
value using One-Sided Wilcoxons Signed Rank Test for probe-set.

Once the p-value is calculated using user modifiable detection p-value cut-offs
Alpha 1 and Alpha2 the boundaries for determining present, marginal or absent call is
set. Any p-value fells below the Alphal is assigned Present call, p-values above
Alphal and bellow alpha2 is assigned Marginal call and p-values above Alpha2 is
assigned absent call.
4.5 Preprocessing
The success of any data-mining project depends on the quality of data
preprocessing, which is also the most time-consuming step of the KDP; it takes
between 40-70% of the entire mining effort (Cios et al., 1998; Cios et al., 2005). The
micro-array data contain large amounts of information but fortunately a lot of it
corresponds to genes that do not exhibit any interesting changes during the
developmental time series. To find interesting genes, it is critical to reduce the size of
the data set by removing genes with expression profiles of no interest. It is well
known that not all genes are related to certain activation or functional process and
thus we expect that their expression levels do not change significantly over the
developmental time course. In this context genes with constant or flat expression
profiles across different day points, or with little variations, are believed to contain no
information of biological significance. Therefore, if included in the data, this might
affect the shape or even the number of clusters. The same applies to the genes that
were not expressed at all in any day point and since they contained no information

they were also excluded: Considering this we used two computational methods to
filter them out. Namely, excluding probe-sets that
1) are not at all expressed in any of the data points, and
2) probe sets that do not exhibit more than two-fold changes.
The first filter exclude probe sets that are labeled as Absent in all the
samples (representing expression level of a gene at different developmental time
points). As it was mentioned before Affymetix uses absolute call to identify mRNA
detection call for genes that are above a predefined threshold. Using that information
the dormant genes from all developmental time point of a particular tissue were
excluded from cluster analysis. The following pseudo code accomplish the tusk
Make a new empty list called post filter
Loop through each probe set P to be filtered
if the probe set is marked present (P) at least in one time point
Add the probe set to the post filter list
End if
v End loop>
The second filter exclude probe sets (and associated genes) that expressed too
little variation across the developmental time points. This variation filter is used to
ensure that probe sets considered for clustering and finding different expression
patterns have a wide enough dynamic range of expression. This concept of dynamism

is a relative measure. In this study probe-sets are removed from further analysis if
they express less than a two- fold differentiation of expression level across all the
temporal developmental points. The pseudo code to accomplish the tusk follows:
Make a new empty list called post filter
Loop through each probe set P to be filtered
Find the minimum expression level for P
Find the maximum expression level for P
ff maximum / minimum is over 2
Add this probe set to post filter list
End if
End loop
Some biologists may disagree with the idea of filtering out any genes, or
probe sets, and might suggest clustering of the entire dataset since there may be
biologically important changes in expression that are very small. However, it is a
well-known phenomenon that if functional genomics experiments are considered as a
funnel to extract valid gene hits for hypothesis building, then a reduction of the false-
positive rate is a requirement. This is an issue that should eventually be decided based
on a confidence level of the domain expert. We need to keep in mind that
modifications of the filtering threshold levels may produce different results, and they
depend on questions we want to answer. After filtering, the number of the probe-sets
get selected for further clustering analysis, for different tissues, are listed in Table 1.

The percentage of probe-sets selected by the filtering for a specific tissue
reflects biological significance associated with the developmental process of
respective tissues. The larger the number of the probe-sets selected by the filtering
process for a tissue, the larger is the biological complexity associated with the
developmental process of the particular tissue. In our case, the biological
complexity is a reflection of either a more complex set of kinetic changes over the
time periods analyzed, or a reflection of the fact that certain tissues may simply
contain more cell types of different phenotypes. As the cellular phenotype is dictated
by the gene expression pattern of an individual cell, more genotypes present naturally
equate to a larger percentage of the genome being used, and will thus result in a larger
fraction of the probe sets making it through the filtering process. In case of our
datasets, we note that pancreas (43.84% probe-sets were selected by the filtering
process) and spermatogenesis and testis tissue of mouse (51.45 % in MGU74 A chip,
50.02% in MGU74B chip and 34.52% in MGU74C chip) contain the largest fraction
of genes displaying significant changes in the time series analyzed. This is to be
expected given that:
the pancreas contains multiple different cell types, each present in significant
ratios throughout the development.
the pancreas is known to develop through a transition mechanism.

in the spermatogenesis tissue of post natal (birth) mouse transition mechanism is
also present.
On the other hand, in tissues like heart of embryonic mouse (in both the ventricles
and the atrial chamber) there is mostly one kind of a cell type (cardiac cells) and that
may explain why only around 12% of the genes were left after filtering.
Different percentages of the probe sets remaining after filtering, for different
tissues, may imply co-regulation of genes. Another possible explanation is that there
is more homogeneity in the heart tissues (in atrial chamber), testis tissues at the
embryonic stage and ovary of embryonic mouse, as compared to pancreas tissue in
embryonic stage and spermatogenesis tissues of adult mouse.
Tissue Chip Platform Total Probe Selected Probe %
set No. set No.
Heart Both Ventricles MOE 430 45101 5630 12.48
Heart atrial chamber MOE 430 45101 5076 11.25
Pancreas MOE 430 45101 19774 43.84
Testis in embryo (GSE1358) MGU74A 12488 1803 14.43
Testis in embryo (GSE1358) \ MGU74B 12477 1778 14.25
Testis in embryo (GSE1358) MGU74C 11934 1177 9.86
Ovary in embryo (GSE1359) MGU74A 12488 ' .1976 - 15.82
Ovary in embryo (GSE1359) . MGU74B 12477 2145. 17.19
Ovary in embryo (GSE1359) MGU74C 11934 1567 13.13
Spermatogenesis and testis (GSE926) MGU74 A 12488 6426 51.45
Spermatogenesis and testis (GSE926) MGU74B 12477 6241 50.02
Spermatogenesis and testis (GSE926) MGU74 C 11934 4120 34.52
Table 4.1: Probe-sets passed the filtering step in different data sets.

Data Mining
This step of the KDP involves using data mining methods to build models to
possibly reveal new knowledge. It is in this step, where by analyzing the
characteristics of the data, we need to decide on the suitable tools for discovering
patterns (knowledge) from the preprocessed data. In genomic or molecular biology
context, data mining is used to determine the relationships of the differentially
expressed genes present in micro-arrays. Since the micro-array data represent
unsupervised data (meaning that we do not know inputs and corresponding outputs)
there are few data mining techniques that can be used for the purpose. We used
clustering with the aim of grouping genes based on different expression profiles in the
developmental time series data, and by doing so identify homogeneous subgroups.
Among the various data analyses successfully conducted on gene expression
micro-array data, gene selection (process of attribute selection which finds genes
most strongly related to a class), classification (predicting the outcome of a disease
based on gene expression patterns) and clustering (process of finding biological
groups in the genome) are the most frequently used. In this study, we conduct both
the first and third type of analysis. In the preprocessing step by gene selection, we
attempt to find invariant or differentially expressed genes in the given temporal
developmental series associated with the terminal differentiation that coordinates the

progenitor cell phenotype. On the other hand, in clustering we attempt to identify
groups of co-expressed genes by recognizing the coherent expression patterns
(Claverie et al., 1999).
There are several types of clustering algorithms but we decided to use the
SOM artificial neural network algorithm (Dvorkin et al., 2005). The neural network
(NN) is an artificial intelligence tool used for pattern recognition and clustering (Pao,
1988). Most often, the SOM network is a 2D square matrix of neurons, and weight
vectors associated with them. Through learning the weight vectors are fitted to a set
of input vectors such as to approximate their density distribution, in an ordered way.
SOM networks are well suited to organize and visualize complex data in a 2D or 3D
space and the same time form, on their own, clusters present in the input data.
If we think about each time point (a day) of temporal developmental series as
defining axis of a multi dimensional space, then we can represent each probe sets
expression as a k-dimensional vector (where k stands for the number of day points) in
this space. In figure 6 we describe 3-dimensional hyper-cube space constructed from
temporal kinetic series containing three day-points. We can imagine in x axis we have
the expression level associated with day 1, in y axis expression level associated with |r
day2 and in z axis expression level associated with day 3. The probe-sets then can be
represented in this hyper cube by three- dimensional vectors.

Figure 5.1: Representation of probe-sets as points in multidimensional space.
By using the SOM network we can then cluster the probe sets, after they are
projected into lower dimensional space, most often in a 2D space. The basic concept
of finding clusters with SOM NN is that each neuron of the single, usually 2D, hidden
layer represents a single clusters center point (figure 6). Each data point is submitted
into the network, and then the distance between the data point and all the neurons
weights that represent the center of associated clusters are calculated. The neuron
closest to the data point is declared as winner and it also reveals which cluster the
point belongs to. After the winning neuron is chosen (as the right cluster for the data
point), the neurons weight is modified so the associated cluster center becomes
; closer to the particular point, Ini addition to modifying the winning neuron the
surrounding neurons weights are also modified but at a different rate. Weights
associated with far away neurons do not change much, on the other hand, the nearby
neurons receive similar updates to the winning neuron. The area of surrounding
neurons to be modified, and the rate of their modification is defined by the
neighborhood kernel function. When all the data points are passed through the
network several times, the SOM finds clusters present in data. The original concept,

which concentrates on preserving datas topological relationships, was proposed in
(Kohonen, 1997). The philosophy behind applying tins technique is that genes with
similar expression patterns are functionally similar. Therefore, if we can find groups
of genes that are closer to each other (their expression level in different temporal
points) they will constitute groups with related functionality (Kohane et al., 2000).
Figure 5.2: SOM NN topology
Once the SOM network clusters the data, we take each node of the, say, 3x3
network, and, if it contains t probe sets, we then generate a 2D graph in the following
way. The X-axis representing all the day points present in the given temporal dataset
(e.g. for the pancreas 7 day points), and Y-axis representing their corresponding
expression levels. Figure 1 shows 29 randomly selected probe-sets data from cluster 1
of the heart data.

1200 -
e 1000
| BOO-
£ 600
8 400
200 -
E10.5 El 1.5 E12.5 E13.5 E14.5 E16.5 E16.5
Developmental times
Figure S3: Gene expression of selected probes from cluster number 1 of heart data set. As the
lowest number of probe-sets contained by a single cluster is too much to fit in one single graph we
have randomly selected 29 probe-sets to display.
Other algorithms have been also used to analyze gene expression data. One is
Bayesian clustering but the technique heavily depends on the a priori knowledge of
data distribution, which in most cases, including our data, is not known. The K-means
clustering (where K is the number of clusters to be guessed by the user) can also be
used but produces clusters that are not easy to interpret, which is an important factor
for our problem. The benefit of the SOM is that we do not impose any rigid structure
on the data (say, by guessing the number of clusters) as it can find on its own the
correct number of clusters in data. SOM also has built-in visualization property and
is scalable to large data sets (Tamayo et al., 1999). Most importantly, however,
almost none of the clustering algorithms, beside SOM, can take into account
relationship among the adjacent points of temporal developmental series. Thus we
have used the SOM algorithm, as implemented at Bioconductor (Jun, 2004). We have
chosen square 2D neighborhood topology for the SOM and the bubble function to

define the neighborhood of the winning neuron. Before using the SOM network the
data was normalized so that each row has 0 mean and variance of 1 (required for
proper learning). We also kept decreasing schedule for the learning rule (winner takes
all) and the neighborhood kernel; both are essential for finding correct clustering
Determining the correct number of clusters in a data set is the most critical
problem in unsupervised learning. Most often, generating several clusters and then
validating the outcomes to choose grouping of the data points that corresponds to the
optimal number of clusters address the problem. We need to keep in mind,
however, that there may be no single correct answer for this kind of exploratory
analysis. To guide the process of choosing the correct number of clusters we shall
use biological knowledge. Namely, there are three possible transitions of gene
expression between two different time points: a) up regulated expression, b) down
regulated expression, or c) flat regulation. Therefore, theoretically, for a time series
containing N time points there are N-l changes and thus possible clysters.
Fortunately, to cover for all possible transitions, it is not necessary to explore that
many clusters (number of clusters to be explored).
If a time series data contains 3 -day points then all of the 9 possible clusters are
shown in Figure 2A. However, these 9 patterns could be combined and can be
represented by only 3 patterns, shown in Figure 2B, and we can say that there are
predominantly only 3 clusters present in our data set, namely, up-regulated pattern,

down- regulated pattern, and the flat or wobble pattern. In reality, when data set
contains gene expression across more developmental time points of an organ three
patterns may not be enough to account for all possible expression profiles. Therefore
a cluster number between the minimum of 3 and the theoretically possible maximum
should be chosen to cover all possible gene expression profiles present in the data.
Figure 5.4: (A) Nine possible cluster profiles in data containing three-day points. (B) Three
groups of these nine clusters.
To evaluate the cluster results we have considered two points, the average
quantized error and the opinion of the domain expert. The average quantized error
gives an average distortion measure of the probe sets when they are clustered. Asking
for different cluster numbers yields different results and the one with the smallest
average quantization error should be chosen (Kohonen, 1997). The average
quantization error is calculated as ||x-mc|| where x(t) is the training input vector and
n^ is the weight associated with a node. When the SOM map is large we calculate it
Z h Ijx-rricll2

where h is the neighborhood function (it is then known as average distortion
measure).Using the above describe formula we have generated different
clusterings/SOM maps and calculated corresponding quantization error values, as
shown in Table 5.1; for both ventricles of heart tissue in Table A, the atrial heart
tissue in Table B, for the pancreas tissue in Table C, for testis development in embryo
in Table D, for the spermatogenesis tissue in Table E and for ovary development in
embryo in Table F.
Network T opology Error
3X5 7.98
4X5 7.34
5X5 7.26
5X6 6JS
Network Topology Error
3X5 11.36
4X5 10.95
5X5 10.87
5X6 9.83
NetworkTopology Error
3X5 7.86
4X5 7.75
5X5 7.79
5X6 7
Cluster Topology Eiror in MGU74A chip Error in MGU7BAchip Error in MGU7CA chip
3X2 7.71 . - 8.26 9.29 -
3X3 6.8 . 7.35 - 345
3X4 5.8 6.27 8.69
3X5 5.8 6.27 8.69
4X5 6.29 ' 5.33 7.27 :
5X5 5.09 ' - 5.9 "6.58
5X6 436 5.45 ' 637
Table 5.1: Quantized error associated with different number of clusters in different dataset

Cluster Topology Error in MGU74A chip Error in MGU7BA chip Error in MGU7CA chip
3X2 22.006 22.81 23.5
3X3 20.12 20.05 21.13
3X4 17.53 18.9 19.84
3X5 17.95 19.72 20.19
4X5 16.09 17 18.44
' 5X5 16:89 19.94 18.54

: Cluster Topology Error in MGU74A chip ~ Error in MGU7BA chip : Bror in MGU7CA chip
3X2 8.54 9.23 9.68
3X3 7.9 8.59 8.82
3X4 7.43 7.92 8.09
3X5 7.43 7.92 8.21
4X5 6.07 6.99 7.49
5X5 6 7.32 7.6
5X6 5.8 6.14 6.5
Table 5.1 (Cont.)
The lowest quantized error value corresponding to a given network topology (rows in
bold in each table) suggests that number of clusters equal to 30 (corresponding to 5x6
topology) is the optimum number for pancreas, both ventricles of heart, atrial
chamber of heart, testis and ovary datasets. On the other hand, in spermatogenesis
data the lowest quantized error rate is associated with the 4x5 topology (number of
clusters). Thus, according to the above analysis we would have chosen the suggested
numbers of clusters, or topologies, however, the final selection changed later after our

experts visual analysis (see discussion in section entitled Evaluation of the
Discovered Knowledge).

Evaluation of the Discovered Knowledge
After performing data-mining and extracting knowledge, the next phase is to
evaluate these results for relevance. After we obtaining data mining results we need to
evaluate their validity. In this step of the KDP results are contextually analyzed for
relevance. The interpretation by domain experts provides best understanding of any
data mining results. This step searches for the optimal parameter values for the
specific clustering algorithm, so that the discovered knowledge best fits the data and
represent the biological knowledge better.
As mentioned before one of the strengths of the six-step KDP model is that it
is highly iterative and interactive. By first calculating the formal quantization error
measure in previous step we came up with possible best numbers of clusters but in
this validation step based on domain experts visual inspection we modify that
Bellow considering each data set separately we report final chosen number of clusters
to represent data set and explain the reasons.
6.1 Testis developmental time course in embryo (GSE1358)
As mentioned before upon considering the quantized error rates of different
number of clusters the cluster number of 30 appears to be the optimal result for
GSE1358 data set. When the profiles of thirty clusters are inspected visually, it is
noted that the middle clusters (represented by shaded regions in figure 6.1) of the

map represents relatively flat expression of genes. These clusters contain little
biologically significant information and they can combine together to form a single
cluster. If the numbers of clusters are decreased the clusters with similar profile tend
to collapse together and if the cluster numbers are increased then the gene list
associated with each of the different clusters get shorter but no significantly new
profiles are detected. It appears that only with 9 clusters all the possible gene
expression profiles present in testis development are well represented. Therefore the
domain expert finally chose 9 clusters to represent the data. The 9 clusters finally
chosen for this data set and their mean expression levels are shown in figure 6.2.
Figure 6.1: 30 Clusters associated with GSE 1358 (MGU74 A chip) data set

S-f* ^V.
A A" \ k
H -Tsrr. T
>S|.-H 1- I"

Figure 6.2: 9 Clusters associated with GSE 1358 data set
6.2 Kinetic Heart Series of Both Ventricles
According to the calculated quantized error rate of different number of
clusters associated with both ventricles of heart tissue cluster number of 30 seems to
be the optimal result. After visual inspection of the 30 cluster profiles, it is discovered
that with the incrimination of the number of clusters the gene lists associated with a
particular profile get shorten. The domain expert found using cluster number of 25
represent the dataset well enough and cover all possible different gene expression
patterns. The final 25 clusters for this data set and their mean expression levels are
shown in figure 6.3.

:: p-treagtj. KrVV IBM.".'

Figure 6.3: 25 Clusters associated with heart Both Ventricle data set
6.3 Kinetic Heart Series of Atrial Chamber
The calculated quantized error rates of different number of clusters associated
with both ventricles of heart tissue suggest cluster number of 30 is the optimal result
(from Table 5.1 B). When the results are visually inspected, it is found that if the
cluster number is increased from 25 to 30 for the atrial chamber of heart tissue the
number of probe- sets associated with different clusters get shorter but no
significantly new profiles were detected. Therefore the domain expert has decided
that 25 cluster profiles are enough to cover all possible regulations in the atrial

chamber of heart tissue. The final 25 clusters and their mean expression levels for this
data set are shown in figure 6.4.
atg;i7=Haia: pjmass,; r-a&?2S-"i r*
vsh , i+H+p kVh
i. V. *rTO HVW rr
Figure 6.4: 25 Clusters associated with heart atrial chamber data set
6.4 Pancreatic Developmental Time course
The calculated quantized error rates of different number of clusters associated
with pancreas tissue suggest cluster number of 30 is the optimal result (from Table
5.1 B). When the results are visually inspected, it is found that if the cluster number is
decreased from 30 to 25 for the pancreas tissue the number of probe- sets associated
with different clusters get larger but no significantly new profiles were lost. Therefore
the domain expert has decided that 25 cluster profiles are enough to cover all possible
regulations in pancreas tissue. In figure 6.5 only the mean values of expression levels
of each cluster associated with pancreas tissue are shown.

~VY/;.-As i 1


Figure 6.S: 25 Clusters associated with the Pancreas data set
6.5 Spermatogenesis and testis development time course (GSE926)
Upon considering the quantized error rates of different number of clusters, the
cluster number of 20 appeared to be the optimal result for this data set (from the
Table 5.1 E). When different expression profiles of genes with 20 clusters are visually
inspected, it is found that several clusters have similar profiles and they can be
collapsed together to come up with fewer numbers of clusters. In Figure 6.6 clusters
with similar profiles found in Spermatogenesis data set are shown with the same color
label. The domain expert therefore, has chosen 9 clusters and decided enough to
depict all-important expressions present in the dataset. Figure 6.7 displays only the

mean values of expression levels of 9 clusters associated with Spermatogenesis data
/ *-1 i ' / t rhL/ /rfy / f. f
Ml i 'M/j '/7j
-f+ /// k'Vv****t+i

MGU74 A chip
X' V

J* 1 Yi
A,. T^f s i

's* / //Vi /// / tl t*t+
Y\1 / /. < m,i% 'hh i
Yv / ' / i#% UHl-S*
MGU74 B chip
MGU74 C chip
Figure 6.6: Similarities among 20 clusters associated with Spermatogenesis data set.

: 1 L-ifr ! ,3557 H

j H*4 v.

\ Wl 1
MGU74 A chip
| 4mm
y>\ Y* Ifl '' v.^v.

MGU74 B chip
[! ; WFST-
tr~.xs MMBl
>*< WW%: H+^4
FTS ifcn i
\ 1 -. ir**v Hi
MGU74 C chip
Figure 6.7: Average expression levels of 9 clusters associated with Spermatogenesis data set.

6.6 Ovary developmental time course in embryo (GSE1359)
The quantized errors found for different number of clusters for this data set
suggests, the cluster number of 30 appears to be the optimal result (from Table 5.IF).
When different expression profiles of 20 clusters are visually inspected, it is found
that several clusters has similar profile and they can be collapsed together to come up
with fewer numbers of clusters. In Figure 6.8, similar clusters are shown with the
same color label. The domain expert therefore, has chosen 9 clusters to depict all-
important expressions present in the dataset.
MGU74 A chip
Figure 6.8: Similarities among 30 clusters associated with ovary data set





: r-t
TV4. I
i < n

MGU74 B chip
r-r^< tf l-i* r*1
l-T^ It' -fj^ s kS
h y : J *4F|
* y\\
A ftr
t N 4 N > r^v
MGU74 B chip
Figure 6.8 Continue.
Figure 6.9 displays only the mean values of expression levels of chosen 9
clusters associated with ovary data set.
, v mr t .- - K'
H4-H 44
* >-- j
MGU74 A chip
Figure 6.9: Average expression levels of 9 clusters associated with ovary data set.

1 BIT- BJ5
l -'V -
H-j-H -"S
A' tf A*
MGU74 B chip
rr-rr: '.tt *v K'+~Kv-i
t -'S
i__ '? _ '-la 1
:+K-: T T i T
v -:r -VM
MGU74 C chip
Figure 6.9: Continue.

7. Using the discovered knowledge
After deriving and validating the knowledge the last step of KDP concerns
how to use them. In this step, it is basically explored that how the discovered
knowledge could help making relevant observation in the domain field. It
incorporates the knowledge (validated and selected in the previous step) into the
application domain. The main tasks are presenting the new knowledge in a
convincing, application domain-oriented way, and formulating them in a way so that
the knowledge can be exploited.
After going through the previous five steps, the patterns or knowledge discovered
can help biologists answering the following questions,
o Find all the profiles in the tissue,
o Find genes with similar functionality,
o Find genes in different tissues with the same profile,
o Describe the developmental process of a tissue.
These points are described below in detail.
7.1 Finding all the gene expression profile
The visualization aspect of the SOM clustering depicts! all possible gene
expression profiles observed in the developmental time series covered by the data.
Simply by looking at these profiles the biologists can easily infer the developmental
process of the tissue. The profiles also help to identify possible presence of the

transcriptional process in the developmental process of the tissue. Examining the gene
lists associated with these different profiles, biologists can identify genes associated
with different biological hypothesis. For example, in the discovered gene expression
profiles of the developmental series of pancreas tissue, profiles with spikes in the
middle of developmental series represents the occurrence of a secondary transition
of the development process of the tissue (marked by the boxes in figure 7.1), which
has been described by others (Jensen, 2004). The mechanism of pancreatic
cytodifferentiation involves selective regulation of the synthesis of cell-specific
proteins or genes. The developing embryonic pancreas goes through terminal
differentiation in two stages. In the first transition the embryonic endoderm gives rise
to the pancreatic bud. After a further three-day period of pancreatic growth and
morphogenesis process (all changes that involve growth and maturation are
collectively known as morphogenesis) the mature pancreas apparently develops
through a single secondary transition, involving increases, of several orders of
magnitude in the levels of exocrine proteins or associated genes expression level.
This cell differentiation process exhibits sudden change of expression level of large
number of genes that get identified by the appearance of the spikes in the pancreatic
data. '

Figure 7.1: Presence of secondary transition in the developmental process of Pancreas.
On the other hand, such profiles with spike in the middle of developmental
periods are absent in GSE 926 and GSE 1358 data sets. All the profiles are relatively
progressive instead and they represent the lack of transcription factor in the
developmental process of testis tissue and ovary tissue in embryo.

13': I lS*NvvS' * H^r-(
i i I Hfl
A, !

HrH "M -
Figure 7.2: Different expression profile in GSE 1358 Data set
7.2 Finding genes with similar functionality
It has been shown in the literature that genes with similar functionality in
yeast tend to cluster together (Eisen et al., 1998). It is likely that genes within a
cluster even for a higher species, such as mouse may share biological functions. Yet,
given the multicellular organization of a mouse into highly specific tissues cell
structure and functionality is far more complex than the yeast. Therefore we expect to
identify simpler gene functional complexity in yeast genome and it is relatively easier
to label common functionality to a specific cluster of genes. On the other hand
associating a similar genome functionality to a specific cluster found in mouse
genome is not that straight forward. It is equally, if not more, likely, that the
groupings of a set of genes in a cluster is resulting from a process of cell
differentiation, rather than being associated with similar functionality. In the present

analyses, a common issue is that of developmental time. All clusters are defined
based on temporal changes, as are observed in all tissues analyzed in this work.
According to the above considerations, equating cluster adherence to functional
similarity should not uncritically be adopted. Nonetheless, when spikes, or sudden
abrupt changes in a developmental series, as that observed in the pancreatic set, is
detected, this may help identify genes that together acts to execute the transition.
Such genes are of natural interest to the biologist, as understanding the nature of
biological transitions remains an important area of study. It is known that e.g. the
gene called Ngn3 (Neurogenin-3) is important to execute the endocrine differentiation
program at the transition period of the pancreas. However, it is not known with which
other genes, Ngn3 operates. Yet, it would be expected that such genes would share
the temporal profile of Ngn3, and therefore, cluster with it around the secondary
transition of the pancreas.
73 Find genes in different tissues with same profile
The numbers of genes in the overall sets of e.g. down/down, or up/up profiles
when comparing dissimilar tissues are much too large to denote a simple functional
similarity. However, the presence of common genes in such cluster comparisons may
provide insight into genes shared at e.g. early versus late developmental stages. It is
expected that during the earlier developmental stages of the tissues there are larger
number of progenitor cell present and as the tissue mature further down the
developmental time axis the number of progenitor cell will be reduced. Therefore,

genes associated with the progenitor cell expected to show downward profile along
the developmental time series data. For example the 01ig2 gene, which has a key
function to maintain the progenitor cell state by repressing the expression of other
genes (e.g. Hb9), exhibits high expression in earlier stage and exhibits lower
expression levels along with the development of pMN cells (Lee et al 2005).
Similarly FGF10-signaling that serves to integrate cell growth and terminal
differentiation at the level of Notch activation plays key role in pancreatic
development and exhibits downward expression profile (Norgaard et al 2003). For
that reason, extracting and comparing genes between tissues that behave with a
downward trend through the sets may imply a likelihood that the genes are commonly
involved in maintaining a more undifferentiated state of the tissues, and such genes
could well be shared amongst multiple tissues, if not all, in a developing organism.
Thus, they may represent a progenitor cell state, which is currently an important
focus in biology With the identification of genes associated with progenitor cell state
we will be able to further concentrate on their regulation process and hopefully be
able to control the cell transformation process.
: When all the downward cluster profiles are combined in the data set of heart
and pancreas and their intersection is taken, 623 probe sets are found. The associated
genes of these probe sets have been examined and multiple genes belong to the WNT
pathway that is known to be associated progenitor cell function. After extracting the
transcription factors from this particular list, the following gene list (Figure 7.3) has

been compiled. In a similar manner, using their own hypothesis and criteria,
biologists or the investigators can chose gene expression profile of their own, and out
of them compile candidate gene lists, which may help formulate novel, or
strengthen/weaken hypotheses.
m&rn 1 141H/33 *1 32HO Iwi^l gene humnlng 1
2 142//U4 a_*t J4UU transcription factor L2a
.3 1 41 20501 hairy/HMhancai-of'Splil related with VRPW mnlif 1 (H#%1)
4 14J4U2Q 41 JU594 aryl hydrocarbon receptor nucloar tranolocator 2
5 141*564 at 2454 RIKFN cDNA 1190001IO0 gen*
U 14 1 7962 paternally nxproo&od 3
7 1413300 at 23462 early B~cll faster mmhoci^IiuI /iru; finger piolmn
0 1410030 .! 3940/ sal-like 2 (Drosophila)
9 14493SC1 at 46065 odd* skipped related 1 (Drosophila)
HJ 1462/92. at 07466 RIKEN cDMA 2610O26K24 n>n*
1 1 1419216 at 2666 S-xtucylulmu induced gene 1
12 14iei5?_at 1429/ nuclear receptor subfamily 2, group P, mombor 1
13 1426979 a at 36241 RIKEN cDNA 1110033CM31 gene
14 1424145_*1 40036 RIKEN cDNA 4930640<"i07 ami
IS 1447B77 k at 12B6BO DNA methyhraneleraee (cytocune-S) 1
16 1600 mini chromosome maintenance deficient 4 homolog (S
1/ 142EB62 at 4602 mini chromosome maintenance deficient (5 cernviai
ia 1415945 at 5046 mini chromosome maintenance deficient 6 (S ewrevi
13 1433623 at 337BB RIKEN CDNA D93U006D10 gene
20 1423600 1762 SRV-bon containing gene 5
21 14334/1 at 31630 transcription factor 7. T coll npociftc
22 1453125 at 41 702 RIKEN cDNA 6230403H02 aim
23 1432331 a^at 1HU2 paired related homeobox 2
24 14iae03_at 7103 pre B-eell leukemia Iranunriptinn factor 2
2+j 1 46U/2.J at 42242 IfcSLI transcription factor, UM/homeodomatn (iMot
26 1421405 at 09976 /inn finger, imprinted 1
2/ 142J124 x at JZXAj noname
20 1417541 at 57223 lymphoid p*i:ifir .
Figure 7.3: Gene list containing transcription factor domain. Compiled combining
Down profile of both heart and pancreas
7.4 Describe the developmental process of a tissue
Analyzing the total cluster geometry of different tissues allows the biologists
to infer the level of complexity involved with developmental process of the tissue. Of
the other tissues analyzed here, the most complex developmental process is observed
for pancreas. The remaining tissues (heart, testes) develop with a rather simple,
progressive change over time. For instance, testes development can be adequately

described with only three major cluster groups; suggesting much less complexity
during the developmental process of testis.

8. Conclusion
We have shown that mining in numerical gene expression data provides a fast,
easy to perform, widely applicable, and unbiased route towards identification of
biologically relevant hypotheses and observations. Since the introduction of high
throughput gene expression screening into biological research a tremendous quantity
of data has been accumulated. When investigators tiy to identify genes closely related
to their specific research they are often stuck with huge amounts of data. In this study
we used the knowledge discovery process as the screening tool to identify
biologically meaningful results and conditions from the genome-wide expression
profiles of data sets not initially designed for cross-comparisons. The used methods of
data mining have provided us with powerful tools for accessing the information
content and interpretation of the genome, and support the value of re-using existing
genomic data sets through novel meta-analyses inspired by valid biological questions.
Interesting and useful relationships (patterns) can henceforth be discovered from
genoihe-wide transcription expression level data; it is highly unlikely that the same
knowledge would have been obtained through wet laboratory experimentation.

Choe S., Michael B., Alan M. M., Geroge M. C. and Marc S H. (2005). Preferred
analysis methods for Affymetrix Gene Chips revealed by a wholly defined control
dataset. Genome Biology. 6. R16.
(Sios KJ, Pedrycz W, Swiniarski R. (1998). Data Mining Methods for Knowledge
Discovery. New York: Kluwer Academic Publishers.
Cios K.J. and Kurgan L.A. (2005). Trends in Data Mining and Knowledge Discovery.
In: Advanced Techniques in Data Mining and Knowledge Discovery, Pal, N.R., Jain,
L. C. (eds.), Springer, 1-26.
Cios K.J, Teresinska A., Koriieczna S., Potocka J. and Sharma S. (2000). Diagnosing
Myocardial Perfusion SPECT Bulls-eye Maps A Knowledge Discovery Approach.
IEEE Engineering in Medicine and Biology Magazine, 19(4): 17-25.
Dvorkin D., Fadok V. and Cios KJ. (2005). SiMCALl Algorithm for Analysis of
Gene Expression Data Related to the Phosphatidylserine Receptor. Artificial
Intelligence in Medicine, (35):49-60.
Eisen M B., Paul S. T, Patrick B, David B. (1998). Cluster analysis and display of
genome-wide expression patterns. Proceedings of the National Academy of Sciences
. GEO Data Sets. 15 June 04, NCBI. 25 February 2006. gds/ gds_browse.cgi?gds=627.
' :V "> l,,V ' : - 'v-. : v-;
-liv ." ; . ' i. £ir- t "t.;;- y
Irizarry R.A., Bolstad B.M., Collin F., Cope L.M., Hobbs B., Speed T. P., (2003)
t Summaries of Affymetrix Gene Chip probe level data. Nucleic Acids Research,
Vol. 31, No. 4 el 5.
Affymetrix Manual, Expression Analysis Technical, 2004.
Jensen J.(2004). Gene regulatory factors in pancreatic development. Dev Dyn.
229(1): 176-200
Claverie JM, (1999) Computational methods for the identification of differential and
coordinated gene expression, Human Molecular Genetics. 8.1821-1822

Jun Y. (2004). som: Self-Organizing Map. R package version 0.3-
0.4. Sdidactique/Rhelp/library/som/html/OOIndex.html
Kohane I. S., Alvin T. K., Atul J. B.(2000). Microarrav For An Integrative Genomics.
Cambridge: The MIT press.
Kohonen T. Ed. Self-organizing maps. New York Springer-Verlag, 1997.
Lee, Soo-Kyung, Lee, Bora, Ruiz, Esmeralda C., Pfaff, Samuel L. (2005). 01ig2 and
Ngn2 function in opposition to modulate gene expression in motor neuron progenitor
cells. Genes Dev. 2005 9:282-294
Norgaard G. A., Jensen J. N., Jensen J. (2003). FGF10 signaling maintains the
pancreatic progenitor cell state revealihg a novel role of Notch in organ development.
Developmental Biology. 264:2,323-338.
R Development Core Team. R(2005). A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-
900051 -07-0,
Pao Y. 1988. Adaptive pattern recognition and neural networks. New York: Addison
Shima J. E., Derek M. J., John M. R., Michael G. D. (2004).The Murine Testicular
Transcriptome: Characterizing Gene Expression in the Testis During the Progression
of Spermatogenesis. Biology of Reproduction 71. 319-330.
Tamayo P., Slonim D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.
S., and Golub T. R. (1999) Interpreting patterns of gene expression with self-
organizing maps: methods and application to hematopoietic differentiation.
Proceedings of the National Academy of Sciences. 96. 6.2907-2912.
Townsend JP, Hard DL. (2002) Bayesian analysis of gene expression levels:
statistical quantification of relative mRNA level across multiple strains or treatments.
Genome Biol. 3(12).
Weinstein J.N. (2004) Genomics and Bioifonnatics Group. Affymatrix Microarray
Data Analysis.