Citation
Analyzing microarray time series data using a Haar wavelet transformation

Material Information

Title:
Analyzing microarray time series data using a Haar wavelet transformation
Creator:
Braudrick, Sarah D
Place of Publication:
Denver, Colo.
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
xvii, 78 leaves : illustrations ; 28 cm

Thesis/Dissertation Information

Degree:
Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Mathematical and Statistical Sciences, CU Denver
Degree Disciplines:
Mathematical Sciences
Committee Chair:
Billups, Stephen C.
Committee Members:
Bjork, Kathe E.
Sain, Stephan R.

Subjects

Subjects / Keywords:
Haar system (Mathematics) ( lcsh )
Wavelets (Mathematics) ( lcsh )
Transformations (Mathematics) ( lcsh )
Haar system (Mathematics) ( fast )
Transformations (Mathematics) ( fast )
Wavelets (Mathematics) ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 75-78).
General Note:
Department of Mathematical and Statistical Sciences
Statement of Responsibility:
by Sarah D. Braudrick.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
166385433 ( OCLC )
ocn166385433
Classification:
LD1193.L62 2007m B72 ( lcc )

Full Text
ANALYZING MICROARRAY TIME SERIES DATA USING A HAAR
WAVELET TRANSFORMATION
by
Sarah D. Braudrick
B.A. Mathematics, B.A. Sociology, University of Northern Colorado, 1998
A thesis submitted to the
University of Colorado at Denver
and Health Sciences Center
in partial fulfillment
of the requirements for the degree of
Master of Science
Mathematical Sciences
2007


This thesis for the Master of Science
degree by
Sarah D. Braudrick
has been approved
by
Stephan R. Sain


Braudrick, Sarah D. (M.S., Applied Mathematics)
Analyzing Microarray Time Series Data Using a Haar Wavelet Transformation
Thesis directed by Associate Professor and Graduate Program Director Stephen
C. Billups
ABSTRACT
This thesis presents an algorithm for identifying significant temporal pat-
terns in microarray time series. Using a false discovery rate procedure to assess
the magnitude of wavelet coefficients resulting from a Haar wavelet transforma-
tion, the algorithm detects gene expression signals exhibiting time-dependent
features. Numerous simulated time series, each consisting of a.small propor-
tion of temporally dependent functions among random signals, were generated
by which to test the algorithm. The simulation encompassed the following
changeable scenarios: number and distribution of time points (evenly-spaced
or randomly-spaced), noise-to-signal ratio, and level of frequency. Testing the
algorithm using these parameters, we were able to answer questions regarding
the effect of sample quantity, sampling rate, and sample quality (degree of noise
present in the expression signal readings) on the algorithms performance in de-
tecting temporal patterns in a microarray time series. Finally, the algorithm is
applied to a genome-wide microarray time series collected from Rattus norvegi-
cus mammary tissue over the course of the 4-day estrous cycle.


This abstract accurately represents the content of the candidates thesis. I


DEDICATION
This thesis is dedicated to John, for the support and the space and knowing
when to give which, and to my parents, who never ran out of encouragement.


ACKNOWLEDGMENT
Id like to thank Steve Billups for giving me a thesis topic which was as inter-
esting as it was challenging, and for always giving thorough feedback. And also
thanks to Holly Wolf, for her limitless understanding when it came to work and
school conflicts.


CONTENTS
Figures ............................................................ x
Tables............................................................ xvi
Chapter
1. INTRODUCTION .................................................... 1
1.1 Time Series Gene Expression Data Analysis..................... 1
1.2 Related Work.................................................. 1
1.2.1 Early attempts time unordered .............................. 2
1.2.2 Recent attempts time ordered................................ 2
1.2.3 Wavelets in signal processing ................................ 3
1.3 Our Approach.................................................. 3
1.3.1 Motivation ................................................... 4
1.4 Challenges.................................................... 5
1.4.1 Sampling...................................................... 6
1.4.2 Homogeneity................................................... 6
1.4.2.1 Within samples.............................................. 7
1.4.2.2 Between samples............................................. 7
1.4.3 Replicates.................................................... 8
1.5 Conceptual Framework.......................................... 8
1.5.1 A model of the data........................................... 9
1.6 Implementation............................................... 10
vii


2. METHODS
11
2.1 Overview of the Method.......................................... 11
2.2 Normalization...................................................... 12
2.3 Wavelets........................................................... 13
2.3.1 The Haar wavelet................................................... 14
2.3.2 Defining a finite basis using Haar wavelet functions............ 15
2.3.3 Calculating the wavelet coefficients............................... 17
2.3.4 Haar wavelet coefficients.......................................... 18
2.4 Identification of Significant Signals.............................. 18
2.4.1 Bootstrapping procedure............................................ 19
2.4.2 FDR analysis....................................................... 20
2.5 Algorithm Summary and Implementation............................... 21
3. TESTING AND RESULTS .................................................. 25
3.1 Generating Simulation Time Series.................................. 25
3.1.1 Haar wavelet signals............................................... 28
3.1.2 Sinusoidal signals................................................. 29
3.1.3 Regression signals ................................................ 30
3.2 Results: Simulation Testing.......................................... 31
3.2.1 Performance using Haar-based / .................................... 31
3.2.2 Performance using sinusoidal-based /............................... 37
3.2.3 Performance using regression-based /............................... 52
3.3 Results: Microarray Gene Expression Time Series................. 60
4. SUMMARY AND CONCLUSIONS............................................... 69
4.1 Sample Size........................................................ 69
vm


4.2 Variation of Sample Response................................. 70
4.3 Sampling Rate ............................................... 70
4.4 Frequency of the Underlying Temporal Response ............. 71
4.5 Which Wavelet Coefficient Performs Best?..................... 72
5. FUTURE WORK.................................................... 73
References........................................................ 75
IX


LIST OF FIGURES
Figure
2.1 The eighth-order Haar wavelet basis functions on the interval [0,4].
Here, n = 8 because D = 2........................................... 16
3.1 The Haar Wavelet step function, ip.................................. 28
3.2 Sinusoidal functions of various frequencies on the time interval [0,1]. 29
3.3 Regression functions 1 through 4, on the interval [0,1]............. 30
3.4 Surface plots of observed sensitivity rates. (These are 3-dimensional
renderings of Tables 3.2 and 3.3, viewed from their upper left hand
corners towards their lower right.).............................. 32
3.5 Sensitivity rates for 16 > T > 9 with 77 = 0.1...................... 33
3.6 Haar wavelet-generated (normalized) signals for T = 40 where T is
evenly spaced and p = 0.1. This illustrates how, even with a large
number of uniform samples, true signals (black) can quickly become
nearly indistinguishable from pure noise (grey) when 77 increases from
1.5 to 2.0.......................................................... 34
3.7 Surface plots for the observed FDRs from uniform and randomly
spaced T, p = 0.1................................................... 35
3.8 Surface plots of observed sensitivity when the proportion of true
signals (p) is 50% (a and b) and 90% (c and d). No significant
differences were seen between small and large p.................. 36
x


3.9 Mean sensitivity rate across all evenly spaced T (with rj = 0.1) as a
function of frequency rate............................................. 37
3.10 Surface plots of the resulting sensitivity from low frequency (near 0
and 1) sinusoidal-generated simulations................................ 38
3.11 Surface plots of the resulting sensitivity from lower frequency (near
2 to 4) sinusoidal-generated simulations............................ 39
3.12 Surface plots of the resulting sensitivity from higher frequency (greater
than 5) sinusoidal-generated simulations............................... 40
3.13 Sensitivity rates for frq = 0.3, rj = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid
dark line) between 20 and 15 than for randomly spaced T (dashed
light line)............................................................ 42
3.14 Sensitivity rates for frq = 0.3, T = 50, as a function of r} in finer
detail. Sensitivity decreases most when 0.5 < r) < 1.0. Evenly (solid
dark line) and randomly (dashed light line) spaced large T produce
similarly decreasing rates............................................. 43
3.15 Sensitivity rates for frq = 1.3, rj = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid
dark line) between 30 and 25 than for randomly spaced T (dashed
light line)............................................................ 45
3.16 Sensitivity rates for frq = 1.3, T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line)
spaced large T produce similarly decreasing rates............. 45
xi


3.17 Sensitivity rates for frq = 1.8, 77 = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid
dark line) between 14 and 12 (see arrows) than for randomly spaced
T (dashed light line).................................................. 47
3.18 Sensitivity rates for frq = 1.8, T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line)
spaced large T produce similarly decreasing rates and both have a
high tolerance to a large amount of added noise........................ 47
3.19 Sensitivity rates for frq = 2.5, 77 = 0.1, as a function of T in finer
detail Sensitivity for evenly spaced T (solid dark line) is larger when
T = 24 than when T 29, and a large decrease occurs between
T = 24 and 23. For randomly spaced T (dashed light line), sensitivity
is larger when T = 22 than when T = 31................................. 49
3.20 Sensitivity rates for frq = 2.5, T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line)
spaced large T produce similarly decreasing rates except for 0.5 <
77 < 0.7, where randomly spaced T yields smaller sensitivity rates. . 49
3.21 Sensitivity rates for frq = 3.8,77 = 0.1, as a function of T in finer de-
tail. Sensitivity drops significantly, but uniformly, for evenly spaced
T (solid dark line) between 24 and 20 (see arrows). Randomly spaced
T (dashed light line) yields slightly higher rates for smaller values of
T...................................................................... 51
xn


3.22 Sensitivity rates for frq = 3.8, T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line)
spaced large T produce similarly decreasing rates.................... 51
3.23 Surface plots (b and c) of observed sensitivity rates from simulations
generated from regression function (a) fregri(t) = 1 481 + 218t2
315f3 + 145t4........................................................... 53
3.24 Sensitivity rates for fregri(t), V = 0.1, as a function of T in finer
detail. Evenly spaced T (solid dark line) differs most from randomly
spaced T (dashed light line) near T = 30. Rates decrease signifi-
cantly for both once T reaches the mid-twenties................ 54
3.25 Sensitivity rates for fregri(t), T = 50, as a function of 77 in finer detail.
Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates except for 0.4 < 77 < 0.6, where
randomly spaced T yields smaller sensitivity rates............. 54
3.26 Surface plots (b and c) of observed sensitivity rates from simulations
generated from regression function (a) fregr2{t) = 0.3e-64^-0'25^2 +
0.7e-256(t0.75)2....................................................... 55
3.27 Surface plots (b and c) of observed sensitivity rates from simulations
generated from regression function (a) /re9r3(t) = 10e~lot.............. 56
3.28 Surface plots (b and c) of observed sensitivity rates from simulations
generated from regression function (a) fregr4(t) = where k =
2 for t < 4 and k 1 for t > |........................................ 57
xni


3.29 Sensitivity rates for fregri(t), rj = 0.1, as a function of T in finer
detail. Rates for randomly spaced T (dashed light line) decrease
more gradually than evenly spaced T when the number of time points
is less than 15. And is evenly spaced T favoring an even number of
time points when T is small?....................................... 59
3.30 Sensitivity rates for fregr4{t), T = 50, as a function of t] in finer detail.
Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates except for 0.7 randomly spaced T yields smaller sensitivity rates........ 59
3.31 A plot indicating the time points sampled. Each asterisk marks the
time (in hours) that a tissue sample was taken from a rat specimen
across the 4 day estrous cycle.................................. 60
3.32 The Haar wavelet reconstruction of the normalized expression sig-
nal for APegl. An increase in expression appears in the estrus and
metestrus phases................................................ 62
3.33 The Haar wavelet reconstruction of the normalized expression signal
for RMT-1. Expression values are high in the first half of the estrous
cycle, and low in the second half............................... 63
3.34 The Haar wavelet reconstruction of the normalized expression signal
for IL-17B. Expression is low in the first half of the cycle, and high
during the last two phases...................................... 64
3.35 The Haar wavelet reconstruction of the expression signal for TSP2.
Expression is higher in the first half, and lower in the second half of
the estrous cycle..................................................... 65
XIV


3.36 The Haar wavelet reconstruction of the expression signal for LAMA5.
Expression moderately decreases from the beginning to the end of
the cycle, with some high level readings occurring in the estrus phase. 66
xv


LIST OF TABLES
Table
2.1 FDR procedure pseudocode.................................................. 22
2.2 Algorithm pseudocode...................................................... 23
2.3 Algorithm input parameters................................................ 24
3.1 Input parameters needed to construct F................................ 27
3.2 Observed sensitivity rates for / = ip(t), evenly spaced T and p = 0.1 31
3.3 Observed sensitivity rates for / = ip(t), randomly spaced T and p = 0.1 32
3.4 Observed sensitivity rates for / = sin(0.3nt), T evenly spaced and
p = 0.1 41
3.5 Observed sensitivity rates for / = sin(l.3irt), evenly spaced T and
p = 0.1 44
3.6 Observed sensitivity rates for / = sm(1.87r£), evenly spaced T and
p = 0.1 46
3.7 Observed sensitivity rates for / = sm(2.57rt), evenly spaced T and
p = 0.1 48
3.8 Observed sensitivity rates for / = sin(3.8jrt), evenly spaced T and
p = 0.1 50
3.9 Observed sensitivity rates for fregri(t), evenly spaced T and p 0.1 53
3.10 Observed sensitivity rates for fregr2{t), randomly spaced T and p = 0.1 55
3.11 Observed sensitivity rates for fregrz{t)^ evenly spaced T and p = 0.1 56
3.12 Observed sensitivity rates for fregr4{t), evenly spaced T and p = 0.1 57
xvi


3.13 A frequency distribution of significance rates for 10 applications of
the algorithm using R = 5........................................... 61
3.14 Genes found to to have temporal significance 100% of the time ... 67
4.1 Percentage Distribution of Positive-Responsible Ck (when T = 50
and r) 0.1) ...................................................... 72
XVII


1. INTRODUCTION
1.1 Time Series Gene Expression Data Analysis
Microarray technology allows researchers to quantify the expression levels
of thousands of genes within a tissue sample. By measuring mRNA levels as-
sociated with each gene, microarrays provide a snapshot of the genetic activity
present in that tissue. This information can then be used to understand patterns
of gene expression. Researchers have used microarrays to investigate relation-
ships among genes, how they react to environmental changes such as injury,
disease, and drugs and their roles in the routine maintenance of an organism.
Gene function plays a dynamic role in the regulation of biological processes, and
time series analysis can help us understand how this unfolds. Microarray exper-
iments, when performed successively, can reveal temporal patterns in biological
systems such as cellular cycles, organism development, and disease progression.
The resulting data from these time course experiments, however, can be dif-
ficult to manage; they tend to be enormous in number (often comprising an
entire genome), widely variable, and diverse in scope (some profiles averaging
mRNA levels near 10,000, others near 500). The analysis of this complex data
introduces unique challenges to investigators and calls for specialized techniques
that take into account the sequential nature of the samples.
1.2 Related Work
1.2.1 Early attempts time unordered
1


Initial efforts to make sense of microarray time series data largely ignored
the continuous nature of time, instead treating time as a categorical variable.
Many early undertakings sought only to differentiate among groups of genes in
the hopes of gaining a more sophisticated understanding of biological processes.
Groupings tend to be established either through classification schemes, in which
genes are categorized based on some pre-specified criteria, or through cluster-
ing methods which identify naturally occurring similarities. Early approaches
in classifying genes were largely guided by multivariate statistical and com-
putational methods including classical statistical modeling techniques [34, 37],
clustering algorithms [12, 30], principal component analysis (PCA) with singu-
lar value decomposition (SVD) [3, 14], self-organizing maps (SOM) [28], and
support vector machines [8]. The shortcoming of a multivariate analysis frame-
work, however, is that the temporal ordering of samples is inconsequential; any
permutation of time points will not change the results of the analysis.
1.2.2 Recent attempts time ordered
More recent classification techniques have addressed this problem by recog-
nizing that temporal order and the intrinsic continuity of gene expression sig-
nals must be incorporated. Some of these time sensitive classification schemes
include the use of self-organizing maps adjusted to incorporate a temporal met-
ric [27, 36], time warping algorithms that attempt to co-align expression signals
[1], and functional data analysis (FDA) methodologies that view observed gene
profiles as independent components in a smooth stochastic process [22, 35]. The
latter presented methods specifically for unevenly and/or infrequently sampled
data. Notable nonparametric methods presented for modeling time series ex-
2


pression profiles have included the use of splines [5, 11] and wavelets. In work
done by Klevecz in 2000 and Klevecz and Murray in 2001 [17, 18], wavelets were
used to identify dominant frequencies at which large numbers of genes exhib-
ited oscillations in their expression profiles. Our use of wavelets is qualitatively
different. Specifically, we will use wavelets as a means of identifying individual
genes that exhibit significant time-dependent variation in their expression levels.
1.2.3 Wavelets in signal processing
The use of wavelets in time series analysis is not new. Wavelets have been
applied to a variety of series including the detection of time-dependent climate
signals [20], serial dependencies within functional magnetic resonance imaging
(fMRI) [9], the analysis of electrocardiogram (EKG) heartbeat intervals [16] and
electroencephalogram (EEG) epileptic spikes [19]. In image processing, wavelets
have been used in the analysis of 2-dimensional images (which are just signals
arising from spacial dependencies instead of temporal), to identify patterns at
various resolutions. Examples include JPEG image compression [31], optimiza-
tion of image coding [33], edge detection and the removal of noise in images [24].
1.3 Our Approach
Here we present a method for identifying meaningful temporal patterns in
gene expression microarray data. Our algorithm computes the Haar wavelet
transformation on the expression profile of each gene, and then incorporates
a false discovery rate analysis on the decomposition coefficients to single out
genes with significantly large temporal responses, relative to the noise. Prior to
3


describing the details of the algorithm, it will be useful to first describe the data
set that motivated the development of this method.
1.3.1 Motivation
The development of our algorithm was motivated by the desire to analyze a
genome-wide time series that resulted from research investigating the estrous cy-
cle. The data came from microarray experiments measuring the levels of mRNA
present in mammary tissue samples taken from female rats (Rattus norvegicus)
and spanning the course of the rat estrous cycle (typically 4-5 days). The exper-
iment, conducted by Margaret C. Neville and Pepper Schedin at the University
of Colorados School of Medicine, sampled 32 time points over a four day period
at approximately 90 minute intervals between the hours of 7am and 7pm. The
microarrays were set up to detect expression levels for the entire rat genome,
comprising over 21,000 genes.
Much of the genetic architecture underlying the estrous cycle is not clearly
understood, therefore our quest was exploratory in nature. The hope was to
be able to identify a set of candidate genes exhibiting a significant temporal
response worthy of further investigation in future experiments. Features of this
experiment made the prospect of data analysis both encouraging and daunting.
While there were many time points sampled, and with some level of regularity
in sampling rate, there were also large gaps of time with no samples, coupled
with the additional variability introduced by heterogenous sample subjects.
Unlike some experiments in which the samples are homogenous, (e.g., the
time points are sampled from cloned organisms or from the same individual
multiple times), this experiment required the extraction of vital organs and
4


therefore each time point represented a different rat. The lack of sample homo-
geneity presents uncertainty that can complicate analysis efforts because each
rats genetic makeup varies to some unknown degree. In addition to this vari-
ability between individual rats, there can also be disparities within samples. For
example, the presence of peripheral tissue whose primary biological function may
be far different than that of the tissue under examination can result in expres-
sion values reflecting a different process. The heterogony of samples also raises
concern for how much confidence we can have that the rat sacrificed at time t*.
was a good representation of the estrous cycle at time tfc. In other words, how
can we be sure that the rat representing time point t7 lets say, wasnt actually
better-suited to represent time point t9 in the estrous cycle?
The sampling rate of this set had irregularities resulting in large windows of
unmeasured gene activity (over nights), therefore we wanted to explore how un-
even sampling might affect the identification of significant temporal profiles. An
integral part of our testing scheme involves the comparison between uniformly
and randomly spaced time points.
Encouraging however, was that this set had a relatively large number of
samples a luxury given that, due to the high cost involved, it is not uncommon
for microarray time courses to sample less than 10 time points [13]. While the
experiment did not perform replicates, this forfeiture afforded the advantage of
more time points covered. Also attractive is the regularity of sampling that did
occur within the four 12 hour sampling phases.
1.4 Challenges
5


The idea behind gene expression time series experiments is to obtain enough
snapshots of gene activity to reveal temporal patterns within a particular bio-
logical process or stimulus response. A key factor in accomplishing this is being
able to distinguish between the true signal and its accompanying noise. As
will be described, there are many factors in the experimental design that can
complicate the discernment between pure signal and noise.
1.4.1 Sampling
As with many time course experiments, sampling methodology can greatly
affect the subsequent assessment of data. The rate of sampling should be tailored
to the process being studied; longer intervals may accommodate slowly unfolding
gene activity, whereas shorter intervals become necessary when studying intense
biological responses. While sampling frequently can simplify and strengthen
analysis, it may be infeasible due to the high cost of microarrays. Sampling too
infrequently, however, can make it nearly impossible to interpolate gene activity
between sample points.
Even when many time points are sampled, irregularity of sampling intervals
can complicate analysis because the rate of change between time points increases
as the interval decreases. Sometimes irregular intervals are strategically carried
out in order to capture particular phases of a process where rapid or slow be-
havior is expected [26]. But in many situations, an uneven rate of sampling is
difficult to avoid due to the impracticality of around-the-clock sampling.
1.4.2 Homogeneity
In addition to sparse or irregular sampling, other factors may bring un-
certainty into the analysis, namely, inconsistencies within and between tissue
6


samples.
1.4.2.1 Within samples
Inconsistencies within samples occurs when the extracted specimen contains
tissue from multiple organs. For the purposes of many experiments, the proto-
typical tissue sample contains only tissue from the organ of interest, be it heart,
liver, or skin, etc. This is ideal because gene activity varies among tissue types
depending upon their biological functions. For example, certain genes responsi-
ble for regulating functions of sight will be highly expressed in ocular tissue, yet
nearly dormant in proximal brain tissue. But during extraction, samples can be
inadvertently contaminated with tissue from neighboring organs.
1.4.2.2 Between samples
Inconsistencies between tissue samples is the result of using heterogenous
subjects. A favorable experimental condition is when each tissue sample comes
from the same organism, or a clone. Such control is attainable when working
with populations of smaller organisms like C. elegans or Saccharomyces cere-
visiae (yeast) because lines of these organisms can be perpetuated indefinitely.
But often it is necessary to use organisms more similar to humans, in which
case cloning is not (yet) a practical approach and each tissue sample requires
the sacrifice of a unique organism. When each time point represents a unique
individual, it can be difficult to know the degree to which changes in expres-
sion values are attributable to a biological response or to the peculiarities of the
individual. Measures can be taken to align organisms so that each sacrificed
specimen can reasonably represent the proper phase of the cycle for that time
7


point. But there will still be variation resulting from the uniqueness of the sam-
ples and which is indistinguishable from the changes inherent in the true gene
signal.
1.4.3 Replicates
In traditional types of time series, the value at time point tj is directly
influenced by the observation at time point t(j-i). This characteristic can bring
robustness to the validity of the value at time point tj; when the value at time
point tj is similar to its neighbors, we have more confidence in it. In gene
expression time series on the other hand, individual time points are independent
of the preceding observations. Therefore, it is desirable to take measures to
increase confidence in the value at time point t
A common experimental design, intended to substantiate the expression
value associated with each sampled time point, is to generate replicates. Repli-
cates are additional microarray tests performed on each sample and combined
into a single reading for that time point. While this design can increase con-
fidence in the values, it does not necessarily always behoove the researcher to
adopt this setup. Consider an experiment in which one is investigating a slow
process over a large cycle length; requiring replicates may amount to squander-
ing expensive microarray tests that would be more useful if spread out to cover
more individual time points.
1.5 Conceptual Framework
We recognize that significant temporal patterns are not necessarily cyclical.
In fact as evidenced when one views a plot of expression values against time -
what can be seen is a great deal of seemingly erratic spikes and bursts of activity.
8


Wavelet transformations are well-suited to such data since they are capable of
closely approximating highly irregular, chaotic signals.
1.5.1 A model of the data
In designing a model of this type of data, we set forth some expectations.
In general, we assume that the variation in gene expression signals is governed
by three components: the temporal response, the sample response, and random
noise. We assume that the temporal response component represents the time-
dependent signal of the true genetic profile and that it is a continuous function
of time.
The sample response is considered time-independent variation occurring
from sample to sample (e.g., differences between individual organisms or tis-
sue makeup). Note that the sample response may exhibit correlations among
genes; there may be groups of genes that vary similarly from one sample to the
next. The magnitude of the sample response is difficult to predict; it could be
very large or small relative to the temporal response.
The random noise is also considered to be time-independent, but largely
gene-independent and relatively small compared to the other two components.
If we combine the sample response and the random noise into a single component,
e, then our model, /, can be written as
f = y + £, (i-i)
where y represents the temporal response, i.e., the genetic profile.
Questions that we explore using our algorithm on strategically simulated
data include whether the sample response is proportional to the temporal re-
9


sponse and what type of distribution might be representative of the sample
response.
1.6 Implementation
This paper presents a method to identify significant temporal responses
among a large number of gene expression signals. We model the data via a
Haar wavelet transformation and use features of the reconstructed signal to
identify profiles with meaningful temporal patterns. Issues to be addressed in
our approach include the questions of how many samples are necessary within
a given time interval, and to what degree irregularity in sampling rate can be
tolerated in order to successfully detect temporal patterns. Also investigated
is how best to normalize the raw data prior to the wavelet transformation.
Normalization of the data is an important issue because it has the potential of
magnifying small scale elements as well as minimizing those of large scale.
To test our methodological answers to these questions, we apply our algo-
rithm to a variety of simulated datasets. After refining our algorithm during the
testing phases, we apply the final version to the set of genome-wide expression
levels sampled from a population of female rats in an attempt to explore the
regulation of the estrous cycle.
10


2. METHODS
2.1 Overview of the Method
Our methods were based upon the model that gene expression measure-
ments are made up of three components: the temporal response, a continuous
function of time that reflects the true expression value; the sample response, a
time-independent form of variability that results from differences between sam-
ples; and noise, which is presumed to be independent of both time and sample.
Furthermore, we assume that the non-temporal portion of the signal (the sam-
ple response combined with the random noise) tends to be proportional to the
signals temporal response.
We are primarily interested in identifying genes that have a significant tem-
poral response. We model the temporal response for a gene as a function of
time, which we approximate from the expression data. To make this concrete,
consider a set of T ordered data points, (ti,yi),, (tT, yT), believed to have
been sampled with noise from some underlying function of time. The aim is
to find a new function, /, that closely approximates, or fits, this underlying
function such that for each j £ [1, T]
f(tj) ~ Vj- (2-1)
Our approach models the data as a continuous function of time so that the
underlying function, y, is approximated by the following linear model:
n
= (2-2)
k=1
11


where fa (t),, fa{t) is a family of basis functions for our method, which target
temporal features we are interested in. In this study, we will use Haar wavelet
functions, defined in 2.3, to form the basis. We then minimize the models
least-squares residual to choose the best parameters ck. That is, we minimize
Ylj=i(Vj ~ The result of this process is a representation of each gene
by the vector of coefficients c = [ci,..., cn] corresponding to the least squares
fit. If the magnitude of any of these coefficients is large relative to the other
coefficients, we say that the gene has a significant temporal response with respect
to the corresponding basis function. A main contribution of this thesis is an
algorithm for determining when a coefficient is large enough to be considered
significant. The remainder of this chapter describes the algorithm in detail.
The raw dataset, G, is arranged into an m x T matrix such that Gij is
the expression level at the jth time point for the ith gene, where j = 1,2,..., T
and i = 1,2,..., m. Prior to calculating the functional fit, we first normalize the
data, resulting in an m x T matrix G. The normalization process is described
in Section 2.2. Section 2.3 describes the Haar wavelets, which will be used to
define the basis functions fa{t) for equation (2.1). Section 2.3 also gives the
mathematical details for calculating the least squares approximation. Section
2.4.1 describes the method for determining the significance of the magnitudes
of the wavelet coefficients.
2.2 Normalization
We examined a number of methods for standardizing the data prior to per-
forming the wavelet decomposition. The choice of a normalization scheme de-
pends on several factors and therefore it was important to assess the implica-
12


tions of each standardizing component as it effects the analysis of the data. Our
method was to first take the log-transform of the data and then center each
signal about 0.
In microarray data analysis it is standard practice to work with the loga-
rithm of the data instead of the raw expression values. This is primarily because
of the wide scope of the expression readings which can range from 0 to 10,000
and the fact that noise tends to be proportional to signal strength. Using
the log-transform of the data ensures that low level changes in expression re-
ceive relatively equal regard as those occurring at high levels. Therefore, our
normalization routine begins by taking the log-transform of the raw expression
data.
Our interest is in the relative changes in expression values, not absolute; so
we centered each signal about 0 by subtracting its mean. That is, the ith gene
expression signal, denoted by Gl *, once normalized will be
Gi.* lnGj.*
ELl-Gy
(2.3)
where the log of a vector is the vector of the logarithms. The notation of the
* symbol in the subscript will indicate the entirety of that index so that is
equivalent to
,2.3 Wavelets
A wavelet transform (WT) can be used to fit a linear model given a set
of sample data points, (yi, .. ,yr)- Instead of mapping the samples from a
temporal domain into a frequency domain (as with Fourier transforms), the
wavelet transform maps the samples into a time-frequency plane. A benefit of
13


a WT is that it can get as local as necessary, whereas the Fourier transform is
globalized over the entire signal. Wavelet transforms are capable of zooming in
or out at different locations of the signal. With wavelets, one can choose the
resolution level with which to transform a given signal [7].
The term wavelet refers to a set of basis functions defined by a pair of
related functions: the wavelet generating function, denoted by ip, and the scaling
function, denoted by (j). These two functions work in concert to generate an
infinite set of basis functions. Typically, however, only a finite subset of these
basis functions is necessary to get a close approximation of /. By choosing
different generating and scaling functions, many different families of wavelet
functions can be produced. In this thesis we focus on Haar wavelets, which are
specified as follows.
2.3.1 The Haar wavelet
For a function [0,1], the Haar scaling function is defined as:
(2.4)
and the Haar wavelet generating function is defined as:
ip(t) = 0(2i) 4>(2t 1).
(2.5)
or equivalently,
1 if 0 < t < i;
V>(<) = < 1 if \ < t < 1;
(2.6)
0 otherwise,
14


The family of Haar basis functions is then defined by {(p(t), ^o,o(^), ipi,o(t), ipi,i(t),
where ^0,0 (t) ip{t) and
(*) = iffi s),
(2.7)
for d G Z+ and 0 < s < Sd = 2d 1.
Generalizing for a function /(f), t defined as:
{-7= if a < t < b\
0 otherwise,
and applying (2.5), the Haar wavelet generating function is defined as:
1 if a (2.8)
y/ba
= H*<*<*>;
0 otherwise.
(2.9)
The denominator \Jb a is chosen so that and ip{t), the remaining Haar basis functions on [a, 5] are then defined by
+ (2d (^r^) s^b ~ )) (2-10)
2.3.2 Defining a finite basis using Haar wavelet functions
The orthonormal wavelet basis is generated by dilating and shifting ip (see
Fig. 2.1). Think of d as indexing the dilation of ip, and s as indexing the shifting
of ip. By choosing a cutoff, D, for d, and thereby limiting how deeply to dilate
ip, we can establish a finite basis consisting of 2D+1 functions and approximate
/ by the expansion
d sd
f(t) = cod(t) + EE* iMO- (2.11)
d=0 s=0
15


For ease of notation, we re-index these functions with a single index, k, where
k = 1 references the first basis function, tpi (t) = and for k = 2,... ,n =
2D+l, we define 0*, = ipd,s, where k = 2d + s + 1. With this re-indexing, (2.11)
can be rewritten as
n
f{t) = ^Cki'k{t). (2.12)
k=1
0.5) --- '
-03[ ; . ;
0 12 3 4
(a) 0o,o = 0(t)
01234 01234
(b) Vl.o = 0(21) (c) 0i,i = ip(2t 8)
01234 01234 01234 01234
(d) 02,o = 0(41) (e) %p2,i = ip(4t (f) '02,2 = 0(4t (g) 02,3 = 0(4i -
16) 32) 48)
Figure 2.1: The eighth-order Haar wavelet basis functions on the interval [0,4].
Here, n = 8 because D = 2.
2.3.3 Calculating the wavelet coefficients
16


The wavelet coefficients Ci,..., Cn in (2.12) axe chosen to minimize the least
squares residual
t t / n \ 2
£toWfe))2 = £ = II#-cwi2. (2-i3)
3=1 3=1 V fc=1 /
where y = (t/i, ..., yr), c = (ci,..., Cn), and W is the n x T design matrix
defined by Wkj ?Pk(tj)- The solution c is calculated by the equation
c = yW+, (2.14)
where W+ is the Moore-Pemose pseudoinverse of W. W+ is calculated using
the singular value decomposition (SVD) of W:
W = UXV\ (2.15)
where the columns of U are the eigenvectors of WW1, the diagonal of X contains
the singular values of W, and the rows of VT are the eigenvectors of WTW. W+
is defined by
W+ = VTX+U, (2.16)
where
A'+ =
0 if Xu = 0
tt- otherwise.
(2.17)
When W has full rank and T >n, then
W+ = (Pf)-1^. (2.18)
When W has full rank and T < n, then
W+ = Wr(WWJ)-\ (2.19)
17


In general, (2.18) or (2.19) makes possible the least-squares solution to (2.14).
We define the mx n matrix C as the set of all wavelet coefficients resulting
from the dataset G, where C^k is the kth wavelet coefficient for the ith gene. C
is calculated by the equation
C = GW+. (2.20)
2.3.4 Haar wavelet coefficients
The first of the wavelet coefficients, C\, is referred to as the sparse coefficient
vT fit-)
and is the average value over all tj, that is C\ = 1T 3 Because of our
normalization scheme, the calculated value of C\ will always be 0. The remaining
wavelet coefficients, C*2,..., Cn, axe referred to as the detail coefficients. Each
detail coefficient corresponds to a localized frequency band, or scale, of the
signal.
2.4 Identification of Significant Signals
Because the relative magnitude of the Haar wavelet coefficients was the key
factor in our analysis of significant temporal patterns, it was necessary to gain
a sense of how the coefficients were distributed. We used bootstrapping to esti-
mate the background distribution of each genes associated wavelet coefficients,
thereby obtaining a p-value for each coefficient C*,, k > 2. Details of this are
provided in Section 2.4.1.
Using the smallest p-values from each signal, we calculate the probabilities
of their smallness based upon their overall cumulative distribution; essentially,
calculating a p-value for each minimum p-value. Applying a FDR procedure to
these p-values, we determine the cutoff point at which to reject the hypothesis
that a signal exhibits no significant temporal pattern.
18


2.4.1 Bootstrapping procedure
In our method, we wanted to identify genes with significantly large wavelet
coefficients. But we needed to find a way to distinguish significance. This was
accomplished using a bootstrapping procedure to estimate a p-value for the
magnitude of each wavelet coefficient \Ct_k\.
For each gene i we generated a set of H trial genes GXi*th, for h = 1,..., H.
Each trial gene is generated by sampling (with replacement) from expression
values (Giti ,..., G^t), and then normalizing the resulting expression profile
according to (2.3). The Haax wavelet coefficients C*,*^ were then calculated for
each trial gene according to (2.14).
The trial coefficients calculated above enable us to calculate a p-value for the
magnitude of each wavelet coefficient. To do this, for each (i. k), we calculate
the fraction of corresponding coefficients Chk,h for which
\Ci,k,h\ > |Ca|. (2.21)
More precisely, let P^k denote this estimated p-value for the kth coefficient for
gene i. Pi k is calculated by the equation
P* = ^1, (2.22)
where h is the number of trials h for which (2.21) is satisfied.
Note that if the data were completely random, the p-values should reflect a
uniform distribution. Moreover, because the Haar wavelet functions are orthog-
onal, the p-values are also independent. These two facts enable us to calculate
a distribution for the smallest p-value.
19


The final step begins by selecting the smallest p-value associated with gene
i, which we notate as Pi,kmin Since the kth set of coefficients (C*,fc) is indepen-
dent and uniformly distributed, we can treat all Pi k) as random variables and
determine the likelihood that Pi,kmin is relatively small. Essentially, the final
step is to calculate a p-value for each Pi,kmin- To do this, we use the random
variables Pi;2:n associated with each gene i to calculate the probability that the
smallest of n 1 independent and uniformly distributed random variables is at
least as small as This probability is calculated by:
Pt = Prob( min(x2,..., xn) < ) = 1 (1 Pi,kmin)n~\ (2.23)
where random variables xk ~ Uniform(0,l). It is this set of probabilities to
which we apply the false discovery rate procedure.
2.4.2 FDR analysis
In multiple testing, the number of type I errors (false rejections) can increase
considerably. For example, when testing 10,000 genes using a significance level
of a = 0.05, we would expect to identify approximately 500 false discoveries
(.FD) by chance alone. If our multiple tests resulted in 1000 discoveries (D),
then the expected rate of false discoveries would be ^ = 0.5, which
is rather large. Therefore, instead of controlling the chance of a single false
discovery occurring (e.g., via Bonferroni methods), we use a false discovery rate
(FDR) procedure developed by Benjamini and Hochberg [6] to control the overall
proportion of false discoveries.
Given a pre-defined significance level, a, and a set of sorted p-values P to
be compared, the FDR procedure defines a multiple testing threshold we will
20


call FDT, such that
FDT = max{ Pi : Pi < a }. (2.24)
So instead of using a static significance level as in the Bonferroni method of
multiple testing, we implement a dynamic threshold which is a function of the
static significance level (a) and the particular set of p-values being assessed. We
are essentially- calculating a significance level for each i as follows:
oti = a. (2.25)
m
By defining at in this way, if Pi < oti then the desired FDR is achieved. One
can simply scan down P in descending order, until the first index i for which
Pi < cnj is reached, and then reject the null hypothesis (H0 : there is no significant
temporal pattern) for all j > i.
This FDR procedure ensures that the expected proportion of false discoveries
(E(^)) as a function of FDT, will not exceed the predefined significance level
a over multiple hypothesis tests. For example, if 100 out of 1000 genes are found
to be temporally significant, and a was chosen to be 0.2, then on average we
can expect at least 80 of these to be valid discoveries. Our algorithm was tested
using a predefined significance level of a = 0.1. Since we knew that only a small
proportion of P would be rejected under the null hypothesis, P was sorted and
searched in ascending order for the sake of efficiency. The pseudocode in Tab.
2.1 summarizes our method.
2.5 Algorithm Summary and Implementation
The complete algorithm is summarized by the pseudocode in Tab. 2.2.
Input parameters are listed in Tab. 2.3. Algorithm output was in the form of
21


Table 2.1: FDR procedure pseudocode
Step 0 : (Initialization) Given a vector of m p-values, P, and a predefined
significance level a;
Step 1 : (Sort) Sort P in ascending order;
Step 2 : (Critical Significance Levels) For each i, calculate at according
to (2.25);
Step 3 : (Find Threshold) For each i, compare Pi to a*; when Pi > a*,
reject the null hypothesis for all* < i\
Step 4 : (Desort and Return Decision Vector) Return a decision vec-
tor of length m in the original order of P, which indicates a discovery
(1) or a nondiscovery (0) for each gene i.
a decision vector, of length m where Zi = 0 indicated that the ith signal was
not flagged as significant, and zt = 1 indicated otherwise.
22


Table 2.2: Algorithm pseudocode
Step 0 : (Initialization) Given an m x T matrix of expression values G,
a vector of T time points t, input parameters a, b, R, H, and a (see
description of input parameters in Fig. 2.3);
Step 1 : (Normalization) Construct the normalized dataset, G, accord-
ing to formula 2.3;
Step 2 : (Haar Wavelet Decomposition) Generate the wavelet coeffi-
cients, C, according to formula 2.14;
Step 3 : (Estimation of P-values) As described in the Section 2.4.1, use
H to estimate the p-value for each wavelet coefficient using formula
2.22, then calculate Pi for each gene using formula 2.23;
Step 4 : (FDR Analysis) As described in Fig. 2.1, use P and a to deter-
mine temporal significance for each gene.
23


Table 2.3: Algorithm input parameters
t, a vector of length T, giving the ordered time points at which ex-
pression levels were sampled;
a and b, endpoints indicating the starting and ending times, used to
define the Haar wavelet functions;
R, an integer > 0, specifying the size of the Haar basis, n = 2R\
G, an m x T matrix containing the expression values of m signals,
each sampled at T time points indexed by fy;
H, an integer > 0, specifying the number of trial signals to generate
for each gene in the bootstrapping procedure; and
a, a value in the interval (0,1), stipulating the false discovery rate to
be used in deciding the temporal significance of genes.
24


3. TESTING AND RESULTS
The method was tested using simulated signals representing the expression
levels across time for a set of genes, G. Some of the gene signals were generated
only by random noise, and the rest by a predefined true temporal pattern
/ plus random noise. To generate temporal patterns for the true signals,
we used three families: the Haax step function, the sine function, and non-
cyclical regression functions (exponential and polynomial). The added noise was
Gaussian with mean 0 and with standard deviation, noise-to-signal ratios. Finally, the algorithm was applied to the genome-wide
expression levels measured from the mammary tissue of rattus norgevicus across
the four day estrous cycle.
3.1 Generating Simulation Time Series
In order to observe our algorithms performance at both detecting the true
signals and dismissing the false, we generated a variety of simulated datasets
used to represent G in calling our primary algorithm. Each simulated dataset
was made up of two kinds of signals those originating only from random noise,
and those produced from a time-dependent generating function / with Gaussian
noise added. The latter kind we refer to as the true signals. Since we knew
which was which, we were able to calculate actual false discovery and sensitivity
rates of our algorithms performance.
For each / tested, a set of simulated signals, F, was generated so that a
specifiable fraction p, of them was based upon / + noise, and the rest consisted
25


only of noise. The matrix, F, consisting of m simulated signals with T time
points (placed according to tjrate) was constructed such that p of them were
true, and defined as Fj = / + N(0,a2), while the remaining were designed
to be pattern-free, defined as F* = 0 + 7V(0, a2). Sampling rate, sampling fre-
quency, and degree of noise-to-signal ratio were varied during testing in order
to analyze their effects upon our algorithms identification of temporal patterns.
The testing scheme utilized the parameters listed in Fig. 3.1.
For simplicity and ease of comparability, all simulations arranged the time
points t within the interval [0, 1] (that is, a = 0, b = 1), and we used R = 3
(see Fig. 2.3), thereby keeping the number of basis functions at 8. Furthermore,
since each / used for simulation had a variance of less than 5, F was assumed
to be already log-transformed once sent to the main algorithm. Therefore, the
normalization of F consisted only of centering each Fj about 0.
F was analyzed by our primary algorithm using H = 1000 and a = 0.1
to test for significantly large coefficients. We counted the resulting number of
true positives (TP), false positives (FP), true negatives (TN) and false negatives
(FN). These variables were then used to calculate the observed FDRs: FDRot,s =
fp+tp an(l the observed sensitivity rates: 5/2^ = yF+fn'
For each generating function /, we produced simulations over various com-
binations of 77, T, p, and tjrate. The testing parameters varied as follows:
0.1 < 77 < 2.0;
T = [6,7,8,9,10,15,20,25,30,40,50];
p= [0.1,0.25,0.5,0.75,0.9].
26


Table 3.1: Input parameters needed to construct F
f, a function to be used in generating the true signals in F;
t], a value used to determine the noise-to-signal ratio such that the
noise (1V(0, a2)) added to each signal, Fi, had standard deviation a =
V fmaxi where fmax is the maximum amplitude of zero-centered /.
T, an integer indicating the number of time points with which to
populate t;
t_rate, a string indicating the type of sampling interval with which to
build t; either a uniform (even) or random (rand) placement of T
time points;
m, an integer dictating the number of simulated signals to generate;
p, a value in the interval [0,1] indicating the fraction of m signals to
generate using /; the fraction of true signals populating F;
H, an integer > 0, specifying the number of trial signals to generate
for each signal in the bootstrapping procedure; and
a, a value in the interval (0,1), stipulating the false discovery rate used
in determining the significance of wavelet coefficients.
FDRobg and SRobS were recorded and stored for each scenario. Output was
generated in tabular form of 7] against T (see Tab. 3.2). For each value of p, we
27


generated four r)-T tables: FDRobs from evenly- and unevenly-spaced T, and
SRobs for evenly- and unevenly-spaced T. Each table was the average result over
10 simulations for a given /, r), T, p, and tjrate.
By varying the noise-to-signal ratio (77), the number of sample points (T),
the sampling rate (tjrate) and the proportion (p) of /-generated signals, we
were able to compare results of various scenarios within each type of generating
function. The motivation for each type of test is to get a sense of how many
time points are needed to identify a significant signal, and how their placement
(uniform or not) and the degree of added noise affects this number.
3.1.1 Haar wavelet signals
The first set of tests involved signals based on the Haar wavelet itself. We
used the step function
{1 if 0 < t < i;
-1 if | < t < 1,
to act as /. These signals clearly display a low frequency pattern across the 02
0.8,
0.4
02
0
-0.2
-0.4
-0.6
-0.0
0 0,2 0.4 0.6 0.0
Figure 3.1: The Haar Wavelet step function, ip
28


overall time interval, i.e., high value to low value. The motivation behind im-
plementing this test was to see how effective the algorithm would be on the
simplest possible test function. This test served as a good reference base with
which to compare the other, more complex tests.
3.1.2 Sinusoidal signals
The second set of tests used variations of the sine function with which to
build F. This test was based on the standard sine function
fit) = sm(frq ir t),
where the parameter frq was used to vary the frequency from approximately half
of a period to approximately two periods on the interval t = [0,1]. We varied
frq within the interval [0.05, 16]. See Fig. 3.2 for some of the frequencies we
tested. These are signals which display windows of cyclical patterns. In varying
(a) frq = 0.3
1 71 77"
0.5
1 \ I \ /
0 1
-0.5 \ / \J \
-1< 7
) 0.5
(b) frq =1.3
1 7^ 7V
\ i\ /
0.5 I
0 \ / \ / \
-0.5 / \ \/. v/
"0 0.5 1
7 \
0.5 / \ / \ \ \
0
\
-0.5 \
7
) 0.5 1
1 (c) frq = 1.8
1\ T\ T\
0.5 !\ l\ l\ l\
0 \ M 1 i / 1 \ \ i i
-0.5 / M i \l \i \! \l
1 ) 0.5 1
0.5 /\ 7 / \ /
0 \ i
-0.5 \ /
'0 0.5 1
1 (d) frq = 2.5
HTJJJTl
0.5 i ii I I !l I! l I j i! 11 I i |
0 I s i |iyHi
-0.5 ! i i u !! 11 I! i ii H ii li II i! V li V Si ii ii li
) 0.5 1
(e) frq = 3.8 (f) frq = 5.2
(g) frq = 8.0
(h) frq = 14.8
Figure 3.2: Sinusoidal functions of various frequencies on the time interval
[0,1].
29


the frequency we intended to test whether the algorithm favors lower or higher
frequencies.
3.1.3 Regression signals
The last set of tests involved regression functions that have been previously
used in various Monte Carlo performance studies [25]. This set consists of the
following four functions:
fregri{t) = 1 48i + 218f2 315t3 + 145t4;
fregr2(t) = O.Se-64^025)2 + OTe-256^075)2;
fregr fregrA (t)
10e_lot;
/
< for *<5
e~2M) for *>!
V
These generating functions are neither simple as with the Haar step function,
/""\ / \ / ^ \ / V 0.8 0.6 0.4 02 px l i 1 j ^ i i / \ M / \ / \ 10 8 6 4 2 \ 1 0.8 0.6 0.4 pv \ N \
0.5 0 ) 0.5 ( 0.5 v0 0.5 1
(&) fregrlif) (b) fregr2(t) (c) fregrs{^) (d) fregrA($)
Figure 3.3: Regression functions 1 through 4, on the interval [0,1].
nor cyclic as with the sinusoidal functions. These generating functions are
implemented in an effort to see how the algorithm fairs when dealing with more
30


complex signals. The first regression function, Fig. 3.3(a), displays a trend with
a little fine structure while the third, Fig. 3.3(c), displays trend but with no
fine structure. The second, Fig. 3.3(b), has obvious time dependent variation
in curvature and the fourth, Fig. 3.3(d), is a function whose first derivative
is undefined at t = |, contradicting standard assumptions for optimal perfor-
mance. Each regression function was used to generate true signals such that
fit) = f{t) + iV(0,a2).
3.2 Results: Simulation Testing
Each simulation generated m = 100 signals, and used H = 1000 and a = 0.1
in determining significant temporal activity. Results were averaged over 10
iterations.
3.2.1 Performance using Haar-based /
The Haar step function simulations yielded similar results for evenly and
randomly spaced T (see Tables 3.2, 3.3 and Fig. 3.4). With 10% of signals true
(p = 0.1), the observed sensitivity rates stayed above 90% when T was as small
as 15 and 7? < 0.5, and when rj was as large as 1.0 and T > 40.
Table 3.2: Observed sensitivity rates for / = evenly spaced T and p = 0.1
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0.02 0.07 0.09 0.12 0.24 0.46
1.6 0 0 0 0 0.01 0.03 0.09 0.21 0.29 0.73 0.89
1.0 0 0 0 0 0.02 0.12 0.33 0.76 0.93 0.99 1.00
0.75 0 0 0 0.03 0 0.23 0.87 0.98 1.00 1.00 1.00
0.5 0 0 0 0.04 0.13 0.80 1.00 1.00 1.00 1.00 1.00
0.25 0 0 0.01 0.11 0.43 1.00 1.00 1.00 1.00 1.00 1.00
0.1 0 0.03 0.03 0.10 0.53 1.00 1.00 1.00 1.00 1.00 1.00
31


Table 3.3: Observed sensitivity rates for / = ip{i), randomly spaced T and
p = 0.1
T n 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0.02 0.07 0.07 0.11 0.14 0.31
1.6 0 0 0 0 0.01 0.03 0.03 0.17 0.19 0.35 0.76
1.0 0 0 0 0 0.01 0.08 0.30 0.60 0.70 0.95 0.99
0.75 0 0 0.01 0.01 0.07 0.19 0.56 0.79 0.96 1.00 1.00
0.5 0 0 0 0.04 0.06 0.61 0.90 1.00 1.00 1.00 1.00
0.25 0 0 0.01 0.15 0.30 1.00 0.95 1.00 1.00 1.00 1.00
0.1 0 0 0 0.17 0.22 1.00 1.00 1.00 1.00 1.00 1.00
(a) Evenly spaced T (b) Randomly spaced T
Figure 3.4: Surface plots of observed sensitivity rates. (These are 3-
dimensional renderings of Tables 3.2 and 3.3, viewed from their upper left hand
corners towards their lower right.)
When added noise was at its best (smallest; rj = 0.1), T could be as small as
15 and still identify 100% of the true signals, with FDR^,S = 0.03 for T evenly
spaced and FDR^,S = 0.06 for T randomly spaced. When T dropped from 15
to 10, sensitivity decreased to 53% (FDi?^ = 0.02) for evenly spaced, and to
22% (FDRobs = 0.04) for randomly spaced time points.
To get a better idea of what was going on between T = 15 and T = 10, we
reran the simulations using the rj = 0.1 and 9 < T < 16. Fig. 3.5 shows the
32


Figure 3.5: Sensitivity rates for 16 > T > 9 with rj = 0.1.
resulting sensitivity rates for evenly and randomly spaced T. We can see that,
while in Fig. 3.4 there appears to be a big drop between T = 15 and T = 10,
Fig. 3.5 shows that the decrease is actually gradual, except at T 12, where
there occurs an increase for evenly spaced T and a decrease for randomly spaced
T.
When the number of time points was at its best (largest; T = 50), a relatively
large amount of added noise (jj = 1.5) did not prevent the algorithm from
capturing 89% (FDRobs = 0.14) and 76% (FDRobs = 0.14) of the true signals,
for evenly and randomly spaced T, respectively. However, when rj was increased
to 2.0, even a large sample size could not make up for the damaging effect of too
much noise; the algorithm identified 46% (FDRobs = 0.1) and 31% {FDRobs =
0.09) of the true signals, for evenly and randomly spaced T, respectively. Fig.
3.6 shows how when rj = 1.5 the signals originating from the Haar step function
are still visible, but when 77 = 2.0 they are not as easily distinguished from the
33


0 0.2 0A 0.6 0.6 1 0 0.2 04 0.6 0J 1
t t
(a) T] = 1.5
(b) T) = 2.0
Figure 3.6: Haar wavelet-generated (normalized) signals for T = 40 where T
is evenly spaced and p = 0.1. This illustrates how, even with a large number of
uniform samples, true signals (black) can quickly become nearly indistinguish-
able from pine noise (grey) when t] increases from 1.5 to 2.0.
signals of pure noise.
For both evenly and randomly spaced time points, FDR^g remained below
0.2 for 77 < 1.0 and T > 20. Unlike the fluid patterns seen with the observed
sensitivity rates, FDRsobs were generally more erratic with surface plots more
reminiscent of origami than landscapes (see Fig. 3.7). For T < 10, the typical
outcome was that no false positives occurred; this was often a by-product of
there being no positives tagged at all. Positives (true or not) did not occur at
all when T was smaller than 7.
There were little to no differences seen in results as the proportion of true
signals (p) was increased. Observed FDRs approached zero as p increased, but
the sensitivity rates were largely unchanged (see Fig. 3.8).
34


(a) Evenly spaced T
(b) Randomly spaced T
Figure 3.7: Surface plots for the observed FDRs from uniform and randomly
spaced T, p = 0.1.
35


£.B-
.5 0.6
! 0.4-
0
W 0.2
0-1
50
(a) p = 0.5; evenly spaced T
(b) p = 0.5; randomly spaced T
T T
(c) p = 0.9; evenly spaced T (d) p = 0.9; randomly spaced T
Figure 3.8: Surface plots of observed sensitivity when the proportion of true
signals (p) is 50% (a and b) and 90% (c and d). No significant differences were
seen between small and large p.
36


3.2.2 Performance using sinusoidal-based /
The results from the sinusoidal simulations were similar to those generated
using the Haar step function in that as p increased from 0.1 to 0.9, there were
only small improvements observed. Therefore, we will present only those results
associated with p = 0.1. Overall, lower frequencies (frq < 5) yielded better
results than higher frequencies (frq > 5). This contrast is illustrated in Figures
3.10 and 3.12. Haar wavelet functions are aligned with sinusoidal frequencies of
a power of two, so there were bursts of improvement near these frequencies, with
deterioration in between. This trend can be seen in Fig. 3.11 for frequencies
near 2 and 4, and in Fig. 3.12 for frequency of 8. Fig. 3.9 plots the mean
sensitivity rates (for evenly spaced T, 77 = 0.1) as a function of frequency. Our
algorithm performed optimally for frequencies near 2 (see Fig. 3.11).
Figure 3.9: Mean sensitivity rate across all evenly spaced T (with rj = 0.1) as
a function of frequency rate.
37


0.5
-0.5
% i
iz
0-
0.5
30 > 10 1
(d) / = sin(0.3 7rt) (e) frq = 0.3; uniform T
(g) / s*n(1.0 7rt) (h) frq = 1.0; uniform T
(j) / = sm(1.3 7rt) (k) frq = 1.3; uniform T
T
(f) frq = 0.3; random T
T
(i) frq = 1.0; random T
T
(1) frq = 1.3; random T
Figure 3.10: Surface plots of the resulting sensitivity from low frequency (near
0 and 1) sinusoidal-generated simulations.
38


1
0.5
0
-0.5
-1.
\
\ /
0 0.5 1
(a) / = sin(1.8 irt)
(b) frq = 1.8; uniform T (c) frq = 1.8; random T
0 0.5 1
(d) / = sin(2.0 nt)
(e) frq = 2.0; uniform T (f) frq = 2.0; random T
0 0.5
(g) / = sin{2.5 7rf)
(h) frq = 2.5; uniform T (i) frq = 2.5; random T
0 0.5 1
0) / = 5*n(3.0 7rf)
(k) frq = 3.0; uniform T (1) frq = 3.0; random T
0 0.5 1
(m) / = sin{3.8 irt)
(n) frq = 3.8; uniform T (o) frq = 3.8; random T
Figure 3.11: Surface plots of the resulting sensitivity from lower frequency
(near 2 to 4) sinusoidal-generated simulations.
39


r / \ 5/ \ / 1 o' \ 1 5 \ V\ rr~ 1 \ j ; i \ t \ l \ \ i \i \
1 \ i
1o 0.5
(a) / = sin(5.2 7it)
!\ 5I \ 1 \ o \ 1 / 5 \ / i\ !\ \ j \ \ \ \ 1 \ 1 \ \ 1
\/ \l
1o 0.5 1
(d)/ = sm(6.0 nt)
/ \ / 5/ \ j i i \l VJT7\ l II z l i 1 \l \ \! I / i 1
V \j
1o 0.5 1
(g) f = sin(7.0 nt)
1 /' i\ 5I\ j 3 \i 5 \ I A n MM II l M i I / l
\/. \; \/
0 0.5 1
(j) / = sin(8.0 7rt)
A A \ 1 \ 5/1 ; 1 1 h MM 5 1 ! A A /A f I \ I | I i l | ! U !
1/ V u
0 0.5 1
(m) / = sm(9.0 7rt)
(b) frq = 5.2; uniform T (c) frq = 5.2; random T

(e) frq = 6.0; uniform T (f) frq = 6.0; random T
(h) frq = 7.0; uniform T (i) frq = 7.0; random T
(k) frq = 8.0; uniform T (1) frq = 8.0; random T
IJ
(n) frq = 9.0; uniform T (o) frq = 9.0; random T
Figure 3.12: Surface plots of the resulting sensitivity from higher frequency
(greater than 5) sinusoidal-generated simulations.
40


Generally, for all frequencies examined, results improved when T was evenly
spaced. However, when T was small, the algorithm often performed slightly
better for randomly spaced T. Here we begin our discussion of the results from
some of the lower frequencies (featured in Figures 3.10 and 3.11).
Table 3.4: Observed sensitivity rates for / = sm(0.37rt), T evenly spaced and
p = 0.1___________________________________________________
T V 6 7 8 10 15 20 25 30 40 50
2.0 0 0 0 0 0.01 0.01 0 0.02 0 0.04 0.04
1.5 0 0 0 0 0 0.01 0 0.01 0.02 0.05 0.05
1.0 0 0 0 0 0 0 0.02 0.05 0.06 0.18 0.35
0.75 0 0 0 0 0.01 0 0.09 0.18 0.18 0.40 0.81
0.5 0 0 0 0 0 0.02 0.17 0.64 0.73 0.97 1.00
0.25 0 0 0 0 0.02 0.14 0.78 1.00 1.00 1.00 1.00
0.1 0 0 0 0 0.01 0.17 0.92 1.00 1.00 1.00 1.00
When frequency was set to 0.3 (see Tab. 3.4 and Fig. 3.10 e and f), minimal
added noise (77 = 0.1) required as little as 20 evenly spaced sample points to
identify 92% (FDRobs = 0.07) of the true signals, or 25 randomly spaced sample
points to identify 90% (FDRobs = 0.07). However, if evenly spaced T dropped
from 20 to 15, the sensitivity dove from 92% to 17% (FDRobs = 0.15). The
decline was more gradual for randomly spaced T.
Even for large T, added noise took a toll on sensitivity rates. For evenly
spaced T = 50, sensitivity rates dropped from 81% (FDRobs 0.16) to 35%
(FDRobs = 0.15) as 77 increased from 0.75 to 1.0. For randomly spaced T = 50,
sensitivity dropped from 97% (FDR0bs = 0.09) to 48% (FDRobs = 0.08) as 77
increased from only 0.5 to 0.75.
To get a better idea of what was going on in between our standard simulation
values of T and 77, we ran additional simulations using small 77 with T in finer
41


detail, and large T with 77 in finer detail. We can see from Fig. 3.13, which
shows how T (with minimal added noise) affects sensitivity, that the largest
jump actually occurs between evenly spaced T = 16 and T = 15. From Fig.
3.14, which shows how 77 (with large T) affects sensitivity, we see that the largest
change for both evenly and randomly spaced T actually occurs between 77 = 0.7
and 77 = 0.9.
25 24 23 22 21 20 19 18 1 7 16 15 14 1 3 12 11 10
T
Figure 3.13: Sensitivity rates for frq = 0.3, 77 = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line)
between 20 and 15 than for randomly spaced T (dashed light line).
42


Figure 3.14: Sensitivity rates for ffq = 0.3, T = 50, as a function of r) in finer
detail. Sensitivity decreases most when 0.5 < r? < 1.0. Evenly (solid dark line)
and randomly (dashed light line) spaced large T produce similarly decreasing
rates.
43


Table 3.5: Observed sensitivity rates for / = sm(1.37r£), evenly spaced T and
V = ).l
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0.01 0 0.02 0.01 0.02
1.5 0 0 0 0 0 0.01 0 0.03 0.02 0.01 0.02
1.0 0 0 0 0 0 0.01 0.01 0.02 0.03 0.01 0.10
0.75 0 0 0 0 0 0.02 0.05 0.02 0.03 0.06 0.23
0.5 0 0 0 0 0 0 0.04 0.02 0.06 0.31 0.57
0.25 0 0 0 0 0 0 0.04 0.10 0.37 0.92 0.99
0.1 0 0 0 0 0 0 0 0.18 0.76 1.00 1.00
Sensitivity rates for frequencies near 1 were not as high as those for frequen-
cies near 0. For frequency 1.3 (see Tab. 3.5 and Fig. 3.10 k and 1), a minimal
amount of noise still required an evenly spaced T as large as 30 to be able to
identify 76% (FDR0bs = 0.14) of the true signals. If the 30 time points were
randomly spaced, only 38% (FDRobs = 0.17) were identified. And if evenly
spaced T dropped from 30 to 25, the sensitivity rate dropped from 76% to 18%
{FDRobs = 0.14).
We again ran additional simulations using small rj with T in finer detail, and
large T with 77 in finer detail. Fig. 3.15, which plots sensitivity as a function of
T (with minimal added noise), shows that evenly spaced T is adversely affected
for T < 26, whereas randomly spaced T has a more gradual affect on sensitivity.
Sensitivity rates for large T behaved similarly for evenly and randomly spaced
T as the level of noise changed. We can see from Fig. 3.16, which shows how rj
(with large T) affects sensitivity, that for both evenly and randomly spaced T,
rates become quite low as rj approaches 0.5.
44


Figure 3.15: Sensitivity rates for frq = 1.3, rj = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line)
between 30 and 25 than for randomly spaced T (dashed light line).
Figure 3.16: Sensitivity rates for frq = 1.3, T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T
produce similarly decreasing rates.
45


Table 3.6: Observed sensitivity rates for / = sm(1.87rt), evenly spaced T and
V - ).l
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0.01 0 0 0.06 0.02 0.06 0.02
1.5 0 0 0.01 0 0 0.02 0.02 0.02 0.09 0.17 0.22
1.0 0 0 0 0 0 0.03 0.11 0.12 0.22 0.50 0.72
0.75 0 0 0 0.01 0.01 0.02 0.24 0.48 0.58 0.95 0.99
0.5 0 0 0 0 0 0.07 0.67 0.88 0.99 1.00 1.00
0.25 0 0 0 0.04 0 0.48 1.00 1.00 1.00 1.00 1.00
0.1 0 0 0 0.01 0 0.99 1.00 1.00 1.00 1.00 1.00
Weve seen that the Haar wavelet transform favors low frequencies, and fre-
quencies near powers of two. So it is not surprising that our algorithm performed
optimally for frequencies near 2, the smallest power of two. When frequency was
set to 1.8, (Tab. 3.6 and Fig. 3.11 b and c) even for relatively small T and large
77, sensitivity rates remained fairly high. For T = 20 and 77 = 0.5, the sensitivity
rate was still as high as 67% (FDRobs = 0.07).
When added noise was most minimal (77 = 0.1), evenly spaced T could be
as small as 15 and our algorithm was still able to identify virtually all temporal
signals (FDRobs = 0.07); for randomly spaced T = 15, our algorithm identified
73% of temporal signals (FDRobs 0.04). However, if evenly spaced T dropped
from 15 to 10, the sensitivity rates fell from 99% to 0% (with FDRobs undefined
as no positives were tagged at all).
We ran additional simulations using small 77 with T in finer detail, and large
T with 77 in finer detail. We can see from Fig. 3.17, which shows how T (with
minimal added noise) affects sensitivity, that the largest jump actually occurs
between evenly spaced T = 13 and T 12. Fig. 3.18 shows that large T was
able to withstand relatively high levels of added noise whether T was evenly or
46


randomly spaced.
Figure 3.17: Sensitivity rates for frq = 1.8, rj = 0.1, as a function of T in finer
detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line)
between 14 and 12 (see arrows) than for randomly spaced T (dashed light line).
Figure 3.18: Sensitivity rates for frq = 1.8, T 50, as a function of rj in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates and both have a high tolerance to a large
amount of added noise.
47


Table 3.7: Observed sensitivity rates for / = sin(2.hirt): evenly spaced T and
P c .1
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0.01 0.01 0.03 0.01 0.03
1.5 0 0 0 0 0 0 0.01 0 0.01 0.03 0.06
1.0 0 0 0 0 0 0 0.03 0 0.04 0.07 0.21
0.75 0 0 0 0 0 0.01 0.01 0.04 0.06 0.18 0.25
0.5 0 0 0 0 0 0.01 0.01 0.07 0.22 0.59 0.89
0.25 0 0 0 0 0 0.02 0.01 0.39 0.58 0.92 1.00
0.1 0 0 0 0 0 0 0.02 0.77 0.56 1.00 1.00
For frequency 2.5 (see Tab. 3.7 and Fig. 3.11 h and i), results faired slightly
better than when frequencies were near 1, but were worse than when frequencies
were near 0. Minimal added noise (rj 0.1) needed evenly spaced T = 25 to
identify 77% (FDR0bs = 0.08), but if T = 30, sensitivity rates curiously drop
to 56% (FDRobs = 0.14). This same is seen for randomly spaced T where
sensitivity is 64% (FDR^s = 0.14) when T = 25, but only 49% (FDRobs = 0.12)
when T = 30. Additionally, if evenly spaced T dropped from 25 to 20, the
sensitivity rate went from 77% to 2% (FDR0bs = 0.6).
This pattern can be seen in Fig. 3.19. Note that the results appearing in
Fig. 3.19 differ from those in Tab. 3.7 and Fig. 3.11 because they were obtained
from a separate batch of simulations investigating detailed ij and T. The same
patterns still emerge however.
For large T, 77 affected sensitivity similarly for evenly and randomly spaced
time points, except when 77 = 0.6 (see Fig. 3.20).
48


T
Figure 3.19: Sensitivity rates for frq = 2.5, rj = 0.1, as a function of T in finer
detail Sensitivity for evenly spaced T (solid dark line) is larger when T = 24
than when T = 29, and a large decrease occurs between T = 24 and 23. For
randomly spaced T (dashed light line), sensitivity is larger when T = 22 than
when T = 31.
Figure 3.20: Sensitivity rates for frq = 2.5, T = 50, as a function of rj in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates except for 0.5 < rj < 0.7, where randomly
spaced T yields smaller sensitivity rates.
49


Table 3.8: Observed sensitivity rates for / = sin(3.87rt), evenly spaced T and
p = 0.1_____________________________________________________
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0.02 0.01 0.03 0.01 0.01
1.5 0 0 0 0.01 0 0 0.02 0.01 0.04 0.06 0.07
1.0 0 0 0 0 0 0 0 0.04 0.07 0.15 0.25
0.75 0 0 0 0 0 0.01 0.01 0.08 0.15 0.27 0.63
0.5 0 0 0 0 0 0.01 0.05 0.28 0.32 0.88 0.99
0.25 0 0 0 0 0 0.01 0 0.50 0.86 1.00 1.00
0.1 0 0 0 0 0 0 0.05 0.77 0.99 1.00 1.00
When frequency was set to 3.8 (see Tab. 3.8 and Fig. 3.11 n and o), results
were an improvement on a frequency of 2.5, but not as good as when frequencies
were nearer to 2. A minimal level of added noise (rj = 0.1) required 25 evenly
spaced sample points in order to identify 77% (FDRobs = 0.08) of the true
signals, or 25 randomly spaced sample points to identify 66% (FDRobs = 0.10).
However, if evenly spaced T dropped from 25 to 20, the sensitivity went from
77% to 5% (FDRobs = 0.44). The decrease in sensitivity rates for randomly
spaced T however appeared more gradual. In Fig. 3.21, we can see that while
there is a big difference in sensitivity rates between evenly spaced T = 25 and
20, the decrease is actually relatively uniform. For large T, rj affected sensitivity
similarly for evenly and randomly spaced time points (see Fig. 3.22).
50


T
Figure 3.21: Sensitivity rates for frq = 3.8, rj = 0.1, as a function of T in finer
detail. Sensitivity drops significantly, but uniformly, for evenly spaced T (solid
dark line) between 24 and 20 (see arrows). Randomly spaced T (dashed light
line) yields slightly higher rates for smaller values of T.
Figure 3.22: Sensitivity rates for frq = 3.8, T = 50, as a function of r] in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T
produce similarly decreasing rates.
51


3.2.3 Performance using regression-based /
The results from the regression based simulation were similar to the Haar
and sinusoidal simulations in that as p increased from 0.1 to 0.9, there were
only slight improvements in the sensitivity and observed false discovery rates.
Therefore, we will again present only those results from the cases where p = 0.1.
The functions upon which our algorithm performed the best were fregri(t) and
fregriit)- In general, sensitivity and false discovery rates were better when T
was evenly spaced. However, when T was small, we again saw that randomly
spaced T yielded some improvements.
For regression function fregri(t) = 1 48t + 21812 315t3 + 145t4 (see Tab.
3.9 and Fig. 3.23), a minimal level of added noise (r? = 0.1) required 30 evenly
spaced time points to yield a 74% sensitivity rate (FDRobs = 0.10), and 40
randomly spaced time points for 72% sensitivity (FDRobs = 0.08). If evenly
spaced T was reduced to 25, sensitivity decreased to 30% (FDRobs = 0.19).
And if randomly spaced T was reduced to 30, sensitivity decreased to 18%
(FDRobs = 0.22).
For large T, a small amount of added noise adversely affected sensitiv-
ity rates. For evenly spaced T = 50, sensitivity rates dropped from 99%
(FDRobs = 0.07) to 49% (FDRobs = 0.09) as rj increased from 0.25 to 0.50.
For randomly spaced T = 50, sensitivity dropped from 90% (FDRobs = 0.12) to
51% (FDRobs = 0.09) as p increased from only 0.25 to 0.50.
52


Table 3.9: Observed sensitivity rates for fregri(t)-, evenly spaced T and p = 0.1
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0 0.01 0.02 0.02 0.01
1.5 0 0 0 0.01 0 0 0.01 0.03 0.02 0.04 0.04
1.0 0 0 0 0 0 0 0.01 0.01 0.02 0.08 0.04
0.75 0 0 0 0 0 0.01 0 0.05 0.01 0.19 0.17
0.5 0 0 0 0.01 0 0 0.05 0.06 0.10 0.45 0.49
0.25 0 0 0 0 0 0.01 0.13 0.21 0.30 0.82 0.99
0.1 0 0 0 0 0 0 0.19 0.30 0.74 0.88 1.00
(a) /regri(t) (b) uniform T (c) random T
Figure 3.23: Surface plots (b and c) of observed sensitivity rates from simula-
tions generated from regression function (a) fregri(t) = 1 48t + 218t2 315t3 +
145t4.
To get a better idea of what was going on in between our standard simulation
values of T and 77, we ran additional simulations using small rj with T in finer
detail, and large T with 77 in finer detail. The results of these simulations are
presented in Figures 3.24 and 3.25. We can see from Fig. 3.24, which shows how
T (with minimal added noise) affects sensitivity, that the largest jump occurs
for both evenly and randomly spaced time points when T is between 26 and 28,
and that evenly spaced T outperforms randomly spaced T near 30 time points.
Fig. 3.25 shows how 77 (with large T) affects sensitivity; we see that the
biggest difference between evenly and randomly spaced T occurs between 77 = 0.4
53


and 77 = 0.6, which is also where the sensitivity rates drop the most. Otherwise,
they decrease similarly.
Figure 3.24: Sensitivity rates for fregri(t), 77 = 0.1, as a function of T in finer
detail. Evenly spaced T (solid dark line) differs most from randomly spaced T
(dashed light line) near T 30. Rates decrease significantly for both once T
reaches the mid-twenties.
Figure 3.25: Sensitivity rates for fregr\{t), T = 50, as a function of 77 in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates except for 0.4 < 77 < 0.6, where randomly
spaced T yields smaller sensitivity rates.
54


Table 3.10: Observed sensitivity rates for fregr2(t), randomly spaced T and
p = 0.1
T n 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0 0 0.01 0 0
1.5 0 0 0 0 0 0 0 0 0 0 0
1.0 0 0 0 0 0 0.01 0 0 0.02 0.02 0
0.75 0 0 0 0.01 0 0.01 0 0.01 0.01 0.03 0.03
0.5 0 0 0 0 0 0.02 0 0.01 0.02 0.06 0.13
0.25 0 0 0 0 0.01 0.02 0.03 0.05 0.11 0.08 0.24
0.1 0 0 0 0 0 0 0.04 0.08 0.15 0.25 0.47
(a) fregriit) (b) uniform T (c) random T
Figure 3.26: Surface plots (b and c) of observed sensitivity rates from sim-
ulations generated from regression function (a) fregr2(t) = 0.3e~64^~0-25)2 +
g_7e-256(<-0.75)2
Simulations using regression function fregr2(t) = 0.3e~64^0-25)2+0.7e_256^_a75)2
(see Tab. 3.10 and Fig. 3.26), yielded the poorest results out of all four re-
gression functions. Results were most similar to those from simulations using
function / = sm(6.0 irt). Evenly spaced T yielded sensitivity rates near zero
even for large T and small 77. The results from randomly spaced T are shown in
Tab. 3.10. For the best case scenario (T = 50 and rj = 0.1), randomly spaced
T produced a sensitivity rate of 47% (FDR^,S = 0.11), while evenly spaced T
55


produced a sensitive rate of only 11% (FDRohs 0.21).
Regression function fregrz{t) = 10e~lot, (see Tab. 3.11 and Fig. 3.27),
yielded better results than fregri{t) but were still not as good as those produced
using fregrl(t).
Table 3.11: Observed sensitivity rates for /regr.3(t), evenly spaced T and p = 0.1
r n 6 7 8 0 10 15 20 25 30 40 50
2.0 0 0 0 0 0 0 0 0 0 0 0.0200
1.5 0 0 0 0 0 0.0100 0.0300 0.0100 0.0100 0.0100 0.0100
1.0 0 0 0 0.0100 0 0 0.0100 0 0.0200 0 0.0200
0.75 0 0 0 0 0 0 0 0 0 0.0200 0.0200
0.5 0 0 0 0 0 0 0 0.0100 0.0200 0.0500 0.0600
0.25 0 0 0 0 0 0 0.0100 0.0400 0.0400 0.1500 0.4800
0.1 0 0 0 0 0 0 0 0.0600 0.0200 0.5900 0.9900
(a) fregrz(t) (b) uniform T (c) random T
Figure 3.27: Surface plots (b and c) of observed sensitivity rates from simu-
lations generated from regression function (a) /regr3(f) = 10e_lot.
The smallest amount of added noise (77 = 0.1) required 40 evenly spaced time
points to yield a 59% sensitivity rate (FDR0bs = 0.12), and 40 randomly spaced
time points for 63% sensitivity (FDR0bs = 0.12). If evenly spaced T = 30,
sensitivity decreased to 2% (FDR0bS = 0.50). And if randomly spaced T = 30,
56


sensitivity decreased to 11% (FDR0{,S = 0.21).
Even for large T, it didnt take much added noise to adversely affect sen-
sitivity rates. For evenly spaced T = 50, sensitivity rates dropped from 99%
(FDRobs = 0.08) to 48% (FDRobs = 0.06) as rj increased from 0.10 to 0.25. For
randomly spaced T = 50, sensitivity dropped from 75% (FDRobs = 0.06) to
35% (FDRobs 0.12) as rj increased from only 0.10 to 0.25.
Simulations using regression function fregr^(t) (see Tab. 3.12 and Fig. 3.28)
yielded the best observed sensitivity and false discovery rates of all four re-
Table 3.12: Observed sensitivity rates for /re9r4%), evenly spaced T and p = 0.1
T V 6 7 8 9 10 15 20 25 30 40 50
2.0 0 0.0100 0 0 0 0.0100 0.0100 0.0300 0 0 0.0400
1.5 0 0 0 0 0 0.0100 0.0300 0.0400 0.0100 0.0600 0.1400
1.0 0 0 0 0 0 0 0 0.0800 0.1000 0.3300 0.4000
0.75 0 0 0 0 0 0.0100 0.0800 0.1500 0.4100 0.5400 0.8100
0.5 0 0 0 0 0.0100 0.0400 0.3700 0.6400 0.9000 0.9600 0.9900
0.25 0 0 0 0 0.0400 0.2300 0.9800 0.9100 1.0000 1.0000 1.0000
0.1 0 0 0 0 0.0300 0.3800 1.0000 1.0000 1.0000 1.0000 1.0000
(a) fregri{t) (b) uniform T (c) random T
Figure 3.28: Surface plots (b and c) of observed sensitivity rates from simu-
lations generated from regression function (a) fregr4(t) = ek^~3\ where k = 2
for t < | and k = 1 for t > |.
57


gression based generating functions. Using the smallest level of added noise
(rj = 0.1), 20 evenly spaced time points produced a 100% sensitivity rate
(FDRobs = 0.07), and 20 randomly spaced time points produced 79% sensi-
tivity (FDRobs = 0.08). If evenly spaced T = 15 however, sensitivity decreased
to 38% (FDRobs = 0.24). And if randomly spaced T = 15, sensitivity decreased
to 32% (FDRobs = 0.06).
For large T, added noise did not affect sensitivity rates until rj = 1.0. For
evenly spaced T = 50, sensitivity rates dropped from 81% (FDR0bs = 0.08)
to 40% (FDRobs = 0.17) as rj increased from 0.75 to 1.0. For randomly spaced
T = 50, sensitivity dropped from 67% (FDR0bs = 0.14) to 28% (FDRobs = 0.03)
as ij increased from 0.75 to 1.0.
We ran additional simulations using small rj with T in finer detail, and large
T with rj in finer detail, the results of which are presented in Figures 3.29 and
3.30. We can see from Fig. 3.29, which shows how T (with minimal added noise)
affects sensitivity, that evenly and randomly spaced T differ most between 22
and 16. Additionally, while rates for randomly spaced T decrease gradually for
T < 15, rates for evenly spaced T appear to favor an even number of time points,
falling when there are an odd number.
Fig. 3.30 shows how rj (with large T) affects sensitivity; we see that the
biggest difference between evenly and randomly spaced T occurs between rj = 0.7
and rj = 1.0, which is also where the sensitivity rates drop the most. Otherwise,
they decrease similarly.
58


Figure 3.29: Sensitivity rates for fregr4{t), V = 0.1, as a function of T in finer
detail. Rates for randomly spaced T (dashed light line) decrease more gradually
than evenly spaced T when the number of time points is less than 15. And is
evenly spaced T favoring an even number of time points when T is small?
Figure 3.30: Sensitivity rates for fregr4{t), T = 50, as a function of rj in finer
detail. Evenly (solid dark line) and randomly (dashed light line) spaced large
T produce similarly decreasing rates except for 0.7 < rj < 1.0, where randomly
spaced T yields smaller sensitivity rates.
59


3.3 Results: Microarray Gene Expression Time Series
We applied our algorithm to the microarray gene expression time series data
obtained from the mammary tissue of rattus norvegicus over the four day estrous
cycle. We experimented with a number of different approaches. Each approach
was performed 10 times, so that we might see whether a gene was selected every
time, none of the time, or some of the time. Since the total number of time
points happened to already be a power of 2, we first used a basis size of 32
(25). We ran the data through the algorithm using basis sizes of 16, 8, and 4 as
well. This approach, with basis sizes of 8, 16, and 32 yielded the most promising
results; several genes were found to be significant for each of the ten rounds.
The other approaches involved using only subsets of the series in an attempt
to fashion a more uniform sampling rate. The vector of time points was uniform
within each of the four days, between 7AM and 7PM. But there were three large
gaps of time, between 7PM and 7AM, where no samples were taken (see Fig.
3.31). Keeping only the first and last time point of each day yielded 8 rather
i I I I ] I i I I l l ! ! ! ! !
x x xxxx >ooeooooo< ooopoc xxxxxxx
i i i i i i i i i i i i I l j i
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Day 1 (Proestrvs) Day 2 (Estrus) Day 3 (Metestrus) Day 4 (Diestrus)
Time (in hours) of Estrous Cycle
Figure 3.31: A plot indicating the time points sampled. Each asterisk marks
the time (in hours) that a tissue sample was taken from a rat specimen across
the 4 day estrous cycle.
60


uniform sample points across the entire time window. However, applying our
algorithm to this subset of the dataset (using a basis size of 8 or 16) yielded
poor results; only a few genes were tagged, and none of them were selected more
than 50% of the time.
Additionally, we applied the algorithm to each of the four days (estrous cycle
phases) separately, using a basis size of 8 and then 16. These four subsets also
consisted of 7 to 9 uniformly spaced time points, and also yielded poor results.
As we saw in the simulation testing, for evenly spaced small T, results tended to
not be good, even with a minimal amount of added noise. The poor results here
support the indication that 8 sample points, although highly uniform, are not
enough to enable the detection of temporal patterns. We will limit our reporting
therefore, to the results obtained when using the entire dataset.
Among the 21,440 genes examined, 40 genes were consistently tagged as
exhibiting significant temporal patterns across the four days of the estrous cycle.
Fifty genes were found to be significant 90% of the time, and another 53 were
tagged 80% of the time. Tab. 3.13 gives the frequency distribution of these
significance rates. Descriptions of the top forty are featured in Tab. 3.14.
Table 3.13: A frequency distribution of significance rates for 10 applications
of the algorithm using R = 5
Rate of Significance 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Gene Count 40 50 53 53 70 90 118 157 219 492 19702
A search of the literature revealed additional information on some of the forty
genes consistently identified by our algorithm.
61


The aortic preferentially expressed gene 1 (APegl; Ref# 1 in Tab. 3.14),
which is expressed in arterial smooth muscle cells, was found by our algorithm
to exhibit a temporal pattern across the estrous cycle. Research has shown that
cycles of aortic oxygen absorption are correlated with phases of the estrous cycle
in female rats [23].
! ! 1 1 1 1 1 > 1 !
l % .... * 1 1 1
1 1 1 1 1 t i ; 4
* % ft 1 ii
4 -..i i l ^ - i L- 1 1 1 1 1 1 1
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Day 1 (Proestrus) Day 2 (Estrus) Day 3 (Metestrus) Day 4 (Diestrus)
Time (in hours) of Estrous Cycle
Figure 3.32: The Haar wavelet reconstruction of the normalized expression
signal for APegl. An increase in expression appears in the estrus and metestrus
phases.
The Haar wavelet reconstruction of the expression signal (a linear combi-
nation of the wavelet coefficients and basis functions) is shown in Fig. 3.32.
Expression appears to increase in the estrus and metestrus phases. Honda et.
al. found that relaxation of the aorta was greater for rats in the estrus phase,
than for those in the proestrus, metestrus or diestrus phases [15]. Relaxation
of the aorta allows blood to flow more smoothly throughout the body thereby
facilitating oxygen absorption. No published works were found involving APegl
62


expression in mammaxy tissue.
The gene associated with mammary cancer in rats (RMT-1; Ref# 7 in Tab.
3.14), was found by our algorithm to be temporally significant within the estrous
cycle. Researchers at UC Berkeley were the first to identify this gene and its
association with rat mammaxy tumors [10]. Their research has shown RMT-1
to be highly expressed in the normal breast cells of virgin rats, and even more
so in rat mammaxy cancers No previously published works were found linking
this gene to the estrous cycle however. The Haax wavelet reconstruction of this
genes expression signal is shown in Fig. 3.33. Expression levels are high in the
first two phases, and are low in the last two.
Time (in hours) of Estrous Cycle
Figure 3.33: The Haax wavelet reconstruction of the normalized expression
signal for RMT-1. Expression values axe high in the first half of the estrous
cycle, and low in the second half.
Embryo receptivity, embryotrophy as well as the immune system axe all
partially regulated by cytokines, which are a group of proteins and peptides
63


used in cell signaling. Interleukins make up a subset of cytokines. The mRNA
expression of inflammatory cytokines interleukin-4 (IL-4) and interleukin-6 (IL-
6) was found to be expressed in the corpus luteum of the porcine (pig) [29]. The
corpus luteum is a temporary progesterone-producing structure present in the
ovary only during the metestrus and diestrus phases of the estrous cycle. We
would expect that genes highly expressed in the corpus luteum would therefore
show higher levels of activity during these phases.
1 1 1 1 1 ! * 1 1
> W* A

/ \ J i
1 m A.
V- 1 1 ill 1 1 1 1 1 1 , l
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Day 1 (Proestrus) Day 2 (Eatrus) Day 3 (Metestrus) Day 4 (Diestrus)
Time (in hours) of Estrous Cycle
Figure 3.34: The Haar wavelet reconstruction of the normalized expression
signal for IL-17B. Expression is low in the first half of the cycle, and high during
the last two phases.
The cytokine interleukin-17B (IL-17B] Ref# 18 in Tab. 3.14) was found
to have significant temporal expression patterns by our algorithm. IL-17B is a
member of the subfamily of interleukins, IL-17, which is most notably involved in
governing inflammatory responses and the production of many other cytokines
(such as IL-6). The Haar wavelet reconstruction of the expression activity of IL-
64


17B is shown in Fig. 3.34. Expression levels axe low in the first two phases, and
are high during the metestrus and diestrus phases, coinciding with the presence
of the corpus luteum.
The angiogenesis inhibiting Thrombospondin proteins (TSP) are expressed
in the normal and hyperplastic human breast, and have been of interest in breast
cancer research due to their anti-tumorigenic functions. Thrombospondins axe
a family of multi-functional proteins and can be divided into two subfamilies:
A, containing TSP-1 and -2, and B, containing TSP-3, -4 and -5. Our algo-
rithm consistently identified as temporally significant the gene responsible for
Thrombospondin-2 (TSP-2] Ref# 26 in Tab. 3.14). In reference to estrous,
1
0.5
o
-0.5
-1
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Day 1 (Proestrus) Day 2 (Estrua) Day 3 (Metestrus) Day 4 (Diestrus)
Time (in hours) of Estrous Cycle
V i ! 1
-M r \ t % i * M ** % t 1 1
1: - I 1
1 i l
Figure 3.35: The Haax wavelet reconstruction of the expression signal for
TSP2. Expression is higher in the first half, and lower in the second half of the
estrous cycle.
TSP was found to be expressed in the preluteal follicles (present in the proestrus
phase) and the mature corpus luteum (present in the metestrus phase) of the
65


rat [4]. The Haar wavelet reconstruction of the expression activity of TSP-2 is
shown in Fig. 3.35. Expression levels are higher in the proestrus and estrus
phases, and are lower during the second half of the cycle.
Laminin (LN) is a component of thin connective membranes (such as ovarian
follicles) and is involved in such functions as cell growth, differentiation and
migration. Research has suggested that LN plays a significant role in follicular
development [21, 2], which occurs during the proestrus and estrus phases of the
estrous cycle. The reconstructed expression signal of Laminin a-5 (LAMA5;
Ref# 30 in Tab. 3.14) is shown in Fig. 3.36. Except for a few neighboring
samples in the estrus phase resulting in high levels of mRNA concentration, the
expression levels for LAMA5 appear to moderately decrease from the beginning
to the end of the cycle.
1
0.5
0
-0.5
-1
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Day 1 (Proestrus) Day 2 (Estrus) Day 3 (Metestrus) Day 4 (Diestrus)
Time (in hours) of Estrous Cycle
! ! 1 tl i i } 1 ! 1 1 1 1
i r*- . % i i
f- u \ v i 9 *
i i i i i i i '1 1 9 l i I 1
Figure 3.36: The Haar wavelet reconstruction of the expression signal for
LAMA5. Expression moderately decreases from the beginning to the end of the
cycle, with some high level readings occurring in the estrus phase.
66


In reference to its link to mammary tissue, Laminin a-5 (LAMA5; Ref# 30
in Tab. 3.14) was found to be differentially expressed in metastatic tumor cells
[32].
Table 3.14: Genes found to to have temporal significance 100% of the time
Ref# Gene ID Name Description
1 1368001-at APegl Aortic preferentially expressed gene 1
2 1368470-at Gamma-glutamyl hydrolase
3 1368567_at MIPP65 protein
4 1369132-at SLC18A2 Solute carrier family 18, member 2
5 1369669-at Neurolysin (metallopeptidase M3 family)
6 1370317-at KCR1 Potassium channel regulator 1
7 1370646-at RMT-1 Mammary cancer associated protein RMT-1
8 1370805-at Melanocyte-specific gene 1 protein
9 1371830_at Similar to ubiquitin-like 1 (sentrin) activating enzyme subunit 1; ubiquitin-like (sen- trin) activating enzyme E1A; SUMO-1 activating enzyme subunit 1; DNA segment, Chr 7, ERATO Doi 177, expressed
10 1372126-at Transcribed locus
11 1372281-at Similar to hypothetical protein MGC28394
12 1373056-at Similar to RIKEN cDNA 1110014L17
13 1374265_at UI-R-CMO-bjj-d-03-0-UI.sl UI-R-CMO Rattus norvegicus cDNA clone UI-R-CMO- bjj-d-03-0-UI 3, mRNA sequence.
14 1374594-at Similar to RIKEN cDNA 1600029D21
15 1375091-at Similar to KIAA1052 protein
16 1375469-at SWI/SNF related, matrix associated, actin dependent regulator of chromatin, sub- family a, member 4
17 1375637-at Similar to RIKEN cDNA 1110003E01
18 1376445-at IH7B Interleukin 17B
19 1377460_at Transcribed locus
20 1377982_at Similar to hypothetical protein D930042H13
21 1381130-at UI-R-BTl-bmy-a-11-O-UI.sl UI-R-BT1 Rattus norvegicus cDNA clone UI-R-BT1- bmy-a-ll-O-UI 3, mRNA sequence
22 1382187-at Similar to RIKEN cDNA 2610029K21
23 1383380_at Similar to Ribonuclease P protein subunit p38 (RNaseP protein p38)
24 1383424-at Similar to thymidylate kinase family LPS-inducible member; thymidylate kinase homologue
25 1384836_at UI-R-C0-jp-b-06-0-UI.sl UI-R-CO Rattus norvegicus cDNA clone UI-R-C0-jp-b-06- 0-UI 3, mRNA sequence.
67


Table 3.13: (Cont.)
Ref# Gene ID Name Description
26 1385751-at TSP-2 Thrombospondin 2
27 1385784_at Similar to calcium-activated chloride channel CLCA4
28 1387664-at ATPase, H+ transporting, lysosomal (vacuolar proton pump), beta 56/58 kDa, isoform 2
29 1387808-at SLC7A7 Solute carrier family 7 (cationic amino acid transporter, y+system), member 7
30 1388932_at LAMA5 Laminin, alpha 5
31 1389173_at Similar to SON protein
32 138941l_at Transcribed locus
33 1389439-at Transcribed locus
34 1389612-at Transcribed locus, highly similar to XPJ542369.1 similar to RIKEN cDNA 2400002D02 (Rattus norvegicus)
35 1391520-at Transcribed locus
36 1393582_at Transcribed locus
37 1394577_at Transcribed locus
38 1395887_at Similar to RIKEN cDNA 1600029D21
39 1397729_x_at Similar to RIKEN cDNA 1600029D21
40 1398661_at UI-R-CVO-brr-e-lO-O-UI.sl UI-R-CVO Rattus norvegicus cDNA clone UI-R-CVO- brr-e-10-0-UI 3, mRNA sequence.
68


4. SUMMARY AND CONCLUSIONS
There were several questions we hoped to be able to answer based on the
results of our simulations. In this chapter, we will summarize the results from
the previous chapter and address the questions we posed about how the size,
rate, and response variation of a sample affects the results of our methodology.
4.1 Sample Size
One of the primary questions we hoped to answer from our simulations was
how many sample points are needed to determine the presence of a temporal
pattern. In general, the less noise present in the sample response, the smaller
the sample size could be to a point. Usually, no matter how small the amount
of noise, any series with less than 8 sample points did not yield any discoveries,
true or not.
When noise was only a small factor, we found that the answer to this ques-
tion greatly depended upon the characteristics of the underlying signal. Our
simulation results indicated that lower frequencies faired better, despite sample
sizes as small as 20. Underlying signals such as these including the fourth
regression function, sinusoidal functions with frequencies near 0 or 2, and the
Haar wavelet step function consistently yielded sensitivity rates better than
90% and observed false discover rates smaller than 0.1.
The underlying function whose sample size was least susceptible to a large
degree of noise was the Haar step function; when the standard deviation of
69


added Gaussian noise equalled the maximum amplitude of the signal (rj 1.0),
the sensitivity rate for evenly spaced T = 25 was 76%, for randomly spaced T
60%, both with observed false discovery rates less than or equal to 0.1.
4.2 Variation of Sample Response
Another primary question we wanted to address was what degree of noise,
or sample response variation, could be tolerated in identifying the presence of a
temporal pattern. In general, the more sample points, the more that noise could
be tolerated but also only to a point. Usually, even for a sample size as large
as 50, any series with additional noise having as much as double the maximum
amplitude of the underlying signal yielded sensitivity rates of less than 5%.
When sample size was large, we found that the answer to this question
appeared to be mostly independent of the underlying signal. Our simulation
results indicated that for nearly all underlying functions simulated, it was when
rj was between 0.5 and 1.0 that sensitivity rates began to decrease severely.
The underlying function which best tolerated a high level of variation in the
sample response was again the Haar step function. When the standard deviation
of added Gaussian noise doubled the maximum amplitude of the signal (77 = 2.0),
the sensitivity rate for evenly spaced T 50 was 46%, for randomly spaced T
31%, both with observed false discovery rates less than 0.1.
4.3 Sampling Rate
Another question we hoped to answer was how irregular sampling affected
the methodology. For larger T (usually T > 15), evenly spaced time points
produced moderately better results than randomly spaced samples. When T
70


was small, randomly spaced T sometimes yielded better results than evenly
spaced.
Notable was that for a large number of sample points, results did not differ
greatly between randomly and evenly spaced time points as the level of added
noise changed. This can be seen in the Results chapter figures (for example, see
Figures 3.22 and 3.30) which plotted sensitivity rates as a function of rj
4.4 Frequency of the Underlying Temporal Response
Since wavelets do not detect stationary frequencies, it was not surprising
that the higher the sine frequency, the poorer the results. But why then were
results from frq = 3.8 better than those from frq = 2.5? While results for
frq = 5.2 were better than those for frq = 6 (because of the lower frequency),
results for frq = 8 were actually better than when frq = 5.2, despite the lower
frequency.
These jumps of improvement when frequency was at or near a power of
two was a result of the underlying function mimicking the scales of the step
function. Fig. 2.1 shows the Haar wavelet scales k = 2,...,8. Just as the
algorithm performed well on the basic Haar step function (Fig. 2.1(a)), so it did
for the curvier version, the sine wave with frequency 2. Similarly, the sine wave
with frequency 4 is aligned with the scaled Haar step functions in Fig. 2.1(b
and c), and the sine wave with frequency 8 is aligned with the scaled Haar step
functions in Fig. 2.1(d-g).
The pattern of decreasing sensitivity rates as frequency increased also held
true within the power-of-two frequencies (e.g., frq = 8 was not as successful as
frq = 4, nor frq = 4 as successful as frq = 2).
71


4.5 Which Wavelet Coefficient Performs Best?
Of the 7 wavelet coefficients assessed (C2,..., C&), did any stand out as being
more or less often the coefficient responsible for producing a significantly small
p-value? Recall that it is the smallest of the seven coefficient-generated p-values
ultimately associated with each gene. And then the FDR procedure determines
which of these minimum p-values is significantly small. We ran the Haar Step
function-based simulation in which we recorded which Ck was responsible for a
positive. Tab. 4.1 shows the results (averaged over 10 runs) for each C\ from
optimal parameters T = 50 and rj = 0.1. Clearly, for very low frequencies like
Table 4.1: Percentage Distribution of Positive-Responsible Ck (when T = 50
and p = 0.1)
Category Total Count Percent of Positives Resulting from C*
k=2 k=3 k=4 k=5 k6 k=7 k=8
False Positives (T uniform) 11 18.18% 18.18% 18.18% 9.09% 18.18% 9.09% 9.09%
False Positives (T random) 15 20.00% 6.67% 26.67% 6.67% 0% 13.33% 26.67%
True Positives (T uniform) 100 100% 0% 0% 0% 0% 0% 0%
True Positives (T random) 100 100% 0% 0% 0% 0% 0% 0%
the Haar Step function, if there are a relatively large number of time points with
a relatively small degree of noise, C2 is consistently the responsible coefficient
for correct discoveries.
72


5. FUTURE WORK
The incorporation of positive-responsible Ck considerations and whether
they are of use in detecting temporal significance in microaxray time series
could be a fruitful avenue of exploration. For example, is usually positive-
responsible even for high frequency signals? What about when there are a
smaller number of time points? Is it possible that there are scenarios in which
a certain coefficient is often incorrectly positive-responsible? A more complete
understanding of the sensitivities of particular coefficients may be useful in the
detection of time series expression signals having specific characteristics, such
as low, or high frequency.
Our use of the Haar wavelet was a simplified approach. Future research
might incorporate the use of continuous wavelets, such as the Morlet, Meyer, or
Mexican hat wavelets, in the analysis of microarray gene expression time series.
Others who might implement a wavelet transform in the analysis of microar-
ray time series, may find it useful to investigate other methods of standardization
of the data. Our approach of normalization involved taking the log and center-
ing each signal about zero. Dividing by the 1-norm of the signal vector may
have provided better results for certain types of signals, such as those simulated
using the second regression function. Making the length of the data vectors
equal allows small and large scale activity to exist within the same scale.
In order to best test such an algorithm as presented in this paper, it is
important to be able to replicate in simulations the nuances of actual microarray
73


time series. It would be informative therefore to experiment with different types
of added noise. For example, incorporate noise that is linked to groups of signals
and varies with the time points. This could serve to simulate clusters of genes
which behave similarly (e.g, always upregulated, downregulated, or neither) from
one organism (time point) to another.
It could be highly informative to apply this approach to microarray gene
expression time series in which temporally significant genes are already known.
An integrated analysis of the extensive and well-studied microarray time se-
ries (by Cho and Spellman, 1998) on gene expression in yeast (Saccharomyces
cerevisiae) could help to confirm the success of any novel approach.
And finally, further investigation into genes found temporally significant by
our algorithm could reveal underlying genetic functions within the estrous cycle.
Along with many other works, our results support the importance of maxi-
mizing the number of time points sampled when performing a microarray time
series experiment. Even when faced with a relatively large amount of sample
variation response, a larger number of time points, whether uniformly taken or
not, was usually able to preserve an existing temporal pattern.
74


REFERENCES
[1] J. Aach and G.M. Church. Aligning gene expression time series with time
warping algorithms. Bioinformatics, 17(6):495-508, 2001.
[2] G. Akkoyunlu, R. Demir, and I. Ustiinel. Distribution patterns of tgf-alpha,
laminin and fibronectin and their relationship with folliculogenesis in rat
ovary, acta histochemica, 105(4):295-301, 2003.
[3] O. Alter, P.O. Brown, and D. Botstein. Singular value decomposi-
tion for geneome-wide expression data processing and modelling. PNAS,
97(18):10101-6, 2000.
[4] R Bagavandoss, E.H. Sage, and R.B. Vernon. Secreted protein, acidic and
rich in cysteine (spare) and thrombospondin in the developing follicle and
corpus luteum of the rat. Journal of Histochemistry and Cytochemistry,
46(9): 1043-9, 1998.
[5] Z. Bar-Joseph, G.K. Gerber, D.K. Gifford, T.S. Jaakkola, and I. Simon.
Continuous representations of time-series gene expression data. Journal of
Computational Biology, 10(3-4):34156, 2003.
[6] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a
practical and powerful approach to multiple testing. Journal of the Royal
Statistical Society, Series B, 57:289-300, 1995.
[7] Albert Boggess and Francis J. Narcowich. A First Course in Wavelets with
Fourier Analysis.
[8] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S.
Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of mi-
croarray gene expression data by using support vector machines. PNAS,
97(l):262-7, 2000.
[9] E. Bullmore, C. Long, J. Suckling, J. Fadili, G. Calvert, F. Zelaya, T.A.
Carpenter, and M. Brammer. Colored noise and computational inference
in neurophysiological (fmri) time series analysis: Resampling methods in
time and wavelet domains. Human Brain Mapping, 12(2):61-78, 2001.
75


[10] S. Chiou, J. Yoo, K.C. Loh, R.C. Guzman, G.R. Gopinath, L. Rajkumar,
Y.C. Chou, J. Yang, N.C. Popescu, and S. Nandi. Identification of rat
mammary tumor-1 gene (rmt-1), which is highly expressed in rat mammary
tumors. Cancer Letters, 174(l):45-55, 2001.
[11] M.J.L. de Hoon, S.I. Imoto, and S. Miyano. Statistical analysis of a small
set of time-ordered gene expression data using linear splines. Bioinformat-
ics, 18(ll):1477-86, 2002.
[12] M.B. Eisen, RT. Spellman, RO. Brown, and D. Botstein. Cluster analysis
and display of genome-wide expression patterns. PNAS, 95:14863-8, 1998.
[13] J. Ernst, G.J. Nau, and Z. Bar-Joseph. Clustering short time series gene
expression data. Bioinformatics, 21 :il59i68, 2005.
[14] N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar, and N.V.
Fedoroff. Fundamental patterns underlying gene expression profiles: Sim-
plicity from complexity. PNAS, 97(15):8409-14, 2000.
[15] H. Honda, H. Ishihaxa, M. Takei, and H. Kogo. Acetylcholine-induced
relaxation of rat aorta is greatest during estrus in the sexual cycle. Japanese
Journal of Pharmacology, 74(l):113-5, 1997.
[16] RC. Ivanov, M.G. Rosenblum, C.K. Peng, J. Mietus, S. Havlin, H.E. Stan-
ley, and A.L. Goldberger. Scaling behaviour of heartbeat intervals obtained
by wavelet-based time-series analysis. Nature, 383:323-7, 1996.
[17] R.R. Klevecz. Dynamic architecture of the yeast cell cycle uncovered by
wavelet decomposition of expression microarray data. Functional and In-
tegrative Genomics, 1:186-192, 2000.
[18] R.R. Klevecz and D.B. Murray. Genome wide oscillations in expression:
wavelet analysis of time series data from yeast expression arrays uncovers
the dynamic architecture of phenotype. Molecular Biology Reports, 28:73-
82, 2001.
[19] M. Latka, Z. Was, A. Kozik, and B.J. West. Wavelet analysis of epileptic
spikes. Physical Review E, 67(5):2904-8, 2003.
[20] K.M. Lau and H. Weng. Climate signal detection using wavelet transform:
How to make a time series sing. Bulletin of the American Meteorological
Society, 76(12):2391-402, 1995.
76


[21] F. Le Bellego, S. Fabre, C. Pisselet, and D. Monniaux. Cytoskeleton reor-
ganization mediates alpha6betal integrin-associated actions of laminin on
proliferation and survival, but not on steroidogenesis of ovine grandulosa
cells. Reproductive Biology and Endocrinology, pages 3-19, 2005.
[22] X. Leng and H.G. Muller. Classification using functional data analysis for
temporal gene expression data. Bioinformatics, 22(l):68-76, 2006.
[23] M.R. Malinow, J.A. Moguilevsky, and L. Gerschenson. Cyclic changes in
the oxygen consumption of the aorta in female rats. Circulation Research,
14(4):364-6, 1964.
[24] S. Mallat and W.L. Hwang. Singularity detection and processing with
wavelets. Information Theory, IEEE Transactions on, 38(2):617-643, 1992.
[25] Allan D.R. McQuarrie and Chih-Ling Tsai. Regression and Time Series
Model Selection. World Scientific Publishing Co. Pte. Ltd., 1998.
[26] C.S. Moller-Levet, F. Klawonn, K.H. Cho, H. Yin, and O. Wolkenhauer.
Clustering of unevenly sampled gene expression time-series data. Fuzzy
Sets and Systems, 152:49-66, 2005.
[27] C.S. Moller-Levet and H. Yin. Circular som for temporal characterisation
of modelled gene expressions. IDEAL, 3578:319-26, 2005.
[28] H. Resson, D.L. Wang, and P. Natarajan. Clustering gene expression data
using adaptive double self-organizing map. Physiological Genomics, 14:35-
46, 2003.
[29] R. Sakumoto, T. Komatsu, E. Kasuya, T. Saito, and K. Okuda. Expression
of mrnas for interleukin-4 and interleukin-6 and their receptors in porcine
corput luteum during the estrous cycle. Domestic Animal Endocrinoloqy,
31(3):246-57, 2005.
[30] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen,
P.O. Brown, D.B. Botstein, and B. Futcher. Comprehensive identification
of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by mi-
croarray hybridization. Molecular Biology of the Cell, 9:3273-97, 1998.
[31] B.E. Usevitch. A tutorial on modern lossy wavelet image compression:
foundations of jpeg 2000. Signal Processing Magazine, IEEE, 18(5):22-35,
2001.
77


[32] W. Wang, J.B. Wyckoff, V.C. Frohlich, Y. Oleynikov, S. Hiittelmaier,
J. Zavadil, L. Cermak, E.P. Bottinger, R.H. Singer, J.G. White, J.E. Segall,
and J.S. Condeelis. Single cell behavior in metastatic primary mammary
tumors correlated with gene expression patterns revealed by molecular pro-
filing. Cancer Research, 62:6278-88, 2002.
[33] Z. Xiong, K. Ramchandran, and M.T. Orchard. Space-frequency quantiza-
tion for wavelet image coding. Imaqe Processing, IEEE Transactions on,
6(5):677-93, 1997.
[34] X.L. Xu, J.M. Olson, and L.R Zhao. A regression-based method to iden-
tify differentially expressed genes in microaxray time course studies and its
application in an inducible huntingtons disease transgenic model. Human
Molecular Genetics, 11(17):197785, 2002.
[35] F. Yao, H.G. Muller, and J.L. Wang. Functional data analysis for
sparse longitudinal data. Journal of the American Statistical Association,
100:577-90, 2005.
[36] B.T. Zhang. Self-organizing latent lattice models for temporal gene expres-
sion profiling. Machine Learning, 52:67-89, 2003.
[37] L.P. Zhao, R. Prentice, and L. Breeden. Statistical modeling of large
microarray data sets to identify stimulus-response profiles. PNAS,
98(10): 14863-8, 1998.
78