Citation |

- Permanent Link:
- http://digital.auraria.edu/AA00003063/00001
## Material Information- Title:
- Analyzing microarray time series data using a Haar wavelet transformation
- Creator:
- Braudrick, Sarah D
- Place of Publication:
- Denver, Colo.
- Publisher:
- University of Colorado Denver
- Publication Date:
- 2007
- Language:
- English
- Physical Description:
- xvii, 78 leaves : illustrations ; 28 cm
## Thesis/Dissertation Information- Degree:
- Master's ( Master of Science)
- Degree Grantor:
- University of Colorado Denver
- Degree Divisions:
- Department of Mathematical and Statistical Sciences, CU Denver
- Degree Disciplines:
- Mathematical Sciences
- Committee Chair:
- Billups, Stephen C.
- Committee Members:
- Bjork, Kathe E.
Sain, Stephan R.
## Subjects- Subjects / Keywords:
- Haar system (Mathematics) ( lcsh )
Wavelets (Mathematics) ( lcsh ) Transformations (Mathematics) ( lcsh ) Haar system (Mathematics) ( fast ) Transformations (Mathematics) ( fast ) Wavelets (Mathematics) ( fast ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) non-fiction ( marcgt )
## Notes- Bibliography:
- Includes bibliographical references (leaves 75-78).
- General Note:
- Department of Mathematical and Statistical Sciences
- Statement of Responsibility:
- by Sarah D. Braudrick.
## Record Information- Source Institution:
- |University of Colorado Denver
- Holding Location:
- Auraria Library
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 166385433 ( OCLC )
ocn166385433 - Classification:
- LD1193.L62 2007m B72 ( lcc )
## Auraria Membership |

Full Text |

ANALYZING MICROARRAY TIME SERIES DATA USING A HAAR
WAVELET TRANSFORMATION by Sarah D. Braudrick B.A. Mathematics, B.A. Sociology, University of Northern Colorado, 1998 A thesis submitted to the University of Colorado at Denver and Health Sciences Center in partial fulfillment of the requirements for the degree of Master of Science Mathematical Sciences 2007 This thesis for the Master of Science degree by Sarah D. Braudrick has been approved by Stephan R. Sain Braudrick, Sarah D. (M.S., Applied Mathematics) Analyzing Microarray Time Series Data Using a Haar Wavelet Transformation Thesis directed by Associate Professor and Graduate Program Director Stephen C. Billups ABSTRACT This thesis presents an algorithm for identifying significant temporal pat- terns in microarray time series. Using a false discovery rate procedure to assess the magnitude of wavelet coefficients resulting from a Haar wavelet transforma- tion, the algorithm detects gene expression signals exhibiting time-dependent features. Numerous simulated time series, each consisting of a.small propor- tion of temporally dependent functions among random signals, were generated by which to test the algorithm. The simulation encompassed the following changeable scenarios: number and distribution of time points (evenly-spaced or randomly-spaced), noise-to-signal ratio, and level of frequency. Testing the algorithm using these parameters, we were able to answer questions regarding the effect of sample quantity, sampling rate, and sample quality (degree of noise present in the expression signal readings) on the algorithms performance in de- tecting temporal patterns in a microarray time series. Finally, the algorithm is applied to a genome-wide microarray time series collected from Rattus norvegi- cus mammary tissue over the course of the 4-day estrous cycle. This abstract accurately represents the content of the candidates thesis. I DEDICATION This thesis is dedicated to John, for the support and the space and knowing when to give which, and to my parents, who never ran out of encouragement. ACKNOWLEDGMENT Id like to thank Steve Billups for giving me a thesis topic which was as inter- esting as it was challenging, and for always giving thorough feedback. And also thanks to Holly Wolf, for her limitless understanding when it came to work and school conflicts. CONTENTS Figures ............................................................ x Tables............................................................ xvi Chapter 1. INTRODUCTION .................................................... 1 1.1 Time Series Gene Expression Data Analysis..................... 1 1.2 Related Work.................................................. 1 1.2.1 Early attempts time unordered .............................. 2 1.2.2 Recent attempts time ordered................................ 2 1.2.3 Wavelets in signal processing ................................ 3 1.3 Our Approach.................................................. 3 1.3.1 Motivation ................................................... 4 1.4 Challenges.................................................... 5 1.4.1 Sampling...................................................... 6 1.4.2 Homogeneity................................................... 6 1.4.2.1 Within samples.............................................. 7 1.4.2.2 Between samples............................................. 7 1.4.3 Replicates.................................................... 8 1.5 Conceptual Framework.......................................... 8 1.5.1 A model of the data........................................... 9 1.6 Implementation............................................... 10 vii 2. METHODS 11 2.1 Overview of the Method.......................................... 11 2.2 Normalization...................................................... 12 2.3 Wavelets........................................................... 13 2.3.1 The Haar wavelet................................................... 14 2.3.2 Defining a finite basis using Haar wavelet functions............ 15 2.3.3 Calculating the wavelet coefficients............................... 17 2.3.4 Haar wavelet coefficients.......................................... 18 2.4 Identification of Significant Signals.............................. 18 2.4.1 Bootstrapping procedure............................................ 19 2.4.2 FDR analysis....................................................... 20 2.5 Algorithm Summary and Implementation............................... 21 3. TESTING AND RESULTS .................................................. 25 3.1 Generating Simulation Time Series.................................. 25 3.1.1 Haar wavelet signals............................................... 28 3.1.2 Sinusoidal signals................................................. 29 3.1.3 Regression signals ................................................ 30 3.2 Results: Simulation Testing.......................................... 31 3.2.1 Performance using Haar-based / .................................... 31 3.2.2 Performance using sinusoidal-based /............................... 37 3.2.3 Performance using regression-based /............................... 52 3.3 Results: Microarray Gene Expression Time Series................. 60 4. SUMMARY AND CONCLUSIONS............................................... 69 4.1 Sample Size........................................................ 69 vm 4.2 Variation of Sample Response................................. 70 4.3 Sampling Rate ............................................... 70 4.4 Frequency of the Underlying Temporal Response ............. 71 4.5 Which Wavelet Coefficient Performs Best?..................... 72 5. FUTURE WORK.................................................... 73 References........................................................ 75 IX LIST OF FIGURES Figure 2.1 The eighth-order Haar wavelet basis functions on the interval [0,4]. Here, n = 8 because D = 2........................................... 16 3.1 The Haar Wavelet step function, ip.................................. 28 3.2 Sinusoidal functions of various frequencies on the time interval [0,1]. 29 3.3 Regression functions 1 through 4, on the interval [0,1]............. 30 3.4 Surface plots of observed sensitivity rates. (These are 3-dimensional renderings of Tables 3.2 and 3.3, viewed from their upper left hand corners towards their lower right.).............................. 32 3.5 Sensitivity rates for 16 > T > 9 with 77 = 0.1...................... 33 3.6 Haar wavelet-generated (normalized) signals for T = 40 where T is evenly spaced and p = 0.1. This illustrates how, even with a large number of uniform samples, true signals (black) can quickly become nearly indistinguishable from pure noise (grey) when 77 increases from 1.5 to 2.0.......................................................... 34 3.7 Surface plots for the observed FDRs from uniform and randomly spaced T, p = 0.1................................................... 35 3.8 Surface plots of observed sensitivity when the proportion of true signals (p) is 50% (a and b) and 90% (c and d). No significant differences were seen between small and large p.................. 36 x 3.9 Mean sensitivity rate across all evenly spaced T (with rj = 0.1) as a function of frequency rate............................................. 37 3.10 Surface plots of the resulting sensitivity from low frequency (near 0 and 1) sinusoidal-generated simulations................................ 38 3.11 Surface plots of the resulting sensitivity from lower frequency (near 2 to 4) sinusoidal-generated simulations............................ 39 3.12 Surface plots of the resulting sensitivity from higher frequency (greater than 5) sinusoidal-generated simulations............................... 40 3.13 Sensitivity rates for frq = 0.3, rj = 0.1, as a function of T in finer detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line) between 20 and 15 than for randomly spaced T (dashed light line)............................................................ 42 3.14 Sensitivity rates for frq = 0.3, T = 50, as a function of r} in finer detail. Sensitivity decreases most when 0.5 < r) < 1.0. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates............................................. 43 3.15 Sensitivity rates for frq = 1.3, rj = 0.1, as a function of T in finer detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line) between 30 and 25 than for randomly spaced T (dashed light line)............................................................ 45 3.16 Sensitivity rates for frq = 1.3, T = 50, as a function of 77 in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates............. 45 xi 3.17 Sensitivity rates for frq = 1.8, 77 = 0.1, as a function of T in finer detail. Sensitivity decreases more sharply for evenly spaced T (solid dark line) between 14 and 12 (see arrows) than for randomly spaced T (dashed light line).................................................. 47 3.18 Sensitivity rates for frq = 1.8, T = 50, as a function of 77 in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates and both have a high tolerance to a large amount of added noise........................ 47 3.19 Sensitivity rates for frq = 2.5, 77 = 0.1, as a function of T in finer detail Sensitivity for evenly spaced T (solid dark line) is larger when T = 24 than when T 29, and a large decrease occurs between T = 24 and 23. For randomly spaced T (dashed light line), sensitivity is larger when T = 22 than when T = 31................................. 49 3.20 Sensitivity rates for frq = 2.5, T = 50, as a function of 77 in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates except for 0.5 < 77 < 0.7, where randomly spaced T yields smaller sensitivity rates. . 49 3.21 Sensitivity rates for frq = 3.8,77 = 0.1, as a function of T in finer de- tail. Sensitivity drops significantly, but uniformly, for evenly spaced T (solid dark line) between 24 and 20 (see arrows). Randomly spaced T (dashed light line) yields slightly higher rates for smaller values of T...................................................................... 51 xn 3.22 Sensitivity rates for frq = 3.8, T = 50, as a function of 77 in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates.................... 51 3.23 Surface plots (b and c) of observed sensitivity rates from simulations generated from regression function (a) fregri(t) = 1 481 + 218t2 315f3 + 145t4........................................................... 53 3.24 Sensitivity rates for fregri(t), V = 0.1, as a function of T in finer detail. Evenly spaced T (solid dark line) differs most from randomly spaced T (dashed light line) near T = 30. Rates decrease signifi- cantly for both once T reaches the mid-twenties................ 54 3.25 Sensitivity rates for fregri(t), T = 50, as a function of 77 in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates except for 0.4 < 77 < 0.6, where randomly spaced T yields smaller sensitivity rates............. 54 3.26 Surface plots (b and c) of observed sensitivity rates from simulations generated from regression function (a) fregr2{t) = 0.3e-64^-0'25^2 + 0.7e-256(t0.75)2....................................................... 55 3.27 Surface plots (b and c) of observed sensitivity rates from simulations generated from regression function (a) /re9r3(t) = 10e~lot.............. 56 3.28 Surface plots (b and c) of observed sensitivity rates from simulations generated from regression function (a) fregr4(t) = where k = 2 for t < 4 and k 1 for t > |........................................ 57 xni 3.29 Sensitivity rates for fregri(t), rj = 0.1, as a function of T in finer detail. Rates for randomly spaced T (dashed light line) decrease more gradually than evenly spaced T when the number of time points is less than 15. And is evenly spaced T favoring an even number of time points when T is small?....................................... 59 3.30 Sensitivity rates for fregr4{t), T = 50, as a function of t] in finer detail. Evenly (solid dark line) and randomly (dashed light line) spaced large T produce similarly decreasing rates except for 0.7 3.31 A plot indicating the time points sampled. Each asterisk marks the time (in hours) that a tissue sample was taken from a rat specimen across the 4 day estrous cycle.................................. 60 3.32 The Haar wavelet reconstruction of the normalized expression sig- nal for APegl. An increase in expression appears in the estrus and metestrus phases................................................ 62 3.33 The Haar wavelet reconstruction of the normalized expression signal for RMT-1. Expression values are high in the first half of the estrous cycle, and low in the second half............................... 63 3.34 The Haar wavelet reconstruction of the normalized expression signal for IL-17B. Expression is low in the first half of the cycle, and high during the last two phases...................................... 64 3.35 The Haar wavelet reconstruction of the expression signal for TSP2. Expression is higher in the first half, and lower in the second half of the estrous cycle..................................................... 65 XIV 3.36 The Haar wavelet reconstruction of the expression signal for LAMA5. Expression moderately decreases from the beginning to the end of the cycle, with some high level readings occurring in the estrus phase. 66 xv LIST OF TABLES Table 2.1 FDR procedure pseudocode.................................................. 22 2.2 Algorithm pseudocode...................................................... 23 2.3 Algorithm input parameters................................................ 24 3.1 Input parameters needed to construct F................................ 27 3.2 Observed sensitivity rates for / = ip(t), evenly spaced T and p = 0.1 31 3.3 Observed sensitivity rates for / = ip(t), randomly spaced T and p = 0.1 32 3.4 Observed sensitivity rates for / = sin(0.3nt), T evenly spaced and p = 0.1 41 3.5 Observed sensitivity rates for / = sin(l.3irt), evenly spaced T and p = 0.1 44 3.6 Observed sensitivity rates for / = sm(1.87rÂ£), evenly spaced T and p = 0.1 46 3.7 Observed sensitivity rates for / = sm(2.57rt), evenly spaced T and p = 0.1 48 3.8 Observed sensitivity rates for / = sin(3.8jrt), evenly spaced T and p = 0.1 50 3.9 Observed sensitivity rates for fregri(t), evenly spaced T and p 0.1 53 3.10 Observed sensitivity rates for fregr2{t), randomly spaced T and p = 0.1 55 3.11 Observed sensitivity rates for fregrz{t)^ evenly spaced T and p = 0.1 56 3.12 Observed sensitivity rates for fregr4{t), evenly spaced T and p = 0.1 57 xvi 3.13 A frequency distribution of significance rates for 10 applications of the algorithm using R = 5........................................... 61 3.14 Genes found to to have temporal significance 100% of the time ... 67 4.1 Percentage Distribution of Positive-Responsible Ck (when T = 50 and r) 0.1) ...................................................... 72 XVII 1. INTRODUCTION 1.1 Time Series Gene Expression Data Analysis Microarray technology allows researchers to quantify the expression levels of thousands of genes within a tissue sample. By measuring mRNA levels as- sociated with each gene, microarrays provide a snapshot of the genetic activity present in that tissue. This information can then be used to understand patterns of gene expression. Researchers have used microarrays to investigate relation- ships among genes, how they react to environmental changes such as injury, disease, and drugs and their roles in the routine maintenance of an organism. Gene function plays a dynamic role in the regulation of biological processes, and time series analysis can help us understand how this unfolds. Microarray exper- iments, when performed successively, can reveal temporal patterns in biological systems such as cellular cycles, organism development, and disease progression. The resulting data from these time course experiments, however, can be dif- ficult to manage; they tend to be enormous in number (often comprising an entire genome), widely variable, and diverse in scope (some profiles averaging mRNA levels near 10,000, others near 500). The analysis of this complex data introduces unique challenges to investigators and calls for specialized techniques that take into account the sequential nature of the samples. 1.2 Related Work 1.2.1 Early attempts time unordered 1 Initial efforts to make sense of microarray time series data largely ignored the continuous nature of time, instead treating time as a categorical variable. Many early undertakings sought only to differentiate among groups of genes in the hopes of gaining a more sophisticated understanding of biological processes. Groupings tend to be established either through classification schemes, in which genes are categorized based on some pre-specified criteria, or through cluster- ing methods which identify naturally occurring similarities. Early approaches in classifying genes were largely guided by multivariate statistical and com- putational methods including classical statistical modeling techniques [34, 37], clustering algorithms [12, 30], principal component analysis (PCA) with singu- lar value decomposition (SVD) [3, 14], self-organizing maps (SOM) [28], and support vector machines [8]. The shortcoming of a multivariate analysis frame- work, however, is that the temporal ordering of samples is inconsequential; any permutation of time points will not change the results of the analysis. 1.2.2 Recent attempts time ordered More recent classification techniques have addressed this problem by recog- nizing that temporal order and the intrinsic continuity of gene expression sig- nals must be incorporated. Some of these time sensitive classification schemes include the use of self-organizing maps adjusted to incorporate a temporal met- ric [27, 36], time warping algorithms that attempt to co-align expression signals [1], and functional data analysis (FDA) methodologies that view observed gene profiles as independent components in a smooth stochastic process [22, 35]. The latter presented methods specifically for unevenly and/or infrequently sampled data. Notable nonparametric methods presented for modeling time series ex- 2 pression profiles have included the use of splines [5, 11] and wavelets. In work done by Klevecz in 2000 and Klevecz and Murray in 2001 [17, 18], wavelets were used to identify dominant frequencies at which large numbers of genes exhib- ited oscillations in their expression profiles. Our use of wavelets is qualitatively different. Specifically, we will use wavelets as a means of identifying individual genes that exhibit significant time-dependent variation in their expression levels. 1.2.3 Wavelets in signal processing The use of wavelets in time series analysis is not new. Wavelets have been applied to a variety of series including the detection of time-dependent climate signals [20], serial dependencies within functional magnetic resonance imaging (fMRI) [9], the analysis of electrocardiogram (EKG) heartbeat intervals [16] and electroencephalogram (EEG) epileptic spikes [19]. In image processing, wavelets have been used in the analysis of 2-dimensional images (which are just signals arising from spacial dependencies instead of temporal), to identify patterns at various resolutions. Examples include JPEG image compression [31], optimiza- tion of image coding [33], edge detection and the removal of noise in images [24]. 1.3 Our Approach Here we present a method for identifying meaningful temporal patterns in gene expression microarray data. Our algorithm computes the Haar wavelet transformation on the expression profile of each gene, and then incorporates a false discovery rate analysis on the decomposition coefficients to single out genes with significantly large temporal responses, relative to the noise. Prior to 3 describing the details of the algorithm, it will be useful to first describe the data set that motivated the development of this method. 1.3.1 Motivation The development of our algorithm was motivated by the desire to analyze a genome-wide time series that resulted from research investigating the estrous cy- cle. The data came from microarray experiments measuring the levels of mRNA present in mammary tissue samples taken from female rats (Rattus norvegicus) and spanning the course of the rat estrous cycle (typically 4-5 days). The exper- iment, conducted by Margaret C. Neville and Pepper Schedin at the University of Colorados School of Medicine, sampled 32 time points over a four day period at approximately 90 minute intervals between the hours of 7am and 7pm. The microarrays were set up to detect expression levels for the entire rat genome, comprising over 21,000 genes. Much of the genetic architecture underlying the estrous cycle is not clearly understood, therefore our quest was exploratory in nature. The hope was to be able to identify a set of candidate genes exhibiting a significant temporal response worthy of further investigation in future experiments. Features of this experiment made the prospect of data analysis both encouraging and daunting. While there were many time points sampled, and with some level of regularity in sampling rate, there were also large gaps of time with no samples, coupled with the additional variability introduced by heterogenous sample subjects. Unlike some experiments in which the samples are homogenous, (e.g., the time points are sampled from cloned organisms or from the same individual multiple times), this experiment required the extraction of vital organs and 4 therefore each time point represented a different rat. The lack of sample homo- geneity presents uncertainty that can complicate analysis efforts because each rats genetic makeup varies to some unknown degree. In addition to this vari- ability between individual rats, there can also be disparities within samples. For example, the presence of peripheral tissue whose primary biological function may be far different than that of the tissue under examination can result in expres- sion values reflecting a different process. The heterogony of samples also raises concern for how much confidence we can have that the rat sacrificed at time t*. was a good representation of the estrous cycle at time tfc. In other words, how can we be sure that the rat representing time point t7 lets say, wasnt actually better-suited to represent time point t9 in the estrous cycle? The sampling rate of this set had irregularities resulting in large windows of unmeasured gene activity (over nights), therefore we wanted to explore how un- even sampling might affect the identification of significant temporal profiles. An integral part of our testing scheme involves the comparison between uniformly and randomly spaced time points. Encouraging however, was that this set had a relatively large number of samples a luxury given that, due to the high cost involved, it is not uncommon for microarray time courses to sample less than 10 time points [13]. While the experiment did not perform replicates, this forfeiture afforded the advantage of more time points covered. Also attractive is the regularity of sampling that did occur within the four 12 hour sampling phases. 1.4 Challenges 5 The idea behind gene expression time series experiments is to obtain enough snapshots of gene activity to reveal temporal patterns within a particular bio- logical process or stimulus response. A key factor in accomplishing this is being able to distinguish between the true signal and its accompanying noise. As will be described, there are many factors in the experimental design that can complicate the discernment between pure signal and noise. 1.4.1 Sampling As with many time course experiments, sampling methodology can greatly affect the subsequent assessment of data. The rate of sampling should be tailored to the process being studied; longer intervals may accommodate slowly unfolding gene activity, whereas shorter intervals become necessary when studying intense biological responses. While sampling frequently can simplify and strengthen analysis, it may be infeasible due to the high cost of microarrays. Sampling too infrequently, however, can make it nearly impossible to interpolate gene activity between sample points. Even when many time points are sampled, irregularity of sampling intervals can complicate analysis because the rate of change between time points increases as the interval decreases. Sometimes irregular intervals are strategically carried out in order to capture particular phases of a process where rapid or slow be- havior is expected [26]. But in many situations, an uneven rate of sampling is difficult to avoid due to the impracticality of around-the-clock sampling. 1.4.2 Homogeneity In addition to sparse or irregular sampling, other factors may bring un- certainty into the analysis, namely, inconsistencies within and between tissue 6 samples. 1.4.2.1 Within samples Inconsistencies within samples occurs when the extracted specimen contains tissue from multiple organs. For the purposes of many experiments, the proto- typical tissue sample contains only tissue from the organ of interest, be it heart, liver, or skin, etc. This is ideal because gene activity varies among tissue types depending upon their biological functions. For example, certain genes responsi- ble for regulating functions of sight will be highly expressed in ocular tissue, yet nearly dormant in proximal brain tissue. But during extraction, samples can be inadvertently contaminated with tissue from neighboring organs. 1.4.2.2 Between samples Inconsistencies between tissue samples is the result of using heterogenous subjects. A favorable experimental condition is when each tissue sample comes from the same organism, or a clone. Such control is attainable when working with populations of smaller organisms like C. elegans or Saccharomyces cere- visiae (yeast) because lines of these organisms can be perpetuated indefinitely. But often it is necessary to use organisms more similar to humans, in which case cloning is not (yet) a practical approach and each tissue sample requires the sacrifice of a unique organism. When each time point represents a unique individual, it can be difficult to know the degree to which changes in expres- sion values are attributable to a biological response or to the peculiarities of the individual. Measures can be taken to align organisms so that each sacrificed specimen can reasonably represent the proper phase of the cycle for that time 7 point. But there will still be variation resulting from the uniqueness of the sam- ples and which is indistinguishable from the changes inherent in the true gene signal. 1.4.3 Replicates In traditional types of time series, the value at time point tj is directly influenced by the observation at time point t(j-i). This characteristic can bring robustness to the validity of the value at time point tj; when the value at time point tj is similar to its neighbors, we have more confidence in it. In gene expression time series on the other hand, individual time points are independent of the preceding observations. Therefore, it is desirable to take measures to increase confidence in the value at time point t A common experimental design, intended to substantiate the expression value associated with each sampled time point, is to generate replicates. Repli- cates are additional microarray tests performed on each sample and combined into a single reading for that time point. While this design can increase con- fidence in the values, it does not necessarily always behoove the researcher to adopt this setup. Consider an experiment in which one is investigating a slow process over a large cycle length; requiring replicates may amount to squander- ing expensive microarray tests that would be more useful if spread out to cover more individual time points. 1.5 Conceptual Framework We recognize that significant temporal patterns are not necessarily cyclical. In fact as evidenced when one views a plot of expression values against time - what can be seen is a great deal of seemingly erratic spikes and bursts of activity. 8 Wavelet transformations are well-suited to such data since they are capable of closely approximating highly irregular, chaotic signals. 1.5.1 A model of the data In designing a model of this type of data, we set forth some expectations. In general, we assume that the variation in gene expression signals is governed by three components: the temporal response, the sample response, and random noise. We assume that the temporal response component represents the time- dependent signal of the true genetic profile and that it is a continuous function of time. The sample response is considered time-independent variation occurring from sample to sample (e.g., differences between individual organisms or tis- sue makeup). Note that the sample response may exhibit correlations among genes; there may be groups of genes that vary similarly from one sample to the next. The magnitude of the sample response is difficult to predict; it could be very large or small relative to the temporal response. The random noise is also considered to be time-independent, but largely gene-independent and relatively small compared to the other two components. If we combine the sample response and the random noise into a single component, e, then our model, /, can be written as f = y + Â£, (i-i) where y represents the temporal response, i.e., the genetic profile. Questions that we explore using our algorithm on strategically simulated data include whether the sample response is proportional to the temporal re- 9 sponse and what type of distribution might be representative of the sample response. 1.6 Implementation This paper presents a method to identify significant temporal responses among a large number of gene expression signals. We model the data via a Haar wavelet transformation and use features of the reconstructed signal to identify profiles with meaningful temporal patterns. Issues to be addressed in our approach include the questions of how many samples are necessary within a given time interval, and to what degree irregularity in sampling rate can be tolerated in order to successfully detect temporal patterns. Also investigated is how best to normalize the raw data prior to the wavelet transformation. Normalization of the data is an important issue because it has the potential of magnifying small scale elements as well as minimizing those of large scale. To test our methodological answers to these questions, we apply our algo- rithm to a variety of simulated datasets. After refining our algorithm during the testing phases, we apply the final version to the set of genome-wide expression levels sampled from a population of female rats in an attempt to explore the regulation of the estrous cycle. 10 2. METHODS 2.1 Overview of the Method Our methods were based upon the model that gene expression measure- ments are made up of three components: the temporal response, a continuous function of time that reflects the true expression value; the sample response, a time-independent form of variability that results from differences between sam- ples; and noise, which is presumed to be independent of both time and sample. Furthermore, we assume that the non-temporal portion of the signal (the sam- ple response combined with the random noise) tends to be proportional to the signals temporal response. We are primarily interested in identifying genes that have a significant tem- poral response. We model the temporal response for a gene as a function of time, which we approximate from the expression data. To make this concrete, consider a set of T ordered data points, (ti,yi),, (tT, yT), believed to have been sampled with noise from some underlying function of time. The aim is to find a new function, /, that closely approximates, or fits, this underlying function such that for each j Â£ [1, T] f(tj) ~ Vj- (2-1) Our approach models the data as a continuous function of time so that the underlying function, y, is approximated by the following linear model: n = (2-2) k=1 11 where fa (t),, fa{t) is a family of basis functions for our method, which target temporal features we are interested in. In this study, we will use Haar wavelet functions, defined in 2.3, to form the basis. We then minimize the models least-squares residual to choose the best parameters ck. That is, we minimize Ylj=i(Vj ~ The result of this process is a representation of each gene by the vector of coefficients c = [ci,..., cn] corresponding to the least squares fit. If the magnitude of any of these coefficients is large relative to the other coefficients, we say that the gene has a significant temporal response with respect to the corresponding basis function. A main contribution of this thesis is an algorithm for determining when a coefficient is large enough to be considered significant. The remainder of this chapter describes the algorithm in detail. The raw dataset, G, is arranged into an m x T matrix such that Gij is the expression level at the jth time point for the ith gene, where j = 1,2,..., T and i = 1,2,..., m. Prior to calculating the functional fit, we first normalize the data, resulting in an m x T matrix G. The normalization process is described in Section 2.2. Section 2.3 describes the Haar wavelets, which will be used to define the basis functions fa{t) for equation (2.1). Section 2.3 also gives the mathematical details for calculating the least squares approximation. Section 2.4.1 describes the method for determining the significance of the magnitudes of the wavelet coefficients. 2.2 Normalization We examined a number of methods for standardizing the data prior to per- forming the wavelet decomposition. The choice of a normalization scheme de- pends on several factors and therefore it was important to assess the implica- 12 tions of each standardizing component as it effects the analysis of the data. Our method was to first take the log-transform of the data and then center each signal about 0. In microarray data analysis it is standard practice to work with the loga- rithm of the data instead of the raw expression values. This is primarily because of the wide scope of the expression readings which can range from 0 to 10,000 and the fact that noise tends to be proportional to signal strength. Using the log-transform of the data ensures that low level changes in expression re- ceive relatively equal regard as those occurring at high levels. Therefore, our normalization routine begins by taking the log-transform of the raw expression data. Our interest is in the relative changes in expression values, not absolute; so we centered each signal about 0 by subtracting its mean. That is, the ith gene expression signal, denoted by Gl *, once normalized will be Gi.* lnGj.* ELl-Gy (2.3) where the log of a vector is the vector of the logarithms. The notation of the * symbol in the subscript will indicate the entirety of that index so that is equivalent to ,2.3 Wavelets A wavelet transform (WT) can be used to fit a linear model given a set of sample data points, (yi, .. ,yr)- Instead of mapping the samples from a temporal domain into a frequency domain (as with Fourier transforms), the wavelet transform maps the samples into a time-frequency plane. A benefit of 13 a WT is that it can get as local as necessary, whereas the Fourier transform is globalized over the entire signal. Wavelet transforms are capable of zooming in or out at different locations of the signal. With wavelets, one can choose the resolution level with which to transform a given signal [7]. The term wavelet refers to a set of basis functions defined by a pair of related functions: the wavelet generating function, denoted by ip, and the scaling function, denoted by (j). These two functions work in concert to generate an infinite set of basis functions. Typically, however, only a finite subset of these basis functions is necessary to get a close approximation of /. By choosing different generating and scaling functions, many different families of wavelet functions can be produced. In this thesis we focus on Haar wavelets, which are specified as follows. 2.3.1 The Haar wavelet For a function [0,1], the Haar scaling function is defined as: (2.4) and the Haar wavelet generating function is defined as: ip(t) = 0(2i) 4>(2t 1). (2.5) or equivalently, 1 if 0 < t < i; V>(<) = < 1 if \ < t < 1; (2.6) 0 otherwise, 14 The family of Haar basis functions is then defined by {(p(t), ^o,o(^), ipi,o(t), ipi,i(t), where ^0,0 (t) ip{t) and (*) = iffi s), (2.7) for d G Z+ and 0 < s < Sd = 2d 1. Generalizing for a function /(f), t {-7= if a < t < b\ 0 otherwise, and applying (2.5), the Haar wavelet generating function is defined as: 1 if a y/ba = H*<*<*>; 0 otherwise. (2.9) The denominator \Jb a is chosen so that
and ip{t), the remaining Haar basis functions on [a, 5] are then defined by |