Citation |

- Permanent Link:
- http://digital.auraria.edu/AA00000039/00001
## Material Information- Title:
- A machine learning approach for gene expression analysis and applications
- Creator:
- Le, Thanh Ngoc
- Place of Publication:
- Denver, CO
- Publisher:
- University of Colorado Denver
- Publication Date:
- 2012
- Language:
- English
- Physical Description:
- 1 electronic file. : ;
## Subjects- Subjects / Keywords:
- Gene expression -- Statistical methods ( lcsh )
DNA microarrays -- Statistical methods ( lcsh ) Cluster analysis ( lcsh ) - Genre:
- non-fiction ( marcgt )
## Notes- Review:
- High-throughput microarray technology is an important and revolutionary technique used in genomics and systems biology to analyze the expression of thousands of genes simultaneously. The popular use of this technique has resulted in enormous repositories of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI). However, an effective approach to optimally exploit these datasets in support of specific biological studies is still lacking. Specifically, an improved method is required to integrate data from multiple sources and to select only those datasets that meet an investigator's interest. In addition, to exploit the full power of microarray data, an effective method is required to determine the relationships among genes in the selected datasets and to interpret the biological meanings behind these relationships. To address these requirements, we have developed a machine learning based approach that includes: * An effective meta-analysis method to integrate microarray data from multiple sources; the method exploits information regarding the biological context of interest provided by the biologists. * A novel and effective cluster analysis method to identify hidden patterns in selected data representing relationships between genes under the biological conditions of interest. * A novel motif finding method that discovers, not only the common transcription factor binding sites of co-regulated genes, but also the miRNA binding sites associated with the biological conditions. * A machine learning-based framework for microarray data analysis with a web application to run common analysis tasks on online.
- Thesis:
- Thesis (Ph,.D.)--University of Colorado Denver. Computer science and information systems
- Bibliography:
- Includes bibliographic references.
- General Note:
- Department of Computer Science and Engineering
- Statement of Responsibility:
- by Thanh Ngoc Le.
## Record Information- Source Institution:
- |University of Colorado Denver
- Holding Location:
- |Auraria Library
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 863062630 ( OCLC )
ocn863062630
## Auraria Membership |

Downloads |

## This item has the following downloads: |

Full Text |

A MACHINE LEARNING APPROACH FOR GENE EXPRESSION ANALYSIS
AND APPLICATIONS by THANH NGOC LE B.S., University of Econimics in Hochiminh City, Vietnam, 1994 M.S., National University in Hochiminh City, Vietnam, 2000 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science and Information Systems 2013 This thesis for the Doctor of Philosophy degree by Thanh Ngoc Le has been approved for the Computer Science and Information Systems Program by Gita Alaghband, Chair Tom Altman, Advisor Katheleen Gardiner, Co-Advisor James Gerlach Sonia Leach Boris Stilman Date: April 10. 2013 n Le, Thanh, Ngoc (Ph.D., Computer Science and Information Systems) A Machine Learning approach for Gene Expression analysis and applications Thesis directed by Professor Tom Altman ABSTRACT High-throughput microarray technology is an important and revolutionary technique used in genomics and systems biology to analyze the expression of thousands of genes simultaneously. The popular use of this technique has resulted in enormous repositories of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI). However, an effective approach to optimally exploit these datasets in support of specific biological studies is still lacking. Specifically, an improved method is required to integrate data from multiple sources and to select only those datasets that meet an investigators interest. In addition, to exploit the full power of microarray data, an effective method is required to determine the relationships among genes in the selected datasets and to interpret the biological meanings behind these relationships. To address these requirements, we have developed a machine learning based approach that includes: An effective meta-analysis method to integrate microarray data from multiple sources; the method exploits information regarding the biological context of interest provided by the biologists. in A novel and effective cluster analysis method to identify hidden patterns in selected data representing relationships between genes under the biological conditions of interest. A novel motif finding method that discovers, not only the common transcription factor binding sites of co-regulated genes, but also the miRNA binding sites associated with the biological conditions. A machine learning-based framework for microarray data analysis with a web application to run common analysis tasks on online. The form and content of this abstract are approved. I recommend its publication. Approved: Tom Altman IV DEDICATION I dedicate this work to my family, my wife, Lan, and my beloved daughter, Anh for their constant support and unconditional love. I love you all dearly. ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Tom Altman for his guidance and support throughout my graduate studies. His way of guidance helped me to develop my self- confidence and skills necessary to address not only research problems, but also real- world problems. Thank you very much Dr. Altman. I would like to express my sincere gratitude to my co-advisor, Dr. Katheleen Gardiner for her continuous support on my research work throughout the years. She guided me with remarkable patience and understanding. She was always there to help me whenever I encounter an obstacle or require an advice. Her inputs were very useful. It was really a pleasure working with her. I would like to thank Dr. Sonia Leach for her help in solving difficult problems and for her valuable inputs as a member of my dissertation committee. She was very generous in transferring her vast knowledge to us. Also, I would like to thank Dr. Gita Alaghband, Dr. James Gerlach, and Dr. Boris Stilman for serving on my dissertation committee and for their helpful suggestions and comments for the improvement of this project; your discussion, ideas, and feedback have been absolutely invaluable. I would like to thank you all for giving me an example of excellence as a researcher, mentor, instructor, and role model. I would like to thank my fellow graduate students, my lab mates and everyone who helped me in many ways to make my graduate student life a memorable one. I am very grateful to all of you. VI I would especially like to thank my amazing family for the love, support, and constant encouragement I have gotten over the years. In particular, I would like to thank my mother, my brothers and sisters. You are the salt of the earth, and I undoubtedly could not have done this without you. Vll TABLE OF CONTENTS List of Tables.................................................................xii List of Figures.................................................................xv Chapter 1. Introduction..............................................................1 1.1 Gene expression data analysis challenges..................................1 1.2 Motivations and proposed methods..........................................3 1.2.1 Microarray meta-analysis..................................................3 1.2.2 Cluster analysis of gene expression data..................................6 1.2.3 DNA sequence analysis.....................................................8 1.3 Research contributions....................................................9 2. Mi croarray analy si s b ackground.......................................13 2.1 Microarray...............................................................13 2.1.1 Microarray platforms.....................................................13 2.1.2 Gene expression microarray data analysis.................................16 2.2 Mi croarray data meta-analy sis..........................................18 2.2.1 Microarray data integration..............................................18 2.2.2 Issues with microarray data meta-analysis................................23 2.3 Cluster analysis.........................................................32 2.3.1 Clustering algorithms....................................................34 2.3.2 Hierarchical clustering algorithms.......................................35 2.3.3 K-Means clustering algorithm.............................................36 2.3.4 The mixture model with expectation-maximization algorithm................37 2.3.5 Fuzzy clustering algorithm...............................................39 viii .42 45 .45 .46 .46 .48 .49 49 .53 .53 .57 .58 .64 68 .70 .80 85 .87 .88 89 .90 92 .98 .98 Gene regulatory sequence prediction...... Datasets................................. Artificial datasets for cluster analysis. Artificial datasets for sequence analysis. Real clustering datasets................. Gene expression datasets................. CMAP datasets............................ Real biological sequence datasets........ Fuzzy cluster analysis using FCM......... FCM algorithm............................ FCM convergence.......................... Distance measures........................ Partially missing data analysis.......... FCM Clustering solution validation....... Global validity measures................. Local validity measures.................. FCM partition matrix initialization...... Determination of the number of clusters Between-cluster-partition comparison Within-cluster-partition comparison...... Determination of the fuzzifier factor.... Defuzzification of fuzzy partition....... Methods.................................. Microarray meta-analysis................. IX 4.1.1 Selection of samples and features..........................................98 4.1.2 Sample and feature selection method........................................98 4.1.3 Methods for feature metric................................................108 4.2 Fuzzy cluster analysis methods using Fuzzy C-Means (FCM)..................109 4.2.1 Modification of the obj ective function...................................110 4.2.2 Partition matrix initialization method....................................Ill 4.2.3 Fuzzy clustering evaluation method........................................121 4.2.4 Fuzzy clustering evaluation using Gene Ontology [180]....................129 4.2.5 Imputation methods for partially missing data.............................137 4.2.6 Probability based imputation method (fzPBI) [154].........................139 4.2.7 Density based imputation method (fzDBI) [155].............................149 4.2.8 Probability based defuzzification (fzPBD) [156]...........................153 4.2.9 Fuzzy genetic subtractive clustering method (fzGASCE) [157]...............161 4.3 Motif finding problem.....................................................172 4.3.1 HIGEDA algorithm..........................................................172 4.3.2 New motif discovery using HIGEDA..........................................192 5. Applications..............................................................196 5.1 Recovery of a gene expression signature...................................196 5.2 Gene-drug association prediction..........................................199 5.2.1 Model design..............................................................199 5.2.2 Model testing.............................................................205 5.3 Drug target predi cti on..................................................216 5.4 Application of HIGEDA into prediction regulatory motifs...................217 6. Conclusions and future work...............................................218 X References 221 XI LIST OF TABLES Table 3.1 : Performance of different distance measures on the Iris dataset.........63 3.2 : Performance of different distance measures on the Wine dataset.........63 3.3 : Average results of 50 trials using incomplete Iris data [140]..........68 3.4 : Performance of three standard algorithms on ASET1......................86 3.5 : Algorithm performance on ASET4 using MISC measure......................96 3.6 : Algorithm performance on ASET5 using MISC measure......................96 4.1 : Predicted HD AC antagonists...........................................102 4.2 : Predicted HDAC agonists...............................................103 4.3 : Predicted estrogen receptor antagonists...............................105 4.4 : Predicted estrogen receptor agonists..................................107 4.5 : Algorithm performance on the ASET1 dataset............................118 4.6 : fzSC correctness in determining cluster number on artificial datasets.119 4.7 : fzSC performance on real datasets.....................................120 4.8 : Fraction of correct cluster predictions on artificial datasets........125 4.9 : Validation method performance on the Iris dataset (3 true clusters)...125 4.10 : Validation method performance on the Wine dataset (3 true clusters)..126 4.11: Validation method performance on the Glass dataset (6 true clusters)...126 4.12 : Validation method performance on the Yeast dataset (5 true clusters).127 4.13 : Validation method performance on the Yeast-MIPS dataset (4 true clusters) ............................................................127 4.14 : Validation method performance on the RCNS dataset (6 true clusters)..128 4.15 : Degrees of belief of GO annotation evidences..132 4.16 : Gene Ontology evidence codes.........................................133 xii 4.17 : Validation method performance on the Yeast dataset using GO-BP.....135 4.18 : Validation method performance on the Yeast-MIPS dataset using GO-BP....136 4.19 : Validation method performance on the RCNS dataset using GO-CC......136 4.20 : Validation method performance on the RCNS dataset using GO-BP......137 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different percentages (%) of missing value.........................145 4.22 : ASET4- Compactness measure........................................157 4.23 : IRIS: Compactness measure.........................................159 4.24 : WINE- Compactness measure..........................................160 4.25 : Performance of GA algorithms on ASET2..............................168 4.26 : Performance of GA algorithms on ASET3..............................169 4.27 : Performance of GA algorithm on ASET4...............................169 4.28 : Performance of GA algorithms on ASET5..............................170 4.29 : Performance of GA algorithms on IRIS...............................171 4.30 : Performanc of GA algorithms on WINE...............................171 4.31 : Average performance (LPC/SPC) on simulated DNA datasets...........187 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA datasets (# of sequences/length of motif/# of motif occurrences)...188 4.33 : Detection of protein motifs (1-8, PFAM; 9-12, Prosite).............190 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms.........193 4.35 : Motifs of Presenilin-1 and Signal peptide peptidase by HIGEDA......194 5.1 : Clustering results with estrogen significantly associated drugs.....197 5.2 : Estrogen signature and the signature predicted by FZGASCE.........198 5.3 : Hsa21 set-1 query genes.............................................206 xiii 5.4 : Expanded set of the Hsa21 set-1.......................................207 5.5 : Gene clusters of the Hsa21 gene set-1 expanded set....................207 5.6 : Proposed gene expression signature for the Hsa21 gene set-1...........209 5.7 : Drugs predicted to enhance expression of Hsa21 gene set-1.............210 5.8 : Drugs predicted to repress expression of Hsa21 gene set-1.............211 5.9 : Hsa21 set-2 query genes...............................................212 5.10 : Predicted gene expression signature for the Hsa21 gene set-2.........213 5.11 : Drugs predicted to enhance expression of Hsa21 gene set-2............215 5.12 : Drugs predicted to repress expression of Hsa21 gene set-2............215 XIV LIST OF FIGURES Figure 1- 1 : Microarray data analysis and the contributions of this research..........12 2- 1 : Microarray design and screening..........................................15 2- 2 : Microarray analysis......................................................18 3- 1 : Expression levels of three genes in five experimental conditions.........60 3-2 : ASET2 dataset with five well-separated clusters............................76 3-3 : PC index (maximize)........................................................77 3-4 : PE index (minimize)........................................................77 3-5 : FS index (minimize)........................................................77 3-6 : XB index (minimize)........................................................77 3-7 : CWB index (minimize).......................................................78 3-8 : PBMF index (maximize)......................................................78 3-9 : BR index (minimize)........................................................78 3-10 : Performance of the global validity measures on artificial datasets with different numbers of clusters............................................79 3-11 : Dendrogram of the Iris dataset from a 12-partition generated by FCM......83 3-12: Dendrogram of the Wine dataset from a 13-partition generated by FCM.......84 3-13 : ASET1 dataset with 6 clusters............................................85 3-14 : PC index on the ASET2 dataset (maximize).................................88 3-15 (Wu [144]): Impact of m on the misclassification rate in Iris dataset......91 3- 16 : Artificial dataset 5 (ASET5) with three clusters of different sizes.....93 4- 1 : Venn diagram for predicted HD AC antagonists............................104 4-2 : Venn diagram for HD AC inhibitors.........................................104 4-3 : Venn diagram for predicted estrogen receptor antagonists..................106 XV 4-4 : Venn diagram for predicted estrogen receptor agonists..................106 4-5 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares, cluster centers by FCM; dark circles, cluster centers found by fzSC. Classes are labeled 1 to 6............................................118 4-6 : Average RMSE of 50 trials using an incomplete ASET2 dataset with different percentages of missing values...............................142 4-7 : Average RMSE of 50 trials using an incomplete ASET5 dataset with different percentages of missing values...............................143 4-8 : Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values.........................................144 4-9 : Average RMSE of 50 trials using an incomplete Wine dataset with different percentages of missing values...............................145 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with different percentages of missing values...............................146 4-11 : Average RMSE of 50 trials using an incomplete Yeast dataset with different percentages of missing values...............................147 4-12 : Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values...............................148 4-13 : Average RMSE of 50 trials using an incomplete Serum dataset with different percentages of missing values...............................148 4-14 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset with different missing value percentages..............................151 4-15 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset with different missing value percentages..............................151 4-16 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values...............................152 XVI 4-17 : (fzDBI) Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values................152 4-18 : Algorithm performance on ASET2.......................................156 4-19 : Algorithm performance on ASET3.......................................156 4-20 : Algorithm performance on ASET4.......................................157 4-21 : Algorithm performance on ASET5.......................................158 4-22 : Algorithm performance on IRIS........................................158 4-23 : Algorithm performance on WINE dataset................................159 4-24 : Algorithm performance on GLASS dataset...............................160 4-25 : A motif model 0......................................................175 4-26 : Dynamic alignment of s=ACG w.r.t 0 from Figure 4-25................176 4-27 : v(t) = Vm x T / (T + t2). VmIs the maximum value of v................184 4- 28 : Strep-H-triad motif by HIGEDA......................................189 5- 1 : Gene expression signature prediction...............................196 5-2 : Gene-Drug association prediction.....................................199 5-3 : Expansion of gene set A..............................................200 5-4 : Identification of key genes..........................................202 5-5 : Gene expression signature generation and application.................204 5-6 : Transcription factor and mircoRNA binding site prediction using HIGEDA...217 xvii 1. Introduction High-throughput microarray technology for determining global changes in gene expression is an important and revolutionary experimental paradigm that facilitates advances in functional genomics and systems biology. This technology allows measurement of the expression levels of thousands of genes with a single microarray. Widespread use of the technology is evident in the rapid growth of microarray datasets stored in public repositories. For example, since its inception in the early 1990s, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI), has received thousands of data submissions representing more than 3 billion individual molecular abundance measurements [6,7], 1.1 Gene expression data analysis challenges The growth in microarray data deposition is reminiscent of the early days of GenBank, when exponential increases in publicly accessible biological sequence data drove the development of analytical techniques required for data analysis. However, unlike biological sequences, microarray datasets are not easily shared by the research community, resulting in many investigators being unable to exploit the full potential of these data. In spite of over 15 years of experience with microarray experiments, new paradigms for integrating and mining publicly available microarray results are still needed to promote widespread, investigator-driven research on shared data. In general, integrating multiple datasets is expected to yield more reliable and more valid results due to a larger number of samples and a reduction of study-specific biases. 1 Ideally, gene expression data obtained in any laboratory, at any time, using any microarray technology should be comparable. However, this is not true in reality. Several issues arise when attempting to integrate microarray data generated using different array technologies. Meta-analysis and integrative data analysis only exploits the behavior of genes individually at the genotype level. In order to discover gene behaviors at the phenotype level by connecting genotype to phenotype, additional analyses are required. Cluster analysis, an unsupervised learning method, is a popular method that searches for patterns of gene expression associated with an experimental condition. Each experimental condition usually corresponds to a specific biological event, such as a drug treatment or a disease state, thereby allowing discovery of the drug targets, and drug and disease relationships. Numerous algorithms have been developed to address the problem of data clustering. These algorithms, notwithstanding their different approaches deal with the issue of determining the number and the construction of clusters. Although some cluster indices address this problem, they still have the drawback of model over-fitting. Alternative approaches, based on statistics with the log-likelihood estimator and a model parameter penalty mechanism, can eliminate over-fitting, but are still limited by assumptions regarding models of data distribution and by a slow convergence with model parameter estimation. Even when the number of clusters is known a priori, different clustering algorithms may provide different solutions, because of their dependence on the initialization parameters. Because most algorithms use an iterative 2 process to estimate the model parameters while searching for optimal solutions, a solution that is a global optimum is not guaranteed [122,132], 1.2 Motivations and proposed methods This dissertation focuses on construction of a machine learning and data mining framework for discovery of biological meaning in gene expression data collected from multiple sources. Three tools have been developed: (i) a novel meta-analysis method for identifying expression profiles of interest; (ii) set of novel clustering methods to incorporate prior biological knowledge and to predict co-regulated genes; and (iii) a novel motif-finding method to extract information on shared regulatory sequences, Together these methods provide comprehensive analysis of gene expression data. 1.2.1 Microarray meta-analysis Several methods have been proposed to combine inter-study microarray data. These methods fall into two major categories based on the level at which data integration is performed: meta-analysis and direct integration of expression values. Meta-analysis integrates data from different studies after they have been pre-analyzed individually. In contrast, the latter approach integrates microarray data from different studies at the level of expression after transforming expression values to numerically comparable measures and/or normalization and carries out the analysis on the combined dataset. Meta-analysis methods combine the results of analysis from individual studies in order to increase the ability to identify significantly expressed genes or altered samples across studies that used different microarray technologies (i.e., different platforms or different generations of the same platform). Lamb et al. [25] and Zhang et al. [26] 3 proposed rank-based methods for performing meta-analysis. They started within the Connectivity Map (CMAP) datasets, a collection of one-off microarray screenings of >1000 small-molecule treatments of five different human cell lines. Genes in each array were ranked using statistical methods and the array data were then integrated, based on how they fit a given gene expression signature, defined as a set of genes responding to the biological context of interest. Gene expression signatures could then be used to query array data generated using other array technologies. These methods successfully identified a number of small molecules with rational and informative associations with known biological processes. However, Lamb et al. [25] weighted up-regulated genes and down-regulated genes differently, while Zhang et al. [26] showed that genes with the same amount of expression change should be weighted equally. Both Zhang et al. [26] and Lamb et al. [25] used fold change criteria (where fold change, FC, is the ratio between the gene expression value of treated and untreated conditions) to rank the genes in expression profiles and they solved the problem of platform-specific probes by eliminating them. Because of the experimental design and the noise inherent in the screening process and imaging methods, a single gene may have probesets that differ in the direction of expression change, e.g., some probesets may be up-regulated and others down-regulated. This problem was addressed by Lamb et al. [25] and Zhang et al. [26] by including all probesets in the ranked list and averaging probesets across genes. In fact, it is not biologically necessary that a single differentially expressed gene show the identical FC for all probesets. It is not straightforward to compare differentially expressed genes just by averaging their probeset values. Witten et al. [9] showed that 4 FC-based approaches, namely FC-difference and FC-ratio, are linear-comparison methods and are not appropriate for noisy data and the data from different experiments. In addition, these methods are unable to measure the statistical significance of the results. While approaches using the t-test were proposed, Deng [10] showed that the Rank-Product (RP) method is not only more stable than the t-test, but it also provides the statistical significance of the gene expression differences. Breitling et al. [29] proposed an RP-based approach that allows the analysis of gene expression profiles from different experiments and offers several advantages over the linear approaches, including the biologically intuitive FC criterion, using fewer model assumptions. However, both the RP and t-test methods, as well as other popular meta-analysis methods, have problems with small datasets and datasets with a small number of replicates. Hong et al. [30] proposed an RP-based method that is most appropriate, not only for small sample sizes, but also for increasing the performance on noisy data. Their method has achieved widespread acceptance and has been used in such diverse fields as RNAi analysis, proteomics, and machine learning [8], However, most RP-based methods, unlike t-test based methods, do not provide an overall measure of differential expression for every gene within a given study [54], We develop a meta-analysis method using RP first to determine the differentially expressed genes in each study and then to construct ordered lists of up and down regulated genes in each study based on the p-values determined using a permutation- based method. We then use the non-parametric rank-based pattern-matching strategy based on the Kolmogorov-Smimov statistic [25] to filter the profiles of interest. We 5 again use RP to filter genes which are differentially expressed across multiple studies. In our method, a t-test based approach can also be used in the first step to generate an estimated FC for each gene within each study. These values are then used in the cluster analysis, instead of the average FC values, to group the genes based on their expression change pattern similarity across multiple studies. 1.2.2 Cluster analysis of gene expression data Numerous clustering algorithms are currently in use. Hierarchical clustering results in a tree structure, where genes on the same branch at the desired level are considered to be in the same cluster. While this structure provides a rich visualization of the gene clusters, it has the limitation that a gene can be assigned only to one branch, i.e., one cluster. This is not always biologically reasonable because most, if not all, genes have multiple functions. In addition, because of the mechanism of one-way assignment of genes to branches, the results may not be globally optimal. The alternative approach, partitioning clustering, includes two major methods, heuristic-based and model-based. The former assigns objects to clusters using a heuristic mechanism, while the latter uses quantifying uncertainty measures. Probability and possibility are the two uncertainty measures commonly used. While probabilistic bodies of evidence consist of singletons, possibilistic bodies of evidence are families of nested sets. Both probability and possibility measures are uniquely represented by distribution functions, but their normalization requirements are very different. Values of each probability distribution are required to add to 1, while for possibility distributions, the largest values are required to be 1. Moreover, the latter requirement may even be abandoned when 6 possibility theory is formulated in terms of fuzzy sets [172-173, 179], The mixture model with the expectation-maximization (EM) algorithm is a well-known method using the probability-based approach. This method has the advantages of a strong statistical basis and a statistics-based model selection. However, the EM algorithm converges slowly, particularly at regions where clusters overlap [27] and requires the data distribution to follow some specific distribution model. Because gene expression data are likely to contain overlapping clusters and do not always follow standard data distributions, (e.g., Gaussian, Chi-squared) [28], the mixture model with the EM method is not appropriate. Fuzzy clustering using the most popular algorithm, Fuzzy C-Means (FCM) [92, 121], is another model-based clustering approach that uses the possibility measure. FCM both converges rapidly and allows assignment of objects to overlapping clusters using the fuzzifier factor, m, where l clustering results when the value of m is equal to 1. However, similar to the EM-based method and the other partitioning methods, FCM has the problem of determining the correct number of clusters. Even if the cluster number is known a priori, FCM may provide different cluster partitions. Cluster validation methods are required to determine the optimal cluster solution. Unfortunately, the clustering model of FCM is a possibility-based one. There is no straightforward statistics-based approach to evaluate a clustering model except the cluster validity indices based on the compactness and separation factors of the fuzzy partitions. However, there are problems with these 7 validity indices; both with scaling the two factors and with over-fit estimates. We have combined the advantages of the EM and FCM methods, where FCM plays the key role in clustering the data, and proposed a method to convert the clustering possibility model into a probability one and then to use the Central Limit Theorem to compute the clustering statistics and determine the data distribution model that best describes the dataset. We applied the Bayesian method with log-likelihood ratio and Akaike Information Criterion measure using the estimated distribution model for a novel validation method for fuzzy clustering partitions [147], 1.2.3 DNA sequence analysis Genes within a cluster are expressed similarly under the given experimental conditions, e.g., a drug treatment, a comparison between the normal and disease tissues, or other biological comparisons of interest. Genes that are similarly expressed may have regulatory DNA sequences in common and appropriate analysis may identify motifs that are overrepresented in these gene sequences. Such motifs may contribute to expression of the entire group [31], In addition, regarding clusters of down-regulated genes, we can apply the sequence analysis to the 3 untranslated regions (3'UTR) potentially discovering motifs connected to miRNAs [32], We may discover the relationships between the miRNAs and the diseases or the treatment conditions targeted by the microarray experiments. Most algorithms developed to find motifs in biological sequences do not successfully identify motifs containing gaps and do not allow a variable number of motif instances in different sequences. In addition, they may converge to local optima 8 [33] and therefore do not guarantee that the motifs are indeed overrepresented in the sequence set. The MEME algorithm [34] solved many of these problems, but still does not model gapped motifs well. By combining the EM algorithm with the hierarchical genetics algorithm, we developed a novel motif discovery algorithm, HIGEDA [145], that automatically determines the motif consensus using Position Weight Matrix (PWM). By using a dynamic programming (DP) algorithm, HIGEDA also identifies gapped motifs. 1.3 Research contributions The rapid growth of microarray repositories has increased the need for effective methods for integration of datasets among laboratories using different microarray technologies. The meta-analysis approach has the advantage of integrating data without the problem of scaling expression values and also has multiple supporting methods to detect and order significantly differentially expressed genes. However, parametric statistics methods may perform differently on different datasets, because of different assumptions about the method model. We propose using the RP method [8] to filter the genes for differential expression. We also propose to integrate RP with a pattern matching ranking method, in a method named RPPR [169], a combination of the Rank Product and Rank Page methods, to more effectively filter gene expression profiles from multiple studies. Meta-analysis of gene expression data does not put differentially expressed genes together at the phenotype level of the experiments. Therefore, the real power of microarrays is not completely exploited. Cluster analysis can discover relationships 9 among genes, between genes and biological conditions, and between genes and biological processes. Most clustering algorithms are not appropriate for gene expression data analysis because they do not allow overlapping clusters and require the data to follow a specific distribution model. We developed a new clustering algorithm, fzGASCE [157] that combines the well-known optimization algorithm, Genetic Algorithm (GA), with FCM and fuzzy subtractive clustering to effectively cluster gene expression data without a priori specification of the number of clusters. Regarding the parameter initialization problem of the FCM algorithm, we developed a novel Fuzzy Subtractive Clustering (fzSC) algorithm [146] that uses a fuzzy partition of data instead of the data themselves. fzSC has advantages over Subtractive Clustering in that it does not require a priori specification of mountain peak and mountain radii, or the criterion on how to determine the number of clusters. In addition, the computational time of fzSC is O(cn) instead of 0(n2), where c and n are the numbers of clusters and data objects respectively. To address the problem of missing data in cluster analysis, we developed two imputation methods, fzDBI and fzPBI, using histogram based and probability based approaches to model the data distributions and apply the model in imputation of missing data. For the problem of cluster model selection and clustering result evaluation with the FCM algorithm, we developed a novel method, fzBLE, [147] that uses the likelihood function, a statistics goodness-of-fit, and a Bayesian method with the Central Limit Theorem, to effectively describe the data distribution and correctly select the optimal fuzzy partition. Using the statistical model of fzBLE, we developed a 10 probability based method for defuzzification of fuzzy partition that helps with generating classification information of data objects using fuzzy partition. In addition to clustering of gene expression data, we also propose an analysis directed at the biological relevance of the cluster results. Using our RPPR method, we provide relationships between the gene clusters and the biological conditions of interest, and determine the clusters which respond positively and negatively to the experimental conditions. Furthermore, our motif-finding algorithm, HIGEDA [145], can discover common motifs in the promoter and mRNA sequences of genes in a cluster. In addition to predicting the possible transcription factor binding sites, HIGEDA can also be used to predict potential miRNA binding sites. Figure 1-1 shows the microarray data analysis key issues which are addressed in this research. The remainder of the dissertation is organized as follows: Chapter 2 presents the background of microarray analysis including microarray technology, microarray data analysis and recent approaches, and the goals of this research; Chapter 3 describes a rigorous analysis of the Fuzzy C-Means algorithm; Chapter 4 describes our methods to address the challenges of gene expression microarray data analysis; and Chapter 5 demonstrates some specific applications of our approach, concludes our contributions and discusses our future work. 11 Microarray from lab (p) Figure 1-1 : Microarray data analysis and the contributions of this research 12 2. Microarray analysis background 2.1 Microarray DNA microarrays rely on the specificity of hybridization between complementary nucleic acid sequences in DNA fragments (termed probes) immobilized on a solid surface and labeled RNA fragments isolated from biological samples of interest [6], A typical DNA microarray consists of thousands of ordered sets of DNA fragments on a glass, filter, or silicon wafer. After hybridization, the signal intensity of each individual probe should correlate with the abundance of the labeled RNA complementary to that probe [1], 2.1.1 Microarray platforms DNA microarrays fall into two types based on the DNA fragments used to build the array: complementary DNA (cDNA) arrays and oligonucleotide arrays. Although a number of subtypes exist for each array type, spotted cDNA arrays and Affymetrix oligonucleotide arrays are the major platforms currently in use. The choice of which microarray platform to use is based on the research needs, cost, available expertise, and accessibility [1], For cDNA arrays, cDNA probes, which are usually generated by a polymerase chain reaction (PCR) amplification of cDNA clone inserts (representing genes of interest), are robotically spotted on glass slides or filters. The immobilized sequences of cDNA probes may range greatly in length, but are usually much longer than those of the corresponding oligonucleotide probes. The major advantage of cDNA arrays is the flexibility in designing a custom array for specific purposes. Numerous genes can be 13 rapidly screened, which allows very quick elaboration of functional hypotheses without any a priori assumptions [120], In addition, cDNA arrays typically cost approximately one-fourth as much as commercial oligonucleotide arrays. Flexibility and lower cost initially made cDNA arrays popular in academic research laboratories. However, the major disadvantage of these arrays is the amount of total input RNA needed. It is also difficult to have complete control over the design of the probe sequences. cDNA is generated by the enzyme reverse transcriptase RNA-dependent DNA polymerase, and like all DNA polymerases, it cannot initiate synthesis de novo, but requires a primer. It is therefore difficult to generate comprehensive coverage of all genes in a cell. Furthermore, managing large clone libraries, and the infrastructure of a relational database for keeping records, sequence verification and data extraction is a challenge for most laboratories. For oligonucleotide arrays, probes are comprised of short segments of DNA complementary to the RNA transcripts of interest and are synthesized directly on the surface of a silicon wafer. When compared with cDNA arrays, oligonucleotide arrays generally provide greater gene coverage, more consistency, and better quality control of the immobilized sequences. Other advantages include uniformity of probe length, the ability to discriminate gene splice variants, and the availability of carefully designed standard operating procedures. Another advantage particular to Affymetrix arrays is the ability to recover samples after hybridization to an array. This feature makes Affymetrix arrays attractive in situations where the amount of available tissue is limited. However, a major disadvantage is the high cost of arrays [1], 14 Following hybridization, the image is processed to obtain the hybridization signals. There are two different ways to measure signal intensity. In the two-color fluorescence hybridization scheme, the RNA from experimental and control samples (referred to as target RNAs) are differentially labeled with fluorescent dyes (Cye5 red vs. Cye3 green) and hybridized to the same array. When the region of the probe is fluorescently illuminated, both the experimental and control target RNAs fluorescence and the relative balance of red versus green fluorescence indicate the relative expression levels of experimental and control target RNAs. Therefore, gene expression values are reported as ratios between the two fluorescent values [1], cells cells ( 1 RNA extraction \.... . - Feature-level data (cel, grp...) Figure 2-1 : Microarray design and screening 15 Affymetrix oligonucleotide arrays use a one-color fluorescence hybridization system where experimental RNA is labeled with a single fluorescent dye and hybridized to an oligonucleotide array. After hybridization, the fluorescence intensity from each spot on the array provides a measurement of the abundance of the corresponding target RNA. A second array is then hybridized to the control RNA, allowing calculation of expression differences. Because Affymetrix array screening generally follows a standard protocol, results from different experiments in different laboratories can theoretically be combined [6], Following image processing, the digitized gene expression data need to be pre- processed for data normalization before carrying out further analysis. 2.1.2 Gene expression microarray data analysis Regarding differentially expressed genes, many protocols use a cutoff of a twofold difference as a criterion. However, this arbitrary cutoff value may be either too high or too low depending on the data variability. In addition, inherent data variability is not taken into account. A data point above or below the cutoff line could be there by chance or by error. To ensure that a gene is truly differentially expressed requires multiple replicate experiments and statistical testing [8], However, not all microarray experiments are array-replicated. We, therefore, need to analyze data generated by different protocols. Statistical analysis, including meta-analysis, uses microarrays to study genes in isolation while the real power of microarrays is their ability to study the relationships between genes and to identify genes or samples that behave in a similar manner. 16 Machine learning and data mining approaches have been developed for further analysis procedures. These approaches can be divided into unsupervised and supervised methods. Unsupervised methods involve the aggregation of samples, genes, or both into different clusters based on the distance between measured gene expression values. The goal of clustering is to group objects with similar properties, leading to clusters where the distance measure is small within clusters and large between clusters. Several clustering methods from classical pattern recognition, such as hierarchical clustering, K- Means clustering, fuzzy C-Means clustering, and self-organizing maps, have been applied to microarray data analysis. Using unsupervised methods, we can search the resulting clusters for candidate genes or treatments whose expression patterns could be associated with a given biological condition or gene expression signature. The advantage of this method is that it is unbiased and allows for identification of significant structure in a complex dataset without any prior information about the objects. In contrast, supervised methods integrate the knowledge of sample class information into the analysis with the goal of identifying expression patterns (i.e., gene expression signatures) which could be used to classify unknown samples according to their biological characteristics. A training dataset, consisting of gene expression values and sample class labels, is used to select a subset of expressed genes that have the most discriminative power between the classes. It is then used to build a predictive model, also called a classifier (e.g., k-nearest neighbors, neural network, support vector machines), which takes gene expression values of the pre-selected set of genes of an unknown sample as input and outputs the predicted class label of the sample. 17 Application specific Technology specific Statistical analysis: meta-analysis, DEG Data mining: Clustering, Classification Sequence analysis Biological Question Experimental Design Experiment Quantification Normalization Pre-processing 1 f 1 f Estimation J Testing i r 1 f Clustering Classification 1 Analysis Motif finding j Regulatory prediction Figure 2-2 : Microarray analysis 2.2 Microarray data meta-analysis 2.2.1 Microarray data integration The integrated analysis of data from multiple studies generally promises to increase statistical power, generalizability, and reliability, while decreasing the cost of analysis, because it is performed using a larger number of samples and the effects of 18 individual study-specific biases are reduced. There are two common approaches for the problem of integrating microarray data: meta-analysis and direct integration of expression values. The direct integration procedure [11-13, 17] can be divided into the following steps. First, a list of genes common to multiple distinct microarray platforms is extracted based on cross-referencing the annotation of each probe set represented on the microarrays. Cross-referencing of expression data is usually achieved using UniGene or LocusLink/EntrezGene databases or the best matching mapping provided by Affymetrix. Next, for each individual dataset, numerically comparable quantities are derived from the expression values of genes in the common list by application of specific data transformation and normalization methods. Finally, the newly derived quantities from individual datasets are combined. Direct integration methods include the following: Ramaswamy et al. [14] re-scaled expression values of a common set of genes for each of five cancer microarray datasets generated by independent laboratories using different microarray platforms. Combining them to form a dataset with increased sample size allowed identification of a gene expression signature that distinguished primary from metastatic cancers. Bloom et al. [15] used a scaling approach based on measurements from one common control sample to integrate microarray data from different platforms. Shen et al. [16] proposed a Bayesian mixture model to transform each raw expression value into a probability of differential expression (POD) for each gene in each independent array. Integrating multiple studies on the common probability scale of POD, they developed a 90-gene meta-signature that predicted relapse-free 19 survival in breast cancer patients with improved statistical power and reliability. In addition to common data transformation and normalization procedures, Jiang et al. [17] proposed a distribution transformation method to transform multiple datasets into a similar distribution before data integration. Data processed by distribution transformation showed a greatly improved consistency in gene expression patterns between multiple datasets. Wamat et al. [18] used two data integration methods, median rank scores and quartile discretization, to derive numerically comparable measures of gene expression from independent datasets. These transformed data were then integrated and used to build support vector machine classifiers for cancer classification. Their results showed that cancer classification based on microarray data could be greatly improved by integrating multiple datasets with a similar focus. The classifiers built from integrated data showed high predictive power and improved generalization performance. A major limitation of these direct integration methods is that filtering genes to generate a subset common to multiple distinct microarray platforms often excludes many thousands of genes, some of which may be significant. However, data transformation and normalization methods are resource-sensitive [20]; one method may be a best fit for some datasets, but not for others. It is difficult to come to a consensus regarding a method that is best for data transformation and normalization on given datasets. Several studies have shown that expression measurements from cDNA and oligonucleotide arrays may show poor correlation and may not be directly comparable 20 [21], These differences may be due to variances in probe content, deposition technologies, labeling and hybridizing protocols, as well as data extraction procedures (e.g., background correction, normalization, and calculation of expression values). For example, cDNA microarray data is usually defined as ratios between experimental and control values and cannot be directly compared with oligonucleotide microarray data that are defined as expression values of experimental samples. Across-laboratory comparisons of microarray data has also demonstrated that sometimes there are larger differences between data obtained in different laboratories using the same microarray technology than data obtained in the same laboratory using different microarray technologies [22], Wang et al. [116] also showed that the agreement between two technologies within the same lab was greater than that between two labs using the same technology; the lab effect, especially when confounded with the RNA sample effect, usually plays a bigger role than the platform effect on data agreement. Commercial microarrays, such as Affymetrix arrays, have produced several generations of arrays to keep up with advances in genomic sequence analysis. The number of known genes and the representative composition of gene sequences are frequently updated and probe sets are modified or added, to better detect target sequences and to represent newly discovered genes. A recent study has shown that expression measurements within one generation of Affymetrix arrays are highly reproducible, but that reproducibility across generations depends on the degree of similarity of the probe sets and the levels of expression measurements [17], Therefore, 21 even when using the same microarray technology, different generations of microarrays make direct integration difficult. Technical variabilities, which result from differences in sample composition and preparation, experimental protocols and parameters, RNA quality, and array quality, pose further challenges to the direct integration of microarray data from independent studies. Vert et al. [20], and Irizarry et al. [22] described the lab-affect for microarray data and concluded that direct integration of expression data is not appropriate. Cahan et al. [21] and Warnat et al. [24], using two methods to identify differentially expressed genes prior to carrying out a classification analysis, showed that gene expression levels themselves could not be directly compared between different platforms. Therefore, we propose to utilize the meta-analysis, instead of direct integration, before further computer analyses. The meta-analysis method, in contrast to the direct integration method, combines results from individual analyses. It therefore avoids the problem of scaling the gene expression levels among datasets from different laboratories. A number of studies have shown that meta-analysis provides a robust list of differentially expressed genes [21-23], With the increasing use of next generation sequencing techniques, microarrays are no longer the only high-throughput technology for gene expression studies. The direct integration approach appears to be inappropriate to those repositories, because we can only scale the expression values across multiple experiments using the same technology. Even with the same technology, e.g., Affymetrix oligonucleotide and 22 cDNA microarrays, the use of this approach seems impossible with scaling expression values. 2.2.2 Issues with microarray data meta-analysis Considering that meta-analysis involves the whole process from choosing microarray data to detecting differentially expressed genes, the key issues are as follows. Issue 1: Identify suitable microarray datasets The first step is to determine which datasets to use regarding the goal of analysis. This is done by determining the inclusion-exclusion criteria. The most important criterion is that the datasets collected should be in the standard gene expression data format, e.g. features by samples [23], Issue 2: Annotate the individual datasets Microarray probe design uses short, highly specific regions in genes of interest because using the full-length gene sequences can lead to non-specific binding or noise. Different design criteria lead to the creation of different probes for the same gene. Therefore, one needs to identify which probes represent a given gene within and across the datasets. The first option is to cluster probes based on sequence data [11, 17], A sequence match method is especially appropriate for cross-platform data integration, as well as Affymetrix cross-generation data integration. However, the probe sequence may not be available for all platforms and the clustering of probe sequences could be computationally intensive for very large numbers of probes. 23 Alternatively, one can map probe-level identifiers such as Image ClonelD, Affymetrix ID, or GenBank accession numbers to a gene-level identifier such as UniGene, RefSeq, or LocusLink/EntrezGene. UniGene, which is an experimental system for automatically partitioning sequences into non-redundant gene-oriented clusters, is a popular choice to unify the different datasets. For example, UniGene Build #211 (released March 12, 2008) reduces the nearly 7 million human cDNA sequences to 124,181 clusters. To translate probe-level identifiers to gene-level identifiers, one can use either the annotation packages in BioConductor, or the identifier mapping tables provided by NCBI and AffymetrixID for LocusLink/Entrez ID to probe ID, or UniGene for probe ID to RefSeq. Issue 3: Resolve the many-to-many relationships between probes and genes The relationship between probes and genes is unfortunately not unequivocal, which means that in some cases a probe may report more than one gene, and vice versa. Even using the same Affymetrix platform, the combination of different chip versions creates serious difficulties, because the probe identification labels (IDs) are not conserved from chip to chip. Therefore, to combine microarray data across studies, a unique nomenclature must be adopted and all the different IDs of the chips must be translated to a common system. It is reasonable that many probe identifiers are mapped onto one gene identifier. This is due to the current UniGene clustering and genome annotation, because multiple probes per gene provide internal replicates, and allow for poor performance of some probes without the loss of gene expression detection, and because microarray chips 24 contain duplicate spotted probes. The issue is when a probe identifier can be mapped to many gene identifiers. This may lead to a problem with the further meta-analysis. For example, a probe could map to gene X in half of the datasets, but to both genes X and Y in the remaining datasets. The further meta-analysis will treat such probes as two separate gene entities, failing to fully combine the information for GenelD X from all studies. If one simply throws away such probes, valuable information may be lost to further analysis. Issue 4: Choosing the meta-analysis technique The decision regarding which meta-analysis technique to use depends on the specific application. In this context, we focus on a fundamental application of microarrays: the two-class comparison, e.g. the class of treatment samples and the class of control samples, where the objective is to identify genes differentially expressed between two specific conditions. Let XNxp represent the expression matrix of selected datasets from multiple studies; where N is the number of common genes in the P selected samples. Let K be the number of studies from which the samples were selected. Denote by Pk the number of samples that belong to the kth study. Let Pkcbe the number of control samples and PkT be the number of treatment or experimental samples from the kth study, then K Pk =Pkc+Pk and P = ^ Pk. For Affymetrix oligonucleotide array experiments, we k=l have Pkc chips with gene expression measures from the control class and PkT chips with gene expression measures from the treatment class. For the two-channel array 25 experiments, we assume that the comparisons of log-ratios are all indirect, i.e., the samples from the treatment class are hybridized against a reference sample RA. Then the expression values from the kth study are collected into X: xkj =log2(Tj/R)J = 1,---,PkT and xkq =log2(Cq/R),q = 1,...,Pkc. The meta-analysis will be performed on the results from analyses that were performed on X. There are four common ways to integrate such information across studies. Vote counting This method simply counts the number of studies where a gene was declared as significant [48], If the number of studies is small, this method can be visualized using Venn diagrams. Combining ranks Unlike vote counting, this technique accounts for the order of genes declared significant [30, 52-54], There are three different approaches to aggregate the rankings of, say, the top 100 lists (the 100 most significantly up-regulated or down-regulated genes) from different studies [23], Two of the algorithms use Markov chains to convert the pair-wise preference between the gene lists to a stationary distribution; the third algorithm is based on an order-statistics model. The rank product method, proposed by Breitling et al. [29] is the most popular of this approach using the third algorithm. It is a non-parametric statistic which is originally used to detect differentially expressed genes in a single dataset. It is derived from the biological reasoning of the fold change (FC) criterion and it detects the genes that are consistently found among the most strongly up- 26 regulated (or down-regulated) genes in a number of replicate experiments. However, this method offers a natural way to overcome the heterogeneity among multiple datasets and, therefore, can be extended to meta-analysis, which generates a single significance measurement for each gene in the combined study [30], Within a given dataset of the study k, the pairwise fold change is computed for every gene i,pFCkq = Tk/C k = l,...,P^,q = 1,...,P^ There are P^ x P(. such ratio values for each gene i, let R=P^ xPk These ratios are ranked within comparison to generate a ranking matrix rFC\:R. The rank product (RP) is computed as the geometric mean of rank for each gene i, by RP;/n can be interpreted as a p-value, because it describes the probability of observing gene i at rank rFC, or better in the qth comparison, q=l,...,R. A permutation procedure is independently carried out to generate all other possible combinations of ranks relative to gene i in the dataset, nRP,. Repeat this procedure B times, where B is a positive number, to form reference distribution with nRP; across R comparisons which is then used to compute the (adjusted) p- value and false discovery rate (FDR) for each gene. Combining p-values In the 1920s, Fisher developed a meta-method that combined the p-values from individual studies [117], In the kth study, for the ith gene, the p-value p;k is generated by one-sided hypothesis testing. The logs of the p-values of the same (2.1) 27 gene are summed across K studies using the Fishers sum of logs method which is defined as follows [50, 51], Si = -2^]log2(pik) (2.2) k=l The S; values are then assessed by comparison against a chi-square distribution with 2K degrees of freedom, X2K Rhodes et al. [36] proposed a statistical model for performing meta- analysis of four independent prostate cancer microarray datasets from cDNA arrays and Affymetrix arrays. Each gene in each study was treated as an independent hypothesis and a significance (denoted by one p value and one q value) was assigned to each gene in each study based on random permutations. The similarity of significance across studies was assessed with meta-analysis methods and combined with multiple inference statistical tests for each possible combination of studies. A cohort of genes was identified to be consistently and significantly deregulated in prostate cancer. Marot et al. [40] used a sequential meta-analysis method in which they computed p-values in each sequential step and combined p-values from different sequential steps to determine the p-values for the results which are either significant genes or biological processes throughout the entire analysis process. Combining effect-sizes The first step is to calculate the effect size d,k and the variance 5,k associated with the effect size for gene i in the kth study, i = 1,... ,N; k = 1,... ,K. 28 (2.3) where S| is the pooled standard deviation estimated across N genes in the dataset of the kth study. Based on the mean differences, an estimated pooled mean difference, p;, and its variance, of, are computed for each gene across K studies. A z-score z, is then calculated using p, andaf: Effect size can be calculated using the correlation coefficient and the method of Cohen [135], which is the difference between the means of two groups standardized by its pooled standard deviation [37, 48, 51], The statistical significance can be determined using a comparison against N(0,1) on these z- scores. However, if the number of studies, K, is small, there may be a problem with over-fitting. In this case, the permutation method is used instead. The method from Hedges and Olkin [136] showed that this standardized difference overestimates the effect size for studies with small sample sizes. They proposed a small correction factor to calculate the unbiased estimate of the effect size, which is known as the Hedges adjustment. The study-specific effect sizes for every gene are then combined across studies into a weighted average. The weights are inversely proportional to the variance of the study-specific (2.4) (2.5) 29 estimates. Choi et al. [37-38] introduced a new meta-analysis method, which combines the results from individual datasets in the form of effect size and has the ability to model the inter-study variation. The effect size was defined to be a standardized mean difference between the test samples and normal samples. The effect sizes from multiple microarray datasets were combined to obtain an estimate of the overall mean, and statistical significance was determined by the permutation test extended to multiple datasets. It was demonstrated that data integration using this method promoted the discovery of small but consistent expression changes and increased the sensitivity and reliability of the analysis. An extended effect size model was then proposed by Hu et al. [39] for meta- analysis of microarray data. Issue 5: Samples/features selection Regardless of the methods of data integration, finding features or samples that together represent the biological meaning for the integrated datasets remains a challenge. Lin [42] used a heuristics method to detect significant genes from integrated data. The underlying spaces were determined using a heuristics search. Then, a set of candidate genes lists were identified again using a heuristics search. Parkhomenko et al. [43] used principle component analysis (PCA) to determine the most important components of the data spaces. Pairs of components are processed one by one using a correlation measurements method to look for the best combination of components in the dataset. This analysis is also a heuristic-based one. Su et al. [44] used a set of genes similar to the gene expression signature targeted to the biological context of interest, to 30 detect candidate genes called VIP (Very Important Pool) genes. These genes control the search process to find the significant samples (in the integrated dataset, the samples are columns, as we mentioned in Issue 1) and genes from the integrated data. Tan et al. [47] approached the problem using GA and a resampling method; the feature and sample candidates were selected under a controlled process using GA and the fitness function was computed using statistical test results on resample sets. Traditional methods to identify similar samples on the basis of gene expression values are Principle Component Analysis and Hierarchical Cluster Analysis. These methods, however, have the problem of strongly relating samples of the same cell lines and batches because of the similarity among cells grown at the same time [25], In addition, these methods require all expression profiles to be generated on the same microarray platform, because the determination of distances between expression profiles from different platforms is not straightforward, which limits use to the available microarray repositories. Lastly, these methods may not benefit from additional information regarding the biological context of interest. For such cases, Huttenhower et al. [49] proposed a meta-analysis method based on a statistical approach within a Bayesian framework. Considering the biological context of the analysis, they measured the distance between every pair of genes and constructed a Bayesian framework for each specific biological context based on the set of arrays. The Bayesian framework was then used to compute the probability of the relevance of each array regarding the context. This probabilistic measurement can be used to filter the samples. Because their method requires a set of genes targeting the biological context, in addition to a 31 definition by the biologists, the biological context can be defined by a set of genes having specific GO (Gene Ontology) terms, or participating in a specific biological process, or belonging to a gene expression signature. Lamb et al. [25] and Zhang et al. [26], by using gene expression signatures proposed pattern-matching, used ranking-based approaches to perform meta-analysis on CMAP datasets. Genes in each expression profiles were ranked using fold changes. Expression profiles were then integrated based on how they fit a given gene expression signature. Their research successfully identified a number of interesting compounds that correlated with known biological processes. 2.3 Cluster analysis After meta-analysis, we identify a list of genes differentially expressed in a single study or across multiple studies. Using this gene list, we filter out the matrix of expression levels of genes significantly expressed under the experimental conditions. Cluster analysis then groups genes where genes within the same cluster have similar expression patterns across studies. This analysis allows the simultaneous comparison of all clusters independently of the platform used and the species studied; similar samples under specific biological conditions from different species can also be analyzed together by meta-analysis. This comparison allows identification of: i) robust signatures of a pathology or a treatment across several independent studies, ii) sets of genes that may be similarly modulated in different disease states or following drug treatments, iii) common sets of co-expressed genes between human and animal models. Because a typical microarray study generates expression data for thousands of genes from a 32 relatively small number of samples, an analysis of the integrated data from several sources or laboratories may provide outcomes that may not necessarily be related to the biological processes of interest. Cluster analysis will further explore the groups of genes that are absolute key agents of the experimental results. Given a set of n objects X = {xj, X2, ...,x}, let 0 = {U, V}, where cluster set V = {v i, V2,...,VC} and partition matrix U={ukj, k=l,...,c, i=l,...,n, be a partition of X such that X = ^jck=lvkand uki = 1,i = 1,______,n Each subset Vk of X is called a cluster and {uki} is the membership degree of {x, J to Vk. Uki e {0,1J if 9 is crisp partition, otherwise, Uki e[0,l]. The goal of cluster analysis is to assign objects to clusters such that objects in the same cluster are highly similar to each other while objects from different clusters are as divergent as possible. These sub-goals create what we call the compactness and separation factors that are used, not only for modelling the clustering objectives, but also for evaluating the clustering result. These two parameters can be mathematically formulated in many different ways that lead to numerous clustering models. A dataset containing the objects to be clustered is usually represented in one of two formats, the object data matrix and the object distance matrix. In an object data matrix, the rows usually represent the objects and the columns represent the attributes of the objects regarding the context where the objects occur. The roles of the rows and columns can be interchanged for another representation method, but this one is preferred because the number of objects is always enormously large in comparison with the number of attributes. Here objects are genes and attributes are the expression levels of the genes under the experimental conditions. Assume we have n genes and p 33 experimental conditions. The gene data matrix X then has n rows and p columns where Xij is the expression level of gene i under the condition j and x, is the expression vector (or object vector) of gene i across p experiments. The distance matrix contains the pairwise distance (or dissimilarity) of objects. Specifically, the entry (ij) in the distance matrix represents the distance between objects i and j, l
and j can be computed using the object vectors i and j from the object data matrix basedon a distance measurement. However, the object data matrix cannot be fully recovered from the distance matrix, especially when the value of p is unknown. 2.3.1 Clustering algorithms Clustering algorithms are classified into either hierarchical or partitioning approaches. While the hierarchical methods group objects into a dendrogram (tree structure) based on the distance among clusters, the partitioning methods directly groups objects into clusters and objects are assigned to clusters based on the criteria of the clustering models. Clustering methods using a partitioning approach are classified into either model-based or heuristic-based approaches. Model-hased clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model obtained from the data then defines the clusters and assigns objects to them. Although K-means is a heuristic-based approach, once a set of K centers that are good representatives for the data is found, we can consider it as a model that generates the data with membership degrees in {0,1} and an addition of noise. In this context, K- Means can be viewed as a model-based approach. 34 2.3.2 Hierarchical clustering algorithms Hierarchical clustering (HC) builds a dendrogram (hierarchical structure) of objects using either agglomerative (bottom-up) or divisive (top-down) approaches. In the former approach, the dendrogram is initially empty; each object is in its own sub- tree (singleton cluster). The clustering process merges similar sub-trees. At each step, the two sub-trees most similar are merged to form a new sub-tree. The process stops when the desired number of sub-trees, say c, at the highest level is achieved. In the top down approach, the dendrogram starts with a sub-tree that contains all the objects. The process then splits the sub-trees into new sub-trees. At each step, the sub-tree of objects with maximum dissimilarity is split into two sub-trees. The process stops when the number of leaves, say c, or the compactness and separation criteria, is achieved. The dissimilarity between every pair of clusters is measured by either the single linkage: Ls(vk,v1) = min{d(xI,xJ);xI Evk,x] e vj, (2.6) the complete linkage: Lc(vk,Vi) = max{d(xI,xJ);xI Evk,x] e vj, (2.7) or the average linkage: Zd(xixj) La(V1cAi) = XiGVk,XjGVi (2.8) The clustering process results in a set of c sub-trees at the level of interest in each approach. In comparison with the partitioning approach, each such sub-tree corresponds to a cluster. 35 Because of their methods of grouping or splitting sub-trees, HC algorithms are considered heuristic-based approaches. The common criticism for standard HC algorithms is that they lack robustness and, hence, are sensitive to noise and outliers. Once an object is assigned to a cluster, it will not be considered again, meaning that HC algorithms are not capable of correcting prior misclassifications and each object can belong to just one cluster. The computational complexity for most HC algorithms is at least 0(n2) and this high cost limits their application in large-scale datasets. Other disadvantages of HC include the tendency to form spherical shapes and reversal phenomenon, in which the normal hierarchical structure is distorted. 2.3.3 K- Means clustering algorithm The K-Means algorithm is a partition approach using a heuristic-based method. Its objective function is based on the square error criterion, J(U,V|X) = ^^ukl||xi-vk||2, where uki e{0,l}. (2.9) k=l i=l K-Means is the best-known square error-based clustering algorithm. Its objective, J, is based on both compactness and separation factors. Minimizing J is equivalent to minimizing the compactness while maximizing the separation. 36 K-Means algorithm Steps 1) Initialize the partition matrix randomly or based on some prior knowledge. 2) Compute the cluster set V. 3) Assign each object in the dataset to the nearest cluster. 4) Re-estimate the partition matrix using the current cluster set V. 5) Repeat Steps 1, 2 and 3 until there is no object changing its cluster. The K-Means algorithm is very simple and can easily be implemented to solve many practical problems. It works very well for compact and hyper-spherical clusters. The computational complexity of K-Means is O(cn). Because c is much less than n, K- Means can be used to cluster large datasets. The drawbacks of K-Means are the lack of an efficient and universal method for identifying the initial partitions and the number of clusters c. A general strategy for the problem is to run the algorithm many times with randomly initialized partitions, although this does not guarantee convergence to a global optimum. K-Means is sensitive to outliers and noise. Even if an object is distant from the cluster center, it can still be forced into a cluster, thus, distorting the cluster shapes. 2.3.4 The mixture model with expectation-maximization algorithm Under the probability context, objects can be assumed to be generated according to several probability distributions. Objects in different clusters may be generated by different probability distributions. They can be derived from different types of density functions (e.g., multivariate Gaussian or t-distribution), or the same families, but with 37 different parameters. If the distributions are known, finding the clusters of a given dataset is equivalent to estimating the parameters of several underlying models. Denote by P(Vk) the prior probability of cluster Vk and P(vk) = l, the conditional probability of object x; given clustering partition 9 = {U,V} is, p(x! |0) = l]P(xI |uk,vk)xP(vk). (2.10) k=l Given an instance of 9, the posterior probability for assigning a data point to a cluster can easily be calculated using Bayes theorem. The mixtures therefore can be applied using any type of components. The multivariate Gaussian distribution is commonly used due to its complete theory and analytical tractability. The likelihood function P(X|9), P(X | 9) = ]_['1iP(xi | 9) is the probability of generating X using the 9 model. The best model 9*, therefore, should maximize the log likelihood function, L(X 19) = log(P(X 19)) = ^_ilog(P(xi 19)), L(X|9*) = max{L(X|9t)}. (2.11) 9* can be estimated using the expectation-maximization (EM) algorithm. EM considers X a combination of two parts: X is the observations of X; XM is the missing information of X regarding 9, and XM is similar to U of crisp clustering. The complete data log likelihood is then defined as, L(X 10) = L(X,Xm 10) = loghv^xplx" eu] (2.12) 1=1 k=l 38 By using an initial value of 9, 9, EM searches for the optimal value of 9 by generating a series of 9 estimates {9, 91,..9T}, where T represents the reaching of the convergence criterion. EM algorithm Steps 1) Randomly generate a value of 9, 9, set t=0. 2) E-step: Compute the expectation of the complete data log likelihood, Q(9, 9l) = E( L(X|9)). 3) M-step: Select a new parameter estimate that maximizes Q(.), 9t+1 = maxe{ Q(9,9l)}. 4) Set t=t+l. Repeat Steps 2 and 3 until the convergence criterion holds. EM assumes that the data distribution follows specific multivariate distribution model, for example, a Gaussian distribution. The major disadvantages of the EM algorithm are the sensitivity to the selection of the initial value of 9, the effect of a singular covariance matrix, the possibility of converging to a local optimum, and the slow convergence rate. In addition, because most biological datasets do not follow a specific distribution model, EM may not be appropriate. 2.3.5 Fuzzy clustering algorithm The clustering techniques we have discussed so far are referred to as hard or crisp clustering, which means that each object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed, and the object x, can belong to cluster Vk, k=l,.. ,,c, 39 with certain degrees of membership, Uki, Uki e[0,l]. This is particularly useful when the boundaries among the clusters are not well separated, and are therefore ambiguous. Moreover, the memberships may help us discover more sophisticated relationships between a given object and the disclosed clusters. Fuzzy C-Means (FCM) is one of the most popular fuzzy clustering algorithms. FCM attempts to find a fuzzy partition for a set of data points while minimizing the objective function as (2-7). By using a fuzzifier factor, m, FCM is flexible in managing the overlap regions among clusters, hence J(U,V|X) = Â£i>2||*i-vl|!. (2.13) 1=1 C=1 FCM algorithm Steps 1) Randomly initialize the values of partition matrix U, set t=0. 2) Estimate the value of vl using I.r\ (2.14) 2>L)n 3) Compute the new values of Ut+1 that maximally fit with vl. x; v! 2 / ^ Xl-V; 2 / J=1 V J ) 1 4) Repeat Steps 2 and 3 until either {vk} or {u J is convergent. (2.15) 40 As with the EM and K-Means algorithms, FCM performs poorly in the presence of noise and outliers, and has difficulty with identifying the initial parameters. To address this issue, FCM is integrated with heuristic-based search algorithms, such as GA, Ant colony (AC), and Particle swarm optimization (PSO). The mountain method or Subtractive clustering can also be used as an alternative to search for the best initial value of V of the FCM algorithm. Due to Possibility Theory limitations, it is difficult to recover data from a fuzzy partition model. However, recent research has shown that it is possible to construct a probability model using the possibility one. Therefore, by using a possibility to probability transformation, it is possible to recover the data from the fuzzy partition model. The FCM algorithm is considered to be model-based algorithm. The FCM algorithm has advantages over its crisp/probabilistic counterparts, especially when there is a significant overlap between clusters [57-59], FCM can converge rapidly and provides more information about the relationships between genes and groups of genes in the cluster results. We therefore choose the FCM algorithm for the cluster analysis of gene expression data. However, the FCM algorithm also has some drawbacks. The objective function can help to estimate the model parameters, but cannot distinguish the best solution from the numerous possibly local optimum solutions. The lack of an effective cluster validation method for FCM prevents its application to real world datasets, where the number of clusters is not known. FCM is well known for its rapid convergence, but this is not guaranteed to reach the global optimum. Because of these issues, further analysis with FCM will be carried out in Chapter 3 that will help us develop a novel method. 41 2.4 Gene regulatory sequence prediction The cluster analysis process produces groups of genes with similar expression patterns. Genes in each group may also have similar regulatory sequences. Understanding global transcriptional regulatory mechanisms is one of the fundamental goals of the post-genomic era [69], Conventional computational methods using microarray data to investigate transcriptional regulation focus mainly on identification of transcription factor binding sites. However, many molecular processes contribute to changes in gene expression, including transcription rate, alternative splicing, nonsense- mediated decay and mRNA degradation (controlled, for example, by miRNAs, or RNA binding proteins). Thus, computational approaches are needed to integrate such molecular processes. miRNAs bind to complementary sites within the 3'-UTRs of target mRNAs to induce cleavage and repression of translation [175], In the past decade, several hundred miRNAs have been identified in mammalian cells. Accumulating evidence thus far indicates that miRNAs play critical roles in multiple biological processes, including cell cycle control, cell growth and differentiation, apoptosis, and embryo development. At the biochemical level, miRNAs regulate mRNA degradation in a combinatorial manner, i.e., individual miRNAs regulate degradation of multiple genes, and the regulation of a single gene may be conducted by multiple miRNAs. This combinatorial regulation is thought to be similar in scope to transcription factor partner regulation. Therefore, in addition to transcription factors and other DNA/RNA-binding proteins, comprehensive 42 investigations into transcriptional mechanisms underlying alterations in global gene expression patterns should also consider the emerging role of the miRNAs. Motif finding algorithms are designed to identify transcription factor binding sites (TFBS) and miRNA complementary binding sites (MBS). These sequences are usually short, ~ 8bp and 23bp, respectively. In addition, each gene may have zero, one, or more such binding sites (BS). These BSs or motifs may be over-represented differently in DNA and RNA sequences of genes in the same group. Motifs are difficult to recognize because they are short, often highly degenerate, and may contain gaps. Although motif- finding is not a new problem, challenges remain because no single algorithm both describes motifs and finds them effectively. Popular methods for motif-finding use a position specific score matrix (PSSM) that describes the motifs statistically, or consensus strings that represent a motif by one or more patterns that appear repeatedly with a limited number of differences. Of the two, PSSM is preferred because it is more informative and can easily be evaluated using statistical methods. In addition, a consensus motif model can be replaced by a PSSM one. One of the most popular motif-finding algorithms using PSSM is MEME proposed by Bailey and Elkan [70], The advantage of MEME is that it uses expectation- maximization (EM) which, as a probability-based algorithm, produces statistically significant results if it can reach a global optimum. The disadvantage is that for motif seeds, it uses existing subsequences from within the sequence set and, as a result, may fail to discover subtle motifs. Chang et al. [72] and Li et al. [76] overcame this drawback by using the genetics algorithm (GA) to generate a set of motif seeds 43 randomly. However, because they used GA with random evolution processes, a rapid convergence to a solution is not assured. Li et al. [77] improved on MEME by using one instance of a position weight matrix (PWM), a type of PSSM, to represent a motif, and statistical tests to evaluate the final model. However, because EM may converge to local optima, use of a single PWM may fail to find a globally optimal solution. The GA- based methods of Wei and Jensen [83] and Bi [71] use chromosomes to encode motif positions. The method in the former [83] is appropriate for models with zero to one occurrence of the motif per sequence (ZOOPS); the latter [71] is appropriate for models with one occurrence per sequence (OOPS), because one variable can be used to represent the single motif occurrence in each sequence. However, Li et al. [77] recently showed that ZOOPS and OOPS are inadequate when not every sequence has the same motif frequency, and that the two-component mixture (TCM) model, which assumes a sequence may have zero or multiple motif occurrences, should be used. However, TCM requires a set of variables for every sequence to manage motif positions, and hence, the size of a chromosome can approach the size of the dataset [145], Lastly, the above algorithms are restricted to finding gapless motifs and, therefore, will fail to find many functionally important, gapped motifs. While some methods, e.g., pattern-based methods of Pisanti et al. [81] and Frith [73], allow gapped motifs, they require the gapped patterns to be well-defined and they generate gap positions randomly or by using a heuristic method. Alternatively, Liu et al. [78] used neural networks to find gapped motifs, but their approach required a limited and specific definition of the neural network structure. 44 2.5 Datasets In order to evaluate the previous studies as well as our proposed methods for microarray analysis, we will use multiple data types with different levels of complexity. Artificial datasets are used in both clustering and motif-finding and are preferred for testing methods because of the ease in generating and running benchmarks. For cluster analysis, we generated datasets using a finite mixture model of data distribution [27], because known cluster structures are ideal for evaluating the capability of an algorithm for determining the number of clusters and the cluster prototypes. For sequence analysis, we generated motifs in the set of randomly generated DNA sequences using the normal distribution model [71] with a known number of motif occurrences and locations in every sequence. Some popular datasets in the Machine Learning Repository, University of California Ivrine (UCI) [84], are also used to show how effective our methods are for cluster analysis using real datasets. The most complex datasets, e.g., gene expression data, including CMAP datasets, are the most challenging of clustering problems. Lastly, for sequence analysis, we used eight DNA transcription factor binding site datasets; two eukaryotic datasets: ERE and E2F [71], and six bacterial datasets: CRP, ArcA, ArgR, PurR, TyrR and IHF [80], 2.5.1 Artificial datasets for cluster analysis Artificial datasets were generated using a finite mixture model of data distribution [27] with different numbers of data objects, clusters, and dimensions. The clusters were generated with a slight overlap (overlapping ratio = 0.1). For test purposes, we selected five datasets, named from ASET1 to ASET5, where the first four datasets are uniform 45 and the last one is non-uniform. Datasets of different dimensions support the stability of our methods. We expect our methods to properly evaluate clustering results, successfully detect the correct number of clusters, properly locate the cluster centers, and to correctly assign data points into their actual clusters. 2.5.2 Artificial datasets for sequence analysis We generated simulated DNA datasets as in Bi [71], with three different background base compositions: (a) uniform, where A, T, C, G occur with equal frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, GTCACGCCGATATTG, was merged once or twice into each sequence, after a defined level of the string change: (i) 9% change representing limited divergence (i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30% change which is essentially background or random sequence variation. 2.5.3 Real clustering datasets Some popular real datasets from the UCI Machine Learning Repository [84] were used to test our method. Iris data The Iris dataset contains information about the sepal length, the sepal width, the petal length, and the petal width of three classes of Iris flowers; setosa, versicolor and virginica. Each class has 50 data objects from a total of 150 data objects in the dataset. This dataset does not contain missing values. The expected results should have the three classes with their own data objects. 46 Wine This dataset contains 178 data objects, each with 13 attributes corresponding to the measures of alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline. The data objects are known to belong to three different classes. There are no missing values in the dataset. We expect to detect three clusters in the dataset and properly assign the data objects to their classes. Glass This dataset contains 214 data objects, each having ten attributes of which nine are the glass characteristics of refractive index, sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron content. The last attribute, a value ranging from 1 to 6, is for glass identification. The dataset with the first nine attributes is used for clustering and the results should contain six clusters and assign the data objects to their proper class identification. Breast Cancer Wisconsin The Breast Cancer Wisconsin (BCW) dataset contains 699 data objects with nine attributes regarding the clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The objects of the datasets are from two different classes of tumor state: benign and malignant. The number of classes in this dataset is unknown. Recent research has concluded that 6 is a reasonable number of classes. We expect to discover 6 clusters in the dataset using our methods. 47 2.5.4 Gene expression datasets Yeast and Yeast-MIPS data The yeast cell cycle data showed expression levels of approximately 6000 genes across two cell cycles comprising 17 time points [85], By visual inspection of the raw data, Cho et al. [85] identified 420 genes that show significant variation. From the subset of 420 genes, Yeung et al. [102] selected 384 genes that achieved peak values in only one phase, and obtained five standard categories by grouping into one category genes that peaked during the same phase. Among the 384 selected genes, Tavazoie et al. [118], through a search of the protein sequence database, MIPS [86], found 237 genes that can be grouped into 4 functional categories: DNA synthesis and replication, organization of centrosome, nitrogen and sulphur metabolism, and ribosomal proteins. The functional annotations show the cluster structure in the dataset. We named these two subsets Yeast and Yeast-MIPS respectively. We expect clustering results to approximate the five class partition in the former and a four class partition in the latter. RCNS Rat Central Nervous System data The RCNS dataset was obtained by reverse transcription-coupled PCR designed to study the expression levels of 112 genes over nine time points during rat central nervous system development [87], Wen et al. [87] classified these genes into four functional categories based on prior biological knowledge. These four classes are external criterion in this dataset. 48 2.5.5 CMAP datasets CMAP datasets contain expression profiles from multiple drug treatments performed at different concentrations. Originally there were 164 distinct small- molecules used in 453 different treatment instances resulting in 564 expression profiles. These datasets have been used to detect associations between genes and drugs. Lamb et al. [25] and Zhang et al. [26] were successful in recovering and exploiting some gene- drug associations using gene expression signatures with matching-and-ranking based methods. These datasets will be used in our approach to demonstrate that we can replicate the results of Lamb et al. [25] and Zhang et al. [26], In addition, more gene expression signatures will be discovered and the associations between DNA motifs and drugs will be established using our methods. By discovering novel gene expression signatures and related pathways for the drugs, we expect to exploit the desired effects as well as the side-effects of drug treatments. Additionally, by using our motif-finding methods, we hope to predict some of the associations between drugs and DNA sequences and the connections between miRNAs and diseases. 2.5.6 Real biological sequence datasets A. Eukaryotic transcription factor binding sites datasets E2F dataset E2F transcription factors play a key role in the regulation of cell cycle-associated genes. Identifying target genes for these factors is a major challenge. We used the set of 25 sequences from Kel et al. [35] that contain 27 instances of the E2F transcription 49 factor binding site. The goal is to discover motif occurrences that locate exactly or overlap with binding sites at known positions. ERE dataset The estrogen receptor (ER) is a ligand-activated enhancer protein that is a member of the steroid/nuclear receptor superfamily. Two genes encode mammalian ERs, ERalpha and ERbeta. In response to estradiol (E2), ER binds with high affinity to specific DNA sequences called estrogen response elements (EREs) and transactivates gene expression. We used the set of 25 sequences from Klinge [14] each of which contains an instance of the ERE. The purpose is to detect the motif sequences located exactly at or overlapping with EREs. B. Bacterial transcription factor binding sites datasets CRB dataset CRP is one of seven "global" transcription factors in E. coli known to regulate more than 100 transcription units. CRPs activity is triggered in response to glucose starvation and other stresses by the binding of the second messenger cAMP. CRP binding sites have proved to be particularly noisy because computational searches for consensus binding sites have missed many known binding sites. CRP was chosen for its highly indiscriminate binding site [115], ArcA dataset ArcA is a global regulator that changes in relation to the expression of fermentation genes and represses the aerobic pathways when E. coli enters low oxygen growth 50 conditions. ArcA was chosen for its different protein domain (CheY-like) and very low consensus binding site [115], ArgR dataset ArgR, complexed with L-arginine, represses the transcription of several genes involved in biosynthesis and the transport of arginine, histidine, and its own synthesis, and activates genes for arginine catabolism. ArgR is also essential for a site-specific recombination reaction that resolves plasmid ColEl multimers to monomers and is necessary for plasmid stability. PurR dataset PurR dimers control several genes involved in purine and pyrimidine nucleotide biosynthesis and its own synthesis. This regulator requires binding of two products of purine metabolism, hypoxanthine and guanine, to induce the conformational change that allows PurR to bind to DNA. TyrR dataset TyrR, tyrosine repressor, is the dual transcriptional regulator of the TyrR regulon that involves genes essential for aromatic amino acid biosynthesis and transport. TyrR can act both as a repressor and as an activator of transcription at o70-dependent promoters. I HI' dataset IHF, Integration Host factor, is a global regulatory protein that helps maintain DNA architecture. It binds to and bends DNA at specific sites. IHF plays a role in DNA 51 supercoiling and DNA duplex destabilization and affects processes such as DNA replication, recombination, and the expression of many genes. These sequence sets were chosen because of their highly degenerate binding sites. C. Protein sequences Sequences from protein families in PFAM and Prosite databases with known motifs are used to evaluate the performance of our method. The DUF356, Strep-H-triad and Excalibur (extracellular calcium-binding) protein families have unambiguous motifs. Those from Nup-retrotrp and Flagellin-C families have weak signals. Sequences from these families are also used to evaluate the capability of detecting subtle motifs. In addition to these families, sequences from Xin Repeat and Planctomycete Cytochrome C protein families are used because they have more variable motifs. For gapped motif examples, we used sequences from the zf-C2H2, EGF 2, LIG CYCLIN l, LIG PCNA and MODTYRITAM families. To test the ability of de novo motif prediction, we used sequences from ZZ, Myb and SWIRM domains. Because motifs of these domains are unknown, the results from our method are compared with other motif-finding algorithms and we expect that our methods will agree with the results of others as shown in Chapter 4. 52 3. Fuzzy cluster analysis using FCM 3.1 FCM algorithm The FCM algorithm was initially developed by Dunn [121] and generalized later by Bezdek [92] with an introduction of the fuzzifier, m, is the best well-known method for fuzzy cluster analysis. FCM allows gradual memberships of data points to clusters measured as degrees in [0,1], This provides the flexibility in describing the data points that can belong to more than one cluster. In addition, these membership degrees provide a much finer picture of the data model; they afford not only an effective way of sharing a data point among multiple clusters but also express how ambiguously or definitely the data point belongs to a specific cluster. Let X = (xi, x2,..., xn} be the dataset in a p-dimensional space Rp where each x; = (x;i, Xi2,..., Xip) g Rp is a feature vector or pattern vector, and each x,, is the jth characteristic of the data points x,. Let c be a positive integer, 2 < c < n, Rcn denote the vector space of all real c x n matrices. Let V = (vi, v2,..., vc} e Rcp of which Vk e Rp is the center or prototype of the kth cluster, 1 < k < c. The concept of the membership degrees is substantiated by the definition and interpretation of fuzzy sets. Fuzzy clustering, therefore, allows got a fine grained solution space in the form of fuzzy partitions of the dataset. Each cluster Vk in the cluster partitions is represented by a fuzzy set p,k as follows: complying with fuzzy set theory, the cluster assignment Uki is now the membership degree of the data point x, to cluster V;, such that Uki = Pk(xj) e [0,1], Because memberships in clusters are fuzzy, 53 there is not a single cluster label assigned to each data point. Instead, each data point x; will be associated by a fuzzy label vector that states its memberships in c clusters, u, =(ui1,u2l,...,ucl}T. (3.1) Definition 3.1 (fuzzy c-partition): Any matrix U = {uki} e Mfcn defines a fuzzy c- partition (or fuzzy partition), where Mfcn = ju e Rcn | ukl e [0,1] Vi, k; g ukl = 1 Vi; 0 < g ukl < n Vkj . (3.2) Row kl of a matrix U e Mfcn exhibits the kth membership function Uk in the fuzzy partition matrix U. The definition of a fuzzy partition matrix (3.2) restricts assignment of data to clusters and what membership degrees are allowed. Definition 3.2 (fuzzy c-means functionals): Let Jm: Mfcn x Rcp R+ be, Jm(U,V) = ^^ud2(xI,vk)^min, (3.3) i=l k=l where m is the fuzzifier factor, l objective functions, and d(.) is any inner product norm of Rp: d2 (xi > vk ) = ||xi vk II = (xi vk )T A(x! vk ), (3.4) where A is a positive definite matrix. If A = Ipxp then d2(x,y) = ||x y||2. Theorem 3.1 (prototypes of FCM): If m and c are fixed parameters, and I, I are sets defined as: 54 (3.5) Jk = (k 11 < k < c; d(xi,vk) = 0}, U={l,2,...,c}/I then (U,V) e (Mfcn x Rcp) may be a global minimum for Jm(U,V) with respect to (w.r.t) the restriction l
k=l(3.6) as in (3.2), only if: V ukl = { l
l(x^v,) '0, kek 1, kef m-1 z 1=1 1 d (xi>Vl) m-1 k =0 1*0 (3.7) and y vi=Zu^./Zu"^ l 1=1 1=1 Proof: We consider minimizing the Jm w.r.t. U under the restriction (3.6) using Lagrange multiplier. Let the Lagrange multiplier be k;, i=l,.. ,,n and put n c L = ZZuSd2(xivk)-Z^(Zuki -1)- =1 k=l =1 k=l As the necessary optimal condition, dL = m(u^ )d (xt,vk)-=0, (3.9) 55 (3.10) cL dvv = 2Zu||xi vk|| = 0 . Assume that k Â£ I; then d2(x,, Vk) ^ 0. We derive from the Equation (3.9) for each k: X: V_1 1 lmd (xt,vk)J 1 ^m-l Id (Xi,vk)J , l X: I-1 2X= E k=i V mJ t_, k=l d (Xi,vk) = 1. Hence, 1 A,; 'V-l c f Z k=l 1 1 V-i d (Xi,vk) Together with (3.11), we obtain the following for the membership, Z 1=1 f ^ ^ m-1 1 ^m-1 d (X;, Vj ) Id (x;,vk)J Thus, ^(x^vjj / Â§^d2(x;,Vj)^ Similarly, we derive from (3.10) for each k, l 1=1 Q.E.D 56 3.2 FCM convergence The iteration process of the FCM algorithm can be described using a map: 3m : (U, V) -> (U, V) where U = F(V) and V = G(U). (U(t), V(t)), t=l,2... is called an iteration sequence of the FCM algorithm. If (U(t),V(t)) = 3m(U(M),V(M)), t>l, where (U(0),V(0)) is any element of Mfcn x Rcp. Set Jm(u*,V*) VV e Rcp, V ^ V* Theorem 3.2 (descentfunction -Bezdek) [133]: Jmis a descent function for (3m, Q} Proof. Because (y d2(y) } and (y ym} are continuous, Jm is the sum of products of such functions, Jm is continuous on (Mfcn, Rcp}. Jm(3m(U,V)) = Jm(F(G(U)), G(U)) < Jm(U,G(U)) by the first case in the definition of Q < Jm(U, V) by the second case in the definition of Q. Thus, Jm is a descent function for (3m, Q}. Q.E.D Theorem 3.3 (solution convex set) : Let [conv(X)]c be the c-fold Cartesian product of the convex hull of X, and let (U, V) be the starting point of the sequence of 3m iteration, U g Mfcn and V = G(U). Then e Mfcn x [conv(X)]c, t=l,2... Mfcnx[conv(X)]c is compact in MfcnxRcp. Proof Let U e Mfcn be chosen. For each k, 1 < k < c, we have 57 <=k,Y\/( Let Pkl = (ukl Y/2 (< T we have v = Â£ pklXi, and Â£ pkl = 1. / 1=1 1=1 1=1 Thus v[! e conv(X), and therefore V g [conv(X)]c. Subsequently, we can prove that U(t) e Mfcn, V(t) g [conv(X)]c, Vt > 1. Q.E.D Theorem 3.4 (the convergence of FCM Bezdek et al.) [134]: Let 9 = (U, V) be the starting point of the iterations with 3m. The sequence 3^,(U,V),t = 1,2... either terminates at an optimal point 9* = (U*, V*) g Cl, or there is sub-sequence converging to a point in Cl. Proof: 3m is a continuous and descent function (Theorem 3.2) and the iteration sequences are always in a compact subset of the domain of Jm (Theorem 3.3). Thus, 3m should terminate at an optimal point 9* = (U*,V*) or there is a sub-sequence converging to a point in Cl. Q.E.D 3.3 Distance measures Similar to other clustering algorithms, FCM uses the distance or dissimilarity of data points, each described by a set of attributes and denoted as a multidimensional vector. The attributes can be quantitative or qualitative, continuous or discrete, which leads to different measurement mechanisms. Accordingly, for a dataset in the form of an object data matrix, the data matrix is designated as two-mode, because its row and column indices have different meanings. This is in contrast to the one-mode distance 58 matrix, which is symmetric with elements representing the distance or dissimilarity measure for any pair of data points, because both dimensions share the same meaning. Given a dataset X, the distance function d(.) on X is defined to satisfy the following conditions, Symmetric: d(x,y) = d(y,x), Positive: d(x,y) > 0, V x,y eX. If the following conditions hold, then d is a metric. Triangle inequality. d(x,z) < d(x,y) + d(y,z), Vx,y,zeX, Reflexive: d(x,y) = 0 iif x = y. If the triangle inequality is violated, then d is a semi-metric. Many different distance measures have been used in cluster analysis. Euclidean distance The Euclidean distance is the most widely used distance measure and defined as d2(x,y) = i](xi-yi)2. (3.12) 1=1 This distance is a true metric because it satisfies the triangle inequality. In the Equation (3.12), the expression data x; and y, are subtracted directly from each other. We, therefore, need to ensure that the expression data are properly normalized when using the Euclidean distance, for example by converting the measured gene expression levels to log-ratios. 59 Pearson correlation distance The Pearson correlation coefficient distance is based on the Pearson correlation coefficient (PCC), which describes the similarity of objects as PCC(x,y)=iÂ£- P H Xi)(Yi a a, Yi) where ax and ay are the sample standard deviation of x and y respectively. PCC has a value from -1 to 1, where PCC=1 when x and y are identical, PCC=0 when they are unrelated, and PCC = -1 when they are anti-correlated. The Pearsons correlation distance is then defined as dp(x, y) = l-PCC(x, y). (3.13) Figure 3-1 : Expression levels of three genes in five experimental conditions The value of dp lies in [0,2] where dp=0 implies that x and y are identical, and dp=2 implies that they are very different from each other. While the Euclidean measure takes the magnitude of the data into account, the Pearson correlation measure geometrically captures the pattern of the two data points [1], It is therefore widely used for gene expression data. Even if the expression levels of the two genes are different, if 60 they peak similarly in the experiments, then they have a high correlation coefficient, or a small Pearson correlation distance (Figure 3-1). Because of its definition, the Pearson correlation distance satisfies the distance conditions, but it is not a metric since it violates the triangle inequality. Absolute Pearson correlation distance The distance is defined as, daP (x, y) = 1 |PCC(x, y)|. (3.14) Because the absolute value of PCC falls in the range [0,1], the dap distance also falls between [0,1], dap is equal to 1 if the expression levels of the two genes have the same shape, i.e., either exactly the same or exactly opposite. Therefore, dap (3.14) should be used with care. Uncentered Pearson correlation distance The distance is based on the uncentered Pearson correlation coefficient (UPCC) which is defined as The dup distance in Equation (3.15) may be appropriate if there is a zero reference state. For instance, in the case of gene expression data given in terms of log- ^ zf /p The uncentered Pearson correlation distance is defined as dup(x,y) = l-UPCC(x,y). (3.15) 61 ratios, a log-ratio equal to 0 corresponds to green and red signal being equal, which means that the experimental manipulation did not affect the expression. Because UPCC lies in the range [-1,1], the dup distance falls between [0,2], Absolute uncentered Pearson correlation distance This distance is defined as in (3.16). Because the absolute value of UPCC falls between [0,1], the daup distance also falls between [0,1], d.p(x,y) = l-|UPCC(x,y)| (3.16) Comparison of measures The distance measure plays an important role in obtaining correct clusters. For simple datasets, where the data is multidimensional, the Euclidean distance measure is employed. But as the dimensions of the dataset increase, where each dimension denotes a specific aspect of the dataset, the Euclidean distance measure may not be the best one to be used. For the Iris dataset, we ran the FCM algorithm with the number of clusters set to 3, which is the number of classes of Iris flowers: Setosa, Versicolor, and Virginica in the dataset. By using five different distance measures from (3.12) to (3.16), we obtained five different cluster partitions. For each partition, we compared the cluster label of each data point with its class label to compute the correctness ratio for every class. The results are shown in Table 3.1. While the four distance measures using Pearsons correlation coefficient provide correct results, the Euclidean distance measure performs worse, particularly on the Virginica class. In this comparison, it is clear that the absolute 62 Pearson correlation measure and the Pearson correlation measure results are identical. Similarly, the result of the uncentered Pearson correlation measure is identical with that of the absolute uncentered Pearson correlation measure. Table 3.1 : Performance of different distance measures on the Iris dataset Classification correctness (%) Average Distance measure method Setosa Versicolor Virginica Correctness Euclidean 100.00 94.00 76.00 90.00 Pearson correlation 100.00 94.00 94.00 96.00 Abs. Pearson correlation 100.00 94.00 94.00 96.00 Uncentered Pearson cor. 100.00 92.00 100.00 97.33 Absolute uncentered Pearson cor. 100.00 92.00 100.00 97.33 Table 3.2 : Performance of different distance measures on the Wine dataset Classification correctness (%) Average Distance measure method #1 #2 #3 Correctness Euclidean 76.27 70.42 56.25 67.65 Pearson correlation 91.53 85.92 100.00 92.48 Abs. Pearson correlation 72.88 67.61 95.83 78.77 Uncentered Pearson cor. 91.53 85.92 97.92 91.79 Absolute uncentered Pearson cor. 91.53 85.92 97.92 91.79 We ran a similar benchmark on the Wine dataset which contains data objects of the three different classes of wines. Results are shown in Table 3.2. Again, all Pearson correlation based distance measures performed better than the Euclidean distance measure. The Pearson correlation measures seem to be better for distance measure in real and unsealed datasets. 63 3.4 Partially missing data analysis In cluster analysis, complete information is preferred throughout the experiment. Unfortunately, real world datasets frequently have missing values. This can be caused by errors that lead to incomplete attributes or by random noise. For example, sensor failures in a control system may cause the system to miss information. For gene expression data, the missing values can come from the platform level and meta-analysis level. With the former, the reasons include insufficient resolution, image corruption, spotting, scratches or dust on the slide, or hybridization failure. The latter can be caused by differences in platforms or chip generations, and the lack of an appropriate method to map probes onto genes across different microarray platforms. In a dataset with partially missing data, some attribute values of a data point x; may be not observed. For example, x; = (xn, x;2, ?, ?, x;5, ?) has missing values corresponding to the third, fourth and sixth attributes and only the first, second and fifth attributes are observed. This causes a problem with distance measurement between x, and other data points in the dataset and the cluster algorithm, therefore, cannot perform properly at x,. Let Xw = { x; g X| x; is a complete data point}, XP = { Xij | x; ^ ? }, and XM = { Xy | Xy = ? } . Clearly, Xp and X contain more information than Xw [140], 64 Three approaches [137-139] have been widely used to address the problem of missing values. (1) The ignorance-based approach is the most trivial approach to deal with datasets when the proportion of incomplete data is small, |X / Xw| |X|, but the elimination brings a loss of information. (2) The Model-based approach defines a model for the partially missing data, XM, using available information, XP, and applies the model to build a complete dataset. However, the complexity of this method can prevent application to large datasets. (3) The Imputation-based approach supplies missing values XM by certain means of approximation. Of the three, the Imputation-based approach is preferred. Because of the need to use an approximation model for the missing data, it is therefore integrated with the cluster algorithm and works based on the cluster model, which may be a probability-based or possibility-based one, estimated by the cluster algorithm. For the Imputation-based approach using FCM, Hathaway et al. [140] identified four strategies to solve the problem of missing values. Whole data strategy (WDS) This strategy, similar to the ignorance-based approach, eliminates the missing data. Results therefore present the cluster partition of Xw instead of X. This strategy is applicable when |XP| / |X| > 0.75. Partial distance strategy (PDS) This is the strategy recommended when XM is large. The distance between every pair of data points is computed using XP. 65 |2 (3.17) dp (x!, vk ) = IK vk II = (x; vk )T (AI; )(x; vk ) : where I; is the index function of the data point x;, defined as in (3.18). 0, x.. eXM Ilt = \ for 1 < i < n, 1 < t
1l, xiteXp |

Full Text |

PAGE 1 A MACHINE LEARNING APPROACH FOR GENE EXPRESSION ANALYSIS AND APPLICATIONS by THANH NGOC LE B.S., University of Econimics in Hochiminh City, Vietnam, 1994 M.S., National University in Hochiminh City, Vietnam, 2000 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science and Information Systems 2013 PAGE 2 ii This thesis for the Doctor of Philosophy degree by Thanh Ngoc Le has been approved for the Computer Science and Information Systems Program by Gita Alaghband, Chair Tom Altman, Advisor Katheleen Gardiner, Co-Advisor James Gerlach Sonia Leach Boris Stilman Date: April 10, 2013 PAGE 3 iii Le, Thanh, Ngoc (Ph.D., Computer Sc ience and Information Systems) A Machine Learning approach for Gene Expression analysis and applications Thesis directed by Professor Tom Altman ABSTRACT High-throughput microarray technology is an important and revolutionary technique used in genomics and systems biology to an alyze the expression of thousands of genes simultaneously. The popular use of this techni que has resulted in enormous repositories of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Info rmation (NCBI). However, an effective approach to optimally exploit these datasets in support of sp ecific biological studies is still lacking. Specifically, an improved me thod is required to integrate data from multiple sources and to select only those datasets that meet an investigatorÂ’s interest. In addition, to exploit the full power of microarray data, an effective me thod is required to determine the relationships among genes in th e selected datasets and to interpret the biological meanings behind these relationships. To address these requirements, we have de veloped a machine lear ning based approach that includes: An effective meta-analysis method to in tegrate microarray data from multiple sources; the method exploits informati on regarding the biological context of interest provided by the biologists. PAGE 4 iv A novel and effective cluster analysis method to identify hidden patterns in selected data representing relationshi ps between genes un der the biological conditions of interest. A novel motif finding method that discove rs, not only the co mmon transcription factor binding sites of co-regulated genes, but also the miRNA binding sites associated with the biological conditions. A machine learning-based framework for microarray data analysis with a web application to run common analysis tasks on online. The form and content of this abstract are approved. I recommend its publication. Approved: Tom Altman PAGE 5 v DEDICATION I dedicate this work to my family, my wife, Lan, and my bel oved daughter, Anh for their constant support and unconditional love. I love you all dearly. PAGE 6 vi ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Tom Altman for his guidance and support throughout my graduate studies. His way of guidance helped me to develop my selfconfidence and skills necessary to address not only research problems, but also realworld problems. Thank you very much Dr. Altman. I would like to express my sincere gratitude to my co-advisor, Dr. Katheleen Gardiner for her continuous support on my research work throughout the years. She guided me with remarkable patience and understandi ng. She was always there to help me whenever I encounter an obstacle or require an advice. Her inputs were very useful. It was really a pleasure working with her. I would like to thank Dr. Sonia Leach for her help in solving difficult problems and for her valuable inputs as a member of my di ssertation committee. She was very generous in transferring her vast knowledge to us Also, I would like to thank Dr. Gita Alaghband, Dr. James Gerlach, and Dr. Boris Stilman for serving on my dissertation committee and for their helpful suggestions and comments for the improvement of this project; your discussion, ideas, and feedback have been absolutely inva luable. I would like to thank you all for giving me an example of excellence as a researcher, mentor, instructor, and role model. I would like to thank my fellow graduate students, my lab mates and everyone who helped me in many ways to make my gradua te student life a memorable one. I am very grateful to all of you. PAGE 7 vii I would especially like to thank my amazing family for the love, support, and constant encouragement I have gotten over the years. In particular, I would like to thank my mother, my brothers and sisters. You are the salt of the earth, and I undoubtedly coul d not have done this without you. PAGE 8 viii TABLE OF CONTENTS List of Tables .............................................................................................................. x ii List of Figures ..............................................................................................................x v Chapter 1. Introduction ........................................................................................................1 1.1 Gene expression data analysis challenges ..........................................................1 1.2 Motivations and proposed methods ...................................................................3 1.2.1 Microarray meta-analysis ...................................................................................3 1.2.2 Cluster analysis of gene expression data ...........................................................6 1.2.3 DNA sequence analysis .....................................................................................8 1.3 Research contributions .......................................................................................9 2. Microarray analysis background ......................................................................13 2.1 Microarray ........................................................................................................13 2.1.1 Microarray platforms .......................................................................................13 2.1.2 Gene expression microarray data analysis .......................................................16 2.2 Microarray data meta-analysis .........................................................................18 2.2.1 Microarray data integration ..............................................................................18 2.2.2 Issues with microarray data meta-analysis ......................................................23 2.3 Cluster analysis ................................................................................................32 2.3.1 Clustering algorithms .......................................................................................34 2.3.2 Hierarchical clustering algorithms ...................................................................35 2.3.3 K-Means clustering algorithm .........................................................................36 2.3.4 The mixture model with expectation-maximization algorithm .......................37 2.3.5 Fuzzy clustering algorithm ..............................................................................39 PAGE 9 ix 2.4 Gene regulatory sequence prediction ...............................................................42 2.5 Datasets ............................................................................................................45 2.5.1 Artificial datasets fo r cluster analysis ..............................................................45 2.5.2 Artificial datasets fo r sequence analysis ..........................................................46 2.5.3 Real clustering datasets ....................................................................................46 2.5.4 Gene expression datasets .................................................................................48 2.5.5 CMAP datasets.................................................................................................49 2.5.6 Real biological se quence datasets ....................................................................49 3. Fuzzy cluster analysis using FCM ...................................................................53 3.1 FCM algorithm.................................................................................................53 3.2 FCM convergence ............................................................................................57 3.3 Distance measures ............................................................................................58 3.4 Partially missing data analysis .........................................................................64 3.5 FCM Clustering solution validation ................................................................68 3.5.1 Global validity measures ..................................................................................70 3.5.2 Local validity measures ...................................................................................80 3.6 FCM partition matr ix initialization ..................................................................85 3.7 Determination of the number of clusters .........................................................87 3.7.1 Between-cluster-par tition comparison .............................................................88 3.7.2 Within-cluster-partition comparison ................................................................89 3.8 Determination of the fuzzifier factor ...............................................................90 3.9 Defuzzification of fuzzy partition ....................................................................92 4. Methods............................................................................................................98 4.1 Microarray meta-analysis .................................................................................98 PAGE 10 x 4.1.1 Selection of samples and features ....................................................................98 4.1.2 Sample and feature selection method ..............................................................98 4.1.3 Methods for feature metric .............................................................................108 4.2 Fuzzy cluster analysis methods using Fuzzy C-Means (FCM) .....................109 4.2.1 Modification of the objective function ..........................................................110 4.2.2 Partition matrix initialization method ............................................................111 4.2.3 Fuzzy clustering evaluation method ..............................................................121 4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] ..............................129 4.2.5 Imputation methods for partially missing data ..............................................137 4.2.6 Probability based imputation method (fzPBI) [154] ......................................139 4.2.7 Density based imputation method (fzDBI) [155] ..........................................149 4.2.8 Probability based defuzz ification (fzPBD) [156] ..........................................153 4.2.9 Fuzzy genetic subtractive clus tering method (fzGASCE) [157] ...................161 4.3 Motif finding problem ....................................................................................172 4.3.1 HIGEDA algorithm ........................................................................................172 4.3.2 New motif discovery using HIGEDA ............................................................192 5. Applications ...................................................................................................196 5.1 Recovery of a gene expression signature .......................................................196 5.2 Gene-drug association prediction ..................................................................199 5.2.1 Model design ..................................................................................................199 5.2.2 Model testing .................................................................................................205 5.3 Drug target prediction ....................................................................................216 5.4 Application of HIGEDA into prediction regulatory motifs ...........................217 6. Conclusions and future work .........................................................................218 PAGE 11 xi References ..................................................................................................................22 1 PAGE 12 xii LIST OF TABLES Table 3.1 : Performance of different distan ce measures on the Iris dataset ...........................63 3.2 : Performance of different distan ce measures on the Wine dataset .......................63 3.3 : Average results of 50 trials us ing incomplete Iris data [140] ..............................68 3.4 : Performance of three standard algorithms on ASET1 .........................................86 3.5 : Algorithm performance on AS ET4 using MISC measure ...................................96 3.6 : Algorithm performance on AS ET5 using MISC measure ...................................96 4.1 : Predicted HDAC antagonists .............................................................................102 4.2 : Predicted HDAC agonists ..................................................................................103 4.3 : Predicted estrogen receptor antagonists .............................................................105 4.4 : Predicted estrogen receptor agonists ..................................................................107 4.5 : Algorithm performance on the ASET1 dataset ..................................................118 4.6 : fzSC correctness in determining cl uster number on artificial datasets ..............119 4.7 : fzSC performance on real datasets .....................................................................120 4.8 : Fraction of correct cluster predictions on ar tificial datasets ..............................125 4.9 : Validation method performance on the Iris dataset (3 true clusters) .................125 4.10 : Validation method performance on the Wine dataset (3 true clusters) ............126 4.11 : Validation method performance on the Glass dataset (6 true clusters) ...........126 4.12 : Validation method performance on the Yeast dataset (5 true clusters) ...........127 4.13 : Validation method performance on the Yeast-MIPS dataset (4 true clusters) .........................................................................................................127 4.14 : Validation method performance on the RCNS dataset (6 true clusters) ..........128 4.15 : Degrees of belief of GO annotation evidences ................................................132 4.16 : Gene Ontology evidence codes ........................................................................133 PAGE 13 xiii 4.17 : Validation method performance on the Yeast dataset using GO-BP ...............135 4.18 : Validation method performance on th e Yeast-MIPS dataset using GO-BP ....136 4.19 : Validation method performance on the RCNS dataset using GO-CC .............136 4.20 : Validation method performance on the RCNS dataset using GO-BP .............137 4.21 : Average results of 50 trials usi ng an incomplete IRIS dataset with different percentages (%) of missing value ...................................................145 4.22 : ASET4Compactness measure .......................................................................157 4.23 : IRIS: Compact ness measure ............................................................................159 4.24 : WINECompact ness measure .........................................................................160 4.25 : Performance of GA algorithms on ASET2 ......................................................168 4.26 : Performance of GA algorithms on ASET3 ......................................................169 4.27 : Performance of GA algorithm on ASET4 .......................................................169 4.28 : Performance of GA algorithms on ASET5 ......................................................170 4.29 : Performance of GA algorithms on IRIS ..........................................................171 4.30 : Performanc of GA algorithms on WINE .........................................................171 4.31 : Average performance (LPC/SPC) on simulated DNA datasets .......................187 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA datasets (# of sequences/length of motif/# of motif occurrences). ................188 4.33 : Detection of protein motifs (1-8, PFAM; 9-12, Prosite)..................................190 4.34 : Motifs of ZZ, Myb and SWIRM domains by the four algorithms ..................193 4.35 : Motifs of Presenilin-1 and Sign al peptide peptidase by HIGEDA ..................194 5.1 : Clustering results with estroge n significantly associated drugs ........................197 5.2 : Estrogen signature and the Â‘s ignatureÂ’ predicted by FZGASCE .......................198 5.3 : Hsa21 set-1 query genes ....................................................................................206 PAGE 14 xiv 5.4 : Expanded set of the Hsa21 set-1 ........................................................................207 5.5 : Gene clusters of the Hsa21 gene set-1 expanded set .........................................207 5.6 : Proposed gene expression signatu re for the Hsa21 gene set-1 ..........................209 5.7 : Drugs predicted to enhance ex pression of Hsa21 gene set-1 .............................210 5.8 : Drugs predicted to repress e xpression of Hsa21 gene set-1 ..............................211 5.9 : Hsa21 set-2 query genes ....................................................................................212 5.10 : Predicted gene expression sign ature for the Hsa21 gene set-2 ........................213 5.11 : Drugs predicted to enhance ex pression of Hsa21 gene set-2 ...........................215 5.12 : Drugs predicted to repress e xpression of Hsa21 gene set-2 ............................215 PAGE 15 xv LIST OF FIGURES Figure 1-1 : Microarray data anal ysis and the contributions of this research .........................12 2-1 : Microarray design and screening .........................................................................15 2-2 : Microarray analysis .............................................................................................18 3-1 : Expression levels of three gene s in five experimental conditions .......................60 3-2 : ASET2 dataset with five well-separated clusters ................................................76 3-3 : PC index (maximize) ...........................................................................................77 3-4 : PE index (minimize) ............................................................................................77 3-5 : FS index (minimize) ............................................................................................77 3-6 : XB index (minimize) ...........................................................................................77 3-7 : CWB index (minimize) .......................................................................................78 3-8 : PBMF index (maximize) .....................................................................................78 3-9 : BR index (minimize) ...........................................................................................78 3-10 : Performance of the global validity measures on artificial datasets with different numbers of clusters ...........................................................................79 3-11 : Dendrogram of the Iris dataset from a 12-partition generated by FCM ............83 3-12: Dendrogram of the Wine dataset from a 13-partition generated by FCM .........84 3-13 : ASET1 dataset with 6 clusters ...........................................................................85 3-14 : PC index on the ASET2 dataset (maximize) .....................................................88 3-15 (Wu [144]) : Impact of m on the misc lassification rate in Iris dataset ................91 3-16 : Artificial dataset 5 (ASET5) with three clusters of different sizes ...................93 4-1 : Venn diagram for predicted HDAC antagonists ...............................................104 4-2 : Venn diagram for HDAC inhibitors ..................................................................104 4-3 : Venn diagram for predicte d estrogen recepto r antagonists ..............................106 PAGE 16 xvi 4-4 : Venn diagram for predicted estrogen receptor agonists ....................................106 4-5 : Candidate cluster centers in the ASET1 dataset found using fzSC. Squares, cluster centers by FCM; dark circ les, cluster centers found by fzSC. Classes are labeled 1 to 6. ..............................................................................118 4-6 : Average RMSE of 50 trials usi ng an incomplete ASET2 dataset with different percentages of missing values .........................................................142 4-7 : Average RMSE of 50 trials usi ng an incomplete ASET5 dataset with different percentages of missing values .........................................................143 4-8 : Average RMSE of 50 tria ls using an incomplete Iris dataset with different percentages of missing values ........................................................................144 4-9 : Average RMSE of 50 trials usi ng an incomplete Wine dataset with different percentages of missing values .........................................................145 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with different percentages of missing values .........................................................146 4-11 : Average RMSE of 50 trials usi ng an incomplete Yeast dataset with different percentages of missing values .........................................................147 4-12 : Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different percentages of missing values .........................................................148 4-13 : Average RMSE of 50 trials usi ng an incomplete Serum dataset with different percentages of missing values .........................................................148 4-14 : (fzDBI) Average RMSE of 50 tria ls using an incomplete ASET2 dataset with different missing value percentages .......................................................151 4-15 : (fzDBI) Average RMSE of 50 tria ls using an incomplete ASET5 dataset with different missing value percentages .......................................................151 4-16 : (fzDBI) Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values .........................................................152 PAGE 17 xvii 4-17 : (fzDBI) Average RMSE of 50 tr ials using an incomplete Yeast-MIPS dataset with different per centages of missing values .....................................152 4-18 : Algorithm performance on ASET2 .................................................................156 4-19 : Algorithm performance on ASET3 .................................................................156 4-20 : Algorithm performance on ASET4 .................................................................157 4-21 : Algorithm performance on ASET5 .................................................................158 4-22 : Algorithm performance on IRIS ......................................................................158 4-23 : Algorithm performance on WINE dataset .......................................................159 4-24 : Algorithm performance on GLASS dataset ....................................................160 4-25 : A motif model ..............................................................................................175 4-26 : Dynamic alignment of s=Â’ACGÂ’ w.r.t from Figure 4-25 ............................176 4-27 : (t) = M T / (T + t2). M is the maximum value of ..................................184 4-28 : Strep-H-triad motif by HIGEDA .....................................................................189 5-1 : Gene expression si gnature prediction ................................................................196 5-2 : Gene-Drug association prediction .....................................................................199 5-3 : Expansion of gene set A ....................................................................................200 5-4 : Identification of key genes ................................................................................202 5-5 : Gene expression signature generation and application .....................................204 5-6 : Transcription factor and mircoRNA binding site prediction using HIGEDA ...217 PAGE 18 1. Introduction High-throughput microarray t echnology for determining global changes in gene expression is an important and revolutionary experimental paradigm that facilitates advances in functional genomics and systems biology. This technology allows measurement of the expression levels of thous ands of genes with a single microarray. Widespread use of the technology is evident in the rapid growth of microarray datasets stored in public repositories. For exampl e, since its inception in the early 1990s, the Gene Expression Omnibus (GEO), main tained by the National Center for Biotechnology Information (NCBI), has rece ived thousands of data submissions representing more than 3 billion individua l molecular abundance measurements [6,7]. 1.1 Gene expression data analysis challenges The growth in microarray data deposition is reminiscent of the early days of GenBank, when exponential increases in publ icly accessible biological sequence data drove the development of analytical techni ques required for data analysis. However, unlike biological sequences, microarray datase ts are not easily shared by the research community, resulting in many investigators be ing unable to exploit the full potential of these data. In spite of over 15 years of e xperience with microarray experiments, new paradigms for integrating and mining publicly available microarray results are still needed to promote widespread, investigator-d riven research on shared data. In general, integrating multiple datasets is expected to yield more reliable and more valid results due to a larger number of samples a nd a reduction of study-specific biases. PAGE 19 Ideally, gene expression data obtained in any laboratory, at any time, using any microarray technology should be comparable. However, this is not true in reality. Several issues arise when attempting to integrate microarray data generated using different array technologies. Meta-analysis a nd integrative data analysis only exploits the behavior of genes individually at the genotype level. In order to discover gene behaviors at the phenotype level by conn ecting genotype to phenotype, additional analyses are required. Cluster analysis, an unsupervised learni ng method, is a popular method that searches for patterns of gene expression associated with an experimental condition. Each experimental condition usually corresponds to a specific biological event, such as a drug treatment or a diseas e state, thereby allowi ng discovery of the drug targets, and drug an d disease relationships. Numerous algorithms have been devel oped to address the problem of data clustering. These algorithms, notwithstanding their different approaches deal with the issue of determining the numb er and the construction of cl usters. Although some cluster indices address this problem, they still ha ve the drawback of model over-fitting. Alternative approaches, based on statistics with the log-likelihood estimator and a model parameter penalty mech anism, can eliminate over-fit ting, but are still limited by assumptions regarding models of data di stribution and by a slow convergence with model parameter estimation. Even when the number of clusters is known a priori, different clustering algorithms may provide different solutions, because of their dependence on the initializati on parameters. Because most algorithms use an iterative PAGE 20 process to estimate the model parameters while searching for optimal solutions, a solution that is a global op timum is not guaranteed [122,132]. 1.2 Motivations and proposed methods This dissertation focuses on construction of a machine learning and data mining framework for discovery of biological meani ng in gene expression data collected from multiple sources. Three tools have been deve loped: (i) a novel meta-analysis method for identifying expression profiles of interest; (ii) set of novel cl ustering methods to incorporate prior biological knowledge and to predict co-re gulated genes; and (iii) a novel motif-finding method to extract inform ation on shared regulatory sequences, Together these methods provide comprehens ive analysis of gene expression data. 1.2.1 Microarray meta analysis Several methods have been proposed to co mbine inter-study microarray data. These methods fall into two major categories based on the level at which data integration is performed: meta-analysis and direct integration of expression values Meta-analysis integrates data from different studies after th ey have been pre-anal yzed individually. In contrast, the latter approach in tegrates microarray data from different studies at the level of expression after transforming expression va lues to numerically comparable measures and/or normalization and carries out th e analysis on the combined dataset. Meta-analysis methods combine the results of analysis from individual studies in order to increase the ability to identify signi ficantly expressed gene s or altered samples across studies that used different microarra y technologies (i.e., di fferent platforms or different generations of the same platform). Lamb et al. [25] and Zhang et al. [26] PAGE 21 proposed rank-based methods for performing me ta-analysis. They started within the Connectivity Map (CMAP) data sets, a collection of one-off microarray screenings of >1000 small-molecule treatments of five diffe rent human cell lines. Genes in each array were ranked using statistical methods and the array data were then integrated, based on how they fit a given gene expression signature defined as a set of genes responding to the biological context of interest. Gene e xpression signatures could then be used to query array data generated using other a rray technologies. These methods successfully identified a number of small molecules with rational and informative associations with known biological processes. However, Lamb et al. [25] weighted up-regulated genes and down-regulated genes differently, while Zh ang et al. [26] showed that genes with the same amount of expression change shoul d be weighted equall y. Both Zhang et al. [26] and Lamb et al. [25] used fold change criteria (where Â“fol d changeÂ”, FC, is the ratio between the gene expression value of treated and untreated conditions) to rank the genes in expression profiles and they solved the problem of platform-specific probes by eliminating them. Because of the experiment al design and the noise inherent in the screening process and imaging methods, a singl e gene may have probesets that differ in the direction of expression change, e.g., some probesets may be up-regulated and others down-regulated. This problem was addressed by Lamb et al. [25] and Zhang et al. [26] by including all probesets in the ranked list and averaging probesets across genes. In fact, it is not biologically n ecessary that a single differentia lly expressed gene show the identical FC for all probesets. It is not straightforward to compare differentially expressed genes just by averaging their probeset values. Witten et al. [9] showed that PAGE 22 FC-based approaches, namely FC-differe nce and FC-ratio, are linear-comparison methods and are not appropriate for noisy data and the data from different experiments. In addition, these methods are unable to measure the statistical significance of the results. While approaches using the t-test were proposed, Deng [10] showed that the Rank-Product (RP) method is not only more st able than the t-test but it also provides the statistical significance of the gene expr ession differences. Brei tling et al. [29] proposed an RP-based approach that allows the analysis of gene expression profiles from different experiments and offers seve ral advantages over the linear approaches, including the biologically intuitive FC criterion, usin g fewer model assumptions. However, both the RP and t-test methods as well as other popular meta-analysis methods, have problems with small datasets and datasets with a small number of replicates. Hong et al. [30] proposed an RP-based method th at is most appropriate, not only for small sample sizes, but also for in creasing the performance on noisy data. Their method has achieved widespread acceptance and has been used in such diverse fields as RNAi analysis, proteomics, and machine learning [8]. However, most RP-based methods, unlike t-test based methods, do not pr ovide an overall measure of differential expression for every gene within a given study [54]. We develop a meta-analysis method using RP first to determin e the differentially expressed genes in each study and then to construct ordered li sts of up and down regulated genes in each study based on the p-values determined using a permutationbased method. We then use the non-parametr ic rank-based pattern-matching strategy based on the Kolmogorov-Smirnov statistic [25] to filter the profiles of interest. We PAGE 23 again use RP to filter genes which are differentially expressed across multiple studies. In our method, a t-test based appr oach can also be us ed in the first step to generate an estimated FC for each gene within each study. These values are then used in the cluster analysis, instead of the average FC values, to group the genes base d on their expression change pattern similarity across multiple studies. 1.2.2 Cluster analysis of gene expression data Numerous clustering algorithms are currently in use. Hierarchi cal clustering results in a tree structure, where genes on the same br anch at the desired level are considered to be in the same cluster. While this structure provides a rich visualization of the gene clusters, it has the limitation that a gene can be assigned only to one branch, i.e., one cluster. This is not always biologically reasonable because most, if not all, genes have multiple functions. In addition, because of the mechanism of one-way assignment of genes to branches, the results may not be globally optimal. The alternative approach, partitioning clustering, includes two major me thods, heuristic-based and model-based. The former assigns objects to clusters using a heuristic mechanism, while the latter uses quantifying uncertainty measures. Probability and possibility are the two uncertainty measures commonly used. While probabilistic bodies of evidence consist of singletons, possibilistic bodies of evidence are familie s of nested sets. Both probability and possibility measures are uni quely represented by distri bution functions, but their normalization requirements are very different. Values of each probability distribution are required to add to 1, while for possibil ity distributions, the largest values are required to be 1. Moreover, the latter re quirement may even be abandoned when PAGE 24 possibility theory is formulated in te rms of fuzzy sets [172-173, 179]. The mixture model with the expectation-maximizati on (EM) algorithm is a well-known method using the probability-based approach. This method has the advantages of a strong statistical basis and a statis tics-based model selection. However, the EM algorithm converges slowly, particularly at regions wh ere clusters overlap [27] and requires the data distribution to follow some specific di stribution model. Because gene expression data are likely to contain overlapping clusters and do not always follow standard data distributions, (e.g., Gaussian, Chi-squared) [28], the mixture model with the EM method is not appropriate. Fuzzy clustering using the most popular algorithm, Fuzzy C-Means (FCM) [92, 121], is another model-based clustering appro ach that uses the po ssibility measure. FCM both converges rapidly and allows assi gnment of objects to overlapping clusters using the fuzzifier factor, m, where 1 m When the value of m changes, the cluster overlaps will change accordingly and ther e will be no overlapping regions in the clustering results when the value of m is e qual to 1. However, similar to the EM-based method and the other partitioning methods, FCM has the problem of determining the correct number of clusters. Even if the cluster number is known a priori, FCM may provide different cluster parti tions. Cluster validation methods are required to determine the optimal cluster solution. Unfortunate ly, the clustering model of FCM is a possibility-based one. There is no straightforw ard statistics-based approach to evaluate a clustering model except the cluster vali dity indices based on the compactness and separation factors of the fuzzy partitions However, there are problems with these PAGE 25 validity indices; both with scaling the two fact ors and with over-fit estimates. We have combined the advantages of the EM and FCM methods, where FCM plays the key role in clustering the data, and proposed a method to convert th e clustering possibility model into a probability one and then to use the Central Limit Theorem to compute the clustering statistics and determine the data distribution model that best describes the dataset. We applied the Bayesian method with loglikelihood ratio and Akaike Information Criterion measure using the estimated distribution model for a novel validation method for fuzzy clustering partitions [147]. 1.2.3 DNA sequence analysis Genes within a cluster are expressed similarly under the given experimental conditions, e.g., a drug treatment, a comparis on between the normal and disease tissues, or other biological comparisons of interest. Genes that are similarly expressed may have regulatory DNA sequences in common and a ppropriate analysis may identify motifs that are overrepresented in these gene sequences. Such motifs may contribute to expression of the entire group [31]. In additi on, regarding clusters of down-regulated genes, we can apply the sequence analysis to the 3Â’ untranslated regions (3'UTR) potentially discovering mo tifs connected to miRNAs [32]. We may discover the relationships between the miRNAs and the diseases or the treatment conditions targeted by the microarray experiments. Most algorithms developed to find mo tifs in biological sequences do not successfully identify motifs containing gaps and do not allow a variable number of motif instances in different sequences. In addition, they may converge to local optima PAGE 26 [33] and therefore do not guarantee that th e motifs are indeed overrepresented in the sequence set. The MEME algorithm [34] solved many of these problems, but still does not model gapped motifs well. By combining the EM algorithm with the hierarchical genetics algorithm, we developed a novel motif discovery algorit hm, HIGEDA [145], that automatically determines the motif consensus using Position Weight Matrix (PWM). By using a dynamic programming (D P) algorithm, HIGEDA also identifies gapped motifs. 1.3 Research contributions The rapid growth of microarray repositories has increased the need for effective methods for integration of datasets among laboratories using di fferent microarray technologies. The meta-analysis approach has the advantage of inte grating data without the problem of scaling expression values a nd also has multiple supporting methods to detect and order significantl y differentially expressed ge nes. However, parametric statistics methods may perform differently on different datasets, because of different assumptions about the method model. We pr opose using the RP method [8] to filter the genes for differential expression. We also propose to integrate RP with a pattern matching ranking method, in a method named R PPR [169], a combination of the Rank Product and Rank Page methods, to more effe ctively filter gene expression profiles from multiple studies. Meta-analysis of gene expression data doe s not put differentially expressed genes together at the phenotype level of the e xperiments. Therefore, the real power of microarrays is not completely exploited. Cl uster analysis can di scover relationships PAGE 27 among genes, between genes and biologi cal conditions, and between genes and biological processes. Most cl ustering algorithms are not ap propriate for gene expression data analysis because they do not allow overlapping clusters and require the data to follow a specific distribution model. We developed a new clus tering algorithm, fzGASCE [157] that combines the well-k nown optimization algorithm, Genetic Algorithm (GA), with FCM and fuzzy subtractiv e clustering to effectively cluster gene expression data without a prio ri specification of the numbe r of clusters. Regarding the parameter initialization problem of the FCM algorithm, we developed a novel Fuzzy Subtractive Clustering (fzSC) al gorithm [146] that uses a fuzzy partition of data instead of the data themselves. fzSC has advantages over Subtractive Clustering in that it does not require a priori sp ecification of mountain peak and mountain ra dii, or the criterion on how to determine the number of clusters. In addition, the computational time of fzSC is O(cn) instead of O(n2), where c and n are the numbers of clusters and data objects respectively. To address the problem of missing data in cl uster analysis, we developed two imputation methods, fzDBI and fzPBI, us ing histogram based and probability based approaches to model the data distributions and apply the model in imputation of missing data. For the problem of cluster model sel ection and clustering result evaluation with the FCM algorithm, we developed a novel method, fzBLE, [147] that uses the likelihood function, a statistics goodness-of-fit, and a Bayesian method with the Central Limit Theorem, to effectively describe the data distribution and correctly select the optimal fuzzy partition. Using the statis tical model of fzBLE, we developed a PAGE 28 probability based method for defuzzificatio n of fuzzy partition that helps with generating classification information of data objects using fuzzy partition. In addition to clustering of gene expre ssion data, we also propose an analysis directed at the biological relevance of the cluster results. Using our RPPR method, we provide relationships between the gene clusters and the biological c onditions of interest, and determine the clusters which respond posit ively and negatively to the experimental conditions. Furthermore, our motif-finding algorithm, HIGEDA [ 145], can discover common motifs in the promoter and mRNA sequences of genes in a cluster. In addition to predicting the possible transcription fact or binding sites, HIGE DA can also be used to predict potential miRNA binding sites. Figure 1-1 shows the microarray data analys is key issues which are addressed in this research. The remainder of the dissert ation is organized as follows: Chapter 2 presents the background of microarray an alysis including mi croarray technology, microarray data analysis and recent approaches and the goals of this research; Chapter 3 describes a rigorous analysis of the Fu zzy C-Means algorithm; Chapter 4 describes our methods to address the challenges of gene expression microarray data analysis; and Chapter 5 demonstrates some specific app lications of our approach, concludes our contributions and discusses our future work. PAGE 29 Microarray from lab1 Microarray from lab2 Microarray from lab (p) Meta-analysis using RPPR Gene expression profiles of interest Determination of optimal fuzzy partition (fzSC) Gene groups miRNA pr ediction Biological context relevance Cluster analysis using fzGASCE, Missing data imputation using fzDBI, fzPBI Group selection (RPPR) Motif discovery (HIGEDA) Cluster validation (fzBLE) OK Defuzzification fzPBD Figure 1-1 : Microarray data analysis and the contributions of this research PAGE 30 2. Microarray analysis background 2.1 Microarray DNA microarrays rely on the specificity of hybridization between complementary nucleic acid sequences in DNA fragments (termed probes) immobilized on a solid surface and labeled RNA fragments isolated fr om biological samples of interest [6]. A typical DNA microarray consists of thousands of ordered sets of DNA fragments on a glass, filter, or silicon wafer. After hybridi zation, the signal intensity of each individual probe should correlate with the abundance of the labeled RNA complementary to that probe [1]. 2.1.1 Microarray platforms DNA microarrays fall into two types based on the DNA fragments used to build the array: complementary DNA (cDNA) arrays and oligonucleotide arrays. Although a number of subtypes exist for each array type, spotted cDNA arra ys and Affymetrix oligonucleotide arrays are the major platforms currently in use. The choice of which microarray platform to use is based on the rese arch needs, cost, available expertise, and accessibility [1]. For cDNA arrays, cDNA probes, which ar e usually generated by a polymerase chain reaction (PCR) amplification of cD NA clone inserts (representing genes of interest), are robotically spotted on glass slid es or filters. The immobilized sequences of cDNA probes may range greatly in length, but are usually much longer than those of the corresponding oligonucleotide probes. The major advantage of cDNA arrays is the flexibility in designing a custom array fo r specific purposes. Numerous genes can be PAGE 31 rapidly screened, which allows very quick elaboration of functi onal hypotheses without any a priori assumptions [120]. In addition, cDNA arrays typically cost approximately one-fourth as much as commercial oligonucle otide arrays. Flexibility and lower cost initially made cDNA arrays popular in academ ic research laborato ries. However, the major disadvantage of these arrays is the am ount of total input RNA needed. It is also difficult to have complete control over th e design of the probe sequences. cDNA is generated by the enzyme reverse transc riptase RNA-dependent DNA polymerase, and like all DNA polymerases, it cannot initiate s ynthesis de novo, but requires a primer. It is therefore difficult to generate compre hensive coverage of all genes in a cell. Furthermore, managing large clone libraries, and the infrastructure of a relational database for keeping records, sequence verifi cation and data extraction is a challenge for most laboratories. For oligonucleotide arrays, probes are comprised of short segments of DNA complementary to the RNA transcripts of in terest and are synthesized directly on the surface of a silicon wafer. When compared with cDNA arrays, oligonucleotide arrays generally provide greater gene coverage, more consistency, and better quality control of the immobilized sequences. Other advantages include uniformity of probe length, the ability to discriminate gene splice variants and the availability of carefully designed standard operating procedures. A nother advantage particular to Affymetrix arrays is the ability to recover samples after hybridization to an array. This feat ure makes Affymetrix arrays attractive in s ituations where the amount of ava ilable tissue is limited. However, a major disadvantage is the high cost of arrays [1]. PAGE 32 Following hybridization, the image is pr ocessed to obtain the hybridization signals. There are two different ways to m easure signal intensit y. In the two-color fluorescence hybridization scheme, the RNA fr om experimental and control samples (referred to as target RNAs) are differentiall y labeled with fluoresce nt dyes (Cye5 red vs. Cye3 green) and hybridized to the same array. When the region of the probe is fluorescently illuminated, both the experimental and contro l target RNAs fluorescence and the relative balance of red versus green fluorescence indicate the relative expression levels of experimental and control target RNAs. Therefore, gene expression values are reported as ratios between the two fluorescent values [1]. RNA extraction mRNA mRNA Cy3 Cy5 Sample preparation and labellin g Hybridization Washing Dirty pins Clean pins Image acquisition Image processing CEL Feature-level data (cel, grpÂ…) cells cells Figure 2-1 : Microarray design and screening PAGE 33 Affymetrix oligonucleotide arrays use a one-color fluorescence hybridization system where experimental RNA is labeled with a single fluores cent dye and hybridized to an oligonucleotide array. After hybridization, the fluores cence intensity from each spot on the array provides a measurement of the abundance of the corresponding target RNA. A second array is then hybridized to the control R NA, allowing calculation of expression differences. Because Affymetrix array screening generally follows a standard protocol, results fr om different experiments in different laboratories can theoretically be combined [6]. Following image processing, the digitized ge ne expression data need to be preprocessed for data normalization be fore carrying out further analysis. 2.1.2 Gene expression microarray data analysis Regarding differentially expressed gene s, many protocols use a cutoff of a twofold difference as a criter ion. However, this arbitrary cu toff value may be either too high or too low depending on the data variability. In addition, inherent data variability is not taken into account. A data point above or below the cutoff line could be there by chance or by error. To ensure that a gene is truly differentially expressed requires multiple replicate experiments and statistical testing [8]. However, not all microarray experiments are array-replicated. We, theref ore, need to analyz e data generated by different protocols. Statistical analysis, including meta-analysi s, uses microarrays to study genes in isolation while the real power of microarray s is their ability to study the relationships between genes and to identify genes or sa mples that behave in a similar manner. PAGE 34 Machine learning and data mining approaches have been developed for further analysis procedures. These approaches can be divided into unsupervised and supervised methods. Unsupervised methods involve the aggregation of samples, genes, or both into different clusters based on th e distance between measured gene expression values. The goal of clustering is to group obj ects with similar properties, leading to clusters where the distance measure is small within clus ters and large between clusters. Several clustering methods from classi cal pattern recognition, such as hierarchical clustering, KMeans clustering, fuzzy C-Means clustering, and self-organizing maps, have been applied to microarray data analysis. Usi ng unsupervised methods, we can search the resulting clusters for candidate genes or tr eatments whose expression patterns could be associated with a given biological condi tion or gene expression signature. The advantage of this method is that it is unbiased and allows for identification of significant structure in a complex dataset without a ny prior information about the objects. In contrast, supervised methods inte grate the knowledge of sample class information into the analysis with the goal of identifying expression patterns (i.e., gene expression signatures) which could be used to classify unknown samples according to their biological characteristics. A training dataset, consisting of gene expression values and sample class labels, is used to select a subset of expressed genes that have the most discriminative power between the classes. It is then used to build a predictive model, also called a classifier (e.g., k-nearest neighbors, neural netw ork, support vector machines), which takes gene expression values of the pre-selected set of genes of an unknown sample as input and outputs the predicted class label of the sample. PAGE 35 2.2 Microarray data meta analysis 2.2.1 Microarray data integration The integrated analysis of data from multiple studies generally promises to increase statistical power, gene ralizability, and reliability, while decreasing the cost of analysis, because it is perfor med using a larger number of samples and the effects of Statistical analysis: meta-analysis, DEG Data mining: Clustering, Classification Sequence analysis Biological Question Experimental Design Experiment Quantification Normalization Estimation Clustering Testing Classification Application specific Technology specific Pre-processing Analysis Motif finding Regulatory prediction Figure 2-2 : Microarray analysis PAGE 36 individual study-specific bias es are reduced. There are tw o common approaches for the problem of integrating microarray data: meta-analysis and direct integration of expression values. The direct integration procedure [11-13, 17] can be divided into the following steps. First, a list of genes common to multiple distinct microarray platforms is extracted based on cross-referencing the annot ation of each probe set represented on the microarrays. Cross-referencing of expression data is usually achieved using UniGene or LocusLink/EntrezGene databases or th e best matching mapping provided by Affymetrix. Next, for each individual datase t, numerically comparable quantities are derived from the expression values of ge nes in the common lis t by application of specific data transformation and normali zation methods. Finally, the newly derived quantities from individual datasets are combin ed. Direct integrati on methods include the following: Ramaswamy et al. [14] re-scaled expression values of a common set of genes for each of five cancer microarray datasets generated by independent laboratories using different microarray platforms. Combining them to form a dataset with increased sample size allowed identific ation of a gene e xpression signature that distinguished primary from metastatic cancers. Bloom et al. [15] used a scaling approach based on measurements from one common control samp le to integrate microarray data from different platforms. Shen et al. [16] propos ed a Bayesian mixture model to transform each raw expression value into a probability of differential expression (POD) for each gene in each independent array. Integrating multiple studies on the common probability scale of POD, they developed a 90-gene me ta-signature that pr edicted relapse-free PAGE 37 survival in breast cancer patients with im proved statistical power and reliability. In addition to common data transformation and nor malization procedures, Jiang et al. [17] proposed a distribution transf ormation method to transform multiple datasets into a similar distribution befo re data integration. Data processed by distribution transf ormation showed a greatly improved consistency in gene expression patterns between multiple datasets. Warnat et al. [18] used two data integration methods, median rank scores and quartile discretization, to derive numerically comparable measures of gene expression from independent datasets. These transformed data were then integrated and used to build support vector machine classifiers for cancer classification. Their resu lts showed that cancer classification based on microarray data could be greatly improved by integrating multiple datasets with a similar focus. The classifiers built from in tegrated data showed high predictive power and improved generaliz ation performance. A major limitation of these di rect integration methods is that filtering genes to generate a subset common to multiple di stinct microarray platforms often excludes many thousands of genes, some of wh ich may be significant. However, data transformation and normalization methods ar e resource-sensitive [20]; one method may be a best fit for some datasets, but not for ot hers. It is difficult to come to a consensus regarding a method that is best for data transformation and normalization on given datasets. Several studies have shown that e xpression measurements from cDNA and oligonucleotide arrays may show poor correla tion and may not be directly comparable PAGE 38 [21]. These differences may be due to variances in probe content, deposition technologies, labeling and hybridizing protocol s, as well as data extraction procedures (e.g., background correction, normalization, and cal culation of expression values). For example, cDNA microarray data is usually defined as ratio s between experimental and control values and cannot be directly compar ed with oligonucleotide microarray data that are defined as expression va lues of experimental samples. Across-laboratory comparisons of microa rray data has also demonstrated that sometimes there are larger differences between data obtained in different laboratories using the same microarray technology than data obtained in the same laboratory using different microarray technologi es [22]. Wang et al. [116 ] also showed that the agreement between two technolog ies within the same lab was greater than that between two labs using the same technology; the lab e ffect, especially when confounded with the RNA sample effect, usually plays a bigger role than the platform effect on data agreement. Commercial microarrays, such as Affymetrix arrays, have produced several generations of arrays to keep up with advances in genomic sequence analysis. The number of known genes and the representa tive composition of gene sequences are frequently updated and probe sets are modifi ed or added, to better detect target sequences and to represent newly discove red genes. A recent study has shown that expression measurements within one genera tion of Affymetrix arrays are highly reproducible, but that repr oducibility across generations depends on the degree of similarity of the probe sets and the levels of expression measurements [17]. Therefore, PAGE 39 even when using the same microarray tec hnology, different genera tions of microarrays make direct integration difficult. Technical variabilities, which result from differences in sample composition and preparation, experimental protocols and parameters, RNA quality, and array quality, pose further challenges to the direct integr ation of microarray da ta from independent studies. Vert et al. [20], and Irizarry et al. [22] describe d the lab-affect for microarray data and concluded that direct integration of expression data is not appropriate. Cahan et al. [21] and Warnat et al. [24], using two methods to id entify differentia lly expressed genes prior to carrying out a classification an alysis, showed that gene expression levels themselves could not be directly compared between different platforms. Therefore, we propose to utilize the meta-analysis, instea d of direct integrat ion, before further computer analyses. The meta-analysis method, in contrast to the direct integration method, combines results from individual analyses. It therefore avoids the problem of scaling the gene expression levels among datasets from different laboratories. A number of studies have shown that meta-analysis provides a r obust list of differentially expressed genes [2123]. With the increasing use of next generation sequenc ing techniques, microarrays are no longer the only highthroughput technology for ge ne expression studies. The direct integration approach appears to be in appropriate to those repositories, because we can only scale the expression values acro ss multiple experiments using the same technology. Even with the same technol ogy, e.g., Affymetrix oligonucleotide and PAGE 40 cDNA microarrays, the use of th is approach seems impossibl e with scaling expression values. 2.2.2 Issues with microarray data meta analysis Considering that meta-analysis involv es the whole process from choosing microarray data to detecting differentially expressed genes, the key issues are as follows. Issue 1: Identify suitable microarray datasets The first step is to determine which datasets to use regarding the goal of analysis. This is done by determining the inclusion-exclusion criteria. The most important criterion is that the datasets collected should be in the standard gene expression data format, e.g. features by samples [23]. Issue 2: Annotate the individual datasets Microarray probe design uses short, highly specific regi ons in genes of interest because using the full-length gene sequences can lead to non-specific binding or noise. Different design criteria lead to the creation of different probes for the same gene. Therefore, one needs to identify which probe s represent a given gene within and across the datasets. The first option is to clus ter probes based on seque nce data [11, 17]. A sequence match method is especially appropriate for cross-platform data integration, as well as Affymetrix cross-generation data in tegration. However, the probe sequence may not be available for all platforms and th e clustering of probe sequences could be computationally intensive for ve ry large numbers of probes. PAGE 41 Alternatively, one can map probe-level identifiers such as Image CloneID, Affymetrix ID, or GenBank accession numbers to a gene-level identifier such as UniGene, RefSeq, or LocusLink/EntrezGene UniGene, which is an experimental system for automatically partitioning se quences into non-re dundant gene-oriented clusters, is a popular choice to unify the different datasets For example, UniGene Build #211 (released March 12, 2008) reduces the ne arly 7 million human cDNA sequences to 124,181 clusters. To translate probe-level identi fiers to gene-level identifiers, one can use either the annotation p ackages in BioConductor, or the identifier mapping tables provided by NCBI and A ffymetrixID for LocusLink/Entrez ID to probe ID, or UniGene for probe ID to RefSeq. Issue 3: Resolve the many-to-many re lationships between probes and genes The relationship between probes and ge nes is unfortunately not unequivocal, which means that in some cases a probe may report more than one gene, and vice versa. Even using the same Affymetrix platform, th e combination of diffe rent chip versions creates serious difficulties, because the probe identification labels (IDs) are not conserved from chip to chip. Therefore, to combine microarray data across studies, a unique nomenclature must be adopted and all the different IDs of the chips must be translated to a common system. It is reasonable that many probe identifie rs are mapped onto one gene identifier. This is due to the current UniGene clustering and genome annotation, because multiple probes per gene provide intern al replicates, and allow fo r poor performance of some probes without the loss of gene expression detection, and because microarray chips PAGE 42 contain duplicate spotted probes. The issue is when a probe identif ier can be mapped to many gene identifiers. This may lead to a problem with the further meta-analysis. For example, a probe could map to gene X in half of the datasets, but to both genes X and Y in the remaining datasets. The further meta -analysis will treat such probes as two separate gene entities, failing to fully comb ine the information for GeneID X from all studies. If one simply throws away such pr obes, valuable information may be lost to further analysis. Issue 4: Choosing the meta-analysis technique The decision regarding which meta-ana lysis technique to use depends on the specific application. In th is context, we focus on a fundamental application of microarrays: the two-class comparison, e.g. the class of treatment samples and the class of control samples, where the objective is to identify genes differentially expressed between two specific conditions. Let XN P represent the expression matrix of selected datasets from multiple studies; where N is the number of common genes in the P selected samples. Let K be the number of studies from which the samples were selected. Denote by Pk the number of samples that belong to the kth study. Let C kPbe the number of control samples and T kPbe the number of treatment or experimental samples from the kth study, then T k C k kP P P and K 1 k kP P. For Affymetrix oligonucleotide array experiments, we have C kP chips with gene expression measures from the control class and T kPchips with gene expression measures from the trea tment class. For the two-channel array PAGE 43 experiments, we assume that the comparisons of log-ratios are all indirect, i.e., the samples from the treatment class are hybridiz ed against a reference sample RA. Then the expression values from the kth study are collected into X: T k j 2 T kjP , 1 j ), R / T ( log X and C k q 2 T kqP , 1 q ), R / C ( log X The meta-analysis will be performed on the results from analyses that were performed on X. There are four common ways to integrate such information across studies. Vote counting This method simply counts the number of studies where a gene was declared as significant [48]. If the number of studies is small, this method can be visualized using Venn diagrams. Combining ranks Unlike vote counting, this technique acc ounts for the order of genes declared significant [30, 52-54]. Ther e are three different approaches to aggregate the rankings of, say, the top 100 lists (the 100 most significantly up-regulated or down-regulated genes) from different st udies [23]. Two of the algorithms use Markov chains to convert the pair-wise preference between the gene lists to a stationary distribution; the third algorith m is based on an order-statistics model. The rank product method, proposed by Breitlin g et al. [29] is the most popular of this approach using the third algorith m. It is a non-parametric statistic which is originally used to detect differentially expressed genes in a single dataset. It is derived from the biological reasoning of the fold ch ange (FC) criterion and it detects the genes that are consiste ntly found among the most strongly up- PAGE 44 regulated (or down-regulated) genes in a number of replicate experiments. However, this method offers a natu ral way to overcome the heterogeneity among multiple datasets and, therefore, can be extended to meta-analysis, which generates a single signifi cance measurement for each gene in the combined study [30]. Within a given dataset of the study k, the pairwise fold change is computed for every gene i,C K T K q k kqP , 1 q P , 1 k C / T pFC There are C K T KP P such ratio values for each gene i, let R=C K T KP P These ratios are ranked within comparison to generate a ranking matrix rFCN:R. The rank product (RP) is computed as the geometric mean of rank for each gene i, by R / 1 R 1 r ir irFC RP (2.1) RPi/n can be interpreted as a p-value, because it describes the probability of observing gene i at rank rFCi or better in the qth comparison, q=1,Â…,R. A permutation procedure is independently car ried out to generate all other possible combinations of ranks relative to gene i in the dataset, nRPi. Repeat this procedure B times, where B is a positive number, to form reference distribution with nRPi across R comparisons which is then used to compute the (adjusted) pvalue and false discovery rate (FDR) for each gene. Combining p-values In the 1920s, Fisher developed a meta-method that combined the p-values from individual studies [117]. In the kth study, for the ith gene, the p-value pik is generated by one-sided hypothesis testing. The logs of the p-values of the same PAGE 45 gene are summed across K studies using th e FisherÂ’s sum of logs method which is defined as follows [50, 51], K 1 k ik 2 i) p ( log 2 S. (2.2) The Si values are then assessed by comparison against a chi-square distribution with 2K degrees of freedom,2 K 2 Rhodes et al. [36] proposed a st atistical model for performing metaanalysis of four indepe ndent prostate cancer micr oarray datasets from cDNA arrays and Affymetrix arrays. Each gene in each study was treated as an independent hypothesis and a significance (denoted by one p value and one q value) was assigned to each gene in each study based on random permutations. The similarity of significance across st udies was assessed with meta-analysis methods and combined with multiple infere nce statistical test s for each possible combination of studies. A cohort of genes was identified to be consistently and significantly deregulated in prostate cancer Marot et al. [40] used a sequential meta-analysis method in which they comput ed p-values in each sequential step and combined p-values from different sequential steps to determine the p-values for the results which are either signi ficant genes or biological processes throughout the entire analysis process. Combining effect-sizes The first step is to ca lculate the effect size dik and the variance ik associated with the effect size for gene i in the kth study, i = 1,Â…,N; k = 1,Â…,K. PAGE 46 p k ik ik ikS C T d (2.3) where p kS is the pooled standard deviation estimated across N genes in the dataset of the kth study. ) N N ( S ) 1 N ( SN 1 i ik 2 ik N 1 i ik p k (2.4) Based on the mean differences, an estimated pooled mean difference, i, and its variance,2 i are computed for each gene across K studies. A z-score zi is then calculated using i and2 i : 2 i i iz (2.5) Effect size can be calculated using the correlation coefficient and the method of Cohen [135], which is the differen ce between the means of two groups standardized by its pooled standard deviation [37, 48, 51]. The statistical significance can be determined using a comparison against N(0,1) on these zscores. However, if the number of studies K, is small, there may be a problem with over-fitting. In this case, the pe rmutation method is used instead. The method from Hedges and Olkin [136] show ed that this standardized difference overestimates the effect size for studies with small sample sizes. They proposed a small correction factor to calculate th e unbiased estimate of the effect size, which is known as the HedgesÂ’ adjustme nt. The study-specific effect sizes for every gene are then combined across st udies into a weighted average. The weights are inversel y proportional to the varian ce of the study-specific PAGE 47 estimates. Choi et al. [37-38] introduced a new meta-analysis method, which combines the results from individual datasets in the form of effect size and has the ability to model the inter-study variati on. The effect size was defined to be a standardized mean difference between the test samples and normal samples. The effect sizes from multiple microarray datasets were combined to obtain an estimate of the overall mean, and statis tical significance was determined by the permutation test extended to multiple datasets. It was demonstrated that data integration using this method promoted the discovery of small but consistent expression changes and increased the sensitivity and reliability of the analysis. An extended effect size model was then proposed by Hu et al. [39] for metaanalysis of microarray data. Issue 5: Samples / features selection Regardless of the methods of data integr ation, finding features or samples that together represent the biological meaning for the integrated datasets remains a challenge. Lin [42] used a heuristics method to detect significant genes from integrated data. The underlying spaces were determined using a heuristics search. Then, a set of candidate genes lists were identified again using a heuristics sear ch. Parkhomenko et al. [43] used principle compone nt analysis (PCA) to determine the most important components of the data spaces Pairs of components are processed one by one using a correlation measurements method to look for the Â‘bestÂ’ combination of components in the dataset. This analysis is also a heuristic-b ased one. Su et al. [44] used a set of genes similar to the gene expression signature target ed to the biological context of interest, to PAGE 48 detect candidate genes called VIP (Very Impor tant Pool) genes. These genes control the search process to find the significant samples (in the integrated dataset, the samples are columns, as we mentioned in Issue 1) and gene s from the integrated data. Tan et al. [47] approached the problem using GA and a resampling method; the feature and sample candidates were selected under a controlled process using GA and the fitness function was computed using statistical te st results on resample sets. Traditional methods to identify similar samples on the basis of gene expression values are Principle Component Analysis and Hierarchical Cluster Analysis. These methods, however, have the problem of strong ly relating samples of the same cell lines and batches because of the similarity amo ng cells grown at the same time [25]. In addition, these methods require all expressi on profiles to be generated on the same microarray platform, because the determin ation of distances between expression profiles from different platforms is not straig htforward, which limits use to the available microarray repositories. Lastly, these me thods may not benefit from additional information regarding the biological context of interest. For such cases, Huttenhower et al. [49] proposed a meta-analysis method ba sed on a statistical approach within a Bayesian framework. Considering the biologic al context of the analysis, they measured the distance between every pair of genes and constructed a Bayesian framework for each specific biological context based on the set of arrays. The Bayesian framework was then used to compute the probability of the relevance of each array regarding the context. This probabilistic measurement can be used to filter the samples. Because their method requires a set of genes targeting the biological context, in addition to a PAGE 49 definition by the biologists, the biological context can be defined by a set of genes having specific GO (Gene Ontology) terms, or participating in a specific biological process, or belonging to a gene expression signature. Lamb et al. [25] and Zhang et al. [2 6], by using gene expression signatures proposed pattern-matching, used ranking-base d approaches to perf orm meta-analysis on CMAP datasets. Genes in each expression profiles were ranked using fold changes. Expression profiles were then integrated base d on how they fit a given gene expression signature. Their research successfully identifi ed a number of interesting compounds that correlated with known biological processes. 2.3 Cluster analysis After meta-analysis, we identify a list of genes differentially expressed in a single study or across multiple studies. Using this gene list, we filter out the matrix of expression levels of genes significantly ex pressed under the experimental conditions. Cluster analysis then groups genes where ge nes within the same cluster have similar expression patterns across studies. This analysis allows the simultaneous comparison of all clusters independently of the platform used and the species studied; similar samples under specific biological conditions from diff erent species can also be analyzed together by meta-analysis. This comparison allows identification of: i) robust signatures of a pathology or a treatment across several independent studies. ii) sets of genes that may be similarly modulated in different disease states or following drug treatments. iii) common sets of co-expressed genes between human and animal models. Because a typical microarray study generates expressi on data for thousands of genes from a PAGE 50 relatively small number of samples, an analysis of the integrated data from several sources or laboratories may provide outcomes that may not necessarily be related to the biological processes of interest. Cluster analysis will fu rther explore the groups of genes that are absolute key agen ts of the experimental results. Given a set of n objects X = {x1, x2,Â…,xn}, let = {U, V}, where cluster set V = {v1, v2,Â…,vc} and partition matrix U={uki}, k=1,Â…,c, i=1,Â…,n, be a partition of X such that k c 1 kv X and c 1 k kin , 1 i 1 u Each subset vk of X is called a cluster and {uki} is the membership degree of {xi} to vk. uki {0,1} if is crisp partition, otherwise, uki [0,1]. The goal of cluster analysis is to as sign objects to clusters such that objects in the same cluster are highly similar to each other while objects from different clusters are as divergent as possible. Thes e sub-goals create what we call the compactness and separation factors that are used, not only for modelling the clustering objectives, but also for evaluating the clustering result. Th ese two parameters can be mathematically formulated in many different ways that lead to numerous clustering models. A dataset containing the objects to be clustered is usually represented in one of two formats, the object data matrix an d the object distance matrix. In an object data matrix, the rows usually represent the objects and the columns represent the attributes of the objects regarding the context where the ob jects occur. The role s of the rows and columns can be interchanged for another re presentation method, but this one is preferred because the number of objects is al ways enormously large in comparison with the number of attributes. Here objects are ge nes and attributes are the expression levels of the genes under the experimental cond itions. Assume we have n genes and p PAGE 51 experimental conditions. The gene data matr ix X then has n rows and p columns where xij is the expression level of gene i under the condition j and xi is the expression vector (or object vector) of gene i across p experiments. The distance matrix contains the pairwise distance (or dissimilarity) of objects. Specifically, the entry (i,j) in the distance matrix represents the distance between objects i and j, 1 i,j n. The distance of genes i and j can be computed using the object vector s i and j from the object data matrix based on a distance measurement. However, the obj ect data matrix cannot be fully recovered from the distance matrix, especially when the value of p is unknown. 2.3.1 Clustering algorithms Clustering algorithms are classified into either hierarchical or partitioning approaches. While the hierarchical methods group objects into a dendrogram (tree structure) based on the dist ance among clusters, the partitioning methods directly groups objects into clusters and objects are a ssigned to clusters based on the criteria of the clustering models. Clustering methods using a partitioning approach are classified into either model-based or heuristic-based approaches. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model obtained from the data then define s the clusters and assigns objects to them. Although K-means is a heuristic-based approach, once a set of K centers that are good representatives for the data is found, we can consider it as a model that generates the data with membership degrees in {0,1} and an addition of noise. In this context, KMeans can be viewed as a model-based approach. PAGE 52 2.3.2 Hierarchical clustering algorithms Hierarchical clustering (HC) builds a de ndrogram (hierarchical structure) of objects using either agglomerative (bottom-up) or divisive (topdown) approaches. In the former approach, the dendrogram is initia lly empty; each object is in its own subtree (singleton cluster). The clustering pro cess merges similar sub-trees. At each step, the two sub-trees most similar are merged to form a new sub-tr ee. The process stops when the desired number of sub-trees, say c, at the highest level is achieved. In the top down approach, the dendrogram starts with a sub-tree that contains all the objects. The process then splits the sub-trees into new subtrees. At each step, the sub-tree of objects with maximum dissimilarity is split into two sub-trees. The process stops when the number of leaves, say c, or the compactness and separation criter ia, is achieved. The dissimilarity between every pair of clusters is measured by either the single linkage: l j k i j i l k Sv x v x ); x x ( d min ) v v ( L (2.6) the complete linkage: l j k i j i l k Cv x v x ); x x ( d max ) v v ( L (2.7) or the average linkage: l k v x v x j i l k Av v ) x x ( d ) v v ( Ll j k i (2.8) The clustering process results in a set of c sub-trees at the level of interest in each approach. In comparison with the pa rtitioning approach, each such sub-tree corresponds to a cluster. PAGE 53 Because of their methods of grouping or splitting sub-trees, HC algorithms are considered heuristic-based approaches. The common criticism for standard HC algorithms is that they lack robustness and, hence, are sensitive to noise and outliers. Once an object is assigned to a cluster, it wi ll not be considered ag ain, meaning that HC algorithms are not capable of correcting pr ior misclassifications and each object can belong to just one cluster. The computationa l complexity for most HC algorithms is at least O(n2) and this high cost limits their application in large-scale datasets. Other disadvantages of HC include the tendency to form sphe rical shapes and reversal phenomenon, in which the normal hierar chical structure is distorted. 2.3.3 K Means clustering algorithm The K-Means algorithm is a partition appr oach using a heuristic-based method. Its objective function is based on the square error criterion, c 1 k n 1 i 2 k i kiv x u = X) | V J(U,, where uki {0,1}. (2.9) K-Means is the best-known square e rror-based clustering algorithm. Its objective, J, is based on both compactness and separation factors. Minimizing J is equivalent to minimizing the compact ness while maximizing the separation. PAGE 54 K-Means algorithm Steps 1) Initialize the partition matrix randomly or based on some prior knowledge. 2) Compute the cluster set V. 3) Assign each object in the datase t to the nearest cluster. 4) Re-estimate the partition matrix using the current cluster set V. 5) Repeat Steps 1, 2 and 3 until there is no object changing its cluster. The K-Means algorithm is very simple and can easily be implemented to solve many practical problems. It works very we ll for compact and hyper-spherical clusters. The computational complexity of K-Means is O(cn). Because c is much less than n, KMeans can be used to cluster large datasets The drawbacks of K-Means are the lack of an efficient and universal method for identify ing the initial partiti ons and the number of clusters c. A general strategy for the probl em is to run the algorithm many times with randomly initialized partitions, although this does not guarantee convergence to a global optimum. K-Means is sensitive to outliers and noise. Even if an object is distant from the cluster center, it can still be forced into a cluster, thus, distor ting the cluster shapes. 2.3.4 The mixture model with expectation maximization algorithm Under the probability context, objects can be assumed to be generated according to several probability distributions. Objects in different clusters may be generated by different probability distributions. They can be derived from differ ent types of density functions (e.g., multivariate Gaussian or t-di stribution), or the same families, but with PAGE 55 different parameters. If the distributions are known, finding the clusters of a given dataset is equivalent to estimating the pa rameters of several underlying models. Denote by P(vk) the prior probability of cluster vk and c 1 k k1 ) v ( P, the conditional probability of object xi given clustering partition = {U,V} is, ) v ( P ) v u | x ( P ) | x ( Pk k k c 1 k i i (2.10) Given an instance of the posterior probability for assigning a data point to a cluster can easily be calculat ed using BayesÂ’ theorem. The mixtures therefore can be applied using any type of components. The multivariate Gaussian distribution is commonly used due to its complete th eory and analytical tractability. The likelihood function P(X| ), n 1 i i) | x ( P ) | X ( P is the probability of generating X using the model. The best model *, therefore, should maximize the log likelihood function, n 1 i i)) | x ( P log( )) | X ( P log( ) | X ( L, ) | X ( L max ) | X ( Lt t (2.11) can be estimated using the expectat ion-maximization (EM) algorithm. EM considers X a combination of two parts: XO is the observations of X; XM is the missing information of X regarding and XM is similar to U of crisp clustering. The complete data log likelihood is then defined as, n 1 i c 1 k k O i k M i M O| x ( p ) v ( P log x ) | X X ( L ) | X ( L. (2.12) PAGE 56 By using an initial value of 0, EM searches for the optimal value of by generating a series of estimates { 0, 1,Â…, T}, where T represents the reaching of the convergence criterion. EM algorithm Steps 1) Randomly generate a value of 0, set t=0. 2) E-step: Compute the expectation of th e complete data log likelihood, Q( t) = E( ( LX|) ). 3) M-step: Select a new parameter estimate that maximizes Q(.), t+1 = max{ Q(,t) }. 4) Set t=t+1. Repeat Steps 2 and 3 until the convergence criterion holds. EM assumes that the data distribution follows specific multivariate distribution model, for example, a Gaussian distribut ion. The major disadvantages of the EM algorithm are the sensitivity to the selection of the initial value of the effect of a singular covariance matrix, the possibility of converging to a local optimum, and the slow convergence rate. In addition, because most biological datasets do not follow a specific distribution model, EM may not be appropriate. 2.3.5 Fuzzy clustering algorithm The clustering techniques we have discu ssed so far are referred to as hard or crisp clustering, which means that each object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed, and the object xi can belong to cluster vk, k=1,Â…,c, PAGE 57 with certain degrees of membership, uki, uki [0,1]. This is particularly useful when the boundaries among the clusters are not well separated, and are therefore ambiguous. Moreover, the memberships may help us di scover more sophisticated relationships between a given object and the disclosed clus ters. Fuzzy C-Means (FCM) is one of the most popular fuzzy clustering algorithms. FC M attempts to find a fuzzy partition for a set of data points while minimizing the objective function as (2-7). By using a fuzzifier factor, m, FCM is flexible in managing th e overlap regions among clusters, hence n 1 i c 1 c 2 k i m kiv x u ) X | V U ( J (2.13) FCM algorithm Steps 1) Randomly initialize the values of partition matrix U0, set t=0. 2) Estimate the value of vt using n 1 i m t ki n 1 i i m t ki t k) u ( x ) u ( v. (2.14) 3) Compute the new values of Ut+1 that maximally fit with vt. c 1 j 1 m 1 2 t j i 1 m 1 2 t k i t kiv x 1 v x 1 u. (2.15) 4) Repeat Steps 2 and 3 until either {vk} or {u.i} is convergent. PAGE 58 As with the EM and K-Means algorithms, FCM performs poorly in the presence of noise and outliers, and ha s difficulty with identifying the initial parameters. To address this issue, FCM is integrated with heuristic-based search algorithms, such as GA, Ant colony (AC), and Particle swarm opt imization (PSO). The mountain method or Subtractive clustering can also be used as an alternative to search for the best initial value of V of the FCM algorithm. Due to Possi bility Theory limitations, it is difficult to recover data from a fuzzy part ition model. However, recent re search has shown that it is possible to construct a probabi lity model using the possibili ty one. Therefore, by using a possibility to probability transformation, it is possible to recover the data from the fuzzy partition model. The FCM algorithm is considered to be model-based algorithm. The FCM algorithm has advantages over its crisp/probabilistic counterparts, especially when there is a significant ov erlap between clusters [57-59]. FCM can converge rapidly and provides more informa tion about the relations hips between genes and groups of genes in the cluster results. We therefore choose the FCM algorithm for the cluster analysis of ge ne expression data. However, the FCM algorithm also has some drawbacks. The objective function can help to estimate the model parameters, but cannot distinguish the Â“bestÂ” solution fr om the numerous possibly local optimum solutions. The lack of an effective clus ter validation method for FCM prevents its application to real world datasets, where th e number of clusters is not known. FCM is well known for its rapid convergence, but this is not guaranteed to reach the global optimum. Because of these issues, further an alysis with FCM will be carried out in Chapter 3 that will help us develop a novel method. PAGE 59 2.4 Gene regulatory sequence prediction The cluster analysis process produces gr oups of genes with similar expression patterns. Genes in each group may also have similar regulatory sequences. Understanding global transcrip tional regulatory mechanisms is one of the fundamental goals of the post-genomic era [69]. C onventional computati onal methods using microarray data to investigate transcriptiona l regulation focus mainly on identification of transcription factor bindi ng sites. However, many molecu lar processes contribute to changes in gene expression, including transcription rate, alternative splicing, nonsensemediated decay and mRNA degradation (contr olled, for example, by miRNAs, or RNA binding proteins). Thus, computational approaches are needed to integrate such molecular processes. miRNAs bind to complementary sites with in the 3'-UTRs of target mRNAs to induce cleavage and repression of translation [175]. In the past decade, several hundred miRNAs have been identified in mammalian cells. Accumulating evidence thus far indicates that miRNAs play critical roles in multiple biological processes, including cell cycle control, cell growth and differentia tion, apoptosis, and embryo development. At the biochemical level, miRNAs regulate mRNA degradation in a combinatorial manner, i.e., individual miRNAs regulate degradation of multiple genes, and the regulation of a single gene may be conducted by multiple mi RNAs. This combinatorial regulation is thought to be similar in scope to transcripti on factor partner regula tion. Therefore, in addition to transcription factors and ot her DNA/RNA-binding proteins, comprehensive PAGE 60 investigations into transcriptional mechanisms underlyi ng alterations in global gene expression patterns should also consid er the emerging role of the miRNAs. Motif finding algorithms are designed to id entify transcription factor binding sites (TFBS) and miRNA complement ary binding sites (MBS). Th ese sequences are usually short, ~ 8bp and 23bp, respectivel y. In addition, each gene may have zero, one, or more such binding sites (BS). These BSs or mo tifs may be over-represented differently in DNA and RNA sequences of genes in the same group. Motifs are difficult to recognize because they are short, often highly dege nerate, and may contai n gaps. Although motiffinding is not a new problem, challenges remain because no single algorithm both describes motifs and finds them effectivel y. Popular methods fo r motif-finding use a position specific score matrix (PSSM) that describes the motifs statistically, or consensus strings that represent a motif by one or more patterns that appear repeatedly with a limited number of differences. Of the tw o, PSSM is preferred because it is more informative and can easily be evaluated using statistical methods. In addition, a consensus motif model can be replaced by a PSSM one. One of the most popular motif-finding al gorithms using PSSM is MEME proposed by Bailey and Elkan [70]. The advantage of MEME is that it uses expectationmaximization (EM) which, as a probability -based algorithm, produces statistically significant results if it can reach a global opt imum. The disadvantage is that for motif seeds, it uses existing subsequences from w ithin the sequence set and, as a result, may fail to discover subtle motifs. Chang et al [72] and Li et al. [76] overcame this drawback by using the genetics algorithm ( GA) to generate a set of motif seeds PAGE 61 randomly. However, because they used GA w ith random evolution processes, a rapid convergence to a solution is not assured. Li et al. [77] improved on MEME by using one instance of a position weight matrix (PWM), a type of PSSM, to represent a motif, and statistical tests to evaluate the final m odel. However, because EM may converge to local optima, use of a single PWM may fail to find a globally optimal solution. The GAbased methods of Wei and Jensen [83] and Bi [71] use chromosomes to encode motif positions. The method in the former [83] is appropriate for models with zero to one occurrence of the motif per sequence (ZOOPS); the latter [71] is appropriate for models with one occurrence per sequence (OOPS), because one variable can be used to represent the single motif occurrence in each se quence. However, Li et al. [77] recently showed that ZOOPS and OOPS are inadequate when not every sequence has the same motif frequency, and that the two-compone nt mixture (TCM) model, which assumes a sequence may have zero or multiple motif o ccurrences, should be used. However, TCM requires a set of variables for every sequence to manage motif positions, and hence, the size of a chromosome can approach the si ze of the dataset [145]. Lastly, the above algorithms are restricted to fi nding gapless motifs and, therefore, will fail to find many functionally important, gapped motifs. Wh ile some methods, e.g., pattern-based methods of Pisanti et al. [81] and Frith [73], allow gapped motifs, they require the gapped patterns to be well-defined and th ey generate gap positions randomly or by using a heuristic method. Altern atively, Liu et al. [78] us ed neural networks to find gapped motifs, but their approach required a li mited and specific definition of the neural network structure. PAGE 62 2.5 Datasets In order to evaluate the previous studies as we ll as our proposed methods for microarray analysis, we will use multiple data types with different levels of complexity. Artificial datasets are used in both cluste ring and motif-finding a nd are preferred for testing methods because of the ease in ge nerating and running benchmarks. For cluster analysis, we generated datasets using a fini te mixture model of da ta distribution [27], because known cluster structures are ideal for evaluating the capab ility of an algorithm for determining the number of clusters and the cluster prototypes. For sequence analysis, we generated motifs in the set of randomly generated DNA sequences using the normal distribution model [71] with a known number of motif occurrences and locations in every sequence. Some popular datasets in the Machine Learning Repository, University of California Ivrine (UCI) [84], are also used to show how effective our methods are for cluster analys is using real datasets. The most complex datasets, e.g., gene expressi on data, including CMAP datase ts, are the most challenging of clustering problems. Lastly, for sequence analysis, we used eight DNA transcription factor binding site datasets; two eukaryot ic datasets: ERE and E2F [71], and six bacterial datasets: CRP, ArcA, Ar gR, PurR, TyrR and IHF [80]. 2.5.1 Artificial datasets for cluster analysis Artificial datasets were generated using a finite mixture model of data distribution [27] with different numbers of data objects, clusters, and dimensions. The clusters were generated with a slight overla p (overlapping ratio = 0.1). Fo r test purposes, we selected five datasets, named from ASET1 to ASET5, where the first four datasets are uniform PAGE 63 and the last one is non-uniform Datasets of different dimens ions support th e stability of our methods. We expect our methods to properly evaluate clustering results, successfully detect the correct number of clusters, properl y locate the cluster centers, and to correctly assign data points into their actual clusters. 2.5.2 Artificial datasets for sequence analysis We generated simulated DNA datasets as in Bi [71], with three different background base compositions: (a) uniform, where A, T, C, G occur with equal frequency, (b) AT-rich (AT = 60%), and (c) CG-rich (CG = 60%). The motif string, GTCACGCCGATATTG, was merged once or twi ce into each sequence, after a defined level of the string change: (i) 9% change representing lim ited divergence (i.e. 91% of symbols are identical to the original string), (ii) 21% change, or (iii) 30% change which is essentially background or random sequence variation. 2.5.3 Real clustering datasets Some popular real datasets from the UCI Machine Learning Repository [84] were used to test our method. Iris data The Iris dataset contains information about the sepal length, the sepal width, the petal length, and the petal width of three cla sses of Iris flowers; setosa, versicolor and virginica. Each class has 50 da ta objects from a total of 150 data objects in the dataset. This dataset does not contain missing values. The expected results s hould have the three classes with their own data objects. PAGE 64 Wine This dataset contains 178 data objects, each with 13 attributes corresponding to the measures of alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenol s, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline. The data obje cts are known to belong to three different classes. There are no missing values in the datase t. We expect to detect three clusters in the dataset and properly assign the data objects to their classes. Glass This dataset contains 214 data objects, each having ten attributes of which nine are the glass characteristics of refractive inde x, sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron content. The last attribute, a value ranging from 1 to 6, is for glass identification. The dataset with the first nine attributes is used for clustering and the results should contain six cl usters and assign the data objects to their proper class identification. Breast Cancer Wisconsin The Breast Cancer Wisconsin (BCW) dataset contains 699 data objects with nine attributes regarding the clump thickness, uni formity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The objects of the datasets are from two different classes of tumor state: benign and ma lignant. The number of classes in this dataset is unknown. Recent research has concluded that 6 is a r easonable number of classes. We expect to discover 6 clusters in the dataset using our methods. PAGE 65 2.5.4 Gene expression datasets Yeast and Yeast-MIPS data The yeast cell cycle data showed expre ssion levels of approximately 6000 genes across two cell cycles comprising 17 time point s [85]. By visual in spection of the raw data, Cho et al. [85] identif ied 420 genes that show sign ificant variation. From the subset of 420 genes, Yeung et al. [102] selected 384 genes that achieved peak values in only one phase, and obtained five standard categories by grouping into one category genes that peaked during the same phase. Amon g the 384 selected genes, Tavazoie et al. [118], through a search of the protein se quence database, MIPS [86], found 237 genes that can be grouped into 4 functional cate gories: DNA synthesi s and replication, organization of centrosome, nitrogen and su lphur metabolism, and ribosomal proteins. The functional annotations show the cluster structure in th e dataset. We named these two subsets Yeast and Yeast-MIPS respectiv ely. We expect clustering results to approximate the five class partit ion in the former and a four class partition in the latter. RCNS Rat Central Nervous System data The RCNS dataset was obtaine d by reverse tran scription-coupled PCR designed to study the expression levels of 112 genes over nine time points during rat central nervous system development [87]. Wen et al. [87] cl assified these genes into four functional categories based on prior biologi cal knowledge. These four cla sses are external criterion in this dataset. PAGE 66 2.5.5 CMAP datasets CMAP datasets contain expression profiles from multiple drug treatments performed at different concentrations. Or iginally there were 164 distinct smallmolecules used in 453 different treatment in stances resulting in 564 expression profiles. These datasets have been used to detect as sociations between genes and drugs. Lamb et al. [25] and Zhang et al. [26] were successf ul in recovering and exploiting some genedrug associations using gene expression signatures with matching-and-ranking based methods. These datasets will be used in our approach to demonstrate that we can replicate the results of Lamb et al. [25] a nd Zhang et al. [26]. In addition, more gene expression signatures will be discovered and the associations between DNA motifs and drugs will be established using our methods. By disc overing novel gene expression signatures and related pathways for the drugs, we expect to e xploit the desired effects as well as the side-effects of drug treatments Additionally, by using our motif-finding methods, we hope to predict some of th e associations between drugs and DNA sequences and the connections between miRNAs and diseases. 2.5.6 Real biological sequence datasets A. Eukaryotic transcription factor binding sites datasets E2F dataset E2F transcription factors play a key role in the regulation of cell cycle-associated genes. Identifying target genes for these factors is a major ch allenge. We used the set of 25 sequences from Kel et al. [35] that co ntain 27 instances of the E2F transcription PAGE 67 factor binding site. The goal is to discover motif occurrences that locate exactly or overlap with binding sites at known positions. ERE dataset The estrogen receptor (ER) is a ligand-activat ed enhancer protein that is a member of the steroid/nuclear receptor superf amily. Two genes encode mammalian ERs, ERalpha and ERbeta. In respons e to estradiol (E2), ER bi nds with high affinity to specific DNA sequences called estrogen response elements (EREs) and transactivates gene expression. We used the set of 25 sequences from Klinge [14] each of which contains an instance of the ERE. The purpose is to detect the motif sequences located exactly at or overlapping with EREs. B. Bacterial transcription fact or binding sites datasets CRP dataset CRP is one of seven "global" transcrip tion factors in E. coli known to regulate more than 100 transcription units. CRPÂ’s activ ity is triggered in response to glucose starvation and other stresses by the bindi ng of the second messenger cAMP. CRP binding sites have proved to be particularly noisy because computational searches for consensus binding sites have missed many known binding sites. CRP was chosen for its highly indiscriminate binding site [115]. ArcA dataset ArcA is a global regulator that changes in re lation to the expression of fermentation genes and represses the aerobic pathways when E. coli enters low oxygen growth PAGE 68 conditions. ArcA was chosen for its different protein domain (CheY-like) and very low consensus binding site [115]. ArgR dataset ArgR, complexed with L-argi nine, represses the transc ription of several genes involved in biosynthe sis and the transport of arginine, histidine, and its own synthesis, and activates genes for arginine catabolism. ArgR is also essential for a site-specific recombination reaction that resolves plas mid ColE1 multimers to monomers and is necessary for plasmid stability. PurR dataset PurR dimers control several genes involve d in purine and pyrimidine nucleotide biosynthesis and its own synthe sis. This regulator require s binding of two products of purine metabolism, hypoxanthine and guanine, to induce the conformational change that allows PurR to bind to DNA. TyrR dataset TyrR, tyrosine repressor, is the dual tran scriptional regulator of the TyrR regulon that involves genes essential for aromatic am ino acid biosynthesis and transport. TyrR can act both as a repressor and as an activator of transcription at 70-dependent promoters. IHF dataset IHF, Integration Host factor, is a global regulatory protein that helps maintain DNA architecture. It binds to a nd bends DNA at specific sites. IHF plays a role in DNA PAGE 69 supercoiling and DNA duplex destabilizatio n and affects processes such as DNA replication, recombination, and the expressi on of many genes. These sequence sets were chosen because of their hi ghly degenerate binding sites. C. Protein sequences Sequences from protein fam ilies in PFAM and Prosite databases with known motifs are used to evaluate the performance of our method. The DUF356, Strep-H-triad and Excalibur (extracellular calcium-binding) protein families have unambiguous motifs. Those from Nup-retrotrp and Flagellin-C families have weak signals. Sequences from these families are also used to evaluate the capability of detecting subtle motifs. In addition to these families, sequences from Xin Repeat and Planctomycete Cytochrome C protein families are used because they have more variable motifs. For gapped motif examples, we used sequences from the zf -C2H2, EGF_2, LIG_ CYCLIN_1, LIG_PCNA and MOD_TYR_ITAM families. To test the ability of de novo motif prediction, we used sequences from ZZ, Myb and SWIRM do mains. Because motifs of these domains are unknown, the results from our method ar e compared with other motif-finding algorithms and we expect that our methods will agree with the results of others as shown in Chapter 4. PAGE 70 3. Fuzzy cluster analysis using FCM 3.1 FCM algorithm The FCM algorithm was initially developed by Dunn [121] and generalized later by Bezdek [92] with an introduction of the fuzzifier, m, is the best well-known method for fuzzy cluster analysis. FCM allows gradual memberships of data points to clusters measured as degrees in [0,1]. This provides the flexibility in describing the data points that can belong to more than one cluster. In addition, these membership degrees provide a much finer picture of the data model; they afford not only an effective way of sharing a data point among multiple clusters but also express how ambiguously or definitely the data point belongs to a specific cluster. Let X = {x1, x2,Â…, xn} be the dataset in a p-dimensional space Rp where each xi = (xi1, xi2,Â…, xip) Rp is a feature vector or pattern vector, and each xij is the jth characteristic of the data points xi. Let c be a positive integer, 2 c < n, Rcn denote the vector space of all real c n matrices. Let V = {v1, v2,Â…, vc} Rcp of which vk Rp is the center or prototype of the kth cluster, 1 k c. The concept of the membership degrees is substantiated by the definition and interpretation of fuzzy sets. Fuzzy clusteri ng, therefore, allows got a fine grained solution space in the form of fuzzy part itions of the dataset. Each cluster vk in the cluster partitions is represented by a fuzzy set k as follows: complying with fuzzy set theory, the cluster assignment uki is now the membership de gree of the data point xi to cluster vi, such that uki = k(xi) [0,1]. Because memberships in clusters are fuzzy, PAGE 71 there is not a single cluster label assigned to each data point. Instead, each data point xi will be associated by a fuzzy label vector th at states its memberships in c clusters, T ci i 2 i 1 i} u , u u ( u (3.1) Definition 3.1 (fuzzy c-partition): Any matrix U = {uki} Mfcn defines a fuzzy cpartition (or fuzzy partition), where c 1 k n 1 i ki ki ki cn fcnk n u 0 ; i 1 u ; k i ] 1 0 [ u | R U M. (3.2) Row kt of a matrix U Mfcn exhibits the kth membership function uk in the fuzzy partition matrix U. The definition of a fuzzy partition matrix (3.2) restricts assignment of data to clusters and what membership degrees are allowed. Definition 3.2 (fuzzy c-means functionals): Let Jm: Mfcn Rcp R+ be, n 1 i c 1 k k i 2 m ki mmin ) v x ( d u ) V U ( J (3.3) where m is the fuzzifier factor, 1 m< also called the wei ghting exponent, which controls the degree of fuzzi ness of the clustering; Jm(U,V) is the infinite family of objective functions, and d(.) is any inner product norm of Rp: ) v x ( A ) v x ( v x ) v x ( dk i T k i 2 A k i k i 2 (3.4) where A is a positive definite matrix. If A Ip p then d2(x,y) = ||x y||2. Theorem 3.1 (prototypes of FCM): If m and c are fi xed parameters, and I ~ I are sets defined as: PAGE 72 i i k i iI / } c ..., 2 1 { I ~ }, 0 ) v x ( d ; c k 1 | k { I n i 1 i (3.5) then (U,V) (Mfcn Rcp) may be a global minimum for Jm(U,V) with respect to (w.r.t) the restriction c 1 k ki n i 1, 1 u (3.6) as in (3.2), only if: i i i c 1 l i 1 m 1 l i 2 1 m 1 k i 2 ki c k 1 n i 1I I k 1 I k 0 I ) v x ( d 1 ) v x ( d 1 u (3.7) and n 1 i m ki n 1 i i m ki k c k 1u x u v (3.8) Proof: We consider minimizing the Jm w.r.t. U under the restriction (3.6) using Lagrange multiplier. Let the Lagrange multiplier be i, i=1,Â…,n and put n 1 i c 1 k ki i n 1 i c 1 k k i 2 m ki) 1 u ( ) v x ( d u L. As the necessary optimal condition, 0 ) v x ( d ) u ( m u Li k i 2 1 m ki ki (3.9) PAGE 73 n 1 i k i m ki k0 v x u 2 v L. (3.10) Assume that k Ii then d2(xi, vk) 0. We derive from the Equation (3.9) for each k: 1 m 1 k i 2 1 m 1 i 1 m 1 k i 2 i ki) v x ( d 1 m ) v x ( d m u 1kc. (3.11) Summing up for k=1,Â…,c and taki ng (3.6) into account, we have, 1 ) v x ( d 1 m uc 1 k 1 m 1 k i 2 1 m 1 i c 1 k ki Hence, c 1 k 1 m 1 k i 2 1 m 1 i) v x ( d 1 1 m. Together with (3.11), we obtain the following for the membership, 1 m 1 k i 2 c 1 l 1 m 1 l i 2 ki) v x ( d 1 ) v x ( d 1 1 u Thus, c 1 l 1 m 1 l i 2 1 m 1 k i 2 ki) v x ( d 1 ) v x ( d 1 u. Similarly, we derive from (3.10) for each k, 1 k c: n 1 i m ki n 1 i i m ki ku x u v. Q.E.D PAGE 74 3.2 FCM convergence The iteration process of the FCM algorithm can be described using a map: ) V Âˆ U Âˆ ( ) V U ( :m where ) V ( F U Âˆ and ) U ( G V Âˆ (U(t), V(t)), t=1,2Â… is called an iteration sequence of the FCM algorithm. If (U(t),V(t)) = m(U(t-1),V(t-1)), t 1, where (U(0),V(0)) is any element of Mfcn Rcp. Set cp m * m fcn m * m cp fcn *V V R V V U J V U J and U U M U V U J V U J R M V U. Theorem 3.2 (descent function Bezdek) [133]: Jm is a descent function for { m, } Proof: Because {y d2(y) } and {y ym} are continuous, Jm is the sum of products of such functions, Jm is continuous on {Mfcn, Rcp}. Jm( m(U,V)) = Jm(F(G(U)), G(U)) Jm(U,G(U)) by the first case in the definition of Jm(U, V) by the second case in the definition of Thus, Jm is a descent function for { m, }. Q.E.D Theorem 3.3 (solution convex set): Let [conv(X)]c be the c-fold Cartesian product of the convex hull of X, and let (U0, V0) be the starting point of the sequence of m iteration, U0 Mfcn and V0 = G(U0). Then m(U(t-1),V(t-1)) Mfcn [conv(X)]c t=1,2Â… Mfcn [conv(X)]c is compact in Mfcn Rcp. Proof: Let U0 Mfcn be chosen. For each k, 1 k c, we have PAGE 75 n 1 i m 0 ki n 1 i i m 0 ki 0 ku x u v. Let n 1 i m 0 ki m 0 ki kiu u, we have n 1 i i ki 0 kx v, and n 1 i ki1. Thus ) X ( conv v0 k, and therefore V0 [conv(X)]c. Subsequently, we can prove that U(t) Mfcn, V(t) [conv(X)]c, t 1. Q.E.D Theorem 3.4 (the convergence of FCM Bezdek et al.) [134]: Let 0 = (U0, V0) be the starting point of th e iterations with m. The sequence ... 2 1 t ), V U (0 0 t m either terminates at an optimal point = (U*, V*) or there is sub-se quence converging to a point in Proof: m is a continuous and descent function (Theorem 3.2) and the iteration sequences are always in a compact subset of the domain of Jm (Theorem 3.3). Thus, m should terminate at an optimal point = (U*,V*) or there is a sub-sequence converging to a point in Q.E.D 3.3 Distance measures Similar to other clustering algorithms, FC M uses the distance or dissimilarity of data points, each described by a set of attributes and denoted as a multidimensional vector. The attributes can be quantitative or qualitative, continuous or discrete, which leads to different measurement mechanisms. Accordingly, for a dataset in the form of an object data matrix, the data matrix is designated as two-mode because its row and column indices have different meanings. This is in contrast to the one-mode distance PAGE 76 matrix, which is symmetric with elements representing the distance or dissimilarity measure for any pair of data points, because both dimensions share the same meaning. Given a dataset X, the distance function d(.) on X is defined to satisfy the following conditions, Symmetric: d(x,y) = d(y,x), Positive: d(x,y) 0, x,y X. If the following conditions hold, then d is a metric. Triangle inequality: d(x,z) d(x,y) + d(y,z), x,y,z X, Reflexive: d(x,y) = 0 iif x = y. If the triangle inequality is viol ated, then d is a semi-metric. Many different distance measures have been used in cluster analysis. Euclidean distance The Euclidean distance is the most widely used distance measure and defined as p 1 i 2 i i 2y x ) y x ( d. (3.12) This distance is a true metric because it satisfies the triangle inequality. In the Equation (3.12), the expression data xi and yi are subtracted directly from each other. We, therefore, need to ensure that the e xpression data are properly normalized when using the Euclidean distance, for example by converting the measured gene expression levels to log-ratios. PAGE 77 Pearson correlation distance The Pearson correlation coefficient dist ance is based on the Pearson correlation coefficient (PCC), which describes the similarity of objects as p 1 i y x i i i i) y y )( x x ( p 1 ) y x ( PCC, where x and y are the sample standard deviati on of x and y respectively. PCC has a value from 1 to 1, where PCC=1 when x and y ar e identical, PCC=0 when they are unrelated, and PCC = 1 when they are anti-correlated. The PearsonÂ’s correlation distance is then defined as ) y x ( PCC 1 ) y x ( dp (3.13) Figure 3-1 : Expression levels of three genes in five experimental conditions The value of dp lies in [0,2] where dp=0 implies that x and y are identical, and dp=2 implies that they are very different fr om each other. While the Euclidean measure takes the magnitude of the data into account, the Pearson correlation measure geometrically captures the pattern of the two da ta points [1]. It is therefore widely used for gene expression data. Even if the expressi on levels of the two genes are different, if PAGE 78 they peak similarly in the experiments, then they have a high correlation coefficient, or a small Pearson correlation distance (Figure 3-1). Because of its definition, the Pearson correlation distance satisfies the distance cond itions, but it is not a metric since it violates the triangle inequality. Absolute Pearson correlation distance The distance is defined as, ) y x ( PCC 1 ) y x ( dap (3.14) Because the absolute value of P CC falls in the range [0,1], the dap distance also falls between [0,1]. dap is equal to 1 if the expression levels of the tw o genes have the same shape, i.e., either exactly the same or exactly opposite. Therefore, dap (3.14) should be used with care. Uncentered Pearson correlation distance The distance is based on the uncentered Pearson correlation coefficient (UPCC) which is defined as p 1 i 0 y i 0 x iy x p 1 ) y x ( UPCC, where p zp 1 i 2 i 0 z The uncentered Pearson correla tion distance is defined as ) y x ( UPCC 1 ) y x ( dup (3.15) The dup distance in Equation (3.15) may be appropriate if there is a zero reference state. For instance, in the case of gene expression data given in terms of log- PAGE 79 ratios, a log-ratio equal to 0 corresponds to green and red signal being equal, which means that the experimental manipulation did not affect the expression. Because UPCC lies in the range [ 1,1], the dup distance falls between [0,2]. Absolute uncentered Pearson correlation distance This distance is defined as in (3.16). Because the absolute value of UPCC falls between [0,1], the daup distance also fa lls between [0,1]. ) y x ( UPCC 1 ) y x ( daup (3.16) Comparison of measures The distance measure plays an important role in obtaining correct clusters. For simple datasets, where the data is multidim ensional, the Euclidean distance measure is employed. But as the dimensions of the data set increase, where each dimension denotes a specific aspect of the dataset, the Euclid ean distance measure may not be the best one to be used. For the Iris dataset, we ran the FCM algorithm with the number of clusters set to 3, which is the number of classes of Iris flow ers: Setosa, Versicolor, and Virginica in the dataset. By using five different distan ce measures from (3.12) to (3.16), we obtained five different cluster partitions. For each pa rtition, we compared the cluster label of each data point with its class label to compute the correctne ss ratio for every class. The results are shown in Table 3.1. While the four distance measures using PearsonÂ’s correlation coefficient provide correct result s, the Euclidean distance measure performs worse, particularly on the Virgin ica class. In this comparison, it is clear that the absolute PAGE 80 Pearson correlation measure and the Pearson correlation measure results are identical. Similarly, the result of the uncentered Pearson correlation measure is identical with that of the absolute uncentered Pearson correlation measure. Table 3.1 : Performance of different distance measures on the Iris dataset Distance measure method Classification correctness (%) Average Correctness Setosa Versicolor Virginica Euclidean 100.0094.0076.00 90.00 Pearson correlation 100.0094.0094.00 96.00 Abs. Pearson correlation 100.0094.0094.00 96.00 Uncentered Pearson cor. 100.0092.00100.00 97.33 Absolute uncentered Pearson cor. 100.0092.00100.00 97.33 Table 3.2 : Performance of different di stance measures on the Wine dataset Distance measure method Classification correctness (%) Average Correctness #1 #2 #3 Euclidean 76.2770.4256.25 67.65 Pearson correlation 91.5385.92100.00 92.48 Abs. Pearson correlation 72.8867.6195.83 78.77 Uncentered Pearson cor. 91.5385.9297.92 91.79 Absolute uncentered Pearson cor. 91.5385.9297.92 91.79 We ran a similar benchmark on the Wine dataset which contains data objects of the three different classes of wines. Resu lts are shown in Table 3.2. Again, all Pearson correlation based distance measures perfor med better than the Euclidean distance measure. The Pearson correlation measures se em to be better for distance measure in real and unscaled datasets. PAGE 81 3.4 Partially missing data analysis In cluster analysis, complete informati on is preferred through out the experiment. Unfortunately, real world datasets frequently have missing values. This can be caused by errors that lead to inco mplete attributes or by random noise. For example, sensor failures in a control system may cause th e system to miss information. For gene expression data, the missing values can come from the platform level and meta-analysis level. With the former, the reasons include insufficient resolution, image corruption, spotting, scratches or dust on the slide, or hybr idization failure. The latter can be caused by differences in platforms or chip generati ons, and the lack of an appropriate method to map probes onto genes across di fferent microarray platforms. In a dataset with partially missing data, so me attribute values of a data point xi may be not observed. For example, xi = (x1, xi2, ?, ?, xi5, ?) has missing values corresponding to the third, fourth and sixth at tributes and only the first, second and fifth attributes are observed. This causes a probl em with distance measurement between xi and other data points in the dataset and the cluster algorithm, therefore, cannot perform properly at xi. Let XW = { xi X| xi is a complete data point }, XP = { xij | xi ? }, and XM = { xij | xij = ? } Clearly, XP and X contain more information than XW [140]. PAGE 82 Three approaches [137-139] have been wi dely used to address the problem of missing values. (1) The ignorance-based approach is the most trivia l approach to deal with datasets when the proportion of incomplete data is small, |X / XW| |X|, but the elimination brings a loss of information. (2 ) The Model-based approach defines a model for the partially missing data, XM, using available information, XP, and applies the model to build a complete dataset. However, the complexity of this method can prevent application to large datasets. (3) The Im putation-based approach supplies missing values XM by certain means of approximation. Of the three, the Imputation-based approach is preferred. Because of the n eed to use an approximation model for the missing data, it is therefore integrated with the cluster algorithm and works based on the cluster model, which may be a probability-bas ed or possibility-based one, estimated by the cluster algorithm. For the Imputation-based approach using FC M, Hathaway et al. [140] identified four strategies to solve the problem of missing values. Whole data strategy (WDS) This strategy, similar to the ignorancebased approach, eliminates the missing data. Results therefore presen t the cluster partition of XW instead of X. This strategy is applicable when |XP| / |X| 0.75. Partial distance strategy (PDS) This is the strategy recommended when XM is large. The distance between every pair of data points is computed using XP. PAGE 83 ) v x )( AI ( ) v x ( v x ) v x ( dk i i T k i 2 AI k i k i 2 Pi (3.17) where Ii is the index function of the data point xi, defined as in (3.18). p t 1 n i 1 for X x 1 X x 0 IP it M it it (3.18) Optimal completion strategy (OCS) In addition to model parameters, this strategy estimates XM. The FCM algorithm is modified to meet this me thod: in the first iteration, XW is used instead of X, and XM is estimated in the favor of optimizing the Jm function using (3.19). FCM is then run using the entire dataset X until convergence. M ij c 1 k m ki c 1 k kj m ki ijX x u v u x (3.19) Nearest prototype strategy (NPS) The nearest prototype strate gy is in fact a version of the k-nearest neighbors strategy [89] for fuzzy partition. This is also considered the simple version of OCS. In each iteration, XM is updated as, ) v x ( d min ) v x ( d X x where v xl i 2 P c l 1 i 2 P M ij kj ij (3.20) Algorithm performance measures Hathaway et al. [140] proposed two m easures for performance evaluation of imputation algorithms. The misclassification measure (MISC) is computed based on the assigned cluster label and the actual class label of the data objects, defined as, PAGE 84 n 1 i i i m) x ( V ), x ( C I ) V X ( C MIS, (3.21) where C(xi) and V(xi) are the actual class and the assign ed cluster labels of the data object xi respectively. Im is the index function, defined as, y x 0 y x 1 ) y x ( Im. The cluster prototype error (PERR) measures the difference between the predicted and the actual cluster prototypes of the dataset; it is defined as, 2 1 c 1 k k k) v v ( d ) V ( PERR (3.22) where vk, the kth predicted cluster prototype, and kv, the kth actual cluster prototype, k=1,Â…,c. We applied the four strategies disc ussed above to the Iris dataset [140] using different scales of the missing values, XM. The results are shown in Table 3.3. The dataset contains 150 4 = 6000 attribute values. XM was generated randomly at various scales from 0 to 0.75. Each me thod was run 50 times. In Table 3.3, columns 2-5 show the average number of iterations to convergen ce, columns 3-6 show the misclassification errors, and columns 7-10 sh ow the mean prototype errors, computed based on the distances between the estimated prototypes and the real prototypes of the dataset for the four methods. Based on the errors in columns 7-10, the accuracy of protot ype estimation by OCS was always at least as good as that of other approaches. The WDS performed well for datasets with low percen tages of missing values. The second best method was NPS. PAGE 85 Table 3.3 : Average results of 50 trials using incomplete Iris data [140] 3.5 FCM Clustering solution validation A fundamental and the most difficult pr oblem in cluster analysis, cluster validation, is to determine whether the fuzzy partition is of good quality. Cluster validation addresses the following questions: i. If all existing algorithms are applied to the dataset, and one obtains a multitude of different partitions, which assignment is correct? ii. If the number of clusters is unknown, partit ions can be determined for different numbers of clusters. Which partition (or number of clusters) is correct? iii. Most algorithms assume a certain stru cture to determine a partition, without testing if this structure really exists in the data. Does the result of a certain clustering technique just ify its application? The cluster validation problem is the gene ral problem of determining whether the underlying assumptions (cluster shapes, number of clustersÂ…) of a clustering algorithm are satisfied for the dataset. There are three different approaches fo r cluster validation: PAGE 86 a) Using hypothesis testing on the clusters, b) Applying model selection techni ques to the cluster model, c) Using cluster validity measures. In the first two methods, statistical te sts with assumption on probabilistic data distribution model are used. Hence, clus ter validation is performed based on the assumed distribution. While the first approac h, using hypothesis testi ng, is applied to a single model that we consider the local va lidation approach (3.5.2) the approach using model selection technique, similar to the gl obal validation approach (3.5.1), takes into account a family of probabilistic models according to a set of partitions from which the best should be selected. FCM is a possibility-based approach. It is similar to the EM algorithm with the mixture model, which is a pr obability-based approach. Comp ared with FCM, EM, with a standard model assumption ab out the data distribution, ha s the disadvantage that it converges slowly. However, EM, because of its assumed statistical model, has the advantage of using the first two approaches to evaluate the cluster results. In contrast to EM, these approaches are useless for FCM. Th erefore, most methods for fuzzy partition validation are based on the third approach, where cluster validity measures are used. Lacking certain knowledge about the data, it has to be determined for each problem individually which cluster sh ape is appropriate, what dist inguishes good clusters from bad ones, and whether there are data lack ing any structure. There are two common approaches to create cluster validity measures. PAGE 87 3.5.1 Global validity measures Global validity measures are mappings g: A(D,R) R describing the quality of a complete clustering partition using a sing le real value. Th e objective function, Jm in (3.3) is a simple example of such a validity measure. When we determine the cluster parameters and memberships, we minimize the Jm function. However, the objective function is a poor choice for creating va lidity functions; it is obvious that 0 ) J ( limm n c but c = n is not the optimal number of clusters for all datasets. In recent literature, a number of cluster validity measures have been developed and have addressed the problem for specific situ ations [92-97, 121-131]. Definition 3.3 (Partition coefficient PC, Bezdek): Let Af(Rp, Rcp) be an analysis space, X Rp, and U: X V Rcp a fuzzy cluster partition. The partition coefficient PC of the partition U is defined by n 1 i c 1 k 2 ki PCu n 1 V. (3.23) The inequality 1 V c 1PC holds because of (3.2) and (3.6). VPC = 1 for all hard partitions. If the fuzzy partition contains no in formation, each object is assigned to each cluster with the same degree and the minimum value 1/c of VPC is obtained. A maximum value of VPC should be considered when ma king a selection between multiple fuzzy partitions, i.e., k PC c k 1 opt PCU V max U V The optimal partition, Uopt, is the most Â“unambiguousÂ” one regarding clus ter assignment of the objects. PAGE 88 Definition 3.4 (partition entropy PE, Bezdek): Let A(RP, Rcp) be an analysis space, X RP and U : X V Rcp a fuzzy partition. The partiti on entropy PE of the partition U is defined by n 1 i c 1 k ki ki PE) u ln( u n 1 V. (3.24) Similar to the partition coefficient, we can show that ) c ln( V 0PE holds for any fuzzy partition U: X V. When the fuzzy memberships are uniformly distributed, VPE tends towards to ln(c). If U is the most ambiguous fuzzy partition, then the maximum value of VPE is obtained. The entropy, i.e., the mean information content of a source which describes a correct partition, is 0, which is al so obtained when U is a hard partition. To find the best fuzzy partition, we therefore look for the partition with minimum entropy, i.e., k PE c k 1 opt PEU V min U V Both PC and PE have the problem of over-fitting in estimating the number of clusters, c, because the use of the all me mberships in the validity measure does not necessarily increase the depe ndence on this parameter. This is why the PC and PE factors tend to increase for large values of c and provide op timal decisions only on unambiguous partitions. For a partition to be good, its clusters should satisfy the compactness and separateness conditions. A pa rtition contains compact and separate clusters if any two objects from the same cl uster have a smaller di stance than any two objects from different clusters. PAGE 89 Fukuyama and Sugeno [93] first proposed a simple approach to measure the cluster separation using the geometry of clus ter centers; the more the clusters scatter away from the data central point, th e corresponding solutio n is preferred. Definition 3.5 (Fukuyama-Sugeno FS cluster index): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The Fukuyama-Sugeno cluster index of the partition U is defined by c 1 k k n 1 i 2 m ki FS) v v ( d u J V, (3.25) where c v vc 1 k k. For each cluster vk, 1 k c, the inequality d2(xi,vk) d2( v,vk) holds for every data object xi, 1 i n. We, therefore, have VFS 0. VFS = 0 when there is one cluster in the fuzzy partition. The more the clusters are away from the central prototype, v the more separated the clusters are. To find the best fuzzy partition, we therefore look for part ition with minimum VFS. There are still the cases, however, where some clusters are far away from v but still close to each other. D unn [123] proposed a cluster i ndex which measures both the compactness and the separation of the partitio n. This cluster index, therefore, was named the compact and separated cluster index. However, DunnÂ’s index is only applicable to hard partitions. To solve this problem, Xie and Beni [94] proposed their cluster index using a separation factor where the distance betw een clusters is also taken into account. PAGE 90 Definition 3.6 (Xie and Beni XB cluster index): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The Xie-Beni cluster index of the partition U is defined by 2 min n 1 i c 1 k k i 2 2 ki 2 2 XB) V ( D n ) v x ( d u ) V ( Sep n J V (3.26) where ) v v ( d min ) V ( D ) V ( Sepl k l k c l k 1 min is the minimum distance between cluster centers of the pa rtition. Since, }. ) v ( diam { max c ) v ( diam u ) v x ( d u c 1 u c ) v x ( d u n ) v x ( d u2 k c k 1 c 1 k 2 k c 1 k n 1 i 2 ki n 1 i k i 2 2 ki n 1 i 2 ki n 1 i c 1 k k i 2 2 ki n 1 i c 1 k k i 2 2 ki We have 2 XPDI 1 V where DI is the Dunn comp act and separated index; )} v ( diam { max ) V ( D DIk c k 1 min Compared to DI, the VXB index is a much simpler co mputation, and it takes the membership degrees into account so that a de tour via a hard partit ion is not necessary [121]. However, the cluster indices, PC, PE, FS and XB do not consider the clusters shapes and volumes. Gath and Geva, ba sed on the concepts of hypervolume and density, proposed three cluster hypervolum e measures which are the basis of their cluster validity index and the indi ces of many othe r researchers. PAGE 91 Definition 3.7 (fuzzy hyper volume FHV): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The fuzzy hypervolume of the partition U is defined by c 1 k i FHV) F det( V (3.27) where Fk, n 1 i ki n 1 i k i 2 ki ku ) v x ( d u Fis the fuzzy covariance matrix of Vk [125]. A fuzzy partition can be expected to have a small value of VFHV if the partition is compact. However, VFHV measures only the compactness of the partition. Rezaee et al. [95] extended the VFHV index by adding a term that measur es the scattering between clusters in the partition. Definition 3.8 (Compose within & between scattering index CWB, Rezaee et al.): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The compose within and between scattering inde x [95] of the partition U is defined by ) V ( Dis ) V ( Scat VCWB (3.28) Scat(V) measures the average scattering of c clusters in V and is defined as ) X ( c V ) V ( ScatFHV where n 1 i 2 ix x n 1 ) X (, and n 1 i ix n 1 x xis the center point of X. Dis(V) measures the scattering between c clusters of V and is defined as PAGE 92 1 c 1 k c 1 l l k min maxv v D D ) V ( Dis The first term of VCWB, i.e., Scat() in (3.28), indicat es the average of scattering variation within the c clusters of the partition. The smaller this term is the more compact the partition is. However, this term does not take into account any geometric assumptions of the cluster prototypes. The second term of VCWB, Dis() indicates the average scattering separation between clusters. It is influenced by the geometry of the cluster centers and will increase if the nu mber of clusters increases. For the best partition, we look for the partition with the minimum value of VCWB. Definition 3.9 (Pakhira, Bandyopadhyay and Maulik fuzzy cluster index PBMF): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The PBMF index [96] of the partition U is defined by D J E c 1 V2 max 1 1 PBMF (3.29) where E1 = n (X) which is a constant for each dataset. The term J1 measures the compactness, while the term Dmax measures the scattering sepa ration of the clusters in the partition. A good partition must have either a small J1 or a large Dmax. To find the best partition, we therefore look for th e partition with the maximum value of VPBMF. Definition 3.10 (Babak Re zaee BR cluster index): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The BR inde x [97] of the partition U is defined by PAGE 93 } J { max J )} V ( Sep { max ) V ( Sep V2 c 2 c BR (3.30) where c l k 1 l k l k REL) v v ( C ) 1 c ( c 2 ) V ( Sep. Instead of using geometric distance between the two centers as the distance of every pair of clusters, Rezaee [97] used the relative degree of sharing or connectivity, CREL of the two clusters, defined as in (3.31). To compare the performance of the global cluster validity indices from (3.23) to (3.30), we chose the three previously used real datasets: Iris, Wine, and Glass and the artificial dataset ASET2 with 200 data points in five well-separated clusters (Figure 3-2). -20 0 20 40 60 80 100 120 10 20 30 40 50 60 70 80 90 100 110 Figure 3-2 : ASET2 dataset with five well-separated clusters PAGE 94 For each of the four select ed datasets, we ran FCM with the number of clusters set from 2 to 10, each partition, corresponding to a cluster number, was evaluated using the cluster validity measures. The results ar e shown separately fo r individual validity measures in Figures 3-3 3-9. 0.5 0.6 0.7 0.8 0.9 1 2345678910 Iris Wine Glass ASET2Figure 3-3 : PC index (maximize) 0 0.2 0.4 0.6 0.8 1 2345678910 Iris Wine Glass ASET2Figure 3-4 : PE index (minimize) The two indices, PC in (3.23) and PE in (3.24) behave very similarly; when PC decreases (Figure 3-3), then PE (Figure 3-4) increases and vice versa. These two measures, however, prefer partitions less ambiguous; they corre ctly identified the number of clusters in the artificial dataset, ASET2, but on the real datasets, where the clusters significantly overlap, they failed. 40 30 20 10 0 10 2345678910 Iris Wine Glass ASET2Figure 3-5 : FS index (minimize) 0 0.5 1 1.5 2 2345678910 Iris Wine Glass ASET2Figure 3-6 : XB index (minimize) PAGE 95 0 100 200 300 400 500 2345678910 Iris Wine Glass ASET2Figure 3-7 : CWB index (minimize) 0 5 10 15 20 25 2345678910 Iris Wine Glass ASET2Figure 3-8 : PBMF index (maximize) Figure 3-9 : BR index (minimize) The other validity measures also correctly identified the cluster number in the ASET2 dataset. However, with the exception of PBMF with the Iris dataset (Figure 3-8), all failed to detect the correc t cluster number in the real datasets. For further comparisons between these va lidity measures, we used 84 artificial datasets [147] generated using a finite mixtur e model as in Section 2.5.1. Datasets are distinguished by the dimensions and cluste r number, and we generated (3-2+1)*(93+1)=14 dataset types. For each type, we ge nerated 6 datasets, for a total of 6*14=84. For each artificial dataset, we ran the standard FCM five times with m set to 2.0. In each case, the best fuzzy partition was then selected to run the validity measures to search for the optimal number of clusters between 212 and to compare with the known number of PAGE 96 clusters. We repeated the experiment 20 tim es and averaged the performance of each method. Results of these compar isons are shown in Figure 3-10. Figure 3-10 : Performance of th e global validity measures on artificial datasets with different numbers of clusters The PC measure provides better results th an the PE measure. Measures using the geometry of the prototypes and the geometric distance m easures i.e., PBMF, CWB and FS, have much better performance than thos e of validity measures not using prototype geometry. However, using geometric in formation for measuring the partition compactness does not help validity measure with better performance. The cluster compactness should be measured using the memb erships. This is reasonable because the data do not always distribute equally, and the cluster volumes, therefore, may differ from cluster to cluster; it is not necessary that any two data points having the same PAGE 97 memberships should have the same distance to the centers of clus ters to which they belong. The selection of validity measures th erefore should be specific to the individual application. 3.5.2 Local validity measures Determination of the optimum cluster num ber using global va lidity measures is very expensive, because clustering must be carried out for multiple cluster numbers. If a perfect partition cannot be r ecognized by a single run, we could at least filter the presumably correctly recogni zed clusters [121]. A validit y measure, called a local validity measure is, therefore, necessary to ev aluate clusters individually. This validity measure helps to correct the partition more directly. The concep t of local validity measures is that, by running FCM with a po ssibly maximized cluster number [122] and analyzing the resulting partition, we provide an improved initializati on for the next run. We continue this procedure until finall y the whole partition c onsists only of good clusters. If the final partition with only good cl usters is found, we have determined the optimal number of clusters and the best assign ment of data points to clusters. Hence, in addition to addressing the probl em of cluster validation, the local validity index helps with the problem of parameters initialization. Definition 3.11 (Relative degree of connectivity of two clusters CREL): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The relative degree of connectivity of the two clusters vk, vl V is defined as [97, 127] n 1 i i li ki l k RELe u u min ) v v ( C, (3.31) PAGE 98 where c 1 t ti ti i) u ln( u e. ei is the entropy of the c-partition at xi X and has a maximum value of ln(c) when xi has the same membership degree to all clusters. Hence, we have c ) c ln( n ) v v ( C 0l k REL The two clusters vk, vl are well separated as CREL 0. Definition 3.12 (Inter-clu ster connectivity CIC): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The inter-cluster connectivity [129] of the two clusters vk, vl V is defined as lk kl l k ICRF RF ) v v ( C (3.32) where } u { \ } u { max u }, u { max u | X x RFki c t 1 ti li ti c t 1 ki i kl The CIC(vk,vl) shows the overlap regi on between clusters vk and vl. If the two clusters are well separated then CIC(vk,vl) 0. A greater CIC is an indication of a greater degree of the similarity between vk and vl. If CIC(vk, vl) is larger than a selected threshold, say 0.5, then the two clusters vk and vl need to be combined. Definition 3.13 (fuzzy inclusion similarity SFI): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The fuzzy inclusion similarity [130] of the two clusters vk, vl V is defined as n 1 i li n 1 i ki n 1 i li ki l k FIu u min u u min ) v v ( S. (3.33) PAGE 99 Definition 3.14 (link-based similarity SLB): Let (RP, Rcp) be an analysis space, X RP and U: X V Rcp a fuzzy partition. The link-based similarity [131] of the two clusters vk, vl V is defined as max LB c 1 t l k t Indir l k LBS ) v v ( S ) v v ( S (3.34) where ) v v ( S ), v v ( S min ) v v ( Sl t Dir t k Dir l k t Indir, c l k 1 l k LB max LB) v v ( S max S and l k l k l k Dirv v v v ) v v ( S Because vk and vl are fuzzy sets, according to the fuzzy set theory, we have n 1 i li ki n 1 i li ki l k Diru u max u u min ) v v ( S. The cluster merging technique evaluate s clusters for their compatibility (similarity) using any of the local validit y measures discussed above and merges clusters that are found to be compatible. In every step of this merging procedure, the most similar pair of adjacent clusters is me rged if their compatibility level is above a user-defined threshold and the number of clusters is therefore gr adually reduced. PAGE 100 Either the compatible cluster merging (CCM) algorithm [126, 127, 128] or the agglomerative HC algorithm [132] can be used for this process. In the case of Iris dataset, we first ra n FCM with the number of clusters set to 12, which is equal to n where n=150, and is clearly gr eater than the optimal number of clusters. The HC algorithm was applied wi th the weighted linkage used as a local validity measure to merge adjacent compatible clusters. In Figure 3-11, it can be seen that most of the clusters were merged into two large groups and the third group contains only one cluster. Figure 3-11 : Dendrogram of the Iris dataset fr om a 12-partition generated by FCM For the Wine dataset, we ran FCM with the number of clusters set to 13, which is equal to n where n=178, and is larger than the optimal number of clusters. The HC algorithm was applied with the weighted linkage. It can be seen in Figure 3-12 that the clusters gather into three groups. PAGE 101 Figure 3-12: Dendrogram of the Wine da taset from a 13-partition generated by FCM By considering only the highest levels of the dendrograms as in Figure 3-11 and Figure 3-12, the number of clus ters in the Iris and Wine datasets can be identified correctly. However, in the fuzzy cluster an alysis, several prototypes can share one cluster. A local validity measure based on memberships can only identify a prototype pair instead of all of these prototypes. For ex ample, in the case of the Iris dataset, the clusters 2, 3, 7, 8 and 12 share the class Vi rginica, but only clus ter 12 was separated to form a final cluster. For the Wine dataset, th e clusters 1, 2, 3, 4, 6, 9 and 10 share the overlap region between classes 2 and 3 but only the cluste rs 9 and 10 are grouped to form a separate cluster. To solve this problem, the best cluster is removed after each merging step and the distances among th e rest are recomputed [126-132]. This PAGE 102 procedure is repeated until no further good cluste rs are found or the compatible level is under the threshold value [128]. 3.6 FCM partition matrix initialization Theorem 3.4 shows that FCM algorithm is very sensitive to the initial value, U0, of the partition matrix. The determination of U0, or the initialization of the partition matrix, can help the algorithm to converge qui ckly to the global optim um or get stuck in local optima. A common approach in most partitioning clustering methods is to generate the partition matrix randomly. However, this requir es different mechanisms for different datasets because of differing data distributions. Figure 3-13 : ASET1 dataset with 6 clusters PAGE 103 We used the artificial dataset 1 (ASET1) as an example. This dataset contains 171 data points in six clusters (Figure 3-13) generated using a finite mixture model as described in Section 2.5.1. We ran the three standard clustering algorithms: k-means, kmedians and FCM with the number of clusters set to 6, and the partition matrix generated using a non-uniform random me thod. The cluster pa rtition found by each algorithm was compared with the classification structure of the dataset. We repeated the experiment 20 times and averaged the perf ormance of each algorithm, as shown in Table 3.4. Because of the random value of the U0, these algorithms could not always converge to the local optimum Hence, the correctness ratios for every class and the whole dataset are sometimes off. Table 3.4 : Performance of three standard algorithms on ASET1 Algorithm Correctness ratio by class Average Ratio 1 2 3 4 5 6 k-means 0.97 0.871.001.001.000.750.93 k-medians 0.95 0.821.001.001.000.620.90 FCM 0.97 1.000.951.001.000.960.98 An alternative approach is to search for the best location of the cluster candidate using data points in the dataset. An in tegration of FCM with optimization algorithms such as GA and PSO is preferred. To manage a solution set, of which each solution is a set of data points selected as cluster candida tes, the optimization al gorithms try different values of the partition matrix. In this wa y, they solve the problem similarly to the method of initializing the matrix randomly, as we mentioned previously, which may be appropriate for some data sets, but not for others. PAGE 104 A simple and yet effective method for th e problem of initializing the partition matrix is subtractive clustering (SC). The su btractive algorithm, or iginally developed by Chiu [63], is a notable improvement to FCM, in which an improved mountain method was proposed to determine the number of cl usters and estimate th eir initial values. Instead of generating the clus ter candidates randomly, the SC method searches for the most dense objects in the dataset. Li et al. [ 64], Liu et al. [65], and Yang et al. [67] used SC to locate the most dense data points as cluster candidates. The partition matrix is then computed using the selected data points as cluster centers. However, the SC method requires the mountain peak and mountai n radii numbers to be specified a priori, which may differ among datasets. Yang et al [66] addressed the problem by a method that automatically computes these values from the dataset. However, the data distribution may not be identical among cluste rs, and the estimated parameters may be appropriate to some clusters but not others Cheng et al. [68] proposed a weighted SC method that makes the cluster candidate sele ction more flexible by allowing the cluster centers to be located at the means of the de nse data points instead of selecting the most dense data points themselves. This approach can help FCM work effectively. However, it still requires that all the SC parameters are specified a priori. 3.7 Determination of the number of clusters FCM has a disadvantage that the number of clusters must be known in advance. When this is not known, the results with di fferent numbers of clusters have to be compared using a validity function in orde r to find an optimal fuzzy partition with PAGE 105 respect to the new objective function. There are two common methods to address the problem of number of clusters using this approach [121]. 3.7.1 Between cluster partition comparison This method defines a global validity meas ure which evaluates a complete cluster partition. An upper bound cmax of the number of clusters is estimated, and a cluster analysis using FCM is carried out for each number of clusters, c, in {2,Â…,cmax}. For each partition solution, the va lidity measure generates a va lidity index which is then compared with the indices of other partitio n solutions to find the optimal number of clusters. Figure 3-14 : PC index on the ASET2 dataset (maximize) For the case of ASET2 dataset, we used VPC, for example, as global validity measure, and set cmax to 14 which is n where n=200 the size of ASET2. FCM algorithm was run with the number of cluste rs set from 2 to 14; each partition, corresponding to a certain number of clusters, was evaluated using VPC. The results are PAGE 106 shown in Figure 3-14. To find the optimal num ber of clusters, we look for the partition with the maximum value of VPC (Definition 3.3). Therefore, five is the optimal number of clusters, and is also equal to the number of classes in ASET2. Hence, in the case of ASET2, VPC correctly detected the number of clusters. 3.7.2 Within cluster partition comparison This method defines a local validity measure that evalua tes individual clusters of a partition. An upper bound cmax of the number of clusters is also estimated and a cluster analysis is carried out for c=cmax. The resulting clusters are then compared using the validity measure. Similar clusters are merg ed into one and very bad clusters are eliminated. These operations reduce the number of clusters in the partition. Afterwards, a cluster analysis is carried out on the rema ining clusters. This pr ocedure is repeated until the analysis result no longer contains e ither similar clusters or very bad clusters. Both 3.7.1 and 3.7.2 are based on the validity measures that we discussed in 3.5. Hence, the problems with using cluster validity measures may cause these methods to remain in local optima when searching for th e Â“bestÂ” number of cl usters; they assume that the cluster analysis results are optimal for the given numbers of clusters. However, FCM does not guarantee this [132]. Hence, an optimal cl uster partition found by these methods is possibly only a local optimum. In general [121], to evaluate a possibilistic cluster analysis, we must consider both the quality of the clusters and the proportion of the classified data. This is especially di fficult with quality measures for individual clusters according to the method in 3.7.2 [ 121]. In this case, the method in 3.7.1 is therefore recommended. PAGE 107 3.8 Determination of the fuzzifier factor The performance of FCM is affected by an exponent parameter, called the fuzzifier factor, m. This parameter affects both the convergence rate and the cluster validity of the algorithm. m=1, FCM behaves like K-Means, sa y all clusters are disjoint. m we have n i 1 and c k 1 for c 1 ) v x ( d ) v x ( d 1 lim ) v x ( d 1 ) v x ( d 1 lim u limc 1 l 1 m 1 l i 2 k i 2 m c 1 l 1 m 1 l i 2 1 m 1 k i 2 m ki m c k 1 for x n x u x u lim v limn 1 i i n 1 i m ki n 1 i i m ki m k m where xis the grand mean of X. Thus, when m go es to infinity, all the data points are considered to belong to only one cluster. Wu [144] ran FCM on the Iris dataset with different values of m (m=1.1, 1.5, 2, 3, 4,Â…, 50) and the number of clusters set to 3, and measured the misclassification rate in each cluster partition as shown in Figure 3-15. As m becomes large, the FCM algorithm becomes stable and no improvement can be achieved, even with a dditional iterations. If U Mfcn, the constraint as in (3.6) makes it difficult to interpret uki as the presentative of the kth cluster at xi. This is a very PAGE 108 important problem with FCM, and any application of FCM should take the determination of m into account. Figure 3-15 (Wu [144]) : Impact of m on the misclassification rate in Iris dataset To date, there are no reliable criteria for th e selection of the optimal fuzzifier for a given set of training vectors. Instead, the best value, with respect to a given value range of m, is fixed through experimentation; i.e ., this choice is sti ll heuristic. To our knowledge, only a few methods exist to determin e an optimal value of m. Romdhane et al. [57] proposed a method to estimate the value range that shoul d contain the optimal values of m, by defining the limits of th e overlap ratio between clusters. An iteration process is then carried out until the optimal value is reached. This approach has provided an easy yet powerful way to estim ate the value of m. However, it has the drawback that its optimal solution is strong ly dependent on the in itial random value of m. Dembele et al. [60] deve loped a method to estimate the value of m using the number of clusters and the estimated membership. The key idea of this method is that as long as m runs toward to its upper bound value, the membership of all data points will converge PAGE 109 to 1/c, where c is the number of clus ters. They estimate the upper bound value of m using this idea. This method does not have the problem of local optima caused by the use of an initially random value of m. Howeve r, it requires the number of cluster to be specified a priori. To address this issue, Schwammle et al. [61] proposed a method using cluster validity indices to estimate the value of c before determining the optimal value of m. Similarly, Zhao et al. [62] proposed a model-based approach for the m estimation using the predefined value of c, which varies from 2 to C where C is the possibly maximum number of clusters. The es timation methods of these two approaches are based on the fuzzy partition generated using the predefined value of c. This value is not guaranteed to be optimal, because most of the cluster validity measures are not appropriate to the real datasets [147]. The fuzzy partition generated by the FCM algorithm, using the selected number of clus ters, is also not optimal because FCM may remain in local optima. 3.9 Defuzzification of fuzzy partition Defuzzification in fuzzy cluster analysis is a procedure to convert the fuzzy partition matrix U into a crisp partition which is then used to determine the classification information of the data objects. The most popular approach for defuzzification of fuzzy partition is appli cation of the maximum membership degree (MMD) principle [148, 152, 153]. The data object xi is assigned to the class of vk if and only if its membership degree to cluster vk is the largest, that is } u { max uli c .. 1 l ki (3.35) PAGE 110 This approach may be inappropriate in some applications, because FCM membership is computed using distance betw een the data object and cluster center. The use of membership degree can assign margin al objects of a large cluster to the immediately adjacent small cluster. This is illustrated in Figure 3-16, where if a data object is in the gray rectangle, it may be in correctly assigned to cluster 3 instead of cluster 2. Figure 3-16 : Artificial dataset 5 (ASET5) with three clusters of different sizes Genther, Runkler and Glesner [151] pr oposed a defuzzification method based on the fuzzy cluster partition (FCB), where th e partition information is applied in membership degree computati on. For each data object xi, i=1,Â…,n, an estimated value, (xi), is computed using the fuzzy partitio n. The set of estimated values of X, {(xi)}i=1,Â…,n is then used to compute the estimated value of vk, wk, k=1,Â…,c, as: n 1 i m ki n 1 i i m ki ku ) x ( u w. (3.36) The distance between the data object and cluster center, dF, in FCB is computed using both the actual and the estimated information as: PAGE 111 )} w ), x ( ( d { max ) w ), x ( ( d } v x ( d { max ) v x ( d ) v x ( dl j 2 l j k i 2 j 2 l j k i 2 k i Fl (3.37) where dF() is used, instead of d(), to com pute the membership degree. The two parameters, and weight the distances from X and its estimated to cluster prototypes. Chuang et al. [150], in the Neighborhood Based Method (NBM), on the other hand proposed adjusting the membership status of every data point using the membership status of its neighbors. Given a cluster vk, k=1,Â…,c, a data point, xi, i=1,Â…,n, receives an additional amount of membersh ip degree from its neighbors: ) x ( NB x kj kii ju h. (3.38) This additional information is used in computing the membership degree, as in (3.39), so that the membership level of each data point is strongly associated with those of its neighbors that will he lp with determination of classification information. Here, c 1 l q li p li q ki p ki NB kih u h u u. (3.39) Similar to FCB, the NBM method requires a definition of neighborhood boundaries and the values of p and q which weight the po sition and spatial information. Chiu et al. [149] also used the spatial information to compute the distance between data point and cluster center in their sFCM (Spatial FCM) method, as in (3.40), NB 1 j k j i j k i i j k i S) v x ( d ) v x ( d ) 1 ( NB 1 ) v x ( d, (3.40) PAGE 112 where NB is the number of ne ighbors of every data point, i j is the weighted factor of xj, a neighbor of xi, defined as: ) exp( 1 1i i j i j (3.41) where ) x x ( di j i j sFCM uses dS(), instead of d(), to compute the membership degree, and updates cluster centers, {S kv}, k=1,Â…,c, as follows: n 1 i m ki n 1 i NB 1 j j i j i i j m ki S ku x x ) 1 ( NB 1 u v. (3.42) Algorithm performance measures The overall performance of defuzzification methods can be measured based on the number of data objects that are misclassified defined as in (3.21); the cluster label of each data object is compared with its actua l class label. If any do not match, then a misclassification has occurred. The compactness measure, defined as in (3.43), can be used to validate how the actual classificati on structure retains as the missing data are updated, c 1 k N 1 i k ik) v x ( d s compactnes. (3.43) We tested the four defuzzification al gorithms: MMD, FCB, NBM and sFCM using the two datasets ASET4 and ASET5 with various values of the fuzzifier factor, m. Each algorithm was run 100 times and the results were averaged. The number of neighbors was set to 8. For the ASET4 dataset which contains 300 data objects of 11 wellseparated clusters in a 5-demensional data space, the result is shown in Table 3.5. PAGE 113 Table 3.5 : Algorithm performan ce on ASET4 using MISC measure m Algo. 1.25 1.375 1.5 1.625 1.75 1.875 2 MMD 0.00 0.000.000.000.000.00 0.00 FCB 12.24 12.2412.2412.2411.7211.71 11.97 NBM 5.89 5.561.971.971.942.03 1.80 sFCM 12.23 9.600.300.750.840.84 0.84 In the Table 3.5, only the MMD method correctly detected the classification information across various values of m. The other methods performed worse, particularly the methods that used spatial information to revise the distance between data object and cluster center, FCB and sFCM That may be because the assumption of 8 as the maximum number of neighbors for eac h data point, which is correct in spatial space, is not appropriate for generic data space. We ran the same benchmark on the ASET5 that contains 150 data points of 3 clusters different in size, see Figure 316. The result is shown in Table 3.6. Table 3.6 : Algorithm performan ce on ASET5 using MISC measure m algo 1.25 1.375 1.5 1.625 1.75 1.875 2 MMD 16.00 16.0016.0016.0016.0016.00 16.00 FCB 4.32 4.324.594.594.324.32 4.32 NBM 27.19 27.1927.1927.1927.1927.19 27.19 sFCM 17.00 17.0017.0017.0016.0016.00 16.00 The FCB algorithm performed better than th e other methods. The methods that use spatial information in computing distance betw een data object and cluster center, e.g., PAGE 114 FCB and sFCM provided better results. Howeve r, these algorithms failed at producing classification information for this dataset. The test results have shown that it is not appropriate to apply the spatial information into the defuzzification problem in cluster analysis of generic data, because properly determining the number of neighbor s for each data point is complicated, and the choice of the two parameters, and in these methods may not be easy due to the data distribution differen ce among different datasets. PAGE 115 4. Methods 4.1 Microarray meta analysis 4.1.1 Selection of samples and features Key challenges in meta-analysis methods arise from the small sample sizes typical of microarray studies and variations due to di fferences in experimental protocols. Both affect the results from i ndividual studies and inevitably degrade the final results of meta-analysis. In addition, the main outcome of these approaches is the ranking of genes for signifi cance, however, it is still unk nown which feature, gene expression value, statistic test p-value, likelihood measurement, or any other statistical measures, best represents of outcome measur es. This chapter addresses the problem of choosing samples from multiple studies w ith the goal of reducing the variance and discovering more biological meaning. In addi tion, because microarray data are specific to the biological context of the experiment in the solution proposed, the meta-analysis will reflect the appropriate biological context. 4.1.2 Sample and feature selection method Common approaches for sample and feat ure selection are PCA, hierarchical clustering analysis, optimization algorith ms and ranking methods. The first three approaches are limited because their results are strongly affected by the lab in which the experiments were carried out [25, 49]. Th ese approaches are also computationally complex, because they use pair-wise sample distance measurements. When the number of probes or features is large, this process becomes ti me consuming, particularly with PAGE 116 the use of external metrics. Lastly, these me thods require the data to be obtained using the same microarray platform. The alternative approach is the ranking method. Because ranking itself is invariant to within-array monotonic data transformations, by applying the rank-based TSPG (Top-Scoring Pair of Groups) met hod, which performs on-chip comparisons within each microarray, no da ta normalization or transfor mations are required before integration [169]. Feature ID mappings / Selection For microarray data obtained using the same generation of the same platform, we directly merge individual data sets us ing probe sets common to all experiments. To merge data from different generations of the same microarray platform, we first use the Affymetrix probe set ID mapping to annotate a ll probesets. Probes that are not annotated will be processed using En trezGene with UniGene data. Probes having two or more gene IDs will have their da ta replicated, one for each unique gene. To merge data from different platforms, we first map all probe identifiers to gene identifiers. Probes having two or more gene IDs will be pr eprocessed as above. This method avoids excluding probes from furt her analysis. In this study, we focus on integrating microarray data from the CMAP datasets which were generated using the same platform but different generations. Sp ecifically, we integr ate microarray data generated from the Affymetrix HGU-133x and HG-U95x platforms to obtain data sets with large (>500) sample sizes. This integr ated data set is used to discover common gene expression signatures in re sponse to drug treatments [170]. PAGE 117 Sample selection We developed RPPR algorithm, a comb ination of the rank product (RP) statistics method with a modi fied version of the PageRankTM (PR) algorithm and the goodness-of-fit Kolmogorov-Smirnov statistical test method. The RP method generates the gene rank for gene expression profiles. The PR algorithm is then applied to augment the gene ranks using Protei n-Protein interaction and GO Biological Process annotation data. The KS statistical test method is used to filter the expression profiles of which the gene ranks best describe a gi ven gene expression signature. Given a sample expression profile X and a gene expression signature S, denote Su and Sd as the lists of up-regulated and do wn-regulated genes, respectively. The connectivity score between X and S is co mputed using the Kolmogorov-Smirnov scores of goodness-of-fit (KS) of the two sub lists in S to the profile X. The KS score of a tag list s to X is calculated as follows: Sort X based on the appropriate expression status of genes. Denote by t and n the lengths of s and X, respectively. Construct a vector O of the position (1,Â…,n) of each pr obe set of s in X and sort these components in ascending order such that O(j) is the pos ition of tag j, where j = 1, 2,Â…, t. Compute the following: t ) 1 i ( n ) i ( O max b n ) i ( O t i max at i 1 t i 1; (4.1) next b a b b a a ) X s ( KS. (4.2) PAGE 118 For each expression profile associated with a specific drug, Xp, p=1,Â…,P, where P is the number of profiles, let ) X S ( KS KSp u p up, ) X S ( KS KSp d p down. The KS connectivity score of the profile Xi to the gene expression signature S is defined as i down i up iKS KS s (4.3) Let ) s ( min S ), s ( max Si p 1 i X min i p 1 i X max The normalized connectivity score of the profile Xp to the gene expression signature S is defined as: 0 s S s 0 s S s ) KS ( sign ) KS ( sign 0 Sp X min p p X max p p down p up p. (4.4) Because RPPR uses only the preprocessed expression data, it is also appropriate for datasets generated using next gene ration sequencing, RNAseq and other highthroughput technologies. Method testing We tested RPPR using the CMAP datasets to detect expre ssion profiles strongly related to the expression signatures, S1 a nd S2, of two drugs, the Histone deacetylase PAGE 119 (HDAC) inhibitor (S1) and estr ogen (S2). We compared the performance of RPPR with those of Lamb et al. [25] and Zhang et al. [26]. HDAC inhibitor signature CMAP arrays were preprocessed usi ng RMA resulting in 564 expression profiles. The RP [29] met hod was then applied to every study to detect genes differentially expressed and to order gene s in each profile. The pattern matching ranking method was used to score all the prof iles with respect to the HDAC inhibitor signature [25]. We used a cuto ff of 0.95 for the normalized c onnectivity scores to select profiles strongly related to the signature. Results, compared with those of Lamb et al [25] and Zhang et al [26], are shown in Ta bles 4.1 and 4.2, and Figures 4-1 and 4-2, respectively. Table 4.1 : Predicted HDAC antagonists (1: Detected; 0: Not detected) Drugs Lamb Zhang RPPR valproic acid 1 1 1 trichostatin A 1 1 1 Resveratrol 0 0 1 HC toxin 1 1 1 Vorinostat 1 1 1 Wortmannin 1 0 0 5666823 0 1 0 Prochlorperazine 0 1 0 17-allylamino-geldanamycin 0 1 0 PAGE 120 Table 4.2 : Predicted HDAC agonists (1 : Detected; 0 : Not detected) Drugs Lamb Zhang RPPR estradiol 1 1 1 genistein 1 1 1 nordihydroguaiaretic acid 1 1 1 thioridazine 0 0 1 LY-294002 0 0 1 staurosporine 0 0 1 indometacin 1 1 1 alpha-estradiol 1 1 1 tretinoin 0 0 1 doxycycline 1 0 1 troglitazone 1 0 0 5149715 1 0 0 12,13-EODE 1 0 0 17-allylamino-geldanamycin 0 1 0 butein 0 1 0 wortmannin 0 1 0 tetraethylenepentamine 0 1 0 diclofenac 0 1 0 4,5-dianilinophthalimide 0 1 0 pirinixic acid 0 1 0 haloperidol 0 1 0 PAGE 121 Figure 4-1 illustrates results of the three approaches using the HDAC inhibitor signature (S1). The results of RPPR are sim ilar to those of Lamb et al. [25] because both used the same scoring method. Vorinosta t, trichostatin A and valproic acid, wellknown HDAC inhibitors, were all identified by RPPR. HC-toxin was not identified by Lamb et al. [25] or Zhang et al. [ 26], but was identifie d by RPPR. Although 17allylamino-geldanamycin, wortmanin, prochlor perazine were not re ported in the top 20 profiles from RPPR, they were ranked with high scores: 0.639, 0.557 and 0.657, Lamb Zhang RPPR 0 4 8 0 3 1 0 5 Figure 4-2 : Venn diagra m for HDAC inhibitors Lamb Zhang RPPR 0 1 3 0 1 0 0 4 Figure 4-1 : Venn diagram fo r predicted HDAC antagonists PAGE 122 respectively. It is interesti ng that only RPPR detected resveratrol which has recently been recognized as an HDAC inhi bitor like valproic acid [119]. Estrogens signature Table 4.3 : Predicted estrogen receptor antagonists (1 : Detected; 0 : Not detected) Drugs Lamb Zhangmine RPPR Tamoxifen 0 0 1 Fulvestrant 1 1 1 trichostatinA 1 0 1 Raloxifene 1 1 1 Monastrol 1 1 1 5224221 1 1 1 Oxaprozin 1 0 1 HCtoxin 0 0 1 Vorinostat 1 1 1 LY-294002 1 1 1 Demecolcine 1 0 1 5248896 1 0 0 Y-27632 0 1 0 Sirolimus 0 1 0 17-allylamino-geldanamycin 0 1 0 Monorden 0 1 0 Sodiumphenylbutyrate 0 1 0 Deferoxamine 0 1 0 Indomethacin 0 1 0 Tretinoin 0 1 0 PAGE 123 For the Estrogen signature (S2), Figures 4-3 and 4-4 show th at results of RPPR are similar to those of Lamb. It is intere sting that both RPPR and ZhangÂ’s identify estradiol with the highest rank, but only R PPR detected tamoxifen as negatively related to the estrogen signature. In fact, tamoxifen is a compound that works by blocking the estrogen receptors on breast tissue cells a nd slowing their estrogen-induced growth. Tamoxifen has recently been chosen as one of the best replacements for fulvestrant in breast cancer treatment. The results of RPPR si gnificantly agreed with those of Lamb et Lamb Zhang RPPR 0 0 3 0 1 0 0 4 Lamb Zhang RPPR 0 2 8 0 1 3 0 6 Figure 4-3 : Venn diagram for predicte d estrogen receptor antagonists Figure 4-4 : Venn diagram for predicted estrogen receptor agonists PAGE 124 al. (2006) and Zhang et al. ( 2008). Some of RPPRÂ’s results which were new to those of Lamb et al. (2006) and Zhang et al. (2008) were also proved by recent literature. Table 4.4 : Predicted estrogen receptor agonists (1 : Detected; 0 : Not detected) Drugs Known target/mechanism Lamb Zhang RPPR Valproic acid Epilepsy treatment, ER stimulator 1 1 1 trichostatinA HDAC inhibito r, ER sensitizer 1 1 1 HCtoxin HDAC inhibitor, ER signaling pathway related 1 1 1 vorinostat Skin cancer treatment, Advanced-stage breast cancer related 1 1 1 wortmannin Longterm estrogen deprivor (usually used with tamoxifen for a better ER inhibitor) 1 0 0 5666823 Similar to 2-deoxy-D-glucose (0.7765) like Tamoxifen (0.7143) 0 1 0 prochlorperazine Nausea and vomiting treatment, Endocrine: not used with hormone replacement estrogens. 0 1 0 17-allylaminogeldanamycin Hsp90 inhibitor, Hsp90 is required for the proper functioning of ER 0 1 0 Although RPPR used the same goodness-of-f it scoring method as Lamb et al. [25], it is based on different distribution func tions generated by the different methods to detect and rank genes that are differentially ex pressed. In addition, instead of using the same distribution function for both up-regulat ed and down-regulated genes, RPPR used the up-regulation distribution of the drug-spec ific profile for the up-regulated genes and the down-regulation distribution of the profile for down-regulated genes. This helps to address the weakness of KS goodne ss-of-fit test because it is less sens itive at the tailend of the distribution. In restricting the range of the distribution function, the results PAGE 125 produced by RPPR are very close to those of Lamb et al. [25] and Zhang et al. [26] although no change was made to the step functi on part of the distribution function. This restriction helps speed up the scoring proc ess when the distribution function can be approximated well by managing the step function. The experimental results in this study also show that, although CMAP datasets are generated from unreplicated single arra ys, the use of meta-analysis with the rank product and page rank methods and biological knowledge re duces the variance of the selected samples. The results therefore have strong agreement with the biological context of interest; e.g., the HDAC inhibitors and Estrogen receptor. 4.1.3 Methods for feature metric One of the most important issues with meta-analysis is how to define a similarity metric across samples or studies so that results have strong biological relevance. Recent research ha s shown that, because of differences in experimental designs and laboratory-specific effects, inte grated gene expression data from multiple studies may have poor correlation under th e biological context of the interest. To overcome this problem, we propose: To use protein-protein interactions ( PPI) with the Random Walk algorithm to determine gene-gene proximities. To use GO cellular component (CC) and biological process (BP) annotations with term-term semantic similarity to determine gene-gene similarities. PAGE 126 4.2 Fuzzy cluster analysis methods using Fuzzy C Means (FCM) The FCM algorithm has several limitations: the initialization of the partition matrix can cause the algorithm to remain in local optima, the partially missing data may cause the algorithm to work improperly, there is no strong mathematical/statistical and biological basis to validate the fuzzy part ition, FCM cannot determine the number of clusters in the dataset, and it does not have an effective method to address the problem of defuzzification, where the method using the maximum membership degree principle (MMD) has been proved inappropria te for gene expression data. To address these problems, we first m odify the FCM objective function using the partition coefficient because of its ability to maintain the partition compactness (Section 3.5.1). For initialization of the partiti on matrix, we developed a new Subtractive Clustering (SC) method using fuzzy partition of the data inst ead of the data themselves [146]. To address the problem of partia lly missing data, we developed two new imputation methods using density and pr obability based approaches [154, 155]. Regarding the cluster validation problem, we developed a new evaluation method for fuzzy clustering, particularly for gene e xpression data [147]. The evaluation model considers both the internal validity, which is statistical in nature, and the external validity, which is biological. For the problem of cluster number, we developed a new clustering algorithm that in tegrates our algorithms with the optimization genetic algorithm (GA) to manage a set of soluti on candidates [157]. Using our fuzzy cluster evaluation method, the optimizat ion algorithm can search fo r the best solutions during its evolutionary processes. To produce classi fication information using fuzzy partition, PAGE 127 we developed a novel defuzzification method th at generate a probabi listic model of the possibilistic model described by the fuzzy pa rtition and apply the model to produce the classification informati on of the dataset [156]. 4.2.1 Modification of the objective function Both the partition coefficient (PC) and partition entropy (PE) are best measures for the compactness of fuzzy partition. Becau se PC performs better than PE regarding this issue (Section 3.5.1), PC was, therefore, used as the compactness factor in the FCM objective function of our method: n 1 i c 1 k n 1 i c 1 k m ki k i 2 m ki u mmin u ) v x ( d u ) V U ( J, (4.5) where the learning rate, is defined as in (4.78), Section 4.3.1. We consider minimizing (4.5) w.r.t. (3.6) us ing the method of Lagrange multipliers. Let the Lagrange multiplier be i, i=1,Â…,n and put n 1 i c 1 k ki i n 1 i c 1 k m ki n 1 i c 1 k k i 2 m ki u) 1 u ( u ) v x ( d u L. For the necessary optimal condition, 0 ) v x ( d ) u ( m u Li k i 2 1 m ki ki u (4.6) n 1 i k i m ki k u0 v x u 2 v L. (4.7) With the same reasoning as in (3.9) and (3.10), we obtain PAGE 128 1 m 1 k i 2 1 m 1 i 1 m 1 k i 2 i ki) v x ( d 1 m ) ) v x ( d ( m u (4.8) Summing up for k=1,Â…,c and taki ng (3.6) into account, we have, c 1 k 1 m 1 k i 2 1 m 1 i c 1 k ki) v x ( d 1 m u 1. Hence, c 1 k 1 m 1 k i 2 1 m 1 i) v x ( d 1 1 m Together with (4.8), we obtain the following for the membership in the FCM version of our method, c 1 l 1 m 1 l i 2 1 m 1 k i 2 ki) v x ( d 1 ) v x ( d 1 u. (4.9) Similarly, we derive from (4.7) for each k, 1 k c: n 1 i m ki n 1 i i m ki ku x u v. (4.10) The update model of = {U,V} in our FCM met hod used (4.9) and (4.10). 4.2.2 Partition matrix initialization method FCM has recently been integrated with optimization algorithms, such as, the Genetic Algorithm, Particle Swarm Optim ization, and Ant Colony Optimization [161166, 176-177]. Alternatively, a Subtractive Cl ustering (SC) method has been used with FCM, where SC is used first to determin e the optimal number of clusters and the PAGE 129 location of cluster centers [105-108] and FCM is then used to determine the optimal fuzzy partitions [105, 111-113]. While this approach can overcome the problem of initialization of parameters, it still requires a priori specification of the parameters of the SC method: the mountain peak and the mountain radii. SC uses thes e two parameters to compute and amend the value of the mountain function for every data point while it is looking for the cluster candidates. Most appr oaches using the trad itional SC method use constant values for these parameters [105, 109-111]. However, different datasets have different data distributions, and, therefore, th ese values need to be adjusted accordingly. Yang and Wu [106] proposed a method to do th is automatically. However, because the data densities are not always distributed equa lly within the dataset, the automatic values may be appropriate only for some data points. In general, use of the SC method to de termine the optimal num ber of clusters is based on the ratio between the amended valu e and the original value of the mountain function at every data point. The data poi nts at which the ratios are above some predefined cutoff are selected and their number is used as the optimal number of clusters [105, 107, 110]. This approach, howev er, requires the specifi cation of the cutoff value which will differ among datasets. Method design Â– histogram-based density Given = {U,V}, a fuzzy partition on X, the accumulated density at vk, k=1,Â…,c, is calculated [112] as PAGE 130 n 1 i ki ku ) v ( Acc. (4.11) By assuming a uniform density of the data observed in each cell, an estimate fhist(x) of the underlying probability densit y function f(x) at any point x of vk can be computed by: sh ) v ( Acc ) x ( fk hist (4.12) where h is the bin width and s is the number of bins generated on vk. The density at each data point xi is estimated, using a histogram based method, as c 1 k ki k iu ) v ( Acc ) x ( dens. (4.13) However, the histogram density estimator as in (4.12) has some weaknesses. It is a discontinuous function, and the choice of both the values of h and s has quite an effect on the estimated density. The apriorism needed to set those values make it a tool whose robustness and reliability are too low to be used for statistical estimation. The density estimator in (4.13) is therefore more true at the centers {vk} than at other data points [112]. If we try to find the most dense data points, c, using (4.13), we will obtain the centers of c clusters of the fuzzy part ition. To address this problem, we use the concept of strong uniform fuzzy partition. Definition 4.1 (strong un iform fuzzy partition) [112]: Let m1 < m2 < Â… < mc be c fixed nodes of the universe (or universal set), =[a,b], such that m1 = a and mc = b, and c>2. We say that the set of c fuzzy subsets A1, A2,Â…, Ac, identified with their PAGE 131 membership function A1, A2,Â… Ac defined on form a strong uniform fuzzy partition of if they fulfill the following conditions, for 1 k c, 1) Ak(mk) = 1 (the protoype). 2) If x [mk-1, mk+1] then Ak(x) = 0. 3) Ak(x) is continuous. 4) Ak(x) monotically increases on [mk-1, mk] and monotically decreases on [mk, mk+1]. 5) x k such that Ak(x) > 0. 6) x c 1 k Ak1 ) x ( (strength condition). 7) mk+1-mk = h, 1 k < c. 8) Ak(mk-x) = Ak(mk+x), 1 < k < c. 9) x [mk, mk+1], Ak(x) = Ak-1(x-h) (same shape at overlap region). Note that the 6th condition is the strength condition, i.e. that without this condition the fuzzy partition is no longer strong. Proposition 4.1 (extended fuzzy partition) [112]: Let Ak, k=1,Â…, c, be a strong uniform fuzzy partition on D, extended on DÂ’, then 1) x, c D, k0 {1,Â…, c 1}, such that k {k0, k0+1}, Ak (x) = 0, and Ak0(x) + Ak0+1 (x) = 1. 2) For k = 1,Â…, c, 1 k 1 km m Akh dx ) x (. 3) KA: [-1, 1] [0,1], such that, ) 1 mk 1 mk ( U h m x K ) x (I k A Ak PAGE 132 and 1 1 A1 du ) u ( K. Definition 4.2: The fuzzy histogram density es timator defined on D is given by c 1 k Bk k hist) x ( u ) v ( Acc nh 1 ) x ( f, (4.14) where {Bk}k=1,...,c is a strong uniform fuzzy partition defined on D. It can easily be shown that fhist 0 and fhist(x)dx = 1, and that fhist goes through the c points (mk, Acc(vk)/nh). We now define a new fuzzy partition = {U',V}, based on Definition 4.1 and Proposition 4.1, using = {U,V}, C 1 l 1 v x 1 v x ki2 l l i 2 k k ie e u (4.15) where n 1 i 2 k i k i 2 kv x ) v | x ( P. (4.16) Because {U',V} is a strong uniform fuzzy partition [112], using Definition 4.2, the density at every data point can then be estimated as c 1 k ki k iu ) v ( Acc ) x ( dens. (4.17) Fuzzy mountain function amendment Each time the most dense data point is selected, the mountain function of the other data points must be amended to search for new cluster centers. In the traditional PAGE 133 SC method, the amendment is done using the direct relationships between the selected data point and its neighborhood, i.e., usi ng a pre-specified mountain radius. In our approach, this is done using the fuzzy part ition and no other parameters are required. First, the accumulated density of al l cluster centers is revised using ) x | v ( P M ) v ( Acc ) v ( Acc* t k t k t k 1 t (4.18) where xt is the data point selected as the new cluster center and Mt is the mountain function value at time t. The de nsity at each data point is th en re-estimated using (4.17) and (4.18) as follows, ) x | v ( P u M ) x ( dens ) x | v ( P u M u ) v ( Acc u ) x | v ( P M ) v ( Acc u ) v ( Acc ) x ( densc 1 k t k ki t i t c 1 k t k ki t c 1 k ki k t ki c 1 k t k t k t c 1 k ki k 1 t i 1 t Hence, the estimated density at each data point is amended as: c 1 k t k ki t i t i 1 t) x | v ( P u M ) x ( dens ) x ( dens. (4.19) PAGE 134 fzSC fuzzy subtractive clustering algorithm [146] Steps 1) Generate a fuzzy partition A using FCM. 2) Create a strong uniform fuzzy partition B using (4.15). 3) Estimate the density at every xi X using (4.17). 4) Select the most dense data point as cluster candidate. 5) If the stop-condition is met then Stop. 6) Re-estimate the density at every xi using (4.19). 7) Go to Step 4. To determine the optimal number of clusters in the dataset, we first locate the most dense data points, n Only the data points where the ratio between the mountain function values at time t and time 0 is above 0.95, which is consid ered a stop-condition, are selected. This number is considered the optimal number of clusters in the dataset. Method testing To evaluate the ability of fzSC in sear ching for the best cluster centers, we ran fzSC on the ASET1 (Figure 3-13) with six cl usters. The result is shown in Figure 4-5. We compared fzSC with three standard algorithms, k-means, k-medians and fuzzy c-means, using the ASET1 dataset (Figure 3-13). Each algorithm was run 20 times and the correctness ratio was averaged. Results are shown in Table 4.5. PAGE 135 Figure 4-5 : Candidate clus ter centers in the ASET1 dataset found using fzSC. Squares, cluster centers by FCM; dark circles, cluster cen ters found by fzSC. Classes are labeled 1 to 6. Table 4.5 : Algorithm performance on the ASET1 dataset Algorithm Correctness ratio by class Avg. Ratio 1 2 3 4 5 6 fzSC 1.00 1.001.001.001.001.001.00 k-means 0.97 0.871.001.001.000.750.93 k-medians 0.95 0.821.001.001.000.620.90 FCM 0.97 1.000.951.001.000.960.98 With a correctness ratio of 100% for every class and the whole dataset, fzSC always successfully detected the optimal partition of ASET1 wh ile the other methods using a random method for partition matrix did not. PAGE 136 To check the ability of fzSC to detect the number of clusters, we ran fzSC on the artificial datasets 20 times each and averaged the performance. Table 4.6 : fzSC correctness in determini ng cluster number on artificial datasets The number of clusters generated in the dataset The dataset dimension 2 3 4 5 5 0.971.001.00 1.00 6 1.000.980.90 1.00 7 1.001.001.00 1.00 8 1.000.990.97 1.00 9 0.870.991.00 0.96 Regarding the problem of partition matr ix initialization, fzSC provides better results in comparison with the three sta ndard clustering algorith ms using a random mechanism. When checking the ability of fzSC on artificial datasets, the performance is significant. It performs worse on the datasets with nine clusters in the data space with the dimension of 2. We tried to plot some of these datasets. The results show that some clusters mostly overlap and te nd to merge into one. We, ther efore, can assume that the performance of fzSC is not good on these data sets because the data were not properly generated. For the real datasets, Iris, Wine, Glass and Breast Cancer Wisconsin, the number of clusters are 3, 3, 6, and 6 respectively. The data in the Wine and Glass datasets were normalized by attributes before clustering. The Pearson correlation distance, which is the most appropriate for real datasets, was used to measure the similarity between data PAGE 137 points. We ran fzSC 20 times on each of these datasets with the value of L, the number of solutions, set to 15. The number of cl usters found was compared with the known number in the dataset. Results are shown in Table 4.7. Table 4.7 : fzSC performance on real datasets Dataset # data points #clusters pred icted #clusters correctness ratio Iris 150 3 3 1.00 Wine 178 3 3 1.00 Glass 214 6 6 5 0.95 0.05 Breast Cancer Wisconsin 699 6 6 5 0.65 0.35 We have proposed a novel algorithm to address the problem of parameter initialization of the standard FCM algorit hm [146]. Our method is novel because the subtractive clustering technique uses fuzzy partition of th e data instead of the data themselves. The advantages of fzSC are th at, unlike traditional SC methods, it does not require specification of the mountain peak and mountain radii, and, with a running time of O(cn), compared to O(n2) for the traditional SC method, it is more efficient for large datasets. In addition, fzSC can be integrated easily with fuzzy clustering algorithms to search for the best centers of cluster candida tes, or to address other existing problems of FCM such as missing data and fuzzifier determination problems. We combined FCM with a new SC method to automatically determine the number of clusters in a dataset. FCM randomly creates a set of fuzzy partition solutions for the dataset using a maximum number of cl usters; SC then uses the fuzzy partition approach to search for the best cluster cente rs and the optimal numb er of clusters for PAGE 138 each solution. FCM then rapidly determines the best fuzzy partiti on of every solution using the optimal number of clusters dete rmined by SC. The best solution from this solution set is the final result. 4.2.3 Fuzzy clustering evaluation method Recently, many cluster validity indices have been developed to evaluate results of FCM. Bezdek [92] measur ed performance using partiti on entropy and the overlap of adjacent clusters. Fukuyama and Sugeno [ 93] combined the FCM objective function with the separation factor, while Xie and Beni [94] integrated th e Bezdek index [92] with the cluster separation factor. Rezaee et al. [95] combined the compactness and separation factors, and Pakhira et al. [96] combined the same two factors where the separation factor was normalized. Recently Rezaee [97] proposed a new cluster index in which the two factors were normalized across the range of possible numbers of clusters. These were introduced in Section 3.5.1. Here, we propose a fuzzy cluster validati on method that uses the fuzzy partition and the distance matrix between cluster cente rs and data points. Instead of compactness and separation, fzBLE uses a Bayesian m odel and a log-likelihood estimator. With the use of both the possibility model and the pr obability model to represent the data distribution, fzBLE is appropriate for artificial datasets where the distribution follows a standard model, as well as for real datasets that lack a standard distribution. We show that fzBLE outperforms popular cluster indices on both artific ial and biological datasets. PAGE 139 Method design Â– Bayesian & likelihood based approach Instead of using the two factors, co mpactness and separation, we propose a validation method (fzBLE) that is based on a log likelihood estimator with a fuzzy based Bayesian model. Given a fuzzy clustering solution modeled by = {U, V}, where V represents the cluster centers and U is the partition ma trix representing the membership degrees of the data points to the clusters. The likeli hood of the clustering model and the data is measured as ) v | x ( P ) v ( P ) V U | x ( P ) X | V U ( L ) X | ( Ln 1 i c 1 k k i k n 1 i i (4.20) where the log likelihood is com puted using the following, max ) v | x ( P ) v ( P log ) L log(n 1 i c 1 k k i k (4.21) Because our clustering model is possibility -based, to apply Equations (4.20) and (4.21), a transformation from possibility to probability is needed. Given a fuzzy clustering model = {U, V}, according to [98], uki is the possibility that vk = xi. If is a proper fuzzy partition, then there exists some x* such that Uk(x*) = 1, k=1,Â…,c, and Uk is a normal possibility distribution. Assume Pk is the probability distribution of vk on X, where pk1 pk2 pk3 Â… pkn. We associate with Pk a possibility distribution Uk on X [98] such that uki is the possibility of xi, where PAGE 140 . 1 ),..., 1 n ( i u p p i u p n u1 i k 1 i k ki ki kn kn (4.22) Reversing (4.22), we obtain the transfor mation of a possibility distribution to a probability distribution. Assume that Uk is ordered the same way with Pk on X: uk1 uk2 uk3 Â… ukn. i / u u p p n / u p1 i k ki 1 i k ki kn kn (4.23) Pk is an approximate proba bility distribution of vk on X, and pki = P(xi|vk). If Uk is a normal possibility distribution then pki = 1. The data distributions Having computed the value of Pk, we can estimate the variance k, the prior probability P(vk), and the normal distribution of vk for cluster k [27]. n 1 i 2 k i ki kv x p, (4.24) c 1 l n 1 i l i n 1 i k i k) v | x ( P ) v | x ( P ) v ( P, (4.25) 1 2 v x k p / 1 k i p2 k 2 k ie ) 2 ( ) v | x ( P (4.26) In real datasets, for a cluster vk, the data points usuall y come from different random distributions. Because they cluster in vk, they tend to follow the normal PAGE 141 distribution of vk estimated as in (4.26). This idea is based on the Central Limit Theorem. Theorem 4.1 (Central Limit Theorem): Sn is the sum of n mutually independent random variables, then the di stribution function of Sn is well-approximated by a certain type of continuous function known as a normal density function, which is given by the formula 2 22 x ,e 2 1 ) x ( f (4.27) where and 2 are the mean and variance of Sn respectively. We integrate the probabilities computed in (4.23) and (4.27) for the probability of the data point xi given cluster vk as ) v | x ( P ), v | x ( P max ) v | x ( Pk i n k i k i *. (4.28) Equation (4.28) better re presents the data distribution, particularly in real datasets. Our method, fzBLE [147], is based on (4.21), (4.26) and (4.28). Method testing We tested fzBLE with othe r well-known cluster indices: partition coefficient PC (3.23) and partition entropy PE (3.24) [92], Fukuyama-Sugeno FS (3.25) [93], Xie and Beni XB (3.26) [94], Compose W ithin and Between scattering CWB (3.28) [95], Pakhira-Bandyopadhyay-Maulik cluster index PBMF (3.29) [96], and Babak Rezaee cluster index BR (3.30) [97] us ing the artificial and real datasets. For each artificial dataset, we ran the standard FCM algorithm five times with m set to 2.0 and the partition matrix initia lized randomly. In each case, the best fuzzy PAGE 142 partition was then selected to run fzBLE and the other cluster indices to search for the optimal number of clusters using numbers fr om 2 to 12 and to compare this with the known number of clusters. We repeated the experiment 20 times and averaged the performance of each method. Table 4.8 shows the fraction of correct predictions. fzBLE and PBMF outperform other approaches. The CF method, which is based on the compactness factor, was the least effective. Table 4.8 : Fraction of correct cluste r predictions on arti ficial datasets #c fzble PC PE FS XB CWB PBMF BR CF 3 1.00 0.42 0.42 0.42 0.42 1.00 1.00 0.83 0.00 4 1.00 0.92 0.92 0.92 0.83 1.00 1.00 1.00 0.00 5 1.00 0.75 0.75 0.83 0.75 0.83 1.00 1.00 0.00 6 1.00 0.92 0.83 0.92 0.58 0.58 1.00 0.92 0.00 7 1.00 0.83 0.83 0.83 0.67 0.58 1.00 0.67 0.00 8 1.00 1.00 0.92 1.00 0.92 0.67 1.00 0.83 0.00 9 1.00 0.92 0.67 0.92 0.67 0.33 1.00 0.83 0.00 Table 4.9 : Validation method performance on the Iris dataset (3 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -763.0965 0.9554 0.0977 -10.6467 0.0203 177.1838 12.3280 1.1910 0.9420 3 -762.8034 0.8522 0.2732 -9.3369 0.1292 213.4392 17.7131 1.0382 0.3632 4 -764.8687 0.7616 0.4381 -7.4821 0.2508 613.2656 14.4981 1.1344 0.2665 5 -770.2670 0.6930 0.5703 -8.2331 0.3473 783.4697 13.6101 1.0465 0.1977 6 -773.6223 0.6549 0.6702 -7.3202 0.2805 904.3365 12.3695 1.0612 0.1542 7 -774.4740 0.6155 0.7530 -6.8508 0.2245 1029.7342 11.2850 0.9246 0.1262 8 -774.8463 0.6000 0.8111 -6.9273 0.3546 1635.3593 10.5320 0.8692 0.1072 9 -780.1901 0.5865 0.8556 -6.6474 0.3147 1831.5705 9.9357 0.7653 0.0905 10 -781.7951 0.5765 0.8991 -6.0251 0.2829 2080.3339 9.3580 0.7076 0.0787 PAGE 143 Tables 4.9 4.11 show the test results on the Iris, Wine and Glass datasets. For the Iris dataset, only fzBLE and PBMF detect ed the correct number of clusters (Table 4.9). For the Wine and Glass datasets, only fzBLE and CWB, and only fzBLE, respectively, detected the correct numbe r of clusters (Tables 4.10 and 4.11). Table 4.10 : Validation method performance on the Wine dataset (3 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -926.4540 0.9264 0.1235 -113.0951 0.1786 3.9100 1.3996 2.0000 61.1350 3 -924.0916 0.8977 0.1764 -104.9060 0.2154 3.2981 0.9316 1.4199 39.3986 4 -932.8377 0.8607 0.2525 -139.9144 0.5295 6.6108 0.6306 1.1983 33.7059 5 -929.6146 0.8225 0.3281 -126.5746 0.5028 6.9001 0.4700 1.0401 28.4741 6 -928.8121 0.8066 0.3669 -118.4715 0.6173 9.2558 0.3706 0.9111 25.3451 7 -930.6451 0.7988 0.3874 -120.3128 0.6465 10.3803 0.2972 0.7629 23.1742 8 -932.0462 0.7993 0.3917 -124.7999 0.6459 11.0836 0.2471 0.6392 21.4411 9 -932.1902 0.7929 0.4120 -122.8396 0.6367 11.8373 0.2100 0.5801 19.9154 10 -935.0478 0.7909 0.4217 -130.9089 0.6270 11.9941 0.1773 0.5252 18.9891 Table 4.11 : Validation method performance on the Glass dataset (6 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -1135.6886 0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732 1.9817 0.5782 3 -1127.6854 0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821 1.5004 0.4150 4 -1119.2457 0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463 1.0455 0.3354 5 -1123.2826 0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610 0.8380 0.2818 6 -1113.8339 0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400 0.8371 0.2430 7 -1116.5724 0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891 0.6914 0.2214 8 -1127.2626 0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065 0.5916 0.2108 9 -1117.7484 0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225 0.5634 0.1887 10 -1122.1585 0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909 0.4926 0.1758 11 -1121.9848 0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265 0.4470 0.1704 12 -1135.0453 0.8325 0.3555 -0.5183 1.4213 11318.4879 0.5317 0.3949 0.1591 13 -1138.9462 0.8317 0.3556 -0.5816 1.4918 14316.7592 0.6243 0.3544 0.1472 PAGE 144 For the Yeast dataset, we ran the FCM al gorithm with m set to 1.17 [60] and used the clustering partition to test all methods as in previous sections. Table 4.12 shows that only fzBLE detected the correct number of clusters (five) in Yeast dataset. Table 4.12 : Validation method performance on the Yeast dataset (5 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -2289.8269 0.9275 0.1172 -85.1435 0.2060 8.3660 1.2138 2.0000 133.0734 3 -2296.4502 0.9419 0.0983 -157.2825 0.2099 4.7637 0.6894 1.0470 94.6589 4 -2305.3369 0.9437 0.1000 -191.7664 0.2175 4.0639 0.5575 0.7240 74.7629 5 -2289.3070 0.9087 0.1648 -187.1073 1.0473 13.6838 0.4087 0.6722 65.9119 6 -2296.3098 0.8945 0.1939 -196.6711 0.9932 13.8624 0.3050 0.6170 60.8480 7 -2296.6017 0.8759 0.2299 -198.2858 1.0558 15.4911 0.2434 0.5686 56.1525 8 -2299.4225 0.8634 0.2526 -201.7688 1.0994 16.9644 0.2050 0.5132 51.2865 9 -2299.3653 0.8453 0.2871 -205.1489 1.2340 20.2532 0.1741 0.4819 48.0737 10 -2302.7581 0.8413 0.2992 -208.5687 1.1947 20.7818 0.1512 0.4533 45.9442 11 -2300.3294 0.8325 0.3186 -209.4023 1.1731 21.1525 0.1307 0.4272 43.6600 12 -2307.5701 0.8290 0.3272 -213.4658 1.2245 23.0389 0.1157 0.4040 42.1594 13 -2310.7819 0.8270 0.3354 -215.2463 1.3036 25.4062 0.1016 0.3847 40.8654 Table 4.13 : Validation method performance on th e Yeast-MIPS dataset (4 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -1316.4936 0.9000 0.1625 25.4302 0.3527 16.7630 0.7155 1.9978 81.0848 3 -1317.3751 0.9092 0.1615 -32.8476 0.2981 10.1546 0.8032 1.2476 58.2557 4 -1304.0374 0.8216 0.3252 -39.4858 2.5297 39.8434 0.5400 1.3218 48.6275 5 -1308.6776 0.8279 0.3216 -54.4979 2.4245 34.9963 0.3620 0.9558 41.9671 6 -1309.9191 0.8211 0.3460 -59.8918 2.3511 35.4533 0.2691 0.8291 38.5468 7 -1315.3692 0.8139 0.3654 -65.4866 2.3562 38.8797 0.2423 0.7252 36.0906 8 -1315.1479 0.8062 0.3918 -67.6774 2.4958 43.9502 0.1966 0.6712 34.1387 9 -1321.2280 0.8109 0.3874 -72.3197 2.2854 41.2112 0.1664 0.6072 32.3289 10 -1324.1578 0.8158 0.3847 -74.7867 2.0433 37.6154 0.1395 0.5588 30.9686 PAGE 145 For the Yeast-MIPS dataset, we ran the FCM algorithm using the same parameters as with the Yeast dataset. The results in Table 4.13 show that only fzBLE correctly detected the four clus ters in the Yeast-MIPS dataset. For the RCNS (Rat Central Nervous System) dataset, we ran fzBLE and the other cluster indices on the dataset cl ustering partition found by the standard FCM algorithm using the Euclidean metric for dist ance measurement. Table 4.14 shows that, again, only fzBLE detected the correct number of clusters. Table 4.14 : Validation method performance on the RCNS dataset (6 true clusters) #c fzble PC PE FS XB CWB PBMF BR CF 2 -580.0728 0.9942 0.0121 -568.7972 0.0594 5.5107 4.2087 1.1107 177.8094 3 -564.1986 0.9430 0.0942 -487.6104 0.4877 4.1309 4.2839 1.6634 117.9632 4 -561.0169 0.9142 0.1470 -430.4863 0.9245 6.1224 3.3723 1.3184 99.1409 5 -561.7420 0.8900 0.1941 -397.0935 1.3006 9.4770 2.6071 1.1669 88.5963 6 -552.9153 0.8695 0.2387 -300.6564 2.5231 20.6496 1.9499 1.1026 84.0905 7 -556.2905 0.8707 0.2386 -468.3121 2.1422 21.0187 2.8692 0.7875 57.5159 8 -555.3507 0.8925 0.2078 -462.0673 1.7245 20.0113 2.5323 0.5894 52.0348 9 -558.8686 0.8863 0.2192 -512.4278 1.6208 22.4772 2.6041 0.5019 45.9214 10 -565.8360 0.8847 0.2241 -644.1451 1.1897 21.9932 3.4949 0.3918 33.1378 We presented a novel method that uses the log likelihood estimator with a Bayesian model and the possibility, rather th an the probability, distribution model of the dataset from the fuzzy partition. By using th e Central Limit Theore m, fzBLE effectively represents distributions in real datasets Results have shown that fzBLE performs effectively on both artificial and real datasets. PAGE 146 4.2.4 Fuzzy clustering evaluation using Gene Ontology [180] The shortcoming of fzBLE is that it is solely based on the data and, therefore, unable to apply prior biologi cal knowledge into cluster evaluation. We developed GOfzBLE, a cluster validation method that generates the fuzzy partition for a given clustering solution using GO term based sema ntic similarity and applies fzBLE method to evaluate the clustering solution using the fuzzy partition. Gene Ontology The Gene Ontology (GO) [181] is a hi erarchy of biological terms using a controlled vocabulary that includes three i ndependent ontologies for biological process (BP), molecular function (MF) and cellular component (CC). Standardized terms known as GO terms describe roles of genes and ge ne products in any organism. GO terms are related to each other in the fo rm of parent-child relationshi ps. A gene product can have one or more molecular functions, can particip ate in one or more bi ological processes, and can be associated with one or more cellular components [182] As a way to share knowledge about functionalities of genes, GO itself does not contain gene products of any organism. Rather, expert curators sp ecialized in different organisms annotate biological roles of gene produc ts using GO annotations. Each GO annotation is assigned with an evidence code that indicates the type of eviden ce supporting th e annotation. Semantic similarity GO is structured as directed acyclic graphs (DAGs) in which the terms form nodes, and the two kinds of semantic relati onships, Â“is-aÂ” and Â“part-ofÂ”, form edges [184]. Â“is-aÂ” is a simple class-subclass re lation, where A is-a B means that A is a PAGE 147 subclass of B. Â‘part-of Â’ is a partial ownership relation; C part-of D means that whenever C is present, it is always a part of D, but C need not al ways be present. The structure of DAG allows assigning a metric to a set of terms base d on the like liness of their meaning content which is used to m easure semantic similarity between terms. Multiple GO based semantic similarity meas ures have been developed [183, 184], and are increasingly used to evaluate the relati onships between proteins in protein-protein interactions, or co-regulated genes in ge ne expression data analysis. Among of the existing semantic similarity measurement methods, that of Resnik is most appropriate for gene expression data analysis because it is strong correlated with gene sequence similarities and gene expre ssion profiles [183]. However, Wang et al. [184] had shown that the ResnikÂ’s method has a drawback in that it uses only the information content derived from annotation statistics which is not suitable for measuring semantic similarity of GO terms. We therefore pr opose to use WangÂ’s method for GO semantic similarity measurement. For each term A, a semantic value, S(A), is computed as: AT t A) t ( S ) A ( S, (4.29) where TA is a set of terms including the term A and its ancestors, and SA(.) is the sematic value regarding the term A, defined as in (4.30), A t )}, t ( ChildrenOf u ), u ( S w max{ A t 1 ) t ( SA u t A, (4.30) where u tw is the semantic contribution factor [ 184] for edge connecting term t with its child, term u. The semantic similarity between two terms, A and B, is defined as: PAGE 148 ) B ( SV ) A ( SV ) t ( S ) t ( S ) B A ( SB AT T t B A GO (4.31) Using the semantic similarities between GO terms, the semantic similarity between the two GO term sets, G1 and G2, is defined as: 2 G 1 G ) 1 G 2 g ( Sim ) 2 G 1 g ( Sim ) 2 G 1 G ( Sim2 G 2 g 1 G 1 g (4.32) where Sim(t,T) is the similarity between the term t and the term set T and is defined as in (4.33), )} u t ( S { max ) T t ( SimGO T u (4.33) Method design For a given crisp clustering solution of gene expression data, an approximate fuzzy partition is generated based on the ge ne-gene semantic similarities using GO term annotations. The fuzzy partition is applied to the fzBLE method for evaluation of the clustering solution. Each crisp clustering solution is modeled with c = {M, V}, where V represents the cluster centers and M, M={mki}; mki {0,1}; k=1,Â…,c; i=1,Â…,n, is the crisp c-partition matrix representing the membersh ip of the data points to the clusters. In case where a fuzzy clustering solution = {U, V}, is given, c can be computed based on using our defuzzification met hod for fuzzy partition [154]. Fuzzification of a crisp clustering partitionIn order to use GO terms to validate crisp clustering solution, a fuzzy partition is generated based on the crisp partition using GO semantic similarity. This is done by assi gning the crisp partition into the space of PAGE 149 GO terms. Each data object xi, i=1,Â…, n, is correspondi ng with a vector of GO annotations GO ix, and a vector of de grees of belief (DOB) CF ix, where CF ijx is the DOB of the term GO ijxannotated to xi (Table 4.15). Because GO annotations of a gene/gene product may come from different sources (T able 4.16), use of DOBs can help with combining multiple annotations using the same term of the same data object. Table 4.15 : Degrees of belie f of GO annotation evidences Evidence Degree of belief EXP 1.0 IDA, IPI, TAS 0.9 IMP, IGI, IEP 0.7 ISS, ISO, ISA, ISM 0.4 IGC 0.2 IBA, IBD, IKR, IRD, RCA 0.3 NAS, IC, ND, NR 0.0 IEA 0.1 The GO annotations for each cluster vk, GO kv for k=1,Â…, c, can be determined using the GO annotations of the members of cluster vk: GO v x GO kx vk (4.34) For each annotation GO ktv of vk, t=1,Â…, GO kv, its DOB, CF ktv, is computed using the DOBs of clusterÂ’s members annotated with term t: GO kt GO j k CF j CF ktv x v x x mean v (4.35) PAGE 150 Table 4.16 : Gene Ontology evidence codes Experimental Evidence Codes EXP Inferred from Experiment IDA Inferred from Direct Assay IPI Inferred from Physical Interaction IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IEP Inferred from Expression Pattern Computational Analysis Evidence Codes ISS Inferred from Sequence or Structural Similarity ISO Inferred from Sequence Orthology ISA Inferred from Sequence Alignment ISM Inferred from Sequence Model IGC Inferred from Genomic Context IBA Inferred from Biological aspect of Ancestor IBD Inferred from Biologica l aspect of Descendant IKR Inferred from Key Residues IRD Inferred from Rapid Divergence RCA inferred from Reviewed Computational Analysis Author Statement Evidence Codes TAS Traceable Author Statement NAS Non-traceable Author Statement Curator Statement Evidence Codes IC Inferred by Curator ND No biological Data available Automatically-assigned Evidence Codes IEA Inferred from Electronic Annotation Obsolete Evidence Codes NR Not Recorded PAGE 151 We apply the semantic similarity measure as in (4.32) to compute the semantic similarity between cluster vk and data point xi using their GO annotations. We propose a modification of (4.31) so that it can be used with DOB. For the term GO ktv of cluster vk and the term GO ijxof data point xi, their semantic similarity is defined as: ) x v min( ) x v ( S ) x v ( SCF ij CF kt GO ij GO kt GO GO ij GO kt GO (4.36) We use (4.36) instead of (4.31) when co mputing the semantic similarity between vk and xi, ) x v ( m Sii k, using (4.32). The distance between vk and xi based on GO semantic similarity is then defined as: ) x v ( m Si 1 d ) x v ( di k 2 i k 2 GO (4.37) A fuzzy partition model using GO terms space GO, GO = {UGO, _}, of c is now determined where UGO is computed as in (4.9) using the distance function defined in (4.37). GO is used with fzBLE for the clustering solution evaluation. Because GO is based on GO annotation, we proposed a method to compute the prior probability of vk, Pp(vk), k=1,Â…, c, using the prior probabi lities of the GO terms used by vk, T. For a given term t, t T, ) T ( freq ) t ( freq ) t ( Pp, (4.38) ) v ( sum ) v ( sum ) v | t ( PCF k CF ku t u v u kGO k (4.39) PAGE 152 where Pp(t) and P(t|vk) are the prior probability of the term t and the conditional probability of the term t given vk respectively. The prior probability of vk, k=1,Â…, c, is computed as in (4.40), GO k GO kv t p k k v t p k k p) t ( P ) v ( P ) v | t ( P ) t ( P ) v t ( P ) v ( P. (4.40) Method testing We tested GOfzBLE with fzBLE and the ot her cluster indices as in the Section 4.2.3 using the Yeast, Yeast-MIPS and RCNS datasets. Table 4.17 : Validation method performan ce on the Yeast dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -4832.656 0.928 0.117 -85.144 0.206 8.366 1.214 2.000 -2289.827 3 -4832.542 0.942 0.098 -157.283 0.210 4.764 0.689 1.047 -2296.450 4 -4832.751 0.944 0.100 -191.766 0.218 4.064 0.558 0.724 -2305.337 5 -4831.548 0.909 0.165 -187.107 1.047 13.684 0.409 0.672 -2289.307 6 -4831.890 0.895 0.194 -196.671 0.993 13.862 0.305 0.617 -2296.310 7 -4833.182 0.876 0.230 -198.286 1.056 15.491 0.243 0.569 -2296.602 8 -4833.644 0.863 0.253 -201.769 1.099 16.964 0.205 0.513 -2299.423 9 -4832.358 0.845 0.287 -205.149 1.234 20.253 0.174 0.482 -2299.365 10 -4832.246 0.841 0.299 -208.569 1.195 20.782 0.151 0.453 -2302.758 11 -4832.600 0.833 0.319 -209.402 1.173 21.153 0.131 0.427 -2300.329 12 -4832.668 0.829 0.327 -213.466 1.225 23.039 0.116 0.404 -2307.570 13 -4832.217 0.827 0.335 -215.246 1.304 25.406 0.102 0.385 -2310.782 For the Yeast dataset, only GOfzBLE a nd fzBLE correctly identified the number of clusters (Table 4.17). Table 4.18 again shows that these two algorithms detected correctly the number of clusters. While fz BLE works based on gene expression levels, PAGE 153 GOfzBLE works based on GO-BP annotations. The results also show that GO-BP annotations are strongly corre lated with gene-gene differe ntial co-expre ssion patterns. In other words, we may utilize either gene-gene co-expression patterns in gene expression data or GO-BP annotations to se arch for genes having similar functionality. Table 4.18 : Validation method performance on the Yeast-MIPS dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -2288.413 0.900 0.163 25.430 0.353 16.763 0.716 1.998 -1316.494 3 -2286.843 0.909 0.162 -32.848 0.298 10.155 0.803 1.248 -1317.375 4 -2283.854 0.822 0.325 -39.486 2.530 39.843 0.540 1.322 -1304.037 5 -2285.069 0.828 0.322 -54.498 2.425 34.996 0.362 0.956 -1308.678 6 -2286.252 0.821 0.346 -59.892 2.351 35.453 0.269 0.829 -1309.919 7 -2286.834 0.814 0.365 -65.487 2.356 38.880 0.242 0.725 -1315.369 8 -2287.543 0.806 0.392 -67.677 2.496 43.950 0.197 0.671 -1315.148 9 -2288.333 0.811 0.387 -72.320 2.285 41.211 0.166 0.607 -1321.228 10 -2288.954 0.816 0.385 -74.787 2.043 37.615 0.140 0.559 -1324.158 Table 4.19 : Validation method perform ance on the RCNS dataset using GO-CC #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -970.340 0.994 0.012 -568.797 0.059 5.511 4.209 1.111 -580.073 3 -970.460 0.943 0.094 -487.610 0.488 4.131 4.284 1.663 -564.199 4 -969.909 0.914 0.147 -430.486 0.925 6.122 3.372 1.318 -561.017 5 -969.669 0.890 0.194 -397.094 1.301 9.477 2.607 1.167 -561.742 6 -969.659 0.870 0.239 -300.656 2.523 20.650 1.950 1.103 -552.915 7 -970.148 0.871 0.239 -468.312 2.142 21.019 2.869 0.788 -556.291 8 -969.787 0.893 0.208 -462.067 1.725 20.011 2.532 0.589 -555.351 9 -970.300 0.886 0.219 -512.428 1.621 22.477 2.604 0.502 -558.869 10 -970.487 0.885 0.224 -644.145 1.190 21.993 3.495 0.392 -565.836 PAGE 154 The RCNS dataset has six groups of simila rly expressed genes, two of which are invariant [87]. We first ra n GOfzBLE using GO-CC annotati ons and compared with the other methods. The results are shown in Table 4.19. Only GOfzBLE and fzBLE identified six clusters in the dataset, corresponding to the six groups in the analysis results performed by Wen et al. [87]. We then reran GOfzBLE using GO-BP annotations and compared with the other me thods. The results are shown in Table 4.20. GOfzBLE identified four clusters in the datase t that are corresponding to the four stages in the developmental process. The resu lt of GOfzBLE again shows that GO-BP annotations are strongly correla ted with gene-gene differentia l co-expression patterns. It also shows that GO-BP annotati ons are useful in creating eff ective external criteria for cluster analysis of gene expression data. Table 4.20 : Validation method perform ance on the RCNS dataset using GO-BP #c GOfzBLE PC PE FS XB CWB PBMF BR fzBLE 2 -1373.999 0.994 0.012 -568.797 0.059 5.511 4.209 1.111 -580.073 3 -1373.935 0.943 0.094 -487.610 0.488 4.131 4.284 1.663 -564.199 4 -1373.776 0.914 0.147 -430.486 0.925 6.122 3.372 1.318 -561.017 5 -1374.208 0.890 0.194 -397.094 1.301 9.477 2.607 1.167 -561.742 6 -1374.496 0.870 0.239 -300.656 2.523 20.650 1.950 1.103 552.915 7 -1374.526 0.871 0.239 -468.312 2.142 21.019 2.869 0.788 -556.291 8 -1374.811 0.893 0.208 -462.067 1.725 20.011 2.532 0.589 -555.351 9 -1375.199 0.886 0.219 -512.428 1.621 22.477 2.604 0.502 -558.869 10 -1375.426 0.885 0.224 -644.145 1.190 21.993 3.495 0.392 -565.836 4.2.5 Imputation methods for partially missing data As discussed earlier in Secti on 3.4, given a data point xi with missing attributes, the global model based approach updates the missing values of xi using the information of the whole data. In contra st, the local model based a pproach does that using the PAGE 155 information of the data poin ts in the neighborhood of xi. The former has an advantage of eliminating the noise that can be caused by the data points around xi, particularly in the case when xi is in the overlapping region. The latte r, however, has the advantage that it may provide the data distribution model that xi is following. Recent research has shown that the gl obal model based appr oach, the optimal completion strategy (OCS) is an example, outperforms the local model based one [139, 140, 143]. Existing approaches usually update XM using either a global model, or a local model. Luo et al. [141], by using a local update model integrated into the FCM algorithm, showed that his method outperf ormed two well-known methods: K-nearest neighborhood (KNN) and Local Least S quares K-nearest neighborhood (SKNN). However, the method of Luo et al. is only ba sed on the local model information. Hence, it may fail at the overlapping re gion. Mohammadi et al. [142] improve the work of Luo et al. by using an external distance based on GO terms. The relationships of the terms in GO in fact are locally detected, methods us ing them are just a kind of local model. Regarding the problem of missing data we developed separately two new algorithms using our fzBLE and fzSC methods namely probability based (fzPBI) and density based (fzDBI) imputa tion algorithms, to create Â“abst ractÂ” fuzzy partition based on the partition derived from the partially distance strategy met hod (PDS) [140]. These abstract partitions not only contain the globa l distribution model of the data, but they also represent the local distributions at every data points in the dataset. We then developed new update models for XM using both the global and local data distribution models derived from the abst ract partitions [154, 155]. PAGE 156 4.2.6 Probability based imputation method (fzPBI) [154] Method design The objective of fzPBI is to cluster a dataset X, that may be incomplete, into c clusters, Xm are estimated during the clustering pro cess with respect to the optimization of the function Jm. Let XW = {xi X | xi is a complete data object}, XP = {xij | xij ?}, and XM = {xij | xij = ?}. The distance measure, d2(.) therefore is defined as: p 1 j 2 kj ij j p 1 j j k i 2) v x ( w w p ) v x ( d, (4.41) where wj indicates the contribu tion degree of the jth attribute of xi, xij, in the distance between xi and vk. If xij XM, wj increases with each iteration to avoid premature use of estimated values. Therefore, we define wj as: M ij P ij jX x T t X x 1 w, (4.42) where t is the iteration index, and T is the number of iterations; 0 t < T. To impute the values of XM, we applied our fzBLE me thod [148] to generate a probabilistic model of the data distribution using the fuzzy partition. This model is then used to impute the missing values. For each cluster vk, k=1,Â…,c, a probability distribution {pki}i=1,Â…,n is derived from the possibility distribution {uki}i=1,Â…,n as in (4.22) and (4.23). The data distributions m odel is then estimated using (4.24), (4.25) and (4.26). The model for missing data es timation is the developed as follows: PAGE 157 Let } p , p p { Pm kp m 2 k m 1 k m k be a set of probabilities at vk, where m kjp, j=1,Â…,p, indicates the probability that attribute j is missing in cluster k, defined as: n 1 i k i n 1 i ij k i m kjv x P I 1 v x P p (4.43) where M ij P ij ijX x 0 X x 1 I. Hence, the probability that a data object xi has a missing attribute j in cluster k is: k i p j t 1 t m kt m kj k k | ) ij ( Mv | x P p p ) v ( P ) x ( P (4.44) Because each cluster vk can be considered a component of the data distribution model of X, the estimated value of xij XM is computed as: c 1 k k | ) ij ( M c 1 k k k | ) ij ( M ijx P v x P x Âˆ. (4.45) To avoid an oscillation in the missing value imputation, estimated values are used decreasingly. In contrast to their usag e in the clustering pro cess, their contribution degree in distance measurement during th e estimation process is defined as, M ij P ij jX x T t 1 X x 1 w. (4.46) PAGE 158 fzPBI algorithm Steps 1) t=0; initialize Ut randomly w.r.t (3.2). 2) Compute {d2(xi, vk)}i=1..n,k=1..c as in (4.41) and (4.42). 3) Compute Ut+1 using (3.3). 4) Compute Vt+1 using (3.4). 5) If (t > T) or ( t m 1 t mJ J) then Stop. 6) Compute {d2(xi, vk)}i=1..n,k=1..c using jw as in (4.46). 7) Create a probabilistic data distribution model from the fuzzy partition using (4.24), (4.25) and (4.26). 8) Create a probabilistic model for XM using (4.43) and (4.44). 9) Estimate XM using (4.45). t=t+1. Go to Step 2. The difference between fzPBI and the st andard FCM algorithm is that fzPBI provides a method to discover th e probabilistic model of the data distribution from the fuzzy partition and to apply the model to missing value imputation. fzPBI therefore can address the problem of missing data itself. Algorithm performance measures We used two measures to evaluate algor ithm performance. The first measure is the root mean square error (RMSE) between the true values a nd the imputed values, defined as: PAGE 159 in 1 i 2 i i ix Âˆ x n 1 RMSE, (4.47) where ni is the number of missing values imputed. The second measure assesses the overall performance determined by the number of data objects with missing attributes that were misclassified. This assessment is done by comparing the cluster label of each data object with its actual class label. If the two match, there is no misclassification. If they do not match, then a misclassifica tion has occurred, defined in (3.21). Method testing 7 12 17 22 27 32 37 42 47 52 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-6 : Average RMSE of 50 trials us ing an incomplete ASET2 dataset with different percentages of missing values To evaluate the performa nce of fzPBI, we compar ed fzPBI with six popular imputation methods: PDS, OCS and NPS [1 40]; FCMimp [158]; CIAO [159]; and FCMGOimp [160]. For each test dataset, we ge nerated the missing data using different PAGE 160 percentages of missing values. The value of the fuzzifier factor, m, was set to 2.0 for the artificial datasets and the Iris, Wine and RCNS datasets. For the Yeast, Yeast-MIPS and Serum datasets, m was set to 1.17 and 1.25 respectively, as in [ 60]. The number of clusters, c, was set to the known number of clusters. E ach algorithm was run 5 times and the best result recorded. We repeated the experiment 50 times and averaged the performance of each algorithm using both measures. 18 23 28 33 38 43 48 53 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-7 : Average RMSE of 50 trials us ing an incomplete ASET5 dataset with different percentages of missing values Regarding the artificial datasets, ASET2 and ASET5; ASET2 has five wellseparated clusters of similar size. ASET5 is more complex, containing three clusters that differ in size and density (Figure 316). On ASET2, fzPBI had the lowest RMSEs across different scales of mi ssing data (Figure 4-6) and th erefore performed better than PAGE 161 the other methods. On ASET5, fzPBI agai n performed best (Figure 4-7), although performance of the other methods was comparable. Applied to the Iris dataset, fzPBI and CIAO had the smallest RMSEs, although, as shown in Figure 4-8, fzPBI performed ma rginally better. Table 4.21 shows that, compared with CIAO, fzPBI had a smalle r number of misclassified objects across different scales of missing values. 0.3 0.8 1.3 1.8 2.3 2.8 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-8 : Average RMSE of 50 trials using an incomplete Iris dataset with different percentages of missing values On the Wine dataset, on average, both fzPBI and CIAO outperformed the other methods. fzPBI and CIAO performed equally on data with percen tages of missing values of 10% and lower. However, fzPBI pe rformed slightly bett er than CIAO on data with higher percentages of missing values. FCMimp and FCMGOimp performed better than other methods, when the percentage of missing values was small (Figure 4-9). PAGE 162 However, they performed worse than all other methods on datasets with high percentages, i.e., greater than 15%, of missing values. Table 4.21 : Average results of 50 trials using an incomplete IRIS dataset with different percentages (%) of missing value % Averaged #objects misclassified fzPBI PDS OCS NPS FCMimp CIAO FCMGOimp 5 15.9 18.7 16.016.218.415.918.0 7 15.9 12.6 16.316.813.015.913.2 10 15.9 10.0 16.317.79.816.09.5 13 16.0 9.1 17.119.09.516.29.0 15 16.1 12.8 18.420.712.216.39.9 20 15.9 20.6 19.822.920.316.023.7 25 16.2 31.9 20.524.830.816.132.8 30 16.7 37.9 26.230.937.916.738.9 40 16.7 49.3 30.837.950.817.057.8 50 21.2 56.5 42.652.357.121.864.1 66.5 86.5 106.5 126.5 146.5 166.5 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-9 : Average RMSE of 50 trials using an incomplete Wine dataset with different percentages of missing values PAGE 163 0.121 0.131 0.141 0.151 0.161 0.171 0.181 0.191 0.201 0.211 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-10 : Average RMSE of 50 trials using an incomplete RCNS dataset with different percentages of missing values On the RCNS dataset (Figure 4-10), fzPB I outperformed other algorithms when the missing value percentage was 40% and lower. However, it performed marginally worse with a missing value percentage of 50%. This is because the RCNS dataset contains only 112 data point s and becomes very sparse with high percentages of missing values; the data probability model was not properly determined, and fzDBI, therefore, could not perform correctly. PAGE 164 0.41 0.91 1.41 1.91 2.41 2.91 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-11 : Average RMSE of 50 trials us ing an incomplete Yeast dataset with different percentages of missing values fzPBI outperformed other methods on the Yeast and Yeast-MIPS datasets (Figures 4-11 and 4-12); FCMGOimp was run using the GO term-based distance measure [11]. However, FCMGOimp outperfor med only FCMimp. This result is similar to that reported in [158]. Using GO terms to measure the distance between genes is interesting. However, a crisp distance measure of {0,1}, where 0 is the distance between a pair of genes having at least one GO term in common, does not help much with the problem of missing value imputation. fzPBI again outperformed other algor ithms on Serum dataset with the percentage of missing value lower than 50% (Figure 4-13). For th e rest of the cases, fzPBI was only outperformed by CIAO. On average, fzPBI performed best on the Serum dataset. PAGE 165 0.38 0.88 1.38 1.88 2.38 2.88 3.38 3.88 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-12 : Average RMSE of 50 trials usi ng an incomplete Yeast-MIPS dataset with different percentages of missing values 0.07 0.09 0.11 0.13 0.15 0.17 571013152025304050RMSE Percentage of missing value fzPBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-13 : Average RMSE of 50 trials us ing an incomplete Serum dataset with different percentages of missing values PAGE 166 4.2.7 Density based imputation method (fzDBI) [155] Similar to fzPBI, the fzDBI algorithm was developed to cluster a dataset X, that may be incomplete, into c clusters based on fuzzy partition generated using partial distance measure as in (4.41) and (4.42). The abstract fuzzy partition is derived from the FCM fuzzy partition using our fzSC method, as follows: Let = {U,V} be the fuzzy partition gene rated by FCM, which is a distancebased fuzzy partition describing the data dens ity at the cluster centers. The accumulated density at the center of cluster vk, k=1,Â…,c, is calculated as in (4.11). However, cannot properly describe the data density at every data point of X (4.2.2). A strong uniform fuzzy partition is then defined as in (4.15) and (4.16). The data density at every data point is then computed using (4.17). A set V Âˆ, } v Âˆ , v Âˆ v Âˆ { V Âˆdc 2 1 of the most dense data points, cd, are then selected as cl uster candidates. Because V Âˆ represents the data densities in the dataset, we cons truct a density-based fuzzy partition Âˆ, } V Âˆ U Âˆ { Âˆ where U Âˆis defined as: c 1 l k l c 1 l li k l kiv Âˆ | v P u v Âˆ | v P u Âˆ. (4.48) The estimated value of xij XM is computed based on both the FCM fuzzy partition, and the density-based fuzzy partition, Âˆ, d dc 1 k m ki c 1 k kj m ki c 1 k m ki c 1 k kj m ki iju Âˆ v Âˆ u Âˆ ) 1 ( u v u x Âˆ, (4.49) PAGE 167 where 0< <1, indicates the contribution level of each of the tw o fuzzy partition models, and Âˆ, in the missing value imputation. We set = 0.5 so that both the models contribute equally. To avoid an o scillation in the missing value imputation, estimated values are used decreasingly. In c ontrast with their usage in the clustering process, their contribution degree in di stance measurement during the estimation process is defined as in (4.46). fzDBI algorithm Steps 1) t = 0; c = n. 2) Initialize Ut randomly w.r.t (3.2). 3) Compute {d2(xi, vk)}i=1,Â…,n,k=1,Â…,c as in (4.41) and (4.42). 4) Compute Ut+1 using (3.3). 5) Compute Vt+1 using (3.4). 6) If (t T) or ( t m 1 t mJ J) then Stop. 7) Compute {d2(xi, vk)}i=1,Â…,n,k=1,Â…,c using jw as in (4.46). 8) Create a strong uniform fuzzy partition as in (4.15) and (4.16). 9) Create a density-based fuzzy partition of X, Âˆ, as in (4.48). 10) Estimate XM using (4.49). t=t+1. Go to Step 3. The difference between fzDBI and the st andard FCM algorithm is that, in addition to a distance-based fuzzy partition, fzDBI generates a density-based fuzzy partition which is used with the distan ce-based one in missing value imputation. Method testing fzDBI was tested the same way with fz PBI, compared with the six popular imputation methods: PDS, OCS and NPS [1 40]; FCMimp [158]; CIAO [159]; and PAGE 168 FCMGOimp [160], fzDBI performed better on da tasets of different levels (Figures 4-14 4-17). 9 14 19 24 29 34 39 44 49 54 571013152025304050RMSE Percentage of missing value fzDBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-14 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET2 dataset with different missing value percentages 19 24 29 34 39 44 49 54 571013152025304050RMSE Percentage of missing value fzDBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-15 : (fzDBI) Average RMSE of 50 trials using an incomplete ASET5 dataset with different missing value percentages PAGE 169 0.25 0.75 1.25 1.75 2.25 2.75 571013152025304050RMSE Percentage of missing value fzDBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-16 : (fzDBI) Average RM SE of 50 trials using an incomplete Iris dataset with different percentages of missing values 0.3 0.8 1.3 1.8 2.3 2.8 3.3 3.8 571013152025304050RMSE Percentage of missing value fzDBI OCS NPS FCMimp CIAO FCMGOimp Figure 4-17 : (fzDBI) Average RMSE of 50 trials using an incomplete Yeast-MIPS dataset with different pe rcentages of missing values PAGE 170 4.2.8 Probability based defuzzification (fzPBD) [156] To generate classification information us ing fuzzy partition, the fuzzy partition needs to be defuzzified for crisp classification information of the data objects. A popular approach to this problem is to u se the maximum membership degree principle (MMD). This may be inappropr iate in some applications, particularly applications where the data are non-uniform, because FCM me mbership is computed using distance between the data object a nd cluster center. Use of me mbership degree can assign marginal objects of a large cluster to the immediately adjacent sm all cluster. Recent solutions include that of C huang et al. [150], who proposed using spatial information to adjust the membership status of every data point using the membership status of its neighbors. In a similar approach, Chiu et al. [149] used spatial info rmation to determine the distance between the data point and cluster ce nter. However, these methods require definition of neighborhood boundaries. Genther et al. [151] proposed defining spatial information by the clustering structure, not the neighborhood, to compute the distance from the data point to cluster center. A common limitation of met hods that use spatial information in computing membership degree is that they have to scal e between the actual pos ition information of the data point and its spatial information. In addition, while use of spatial information is appropriate for image segmentation, it may not work with generic data cluster analysis because it is difficult to de fine neighborhood boundaries. PAGE 171 We developed fzPBD that uses the fuzzy partition and the data themselves to construct a probabilistic model of the data distributions. Th e model is then applied to produce the classification information of data points in the dataset. Method design We applied the statistics model in our fzBLE method [141] to generate a probabilistic model of the data di stribution using fuzzy partition. Given a fuzzy partition matrix U, the vector Uk = {uki}i=1,Â…,n, k=1,Â…,c, is a possibility model of the data distribution of vk on X. We associate {Pk}, the probability distributions, with the po ssibility distributions {Uk} of vk on X, k=1,Â…,c, using (4.22) and (4.23). The statistics at vk, k=1,Â…,c, are then derive d using (4.24), (4.25) and (4.26). Using the idea based on the Central Limit Theorem, Â“The distribution of an average tends to be normal, even when the distribution from which the average is computed is decidedly non-normal,Â” the data objects can be assumed to follow the normal distributions at {vk}k=1,Â…,c, defined as in (4.26). Therefore, the data object xi is assigned to the class of vk, where )} x | v ( P { max ) x | v ( Pi l c .. 1 l i k (4.50) Because P(vk|xi) = P(xi,vk)/P(xi) = P(xi|vk)P(vk)/P(xi), an alternativ e to (4.50) is )} v ( P ) v | x ( P { max ) x | v ( Pl l i c .. 1 l i k (4.51) where P(vl), l=1,Â…,c, the prior probability of vl, can be computed using (4.25). PAGE 172 fzPBD algorithm Steps 1) Convert the possibility distributions in U into probability distributions using (4.22) and (4.23). 2) Construct a probabilistic model of the data distri butions using (4.24) and (4.26). 3) Apply the model to produce the classifi cation information for every data point using (4.51). Method testing We compared fzPBD with the MMD, FCB, NBM and sF CM methods on ASET2, ASET3, ASET4 and ASET5 using the compactness measure (3.43) and misclassification measure (3.21) for evaluati on of algorithm performance. For each dataset, the following values of the fuzzifier factor, m were used: 1.25, 1.375, 1.50, 1.625, 1.75, 1.875 and 2.0. The number of cl usters, c, was set to the known number of clusters. The FCM algorithm was run 3 times and the best fuzzy cl uster partition was selected to test all the algor ithms. We repeated the experiment 100 times and averaged the performance of each algor ithm using the two measures. PAGE 173 0 0.5 1 1.5 2 2.5 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-18 : Algorithm performance on ASET2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-19 : Algorithm performance on ASET3 ASET2 and ASET3 each contain five well -separated clusters of the same sizes. ASET1 and ASET2 are in tw o-dimensional and three-dimensional data space, respectively. Performance of all algorithms is shown in Figures 4-18 and 4-19. fzPBD, MMD and sFCM generated no misclassifica tion across multiple levels of m. PAGE 174 0 2 4 6 8 10 12 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-20 : Algorithm performance on ASET4 ASET4 contains 11 well-se parated clusters in a five-dimensional data space. Results are shown in Figure 4-20. fzPBD and MMD perfor med best. Table 4.22 shows that they also had the smallest compact ness measures across multiple levels of m. Table 4.22 : ASET4Compactness measure Algo. Fuzzifier factor m 1.25 1.3751.5 1.6251.751.8752.00 fzPBD 0.39 0.39 0.390.39 0.390.39 0.39 MMD 0.39 0.39 0.390.39 0.390.39 0.39 FCB 4.52 4.52 4.524.52 4.534.54 4.54 NBM 12.8511.288.738.72 8.728.72 8.72 sFCM 4.94 3.2 0.450.49 0.490.5 0.5 PAGE 175 ASET5 is non-uniform with three cluste rs in a two-dimensional data space (Figure 3-16). Results are shown in Figur e 4-21. fzPBD outperformed all the other algorithms. 0 5 10 15 20 25 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-21 : Algorithm performance on ASET5 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-22 : Algorithm performance on IRIS The IRIS dataset contains three clusters corresponding to the th ree classes of Iris flowers. Results are shown in Figure 4-22. NBM outperformed other algorithms. fzPBD PAGE 176 performed lightly less well, however, Table 4.23 shows that, compared with NBM, fzPBD had smaller compactness measur es across multiple levels of m. Table 4.23 : IRIS: Compactness measure Algo. Fuzzifier factor m 1.25 1.3751.51.6251.751.875 2.00 fzPBD 0.44 0.440.440.440.440.44 0.44 MMD 0.44 0.440.440.440.440.44 0.44 FCB 11.84 11.8111.7811.7611.7511.75 11.75 NBM 1.75 1.751.751.741.741.74 1.74 sFCM 0.44 0.440.440.440.440.44 0.44 31 36 41 46 51 56 61 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-23 : Algorithm perf ormance on WINE dataset The Wine dataset contains information on 13 attributes of the three classes of wines. Results are shown in Figure 4-23. fzPBD and FCB performed better than other algorithms. fzPBD performed better than FCB with the fuzzifier factor levels less than PAGE 177 1.8. Also, Table 4.24 shows that fzPBD performed better than FCB across multiple levels of m. Table 4.24 : WINECompactness measure Algo. Fuzzifier factor m 1.25 1.3751.51.6251.751.875 2.00 fzPBD 0.08 0.080.080.080.080.08 0.08 MMD 0.05 0.050.050.050.050.05 0.06 FCB 0.19 0.180.180.170.160.16 0.17 NBM 0.17 0.170.170.180.180.18 0.18 sFCM 0.05 0.050.050.050.050.05 0.05 8 28 48 68 88 108 1.251.3751.51.6251.751.8752Number of misclassified data points Fuzzifier (m) fzPBD MMD FCB NBM sFCM Figure 4-24 : Algorithm perf ormance on GLASS dataset The Glass dataset contains information on ni ne attributes of six classes of glass used in building construction. Results ar e shown in Figure 4-24. fzPBD significantly outperformed all the other algorithms. fzPB D outperformed the other methods on both PAGE 178 artificial and real datasets, pa rticularly on the datasets with cl usters that differed in size. fzPBD is, therefore, appropriate for real-wor ld datasets, where usually the data densities are not uniformly distributed. 4.2.9 Fuzzy genetic subtractive clustering method (fzGASCE) [157] We have addressed the key problems of the fuzzy C-Means algorithm: initialization of the fuzzy pa rtition matrix, evaluation of fuzzy clustering results, missing value imputation and fuzzy partition de fuzzification. The experimental results show that our methods outperformed the ex isting algorithms addr essing the same issues. Yet another issue with FCM algorithm is that, like most other partiti oning approaches, it cannot by itself determine the number of clusters and, as in previous studies, its results depend strongly on initial parameters. For some initial values, FCM will converge rapidly to a global optimum, but, for others, it may become stuck in a local optimum. Common approach to address the limitations of FCM is to inte grate FCM with the Genetic Algorithm (GA), where the GA is used to manage a set of solution candidates. The FCM algorithm is then applied and a clus ter validity index is used as the GA fitness function to search for the best solution. Ghos h et al. [163] and Liu et al. [166] proposed to use the partition coefficient (PC) [92] a nd Xie and Beni (XB) [94] validity indices. In addition, Liu et al. [166] proposed a modified version of the PC index (MPC) in order to reduce the monotonic tendency of the index. The Fukuyama-Sugeno cluster index (FS) [93], which measures the compact ness and separation of the cl uster partition, was used by Ghosh et al. [163]. Lianji ang et al. [164], in a novel self-Adaptive Genetic Fuzzy CMeans algorithm (AGFCM), propo sed a validity index comb ining the PC index with PAGE 179 total variation of fuzzy partition. Halder, Pramanik and Kar [162] proposed a GA fitness function (HPK) based on the compactness measure and combined it with an intra-inter validity index for a novel algorithm that automatically determines the number of clusters in the dataset. Li n et al. [165] proposed a comb ination of GA and FCM for a novel method with adaptive cluster validity index (ACVI) based on the intra and inter measures of the fuzzy partition, where GA is used to solve the trade-off problem between these two factors for a better evaluation of the fuzzy partition. A common limitation of existing methods using GA with FCM is that the GA fitness functions are based on the cluster vali dity indices, which usually have a problem with scaling between the compact ness and separation factors. In addition, they use the maximum membership degree for defuzzification that may be improper, as described in Section (4.2.8). In this section, we combine FCM with GA and fuzzy SC algorithms for a novel clustering algorithm, fzGASCE, that automatically determines the number of clusters in the dataset. Th e FCM algorithm rapidly dete rmines the exact clustering prototypes of each solution candidate so that the GA algorithm, managing a set of such candidates, can select the optimal one. Th e fuzzy SC method helps the GA algorithm escape any local optima [157]. The fuzzy subtractive clustering (fz SC) method has advantages over the traditional one in that it is more efficient from the computational viewpoint. In addition, it is more appropriate for real world pr oblems, because it does not require additional information regarding the mountain peak, mount ain radii, or the th reshold between the current density and original density for a data object to be selected as a cluster PAGE 180 candidate. However, fzSC uses a random fuzz y partition, and may perform worse in cases where the partition is not the result of a convergent fuzzy partitioning process. For the artificial dataset, the FCM algorithm can converge very rapidly, but for some real datasets, because of overlap regions where a data point can belong to two or more clusters, the FCM algorithm may not converge to a local optimum. The fuzzy strong uniform criteria may not hold with the ne w fuzzy partition created from the fuzzy partition by the FCM algorithm. Hence, the cl uster candidates determined by our fzSC may not be optimal. In addition, a local optimal fuzzy partition may cause the fzBLE method to improperly pick the best partiti on from a given set of fuzzy partitions. Method design Chromosome We used chromosome to represent the whole clustering solution. Each chromosome contains a set of lo ci each standing for the inde x of the data point selected as cluster center. We set the length of chromosomes to n, which is assumed the maximum number of clusters in the dataset. Crossover operator The crossover operator is used to produ ce two new offspring from a given pair of parents. Both the roulette wheel a nd tournament selection methods are used interchangeably to select parents maintaining potentially useful solutions in the current generation. A two-point crossover operator with probability Pc, Pc = 0.5, is used. Mutation operator PAGE 181 The mutation operator is used to make changes in portions of the chromosomes of newly created members. Because each ch romosome encodes data points representing the cluster centers of a clustering soluti on, changing of the data points in the chromosome may improve the clustering quali ty. We, therefore, propose three different tasks for the mutation operator: (i) add a data point as a new cluster center, because this may help to locate a new cluster in a higher de nsity region, (ii) remove a data point to prevent the inclusion of a sparse cluster, and (iii) replace one data point with another so that the new cluster is lo cated in a higher de nsity region. These tasks are commonly used by existing methods. However, they are employed in a random or heuristic way, and cannot guarantee that the GA algorithm will escape local optima. To address this issue, we applied our fzSC method. The fuzzy partitions of the parents are used to estimate the density at every data point using (4.13), (4.14), (4.15), (4.16) and (4.17). At each time, t, the data point, tx with the highest density, tM, is selected, and the densities at the remaining data points are updated as in (4.19). Such points, where the densities change less than a predefined ratio, RM, are considered to be significantly dense and are used in the mutation operator of fzGASCE. This is instead of using randomly selected data points, as in existin g methods. We chose a value of 0.95 for RM. The selection of the value of RM does not affect the outcom e; however, a low value of RM may slow the convergen ce process of fzGASCE. Fitness function Instead of using a cluster validity i ndex for the fitness function, we use our cluster validity method, fzBLE, for cluste r evaluation. For each chromosome, FCM is PAGE 182 applied to generate the fuzzy partition, whic h is then used to generate a probabilistic model of the data distributions as in (4. 25), (4.26) and (4.27). The fitness function, given a fuzzy partition = {U,V}, is defined: fitness( ) = log[L( |X)] Â– log(c), (4.52) where L( |X), the likelihood of the clustering model and the data, is measured as: ) v | x ( P ) v ( P ) V U | x ( P ) X | V U ( L ) X | ( Ln 1 i c 1 k k i k n 1 i i (4.53) fzGASCE algorithm Input: data to cluster X={xi}, i=1,Â…,n. Output: an optimal fuzzy clustering solution, c: optimal number of clusters. V = {vi }, i =1,Â…,c: the cluster centers. U={uki}, i=1,Â…,n, k=1,Â…,c: the partition matrix. Steps 1) Randomly generate a set of chromo somes describing clustering solution candidates. 2) Compute the fitness value for every chromosome. 3) If the stop criteria are me t, then go to Step 7. 4) Apply the crossover operator with the probability Pc = 0.5, and the roulette wheel and tournament pa rent selection methods. PAGE 183 5) Apply the mutation operator with the probability Pm = 0.01. The significantly dense data points are used in the replacement, and fresh offspring are created using these points with a probability of Pc Pm. 6) Go to Step 2. 7) Select the Â‘bestÂ’ chromoso me from the population as the clustering solution. 8) Apply [157] for the defuzzification of the fuzzy partition of the solution. Method testing To evaluate the performance of fzGAS CE, we used four artificial datasets ASET2, ASET3 and ASET4 each have well-sepa rated clusters of similar sizes. The number of clusters and data dimensions of these dataset are (5,2 ), (5,3) and (11,5), respectively. ASET5 is more complex, contai ning three clusters that differ in size and density (Figure 3-16). For real datasets, we used the Iris and Wine datasets. Performance measures We used three measures, COR, EVAR and MISC, to evaluate algorithm performance. COR is the corre ctness ratio, defined as, N 1 i) c Âˆ c ( I N 1 COR, (4.54) where N is the number of trials, c and c Âˆ are the number of clus ters and the predicted number of clusters respectiv ely, and I(.) is defined as: 0 x 0 0 x 1 ) x ( I. PAGE 184 EVAR is a measure of the accuracy of the predicted number of clusters defined as in (4.55). 2) c Âˆ c ( N 1 EVAR (4.55) MISC is a measure of the ove rall performance, determined by the number of data objects that were misclassified, defined as in (3 .21). To be fair, MISC is calculated only when an algorithm correctly id entifies the number of clusters. Then, the assigned cluster label of each object is compared with its actual cluster label. If any of them do not match, a misclassification has occurred. We compared the performance of fzGA SCE with those of the eight genetic algorithm methods that also use FCM, sp ecifically, PBMF, MPC, HPK, AGFCM, XB, FS, PC and ACVI [162-166]. An earlier ve rsion of fzGASCE, fz GAE, which did not include the fuzzy SC and defuzzification met hods, was also used. For each dataset, the number of clusters, c, was se t to the known number of cluste rs. All algorithms were run using a population size of 24 and a maximu m number of 100 generations. The fuzzy partition of each chromosome was generated using the FCM algorithm with 10 iterations, and the fuzzifier f actor, m, was set to 1.25. We repeated the experiment 100 times and averaged the performance of each algorithm using values of COR, EVAR and MISC. ASET2 contains five clusters in a 2-dimensional data sp ace. The clusters are wellseparated and of the same size. Performance of all algorithms is shown in Table 4.25. All algorithms had very low MISC measur es, indicating that they grouped the data PAGE 185 points into the correct clusters. However, fzGASCE outperformed all algorithms by all three measures and fzGAE performed better than the other methods with the exception of fzGASCE. This comparison illustrates the advantage of us ing our fzBLE method [147] in the GA fitness function. The use of fuzzy SC [146] in fzGASCE also improved its performance, particular ly in escaping local optima. Table 4.25 : Performance of GA algorithms on ASET2 Algorithm COREVARMISC fzGASCE 1.0000.0000.000 fzGAE 0.6400.5000.000 PBMF 0.5100.5900.000 MPC 0.2900.9700.000 HPK 0.1005.0100.021 AGFCM 0.6002.8000.000 XB 0.4901.4500.000 FS 0.1201.1000.070 PC 0.2301.0400.000 ACVI 0.2002.4900.011 The ASET3 contains five well-separated cl usters in a 3-dimensional data space while the ASET4 contains 11 clusters in a 5dimensional data space. Performance of all algorithms on ASET3 and ASET4 are shown in Tables 4.26 and 4.27 respectively. On both datasets, fzGASCE outperformed all other algorithms, while fzGAE was the second best. PAGE 186 Table 4.26 : Performance of GA algorithms on ASET3 Algorithm COREVARMISC fzGASCE 1.0000.0000.000 fzGAE 0.7100.3800.000 PBMF 0.6000.4500.000 MPC 0.6100.8600.000 HPK 0.1205.2400.000 AGFCM 0.6501.4900.000 XB 0.6400.4300.000 FS 0.5200.8400.011 PC 0.6200.8900.000 ACVI 0.1002.1000.000 Table 4.27 : Performance of GA algorithm on ASET4 Algorithm COREVARMISC fzGASCE 1.0000.0000.000 fzGAE 0.4500.7500.000 PBMF 0.3401.0000.000 MPC 0.4200.8200.002 HPK 0.0101.9100.037 AGFCM 0.3402.3800.000 XB 0.4100.9000.000 FS 0.4000.8800.000 PC 0.4500.7000.000 ACVI 0.1704.6500.000 PAGE 187 The ASET5 dataset is non-uniform with th ree clusters in a 2-dimensional data space. Table 4.28 shows the algorithm perf ormance. HPK and AGFCM both failed to determine the number of clusters. fz GASCE not only outperformed the other algorithms, with COR = 1.0, but also was the only algorithm that gr ouped all of the data points into the correct clus ters. Although fzGAE performed better than the remaining algorithms, it failed to correctly group data points into clusters, similar to other methods that used the maximum membersh ip degree for defuzzification. Table 4.28 : Performance of GA algorithms on ASET5 Algorithm COREVARMISC fzGASCE 1.0000.0000.000 fzGAE 0.9000.1000.107 PBMF 0.7000.3000.107 MPC 0.0500.9600.107 HPK 0.0005.770AGFCM 0.0008.470XB 0.0400.9600.107 FS 0.0203.4800.107 PC 0.0500.9600.107 ACVI 0.0800.9200.107 The IRIS dataset contains three clusters corresponding to the th ree classes of Iris flowers. The performance of the algorithms on this dataset is shown in Table 4.29. HPK and AGFCM again completely failed at de tecting the number of clusters. fzGASCE outperformed other algorithms in detecti ng the number of clusters as well as in grouping data points into their own clusters. PAGE 188 Table 4.29 : Performance of GA algorithms on IRIS Algorithm COREVARMISC fzGASCE 1.0000.0000.033 fzGAE 0.8800.1200.040 PBMF 0.8600.1400.040 MPC 0.0400.9700.160 HPK 0.0005.720AGFCM 0.0008.120XB 0.0501.0100.040 FS 0.3900.7800.154 PC 0.0800.9200.115 ACVI 0.1500.8500.040 Table 4.30 : Performanc of GA algorithms on WINE Algorithm COREVARMISC fzGASCE 1.0000.0000.213 fzGAE 0.8600.1400.303 PBMF 0.0002.050MPC 0.0002.810HPK 0.0006.760AGFCM 0.0009.210XB 0.2701.0100.303 FS 0.0005.720PC 0.1100.9200.303 ACVI 0.0900.9100.303 PAGE 189 The Wine dataset contains information on 13 attributes of three classes of wines. Results on this dataset are shown in Tabl e 4.30. Only fzGASCE and fzGAE identified the correct number of cluste rs, with COR values of 1 and 0.86, respectively. Among the other algorithms, only XB and ACVI detected the correct num ber of clusters, but only with low COR values. Overall, fzGASCE out performed other methods on both artificial and real datasets. It performed particularly well on datasets with clusters that differed in size, not only in predicting the correct numbe r of clusters, but also in grouping the data points into the correct cluste rs. fzGASCE is, therefore, appropriate for real-world problems, where the data densiti es are not uniformly distributed. 4.3 Motif finding problem 4.3.1 HIGEDA algorithm We present a new algorithm, HIGEDA [145 ], applicable to either DNA or protein sequences, which uses the TCM model and combines the hierarchical gene-set genetics algorithm, HGA (Hong, 2008) w ith EM and Dynamic Programming (DP) algorithms to find motifs with gaps. The HGA algorithm helps HIGEDA manage motif seeds and escape from local optima; the EM algorithm uses the be st alignment of a motif model on the dataset, where the alignm ents are generated by DP to estimate the motif model parameters so that the model fits the best conserved forms of the motifs of interest. Method design PAGE 190 Given A = {a1, a2,Â…, al} {_}, a set of symbols used for sequence encoding and the gap symbol, l=20 for protein sequences and l=4 for DNA sequences, let S = {S1, S2,Â…, Sn} be a set of biological seque nces based on A. Assume L = {L1, L2,Â…, Ln}, where Li is the length of sequence i, i=1,Â…,n. Assume that the length of shared motifs is a known value, W. With mi = Li W+1, the number of possible positions of a motif in sequence i, denote Z = {zij}, i=1,Â…,n, j=1,Â…,mi, as the probability of motif occurrence at position j in sequence i. Let be the motif model. The motif finding problem is to determine and Z such that: max ) | Z S ( P (4.56) Motif model Of the motif model two components M and B model the motif and nonmotif (the background) positions in sequences A motif is modeled by a sequence of discrete random variables whose values give the probabilities of each of the different letters occurring in each of the different positions of a motif occurrence. The background positions in the sequences ar e modeled by a singl e discrete random variable. The motif model is as follows: PAGE 191 = { B, M} = M w al M 2 al M 1 al B 0 al M w 2 a M 2 2 a M 1 2 a B 0 2 a M w 1 a M 2 1 a M 1 1 a B 0 1 a M w _, M 2 _, M 1 _, B 0 _, (4.57) where, a,k, 1 k W, is the probability that symbol a occurs either at a background position or at position k of a motif occurrence. W , 1 k 1A a k a (4.58) The matrix in (4.57) is a position wei ght matrix (PWM), but its use differs from conventional PWMs in that it contains 21 rows for protein motifs and 5 rows for DNA motifs. The first row stands for the ga p symbol, which may occur in a motif, but not in the first or last positions. The remaining rows stand for residue symbols, 20 amino acid letters or 4 nucleotide letters. W is chosen large enough to accommodate all possible motif consensus sequences. Figure 3-7 shows a motif model representing Â‘ACÂ’, Â‘AGCÂ’ and Â‘ATCÂ’. Sequence and motif model alignment with gaps Given an input subsequence s, the best alignment with gaps of s and is created using DP algorithm. Because s comes from a real dataset, gaps are allowed only in The first symbol of s is aligned with the first column in Consecutive symbols from s are aligned with either gap or symbol columns to achieve the best alignment score. A conventional dynamic alignment score is the sum of the pair wise alignment scores. Instead, here, it is the multiplication of all pa ir wise alignment scores, and hence it is the PAGE 192 probability that the best consensus from matches s. To control the occurrence of gaps, we define PPM, POG and PEG as the reward for a perfect match and the penalties for opening and extending a gap, respectively. We choose, by empirical experiments, PPM=1, POG=0.00875, and PEG=0.325. Let Ujk be the best alignment score up to the jth symbol sj in s and column k of Ujk is calculated as in (4.59). )} 1 ( PG U PPM U max{ j k 0 j ) 1 ( PG U j k PPM U 0 j k UM 1 k s 1 k j M 1 k s 1 k 1 j M 1 k s 1 k j M 1 k s 1 k 1 j M 1 s jk1 j j 1 j j j, (4.59) where PG is either PEG or POG depending on the gap status at column k in U. Let s=Â‘ACGÂ’ and be as shown in Figure 4-25. Figure 4-26 shows the dynamic alignment matrix {Uij} of s w.r.t The best alignment score is 0.0039 which is found in the second row and the last column of the matrix Hence, the best consensus for s=Â’ACGÂ’ by is Â‘A_CÂ’. Â‘ACÂ’ is considered in-motif model and the last symbol Â‘GÂ’ is out-motif model for subsequence s. The motif model w ill be adjusted to fit Â‘A_CÂ’ instead of Â‘ACGÂ’. This feature produces the occurrence probabilities for gap symbols in our motif model even if such a symbol does not appear in the sequences set. Figure 4-25 : A motif model 0 1 2 3 0.00 0.00 0.25 0.00 A 0.02 0.50 0.10 0.04 G 0.02 0.30 0.35 0.02 C 0.03 0.10 0.01 0.90 T 0.03 0.10 0.29 0.04 PAGE 193 Figure 4-26 : Dynamic ali gnment of s=Â’ACGÂ’ w.r.t from Figure 4-25 Motif occurrence model Let Sij be the subsequence of le ngth W at position j in sequence i. Denote I(.) as the indicator function and PM(Sij) and PB(Sij) as the conditional pr obabilities that Sij is generated using the motif model and the background mode l respectively. W 1 k L 1 a ) a S ( I M ak M ij ij M1 k j i) | S ( P ) S ( P, (4.60) W 1 k L 1 a ) a S ( I B 0 a B ij B ij B1 k j i) | S ( P ) S ( P. (4.61) Let be the prior probability of motif occurrence at every possible position in the sequence set. Similarly to the TCM m odel in MEME [70], the motif occurrence probability at position j in sequence i is ) S ( P ) 1 ( ) S ( P ) S ( P zij B ij M ij M ij (4.62) The motif log-odds score of Sij is defined by: ) S ( P ) S ( P log ) S ( losij B ij M ij (4.63) Sij is considered a motif hit if it satisfies the inequality: ) 1 ( log ) S ( losij. (4.64) 1 2 3 A 0.50 0.0043 C 0.0050 0.0039 G 0.0001 PAGE 194 Gapped motif model When gaps are allowed in the alignment of Sij and PM(Sij) and PB(Sij) are calculated using the recovere d version with gaps of Sij, similar to that in Xie [178]. Let sm and sc be the in-motif and out-motif parts, respectively. Let s* be the best consensus for Sij by ) Z | s ( P ) Z | s ( P ) s ( P ) s ( P ) S ( PM B c M c B ij M (4.65) ) Z | s ( P ) Z | s ( P ) s ( P ) s ( P ) S ( PB m B c m B c B ij B (4.66) Multiple-motif finding method To find multiple motifs in given dataset, we use a mask variable M, M={Mij}i=1,Â…,n,j=1,Â…,m, where Mij represents the probability of the chance that a motif occurs again at position j in sequence i. M is initially set to {1}. The modified version of (4.62) with respect to M is } M { min z ) M | 1 z ( P z1 k j i W , 1 k ij ij M ij (4.67) Once a motif is found, its positions are updated to M. ) M 1 ( M M) k j ( i ) k j ( i ) k j ( i (4.68) where Mi(j+k) = P(zij=1)*(W-k)/W, k=0,Â…,W-1. The update mechanism in (4.68) allows for multiple overlapping motifs. Hierarchical gene-set genetics algorithm (HGA) The GA is a global optimization procedure that performs adaptive searches to find solutions to large scale optimization problems with multiple local optima. PAGE 195 Conventional GAs use crossover and mutation operators to escape local optima. These operators depend strongly on how the proba bilities of crossover and mutation are chosen. Recent improvements in GA have focused on adaptively adjusting operator probabilities so that the ge netics processes quickly esca pe local optima. However, setting up adaptive GAs is difficult and most approaches are based on heuristics. Here we use HGA, a GA improved by Hong [75]. HGA treats a chromosome as a set of gene-sets, not a set of genes as in conventional GAs, as a mechanism to escape local optima. Starting with gene-sets of the largest size equal to half the chromosome length and ending with gene sets of size 1, HGA performs crossover and mutation operations based on gene-set bound aries. When the model is ( k)-convergent, HGA expects to find a global optimum and attempts to escape local optima, if any exist, by performing genetics operations with the largest size gene-sets. HGA is most appropriate to our genetics model, because each gene-set represents a set of adjacent columns in the motif model that represents patterns of ad jacent residues in biological sequences. Genetics operations based on gene-set boundari es allow residues to come together as well as to change simultaneously. EM algorithm of HIGEDA HIGEDA solves the motif finding problem using an EM algorithm to maximize the likelihood function (4.69) over the entire dataset. max ) | Z S ( P log ) | Z S ( L (4.69) Estimation step: Z is estimated using (4.63) (or (4.67)). PAGE 196 ) S ( P ) 1 ( ) S ( P ) S ( P zij B t ij M t ij M t ) 1 t ( ij (4.70) Maximization step: and are computed with respect to (4.70), max ) log( z ) 1 log( ) z 1 ( ) S ( P log z ) S ( P log ) z 1 ( ) | Z S ( Lij ij n 1 i m 1 j ij M ij ij B ij (4.71) Such that, n 1 i m 1 j ijz nm 1 0 L. (4.72) With respect to constraints (4.58), the obj ective function is relaxed using Lagrange multipliers k, k=0,Â…,W, W 0 k l 1 a ak k1 ) | Z S ( L L. ) a S ( I ) a S ( I ) S ( P 1 ) S ( P ) S ( P 1 ) S ( P logM ak 1 k j i 1 k j i W k k 1 k L 1 a ) a S ( I M ak ij M M ak ij M ij M M ak ij M1 k j i Hence, 0 ) a S ( I z Lk n 1 i m 1 j M ak 1 k j i ij M ak So, k n 1 i m 1 j 1 k j i ij M ak) a S ( I z (4.73) PAGE 197 Using (4.58) and (4.73) 1 z ) a S ( I z ) a S ( I zk n 1 i m 1 j ij k n 1 i m 1 j l 1 a 1 k j i ij l 1 a k l 1 a n 1 i m 1 j 1 k j i ij M ak (4.74) Using (4.73) and (4.74), n 1 i m 1 j ij n 1 i m 1 j 1 k j i ij M akz ) a S ( I z. (4.75) Similarly, n 1 i m 1 j ij n 1 i m 1 j W 1 k 1 k j i ij B 0 a) z 1 ( ) a S ( I ) z 1 ( W 1. (4.76) Refining model parameters The maximization step results in model parameters and Because the values of Z are estimated during the EM pro cesses, a straightforward update to and using (4.70), (4.75) and (4.76) may cause an os cillation in convergence. MEME estimates by trying values from n m/ 1 to 1/(W+1). This requires significant computational time and squanders the maximization step benefit. We apply the gradie nt descent learning law to update model parameter, of the model, (t+1) = (t) + t (4.77) PAGE 198 where = e(t+1) (t) and (t), e(t+1) and (t+1) are the current, estimated and new values of respectively. The learning rate t may be slightly re duced by processing time. A popular form of t is as in (4.78), where T is the processing duration. T 1 1 t 1max t (4.78) If the size of the dataset is small, some elements of may be zero. These are not appropriate to the Bayesian process and re main zero. To solve this problem, MEME uses Dirichlet mixture model [34]. While th is has a strong mathem atical basis, the drawback lies in how to sele ct the right number of mixture components. In addition, this number has no meaning in sequence analys is. Instead, we use pseudo-count methods. For DNA, we borrow an added pseudo-c ounts method from He nikoff and Henikoff [74]. The added portion used is 0.01. For pr oteins, we propose a method using the motif model and the substitution proba bility matrix from BLOSUM62, and a heuristic that the pseudo-counts should be position specifi c and depend on strong signals [145]. The pseudo-count value for a given symbol a, in column k of is calculated using Equation (4.79). L 1 b b / a M akP psc1 k b, (4.79) where Pa/b is the BLOSUM substitution probability for amino acid a from the observation of amino acid b. The motif model is then refined using (4.80) in which, and are predefined, and + = 1. a,k = a,k + .pscak. (4.80) PAGE 199 HIGEDA Algorithm HIGEDA uses HGA to manage its popula tion and to escap e local optima. During the evolution process, members of th e current generation are processed with the EM algorithm a small number of times, a nd ranked based on their goodness, measured using a fitness function. The best members are used to create new ones using crossover and mutation operators. Newly created members replace the worst ones in the current generation using a tournament selection to form the new generation. This process repeats until the number of generations is equa l to a predefined number, or the current generation goodness ratio is convergent. The best members from the last generation are taken as possible motif candidates. Each member contains a variable and a chromosome encoding described in (4.57); each gene in the chromosome represents a column in that has 21 elements for a protein motif, or five elements for a DNA motif. There are W+1 such genes in each chromosome. We note that while Bi [71] proposed encoding PWM using chromosomes, he did not discuss it in detail Because the length W of a motif is small relative to the sequence lengths, our genetics model consum es less memory than those of Wei and Jensen [83] or Bi [71]. The gene-set maximum size is l0 = W/2, the initial size is 1 and the final size is l, such that, l = 2q, where 2q < l0 < 2q+1. (4.81) We use =0.05 and k=9 for ( k)-convergent criterion. PAGE 200 The crossover operator is used to produce two new children from a given pair of parents. We use a two-point gene-set crossover operator with probability pc = 0.95. The mutation operator is used to make changes in some portions of the chromosomes of newly created members. Because each chromosome encodes a resulting from an alignment on the whole sequence set, a one position shift to the left or right may improve the quality of the alignment. We propose two different tasks for the mutation operator: (i) change in a gene-set: Two rows are select ed randomly. Cells corresponding to the selected rows in the given gene-set are exchanged, and (i i) change to whole chromosome: Gene-sets in the chromosome are shifted left or right by one position. The blank gene-sets are f illed with average probability values. The fitness function of HIGEDA is a combination of the objective function (4.69) and the posterior scoring function (4.80). The obj ective function assesses the best fit model, while the posterior scoring function determ ines how well the model is distinguished from the background in the enti re dataset. We use the motif posterior scoring approach of Nowakowski-Tiuryn [79] and estimate the prior probability of motif occurrence. It follows that the posterior score of subsequence Sij at position j in sequence i has the form: ) S ( P ) 1 ( ) S ( P log ) | (S Sij B ij M ij p (4.82) (4.69) and (4.80) are normalized and co mbined. Both are important to the model evaluation, but not equally in all contexts. The fitness function is first applied to find best fit models, and then used with the posterior scoring function to select the most PAGE 201 significant model. To this ru le, we add a s caling function (t), which is shown in Figure 4-27. The fitness function of HIGEDA is then defined as: n 1 i m 1 j ij p ij ij M) | S ( S ) 1 P(z ) 1 P(z ) ( ) | Z S ( L m ) fit(. (4.83) Figure 4-27 : (t) = M T / (T + t2). M is the maximum value of To find gapped motifs, in later phases HIGEDA tries to improve the most significant model by intr oducing gaps. This is similar to a local alignment problem, but instead of aligning every sequence to the se quence set, HIGEDA, using DP, aligns the sequence to the motif model that describes the local alignment of the sequence set. While the run time of gapless alignments is O(W), that of DP alignment is O(W2). By restricting the use of DP the speed of HIGEDA is significantly improved. 0 t M PAGE 202 HIGEDA Algorithm [145] Input: Sequence set S, motif length W. Output: Motif model Steps 1) Set M={1} once, set gene-set parame ters using (4.71), and randomly generate the first generation, set gNo=0. 2) For each member in the current generation, apply EM once. Measure member goodness using the f itness function (4.73). 3) Select parents from the best members us ing ranking selection; apply genetics operator to create new members. 4) Create new generation using tour nament selection, gNo++. 5) If the current generation is ( k)-convergent, then 6) save ( l ), set l=l0, 7) Apply genetics operators to refresh current generation. 8) Restore ( l ), go to Step 2. 9) If stop criteria are met, then 10) set l = l/2, set gNo = 0. 11) If l > 0, then go to Step 2. 12) Take the best member as the motif candidate; run EM with the candidate until convergent to obtain the motif model. 13) Output the motif model a nd update M using (4.68). 14) Stop, or go to Step 1 to find more motifs. Method testing We compared HIGEDA with five ope n source algorithms, GAME-2006, MEME v3.5, GLAM2-2008, BioProspector-2004 and PRATT v2.1, one web application, GEMFA (http://gemfa.cmh.edu), and for the EM motif algorithm, we used results from PAGE 203 Bi [71]. All algorithms were run 20 times on each dataset with the runtimes recorded; for GEMFA, this was obtained from the website. Default run parameters were used for all programs. For HIGEDA, the number of generations at each gene-set level was 50 and the population size for all gene-set levels was 60. For algorithms in Tables 4.31, 4.32 and 4.33, motif lengths were known motif lengths. For Tables 4.34 and 4.35, we first ran HIGEDA to find statistically significant motifs, and then ran other algorithms using the lengths of best motifs found by HIGEDA. Performance measures As in Bi [71], two quantities are used to evaluate the algorithms: LPC, the letter level performance coefficient and SPC, the mo tif site level performance coefficient. If we denote delta(.) as the indicator function, and Oi and Ai, respectively, as the set of known and predicted motif positions in sequence i, then, i i n 1 i i iO A / O A n 1 ) S ( LPC (4.84) n 1 i i i) Empty O A ( delta n 1 = SPC(S). (4.85) Simulated DNA datasets We generated simulated DNA datasets as in Bi [71], with three different background base compositions: (a) uniform, where A, T, C, G occur with equal frequency, (b) AT-rich (AT = 60%), and (c ) CG-rich (CG = 60%). The motif string, GTCACGCCGATATTG, was merged once (or tw ice, if needed) into each sequence, after a defined level of the string change: (i ) 9% change representing limited divergence PAGE 204 (i.e. 91% of symbols are identical to the orig inal string), (ii) 21% change, or (iii) 30% change (essentially background or random sequence variation). Table 4.31 : Average performance (LPC/SPC) on simulated DNA datasets Motif identity Algorithm UniformAT-RichCG-Rich Avg. Runtime 91% HIGEDA 1.00/1.001.00/1.001.00/1.00 32s GEMFA 0.98/1.000.98/1.001.00/1.00 22s GAME 0.86/0.880.88/0.900.91/0.94 2m26s MEME 1.00/1.001.00/1.001.00/1.00 2s GLAM2 1.00/1.001.00/1.001.00/1.00 52s BioPro. 0.99/1.000.99/1.000.94/0.95 2s PRATT 0.83/0.950.88/1.000.46/0.80 0.2s EM 0.99/1.000.99/1.001.00/1.00 79% HIGEDA 0.87/0.961.00/1.001.00/1.00 32s GEMFA 0.87/0.880.87/0.900.85/0.89 18s GAME 0.43/0.550.55/0.610.64/0.71 2m11s MEME 0.95/0.951.00/1.001.00/1.00 2s GLAM2 1.00/1.000.07/0.251.00/1.00 57s BioPro. 0.86/0.920.89/0.950.94/0.98 2s PRATT 0.46/0.700.03/0.150.09/0.15 0.2s EM 0.83/0.870.89/0.910.87/0.89 70% HIGEDA 0.79/0.810.84/0.890.65/0.76 32s GEMFA 0.56/0.650.50/0.600.52/0.56 20s GAME 0.14/0.280.19/0.340.20/0.34 1m55s MEME 0.44/0.500.75/0.750.27/0.30 2s GLAM2 0.95/0.950.01/0.050.00/0.05 1m20s BioPro. 0.26/0.330.39/0.440.25/0.33 2s PRATT 0.31/0.400.05/0.100.19/0.30 0.2s EM 0.38/0.480.47/0.580.48/0.54 PAGE 205 In Table 4.31, we compared results with those obtained with seven other algorithms. When motif sequences are 91% id entical, all algorithms performed equally well regardless of base composition. However, when the identity dropped to 79% or 70%, HIGEDA in general performed as we ll or significantly better than other algorithms on all base compositions. Hence, HIGEDA is not significantly affected by noise. Finding motifs in biological DNA sequences Table 4.32 : Average performance (LPC/SPC/run time (seconds)) on eight DNA datasets (# of sequences/length of motif/# of motif occurrences). Algo. E2F(27/11/25) ERE(25/13/25) crp(24/22/18) arcA(13/15/13) HIGEDA 0.57/ 0.96 /33s 0.58/0.87/37s 0.75/0.90 /22s 0.38/0.53/56s GEMFA 0.64/0.85/39s 0.74 / 0.92 /43s 0.57/0.88/12s 0.32/0.42/26s GAME .24/.90/2m45s .24/.75/2m11s .45/.80/2m33s .05/.10/1m26s MEME 0.71/0.85/16s 0.68/0.68/4s 0.55/0.68/1s 0.47/0.54 /4s GLAM2 .84 /.93/1m28s .74 / .92 /1m10s .54/.64/1m13s .47/.54 /1m17s PRATT 0.17/0.33/.2s 0.24/0.44/.2s 0.22/0.56/.1s 0.17/0.23/.4s BioPros. 0. 50/0.63/3s 0.65/0.71/3s 0.37/0.45/1s 0.01/0.01/6s Algo. argR(17/18/17) purR(20/26/20) tyrR(17/22/17) ihf(17/48/24) HIGEDA .75/.94 /1m22s .90 /.93/1m38s .34/.41/1m19s .14 /.31/3m44s GEMFA 0.35/0.38/30s 0.81/0.90/39s 0.32/0.49/27s 0.11/0.26/1m GAME .36/.55/2m31 .33/.53/3m23s .11/.23/2m39s .07/.18/10m7s MEME 0.75/0.94 /5s 0.58/0.85/4s 0.43/0.47 /3s 0.00/0.00/5s GLAM2 .38/.47/1m32s .17/.50/3m11s .35/.35/1m55s .12/ .33 /9m55s PRATT 0.16/0.29/.2s 0.46/ 0.95 /.2s 0.06/0.12/.3s 0.06/0.25/.5s BioPros. 0.62/0.69/8s 0.18/0.20/7s 0.33/0.42/6s 0.05/0.13/17s PAGE 206 We used eight DNA transcription factor binding site datasets, two eukaryotic datasets, ERE and E2F, and six bacterial datasets: CRP, ArcA, ArgR, PurR, TyrR and IHF. Table 4.32 shows that HIGEDA performe d as well or better than other algorithms on most datasets. One exception is ERE, po ssibly because it is an example of OOPS which is the motif finding model used by GEMFA. Also, GLAM2 performed better than HIGEDA on IHF, which contains a motif of length 48. Finding motifs in biological protein sequences We selected twelve protein families from PFAM and Prosite representing motifs with different levels of sequence specificity from completely defi ned to more subtle, degenerate and gapped. Figure 4-28 : Strep-Htriad motif by HIGEDA Results obtained with HIGEDA are comp ared to those obtained with other algorithms (Table 4.33). The gap parameters of all algorithms, if supported, are set to known gap structures. All algorithms identify the unambiguous DUF356 IHPPAH motif. For the Strep-H-triad motif, the HIDEGA, MEME and GLAM2 consensus sequences are more specific than the known motif, because of the motif decoding procedure. Full degeneracy is shown in Figure 4-28. PAGE 207 Table 4.33 : Detection of protein mo tifs (1-8, PFAM; 9-12, Prosite) Family (#seq, max L) Algorithm (run time) Known motif Predicted motif 1. DUF356 (20,158) HIGEDA (12s) MEME (2s) GLAM2 (1m58s) PRATT (0.1s) I H P P A H I H P P A H I H P P A H I H P P A H I H P P x H 2. Strep-H-triad (21,486) HIGEDA (40s) MEME (5s) GLAM2 (1m51s) PRATT (0.1s) H x x H x H H G D H Y H H G D H Y H H G D H Y H H x x H x H 3. Excalibur (26,296) HIGEDA (1m) MEME (4s) GLAM2 (2m56s) PRATT (0.5s) D x D x DG xx CE D R D [RNGK] DG [IV] [AG]CE [WY] Q [GA] [NW] Y Y L K S D D R D K D G V A C E D x D xxx C 4. Nup-retrotrp (14,1475) HIGEDA (3m18s) MEME (10s) GLAM2 (6m55s) PRATT (1s) G R K I x x x x x R R K x S G R K I K T A V R R K K W[DE]C[DE][TV]C[LC][VL]QNK[AP][ED] S N G K N M F S S S G T S F S S S [GP] T x x S x(1,2) R K 5. Flagellin-C (42, 1074) HIGEDA (1m29s) MEME (6s) GLAM2 (3m35s) PRATT (0.07s) N R F x S x I x x L RA [NDQG] L G A [FV] Q N R R [AS] [DNQ] L G A [VF] Q N R R A D L G A F Q N R A-x-Q 6. Xin Repeat (25,3785) HIGEDA (5m45s) MEME (1m1s) GLAM2 (6m36s) PRATT (6s) GDV[KQR][TSG]x[RKT]WLFETxPLD GDV[RK] [ST] [ACT] [RK] WLFETQPLD KGDV [RK] T [CA][RK]W[LM]FETQPL H K G D V R T C R W L F E T Q P G D V x T x x W x F E T x P 7. Planc.Cyt. C (12,1131) HIGEDA (1m) MEME (9s) GLAM2 (3m42s) PRATT (0.02s) C{CPWHF}{CPWR}CH{CFYW} [NHKSY] C [AQELMFTV][AGS] CH F S P D G K R F S P D G P D x x x L 8. zf-C2H2 (115,99) HIGEDA (2m12s) MEME (45s) GLAM2 (19m17s) PRATT (0.1s) Cx(1-5)Cx3#x5#x2Hx(3-6)[HC] C [_Y][_RQEK][_C][_EKP][_EGIK] C [_G]K[ARST] F S[RQK][KS]S[NHS] L [NKT][RKST] H [QILKM]R[IST V] H EIC[NG]KGFQRDQNLQLHRRGHNLPW YKC__P_CGK_FS_KSSLT_H__RI__HT Hx(3-5)H PAGE 208 Table 4.33 (con t.) Family (#seq, max L) Algorithm (run time) Known motif Predicted motif 9. EGF_2 (16,3235) HIGEDA (4m19) MEME (15s) GLAM2 (8m) PRATT (1s) CxCx2[GP][FYW]x(4-8)C C[_EK][_C][_EILSV]C[NDE][NDQEPS]G [FWY][AQESTY]G[DS]DCS[GI] GECxCNxGYxGSDCSI CNx(0-9)GECxCNEGWSGDDC Gx(0-1)CxCx5Gx2C 10. LIG-Cyclin-1 (23,1684) HIGEDA (1m8s) MEME (18s) GLAM2 (2m35s) PRATT (0.5s) [RK]xLx(0-1)[FYLIVMP] [RK]R[RL][RIL][DEIFY] PAPAP Tx(0-1)RKP Kx(0-1)R 11. LIG_PCNA (13, 1616) HIGEDA (1m32s) MEME (9s) GLAM2 (4m59s) PRATT (0.3s) x(0-3)x{FHWY}[ILM]{P}{FHILVWYP} [DHFM][FMY] [RGIKP][_QK][_RDKS][DPST][IL] [KMTY] [SV]FFG GQKTIMSFFS Qx(0-1)SIDSFF[K_][R_] Lx2Sx(2-4)K 12. Mod_Tyr_Itam (10,317) HIGEDA (51s) MEME (2s) GLAM2 (3m43s) PRATT (0.1s) [DE]x(2)Yx(2)[LI]xx(6-12)Yx(2)[LI] [RDQE][_S][_CET][_IL][_GM][_QTY][_QELK][_D GK][_EI][_RS][_RN][_RGP]LQ[DGV][GHT]YxM[CI Y]Q[NGT]L[ILS] GKEDDGLYEGLNIDDCATYEDIHM E_H_____SLAQKSM_DH _SRQ_ Lx2Lx(0-1)L HIGEDA and GLAM2 identified the exact motif for the moderately degenerate Excalibur (Section 2.5.6). Only HIGEDA identified Nup-retrotrp motif erring in only one residue and identifying the key arginine (R) residues. No algorithm identified the Flagellin-C motif, possibly because of the w eak motif signal. While all performed well in finding the Xin repeat motif, onl y HIGEDA identified the Planctomycete Cytochrome C motif, erring in only one resi due. For the gapped motif ZF-C2H2 family, only HIGEDA identified the key C, H, F and L residues and the spacing between them. HIGEDA also provided better results on EGF_2 and LIG_ CYCLIN_1. No algorithm PAGE 209 identified the LIG_PCNA and MOD_TYR_ITAM motifs. Together, these results show that HIGEDA can effectively find more subtle motifs, including those with gaps. Because HIGADA uses GA to find motif seeds, its run times are longer than those of MEME and PRATT, but shorter than those of GLAM2, as indicated in Table 4.33. 4.3.2 New motif discovery using HIGEDA To test HIGEDA in prediction of nove l motifs, we selected the following proteins from PFAM: those containing onl y a ZZ domain, only a SWIRM domain, a ZZ plus a Myb domain, and a ZZ plus bot h a Myb and a SWIRM domain. While ZZ domains have a consensus sequence in PFA M, SWIRM and MYB proteins are defined only by experimental demonstration of f unctional properties and protein sequence alignments. We ran HIGEDA on these groups va rying the output motif length from 4 to 33. We defined statistically significant mo tifs as those found in more than 85% of sequences, and having a p-value less than 0.0001 (computed using Touzet and Varre [82]. We also ran MEME, GLAM2 and PRATT. In Table 4.34, we first show that HIGEDA, MEME and PRATT successfully identified patterns of the known ZZ motif, CX (1-5)C. No motif was discovered in the SWIRM-containing proteins. But HIGEDA, MEME and GLAM2 each discovered a motif, WxAxEELLLL, common to Myb proteins which we propose as a novel domain. Uniquely discovered by HIGEDA, we propose a second novel domain, GNW[AQ]DIADH[IV]G[ NGST], of unknown function, which is specific to the ZZSWIRM-Myb protein set. PAGE 210 Table 4.34 : Motifs of ZZ, Myb and SW IRM domains by the four algorithms Algorithm Domain / Found motif p-value % Hit ZZ (67,5125) HIGEDA D[FY]DL C [AQES]x C -[EYV] 5.76e-7 91.67 C [GHPY]D[FY]DL C [AQES]x C 2.25e-6 93.06 MEME CLICPDYDLC 73.00 GLAM2 HSRDHP[ML][IL][QRK] 91.00 PRATT Cx2C 92.00 SWIRM (49,624): No motif found ZZ-Myb (35,2697) HIGEDA [ND]W[ST]A[DE]EELLLL 3.81e-9 97.14 MEME WTADEELLLLD 100.0 GLAM2 WTAEEELLLLE 97.00 PRATT Ex2Lx(4-5)E 96.00 ZZ-Myb-SWIRM (43,1049) HIGEDA GNW[AQ]DIADH[IV]G[NGST] 3.6e-10 100.0 WGADEELLLLEG 2.4e-10 100.0 MEME WGADEELLLLEG 100.0 GLAM2 KQLCNTLRILPK 85.00 PRATT Wx(1-2)D[ADEQ]Ex(2-3)L[ILV] 100.0 Lastly, we selected the protein families, Presenilin-1 and Signal peptide peptidase, that share the conserved seque nce GXGD and are grouped in the same clan in PFAM, as shown in Table 4.35. Only HIGEDA successfully discovered motifs containing GLGD in these families. In addition, two patterns, YDIFWVF which is also identified by MEME and GLAM2, and FGT[NDP]VMVTVA[KT], which is also identified by MEME, appear PAGE 211 frequently (in 86% of 151 protein sequences ) in the Signal peptide peptidase family. We, therefore, propose these patterns as new motifs of this family. Table 4.35 : Motifs of Presenilin-1 a nd Signal peptide peptidase by HIGEDA Algorithm Family (# seq, max) / Found motif p-value % Hit Presenilin-1 (86,622) HIGEDA GLGD FIFYS 1.3e-10 85.00 KL GLGD FIFY 1.6e-10 85.00 MEME HWKGPLRLQQ 66.00 GLAM2 GDFIFYSLVL 85.00 PRATT Lx(1-3)Lx(2-3)I 100.0 Signal p. Peptidase (151, 690) HIGEDA F[AS]ML GLGD IVIPG 8.9e-08 88.74 YDIFWVF[GF]T[NDP] 1.7e-08 86.75 GLF[IFV]YDIFWVF 8.7e-08 86.75 FGT[NDP]VMVTVA[KT] 8.6e-08 86.09 MEME FWVFGT[ND]VMV YDIFWVFGT[NDP]VMV [AS]MLGLGDIVIPGI 86.00 87.00 92.00 GLAM2 [FL][FI]YD[IV]F[WF]VF[GF] GL_FFYDIFWVFGT 95.00 90.00 PRATT Lx3F 100.0 HIGEDA integrates HGA, EM and dynamic programming (DP) to find motifs in biological sequences. HIGEDA uses HGA to manage different motif seeds and to escape local optima more efficiently than conventional GAs by usi ng gene-set levels. By applying DP and proposi ng a new technique for aligning sequences to a motif model, and then using EM with the dynami c alignments, the HIGEDA creates a new PAGE 212 way to automatically insert gaps in motif models. Using the gradient descent learning law, HIGEDA effectively estimates its mode l parameters without the greedy searches used in MEME. Using the pseudo-counts me thod based on a simple mechanism and the BLOSUM62 substitution probability matrix, HIGEDA avoids the small dataset problem of zeros in motif model elem ents, without using the Diri chlet mixture prior knowledge used in MEME and similar approaches. B ecause HIGEDA uses a set of motif seeds generated randomly, it outperforms MEME, GLAM2 and several other algorithms in finding subtle, more degenerate and gapped mo tifs. Lastly, we have shown that, as a TCM based algorithm, HIGEDA can identify novel motifs. PAGE 213 5. Applications 5.1 Recovery of a gene expression signature In this application, we apply our methods to recover the estrogen gene expression signature from the CMAP dataset. Th e flowchart is shown in Figure 5-1. Figure 5-1 : Gene expressi on signature prediction The fzGASCE with fzSC, fzBLE, fzPBD and RPPR methods were applied to the estrogen treatment datasets in CMAP with the estrogen gene expression signature [85]. We first preprocessed the CMAP arrays using th e RMA method to obtain 564 expression profiles. Using our meta-analysi s method, RPPR [169] we selected the Integrated gene expression data fzGASCE : Genetic algorithm with fSC and fzBLE Cluster candidates Fuzzy partition (1) fzSC : Fuzzy Subtractive Clustering Immunized candidates Cluster candidates Fuzzy partition Cluster solution Competitors Generation selection Potential offsprings The same? Done Update the Solution set Y PAGE 214 expression profiles that were most strongly related to the estrogen expression signature [25, 26]. Expression profiles from the five experiments, #373, 988, 1021, 1079 and 1113 were selected. We then used the Rank-Product method [30, 54] across multiple experiments to filter probes differentially expressed using a p-value cutoff at 0.005. This process selected 204 probes. The expression levels of these 204 probe s across the select ed profiles were extracted and the fzGASCE method clusteri ng was applied using Pearson Correlation distance. Four clusters were found. Each cluster, a set of probesets, is considered a candidate estrogen gene expres sion signature. We applied R PPR to each cluster using CMAP. Significantly related expression prof iles, with cutoffs of 0.5 and -0.5 for positive and negative scores respectively, for each cluster are shown in Table 5.1. Table 5.1 : Clustering results with es trogen significantly associated drugs Cluster (#probesets) Associated drugs [CMAP experiment#] (+/-hitscore) 1 (60) estradiol[373](+0.93), estradio l[1021](+0.85), estradiol[988] (+0.733), estradiol[1079](+0.69), doxycycline[1113](+0.64), genistein[1073](+0.63), estradiol[365](+0.63), nordihydroguaiaretic acid[415](+0.59), alpha-estradiol[ 1048](+0.58), genistein[268](+0.57), estradiol[414](+0.57), genistein[1015]( +0.56), dexamethasone[374](+0.54), estradiol[121](+0.53), alpha-estradiol [990](+0.52), genistein[267](+0.52), fulvestrant[1076](-0.75), fulvestrant[1043](-0.71), fulvestrant[985](-0.66), fulvestrant[367](-0.57), 5224221[956](-0 .55), 15-delta prostaglandin J2[1011](-0.54), 5230742[862](-0.54), fulvestrant[310](-0.52) 2 (53) doxycycline[1113](+0.85), estradiol[988](+0.74), estradiol[1021](0.61), estradiol[373](0.58), 12,13-EODE[1108](+0.56) 3 (44) estradiol[988](+0.84), estradi ol[1079](+0.80), estradiol[1021](+0.58), alpha-estradiol[990](+0.52) 4 (47) estradiol[1021](+0.92), estradiol[988](+0.66), doxycycline[1113](+0.61), estradiol[1079](+0.58) PAGE 215 Table 5.2 : Estrogen signature and th e Â‘signatureÂ’ predicted by FZGASCE Affy-id Selected expression profiles predicted 373988102110791113 202437_s_at 1.721.921.971.681.75x 206115_at 1.291.291.281.401.54x 209339_at 1.661.801.771.761.82x 209687_at 1.511.661.591.631.65x 210367_s_at 1.621.771.641.701.71x 210735_s_at 1.711.731.711.811.81x 211421_s_at 1.351.551.471.631.59x 213906_at 1.291.481.451.431.62x 215771_x_at 1.451.691.711.791.63x 215867_x_at 1.691.821.771.701.82x 205879_x_at 1.501.811.681.681.95x 205862_at 1.691.811.691.691.72x 203666_at 1.551.611.601.601.68x 203963_at 1.711.881.711.851.81x 204508_s_at 1.601.821.761.931.91x 204595_s_at 1.781.801.621.751.99x 205380_at 1.311.591.501.641.84x 205590_at 1.691.711.871.821.79x 217764_s_at 1.621.831.741.851.84x 204597_x_at 1.701.741.752.021.69205239_at 1.711.731.771.621.81204497_at 1.791.761.701.791.84Table 5.1 shows that cluster 1 strongly hit the estrogen expr ession profiles in CMAP. It is also the only cluster with a significant negative correlation with fulvestrant, an estrogen antagonist. This cluster is theref ore the best candidate for the estrogen gene expression signature. Interestingly, this cl uster contains 19 of 22 probesets from the PAGE 216 actual estrogen signature that were significan tly regulated in the five selected profiles (Table 5.2). We therefore conclude that fzGASCE successfully di scovered the estrogen gene expression signature from the CMAP datasets. 5.2 Gene drug association prediction Our algorithm methods can be combined in a solution model for the problem of gene-drug association prediction. Given a set of genes that may describe completely or partially the biological phenot ype of interest, the problem is to search the CMAP datasets for drug treatments that directly or indirectly alter their expression. 5.2.1 Model design Let A be the given set of genes. Figure 5-2 : Gene-Drug association prediction Gene set A Expanded set B Key gene set BA Gene expression signature generation Predicted gene-drug associations PAGE 217 Gene set expansion Figure 5-3 : Expansion of gene set A Because our knowledge of the molecular basi s of the phenotype is limited, the set A probably does not describe th e phenotype completely. To address this problem, we propose to expand the gene set using Protei n-Protein Interaction (PPI) and pathway data. A gene g is added to set A if g interacts with at least one gene in set A, and g and/or its interaction partner in A is annotated as a component of some pathway. These criteria are chosen because, while physically interacting proteins are not necessarily Selected gene set: B A B Apply KEGG pathway annotation Given A a set of genes describe the biological phenotype Search for in-vivo interactors of A a b c a b c d f e a b c d f e Select the genes having at least one interaction partner that is annotated in KEGG a b c d f e Genes of the given set A KEGG-pathway annotated Expanded set: B PAGE 218 components of a biological proce ss, a pair of interacting pr oteins which are associated with one pathway is more likely to participate in another pathway. We first use PPI data from the Human Protein Reference Da tabase (HPRD) to expand set A. We then use KEGG pathway data to select proteins that meet the criteria. If a gene g, g A, is annotated as a component of any pathway, we will include genes that interact with g, whether or not they are annotated as pathway components. If gene g is not a component of any pathway, its interaction partners will be included only if they are annotated as components of at least one pathway. These included genes then suggest a novel potential role for g. Let B be the expanded gene set. B contains the set A and the proteins that interact with A, and may mediate the association of proteins in A with some pathways. Every pair of genes in B is annotated in at least one pathway. For example, consider a set of three genes, a, b, c, and their interaction pa rtners, d, e, f (Figure 5-3). The interaction between the genes f and b is not annotated in any KEGG pathway. Therefore, f is not selected. The expanded set B then cont ains the genes a, b, c, d and e. The key genes Set B may be large, and it is possible that not all of its members are involved in a single pathway relevant to the biological phe notype. The next step, therefore, is to search for the genes that are most likely to interact with genes in A (Figure 5-4). We call these the key genes associated with th e phenotype. Key genes may be members of A or the genes that interact with members of A. PAGE 219 Figure 5-4 : Identification of key genes We cluster the genes in B into groups where genes of the same group most often occupy the same cellular locati on to facilitate th e interaction. We use the GO cellular component (CC) annotations to measure th e distances between genes in B and then apply our clustering methods to cl assify genes in B into groups, B1, B2,Â…, Bk. These subsets may be the set A itself or have a member that interacts with at least one member of A and is located within the same cellular locations as that member of A. These sets Bi, i=1,Â…,k, are not necessarily related to A in a bi ological process, we therefore search for AB, where AB{Bi}i=1..k, and AB is functionally closest to A. We address this problem using a semantic si milarity measure based on GO Biological Process (BP) Annotate B using GO CC terms Create MB, a CC-terms based distance matrix for genes in B Select BA from {Bi, i=1,k} where BA is the closest to A using GO BP based similarity Cluster B using MB to have sub-groups of B: B1, B2,Â…, B k Radius BA using PPI: BA*, create the expression profiles SA for BA* using random walk and heuristics PAGE 220 annotations. The selected set ABshould have the highest semantic similarity to A. We define BA, A AB A B the key genes of the expanded se t B of set A. Since the set BA was generated by examining known interact ors of A that occupy similar cellular locations and are annotated to pathways, BA is believed to better describe the biological phenotype of interest than the set A. Generate and apply gene expression signature In this step, we use BA and the CMAP gene expression microarray data to generate a gene expression signature for th e biological phenotype. Because not all the genes in BA exhibit a change in all expression profiles in the da taset, the key issues in this step are to select expression profiles and determine the direction change in expression of genes in BA. We extract the fold change (FC) levels of genes in BA using all gene expression profiles in CMAP, compute p-values from FC to filter genes differentially expressed in each expression profile using Rank Produc t [30], and conduct cluster analysis to group all CMAP expression profiles over genes in BA into p groups, {Pi}i=1..p. We choose PA as the group of samples where the genes in BA are the most changed. We use the selected genes to define the gene expr ession signature, SA, for the phenotype. The FC levels of genes in SA are averaged across the expression profiles in PA to determine the expression direction. SA is the gene expressi on signature we suggest for the phenotype. PAGE 221 Figure 5-5 : Gene expression signatu re generation and application The predicted expression signature, SA, is then used to identify a subset of CMAP, CMAPA which contains the expression pr ofiles strongly relevant to the phenotype. The small-molecule/drugs used as treatments in CMAPA suggest potential therapeutics to modulate expr ession of the given gene set. By conducting cluster analysis on CMAPA and using the GO BP-based seman tic similarity measure, we can also obtain the set CA that contains additional genes potentially related to the biological Key genes Result-1: A set of drugs/treatments that may target genes in A Cluster CMAPA into m clusters of genes: C1,Â…Cm using gene expression levels Select CA from {Ci, i=1,m} where CA is the closest to A using expression levels Result-3: Cc A provides additional gene groups relating to the p henot yp e Select from CMAP a set of expression profiles related to SA: CMAPA Eliminate CA and clusters of unchanged g enes Result-2: CA provides additional genes relating to the phenotype PAGE 222 phenotype of interest, and the set Cc A of additional groups of genes changed, but differently that did the ge nes in the given set A. 5.2.2 Model testing Intellectual disability (ID) in Down syndrome (DS) ranges from low normal to severely impaired, and has a significant impact on the quality of lif e of the individuals affected and their families. Development of pharmacotherapies for cognitive deficits is, therefore, an important goal. The worki ng hypothesis in DS research is that the phenotypic features are caused by the modest 50% increases in expression of trisomic genes and that these in turn result in m odest perturbations of otherwise normal cell functions and pathways. However, the primary difficulties of targeting therapeutics to Hsa21q genes are the determinations of the number of genes to be regulated simultaneously, the number of drugs to be used and the gene and/or drug interactions might occur in full trisomies to produce advers e effects [171]. This application will help with identification of the set of drug treatme nts that may affect a given set of Hsa21q genes using the CMAP datasets. Hsa21 gene set 1 This gene set contains six Hsa21q genes: APP, DYRK1A, ITSN1, RCAN1, S100B and TIAM1 [171] as shown in Table 5.3. 1) As in Figure 5-3, to generate the e xpanded gene set, we used the PPI and pathway database in th e GardinerÂ’s lab [167] which is compiled from multiple bioinformatics resources [168], (available online at PAGE 223 http://gfuncpathdb.ucdenver.edu/iddrc/ ). We used the HPRD PPIs at Â“invivoÂ” level and the KEGG pathways to filter the genes. The expanded set B of A is shown in Table 5.4. Table 5.3 : Hsa21 set-1 query genes No. Gene Description Chr. 1 APP amyloid beta (A4) precursor protein 21 2 DYRK1A dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1A 21 3 ITSN1 intersectin 1 (SH3 domain protein) 21 4 RCAN1 regulator of calcineurin 1 21 5 S100B S100 calcium binding protein B 21 6 TIAM1 T-cell lymphoma invasion and metastasis 1 21 2) As in Figure 5-4, we conduct cluste r analysis on B us ing fzGASCE and GOfzBLE methods with GO-CC annotati ons. Gene groups are then ranked based on how close they are to the given gene set A using GO-BP based semantic similarity measurement. Results are shown in Table 5.5. We choose the first group, which has the highest semantic similarity score to the set A, for the next steps. PAGE 224 Table 5.4 : Expanded set of the Hsa21 set-1 Selected proteins in BOLD are an notated in at least one pathway # Selected protein Interactor 1 APP SERPINA3, ABL1, APBA1, APBA2, APBB1, APBB2, CALR, CASP3, CDK5, CHRNA7, CLU, CTSD, F12, GRB2, GSK3B, GSN, HSD17B10, HMOX2, KNG1, KLC1, LRP1, NF1, PIN1, PPID, PSEN2, SHC1, TGFB1, TGFB2, NUMB, NAE1, NUMBL, ITM2B, HOMER3, HOMER2, MAPK8IP1, APBA3, BCAP31, APBB3, KAT5, NCSTN, BACE2, LDLRAP1, HTRA2, LRP1B, SHC3, NECAB3, COL25A1 2 DYRK1A CREB1, FOXO1, GLI1, YWHAB, YWHAE, YWHAG 3 ITSN1 DNM1, EPHB2, EPS15, HRAS, SNAP25, SOS1, SNAP23, WASL 4 RCAN1 MAP3K3, PPP3CA 5 S100B GUCY2D, MAPT, IQGAP1 6 TIAM1 ANK1, CAMK2G, CD44, EFNB1, EPHA2, HRAS, MYC, NME1, PRKCA, RAC1, SRC, TIAM1, YWHAG, MAPK8IP1, MAPK8IP2, PARD3, PPP1R9B Table 5.5 : Gene clusters of th e Hsa21 gene set-1 expanded set Genes in BOLD are in the original gene set A # Genes Similarity score to A 1 APBA1(9), APBA2(15), APBB1(11), APBB2(4), APP(21) CHRNA7(15), CLU(8), CREB1(2), CTSD(11), DNM1(9), RCAN1(21) DYRK1A(21) EPHA2(1), EPHB2(1), KNG1(3), LRP1(12), MYC(8), NF1(17), PIN1(19), PPID(4), PSEN2(1), S100B(21) SNAP25(20), TGFB1(19), TGFB2(1), SNAP23(15), IQGAP1(15), NUMBL(19), HOMER3(19), KAT5(11), LRP1B(2), PPP1R9B(17) 0.32 2 SERPINA3(14), ABL1(9), ANK1(8), CALR(19), CAMK2G(10), CASP3(4), CD44(11), CDK5(7), EFNB1(X), EPS15(1), F12(5), FOXO1(13), GLI1(12), GRB2(17), GSK3B(3), GSN(9), GUCY2D(17), HSD17B10(X), HMOX2(16), HRAS(11), KLC1(14), MAPT(17), MAP3K3(17), NME1(17), PPP3CA(4), PRKCA(17), RAC1(7), ITSN1(21) SHC1(1), SOS1(2), SRC(20), TIAM1(21) YWHAB(20), YWHAE(17), YWHAG(7), NUMB(14), NAE1(16), WASL(7), ITM2B(13), HOMER2(15), MAPK8IP1(11), APBA3(19), BCAP31(X), APBB3(5), NCSTN(1), MAPK8IP2(22), BACE2(21), LDLRAP1(1), HTRA2(2), SHC3(9), PARD3(10), NECAB3(20), COL25A1(4) 0.26 PAGE 225 3) As in Figure 5-5, we search for the probesets of the selected genes and extract their expression levels. Differential expression significance is set using a cutoff p-value at 0.005. We c onduct cluster analysis of expression profiles using our cluster method, fzGASCE where the fuzzifier factor, m, is set to 1.15. The expression profile cl usters are ranked based on the average fold change of genes that are differe ntially expressed. The cluster with highest expressed le vel is selected. 4) We average the expression levels of the selected probesets (SA) across selected profiles and assign the Â‘Dow nÂ’ label to probesets where the averaged FC value is less than 1 and vice versa. Results are shown in Table 5.6. We propose these genes with their direction of expression change as the expression signature for the phenotype most related to the given set A of Hsa21 genes. 5) We use the gene expression signature in Table 5.6 with RPPR [169] (support tool is available online [ 168]) to select the list of drug treatments in CMAP using a score cutoff at 0.8. Results ar e shown in Tables 5.7 and 5.8. Our approach successfully identifies dr ugs commonly used in treatment of mental illness. Mebhydrolin is used to treat depression in older ADULT with ID; cytochalasin B causes filament shor tening, however, Krucker et al. [186] showed that the effect of cytochalas in B can be reversible making stable long-term potentiation in region CA1 of the hippocampus. Geldanamycin PAGE 226 can help with reducing brain injury; Levomepromazine is used to relieve symptoms of schizophrenia. All of th ese drugs are listed in Table 5.8. Table 5.6 : Proposed gene expression signature for the Hsa21 gene set-1 (67 genes, 147 probesets) No Probeset Sign Gene Chr. No Probeset Sign Gene Chr. 1 202123_s_at ABL1 9 75 206689_x_at KAT5 11 2 207087_x_at ANK1 8 76 209192_x_at KAT5 11 3 208352_x_at ANK1 8 77 214258_x_at KAT5 11 4 205389_s_at ANK1 8 78 212877_at KLC1 14 5 208353_x_at ANK1 8 79 212878_s_at KLC1 14 6 205391_x_at ANK1 8 80 213656_s_at KLC1 14 7 205390_s_at ANK1 8 81 217512_at KNG1 3 8 206679_at APBA1 9 82 206054_at KNG1 3 9 215148_s_at APBA3 19 83 57082_at LDLRAP1 1 10 205146_x_at APBA3 19 84 221790_s_at LDLRAP1 1 11 202652_at APBB1 11 85 203514_at MAP3K3 17 12 204650_s_at APBB3 5 86 213013_at MAPK8IP1 11 13 211277_x_at APP 21 87 213014_at MAPK8IP1 11 14 200602_at APP 21 88 205050_s_at MAPK8IP2 22 15 214953_s_at APP 21 89 208603_s_at MAPK8IP2 22 16 217867_x_at BACE2 21 90 206401_s_at MAPT 17 17 212952_at CALR 19 91 203928_x_at MAPT 17 18 212953_x_at CALR 19 92 203929_s_at MAPT 17 19 214315_x_at CALR 19 93 203930_s_at MAPT 17 20 214316_x_at CALR 19 94 202431_s_at MYC 8 21 200935_at CALR 19 95 202268_s_at NAE1 16 22 212669_at CAMK2G 10 96 208759_at NCSTN 1 23 214322_at CAMK2G 10 97 210720_s_at NECAB3 20 24 212757_s_at CAMK2G 10 98 207545_s_at NUMB 14 25 202763_at CASP3 4 99 209073_s_at NUMB 14 26 204490_s_at CD44 11 100 221280_s_at PARD3 10 27 212014_x_at CD44 11 101 210094_s_at PARD3 10 28 209835_x_at CD44 11 102 221526_x_at PARD3 10 29 210916_s_at CD44 11 103 221527_s_at PARD3 10 30 212063_at CD44 11 104 202927_at PIN1 19 31 217523_at CD44 11 105 204186_s_at PPID 4 32 216056_at CD44 11 106 204185_x_at PPID 4 33 204489_s_at CD44 11 107 202429_s_at PPP3CA 4 34 204247_s_at CDK5 7 108 202457_s_at PPP3CA 4 35 210123_s_at CHRNA7 15 109 202425_x_at PPP3CA 4 36 208791_at CLU 8 110 215194_at PRKCA 17 37 208792_s_at CLU 8 111 213093_at PRKCA 17 38 222043_at CLU 8 112 215195_at PRKCA 17 39 214513_s_at CREB1 2 113 206923_at PRKCA 17 40 204312_x_at CREB1 2 114 211373_s_at PSEN2 1 41 204314_s_at CREB1 2 115 204262_s_at PSEN2 1 42 204313_s_at CREB1 2 116 204261_s_at PSEN2 1 43 200766_at CTSD 11 117 208641_s_at RAC1 7 44 209033_s_at DYRK1A 21 118 209686_at S100B 21 PAGE 227 Table 5.6 (con t.) No Probeset Sign Gene Chr. No Probeset Sign Gene Chr. 45 211541_s_at DYRK1A 21 119 202376_at SERPINA3 14 46 211079_s_at DYRK1A 21 120 214853_s_at SHC1 1 47 203499_at EPHA2 1 121 201469_s_at SHC1 1 48 217887_s_at EPS15 1 122 217048_at SHC1 1 49 217886_at EPS15 1 123 206330_s_at SHC3 9 50 215961_at F12 5 124 209131_s_at SNAP23 15 51 205774_at F12 5 125 209130_at SNAP23 15 52 202724_s_at FOXO1 13 126 214544_s_at SNAP23 15 53 202723_s_at FOXO1 13 127 202508_s_at SNAP25 20 54 206646_at GLI1 12 128 202507_s_at SNAP25 20 55 215075_s_at GRB2 17 129 212777_at SOS1 2 56 209945_s_at GSK3B 3 130 212780_at SOS1 2 57 200696_s_at GSN 9 131 221284_s_at SRC 20 58 214040_s_at GSN 9 132 221281_at SRC 20 59 218120_s_at HMOX2 16 133 213324_at SRC 20 60 218121_at HMOX2 16 134 203085_s_at TGFB1 19 61 217080_s_at HOMER2 15 135 203084_at TGFB1 19 62 212983_at HRAS 11 136 209908_s_at TGFB2 1 63 203089_s_at HTRA2 2 137 220407_s_at TGFB2 1 64 211152_s_at HTRA2 2 138 209909_s_at TGFB2 1 65 213446_s_at IQGAP1 15 139 206409_at TIAM1 21 66 210840_s_at IQGAP1 15 140 213135_at TIAM1 21 67 200791_s_at IQGAP1 15 141 205810_s_at WASL 7 68 217731_s_at ITM2B 13 142 205809_s_at WASL 7 69 217732_s_at ITM2B 13 143 217717_s_at YWHAB 20 70 209297_at ITSN1 21 144 208743_s_at YWHAB 20 71 35776_at ITSN1 21 145 210317_s_at YWHAE 17 72 207322_at ITSN1 21 146 210996_s_at YWHAE 17 73 210713_at ITSN1 21 147 213655_at YWHAE 17 74 209298_s_at ITSN1 21 Table 5.7 : Drugs predicted to enha nce expression of Hsa21 gene set-1 Drug (concentration) Target or disease Estradiol (10nM) Treatment of urogenital symptoms associated with postmenopausal atrophy of the vagina and/or the lower urinary tract. 4,5-dianilinophthalimide (10mM) Potent inhibitors of amyloid fibrillization. Colforsin (50mM) Novel drug for the treatment of acute heart failure. Fisetin (50mM) A compound that has been shown to be able to alleviate ageing effects in certain model organisms. PAGE 228 Table 5.8 : Drugs predicted to repre ss expression of Hsa21 gene set-1 Drug (concentration) Target or disease Mebhydrolin (4.8mM) Enhanced sedation w ith alcohol or other CNS depressants. Additive antimuscarinic effects with MAOIs, atropine, TCAs which are the treatment of major depressive disorder, panic disorder and other anxiety disorders, causing increased levels of norepinephrine and serotonin that act as communication agents between different brain cells. Cytochalasin B (20.8mM) A cell-permeable mycotoxin. It inhibits cytoplasmic division by blocking the formation of contractile microfilaments. It inhibits cell movement and induces nuclear extrusion. Geldanamycin (1.mM) A benzoquinone ansamycin antibiotic that binds to Hsp90 and inhibits its function. H SP90 proteins play important roles in the regulation of the cell cycle, cell growth, cell survival, apoptosis, angiogenesis and oncogenesis. Tanespimycin (1.mM) It helps cause the br eakdown of certain proteins in the cell, and may kill cancer cells. It is a type of antineoplastic antibiotic and a type of HSP90 inhibitor. Vorinostat (10.mM) A drug that is used to treat cutaneous T-cell lymphoma that does not get better, gets worse, or comes back during or after treatment with other drugs. It is also being studied in the treatment of other types of cancer. Vorinostat is a type of histone deacetylase inhibitor. Trichostatin A (1.mM 100.nM) An organic compound that serves as an antifungal antibiotic and selectively inhibits the class I and II mammalian histone deacetylase families of enzymes, but not class III HDACs. PAGE 229 Table 5.8 (con t.) Drug (concentration) Target or disease Levomepromazine (9.mM) Levomepromazine is used in palliative care to help ease distressing symptoms such as pain, restlessness, anxiety and being sick. It works on chemical substances acting on the nervous system in brain. Levomepromazine is also used to relieve symptoms of schizophrenia. Schizophrenia is a mental health condition that causes disordered ideas, beliefs and experiences. Symptoms of schizophrenia include hearing, seeing, or sensing things that are not real, having mistaken beliefs, and feeling unusually suspicious. Hsa21 gene set 2 Given a set of Hsa21 genes: APP, DSCAM, ITSN1, PCP4, RCAN1, TIAM1. Table 5.9 : Hsa21 set-2 query genes No. Gene Description Chr. 1 APP amyloid beta (A4) precursor protein 21 2 DSCAM Down syndrome cell adhesion molecule 21 3 ITSN1 intersectin 1 (SH3 domain protein) 21 4 PCP4 Purkinje cell protein 4 21 5 RCAN1 regulator of calcineurin 1 21 6 TIAM1 T-cell lymphoma invasion and metastasis 1 21 Table 5.10 shows the predicted gene expr ession signature for Hsa21 gene set 2. We apply the gene expression signature to s earch for the drug treatments that may affect genes in the given gene set using CMAP datasets. A list of drug treatments that positively regulates genes in the given set is shown in Table 5.11. Table 5.12 lists drugs PAGE 230 predicted to negatively affect these genes. These include geldanamycin which can help with reducing brain injury, podophyllotoxin wh ich is used in the treatment of brain tumors, and chlorpromazine which is used in the treatment of various psychiatric illnesses. Table 5.10 : Predicted gene expression signature for the Hsa21 gene set-2 (70 genes, 157 probesets) No. AffyID Sign Gene Chr. No. AffyID Sign Gene Chr. 1 202123_s_at ABL1 9 80 210713_at ITSN1 21 2 205390_s_at ANK1 8 81 209192_x_at KAT5 11 3 208352_x_at ANK1 8 82 214258_x_at KAT5 11 4 207087_x_at ANK1 8 83 206689_x_at KAT5 11 5 205389_s_at ANK1 8 84 212877_at KLC1 14 6 205391_x_at ANK1 8 85 212878_s_at KLC1 14 7 208353_x_at ANK1 8 86 213656_s_at KLC1 14 8 206679_at APBA1 9 87 206054_at KNG1 3 9 209870_s_at APBA2 15 88 217512_at KNG1 3 10 209871_s_at APBA2 15 89 57082_at LDLRAP1 1 11 215148_s_at APBA3 19 90 221790_s_at LDLRAP1 1 12 205146_x_at APBA3 19 91 200784_s_at LRP1 12 13 202652_at APBB1 11 92 200785_s_at LRP1 12 14 212972_x_at APBB2 4 93 203514_at MAP3K3 17 15 213419_at APBB2 4 94 213013_at MAPK8IP1 11 16 216750_at APBB2 4 95 213014_at MAPK8IP1 11 17 216747_at APBB2 4 96 208603_s_at MAPK8IP2 22 18 40148_at APBB2 4 97 205050_s_at MAPK8IP2 22 19 212970_at APBB2 4 98 202431_s_at MYC 8 20 212985_at APBB2 4 99 202268_s_at NAE1 16 21 204650_s_at APBB3 5 100 208759_at NCSTN 1 22 200602_at APP 21 101 210720_s_at NECAB3 20 23 214953_s_at APP 21 102 211914_x_at NF1 17 24 211277_x_at APP 21 103 212676_at NF1 17 25 217867_x_at BACE2 21 104 216115_at NF1 17 26 200837_at BCAP31 X 105 204325_s_at NF1 17 27 200935_at CALR 19 106 211094_s_at NF1 17 28 212952_at CALR 19 107 212678_at NF1 17 29 214315_x_at CALR 19 108 210631_at NF1 17 30 212953_x_at CALR 19 109 204323_x_at NF1 17 31 214316_x_at CALR 19 110 211095_at NF1 17 32 214322_at CAMK2G 10 111 207545_s_at NUMB 14 33 212669_at CAMK2G 10 112 209073_s_at NUMB 14 34 212757_s_at CAMK2G 10 113 209615_s_at PAK1 11 35 202763_at CASP3 4 114 221280_s_at PARD3 10 36 212063_at CD44 11 115 221527_s_at PARD3 10 37 204489_s_at CD44 11 116 221526_x_at PARD3 10 38 210916_s_at CD44 11 117 210094_s_at PARD3 10 PAGE 231 Table 5.10 (con t.) No. AffyID Sign Gene Chr. No. AffyID Sign Gene Chr. 39 204490_s_at CD44 11 118 202927_at PIN1 19 40 212014_x_at CD44 11 119 204186_s_at PPID 4 41 217523_at CD44 11 120 204185_x_at PPID 4 42 209835_x_at CD44 11 121 202429_s_at PPP3CA 4 43 216056_at CD44 11 122 202425_x_at PPP3CA 4 44 204247_s_at CDK5 7 123 202457_s_at PPP3CA 4 45 210123_s_at CHRNA7 15 124 215195_at PRKCA 17 46 208792_s_at CLU 8 125 215194_at PRKCA 17 47 208791_at CLU 8 126 213093_at PRKCA 17 48 222043_at CLU 8 127 206923_at PRKCA 17 49 200766_at CTSD 11 128 204261_s_at PSEN2 1 50 217341_at DNM1 9 129 204262_s_at PSEN2 1 51 215116_s_at DNM1 9 130 211373_s_at PSEN2 1 52 211484_s_at DSCAM 21 131 208641_s_at RAC1 7 53 202711_at EFNB1 X 132 208370_s_at RCAN1 21 54 203499_at EPHA2 1 133 215253_s_at RCAN1 21 55 210651_s_at EPHB2 1 134 202376_at SERPINA3 14 56 211165_x_at EPHB2 1 135 201469_s_at SHC1 1 57 209589_s_at EPHB2 1 136 214853_s_at SHC1 1 58 209588_at EPHB2 1 137 206330_s_at SHC3 9 59 217887_s_at EPS15 1 138 214544_s_at SNAP23 15 60 217886_at EPS15 1 139 209130_at SNAP23 15 61 215961_at F12 5 140 209131_s_at SNAP23 15 62 205774_at F12 5 141 202508_s_at SNAP25 20 63 215075_s_at GRB2 17 142 202507_s_at SNAP25 20 64 209945_s_at GSK3B 3 143 212777_at SOS1 2 65 200696_s_at GSN 9 144 212780_at SOS1 2 66 214040_s_at GSN 9 145 221281_at SRC 20 67 218121_at HMOX2 16 146 213324_at SRC 20 68 218120_s_at HMOX2 16 147 221284_s_at SRC 20 69 217080_s_at HOMER2 15 148 203084_at TGFB1 19 70 212983_at HRAS 11 149 203085_s_at TGFB1 19 71 202282_at HSD17B10 X 150 220407_s_at TGFB2 1 72 211152_s_at HTRA2 2 151 209908_s_at TGFB2 1 73 203089_s_at HTRA2 2 152 220406_at TGFB2 1 74 217731_s_at ITM2B 13 153 209909_s_at TGFB2 1 75 217732_s_at ITM2B 13 154 213135_at TIAM1 21 76 207322_at ITSN1 21 155 206409_at TIAM1 21 77 209298_s_at ITSN1 21 156 205810_s_at WASL 7 78 209297_at ITSN1 21 157 205809_s_at WASL 7 79 35776_at ITSN1 21 PAGE 232 Table 5.11 : Drugs predicted to enha nce expression of Hsa21 gene set-2 Drug (concentration) Target or disease Estradiol (10nM) Treatment of urogenital symptoms associated with postmenopausal atrophy of the vagina and/or the lower urinary tract. 4,5-dianilinophthalimide (10mM) Potent inhibitors of amyloid fibrillization. Colforsin (50mM) Novel drug for the treatment of acute heart failure. Table 5.12 : Drugs predicted to repre ss expression of Hsa21 gene set-2 Drug (concentration) Target or disease Trichostatin A(1.mM, 100.nM) An organic compound that serves as an antifungal antibiotic and selectively inhibits the class I and II mammalian histone deacetylase families of enzymes, but not class III HDACs. Geldanamycin (1.mM) A benzoquinone ansamycin antibiotic that binds to Hsp90 and inhibits its function. H SP90 proteins play important roles in the regulation of the cell cycle, cell growth, cell survival, apoptosis, angiogenesis and oncogenesis. Podophyllotoxin (9.6mM) Used in the treatment of many cancers, particularly small cell lung carcinoma and testicular cancer and brain tumors. It arrests ce ll growth by inhibiting DNA topo-isomerase II, which cau ses double strand breaks in DNA. Cytochalasin B (20.8mM) A cell-permeable mycotoxin. It inhibits cytoplasmic. division by blocking the formation of contractile microfilaments. It inhibits cell movement and induces nuclear extrusion. PAGE 233 Table 5.12 (con t.) Drug (concentration) Target or disease Propafenone (10.6mM) Used to prevent or treat serious irregular heartbeats (arrhythmia). Belongs to a class of drugs called antiarrhythmics. Chlorpromazine (11.2mM) It is used in the treatment of various psychiatric illnesses and is also used in the management of nausea and vomiting associated with terminal illness. Tanespimycin (1.mM) It helps cause the breakdown of certain proteins in the cell, and may kill cancer cells. It is a type of antineoplastic antibiotic and a type of HSP90 inhibitor. 5.3 Drug target prediction In this application, we search for the se t of genes affected by a given set of drug treatments as follows: 1) Relate the given set of drug treatments with drug treatments in the CMAP a. Use Drug Bank chemical similarity. b. Based on overlap of drug targets. 2) Extract expression profiles of the CMAP drug treatments rela ted to the given drug treatments. 3) Generate the gene expression si gnature as in Section 5.2.1. 4) Apply RPPR [169] to the gene expressi on signature to expand the selected expression profiles. PAGE 234 5) Apply our clustering methods to s earch for the group of genes most differentially expressed. This set of genes is the answer to the problem. 5.4 Application of HIGEDA into prediction regulatory motifs HIGEDA is capable of discovering new motifs in a sequence set. In this application, we use HIGEDA to discover transcription factor binding sites (TFBS) and miRNA complementary binding sites (MBS) in the genomic and mRNA sequences of genes within each cluster from Section 5.1. Figure 5-6 illustrates the proposed method. Figure 5-6 : Transcription f actor and mircoRNA binding si te prediction using HIGEDA NCBI DB Gene RefSeqs HIGEDA Seq-Motifs TFBS Cluster analysis Gene groups MBS Motif Scanner Integrated gene expression data Biological context of interest PAGE 235 6. Conclusions and future work Our studies regarding microa rray analysis included meta -analysis methods, cluster analysis of gene expressi on data, and motif finding in biological sequences. Use of meta-analysis methods should provide better results in analysis of microarray data obtained from multiple studies because of a st ronger statistical basis, and a reduction in bias caused by results of any individual study. In cluster analysis, we showed that soft cluster methods are more appropriate for gene expression data because it allows genes to belong to multiple clusters, reflecting their participation in more than one biological process. Use of the FCM algorithm is optimal because it eliminates problems seen in the EM algorithm of slow convergence at regions of overlap between adjacent cl usters and its assumption about the data distribution model may violate th at of most microarray data that do not follow a specific distribution model, such as Ga ussian or Chi-Squared. We also proved that, by using the possibility to probability model conversi on, we can apply probability based model selection to possibility based models. We th erefore benefit from the power of statistical methods but also effectively model the na tural data distributions because of the flexibility of possibility based models. Th e advantages of our approach have been shown in our novel methods of fuzzy cluster validation using either gene expression data or GO annotations, missing data imput ation, fuzzy cluster defuzzification and cluster number detection. PAGE 236 In the motif-finding problem, we showed that a PWM based model is most appropriate. With the use of the hierarchi cal genetics algorithm where patterns of subsequences in motifs can change simultaneous ly, the motif-finding algorithm can avoid local optima. In addition, with the use of the dynamic programming algorithm, gapped motifs can be discovered. In this dissertation, we have made th e following contributions to bioinformatics research: 1) Development of a meta-analysis met hod to effectively filter differentially expressed genes using external metrics, in formation based on the specific biological context, and prior knowledge of existing biol ogical pathways, PPIs and Gene Ontology. The method is applicable to cases where data come from different laboratories using different versions of diffe rent microarray platforms, and should provide more biologically relevant results. We further propose a novel use of the Connectivity Map datasets to predict gene-drug associations. 2) Development of a novel clustering algorithm that combines algorithms developed for FCM parameter initiali zation, FCM missing data imputation, defuzzification of FCM fuzzy partition an d fuzzy cluster evaluation with the optimization algorithm, GA. The algorithm is a pplicable to cluster analysis of gene expression data without a prio r specification of the number of clusters and assumption of the data distributions. Our clusteri ng method successfully supported specific applications described in Section 5. PAGE 237 3) Development of a novel motif-finding algorithm for the problem of gene regulatory sequence prediction for groups of genes with similar expression patterns under some biological conditions of intere st, demonstrated in Section 4.3.2. 4) Development of a machine learning base d framework for a complete solution to the problem of microarray data analysis. 5) We have developed a website [168] wh ere the methods of this dissertation are made available online. The website uses R software as the computational back-end so that our methods can work seamlessly with other statistical, machine learning and data mining methods available in R packages. We will update our website with our machine learning based framework to allow microarray data analysis online. In future work, we will improve our algorithm methods with parallelization and implement them as web applications to support gene-drug-disease association prediction. These applications will be r un on High Performance Computing (HPC) and accessible online. In addition, we will adap t our methods to other research domains such as Magnetic Resonance Imaging (MRI) da ta analysis, Social network information retrieval, cluster analysis of internet websites and users data. For the MRI data analysis, our cluster methods can help with image segm entation, marker identification and cluster analysis. For social network information retr ieval, our cluster methods can help with identification the group of users having similar social behaviors, namely entertainment, buying and travelling. PAGE 238 REFERENCES [1] Stekel D. (2003) Microarray Bioinformatics, Cambridge University Press. [2] Kohane I.S., Kho A.T. and Butte A.J. (2003) Micr oarrays for an Integrativ e Genomics, MIT Press. [3] Amaratunga D. and Cabrera J. (2004) Exploration and Analysis of DNA Microarray and Protein Array Data, John Wiley & Sons, Inc., Hoboken, New Jersey. [4] Ewens W. and Grant G. (2005) Statisical Methods in Bioinformatics: An Introduction 2ed, Springer. [5] Shewhart W.A. and Wilks S.S. (2009) Batch effects and noise in microarray experiments, Wiley. [6] Barrett T., Suzek T.O., Troup D.B., Wilhite S.E., Ngau W.C., Ledoux P., Rudnev D., Lash A.E., Fujibuchi W. and Edgar R. (2005) NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res. Vol. 33, D562Â–D566. doi: 10.1093/nar/gki022. [7] Barrett T., Troup D.B., Wilhite S.E. Ledoux P., Rudnev D., Evangelista C., Kim I.F., Soboleva A., Tomashevsky M. and Edgar R. (2007) NCBI GEO: mining tens of millions of expression profiles database and tools update. Nucleic Acids Res. Vol. 35, D760Â–D765. doi: 10.1093/nar/gkl887. [8] Koziol A.J. (2011) Comments on the rank product method for analyzing replicated experiments, Elsevier Â– FEBS letters, Vol. 584, pp. 941-944. [9] Witten D. and Tibshirani R. (2007) A comparison of fold change and the t-statistic for microarray data analysis, Technical report, Stanford University. [10] Deng X., Xu J., Hui J. and Wang C. (2009) Probability fold change: A robust computational approach for identifying differentially expressed gene lists, Computer methods and Programs in Biomedicine, Elsevier, Vol. 93, pp. 124-139. [11] Warnat P., Eils R. and Brors B. (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, Vol. 6, pp. 265. [12] Xu L., Geman D. and Winslow R. (2007) Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics Vol. 8, pp. 275. [13] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 9, pp. 125. PAGE 239 [14] Ramaswamy S., Ross K.N., Lander E.S. and Golub T.R. (2003) A molecular signature of metastasis in primary solid tumors. Nat Genet, Vol. 33, pp. 49-54. [15] Bloom G., Yang I.V., Boulware D., Kwong K.Y., Coppola D., Eschrich S., Quackenbush J. and Yeatman T.J. (2004) Multi-Platform, Multi-Site, Microarray-Based Human Tumor Classification. Am J. Pathol, Vol. 164, pp. 9-16. [16] Shen R., Ghosh D. and Chinnaiyan A.M. (2004) Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics, Vol.5, pp. 94. [17] Jiang H., Deng Y., Chen H.S., Tao L., Sha Q., Chen J., Tsai C.J. and Zhang S. (2004) Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics, Vol 5, pp.81. [18] Warnat P., Eils R. and Brors B. (2005) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, pp. 265. [19] Rao Y., Lee Y., Jarjoura D., Ruppert A.S., Liu C.G., Hsu J.C. and Hagan J.P. (2008) A Comparison of Normalization Tech niques for Microarray Data. Berk eley Electronic Press, Vol. 7. [20] Vert G., Nemhauser J.L., Geldner N., Hong F. and Chory J. (2005) Molecular Mechanisms of Steroid Hormone Signaling in Plants. The Annual Review of Cell and Developmental Biology. [21] Cahan P., Rovegno F., Mooney D., Newman J.C., Laurent III G. and McCaffrey T. (2007) Metaanalysis of microarray results: challenges, opportu nities, and recommendations for standardization. Gene, Elsevier, pp.12-18. [22] Irizarry R.A., Warren D., Spencer F., Kim I.F., Biswal S., Frank B.C., Gabrielson E., Garcia J.G., Geoghegan J., Germino G., Griffin C., Hilmer S. C., Hoffman E., Jedlicka A.E., Kawasaki E., Martnez-Murillo F., Morsberger L., Lee H., Pete rsen D., Quackenbush J., Scott A., Wilson M., Yang Y., Ye S.Q. and Yu W. (2005) Multiple-lab oratory comparison of microarray platforms. Nat Methods, Vol. 2, pp.239-243. [23] Ramasamy A., Mondry A., Holmes C.C. and Altman D.G. (2008) Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets, PLoS, Vol. 5, pp. 184-196. PAGE 240 [24] Warnat P., Eils R. and Brors B. (2006) Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics. Vol. 6, pp. 265. [25] Lamb J., Crawford E.D., Peck D., Modell J.W., Blat I.C., Wrobel M.J., Lerner J., Brunet J.P., Subramanian A., Ross K.N., Reich M., Hieronymus H., Wei G., Armstrong S.A., Haggarty S.J., Clemons P.A., Wei R., Carr S.A., Lander E.S. and Golub T.R. (2006) The Connectivity Map: Using Gene-Expression Signatures to Connect Sma ll Molecules, Genes, and Disease. Science. Vol. 313, pp. 1929-35. [26] Zhang S.D., Gant T.W. (2008) A simple and robust method for connecting small-molecule drugs using gene-expression signatures. BMC Bioinformatics, Vol. 9, pp. 258-264. [27] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian Mixtures. Neural Computation, Vol. 8, pp. 409-1431. [28] Ma S. and Dai Y. (2011) Principal component analysis based methods in bioinformatics studies, Bioinformatics, Vol. 10, pp.1-9. [29] Breitling R., Armengaud P., Amtma nn A. and Herzyk P. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters, Vol. 573, pp. 83-92. [30] Hong F., Breitling R., McEntee C.W., Wittner B.S., Nemhauser J.L. and Chory J. (2006) RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis, Bioinformatics, Vol. 22, pp.2825-2827. [31] Zhang N.R., Wildermuth M.C. and Speed T.P. (2008) Transcription Factor Binding site prediction with multivariate gene expression data, Annua l applied statistic, Vol. 2, pp. 332-365. [32] Alexiou P., Maragkakis M., Papadopoulos G.L., Si mmosis V.A., Zhang L. and Hatzigeorgiou A.G. (2010) The DIANA-mirExTra Web Server: From Gene Expression Data to MicroRNA Function, PLOS ONE, Vol. 5, e.9172. [33] Das M.K. and Dai H.K. (2007) A survey of DNA motif finding algorithms, BMC Bioinformatics, Vol. 8, S21. PAGE 241 [34] Bailey T.L. and Elkan C. (1995). The value of pr ior knowledge in discovering motifs with MEME. Proc. Intl. Conf. Intel. Syst. Mol. Biol., 3, 21Â–29. [35] Kel A.E., Kel-Margoulis O.V., Farnham P.J., Bartley S.M., Wingender E. and Zhang M.Q. (2001) Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. Journal of Molicular Biology, Vol. 25, pp. 99-120. [36] Rhodes D.R., Barrette T.R., Rubin M.A., Ghosh D. and Chinnaiyan A.M. (2002) Meta-Analysis of Microarrays: Interstudy Validation of Gene Expr ession Profiles Reveals Pathway Dysregulation in Prostate Cancer. Cancer Res, Vol. 62, pp. 4427-4433. [37] Choi J.K, Yu U., Kim S. and Yoo O.J. (20 03) Combining multiple microarray studies and modeling inter study variation. Bioinformatics, Vol. 19, pp. 84-90. [38] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J. and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Letters, Vol. 565, pp. 93-100. [39] Hu P., Greenwood C.M. and Beyene J. (2005) In tegrative analysis of multiple gene expression profiles with quality-adjusted effect size models. BMC Bioinformatics, Vol. 6, pp. 128. [40] Marot G. and Mayer C.D. (2009) Sequential Anal ysis for Microarray Data Based on Sensitivity and Meta-Analysis. Berkeley Electronic Press, Vol. 8. [41] Strimmer K. and Opgen-Rhein R. (2007) Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach. Be rkeley Electronic Press, Vol. 6. [42] Lin S. (2010) Space Oriented Rank-Based Data In tegration. Berkeley Elec tronic Press, Vol.9. [43] Parkhomenko E., Tritchler D. and Beyene J. (2009) Sparse Canonical Correlation Analysis with Application to Genomic Data Integratio n. Berkeley Electronic Press, Vol.8. [44] Su Z., Hong H., Fang H., Shi L., Perkins R. and Tong W. (2008) Very Important Pool (VIP) genes Â– an application for microarray-based molecular signatures. BMC Bioinformatics, Vol. 9, S9. [45] Rossell D., Guerra R., Scott C. (2008) Semi-Parametric Differential Expression Analysis via Partial Mixture Estimation. Berkeley Electronic Press, Vol. 7. PAGE 242 [46] Zhang S. (2006) An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data. Berkeley Electronic Press, Vol. 5. [47] Tan F., Fu X., Zhang Y. and Bourgeois A.G. (2006) Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data. IEEE Congress on Evolutionary Computation. [48] Choi J.K., Choi J.Y., Kim D.G., Choi D.W., Kim B.Y., Lee K.H., Yeom Y.I., Yoo H.S., Yoo O.J. and Kim S. (2004) Integrative analysis of multiple gene expression profiles applied to liver cancer study. FEBS Letter, Vol. 565, pp. 93-100. [49] Huttenhower C., Hibbs M., Myers C. and Troyanskaya O.G. (2006) A scalable method for integration and functional analysis of multiple mi croarray datasets. Bioinformatics, Vol. 1, pp. 2890-2896. [50] Rodriguez-Zas S.L., Ko Y., Adams H.A. and Southey B.R. (2008) Advancing the understanding of the embryo transcriptome co-regulation using meta-, functional, and gene network analysis tools. Reproduction. Vol. 135, pp. 213-236 [51] Burguillo F.J., Martin J., Barrer a I. and Bardsley W.G. (2010) Meta-analysis of microarray data: The case of imatinib resistance in chronic myelogenous leukemia. Comput Biol Chem, Vol. 34(3), pp. 184-192. [52] Kadota K., Nakai Y. and Shimizu K. (2009) Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity. Algorithms Mol Biol. Vol. 22. [53] Xu L., Tan A.C., Winslow R.L. and Geman D. (2008) Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics, Vol. 27, pp. 9. [54] Hong F. and Breitling R. (2008) A comparis on of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinfomatics, Vol. 24, pp. 374-382. [55] Yuan T., Gui-xia L., Ming Z., Yi Z. and Chun-guang Z. (2010) In Proc. of the 2010 2nd International on Signal Processing Systems (ICSPS), Vol. 3, pp. 484-488. PAGE 243 [56] Xu L., Geman D. and Winslow R.L. (2007) Larg e-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics, Vol. 8, pp. 275+. [57] Romdhane L.B., Ayeb B. and Wang S. (2002) On computing the fuzzifier in FLVQ: A data driven approach, International Journal of Neural Systems, Vol. 12, pp. 149-157. [58] Bezdek J.C., Pal L.R. and Hathaw ay R.J. (1996) Sequential comp etitive learning and the Fuzzy Cmeans clustering algorithms. Neural Networks, Vol. 9, pp. 787-796. [59] Bezdek J.C. and Pal N.R. (1995) Two soft relativ es of learning vector quantization. Neural Networks, Vol. 8, pp. 729-743. [60] Dembele D. and Kastner P. (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics, Vol. 19, pp. 973Â–980. [61] Schwammle V. and Jensen O.N. (2010) A simple and fast method to determine the parameters for fuzzy C-Means cluster analysis, Bioinformatics, Vol. 26, pp. 2841-2848. [62] Zhao Q., Li Q. and Xing S. (2010) FCM Algorithm Based on the Optimization Parameters of Objective Function Point, Proc. of 2010 International Conference on Computing, Control and Industrial Engineering, Vol. 2. [63] Chiu S.L. (1994) Fuzzy Model Identification Base d on Cluster Estimation. Journal of Intelligent and Fuzzy Systems, Vol. 2, pp. 267-278. [64] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy C-means Algorithm for Manufacturing Cell Formation, IEEE. [65] Liu W., Xiao C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive Clustering With Fuzzy C-Means Clustering, In Proc. of the Second In ternational Conference on Machine Learning and Cybernetics. [66] Yang M.S. and Wu A.K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic, Vol. 8, pp. 125Â–138. [67] Yang Q., Zhang D. and Tian F. (2010) An initialization method for fuzzy c-means algorithm using subtractive clustering. In Proc. of 2010 Third In ternational Conference on Intelligent Networks and Intelligent Systems. PAGE 244 [68] Cheng J.Y., Quin C. and Jia J. (2008) A weighted mean Subtractive Clustering, Info. Technology Journal, Vol. 7, pp. 356-360. [69] Collins F.S., Green E.D., Guttmacher A.E. and Gu ye M.S. (2003) A vision for the future of genomics research. Nature Vol. 422, pp. 835-847 [70] Bailey T.L. and Elkan C. (1995) The value of prior knowledge in discovering motifs with MEME. Proc. Intl. Conf. Intel. Syst. Mol. Biol. Vol. 3, pp. 21-29. [71] Bi C. (2007) A genetic-based EM motif-finding al gorithm for biological sequence analysis. Proc. IEEE Symp. Comput. Intel. Bioinf o. Comput. Biol pp. 275-282. [72] Chang X., Zhou W., Zhou C. and Liang Y. (2006) Prediction of transcription factor binding sites using genetic algorithm. 1st Conf. Ind. Elec. Apps. pp. 1-4. [73] Frith M.C., Saunders N.F., Kobe B. and Bailey T.L. (2008) Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. Vol. 4, e1000071 [74] Henikoff J.G. and Henikoff S. (1996) Using substitution probabilities to improve position-specific scoring matrices. Comp. App. Biosci. Vol. 12, pp. 135-143. [75] Hong T.P. and Wu M.T. (2008) A hierarchical gene-set genetic algorithm. J. Comp. Vol. 3, pp. 6775. [76] Li L., Liang Y. and Bass R.L. (2007) GAPWM: a genetic algorithm method for optimizing a position weight matrix. Bioinformatics, Vol. 23, pp.1188-1194. [77] Li L., Bass R.L. and Liang Y. (2008) fdrMotif : identifying cis-elements by an EM algorithm coupled with false discovery rate control. Bioinformatics, Vol. 24, pp. 629-636. [78] Liu D, Xiong X., Dasgupta B and Zhang H. (2006) Motif discoveries in unaligned molecular sequences using self-organizing neural networks. IEEE Trans. Neural Networks, Vol. 17, pp.919928. [79] Nowakowski S. and Tiuryn J. (2007) A new ap proach to the assessment of the quality of predictions of transcription factor binding sites. J. Biomed. Info. Vol. 40, pp. 139-149. [80] Osada R., Zaslavsky E. and Singh M. (2004) Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinfor matics, Vol. 20, pp.3516-3525. PAGE 245 [81] Pisanti N., Crochemore M. Grossi R. and Sagot M.F. (2005) Bases of motifs for generating repeated patterns with w ildcards. IEEE/ACM Trans. Comput. Biol and Bioinfo. Vol. 2, pp.40-50. [82] Touzet H. and Varr J.S. (2007) Efficient and accurate P-value computation for position weight matrices. Algorithms for Mol. Biol. Vol. 2, pp. 15-26. [83] Wei Z. and Jensen S.T. (2006) GAME: Detecting cis-regulatory elements using a genetic algorithm. Bioninformatics Vol. 22, pp. 1577-1584. [84] Frank A. and Asuncion A. (2010) Machine Learning Repository, [Online], http://archive.ics.uci.edu/ml. [85] Cho R.J., Campbell M.J., Winzeler E.A., Steinmet z L., Conway A., Wodick a L., Wolfsberg T.G., Gabrielian A.E., Landsman D., Lockhart D.J. and Davis R.W. (1998) A genome-wide transcriptional analysis of the mitotic cellcycle. Molecular Cell, Vol. 2, pp. 65-73. [86] Mewes H.W., Frishman D., Gldener U., Mannha upt G., Mayer K., Mokrejs M., Morgenstern B., Mnsterktter M., Rudd S. and Weil B. (1999) MIPS: a database for protein sequences and complete genomes, Nucleic Acids Research, Vol. 27, pp. 44-48. [87] Wen X., Fuhrman S., Michaels G.S., Carr D.B., Sm ith S., Barker J.L. and Somogyi R. (1998) Large-scale temporal gene expression mapping of central nervous system development. Proceeding of the National Academy of Science USA, Vol. 95, pp. 334-339. [88] Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T. and Tibshirani R. (2011) Missing value estimation methods for DNA microarrays. Bioinformatics, Vol. 17, pp. 520-525. [89] Shi F., Abraham G., Leckie C., Haviv I. and Kowalczyk A. (2011) Meta-analysis of gene expression microarrays with missing replicates, BMC Bioinformatics, Vol. 12, pp. 84-99. [90] Pedro J., Curado J. and Church G.M. (2009) Meta -analysis of age-related gene expression profiles identifies common signatures of aging, Bioinformatics, Vol. 25, pp. 875Â–881. [91] Zhang Z., Chen D. and Fenstermacher D. (2007) Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome, BMC Genomics, Vol. 8, pp.331-343. PAGE 246 [92] Bezdek J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, NewYork. [93] Fukuyama Y. and Sugeno M. (1989) A new method of choosing the number of clusters for the fuzzy c-means method. In Proc. of the Fifth Fuzzy Systems Symp., pp. 247Â–250. [94] Xie X.L. and Beni G. (1991) A validity measure for fuzzy clustering, I EEE Trans. Pattern Anal. Mach. Intell, Vol. 13, pp. 841Â–847. [95] Rezaee M.R., Lelieveldt B.P.F and Reiber J.H.C (199 8) A new cluster validity index for the fuzzy c-mean, Pattern Recognition Letter, Vol. 19, pp. 237Â–246. [96] Pakhira M.K., Bandyopadhyay S. and Maulik U. (2004) Validity index for crisp and fuzzy clusters, Pattern Recognition, Vol. 37, pp. 481Â–501. [97] Rezaee B. (2010) A cluster validity index for fuzzy clustering, Fuzzy Sets and Systems, Vol. 161, pp. 3014-3025. [98] Florea M.C., Jousselme A.L., Grenier D. and Boss e E. (2008) Approxima tion techniques for the transformation of fuzzy sets into random sets, Fuzzy Sets and Systems, Vol. 159, pp. 270Â–288. [99] Xu L. and Jordan M.I. (1996) On convergence properties of the EM algorithm for Gaussian mixtures, Neural Computation, Vol. 8, pp. 129-151. [100] Somogyi R., Wen X., Ma W. and Barker J.L. (1995) Developmental kinetic of GLAD family mRNAs parallel neurogenesis in the rat, Journal of Neurosciences, Vol. 15, pp. 2575-2591. [101] Wen X., Fuhrman S., Michaels G.S., Carr G.S., Sm ith D.B., Barker J.L., Somogyi R. (1998) Large scale temporal gene expression mapping of central nervous system development. In Proc. of the National Academy of Science USA, Vol. 95, pp. 334-339. [102] Yeung K.Y., Haynor D.R. and Ruzzo W. (2001) Validating clustering for gene expression data. Bioinformatics, Vol. 17, pp. 309Â–318. [103] Yeung K.Y., Fraley C., Murua A., Raftery A.E. and Ruzzo W.L. (2001) Model based clustering and data transformations for gene expression data, Bioinformatics, Vol. 17, pp. 977-987. [104] Mewes H.W., Hani J., Pfeiffer F. and Frishman D. (1998) MIPS: A database for protein sequences and complete genomes, Nucleic Acid s Research, Vol. 26, pp. 33-37. PAGE 247 [105] Liu W.Y., Xaio C.J., Wang B.W., Shi Y. and Fang S.F. (2003) Study On Combining Subtractive Clustering With Fuzzy C-Means Clustering, in Proc. of Machine Learning and Cybernetics, International Conference, Xi'an, pp. 2659Â–2662. [106] Yang M.S. and Wu K.L. (2005) A modified mountain clustering algorithm, Pattern Anal Applic, Vol. 8, pp. 125Â–138. [107] Chen J.Y., Quin Z. and Jia J. (2008) A weighted mean s ubtractive clustering algorithm, Information Technology, Vol. 7, pp. 356Â–360. [108] Tuan C.C., Lee J.H. and Chao S.J. (2009) Using Nearest Neighbor Method and Subtractive Clustering-Based Method on Antenna-Array Selec tion Used in Virtual MIMO in Wireless Sensor Network, in Proc. of the Tenth International Conference on Mobile Data Management: Systems, Services and Middleware, Taipei, pp. 496Â–501. [109] Collazo J.C., Aceves F.M., Gorr ostieta E.H., Pedraza J.O., Soto mayor A.O. and Delgado M.R. (2010) Comparison between Fuzzy C-means clustering and Fuzzy Cl ustering Subtractive in urban air pollution, in Proc. of Electronics, Communications and Computer (CONIELECOMP), 2010 20th International Conference, Cholula, pp. 174Â–179. [110] Yang Q., Zhang D. and Tian F. (2010) An in itialization method for Fuzzy C-means algorithm using Subtractive Clustering, Third Interna tional Conference on Intelligent Networks and Intelligent Systems, Vol. 10, pp. 393Â–396. [111] Li J., Chu C.H., Wang Y. and Yan W. (2002) An Improved Fuzzy C-means Algorithm for Manufacturing Cell Formation, Fuzzy Systems, Vol. 2, pp. 1505Â–1510. [112] Loquin K. and Strauss O. (2008) Histogram density estimators based upon a fuzzy partitionÂ”, Statistics and Probability Letter s, Vol. 78, pp. 1863Â–1868. [113] Houte B.P.P. and Heringa J. (2010) Accurate co nfidence aware clustering of array CGH tumor profiles, BioInformatics, Vol. 26, pp. 6Â–14. [114] Klinge C.M. (2001) Estrogen rece ptor interaction with estrogen response elements. Nucleic Acids Research, Vol. 29, pp.2905-2919. PAGE 248 [115] Salama R.A. and Stekel D.J. (2010) Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction. Nucleic Acids Research, Vol. 38, pp. e135. [116] Wang H., He X., Band M., Wilson C. and Liu L. (2005) A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics, Vol. 6, pp. 71. [117] Fisher R.A. (1925) Statistical methods for research workers, 13 th ed. Oliver & Loyd, London. [118] Tavazoie S., Hughes J.D., Campbell M.J., Cho R.J. and Church G.M. (1999) Systematic determination of genetic network architecture. Nat. Genet. 22, 281Â–285. [119] Dayanga-Erden D., Bora G., Ayhan P., Kocaefe ., Dalkara S., Yeleki K., Demir A.S. and Erdem-Yurter H. (2009), Histone Deacetylase In hibition Activity and Molecular Docking of (E )Resveratrol: Its Therapeutic Potential in Spinal Muscular Atrophy. Chemical Biology & Drug Design, Vol. 73, pp. 355Â–364. [120] Faumont N., Durand-Panteix S., Schlee M., Grm minger S., Schuhmacher M., Hlzel M., Laux G., Mailhammer R., Rosenwald A., Staudt L.M., Bornkamm G.W. a nd Feuillard J. (2009) c-Myc and Rel/NF-kappaB are the two master transcriptiona l systems activated in the latency III program of Epstein-Barr virus-immortalized B cells. Journal of Virology, Vol. 83, pp. 5014-5027. [121] Dunn J.C. (1974) A fuzzy relative of the ISODAT A process and its use in detecting compact wellseparated clusters, J. Cybernet, Vol. 3, pp. 32-57. [122] Hoppner F., Klawomm F., Kruse R. and Runkler T. (2000) Fuzzy cluster analysis Â– methods for classification, data analysis and image recognition, John Wiley and Son Ltd. [123] Dunn J. (1974) Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, Vol. 4, pp. 95-104. [124] Wang W. and Zhang Y. (2007) On fuzzy cluster validity indices, Journal of Fuzzy Sets and Systems, Vol. 158, pp. 2095 Â– 2117. [125] Gath I. and Geva A.B. (1989) Unsupervised Optimal Fuzzy Clustering, Pattern Analysis and Machine Intelligence, Vol.11, pp.773-781. PAGE 249 [126] Krishnnapuram R. and Freg C.P. (1992) Fitting an unknown number of lines and planes to image data through compatible cluster merging. Pattern Recognition, Vol. 25, pp. 385-400. [127] Kim Y., Kim D., Lee D. and Lee K.H. (2004) A cl uster validation index for GK cluster analysis based on relative degree of sharing. Information Science, 168, pp. 225-242. [128] Kaymak U. and Babu ska R. (1995) Compatible cluster merging for fuzzy modeling, In Proc. IEEE Int. Conf. Fuzzy System, Yokoha ma, Japan, Mar. 1995, pp. 897Â–904. [129] Tasdemir K. and Merenyi E. (2011) Exploiting Data Topology in Visualization and clustering of Self-Organizing Maps, Neural Networks, Vol. 20, pp. 549-562. [130] Kaymak U. and Setnes M. (2002) Fuzzy Clusteri ng With Volume Prototypes and Adaptive Cluster Merging, Fuzzy Systems, Vol. 10, pp. 705-711. [131] Iam-on N., Boongoen T. and Garrett S. (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis, Bioinformatics, Vol. 26, pp. 1513Â–1519. [132] Valente de Oliveira J. and Pedrycz W. (2007) Advances in Fuzzy Clustering and its Applications, John Wiley & Sons Ltd. [133] Bezdek J.C. (1980) A Convergence Theorem fo r the Fuzzy ISODATA Clustering Algorithms, Pattern analysis and Machine intelligence, Vol. 2, pp. 1-8. [134] Bezdek J.C., Hathaway R.J., Sa bin M.J. and Tucker W.T. (1987 ) Convergence Theory for Fuzzy cMeans: Counterexamples and Repairs, Systems, Man AND Cybernetics, Vol. 17, pp. 873-877. [135] Cohen J., Cohen P., West S.G. and Aiken L.S. (2002) Applied multiple regression/correlation analysis for the behavioral sciences, 3rd ed. Psychology Press. [136] Hedges L.V. and Olkin I. (1985) Statistical Meth ods for Meta-Analysis, Or lando: Academic Press. [137] Li D., Zhong C. and Zhang L. (2010) Fuzzy c-means clustering of partially missing data sets based on statistical representation, In Proc. of Intl' Conf. Fuzzy Systems and Knowledge Discovery, pp. 460-464, Dalian, China Aug 2010. [138] Te-Shun C., Yen K.K., Liwei A., Pissinou N. and Makki K. (2007) Fuzzy belief pattern classification of incomplete data Systems, pp. 535-540, In Proc. Intl' Systems, Man and Cybernetics, Oct 2007 Montreal CA. PAGE 250 [139] Garc a-Laencina P.J., Sancho-Gomez J.L. and Figueir as-Vidal A.R. (2010) Pattern classification with missing data: a review, Neural Compu ting & Applications, Vol. 19, pp. 263Â–282. [140] Hathaway R.J. and Bezdek J.C. (2001) Fuzzy c-Means Clustering of Incomplete Data, Systems, Man and Cybernetics, Vol. 31, pp. 735-744. [141] Luo J.W., Yang T. and Wang Y. (2005) Missing Value Estimation For Microarray Data Based On Fuzzy C-means Clustering, In Proc. Intl' Conf. on High-Performance Computing, Changsha China, Dec 2003. [142] Mohammadi A. and Saraee M.H. (2008) Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology, In Proc. Intl' Conf. on Bioinformatics and Biomedicine, pp. 382-385, Washington, DC, USA. [143] Himmelspach L. and Conrad S. (2010) Clustering approaches for data with missing values: Comparison and evaluation, In Proc. of Intl' conf. on Digital Information Management, pp. 19-28, Thunder Bay, ON. [144] Wu K.L. (2010) Parameter Selections of Fuzzy C-Means Based on Robust Analysis, Engineering and Technology, Vol. 65, 554-557. [145] Le T., Altman T. and Gardiner J.K. (2010) HI GEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences, Bioinformatics, Vol. 26, pp.302-309. [146] Le T. and Altman T. (2011) A new initializati on method for the Fuzzy CMeans Algorithm using Fuzzy Subtractive Clustering, In Proc. of the 20 11 International Confer ence on Information and Knowledge Engineering, pp. 144Â–150, Las Vegas, Nevada, USA. [147] Le T. and Gardiner J.K. (2011) A validation met hod for fuzzy clustering of gene expression data, In Proc. of the 2011 International Conference on Bioinformatics & Computational Biology, Vol. I, pp. 23Â–29, Las Vegas, Nevada, USA. [148] Bodjanova S. (2004) Linear intensification of probabilistic fuzzy partitions, Fuzzy Sets and Systems, Vol. 141, pp. 319-332. [149] Chiu W.Y. and Couloigner I. (2006) Modified fuzzy c-means classification technique for mapping vague wetlands using Landsat ETM+ imagery, Hydrol. Process., Vol. 20, pp. 3623Â–3634. PAGE 251 [150] Chuang K.S., TZeng H.L., Chen S., Wu J. and Ch en T.J. (2006) Fuzzy cmeans clustering with spatial information for image segmentation, Com puterized Medical Imagin g and Graphics, Vol. 30, pp. 9-15. [151] Genther H., Runkler T.A. and Glesner M. (1994) Defuzzification based on fuzzy clustering, Fuzzy Systems, Vol. 3, 1646-1648. [152] Roychowdhury S. and Pedrycz W. (2001) A Su rvey of Defuzzification Strategies, Intelligent Systems, Vol. 16, pp. 679-695. [153] Yang Y., Huang S. and Rao N. (2008) An Automatic Hybrid Method for Retinal Blood Vessel Extraction, Applied Mathematics and Computer Science, Vol. 18, pp. 399-407. [154] Le T., Altman T and Gardiner J.K. (2012) Pr obability-based imputation method for fuzzy cluster analysis of gene expression microarray data, Proc. Intl' Conf. on Information Technology-New Generations, pp. 42-47, Las Vegas, NV, USA. [155] Le T., Altman T. and Gardiner J.K. (2012) Density-based imputation method for fuzzy cluster analysis of gene expression microarray data, Proc. Intl' Conf. on Bioinformatics and Computational Biology, pp. 190-195, Las Vegas, NV, USA. [156] Le T., Altman T. and Gardiner J.K. (2012) A Probability Based Defuzzification Method for Fuzzy Cluster Partition, Proc. Intl' Conf. on Artificia l Intelligence (WORLDCOMP-ICAI'12), Vol. 2, pp. 1038-1043, Las Vegas, NV, USA. [157] Le T., Altman T. and Gardiner J.K. (2012) A Fu zzy Clustering Method using Genetic Algorithm and Fuzzy Subtractive Clustering, Proc. Intl' Conf. on Information and Knowledge Engineering (WORLDCOMP-IKE'12), pp. 426-432, Las Vegas, NV, USA. [158] Luo J.W., Yang T. and Wang Y. (2006) Missing Value Estimation For Microarray Data Based On Fuzzy C-means Clustering, Proc. Intl' Conf. High-Performance Computing (HPCASIA'05), IEEE Press, Feb. 2006, pp. 11-16. [159] Kim D.W, Lee K.Y., Lee K.H. and Lee D. (2007) Towards clustering of incomplete microarray data without the use of imputation, Bioinformatics, Vol. 23, Jan. 2007, pp. 107-113. PAGE 252 [160] Mohammadi A. and Saraee M.H (2008) Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology, Proc. Intl' Conf. Bioinformatics and Biomedicine (BIBM'08), IEEE Press, Nov. 2008 pp. 382-385. [161] Mehdizadeh E, Sadi-Nezhad S And Tavakkoli-Moghaddam R (2008) Optimization of Fuzzy Clustering Criteria by a Hybrid Pso and Fuzzy C-Means Clustering Algorithm, Iranian Journal of Fuzzy Systems, Vol. 5, pp. 1-14. [162] Halder A., Pramanik S., and Kar A. (2011) Dy namic Image Segmentation using Fuzzy C-Means based Genetic Algorithm, International Journal of Computer Applications, Vol. 28, pp. 15-20. [163] Ghosh A., Mishra N. S., and Ghosh S. (2011) Fuzzy clustering algorithms for unsupervised change detection in remote sensing images, Information Sciences, Vol. 181, pp. 699-715. [164] Lianjiang Z., Shouning Q., and Tao D. (2010) Adaptive fuzzy clustering based on genetic algorithm, In Proc. of 2nd conference on advanced computer control, Shenyang China, pp. 79-82. [165] Lin T.C., Huang H.C., Liao B.Y., and Pan J.S. (2007) An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Inde x, International Journal of Computer Sciences and Engineering Systems, Vol. 1, pp. 253-257. [166] Liu Y. and Zhang Y. (2007) Optimizing Parameters of Fuzzy c-Means Clustering Algorithm, In Proc. of the 4th conference on Fuzzy Systems and Knowledge Discovery (FSKD '07), Vol. 1, 2007, pp. 633-638. [167] Sturgeon X., Le T. and Gardiner J.K. (2010) Pathways to Intellectual Disability in Down Syndrome, Tenth Annual Coleman Institute Conference (The Coleman Institute for Cognitive Disabilities, University of Colorado), Westminster, Colorado USA, 10/2010. [168] Sturgeon X., Le T., Ahmed M. and Gardiner J.K. (2012) Pathways to cognitive deficits in Down syndrome, Progress in Brain Research Elsevier, Vol. 197, pp. 73-100. [169] Le T. and Altman T. (in preparation) RPPR: a rank product statistics with page rank algorithm combination method for expression profiles selection using gene expression signature. [170] Le T., Leach S. and Gardiner J.K. (in preparation) A new approach of Connectivity MAP to discover gene-drug association using gene expression microarray data. PAGE 253 [171] Gardiner J.K. (2010) Molecular basis of pharmacotherapies for cognition in Down syndrome, Trends Pharmacol Sci., Vol. 31, pp. 31: 66-73. [172] Dubois D. and Prade M. H. (1988) Possibility theory : an approach to computerized processing of uncertainty, Plenum Press. [173] Dubois D. and Prade M. H. (1980) Fuzzy sets an d systems: Theory and applications, Mathematics in Science and Engineering, Accademic Press. [174] Wolkenhauer O. (1998) Possibility theory with applications to data analysis, John Wiley & Sons Inc. [175] Chu C.Y. and Rana T.M. (2006) Translation Repression in Human Cells by MicroRNA-Induced Gene Silencing Requires RCK/p54, PLoS Biololy, Vol. 4, pp. 1122-1136. [176] Velusamy K. and Manavalan R. (2011) Performance Analysis of Unsupervised Classification Based on Optimization, Int'l Journal of Computer Applications, Vol. 42, pp. 22-27. [177] Rana S., Jasola S. and Kumar R. (2011) A review on particle swarm optimization algorithms and their applications to data clustering, Artif Intell Rev, Vol. 35, pp. 211-222. [178] Xie J., Li K.C. and Bina M. (2004) A Bayesian Insertion/Deletion Algorithm for Distant Protein Motif Searching via Entropy Filtering, Journal of th e American Statistical Association, Vol. 99, pp. 409-420. [179] Klir G.J. and Yuan B. (1995) Fu zzy Sets and Fuzzy Logic: Theory and Applications, Prentice Hall. [180] Le T, Altman T., Vu L. and Gardiner J.K. (in preparation) A cluster validation method using GO for gene expression data analysis. [181] Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nat Genet, Vol. 25(1), pp. 25-29. [182] Tari L., Baral C. and Kim S. (2009) Fuzzy c-means clustering with prior biological knowledge, J. Biomed Inform., Vol. 42, pp. 74-81. PAGE 254 [183] Xu T., Du L.F., and Zhou Y. (2008) Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data, BMC Bioinformatics, Vol. 9, pp. 472481. [184] Wang J.Z., Du Z., Payattakool R., Yu P.S and Chen C.F. (2007) A new method to measure the semantic similarity of GO terms, Bioinformatics, Vol. 23, pp. 1274-1281. [185] Yu G., Li F., Qin Y. et al. (2010) GOSemSim: an r package for measuring semantic similarity among GO terms and gene products, Bioinformatics, Vol. 26, pp. 976-983. [186] Krucker T., Siggins G.R. and Halpain S. (2000) Dynamic actin filaments are required for stable long-term potentiation (LTP) in area CA1 of the hippocampus, PNAS, Vol. 97, pp. 6856-6861 |