METHODS FOR PREDICTION OF PROTEIN INTERFACES AND
FUNCTIONS
by
CAO DANG NGUYEN
M.S, Vietnam National University, Hochiminh City, 2000
A thesis submitted to the
University of Colorado at Denver and Health Science Center
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Computer Science and Information Systems
Summer 2007
This thesis for the Doctor of Philosophy
degree by
Cao D. Nguyen
has been approved
by
Dr. Gethner, Ellen
Dr. Mannino, Michael
X
Date
Cao D. Nguyen (Ph.D, Computer Science and Information System)
Methods For Prediction of Protein Interfaces and Functions
Dissertation directed by Professor Krzysztof J. Cios
ABSTRACT
Proteins play key roles in all activities that take place within a living cell. One of the
main goals of proteomics research is to understand protein functions. Critical to
reaching this goal is information on proteinprotein interactions. Proteinprotein
interactions play a major role in many biological processes. Identifying the sites of
interaction in a protein is an important step in understanding its functional
mechanisms, and is crucial information for drug design. This work involves design of
methods for prediction of protein interfaces and for prediction of protein functions.
The main contributions are as follows:
1) a novel method based on the Hidden Markov Model (HMM) for predicting
interfaces of a protein chain unit that participates in protein complexes.
2) a novel approach for assigning functions to proteins by using a combination of
clustering and Fuzzy Cognitive Maps, namely ClusFCM.
3) a hybrid clustering method, namely GAKREM, a combination of genetic
algorithm, kmeans, logarithmic regression and expectation maximization
clustering method.
This abstract accurately represents the content of the candidates thesis. I
its publication.
mend
Signed
fzysztof J. Cios
DEDICATION
To my parents who have inspired in me the enthusiasm for learning and taught me the
value of perseverance and resolve.
To my brothers and sisters.
To my wife for her support and understanding during my studies.
To my son for his future inspiration.
ACKNOWLEDGEMENTS
First of all, I give utmost thanks to my great advisor, Dr. Krzysztof Cios, who
has continually encouraged me to achieve great things and supported me at every step
of my study. From the bottom of my heart, thank you for everything you have done
for me!
My appreciation goes to my fantastic coadvisor, Dr. Katheleen Gardiner, who
has patiently taught me biology. Parts of this work would have not been possible
without her.
I am thankful to Dr. Michael Mannino for his valuable comments that helped
in improving my dissertation. I also thank Dr. Tom Altman and Dr. Ellen Gethner for
their very much appreciated input that helped me to clarify the subject.
I have always received support from my fathers friends: Mr. Dinh Lai, Dr.
Duong Nguyen, Mr. Son Nguyen, Mr. Tien Nguyen and my best friend, Mr. Hien
Huynh, since the time I first arrived in America. I am much grateful to all of them.
I acknowledge financial support from the Vietnamese Ministry of Education
and Training and from the Butcher Foundation for my study.
Last, but not least, I would like to thank my parents, brothers and sisters for
their constant support. I thank my wife for her persistent confidence and
understanding. I thank my son for the joy he has brought to my life. Without them I
could have not achieved what I did.
TABLE OF CONTENTS
Figures........................................................................ix
Tables .........................................................................x
Chapter
1. Introduction.................................................................1
1.1 Overview....................................................................1
1.2 Background Information.....................................................2
1.2.1 Biological background....................................................2
1.2.1.1 Protein Structure.......................................................2
1.2.1.2 Protein Interface.......................................................8
1.2.1.3 ProteinProtein Interaction Network.....................................9
1.2.2 Mathematics background...................................................10
1.2.2.1 Hidden Markov Model (HMM)..............................................10
1.2.2.2 Clustering.............................................................14
1.2.2.3 Genetic Algorithms.....................................................16
1.2.2.4 Fuzzy Cognitive Maps...................................................17
1.2.3 Related Research.........................................................17
1.2.3.1 Prediction of Protein Interfaces.......................................17
1.2.3.2 Prediction of Protein Function.........................................18
1.2.3.3 Hybrid Clustering Methods..............................................20
2. Prediction of Protein Interfaces...........................................21
2.1 Overview..................................................................21
2.2 Methods...................................................................21
2.2.2 Definitions.............................................................21
2.2.3 HMM architecture........................................................22
2.2.3.1 HMM training phase.....................................................23
2.2.3.2 HMM prediction phase...................................................25
2.2.4 Assessment of predictions...............................................25
2.4 Results and Discussion...................................................26
2.4.1 HMM prediction results.................................................26
2.4.2.1 HMM comparison with twostage classifier on 139protein dataset.......28
2.4.2.2 HMM comparison with twostage classifier on 77 protein dataset........29
3. Prediction of Protein Functions............................................35
3.1 Overview..................................................................35
3.2 Methods..................................................................35
3.2.1 Materials..............................................................35
3.2.2 ClusFCM Algorithm......................................................38
3.2.2.1 PP1 Network Definition................................................38
3.2.2.2 Stage 1...............................................................39
3.2.2.3 Stage 2...............................................................41
3.2.3 Description of methods used for comparison............................43
3.2.3.1 Majority..............................................................43
3.2.3.2 ^statistics..........................................................43
3.2.3.3 FunctionalFlow....................................................... 43
3.2.3.4MRF....................................................................44
3.2.4 Assessment of prediction...............................................45
3.3 Results and Discussion................................................46
4. GAKREM: a Clustering Algorithm that Automatically Generates
a Number of Clusters........................................................52
4.1 Overview..................................................................52
4.2 Methods..................................................................53
4.2.1 Definitions............................................................53
4.2.2 GAKREM Algorithm.......................................................54
4.2.2.1 EM and Kmeans algorithms overview....................................54
4.2.2.2 GAKREMs Pseudocode...................................................57
4.3 Results and Discussion...................................................61
vii
4.3.1 Experiments on manuallygenerated data..................................61
4.3.1.1 The performance of GAKREM.............................................61
4.3.1.2 GAKREM versus Likelihood CrossValidation technique...................70
4.3.1.3 GAKREM versus EM and Kineans algorithms..............................72
4.3.2 Experiments on automaticallygenerated data............................74
4.3.2.1 The performance of GAKREM.............................................75
4.3.2.2 GAKREM versus Likelihood CrossValidation technique...................77
4.3.2.3 GAKREM versus EM and Kmeans algorithms...............................78
5. Conclusions and Future Work...............................................81
Appendix
A. 139 protein units are used in HMM methods..................................83
References....................................................................85
viii
LIST OF FIGURES
Figure
1.1: Amino acid......................................................3
1.2: Twenty amino acids and their properties are shown in a Venn diagram 3
1.3: Dehydration synthesis reaction..................................4
1.4: Peptide bond and peptide unit...................................5
1,5:A DIAGRAM OF THE A HELIX STRUCTURE OF AMINO ACIDS ...............5
1.6: Diagram of b sheet structure of protein (taken from Berg et al., 2002). 7
1.7: Loop regions connect beta strands and alpha helices.............7
1.8: Tertiary structure of protein pdb id. 1AEH, taken from Protein .8
1 9: Two SUBUNITS ARE SHOWN (PDB ID 1 MSB).......................... 8
1.10: The yeast twohybrid system and its highthroughput applications ..10
1.11: Hidden Markov Model...........................................11
1.12: Starting WITH A DATASET in the left...........................16
2.13: Example distribution of interfaces and noninterfaces according .... 22
2.14: HMM architecture..............................................23
2.15: Example of transformation.....................................25
2.16: Distribution of precision and recall..........................27
2.17: Overall view of complexes with identified interfaces..........30
2.18: Overall view of the ligand HEMK in the target 20 (heterodimer) .... 31
2.19: HMM architectures for monomer grouping........................33
3.20: GO IS A HIERARCHICAL STRUCTURE CONSISTING THREE RELATED ONTOLOGIES 38
3.21: Clusters generated by the ClusFCM algorithm...................41
3.22: In this FCM, proteins shown in black are annotated............42
3.23: TPs ARE SHOWN AS THE WHITE CIRCLE (INTERSECTION)..............45
3.24: Results of ClusFCM prediction for different thresholds........46
3.25: Comparison of ClusFCM results with four other methods.........49
3.26: Performance of ClusFCM and the other methods..................50
4.27: Old Faithful data and the 2cluster mixture...................62
4.28: GAKREMs behavior on the Old Faithful dataset.................62
4.29: GAKREM OBTAINED THE CORRECT NUMBER OF CLUSTERS ...............66
4.30: The behavior of GAKREM on the tested datasets in a trial......69
4.31: A comparison for determination of the best number OF CLUSTERS.72
4.32: Average maximum loglikelihood functions .....................73
4.33: The drawback of sensitive initialization......................74
4.34: An example of the drawback of sensitive initialization........77
4.35: The accuracy of GAKREM........................................78
4.36: Comparison of average maximum loglikelihood functions........80
LIST OF TABLES
Table
2.1: Performance measures of HMM method on 139 protein dataset........27
2.2: Yan etal.sSVM first stage tested on 139 protein dataset.........28
2.3: Yan et al.s Bayesian second stage tested on 139 protein dataset.29
2.4: A COMPARISON OF HMM METHOD WITH YAN ET AL (2004).................29
2.5: Results of the HMM method when monomers are grouped by...........32
3.6: The 39 GO terms used in this study...............................36
3.7: The number of proteins in Yeast network have at least:...........37
3.8: Average number of common functions of interacting proteins:......39
3.9: Performance measures of ClusFCM on three datasets................46
3.10: Performance measures of ClusFCM and the other methods...........47
3.11: The number of TP, FP, TN and FN of ClusFCM and the other methods 48
3.12: The AUC values of ClusFCM and the other methods.................49
3.13: 95% confidence intervals .......................................51
3.14: The result of ClusFCM and the other methods.....................51
x
1. Introduction
1.1. Overview
The main research objective of this work is development of methods for
prediction of protein interfaces and prediction of protein functions. Additionally we
propose a hybrid approach to clustering that addresses disadvantages of the existing
clustering algorithms such as expectation maximization (EM) and Kmeans. In
particular the work includes:
1) a method based on the Hidden Markov Model (HMM) for predicting
interfaces of a protein chain unit that participate in protein complexes. The
HMM is designed by combining biological characteristics, such as structural
information, accessible surface area and transition probability among amino
acids, of the sequences neighboring a target residue.
2) a novel approach for assigning functions to proteins that combines a clustering
technique and a fuzzy cognitive map approach, (ClusFCM). The method
exploits the fact that proteins of known function and cellular location tend to
cluster together and that protein functions can be predicted not only through
interaction partners but also from other connected proteins in the network.
3) a hybrid clustering method, called GAKREM, which is a combination of a
genetic algorithm, Kmeans, logarithmic regression, and expectation
maximization.
One of the most important problems in the post genomic era is determining protein
function. With the development of highthroughput techniques, such as the Yeast
Two Hybrid method and improved mass spectrometry, large protein interaction
datasets are being generated rapidly (Uetz et al., 2000; Ito et ah, 2001). These provide
opportunities for inference of protein function. Proteinprotein interactions are
important for many biological processes, from the formation of cellular
macromolecular structures and enzymatic complexes to the regulation of signal
transduction pathways.
To understand the physiological function of a protein we must first identify protein
interfaces, defined as the domains for selective recognition of interacting molecules
1
and for the formation of complexes. Identification of these interfaces is important not
only for understanding protein function but also for understanding the effects of
mutations and for efficient drug design. Currently, the only reliable method for
predicting interfaces requires knowledge of the tertiary structures of protein
complexes. However, because such information requires sophisticated NMR and X
ray diffraction studies, only a small fraction of known protein sequences have tertiary
structures fully determined. Thus, there is a need for accurate methods for prediction
of interfaces (Valencia and Pazos, 2002).
Another contribution of this research is a hybrid clustering method. Clustering is one
of the most difficult and important methods for data analysis, and is broadly used in
many fields, including data mining/machine learning, pattern recognition, and
bioinformatics. However, clustering, as an unsupervised learning method, has many
practical shortcomings. For example:
it is difficult to specify the correct number of clusters to be generated.
the clustering result is very sensitive to the initial choice of parameters that
include similarity measures, initial cluster centers, etc.
the resulting solution may be significantly inferior to the global optimum.
To address these problems we use a combination of machine learning techniques,
including kmeans, expectation maximization, logarithmic regression and genetic
algorithms, that result in a clustering method that avoids the above weaknesses.
1.2 Background Information
1.2.1 Biological background
1.2.1.1 Protein Structure
1) Amino Acids
Amino acids are the basic units of proteins. They form polymer chains called
peptides which in turn form longer molecules called proteins. Proteins are produced
by translation from an mRNA template. There are 20 amino acids encoded by the
standard genetic code. All of them have in common a central carbon atom (Ca), a
hydrogen atom, an amino group (NHi), and a carboxyl group (COOH) (see Figure
1.1).
2
H
1
h2n ae e;
R
Figure 1.1: Amino acid.
What distinguishes one amino acid from another is the side chain R attached to the Ca.
Amino acids are usually classified by the properties of the side chain into four groups:
weak acid, weak base, hydrophile, and hydrophobe (Doolittle, 1989; David et al.,
2000) or into eight groups: tiny, small, aliphatic, aromatic, nonpolar, charged, polar,
positive (Taylor, 1968). Symbols of the 20 amino acids and a Venn diagram
conveying several their properties are listed below (Figure 1.2):
Alanine (Ala/A),
Arginine (Arg/R),
Asparagine (Asn/N),
Aspartic Acid (Asp/D),
Cysteine (Cys/C),
Glutamic Acid (Glu/E),
Glutamine (Gln/Q),
Glycine (Gly/G),
Histidine (His/H),
Isoleucine (Ile/I),
Leucine (Leu/L),
Lysine (Lys/K),
Methionine (Met/M),
Phenylalanine (Phe/F),
Proline (Pro/P),
Serine (Ser/S),
Threonine (Thr/T),
Tryptophan (TrpAV),
Tyrosine (Tyr/Y),
Valine (Val/V)
3
2) Protein
A protein is a molecule that consists of amino acids joined by peptide bonds.
Proteins are essential structural and functional components of all living cells and
viruses. Amino acids are linked to form proteins using information encoded in the
DNA of genes. Genes are transcribed into RNA, RNA is then subject to post
transcriptional modification and splicing to produce a mature mRNA that undergoes
translation into a protein (Branden and Tooze, 1991).
a) Primary structure:
The amino acid sequence of a protein is called its primary structure. Amino acids are
joined end to end during protein synthesis by formation of peptide bonds. The
carboxy group of the first amino acid condenses with the amino group of the next to
eliminate water and yield a peptide bond (see Figure 1.3). This process is repeated as
the chain elongates. The two ends of the amino acid chain are referred to as the amino
terminus (Nterminus) and the carboxy terminus (Cterminus) based on the nature of
the free group on each extremity and the direction of synthesis.
0 H
R ^ +
OH H
0
RC NR' + H2
H
Figure 1. 3: Dehydration synthesis reaction.
The formation of a succession of peptide bonds generates a mainchain or backbone
from which project the various side chains. The mainchain atoms are a carbon atom
Ca to which the side chain is attached, an NH group bound to Ca, and a carbonyl
group, C=0, also bound to Ca.
Peptide units
The polypeptide chain is divided into peptide units that go from one Ca atom to the
next Ca (see Figure 1.4). Each Ca atom thus belongs to two such units. The reason for
dividing the chain in this way is that all the atoms in such a unit are fixed in a plane
with the bond length and bond angles very nearly the same in all units in all proteins
(Branden and Tooze, 1991).
4
Figure 1.4: Peptide bond and peptide unit.
Since the peptide units are effectively rigid groups that are linked into a chain by
covalent bonds at the Ca atoms, the only degrees of freedom they have are rotations
around these bonds. Each unit can rotate around two such bonds: CaC (t/) and NCa
(tj)) bonds.
b) Secondary structure:
Secondary structure describes the general threedimensional form of local regions or
overall shape of biopolymers. The secondary structure of a protein may include
regions of alpha helices, beta sheets and loop regions (Kabsch and Sander, 1983).
a helix
The a helix is the essential element of protein structure, a helices in proteins are
found when a stretch of consecutive residues all have the tj/ and (j) angle pair
approximately 60 and 50, respectively. The a helix has 3.6 residues per turn
(corresponding to 5.4 A, 1.5 A per residue) with hydrogen bonds between C=0 of
residue n and NH of residue n+4 (see Figure 1.5). Thus all NH and C'O groups are
joined with hydrogen bonds except the first NH groups and the last CO groups at the
ends of the a helix (Branden and Tooze, 1991).
5
Figure 1.5: A diagram of the a helix structure of
amino acids (created by http://www.jmol.org).
(3 sheets
The second major structural element found in globular proteins (proteins that are
relatively spherical in shape with no systematic structures) is the p sheet. This
structure is built up from a combination of several regions of the polypeptide chain, in
contrast to the a helix which is built up from one continuous region. The P sheet
consists of two or more nonconsecutive amino acids within the same protein that are
arranged adjacently and in parallel, and with alternating orientation such that
hydrogen bonds can form between the two strands. The amino acid chain is fully
extended throughout a p strand. The NH groups in the backbone of one strand
establish hydrogen bonds with the C=0 groups in the backbone of the adjacent,
parallel strand(s). The Ca atoms of adjacent strands are 3.5 A apart. (Branden and
Tooze, 1991)
6
nJ 9 > 9 o j 9 IQ 9! 1 * ... 9 $ sJ 9 9
9 T s i ! __LJ 4 \ i i
4 1 q i 1 i ^
a . Q i 9 \ \ Q J Q 4 ^! !? 9 9 9 9
T 5 1 ! 9
Figure 1.6: Diagram of p sheet structure of protein (taken from Berg et al., 2002).
Loop regions
Most protein structures are built up from combinations of secondary structure
elements, alpha helices and beta strands, which are connected by loop regions of
various lengths and irregular shape. The loop regions are at the surface of the
molecule. (Branden and Tooze, 1991)
A betaalphabeta unit
Figure 1.7: Loop regions connect beta strands and alpha helices,
c) Tertiary structure
Secondary structure elements are connected into simple motifs: helixloophelix,
betaalphabeta ... Several motifs usually combine to form compact globular
structures, which called domains. The fundamental unit of tertiary structure is the
domain (see Figure 1.8).
7
Figure 1.8: Tertiary structure of protein pdb id. 1AEH, taken from Protein Databank
(http://www.rcsb.org).
1.2.1.2 Protein Interface
Proteins perform biological functions by associating with other proteins, small
molecules, RNA or DNA. This association, the physical binding of protein structures
through weak, noncovalent bonds, is called an interaction. Two proteins interact
through active regions on their surfaces, called binding sites. The region where two
protein subunits in the same protein complex come into contact is called the interface
(see Figure 1.9). Understanding the characteristics of interfaces is necessary for
understanding the molecular recognition process and in addition, the ability to predict
interfaces is very important in understanding the functional consequences of
mutations and in design of pharmaceuticals.
Figure 1.9: Two subunits are shown (PDB id lmsb): one subunit is colored blue and one red. The
atoms in green are interaction sites in blue subunit and those in yellow are the interface atoms in the
red subunit (Jones and Thornton, 1997).
8
1.2.1.3 ProteinProtein Interaction Networks
Interactions between proteins are crucial for all biological functions. For
example, signals from the exterior of a cell are mediated to the inside of that cell by
proteinprotein interactions of the signaling molecules. Protein A interacts with
protein B and in turn protein B interacts with C, so on. These interacting proteins
create a proteinprotein interaction network (PPI).
Several experimental methods have been developed to identify proteinprotein
interactions. They are briefly described below.
Yeast Two Hybrid System (Y2H)
A protein of interest, the bait, is fused to the DNAbinding domain derived from a
transcription factor. Each proteincoding gene in the genome is then individually
fused to the activation domain of the same transcription factor, forming a library of
preys. To detect interactions, the bait protein is expressed in one yeast strain which
is mated to each member of the prey library expressed in a different yeast strain. In
cells where the bait interacts with a prey, a functional transcription factor is
assembled and the interaction is detected via activation of a reporter gene (usually by
a color assay). In a large scale whole genome experiment, all possible pairwise
interactions in theory are tested in a set of crosses (Uetz et al., 2000; Ito et al., 2001;
Auerbach et al., 2002) (see Figure 1.10).
Mass Spectrometry
In one highthroughput method, a DNA sequence encoding an affinity tag is ligated to
the cDNA encoding the target bait protein and the DNA is introduced into cells to
allow the modified protein to be expressed and form normal physiological complexes
with other proteins. The tagged protein is then precipitated on an affinity column
along with any associated proteins. Proteins extracted with the tagged bait are then
identified by mass spectrometry (Mann and Pandey, 2001).
Protein Chips
Genes encoding proteins of interest are cloned and the proteins expressed and purified
They are then attached to the surface of a microarray in a way that preserves their
ability to interact specifically with other proteins. Initially, protein chips were used to
identify yeast genes encoding defined biochemical activities (Martzen et al., 1999;
MacBeath and Schreiber, 2000). Zhu et al., (2001) then applied this technique to the
9
study of whole proteome protein interactions. The entire set of yeast proteins was
printed on a microarray and screened for interactions with proteins and phospholipids.
They identified many new calmodulin and phospholipid interacting proteins, and
identified a novel common potential binding motif for many of the calmodulin
binding proteins.
a p,*y
S&!
,
 gw ON
8
Army o? bait X May o* pmy dorse*
Array ol
SeKsciion Qi
BatX tl>ra y#**
Mattng
on uriactv#
4*
Selector* of nswactorss>
Saqj^ACtrg of fetsmry inserts
Figure 1.10: The yeast twohybrid system and its highthroughput applications (adapted from
Auerbach et al. 2002).
1.2.2 Mathematics background
1.2.2.1 Hidden Markov Model (HMM)
a) Definition
A firstorder discrete HMM is a stochastic generative model for linear
problems such as sequences or time series that are defined by a finite set 5) of states, a
discrete alphabet S of symbols, a probability transition matrix ST = [ti(j)] and a
probability emission matrix Â£= [e^b)J. Each state k emits symbol b according to Â£
creating an observable sequence of symbols. The states are interconnected by state
transition probabilities 3". Starting from an initial state, a sequence of states is
generated by moving from state to state according to the transition probabilities 3
10
until the end state is reached (Durbin et al. 1998; Baldi and Brunak, 2001).
 Transition probabilities between states: tj( 1) + ... + t\(K) = 1, for all states / = 1... A"
 Start probabilities to(i): fo(l)+ ... + ?o(^0=l
 Emission probabilities within each state: ej(bi) + ... + ei(bisi) = 1, for all states / =
1 ...K and for all symbol /?,e S with e^b) = P(xi=b I 7ij=k)
Figure 1.11: Hidden Markov Model.
 At each time step t, the only thing that affects future states is the current state Jtt:
= k I whatever happened so far)
=P(xt+ \ = k I Ttp 7*1, .... 7tp Xp x2, . xt)
=P(xt+\ = k I 7Tt)
 A parse of a sequence: given a sequence x = x/. xn, a parse of x is a sequence of
states k 71), 7Tn
 Likelihood of a parse: given a sequence x = X]...xn and a parse 7T= Ku ., Kn and
given a HMM, to find how likely is the parse:
P{X, 7t) = P(xp XN, 7Tp .... 7t^j)
= P(XN> kN I ^N1) P(XN1 zN\ I ^2). Pixi 7*2 I 7T\) P(xj, K\)
= ^(xN I ^v) P(xN I nN. i) .. .P(x2 I 712) P(7T2 I 7t\) P{x\ I n\) P{tt\)
= '0(*1) W^N) exSx0
b) Key issues in HMM
There are two key issues in designing a HMM.
1. Decoding:
11
Given a HMM with parameter #and a sequence x of symbols, find the sequence k
of states that maximizes P(x,7T I 9)
2. Learning:
Given a HMM with unspecified transition and emission probabilities and a
sequence x, find parameter 9= {J, Â£} that maximizes P(x I 9)
1. Decoding problem
Given the input sequence x=XX2.. .xn with x, eS, we want to find the output state
sequence tu=K\K2...k^ with ^eS)such that the probability P(x,7t\ 9) is maximized:
7t* = argmaxK P(x, n\ 9) (1)
Equation (1) can be solved by a combination of a Bayes rule and dynamic
programming model such as the Viterbi algorithm as follow:
 let 5k(i)= maxn\.mi P{x\...Xj.\, K\, ..., 7Tj.\, x Kj = k) is the probability of the most
likely sequence of states ending at state TTj = k, thus, we have:
 Si(i+1) = maxK]..mP(xi...xi? ^+i = 0
maxn\....7iiP(Xj+i,^+] /lxj...Xi, 7t\, ..., 7%) P(xi...Xi, KI,..., 71))
= mruyti,....niF(x,+i,^+i =/1 7lj) P(x i.. .x,_i, Kx, ..., 71, x 71,)
= maxkP{xM,7ti+1 = / \xj=k) maxKi Kl.,P(x,...x,.t, Ki, 7t,.\, x 7T,=k)
=
By induction, we have: P(x, 7Z*) = maxkSi;(N)
For implementation, the Viterbi algorithm includes the following steps:
Input: x=xj.....x/v
Output: 7V=K\K2...7Tn
 Initialization:
<$0(0) =1
4(0) = 0, for all k > 0
 Iteration:
Sj(i) = ej(xj) x maxfc t^P S^i 1)
Ptrj(i) = argmaxfc 40 1)
Termination:
P(x, 7t*) maxfc Sj^N)
Traceback:
TT/y* = argmaxfc S^fN)
7Tj_\* = Ptrjjx (')
To avoid the smallunderflow problem, we take the logs of all values 40):
4(0 log e,(Xi) + maxk[ 4(M) + log tk(l)J
12
2. Learning problem
Given a HMM with unspecified transition and emission probabilities and a sequence
x, find parameter 9 = {2/, Â£} that maximizes P(x I 0). There are two cases:
a) If we know the output state sequence 7F=JC\Jlk...%\
We can show that the maximum likelihood parameters 9\o maximize P(x 19) are:
1. ek{b)P(Xib I 7l,=k) *
ZMC>
re S
with Ek(b)=#time state k emits symbol b in x
T (l)
2. tk(l)=P(nM=l\ m=k)=
Z.TkO)
M D
with Tk(l)=# time k >/ transition occurs in n
b) If we do not know the true Ek(b) and Tk(l)
We can use expectation maximization (EM) algorithm, BaumWelch Algorithm or
Viterbi training to find the maximum likelihood parameters 9. For example, the
principle of EM algorithm can be summarized as follows:
Step 0: Pick the bestguess for model parameters 9
Step 1: Estimate Tk(l), Ek(b) in the training data based on the current
parameters 0
Step 2: Update ^according to Tk(l), Ek(b)
Step 3: Repeat 1 & 2, until convergence
To estimate Tk(l), Ek(b), we know that:
P( 7tj = k, 7tj+1 l \ x) = P(7Tj = k, 7Tj+ j = /, x)/ P(x) (*)
where:
i) P(x) = P(x, k) I.K P(x I x) I P( K)
We define/d7)= P(x\.. ,x n, k)
= ^7Ti...7ij.J^x\ xi1 ^1 Ki1 = k) ek(x0
= Ef 7^2 P(x 1 Xj ], 7T\,..., 7lj_% 7tj_  = I) t[lk> e^Xj)
= E[P(xh. .Xj_ 1, 7Tj_ 1 = l) tfV ek(xj)
= ek Â£lfl(i\) tfk>
Thus P(x) =Zkfk(W tk<>
ii) P( 7Tj k, 7Tj+1 = l, x )
=P(Xj+1 = /, Xj+1..,XN I 7Tj = k) P(xj...Xj, 7Tj = k)
13
=p(*i+l = l xi+\xi+2xN 1 Ki = k>fk(')
=P(xj+2...XN I 7Tj+ [ = Ij P(xi+1 I Zj+\ = l) P(Kj+\ =l \ 7tj = k)fk(i)
We define:
bk( 0 = p(xi+1 XN 1 ni = k>
~ \...7Zn P(xi+1 A7+2 *99 ^i+1> % ^ ni ~ k)
= ^7 PKl+i...7lN P(Xi+1A7+2> VA/' ^'+1 ~ k ^i+2 % I = k)
= r/ ^1+1) Z7t,+x...KNp(xi+2 > ^/+2> 1 ^+1 = D
= ^lel(xi+\) fkin bl(> + ])
Thus, P(7Tj = /c, ^/+[ = /, .v ;=/?/(/+U e/fx/+i ) tk(l)fk(i)
So (*) is now:
fk(i) tk(l) e,(xi+l)b, (/' + !)
P(x, = k,7tM = l I x) =
P(x)
r. n, , , N Z, :(') h (0 <7 (*/+l) (l + 0
Finally, Tk(l) = }^P(tri = k,xM = / I .v) 
F(x)
Similarly, Ek{b)~
Z,u=*A(')^(0
F(x)
1.2.2.2 Clustering
Clustering means identifying groups of objects that are similar to each other
based on the chosen similarity measure but dissimilar to objects belonging to other
groups (Cios et al., 1998). Most clustering algorithms perform a localized search
where the solution obtained at the next iteration is in the vicinity of the current
solution (Jain et ah, 1999). We briefly review three popular clustering methods below.
1. Kmeans clustering (McQueen, 1967)
The objective of the Kmeans algorithm is to minimize total intracluster variance:
k
v= ii1^ jui I with Si is a cluster /, i=\..k, and p; is the mean point of all
i=\ .r ; eS,
the points XjeSj.
Briefly, given a k number of clusters, the kmean algorithm performs these steps:
Step 1: Randomly partition objects into k clusters
Step 2: Compute the centroids of the clusters. The centroid is the mean point
of the cluster.
Step 3: Assign each object to the cluster with the nearest mean point.
Step 4: Go back to step 2, stop when no more new assignments are made
14
2. Hierarchical clustering
Hierarchical clustering techniques create a hierarchical decomposition of the set of
data using some criterion. This proceeds by either a series of successive divisions or
mergers. The result is a nested sequence of clusters which can be represented with a
tree called a dendrogram.
a) Divisive algorithm (MacNaughtonSmith, 1964)
A divisive algorithm starts with the whole data set and proceeds to divide it into
successively smaller clusters:
In the processing cluster, select an object whose average dissimilarity
to the remaining objects is the greatest and move it to a splinter cluster.
, Compare the average dissimilarity of the objects remaining in the
processing cluster to the splinter cluster. If any object is on average
more similar to the splinter cluster, move it to the splinter for which
the difference in average similarity is greatest.
If there are no such objects then stop.
b) Agglomerative algorithm (Jain et al., 1999)
This method initially considers that each object belongs to a different cluster. The
most similar clusters are merged into a new cluster (see Figure 1.12). This process
stops when a criterion is met. There are several linkage methods that measure
similarity between two clusters:
/. Single linkage:
The distance between two clusters A and B is the distance of the two
closest objects in different clusters:
min { d(x,y): x e A, y e B }
2. Complete linkage:
Clusters are joined based on the maximum distance between objects in
different clusters:
max { d(x,y): x e A, y e B }
3. Average linkage:
The distance between two clusters is determined by the average of all
distances between the points in the two clusters:
card(A)card(B)
ZZf/(x>)
,t A ye B
15
CD CD (T)
Figure 1.12: Starting with a dataset in the left, the hierarchical clustering dendrogram would be in the
right.
3. Probabilistic clustering
The probabilistic algorithm computes probabilities of cluster memberships based
on probability distributions. The goal of the algorithm is to maximize the overall
likelihood of the data. One of these probabilistic algorithms is the expectation
maximization (EM) algorithm (Dempster et al., 1977; Bishop, 1995; Xu and
Jordan, 1996). The EM algorithm is described in detail in Chapter 4.
1.2.2.3 Genetic Algorithms
In contrast to clustering methods, Genetic Algorithms (GAs) perform a global
search for solutions. By using crossover and mutation operators, GAs can produce
new solutions that are completely different from the current ones (Cios et al., 1998).
Crossover: given two parent chromosomes, the crossover operator randomly
selects one (or more) position(s) in the chromosomes at which they exchange
content, resulting in two novel offspring.
101000
11
101000
00
crossover
010101
00
010101
11
Mutation: given a chromosome, mutation outputs a chromosome by
complementing the bit value at a randomly selected location in the input
chromosome.
010101
00
mutation
010101
10
These operators are used with some predetermined probabilities which depend on the
fitness values. Mutation is used in GA to make sure that no part of the search space is
left unexplored.
16
1.2.2.4 Fuzzy Cognitive Maps
FCM originated as a combination of fuzzy logic and neural networks designed
to capture causal relationships between concepts. While other approaches, such as
Bayesian networks can be used for reasoning about causal relationships, FCMs have
the advantage of allowing for feedback between concepts. An FCM is represented as
a directed cyclic graph where the nodes represent concepts and a directed edge vv,;
measures how much concept C, causes C; (Kosko et al., 1986, Fletcher et ah, 2005).
A time varying concept C,(r) measures the degree of occurrence of some fuzzy event.
The edges Wy take on values in the fuzzy interval [1, 1] where w,y = 0 indicates no
causality from C, to C,, vv,; > 0 indicates causal correlation in the same direction (C;
increases as C, increases or C, decreases as C, decreases) and Wy < 0 indicates
negative causal correlation (C; decreases as C, increases or C, increases as C,
decreases). A concept cannot cause itself, i.e. there is no selffeedback, so the entries
on the diagonal of the edge matrix are always zero. An important task in designing
this FCM is to find hidden patterns. Let C/,.., C be the concepts of the FCM and W =
(w^) be the connection matrix of the FCM, where Wy is the weight of the directed
edge C;C. Let s(t) = {C\(t), C2(t), ... C(t)) be a state vector denoting values of
concepts at a time t. Let CiCj be the edges of the FCM (/ ^j), then the edges form a
directed cycle. When a state vector flows through a cycle in an evolutionary way the
FCM is a dynamic system then the value of a concept C, is changed, and if its
causality flows through the edges of a cycle and if it again causes the changing of C
the system goes round and round. The equilibrium state found for this dynamical
system is called the hidden pattern or fixed point.
1.2.3 Related Research
1.2.3.1 Prediction of Protein Interfaces
A number of researchers haveanalyzed features of protein interface sites.
Chothia and Janin (1975) determined that such sites are more hydrophobic, flat, or
protruding than other surfaces. Jones and Thornton (1996) studied properties of
protein surfaces surrounding interacting residues and concluded that several physical
characteristics, such as Accessible Surface Area (ASA), were very influential.
Sheinerman and Honig (2002) looked at the contribution of electrostatic interactions
from the population of charged and polar residues on interfaces. Ofran and Rost
(2003a) found that amino acid composition and preferences for residueresidue
interactions could be used to differentiate six types of protein interfaces, each
corresponding to a different functional association between residues.
Based on features of the known protein interfaces, several computational methods
have been reported for their prediction. One approach was based on physical and
17
chemical characteristics of protein interfaces. Jones and Thornton (1997) predicted
interface patches using the sum of values of solvation, residue interface propensity,
hydrophobicity, flatness, protrusion index and ASA. The algorithm by Kini and
Evans (1996) used the fact that proline residues occurred near interfaces at a
frequency 2.5 times higher than expected by chance. Pazos et al. (1997) used multiple
sequence alignments and correlated mutations based on the hypothesis that residues
participating in functionally required interactions tend to show compensatory
mutations during evolution. Gallet et al. (2000) reported that interfaces can be
predicted by using the hydrophobic moment and averaged hydrophobicity.
Another approach to predict interfaces is to use machine learning methods, such as
neural networks, Bayesian statistics and support vector machines (SVM), with
sequence profiles and/or ASA as input data. Zhou and Shan (2001) and Fariselli et al.
(2002) used a neural network to learn whether or not exposed residues at the protein
surface were in a contact patch. Koike and Takagi (2004) applied SVM to classify
interfaces of residues for both homo and heterodimers. Bradford and Westhead
(2005) combined a support vector machine approach with the previous work of Jones
and Thornton (1997) to predict proteinprotein binding sites. Bordner and Abagyan
(2005) detected protein interfaces by using a SVM trained on a combination of
evolutionary conservation signals plus local surface properties. Chen and Zhou
(2005) extended their previous work by developing a consensus method that
combined two neural networks consecutively. Yan et al. (2004) reported a twostage
method in which residue clusters belonging to interfaces were detected through the
SVM and then used as input to a Bayesian network classifier topredict the most likely
class, interface or noninterface, for the target residues.
The above approaches share a common characteristic: inputs are the encoded
identities of n contiguous amino acid residues, corresponding to a window of size n
containing the target residue. Each residue in the window is represented by a 20
dimensional vector whose values are the corresponding frequencies in sequence
profiles of the residue (e.g. in the method of Yan et al., the 20dimensional vector is a
20bit vector with 1bit for each lettercode of the 20 amino acid alphabet). These
methods are limited because a large amount of data is needed for training, and more
importantly, because the biological relationships between sequence profiles and
interfaces are not well established.
1.2.3.2 Prediction of Protein Function
One of the main goals of proteomics research is to understand and predict
protein functions. Even for a simple organism, such as the bakers yeast,
Saccharomyces cerevisiae, functions are unknown for approximately one third of the
18
proteins. For more complex organisms, functional annotation is lacking for a much
larger fraction of the proteome.
A number of computational methods have been developed for protein function
prediction. A general method uses sequence similarity, such as BLAST, to identify
homologous proteins in protein databases and then to assign functions to the query
protein based on known functions of the matches (Altschul et al., 1997). A more
limited approach, Rosetta stone sequence, infers protein interactions from genomic
sequences using the observation that some pairs of interacting proteins have
homologies in another organism fused into a single protein chain. Many such protein
pairs have been confirmed as functionally related (Marcotte et ah, 1999a). However,
the reliability of such methods is not satisfactorily high. Grouping proteins by
correlated evolution, correlated messenger RNA expression patterns plus patterns of
domain fusion has been successfully applied to yeast proteins (Marcotte et ah, 1999b).
Troyanskaya et ah (2003) used Bayesian reasoning to integrate similarly
heterogeneous types of highthroughput biological data for protein function
prediction. Zhou et ah (2002) identified correlations between genes that have similar
expression patterns and those that have similar functions. Lewis et ah (2006) used
support vector machines to infer gene functional annotations from a combination of
protein sequence and structure data.
Information on ProteinProtein Interactions (PPI) is used for function predictions.
Proteins interact with each other for a common purpose and thus the function of an
unannotated protein may be annotated by information on the functions of its
neighbors in an interaction complex. Recent highthroughput experiments have
generated large scale protein physical interaction data for several organisms such as
Caenorhabditis elegans (Li et ah, 2004), Drosophila melanogaster (Giot et ah, 2003),
Helicobacter pylori (Rain et ah, 2001) and Saccharomyces cerevisiae (Fromont
Racine et ah, 1997; Schwikowski et ah, 2000; Uetz et ah, 2000; Ho et ah, 2002).
Using these datasets, a number of protein function prediction methods have been
developed, each having its unique strengths and limitations. The Majority method
infers functions for a protein using the most frequent annotations among its nearest
neighbors (Schwikowski et ah 2000). In this method, some functions may have a very
high frequency in the network but will not be annotated for the query if they do not
occur in the nearest neighbor set. Chua et ah (2006) extended the Majority method to
predict protein functions by exploiting indirect neighbors and devising a topological
weight to estimate functional similarity. Hishigaki et ah (2001) predicted protein
functions based on ^2statistics. This extended the Majority method by looking at all
proteins within a specified radius within the network, thus taking into account the
frequency of all proteins having a particular function. However, the ^2statistic does
not consider any aspect of the underlying topology of the PPI network. Letovsky et ah
19
(2003) and Deng et al. (2003) used a Markov Random Field (MRF) to assign
functions based on a probabilistic analysis of graph neighborhoods in a PPI network.
This method assumes that the probability distribution for the annotation of any node
is conditionally independent of all other nodes given its neighbors. This method is
sensitive to the neighborhood size and the parameters of the binomial distribution
used in function assignments. FunctionalFlow considers each protein of known
function as a source of functional How for that function (Nabieva et al., 2005).
Functional flow spreads through the neighborhoods of the sources. Proteins receiving
the highest amount of flow of a function are assigned that function. This algorithm,
however, does not consider indirect flow of functions to other proteins after labeling
of the functions.
1.2.3.3 Hybrid Clustering Methods
Raghavan and Birchand (1979) used GA for clustering, where GA was used to
k
minimize the squared error 11"*, jj.i II2 of a cluster. To partition a dataset with
i=l .v,eS,
m data points into k cluster, they used a kary string of length m to represent each
chromosome. In this representative scheme, the resulting offspring may be inferior.
For example, by applying the single point crossover at the 4th position on these two
chromosomes 111000 and 000111, we get 111111 and 000000 as offspring,
corresponding to an inferior partition. To improve this representation, Bhuyan et al.
(1991) used an additional separator symbol along with the pattern labels to represent
a partition. For example, the separator symbol is represented by and with six
patterns {A,B,C,D,E,F} the chromosome ACF*BDE corresponds to a 2partition
{A,C,F} and {B,D,E}. Thus, the clustering problem now becomes a permutation
problem. However, this solution also suffers from permutation redundancy, for
example, there are 6!x2 equivalent chromosomes corresponding to the same partition
of the data into the two cluster. Babu and Murty (1993) introduced a hybrid method
by using GA to initialize the centroids and Kmeans algorithm to find the final
partitions. They selected a candidate set of k centroids from the vertices of d
dimensional hyperboxes formed by dividing each dimension into 2 1 segments. A
chromosome is represented by a bit string of length kdB formed by the concatenation
of k Bbit strings for each of the d dimensions. Recently, Laszlo and Mukherjee
(2006) used a hyperquadtree constructed on the data to represent the set of centers,
where each gene in a chromosome is randomly associated with a node in the hyper
quad tree.
20
2. Prediction of Protein Interfaces
2.1 Overview
Identifying the sites of interaction in a protein is a critical problem for
understanding its functional mechanisms, as well as for drug design. To predict sites
within a protein chain that participate in protein complexes, we have developed a
novel method based on the Hidden Markov Model (HMM), which combines several
biological characteristics of the sequences neighboring a target residue: structural
information, accessible surface area, and transition probability among amino acids.
We have evaluated the method using 5fold crossvalidation on 139 unique proteins
and shown a precision of 66% and recall of 61% in identifying interfaces. These
results are better than those achieved by using other methods.
2.2 Methods
2.2.1 Materials
The method of collecting protein sequence data for the purpose of training and
testing of our HMM is similar to that of Zhou and Shan (2001) and Koike and Takagi
(2004). We extract nonhomologous proteins from all multiple chain protein entries in
the PDB (December 2005 release) by using PSIBLAST (Altschul et ah, 1997) with
the cutoffs: identity <30% and Evalue>0.14. Also with a confidence level of >95%,
we exclude all chains shorter than 50 residues as well as those longer than 1100
residues. After this filtering, our resulting first dataset consists of 139 non
homologous complexforming protein chains in the PDB including 68 homodimers
and 71 heterodimers. The list of the proteins is provided in Appendix A. The set of
77 protein sequences used by Yan et al. (2004) forms the second dataset.
2.2.2 Definitions
There are several methods to predict which residues in a protein are part of the
proteinprotein interface. One method is to go through one of the proteins and check
which residues have a residue of the other protein within a specific distance (e.g. 5A)
(Zhou and Shan, 2001; Koike and Takagi, 2004). In this work, to define interfaces,
we used the definition of Jones and Thornton (1996): we calculated the difference in
the accessible surface area (ASA) upon complex formation. Using the DSSP
(Definition of Secondary Structure of Proteins) program of Kabsch and Sander (1983),
we extracted the ASA for each residue in the subunits forming the complex and in the
21
I
I
complex. When the ratio of ASA of a residue to its nominal maximum area in the
subunit is at least 25%, the residue is called a surface residue. A surface residue is
defined as an interface residue when its calculated ASA value in the complex is
decreased by at least 1A2. By using this definition, also used in Yan et al. (2004), we
obtained a total of 1 1555 interfaces from a total of 33632 residues in the 139 protein
dataset (see Figure 2.13).
Figure 2.13: Distribution of interfaces and noninterfaces using the threeamino acid categories,
applied to the two datasets. A is Ambivalent. I is Internal, and E is External.
2.2.3 HMM architecture
A firstorder discrete HMM is a stochastic generative model for linear
problems such as sequences or time series defined by a finite set 2) of states, a
discrete alphabet 5 of symbols, a probability transition matrix J=[t,(j)J and a
probability emission matrix t[ek(b)]. Each state k emits a symbol b according to Â£,
creating an observable sequence of symbols. The states are interconnected by state
transition probabilities J. Starting from an initial state, a sequence of states is
generated by moving from state to state according to the transition probabilities 5
until the end state is reached (Baldi and Brunak, 2001)
First, a protein sequence is translated from the 20letter amino acid alphabet to the
threeletter amino acid categories of based on broad biochemical similarity (Karlin et
al. 1989): Ambivalent (Ala, Cys, Gly, Pro, Ser, Thr, Trp, Tyr), External (Arg, Asn,
Asp, Gin, Glu, His, Lys), and Internal (lie, Leu, Met, Phe, Val) (see Figure 2.13). We
observe that, for both the 139 and the 77 protein sequence datasets, the fraction of
interface amino acids in the external category is much higher than that in the internal
I
22
or ambivalent categories (see Figure 2.13). This grouping not only significantly
reduces the dimensionality of the input space but it also increases the accuracy of the
HMM. In our HMM, the states of the class 2) are defined as 2) = {A+,E+,I+,A J)
and the set of symbol alphabet are defined as S={A,E,I), where A stands for
ambivalent, E for external and I for internal (see Figure 2.14). For completeness we
also show in Table 2.5 and Figure 2.19, the results obtained using different FIMM
models, corresponding to four different monomer groupings based on: a) chemical, b)
functional, c) charge, and d) hydrophobic alphabets.
Figure 2.14: HMM architecture: there are six interconnected states plus two fictitious start/end states
according to the three categories. The state designated with a superscript + emits symbols located in
interface domains. The emission probability and transition probability parameters are trained by
converted protein sequences in both strands, from Nterminal to Cterminal, and in the opposite
direction.
2.2.3.1 HMM training phase
To train the model, the 139 protein sequences are first converted from 20
amino acids to the three category letter sequences (A, I, E). Next, we use two
operators on each sequence. The expansion operator expands each residue in the
threecategoryletter sequence into an odd ^residue window containing the residue
and Lk/2\ neighboring residues, on both sides of the residue. It should be noted that
when windows / and /+1 are connected the last residue of window / and the first
residue of window i+l are put together, which results in a residue pair. Because this
pair does not exist in the original sequence it is eliminated from training data. The
contraction operator cuts the expanded sequence into connected windows of size m,
23
where each window must have in the middle an interface residue, and [_m/2\
neighboring residues on either side of the interface. In using the contraction operator
the following rule is applied. Let window / be a window formed with an interface
residue in the middle, then if the next observed interface residue belongs to this
window i, the said interface residue is omitted, and as a result, no window is formed.
Otherwise, window /+1 is formed with the said interface in the middle and merged
with window i (see Figure 2.15). These two operators exploit the fact that interface
residues tend to form clusters within the amino acid sequence (Ofran and Rost, 2003).
Firstly, the expansion operator expands residues around the target residues, and thus
improves the transition probability among the amino acids around the target residues.
Secondly, the contraction operator cuts the expanded sequence into connected
windows, where the middle residue is an interface. Thus, we again focus on clustering
of interface residues. We empirically determined values of k=5 for the first operator
and of m=7 for the second. The transformed sequences are input into the model in
both strands from the Njermincd to the C_terminal, and in the reverse direction, to
learn the emission and transition probabilities, E and T.
Later, the transformed sequences are input into the model in both strands from
N_tenninal to Cjerminal and in the opposite direction, to train the emission and
transition probabilities by the following equations:
 The emission probability of a state Remits a symbol b:
ek(b)=P(Xib I 7tj=k) = ^ with Ek(b)=#time state k emits symbol b in x
2^Ek(c)
= s
 The transition probability for a state / moving to state /:
T.{1)
tdl) =P(m+\=l \ 7ti=k)= =r with Tdl)=# time k>l transition occurs in n
Z^,o)
je D
For more details, the reader can refer to section 1.2.2 Mathematics background.
24
CKVLTVFGTRP. , ...AEIIAIIAAEA..
transformation
NNlsnsnsTINIl^  ...NNNNNININNN..
expansion

contraction

... AE I, AE1I, AEIIA ,E I IAI,,I IAI I^IAI IA. AI
IAA^I IAAE, JAAEA^AAEA AEA..
.. NNN ,NNNN, NNNNN, NNNN I, NNN IN, ,NN IN I, ,N I
NIN, IN INN, NINNN, INNN, NNNL~~~~
... I IAI,.,IIA+AEvjIA+AE L.A+AE IA.AE IA+A.E'
I AE I^AE IA. ._____________ ______^
.. .nnnÂ£nnn + IN, ,,NN+ IN I, N+ IN IN,, IN IN+N,J^
INNN^INNN. .
Figure 2.15: Example of transformation from an original sequence into a training sequence that
constitutes input to the model. In the expansion (with k=5), the T* letter A is expanded to AEI (because
it has only two residues on the right flank side), the 2nd E is expanded to AEII (because it has one
residue on the left and two residues on the right flank side), and the 3rd I is expanded to AEIIA
(because it has two residues on both flank sides), and so on. The transition between the last residue of
window / and the first residue of window i +1 (marked as u) is not used in training task. The
contraction operator (with m=7) cuts: the lsl observed interface I into a size7 window NNNIWNNN; 
the 2nd observed interface I into a size7 window NNNIN^N. merged with the previous window
(marked as +) to become NNNIuNNN+IN^NN; the 3rd observed interface I into N^NNINI^jN, merged
with the previous window (marked as +) to become NNNI^NNN+INuNN+INIuN. The 4th observed
interface I belongs to the previous window, thus no window is formed for the 4,h interface, and so on.
2.2.3.2 HMM prediction phase
For prediction, a protein sequence will be converted into the threecategory
letter sequence before it is used as an input to the model. In this phase, the two
operators will not be used. Then we apply the Viterhi algorithm to find the most
probable path of output state sequence 7t=7t\Wi...KN The initial state K\ is chosen
according to an initial distribution of the states. Interfaces are residues emitted by
those states designated with a superscript +.
2.2.4 Assessment of predictions
The following measures are used for assessing the performance of our
method:
25
_ TP
~ TP + FP'
_ TP
~ TP + FN
TP + TN
~TP + TN + FP+FN
Correlation Coefficient (MCC)
_____________TPxTN FPxFN________________
,j(TP + FN))(TP + FP)(TN + FP)(TN + FN)
Where:
TP= the number of residues predicted to be interfaces that actually are interfaces
(true positive)
TN=the number of residues predicted not to be interfaces that actually are not
interfaces (true negative)
FP= the number of residues predicted to be interfaces that actually are not interfaces
(false positive)
FN=the number of residues predicted not to be interfaces that actually are interfaces
(false negative)
The precision (a.k.a. positive predictive value) is the probability of correctly
predicting a function while the recall (a.k.a. sensitivity) is the probability that a
function prediction is correct. The MCC value is between 1 and +1 and measures
how well the predicted class labels agree with the actual class labels. The value of the
MCC of 1 (when TP0 and TN=0) means complete disagreement; and of +1 (when
FP=0 and FN=0) complete agreement; when it is 0 (when TP=FP and TN=FN) then
the prediction is random (Baldi et al., 2000). We did not use accuracy because it is
sensitive to the distribution of functions among proteins and our data is highly
skewed (the negative class the positive class). Thus, if we sacrificed true positives
to predict all examples as negative we could have obtained high accuracy (but not
correct) of a classifier (Cios and Kurgan, 2004).
Precision
Recall
Accuracy
Matthews
2.4 Results and Discussion
2.4.1 HMM prediction results
We trained our HMM by using a 5fold crossvalidation on the dataset of 139
proteins. Table 2.1 summarizes the results in terms of performance measures defined
above: MCC, precision, recall and accuracy. The methods accuracy is 73% with a
MCC of 0.42. This result shows that it performs significantly better than a random
predictor (Baldi et al., 2000). Of the residues predicted as interfaces, 66% are actual
26
interfaces (precision), and of all interfaces, 61% are identified as such (recall). Figure
2.16 shows precision and recall for each predicted protein data point in our dataset.
The results show that the HMM method is useful for actual prediction.
Table 2.1: Performance measures of HMM method on 139 protein dataset using 5
fold crossvalidation.
Homodimers Heterodimers Overall
Precision 0.65 0.68 0.66
Recall 0.64 0.57 0.61
MCC 0.43 0.41 0.42
Accuracy 0.74 0.72 0.73
Figure 2.16: Distribution of precision and recall of
prediction result for 139 proteins using 5fold crossvalidation.
Work of other authors (Jones and Thornton, 1997; Ofran and Rost, 2003; Fariselli et
al., 2002; Chen and Zhou, 2005) indicated that the characteristics of homodimer and
heterodimer interfaces are different. We separately tested homodimers and hetero
dimers in our dataset and found that the performance results of homodimers are
slightly better than those of heterodimers.
2.4.2 Comparison with other methods
Zhou and Shan (2001) predicted interfaces using neural networks. The input
values were sequence profiles and ASA of the target residue and of the neighbor
residues. The precision and recall for their classifier was reported as 51% and 50%,
respectively. Koike and Takagi (2004) used SVMs with sequential neighboring
profiles, i.e., profiles of residues in the neighborhood, and they reported precision and
27
recall of 50% and 5456% for a homohetero mixed test data. However, because of
differences in definitions, results of these methods cannot be directly compared with
ours. For direct comparison, we tested our HMM method on the 77 proteins from the
dataset reported by Yan et al. (2004). We used exactly the same definition to extract
the interfaces. We evaluated the twostage classifier method of Yan et al. using 5fold
crossvalidation on our 139 protein dataset. We selected Yan et al.s method for
comparison because it used open source SVMs and Bayesian network
implementations from http://www.cs.waikato.ac.nz/~ml/weka (Witten and Frank,
2005), so it was easy to replicate. In addition, to the best of our knowledge, the two
stage classifier method of Yan et al. gave the best performance results.
2.4.2.1 HMM comparison with twostage classifier on 139protein dataset
The input to the SVM consists of the encoded identities of nine amino acid
residues consisting of the target residue and the four residues on each side. Each
residue in the window is represented by a 20bit vector, with 1bit for each letter
representing the 20 amino acids. Thus, each input data point is of dimension 9x20
plus class attribute (1 interface, 0= noninterface). The results are shown in Table 2.2.
They indicate that the performance measures of our HMM are slightly better than the
SVMs used in the first stage in terms of precision, recall and MCC.
Next, we evaluated the SVM method by using a different way of coding residues, as
in Zhou and Shan (2001) and Koike and Takagi (2004). Each residue in the window,
instead of taking into account only the nearest neighbor information as in Yan et al
(2004), now is coded as a 20dimensional vector, with its components corresponding
to frequencies in the multiple sequence alignment of the protein, taken from the Hssp
file. When using these input vectors, the SVMs performance was only at 58%
precision, 38% recall and 0.27 MCC. This result is lower than using just the nearest
neighbor information as input. Thus, finding the biological relationship between
sequence profiles and interface residue information remains an open research
question.
Table 2.2: Yan et al.s SVM first stage tested on 139 protein dataset.
HMM method Yan et al.s first stage
Precision 0.66 0.61
Recall 0.61 0.59
MCC 0.42 0.40
Accuracy 0.73 0.73
28
In the second stage, to compare with the method of Yan et al., a Bayesian network
classifier is trained to identify an interaction site residue based on the class labels (1 
interface, 0 noninterface) of its neighbors. The class label outputs of the SVM in
the first stage constitute input to the Bayesian classifier in the second stage. Table 2.3
shows the results of the second stage using BayesNetB from the Weka package
(Witten and Frank, 2005), with the input being the class labels z of the eight residues
surrounding a target residue, four on each side. The method of Yan et al. classified
the target residue as an interaction site if p(\\z) > Xp(0\z), where X ranges from 0.01
to 1, in increments of 0.01, so as to maximize the correlation coefficient. The HMM
performed better on the original 139 proteins than the method of Yan et al. using all
measures.
Table 2.3: Yan et al.s Bayesian second stage tested on 139 protein dataset
Yan et al.s twostage method
HMM method SVM Bayesian
Precision 0.66 0.61 0.39
Recall 0.61 0.59 0.40
MCC 0.42 0.40 0.10
Accuracy 0.73 0.73 0.65
2.4.2.2 HMM comparison with twostage classifier on 77 protein dataset
To compare Yan et al.s method, we use their datasets (77 proteins) and
results. The results in Table 2.4 show that our HMM method achieves higher recall
(53%) than the twostage method of Yan et al. (39%) when tested on their 77protein
dataset, while the other performance measures are equal. Also, the performance
measures of the HMM method are significantly better than the method of Gallet et al.
(2000) (we use results provided by the authors) when tested on the same dataset.
Table 2.4: A comparison of HMM method with Yan et al. using the same 77 protein data set.
HMM method Yan et al.s twostage method Gallet et al.s method
SVM Bayesian
Precision 0.58 0.44 0.58 0.30
Recall 0.53 0.43 0.39 0.44
MCC 0.30 0.19 0.30 0.02
Accuracy 0.68 0.66 0.72 0.51
29
On the dataset of 139 proteins, our sixstate HMM method achieved 73% accuracy,
66% precision, 61% recall, and Matthews correlation coefficient of 0.42. Results of
other HMM variations we designed are shown in Table 2.5. In Figure 2.17 we
illustrate the performance of the sixstate HMM method in terms of tertiary structure
of two complexes: the PDB 1BM9 and 1EFN. In several comparisons, our method
performed better than the twostage method of Yan et al. on their dataset of 77
proteins, as well as when testing their method on our 139 proteins dataset. We have
also shown that if we use sequence profiles to encode the residues (Zhou and Shan,
2001; Koike and Takagi, 2004), the performance of the SVM method substantially
decreases. This suggests that the HMM method might be used for predicting effects
of point mutations on protein interactions or solvent accessibility.
Figure 2.17: Overall view of complexes with identified interfaces. The predicted protein in each
complex is shown in spacefill coded as: Blueinterfaces classified correctly (TP), Yellowresidues
incorrectly classified as interfaces (FP), Green noninterfaces correctly identified as such (TN). Red
residues incorrectly identified as noninterfaces (FN); binding proteins are shown as wireframes
(a) 1BM9 (replication terminator protein from Bacillus Subtilis) (homodimer): the predicted protein
(chain A), the binding protein (chain B)
(b) 1EFN (HIV1 NEF protein in complex with R96i mutant FYN SH3 domain) (heterodimer): the
predicted protein (chain B), the binding protein (chains A,C,D). The figures were done using RasMol
(http://www.openrasmol.org)
30
Validation with the CAPRI targets
CAPRI (Critical Assessment of Prediction of Interactions;
http://capri.ebi.ac.uk) is a community wide experiment to assess the capacity of
protein docking algorithms on targets based on structures of the unbound
components. To evaluate our method, we trained the HMM by using the 139 protein
dataset and used the resulting classifier to identify the protein interfaces on the
targets. The prediction of the ligand HEMK in the target 20 (protein
Methyltransferase HEMK complexed with Release Factor 1, heterodimer) is shown
in Figure 2.18. The HMM identified 32 interfaces out of 59 positive class residues
and 132 noninterfaces out of 217 negative class residues.
Figure 2.18: Overall view of the ligand HEMK in the target 20 (heterodimer) with identified
interfaces. The residues of interest are shown in spacefill coded as: blue, TPs; yellow, FPs; green,
TNs; red, FNs; binding proteins are shown as wireframes. The figure was created using RasMol
(http://www.openrasmol.org).
31
Table 2.5: Results of the HMM method when monomers are grouped by:
1) chemical alphabet: (A) acidic (Asp, Glu), (L) aliphatic (Ala, Gly, lie, Leu, Val), (M)
amide (Asn, Gin), (R) aromatic (Phe, Tip, Tyr), (B) basic (Arg, His, Lys), (H) hydroxyl
(Ser, Thr), (I) imino (Pro), (S) sulfur (Cys, Met)__________________________________
__________________________________ 77 proteins____________________139 proteins______
Precision 0.53 0.66
Recall 0.51 0.61
MCC 0.24 0.42
2) functional alphabet: (A) acidic and basic (Asp, Glu, Arg, His, Lys), (H) hydrophobic
nonpolar (Ala, lie, Leu, Met, Phe, Pro, Trp, Val), (P) polar uncharged (Asn, Gin, Ser,
Thr, Pro, Cys, Met)___________________________________________________________________
_________________________________________77 proteins______________139 proteins________
Precision 0.59 0.69
Recall 0.39 0.49
MCC 0.26 0.39
3) charge alphabet: (A) acidic and basic (Asp, Glu, Arg, His, Lys), (N) neutral (Ala, He,
Leu, Met, Phe, Pro, Trp, Val, Asn, Gin, Ser, Thr, Pro, Cys, Met)
Precision
Recall
MCC
77 proteins__________ 139 proteins
0.59 0.69
0.39 0.49
0.26 0.39
4) hydrophobic alphabet: (H) hydrophobic (Ala, He, Leu, Met, Phe, Pro, Trp, Val), (P)
hydrophilic (Arg, Asn, Asp, Cys, Gin, Glu, Gly, His, Lys, Ser, Thr, Tyr)
77 proteins 139 proteins
Precision 0.44 0.51
Recall 0.74 0.77
MCC 0.19 0.30
32
Figure 2.19: HMM architectures for monomer grouping based on:
a) chemical alphabet: there are sixteen interconnected states (A+,L+,M+,R+,B+,H+,I+,S+,A ,L ,M R ,B ,
H,I,S) plus two fictitious start/end states according to the eight categories.
i
i
33
b) functional alphabet: there are six interconnected states (A+,H+,P+,A ,H ,P) plus two fictitious
start/end states according to the three categories.
c) charge alphabet: there are four interconnected states (A+,N+,A\N) plus two fictitious start/end
states according to the two categories.
d) hydrophobic alphabet: there are four interconnected states (H+,P+,H ,P~) plus two fictitious
start/end states according to the two categories.
The state designated with a superscript + emits symbols located in interface domains. The emission
probability and transition probability parameters are trained by converted protein sequences in both
strands, from Nterminal to Cterminal, and in the opposite direction.
Once we have dealt with the problem of prediction of protein interfaces, we are able
to go to the next step of our research, namely to design of methods for prediction of
protein functions. This is described in the next chapter.
34
3. Prediction of Protein Functions
3.1 Overview
We take into consideration both the homologies between proteins and the
underlying topology of the PPI networks to introduce a hybrid approach to overcome
some of the weaknesses of current methods. First, we assign biological homolog
scores to the edges in the PPI network. Second, we use an agglomerative clustering
method in the weighted graphs by taking into account the fact that proteins of known
function and cellular location tend to cluster together. Then, we consider each cluster
as a fuzzy cognitive map (FCM) with proteins constituting causal nodes and edges
representing cause and effect relationships. For a protein of interest we simulate
effects of proteins having a function to the protein in FCM which the protein belongs
to. We used the ClusFCM algorithm on PPI networks for yeast, worm and fly
interaction datasets obtained from the GRID (Breitkreutz et al. 2003) and compared
its performance with those of Majority, %2statistics, MRF and Functional Flow.
3.2 Methods
3.2.1 Materials
Even though there exist programs that take advantage of GO and homology
(Blast2GO: Conesa et al., 2005; GOblet: Groth et al., 2004; GOtcha: Martin et al.,
2004; OntoBlast: Zehetner et al., 2003), to prepare data for experiments, we use our
own procedure described below.
GO annotations:
GO (http://www.geneontology.org; Ashburner et al., 2000) is composed of three
related ontologies: the molecular function of gene products, their associated
biological processes, and their physical location as cellular components. Each
ontology is constructed as a directed acyclic graph (see Figure 3.20). We consider
functional annotation only at the third level of GO hierarchies. After converting
protein labels through ISA (SPELL out) ancestors of each label, there are 39 GO
terms, including 11 biological processes, 10 cellular components, and 18 molecular
functions (see Table 3.6 for details).
35
Physical Interactions:
Yeast interaction data: The yeast GRID (release 3/2006) contains 20519 interaction
pairs. Of 4948 proteins participating in interactions, 4600 proteins have been labeled
with GO terms (the yeast genome database at http://www.yeastgenome.org).
Worm interaction data: There are 2780 unique proteins participating in 4453 distinct
interactions in the worm GRID dataset (release 3/2006). Of these, 1680 proteins are
labeled with GO terms (worm base (release WS155) at http://www.wormbase.org).
Fly interaction data: The fly GRID dataset (release 3/2006) includes 28406 distinct
interactions. 7938 proteins participate in interactions. 5098 of these are annotated
with GO terms (fly genome database at http://www.flybase.org )(see Table 3.7.a, b
and c for details).
Table 3.6: The 39 GO terms used in this study.
GO ID GO Term Ontology
G0:0030533 triplet codonamino acid adaptor activity Molecular function
G0:0005215 transporter activity Molecular function
G0:0045182 translation regulator activity Molecular function
G0:0030528 transcription regulator activity Molecular function
G0:0005198 structural molecule activity Molecular function
G0:0004871 signal transducer activity Molecular function
G0:0031386 protein tag Molecular function
G0:0045735 nutrient reservoir activity Molecular function
G0:0003774 motor activity Molecular function
G0:0005554 molecular function unknown Molecular function
G0:0030234 enzyme regulator activity Molecular function
G0:0031992 energy transducer activity Molecular function
G0:0045499 chemorepellant activity Molecular function
G0:0042056 chemoattractant activity Molecular function
G0:0030188 chaperone regulator activity Molecular function
G0:0003824 catalytic activity Molecular function
G0:0005488 binding Molecular function
G0:0016209 antioxidant activity Molecular function
G0:0019012 virion Cellular component
G0:0045202 synapse Cellular component
G0:0043234 protein complex Cellular component
G0:0043226 organelle Cellular component
G0:0031974 membraneenclosed lumen Cellular component
G0:0005576 extracellular region Cellular component
G0:0031012 extracellular matrix Cellular component
GO:0()31975 envelope Cellular component
36
G0:0008372 cellular component unknown Cellular component
G0:0005623 cell Cellular component
G0:0016032 viral life cycle Biological process
G0:0050896 response to stimulus Biological process
G0:0000003 reproduction Biological process
G0:0050789 regulation of biological process Biological process
00:0043473 pigmentation Biological process
G0:0007582 physiological process Biological process
G0:0007583 killer activity Biological process
G0:0007584 response to nutrient Biological process
G0:0007585 respiratory gaseous exchange Biological process
G0:0007586 digestion Biological process
G0:0007587 sugar utilization Biological process
Table 3.7.a: The number of proteins in Yeast network have at least:
1 neighbor 2 neighbors 3 neighbors 4 neighbors 5 neighbors 6 neighbors
Annotated 4600 3561 2832 2330 1992 1701
Non annotated 348 177 88 50 33 22
Total 4948 3778 2920 2380 2025 1723
Table 3.7.b: The number of proteins in Fly network have at least:
Annotated 4975 3561 2775 2285 1913 1652
Non annotated 2963 1785 1277 991 815 676
Total 7938 5346 4052 3276 2728 2328
Table 3.7.c: The number of proteins in Worm network have at least:
Annotated 1680 801 520 379 301 236
Non annotated 1100 412 240 171 134 104
Total 2780 1213 760 550 435 340
37
Graphical View
Ball: all (167127 )<>
00! GO:0008150: biological_process (119134 )
0 !>GO:0005575. cellular_component {106192)
0 1 GO:OO03674 : molecular_function (116635 ) #
0 G0:0016209 : antioxidant activity {478 )%
. 3' GQ.0045174: glutathione dehydrogenase (ascorbate) activity' (6)
. o> GO 0004362: glutathionedisuifrde reductase activity (17)
0 G0:0004601 : peroxidase activity (359 )
. G0.0019806. bromide peroxidase activity (1)
. i GO: 0004096 : catalase activity (60)
. G0:0016691: chloride peroxidase activity (0)
. GO0004130 cytochromec peroxidase activity(12)
. GO:0016690: diarylpropane peroxidase activity (0)
. y> GO:0047888: fattyacid peroxidase activity (0)
. a.GO:0004602: glutathione peroxidase activity (56)
. G0:0004447 : iodide peroxidase activity (1 )
. aGO:0016688: Lascorbate peroxidase activity (18)
. sGO:0016689: manganese peroxidase activity (0)
. G0:0016692 : NADH peroxidase activity (0 )
. GO:0050137 : NADPH peroxidase activity (0)
. hGO:0051920: peroxiredoxin activity (0 )
. GO:0047066 : phospholipidhydroperoxide glutathione peroxidase activity (1 )
GO:0016693: secretory plant peroxidase activity (0)
. 3) G0:0008379 : thioredoxin peroxidase activity ( 33)
. G0:0004791 : thioredoxindisulfide reductase activity ( 34 )
0GO:OOO5488: binding (30689)
0 GO:0003824: catalytic activity (39782)
0 Si G0:0016491: oxidoreductase activity (6550)
0&GO:OO16684 oxidoreductase activity, acting on peroxide as acceptor (359)
0ci> G0:0004601 : peroxidase activity (359) #
. !) GOO019806 bromide peroxidase activity (1)
. 3s GO: 0004096 catalase activity' (60)
. 0 G00016691 chloride peroxidase activity (0)
Figure 3.20: GO is a hierarchical structure consisting three related ontologies: molecular function,
biological processes, and cellular components (taken from http://www.godatabase.org).
3.2.2 ClusFCM Algorithm
The algorithm works in two stages. In stage 1, it performs clustering and in
stage 2, it performs function flow in a FCM. At both stages, a PPI network is used.
3.2.2.1 PPI Network Definition
The PPI network is an undirected graph G(V,E), where V is a set of proteins
and E is set of edges connecting proteins u and v if the corresponding proteins interact
physically. We assign weights to the edges by a homology score between two
interacting proteins. We use the bUseq program (Tatusova and Madden, 1999) to
extract the expected values, in the range from 0 to 10, as the strength of the biological
relationship between all interacting proteins. The lower the value between two
proteins, the higher the possibility they share the same functions (Altschul et al.,
1997). We observe that, for all three datasets, interacting proteins are more likely to
38
be involved in the same functions if they are more homologous (see Table 3.8). We
use the following notation:
N: the total number proteins in the graph;
P: the protein u (ul..N);
Nei(u): the set of proteins interacting directly with protein Pu;
wy. the weight assigned to the edges connecting protein P and Pv, where:
fthe expected value, if Pu and Pv directly interact
wu (, = <
oo, otherwise
Table 3.8: Average number of common functions of interacting proteins:
Evalue<=lE3 Evalue>lE3
Yeast 5.28 4.27
Worm 4.02 2.38
Fly 5.75 3.78
3.2.2.2 Stage 1
Based on the fact that proteins of known function and cellular location tend to
cluster together in a PPI network (Schwikowski et al., 2000), we perform
agglomerative clustering (Jain et al., 1999) in the weighted graph G. We use this
method because it can generate many small clusters. Initially we treat each node in
the graph as its own cluster, and then proceed by merging the two nearest clusters into
a new cluster. The distance between cluster C, and C/, is calculated as r/(C,,C/,)=
/nin{wU'V, with PeCi, PveCh J, in a single linkage method manner. A cluster is
excluded from the agglomeration process when the following heuristic rules are
satisfied:
a) the cardinality of the cluster exceeds a threshold x, and (1)
Pi l0Â§2 Pi
b) the entropy function of the cluster, H= _^t
log2c
where c is the number of function categories in the cluster, and ps is the relative
frequency of category 5 in the cluster. In our experiments, we use x= N/50 and /=
0.85 (the values are determined experimentally).
The merging process stops if there are no further new clusters formed. The following
pseudocode describes the algorithm:
39
Algorithm 1: Agglomerative cluster in the Ist stage
Input: Matrix W={w,vj
Initialize distance matrix D={duv}, with du,v=wu,v (u=l..N, v=l..N)
2 Initialize N clusters corresponding to N proteins.
3 Find the smallest distance in D, and merge the corresponding clusters, say Cj
and Ch, to get a new cluster (QCh)
4 Recompute the distance between (CjQ,) and any other cluster Q in matrix D
by:
o if the cluster (CiCh) satisfies the rule (1) and (2)
min{d(Cj,Ck),d(Ch, Ck)}, otherwise
5 If D ={) then stop, otherwise go to step 3.
Output: Cluster sets
Figure 3.21 shows clusters generated from this stage for the yeast, worm and fly
datasets, respectively. The figures indicate that the method takes into account not only
the homologies (distances) but also the functions shared among the proteins. For
example, although the yeast proteins NAS2 and ROT! are not homologous, they share
GO terms for the same functions (physiological process (G0:0007582), cell
(G0:0005623), and cellular process (GO: 0009987)). Thus, they are put together in
the same cluster. Similarly, the worm proteins, jkk1 and F43G6, share binding
molecular function (G0:0005488); the fly proteins pont and Or67c share functions:
binding (G0:0005488), physiological process (G0:0007582) and response to
stimulus (GO: 0050896).
40
a) a cluster generated by ClusFCM in Yeast network
Figure 3.21: Clusters generated by the ClusFCM algorithm using yeast, worm, and fly data.
3.2.23 Stage 2
Each cluster is now considered as a FCM (Kosko, 1984; Fletcher et ah, 2005).
A PPI network is a dynamic system where proteins, through interacting edges, flow
their functions to other proteins. We use FCM to simulate how likely it is that a
protein is annotated with a function from its neighbors and, in turn, how the newly
annotated protein propagates the function of interest through the network. In
terminology of the FCM, nodes (proteins) are causal concepts and edges
(interactions) are causeeffect relationships (see Figure 3.22). For a function of
interest, we denote S (s;, S2.. sn) as an instantaneous state vector, where su1 (on) if
the protein u is annotated or predicted with the function and otherwise s=0 (off). As
defined in FCM, an unannotated protein u is switched on if rule (3) is activated:
41
X (10 wv u) (3)
6 is an activation threshold (explained later)
Figure 3.22: In this FCM. proteins shown in black are annotated. The initial state vector is
(101111000).
Starting with an initial state vector, we iteratively apply rule (3) to all unannotated
proteins until the equilibrium state (hidden pattern) is reached. The algorithm for
prediction of a function of interest is outlined below.
Algorithm 2: Finding the hidden pattern in the 2nd stage
Input: A cluster C and an initial state vector
1 Initially assign to each unannotated protein a functions that have the highest
scores \Nei(u)\ x na where 7ta is the fraction of proteins annotated with
function a.
2 Set the initial state vector So= (so.i, so,2... so.ici) where ICI = cardinality of
cluster C, and
1, if protein u is annotated or predicted with the function of interest
0, otherwise
3 Apply rule (3) for each protein in cluster C.
4 If Sj = Sj.j then S, is the hidden pattern, otherwise go to step 3.
Output: The hidden pattern of the cluster.
42
3.2.3 Description of methods used for comparison
We compared performance of the ClusFCM algorithm with four methods:
Majority (Schwikowski et al., 2000), yfstatistics (Hishigaki et ah, 2001), MRF
(Letovsky et ah, 2003) and FunctionalFlow (Nabieva et ah, 2005).
3.2.3.1 Majority
For each protein in a set, we assume it is unannotated and then we count the
number of times each function occurs in its nearest neighbors. The functions with the
highest frequencies are assigned to the query protein.
3.2.3.2 %2statistics
First, for each function a in the 39 GO terms we derive the fraction Ka =
number of proteins having function a / N. Then, we calculate na as the number of
proteins among the query neighbors that have the function a, and ea as the expected
number: e= number of proteins in its nearest neighbors x xa. The query protein is
annotated with the function with the highest value among the functions of all
proteins in its nearest neighborhood, where % = (na ea)~/ea .
3.2.3.3 FunctionalFlow
We implemented this method as outlined in Nabieva et al. (2005), using the
following notation:
 R(u) is the amount in the reservoir for function a that node u has at time t.
Initially (t=0): R"(u) = o if node u is annotated with a, otherwise 0 and,
At time t:
R(u) = /?_,()+ X (g?(v,u)~ g?(u,v))
v:(u,v)eE
_g, (wv) js the fjow 0f function a at time t from protein it to protein v.
Initially (r=0): g[](u,v) = 0 and,
At time t: g(u,v) = min(l, R(u) ) if R_t (u )>Rt'_, (u), otherwise 0.
y
 /a(u) is the functional score for node u and function a over cl iterations is calculated
as the total amount of flow that has entered the node u:
/a(u)=X Â£ g(\,lt) (4)
v:(u.v)eE
This algorithm is iteratively run d times, where d is set to half the diameter of the
interaction network, namely 6, 7 and 6 for the yeast, worm and fly physical
43
interaction networks, respectively. Functions which have the highest scores (Equation
4) for a protein will be assigned to the protein.
3.23.4 MRF
.This algorithm exploits the difference between two probabilities:
 p" is the probability that two nodes in an interacting pair have the same function a.
 p is the probability that two nodes in an interacting pair do not share the same
function.
Using Markov assumption, the probability distribution for the annotating function a
of node u, L", is conditionally independent of its neighbors. To derive the probability
that protein u has function a in condition with its neighbors Nei(u) and the number of
the neighbors which are annotated with function a, n, the Bayes rule is used:
P(K I Nei(u) O
P 1C Nei(u)) P(Lau)
P(K I Nei(i0)
(5)
where:
P( Lau )=fa is the frequency of function a in the network.
 P(n I L", Nei(u))is the probability of having n proteins labeled with function a in
the neighbors Nei(u) in the context of protein u having the function a. This
probability is expected to follow a binomial distribution:
P(nau I L", Nei(u)) = B(Nei(u),n, p), with B(N,k,p) =
r
\k j
{p)k{\P)N~k
 P(n I Nei(u)) is the probability of having n proteins labeled function a in the
neighbors Nei(u):
P(n I Nei(u)) = fa P(n IT, Nei(u)) + fa P(nau I L", Nei(it))
= fa B(Nei(u), <, p ) + fa. B(Nei(u), <, p)
Thus, Equation (5) becomes:
faB(Nei(u),n,Pt)
(6)
P(P I Nei(u) nu) = 
itt' B(Nei{u),n, /?) + / a B(Nei(u),n, p)
Equation (6) is iteratively applied for each unannotated protein a in the network
separately for each function a\ if the probability in (6) exceeds a threshold, the protein
u will be reclassified with the function a. This process stops when no further labeling
occurs.
44
3.2.4 Assessment of prediction
We use the leaveoneout crossvalidation procedure to evaluate predictions
resulting from each method. For each protein in the datasets we assume that it is
unannotated. Then, we use each of the above methods to predict the protein functions.
Let A be the annotated function set, P be the predicted function set, and F be the
whole 39 GO function collection set. We calculate the number of true positive (TP),
false positive (FP), true negative (77V) and false negative (FN) as follows:
 TP (the number of predicted functions that actually are annotated) = \A n P\
 FP (the number of predicted functions that are not annotated) = \P \4
 FN (the number of annotated functions that are not predicted) = \A \ P\
 TN (the number of not predicted functions that actually are not annotated) = F
\{AuP}\
This calculation, used in information retrieval (van Rijsbergen, 1979) (where A the
annotated function set is the relevant document and P the predicted function set is
the retrieved document) is illustrated in Figure 3.23.
39 GO Functions
Annotated Predicted
Figure 3.23: TPs are shown as the white circle (intersection), FPs are shown as grey area on the right,
FNs are shown on the left as the grey area, and TNs arc the rest of the entire area.
The following measures are used for assessing the performance of ClusFCM:
Precision
Recall
TP _\AnP\
TP+ FP~ P
TP I An PI
TP + FN
Matthews Correlation Coefficient (MCC)
TP xTN FPxFN
Harmonic mean (HM)
y](TP + FN))(TP + FP)(TN + FP)(TN + FN)
2
1 / Pr ecision +1 / Re call
45
The precision (a.k.a. positive predictive value) is the probability of correctly
predicting a function (L4 n P\ / P) while the recall (a.k.a. sensitivity) is the
probability that a function prediction is correct (L4 n P\ / A). The HM combines
precision and recall into a single number ranging from 0 to 1. If HM is 0 then no
annotated proteins have been predicted, while if it is 1 then all predicted functions are
correct (van Rijsbergen, 1979).
3.3 Results and Discussion
ClusFCM and the other four methods are implemented in Java and tested on
the three datasets. Table 3.9 shows the performance of our algorithm. The
relationship between precision and recall is shown in Figure 3.24, using different
thresholds in stage 2. To determine the values of the threshold #used in rule (3), we
applied the algorithm with different values of 0s in the range from 0.5 to 1, in
increments of 0.01. The value of 6 for which the algorithm yields the highest MCC
value is used; the chosen thresholds are 0.61 for yeast, 0.43 for worm, and 0.44 for fly
data.
100  100  100 I
Figure 3.24: Results of ClusFCM prediction for different thresholds in the three datasets.
Table 3.9: Performance measures of ClusFCM on three datasets using leaveoneout
validation.
Yeast Worm Fly
Precision 0.63 0.33 0.34
Recall 0.70 0.70 0.73
MCC 0.61 0.43 0.44
HM 0.66 0.45 0.47
46
The results of the four methods using leaveoneout cross validation are shown in
Table 3.10. For the MajorityFunctionalFlow and statistics we select top ten
functions having the highest scores and assign these functions to each unannotated
protein. For the MRF method, we use threshold values from 0 to 1, in increment of
0.1, as used in Equation 6, to predict the functions. For each method, we choose the
threshold which yields the highest MCC value. Interestingly, we found that with the
selected thresholds the HM measure also achieves the highest value for each method.
As shown, ClusFCM outperforms other methods in terms of recall, MCC and HM
measures. The results of ^statistics, when compared with Majority and
FunctionalFlow, are worse on the Yeast data, but better on the Worm and Fly. This is
because the fraction of proteins having more than one neighbor is smaller in the Fly
and Worm datasets, prominently showing that the underlying topology of the PPI
network plays a crucial role. FunctionalFlow performs similarly to Majority on the
three datasets. These methods, however, do not take into account indirect effects in
which a protein, after being annotated, is used to annotate other proteins. MRF
performs well on the yeast dataset where almost all proteins are annotated, but in
worm and fly datasets, ClusFCM performs significantly better in terms of the recall,
MCC and HM. Because the results for ClusFCM and MRF are similar we provide
additional analysis using bootstrap percentile test.
Table 3.10: Performance measures of ClusFCM and the other methods in yeast, worm
and fly physical interaction network using leaveoneout test.
ClusFCM Majority X2statistics FunctionalFlow MRF
Yeast network
Precision 0.63 0.63 0.27 0.64 0.72
Recall 0.70 0.45 0.53 0.44 0.57
MCC 0.61 0.47 0.24 0.47 0.59
HM 0.66 0.52 0.35 0.52 0.64
Worm network
Precision 0.33 0.21 0.21 0.28 0.29
Recall 0.70 0.37 0.71 0.34 0.63
MCC 0.43 0.22 0.32 0.26 0.38
HM 0.45 0.27 0.32 0.31 0.40
Fly network
Precision 0.34 0.28 0.20 0.34 0.34
Recall 0.73 0.32 0.63 0.28 0.55
MCC 0.44 0.24 0.25 0.25 0.37
HM 0.47 0.30 0.30 0.31 0.42
47
We note that functional annotations for the proteins are incomplete at present. In the
Yeast database, there are 348 unannotated proteins out of 4948 proteins (7%) that
participate in interactions. This fraction is approximately 35% for Fly proteins and
40% for Worm proteins. Therefore, a protein may have a function that has not yet
been experimentally verified. We wish to decrease the number of annotated functions
that are not predicted and increase the number of predicted functions that are actually
annotated. The fact that values of recall are always higher than the values of
precision in all datasets increases confidence in our methods. Table 3.6 shows the
number of TPs and FNs predicted by each method. ClusFCM identifies more TPs and
fewer FNs than any other method over the three datasets.
To visualize performance of the ClusFCM we use Receiver Operating Characteristic
(ROC) curves which plot the TP rate (TPR, equal to TP/ (TP+FN)) against the FP rate
(FPR, equal to FP/ (FP+TN)) for different thresholds (Bradley, 1997). For comparison,
we also show the ROC curves for the Majority, %statistics, Functional Flow and
MRF methods by different top scoring functions (from 1 to 10), and different
thresholds (from 0 to 1, in increments of 0.1), as shown in Figure 3.25. The closer the
curve follows the top lefthand area of the ROC space the more accurate the classifier.
A random classifier would have its ROC lying along the diagonal line connecting
points (0,0) and (1,1). We see that performance of ClusFCM is better than that of
Majority, X~statistics and FunctionalFlow. The values of areas under the ROC curve
(AUC), for all classifiers, are shown in Table 3.12.
Table 3.11: The number of TP, FP, TN and FN of ClusFCM and the other methods on the yeast,
worm and fly physical interaction networks using leaveoneout test on the 39 protein functions.
ClusFCM Majority 72statistics FunctionalFlow MRF
Yeast network (4948x39 data points)
TP 17411 11213 13178 11035 14302
FP 10195 6658 36302 6316 5671
TN 157792 161329 131685 161671 162316
FN 7574 13772 11807 13950 10683
Worm network (2780x39 data points)
TP 4539 2409 4603 2190 4104
FP 9311 9108 17567 5656 9992
TN 92629 92832 84373 96284 91948
FN 1941 4071 1877 4290 2376
Fly network (7938x39 data points)
TP 18072 8099 15776 7032 13608
FP 34555 20516 64736 13780 26040
TN 250104 264143 219923 270879 258619
FN 6851 16824 9147 17891 11315
48
nui puiiuvi Riii
Figure 3.25: Comparison of ChtsFCM results with four other methods on three datasets.
Table 3.12: The AUC values of ClusFCM and the other methods in yeast, worm and fly physical
interaction networks.
ClusFCM Majority ^statistics FunctionalFlow MRF
Yeast 0.89 0.62 0.63 0.62 0.87
Worm 0.86 0.61 0.72 0.60 0.86
Fly 0.86 0.61 0.67 0.59 0.82
Next, we analyze the predictive power of ClusFCM and the other four methods by
using leaveoneout crossvalidation on proteins having at least two neighbors, at
least three neighbors, and so on. The results show that performance of ClusFCM
improves as the number of neighbors increases (see Figure 3.26). Moreover, all
performance measures for ClusFCM increase monotonically over all datasets. This
important characteristic is absent in the other four methods.
49
0 3
0.2 4
0.2
1 2 3 4 5 6
neighbors
2 3 4 5 6
Neighbors
Figure 3.26: Performance of ClusFCM and the other methods on the yeast, worm and fly interaction
networks using leaveoneout procedure for proteins having at least oneneighbor, twoneighbors, etc
To gauge uncertainty of the algorithms performance we next use the bootstrap
percentile test (DiCiccio and Efron, 1996; Davison and Hinkley, 1997; Keller, 2005)
to calculate confidence intervals. Bootstrapping is a nonparametric technique
involving a large number of repetitive computations: from each protein dataset of size
N, we sample n=50 / a datasets of size N (with replacement), where a is the
confidence level of the test. With a confidence level of 95%, we have 1000 datasets
for each of three organisms: yeast, worm and fly. For each dataset i, we compute the
difference of performance measures: dwcc.i, dnM.i and dAuc.i between our method and
the other four, respectively. These 1000 values represent the nonparametric
distribution of the random variables dMcc, duM and dAuc We consider a confidence
interval [lowms, upms\ such that p(lowms< dms
of p(dms), where ms is MCC, HM or AUC measures. Formally, with the confidence
4>
level of 95%, the lower bound lown
1 (z0 + z0025)
and the upper bound
Zq+Z
la(z0 + z0'975)
HP,ns =
where is the standard normal c.d.f., za is the a
quantile of standard normal distribution, Zo is the biascorrection and a is the
acceleration. The nullhypothesis, that both algorithms perform equally well, can be
rejected if 0 lies outside of the interval. Table 3.13 shows confidence intervals for
ClusFCM and other methods over 1000 bootstrap datasets for yeast, worm, and fly.
50
The intervals show that ClusFCM performs significantly better in terms of the MCC,
HM and AUC (except for the similarity of AUC measure in worm dataset for MRF).
Table 3.13: 95% confidence intervals for the difference of MCC, HM and AUC values between
ClusFCM and the four methods over 1000 bootstrap datasets.
Majority ^statistics Functional Flow MRF
low up low up low up low up
MCC 0.136041 0.136465 0.370374 0.370917 0.135682 0.136096 0.016199 0.016530
Yeast HM 0.139726 0.139119 0.308030 0.308444 0.140698 0.141098 0.025778 0.026078
AUC 0.259770 0.259997 0.256976 0.257223 0.262113 0.262330 0.016500 0.016841
MCC 0.215037 0.215894 0.116154 0.116763 0.173820 0.174690 0.055099 0.055582
Worm HM 0.178593 0.179347 0.124988 0.125462 0.140512 0.141335 0.047477 0.047882
AUC 0.244011 0.244480 0.141463 0.141948 0.260912 0.261352 0.002379 0.001815
MCC 0.199280 0.199709 0.185497 0.185852 0.182837 0.183273 0.066934 0.067247
Fly HM 0.163064 0.163491 0.166621 0.166905 0.158066 0.158527 0.044335 0.044620
AUC 0.251137 0.251342 0.191224 0.191451 0.267226 0.267428 0.039924 0.040251
Let us note that the yeast twohybrid method has been popular because it can be scaled up. However, it produces false positive or false negative errors (Letovsky and Kasif, 2003). We addressed this problem by performing the prediction of the methods in the fly dataset using only high confidence interaction data. The result of this study is shown in Table 3.14.
Table 3.14: The result of ClusFCM and the other methods in 4621 high confidence fly interaction data.
ClusFCM Majority ^statistics FunctionalFlow MRF
Precision 0.28 0.19 0.20 0.28 0.31
Recall 0.76 0.35 0.79 0.41 0.67
MCC 0.40 0.18 0.32 0.28 0.40
HM 0.41 0.24 0.32 0.33 0.41
51
4. GAKREM: a Clustering Algorithm that Automatically Generates a Number
of Clusters
4.1 Overview
Clustering means identifying groups of objects that are similar to each other
(based on the chosen similarity measure) but dissimilar to objects belonging to other
groups. Clustering is the most important and at the same time the most difficult data
mining method. The reason for its importance is that most of the huge amounts of
data being collected in any given domain is unsupervised, meaning that there are no
known inputs corresponding to the known outputs. In such a situation one can use
only one of two key unsupervised learning methods: clustering or association rules. If
the outcome of clustering is successful, namely, an algorithm is able to reliably find
true natural groupings in the data then the problem of analyzing such data becomes
much easier because domain experts could possibly label (assign meaning) the found
clusters and then one can use a multitude of supervised learning algorithms to further
analyze the data. Unfortunately, assessing the reliability/correctness of clustering is
an open research question. First, the user, for almost all clustering algorithms, must
guess the number of clusters to be generated as an input parameter. If no deep
knowledge about the domain and the data exists, then correct guessing is very
difficult. To address this problem, several measures, called cluster validity, for
evaluating the correctness of clustering have been developed. They are used as
follows: one generates several clusters in the data and then uses a measure for which
the optimum value should indicate the correct grouping (say ncolT number of clusters).
However, almost all of the cluster validity measures suffer from the monotonicity
property (the more clusters are generated the better the cluster validity performance
index becomes). Second, clustering algorithms, such as Kmeans and EM, perform a
localized search where the solution obtained at the next iteration is in the vicinity of
the current solution. In other words, they are very sensitive to initialization. Such
clustering algorithms may suffer from overfitting, with too many clusters generated
while not reflecting the true underlying structure (Dempster et al., 1997; Jain and
Dubes, 1988; McQueen, 1967; Jain et ah, 1999).
Well known clustering algorithms include statistical clustering (Dempster et ah, 1997;
Wu, 1983; Xu and Jordan, 1996), hierarchical clustering (Jain and Dubes, 1988),
partitional clustering (McQueen, 1996; Zahn, 1971), nearest neighbor clustering (Su
and Ku, 1978), fuzzy Cmeans clustering (Ruspini, 1969; Bezdek, 1981) and neural
network clustering (Kohonen, 1989). Among them, Kmeans is widely used because
52
generated the data point Xi. In a finite mixture model, technically, y; = { y!,...,y* } is
a ^dimensional binary vector corresponding to k clusters, i.e. if cluster h (h=l..k)
produced the data point x; then yf = 1 and y. =0 with j^h.
 denotes parameters of the density to be estimated. In normal mixtures, 0 contains
two parts: 0 = (a, d}, where a denotes the mixing prior probability of unobserved
data Y and #is the set of parameters ///, L} defining the Gaussian distribution p. More
specifically, if normal mixtures have k clusters then a ={cXi,is a ^dimensional
vector, where ah is the a priori probability of cluster h, and jU=(jUi, .... Hkl and
Z=/27/,...,2J.7 are the mean and covariance vectors of Gaussian distribution p,
respectively.
 f s fcl, &0} are parameters of the density at time t.
 Z=fX,Y} is a complete data set where X is observed (or incomplete) and Y is
unobserved (or missing) data, respectively.
4.2.2 GAKREM Algorithm
In section 4.2.2.2, the threephase GAKREM algorithm was introduced and
explained. To understand the GAKREM, we briefly explain the EM theory and K
means, which are used in our algorithm.
4.2.2.1 EM and Kmeans algorithms overview
Suppose that data points in the incomplete data set X are independently and
identically distributed with an underlying distribution p, then the probability density
function of X can be expressed as:
p(XI0) = f[pU,l0) (1)
/=]
From Equation (1) the following incomplete data loglikelihood of the parameter 0(t)
given data set X can be obtained:
/(01 X) = log p(X 10) = log f[ p{xt I 0) = X og p(x, 10) (2)
i=i i=i
In the maximum likelihood (ML) problem, we are interested in finding 0 that
maximizes the incomplete data loglikelihood function in Equation (2), namely:
0* = arg max0/(0 I X) (3)
Typically, Equation (3) can be solved by using a Lagrange multiplier:
1^ = 0 (4)
In many problems, however, Equation (4) becomes nonlinear. Thus, it is hard to find
a closed form solution for Equation (4) although there exist iterative methods such as
54
Newton or quasiNewton for solving it. The general EM algorithm is an elaborate
technique for obtaining 0 (Dempster et al., 1997; Wu, 1983; Xu and Jordan, 1996).
Recall that Z=(X,Y/ is a complete data set, thus, the complete data loglikelihood
function is defined by :
/(0IZ) = log/?(ZI0) = log/?(X,ri0) (5)
Given the current parameter estimates and the observed data X, the EM algorithm
maximizes the incomplete data loglikelihood function by iteratively maximizing the
expectation of the complete data loglikelihood function of Equation (5). Intuitively,
the EM algorithm consists of these two steps:
 Estep: at time t, the EM computes the expected value of the complete data log
likelihood function l(0ul I Z), given the parameters at time M 0<,l)and the observed
data X. The result is a function:
Q(Q(n I 0",)) = Â£[log p(Z I 0") I X,0('_l)] = Jlog p(X,Y\ tn)x p(Y I X,Gu~n)dY (6)
where p(Y I X, is the marginal distribution of the unobserved data Y computed
by using Bayes rule:
p(Y\X,u~") =
P(X,Y I0(M))
j>(x,yi0"1VK
(7)
 Mstep: the EM finds the parameter 0 to maximize the expectation computed in the
Estep, that is:
0(,) = arg max0/( I 0<,_l)) (8)
Starting from the initial guess of parameter 0
repeated until convergence. At each iteration the EM increases the incomplete data
loglikelihood function 1(0 IX) and is guaranteed to converge to a local maximum of
the loglikelihood function (Dempster et al., 1997; Wu, 1983; Xu and Jordan, 1996).
In the mixture density parameter estimation problem, assuming that the incomplete
data X is governed by kcluster finite mixture distribution, its density function now
can be written as the following probabilistic model:
p(xi\G) = Y,ahph(xi\&h) (9)
h=1
where:
x; is a particulate c/dimension data point of the data set X.
0 = {a, 0} is the complete set parameter of the mixture.
k
a = {cci,..., ak} are the mixing probabilities satisfied oth> 0 Vh and ^ ah = 1
h=\
0 = {0i,..., 0k} are the parameters defining clusters.
Ph is a Gaussian density function parameterized by 0h={fth, Eh}:
55
(10)
Ph^;\eh) = ph{x., \ nhxh):
1
2niU1I El1/2
Consequently, the incomplete data loglikelihood function corresponding to the k
cluster mixture is defined by:
/(0I X) = log]_[/7(xi. I 0) = ^Jog^arA ph(xi \8h) (11)
i=l i=1 /i=l
In the case of kcluster finite mixtures, Equation (5) for complete data loglikelihood
function can be rewritten as:
/(I Z) = log p(Z I 0) = log p(X, Y 10) = X Z y> loÂ§ K PM IA)] (12)
t=l h=I
Where >f = 1 and y! =0 with j^h if cluster h produced data point x, (see Definitions).
Given the initial guess of 0
iteratively operates until convergence:
Estep: at time t, computes the expected value of the complete data loglikelihood
function in Equation (12). Note that the conditional expected values of for every
data point v, of the unobserved data Y are derived by the Bayes rule:
ccfipix, IO
E(yf I xp 0(o) = p(y. = 11 x;,0(,>) = 
(13)
/=i
Mstep: maximizes the loglikelihood function of the complete data Z by updating
the parameters at time t+1 [6]:
a'
+n
= Â£^(>,A'x,,'0)
IV i=l
5>(.Â£(yi*lx(.,(,))
(r+i) /=i
r*u N
(14)
(15)
d+1) i=i
Â£Â£(?," Ix,.,0('>)
1=1
X Â£(y? I 0(" )(*, ~ )r (xk+i> )
E =
Â£Â£(yf lx,.,"')
(16)
i=i
In addition to the initial guess of parameter 0, the EM algorithm also requires the
number of clusters to be specified a priori to operate.
56
On the other hand, the objective of the Kmeans algorithm is to minimize total intra
k
cluster variance, namely, V=^T ^ Ixjjuh Iwhere Ch is a cluster h, h=\..k,
/i=l xjech
and (ih is the mean point of all the points XjGCh. Briefly, given a k number, the K
means algorithm performs these steps:
1) Randomly partition data points into k clusters.
2) Compute the centroids of the clusters. The centroid is the mean point
of the cluster.
3) Assign each data point to the cluster with the nearest mean point.
4) Repeat step 2 until there is no further assignment.
4.2.2.2 GAKREMs Pseudocode
GAKREM works in three phases. Phase I is used to compute the initial guess
of parameters. Phase II is a novel method to evaluate the fitness of generated
chromosomes, while in Phase III evolutionary generation is used to simultaneously
search the optimal fitness value and the best number of clusters of the mixture.
Phase I: Computing the first guess of parameters 0
To address the drawbacks of EM and Kmeans, namely, that they often end up
in a local optimum, we use GA to perform a global search for the best solution. In
contrast to the EMs and Kmeans use of a hill climbing search, GAs are used to
move the population away from local optima areas (Holland, 1975; Davis, 1991). The
GAKREM starts with population P which is a set of the a) initial guess of parameter
0(O) and, b) the number of clusters k used in the EM, which are represented by
chromosomes. A chromosome in P is encoded as a binary string of length N
(corresponding to N data points in the incomplete data X). The ilh locus of a
chromosome is set to 1 if the corresponding data point ith in data set X is a centroid of
the cluster, otherwise, it is 0. As a result, the number of 1loci in each chromosome is
also the k number of clusters input to the EM. We illustrate the idea by means of the
following example:
Example: suppose that A is 10; the number of chromosomes in P is 3, then each
chromosome can be randomly configured by a uniform distribution as:
Position
Chromosome 1:
Chromosome 2:
Chromosome 3:
1234567890
0001000010
0100011001
1010100100
This means that chromosome 1 specifies 2 clusters: the centroid of the cluster 1 is the
4th data point and the centroid of cluster 2 is the 9th data point. Chromosome 2 has 5
57
clusters in which the centroids are the 2nd, 6th, 7th, 10th data points, respectively.
Similarly, the 1st, 3rd, 5th and 8th data points are the centroids of the 4 clusters
represented by chromosome 3.
To optimize the use of memory and time, instead of storing the full length of
chromosomes, we encode only the positions of 1loci in chromosomes, thus the
chromosomes in the above example can be represented as a set of numbers (4,9),
(2,6,7,10), and (1,3,5,8), respectively. Each chromosome now specifies the number
of clusters, k, in the data. To use the EM algorithm we also need information about
the value of parameter 0
performing Kmeans clustering as follows:
Phase I: Computing initial parameters {a0), 0(O)}
Input : a data set X with N data points
a chromosome 6 encoding k clusters: loci[l..k]
Output: 0(O)= {a(0), 0(O)}
1 k
2 For i < 1 to N
3 For h < 1 to k
4 Calculate the distance II x[i] x [ loci[h] ] II
where x [ loci[h] ] is the centroid of cluster h
5 End For
6 Assign the data point jc, to the cluster with the closest centroid.
7 End For
8 For h
10
11
Calculate am // _L.V x Â£ = _L V(x. ju)(xi jU)T
* N h \ch h Icjtf * *
where lc/,1 is the cardinality of cluster Ch
End For
Return {a(0), 0
Phase II: Evaluation of the chromosomes fitness
In addition to defining the chromosome encoding scheme, we need to specify
a function that evaluates a chromosomes fitness. In the case of the finite mixture
problem, we cannot simply perform EM on each chromosome and use the resulting
loglikelihood function as a fitness value. The reason is that the EM suffers from
monotonic property, namely, when the number of clusters increases, the maximum
loglikelihood /(0 IX) also increases. Therefore, each chromosome should be
evaluated by its maximum loglikelihood function minus a penalty corresponding to
58
the number of clusters encoded by the chromosome. Inspired by Occams razor, we
use the following simple fitness function to evaluate chromosome
fit (e>=i(e\e) \og(k) 05)
where k is the number of clusters encoded by chromosome C and /(0IC) is log
likelihood function resulting from performing the EM on chromosome C.
Let us note that the GA is still an exhaustive search algorithm so Equation (15) would
be an expensive formula to use. To decrease the computational cost, instead of
running a full EM on each chromosome until convergence we perform only a partial
EM in (a few) r initial steps. Then we use logarithmic regression to predict the log
likelihood function at a convergence point. There are several papers on the
convergence rate of the loglikelihood function /(0IC) of the general EM algorithm
(Dempster et al., 1997; Wu, 1983; Redner and Walker, 1984; Xu and Jordan, 1996).
In most of cases, the convergence curves of /(0IC) are fitted by logarithmic curves.
The expected maximum loglikelihood function is computed as:
E(0 = fllog(r) + /> (16)
where:
rXrx/tI>Z/I X
^ /=i and b=1Â£=>_
rE(')2(E'>2
t=i t=i
with /, is the loglikelihood value at each iteration t in the EM and r is set to 5 in all
experiments. The fitness value in Equation (15) now becomes:
fit(e)=E{C)\og(k) (17)
Several cluster validity techniques, such as partition coefficient (Bezdek, 1981),
minimum description length (Schwarz, 1978; Rissanen, 1978) and Bayesian
information criterion (BIC) (Schwarz, 1978), have been suggested to find an optimal
number of clusters. We compare our approach to the likelihood crossvalidation
technique reported by Smyth et al. (1998) for determination of the best number of
clusters.
Phase III: Evolutionary generation for finding optimal fitness value
This phase works iteratively in three steps, as outlined below.
Step 1 Initialize population and evaluate fitness value of chromosomes
Generate M chromosomes for the population. Each Abinary string chromosome has k
1loci, where k is a random number ranging from 2 to ViV The positions of 1loci in
each chromosome are randomly distributed by uniform distribution (see example in
section 4.2.2.3). We use M64 in our experiments.
59
For each chromosome C a finite mixture model we first compute the initial guess
of parameter 0(O)= {a<0), 0(O)} by using the method presented in phase I. Together
with the number of clusters k encoded in the chromosome, the parameters 0(O)and k
constitute input to the EM algorithm to calculate the fitness value, fit (Â£), by Equation
(17).
1 Mir 64
2 For / < 1 to M
3 k < random number 2 .. yf~N
5 For /r < 1 to k
6 chromosome[/].loci[/i] < random number 1 ..N
7 End For
8 Use the method presented in phase I to compute 0(O)of chromosome[/J.
9 Use Equation (17) in phase II to calculate fit (chromosome[/]).
10 End For
Step 2 Generate new population
Using the roulette wheel method, a pair of chromosomes (parents) in the population is
selected for mating. This includes two operators:
Crossover, which selects loci from the parent chromosomes and creates a new
offspring. GAKREM uses uniform crossover prototype where loci are
randomly copied from the first or the second parent. The new offspring may
have a different number of clusters and the 1loci, thus changing the first
guess of parameters.
Mutation, which occurs with a probability rate. A randomly selected locus of
the offspring is inversed, which may increase or decrease the number of
clusters in the offspring. The mutation operator might move the population
away from local optima areas.
Note that if the fitness value of the new offspring is better than the lowest fitness
value in the population, the corresponding chromosome will be replaced by the new
offspring. Stage 2 is repeated in g generations to improve the fitness value of
chromosomes in the population. We use g = 4000 in our experiments.
11 g < 4000
12 For / < 1 to g
13 dad
60
14 moms select a chromosome in population by roulette wheel method
15 offsprings crossover {dad mom)
16 If (mutation rate is satisfied)
17 offsprings mutation(offspring)
18 End If
19 Use the method presented in phase I to compute 0
20 Use Equation (17) in phase II to compute fit(offspring).
21 If ( fit (offspring) > the lowest fitness value in the population )
22 Replace the corresponding chromosome by the offspring
23 End If
22 End For
Step 3 Use the general EM on the chromosome which has the highest fitness
value, until convergence.
4.3 Results and Discussion
We have conducted extensive experiments using GAKREM on two kinds of
data. The first are the manuallygenerated datasets and the second are automatically
generated datasets as used in Xu and Jordan (1996) and Jain and Dubes (1988). The
probability of mutation was set experimentally to 0.15 for all experiments.
4.3.1 Experiments on manuallygenerated data
4.3.1.1 The performance of GAKREM
The experiments are performed on cluster mixtures from 2 through 9 clusters.
To test the robustness of GAKREM we repeated the experiments 100 times for each
dataset, and we have found that GAKREM correctly calculated the number of clusters
for each mixture in all trials.
Figure 4.27 illustrates the result of GAKREM on a 2cluster mixture derived from the
Old Faithful dataset in Bishop (1995). This dataset has two dimensions: duration of
an eruption of the geyser in minutes and interruption time. GAKREM precisely
recognizes a 2cluster mixture, of course, without the predefined number of clusters.
Figure 4.28 describes the behavior of GAKREM in a trial out of 100 trials in
searching the global optimum fitness values of chromosomes for the Old Faithful
dataset.
61
Figure 4.27: Old Faithful data and the 2cluster mixture obtained from GAKREM. The optimal log
likelihood value is at 8.9078.
0 1000 2000 3000 4000
Figure 4.28: GAKREMs behavior on the Old Faithful dataset. It performs an heuristic search on the
global space. At the 1728th generation in this trial, the optimal fitness value is stable at 7.5568.
The results of GAKREM on 3, 4, 5, 6, 7, 8 and 9cluster mixtures are shown in
Figure 4.29. These datasets, manually generated by our own tool, have 374, 402, 818,
724, 695, 988, 1122 data points with two dimensions, respectively. GAKREM
robustly generates the correct number of clusters for each mixture in all trials.
62
130
110
90
70
50
30
10
10 30 50 70 90 110 130 150 10 30 50 70 90 110 130 150
a) 3cluster mixture dataset and the exact result obtaining from GAKREM. The optima log
likelihood values is at 6.85.
b) 4cluster mixture dataset and the correct result obtaining from GAKREM. The optima log
likelihood values is at 7.34.
63
190
c) 5cluster mixture dataset and the correct result obtaining from GAKREM. The optimal log
likelihood value is at 7.05.
d) 6cluster mixture dataset and the correct result obtaining from GAKREM. The optimal log
likelihood value is at 7.16.
64
10 30 SO 70 90 110 1 30 150 1 70 1 90 10 30 50 70 90 110 1 30 1 50 1 70 1 90
e) 7cluster mixture dataset and the correct result obtaining from GAKREM. The optimal log
likelihood value is at 6.97.
f) 8cluster mixture dataset and the correct result obtaining from GAKREM. The optimal log
likelihood value is at 7.38.
65
g) 9cluster mixture dataset and the correct result obtaining from GAKREM. The optimal log
likelihood value is at 7.59.
Figure 4.29: GAKREM obtained the correct number of clusters for each tested data set in all 100 trials,
of course, without knowledge of number of clusters to operate.
The behavior of GAKREM in a trial is demonstrated in Figure 4.30. As shown, it
performs a thorough search simultaneously both the number of clusters and the initial
centroids of clusters.
0 1000 2000 3000 4000
a) GAKREMs behavior on the 3cluster dataset. After trying with 4 clusters, GAKREM is
stabilized with the optimal fitness value at 4.4869 at the generation of 646 for 3 clusters.
66
Fitness value Fitness value
0
1000
2000
3000
4000
b) GAKREMs behavior on the 4cluster dataset. After trying with 7, 5, 5 clusters, GAKREM is
stabilized with the optimal fitness value at 5.4547 at the generation of 1059 for 4 clusters.
0 1000 2000 3000 4000
c) GAKREMs behavior on the 5cluster dataset. After trying with 9, 7, 5, 6 clusters, GAKREM
is stabilized with the optimal fitness value at 5.3616 at the generation of 2426 for 5 clusters.
67
0
1000
2000
3000
4000
a?
>
U)
d) GAKREMs behavior on the 6cluster dataset. After trying with 10, 8, 7, 6 clusters, GAKREM is
stabilized with the optimal fitness value at 5.5028 at the generation of 3606 for 6 clusters.
d>
tu
>
c
0 1000 2000 3000 4000
e) GAKREM's behavior on the 7cluster dataset. After trying with 6, 5, 8, 6, 5 clusters, GAKREM is
stabilized with the optimal fitness value at 4.9659 at the generation of 3635 for 7 clusters.
68
Fitness value
0 1000 2000 3000 4000
0 GAKREMs behavior on the 8cluster dataset. After trying with 9, 8, 10, 9 clusters, GAKREM is
stabilized with the optimal fitness value at 6.8795 at the generation of 1946 for 8 clusters.
3
>
(/>
a>
c
0 1000 2000 3000 4000
g) GAKREM's behavior on the 9cluster dataset. After trying with 11,10 clusters, GAKREM is
stabilized with the optimal fitness value at 7.5016 at the generation of 3869 for 9 clusters.
Figure 4.30: The behavior of GAKREM on the tested datasets in a trial. Its shown that GAKREM
performs a global search simultaneously both number of clusters and the initial centroids of clusters.
69
4.3.1.2 GAKREM versus Likelihood CrossValidation technique
For comparison in determining the best number of clusters in each generated
data set, we applied the likelihood crossvalidation technique (Smyth, 1988)
implemented in Weka package, a widelyused open source (Witten and Frank, 2005).
The very expensive likelihood crossvalidation technique only determined correctly
the number of clusters in 2, 4, 6cluster mixture datasets while failed to recognize
the underlying clusters in 3, 5, 7, 8 and 9cluster mixture datasets as shown in
Figure 4.31.
130
110
90
70
50
30
10
10 30 50 70 90 110 130 150 10 30 SO 70 90 110 130 150
a) 3cluster mixture dataset and the wrong result obtaining from EM with likelihood cross
validation technique.
b) 5cluster mixture dataset and the wrong result obtaining from EM with likelihood cross
validation technique.
70
210
190
170
150
130
110
90
70
50
30
10
c) 7cluster mixture dataset and the wrong result obtaining from EM with likelihood cross
validation technique.
170
ISO
130
110
90
70
SO
30
10
d) 8cluster mixture dataset and the wrong result obtaining from EM with likelihood cross
validation technique.
71
130
e) 9cluster mixture dataset and the wrong result obtaining from EM with likelihood crossvalidation
technique.
Figure 4.31: A comparison for determination of the best number of clusters between GAKREM and
likelihood crossvalidation technique.
4.3.1.3 GAKREM versus EM and Kmeans algorithms
On the other hand, for comparison of the maximum loglikelihood functions
of the mixtures performed by GAKREM, EM and Kmeans algorithms, we
implemented the last two algorithms 100 times on the same data sets as described
above. It is important to note that to operate the EM and Kmeans we need to specify
a priori the number of clusters. Thus, we use from 2 through 9 clusters as input
parameters. Figure 4.32 shows comparison for the average maximum loglikelihood
functions resulting from the EM, Kmeans and GAKREM while testing on 2 through
9 clusters. As can be seen GAKREM significantly outperforms both the conventional
EM and the Kmeans algorithms in terms of maximum loglikelihood functions.
72
23456739
Figure 4.32: Average maximum loglikelihood functions of GAKREM, EM and Kmeans testing on
2, 3, 4, 5,6, 7, 8 and 9cluster datasets, respectively.
An example of the drawback of sensitive initialization of the EM and Kmeans is
presented in Figure 4.33 in which the maximum loglikelihood functions converge to
local maximum areas in the 7cluster mixture.
10 30 50 70 90 110 130 1 50 170 190 10 30 50 70 90 110 130 150 170 190
a) The Kmeans converges into a local optimal solution with predefined 7 number of clusters.
In this trial, Kmeans reaches the maximum loglikelihood function at 7.6184.
73
210
190
170
150
130
110
90
70
so
30
10
10 30 SO 70 90 110 1 30 150 1 70 1 90 10 30 SO 70 90 110 130 ISO 170 190
b) The conventional EM converges into a local optimal solution with predefined 7 number of
clusters. In this trial, EM reaches the maximum loglikelihood function at 7.2367.
Figure 4.33: The drawback of sensitive initialization of the Kmeans and conventional EM. Both
algorithms requires the number of clusters as priori. Note that with this data set, the GAKREMs
average maximum likelihood function is stable at 6.9666, of course, without predefined number of
clusters.
4.3.2 Experiments on automaticallygenerated data
To automatically generate clustered data sets, we use the following algorithm
based on the work in Jain and Dubes(1988) and Xu and Jordan (1996):
Pseudocode to generate data:
Input N= the data points to be generated
d= the dimension of data points
k= the number of cluster to be generated
Output: a clustered data set N x d dimensions
1 For /7 < 1 to k
2 Set the size of cluster Iv. Ch
3 For i < 1 to d
4 Generate cluster center ph at uniform distribution over [0.. 100]
for each dimension.
74
5
Scatter the appropriate number of points q, around p.h
according to a Gaussian distribution centered on ih and having
covariance matrix Eh constantly set equal to 1.
6 End For
7 For y'
8 Go to line 2 if cluster h overlaps cluster j
9 End For
10 End For
/*h i in .
In this algorithm, the two clusters h and j overlap if > 0, where //, and m,
ch+ci
denote the number of misclassified points from clusters h and j. Note that a data point
x, belonging to cluster h has been misclassified if p(h\xj) < p(j\xi).
4.3.2.1 The performance of GAKREM
We used the above algorithm to generate 1000 data sets, with the number of
data points fixed at 500, the dimension d randomly set from 2 to 5 and the number of
cluster k randomly set from 3 to 10. For each generated dataset, we performed the
GAKREM, EM and Kmeans 100 times to gauge the robustness of the algorithms.
GAKREM obtained the mixtures with the correct number of clusters at the average
accuracy of 97%, of course, without the requirement of number of clusters. Figure
4.34 demonstrates experiments for a generated data set with 3 dimensions and 10
clusters in which GAKREM recognized the correct number of clusters while EM and
Kmeans misclassified the clusters.
75
a) 10cluster generated data set and the correct result obtaining from GAKREM with maximum
loglikelihood at 6.463. The image is generated by our tool available upon request.
b) 10cluster generated data set and the misclassified result obtaining from EM with a local
maximum loglikelihood function at 7.276. EM requires the number of clusters as a priori.
The image is generated by our tool available upon request.
76
*
*
*. t
\
\
*
%
c) 10clustcr generated data set and the misclassified result obtaining from Kmeans with a local
maximum loglikelihood function at 7.0783. Kmeans requires the number of clusters as
priori. The image is generated by our tool available upon request.
Figure 4.34: An example of the drawback of sensitive initialization of EM and Kmeans. While
GAKREM achieves a global optimum solution, EM and Kmeans get stuck at local optimum areas.
4.3.2.2 GAKREM versus Likelihood CrossValidation technique
Again, we applied the likelihood crossvalidation technique in the 1000
generated datasets to test the capability of obtaining the optimal number of clusters
underlying mixture. To gauge the robustness of this technique, we repeated the
experiments 100 trials in each dataset. The accuracy of likelihood crossvalidation
technique is only at 31% in overall. More important, as shown in Figure 4.35, the
performance of this technique is good only if the mixtures have the underlying
clusters less than 5, otherwise, it significantly failed to determine the best number of
clusters in mixture datasets.
77
100 n
Mixtures
Figure 4.35: The accuracy of GAKREM and likelihood crossvalidation technique tested from 3
through 10 cluster mixtures of 1000 datasets.
4.3.2.3 GAKREM versus EM and Kmeans algorithms
As noted, both the conventional EM and Kmeans require the number of
clusters to be specified a priori, and thus we compare the performance of GAKREM,
EM and Kmeans based on the average maximum loglikelihood functions in all trials.
Figure 4.36 illustrates the performance of GAKREM, EM and Kmeans in 100 trials
over 2, 3, 4 and 5 dimensional datasets, respectively. The average maximum log
likelihood values of GAKREM are significantly better than those of EM and Kmeans
in all tests.
78
Average maximum loglikelihood functions
CD CD od CO ^i Vl CD a> cn cn
b b b b o cn o cn o cn o
o o o o o o o O o O o
GAKREM EM kmeans
,g oo > 1 II 2d datasets
Average maximum loglikelihood functions
4 4 cr> o> cn cn Ci
b o b b b o b b b
o o o o o o o o o
_L_ __________I____________I
Average maximum loglikelihood functions Average maximum loglikelihood functions
kmeans
5d datasets
Figure 4.36: Comparison of average maximum loglikelihood functions of GAKREM, EM and K
means over 1000 data sets.
80
5. Conclusions and Future Work
To predict sites within a protein chain that participate in protein complexes,
we have developed a novel method based on the HMM, which combines several
biological characteristics of the sequences neighboring a target residue: structural
information, accessible surface area, and transition probability among amino acids.
We have evaluated the method using 5fold crossvalidation on 139 unique proteins
and demonstrated precision of 66% and recall of 61% in identifying interfaces. These
results are better than those achieved by other methods used for identification of
interfaces. This suggests that the HMM method might be used for predicting effects
of point mutations on protein interactions, or solvent accessibility, from protein
sequences.
We have developed and extensively tested new ClusFCM algorithm for protein
functions prediction based on protein physical interaction networks. The algorithm
uses several characteristics of interaction networks: direct and indirect interactions,
underlying topology networks, clusters, and edges weighted by homology measures
between the interacting proteins. We have demonstrated ClusFCM's robustness by
testing it on the curated interaction data from the GRID database using the leaveone
out crossvalidation. In addition, we used the bootstrap percentile test for establishing
the statistical significance of the results. All results showed that ClusFCM
outperformed the Majority, Â£statistics, Functional Flow, and MRF methods. We
plan to extend the association rule algorithm to discover relationships between
functions present in a protein and its neighborhood to assign new functions to the
protein. In a further study, we will use ClusFCM method to predict chromosome 21
gene functions. The long arm of human chromosome 21 contains 430 genes
identified by standard genomic sequence analysis tools. For approximately 50% of
these genes, the functions are completely unknown (Gardiner et al., 2003; Gardiner,
unpublished and International Human Chr21 Annotation Consortium, ms in
preparation). Our developing chromosome 21 database (Nikolaienko et ah, 2005;
Nguyen et ah, 2007) currently consists of protein interaction data from several
curated databases: 65 chromosome 21 proteins from HPRD, 27 from BIND, and 65
from DIP plus PPID which are based on results of genespecific experiments and
from the recent human interactome project report (36 chromosome 21 proteins from
Rual et ah, 2005).
We also presented a powerful new algorithm, called GAKREM, that combines the
best characteristics of Kmeans and EM algorithms but avoids their weaknesses that
include the need to specify a priori the number of clusters, obtaining solutions only
81
with local optima, and lengthy computational times. To achieve these goals, we first
use genetic algorithms for estimating parameters and initializing starting points for
the EM. Then, the loglikelihood of each configuration of parameters and a number of
clusters resulting from the EM is used as the fitness value for each chromosome in the
population. The novelty of GAKREM is that, in each evolutionary generation, it very
efficiently approximates the loglikelihood for each chromosome by using
logarithmic regression instead of running the conventional EM algorithm until
convergence, and it uses a simple Kmeans algorithm to initially assign data points to
the clusters. The algorithm is evaluated by comparing its performance on several
dataset with that of the conventional EM algorithm, the Kmeans algorithm and the
likelihood crossvalidation technique. We show that GAKREM is not sensitive to
initialization of parameters and thus it might escape from local optima areas. Thus,
our approach can be extended to any type of mixture models that use EM and K
means for parameter estimation. We have tested the algorithm extensively, on both
manually and automatically generated datasets. The results show that it outperforms
the EM and Kmeans algorithms. We plan to use GAKREM to discover new
pathways in chromosome 21 proteins by using heterogeneous data combining micro
array data and interaction data.
82
APPENDIX
A. 139 protein units are used in HMM methods.
lqha_A M lfwy_A M lnul_A M
lhjr_A M lrl2_A M lhxm_D H
1 bmt_A M lg57_A M 1 iqp_A H
2scu_A H 1 g8q_A M lg5c_A H
1 efn_B H 1 qhf_A M 2chs_A M
IjtgB H 1 otc_B H liib_A M
1 a04_A M ldqn_A M lrva_A M
1 tht_A M 1 ekj_B H lecm_A M
1 m6p_A M 1 qup_A M 1 qb2_A M
1 luc_A M lefc_A M 8atc_B H
lal2_A M iqg3_A M lkb5_B H
1 id 1 _A M lprt_F H ldk5_A M
lf3u_A H 1 g4u_S M 1 faO_B H
1 gpm_A H lryp_B H 1 e96_B H
1 a2z_A M 1 azs_A H 1 ifq_A M
1 rho_A M lqr7_A M lb35_A H
1 d 1 g_A H 2fok_A M lais_B H
lh7w_A H lh54_B H 1 ibz_A H
1 qbz_C H 1 b9y_C H 1 sky_E H
1 mpy_A M 3ink_C M 1qoO_D H
1 iq4_A M lqj2_C H 1 cvj_B H
lqkr_A M 1 bkc_A H 2pol_A M
83
I fsi_A H 1 g71 _A M 1 fxz_A M
lg2i_A M lqpo_A M lfjf_K H
1bou_A H 1 qf8_A M 1 d9e_A M
1 sei_A M lfjf_B H 1 der_A M
1 bm9_A M 1 eer_A H lfc3_A H
IfjfJ H 1 bht_A M lgd8_A M
1 h6g_A M irypC H 1 dce_A H
1 poi_A H ld5y_A M 2occ_E H
lqmg_A H 2iif_G M 1 dkg_A H
1 dzr_A M lqsl_A M 1 agr_E H
1 ade_A M lfjf_M H lf6f_A H
1 msp_A M 1 kpt_A M 1 ct9_A H
lqsO_B H lmjh_A M 1ghq_B H
lhav_A M 1 d4a_A M 2tnf_A M
lqip_A H ifjfQ H lcjd_A H
1 d02_A M 1 c9k_B H lqla_B H
1 ytf_D H 1 mty_G H lejh_A H
2bpa_2 H lg8y_A M 1 zym_A M
2prg_A H 1 jtdB H 1 p32_A H
ldvj_A H lajq_A M 1 itb_B H
1 wdc_C H 1 hur_A M lckl_B M
3chb_D M lixm_A M ljeq_B H
1 yai_A M lf6d_A H ljxp_A H
1 g6o_A M 1 c1g_B M
2mev_l H lg81_A M
84
REFERENCES
Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W. and Lipman
D. (1997) Gapped BLAST and PSIBLAST: a new generation of protein
database search programs. Nucleic Acids Res, 25, 33893402.
Ashburner, M., Ball, C. et al. (2000) Gene ontology: tool for the unification of
biology. The Gene Ontology Consortium. Nat. Genet., 25, 2529.
Auerbach, D., Thaminy, S., Hottiger, M. and Stagljar, I. (2002) The postgenomic era
of interactive proteomics: facts and perspectives. Proteomics, 2, 611623
Baldi, P., Brunak, S., Chauvin, Y. and Andersen, C. (2000) Assessing the accuracy of
prediction algorithms for classification: an overview. Bioinformatics, 16, 412
424.
Baldi, P. and Brunak, S. (2001) Bioinformatics: the machine learning approach. The
MIT Press.
Babu, P. and Murty, N. (1993) A nearoptimal initial seed value selection for kmeans
algorithm using genetic algorithm. Pattern Recognition Letters, 14, 763769.
Berg, J., Tymoczko, J. and Stryer, L. (2002) Biochemistry. W.H. Freeman.
Bezdek, J. (1981) Pattern recognition with fuzzy objective function algorithms.
Plenum Press, New York, NY.
Bilmes, J. (1997) A gentle tutorial on the EM algorithm and its application to
parameter estimation for Gaussian mixture and hidden Markov models.
Technical Report ICSITR97021, International Computer Science Institute
(ICSI), Berkeley, CA.
Bishop, C. (1996) Neural networks for pattern recognition. Oxford University Press.
Bhuyan, N., Raghavan, V. and Venkatesh, E. (1991). Genetic algorithm for clustering
with an ordered representation. In Proceedings of the Fourth International
Conference on Genetic Algorithms, 408415.
Boyles, A. (1983) On the convergence of the EM algorithm. Journal of the Royal
Statistical Society, Series B, 45, 4750.
85
Bordner, A. and Abagyan, R. (2005) Statistical analysis and prediction of protein
protein interfaces, Proteins: Structure, Function, and Bioinformatics, 60(3),
353366
Branden, C., Tooze, J. (1991) Introduction to protein structure. Garland Publishing,
Inc.
Bradford, J. and Westhead, D. (2005) Improved prediction of proteinprotein binding
sites using a support vector machines approach. Bioinformatics, 21(8), 1487
1494
Bradley, A. P. (1997) The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(7), 11451159.
Breitkreutz, B.J., Stark, C. and Tyers M. (2003) The GRID: the general repository for
interaction datasets. Genome Biol, 4(3), R23
Chen, H. and Zhou, H. (2005) (2005) Prediction of Interface Residues in Protein
Protein Complexes by a Consensus Neural Network Method: Test Against
NMR Data, Proteins: Structure, Function, and Bioinformatics, 61, 2135
Chothia, C. and Janin, J. (1975) Principles of proteinprotein recognition. Nature, 256,
705708
Chua, H., Sung, W. and Wong, L. (2006) Exploiting indirect neighbours and
topological weight to predict protein function from proteinprotein interactions.
Bioinformatics, 22, 16231630.
Cios, K., Pedrycz W. and Swiniarski R. (1998) Data mining methods for knowledge
discovery. Kluwer Academic Publishers.
Cios, K. and Kurgan, L. 2004. CLIP4: hybrid inductive machine learning algorithm
that generates inequality rules. Information Sciences, 163 (13), 3783
Conesa, A., Gotz, S., GarciaGomez, J., Terol, J., Talon, M., and Robles, M. (2005)
Blast2GO: a universal tool for annotation, visualization and analysis in
functional genomics research. Bioinformatics, 21, 36743676.
Davis, L. (1991) Handbook of genetic algorithms, Van Nostrand Reinhold.
Davison, A. and Hinkley, S. (1997). Bootstrap methods and their application.
Cambridge University Press.
86
Dempster A., Laird N. and Rubin D. (1977) Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39,
138.
Deng, M., Zhang, K., Mehta, S., Chen, T. and Sun, F. (2003) Prediction of protein
function using proteinprotein interaction data. J Comput Biol, 10, 947960.
DiCiccio, T.J. and Efron, B. (1996) Bootstrap confidence intervals (with Discussion).
Statistical Science, 11, 189228.
FromontRacine,M. et al. (1997) Toward a functional analysis of the yeast genome
through exhaustive twohybrid screens. Nat. Genet., 16, 277282.
Doolittle, R. (1989) Redundancies in protein sequences. In predictions of protein
structure and the principles of protein conformation (Fasman, G.D. ed) Plenum
Press, New York, 599623
Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) Biological sequence
analysis, Cambridge University Press, Cambridge.
Gallet, X., Charloteaux, B., Thomas, A., and Brasseur, R. (2000) A fast method to
predict protein interfaces from sequences. J. Mol. Biol., 302, 917926
Gardiner, K., Fortna, A., Bechtel, L. and Davisson, M. (2003) Mouse models of
Down syndrome: how useful can they be: Comparison of the gene content of
human chromosome 21 with homologous mouse genomic regions. Gene, 318,
137147.
Giot,L. et al. (2003) A protein interaction map of Drosophila melanogaster. Science,
302, 17271736.
Groth, D., Lehrach, H., and Hennig, S. (2004) GOblet: a platform for Gene Ontology
annotation of anonymous sequence data. Nucleic Acids Res., 32, W313W317.
Fariselli P., Pazos F., Valencia A. and Casadia R. (2002) Prediction of proteinprotein
interfaces in heterocomplexes with neural networks. Eur. J. Biochem., 269,
13561361.
Fawcett, T. (2001) Using rule sets to maximize ROC performance. Conference on
Data Mining (ICDM2001), 131138.
87
Fletcher, D., Nguyen, D. and Cios, K. (2005). Autonomous synthesis of fuzzy
cognitive maps from observational data: preliminaries, IEEE Aerospace
Conference, Big Sky, Montana, March 512.
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. and Takagi, T. (2001) Assessment of
prediction accuracy of protein function from proteinprotein interaction data.
Yeast, 18, 523531
Ho,Y. et al. (2002) Systematic identification of protein complexes in Saccharomyces
cerevisiae by mass spectrometry. Nature, 415, 180183.
Holland, J. (1975) Adaptation in natural and artificial systems, University of
Michigan Press, Ann Arbor.
Ito, T., Chiba, T. et al. (2001) A comprehensive twohybrid analysis to explore the
yeast protein interactome. Proc. Natl Acad. Sci. USA, 98, 45694574.
Jain, K., Murty, N. and Flynn, J. (1999) Data clustering: a review. ACM Computing
Surveys, 31, 264323
Jain, A. and Dubes, R. (1988) Algorithms for clustering data, PrenticeHall advanced
reference series, PrenticeHall, Inc., Upper Saddle River, NJ.
Jones, S. and Thornton, J.M. (1996) Principles of proteinprotein interaction. Proc.
Natl. Acad. Sci. USA, 93, 1320
Jones, S. and Thornton J. M. (1997) Prediction of proteinprotein interfaces using
patch analysis. J. Mol. Biol., 272, 133143.
Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern
recognition of hydrogenbonded and geometrical features. Biopolymers, 22,
25772637.
Karlin, S., Ost, F. and Blaisdell, B. (1989) Patterns in DNA and amino acid sequences
and their statistical significance. Mathematical methods for DNA sequences,
133157, CRC Press.
Keller, M., Bengio, S., and Wong, S. (2005) Benchmarking nonparametric statistical
tests. In Advances in Neural Information Processing Systems (NIPS), MIT Press,
18.
88
Kini, R. and Evans, H. (1996) Prediction of potential proteinprotein interfaces from
amino acid sequence identification of a fibrin polymerization site. FEBS letters,
385, 8186.
Kohonen, T. (1989) Selforganization and associative memory, SpringerVerlag, New
York, NY.
Koike, A. and Takagi, T. (2004) Prediction of proteinprptein interfaces using support
vector machines. Protein Engineering, Design & Selection, 17, 165173.
Kosko, B. (1986) Fuzzy Cognitive Maps. Int. J. of ManMachine Studies, 24, 6575.
Kurgan, L., Cios, K. and Scott, D. 2006. Highly Scalable and Robust Rule Learner:
Performance Evaluation and Comparison. IEEE Transactions on Systems Man
and Cybernetics, Part B, 36, s3253
Laszlo, M. and Mukherjee, S. (2006) A Genetic Algorithm Using HyperQuadtrees
for LowDimensional Kmeans Clustering. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 28, 533543
Letovsky, S. and Kasif, S. (2003) Predicting protein function from protein/protein
interaction data: a probabilistic approach. Bioinformatics, 19, 197204
Lewis, D., Jebara, T. and Noble, W. (2006) Support vector machine learning from
heterogeneous data: an empirical analysis using protein sequence and structure.
Bioinformatics, 22, 27532760
Li,S. et al. (2004) A map of the interactome network of the metazoan C.elegans.
Science, 303, 540543.
Lu,S. and Fu, K. (1978) A sentencetosentence clustering procedure for pattern
analysis. IEEE Transactions on Systems, Man and Cybernetics, 8,381389.
MacBeath, G. and Schreiber, L. (2000) Printing Proteins as Microarrays for High
Throughput Function Determination. Science ,289, 17601763.
MacLachlan, G. and Basford, K. (1988) Mixture models: Inference and applications
to clustering, New York: Marcel Decker.
89
MacnaughtonSmith, P., Williams, T., Dale, B. and Mockett, G. (1964) Dissimilarity
Analysis: A New Technique of Hierarchical Subdivision. Nature, 202, 1034
1035.
Mann, M. and Pandey, A. (2001) Use of mass spectrometryderived data to annotate
nucleotide and protein sequence databases. Trends Biochem. Sci, 26, 5461
Marcotte, E., Pellegrini, M., Ng, H., Rice, D., Yeates, T. and Eisenberg, D. (1999a)
Detecting protein function and protein protein interactions from genome
sequences. Science, 285, 751753.
Marcotte, E., Pellegrini, M., Thompson,M., Yeates, T. and Eisenberg, D. (1999b) A
combined algorithm for genomewide prediction of protein function. Nature,
402,8386
Martin, D., Berriman, M., and Barton G. (2004) GOtcha: a new method for prediction
of protein function assessed by the annotation of seven genomes. BMC
Bioinformatics,5, 178195.
Martzen, M., McCraith, S., Spinelli, S. et al. (1999) A biochemical genomics
approach for identifying genes by the activity of their products. Science, 286,
11531155.
McQueen, J. (1967) Some Methods for classification and Analysis of Multivariate
Observations. Proceedings of 5,h Berkeley Symposium on Mathematical
Statistics and Probability, Berkeley, University of California Press, 1 ,281297
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M. (2005) Wholeproteome
prediction of protein function via graphtheoretic analysis of interaction maps.
Bioinformatics, 21, 302310
Nelson, D. and Cox, M. (2000) Lehninger Principles of Biochemistry, 3rd edition,
Worth Publishers.
Nikolaienko, O., Nguyen, C., Crnic, L., Cios, K. and Gardiner, K. (2005) Human
chromosome 21/Down syndrome gene function and pathway database. Gene,
364, 9098
Nguyen, C., Thaicharoen, S., Lacroix, T., Gardiner, K. and Cios, K. (2006) A
comprehensive human chromosome 21 database. IEEE Engineering in Medicine
and Biology, 26 (2), 8693
90
Nguyen, C., Gardiner, K. and Cios, K. (2007) A hidden Markov model for prediction
of protein interfaces, Journal of Bioinformatics and Computational Biology,
5(2), 115.
Nguyen, C., Mannino, M., Gardiner, K. and Cios, K. (2007) ClusFCM algorithm for
prediction of protein functions using homologies and protein interactions,
submitted.
Nguyen, C. and Cios, K. (2007) GAKREM: a Clustering Algorithm that
Automatically Generates a Number of Clusters, submitted.
Ofran, Y. and Rost, B. (2003a) Analysing six types of proteinprotein interfaces. J.
Mol. Biol, 325, 377387
Ofran,Y. and Rost, B. (2003b) Predicted proteinprotein interfaces from local
sequence information. FEBS Lett., 544, 236239.
Pazos, F., HelmerCitterich, M., Ausiello, G. and Valencia, A. (1997) Correlated
mutations contain information about proteinprotein interaction. J. Mol. Biol.,
271,511523.
Raghavan, V. and Birchand, K. (1979) A clustering strategy based on a formalism of
the reproductive process in a natural system. In Proceedings of the Second
International Conference on Information Storage and Retrieval, 1022.
Rain, J.C. et al. (2001) The proteinprotein interaction map of Helicobacter pylori.
Nature, 409,211215.
Redner, A. and Walker, F. (1984) Mixture densities, maximum likelihood, and the
EM algorithm. Siam Review, 26, 195239.
Rissanen, J. (1978) Modeling by shortest data description. Automatica, 14, 465471.
Rual, F., Venkatesan, K., Hao, T. et al. (2005) Towards a proteomescale map of the
human proteinprotein interaction network. Nature. 437, 11731178.
Ruspini, E. (1969) A new approach to clustering. Information Control, 15,2232.
Schwarz, G. (1978) Estimating the dimension of a model. The Annals of Statistics, 6,
461464.
91
