Citation
Data mining and quantitative analysis of proteomics

Material Information

Title:
Data mining and quantitative analysis of proteomics
Creator:
Yen, Chia-Yu
Publication Date:
Language:
English
Physical Description:
78 leaves : illustrations ; 28 cm

Subjects

Subjects / Keywords:
Proteomics ( lcsh )
Mass spectrometry ( lcsh )
Peptides ( lcsh )
Data mining ( lcsh )
Data mining ( fast )
Mass spectrometry ( fast )
Peptides ( fast )
Proteomics ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 76-78).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Chia-Yu Yen.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
55626422 ( OCLC )
ocm55626422
Classification:
LD1190.E52 2003m Y46 ( lcc )

Full Text
DATA MINING AND QUANTITATIVE ANALYSIS OF PROTEOMICS
by
Chia-Yu Yen
B.S., Feng Chia University, 1994
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
in
Computer Science and Engineering
2003
r


This thesis for the Master of Science Degree
by
Chia-Yu Yen
has been approved
Steve Helmke
Tom Altman
Ellen Gethner

Date


Yen, Chia-Yu (M.S., Computer Science and Engineering)
Data Mining and Quantitative Analysis of Proteomics
Thesis directed by Professor Krzysztof (Krys) Cios
ABSTRACT
Although mass spectrometry data is usually used to identify proteins, the other
application quantification of protein is still developing. This thesis discusses the
quantification of protein isoforms in mass spectrometry data derived from
proteomic studies. The ratio of protein isoforms of a protein mixture can be used
to determine the health of patients. Here, we present an automatic system for
protein quantification so that biologists can fully focus on the experiments and
find the result quickly and precisely. To accomplish this, two subsystems were
developed. One subsystem is employed to find the quantification peptides and
design an internal standard peptide for quantitative analysis. The other subsystem
is used to quantify mass spectrometry data from these essential peptides. With the
real example of human heart proteins the alpha and beta myosin heavy chains -
we demonstrate the correctness of our system and results.
This abstract accurately represents the content of the candidates thesis. I
recommend its publication.
m


ACKNOWLEDGEMENTS
I would like to take this opportunity to thank my advisor Dr. Krzysztof (Krys)
Cios for his encouragement, support, guidance and comments.
I would also like to thank Dr. Steve Helmke, University of Colorado Health
Sciences Center, for the mass spectrometry data and helping me understand the
biological knowledge, and my committee members Drs. Tom Altman and Ellen
Gethner.


CONTENTS
Figures.......................................................viii
Tables..........................................................x
CHAPTER
1. INTRODUCTION................................................. 1
2. BACKGROUND .................................................. 6
Definition of Proteomics..................................6
Proteins .................................................6
Amino Acids............................................7
Conservative Amino Acid Substitution................7
Polypeptide............................................8
One-dimensional and Two-dimensional Gel Electrophoresis...9
In-gel Digestion and In-silico Enzyme Digestion..........10
MALDI-TOF Mass Spectrometry Experiment...................13
Data Description.........................................17
Theoretical Mass Value...................................20
v


3. MINING INFORMATION FOR QUANTIFICATION PEPTIDES
AND INTERNAL STANDARD PEPTIDE..................................24
Amino Acid Families for Conservative Amino Acid Substitution ... 24
Quantification Peptides.....................................25
The Required Data........................................26
Algorithm for Finding Quantification Peptides............26
Discovered Result........................................29
Internal Standard Peptide...................................29
The Required Data........................................31
Algorithm for Finding Internal Standard Peptide..........31
Discovered Result........................................35
4. QUANTIFICATION OF MASS SPECTROMETRY DATA.......................36
Isotopes .................................................. 36
Standard Curves ........................................... 38
Data Quantification.........................................41
Patient Mass Spectrometry Data...........................41
Strategy of Choosing Desired Peaks.......................42
Calculation of the Quantification Data...................43
Error Estimation.........................................44
vi


5. RESULTS OF QUANTIFYING HUMAN HEART MYOSIN
HEAVY CHAIN ............................................ 48
Quantification Peptides and Internal Standard Peptide.48
Input Data.........................................48
Execution ........................................ 49
Validation ....................................... 52
Protein Quantification................................54
Project Management.................................54
Isotope Management.................................55
Sample Management ................................ 56
Standard Curve.....................................57
Results............................................61
6. CONCLUSION AND FUTURE WORK...............................66
APPENDIX
A. TWENTY AMINO ACIDS.......................................68
B. AMINO ACID SEQUENCE OF ALPHA-MyHC........................74
C. AMINO ACID SEQUENCE OF BETA-MyHC.........................75
BIBLIOGRAPHY...................................................76
vii


FIGURES
Figure
1.1 System overview ................................................ 4
2.1 Common structure of an amino acid .............................7
2.2 Formation of a polypeptide ..................................... 9
2.3 Gel images .................................................... 10
2.4 Cleavage site ................................................. 11
2.5 MALDI plate ................................................... 14
2.6MALDI-TOF mass spectrometry .................................... 15
2.7 Linear and reflectron time of flight........................... 16
2.8 Graphical view of MS data...................................... 17
2.9 Example of MS data............................................. 19
3.1 Intensity chart of a MS data file..............................33
4.1 Mono-isotopic peaks and isotopic peaks ........................38
5.1 Amino acid families ...........................................49
5.2 Result of processing given data ............................... 50
5.3 Possible quantification peptide pairs ......................... 51
5.4 Possible internal standard peptides ........................... 51
viii


5.5 Myosin heavy chain quantification peptides .................... 52
5.6 Project management ............................................ 55
5.7 Isotope management ............................................ 56
5.8 Sample management ............................................. 57
5.9 Standard curves and their detailed information ................ 61
5.10 Manually calculated quantification results.....................62
5.11 Standard curve data............................................64
5.12 Quantification result of patient sample.......................65
IX


TABLES
Table
2.1 Amino acid families .............................................. 8
2.2 Cleavage rules of trypsin ....................................... 12
2.3 Theoretical mass values of amino acids...........................20
2.4 Other useful mass values ........................................21
4.1 Standard curves .................................................. 39
x


CHAPTER 1
INTRODUCTION
An important problem in biology is to determine the concentration of a particular
protein in a complex protein mixture because the concentration, or amount of a
protein, can be used to determine the state of a patients health. Proteomics
provides a systematic way to solve this problem (Chambers et al., 2000; Patterson
et al., 2003).
Researchers at the Cardiology Laboratory of University of Colorado Health
Sciences Center (UCHSC) found that the changing of myosin heavy chain
(MyHC), alpha (a) and beta (P), gene expression in human heart is correlated with
the heart failure (Nakao et al., 1997). Changes in gene expression were measured
by quantification of mRNA. However, changes in mRNA do not match changes
in protein (Gygi et al., 1999; dos Remedios et al., 1996) and protein is more
important because protein carries out the actual function. The a and P MyHC
proteins cannot be distinguished by conventional means. Therefore the lab
developed an assay using matrix-assisted laser desorption ionization time of flight
mass spectrometry (MALDI-TOF MS) to measure protein concentrations by
quantifying specific tryptic peptides from these proteins (Helmke et al., 2003).
1


The quantification process is as follows.
a) Find a pair of protein isoforms which change with heart failure.
b) Find a peptide which can be used to distinguish each protein and call this
peptide quantification peptide.
c) Design an internal standard peptide by using quantification peptides.
d) Synthesize quantification peptides and the internal standard peptide.
e) Mix known amounts of these synthetic peptides for experiment.
f) Create standard curves from these experimental results.
g) Experiment on patient heart tissue.
h) Collect MS data signals of quantification peptides and internal standard
peptide.
i) Use the collected signals to get the real amounts of the two protein isoforms.
In this quantification process, the tasks of finding quantification peptides,
designing the internal standard peptide, processing the MS data, calculating
standard curves, and quantifying sample MS data are very labor intensive. For
example, the average number of amino acids of a protein is over 300, and the
average number of amino acids of a peptide is 10. This means that if one tries to
find a pair of quantification peptides manually, the current procedure at UCHSC,
it will require considering over 1,000 different peptide combinations. A large
2


protein like myosin would have 40,000 combinations. Even by cutting this
amount through the use of the background biological knowledge, the number of
peptide combinations is still large. That is why the quantification process is
extremely time consuming.
This thesis is focused on peptide quantification. The goal is to develop an
automated system that will assure a robust quantification process, which will help
biologists quantify the desired proteins much faster. The system can be applied to
similar problems since our design is very flexible thus allowing easy adaptation.
3


Quantification peptides and IS peptide finding Subsystem
Figure 1.1 System overview
4


Figure 1.1 shows that the system consists of two main parts: one for discovery of
two quantification peptides and internal peptide, and the other for quantifications
of patients samples. If the required input data are available, each part of the
system can be run separately.
This thesis is organized as follows. Chapter 2 provides biological background
related to this research. Chapter 3 discusses the subsystem for finding
quantification peptides and the internal standard peptide. Chapter 4 describes the
method of quantification. Then, the working of the system is illustrated in Chapter
5. The final chapter provides conclusions and a brief discussion of future work.
5


CHAPTER 2
BACKGROUND
In order to better understand the problem domain with which we are dealing in
this work, we provide background biological knowledge first.
Definition of Proteomics
Patterson et al. (2003) said proteomics is the systematic study of the many and
diverse properties of proteins in a parallel manner with the aim of providing
detailed descriptions of the structure, function and control of biological systems in
health and disease and Tyers et al. (2003) defined proteomics is the study of the
function of all expressed proteins.
The most common method to generate protein fingerprints is using gel
electrophoresis to separate proteins which are excised from the gel, digested
enzymatically, and subjected to mass spectrometry.
Proteins
Proteins are the end products of genes and serve as critical structural and
functional components in cells and tissues. They are comprised of amino acids
6


joined together by peptide bonds. Although many amino acids exist, only 20 of
them are commonly used to make proteins.
Amino Acids
Amino acids are molecules which consist of an amino group (-NH2), an acidic
carboxyl group (-COOH), and a side chain (R group), except proline. The
structure of R makes one amino acid different from others (Vander et al., 1998).
H
R-
-COOH
NH2
Figure 2.1 Common structure of an amino acid
Conservative Amino Acid Substitution. The substitution of amino acids for one
another in protein sequences is viewed as a pairwise phenomenon or the
groupwise relationships (Wu et al., 1996). Based on the idea of Helmke et al.
(2003), we considered only the pairwise substitution in our system. For the
purpose of the pairwise substitution, amino acids are classified into 6 groups
according to physical and chemical properties of side chains (USPTO, 2002).
7


Family Name Amino Acids
With basic side chains lysine (K), arginine (R), histidine (H)
With acidic side chains aspartate (D), glutamate (E)
With uncharged polar side chains glycine (G), asparagines (N), glutamine (Q), serine (S), threonine (T), tyrosine (Y), cysteine (C)
With nonpolar side chains alanine (A), valine (V), leucine (L), isoleucine (1),. proline (P), phenylalanine (F), methionine (M), tryptophan(W)
With branched side chains threonine (T), valine (V), isoleucine (1)
With aromatic side chains tyrosine (Y), phenylalanine (F), tryptophan (W), histidine (H)
Table 2.1 Amino acid families
Each amino acid in a family has chemical properties close to other family
members. This feature is very important for the conservative amino acid
substitution idea because, after in-family substitution, the chemical features of a
protein fragment or a peptide still remain similar to the original one.
Polypeptide
Amino acids can group together by linking ones carboxyl group to anothers
amino group. The bond between two amino acids is called the peptide bond and
the molecule itself is called a polypeptide. The terms protein and peptide are
distinguished by the amount of amino acids they contain. A polypeptide
comprised of more than 50 amino acids is called a protein. If a polypeptide has 50
or fewer amino acids, it is called a peptide (Vander et al., 1998).
8


Figure 2.2 Formation of a polypeptide (picture taken from
http://tonga.usip.edu/gmoyna/biochem341/lecture11 .html)
One-dimensional and Two-dimensional Gel Electrophoresis
1- D and 2-D gel electrophoresis are used to resolve complex mixtures of proteins
(Hunter et al., 2002), and the resolved proteins are stained with dyes. The stained
proteins can be seen clearly, and the deeper the color, the more the abundance of a
protein. Figure 2.3 (a) shows a 1-D gel scanned image. Each column, called a
lane is a sample and the molecular weight is decreasing from top to bottom.
Figure 2.3 (b) shows a 2-D gel image which contains only one sample, but the
sample contains several proteins. The interpretation from top to bottom is the
same as in the 1-D gel but the left to right interpretation is quite different. On the
2- D gel, the left to right correlates with the proteins charge. Neutral proteins are
9


in the middle. On the left, there are negatively charged proteins, on the right there
are positively charged proteins.
(a) 1 -D gel (b) 2-D gel
Figure 2.3 Gel images
In-eel Digestion and In-silico Enzyme Digestion
The biological way to digest a protein is in-gel digestion. The biologist excises
dyed proteins from a 1-D or 2-D gel and then puts them into different containers
and dries the samples. Because the samples are dried, the dropped-in trypsin
solution can be soaked completely into samples. This makes the process of
10


digestion happen in-gel and cleaves the proteins more completely. Instead of
using in-gel digestion, in-silico enzyme digestion simulates the behavior of
enzymes based on the in-gel digestion knowledge.
There are several enzymes used to digest proteins. The most popular one is
trypsin because its cleavage behavior is the most predictable. Normally, trypsin
cleaves a protein after amino acid K or R. Table 2.2 shows the cleavage rules of
trypsin (IPMCS, 2003).
Figure 2.4 Cleavage site (picture taken from
http://tonga.usip.edu/gmoyna/biochem341/lecture11 .html)
11


Rule Type Cleavage Site Action
Normal C-term of K Cleave
Normal C-term of R Cleave
Exception KP Dont Cleave
Exception RP Dont Cleave
Exception WKP Cleave
Exception . MRP Cleave
Exception CKD Dont Cleave
Exception DKD Dont Cleave
Exception CKH Dont Cleave
Exception CKY Dont Cleave
Exception CRK Dont Cleave
Table 2.2 Cleavage rules of trypsin
Theoretically, all these rules will be applied to digest a protein but, in practice,
because of reasons such as shorter digestion time, the cleavage sites might not be
the same as the theoretical sites. In other words, it is possible for a user to find a
set of peptides by experiment but these peptides might not be the same as the
theoretical values. Because of this situation, this system adopted will use only the
predicted sites.
Example 1: This is the amino acids sequence of a Trypsin Precursor:
FPTDDDDKIV GGYTCAANSI PYQVSLNSGS HFCGGSLINS QWWSAAHCY
KSRIQVRLGE HNIDVLEGNE QFINAAKIIT HPNFNGNTLD NDIMLIKLSS
PATLNSRVAT VSLPRSCAAA GTECLISGWG NTKSSGSSYP SLLQCLKAPV
LSDSSCKSSY PGQITGNMIC VGFLEGGKDS CQGDSGGPW CNGQLQGIVS
WGYGCAQKNK PGVYTKVCNY VNWIQQTIAA N
12


After applying the first two rules, we have:
FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWWSAAHCYKSRI
QVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRV
ATVSLPRSCAAAGTECLISGWGNTKSSGSSYPSLLQCLKAPVLSDSSCKSSYPG
QITGNMICVGFLEGGKDSCQGDSGGPWCNGQLQGIVSWGYGCAQKNKPGVYTK
VCNYVNWIQQTIAAN
The marked Ks and Rs are the expected cleavage sites.
Then we apply rules 3 and 4. A KP is found, so this site cannot be a cleavage
site.
FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWWSAAHCYKSRI
QVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRV
ATVSLPRSCAAAGTECLISGWGNTKSSGSSYPSLLQCLKAPVLSDSSCKSSYPG
QITGNMICVGFLEGGKDSCQGDSGGPWCNGQLQGIVSWGYGCAQKNKPGVYTK
VCNYVNWIQQTIAAN
Because there is no matched sequence for the remaining rules, we find that this
protein consists of 15 peptides.
FPTDDDDK, IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWWSAAHCYK,
SR, IQVR, LGEHNIDVLEGNEQFINAAK, IITHPNFNGNTLDNDIMLIK,
LSSPATLNSR, VATVSLPR, S CAAAGTECLISGWGNTK,
SSGSSYPSLLQCLK, APVLSDSSCK, SSYPGQITGNMICVGFLEGGK,
DSCQGDSGGPWCNGQLQGIVSWGYGCAQK, NKPGVYTK,
VCNYVNWIQQTIAAN
MALDT-TOF Mass Spectrometry Experiment
A MALDI-TOF MS experiment has four steps (Aebersol et al., 2003). The first
step is to separate proteins from the sample of tissues or cells. 1-D or 2-D gel
13


electrophoresis is usually used for separation purpose. Then, the proteins are
excised carefully from the gel to avoid including any unwanted parts. Figure 2.3
shows that either a band of 1-D gel or a spot of 2-D gel contains a protein, or
protein isoforms. A careless excision might cause the removed part to contain
other proteins. The removed part is stained by chemistry, so it needs to be de-
stained. After that, an enzyme, such as trypsin, is used to digest the protein in-gel
into several peptides. To assure that the protein is digested thoroughly is
important. Incomplete digestion leads to wrong mass spectra and neither protein
identification nor protein quantification would then be correct. Hence, the sample
needs to be dried in order to soak out the enzyme solution completely. If possible,
it is best to digest the sample twice. In the last step, the sample is spotted onto a
steel MALDI plate (Figure 2.5) and subjected to MALDI-TOF MS.
Figure 2.5 MALDI plate
14


Figure 2.6 MALDI-TOF mass spectrometry
(picture taken from Amersham Biosciences)
Figure 2.6 illustrates how the sample is ionized by a laser flash. Then, the ions are
accelerated by high voltage and enter the time of flight tube where ions
differentiate from each other according to their mass to charge ratios (m/z)
(Aebersol et al., 2003). The lighter the ion, the faster it is detected.
15


Linear Time Of Flight tube
ion
source
For Proteins
Reflectron Time Of Flight tube
ion source
Figure 2.7 Linear and reflectron time of flight
(picture taken from Amersham Biosciences)
There are two different types of TOF: linear and reflectron. The linear TOF
accelerates ions linearly. The advantage of this type of TOF is that it can detect
bigger ion sources such as a protein. The disadvantage is that the resolution of the
detected peak list is low. On the other hand, the reflectron TOF is used to
compensate for energy, and increase the path length (Aebersol et al., 2003). This
makes its resolution higher but it is no good for determining bigger ion sources.
Figure 2.7 shows the difference between the linear and reflectron TOF. We can
16


see that for the same sample, the area of a peak is wider for the linear method
while it is sharper for the reflectron TOF.
Voyager Spec #1[BP =712.3, 26107]
Figure 2.8 Graphical view of MS data
(picture taken from Helmke et al., 2003)
Figure 2.8 shows the output of MALDI-TOF MS. The Y axis is the intensity of
the MS signal. At the right hand side is the real detected number while the
opposite side indicates the intensity based on the highest signal. The unit of the X
axis is mass to charge ratios im/z). In MALDI the charge is 1 so m/z is the same
as mass.
Data Description
At UCHSC the mass spectrometry instrument bundles a data processing software,
Data Explorer, which can perform some functions such as noise reduction and de-
17


isotoping. Also, it can save the processed data as a plain text file or a Microsoft
Excel file. Unfortunately, the plain text file format does not include all the needed
information for the quantification process. Therefore, the Microsoft Excel format
was chosen for this system.
A mass spectrometry data file can contain several different experiment results, so
we need an identifier to distinguish different samples. For each result, the first
row is the header, which indicates the names of columns but the first column of
this header is empty originally. Therefore, we used this blank field to record
important information both for human discrimination and for automated parsing.
An important factor about data is that peptides do not match peaks one by one.
The reasons are as follows.
a) Abundance of a peptide: the more abundant, the higher the peak. In other
words, a peptide with very low abundance might not show up in a peak list.
b) Chemical and physical features of a peptide: some peptides might be not
detected by its special features.
c) Existence of isotopes: peptides contain naturally occurring isotopes so a
peptide can have more than one peak in MS data.
18


ll&j - A B C D mmm F G H
1 1 GJ1 1 3 Centroid Mass Lower Bound Upper Bound [Charge (z)IHeiqht Relative Intensity i Area
2 1 1735.963422 1735.63 1736.38 0 651 6.75! 2439.27
3 2 1737.972348 1737.5 1738.46 1 659 6.84 5092.55
4 3 1738.903358 1738.46 1739.33 1 6951 7.21! 5464.07
5 4 1740.083599 1739.83 1740.41 1 643 6.67! 1370.12
m 5! 1741.050023 1740.41 1741.53 1 2065 21.42! 10453.39
ill. 6 1742.041495 1741.53 1742.7 i 2134~ 22.131 8730.15
185 7 1743.043347 1742.7 1743.62 1 xnf 18.43! 6448
195 8 1744.013599 1743.62 1744.32 1 970 10.061 2356.13
10 9 1744.843901 1744.32 1745.12 0 709 . 7.36! 1939.97
* 10 1745.306146 1745.12 1745.57 0 785 8-141 1081.27
m 11 1747.039762 1746.74 1747.28 0 649 6.73! 2143.611
ip 12 1747.85497 1747.28 1748.28 0 529 5.49 j 2149.87
14 13 1749.482056 1749.2 1749.79 0 748 7.75! 1237.41
it; 14 1750.091275 1749.79 1750:58 0 641 6.64! 2580.87
Figure 2.9 Example of MS data
Figure 2.9 shows a partial example. The first column records the sequence
number of each peak. The second column shows the centroid mass value of a
peak. The third and fourth columns are the mass values of the lower bound and
upper bound of a peak. The fifth column is the charge of a peak. The sixth column
is the peak height. The seventh column is the same as the sixth column but is
expressed as a percentage of the highest peak. The last column is the calculated
area of a peak. In this work only the identifier, column 2, and column 6 are used
for quantification.
19


Theoretical Mass Value
The theoretical mass value of a peptide or a protein can be determined by
summing up the theoretical mass values of the amino acids. Table 2.3 (PMWC,
2003; FM, 2003) shows the mass values for each amino acid.
Short Code Name Average Mass Mono-isotopic mass
A Alanine 71.0788 71.03711
IIIIIPWPII Cysteine 103.1448- 103.00919
D Aspartic Acid 115.0886 115.02694
E Glutamic Acid 129.1155 129.04259
F Phenylalanine 147.1766 147.06841
G Glycine 57.0520 57.02146
H Histidine 137.1412 137.05891
1 Isoleucine 113.1595 113.08406
K Lysine 128.1742 128.09496
L Leucine 113.1595 113.08406
M Methionine 131.1986 131.04049
N Asparagine 114.1039 114.04293
P Proline 97.1167 97.05276
Q Glutamine 128.1308 128.05858
R Arginine 156.1876 156.10111
S Serine 87.0782 87.03203
T Threonine 101.1051 101.04786
V Valine 99.1326 99.06841
W T ryptophan 186.2133 186.07931
Y Tyrosine 163.1760 163.06333
Table 2.3 Theoretical mass value of amino acids
20


Name Average Mass Mono-isotopic mass
H+ 1.00794 1.00782
OH' 17.0073 17.00274
H20 18.01524 18.01056
Table 2.4 Other useful mass values
There are two different mass values in this table: one is the average mass, and the
other is the mono-isotopic mass. In this work only the mono-isotopic mass is
used. Note that the mass values in Table 2.3 are of amino acid residues which
differ from amino acids by H2O. Therefore, mass values of an H and an OH
(Table 2.4) (PMWC, 2003; FM, 2003) have to be added to both sides of the
protein.
Example 2: Find the theoretical mass values of
(1) IQVR
113.08406 + 128.05858 + 99.06841 + 156.10111 + 18.01056 = 514.32272
(2) LGEHNIDVLEGNEQFINAAK
113.08406 *2 + 57.02146 2 + 129.04259 3 + 137.05891 + 114.04293 3 +
113.08406 *2+ 115.02694 + 99.06841 + 128.05858 + 147.06841 + 71.03711 2
+ 128.09496 + 18.01056 = 2210.09671
(3) LSSPATLNSR
21


113.08406 * 2 + 87.03203 3 + 97.05276 + 71.03711 + 101.04786 + 114.04293
+ 156.10111 + 18.01056 = 1044.55654
(4) VATVSLPR
99.06841 2 + 71.03711 + 101.04786 + 87.03203 + 113.08406 + 97.05276 +
156.10111 + 18.01056 = 841.50231
In mass spectrometry, the analysis is always in terms of ions. Mass spectrometry
actually measures mass to charge ratio which is m/z. The MALDI method tends to
only produce ions with a plus one charge. This is convenient because then z is one
and m/z equals mass. The plus one peptide ion comes from the addition to the
peptide of a proton, which is an H ion. This proton comes from the trifluoro-acetic
acid used to prepare the sample. People call the molecule being analyzed M,
and the protonated ion [M+H]+, which signifies a single added H.
Example 3: Find the [M+H]+ ions:
(1) IQVR
514.32272 + 1.00782 = 515.33054
(2) LGEHNIDVLEGNEQFINAAK
2210.09671 + 1.00782 = 2211.10453
(3) LSSPATLNSR
22


1044.55654 + 1.00782 = 1045.56436
(4) VATVSLPR
841.50231 + 1.00782 = 842.51013
These calculated values in Example 3 are used to calibrate mass spectrometry data
and they also appear very frequently (Duncan et al., 2003) because they are the
fragments of the trypsin used to digest proteins.
23


CHAPTER 3
MINING INFORMATION FOR QUANTIFICATION
PEPTIDES AND INTERNAL STANDARD PEPTIDE
Careful selection of the quantification peptide and design of the internal standard
peptide are crucial for accurate quantitative MALDI-TOF MS data. This chapter
presents the subsystem that can help biologists find quantification peptides and
the internal standard peptide (IS) by using the biological background knowledge.
This subsystem requires the information of amino acid families to apply the
conservative amino acid substitution. With this information, the subsystem finds
possible quantification peptide pairs and designs an internal standard peptide for
each pair.
Amino Acid Families for Conservative
Amino Acid Substitution
As mentioned before, amino acids can be divided into families according to their
chemical properties. However, different users might require different divisions,
families, and the priorities of families, so we use the family shown in Table 2.1 as
a default family classification, but users can change the families and priorities to
suit their own interests. Once a user changes and saves the setting, the processes
24


of finding quantification peptides and the internal standard peptide will do
conservative amino acid substitution based on the users definition.
Quantification Peptides
For a given protein, a quantification peptide must be selected specifically and
uniquely to that protein in the context in which it will be measured. For example,
a highly conserved protein such as human cardiac alpha myosin heavy chain
would have quantification peptides shared with other species, but if human
samples were analyzed, then the quantification peptide would have to
discriminate human cardiac alpha myosin heavy chain from other human cardiac
myosin isoforms.
There are three biological rules that can be used for finding the pair of
quantification peptides (Helmke et al., 2003). They are the following.
a) Cleavage sites of these peptides are identical.
b) Both peptides are only different by a single conservative amino acid
substitution.
c) The peptides have the strongest MS signals.
25


The Required Data
To find quantification peptides, five input files are needed: two protein sequences
for analyzing proteins, and three MS data files for both proteins and a patient
sample. The protein sequence file is a text file which contains the amino acid
sequence of a protein. By assumption, the user already knows what target protein
he/she wants to use, and then obtains the target protein sequence by using a web
based protein database such as Swiss-Prot. The MS data file is a Microsoft Excel
file which contains the actual MALDI-TOF MS data of sample. The sample can
be a protein or a protein mixture. This file is generated by Data Explorer which is
the bundled software of a MALTI-TOF MS system.
Algorithm for Finding Quantification Peptides
In order to find the desired quantification peptides, an algorithm is designed based
on the above biological background knowledge.
Before we can use this knowledge, we first need to discuss why this algorithm
uses protein sequences as input, instead of peptide lists of proteins. There are
several web-based applications which help people auto-digest a given protein
sequence, such as MS-Digest. Those applications show the peptides of a protein
and give information needed to determine quantification peptides, except for the
26


required amino acids before and after each peptide. Our algorithm needs two
amino acids before and after peptides to more strictly keep chemistry similar,
while those web-based applications only give one. Therefore, our algorithm
accepts two protein sequences as input and performs in-silico digestion by itself.
If all the rules are applied at the same time, all peptide pairs need to be evaluated.
Therefore, the algorithm utilizes rules one by one. The following considerations
decide the order of their application. The algorithm is designed to show not only
the pair with the strongest signals but also all pairs fulfilling other rules, so rule c)
is implemented to sort the found candidates by signal strength. This means that it
will not reduce the number of candidates. Thus, rule c) can only be applied as the
last rule. Substitution in rule b) means replacing one for another so this implies
the lengths of two peptides will be the same. As a result, rule b) not only performs
the conservative amino acids substitution but also implies checking the lengths of
two peptides. This character makes it remove more unnecessary pairs than rule a).
Hence, we applied rule b) as the first rule.
The pseudocode of the algorithm follows:
INPUT Pa: Protein sequence 1, Pp: Protein sequence 2, fa: MS data of
protein 1, fp: MS data of protein 2
Apply in silico digestion to get peptide sets Pa and Pp
27


// apply rule b)
For each peptide pair of pa e Pa AND pp e Pp
If the Hamming distance of pa and pp = 1 AND
the differed amino acids are located in the same amino acid family Then
Insert this pair to Candidates
// apply rule a)
For each pair in Candidates
If the first two amino acids ofpa and pp are different OR
the two amino acids before pa and pp are different OR
the last two amino acids of pa and pp are different OR
the two amino acids after pa and pp are different Then
Remove this pair from Candidates
// apply rule c)
For each pair in Candidates
Set the peak heights of pQ and pp by Protein MS data
Sort Candidates by peak heights of pa and pp descending
Return Candidates
Some protein sequence alignment algorithms exist such as BLAST algorithm
(Altschul et al., 1990) and the dynamic programming pairwise sequence
alignment (Needleman et al., 1970; Smith et al., 1981). Those algorithms can
calculate the similarity of a pair of sequences by using accepted point mutation
(PAM) matrix of Dayhoff et al (1978) and find a score to indicate the relationship
between the sequences. In our system, we need simply to handle those sequence
pairs that are only different by a single conservative amino acid substitution, so
28


we chose the straightforward approach which checks the length of the given pair
of peptides and then determine the relationship of the amino acids in different site.
Discovered Result
This algorithm finds peptide pairs satisfying all three rules, sorted by abundance
in a descending order. They are used as quantification peptide pairs to determine
the internal standard peptide later.
Internal Standard Peptide
A stable isotope labeled version of the peptide is the easiest to find internal
standard peptide, and the use of this peptide is commonplace in the small-
molecule field in concert with mass spectrometry for quantitative analyses
(Bucknall et al., 2002; Ong et al., 2003). However, the cost of synthesizing a
stable isotope labeled peptide is very high. Hence, in this work we adopt another
method that only needs to synthesize a normal peptide for quantifications of two
proteins.
In the method, if a peptide fulfills all requirements stated below, it can be chosen
as the internal standard peptide (Helmke et al., 2003):
a) it is highly homologous to the quantification peptide.
29


b) the sequence of this peptide must be altered from that of the quantification
peptide so that the mass of this peptide can be discriminated from the
quantification peptide by MALDI-TOF MS while maintaining the chemistry
of the original quantification peptide.
c) the substitution should not change the charge or hydrophobicity of the peptide
as this would change the recovery of the peptide or the ability of the peptide to
co-crystallize with the matrix or the ability to ionize, therefore change the
production of its MALDI-TOF signal.
d) the mass of the internal standard peptide should be in an open region of the
spectrum in which the peptide signal could appear without interference by
other peptides. This open region must be near the quantification peptide since
the peptide will have a mass close to that of the quantification peptide.
e) if there are several potential quantification peptides, then the sample spectra
can be inspected to find the quantification peptides that have the highest signal
and that have nearby open regions for the internal standard peptide signal.
The most important property of these rules is to keeping the chemistry
characteristic features of the internal standard peptide. We achieved this by
applying conservative amino acid substitution. The other part is to find an open
30


region of the spectrum where a synthetic peptide can be inserted without
interference from the sample.
The Required Data
Quantification peptides are the first required input data. Second, an MS data file
for a sample is required because it shows the peak height of the peptides of the
sample. The following algorithm uses this file to find on open region for the
internal standard peptide.
Algorithm for Finding Internal Standard Peptide
This algorithm applies the above stated rules to find the internal standard peptide
candidates.
The first step of the algorithm is to guess where the internal standard peptide is.
The reasonable prediction is the center between the two quantification peptides.
Assume that the given quantification peptide pair is (pa, pp). The distance from pa
to this predicted value is called D. We know that the alternations of any two
amino acids within an amino acid family may not cover all found values of D.
Because of this limitation, the algorithm finds all possible alternative distances by
31


given amino acid families and uses the one closest to D, instead of the assigned
guess.
Secondly, the algorithm checks the existence of an open region. In order to avoid
MS data which does not have to be de-noised, the algorithm makes some
assumptions to deal with noise. By observation (Figure 3.1), in an MS data file,
more than 90% of the peaks have intensities less than 10%, so this algorithm used
10% as the average noise and a factor of 1.5 as possible boundary for all noise.
Due to the existence of isotopes, a range wide-enough to insert an internal
standard and its isotopes is inspected. If all peaks in this range are less than the
noise boundary, then it can be an open region. When the predicted internal
standard can be inserted into the sample without interference, algorithm goes to
the next step; otherwise, the process is stopped.
32


Figure 3.1 Intensity chart of a MS data file
In the last step, conservative amino acid substitution has to be checked. The
algorithm substitutes amino acids from the site where pa and pp are different on
both sides. There are two biological considerations behind this. The first one is
that the rest of the peptide would be very similar to both quantification peptides,
so it could be used for either. The other is that the end amino acids determine
more of the peptides chemistry. This means that the end substitution of a peptide
will probably change chemical properties of a peptide more than compared with
the substitution site closer to the different site of a quantification peptide pair, and
this is not what we want. The algorithm will only check the possible substitution
33


within given amino acid families. Once it finds a site that fits the conditions of
mass value and conservative substitution, a new peptide is created by the found
substitution.
The pseudocode of the algorithm is shown below.
INPUT pa: quantification peptide 1, pp: quantification peptide 2, f: Patient
MS data
Predict a possible location of the IS by the average value of the masses of
pa and pp and call the distance from pa to this predicted value D.
Use D to find a closest possible amino acid change value in given amino
acid families and set it back to D
Assign the mass of pa + D to Mis
If Mis does not locate in an open space Then
Return null
Find all possible amino acid substitutions S whose mass changes match D
For each amino acid AA e pa
If AA is in S Then
Get the substitution amino acid AA from S
Create a new peptide sequence IS which is the same as pa but AA
is replaced by AA
Append IS into Candidates
OUTPUT Candidate values
All these steps are only based on pa. In order to find all possible internal standard
peptides, we need to exchange pa and pp and repeat the process again.
34


Discovered Result
The final products of this algorithm are some potential internal standard peptides.
These peptides are arranged in the order of the given quantification pairs. In each
quantification pair section, the substituted site that is closer to the different site of
the quantification pair is the higher priority. Although this algorithm returns
internal standard peptides based on all possible quantification pairs, user only
needs to focus on the pair with the highest signals.
35


CHAPTER 4
QUANTIFICATION OF MASS SPECTROMETRY DATA
This chapter presents the subsystem that can quantify MS data automatically.
Before quantifying the experimental MS data, we first obtain the necessary
information in terms of the quantification peptides and the internal standard
peptide. They can be found by the subsystem described in Chapter 3, or manually.
The other required information is the isotopes of the quantification peptides, the
isotopes of the internal standard peptide, and the standard curves. Using this
information, MS data can be quantified.
Isotopes
In mass spectrometry data, every peak maps to a mass value. Protein
identification typically uses only the mono-isotopic peak, but due to naturally
occurring heavier isotopes such as 13C, 15N, there are a series of isotopic peaks
following the mono-isotopic peak. The mass values of isotopic peaks are 1 Dalton
(Da) higher than its previous peak because each heavy isotope is 1 Da heavier
than its parent element. The charge z is usually equal to 1, so the value of m/z is
equivalent to the mass value m. As explained above, the allowable error to
determine the expected peaks cannot be bigger than 0.5 Da. Of course, if an
36


expected peak is found, then the first isotope will usually be the next peak whose
mass value is the mass value of the expected peak plus one. The reason to
consider the intensities, or amounts of isotopes, is that sometimes the intensities
of isotopes are almost the same as, or greater than the mono-isotopic peak. If we
ignore them we may be discarding a large portion of the MS signal data. As
described above, the subsystem can generate all the consecutive isotopic peaks
and the user specifies the number of isotopic peaks to be used in quantification.
The issue for the user is to decide how many isotopic peaks to use. Using too few
of them may waste some potentially useful information, while too many increases
noise because the signal decreases with increased mass (see Figure 4.1). Thus, for
flexibility, the subsystem leaves this part up to the user to input the mono-isotopic
mass values of peptides and the number of isotopic peaks.
37


Figure 4.1 Mono-isotopic peaks and isotopic peaks
(picture taken from Helmke et al., 2003)
Standard Curves
The standard curve is a very important factor in the quantification process.
Without the standard curve, the user cannot get the quantification result from
mass spectrometry data. To create a standard curve, the user needs to prepare
standard curve data in the same format as the normal mass spectrometry data. The
only difference between the standard curve data and normal data is the identifier.
The format of the identifier is SC_SampleName_FileName where SC is a
constant string which means a standard curve data; the format of SampleName
is XX % which indicates the percentage of the higher protein, and FileName is a
sequence number. Furthermore, for the sample itself, these standard curve data are
38


made totally from synthetic peptides, not from a real sample. That is why these
data can be used to make the standard.
This subsystem provides two different standard curves: one is the ratio of
quantification peptides; the other is the ratio of the quantification peptide to the
internal standard. The first can be used to find the relative quantification of two
quantification peptides. The second is used to calculate the absolute quantification
of each quantification peptide.
Standard Curve X axis Y Axis
Quantification peptide ratio Peptide ratio Peptide MS signal ratio
Peptide 1 standard curve pmol of peptide 1 Peptidel / IS MS signal ratio
Peptide 2 standard curve pmol of peptide 2 Peptide2 / IS MS signal ratio
Table 4.1 Standard curves
All standard curves share the same MS data source. The user needs to indicate
two inputs. One is the total amount (pmol) of two quantification peptides.
Because the identifier of the standard curve data provides only the percentage of
the quantification peptide which has bigger molecular weight, the subsystem
needs to convert this percentage to amount. For example, a synthetic sample has
39


25% of the higher quantification peptide and the total amount of the quantification
peptides is 4 pmol. Hence, we know that there is 25% x 4 pmol = 1 pmol of
higher quantification peptide and 3 for lower one. The other input is the amount
of internal standard for absolute quantification.
In order to adapt different data sets which might not be fit by linear regression, the
non-linear regression model is chosen:
where 6 represents unknown parameters (Montgomery et al., 2001; Motulsky et
al., 1999). Then, a non-linear least-squares function is used to estimate the
unknown parameters (Montgomery et al., 2001; Motulsky et al., 1999).
The basic idea is to evaluate the residual sum of squares function S(8) iteratively.
The set of unknown parameters which has the smallest evaluation value is chosen
as the parameters of function/.
There are four regression types in the program: linear, quadratic, cubic, and
quartic. When the regression type is chosen, the subsystem will use it as function/
in Equation (4.1). The user can evaluate how well the chosen type fits the input
y = f(x,0) + £
(4.1)
n
(4.2)
40


data set by fitting goodness or coefficient of determination R2 (Montgomery et al.,
2001; Motulsky et al., 1999).
Goodness = R2 = 1

Y(y-y f
Vv i J mean J
i=l
(4.3)
Data Quantification
After the information about quantification peptides, internal standard peptide,
desired isotopes, and standard curves is collected, we can finally start to quantify
the MS data of the samples.
Patient Mass Spectrometry Data
Patient mass spectrometry data are collected from real patient samples instead of
synthetic peptides. The format is the same as described above. The only
difference is the identifier. The format is SampleName_Microgram_Filename.
For example, GJ1_1_3 means that the experiment uses the sample GJ1 with
weight 1 microgram and experiment sequence 3. When the user wants to read
mass spectrometry data, he/she needs to provide the amount (pmol) of internal
standard peptide for till experiments contained in the chosen file. This amount is
used to quantify the given proteins.
41


Strategy of Choosing Desired Peaks
The data of each experiment includes many peaks. Only mono-isotopic peaks of
quantification peptides and the internal standard peptide, and their chosen isotopic
peaks are required.
Some criteria have to be met when the subsystem collects the peak data. The first
one is the error range. Theoretically, the mass values of the required peaks are the
same as the given mass values, but, for reasons such as experiment error, mass
values will shift. Therefore, for each peak, an error range has to be set for peak
determination. As discussed before, 0.5 Da is chosen as the standard error range
in this subsystem. The second criterion is the existence of the required peaks. If
the user likes to have three isotopic peaks, then the subsystem has to sum up the
intensities (peak heights) of the mono-isotopic peak and all required isotopic
peaks. Sometimes, some of these peaks are not detected. In this situation, the
subsystem assumes that a quantification peptide or an internal standard peptide is
not abundant enough in this sample. The reason is that a peptide has mono-
isotopic and isotopic peaks shown in MS data naturally. In other words, some
consecutive peaks cannot be treated as the peptide peaks when these peaks lack of
either mono-isotopic peak or required isotopic peaks. Therefore, the abundance of
these peaks cannot be counted in. The third criterion is to compare the intensity of
each found peak with background noise. The background noise can come from
42


many sources. If any required peak of a quantification peptide is less than
background noise, the subsystem treats this quantification peptide as detected but
too small, and will not use this value for calculation. The last criterion applies if
there are two peaks within the same range. In this case, the subsystem takes only
the highest one for calculation, because this is the true peak and the lower one is
only a shoulder to the main peak.
Calculation of the Quantification Data
Using the strategy outlined above, the required intensities of two quantification
peptides and internal standard peptide are known. Now, let Ha, Hp, and His be the
intensities of quantification peptides Pa, Pp, and internal standard peptide PiS. First,
the MS signal ratio of Pa and Pp are calculated and presented as % Pa MS.
%Pa MS =
Hr
Ha+Hp
(4.4)
Then, let the standard curve of the ratio of the quantification peptides be /R. The
inverse estimation of /r is used to determine the amount ratio of these two
peptides and the answer is called %Pa Peptide.
%Pa Peptide = fRl(%Pa MS) (4.5)
43


Next, the ratios of Pa / IS and Pp / IS have to be found. With these ratios, the same
as above, we apply the inverse function of standard curves/p of Pa / IS and Pp / IS
to estimate the amounts of Pa and Pp.
pmol P. = f-1 {Ht / HIS), i = {a, (4.6)
By the provided total amount W of Pa and Pp, the amount of peptide per
microgram is found.
pmol Pt/ fig = P~l ,i = {a, p] (4.7)
vv
Error Estimation
Because the end results of the quantification process are found by inverse
functions of the standard curves, the estimation of error range must be considered.
Although, the standard deviation can be used as an easy error estimator, this
subsystem implements a statistical method called confidence interval, to
determine the error range. The method is described below (Montgomery et al.,
2001).
For an estimated value xq, it is possible to locate a range
x + d1 < x0 where d] and cfe are the root of
44


i=l
In equation (4.8):
a) ta, 2n_2 is the percentage points of t-distribution where n is the number of
given points and a is the confidence coefficient. The value of t can be found in
a statistical table.
b) The standard error of regression is &2.
ee
&2 =(4.10)
n 2
The residual sum of squares is SSjtes:
n
(4.11)
/=!
The corrected sum of cross products of jc, and yi Sxy.
f n V n \
n \f n
Z*,-
n
n

V 1=1 A i=l J
2yt(x,-x)
(4.12)
(=i
n
i=i
c) is the least-squares estimators of /?;.
(4.13)
45


The corrected sum of the jc, S**:
(4.14)
d) x is the average of x values.
(4.15)
e) y is the average of y values.
(4.16)
f) yo is the new observation on the y axis.
Using the above equations, one can find the coefficients of Equation (4.9) and
then use the quadratic formula to find d] and d2. Usually, the user would use
format x £ to present the range, because this format is widely used to present a
range of an estimated value. Unfortunately, because dj and d2 are the roots of a
quadratic, Equation (4.8) cannot provide us this format unless di and d2 are equal.
Hence, this subsystem uses the average value of di and d2 instead of the estimated
value x o-
46


The advantage of this error estimation is that it is more precise and the
disadvantage is that it is only for linear regression. Since the current subsystem
deals only with linear regression, this error estimation works well.
47


CHAPTER 5
RESULTS OF QUANTIFYING HUMAN
HEART MYOSIN HEAVY CHAIN
Most human heart failure is caused by heart muscle disease. Helmke et al. (2003)
of UCHSC found that the amounts of two isoforms of cardiac myosin heavy chain
(MyHC), alpha (a) and beta (P) can be correlated to whether or not a patient has
heart disease.
Helmke provided not only the experimental MS data as the input for our system
but also manually found quantification peptides and designed an internal standard
peptide. These manually found peptides are used to verify our system. In addition,
manually calculated quantification results were provided to us by Helmke and
used to confirm the output of our system.
Quantification Peptides and Internal Standard Peptide
Input Data
The protein sequence files were found by using the Swiss-Prot database and
confirmed by Helmke. In addition, three MS data files generated by a-MyHC
sample, P-MyHC sample, and patient sample respectively were provided by
UCHSC.
48


Execution
The first step is to decide which amino acid families will be used for conservative
amino acid substitution. The example shown here uses only the non-polar family.
Figure 5.1 Amino acid families
Then, the above mentioned input data were assigned to the proper fields for
further analysis.
49


gggll
MsnsgcmorJt Ifelp
Ki
-Pfoteifs Shpivs rigs -
ikwwhRwtefc 0o;isri!5'W^3^ing;'gyr,Wy'D5ci.^<:rt4'iTb?w,iaunt:fica5ijnl0riwi'.'P1J8JS8e(9j[)rt r
Higher PnastSiCOommattsand SKtinjjs'cyenWOwuKrtsffhK^iBrr.lti^nflfVctefie'^1 SSJA^hslxt
~J\
-PeakUstF^s-
.lajworiPrr^stiteiBTtsr8s:cntJSi^rig5Sc^(OT^DBcwrte\7tesis^8ir^GalK!rAPe!fe!la?es}ra,'!!t!f!i
: H^her PfrtshjCclCocuirifliils mil Seair^s'ciranWyOtK-ufiisriS£kThesisU3rlillioatli3o'PMisWlf}i.sr>c.-.ls.-l:
it Pajisrt S^rgsfcywVli* Daiwa^soiS
4
Message j QusnUfiesIi&'i Patffie | £ Peptde |
ijFinding iuan.ti£ication Peptides ...
Applying Sage conservative Asms Acid SiSbeeltucien Eule .,.
[Hiafer of talma pairs; 6 |
Applying Sane Cleavage Site Rule ...
| Umber of feand pairs; 4[
Applying Eights signal &ue ...
| eftaatet of £&m& pairs; 4 [
:Finding Internal Standard Peptide for peptide pair: (lU'SPMIJEGQriPSR, Il&PTOXtt&Qn&Sft)...
12 gandidatg(g) iesmd.l
'Finding Internal Standard Peptide £ peptide pairs (AIOSMSQQAlMM&giiEPK, AISSEAK§QAIIMVEE1>K!,
12 sanamraMs) found.|
| Finding Internal Standard Peptide fax peptide p&ic: (EDSTHsJOnPPK,, EDQViggsjpPXJ...
11 eendlaate(3) iormd.|
I Finding Internal Standard Peptide for peptide pair: (MALE, HE/LR)...
10 candidates) fomd.l
^oai?icaFnpep!iitesamitrteinaislad!sdpisp8ae,ftifeft
Figure 5.2 Result of processing given data
Figure 5.2 shows the number of possible quantification peptide pairs (red
rectangles) and the number of internal standard peptides (blue rectangles) found
for each pair of quantification peptides. We can see that there is no internal
standard peptide candidate found for the fourth pair. The reasons can be no open
region or no substitution existing.
50


Figure 5.3 shows detailed information of found quantification pairs in descending
order of MS signal intensity.
Fms Oviwitlfesfcw! PepUtfcs ensWcmsi Stamford Pcpticte
Message JWTttfcrtiOT; Peptize j & pgpidt j
Site | Hgte fwftjle . | toss
t'jl(K) ALQE7i£tQQALJD0LQA£E)K (V?lj i21 S2 JUSS- <993 -1016 jjK) ftL<5SAH-3aM.EQLSiVEBS'. fVM) i2t 80.8*585... |1030 >1018
(YR$ ILNP&MPEGQFIDSR (IQ
tMOJ^t.^:i^-7&.'.:.'JlCYBrH3^YAt>B3£:0§R^.7 ^?685Sk.-|725-7*1: -f*
(VK) ECQVMQONFPK (FPf
:13l3JB1S13...j73.S3
'VK) EDQVLQSSIFPK (FC>)
880232528 "jlKiO.tStS ESRt M5VLR pTKj
>1235SS570..
73-83
TSR)?4EaLRCVK)
,633.357559... 11611 .1515
Figure 5.3 Possible quantification peptide pairs
Although several candidate pairs were found, only the first pair was used
according to rules described above. Figure 5.4 shows the internal standard peptide
candidates for the peptide pair with the strongest MS signal.
| Firrf Sua-ilificslicrt Pepiii&s attllrittnsi Standard Pe^sticfe J
ftfcrar 7j4irniiMlm Ptbmp
i Lower PejiUcte Mass j Site I Siterrssii Standard Mass 1 Dltf i Motet Peptide LSPJ.6Pr''(7TDSr ,17^09 : ^ \ / ~ i f ' , y-
i .. > __ ... A
Figure 5.4 Possible internal standard peptides
51


Validation
The manually found quantification peptides and the internal standard peptide are
as follows.
m/z [M+H]+
Mono-isotopic
ci-MyHC 726-741 Peptide : YR | ILNPVAIPEGQFIDSR | KG 1768.96
Internal Standard Peptide: ILNPVAVPEGQFIDSR 1754.94
P-MyHC 724-739 Peptide : YR | ILNPAAIPEGQFIDSR | KG 1740.93
f t
Trypsin sites Trypsin sites
Figure 5.5 Myosin heavy chain quantification peptides
(picture taken from Helmke et al., 2003)
Comparing the data with automatically found peptides, Figures 5.3 and 5.4, the
first pair of quantification peptides in Figure 5.3 shows the following:
a) the quantification peptide found from 0-MyHC (lower protein peptide) is
ILNPAAIPEGQFIDSR. This peptide is located in the protein from 726 to 741
and its mono-isotopic mass value is around 1740.9282.
52


b) the quantification peptide found from a-MyHC (higher protein peptide) is
ILNPVAIPEGQFIDSR. This peptide is located in the protein from 724 to 739,
and its mono-isotopic mass value is around 1768.9595.
The sequences and locations of these two peptides found automatically are the
same as the peptides shown in Figure 5.5. Furthermore, if we round the theoretical
masses to two decimal places, they will match the mass values of Figure 5.5.
As for the internal standard peptide, the first peptide in Figure 5.4 is still the same
as the given data because it follows all the rules. The second peptide also follows
all of the rules but the substitution site is too far from the different site (site 4) of
two quantification peptides. Helmke thought that the second one might also work
because it follows all the rules as well, however, he did not use the second
internal standard peptide in the experiment. As mentioned before, a peptide does
not always show up in the MS peak list, so perhaps, this second peptide is not
good for detection. Hence, we kept all the candidates for potential replacement.
53


Protein Quantification
Project Management
The quantification process is managed by project name. The user can manipulate
several quantification projects at the same time. The most important information
is the two quantification peptides and the internal standard peptide. Because this
system only accepts two proteins for comparison, the peptides are defined as
lower and higher quantification peptide respectively. The names of the proteins
are needed for identification purpose, and the mass values of two quantification
peptides and the internal standard are needed for quantification. In addition, the
amount of internal standard mixed into the real sample is required and recorded.
54


Figure 5.6 Project management
Isotope Management
This example takes only the first isotope of each quantification peptide and, as
discussed before, the default allowed error is 0.5 Da.
55


Managsassnt Help
1 . ' i : yi 181*0^*: f : Sikv&rj
| Bsiong To MofccJar WciaY | ;>.ltw::IEtrof i,Da) i
H74GS3 11741 S3 jo.5 |
11754.94 1755.S4 p.5 |
H 768.9S / f , 1769.36 pi __ ___ j
Figure 5.7 Isotope management
Sample Management
In this example, there are 15 samples: 5 standard curve samples and 10 patient
samples. After reading these samples, we recognized the data as shown in Figure
5.8.
56


f MisasBasM
Management Help
.-MM
188 i - j _'.} §*
Sample Name ' Amount IS pmol - Type
FC_3 3.0 pg 2.0 Patient
GJ_1 1.0 pg 2.0 Patient
GJ_2 2.0 pg 2.0 Patient
GJ_3 3.0 pg 2.0 Patient
GJ_4 4.0 pg 2.0 Patient
PO_3 3.0 pg 2.0 Patient
RO_3 3.0 pg 2.0 Patient
SA_3 3.0 pg 2.0 Patient
SM_3 3.0 pg 2.0 Patient
WY_3 3.0 pg 2.0 Patient
0% 4.0 pmol 2.0 Standard Curve
100% 4.0 pmol 2.0 Standard Curve
25% 4.0 pmol 2.0 Standard Curve
50% 4.0 pmol 2.0 Standard Curve
75% 4.0 pmol 2.0 Standard Curve

.'v, - i'*
liiim


* f
IlSppSIt:
mtfKBlilS
Ililil


sample Viewer
Figure 5.8 Sample management
Standard Curve
Three different standard curves are generated: a-MyHC / P-MyHC peptide ratio,
P-MyHC peptide standard curve, and a-MyHC peptide standard curve. The
detailed information of these standard curves is shown in Figure 5.9.
57


(a) Quantification peptide ratio standard curve
58


(c) a-MyHC peptide standard curve
59


Alpha-MyHC peptide / Beta-MyHC
Standard Curve Information
Number of iterations: 117
Maximum number of iterations: 2000
-Sum of residuals squared: 122.69220502419353:
: Standard deviation: 6.395107635898803 , j
Goodness of fit: 0.9999998253507646
: Parameters:
' a = -0.5829379537896258
b a 0.9873996873119156
Points (Data)
XAxis YAxis j
yj o : 0 t s
0 = 0. 1
' 3 I , 0 o p
4 f 0 V o ' 1
i. .0
6 o 0 s
iii - o y _: 0 %
mi 0
9 0 5 0
10 . : \ ... o ;-1 h |
(d) Detailed information of
quantification peptide
ratio standard curve
Beta-MyHC peptide Standard Curv
Standard Curve Information
Number of iterations. 76
' Maximum number of iterations: 2000 !
j Sum of residuals squared: 0.16891766170500883::
Stand at d deviation 0 2 3728861589845168
. .Goodness of fit: 0.9999994440946864
Parameters:
a = 0.037775911489048415
b = 048848515702045214
y = a+b>-
Points (Data)
] X Axis Y Axis j
Li] 4 1 9446 -i. |
ill 4 2 0756
3J . 4 1.8915
! 4 | 4 ' : 1.9433 -
liii 4 2.091
li.,61 4 1.8981 1 1
h7 1 4 1.8022 i
& 4 - 2.0417 S' i
9i 4 1.95-15 , !
10! 4 1 8982 fS si
(e) Detailed information of p-MyHC
peptide standard curve
60


i
i;
-Alpha-MyHC peptide Standard Cur
Standard Curve Information
Number of iterations: 75
Maximum number of iterations: 20Q0
Sum of residuals squared: 0.31752715542109868
Standard deviation': 0.3253342667785953
Goodness of fit: 0.9999998487870273 -
a = 0 059246488827915555
b = 04234052555119099
y = a+bx
|yy .ittlllli':
Points (Data)
X Axis
YAxis
j>5 | 0 ! ' 0 X .
o 0 3
0 0
L~ C~ Lw p 0 "
9 I;;..,!,,,,.1 O' : 0 *'
ini .... ' '-p_ _d
(f) Detailed information of a-MyHC
peptide standard curve
Figure 5.9 Standard curves and their detailed information
Results
Figure 5.10 shows the manually calculated quantification results provided by
Helmke. These data were used to verify the output of our system. After the
61


correctness was confirmed by Helmke, the identifier and output format were
changed to make it organized. In addition, Helmke did more experiments for each
sample, so it is difficult to show the relationship between the old data (Figure 5.10)
and the new data (Figure 5.12).
Sainele i File Name Beta Beta/ISl Internal Standard Anlha/IS! Alpha
3EP ! 0007 21002.0000 2.1813! 9628.0000 1.5972! 15378.0000
1 0008 18191.0000 1.8029! 10090.0000 1.36631 13786.0000
! 0009 24008.0000 1.90781 12584.0000 1.4388! 18106.0000

iAverage 21067.0000 1.9640 10767.3333 1.4674! 15756.6667
ISTDEV 2909.0447 0.19541 1590.1476 0.11811 2184.7520

6EG i 0010 3914.0000 0.4648! 8421.0000 1.7772! 14966.0000
i 0011 2964.0000 0.4141! 7158.0000 2.0504! 14677.0000
i 0012 2854.0000 0.3718! 7677.0000 1.70081 13057.0000

jAverage 3244.0000 0.4169! 7752.0000 1:84281 14233.3333
iSTDEV 582.8379 0.0466! 634.8315 0.1838! 1028.9316

7EG I 0016 4503.0000 0.47011 9578.0000 2.2553! 21601.0000
I 0017 5055.0000 0.51901 9739.0000 2.0112! 19587.0000
I 0016 4284.0000 0.4607! 9298.0000 1.92931 17939.0000

lAverage 4614.0000 0.4833! 9538.3333 2.0653! 19709.0000
iSTDEV 397.3047 0.0313! 223.1599 0.1696! 1834.0458
Figure 5.10 Manually calculated quantification results
There are two different kinds of results provided by our system: one is the
standard curve data which is used only for reference (Figure 5.10); the other is the
quantification result from the patient samples (Figure 5.11). Their formats are the
same but the quantification result displays more columns than standard curve data.
The reason is that the standard curve data is generated from the synthetic peptides
62


and the amounts of these peptides are known, so it is unnecessary to show the
quantification columns for this type of data.
mm
asm
wumrxbl /"
Managemant Help
It**
Sample 25%

Filename p-MyHC' ; IS 4flyHC' l %ct-MyWCt?1£ : p-MyHC/IS - -MyHC/IS ,
10 36382.0 24256.0 11483.0 I23.9904 1.4999 0.4734 i
11 42139.0 27894.0 14081.0 25.0462 1.5107 0.5048 |
12 19616.0 12179.0 6139.0 23.8361 1.6106 0.5041 !
13 43169.0 30014.0 14972.0 25.7512 1.4383 0.4988
14 55032.0 36080.0 18715.0 25.3773 1.5253 0.5187
36 37617.0 24572.0 11839.0 23.9385 1.5309 04818 i
37 39659.0 . 25184.0 13656.0 25.6138 1.5748 0.5422 \
38 39444.0 26256.0 13014.0 24.8084 1.5023 0.4957 j
39 50544.0 33585.0 17912.0 126.1657 1.505 0.5333 f
40 65283.0 42343.0 22155.0 I25.338 1.5418 0.5232 \
* \ x< \v '
Average 42888.5 28236.3 14396.6 24.9866 1.524 3.5076 ii
StdDeu 12212.2529 3115.5097 4438.5744 3.8229 3.0464 3.0219 |f
Quantify read in e/psriment data
(a) Screen view
63


A I B c D E i -F -. G
;:!SC25% IIS: 2.0 i>iol Total: 4.0 umolj j
2 Filename B-MvHC 18 a-MvHC i % a-MvHC MS i B-MvHC / IS a-MvHC / IS
3 f 10 1 36382 24256 "1 11483 j 23.9904 1.4999 0.4734
4 r 11 42139 27894 | 14081 25.0462 1.5107 0.5048
fUT 12 19616 12179 j 6139 23.8361 1.6106 0.5041
TF7 13 43169 30014 1 14972 J 25.7512 1.4383 0.4988
7 U 55032 36080 1 18715 ! 25.3773 1.5253 0.5187
IP 36 37617 24572 | 11839 j 23.9385 1.5309 0.4818
9 r 37 39659 25184 1 13656 25.6138 1.5748 0.5422
10 f 38 39444 26256 j 13014 ! 24.8084 1.5023 0.4957
11 r 39 50544 33585 1 17912 | 26.1657 1.505 0.5333
W~~40 I 65283 42343 | 22155 i 25.338 1.5418 0.5232
13 i j
14 i 1
15 iAverage i 42888.5 28235.3'! 14396.61 24.9866! 1.524 0.5076
16 iSTDEV | 12212.2529 8115.5097! 4438.57441 0.8229! 0.0464 0.0219
(b) File view
Figure 5.11 Standard curve data
HI} Quantification Proteoinics MyHC-'
Management Help
E.
;r
/.tr.
.JO!*]
il :
Sample; fc _3 |§§ RUM 'j
FJena p-MyHC IS a-MyHC % a-M. % a-M... P-MyH. pn n1 p pmol P, ft M ,'H r.molo. pmol or pmol(...' 1
22 9609.0 21674.0 24191.0 71.571 73.0747 0.4433 0.8302 0.2767 1.1161 2.4961 0.832 1.1087 1
23 18978.0 39767.0 47222.0 71.3323 72.833 0.4772 0.8996 0.2999 1.1875 2.6647 0.8882 1.1881
24 8106.0 18110.0 20564J3 71.7265 73.2322 0.4476 0.839 0.2797 1.1355 2.5419 0.8473 1.127 !
25 14723.0 32599.0 41051.0 73.6024 75.132 0.4516 0.8472 02824 12593 2.8343 0.9448 1.2272 |
26 14655.0 32533.0 38039.0 72.1885 73.7001 0.4505 0.8449 0.2816 1.1692 2.6215 0.8738 1.1554 i
27 17027.0 38903.0 48012.0 73.8203 75.3527 0.4377 0.8187 0.2729 12341 2.7748 0.9249 1.1978 S
28 11874.0 26783.0 32082.0 72.9866 74.5084 0.4433 0.8302 02767 . 1.1978 2.689 0.8963 1.173 !
29 19921.0 42210.0 46810.0 70.1473 71.6328 0.4719 0.8887 02962 1.109 2.4793 0.8264. 1.1226 !
30 16291.0 37988.0 44180.0 73.0598 74.5825 0.4288 0.8005 0.2668 1.163 2.6068 0.8689 1.1357 1
31 21368.0 42498.0 49614.0 69.8966 71.3789 0.5028 0.952 0.3173 1.1674 2.6172 0.8724 1.1897 |
'* *- .. .-<>-' -i ." % < x "* 'j ~ ^ ::. pi ::V: -/I::;:' ^ V ^ : ^ ^ t 5* x ^ ^ A j < (j-~
Average 15255.2 33306.5 39176.5 72.0331 73.5427 3.4555 3.8551 3285 t .1739 2.6326 3.8775 1.1625 :
StdDev 1375.1061 3619.325 10334.261 1.3566 1.3739 3.0221 3.0453 0.0151 3.0482 0.1138 0.0379 3.0387 :
IQuBniifv rra i in experment tlota
(a) Screen view
64


1 A B C _ r ! L t
1 FC 3 IS: 2.0 nmol
1 rikmimo B-MvHC IS h-MvHC % q-MvHCMs;; a-MvHC neutidc; BJiivHC f IS 1 nmol B-MvHC ittniol B-MvHC/un n role in
"ST 22 9609 21674 24191 I 71.571 73.0747 0.44$ j 0.8302 \ 0.2767
< :i 18978 39767 47222 71.3323 i 72.833 0.4772 0.8996 I 0.2999
5 24 8106 '18110 I 20564 ; 71.7265 5 73.2322 0.4476 0.839 i 0.2797
G .5 14723 32599 41051 73.6024 75.132 0.4516 0.6472 ! 0.2824
7 r 26 14655 32533 $039 72.1885 73.7001 0.4505 0.8449 i 0.2816
W 27 17027 $903 - 48012 73.8203 75.3527 0.4377 0.0187 0.2729
If 28 11874 26783 1 32082 72.9866 74.5084 0.44$ 1 0.8302 0.2767
10 r 29 19921 42210 46810 70.1473 71.6328 0.4719 0.8$7 0.2962
11 30 18291 37988 441$. 73.0598 74.5825 0.42$ j 0.8005 { 0.2668
i2 r 3i 21380 1 42498 49614 69.8986 71.37$ 0.5028 i 0.952 ! 0.3173

i 4 A/N ige 15255.2 33306.5 $176.5 72.0331 73.5427 0.4555 0.8551 0.285
151STDEV 4375,1068. i 8619.325 1Q$4J2625 J 1.3566 1.3739 i 0,0221 _ 0.0453i. 0.0151
(b) File view
Figure 5.12 Quantification result of patient sample
65


CHAPTER 6
CONCLUSION AND FUTURE WORK
Peptide quantification becomes increasingly important to diagnose diseases. Our
system is designed to improve the efficiency of the current manual quantification
process, which is labor intensive and prone to human error.
The key feature of our system is the integration of different subsystems into a
single application for discovery of quantification peptides, design of the internal
standard peptide, and quantification of all given data in a rapid and efficient
manner.
In short, we developed and validated a fully automatic system. By generalizing
the concept, it is possible to use the system for similar problems.
Although, we achieved our goal of automating the process of protein
quantification, the following improvements are planned in the future,
a) Enzyme selection: Because trypsin is the most popularly used enzyme, this
system only implemented in-silico digestion for it, but some proteins are
66


digested by other enzymes. Therefore, other enzymes could be implemented
to make the usage of this system more widely applicable,
b) Noise determination: Noise determination is very important for determining
open region and desired peaks. A 10% noise assumption was made for this
system, but it would be better to design an algorithm which can find the most
optimal noise ratio for all different MS data files.
This is only the first step of our research. In the immediate future, we will start to
working on a similar problem: a switch between human cardiac tropomyosin and
human skeletal tropomyosin in heart disease.
67


APPENDIX
A. TWENTY AMINO ACIDS
(picture taken from http://www.agsci.ubc.ca/courses/fnh/410/protein/l_12.htm)
Name Ahhre^ iation Structure
Alanine Ala (a) COO- 1 +H3NC H 1 ch3
Arginine Arg(r) COO- +H3NC H ch2 ch2 1 ch2 1 N H c=nh2+ nh2
Asparagine Asn (n) COO- +H3NC H ch2 c 0^ nh2
68


\:i me Al)brc\ iulion Structure
Aspartate Asp (d) coo- + 1 +H3N C H ch2 C c/Y
Cysteine Cys (c) coo- +H3NC H ch2 SH
Glutamate Glu (e) coo- +H3NC H ch2 ch2 1 /V
Glutamine Gin (q) COO- +H3NC H ch2 ch2 c 0^ nh2
69


Name Ahhrt\ ialion Structure
Glycine Gly(g) COCr 1 +H3NC H H
Histidine His (h) coo- +H3NCH ch2 C CH +HN JlH ^ / C H
Isoleucine Ile(i) coo- 1 ^N C H H C CH3 1 CHt 1 ch3
Leucine Leu (1) COO- 1 -tn3NC H ch2 CH /\ h3c ch3
70


Name Abbreviation Structure
Lysine Lys (k) COO- +H3NC H 1 ch2 ch2 1 ch2 ch2 nh3+
Methionine Met (m) COO- +H3NC H ch2 CHq 1 s 1 CH3
Phenylalanine Phe(f) COO- C H ch2 6
71


Name Abbreviation Structure
Proline Pro (p) COO- + 1 h2c^ ^ch2 ch2
Serine Ser(s) COO- +H3NC H HC OH H
Threonine Thr(t) coo- + 1 +H3NC H 1 HCOH ch3
Tryptophan Trp(w) COO- 4H3NC H CH2 oo H
72


Naim* Abhre\ ialion Strutlure
Tyrosine Tyr (y) C00- +H3Nc H ch2 A V OH
Valine Val (v) coo- 1 Tl3NC H CH /\ h3c ch3
73


B. AMINO ACID SEQUENCE OF ALPHA-MyHC
MTDAQMADFG
ILSREGGKVI
VLFNLKERYA
PPHIFSISDN
IGDRGKKDNA
IHFGATGKLA
MLLVTNNPYD
YKLTGAIMHY
CHPRVKVGNE
KQPRQYFIGV
EYKKEGIEWT
AKLYDNHLGK
PLNETWALY
ALHRENLNKL
LEGIRICRKG
LDIDHNQYKF
IEFKKIVERR
MATMKEE FGR
NDAEERCDQL
ELKKDIDDLE
LQEAHQQALD
ERAKRKLEGD
ALALQLQKKL
RLEEAGGATS
SVAELGEQID
SRTLEDQANE
SQLTRGKLSY
EETEAKAELQ
DAEEAVEAVN
FDKILAEWKQ
KRENKNLQEE
ASLEHEEGKI
TSLDAETRSR
LLKDTQIQLD
RKLAEQELIE
RNAEEKAKKA
EAEQIALKGG
ELTYQTEEDK
VQHELDEAEE
AAAQYLRKSE
AETENGKTVT
AWMIYTYSGL
AYQYMLTDRE
NANKGTLEDQ
SADIETYLLE
YAFVSQGEVS
GNMKFKQKQR
YVTKGQSVQQ
LDIAGFEIFD
FIDFGMDLQA
SNNFQKPRNI
QKSSLKLMAT
MTNLRTTHPH
FPNRILYGDF
GHTKVFFKAG
DALLVIQWNI
IKETLEKSEA
IKNKIQLEAK
LTLAKVEKEK
DLQVEEDKVN
LKLTQESIMD
KENQARIEEL
VQIEMNKKRE
NLQRVKQKLE
YRVKLEEAQR
TQQMEDLKRQ
RVLSKANSEV
AKCSSLEKTK
KYEESQSELE
ISDLTEQLGE
LRAQLEFNQI
NEVLRVKKKM
DAVRANDDLK
TSERVQLLHS
ITDAAMMAEE
KKQLQKLEAR
KNLLRLQDLV
RADIAESQVN
KERLEAQTRP
VKEDQVLQQN
FCVTVNPYKW
NQSILITGES
IIQANPALEA
KSRVIFQLKA
VASIDDSEEL
EEQAEPDGTE
VYYSIGALAK
FNSFEQLCIN
CIDLIEKPMG
KGKQEAHFSL
LFSSYATADT
FVRCIIPNER
RQRYRILNPV
LLGLLEEMRD
RAFMGVKNWP
RRKELEEKMV
VKEMNERLED
HATENKVKNL
SLSKSKVKLE
LENDKLQLEE
EEELEAERTA
AEFQKMRRDL
KEKSEFKLEL
SLNDFTTQRA
LEEEGKAKNA
AQWRTKYETD
HRLQNEIEDL
SSQKEARSLS
GGKNVHELEK
KAEIERKLAE
EGDLNEMEIQ
ENIAIVERRN
QNTSLINQKK
LKKEQDTSAH
VRELEGELEA
DKLQLKVKAY
KLRAKSRDIG
FDIRTECFVP
PPKFDKIQDM
LPVYNAEWA
GAGKTVNTKR
FGNAKTVRND
ERNYHIFYQI
MATDSAFDVL
DADKSAYLMG
AVYEKMFNWM
FTNEKLQQFF
IMSILEEECM
IHYAGTVDYN
GDSGKSKGGK
KAPGVMDNPL
AIPEGQFIDS
ERLSRIITRM
WMKLYFKIKP
SLLQEKNDLQ
EEEMNAELTA
TEEMAGLDEI
QQVDDLEGSL
KLKKKEFDIN
RAKVEKLRSD
EEATLQHEAT
DDVTSNMEQI
KLQTENGELA
LAHALQSARH
AIQRTEELEE
MVDVERSNAA
TELFKLKNAY
VRKQLEVEKL
KDEEMEQAKR
LSHANRMAAE
NLLQAELEEL
KMEADLTQLQ
LERMKKNMEQ
EQKRNAESVK
KRQAEEAEEQ
AKQKMHDEE
DDKEEFVKAK
AMLTFLHEPA
AYRGKKRSEA
VIQYFASIAA
NSSRFGKFIR
LSNKKPELLD
GFTSEEKAGV
LNSADLLKGL
VTRINATLET
NHHMFVLEQE
FPKATDMTFK
ILGWLEKNKD
KKGSSFQTVS
VMHQLRCNGV
RKGTEKLLSS
QAQARGQLMR
LLKSAETEKE
LQVQAEQDNL
KKRKLEDECS
IAKLTKEKKA
EQEKKVRMDL
QQNSKIEDEQ
LSRELEEISE
AAALRKKHAD
IKAKANLEKV
RQLEEKEALI
DCDLLREQYE
AKKKLAQRLQ
AAALD KKQRN
EESLEHLETF
ELQSALEEAE
NHQRWDSLQ
AQKQVKSLQS
RAWEQTERS
SEVEEAVQEC
TIKDLQHRLD
GMRKSERRIK
ANTNLSKFRK
74


C. AMINO ACID SEQUENCE OF BETA-MyHC
MGDSEMAVFG
IVSREGGKVT
VLYNLKDRYG
PPHIFSISDN
IGDRSKKDQS
HFGATGKLAS
LLITNNPYDY
KLTGAIMHFG
HPRVKVGNEY
QPRQYFIGVL
YKKEGIEWTF
KLFDNHLGKS
LNETWGLYQ
HRENLNKLMT
GIRICRKGFP
IDHNQYKFGH
YKKLLERRDS
SMKEEFTRLK
AEERCDQLIK
KRDIDDLELT
EAHQQALDDL
AKRKLEGDLK
GSQLQKKLKE
EEAGGATSVQ
AELGEQIDNL
TLEDQMNEHR
LTRGKLTYTQ
TEAKAELQRV
EEAVEAVNAK
KILAEWKQKY
ENKNLQEEIS
LEHEEGKILR
LDAETRSRNE
KDTQIQLDDA
LAEQELIETS
AEEKAKKAIT
EQIALKGGKK
TYQTEEDRKN
HELDEAEERA
AAAPYLRKSE
AETEYGKTVT
SWMIYTYSGL
AYQYMLTDRE
PGKGTLEDQI
ADIETYLLEK
AFISQGETTV
NMKFKLKQRE
VTKGQNVQQV
DIAGFEIFDF
IDFGMDLQAC
ANFQKPRNIK
KSSLKLLSTL
NLRS THPHFV
NRILYGDFRQ
TKVFFKAGLL
LLVIQWNIRA
EALEKSEARR
NKIQLEAKVK
LAKVEKEKHA
QAEEDKVNTL
LTQESIMDLE
LQARIEELEE
IEMNKKREAE
QRVKQKLEKE
SKAEETQRSV
QLEDLKRQLE
LSKANSEVAQ
CSSLEKTKHR
EESQSELESS
DLTEQLGSSG
AQLEFNQIKA
ALRVKKKMEG
VRANDDLKEN
ERVQLLHSQN
DAAMMAEELK
QLQKLEARVR
LLRLQDLVDK
DIAESQVNKL
KERLEAQTRP
VKEDQVMQQN
FCVTVNPYKW
NQSILITGES
IQANPALEAF
SRVIFQLKAE
ASIDDAEELM
EQAEPDGTEE
IYATGALAKA
NSFEQLCINF
IDLIEKPMGI
GKPEAHFSLI
FANYAGADAP
RCIIPNETKS
RYRILNPAAI
GLLEEMRDER
FMGVKNWPWM
KELEEKMVSL
EMNERLEDEE
TENKVKNLTE
TKAKVKLEQQ
NDKQQLDERL
ELESERTARA
FQKMRRDLEE
KSEFKLELDD
NDLTSQRAKL
EEVKAKNALA
WRTKYETDAI
LQNEIEDLMV
QKEARSLSTE
KTIHELEKVR
EIERKLAEKD
DLNEMEIQLS
IAIVERRNNL
TSLINQKKKM
KEQDTSAHLE
ELENELEAEQ
LQLKVKAYKR
RAKSRDIGTK
FDLKKDVFVP
PPKFDKIEDM
LPVYTPEWA
GAGKTVNTKR
GNAKTVRNDN
RDYHIFYQIL
ATDNAFDVLG
ADKSAYLMGL
VYERMFNWMV
TNEKLQQFFN
MSILEEECMF
HYAGIVDYNI
IEKGKGKAKK
PGVMDNPLVM
PEGQFIDSRK
LSRIITRIQA
KLYFKIKPLL
LQEKNDLQLQ
EMNAELTAKK
EMAGLDEIIA
VDDLEGSLEQ
KKKDFELNAL
KVEKLRSDLS
ATLQHEATAA
VTSNMEQIIK
QTENGELSRQ
HALQSARHDC
QRTEELEEAK
DVERSNAAAA
LFKLKNAYEE
KQLEAEKMEL
EEMEQAKRNH
HANRMAAEAQ
LQAELEELRA
DADLSQLQTE
RMKKNMEQTI
KRNAESVKGM
QAEEAEEQAN
GLNEE
DDKQEFVKAK
AMLTFLHEPA
AYRGKKRS EA
VIQYFAVIAA
SSRFGKFIRI
SNKKPELLDM
FTSEEKNSMY
NSADLLKGLC
TRINATLETK
HHMFVLEQEE
PKATDMTFKA
IGWLQKNKDP
GSSFQTVSAL
HQLRCNGVLE
GAEKLLSSLD
QSRGVLARME
KSAEREKEMA
VQAEQDNLAD
RKLEDECSEL
KLTKEKKALQ
EKKVRMDLER
NARIEDEQAL
RELEEISERL
ALRKKHADSV
AKANLEKMCR
LDEKEALISQ
DLLREQYEEE
KKLAQRLQEA
ALDKKQRNFD
SLEHLETFKR
QSALEEAEAS
LRWDSLQTS
KQVKSLQSLL
WEQTERSRK
VEEAVQECRN
KDLQHRLDEA
RKSERRIKEL
TNLSKFRKVQ
75


BIBLIOGRAPHY
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment
search tool. J. Mol Biol, Oct 5; 215 (3):403-10, 1990
Aebersol R, Mann M: Mass Spectrometry-based Proteomics. Nature, 422, 198-
207, 2003.
Bucknall M, Fung KY, Duncan MW: Practical Quantitative Biomedical
applications of MALDI-TOF Mass Spectrometry. J Am Soc Mass Spectrom,
vol 13 (9): 1015-27, 2002.
Chambers G, Lawrie L, Cash P, Murray GI: Proteomics: a new approach to the
study of disease. The Journal of Pathology, vol. 192 (3), 2000.
Cios K, Pedrycz W, Swiniarski RW: Data Mining Methods for Knowledge
Discovery. Kluwer Academic Publishers, Norwell, MA, 1998.
Cios K, Teresinska A, Konieczna S, Potocka J, Sharma S: Diagnosing Myocardial
Perfusion SPECT Bulls-eye Maps A Knowledge Discovery Approach.
IEEE Engineering in Medicine and Biology Magazine, 19(4): 17-25, 2000.
Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in
proteins. Atlas of Protein Sequence and Structure, Dayhoff MO, editor, Vol. 5
suppl. 3, 345 352, National Biomedical Research Foundation, Washington,
D.C., 1978.
dos Remedios CG, Berry DA, Carter LK, Coumans JVF, Heinke MY, Kiessling
PC, Seeto RK, Thorvaldson T, Trahair T, Yeoh T, Yao M, Gunning PW,
Hardeman E, Humphery-Smith I, Naidoo D, Keogh A: Different
Electrophoretic Techniques Produce Conflicting Data in the Analysis of
Myocardial Samples from Dilated Cardiomyopathy Patients Protein Levels
Do Not Necessaruly Reflect mRNA Levels. Electrophoresis, 17:235-238,
1996.
Duncan M, Fung K, Wang H, Yen C, Cios K: Identification of Contaminants in
Proteomics Mass Spectrometry Data. IEEE CSBC, 2003.
76


FM: FindMod tool, http://us.expasy.org/tools/findmod/fmdmod_masses.html,
2003.
Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and
rnRNA abundance in yeast. Mol Cell Biol, 19 (3): 1720-30,1999
Helmke SM, Yen C, Nunley K, Cios K, Duncan M, Perryman MB: Quantification
of Human Cardiac a- and P-Myosin Heavy Chain Protein by MALDI-TOF
Mass Spectrometry. Submitted to Analytical Chemistry.
Hunter TC, Andon NL, Roller A, Yates JR, Haynes PS: The functional
proteomics toolbox: methods and applications. J Chromatography B Analyt
Technol Biomed Life Sci, 782 (1-2):165-81, 2002
IPMCS: Instruction for PeptideMass peptide Characterisation Software.
http://us.expasy.org/tools/peptide-mass-doc.html, 2003
Montgomery DC, Peck EA, Vining G: Introduction to Linear Regression Analysis,
third edition. Wiley-Interscience, 2001.
Motulsky H, Christopoulos A: Fitting Models to Biological Ddtd USing Linear
and Nonlinear Regression. http://www.curvefit.comMdmdex.htm, 1999.
MS-Digest. http://prospector.ucsf.edU/ucsfhtml4.0/msdigest.htm, 2002.
1 vuvuu
, ------- .., jiv, xjiioujw mi*., jLemwand LA; Myosin we
Chain Gene Expression in Human Heart Failure. J Clin Invest, 100 (9):23
70,1997,
Needleman SB, Wunsch CD: A general method applicable to the ^search
similarities in the amino acid sequences of two pffltd/M
Biology 48:443-453,1970.
Ong S, Foster U, Mann M; UqsSflfyft'Q
proteomics. Methods, 29(2);Ui
n ' 1 "
PatteonSD.AebersoMMA/
Genetics,vol.3]suppled 7


FM: FindMod tool, http://us.expasy.org/tools/findmod/findmod_masses.htnil,
2003.
Gygi SP, Rochon Y, Franza BR, Aebersold R: Correlation between protein and
mRNA abundance in yeast. Mol Cell Biol, 19 (3): 1720-30,1999
Helmke SM, Yen C, Nunley K, Cios K, Duncan M, Perryman MB: Quantification
of Human Cardiac a- and (1-Myosin Heavy Chain Protein by MALDI-TOF
Mass Spectrometry. Submitted to Analytical Chemistry.
Hunter TC, Andon NL, Roller A, Yates JR, Haynes PS: The functional
proteomics toolbox: methods and applications. J Chromatography B Analyt
Technol Biomed Life Sci, 782 (1-2):165-81, 2002
IPMCS: Instruction for PeptideMass peptide Characterisation Software.
http://us.expasy.org/tools/peptide-mass-doc.html, 2003
Montgomery DC, Peck EA, Vining G: Introduction to Linear Regression Analysis,
third edition. Wiley-Interscience, 2001.
Motulsky H, Christopoulos A: Fitting Models to Biological Data using Linear
and Nonlinear Regression, http://www.curvefit.com/oldindex.htm, 1999.
MS-Digest. http://prospector.ucsf.edU/ucsfhtml4.0/msdigest.htm, 2002.
Nakao K, Minobe W, Roden R, Bristow MR, Lein wand LA: Myosin Heavy --
Chain Gene Expression in Human Heart Failure. J Clin Invest, 100 (9):2362-
70, 1997.
Needleman SB, Wunsch CD: A general method applicable to the search for
similarities in the amino acid sequences of two proteins. Journal of Molecular
Biology, 48: 443-453,1970.
Ong S, Foster LJ, Mann M: Mass spectrometric-based approaches in quantitative
proteomics. Methods, 29(2): 124-30, 2003.
Patterson SD, Aebersold RH:;Proteomics: the first decade and beyond. Nature
Genetics, vol. 33 supplement pp 311-323, 2003.
77


PMWC: Peptide Molecular Weight Calculator.
http://www.tpims.org/info/calculator/mw_calculator_parent.html, 2003.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in
C++, second edition. Cambridge University Press, 2002.
Smith TF, Waterman MS: Identification of common molecular subsequences. J.
Mol. Biol. 147, 195-7, 1981.
Tyers M, Mann M: From genomics to proteomics. Nature, 422, 193-197, 2003.
USPTO: US patent and trademark office, http://appftl.uspto.gov/netacgi/nph-
Parser?Sectl=PT01 &Sect2=HITOFF&d=PG01 &p=l &u=/netahtml/PTO/srch
num.html&r= 1 &f=G&l=50&s 1='20020128194'.PGNR.&OS=DN/200201281
94&RS=DN/20020128194,2002.
Vander A, Sherman J, Luciano D: Human Physiology the Mechanisms of Body
Function, seventh edition. McGraw-Hill, 1998.
Wu TD, Brutlag DL: Discovering Empirically Conserved Amino Acid
Substitution Groups in Databases of Protein Families. Proc Int Conflntell Syst
Mol Biol, 4:230-40,1996.
Used Third Party Java packages:
ImageJ: ImageJ. http://rsb.info.nih.gov/ij/, 2003.
JXL: Java Excel API. http://www.andykhan.com/jexcelapi/, 2003.
SGT: Scientific Graphics Toolkit, http://www.epic.noaa.gov/java/sgt/index.html,
2003.
78