A tool for identifying proteins from Maldi-Tof data

Material Information

A tool for identifying proteins from Maldi-Tof data
Rao, Rachana
Publication Date:
Physical Description:
x, 67 leaves : ; 28 cm


Subjects / Keywords:
Proteins -- Analysis ( lcsh )
Proteomics ( lcsh )
Mass spectrometry ( lcsh )
Mass spectrometry ( fast )
Proteins -- Analysis ( fast )
Proteomics ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Includes bibliographical references (leaves 65-67).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Rachana Rao.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
63756096 ( OCLC )
LD1193.E52 2004m R36 ( lcc )

Full Text
Rachana Rao
B. E., Bangalore University, India, 2000
A thesis submitted to the
University of Colorado at Denver
in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science

This thesis for the Master of Science
degree by
Rachana Rao
has been approved
Min-Hyung Choi
Dinesh Mehta

Rao, Rachana (M.S., Computer Science)
A Tool for Identifying Proteins from MALDI-TOF Data
Thesis directed by Professor Krzysztof (Krys) Cios
Mass Spectrometry has become a widely used choice for
identifying and characterizing proteins. Mass Spectrometry uses
peptide mass fingerprinting, which allows usage of information from a
unique collection of peptides that occur when the protein is digested, to
identify a protein. Identification of proteins has been a challenge ever
since the problem was first identified due to various reasons. The
protein sample is usually a complex mixture of proteins and may
contain contaminants and noise peaks after quantification by Mass
Spectrometry device. Added to this, proteins also undergo post
translational modification, which further weakens the identification of the
real peaks. Due to this, different approaches have always been
explored to identify proteins. The thesis aims at identifying a good
match for a given protein sample. With the prior knowledge of known
protein sequences, amino acid masses, enzyme cleavage rules, the
identification of the unknown protein is done by matching the masses
from the unknown protein with the masses of the known protein in the
protein sequence database. If the database has no exact matches to
the unknown protein in the sample, the closest match, usually from the
same homology is extracted from the databases. However, this
procedure does not allow the elucidation of new proteins.
This abstract accurately represents the content of the candidates
thesis. I recommend its publical'

I would like to take this opportunity to thank my advisor Dr.
Krzysztof (Krys) Cios for his encouragement, support, guidance and
comments. I would like to specially thank Dr. Dinesh Mehta from
Colorado School of Mines for helping me in devising algorithms and
strategies to solve the problem and also being my committee member. I
would also like to thank my committee members Drs. Choi and Stilman.
I would also like to thank Allison Gehrke, Srdjan Askovic, and Dr.
Mark Duncan for providing Mass Spectrometry data and helping me
understand the biological aspect of the thesis. I would like to extend my
thanks to Dr. Larry Hunter from University of Colorado Health Sciences
Center, Dr. Karen Kafadar and Russel Boise from the Math department
at University of Colorado at Denver for providing me with ideas and
work stations to accomplish my project. I would also take the
opportunity to thank my colleagues Archana Nanjundaswamy, Chia-Yu
Yen and Thami Rachidi for their help and support.

Figures.......................................... ix
Tables........................................... x
1. INTRODUCTION................................... 1
2. BACKGROUND..................................... 4
Proteins and Amino Acids................ 4
2D Gel Electrophoresis (2DE)............ 7
Mass Spectrometry....................... 9
Mass Values............................ 11
Peptide Mass Fingerprinting............ 11
Protein Databases...................... 12
NCBI............................. 12
SWISS PROT....................... 13
OWL.............................. 13
Ludwignr......................... 14
Protein Sequences...................... 14
FASTA Format........................... 16

Other Protein Identification Tools......... 19
MS Fit............................... 19
ProFound............................. 20
Sequest.............................. 20
AACompIdent.......................... 21
CombSearch........................... 21
Pepldent............................. 21
PeptideCutter........................ 22
Pre-computation............................ 27
Steps for Building Table 1 .......... 27
Implementation of Table 1............ 29
Analysis............................. 29
Steps for Building Table 2........... 30
Implementation of Table 2............ 30
Analysis............................. 30
Merging of the Two Tables- Table 1 and
Table 2.............................. 30
Implementation of Merge.............. 31
Analysis............................. 31

Mass Calculation
Implementation of Mass Calculation ... 32
Analysis................................. 32
Illustrative Example 1 ........................ 35
Illustrative Example 2......................... 36
Analysis....................................... 37
Pre-computation................................ 40
Search Procedure............................... 45
5. RESULTS................................................. 48
Pre-computation................................ 48
Illustrative Example 1 ........................ 52
Illustrative Example 2......................... 53
Illustrative Example 3......................... 54
Illustrative Example 4......................... 55
Illustrative Example 5......................... 56
Analysis....................................... 57
6. CONCLUSION AND FUTURE WORK.............................. 58

A. AMINO ACIDS........................... 60
B. DATABASE.............................. 61
D. RESULTS............................... 63
BIBLIOGRAPHY................................... 64

2.1.2D map of the human-prostate proteome............ 8
2.2. Perspective Biosystems Voyager DE-PRO MALDI-
TOF................................................... 9
2.3. Schematic of MALDI-TOF mass spectrometer........ 10
2.4. Mass spectrum...................................... 11
2.5. FASTA file format.................................. 17
2.6. Mascot Peptide Mass Fingerprint.................... 19
2.7. Identification of proteins procedure............... 23
5.1. Formatted FASTA file format........................ 49
5.2. Result Peak List 1 ................................ 52
5.3. Result Peak List 2................................. 53
5.4. Result Peak List 3................................. 54
5.5. Result Peak List 4................................. 55
5.6. Result Peak List 5................................. 56

2.1 .Amino acids, abbreviations and masses........... 5
2.2. List of identifiers used by NCBI................ 16
3.1. Trypsin cleavage rules- Normal.................. 26
3.2. Number of mass files generated.................. 34
4.1. Trypsin cleavage rules.......................... 42

Proteomics has become the domain in which it is required to
analyze huge amounts of experimental data. It is the study of proteins
in a specific organism or cell type. It can be defined as the identification,
characterization and quantification of proteins which are engaged in a
specific pathway, cell, tissue or organism and by studying which we can
provide accurate and comprehensive data about the entire system.
Because proteins are important functional molecules, the molecular
characterization of proteins gives a complete understanding of the
biological system. For example, a study of a particular configuration of
proteins in a tissue could help in defining a particular disease.
Proteomics allows looking at hundreds or thousands of protein
interactions at the same time. One of the results of proteomics is
discovery of biomarker proteins for detection and diagnosis of diseases
and identification of drug treatments
For identification of any protein, information that is unique to that

particular protein or its family has to be obtained. For this, the
concentration value or molecular weight of the protein could be used,
but chances are that even this would not give a definitive answer. A
possible way out could be to sequence the protein. However, this is a
time consuming procedure. For this reason, peptide mass fingerprinting
(PMF) is widely used to provide faster and more accurate results
[Cottrell et. al., 1994]. Protein can be compared to a fingerprint unique
to that molecule. The information comes from a unique collection of
peptides that result when the protein is digested. Mass Spectrometry
has become the one of the most popularly used techniques used to
identify proteins.
Identification of proteins has been a challenge ever since the
problem was first identified. Various reasons are behind the complexity
of the procedure. Protein samples to be analyzed are mostly a complex
mixture of different proteins. Contaminants like keratin, trypsin autolysis
products may be present in the sample to be analyzed
[ When the protein is
digested by enzymes they have the possibility of not cleaving at the
right sites. To identify real peaks after the experiment is not easy due to
the presence of noise and experimental error. Proteins also undergo

post translational modification, which further weakens the identification
of the real peaks [Knudsen et. al., 2002], Due to this, different
approaches have always been explored.
This thesis aims at identifying a match for the protein in a given
sample. The digestion of protein by an enzyme gives rise to a set of
peptide masses. This set of peptide masses are the fragments of the
protein that is to be identified. These peptide masses are matched
against the available database of all the known proteins resulting in a
match for the input sequence. The match is then measured against
various criteria to arrive at the closest match. However the identification
of proteins is with the available database of already identified proteins.
This procedure does not allow the elucidation of new proteins.
This thesis is organized as follows. Chapter 2 provides the
necessary biological background used for the thesis. Chapter 3
discusses the protein identification procedure using a forward
approach. Chapter 4 describes the procedure of protein identification
using the backward approach. The results obtained are illustrated and
discussed in Chapter 5. Chapter 6 provides the conclusions drawn from
the thesis and possible future work.

The goal of the thesis being identification of proteins
requires the knowledge of some biological terminology. This chapter
discusses the biological terms and procedures used.
Proteins and Amino Acids
Proteins are molecules that perform all the functions in cells and
tissues in a biological system. Changes in the cells and tissues due to
disease can be identified by comparing normal proteins and diseased
proteins. Amino Acids are the building blocks of proteins. Each protein
is a combination of the 20 different amino acids available. An amino
acid can be represented by either a single letter code or a single letter.
For example, alanine is represented by Ala or A, etc.
Table 2.1 shows the different amino acids, their letter
representation and their molecular weight.

Amino Acid Letter Monoisotopic Mass
Alanine A 71.079
Arginine R 156.188
Asparagine N 114.104
Aspartic acid D 115.089
Cysteine C 103.144
Glutamine Q 128.131
Glutamic acid E 129.116
Glycine G 57.052
Histidine H 137.142
Isoleucine I 113.160
Leucine L 113.160
Lysine K 128.174
Methionine M 131.198
Phenylalanine F 147.177
Proline P 97.117
Serine S 87.078
Threonine T 101.105
T ryptophan W 186.213
Tyrosine Y 163.170
Valine V 99.113
Table 2.1: Amino acids, abbreviations and masses
Amino acids are organic molecules that have -COOH and -NH2
groups. The C-terminal end refers to the extremity of a protein
terminated by an amino acid with a free carboxyl group -COOH. The N-
terminal end refers to the extremity of a protein terminated by an amino

acid with a free amino group -NH2. These groups are used when amino
acids join together to form a protein. -OH is lost from the -COOH group,
while -H is lost from the -NH2 group, and the resulting -CO- and -NH-
bind together to form what is called the peptide bond (-CO-NH-), while
OH and H will go on to form a water molecule (H20). This is known as
water loss.
Protein digestion is a process where proteins are broken down to
peptides by action of an enzyme, e.g. Trypsin. Protein digestion occurs
either at C terminal or N terminal depending on the enzyme involved.
Each enzyme has its own set of rules to cleave the proteins [Graves et.
al., 2002],
We now explore the various methods used for separation and
isolation of the proteins. There are two methods widely used for this,
the 1D Gel Electrophoresis and the 2D Gel Electrophoresis. The 1D Gel
process separates the proteins based on their size and shape. The 2D
Gel process separates the proteins based on the charge and molecular
weight of the proteins. The data used in this thesis is based on 2D Gel

2D Gel Electrophoresis (2DE)
Two-dimensional gel electrophoresis is one of the methods used
for the separation and isolation of proteins. 2DE is a separation
technique that allows the simultaneous resolution of numerous proteins.
It is based on the orthogonal separation of the proteins by
isoelectric focusing (IFE) and the molecular weight (MW). In isoelectric
focusing, the proteins are separated based on their charge and then
separated orthogonally based on their molecular weights. It resolves
similar proteins [Henzel et. al., 1993].
The difficulties during the experiments are overcome by
introducing the immobilized pH gradient gels for isoelectric focusing.
Using these standardized gels, it is possible for more proteins to
undergo further characterization and generate highly reproducible two-
dimensional maps. After this the proteins in the two-dimensional gels
are visualized by staining. Proteins can be quantified and the spot
patterns in different gels can then be matched and compared.

Figure 2.1 shows 600 Vg of protein from a whole tissue
homogenate separated by 2DE. The gel has been stained and proteins
detected are labeled with numbers.
Figure 2.1: 2D map of the human-prostate proteome.
After the separation of proteins in the Gel, each protein has to be
identified. For this the proteins undergo another procedure, mass

spectrometry, where special devices are used to arrive at peptide
fragments of the protein.
Mass Spectrometry
Mass spectrometry is the method of arriving at peptides that are
generated by proteolytic digestion of the proteins. The enzyme that is
commonly used for protein digestion is Trypsin. Trypsin cleaves the
protein at C-terminal side of the amino acids lysine and arginine. Mass
Spectrometry includes three devices: an ionization device, a mass
analyzer and a detector [Mann et. al., 2001]. The widely used ionization
device is the Matrix Assisted Laser Desorption Ionization (MALDI)
shown in Figure 2.2.
Figure 2.2: Perseptive Biosystems Voyager DE-PRO MALDI-TOF

First the sample of protein is deposited on a MALDI plate in the
form of spots. The spots are then irradiated by the laser pulse which
generates a burst of ions. The ions are then accelerated using a fixed
amount of kinetic energy and the ions travel down the flight tube. The
ions that have higher velocity are recorded on the detector producing
time-of-flight (TOF) spectrum. Figure 2.3 shows a schematic of the
MALDI TOF mass spectrometer.
Figure 2.3: Schematic of MALDI-TOF mass spectrometer
The Time of Flight (TOF) is used as mass analyzer and detector.
It measures mass to charge ratio (m/z) and records the number of ions
at each m/z value. MALDI is combined with TOF to measure mass of
the most prominent peptides

r An example of a mass
spectrum obtained after the analysis by MALDI TOF is shown in Figure
Voyager Spec #[BP = 712.3.25107]
Figure 2.4: Mass spectrum
(Source: University of Colorado Health Sciences Center)
Mass Values
Mass values can either be monoisotopic or average.
Monoisotopic mass is the mass of the first peak of the isotopic
spectrum whereas average is the average of the peaks for that mass.
Peptide Mass Fingerprinting
Peptide fragments resulting from protein digest by enzyme can
uniquely identify the protein. It acts as a fingerprint for that particular
protein. The peptide masses are compared to the available database of

known proteins. The idea behind is to retrieve proteins that produce
similar set of peptides to that of the reference protein. As a result of this
process, list of proteins that closely match the input data are obtained
[Pappin et. al., 1993].
After a discussion of various methods, we now review the
available databases with known protein information. Various vendors
provide protein databases, and each has its own properties.
Protein Databases
National Center for Biotechnology Information provides one of
the largest and frequently updated non-redundant databases combining
most of the public domain databases. NCBI allows free download of
these databases. It has a huge set of databases for nucleotide and
proteins in both pre-formatted and FASTA formats. NCBI has a cross-
reference of sequences from GenBank CDS translations, PIR, SWISS-
PROT, PRF, and PDB after eliminating duplicates

SWISS PROT was established in 1986 and has been maintained
collaboratively since 1987 by the Department of Medical Biochemistry
of the University of Geneva and the EMBL Data Library (now the EMBL
Outstation of The European Bioinformatics Institute EBI). Swiss Prot is
a protein sequence database that provides information about the
protein like description of the function of a protein, its domain structure,
post-translational modifications and variants. It maintains a minimal
level of redundancy and is well integrated with other databases. This is
one of the smallest databases and is highly annotated
The OWL database is a non-redundant composite of four
publicly available primary sources: SWISS-PROT, PIR (1-3), GenBank
(translation) and NRL-3D. Amongst these, SWISS-PROT is the highest
priority source. The other databases are compared against SWISS-
PROT to eliminate identical and trivially different sequences. However,
this database is least frequently updated and does not have a very
consistent nomenclature

Ludwignr database is also a non-redundant database made up
from a number of another databases, like Swiss-Prot, Trembl, Trembl-
New, Genpept-updates, Genpept, Yeastpep, Wormpep. Other
databases that have a collection of isomeric proteins are also included
in Ludwignr, like Varsplic databases, Swiss-Prot Varsplic and Trembl
Varsplic. The new protein set constructed from the original entry of
protein can have a very different sequence with alterations made in the
segments- addition or deletion of amino acids
Protein Sequences
A non redundant database means that identical sequences are
merged into one entry. For this, sequences should have identical
lengths and every residue at every position must be the same. For
example both entries gi| 1469284 and gi| 1477453 have the same

>gi|1469284|gb|AAB05030.1|afuC gene productAA
gi|1477453|gb|AAB17216.1|afuC [Actinobacillus pleuropneumoniae]
NCBI follows the following syntax:
identifier signifying the database from which the protein
was obtained.
a protein id called its accession number.
sequence of amino acids in the protein sequence.
For example, the above sequence is in NCBI format. Table 2.2
shows a list of identifiers used by NCBI Blast.

Database Name Identifier Syntax
GenBank gb|accession|locus
EMBL Data Library emb|accession|locus
DDBJ, DNA Database of Japan dbj accession locus
NBRF PIR pir |entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|entry name
Brookhaven Protein Data Bank pdb entry|chain
Patents pat|country number
Genlnfo Backbone bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref (accession | locus
Local Sequence identifier lcl|identifier
NCBI sequence databases gi|accession
Table 2.2: List of identifiers used by NCBI
FASTA Format
FASTA stands for FAST-AII as it can be used for a fast protein
comparison or a fast nucleotide comparison
f A sequence in FASTA format consists of:
Title line Starting with '>' symbol
Protein identification (source and number)
Optional annotation lines, beginning with

Sequence of amino acids in the protein. This may continue in
lines and the next '> symbol signifies the end of this protein
An example of a sequence in FASTA format is:
> gi|37794|emb|CAA44721.11 (X62949) vacuolar isoform 2 of
H+ATPase Mr 56,000 subunit [Homo sapiens]
Figure 2.5: FASTA file format
For example, the protein sequence from the database is
originally from EMBL protein database and is a human protein. It starts
with a title line starting with *> symbol followed by protein identification
information and a sequence of amino acids in the protein.
There are various different identification tools available which
identify proteins based on various criteria and different features. A few
of them, like Mascot, MS Fit, Profound etc. are discussed below.

Mascot is a corporate search engine which helps identify the
proteins. A mass list is submitted to Mascot and it retrieves the matches
with the known proteins. Mascot allows search criteria to be altered to:
match different databases
vary number of missed cleavages
choose the species
use different enzymes
vary peptide tolerance
After submitting the peak list Mascot returns the best possible
match along with other matches. The results give an overview of the
information of the protein matched to, peptide sequences, masses
matched and unmatched fhttp://www.matrixscience.com1. Figure 2.6
shows the Mascot search form.

MASCOT Peptide Mass Fingerprint
Your name Rachana Email rrao@ouray
Search title SearchPeakl j
Database NCBInr [v]
Taxonomy All entries v]
Enzyme Trypsin jv] Allow up to 1 v| missed cleavages
Fixed AB old ICATdO(C) ;a| Variable AB old ICATdO(C) [a|
modifications AB old ICATdB (C) modifications AB old ICATd8 (C) til
Acetyl (K) Acetyl (K)
Acetyl (N-term) Acetyl (N-term) ,!
Amide (C-term) i^J Amide (C-term) (yj;
Protein mass kDa Peptide tol. 1.0 11 Da ;v]
Mass values MH+ O Mr Monoisotopic Average O
Data file f Biowso... J
Query 568.13036 " "" ~~' y
NB Contents 615.36546 ca
of this field 636.31313 ;
are ignored if 767.37906
a data file 000.47471
is specified. 833.39365 ...... . . jsi
Overview Report top 20 |yl hits
Start Search... | Reset Form 1
'lopynght 10ijj (vTstri-. 'i'Ciernie Ltd, ll PigStr R
Figure 2.6: Mascot Peptide Mass Fingerprint
Other Protein Identification Tools
MS Fit
This Regents of the University of California product, helps
identify the proteins from mass spectrometry data. It identifies proteins

based on the comparison of calculated mass values with predicted
mass values
ProFound is a tool by Rockefeller University. This employs a
Bayesian algorithm to identify proteins. Masses of peptides from a
proteolytic digestion of a protein are compared to the masses of a
peptide database calculated from NCBI's non-redundant protein
E adv.htmll.
Sequest program was version 28 of the California based product
by ThermoFinnigan. Sequest analyzes data from MS/MS spectra and
provides a score for each match with the database. It uses NCBI's non-
redundant protein database for protein comparison

AACompIdent allows identification of proteins from their amino
acid composition. The concentration and molecular weight values could
also be entered to give a more positive answer
CombSearch allows a combined interface to identify proteins
using several protein identification tools at the same time. It includes
Peptldent, Tagldent and Multildent from ExPASy, MS-Fit from
ProteinProspector, Mowse from UK Human Genome Mapping Project
Resource Centre, ProFound from PROWL and PeptideSearch from the
Bioanalytical Research Group at EMBL [
This tool allows identification of proteins using peptide mass
fingerprinting data, concentration values and molecular weights
experimentally measured. The masses from the sample are compared
with the proteins in SWISS-PROT or TrEMBL databases

This tool helps predict potential protease and cleavage sites
cleaved by various enzymes given the protein sequence. It searches
the SWISS-PROT and TrEMBL databases. The results are the possible
cleavage sites mapped on it fhttp://us.expasv.orq/tools1.
We summarize the procedure that this thesis is based on:
2-D gel electrophoresis is used to separate the complex
protein mixtures based on physical techniques.
The protein spot is removed from the gel
Each spot is digested using trypsin.
The digestion of proteins gives a set of peptide masses.
These peptide masses are fragments of protein that has to be
These fragments are then compared with the set of known
proteins to arrive at a match.
The match is then streamlined to match other criteria to
divulge the best match.
Figure 2.7 illustrates the step by step process of identifying the

In-gel digestion
with trypsin
oooo| *
96-well plate
mass spectrometry
Protein identity
O 10000
Mass spectrum
Figure 2.7: Identification of proteins procedure
(Source: Adapted from: files/Nutrition 20 155.pdf)

The aim of our research is to develop software to identify the
proteins in the given sample. Let the set of known proteins be
represented by P. Let the protein in the sample be represented by p.
Hence p is an element of P. Protein p is digested to arrive at peptide
fragments p1, p2, p3, etc. Let the peptide fragments have masses ml,
m2, m3, etc., respectively. These masses have to be compared to the
universal set of known proteins P to arrive at a close match.
To arrive at the closest match, a few criteria are to be satisfied.
The minimum number of peptides, p1, p2, p3, etc., required for
matching a particular protein in the database is 4. Also, a mass error
tolerance of +/-0.5 Da is allowed to accommodate experimental and
calibration errors. However there are chances that there may not be an
exact match to the protein in the protein database. In such cases
protein matches which exhibit the closest homology, often equivalent
proteins from related species, are retrieved as results.

The identification of proteins challenge was first tackled by us by
using a forward approach to the problem. Theoretical peptide masses
are calculated using the knowledge of trypsin cleavage rules and amino
acid masses. The input sequence peptide masses, ml, m2, m3, etc.,
are compared to these calculated masses and appropriate matches
arrived at. The matches then have to be authenticated by comparing
the results with the database of known proteins. The flow of the solution
here is in a forward direction.
However for each input peptide mass, there can be numerous
matches of peptides. The reason for this is that various combinations
of different amino acids can lead to the same peptide mass. Also two
amino acids, Isoleucine (I) and Leucine (L), have the same mass
values. As there may be experimental errors, an error tolerance of +/-
0.5 permitted. This can lead to more matches as a few other pairs of
amino acids (for example Glutamine (Q) and Lysine (K)) have mass
value differences of 1 dalton or less.
This problem can be viewed as generalized version of the
partition problem as a combination of twenty amino acids is used to
arrive at the required mass value. We start by choosing a cleaving

amino acid and by adding other amino acids till the required value in the
input mass sequence is reached. Since we use trypsin as the enzyme,
according to the normal cleavage rules, a protein cleaves at the
presence of Lysine (K) or Arginine (R) amino acids.
Rule Type Cleavage Site Action
Normal C-term of K Cleave
Normal C-term of R Cleave
Table 3.1: Trypsin cleavage rules Normal
Dynamic achievement of mass value according to the amino acid
masses can be tedious as it requires checking for a huge number of
combinations that the amino acids lead to. Hence to be time efficient we
tackle the problem in a different way. The peptide sequences are
generated and stored. Hence repetition of the generation of peptide
sequences is not necessary which saves time. The input mass
sequences ml, m2, m3, etc. are then compared to masses of peptide
For this purpose, two separate lookup tables are created, Table
1 and Table 2. The first table has all peptide sequences of sequence

lengths 1 to 5 such that the last amino acid in each sequence is one of
the cleaving amino acids Lysine (K) or Arginine (R). The second table
has peptide sequences of lengths 1 to 5 but without the cleaving amino
acids. The purpose is, that by looking into the two tables we can arrive
at all peptide sequences of length 1 to 10. For faster access, the two
tables are also combined so the combined sequences are available.
Also, with the presence of Table 2, the sequence length can be
increased to any length according to the requirement. For example, for
input sequences of mass value greater than 1800 peptide sequences
can be arrived at by appending sequences from Table 2 to the existing
sequences. A detailed explanation is as follows.
Known: Amino acid masses
Mass of Water
Normal Trypsin Cleavage Rules
Required: List of Peptide Sequences and masses
Steps for Building Table 1
Let A = Set of 20 amino acids
Let C = {K,R} //Cleaving Amino acids

Let N = A-C //Non-Cleaving Amino acids
Elements of C can exist independently as peptide sequences.
All other amino acids cannot exist independently. The next possible
combinations are of the pairs of amino acids that can exist. So the
lookup table build continues with the addition of all possible amino acid
T1 = N.C //Elements of N are prefixed to elements of C.
The peptide sequence is now expanded to a combination of
three amino acids. Each entry in Table 1 is prefixed with the elements
of N to give a triplet sequence of amino acids.
T1=T1UN.T1 //Elements of N are prefixed to elements of Table 1.
A quadruplet sequence is formed for all the triplet sequence
combinations, taking into account that any amino acid added should be
added to the beginning of the already existing peptide sequence.
T1 = T1 UN. T1//Elements of N are prefixed to elements of Table 1.
The database is further expanded to a sequence length of 5.
T1 = T1U N. T1 //Elements of N are prefixed to elements of Table 1.

Implementation of Table 1
As elements of set C can exist independently, K and R are written to
the Sequence file.
Five iterations implementing T1= T1+ N. T1 are followed to build
sequences of length 5.
All the sequences are written to the Sequence file.
After the first iteration 18*2=36 combinations can be arrived at.
An example of such a combination would be as follows:
{AK, GK, AR, GR}
However a pair such as KK, KR, or RR cannot exist. At the
second iteration, 18*18*2=648 triplet sequences are arrived at. An
example of this form would be:
A triplet sequence such as KCK, RCK etc. cannot exist.
Continuing such iterations 11,664 such quadruplet sequences and
209,952 entries with sequence length are arrived at. This leads to a
database with over 222,300 entries.

Steps for Building Table 2
Table 2 is created in the same method as Table 1 except that, C
is not used to generate the sequences. The 18 amino acids are used to
generate sequences up to length 5 in a similar fashion as Table 1.
Implementation of Table 2
Five iterations implementing T2= T2 U N. T2are followed to build
sequences of length 5.
All the sequences are written to the Sequence file.
18*18*18*18*18=1,889,568 combinations are arrived at for this
table. A sequence from this table looks like: etc.
For faster and more efficient access, the two lookup tables are
merged into a single table with all the possible combinations.
Merging of the Two Tables- Table 1 and Table 2
Let Sequence1= {seq11, seq12, seq13...}//sequences from Table 1
Let Sequence2= {seq21, seq22, seq23,....}//sequences from Table 2

These two sets are now merged to arrive at a single table.
Let Sequence3= {seq31, seq32, seq33,....} //sequences resulting from
merging Table 1 and Table 2
Sequence 3= Sequence2.Sequence1
Implementation of Merge
For each entry seqln in Table 1
For each entry seq2m in Table 2
Insert seqln. seq2m into Table3
For example:
Let Sequence2 = {ANDCQ}
Let Sequencel = {AACGK}
Sequence 3= Sequence2.Sequencel = {ANDCQAACGK}
The merging of tables gives all possible sequences of length 1 to
Mass Calculation
Masses are then calculated for each sequence in Sequence 3.
For each sequence, the mass of each amino acid present in the peptide

sequence is summed up. Also due to peptide bonding, water loss
occurs and adds to the peptide mass.
Mass of the peptide sequence = £ Mass of amino acids+ Mass
of water
This calculated weight is written into the destination mass file
along with the sequence.
Implementation of Mass Calculation
For each sequence from Sequence3= {seq31, seq32, seq33....}
Mass of seq31 = £ Mass of amino acids in seq31 + Mass of water
The range of masses achieved by this database is 128.1741 to
1850. Any input sequence mass value within this range and of string
length 10 or less is readily parsed for. However, with this first step a
successful parse may not be arrived at. For example, if the peak list has
a molecular weight exceeding 1850 the database has no entry for this.
(The peak list cannot have values lower than 128.1741 as K is the
smallest amino acid that can be by itself.) Hence to achieve higher
values, values from Table 2 are appended to the exixting merged table.

For faster access, the peptide sequences and the masses are
separated into different files. The files are divided according to the
masses. For example, if mass of the peptide sequence is 565.45 then
this is stored in a file Database6_01.
The files are named according to the following rules. If the mass
arrived at is between 501 and 600 then it is stored in the file Database6,
where 6 denotes the higher limit of the range. Also _01 denotes that
this is the first file in the range 501 to 600. Each file can have maximum
of 1,000,000 entries. This is regulated by a count variable. When a file
exceeds the limit of entries a new file is created and the suffix of the file
changes to _02.
For each mass value in the input sequence the search is
performed on the available sequence file. When a mass value is
matched with the sequence file, the peptide has to be appended to all
the previously found peptide sequences to build a complete sequence.
The sequence constructed can now be compared with the database of

This approach was implemented using Java and executed on a
Pentium III 1.7 GHz machine with 256MB Ram, 10GB hard disk drive.
The number of mass files generated is represented in Table 3.2.
Mass range <600.0 4
Mass range 600.0-700.0 4
Mass range 700.0-800.0 29
Mass range 800.0-900.0 109
Mass range 900.0-1000.0 159
Mass range 1000.0-1100.0 71
Mass range 1100.0-1200.0 9
Mass range >1200.0 & <1800.0 6
Table 3.2: Number of mass files generated
With these database mass files generated, the input peak list
was searched in the generated files.

Illustrative Example 1
An illustrative peak list:
515.32259 548.30981 568.12187 601.38116
650.03003 666.00447 672.01784 681.97912
687.9904 766.45428 778.41962 842.51152
861.05377 870.54684 877.03117 893.007
908.50384 973.53404 1015.56238 1045.58101
1140.61462 1179.61014 1277.72312 1310.69251
1365.65968 1475.76474 1682.88034 1716.85418
1793.8115 1911.02395 1993.98259 2169.07623
2211.09428 2378.15092 2383.96381
Consider the first peptide with molecular weight 515.32259. 449
matches were found. Out of these 16 were exact matches and the
others were within the range of +/- 0.1. The matches for the peptides
548.30981 were 32 with no exact matches and all were within the range
of +/-0.1.

This gives rise to a possibility of 449*32=14368 combinations of
proteins. For example, a few of the combinations are as follows:
Illustrative Example 2
An illustrative peak list:
508.5875 523.59069 524.59766 533.89677 536.58985
538.44283 550.59667 551.59133 552.59289 553.58633
559.30777 586.90759 599.30273 602.85659 649.91539
655.88875 656.90546 657.90295 659.09212 671.83348
673.83213 685.25884 687.75172 866.57273 951.4055
The matches for the peptide 599.30273 were 1601. The matches
for the peptide 951.4055 were 20 by going through only 10 files out of

the available 159 for the mass range. All these values were exact
matches. This gives rise to a possibility of 1601*20=32020
combinations of proteins:
While implementing this approach various problems were faced.
Amino acid isoforms.
Various combinations of amino acids leading to the same mass.
Space and time consumption.
While generating the mass files, the mass ranges had
more files or combinations that were not generated due to
space and time usage.

Hence for an input sequence, all the possible
combinations could not be verified. This led to the
possibility of missing key matches for a few input peaks.
Affirmative results for the input sequence were not obtained. This
was because even if all the input sequences were matched for,
they had to be authenticated with the existing database of

Due to the failure of the previous approach we tried another
approach. This approach follows a backward or trace-back approach
where the flow of the solution is backward. The input sequence is
searched in the database and matched for. As the match is against the
known proteins that exist in the protein database, relative to all possible
combinations, the result is more efficient.
For the masses obtained from the MS experiment, the protein
has to be identified. As initial data, along with masses of proteins from
the experiment, the database of known proteins is available. This
database of proteins has the list of all the proteins identified. The
protein in the sample is a match to any of these pre-identified proteins
of the database.
The database chosen is NCBI which is non-redundant and is the
largest and most frequently updated database. For convenience, the

FASTA format representation is utilized. The database currently has
1,782,119 protein sequences spread over a wide variety of species.
This approach was implemented using Java and executed on a
Pentium III 1.7 GHz machine with 256MB Ram, 10GB hard disk drive.
The procedure followed for the identification of protein can be explained
as follows.
Input: Mass sequence from the MALDI-TOF MS experiment.
Sorted in ascending order.
Database: Database of Proteins from NCBI.
1,782,119 protein entries in FASTA format.
Expected Output: Protein Match for the input sequence.
Given the protein database and the masses of the amino acids,
the protein sequence can be translated to a sequence of masses
representing the peptides in the protein sequence.

Each line is read from the database file. Each line in the
database file has the following:
accession number of the protein
name of the protein
the organism it belongs to
the amino acid sequence
From this, the amino acid sequence is extracted. The database
is first converted to a readable format by segregating the protein
information from the amino acid sequence. There are many amino acid
sequences in the database which have sequence length less than 40.
Such sequences are not considered for computational purposes. This is
because, for identification of proteins, a minimum of three peptides
have to be identified. The average peptide length can be anywhere
from 8-12. Taking this into consideration the minimum amino acid
sequence length is set to 40. Only valid amino acid sequences are
transferred into another file. Each amino acid sequence has to be
computationally digested by trypsin, that is, the cleavage rules for
trypsin have to be applied. Trypsin usually cleaves at the C-term of K
and R. But this rule has a few exceptions. For example, if the amino
acid Lysine (K) is followed by amino acid Proline (P) and the amino acid

prior to K is not W, then the protein sequence does not cleave at that
site. Table 4.1 discussed the rules in detail.
Rule Type Cleavage Site Action
Normal C-term of K Cleave
Normal C-term of R Cleave
Exception KP Dont Cleave
Exception RP Dont Cleave
Exception WKP Cleave
Exception MRP Cleave
Exception CKD Dont Cleave
Exception DKD Dont Cleave
Exception CKH Dont Cleave
Exception CKY Dont Cleave
Exception ORK Dont Cleave
Table 4.1: Trypsin cleavage rules
This is implemented using the following pseudo code:
Program: Set Database
INPUT: NCBI FASTA format database
OUTPUT: Formatted, Structured database
DBSeqs Structured Format Output file
Read the DBNcbi line by li. .e
For each protein sequence in DBNcbi
If the length of the protein sequence is > 40 then
Write the sequence along with the protein name
into DBSeqs

After digesting each amino acid sequence, a collection of
peptides for that protein are obtained. Molecular weights for each
peptide are calculated as follows:
Mass of peptide = Mass of each amino acid present +
Mass of water (18.0152)
The input data has masses in ascending order. Hence for easier
comparison, masses of peptides for each sequence are also sorted in
ascending order using Quick Sort. The masses thus calculated and
sorted are written into a separate file. This is repeated for each of the
amino acid sequences in the file.
The pseudo code used to implement this code is as follows:
Program: StripDatabase
INPUT: Formatted, Structured database
OUTPUT: Database with masses calculated and sorted for each
DBSeqs Structured Format Output file
DBMasses Database with masses sorted in ascending order for each
Read DBSeqs line by line
For each line in DBSeqs
Skip the protein information line
For each protein sequence in DBSeqs
Call procedure CalcMasses: Digest it according to

trypsin rules, simultaneously
calculating masses for the
Call procedure QSortMasses: Quick sort the
masses in ascending order.
Write the calculated and sorted masses into
//Procedure CalcMasses
For each protein sequence in DBSeqs
Tokenize protein sequence into single amino acids
For mass of water add mass of each amino acid
in the sequence until one of these cases occur:
If the amino acid is K then
Case 1: And If the following amino acid is
P then
If the previous amino acid to K
was W then
Cleave the protein sequence
Do not Cleave the protein
Case 2: And If the following amino acid is
D then
If the previous amino acid to K
was C OR *D then
Do not Cleave the protein
Case 3: And If the following amino acid is
HOR Y then
If the previous amino acid to K
was C then
Do not Cleave the protein
If the amino acid is R then
Case 1: And If the following amino acid is
P then
If the previous amino acid to R

was W then
Cleave the protein sequence
Do not Cleave the protein
Case 2: And If the following amino acid is
If the previous amino acid to R
was C then
Do not Cleave the protein
Search Procedure
With the pre-computations, the search for protein identification
can now be done on the masses file. The mass sequence file is read
line by line and the masses for each sequence are populated in an
array. Let this be the DBMasses array. The input mass sequence is
also populated in an array. Let this be the InputSequence array. The
two arrays are simultaneously traversed to match the peptide masses.
For each value in InputSequence each value in DBMasses is
compared. If the value in InputSequence exceeds value from
DBMasses then there is no match for that particular InputSequence
value in DBMasses array, as the arrays are sorted in ascending order.
The check is then continued for the next peptide mass value from
InputSequence array. This loop is repeated till the end of either of the
arrays. The masses are checked in a mass calculation error of +/-0.50.

If a match is found, then the value is saved and the count is maintained.
If the number of matches is more than 5, then the DBMasses is
considered a possible match for the protein. This is then recorded in a
file along with the respective protein name and peptide sequence. The
protein information is derived from the main database file. The protein
file is traversed in the same order as the mass sequence file. The
number of matches is also recorded in the result file. The process of
matching the InputSequence to the DBMasses is continued till the end
of the mass sequence file.
The pseudo code for the algorithm used is as follows:
Program: SearchMatches
INPUT: Databases of Masses of Sequences and Sequences, Input
peak list
OUTPUT: Matched Proteins- ProteinMatches
Read DBMasses and DBSeqs line by line
Each mass from a line in DBMasses is added to an array
DBMasses Array
For each mass in Input peak list
Traverse the DBMasses Array to the point where it either
matches or exceeds the Input peak list value.
Calculate Error Tolerance = Input peak list value -
DBMasses Array value
If match found within Error Tolerance then
Record masses matched
Increment number of matches

If no match to the Input peak list value then
Continue with the next Input peak list value
If number of matches >= 5 then
Record protein sequence, number of matches into
ProteinMatches output file.

As discussed in the previous chapter, the identification of protein
is done with comparison of the unknown input protein with the known
protein database.
An extract from the FASTA file from NCBI is shown in figure 2.5.
This form of the database is then converted to a format as shown in
Figure 5.1 with all the amino acid sequences belonging to the
sameprotein in one line to enable easy and convenient reading.

>gi|30090032|gb|AA086520.1| (AY188332) phytochelatin synthetase [Triticum
>gi|30090035|gb|AA086522.1| (AY188333) AGLG1 [Triticum monococcum]
>gi|30466148|gb|AAP33283.1| (AY214600) DhpS [uncultured Acidobacterium sp.]
Figure 5.1: Formatted FASTA file format
From this file only the sequences are extracted and it is as below:

For this particular sequence, masses are calculated using the
masses of each amino acid and also using trypsin cleavage rules. The
list of masses of the peptides is as below:
146.11015 245.17856 293.17856 332.1742 344.21057 346.22626
374.20343 486.32123 514.32733 518.2383 519.2773 536.3005
566.27466 638.2859 640.3301 649.3594 732.37537 818.4598
841.3622 846.35547 859.45984 917.46936 977.4435 1087.5491
1101.5024 1126.5817 1312.5991 1732.8884 1882.9263 1981.9249
2037.9182 2116.9895 2386.171 2418.1758 2438.0828 2463.226
2524.2236 2660.1707 2918.6465 2937.384 3012.3716 3249.796
3443.443 3656.8145 3893.7617
This process is repeated for all the available protein sequences
in the database. The database has 1,782,119 protein sequences. The
lists of masses are stored in DBMasses and sequences are stored in
DBSeqs. Snapshots of these files have been included as part of
After this pre-computation, the matching of the unknown input
mass list with the masses in the database is done. The unknown
protein sample is compared to all the proteins in the database one by

These are the parameters considered for matching the database:
Error Tolerance: +/- 0.5
Number of Peptides to be matched to qualify as valid match to
the protein: 5 or more
Number of matches with the database: As many as possible.
Results are checked till the end of the database.
The data are all provided by University of Colorado Health
Sciences Center. These masses have been submitted to the software
to obtain results. Also to validate the results, the same peak lists have
been submitted to Mascot. The results were then compared to see if the
two coincided. An extract of the results file looks like:
>gi|28076812|gb|AA031594.11 (AF499685) ribulose 1,5-bisphosphate
carboxylase/oxygenase large subunit [Myrmecia biatorellaej

Illustrative Example 1
For the input peak list:
550.12437 568.12278 614.41047 651.37805 666.00304 870.54967
919.46884 986.52513 997.5255 1074.56884 1172.66589 1191.64913
The output for the following input peak list is shown in Figure 5.2.
Matches with the Database
Number of Matches per peak list
Figure 5.2: Result Peak List 1
Total number of peaks submitted: 47
Maximum number of peptides matched: 16
Total number of matches with the database found: 310
Matched to Mascot: Yes
Matched to: gi| 16507237 heat shock 70kDa protein 5 (glucose-
regulated protein, 78kDa

Illustrative Example 2
568.13036 615.36546 636.31313767.37906 800.47471 833.39365
848.46861 853.47844 855.48474 870.55085 875.45598 917.49718
930.53735 946.51218 960.52326 975.464851016.5226 1021.51935
1029.56535 1056.61629 1061.53807 1072.55128 1121.58132
1124.59484 1133.58026 1220.63081 1267.60844 1281.6569
1344.71873 1390.76292 1484.78217 1518.86207 1519.3783
1544.79235 1640.87637 1684.9121 1765.89032 1792.77029
1798.91328 1905.06347 2033.15547 2039.9989 2052.05046 2143.047
2175.03775 2225.11241 2283.1841 2365.3358 2497.34211
2560.24987 2807.29641 3931.24277 3946.02749
The output for the following input peak list is shown in Figure 5.3.
Matches with the Database
Number of Matches per peak list
Figure 5.3: Result Peak List 2
Total number of peaks submitted: 53
Maximum number of peptides matched: 17
Total number of matches with the database found: 408
Matched to Mascot: No
Matched to: gi|45552029 Drosophila melanogaster

Illustrative Example 3
550.11731 567.26365 568.1272 581.01275 598.04547 615.38707
631.35544 637.37934 653.34925 666.01996 709.36309 782.35972
787.41587 814.34526 842.88949 856.52832 870.54321 877.49134
978.54647 995.56857 997.52074 1084.56289 1123.64908 1124.5579
1172.55451 1188.53757 1191.60366 1204.54388 1236.51944
1258.6802 1341.67992 1379.68438 1382.68956 1394.66285
1397.71609 1528.705 1587.83171 1593.85088 1607.7537 1652.74773
1653.76257 1814.88584 2225.10164 2249.05751 2283.15798
2299.17456 2302.14645 2303.13043 2591.28571 2719.37405
The output for the following input peak list is shown in Figure 5.4.
Matches with the Database
Number of Matches per peak list
Figure 5.4: Result Peak List 3
Total number of peaks submitted: 51
Maximum number of peptides matched: 12
Total number of matches with the database found: 268
Matched to Mascot: Yes
Matched to: gi|23323046 glucose regulated protein, 58 kDa; ER-60
protease; oxidoreductase ERp57 [Rattus norvegicus]

Illustrative Example 4
548.31237 565.31512 568.12569 581.06586 598.0792 601.37573
615.37206 637.37425 650.04558 653.34889 666.00276 679.51535
766.46728 795.46237 813.44475 870.5358 908.50423 1015.57495
1084.62552 1140.60871 1246.73151 1249.68111 1262.66085
1310.6868 1323.74612 1620.80934 1682.88783 1683.42125
1741.81181 1792.84171 1793.84034 1822.92549 1911.05433
1915.95998 1947.94789 2086.90941 2148.06819 2149.07042
2169.08761 2183.89998 2225.13299 2233.12551 2249.08485
2365.08574 2377.19381 2378.18037 2900.51667
The output for the following input peak list is shown in Figure 5.5.
Matches with the Database
Number of Matches per peak list
Figure 5.5: Result Peak List 4
Total number of peaks submitted: 47
Maximum number of peptides matched: 15
Total number of matches with the database found: 385
Matched to Mascot: Yes
Matched to: gi|33337538 cytokine-inducible inhibitor of signalling type
IV [Homo sapiens]

Illustrative Example 5
524.13087 550.1178 568.12341 587.16191 590.26946 602.28302
610.35507 628.36045 644.37944 650.04725 652.35009 666.01597
672.03378 681.99055 712.36337 728.35668 743.37521 828.44439
842.89988 856.53513 870.549 877.04607 893.0272 908.9909
909.42082 1004.5685 1007.54 1017.55787 1018.56586 1051.55434
1055.62971 1063.49483 1116.65083 1132.66068 1137.59826
1179.62705 1214.61464 1298.6609 1455.7383 1475.78105
1488.79852 1493.73381 1769.89315 1812.95965 1886.95516
1924.90407 1925.92067 1937.84138 2038.97186 2054.01033
2225.11083 2233.07347 2249.06567 2283.17573 2299.17785
The output for the following input peak list is shown in Figure 5.2.
Matches with the Database
Njmber of Matches per peak list
Figure 5.6: Result Peak List 5
Total number of peaks submitted: 55
Maximum number of peptides matched: 11
Total number of matches with the database found: 247
Matched to Mascot: No
Matched to: gi|29895655 Homoserine kinase [Bacillus cereus ATCC

As seen from the results, three out of the five masses matched
to Mascot. More data values were run on the two programs and the
following is the result:
Number of Input peaks submitted to Mascot and the software: 60
Number of Input peaks matched results with Mascot software: 23
This is about 40% of the data matching with Mascot. The
possible reasons include:
Many of the input peak files were not given a high score,
giving a possibility that the mass values may not be valid
Also Mascot does not use all the trypsin cleavage rules. It
only uses the normal trypsin rules and does not consider the
exception rules.
Mascot uses Formatted DB output versus the FASTA format
used in the software. Hence there is a possibility of
discrepancy of the accession numbers.

Knowledge of proteins and protein behavior plays a key role in
disease identification and many other changes in organisms. It stores a
huge amount of information which could be useful to biologists in
identifying various behavioral changes in organisms.
Search for a better search engine has been the motivation for
this project. To summarize, software to identify proteins in a sample
against the database of known proteins is developed. This is a license
free tool developed for the use of the university. It incorporates all the
rules involved for the digestion of protein expected by the enzyme used,
trypsin, thus giving a stronger biological identification of protein.
As future enhancements to this project a lot of improvements
could be made:
Design a GUI to clearly represent the results and provide a more
user-friendly environment.

Allowance for missed cleavages. Missed cleavages occur when
the protein sequence does not cleave at the site that it has to
actually cleave.
Incorporation of the software developed to handle Post
Translational Modifications.
Inclusion of other widely used enzymes. The software only
incorporates trypsin, but other enzymes like chymotrypsin etc.
are used to digest some proteins.
Algorithm improvement. The algorithm can always be improved
to incorporate better speed efficiency.
This being the first step towards building a software to identify
proteins, we look forward to the incorporation of enhancements to add
to this project.

Name Abbreviation Average Mass Mono-isotopic mass
Alanine A 71.0788 71.03711
Cysteine C 103.1448 103.00919
Aspartic Acid D 115.0886 115.02694
Glutamic Acid E 129.1155 129.04259
Phenylalanine F 147.1766 147.06841
Glycine G 57.0520 57.02146
Histidine H 137.1412 137.05891
Isoleucine 1 113.1595 113.08406
Lysine K 128.1742 128.09496
Leucine L 113.1595 113.08406
Methionine M 131.1986 131.04049
Asparagine N 114.1039 114.04293
Proline P 97.1167 97.05276
Glutamine Q 128.1308 128.05858
Arginine R 156.1876 156.10111
Serine S 87.0782 87.03203
Threonine T 101.1051 101.04786
Valine V 99.1326 99.06841
Tryptophan w 186.2133 186.07931
Tyrosine Y 163.1760 163.06333


Me Edt Fernet Hab

t>gi(3OO9Cn32|gb|AAO065M1| (AY188332) phytochelatin synthetase [Triticum monococcum| 3
>gi|30090033|gb|AAC68521 1| (AY188332) cytochrome B5 (Triticum monococcum|
>gi(30Q90D3S|gb|AA066522.11 (AY188333) AGLG1 (Triticum monococcum]
>gi|3Q466148|gb|AAP33283 1| (AY214600) DhpS [uncultured Acidobaeterium sp j
>gi(30466149|gb|AAP33204 1| (AY214600) hypothetical protein [uncultured Acidobadenum sp ]
>gi|30466150|gb|AAP33285 1| (AY2146D0) LwF (uncultured Acidobaeterium sp (
>gi|3O46615l|gb|AAP33206 1| (AY2146Q0) LivM [uncultured Acidobaeterium sp ]
Formatted NCBI database

4§DHMi.csl,(Kt Notffwd
Ffe Edt Format Hafe
-IPl X|
0146.11015 174.1163 174 1163 174 1163 174 1163 245.17856 293.17B56 332 1742 344.21057 346.22626 374 20343 ^1
486 32123 514 32733 518 2383 519 2773 536 3005 566 27466 638 2859 640 3301 649 3594 732 37537 818 4S98 841 3622
846 35547 B59 459B4 917.46936 977 4435 1087.5491 1101 5024 1126 5817 1312.5991 1732 0884 1882 9263 1981 9249
2037 9182 2116 9895 2386.171 2418 1758 2438.0828 2463 226 2524 2236 2660.1707 2918 6465 2937.384 3012 3716
3249.796 3443 443 3656 8145 3893.7617
1 146 110151|74.1163 203 13162 245.15341 261.14832 362 17B2S 374.2324 388.24805 389 2069 415 25894 465.2382
489 2845 502 27972 543 34265 613.39575 692 3904 742 438351105 5967 1402.7173 1403 68031406 68321412 6263
2183 096 2376 2192 2565 3584 301B.4187 3183.446
2 146 11015 174.1163 174 1163 174 1163 259.1942 275 164 302.17487 415 25894 434.19937 493.29468 560.29645 624 339
654 3383662 2926 1002 5392 1062 5756 1103.5844 1115.6119 1142 6416 1255 6707 1258 6968 1316 6705 1407.7291
1685 B572 1775 9034 1070 9060 1946.9315 1971.9796 2593 2778 3145 6604
3174.1163174 1163 259 1942 271.16907 3051568 311 17523 346.22626 390.1797 41919974 461 24332 515 31134
560 2965 571 3489 573 31683 914 4867 970 5857 1316 7572 1506.7201 1591 8 1620 867 2086 094 3310 866 3420 6987
4312 1875
4 146 11015 146.11015 174 1163 245.15341 245.17856 259 1942 260 15308 271 16907 303 15887 491 2022 535 2661
560 31506 576.32776 808 4013 859.48B4 892 5217 1000 4964 1112.6599 1350 6714 1707.9462 1811 9788 1814.9357
1893 0262 2237.1814 2418 1968 2973 666
5146.11015 146.11015174 1163174.1163 245.15341 277 1255 287.20038 293 17856 311 17523 321 1847 344 22183
389.2069 458 28992 401.24054 525.307 533.2427 550 2909 612 4369 635.34375 705 422 737.3755 006 4121 844.42725
1010 4901 1606 7191 1723 9192 1735.0516 3056 8005 4266 2017 7116.823
6 117.0B361 174.1163 247.15784 275.164 328 19052 374.2324 431.23273 407 31644 544 3267 565.28516 730 37305
042.4292 930.50024 988.5752 1145.591 1294 6868 1394.7063 1630.0865 1687 0735 2349.191 2569.5042 2013 314 2077 5503
3337 7563
7 174 1163 233.14218 388 24005 400.28442 419.19974 505.2729 500.3379 709.41693 715 3911 959 52686 1063 6072
1194 629 1736 9066 2340 2158 2373 0708
B 146.11015 146.11015 146.11015 174.1163 174.1163 174 1163 231.13776 274.16873 275.15274 275.15274 289.14325
290 16364 316 2157 374 2324 402.2273 403.2113 415.28412 431 2426 435 2190 477 23035 518.2933 544 3015 558.3787
589.3304 601 3594 615 375 623 X74 641 3906 674.36456 714 43225 718 3445 744.3999 832.4224 842 52716 920 50134
927.5799 996 5034 1039 5708 1047.5317 1092.4003 1153 5306 1186 6069 1198.6855 1232.5939 1356 748 1398.5945
1570 0776 1655 8770 1716.9419 1757 6235 1872 886 1957 9905 2186.1477 2495.368 2529 323 2615 366 2968 4932 3163 491 ^
Masses calculated for NCBI database

JS^rolMaUhesl Notepad
FVe Edt Format Help
> 237910 >gi|28O76012]gb|AAO31594 1| (AF499685) ribulose 1 ,5-bisphosphate carbovylase/oxygenase large subunit
[Myrmecia bialorellae]
> 659453 >gi|19915738|gb|AAM05243.1| (AE010B64) cell surface protein |Methanosarcina acetivorans sir C2A)
|Methanosarcina acetivorans C2A]
> 922003 gi|37794|emb|CAA44721,1| C42949) vacuolar isoform 2 of H+ATPase Mr 56,000 subunit [Homo sapiens]
Results file showing protein match details

Aebersold, R., & Mann, M. (2003). Mass spectrometry-based
proteomics. Nature, 422,198-207.
Cagney, G., Amiri, S., Premawaradena, T., Lindo, M., & Emili, A.
(2003). In silico proteome analysis to facilitate proteomics
experiments using mass spectrometry. Proteome Science, 1, 1-5
Chaurand, P., Luetzenkirchen, F., Spengler, B. (1999). Peptide and
protein identification by matrix-assisted laser desorption
ionization (MALDI) and maldi-post-source decay time-of-flight
mass spectrometry. Journal of the American Society of Mass
Spectrum, 10, 91-103.
Cios K., Pedrycz, W., & Swiniarski, R.W. (1998). Data mining methods
for knowledge discovery. Norwell, MA: Kluwer Academic
Cottrell, J.S. (1994). Protein identification by peptide mass
fingerprinting. Peptide Research, 7, 115-124.
Duncan, M., Fung, K., Wang, H., Yen, C., & Cios, K. (2003).
Identification of contaminants in proteomics mass spectrometry
data procedure of the computational systems bioinformatics
(CBS03). IEEE Computer Society, 409-410.
Eng, J. K., McCormack, A. L., & Yates, J.R. (1994). An approach to
correlate MS/MS data to amino acid sequences in a protein
database. Annual Review Biochemistry, 5(11), 976-89.
ExPASy tools. [World Wide Web]. Available:
Fenn, J.B., Mann, M., Meng, C.K., Wong, S.F. & Whitehouse, C.M.
(1989). Electrospray ionization for mass spectrometry of large
biomolecules. Science, 246(4926), 64-71.

Graves, P. & Haystead, T.A.J. (2002, March 12). Molecular biologists
guide to proteomics. Microbiology and Molecular Biology
Reviews, 65(1), 84 96.
Helmke S.M., Yen C., Cios K.J., Nunley K., Bristow M.R., Duncan M., &
Perryman M.B. (2004). Quantification of human cardiac A- and
B-myosin heavy chain protein by MALDI-TOF mass
spectrometry. Analytical Chemistry, 76(6), 1683-1689.
Henzel, W.J., Billed, T.M., Stults, J.T., Wong, S.C., Grimley, C., &
Watanabe C. (1993). Identifying proteins from two-dimensional
gels by molecular mass searching of peptide fragments in
protein sequence databases. National Academic Science, 90,
James, P., Quadroni, M., Carafoli, E., & Gonnet, G. (1994). Protein
identification in DNA databases by peptide mass fingerprinting.
Protein Science, 3(8), 1347-50.
Jefferies, J. R. (2003). Protein identification by peptide mass
fingerprinting tutorial. [World Wide Web], Available:
Kleinsmith, J. L. & Kish, V. M. (1995). Principles of cell and molecular
biology. New York, NY: Harper Collins College Publishers.
Kratchmarova, I., Kalume, D. E., Blagoev, B., Scherer, P. E.,
Podtelejnikov, A. V., Molina, H., Bickel, P. E, Andersen, J. S.,
Fernandez, M. M., Bunkenborg, J., Roepstorff, P., Kristiansen,
K., Lodish, H.F., Mann, M. & Pandey, A. (2002). A proteomic
approach for identification of secreted proteins during the
differentiation of 3T3-L1 preadipocytes to adipocytes. Molecular
and Cellular Proteomics, 1, 213-222.
Mann, M., Hendrickson, R., & Pandey, A. (2001). Analysis of proteins
and proteomes by mass spectrometry. Annual Review
Biochemistry, 70, 437-473.
Mascot peptide mass fingerprinting. [World Wide Web], Available:

Moore, R., Young, M., & Lee, T. (2002). Qscore: An algorithm for
evaluating sequest database search results. Journal of the
American Society for Mass Spectrum, 13(4),378-386.
Mowse scoring algorithm. [World Wide Web].Available: help.html
Moyer, S. C., Marzilli, L., Woods, A., Laiko, V., Doroshenko, V., &
Cotter, R. (2003). Atmospheric pressure matrix-assisted laser
desorption/ionization on a quadrupole ion trap mass
spectrometer. International Journal of Mass Spectrum, 226, 133-
NCBI database. [World Wide Web], Available:
Ong, S., Foster, L.J., & Mann, M. (2003). Mass spectrometric-based
approaches in quantitative proteomics. Methods, 29(2), 124-30.
Pappin, D.J.C., Hojrup, P., & Bleasby, A.J. (1993). Rapid identification
of proteins by peptide-mass finger printing. Current Biology, 3,
Parker, K.C. (2002). Scoring methods in MALDI peptide mass
fingerprinting: Chemscore, and the chemapplex program.
Journal of the American Society of Mass Spectrum, 13, 22-39.
Patterson, S., & Aebersold, R. (1995) Mass spectrometric approaches
for the identification of gel-separated proteins. Electrophoresis,
20, 310-319.
Peri, S., Ibarrola, N., Blagoev, B., Mann, M., & Pandey, A. (2001).
Common pitfalls in bioinformatics based analyses: look before
you leap. Trends in Genetics, 17, 541-545.
Perkel, J. M. (2001). Mass spectrometry applications for proteins.
Scientist, 15 (16), 31-32.
Perkins, D.N., Pappin, D.J., Creasy, D.M., & Cottrell, J.S. (1999).
Probability-based protein identification by searching sequence

databases using mass spectrometry data. Electrophoresis,
20(18), 3551-67.
Qin, J.,Chait, & B. T., (1997), Identification and characterization of post-
translational modifications of proteins by MALDI ion trap mass
spectrometry. Annual Review Biochemistry, 69(19), 4002-9.
Rabilloud, T. (2002), Two-dimensional gel electrophoresis in
proteomics: Old, old fashioned, but it still climbs up the
mountains. Proteomics, 2(1), 3-10.
Werner, H. & Langen, L.H. (2000). Mass spectrometry: A tool for the
identification of proteins separated by gels. Electrophoresis,
21(11), 2105-14.
Yen, C., Helmke, S.N., Cios, K.J., Perryman, M.B., & Duncan, M.
(2004). Quantitative analysis of proteomics using data mining.
IEEE Engineering in Medicine and Biology Magazine, in print.
Zhang, W., & Chait, B. (2000, June 1). Profound: An expert system for
protein identification using mass spectrometric peptide mapping
information. Analytical Chemistry, 72(11), 2482-2489.