Citation
New strategies for identifying peptides using mass spectrometric fragmentation data

Material Information

Title:
New strategies for identifying peptides using mass spectrometric fragmentation data
Creator:
Yen, Chia-Yu
Publication Date:
Language:
English
Physical Description:
xxi, 285 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Peptides -- Identification ( lcsh )
Mass spectrometry ( lcsh )
Algorithms ( lcsh )
Algorithms ( fast )
Mass spectrometry ( fast )
Peptides ( fast )
Genre:
Guidebooks. ( fast )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )
Guidebooks ( fast )

Notes

Bibliography:
Includes bibliographical references (leaves 278-285).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Chia-Yu Yen.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
318900215 ( OCLC )
ocn318900215
Classification:
LD1193.E52 2008d Y46 ( lcc )

Full Text
NEW STRATEGIES FOR IDENTIFYING PEPTIDES
USING MASS SPECTROMETRIC FRAGMENTATION DATA
by
Chia-Yu Yen
B.S., Feng Chia University, 1994
M.S., University of Colorado at Denver, 2003
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
In Computer Science and Information Systems
2008


This thesis for the Doctor of Philosophy degree
by Chia-Yu Yen
has been approved by
Tom Altman
Bogdan S. Chlebus
Michael V. Mannino
/JoJ ( Date


Yen, Chia-Yu (Doctor of Philosophy, Computer Science and Information Systems)
New Strategies for Identifying Peptides using Mass Spectrometric Fragmentation
Data
Thesis directed by Professor Lawrence E. Hunter
ABSTRACT
Identification of proteins in complex samples is a major area in bioinformatics. The
most successful method currently available is shotgun proteomics, where proteins are
proteolyzed into peptides, followed by large-scale sequencing of peptides by on-line
chromatographic separation and fragmentation in a mass spectrometer (LC-MS/MS).
The fragmentation process generates spectra (referred to as MS/MS spectra) from
which peptide sequences consistent with the observed fragment ions can be identified.
A common computational strategy for matching an MS/MS spectrum to a peptide
sequence involves inter-converting spectral and sequence information. An alternative
strategy uses direct spectrum-to-spectrum matching against a reference library of
previously observed MS/MS, which has the advantage of evaluating matches using
fragment ion intensities and other ion types than the simple set normally used.
However, this approach is limited by the small sizes of the available peptide MS/MS
libraries and the inability to evaluate the rate of false assignments. In this thesis, the
methods to assess the performance of search algorithms are clarified and extended to
this new search method. Secondly, a framework is developed to manipulate the search


space to improve the performance of commercial search programs. Finally, a rigorous
evaluation of the spectrum-to-spectrum search approach is conducted. To do this, a
general computational framework was developed for assessing the reliability of the
spectrum-to-spectrum search programs, manipulating the search space by using a full
theoretical spectral library, and maximizing peptide identification using a new
consensus approach. Furthermore, using a theoretical spectral library allows for
creation of a library of inverted sequences for evaluation of the false assignment rates.
Because the theoretical spectral library is comparable in search space to that of a
protein database, this method allows direction comparison of results with spectrum-
to-sequence searches.
This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
/
/
Lawrence E. Hunter


ACKNOWLEDGMENT
I would like to acknowledge my gratitude to Dr. Katheryn Resing for her supervision,
advice, and guidance from the very early stage of my research, as well as providing
extraordinary experiences through-out the work.
I gratefully acknowledge Dr. Lawrence Hunter for his advice and support,
particularly in the final stages of completing the thesis.
I also gratefully thank Dr. Natalie Ahn for her support. I also benefited by comments
and suggestions from members of her laboratory, as well as those of the Resing
laboratory.
I would also acknowledge the contribution of Dr. Krzysztof J. Cios in earlier studies.
My wife deserves special mention for her unrivaled and wholehearted support of my
work.
This work was primarily supported by NIH grant R01-CA126240 (KAR).
Finally, I would like to thank everybody who was important to the successful
realization of this thesis, as well as expressing my apology that I could not mention
them personally one by one.


TABLE OF CONTENTS
Figures......................................................................xi
Tables.......................................................................xx
Chapter
1. Introduction............................................................1
1.1 Generation of MS and MS/MS Spectra in Proteomic Profiling...............4
1.2 Peptide Fragmentation..................................................11
1.3 Meaning of Term Mass or m/z............................................16
1.4 Search Programs for Mapping MS/MS Information to
Peptide Sequences.....................................................17
1.4.1 Descriptive Models.....................................................19
1.4.2 Interpretative Models..................................................20
1.4.3 Probability Based Models...............................................21
1.5 Limitations of Search Programs.........................................22
1.5.1 Identifying MS/MS Spectra that Score
Below the Acceptance Thresholds.......................................26
1.6 Determining the Number of False Positives..............................31
1.7 Testing a New Search Approach..........................................35
2. Performance Measures...................................................37
2.1 Confusion Matrix.......................................................38
vi


42
47
49
51
55
56
57
60
62
63
64
69
73
77
77
78
89
91
94
95
Utilization of Confusion Matrix
in Mass Spectrometry-Based Proteomics....................
Defining Confusion Matrix Classes Using Standard Proteins...
The General Case.........................................
Receiver Operating Characteristic Curve..................
ROC Plot for Control Mixture Sample......................
ROC Plot Using False Discovery Rate......................
ROC Plot for High Mass Accuracy Dataset..................
The Area Under the ROC Curve.............................
Standard Datasets........................................
Generation of DTA Files..................................
PLC the Fully Curated Dataset..........................
ABRF the Standard Protein Dataset......................
Phoebe3 the Complex Sample Dataset.....................
Q a Complex LCQ Sample.................................
Separated and Concatenated Target-Decoy Searches.........
Reproduction of Elias and Gygis Method..................
Improving Search Sensitivity Using Peptide-Centric Database
Workflow.................................................
Missed Cleavage Rule Tool................................
Strong Cation Exchange Tool for PC-DB....................
Vll


.96
.97
.98
101
106
108
111
113
114
116
117
118
122
123
124
142
146
148
151
151
Strong Cation Exchange Tool for Dataset
Result Compare Tool..................................
Verification of Correctness of PC-DB.................
Missed Cleavage Rules................................
Converting Missed Cleavage Rules to Regular Expression
Strong Cation Exchange Rules.........................
Experiments............................................
Databases............................................
Analysis of MC PC-DB Results.........................
Analysis of MC-SCX PC-DB Results.....................
Impact of Database Size on Assignment Confidence.....
Analysis of Score Distribution for Incorrect Assignments.
Experiment for Non-tryptic Peptides..................
Protein Profile Analysis.............................
Extension in Performance Measures....................
Conclusion...........................................
Overview of Spectral Library Searching
and Theoretical Spectral Library.....................
Spectrum-to-Spectrum Mapping Approach................
Spectral Library Search System.......................
Spectral Library.....................................
vm


155
158
161
162
164
168
173
175
176
178
180
184
186
187
190
194
196
197
197
Similarity Measure.......................................
Differences Between X!Hunter and BiblioSpec..............
Evaluation of the Software Engineering
for Spectral Library Search Programs.....................
Testing the Workflow Using Same Dataset for Library
and Experimental Dataset.................................
Spectra Preprocessing and Filtering......................
Preprocessing of the Spectra Causing the Loss of Information
Results with the Curated Dataset.........................
Results with the ABRF LTQ Dataset........................
General Search Configuration.............................
Initial Test of the Zhang Spectra in a Reference Library.
ROC Analysis of the xRef and xHybrid Results.............
Discussion...............................................
Fundamental Studies of Theoretical Spectral Library
Using Standard Protein Dataset...........................
Construction of the Theoretical Spectral Library.........
XIHunter Search Agent....................................
Expectation Scoring by XIHunter..........................
Analyses of Spectrum-to-Spectrum Search Results
Using ABRF Dataset.......................................
The Spectral Libraries and Search Parameters.............
Comparisons of Spectral Library Search Results...........
IX


5.4.3 Venn Analyses.........................................................201
5.4.4 Score Distribution....................................................205
5.4.5 ROC Analysis..........................................................215
5.4.6 Use of Consensus to Validate Assignments
Despite Poor Discrimination...........................................226
5.4.7 Consensus Analysis of the ABRF Dataset................................230
6. A Theoretical MS/MS Library for Spectrum-to-Spectrum Searching
in Large-Scale Identification of Proteins.............................234
6.1 Analysis of the Large Scale Dataset...................................246
6.1.1 Classify the Search Result by High Mass Accuracy......................247
6.1.2 Score Distribution....................................................254
6.1.3 Consensus Analysis of the Phoebe3 Dataset.............................268
6.2 Conclusion and Future Work............................................272
Bibliography................................................................278
x


LIST OF FIGURES
Figure
1.1 Tandem Processing of an Ion in the Mass Spectrometer
for MS/MS Analysis...................................................5
1.2 A Contour Plot of MS Data, Where Each Dark Area Represents
a Peptide Ion.........................................................9
1.3 A Sample MS Spectrum, Where the Grey Arrow Shows an Ion that is
Targeted for Fragmentation in Figure 1.4.............................10
1.4 A Sample MS/MS Spectrum of the 1039 Ion in Figure 1.3, Showing the
Fragment Ions Produced and the Amino Acid Increments.................10
1.5 Nomenclature of Fragment Ions Generated by Breaking the Bonds
in a Peptide Backbone, Where R represents the side-chains
of the Amino Acids...................................................13
1.6 Possible b and y Ions for Peptide DLSLKEIQK..........................15
1.7 Simplified Representation of an MS/MS Spectrum from Peptide
DLSLKEIQK............................................................15
1.8 XCorr Distribution for PLC Dataset Reveals Poor Discrimination.......24
1.9 Mascot Mowse Scores Charge +1 Distribution...........................29
1.10 Mascot Mowse Scores Charge +2 Distribution...........................30
1.11 Mascot Mowse Scores Charge +3 Distribution...........................30
1.12 A Separated Target-Decoy Search at a Score
Above the Unequivocal Score Threshold................................34
1.13 A Separated Target-Decoy Search at a Score
Below the Unequivocal Threshold......................................34
xi


1.14 A Concatenated Target-Decoy Search
with the Same False Discovery Rate Threshold as Figure 1.13.................34
2.1 Various Coverage for Experimental Spectra...................................43
2.2 Simple ROC Plot.............................................................52
2.3 ppm-Score Plot Examples.....................................................59
2.4 Spectra Distribution for PLC Dataset by Charge State.......................66
2.5 Spectra Distribution for PLC Dataset by MH+................................67
2.6 Spectra Distribution for PLC Dataset by MH+ and Charge State.............67
2.7 Spectra Distribution for PLC Dataset by m/z................................68
2.8 Spectra Distribution for PLC Dataset by m/z and Charge State...............68
2.9 Spectra Distribution for ABRF Dataset by Charge State.....................70
2.10 Spectra Distribution for ABRF Dataset by MH+..............................71
2.11 Spectra Distribution for ABRF Dataset by MH+ and Charge State.............71
2.12 Spectra Distribution for ABRF Dataset by m/z..............................72
2.13 Spectra Distribution for ABRF Dataset by m/z and Charge State.............72
2.14 Spectra Distribution for Phoebe3 Dataset by Charge State....................74
2.15 Spectra Distribution for Phoebe3 Dataset by MH+.............................75
2.16 Spectra Distribution for Phoebe3 Dataset by MH+ and Charge State............75
2.17 Spectra Distribution for Phoebe3 Dataset by m/z.............................76
2.18 Spectra Distribution for Phoebe3 Dataset by m/z and Charge State...........76
2.19 The Observed Score Distribution of the Separated Target-Decoy Search.......83
xii


2.20 The Observed Score Distribution
of the Concatenated Target-Decoy Search..................................83
2.21 The Estimated Score Distribution of the Separated Target-Decoy Search...84
2.22 The Estimated Score Distribution
of the Concatenated Target-Decoy Search..................................84
2.23 The Expected Score Distribution Generated by
the Target Database Only Search Using the ABRF Proteins
to Classify Correct and Incorrect Hits...................................85
2.24 The Correlation Between the FDRs of the Separated and Concatenated
Target-Decoy Searches....................................................86
2.25 The Correlation Between the FDRs of the Target Only and Concatenated
Target-Decoy Searches....................................................87
2.26 The Correlation Between the FDRs of the Target Only and Separated
Target-Decoy Searches....................................................88
3.1 The Effect of Database Size on the Score Distribution.....................91
3.2 A Conventional Workflow...................................................93
3.3 A Modified Workflow for PC-DB.............................................94
3.4 The Internal Missed Cleavage Sites of a Very Short Protein
(IPI00032221.1)........................................................ 103
3.5 Internal or Endopeptidase Type Cleavages.................................103
3.6 Exopeptidase Type Cleavages..............................................103
3.7 Discrimination Between Correct vs. Incorrect Assignments
for Sequest Searches....................................................120
3.8 Discrimination Between Correct vs. Incorrect Assignments
for Mascot Searches.....................................................120
3.9 Score Distribution for Sequest Using Inverted Database...................121
xiii


3.10 Score Distribution for Mascot Using Inverted Database..............121
3.11 ROC Curves for Mascot Using PLC Dataset...............................................126
3.12 ROC Curves for Sequest Using PLC Dataset...............................................127
3.13 ROC Curves for Mascot Charge 1 Result Using PLC Dataset.128
3.14 ROC Curves for Mascot Charge 2 Result Using PLC Dataset.129
3.15 ROC Curves for Mascot Charge 3 Result Using PLC Dataset.130
3.16 ROC Curves for Sequest Charge 1 Result Using PLC Dataset.131
3.17 ROC Curves for Sequest Charge 2 Result Using PLC Dataset.132
3.18 ROC Curves for Sequest Charge 3 Result Using PLC Dataset.133
3.19 ROC Curves for Mascot Using PLC Dataset with NOE Database...............................................135
3.20 ROC Curves for Sequest Using PLC Dataset with NOE Database..................................136
3.21 ROC Curves for Mascot Charge 1 Result
Using PLC Dataset with NOE Database.................................137
3.22 ROC Curves for Mascot Charge 2 Result
Using PLC Dataset with NOE Database.................................138
3.23 ROC Curves for Mascot Charge 3 Result
Using PLC Dataset with NOE Database.................................139
3.24 ROC Curves for Sequest Charge 1 Result
Using PLC Dataset with NOE Database.................................140
3.25 ROC Curves for Sequest Charge 2 Result
Using PLC Dataset with NOE Database.................................141
3.26 ROC Curves for Sequest Charge 3 Result
Using PLC Dataset with NOE Database.................................142
4.1 Spectrum for Peptide FKDLGQENFK.....................................156
xiv


4.2 Spectrum for Peptide FKDLGEENFK..........................................157
4.3 The Range from 815 to 838 Da Taken from Figure 4.1........................157
4.4 The Range from 815 to 838 Da Taken from Figure 4.2........................157
4.5 Comparison of the Score Distribution of the Regular and
Simplified DTAs..........................................................167
4.6 Comparison of the Score Distribution of the Regular and Simplified DTAs
Using the Inverted Database..............................................168
4.7 Illustration of Construction of the Hybrid Spectral Library...............179
4.8 ROC Curves for Original and Hybrid Reference Library Searches..........183
4.9 ROC Curves for Original and Hybrid Reference Library Searches
Using True Positive Count................................................184
5.1 The Standard Workflow for X!Hunter Search.................................191
5.2 The Modified Workflows for the Partitioned X!Hunter Searches..............192
5.3 Score Distributions for Different Initial Expectation Value Settings.....196
5.4 Overlap of xRef and xTS Search Results Using ABRF Dataset
Showing Similarity at the Experimental Spectra Level.....................204
5.5 Overlap of xRef and xTS Search Results Using ABRF Dataset
Showing Similarity at the Unique Spectra Level..........................204
5.6 Overlap of xRef and xTS Search Results Using ABRF Dataset
Showing Similarity at the Unique Sequence Level.........................205
5.7 Score Distributions of XIHunter Searches for the ABRF Dataset.............208
5.8 Score Distributions of BiblioSpec Searches for the ABRF Dataset...........209
5.9 Score Distributions of XIHunter Charge 1 Searches
for the ABRF Dataset.....................................................210
xv


5.10 Score Distributions of X!Hunter Charge 2 Searches
for the ABRF Dataset........................................................211
5.11 Score Distributions of X!Hunter Charge 3 Searches
for the ABRF Dataset........................................................212
5.12 Score Distributions of BiblioSpec Charge 1 Searches
for the ABRF Dataset........................................................213
5.13 Score Distributions of BiblioSpec Charge 2 Searches
for the ABRF Dataset........................................................214
5.14 Score Distributions of BiblioSpec Charge 3 Searches
for the ABRF Dataset........................................................215
5.15 ROC Curves of Spectral Library Searches for the ABRF Dataset................218
5.16 ROC Curves of Spectral Library Searches for the ABRF Dataset
Using the True Positive Count...............................................219
5.17 ROC Curves of Spectral Library Charge 1 Searches
for the ABRF Dataset........................................................220
5.18 ROC Curves of Spectral Library Charge 1 Searches for the ABRF Dataset
Using the True Positive Count...............................................221
5.19 ROC Curves of Spectral Library Charge 2 Searches
for the ABRF Dataset........................................................222
5.20 ROC Curves of Spectral Library Charge 2 Searches for the ABRF Dataset
Using the True Positive Count...............................................223
5.21 ROC Curves of Spectral Library Charge 3 Searches
for the ABRF Dataset........................................................224
5.22 ROC Curves of Spectral Library Charge 3 Searches for the ABRF Dataset
Using the True Positive Count...............................................225
5.23 Overlap of XIHunter and Mascot Search Results Using the ABRF Dataset
and Showing Similarity at the Experimental Spectra Level...................228
xvi


5.24 Overlap of XIHunter and Mascot Search Results Using the ABRF Dataset
and Showing Similarity at the Unique Spectra Level.....................229
5.25 Overlap of XIHunter and Mascot Search Results Using the ABRF Dataset
and Showing Similarity at the Unique Sequence Level....................230
6.1 Using High Mass Accuracy of the Peptide Ion (ppm on y axis)
to Scores of the Correct (red) and Incorrect (blue) Assignments of
XIHunter Reference Library Search of the ABRF Dataset...........237
6.2 Using Mass Accuracy to Scores of Correct (red) and Incorrect (blue)
Assignments for the XIHunter Theoretical Spectral Library Search of
the ABRF Dataset.......................................................238
6.3 Using Mass Accuracy of the Parent Ion to Scores of Correct (red) and
Incorret (blue) Assignments of BiblipSpec Theoretical Spectral Library
Search for ABRF Dataset with 20 Peaks...........................239
6.4 Using Mass Accuracy of the Parent Ion to Scores of Correct (red) and
Incorrect (blue) Assignments of BiblipSpec Theoretical Spectral Library
Search for the ABRF Dataset with 50 Peaks.......................240
6.5 Using Mass Accuracy of the Parent Ion to Scores of Correct (red) and
Incorrect (blue) Assignments of BiblipSpec Theoretical Spectral Library
Search for the ABRF Dataset with 100 Peaks......................241
6.6 Using Mass Accuracy of the Parent Ion (ppm on x axis) to Evaluate
the XIHunter Reference Library Search for the ABRF Dataset.............242
6.7 Using Mass Accuracy of the Parent Ion to Evaluate
the XIHunter Theoretical Spectral Library Search for the ABRF Dataset..243
6.8 Using Mass Accuracy of the Parent Ion to Evalate the BiblioSpec
Theoretical Spectral Library Search for the ABRF Dataset with 20 Peaks.244
6.9 Using Mass Accuracy of the Parent Ion to Evaluate the BiblioSpec
Theoretical Spectral Library Search for the ABRF Dataset with 50 Peaks.245
6.10 Using Mass Accuracy of the Parent Ion to Evaluate the BiblioSpec
Theoretical Spectral Library Search for ABRF Dataset with 100 Peaks....246
xvii


6.11 Using Mass Accuracy to Evaluate the X! Hunter Reference Library Search
for the Phoebe3 Dataset..................................................249
6.12 Using Mass Accuracy to Evaluate the X! Hunter Theoretical Spectral
Library Search for the Phoebe3 Dataset..................................250
6.13 Using Mass Accuracy to Evaluate the BiblioSpec Theoretical Spectral
Library Search for the Phoebe3 Dataset with 100 peaks...................251
6.14 Using Mass Accuracy to Evaluate the X! Hunter Reference Library Search
for the Phoebe3 Dataset..................................................252
6.15 Using Mass Accuracy to Evaluate the XIHunter Theoretical Spectral
Library Search for the Phoebe3 Dataset...................................253
6.16 Using Mass Accuracy to Evaluate the BiblioSpec Theoretical Spectral
Library Search for the Phoebe3 Dataset with 100 Peaks...................254
6.17 Score Distribution of XIHunter Reference Library Search for
the Phoebe3 Dataset......................................................257
6.18 Score Distribution of XIHunter Reference Library
Charge 1 Search for the Phoebe3 Dataset.................................258
6.19 Score Distribution of XIHunter Reference Library
Charge 2 Search for the Phoebe3 Dataset.................................259
6.20 Score Distribution of XIHunter Reference Library
Charge 3 Search for the Phoebe3 Dataset.................................260
6.21 Score Distribution of XIHunter Theoretical Spectral Library Search for
the Phoebe3 Dataset.....................................................261
6.22 Score Distribution of XIHunter Theoretical Spectral Library
Charge 1 Search for the Phoebe3 Dataset.................................262
6.23 Score Distribution of XIHunter Theoretical Spectral Library
Charge 2Search for the Phoebe3 Dataset..................................263
6.24 Score Distribution of XIHunter Theoretical Spectral Library
Charge 3Search for the Phoebe3 Dataset..................................264
xviii


6.25 Score Distribution of BiblioSpec Theoretical Spectral Library Search for
the Phoebe3 Dataset with 100 Peaks.....................................265
6.26 Score Distribution of BiblioSpec Theoretical Spectral Library
Charge 1 Search for the Phoebe3 Dataset with 100 Peaks.................266
6.27 Score Distribution of BiblioSpec Theoretical Spectral Library
Charge 2Search for the Phoebe3 Dataset with 100 Peaks..................267
6.28 Score Distribution of BiblioSpec Theoretical Spectral Library
Charge 3 Search for the Phoebe3 Dataset with 100 Peaks.................268
xix


LIST OF TABLES
Table
1.1 Comparison of results with different search programs......................25
2.1 Formal definition of a confusion matrix...................................38
2.2 The formal definition of a confusion matrix for MS-based proteomics.......45
2.3 A confusion matrix for a fully curated dataset............................46
2.4 A confusion matrix for a standard dataset.................................48
2.5 Sizes of datasets.........................................................63
3.1 Summary of the missed cleavage rules.....................................105
3.2 Regular expressions specifying missed cleavage rules.....................108
3.3 SCX rules................................................................110
3.4 Summary of protein and peptide centric databases used in this study.....114
3.5 Results of searches against protein and PC-DBs..........................115
3.6 Searching result comparison..............................................116
3.7 Summary of changes in protein profiles for the peptide assignments
from searches shown in Table 3.5.......................................124
3.8 Confusion matrix and performance metrics for PLC dataset.................125
3.9 Confusion matrix and performance metrics for PLC dataset.................134
4.1 Summary of spectral libraries............................................154
4.2 Differences between XIHunter and BiblioSpec..............................161
xx


4.3 Initial search algorithm tests............................................164
4.4 Preprocessing filters in XIHunter.........................................172
4.5 Effects of XIHunter preprocessing filters
on the manually curated PLC dataset.....................................174
4.6 Effects of XIHunter preprocessing filters on the ABRF dataset.............176
5.1 Sizes of the partitioned theoretical spectral libraries...................190
5.2 Summary of all searches for ABRF..........................................201
5.3 AUC for different searches................................................226
5.4 Searches of the ABRF dataset using Mascot with and without consensus
with Sequest or XIHunter.................................................233
6.1 Search results of a large scale dataset filtered by mass accuracy of
the parent ion, the peptide length, missed cleavage rules, and charge...271
xxi


1. Introduction
Protein profiling, a central problem in molecular biology, involves the identification
and characterization of the thousands of proteins found in a cell extract. A key
computational problem central to this task is the interpretation of the data produced
by mass spectrometry (MS) instruments in light of pre-existing databases of genomic
sequences. The current state of the art involves two step process (called MS/MS)
which produces thousands of spectra of dissociated protein fragments for each sample;
the programmatic problem is to use this information to identify the best matching
protein sequences from a database. Several studies, including additional work
presented here, have shown that major problems with protein profiling are due to the
search programs. These problems include overall low discrimination between proteins
in the mixture and similar distractors in the databases, reduced sensitivity with the
large search space inherent in mammalian databases, and extremely long computing
times.
This thesis will make five significant contributions in this important area: (1)
clarifying and clearly defining the performance measures used to evaluate alternative
search strategies; (2) developing novel algorithms that address the large search space
issue, making mammalian databases more tractable; (3) designing a method to test the
correctness of the spectral library search applications; (4) using theoretical spectra to
1


evaluate and render comparable the most widely used spectral library search
applications; and (5) devising and testing a new consensus method to improve the
search sensitivity.
Many of the issues are due to the fact that the search programs use a protein sequence
database as the input against which the MS/MS must be matched. This requires that
the search program have three functions, (1) conversion of the protein sequence into a
form that the MS/MS data can be mapped against or conversion of the MS/MS data
into sequence information for mapping against the protein sequences, (2) a scoring
method to evaluate the thousands of possible mappings (to link a specific MS/MS to a
specific peptide sequence), and (3) reassembly of the best mapped peptide sequence
into the protein profiles. Previous work by Karen Meyer-Arendt has shown that the
best solution for the problems associated with the third function was to shift to a
peptide-centric approach. This thesis extends this peptide-centric approach to the
whole search system, developing new methods for the first function in order to
manipulate the search space and improve results, as well as developing principled
methods for evaluating the scoring methods and overall performance of the system. In
addition, this thesis also tests an alternative mapping approach, where the peptide
sequences are converted to theoretical spectra which allow more direct mapping to
the MS/MS. This provides a more direct mapping, with improvement in computing
speed for the searching. The search speed can be further increased by drawing the
2


experience in the search space manipulation. Furthermore, by combining this faster
search method with a traditional search approach, the sensitivity is proved to be better.
The remainder of Chapter 1 introduces the domain knowledge that is essential to
mass spectrometry (MS) based proteomics. Chapter 2 describes the performance
measures utilized in this thesis to evaluate novel search strategies, outlines the issues
of utilizing these measures in MS-based proteomics, and provides my solutions to
these issues. Also, the proteomics datasets used through-out this thesis to evaluate the
performance as the standards are described, along with the reasoning for using those
particular datasets. Chapter 3 presents the results of novel methods to manipulate the
search space in MS-based proteomics. A set of algorithms is developed to solve this
issue by manipulating the database in a peptide centric manner using the chemical
properties uncovered from a large dataset. The achievement is the improved
sensitivity and confidence in the search results obtained by manipulating the search
space, as well as development of computational tools to generate and utilize the
subset databases that were required to manipulate the search space. Chapter 4
introduces the spectral library search, which is a new, computationally faster method
to map the mass spectrometry MS/MS spectra to peptide sequences. This approach
required generation of a library of theoretically generated MS/MS spectra that is ten
times larger than any previous spectral library used in this way. The large size
required adaptation of the use of subset databases, and drew on the experience and
methods developed in Chapter 3. Importantly, the use of theoretical spectra allows the
3


first evaluation of this approach using principled performance measures. Chapter 5
presents applications developed to allow a fair comparison with a traditional search of
the protein database, and show that this new approach provides comparable
sensitivity and discrimination, using a standard protein dataset. Chapter 6 shows the
implementation of applications in Chapter 5 to a complex protein sample, and a new
consensus method to take advantage of the new search approach.
1.1 Generation of MS and MS/MS Spectra in
Proteomic Profiling
This section expands on the methods used to analyze the peptides produced from
digesting a complex sample, illustrating how the mapping from sequence to spectra
and visa versa is accomplished.
MS and MS/MS data are collected by applying the digest to a reverse phase liquid
chromatography column directly coupled to the mass spectrometer (LC/MS). Peptides
are eluted from the column by an organic solvent, in order of increasing
hydrophobicity. As the peptides elute, they are transferred to the gas phase as they
enter the mass spectrometer, and data are collected on the peptide masses. Then the
mass spectrometer manipulates the ions to isolate the peptide ions for fragmentation
to generate the MS/MS spectra. This process is called MS/MS in reference to the two
stage operating mode used in the fragmentation and is illustrated for an ion trap mass
4


spectrometer in Figure 1.1. In MS/MS, a population of the selected peptide ion (also
referred to as the parent ion) is isolated electronically in the MS, and fragmented by
collision-induced dissociation (CID). In CID, the ions are accelerated in the presence
of an inert gas. The collision of the peptide ions with the gas breaks molecular bonds
and produces fragmentation ions which are scanned by the mass analyzer and
recorded by the detector. There are other dissociation methods, but CID is the only
one used for proteomics profiling, at this time.
Injection/Trapping Isolation
Fragmentation
MS/MS
Spectrum
Figure 1.1 Tandem Processing
MS/MS Analysis
of an Ion in the Mass Spectrometer for
The MS and MS/MS data are written to a raw file in a proprietary format. The
important features of an MS instrument raw file are the mass to charge ratio of the
peptide parent ions and the fragment ions, and the number of ions detected, along
with the time stamp for each scan. An LC/MS analysis consists of thousands of MS
and MS/MS scans generating individual spectra that must be extracted for further
study. Software is available that extracts out only the MS scans, for visualization as a
5


contour plot, or that combines the information on the peptide mass and the MS/MS
spectrum to generate peak list (DTA) files, which are the text format of spectra, that
are used by search programs to identify the peptide sequence (discussed later).
Figure 1.2 shows the contour plot representation of the MS data from a single LC/MS
analysis. In this plot, each spot represents a peptide, whose darkness is the
abundance of the peptide over the elution time (the x-axis is the retention time of a
peptide from the reverse phase column, while the y-axis is the mass to charge ratio of
the ion detected by the mass spectrometer). This contour plot is manually annotated.
Each red or blue spot is a group of MS/MS scans for a specific peptide; the blue spots
with labels indicate the search program was successful in identifying the peptide
sequence. The labels are gene names of the proteins that gave rise to each peptide
after digestion, along with a peptide indicator. The overall contour plot represents the
whole population of peptides detected as ions in this LC/MS experiment. A vertical
slice of this data will look like Figure 1.3 where each peak represents a peptide ion;
this is referred to as an MS spectrum. Each peak in an MS spectrum represents a
peptide ion (the charges of the major ions are shown as z = 1,2, etc.). For example, in
Figure 1.3, the peak at m/z 1,039.528 Da (gray arrow) is a doubly charged ion that
can be mapped to peptide sequence MVNPTVFFDIAVDGEPLGR, a peptide derived
from the human Cyclosporin A-binding protein. Figure 1.4 is the MS/MS spectrum of
that peak fragmented in an MS/MS scan. Each peak in an MS/MS spectrum
6


represents a fragment ion generated by CID of the input peptide ion. The molecular
weight difference between two peaks is the amino acid difference. This helps to map
a spectrum to a peptide sequence. In this case, it maps the reverse of NPTVFFDIAVD
in the peptide.
It is clear from Figure 1.3 that there are a large number of peptide ions in a single
analysis set; in fact, an MS instrument collects up to 2.5 MS/MS spectrum/second; a
single analysis typically requires data collection for two hours and results in 10,GOO-
15,000 MS/MS spectra. Samples are usually fractionated, so that datasets have
multiple analyses, with millions of spectra to be searched. The underlying statistical
theory for these complex datasets is not well understood, making it difficult to
convert between sequence and MS/MS spectral information (described in the next
section). The result is that search programs work best for about 55-65% of the
experimental spectra where this process is straightforward. Borderline cases lead to
incorrect assignments scoring better than correct assignments. Furthermore, the size
of the database becomes problematic. Thus, when comparing methods involving
search strategies in this thesis, it is important to distinguish the search space issue
from the scoring issue, as well as ensure that the distribution of easy and borderline
cases in the experimental dataset is not changed by preprocessing or scoring methods
(the application of performance measures for classifiers to this issue is discussed in
Chapter 2).
7


The large size of proteomics datasets also presents computational challenges in
management of the searches, and is also computationally time-consuming, with
searches of full datasets requiring several days of computing time. The direct
mapping of experimental spectra to previously observed spectra has been proposed as
a way to improve computational speed, but currently suffers from lack of reference
spectra and the fact that current libraries are dominated by the easy to identify cases.
In Chapter 4, I show that the theoretical spectra calculated by the Zhang
MassAnalyzer MS/MS modeling software can be used by a spectra/spectra matching
program. In addition, those results indicate that optimizing the method for assessing
the similarity between spectra would improve discrimination.
8


Figure 1.2 A Contour Plot of MS Data, Where Each Dark Area Represents a
Peptide Ion
9


ABRF_20f emtomoluM0ul_orb1 #5334 RX: 107.22 AV: 1 NL: 3.77BB
T: FTMS + p NSI Full me [ 350.00-1600.00]
674470
Figure 1.3 A Sample MS Spectrum, Where the Grey Arrow Shows an Ion
that is Targeted for Fragmentation in Figure 1.4
ABRF_20ferrtomoluMOul orb! #5335 RT: 107.25 AV: 1 NL: 1.38E3
T: RMS + c NSI d Fui ms2 1039.53@35.00 [ 275.00-2000.00]
rrVz
Figure 1.4 A Sample MS/MS Spectrum of the 1039 Ion in Figure 1.3,
Showing the Fragment Ions Produced and the Amino Acid
Increments (Indicated by Single Letter Code)
10


1.2 Peptide Fragmentation
This section describes the concepts involved in mapping the MS/MS spectral
information onto the protein sequence, as illustrated in Figure 1.4. The specific
fragment ions generated in an MS/MS spectrum have m/z values dependent on the
sequence of the peptide that was selected for fragmentation. In an ideal case, mass
differences between the ions in the MS/MS spectrum can be used to read the
sequence, as shown in Figure 1.4. In that spectrum, a part of the peptide sequence is
shown mapped onto the spectrum, illustrating the relationship between the peptide
sequence and the fragment ions in the MS/MS spectrum. This exemplifies the
essential programmatic problem that is required for automated analysis of MS/MS
spectra by search programs, the conversion of the information in the MS/MS
spectrum into a form that can be mapped onto a sequence (or visa versa). In this
section, the principles behind the mapping of the information in the MS/MS spectrum
onto a peptide sequence will be discussed.
Before explaining the various search programs that have been developed to carry out
this mapping, the basic concept of peptide fragmentation in a mass spectrometer is
described and how the mapping from the fragment ions to the sequence is carried out.
Understanding these principles is important to this thesis, because accounting for the
complexity of the chemistry has been a major block to analysis of MS/MS spectra.
Search programs are most useful with those peptides that have relatively simple gas
11


phase chemistry. However, about 40% of peptides are estimated to have gas phase
chemistry that is unusual enough to cause problems with the search program scoring
methods. This thesis will exploit new methods of predicting the gas phase chemistry,
in order to capture a much wider range of data.
Figure 1.5 shows a peptide which contains 4 amino acids (each side chain R means an
amino acid) using the nomenclatures first proposed by Bienmann (1990 (A) and (B)).
An amino acid or peptide contains two ends that are chemically different the N
terminus (left hand side) that is a primary amine (except for proline, which is a
secondary amine) and the C terminus (right hand side) that is a carboxylic acid.
Where amino acids join together to make the peptide, the two ends of the adjacent
amino acids are combined to make a peptide bond. Normally in a mass spectrometer,
the peptide bonds are broken most easily, producing two types of fragments. The two
peptide fragments are named as b ion for the fragment ion extending to the left of
the fragmentation site and y ion for the one extending to the right.
12


X3
Ri
R2
H2N-CH-CO-NH-CH
a2
CO
ys
Z3
NH
CH-CO-NH-CH-COOH
R3
R4
b2
c2
Figure 1.5 Nomenclature of Fragment Ions (a, b, c, x, y, and z) Generated by
Breaking the Bonds in a Peptide Backbone, Where R Represents
the Side-Chains of the Amino Acids
Also, other bonds can fragment; if the bond between the alpha carbon of an amino
acid and the carboxyl is cleaved, a so-called a ion is produced. Other cleavages (x,
z, and c ions) are shown in Figure 1.6, but these are almost never observed in the MS
instruments used for proteomics profiling. Secondary losses of water or ammonia are
often observed, and a second cleavage of the peptide backbone can produce internal
fragment ions.
Figure 1.6 illustrates a peptide fragmentation example which consists of eight b ions
and eight y ions. These are numbered in order of the peptide bond that was
fragmented; b ion counts from left to right, while the order of the y ion counts is
13


reversed. Peaks have different intensities because the bonds vary in the ease that they
will cleave. This example shows all the possible fragments, but usually not all the
possible ions are generated, or if generated, may not be detected by the mass
spectrometer. Based on mass differences between ions and experience with the
chemistry, a manual de novo analysis can be done. For simplicity, I will first assume
that we know peaks are b or y ions, respectively red or blue (Figure 1.7). In this case,
the sequence is directly read off by determining the m/z differences between each
series of ions. For instance, the m/z distance in Daltons between y7 to y6 is the
molecular weight of tyrosine (Y or 163 Da), y6 to y5 is glutamate (E or 129 Da), and
y5 to y4 is valine (V or 99 Da). So, we know that the unknown peptide contains a
sequence tag VEY. However, the sequence direction may be reversed, so the
unknown peptide may contain a YEV, instead. Fortunately, the direction can usually
be determined because y ions have a C-terminal OH, which shifts their mass.
14


b8 ,
b7
b4
b5-
b6-
b3-
bl-
b2
H D
L
K
Q
K OH
--y1
Iy2
i_^y3
1y4
ys
1y7
y8
ye
Figure 1.6 Possible b and y Ions for Peptide DLSLKEIQK (Represented by
the Single Letter Code)
Figure 1.7 Simplified Representation of an MS/MS Spectrum from Peptide
DLSLKEIQK
Note that the intensities of the ions in Figure 1.7 are different from each other. This is
because each cleavage is a separate chemical reaction, driven by the presence of
protons near the peptide bond. Protons move around the peptide depending upon the
15


basic nature (or proton affinity) of the various sites. If a peptide bond is very basic, it
will attract protons, and cleave more readily, producing more ion intensity in the
fragment ion products of that cleavage. This is referred to as the mobile proton
hypothesis. If a peptide contains several arginine residues, which bind protons very
tightly, the cleavage of sites in that peptide are restricted to sites that have a resident
proton (for instance, adjacent to an acidic group). These are referred to as the non-
mobile proton cases, and present special challenges for search programs, because
their cleavage products are not randomly generated along the sequence.
1.3 Meaning of Term Mass or m/z
Because mass spectrometers actually detect the mass to charge ratio of analytes rather
than the masses, it is important to clarify the meaning of the various representations
of mass and mass to charge used in this field. The unit of the x axis in Figure 1.3 or of
the y axis in Figure 1.2 is Dalton, which can be used for charged or uncharged
molecules. The uncharged mass of a peptide is called the peptide mass, denoted using
M. The peptide is charged by adding protons (FI+), with the maximum number
determined by the number of sites that can accept the protons. Thus, the peptide may
receive multiple charges; the charged form is referred to as the parent ion mass,
denoted using [M+zH]+, or m+z/z (usually shortened to m/z), where z is the number
of charges. (It is referred to as a parent ion, because it is the parent of the MS/MS.)
The relationships between these mass representations are listed in equations 1.1 and
16


1.2. Note that (m+z)/z is often referred to as m/z, as it is in Figures 1.2 and 1.3.
Although parent ion mass is a general form, the special form that z equals to 1 is used
very often and is called the peptide ion mass, denoted using [M+H]+ or MH+.
Different forms are used interchangeably, and it is important to know which form is
being reported. For example, the charged parent ion mass is listed on the top of an
MS/MS spectrum (Figure 1.4). On the other hand, in the header of the DTA text file
that is a popular format to store a spectrum, the peptide ion mass or MH+ is reported.
[M + z//]+z = M + z x proton (1.1)
/ , w \MH+ proton) ..
m/z-(m + z)/z = s---------------L + proton (1.2)
z
where proton is the mass of a proton (H+) and z is the charge.
1.4 Search Programs for Mapping MS/MS Information to
Peptide Sequences
In general, a search program takes an MS/MS spectrum from an LC/MS analysis (the
experimental MS/MS spectrum), information about the peptide mass and charge, and
a protein sequence database as inputs. The search program generates an indexed list
of all possible peptide sequences from the protein sequence database and information
about the cleavage specificity of the protease used in generating the sample. From this
indexed list, the possible candidate peptide sequences are selected, typically by
matching the predicted mass and the observed mass of the peptide. Then, each search
program uses some strategy to convert either the experimental spectrum or the
17


candidate sequences to a common format for scoring. The search program scores the
goodness-of-fit between the candidates and the observed experimental spectrum,
ranks the candidates, and then reports back the top ranked hit.
Even with a simple example, such as shown in Figure 1.4, it is clear that the MS/MS
is very information rich, and it is complex information. A major complication is the
fact that each fragment ion is actually multiple ions due to the presence of molecular
isotopes (the most common is C vs. C, so that each fragment ion is actually an
envelope of ions that are IDa different in mass). In addition, the MS/MS may vary
from weak (revealing only a few of the possible fragment ions) to intense (revealing a
large member of ions), and there are also noise and contaminant ions present in the
spectrum. These problems complicate the peak peaking and normalization required
for scoring similarity between the experimental spectrum generated in the LC/MS and
the theoretical model of the spectrum utilized by the search program. In addition,
earlier proteomic profiling experiments were commonly carried out with the LCQ
instrument, which did not measure the parent mass very accurately, so the number of
candidate peptides was enormous due to the requirement of a wide error tolerance.
Fortunately, the high mass accuracy instrument (LTQ/Orbitrap) is adapted widely
nowadays; this makes the parent mass tolerance a better filter to remove irrelevant
candidates. However, the computational task of identifying the correct sequence is
still very difficult, and much of research in this field has focused on ways to speed up
18


these comparisons and improve discrimination between the correct and incorrect
assignments. Several alternative methods have been developed to compare the
observed fragmentation pattern to some type of predicted fragmentation pattern. In
each case, only part of the available information is used, in order to simplify the
computational task.
The approaches used by database search algorithms can be categorized into four
different types, depending on how they handle the theoretical spectra and the scoring
of results: descriptive, interpretative, stochastic, and statistical or probability
modeling (reviewed in Sadygov et al., 2004).
1.4.1 Descriptive Models
Descriptive models calculate a theoretical spectrum for each peptide based on a set of
rules, and then calculate a similarity score between the theoretical and observed
spectra. Sequest is an example of a search program using a descriptive model of the
MS/MS spectra (Eng et al., 1994). The theoretical spectra is very simplified; the rules
assign the same arbitrary intensity for all b and y ions (50) and their dehydrated or
deammoniated forms (15), and include only the second isotopic peaks set at 25
(Sadygov et al., 2004). Because there are often mass inaccuracies in the fragment ion
spectra, Sequest uses a fast Fourier transform to calculate a cross-correlation score
(XCorr) between the experimental MS/MS spectrum and the model MS/MS spectra
19


for a subset of the top scoring candidates (using a relatively fast preliminary scoring
method). Although this was very useful for the low resolution MS instruments
available when Sequest was developed, it is less advantageous for more recent
instruments.
1.4.2 Interpretative Models
In interpretative models, a peptide sequence tag of the given spectrum is
automatically determined (the sequence illustrated in Figure 1.4 is a sequence tag)
and this information is employed to match to the candidate peptide sequences by
pattern matching. Because the sequence tag can be found by b or y ions, it needs to be
considered bi-directionally. The matching of the other ions in the MS/MS spectrum is
then used to score differences between the candidates. XlTandem (Craig and Beavis,
2004) is an example of this type of algorithm. No available search programs of this
type use the intensity information. In general, this approach is most successful for
well behaved, doubly charged parent ions, and often has difficulty when the MS/MS
has chemical noise (where fragment ions are present that are derived from other
parent ions in the isolation window, referred to as chimera spectra), or the peptide has
unusual gas phase chemistry (singly charged cases, or non-mobile proton cases where
fragmentation is restricted to a few sites, instead of producing long, continuous
strings of fragment ions).
20


1.4.3 Probability Based Models
Probability based models use a probability function to determine the likelihood of
observing the set of fragment ions in the experimental MS/MS spectrum. Mascot is an
example, and it is the most successful search algorithm available (Pappin et al., 1993).
The probability function is based on an adaptation of Mowse scoring methods
originally developed for peptide mass fingerprinting, where the probability of
observing the set of peptides is calculated based on the frequency of each peptide in
the protein database. This is normalized so that the size of the protein database does
not affect the scores. The probability functions for singly and doubly charged ions are
reasonably discriminatory, but multiple charging creates ambiguity by greatly
increasing the number of possible fragment ions (each charge form must be included
in the probability function). This ambiguity lowers the discrimination with larger,
highly charged parent ions. Intensity information is utilized indirectly in the peak
selection. In addition, a few researchers have recently attempted to utilize very simple
models of the intensity to weight the probabilities (Eddes et al., 2002; Chen et al.,
2005; Johnson et al, 2005). Methods have utilized intensity information in statistical
or machine learning approaches for validation (Wysocki et al., 2000; Elias et al., 2004,
Havilio et al., 2003). The most interesting approach has been used by Narasimhan et
al. (2005) in their MASPIC software, which weights the probabilities according to the
m/z of the fragment ions. This is potentially very useful, because the fragment ions
21


that are larger than the parent ion are less ambiguously determined (this region of the
spectrum usually has only singly charged fragment ions and relatively simple
fragmentation chemistry, so that the probabilities can be more accurately determined).
In stochastic models, the probabilities of observing the different types of fragment
ions are determined by analysis of a large, curated dataset. For example, dehydrated
ions are less likely to be observed than their undehydrated precursors. In the
estimation process, the physical and chemical properties can be taken into account for
the scoring process. Once the probabilities of observing the individual ion types are
known, that information can be used to calculate the probability of a given observed
spectrum. SCOPE is an algorithm in this category (Bafna and Edwards, 2001). This
method is difficult to implement for broad classes of peptide ions, because each class
will have a different gas phase chemistry and thus probability of generating the
various types of ions. Thus, this approach is not currently used in proteomics
profiling studies.
1.5 Limitations of Search Programs
A major limitation of current search programs is the ability to assign peptide
sequences with high confidence. This is primarily due to the oversimplification of the
theoretical MS/MS spectrum that is compared to the observed spectrum during
scoring. An example of one of the scoring methods (Cross-correlation or XCorr score
22


from Sequest) is shown in Figure 1.8. This distribution was generated from Sequest
search results for doubly charge peptides, because peptides with this charge show best
discrimination between correct and incorrect assignments in Sequest. In this case,
incorrect assignments were determined from the score distribution for a set of triply
charged ions from a larger dataset of the same sample that were arbitrarily assigned a
charge of two, so that none of these assignments could be correct (the number of
incorrect assignments was chosen so that it equaled total assignments minus correct
assignments). An alternative approach is to search against a randomized or sequence
inverted database, which would yield a curve that looked almost identical (showing a
peak extending from XCorr 0.8 to 3.0, with a few cases extending up to XCorr 3.3).
The score distribution for manually validated cases or for standard peptides extends
from XCorr 0.8 to 9.0; and this distribution overlaps significantly with the
distribution of incorrect assignments, which extends from XCorr 0.1 to 3.3 (Resing et
al., 2004). Most users set a threshold between correct vs. incorrect peptide sequence
assignments that minimizes the number of peptide sequences that are incorrect (false
positives). This leads to larger numbers of peptides that are correctly assigned but
rejected (false negatives), when their scores are below stringent thresholds. In the
case shown in Figure 1.8, the false negative rate was 45% at false discovery rate = 1%.
The low discrimination also means that incorrect peptide sequences can score higher
than the correct sequence in the low scoring region of the distribution (referred to as
distraction). Exhaustive analysis of the dataset shown in Figure 1.8 revealed that
23


distraction occurred in 20% of assignments when an incorrect sequence was chosen
over the correct one.
Figure 1.8 XCorr Distribution for PLC Dataset Reveals Poor Discrimination
To test if this low discrimination was a common problem of other search programs, I
compared the results obtained with the most commonly used search programs, such
as Sequest (Eng et al., 1994), Mascot (Papping et al., 1993), OMSSA (Craig and
Beavis, 2004), and XlTandem (Geer et ah, 2004) (results summarized in Table 1.1).
Table 1.1 shows the total numbers of DTA files identified in a highly annotated
human shotgun "test" dataset of 1,617 MS/MS, referred to as the PLC dataset, where
24


it is estimated that there are 950 DTAs of tryptic, unmodified peptides. Results are
summarized as the number of those tryptic DTAs validated (also given as the
percentage of the 950 cases) and the number of false positives identified by manual
analysis. Searches were carried out using recommended default parameters for each
program (requiring fully tryptic, 2.5 Da peptide mass tolerance and 2 missed
cleavages). Search conditions were adjusted to hold the false discovery rate [FDR =
incorrect assignments/(incorrect +correct assignments)] constant at 3%. FP equals the
number of false positives identified by manual analysis. For XITandem and OMSSA
the expectation score was adjusted until the self-reported FDR was 3%.
Table 1.1 Comparison of results with different search programs
Search program #DTAs IDed FPs %IDed
Sequest 510 15 54%
Mascot 569 18 60%
OMSSA 583 P < 0.05 61%
XITandem 513 P < 0.05 54%
IDed by all four 489 n.d. 54%
Sequest (ACN filtering) 652 11 69%
MSPIus 735 17 77%
MAE (top Mascot hits) 816 12 86%
MAE (top 10 Mascot) 905 19 95%
Although differences were noted among the results, the most striking result was 54%
of the MS/MS could be identified by any search program, and the best among them
was only able to identify an additional 7%. Thus, the discriminations of the individual
25


search programs are only slightly different from each other, and all have significant
false negative rates. Recently, a similar comparison was carried out on Sequest,
Mascot, OMSSA, and XITandem by Balgley et al., (2007) using LTQ data; this study
revealed similar results to those shown here, with exception that a more recent
version of OMSSA gave 64% more peptide identifications than Mascot, at FDR of
1%.
1.5.1 Identifying MS/MS Spectra that Score Below the
Acceptance Thresholds
Methods have been devised to capture more data below the acceptance threshold. For
example, in Table 1.1, three such approaches are illustrated. Looking at the difference
between the score of the top hit and the second hit is one such method. The use of
Sequest ACN score (difference in XCorr of the first and second ranked hits) was
evaluated. The ACN filter was set at 0.08, using XCorr thresholds described in the
literature, and this method captured 15% more cases than using XCorr alone. The use
of consensus between two search programs (Mascot and Sequest, referred to as the
MSPlus approach) was also evaluated, using physicochemical properties to filter out
the high number of false positives that this method produces, and captured a larger
proportion of the expected data, but were still missing at least 23% of the expected
information. However, these methods clearly demonstrate that there are many cases
26


that yield borderline scores in the search program scoring methods, that can be
captured when using other information than the original search score.
The third approach to capturing more of the borderline cases involves rescoring of the
peptide sequences using the intensity information in the MS/MS spectrum (referred to
as MAE in Table 1.1). This is based on the concept that predicting the fragmentation
chemistry of peptides will provide estimates of fragment ion intensities and thus
should provide higher discrimination in evaluating search results. This method is used
qualitatively in validation of MS/MS peptide assignments when using manual
analysis (Wysocki et al., 2000; Wysocki et al., 2005; Paizs and Suhai, 2005), using a
simple model of chemical plausibility of the fragment ion chemistry. A programmatic
method of assessing chemical plausibility was recently developed by Z. Zhang. His
MassAnalyzer program produces a quantitative model for predicting fragment ion
intensities based on classical chemical kinetics and the mobile proton hypothesis
(Zhang, 2004; Zhang, 2005). Rate constants for peptide bond fragmentation were
modeled based on proton affinity at the peptide bond, availability of protons at the
cleavage site (proton mobility), intrinsic cleavage kinetics determined by the adjacent
amino acids, and other reactions, such as dehydration, deammoniation, C-terminal
rearrangement, and loss of CO from b ions. Shaojun Sun in our lab wrote a program
called Manual Analysis Emulator (MAE) which calculates a dot product similarity
score (Sim) in order to rescore MS/MS search results with the Zhang simulated
27


spectra (Sun et al., 2007). That study found excellent discrimination between correct
and incorrect assignments (discussed in detail in a later chapter), and captured 95% of
the expected hits in our test dataset (Table 1.1). Improved discrimination and
increased data capture was observed with independent experiments on larger datasets
of K562 proteins (>400,000 MS/MS, not shown). As this will be an important aspect
of the new search program, the results are discussed more fully in Chapter 4.
Thus, our analysis shows that there is a high scoring set of MS/MS that any search
program can identify, along with a borderline scoring set that require additional
scoring to validate. If these two sets are chemically the same, then the borderline set
can be ignored in analyses. On the other hand, if clear chemical differences between
these classes can be found, then analyses of search results must consider both sets of
data. Comparison of the score distributions can reveal the discrimination between the
correct and incorrect assignments (Keller et al., 2002; Resing et al., 2004; Frewen et
al., 2006). For example, Figures 1.9, 1.10, and 1.11 show that the score distributions
are not similar for different charge forms, as expected because the gas phase
chemistry varies as a function of charge, (Wysocki et al., 2005). The score
distributions show that singly and triply charged ions have more borderline cases than
the doubly charged. Manual validation reveals that those borderline cases are more
likely to be weak spectra, peptides fragmenting by the non-mobile proton cleavage
mechanism, and cases where there are multiple parent ions in the isolation window
28


for MS/MS (referred to as chimera MS/MS), although other reasons are possible.
Due to the existence of these problems, the performance of search programs must be
evaluated independently for each charge class present in the datasets and for weak vs.
intense spectra. In the datasets utilized in this thesis, the highest numbers of cases are
seen for the doubly and triply charged subsets; higher charged forms are not analyzed,
because they usually are not present in significant numbers. Chimera spectra are more
likely to be present in complex samples, and thus is it necessary to evaluate the
methods with datasets of both simple and complex samples, to evaluate the impact of
chimera spectra on results.
Charge +1
0 20 40 60 80 100 120 140 160 180
Mowse
Figure 1.9 Mascot Mowse Scores Charge +1 Distribution
29


250
Charge +2
Charge +3
0 20 40 60 80 100 120 140 160 180
Mowse
Figure 1.11 Mascot Mowse Scores Charge +3 Distribution
30


1.6 Determining the Number of False Positives
It is clear that the determination of the number of false positives is critical in analysis
of data below the unequivocal acceptance threshold. One method is to fit a FP and TP
distribution to the score distribution. This approach is most commonly used with the
PeptideProphet software (Keller et al., 2002), but recent studies suggest that the
fitting of the TP distribution is unreliable. This is most likely because the assumption
that there is only one population in the positive set is incorrect, because there are
several overlapping distributions (chimera, non-mobile proton cases, etc., as
described in previous section).
Another approach to estimating the false positives is to add proteins to the search
database that are not in the sample. Early studies added proteins from other species.
These added proteins are referred to as decoy proteins. The problem with this
approach is that the actual false positive cases from the right species could not be
identified, and the relative number of peptide candidates in the proteins from each
species could not be determined. Thus, the false positive rate could not be accurately
determined, and the method focused on the fact that any protein identified as the
correct species was most likely a true positive.
More often, proteomics studies make a decoy database by inverting the protein
sequences which are read from C- to N-terminus (rather than the normal N- to C-
31


terminus). If small peptides are excluded, the peptide sequences generated in-silico
for the candidate list will produce unique sequences that are all incorrect. Thus, any
hit against those peptides is by definition a false positive (Figures 1.12, 1.13, and
1.14). In these figures, the blank and dotted squares represent the search spaces of the
target and decoy databases, respectively. These databases can be merged as a big
database for search (Figure 1.14), or used separately (Figures 1.12 and 1.13). The
advantage of the concatenated target-decoy search is that it requires only one search.
When the score threshold is stringent, no peptide sequence in decoy database is
expected to be assigned. So that all the accepted hits are true positives (denoted as T).
While the score is lower, the peptide sequences in the decoy database start to show up
as false positives (denoted as F).
All these cases assumes that the various MS/MS have an equal probability of yielding
a false positive hit within the search space input to the search program (e.g., the
peptide sequences derived from a protein database). This has never been tested in a
published study, but it seems likely that the high quality spectra are more likely to
give a false positive in the inverted database search (because these have a large
number of fragment ions, and matching even a small subset of these ions can produce
a score above threshold). The Gygi group advocates searching the concatenation of
the normal and decoy databases, instead of searching separately. This has the
advantage that the number of false candidates in each half should be equal, so that the
32


number of FPs is simply twice the number of hits in the inverted sequence database
(Elias and Gygi, 2007) (Figure 1.14).
For the decoy database generation, variations have been developed other than the
inversion, such as reshuffling and Markov chain. In reshuffling, the amino acid
sequences are reshuffled, keeping the protein sizes the same. It is important in these
methods to keep the overall amino acid composition of the proteins constant. One
must also consider the size distribution for the predicted tryptic peptides, because the
trypsin targets (Lys and Arg) are not randomly distributed in the protein sequences.
Markov chain is another method used to generate a randomized database. Finney et al.
(2005) showed that the behavior of this method is similar to the reshuffled one, but
analysis of composition has not been published. One reason an inverted sequence
database is preferred, is that composition, peptide size, and adjacent amino acid
frequencies are nearly identical to a normal database.
The only systematic study on the performance of these target-decoy or modified
databases focused only on the FDR, although clear differences in coverage can be
demonstrated. Thus, no systematic study has evaluated all the confusion matrix
classes.
33


Target Database
Decoy Database
Figure 1.12 A Separated Target-Decoy Search at a Score Above the
Unequivocal Score Threshold
Target Database
Decoy Database

Figure 1.13 A Separated Target-Decoy Search at a Score Below the
Unequivocal Threshold
Target Database
Decoy Database
Figure 1.14 A Concatenated Target-Decoy Search with the Same False
Discovery Rate Threshold as Figure 1.13
34


1.7 Testing a New Search Approach
It is useful to think of the peptide sequences input to the search program as a search
space. A major contributor to the discrimination is the size of the search space. The
first major study in this thesis involves analysis of results as the search space is varied
(Chapter 3). Computational methods based on the properties of the peptides were
developed that manipulate the search space; this required developing a new approach
to management of the peptide sequences (referred to as a peptide centric database or
PCDB) in carrying out searches. In the second major study in the thesis, an alternative
search approach is evaluated; this approach uses a database of previously observed
MS/MS spectra as the search space, instead of the peptide sequences. This allows
direct mapping of spectra to spectra, but has the obvious limitation that the spectral
library of available cases may not include many of the peptides that are in the sample
being analyzed. In fact, the currently available spectral libraries are dominated by the
most abundant proteins in the cell. This thesis will test the use of the Zhang
theoretical spectra as a source of reference MS/MS for searching using direct spectra
to spectra mapping.
In Chapter 4, I compared results with two available programs for spectral library
searching: X!Hunter (Craig et al., 2006) and BiblioSpec (Frewen et ah, 2006),
utilizing a dataset generated by analysis of 49 standard proteins and a dataset
generated from a complex sample with ~ 1,600 proteins In Chapter 2, score
35


distributions and ROC curves are utilized to assess the performance of these search
programs. Because both X!Hunter and BiblioSpec have different advantages and
disadvantages, a major goal of this thesis is to develop alternative algorithmic
approaches in order to achieve better discrimination.
36


2. Performance Measures
The identification of a peptide sequence based upon the MS/MS fragmentation
spectrum is a specific example of the general problem of classification, where the
search program fits the general definition of a classifier, and a classifier is defined as
a mapping from unlabeled instances to (discrete) classes (Kohavi and Provost,
1998). Because Mass Spectrometry search programs map individual MS/MS spectra
to peptide sequences, in this thesis, the general methods of assessing performance of
classifiers will be applied to analysis of proteomics data.
Several proteomics studies have addressed the performance issue, but the
comparative evaluation of classifiers has not been uniformly carried out across
different studies. Therefore, an important aspect of this thesis is the development of
principled evaluation methods for these search programs, based on standard theory
for comparative evaluation of classifiers. In Chapter 1, the use of score distributions
in evaluating high confidence data for these classifiers was discussed. In this chapter,
the two traditional methods of examining the performance of a classifier will be
discussed: (1) the confusion matrix, and (2) the ROC curve. Also, the test datasets are
described. In addition, the commonly used approach (the target-decoy search strategy)
for assessing the search performance in MS-based proteomics is discussed in detail. I
will investigate the performance of separated and concatenated variations. In this
37


chapter, I will describe the definitions of these methods, and the applications and
issues to our research.
2.1 Confusion Matrix
This section starts with the basic definition of the confusion matrix and performance
matrices, and the issues when applying these to results of searching LC/MS datasets
are pointed out.
Confusion matrix analyses can be used to analyze any number of classes; thus, in
general, the size of a confusion matrix is n by n where n is the number of different
classes (Kohavi and Provost, 1998). However, the usual application of this approach
is when n equals to 2, where the two classes are negative and positive. The matrix for
a two class classifier is then defined as follows (Cios et al., 1998).
Table 2.1 Formal definition of a confusion matrix
Test Result
Positive (1) Negative (0)
Hypothesis Positive (1) True Positive (TP) False Negative (FN)
Negative (0) False Positive (FP) True Negative (TN)
Based on Table 2.1, the coverage is defined as the total instances that are classified by
a classifier (Kohavi and Provost, 1998).
38


Coverage = TP + FP + FN + TN
(2.1)
Also, six commonly used performance metrics can be defined (Kohavi and Provost,
1998; Pepe, 2003).
(a) Accuracy: estimates the proportion of the correct predictions out of the total
instances.
Accuracy =
TP + TN
TP + TN + FN + FP
(2.2)
(b) Sensitivity, recall, or true positive rate: measures the proportion of positive
instances identified correctly.
TP
Sensitivity = P(Test = 11 Hypothesis l) =
TP + FN
(2.3)
(c) Specificity or true negative rate: measures the proportion of negative instances
identified correctly.
TN
Specificity = P(lest = 0 | Hypothesis = 0) =
FP + TN
(2.4)
(d) Precision: measures the proportion of the correct positive identifications out of the
total instances correctly identified (both negative and positive).
TP
Precision =
TP + FP
(2.5)
(e) False positive rate; measures the proportion of incorrect positive identifications of
the negative class.
FP
FPR = Pifiest = 11 Hypothesis = 0) =
FP + TN
(2.6)
39


(f) False negative rate; measures the proportion of the incorrect negative
identifications out of the positive class.
FNR = P(Test = 0 | Hypothesis l)
FN
TP + FN
(2.7)
Another widely used metric is the false discovery rate (FDR) which is the expected
proportion of falsely rejected hypotheses among the number of rejections (Benjamini
and Hochberg, 1995).
FDR =
FP
FP + TP
(2.8)
Note that most of these metrics use only two out of four values in the confusion
matrix. This suggests that these estimations are possibly biased (Baldi et al., 2000).
For example, two classification results can have the same sensitivity. According to
the equation, the TP and FN are the same for both, but FP and TN are not necessary
identical. Therefore, I adapted the Matthews correlation coefficient (MCC) suggested
by Baldi et al (2000) and the Harmonic mean (HM) utilized by Nguyen et al (2008).
The equations are listed below.
MCC =
_______TPxTN-FPxFN________
j(TP + FN\TP + FP\TN + FP\TN + FN)
(2.9)
HM =
2
1 1
------------1-------
Precision Recall
(2.10)
40


This section lists most of the well-known performance metrics because the terms can
be defined incorrectly. For example, FDR is broadly used in proteomics recently,
where the number of TN is seldom known. However, the terms FPR and FDR are
often confused. This confusion can often be obscured by the definition used by the
authors. Higdon et al. (2005) refer to the FDR as the FPR in their study, and this error
is clearly revealed in their equations for FPR (being identical to FDR as defined
above). Also, Peng et al. (2003) defined the overall false-positive rate for a target
decoy database search as:
%fal = 2[nrev f(nrev + nreal)] (2.11)
where nrev is the number of FPs and nreai is the number of TPs. Substituting these
terms into Eq. 2.11 gives
% fal = 2 [FP /(FP + TP)] = 2 FDR (2.12)
This gives 2FDR because it is the combined number of FP hits in the inverted and
normal database. The FDR for a normal search would be half that number. Recently,
Elias and Gygi (2007) clarified this wrong FPR as:
FP = 2 x passing decoy assignments (2.13)
TP Total passing assignments number of FPs
FPR =
FP
TP + FP
(2.14)
(2.15)
41


This new definition avoids the apparent error which makes the maximum FPR to be 2
times larger, but it is still not the conventional definition for FPR. It seems the authors
want to redefine the meaning of FPR, but in the review of Choi and Nesvizhskii
(2008), the term FDR is still preferred.
2.1.1 Utilization of Confusion Matrix in
Mass Spectrometry-Based Proteomics
Comparison of the performance of different search programs using a confusion
matrix is complicated by the fact that categorization in a confusion matrix is
ambiguous. Furthermore, search programs often utilize prefiltering of the data to
eliminate poor quality MS/MS spectra; the algorithms used are different, resulting in
different coverage for each program or removal of a different subset of the data.
Figure 2.1 illustrates the coverage issue showing the prefiltering processes in going
from the MS/MS data in the raw data to the subset that is scored by the search
program.. The raw file contains S MS/MS spectra collected by the MS instrument.
The vendor application then extracts these spectra and saves them into text format
(DTA files). However, not all spectra contain data (have no fragment ions with
intensity above the noise level), so the number of extracted spectra (DT) is smaller
than the number of MS/MS collected. Because the vendor software is conservative,
many very weak spectra are included in DT. Therefore, the search programs often
42


remove additional spectra that are considered to be bad or weak. Typically, the
number of DTAs (DS) retained for scoring by search programs is reduced by about
Figure 2.1 Various Coverage for Experimental Spectra
Determining the coverage must be the first step in categorizing the classes in a
confusion matrix. Proteomics articles using ROC plots to determine performance
commonly give no explicit description of coverage, and often do not give the total
number of MS/MS or sufficient information to allow calculation of coverage (Elias et
al., 2004; Lam et al., 2007; Frewen et al., 2006; Kapp et al., 2005). One exception to
this is a recent study evaluating search programs using ROC plots in machine
learning studies of proteomics data (Ulintz et al., 2006). Elias and Gygi (2007) do not
show ROC curve analysis, but in a table, the coverage is shown as Ds implicitly.
10%.
S = # detected MS/MS
Ds = # DTAs accepted
by search program
43


The coverage is ignored because most proteomics studies make the assumption that
the spectra eliminated from DT are similar to the retained spectra (Ds). For instance, if
spectra are eliminated because they are weak, the retained and eliminated spectra
most likely are chemically similar, and just the concentration varies. However, the
algorithms may remove specific classes of peptides; specifically, it may remove those
that have poor fragmentation (for example, a case where there is no mobile proton).
The coverage issue must be evaluated to definitively determine that the eliminated
spectra are not biased in some way. Without precise information about the criteria for
elimination, the evaluation of performance using ROC plots may be biased in an
unpredictable manner.
A second major issue in utilizing confusion matrix terms in proteomics is the defining
of the hypothesis positive and negative classes. In proteomics analyses, the terms
hypothesis positive and negative are replaced by terms referring to the
identifiability of the MS/MS spectrum, that is whether it can reliably be mapped to a
peptide sequence within the search space. For example, for a set of manually curated
MS/MS spectra, the positive set was defined to be the spectra whose correspondent
peptide sequences are determined by experts and the negative set to be others.
The test result is defined by the search score and acceptance threshold (search
program dependent). If a sequence assignment scores better than the acceptance
44


threshold, it will be classified as test positive; otherwise, it will be test negative.
Table 2.2 shows the redefined confusion matrix for MS-based proteomics.
Table 2.2 The formal definition of a confusion matrix for MS-based
proteomics
Test Result
Positive (1) Negative (0)
Curated Spectra Sequence mapped (1) TP FN
Sequence not mapped (0) FP TN
However, researchers care about the correctness of the assigned peptide sequences
more than the score. Specifically, an MS/MS of an identifiable peptide sequence may
also map to a different peptide sequence, which scores higher than the correct
sequence (and is also above the acceptance threshold). These are the distracted cases
referred to in Figure 1.8. Table 2.3 is a modification of Table 2.2 including these
cases. Note that an identifiable MS/MS spectrum with a wrong peptide sequence
assigned is classified as a FP.
45


Table 2.3 A confusion matrix for a fully curated dataset
Test Resul
Positive (1) Negative (0)
Sequence Agree Sequence Disagree
Curated spectra Sequence mapped (1) TP FP FN
Sequence not mapped (0) FP FP TN
The fundamental problem in estimating the performance using the confusion matrix
requires a method to separate the identifiable set from the non-identifiable set.
However, in most complex proteomics datasets, the hypotheses positive and negative
classes cannot be clearly defined. In order to evaluate performance in a principled
manner, a dataset must be used where these classes can be defined. The best method
is to utilize a fully curated dataset, as indicated in Tables 2.2 and 2.3. In Chapters 3
and 4, studies are described that make use of such a dataset in evaluating performance
of search strategies and sequence validation methods. However, use of curation in
evaluation of methods for the new fast scanning mass spectrometers is difficult
because of the large size of the datasets. Instead, two alternative methods will be
considered, utilizing standard proteins or utilizing accurate masses, which are
described in the next two sections.
46


2.1.2 Defining Confusion Matrix Classes
Using Standard Proteins
It is known that simply classifying hits to be correct or incorrect by a score threshold
does not allow assessment of the FPR and FNR. A fully curated dataset is the ideal
situation, but it is pricey to make a set of fully curated spectra for comparative
performance estimation and impractical for the large datasets generated by the
LTQ/Orbitrap mass spectrometer almost universally used in profiling at this time
(Ulintz et al., 2006). A simple alternative is to use a control mixture or a standard
mixture to produce MS/MS spectra (Keller et al., 2002; Purvine et al., 2004). The
idea is to create a sample using individually purified proteins, so the composition of
the sample is known (although minor contaminants can also be present). As long as
the control mixture contains proteins ranging in concentration, molecular weight,
physicochemical properties, and sequence, this artificial sample can be used to
determine the performance of search algorithms. In such a case, the hypotheses
positive and negative are defined as the expected proteins and the unexpected proteins,
respectively.
The definitions for the confusion matrix of a standard protein dataset are shown in
Table 2.4, where standard vs. others refers to whether the hit is to one of the protein
in the standard list. Thus the assessment of class hypothesis true or false is score
independent. Within each class, the TPs are those cases where the score passes a
47


score threshold and the assigned peptide sequence is one of the 49 proteins, the TNs
are those where the score is below the threshold and the assigned peptide sequence is
not one of the 49 proteins, the FPs are those that pass the score threshold and are not
one of the 49 proteins, and the FNs are those that are below the threshold and are one
of the 49 proteins. Here the coverage is still defined as the number of MS/MS spectra
scored by search programs.
Table 2.4 A confusion matrix for a standard dataset
Test lesult
Positive (1) Negative (0)
Protein Standard TP FN
Others FP TN
When using the test dataset consisting of 49 standard proteins searched against a
human database of about 27,000 genes, the chance of finding false positive hits is
negligible. To test this hypothesis, protein sequences in a human database were
inverted, and a standard protein dataset containing 49 proteins was search against this
database. The result showed that there were only 7 out of 5,854 MS/MS files or
0.12% FPs in the whole dataset that were the inverted protein sequences of those 49
standard proteins. This is less than the predicted chance of hitting a peptide in the
protein database (3,682 unique peptides in 49 proteins divided by 1,452,058 in the
whole protein database or 0.25%). Thus, the number of incorrectly classified cases is
48


very small (ignoring the cases where an identifiable spectrum was incorrectly
identified as a protein other than the standards, by distraction). Despite this problem,
the fact that the FPR is so low means that this method can be used for ROC plots, as
long as the number of identifiable MS/MS compared to number of unidentifiable
MS/MS is not significantly smaller than the ABRF sample.
2.1.3 The General Case
As discussed above, the hypothesis positives and negatives can be determined for
both curated and standard protein datasets. However, for a real world proteomics
sample, there is no information about hypothesis positives and negatives, because the
proteins in the sample are not known. This means that there is no way to fill out the
confusion matrix. Because performance comparisons are essential in the MS-based
proteomics field, an alternative approach is often used to estimate the number of false
positives by searching a database of proteins that cannot be identified. This database
is referred to as a decoy database and is usually generated by inverting the protein
sequences in a database so that they read from C- to N-terminus instead of the correct
direction from N- to C-terminus. The database of correct protein sequences is referred
to as the target database. Elias and Gygi (2007) concatenate the two databases and
then define the classes of a confusion matrix for a target-decoy search as follows
(Elias et al., 2004; Moore et ah, 2002; Peng et ah, 2003).
FP = 2x decoy assignments (2.16)
49


TP = Total assignments FP
(2.17)
TC(total correct) = Total TPs for different searches
(2.18)
TKtotal incorrect) = Total assignments TC
(2.19)
FN = TC TP
(2.20)
TN = TI FN
(2.21)
The assumption for a target-decoy method is that the TC (equation 2.18) of a dataset
is discovered by all searches although it is known not always to be true. Also, if the
false positives arise preferentially from the identifiable MS/MS, the number of FPs in
the decoy search will be larger than the number in the target search. As long as the
dataset has a relatively small number of identifiable vs. nonidentifiable cases, then
this will not make a significant impact. Even though the estimated values are not
perfect, until recently this method was the only published method to estimate the FDR
for comparative evaluation.
Note that Elias and Gygi (2007) clarified and extended the original idea (equation
2.11) to be able to assess an estimation of a complete confusion matrix. Furthermore,
Balgley et al. (2007) pushed the idea further to apply to the protein level. In that study,
classes in a confusion matrix were defined at the protein level, based on the peptide
identifications in a target-decoy protein database as above. The assigned peptide
sequences are than used to determine the number of identified proteins. The number
50


of proteins identified in the decoy database defines the number of FP proteins
(multipled by two). Other numbers are defined as follow:
tn = £p,
(2.22)
(2.23)
TPX=PX~FPX
(2.24)
where Plolai is the number of proteins in the database, and P, and Px are the number of
Gygis study showed that these efforts provide a better theoretical basis for the
utilization of performance matrices and the analysis of ROC curve for search
programs using spectra to sequence mapping by Sequest. In Section 2.4,1 extend this
study to Mascot, and in future studies, I propose using the Zhang theoretical spectra
to extend the approach to spectra/spectra mapping.
2.2 Receiver Operating Characteristic Curve
A receiver operating characteristic (ROC) curve is originally used for analysis of
radar signals. In late 90s, it was extended to analysis of the performance of machine
learning algorithms (Provost and Fawcett, 1997; Provost et al., 1998). Figure 2.2
shows a simple ROC plot with two curves comparing results of two classifiers. The X
and Y axes are the FPR and TPR, respectively. The diagonal line indicates the
protein hits for i'h and xth search algorithm, respectively.
51


random performance for a classifier (Figure 2.2). In comparing the experimental plots,
the solid line representing results of one classifier is closer to the top-left comer than
the dashed line representing the other classifier. This indicates that the performance
of the classifier represented by the solid line is better.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FPR
Figure 2.2 Simple ROC Plot
A practical algorithm used to construct an ROC curve is described below (Fawcett,
2003):
52


INPUT: L: the set of test instances, f: the classifier,
N: total negative, P: total positive
OUTPUT: R: a list of ROC points
S <- Sort L by f scores descending
FP <- 0
TP <- 0
R { }
fprev
For Each i in S
If f(i) # f,
prev
Then
R. append
fprev
End If
f FP TP_
\ N P
f(i)
))
If i is a positive example Then
TP <- TP + 1
Else
FP <- FP + 1
End If
(f
R.append
W
FP TP
~N' P

End For
Return R
Based on this algorithm, a modified algorithm is built for a search program with a
fully curated input.
53


INPUT: L: the set of DTAs, g: the search program,
N: number of non-identifiable spectra,
P: number of identifiable spectra
OUTPUT: R: a list of ROC points
S Sort L by g scores
FP <- 0
TP 0
R { (0, 0) }
For Each d in S
If d is identifiable And
d. knownPepSeq = g(d) .assignedPepSeq Then
TP TP + 1
Else
FP <- FP + 1
End If
((
R.append
VV
FP TP_
N P
\\
End For
Return R
Notice that the most important difference between these two algorithms is how to
classify an instance as a TP or a FP. The original algorithm uses the known
classification of each entry, while the modified algorithm compares the assigned
peptide sequences to the expected peptide sequences. Although this algorithm is
simple, it requires a curated dataset, and the PLC dataset analyzed by our laboratory
is the only such proteomics dataset of a complex sample. I use it to compare the
performances of different search settings, but it is an LCQ dataset, and the main focus
of this thesis is to determine the performance of classifiers with LTQ datasets. In
54


order to deal with non-curated datasets, three other ROC variations will be described
in the following sections.
2.2.1 ROC Plot for Control Mixture Sample
As mentioned above, TPR and FPR can be obtained by searching a set of spectra
generated from the control mixture sample. Hence, I revised the ROC algorithm to
accommodate the change in counting the TP and FP; a DTA is counted as a TP if the
given peptide sequence is a subsequence of the standard protein set. The modified
algorithm is shown below. The main different from this algorithm to the one in
previous section is the way to determine the positive and negative entries.
INPUT: L: the set of DTAs, g: the search program,
F: the known proteins,
N: number of spectra received sequences not in F,
P: number of spectra received sequences in F
OUTPUT: R: a list of ROC points
S <- Sort L by g scores
FP <- 0
TP <- 0
R { (0, 0) }
For Each d in S
If g(d).assignedPepSeq is in F Then
TP <- TP + 1
Else
FP <- FP + 1
End If
55


R.append
End For
Return R
ff
\\
FP TF[
N P
/
2.2.2 ROC Plot Using False Discovery Rate
Due to the difficulty of obtaining the real FPR, Peng et al. (2003) proposed using the
false discovery rate (FDR) calculated from the search results of the target-decoy
search strategy. The equation is shown below:
2 X H
FDR =--------(2.25)
Ht +Hd
where Ht and Ho are the numbers of target and decoy hits above a certain threshold,
respectively. Researchers have used FDR as the x axis in a ROC curve instead of FPR.
The maximum value of FDR is not 100%, but rather is some fraction of 100%,
because it depends on the number of TP cases in the dataset. In particular, if the
fraction is different for the methods that are compared, the performance cannot be
rigorously compared. Another way to estimate the error rate from a search result is
using applications such as PeptideProphet and this error rate can be use to plot a ROC
curve (Lam et al., 2007). In the case of PeptideProphet, the TPR is underestimated for
lower scoring cases, compromising this method as well.
56


The algorithm for making a ROC curve using FDR is almost identical to the standard
algorithm described above, so it is not shown.
2.2.3 ROC Plot for High Mass Accuracy Dataset
Because the reported masses of new MS instruments are more accurate, we can take
advantage of high mass accuracy to determine TP/FP. The idea is to calculate the
mass difference (error) between theoretical and experimental m/z (in parts per million
or ppm), and then set a range for distinguishing correct and incorrect hits. The mass
error in ppm is calculated as:
mz o mz
MassError = --------- x 106 (2.26)
mzT
where mzg and mzj are the experimental mass (reported by instrument) of a spectrum
and the theoretical mass of the assigned peptide sequence, respectively.
The usefulness of the mass error in distinguishing correct vs. incorrect assignments is
shown in Figure 2.3. This shows a plot of the results in a study from the Gygi lab
(Beausoleil et al., 2006) comparing the mass error in ppm with XCorr from Sequest.
It can be seen that most of target (or correct) hits are found in the range of about -2 to
+8ppm. As expected the FPs from the decoy database are distributing randomly
across the full error range, as are a subset of the hits from the target database. These
latter cases are presumably the FPs in the target database search.
57


Thus, the reason why the mass accuracy filter is so useful is that combining the mass
accuracy filter with a wide mass tolerance search is similar to the target-decoy
approach. For instance, for a MS/MS spectrum, there are n candidates in the searched
database and only m of them can be correct based on the mass accuracy filter.
Typically, the tolerance for the parent mass error is set to 1.2 Da, and the mass
accuracy filter will remove cases where the mass error is less than 0.05 Da (here
roughly converting the ppm threshold to Da), with the result that the sizes of the
target and decoy search are very unequal. However, the cases excluded because they
have an incorrect mass may include some cases that are actually correct, but the mass
has been incorrectly assigned when making the DTA file. In our datasets, these are a
relatively small percent of the hypothesis positive cases, and thus, the method can be
used to ask if there are significant differences between standard protein datasets and
datasets from complex, real-world samples.
58


6
5
n = 1,462
4
d
o *
X
2
1
* -
Target hits
Decoy hits
0
-50
I I
-40 -30
I I
-20 -10
0
10 20 30 40
50
Mass deviation (p.p.m.)
Figure 2.3 ppm-Score Plot Examples (Taken from Beausoleil et al., 2006)
The following algorithm describes how to use the accurate mass filter to generate a
ROC curve. The change in this algorithm from the original one is the way to
determine positive and negative entries using the error of the parent ion mass in ppm
(Eq. 2.26).
INPUT: L: the set of DTAs, g: the search program,
R: the ppm boundary, N: number of spectra not in R,
P: number of spectra in R
OUTPUT: R: a list of ROC points
S <- Sort L by g scores
59


FP <- 0
TP <- 0
R { (0, 0) }
For Each d in S
If g(d).ppm is in R Then
TP TP + 1
Else
FP <- FP + 1
End If
(f
R.append
Vv
End For
Return R
FP TP\
N' P),
2.2.4 The Area Under the ROC Curve
Although a two-dimensional ROC curve can be used to represent classifier
performance, a single scalar value is preferred. The area under the ROC curve (AUC)
is such a commonly used method to measure the overall performances of classifiers
(Braydley, 1997), or an indicator of the summary statistic (Boukaert, 2006). Huang
and Ling (2005) recently compared the use of accuracy and AUC in evaluation
learning algorithms. They concluded that for balanced datasets, AUC is better than
accuracy in terms of statistical consistency and discrimination.
I demonstrate two approaches to calculate the AUC. The first method is proposed by
Hand and Till (2001).
60


AUC =
(2.27)
S0 0.5 x n0 x (rt0 +1)
oxi
where no and / are the number of positive and negative instances, respectively, and
So is the sum of the ranks of positive instances. The other one is a general method that
uses trapezoidal integration.
auc = '£
yM + yt
2
x (*/+i
(2.28)
where y, and y,+i are the z'th and the z'+7th TPRs, respectively, and x, and x,, / are the 7th
and the z+7th FPRs, respectively. The x and y can be numbers of FPs and TPs,
respectively, instead of the ratios. Using this method, a denominator, total false
multiplying total true, is needed. Using this method, the AUC may be underestimated
when the input instances are too few. Because there are thousands of inputs for MS-
based proteomics, this problem can be ignored. In this thesis, I adapt trapezoidal
integration method to calculate AUC.
A few proteomics studies have used AUC in comparing many search conditions,
without estimating the error. Because any two AUC values are often very similar, the
difference between them is much smaller than the individual values (for example,
0.988 vs. 0.979), it is critical to know the error. To analyze the error in our analyses,
the AUC was calculated four times, using 75% of the data for each calculation,
randomly selecting each subset of DTA files for this analysis. The standard deviation
61


of the four measurements was calculated. The result was a mean AUC value of 0.897,
with standard deviation of 0.004.
In this section, I discussed the methods to estimate the search performance and
indicated the issues for performance evaluation in MS-based proteomics. Throughout
this thesis, all the performance measures will follow methods mentioned in this
section and the datasets used as inputs are discussed later.
2.3 Standard Datasets
In this thesis, an important part is the choice of benchmarking datasets to evaluate our
methods. To choose the benchmarking datasets, two criteria need to be considered:
the sample size and the representation. Normally, medical studies suffer from the
limited sample size, but in MS-based proteomics, the sample size usually is not a
problem, because it generates up 100,000 MS/MS spectra per biological sample
easily. Therefore, our concern is the generalization of the datasets. The charge state is
a widely acknowledged factor that affects the identification. Hence, in this section, I
describe the method to extract the DTA files from the raw formats and the properties
for 3 chosen datasets, breaking down analysis according to the charge state
distribution of the parent ions. In addition, the MH+ (the peptide ion mass) and the
m/z (the parent ion mass) distributions are showed, together with the charge state, in
order to show how the datasets vary in these critical parameters.
62


2.3.1 Generation of DTA files
The three major datasets used in these analyses are referred to as the PLC, ABRF, and
Phoebe3 datasaets, and are described more completely below. In each case, DTA files
were generated for each dataset using vendor extract_msn software, which utilizes
different methods for LCQ and LTQ/Orbitrap datasets (PLC is LCQ data, and ABRF
and Phoebe3 are LTQ/Orbitrap datasets). Vendor recommended parameters were
used in processing the raw data into the DTA files. Table 2.5 summarizes the number
of DTA files generated for each dataset, broken down by charge forms. For the PLC
dataset, the number of DTA files for charge 2 and 3 are the same because for the LCQ
instrument produces low resolution measurement of the parent ion masses, with the
consequence that parent ions with multiple charges cannot be distinguished. Thus,
extract_msn.exe creates two charge forms for any DTA with multiple charges in that
case.
Table 2.5 Sizes of datasets
# spectra PLC ABRF Phoebe3
Total 4,245 5,854 114,039
Charge 1 1,015 2,314 24,740
Charge 2 1,615 2,342 42,182
Charge 3 1,615 907 34,890
Others 0 291 12,227
63


Figures 2.4, 2.9, and 2.14 show charge distributions of each dataset. In Figure 2.4, the
number of doubly and triply charged ions is identical, because the low resolution
precludes charge determination, and two DTAs are generated assuming charges of 2
or 3. In Figures 2.9 and 2.14, the resolution of spectra is high enough for the
LTQ/Orbitrp to determine the charge, even for those higher than triply charged.
However, there are few MS/MS of higher charge forms, and the numbers are
insufficient for statistical significance. Hence, I use only data up to charge 3 as other
groups have done (Peng et al., 2003; Ulintz et al., 2006; Lam et al., 2007).
2.3.2 PLC the Fully Curated Dataset
To assess the performance of search algorithms under different conditions, a fully
curated dataset is the best situation. However, creating such a dataset is very time
consuming and error-prone. In the public domain, there is only one fully curated
dataset, the PLC dataset, which was generated by our lab (Resing, 2004). This dataset
was collected by LCQ MS/MS instrument from a soluble cell lysate of a human
erythroleukemia cell line (K562) and contained 2,117 (or 4,245 with decoy spectra)
MS/MS spectra. Proteomics datasets consist of multiple LC/MS analyses on the same
sample, where the sample is fractionated in different ways. Therefore, this dataset
consisted of 17 LC/MS analyses, each only 20 min long. This resulted in a relatively
small dataset, although the range of peptide intensities and percentage of chimera
spectra was typical of datasets generated from longer runs or after more fractionation.
64


The small size made it possible to carry out manual analysis of every search result.
Manual analysis involves evaluating the chemical plausibility of the fragmentation
spectra, given the peptide assignment. It is done by an expert and is considered the
gold standard for peptide identification. The manual analysis of PLC dataset
showed that the MS/MS classes of complex samples were well represented (Resing et
al., 2004) (Figures 2.4 to 2.8). Because the number of identifiable spectra in PLC
dataset is known, true positive (TP), true negative (TN), false positive (FP), and false
negative (FN) cases can be determined for any search condition. For example, in
Chapter 3, this approach is used to compare sensitivity of various search programs,
when holding FPR constant. In addition, in Chapters 3 and 4, this dataset is used in
ROC analyses to compare discrimination in different situations.
65


PLC Spectra Charge Distribution
1
Figure 2.4 Spectra Distribution for PLC Dataset by Charge State
66


300 900 1500 2100 2700 3300 3900 4500
MH+
Figure 2.5 Spectra Distribution for PLC Dataset by MH+
400
350
300
250
§200
o
150
100
50
0
PLC Spectra MH+ Distribution by Charge
Charge 1
uriarye z. Charge 3
300 900 1500 2100 2700 3300 3900 4500
MH+
Figure 2.6 Spectra Distribution for PLC Dataset by MH+ and Charge State
67


PLC Spectra m/z Distribution
m/z
Figure 2.7 Spectra Distribution for PLC Dataset by m/z
PLC Spectra m/z Distribution by Charge
m/z
Figure 2.8 Spectra Distribution for PLC Dataset by m/z and Charge State
68


2.3.3 ABRF the Standard Protein Dataset
Due to the issues mentioned above, studies utilize mixtures of standard proteins
because of the need to unambiguously identify the MS/MS in the positive class (true
positives and false negatives) (Keller et al., 2002; Purvine et al., 2004). The benefit of
using a standard protein sample is that the range of positive hits is predictable. Also,
because the PLC dataset was generated using the low resolution LCQ instrument, it
does represent the current experimental environment. To make the performance
estimation more convincing, a dataset that originated from a national-widely
distributed sample by the Association of Biological Research Facilities (ABRF) was
produced. The advantages of using this dataset are: (1) it is a standard protein sample
and (2) it is generated from a high resolution instrument, LTQ/Orbitrap. This dataset
was collected on a mixture of 49 known proteins (ABRF standard mixture), and
accurate masses of the parent ions are analyzed in the Orbitrap and MS/MS were
analyzed in the LTQ. This dataset will allow validation of the method of using
accurate masses in analysis of the correct vs. incorrect assignments. This dataset is
similar in size and chimera content to the PLC dataset.
Figures 2.9 to 2.13 show the spectra distributions by charge state and/or m/z for the
ABRF sample. Note that there are only few data over charge 3, so in this thesis, all
the analyses are shown up to charge 3.
69


ABRF Spectra Charge Distribution
1
Figure 2.9 Spectra Distribution for ABRF Dataset by Charge State
70


1400
MH+
Figure 2.10 Spectra Distribution for ABRF Dataset by MH+
900
800
700
600
|500
o400
300
200
100
0
ABRF Spectra MH+ Distribution by Charge
Charge 1
isnarge z; Charae 3j
Charge 4 j
Charge 6
|
i
1
1 I
3-r :L-r- -1 ^ i rn. m. |.|
300 900 1500 2100 2700
MH+
3300
3900
4500
Figure 2.11
Spectra Distribution for ABRF Dataset by MH+ and Charge
State
71


1200
ABRF Spectra m/z Distribution
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
Figure 2.12 Spectra Distribution for ABRF Dataset by m/z
600
500
400
C
§300
o
200
100
0
ABRF Spectra m/z Distribution by Charge
Charge 1 Charge 2j
Charge 3 j Charge 4 j
_ Charge 5 J Charge 61
T 1
\ | I 1
iu i ,! i 1 ,
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
Figure 2.13 Spectra Distribution for ABRF Dataset by m/z and Charge State
72


2.3.4 Phoebe3 the Complex Sample Dataset
To represent the type of data obtained when complex samples with hundreds of
proteins are analyzed, a dataset referred to as Phoebe3 was utilized. The sample was
generated by fractionating a soluble extract of a human melanoma cell line (WM115,
Bernard et al., 2003). Fractionation was done by ion exchange (MonoQ), and the full
profile generated 30 fractions; although for convenience, only a portion of the
fractions are utilized in this analysis. The dataset includes fractions 20-27, with three
LC/MS analyses of each, thus, this dataset consists of 24 LC/MS analyses, each 2
hours long. The total number of MS/MS was 114,039. Data were collected on the
LTQ/Orbitrap as described for the ABRF sample, and accurate masses must be used
to identify correct vs. incorrect assignments, because the proteins in the sample are
unknown. Figures 2.14 to 2.18 show the spectra distributions by charge state and/or
m/z for the Phoebe3 sample. Similar to the ABRF dataset, there are only few data
over charge 3, so in this thesis, all the analyses are shown up to charge 3.
73


Phoebe3 Spectra Charge Distribution
1
Figure 2.14 Spectra Distribution for Phoebe3 Dataset by Charge State
74


20000 y
18000
16000
14000
*.12000
c
g10000
8000 --
6000
4000
2000
0 --
Phoebe3 Spectra MH+ Distribution
300 900 1500 2100 2700 3300 3900 4500
MH+
Figure 2.15 Spectra Distribution for Phoebe3 Dataset by MH+
Phoebe3 Spectra MH+ Distribution by Charge
MH+
Figure 2.16 Spectra Distribution for Phoebe3 Dataset by MH+ and Charge
State
75


20000 -r
18000 -
16000 -
14000 -
*-12000 -
c
=10000 -
8000 -
6000 -
4000 -
2000 -
0 -
Phoebe3 Spectra m/z Distribution
17990 17702
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
Figure 2.17 Spectra Distribution for Phoebe3 Dataset by m/z
9000
8000
7000
6000
I 5000
O 4000
3000
2000
1000
0
Phoebe3 Spectra m/z Distribution by Charge
Charge 1
Charge 2 [I Ch^rgp 3
Charge 4
Charge 5
Charge 6 -.I Charge 7
| J Charge 8
1
L l J I I
L .! L L 1 , 1 i 9i 1 l i fl
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600
m/z
Figure 2.18 Spectra Distribution for Phoebe3 Dataset by m/z and Charge
State
76


2.3.5 Q a Complex LCQ Sample
For data mining studies of the protease specificity and other peptide properties,
another dataset (Q dataset) was collected under the same conditions as PLC dataset,
except that the proteins were fractionated, then each fraction was digested for
LC/MS/MS analysis. In addition, the data were collected under conditions of higher
sampling, which made the dataset much bigger (-800,000 MS/MS total). The high
confidence peptide assignments from this dataset were used for data mining studies.
Specifically, this approach was only used to determine the missed cleavage rules
described in Chapter 3.
2.4 Separated and Concatenated Target-Decoy Searches
For complex samples, where the protein composition is not known, neither
performance matrices nor ROC curve can be applied to assess the performance. The
target-decoy search strategy and the high mass accuracy method are the solutions to
that scenario. Due to the simplicity, the target-decoy search strategy described in
Section 1.6 is the commonly used method to evaluate the performance of protein
sequence search programs for a complex sample. In this method, searching is done
against two databases, the target and the decoy. The target database is usually the
normal sequence database, or it can be the protein sequences researchers expected.
The decoy database can be the reversed sequence database or a database whose
species is different from the target database (with enough evolutionary distance that
77


there are few exactly identical peptides). However, the reversed database is the most
common because it preserves the properties of the target database, and makes the
numbers of false positives in two databases equally likely (Elias and Gygi, 2007).
There are two methods to use these databases: separation or concatenation of the
target and decoy databases. The separated target-decoy database search strategy is
also called a normal-flip search or normal-inverted search. It searches two databases
separately using the same dataset and estimates the false positives using the result of
the decoy search. This method has been used for many years. The concatenated
strategy was first advocated by Gygi in 2003 (Peng et al., 2003). It combines the
target and decoy databases and searches this combined database once. This method
makes the target and decoy sequences compete for the best hits. The advantage of the
concatenated method is that only one search is required, and it eliminates the
complication of FPs arising in the inverse search from the MS/MS that are high
scoring TPs in the normal search. Moreover, Elias and Gygi (2007) shows that
concatenated method predicts the false positive rate more accurately than the
separated one.
2.4.1 Reproduction of Elias and Gygis Method
The Gygi study only evaluated the results of a Sequest search and only utilized a
complex sample; in this section, I used their method to evaluate the Mascot search
78


program with the standard ABRF dataset. The IPI Human database v3.27 was the
target database, and the decoy database was constructed by reversing the protein
sequences of the target database. The concatenated database was then built by
merging two databases. The ABRF dataset was searched using Mascot against the
target, decoy, and concatenated databases, keeping all other search parameters the
same. The search results were then sorted by Mowse score and binned so that each
bin was 2 Mowse units wide. For the separated target-decoy method, the positive and
negative peaks in the score distribution plot (Figure 2.19) were depicted by the
frequencies of the hits in each score bin of the target database only and decoy
database only searches, respectively. For the concatenated target-decoy search, the
positive and negative peaks in the score distribution plot (Figure 2.20) were generated
by the frequencies of the hits in each score bin of the target and decoy hits,
respectively. These figures represent the observed distribution for the separated
target-decoy search. Note that total numbers of DTAs are different for Figure 2.19
and 2.20, because Figure 2.19 is the results of two searches. This is the reason a
method to unify these results is needed, so the next step is to estimate the score
distributions from the observed distributions.
The estimated number of correct and incorrect hits for the concatenated search is
based on the assumption that the chances of having the FPs for the target and decoy
79