Citation
An expert system incorporating data simulation, feature recognition, model fitting, and data analysis functions

Material Information

Title:
An expert system incorporating data simulation, feature recognition, model fitting, and data analysis functions
Creator:
Sun, Shaojun
Publication Date:
Language:
English
Physical Description:
xix, 244 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Proteins -- Analysis ( lcsh )
Mass spectrometry ( lcsh )
Amino acid sequence ( lcsh )
Expert systems (Computer science) ( lcsh )
Amino acid sequence ( fast )
Expert systems (Computer science) ( fast )
Mass spectrometry ( fast )
Proteins -- Analysis ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 238-244).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Shaojun Sun.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
527656381 ( OCLC )
ocn527656381
Classification:
LD1193.E52 2009d S86 ( lcc )

Full Text
AN EXPERT SYSTEM INCORPORATING DATA SIMULATION, FEATURE
RECOGNITION, MODEL FITTING, AND DATA ANALYSIS FUNCTIONS
by
Shaojun Sun
B.S., Southeast University, 1990
M.S., Southeast University, 1993
M.S., University of Colorado Denver, 2003
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in Computer Science and Information Systems
2009


This thesis for the Doctor of Philosophy
degree by
Shaojun Sun
has been approved by
Dr. Bogdan Chlebus
Dr. Michael V. Mannino
//-/ 7
D-oo

Date


Sun, Shaojun (Doctor of Philosophy, Computer Science and Information Systems)
An Expert System Incorporating Data Simulation, Feature Recognition, Model Fitting,
and Data Analysis Functions
Thesis directed by Associate Professor Ilkyeun Ra
ABSTRACT
A major limitation in protein identification from complex mixtures is the ability
of search programs to accurately assign peptide sequences using mass spectrometric
fragmentation spectra (MS/MS). Manual analysis is used to assess borderline
identifications; however, it is error-prone and time consuming, and criteria for
acceptance or rejection are not well defined. The primary computational strategy for
MS/MS identification requires the prediction of spectra from peptide sequences.
However even state of the art algorithms such as MASCOT do not evaluate intensity
information in experimental spectra, considering only the theoretical fragment masses
of a candidate peptide sequence. In this dissertation thesis, I report a Manual
Analysis Emulator (MAE) expert system which implements criteria used in manual
analysis of low energy collision-activated dissociation (CAD) spectra.
MAE evaluates the chemical plausibility of peptide assignments by measuring the
similarity between experimental spectra and predicted spectra. The predicted spectra
are simulated using a kinetic model of peptide fragmentation for each candidate
sequence assignments. The kinetic model used in the initial version of MAE was


developed by Z. Zhang and is based on known gas phase mechanisms of peptide
dissociation (Anal. Chem. 2004, 7(5:3908-3922), as implemented in the software
program MassAnalyzer. To add new chemical mechanisms and use a more powerful
optimization method for parameter fitting, I created my own implementation of the
kinetic model, S3. Parameters are fit using a constrained Levenberg-Marquardt (LM)
algorithm with a novel merit function (dSIM). Classic chi-square LM fitting
followed by dSIM score fitting significantly improves the similarity between the
experimental and predicted spectra. Receiver Operator Characteristic (ROC) plots for
peptide identification using S3 in conjunction with MAE demonstrates that the
constrained LM algorithm improves discrimination. Additionally the fast running
time and effective convergence to local minima in S3 will aid in the testing of new
chemical mechanisms. Machine learning methods are used to determine the factors
(features) that contribute most significantly to inaccurate prediction in the current
kinetic model (Zhang, 2004, 2005), thus revealing where new chemical mechanisms
are required. We demonstrate improvement in prediction accuracy by incorporating a
novel N2 C-terminal proline cleavage mechanism into the kinetic model.
IV


This abstract accurately represents the content of the candidates thesis. I recommend
its publication.
Signed
v


ACKNOWLEDGMENTS
I am especially indebted to Dr. Katheryn Resing for her supervision, advice, and
guidance from the start of my research, as well as supporting extraordinary
knowledge and experience throughout my whole research.
I would like to acknowledge my gratitude to Dr. Zhong-qi Zhang (Amgen) for
providing predicted spectra and for advice in reproducing the program in our
laboratory.
I gratefully acknowledge Dr. Ilkyeun Ra for his advice and support, particularly in the
final stage of completing the thesis proposal.
I would also acknowledge the contribution of Dr. William M. Old in supporting
training data, statistical analysis methods, dissertation paper writing and modification.
I gratefully thank Dr. Natalie Ahn for her support, and get benefit from the member
of Old laboratory by comments and suggestions.
I also thank Dr. Meredith Betterton in nonlinear multi dimension dramatic model
fitting, Dr. Krzysztof J. Cios in machine learning approach using, and Karen Kafadar
(UC Denver Health Science Center) and Dr. Robin Knight (UC Boulder) for
assistance with ROC analyses and other statistical advice and Alex Mendoza for
assistance with data management in early phases of this project.
My wife deserves special mentions for her wholehearted support of my work.
This work was supported by NIH R01 CA87648 (KAR), and NCI R21 CA125291,
new methods for phosphor peptide identification.
I would like to thank everyone who has given me help in precession of my research,
and apology that I could not mention them one by one.


TABLE OF CONTENTS
Figures.........................................................................xiii
Tables...........................................................................xvi
Chapter
1. Introduction....................................................................1
1.1 Background.................................................................1
1.1.1 Using Score Thresholds to Replace Manual Curation....................3
1.1.2 Predicting Peak Intensities..........................................5
1.1.3 Feature Recognition of Peptide Bond Cleavages........................6
1.1.4 New Knowledge Generation.............................................8
1.1.5 The S3 MS/MS Simulation Model and Fitting Algorithm.................9
1.2 Comparison of MAE Expert System with the Gygi Approach...................10
1.3 Specific Tasks...........................................................12
1.3.1 Fragment Ions Identification to Understand Peptide Dissociation ....14
1.3.2 Create a New Scoring Schema(Inference Engine).......................15
1.3.3 Improve the Kinetic Model...........................................16
1.3.4 Fitting Parameters Simultaneously ..................................17
2. Literature Review..............................................................19
vii


2.1 Expert System...............................................................19
2.1.1 Introduction of Expert System........................................19
2.1.2 Computer Hardware and Software Development in Mass Spectrometry.... 21
2.1.3 Expert System Application in Mass Spectrometry......................21
2.1.4 Rule-Based Expert System............................................27
2.1.5 Expert System Design................................................28
2.1.6 MAE Expert System Structure.........................................30
2.2 The Spectra Prediction Model and Its Fitting................................35
2.2.1 Description of the Spectra Prediction Model..........................35
2.2.2 Fitting the Nonlinear Multidimension Kinetic Simulation Model.......39
2.2.3 The Nature of the Chemistry Considered in Designing Fitting Protocol..41
2.2.4 Overall Strategy for Fitting Parameters of the Model................42
2.2.5 Limitation of the Model.............................................44
2.3 Chemistry and Mass Spectrometry Domain Knowledge............................51
2.3.1 Overview of Our in-house Analysis System.............................51
2.3.2 Generation of MS and MS/MS Spectra in Proteomic Profiling...........54
2.3.3 Converting Peptide Sequence to Predicted Spectra....................56
2.3.4 Predicting Peptide Fragmentation is Critical for Identifying Sequence.57
2.3.5 Spectrum to Spectrum Matching.......................................59
2.3.6 Peptide Fragmentation...............................................60
2.3.7 Interpretation of MS/MS Spectra.....................................62
viii


2.3.8 Mobile Proton Hypothesis...........................................64
3. Feature Recognition..........................................................67
3.1 Simplifying Information in DTA Files....................................68
3.1.1 Creating an Ion List...............................................68
3.1.2 Removing Noisy Ions................................................70
3.1.3 Illustrating the Full Process......................................71
3.1.4 Implementing the De isotope Function...............................74
3.1.5 Characterizing the sDTA Information................................76
3.1.6 Testing Whether sDTA are Sufficient for the Database Search........78
3.2 Available Feature Recognition Methods...................................79
3.3 Matching Predicted to Observed Peaks by Mass/Charge....................82
3.3.1 Meaning of Term mass or m/z........................................83
3.3.2 Predicted m/z Values...............................................84
3.3.3 Matching Using m/z Tolerance.......................................84
3.4 Assigning Ions in Order of Likelihood of Observation...................85
3.4.1 Ion Type List......................................................87
3.4.2 Rules for Fragment Ions Generated by Secondary Cleavage...........90
3.5 Implementing the Priority Rules.........................................93
3.5.1 Pseudocode for Implementing the Chemical Rules.....................93
3.5.2 Pseudocode for Labeling Parent and Its Neutral Loss Products.......95
3.5.3 Pseudocode for Labeling b and y Ions...............................97
IX


3.5.4 Pseudocode for Labeling a, Dehydrated and Deammoniated b, y Ions ... 98
3.5.5 Pseudocode for Labeling Dehydrated and De-ammoniated a Ion,
Double Dehydrated and De-ammoniated b and y Ions...................100
3.5.6 Pseudocode for Labeling Triple Dehydrated b and y Ions............101
3.5.7 Pseudocode for Labeling Internal Fragment Ions....................103
3.6 Additional Heuristic Rules..............................................105
3.7 Formatting the Report as a Worksheet....................................106
3.8 Evaluating the Feature Recognition..................................... 108
3.9 Conclusion..............................................................108
4. Metrics for Evaluation Goodness of Fit between Two Spectra..................110
4.1 Scores for Characterizing the Similarity to the Predicted Spectra
and the Chemical Plausibility of the Peptide Assignment.................110
4.1.1 Similarity Score...................................................Ill
4.1.2 Correlation Score................................................112
4.1.3 Proportion of Ion Current (PIC) Score............................112
4.1.4 y-b/y+b Score....................................................113
4.1.5 Internal Fragment Ions Score..................................... 113
4.1.6 Cross-Correlation Score...........................................114
4.2 MAE Score Schema........................................................115
4.3 Similarity Scoring against Predicted MS/MS Spectra......................117
4.4 Discussion of Scoring...................................................127
4.4.1 Validuation of a Spectral Chimera through SIM and PIC Score.......127
x


4.4.2 Protein Profiling Results
130
4.4.3 Conclusion........................................................ 133
5. Data Mining in Spectra Prediction Model......................................136
5.1 Introduction of Data Mining..............................................136
5.1.1 Data Mining Process................................................137
5.2 N2 Cleavage Fragmentation Mechanism.....................................139
5.2.1 Problem Definition.................................................140
5.2.2 Data Collection....................................................142
5.2.3 Data Prepare.......................................................143
5.2.4 Data Preprocessing.................................................143
5.2.5 Algorithm Selection and Execution..................................145
5.2.6 Final Evaluation...................................................154
5.3 Other Chemistries not well Modelled.....................................155
5.3.1 Three Types of Problems in MassAnalyzer.............................155
5.4 Conclusion..............................................................156
6. Model Fitting Algorithm......................................................158
6.1 Introduction.............................................................159
6.2 Least Squares as a Merit Function.......................................161
6.3 Nonlinear Multidimensional Dynamic Model and Principle of Fitting.......162
6.4 Fitting Methods.........................................................166
6.4.1 Naive Fitting Method.............................................. 169
xi


6.4.2 Levenberg-Marquardt Method..................................... 173
6.4.3 Levenberg-Marquardt Execution...................................180
6.4.4 Fitting Results.................................................187
6.5 Test and Comparison..................................................192
6.6 Improvement of the Kinetic Model.....................................197
6.6.1 Adding Mechanism N2 for Second Amino Bond Cleavage.............197
6.6.2 Adjusting the Effective Temperature for the Current Cycle.......200
7. Summary and Conclusion....................................................203
7.1 Conclusion...........................................................203
7.2 Contribution.........................................................208
7.3 Further Works........................................................210
Appendix A: First Derivative for Each Parameter..............................213
Appendix B: Math Principle of Square Root Fitting............................236
Bibliography.................................................................238
xii


LIST OF FIGURES
Figure
1.1 The Basic Function of an Expert System.......................................2
1.2 A Peptide Sequence String Connected by Peptide Bonds that are cleaved
during MS/MS Process, Producing b and y ions.................................7
1.3 High-level View of the MAE Expert System.....................................13
2.1 Rule-based Model in Expert System............................................27
2.2 Knowledge Elicitation Process...............................................28
2.3 High Order View of the MAE Expert System....................................32
2.4 Four Levels Tree Structure for Predicted Spectra Storage on Hard Disk.......34
2.5 The Kinetic Modeling of Peptide Fragmentation...............................37
2.6 Fragment Pathways Modeled by MassAnalyzer...................................38
2.7 Processing Diagram for MS/MS Spectra Prediction Model Fitting...............44
2.8 The Difference between Predicted and Experimental Spectra
on the Dehydrated Fragmental Ions.............................................48
2.9 The Difference between Predicted and Experimental Spectra
on N Terminal Second Amino Bond Broken (yn-2 ion).............................49
2.10 The Difference between Predicted and Experimental Spectra
on Internal Fragmental Ions...................................................50
2.11 Overview of Protein Identification in Shotgun Proteomics, showing the Peptide
Identification Functions that are Impacted by the MAE Expert System Functions .... 53
2.12 MS/MS Analysis of an Ion in a Complex Sample...............................55
xiii


2.13 An MS/MS Experimental Spectra.................................................55
2.14 Nomenclature of Fragmental Ions...............................................62
2.15 Peptide Ions Generated in MS/MS Fragmentation.................................63
2.16 Mobile Proton Model..........................................................65
2.17 Performance of Charge Remote Fragmentation....................................66
3.1 Three Scans MS/MS Spectra.......................................................73
3.2 Three Scans Merge into One Spectrum............................................73
3.3 sDTA Processing for Removing Isotope Ions and Noises...........................73
3.4 Error Analyses of the Singly Charged Ions from High Scoring Peptides...........77
3.5 A Features Labeled Experimental Spectrum by Mascot correctly Assignment.........81
3.6 A Features Labeled Experimental Spectrum by Mascot incorrectly assignment.......81
4.1 Machine Learning: Divide the Testing Data,
Which was already Manually Identified, into Two Groups (Yes and No)..............116
4.2 Decision Tree: Five Features, SIM, PIC, y-b/y+b, intfrag, charge...............117
4.3 sDTA and DTA Different Effects on SIM Scoring..................................119
4.4 Receiver Operating Characteristic (ROC) Analyses Comparing MAE and Mascot 123
4.5 An Example of a Spectral Chimera and the MS/MS Spectra of Peptides.............129
4.6 Comparison of Peptides and Proteins from MSPlus, Mascot,
and MAE Rescoring of the Mascot Assignments......................................131
5.1 Knowledge Discovery Tasks.......................................................138
5.2 Data Mining Process and Lifecycle..............................................138
5.3 Log plot of intensity of the y(n-2) ions theoretical (theor) vs. observed(obs).142
xiv


5.4 Decision Tree for WP vs. UP
153
6.1 An Overview of How to Simultaneously fit a Set of 264 Parameters a .....167
6.2 Naive Fitting Processing of one Parameter Q for Intuitive Analysis........172
6.3 Naive Fitting Processing of two Parameters (fi, f2)
One by One Fitting for Intuitive Analysis.................................173
6.4 Experiment MSMS Spectra Identified Sequence as ESQSHPGDFVLSVB.........188
6.5 SIM Score Assignment Improvement by the Fitted Model ....................191
6.6 SIM Score Assignment on the large Independent set
Showing the Same Result of Figure 6.51 ...................................192
6.7 Comparison of the Discrimination before and after Fitting...............194
6.8 ROC Curses Showing Three Classifiers Identification.....................195
6.9 Different Predictions by S3 Model without/with N2 Cleavage..............200
6.10 Function-------carve...................................................202
1 + Ae~h(x-W)
xv


LIST OF TABLES
Table
3.1 MAE Expert System Labeling Fragment Ions....................................107
4.1 Coefficient Value for Calculation of MAE Score..............................115
4.2 Datasets Used in This Study
(score distributions for samples 1 and 2 or 3-5 are shown in Figure 4.4).....125
5.1 Attributes Used in Machine Learning Algorithm...............................146
6.1 Processing of Fitting 264 Parameters Simultaneously.........................190
6.2 Fitting Algorithms Comparison..............................................196
6.3 Chi-square Improvement by Adding N2 Fragmental Pathway.....................199
xvi


Preface
Proteomics is the systematic measurement of proteins within living systems. In
humans, complete protein profiling is especially challenging, where a single cell
expresses more than 10,000 different proteins. Recent advances in mass spectrometry
instrumentation have now enabled the simultaneous identification and quantification
of nearly all expressed proteins in a single sample, opening the door for system wide
study of complex human diseases, in particular cancer, where the cells normal
regulatory circuits fail to properly control cell growth. Despite these recent
technological advances, a major bottle neck remains in the computational methods for
identifying proteins from the millions of peptide fragmentation spectra (MS/MS)
typically generated in large scale proteomics studies. The predominant approach to
protein identification is sequence database searching, where the experimental spectra
are compared to sequences in a protein database, by translating these sequences to
model spectra. Current approaches suffer from poor accuracy and discrimination,
primarily due to the use of naive fragmentation models for predicting spectra,
ignoring the rich information contained within the relative intensities of peaks in a
typical MS/MS spectrum. These intensities are a direct reflection of the underlying
peptide chemistry, and are typically used in manual analysis by experts to evaluate
the quality and plausibility of MS/MS peptide assignments. This thesis describes an
expert system, the Manual Analysis Emulator (MAE), for automated large scale
xvn


peptide identification. By using expert knowledge to evaluate chemical plausibility of
peptide identifications, as encapsulated in a kinetic model of fragmentation developed
by the author, MAE is more accurate and has significantly higher discrimination than
current spectrum to sequence matching algorithms. Importantly, MAE automates the
identification validation process, making it amenable to current large scale proteomic
workflows. In addition to the significant contributions to the field of computational
proteomics, the author has developed an innovative constrained Levenberg-
Marquardt algorithm for non-linear optimization of very large models containing
hundreds of free parameters. This is used to fit, S3, a 264 parameter kinetic model of
gas phase peptide fragmentation developed by the author. When incorporated into the
MAE expert system, S3 increases discrimination even higher than achieved with
previously published kinetic models.
This thesis is organized as follows. Chapter 1 begins with a brief background and
rational for developing the MAE expert system, followed by a general overview of
each component of MAE. Chapter 2 reviews expert system design and provides a
more detailed description of the MAE architecture. Chapter 3 focuses on the feature
extraction component of MAE, which is used to identify new chemical mechanisms.
Chemical plausibility of MS/MS peptide assignments is evaluated by comparing
simulated spectra with the experimental spectrum using an extensive set of scoring
methods, as described in Chapter 4. Chapter 5 describes the use of machine learning
to discover new mechanisms for improving the models prediction accuracy. A
xvm


constrained version of the classical Levenberg-Marquardt algorithm was developed to
optimize the S3 model, as described in Chapter 6. The extensive set of equations
derived for implementation of the Levenberg-Marquardt algorithm are found in
Appendix A. Chapter 7 is a summary.
xix


1. Introduction
1.1 Background
A major problem in proteomics is the identification of proteins in samples as
complex as whole cell lysate or tissue extracts (McCormack, et al., 1997, Eng, et al.,
1994). At this time, the foremost method for accomplishing this goal is to treat the
proteins with an enzyme that cleaves the proteins into peptides. These peptides are
resolved by a chromatographic method, and then analyzed by a mass spectrometer
(MS) as they elute from the chromatography system. Each peptide ionizes in the MS
and is then fragmented in the gas phase into even smaller pieces, yielding
fragmentation or MS/MS spectra. The information in the MS/MS spectra is then
queried against a database of protein sequences to identify the peptide sequence and
infer the presence of the protein (Sadygov, et al., 2004).
It was recognized early on that scores from search programs were poor at
discriminating between correct and incorrect sequence assignments (MacCoss, et al.,
2002). When experiments involved samples with only a handful of proteins and MS
instruments collected data slowly, the number of MS/MS spectra was small, and
manual curation was used to validate the low scoring cases. However, manual
1


analysis lacks uniform criteria and can be error prone (Johnson et al., 2005).
Furthermore, current experiments can generate millions of MS/MS, making manual
analysis impossible. This proposal describes the development of an expert system to
automate the validation process, increase search engine discrimination and provide
the basis for data mining studies to identify unknown problems in the evaluation of
MS/MS spectra. An expert system is a computer program that represents and reasons
based on knowledge of some specialist subject with a view to solving problems or
giving advice (Jackson, 1999), or that models the problem-solving ability of a human
expert (Durkin, 1994). In this case, I am modeling the expert curation of MS/MS data,
and refer to the program as Manual Analysis Emulator, or MAE. The major parts
of an expert system are the knowledge base utilized by the software, and inference
mechanisms, as diagrammed in Figure 1.1.
Expert System
Knowledge Inference
Base Engine
Figure 1.1 The Basic Functions of an Expert System
2


1.1.1 Using Score Thresholds to Replace Manual Curation
The most common alternative to manual curation is to specify score thresholds for
acceptance to control the false discovery rate. These thresholds are set either by
searching datasets against an inverted sequence database of similar size to identify
false positive scores (Moore, et al., 2002), or by statistical analysis of multiple scores
(Keller, et al., 2002). Early studies validated the score threshold method using yeast
extracts, which has a relatively small genome. The sensitivity of this approach
degrades when larger databases are used, such as those required for human samples.
We have estimated that >40% of correct peptide assignments in a mammalian sample
are rejected when using stringent thresholds (Resing, et al., 2004), reducing sensitivity
of protein identification. This results in a large amount of wasted data collection time.
For example, in a recent comparison of mass spectrometry platforms, the majority of
collected spectra were not identified correctly; of an average of 15,309 spectra
acquired in three replicate runs, only 23% yielded high-confidence peptide-spectral
matches (Elias, et al. 2004).
To minimize false negatives, investigators often reduce the acceptance threshold
in order to capture more information, which results in increased false positives.
Several methods have been developed to filter the resulting false positives, based on
3


agreement between chemical properties of the peptides and behavior on ion exchange
or reversed phase chromatography (Norbeck, et al., 2005), exact mass measurements,
or differences in scores between the top scoring peptides and lower scoring
candidates (Fenyo, et ah, 2003, Sadygov, et ah, 2004). Methods have also utilized
intensity information in statistical approaches for validation (Havilo, et ah, 2003,
Narasimhan, et ah, 2005).
Nevertheless, manual analysis of MS/MS spectra by an experienced annotator,
who examines each spectrum for chemical plausibility, is regarded by many as the
best method for validating borderline cases (Eddes, et ah, 2002). In particular, manual
analysis (1) evaluates fragment ion peak intensities for chemical plausibility, where
the intensity of the observed fragment ions is correlated with the chemical reactivity
of the bonds in the peptide, (2) considers other fragment ion types not evaluated by
the search program, and (3) accounts for all the ion intensity in the MS/MS spectrum,
excluding noise peaks. The MAE expert system incorporates all three of these
components. Simulated spectral intensities, based on a chemical model of
fragmentation, are used to evaluate chemical plausibility. A novel feature recognition
method was designed to consider all ion types typically observed, rather than the
limited set of canonical ion types considered by most search algorithms. The ions are
labeled according to a hierarchical set of expert rules, which allows for an accurate
accounting of the total identified intensity.
4


1.1.2 Predicting Peak Intensities
The development of accurate methods for predicting fragment ion intensity would
allow evaluation of chemical plausibility in an automated fashion. Statistical analysis
of fragment ion intensities in MS/MS spectra have led to the development of
mechanisms that account for differences in gas phase chemical reactivity at different
peptide bonds (Paisz et al.,2005, Wysocki, et al., 2005). In manual analysis, heuristic
rules are used to test whether observed fragment ion intensities are consistent with the
expected probability of cleavage at specific sites (Johnson, et al., 2005, Wysocki, st.
al., 2005). Common examples include enhanced cleavage N-terminal of Proline
(usually a major ion, even in weak spectra) or reduced cleavage C-terminal of Gly
(usually a minor ion, even when overall intensity is high) (Paisz et al., 2005,
Wysocki, et al., 2000).
A different approach became possible when the established mechanisms for
peptide MS/MS fragmentation were incorporated into a kinetic model by Z. Zhang
and implemented in the program MassAnalyzer. The simulator predicts relative
fragment ion intensities, showing excellent agreement experimental intensities for a
wide variety of peptides (Zhang, 2004, 2005). This suggested that simulated spectra
5


can be used to automate the evaluation of chemical plausibility in validation of search
results in complex samples.
In the first part of this thesis, I describe the construction and validation of the
MAE expert system designed that incorporates this kinetic model. In particular, MAE
calculates similarity scores between the MassAnalyzer simulated MS/MS spectra and
the experimental MS/MS spectra (SIM, or other scores, as described in 4.1 section).
Using MAE SIM scoring to validate search results, we found a substantial
improvement in discrimination between correct vs. incorrect assignments for LCQ
and LTQ MS/MS from simple peptide digests (Sun et al., 2007). These results are
described in chapter 4 section 4.
1.1.3 Feature Recognition of Peptide Bond Cleavages
All three of the tasks described in 1.1.1 require labelling of the ions in the MS/MS
by their chemical type. Currently, the only available feature recognition algorithms
are found in search programs, such as Mascot and Profound. These programs
recognize only the canonical cleavage products, most of which represent breaking of
the peptide bonds. A peptide can be represented as a sequence string (Figure 1.2),
and cleavage of each peptide bond will produce two products, a b-ion and a y-ion.
When viewed as an MS/MS spectrum, only some of these products are observed, and
they vary in intensity. Other ions are produced from other types of cleavages, but are
6


not shown here, for simplicity. The development of this part of MAE is described in
more detail in chapter 3.
b8 i
b7
, b4
b3i
b6
b5i
b1^
H- D
Y
V
Q
y7
L-y8
y2
,-y3
L-y4
_y5
y6
K- OH
yi
Figure 1.2 A peptide sequence string connected by peptide bonds that are
cleaved during the MS/MS process, producing b and y ions.
The MAE feature recognition method is fundamentally different than that used by
other programs. A novel probability ranking approach is used to annotate peaks
according to their likelihood of appearance, to mimic what an expert does when
annotating the features manually. This approach contrasts with other programs of
similar nature, which attempt to develop a set of specific probabilities. However,
there is not enough information available to produce an accurate set of probabilities.
In fact, studies by Wysockis group (Wysocki, et al., 2000, 2005) attempted to use
this type of data to develop a search program. Although a program was developed, it
has poor sensitivity and accuracy. My method is a type of rank products approach,
7


which has proven very powerful in dealing with situations where the underlying
probabilities are hard to evaluate (Bern, et ah, 2008).
1.1.4 New Knowledge Generation
To illustrate how this feature recognition method has been used in new knowledge
generation, I will begin with the development of a scoring method that addresses the
third task in manual analysis (from 1.1.1), the evaluation of the proportion of the
MSMS ion current that is accounted for by the peptide sequence (Proportion of Ion
Current = PIC); (Johnson, et al., 2005, Chen, 2005). This required development of
appropriate pre-processing of the data to remove noise and development of rules to
select the set of ions that would be classified by the feature recognition module.
From this, the PIC score is calculated. Examining a combination of SIM and PIC
scores revealed MS/MS spectra that are produced from more than one peptide ion
present in the MS isolation window (chimera spectra). This work provided the
motivation for development of additional software to characterize these chimeras in
quality control for data collection in the Old laboratory (Houel, in preparation).
Another important result of the feature recognition module in identifying new
knowledge was shown in our machine learning studies. In a collaboration with A.
Gehrke, K. Kaffadar, and K. Cios (Gehrke, 2008), I identified gas phase mechanisms
8


not previously recognized as important in low energy peptide fragmentation. The
development of attribute tables for these studies, as part of the MAE knowledge base,
and the study itself, are described in Chapter 5, sections 5.2 through 5.3. Thus, MAE
provides a useful platform for assessing validity of search program results and for
mining information about unusual cases.
1.1.5 The S3 MS/MS Simulation Module and Fitting Algorithm
The original design for MAE also included the simulation of the MS/MS spectra,
with a module to update the simulator with the new mechanisms. However, the
simulator in MassAnalyzer is not able to produce the next generation of simulated
spectra. This is due to the limitations in the methods developed by Zhang, in
particular the naive method of fitting the large number of parameters required for this
simulator and the inability to add new chemical mechanisms. In the final part of this
thesis, I describe the implementation of an alternative algorithm for fitting, and a
simulator that will incorporate these new mechanisms. This is described in Chapter
6, sections 6.2 through 6.4.
9


1.2 Comparison of MAE Expert System with the Gygi Approach.
A major reason for feature recognition is to provide input for machine learning
studies to understand the gas phase chemistry of the peptides. Even though this
approach was used for study of small molecules early in the development of machine
learning methods, it was not applied to peptides until 2004 (Elias, et al., 2004). This
was motivated by efforts to predict the intensity of ions in the MS/MS spectrum, in
order to improve scoring of database search results. However, the first big study,
conducted by collaboration of groups at Harvard, produced very disappointing results,
where only the so-called proline effect was detected. This is a very strong difference
in intensity between the C-terminal and N-terminal cleavages on either side of the
amino acid proline. Similar differences are detected adjacent to glycine, aspartic
acid, glutamic acid, and hydrophobic residues.
When features were detected by MAE, and these were used in a similar study,
although directed at only one cleavage site, the sensitivity was much higher, and
accounted for nearly all the intensity differences from predicted vs observed
intensities. This is attributed to several things. First, the data were preprocessed to
remove noise and isotope ions, in a very accurate manner. Second, the attributes
were simplified to eliminate all the correlated cases, choosing the chemically most
plausible alternative. Third, the attributes were normalized, because unnormalized
10


values were weighting the ones that had higher values. The scoring methods in MAE
were used to select the MS/MS were I had the most confidence in the data. The full
study describing how this was done is shown in Chapter 4, section 4.1 to 4.2.
Difference in machine learning method between Gygi (Elias, JE et al. 2004)
and Gehrke (Gehrke, A et al. 2008): Gygi used two training data, one for MS/MS
spectra assigned with high-quality, nonredundant match peptides, the other one for
same spectra assigned with mismatched peptides, to build two probabilistic models to
compute the likelihood of an experimental MS/MS spectra given a candidate peptide.
Gygi chose 40 features to build a decision tree to make a candidate peptide match or
mismatch compared with an experiment MS/MS.
There are two primary limitations of the study: (1) only canonical fragment ions
were considered, leaving many ions in the MS/MS spectra unannotated and not
considered in the statistical model. For example, it did not consider multiple
dehydrated and de-ammoniated b and y ions and internal fragment ions. (2) gygi
choose 40 features to train his model with many redundant features which will affect
the accuracy of model.
Gehrke used the same machine learning method to determine what factors
(features) affect the result (classes). The strengths of this analysis are (1) the use of a
11


feature selection algorithm to choose (prune) features that are highly predictive, and
(2) the use of nearly all known fragment ion types in the analysis, which exist in the
experiment MS/MS spectra. This maximized the amount of information used from the
MS/MS data and thus improves the models accuracy of prediction.
1.3 Specific Tasks
Overview of expert system: The identification/validation of peptides by tandem
mass spectrometry is a central task in MS/MS research, but due to the complex
chemical properties of peptides, the accuracy of peptide identification/validation is
still limited. I designed the MAE expert system (figure 1.4) to address this limitation.
MAE utilizes chemical knowledge to develop a feature recognition function for
identification of fragment ions, and to create a spectra prediction model. The scoring
schema in MAE is designed as inference engine to decide whether a candidate
peptide is identified. Figure 1.3 shows how machine learning is considered as an
assistant tool to improve the model. The flowchart shown in Figure 1.3 depicts a
high-level view of the proposed MAE expert system. The expert system and its
components are described in detail later in the proposal. Here I show it to set the stage
for the discussion.
12


Four tasks are indicated in Figure 1.3.
Task 1: feature recognition
Task 2: scoring schema
Task 3: machine learning
Task 4: fitting parameters
Figure 1.3 High-level view of the MAE Expert System
13


1.3.1 Fragment Ion Identification to Understand Peptide Dissociation
(Feature Recognition based on chemistry knowledge)
Fragment ions in MS/MS spectra are a function of the gas phase chemistry of the
peptide. Recognition of fragment ions can help researchers further understand the
peptide fragmentation process. Given an experimental spectrum assigned to a
candidate peptide, the MAE expert system assigns fragment ions by matching the
peaks to the fragments theoretical m/z values within a user specified mass tolerance.
Identifying fragment ions is a key process in identifying the peptide, and increases the
specificity and sensitivity of the peptide sequence assignments. This is required when
the candidate peptide sequence is identified by some search engine, where the
fragment ion assignment is not certain. MAE significantly decreases the number of
un-identified ions. I have implemented this feature in MAE to recognize un-identified
fragment ions by the search engine MASCOT, based on an expert knowledge base.
My goal is to identify ions in the MSMS experiment spectra to achieve better
score results via the use of doubly or triply charged ions, internal fragment ions,
isotope ions, dehydrated and de-ammonia ions, and immonium ions. In fact, I believe
better performance can be achieved using the unique chemical properties of these
assigned ions. For example, a typical MS/MS spectrum with a triply dehydrated ion
indicates the presence of at least 3 serine or threonine amino acids in the candidate
14


peptide sequence (Sun, et al., 2007), and aspartic acid in a singly charged peptide
sequence always leads to observation of a correspondent y ion with higher
intensity. I will use all this chemistry information to identify ions and calculate
scores.
The benchmarks for completion of this specific aim, feature recognition (ion
identification) are (1) extending my current ion type to include triply charged ions,
doubly or triply dehydrated and de-ammonia ions, and internal fragment ions, and (2)
implementing this in source code to provide standard ion identification information
for users.
1.3.2 Create a New Scoring Schema (Inference Engine)
The inference engine is a key component in an expert system and acts as the
decision engine to judge whether a candidate peptide sequence is good or bad fit
compared with experiment spectra. Here I create a combined scoring schema as an
inference engine in the MAE expert system. I apply the results of taskl to calculate
the PIC score, along with the new predicted spectra from the improved prediction
model to calculate the SIM score. Finally I create a MAE combined score to judge a
candidate peptide match to the experimental MS/MS spectra. The new scoring
15


scheme can achieve higher discrimination without compromising speed.
1.3.3 Improving the Kinetic Model
Improving the accuracy of MS/MS spectra prediction can increase the number of
accurate peptide matches provided from an initial search engine, further increasing
discrimination. I develop an entirely new approach to validating identification that
uses a chemical kinetic model of cleavage to predict the relative ion intensities of the
fragment ions (Zhang, 2004, 2005). I have already shown excellent discrimination
between correct and incorrect assignments based on our application of predicted
MS/MS spectra for low resolution LCQ/LTQ mass spectrometers (sun, et al., 2007).
I implemented a prototype program that calculates the predicted spectra for singly
and doubly charged ions for the LCQ/LTQ traps, as a replacement for MassAnalyzer.
To fit the parameters of the model, I designed the fitting strategy (task3) to handle
264 parameters and optimize all the chemical constants for our sequence spectra
prediction model.
There are three important poorly predicted classes of ions that are observed in
MassAnalyzer prediction spectra. They are de-ammoniated and dehydrated ions,
16


internal fragment ions, and N2 cleavage ions (N2 cleavage ions means that the
peptide fragments at the second N terminal amino acid bond, resulting in two
complementary fragments). In particular, I focus on adding new mechanisms and
adjusting or fitting parameters for generating the fragment ions with more accurate
intensities. The parameters of the model are optimized to minimize the chi-square
between the predicted and experiment spectra.
1.3.4 Fitting Parameters Simultaneously
Current spectral library search engines, such as X! Hunter, use a reference library
of previously observed spectra to identify experimental spectra. Alternatively, we can
use our kinetic model to predict MS/MS spectra to populate such a reference library
and use this for searching. This addresses the extremely low coverage of current
libraries, which are no more than 10-15% of the human proteome. However, the
success of this approach depends on the accuracy of the predicted spectral intensities.
The problem with the model developed by Zhang in Amgen Inc. is that fitting is
performed on each parameter independently, which is not guaranteed to find the
optimal parameter set, thus compromising the ultimate accuracy of MassAnalyzer
spectra.
17


The diversity in amino acid chemistries and the complexity of peptide backbone
fragmentation requires a model with many free parameters; our model contains 264
parameters. Complex kinetic models frequently have many dependent parameters.
The goal to reach the global minimum of chi-square based on the fitting parameters
one by one is infeasible. I propose a powerful algorithm to fit parameters
simultaneously through divided and conquer strategy to reach the local minimum
chi-square with the best fit parameters.
The benchmarks for completion of this specific aim are (1) completion of the
spectra prediction program for singly and doubly charged sequences, and (2)
achievement of fitting parameters for the prediction model with improved accuracy
ofintensity prediction, and (3) comparison the predicted spectra between our model
and other existing model.
18


2. Literature Review
In this chapter, I will first give a brief overview of expert systems and design
(section 2.1) and the fitting procedure (section 2.2), and then focus on MS domain
and expand the knowledge (sections 2.3) and describing the current MS/MS spectra
prediction model developed by Zhang (section 2.4). Section 2.5 will describe training
data for fitting Zhang model and other aspects of the proposal.
2.1 Expert System
2.1.1 Introduction of Expert System
This proposal describes the creation of an expert system for managing scientific
information about the chemistry of peptides when analyzed by a mass spectrometer.
It incorporates several different algorithms for different problems, implements
decision and control mechanisms that incorporate the domain knowledge of experts,
and organizes information in a structure that facilitates downstream applications.
In fact, expert system derives from the Artificial Intelligence (AI), which is a
branch of computer science concerned emulation of human cognitive skill such as
problem solving. Moreover, AI is associated with the higher intellectual processes,
19


such as the ability to reason, discover meanings, generalize, or learn (Beavis et al.,
2000). Expert system technology has been successfully applied to a diverse range of
domains, including Mass Spectrometry field, which has been rapidly grown with
development of computer software and hardware. Typical tasks for expert system
involve:
the interpretation of data (MSMS experimental spectra)
diagnosis of malfunctions (MSMS spectrometry faults)
structural analysis of complex objects (MSMS prediction Model)
planning sequences of action (MSMS chemical actions for fragment ions)
There are many definitions for an expert system, but none are guaranteed to
satisfy all requirements. However, there are a number of functions which are
sufficiently important that they should comprise the desired expert system. The Term
knowledge-based system is sometimes used as a synonym for expert system and is
more general. The process of constructing an expert system is often considered to be
applied artificial intelligence (Feigenbaum, 1977)
An expert system can be distinguished from an application program based on
following functions:
(1) Whether it simulates human reasoning;
20


(2) Whether it performs reasoning based on human knowledge;
(3) Whether it solves problem by heuristic or approximate methods.
2.1.2 Computer Hardware and Software Development in Mass Spectrometry
The improvement in computer hardware and software has enabled an incredible
increase in performance and types of mass spectrometers available. These
improvements have also resulted in higher experiment data throughout, faster results
to the user, and increased profitability. These improvements have allowed process
multitasking during data acquisition, where the computer both collects data and
controls the instrument operation, and automated peaks search and spectra matching
to identify peptides, where large volumes of data are quickly analyzed (Beavis, et. al.,
2000).
2.1.3 Expert System Application in Mass Spectrometry
Computers are now an indispensable part of analytical instrumentation and
identification of proteins and peptides. Both hardware and software engineers have
21


been interested in computer development tools that can monitor their instruments,
subsequently make decisions, and data analysis. In the Mass Spectrometry field,
Expert system tools are used in three primary areas: optimization and control of the
performance of the mass spectrometer itself, collection of the detector signal as a
function of m/z, and analysis of the data (Beavis, et. al., 2000).
Controlling the MS instrument: Mass spectrometry instrument parameters must
be adjusted to their optimum values for the best performance. Mass spectrometers
must also be tuned and calibrated to a state in which peak intensity, peak shape, and
mass calibration are within the acceptable range. After tuning, the computers control
instrument execution and monitor all regular and un-regular signs of the instruments.
Data Collection: Data collection and data storage are also very important in the
identification of proteins and peptides. The fragment ions generated during the
MS/MS process are collected and recorded via a direct digital interface into a binary
Raw file that is in the vendors proprietary format. We must depend on the vendor
software to extract the information. This software generates problems in the spectra
that must be handled in some sort of post-processing to evaluate the quality of the
spectra, eliminate noise, and deal with what are called isotope peaks. We have
extensively evaluated this for the LCQ MS instrument, and the same evaluation must
22


be carried out on the LTQ software, to make MAE better able to evaluate this new
instrument.
Data Analysis: After data collection, experts or computers must extract and
interpreted their chemical information. There has been a signification development in
the mass spectrometry area of the data analysis software since the early report in 1959
(Gillette, 1959). There are software and expert systems that preprocess the mass
spectrometry data through range cut and peak smoothing (Gullo, et. al., 2008), and
use these spectra as inputs to search engines that uses intelligent peak-searching and
pattern-matching algorithm with the most likely fitness(Zhang, ProFound 2000), in
order to predict the theoretical spectra and then identify the MSMS spectra (Zhang,
2004)
MS/MS data analysis is a rapidly expanding field, and there are many applications
of expert systems by now. One of example of application of Expert system is
DENDRAL (Lindsay, et. ah, 1993) applied to a variety of analytic tasks in MS.
The fundamental problem in analytical chemistry is how to determine and allocate
the chemical structure of molecules. The DENDRAL expert system was designed to
solve the problem. It used a substantial knowledge of chemistry, and was the first
23


rule-based expert system applied to a real-world for chemistry structure problem,
and also was the first application of AI to a problem of scientific reasoning.
DENDRAL is a set of programs and can be distinguished as Heuristic DENDRAL
and Metu-DENDRAL. DENDRAL originally stood for DENDRitic algorithm,
enumerating all the topologically distinct arrangements of atoms. Heuristic
DENDRAL incorporates specific mass spectrometry (knowledge) to produces an
ordered set of chemical structure for explanation of the data. Meta-DENDRAL
accepts known mass spectrum/structure pairs as input and attempts to infer the
specific knowledge of mass spectrometry that can be used by Heuristic DENDRAL to
explain new spectra (Lindsay, et., al., DENDRAL: a case study of the first expert
system for scientific hypothesis formation 1993).
DENDRAL is an extensive case study of one of the first knowledge-based expert
system. DENDRALs knowledge collects the substantive one of generating all
isomeric structures, but was admittedly incomplete, including four sub fields (1)
Knowledge of chemical graphs, (2)Knowledge of chemical stability, (3)Knowledge
of mass spectrometry, (4)Knowledge of synthetic chemistry.
Another example of expert system is ProFound for protein identification using
Mass spectrometric peptide mapping information (Zhnag, et., al, 2000). ProFound
24


employs a Bayesian algorithm to rank the proteins through the probability of
occurrence and combined with other additional information such as protease used,
amino acid tag, and sequence information for identification of protein from the
protein databases. Bayes probability theory and the maximum entropy principle are
applied to derive the probability for the candidate protein given MS spectra and
background information. For any given candidate protein in the search database, the
probability of the protein increases with increasing numbers of ions match, increasing
the mass accuracy, and decreasing number of theoretical digested fragments. The
ranking of the candidate proteins is based the value of their probability.
Other example of data analysis in MS is developed by Arshadi (Niloofar Arshadi,
Data Mining for Case-Based Reasoning in High-Dimensional Biological Domains
IEEE 2005) Author developed his algorithm for case-Based reasoning and then use
two mass spectrometry-based ovarian data sets to check the specificity and
sensitivity under the algorithm.
There are many tools for mass spectrometry data preprocessing. One of example
is MSPtool that was developed (Gullo, et. al., 2008) as a crucial phase to perform data
management and knowledge discovery tasks on MSMS spectra.
25


Mass Spectra Preprocessing tool (MSPtool) is a graphical tool to preprocess mass
spectrometry data. A raw spectrum outputted from a mass spectrometer is
substantially a combination of three components: the true signal, a baseline signal,
and noise (K.R. Coombes, K.A. Baggerly, and J.S. Morris. Pre-Processing Mass
Spectrometry Data. Fundamentals of Data Mining in Genomics and Proteomics.
Kluwer, Boston, 2007). There is a general agreement about preprocessing schema,
which is calibration, filtering or denoising, baseline correction, normalization, peak
detection, peak quantification, and peak matching. First, MSPtool provides a cut of
m/z range of the spectra to filter non relevant biological information. Second,
MSPtool makes peak smooth through his own developed algorithm based on local
maxima for pseudo-Gaussian. Thirdly, MSPtool recognizes the peaks as a further step
of peak detection using signal-to-noise ratio (S/N) user defined threshold. Fourth,
MSPtool supports four available baseline functions (linear, logarithmic, exponential,
and piecewise linear function) for user to choose. Fifth, MSPtool discretizes the
original intensity according to specific quantization levels. Sixth, MSPtool normalizes
the spectra through original intensity into new ones by a certain fixed range.
My expert system focuses on the data preprocessing, feature recognition,
knowledge representation, theoretical spectra prediction, and then evaluates the
sensitivity and efficiency for peptide identification and validation. These processes
26


can be utilized with any search program, producing a generalizable method that is not
available in any other system.
2.1.4 Rule-Based Expert System
Rule-based expert systems are currently the most popular expert system in
knowledge engineering. It can define that a program processes problem-specific
information contained in problem facts with a set of rules contain in the knowledge
base, using an inference mechanism to infer the result. Whereas a knowledge base is
considered as a set of rules, an inference mechanism is considered a reasoning engine
by combining problem facts with rules to infer new information. Though rule-based
expert systems are not an exact match for human problem solving, they provide a
reasonable model for similarity of human reasoning. Figure 2.1 show the rule-base
expert system model.
Figure 2.1 Rule-based Model in Expert System
27


2.1.5 Expert System Design
There are four topics we need to consider in designing an expert system. They are:
(1) Acquiring knowledge.
Knowledge Acquisition plays an important role in Rule-Based Expert System.
Here the knowledge engineer interacts with domain expert to acquire, organize, and
study the problems knowledge. Acquiring knowledge from the domain expert is
called knowledge elicitation. Most expert system researchers have realized that
acquiring knowledge is the most difficult task of designing the expert system (for
example to create theoretical spectra prediction model by chemistry knowledge). The
identification and encoding of knowledge is one of the most complex and arduous
tasks encountered in the construction of an expert system. Thus the process of
building a knowledge base has usually required a time-consuming collaboration
between a domain expert and an expert system researcher (Duda and Shortliffe 1983).
Figure 2.2 shows knowledge elicitation process.
Questions Results
Answer Knowledge
Figure 2.2 Knowledge Elicitation Process
28


(2) Representing knowledge.
There are five fundament knowledge representation techniques; they are object-
attribute-value triplets, rules, semantic networks, frames, logic. My MAE expert
system is rules-based expert system and focus on the acquisition and representation of
rules knowledge.
A rule is a form of procedural knowledge. Its structure contains one or more IF
and THEN part. In a rule-base expert system, domain knowledge is represented as a
set of rules and entered in the systems knowledge base.
(3) Controlling reasoning.
The reasoning process in expert systems use a technique is called inference. There
are several fundamental reasoning techniques; they are deductive, inductive,
abductive, analogical, common-sense, non-monotonic reasoning.
Deductive reasoning infers new information from logically related knowledge
information. One part, feature recognition in MAE, uses the reasoning technique to
get new information (identified ion). The new information is considered as
knowledge base for whole MAE expert system.
29


(4) Explaining resolution.
Resolution is considered as an inference strategy used in expert system to judge
the truth of an assertion. MAE expert system applies feature recognition and
theoretical spectra prediction model as knowledge base, combined with scoring
method as inference engine, and output in a result file.
2.1.6 MAE Expert System Structure
All expert systems are composed of four basic components: a user interface, a
database, a knowledge base, and an inference mechanism. This MAE expert system
evaluates and manages data from high throughput mass spectrometry (MS) data
utilized in identifying proteins in complex samples by analyzing peptides generated
from those proteins. The most critical use is in relating predicted and experimental
MS fragmentation data. The four major parts of this expert system are diagrammed
in Figure 2.3. These include: (1) the knowledge base provided by a feature
recognition tool for analysis of the experimental data, built on the methods used by an
expert in manual annotation of that data; this provides the a knowledge base, (2) a
simulator that predicts the fragmentation MS spectra, generating a database of
simulated spectra that are most likely required for studies in the Old Laboratory,
organized in a tree structure for rapid access, (3) an inference mechanism or data
30


analysis module that provides comparisons between the experimental inputs and the
theoretical information; these inference functions include centralized information on
peptide chemistry, scoring metrics for evaluating spectral quality and goodness of fit
between the simulated and experimental spectra, and algorithms for comparisons of
different aspects of the spectra. (4) a simple user interface that accepts information
from a simple CSV file generated from SQL queries of an Oracle 9i database or
upstream programs; although simple, this user interface is used by all the
programmers in the Old Laboratory, and several tools have been developed for
managing the Oracle database and the SQL queries. Each part of this expert system
can be independently used on specific tasks; also the expert system can be
comprehensively used on peptide sequence identification/validation to replace expert
analysis Moreover, expert system development usually proceeds through several
phases including problem selection, knowledge acquisition, knowledge representation,
programming, testing and evaluation.
31


MSMS Spectra
(DTA files) MPIus flle
Figure 2.3 High Order View of the MAE Expert System with input MS/MS
spectra, search results for identification of the peptide sequence (MSPlus file),
and the inference mechanism that connects the database of theoretical
information with the knowledge base provided by the functions that provide
feature recognition in the experimental data.
There are two major aspects of manual analysis of spectra that must be automated.
These are assessing chemical plausibility, which will utilize the simulated MS/MS
spectra as the major inference method. First, the expert asks whether the overall
spectrum is accounted for by the assigned peptide sequence. Second, the expert
analyses whether the intensity rank of each peak among the spectrum satisfies the
embedded chemistry property of the peptides.
It is important to realize the magnitude of the problem in managing the data for
these studies. The major data unit is an MS/MS fragmentation spectrum of a peptide;
32


this fragmentation spectrum produces information that is used to identify the peptide
by string matching to the protein sequences in a reference library. The sequence
assignments will vary in quality, and manual analysis was originally used to evaluate
this. However, not only are samples now orders of magnitude more complex, but
current MS instruments collect six MS/MS spectra/sec, and typically collect data for
12 hours a day for weeks. This makes manual analysis impossible. This motivated
the need for an expert system that automated key aspects of manual analysis,
minimized subjective decisions, and enabled high-throughput processing to support
research efforts that rely on data mining MS data or that utilize the simulated MS data
for evaluating the MS chemistry or search results.
Feature recognition and knowledge representation is a key part in our expert
system; one of the most powerful parts of expert systems is the ability to extract
abstract information and to explain reasoning. This section will introduce how to
extract abstract information and includes the associated pseudocode.
The scoring method, designed as the inference mechanism in our expert systems,
enforced the expert systems ability to review a consultation and provide the user
with an explanation for how its conclusion was derived. The inference mechanism is
essentially represented the reasoning process used by the expert to resolve the
problem. It provides for a better understanding of how the confidence conclusion was
33


reached. The section 4.1 will introduce the design of scoring method and some
associated algorithms of score calculation.
The predicted spectral library works as a database in MAE expert system. We
alphabetically store the spectra by a 4 level tree structure due to the large size of the
library (60GBytes). The next section will introduce the tree structure used to make
information balance among each branch.
Figure 2.4 Four levels Tree Structure for Predicted Spectra Storage on Hard
Disk. The fold of predicted spectra is root, first level has 20 sub folds named on
N terminator first amino acid one-letter symbol in peptide sequence, second level
has 20x20 sub folds named on N terminator second amino acid one-letter symbol
in peptide sequence, third level has 20x20x20 sub folds named on N terminator
third amino acid one-letter symbol in peptide sequence.
34


2.2 Spectra Prediction Model and Its Fitting
2.2.1 Description of the Spectra Prediction Model
The peptide spectra prediction model is based on a kinetic approach to describing
the gas phase chemistry, based on the mobile proton model (Wysocki et. al., 2000)
of peptide fragmentation, and was developed to quantitatively simulate the low-
energy collision-induced dissociation (CID) spectra of peptides dissociating in a
quadrupole ion trap mass spectrometer (Zhang, 2004). The model includes most
fragmentation pathways plus some additional pathways based on the authors
observations.
The model is created based on the following major assumption
Teff or effective temperature of activation is related to the mass, charge and
collision energy of the peptide ion as it is being accelerated by a radio
frequency (RF) electric field during fragmentation (as shown in Figure 2.5).
The internal energy of the peptide population is assumed to be distributed
according to a Boltzmann distribution represented by the Teff.
Intramolecular proton transfer to the various cleavable bonds occurs rapidly in
comparison to the rate of dissociation. For simplification, the model assumes
it does not move after that.
35


The side chain of each residue has a constant gas-phase basicity (GB),
independent of the surrounding sequence (this is not true, but was assumed to
make the charge calculations easier, and is a reasonable assumption for doubly
charged peptides).
GB of a backbone amide depends on the immediately adjacent amino acids,
and the effects are additive (this means that only the two adjacent amino acids
need be considered in predicting the cleavage of a peptide bond).
Each backbone cleavage is dependent on the peptide achieving sufficient
energy above Ea or Energy of activation, to cleave that bond (different bonds
require different Ea thresholds).
The Ea required for a specific bond can be expressed as an additive function
where each amino acid is described by parameters that are independent of
other factors.
This internal energy of a peptide is decreased by an increment when a peptide
bond is cleaved. There may be sufficient energy for secondary cleavages.
Thus, the simulation is carried out iteratively, until the cleavages have
exhausted all the added energy. This iterative process is shown in Figure 2.6.
36


£
Q)
!fc
Sfc
Figure 2.5 The Kinetic Modeling of Peptide Fragmentation (taken from Zhongqi
Zhang, Anal. Chem. 2004, 76, 3908-3922).
37


Figure 2.6 Fragment Pathways Modeled by MassAnalyzer showing three
generation iterations, reflecting primary and secondary cleavages (Neu
represents neutral losses) that show the rapid proliferation of many products;
often hundreds are produced in a full simulation.
The original published studies by Zhang show good agreement between the
predicted and observed peptide chemistry, using simple similarity scores to evaluate a
large population of experimental spectra. Because the kinetic model relies on the
mobile proton model of peptide fragmentation, this study implies the validation of
the mobile proton model of peptide fragmentation.
38


2.2.2 Fitting the Nonlinear Multidimensional Kinetic Simulation Model
Published studies from our lab (Sun et. al., 2007, Yen et al., 2008) have revealed
that specific mechanisms are not well predicted with the current model in
MassAnalyzer, requiring that we add them to the model to increase MS/MS
prediction accuracy. This has proven difficult, because it is a ~264 parameter
nonlinear model where parameters were fit using a simple naive method in
MassAnalyzer, requiring many weeks to fit after each mechanism is added. I propose
completion of the expert system by replacing the current simulator with one that I will
develop, which will use an improved fitting procedure that utilizes more sophisticated
algorithms that will require high performance computing to carry out, along with a
divide and conquer modeling approach and making use of reporting functions that
allow better data selection for the training and for increasing speed of convergence.
This number of parameters is required because of the nature of the chemistry,
which involves hundreds of individual reactions which must be adequately
represented both in the input experimental data and in the model. To select and
manage this complex data, I will build on two programs I have written in preparation
for this thesis proposal, a data mining tool (MAE) that provides for targeted fitting to
specific parts of the model (Sun et. al., 2007), and the simulator of the gas phase
chemistry, S3, which is based on the published kinetic model implemented in the
39


simulation function in the program MassAnalyzer, recently described by Z. Zhang
(Zhang, 2004, 2005).
The MassAnalyzer simulator was fit one parameter at a time, and is now proving
difficult to extend to include additional chemistries that I have identified in the initial
studies. To solve this problem, this thesis proposal will develop novel methods for
fitting high dimensional nonlinear models, making three significant contributions: (1)
develop preprocessing methods to identify the best experimental data for the training
set, develop statistical measures that weight the data according to peptide and data
attributes, using measures and attribute tables that I developed in MAE, and testing
new algorithms for simplifying and de-noising the experimental spectra, (2) design a
novel fitting method for high dimensional models, utilizing a modified Levenberg-
Marquardt algorithm for fitting, and (3) develop performance measures to evaluate
the fitting of the S3 model, developing MAE based reporting functions to identify
what parts of the model and what input data are problematic during the fitting.
A successful fitting method should allow the addition of new chemical
mechanisms to S3 that better predict the gas phase chemistry than were achieved
using the single parameter fitting utilized by MassAnalzyer simulator. I have tested
the usefulness of this approach in a collaborative study with K. Cios and A. Gehrke,
to analyze the cleavage of the second peptide bond (N2 cleavage study briefly
40


described in Chapter 3). This study developed a set of attributes to support machine
learning studies, which are also available for use by MAE in the modeling efforts.
The completion of the proposed fitting experiments will provide better parameters
for the MS/MS simulations, along with error estimates of these parameters. In
addition, an important aspect of managing this large fitting process will be
development of statistical tests to identify and characterize subsets of the data that are
outliers or that have a negative impact on the local minima or the convergence. A
divide and conquer strategy will be developed to optimize the fitting of different
subsets of the data, based on exploiting domain knowledge about peptide chemistry
and my experience with using a large range of peptide attributes in robust regression
and supervised machine learning to optimize the fitting of subgroups.
2.2.3 The Nature of the Chemistry Considered in Designing Fitting Protocol
The complexity of the gas phase chemistry of peptides is the reason for the high
dimensionality of the model. The ease with which each peptide bond is cleaved must
be evaluated. Fortunately, the majority of the chemistry is driven by the nature of the
two adjacent amino acids. Even with that simplification, there are 20 amino acids,
meaning that there are 400 pairs that must be sufficiently represented in our
experimental data to evaluate these chemistries. Because some pairs are low in
41


abundance, a fairly large database must be accumulated to provide information on
these amino acids pairs. We estimate that about 5000 spectra will be the minimum
for fitting one of our simpler models, based on the known frequency of the amino
acid pairs. This assumes that these spectra will reflect the same basic chemistry. If
the peptides have subset types that are significantly different from each other, then
each subset type will require a similar number of peptide MS/MS included in the
fitting, and care must be taken to ensure that no group over-weights another group. It
is likely that this effect is important in the problems that Zhang now has in adding
additional chemistries to his model.
2.2.4 Overall Strategy for Fitting Parameters of the Model
The prediction of peptide spectra will perform a primary role in optimizing
current peptide identification algorithms used in large scale proteomics experiments.
This requires that the peptide spectra prediction model has the following properties,
(1) the model can correctly predict the relative intensities of the fragment ions in the
observed spectra; (2) the model represents the chemistry properties of all peptide
sequences, even though they have large sequence driven differences in the types of
cleavages observed; (3) the model can be calibrated to fit the any MS instrument; (4)
the model is open and can be modified if the new chemistry knowledge is found.
42


A high order view of the fitting strategy is shown in Figure 2.7, where the input
experimental data is derived from search results of datasets collected in the Old
laboratory over the last two years. These datasets will be filtered for high quality
MS/MS that we are confident were correctly identified, using the MAE software to
manage the data selection. For the initial experiments, we will utilize a simplified
model, which we estimate will require about 5,000 peptide sequences (as we are
really interested in the individual fragmentation events, it is the ratios between
various ions in the spectra that are of interest, so these 5,000 spectra will actually
include more than 15,000 experimental data points). The modeling method is
described in the pink outlined box. A spectra prediction model has been described in
figure 2.4. It is a difficult task to fit a nonlinear dynamic multiple dimension model
with ~264 parameters. The processing diagram shown below (Figure 2.7) illustrates
the principle of the least squares fitting using a maximum likelihood estimator (Press
et. al., 2007). For the first iteration, an initial set of parameters is assigned to the
model, the Chi_square value is calculated by the fitting program. For the second
iteration, an adjusted set parameters is chosen by operator experience or by a specific
fitting algorithm, and a fitting program calculates Chi square again. Overall the
procedure is to minimize Chi_square over a set of parameters, and find the best fit
parameters.
43


Figure 2.7 Processing Diagram for MS/MS Spectra Prediction Model Fitting.
The program calculates Chi_square again a set of parameters for each iteration,
and gets minimization of the Chi_square with a set of best fit parameters.
2.2.5 Limitations of the Model
The model is complicated and the computation is time intensive. Moreover there
are at least 264 parameter values which are very hard to optimize simultaneously for
even the best fitting the experimentation spectra. The values of the gas-phase
basicities (GB) in the model are relative to a fixed GB value of 1000 for the Arginine
side chain instead of their real physical values. Proton probability in one position over
the whole peptide depends on the corresponding microstate basicity (GB) and it
adjacent microstate affection (AGB). Unfortunately, peptides structure are usually
folded into a compact shape using a variety of loops and turns, Determining how
many adjacent microstates affect the aim microstate is infeasible work. The model is
44


then not sufficiently accurate in simulating the peptide fragmentation procedure in the
LC/MS instrument.
Due to the oversimplification of the kinetic model, even with optimized
parameters the predicted values do not fit the experiment data. For example, we have
found that the agreement between calculated and experimental spectra is worse when
the collision energy is outside the 30-40% range. It is clear that the ion temperature
correlates strongly to the collision energy (Gabelica et. al., 2003). However, the
effective temperature of an ion may be highly dependent on the nature of peptide in
ways not accounted for by the model.
My studies showed that many spectra, particularly those other than the doubly
charged cases, do not show a good fit between the predicted and observed spectra. I
also demonstrated that the predicted dehydrated and deammoniated fragmentation
ions were under predicted relative to the corresponding observed ions. In particular,
the predicted spectra were very poor, when multiple losses of water or ammonia
occurred. N-terminal cleavages and internal fragment ions are also under predicted.
Figure 2.8, 2.9, and 2.10 show examples of MS/MS where the gas phase
chemistry is incompletely predicted by the model. In these cases, I compare the
processed observed spectra where the noise and isotope ions have been removed (the
45


resulting ions are referred to as Average Mass Ions in sDTA files (top panels) and
predicted spectra generated by MassAnalyzer (bottom panels). In each case,
summaries of observed fragment ions from the MS/MS spectra are shown along with
the MS/MS to help in the interpretation. Canonical sequence-specific b and y ions are
respectively shown by 1 and L symbols above and below the sequence. Dehydrated
ions are represented as triangles, and a ions are indicated. Internal fragment ions
are shown by bars below the sequence. Multiply charged canonical fragment ions are
shown above or below the singly charged ions.
Figure 2.8 shows an example of a peptide showing unusual dehydration. This is a
singly charged peptide with two Ser, one Thr, and two acidic groups. Multiple
dehydrations from bn and bi2 representing neutral loss of one, two, or three water
molecules are observed, consistent with the number of Ser/Thr residues on these
fragment ions. Note that the larger bn fragment ion shows less dehydration. Multiple
dehydrations are also observed from the parent ion.
Figure 2.9 shows an example of a doubly charged parent ion undergoing neutral
loss of D or DT from the N-terminus. This results in the generation of ys+2 and yg+2
ions (see insert), even though the peptide has only one basic residue. MassAnalyzer
shows under prediction of the intensity for the y8 ion.
46


Figure 2.10 shows an example of a peptide where multiple internal fragment ions
are observed. Most of the internal fragments were generated by cleavage first
between IleyProg, and second within an active region at QNVP. As a result, 67
shows significantly lower intensity than predicted, due to its depletion following
internal fragmentation. There are several internal fragment ions in experiment spectra,
but there are no correspondent ions in predicted spectra.
47


2.5
sDTA, MH"1
00
>.
Theoretical
M-H2Q
MH+ AAAAffiS
G I L A A D E S T G S I 'AX
mh+
Figure 2.8 The Difference between Predicted and Experimental Spectra on the
Dehydrated Fragmental Ions
48


MH+
MH+
MH++
A X AA
ii -n 1
I T^-RL i^jak
A
______
A ^ab|:77
------Va3-9
Figure 2.9 Difference between Predicted and Experimental Spectra on N
Terminal Second Amino Bond Broken (y_2 ion)
49


50


Figure 2.10 Difference between Predicted and Experimental Spectra for Internal
Fragments
2.3 Chemistry and Mass Spectrometry Domain Knowledge
2.3.1 Overview of Our in-house Analysis System
The proteomics data is managed using an Oracle 9i database-backed architecture,
in which modular components exchange data through defined .csv (comma separated
values) interfaces. This modularity allows testing of prototype programs and
comparing different components that perform similar tasks, including those from
other labs. By maintaining defined interfaces, the information from many analyses
can be integrated in a common data model. The system has two major branches,
those involved in peptide identification and those involved in protein inference from
the peptide identification. This thesis involves only the peptide characterization
branch (shown in blue in Figure 2.11), and the protein inference branch will not be
discussed.
The identification of the peptide sequence from the fragmentation spectra is
carried out by a SearchManager which automatically processes MS data files
(Thermo-Electron .raw, Agilent .pkl, ABI .wif), directs the database searches to
51


identify the peptides (Mascot, Sequest, and X!Hunter are described in this thesis,
although others are available), and parses search results into the Oracle database.
MSPlus and the basic MAE (Rescore) module act directly on the information in the
database, then an SQL query produces the initial .csv for other applications.
Search manager includes a simple user interface to activate the rescore module.
However, the major user interface used in this thesis is the SQL generated .csv and
the tools developed to manage it make up the simple user interface to additional MAE
functions.
52


Protein or spectral
database
Peptide-centric
mapping of
peptides to
proteins
Figure 2.11 Overview of Protein Identification in Shotgun Proteomics, Showing
the Peptide Identification Functions That are Impacted by the MAE Expert
System Functions
2.3.2 Generation of MS and MS/MS Spectra in Proteomic Profiling
53


The Old laboratory utilizes the simulated spectra in protein profiling, a central
problem in molecular biology, which involves the identification and characterization
of the thousands of proteins found in a cell extract. The most successful method is
referred to as shotgun proteomics, and involves analysis of peptides generated by
enzymatic digestion of the proteins. This method uses a mass spectrometer (MS) to
ionize the peptides and fragment them, as illustrated in Figure 2.12. The
fragmentation spectrum is also referred to as an MS/MS because of the two stage
process required to produce them. The MS/MS fragmentation ions contain
information that can be mapped back to the peptide amino acid sequence string
derived from the protein in the original sample. An example is shown in Figure 2.13,
where the data on the fragment ions are plotted by the mass of each fragment and
number of molecules of each fragment observed (intensity). A key computational
problem is the interpretation of the MS/MS data, in order to assign a peptide sequence
based on the theoretical fragmentation patterns derived from a protein sequence
database. The accurate prediction of the predicted spectra from the peptide candidates
is a key factor for correct peptide sequence assignment.
54


Injection/T rapping
Isolation
Fragmentation
MS/MS
Spectrum
Figure 2.12 MS/MS Analysis of an Ion in a Complex Sample (introduced from
the left) by Isolating a Single Population of Ions (represented by a single ion),
Which are then Fragmented by Collision Induced Dissociation with Helium Ions,
Followed by Counting of the Fragment Ions to Produce the MS/MS Spectrum
(Figure taken from Chem Biol 2 (2007) 39-52, Ahn et. al.)
ABRF_20ferTtomolul_10ul_orb1 #5335 RT: 107.25 AV: 1 NL: 1.38E3
T: UMS + c NSI d Full ms2 1039.53@35.00 [ 275.00-2000.00]
Figure 2.13 An MS/MS Experimental Spectra, Where Each Major Peak in the
MS/MS Spectrum Represents Cleavage at a Peptide Bond, and the Molecular
Weight Difference between Two Peaks is the Amino Acid Difference (thus
mapping the fragment ions to the peptide sequence NPTVFFDIAVD)


2.3.3 Converting Peptide Sequences to Predicted Spectra
In general, when a search program is utilized to identify a peptide sequence, the
information in the experimental MS/MS spectrum is converted into a peptide
sequence based on the information similarity between a MS/MS experiment spectra
and a predicted peptide spectra from the candidate protein. Then the possible
candidate peptide sequences are selected from an indexed list of all possible peptide
sequences from the protein sequence database, typically by matching the predicted
mass and the observed mass of the peptide, with some error tolerance depending on
the MS instrument used. The candidates are scored to find the best fit between the
experimental spectrum and a predicted spectrum.
It is clear that the MS/MS is very information rich, and it is complex information.
The generation of the predicted spectrum usually involves simplification of the
expected spectrum, in order to speed the computational process. In general, all
approaches are most successful for well behaved, doubly charged parent ions, and
often have difficulty when the MS/MS has chemical noise (where fragment ions are
present that are derived from other parent ions in the isolation window, referred to as
chimera spectra), or the peptide has unusual gas phase chemistry (singly charged
cases, or non-mobile proton cases where fragmentation is restricted to a few sites,
instead of producing long, continuous strings of fragment ions).
56


These problems complicate the comparison between the experimental spectra
generated in the LC/MS and predicted spectra from peptide sequences. In addition,
earlier proteomic profiling experiments were commonly carried out with the LCQ
instrument, which did not measure the parent mass very accurately, so the number of
candidate peptides was enormous due to the requirement of a wide error tolerance.
Fortunately, the high mass accuracy instrument currently in common use
(LTQ/Orbitrap) provides more accurate parent mass measurements, allowing better
filtering to remove irrelevant candidates. However, the predicted spectrum from the
candidate peptide sequence is still very difficult, and much research in this field has
focused on ways to accurately predict spectra for improving discrimination between
the correct and incorrect assignments.
2.3.4 Predicting Peptide Fragmentation is Critical for Identifying Sequence
There are several basic approaches to using the information in fragmentation
spectra to identify the peptide sequence, but only four have proven useful (Sadygov et.
al., 2004). The first method used in automated searching (Mann, et. ah, 1994) uses a
peptide sequence tag like that shown in Figure 1.6. This information can be
employed to match to the candidate peptide sequences by pattern matching. Because
the sequence tag can be found by b or y ions, it needs to be considered bi-
57


directionally. The matching of the other ions in the MS/MS spectrum is then used to
score differences between the candidates. XlTandem (Craig et. al., 2004) is an
example of this type of algorithm.
A second approach uses a probability function to account for the likelihood of
observing the set of fragment ions in the experimental spectrum by calculating their
probabilities in the candidate spectra (this method is utilized by the search program
Mascot (Stein et. al., 1994)). Neither of these two approaches uses the intensity
information, although several groups have used it in validation methods.
A third approach makes a very simple model of intensities. For example,
SEQUEST search engines use a simple cleavage model to produce a spectrum for
cross-correlation scoring to identify which peptide candidate is the likely correct
assignment (Eng et. al., 1994). First, the b and y fragment ions for the candidate
peptide sequence are calculated, and given the same intensity of 50 counts. In
addition, isotopic variants due to presence of C are added, by adding an ion that is
+ 1 Da greater in mass than each of the b and y fragment ions, and these one C ions
are all given an intensity of 25.0. Three other kinds of ions are added, that mimic the
losses of water and ammonia and the a-type ions that are often present, and these are
assigned an intensity value of 10.0. As we observed, the simply predicted spectra
without considering peptide sequence chemistry does not really represent the
58


experiment spectra from LC/MS instrument analysis, but it is sufficient to produce a
reasonable cross-correlation score, and typically identifies about 70% of the cases,
with reasonable confidence. Problems arise when the predicted spectra may over
contain some ions with low m/z value while the experimental spectra does not show
these ions or predicted spectra does neglect neutral lose but the experimental spectra
still contain. Another problem is that the ratio among types of predicted ions in the
model does not real fit the ratio we observed in the experimental spectra, particularly
for smaller and larger peptides (the ratios given are optimum for peptides of about
1400 Da).
A fourth approach is called stochastic models for database searching. SCOPE is
an example of program based on probability estimator for peptide fragmentation and
subsequent generation of tandem mass spectra. It has a two-step stochastic process:
fragmentation and measurement.
2.3.5 Spectrum to Spectrum Matching
More recently, direct comparisons between the experimental spectrum and a
library of reference MS/MS spectra has been utilized for searching, using observed
spectra (Craig et. al., 2004, Frewen et al., 2006, Lam et al., 2007). For example,
X!Hunter, uses a library generated by averaging together spectra that were identified
59


with the same peptide sequences, sequence modifications and parent ion charge
(Craig et al., 2006). This method is limited to previously observed spectra using the
same instrument type and fragmentation method. Old lab has tested a library of
predicted spectra generated by the Zhang MassAnalyzer program (Yen et al., 2009).
This revealed that these spectra were reasonably comparable with the library of
previously observed spectra, but to be maximally useful, it would require
improvement of the subset of spectra that were not well predicted by the simulator in
MassAnalzyer.
2.3.6 Peptide Fragmentation
The specific fragment ions generated in an MS/MS spectrum have masses
dependent on the sequence of the peptide that was selected for fragmentation.
Activation to produce the fragmentation occurs by introducing helium gas to the MS,
then moving the peptide ions through an electric field at a rapid rate, so that they
collide with many helium atoms over a few microseconds. This is considered a slow-
heating procedure in an ion trap, and usually only enough energy is introduced into a
peptide ion to induce one fragmentation event. But the spectrum represents a
population of ions, so the full set of breakable bonds is observed in the resulting
mixture of fragment ions. In an ideal case, mass differences between the ions in the
MS/MS spectrum can be used to read the sequence. This is shown in Figure 1.2,
60


where a part of the peptide sequence is shown mapped onto the spectrum, illustrating
the relationship between the peptide sequence and the fragment ions in the MS/MS
spectrum. This exemplifies the essential programmatic problem that is required for
automated analysis of MS/MS spectra by search programs, the conversion of the
information in the MS/MS spectrum into a form that can be mapped onto a sequence
(or vice versa). In this section, the principles behind the mapping of the information
in the MS/MS spectrum onto a peptide sequence will be discussed.
Because a peptide is a linear string of amino acids, each fragmentation at a
peptide bond will produce two products. Figure 2.14 shows a peptide which contains
5 amino acids (each side chain R means an amino acid). An amino acid or peptide
contains two ends that are chemically different the N terminus (left hand side) that
is a primary amine (except for proline, which is a secondary amine) and the C
terminus (right hand side) that is a carboxylic acid. Where amino acids join together
to make the peptide, the two ends of the adjacent amino acids are combined to make a
peptide bond. There are three bonds in this region which can be broken, each
indicated by dotted lines. Normally in a mass spectrometer, the peptide bonds are
broken most easily, producing two types of fragments. The two peptide fragments are
named as b ion for the fragment ion extending to the left of the fragmentation site
and y ion for the one extending to the right.
61


b3
H
+
HiN
OH
o r3; : z2 o r5
Figure 2.14 Nomenclature of Fragment Ions, such as from N Terminator a, b, c
ions, and from C Terminator x, y, z ions, Generated by Broken Peptide
Backbone. R stands for the side chain for any Amino Acid. (Figure taken from
Methods 35 (2005) 211-222, Wysocki et. al.)
Also, other bonds can fragment; if the bond between the alpha carbon of an amino
acid and the carboxyl is cleaved, a so-called a ion is produced. Other cleavages (x,
z, and c ions) are shown in Figure 2.14, but these are almost never observed in the
MS instruments used for proteomics profiling. Secondary losses of water or ammonia
are often observed, and a second cleavage of the peptide backbone can produce
internal fragment ions.
2.3.7 Interpretation of MS/MS Spectra
Figure 2.15 illustrates a peptide fragmentation example which consists of eight b
ions and eight y ions. These are numbered in order of the peptide bond that was
fragmented; b ion counts from left to right, while the order of the y ion counts is
62


reversed. Peaks have different intensities because the bonds vary in the ease that they
will cleave. This example shows all the possible fragments, but usually not all the
possible ions are generated, or if generated, may not be detected by the mass
spectrometer. Based on mass differences between ions and experience with the
chemistry, a manual de novo analysis can be done. For simplicity, we will first
assume that we know peaks are b or y ions, respectively red or blue (Figure 2.15). In
this case, the sequence is directly read off by determining the m/z differences between
each series of ions. For instance, the m/z distance in Daltons between y7 to y6 is the
molecular weight of tyrosine (Y or 163 Da), y6 to y5 is glutamate (E or 129 Da), and
y5 to y4 is valine (V or 99 Da). So, we know that the unknown peptide contains a
sequence tag VEY. However, the sequence direction may be reversed, so the
unknown peptide may contain a YEV, instead. Fortunately, the direction can usually
be determined because y ions have a C-terminal OH, which shifts their mass.
Figure 2.15 Peptide Ions Generated in MS/MS Fragmentation. The Resulting
Spectrum Showing the Observed Set of Ions, Illustrating the Intermixing of the
63


Two Types of Sequence ions, including the C-terminus (y ions) and the N-
terminus (b ions) (Sadygov, et. al., 2004)
Note that the intensities of the ions in Figure 2.12 are different from each other.
This is because each cleavage is a separate chemical reaction, driven by the presence
of protons near the peptide bond. Protons move around the peptide depending upon
the basic nature (or proton affinity) of the various sites. If a peptide bond is very basic,
it will attract protons, and cleave more readily, producing more ion intensity in the
fragment ion products of that cleavage. This is referred to as the mobile proton
hypothesis. If a peptide contains several Arginine residues, which bind protons very
tightly, the cleavage of sites in that peptide are restricted to sites that have a resident
proton (for instance, adjacent to an acidic group). These are referred to as the non-
mobile proton cases, and present special challenges for search programs, because
their cleavage products are not randomly generated along the sequence. In addition,
in modeling this chemistry, it might be necessary to consider the two types of
peptides separately.
2.3.8 Mobile Proton Hypothesis
The kinetic model for prediction of dissociation of spectra was originally
conceived as a test of the Mobile Proton Hypothesis for gas phase chemistry of
peptides. This model was defined by V.H. Wysocki (Wysocki, et. al., 2000) to
64


explain the relationship between peptide charge by addition of protons, and the
fragmentation. Her group, as well as others have contributed to a large body of data
studying these mechanisms (Dongre, et. al., 1996, Tsaprailis, et. al., 1999, Jones et.
al., 1994). The model predicts that protons will move around on a peptide that is in
the gas phase inside an MS, to produce a population of differentially protonated forms.
These various forms depend on the internal energy content of the peptide and the gas-
phase basicities of the different protonation sites of the peptide. The specific bond
that fragments is the one that has attracted a proton, as shown in Figure 2.16. These
cleavages are called charge-directed.
Figure 2.16 Mobile Proton Model Showing a Proton (red H+) Moving from
One NH2 Position to the Oxygen in a Peptide Bond, Producing a Rearrangement
and Breaking of the Peptide Nond. (Figure taken from V.H. Wysocki et al., J.
Mass Spectrom (2000), 35, 1399-1406)
H'
o
y ion
65


Protons in other parts of the peptide can be involved in a similar type of chemistry,
but only locally. An example is shown in Figure 2.17, where an acidic side chain
contributes a proton to cleavage of the adjacent bond. This type of cleavage is called
charge-remote, because the proton that charges the peptide so it can be seen in the
MS is someplace else on the peptide. This is most likely to occur when there are
arginine amino acids in the peptide that are equal in number or greater than the
number of protons. The required energy for proton mobilization from a basic side-
chain or the amino terminus depends on the amino acid component, with dissociation
energy requirements is greatest for arginine-containing peptides and decreasing in the
order Arg-containing > Lys-containing > non-basic, mimicking the order of
decreasing gas-phase basicity (Dongre AR et al. J. Chem Soc 1996; 118:8365).
-D-X-
b ion
y ion
Figure 2.17 Mechanism of Charge Remote Fragmentation for a Sequence (DX)
Dissociation to b and y Ions (Taken from V.H. Wysocki et al., J. Mass
Spectrom.(2000), 35, 1399-1406)
66


3. Feature Recognition
This chapter describes the feature recognition features in the MAE expert system.
The major question that the feature annotation was designed to address was whether
we can deduce a set of predicted fragment ions from the peptide sequence, which will
account for all or nearly all of the ion intensity observed in the fragment ions. From
this information, the Proportion of Ion Current (PIC) score is calculated, in order to
provide an inference mechanism to evaluate this aspect. Two other inference
mechanisms were added, in order to provide sufficient information on specific types
of sequences.
A major part of spectral analysis by an expert is the initial classification of the
fragment ions by their chemical type and annotation of the amino acid sequence
string that they represent. In this chapter, the programmatic method for carrying out
that analysis is provided.
In order to exploit the information in experimental and predicted spectra for
studies involving chemistry, the chemical nature of the ions must be evaluated. For
example, a user may want to study the fragmentation of the second peptide bond, the
cleavage between two specific amino acids, the dehydration of the C-terminus, or the
67


difference between the singly and doubly charged forms of the same ion. A major
function of the MAE expert system is to classify the fragment ions in the same way
that an expert does, when carrying out manual analysis. This must be done in a
general way that allows targeting of various aspects of the chemistry.
3.1 Simplifying Information in DTA Files
An important result from the study of the rescoring of the search results was that
preprocessing the experimental MSMS spectra had a significant impact on
discrimination and feature recognition. Preprocessing mass spectrometry (MS/MS)
data has been recognized as a crucial preliminary phase in order to perform data
management and knowledge discovery task on mass spectra (Gullo, et al., 2008).
Processing involves combining spectra derived from the same peptide ion, in order to
enhance signal to noise. This produces complications in the spectra that required
development of an algorithmic approach to remove noise and then combine the
related ions into a single peak.
3.1.1 Creating an Ion List
To preprocess LCQ data, I begin with the information in the DTA file generated
by the vendor software to present the essential information in the MS/MS scans in the
68


Raw file (the data file collected by the MS instrument). The first step is to create an
"ion list" of the most intense DTA ions (a DTA ion is one line in the DTA file) equal
to seven or fourteen times the number of amino acids in the candidate peptide for
singly or multiply charged cases, respectively. These values were determined as
twice the number that would encompass all the obvious ions in a survey of high
intensity, richly fragmenting MS/MS spectra. The remaining DTA ions are
categorized as bulk noise ions. The intensity mean and standard deviation (s.d.) are
then calculated for the bulk noise ions.
To illustrate the handling of the information in the DTA file by these three
preprocessing functions, an example (figure 3.1) was chosen where three MS/MS
spectra (figure 3.1) were summed to produce a single DTA file (Figure 3.2). This
DTA gave a high confidence assignment after database searching, corresponding to
the peptide sequence SVEMHHEALSEALPGDNVGFNVK (XCorr 4.92, Mowse 94),
and a triply charged form of the same peptide that co-eluted with the doubly charged
form, increasing the confidence in the assignment.
This extraction of the DTA file was performed by the vendor Extract_MSN
software. The file contains 1,313 m/z entries, including noise ions, most of which are
too low to see in the spectrum. The mass accuracy is given to one decimal place by
the software. For the weak, doubly charged ion (1082-1084 Da), note that the process
69


of centroiding and summation used to generate the DTA file produces several
apparent fragment ions with no obvious isotope envelope; thus, deconvoluting charge
information and identifying the monoisotopic ion is difficult. In figure 3.1, expanded
views are shown of three mass regions, containing a weak doubly charged ion and
two stronger singly charged ions, in order to illustrate how DTA and sDTA files are
processed (Left panel: the full MS/MS spectrum; Left expanded panel: 1082-1084 Da;
Middle expanded panel: 1321-1324 Da; Right expanded panel: 1433.5-1437 Da).
3.1.2 Removing Noisy Ions
In order to simplify the spectrum, the lines in the ion list are then grouped
into clusters" of DTA ions and combined. The resulting single peak is referred to as
an "Average Mass Ion", with intensity equal to the sum of all the included DTA ions.
The m/z and intensity of the Average Mass Ions are calculated using Eq. 3.1 and Eq.
3.2, showing an example of a cluster with three fragment ions. We refer to the
weighted average m/z of each reprocessed peak as "M/z", and 4 and A4 are intensity
and m/z of DTA ions within each cluster which are combined during processing of
the DTA file.
/ = £/* =2656 + 19046 + 5015 = 26717
(3.1)
M/z
Y.VkMk) 1103.3x2656+1103.5x19046+1104.5x5015 ||Q3 67 ^2)
I'
2656+19046 + 5015
70


The resulting / and M/z values for each Average Mass Ion are written to the
sDTA. An error correction of 1.0002 x M/z was applied to each M/z value, after
analyzing the high quality assignments in the list. A second noise threshold was then
determined, where Average Mass Ions with / greater than the mean + 2.8 x s.d. of the
bulk noisy ions were classified as "major ions" for the next step of processing. This
threshold was chosen to eliminate >98% of noise ions in spectra where MS/MS had
apparently been carried out on an MS noise peak. Such cases were identified when
spectra were very weak, no MS/MS was observed for ions with similar m/z and
reverse phase retention time but with higher signal, and no peptide tag sequence could
be identified.
3.1.3 Illustrating the Full Process
Figure 3.3 shows the final simplified DTA (sDTA) file after MAE processing.
Many of the noise ions were removed and overall signal of most remaining ions
increased. The three expanded panels show the resulting fragment ions of 1083.73
Da (y2Q2, pred m/z = 1083.72), 1322.03 Da (bI2, pred m/z = 1322.45), and 1435.11
(bi3, pred m/z = 1435.61), where the clustering process produces combined ions with
greater intensity. Clusters are identified by choosing the most intense DTA ion as the
starting point, then assigning ions to the cluster that are within -2.0 to +2.5 Da of that
71


ion in three steps: (1) check for ions within 1 Da of the starting point; (2) check for
additional ions within 1 Da of those identified in the first step; (3) check for
additional ions within 1 Da of those identified in the second step. The 4.5 Da
window was chosen to include 95% of the ion intensity of the most intense ions
(based on a random survey of 30 cases from 5 LC/MS files).
The process is repeated with the next most intense ion in the ion list, until all ions
have been assigned to a cluster. For example, to produce the ion at 1083.73 Da, MAE
first selects the most intense ion in the region (1084.2 Da) and looks for adjacent ions
between 1082.2-1086.7 Da, extending in one Da steps. The extension of the cluster
stops if ions fail to appear within 1 Da, or if ion(s) within 1 Da have intensity below
3000 counts (e.g., 1082.1 Da). Thus, MAE first clusters ions within 1 Da of the
1084.2 peak (1083.5 and 1084.3 Da), then clusters adjacent ions extended by 1 Da
(1082.8 Da), followed by ions further extended by 1 Da from 1082.8 (none observed,
therefore search is stopped for this cluster). All ions within this cluster are then
combined to produce an ion with weighted average mass 1083.73 Da and intensity
equal to the sum of each ion. Note that several small ions are within these limits, but
are not included in the calculation because they are classified as bulk noise (2.8 x s.d.,
see Methods).
72


Figure 3.1 Three Scans MS/MS Spectra
DTA(316-322)
£
c
c 1
i/i
ros8 n'*! §5 ^J.L 252I o I JLiL OO'LA
j. - L-l..

s g
Figure 3.2 Three Scans Merge into One Spectrum
1082 1083 1084 1321 1322 1323
143414351436
Figure 3.3 sDTA Processing for Removing Isotope Ions and Noises.
73


3.1.4 Implementing the De_Isotope Function
The De_Isotope function was implemented in order to merge isotope ions into
one ion with average ion mass and erase noise, as described in this section. First, the
MSMS observed spectrum is translated into a n><2 matrix in which the first column
represents m/z, the second column represents correspondent intensity, and n is
number of ions.
The following pseudocode describes the De_Isotope function.
//input : ion_limit
// a N><2 matrix. First column stores mass of each ion,
// second column stores correspondent intensity of each ion
// N is number of ions in MSMS experiment spectra
//output: a M><2 matrix
// M is number of ions in sDTA
De_Isotope(ion_Mlimit, matN2)
1 N < size of matN2
2 create Matrix matM2
3 sorting matN2 with mass increase order
4 set back_win
5 set forw_win
6 for i < 0 to ion Mlimit-1
74


7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
id_max < find index with max intensity in matN2
matM2[i] [0]< matN2[id_max][0]
matM2[i] [1]< matN2[id_max][l]
delete matN2[id_max]
id_max_next < id_max+l
id_max_pre < id_max-l
while matN2[id_max_next]-matN2[id_maxj maxM2[i][l] < maxM2[i][l]+ matN2[id_max_next][l]
delete matN2[id_max_next|
idmax < idmaxnext
id_max_next < id_max +1
id max < id_max_pre+l
while matN2[id_max_pre]-matN2[id_max] maxM2[i][l] < maxM2[i][l]+ matN2[id_max_pre][l]
delete matN2[id_max_pre]
id niax id max pre 4- id niax -1
Output matM2
75


This Deisotope function will output a matrix with size M De-isotope ions,
where M is the number of ions in the final sDTA data. The algorithm first assigns the
size of the matrix matN2 to N, and creates a new matrix matM2, and sorts the
matrix matN2 according to increasing mass or size of the ion entries in the sDTA
file on code lines 1-3. At the start of each iteration of the for loop of code lines 6-23,
the id_max records the ion index with maximum current intensity in matN2 on
code line 7, assigns the ion information to matM2 on code line 8-9, and deletes the
ion from the matrix matN2 on code line 10. On code line (11-23), the program
searches forward and backward from the isotope ions in the isotope window (called
forw win and back win). If an adjacent ion falls in the isotope window on code
line 13 or 19, the program adds the ion intensity to matM2[i][l] on code line 14 or
20, deletes the ion from the matrix matN2 on code line 15 or 21, and then assigns
id_max_next or id_max_pre to id_max on code line 16 or 22.
3.1.5 Characterizing the sDTA Information
Accuracy was assessed by examining mass errors of fragment ion assignments
(predicted m/z observed M/z) of the cases where we were confident of the peptide
identification. The mass error distribution of all b and y fragment ions showed an
offset of the mean from zero (mean = 0.1535, s.d. = 0.492), indicating a systematic
error in mass determination, which varied with fragment ion M/z. This was not an
76


artifact due to combining cluster lines, because in unprocessed clusters the mass
accuracy of the m/z lines corresponding to the monoisotopic peaks showed the same
systematic mass error. It appears to be an aspect of the mass inaccuracy of the
ThermoElectron 3D ion traps (Cox, et, al., 1995), because a nearly identical
systematic error was observed in LCQ data from another lab (Chen, et al., 2005) and
in both the LCQ Classic and LCQ XP instruments in our lab.
Figure 3.4 Error Analyses of the Singly Charged Ions from High Scoring
Peptides Identified by MAE Compared with the Predicted Masses. The
Corrections Were Applied to the Small Test MuDPIT Dataset of K562 Proteins
To compensate for the systematic error, we multiplied each M/z value by a factor
of 1.0002, based on linear regression of (m/z M/z) vs. M/z. The smaller standard
deviation may be related to the fact that this dataset sampled a wider concentration
range of parent ions than the small test dataset. The range of observed errors is well
77


within the error level expected for 3D ion traps, and indicates that ions within 1.5 Da
of each other cannot be accurately distinguished.
3.1.6 Testing Whether sDTA are Sufficient for the Database Search
To test whether processing into sDTA files removed any critical information, all
sDTA files in the small test dataset were used in searches with Mascot and Sequest
and the results were compared with those obtained when searching using the
unprocessed DTA files. Overall, Mascot Mowse scores for correct assignments
remained the same or increased by 1-5% due to better matching of the high mass ions,
and the scores for incorrect hits decreased slightly due to reduced annotation of noise.
This resulted in a small increase in discrimination. On the other hand, >90% of
Sequest XCorr scores decreased with the sDTA files, sometimes by as much as 25%,
which we attributed to the reduction in noise. Sequest SP and ion scores increased,
because these scores are sensitive to greater matching between observed M/z and
predicted m/z values. (SP evaluates the presence of continuous runs of b and y
ions, while ion score evaluates the percentage of predicted b and y ions that are
observed.) Importantly, no correct assignments found with DTA files were removed
by searching sDTA files.
78


These results are consistent with other studies reporting similar improvements
upon deisotoping and removing noise from DTA files (Wehofsky, et al., 2002). As
the purpose of our study is to evaluate search results on DTA files, all other searches
in this study were done using DTA files. However, this test showed that the sDTA
files lost no significant information after processing by MAE, and in fact the
processing increased the likelihood that predicted and observed fragment ions could
be matched.
3.2 Available Feature Recognition Methods
To illustrate why a feature recognition system was necessary, I describe the
available programmatic methods for labeling the fragment ion features in MS/MS
spectra. The best of these are provided by the two major search programs, Sequest
and Mascot, which provide identical labeling. Figures 3.5 and 3.6 show the output of
Mascot, with the ions labeled for a correct peptide assignment and an incorrect
assignment. There are four major problems with the Mascot/Sequest approach to
feature recognition. 1) Only singly and doubly charged fragment ions are recognized,
which is a problem for higher charged peptides (all other algorithms for fragment ion
feature recognition recognize only singly charged ions). 2) There are no decisions for
alternative assignments; therefore, all possibilities are presented, which is very
confusing for large peptides where there are many alternatives, most of which are not
79


plausible. 3) There are secondary chemical products labeled, such as dehydrated
(marked with a 0) or deammoniated ions (marked with an *), without any primary
products from which they could be derived. 4) The doubly charged fragment ions are
often not plausible (very small fragment ions cannot physically accommodate two
protons).
Internal fragment ions, generated by cleavage at two peptide bonds, were not
annotated by any algorithm until recently, when Mascot made this available as an
option. These are usually relatively small; for example, the ion at 180 Da is an
internal fragment ion. These ions present a unique problem, in that they are generated
from two peptide bond cleavages, and can be observed as secondary product ions,
without primary products. This presents a combinatorial explosion for large peptides,
and requires some limitation to be useful; the method used by Mascot to limit these
assignments is not known.
80


VLDLIAHISK
correct
+
+

+
+
/"S
CD
O
3s!
s
+
Figure 3.5 A Features Labeled Experimental Spectrum by Mascot correctly
Assignment
VITSGQLVYK
incorrect
Figure 3.6 A Features Labeled Experimental Spectrum by Mascot incorrectly
Assignment
81