Citation
Development of a computational model using cytokine Biomarkers for identifying lung disease types

Material Information

Title:
Development of a computational model using cytokine Biomarkers for identifying lung disease types
Creator:
Gnabasik, David F. ( author )
Language:
English
Physical Description:
1 electronic file (149 pages) : ;

Subjects

Subjects / Keywords:
Lungs -- Diseases ( lcsh )
Cytokines ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
This dissertation presents a computational model that distinguishes among 7 lung disease clinical types: healthy non-smokers, smokers diagnosed with and without Chronic Obstruc-tive Pulmonary Disease (COPD), adenocarcinoma, squamous cell carcinoma, cystic fibrosis, and acute lung injury. The model reliably assigns an individual patient into one of these types from the conditional relationships of noisy, incomplete, and widely variable cytokine biomarker concentration measurements taken from blood serum. Panels of 12 cytokine measurements precisely classify both known and unknown patients into one of these dis-tinct clinical types. The approach assigns patients to known clinical types from the condi-tional relationships of noisy, incomplete, and widely variable protein concentration meas-urements, including outliers. A discrete topological structure (DTS) formulation induces discrete state variables from concentration measurements through a binning algorithm that exposes the conditional relationships and dependencies among the concentration data. A unique application of an exclusive-or (XOR) operation on the state space extracts the pat-terns identifying the set of distinctive features for each clinical type.
Review:
The computational model builds a discrete topological structure from a baseline data set, and is developed using several novel schemes designed specifically for this analysis. • All biomarker concentration values are placed into discrete bins forming bin states from multiple biomarkers measured from individual samples. The result is a multidimensional space representing a characteristic set of states for each clinical type population. • The model builds conditional probabilistic relationship between bin states that successfully separates and distinguishes binstates belonging to different clinical types. • The model incorporates new clinical types and distinguishes the set of targeted biomarker variables that uniquely characterize the new type.
Review:
The computational model has been validated by processing a random 10% of the data that has been initially excluded, and by processing a separate data set extracted from the literature.
Thesis:
Thesis (Ph.D.)--University of Colorado Denver
Bibliography:
Includes bibliographical references.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by David F. Gnabasik.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
on10073 ( NOTIS )
1007341316 ( OCLC )
on1007341316

Downloads

This item has the following downloads:


Full Text
DEVELOPMENT OF A COMPUTATIONAL MODEL USING CYTOKINE
BIOMARKERS FOR IDENTIFYING LUNG DISEASE TYPES
-by-
DAVID F. GNABASIK B.A., University of Chicago, 1979 M.S., University of Colorado, 2002
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy
Computer Science and Information Systems
2017


2017
DAVID F. GNABASIK ALL RIGHTS RESERVED
11


This thesis for the Doctor of Philosophy degree by David F. Gnabasik has been approved for the
Computer Science and Information Systems Program
by
Tom Altman, Committee Chair Gita Alaghband, Advisor Stephen Billups Michael Mannino Ilkyeun Ra
May 13, 2017


Gnabasik, David F. (Ph.D., Computer Science and Information Systems)
Development of a Computational Model Using Cytokine Biomarkers for Identifying Lung Disease Types
Thesis directed by Professor Gita Alaghband.
ABSTRACT
This dissertation presents a computational model that distinguishes among 7 lung disease clinical types: healthy non-smokers, smokers diagnosed with and without Chronic Obstructive Pulmonary Disease (COPD), adenocarcinoma, squamous cell carcinoma, cystic fibrosis, and acute lung injury. The model reliably assigns an individual patient into one of these types from the conditional relationships of noisy, incomplete, and widely variable cytokine biomarker concentration measurements taken from blood serum. Panels of 12 cytokine measurements precisely classify both known and unknown patients into one of these distinct clinical types. The approach assigns patients to known clinical types from the conditional relationships of noisy, incomplete, and widely variable protein concentration measurements, including outliers. A discrete topological structure (DTS) formulation induces discrete state variables from concentration measurements through a binning algorithm that exposes the conditional relationships and dependencies among the concentration data. A unique application of an exclusive-or (XOR) operation on the state space extracts the patterns identifying the set of distinctive features for each clinical type.
The computational model builds a discrete topological structure from a baseline data set, and is developed using several novel schemes designed specifically for this analysis.
All biomarker concentration values are placed into discrete bins forming bin states from multiple biomarkers measured from individual samples. The result is a multidimensional space representing a characteristic set of states for each clinical type population.
IV


The model builds conditional probabilistic relationship between bin states that successfully separates and distinguishes bin states belonging to different clinical types.
The model incorporates new clinical types and distinguishes the set of targeted biomarker variables that uniquely characterize the new type.
The computational model has been validated by processing a random 10% of the data that has been initially excluded, and by processing a separate data set extracted from the literature.
The form and content of this abstract is approved. I recommend its publication.
Approved: Gita Alaghband
v


ACKNOWLEDGEMENTS
I deeply thank my advisor, Dr. Gita Alaghband, for her intellectual support and encouragement during this journey.
We thank Dr. Mark W. Duncan, Ph.D., of the University of Colorado Anschutz Medical Campus, School of Medicine, for providing us with 5 of the 7 original unpublished data sets.
We thank Dr. Paul A. Bunn, Jr., MD, of the University of Colorado Anschutz Medical Campus, School of Medicine, for providing us with 2 of the 7 original unpublished data sets.
We thank Dr. Brid Ryan and Dr. Curtis C. Harris of NIH/NCI for the NCI-Maryland Cancer, NCI-Maryland Control, NCI-PLCOC Cancer, and NCI-PLCOC Control data sets.
[Brid M. Ryan, Ph.D., M.P.H., Center for Cancer Research, National Cancer Institute, Building 37, Room 3060C, Bethesda, MD 20892, Brid Rvan@nih.gov. https://ccr.cancer.gov/laboratorv-of-human-carcinogenesis/brid-m-ryan ]
[Curtis C. Harris, M.D., Center for Cancer Research, National Cancer Institute, Building 37, Room 3068A, Bethesda, MD 20892, curtis harris@nih.gov. https://ccr.cancer.gov/Laboratorv-of-Human-Carcinogenesis/curtis-c-harris ]
vi


TABLE OF CONTENTS
LIST OF FIGURES..........................................................ix
LIST OF TABLES..............................................................xi
LIST OF ABBREVIATIONS......................................................xii
CHAPTER I: INTRODUCTION......................................................1
1.1 Motivation.............................................................1
1.2 Outline of the Thesis..................................................4
CHAPTER II: RELATED WORK.....................................................5
2.1 Modeling Lung Disease..................................................5
2.2 Topological Approaches................................................8
2.3 Protein Interaction Networks.........................................10
CHAPTER III: DATA SOURCES and STATISTICAL ANALYSIS......................13
3.1 Cytokine Biomarkers and Their Relevance...............................13
3.2 Data Sources.........................................................14
3.3 Statistical Analysis.................................................16
3.4 Statistical Inference................................................21
3.5 Defining Variables in High-Dimensional Configuration and State Space.22
3.6 Statistical Assumptions..............................................23
3.7 Handling Variation Through Model Constraints.........................24
CHAPTER IV: COMPUTATIONAL MODEL.............................................33
4.1 Defining the Model....................................................33
4.2 Motivation of the Discrete Topological Structure.................34
4.3 Bin Size and Bin States..............................................36
4.4 Developing the Computational Model...................................43
CHAPTER V: RESULTS and VALIDATION STUDIES..................................102
5.1 Distinguishing Biomarkers and the Measure of Similarity...........102
5.2 10% Study Validation.................................................102
vii


5.3 Healthy Serum Validation...........................................103
5.4 Population Variability and Conditional Structure...................106
5.5 Strategies for Modeling Measurement Variation......................106
5.6 NCI Study Validation...............................................107
CHAPTER VI: TOPOLOGICAL ANALYSIS..........................................109
6.1 Motivation for a Topological Analysis..............................109
6.2 Topological Homology and Betti Numbers.............................109
6.3 Computing Topological Connectedness................................Ill
6.4 Topological Results................................................112
CHAPTER VII: CONCLUSION...................................................117
REFERENCES................................................................119
APPENDIX A Percent Missing Values per Clinical Type Biomarker...........125
APPENDIX B The Chemical Stochastic Master Equation......................130
APPENDIX C Definitions and Characteristics of Lung Disease..............132
APPENDIX D Test Sensitivity and Specificity.............................134
APPENDIX E Description of the Binning Algorithm as a Learning Algorithm.135
viii


LIST OF FIGURES
FIGURE
Figure 1: DTS values per biomarker for all patients of clinical type Adenocarcinoma.3
Figure 2: Mean concentrations and standard error bars for Never Smokers &
Adenocarcinoma........................................................................................................................17
Figure 3 (a-1): Mean concentration values for each biomarker for every clinical type.........................................................21
Figure 4: Number of biomarker measurements for all baseline samples..........................................................................27
Figure 5: (a-i): All clinical type concentration values......................................................................................32
Figure 6: Ratio of the number of empty bins to the total number of bins......................................................................36
Figure 7: Assigning Adenocarcinoma concentration [c] values to bin states....................................................................49
Figure 8 (a-i): Complete set of 12 biomarker concentration values for every clinical type. 53
Figure 9 (a-i): All clinical type biomarker DTS values.......................................................................................58
Figure 10: Point cloud for the baseline clinical types.......................................................................................59
Figure 11: Plot of IFNG-only bin state Probability and DTS values for Never Smokers. ...63
Figure 12: Plot of IFNGbin state Probability and DTS values for every clinical type.64
Figure 13 (a-1): Bin state Probability and DTS values for every clinical type per biomarker. ..........................................................................70
Figure 14 (a): Partial integer matrix for IFNG values of Adenocarcinoma......................................................................72
Figure 14 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types............................................................72
Figure 15 (a): Partial integer matrix for biomarker EGF for all 7 clinical types.............................................................72
Figure 15 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types.73
Figure 15 (c): Partial integer matrix for biomarker ILIA for all 7 clinical types............................................................73
Figure 15 (d): Partial integer matrix for biomarker IL1B for all 7 clinical types.74
Figure 15 (e): Partial integer matrix for biomarker IL2 for all 7 clinical types............................................................74
Figure 15 (f): Partial integer matrix for biomarker IL4 for all 7 clinical types............................................................75
Figure 15 (g): Partial integer matrix for biomarker IL6 for all 7 clinical types..75
IX


Figure 15 (h): Partial integer matrix for biomarker IL8 for all 7 clinical types
76
Figure 15 (i): Partial integer matrix for biomarker IL10 for all 7 clinical types...76
Figure 15 (j): Partial integer matrix for biomarker MCP1 for all 7 clinical types...77
Figure 15 (k): Partial integer matrix for biomarker TNFA for all 7 clinical types.......77
Figure 15 (1): Partial integer matrix for biomarker VEGF for all 7 clinical types...78
Figure 16 (a-h): Every clinical type distinguished by their Probability and DTS states.82
Figure 17: All baseline clinical types distinguished by Probability and DTS states......83
Figure 18 (a-1): Each biomarker distinguished by Probability and DTS states within all
baseline clinical types...........................................................89
Figure 19: All baseline biomarkers distinguished by Probability and DTS states within all baseline clinical types........................................................90
Figure 20 (a-b): Best fit vectors for Adenocarcinoma [EGF, IFNG]........................98
Figure 21: Quantitative clinical type classification: Po values for each clinical type..113
x


LIST OF TABLES
TABLE
Table 1: All clinical type biomarker bin sizes in pg/ml...........................37
Table 2: Biomarker Concentration Averages and Standard Deviation Values per Clinical Type.............................................................................39
Table 3: Biomarker Data Characteristics...........................................40
Table 4: Number of points for biomarkers; their O and O values for the first 6 bins.49
Table 5: Clinical type identifiers used to distinguish bin states.................71
Table 6: Distinguishing biomarkers per clinical type in Probability and DTS dimensions. .................................................................................91
Table 7: Number of distinguishing biomarker states n per clinical type Ct...........91
Table 8: Counts and percentages of permissible and non-permissible bin states.......93
Table 9: Using cosine similarity to assign baseline patients to a clinical type.....99
Table 10: Distinguishing biomarkers of excluded patients in Probability, DTS
dimensions..................................................................103
Table 11: Data sources for the NIH Maryland and PLCOC Studies.......................108
Table 12: Topological properties for each clinical type..........................113
xi


LIST OF ABBREVIATIONS
Symbol Description
[c] set of concentration data values for Bi
m concentration means
Bi biomarker; 1 < i < Q
ct clinical type; 1 < t < 7
Dr set of observed concentrations per combination
DTS discrete topological structure: a model of how protein concentrations change relative to one another.
Gi number of concentration values grouped in each bin
Hi threshold in binning algorithm
lOge natural logarithm
MC population DTS matrix
JVlc~cond population conditional probability matrix per clinical type
Mc-joint population joint probability matrix per clinical type
IVlc-marg population marginal probability matrix per clinical type
Mz sample DTS matrix
M a interaction matrix
Mp |3 interaction matrix
Nr number of bins for a combination
O (O-hat) the set of occupied bin states per clinical type biomarker combination
0 (O-bar) the set of distinguishing bin states per clinical type biomarker combination; 0 CZ 0.
P(X|Y) conditional distribution of X given Y
Pb,n set of population probabilities per combination per bin Ni
Pct(Bi) population probability for all patients
pg/ml picograms 10"12 grams per milliliter
Q biomarkers per patient panel (12)
R total number of clinical type biomarker combinations
Wr bin size; 1 < r < R, per combination
X, Y bin states considered as random variables
XOR the exclusive-or logical operation that outputs true only when inputs differ
zt set of Z patient samples in Ct; 1 < z < Z
Oi concentration standard deviations
xii


CHAPTER I: INTRODUCTION
1.1 Motivation
Proteomics is the investigation of the entire protein content or the protein complement of the genome of a biological system the proteome [63], Specific biomarker proteins act as measurable indicators of biological disease [52], including lung disease. The overall goal is to ensure that people with lung diseases are accurately and cost-effectively diagnosed and then treated accordingly. The problem is acute, as the American Lung Association states that an estimated 158,080 Americans are expected to die from lung disease in 2016 [74],
The seven clinical types analyzed here account for some of the most frequent forms of lung disease [73], Respiratory diseases are of multiple origin, and the selected clinical types cover a wide spectrum of suspected causes. Diagnosis and treatment are also problematic, given that Guarascio et al declare that [75], more conclusive evidence is needed concerning ideal therapy selection and appropriate sequencing of therapy. Regarding COPD Spyratos et al [76] state that Due to the progressive nature of the disease, underestimation of symptoms by the patients, lack of knowledge and underuse of spirometry by the Primary Care providers the disease remains under-diagnosed in about half of the cases. See Appendix C for a brief description of lung disease characteristics.
Reliably measuring proteomic biomarker concentrations is difficult due to technical and biological variation, their wide dynamic range of concentrations and numerous post-translational modifications [61], However, the major limitation of proteomic investigations remains the complexity of biological structures and physiological processes [77], Despite these variations, we developed a computational model that reliably distinguishes among various clinically diagnosed lung disease types. The model hypothesizes that biomarker interactivity induces a distinctive protein concentration topology for each clinical


type, and that certain topological patterns revealed by these host-response proteins remain characteristically invariant.
The model selects the unique set of biomarkers given a small number of biological and statistical assumptions whose protein host-response topology corresponds to a patients clinical type, thereby simplifying the high-dimensional biomarker concentration space so that certain topological patterns of lung disease are extracted. We propose a new computational model that:
1) distinguishes among multiple clinical types,
2) accurately assigns patients to these known types,
3) distinguishes the unique set of biomarkers that characterize each clinical type.
The model represents a space of distributions built upon computable discrete states
that drastically reduces the number of plausible hypotheses despite significant data variation. We were initially motivated to explore this approach by Figure 1, which plots the DTS values (explained in $4.4:A) for every patient diagnosed with Adenocarcinoma for each of the 12 biomarkers. For the sake of visualization, a curve is fitted to each patients panel of 12 measurements. Note how DTS values can be less than zero. We decided to investigate whether these gaps or holes uniquely identified each clinical type. Formulating proteomic concentration data as multidimensional discrete bin states offers several advantages:
1) It abstracts away some of the variation in the data as given, while still allowing direct comparisons among multiple cytokine biomarker variables.
2) It puts to use the cytokine biomarkers as effective host-response signals indicative of disease state.
3) It allows for the systematic comparison of the conditional relationships among the chosen cytokine biomarkers.
2


2.80
2.30
1.80
1.30
0.80
0.30
-0.20
Figure 1: DTS values per biomarker for all patients of clinical type Adenocarcinoma.


1.2 Outline of the Thesis
Chapter I introduces the problem being studied and what initially motivated this analysis. Chapter II presents the related research in this area, describing both the topological and non-topological approaches including the use of protein interaction networks. Chapter III describes the baseline dataset used in the analysis, the suitability of cytokine biomarkers and their relevance. The inability of standard statistical analysis and inference to make accurate assignments to clinical types using biomarker concentration measurements is discussed. We discuss the importance of statistical modeling the definition of the models random variables in particular the models assumptions, and how variation is handled through model constraints. Chapter IV defines the computational model, the motivation of the discrete topological structure (DTS), and its manifestation through bin states. The computational model is developed through a series of algorithms and equations. Chapter V identifies the biomarkers that distinguish the clinical type populations and classify individual samples. Three separate validation studies are presented. Chapter VI discusses the potential of a formal topological analysis. Whereas the binning algorithm exposes the conditional relationships and dependencies among the concentration data by discretizing the concentration values of paired biomarkers into static DTS bin states, we also ask whether it is possible to structure the interactions of protein biomarker concentration values by computing the connectedness and disconnectedness of these conditional values in concentration topological space. This approach would allow for the dynamic visualization of biomarker concentration values in multidimensional space over time of a patients disease trajectory. Chapter VII concludes the thesis.


CHAPTER II: RELATED WORK
2.1 Modeling Lung Disease
We consider the following established and recent approaches to computationally modeling lung disease.
cellular tumor cell morphology,
multiscale, multi-physics "omics" data mining,
Bayesian approaches,
equation-based and agent-based models,
protein interaction network analysis, and
topological analysis.
Tumor cell morphology on light microscopy has traditionally been used to predict disease behavior and prognosis, though recent techniques build upon stem-cell research [36], Cellular morphology often drives in vivo animal models of lung disease, but how lung disease is modeled in vivo is often not initiated by the same events that cause the disease in humans [57],
Newer approaches to modeling lung disease now include mining large genomic, transcriptomic, and proteomic databases requiring advanced computational techniques to reveal relevant patterns in these data. A multiscale, multi-physics modeling approach implies that "omics" data from different sources genomics, proteomics, metabolomics can be integrated through multiple computational informatics techniques [37], For example, to diagnose and assess COPD, Burrowes et al describe a multiscale computational model to investigate the relationship between computerized tomography (CT) and magnetic resonance imaging (MRI) imaging measurements and disease severity [34, 62], Their physics-based
5


approach to relating structure and function recognizes the non-linear relationships operating among respiratory components. They acknowledge that the modeling process requires multiple simplifying assumptions of the lung system, and that current models predict lung function at a specific point in time. Burrowes also states that One of the major hurdles standing in the way of computational modeling and its clinical use is lack of model validation [34, page 6 of 8],
Gefen [35] provides several examples of equation-based and agent-based (ABM) computational models. Both these simulation approaches specify the rules of behavior of individual entities and their interactions within a population. The model is a set of equations in equation-based modeling, whereas ABM models assert that the various non-linear and adaptive interactions between the entities are too complicated to be represented by analytical expressions. An ABM typically has the following components: agents (such as immune cells, bacteria, or cytokines), the environment where agents reside (such as a two-dimensional grid representing a section of lung tissue), probabilistic rules that govern the dynamics of agents, including movement, actions, and interactions among agents and between agents and environment, and time-scales on which the rules are executed. Dick etal [38] develop a multiscale model of acute inflammatory disease that describes baro-, chemo-, and cytokine-reflex control of cardio-pulmonary function. They developed algorithms that quantify the nonlinear characteristics of variability in some biological signals, and acknowledge that the primary challenge lies in integrating a large body of data into a cohesive whole that can guide novel therapies [38, page 1],
For an example of a Bayesian approach, Ostroff el al [26] published a comprehensive clinical biomarker study of non-small cell lung cancer (NSCLC) to discover 44 possible protein
6


biomarkers in 1326 archived serum samples from four independent studies. (VEGF is the single shared biomarker between their study and our experiments.) They developed a panel using 12 of their protein biomarkers that distinguished NSCLC from controls with 89% sensitivity and 83% specificity. This work is encouraging because of the generality of their aptamer-based proteomic technology, their specific evidence against over-fitting, and the fact that all their samples were confirmed by expert pathology review. They used a naive Bayes model which assumes that the presence of a feature in a class is unrelated to, or strongly independent of, the presence of other features for constructing classifiers, a log-normal distribution to model their data, and a log-normal parametric model to capture the protein distributions for a given clinical state. However, their results do not yet include independent validation studies.
These complex physics-based, equation-based, or agent-based modeling approaches often require years of development, and they make many simplifying biological assumptions for the sake of reproducible and automated predictions. Instead, our model assigns patients to known clinical types from the marginal and conditional relationships of biomarker concentration measurements. Our approach selects the unique set of biomarkers given a small number of biological and statistical assumptions (see §3.6 and §3.7) that induces the protein host-response topology corresponding to a patients clinical type. The resulting computational model represents a space of distributions built upon computable discrete states that drastically reduces the scope of the possible space of plausible hypotheses despite the significant data variation and extreme heterogeneity of lung disease [81],
7


2.2 Topological Approaches
Carlsson [20] discusses how topological data analysis (TDA) can extract information from high-dimensional, noisy, incomplete but massive, biological data sets using techniques from topology. TDA is the study of the shape of data, and it provides a general framework to analyze such data that is insensitive to the measurement metric while providing both dimension reduction and robustness to noise. Carlsson enumerates the analytical obstacles involved in such research as including:
the requirement that large-scale qualitative information must first be extracted to determine the specific direction of subsequent research,
the fact that biological distance metrics are often difficult to theoretically justify,
the unnaturalness, even meaningless, of using Euclidean metric coordinates, and
the greater value of categorical data summaries over individual parameter choices.
Carlsson also leads the Applied and Computational Algebraic Topology (CompTop)
group at Stanford University (http://comptop.stanford.edu/). which seeks to develop flexible topological methods for the analysis of data that is difficult to analyze using classical linear methods. Their Javaplex library [59] computes Betti numbers, a measure used to distinguish topological spaces based on the connectivity of n-dimensional simplicial complexes, and persistent homology intervals, a method for computing topological features of a space at different spatial resolutions. These homology intervals describe how the homology of a topological structure changes over time.
As a clinical example, Carlsson topologically analyzed the differences among healthy, pre-diabetic and diabetic populations for two variables insulin response and glucose level -
8


using the Miller-Reaven Diabetes dataset [27], The two variables effectively distinguish among the clinical types in the R3 space using a typical set of 1-dimensional filters such as:
density estimators,
measures of data depth,
eigenfunctions of graph Laplacian, and
principal coordinates analysis or multidimensional scaling coordinates.
The Computational Homology Group (CHomP) at Rutgers University Ihttp://chomp.rutgers.edu/1 performs protein data analysis using persistent homology to extract geometrical and topological information from protein data available in the Protein Data Bank, the worldwide repository of information about the 3-dimensional structures of proteins and nucleic acids [http://www.wwpdb.org/]. The CHomP software computes certain topological invariants of protein molecules, such as the 112M Sperm Whale Myoglobin D122n N-propyl Isocyanide protein [24], The focus in this work is to locate valid protein docking conformations, a subset of the protein-folding problem. Edelsbrunner defines many of the topological techniques and terms used in this 3-dimensional type of analysis [23],
Blinder et al [25] propose the use offunctional topology to identify several characteristics of biological computing networks in terms of form-function fingerprints. These network structures are represented by 3 matrices:
1. a topological connectivity matrix where each row is the shortest topological path lengths of a node with all other nodes;
2. a topological correlation matrix where an element (i,j) represents the correlation between the topological connectivity of nodes (i) and (/); and
3. a weighted graph matrix that represents the strengths of the connections.
9


Their functional holography analysis distinguishes among similar neuronal networks that standard statistical determinants of networks structure such as averaged path length did not.
2.3 Protein Interaction Networks
Some biomarker proteins often form protein-protein interaction (PPI) networks both permanent and transient distinct from their function as host-response signals [50], These PPI networks often produce high specificity due to the physical contacts between two or more protein molecules because of biochemical events steered by electrostatic forces. Many PPI networks manifest the associations between molecular chains that occur in a cell or in a living organism in a specific biomolecular context [64], That is, by providing a network-level map of the cell, PPI networks could identify the biomarkers functioning in the same disease pathway, presumably those with an effective disease predictive ability [60],
Protein interactions also cover a spectrum of order and function from weakly random to highly-structured because protein function is aggregated from multiple sources. Kumar et al state that The function of a protein and its properties are decided not only by the static folded three-dimensional structure but also by the distribution and redistributions of the populations of its conformational and dynamic sub-states under different (physical or binding) environments [10].
Network analysis quantifies PPI networks by various metrics, including the number of nodes (N), the number of links (L), the average distance in the network ( av d), the diameter of the network (the maximum of d), the number of components, the average degree (avD) and the clustering coefficient (CC), degree centrality (D) and degree distribution, shortest distance (d) between a focal node and one other node, link density, averaged shortest distance between
10


pairs of nodes, clique density, among others [65], Topological importance (TF1) is a general measure of centrality, focusing on how effects originated from a focal node can spread throughout the network. The topological overlap (TOnt) measure is derived from TIn. Given a threshold level defined (t), it is possible to determine which are the stronger (over the threshold) and which are the weaker (below the threshold) interactors of a network node i [65],
Erten et al use their Vavien algorithm to analyze the topology of PPI networks to infer functional information for the sake of disease association, assuming these networks are organized into recurrent schemes that underlie the mechanisms of cooperation among different proteins and that proteins tend to interact with other proteins of similar function. They show that proteins associated with similar diseases exhibit similar topological characteristics in their PPI networks, and that the Vavien algorithm outperforms existing information flow-based models in terms of ranking the true disease gene highest among other candidate genes [44],
Protein-protein interaction behavior is determined by protein concentrations that represent a balance between functional and structural interactions [51], An interaction can produce a change in the relative concentration gradient of either or both the interacting proteins. Crucially, the interaction between two proteins depends not only on their physical binding affinity, but also on their relative concentrations to the extent that the control of protein abundances becomes important in the functional operation and evolution of natural protein-protein interactions [3], Even given the wide variation of protein concentrations, their relative abundances may still be under tight evolutionary control [2], That is, measured protein concentration values are neither simple nor static, but they may be conditionally bounded ratios that dynamically fluctuate among a preferred or characteristic set of value ranges. The
11


non-linear complexity of the concentration patterns produced by these ratios explains why there is not a single identifiable protein concentration state for a clinical type.
For example, Heo, Maslov, and Shakhnovich [2] address the question of how living cells achieve sufficient quantity of functional protein complexes while minimizing their promiscuous non-functional interactions. They modeled the topology of a protein interaction network to shape these protein abundances and the strengths of their functional and nonspecific interactions. They found a positive relationship between evolved physical-chemical properties of protein interactions and their abundances due to a frustration effect. However, the PPI approach to modeling depends upon a complete knowledge of the interactions among a set of proteins.
Network analysis can quantify the functioning and relative importance of proteins in cell function, but high-throughput experimental detection methods for PPI (such as mass spectrometry) often both generate high false positive and high false negative rates [66], We propose an approach that combines the accuracy and precision of protein abundance measured using established 2-D PAGE gel electrophoresis technology with a novel analysis of the topological and conditional relationships and dependencies derived from these concentration data.
12


CHAPTER III: DATA SOURCES and STATISTICAL ANALYSIS
3.1 Cytokine Biomarkers and Their Relevance
We wish to show that protein concentration distributions contain information about their conditional inter-dependencies, and whether those inter-dependencies are characteristically maintained and conserved. That is, the biomarkers to collect must act as disease state signals due to the existence and modulating strength of their relative and mutual effects upon each other. Cytokine proteins satisfy these data source requirements.
Cytokines are a broad category of small proteins acting as immunomodulating agents they elicit regulatory function in immunologic pathways that are important in cell signaling. They are secreted by components of the adaptive immune systems, and they act as effectors or modulators of lung tissue inflammatory response [45], to the extent that Biancotto claims that the level and type of cytokine production has become critical in distinguishing physiologic from pathologic immune conditions [40], The 12 baseline protein biomarkers (EGF, IFNG, ILIA, IL1B, IL2, IL4, IL6, IL8, IL10, MCP1, TNFA, VEGF} (EGF: epidermal growth factor; IFNG: interferon y; IL: interleukin; MCP: monocyte chemo-attractant protein; TNF: tumor necrosis factor; VEGF: vascular endothelial growth factor) are chosen because of their known sensitivity as host-response signals to various lung diseases [4], so that concentrations of circulating cytokines in serum may be associated with lung disease survival [5], Evidence also suggests that biomarkers IL6 and IL8 are specifically associated with an increased risk of lung disease [6, 7],
Host-response sensitivity is not the only possible criterion for selecting a biomarker panel to reflect disease state. Every biomarker exists within a protein family a group of evolutionarily-related proteins that descend from a common ancestor whose members are
13


characterized by the rate and amount of product produced, the number of possible stable energy configurations, the speed of feedback mechanisms, similar three-dimensional structures, sequence similarity, and other characteristics [58], Other possible criteria include biomarkers that are known to be functionally coupled by stoichiometric feedback mechanisms, derived from the same biochemical family, or are co-located spatially or temporally. The strategy for selecting an effective panel of biomarkers for a specific disease or disease family requires an educated guess initially, most likely based upon known clinical disease correlations from the literature. Because multiple justifications are available for the biomarker panel selection depending upon the hypothesis under investigation the modeling approach allows for the systematic and automated substitution and comparison among various biomarker variables.
3.2 Data Sources
In these experiments, the model is constructed using host-response cytokine biomarker concentration data from 343 patients given to us in standard units of picograms 10"12 grams per milliliter (pg/ml). Other data sets obtained from the literature are standardized to these units. The baseline data set includes 7 clinical types from which the 12 protein biomarkers are measured. The number of patients per clinical type ranges from 24 to 56 see Table 3. The Q=12 baseline biomarkers {EGF, IFNG, ILIA, IL1B, IL2, IL4, IL6, IL8, IL10, MCP1, TNFA, VEGF} measured from each patients blood serum are chosen because of their known or suspected relationship to lung disease. Two specimens are collected from each patient at the same time, and these two specimens are averaged over each biomarker to provide a single biomarker panel of 12 measurements per patient, except for cases of missing data. Each of the 343 patients had been expertly diagnosed as
14


belonging to only one of 7 lung-related clinical types Ct, 1 < t < 7: adenocarcinoma, squamous cell carcinoma, never smokers, smokers with chronic obstructive pulmonary disease (COPD), smokers without COPD, acute lung injury, or cystic fibrosis [1], The original experiments considered never smokers, smokers with COPD, and smokers without COPD as the control groups. We sequestered a random 10% of the given data for subsequent model validation leaving 310 patients. There are 659 missing biomarker measurements out of possible 310*12=3720 (82.3%) for a total of 3061 measured values.
Precise protein assay techniques using 2-D PAGE gel electrophoresis [53] are used to consistently collect homogeneous blood serum specimens. 2-D PAGE gel electrophoresis is a precise and established method for the separation of proteins in 2 dimensions where isoelectric focusing is used to separate proteins by their charge (pi) in the first dimension and SDS-PAGE (sodium dodecyl sulfate polyacrylamide gel) electrophoresis is used to separate proteins by their molecular weight in the second dimension. The technique is often used for the isolation of proteins for further characterization by mass spectroscopy, a high-throughput analytical technique that ionizes chemical species and sorts the ions based on their mass to charge ratio.
The first five data sets are all from the same unpublished set of experiments [Acknowledgement A] conducted at laboratories at the University of Colorado Anschutz Medical Campus. The last two data sets cystic fibrosis and acute lung injury originate from different experiments though the wet-lab protocols and analytics are performed in the same way as the first five data sets [Acknowledgement B], To minimize batch effects, both laboratories incorporated a standard sample in each electrophoresis gel which was subsequently subtracted during analysis, and both used the Cy2 channel from each gel to
15


normalize spot intensities and for automated matching between gels. All patients underwent expert pathology review and are histologically assigned to one and only one clinical type, provided with the original data sets. There are many more data values than the number of biomarkers, which avoids the issue of overfitting. We are working with precise concentrations of secreted proteins expressed in the blood. This sampling strategy is justified because it is non-invasive, generated a large set of data with quantitative accuracy involving a small number of variables the 12 biomarkers and works with a homogeneous composition indicative of the entire organism.
3.3 Statistical Analysis
A standard statistical analysis [26] to distinguish among clinical types calculates the concentration population means jni and standard deviation oi values, 1 < i < Q, for each clinical type biomarker to reveal the differences among the various clinical types and to see whether they correctly assign an unknown patient to a known clinical type. For example, the bar chart in Figure 2 plots the means and standard error bars for Never Smokers with Adenocarcinoma, and Figure 3 (a-1) plots each of the mean biomarkers concentration values 12 for every clinical type. The small error bars in these figures suggest these data were produced precisely and with quantitative accuracy, though it is not known why the mean values for Acute Lung Injury are the highest for every biomarker except EGF and VEGF. Table 2 tabulates the biomarker concentration averages and standard deviations per clinical type.
The concentration jr and o values do not unambiguously assign patients to the clinical types for the 12 given biomarkers. Data averages do not provide enough information to reliably classify individual patient data, due to the variation of individual patient
16


Concentration (pg/ml)
measurements. §4.4.1 describes several statistical averaging attempts using established methods, such as naive Bayes classifier, Markov chain analysis, and cosine similarity. Instead, subsequent work focused on developing a computational model that effectively processed individual concentration values and not population averages.
Biomarker Mean Values with Standard Errors
Never Smokers Adenocarcinoma

IL2 IL4 IL6
Figure 2: Mean concentrations and standard error bars for Never Smokers & Adenocarcinoma.
EGF Biomarker Mean Values
17


Concentration (pfj/ml) Concentration (py/ml) Concentration (py/ml)
IFNG Biomarker Mean Values
100 h
o I-
IL1B Biomarker Mean Values
18
TestSet W TestSet W TestSet


Concentration (py/ml)
Concentration (py/ml)
Concentration (py/ml)
iii i | i i
Adenocarcinoma
Squamous

H
NeverSmokers
SmokersVWhCOPD
SmokersVWhoutCOPD
A cuteLuny Injury
CysticFibrosis
TestSet
m
D?
o
3
m
(D
Adenocarcinoma
Squamous
NeverSmokers
SmokersVWhCOPD
SmokersWIthoutCOPD
(D
<
Qj
CD
CD
A cuteLuny Injury
CysticFibrosis
O o
t11>'r

TestSet
H
Adenocarcinoma

Squamous
NeverSmokers
SmokerstMthCOPD
SmokersVMthoutCOPD
AcuteLunylnjury
CysticFibrosis
TestSet
H
(
1
IL2 Biomarker Mean Values


Concentration (pg/ml) Concentration (pg/ml) Concentration (pgSml)
IL8 Biomarker Mean Values'
to
IL10 Biomarker Mean Values
Vi
MCP1 Biomarker Mean Values
Vi
20


TNFA Biomarker Mean Values
VEGF Biomarker Mean Values
E
Figure 3 (a-1): Mean concentration values for each biomarker for every clinical type. 3.4 Statistical Inference
A model-based approach is suitable for statistical inference in these experiments because the biological causes that produce the relative biomarker concentrations are generally unknown it is not precisely known why a certain panel of biomarkers clearly distinguishes among the various clinical types. A fully-realized model-based approach computes all possible alternatives within the high-dimensional biomarker concentration space and then selects the alternative that effectively assigns patients to their known clinical types. This strategy is different from a design-based approach to inference where the model
21


is generally known beforehand and where it is important to ensure that enough sample data are selected randomly from known populations.
The model-based approach requires a precise identification and definition of the models random variables. Given the observation that these biomarker measurements reveal characteristic gaps or holes not explainable as missing data, a promising way to analyze these gaps is to group or bin concentration data where each bin is considered as a discrete random variable. Individual bins can then be compared to every other bin in every other biomarker to reveal their interactive or conditional structure, as fully explained in $4.4:A.
3.5 Defining Variables in High-Dimensional Configuration and State Space
Much proteomics research operates within a high-dimensional configuration space also called parametric space where a panel of measurements exists in the number of dimensions defined by the number of measured protein signals (e.g., R12). The notion of configuration space has been used in molecular biology [69] to represent the space of all possible states of a system characterized by many degrees of freedom. However, the non-intuitiveness of high dimensions poses difficult analytical issues. The scalability of measures in Euclidean spaces is generally poor as dimensionality increases and data can become uniformly distributed [28, 29], Given a set of targeted biomarker variables thought to characterize the overall state of a living organism, the purpose of a configuration space is to structure the possible states of the variables that represent the state of the organism.
To structure the configuration space, we define two kinds of random variables.
The first definition declares the observed measurements of the 343 patients 12 biomarkers as variables assigned to the configuration space. The second definition declares the generated bin states as variables assigned to the computational state space, which is larger
22


than the configuration space because of the addition of non-permissible bin states. When we claim to distinguish the set of targeted variables that uniquely characterize each clinical type, this refers to each biomarkers set of bin states. Whereas the configuration space is the space of all possible biomarker measurement values, the state space is the space of probabilities acting upon the configuration space.
3.6 Statistical Assumptions
Several model-based statistical assumptions must also be made explicit.
Distributional assumptions. When a statistical model employs terms involving random errors, assumptions are made about the probability distribution of these errors. The use of the same data collection and processing protocols over all the biomarker samples assumes a Gaussian distribution of errors, so long as the sensitivity and specificity of the various biomarker measurements are roughly the same -see Appendix D. But when a biomarker is difficult to measure, this often produces incomplete data. For example, biomarker ILIA has the lowest concentration range of all the given biomarkers, and quite a few patient values are missing. We included all biomarker data values and assumed a Gaussian distribution of errors in the measurement values.
Structural assumptions. Statistical relationships between variables are often modelled by equating one variable to a function of another or several others plus a random error. Models often involve making a structural assumption about the form of this functional relationship, such as linear or multiple regression. This can sometimes be generalized to models involving relationships among multiple underlying unobserved latent variables. In our experiments, the multiple bins of concen-
23


tration values serve as jointly distributed random variables. But as labels, bin 3 is not related to bin 4 in a sense of predication or lesser-than ordering. There are simply bins that are occupied some more than others and bins that are empty.
Cross-variation assumptions. Cross-variation assumptions involve the joint probability distributions (JPD) of either the measurements themselves or the random errors in a model. A JPD is defined when given at least two random variables X, Y, ..., that are defined on a probability space, the joint probability distribution for X, Y, ... is a probability distribution that gives the probability that each of X, Y, ... falls within a discrete set of values specified for that variable [70], The model assumes that measurement errors are statistically independent, but the analysis explicitly characterizes the conditionality of measurements where each clinical type is hypothesized to have a unique JPD of characteristic biomarker relationships.
Guarding against data dredging. Data dredging presents data patterns as statistically significant by exhaustively searching for combinations of variables that might show a correlation without first devising a specific hypothesis for the underlying cause.
We guard against such significant statistical noise by validating the primary hypothesis with data that was not used in constructing the hypothesis.
3.7 Handling Variation Through Model Constraints A computational model must efficiently navigate within the large and complex probability space in which both the disease responses and the protein measurements occur. Measured protein concentrations in individuals vary across orders of magnitude in both the technical and biological dimensions depending upon the progression of the disease state, how the specimens are collected, the tissue or fluid that is sampled, and by the assay
24


protocol used to measure the concentrations [32], There is also the possibility of introducing bias due to poor experimental design, sampling, and protocol differences [33], The model addresses these sources of universal biological variation, as follows.
1) Biomarker measurements are made from noisy, incomplete, and widely variable protein concentration values, so the assignments of the model are necessarily probabilistic. The analytical challenge is to reduce the effect of measurement variations enough to make accurate clinical classification despite significant measurement variations, even though proteomics experiments conducted in different laboratories using the same specimenhandling protocols can produce drastically different results [8, 9], Though the baseline data comes from two different labs, both labs applied the same methods of random population selection, specimen collection and handling protocols, instrumentation calibration and use, and non-biased experimental design within their individual experiments. Assuming the use of standard gel electrophoresis assay techniques allows us to test whether the model works at all without having to model these known sources of variation.
2) Patients that are sampled are nearly always being clinically treated at the same time, yet the effects of those treatments on relative biomarker concentration levels are usually unknown. This influence is mitigated somewhat by focusing on the concentrations of host-response proteins circulating in homogeneous serum instead of targeted tissue specimens.
3) The model does not require the extraction of specimens from diseased tissues to measure and compare all the proteins therein to find a set of disease-specific biomarkers. The tissue extraction approach is both costly and invasive. Even though differences have been uncovered in protein expression between normal and diseased tissues that may have
25


specificity for different tumor types, there is increasing evidence of a detectable human immune response to cancer in blood sera, which may aid in disease immunotherapy [47],
4) There exist overly frequent or non-specific variations involving issues of scale -those proteins that change concentration in response to a very large number of inflammatory signal events or stimuli and not necessarily those involved in lung disease. For example,
IL10 is apleiotropic immuno-regulatory cytokine (where a single protein influences two or more seemingly unrelated physiological functions) that protects from several infection-based immunopathology and allergy responses. These non-specific types of variations are managed by avoiding those protein biomarkers that respond to every inflammatory event and selecting only those with other evidence that links them to the disease under investigation
5) A great deal of proteomics research is plagued by the issue of overfitting, which occurs when a statistical model describes random error or noise instead of the underlying relationships. Overfitting generally occurs when a model is excessively complex, such as having too many parameters or experimental variables relative to the number of observations or measurements, making it easy to fit multiple models to the data and expose structure that does not correlate with the hypothesis under study. Often preliminary or training data fits the model well, but independent validation studies perform very poorly. To counter overfitting, we restrict the number of targeted biomarker variables to 12, well below the number of separate observations or measurements.
6) Nearly all the measurement data sets available to us are incomplete. Only 39 of the 343 patients (11.4%) have all 12 biomarker measurements, but 85.4% have 9 or more biomarkers. A total of 17.7% biomarker values are missing from the baseline data sets.
26


Figure 4 shows the mode of measurements per patient panel as 10. The mean is 9.84. No data was interpolated or averaged to fill in missing data. Instead, we chose to declare a sufficiency constraint value of 95%. That is, in the training set, the amount of a clinical types data is accepted as sufficient if at least one of its distinguishing biomarker variables accurately assigned patients at least 95% of the time.
Number of Biomarker Measurements For All Samples
Figure 4: Number of biomarker measurements for all baseline samples.
7) Concentration data can be transformed by excluding extreme outliers, or by taking their natural loge to reduce the differences in measured concentration values. We observed that removing outliers removes a few characteristic topological features among clinical types, though requiring significantly less overall computational time. To balance these objectives, we excluded only the 7 individual measurements as outliers that fell into the farthest range of standard deviation, those values greater than ji + 6*o. We did not apply other data transformations. All clinical type concentration values are shown in Figure 5 (a-i).
27


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type Adenocarcinoma
All Concentration Values for Clinical Type Squamous
28


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type NeverSmokers
All Concentration Values for Clinical Type SmokersWithCOPD
29


All Concentration Values for Clinical Type SmokersWithoutCOPD
All Concentration Values for Clinical Type AcuteLunglnjury
30


All Concentration Values for Clinical Type CysticFibrosis
All Concentration Values for Clinical Type TestSet
31


All Concentration Values for Clinical Type HealthySerum
Figure 5: (a-i): All clinical type concentration values.
32


CHAPTER IV: COMPUTATIONAL MODEL
4.1 Defining the Model
Given these difficulties and conflicting constraints in analyzing proteomic data, the problem is to formulate the right space in which the model can effectively address several fundamental questions.
1. How should the set of targeted biomarker variables be computed to reliably separate and group clinical type populations? Is there an essential set of variables?
2. What is the fine-grained conditional structure relating the biomarkers per clinical type?
3. Can unknown patients be reliably assigned to a clinical type?
4. How can data variation (e.g., outliers) be effectively included in the computational model?
5. Is there an effective strategy for handling missing data values?
The computational model builds a discrete topological structure (DTS) from the baseline dataset. The model is developed using several novel schemes designed specifically for this analysis.
A customized binning algorithm discretizes the concentration values of paired biomarkers into bins and forms bin states corresponding to biomarkers belonging to individual samples.
The model builds conditional probability relationships between bin states that successfully separates and distinguishes the bin states belonging to different clinical types. A clinical type population is then represented as a characteristic set of biomarker bin states, where each biomarker pair is assigned a distinct bin size and
33


number of bin states, whether occupied or empty of concentration values. Every biomarker belongs to multiple combinations, each with its own characteristic bin size and set of bin states.
A unique application of an exclusive-or (XOR) set of operations then extracts the distinctive and distinguishing patterns that successfully assigns a set of states to a clinical type. The resulting analysis indicates which of the biomarkers are key in recognizing the discriminating signs of the lung diseases under this study.
4.2 Motivation of the Discrete Topological Structure
The DTS was initially motivated by considering a derivation of the Chemical Stochastic Master Equation (CSME), as given in Appendix B. The CSME originally
described the probability of a vector of chemical measurements X belonging to a certain state
and how X changes with respect to time as t > 0 [13], Instead, we formulate the DTS equation by representing concentration [c] values as bin state variables under a joint probability distribution (JPD) and discretize time t as step transformations among these
various bin states. The JPD p(X, t) represents the probability that at bin state transformation t the measurement panel contains Xi proteins of the first clinical type, X2 proteins of the second clinical type, and so on [15,16], The inputs to this formulation are the distributions of protein concentrations for a clinical measurement panel relative to a standard in picograms per milliliter (pg/pl). These measurement vectors are transformed to matrices of conditional
bin states. The change of X in p(X, t) is then reformulated as a unit-less quantity where the
probability that state transformation ti will occur given bin state X is represented by an interaction matrix of marginal probabilities (marginal signifies the probabilities of a subset of variable values without reference to the values of other variables [70]); and the probability
34


that state transformation tj will bring the system into bin state X from another state (say Y) is represented by an interaction matrix of conditional probabilities. The set of discrete bin states indexes their respective matrices. This reformulation assigns concentration values to a vector of discrete bins to reveal the topological structure underlying the distribution of these values. The discrete bins of concentration are computed as state variables represented by unit-less probabilities by a binning algorithm that maximizes the total number of bins while simultaneously minimizing the number of empty bins where no concentration data reside.
The discrete topological structure of the concentration data then effectively represents how the various biomarker concentrations relate to each other, as formalized in Equation 1 (see $4.4:A). The model computes a conditional probability matrix of all the mutual interactions of the entire protein ensemble for a measurement panel at discrete bin state transformation t, where the conditional probability of random variable bin state A given random variable bin state B is equal to the joint probability of A and B divided by the marginal of B.
There are two different ways of interpreting the nature of a bin state. As a static and aggregate snapshot of a clinical type population, a state represents a range of concentration values, so computing the DTS uses the population bin probabilities as the possible number of states in relation to all the biomarkers for a given clinical type. But for an individual patient, a sample represents a snapshot that varies within the static constraints of the population snapshot. In this case, the DTS computation uses the one available state the patients panel of 12 concentration probabilities to make probabilistic clinical type assignments.
We now develop a binning algorithm that implements this DTS equation, providing the step-by-step details that construct this new probabilistic computational model.
35


4.3 Bin Size and Bin States
Inferring data structure and classification from concentration values depends upon how these values are binned or grouped. Once a unique and constant bin size is calculated for a biomarker-clinical type-coordinate combination, the calculated bins are plugged into the DTS equation as separate and biomarker-specific state variables produced by the binning algorithm. Empty bins where there are no data values within that bin range are then considered as non-permissible states. The ratio of the number of empty bins against the total number of bins for every CBr is linear, as shown in Figure 6. The binning algorithm provides the basis for calculating the probability of each biomarker clinical type coordinate bin state. All calculated bin sizes are given in Table 1. TestSet refers to a small set of fabricated data points used to validate calculations and are never used in pooled calculations.
Empty bins
200
100
150
5C
Total bins
50
100
150
200
Figure 6: Ratio of the number of empty bins to the total number of bins.
36


Table 1: All clinical type biomarker bin sizes in pg/ml.
Clinical Type Bi- omarker EGF IFNG ILIA IL1B TL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
Adenocarcino- ma Bin size 4.0188 1.6849 1.3170 2.9013 2.6063 2.2277 3.3217 1.8597 1.4961 4.8582 2.1165 3.2821
Min value 0.9415 1.1810 1.5025 0.3820 3.7500 3.2090 0.7500 1.9510 0.7260 47.8670 5.0180 4.1060
Max value 370.672 5 34.8780 22.5750 93.2225 76.7270 47.7620 173.479 0 46.5840 24.6630 902.917 5 55.8135 201.029 0
Squamous Bin size 25.6025 2.4179 1.7094 2.2409 2.7740 1.0377 3.5711 2.8973 2.2414 4.2379 2.9231 28.8565
Min value 1.3955 1.1810 1.3470 0.3820 3.9060 3.2845 0.9170 1.8580 0.7260 32.7945 4.7975 3.0665
Max value 308.625 0 59.2095 35.5350 45.2005 103.771 0 19.8870 229.468 5 129.3395 54.5195 541.337 0 121.723 0 349.344 5
Controls Never Smokers Bin size 4.9663 0.6260 0.0234 1.7292 0.9130 0.9029 1.5829 1.7797 0.2707 4.3751 1.8843 4.5018
Min value 1.2180 1.1810 1.3470 0.3820 3.6715 3.2845 0.7500 1.3965 0.7260 37.5950 4.9080 4.6275
Max value 954.737 5 11.1970 1.5340 34.9650 18.2800 14.1195 32.4080 44.1100 3.9740 615.111 5 50.1300 598.863 5
Controls -Smokers with COPD Bin size 3.7650 0.6750 0.5640 0.6228 1.2156 0.4684 0.3933 2.0148 1.1390 4.3275 1.2309 4.4091
Min value 1.6760 1.3080 1.9100 0.3820 3.9825 3.3600 0.7130 1.1840 0.6880 31.5945 4.6870 3.3930
Max value 317.936 5 9.4080 6.4220 7.8550 28.2935 7.1075 7.0050 49.5400 18.9125 602.818 0 29.3055 567.753 0
Controls -Smokers without COPD Bin size 4.0905 0.3150 0.1254 0.8070 0.5339 0.4576 0.2815 1.2047 0.1974 26.4710 1.2202 4.5300
Min value 0.9600 1.1810 1.2840 0.3820 3.5930 3.3600 0.7870 1.1840 0.6880 22.8825 4.6870 4.4300
Max value 377.283 5 4.9615 2.2870 10.0660 12.1360 10.6810 5.2910 20.4585 3.0570 340.534 5 29.0900 638.630 0
Acute Lung Injury Bin size 4.2996 4.9875 3.2633 4.1500 4.4825 4.4266 3.6994 5.2871 4.4546 5.0662 4.7249 4.7113
Min value 1.2430 1.7590 7.6080 0.0360 0.3000 0.1830 8.6420 0.7890 0.0050 65.1690 1.8720 1.9530
Max value 500.0 959.367 0 164.247 0 398.437 0 520.271 0 531.379 0 275.0 1227.386 0 516.739 0 1200.0 738.960 0 774.606 0
Cystic Fibrosis Bin size 2.5936 1.6285 0.8440 3.4580 4.1233 1.1234 2.3259 2.2821 0.2086 53.4563 0.8463 2.6786
Min value 1.4420 0.9110 0.6710 0.0690 0.7180 0.5250 2.6650 0.6290 0.2710 12.0120 1.8930 1.8260
Max val- 94.8110 26.9670 14.1750 166.055 330.578 18.5000 77.0940 55.4000 2.7740 653.487 15.4330 108.97
43


ue 0 0 0
TestSet Bin size 1.875 1.5 1.6875 1.8 2.25 2.25 2.625 2.5714 2.5313 2.8125 2.75 3.0
Min value 5.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
Max value 35.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0
Healthy Serum Bin size N/A 4.78 3.6107 1.9456 N/A 1.72 1.917 1.835 1.725 3.0115 3.1132 3.9559
Min value N/A 153.50 181.0 2.47 N/A 3.3 12.46 7.5 0.9 39.74 1.42 28.3
Max value N/A 822.7 383.2 33.6 N/A 37.7 50.8 44.2 28.5 160.2 138.4 297.3
38


Table 2: Biomarker Concentration Averages and Standard Deviation Values per Clinical Type.
Clinical Type Average Biomarker Values EGF IFNG ILIA IL1B TL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
All Clinical Types 65.03 14.83 42.38 3.01 18.99 8.10 10.73 81.95 34.77 232.64 10.08 85.02
Adenocarcinoma 54.69 3.85 6.36 4.24 16.00 6.67 28.21 10.66 2.94 240.91 17.92 46.14
Squamous 85.94 3.91 5.89 3.72 12.44 7.55 20.59 40.56 4.56 226.03 17.47 117.85
Controls Never Smokers 90.00 3.60 1.44 2.20 7.06 5.40 3.59 5.49 1.25 183.76 12.93 78.13
Controls Smokers with COPD 66.49 2.77 3.12 1.09 7.16 5.54 2.95 6.77 2.11 237.67 12.35 99.08
Controls Smokers without COPD 56.74 2.18 1.82 1.16 6.19 5.15 3.84 4.32 1.31 185.77 9.54 64.78
Acute Lung Injury 59.16 118.97 57.57 69.04 75.53 107.73 126.85 155.14 64.69 331.61 88.03 123.93
Cystic Fibrosis 20.97 4.54 4.51 12.80 27.80 4.92 30.09 8.86 0.84 253.22 5.64 25.36
Healthy Serum N/A 569.12 270.15 8.74 N/A 31.15 24.27 34.92 7.65 68.87 65.92 212.37
Standard Deviation Values EGF IFNG ILIA IL1B TL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
All Clinical Types 95.15 81.58 83.47 20.04 61.52 38.59 41.87 590.22 136.73 158.95 33.27 124.20
Adenocarcinoma 72.70 5.69 6.49 15.52 59.29 7.08 93.77 14.01 4.79 181.36 46.90 42.45
Squamous 92.03 9.32 10.10 9.02 23.14 14.10 36.48 168.22 10.63 106.68 19.55 98.38
Controls Never Smokers 143.51 2.87 0.09 5.77 7.38 2.39 5.74 7.04 0.58 106.41 10.45 114.50
Controls Smokers with COPD 79.85 1.80 1.91 1.46 4.38 4.32 2.60 7.07 3.44 116.64 11.50 123.29
Controls Smokers without COPD 77.89 0.91 0.38 1.81 1.81 1.41 12.04 2.98 0.62 77.53 4.59 122.35
Acute Lung Injury 93.73 243.15 54.51 115.91 138.43 171.76 103.13 299.78 129.25 228.59 182.00 193.35
Cystic Fibrosis 25.97 6.16 4.20 35.89 70.79 4.37 26.12 11.78 0.73 153.35 3.15 24.09
Healthy Serum N/A 219.32 58.52 10.24 N/A 11.47 11.74 11.66 9.47 38.45 44.01 82.23
39


Table 3: Biomarker Data Characteristics.
ClinicalType Marker Minimum Concentration Maximum Concentration Minimum Probability Maximum Probability Minimum DTS Maximum DTS
Adenocarcinoma EGF 0.9415 370.6725 0 0.10638 0.09524 1.16292
Adenocarcinoma IFNG 1.181 34.878 0 0.42857 0.25 3.05278
Adenocarcinoma ILIA 1.5025 22.575 0 0.22222 1.10926 4.07049
Adenocarcinoma IL1B 0.382 93.2225 0 0.8125 0.4 4.88426
Adenocarcinoma IL2 3.75 436.6875 0 0.51064 0.28572 3.48885
Adenocarcinoma IL4 3.209 47.762 0 0.60465 0.28572 3.489
Adenocarcinoma IL6 0.75 649.422 0 0.44681 0.18182 2.22012
Adenocarcinoma IL8 1.951 86.9895 0 0.25532 0.15385 1.87863
Adenocarcinoma IL10 0.726 24.663 0 0.63043 0.73951 2.71369
Adenocarcinoma MCP1 47.867 902.9175 0 0.08333 0.0625 0.76311
Adenocarcinoma TNFA 5.018 348.16145 0 0.34783 0.20001 2.44225
Adenocarcinoma VEGF 4.106 201.029 0 0.08511 0.07143 0.87221
Squamous EGF 1.3955 308.625 0 0.28205 0.52588 1.71242
Squamous IFNG 1.181 59.2095 0 0.61765 0.5 5.1372
Squamous ILIA 1.347 35.535 0 0.55556 0.5 5.13724
Squamous IL1B 0.382 45.2005 0 0.75758 0.33333 3.42483
Squamous IL2 3.906 118.909 0 0.45 0.28571 2.93554
Squamous IL4 3.2845 98.298 0 0.30769 0.656 2.56863
Squamous IL6 0.917 229.4685 0 0.4 0.14286 1.46776
Squamous IL8 1.858 1133.017 0 0.23077 0.15384 1.58065
Squamous IL10 0.688 54.5195 0 0.71795 0.28571 2.93554
Squamous MCP1 32.7945 541.337 0 0.075 0.06667 0.68496
Squamous TNFA 4.7975 121.723 0 0.20513 0.16666 1.71238
Squamous VEGF 3.0665 349.3445 0.025 0.2 0.48544 1.58072
Controls Never Smokers EGF 1.218 954.7375 0 0.08511 0.07408 0.90653
Controls Never Smokers IFNG 1.181 11.197 0 0.27273 0.56124 2.22541
Controls Never Smokers ILIA 1.347 1.534 0 0.5 3.08946 12.24058
Controls Never Smokers IL1B 0.382 34.965 0 0.71875 1.54157 6.11979
Controls Never Smokers IL2 3.6715 57.4705 0 0.25 0.68597 2.72
Controls Never Smokers IL4 3.2845 16.6065 0 0.26667 0.8827 3.49731
Controls Never Smokers IL6 0.75 32.408 0 0.51064 0.61665 2.44799
Controls Never Smokers IL8 1.275 44.11 0 0.42553 0.68513 2.71985
Controls Never Smokers IL10 0.726 3.974 0 0.46667 0.77236 3.06011
40


Controls Never Smokers MCP1 37.595 615.1115 0 0.06 0.05263 0.64409
Controls Never Smokers TNFA 4.908 50.13 0 0.23913 0.41108 1.63191
Controls Never Smokers VEGF 4.6275 598.8635 0 0.16 0.07692 0.94136
Controls Smokers with COPD EGF 0.998 317.9365 0 0.14286 0.07692 0.81833
Controls Smokers with COPD IFNG 1.308 9.408 0 0.31034 0.93738 2.8529
Controls Smokers with COPD ILIA 1.91 6.422 0 0.5 2.18995 6.65757
Controls Smokers with COPD IL1B 0.382 7.855 0 0.625 1.31234 3.9941
Controls Smokers with COPD IL2 3.9825 28.2935 0 0.34091 0.37469 2.12787
Controls Smokers with COPD IL4 3.36 33.981 0.02439 0.2439 0.72997 2.21917
Controls Smokers with COPD IL6 0.713 14.7855 0 0.18605 0.50433 1.63688
Controls Smokers with COPD IL8 1.184 49.54 0 0.29545 0.22222 2.36427
Controls Smokers with COPD IL10 0.688 18.9125 0 0.54054 1.0927 3.54654
Controls Smokers with COPD MCP1 31.5945 602.818 0 0.06818 0.05715 0.60791
Controls Smokers with COPD TNFA 4.687 82.8344 0 0.16279 0.24979 1.41859
Controls Smokers with COPD VEGF 3.393 567.753 0 0.11364 0.07143 0.75987
Controls Smokers without COPD EGF 0.96 377.2835 0 0.17391 0.08 0.99052
Controls Smokers without COPD IFNG 1.181 4.9615 0 0.27586 0.99833 2.47649
Controls Smokers without COPD ILIA 1.284 2.287 0 0.2 2.12253 4.95383
Controls Smokers without COPD IL1B 0.382 10.066 0 0.64 1.99667 4.95304
Controls Smokers without COPD IL2 3.593 12.136 0 0.20833 0.46145 1.90491
Controls Smokers without IL4 3.36 10.681 0 0.28261 0.49991 2.06368
41


COPD
Controls Smokers without COPD IL6 0.787 88.552 0 0.15556 0.42849 1.76887
Controls Smokers without COPD IL8 1.184 20.4585 0 0.25 0.66654 2.75154
Controls Smokers without COPD IL10 0.688 3.057 0 0.2439 0.90757 2.25134
Controls Smokers without COPD MCP1 22.8825 340.5345 0.02083 0.16667 0.76795 1.90501
Controls Smokers without COPD TNFA 4.687 29.09 0 0.22917 0.15385 1.9049
Controls Smokers without COPD VEGF 4.43 638.63 0 0.14583 0.09523 1.17911
Acute Lung Injury EGF 1.243 500 0 0.23256 0.08696 1.36894
Acute Lung Injury IFNG 1.759 959.367 0 0.34783 0.11765 1.85198
Acute Lung Injury ILIA 7.608 164.247 0 0.2 0.25 3.93577
Acute Lung Injury IL1B 0.036 398.437 0 0.5 0.18182 2.86219
Acute Lung Injury IL2 0.3 520.271 0 0.47222 0.15385 2.42187
Acute Lung Injury IL4 0.183 531.379 0 0.5 0.125 1.96766
Acute Lung Injury IL6 8.642 275 0 0.275 0.09091 1.43113
Acute Lung Injury IL8 0.789 1227.386 0 0.23214 0.09524 1.49923
Acute Lung Injury IL10 0.005 516.739 0 0.5 0.125 1.96775
Acute Lung Injury MCP1 65.169 1200 0 0.05357 0.04445 0.69967
Acute Lung Injury TNFA 1.872 738.96 0 0.30909 0.09524 1.49916
Acute Lung Injury VEGF 0.381 774.606 0 0.16667 0.07142 1.12431
Cystic Fibrosis EGF 1.442 94.811 0 0.22222 0.18182 1.09312
Cystic Fibrosis IFNG 0.911 26.967 0 0.33333 0.4 2.40579
Cystic Fibrosis ILIA 0.399 14.175 0 0.25 0.25 1.5036
Cystic Fibrosis IL1B 0.069 166.055 0 0.63158 0.28571 1.71766
Cystic Fibrosis IL2 0.718 330.578 0 0.26316 0.22222 1.33591
Cystic Fibrosis IL4 0.525 18.5 0 0.41667 0.33333 2.00484
Cystic Fibrosis IL6 2.665 77.094 0 0.21429 0.2 1.20244
Cystic Fibrosis IL8 0.629 55.4 0 0.30435 0.25 1.50344
Cystic Fibrosis IL10 0.205 2.774 0 0.33333 0.437 1.50404
Cystic Fibrosis MCP1 12.012 653.487 0 0.20833 0.31782 1.09385
Cystic Fibrosis TNFA 1.893 15.433 0 0.15789 0.19999 1.20287
Cystic Fibrosis VEGF 1.826 108.97 0 0.17391 0.13334 0.8016
42


Healthy Seram EGF N/A N/A N/A N/A N/A N/A
Healthy Seram IFNG 153.5 822.7 0 0.14286 0.28572 1.18402
Healthy Seram ILIA 181 383.2 0 0.14286 0.28572 1.18606
Healthy Seram IL1B 2.47 33.6 0 0.28571 0.5 2.08886
Healthy Seram IL2 N/A N/A N/A N/A N/A N/A
Healthy Seram IL4 3.3 37.7 0 0.28571 0.5 2.08274
Healthy Seram IL6 12.46 50.8 0 0.14286 0.28572 1.19017
Healthy Seram IL8 7.5 44.2 0 0.28571 0.4 1.6662
Healthy Seram IL10 0.9 28.5 0 0.33333 0.5 2.08888
Healthy Seram MCP1 39.74 160.2 0 0.28571 0.33334 1.38589
Healthy Seram TNFA 1.42 138.4 0 0.16667 0.33334 1.38544
Healthy Seram VEGF 28.3 297.3 0 0.28571 0.33334 1.38304
TestSet EGF 5 35 0 0.25 0.5 1.45182
TestSet IFNG 2 20 0 0.25 0.5 1.46094
TestSet ILIA 3 30 0 0.25 0.5 1.45182
TestSet IL1B 4 40 0 0.25 0.5 1.44661
TestSet IL2 5 50 0 0.25 0.5 1.44661
TestSet IL4 6 60 0 0.25 0.5 1.44401
TestSet IL6 7 70 0 0.25 0.5 1.44401
TestSet IL8 8 80 0 0.25 0.5 1.44271
TestSet ILIO 9 90 0 0.25 0.5 1.4401
TestSet MCP1 10 100 0 0.25 0.5 1.4401
TestSet TNFA 11 110 0 0.25 0.5 1.4375
TestSet VEGF 12 120 0 0.25 0.5 1.4375
43


4.4 Developing the Computational Model
We seek to develop a model that describes the probability of a host-response system to occupy a discrete set of states characteristic of a patients clinical type. We model this topological property by discretizing the concentration values of biomarkers in each panel for all combinations of biomarkers and clinical types from the given population of patients. The term combination is restricted to mean processing an aggregate concentration of data values for a subset of biomarkers from a set of clinical types to determine which subset of biomarkers distinguishes among those clinical types.
We perform this discretization process using a new binning algorithm -Max-Bins-Min-Empty-Bins (presented in Algorithm 1) where we first compute a bin size and number of bins (Wr, Nr, 1 < r < R) for every clinical type paired biomarker combination CBr. Each clinical type has 77 different combinations of pairs of biomarkers (except when an entire biomarker is missing), including pairs of the same biomarker. A CBr is defined as the aggregate concentration data from each of these pairs of biomarkers. Throughout the paper, the term clinical type paired biomarker combination CBr refers to forming combinations of biomarkers belonging to a specific clinical type. Every clinical type has the same number of CBrs. The bin size computation maximizes the number of occupied bins 6 (O-hat), separating data values by the highest possible resolution, while minimizing the number of gaps or empty bins 6 (O-tilde) where no data values reside. Empty bins are then considered as non-permissible states. Maximizing the number of occupied bins separates the states within each combination. The model computes a different total number of bins (states) Nr, for each combination CBr. We then compute the probability of each concentration data point belonging to a bin within that combination. Each of these combinations has a characteristic number of occupied bin states 6 determined by the binning algorithm, though it is possible for a biomarker not to exhibit dis-
43


tinguishing bin states labeled as O (O-bar) depending upon the set of clinical types that are processed.
These computations produce multiple bin states for different combinations embedded within a space that exposes the conditional structure underlying the distribution of these values. The model computes a coordinate space whose dimensions are the computed DTS values, individual bin probabilities, and the concentration values for every CBr. The resulting space is analyzed to reveal the distinguishing patterns of each combination in this coordinate space.
a. Formulate the discrete topological structure (DTS)
We are developing a model that represents the conditional relationships of expressed host-response biomarkers, whose interactivity induces a probable biomarker concentration distribution as a discrete set of states characteristic of a patients clinical type. We represent these interactive relationships by calculating their joint probability matrix Mc-joint the probability that biomarker B2 measured at concentration C2 (event A) occurs at the same time biomarker Bi is measured at concentration ci (event T). A related but separate concept is also necessary: the conditional probabilities of biomarker interactivity where, given concentration measurement ci for Bi as event 7, how likely is the measured concentration C2 for B2 as event A- call this matrix Mp. To represent the influence of individual biomarkers, we use marginal probabilities, the probabilities of various concentration values of a subset of biomarker variables without reference to the values of the other variables being considered call this matrix M. These marginal probabilities represent the probability distribution of event A when the probability value of event 7 is not known. Together, these types of probability express the mutual interactivity and distribution of the biomarker measurements that are indicative of each clinical type. We equate these concepts to a discrete topological structure matrix, where the various biomarker
44


concentration measurements and their probabilities reveal topological patterns characteristic of each clinical type. A DTS matrix is computed for each CBr. and the matrix (i.e., the specific set of biomarkers) that produces the most accurate set of patient assignments per clinical type is designated Me for that population.
Me Me_joint(l Ma) Mp Equation 1: DTS matrix equation for population Me.
In Equation 1, Mc-joint is the population joint probability matrix, 1 is a complete matrix of ones (not the identity matrix), M is the a interaction matrix of marginal probabilities, and Mp is the P interaction matrix of conditional probabilities for the clinical type. The DTS equation is implemented in terms of matrices of conditional and marginal probabilities involving bivariate pairs of biomarkers, each of which are indexed by their respective set of discrete bin states as computed by the binning algorithm, as explained below. This formulation assigns each set of biomarker concentration measurements to its own vector of discrete bins to reveal the discrete topological structure underlying the distribution of these concentration values. These discrete bins of concentration are computed as state variables represented by unit-less probabilities by the Max-Bins-Min-Empty-Bins algorithm. The discrete topological structure of the bin states then effectively describes how the various biomarker concentrations relate to each other.
b. Develop the binning algorithm
We motivate the discussion by developing a customized binning algorithm that discretizes the concentration measurements into bins and induces a multidimensional space of discrete bin states characteristic for each clinical type. The binning algorithm is a necessary computational component of the overall DTS equation. The algorithm accepts as input vectors of biomarker concentration measurements relative to a standard in
45


picograms per milliliter (pg/pl) and outputs the number of bins and the constant bin size for each combination of concentration values. This process is described below.
Step 1: Define the Possible Combinations. After experimenting with all possible clinical type biomarker combinations (pairs, triplets, quadruples, etc), we determined that analyzing the conditional probabilities of every clinical type and pairs of biomarker combination (Bi, Bj} is the most effective means for distinguishing among all possible combinations as CBr, indexed as 1 < r < R. We refer to Dr as the combined set of observed concentration data values within each CBr, for a specific clinical type and biomarker pair (Bi, Bj}. In these experiments this produces 7 x 77 = 455 Wr bin sizes, where the number of bins Nr is different for each unique CBr. The same Wr value is computed for both (Bi, Bj} and (Bj, Bi}.
Step 2: Compute the Combination Bin Sizes. We next compute the bin sizes and number of bins Wr, Nr, 1 < r < R for each CBr using the Max-Bins-Min-Empty-Bins algorithm. The output of
algorithm is a set of Nr bins of fixed bin size Wr per CBr. The algorithm guarantees that each biomarker concentration value is assigned to a single bin state n within the set of bin states for each CBr, where 1 < n < Nr.
The Max-Bins-Min-Empty-Bins algorithm first initializes the bin size and then iteratively re-computes it as the difference of the lowest and highest concentration data values [c] for each combination divided by the current number of bins Nr.
W, = (max[c] min[c])/Nr
Equation 2: Used to calculate the bin size Wr.
46


This bin size is incremented by a constant value VMaxNBins within a loop until the absolute value of the difference of the current bin size and the loge of the current number of empty bins, threshold Hi, becomes greater than or equal to the previous threshold value Hi-i.
Hi > abs (Wr log(Nj))
Equation 3: Used to calculate the threshold value Hi.
The bin size is a function of whether the data set Dr is processed in logeform or not, the specific distribution of concentration values, and the upper limit to the number of bins, MaxNBins. This constant upper limit is the square of the number of bins that -through experimentation guaranteed at least the same number of empty bins as the number of data values over the concentration measurements. In these experiments, MaxNBin = 233 for the baseline data set, but subsequent matrix dimensions are rounded up to 256 x 256 rows and columns to simplify model calculations. Pseudo-code for the binning algorithm is given in Algorithm 1.
forEach combination Dr = [ D(Bi), D (Bj) ]
GetMaxBinsMinEmptyBinValue (Dr, MaxNbins) returns Wr, Nr
Dr: set of concentration data for given CBr MaxNbins: maximum number of bins
# Initialize number of bins (Nr) bin step size (binlncrement),
# bin size (Wr) and number of empty bins (emptyBins) .
Nr = binlncrement = VmaxNbins
Wr = | Max (Dr) Min (Dr) | / Nr
emptyBins = Count Number Of Empty Bins(Dr, Nr, Wr) result = |Wr Loge (emptyBins) |
47


while (result != 1 and Nr < maxNbins binlncrement)
Nr = Nr + binlncrement
Wr = | Max (Dr) Min (Dr) | / Nr
emptyBins = Count Number Of Empty Bins(Dr, Nr, Wr)
If (emptyBins > 0) or | Wr Loge (emptyBins) | ) then result = 1 return (Wr, Nr)
Algorithm 1: Pseudo-code for the binning algorithm.
The output of the Max-Bins-Min-Empty-Bins binning algorithm is a bin size Wr and the number of bins Nr for each clinical type paired biomarker combination CBr. Define NBi as the number of bins for biomarker Bi and NBj as the number of bins for biomarker Bj. For example, consider the CBr for Adenocarcinoma {Bi=IFNG, Bj=ILl A}. Since there are 53 Adenocarcinoma patients, there are at most 53 concentration measurements for either IFNG or ILIA, less missing values. IFNG actually has 35 [c] values (sorted):
(1.181, 1.244, 1.244, 1.244, 1.277, 1.373, 1.537, 1.634, 1.7, 1.7035, 1.766, 1.766, 1.766, 1.8055, 2.0015, 2.0705, 2.17, 2.24, 2.513, 2.538, 2.6525, 2.6955, 2.791, 2.827, 3.002,
3.43, 3.505, 3.647, 3.9893, 4.0835, 5.515, 9.1425, 10.974, 14.048, 34.878}, whereas ILIA has 9 [c] values (sorted): (1.5025, 1.534, 2.44415, 2.475, 4.759, 6.294, 6.4865, 16.1703, 22.575}, for a total of 44 measurements. See Appendix A for a complete list of missing values. During step 2, we discretize these measured values by assigning each measurement of [c] to a bin state for this CBr. Each value in a set of combined concentration values [c] is assigned to a single bin, but multiple concentration values are assigned to the same bin, plotted in Figure 7 for the [c] values of clinical type Adenocarcinoma for biomarkers (Bi=IFNG, Bj=ILl A}. The top 2 rows in Figure 7 refer to the actual concentration values as measured in pg/ml, given by the [c] axis. Concentration values are mapped to specific bin states in the Bin Intervals row. Many of the [c] values are grouped in the first few bins. The first 6 states are labeled, and bin 5 is the first empty bin out of the 23 bins. Bins 1
48


through 4 illustrate the joint probabilities of ILIA and IFNG values occupying the same state. Table 4 tabulates for the first 6 bins only the number of points for {Bi=IFNG, Bj=ILl A} per bin, their values of 6 (occupied bins) and 6 (empty bins).
Binned [c] for Adenocarcinoma [IFNG,ILIA] BinSize=1.68485
Joint probability of occupying the same state.
IL1A [c] Values
IFNG [cj Values
Bin Intervals
0 12 3 4
_i___i___i___i___i___i___i___i___i i i_______i__i___i___i___i___i i i_______i i____i___i___i___i___i
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
[C]
Figure 7: Assigning Adenocarcinoma concentration [c] values to bin states. Table 4: Number of points for biomarkers; their O and O values for the first 6 bins.
Bin IFNG ILIA
1 8 2
2 17 2
3 5 1
4 1 2
5 0 0
6 1 0
A o 5 4
6 1 2
All Concentration Values for Clinical Type Adenocarcinoma
49


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type SmokersWithCOPD
All Concentration Values for Clinical Type Squamous
50


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type SmokersWithoutCOPD
All Concentration Values for Clinical Type NeverSmokers
51


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type AcuteLunglnjury
All Concentration Values for Clinical Type CysticFibrosis
52


[c] pg/ml [c] pg/ml
All Concentration Values for Clinical Type TestSet
All Concentration Values for Clinical Type HealthySerum
Figure 8 (a-i): Complete set of 12 biomarker concentration values for every clinical type.
Figure 8 (a) plots the entire dynamic scale of its 12 biomarker concentration ranges for all Adenocarcinoma patients. The difference in scale between ranges is significant. For example, Adenocarcinoma MCP1 is expressed over a large dynamic range from
53


47.9 to 902.9 pg/ml whereas the measured range of ILIA is tiny in comparison from
1.5 to 22.5 pg/ml. This issue of scale is mitigated by instead comparing the DTS values for Adenocarcinoma, as plotted in Figure 9 (a). Of all the clinical types, only a few DTS values were greater than 7.0 (6 ILIA values from Never Smokers). These few values were excluded in that plot to keep axes dimensions consistent. The figures make explicit the difference between the dynamic scale of the biomarker concentrations and the calculated DTS values. For every clinical type, the binning algorithm calculated the DTS values all to within one order of magnitude.
DTS Values for Clinical Type Adenocarcinoma
54


DTS DTS
DTS Values for Clinical Type Squamous
DTS Values for Clinical Type SmokersWithCOPD
55


DTS DTS
DTS Values for Clinical Type SmokersWithoutCOPD
(Partial) DTS Values for Clinical Type NeverSmokers
56


DTS DTS
DTS Values for Clinical Type AcuteLunglnjury
7.
6.5 6.
5.5 5.
4.5 4.
3.5 3.
2.5 2.
1.5 1.
0.5 0.
EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
DTS Values for Clinical Type CysticFibrosis
57


DTS DTS
7.
6.5 6.
5.5 5.
4.5 4.
3.5 3.
2.5 2.
1.5 1.
0.5 0.
EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
DTS Values for Clinical Type TestSet
7.
6.5 6.
5.5 5.
4.5 4.
3.5 3.
2.5 2.
1.5 1.
0.5 0.
EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF
DTS Values for Clinical Type HealthySerum
Figure 9 (a-i): All clinical type biomarker DTS values.
Figure 10 plots in 3 dimensions the biomarker concentrations, the probability of that concentration happening in the population, and the computed DTS values for every clinical type. The probabilities are banded and there are areas of localized density for
58


each type. Each clinical type also exhibits a characteristic plume effect. The plot excludes
zero probabilities. Four Never Smokers values greater than 8.0 are excluded to increase
plot resolution: {8.06626, 8.13518, 8.6218, 12.2406}
All Clinical Types Point Cloud Probability
0.2 n.4

Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
DTS
500
Concentration
1000
Figure 10: Point cloud for the baseline clinical types.
c. Compute the clinical type discrete topological structure matrix Me
Once the fixed bin size Wr per combination is known, and the individual biomarkers in each combination have been mapped and assigned to bin states, we compute the population joint probabilities for Dr inside a nested loop for each clinical type Ct, combination CBr G Ct, biomarker Bi G CBr, bin b from 1 to Nr using Equation 4, where Gb is the number of concentration values of Bi in bin b, as in Table 4.
Pt[b]=1511-r-R1-b-Nr
Equation 4: Compute population joint probabilities of data set Dr for clinical type Ct.
59


Given a computed bin size Wr for Dr, Pi is the set of probabilities for observing the biomarker concentrations in each bin, oftentimes zero. A bin probability equals the number of concentration values Gb grouped in each bin divided by |Dr| so that the sum of probabilities over the set of bins is 1. The population probabilities for each clinical type -paired biomarker combination CBr includes all the sample values in that combination. Calculating probabilities this way assumes that each population of samples adequately represents the set of possible concentration bins. Computing the population joint probabilities involves formulating several probability matrices.
1. Compute the population joint probability matrix Mc-joint. Let Zt = |Ct|, 1 < t < 7 be the number of patient samples in clinical type population Ct. Let Bi be biomarker i for sample Sj, where Sj G Ct. For each combination of pairs of biomarkers Bi and Bj, we define their joint probability as the probability of belonging to bin concentration states Xi and Yj together at the same time. Considering bin concentration states X and Y as random variables, we compute the population joint probability matrix Mc-joint by multiplying each bin probability Pi for biomarker Bi with each bin probability Pj for biomarker Bj, where Bi is indexed by i from 1 to the number of bins Nk for biomarker Bi and Bj is indexed by j from 1 to the number of bins NBj for biomarker Bj. Equation 5 multiplies two vectors (one row vector and one column transposed) together element-wise as an outer product to form a 2-dimensional matrix for that biomarker combination of Bi and Bj. The dimensions of Mc-joint, one for each CBr, is INm x NBj. Bins 1 through 4 in Figure 7 illustrate joint probability values greater than zero.
MC-joint(b j) = P(Bj) P(Bj)Twhere 1 < i < NBi, 1 < j < NBj Equation 5: Used to calculate each population joint probability matrix Mc-joint.
2. Compute the population marginal distributions Mi-marg and Mj-nmrg. We compute the row Mi-marg and column Mj-marg marginal probability values as:
60


NBi
^i-marg(^) ^ ^C-jointO* j)< 1 i ^Bi
i=i
Equation 6: Used to calculate the Mi-marg row matrix.
NBj
^j-marg(j) ^ ^C-joint(h j)< ^ j ^Bj i=l
Equation 7: Used to calculate the Mj-marg column matrix.
The a interaction matrix M the matrix from Equation 1 with dimensions Nbi x NBj is then composed as the transposition of Mi-marg repeated NBj times.
3. Compute the population conditional probability matrix Mc-cond. The conditional distribution of random variable X given random variable Y is computed as the joint probability values of X and Y divided by the marginal values of Y. Given a pair of biomarkers {Bi, Bj}, this element-by-element matrix division implies for our purposes that:
Mci-cond Mc-joint/Mj-marg Mcj-cond ^C-joint/^i-marg
Equation 8: Calculate each conditional probability matrix Mc-cond, one per CBr.
4. Compute the population discrete topological structure matrix Me. The marginal Mc-marg, joint Mc-joint, and conditional Mc-cond probability matrices are computed for each CBr within each population. Define the corresponding p interaction matrix Mp as Pi divided element-wise by Pj from step 1 as Equation 9:
Mp(i, j) = *yPj, given Pj > 0, else 0
Equation 9: Used to calculate the p interaction matrix Mp.
We finally compute the DTS matrix Me for each CBr using Equation 10.
Me Me_joint(l Ma) Mp
Equation 10: Used to calculate the DTS matrix Me, one per CBr.
61


We constructed a set of CBr matrices representing the conditional probability relationship between all pairs of biomarkers within each population. Each CBr combination has a characteristic vector of occupied bin states 6 and empty bin states 6 out of a possible number of bins Nr as determined by the binning algorithm. Each CBr combination now composes an object with the following properties, which will be used to determine the set of distinguishing biomarkers per clinical type:
clinical type population Ct,
biomarker pair {Bi, Bj},
bin size Wr,
number of bins Nr,
bin state vector [Pi, O, 6, Gb],
set of observed concentration values Dr,
matrices IVIC, IVEc-cond, IVI(.'-joint, IVEc-marg, M, Mp.
d. Determine the distinguishing biomarkers
In the next step, we determine the distinguishing biomarkers per clinical type, allowing us to select the essential set of biomarkers for the sake of effective diagnosis.
From the analysis above, each biomarker Bi is a member of a number of CBrs with different bin sizes each of which corresponding to a Dr where several bin states were assigned to the concentration values of this Bi. Each bin state in 6 is characterized by its bin size, the binned concentration values and their DTS and probability values. Bin states are compared to each other because their properties are all constructed using the same algorithms. We can therefore form a coordinate system of the bin state probability values and the DTS values per biomarker for each clinical type, instead of comparing concentration values, as plotted below in Figure 11 for the IFNG values of Never Smokers. The characteristic vertical bands of discrete bin state probability exist in every combination.
62


Each point represents a distinct bin state. Bin states that lie on the ordinate with zero probability are non-permissible bin states since there are no data values within that bin range. They are shown to indicate the ratio of permissible to non-permissible bin states, in this case a ratio of 2-to-l. Figure 12 display a close-up of the IFNG values for every clinical type together. Individual bin states to the right have higher probabilities, but there are nearly always states with the same (usually low) probability value separated by their DTS values. The arrows in Figure 12 point to the first four Adenocarcinoma IFNG values plotted, corresponding to Figure 7. Figure 13 (a-1) plots all the 12 biomarkers in standard axes for every clinical type, including the validation dataset Healthy Serum and TestSet.
NeverSmokers: IFNG
DTS
Figure 11: Plot of IFNG-only bin state Probability and DTS values for Never Smokers.
63


DTS 5 -
All Clinical Types for IFNG
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis
3 .

v. ;
CL2
0.3
/
/
0.4 0.5
1 1 Probability
Figure 12: Plot of IFNG bin state Probability and DTS values for every clinical type.
DTS
7
6 -5 -
4-
All Clinical Types for EGF
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
3 2 1
(P.................... ----- ....................... -------
0.0 0.1 0.2 0.3 0.4 0.5 0.6
^Probability
64


All Clinical Types for IFNG
DTS
7
6
5 ;
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
3
2
1
0 -
0.0
0.4
0.5
0.6
^Probability
DTS
7
6
5 [
4 f
3
2
1
0 0.0
All Clinical Types for IL1A
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
0.1
0.2
0.3
0.4
0.S
0.6
^Probability
65


All Clinical Types for IL1B
DTS
7
6
5
4
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
3
2
1
t
* *
O'.............
0.0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
Probability
All Clinical Types for IL2
DTS
7 r
6 -
5 -
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
FlealthySerum
11**^ Probability 0.6 0.7
66


All Clinical Types for IL4
DTS
7 r
6 -
5
4-
3 -
!
2 t
t
+
\
*
1
*
*

*
0L
0.0
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
0.3
0.4
0.5
0.6
0.7
Probability
DTS
7 r
6 -5 1 4-
3 -
All Clinical Types for IL6
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
FlealthySerum
2f *
I*'
l 1
*
f ; *
0.0
0.1
0.2
0.3
0.4
0.5
0.6
^Probability
67


All Clinical Types for IL8
DTS
7
6 -5 :
4 -
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
3

2
*
:
1
jf
*
oL
0.0
*
0.1
0.4
0.5
0.6
^Probability
DTS
7
All Clinical Types for IL10
6
5
4
t *
f # i :
; *
s
oL
0.0
%
0.1
0.2
0.3
0.4
0.5
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
**1 Probability 0.6 0.7
68


All Clinical Types for MCP1
DTS
7 r
6 -5 -4-
3 -
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
2
1 : i 1
hu
o.o
0.1
0.2
0.3
0.4
0.5
0.6
^Probability
DTS
7
6 -
5 -4 -
3 -
All Clinical Types for TNFA clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
FlealthySerum
2
1
0 L-.i-i
0.0 0.1 0.2 0.3
0.4 0.5
06
^^Probability
69


All Clinical Types for VEGF
DTS
7 r
6 -5 1 4-
3-
Clinical Types
Adenocarcinoma
Squamous
NeverSmokers
SmokersWithCOPD
SmokersWithoutCOPD
AcuteLunglnjury
CysticFibrosis TestSet
HealthySerum
0.3 0.4 0.5 0.6
^Probability
Figure 13 (a-1): Bin state Probability and DTS values for every clinical type per biomarker.
These clinical type bin states are then represented in matrix form to determine their characteristic and distinguishing states. We formulate 7x12 individual integer matrices that represent all occupied (non-distinguishing) bin states one for each clinical type biomarker. These integer matrices are constructed by first standardizing the bin state probability and DTS values. The probability values are multiplied by 100 and rounded to integers as percent values along the x-axis to form a standard 100 cells. The corresponding DTS values are raised as exponents to the natural logarithm and rounded to integers, standardizing the y-axis to 256 cells, and starting from the upper left corner. This process excludes only 8 values (all of them Never Smokers, ILIA values) that are greater than e5 5452 = 2 5 6, out of 3061 values. This forms a cellular structure where a whole integer in a cell indicates the presence of a probability DTS value and 0 otherwise. For example, the (partial) cellular structure in Figure 14 (a) plots the first 4 values of IFNG for only Adenocarcinoma (id = 2). An element-by-element XOR operation between the cellular struc-
70


tures of two clinical types of the same biomarker reveals which clinical type probability -DTS bin values are unique between those two clinical types.
An elaboration of this logic is used to obtain the complete list of distinguishing bin states of the same biomarker among every clinical type. The objective is the same to identify those matrix cells that are occupied by one and only one clinical type for that biomarker. To begin with, each clinical type is represented by a unique 2n identifier starting at n = 1, as tabulated in Table 5.
Table 5: Clinical type identifiers used to distinguish bin states.
n: Clinical Type Unique 2n Identifier
1: Adenocarcinoma 2
2: Squamous 4
3: Controls Never Smokers 8
4: Controls Smokers with COPD 16
5: Controls Smokers without COPD 32
6: Acute Lung Injury 64
7: Cystic Fibrosis 128
We replace the occupied matrix integer values with their respective clinical type identifiers, and then add every matrix together per biomarker, MB = £t=i Mt, 1 < t <
CT, so that each matrix cell contains zero, one, or more than one clinical type identifier. An element-by-element log2 operation that returns a whole integer identifies a single clinical type occupying that cell. Cells that contains one of these unique values (2, 4, 8, etc) identifies the only clinical type for that distinguishing biomarker in that cell. This method depends upon the fact that a binomial coefficient (m choose n) (mod 2) is computable using the operation nXORm. Figure 14 (b) plots the integer matrix for biomarker IFNG for every clinical type, where Adenocarcinoma is distinguished by 3 (red) circled cells. The (blue) circled value of 34 indicates that both Adenocarcinoma (id = 2) and Smokers without COPD (id = 32) exist in the same cell. Figure 15 (a-1) plots all the biomarkers for every clinical type where each diagram displays the first 24 rows and columns out of 100.
71


0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 64. 64. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0, 0. 0. CD 0 0. 0. 0 0. 0. 0. 0. 0. 0. 32. 16. 4, O 0. 32
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 3. 3. 0. 8. 0. 0. S. 0. 0. 0.
0. 0. 0. 0. 0. 0 CD 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. O 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 32. 64. 0. 0. 0. 0. 16. 0. 16. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 8. 3. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 32. 0. 16. 0. 0. 0. 0. 0.
0. 0, 0. 0. 0. 0 0. 0. 0. 0, 0. 0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0,
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 64. 0. 0. 64. 0.
0. 0. 0. 0. 0. 0 0. 0. 0. 0. 0. 0. 0. 0. 32. 0. 0. 0. 0. 16. 0. 0. 0. 0.
Figure 14 (a): Partial integer matrix for IFNG values of Adenocarcinoma.
Figure 14 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
122. 122. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
42. 42. 34. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
<£ CO 84. 4. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
72. 74. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 112. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
10. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 66. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 4, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (a): Partial integer matrix for biomarker EGF for all 7 clinical types.
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
72


0. 0
0. 0.
64. 64.
0. 0.
64. 0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 2. 32. 16. 4. 2. 34. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. S. s. 0. 8. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 32. 64. 0. 0. 0. 0. 16. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. S. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 32. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 64. 0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 32. 0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. S. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
Figure 15 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 64. 0. 0. 0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 64. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 2. 4. 0. 2. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 64. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
512. 512. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 64. 64. 0. 0. 0. 0. 32. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 2. 4. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (c): Partial integer matrix for biomarker ILIA for all 7 clinical types.
73


0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 68. 66. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 32. 16. 16. 0. 0. 0, 0. 0. 0. 0. 32
0. 64. 64. 64. 64. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0.
64. 64, 64. 0. 2. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0. 0,
0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 16. 0. 0,
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0. 0,
0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0. 0,
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 32, 0. 0, 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0. 0,
Figure 15 (d): Partial integer matrix for biomarker IL1B for all 7 clinical types.
0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 36. 4. 58. 0. 0. 22. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 64. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 40. 0. 0. 0. 2. 0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 64. 80. 16. 16. 0. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 0. 96. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 0. 0. 0. 8. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0, 0. 80. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0.
0. 0, 0. 32. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0.
0, 64. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0.
0. 32. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 8. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (e): Partial integer matrix for biomarker IL2 for all 7 clinical types.
74


0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 0. 32. 32. 0. 0, 34, 16. 2. 0. 8. 0. 0. 2. 8. 0. 0. 0. 0. 0. 0, 0. 0.
64. 68. 64. 4. 0. 0. 64. 0. 4, 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 64. 0. 0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 16. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0.
0. 32. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0.
0. 0. 0. 0. 0. 0. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 32. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
6. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 16. 0. 64. 0. 64, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0, 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (f): Partial integer matrix for biomarker IL4 for all 7 clinical types.
0. 0. 0, 0. 0. 0, 0. 0, 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0.
64. 70. 60. 126. 8. 0. 0. 8. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 84. 84. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 112. 80. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. CO 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 4. 0. 0. 0, 0. 0, 0. 0, 0. 0. 0. 0. 0. 0. 0, 0. 0, 0. 0. 0. 0. 0. 0.
0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
512. 512. 576. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 16. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 16. 0. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 66. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (g): Partial integer matrix for biomarker IL6 for all 7 clinical types.
75


0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 66. 82. 64. 8. 2. 42. 0. 24. 0. 0. 0. 0. 0. 8. 32. 0. 0. 0. 0. 0. 0. 0. 0.
4. 4. 4. 0. 4. 0. 0. 0. 0. 0, 0, 0, 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0.
64. 128. 2. 64. 0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 4. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 2. 0. 2. 0. 42. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 80. 16. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 2. 64. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 576. 0. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 2. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 68. 4. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (h): Partial integer matrix for biomarker IL8 for all 7 clinical types.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
64. 64. 66. 2. 32. 0. 2. 34. 8. 0. 8. 0. 0. 0. 2. 0. 0. 0. 0. 0. 8. 0. 0. 0
0. 0. 4. 16. 4. 0. 4. 0. 0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0, 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 128, 128. 96, 32. 32. 64. 0. 0, 0. 0. 0. 16, 0. 0, 0. 0, 0. 0, 0. 0, 0. 0, 0
0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0, 0. 0, 0. 0, 0. 0. 0. 0. 0
0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 2. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 96. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 64. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 6. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 512. 0, 0. 0. 0. 512. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0, 0. 0, 0
0, 0. 0, 0. 0. 0. 0. 0. 0, 0. 0. 0. 0, 0. 0, 0. 0. 0. 0, 0. 0, 0. 0, 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 32. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0
Figure 15 (i): Partial integer matrix for biomarker IL10 for all 7 clinical types.
76


0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
94. 94. 32. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
74. 162. 64. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
GO 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
10. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
6, 96. 64. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 96. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 512. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 64. 0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (j): Partial integer matrix for biomarker MCP1 for all 7 clinical types.
0. 0. 0. 0. 0. 0, 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0, 0, 0. 0. 0. 0. 0.
64. 88. 122. 82. 8. 32. 32. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 8. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 144. 84. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 24. 16. 16. 0. 2. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 36. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 80. 10. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 32. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 16. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 16. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 16. 64. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 520. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 32. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0.
0. 0, 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (k): Partial integer matrix for biomarker TNFA for all 7 clinical types.
77


' 0. 0. 0. 0. 0. 0. 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0, 0, 0, 0, 0. 0.
64. 88. 122. 82. 8. 32. 32. 0. 0, 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 4. 0. 0. 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 8. 32, 0. 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 0. 0. 0.
0. 144. 84. 0. 0. 0. 0, 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 24. 16. 16. 0. 2. 0. 0, 0, 0, 2, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 36. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 80. 10. 0. 0. 0, 2, 0, 0, 0, 0, 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0. 32. 32. 4. 0. 0, 0, 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0, 0, 0,
0. 64. 64. 0. 0. 0. 0. 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 16. 0. 0. 0. 0. 0. 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 0. 0. 0, 0, 0, 0, 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0. 0. 16. 0. 0. 0. 0, 0, 0, 0, 0. 0. 0. 0. 0. 0. 0, 0, 0. 0. 0. 0. 0. 0. 0.
0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 16. 64. 0. 0. 0. 0. 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 520. 0. 512. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0, 0. 0. 0, 0, 0, 0, 0, 0. 0. 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0. 0. 0.
0. 0. 32. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 64. 0. 0. 0. 0. 0. 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 4. 0. 0. 0. 0. 0. 0. 0. 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0, 0, 0, 0. 0, 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0. 32. 0. 0. 0. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
Figure 15 (1): Partial integer matrix for biomarker VEGF for all 7 clinical types.
e. Plot and tabulate the distinguishing matrices
The 12 individual integer matrices produced for each clinical type can be consolidated into 3 dimensions to plot their distinguishing biomarkers with respect to the aforementioned probability cell and DTS cell states. Figure 16 (a-h) plots the distinguishing probability cell and DTS cell states for each clinical type separately, and Figure 17 plots them all together. Each clinical type has several stretches of distinguishing bin states, and these bin states are spread out over the total state space for every clinical type. Interestingly, Never Smokers displays the most spread or variation among all the clinical types -one is normal for a large number of states. We observe that the range of probability values is low in the Probability dimension: no single biomarker overwhelms the others in terms of frequency. It is also clear that the DTS coordinate effectively separates out the clinical types.
78


Probability Cell
Probability Cell 10
79



NeverSmokers Distinguished by Probability & DTS Cells
o
Probability Cell 10
SmokersWithCOPD Distinguished by Probability & DTS Cells
80


SmokersWithoutCOPD Distinguished by Probability & DTS Cells
Probability Cell
DTS Cell
Probability Cell
AcuteLunglnjury Distinguished by Probability & DTS Cells
DTS Cell
81


Probability Cell
CysticFibrosis Distinguished by Probability & DTS Cells
DTS Cell
HealthySerum Distinguished by Probability & DTS Cells
Probability Cell
Figure 16 (a-h): Every clinical type distinguished by their Probability and DTS states.
82


Baseline Clinical Types Distinguished by Probability & DTS Cells
Clinical Types Adenocarcinoma Squamous NeverSmokers SmokersWithCOPD SmokersWithoutCOPD AcuteLunglnjury CysticFibrosis
DT
Figure 17: All baseline clinical types distinguished by Probability and DTS states.
Probability Cell
0
VEGF
UNFA
MCP1
IL10
IL8
IL6
IU
IL2
IL1B
IL1A
IFMG
EGF
Different biomarkers have varying ability to distinguish among every clinical type. For example, Figure 18 (a-1) plots the distinguishing bin states for each of the 12 biomarkers within every clinical type. Each biomarker has a distinguishing topology of bin states. Figure 19 plots the baseline biomarkers distinguished by their Probability and DTS states, again expressed as raised exponents to the natural logarithm and rounded to integers. The biomarkers are close together in state space, though ILIA (blue) seems to stand apart. Together, these plots reveal which CBr is the most effective for determining the clinical type of a patient.
83


EGF Distinguished by Probability & DTS For All Clinical Types
IFNG Distinguished by Probability & DTS For All Clinical Types
84


IL1A Distinguished by Probability & DTS For All Clinical Types
IL1B Distinguished by Probability & DTS For All Clinical Types
85


o
aVjVUW &
f of
PA

es
0\s'v
%6


IL6 Distinguished by Probability & DTS For All Clinical Types
Probability Cell
IL3 Distinguished by Probability & DTS For All Clinical Types
Probability Cell
87


Full Text

PAGE 1

DEVELOPMENT OF A COMPUTATIONAL MODEL USING CYTO KINE BIOMARKERS FOR IDENTIFYING LUNG DISEASE TYPES by DAVID F. GNABASIK B.A., University of Chicago, 1979 M.S., University of Colorado, 2002 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Computer Science and Information Systems 2017

PAGE 2

ii 2017 DAVID F. GNABASIK ALL RIGHTS RESERVED

PAGE 3

iii This thesis for the Doctor of Philosophy degree by David F. Gnabasik has been approved for the Computer Science and Information Systems Program b y Tom Altman, Committee Chair Gita Alaghband, Advisor Stephen Billups Mic hael Mannino Ilkyeun Ra May 1 3, 2017

PAGE 4

iv Gnabasik David F (Ph.D., Computer Science and Information Systems) Development of a Computational Model Using Cytokine Biomarkers for Identifying Lung Disease Types Thesis directed by Professor Gita Alaghband ABSTRACT This dissertation presents a computational model that distinguishes among 7 lung disease clinical types : healthy non smokers, smokers diagnosed with and without Chronic Obstructive Pulmonary Disease (COPD), adenocarcinoma, squamous cell carcinoma, cystic fibrosis, and a cute lung injury The model reliably assigns an individual patient into one of these types from the conditional relationships of noisy, incomplete, and widely variable cytokine biomarke r concentration measurements taken from blood serum. Panels of 12 cytokine measurements precisely classify both known and unknown patients into one of these distinct clinical types. The approach assigns patients to known clinical types from the conditional relationships of noisy, incomplete and widely variable protein concentration measurements, including outliers. A d iscrete t opological s tructure (DTS) formulation induces discrete state variables from concentration measurements through a binning algorithm that exposes the conditional relationships and dependencies among the concentration data. A unique application of a n exc lusive or ( XOR ) operation on th e state space extracts the patterns identifying the set of distinctive features for each clinical type. The computational model builds a discrete topological structure from a baseline data set, and is developed using several novel schemes designed specific ally for this analysis. All biomarker concentration values are placed into discrete bins forming bin states from multiple biomarkers measured from individual samples The result is a multidimensional space representing a characteristic set of states for ea ch clinical type population.

PAGE 5

v The model builds conditional probabilistic relationship between bin states that successfully separates and distinguishes bin states belonging to different clinical types The model incorporates new clinical types and distinguishes the set of targeted biomarker variables that uniquely characterize the new type. The computational model has been validated by processing a random 10% of the data that has been initially excluded and by processing a separate data set extracted from the literature. The form and content of this abstract is approved. I recommend its publication. Approved: Gita Alaghband

PAGE 6

vi ACKNOWLEDGEMENTS I deeply thank my advisor, Dr. Gita Alaghband for her intel lectual support and encouragement during this journey. We thank Dr. M ark W. Duncan Ph.D., of the University of Colorado Anschutz Medical Campus, School of Medicine for providing us with 5 of the 7 original unpublished data sets. We thank Dr. Paul A. Bunn Jr., MD, of the University of Colorado Anschutz Medical Campus School of Medicine, for providing us with 2 of the 7 original unpublished data sets. We thank Dr. Brid Ryan and Dr. Curtis C. Harris of NIH/NCI for the NCI Maryland Cancer NCI Maryland Control NCI PLCOC Cancer and NCI PLCOC Control data sets. [ Brid M. Ryan Ph.D., M.P.H., Center for Cancer Research, National Cancer Institute, Building 37, Room 3060C, Bethesda, MD 20892, Brid_Ryan@nih.gov https://ccr.cancer.gov/laboratory of human carcinogenesis/brid m ryan ] [ Curtis C. Harris M.D., Center for Cancer Research, National Cancer Institute, Building 37, Room 3068 A, Bethesda, MD 20892, curtis_harris@nih.gov https://ccr.cancer.gov/Laboratory of Human Carcinogenesis/curtis c harris ]

PAGE 7

vii TABLE OF CONTENTS LIST OF FIGURES ................................ ................................ ................................ ........... ix LIST OF TABLES ................................ ................................ ................................ ............ xi LIST OF ABBREV IATIONS ................................ ................................ .......................... xii CHAPTER I: INTRODUCTION ................................ ................................ ....................... 1 1.1 Motivation ................................ ................................ ................................ ................. 1 1.2 Outline of the Thesis ................................ ................................ ................................ 4 CHAPTER II: RELATED WORK ................................ ................................ ..................... 5 2.1 Modeling Lung Disease ................................ ................................ ............................ 5 2.2 Topological Approaches ................................ ................................ ........................... 8 2.3 Protein Interaction Networks ................................ ................................ .................. 10 CHAPTER III: DATA SOURCES and STATISTICAL ANALYSIS ............................. 13 3.1 Cytokine Biomarkers and Their Relevance ................................ ............................ 13 3.2 Data Sources ................................ ................................ ................................ ........... 14 3.3 Statistical Analysis ................................ ................................ ................................ .. 16 3.4 Statistical Inference ................................ ................................ ................................ 21 3.5 Defining Variables in High Dimensional Configuration and State Space ............. 22 3.6 Statistical Assumptions ................................ ................................ ........................... 23 3.7 Handling Variation Through Model Constraints ................................ .................... 24 CHAPTER IV: COMPUTATIONAL MODEL ................................ ............................... 33 4.1 Defining the Model ................................ ................................ ................................ 33 4.2 Motivation of the Discrete Topological Structure ................................ .................. 34 4.3 Bin Size and Bin States ................................ ................................ ........................... 36 4.4 Developing the Computational Model ................................ ................................ .... 43 CHAPTER V: RESULTS and VALIDATION STUDIES ................................ ............ 102 5.1 D istinguishing Biomarkers and the Measure of Similarity ................................ .. 102 5.2 10% Study Validation ................................ ................................ ........................... 102

PAGE 8

viii 5. 3 Healthy Serum Validation ................................ ................................ .................... 103 5.4 Population Variability and Conditional Structure ................................ ................ 106 5.5 Strategies for Modeling Measurement Variation ................................ ................. 106 5.6 NCI Study Validation ................................ ................................ ........................... 10 7 CHAPTER VI: TOPOLOGICAL ANALYSIS ................................ .............................. 109 6.1 Motivation for a Topological Analysis ................................ ................................ 109 6.2 Topological Homology and Betti Numbers ................................ .......................... 109 6.3 Computing Topological Connectedness ................................ ............................... 111 6.4 Topological Results ................................ ................................ .............................. 112 CHAPTER VII: CONCLUSION ................................ ................................ ................... 117 REFERENCES ................................ ................................ ................................ ............... 119 APPENDIX A Percent Missing Values per Clinical Type Biomarker ........................ 125 APPENDIX B The Chemical Stochastic Master Equation ................................ .......... 130 APPENDIX C Definitions and Characteristics of Lung Disease ................................ 132 APPENDIX D Test Sensitivity and Specificity ................................ ........................... 134 APPENDIX E Description of the Binning Algorithm as a Learning Algorithm ........ 135

PAGE 9

ix LIST OF FIGURES FIGURE Figure 1: DTS values per biomarker for all patients of clinical type Adenocarcinoma. ........ 3 Figure 2: Mean concentrations and standard error bars for Never Smokers & Adenocarcinoma. ................................ ................................ ................................ ........ 17 Figu re 3 (a l): Mean concentration values for each biomarker for every clinical type. ....... 21 Figure 4: Number of biomarker measurements for all baseline samples. ............................. 27 Figure 5: (a i): All clinical type concentration values. ................................ ......................... 32 Figure 6: Ratio of the number of empty bins to the total number of bins. ............................ 36 Figure 7: Assigning Adenocarcinoma concentration [c] values to bin states. ...................... 49 Figure 8 (a i): Complete set of 12 biomarker concentration values for every clinical type. 53 Figure 9 (a i): All clinical type biomarker DTS values. ................................ ....................... 58 Figure 10: Point cloud for the baseline clinical types. ................................ .......................... 59 Figure 11: Plot of IFNG only bin state Probability and DTS values for Never Smokers. ... 63 Figure 12: Plot of IFNG bin state Probability and DTS values for every clinical type. ....... 64 Figure 13 (a l): Bin state Probability and DTS values for every clinical type per biomarker. ................................ ................................ ................................ ................................ .... 70 Figure 14 (a): Partial integer matrix for IFNG values of Adenocarcinoma. ......................... 72 Figure 14 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types. ............... 72 Figure 15 (a): Partial integer matrix for biomarker EGF for all 7 clinical types. ................. 72 Figure 15 (b): Partial integer matrix for biomarker IFNG for all 7 clinical types. ............... 73 Figure 15 (c): Partial integer matrix for biomarker IL1A for all 7 clinical types. ................ 73 Figure 15 (d): Partial integer matrix for biomarker IL1B for all 7 clinical types. ................ 74 Figure 15 (e): Partial integer matrix for biomarker IL2 for all 7 clinical types. ................... 74 Figure 15 (f) : Partial integer matrix for biomarker IL4 for all 7 clinical types. ................... 75 Figure 15 (g): Partial integer matrix for biomarker IL6 for all 7 clinical types. ................... 75

PAGE 10

x Figure 15 (h): Partial integer matrix for biomarker IL8 for all 7 clinical types. ................... 76 Figure 15 (i): Partial integer matrix for biomarker IL10 for all 7 clinical types. ................. 76 Figure 15 ( j): Partial integer matrix for biomarker MCP1 for all 7 clinical types. ............... 77 Figure 15 (k): Partial integer matrix for biomarker TNFA for all 7 clinical types. .............. 77 Figure 15 (l): Partial integer matrix for biomarker VEGF for all 7 clinical types. ............... 78 Figure 16 (a h): Every clinical type distinguished by their Probability and DTS states. ..... 82 Figure 17: All baseline clinical types distinguished by Probability and DTS states. ........... 83 Figure 18 (a l): Each biomarker distinguished by Probability and DTS states within all baseline clinical types. ................................ ................................ ................................ 89 Figure 19: All baseline biomarkers distinguished by Proba bility and DTS states within all baseline clinical types. ................................ ................................ ................................ 90 Figure 20 (a b): Best fit vectors for Adenocarcinoma [EGF, IFNG]. ................................ .. 98 0 values for each clinical type. ........ 113

PAGE 11

xi LIST OF TABLES TABLE Table 1: All clinical type biomarker bin sizes in pg/ml. ................................ ............... 37 Table 2: Biomarker Concentration Averages and Standard Deviation Values per Clinical Type. ................................ ................................ ................................ ......................... 39 Table 3: Biomarker Data Characteristics. ................................ ................................ ........ 40 Table 4: Number of points for biomarkers; their and values for the first 6 bins. ..... 49 Table 5: Clinical type identifiers used to distinguish bin s tates. ................................ ...... 71 Table 6: Distinguishing biomarkers per clinical type in Probability and DTS dimensions. ................................ ................................ ................................ ................................ .. 91 Table 7: Number of distinguishing biomarker states n per clinical type C t .................... 91 Table 8: Counts and percentages of permissible and non permissible bin states. ........... 93 Table 9: Using cosine similarity to assign baseline patients to a clinical type. ............... 99 Table 10: D istinguishing biomarkers of excluded patients in Probability, DTS dimensions. ................................ ................................ ................................ ............. 103 Table 11: Data sources for the NIH Maryland and PLCOC Studies. ............................. 108 Table 12: Topological properties for each clinical type. ................................ ................ 113

PAGE 12

xii LIST OF ABBREVIATION S Symbol Description [c] set of concentration data values for B i i concentration means B i C t D r set of observed concentration s per combination DTS discrete topological structure: a model of how protein concentrations change relative to one another. G i number of concentration values grouped in each bin H i threshold in binning algorithm log e natural logarithm M C population DTS matrix M C cond population conditional probability matrix per clinical type M C joint population joint probability matrix per clinical type M C marg population marginal probability matrix per clinical type M z sample DTS matrix M interaction matrix M interaction matrix N r number of bins for a combination (O hat) the set of occupied bin states per clinical type biomarker combination (O bar) the set of distinguishing bin states per clinical type biomarker combination; P (X|Y) conditional distribution of X given Y P B,N set of population probabilities per combination per bin N i P Ct (B i ) population probability for all patients pg/ml pico grams 10 12 grams per milliliter Q biomarkers per patient panel (12) R total number of clinical type biomarker combinations Wr bin size ; per combination X, Y bin states considered as random variables XOR the exclusive or logical operation that outputs true only when inputs differ Z t set of Z patient samples in C t i concentration standard deviations

PAGE 13

CHAPTER I : INTRODUCTION 1.1 Motivation Proteomics is the investigation of the entire protein content or the protein compleproteome [63]. Specific biomarker proteins act a s measurable indicator s of biological disease [52], including lung disease The overall goal is to ensure that people with lung diseases are accurately and cost effectively diagnose d and then treated accordingly. The problem is acute, as t he American Lung Association states that an estimated 158,080 Americans are expected to die from lung di sease in 2016 [74] The seven clinical types analyzed here account for some of the most frequent forms of lung disease [73] Respiratory diseases are of multiple origin, and the selected clinical types cover a wide spectrum of suspected causes. Diagnosis a nd treatment are also problematic, given that Guarascio et al declare that [75] et al [76] by the patients, lack of knowledge and underuse of spirometry by the Primary Care providers the disease remains under See Appendi x C for a brief description of lung disease characteristics. Reliably measuring proteomic biomarker concentrations is difficult due to technical and biological variation, their wide dynamic range of concentrations and numerous post translational modificat ions [61] [77] Despite these variations, we developed a computational model that reliably distinguishes among various clinically diagnosed lung disease types. The model hypothesizes that biomarker interactivity induces a distinctive protein concentration topology for each clinical

PAGE 14

2 type, and that certain topological patterns revealed by these host response proteins remain characteristically invariant. The model selects the unique set of biomarkers given a small number of biological and statistical assumptions whose protein host response topology corresponds to a thereby simplifying the high di mensional biomarker concentration space so that certain topological patterns of lung disease are extracted. We propose a new computational model that: 1) distinguishes among multiple clinical types, 2) accurately assigns patients to these known types, 3 ) distinguishes the unique set of biomarkers that characterize each clinical type T he model represents a space of distributions built upon computable discrete states that drastically reduces the number of plausible hypotheses des pite significant data va riation. We were initially motivated to explore this approach by Figure 1 which plots the DTS values (explained in §4.4:A ) for every patient diagnosed with Adenocarcinoma for each of the 12 biomarkers. For the sake of vis ualization, a 12 measurements. Note how DTS values can be less than zero. We decided to investigate whether these gaps or holes uniquel y identified each clinical type Formulating proteomic concentration data as multidimensional discrete bin states offers several advantages: 1) I t abstracts away some of the variation in the data as given, while still allowing direct comparisons among multiple cytokine biomarker variables. 2) I t puts to use the cytokine biomarkers as effective host response signals indicative of d isease state 3) It allows for the systematic comparison of the conditional relationships among the chosen cytokine biomarkers.

PAGE 15

Figure 1 : DTS values per biomarker for all patients of clinical type Adenocarcinoma.

PAGE 16

1.2 Outline of the Thesis Chapter I introduces the problem being studied and what initially motivated this analysis. Chapter II presents the related research in th is area, de scribing both the topological and non topological approaches including the use of p rotein i nteraction n etworks Chapter III describes the baseline dataset used in the analysis, the suitability of c ytokine b iomarkers and their relevance. The inability of st andard s tatistical a nalysis and i nference to make accurate assignments t o clinical types using biomarker concentration measurements is discussed We discuss the importance of s tatistical modeling the definition of the random variables in particular ssumptions and how v ariation is handled t hrough m odel constraints. Chapter IV defines the computational model, the m otivation of the d iscrete t o pological s tructure (DTS) and its manifestation through b in s tates The computati onal model is developed through a series of algorithms and equations. Chapter V identifies the b iomarkers that distinguish the clinical type populations and classify individual samples. Three separate v alidation studies are presented. Chapter VI discusses the potential of a formal topological analysis. Whereas t he binning algorithm exposes the conditional relationships and dependencies among the concentration data by discretizing the concentration values of paired biomarkers into static DTS bin states w e a lso ask whether it is possible to structure the interactions of protein biomarker concentration values by c omputing the connectedness and disconnectedness of these conditional values in concentration topological space. This approach would allow for the dyn amic visualization of biomarker concentration values in multidimensional space over time o f a patient trajectory Chapter VI I concludes the thesis.

PAGE 17

5 CHAPTER I I : RELATED WORK 2.1 Modeling Lung Disease We consider the following established and recent approaches to computationally modeling lung disease. cellular tumor cell morphology, multiscale, multi physics "omics" data mining, Bayesian approaches, equation based and agent based models, protein interaction network analysis, and topological analysis. Tumor cell morphology on light microscopy has traditionally been used to predict disease behavior and prognosis, though recent techniques build upon stem cell research [36]. Cellular morphology often drives in vivo animal m odels of l ung d isease but how lu ng disease is modele d in vivo is often not initiated by the same events that cause the disease in humans [57]. Newer approaches to modeling lung disease now include mining large genomic, transcriptomic, and proteomic databases requiring advanced computational techniques to reveal relevant patterns in these data. A multiscale multi physics modeling approach implies that "omics" data from different sources genomics, proteomics, metabolomics can be integrated through multiple computational informatics techniques [37]. For example, to diagnose and assess COPD, Burrowes et al describe a multiscale computational model to investigate the relationship between computerized tomography (CT) and magnetic resonance imaging (MRI) imaging measurements and disease severity [34 62 ]. Their physics based

PAGE 18

6 approach to relating structure and function recognizes the non linear relationships operating among respiratory compon ents. T hey acknowled ge that t he modeling process requires multiple simplifying assumptions of the lung system and that current models predict lung function at a specific po int in time. Burrowes also state s [34, page 6 of 8]. Gefen [35] provides several examples of equation based and agent based (ABM ) computational models. Both t hese simulation approaches specify the rules of behavior of individual entities and their interactions within a population. The model is a set of equations in equation based modeling, whereas ABM models assert that the various non linear and adaptive interactions between the entities are too complicated to be represented by analytical expressions. An ABM typically has the follo wing components: agents (such as immune cells, bacteria, or cytokines), the environment where agents reside (such as a two dimensional grid representing a section of lung tissue), probabilistic rules that govern the dynamics of agents, including movement, actions, and interactions among agents and between agents and environment, and time scales on which the rules are executed. Dick et al [38] develop a multiscale model of acute inflammatory disease that describes baro chemo and cytokine reflex control o f cardio pulmonary function. They developed algorithms that quantify the non linear characteristics of variability in some biological signals and primary challenge lies in integrating a large body of data into a cohesive whole that c an guide For an example of a Bayesian approach, Ostroff et al [26] published a comprehensive clinical biomarker study of non small cell lung cancer (NSCLC) to discover 44 possible protein

PAGE 19

7 biomarkers in 1326 archived serum s amples from four independent studies. ( VEGF is the single shared biomarker between their study and our experiments. ) They developed a panel using 12 of their protein biomarkers that distinguished NSCLC from controls with 89% sensitivity and 83% specificity This work is encouraging because of the generality of their aptamer based proteomic technology, their specific evidence against over fitting, and the fact that all their samples we re confirmed by expert pathology review. They used a nave Bayes model which assumes that the presence of a feature in a class is unrelated to, or strongly independent of, the presence of other feature s for constructing classifiers, a log normal distribution to model their data, and a log normal parametric model to capture the protein distributions for a given clinical state. However, the ir results do not yet include independent validation studies. These complex physics based equation based or agent based modeling approach es often require years of development, and they ma ke many simplifying biological assumptions for the sake of reproducible and automated predictions. Instead, our model assigns patients to known clinical types from the marginal and conditional relationships of biomarker concentration measurements. Our approach selects the unique set of biomarkers given a small number of biological and statistical assumptions (see §3.6 and §3.7) that induces the protein host response topology corres ponding The resulting computational mod el represents a space of distributions built upon computable discrete states that drastically reduces the scope of the possible space of plausible hypotheses des pite the significant data variation and extreme heterogeneity of lung disease [81]

PAGE 20

8 2.2 Topological Approaches Carlsson [20] discusses how topological data analysis (TDA) can extract information from high dimensional, noisy, incomplete but massive, biological data sets using techniques from topology. TDA is the study of the shape of data a nd it provides a general framework to analyze such data that is insensitive to the measurement metric while providing both dimension reduction and robustness to noise. Carlsson enumerates the analytical obstacles involved in such research as including: the requirement that large scale qualitative information must first be extracted to determine the specific direction of subsequent research, the fact that biological distance metrics are often difficult to theoretically justify, the unnaturalness, even mea ningless, of using Euclidean metric coordinates and the greater value of categorical data summaries over individual parameter choices. Carlsson also leads the Applied and Computational Algebraic Topology ( CompTop ) group at Stanford University ( http://comptop.stanford.edu/ ), which seeks to develop flexible topological methods for the analysis of data that is difficult to analyze using classical linear methods. T heir Java p lex library [59] computes Betti numbers a measure used to distinguish topological spaces based on the connectivity of n dimensional simplicial complexes and persistent homology intervals a method for computing topological features of a space at different spatial resolutions. These homology int ervals describe how the homology of a topological structure changes over time. As a clinical example, Carlsson topologically analyzed the differences among healthy, pre diabetic and diabetic populations for two variables insulin response and glucose level

PAGE 21

9 using the Miller Reaven Diabetes dataset [27] The two variables effectively distinguish among the clinical types in the 3 space using a typical set of 1 dimensional filters such as: density estimators, measures of data depth, eigenfunctions of gr aph Laplacian, and principal coordinates analysis or multidimensional scaling coordinates. The Computational Homology Group ( CHomP ) at Rutgers University [ http://chomp.rutgers.edu/ ] performs protein data analysis using persistent homology to extract geometrical and topological information from protein data available in the Protein Data Bank the worldwide repository of information about the 3 dimensional structures of proteins and nuc leic acids [ http://www.wwpdb.org/ ] The CHomP software computes certain topological invariants of protein molecules such as the 112M Sperm Whale Myoglobin D122n N propyl Isocyanide protein [24]. The focus in this work is to locate valid protein docking conformations a subset of the protein folding problem Edelsbrunner defines many of the topological techniques and terms used in this 3 dimensional type of analysis [23]. Blinder et al [25] propose the use of functional topology to identify several characteristics of biological computing networks in terms of form function fingerprints. These n etwork structures are represented by 3 matrices: 1. a topological connectivity matrix where each row is the shortest topol ogical path lengths of a node with all other nodes; 2. a topological correlation matrix where an element ( i, j ) represents the correlation between the topological connectivity of nodes ( i ) and ( j ); and 3. a weighted graph matrix that represents the strengths of the connections.

PAGE 22

10 Their functional holography analysis distinguishes neuronal networks that standard statistical determinants of networks structure such as averaged path length did not. 2 3 Protein Interaction Networks Some biomarker p roteins often form protein protein interaction (PPI) networks both permanent and transient distinct from their function as host response signals [50] These PPI networks often produce high specificity due to the physical contacts between two or more p rotein molecules because of biochemical events steered by electrostatic forces. Many PPI networks manifest the associations between molecular chains that occur in a cell or in a living organism in a specific biomolecular context [64]. That is, by providing a network level map of the cell PPI networks could identify the biomarkers functioning in the same disease pathway, presumably those with an effective disease predictive ability [60]. Protein interactions also cover a spectrum of order and function from weakly random to highly structured because protein function is aggregated from multiple sources. Kumar et al by the static folded three dimensional structure but also by the distribution and redistributions of the populations of its conformational and dynamic sub [10]. Network analysis quantifies PPI n etworks by various metrics including the number of nodes ( N ), the number of links ( L ), the average distance in the network ( av d ), the diameter of the network (the maximum of d ), the number of components, the average degree ( av D ) and the clustering coeffic ient ( CC ) degree centrality ( D ) and degree distribution, shortest distance ( d ) between a focal node and one other node, link density, averaged shortest distance between

PAGE 23

11 pairs of node s clique density among others [65]. Topological importance ( TI n ) is a general measure of centrality, focusing on how effects originated from a focal node can spread throughout the network. The topological overlap ( TO n t ) measure is derived from TI n Given a threshold level defined ( t ), it is possible to determine which are the stronger (over the threshold) and which are the weaker (below the threshold) interactors of a network node i [65] Erten et al use their Vavien algorithm to analyze the topology of PPI networks to infer functi onal information for the sake of disease association, assuming these networks are organized into recurrent schemes that underlie the mechanisms of cooperation among different proteins and that proteins tend to interact with other proteins of similar functi on. They show that proteins associated with similar diseases exhibit similar topological characteristics in their PPI networks, and that the Vavien algorithm outperforms existing information flow based models in terms of ranking the true disease gene highe st among other candidate genes [44]. Protein protein interaction behavior is determined by protein concentrations that represent a balance between functional and structural interactions [ 51 ] An interaction can produce a change in the relative concentration gradient of either or both the interacting proteins. Crucially, the interaction between two proteins depends not only on their physical binding affinity, but also on their relative concent rations to the extent that the control of protein abundances becomes important in the functional operation and evolution of natural protein protein interactions [3]. Even given the wide variation of protein concentrations, their relative abundances may sti ll be under tight evolutionary control [2]. That is, m easured p rotein concentration values are neither simple nor static but t hey may be conditionally bounded ratios that dynamically fluctuate among a preferred or cha racteristic set of value ranges. The

PAGE 24

12 n on linear complexity of the concentration patterns produced by these ratios explains why there is not a single identifiable protein concentration state for a clinical type. For example, Heo, Maslov, and Shakhnovich [2] address the question of how living cells achieve sufficient quantity of functional protein complexes while minimizing their promiscuous non functional interactions. They modeled the topology of a protein interaction network to shape these protein abun dances and the strengths of their functional and nonspecific interactions. They found a positive relationship between evolved physical chemical properties of protein interactions and their abundances due to a frustration effect. However, t h e PPI approach t o modeling depends upon a complete knowledge of the interactions among a set of proteins. N etwork analysis can quantify the functioning and relative importance of proteins in cell function, but h igh throughput experimental detection methods for PPI (such as mass spectrometry) often both generate high false positive and high false negative rates [66]. We propose an approach that combines the accuracy and precision of protein abundance measured using established 2 D PAGE gel electrophoresis technology with a novel analysis of the topological and conditional relationships and dependencies derived from these concentration data.

PAGE 25

13 CHAPTER I I I: DATA SOURCES and STATISTICAL ANALYSIS 3.1 Cytokine Biomarkers and Their Relevance We wish to show that protein c oncentration distributions contain information about their conditional inter dependencies and whether those inter dependencies are characteristic ally maintained and conserved That is, the biomarkers to collect must act as disease state signals due to the existence and modulating strength of their relative and mutual effects upon each other. Cytokin e proteins satisfy these data source requirements. Cytokines are a broad category of small proteins acting as immunomodulating agents they elicit regulatory function in immunologic pathways that are important in cell signaling. They are secreted by components of the adaptive immune systems, and they act as effectors or modulators of lung tissue inflammatory response [45], to the extent that Biancotto claims that he level and type of cytokine production has become critical in distinguishing physiologic fr The 12 baseline protein biomarkers {EGF, IFNG, IL1A, IL1B, IL2, IL4, IL6, IL8, IL10, MCP1, TNFA, VEGF} (EGF: epiderma l growth factor; attractant protein; TNF: tumor necrosis factor; VEGF: vascular endothelial growth factor) are chosen because of their known sensitivity as host response signals to various lung disea ses [4], so that concentrations of circulating cytokines in serum may be associated with lung disease survival [5]. Evidence also suggests that biomarkers IL6 and IL8 are specifically associated with an increased risk of lung disease [6, 7]. H ost response sensitivity is not the only possible criterion for selecting a biomarker panel to reflect disease state. Ever y biomarker exists within a protein family a group of evolutionarily related proteins that descend from a common ancestor whose member s are

PAGE 26

14 characterized by the rate and amount of product produced, the number of possible stable energy configurations, the speed of feedback mechanisms, similar three dimensional structures sequence similarity, and other characteristics [58] O ther possible criteria include biomarkers that are known to be functionally coupled by stoichiometric feedback mechanisms, derived from the same biochemical family, or are co located spatially or temporally. The strategy for selecting an e ffective panel of biom arkers for a specific disease or disease family requires an educated guess initially, most likely based upon known clinical disease correlations from the literature. Because multiple justifications are available for the biomarker pane l selection dependin g upon the hypothesis under investigation the modeling approach allow s for the systematic and automated substitution and comparison among various biomarker variables. 3. 2 Data Sources In these experiments, the model i s constructed using host response cytokine biomarker concentration data from 343 patients given to us in standard units of pico grams 10 12 grams per milliliter (pg/ml). Other data sets obtained from the literature are standardized to these units. The baseline data set include s 7 clinical types from which the 12 protein biomarkers are measured. The number of patients per clinical type r anges from 24 to 56 see Table 3 The Q =12 baseline biomarkers {EGF, IFNG, IL1A, IL1B, IL2, IL4, IL6, IL8, IL10, MCP1, TNFA, VEGF} a re chosen because of their known or suspected relationship to lung disease. Two specimens a re collected from each patient at the same time, and these two specimens a re averaged over each biomarker to provide a sing le biomarker panel of 12 measurements per patient except for cases of missing data. Each of the 343 patients had been expertly diagnosed as

PAGE 27

15 belonging to only one of 7 lung related clinical types C t : adenocarcinoma, squamous cell carcinoma, neve r smokers, smokers with chronic obstructive pulmonary disease (COPD), smokers without COPD, acute lung injury, or cystic fibrosis [1]. The original experiments considered never smokers smokers with COPD, and smokers without COPD as the control groups. We sequestered a random 10% of the given data for subsequent model validation leaving 310 patients. There a re 659 missing biomarker measurements out of possible 310*12=3720 (82.3%) for a total of 3061 measured values. Precise protein assay techniq ues using 2 D PAGE gel electrophoresis [ 53 ] a re used to consistently collect homogeneous blood serum specimens. 2 D PAGE gel electrophoresis is a precise and established method for the separation of proteins in 2 dimensions where isoelectric focusing is used to sepa rate proteins by their charge ( pI ) in the first dimension and SDS PAGE (sodium dodecyl sulfate polyacrylamide gel) electrophoresis is used to separate proteins by their molecular weight in the second dimension. The technique is often used for the isolati on of proteins for further characterization by mass spectroscopy a high throughput analytical technique that ionizes chemical species and sorts the ions based on their mass to charge ratio. The first five data sets are all from the same unpublished set of experiments [ Acknowledgement A] conducted at laboratories at the University of Colorado Anschutz Medical Campus The last two data sets c ystic fibrosis and acute lung injury originate from different experiments though the wet lab protocols and analy tics a re performed in the same way as the first five data sets [Acknowledgement B] To minimize batch effects, both laboratories incorporated a standard sample in each electrophoresis gel which was subsequently subtracted during analysis, and both used the Cy2 channel from each gel to

PAGE 28

16 normalize spot intensities and for automated matching between gels. All p atients underwent expert p athology review and a re histologically assigned to one and only one clinical type, provided with the original data sets. There are many more data values than the number of biomarkers which avoids the issue of overfitting. We are working with precise conce ntrations of secreted proteins expressed in the blood. This sampling strategy is justified because it is non invasive, generated a large set of data with quantitative accuracy involving a small number of variables the 12 biomarkers and works with a hom ogeneous composition indicative of the entire organism. 3. 3 Statistical Analysis A standard statistical analysis [26] to distinguish among clinical types calculate s the concentration population means i and standard deviation i values, for each clinical type biomarker to reveal the differences among the various clinical types and to see For example, t he bar chart in Figure 2 plots the means and standard error bars for Never Smokers with Adenocarcinoma and Figure 3 (a l) plots e ach of the mean biomarkers concentration values 12 for every clinical type. The small error bars in these figures suggest these data were produced precisely and with quant itative accuracy though it is not known why t he mean values for Acute Lung Injury are the highest for every biomarker except EGF and VEGF Table 2 tabulates the biomarker concentration average s and s tandard d eviation s per clinical type. T he concentration and values do not unambiguously assign patient s to the clinical types for the 12 given biomarkers. Data averages do not provide enough information to reliably classify individual patient data, due to the variation of individual patient

PAGE 29

17 measurements. §4.4.I describes several statistical averaging attempts using established methods, such as nave Bayes classifier, Markov ch ain analysis, and cosine similarity. Instead subsequent work focused on developin g a computational model that effective ly processed individual concentration values and not population averages. Figure 2 : Mean concentrations and standard error bars for Never Smokers & Adenocarcinoma.

PAGE 30

18

PAGE 31

19

PAGE 32

20

PAGE 33

21 Figure 3 (a l) : Mean concentration value s for each biomarker for every clinical type. 3. 4 Statistical Inference A model based approach is suitable for statistical inference in these experiments because the biological causes that produce the relative biomarker concentrations are generally unknown it is not precisely known wh y a certain panel of biomarkers clearly distinguishes among the various clinical types. A fully realized model based approach compute s all possibl e alternatives within the high dimensional biomarker concentration space and then select s the alternative that effectively assigns patients to their known clinical types This strategy is different from a design based approach to inference where the model

PAGE 34

22 is generally known beforehand and where it is important to ensure that enough sample data are selected randomly from known populations. The model based approach requires a precise identification and definition of the random variables Given the observation that these biomarker measurements reveal characteristic gaps or holes not explainable as missing data, a promising way to analyze these gaps is to group or bin concentration data where each bin is considered as a discrete random variable. Individual bins can then be compared to every other bin in every other bio marker to reveal their interactiv e or conditional structure, as fully explained in §4.4:A 3. 5 De fin ing Variables in High Dimensional Configuration and State Space Much proteomics research operates within a h igh dimensional configuration space also called parametric space where a panel of measurements exists in the number of dimensions defined by the number of measured protein signals (e.g., 12 ) The notion of configuration space has been used in molecular biology [69 ] t o represent the space of all possible states of a system characterized by many degrees of freedom. However, the non intuitive ness of h igh dimensions poses difficult analytical issues. T he scalability of measures in Euclidean spaces is generally poor as dimensionality increases and data can become uniformly distributed [28, 29]. Given a set of targeted biomarker variables thought to characterize the overall state of a living organism, the purpose of a configuration space is to structure the possible states of the variables that represent the state of the organism. To structure the configuration space w e define two kinds of random variable s The first definition declares the observed measurements of the 12 biomarkers as variables assig ned to the configuration space The second definition declares the generated bin states as variables assigned to the computational state space which is larger

PAGE 35

23 than the configuration space because of the addition of non permissible bin states. When we claim to distinguish the set of targeted variables that uniquely characterize each clinical type, this refers to each biomarker Whereas the configuration space is the space of all possible biomarker measurement values, the state space is the space of probabilities acting up on the configuration space. 3.6 Statis tical Assumptions Several model based statistical assumptions must also be made explicit Distributional assumptions When a statistical model employs terms involv ing random errors, assumptions are made about the probability distribution of these errors The use of the same data collection and processing protocols over all the biomarker samples assumes a Gaussian distribution of errors, so long as the sensitivity and specificity of the various biomarker measurements are roughly the same see Appendix D But when a biomarker is difficult to measure, this often produce s incomplete data. For example, biomarker IL1A has the lowest concentration range of all the given bio markers, and quite a few patient values are missing. W e included all biomarker data values and assumed a Gaussian distribution of errors in the measurement values. Structural assumptions Statistical relationships between variables are often modelled by equating one variable to a function of another or several others plus a random error. Models often involve making a structural a ssumption about the form of this functional relationship, such as linear or multiple regression. This can sometimes be generalized to models involving relationships among multiple underlying u nobserved latent variables. In our experiments, t he multiple bins of concen-

PAGE 36

2 4 tration values serve as jointly distributed random variables But as labels, bin 3 is not related to bin 4 in a sense of predication or lesser than ordering. There are simply bins t hat are occupied some more than others and bins that are empty. Cross variation assumptions Cross variation assumptions involve the joint probability distributions (JPD) of either the measurements themselves or the random errors in a model. A JPD is d efined when given at least two random variables X Y ..., that are defined on a probability space, the joint probability distribution for X Y ... is a probability distribution that gives the probability that each of X, Y ... falls within a discrete set of values specified for that variable [70]. The model assum es that measurement errors are statistically independent but the analysis explicitly characterizes the conditionality of measurements where each clinical type is hypothesized to have a unique JPD of characteristic biomarker relationships Guarding against data dredging D ata dredging presents data patterns as statistically significant by exhaustively searching for combinations of variables that might show a correlation without first devising a specific hypothesis for the underlying cause. W e guard against the primary hypothesis with data that was not used in constructing the hypothesis. 3.7 Handling Variation Through Model Constraints A computational model must efficiently navigate within the large and complex probability space in wh ich both the disease responses and the protein measurements occur. Measured protein concentrations in individuals vary across orders of magnitude in both the technical and biological dimensions depending upon the progression of the disease state, how the s pecimens are collected, the tissue or fluid that is sampled, and by the assay

PAGE 37

25 protocol used to measure the concentrations [32] There is also the possibility of introducing bias due to poor experimental design, sampling and protocol differences [33] The model addresses these sources of universal biological variation, as follows. 1) Biomarker measurements are made from noisy, incomplete and widely variable protein concentration values, so the assignments of the model are necessarily probabilistic. The a nalytical challenge is to reduce the effect of measurement variations enough to make accurate clinical classification des pite significant measurement variations e ven though proteomics experiments conducted in different laboratories using the same specimen handling protocols can produce drastically different results [8, 9] Though the baseline data comes from two different labs, both labs applied the same methods of random population selection, specimen collection and handling protocols, instrumenta tion calibration and use, and non biased experimental design within their individual experiments Assuming the use electrophoresis assay techniques allows us to test whether the model works at all without having to model these known sourc es of variation. 2) Patients that are sampled are nearly always being clinically treated at the same time, yet the effects of those treatments on relative biomarker concentration levels are usually unknown. This influence is mitigated somewhat by focusin g on the concentrations of host response proteins circulating in homogeneous serum instead of targeted tissue specimens. 3) The model does not require the extraction of specimens from diseased tissues to measure and compare all the proteins therein to fi nd a set of disease specific biomarkers. The tissue extraction approach is both costly and invasive. Even though differences have been uncovered in protein expression between normal and diseased tissues that may have

PAGE 38

26 specificity for different tumor types, there is increasing evidence of a detectable human immune response to cancer in blood sera, which may aid in disease immunotherapy [47] 4) There exist overly frequent or non specific variations involving issues of scale those proteins that change concentration in response to a very large number of inflammatory signal events or stimuli and not necessarily those involved in lung disease. For example, IL10 is a pleiotropic immuno regulatory cytokine ( whe re a single protein influences two or more seemingly unrelated physiological functions) that protects from several infection based immunopathology and allergy responses. These non specific types of variations are managed by avoiding those protein biomarkers that respond to every inflammatory event and selecting only those with other evidence that links them to the disease under investigation 5) A great deal of proteomics research is plagued by the issue of overfitting which occurs when a statistical model describes random error or noise instead of the underlying relationships. Overfitting generally occurs when a model is excessively complex, such as having too many parameters or experimental variables relative to the number of observations or measurements making it easy to fit multiple models to the data and expose structure that does not correlate with the hypothesis under stud y Often preliminary or training data fits the model well, but independent validation studies perform very poorly. To counter overfitting, we restrict the number of targeted biomarker variables to 12 well below the number of separate observations or measurements 6 ) Nearly all the measurement data sets available to us are incomplete. Only 39 of the 343 patients (11.4%) have all 12 biomarker measurements, but 85.4% have 9 or more bio markers. A total of 17.7% biomarker values are missing from the baseline data sets

PAGE 39

27 Figure 4 shows the mode of measurements per patient panel as 10. The mean is 9.84. No data was interpolated or averaged to fill in missing data. Instead, we chose to declare a sufficiency constraint value of 95%. That is in the training set, the amount of a clinical i s accepted as sufficient if at least one of its distinguishing biomarker variables accurately assigned patients at least 95% of the time. Figure 4 : Number of biomarker measurements for all baseline samples. 7 ) Concentration data can be transformed by excluding extreme outliers, or by taking their natural log e to reduce the differences in measured concentration values. We observed that removing outliers removes a few characteristic topological features among clinical types, th ough requiring significantly less overall computational time. To balance these objectives, w e excluded only the 7 individual measurements as outliers that fell into the farthest range of standard deviation, those values greater than We did not app ly other data transformation s All clinical type concentration values are shown in Figure 5 (a i).

PAGE 40

28

PAGE 41

29

PAGE 42

30

PAGE 43

31

PAGE 44

32 Figure 5 : (a i ): All clinical type concentration values.

PAGE 45

33 CHAPTER I V : COMPUTATIONAL MODEL 4.1 Defining the Model Given these difficulties and conflicting constraints in analyzing proteomic data, t he problem is to formulate the right space in which the model can effectively address several fundamental questions. 1. How should the set of targeted biomarker variables be co mputed to reliably separate and group clinical type populations? Is there a n essential set of variables? 2. What is the fine grained conditional structure relating the biomarkers per clinical type? 3. Can unknown patients be reliably assigned to a clinical type ? 4. How can data variation (e.g., outliers ) be effectively included in the computational model? 5. Is there an effective strategy for handling missing data values? The computational model builds a discrete topological structure (DTS) from the baseline dataset. The model is developed using several novel schemes designed specifically for this analysis. A customized binning algorithm discretizes the concentration values of paired biomarkers into bins and forms bin states corresponding to biomarkers belonging to individual samples. The model builds conditional probabili ty relationship s between bin states that successfully separates and distinguishes the bin states belonging to different clinical types. A clinical type population is then represented as a characteristic set of biomarker bin states, where each biomarker pair is assigned a distinct bin size an d

PAGE 46

34 number of bin states, whether occupied or empty of concentration values. Every biomarker belongs to multiple combinations, each with its own characteristic bin size and set of bin states. A unique application of an exclusive or (XOR) set of operations t hen extracts the distinctive and distinguishing patterns that successfully assigns a set of states to a clinical type The resulting analysis indicates which of the biomarkers are key in recognizing the discriminating signs of the lung diseases under this study. 4.2 Motivation of the Discrete Topological Structure The DTS was initially motivated by considering a derivation of the Chemical Stochastic Master Equation (CSME) as given in Appendix B The CSME originally described the probability of a vector of chemical measurements belonging to a certain state and how changes with respect to time as [13] Instead, w e formulate the DTS equation by representing concentration [c] values as bin state variables under a joint p robability distribution (JPD) and discretize time t as step transformations among these various bin states. The JPD p( t) represents the probability that at bin state transformation t the measurement panel contains X 1 proteins of the first clinical type, X 2 proteins of the second clinical type, and so on [15,16]. The inputs to th is formulation are the distributions of protein concentrations for a clinical measurement panel relative to a standard in pico gram s per milliliter (pg/l). The se measurement vectors are transformed to matrices of conditional bin states. The change of in p( t) is then re formulated as a unit less quantity where the probability that state transformation t i will occur given bin state is represented by a n interaction matrix of marginal probabilities (marginal signifies the probabilities of a subset of variable values without reference to the values of other variables [70] ) ; and the probability

PAGE 47

35 that state transformation t j will bring the system into bin stat e from an other state ( say ) is represented by an interaction matrix of conditional probabilities The set of discrete bin states index es their respective matrices. This re formulation assigns concentration values to a vector of discrete bins to reveal the topological structure underlying the distribution of these values. The discrete bins of concentration are computed as state variables represented by unit less probabiliti es by a binning algorithm that maximizes the total number of bins while simultaneously minimizing the number of empty bins where no concentration data reside. The discrete topological structure of the concentration data then effectively represents how t he various biomarker conce ntrations relate to each other, as formalized in Equation 1 ( see §4.4:A ). The model computes a c onditional probability matrix of all the mutual interactions of the entire protein ensemble for a measurement panel at discrete bin state transformation t where the conditional probability of random variable bin state A given random variable bin state B is equal to the joint probability of A and B divided by the marginal of B T here are two bin aggregate snapshot of a clinical type population, a state represents a range of concentration values, so computing the DTS uses the population bin probabilities as the possible num ber of states in relation to all the biomarkers for a given clinical type. But for an individual patient, a sample represents a snapshot that varies within the static constraints of the population snapshot. In this case the DTS computation uses the one av ailable state of 12 concentration probabilities to make probabilistic clinical type assignments. We now develop a binning algorithm that implements this DTS equation, providing the step by step details that construct this new proba bilistic computational model.

PAGE 48

36 4.3 Bin Size and Bin States Inferring data structure and classification from concentration values depends upon how these values are binned or grouped. Once a unique and constant bin s ize is calculated for a biomarker clinical type coordinate combination, the calculated bins are plugged into the DTS equation as separate and biomarker specific state variables produced by the binning algorithm. E mpty bins where there are no data values within that bin range are then considered as non permissible states. The ratio of the number of empty bins against the total number of bins for every CB r i s linear, as shown in Figure 6 The binning algorithm provides the basis for calculating the probability of each biomarker clinical type coordinate bin state. All calculated bin sizes are given in Table 1 TestSet refers to a small set of fabricated data points used to validate calculations and are never used in pooled calculations. Figure 6 : Ratio of the number of empty bins to the total number of bins.

PAGE 49

43 Table 1 : All clin ical type biomarker bin sizes in pg/ml. Clinical Type Biom arker EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF Adenocarcinoma Bin size 4.0188 1.6849 1.3170 2.9013 2.6063 2.2277 3.3217 1.8597 1.4961 4.8582 2.1165 3.2821 Min value 0.9415 1.1810 1.5025 0.3820 3.7500 3.2090 0.7500 1.9510 0.7260 47.8670 5.0180 4.1060 Max value 370.672 5 34.8780 22.5750 93.2225 76.7270 47.7620 173.479 0 46.5840 24.6630 902.917 5 55.8135 201.029 0 Squamous Bin size 25.6025 2.4179 1.7094 2.2409 2.7740 1.0377 3.5711 2.8973 2.2414 4.2379 2.9231 28.8565 Min value 1.3955 1.1810 1.3470 0.3820 3.9060 3.2845 0.9170 1.8580 0.7260 32.7945 4.7975 3.0665 Max value 308.625 0 59.2095 35.5350 45.2005 103.771 0 19.8870 229.468 5 129.3395 54.5195 541.337 0 121.723 0 349.344 5 Controls Never Smokers Bin size 4.9663 0.6260 0.0234 1.7292 0.9130 0.9029 1.5829 1.7797 0.2707 4.3751 1.8843 4.5018 Min value 1.2180 1.1810 1.3470 0.3820 3.6715 3.2845 0.7500 1.3965 0.7260 37.5950 4.9080 4.6275 Max value 954.737 5 11.1970 1.5340 34.9650 18.2800 14.1195 32.4080 44.1100 3.9740 615.111 5 50.1300 598.863 5 Controls Smokers with COPD Bin size 3.7650 0.6750 0.5640 0.6228 1.2156 0.4684 0.3933 2.0148 1.1390 4.3275 1.2309 4.4091 Min value 1.6760 1.3080 1.9100 0.3820 3.9825 3.3600 0.7130 1.1840 0.6880 31.5945 4.6870 3.3930 Max value 317.936 5 9.4080 6.4220 7.8550 28.2935 7.1075 7.0050 49.5400 18.9125 602.818 0 29.3055 567.753 0 Controls Smokers without COPD Bin size 4.0905 0.3150 0.1254 0.8070 0.5339 0.4576 0.2815 1.2047 0.1974 26.4710 1.2202 4.5300 Min value 0.9600 1.1810 1.2840 0.3820 3.5930 3.3600 0.7870 1.1840 0.6880 22.8825 4.6870 4.4300 Max value 377.283 5 4.9615 2.2870 10.0660 12.1360 10.6810 5.2910 20.4585 3.0570 340.534 5 29.0900 638.630 0 Acute Lung Injury Bin size 4.2996 4.9875 3.2633 4.1500 4.4825 4.4266 3.6994 5.2871 4.4546 5.0662 4.7249 4.7113 Min value 1.2430 1.7590 7.6080 0.0360 0.3000 0.1830 8.6420 0.7890 0.0050 65.1690 1.8720 1.9530 Max value 500.0 959.367 0 164.247 0 398.437 0 520.271 0 531.379 0 275.0 1227.386 0 516.739 0 1200.0 738.960 0 774.606 0 Cystic Fibrosis Bin size 2.5936 1.6285 0.8440 3.4580 4.1233 1.1234 2.3259 2.2821 0.2086 53.4563 0.8463 2.6786 Min value 1.4420 0.9110 0.6710 0.0690 0.7180 0.5250 2.6650 0.6290 0.2710 12.0120 1.8930 1.8260 Max val94.8110 26.9670 14.1750 166.055 330.578 18.5000 77.0940 55.4000 2.7740 653.487 15.4330 108.97

PAGE 50

38 ue 0 0 0 TestSet Bin size 1.875 1.5 1.6875 1.8 2.25 2.25 2.625 2.5714 2.5313 2.8125 2.75 3 .0 Min value 5 .0 2 .0 3 .0 4 .0 5 .0 6 .0 7 .0 8 .0 9 .0 10 .0 11 .0 12 .0 Max value 35 .0 20 .0 30 .0 40 .0 50 .0 60 .0 70 .0 80 .0 90 .0 100 .0 110 .0 120 .0 Healthy Serum Bin size N/A 4.78 3.6107 1.9456 N/A 1.72 1.917 1.835 1.725 3.0115 3.1132 3.9559 Min value N/A 153.50 181.0 2.47 N/A 3.3 12.46 7.5 0.9 39.74 1.42 28.3 Max value N/A 822.7 383.2 33.6 N/A 37.7 50.8 44.2 28.5 160.2 138.4 297.3

PAGE 51

39 Table 2 : Biomarker Concentration Averages and Standard Deviation Values per Clinical Type. Clinical Type Average Biomarker Values EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF All Clinical Types 65.03 14.83 42.38 3.01 18.99 8.10 10.73 81.95 34.77 232.64 10.08 85.02 Adenocarcinoma 54.69 3.85 6.36 4.24 16.00 6.67 28.21 10.66 2.94 240.91 17.92 46.14 Squamous 85.94 3.91 5.89 3.72 12.44 7.55 20.59 40.56 4.56 226.03 17.47 117.85 Controls Never Smokers 90.00 3.60 1.44 2.20 7.06 5.40 3.59 5.49 1.25 183.76 12.93 78.13 Controls Smokers with COPD 66.49 2.77 3.12 1.09 7.16 5.54 2.95 6.77 2.11 237.67 12.35 99.08 Controls Smokers without COPD 56.74 2.18 1.82 1.16 6.19 5.15 3.84 4.32 1.31 185.77 9.54 64.78 Acute Lung Injury 59.16 118.97 57.57 69.04 75.53 107.73 126.85 155.14 64.69 331.61 88.03 123.93 Cystic Fibrosis 20.97 4.54 4.51 12.80 27.80 4.92 30.09 8.86 0.84 253.22 5.64 25.36 Healthy Serum N/A 569.12 270.15 8.74 N/A 31.15 24.27 34.92 7.65 68.87 65.92 212.37 Standard Deviation Values EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF All Clinical Types 95.15 81.58 83.47 20.04 61.52 38.59 41.87 590.22 136.73 158.95 33.27 124.20 Adenocarcinoma 72.70 5.69 6.49 15.52 59.29 7.08 93.77 14.01 4.79 181.36 46.90 42.45 Squamous 92.03 9.32 10.10 9.02 23.14 14.10 36.48 168.22 10.63 106.68 19.55 98.38 Controls Never Smokers 143.51 2.87 0.09 5.77 7.38 2.39 5.74 7.04 0.58 106.41 10.45 114.50 Controls Smokers with COPD 79.85 1.80 1.91 1.46 4.38 4.32 2.60 7.07 3.44 116.64 11.50 123.29 Controls Smokers without COPD 77.89 0.91 0.38 1.81 1.81 1.41 12.04 2.98 0.62 77.53 4.59 122.35 Acute Lung Injury 93.73 243.15 54.51 115.91 138.43 171.76 103.13 299.78 129.25 228.59 182.00 193.35 Cystic Fibrosis 25.97 6.16 4.20 35.89 70.79 4.37 26.12 11.78 0.73 153.35 3.15 24.09 Healthy Serum N/A 219.32 58.52 10.24 N/A 11.47 11.74 11.66 9.47 38.45 44.01 82.23

PAGE 52

40 Table 3: Bio marker Data Characteristics ClinicalType Marker Minimum Concentration Maximum Concentration Minimum Probability Maximum Probability Minimum DTS Maximum DTS Adenocarcinoma EGF 0.9415 370.6725 0 0.10638 0.09524 1.16292 Adenocarcinoma IFNG 1.181 34.878 0 0.42857 0.25 3.05278 Adenocarcinoma IL1A 1.5025 22.575 0 0.22222 1.10926 4.07049 Adenocarcinoma IL1B 0.382 93.2225 0 0.8125 0.4 4.88426 Adenocarcinoma IL2 3.75 436.6875 0 0.51064 0.28572 3.48885 Adenocarcinoma IL4 3.209 47.762 0 0.60465 0.28572 3.489 Adenocarcinoma IL6 0.75 649.422 0 0.44681 0.18182 2.22012 Adenocarcinoma IL8 1.951 86.9895 0 0.25532 0.15385 1.87863 Adenocarcinoma IL10 0.726 24.663 0 0.63043 0.73951 2.71369 Adenocarcinoma MCP1 47.867 902.9175 0 0.08333 0.0625 0.76311 Adenocarcinoma TNFA 5.018 348.16145 0 0.34783 0.20001 2.44225 Adenocarcinoma VEGF 4.106 201.029 0 0.08511 0.07143 0.87221 Squamous EGF 1.3955 308.625 0 0.28205 0.52588 1.71242 Squamous IFNG 1.181 59.2095 0 0.61765 0.5 5.1372 Squamous IL1A 1.347 35.535 0 0.55556 0.5 5.13724 Squamous IL1B 0.382 45.2005 0 0.75758 0.33333 3.42483 Squamous IL2 3.906 118.909 0 0.45 0.28571 2.93554 Squamous IL4 3.2845 98.298 0 0.30769 0.656 2.56863 Squamous IL6 0.917 229.4685 0 0.4 0.14286 1.46776 Squamous IL8 1.858 1133.017 0 0.23077 0.15384 1.58065 Squamous IL10 0.688 54.5195 0 0.71795 0.28571 2.93554 Squamous MCP1 32.7945 541.337 0 0.075 0.06667 0.68496 Squamous TNFA 4.7975 121.723 0 0.20513 0.16666 1.71238 Squamous VEGF 3.0665 349.3445 0.025 0.2 0.48544 1.58072 Controls Never Smokers EGF 1.218 954.7375 0 0.08511 0.07408 0.90653 Controls Never Smokers IFNG 1.181 11.197 0 0.27273 0.56124 2.22541 Controls Never Smokers IL1A 1.347 1.534 0 0.5 3.08946 12.24058 Controls Never Smokers IL1B 0.382 34.965 0 0.71875 1.54157 6.11979 Controls Never Smokers IL2 3.6715 57.4705 0 0.25 0.68597 2.72 Controls Never Smokers IL4 3.2845 16.6065 0 0.26667 0.8827 3.49731 Controls Never Smokers IL6 0.75 32.408 0 0.51064 0.61665 2.44799 Controls Never Smokers IL8 1.275 44.11 0 0.42553 0.68513 2.71985 Controls Never Smokers IL10 0.726 3.974 0 0.46667 0.77236 3.06011

PAGE 53

41 Controls Never Smokers MCP1 37.595 615.1115 0 0.06 0.05263 0.64409 Controls Never Smokers TNFA 4.908 50.13 0 0.23913 0.41108 1.63191 Controls Never Smokers VEGF 4.6275 598.8635 0 0.16 0.07692 0.94136 Controls Smokers with COPD EGF 0.998 317.9365 0 0.14286 0.07692 0.81833 Controls Smokers with COPD IFNG 1.308 9.408 0 0.31034 0.93738 2.8529 Controls Smokers with COPD IL1A 1.91 6.422 0 0.5 2.18995 6.65757 Controls Smokers with COPD IL1B 0.382 7.855 0 0.625 1.31234 3.9941 Controls Smokers with COPD IL2 3.9825 28.2935 0 0.34091 0.37469 2.12787 Controls Smokers with COPD IL4 3.36 33.981 0.02439 0.2439 0.72997 2.21917 Controls Smokers with COPD IL6 0.713 14.7855 0 0.18605 0.50433 1.63688 Controls Smokers with COPD IL8 1.184 49.54 0 0.29545 0.22222 2.36427 Controls Smokers with COPD IL10 0.688 18.9125 0 0.54054 1.0927 3.54654 Controls Smokers with COPD MCP1 31.5945 602.818 0 0.06818 0.05715 0.60791 Controls Smokers with COPD TNFA 4.687 82.8344 0 0.16279 0.24979 1.41859 Controls Smokers with COPD VEGF 3.393 567.753 0 0.11364 0.07143 0.75987 Controls Smokers without COPD EGF 0.96 377.2835 0 0.17391 0.08 0.99052 Controls Smokers without COPD IFNG 1.181 4.9615 0 0.27586 0.99833 2.47649 Controls Smokers without COPD IL1A 1.284 2.287 0 0.2 2.12253 4.95383 Controls Smokers without COPD IL1B 0.382 10.066 0 0.64 1.99667 4.95304 Controls Smokers without COPD IL2 3.593 12.136 0 0.20833 0.46145 1.90491 Controls Smokers without IL4 3.36 10.681 0 0.28261 0.49991 2.06368

PAGE 54

42 COPD Controls Smokers without COPD IL6 0.787 88.552 0 0.15556 0.42849 1.76887 Controls Smokers without COPD IL8 1.184 20.4585 0 0.25 0.66654 2.75154 Controls Smokers without COPD IL10 0.688 3.057 0 0.2439 0.90757 2.25134 Controls Smokers without COPD MCP1 22.8825 340.5345 0.02083 0.16667 0.76795 1.90501 Controls Smokers without COPD TNFA 4.687 29.09 0 0.22917 0.15385 1.9049 Controls Smokers without COPD VEGF 4.43 638.63 0 0.14583 0.09523 1.17911 Acute Lung Injury EGF 1.243 500 0 0.23256 0.08696 1.36894 Acute Lung Injury IFNG 1.759 959.367 0 0.34783 0.11765 1.85198 Acute Lung Injury IL1A 7.608 164.247 0 0.2 0.25 3.93577 Acute Lung Injury IL1B 0.036 398.437 0 0.5 0.18182 2.86219 Acute Lung Injury IL2 0.3 520.271 0 0.47222 0.15385 2.42187 Acute Lung Injury IL4 0.183 531.379 0 0.5 0.125 1.96766 Acute Lung Injury IL6 8.642 275 0 0.275 0.09091 1.43113 Acute Lung Injury IL8 0.789 1227.386 0 0.23214 0.09524 1.49923 Acute Lung Injury IL10 0.005 516.739 0 0.5 0.125 1.96775 Acute Lung Injury MCP1 65.169 1200 0 0.05357 0.04445 0.69967 Acute Lung Injury TNFA 1.872 738.96 0 0.30909 0.09524 1.49916 Acute Lung Injury VEGF 0.381 774.606 0 0.16667 0.07142 1.12431 Cystic Fibrosis EGF 1.442 94.811 0 0.22222 0.18182 1.09312 Cystic Fibrosis IFNG 0.911 26.967 0 0.33333 0.4 2.40579 Cystic Fibrosis IL1A 0.399 14.175 0 0.25 0.25 1.5036 Cystic Fibrosis IL1B 0.069 166.055 0 0.63158 0.28571 1.71766 Cystic Fibrosis IL2 0.718 330.578 0 0.26316 0.22222 1.33591 Cystic Fibrosis IL4 0.525 18.5 0 0.41667 0.33333 2.00484 Cystic Fibrosis IL6 2.665 77.094 0 0.21429 0.2 1.20244 Cystic Fibrosis IL8 0.629 55.4 0 0.30435 0.25 1.50344 Cystic Fibrosis IL10 0.205 2.774 0 0.33333 0.437 1.50404 Cystic Fibrosis MCP1 12.012 653.487 0 0.20833 0.31782 1.09385 Cystic Fibrosis TNFA 1.893 15.433 0 0.15789 0.19999 1.20287 Cystic Fibrosis VEGF 1.826 108.97 0 0.17391 0.13334 0.8016

PAGE 55

43 Healthy Serum EGF N/A N/A N/A N/A N/A N/A Healthy Serum IFNG 153.5 822.7 0 0.14286 0.28572 1.18402 Healthy Serum IL1A 181 383.2 0 0.14286 0.28572 1.18606 Healthy Serum IL1B 2.47 33.6 0 0.28571 0.5 2.08886 Healthy Serum IL2 N/A N/A N/A N/A N/A N/A Healthy Serum IL4 3.3 37.7 0 0.28571 0.5 2.08274 Healthy Serum IL6 12.46 50.8 0 0.14286 0.28572 1.19017 Healthy Serum IL8 7.5 44.2 0 0.28571 0.4 1.6662 Healthy Serum IL10 0.9 28.5 0 0.33333 0.5 2.08888 Healthy Serum MCP1 39.74 160.2 0 0.28571 0.33334 1.38589 Healthy Serum TNFA 1.42 138.4 0 0.16667 0.33334 1.38544 Healthy Serum VEGF 28.3 297.3 0 0.28571 0.33334 1.38304 TestSet EGF 5 35 0 0.25 0.5 1.45182 TestSet IFNG 2 20 0 0.25 0.5 1.46094 TestSet IL1A 3 30 0 0.25 0.5 1.45182 TestSet IL1B 4 40 0 0.25 0.5 1.44661 TestSet IL2 5 50 0 0.25 0.5 1.44661 TestSet IL4 6 60 0 0.25 0.5 1.44401 TestSet IL6 7 70 0 0.25 0.5 1.44401 TestSet IL8 8 80 0 0.25 0.5 1.44271 TestSet IL10 9 90 0 0.25 0.5 1.4401 TestSet MCP1 10 100 0 0.25 0.5 1.4401 TestSet TNFA 11 110 0 0.25 0.5 1.4375 TestSet VEGF 12 120 0 0.25 0.5 1.4375

PAGE 56

43 4.4 Developing the Computational Model We seek to develop a model that describe s the probability of a host response system to occupy a discrete set of states characteristic of a We model this topological property by discretizing the concentration values of biomarkers in each panel for all combinations of biomarkers and clinical types from the given population of patients. The term combination is restricted to mean processing an aggregate concentration of data values for a sub set of biomarkers from a set of clinical types to d etermine which sub set of biomarkers distinguishes among those clinical types. We perform this discretization process using a new binning algorithm Max Bins Min Empty Bins ( presented in Algorithm 1 ) where we first compute a bin size and number of bins ( W r N r ) for every clinical type paired biomarker combination CB r Each clinical type has 77 different combinations of pairs of biomarkers (except when an entire biomarker is missing) including pairs of the same biomarker. A CB r is defined as the aggregate concentration data from each of these pairs of biomarkers. Throughout the paper, the term clinical type paired biomarker combination CB r refers to forming combinations of biomarkers belonging to a specific clinical type. Every clinical type has the same number of CB r of occupied bins (O hat) separating data values by the highest possible resolution, while minimizing the number of gaps or empty bins (O tilde ) where no d ata values reside. Empty bins are then considered as non permissible states Maximizing the number of occupied bins separates the states within each combination. The model computes a different total number of bins (states) N r for each combination CB r We then compute the probability of each concentration data point belonging to a bin within that combination. Each of these combinations has a characteristic number of occupied bin states determi ned by the binning algorithm, though it is possible for a biomarker not to exhibit dis-

PAGE 57

44 tinguishing bin states labeled as (O bar) depending upon the set of clinical types that are processed Th ese computations produce multiple bin states for different combinations embedded within a space th at exp oses the conditional structure underlying the distribution of these values. The model computes a coordinate space whose dimensions are the computed DTS values, individual bin probabilities, and the concentration values for every CB r The resulting sp ace is analyzed to reveal the distinguishing patterns of each combination in this coordinate space a. Formulate the d iscrete t opological s tructure (DTS ) We are developing a model that represents the conditional relationships of expressed host response biomarkers, whose interactivity induces a probable biomarker conWe represent these interactive relationships by calculating their joint probability matrix M C joint the probability that biomarker B 2 measured at concentration c 2 (event X ) occurs at the same time biomarker B 1 is measured at concentration c 1 (event Y ) A related but separate concept is also necessary: the conditional probabilities of biomarker interactivity where, given concentration measurement c 1 for B 1 as event Y how likely is the measured concentration c 2 for B 2 as event X call this matrix M T o represent the influence of individual biomarkers, we use marginal probabilities the probabilities of various concentration values of a subset of biomarker variables without reference to the values of the other variables being considered call this matrix M These marginal probabilities represent the probability distribution of event X when the probability value of event Y is not known. Together, these types of probability express the mutual interactivity and distribution of the biomarker measurements that are indicative of each clinical type. We equate these concepts to a d iscrete t opological s tructure m atrix, where the various biomarker

PAGE 58

45 concentration measurements and their probabilities reveal topological patterns characteristic of each clinical type. A DTS matrix is computed for each CB r and the matrix (i.e., the specific set of biomarkers) that produces the most accurate set of patient assignments per clinical type is designated M C for that population. Equation 1: DTS matrix equation for population M C In Equation 1 M C joint i s the population joint probability matrix 1 is a complete matrix of ones (not the identity matrix), M is the interaction matrix of marginal probabilities and M is the interaction matrix of conditional probabilities for the clinical type. T he DTS equation is implemented in terms of matrices of c onditional and marginal probabilities involving bivariate pairs of biomarkers, each of which are indexed by their respective set of discrete bin states as computed by the binning algorithm as explained below This formulation assigns each set of biomarker concentration measurements to its own vector of discrete bins to reveal the d iscrete topological structure underlying the distribution of these concentration values. These discrete bins of concentration are computed as state variables represented by unit less probabilities by the Max Bins Min Empty Bins a lgorithm. The discrete topological structure of the bin states then effectively describes how the various biomarker concentrations relate to each other. b. Develop the b inning a lgorithm W e motivate the discussion by developing a customized binning algorithm that dis cretizes the concentration measurements into bins and induces a multidimensional space of discrete bin states characteristic for each clinical type. The binning algorithm is a necessary computational component of the overall DTS equation. The algorithm acc epts as input vectors of biomarker concentration measurements relative to a standard in

PAGE 59

46 pico gram s per milliliter (pg/l) and outputs the number of bins and the constant bin size for each combination of concentration values. This process is described below. Step 1: Define the Possible Combinations After experimenting with all possible clinical type biomarker combinations (pairs, triplets, quadruples, etc), we determined that analyzing the conditional probabilities of every clinical type and pairs of bi omarker combination { B i B j } is the most effective means for distinguishing among all possible combinations as CB r indexed as We refer to D r as the combined set of observed concentration data values within each CB r for a specific clinical type and biomarker pair { B i B j }. In these experiments this produces 7 x 77 = 455 W r bin sizes, where the number of bins N r is different for each unique CB r The same W r value is computed for both { B i B j } and { B j B i }. Step 2: Compute the Combination Bin Sizes We n ext compute the bin sizes and number of bins W r N r for each CB r using the Max Bins Min Empty Bins algorithm. The output of algorithm is a set of N r bins of fixed bin size W r per CB r The algorithm guarantees that each biomarker concentration value is assigned to a single bin state n within the set of bin state s for each CB r where N r The Max Bins Min Empty Bins algorithm first initializes t he bin size and then iteratively re computes it as the difference of the lowest and highest concentration data values [c] for each combination divided by the current number of bins Equation 2 : Used to c alculate the bin size W r

PAGE 60

47 This bin size is incremented by a constant value within a loop until the absolute value of the difference of the current bin size and the log e of the current number of empty bins, threshold H i becomes greater than or equal to the previo us threshold value H i 1 Equation 3 : Used to c alculate the threshold value H i. The bin size is a function of whether the data set D r is processed in log e form or not, the specific distribution of concentration value s and the upper limit to the number of bins, MaxNBins This constant upper limit is the square of the number of bins that through experimentation guaranteed at least the same number of empty bins as the number of data values over the concentration meas urements. In these experiments MaxNBin = 233 for the baseline data set, but subsequent matrix dimensions are rounded up to 256 256 rows and columns to simplify model calculations Pseudo code for the binning algorithm is given in Algorithm 1. forEach combination D r = [ D( B i ), D( B j )] GetMaxBinsMinEmptyBinValue ( D r MaxNbins) returns W r N r D r : set of concentration data for given CB r M axNbins: maximum number of bins # Initialize number of bins (N r ), bin step size (binIncrement), # bin size (W r ), and number of empty bins (emptyBins). N r = binIncrement = W r = |Max( D r ) Min( D r )| / N r emptyBins = Count_Number_Of_Empty_Bins( D r N r W r ) result = |W r Log e (emptyBins)|

PAGE 61

48 while (result != 1 and N r < maxNbins binIncrement) N r = N r + binIncrement W r = |Max( D r ) Min( D r )| / N r emptyBins = Count_Number_Of_Empty_Bins( D r N r W r ) If ( emptyBins > 0) or |W r Log e ( emptyBins )|) then result = 1 return(W r N r ) Algorithm 1: Pseudo code for the binning algorithm. The output of the Max Bins Min Empty Bins binning algorithm is a bin size W r and the number of bins N r for each clinical type paired biomarker combination CB r Define NB i as the number of bins for biomarker B i and NB j as the number of bins for biomarker B j For example, consider the CB r for Adenocarcinoma { B i =IFNG, B j =IL1A}. Since there are 53 Adenocarcinoma patients, there are at most 53 concentration measurements for either IFNG or IL1A less missing values. IFNG actually has 3 5 [c] values (sorted): { 1.181, 1.244, 1.244, 1.244, 1.277, 1.373, 1.537, 1.634, 1.7, 1.7035, 1.766, 1.766, 1.766, 1.8055, 2.0015, 2.0705, 2.17, 2.24, 2.513, 2.538, 2.6525, 2.6955, 2.791, 2.827, 3.002, 3.43, 3.505, 3.647, 3.9893, 4.0835, 5.515, 9.1425, 10.974, 14.048, 34.878 }, whereas IL1A has 9 [c] values (sorted): { 1.5025, 1.534, 2.44415, 2.475, 4.759, 6.294, 6.4865, 16.1703, 22.575 } for a total of 44 measurements. See Appendix A for a complete list of missing values. During step 2, we discretize the se measured values by assigning each measurement of [c] to a bin state for this CB r Each value in a set of combined concentration values [c] is assigned to a single bin but m ultiple concentration values are assigned to the same bin, plotted in Figure 7 for the [c] values of clinical t ype Adenocarcinoma for biomarkers { B i =IFNG, B j =IL1A} The top 2 row s in Figure 7 refer to the actual concentration values as measured in pg/ml, given by the [c] axis. Concentration values are mapped to specific bin states in the Bin Interval s row. Many of the [c] values are grouped in the first few bins The first 6 states are labeled, and bin 5 is the first empty bin out of the 23 bins. Bins 1

PAGE 62

49 through 4 illustrate the joint probabilities of IL1A and IFNG values occupying the same state. Tabl e 4 tabulates for the first 6 bins only the number of points for { B i =IFNG, B j =IL1A} per bin, their values of (occupied bins) and (empty bins). Figure 7 : Assigning Adenocarcinoma concentration [c] values to bin states. Table 4 : Number of points for biomarkers; their and values for the first 6 bins. Bin IFNG IL1A 1 8 2 2 17 2 3 5 1 4 1 2 5 0 0 6 1 0 5 4 1 2

PAGE 63

50

PAGE 64

51

PAGE 65

52

PAGE 66

53 Figure 8 (a i) : Complete set of 12 biomarker concentration values for every clinical type Figure 8 (a) plot s the entire dynamic scale of its 12 biomarker concentration ranges for all Adenocarcinoma patients The difference in scale between ranges is significant. For example, Adenocarcinoma MCP1 is expressed over a large dynamic range from

PAGE 67

54 47.9 to 902.9 p g/ml whereas the measured range of IL1A is tiny in comparison from 1.5 to 22.5 pg/ml. This issue of scale is mitigated by instead comparing the DTS values for Adenocarcinoma, as plotted in Figure 9 (a) Of all the clinical types, only a few DTS values were greater than 7.0 (6 IL1A values from Never Smokers). These few values were excluded in that plot to keep axes dimensions consistent. The f igures make explicit the difference between the dynamic scale of the biomarker concentrations and the calculated DTS values. For every clinical type, the binning algorithm calculated the DTS values all to within one order of magnitude.

PAGE 68

55

PAGE 69

56

PAGE 70

57

PAGE 71

58 Figure 9 (a i ): All clinical type biomarker DTS values Figure 1 0 plots in 3 dimensions t he biomarker concentrations, the probability of that concentration happening in the population, and the computed DTS value s for every clinical type The probabilities are banded and there are areas of localized density for

PAGE 72

59 each type. Each clinical type also exhibits a characteristic plume effect. The plot excludes zero probabilities Four Never Smokers values greater than 8.0 are excluded to increase plot resolution: {8.06626, 8.13518, 8.6218, 12.2406} Figure 1 0: Point cloud for the baseline clinical types. c. Compute t he c linical t ype d iscrete t opological s tructure m atrix M C Once the fixed bin size W r per combination is known, and the individual biomarkers in each combination have been mapped and assigned to bin states, we compute the population joint probabilities for D r inside a nested loop for each clinical type C t combination CB r C t biomarker B i CB r bin b from 1 to N r using E quation 4, where G b is the number of concentration values of B i in bin b as in Table 4 Equation 4 : C ompute population joint probabilit ies of data set D r for clinical type C t

PAGE 73

60 Given a computed bin size W r for D r P i is the set of probabilities for observing the biomarker concentrations in each bin, oftentimes zero. A bin probability equals the number of concentration values G b grouped in each bin divided by | D r | so that the sum of probabilities over the set of bins is 1. The population probabilities for each clinical type paired biomarker combination CB r includes all the sample values in that combination. Calculating probabilities this way assumes that each popu lation of samples adequately represents the set of possible concentration bins. Computing the p opulation joint p robabilities involves formulating several probability matrices 1. Compute the population joint probability matrix M C joint Let be the number of patient samples in clinical type population C t Let B i be biomarker i for sample where For each combination of pairs of biomarkers B i and B j we define their joint probability as the probability of belonging to bin concen tration states X i and Y j together at the same time. Considering bin concentration states X and Y as random variables, we compute the population joint probability matrix M C joint by multiplying each bin probability P i for biomarker B i with each bin probability P j for biomarker B j where B i is indexed by i from 1 to the number of bins N Bi for biomarker B i and B j is indexed by j from 1 to the number of bins N Bj for biomarker B j Equation 5 multiplies two vectors (one row vector and one column transposed) together element wise as a n outer product to form a 2 dimensional matrix for that biomarker combination of B i and B j The dimensions of M C joint one for e ach CB r is N Bi x N Bj Bins 1 through 4 in Figure 7 illustrate j oint probability values greater than zero. where Equation 5: Used to calculate each population joint probability matrix M C joint 2. Compute the population marginal distributions M i marg and M j marg We compute the row M i marg and column M j marg marginal probability values as:

PAGE 74

61 Equation 6: Used to calculate the M i marg row matrix. Equation 7: Used to calculate the M j marg column matrix. The interaction matrix M the matrix from Equation 1 with dimensions N Bi x N Bj is then composed as the transposition of M i marg repeated N Bj times. 3. Compute the population conditional probability matrix M C cond The conditional distribution of random variable X given random variable Y is computed as the joint probability values of X and Y divided by the marginal values of Y Given a pair of biomarkers { B i B j }, this element by element matrix division implies for our purposes that: Equation 8: Calculate each conditional probability matrix M C cond one per CB r 4. Co mpute the population discrete topological structure matrix M C The marginal M C marg joint M C joint and conditional M C cond probability matrices are computed for each CB r within each population. Define the corresponding interaction matrix M as P i divided element wise by P j from step 1 as Equation 9: Equation 9: Used to calculate the interaction matrix M We finally compute the DTS matrix M C for each CB r using Equation 10. Equation 10 : Used to calculate the DTS matrix M C one per CB r.

PAGE 75

62 We constructed a set of CB r matrices representing the conditional probability relationship between all pairs of biomarkers within each population. Each CB r combination has a characteristic vector of occupied bin states and empty bin states out of a possible number of bins N r as determined by the binning algorithm. Each CB r combination now composes an object with the following properties, which will be us ed to determine the set of distinguishing biomarkers per clinical type: clinical type population C t biomarker pair { B i B j }, bin size W r, number of bins N r, bin state vector [ P i ', G b ], set of observed concentration values D r matrices M C M C cond M C joint M C marg M M d. Determine the d istinguishing b iomarkers In the next step, we determine the distinguishing biomarkers per clinical type, allowing us to select the essential set of biomarkers for the sake of effective diagnosis. From the analysis above, each biomarker B i is a member of a number of CB r ferent bin sizes each of which corresponding to a D r where several bin states were assigned to the concentration values of this B i E ach bin state in is characterized by its bin size the binned concentration values and their DTS and probability values. Bin states are compared to each other because their pro perties are all constructed using the same algorithms. We can therefore form a coordinate system of the bin state probabilit y values and the DTS values per biomarker for each clinical type, instead of comparing concentration va lues, as plotted below in Figure 1 1 for the IFNG values of Never Smokers. The characteristic vertical bands of discrete bin state probability exist in every combination.

PAGE 76

63 Each point represents a distinct bin state. Bin states that lie on the ordinate with zero probability are non permissible bin states since there are no data values within that bin range. They are shown to indicate the ratio of permissible to non permissible bin states, in this case a ratio of 2 to 1. Figure 12 display a close up of the IFNG values for every clinical type tog ether. Individual bin states to the right have higher probabilities, but there are nearly always states with the same (usually low) probability value separated by their DTS values. The arrows in Figure 12 point to the first four Adenocarcinoma IFNG values plotted corresponding to Figure 7 Figure 13 (a l) plots all the 12 biomarkers in standard axes for every clinical type including the validation dataset Healthy Serum and TestSet. Figure 1 1 : Plot of IFNG only bin state Probability and DTS values for Never Smokers.

PAGE 77

64 Figure 12 : Plot of IFNG bin state Probability and DTS values for every clinical type

PAGE 78

65

PAGE 79

66

PAGE 80

67

PAGE 81

68

PAGE 82

69

PAGE 83

70 Figure 13 ( a l ) : B in state Probability and DTS values for every clinical type per biomarker These clinical type bin states are then represented in matrix form to determine their characteristic and distinguishing states. We formulate 7 x 12 individual integer matrices that represent all occupied ( non distinguishing) bin states one for each clinical type biomarker. These integer matrices are constructed by first standardizing the bin state probability and DTS values. The probability values are multiplied by 100 and rounded to integers as percent values along the x a xis to form a standard 100 cells. The corresponding DTS values are raised as exponents to the natural logarithm and rounded to integers, standardizing the y axis to 256 cells and starting from the upper left corner. This process excludes only 8 values (all of them Never Smokers IL 1A values) that are greater than e 5.5452 = 256, out of 3061 values. This form s a cellular structure where a whole integer in a cell indicates the presence of a probability DTS value and 0 otherwise. For example, the (partial) cellular structure in Figure 14 ( a ) plots the first 4 values of IFNG for only Adenocarcinoma (id = 2). An element by element XOR operation between the cellular struc-

PAGE 84

71 tures of two clinical types of the same biomarker reveals which clinical type probabili ty DTS bin values are uniq ue between those two clinical types. An elaboration of this logic is used to obtain the complete list of distinguishing bin states of the same biomarker among every clinical type The objective is the same to identify those matrix cells that are occupied b y one and only one clinical type for that biomarker. To begin with, e ach clinical type is represented by a unique 2 n identifier starting at n = 1 as tabulated in Table 5 Table 5 : C linical type identifiers used to distinguish bin states n : Clinical Type Unique 2 n Identifier 1 : Adenocarcinoma 2 2 : Squamous 4 3 : Controls Never Smokers 8 4 : Controls Smokers with COPD 16 5 : Controls Smokers without COPD 32 6 : Acute Lung Injury 64 7 : Cystic Fibrosis 128 We replace the occupied matrix integer values with their respective clinical type identifiers, and then add every matrix together per biomarker, so that each matrix cell contains zero, one, or more than one clinical type identi fier. An element by element log 2 operation that returns a whole integer identifies a single clinical type occupying that cell. C ell s that contains one of these unique values (2, 4, 8, etc) identifies the only clinical type for that distinguishing biomarker in that cell. This method depends upon the fact that a binomial coefficient (m choose n) (mod 2) is computable using the operation n XOR m Figure 1 4 ( b ) plots the integer matrix for biomarker IFNG for every clinical type where Adenocarcinoma is distinguished by 3 (red) circled cells. The (blue) circled value of 34 indicates that both Adenocarcinoma (id = 2) and Smokers without COPD ( id = 32) exist in the same cell. Figure 15 (a l) plots all the biomarkers for every clini cal type where each diagram displays the first 24 rows and columns out of 100.

PAGE 85

72 Figure 14 (a): Partial integer matrix for IFNG values of Adenocarcinoma. Figure 14 (b) : Partial integer matrix for biomarker IFNG for all 7 clinical types. Figure 15 (a): Partial integer matrix for biomarker EGF for all 7 clinical types

PAGE 86

73 Figure 15 (b) : Partial integer matrix for biomarker IFNG for all 7 clinical types. Figure 15 (c ) : Partial integer matrix for biomarker IL1A for all 7 clinical types.

PAGE 87

74 Figure 15 (d ) : Partial integer matrix for biomarker I L1B for all 7 clinical types. Figure 15 (e ) : Partial integer matrix for biomarker I L2 for all 7 clinical types.

PAGE 88

75 Figure 15 ( f ) : Partial integer matrix for biomarker IL4 for all 7 clinical types. Figure 15 ( g ) : Partial integer matrix for biomarker I L6 for all 7 clinical types.

PAGE 89

76 Figure 15 (h ) : Partial integer matrix for biomarker IL8 for all 7 clinical types. Figure 15 (i ) : Partial integer matrix for biomarker I L10 for all 7 clinical types.

PAGE 90

77 Figure 15 (j ) : Partial integer matrix for biomarker MCP1 for all 7 clinical types. Figure 15 (k ) : Partial integer matrix for biomarker TNFA for all 7 clinical types.

PAGE 91

78 Figure 15 (l ) : Partial integer matrix for biomarker VEGF for all 7 clinical types. e. Plot and t abulate the d istinguishing m atrices The 12 individual integer matrices produced for each clinical type can be consolidated into 3 dimensions to plot their distinguishing biomarkers with respect to the aforementioned probability cell and DTS cell states. F igure 16 (a h) plots the distinguishing p robability cell and DTS cell states for each clinical type separately, and Figure 17 plots the m all together E ach clinical type has several stretches of distinguishing bin states, and these bin states are spread ou t over the total state space for every clinical type. Interestingly, Never Smokers displays the most spread or variation among all the clinical types one is for a large number of states. We observe that the range of probability values is low in the Probability dimension: no single biomarker overwhelms the others in terms of frequency. It is also clear that the DTS coordinate effectively separates out the clinical types.

PAGE 92

79

PAGE 93

80

PAGE 94

81

PAGE 95

82 Figure 16 (a h ) : Every clinical type distinguished by their Probability and DTS states.

PAGE 96

83 Figure 17 : All b aseline c linical types distinguished by Probability and DTS states. Different biomarkers have varying ability to distinguish among every clinical type For example, Figure 18 (a l) plots the distinguishing bin states for each of the 12 biomarker s within every clinical type Each biomarker has a d istinguishing topology of bin states. Figure 1 9 plots the baseline biomarkers distinguish ed by their Probability and DTS states, again expressed as raised exponents to the natural logarithm and rounded to integers. The biomarkers are close together in state space, though IL1A (blue) seems to stand apart. Together, these plots reveal which CB r is the most effective for determining the clinical type of a patient.

PAGE 97

84

PAGE 98

85

PAGE 99

86

PAGE 100

87

PAGE 101

88

PAGE 102

89 Figure 18 (a l) : Each biomarker distinguished by Probability and DTS states within all baseline clinical types.

PAGE 103

90 Figure 1 9 : All b aseline biomarkers distinguished by Probability and DTS states within all baseline clinical types Table 6 summarizes the distinguishing biomarkers per clinical type, the number of patients, the total number of bins for these distinguishing biomarkers, and the number of inc orrectly assigned patients. All the patients were correctly assigned to their clinical type. The specific rules for making assignments are given in §F below Table 7 tabulates which clinical types and how many are distinguished by each biomarker. Acute L ung Injury has the most distinguishing biomarkers (all 12) due to its large variation in patient measurement values. The biomarker pairs that distinguish among all 7 clinical types are: {IL2, VEGF}, {IL6, VEGF}, {IL6, TNFA}, {IL2, TNFA}, {EGF, MCP1}, {IL2,

PAGE 104

91 MCP1}, {EGF, IL10}, {EGF, IL2}, and {EGF, IFNG} EGF and IL2 are the most frequent biomarkers in this list of pairs. However, since IL2 has very low concentration values making it more difficult to measure accurately it may not be a clinically effective biomarker choice. Table 6 : Distinguishing biomarkers per clinical type in Probability and DTS dimensions. Clinical Type Classification C t N Distinguishing Biomarkers (with at least 1 distinguishing bin state) Total Patients Total Bins Incorrect Assignments Adenocarcinoma 7: IFNG IL1B IL2 IL6 IL10 MCP1 VEGF 53 444 0 Squamous 7: IFNG IL2 IL6 IL8 IL10 MCP1 TNFA 44 1664 0 Never Smokers 5: EGF IFNG MCP1 TNFA VEGF 55 624 0 Smokers with COPD 5: EGF IL10 MCP1 TNFA VEGF 49 492 0 Smokers without COPD 7: EGF IFNG IL2 IL4 IL10 TNFA VEGF 53 386 0 Acute Lung Injury 12: EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF 62 572 0 Cystic Fibrosis 4: EGF IL1A IL2 IL6 27 360 0 Table 7: Number of distinguishing biomarker states n per clinical type C t C t EGF IFNG IL1A IL1B IL2 IL4 IL6 IL8 IL10 MCP1 TNFA VEGF Adenocarcinoma 1 1 2 3 1 7 4 Squamous 1 2 1 4 1 5 2 Never Smokers 6 1 6 2 5 Smokers with COPD 5 1 7 1 7 Smokers without COPD 5 2 2 1 1 1 3 Acute Lung Injury 7 12 4 7 9 12 8 14 10 12 11 9 Cystic Fibrosis 1 1 1 4 Number of Clinical Types Distinguished Per Biomarker 5 5 2 2 5 2 4 2 5 5 5 5 f. Assign a s c linical t ype W e can now assign an unknown patient sample to a clinical type. Assigning a unique bin number and bin probability for each sample biomarker value simply involves looking up the corresponding bin number in the known population probability list for that n value is the expected probability

PAGE 105

92 of its assigned bin. F or an individual patient, a sample represents a snapshot from within the complete set of population states. T he DTS computation must use the one available state biomarker concentration s to make probabilistic clinical type assignments Whereas the population M C matrices are computed for every biomarker combination, a single Q x Q matrix M z is computed per sample. But the matrix formulation for a sample z in Equation 1 1 is the same as the pop ulation version in Equation 10 Equation 11 : Used to c alculate sample DTS matrix M z g. Develop the f itness f unction Now that the probabilities and DTS values are known for every individual sample and clinical type population, and the distinguishing biomarkers are identified, we define a state similarity or fitness function between the biomarker states of each sample and the same biomarkers of each of the available clinical type population s. The purpose of the state. There are three rules. 1) bin number and probability by indexing its concentration value into the set of population b ins as done in step 2 2) Select the closest clinical type coordinate value(s) from the set of possible population values for each bin. This fitness function must use the same coordinate(s) from the previous step as either a single coordinate such as the D TS values or combined with another coordinate, such as probability. For example, the function below assigns a sample s z to clinical type C t as the minimum value of the B j (for only specified

PAGE 106

93 bin N x ) and all the analyzed population DTS values for biomarker B j This equation is repeated for all biomarkers Equation 1 2 : Used to c alculate the f itness function. 3) Since there are multiple bi omarkers, a third rule is necessary to correctly assign a sample to a population as a function of how many biomarkers were correct or not and or whether they were distinguishing or not. We chose as the third rule the correct assignment of all but one distinguishing biomarker and all but two non distinguishing biomarkers. Assigning a sample to a non permissible bin state is considered an incorrect assignmen t, such as state 5 in Figure 7 A summary of the permissible and non permissible bin states per biomarker is given in Table 8 Table 8 : Counts and percentages of permissible and non permissible bin states. Clinical Type Biom arker Bin count Non perm bin count Perm. bin count % non perm. bins % perm. bins Non perm. / perm. bins Perm. % > non perm. %? Adenocarcinoma EGF 13 1 12 7.7% 92.3% 0.1 false Adenocarcinoma IFNG 25 21 4 84.0% 16.0% 5.3 true Adenocarcinoma IL1A 21 17 4 81.0% 19.0% 4.3 true Adenocarcinoma IL1B 21 15 6 71.4% 28.6% 2.5 true Adenocarcinoma IL2 37 30 7 81.1% 18.9% 4.3 true Adenocarcinoma IL4 17 9 8 52.9% 47.1% 1.1 true Adenocarcinoma IL6 65 51 14 78.5% 21.5% 3.6 true Adenocarcinoma IL8 45 32 13 71.1% 28.9% 2.5 true Adenocarcinoma IL10 25 18 7 72.0% 28.0% 2.6 true Adenocarcinoma MCP1 121 91 30 75.2% 24.8% 3.0 true Adenocarcinoma TNFA 41 29 12 70.7% 29.3% 2.4 true Adenocarcinoma VEGF 13 0 13 0.0% 100.0% 0.0 false Squamous EGF 117 61 56 52.1% 47.9% 1.1 true Squamous IFNG 193 176 17 91.2% 8.8% 10.4 true Squamous IL1A 49 41 8 83.7% 16.3% 5.1 true Squamous IL1B 97 86 11 88.7% 11.3% 7.8 true Squamous IL2 117 104 13 88.9% 11.1% 8.0 true Squamous IL4 121 105 16 86.8% 13.2% 6.6 true Squamous IL6 73 51 22 69.9% 30.1% 2.3 true Squamous IL8 233 212 21 91.0% 9.0% 10.1 true Squamous IL10 117 101 16 86.3% 13.7% 6.3 true Squamous MCP1 225 180 45 80.0% 20.0% 4.0 true Squamous TNFA 157 136 21 86.6% 13.4% 6.5 true Squamous VEGF 165 137 28 83.0% 17.0% 4.9 true Never Smokers EGF 193 166 27 86.0% 14.0% 6.1 true

PAGE 107

94 Never Smokers IFNG 17 6 11 35.3% 64.7% 0.5 false Never Smokers IL1A 13 11 2 84.6% 15.4% 5.5 true Never Smokers IL1B 21 17 4 81.0% 19.0% 4.3 true Never Smokers IL2 17 8 9 47.1% 52.9% 0.9 false Never Smokers IL4 13 6 7 46.2% 53.8% 0.9 false Never Smokers IL6 21 11 10 52.4% 47.6% 1.1 true Never Smokers IL8 25 16 9 64.0% 36.0% 1.8 true Never Smokers IL10 13 5 8 38.5% 61.5% 0.6 false Never Smokers MCP1 133 95 38 71.4% 28.6% 2.5 true Never Smokers TNFA 25 10 15 40.0% 60.0% 0.7 false Never Smokers VEGF 133 107 26 80.5% 19.5% 4.1 true Smokers with COPD EGF 85 59 26 69.4% 30.6% 2.3 true Smokers with COPD IFNG 13 6 7 46.2% 53.8% 0.9 false Smokers with COPD IL1A 9 6 3 66.7% 33.3% 2.0 true Smokers with COPD IL1B 13 8 5 61.5% 38.5% 1.6 true Smokers with COPD IL2 21 11 10 52.4% 47.6% 1.1 true Smokers with COPD IL4 9 0 9 0.0% 100.0% 0.0 false Smokers with COPD IL6 17 4 13 23.5% 76.5% 0.3 false Smokers with COPD IL8 25 16 9 64.0% 36.0% 1.8 true Smokers with COPD IL10 17 11 6 64.7% 35.3% 1.8 true Smokers with COPD MCP1 133 98 35 73.7% 26.3% 2.8 true Smokers with COPD TNFA 21 6 15 28.6% 71.4% 0.4 false Smokers with COPD VEGF 129 101 28 78.3% 21.7% 3.6 true Smokers without COPD EGF 93 68 25 73.1% 26.9% 2.7 true Smokers without COPD IFNG 13 3 10 23.1% 76.9% 0.3 false Smokers without COPD IL1A 10 5 5 50.0% 50.0% 1.0 false Smokers without COPD IL1B 13 8 5 61.5% 38.5% 1.6 true Smokers without COPD IL2 17 4 13 23.5% 76.5% 0.3 false Smokers without COPD IL4 17 5 12 29.4% 70.6% 0.4 false Smokers without COPD IL6 17 3 14 17.6% 82.4% 0.2 false Smokers without COPD IL8 17 8 9 47.1% 52.9% 0.9 false Smokers without COPD IL10 14 3 11 21.4% 78.6% 0.3 false Smokers without COPD MCP1 13 0 13 0.0% 100.0% 0.0 false Smokers without COPD TNFA 21 8 13 38.1% 61.9% 0.6 false Smokers without COPD VEGF 141 120 21 85.1% 14.9% 5.7 true Acute Lung Injury EGF 93 72 21 77.4% 22.6% 3.4 true Acute Lung Injury IFNG 21 13 8 61.9% 38.1% 1.6 true Acute Lung Injury IL1A 17 11 6 64.7% 35.3% 1.8 true Acute Lung Injury IL1B 33 28 5 84.8% 15.2% 5.6 true Acute Lung Injury IL2 29 22 7 75.9% 24.1% 3.1 true Acute Lung Injury IL4 21 14 7 66.7% 33.3% 2.0 true Acute Lung Injury IL6 53 42 11 79.2% 20.8% 3.8 true Acute Lung Injury IL8 25 12 13 48.0% 52.0% 0.9 false Acute Lung Injury IL10 17 8 9 47.1% 52.9% 0.9 false Acute Lung Injury MCP1 177 145 32 81.9% 18.1% 4.5 true Acute Lung Injury TNFA 25 15 10 60.0% 40.0% 1.5 true

PAGE 108

95 Acute Lung Injury VEGF 61 33 28 54.1% 45.9% 1.2 true Cystic Fibrosis EGF 37 26 11 70.3% 29.7% 2.4 true Cystic Fibrosis IFNG 17 12 5 70.6% 29.4% 2.4 true Cystic Fibrosis IL1A 17 9 8 52.9% 47.1% 1.1 true Cystic Fibrosis IL1B 49 42 7 85.7% 14.3% 6.0 true Cystic Fibrosis IL2 81 72 9 88.9% 11.1% 8.0 true Cystic Fibrosis IL4 17 11 6 64.7% 35.3% 1.8 true Cystic Fibrosis IL6 33 23 10 69.7% 30.3% 2.3 true Cystic Fibrosis IL8 25 17 8 68.0% 32.0% 2.1 true Cystic Fibrosis IL10 13 5 8 38.5% 61.5% 0.6 false Cystic Fibrosis MCP1 13 2 11 15.4% 84.6% 0.2 false Cystic Fibrosis TNFA 17 7 10 41.2% 58.8% 0.7 false Cystic Fibrosis VEGF 41 26 15 63.4% 36.6% 1.7 true TestSet EGF 17 13 4 76.5% 23.5% 3.3 true TestSet IFNG 13 9 4 69.2% 30.8% 2.3 true TestSet IL1A 17 13 4 76.5% 23.5% 3.3 true TestSet IL1B 21 17 4 81.0% 19.0% 4.3 true TestSet IL2 21 17 4 81.0% 19.0% 4.3 true TestSet IL4 25 21 4 84.0% 16.0% 5.3 true TestSet IL6 25 21 4 84.0% 16.0% 5.3 true TestSet IL8 29 25 4 86.2% 13.8% 6.3 true TestSet IL10 33 29 4 87.9% 12.1% 7.3 true TestSet MCP1 33 29 4 87.9% 12.1% 7.3 true TestSet TNFA 37 33 4 89.2% 10.8% 8.3 true TestSet VEGF 37 33 4 89.2% 10.8% 8.3 true Healthy Serum EGF 0 0 0 N/A N/A N/A false Healthy Serum IFNG 141 134 7 95.0% 5.0% 19.1 true Healthy Serum IL1A 57 50 7 87.7% 12.3% 7.1 true Healthy Serum IL1B 17 13 4 76.5% 23.5% 3.3 true Healthy Serum IL2 0 0 0 N/A N/A N/A false Healthy Serum IL4 21 17 4 81.0% 19.0% 4.3 true Healthy Serum IL6 21 14 7 66.7% 33.3% 2.0 true Healthy Serum IL8 21 16 5 76.2% 23.8% 3.2 true Healthy Serum IL10 17 13 4 76.5% 23.5% 3.3 true Healthy Serum MCP1 41 35 6 85.4% 14.6% 5.8 true Healthy Serum TNFA 45 39 6 86.7% 13.3% 6.5 true Healthy Serum VEGF 69 63 6 91.3% 8.7% 10.5 true h. D ifferences b etween the b inning a lgorithm and v ari ous c lustering methods T he b inning a lgorithm maximizes the total number of bins for the sake of higher data resolution, while minimizing the number of empty bins for the sake of reducing the number of bin states. This approach is distinct from most standard clustering method s such as connectivity based clustering (e.g., hierarchical clustering ) or centroid based clustering (e.g., k means clustering ), which seek to minimize the number of partitions of data set D into disjoint subset s [ 5 6 ] The chosen clustering algorithm must satisfy two

PAGE 109

96 necessary constraints: that we know the number of clusters k beforehand as simply the number of clinical types, and that each n dimensional point belongs to exactly one and only one clinical type cluster (i.e., strict partitioning clustering ) I n the Euclidean case, each cluster has a well defined centroid (i.e., an average across all points in the cluster), but a s noted previously the scalability of measures in Euclidean spaces is generally poor as dimensionality increases. Following Skillicorn [ 68 ], g iven a dataset with Q =12 biomarker attributes in a 12 space and suppose these attributes are divided into C =7 clinical type types Considering those records whose attribute values are equally likely to b e one of these C types what is t he likelihood that such a record will be close to the origin in the Q dimensional space? This h appen s only if all its attribute values are in the close to zero category, and the chance of this is which is or 7.2 x 10 11 The same a nalysis applies for t he likelihood that two records are close to each other. i. Results from o ther a pproaches During these initial experiments, we found that method s based upon averaging did not classify the baseline clinical types very well. We first developed a nave Bayes classifier program trained on the bin number, probability, and DTS values of the baseline (non excluded) data set. The 33 excluded (non baseline) patient samples were then processed, but only 17.68% of these samples were correctly assigned to their clinical type. Following Blum, Hopcroft, and Kannan [71, ch. 5] w e also investigated whether a Markov chain analysis would expose biomarker patterns or structure. The analysis was initially appealing because a Markov chain represents a stochastic process with a finite set of states where, f or each pair of states x and y there is a transition probability p xy of going from state x to state y such that for each x The fundamental theorem of Markov chains asserts [79] that for a connected Markov chain, the long term average

PAGE 110

97 probability distribution converges to a limit probability vector x which satis fie s the equations x P = x and Furthermore, i f the corresponding Markov graph is strongly connected, then the fraction of time the random Markov walk spends at the vertices of the graph converges to a stationary probability distribution [71, p. 160] which may correspond to a clinical type. However, Markov chains are used to model situations where all the information of the system necessary to predict the future can be encoded in the current state [71, p. 141] and without longitudinal patient data, this would be difficult to justify. To draw a uniform random sam ple from a d dimensional set, the set is required to be convex so that the Markov chain technique will converge quickly to its stationary proba bility [71, p. 150]. Such a convex set is defined as a region where for every pair of points within the region, every poi nt on the straight line segment that joins the pair of points is also within the region. However, the topological analysis showed that various gaps or holes were present in each of the data sets except TestSet thereby violating convexity. For these reason s the Markov chain approach was abandoned. Cosine similarity is sometimes c onsidered more appropriate for high dimensio nal spaces, where the space spanned by the variable s is considered a vector space instead of a Cartesian space. The high dimensional vector space represents a state space for syst em s with many degrees of freedom or for many trials of a random variable. Then the cosine of the angle between the vectors f o r 2 points a and b is given by the dot product of a and b divided by their vector norms [74 p. 17 ] : T he best fit vector can be then used as a n alternate ranking method ; t hat is, the unit vector u for which the sum of squared perpendicular distances of all the vectors to u is minimized. In these experiments, every biomarker value for every clinical type population is plotted as a set of vectors emanating from the origin where for bin probabilities greater than 0 the bin probability maps to (in radians) and the DTS val ue maps to the

PAGE 111

98 vector magn itude, as shown below in Figure 20 (a b) for two Adenocarcinoma biomarkers. Figure 2 0 (a b) : B est fit vector s for Adenocarcinoma [ EGF, IFNG].

PAGE 112

99 Each black vector represents an individual patient. Each clinical type biomarker is then represented by the (red) averaged best fit vector called Ideally, the biomarker vector for each patient would be closest to the best fit vector of his known clinical type. T he bin probability DTS val ue vector is calculate set of biomarkers The c osine similarity is calculated between each patient and each clinical type per biomarker and then ranked, when the patient had data for that biomarker. Table 9 summarizes the results Only AcuteLungInjury patients had consistently high clinical type assignments as well as TestSet patients because of their fabricated nature Percentages greater than 50% are highlighted. We conclude that c osine similarity is not an effective way of assigning patients to thei r clinical types. Table 9: Using cosine similarity to assign baseline patients to a clinical type. Clinical Type Biomarker Correct Total Percent Adenocarcinoma EGF 1 21 4.76 % Adenocarcinoma IFNG 2 8 25.00 % Adenocarcinoma IL1A 3 6 50.00 % Adenocarcinoma IL1B 0 5 0.00 % Adenocarcinoma IL2 1 7 14.29 % Adenocarcinoma IL4 1 7 14.29 % Adenocarcinoma IL6 0 11 0.00 % Adenocarcinoma IL8 4 13 30.77 % Adenocarcinoma IL10 0 9 0.00 % Adenocarcinoma MCP1 1 32 3.13 % Adenocarcinoma TNFA 4 10 40.00 % Adenocarcinoma VEGF 0 28 0.00 % Squamous EGF 1 12 8.33 % Squamous IFNG 0 4 0.00 % Squamous IL1A 1 4 25.00 % Squamous IL1B 1 6 16.67 % Squamous IL2 1 7 14.29 % Squamous IL4 1 8 12.50 % Squamous IL6 3 14 21.43 % Squamous IL8 4 13 30.77 % Squamous IL10 1 7 14.29 % Squamous MCP1 1 30 3.33 % Squamous TNFA 1 12 8.33 % Squamous VEGF 6 13 46.15 %

PAGE 113

100 NeverSmokers EGF 2 27 7.41 % NeverSmokers IFNG 7 11 63.64 % NeverSmokers IL1A 2 2 100.00 % NeverSmokers IL1B 1 4 25.00 % NeverSmokers IL2 2 9 22.22 % NeverSmokers IL4 0 7 0.00 % NeverSmokers IL6 1 10 10.00 % NeverSmokers IL8 2 9 22.22 % NeverSmokers IL10 1 8 12.50 % NeverSmokers MCP1 2 38 5.26 % NeverSmokers TNFA 0 15 0.00 % NeverSmokers VEGF 5 26 19.23 % SmokersWithCOPD EGF 13 26 50.00 % SmokersWithCOPD IFNG 1 7 14.29 % SmokersWithCOPD IL1A 1 3 33.33 % SmokersWithCOPD IL1B 1 5 20.00 % SmokersWithCOPD IL2 2 10 20.00 % SmokersWithCOPD IL4 6 9 66.67 % SmokersWithCOPD IL6 1 13 7.69 % SmokersWithCOPD IL8 1 9 11.11 % SmokersWithCOPD IL10 1 6 16.67 % SmokersWithCOPD MCP1 2 35 5.71 % SmokersWithCOPD TNFA 3 15 20.00 % SmokersWithCOPD VEGF 16 28 57.14 % SmokersWithoutCOPD EGF 5 25 20.00 % SmokersWithoutCOPD IFNG 4 10 40.00 % SmokersWithoutCOPD IL1A 1 5 20.00 % SmokersWithoutCOPD IL1B 1 5 20.00 % SmokersWithoutCOPD IL2 2 13 15.38 % SmokersWithoutCOPD IL4 6 12 50.00 % SmokersWithoutCOPD IL6 0 14 0.00 % SmokersWithoutCOPD IL8 0 9 0.00 % SmokersWithoutCOPD IL10 7 11 63.64 % SmokersWithoutCOPD MCP1 6 13 46.15 % SmokersWithoutCOPD TNFA 4 13 30.77 % SmokersWithoutCOPD VEGF 0 21 0.00 % AcuteLungInjury EGF 3 23 13.04 % AcuteLungInjury IFNG 14 17 82.35 % AcuteLungInjury IL1A 4 8 50.00 % AcuteLungInjury IL1B 9 11 81.82 % AcuteLungInjury IL2 11 13 84.62 % AcuteLungInjury IL4 13 16 81.25 % AcuteLungInjury IL6 17 22 77.27 % AcuteLungInjury IL8 16 21 76.19 % AcuteLungInjury IL10 13 16 81.25 % AcuteLungInjury MCP1 30 45 66.67 %

PAGE 114

101 AcuteLungInjury TNFA 17 21 80.95 % AcuteLungInjury VEGF 0 28 0.00 % CysticFibrosis EGF 2 11 18.18 % CysticFibrosis IFNG 0 5 0.00 % CysticFibrosis IL1A 6 8 75.00 % CysticFibrosis IL1B 1 7 14.29 % CysticFibrosis IL2 1 9 11.11 % CysticFibrosis IL4 2 6 33.33 % CysticFibrosis IL6 0 10 0.00 % CysticFibrosis IL8 0 8 0.00 % CysticFibrosis IL10 0 8 0.00 % CysticFibrosis MCP1 7 11 63.64 % CysticFibrosis TNFA 0 10 0.00 % CysticFibrosis VEGF 0 15 0.00 % TestSet EGF 4 4 100.00 % TestSet IFNG 3 4 75.00 % TestSet IL1A 2 4 50.00 % TestSet IL1B 0 4 0.00 % TestSet IL2 4 4 100.00 % TestSet IL4 3 4 75.00 % TestSet IL6 4 4 100.00 % TestSet IL8 4 4 100.00 % TestSet IL10 4 4 100.00 % TestSet MCP1 4 4 100.00 % TestSet TNFA 4 4 100.00 % TestSet VEGF 4 4 100.00 %

PAGE 115

102 CHAPTER V : RESULTS and VALIDATION STUDIES 5 .1 D istinguishing Biomarkers and the M easure of Similarity D ifferent combinations of the various coordinate dimensions produce different distinguishing bio marker sets both in the number and type of individual biomarkers and in the number of unique states per bio marker. Using single biomarkers, single coordinate s whether log e [ concentration ] probability, or DTS provided little separation value. As given in Table 6 t he Probability and DTS dimensions distinguish among all 7 baseline clinical types when using all 12 biomarkers T hese are t h e minimum coordinates needed to separate every clinical type and to measure their similarity or closeness These conditionally dependent biomarkers and their bin states distinguish the baseline types in the aggregate that is, given a certain set of biomarkers and clinical types processed together. For example, i f the clinical goal is only to distinguish among patients with either Adenocarcinoma or Squamous cell carcinoma, then the essential set of distinguishing bio marker bins is different between these two clinical types compared to the baseline clinical types. The investigator still has the choice, given biomarker availability and cost, to decide which set of biomarkers to use in subsequent investigations, confiden t that unknown clinical type patients are correctly assigned with a high probability. 5 .2 10% Study Validation 10 % of the original patients from each clinical type a total of 33 were excluded as a separate data set prior to processing. W e ignore the known clinical types of these samples during processing but use the sample 's observed concentration value as a lookup into the permissible bin states and probability values of every baseline clinical type as calculated in §4.3 :C Every excluded patient except one Cystic Fibrosis patient assigned to Acute Lung Injury was correctly assigned to their known clinical type All excluded patients, except one Cys tic Fibrosis patient assigned to Acute Lung Injury were correctly

PAGE 116

103 assigned to their respective clinical type as listed in Table 10 We account for this incorrect assignment by the tiny sample size of the excluded Cystic Fibrosis patients, which was the smallest size to begin with. Although there is some overlap, processing the excluded data set did not produce the same set of distinguishing biomarkers as the baseline biomarkers. The number of c ommon biomarkers between the baseline and excluded data sets ( a ), the number of distinguishing biomarkers in the baseline data set ( b ), and the number of dist inguishing biomarkers in the excluded data set ( c ) are given in the Common Biomarkers column. Table 10 : D istinguishing biomarkers of excluded patients in Probability DTS dimensions. Clinical Type N Distinguishing Biomarkers Patients Incorrect Assignment Common Biomarkers a/b/c Adenocarcinoma 9 : EGF IFNG IL10 IL2 IL6 IL8 MCP1 TNFA VEGF 5 0 6/7/9 Squamous 9 : IFNG IL10 IL1B IL2 IL4 IL6 IL8 MCP1 VEGF 4 0 6/7/9 Controls Never Smokers 6 : EGF IL1B IL2 IL6 MCP1 VEGF 5 0 3/5/6 Controls Smokers with COPD 6 : EGF IFNG IL4 IL6 MCP1 VEGF 5 0 3/5/6 Controls Smokers without COPD 6 : IL2 IL4 IL6 MCP1 TNFA VEGF 5 0 4/7/6 Acute Lung Injury 9 : EGF IFNG IL10 IL2 IL6 IL8 MCP1 TNFA VEGF 6 0 9/12/9 Cystic Fibrosis 2 : MCP1 VEGF 3 1 0/4/2 5 .3 Healthy Serum Validation Whereas the baseline clinical types were collected by standard 2 D PAGE gel electrophoresis protocols, measurements from 7 from a completely different sampling protocol and experimental design (Luminex fluorescent bead based immunoassay [20]). However, these measurements are given as means where the number of observations is unknown for various patient classes: all male patients, females, Caucasians, non Caucasians, day 0 median values, day 7 media n val-

PAGE 117

104 ues, and the internal control bridge sample means. Data was not collected for the EGF or IL2 biomarkers, but included the other 10 biomarkers. When processed along with the baseline data sets, all 7 samples are correctly assigned to their Healthy Seru m clinical type. A similar Luminex experiment was performed by Biancotto et al [40] from 144 healthy donors of paired samples taken 7 days apart. Only 5 out of 27 cytokines showed significant differences in concentrations between the paired samples two of which were included in these exp eriments: MCP1 and VEGF. From their averaged data, they found higher variations in cytokine levels between individuals than were observed for samples obtained one week apart from identical patients Even when ignoring the differences in experimental protocol, several validation issues arise for this data set. How does processing averaged data, typical of nearly all published data, affect the outcome? If individual patient measurements are not available, the averaged dat a naturally regresses to the population mean values, losing much varia, say Never Smokers. W e simply treated each sample as a separate patient because of their differenti We also expected Healthy Serum (HS) to somehow resemble the Never Smokers (NS) control group in terms of distinguishing biomarkers. T he new data can be processed in several di fferent ways. Is it better to replace the Never Smokers data with the Healthy Serum data in a different processing run and then compare the differences, or is it better to simply add the Healthy Serum data to the other baseline data sets? This decision inv olves making 3 different comparisons where each distinguishing biomarker set is given

PAGE 118

105 in (a), the clinical type classification accuracy in (b), and the common biomarker set in (c). 1. Process only the Healthy Serum and Never Smokers data sets in a pair wise c omparison. a. HS={IL1A IL4 IL6 IL10 TNFA}, NS={EGF IFNG IL2 IL4 IL6 IL8 IL10 MCP1 TNFA} b. 6 of the 7 HS patients are correctly assigned as HS. c. 4 common biomarkers: { IL4 IL6 IL10 TNFA} 2. Process a new baseline data set where Healthy Serum data replaces Never Smokers dat a. a. HS={IL1A IL1B}, original NS={EGF IFNG MCP1 TNFA VEGF} b. All 7 HS patients are correctly assigned as HS. c. No common biomarkers. 3. Process a new baseline data set where Healthy Serum data is added to the original baseline data set. a. HS={IL1A}, NS={EG F IFNG IL1B IL2 IL6 IL8 IL10 MCP1 TNFA} b. All 7 HS patients are correctly assigned as HS. c. No common biomarkers. In comparison 1, Healthy Serum and Never Smokers share the common bio marker set {IL4 IL6 IL10 TNFA}. In comparison 2, when Healthy Serum replaces the Never ly by the IL1A bio marker. Since every Healthy Serum patient i s correctly a ssigned the classification process benefits by add ing Healthy Serum to the baseline clinical types even

PAGE 119

106 though Healthy Serum ends up with just one distinguishing biomarker. This illustrates the difficulty of determining a n essential identifying set of biomarkers to distinguish among various clinical types. Multiple identifying sets of biomarkers can exist, so long as a set of biomarkers accurately assigns all or nearly all patients. 5.4 Population Variability and Conditional Structure There is often a large variation in concentration values within clinical type population s, which produces a large degree of overlap in the concentration space. There are fewer distinguishing biomarkers overall with large overlap, and these distinguishing biomarkers differ when comparing among clinical types. That is, comparing clinical type A with clinical type B produces a different set of distinguishing biomarkers when comparing clinical type A with clinical type C What distinguishes one clinical type from another are their conditional probabilities materialized in the complete set of occupied bin states, distinguishing bin states than distinguishing bin states in every population. I n §4.3 :G the fitness function compares states in terms of the complete set of occupied bin states per biomarker per clinical type and not just the set of distinguishing bin states. The model is necessarily coarse all the relevant state transitions, model parameters, or regulation mechanisms. The coarse grained conditional structure of the biomarkers corresponds to the distinguishing set of biomarkers in each computed combination, but the fine grained conditional structure is only r evealed by the complete set of occupied bin states. 5.5 Strategies for Modeling Measurement Variation Four modeling decisions impact the control of m easurement v ariation First, the model processes all individual patient concentration values and not t heir population averages Second it can still be computationally thrifty to remove extreme outliers past a declared range of standard deviation. Third, the choice of baseline clinical type s and the ir

PAGE 120

107 measured biomarkers determine relative concentration ranges Different combinations of clinical type s and biomarkers produce different assignments Finally, the biomarker variables are analyzed in terms of relative protein abundances, in ratios that are less variable than th eir individual biomarker measurement s. 5 .6 NCI Study Validation The second validation study processed the larger data set generated by the N CI Maryland Cancer / Control [80 ] and NCI PLCOC Cancer / Control studies through the same set of programs and alg orithms that produced the baseline results. The NIH Clinical Center in Bethesda, Maryland is the research hospital for the NIH, the federal government's principal agency for biomedical research. These studies had more patients than the baseline data set with a slightly different mix of bi omarkers, and used electro chemiluminescence immunoassay (ECLIA) plates instead of gel electrophoresis to measure the cytokine levels in blood serum. The Maryland study did not collect EGF, IL1A, IL2, MCP1, or VEGF measur ements but added IL5, IL12, and GMCSF biomarker concentration values. T he set of NCI PLCOC biomarkers are a proper subset of the NCI Maryland biomarker set. Table 11 summarizes the NCI data sources. The cytokine biomarker granulocyte macrophage colony stimulating factor (GMCSF), is a monomeric glycoprotein secreted by macrophages, T cells, mast cells, NK cells, endothelial cells and fibroblasts We used t he following tests to determine how well the model distinguished among these clinical types 1. Test 1 : between the NCI Maryland Cancer and NCI Maryland Control patients 2. Test 2 : between the NCI PLCOC Cancer and NCI PLCOC Control 3. Test 3 : aggregate all 4 types together.

PAGE 121

108 Table 11 : Data sources for the NIH Maryland and PLCOC Studies. Clinical Type Patients Outliers > 6 S.D. Missing Data Incorrect Assignment N : Distinguishing Biom arkers NCI Maryland Cancer 355 20 5 0 10: IFNG, IL1B, IL4, IL5, IL6, IL8, IL10, IL12, TNFA, GMCSF NCI Maryland Control 466 30 5 0 10: IFNG, IL1B, IL4, IL5, IL6, IL8, IL10, IL12, TNFA, GMCSF NCI PLCOC Cancer 591 9 8 0 IL1B, IL6, IL8, TNFA NCI PLCOC Control 650 7 8 0 IL1B, IL6, IL8, TNFA Every NCI Maryland Control patient was correctly assigned to their clinical type when processed through the model 3 out of 355 NCI Maryland Cancer patients had 1 IL8 biomarker assigned to the c ontrol group and 6 had 1 IL10 biomarker improperly assigned. Since the third rule allow s one incorrect biomarker assignment per patient, we consider all the c ancer patients to be correctly a ssigned to their group. The set of d istinguishing b iomarkers for the Maryland Cancer and Control groups are the same. Every patient in the NCI PLCOC Cancer and NCI PLCOC Control groups was correctly assigned, and the set of d istinguishing b iomarkers for these groups are the same.

PAGE 122

109 CHAPTER V I: TOPOLOGICAL ANALYSIS 6 .1 Motivation for a Topological Analysis The binning algorithm exposes the conditional relationships and dependencies among the concentration data by discretizing the concentration values of paired biomarkers into static DTS bin states. W e ask whether i t is possible to structure the interactions of protein biomarker concentration values by formalizing the connectedness and disconnectedness of these conditional values in concentration topological space a set of points and their neighborhoods that provides a general framework for the study of cont inuity, connectedness, and convergence where a topological space is said to be connected if it is not the union of two disjoint nonempty open sets [ 54 ]. This approach is motivated by the fact that m any pathological conditions are characterized by significant and sudden changes in a common set of biochemical variables [ 48 ], at which point a catastrophic response affects how the biochemical concentration s functionally depend upon the variables. W e d e fine a topology as a dynamic system of sets that describes the connectivity of each set, where a topology of a point cloud ( defined as a finite metric space a finite set of points equ ipped with a notion of distance ) is a collection of subsets that implicitly defines which points are near each other without necessarily specifying a numeric distance between them [ 20 ] T opological objects can then be grouped into classes that have the same standard measure or invariant describing their connectivity and continuity. These t opological summaries are robust to stochastic perturbations and can even be used as random variables in statistical inference [ 49 ]. 6.2 Topological Homology and Betti Numbers The most useful topological invariants involve homology which defines a sequence of groups describing the connectedness of a topological space h omology measures connectivity. For example, the n dimensional holes in a space are represented by

PAGE 123

110 the homology count for each dimension n That is, homology measures holes and holes are things you can run a loop around. Topologically characterizing the difference between functional and diseased concentration spaces implies that clinical type samples reveal topologically non equivalent conc entration surfaces. This is predicated upon the mathematical definition of configuration space in C n X as the set of n element subsets of a topological space X The advantage of this type of space is that if two systems have the same topological structures (i.e., homeomorphisms) in their configuration spaces then the two systems can be qualitatively considered the same [20]. The simplest topological measure of connectedness is called a Betti number Betti numbers are used to distinguish topological spaces based on the connectivity of n dimensional simplicial complexes a topological space constructed by the union of points, line segments, triangles, and th eir n dimensional counterparts [55] Informally, th e k th Betti number refers to the number of k dimensional holes on a topological surface, where 0 symbolizes the number of connected components. For example, t he cyclomatic number of a graph the first Betti number is defined as 0 The k th Be tti number tells us the maximum number of cuts that must be made before separating a surface into two pieces or 0 cycles, 1 cycles, etc. Informally, the k th Betti number refers to the number of unconnected k dimensional surfaces or the number of k dimensional holes [19]. The Betti numbers of an object embedded in 3 are respectively: 0 the number of connected parts separated by gaps or holes 1 the number of circles surrounding tunnels, 2 the number of shells surrounding holes The set of Betti number s that are iteratively calculated for each clinical type biomarker set combination CB r provides a n alternate way of comparing data combinations so long as we uniformly determine when to stop computing the topological filtration val-

PAGE 124

111 ue of the data under co mparison. A filtration on a complex X is a collection of sub complexes of X such that whenever This filtration value is functiona lly equivalent to the bin size W r in two dimensions. The filtration value basically defines the maximum resolution of the components of the complex. Complex construction is very sensitive to the maximum filtration value, so it is important to establish an algorithm that assigns a consistent filtration value. 6.3 Computing Topological Connectedness Betti intervals describe how the homology of a simplicial complex X(t) changes with time t We want to find Betti intervals that persist for a relatively long time as those intervals might characterize the dataset under consideration We use the open source Javaplex library [20] within MATLAB R2013a and 64 bit Java JRE v1.6.32 to compute the Betti numbers. Computati ons are performed on a 3. 2 GHz dual core 64 bit Windows 7 computer with 8 Gb of RAM and 2 Gb of java heap space. The JavaPlex library for computing 0 1 2 proceeds as follows. a. A point cloud is assigned as a set of n points for each biomarker per clinical type. b. A Euclidean metric space is calculated from the point cloud. c. A Vietoris Rips stream is created using inputs of the maximum dimension (1, 2, or 3), the maximum filtration value, and the number of divisions, whi ch is set to the number of points in the cloud. d. The set of simplices is calculated from the stream. e. The persistent intervals are computed using the default simplicial algorithm. f. The 0 1 2 Betti numbers are computed. A Vietoris Rips complex is an abstract simplicial complex that can be defined from metric space M and distance by forming a simplex for every finite set of points that has

PAGE 125

112 diameter at most A fixed set of cloud points can be completed to a Vietoris Rips complex based on a pr oximity parameter [18] In a Vietoris Rips stream, once the filtration value t is greater than the diameter of the point cloud, the Betti numbers will become 0 = 1, 1 = 2 = = 0. The persistent intervals are simply those that persist until t = t max A significant value is the minimum proximity parameter that yields 0 = 1 = 1 Calculating the 0 values requires choosing a consistent filt ration value, the maximum resolution of the components of the Rips complex which compels computing another topological invariant the Euler characteristic of a point set. Intuitively, is the number of disjoint closed intervals that comprise the set s [21]. The filtration value for each clinical type bio marker combination is dete rmined as the value that produces a value of zero (computed as the alternating sum of the Betti numbers ( 1 2 ). Monotonically increasing the filtration value decreases the value from positive values through zero to negative values. Choosing this filtration value allows for a consistent comparison among the Betti numbers of the various clinical types and by inference, which clinical types are topologically equivalent (homol ogous). 6.4 Topological Results The clinical types are partially distinguished by their 0 values per biomarker, as plotted in Figure 2 1 Only Acute Lung Injury appears to be completely distinguishable Table 1 2 summarizes the t opological properties for each clinical type O nly the 0 values are non zero all the 1 and 2 values are zero so no further structure is revealed. T h e Betti numbers and Euler characteristic values are no t sufficient to distinguish a mong the various clinical types.

PAGE 126

113 Figure 21 : 0 values for each clinical type. Table 1 2 : Topological properties for each clinical type. Clinical Type Markers Divisions ( Points ) Filter Simplices Euler 0, 1, 2 Cystic Fibrosis EGF 11 15.5650 65 1 3,0,0 Cystic Fibrosis IFNG 5 26.1000 30 0 1,0,0 Cystic Fibrosis IL1A 10 3.4440 53 1 3,0,0 Cystic Fibrosis IL1B 7 24.3000 46 0 2,0,0 Cystic Fibrosis IL2 9 18.2760 49 1 3,0,0 Cystic Fibrosis IL4 7 5.7000 46 0 2,0,0 Cystic Fibrosis IL6 10 18.7000 54 0 2,0,0 Cystic Fibrosis IL8 8 16.0000 48 0 2,0,0 Cystic Fibrosis IL10 9 0.8700 46 0 1,0,0 Cystic Fibrosis MCP1 11 198.2000 66 0 1,0,0 Cystic Fibrosis TNFA 10 3.3967 63 1 3,0,0 Cystic Fibrosis VEGF 15 10.7180 97 1 4,0,0 Acute Lung Injury EGF 23 17.3000 136 0 5,0,0 Acute Lung Injury IFNG 17 119.0000 82 2 4,0,0 Acute Lung Injury IL1A 8 55.5000 48 0 2,0,0 Acute Lung Injury IL1B 11 92.5000 72 0 2,0,0 Acute Lung Injury IL2 14 62.8000 86 0 3,0,0 Acute Lung Injury IL4 17 70.9000 80 0 2,0,0 Acute Lung Injury IL6 24 18.5000 134 0 3,0,0 Acute Lung Injury IL8 22 26.4520 127 1 12,0,0 Acute Lung Injury IL10 17 62.4000 102 0 3,0,0 Acute Lung Injury MCP1 49 25.3310 240 2 9,0,0 Acute Lung Injury TNFA 21 23.6440 111 1 8,0,0

PAGE 127

114 Acute Lung Injury VEGF 28 23.6150 128 2 9,0,0 Adenocarcinoma EGF 23 18.0360 124 2 5,0,0 Adenocarcinoma IFNG 8 9.8000 54 0 2,0,0 Adenocarcinoma IL1A 6 15.8000 32 0 1,0,0 Adenocarcinoma IL1B 5 90.7770 23 1 1,0,0 Adenocarcinoma IL2 6 359.3500 31 1 2,0,0 Adenocarcinoma IL4 8 13.8000 50 0 2,0,0 Adenocarcinoma IL6 12 20.6360 53 3 5,0,0 Adenocarcinoma IL8 12 10.6310 57 1 2,0,0 Adenocarcinoma IL10 9 7.5161 49 1 2,0,0 Adenocarcinoma MCP1 35 29.1490 201 3 7,0,0 Adenocarcinoma TNFA 7 44.9870 39 1 2,0,0 Adenocarcinoma VEGF 28 13.1280 159 1 3,0,0 Never Smokers EGF 28 19.8670 141 1 4,0,0 Never Smokers IFNG 11 2.6190 51 1 1,0,0 Never Smokers IL1A 3 4.9528 3 0 1,0,0 Never Smokers IL1B 4 27.0000 8 2 2,0,0 Never Smokers IL2 7 13.7000 39 1 2,0,0 Never Smokers IL4 9 3.8139 49 1 2,0,0 Never Smokers IL6 10 6.5516 47 1 2,0,0 Never Smokers IL8 9 10.7420 49 1 3,0,0 Never Smokers IL10 8 1.6462 31 1 1,0,0 Never Smokers MCP1 39 17.5010 239 1 7,0,0 Never Smokers TNFA 15 7.5636 87 1 2,0,0 Never Smokers VEGF 26 22.5140 152 2 6,0,0 Healthy Serum EGF N/A N/A N/A N/A 0,0,0 Healthy Serum IFNG 7 262.0000 20 2 2,0,0 Healthy Serum IL1A 7 79.4360 39 1 2,0,0 Healthy Serum IL1B 4 49.0000 15 1 1,0,0 Healthy Serum IL2 N/A N/A N/A N/A 0,0,0 Healthy Serum IL4 4 49.0000 15 1 1,0,0 Healthy Serum IL6 7 12.8000 46 0 2,0,0 Healthy Serum IL8 5 48.8000 30 0 1,0,0 Healthy Serum IL10 5 48.8000 30 0 1,0,0 Healthy Serum MCP1 6 49.0000 31 1 2,0,0 Healthy Serum TNFA 6 49.0000 15 1 1,0,0 Healthy Serum VEGF 6 118.0000 24 2 2,0,0 Smokers Without COPD EGF 25 18.3680 139 3 6,0,0 Smokers Without COPD IFNG 10 1.3776 49 1 2,0,0 Smokers Without COPD IL1A 5 3.9910 23 1 1,0,0 Smokers Without COPD IL1B 5 9.0132 23 1 1,0,0 Smokers Without COPD IL2 13 2.1375 59 1 1,0,0 Smokers Without COPD IL4 12 1.8472 57 1 1,0,0 Smokers Without COPD IL6 4 82.0000 8 2 2,0,0 Smokers Without COPD IL8 9 4.8400 55 1 2,0,0

PAGE 128

115 Smokers Without COPD IL10 15 0.6548 87 1 2,0,0 Smokers Without COPD MCP1 13 26.0000 13 13 13,0,0 Smokers Without COPD TNFA 13 4.8902 83 1 3,0,0 Smokers Without COPD VEGF 21 22.6500 106 2 6,0,0 Squamous EGF 12 51.0000 24 2 2,0,0 Squamous IFNG 4 50.0000 8 2 2,0,0 Squamous IL1A 4 27.0000 8 2 2,0,0 Squamous IL1B 8 14.9950 41 1 2,0,0 Squamous IL2 8 46.0020 35 1 2,0,0 Squamous IL4 6 17.0000 24 2 2,0,0 Squamous IL6 15 15.9880 99 1 2,0,0 Squamous IL8 11 26.2340 63 1 3,0,0 Squamous IL10 7 29.9770 31 1 1,0,0 Squamous MCP1 32 29.6650 167 1 6,0,0 Squamous TNFA 12 11.6990 79 1 3,0,0 Squamous VEGF 13 28.0000 14 12 12,0,0 Smokers With COPD EGF 30 15.0930 195 1 7,0,0 Smokers With COPD IFNG 7 2.7383 33 1 2,0,0 Smokers With COPD IL1A 3 2.0903 3 0 1,0,0 Smokers With COPD IL1B 5 7.5208 23 1 1,0,0 Smokers With COPD IL2 10 4.9473 59 1 3,0,0 Smokers With COPD IL4 4 26.0000 8 2 2,0,0 Smokers With COPD IL6 9 3.7065 55 1 3,0,0 Smokers With COPD IL8 9 8.0716 55 1 2,0,0 Smokers With COPD IL10 6 14.8360 25 1 1,0,0 Smokers With COPD MCP1 38 21.6380 205 1 6,0,0 Smokers With COPD TNFA 11 9.7731 67 1 2,0,0 Smokers With COPD VEGF 29 22.0450 175 3 8,0,0 NCI Maryland Cancer IFNG 13 24.8270 77 1 4,0,0 NCI Maryland Cancer IL1B 11 17.0740 51 1 2,0,0 NCI Maryland Cancer IL4 17 8.9729 95 1 3,0,0 NCI Maryland Cancer IL6 15 17.5980 99 1 4,0,0 NCI Maryland IL8 9 2227.0000 25 3 3,0,0

PAGE 129

116 Cancer NCI Maryland Cancer IL10 7 846.0000 26 2 2,0,0 NCI Maryland Cancer TNFA 12 5.3324 61 1 1,0,0 NCI Maryland Control IFNG 16 18.6550 95 1 4,0,0 NCI Maryland Control IL1B 13 30.3990 81 1 5,0,0 NCI Maryland Control IL4 18 17.4950 109 1 5,0,0 NCI Maryland Control IL6 16 24.7730 93 3 6,0,0 NCI Maryland Control IL8 13 1434.0000 23 5 5,0,0 NCI Maryland Control IL10 41 21.1000 253 1 10,0,0 NCI Maryland Control TNFA 12 15.8220 69 1 2,0,0 NCI PLCOC Cancer IL1B 12 8.1714 61 1 3,0,0 NCI PLCOC Cancer IL6 16 19.6000 94 2 6,0,0 NCI PLCOC Cancer IL8 30 21.5540 191 1 7,0,0 NCI PLCOC Cancer TNFA 13 11.6230 77 1 3,0,0 NCI PLCOC Control IL1B 13 5.2963 75 1 3,0,0 NCI PLCOC Control IL6 14 19.0540 79 1 4,0,0 NCI PLCOC Control IL8 11 46.5300 57 1 1,0,0 NCI PLCOC Control TNFA 18 7.2762 94 2 4,0,0

PAGE 130

117 CHAPTER V I I : CONCLUSION We developed a computational model that reliably distinguishes among lung pathologies by assigning biomarker concentration values to discrete states des pite significant data variation and technical challenges. The model addresses v arious sources of biological variation, missing data concerns and outliers, and model overfitting, yet accurat ely assigns patients to their respective clinical types. New clinical types can be incorporated into the model, and the model dis tinguishes the set of targeted biomarker variables that uniquely characterize the clinical types under analysis. The source data, concentration values of host response serum cytokines, serve as adequate biomarker variables. Certain features of the cytokine concentration distributions remain characteristically invariant with respect to a The Discrete Topological Structure computational model distinguishes among clinical type populations by discretizing concentrations values to populate only certain bin states. A unique exclusive or (XOR) operation extracts distinguishing patterns that succ essfully assign specific set of states to specific clinical types, thereby distinguishing clinical types with an accuracy of over 99% within the given population. Distinguishing among multiple clinical types is a function of the number of distinguishing bi omarker bin states. The resulting analysis also indicates which of the biomarkers are better at recognizing the discriminating signs of the lung diseases under this study. The resulting DTS model simplifies the high dime nsional biomarker concentration spac e to reveal distinguishing features of lung disease The biomarker pairs that distinguish among all 7 clinical types are: {IL2, VEGF}, {IL6, VEGF}, {IL6, TNFA}, {IL2, TNFA}, {EGF, MCP1}, {IL2, MCP1}, {EGF, IL10}, {EGF, IL2}, and {EGF, IFNG} The model has been validated in several different ways. All 310 of the patients are correctly assigned to their clinical type.

PAGE 131

118 32 out of 33 previously excluded patients are correctly assigned to their clinical type. A separate set of 7 Healthy Serum patients are correctly assigned. Besides analyzing other disease data sets, possible directions for future work include: 1. Incorporating longitudinal patient data would allow the model to make specific patient prognoses, but these data are difficult to obtain. However, t he model tory over time. 2. Processing large proteomic microarray data sets involving thousands of measured protein biomarkers 3. Formalizing a concept of convergence to characterize th e relative importance of each biomarker in each clinical type population.

PAGE 132

119 REFERENCES 1 J. Subramanian, R. Govindan, "Lung Cancer in Never Smokers: A Review". Journal of C linical Oncology 25 (5): 561 70 (2007) 2 M. Heo, S. Maslov, E. Shakhnovich "Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions," PNAS vol 1 08 no. 10 4258 4263. (Mar 2011) 3 J. Zhang, S. Maslov, E. Shakhnovich, "Constraints imposed by non functional pro tein protein interactions on gene expression and proteome size," Molecular Systems Biology 4:210. (2008). 4 L. Enewold, L. E. Mechanic, E. D. Bowman, Y. L. Zheng, Z. Yu, G. Trivers, A. J. Alberg, C. C. Harris, "Serum concentrations of cytokines and lung can cer survival in African Americans and Caucasians", Cancer Epidemiol Biomarkers Prev. 18(1):215 22. doi: 10.1158/1055 9965.EPI 08 0705. (Jan 2009) 5 C. A. Dinarello, "Proinflammatory Cytokines", Chest 2000;118;503 508. 6 H. Yanagawa, S. Sone, Y. Takahashi, et al "Serum levels of interleukin 6 in patients with lung cancer", Br J Cancer. 1995;71(5):1095 1098. 7 M. Orditura, F. De Vita, G. Catalano, et al "Elevated serum levels of interleukin 8 in advanced non small cell lung cancer patients: relationship with p rognosis", J Interferon Cytokine Res. 22(11):1129 1135. ( 2002 ) 8 J. Peng, J. E. Elias, C. C. Thoreen, L. J. Licklider, S. P. Gygi "Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC MS/MS) for large scale protein ana lysis: The yeast proteome". Journal of proteome research 2 (1): 43 50. PubMedId 12643542. (2003) 9 M. P. Washburn, D. Wolters, J. R. Yates, "Large scale analysis of the yeast proteome by multidimensional protein identification technology". Nature Biotechnol ogy 19 (3): 242 247. PubMedId 11231557. (2001) 10 S. Kumar, B. Ma, C.J. Tsai, N. Sinha, R. Nussinov, "Folding and binding cascades: dynamic landscapes and population shifts," Protein Sci. 9, 10. (2000) 11 R. Laubenbacher, "System identification of biochemical networks using discrete models," Computation of biochemical pathways and genetic networks, U. Kummer (ed.), Petronius Verlag, Berlin. (2005) 12 J.M. Kleinberg, "An impossibility theorem for clustering", NIPS 2002: 446 453. 13 H. de Jong, "Modeling and Simulation of Genetic Regulatory Systems A Literature Review," J. Comput. Biol. 9 103 129. (2002) 14 N. G. van Kampen, "Stochastic Processes in Physics and Chemistry", ( 3rd), Elsevier, Amsterdam. (2007)

PAGE 133

120 15 D. T. Gillespie, "Exact st ochastic simulation of coupled chemical reactions", J. Phys Chem. 81(25), 2340 2361. (1977) 16 D. T. Gillespie, "A rigorous derivation of the chemical master equation ", Physica D 188, 404 425. (1992) 17 S. H. Strogatz, "Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering," Perseus, New York. (1994) 18 R. Ghrist, "Barcodes: The Persistent Topology of Data", Bulletin Of The American Mathematical Society, v. 45, Number 1, Jan 2008, pp 61 75. 19 T. K Dey, H. Edelsbrunne r, S. Guha, "Computational Topology," Advances in Discrete and Computational Geometry, eds. B. Chazelle, J. E. Goodman and R. Pollack. Contemporary Mathematics, AMS, Providence (1998) 20 G. Carlsson, "Topology and Data," Bulletin (New Series) Of t he America n Mathematical Society, Volume 46, Number 2, April 2009, Pages 255 308, E published on Jan 29, 2009. 21 D. Klain, K. Rybnikov, K. Daniels, B. Jones, C. Neacsu, "Estimation of Euler Characteristic from Volumetric Data", Proceedings of 4th International Sympos ium on 3D Data Processing, Visualization and Transmission, Georgia Institute of Technology, Atlanta, Georgia, p. 51 58, June 18 20, 2008. 22 D. Gnabasik, G. Alaghband, Discrete Topological Structure of Proteomic Biomarkers," Int'l Conf. on Computational Sci ence & Computational Intelligence, Las Vegas, Nevada, March 10 12, 2014. 23 H. Edelsbrunner, "Biological Applications of Computational Topology", Chapter 63 of Handbook of Discrete and Computational Geometry, 1395 1412, eds. J. E. Goodman and J. O'Rourke, CR C Press, Boca Raton, Florida (2004) 24 112M Sperm Whale Myoglobin D122n N propyl Isocyanide protein Available Nov 2016 http://www.rcsb.org/pdb/explore.do?structureId=112m http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?ShowOp=VastSum&uid=112M 25 P. Blinder, I. Baruchi, V. Volman, H. Levine, D. Baranes, E. B. Jacob, "Functional topology classification of biological computing networks ", Natural Computing 4: 339 361. (2005) 26 R.M. Ostroff, W.L. Bigbee, W. Franklin, L. Gold, M. Mehan, Y.E. Miller, H.I. Pass, W.N. Rom, J.M. Siegfried, A. Stewart, J.J. Walker, J.L. Weissfeld, S. Willi ams, D. Zichi, E.N. Brody, "Unlocking Biomarker Discovery Large Scale Application of Aptamer Proteomic Technology for Early Detection of Lung Cancer," PLoS ONE, Dec. 2010, 5(12) 27 G. Carlsson, "Overview of Topological Data Analysis", Presentation to the I nstitute for Mathematics and its Applications in Minneapolis on October 7, 2013.

PAGE 134

121 28 C M. Micheel, S J. Nass, G S. Omenn, Editors Evolution of Translational Omics: Lessons Learned and the Path Forward National Academies Press. (2012) 29 S Erten G Beb ek, M Koyutrk, Vavien: An Algorithm for Prioritizing Candidate Disease Genes Based on Topological Similarity of Proteins in Interaction Networks Journal of Computational Biology. ( Nov 2011 ) 30 Joint Analysis of Time Evolving Binary Matrices E. Wang, D Liu, J Silva, D Dunson L Carin Advances in Neural Information Processing Systems 23, pp: 2370 2378. (2010) 31 Clarke, R., Ressom, H. W., Wang, A., Xuan, J., Liu, M. C., Gehan, E. A., & Wang, Y. The properties of high dimensional data spaces: implications for exploring gene and protein expression data Nature Reviews. Cancer, 8(1), 37 49. (2008) Available: http://doi.org/10.1038/nrc2294 32 Wolfram Research, Inc., Mathematica, Version 10.3, Champaign IL (2015) 33 Luminex Assays from Thermo Fisher Scientific. Available Nov 2016 http://www.thermofisher.com/us/en/home/life scien ce/protein biology/protein assays analysis/luminex assays.html 34 Burrowes, K. S., Doel, T., & Brightling, C. Computational modeling of the obstructive lung diseases asthma and COPD Journal of Translational Medicine, 12(Suppl 2), S5. http://doi.org/10.1186/1479 5876 12 S2 S5 (2014) 35 Gefen, A., Multiscale Computer Modeling in Biomechanics and Biomedical Engineering Springer Berlin Heidelberg. (2013) 36 Snoeck, H W. Modeling human lung development and disease using pluripotent stem cells Development 2015 142: 13 16; doi: 10.1242/dev.115469 37 Bhattacharya, S., Mariani T.J. Systems biology approaches to identify developmental bases for lung diseases Pediatric Research 73, 5 14 522 doi:10.1038/pr.2013.7 (2013) 38 Dick T E Molkov Y I Nieman G Hsieh Y H Jacono F J Doyle J Scheff J D Calvano S E Androulakis I P An G Vodovotz Y ., Linking Inflammation, Cardiorespiratory Variability, and Neural Control in Acute Inflammation via Computational Modeling Front. Physio. 3:222. doi:10.3389/fphys.2012.00222 (2012) 39 Eickmeier O Huebner M Herrmann E Zissler U Rosewich M Baer P C Buhl R Schmitt Groh S Zielen S Schubert R. Sputum biomarker profiles in cystic fibrosis (CF) and chronic obstructive pulmonary disease (COPD) and association between pulmonary function Cytokine. 2010 May;50(2):152 7. doi: 10.1016/j.cyto.2010.02.004. Epub 2010 Feb 23. 40 Biancotto A, Wank A, Perl S, Cook W, Olnes MJ, et al. "B aseline Levels and Temporal Stability of 27 Multiplexed Serum Cytokine Concentrations in Healthy Subjects", PLoS ONE 8(12): e76091. doi:10.1371/journal.pone.0076091 (2013)

PAGE 135

122 41 R. Weber, H J. Schek, S. Blott, A quantitative analysis and performance study for similarity search methods in high dimensional spaces Proc 24th Int Very Large Database Conf; p. 194 205. (1998) 42 R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional da ta for data mining applications Proc 1998 ACM SIGMOD Int Conf Management Data; p. 94 105. (1998) 43 G. Caldarelli R. Pastor Satorras, A. Vespignani Structure of cycles and local ordering in complex networks Eur Phys J B 2004;38:183 186. 44 S. Erten, G. Bebek, M. Koyutrk, "Vavien: An A lgorithm for Prioritizing Candidate Disease Genes Based on Topological Similarity of Proteins in Interaction Networks", J Comput Biol. 2011 Nov;18(11):1561 74. doi: 10.1089/cmb.2011.0154. Epub Oct 28, 2011. 45 R. Sivangala, G. Sumanlatha. "Cytokines that Med iate and Regulate Immune Responses", Austin Publishing Group. Innovative Immunology. Available: www.austinpublishinggroup.com/ebooks (2015) 46 P. Drineas, M. W. Mahoney, "RandNLA: Randomized Numerical Linear Algebra", Communications of the ACM, Vol. 59 No. 6, pp 80 90. 10.1145/2842602 47 S. Hanash "Harnessing immunity for cancer marker discovery" Nature Biotechnol. 21, 37 38 (20 03) 48 M. E. Kotas, R. Medzhitov, Homeostasis, Inflammation, and Dis ease Susceptibility Cell. 2015 Feb 26; 160(5): 816 827. doi: 10.1016/j.cell.2015.02.010 49 G. Carlsson, R. Jardine, D. Feichtner Kozlov, D. Morozov, "Topological Data Analysis and Machine Learning Theory", Tech. Rep. 12w5081 Banff International Research Station for Mathematical Innovation and Discovery. (2009) 50 J. R. Perkins, I. Diboun, B. H. Dessailly, J. G. Lees, C. Orengo, "Transient Protein Protein Interactions: Structural, Functional, and Network Properties", Structure, Volume 18, Issue 10, 13 Octobe r 2010, Pages 1233 1243, ISSN 0969 2126. 51 A L Barabasi, Z. N. Oltvai, Network biology: understanding the cell's functional organization Nature Revi e ws Genetics 5, 101 113 doi:10.1038/nrg1272 (Feb 2004) 52 R. Schiess, B. Wollscheid, R. Aebersold. "Targete d proteomic strategy for clinical biomarker discovery", Mol Oncol. 2009 February ; 3(1): 33 44. ( 2008 ) 53 P H O'Farrell, "High resolution two dimension al electrophoresis of proteins", J. Biol. Chem. 250 (10): 4007 21. (1975). Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874754/ 54 B. M endelson "I ntroduction to T opology (3 rd )", Dover Publications, New York. (1975) ISBN 0 486 66352 3.

PAGE 136

123 55 C.R.F. Maunder, Algebraic Topology Dover Publications, New York. (1996) ISBN 0 486 69131 4 56 Celebi, M. Emre (Ed.), "Par titional Clustering Algorithms". Springer Intl Publishing. (2015) 57 T.L. Bonfield "In Vivo Models of Lung Disease", Lung Diseases: Selected State of the Art Rev iews, edited by Irusen. Hampshire, UK: DEMInTech, 2012. 58 Y. Qi, W. S. Noble, Protein interaction networks: Protein domain interaction and protein function prediction in Handbook of Statistical Bioinformatics (ch. 10), H. Horng Shing Lu et al (eds.). Springer Verlag, Berlin 2011. 59 A. Tausz, M. Vejdemo Johansson, H. Adams. "Javaplex: a Research Platform for Persistent Homology", Geometric /Topological Software symposium. (2012) 60 Ideker T, Sharan R. "Protein networks in disease", Genome Research. 2008;18(4):644 652. doi:10.1101/gr.071852.107. 61 R. Schiess, B. Wollscheid, R. Aebersold, "Targeted Proteomic Strategy for Clinical Biomarker Discovery", Molecular Oncology, 3(1), 33 44. Availabl e: http://doi.org/10.1016/j.molonc.2008.12.001 62 K. S. Burrowes, J. De Backer, R. Smallwood, P. J. Sterk, I. Gut, R. Wirix Speetjens, "Multi scale computational models of the airways to unravel the pathophysiological mechanisms in asthma and chronic obstructive pulmonary disease (AirPROM) Interface Focus, 3(2), 20120057. http://doi.org/10.1098/rsfs.2012.0057 (2013) 63 J. Hirsch, K C. Hansen, A. L. Burlingame, M. A. Matthay. "Proteomics: current tec hniques and potential applications to lung disease", Am J Physiol Lung Cell Mol Physiol 287: L1 L23, 2004; 10.1152/ajplung.00301 ( 2003 ) 64 De Las Rivas J, Fontanillo C, "Protein protein interactions essentials: key concepts to building and analyzing interact ome networks". PLoS Computational Biology. 6 (6): e1000807. doi:10.1371/journal.pcbi.1000807. 65 F. Jord n, T P Nguyen, W C Liu, "Studying protein protein interaction networks: A s ystems view on diseases", Briefings in Functional Genomics 11 (6): pp 497 504. doi: 10.1093/bfgp/els035 (2012) 66 J. Zahiri, J.H. Bozorgmehr, A. Masoudi Nejad, "Computational Prediction of Protein Protein Interaction Networks: Algorithms and Resources", Current Genomics. 14(6):397 414. doi:10.2174/1389202911314060004. (2013) 67 McPherso n, G. Statistics in Scientific Investigation: Its Basis, Application and Interpretation Springer Verlag. ISBN 0 387 97137 8 (1990) 68 D. Skillicorn. "Understanding High Dimensional Spaces", Springer Briefs in Computer Science. (2012 )

PAGE 137

124 69 Bern, M., Eppstein D., Agarwal, P. K., Amenta, N., Chew, P., Dey, T., Dobkin, D. P., Edelsbrunner, H., Grimm, C., Guibas, L. J., Harer, J., Hass, J., Hicks, A., Johnson, C. K., Lerman, G., Letscher, D., Plassmann, P., Sedgwick, E., Snoeyink, J., Weeks, J., Yap, C., Zorin, D., "Emerging Challenges in Computatio nal Topology". NSF Report. (1999) 70 M. Hazewinkel ( ed. ). "Joint distribution", Encycl opedia of Mathematics, Springer. (2001) ISBN 978 1 55608 010 4 71 A. Blum, J. Hopcroft, R. Kannan. "Foundations of Data Science". Pre version of a textbook. Available at http://www.cs.cornell.edu/jeh/ May 2015. 72 T. Loong. "Understanding sensitivity and specificity wi th the right side of the brain", BMJ. 327 (7417): 716 719. (2003) 73 P.T. Reid, J.A. Innes. "Respiratory disease", In: Walker BR, Colledge NR, Ralston SH, Penman ID, eds. Davidson's Principles and Practice of Medicine. 22nd ed. Philadelphia, PA: Elsevier Churchill Livingstone; chap 19. (2014) 74 Lung Cancer Fact Sheet from the American Lung Association. (Nov. 2016) Available at htt p://www.lung.org/lung health and diseases/lung disease lookup/lung cancer/resource library/lung cancer fact sheet.html 75 A.J. Guarascio, S.M. Ray, C.K. Finch, T.H. Self. "The clinical and economic burden of chronic obstructive pulmonary disease in the USA", ClinicoEconomics and Outcomes Research. Jun 17;5:235 45. (2013) 76 D. Spyratos, D. Chloros, L. Sichletidis. "Diagnosis of chronic obstructive pulmonary disease in the primary care setting", Hippokratia. 2012 Jan Mar; 16(1): 17 22. 77 K. Chandramouli, P Y Qian "Human Genomics and Proteomics", Volume 2009, Article ID 239204. 78 Philip Morris Products S.A. COPD Biomarker Identification Study Available Nov 2016 https://www.clinicaltrials.gov/ct2/show/NCT01780298 79 A. Plavnick: The fundamental theorem of markov chains University of Chicago VIGRE REU (2008) 80 Pine SR, Mechanic LE, Enewold L, et al. "Increased levels of circulating interleukin 6, interleukin 8, C r eactive protein and risk of lung cancer", J Natl Cancer Inst. 2011;103(14):1112 1122 81 Brower V (2009) Biomarker studies abound for early detection of lung cancer. J Natl Cancer Inst 101: 11 13.

PAGE 138

125 APPENDI X A Percent Missing Values per Clinical Type Biomarker (11 / 119 > 50% are highlighted in red) Marker Clinical Type Count Total %Missing [EGF] Adenocarcinoma 47 53 11.3% [EGF] Squamous 39 44 11.4% [EGF] Never Smokers 47 55 14.5% [EGF] Smokers with COPD 42 49 14.3% [EGF] Smokers without COPD 46 53 13.2% [EGF] Acute Lung Injury 43 62 30.6% [EGF] Cystic Fibrosis 18 27 33.3% [EGF] TestSet 4 4 0.0% [EGF] Healthy Serum 0 7 N/A [EGF] NCI Maryland Cancer 0 355 N/A [EGF] NCI Maryland Control 0 466 N/A [EGF] NCI PLCOC Cancer 0 591 N/A [EGF] NCI PLCOC Control 0 650 N/A [IFNG] Adenocarcinoma 35 53 34.0% [IFNG] Squamous 34 44 22.7% [IFNG] Never Smokers 22 55 60.0% [IFNG] Smokers with COPD 29 49 40.8% [IFNG] Smokers without COPD 29 53 45.3% [IFNG] Acute Lung Injury 46 62 25.8% [IFNG] Cystic Fibrosis 15 27 44.4% [IFNG] TestSet 4 4 0.0% [IFNG] Healthy Serum 7 7 0.0% [IFNG] NCI Maryland Cancer 355 355 0.0% [IFNG] NCI Maryland Control 466 466 0.0% [IFNG] NCI PLCOC Cancer 0 591 N/A [IFNG] NCI PLCOC Control 0 650 N/A [IL1A] Adenocarcinoma 9 53 83.0% [IL1A] Squamous 9 44 79.5% [IL1A] Never Smokers 2 55 96.4% [IL1A] Smokers with COPD 4 49 91.8% [IL1A] Smokers without COPD 5 53 90.6% [IL1A] Acute Lung Injury 15 62 75.8% [IL1A] Cystic Fibrosis 12 27 55.6% [IL1A] TestSet 4 4 0.0% [IL1A] Healthy Serum 7 7 0.0% [IL1A] NCI Maryland Cancer 0 355 N/A [IL1A] NCI Maryland Control 0 466 N/A [IL1A] NCI PLCOC Cancer 0 591 N/A [IL1A] NCI PLCOC Control 0 650 N/A [IL1B] Adenocarcinoma 32 53 39.6% [IL1B] Squamous 33 44 25.0%

PAGE 139

126 [IL1B] Never Smokers 32 55 41.8% [IL1B] Smokers with COPD 24 49 51.0% [IL1B] Smokers without COPD 25 53 52.8% [IL1B] Acute Lung Injury 32 62 48.4% [IL1B] Cystic Fibrosis 19 27 29.6% [IL1B] TestSet 4 4 0.0% [IL1B] Healthy Serum 7 7 0.0% [IL1B] NCI Maryland Cancer 355 355 0.0% [IL1B] NCI Maryland Control 466 466 0.0% [IL1B] NCI PLCOC Cancer 588 591 0.5% [IL1B] NCI PLCOC Control 642 650 1.2% [IL2] Adenocarcinoma 48 53 9.4% [IL2] Squamous 40 44 9.1% [IL2] Never Smokers 49 55 10.9% [IL2] Smokers with COPD 44 49 10.2% [IL2] Smokers without COPD 48 53 9.4% [IL2] Acute Lung Injury 36 62 41.9% [IL2] Cystic Fibrosis 19 27 29.6% [IL2] TestSet 4 4 0.0% [IL2] Healthy Serum 0 7 N/A [IL2] NCI Maryland Cancer 0 355 N/A [IL2] NCI Maryland Control 0 466 N/A [IL2] NCI PLCOC Cancer 0 591 N/A [IL2] NCI PLCOC Control 0 650 N/A [IL4] Adenocarcinoma 43 53 18.9% [IL4] Squamous 40 44 9.1% [IL4] Never Smokers 45 55 18.2% [IL4] Smokers with COPD 42 49 14.3% [IL4] Smokers without COPD 46 53 13.2% [IL4] Acute Lung Injury 34 62 45.2% [IL4] Cystic Fibrosis 12 27 55.6% [IL4] TestSet 4 4 0.0% [IL4] Healthy Serum 7 7 0.0% [IL4] NCI Maryland Cancer 355 355 0.0% [IL4] NCI Maryland Control 466 466 0.0% [IL4] NCI PLCOC Cancer 0 591 N/A [IL4] NCI PLCOC Control 0 650 N/A [IL6] Adenocarcinoma 48 53 9.4% [IL6] Squamous 40 44 9.1% [IL6] Never Smokers 47 55 14.5% [IL6] Smokers with COPD 43 49 12.2% [IL6] Smokers without COPD 46 53 13.2% [IL6] Acute Lung Injury 40 62 35.5% [IL6] Cystic Fibrosis 14 27 48.1%

PAGE 140

127 [IL6] TestSet 4 4 0.0% [IL6] Healthy Serum 7 7 0.0% [IL6] NCI Maryland Cancer 355 355 0.0% [IL6] NCI Maryland Control 466 466 0.0% [IL6] NCI PLCOC Cancer 591 591 0.0% [IL6] NCI PLCOC Control 650 650 0.0% [IL8] Adenocarcinoma 47 53 11.3% [IL8] Squamous 40 44 9.1% [IL8] Never Smokers 47 55 14.5% [IL8] Smokers with COPD 44 49 10.2% [IL8] Smokers without COPD 48 53 9.4% [IL8] Acute Lung Injury 56 62 9.7% [IL8] Cystic Fibrosis 23 27 14.8% [IL8] TestSet 4 4 0.0% [IL8] Healthy Serum 7 7 0.0% [IL8] NCI Maryland Cancer 350 355 1.4% [IL8] NCI Maryland Control 454 466 2.6% [IL8] NCI PLCOC Cancer 591 591 0.0% [IL8] NCI PLCOC Control 650 650 0.0% [IL10] Adenocarcinoma 46 53 13.2% [IL10] Squamous 39 44 11.4% [IL10] Never Smokers 45 55 18.2% [IL10] Smokers with COPD 37 49 24.5% [IL10] Smokers without COPD 41 53 22.6% [IL10] Acute Lung Injury 44 62 29.0% [IL10] Cystic Fibrosis 21 27 22.2% [IL10] TestSet 4 4 0.0% [IL10] Healthy Serum 6 7 14.3% [IL10] NCI Maryland Cancer 355 355 0.0% [IL10] NCI Maryland Control 466 466 0.0% [IL10] NCI PLCOC Cancer 0 591 N/A [IL10] NCI PLCOC Control 0 650 N/A [MCP1] Adenocarcinoma 48 53 9.4% [MCP1] Squamous 40 44 9.1% [MCP1] Never Smokers 50 55 9.1% [MCP1] Smokers with COPD 44 49 10.2% [MCP1] Smokers without COPD 48 53 9.4% [MCP1] Acute Lung Injury 56 62 9.7% [MCP1] Cystic Fibrosis 24 27 11.1% [MCP1] TestSet 4 4 0.0% [MCP1] Healthy Serum 7 7 0.0% [MCP1] NCI Maryland Cancer 0 355 N/A [MCP1] NCI Maryland Control 0 466 N/A [MCP1] NCI PLCOC Cancer 0 591 N/A

PAGE 141

128 [MCP1] NCI PLCOC Control 0 650 N/A [TNFA] Adenocarcinoma 47 53 11.3% [TNFA] Squamous 39 44 11.4% [TNFA] Never Smokers 46 55 16.4% [TNFA] Smokers with COPD 44 49 10.2% [TNFA] Smokers without COPD 48 53 9.4% [TNFA] Acute Lung Injury 55 62 11.3% [TNFA] Cystic Fibrosis 19 27 29.6% [TNFA] TestSet 4 4 0.0% [TNFA] Healthy Serum 6 7 14.3% [TNFA] NCI Maryland Cancer 355 355 0.0% [TNFA] NCI Maryland Control 466 466 0.0% [TNFA] NCI PLCOC Cancer 591 591 0.0% [TNFA] NCI PLCOC Control 650 650 0.0% [VEGF] Adenocarcinoma 47 53 11.3% [VEGF] Squamous 40 44 9.1% [VEGF] Never Smokers 50 55 9.1% [VEGF] Smokers with COPD 44 49 10.2% [VEGF] Smokers without COPD 48 53 9.4% [VEGF] Acute Lung Injury 48 62 22.6% [VEGF] Cystic Fibrosis 23 27 14.8% [VEGF] TestSet 4 4 0.0% [VEGF] Healthy Serum 7 7 0.0% [VEGF] NCI Maryland Cancer 0 355 N/A [VEGF] NCI Maryland Control 0 466 N/A [VEGF] NCI PLCOC Cancer 0 591 N/A [VEGF] NCI PLCOC Control 0 650 N/A [IL5] Adenocarcinoma 0 53 N/A [IL5] Squamous 0 44 N/A [IL5] Never Smokers 0 55 N/A [IL5] Smokers with COPD 0 49 N/A [IL5] Smokers without COPD 0 53 N/A [IL5] Acute Lung Injury 0 62 N/A [IL5] Cystic Fibrosis 0 27 N/A [IL5] TestSet 0 4 N/A [IL5] Healthy Serum 0 7 N/A [IL5] NCI Maryland Cancer 355 355 0.0% [IL5] NCI Maryland Control 466 466 0.0% [IL5] NCI PLCOC Cancer 0 591 N/A [IL5] NCI PLCOC Control 0 650 N/A [IL12] Adenocarcinoma 0 53 N/A [IL12] Squamous 0 44 N/A [IL12] Never Smokers 0 55 N/A [IL12] Smokers with COPD 0 49 N/A

PAGE 142

129 [IL12] Smokers without COPD 0 53 N/A [IL12] Acute Lung Injury 0 62 N/A [IL12] Cystic Fibrosis 0 27 N/A [IL12] TestSet 0 4 N/A [IL12] Healthy Serum 0 7 N/A [IL12] NCI Maryland Cancer 355 355 0.0% [IL12] NCI Maryland Control 466 466 0.0% [IL12] NCI PLCOC Cancer 0 591 N/A [IL12] NCI PLCOC Control 0 650 N/A [GMCSF] Adenocarcinoma 0 53 N/A [GMCSF] Squamous 0 44 N/A [GMCSF] Never Smokers 0 55 N/A [GMCSF] Smokers with COPD 0 49 N/A [GMCSF] Smokers without COPD 0 53 N/A [GMCSF] Acute Lung Injury 0 62 N/A [GMCSF] Cystic Fibrosis 0 27 N/A [GMCSF] TestSet 0 4 N/A [GMCSF] Healthy Serum 0 7 N/A [GMCSF] NCI Maryland Cancer 355 355 0.0% [GMCSF] NCI Maryland Control 466 466 0.0% [GMCSF] NCI PLCOC Cancer 0 591 N/A [GMCSF] NCI PLCOC Control 0 650 N/A

PAGE 143

130 APPENDI X B The Chemical Stochastic Master Equation The derivation of the DTS equation was motivated by the C hemical S tochastic M aster E quation ( C SME) as given by van Kampen [14] and Gillespie [15,16]. The C SME is a set of first order differential equations describing the time evolution of the probability of a system p( t) to occupy each one of a discrete set of states as a function of a continuous time variable t For van Kampen the C SME describes the probability of a vector of measurements belonging to a certain state and how changes with respect to time [13] as The differential Chemical S tochastic M aster E quation ( C SME) In the C SME, m and n are the number of interactions between chemical states in their respective concentration measurement s is the probability that interaction i will occur in interval [t, ] given that the system is in state at time t and is the probability that interaction j will bring the sy stem into state from another state, say The C SME is a gain loss equation for state proba bilities where a state is understood as a vector of chemical concentration measurements However, t here are several issues with this conceptual model as de Jong states in [13]. uously and deterministically, both of which assumptions may be questionable in the case of gene regulation. In the first place, the small numbers of molecules of some of the components of the regulatory system compromise the continuity assumption. Second, deterministic change presupposed by the differential operator

PAGE 144

131 d/dt may be questionable due to fl uctuations in the timing of cellular events, such as the delay between start and finish of transcription We treat time t as discrete because of the inherently discrete nature of sampling clinical da ta.

PAGE 145

132 APPENDI X C Definitions and Characteristics of Lung Disease These definitions are from the National Cancer Institute Dictionary of Cancer Terms at https://www.cancer.gov/publications/dictionaries/cancer ter ms Chronic Obstructive Pulmonary Disease (COPD) : a lung disease characterized by chronic obstruction of lung airflow that interferes with normal breathing. Adenocarcinoma : defined as neoplasia of epithelial tissue; a type of cancerous tumor that can occ ur in several parts of the body. Squamous Cell Carcinoma : cancer that begins from squamous cells, a type of skin cell. Cystic Fibrosis : a genetic disorder that affects mostly the lungs whose long term issues include difficulty breathing and coughing up m ucus because of frequent lung infections. Acute Lung Injury : acute respiratory distress syndrome (ARDS) is a medical condition occurring in critically ill patients characterized by widespread inflammation in the lungs. There are 2 major types of lung cancer: small cell lung cancer ( SCLC ) and non small cell lung cancer ( NSCLC ). About 85% to 90% of lung cancers are non small cell lung cancer Refer to : http://www.cancer.org/acs/groups/cid/documents/webcontent/003115 pdf.pdf About 25% to 30% of lung cancers are NSCLC squamous cell carcinomas. About 40% are NSCLC adenoc arcinomas. About 10% to 15% are NSCLC large cell (undifferentiated) carcinomas. About 10% to 15% are small cell lung cancer (SCLC). It is very rare for someone who has never smoked to have SCLC.

PAGE 146

133 There are a few other subtypes of non small cell lung cancer that are much less common. Along with the 2 main types of lung cancer, other tumors can occur in the lungs such as lung carcinoid tumors that account for fewer than 5% of lung tumors. Neoplasia is new, uncontrolled growth of cells that is not under physio logic control, and t here is no single mechanism by which a neoplasm arises. Neoplasms must establish a blood supply to keep growing. Carcinomas arise from epithelial surfaces. Carcinomas that form glandular configurations are called adenocarcinomas Carcinomas that form solid nests of cells with distinct borders, intercellular bridges, and pink keratinized cytoplasm are called squamous cell carcinomas Immuno histochemical staining is helpful to determine the cell type of a neoplasm when the degree of differentiation, or morphology alone, does not allow an exact classification. Traditionally, tumor cell morphology on light microscopy has been used to predict tumor behavior and prognosis. Tumor bio markers in serum such as carcinoembryonic antigen (CEA ), alpha fetoprotein (AFP), or human chorionic gonadotropin (HCG) can be performed. Unfortunately, they are not that specific or sensitive, particularly when applied as screening tests to a general population.

PAGE 147

134 APPENDI X D Test S ensitivity and S pecifici ty Test sensitivity is the ability of a test to correctly identify those with an identifiable disease the true positive rate whereas test specificity is the ability of the same test to correctly identify those without the d isease the true negative rate [72] Given TP = number of true positives and FN = number of false negatives, then test sensitivity (or true positive rate TPR ) is defined as Given TN = number of true negatives and F P = number of false positive s, then t est speci ficity (or true negative rate T N R ) is defined as The f alse positive rate (the type I error rate) is defined as The f alse negative rate (the type II error rate) is defined as Table 5.3 summarizes the sensitivity / specificity values and the error rates for each of the c linical t ype s per experiment. Table 5.3: S ensitivity s pecificity and error rates per c linical t ype Experiment: Clinical Type TP FN TPR TN FP TNR (I) (II) Baseline: Adenocarcinoma 53 0 1.0 53 0 1.0 0 0 Baseline: Squamous 44 0 1.0 44 0 1.0 0 0 Baseline: Never Smokers 55 0 1.0 55 0 1.0 0 0 Baseline: Smokers with COPD 49 0 1.0 49 0 1.0 0 0 Baseline: Smokers without COPD 53 0 1.0 53 0 1.0 0 0 Baseline: Acute Lung Injury 62 0 1.0 62 0 1.0 0 0 Baseline: Cystic Fibrosis 27 0 1.0 27 0 1.0 0 0 10% Adenocarcinoma 5 0 1.0 5 0 1.0 0 0 10% Squamous 4 0 1.0 4 0 1.0 0 0 10% Never Smokers 5 0 1.0 5 0 1.0 0 0 10% Smokers with COPD 5 0 1.0 5 0 1.0 0 0 10% Smokers with out COPD 5 0 1.0 5 0 1.0 0 0 10% Acute Lung Injury 6 0 1.0 6 0 1.0 0 0 10% Cystic Fibrosis 3 1 0.75 3 1 0.75 0.25 0.25 Validation: Healthy Serum 7 0 1.0 7 0 1.0 0 0 NCI Maryland Cancer 355 0 1.0 355 0 1.0 0 0 NCI Maryland Control 466 0 1.0 466 0 1.0 0 0 NCI PLCOC Cancer 591 0 1.0 591 0 1.0 0 0 NCI PLCOC Control 650 0 1.0 650 0 1.0 0 0

PAGE 148

13 5 APPENDIX E Description of the Binning Algorithm as a Learning Algorithm This section discusses s ome background material before presenting a description of the binning algorithm as a learning algorithm. We show that a description of the binning algorithm as a probabilistic graphical model is problematic as are the maximum likelihood and maximum entro py estimation approaches. Instead, we use a discriminative conditional density estimation approach. There are two kinds of probabilistic graphical models used in machine learning. The first kind are directed acyclic graphs (DAGs), such as Bayesian networks [ a ]. T he binning algorithm cannot be represented as a Bayesian network because it is not possible to determine the ordering or parent child hierarchy of such a network. That is, t he ordering of nodes is important in Bayesian networks and should go from cause to effect. However, the biological causes that produce the relative biomarker concentrations in our experiments are not known. The second kind of probabilistic graphical models are u ndirected graphs such as Markov networks T he joint distribution of a Markov network can be written as a product of non negative functions ( potential functions ) over the cliques of the graph, provided p has full support (i.e. p(x) > 0, x). T he Hammersley Clifford theorem provides the necessary and sufficient conditions under which a positive probability distribution can be represented as a Markov network [ b ] However, the condition of full support is not satisfied in our experiments because p(x) is not > 0 for x. The task of probability modeling (density estimation) the problem of estimating a probability density function from random samples distributed according to this density can be attacked using a variety of approaches, of which three are discussed here [c] The first approach uses Bayes Theorem to estimate the conditional density of an unknown distribution in either of two modes discriminative or generative

PAGE 149

136 The discriminative mode computes a hypothesis that models the conditional probability distribu tion P[y|x] up to a small error. The generative mode estimates the probability distribution P[x|y] separately for every y Our approach is discriminative since we directly model the conditional distribution without assuming anything about the input distribution P(x) The maximum likelihood estimation approach measures the data fitting quality of a discrete model by maximizing subjec t to where domain Q is the set of all density functions of a chosen form, say exponential. Our early experiments, however, did not produce a reliable set of density functions. The maximum entropy estimation approach approximates the true expectatio n E D [f j ] of each feature f j by its empirical average over the given samples as We reject this averaging approach for the same reasons as given in §4.4.I. The primary disadvantage of discriminative model free learning (the estimation of conditional densities) versus of model based learning ( the estimation of joint densities) is that adding a new class or type requires the entire system to be retrained since inter class dynamics are significant. [ a ] R. Daly, Q. Shen, S Aitken. "Learning Bayesian networks: approaches and issues". The Knowledge Engineering Review, Vol. 26:2, 99 157. & Cambridge University Press, 2011 doi:10.1017/S0269888910000251 [b ] S.L. Lauritzen, A.P. Dawid, B.N. Larsen and H. G. Leimer. "Independence [c] V. N. Vapnik. "An Overview of Statistical Learning Theory", IEEE Transactions on Neural Networks, Vol. 10, No. 5, September 1999